Make unique IDs for dna_segs

Generates unique identifiers (IDs) for each dna_seg features. They can be based on the values from existing columns, or generated from scratch.

Usage

make_unique_ids(dna_seg_input, old_id = NULL, new_id = "id")

Arguments

dna_seg_input: Either a single dna_seg or a list of dna_seg objects.
old_id: Either a character vector representing dna_seg columns, or NULL. The IDs will be generated based on the vector of dna_seg columns provided, or generated from scratch if this argument is NULL.
new_id: A character string, the generated IDs will be stored in the dna_seg column given by this argument. Will create a new column if it does not exist in the dna_segs.

Value

Either a single dna_seg object or a list of dna_seg objects, matching the input given using dna_seg_input.

Details

This function generates unique identifiers for dna_segs. Having unique identifiers is necessary for certain other functions, like converting a dna_seg into a FASTA file, as most tools that make use of FASTA files require unique headers for each sequence in the FASTA file.

If old_id is left as NULL, the generated IDs are simply row numbers for each feature. If old_id refers to one or multiple dna_seg columns, then the values of those columns are concatenated, separated by "_". Then, a number is added to these values, which starts at 1 for each combination of values, and goes up each time the same combination is found. See the examples below.

Author

Mike Puijk

Examples

## Prepare dna_seg
names1 <- c("A", "A", "B", "B", "B", "C")
types1 <- c("gene", "gene", "gene", "protein", "gene", "gene")

## Make dna_seg
dna_seg_raw <- dna_seg(data.frame(name = names1,
                                  start = (1:6) * 3,
                                  end = (1:6) * 3 + 2,
                                  strand = rep(1, 6),
                                  type = types1))

## Generate IDs based on 1 column 
dna_seg_edit <- make_unique_ids(dna_seg_input = dna_seg_raw,
                                old_id = "name")
dna_seg_edit[, .(name, type, id)]
#>      name    type     id
#>    <char>  <char> <char>
#> 1:      A    gene    A_1
#> 2:      A    gene    A_2
#> 3:      B    gene    B_1
#> 4:      B protein    B_2
#> 5:      B    gene    B_3
#> 6:      C    gene    C_1

## Generate IDs based on multiple columns
dna_seg_edit <- make_unique_ids(dna_seg_input = dna_seg_raw, 
                                old_id = c("name", "type"))
dna_seg_edit[, .(name, type, id)]
#>      name    type          id
#>    <char>  <char>      <char>
#> 1:      A    gene    A_gene_1
#> 2:      A    gene    A_gene_2
#> 3:      B    gene    B_gene_1
#> 4:      B protein B_protein_1
#> 5:      B    gene    B_gene_2
#> 6:      C    gene    C_gene_1