Filter dna_seg features by looking at a maximum within groups
max_by_group.RdTakes a dna_seg or list of dna_seg objects. It groups them based on
group_by, and per group takes the feature with the maximum value in the
column given by longest.
Arguments
- dna_seg_input
Either a single
dna_segor a list ofdna_segobjects.- group_by
A character string, representing a
dna_segattribute that the features will be grouped by.- longest
A character string, representing a
dna_segattribute. After grouping, features will be taken with the maximum value in the column given by this argument.- ignore_boundaries
Logical. If
TRUE, any features with"boundaries"as theirgene_typewill be kept regardless.
Value
Either a single dna_seg object or a list of dna_seg objects,
matching the input given using dna_seg_input.
Details
This was intended to take the longest transcript per gene, although it can
be used for other purposes. If group_by
points to a column with gene IDs, it intentionally mimics the output
of the primary_transcript.py script from OrthoFinder, so that dna_segs
can be loaded in from FASTA files before primary_transcript.py is used.
This preserves the metadata from the FASTA files, since
primary_transcript.py will remove this metadata.
Examples
## Prepare dna_seg
names1 <- c("1A", "1B", "2A", "2B", "2C")
genes1 <- c("1", "1", "2", "2", "2")
starts1 <- c(1, 1, 101, 101, 101)
ends1 <- c(30, 60, 160, 130, 160)
lengths1 <- abs(starts1 - ends1)+1
## Make dna_seg
dna_seg_raw <- dna_seg(data.frame(name=names1, start=starts1, end=ends1,
strand=rep(1, 5), length=lengths1,
gene=genes1))
dna_seg_raw
#> name start end strand length gene gene_type region_plot col fill
#> <char> <num> <num> <num> <num> <char> <char> <char> <char> <char>
#> 1: 1A 1 30 1 30 1 arrows NA grey20 grey80
#> 2: 1B 1 60 1 60 1 arrows NA grey20 grey80
#> 3: 2B 101 130 1 30 2 arrows NA grey20 grey80
#> 4: 2A 101 160 1 60 2 arrows NA grey20 grey80
#> 5: 2C 101 160 1 60 2 arrows NA grey20 grey80
#> lty lwd pch cex
#> <num> <num> <num> <num>
#> 1: 1 1 8 1
#> 2: 1 1 8 1
#> 3: 1 1 8 1
#> 4: 1 1 8 1
#> 5: 1 1 8 1
## Take longest feature per gene name
dna_seg_edit <- max_by_group(dna_seg_input = dna_seg_raw, group_by = "gene")
dna_seg_edit
#> name start end strand length gene gene_type region_plot col fill
#> <char> <num> <num> <num> <num> <char> <char> <char> <char> <char>
#> 1: 1B 1 60 1 60 1 arrows NA grey20 grey80
#> 2: 2A 101 160 1 60 2 arrows NA grey20 grey80
#> lty lwd pch cex
#> <num> <num> <num> <num>
#> 1: 1 1 8 1
#> 2: 1 1 8 1