Filter dna_seg features by looking at a maximum within groups
max_by_group.Rd
Takes a dna_seg
or list of dna_seg
objects. It groups them based on
group_by
, and per group takes the feature with the maximum value in the
column given by longest
.
Arguments
- dna_seg_input
Either a single
dna_seg
or a list ofdna_seg
objects.- group_by
A character string, representing a
dna_seg
attribute that the features will be grouped by.- longest
A character string, representing a
dna_seg
attribute. After grouping, features will be taken with the maximum value in the column given by this argument.- ignore_boundaries
Logical. If
TRUE
, any features with"boundaries"
as theirgene_type
will be kept regardless.
Value
Either a single dna_seg
object or a list of dna_seg
objects,
matching the input given using dna_seg_input
.
Details
This was intended to take the longest transcript per gene, although it can
be used for other purposes. If group_by
points to a column with gene IDs, it intentionally mimics the output
of the primary_transcript.py
script from OrthoFinder, so that dna_segs
can be loaded in from FASTA files before primary_transcript.py
is used.
This preserves the metadata from the FASTA files, since
primary_transcript.py
will remove this metadata.
Examples
## Prepare dna_seg
names1 <- c("1A", "1B", "2A", "2B", "2C")
genes1 <- c("1", "1", "2", "2", "2")
starts1 <- c(1, 1, 101, 101, 101)
ends1 <- c(30, 60, 160, 130, 160)
lengths1 <- abs(starts1 - ends1)+1
## Make dna_seg
dna_seg_raw <- dna_seg(data.frame(name=names1, start=starts1, end=ends1,
strand=rep(1, 5), length=lengths1,
gene=genes1))
dna_seg_raw
#> name start end strand length gene gene_type region_plot col fill
#> <char> <num> <num> <num> <num> <char> <char> <char> <char> <char>
#> 1: 1A 1 30 1 30 1 arrows NA grey20 grey80
#> 2: 1B 1 60 1 60 1 arrows NA grey20 grey80
#> 3: 2B 101 130 1 30 2 arrows NA grey20 grey80
#> 4: 2A 101 160 1 60 2 arrows NA grey20 grey80
#> 5: 2C 101 160 1 60 2 arrows NA grey20 grey80
#> lty lwd pch cex
#> <num> <num> <num> <num>
#> 1: 1 1 8 1
#> 2: 1 1 8 1
#> 3: 1 1 8 1
#> 4: 1 1 8 1
#> 5: 1 1 8 1
## Take longest feature per gene name
dna_seg_edit <- max_by_group(dna_seg_input = dna_seg_raw, group_by = "gene")
dna_seg_edit
#> name start end strand length gene gene_type region_plot col fill
#> <char> <num> <num> <num> <num> <char> <char> <char> <char> <char>
#> 1: 1B 1 60 1 60 1 arrows NA grey20 grey80
#> 2: 2A 101 160 1 60 2 arrows NA grey20 grey80
#> lty lwd pch cex
#> <num> <num> <num> <num>
#> 1: 1 1 8 1
#> 2: 1 1 8 1