Filter dna_seg features by looking at a maximum within groups

Takes a dna_seg or list of dna_seg objects. It groups them based on group_by, and per group takes the feature with the maximum value in the column given by longest.

Usage

max_by_group(
  dna_seg_input,
  group_by,
  longest = "length",
  ignore_boundaries = TRUE
)

Arguments

dna_seg_input: Either a single dna_seg or a list of dna_seg objects.
group_by: A character string, representing a dna_seg attribute that the features will be grouped by.
longest: A character string, representing a dna_seg attribute. After grouping, features will be taken with the maximum value in the column given by this argument.
ignore_boundaries: Logical. If TRUE, any features with "boundaries" as their gene_type will be kept regardless.

Value

Either a single dna_seg object or a list of dna_seg objects, matching the input given using dna_seg_input.

Details

This was intended to take the longest transcript per gene, although it can be used for other purposes. If group_by points to a column with gene IDs, it intentionally mimics the output of the primary_transcript.py script from OrthoFinder, so that dna_segs can be loaded in from FASTA files before primary_transcript.py is used. This preserves the metadata from the FASTA files, since primary_transcript.py will remove this metadata.

Author

Mike Puijk

Examples

## Prepare dna_seg
names1 <- c("1A", "1B", "2A", "2B", "2C")
genes1 <- c("1", "1", "2", "2", "2")
starts1 <- c(1, 1, 101, 101, 101)
ends1 <- c(30, 60, 160, 130, 160)
lengths1 <- abs(starts1 - ends1)+1

## Make dna_seg
dna_seg_raw <- dna_seg(data.frame(name=names1, start=starts1, end=ends1,
                                  strand=rep(1, 5), length=lengths1,
                                  gene=genes1))
dna_seg_raw
#>      name start   end strand length   gene gene_type region_plot    col   fill
#>    <char> <num> <num>  <num>  <num> <char>    <char>      <char> <char> <char>
#> 1:     1A     1    30      1     30      1    arrows          NA grey20 grey80
#> 2:     1B     1    60      1     60      1    arrows          NA grey20 grey80
#> 3:     2B   101   130      1     30      2    arrows          NA grey20 grey80
#> 4:     2A   101   160      1     60      2    arrows          NA grey20 grey80
#> 5:     2C   101   160      1     60      2    arrows          NA grey20 grey80
#>      lty   lwd   pch   cex
#>    <num> <num> <num> <num>
#> 1:     1     1     8     1
#> 2:     1     1     8     1
#> 3:     1     1     8     1
#> 4:     1     1     8     1
#> 5:     1     1     8     1

## Take longest feature per gene name
dna_seg_edit <- max_by_group(dna_seg_input = dna_seg_raw, group_by = "gene")
dna_seg_edit
#>      name start   end strand length   gene gene_type region_plot    col   fill
#>    <char> <num> <num>  <num>  <num> <char>    <char>      <char> <char> <char>
#> 1:     1B     1    60      1     60      1    arrows          NA grey20 grey80
#> 2:     2A   101   160      1     60      2    arrows          NA grey20 grey80
#>      lty   lwd   pch   cex
#>    <num> <num> <num> <num>
#> 1:     1     1     8     1
#> 2:     1     1     8     1