Create comparisons between DNA segments

Create a list of comparison objects from a list of dna_seg objects or files by parsing (and executing) sequence alignments. If these files already exist, then those will be parsed. If not, DIAMOND or a BLAST program can be executed to generate the sequence alignment results between the dna_seg, with respect to the order of dna_segs. Executing DIAMOND or BLAST requires that the command-line implementations of these tools are installed.

Usage

comparisons_from_dna_segs(
  dna_segs = NULL,
  seg_labels = NULL,
  files = NULL,
  mode = "full",
  tool = "blast",
  algorithm = "blastp",
  sensitivity = "default",
  output_path = NULL,
  all_vs_all = FALSE,
  filt_high_evalue = NULL,
  filt_low_per_id = NULL,
  filt_length = "auto",
  use_cache = TRUE,
  verbose = FALSE,
  ...
)

Arguments

dna_segs: A list of dna_seg objects to create comparisons between. Either dna_segs or files must be provided.
seg_labels: A character vector containing DNA segment labels.
files: A character vector, containing file paths to the FASTA or GenBank files. The comparisons will be made between these files. Either dna_segs or files must be provided.
mode: Determines how the comparisons will be filtered. "besthit", "bidirectional", or "full". If mode is "besthit", only the best hit will be taken from each input query (see best_hit). If mode is "bidirectional", then hits are only kept if they are the best hits for their query in both directions (see bidirectional_best_hit). "full" means that all sequence alignment results are considered.
tool: Choice of sequence alignment tool. Either "blast" or "diamond".
algorithm: Choice of BLAST algorithm to run. One of: "blastp", "blastp-fast", "blastp-short", "tblastx", "blastn", "blastn-short", "megablast", or "dc-megablast".
sensitivity: Choice of sensitivity option when running DIAMOND. One of: "fast", "default", "mid-sensitive", "sensitive", "more-sensitive", "very-sensitive", or "ultra-sensitive".
output_path: Path to the folder that will contain the output files. Both the sequence alignment result and the FASTA files used to make them will be stored here.
all_vs_all: Logical. If TRUE, sequence alignments will be performed for every combination of the inputs, instead of just the ones necessary for plotting. Note that this can take a long time, so use with caution.
filt_high_evalue: A numerical, filters out all comparisons with an e-value higher than this value (unfiltered when left as NULL).
filt_low_per_id: A numerical, filters out all comparisons with a percentage identity lower than this value (unfiltered when left as NULL).
filt_length: A number indicating the minimum length required for hits, or "auto". If "auto", it will be determined based on the choice of tool and algorithm (150 for DIAMOND or any blastp algorithm, 450 for tblastx, 900 for any blastn algorithm).
use_cache: Logical. If FALSE, it will never check for existing files. This includes the FASTA files used as input for sequcence alignment, the database files used by DIAMOND and BLAST, and the sequence alignment results themselves.
verbose: Logical. If TRUE, reports timings when creating new files.
...: Arguments to pass to other functions (the functions executing the sequence alignments tools, run_blast, and run_diamond).

Value

A list of comparison objects.

Details

Unless use_cache is set to FALSE, this function will look for the files required using a combination of the seg_labels (if these are provided), and the names of the dna_segs or files that were provided as input. If it cannot find sequence alignment results in the form of "query_subject" (or to put it differently, "dna_seg1_dna_seg2"), then it will run DIAMOND or BLAST to generate these results. Using this system, it also looks for the FASTA files required as input for the sequence alignment.

If output_path is left as NULL, the current working directory will be used instead.

Author

Mike Puijk

Examples

if (FALSE) { # \dontrun{
## Comparisons from a vector of GenBank files using DIAMOND
comparisons <- comparisons_from_dna_segs(
  files = c("genome1.gb", "genome2.gb", "genome2.gb"),
  tool = "diamond",
  output_path = "output/diamond",
  sensitivity = "very-sensitive",
  verbose = TRUE
)
} # }