Creating dna_segs from files

Functions to parse dna_seg objects from different file formats. Support is included for the following file formats: GenBank, EMBL, ptt, FASTA, and tabular data.

Usage

read_dna_seg_from_file(
  file,
  tagsToParse = c("CDS"),
  fileType = "detect",
  meta_lines = 2,
  gene_type = "auto",
  header = TRUE,
  extra_fields = NULL,
  boundariesToParse = NULL,
  read_sequence = FALSE,
  verbose = FALSE,
  ...
)

read_dna_seg_from_files(
  files,
  tagsToParse = c("CDS"),
  fileType = "detect",
  meta_lines = 2,
  gene_type = "auto",
  header = TRUE,
  extra_fields = NULL,
  boundariesToParse = NULL,
  read_sequence = FALSE,
  verbose = FALSE,
  ...
)

read_dna_seg_from_embl(
  file,
  tagsToParse = c("CDS"),
  boundariesToParse = NULL,
  extra_fields = NULL,
  read_sequence = FALSE,
  gene_type = "auto",
  verbose = FALSE,
  ...
)

read_dna_seg_from_genbank(
  file,
  tagsToParse = c("CDS"),
  boundariesToParse = NULL,
  extra_fields = NULL,
  read_sequence = FALSE,
  gene_type = "auto",
  verbose = FALSE,
  ...
)

read_dna_seg_from_fasta(file, read_sequence = FALSE, ...)

read_dna_seg_from_ptt(file, meta_lines = 2, header = TRUE, ...)

read_dna_seg_from_tab(file, header = TRUE, ...)

Arguments

file: A character string containing a file path, or a file connection.
tagsToParse: A character vector of tags to parse for GenBank or EMBL files. Common examples include "CDS", "gene", "tRNA", "repeat_region", and "misc_feature".
fileType: A character string containing the file format to parse. Must be one of: "genbank", "embl", "ptt", "fasta", or "detect". If "detect" is chosen, then it will attempt to determine the file format automatically.
meta_lines: The number of lines in the ptt file that represent metadata, not counting the header lines. Standard for NCBI files is 2 (name and length, number of proteins).
gene_type: A character string, determines how genes are visualized. Must be a valid gene type (see gene_types). For GenBank and EMBL files, if this argument is "auto", the genes will appear as arrows if there are no introns, and as exons (blocks) when there are introns present.
header: Logical. If TRUE, parses the first line of the tabular file as a header containing column names.
extra_fields: A character vector of extra fields to parse for GenBank or EMBL files. These fields will be added as columns in the resulting dna_seg object, unless the field was always empty.
boundariesToParse: A character vector of tags to parse as sequence boundaries for GenBank or EMBL files. Common examples include "source", "contig", "chromosome", and "scaffold".
read_sequence: Logical. If TRUE, will add a sequence column to the dna_seg containing the DNA or amino acid sequence of the features.
verbose: Logical. If TRUE, reports timings whenever it starts parsing a file.
...: Further arguments to pass to as.dna_seg.
files: A list or character vector containing file paths. Supports wildcard expansion (e.g. *.txt).

Value

A list of dna_seg objects for read_dna_seg_from_files, and a single dna_seg object otherwise.

Details

GenBank and EMBL files are two commonly used file types that often contain a great variety of information. To properly extract data from these files, the user has to choose which features to extract. Commonly 'CDS' features are of interest, but other feature tags such as "gene", "misc_feature", or "tRNA" may be of interest. Should a feature contain an inner "pseudo" tag indicating this CDS or gene is a pseudo gene, this will be presented as a "CDS_pseudo" or a "gene_pseudo" feature type respectively in the resulting table. In these two file types, the following fields are parsed (in addition to the mandatory name, start, end, and strand): gi (from db_xref=GI), uniprot_id, gene, locus_id (from locus_tag=), proteinid, product, color, and region_plot. The sequence itself from CDS tags can also be read using read_sequence = TRUE. In addition, extra tags can be parsed with the argument extra_fields. If there is more than one field for a given name, only the first one is parsed.

Tab files representing DNA segments should have at least the following columns: name, start, end, and strand. If these column names are not all present, or if there is no header at all, then these 4 columns are assumed to be the first 4 columns of the file, in that order. If the tab file does have headers, then any additional columns can be supplied as needed, like line width and type, pch and/ or cex. See dna_seg for more information. An example:

name	start	end	strand	fill
feat1A	2	1345	1	blue
feat1B	1399	2034	1	red
feat1C	2101	2932	-1	grey
feat1D	2800	3120	1	green

FASTA files are parsed differently, depending on the first sequence header (defline) found in the file. When parsing FASTA files from UniprotKB, metadata is parsed fromthe deflines, creating extra columns in the resulting dna_seg object, including locus_id, gene, and product. Alternatively, positional information of features will be parsed when using FASTA files from ensembl when they contain this information in their headers. In all other cases, each entry in a FASTA file will result in a single feature, all concatenated behind each other in the resulting dna_seg. Some support is included for parsing metadata from FASTA files from NCBI.

Ptt (or protein table) files are a tabular format providing information on each protein of a genome (or plasmid, or virus, etc).

Author

Lionel Guy, Jens Roat Kultima, Mike Puijk

Examples

## Read DNA segment from tab
dna_seg3_file <- system.file('extdata/dna_seg3.tab', package = 'genoPlotR')
dna_seg3 <- read_dna_seg_from_tab(dna_seg3_file)

## From GenBank file
bq_genbank <- system.file('extdata/BG_plasmid.gbk', package = 'genoPlotR')
bq <- read_dna_seg_from_file(bq_genbank, fileType = "detect")

## Parsing extra fields in the GenBank file
bq <- read_dna_seg_from_file(bq_genbank,
                             extra_fields = c("db_xref", "transl_table"))
names(bq)
#>  [1] "name"         "start"        "end"          "strand"       "length"      
#>  [6] "gi"           "gene"         "locus_id"     "product"      "proteinid"   
#> [11] "feature"      "gene_type"    "seq_origin"   "db_xref"      "transl_table"
#> [16] "region_plot"  "col"          "fill"         "lty"          "lwd"         
#> [21] "pch"          "cex"         

## From embl file
bq_embl <- system.file('extdata/BG_plasmid.embl', package = 'genoPlotR')
bq <- read_dna_seg_from_embl(bq_embl)

## From ptt files
bq_ptt <- system.file('extdata/BQ.ptt', package = 'genoPlotR')
bq <- read_dna_seg_from_ptt(bq_ptt)