Creating dna_segs from files
read_dna_seg.Rd
Functions to parse dna_seg
objects from different file formats. Support
is included for the following file formats: GenBank, EMBL, ptt, FASTA, and
tabular data.
Usage
read_dna_seg_from_file(
file,
tagsToParse = c("CDS"),
fileType = "detect",
meta_lines = 2,
gene_type = "auto",
header = TRUE,
extra_fields = NULL,
boundariesToParse = NULL,
read_sequence = FALSE,
verbose = FALSE,
...
)
read_dna_seg_from_files(
files,
tagsToParse = c("CDS"),
fileType = "detect",
meta_lines = 2,
gene_type = "auto",
header = TRUE,
extra_fields = NULL,
boundariesToParse = NULL,
read_sequence = FALSE,
verbose = FALSE,
...
)
read_dna_seg_from_embl(
file,
tagsToParse = c("CDS"),
boundariesToParse = NULL,
extra_fields = NULL,
read_sequence = FALSE,
gene_type = "auto",
verbose = FALSE,
...
)
read_dna_seg_from_genbank(
file,
tagsToParse = c("CDS"),
boundariesToParse = NULL,
extra_fields = NULL,
read_sequence = FALSE,
gene_type = "auto",
verbose = FALSE,
...
)
read_dna_seg_from_fasta(file, read_sequence = FALSE, ...)
read_dna_seg_from_ptt(file, meta_lines = 2, header = TRUE, ...)
read_dna_seg_from_tab(file, header = TRUE, ...)
Arguments
- file
A character string containing a file path, or a file connection.
A character vector of tags to parse for GenBank or EMBL files. Common examples include
"CDS"
,"gene"
,"tRNA"
,"repeat_region"
, and"misc_feature"
.- fileType
A character string containing the file format to parse. Must be one of:
"genbank"
,"embl"
,"ptt"
,"fasta"
, or"detect"
. If"detect"
is chosen, then it will attempt to determine the file format automatically.- meta_lines
The number of lines in the ptt file that represent metadata, not counting the header lines. Standard for NCBI files is 2 (name and length, number of proteins).
- gene_type
A character string, determines how genes are visualized. Must be a valid gene type (see gene_types). For GenBank and EMBL files, if this argument is
"auto"
, the genes will appear as arrows if there are no introns, and as exons (blocks) when there are introns present.- header
Logical. If
TRUE
, parses the first line of the tabular file as a header containing column names.- extra_fields
A character vector of extra fields to parse for GenBank or EMBL files. These fields will be added as columns in the resulting
dna_seg
object, unless the field was always empty.- boundariesToParse
A character vector of tags to parse as sequence boundaries for GenBank or EMBL files. Common examples include
"source"
,"contig"
,"chromosome"
, and"scaffold"
.- read_sequence
Logical. If
TRUE
, will add a sequence column to thedna_seg
containing the DNA or amino acid sequence of the features.- verbose
Logical. If
TRUE
, reports timings whenever it starts parsing a file.- ...
Further arguments to pass to as.dna_seg.
- files
A list or character vector containing file paths. Supports wildcard expansion (e.g. *.txt).
Details
GenBank and EMBL files are two commonly used file types that often contain a
great variety of information. To properly extract data from these files, the
user has to choose which features to extract. Commonly 'CDS' features are of
interest, but other feature tags such as "gene"
, "misc_feature"
, or
"tRNA"
may be of interest. Should a feature contain an inner "pseudo" tag
indicating this CDS or gene is a pseudo gene, this will be presented as a
"CDS_pseudo"
or a "gene_pseudo"
feature type respectively in the
resulting table. In these two file types, the following fields are parsed
(in addition to the mandatory name, start, end, and strand): gi
(from db_xref=GI), uniprot_id, gene, locus_id (from locus_tag=),
proteinid, product, color, and region_plot. The sequence itself from CDS tags
can also be read using read_sequence = TRUE
. In addition, extra tags can be
parsed with the argument extra_fields
. If there is more than one field for
a given name, only the first one is parsed.
Tab files representing DNA segments should have at least the following columns: name, start, end, and strand. If these column names are not all present, or if there is no header at all, then these 4 columns are assumed to be the first 4 columns of the file, in that order. If the tab file does have headers, then any additional columns can be supplied as needed, like line width and type, pch and/ or cex. See dna_seg for more information. An example:
name | start | end | strand | fill |
feat1A | 2 | 1345 | 1 | blue |
feat1B | 1399 | 2034 | 1 | red |
feat1C | 2101 | 2932 | -1 | grey |
feat1D | 2800 | 3120 | 1 | green |
FASTA files are parsed differently, depending on the first sequence header
(defline) found in the file. When parsing FASTA files from UniprotKB,
metadata is parsed fromthe deflines, creating extra columns in the resulting
dna_seg
object, including locus_id, gene, and product. Alternatively,
positional information of features will be parsed when using FASTA files from
ensembl when they contain this information in their headers. In all other
cases, each entry in a FASTA file will result in a single feature, all
concatenated behind each other in the resulting dna_seg
. Some support is
included for parsing metadata from FASTA files from NCBI.
Ptt (or protein table) files are a tabular format providing information on each protein of a genome (or plasmid, or virus, etc).
Examples
## Read DNA segment from tab
dna_seg3_file <- system.file('extdata/dna_seg3.tab', package = 'genoPlotR')
dna_seg3 <- read_dna_seg_from_tab(dna_seg3_file)
## From GenBank file
bq_genbank <- system.file('extdata/BG_plasmid.gbk', package = 'genoPlotR')
bq <- read_dna_seg_from_file(bq_genbank, fileType = "detect")
## Parsing extra fields in the GenBank file
bq <- read_dna_seg_from_file(bq_genbank,
extra_fields = c("db_xref", "transl_table"))
names(bq)
#> [1] "name" "start" "end" "strand" "length"
#> [6] "gi" "gene" "locus_id" "product" "proteinid"
#> [11] "feature" "gene_type" "seq_origin" "db_xref" "transl_table"
#> [16] "region_plot" "col" "fill" "lty" "lwd"
#> [21] "pch" "cex"
## From embl file
bq_embl <- system.file('extdata/BG_plasmid.embl', package = 'genoPlotR')
bq <- read_dna_seg_from_embl(bq_embl)
## From ptt files
bq_ptt <- system.file('extdata/BQ.ptt', package = 'genoPlotR')
bq <- read_dna_seg_from_ptt(bq_ptt)