The Reference Sequence (RefSeq) database[1] is an open access, annotated and curated collection of publicly available nucleotide sequences (DNA, RNA) and their protein products. RefSeq was introduced in 2000.[2][3] This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from viruses to bacteria to eukaryotes.
Content | |
---|---|
Description | curated non-redundant sequence database of genomes. |
Contact | |
Research center | National Center for Biotechnology Information |
Primary citation | Pruitt KD & al. (2005)[1] |
Access | |
Website | https://www.ncbi.nlm.nih.gov/RefSeq |
For each model organism, RefSeq aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. RefSeq is limited to major organisms for which sufficient data are available (121,461 distinct "named" organisms as of July 2022),[4] while GenBank includes sequences for any organism submitted (approximately 504,000 formally described species).[5]
RefSeq collection comprises different data types, with different origins, so it is necessary to establish standard categories and identifiers to store each data type. The most important categories are:
Category | Description |
---|---|
NC | Complete genomic molecules |
NG | Incomplete genomic region |
NM | mRNA |
NR | ncRNA |
NP | Protein |
XM | predicted mRNA model |
XR | predicted ncRNA model |
XP | predicted Protein model (eukaryotic sequences) |
WP | predicted Protein model (prokaryotic sequences) |
For more details and more categories, see Table 1 in Chapter 18 of the book The Reference Sequence (RefSeq) Database.
Several projects to improve RefSeq services are currently in development by the NCBI, often in collaboration with research centers such as EMBL-EBI:
According to the RefSeq release 213 (July 2022), the number of species represented in the database by counting distinct taxonomic IDs are as follows:[4]
Taxonomic ID | Species |
---|---|
Archaea | 1443 |
Bacteria | 69122 |
Fungi | 16869 |
Invertebrate | 5715 |
Mitochondrion | 13648 |
Plant | 9177 |
Plasmid | 6073 |
Plastid | 9430 |
Protozoa | 746 |
Vertebrate (mammalian) | 1509 |
Viral | 11620 |
Vertebrate (other) | 5237 |
Other | 4 |
Complete | 121461 |
The counts of accession and basepairs per molecule type are:[4]
Molecule type | Accessions | Basepairs/residues |
---|---|---|
Genomics | 40,758,769 | 2.923212393984×10 12 |
RNA | 45,781,716 | 1.22253022047×10 11 |
Protein | 234,520,053 | 9.129062394×10 10 |