Glossary
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U
Amplicons: Pieces of DNA formed as the products of natural or artificial amplification events. In the Genome Sequencer FLX System, an amplicon is a PCR product that is flanked with an appropriate 'Primer A' and 'Primer B' sequences. Amplicons are used in ultra-deep sequencing for accurate detection of low frequency mutations. Detection is enabled by the high levels of oversampling obtained when subjecting specific amplicon populations to 454 Sequencing.
Assembly: Putting sequenced fragments of DNA into their correct chromosomal positions.
Bacterial artificial chromosome(BAC): An artificially created chromosome in which medium-sized segments of foreign DNA (pieces of DNA 100 to 300 kilobases in length from another species) are cloned into bacteria. BACs are used to sequence the genetic code of organisms in genome projects. A short piece of the organism's DNA is amplified as an insert in BACs, and then sequenced. Finally, the sequenced parts are rearranged in silico, resulting in the genomic sequence of the organism
Base: The four chemical building blocks, represented by the letters A, C, G, and T, that compose DNA.
Base pair (bp): Two nitrogenous bases (adenine and thymine or guanine and cytosine) held together by weak bonds. Two strands of DNA are held together in the shape of a double helix by the bonds between base pairs.
Complementary DNA(cDNA): DNA synthesized from a mature mRNA template in a reaction catalyzed by the enzyme reverse transcriptase. It is often used in gene cloning or as gene probes or in the creation of a cDNA library. Partial sequences of cDNAs are often obtained as expressed sequence tags.
ChIP-Sequencing: ChIP-Seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify binding sites of DNA-associated proteins. It is used primarily to determine how transcription factors and other chromatin-associated proteins influence phenotype-affecting mechanisms. Determining how proteins interact with DNA to regulate gene expression is essential for fully understanding many biological processes and disease states. This epigenetic information is complimentary to genotype and expression analysis.
Chromosome: The physical structures in cells containing the large stretches of DNA (hundreds of millions of bases) and the information for thousands of genes.
Consensus Accuracy: The percentage of correct calls in the Genome Sequencer system's consensus as compared to the reference sequence.
Contig: A set of overlapping DNA segments derived from a single genetic source in shotgun DNA sequencing projects. Assembled contigs can be ordered into scaffolds and then used to deduce the original DNA sequence of the source.
N50 Contig Size: The contig size such that all contigs larger than that have 50% the bases of the assembly.
Cosmid: DNA from a bacterial virus into which is spliced a small fragment of a genome to be amplified and sequenced. A cosmid is an artificially constructed structure. (On a technical level, a cosmid contains the cos gene of phage lambda and can be packaged in a lambda phage particle for infection into E.coli, permitting cloning of larger DNA fragments that can be introduced into bacterial hosts in plasmid vectors).
Coverage Depth: The total number of nucleotides from reads that are mapped to a given position.
De novo: New, unknown genomes.
Diploid: A full set of genetic material consisting of paired chromosomes, one from each parental set. Most animal cells except the gametes have a diploid set of chromosomes. The diploid human genome has 46 chromosomes.
DNA: The molecule that encores genetic information through the sequence of four constituent bases.
DNA Sequence: The precise order of bases for a DNA fragment, a gene, a chromosome, or an entire genome.
Epigenetics: A changes in gene expression without a change in the underlying DNA sequence of the organism.
Expressed Sequence Tag (EST): A tiny portion of an entire gene (200-500 nucleotides long) that can be used to help identify unknown genes and to map their positions within a genome.
Eukaryote: Cell or organism with membrane-bound, structurally discrete nucleus and other well-developed subcellular compartments. Eukaryotes include all organisms except viruses, bacteria, and bluegreen algae.
Flow: During a sequencing Run, nucleotides are flowed sequentially across the PTP device, one at a time, in the cyclical order "TACG', as controlled by the Run script. When the flowed nucleotide is a complementary to the next nucleotide (or homopolymer) on the DNA template in any given well, the polymerase extense the nascent DNA strand in that well. Addition of one or more nucleotide(s) releases a corresponding number of pyrophosphate (PPi) molecules. One molecule of ATP is synthesized for each PPi release, causing a flash of light (signal) whose intensive is proportional to the number of nucleotides incorporated.
FlowMapper Software: A software algorithm that aligns flowgrams of sequences against a reference sequence that has been converted into many ideal sub-flowgrams and built into a database.
Flowgram: Data processing extracts information about the signal intensity in each well, over all flows. The signal intensity for each flow is plotted as a function of flow order, yielding a flowgram for the well. The signal intensity is proportional to the number of bases added (linear relationship); if no nucleotides is extended in that well during a flow, the signal will be very low (background); if one nucleotide is added, the signal will be similar in intensity to the key signal; if more than one nucleotide is added, the height of the signal will be correspondingly higher.
Flowgram Signal Space: The quantitative representation of any base call (whether a singlet or homopolymer stretch), based on the relative signal intensity generated during an individual nucleotide "flow" (incorporation) step.
Fosmid: A bacterially-propagated phagemid vector system suitable for cloning genomic inserts approximately 40 kilobases in size. Largely supplanted by BAC's and PAC's for genome mapping and sequencing.
GC-rich area: Many DNA sequences carry long stretches of repeated G and C which often indicate a gene-rich region.
Gene: The fundamental unit of heredity. A gene is a specific sequence of bases (usually thousands to the thousands of bases) located in a specific location on a particular chromosome. Genes are transcribed to produce multiple molecules of mRNA which are then translated to produce multiple copies of a specific protein.
Genome: All the bases in all the genes on all the chromosomes of a species. The human genome consists of over 3 billion bases.
Genomics: The comprehensive study of all genes and their function in biological pathways.
Genotyping: The measurement of genetic variation between species members.
Haploid: A single set of chromosomes (half the full set of genetic material) present in the egg and sperm cells of animals and in the egg and pollen cells of plants. Human beings have 23 chromosomes in their reproductive cells.
Haplotype: A way of denoting the collective genotype of a number of closely linked loci on a chromosome.
Homopolymer: An uninterrupted stretch of a single nucleotide.
Introns: Non-coding portions of precursor mRNA, removed before mature RNA formed. Introns are spiced out the resulting mRNA sequence is exons ready to be translated into proteins. Inturrupt protein coding sequence of a gene.
Junk DNA: Stretches of DNA that do not code for genes; most of the genome consists of so-called junk DNA which may have regulatory and other functions. Also called non-coding DNA.
Key (or key sequence): The sequencing key is a known sequence of four nucleotides located immediately downstream from the sequencing primer. It is therefore the first to be sequenced in each well.
Library: A library is a collection of DNA fragments representative of the entire DNA sample to be sequenced. Each library is created from user-supplied purified DNA.
Mapped Reads: Sequences from a Genome Sequencer system run that have been aligned against a reference sequence using FlowMapper software, and were found to be above the user-specified threshold.
Metagenomics: The study of genetic material recovered directly from environmental samples.
Methylation: The purpose of a methylation project is to determine what Cytosines (5' of a Gunasine) are methylated. It is though that the pattern of methylation (also called epigenetic modifications) controls gene expression. To perform a methylation project two similar samples must be sequenced. The first sample is a reference for the region of interest. The second sample is that same region that has been treated with bisulfate. Bisulfite converts non-methylated cytosines to uracil. Upon sequencing, these uracil bases will be sequenced as thyamines. Upon comparison of the two samples, methylated cytosines will be read as cytosines in both samples while un-methylated cytosines will be converted to Thyamines in the bisulfite treated sample.
MID (Multiplex Identifier: Short sequence that can be introduced immediately downstream from the Key sequence in all template of a DNA library. MIDs can be recognized by the data analysis software and used to identify the library to which an individual read belongs. This features allows multiple libraries tagged with different MIDs to be sequenced together, within an individual PTP device.
Nucleotide Space: A sequence of nucleotide characters.
non-coding RNA (ncRNA): Any RNA molecule that is not translated into a protein. Include transfer RNA and ribsomal RNA (primary constituent of ribosomes)
Oligonucleotide: A molecule usually composed of 25 or fewer nucleotides; used as a DNA synthesis primer.
PAC: P1-artificial chromosome, a bacterially-propagated phagemid vector system suitable for cloning genomic inserts up to several hundred kilobases in size. The human genome is being mapped and sequenced primarily with BAC's and PAC's.
Paired-end reads: Used to determine the orientation and relative positions of contigs produced by de novo shotgun sequencing and assembly. Also good for identification of genomic structural variations.
- DNA fragmented into 3,000, 8,000 or 20,000 base pairs, adaptors added at ends, circularized to connect at both added adaptors
- Fragmented again with base tags in middle of fragment
- Run PCR and sequencing, able to determine relative direction of base tag nucleotides and distance from one another (each is 3K, 8K, or 20K apart)
Paleogenomics: The comparative analysis of the genomes of several key species to deduce the organization of ancestral genomes and the scenario that has led to the present-day genomes.
Polymerase chain reactions (PCR): The process used to make multiple copies of DNA strands.
PHRED: A software program that reads DNA sequencer trace data, calls bases, assigns quality values to the bases, and writes the base calls and quality values to output files.
Pharmacogenomics: The study of the interaction of an individual's genetic makeup and response to a drug.
Primer: Short preexisting polynucleotide chain to which new deoxyribonucleotides can be added by DNA polymerase.
Prokaryote: Cell or organism lacking a membrane-bound, structurally discrete nucleus and other subcellular compartments. Bacteria are examples of prokaryotes.
Proteomics: The study of the full set of proteins encoded by a genome.
Pyrosequencing: 'Sequencing by synthesis'- sequencing of a single strand DNA by synthesis of the complementary strand one base pair at a time. The added nucleotide pair is detected and coded.
Q20 Read Length: Read length at which the bases are 99% accurate, and higher for all preceding bases.
Q40+: Bases with 99.99% accuracy.
%Bases Q40+: The portion of an assembled genome with a quality score of 40 or higher.
Resequencing: Used for determining a change in DNA sequence from a "reference" sequence. It is often performed using PCR to amplify the region of interest (pre-existing DNA sequence is required to design the PCR primers). Resequencing uses three steps, extraction of DNA or RNA from biological tissue; amplification of the RNA or DNA (often by PCR); followed by sequencing. The resultant sequence is compared to a reference or a normal sample to detect mutations.
Serial analysis of gene expression (SAGE): A technique used to produce a snapshot of the messenger RNA population in a sample of interest in order to analyze the expressed genes in eukaryotic organisms (gene expression profiling).
Sanger method: Chain termination method, chain terminator nucleotides dyed, each nucleotide in DNA labeled with a different color a chromatogram produce, produces high read lengths (750 bp)
Scaffold: A series of contigs that are in the right order but not necessarily connected in one continuous stretch of sequence.
Shotgun sequencing: A method used for sequencing long DNA strands to increase speed.
- DNA is broken up randomly into numerous small segments, which are sequenced using the chain termination method to obtain reads.
- Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing.
- Computer programs then use the overlapping ends of different reads to assemble them into a contiguous sequence.
Single nucleotide polymorphism (SNP): A difference in one pair of nucleotides in coding or non-coding region
Transcriptome: The set of all messenger RNA (mRNA) molecules or "transcripts," produced in one or a population of cells. The term can be applied to the total set of transcripts in a given organism, or to the specific subset of transcripts present in a particular cell type. Unlike the genome, which is roughly fixed for a given cell line (excluding mutations), the transcriptome can vary with external environmental conditions. Because it includes all mRNA transcripts in the cell, the transcriptome reflects the genes that are being actively expressed at any given time, with the exception of mRNA degradation phenomena such as transcriptional attenuation. The study of transcriptomics examines the expression level of mRNAs in a given cell population, often using high-throughput techniques based on DNA microarray technology
Ultra-deep sequencing: Sequencing the same target of DNA, using amplicons, many times to find rare mutations.
16s rRNA: A 1542 nt long component of the small prokaryotic ribosomal subunit. 16s rDNA sequences contain hypervariable regions which can provide species-specific signature sequences useful for bacterial identification to species level, particularly in metagenomic studies.