Finding the Needle in the Genomic Haystack With DRAGEN
Illumina scientists tailor-made research solutions for detecting genes in difficult-to-read regions that cause both common and rare diseasesOriginally published on Illumina News CenterNORTHAMPTON, MA / ACCESSWIRE / March 8, 2024 / When scientists …
Illumina scientists tailor-made research solutions for detecting genes in difficult-to-read regions that cause both common and rare diseases
NORTHAMPTON, MA / ACCESSWIRE / March 8, 2024 / When scientists want to sequence a DNA sample on an Illumina system, they don't try to read all 4 billion base pairs of the genome at once. Instead, they slice the DNA into short fragments of about 500 base pairs that are easier to work with and faster to read.
DNA samples, in the form of a small amount of tissue or fluid, usually contain many cells, and thus many copies of the organism's genome-so once the system captures images of the fragments, it reassembles the data for one complete sequence by comparing where the fragments overlap.
Think of it like tossing several identical copies of a book into a paper shredder, each one at a random angle. You can't reassemble any individual copy like you would a puzzle, because all the pieces are the same shape. (Especially if you don't have an intact book to reference.) But, since each copy of the book was shredded in different random locations from the others, you can match up fragments from different copies based on where the text overlaps.
If the species of interest has never been sequenced before, scientists must rely on these overlaps, known as contiguous regions, or "contigs," to build a reference genome. Fortunately, human reference genomes are available thanks to The Human Genome Project, completed in 2003, and the ongoing work of the Genome Reference Consortium. Every individual human shares 99.99% of the same base pairs, so scientists can identify an individual's genetic variants by comparing them to existing references.
Unfortunately, in many regions of the human genome, the sequence of base pairs is highly repetitive. Entire genes-many thousands of base pairs long-may be duplicated multiple times, with only a handful of base pair variations to differentiate the copies. Furthermore, the number of duplications of given genes, and the specific differences between the copies, frequently varies from person to person.
Lesen Sie auch
These regions of "high homology" are notoriously difficult to analyze, even with a reference genome available. Fragments from them are likely to "fit" in several possible locations, leaving the system with low confidence that it's aligned them correctly.
Unfortunately, many genetic diseases result from having an atypical number of copies of specific genes, or a variant in just one gene of a multigene family with many copies-so in order to screen for these diseases, sequencing systems and data analysis pipelines must be sophisticated enough to accurately detect variants even in high-homology regions.