Tech News

The human genome is (almost) complete — here’s what’s left to do

The release of the draft human genome sequence in 2001 was a seismic moment in our understanding of the human genome and paved the way for advances in our understanding of the genomic basis of human biology and disease.

But sections were left unsequenced, and some sequence information was incorrect. Now, two decades later, we have a much more complete version, published as a preprint (which is yet to undergo peer review) by an international consortium of researchers.

Technological limitations meant the original draft human genome sequence covered just the “euchromatic” portion of the genome — the 92% of our genome where most genes are found, and which is most active in making gene products such as RNA and proteins.

The newly updated sequence fills in most of the remaining gaps, providing the full 3.055 billion base pairs (“letters”) of our DNA code in its entirety. This data has been made publicly available, in the hope other researchers will use it to further their research.

Why did it take 20 years?

Much of the newly sequenced material is the “heterochromatic” part of the genome, which is more “tightly packed” than the euchromatic genome and contains many highly repetitive sequences that are very challenging to read accurately.

These regions were once thought not to contain any important genetic information but they are now known to contain genes that are involved in fundamentally important processes such as the formation of organs during embryonic development. Among the 200 million newly sequenced base pairs are an estimated 115 genes predicted to be involved in producing proteins.

Two key factors made the completion of the human genome possible:

1. Choosing a very special cell type

The newly published genome sequence was created using human cells derived from a very rare type of tissue called a complete hydatidiform mole, which occurs when a fertilized egg loses all the genetic material contributed to it by the mother.

Most cells contain two copies of each chromosome, one from each parent and each parent’s chromosome contributing a different DNA sequence. A cell from a complete hydatidiform mole has two copies of the father’s chromosomes only, and the genetic sequence of each pair of chromosomes is identical. This makes the full genome sequence much easier to piece together.

2. Advances in sequencing technology

After decades of glacial progress, the Human Genome Project achieved its 2001 breakthrough by pioneering a method called “shotgun sequencing”, which involved breaking the genome into very small fragments of about 200 base pairs, cloning them inside bacteria, deciphering their sequences, and then piecing them back together like a giant jigsaw.

This was the main reason the original draft covered only the euchromatic regions of the genome — only these regions could be reliably sequenced using this method.

The latest sequence was deduced using two complementary new DNA-sequencing technologies. One was developed by PacBio and allows longer DNA fragments to be sequenced with very high accuracy. The second, developed by Oxford Nanopore, produces ultra-long stretches of continuous DNA sequence. These new technologies allow the jigsaw pieces to be thousands or even millions of base pairs long, making it easier to assemble.

The new information has the potential to advance our understanding of human biology including how chromosomes function and maintain their structure. It is also going to improve our understanding of genetic conditions such as Down syndrome that have an underlying chromosomal abnormality.

Is the genome now completely sequenced?

Well, no. An obvious omission is the Y chromosome because the complete hydatidiform mole cells used to compile this sequence contained two identical copies of the X chromosome. However, this work is underway and the researchers anticipate their method can also accurately sequence the Y chromosome, despite it having highly repetitive sequences.

Even though sequencing the (almost) complete genome of a human cell is an extremely impressive landmark, it is just one of several crucial steps towards fully understanding humans’ genetic diversity.

The next job will be to study the genomes of diverse populations (the complete hydatidiform mole cells were European). Once the new technology has matured sufficiently to be used routinely to sequence many different human genomes, from different populations, it will be better positioned to make a more significant impact on our understanding of human history, biology, and health.

Both care and technological development are needed to ensure this research is conducted with a full understanding of the diversity of the human genome to prevent exacerbation of health disparities by limiting discoveries to specific populations.

Article by Melissa Southey, Chair Precision Medicine, Monash University and Tu Nguyen-Dumont, Senior research fellow, Monash University

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Repost: Original Source and Author Link


Nvidia and Harvard develop AI tool that speeds up genome analysis

Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.

Researchers affiliated with Nvidia and Harvard today detailed AtacWorks, a machine learning toolkit designed to bring down the cost and time needed for rare and single-cell experiments. In a study published in the journal Nature Communications, the coauthors showed that AtacWorks can run analyses on a whole genome in just half an hour compared with the multiple hours traditional methods take.

Most cells in the body carry around a complete copy of a person’s DNA, with billions of base pairs crammed into the nucleus. But an individual cell pulls out only the subsection of genetic components that it needs to function, with cell types like liver, blood, or skin cells using different genes. The regions of DNA that determine a cell’s function are easily accessible, more or less, while the rest are shielded around proteins.

AtacWorks, which is available from Nvidia’s NGC hub of GPU-optimized software, works with ATAC-seq, a method for finding open areas in the genome in cells pioneered by Harvard professor Jason Buenrostro, one of the paper’s coauthors. ATAC-seq measures the intensity of a signal at every spot on the genome. Peaks in the signal correspond to regions with DNA such that the fewer cells available, the noisier the data appears, making it difficult to identify which areas of the DNA are accessible.

ATAC-seq typically requires tens of thousands of cells to get a clean signal. Applying AtacWorks produces the same quality of results with just tens of cells, according to the coauthors.

AtacWorks was trained on labeled pairs of matching ATAC-seq datasets, one high-quality and one noisy. Given a downsampled copy of the data, the model learned to predict an accurate high-quality version and identify peaks in the signal. Using AtacWorks, the researchers found that they could spot accessible chromatin, a complex of DNA and protein whose primary function is packaging long molecules into more compact structures, in a noisy sequence of 1 million reads nearly as well as traditional methods did with a clean dataset of 50 million reads.

AtacWorks could allow scientists to conduct research with a smaller number of cells, reducing the cost of sample collection and sequencing. Analysis, too, could become faster and cheaper. Running on Nvidia Tensor Core GPUs, AtacWorks took under 30 minutes for inference on a genome, a process that would take 15 hours on a system with 32 CPU cores.

In the Nature Communications paper, the Harvard researchers applied AtacWorks to a dataset of stem cells that produce red and white blood cells — rare subtypes that couldn’t be studied with traditional methods. With a sample set of only 50 cells, the team was able to use AtacWorks to identify distinct regions of DNA associated with cells that develop into white blood cells, and separate sequences that correlate with red blood cells.

“With very rare cell types, it’s not possible to study differences in their DNA using existing methods,” Nvidia researcher Avantika Lal, first author on the paper, said. “AtacWorks can help not only drive down the cost of gathering chromatin accessibility data, but also open up new possibilities in drug discovery and diagnostics.”


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Repost: Original Source and Author Link