ALLPATHS-LG, a new standard for assembling a billion-piece genome puzzle

For starters, genome sequencing requires a specialized laboratory machine to read DNA into a sequence of four different letters or bases. Yet that sequence does not flow out in one perfect, continuous string. It comes out in many small pieces (called “reads”), which require complicated computational methods to be assembled into one, coherent whole.

In a paper published in the December 27 early edition of the Proceedings of the National Academy of Sciences (PNAS), Broad Institute scientists unveil a new method named ALLPATHS-LG that promises a cheaper and more accurate method of putting together all these genomic pieces.

“It’s called ALLPATHS because it sees multiple paths through the data, making possible more accurate assemblies,” said Iain MacCallum, a co-author and co-leader of the genome assembly group at the Broad. “LG means it can now assemble large genomes — for example, the human genome.”

The new method uses the cheapest form of DNA sequence data currently available. Over the past several years, sequencing technology has rocketed forward, yielding a thousand-fold drop in costs. While these new data are considerably less expensive to generate, they are also much more difficult to use: the reads are about 100 bases long — eight times shorter than what was typically achieved using the older form of sequencing (known as “capillary” sequencing).

To assemble a large genome, a very large number of the new short reads are required. “Think about a puzzle with a billion or so pieces looking nearly alike,” said Chad Nusbaum, co-director of the Broad’s Genome Sequencing and Analysis Program and a co-author of the PNAS paper. “It’s very difficult to put those pieces back together correctly.”

Researchers have recently tried to tackle this problem, but with mixed results. For example, a paper published in November critiqued assemblies created by SOAPdenovo, a new program from BGI, a genomics research institute in Shenzhen, China. SOAPdenovo is one of a few programs capable of assembling large genomes from short reads. “As a field, we really started to question whether good genome assemblies could ever be built from this super-cheap data,” said co-author Sante Gnerre, who co-leads the Broad’s genome assembly group.

In their PNAS paper, Broad researchers tested ALLPATHS-LG by comparing it to existing methods for assembling whole genomes, including SOAPdenovo, as well as to older methods that rely on sequence data from the older, more expensive, capillary-based reads. Comparison to the latter assemblies was essential; in spite of their expense, they define the quality standard for the field.

The resulting genome assemblies were evaluated according to three criteria: completeness – how much of the genome is actually present in the final assembly; continuity – how long the stretches of information are in the assembly; and accuracy – how accurately the stretches of overlapping base reads are put together. “Our goal was to meet the old capillary quality standard using the new cheap data,” said David Jaffe, co-author and director of Computational Research and Development in the Broad’s Genome Sequencing and Analysis Program.

The ALLPATHS-LG assemblies turned out better than the SOAPdenovo assemblies. And by most metrics, the ALLPATH-LG assemblies met or were within a factor of two of the old quality standard. “We were pleasantly surprised,” said Jaffe. “We know that the ALLPATHS assemblies still have many imperfections, but they’re already good enough for many purposes.” Based on these results, the Broad Institute is now proceeding with the sequencing and assembly of large genomes using the ALLPATHS-LG model — about a dozen mammals and fish have so far been analyzed.

Along with improving computational methods, researchers at the Broad also developed better lab tools to prepare DNA samples for sequencing. “These laboratory improvements were key to achieving the new quality standard,” says Jaffe, who worked closely with Robert Nicol’s team in the Broad’s Genome Sequencing Platform and Andreas Gnirke’s group in the Genome Sequencing and Analysis Program to improve sample preparation.

The ALLPATHS-LG algorithm is available as an open source tool. “Anybody can download it and we are eager to help anyone use it,” added Jaffe. The team is committed to helping researchers work with the tool, both in the lab and for computational analyses. “We expect that as sequencing costs keep dropping, genome sequencing will become routine, even standard of care,” Jaffe said. “We believe ALLPATHS-LG is the best method to put it all together and we intend to keep improving it. ”

Interested researchers can email Jaffe to learn more about the tool and how to put it to use.

Paper(s) cited: 
Gnerre S, MacCallum I, Przybylski D, Ribeiro F, Burton J, Walker B, Sharpe T, Hall G, Shea T, Sykes S, Berlin A, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander E, Jaffe DB. High-quality draft assemblies of mammalian genomes from massively parallel sequence data Proceedings of the National Academy of Sciences USA December 27, 2010. DOI 10.1073/pnas.1017351108