Indexing All Life's Known Biological Sequences

There's a lot of biological data out there but how can you search it? This paper suggests a methodology using annotated De Bruijn graphs to scalably index very large sets of DNA or protein sequences. They use efficient data structures and algorithms to compress Petabases of DNA sequences and make the indexes available for the community, and show that these indexes can be useful for different kinds of analyses.

1