Justin Paschall, a scientific software engineer from the AMPLab at UC Berkeley, recently came to Phosphorus to speak at length about ADAM — a framework developed in the Apache Spark ecosystem to replace ad-hoc pipelines that interpret massive amounts of genomic and transcriptomic DNA and RNA sequencing data.
ADAM addresses the tenuous gluing-together of workflows that can occur when attempting to avoid major bottlenecks in scalability, long-term stability, and speed maintenance. ADAM leverages the Avro and Parquet frameworks, providing a more consistent and effective means of persisting, sharing, and processing integral genomic data, whereas the current representation of data uses a variety of different legacy formats, most of which have not been designed with scalability or BigData in mind. (For instance, the VCF format can be difficult to parse and was designed more for human readability, rather than large-scale data-processing.)
ADAM is open source, founded by the same UC Berkeley AMPLab that developed Apache Spark, and it re-envisions the way genomic analysis uses clusters of computers to solve Big Data challenges in science and biomedicine, bringing more efficiency and parallelization in development time and end-user advantages.
For more info on ADAM, check out the project’s GitHub page here. You can also check out the video of Justin Paschall’s talk, below.