Introducing Guacamole 0.0.0

30 Jan 2015

Today we’re excited to announce the release of Guacamole, a variant calling platform that can run an order of magnitude faster than existing solutions and allows for faster iteration on variant calling algorithms.

Background

Variant calling, the process of identifying differences between two or more genomes, is an important stage in many genome analysis pipelines.

Human genomes are commonly sequenced as billions of overlapping short reads that are aligned in ways that maximize agreement about the underlying base pairs, but the data is typically noisy enough that many positions show evidence of being two or more values:

A, C, or T?

(A, C, or T?)

In addition, at hundreds of gigabytes of data per genome, a typical variant caller only attempts to process a few genomes and can take hours to run. This has slowed the development of better variant callers; who wants to tweak parameters and experiment on algorithms with a turn-around time of hours?

As a result, a few callers have become de facto standards, but they weren’t designed to scale to thousands (let alone millions) of genomes and, worse, there is mounting evidence that they diverge in their output (for fixed inputs), sometimes wildly!

All downstream treatment decisions hinge on the variants discovered in patients’ genomes, so it is crucial that we gain better understanding of how to quickly and accurately identify variants.

Guacamole

With this in mind, we developed Guacamole, which makes it easy to implement variant calling algorithms and run them on modern, distributed infrastructure. Along with Cycledash (our variant visualizer) and Ketrew (our workflow manager), we’re able to iterate on variant callers, run them, and view their output easily and in a matter of minutes.

What’s in 0.0.0

Three reference variant callers, all expressed in ~100-200 lines of code and that run on full genomes in as few as 20 minutes:
- “Germline standard” caller: emits any variants whose likelihood (based on the quality of the reads displaying them) clears a certain threshold.
- “Germline threshold” caller: emits variants that appear in above a certain percentage of the reads.
- “Somatic standard” caller: compares a “tumor” sample to a “normal” sample (both having been aligned to the same reference) and emits likely variants that are specific to the former (e.g. mutations found in a cancer tumor).
Primitives for performing genomic analysis:
- partitionLociByApproximateDepth:
  - Assigns genomic regions to processing tasks, balancing the number of reads that align to each region.
  - High coverage regions are divided into many small partitions; low coverage regions are merged.
- pileupFlatmap:
  - Run a user-specified function on each locus of the genome, passing it a pileup giving the relevant bases from all reads mapping to that locus.
- windowFlatMap:
  - Run a user-specified function on sliding windows of a specified size in the genome.
  - All reads mapping within a window centered around each locus are passed to the user function.
  - Like pileupFlatmap, but for algorithms that require additional context around each locus.

What Are We Doing With It?

We used Guacamole to enter variant calls into the recent DREAM Somatic Mutation Calling Challenge.
We’ve also begun to tailor it to various clinical applications, including the development of precision cancer immunotherapies with collaborators from Mt. Sinai and Memorial Sloan Kettering.

Limitations / Roadmap

The callers included in 0.0.0 are experimental; we don’t recommend using them in production yet. Below are some things we are working on adding or improving.

Structural Variants

While the SNV-calling functionality in Guacamole is fairly robust, the indel- and SV-calling functionality is still fairly naïve.

Better structural-variant calling is being worked on.
We also plan to allow Guacamole to ingest reads from multiple sequencing platforms, e.g. longer PacBio reads, and make calls based on “hybrid” short- and long-read data.

Scalability

Spark, which we use as our distributed runtime, is difficult to debug and frequently fails on inputs larger than a few hundred GB (compressed; this can end up being tens of TB in deserialized-Scala-object form).

We have built a prototype variant caller using the Crunch/Scrunch interfaces to Hadoop’s implementation of MapReduce to establish that these algorithms do complete when run on more robust runtimes.
We are actively working on Spark’s monitoring capabilities.
Building a pipeline capable of rapidly analyzing genomic data from hundreds or thousands of patients is a high priority.

Clinical Applications

Guacamole will be in used in various downstream applications in the coming months:

An upcoming cancer vaccine trial will use somatic variant calls from Guacamole (in conjunction with other callers) to identify candidate biomarkers in patients’ cancer cells.
We are studying concordance between existing variant callers, Guacamole included, in an effort to better characterize the failure modes of current approaches.

You can find more info about Guacamole, as well as see what we’re working on and file issues, on GitHub.

Introducing CycleDash 0.0.0

28 Jan 2015

When iterating on a variant caller, it’s helpful to be able to rapidly identify the most interesting variants called and to visually inspect the evidence used by the caller to make each call. CycleDash is our variant call analysis application, used to inspect, query, compare, and export variant calls imported from VCF files.

Cycledash's Examine Page

CycleDash displays variant calls in a tabular format. It provides summary statistics, annotations (e.g. the gene within which a variant occurs), a discussion/commenting system, and a powerful query omnibar which allows the user to quickly filter and sort the variants using an SQL-inspired syntax. It allows users to export the filtered variants as a VCF file for further downstream analysis.

Users can manually upload a VCF file to CycleDash through the web interface or they can point CycleDash to a VCF file stored on HDFS. When a VCF is submitted through either method, a request to process the VCF is placed on a work queue. Once processing is complete, the processed data is inserted into a PostgreSQL database in a denormalized format. The variants are then available for inspection through the web interface.

What’s in 0.0.0?

A RESTful JSON API for submitting and querying VCFs.
A dashboard listing all submitted VCFs + metadata.
An “Examine” page for filtering and exploring the variants in a VCF, including
- A SQL-like query language for filtering and sorting variants.
- Download functionality for saving queried variants.
- An embedded pileup viewer for viewing variants in context.
- Summary statistics compared to a validation VCF, e.g. precision, recall, and F1 score.
Workers for processing and annotating VCFs.
A comment system for discussing variants.

What Are We Doing With It?

We’ve primarily used CycleDash to debug and improve Guacamole, our somatic variant caller. Ketrew posts the results of a run of Guacamole (or any other variant caller) to CycleDash, which in turn processes and imports the variant calls into PostgreSQL, where they are stored for later inspection.

In the future, we would like to use CycleDash to explore the functional consequences of variants. Exploring, discussing, and analyzing the results of a variant calling pipeline is often done in a series of Excel spreadsheets, written notes, and static images. When mature, CycleDash may be an improvement on existing workflows.

Current Limitations

Performance

CycleDash currently has trouble querying and serving call sets with over a million variants. We hope to improve performance for large call sets by optimizing our SQL queries. If we are not able to obtain the desired performance with PostgreSQL, we are open to exploring other databases. It’s a priority that CycleDash be responsive for even the most complex queries over the largest call sets.

Queryability

Our query engine cannot yet execute queries like “find all true positives”, “show all indels”, and doesn’t yet support queries over annotations or comments. We are also planning to support aggregate queries, user-defined functions, and saved queries.

Robust Error Handling

CycleDash fails in odd ways when it gets malformed and unexpected input. This is very confusing for users, and often involves diving into logs to figure out how something went wrong. Better error reporting is part of our ongoing goal to offer a seamless and pleasant UX to developers and non-technical users alike.

Roadmap

Some other highlights of our roadmap include the following:

Comparison and concordance between different call sets.
Introducing histograms and other visualizations and statistics.
More annotation options and better pluggability for adding new annotations.
A new, more performant, and purpose-built pileup viewer, pileup.js.
Links between variants in CycleDash and the UCSC Genome Browser.

You can track (and add to) the project’s issues & features on GitHub Issues.

Faster Pileup Loading with BAI Indices

23 Jan 2015

One of the key features of CycleDash is the embedded pileup view, currently provided by BioDalliance. This lets you inspect the reads which contributed to a variant call:

(This pileup shows a single nucleotide variation between the top sequence (C) and the bottom sequence (T), relative to a reference genome (C).)

We noticed that the two pileup tracks were quite slow to load, often taking in excess of 30 seconds to appear. This latency discouraged use of the feature, so we set out to improve it.

Pileup data is typically stored in a BAM file, which encodes all the short reads in the order in which they appear in the genome. Because genomes are large and sequenced at high read depth, BAM files can be quite large. For example, the synth3 BAM files from the DREAM Somatic Mutation Calling Challenge weigh in at 177.4GB each.

Transferring such a large file to a user’s web browser is impractical. Instead, BioDalliance (and other genome browsers such as IGV) load a BAM Index (BAI), which they then use to load only small portions of the larger BAM file.

As BAM files grow, so do their corresponding BAI index files. For the synth3 data set, the two BAI files are each 8.6MB (4.9MB compressed). This is still quite a bit of information to transfer to a web browser, and it accounted for the slow load time that we were seeing:

(The two BAI files block the loading of other resources for seven seconds.)

We tried some standard tricks like applying GZip compression to the BAI files, but the size of these files remained a fundamental barrier to interactive use.

In principle, only a tiny portion of the BAI file is needed at any given time. For example, if the user wants to look at Chromosome 20, then there’s no need to load index data for Chromosome 3.

After some discussion on BioStars and the samtools-devel mailing list, we decided to index the BAM index! This results in a small JSON file which records the beginning and ending of each contig in the BAI file (“contig” is a general term for a contiguous stretch of DNA, e.g. a chromosome):

{
  "chunks": [
    [8, 716520],         // byte offsets for Chromosome 1
    [716520, 1463832],   // byte offsets for Chromosome 2
    [1463832, 2070072],  // etc.
    ...
  ]
}

We call this an “index chunk file” because it records the locations of different chunks of the BAM Index. For example, the above file records that Chromosome 3 begins at byte 1,463,832 in the BAI file and extends through byte 2,070,072. To display pileups from Chromosome 3, a genome viewer need only load these ~600K of the BAI file, rather than the full 8.6M. This results in a considerable speedup:

The overall load time has been reduced from 10s to something like 3s.

You might wonder, what happens if this index of the index gets too large? For a human genome, it’s quite small (1.7KB for synth3) and can be inlined into the page or requested separately. And its size depends only on the number of contigs, not the number of reads, so it won’t grow with sequencing depth (like a BAI file does).

To incorporate BAI indices into your own software, you’ll want to look at the bai-indexer Python project and the change to BioDalliance which added support. We typically store the index chunks in a .bai.json file alongside the standard .bam and .bai files, so they never get separated.

While we’d hoped not to add new files and formats to an already large family, we found that it was impossible to incorporate higher level indexing data into the BAI file itself. By using a JSON format, we stay flexible to future improvements and extensions to this scheme.

By creating a higher level index, we were able to significantly improve the load time of genomic visualizations. We’d love to see wider support for this convention, so check out the bai-indexer project!

Introducing Ketrew 0.0.0

21 Jan 2015

We just released Ketrew 0.0.0 in opam’s main repository.

Ketrew means Keep Track of Experimental Workflows: it is a workflow engine based on an Embedded Domain Specific Language.

The EDSL is a simple OCaml library providing combinators to:
- define complex workflows (interdependent steps/programs using a lot of data, with many parameter variations, running on different hosts with various schedulers)
- submit them to the engine.
The engine orchestrates those workflows while keeping track of everything that succeeds, fails, or gets lost.

Ketrew can be run as a standalone application (i.e. run the engine from the command line), or using a client-server architecture (i.e. run a proper server, and connect through HTTPS).

What’s in 0.0.0?

An EDSL that can be used to build complex workflows: we wanted to keep the EDSL usable by OCaml beginners (read: it is not a bunch of GADTs carrying 1st class modules).
Client-server (HTTPS + JSON) and standalone modes.
An implementation of the engine which is naive/slow but testable & understandable.
A command-line client that can explore the state of workflows and pilot the engine interactively.
A persistence layer based on an “explorable” (but slow) git database.
Access through SSH to LSF and PBS-based computing clusters, and also to Unix machines that do not have a batch scheduler.
A plugin infrastructure to add backends (to be linked with the library or using Dynlink and/or Findlib (cf. documentation).

What We Are Doing With It

Ketrew has been used for very diverse things like running backups or building documentation websites (see smondet/build-docs-workflow). But we actually develop and use Ketrew to run bioinformatics pipelines on very large amounts of data:

Biomedical computational pipelines involve various long-running computational steps; for each of them, we want to run many parameter variations.
At the same time, bioinformatics tools are infamous for being quite poorly engineered; they tend to fail in mysterious ways, for seemingly valid inputs.
The computing infrastructure is also very diverse and gives little control to the end user.

Ketrew is designed with those adverse conditions in mind.

By the way, we just started open-sourcing our Ketrew pipelines for genomics experiments, see hammerlab/biokepi: it is a target library (i.e. a set of functions to create Ketrew.Target.t values which wrap the installation and running of bioinformatics tools), but it also contains a cool GADT-based pipeline description module; we’re on the road to concise and well-typed bioinformatics pipelines!

Limitations & Future Work

Performance

We are switching to faster database backends: see the library smondet/trakeva (see PR #98).

We had to slow down Ketrew quite artificially because submitting too many targets (≥ 100-ish), that all check-up on files/conditions on the same SSH host, would cause connection errors, we will work on less naive approaches for accessing remote hosts (see issue #69).

We are also working on the engine itself, see issue #73.

(G)UI

We routinely successfully run workflows that contain more than 1000 steps each, but we need better user-interfaces to go much further (both web- and command-line-based even emacs/vim friendly).

Deployment

Because of ocp-build, Ketrew 0.0.0 requires OCaml 4.01.0; we are switching to an ocamlbuild/oasis -based build while waiting for Assemblage to be ready.

Closing Remarks

We hope that you will find Ketrew useful for any of the various kinds of workflows you may want to track! Feel free to file issues or reach out at hammerlab/ketrew if you have any questions.

If you are in New York City by the end of January, we will be at the Compose conference to present Ketrew.

Streaming from HDFS with igv-httpfs

05 Dec 2014

Genomic data sets can be quite large, with BAM (binary alignment) files easily weighing in at hundreds of gigabytes or terabytes for a whole genome at high read depth. Distributed file systems like HDFS are a natural fit for storing these beasts.

This presents interactive genomic exploration tools like IGV with a challenge: how can you interact with a terabyte data file that may be on a remote server? Downloading the entire file could take days, so this isn’t an option.

IGV and other tools create interactive experiences using a few tricks:

They require a BAI (BAM Index) file, which lets them quickly determine the locations in the BAM file which contain reads in a particular range of interest.
They set the rarely-used Range: bytes HTTP header to request small portions of the BAM file from the remote web server.

In practice, this means that IGV can display reads from any location in a 100 GB BAM file while only transferring ~100KB of data over the network.

Popular web servers like Apache and nginx support the Range: bytes header, but WebHDFS, the standard HTTP server for content on HDFS, does not. It does support requesting ranges of bytes in a file, however, but it does so using its own URL parameters. IGV does not know about these parameters, so it can’t speak directly to WebHDFS.

We can solve this problem by building an adapter. And that’s exactly what igv-httpfs is.

igv-httpfs exports a WSGI server which understands Range: bytes HTTP headers and converts them to appropriate HTTP requests to WebHDFS. This allows IGV to load files from HDFS.

igv-httpfs also sets appropriate CORS headers so that web-based genome visualizations such as BioDalliance which make cross-origin XHRs can enjoy the same benefits.

As with any WSGI app, we recommend running igv-httpfs behind a WSGI HTTP server (we use gunicorn) and an HTTP server (we use nginx). This improves performance and lets you enable gzip compression, an essential optimization for serving genomic data over the network.

Update (2015-12-14): Another approach to serving HDFS content over HTTP is to create an NFS mount for the HDFS file system, say at /hdfs. With this in place, you can serve HDFS files using a standard HTTP server like nginx or Apache. Some care is required to correctly support CORS, caching and bioinformatics MIME types, but support for range requests comes built-in. You can use our nginx configuration as a template.

Older Newer