spark-bam improves on hadoop-bam in 3 ways:
hadoop-bam computes splits sequentially on one node. Depending on the storage backend, this can take many minutes for modest-sized (10-100GB) BAMs, leaving a large cluster idling while the driver bottlenecks on an emminently-parallelizable task.
For example, on Google Cloud Storage (GCS), two factors causing high split-computation latency include:
spark-bam identifies record-boundaries in each underlying file-split in the same Spark job that streams through the records, eliminating the driver-only bottleneck and maximally parallelizing split-computation.
An important impetus for the creation of spark-bam was the discovery of two TCGA lung-cancer BAMs for which hadoop-bam produces invalid splits:
HTSJDK threw an error when trying to parse reads from essentially random data fed to it by hadoop-bam:
MRNM should not be set for unpaired read
These BAMs were rendered unusable, and questions remain around whether such invalid splits could silently corrupt analyses.
spark-bam fixes these record-boundary-detection “false-positives” by adding additional checks:
Validation check | spark-bam | hadoop-bam |
---|---|---|
Negative reference-contig idx | ✅ | ✅ |
Reference-contig idx too large | ✅ | ✅ |
Negative locus | ✅ | ✅ |
Locus too large | ✅ | 🚫 |
Read-name ends with \0 |
✅ | ✅ |
Read-name non-empty | ✅ | 🚫 |
Invalid read-name chars | ✅ | 🚫 |
Record length consistent w/ #{bases, cigar ops} | ✅ | ✅ |
Cigar ops valid | ✅ | 🌓* |
Subsequent reads valid | ✅ | ✅ |
Non-empty cigar/seq in mapped reads | ✅ | 🚫 |
Cigar consistent w/ seq len | 🚫 | 🚫 |
* Cigar-op validity is not verified for the “record” that anchors a record-boundary candidate BAM position, but it is verified for the subsequent records that hadoop-bam checks
spark-bam detects BAM-record boundaries using the pluggable Checker
interface.
Four implementations are provided:
Default/Production-worthy record-boundary-detection algorithm:
seqdoop
checker) using the check-bam
, compute-splits
, and compare-splits
commandsDebugging-oriented Checker
:
full-check
command or its stand-alone “main” app at org.hammerlab.bam.check.full.Main
Checker
that mimicks hadoop-bam’s BAMSplitGuesser
as closely as possible.
BAMSplitGuesser
logic more efficiently/directlyThis Checker
simply reads from a .records
file (as output by index-records
) and reflects the read-positions listed there.
It can serve as a “ground truth” against which to check either the eager
or seqdoop
checkers (using the -s
or -u
flags to check-bam
, resp.).
hadoop-bam is poorly suited to handling increasingly-long reads from e.g. PacBio and Oxford Nanopore sequencers.
For example, a 100kbp-long read is likely to span multiple BGZF blocks, causing hadoop-bam to reject it as invalid.
spark-bam is robust to such situations, related to its agnosticity about buffer-sizes / reads’ relative positions with respect to BGZF-block boundaries.
Analyzing hadoop-bam’s correctness (as discussed above) proved quite difficult due to subtleties in its implementation.
Its record-boundary-detection is sensitive, in terms of both output and runtime, to:
spark-bam’s accuracy is dramatically easier to reason about:
This allows for greater confidence in the correctness of computed splits and downstream analyses.
While evaluating hadoop-bam’s correctness, BAM positions were discovered that BAMSplitGuesser
would correctly deem as invalid iff the JVM heap size was below a certain threshold; larger heaps would avoid an OOM and mark an invalid position as valid.
An overview of this failure mode:
decodedAny
flag would be set to true
BAMRecordCodec
attempts to allocate a byte-array of that size and read the “record” into it
RuntimeEOFException
is thrown while attempting to read ≈1GB of data from a buffer that is only ≈256KB in sizedecodedAny
flag signals that this position is valid because at least one record was decoded before “EOF” (which actually only represents an “end of 256KB buffer”) occurredThis resulted in positions that hadoop-bam correctly ruled out in sufficiently-memory-constrained test-contexts, but false-positived on in more-generously-provisioned settings, which is obviously an undesirable relationship to correctness.