spark-bam and hadoop-bam were compared on the following datasets:
Raw data for the above can be found in this google sheet.
Across the datasets described above:
There are no known situations where spark-bam incorrectly classifies a BAM-record-boundary.
On the above data hadoop-bam exhibited false-positive rates between 1 per 18k and 1 per 625MM uncompressed BAM positions.
False-positives were discovered that were only correctly identified in spark-bam due to each of the additional checks in spark-bam:
In addition, several hundred false-negatives were discovered in GiaB PacBio long-read data: hadoop-bam missed sites that are true read-starts. None of these sites were on split boundaries, but it seems likely that correctness errors could ensue if they were.
Four CLI commands compare spark-bam’s speed with hadoop-bam’s in various ways:
The latter two time a single CPU computing a split, which hadoop-bam is much faster at, but the former two better factor in spark-bam’s gains from parallelization.
In particular, time-load
does minimal work other than split-computation, returning the first read from every partition, so spark-bam is much faster than hadoop-bam. count-reads
amortizes spark-bam’s split-computation edge over more subsequent work, so the difference is less pronounced.
count-reads
and time-load
were each run, with and without the -s
flag (i.e. with each of spark-bam and hadoop-bam running first and incurring Spark-setup issues, colder caches, etc.), on the 10 DREAM synthetic BAMs described above.
count-reads
:
time-load
: