Process BAM files using Apache Spark and HTSJDK; inspired by hadoop-bam.
$ spark-shell --packages=org.hammerlab.bam:load_2.11:1.2.0-M1
import spark_bam._, hammerlab.path._
val path = Path("test_bams/src/main/resources/2.bam")
// Load an RDD[SAMRecord] from `path`; supports .bam, .sam, and .cram
val reads = sc.loadReads(path)
// RDD[SAMRecord]
reads.count
// 2500
import hammerlab.bytes._
// Configure maximum split size
sc.loadReads(path, splitSize = 16 MB)
// RDD[SAMRecord]
// Only load reads in specific intervals
sc.loadBamIntervals(path)("1:13000-14000", "1:60000-61000").count
// 129
libraryDependencies += "org.hammerlab.bam" %% "load" % "1.2.0-M1"
<dependency>
<groupId>org.hammerlab.bam</groupId>
<artifactId>load_2.11</artifactId>
<version>1.2.0-M1</version>
</dependency>
spark-shell
spark-shell --packages=org.hammerlab.bam:load:1.2.0-M1
spark-bam uses Java NIO APIs to read files, and needs the google-cloud-nio connector in order to read from Google Cloud Storage (gs://
URLs).
Download a shaded google-cloud-nio JAR:
GOOGLE_CLOUD_NIO_JAR=google-cloud-nio-0.20.0-alpha-shaded.jar
wget https://oss.sonatype.org/content/repositories/releases/com/google/cloud/google-cloud-nio/0.20.0-alpha/$GOOGLE_CLOUD_NIO_JAR
Then include it in your --jars
list when running spark-shell
or spark-submit
:
spark-shell --jars $GOOGLE_CLOUD_NIO_JAR --packages=org.hammerlab.bam:load:1.2.0-M1
…
import spark_bam._, hammerlab.path._
val reads = sc.loadBam(Path("gs://bucket/my.bam"))