Pages

Process BAM files using Apache Spark and HTSJDK; inspired by hadoop-bam.

$ spark-shell --packages=org.hammerlab.bam:load_2.11:1.2.0-M1
import spark_bam._, hammerlab.path._

val path = Path("test_bams/src/main/resources/2.bam")

// Load an RDD[SAMRecord] from `path`; supports .bam, .sam, and .cram
val reads = sc.loadReads(path)
// RDD[SAMRecord]

reads.count
// 2500

import hammerlab.bytes._

// Configure maximum split size
sc.loadReads(path, splitSize = 16 MB)
// RDD[SAMRecord]

// Only load reads in specific intervals
sc.loadBamIntervals(path)("1:13000-14000", "1:60000-61000").count
// 129

Linking

SBT

libraryDependencies += "org.hammerlab.bam" %% "load" % "1.2.0-M1"

Maven

<dependency>
  <groupId>org.hammerlab.bam</groupId>
  <artifactId>load_2.11</artifactId>
  <version>1.2.0-M1</version>
</dependency>

From spark-shell

spark-shell --packages=org.hammerlab.bam:load:1.2.0-M1

On Google Cloud

spark-bam uses Java NIO APIs to read files, and needs the google-cloud-nio connector in order to read from Google Cloud Storage (gs:// URLs).

Download a shaded google-cloud-nio JAR:

GOOGLE_CLOUD_NIO_JAR=google-cloud-nio-0.20.0-alpha-shaded.jar
wget https://oss.sonatype.org/content/repositories/releases/com/google/cloud/google-cloud-nio/0.20.0-alpha/$GOOGLE_CLOUD_NIO_JAR

Then include it in your --jars list when running spark-shell or spark-submit:

spark-shell --jars $GOOGLE_CLOUD_NIO_JAR --packages=org.hammerlab.bam:load:1.2.0-M1
…
import spark_bam._, hammerlab.path._

val reads = sc.loadBam(Path("gs://bucket/my.bam"))