With spark-bam on the classpath, a SparkContext
can be “enriched” with relevant methods for loading BAM files by importing:
import spark_bam._
The primary method exposed is loadReads
, which will load an RDD
of HTSJDK SAMRecord
s from a .sam
, .bam
, or .cram
file:
sc.loadReads(path)
// RDD[SAMRecord]
Arguments:
path
(required)
hammerlab.paths.Path
can be constructed from a URI
, String
(representing a URI
), or java.nio.file.Path
:
import hammerlab.path._
val path = Path("test_bams/src/main/resources/2.bam")
bgzfBlocksToCheck
: optional (default: 5)readsToCheck
:
maxReadSize
:
splitSize
:
16m
, 32MB
When the path
is known to be an indexed .bam
file, reads can be loaded that from only specified genomic-loci regions:
import org.hammerlab.genomics.loci.parsing.ParsedLoci
import org.hammerlab.genomics.loci.set.LociSet
import org.hammerlab.bam.header.ContigLengths
import org.hammerlab.hadoop.Configuration
implicit val conf: Configuration = sc.hadoopConfiguration
val parsedLoci = ParsedLoci("1:11000-12000,1:60000-")
val contigLengths = ContigLengths(path)
// "Join" `parsedLoci` with `contigLengths to e.g. resolve open-ended intervals
val loci = LociSet(parsedLoci, contigLengths)
sc.loadBamIntervals(
path,
loci
)
// RDD[SAMRecord] with only reads overlapping [11000-12000) and [60000,∞) on chromosome 1
Arguments:
path
(required)loci
(required): LociSet
indicating genomic intervals to loadsplitSize
: optional (default: taken from underlying Hadoop filesystem APIs)estimatedCompressionRatio
3.0
)Implementation of loadReads
: takes the same arguments, but returns SAMRecord
s keyed by BGZF position (Pos
).
Primarly useful for analyzing split-computations, e.g. in the compute-splits
command.
Similar to loadReads
, but also returns computed Split
s alongside the RDD[SAMRecord]
.
Primarly useful for analyzing split-computations, e.g. in the compute-splits
command.