Bioinformatics Workflows on a Hadoop Cluster
11 Jun 2015Quick experience report on our adventures running old-school batch-processing-style computational workflows on a YARN-managed Hadoop cluster.
Ketrew on YARN
We run a whole lot of bioinformatics software artifacts in the form of long
pipelines of finely tuned analysis steps.
The project
hammerlab/biokepi
is our repository of bioinformatics workflows; it
uses
Ketrew
as a workflow engine and exposes a high-level Lego-style pipeline building API
in OCaml.
Biokepi is quite agnostic of the computing environment; any Ketrew backend with
a shared filesystem can be used, by creating a
Run_environment.Machine.t
.
In house, we want to use our Hadoop cluster. Nowadays the Hadoop ecosystem is
using the
YARN
scheduler. Ketrew gained support for YARN a
few months ago,
and this got into the
1.0.0 release that
recently hit the shelves.
Ketrew has to be able to run both Hadoop applications (with built-in knowledge
of YARN resources) and arbitrary programs (like classical batch-application
clusters).
The implementation in
ketrew_yarn.ml
provides both.
Running YARN-enabled Hadoop applications is just a matter of daemonizing their
“master process” on one of the cluster’s login-nodes. This process will then
request resources from YARN.
To run arbitrary programs we had to wrap them within the cutely named
org.apache.hadoop.yarn.applications.distributedshell.Client
class;
it runs the “application master” that requests (part of) a cluster node
from YARN, to then launch the given shell script.
Once this was implemented, there were a few more obstacles yet to kick through.
Actually Making It Work
A Hadoop cluster is built around a non-POSIX distributed filesystem: HDFS. Most bioinformatics tools won’t know about HDFS, and will assume a good old Unix-based filesystem access. Luckily, we have a decent NFS mount configured.
The gods of distributed computing won’t let you use NFS without cursing you; this time it was the user IDs.
YARN’s default behavior is to run “containers” as the yarn
Unix-user (a
container is a unit of computational resources; like a core of a node in the
cluster), hence all the POSIX files created by bioinformatics tools would be
manipulated by the yarn
user.
The first hacky attempt was to clumsily chmod 777
everything, with a
permissive umask
, and hope that the yarn
user will be able to read and
write.
Nope.
Of course, NFS is based on UIDs not usernames, the user yarn
has
different UIDs on different nodes of the cluster (which can be seen as a
misconfiguration bug).
A given step of the pipeline will write files
as the user 42
and a latter one, on a different node, will try to write as
51
… even with a lot of black magic chmod/setfacl
incantations, we end up
with pipelines failing after a few hours and leaving terabytes of unusable files on
our then inconsistent NFS mount.
To save our poor souls, there just so happens to be something called “YARN Security.” The scheduler can be configured to run the containers as the Unix user that started the application, just like other schedulers have been doing for the past 30 years (cf. documentation).
After updating to the latest Cloudera distribution to enjoy a few bug fixes, it
finally works!
We’ve been running hundreds of CPU-hours of bioinformatics pipelines,
generating terabytes of data,
and posting results to our instance of
hammerlab/cycledash
.
Next Steps
Always outnumbered never outgunned, we’re now working on:
- Getting more run-time information from YARN (issue
hammerlab/ketrew#174
): So far we get basic “application status information” from YARN as well as post mortem application logs. We want to show in Ketrew’s UI more real-time information, especially the logs of the cluster node running the YARN-container. - Improving Biokepi to become a “production ready” pipeline framework. This
means adding more tools, improving configurability and scalability; cf.
hammerlab/biokepi/issues
.
As always, help is warmly welcome!