Introducing Ketrew 0.0.0
21 Jan 2015We just released Ketrew 0.0.0
in
opam’s main repository.
Ketrew means Keep Track of Experimental Workflows: it is a workflow engine based on an Embedded Domain Specific Language.
- The EDSL is a simple OCaml library providing combinators to:
- define complex workflows (interdependent steps/programs using a lot of data, with many parameter variations, running on different hosts with various schedulers)
- submit them to the engine.
- The engine orchestrates those workflows while keeping track of everything that succeeds, fails, or gets lost.
Ketrew can be run as a standalone application (i.e. run the engine from the command line), or using a client-server architecture (i.e. run a proper server, and connect through HTTPS).
What’s in 0.0.0?
- An EDSL that can be used to build complex workflows: we wanted to keep the EDSL usable by OCaml beginners (read: it is not a bunch of GADTs carrying 1st class modules).
- Client-server (HTTPS + JSON) and standalone modes.
- An implementation of the engine which is naive/slow but testable & understandable.
- A command-line client that can explore the state of workflows and pilot the engine interactively.
- A persistence layer based on an “explorable” (but slow)
git
database. - Access through SSH to LSF and PBS-based computing clusters, and also to Unix machines that do not have a batch scheduler.
- A plugin infrastructure to add backends (to be linked with the library or
using
Dynlink
and/or Findlib (cf. documentation).
What We Are Doing With It
Ketrew has been used for very diverse things like running backups or building
documentation websites (see
smondet/build-docs-workflow
).
But we actually develop and use Ketrew to run bioinformatics pipelines on
very large amounts of data:
- Biomedical computational pipelines involve various long-running computational steps; for each of them, we want to run many parameter variations.
- At the same time, bioinformatics tools are infamous for being quite poorly engineered; they tend to fail in mysterious ways, for seemingly valid inputs.
- The computing infrastructure is also very diverse and gives little control to the end user.
Ketrew is designed with those adverse conditions in mind.
By the way, we just started open-sourcing our Ketrew pipelines for genomics
experiments, see hammerlab/biokepi: it
is a target library (i.e. a set of functions to create Ketrew.Target.t
values which wrap the installation and running of bioinformatics tools), but it
also contains a cool GADT-based pipeline description module; we’re on the road
to concise and well-typed bioinformatics pipelines!
Limitations & Future Work
Performance
We are switching to faster database backends: see the library
smondet/trakeva
(see PR
#98).
We had to slow down Ketrew quite artificially because submitting too many targets (≥ 100-ish), that all check-up on files/conditions on the same SSH host, would cause connection errors, we will work on less naive approaches for accessing remote hosts (see issue #69).
We are also working on the engine itself, see issue #73.
(G)UI
We routinely successfully run workflows that contain more than 1000 steps each, but we need better user-interfaces to go much further (both web- and command-line-based even emacs/vim friendly).
Deployment
Because of ocp-build
, Ketrew 0.0.0
requires OCaml 4.01.0; we are switching
to an ocamlbuild/oasis
-based build while waiting for
Assemblage to be ready.
Closing Remarks
We hope that you will find Ketrew useful for any of the various kinds of
workflows you may want to track! Feel free to file issues or reach out at
hammerlab/ketrew
if you have any
questions.
If you are in New York City by the end of January, we will be at the Compose conference to present Ketrew.