ETL Workflow is a simple and opinionated way to help you structure type-safe Extract-Transform-Load (ETL)
pipelines. This Domain Specific Language (DSL) is flexible enough to create linear pipelines which involve a single
Extract
source and Load
sink
Extract source A ~> Transform A to B ~> Load B (sink 1)
all the way to stitching multiple Extract sources together and flowing the data through to multiple Load sinks
Extract source A ~> ~> Load D (sink 1)
\ /
Extract source B ~> Transform (A, B, C) to D ~> Load D (sink 2)
/ \
Extract source C ~> ~> Load D (sink 3)
It is built on an immutable and functional architecture where side-effects are executed at the end-of-the-world when the pipeline is run.
This is intended to be used in conjunction with Spark (especially for doing ETL) in order to minimize boilerplate and have the ability to see an almost whiteboard-like representation of your pipeline.
resolvers += Resolver.bintrayRepo("calvinlfer","maven")
libraryDependencies += "com.ghostsequence %% "etl-workflow" % <latest version info at the top>"
An ETL pipeline consists of the following building blocks:
A producer of a single element of data whose type is A
. This is the start of the ETL pipeline, you can connect this
to Transform
ers or to a Load[A, AStatus]
to create an ETLPipeline[AStatus]
that can be run.
A transformer of a an element A
to B
you can attach these after an Extract[A]
or before a Load[B]
The end of the pipeline which takes data B
flowing through the pipeline and consumes it and produces a status
BStatus
which indicates whether consumption happens successfully
This represents the fully created ETL pipeline which can be executed using unsafeRunSync()
to produce a
ConsumeStatus
which indicates whether the pipeline has finished successfully.
Note: At the end of the day, these building blocks are a reification of values and functions. You can build an ETL pipeline out of functions and values but it helps to have a Domain Specific Language to increase readability.
See here for examples on how to get started
Make sure you have the correct Bintray credentials before proceeding:
sbt release
This will automatically create a Git Tag and publish the library to Bintray for all Scala versions.