Skip to content

An opinionated way to structure ETL pipelines with a heavy focus on reusability and testing

Notifications You must be signed in to change notification settings

calvinlfer/etl-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Workflow (beta)

Build Status Download

ETL Workflow is a simple and opinionated way to help you structure type-safe Extract-Transform-Load (ETL) pipelines. This Domain Specific Language (DSL) is flexible enough to create linear pipelines which involve a single Extract source and Load sink

Extract source A ~> Transform A to B ~> Load B (sink 1)

all the way to stitching multiple Extract sources together and flowing the data through to multiple Load sinks


Extract source A ~>                               ~> Load D (sink 1)
                   \                             /
Extract source B    ~> Transform (A, B, C) to D ~>   Load D (sink 2)
                   /                             \
Extract source C ~>                               ~> Load D (sink 3)

It is built on an immutable and functional architecture where side-effects are executed at the end-of-the-world when the pipeline is run.

This is intended to be used in conjunction with Spark (especially for doing ETL) in order to minimize boilerplate and have the ability to see an almost whiteboard-like representation of your pipeline.

Usage

resolvers += Resolver.bintrayRepo("calvinlfer","maven")

libraryDependencies += "com.ghostsequence %% "etl-workflow" % <latest version info at the top>"

Building Blocks

An ETL pipeline consists of the following building blocks:

Extract[A]

A producer of a single element of data whose type is A. This is the start of the ETL pipeline, you can connect this to Transformers or to a Load[A, AStatus] to create an ETLPipeline[AStatus] that can be run.

Transform[A, B]

A transformer of a an element A to B you can attach these after an Extract[A] or before a Load[B]

Load[B, BStatus]

The end of the pipeline which takes data B flowing through the pipeline and consumes it and produces a status BStatus which indicates whether consumption happens successfully

ETLPipeline[ConsumeStatus]

This represents the fully created ETL pipeline which can be executed using unsafeRunSync() to produce a ConsumeStatus which indicates whether the pipeline has finished successfully.

Note: At the end of the day, these building blocks are a reification of values and functions. You can build an ETL pipeline out of functions and values but it helps to have a Domain Specific Language to increase readability.

Examples

See here for examples on how to get started

Inspiration

Release process

Make sure you have the correct Bintray credentials before proceeding:

sbt release

This will automatically create a Git Tag and publish the library to Bintray for all Scala versions.

About

An opinionated way to structure ETL pipelines with a heavy focus on reusability and testing

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages