spark-plug

A scala driver for launching Amazon EMR jobs

why?

We run a lot of reports. In the past, these have been kicked off by bash scripts that typically do things like date math, copy scripts and config files to s3 before calling to the amazon elastic-mapreduce command line client to launch the job. The emr client invocation ends up being dozen of lines of bash code adding each step and passing arguments.

It's been a pain to share defaults or add any abstraction over common job steps. Additionally, performing date arithmetic and conditionally adding EMR steps can be a pain. Lastly, the EMR client offers less control over certain options available from the EMR API.

simple example

val flow = JobFlow(
  name      = s"${stage}: analytics report [${date}]",
  cluster   = Master() + Core(8) + Spot(10),
  bootstrap = Seq(MemoryIntensive),
  steps     = Seq(
    SetupDebugging(),
    new HiveStep("s3://bucket/location/report.sql",
      Map("YEAR" -> year, "MONTH" -> month, "DAY" -> day))
  )
)

val id = Emr.run(flow)(ClusterDefaults(hadoop="1.0.3"))
println(id)

API documentation

download

Available in Maven Central as com.bizo spark-plug_2.10

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.settings		.settings
project		project
src		src
.gitignore		.gitignore
.project		.project
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
build.sbt		build.sbt
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-plug

why?

simple example

download

About

Releases

Packages

Languages

License

narrative-io/spark-plug

Folders and files

Latest commit

History

Repository files navigation

spark-plug

why?

simple example

download

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages