Skip to content
Richard Darst edited this page Apr 13, 2018 · 12 revisions

Planning

  • No prerequisite for previous R use, but expect Python familiarity

    We do require some basic programming experience (say, equivalent to some hypothetical "Programming 101"), but it doesn't have to be specifically in R/Python.

  • Should focus on hands-on doing rather than lectures + separate exercise (see coderefinery approach)

  • Presentation technology bikeshedding

    If this were a Python only course, jupyter notebooks would be an obvious choice? But what about R users? jupyter isn't that popular there, R users tend to use Rstudio, which provides "Rmarkdown" documents which can be used to do similar "literate programming" stuff as jupyter notebooks.

Notes

Key topics:

  • IO, data storage formats (local disks, scratch, ...)
  • Comparison of type of tools/libraries for different tasks
  • Filesystems (what we have available)?
  • matplotlib/ggplot
  • Optimizing memory usage
  • Parallelization - split, apply, combine, array jobs Secondary topics:
  • profiling
  • slurm scripts/slurm history/array jobs
  • memory/object models
  • seff

Python specific

R specific

How much do we want to teach Hadleyverse stuff vs. out-of-the-box R stuff?

  • ggplot at least is IMHO quite a lot better than the built-in plotting and widely used.

Themes

Unlike the outline, these are the big lessons people should learn via the things we teach.

  • use the right tools, data structures, and libraries
  • automation of workflows. Don't do everything manually
  • use good file formats
  • good development environments, IDEs, ...
  • profiling (and less debugging)

Outline

The general idea is that we do the same workshop/session/lecture/whatever twice, once with R and once with Python. That allows us to reuse lecture materials for both courses and share improvements.

Day 1

  • Introduction
    • What does the course cover?
    • Data Frames
      • What kind of data structure is it? Compare to the other usual suspects, lists, dicts, N-d arrays.
        • Special features: Categories/Factors, missing values
      • Useful for tabular data (CSV files, some similarities with RDBMS)
  • Get people set up
    • Start Rstudio / jupyter notebook session on node via slurm
    • ssh keys (at least for R)
  • Introductory exercises
    • numpy/pandas beginnings (/ similar stuff for R)
  • Profiling, debugging
  • A few more short exercises
  • I/O
    • HDF5 / pytables
    • sqlite
    • csv
  • Even more exercises

Day 2

  • Maybe move part of I/O from day 1 here?
  • Split-apply-combine
    • Motivation, why is this a common and useful workflow?
    • Running on a parallel batch system
      • Small problem: Everything in one process
      • Medium: Apply part in parallel using multiprocessing or other simple technique.
      • Large: Apply part in parallel using slurm array jobs, and using job dependencies to correctly order the split, apply, and combine phases.

Day 3

  • Visualization with matplotlib & ggplot
    • Seaborn could be interesting too (statistics-focused layer on top of matplotlib), but I have no personal experience of it.
    • For matplotlib could cover tricks like using latex for rendering math for axis labels etc.
  • Workflows for visualization
    • repeatability is important!
    • putting plotting stuff into scripts vs. redoing it
    • using make for managing workflows

Python course breakdown

  • Day 1
    • 30 min: general course intro and Jupyter notebook intro
    • 30 min: data types and numpy
      • ufuncs, broadcasting, ...
    • 30 min: pandas intro, some puzzles. Baby names example (from: pandas 100 puzzles)
    • 30 min: advanced dataframe operations (PROFILING)
    • 30 min: read data to pandas from sqlite (and other formats)
    • 15 min: More about notebooks: publishing, version control, questions and so on.
  • Day 2: data handling
    • 30 min: Split-apply-combine (DEBUGGING)
    • 15 min: small vs large files: intro and a basic benchmarking
    • 30 min: advanced storage formats: HDF5, sqlite, containers for machine learning, etc.
    • 30 min: basic automation with makefiles
  • Day 3: visualization
    • 10 min: Intro and video: some high level visualization motivation video (that describes the four(?) types of visual ways to represent data and what they are good for).
    • 30 min: matplotlib basic concepts. figures vs axes vs axis, object-orientedness, common arguments, direct API vs global-background API, etc. Seaborn. Graphics stack big picture: flexible and hard to use tools vs limited-purpose but easy to use tools. Scriptability of graphics.
    • ?? min: matplotlib/seaborn examples.
    • (insert rest here)
    • 30 min: interactive visualization with jupyter and widgets.

3rd party resources for inspiration

Dataset ideas