ME414 Introduction to Data Science and Big Data Analytics

LSE Methods Summer Programme 2016

Kenneth Benoit, Department of Methodology, LSE
Slava Mikhaylov, University College London, UCL
Jack Blumenau (Labs), Department of Methodology, LSE

This repository contains the class materials for the LSE Methods Summer Programme course ME414 Introduction to Data Science and Big Data Analytics taught in August 2016 by Kenneth Benoit and Slava Mikhaylov.

Overview

Data Science and Big Data Analytics are exciting new areas that combine scientific inquiry, statistical knowledge, substantive expertise, and computer programming. One of the main challenges for businesses and policy makers when using big data is to find people with the appropriate skills. Good data science requires experts that combine substantive knowledge with data analytical skills, which makes it a prime area for social scientists with an interest in quantitative methods. This course integrates prior training in quantitative methods (statistics) and coding with substantive expertise and introduces the fundamental concepts and techniques of Data Science and Big Data Analytics.

Typical students will be Masters and PhD students from any field requiring the fundamentals of data science or working with typically large datasets and databases. Practitioners from industry, government, or research organisations with some basic training in quantitative analysis or computer programming are also welcome. Because this course surveys diverse techniques and methods, it makes an ideal foundation for more advanced or more specific training. Our applications are drawn from social, political, economic, legal, and business and marketing fields, rather than engineering or other sciences.

Objectives

This course aims to provide an introduction to the data science approach to the quantitative analysis of data using the methods of statistical learning, an approach blending classical statistical methods with recent advances in computational and machine learning. We will cover the main analytical methods from this field with hands-on applications using example datasets, so that students gain experience with and confidence in using the methods we cover. We also cover data preparation and processing, including working with structured databases, key-value formatted data (JSON), and unstructured textual data. At the end of this course students will have a sound understanding of the field of data science, the ability to analyse data using some of its main methods, and a solid foundation for more advanced or more specialised study.

The course will be delivered as a series of morning lectures, followed by lab sessions in the afternoon where students will apply the lessons in a series of instructor-guided exercises using data provided as part of the exercises. The course will cover the following topics:

an overview of data science and the challenge of working with big data using statistical methods
how to integrate the insights from data analytics into knowledge generation and decision-making
how to acquire data, both structured and unstructured, and to process it, store it, and convert it into a format suitable for analysis
an overview of regression and classification methods and related methods for assessing model fit and cross-validating predictive models
supervised learning approaches, including linear and logistic regression, decision trees, and naive Bayes
unsupervised learning approaches, including clustering, association rules, and principal components analysis
quantitative methods of text analysis, including mining social media and other online resources

Prerequisites

An introduction to quantitative methods at any level would serve as a very useful foundation for this course, although no formal prerequisites are required. Familiarity with computer programming or database structures is a benefit, but not formally required.

Preparing before the course

We strongly recommend you spend some of July and August before the course reading some of the following materials:

James et al. (2013) An Introduction to Statistical Learning: With applications in R, Springer, Chapters 1--2. Note: The book is available from the authors' page here.
An Introduction to R.
Downloading and installing RStudio and R on your computer.
Data Camp R tutorials.
Data Camp R Markdown tutorials, first chapter.
R Codeschool.
Garrett Grolemund and Hadley Wickham (2016) R for Data Science, O'Reilly Media, Chapters 1-3. Note: Online version is available from the authors' page here.

Important Specifics

Computer Software

Computer-based exercises will feature prominently in the course, especially in the lab sessions. The use of all software tools will be explained in the sessions, including how to download and install them. All of the class work will be done using R, using publicly available packages.

Main Texts

The primary texts are:

James et al. (2013) An Introduction to Statistical Learning: With applications in R, Springer. Note: The book is available from the authors' page here.
Garrett Grolemund and Hadley Wickham (2016) R for Data Science, O'Reilly Media. Note: Online version is available from the authors' page here.
Zumel, N. and Mount, J. (2014). Practical Data Science with R. Manning Publications.

The following are supplemental texts which you may also find useful:

Lantz, B. (2013). Machine Learning with R. Packt Publishing.
Lesmeister, C. (2015). Mastering Machine Learning with R. Packt Publishing.
Conway, D. and White, J. (2012) Machine Learning for Hackers. O'Reilly Media.
Leskovec, J., Rajaraman, A. and Ullman, J. (2011). Mining of Massive Datasets. Cambridge University Press.
Zafarani, R., Abbasi, M. A. and Liu, H. (2014) Social Media Mining: An introduction. Cambridge University Press.
Hastie et al. (2009) The Elements of Statistical Learning: Data mining, inference, and prediction. Springer. Note: The book is available from the authors' page here.

Instructions for Submitting Homeworks

Each homework will be a single file in the RMarkdown format. The files linked below are named very carefully, to make it easy for us to identify your completed lab assignments.

Obtaining the starter files.

Each day below will link the name of a starter file for you to download and work with. These are in the RMarkdown format. You should embed your answers, with code, into your version of the instruction files.
Renaming the starter files.

For example, the first assignment file is named ME414_assignment1_LASTNAME_FIRSTNAME.Rmd, which you will need to rename by replacing the uppercase terms with your own last and first names, e.g. ME414_assignment1_Bloggs_Joe.Rmd.
From RStudio, you can create an HTML output file with your evaluated code, figure, and text answers by clicking the "Knit HTML" button in the top of the editor pane in RStudio. The resulting HTML file will be saved in your working directory.
You will need to upload the resulting HTML file -- renamed! -- to the course Moodle page, to the appropriate assignment folder.

We will walk you through this process in the Day 1 lab, so don't worry if it seems complicated the first time. This sort of careful workflow process and file management is part of learning practical data science too!

Instructions for use of course materials

You have three options for downloading the course material found on this page:

You can download the materials by clicking on each link.
You can "clone" repository, using the buttons found to the right side of your browser window as you view this repository. This is the button labelled "Clone in Desktop". If you do not have a git client installed on your system, you will need to get one here and also to make sure that git is installed. This is preferred, since you can refresh your clone as new content gets pushed to the course repository. (And new material will get actively pushed to the course repository at least once per day as this course takes place.)
Statically, you can choose the button on the right marked "Download zip" which will download the entire repository as a zip file.

You can also subscribe to the repository if you have a GitHub account, which will send you updates each time new changes are pushed to the repository.

Instructors

Kenneth Benoit is Professor of Quantitative Social Research Methods at the Department of Methodology, LSE. With a background in political science, his substantive work focuses on political party competition, political measurement issues, and electoral systems. His research and teaching is primarily in the field of social science statistical applications. His recent work concerns the quantitative analysis of text as data, for which he has developed a package for the R statistical software.

Dr. Slava Mikhaylov is a Senior Lecturer in Quantitative Methods at UCL and has been teaching quantitative methods at UCL Political Science department for the last five years. He’s currently involved in an ESRC Big Data infrastructure investment initiative -- Consumer Data Research Centre. One of Slava’s responsibilities in the Centre is development and provision of big data analytics training for academic and professional community (data users). In addition Slava Mikhaylov is deputy director of UCL Q-Step Centre, an ESRC-funded initiative to promote quantitative methods.

Jack Blumenau is a PhD student at LSE Government Department, starting in the autumn his postdoc at LSE Methodology Department. Jack serve as the lab assistant for this course and deliver one of the sessions related to his PhD research.

Course Schedule

Monday, August 15: Overview and introduction to data science [KB, SM]

We will use this session to get to know the range of interests and experience students bring to the class, as well as to survey the machine learning approaches to be covered. We will also discuss and demonstrate the R software.

Resources

Lecture Notes
Assignment 1 as R markdown
Assignment 1 solution as R markdown

Required reading:

James et al (2013), Chapters 1--2. Note: The book is available from the authors' page here.
An Introduction to R.
Downloading and installing RStudio and R on your computer.
Data Camp R tutorials.
Data Camp R Markdown tutorials, first chapter.
R Codeschool.
Garrett Grolemund and Hadley Wickham (2016) R for Data Science, O'Reilly Media, Chapters 1-3. Note: Online version is available from the authors' page here.

Tuesday, August 16: Data structures and databases [KB]

Data types and formats, record cleaning, linkage, SQL, JSON, massive data processing.

Resources

Lecture Notes
Assignment 2 as R markdown
Assignment 2 solution as R markdown

Required reading:

Grolemund and Wickham, Chapters 4, 8-10.
Lantz, Chapters 2, 12.
Zumel and Mount, Chapters 2-4.

Wednesday, August 17: Linear Regression [SM]

Linear regression model and supervised learning.

Resources

Lecture Notes
Assignment 3 as R markdown
Assignment 3 solution as R markdown

Required reading:

James et al., Chapter 3.

Thursday, August 18: Classification [SM]

Logistic regression, discriminant analysis, Naive Bayes, evaluating model performance.

Resources

Lecture Notes
Assignment 4 as R markdown
Assignment 4 solution as R markdown

Required reading:

James et al., Chapter 4.

Friday, August 19: Resampling methods, model selection and regularization [SM]

Cross-validation, bootstrap, ridge and lasso.

Resources

Lecture Notes
Assignment 5 as R markdown
Assignment 5 solution as R markdown

Required reading:

James et al., Chapter 5-6.

Monday, August 22: Non-linear models and tree-based methods [SM]

GAMs, local regression, decision trees, random forest, boosting.

Resources

Lecture Notes
Assignment 6 as R markdown
Assignment 6 solution as R markdown

Required reading:

James et al., Chapter 7-8.

Tuesday, August 23: Unsupervised learning and dimensional reduction [KB]

Cluster analysis, PCA, correspondence analysis, association rules.

Resources

Lecture Notes
Assignment 7 as R markdown
Association rule mining code example
- R code
- book dataset for example
Assignment 7 solution as R markdown

Required reading:

James et al., Chapter 10.

Wednesday, August 24: Text analysis [KB]

Working with text in R, sentiment analysis, dictionary methods.

Resources

Lecture Notes
Assignment 8 as R markdown or html
- sample zip file UKimmigTexts.zip of texts for building a corpus
Assignment 8 solution as R markdown

Required reading:

Grimmer, J, and B M Stewart (2013), ``Text as Data: the Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.'' Political Analysis.
Benoit, Kenneth and Alexander Herzog. In press. ``Text Analysis: Estimating Policy Preferences From Written and Spoken Words.''.'' In Analytics, Policy and Governance, eds. Jennifer Bachner, Kathyrn Wagner Hill, and Benjamin Ginsberg.

Thursday, August 25: Topic modelling [JB]

Resources

Lecture Notes
Assignment 9 as R markdown
Assignment 9 solution as R markdown

Required reading:

David Blei (2012). ``Probabilistic topic models.'' Communications of the ACM, 55(4): 77-84.
Blei, David, Andrew Y. Ng, and Michael I. Jordan (2003). ``Latent dirichlet allocation.'' Journal of Machine Learning Research 3: 993-1022.
Blei, David (2014) ``Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models.'' Annual Review of Statistics and Its Application, 1: 203-232.

Friday, August 26: Mining the Social Web [KB, JB]

Working with the Twitter API, Facebook API, JSON data, and examples.

Resources

Lecture Notes
General examples from the lecture
Streaming example code
Rest Example code
Assignment 10 as R markdown. Note: This is a take-home set only, as there is no lab on Day 10.

Required reading:

Broniatowski, David A, Michael J Paul, and Mark Dredze. 2013. "National and Local Influenza Surveillance Through Twitter: an Analysis of the 2012-2013 Influenza Epidemic" PLoS ONE 8(12): 83672–78. PDF here
Twitter Authentication setup:
- Official
- Walkthrough
Twitter API documentation:
- Overview of REST API
- Overview of streaming API

Assessment

Exam: Friday, August 26, Time: 14:00, Room: Clement House CLM.4.02

Instructions: Complete and submit the exam just as you would any lab assignment: by renaming the file, editing the R Markdown, knitting, and submitting through Moodle your knitted HTML file. Moodle page for the course is linked here:.
Formatting: Put your own textual answers in boldface (using **boldface type** in RMarkdown), so that we can easily identify them when reviewing your HTML file.
Deadline: Monday 29 August 17:00 London time (GMT+1)

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
day1		day1
day10		day10
day2		day2
day3		day3
day4		day4
day5		day5
day6		day6
day7		day7
day8		day8
day9		day9
lab notes		lab notes
.gitignore		.gitignore
README.md		README.md

verenakunz/ME414

Folders and files

Latest commit

History

Repository files navigation

ME414 Introduction to Data Science and Big Data Analytics

LSE Methods Summer Programme 2016

Overview

Objectives

Prerequisites

Preparing before the course

Important Specifics

Computer Software

Main Texts

Instructions for Submitting Homeworks

Instructions for use of course materials

Instructors

Course Schedule

Monday, August 15: Overview and introduction to data science [KB, SM]

Resources

Required reading:

Recommended Reading:

Tuesday, August 16: Data structures and databases [KB]

Resources

Required reading:

Recommended Reading:

Wednesday, August 17: Linear Regression [SM]

Resources

Required reading:

Recommended Reading:

Thursday, August 18: Classification [SM]

Resources

Required reading:

Recommended Reading:

Friday, August 19: Resampling methods, model selection and regularization [SM]

Resources

Required reading:

Recommended Reading:

Monday, August 22: Non-linear models and tree-based methods [SM]

Resources

Required reading:

Recommended Reading:

Tuesday, August 23: Unsupervised learning and dimensional reduction [KB]

Resources

Required reading:

Recommended Reading:

Wednesday, August 24: Text analysis [KB]

Resources

Required reading:

Recommended Reading:

Thursday, August 25: Topic modelling [JB]

Resources

Required reading:

Recommended Reading:

Friday, August 26: Mining the Social Web [KB, JB]

Resources

Required reading:

Recommended Reading:

Assessment

Exam: Friday, August 26, Time: 14:00, Room: Clement House CLM.4.02

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages