- Using mixed-method approach in Dr. David Jeong's lab to study social media interactions as well as ethical implications of emerging media technology.
- Developed Reddit comment scraper for sentiment analysis on behalf of the SCU Media Research Lab - Data Science Team.
- Installing Node and NPM on Windows and macOS is straightforward because you can just use the provided installer:
- Download the required installer:
- Go to https://nodejs.org/en/
- Select the button to download the LTS build that is
"Recommended for most users"
.
- Install Node by double-clicking on the downloaded file and following the installation prompts.
- The easiest way to test that Node is installed is to run the "version" command in your terminal/command prompt and check that a version string is returned:
> node -v
v16.15.1
- The Node.js package manager NPM should also have been installed, and can be tested in the same way:
> npm -v
v8.13.1
- Like Python is to pip, NPM is used to fetch any packages (JavaScript libraries) that an application needs for Node development, testing, and/or production, and may also be used to run tests and tools used in the development process.
- You can manually use NPM to separately fetch each needed package. Typically we instead manage dependencies using a plain-text definition file named
package.json
. Thepackage.json
file should contain everything NPM needs to fetch and run your application.
- The following steps show how you can use NPM to download a package, save it into the project dependencies, and then require it in a Node application.
- First create a directory named
reddit-scraper-bot
for your new application and navigate into it:
mkdir reddit-scraper-bot
cd reddit-scraper-bot
- Use the
npm init
command to create apackage.json
file for your application. This command prompts you for a number of things, including the name and version of your application and the name of the initial entry point file (by default this isindex.js
. For now, just accept the defaults:
npm init
-
Now do
npm install [insert package name]
for each of the following dependencies we will need for the sake of this project:entities
: Decodes HTML entities (e.g. & becomes &, " becomes ", < becomes <, > becomes >).json2csv
: Convert JSON to CSV.path
: Node.js "path" module published to the NPM registry.snoowrap
: Fully-featured JavaScript wrapper that provides a simple interface to access every Reddit API endpoint.vader-sentiment
: Javascript port of the VADER sentiment analysis tool. Sentiment from text can be determined in a Node.js app.fs
: Provides a lot of very useful functionality to access and interact with the file system.dayjs
: JavaScript date library for parsing, validating, manipulating, and formatting date.prompt-sync
: A sync prompt for Node.js. very simple. no C++ bindings and no bash scripts.pm2
: Production process manager for Node.js applications that allows you to keep applications alive for however long you want without downtime and to facilitate common system admin tasks.
- Go to this website, make a new Reddit account (or use pre-existing one), and uou will see a box at the bottom that reads: "are you a developer, create an app."
- Click "create app" and obtain the following credentials from the portal to properly calibrate your Reddit API instance.
- To properly secure your new Reddit credentials, make a file called
.env
and copy the format structure below while filing in the variables for the Reddit API instance:
CLIENT_ID=[insert-client-id]
CLIENT_SECRET=[insert-reddit-client-secret]
USER_AGENT='Student Research Lab Robot by /u/anon-username' 'https://github.com/Santa-Clara-Media-Lab/student-research-lab-robot'
USER_NAME=[your-reddit-username]
PASS_WORD=[your-reddit-password]
- Holistic Method:
- Scrapes across various subreddits with set post and comment amount limits to avoid overloading requests to the Reddit server.
- To properly utilize this method, we would do the following:
-
- Copy the code linked above and name it
holisticMethodReddit.js
and save it in your current coding folder -reddit-scraper-bot
- Copy the code linked above and name it
-
- Go to the command line and type then hit enter:
pm2 start holisticMethodReddit.js
- Go to the command line and type then hit enter:
-
- Wait because this method takes a long amount of time but over a couple hours it will output large amounts of metadata into the
holisticRedditComments.csv
in your coding folder.
- Wait because this method takes a long amount of time but over a couple hours it will output large amounts of metadata into the
-
- Check the
redditComments
folder for your outputtedholisticRedditComments.csv
file!
- Check the
-
- Quality Method:
- Recursively scrapes all comments and their children replies from a specific post ID
- Say we wanted to scrape all the comments from this thread, we would do the following:
-
- Copy the code linked above and name it
qualityMethodReddit.js
and save it in your current coding folder -reddit-scraper-bot
- Copy the code linked above and name it
-
- Go to the post comment url => https://www.reddit.com/r/virtualreality/comments/obdzm5 => and get
obdzm5
, which would be the post ID.
- Go to the post comment url => https://www.reddit.com/r/virtualreality/comments/obdzm5 => and get
-
- Go to the command line and type then hit enter:
pm2 start qualityMethodReddit.js
- Go to the command line and type then hit enter:
-
- It will ask you to input a post id, so enter in
obdzm5
- It will ask you to input a post id, so enter in
-
- Check the
redditComments
folder for your outputtedqualityRedditComments.csv
file!
- Check the
-
- Updating outdated Node.js and NPM versions.
- Fixing the Node.js environment path variables for Windows/Mac per this website's instructions.
- Identifying and installing packages for dependencies not found.
- Reddit status codes 429 and/or 503 when using the Reddit API via my scripts.
- Having Reddit API credentials that don’t match what you have on the developer portal and/or inputting the inaccurate information.