Skip to content

Latest commit

 

History

History
233 lines (192 loc) · 14.1 KB

README.org

File metadata and controls

233 lines (192 loc) · 14.1 KB

Wikgraph

This project, initially started as an end-of-year project for CSC111 at UofT, is focused on finding gaps in the knowledge of Wikipedia. The information found on Wikipedia can be used as a microcosm for the greater collective human knowledge. Finding gaps or underdeveloped areas in this will give us directions that we should explore as a society.

This project entails running data analysis on the entire wikipedia dataset which is, by nature, very large. This means that running the computations on the dataset will require sufficient processing power and memory (minimum 16 GB with paging scheme such as a swapfile or swap partition). More detailed instructions can be found in the project report.

Dev instructions (from project root)

  1. Create a venv: python -m venv venv and enter the venv:
    • source venv/bin/activate on Unix
    • venv\Scripts\activate.bat on Windows
  2. Install requirements: pip install -r requirements.txt
  3. Install local package: pip install -e .
  4. Place data files in respective locations in data/
  5. Run tests: pytest -v

File Structure: IMPORTANT

Each of these subpoints will be a directory in the repo. Try to ensure that your code is as cleaned up as possible when you are pushing and that you are not pushing unnecessary files or you don’t have files in the wrong location.

The root directory will contain things like this README, requirements.txt, etc. Try not to clutter it up too much with things that would do better placed in a subdirecotry.

data

This directory is meant for data storage. This will not be pushed, but the structure will remain. We don’t push this because it’s bad practice to push file that are obtainable outside of the project (especially if these files are large)

raw

Raw files that have not yet been processed. This inlucdes the wikidump.

reduced

Smaller sections of the wikidump that we can run trials on.

processed

This is where output will go. We may push some of these or find some other way to share these as the processing time will be insane.

proposal

Directory for the project proposal. Only push tex, pdf, and bib files.

report

Directory for the project report. Only push tex, pdf, and bib files.

wikigraph

This is where all the python files will go. There should generally be no subfolders here but there are some exceptions. This is to allow for proper PATH management (how python modules are imported, etc).

All python files here will need to include the following

"""Module docstring"""
import os  # Toward the top of the file

if __name__ == '__main__':
    os.chdir(__file__[0:-len('wikigraph/name of file')])

This code ensures that the code runs relative to the root directory, no matter where you execute it from. This smooths out some differences between vscode and pycharm/terminal python. I know that some of our TAs use vscode so this is NECESSARY.

We should also make sure to document our code very well.

test

This directory is where we will put unit tests but it is also okay to have random testing for other things. Try to make sure that your code is as clean as possible when you’re pushing things.