Skip to content

Commit

Permalink
Data hosting (#100)
Browse files Browse the repository at this point in the history
Co-authored-by: Sam Cunliffe <samcunliffe@users.noreply.github.com>
  • Loading branch information
niksirbi and samcunliffe committed Nov 10, 2023
1 parent b329030 commit 4993bbc
Show file tree
Hide file tree
Showing 136 changed files with 960 additions and 1,024 deletions.
8 changes: 8 additions & 0 deletions .github/workflows/test_and_deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,14 @@ jobs:
- os: windows-latest
python-version: "3.10"
steps:
# Cache the test data to avoid re-downloading
- name: Cache Test Data
uses: actions/cache@v3
with:
path: ${{ github.workspace }}/.WAZP/*
key: cached-test-data
enableCrossOsArchive: true

# A hack because chrome isn't in the PATH on Windows
- name: Fix Chrome application path on Windows
if: matrix.os == 'windows-latest'
Expand Down
83 changes: 80 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,12 @@ For Windows, be sure to download the ``chromedriver_win32.zip`` file, extract th

It's a good idea to test locally before pushing. Pytest will run all tests and also report test coverage.

#### Test data
For some tests, you will need to use real experimental data.
We store some sample projects in an external data repository.
See [sample projects](#sample-projects) for more information.


### Continuous integration
All pushes and pull requests will be built by [GitHub actions](https://docs.github.com/en/actions). This will usually include linting, testing and deployment.

Expand All @@ -139,7 +145,7 @@ We use [semantic versioning](https://semver.org/), which includes `MAJOR`.`MINOR
* MINOR = new feature
* MAJOR = breaking change

We use [`setuptools_scm`](https://github.com/pypa/setuptools_scm) to automatically version WAZP. It has been pre-configured in the `pyproject.toml` file. [`setuptools_scm` will automatically infer the version using git](https://github.com/pypa/setuptools_scm#default-versioning-scheme). To manually set a new semantic version, create a tag and make sure the tag is pushed to GitHub. Make sure you commit any changes you wish to be included in this version. E.g. to bump the version to `1.0.0`:
We use [`setuptools_scm`](https://github.com/pypa/setuptools_scm) to automatically version WAZP. It has been pre-configured in the `pyproject.toml` file. `setuptools_scm` will automatically infer the version using git. To manually set a new semantic version, create a tag and make sure the tag is pushed to GitHub. Make sure you commit any changes you wish to be included in this version. E.g. to bump the version to `1.0.0`:

```sh
git add .
Expand Down Expand Up @@ -175,8 +181,6 @@ If you create a new documentation source file (e.g. `my_new_file.md` or `my_new_
my_new_file
```



### Building the documentation locally
We recommend that you build and view the documentation website locally, before you push it.
To do so, first install the requirements for building the documentation:
Expand All @@ -197,5 +201,78 @@ rm -rf docs/build
sphinx-build docs/source docs/build
```

## Sample projects

We maintain some sample WAZP projects to be used for testing, examples and tutorials on an [external data repository](https://gin.g-node.org/SainsburyWellcomeCentre/WAZP).
Our hosting platform of choice is called [GIN](https://gin.g-node.org/) and is maintained by the [German Neuroinformatics Node](https://www.g-node.org/).
GIN has a GitHub-like interface and git-like [CLI](https://gin.g-node.org/G-Node/Info/wiki/GIN+CLI+Setup#quickstart) functionalities.

### Project organisation

The projects are stored in folders named after the species - e.g. `jewel-wasp` (*Ampulex compressa*).
Each species folder may contain various WAZP sample projects as zipped archives. For example, the `jewel-wasp` folder contains the following projects:
- `short-clips_raw.zip` - a project containing short ~10 second clips extracted from raw .avi files.
- `short-clips_compressed.zip` - same as above, but compressed using the H.264 codec and saved as .mp4 files.
- `entire-video_raw.zip` - a project containing the raw .avi file of an entire video, ~32 minutes long.
- `entire-video_compressed.zip` - same as above, but compressed using the H.264 codec and saved as .mp4 file.

Each WAZP sample project has the following structure:
```
{project-name}.zip
└── videos
├── {video1-name}.{ext}
├── {video1-name}.metadata.yaml
├── {video2-name}.{ext}
├── {video2-name}.metadata.yaml
└── ...
└── pose_estimation_results
├── {video1-name}{model-name}.h5
├── {video2-name}{model-name}.h5
└── ...
└── WAZP_config.yaml
└── metadata_fields.yaml
```
To learn more about how the sample projects were generated, see `scripts/generate_sample_projects` in the [WAZP GitHub repository](https://github.com/SainsburyWellcomeCentre/WAZP).

### Fetching projects
To fetch the data from GIN, we use the [pooch](https://www.fatiando.org/pooch/latest/index.html) Python package, which can download data from pre-specified URLs and store them locally for all subsequent uses. It also provides some nice utilities, like verification of sha256 hashes and decompression of archives.

The relevant funcitonality is implemented in the `wazp.datasets.py` module. The most important parts of this module are:

1. The `sample_projects` registry, which contains a list of the zipped projects and their known hashes.
2. The `find_sample_projects()` function, which returns the names of available projects per species, in the form of a dictionary.
3. The `get_sample_project()` function, which downloads a project (if not already cached locally), unzips it, and returns the path to the unzipped folder.

Example usage:
```python
>>> from wazp.datasets import find_sample_projects, get_sample_project

>>> projects_per_species = find_sample_projects()
>>> print(projects_per_species)
{'jewel-wasp': ['short-clips_raw', 'short-clips_compressed', 'entire-video_raw', 'entire-video_compressed']}

>>> project_path = get_sample_project('jewel-wasp', 'short-clips_raw')
>>> print(project_path)
/home/user/.WAZP/sample_data/jewel-wasp/short-clips_raw
```

### Local storage
By default, the projects are stored in the `~/.WAZP/sample_data` folder. This can be changed by setting the `LOCAL_DATA_DIR` variable in the `wazp.datasets.py` module.

### Adding new projects
Only core WAZP developers may add new projects to the external data repository.
To add a new poject, you will need to:

1. Create a [GIN](https://gin.g-node.org/) account
2. Ask to be added as a collaborator on the [WAZP data repository](https://gin.g-node.org/SainsburyWellcomeCentre/WAZP) (if not already)
3. Download the [GIN CLI](https://gin.g-node.org/G-Node/Info/wiki/GIN+CLI+Setup#quickstart) and set it up with your GIN credentials, by running `gin login` in a terminal.
4. Clone the WAZP data repository to your local machine, by running `gin get SainsburyWellcomeCentre/WAZP` in a terminal.
5. Add your new projects, followed by `gin commit -m <message> <filename>`. Make sure to follow the [project organisation](#project-organisation) as described above. Don't forget to modify the README file accordingly.
6. Upload the committed changes to the GIN repository, by running `gin upload`. Latest changes to the repository can be pulled via `gin download`. `gin sync` will synchronise the latest changes bidirectionally.
7. Determine the sha256 checksum hash of each new project archive, by running `sha256sum {project-name.zip}` in a terminal. Alternatively, you can use `pooch` to do this for you: `python -c "import pooch; pooch.file_hash('/path/to/file.zip')"`. If you wish to generate a text file containing the hashes of all the files in a given folder, you can use `python -c "import pooch; pooch.make_registry('/path/to/folder', 'hash_registry.txt')`.
8. Update the `wazp.datasets.py` module on the [WAZP GitHub repository](https://github.com/SainsburyWellcomeCentre/WAZP) by adding the new projects to the `sample_projects` registry. Make sure to include the correct sha256 hash, as determined in the previous step. Follow all the usual [guidelines for contributing code](#contributing-code). Additionally, you may want to update the scripts in `scripts/generate_sample_projects`, depending on how you generated the new projects. Make sure to test whether the new projects can be fetched successfully (see [fetching projects](#fetching-projects) above) before submitting your pull request.

You can also perform steps 3-6 via the GIN web interface, if you prefer to avoid using the CLI.

## Template
This package layout and configuration (including pre-commit hooks and GitHub actions) have been copied from the [python-cookiecutter](https://github.com/SainsburyWellcomeCentre/python-cookiecutter) template.
3 changes: 1 addition & 2 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,8 @@ include *.md
recursive-include wazp/*.py
recursive-include wazp/pages *.py

recursive-exclude sample_project *.avi
recursive-exclude sample_project *.h5
recursive-exclude docs *
recursive-exclude scripts *
recursive-exclude * __pycache__
recursive-exclude * *.py[co]

Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ myst-parser
nbsphinx
pydata-sphinx-theme
setuptools-scm
sphinx
sphinx>=7.1
sphinx-autodoc-typehints
7 changes: 6 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,12 +83,17 @@
"**/includes/**",
]

# Don't check the anchors for the following URLs during linkcheck
linkcheck_anchors_ignore_for_url = [
"https://gin.g-node.org/G-Node/Info/wiki/",
]

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
html_theme = "pydata_sphinx_theme"
html_title = "wazp"

# Cutomize the theme
# Customize the theme
html_theme_options = {
"icon_links": [
{
Expand Down
4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ dependencies = [
"PyYAML",
"shapely",
"openpyxl",
"defusedxml"
"defusedxml",
"pooch",
"tqdm",
]

classifiers = [
Expand Down
Loading

0 comments on commit 4993bbc

Please sign in to comment.