Dataset Catalog

An educational project of a web service for uploading datasets to S3-compatible storage and retrieving information about them.

Documentation

System Design
Development Environment
Tooling
Style Guide
Testing
Building Docker Image
Deployment

Technology Stack

Python 3.9
FastAPI
Uvicorn
PostgreSQL
Minio an S3-compatible storage
Containers

Project Layout

├── app                 # Application logic
│   ├── handlers        # Request handlers (similar to controllers or views in other frameworks)
│   ├── models          # SQLAlchemy table definitions and additional data types like enums
│   ├── repositories    # Modules containing SQLAlchemy query expressions → DB access interface
│   ├── schemas         # Pydantic schemas validating input and output data
│   ├── settings        # Application settings
│   ├── utils           # Helpers that doesn't contain any business logic and can be extracted
│   ├── application.py  # FastAPI entry point and it's configuration
├── main.py             # Webservice entry point with additional settings
├── migrations          # Alembic migrations
├── pip.conf            # pip config for working with private package registry like Artifactory
├── setup.cfg           # Python environment configs like linters, mypy rules, pytest etc.
├── tests               # Test suite running via Pytest

Run Locally

make up

cp .env.example .env
make venv
# Check PostgreSQL logs to make sure
# LOG: database system is ready to accept connections
make migrate
make serve

Open Swagger UI in your favourite browser.

Upload a Big File

Download a few big datasets. You can find some at Kaggle. I've tried Nearby Social Network - All Posts and allposts.csv (~47 GB).
Generate MD5 hash for a file you want to upload:

md5 allposts.csv
MD5 (allposts.csv) = 148a68b39a273bfda5ece7d868c9c1c8

Make a request:

curl -X 'PUT' \
  'http://localhost:8000/datasets' \
  -H 'accept: application/json' \
  -H 'content-md5: 148a68b39a273bfda5ece7d868c9c1c8' \
  -H 'Content-Type: multipart/form-data' \
  -F 'dataset_name=Nearby Social Network - All Posts' \
  -F 'dataset_file=@allposts.csv;type=text/csv'

Try another one to simulate simultaneous uploads. See how some of your CPU cores start to load (thanks to ThreadPool in Minio SDK and GIL limitations). Memory consumption doesn't grow fast, but a bandwidth of disk reads and writes grows, sometimes twice as much as an uploaded file because of SpooledTemporaryFile.

No doubt this kind of naive testing does not show how the implemented web service will work in the production environment. But at least it shows how you can upload large files to any S3-compatible storage in your FastAPI application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset Catalog

Documentation

Technology Stack

Project Layout

Run Locally

Upload a Big File

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset Catalog

Documentation

Technology Stack

Project Layout

Run Locally

Upload a Big File