Skip to content
This repository has been archived by the owner on Aug 5, 2023. It is now read-only.

Latest commit

 

History

History
95 lines (73 loc) · 3.46 KB

README.md

File metadata and controls

95 lines (73 loc) · 3.46 KB

Dataset Catalog

dataset_catalog codecov

An educational project of a web service for uploading datasets to S3-compatible storage and retrieving information about them.

Documentation

Technology Stack

  • Python 3.9
  • FastAPI
  • Uvicorn
  • PostgreSQL
  • Minio an S3-compatible storage
  • Containers

Project Layout

├── app                 # Application logic
│   ├── handlers        # Request handlers (similar to controllers or views in other frameworks)
│   ├── models          # SQLAlchemy table definitions and additional data types like enums
│   ├── repositories    # Modules containing SQLAlchemy query expressions → DB access interface
│   ├── schemas         # Pydantic schemas validating input and output data
│   ├── settings        # Application settings
│   ├── utils           # Helpers that doesn't contain any business logic and can be extracted
│   ├── application.py  # FastAPI entry point and it's configuration
├── main.py             # Webservice entry point with additional settings
├── migrations          # Alembic migrations
├── pip.conf            # pip config for working with private package registry like Artifactory
├── setup.cfg           # Python environment configs like linters, mypy rules, pytest etc.
├── tests               # Test suite running via Pytest

Run Locally

make up

cp .env.example .env
make venv
# Check PostgreSQL logs to make sure
# LOG: database system is ready to accept connections
make migrate
make serve

Open Swagger UI in your favourite browser.

Upload a Big File

  1. Download a few big datasets. You can find some at Kaggle. I've tried Nearby Social Network - All Posts and allposts.csv (~47 GB).

  2. Generate MD5 hash for a file you want to upload:

md5 allposts.csv
MD5 (allposts.csv) = 148a68b39a273bfda5ece7d868c9c1c8
  1. Make a request:
curl -X 'PUT' \
  'http://localhost:8000/datasets' \
  -H 'accept: application/json' \
  -H 'content-md5: 148a68b39a273bfda5ece7d868c9c1c8' \
  -H 'Content-Type: multipart/form-data' \
  -F 'dataset_name=Nearby Social Network - All Posts' \
  -F 'dataset_file=@allposts.csv;type=text/csv'

Try another one to simulate simultaneous uploads. See how some of your CPU cores start to load (thanks to ThreadPool in Minio SDK and GIL limitations). Memory consumption doesn't grow fast, but a bandwidth of disk reads and writes grows, sometimes twice as much as an uploaded file because of SpooledTemporaryFile.

No doubt this kind of naive testing does not show how the implemented web service will work in the production environment. But at least it shows how you can upload large files to any S3-compatible storage in your FastAPI application.


© Andrey Krisanov, 2021