Skip to content

Marian machine translation training pipeline for thousands of models

Notifications You must be signed in to change notification settings

hplt-project/OpusPocus

Repository files navigation

OpusPocus on LUMI

Modular NLP pipeline manager.

OpusPocus is aimed at simplifying the description and execution of popular and custom NLP pipelines, including dataset preprocessing, model training and evaluation. The pipeline manager supports execution using simple CLI (Bash) or common HPC schedulers (Slurm, HyperQueue).

It uses OpusCleaner for data preparation and OpusTrainer for training scheduling (development in progress).

Structure

  • go.py - pipeline manager entry script
  • opuspocus/ - OpusPocus modules
  • config/ - default configuration files (pipeline config, marian training config, ...)
  • examples/ - pipeline manager usage examples
  • scripts/ - helper scripts, at this moment not directly implemented in OpusPocus
  • tests/ - unit tests

Installation

  1. Install MarianNMT.

  2. Prepare the OpusCleaner and OpusTrainer Python virtual environments.

  3. Install the OpusPocus requirements.

pip install -r requirements.txt

Usage (Simple Pipeline)

See the examples/ directory for example execution

  1. Initialize the pipeline.
$ ./go.py init \
    --pipeline-config path/to/pipeline/config/file \
    --pipeline-dir pipeline/destination/directory \
  1. Execute the pipeline.
$ ./go.py run \
    --pipeline-dir pipeline/destination/directory \
    --runner bash \
  1. Check the pipeline status.
$ ./go.py traceback --pipeline-dir pipeline/destination/directory

OR

$ ./go.py status --pipeline-dir pipeline/destination/directory

About

Marian machine translation training pipeline for thousands of models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published