The config `pipeline.full.simple.yml` contains many errors #36

bhaddow · 2024-08-02T09:33:07Z

Running with the config pipeline.full.simple.yml fails with the following error

TypeError("__init__() got an unexpected keyword argument 'python_venv_dir'")

Fixing this (by commenting out the config variable) brings more errors. It looks like this sample config file needs updated.

The text was updated successfully, but these errors were encountered:

bhaddow · 2024-08-02T09:43:43Z

After fixing various config errors, I get

ValueError: Dataset clean.en-es is not registered in the gather.en-es categories.json

I don't know what this means.

varisd · 2024-08-02T09:49:17Z

I am currently working on having a basic automated test for the config files to make sure there are no typos (or obsolete stuff).
I am planning to remove pipeline.full..yml configs; Please refer to pipeline.preprocess.yml (preprocessing) and pipeline.train..yml configs in the future instead.

We can add the "full" example back to config/ later but for the time being, I think it is better to split the preprocessing and training part for clarity (and easier execution).

bhaddow · 2024-08-02T09:55:04Z

OK, but pipeline.train.simple.yml has the same issues.

On the final blocker

ValueError: Dataset clean.en-es is not registered in the gather.en-es categories.json

The training data is in the categories.json for decontaminate, but somehow does not make it through to the gather step. I took a quick look at gather.py, but I cannot see what is happening here.

bhaddow · 2024-08-02T10:10:03Z

If set

compressed = False

in the raw step, it gets a bit further

varisd · 2024-08-02T10:29:59Z

What are the contents of your input .*/raw .*/valid and .*/test directories?

bhaddow · 2024-08-02T10:38:55Z

Finally, if I add

valid_dataset: floresdev

to the train_model steps, the init runs withoyt error

bhaddow · 2024-08-02T10:40:05Z

(opuspocus-venv) [vor]bhaddow: ls -R data/
data/:
test  train  valid

data/test:
newstest2013.en  newstest2013.es

data/train:
nc.en  nc.es  news-commentary-v18.en-es.tsv.gz

data/valid:
floresdev.en  floresdev.es

I think my naming is different from the naming that OpusPocus expects, but the expected naming is not documented afaik.

varisd · 2024-08-02T12:01:12Z

Yes there is a mismatch. You need to adjust the following (global) parameters in config/pipeline.preprocess.yml:

global:
  [...]
  - raw_para_dir: data/train
  - raw_mono_src_dir: data/train  # it should actually point to the directory with the mono src corpora
  - raw_mono_tgt_dir: data/train # same, but tgt corpora
  - valid_data_dir: data/valid
  - test_data_dir: data/test
  - valid_dataset: floresdev
  [...]

Note that the above is just the config excerpt.

varisd · 2024-08-02T12:03:02Z

Also, regarding the train directory. You need to unify the format to either .gz files (compressed=True) or no .gz suffix (compressed=False).
We currently do not support identification by file suffix. We can create an issue if it is a desired feature.

bhaddow · 2024-08-02T12:56:50Z

There is a documentation issue here then.

The news-commentary-v18.en-es.tsv.gz file is not intended for training, I just downloaded it then unpacked.

varisd · 2024-08-02T13:00:44Z

There is a documentation issue here then.

Yes, the data preparation need to be better documented. Related to issue #35

bhaddow · 2024-08-02T14:34:29Z

I just set raw_data_dir here and it seems to work.

global:
  [...]
  - raw_para_dir: data/train
  - raw_mono_src_dir: data/train  # it should actually point to the directory with the mono src corpora
  - raw_mono_tgt_dir: data/train # same, but tgt corpora
  - valid_data_dir: data/valid
  - test_data_dir: data/test
  - valid_dataset: floresdev
  [...]

bhaddow · 2024-08-06T10:32:17Z

When running, I get the error:

FileNotFoundError: [Errno 2] No such file or directory: 'scripts/bash_runner_submit.py'

It looks like the path to the bash_runner script is hard-coded. How should this be set? Should there be a config variable that points to the location of the OpusPocus scripts?

varisd · 2024-08-13T11:23:35Z

When running, I get the error:
FileNotFoundError: [Errno 2] No such file or directory: 'scripts/bash_runner_submit.py'
It looks like the path to the bash_runner script is hard-coded. How should this be set? Should there be a config variable that points to the location of the OpusPocus scripts?

This is hardwired path to the bash-submit-wrapper in scripts/ directory. The pipeline needs to be executed from the repository root for it to be working correctly at the moment.

rggdmonk · 2024-08-26T09:33:42Z

Resolved.

bhaddow changed the title ~~The config pipeline.full.simple.yml contains unrecognised arguments~~ The config pipeline.full.simple.yml contains many errors Aug 2, 2024

rggdmonk added bug Something isn't working need investigation Unknown scope labels Aug 2, 2024

rggdmonk added the to close label Aug 23, 2024

rggdmonk closed this as completed Aug 26, 2024

rggdmonk removed the to close label Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The config `pipeline.full.simple.yml` contains many errors #36

The config `pipeline.full.simple.yml` contains many errors #36

bhaddow commented Aug 2, 2024 •

edited

Loading

bhaddow commented Aug 2, 2024

varisd commented Aug 2, 2024 •

edited

Loading

bhaddow commented Aug 2, 2024

bhaddow commented Aug 2, 2024

varisd commented Aug 2, 2024

bhaddow commented Aug 2, 2024

bhaddow commented Aug 2, 2024

varisd commented Aug 2, 2024

varisd commented Aug 2, 2024

bhaddow commented Aug 2, 2024

varisd commented Aug 2, 2024

bhaddow commented Aug 2, 2024

bhaddow commented Aug 6, 2024 •

edited

Loading

varisd commented Aug 13, 2024

rggdmonk commented Aug 26, 2024

The config pipeline.full.simple.yml contains many errors #36

The config pipeline.full.simple.yml contains many errors #36

Comments

bhaddow commented Aug 2, 2024 • edited Loading

bhaddow commented Aug 2, 2024

varisd commented Aug 2, 2024 • edited Loading

bhaddow commented Aug 2, 2024

bhaddow commented Aug 2, 2024

varisd commented Aug 2, 2024

bhaddow commented Aug 2, 2024

bhaddow commented Aug 2, 2024

varisd commented Aug 2, 2024

varisd commented Aug 2, 2024

bhaddow commented Aug 2, 2024

varisd commented Aug 2, 2024

bhaddow commented Aug 2, 2024

bhaddow commented Aug 6, 2024 • edited Loading

varisd commented Aug 13, 2024

rggdmonk commented Aug 26, 2024

The config `pipeline.full.simple.yml` contains many errors #36

The config `pipeline.full.simple.yml` contains many errors #36

bhaddow commented Aug 2, 2024 •

edited

Loading

varisd commented Aug 2, 2024 •

edited

Loading

bhaddow commented Aug 6, 2024 •

edited

Loading