Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The config pipeline.full.simple.yml contains many errors #36

Closed
bhaddow opened this issue Aug 2, 2024 · 15 comments
Closed

The config pipeline.full.simple.yml contains many errors #36

bhaddow opened this issue Aug 2, 2024 · 15 comments
Labels
bug Something isn't working need investigation Unknown scope

Comments

@bhaddow
Copy link
Contributor

bhaddow commented Aug 2, 2024

Running with the config pipeline.full.simple.yml fails with the following error

TypeError("__init__() got an unexpected keyword argument 'python_venv_dir'")

Fixing this (by commenting out the config variable) brings more errors. It looks like this sample config file needs updated.

@bhaddow bhaddow changed the title The config pipeline.full.simple.yml contains unrecognised arguments The config pipeline.full.simple.yml contains many errors Aug 2, 2024
@bhaddow
Copy link
Contributor Author

bhaddow commented Aug 2, 2024

After fixing various config errors, I get

ValueError: Dataset clean.en-es is not registered in the gather.en-es categories.json

I don't know what this means.

@rggdmonk rggdmonk added bug Something isn't working need investigation Unknown scope labels Aug 2, 2024
@varisd
Copy link
Contributor

varisd commented Aug 2, 2024

I am currently working on having a basic automated test for the config files to make sure there are no typos (or obsolete stuff).
I am planning to remove pipeline.full..yml configs; Please refer to pipeline.preprocess.yml (preprocessing) and pipeline.train..yml configs in the future instead.

We can add the "full" example back to config/ later but for the time being, I think it is better to split the preprocessing and training part for clarity (and easier execution).

@bhaddow
Copy link
Contributor Author

bhaddow commented Aug 2, 2024

OK, but pipeline.train.simple.yml has the same issues.

On the final blocker

ValueError: Dataset clean.en-es is not registered in the gather.en-es categories.json

The training data is in the categories.json for decontaminate, but somehow does not make it through to the gather step. I took a quick look at gather.py, but I cannot see what is happening here.

@bhaddow
Copy link
Contributor Author

bhaddow commented Aug 2, 2024

If set

compressed = False

in the raw step, it gets a bit further

@varisd
Copy link
Contributor

varisd commented Aug 2, 2024

What are the contents of your input .*/raw .*/valid and .*/test directories?

@bhaddow
Copy link
Contributor Author

bhaddow commented Aug 2, 2024

Finally, if I add

valid_dataset: floresdev

to the train_model steps, the init runs withoyt error

@bhaddow
Copy link
Contributor Author

bhaddow commented Aug 2, 2024

(opuspocus-venv) [vor]bhaddow: ls -R data/
data/:
test  train  valid

data/test:
newstest2013.en  newstest2013.es

data/train:
nc.en  nc.es  news-commentary-v18.en-es.tsv.gz

data/valid:
floresdev.en  floresdev.es

I think my naming is different from the naming that OpusPocus expects, but the expected naming is not documented afaik.

@varisd
Copy link
Contributor

varisd commented Aug 2, 2024

Yes there is a mismatch. You need to adjust the following (global) parameters in config/pipeline.preprocess.yml:

global:
  [...]
  - raw_para_dir: data/train
  - raw_mono_src_dir: data/train  # it should actually point to the directory with the mono src corpora
  - raw_mono_tgt_dir: data/train # same, but tgt corpora
  - valid_data_dir: data/valid
  - test_data_dir: data/test
  - valid_dataset: floresdev
  [...]

Note that the above is just the config excerpt.

@varisd
Copy link
Contributor

varisd commented Aug 2, 2024

Also, regarding the train directory. You need to unify the format to either .gz files (compressed=True) or no .gz suffix (compressed=False).
We currently do not support identification by file suffix. We can create an issue if it is a desired feature.

@bhaddow
Copy link
Contributor Author

bhaddow commented Aug 2, 2024

There is a documentation issue here then.

The news-commentary-v18.en-es.tsv.gz file is not intended for training, I just downloaded it then unpacked.

@varisd
Copy link
Contributor

varisd commented Aug 2, 2024

There is a documentation issue here then.

Yes, the data preparation need to be better documented. Related to issue #35

@bhaddow
Copy link
Contributor Author

bhaddow commented Aug 2, 2024

I just set raw_data_dir here and it seems to work.

global:
  [...]
  - raw_para_dir: data/train
  - raw_mono_src_dir: data/train  # it should actually point to the directory with the mono src corpora
  - raw_mono_tgt_dir: data/train # same, but tgt corpora
  - valid_data_dir: data/valid
  - test_data_dir: data/test
  - valid_dataset: floresdev
  [...]

@bhaddow
Copy link
Contributor Author

bhaddow commented Aug 6, 2024

When running, I get the error:

FileNotFoundError: [Errno 2] No such file or directory: 'scripts/bash_runner_submit.py'

It looks like the path to the bash_runner script is hard-coded. How should this be set? Should there be a config variable that points to the location of the OpusPocus scripts?

@varisd
Copy link
Contributor

varisd commented Aug 13, 2024

When running, I get the error:

FileNotFoundError: [Errno 2] No such file or directory: 'scripts/bash_runner_submit.py'

It looks like the path to the bash_runner script is hard-coded. How should this be set? Should there be a config variable that points to the location of the OpusPocus scripts?

This is hardwired path to the bash-submit-wrapper in scripts/ directory. The pipeline needs to be executed from the repository root for it to be working correctly at the moment.

@rggdmonk
Copy link
Contributor

Resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need investigation Unknown scope
Projects
None yet
Development

No branches or pull requests

3 participants