Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config files contain a lot of repetition - can this be avoided? #27

Open
bhaddow opened this issue Jul 22, 2024 · 1 comment
Open

Config files contain a lot of repetition - can this be avoided? #27

bhaddow opened this issue Jul 22, 2024 · 1 comment
Labels
feature New feature or request

Comments

@bhaddow
Copy link
Contributor

bhaddow commented Jul 22, 2024

Hi

Looking at a sample config file, there is a lot of repetition. For example:

    - step: raw
      step_label: gather.${global.src_lang}-${global.tgt_lang}
      src_lang: ${global.src_lang}
      tgt_lang: ${global.tgt_lang}
      raw_data_dir: ${global.raw_data_dir}
    - step: raw
      step_label: valid.${global.src_lang}-${global.tgt_lang}
      src_lang: ${global.src_lang}
      tgt_lang: ${global.tgt_lang}
      raw_data_dir: ${global.valid_data_dir}

These step stanzas are all nested within pipeline . Why not specify the source and target language at the pipeline level? Can this be done with OmegaConf? I think it can be done with a custom resolver, if not supported directly. It could make the config file much easier to read.

@bhaddow bhaddow added the feature New feature or request label Jul 22, 2024
@varisd
Copy link
Contributor

varisd commented Jul 24, 2024

Related to issue #22

Basically, with the Dataclasses and their inheritance we can setup pipeline config and pipeline steps in a following way:
Config example:

- pipeline:
  - src_lang: en
  - tgt_lang: de 
  - steps:
  - 
    - step: raw
      step_label: gather.${global.src_lang}-${global.tgt_lang}
      raw_data_dir: ${global.raw_data_dir}
    - step: raw
      step_label: valid.${global.src_lang}-${global.tgt_lang}
      raw_data_dir: ${global.valid_data_dir}

tl;dr: We can get a reasonable simplification with Dataclasses and later we can consider some "syntactic sugar" for the most common step configurations not simplified byt the refactor

The dataclass implementation would then have a general "pipeline" dataclass (containing stuff line src_lang, tgt_lang) and "raw" step (and other steps) dataclass could, by default, inherit the "pipeline" values (src, tgt lang) if not overwriten by user.
This would simplify config files when defining models/corpora in one direction.
For the opposite direction, we would have to add either addtional optional arguments to pipeline steps (e.g., "reverse") or add some "fake" steps, such as BackwardTrainSteps, which would in practice create a regular TrainSteps with "rewired" arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants