Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look for Solutions for Multilingual Model Support #26

Open
varisd opened this issue Jul 19, 2024 · 2 comments
Open

Look for Solutions for Multilingual Model Support #26

varisd opened this issue Jul 19, 2024 · 2 comments
Labels
feature New feature or request need investigation Unknown scope question Further information is requested

Comments

@varisd
Copy link
Contributor

varisd commented Jul 19, 2024

I am opening a discussion regarding ideas for the multilingual support in OpusPocus.

Personally, I think that multilingual training can be done in the current state using the following approach. Let us assume two language pairs, en-de and en-fr:

  • we set two Corpus steps, one for each language pair (e.g., corpus.en-de, corpus.en-fr). Each step directory will have languages set to "x" (source) and "y" (target). The corpora files inside should still contain the langpair in their filename, e.g. corpusA.en-de.x
  • Before training, we merge the corpus directories using MergeSteps. This should not be a problem, because all corpus files will still have unique filenames (with lang x, y suffixes).
  • The training will just have to pick correct corpora. OpusTrainer should help here a lot since we will need to schedule the langpairs correctly.

Did anyone try this approach before? Are there any other proposals?

@varisd varisd added feature New feature or request question Further information is requested need investigation Unknown scope labels Jul 19, 2024
@bhaddow
Copy link
Contributor

bhaddow commented Jul 19, 2024

Whatever the implemented solution is, it should be fairly transparent to the user. So they should be able to say "give me a model that translates from fr or de into en" and not have to worry about creating extra CorpusSteps and MergeSteps.

There's some things that multilingual modelling definitely requires:

  • Insertion of the special tokens, e.g. a <2de> token to say that this sentence should be translated to German
  • Mixing of data. Normally temperature sampling is necessary. Should this be done by OpusTrainer?
  • Multilingual back-translation. OK, this does not need to be in the initial version, but would be needed to create strong systems. I don't know what extra complexity this adds (if any).

@varisd
Copy link
Contributor Author

varisd commented Jul 19, 2024

Whatever the implemented solution is, it should be fairly transparent to the user. So they should be able to say "give me a model that translates from fr or de into en" and not have to worry about creating extra CorpusSteps and MergeSteps.
I agree that the corpus step merging is not the most elegant approach right now. Merging could be avoided by enabling multiple CorpusStep arguments where necessary (e.g. model or subword training). We would still need to figure out a reasonable way of specifying correct corpus from correct step though (in a way that will not be confusing for the user).

  • Insertion of the special tokens, e.g. a <2de> token to say that this sentence should be translated to German
    At worst this can be implemented as a addition Step, that will do the corpus modification. (Similar to Decontaminate)
  • Mixing of data. Normally temperature sampling is necessary. Should this be done by OpusTrainer?
    This should be done by opustrainer. I think that was one of the motivations behind it in the first place.
  • Multilingual back-translation. OK, this does not need to be in the initial version, but would be needed to create strong systems. I don't know what extra complexity this adds (if any).
    Right now, the backtranslated corpora are a result of a translation steps (so they are in a separate directory). The real question is how to "merge" or pass multiple corpus steps so they can be used by training, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request need investigation Unknown scope question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants