Look for Solutions for Multilingual Model Support #26

varisd · 2024-07-19T09:42:56Z

I am opening a discussion regarding ideas for the multilingual support in OpusPocus.

Personally, I think that multilingual training can be done in the current state using the following approach. Let us assume two language pairs, en-de and en-fr:

we set two Corpus steps, one for each language pair (e.g., corpus.en-de, corpus.en-fr). Each step directory will have languages set to "x" (source) and "y" (target). The corpora files inside should still contain the langpair in their filename, e.g. corpusA.en-de.x
Before training, we merge the corpus directories using MergeSteps. This should not be a problem, because all corpus files will still have unique filenames (with lang x, y suffixes).
The training will just have to pick correct corpora. OpusTrainer should help here a lot since we will need to schedule the langpairs correctly.

Did anyone try this approach before? Are there any other proposals?

bhaddow · 2024-07-19T09:54:11Z

Whatever the implemented solution is, it should be fairly transparent to the user. So they should be able to say "give me a model that translates from fr or de into en" and not have to worry about creating extra CorpusSteps and MergeSteps.

There's some things that multilingual modelling definitely requires:

Insertion of the special tokens, e.g. a <2de> token to say that this sentence should be translated to German
Mixing of data. Normally temperature sampling is necessary. Should this be done by OpusTrainer?
Multilingual back-translation. OK, this does not need to be in the initial version, but would be needed to create strong systems. I don't know what extra complexity this adds (if any).

varisd · 2024-07-19T10:06:49Z

Whatever the implemented solution is, it should be fairly transparent to the user. So they should be able to say "give me a model that translates from fr or de into en" and not have to worry about creating extra CorpusSteps and MergeSteps.
I agree that the corpus step merging is not the most elegant approach right now. Merging could be avoided by enabling multiple CorpusStep arguments where necessary (e.g. model or subword training). We would still need to figure out a reasonable way of specifying correct corpus from correct step though (in a way that will not be confusing for the user).

Insertion of the special tokens, e.g. a <2de> token to say that this sentence should be translated to German
At worst this can be implemented as a addition Step, that will do the corpus modification. (Similar to Decontaminate)

Mixing of data. Normally temperature sampling is necessary. Should this be done by OpusTrainer?
This should be done by opustrainer. I think that was one of the motivations behind it in the first place.

Multilingual back-translation. OK, this does not need to be in the initial version, but would be needed to create strong systems. I don't know what extra complexity this adds (if any).
Right now, the backtranslated corpora are a result of a translation steps (so they are in a separate directory). The real question is how to "merge" or pass multiple corpus steps so they can be used by training, etc.

varisd added feature New feature or request question Further information is requested need investigation Unknown scope labels Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look for Solutions for Multilingual Model Support #26

Look for Solutions for Multilingual Model Support #26

varisd commented Jul 19, 2024

bhaddow commented Jul 19, 2024 •

edited

Loading

varisd commented Jul 19, 2024

Look for Solutions for Multilingual Model Support #26

Look for Solutions for Multilingual Model Support #26

Comments

varisd commented Jul 19, 2024

bhaddow commented Jul 19, 2024 • edited Loading

varisd commented Jul 19, 2024

bhaddow commented Jul 19, 2024 •

edited

Loading