Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should OpusCleaner have the notion of a "project"? #146

Open
bhaddow opened this issue Jan 12, 2024 · 3 comments
Open

Should OpusCleaner have the notion of a "project"? #146

bhaddow opened this issue Jan 12, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@bhaddow
Copy link
Contributor

bhaddow commented Jan 12, 2024

I am trying to understand the intended workflow for OpusCleaner.

Suppose I want to build some MT systems. I fire up OpusCleaner, download some data, apply cleaning rules until I am happy, then I upload data to the data to the cluster for training. Maybe I come back the next day, and want to create a new version of this data set, or maybe I want train a completely different MT system.

For this, would it be useful if OC had the notion of a "project"? I open a project, add files to it, set some project-wide rules and parameters, then maybe some data set specific parameters. If I then want to work on a different MT system, then I open a different project. I can copy the project file onto a different server, and initialise it (by downloading the files). Maybe projects could have versions, so I can track which data/rule set I used.

@bhaddow bhaddow added the enhancement New feature or request label Jan 12, 2024
@jelmervdl
Copy link
Collaborator

My idea was to leave project management to things like git, and let OpusCleaner use the filesystem as a project structure. So you'd run opuscleaner server from your project dir, e.g. hplt/v1/eng-fry. You can then also use git to commit the json files generated by OpusCleaner to track filters per dataset.

A hybrid approach might be what Jupyter Lab does, where you can also change directories (to some degree) from the web interface to open up a different project. But by default it will treat the current working directory as the root of your lab session.

@bhaddow
Copy link
Contributor Author

bhaddow commented Jan 12, 2024

Ah, that makes sense. Having opuscleaner manage projects would add more complexity. Running multiple instances would be simpler.

It would still be useful have some config that applies to all the datasets in a directory, say to set languages or default rules. It could be just a json/yaml file that lives in the directory.

@jelmervdl
Copy link
Collaborator

There seems to be some agreement on that. You're not the first to bring it up! #101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants