Update README.md

andrewyng · Jun 11, 2024 · 9702049 · 9702049
1 parent 7965883
commit 9702049
Showing 1 changed file with 4 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -23,10 +23,10 @@ Comments and suggestions for how to improve this are also very welcome!
 
 ## Getting Started
 
-To get started with Translation Agent, follow these steps:
+To get started with `translation-agent`, follow these steps:
 
 ### Installation:
-- The Poetry package manager is required for installation (https://python-poetry.org/docs/#installation) Depending on your environment, this might work:
+- The Poetry package manager is required for installation. [Poetry Installation](https://python-poetry.org/docs/#installation) \n Depending on your environment, this might work:
 
 ```bash
 pip install poetry 
@@ -55,13 +55,13 @@ for both commercial and non-commercial purposes.
 
 ## Ideas for extensions 
 
-Here’re ideas we haven’t had time to experiment with but that we hope the open-source community will:
+Here are ideas we haven’t had time to experiment with but that we hope the open-source community will:
 - **Try other LLMs.** We prototyped this primarily using gpt-4-turbo. We would love for others to experiment with other LLMs as well as other hyperparameter choices and see if some do better than others for particular language pairs. 
 - **Glossary Creation.** What’s the best way to efficiently build a glossary -- perhaps using an LLM -- of the most important terms that we want translated consistently? For example, many businesses use specialized terms that are not widely used on the internet and that LLMs thus don’t know about, and there are also many terms that can be translated in multiple ways; for example, ”open source” in Spanish can be “Código abierto” or “Fuente abierta”; both are fine, but it’d better to pick one and stick with it for a single document. 
 - **Glossary Usage and Implementation.** Given a glossary, what’s the best way to include it in the prompt? 
 - **Evaluations on different languages.** How does its performance vary in different languages? Are there changes that make it work better for particular source or target languages? (Note that for very high levels of performance, which MT systems are approaching, we’re not sure if BLEU is a great metric.) Also, its performance on lower resource languages needs further study.  
 - **Error analysis.** We’ve found that specifying a language and a country/region (e.g., “Spanish as colloquially spoken in Mexico”) does a pretty good job for our applications. Where does the current approach fall short? We’re also particularly interested in understanding its performance on specialized topics (like law, medicine) or special types of text (like movie subtitles) to understand its limitations. 
-- **Better evals.** Finally, we think better evaluations (evals) is a huge and important research topic. As with other LLM applications that generate free text, current evaluation metrics appear to fall short. For example, we found that even on documents where our agentic workflow captures context and terminology better, resulting in translations that our human raters prefer over current commercial offerings, evaluation at the sentence level (using the FLoRes dataset) resulted in the agentic system scoring lower on BLEU. Can we design better metrics (perhaps using an LLM to evaluate translations?) that capture translation quality at a document level that correlates better with human preferences? 
+- **Better evals.** Finally, we think better evaluations (evals) is a huge and important research topic. As with other LLM applications that generate free text, current evaluation metrics appear to fall short. For example, we found that even on documents where our agentic workflow captures context and terminology better, resulting in translations that our human raters prefer over current commercial offerings, evaluation at the sentence level (using the [FLORES](https://github.com/facebookresearch/flores) dataset) resulted in the agentic system scoring lower on BLEU. Can we design better metrics (perhaps using an LLM to evaluate translations?) that capture translation quality at a document level that correlates better with human preferences? 
 
 ## Related work