- BERTScore: Evaluating Text Generation with BERT
- Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
- SlimPajama-DC- Understanding Data Combinations for LLM Training
- Line-Based Splitter to Generate Train/Dev/Test Dataset
bash ./bin/etl/train_dev_test_splitter_for_lines_data.sh ${DATA_LINES_PATH} ${DEV_DATA_SIZE} ${TEST_DATA_SIZE}
- SlimPajama-DC Text Corpus Low-Length Filtering and Deduplication
python ./bin/etl/dataset/text_corpus/text_corpus_slimpajama_dc_processor.py ./bin/etl/dataset/text_corpus/text_corpus_slimpajama_dc_processor.json