Dialog-KoELECTRA

Introduction

Dialog-KoELECTRA is a language model specialized for dialogue. It was trained with 22GB colloquial and written style Korean text data. Dialog-ELECTRA model is made based on the ELECTRA model. ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU.

Dialog-KoELECTRA can speed up learning and use less memory by using the mixed precision option during pre-training. When finetuning, parameter optimization is possible by using NNI option.

Released Models

We are initially releasing small version pre-trained model. The model was trained on Korean text. We hope to release other models, such as base/large models, in the future.

Model	Layers	Hidden Size	Params	Max Seq Len	Learning Rate	Batch Size	Train Steps	Train Time
Dialog-KoELECTRA-Small	12	256	14M	128	1e-4	512	1M	28day

How to use from the transformers library

The Dialog-KoELECTRA model is uploaded to the hugging face, so it is easy to use.

from transformers import ElectraTokenizer, ElectraForSequenceClassification
  
tokenizer = ElectraTokenizer.from_pretrained("skplanet/dialog-koelectra-small-discriminator")

model = ElectraForSequenceClassification.from_pretrained("skplanet/dialog-koelectra-small-discriminator")

If you want to download the model directly without using the transformers library, you can download it through the link below.

Model	Pytorch-Generator	Pytorch-Discriminator	Tensorflow-v1	ONNX
Dialog-KoELECTRA-Small	link	link	link	link

Model Performance

Dialog-KoELECTRA shows strong performance in colloquial data downstream tasks.

	Colloquial data			Written data
	NSMC (acc)	Question Pair (acc)	Korean-Hate-Speech (F1)	Naver NER (F1)	KorNLI (acc)	KorSTS (spearman)
DistilKoBERT	88.60	92.48	60.72	84.65	72.00	72.59
KoELECTRA-Small	89.36	94.85	63.07	85.40	78.60	80.79
Dialog-KoELECTRA-Small	90.01	94.99	68.26	85.51	78.54	78.96

Train Data

	corpus name	size
dialog	Aihub Korean dialog corpus	7GB
	NIKL Spoken corpus
	Korean chatbot data
	KcBERT
written	NIKL Newspaper corpus	15GB
written	namuwikitext	15GB

Vocabulary

We applied morpheme analysis using huggingface_konlpy when creating a vocabulary dictionary. As a result of the experiment, it showed better performance than a vocabulary dictionary created without applying morpheme analysis.

vocabulary size	unused token size	limit alphabet	min frequency
40,000	500	6,000	3

Demo

Dialog-KoELECTRA Named Entity Recognition

Pre-training

Use preprocess.py to preprocess from a raw text. Data preprocessing only removed repetitive characters and Chinese characters. It has the following arguments:

--corpus_dir: A directory containing raw text files.
--output_file: File created after preprocessing.

Then run (for example)

python3 preprocess.py \
    --corpus_dir raw_data_dir \
    --output_file preprocessed_data.txt \

Use build_vocab.py to create a vocabulary file from a raw text or preprocessed data. It has the following arguments:

--corpus: A raw text file or preprocessed file to turn into a vocabulary file.
--tokenizer: a name for the tokenizer such as a wordpiece or mecab_wordpiece (wordpiece by default).
--vocab_size: The number of word in vocabulary (40000 by default).
--min_frequency: The minimum frequency a pair must have to produce a merge operation (3 by default).
--limit_alphabet: The number of initial tokens that can be kept before computing merges (6000 by default).
--unused_size: The number of unused token (500 by default).

Then run (for example)

python3 build_vocab.py \
    --corpus preprocessed_data.txt \
    --tokenizer mecab_wordpiece \
    --vocab_size 40000 \
    --min_frequency 3 \
    --limit_alphabet 6000 \
    --unused_size 500

Use build_pretraining_dataset.py to create a pre-training dataset from a dump of raw text. It has the following arguments:

--corpus_dir: A directory containing raw text files to turn into Dialog-KoELECTRA examples. A text file can contain multiple documents with empty lines separating them.
--vocab_file: File defining the wordpiece vocabulary.
--output_dir: Where to write out Dialog-KoELECTRA examples.
--max_seq_length: The number of tokens per example (128 by default).
--num_processes: If >1 parallelize across multiple processes (1 by default).
--blanks-separate-docs: Whether blank lines indicate document boundaries (True by default).
--do-lower-case/--no-lower-case: Whether to lower case the input text (True by default).
--tokenizer_type: a name for the tokenizer such as a wordpiece or mecab_wordpiece (wordpiece by default).

Then run (for example)

python3 build_pretraining_dataset.py \
    --corpus_dir data/train_data/raw/split_normalize \
    --vocab_file data/vocab/vocab.txt \
    --tokenizer_type wordpiece \
    --output_dir data/train_data/tfrecord/pretrain_tfrecords_len_128_wordpiece_train \
    --max_seq_length 128 \
    --num_processes 8

Use run_pretraining.py to pre-train an Dialog-KoELECTRA model. It has the following arguments:

--data_dir: a directory where pre-training data, model weights, etc. are stored.
--model_name: a name for the model being trained. Model weights will be saved in <data-dir>/models/<model-name> by default.
--hparams (optional): a JSON dict or path to a JSON file containing model hyperparameters, data paths, etc. See configure_pretraining.py for the supported hyperparameters.
--use_tpu (optional): Option to use tpu when training the model.
--mixed_precision (optional): Option for whether to use mixed precision when training the model.

Then run (for example)

python3 run_pretraining.py \
    --data_dir data/train_data/tfrecord/pretrain_tfrecords_len_128_wordpiece_train \
    --model_name data/ckpt/pretrain_ckpt_len_128_small_wordpiece_train \
    --hparams data/config/small_config_kor_wordpiece_train.json \
    --mixed_precision

Use pytorch_convert.py to convert the tf model to pytorch model. It has the following arguments:

--tf_ckpt_path: a directory where tensorflow checkpoint are stored.
--pt_discriminator_path: Where to write out pytorch discriminator model.
--pt_generator_path (optional): Where to write out pytorch generator model.

Then run (for example)

python3 pytorch_convert.py \
    --tf_ckpt_path model/ckpt/pretrain_ckpt_len_128_small \
    --pt_discriminator_path model/pytorch/dialog-koelectra-small-discriminator \
    --pt_generator_path model/pytorch/dialog-koelectra-small-generator \

Fine-tuning

Use run_finetuning.py to fine-tune and evaluate an Dialog-KoELECTRA model on a downstream NLP task. It expects three arguments:

--config_file: a YAML file containing model hyperparameters, data paths, etc..
--nni: Option for whether to use nni when finetuning the model.

Then run (for example)

python3 run_finetune.py --config_file conf/hate-speech/electra-small.yaml

References

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
KoELECTRA: Pretrained ELECTRA Model for Korean

Contact Info

For help or issues using Dialog-KoELECTRA, please submit a GitHub issue.

For personal communication related to Dialog-KoELECTRA, please contact Wonchul Kim (wonchul.kim@sk.com).

Citation

If you apply this library to any project and research, please cite our code:

@misc{DialogKoELECTRA,
  author       = {Wonchul Kim and Junseok Kim and Okkyun Jeong},
  title        = {Dialog-KoELECTRA: Korean conversational language model based on ELECTRA model},
  howpublished = {\url{https://github.com/skplanet/Dialog-KoELECTRA}},
  year         = {2021},
}

License

Dialog-KoELECTRA project is licensed under the Apache License 2.0.

 Copyright 2020 ~ present SK Planet Co. RB Dialog solution

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Dialog-KoELECTRA

Introduction

Released Models

How to use from the transformers library

Model Performance

Train Data

Vocabulary

Demo

Pre-training

Fine-tuning

References

Contact Info

Citation

License

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Dialog-KoELECTRA

Introduction

Released Models

How to use from the transformers library

Model Performance

Train Data

Vocabulary

Demo

Pre-training

Fine-tuning

References

Contact Info

Citation

License