tinse_probing_datasets

Script(s) to generate datasets for probing tasks of BERT (project TINSE)

Setup

Make sure you have access to the docker daemon.

conda create -n tinse python=3.8
conda activate tinse
pip install -r requirements.txt

To install neuralcoref from source:

git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e .

Download needed spacy pipelines:

python -m spacy download en_core_web_sm

Usage

Start elasticsearch container in another Terminal:

docker run -p 12375:9200 -p 12376:9300 -e "discovery.type=single-node" --detach --name es -v esdata1:/usr/share/elasticsearch/data:rw  docker.elastic.co/elasticsearch/elasticsearch:7.16.2

Once created you only need to make sure the container is running before you call the script to create datasets. If it is not running start it via:

docker container start es

To finally run the dataset creation script:

conda activate tinse
python dataset_creation.py -t=<tasks> -s=<size> -sq=<samples_per_query> ...

Naming of Datasets saved in /datasets: {source}_{task}_{size}_{samples_per_query}_{timestamp}.json

Options

Option	Description	Default
-s, --size	Size of the generated dataset(s)	10000
-sq, --samples_per_query	Determines the maximumn number of passage samples with the same query in the generated dataset	5
-src, --source	Source Dataset	msmarco
-t, --tasks	Tasks to generate datasets for. Possible tasks are: ['bm25', 'tf', 'semsim', 'ner', 'corefres', 'factchecking']. Should be comma seperated	bm25,semsim,ner,tf
-sp, --sample_path	Reuse an existing sample of a dataset. You need to specify the name of the file in ./datasets/samples/. Every time a dataset is newly sampled it is saved in csv format. Naming format: {src}{size}{samples per query}_{timestamp}.csv. If set --size and --samples_per_query are ignored)	-
-ph, --port_http	Http Port for elasticsearch container, should correspond to the port the docker container is bound to	12375
-pt, --port_tcp	TCP Port for elasticsearch container, should correspond to the port the docker container is bound to	12376
--split	Only relevant for Factchecking Task. Train, val and test split ratio. Must add up to 100	70,15,15
--neg_sample_ratio	Only relevant for coreference resolution. Ratio of negative sampling containing easy and hard examples. First number corresponds to percentage * 100 of easy examples (random word from the passage), second for harder (other entities in the passage). Must add up to 100	50,50
--id_pairs	Flag whether ID pairs (MSMARCO) from csv (assets/msmarco/passage_re_ranking/msm_id_pairs.csv) should be used to create the datasets	-

Directory Contents

assets

source datasets (msmarco, fever, glove) Mappings from key to files are made in src/dataset_sources.py. If another dataset should be added one should follow this schema.

datasets

generated datasets in json format

datasets/samples

generated samples to construct datasets which use the same query/document pairs

Other useful commands

Run detached script on server:

tmux
run script
To quit tmux session: Ctrl + b then d
To reattach to tmux session: tmux attach

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.vscode		.vscode
assets/msmarco/passage_re_ranking		assets/msmarco/passage_re_ranking
datasets		datasets
src		src
.gitignore		.gitignore
README.md		README.md
dataset_creation.py		dataset_creation.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tinse_probing_datasets

Setup

Usage

Options

Directory Contents

assets

datasets

datasets/samples

Other useful commands

Run detached script on server:

About

Releases

Packages

Contributors 2

Languages

Heyjuke58/tinse_probing_datasets

Folders and files

Latest commit

History

Repository files navigation

tinse_probing_datasets

Setup

Usage

Options

Directory Contents

assets

datasets

datasets/samples

Other useful commands

Run detached script on server:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages