Skip to content

valevo/Thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Statistical Methodology for Quantitative Linguistics:
A Case Study of Learnability and Zipf’s Law

Master of Logic Thesis of Valentin Vogelmann

Contents

/figures

Figures and tables, see figures/README.md for navigation.

/data

Linguistically pre-processed (see Section 2.1 for pipeline details) and subsequently pickled (using Python module pickle) corpora: Wikipedia dumps in 7 languages. Each folder, prefixed with a language code (see below), contains the corpus split into multiple files in order to stay below GitHub's file size limit. Use wiki_from_pickles in /data/reader.pyto load a corpus from a folder in /data; corpus.py contains wrappers to turn the corpora loaded with wiki_from_pickles into Python objects with convenient functionality.

Language codes are : Esperanto - EO, Finnish - FI, Indonesian - ID, Korean - KO, Norwegian (the Bokmal variant) - NO, Turkish - TR and Vietnamese - VI.

See Table 2.1 of the thesis for the basic size characteristics of the corpora.

/src

Code used for generating subcorpora according to the Subsampling and Filtering methods and analysing those.

  • /src/stats/: Helper functionality such as calculation of ranks, frequencies, typicality or performing MLE
  • /src/subsampling: Scripts to analyse the various aspect of the Subsampling method analysed in the thesis, such as variance and convergence
  • /src/filtering: Implementations of the TypicalityFilter and SpeakerRestrictionFilter sampling algorithms, both sequential and parallelised versions (using Python's multiprocessing library)
  • /src/evaluation: Functions and scripts for evaluation of filtering results, such as lexical diversity and Jaccard distance
  • shell_scripts: Shell scripts to deploy Python code on SurfSARA's LISA computing cluster

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published