Skip to content

ZJaume/heliport

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

heliport

A language identification tool that aims to be both fast and accurate. Originally started as a HeLI-OTS port to Rust.

Installation

From PyPi

Install it in your environment

pip install heliport

then download the model

heliport-download

From source

Install the requirements:

Clone the repo, build the package and compile the model

git clone https://github.com/ZJaume/heliport
cd heliport
pip install .
heliport-convert

Usage

CLI

Just run the heliport command that reads lines from stdin

cat sentences.txt | heliport
eng_latn
cat_latn
rus_cyrl
...

Python package

>>> from heliport import Identifier
>>> i = Identifier()
>>> i.identify("L'aigua clara")
'cat_latn'

Rust crate

use std::sync::Arc;
use heliport::identifier::Identifier;
use heliport::lang::Lang;
use heliport::load_models;

let (charmodel, wordmodel) = load_models("/dir/to/models")
let identifier = Identifier::new(
    Arc::new(charmodel),
    Arc::new(wordmodel),
    );
let lang, score = identifier.identify("L'aigua clara");
assert_eq!(lang, Lang::cat_Latn);

Benchmarks

Speed benchmarks with 100k random sentences from OpenLID, all the tools running single-threaded:

tool time (s)
CLD2 1.12
HeLI-OTS 60.37
lingua all high preloaded 56.29
lingua all low preloaded 23.34
fasttext openlid193 8.44
heliport 2.33