Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompile Python HuggingFace's Tokenizers=0.10.3 #65

Open
SamuelLarkin opened this issue Sep 27, 2021 · 9 comments
Open

Precompile Python HuggingFace's Tokenizers=0.10.3 #65

SamuelLarkin opened this issue Sep 27, 2021 · 9 comments

Comments

@SamuelLarkin
Copy link
Collaborator

Hi,
I'm trying to install HuggingFace's Tokenizers=0.10.3 with pip install tokenizers==0.10.3 and it fails. If I try to install the version 0.10.1 it succeeds because pip finds a version build from computeCanada

pip install tokenizers==0.10.1
Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/avx512, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/avx2, /cvmfs/soft.compute
canada.ca/custom/python/wheelhouse/nix/generic, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
Processing /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic/tokenizers-0.10.1+computecanada-cp38-cp38-linux_x86_64.whl
Installing collected packages: tokenizers
Successfully installed tokenizers-0.10.1+computecanada

Based on this output, I would like tokenizers==0.10.3+computecanada for Python-3.8.

Thanks

@fieldsa
Copy link
Collaborator

fieldsa commented Sep 28, 2021

This package is not available currently in the Compute Canada wheelhouse.
$ avail_wheels

/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic/tokenizers-0.10.1+computecanada-cp38-cp38-linux_x86_64.whl #(the latest vers)

CC software list: https://docs.computecanada.ca/wiki/Available_Python_wheels

Attempting to install in a virtualenv requires ruby, cargo (crates) and results in an error with ruby/1.41.0.

$ module load StdEnv/2018.3
$ module load python/3.8.0
$ module load rust/1.41.0
$ . ~/venv/test-tokenizers/bin/activate
$ pip install tokenizers==0.10.3

[..]
Caused by:
    process didn't exit successfully: `rustc --edition=2018 --crate-name bitvec /home/fieldsa/.cargo/registry/src/github.com-1ecc6299db9ec823/bitvec-0.19.5/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 --cfg 'feature="alloc"' --cfg 'feature="std"' -C metadata=e573222695fb9170 -C extra-filename=-e573222695fb9170 --out-dir /tmp/pip-install-a5ji38el/tokenizers_a44e33f99b4942a5b298511ad70ef886/target/release/deps -L dependency=/tmp/pip-install-a5ji38el/tokenizers_a44e33f99b4942a5b298511ad70ef886/target/release/deps --extern funty=/tmp/pip-install-a5ji38el/tokenizers_a44e33f99b4942a5b298511ad70ef886/target/release/deps/libfunty-2289090f5a439874.rmeta --extern radium=/tmp/pip-install-a5ji38el/tokenizers_a44e33f99b4942a5b298511ad70ef886/target/release/deps/libradium-72e277b2ee5f2108.rmeta --extern tap=/tmp/pip-install-a5ji38el/tokenizers_a44e33f99b4942a5b298511ad70ef886/target/release/deps/libtap-31bb11a449977869.rmeta --extern wyz=/tmp/pip-install-a5ji38el/tokenizers_a44e33f99b4942a5b298511ad70ef886/target/release/deps/libwyz-342e26516d1da351.rmeta --cap-lints allow` (exit code: 1)
  warning: build failed, waiting for other jobs to finish...
  error: build failed
  cargo rustc --lib --manifest-path Cargo.toml --features pyo3/extension-module --release --verbose -- --crate-type cdylib
  error: cargo failed with code: 101
 
  ----------------------------------------
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers

Thus, to provide this package in CC CVMFS may require a request upstream to Compute Canada. Alternately, to compile the package locally witll requires a custom EasyBlock for this particular version.

@fieldsa
Copy link
Collaborator

fieldsa commented Sep 28, 2021

If I install latest nightly build of rust from rustup (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) instead of loading rust from CVMFS - I can compile most of the crates. However, when it reaches the end prior to making the tokenizers wheel it outputs an error due to GLIBC_2.18.

     Compiling paste-impl v0.1.18
       Running `rustc --crate-name paste_impl --edition=2018 /home/fieldsa/.cargo/registry/src/github.com-1ecc6299db9ec823/paste-impl-0.1.18/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi --crate-type proc-macro --emit=dep-info,link -C prefer-dynamic -C embed-bitcode=no -C debug-assertions=off -C metadata=fc76f3898ac42ac4 -C extra-filename=-fc76f3898ac42ac4 --out-dir /tmp/pip-install-szsebxbe/tokenizers_95b5288ea9e945aa95b1befdf1f6769a/target/release/deps -L dependency=/tmp/pip-install-szsebxbe/tokenizers_95b5288ea9e945aa95b1befdf1f6769a/target/release/deps --extern proc_macro_hack=/tmp/pip-install-szsebxbe/tokenizers_95b5288ea9e945aa95b1befdf1f6769a/target/release/deps/libproc_macro_hack-f801c04b47fa343a.so --extern proc_macro --cap-lints allow`
  error: `proc-macro` crate types currently cannot export any items other than functions tagged with `#[proc_macro]`, `#[proc_macro_derive]`, or `#[proc_macro_attribute]`
    --> /home/fieldsa/.cargo/registry/src/github.com-1ecc6299db9ec823/paste-impl-0.1.18/src/lib.rs:25:1
     |
  25 | pub fn expr(input: TokenStream) -> TokenStream {
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
  error: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by /tmp/pip-install-szsebxbe/tokenizers_95b5288ea9e945aa95b1befdf1f6769a/target/release/deps/libproc_macro_hack-f801c04b47fa343a.so)
    --> /home/fieldsa/.cargo/registry/src/github.com-1ecc6299db9ec823/paste-impl-0.1.18/src/lib.rs:10:5
     |
  10 | use proc_macro_hack::proc_macro_hack;
     |     ^^^^^^^^^^^^^^^
  
  error: could not compile `paste-impl` due to 2 previous errors

$ rpm -qf /lib64/libc.so.6 
glibc-2.17-307.el7.1.x86_64

Some forums suggest this is due to Debian vs. CentOS compiled shared objects, while another suggests to use RHEL8. ComputeCanada address the GLIBC_2.18 issue with the following page: https://docs.computecanada.ca/wiki/Installing_software_in_your_home_directory#Installing_binary_packages :

[..] they may fail using errors such as /lib64/libc.so.6: version GLIBC_2.18 not found.
Often such binaries can be patched using our setrpaths.sh script, using the syntax setrpaths.sh --path path [--add_origin] where path refers to the directory where you installed that software. [..]

Some archive file, such as [,,] python wheels (.whl files) may contain shared objects that need to be patched. The setrpaths.sh script extracts and patches these objects and updates the archive.`

Shared object (during build):

  • /tmp/pip-install-jp25ytle/tokenizers_639c73cdd4ec440e81ebd0e251636201/target/release/deps/libproc_macro_hack-f801c04b47fa343a.so

@SamuelLarkin
Copy link
Collaborator Author

Thanks @fieldsa
I tried to installed HF's Tokenizers=0.10.3 in a conda environment and got the same error message about GLIBC_2.18. CentOS is NOT a great OS for research as everyone else is using ubuntu so we get issues like that because CentOS is dragging behind.

I'll take a look at the link you provided.

@ddamoursNRC
Copy link
Collaborator

Sam using the StdEnv/2020 module with the latest Rust module seemed to work for me.

#!/bin/bash
module load StdEnv/2020
module load miniconda3-4.8.2-gcc-9.2.0-sbqd2xu
module load rust/1.53.0
conda create -c conda-forge -p tokenizers python=3
conda activate tokenizers
pip install tokenizers==0.10.3

@SamuelLarkin
Copy link
Collaborator Author

That's sounds a bit like black magic ;) but I'll give this a try.

Thanks

@SamuelLarkin
Copy link
Collaborator Author

I gave @ddamoursNRC's script a try and it fails when I try to use it. It gives the same GLIBC_2.18 error message.

python -c 'import tokenizers'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/gpfs/projects/DT/mtp/models/WMT2020/opt/miniconda3/envs/tokenizers-0.10.3/lib/python3.10/site-packages/tokenizers/__init__.py", line 79, in <module>
    from .tokenizers import (
ImportError: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by /gpfs/projects/DT/mtp/models/WMT2020/opt/miniconda3/envs/tokenizers-0.10.3/lib/python3.10/site-packages/tokenizers/tokenizers.abi3.so)

@fieldsa
Copy link
Collaborator

fieldsa commented Nov 26, 2021

This may be a great opportunity to test singularity on Trixie to leverage the latest gcc when running miniconda, rust and the tokenizer. Can you specify if there is an interest level to give it a try?

@fieldsa
Copy link
Collaborator

fieldsa commented Nov 26, 2021

Alternately, with-out using containers it may be possible to leverage a login shell with-out loading CVMFS for running miniconda and custom compiled libraries.

The downside is that you wouldn't be able to leverage the CC CVMFS provided modules / python wheels. However, if there is a compatibility issues due to very new code compared to OS libraries this could help keep the GLIBC version conflict minimized by using the same version to compile all software components locally.

If the version in OS is just too old then the singularity containers may be best approach.

@nrcfieldsa
Copy link

@SamuelLarkin - was this issue resolved and you were able to run Tokenizers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants