Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HuggingBento: A Bento-flavoured distro running Hugging Face Transformers #108

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

gregfurman
Copy link
Collaborator

@gregfurman gregfurman commented Aug 24, 2024

What is this?

Create distribution of Bento for usage with NLP pipelines. This uses the knights-analytics/hugot library to allow for running a Hugging Face pipeline with an ONNX model using Go.

It introduces three new components:

  • nlp_classify_text for text classification pipelines (processor_text_classifier.go)
  • nlp_classify_tokens for token and NER classification pipelines (processor_token_classifier.go)
  • nlp_extract_features for feature extraction pipelines (processor_feature_extractor.go)

Since there is a lot of config overlap between all of these processors, a single processor.go file defines config that is shared amongst all processor types.

All of these will use a shared ONNX Runtime session that is atomically initialised upon creation of one or more HuggingBento processors. This is required to interact with the underlying ONNX Runtime library and can only have a single session created at a time (which required some work when integration testing to ensure runs were not flaky).

Building HuggingBento

Note: the Go build tag huggingbento is used to ensure all files in this distro are only compiled when specified necessary.

Docker

Run the below to build a new image on your local (without using any cached layers).

docker build --platform=linux/amd64  -f resources/huggingbento/Dockerfile -t warpstreamlabs/huggingbento:latest --no-cache .

Binary

  • Follow the instructions in the README at resources/huggingbento/README.md for local installation instructions to get the required external dependencies (C bindings for tokenizer and ONNX Runtime dynamic library).
  • Build with make huggingbento

Testing

Integration Tests

  • Running integration tests (on you local) will require the dependencies listed above. There is a test for each of the

Steps to manually test

  • Once completed, create a Bento config with the following content in config.yaml:
input:
  generate:
    interval: '@every 10s'
    batch_size: 5
    mapping: root = "Japanese Bento boxes taste amazing!"

pipeline:
  processors:
    - nlp_classify_text:
        pipeline_name: classify-incoming-data
        problem_type: multiLabel
        enable_model_download: true
        model_download_options:
          model_repository: KnightsAnalytics/distilbert-base-uncased-finetuned-sst-2-english

# Out: [{"Label":"NEGATIVE","Score":0.00014481653},{"Label":"POSITIVE","Score":0.99985516}]
# ...
  • This will load a processor for classifying the sentiment of text using the KnightsAnalytics/distilbert-base-uncased-finetuned-sst-2-english model. It will also download the model and relevant files from the huggingface repository.
  • The model should download, and once completed, you should get 1-2 batches of identical output: [{"Label":"NEGATIVE","Score":0.00014481653},{"Label":"POSITIVE","Score":0.99985516}].

TODO

  • Fix GitHub workflow for release and testing this in CI
  • Write a specififc component guide for usage like with serverless
  • Implement pipeline for zero shot evaluation
  • Perhaps run tests inside docker-compose to allow for local testing.
  • Better describe the fields of each component (i.e NewStringAnnotatedEnumField)
  • Make a generate/ directory to allow for generating ONNX runtime's and huggingface bindings for any OS/ARCH combo like with ollama which has multiple generate scripts.

@gregfurman gregfurman self-assigned this Aug 24, 2024
internal/impl/huggingface/processor_feature_extraction.go Outdated Show resolved Hide resolved
internal/impl/huggingface/processor_feature_extraction.go Outdated Show resolved Hide resolved
model_repository: "KnightsAnalytics/distilbert-base-uncased-finetuned-sst-2-english"


# In: "This meal tastes like old boots."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be better such that you can provide the huggingface processors with a bloblang mapping for the input. You could keep it this way but I think I would assume that the data coming in would be in a json format and the user would have to know to apply a mapping to it / use a branch processor.

Like this is the way the http processor works: it requires you to use it with a branch processor in a way, but I think that is harder to understand than a bloblang mapping field for a new user.

resources/huggingbento/README.md Outdated Show resolved Hide resolved
resources/huggingbento/README.md Outdated Show resolved Hide resolved
#!/bin/bash

ONNXRUNTIME_VERSION=${ONNXRUNTIME_VERSION:-"1.18.0"}
DEPENDENCY_DEST=${DEPENDENCY_DEST:-"/usr/lib"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set this to /usr/lib/localon macOS like the README.md?

Copy link
Collaborator Author

@gregfurman gregfurman Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script assumes Linux which where the above would work. Not sure if changing this to Mac by default will confuse people more. Perhaps I'll add a comment mentioning this

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

:::caution BETA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I would have it on experimental because having at BETA limits what can be changed outside of a major release - what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. Will change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there needs to be a way of including these processors somewhere else so that they don't appear on the website. i.e. moved to somewhere like serverless.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding an admonition at the top of the docs saying this is only availble in the huggingbento distro? I think trying to generate the docs into a new location is do-able but could end up being more trouble than it's worth if a text-block could suffice. Thoughts?

gregfurman and others added 2 commits September 16, 2024 18:23
Co-authored-by: Jem Davies <131159520+jem-davies@users.noreply.github.com>
Co-authored-by: Jem Davies <131159520+jem-davies@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants