Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nim docs #1709

Merged
merged 9 commits into from
Jul 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ auto_examples/mlflow_plugin/index
auto_examples/mmcloud_agent/index
auto_examples/modin_plugin/index
auto_examples/kfmpi_plugin/index
auto_examples/nim_plugin/index
auto_examples/onnx_plugin/index
auto_examples/openai_batch_agent/index
auto_examples/papermill_plugin/index
Expand Down
2 changes: 2 additions & 0 deletions docs/integrations.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@ orchestrated by Flyte itself, within its provisioned Kubernetes clusters.
- Run Databricks jobs in your workflows with the Databricks agent.
* - {doc}`Memory Machine Cloud <auto_examples/mmcloud_agent/index>`
- Execute tasks using the MemVerge Memory Machine Cloud agent.
* - {doc}`NIM <auto_examples/nim_plugin/index>`
- Serve optimized model containers with NIM.
* - {doc}`OpenAI Batch <auto_examples/openai_batch_agent/index>`
- Submit requests for asynchronous batch processing on OpenAI.
* - {doc}`SageMaker Inference <auto_examples/sagemaker_inference_agent/index>`
Expand Down
23 changes: 23 additions & 0 deletions examples/nim_plugin/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# ######################
# NOTE: For CI/CD only #
########################
FROM python:3.11-slim-buster
LABEL org.opencontainers.image.source=https://github.com/flyteorg/flytesnacks

WORKDIR /root
ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /root

# Install Python dependencies
COPY requirements.in /root
RUN pip install -r /root/requirements.in

# Copy the actual code
COPY . /root/

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag
40 changes: 40 additions & 0 deletions examples/nim_plugin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
(nim_plugin)=

# NIM

```{eval-rst}
.. tags:: Inference, NVIDIA
```

Serve optimized model containers with NIM in a Flyte task.

[NVIDIA NIM](https://www.nvidia.com/en-in/ai/), part of NVIDIA AI Enterprise, provides a streamlined path
for developing AI-powered enterprise applications and deploying AI models in production.
It includes an out-of-the-box optimization suite, enabling AI model deployment across any cloud,
data center, or workstation. Since NIM can be self-hosted, there is greater control over cost, data privacy,
and more visibility into behind-the-scenes operations.

With NIM, you can invoke the model's endpoint as if it is hosted locally, minimizing network overhead.

## Installation

To use the NIM plugin, run the following command:

```
pip install flytekitplugins-inference
```

## Example usage

For a usage example, see {doc}`NIM example usage <serve_nim_container>`.

```{note}
NIM can only be run in a Flyte cluster, not locally, as it must be deployed as a sidecar service in a Kubernetes pod.
```

```{toctree}
:maxdepth: -1
:hidden:

serve_nim_container
```
Empty file.
113 changes: 113 additions & 0 deletions examples/nim_plugin/nim_plugin/serve_nim_container.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# %% [markdown]
# (serve_nim_container)=
#
# # Serve Generative AI Models with NIM
#
# This guide demonstrates how to serve a Llama 3 8B model locally with NIM within a Flyte task.
#
# First, instantiate NIM by importing it from the `flytekitplugins.inference` package and specifying the image name along with the necessary secrets.
# The `ngc_image_secret` is required to pull the image from NGC, the `ngc_secret_key` is used to pull models
# from NGC after the container is up and running, and `secrets_prefix` is the environment variable prefix to access {ref}`secrets <secrets>`.
#
# Below is a simple task that serves a Llama NIM container:
# %%
from flytekit import ImageSpec, Resources, Secret, task
from flytekit.extras.accelerators import A10G
from flytekitplugins.inference import NIM, NIMSecrets
from openai import OpenAI

image = ImageSpec(
name="nim",
registry="ghcr.io/flyteorg",
packages=["flytekitplugins-inference"],
)

nim_instance = NIM(
image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0",
secrets=NIMSecrets(
ngc_image_secret="nvcrio-cred",
ngc_secret_key="ngc-api-key",
ngc_secret_group="ngc",
secrets_prefix="_FSEC_",
),
)


@task(
container_image=image,
pod_template=nim_instance.pod_template,
accelerator=A10G,
secret_requests=[
Secret(
group="ngc", key="ngc-api-key", mount_requirement=Secret.MountType.ENV_VAR
) # must be mounted as an env var
],
requests=Resources(gpu="0"),
)
def model_serving() -> str:
client = OpenAI(base_url=f"{nim_instance.base_url}/v1", api_key="nim") # api key required but ignored

completion = client.chat.completions.create(
model="meta/llama3-8b-instruct",
messages=[
{
"role": "user",
"content": "Write a limerick about the wonders of GPU computing.",
}
],
temperature=0.5,
top_p=1,
max_tokens=1024,
)

return completion.choices[0].message.content


# %% [markdown]
# :::{important}
# Replace `ghcr.io/flyteorg` with a container registry to which you can publish.
# To upload the image to the local registry in the demo cluster, indicate the registry as `localhost:30000`.
# :::
#
# The `model_serving` task initiates a sidecar service to serve the model, making it accessible on localhost via the `base_url` property.
# Both chat and chat completion endpoints can be utilized.
#
# You need to mount the secret as an environment variable, as it must be accessed by the `NGC_API_KEY` environment variable within the NIM container.
#
# By default, the NIM instantiation sets `cpu`, `gpu`, and `mem` to `1`, `1`, and `20Gi`, respectively. You can modify these settings as needed.
#
# To serve a fine-tuned Llama model, specify the HuggingFace repo ID in `hf_repo_ids` as `[<your-hf-repo-id>]` and the
# LoRa adapter memory as `lora_adapter_mem`. Set the `NIM_PEFT_SOURCE` environment variable by
# including `env={"NIM_PEFT_SOURCE": "..."}` in the task decorator.
#
# Here is an example initialization for a fine-tuned Llama model:
# %%
nim_instance = NIM(
image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0",
secrets=NIMSecrets(
ngc_image_secret="nvcrio-cred",
ngc_secret_key="ngc-api-key",
ngc_secret_group="ngc",
secrets_prefix="_FSEC_",
hf_token_key="hf-key",
hf_token_group="hf",
),
hf_repo_ids=["<your-hf-repo-id>"],
lora_adapter_mem="500Mi",
env={"NIM_PEFT_SOURCE": "/home/nvs/loras"},
)

# %% [markdown]
# :::{note}
# Native directory and NGC support for LoRa adapters coming soon.
# :::
#
# NIM containers can be integrated into different stages of your AI workflow, including data pre-processing,
# model inference, and post-processing. Flyte also allows serving multiple NIM containers simultaneously,
# each with different configurations on various instances.
#
# This integration enables you to self-host and serve optimized AI models on your own infrastructure,
# ensuring full control over costs and data security. By eliminating dependence on third-party APIs for AI model access,
# you gain not only enhanced control but also potentially lower expenses compared to traditional API services.
#
# For more detailed information, refer to the [NIM documentation by NVIDIA](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html).
1 change: 1 addition & 0 deletions examples/nim_plugin/requirements.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
flytekitplugins-inference>=1.13.1a5
Loading