rag-api-pipeline
is a Python-based data pipeline tool that allows you to easily generate a vector knowledge base from any REST API data source. The
resulting database snapshot can be then plugged-in into a Gaia node's LLM model with a prompt and provide contextual responses to user queries using RAG
(Retrieval Augmented Generation).
The following sections help you to quickly setup and execute the pipeline on your REST API. If you're looking more in-depth information about how to use thes tool, tech stack and/or how it works under the hood, check the content menu on the left.
- Poetry (Docs)
- Python 3.11.x
- (Optional): a Python virtual environment manager or your preference (e.g. conda, venv)
- Docker and Docker compose
- Qdrant vector database (Docs)
- LLM model provider (either a Gaianet node or Ollama)
Git clone or download this repository to your local.
If using a custom virtual environment, you should activate your virtual environment, otherwise poetry will handle the environment for you
Navigate to the directory where this repository was cloned/download and execute the following on a terminal:
poetry install
Copy config/.env/sample
into config/.env
file and set environment variables accordingly. Check the environment variables section
for details.
Define the pipeline manifest for your REST API you're looking to extract data from. Check how to define an API pipeline manifest in Defining an API Pipeline Manifest for details, or take a look at the in-depth review of the sample manifests available in API Examples.
Set the REST API key in a config/secrets/api_key
file, or specify it using the --api-key
as argument to the CLI
Get the base URL of your Qdrant Vector DB or deploy a local Qdrant
(Docs) vector database instance using docker:
# IMPORTANT: make sure you use `qdrant:v1.10.1` for compatibility with Gaianet node
docker run -p 6333:6333 -p 6334:6334 -v ./qdrant_dev:/qdrant/storage:z qdrant/qdrant:v1.10.1
Get your Gaianet node running (Docs) or install Ollama (Docs) provider locally. The latter is recommended if you're looking to run the pipeline on consumer hardware.
Load the LLM embeddings model of your preference into the LLM provider you chose in the previous step:
- You can find info on how to customize a Gaianet node here
- If you chose Ollama, follow these instructions to import the LLM embeddings model:
- Make sure the Ollama service is up and running
- Go to the folder where the embeddings model is located. For this example, the llm model file is
nomic-embed-text-v1.5.f16.gguf
(Source) - Create a file with name
Modelfile
and paste the following (replace<path/to/model>
with your local directory):
FROM <path/to/model>/nomic-embed-text-v1.5.f16.gguf
- Import the model by running the following command on a terminal:
ollama create Nomic-embed-text-v1.5
- Make sure the model setting by running the command:
ollama show Nomic-embed-text-v1.5
Now you're ready to use the rag-api-pipeline
CLI commands to execute different tasks of the RAG pipeline, from extracting data from an API source to generating vector embeddings
and a database snapshot. If you need more details about the parameters available on each command you can execute:
poetry run rag-api-pipeline <command> --help
Below you can find the default instructions available and an in-depth review of both the functionality and available arguments that each command offers:
# run the entire pipeline
poetry run rag-api-pipeline run-all API_MANIFEST_FILE ----openapi-spec-file <openapi-spec-yaml-file> [--full-refresh] [--llm-provider openapi|ollama]
# or run using an already normalized dataset
poetry run rag-api-pipeline from-normalized API_MANIFEST_FILE --normalized-data-file <jsonl-file> [--llm-provider openapi|ollama]
# or run using an already chunked dataset
poetry run rag-api-pipeline from-chunked API_MANIFEST_FILE --chunked-data-file <jsonl-file> [--llm-provider openapi|ollama]
-
run-all: executes the entire RAG data pipeline including API endpoint data streams, data normalization, data chunking, vector embeddings and database snapshot generation. You can specify the following arguments to the command:
API_MANIFEST_FILE
: API pipeline manifest file (mandatory)--llm-provider [ollama|openai]
: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)--api-key
: API Auth key. If not specified, it will try to get it fromconfig/secrets/api_key
--openapi-spec-file
: API OpenAPI YAML spec file. default toconfig/openapi.yaml
--source-manifest-file
: Airbyte API Connector YAML manifest. If specified, it will omit the API Connector manifest generation step.--full-refresh
: clean up cache and extract API data from scratch.--normalized-only
: run pipeline until the data normalization stage.--chunked-only
: run pipeline until the data chunking stage.
-
from-normalized: executes the RAG data pipeline using an already normalized JSONL dataset. You can specify the following arguments to the command:
API_MANIFEST_FILE
: API pipeline manifest file (mandatory)--llm-provider [ollama|openai]
: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)--normalized-data-file
: path to the normalized dataset in JSONL format (mandatory). Check the Architecture section for details on the required data schema.
-
from chunked: executes the RAG data pipeline using an already chunked dataset in JSONL format. You can specify the following arguments to the command:
API_MANIFEST_FILE
: API pipeline manifest file (mandatory)--llm-provider [ollama|openai]
: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)--chunked-data-file
: path to the chunked dataset in JSONL format (mandatory). Check the Architecture section for details on the required data schema.
Cached API stream data and results produced from running any of the CLI commands are stored in <OUTPUT_FOLDER>/<api_name>
. The following files and folders
are created by the tool within this baseDir
folder:
{baseDir}/{api_name}_source_generated.yaml
: generated Airbyte Stream connector manifest.{baseDir}/cache/{api_name}/*
: extracted API data is cached into a local DuckDB. Database files are stored in this directory. If the--full-refresh
argument is specified to therun-all
command, the cache will be cleared and API data will be extracted from scratch.{baseDir}/{api_name}_stream_{x}_preprocessed.jsonl
: data streams from each API endpoint is preprocessed and stored in JSONL format{baseDir}/{api_name}_normalized.jsonl
: preprocessed data streams from each API endpoint are joined together and stored in JSONL format{baseDir}/{api_name}_chunked.jsonl
: normalized data that goes through the data chunking stage is then stored in JSONL format{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot
: vector embeddings snapshot file that was exported from Qdrant DB{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot.tar.gz
: compressed knowledge base that contains the vector embeddings snapshot
The following environment variables can be adjusted in config/.env
based on user needs:
- Pipeline config parameters:
API_DATA_ENCODING
(="utf-8"): default data encoding used by the REST APIOUTPUT_FOLDER
(="./output"): output folder where cached data streams, intermediary stage results and generated knowledge base snapshot are stored
- LLM provider settings:
LLM_API_BASE_URL
(="http://localhost:8080/v1"): LLM provider base URL (default to a local openai-based provider such as gaia node)LLM_API_KEY
(="empty-api-key"): API key to authenticate requests to the LLM providerLLM_EMBEDDINGS_MODEL
(="Nomic-embed-text-v1.5"): name of the embeddings model to be consumed through the LLM providerLLM_EMBEDDINGS_VECTOR_SIZE
(=768): embeddings vector sizeLLM_PROVIDER
(="openai"): LLM provider backend to use. It can be eitheropenai
orollama
(gaianet offers an openai compatible API)
- Qdrant DB settings:
QDRANTDB_URL
(="http://localhost:6333"): Qdrant DB base URLQDRANTDB_TIMEOUT
(=60): timeout for requests made to the Qdrant DBQDRANTDB_DISTANCE_FN
(="COSINE"): score function to use during vector similarity search. Avaiable functions: ['COSINE', 'EUCLID', 'DOT', 'MANHATTAN']
- Pathway-related variables:
AUTOCOMMIT_DURATION_MS
(=1000): the maximum time between two commits. Every autocommit_duration_ms milliseconds, the updates received by the connector are committed automatically and pushed into Pathway's dataflow. More info can be found hereFixedDelayRetryStrategy
(docs) config parameters:PATHWAY_RETRY_MAX_ATTEMPTS
(=10): max retries to be performed if a UDF async execution failsPATHWAY_RETRY_DELAY_MS
(=1000): delay in milliseconds to wait for the next execution attempt
- UDF async execution: set the maximum No of concurrent operations per batch during udf async execution. Zero means no specific limits. Be careful when settings
this parameters for the embeddings stage as it could break the LLM provider with too many concurrent requests
CHUNKING_BATCH_CAPACITY
(=0): max No. of concurrent operation during data chunking operationsEMBEDDINGS_BATCH_CAPACITY
(=15): max No. of concurrent operation during vector embeddings operations
TBD
-
Start with building your containers:
docker compose -f local.yml build
. -
Build production containers with
docker compose -f prod.yml build
-
To run your application invoke:
docker compose -f prod.yml rm -svf
docker compose -f prod.yml up
- If trying to install
pillow-heif
missinng module:- Add the following flags
export CFLAGS="-Wno-nullability-completeness"
- Add the following flags
- Libraries required for having libmagic working:
- MacOS:
brew install libmagic
pip install python-magic-bin
- MacOS:
We use Vocs framework for generating the Documentation site. Run pnpm run dev
if you want to work on some updates. Execute the following
commands if you're looking to build and deploy the updated documentation on Github pages:
pnpm run build
pnpm run deploy
Presentation slides can be found here
🛠️ Built 🛠️ with ❤️ by RaidGuild