TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices

TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a LLM service system designed to bring LLM functions to low-resource edge devices. While cloud LLM services have achieved great success, privacy concerns arise and users do not want their conversations uploaded to the cloud as these conversations could involve sensitive personal information.

Our TPI-LLM system addresses the privacy issue by enabling LLM inference on edge devices with limited resources. The system leverages multiple edge devices to perform inference through tensor parallelism, combined with a sliding window memory scheduler to minimize memory usage. Currently, TPI-LLM can run Yi-34B in full precision on 4 laptops with 5GB of memory on each laptop, and run Llama 2-70B on 8 devices with 3GB of memory on each device. Furthermore, TPI-LLM has demonstrated over 80% less TTFT and token latency compared to Accelerate and over 90% compared to Transformers.

In the future, more acceleration techniques and LLM models will be supported.

Installation

Use the Source Code

Clone this repo and enter the project folder.
Add PYTHONPATH to .bashrc:

> vim ~/.bashrc
export PYTHONPATH=<PATH-TO-TPI-LLM>/src

Create a new conda environment and install dependencies:

> conda create -n tpi-llm python=3.9
> conda activate tpi-llm
(tpi-llm) > pip install -r requirements.txt

Using Pre-built Docker Image

We provide Docker images for TPI-LLM, available on Docker Hub. This is the easiest way to get started, but the container may slow down inference speed.

If the container is a master node, use docker cp <HOST_MODEL_PATH> master:/root/TPI-LLM/ to copy the pretrained model files to the container of the master node.

Build from Dockerfile

If you prefer to build the Docker image yourself, you can modify and use the provided Dockerfile in our repo.

> docker build -t tpi-llm:local .
> docker run -dit --name master tpi-llm:local

How to Use?

Download Pretrained Model Weights

To get started, you’ll need to download the pretrained model weights from Hugging Face:

Llama 2 series, for example, Meta/Llama-2-7b-hf
Llama 3 series, for example, Meta/Llama-3-8b
Llama 3.1 series, for example, Meta/Llama-3.1-8b-Instruct
01 AI Yi series, for example, chargoddard/Yi-34B-Llama

Please make sure that the downloaded weight files conform to the HuggingFace format.

After downloading, save the model files in a directory of your choice, which we’ll refer to as /root/TPI-LLM/pretrained_models/Llama-2-1.1b-ft.

Run on Your Laptop

Run the example script for a trial:

> python examples/run_multiprocess.py --world_size 4 --model_type llama --model_path /root/TPI-LLM/pretrained_models/Llama-2-1.1b-ft --prompt "how are you?" --length 20 --memory_window 4

This command will run 4 processes on a single machine, creating a pseudo-distributed environment that leverages tensor parallelism for Llama inference.

First-Time Setup:

If this is your first time running the task, the master node will automatically slice the pretrained weight files. Suppose we have 4 worker nodes (including the master node), the sliced weight files should be like the following:

> ls <PATH-TO-MODEL-FILES>
|- config.json
|- model-00001-of-00004.safetensors
|- model-00002-of-00004.safetensors
|- model-00003-of-00004.safetensors
|- model-00004-of-00004.safetensors
|- model.safetensors.index.json
|- ...
|- split/
|--- node_0
|--- node_1
|--- node_2
|--- node_3

Subsequent Runs:

For subsequent runs, the sliced model weight files can be reused. Or you can include the --split_bin option to re-split it.

Run on Multiple Hosts

Assume we have 2 laptops with IP addresses as follows:

IP of host 1: 192.168.2.1 (master node)
IP of host 2: 192.168.2.2 (worker node)

The master node is regarded as the task publisher, who initiates the prompt and display generated text to users, it also slices the pretrained weight files and serve as a file server to distribute the sliced files to other worker nodes.

Step 1: To launch the master node, run the following command on laptop 1:

# Run the master node on laptop 1 (IP: 192.168.2.1, RANK = 0)
> python examples/run_multihost.py --rank 0 --world_size 2 --master_ip 192.168.2.1 --master_port=29500 --model_type llama --model_path /root/TPI-LLM/pretrained_models/Llama-2-1.1b-ft --prompt "how are you?" --length 20 --memory_window 4

NOTE: Please make sure the master node can be connected by all other nodes. The master node also participate in tensor-parallel inference.

Step 2: To launch other worker nodes, use the following command on other laptops (e.g., laptop 2):

# Run the worker node on host 2 (IP: 192.168.2.2, RANK = 1)
> python examples/run_multihost.py --rank 1 --world_size 2 --master_ip 192.168.2.1 --master_port=29500 --model_type llama --model_path /root/TPI-LLM/pretrained_models/sync --memory_window 4

The worker nodes will automatically download their weight files from the master node. If you have downloaded the files before, you can use the option --force_download to force a re-download.

Other Arguments

TPI-LLM provides several optional parameters that you can customize to control various aspects of the inference process. Below is a list of these options:

Argument	Default	Type	Description
`--prompt`	`""`	`str`	The input prompt.
`--length`	`20`	`int`	Maximum length of the generated sequence.
`--prefix`	`""`	`str`	Text added prior to input for context.
`--split_bin`	`False`	`bool`	Split the pretrained model file. (available only on the master node)
`--save_dir`	`"split"`	`str`	The directory to save split model files.
`--seed`	`42`	`int`	Random seed for reproducibility.
`--file_port`	`29600`	`int`	Port number on the master node where the file server is listening on.
`--force_download`	`False`	`bool`	Force worker nodes to re-download model weight slices. (available only on the non-master node)
`--temperature`	`1.0`	`float`	Sampling temperature for text generation. (available only on the master node)
`--k`	`0`	`int`	Number of highest probability tokens to keep for top-k sampling. (available only on the master node)
`--p`	`0.9`	`float`	Cumulative probability for nucleus sampling (top-p). (available only on the master node)
`--disable_memory_schedule`	`False`	`bool`	Set to True to disable memory window scheduling, this may lead to higher speed.
`--memory_window`	`2`	`int`	Size of the memory window used during inference. Should be at least 2.
`--torch_dist`	`False`	`bool`	Whether to use torch.distributed.

Cite Us

Upcoming, the paper is under review.

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
examples		examples
script		script
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices

Installation

Use the Source Code

Using Pre-built Docker Image

Build from Dockerfile

How to Use?

Download Pretrained Model Weights

Run on Your Laptop

Run on Multiple Hosts

Other Arguments

Cite Us

About

Contributors 2

Languages

License

Lizonghang/TPI-LLM

Folders and files

Latest commit

History

Repository files navigation

TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices

Installation

Use the Source Code

Using Pre-built Docker Image

Build from Dockerfile

How to Use?

Download Pretrained Model Weights

Run on Your Laptop

Run on Multiple Hosts

Other Arguments

Cite Us

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages