(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL – utm_medium}}", "utm_source={{URL – utm_source}}", "utm_campaign={{URL – utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());
  • April 12, 2024
  • 3 min read

Easily Migrating LLM Inference Serving from vLLM to Friendli Container

Easily Migrating LLM Inference Serving from vLLM to Friendli Container thumbnail

vLLM is an open-source inference engine that provides a starting point for serving your large language models (LLMs). However, when it comes to production environments, vLLM faces challenges. In production environments, various optimizations including efficient quantized models [link1, link2], and efficient use of computation (e.g., MoE techniques) become crucial. In production environments, Friendli Container is a much better option, and it is gaining popularity among the companies that need to serve LLMs on a large scale. While vLLM provides an easy entrance to inference serving, this article illustrates how Friendli Container is equally easy-to-use with a simple extra step.

Friendli Container: Built for Production

Friendli Container leverages unique optimizations including the Friendli DNN library optimized for generative AI, iteration batching (or continuous batching), efficient quantization, and TCache techniques, making them ideal for production environments. They offer superior performance and handle heavy workloads efficiently. As shown in these articles 1 and 2, Friendli Container exhibit roughly 10x faster TTFT (time-to-first-token) and 10x faster TPOT (time-to-output-token) under modest loads while serving AWQ-ed Mixtral 7x8B model on an NVIDIA’s A100 80GB GPU.

Moving to Friendli Container: An Easy Transition

Launching inference serving containers using vLLM is pretty straightforward. As instructed in the blog post and the documentations, you can install it in your local environment with pip install vllm or by downloading a pre-built Docker image with:

bash
docker pull vllm/vllm-openai:latest

With the image, you could launch the server with the following command:

bash
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-v0.1

Transitioning from vLLM to Friendli Container is very easy. Here's what you need to do:

  1. Sign up: Create a Friendli Suite account and generate a Personal Access Token and a Container Secret for user authentication and container activation.
  2. Download: Pull the trial image for Friendli Containers from Friendli's registry via Docker login with your Personal Access Token.
bash
export FRIENDLI_PAT="YOUR PERSONAL ACCESS TOKEN"
export YOUR_EMAIL="YOUR EMAIL"

docker login registry.friendli.ai -u $YOUR_EMAIL -p $FRIENDLI_PAT
docker pull registry.friendli.ai/trial
  1. Launch Friendli Container: Launching a Friendli Container closely resembles launching a vLLM server. You'll need your Container Secret and specify the model name and port details.
bash
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  registry.friendli.ai/trial \
  --hf-model-name mistralai/Mistral-7B-v0.1 \
  --web-server-port 8000

OpenAI Compatible Inference API: Use Your Favorite Tools

Both Friendli Container and vLLM offer an OpenAI-compatible inference API. This allows you to simply send text completion requests through cURL, which works identically for both vLLM and Friendli Container.

bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-v0.1",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
    }'

Moreover, it also allows you to use popular tools like the OpenAI Python SDK seamlessly on either platform.

python
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1" #Fill in your Friendli/vLLM endpoint
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(
    model="mistralai/Mistral-7B-v0.1",
    prompt="San Francisco is a",
)
print("Completion result:", completion)

Ready to Take Your LLMs to the Next Level?

Head over to https://friendli.ai/products/container/ to start your free trial and experience the power of Friendli Containers for high-performance LLM serving!


Written by

FriendliAI logo

FriendliAI Tech & Research


Share