- April 12, 2024
- 3 min read
Easily Migrating LLM Inference Serving from vLLM to Friendli Container
vLLM is an open-source inference engine that provides a starting point for serving your large language models (LLMs). However, when it comes to production environments, vLLM faces challenges. In production environments, various optimizations including efficient quantized models [link1, link2], and efficient use of computation (e.g., MoE techniques) become crucial. In production environments, Friendli Container is a much better option, and it is gaining popularity among the companies that need to serve LLMs on a large scale. While vLLM provides an easy entrance to inference serving, this article illustrates how Friendli Container is equally easy-to-use with a simple extra step.
Friendli Container: Built for Production
Friendli Container leverages unique optimizations including the Friendli DNN library optimized for generative AI, iteration batching (or continuous batching), efficient quantization, and TCache techniques, making them ideal for production environments. They offer superior performance and handle heavy workloads efficiently. As shown in these articles 1 and 2, Friendli Container exhibit roughly 10x faster TTFT (time-to-first-token) and 10x faster TPOT (time-to-output-token) under modest loads while serving AWQ-ed Mixtral 7x8B model on an NVIDIA’s A100 80GB GPU.
Moving to Friendli Container: An Easy Transition
Launching inference serving containers using vLLM is pretty straightforward. As instructed in the blog post and the documentations, you can install it in your local environment with pip install vllm
or by downloading a pre-built Docker image with:
bashdocker pull vllm/vllm-openai:latest
With the image, you could launch the server with the following command:
bashdocker run --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model mistralai/Mistral-7B-v0.1
Transitioning from vLLM to Friendli Container is very easy. Here's what you need to do:
- Sign up: Create a Friendli Suite account and generate a Personal Access Token and a Container Secret for user authentication and container activation.
- Download: Pull the trial image for Friendli Containers from Friendli's registry via Docker login with your Personal Access Token.
bashexport FRIENDLI_PAT="YOUR PERSONAL ACCESS TOKEN" export YOUR_EMAIL="YOUR EMAIL" docker login registry.friendli.ai -u $YOUR_EMAIL -p $FRIENDLI_PAT docker pull registry.friendli.ai/trial
- Launch Friendli Container: Launching a Friendli Container closely resembles launching a vLLM server. You'll need your Container Secret and specify the model name and port details.
bashexport FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" docker run --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ registry.friendli.ai/trial \ --hf-model-name mistralai/Mistral-7B-v0.1 \ --web-server-port 8000
OpenAI Compatible Inference API: Use Your Favorite Tools
Both Friendli Container and vLLM offer an OpenAI-compatible inference API. This allows you to simply send text completion requests through cURL, which works identically for both vLLM and Friendli Container.
bashcurl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mistral-7B-v0.1", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }'
Moreover, it also allows you to use popular tools like the OpenAI Python SDK seamlessly on either platform.
pythonfrom openai import OpenAI openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" #Fill in your Friendli/vLLM endpoint client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) completion = client.completions.create( model="mistralai/Mistral-7B-v0.1", prompt="San Francisco is a", ) print("Completion result:", completion)
Ready to Take Your LLMs to the Next Level?
Head over to https://friendli.ai/products/container/ to start your free trial and experience the power of Friendli Containers for high-performance LLM serving!
Written by
FriendliAI Tech & Research
Share