Friendli Engine
The fastest LLM serving engine
on the market

Get started Read the docs

GROUNDBREAKING PERFORMANCE

40~80%

Cost savings

up to

6× Fewer

GPUs required

10.7× Higher

Throughput

6.2× Lower

Latency

What
Friendli Engine
offers

Speed up the serving of LLMs,
thus slashing costs by
40~80%

Friendli Engine is highly optimized to make LLM serving fast and cost-effective. Process LLM inference with Friendli Engine, the fastest engine on the market. With our performance testing showing that Friendli Engine is significantly faster than vLLM and TensorRT-LLM.

Multi-LoRA serving on a single GPU

Friendli Engine simultaneously supports multiple LoRA models on fewer GPUs (even on just a single GPU!), a remarkable leap in making LLM customization more accessible and efficient.

Deploy LLMs and more!

Friendli Engine supports a wide range of generative AI models, including quantized models and MoE.

View the full model list

Key Technology

Iteration batching
(aka continuous batching)

Iteration batching is a new batching technology we invented to handle concurrent generation requests very efficiently. Iteration batching can achieve up to tens of times higher LLM inference throughput than conventional batching while satisfying the same latency requirement. Our technology is protected by our patents in the US and Korea.

DNN library

Friendli DNN Library is the set of optimized GPU kernels carefully curated and designed specifically for generative AI. Our novel library allows Friendli Engine to support faster LLM inference of various tensor shapes and datatypes, as well as support quantization, Mixture of experiences, LoRA adapters, and so on.

Friendli TCache

Friendli TCache intelligently identifies and stores frequently used computational results. The Friendli Engine leverages the cached results, significantly reducing the workload on the GPUs.

Speculative decoding

Friendli Engine natively supports speculative decoding, an optimization technique that rapidly speeds up LLM/LMM inference by making educated guesses on future tokens in parallel while generating the current token. Through validation of the generated potential future tokens, speculative decoding ensures identical model outputs at a fraction of the inference time.

Highlights

Running Quantized Mixtral 8x7B on a Single GPU

We quantized the Mixtral-7x8B-instruct v0.1 model with AWQ, and ran on a single NVIDIA A100 80GB GPU. Both the TTFT and TPOT outnumbers a baseline vLLM system. Friendli Engine achieves at least 4.1x faster response time and 3.8x ~ 23.8x higher token throughput.

Quantized Llama 2 70B on Single GPU

With Friendli Engine, running AWQ-ed models is seamless. For example, one can run AWQ-ed LLMs (e.g., Llama 2 70B 4-bit on a single A100 80 GB GPU) natively on Friendli Engine. Running LLMs with AWQ on Friendli Engine enables you to achieve efficient LLM deployment and remarkable efficiency gains without sacrificing accuracy.

Even faster TTFT with Friendli TCache

Friendli TCache reuses recurring computations, optimizing TTFT (Time to First Token) by leveraging cached results. We show that our Engine delivers 11.3x to 23x faster TTFT compared to vLLM.

HOW TO USE

Three ways to run generative AI models with Friendli Engine:

Friendli Container

Serve generative AI models with Friendli Engine in your GPU environment

Learn more

Friendli Dedicated Endpoints

Build and run generative AI models on autopilot

Learn more

Friendli Serverless Endpoints

Call fast and affordable API for open-source generative AI models

Learn more

1. Testing conducted by FriendliAI in October 2023 using Llama-2-13B running on Friendli Engine. See the detailed results and methodologyhere.

2. Performance compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150. Evaluation conducted by FriendliAI.

3. Performance of Friendli Container compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150, mean request per second = 0.5. Evaluation conducted by FriendliAI.

Friendli EngineThe fastest LLM serving engine on the market

WhatFriendli Engineoffers

Speed up the serving of LLMs,thus slashing costs by40~80%

Multi-LoRA serving on a single GPU

Deploy LLMs and more!

Key Technology

Iteration batching(aka continuous batching)

DNN library

Friendli TCache

Speculative decoding

Highlights

Running Quantized Mixtral 8x7B on a Single GPU

Quantized Llama 2 70B on Single GPU

Even faster TTFT with Friendli TCache

Three ways to run generative AI models with Friendli Engine:

Friendli Container

Friendli Dedicated Endpoints

Friendli Serverless Endpoints

Friendli Engine
The fastest LLM serving engine
on the market

What
Friendli Engine
offers

Speed up the serving of LLMs,
thus slashing costs by
40~80%

Iteration batching
(aka continuous batching)