(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL – utm_medium}}", "utm_source={{URL – utm_source}}", "utm_campaign={{URL – utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());
  • August 4, 2022
  • 2 min read

Friendli Engine: How Good is it on Small Models?

Friendli Engine: How Good is it on Small Models? thumbnail

In our previous blog post Friendli Engine: How to Serve Large-scale Transformer Models, we showed the dramatic performance gain and cost savings of using Friendli Engine (a.k.a. PeriFlow or Orca) to run large-scale generative models like GPT 175B, thanks to our patented technologies. Since then, we have been getting many inquiries on the performance of Orca on serving smaller generative models (e.g., models with a few billion parameters) on a single GPU.

Yes, Orca outperforms FasterTransformer significantly for models from hundreds of millions to a few billion parameters! And we at FriendliAI are still working non-stop on optimizing Orca on small-size models as well as large-sized ones.

Today, we are going to compare Orca against FasterTransformer, but with smaller sized models this time — 1.3B and 345M each.

In both cases, we ran our evaluation on NVIDIA A10G GPU. The below figures show throughput and mean normalized latency. Since each request in the trace requires different processing time, which is (roughly) in proportion to the number of generated tokens, we report mean latency normalized by the number of generted tokens of each request.

In our last blog post, when comparing Orca against FasterTransformer, because FasterTransformer does not have its own scheduler, we implemented a custom scheduler that mimics the batching scheduler of the NVIDIA Trition inference server. Note that this time we used an actual NVIDIA Triton Inference Server.

GPT 1.3B with A10G GPU

Mean of normalized latency and throughput on GPT 1.3B with A10G GPU comparison on Orca and FasterTransformer

At the same latency level of 11 ms/token, Orca has 55.4X higher throughput than FasterTransformer. Among Transformer-based generative models with the same size, there is GPT-Neo, for instance.

GPT 345M with A10G GPU

Mean of normalized latency and throughput on GPT 345M with A10G GPU comparison on Orca and FasterTransformer

At the latency level of 11 ms/token, Orca has 26.1X higher throughput than FasterTransformer. Among similar-sized models there is GPT-2 medium (355M).

Summary

Here, you can see that Orca provides significantly higher throughput and lower latency than NVIDIA FasterTransformer. As the load becomes heavier, Orca provides higher throughput with a relatively small increase in latency.

Regardless of model size, large or small, Orca continues to outperform existing serving systems. We hope such results might make Orca helpful to a broader range of users, from companies running heavy models to those working on relatively small-sized ones as well.

*The research on Orca was presented in OSDI 2022, on July 12th. You can read the paper here.

**Orca was developed by FriendliAI. We provide the end-to-end AI development platform Friendli Suite as our product. For more information, check the link.


Written by

FriendliAI logo

FriendliAI Tech & Research


Share