(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL – utm_medium}}", "utm_source={{URL – utm_source}}", "utm_campaign={{URL – utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());
  • July 12, 2024
  • 7 min read

Showcasing FriendliAI’s Integration with LiteLLM

Showcasing FriendliAI’s Integration with LiteLLM thumbnail

LiteLLM recently introduced FriendliAI as one of their LLM inference API providers. LiteLLM allows users to utilize over 100 large language models with load balancing, fallbacks, and cost tracking, all in the OpenAI API format. You can leverage FriendliAI’s blazing-fast performance and cost-efficiency alongside LiteLLM’s versatile features.

This blog post will explore how the Friendli Serverless Endpoint can be used with LiteLLM. We will cover basic usages, example codes for different response types, and the budget manager provided by LiteLLM. Moreover, stay tuned for a fun experiment comparing the cost-efficiency of FriendliAI and OpenAI models using the budget manager. Based on this experiment, we can generate approximately ten times more tokens with FriendliAI’s meta-llama-3-70b-instruct model than with OpenAI’s GPT-4o model under the same budget conditions. By the end, you'll be well-equipped to maximize your use of LiteLLM and FriendliAI for your specific needs. So please follow along!

Resources

Basic Usages

This section will cover the basic usages of the LiteLLM Python SDK for chat completions with four different response types: default, streaming, asynchronous, and asynchronous streaming. Throughout this blog, we will use FriendliAI’s meta-llama-3-70b-instruct model and ask it “Hello from LiteLLM”.

Before diving in, make sure you have a Friendli Personal Access Token. You can get your token here. You can install the required libraries and export relevant variables as:

$ pip install litellm
$ export FRIENDLI_TOKEN=[FILL_IN_YOUR_TOKEN]

Default Example Code

This example demonstrates how you can use the LiteLLM Python SDK to generate a response. LiteLLM supports LLM inferences using the ‘completion’ function.

python
from litellm import completion

response = completion(
    model="friendliai/meta-llama-3-70b-instruct",
    messages=[
       {"role": "user", "content": "Hello from LiteLLM"}
    ],
)

print(response.choices[0].message.content)

Streaming Example Code

This example demonstrates how you can use the LiteLLM Python SDK to generate a streaming response. Responses can be streamed by setting the stream argument as ‘True’ in the completion function.

python
from litellm import completion

response = completion(
    model="friendliai/meta-llama-3-70b-instruct",
    messages=[
       {"role": "user", "content": "Hello from LiteLLM"}
    ],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Async Example Code

This example demonstrates how you can use the LiteLLM Python SDK to generate an asynchronous response. Asynchronous chat completions are supported using the ‘acompletion’ function.

python
from litellm import acompletion
import asyncio

async def test_get_response():
    response = await acompletion(
        model="friendliai/meta-llama-3-70b-instruct",
        messages=[
           {"role": "user", "content": "Hello from LiteLLM"}
        ],
    )
    print(response.choices[0].message.content)

asyncio.run(test_get_response())

Async Streaming Example Code

This example demonstrates how you can use the LiteLLM Python SDK to generate an asynchronous streaming response.

python
from litellm import acompletion
import asyncio

async def test_get_response():
    response = await acompletion(
        model="friendliai/meta-llama-3-70b-instruct",
        messages=[
           {"role": "user", "content": "Hello from LiteLLM"}
        ],
        stream=True,
    )
    async for chunk in response:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

asyncio.run(test_get_response())

Results

The chat completion inference result of “Hello from LiteLLM” using FriendliAI’s meta-llama-3-70b-instruct model with LiteLLM is as follows:

Hello from an AI! It's great to meet you, LiteLLM! How's your day going so far?
# Result of print(response)

ModelResponse(
    id=None,
    choices=[
        Choices(
            finish_reason='stop',
            index=0,
            message=Message(
                content="Hello from an AI! It's great to meet you, LiteLLM! How's your day going so far?",
                role='assistant'
            )
        )
    ],
    created=1720661080,
    model='friendliai/meta-llama-3-70b-instruct',
    object='chat.completion',
    system_fingerprint=None,
    usage=Usage(
        completion_tokens=25,
        prompt_tokens=16,
        total_tokens=41
    )
)

Congratulations on getting the basics under your belt! You have taken the first step in leveraging LiteLLM and FriendliAI for your projects. As the LLM has answered, how's your day going so far? We hope it has been productive and enjoyable. Furthermore, pay attention to the ‘total_tokens’ variable in the response above. We will use this variable to calculate the total number of tokens used in our final experiment.

Stay tuned as we delve deeper into more advanced features and exciting experiments in the following sections. Let's continue exploring the full potential of these powerful tools together!

Budget Manager

An interesting feature of LiteLLM is their BudgetManager class. You can manage budgets and track spent costs for each user. Advanced features include storing user budgets in a database and resetting user budgets based on a set duration. You can check out their implementation code here.

User-based Rate Limiting Code

In this example, we will explore how to use the BudgetManager class to manage and enforce user-specific budgets. This feature is particularly useful for controlling the costs associated with running LLM inferences. The code goes through the process of creating a budget for a user, checking their current usage against the budget, and updating the cost after an inference is made.

Here's the code implementation:

python
from litellm import BudgetManager, completion

budget_manager = BudgetManager(project_name="test_project")

user = "user_id"

if not budget_manager.is_valid_user(user):
    budget_manager.create_budget(total_budget=0.001, user=user)

if budget_manager.get_current_cost(user=user) <= budget_manager.get_total_budget(user):
    response = completion(
        model="friendliai/meta-llama-3-70b-instruct",
        messages=[
            {"role": "user", "content": "Hello from LiteLLM"}
        ],
    )
    budget_manager.update_cost(completion_obj=response, user=user)
else:
    print("Sorry - no more budget!")
# user_cost.json
{
    "user_id": {
        "total_budget": 0.001,
        "current_cost": 3.68e-05,
        "model_cost": {
            "friendliai/meta-llama-3-70b-instruct": 3.68e-05
        }
    }
}

The Final Budget Manager Experiment

Now that we have finally covered all the basics, let's try something fun! Have you ever wanted to see how many inferences you could make with a strict budget? This experiment can help us understand how much LLMs actually cost. We tried using the budget manager to see how many inferences could be made to the FriendliAI’s meta-llama-3-70b-instruct model with $0.001. Let’s try asking the model “Hello from LiteLLM” until we run out of money.

Here's the code implementation. It tracks and updates the total number of inferences and tokens used, and stops when the budget is exceeded, printing a summary.:

python
from litellm import BudgetManager, completion

budget_manager = BudgetManager(project_name="test_project")

user = "user_id"

total_inferences = 0
total_tokens = 0

if not budget_manager.is_valid_user(user):
    budget_manager.create_budget(total_budget=0.001, user=user)

while True:
    if budget_manager.get_current_cost(user=user) <= budget_manager.get_total_budget(user):
        response = completion(
            model="friendliai/meta-llama-3-70b-instruct",
            messages=[
                {"role": "user", "content": "Hello from LiteLLM"}
            ],
        )
        budget_manager.update_cost(completion_obj=response, user=user)
        total_tokens += response.usage.total_tokens
        total_inferences += 1
    else:
        print("Sorry - no more budget!")
        print(f"Total number of successful inferences: {total_inferences}")
        print(f"Total number of used tokens: {total_tokens}")
        print(f"Example of a response is: {response.choices[0].message.content}")
        break

FriendliAI’s meta-llama-3-70b-instruct Model Results

In this run, 27 inferences, using a total of 1281 tokens, could be made with $0.001.

Sorry - no more budget!
Total number of successful inferences: 27
Total number of used tokens: 1281
Example of a response is: Hello from me! It's nice to meet you, LiteLLM! How are you doing today?
# user_cost.json
{
    "user_id": {
        "total_budget": 0.001,
        "current_cost": 0.0010248,
        "model_cost": {
            "friendliai/meta-llama-3-70b-instruct": 0.0010248
        }
    }
}

OpenAI’s GPT-4o Model Results

Next, we tried running the same experiment with OpenAI’s GPT-4o model. Simply swap the model value with "gpt-4o" in the experiment code. In this run, 6 inferences, using a total of 126 tokens, could be made with $0.001. Under the same budget, we were able to use over 10 times as many tokens with FriendliAI’s meta-llama-3-70b-instruct model compared to OpenAI’s GPT-4o model!

Sorry - no more budget!
Total number of successful inferences: 6
Total number of used tokens: 126
Example of a response is: Hello! How can I assist you today?
# user_cost.json
{
    "user_id": {
        "total_budget": 0.001,
        "current_cost": 0.0011700000000000002,
        "model_cost": {
            "gpt-4o-2024-05-13": 0.0011700000000000002
        }
    }
}

Token Cost Comparison

This graph visualizes the comparison of the number of tokens generated by LLM models from FriendliAI and OpenAI within a $0.001 budget on LiteLLM.:

With $0.001, we can generate ~10.17 times more tokens with FriendliAI’s meta-llama-3-70b-instruct model (1281 tokens) compared to OpenAI’s GPT-4o model (126 tokens).

Similarly, we can compare the cost per 1M tokens for FriendliAI and OpenAI models as below:

python
from litellm import model_cost

print(model_cost["friendliai/meta-llama-3-70b-instruct"]["input_cost_per_token"] * 1000000)  # $0.8 per 1M tokens
print(model_cost["friendliai/meta-llama-3-70b-instruct"]["output_cost_per_token"] * 1000000)  # $0.8 per 1M tokens
print(model_cost["gpt-4o"]["input_cost_per_token"] * 1000000)  # $5 per 1M tokens
print(model_cost["gpt-4o"]["output_cost_per_token"] * 1000000)  # $15 per 1M tokens

Conclusion

This tutorial shows basic examples of integrating LiteLLM with Friendli Serverless Endpoints for chat completions. We also demonstrate LiteLLM’s budget manager to limit user inference costs. Combining these learnings, we present a practical experiment that calculates the number of inference requests that could be made under a specific budget.

Remember, this is just a starting point – feel free to experiment and customize the process to suit your specific needs using Friendli Endpoints on LiteLLM’s versatile platform!


Written by

FriendliAI logo

FriendliAI Tech & Research


Share