⚡ Best Practices for Production
Expected Performance in Production
1 LiteLLM Uvicorn Worker on Kubernetes
Description | Value |
---|---|
Avg latency | 50ms |
Median latency | 51ms |
/chat/completions Requests/second | 35 |
/chat/completions Requests/minute | 2100 |
/chat/completions Requests/hour | 126K |
1. Switch of Debug Logging
Remove set_verbose: True
from your config.yaml
litellm_settings:
set_verbose: True
You should only see the following level of details in logs on the proxy server
# INFO: 192.168.2.205:11774 - "POST /chat/completions HTTP/1.1" 200 OK
# INFO: 192.168.2.205:34717 - "POST /chat/completions HTTP/1.1" 200 OK
# INFO: 192.168.2.205:29734 - "POST /chat/completions HTTP/1.1" 200 OK
2. On Kubernetes - Use 1 Uvicorn worker [Suggested CMD]
Use this Docker CMD
. This will start the proxy with 1 Uvicorn Async Worker
(Ensure that you're not setting run_gunicorn
or num_workers
in the CMD).
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml"]
2. Batch write spend updates every 60s
The default proxy batch write is 10s. This is to make it easy to see spend when debugging locally.
In production, we recommend using a longer interval period of 60s. This reduces the number of connections used to make DB writes.
general_settings:
master_key: sk-1234
proxy_batch_write_at: 5 # 👈 Frequency of batch writing logs to server (in seconds)
3. Move spend logs to separate server
Writing each spend log to the db can slow down your proxy. In testing we saw a 70% improvement in median response time, by moving writing spend logs to a separate server.
Spend Logs
This is a log of the key, tokens, model, and latency for each call on the proxy.
1. Start the spend logs server
docker run -p 3000:3000 \
-e DATABASE_URL="postgres://.." \
ghcr.io/berriai/litellm-spend_logs:main-latest
# RUNNING on http://0.0.0.0:3000
2. Connect to proxy
Example litellm_config.yaml
model_list:
- model_name: fake-openai-endpoint
litellm_params:
model: openai/my-fake-model
api_key: my-fake-key
api_base: https://exampleopenaiendpoint-production.up.railway.app/
general_settings:
master_key: sk-1234
proxy_batch_write_at: 5 # 👈 Frequency of batch writing logs to server (in seconds)
Add SPEND_LOGS_URL
as an environment variable when starting the proxy
docker run \
-v $(pwd)/litellm_config.yaml:/app/config.yaml \
-e DATABASE_URL="postgresql://.." \
-e SPEND_LOGS_URL="http://host.docker.internal:3000" \ # 👈 KEY CHANGE
-p 4000:4000 \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml --detailed_debug
# Running on http://0.0.0.0:4000
3. Test Proxy!
curl --location 'http://0.0.0.0:4000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data '{
"model": "fake-openai-endpoint",
"messages": [
{"role": "system", "content": "Be helpful"},
{"role": "user", "content": "What do you know?"}
]
}'
In your LiteLLM Spend Logs Server, you should see
Expected Response
Received and stored 1 logs. Total logs in memory: 1
...
Flushed 1 log to the DB.
Machine Specification
A t2.micro should be sufficient to handle 1k logs / minute on this server.
This consumes at max 120MB, and <0.1 vCPU.
4. Switch off resetting budgets
Add this to your config.yaml. (Only spend per Key, User and Team will be tracked - spend per API Call will not be written to the LiteLLM Database)
general_settings:
disable_spend_logs: true
disable_reset_budget: true
5. Switch of litellm.telemetry
Switch of all telemetry tracking done by litellm
litellm_settings:
telemetry: False
Machine Specifications to Deploy LiteLLM
Service | Spec | CPUs | Memory | Architecture | Version |
---|---|---|---|---|---|
Server | t2.small . | 1vCPUs | 8GB | x86 | |
Redis Cache | - | - | - | - | 7.0+ Redis Engine |
Reference Kubernetes Deployment YAML
Reference Kubernetes deployment.yaml
that was load tested by us
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-deployment
spec:
replicas: 3
selector:
matchLabels:
app: litellm
template:
metadata:
labels:
app: litellm
spec:
containers:
- name: litellm-container
image: ghcr.io/berriai/litellm:main-latest
imagePullPolicy: Always
env:
- name: AZURE_API_KEY
value: "d6******"
- name: AZURE_API_BASE
value: "https://ope******"
- name: LITELLM_MASTER_KEY
value: "sk-1234"
- name: DATABASE_URL
value: "po**********"
args:
- "--config"
- "/app/proxy_config.yaml" # Update the path to mount the config file
volumeMounts: # Define volume mount for proxy_config.yaml
- name: config-volume
mountPath: /app
readOnly: true
livenessProbe:
httpGet:
path: /health/liveliness
port: 4000
initialDelaySeconds: 120
periodSeconds: 15
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /health/readiness
port: 4000
initialDelaySeconds: 120
periodSeconds: 15
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 10
volumes: # Define volume to mount proxy_config.yaml
- name: config-volume
configMap:
name: litellm-config
Reference Kubernetes service.yaml
that was load tested by us
apiVersion: v1
kind: Service
metadata:
name: litellm-service
spec:
selector:
app: litellm
ports:
- protocol: TCP
port: 4000
targetPort: 4000
type: LoadBalancer