You are browsing unreleased documentation.
The AI Proxy Advanced plugin offers different load-balancing algorithms to define how to distribute requests to different AI models. This guide provides a configuration example for each algorithm.
Semantic routing
Semantic routing enables distribution of requests based on the similarity between the prompt and the description of each model. This allows Kong to automatically select the model that is best suited for the given domain or use case.
To set up load balancing with the AI Proxy Advanced plugin, you need to configure the following parameters:
-
config.embeddings
to define the model to use to match the model description and the prompts. -
config.vectordb
to define the vector database parameters. Only Redis is supported, so you need a Redis instance running in your environment. -
config.targets[].description
to define the description to be matched with the prompts.
For example, the following configuration uses two OpenAI models: one for questions related to Kong, and another for questions related to Microsoft.
_format_version: "3.0"
services:
- name: openai-chat-service
url: https://httpbin.konghq.com/
routes:
- name: openai-chat-route
paths:
- /chat
plugins:
- name: ai-proxy-advanced
config:
embeddings:
auth:
header_name: Authorization
header_value: Bearer <token>
model:
name: text-embedding-3-small
provider: openai
vectordb:
dimensions: 1024
distance_metric: cosine
strategy: redis
threshold: 0.7
redis:
host: redis-stack-server
port: 6379
balancer:
algorithm: semantic
targets:
- model:
name: gpt-4
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
description: "What is Kong?"
- model:
name: gpt-4o-mini
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
description: "What is Microsoft?"
You can validate this configuration by sending requests and checking the X-Kong-LLM-Model
response header to see which model was used.
In the response to the following request, the X-Kong-LLM-Model
header value is openai/gpt-4
.
curl --request POST \
--url http://localhost:8000/chat \
--header 'Content-Type: application/json' \
--header 'User-Agent: insomnia/10.0.0' \
--data '{
"messages": [
{
"role": "system",
"content": "You are an IT specialist"
},
{
"role": "user",
"content": "Who founded Kong?"
}
]
}'
Weighted round-robin
The round-robin algorithm distributes requests to the different models on a rotation. By default, all models have the same weight and receive the same percentage of requests. However, this can be configured with the config.targets[].weight
parameter.
If you have three models and want to assign 70% of requests to the first one, 25% of requests to the second one, and 5% of requests to the third one, you can use the following configuration:
_format_version: "3.0"
services:
- name: openai-chat-service
url: https://httpbin.konghq.com/
routes:
- name: openai-chat-route
paths:
- /chat
plugins:
- name: ai-proxy-advanced
config:
balancer:
algorithm: round-robin
targets:
- model:
name: gpt-4
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
weight: 70
- model:
name: gpt-4o-mini
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
weight: 25
- model:
name: gpt-3
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
weight: 5
Consistent-hashing
The consistent-hashing algorithm uses a request header to consistently route requests to the same AI model based on the header value. By default, the header is X-Kong-LLM-Request-ID
, but it can be customized with the config.balancer.hash_on_header
parameter.
For example:
_format_version: "3.0"
services:
- name: openai-chat-service
url: https://httpbin.konghq.com/
routes:
- name: openai-chat-route
paths:
- /chat
plugins:
- name: ai-proxy-advanced
config:
balancer:
algorithm: consistent-hashing
hash_on_header: X-Hashing-Header
targets:
- model:
name: gpt-4
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
- model:
name: gpt-4o-mini
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
Lowest-latency
The lowest-latency algorithm distributes requests to the model with the lowest response time. By default, the latency is calculated based on the time the model takes to generate each token (tpot
). You can change the value of the config.balancer.latency_strategy
to e2e
to use the end-to-end response time.
For example:
_format_version: "3.0"
services:
- name: openai-chat-service
url: https://httpbin.konghq.com/
routes:
- name: openai-chat-route
paths:
- /chat
plugins:
- name: ai-proxy-advanced
config:
balancer:
algorithm: lowest-latency
latency_strategy: e2e
targets:
- model:
name: gpt-4
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
- model:
name: gpt-4o-mini
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
Lowest-usage
The lowest-usage algorithm distributes requests to the model with the lowest usage volume. By default, the usage is calculated based on the total number of tokens in the prompt and in the response. However, you can customize this using the config.balancer.tokens_count_strategy
parameter. You can use:
-
prompt-tokens
to count only the tokens in the prompt -
completion-tokens
to count only the tokens in the response
For example:
_format_version: "3.0"
services:
- name: openai-chat-service
url: https://httpbin.konghq.com/
routes:
- name: openai-chat-route
paths:
- /chat
plugins:
- name: ai-proxy-advanced
config:
balancer:
algorithm: lowest-usage
tokens_count_strategy: prompt-tokens
targets:
- model:
name: gpt-4
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>
- model:
name: gpt-4o-mini
provider: openai
options:
max_tokens: 512
temperature: 1.0
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer <token>