Looking for the plugin's configuration parameters? You can find them in the AI Proxy Advanced configuration reference doc.
The AI Proxy Advanced plugin lets you transform and proxy requests to multiple AI providers and models at the same time. This lets you set up load balancing between targets.
The plugin accepts requests in one of a few defined and standardised formats, translates them to the configured target format, and then transforms the response back into a standard format.
The following table describes which providers and requests the AI Proxy Advanced plugin supports:
Provider | Chat | Completion | Streaming |
---|---|---|---|
OpenAI (GPT-4, GPT-3.5) | ✅ | ✅ | ✅ |
OpenAI (GPT-4o and Multi-Modal) | ✅ | ✅ | ✅ |
Cohere | ✅ | ✅ | ✅ |
Azure | ✅ | ✅ | ✅ |
Anthropic | ✅ | ❌ | Only chat type |
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats) | ✅ | ✅ | ✅ |
Llama2 (raw, OLLAMA, and OpenAI formats) | ✅ | ✅ | ✅ |
Llama3 (OLLAMA and OpenAI formats) | ✅ | ✅ | ✅ |
Amazon Bedrock | ✅ | ✅ | ✅ |
Gemini | ✅ | ✅ | ✅ |
How it works
The AI Proxy Advanced plugin will mediate the following for you:
- Request and response formats appropriate for the configured
provider
androute_type
- The following service request coordinates (unless the model is self-hosted):
- Protocol
- Host name
- Port
- Path
- HTTP method
- Authentication on behalf of the Kong API consumer
- Decorating the request with parameters from the
config.options
block, appropriate for the chosen provider - Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
- Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
- Fulfillment of requests to self-hosted models, based on select supported format transformations
Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong consumers, using consistent request and response formats, regardless of the backend provider or model.
This plugin currently only supports REST-based full text responses.
Load balancing
This plugin supports several load-balancing algorithms, similar to those used for Kong upstreams, allowing efficient distribution of requests across different AI models. The supported algorithms include:
- Lowest-usage: The lowest-usage algorithm in AI Proxy Advanced is based on the volume of usage for each model. It balances the load by distributing requests to models with the lowest usage, measured by factors such as prompt token counts, response token counts, or other resource metrics.
- Round-robin (weighted)
- Consistent-hashing (sticky-session on given header value)
Additionally, semantic routing works similarly to load-balancing algorithms like lowest-usage or least-connections, but instead of volume or connection metrics, it uses the similarity score between the incoming prompt and the descriptions of each model. This allows Kong to automatically choose the model best suited for handling the request, based on performance in similar contexts.
Semantic routing
The AI Proxy Advanced plugin supports semantic routing, which enables distribution of requests based on the similarity between the prompt and the description of each model. This allows Kong to automatically select the model that is best suited for the given domain or use case.
By analyzing the content of the request, the plugin can match it to the most appropriate model that is known to perform better in similar contexts. This feature enhances the flexibility and efficiency of model selection, especially when dealing with a diverse range of AI providers and models.
Request and response formats
The plugin’s config.route_type
should be set based on the target upstream endpoint and model, based on this capability matrix:
Provider Name | Provider Upstream Path | Kong route_type
|
Example Model Name |
---|---|---|---|
OpenAI | /v1/chat/completions |
llm/v1/chat |
gpt-4 |
OpenAI | /v1/completions |
llm/v1/completions |
gpt-3.5-turbo-instruct |
Cohere | /v1/chat |
llm/v1/chat |
command |
Cohere | /v1/generate |
llm/v1/completions |
command |
Azure | /openai/deployments/{deployment_name}/chat/completions |
llm/v1/chat |
gpt-4 |
Azure | /openai/deployments/{deployment_name}/completions |
llm/v1/completions |
gpt-3.5-turbo-instruct |
Anthropic | /v1/messages |
llm/v1/chat |
claude-2.1 |
Anthropic | /v1/complete |
llm/v1/completions |
claude-2.1 |
Llama2 | User-defined | llm/v1/chat |
User-defined |
Llama2 | User-defined | llm/v1/completions |
User-defined |
Mistral | User-defined | llm/v1/chat |
User-defined |
Mistral | User-defined | llm/v1/completions |
User-defined |
Amazon Bedrock | Use the LLM chat upstream path |
llm/v1/chat |
Use the model name for the specific LLM provider |
Amazon Bedrock | Use the LLM completions upstream path |
llm/v1/completions |
Use the model name for the specific LLM provider |
Gemini | llm/v1/chat |
llm/v1/chat |
gemini-1.5-flash or gemini-1.5-pro
|
Gemini | llm/v1/completions |
llm/v1/completions |
gemini-1.5-flash or gemini-1.5-pro
|
The following upstream URL patterns are used:
Provider | URL |
---|---|
OpenAI | https://api.openai.com:443/{route_type_path} |
Cohere | https://api.cohere.com:443/{route_type_path} |
Azure | https://{azure_instance}.openai.azure.com:443/openai/deployments/{deployment_name}/{route_type_path} |
Anthropic | https://api.anthropic.com:443/{route_type_path} |
Llama2 | As defined in config.model.options.upstream_url
|
Mistral | As defined in config.model.options.upstream_url
|
Amazon Bedrock | https://bedrock-runtime.{region}.amazonaws.com |
Gemini | https://generativelanguage.googleapis.com |
While only the Llama2 and Mistral models are classed as self-hosted, the target URL can be overridden for any of the supported providers. For example, a self-hosted or otherwise OpenAI-compatible endpoint can be called by setting the same
config.model.options.upstream_url
plugin option.
Input formats
Kong will mediate the request and response format based on the selected config.provider
and config.route_type
, as outlined in the table above.
The Kong AI Proxy accepts the following inputs formats, standardized across all providers; the config.route_type
must be configured respective to the required request and response format examples:
Response formats
Conversely, the response formats are also transformed to a standard format across all providers:
The request and response formats are loosely based on OpenAI. See the sample OpenAPI specification for more detail on the supported formats.
Get started with the AI Proxy plugin
- Configuration reference
- Basic configuration example
- Learn how to use the plugin with different providers: