AI Proxy Advanced

Overview Examples Configuration reference Changelog

Overview of capabilities

AI Proxy Advanced plugin supports capabilities across batch processing, multimodal embeddings, agents, audio, image, streaming, and more, spanning multiple providers:

For Kong Gateway versions 3.6 or earlier:

Chat Completions APIs: Multi-turn conversations with system/user/assistant roles.
Completions API: Generates free-form text from a prompt. OpenAI has marked this endpoint as legacy and recommends using the Chat Completions API for new applications.

For Kong Gateway version v3.11+:

Batch, assistants, and files APIs: Support parallel LLM calls for efficiency. Assistants enable stateful, tool-augmented agents. Files provide persistent document storage for richer context across sessions.
Audio capabilities APIs: Provide speech-to-text transcription, real-time translation, and text-to-speech synthesis for voice agents, multilingual interfaces, and meeting analysis.
Image generation and editing APIs: Generate and modify images from text prompts to support multimodal agents with visual input and output.
Responses API: Return response metadata for debugging, evaluation, and response tuning.
AWS Bedrock agent APIs: Support advanced orchestration and real-time RAG with Converse, ConverseStream, RetrieveAndGenerate, and RetrieveAndGenerateStream.
Hugging Face text generation: Enable text generation and streaming using open-source Hugging Face models.
Embeddings API: Provide unified text-to-vector embedding generation with multi-vendor support and analytics.
Realtime streaming: Stream completions token-by-token for low-latency, interactive experiences and live analytics.

The following reference tables detail feature availability across supported LLM providers when used with the AI Proxy Advanced plugin.

Core text generation

Support for chat, completions, and embeddings.

Provider	Chat Completions	Chat Completions streaming	Embeddings
OpenAI (GPT-3.5, GPT-4, GPT-4o, and Multi-Modal)
Cohere
Azure
Anthropic
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats)
Llama2 (supports Llama2 and Llama3 models and raw, OLLAMA, and OpenAI formats)
Amazon Bedrock
Gemini
Hugging Face

The following providers are supported by the legacy Completions API:

OpenAI
Azure OpenAI
Cohere
Llama2
Amazon Bedrock
Gemini
Hugging Face

Advanced text generation v3.11+

Support for function calling, tool use, and batch processing.

Provider	Files	Batches	Assistants	Responses
OpenAI (GPT-3.5, GPT-4, GPT-4o, and Multi-Modal)
Cohere
Azure
Anthropic
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats)
Llama2 (supports Llama2 and Llama3 models and raw, OLLAMA, and OpenAI formats)
Amazon Bedrock
Gemini
Hugging Face

Audio features v3.11+

Support for text-to-speech, transcription, and translation.

Provider	Audio: Speech	Audio: Transcriptions	Audio: Translations
OpenAI (GPT-3.5, GPT-4, GPT-4o, and Multi-Modal)
Cohere
Azure
Anthropic
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats)
Llama2 (supports Llama2 and Llama3 models and raw, OLLAMA, and OpenAI formats)
Amazon Bedrock
Gemini
Hugging Face

Image and realtime features v3.11+

Support for image generation, image editing, and realtime interaction.

Provider	Image: Generations	Image: Edits	Realtime
OpenAI (GPT-3.5, GPT-4, GPT-4o, and Multi-Modal)
Cohere
Azure
Anthropic
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats)
Llama2 (supports Llama2 and Llama3 models and raw, OLLAMA, and OpenAI formats)
Amazon Bedrock
Gemini
Hugging Face

How it works

The AI Proxy Advanced plugin will mediate the following for you:

Request and response formats appropriate for the configured config.targets.model.provider and config.targets.route_type
The following service request coordinates (unless the model is self-hosted):
- Protocol
- Host name
- Port
- Path
- HTTP method
Authentication on behalf of the Kong API consumer
Decorating the request with parameters from the config.targets.model.options block, appropriate for the chosen provider
Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
Fulfillment of requests to self-hosted models, based on select supported format transformations

Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong Gateway Consumers, using consistent request and response formats, regardless of the backend provider or model.

v3.11+ AI Proxy Advanced supports REST-based full-text responses, including RESTful endpoints such as llm/v1/responses, llm/v1/files, llm/v1/assisstants and llm/v1/batches. RESTful endpoints support CRUD operations— you can POST to create a response, GET to retrieve it, or DELETE to remove it.

Request and response formats

The plugin’s route_type should be set based on the target upstream endpoint and model, based on this capability matrix:

The following requirements are enforced by upstream providers:

For Azure Responses API, set config.azure_api_version to "preview".

For OpenAI and Azure Realtime APIs, include the header OpenAI-Beta: realtime=v1.

Only WebSocket is supported—WebRTC is not supported.

For OpenAI and Azure Assistant APIs, include the header OpenAI-Beta: assistants=v2.

For requests with large payloads (e.g., image edits, audio transcription/translation), consider increasing config.max_request_body_size to three times the raw binary size.

Provider path	Kong Gateway route type	Example model name	Min version
/v1/chat/completions	llm/v1/chat	gpt-4	3.6
/v1/completions	llm/v1/completions	gpt-3.5-turbo-instruct	3.6
/v1/embeddings	llm/v1/embeddings	text-embedding-ada-002	3.11
/v1/files	llm/v1/files	n/a	3.11
/v1/batches	llm/v1/batches	n/a	3.11
/v1/assistants	llm/v1/assistants	gpt-4-1106-preview	3.11
/v1/responses	llm/v1/responses	gpt-4-1106-preview	3.11
/v1/realtime	realtime/v1/realtime	gpt-4o	3.11
/v1/audio/speech	audio/v1/audio/speech	tts-1	3.11
/v1/audio/transcriptions	audio/v1/audio/transcriptions	whisper-1	3.11
/v1/audio/translations	audio/v1/audio/translations	whisper-1	3.11
/v1/images/generations	image/v1/images/generations	dall-e-3	3.11
/v1/images/edits	image/v1/images/edits	dall-e-2	3.11
/v1/realtime	realtime/v1/realtime	gpt-4o	3.11

Provider path	Kong Gateway route type	Example model name	Min version
/openai/deployments/{deployment_name}/chat/completions	llm/v1/chat	gpt-4	3.6
/openai/deployments/{deployment_name}/completions	llm/v1/completions	gpt-3.5-turbo-instruct	3.6
/openai/deployments/{deployment_name}/embeddings	llm/v1/embeddings	text-embedding-ada-002	3.11
/openai/files	llm/v1/files	n/a	3.11
/openai/batches	llm/v1/batches	n/a	3.11
/openai/assistants	llm/v1/assistants	n/a	3.11
/openai/v1/responses	llm/v1/responses	n/a	3.11
/openai/realtime	realtime/v1/realtime	n/a	3.11
/openai/audio/speech	audio/v1/audio/speech	n/a	3.11
/openai/audio/transcriptions	audio/v1/audio/transcriptions	n/a	3.11
/openai/audio/translations	audio/v1/audio/translations	n/a	3.11
/openai/images/generations	image/v1/images/generations	n/a	3.11
/openai/images/edits	image/v1/images/edits	n/a	3.11
/openai/realtime	realtime/v1/realtime	n/a	3.11

Provider path	Kong Gateway route type	Example model name	Min version
User-defined	llm/v1/chat	User-defined	3.6
User-defined	llm/v1/completions	User-defined	3.6
User-defined	llm/v1/embeddings	User-defined	3.11

Provider path ¹	Kong Gateway route type	Example model name	Min version
Use the LLM `chat` upstream path	llm/v1/chat	Use the model name for the specific LLM provider	3.8
Use the LLM `completions` upstream path	llm/v1/completions	Use the model name for the specific LLM provider	3.8
Use the LLM `embeddings` upstream path	llm/v1/embeddings	Use the model name for the specific LLM provider	3.11
Use the LLM `image/generations` upstream path	image/v1/images/generations	Use the model name for the specific LLM provider	3.11
Use the LLM `image/edits` upstream path	image/v1/images/edits	Use the model name for the specific LLM provider	3.11

¹ Supported only when llm_format is set to bedrock

Provider path	Kong Gateway route type	Example model name	Min version
User-defined	llm/v1/chat	User-defined	3.6
User-defined	llm/v1/completions	User-defined	3.6
User-defined	llm/v1/embeddings	User-defined	3.11

Provider path	Kong Gateway route type	Example model name	Min version
/v1/messages	llm/v1/chat	claude-3-opus-20240229	3.6
/v1/complete	llm/v1/completions	claude-2.1	3.6

Provider path	Kong Gateway route type	Example model name	Min version
/v1/chat	llm/v1/chat	command	3.6
/v1/generate	llm/v1/completions	command	3.6
/v2/embed	llm/v1/embeddings	embed-english-v3.0	3.11

Provider path	Kong Gateway route type	Example model name	Min version
/models/{model_provider}/{model_name}	llm/v1/chat	Use the model name for the specific LLM provider	3.9
/models/{model_provider}/{model_name}	llm/v1/completions	Use the model name for the specific LLM provider	3.9
/models/{model_provider}/{model_name}	llm/v1/embeddings	Use the embedding model name	3.11

To use the realtime/v1/realtime route, users must configure the protocols to ws and/or wss on both the service and on the route where the plugin is associated.

The following upstream URL patterns are used:

Provider	URL
OpenAI	https://api.openai.com:443/{route_type_path}
Cohere	https://api.cohere.com:443/{route_type_path}
Azure	https://{azure_instance}.openai.azure.com:443/openai/deployments/{deployment_name}/{route_type_path}
Anthropic	https://api.anthropic.com:443/{route_type_path}
Mistral	As defined in `config.targets.model.options.upstream_url`
Llama2	As defined in `config.targets.model.options.upstream_url`
Amazon Bedrock	https://bedrock-runtime.{region}.amazonaws.com
Gemini	https://generativelanguage.googleapis.com
Hugging Face	https://api-inference.huggingface.co

While only the Llama2 and Mistral models are classed as self-hosted, the target URL can be overridden for any of the supported providers. For example, a self-hosted or otherwise OpenAI-compatible endpoint can be called by setting the same config.targets.model.options.upstream_url plugin option.

v3.10+ If you are using each provider’s native SDK, Kong Gateway allows you to transparently proxy the request without any transformation and return the response unmodified. This can be done by setting config.llm_format to a value other than openai, such as gemini or bedrock. See the section below for more details.

In this mode, Kong Gateway will still provide useful analytics, logging, and cost calculation.

Input formats

Kong Gateway mediates the request and response format based on the selected config.targets.model.provider and config.targets.route_type.

v3.10+ By default, Kong Gateway uses the OpenAI format, but you can customize this using config.llm_format. If llm_format is not set to openai, the plugin will not transform the request when sending it upstream and will leave it as-is.

The Kong Gateway AI Proxy accepts the following inputs formats, standardized across all providers. The config.targets.route_type must be configured respective to the required request and response format examples.

Text generation inputs

The following examples show standardized text-based request formats for each supported llm/v1/* route. These formats are normalized across providers to help simplify downstream parsing and integration.

{
    "messages": [
        {
            "role": "system",
            "content": "You are a scientist."
        },
        {
            "role": "user",
            "content": "What is the theory of relativity?"
        }
    ]
}

      
        
      
    
Copied to clipboard!

v3.9+ With Amazon Bedrock, you can include your guardrail configuration in the request:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a scientist."
        },
        {
            "role": "user",
            "content": "What is the theory of relativity?"
        }
    ],
      "guardrailConfig": {
              "guardrailIdentifier":"<guardrail_identifier>",
              "guardrailVersion":"1",
              "trace":"enabled"
          }
}

      
        
      
    
Copied to clipboard!

{
    "prompt": "You are a scientist. What is the theory of relativity?"
}

Copied to clipboard!

Supported in: v3.11+

  {
    "input": "The food was delicious and the waiter...",
    "model": "text-embedding-ada-002",
    "encoding_format": "float"
  }

Copied to clipboard!

Supported in: v3.11+

curl http://localhost:8000 \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F purpose="fine-tune" \
  -F file="@mydata.jsonl"

Copied to clipboard!

Supported in: v3.11+

{
    "instructions": "You are a personal math tutor. When asked a question, write and run Python code to answer the question.",
    "name": "Math Tutor",
    "tools": [{"type": "code_interpreter"}],
    "model": "gpt-4o"
  }

Copied to clipboard!

Supported in: v3.11+

{
    "input_file_id": "file-abc123",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
}

Copied to clipboard!

Supported in: v3.11+

This is a RESTful endpoint that supports all CRUD operations, but this preview example demonstrates only a POST request.

  {
    "input": "Tell me a three sentence bedtime story about a unicorn."
  }

Copied to clipboard!

Audio and image generation inputs

The following examples show standardized audio and image request formats for each supported route. These formats are normalized across providers to help simplify downstream parsing and integration.

Supported in: v3.11+

curl http://localhost:8000 \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
  }' \
  --output speech.mp3

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

curl http://localhost:8000/ \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="gpt-4o-transcribe"

Copied to clipboard!

Supported in: v3.11+

curl http://localhost:8000/ \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/german.m4a" \
  -F model="whisper-1"

Copied to clipboard!

Supported in: v3.11+

{
  "prompt": "A cute baby sea otter",
  "n": 1,
  "size": "1024x1024"
}

Copied to clipboard!

Supported in: v3.11+

curl -s -D >(grep -i x-request-id >&2) \
  -o >(jq -r '.data[0].b64_json' | base64 --decode > gift-basket.png) \
  -X POST "https://api.openai.com/v1/images/edits" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F "model=gpt-image-1" \
  -F "image[]=@body-lotion.png" \
  -F "image[]=@bath-bomb.png" \
  -F "image[]=@incense-kit.png" \
  -F "image[]=@soap.png" \
  -F 'prompt=Create a lovely gift basket with these four items in it'


      
        
      
    
Copied to clipboard!

Supported in: v3.11+

To use the realtime route, you must configure the protocols ws and/or wss on both the Service and on the Route where the plugin is associated.

{
  "model": "gpt-4o",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Explain how rainbows form." }
  ]
}

      
        
      
    
Copied to clipboard!

Response formats

Conversely, the response formats are also transformed to a standard format across all providers:

Text-based responses

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "The theory of relativity is a...",
                "role": "assistant"
            }
        }
    ],
    "created": 1707769597,
    "id": "chatcmpl-ID",
    "model": "gpt-4-0613",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 5,
        "prompt_tokens": 26,
        "total_tokens": 31
    }
}

      
        
      
    
Copied to clipboard!

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "text": "The theory of relativity is a..."
        }
    ],
    "created": 1707769597,
    "id": "cmpl-ID",
    "model": "gpt-3.5-turbo-instruct",
    "object": "text_completion",
    "usage": {
        "completion_tokens": 10,
        "prompt_tokens": 7,
        "total_tokens": 17
    }
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0023064255,
        -0.009327292,
        .... (1536 floats total for ada-002)
        -0.0028842222,
      ],
      "index": 0
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "id": "file-abc123",
  "object": "file",
  "bytes": 120000,
  "created_at": 1677610602,
  "filename": "mydata.jsonl",
  "purpose": "fine-tune",
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
    "input_file_id": "file-abc123",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
}

Copied to clipboard!

Supported in: v3.11+

{
  "id": "asst_abc123",
  "object": "assistant",
  "created_at": 1698984975,
  "name": "Math Tutor",
  "description": null,
  "model": "gpt-4o",
  "instructions": "You are a personal math tutor. When asked a question, write and run Python code to answer the question.",
  "tools": [
    {
      "type": "code_interpreter"
    }
  ],
  "metadata": {},
  "top_p": 1.0,
  "temperature": 1.0,
  "response_format": "auto"
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "id": "resp_67ccd2bed1ec8190b14f964abc0542670bb6a6b452d3795b",
  "object": "response",
  "created_at": 1741476542,
  "status": "completed",
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "max_output_tokens": null,
  "model": "gpt-4.1-2025-04-14",
  "output": [
    {
      "type": "message",
      "id": "msg_67ccd2bf17f0819081ff3bb2cf6508e60bb6a6b452d3795b",
      "status": "completed",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "In a peaceful grove beneath a silver moon, a unicorn named Lumina discovered a hidden pool that reflected the stars. As she dipped her horn into the water, the pool began to shimmer, revealing a pathway to a magical realm of endless night skies. Filled with wonder, Lumina whispered a wish for all who dream to find their own hidden magic, and as she glanced back, her hoofprints sparkled like stardust.",
          "annotations": []
        }
      ]
    }
  ],
  "parallel_tool_calls": true,
  "previous_response_id": null,
  "reasoning": {
    "effort": null,
    "summary": null
  },
  "store": true,
  "temperature": 1.0,
  "text": {
    "format": {
      "type": "text"
    }
  },
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 36,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens": 87,
    "output_tokens_details": {
      "reasoning_tokens": 0
    },
    "total_tokens": 123
  },
  "user": null,
  "metadata": {}
}

      
        
      
    
Copied to clipboard!

Image, and audio responses

The following examples show standardized response formats returned by supported audio/ and image/ routes. These formats are normalized across providers to support consistent multimodal output parsing.

Supported in: v3.11+

The response contains the audio file content of speech.mp3.

Supported in: v3.11+

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100 or a 1,000 times bigger. This is a place where you can get to do that.",
  "usage": {
    "type": "tokens",
    "input_tokens": 14,
    "input_token_details": {
      "text_tokens": 0,
      "audio_tokens": 14
    },
    "output_tokens": 45,
    "total_tokens": 59
  }
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

Copied to clipboard!

Supported in: v3.11+

{
  "created": 1713833628,
  "data": [
    {
      "b64_json": "..."
    }
  ],
  "usage": {
    "total_tokens": 100,
    "input_tokens": 50,
    "output_tokens": 50,
    "input_tokens_details": {
      "text_tokens": 10,
      "image_tokens": 40
    }
  }
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "created": 1713833628,
  "data": [
    {
      "b64_json": "..."
    }
  ],
  "usage": {
    "total_tokens": 100,
    "input_tokens": 50,
    "output_tokens": 50,
    "input_tokens_details": {
      "text_tokens": 10,
      "image_tokens": 40
    }
  }
}

      
        
      
    
Copied to clipboard!

{ "type": "message_fragment", "content": "Rainbows form when light is refracted, reflected, and dispersed in water droplets." }

Copied to clipboard!

The request and response formats are loosely modeled after OpenAI’s API. For detailed format specifications, see the sample OpenAPI specification.

Supported native LLM formats

When config.llm_format is set to gemini, only the Gemini provider is supported. The following Gemini APIs are available:

/generateContent
/streamGenerateContent

When llm_format is set to bedrock, only the Bedrock provider is supported. Supported Bedrock APIs include:

/converse
/converse-stream
/retrieveAndGenerate
/retrieveAndGenerateStream
/rerank

When config.llm_format is set to cohere, only the Cohere provider is supported. Available Cohere APIs are:

/v1/rerank
/v2/rerank

When llm_format is set to "huggingface", only the Hugging Face provider is supported. The following Hugging Face APIs are supported:

/generate
/generate_stream

Caveats and limitations

The following sections detail the provider and statistic logging limitations.

Provider-specific limitations

Anthropic: Does not support llm/v1/completions or llm/v1/embeddings.
Llama2: Raw format lacks support for llm/v1/embeddings.
Bedrock and Gemini: Only support auth.allow_override = false.

Statistics logging limitations

Anthropic: No statistics logging for llm/v1/completions.
OpenAI and Azure: No statistics logging for assistants, batch, or audio APIs.
Bedrock: No statistics logging for image generation or editing APIs.

Load balancing

This plugin supports several load-balancing algorithms, similar to those used for Kong upstreams, allowing efficient distribution of requests across different AI models. The supported algorithms include:

Algorithm	Description
Consistent-hashing (sticky-session on given header value)	The consistent-hashing algorithm routes requests based on a specified header value (`X-Hashing-Header`). Requests with the same header are repeatedly routed to the same model, enabling sticky sessions for maintaining context or affinity across user interactions.
Lowest-latency	The lowest-latency algorithm is based on the response time for each model. It distributes requests to models with the lowest response time.
Lowest-usage	The lowest-usage algorithm in AI Proxy Advanced is based on the volume of usage for each model. It balances the load by distributing requests to models with the lowest usage, measured by factors such as: Prompt token counts Response token counts Cost v3.10+ Or other resource metrics.
Priority group v3.10+	The priority algorithm routes requests to groups of models based on assigned weights. Higher-weighted groups are preferred, and if all models in a group fail, the plugin falls back to the next group. This allows for reliable failover and cost-aware routing across multiple AI models.
Round-robin (weighted)	The round-robin algorithm distributes requests across models based on their respective weights. For example, if your models `gpt-4`, `gpt-4o-mini`, and `gpt-3` have weights of `70`, `25`, and `5` respectively, they’ll receive approximately 70%, 25%, and 5% of the traffic in turn. Requests are distributed proportionally, independent of usage or latency metrics.
Semantic	The semantic algorithm distributes requests to different models based on the similarity between the prompt in the request and the description provided in the model configuration. This allows Kong to automatically select the model that is best suited for the given domain or use case. This feature enhances the flexibility and efficiency of model selection, especially when dealing with a diverse range of AI providers and models.

Retry and fallback

The load balancer has customizable retries and timeouts for requests, and can redirect a request to a different model in case of failure. This allows you to have a fallback in case one of your targets is unavailable.

For versions v3.10+ this plugin supports fallback across targets with any supported formats. For versions earlier than 3.10, fallback is not supported across targets with different formats. You can still use multiple providers, but only if the formats are compatible. For example, load balancers with the following target combinations are supported:

Different OpenAI models
OpenAI models and Mistral models with the OpenAI format
Mistral models with the OLLAMA format and Llama models with the OLLAMA format

Some errors, such as client errors, result in a failure and don’t failover to another target.

v3.10+ To configure failover in addition to network errors, set config.balancer.failover_criteria to include:

Additional HTTP error codes, like http_429 or http_502

The non_idempotent setting, as most AI services accept POST requests

Request and response formats

The plugin’s route_type should be set based on the target upstream endpoint and model, based on this capability matrix:

The following requirements are enforced by upstream providers:

For Azure Responses API, set config.azure_api_version to "preview".

For OpenAI and Azure Realtime APIs, include the header OpenAI-Beta: realtime=v1.

Only WebSocket is supported—WebRTC is not supported.

For OpenAI and Azure Assistant APIs, include the header OpenAI-Beta: assistants=v2.

For requests with large payloads (e.g., image edits, audio transcription/translation), consider increasing config.max_request_body_size to three times the raw binary size.

Provider path	Kong Gateway route type	Example model name	Min version
/v1/chat/completions	llm/v1/chat	gpt-4	3.6
/v1/completions	llm/v1/completions	gpt-3.5-turbo-instruct	3.6
/v1/embeddings	llm/v1/embeddings	text-embedding-ada-002	3.11
/v1/files	llm/v1/files	n/a	3.11
/v1/batches	llm/v1/batches	n/a	3.11
/v1/assistants	llm/v1/assistants	gpt-4-1106-preview	3.11
/v1/responses	llm/v1/responses	gpt-4-1106-preview	3.11
/v1/realtime	realtime/v1/realtime	gpt-4o	3.11
/v1/audio/speech	audio/v1/audio/speech	tts-1	3.11
/v1/audio/transcriptions	audio/v1/audio/transcriptions	whisper-1	3.11
/v1/audio/translations	audio/v1/audio/translations	whisper-1	3.11
/v1/images/generations	image/v1/images/generations	dall-e-3	3.11
/v1/images/edits	image/v1/images/edits	dall-e-2	3.11
/v1/realtime	realtime/v1/realtime	gpt-4o	3.11

Provider path	Kong Gateway route type	Example model name	Min version
/openai/deployments/{deployment_name}/chat/completions	llm/v1/chat	gpt-4	3.6
/openai/deployments/{deployment_name}/completions	llm/v1/completions	gpt-3.5-turbo-instruct	3.6
/openai/deployments/{deployment_name}/embeddings	llm/v1/embeddings	text-embedding-ada-002	3.11
/openai/files	llm/v1/files	n/a	3.11
/openai/batches	llm/v1/batches	n/a	3.11
/openai/assistants	llm/v1/assistants	n/a	3.11
/openai/v1/responses	llm/v1/responses	n/a	3.11
/openai/realtime	realtime/v1/realtime	n/a	3.11
/openai/audio/speech	audio/v1/audio/speech	n/a	3.11
/openai/audio/transcriptions	audio/v1/audio/transcriptions	n/a	3.11
/openai/audio/translations	audio/v1/audio/translations	n/a	3.11
/openai/images/generations	image/v1/images/generations	n/a	3.11
/openai/images/edits	image/v1/images/edits	n/a	3.11
/openai/realtime	realtime/v1/realtime	n/a	3.11

Provider path	Kong Gateway route type	Example model name	Min version
User-defined	llm/v1/chat	User-defined	3.6
User-defined	llm/v1/completions	User-defined	3.6
User-defined	llm/v1/embeddings	User-defined	3.11

Provider path ¹	Kong Gateway route type	Example model name	Min version
Use the LLM `chat` upstream path	llm/v1/chat	Use the model name for the specific LLM provider	3.8
Use the LLM `completions` upstream path	llm/v1/completions	Use the model name for the specific LLM provider	3.8
Use the LLM `embeddings` upstream path	llm/v1/embeddings	Use the model name for the specific LLM provider	3.11
Use the LLM `image/generations` upstream path	image/v1/images/generations	Use the model name for the specific LLM provider	3.11
Use the LLM `image/edits` upstream path	image/v1/images/edits	Use the model name for the specific LLM provider	3.11

¹ Supported only when llm_format is set to bedrock

Provider path	Kong Gateway route type	Example model name	Min version
User-defined	llm/v1/chat	User-defined	3.6
User-defined	llm/v1/completions	User-defined	3.6
User-defined	llm/v1/embeddings	User-defined	3.11

Provider path	Kong Gateway route type	Example model name	Min version
/v1/messages	llm/v1/chat	claude-3-opus-20240229	3.6
/v1/complete	llm/v1/completions	claude-2.1	3.6

Provider path	Kong Gateway route type	Example model name	Min version
/v1/chat	llm/v1/chat	command	3.6
/v1/generate	llm/v1/completions	command	3.6
/v2/embed	llm/v1/embeddings	embed-english-v3.0	3.11

Provider path	Kong Gateway route type	Example model name	Min version
/models/{model_provider}/{model_name}	llm/v1/chat	Use the model name for the specific LLM provider	3.9
/models/{model_provider}/{model_name}	llm/v1/completions	Use the model name for the specific LLM provider	3.9
/models/{model_provider}/{model_name}	llm/v1/embeddings	Use the embedding model name	3.11

To use the realtime/v1/realtime route, users must configure the protocols to ws and/or wss on both the service and on the route where the plugin is associated.

The following upstream URL patterns are used:

Provider	URL
OpenAI	https://api.openai.com:443/{route_type_path}
Cohere	https://api.cohere.com:443/{route_type_path}
Azure	https://{azure_instance}.openai.azure.com:443/openai/deployments/{deployment_name}/{route_type_path}
Anthropic	https://api.anthropic.com:443/{route_type_path}
Mistral	As defined in `config.targets.model.options.upstream_url`
Llama2	As defined in `config.targets.model.options.upstream_url`
Amazon Bedrock	https://bedrock-runtime.{region}.amazonaws.com
Gemini	https://generativelanguage.googleapis.com
Hugging Face	https://api-inference.huggingface.co

While only the Llama2 and Mistral models are classed as self-hosted, the target URL can be overridden for any of the supported providers. For example, a self-hosted or otherwise OpenAI-compatible endpoint can be called by setting the same config.targets.model.options.upstream_url plugin option.

v3.10+ If you are using each provider’s native SDK, Kong Gateway allows you to transparently proxy the request without any transformation and return the response unmodified. This can be done by setting config.llm_format to a value other than openai, such as gemini or bedrock. See the section below for more details.

In this mode, Kong Gateway will still provide useful analytics, logging, and cost calculation.

Input formats

Kong Gateway mediates the request and response format based on the selected config.targets.model.provider and config.targets.route_type.

v3.10+ By default, Kong Gateway uses the OpenAI format, but you can customize this using config.llm_format. If llm_format is not set to openai, the plugin will not transform the request when sending it upstream and will leave it as-is.

The Kong Gateway AI Proxy accepts the following inputs formats, standardized across all providers. The config.targets.route_type must be configured respective to the required request and response format examples.

Text generation inputs

The following examples show standardized text-based request formats for each supported llm/v1/* route. These formats are normalized across providers to help simplify downstream parsing and integration.

{
    "messages": [
        {
            "role": "system",
            "content": "You are a scientist."
        },
        {
            "role": "user",
            "content": "What is the theory of relativity?"
        }
    ]
}

      
        
      
    
Copied to clipboard!

v3.9+ With Amazon Bedrock, you can include your guardrail configuration in the request:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a scientist."
        },
        {
            "role": "user",
            "content": "What is the theory of relativity?"
        }
    ],
      "guardrailConfig": {
              "guardrailIdentifier":"<guardrail_identifier>",
              "guardrailVersion":"1",
              "trace":"enabled"
          }
}

      
        
      
    
Copied to clipboard!

{
    "prompt": "You are a scientist. What is the theory of relativity?"
}

Copied to clipboard!

Supported in: v3.11+

  {
    "input": "The food was delicious and the waiter...",
    "model": "text-embedding-ada-002",
    "encoding_format": "float"
  }

Copied to clipboard!

Supported in: v3.11+

curl http://localhost:8000 \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F purpose="fine-tune" \
  -F file="@mydata.jsonl"

Copied to clipboard!

Supported in: v3.11+

{
    "instructions": "You are a personal math tutor. When asked a question, write and run Python code to answer the question.",
    "name": "Math Tutor",
    "tools": [{"type": "code_interpreter"}],
    "model": "gpt-4o"
  }

Copied to clipboard!

Supported in: v3.11+

{
    "input_file_id": "file-abc123",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
}

Copied to clipboard!

Supported in: v3.11+

This is a RESTful endpoint that supports all CRUD operations, but this preview example demonstrates only a POST request.

  {
    "input": "Tell me a three sentence bedtime story about a unicorn."
  }

Copied to clipboard!

Audio and image generation inputs

The following examples show standardized audio and image request formats for each supported route. These formats are normalized across providers to help simplify downstream parsing and integration.

Supported in: v3.11+

curl http://localhost:8000 \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
  }' \
  --output speech.mp3

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

curl http://localhost:8000/ \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="gpt-4o-transcribe"

Copied to clipboard!

Supported in: v3.11+

curl http://localhost:8000/ \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/german.m4a" \
  -F model="whisper-1"

Copied to clipboard!

Supported in: v3.11+

{
  "prompt": "A cute baby sea otter",
  "n": 1,
  "size": "1024x1024"
}

Copied to clipboard!

Supported in: v3.11+

curl -s -D >(grep -i x-request-id >&2) \
  -o >(jq -r '.data[0].b64_json' | base64 --decode > gift-basket.png) \
  -X POST "https://api.openai.com/v1/images/edits" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F "model=gpt-image-1" \
  -F "image[]=@body-lotion.png" \
  -F "image[]=@bath-bomb.png" \
  -F "image[]=@incense-kit.png" \
  -F "image[]=@soap.png" \
  -F 'prompt=Create a lovely gift basket with these four items in it'


      
        
      
    
Copied to clipboard!

Supported in: v3.11+

To use the realtime route, you must configure the protocols ws and/or wss on both the Service and on the Route where the plugin is associated.

{
  "model": "gpt-4o",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Explain how rainbows form." }
  ]
}

      
        
      
    
Copied to clipboard!

Response formats

Conversely, the response formats are also transformed to a standard format across all providers:

Text-based responses

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "The theory of relativity is a...",
                "role": "assistant"
            }
        }
    ],
    "created": 1707769597,
    "id": "chatcmpl-ID",
    "model": "gpt-4-0613",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 5,
        "prompt_tokens": 26,
        "total_tokens": 31
    }
}

      
        
      
    
Copied to clipboard!

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "text": "The theory of relativity is a..."
        }
    ],
    "created": 1707769597,
    "id": "cmpl-ID",
    "model": "gpt-3.5-turbo-instruct",
    "object": "text_completion",
    "usage": {
        "completion_tokens": 10,
        "prompt_tokens": 7,
        "total_tokens": 17
    }
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0023064255,
        -0.009327292,
        .... (1536 floats total for ada-002)
        -0.0028842222,
      ],
      "index": 0
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "id": "file-abc123",
  "object": "file",
  "bytes": 120000,
  "created_at": 1677610602,
  "filename": "mydata.jsonl",
  "purpose": "fine-tune",
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
    "input_file_id": "file-abc123",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
}

Copied to clipboard!

Supported in: v3.11+

{
  "id": "asst_abc123",
  "object": "assistant",
  "created_at": 1698984975,
  "name": "Math Tutor",
  "description": null,
  "model": "gpt-4o",
  "instructions": "You are a personal math tutor. When asked a question, write and run Python code to answer the question.",
  "tools": [
    {
      "type": "code_interpreter"
    }
  ],
  "metadata": {},
  "top_p": 1.0,
  "temperature": 1.0,
  "response_format": "auto"
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "id": "resp_67ccd2bed1ec8190b14f964abc0542670bb6a6b452d3795b",
  "object": "response",
  "created_at": 1741476542,
  "status": "completed",
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "max_output_tokens": null,
  "model": "gpt-4.1-2025-04-14",
  "output": [
    {
      "type": "message",
      "id": "msg_67ccd2bf17f0819081ff3bb2cf6508e60bb6a6b452d3795b",
      "status": "completed",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "In a peaceful grove beneath a silver moon, a unicorn named Lumina discovered a hidden pool that reflected the stars. As she dipped her horn into the water, the pool began to shimmer, revealing a pathway to a magical realm of endless night skies. Filled with wonder, Lumina whispered a wish for all who dream to find their own hidden magic, and as she glanced back, her hoofprints sparkled like stardust.",
          "annotations": []
        }
      ]
    }
  ],
  "parallel_tool_calls": true,
  "previous_response_id": null,
  "reasoning": {
    "effort": null,
    "summary": null
  },
  "store": true,
  "temperature": 1.0,
  "text": {
    "format": {
      "type": "text"
    }
  },
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 36,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens": 87,
    "output_tokens_details": {
      "reasoning_tokens": 0
    },
    "total_tokens": 123
  },
  "user": null,
  "metadata": {}
}

      
        
      
    
Copied to clipboard!

Image, and audio responses

The following examples show standardized response formats returned by supported audio/ and image/ routes. These formats are normalized across providers to support consistent multimodal output parsing.

Supported in: v3.11+

The response contains the audio file content of speech.mp3.

Supported in: v3.11+

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100 or a 1,000 times bigger. This is a place where you can get to do that.",
  "usage": {
    "type": "tokens",
    "input_tokens": 14,
    "input_token_details": {
      "text_tokens": 0,
      "audio_tokens": 14
    },
    "output_tokens": 45,
    "total_tokens": 59
  }
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

Copied to clipboard!

Supported in: v3.11+

{
  "created": 1713833628,
  "data": [
    {
      "b64_json": "..."
    }
  ],
  "usage": {
    "total_tokens": 100,
    "input_tokens": 50,
    "output_tokens": 50,
    "input_tokens_details": {
      "text_tokens": 10,
      "image_tokens": 40
    }
  }
}

      
        
      
    
Copied to clipboard!

Supported in: v3.11+

{
  "created": 1713833628,
  "data": [
    {
      "b64_json": "..."
    }
  ],
  "usage": {
    "total_tokens": 100,
    "input_tokens": 50,
    "output_tokens": 50,
    "input_tokens_details": {
      "text_tokens": 10,
      "image_tokens": 40
    }
  }
}

      
        
      
    
Copied to clipboard!

{ "type": "message_fragment", "content": "Rainbows form when light is refracted, reflected, and dispersed in water droplets." }

Copied to clipboard!

The request and response formats are loosely modeled after OpenAI’s API. For detailed format specifications, see the sample OpenAPI specification.

Supported native LLM formats

When config.llm_format is set to gemini, only the Gemini provider is supported. The following Gemini APIs are available:

/generateContent
/streamGenerateContent

When llm_format is set to bedrock, only the Bedrock provider is supported. Supported Bedrock APIs include:

/converse
/converse-stream
/retrieveAndGenerate
/retrieveAndGenerateStream
/rerank

When config.llm_format is set to cohere, only the Cohere provider is supported. Available Cohere APIs are:

/v1/rerank
/v2/rerank

When llm_format is set to "huggingface", only the Hugging Face provider is supported. The following Hugging Face APIs are supported:

/generate
/generate_stream

Caveats and limitations

The following sections detail the provider and statistic logging limitations.

Provider-specific limitations

Anthropic: Does not support llm/v1/completions or llm/v1/embeddings.
Llama2: Raw format lacks support for llm/v1/embeddings.
Bedrock and Gemini: Only support auth.allow_override = false.

Statistics logging limitations

Anthropic: No statistics logging for llm/v1/completions.
OpenAI and Azure: No statistics logging for assistants, batch, or audio APIs.
Bedrock: No statistics logging for image generation or editing APIs.

Templating v3.7+

The plugin allows you to substitute values in the config.targets[].model.name and any parameter under config.targets[].model.options with specific placeholders, similar to those in the Request Transformer Advanced plugin.

The following templated parameters are available:

$(headers.header_name): The value of a specific request header.
$(uri_captures.path_parameter_name): The value of a captured URI path parameter.
$(query_params.query_parameter_name): The value of a query string parameter.

You can combine these parameters with an OpenAI-compatible SDK in multiple ways using the AI Proxy and AI Proxy Advanced plugins, depending on your specific use case:

Action	Description
Select different models dynamically on one provider	Allow users to select the target model based on a request header or parameter. Supports flexible routing across different models on the same provider.
Use one chat route with dynamic Azure OpenAI deployments	Configure a dynamic route to target multiple Azure OpenAI model deployments.
Use multiple routes to map mulitple Azure Deployment	Use separate Routes to map Azure OpenAI SDK requests to specific deployments of GPT-3.5 and GPT-4..

FAQs

Can I authenticate to Azure AI with Azure Identity?

AI Proxy Advanced

Overview of capabilities

Core text generation

Advanced text generation v3.11+

Audio features v3.11+

Image and realtime features v3.11+

How it works

Request and response formats

Input formats

Text generation inputs

Audio and image generation inputs

Response formats

Text-based responses

Image, and audio responses

Supported native LLM formats

Caveats and limitations

Provider-specific limitations

Statistics logging limitations

Load balancing

Retry and fallback

Request and response formats

Input formats

Text generation inputs

Audio and image generation inputs

Response formats

Text-based responses

Image, and audio responses

Supported native LLM formats

Caveats and limitations

Provider-specific limitations

Statistics logging limitations

Templating v3.7+

FAQs

Did this doc help?

Help us make these docs great!

Still need help