This guide walks you through setting up the AI Proxy plugin with streaming.
What is request streaming
In an LLM (Large Language Model) inference request, Kong Gateway uses the upstream provider’s REST API to generate the next chat message from the caller.
Normally, this request is processed and completely buffered by the LLM before being sent back to Kong Gateway and then to the caller in a single large JSON block. This process can be time-consuming, depending on the max_tokens
, other request parameters, and the complexity of the request sent to the LLM model.
To avoid making the user wait for their chat response with a loading animation, most models can stream each word (or sets of words and tokens) back to the client. This allows the chat response to be rendered in real time.
For example, a client could set up their streaming request using the OpenAI Python SDK like this:
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8000/12/openai",
api_key="none"
)
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Tell me the history of Kong Inc."}],
stream=True,
)
print('>')
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
The client won’t have to wait for the entire response. Instead, tokens will appear as they come in, for example:
$ python3 long-streaming-request.py
> Kong Inc. is a software company providing cloud-native connectivity solutions for APIs and....
How AI Proxy streaming works
In streaming mode, a client can set "stream": true
in their request, and the LLM server will stream each part of the response text (usually token-by-token) as a server-sent event.
Kong Gateway captures each batch of events and translates them into the Kong Gateway inference format. This ensures that all providers are compatible with the same framework including OpenAI-compatible SDKs or similar.
In a standard LLM transaction, requests proxied directly to the LLM look like this:
sequenceDiagram actor Client participant Kong Gateway Note right of Kong Gateway: AI Proxy plugin Client->>+Kong Gateway: destroy Kong Gateway Kong Gateway->>+Cloud LLM: Sends proxy request information Cloud LLM->>+Client: Sends chunk to client
When streaming is requested, requests proxied directly to the LLM look like this:
flowchart LR A(client) B(Kong Gateway Gateway with AI Proxy plugin) C(Cloud LLM) D[[transform frame]] E[[read frame]] subgraph main direction LR subgraph 1 A end subgraph 3 C end subgraph 2 D E end A --> B --request--> C C --response--> B B --> D-->E E --> B B --> A end linkStyle 2,3,4,5,6 stroke:#b6d7a8,color:#b6d7a8 style 1 color:#fff,stroke:#fff style 2 color:#fff,stroke:#fff style 3 color:#fff,stroke:#fff style main color:#fff,stroke:#fff
The new streaming framework captures each event, sends the chunk back to the client, and then exits early.
It also estimates tokens for LLM services that decided to not stream back the token use counts when the message is completed.
Streaming limitations
Keep the following limitations in mind when you configure streaming for the AI Proxy plugin:
- Multiple AI features shouldn’t expect to be applied and work simultaneously
- You can’t use the Response Transformer plugin or any other response phase plugin when streaming is configured.
- The AI Request Transformer plugin plugin will work, but the AI Response Transformer plugin will not. This is because Kong Gateway can’t check every single response token against a separate system.
- Streaming currently doesn’t work with the HTTP/2 protocol. You must disable this in your
proxy_listen
configuration.
Configuration
The AI Proxy plugin already supports request streaming, all you have to do is request Kong Gateway to stream the response tokens back to you.
The following is an example llm/v1/completions
route-type streaming request:
{
"prompt": "What is the theory of relativity?",
"stream": true
}
You should receive each batch of tokens as HTTP chunks, each containing one or many server-sent events.
Response streaming configuration parameters
In the AI Proxy plugin configuration, you can set an optional field config.response_streaming
to one of three values:
Value | Effect |
---|---|
allow |
Allows the caller to optionally specify a streaming response in their request (default is not-stream) |
deny |
Prevents the caller from setting stream=true in their request |
always |
Always returns streaming responses, even if the caller hasn’t specified it in their request |