Configure Streaming with AI Proxy - Plugin - unreleased

Plugin Hub

You are browsing unreleased documentation.

This guide walks you through setting up the AI Proxy plugin with streaming.

What is request streaming

In an LLM (Large Language Model) inference request, Kong Gateway uses the upstream provider’s REST API to generate the next chat message from the caller. Normally, this request is processed and completely buffered by the LLM before being sent back to Kong Gateway and then to the caller in a single large JSON block. This process can be time-consuming, depending on the max_tokens, other request parameters, and the complexity of the request sent to the LLM model.

To avoid making the user wait for their chat response with a loading animation, most models can stream each word (or sets of words and tokens) back to the client. This allows the chat response to be rendered in real time.

For example, a client could set up their streaming request using the OpenAI Python SDK like this:

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/12/openai",
    api_key="none"
)

stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Tell me the history of Kong Inc."}],
    stream=True,
)

print('>')
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

The client won’t have to wait for the entire response. Instead, tokens will appear as they come in, for example:

$ python3 long-streaming-request.py
> Kong Inc. is a software company providing cloud-native connectivity solutions for APIs and....

How AI Proxy streaming works

In streaming mode, a client can set "stream": true in their request, and the LLM server will stream each part of the response text (usually token-by-token) as a server-sent event. Kong Gateway captures each batch of events and translates them into the Kong Gateway inference format. This ensures that all providers are compatible with the same framework including OpenAI-compatible SDKs or similar.

In a standard LLM transaction, requests proxied directly to the LLM look like this:

When streaming is requested, requests proxied directly to the LLM look like this:

The new streaming framework captures each event, sends the chunk back to the client, and then exits early.

It also estimates tokens for LLM services that decided to not stream back the token use counts when the message is completed.

Streaming limitations

Keep the following limitations in mind when you configure streaming for the AI Proxy plugin:

Multiple AI features shouldn’t expect to be applied and work simultaneously
You can’t use the Response Transformer plugin or any other response phase plugin when streaming is configured.
The AI Request Transformer plugin plugin will work, but the AI Response Transformer plugin will not. This is because Kong Gateway can’t check every single response token against a separate system.
Streaming currently doesn’t work with the HTTP/2 protocol. You must disable this in your proxy_listen configuration.

Configuration

The AI Proxy plugin already supports request streaming, all you have to do is request Kong Gateway to stream the response tokens back to you.

The following is an example llm/v1/completions route-type streaming request:

{
  "prompt": "What is the theory of relativity?",
  "stream": true
}

You should receive each batch of tokens as HTTP chunks, each containing one or many server-sent events.

Response streaming configuration parameters

In the AI Proxy plugin configuration, you can set an optional field config.response_streaming to one of three values:

Value	Effect
`allow`	Allows the caller to optionally specify a streaming response in their request (default is not-stream)
`deny`	Prevents the caller from setting `stream=true` in their request
`always`	Always returns streaming responses, even if the caller hasn’t specified it in their request

Previous SDK Usage

Next Set up AI Proxy with Anthropic

Thank you for your feedback.

Was this page useful?