Tutorial

AI API Streaming with Server-Sent Events: Complete Guide

April 17, 2026 · 8 min read

Streaming lets your AI application display responses token-by-token as they are generated, rather than waiting for the entire response. This dramatically improves perceived latency and user experience. Under the hood, AI APIs use Server-Sent Events (SSE) to push tokens to your client in real time.

How SSE Streaming Works

When you set stream=True, the API sends a series of small JSON chunks over an HTTP connection instead of one large response. Each chunk contains one or more tokens. The connection stays open until the response is complete.

  1. Client sends a POST request with stream: true
  2. Server responds with Content-Type: text/event-stream
  3. Server pushes data: {...} events as tokens are generated
  4. Final event: data: [DONE] signals completion

Python: Basic Streaming

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aipower.me/v1",
    api_key="YOUR_AIPOWER_KEY",
)

# Enable streaming
stream = client.chat.completions.create(
    model="deepseek/deepseek-chat",
    messages=[{"role": "user", "content": "Explain how neural networks learn"}],
    stream=True,
)

# Process tokens as they arrive
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
print()  # Newline at the end

Node.js: Basic Streaming

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.aipower.me/v1",
  apiKey: "YOUR_AIPOWER_KEY",
});

async function streamResponse() {
  const stream = await client.chat.completions.create({
    model: "deepseek/deepseek-chat",
    messages: [{ role: "user", content: "Explain how neural networks learn" }],
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) process.stdout.write(content);
  }
  console.log();
}

streamResponse();

Building a Real-Time Chat UI (FastAPI + SSE)

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")

@app.post("/api/chat")
async def chat(messages: list):
    async def generate():
        stream = client.chat.completions.create(
            model="deepseek/deepseek-chat",
            messages=messages,
            stream=True,
        )
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield f"data: {content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Frontend: Consuming SSE in JavaScript

async function sendMessage(messages) {
  const response = await fetch("/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(messages),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let result = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const text = decoder.decode(value);
    const lines = text.split("\n").filter(line => line.startsWith("data: "));
    for (const line of lines) {
      const data = line.slice(6);
      if (data === "[DONE]") return result;
      result += data;
      updateChatUI(result); // Re-render the message in your UI
    }
  }
  return result;
}

Streaming with Error Handling

def stream_with_retry(messages, model="deepseek/deepseek-chat", max_retries=3):
    for attempt in range(max_retries):
        try:
            stream = client.chat.completions.create(
                model=model, messages=messages, stream=True,
            )
            full_response = ""
            for chunk in stream:
                content = chunk.choices[0].delta.content
                if content:
                    full_response += content
                    yield content
            return
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            print(f"Stream error (attempt {attempt + 1}): {e}")

Performance Tips

  • First-token latency: Streaming shows the first token in 200-500ms vs 2-5s for non-streaming full responses
  • Cost is identical: Streaming does not cost more or less than non-streaming requests
  • Abort early: Cancel the stream if the user navigates away to save output tokens
  • Buffer for rendering: Batch UI updates every 50ms instead of per-token to avoid jank

All 16 models on AIPower support streaming. Start building real-time AI UIs at aipower.me with 10 free API calls.

GET STARTED WITH AIPOWER

16 AI models. One API. OpenAI SDK compatible.

Who should use AIPower?

  • • Developers needing both Chinese and Western AI models
  • • Chinese teams that can't access OpenAI / Anthropic directly
  • • Startups wanting multi-model redundancy through one API
  • • Anyone tired of paying grey-market intermediary premiums

3 steps to first API call

  1. Sign up — email only, 10 free trial calls, no card
  2. Copy your API key from the dashboard
  3. Change base_url in your OpenAI SDK → done
from openai import OpenAI

client = OpenAI(
    base_url="https://api.aipower.me/v1",  # ← only change
    api_key="sk-your-aipower-key",
)

response = client.chat.completions.create(
    model="auto-cheap",   # or anthropic/claude-opus, deepseek/deepseek-chat, openai/gpt-5, etc.
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

+100 bonus calls on first $5 top-up · WeChat Pay + Alipay + card accepted · docs · security