Architecture

AI API Rate Limits Explained: How to Handle Throttling Like a Pro

April 16, 2026 · 6 min read

Every AI API has rate limits. Hit them and your application breaks. Understanding and handling rate limits properly is the difference between a demo and a production application. Here's how to do it right.

Rate Limits by Provider

ProviderDefault RPMDefault TPM429 Behavior
OpenAI60-10,00060K-2MRetry-After header
Anthropic60-4,00080K-400KRetry-After header
DeepSeek601MVariable wait
AIPower200/minUnlimited429 + retry hint

The Right Way to Handle 429 Errors

import time
import random
from openai import OpenAI

client = OpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")

def call_with_retry(messages, model="deepseek/deepseek-chat", max_retries=5):
    """Exponential backoff with jitter — the production-grade pattern."""
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages,
            )
        except Exception as e:
            if "429" in str(e) or "rate" in str(e).lower():
                # Exponential backoff: 1s, 2s, 4s, 8s, 16s + random jitter
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
                time.sleep(wait)
            else:
                raise  # Re-raise non-rate-limit errors
    raise RuntimeError("Max retries exceeded")

Pattern: Request Queue with Concurrency Control

import asyncio
from openai import AsyncOpenAI

aclient = AsyncOpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")

class RateLimitedQueue:
    def __init__(self, max_concurrent=10, rpm_limit=180):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.interval = 60 / rpm_limit  # Seconds between requests

    async def call(self, messages, model="deepseek/deepseek-chat"):
        async with self.semaphore:
            await asyncio.sleep(self.interval)  # Spread requests evenly
            return await aclient.chat.completions.create(
                model=model, messages=messages,
            )

    async def batch(self, message_list, model="deepseek/deepseek-chat"):
        tasks = [self.call(msgs, model) for msgs in message_list]
        return await asyncio.gather(*tasks, return_exceptions=True)

# Process 1000 requests without hitting rate limits
queue = RateLimitedQueue(max_concurrent=10, rpm_limit=180)
results = asyncio.run(queue.batch(all_messages))

Pattern: Multi-Provider Fallback for Rate Limits

When one provider rate-limits you, fall back to another:

# With AIPower, use smart routing as automatic fallback
# "auto" routes to available models — if DeepSeek is rate-limited,
# it tries Qwen, then GLM, then others
response = client.chat.completions.create(
    model="auto",  # Never rate-limited because it has 10+ backend providers
    messages=[{"role": "user", "content": "Hello!"}],
)

Monitoring Rate Limit Usage

from collections import deque
import time

class RateMonitor:
    def __init__(self, window_seconds=60):
        self.calls = deque()
        self.window = window_seconds

    def record(self):
        now = time.time()
        self.calls.append(now)
        # Remove calls outside the window
        while self.calls and self.calls[0] < now - self.window:
            self.calls.popleft()

    @property
    def current_rpm(self):
        return len(self.calls)

    def safe_to_call(self, limit=180):
        return self.current_rpm < limit

Best Practices Summary

  1. Always implement retry with exponential backoff — never retry immediately
  2. Add jitter — prevents thundering herd when many clients retry simultaneously
  3. Use a request queue — don't fire all requests at once
  4. Monitor your RPM — stay under limits proactively
  5. Use an API gateway — AIPower's smart routing auto-distributes across providers
  6. Cache responses — identical queries shouldn't hit the API twice

AIPower's gateway distributes your requests across 10 providers, dramatically reducing the chance of hitting any single provider's rate limit. Try it at aipower.me — 200 RPM default, 10 free calls.

GET STARTED WITH AIPOWER

16 AI models. One API. OpenAI SDK compatible.

Who should use AIPower?

  • • Developers needing both Chinese and Western AI models
  • • Chinese teams that can't access OpenAI / Anthropic directly
  • • Startups wanting multi-model redundancy through one API
  • • Anyone tired of paying grey-market intermediary premiums

3 steps to first API call

  1. Sign up — email only, 10 free trial calls, no card
  2. Copy your API key from the dashboard
  3. Change base_url in your OpenAI SDK → done
from openai import OpenAI

client = OpenAI(
    base_url="https://api.aipower.me/v1",  # ← only change
    api_key="sk-your-aipower-key",
)

response = client.chat.completions.create(
    model="auto-cheap",   # or anthropic/claude-opus, deepseek/deepseek-chat, openai/gpt-5, etc.
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

+100 bonus calls on first $5 top-up · WeChat Pay + Alipay + card accepted · docs · security