AI Gateway Pricing
This page describes the pricing model and list prices of Singdata AI Gateway in overseas regions (USD). For Lakehouse compute, storage, and network resources, see Lakehouse Pricing. The pricing entry point is Pricing and Billing.
Overview
Singdata AI Gateway is a one-stop platform that aggregates and manages mainstream LLMs (Anthropic Claude, OpenAI GPT, Qwen, DeepSeek, GLM, Kimi, MiniMax, and more) behind a unified API, so you do not have to register, integrate, and fund a separate account on each vendor's platform.
Billing Modes
Pay-as-you-go
Each API call is billed by the number of tokens it actually consumes, with no minimum spend. Metering is captured at the individual call level and can be aggregated by API key, application, or tenant. Bills are issued monthly, and itemized usage is available in the Billing Center of the console.
Billing Dimensions
Different model types are billed along different dimensions. The fields used in the price tables below have the following meanings.
Chat Models Billed by Token
Input and output are priced separately. One token is roughly 0.5 Chinese characters or 0.75 English words.
| Field | Meaning |
|---|---|
| Input | Unit price for tokens in the prompt portion of the request |
| Output | Unit price for tokens generated by the model |
| Context Window | Tiered pricing across context-window ranges; tokens are billed at the unit price of the tier their request falls into |
Caching and Cost Savings
If you repeatedly send the same leading content (a long system prompt, a fixed knowledge base document, and so on), the model can keep that content available for reuse on subsequent calls instead of recomputing it from scratch. That is what caching does. Tokens that hit the cache are priced well below the standard input price, which can substantially reduce cost for long prompts and multi-turn conversations.
There are two types of caching, mapped to different columns in the price tables.
| Type | How It Works | Corresponding Columns in Price Table |
|---|---|---|
| Explicit Cache | You explicitly tell the model to store a segment. A one-time write fee applies on creation (slightly higher than the input price), and a much lower hit fee applies on each subsequent reuse | Explicit·Write, Explicit·Write·5min, Explicit·Write·1h, Explicit·Hit |
| Implicit Cache | The system automatically detects repeated prefixes and caches them, with no manual action required. There is no write fee; only the lower hit price applies on hits | Implicit·Hit |
How each column is billed:
- Explicit·Write: Charged when a prompt segment is first written to the cache, calculated as the number of tokens written multiplied by this unit price
- Explicit·Write·5min / Explicit·Write·1h: Anthropic Claude's explicit cache offers two retention tiers. The 5-minute tier has a lower unit price; the 1-hour tier has a higher unit price but suits cases where the same content is reused repeatedly within an hour
- Explicit·Hit: Charged when a subsequent request hits the cache, calculated as the number of hit tokens multiplied by this unit price, which is significantly lower than the input price
- Implicit·Hit: When the system detects a repeated prefix in a request, the hit portion is billed at this unit price. Because the system writes to the cache automatically, no separate write fee applies
Vendors differ in which cache types they support. The matrix below summarizes current support:
| Vendor | Explicit Cache | Implicit Cache |
|---|---|---|
| Anthropic Claude | Supported (5-minute and 1-hour write tiers) | Not supported |
| OpenAI GPT | Not supported | Supported |
| Qwen | Supported | Partially supported (3.5 series supported; 3.6 / 3.7 series not yet available) |
| DeepSeek | Partially supported (v3.2 supported) | Supported |
| GLM | Partially supported (5.1 supported) | Supported |
| Kimi | Supported | Supported |
| MiniMax | Not supported | Supported |
How Multimodal Embeddings Are Billed
Embedding models are priced separately by input data type. Text input uses a single input price, while image and video inputs are billed by the number of multimodal tokens at a unit price higher than text input.
Overseas Model List Prices
Anthropic Claude Series
| Model | Input | Output | Explicit·Write·5min | Explicit·Write·1h | Explicit·Hit |
|---|---|---|---|---|---|
| claude-opus-4-7 | 5 | 25 | 6.25 | 10 | 0.5 |
| claude-opus-4-6 | 5 | 25 | 6.25 | 10 | 0.5 |
| claude-opus-4-5 | 5 | 25 | 6.25 | 10 | 0.5 |
| claude-sonnet-4-6 | 3 | 15 | 3.75 | 6 | 0.3 |
| claude-sonnet-4-5 | 3 | 15 | 3.75 | 6 | 0.3 |
| claude-haiku-4-5 | 1 | 5 | 1.25 | 2 | 0.1 |
OpenAI GPT Series
| Model | Context Window | Input | Output | Implicit·Hit |
|---|---|---|---|---|
| gpt-5.4-pro | ≤272K | 30 | 180 | — |
| gpt-5.4-pro | >272K | 60 | 270 | — |
| gpt-5.4 | ≤272K | 2.5 | 15 | 0.25 |
| gpt-5.4 | >272K | 5 | 22.5 | 0.5 |
| gpt-5.4-mini | — | 0.75 | 4.5 | 0.075 |
| gpt-5.5 | ≤272K | 5 | 30 | 0.5 |
| gpt-5.5 | >272K | 10 | 45 | 1 |
| gpt-5.2-pro | — | 21 | 168 | — |
| gpt-5.2-2025-12-11 | 1M tokens | 1.75 | 14 | 0.175 |
| gpt-5-2025-08-07 | 1M tokens | 1.25 | 10 | 0.125 |
| gpt-5-mini-2025-08-07 | 1M tokens | 0.25 | 2 | 0.025 |
| gpt-5-nano-2025-08-07 | 1M tokens | 0.05 | 0.4 | 0.005 |
| gpt-4.1-2025-04-14 | 1M tokens | 2 | 8 | 0.5 |
| gpt-4.1-mini-2025-04-14 | 1M tokens | 0.4 | 1.6 | 0.1 |
| gpt-4o-2024-08-06 | 128K tokens | 2.5 | 10 | 1.25 |
| o4-mini | — | 1.1 | 4.4 | 0.275 |
| o3 | — | 2 | 8 | 0.5 |
Qwen Series
| Sub-series | Model | Context Window | Input | Output | Explicit·Write | Explicit·Hit | Implicit·Hit |
|---|---|---|---|---|---|---|---|
| Qwen Max | qwen3.7-max | 0–1M | 2.5 | 7.5 | 3.125 | 0.25 | 0.5 |
| Qwen Max | qwen3.6-max-preview | 0–128K | 1.3 | 7.8 | 1.625 | 0.13 | — |
| Qwen Max | qwen3.6-max-preview | 128K–256K | 2 | 12 | 2.5 | 0.2 | — |
| Qwen Max | qwen3-max | 0–32K | 1.2 | 6 | 1.5 | 0.12 | 0.24 |
| Qwen Max | qwen3-max | 32K–128K | 2.4 | 12 | 3 | 0.24 | 0.48 |
| Qwen Max | qwen3-max | 128K–256K | 3 | 15 | 3.75 | 0.3 | 0.6 |
| Qwen Max | qwen3-max-preview | 0–32K | 1.2 | 6 | 1.5 | 0.12 | 0.24 |
| Qwen Max | qwen3-max-preview | 32K–128K | 2.4 | 12 | 3 | 0.24 | 0.48 |
| Qwen Max | qwen3-max-preview | 128K–256K | 3 | 15 | 3.75 | 0.3 | 0.6 |
| Qwen Plus | qwen3.6-plus | 0–256K | 0.5 | 3 | 0.625 | 0.05 | — |
| Qwen Plus | qwen3.6-plus | 256K–1M | 2 | 6 | 2.5 | 0.2 | — |
| Qwen Plus | qwen3.5-plus | 0–256K | 0.4 | 2.4 | 0.5 | 0.04 | 0.08 |
| Qwen Plus | qwen3.5-plus | 256K–1M | 0.5 | 3 | 0.625 | 0.05 | 0.1 |
| Qwen Flash | qwen3.6-flash | 0–256K | 0.25 | 1.5 | 0.3125 | 0.025 | — |
| Qwen Flash | qwen3.6-flash | 256K–1M | 1 | 4 | 1.25 | 0.1 | — |
| Qwen Flash | qwen3.5-flash | 0–1M | 0.1 | 0.4 | 0.125 | 0.01 | — |
| Qwen Embedding | qwen3-vl-embedding (text input) | — | 0.1 | — | — | — | — |
| Qwen Embedding | qwen3-vl-embedding (image / video input) | — | 0.258 | — | — | — | — |
DeepSeek Series
| Model | Input | Output | Explicit·Write | Explicit·Hit | Implicit·Hit |
|---|---|---|---|---|---|
| deepseek-v3.2 | 0.57 | 1.71 | 0.7125 | 0.057 | 0.114 |
| deepseek-v4-flash | 0.2 | 0.4 | — | — | — |
| deepseek-v4-pro | 2.4 | 4.8 | — | — | — |
GLM Series
| Model | Context Window | Input | Output | Explicit·Write | Explicit·Hit | Implicit·Hit |
|---|---|---|---|---|---|---|
| glm-4.7 | 0–32K | 0.431 | 2.007 | — | — | 0.0862 |
| glm-4.7 | 32K–166K | 0.574 | 2.294 | — | — | 0.1148 |
| glm-5 | 0–32K | 0.573 | 2.58 | — | — | 0.1146 |
| glm-5 | 32K–166K | 0.86 | 3.154 | — | — | 0.172 |
| glm-5.1 | 0–32K | 0.825 | 3.301 | 1.03125 | 0.0825 | 0.165 |
| glm-5.1 | 32K–200K | 1.1 | 3.851 | 1.375 | 0.11 | 0.22 |
Kimi Series
| Model | Input | Output | Explicit·Write | Explicit·Hit | Implicit·Hit |
|---|---|---|---|---|---|
| kimi-k2.5 | 0.574 | 3.011 | 0.7175 | 0.0574 | 0.1148 |
| kimi-k2.6 | 0.8939 | 3.7131 | 1.117375 | 0.08939 | 0.17878 |
MiniMax Series
| Model | Input | Output | Implicit·Hit |
|---|---|---|---|
| MiniMax-M2.5 | 0.304 | 1.213 | 0.0608 |
Billing Examples
Text model: Calling claude-sonnet-4-6 with 10,000 input tokens and 2,000 output tokens, no cache hit:
Text model with explicit cache: Calling claude-haiku-4-5. The first request writes 5,000 tokens to the 5-minute cache tier and outputs 1,000 tokens. The second request hits the cache for 5,000 tokens, with 500 new input tokens and 1,000 output tokens:
Text model with implicit cache: Calling gpt-5-2025-08-07 with 10,000 input tokens (of which 6,000 hit the implicit cache) and 2,000 output tokens:
