AI Gateway Pricing

This page describes the pricing model and list prices of Singdata AI Gateway in overseas regions (USD). For Lakehouse compute, storage, and network resources, see Lakehouse Pricing. The pricing entry point is Pricing and Billing.

Overview

Singdata AI Gateway is a one-stop platform that aggregates and manages mainstream LLMs (Anthropic Claude, OpenAI GPT, Qwen, DeepSeek, GLM, Kimi, MiniMax, and more) behind a unified API, so you do not have to register, integrate, and fund a separate account on each vendor's platform.

Billing Modes

Pay-as-you-go

Each API call is billed by the number of tokens it actually consumes, with no minimum spend. Metering is captured at the individual call level and can be aggregated by API key, application, or tenant. Bills are issued monthly, and itemized usage is available in the Billing Center of the console.

Billing Dimensions

Different model types are billed along different dimensions. The fields used in the price tables below have the following meanings.

Chat Models Billed by Token

Input and output are priced separately. One token is roughly 0.5 Chinese characters or 0.75 English words.

FieldMeaning
InputUnit price for tokens in the prompt portion of the request
OutputUnit price for tokens generated by the model
Context WindowTiered pricing across context-window ranges; tokens are billed at the unit price of the tier their request falls into

Caching and Cost Savings

If you repeatedly send the same leading content (a long system prompt, a fixed knowledge base document, and so on), the model can keep that content available for reuse on subsequent calls instead of recomputing it from scratch. That is what caching does. Tokens that hit the cache are priced well below the standard input price, which can substantially reduce cost for long prompts and multi-turn conversations.

There are two types of caching, mapped to different columns in the price tables.

TypeHow It WorksCorresponding Columns in Price Table
Explicit CacheYou explicitly tell the model to store a segment. A one-time write fee applies on creation (slightly higher than the input price), and a much lower hit fee applies on each subsequent reuseExplicit·Write, Explicit·Write·5min, Explicit·Write·1h, Explicit·Hit
Implicit CacheThe system automatically detects repeated prefixes and caches them, with no manual action required. There is no write fee; only the lower hit price applies on hitsImplicit·Hit

How each column is billed:

  • Explicit·Write: Charged when a prompt segment is first written to the cache, calculated as the number of tokens written multiplied by this unit price
  • Explicit·Write·5min / Explicit·Write·1h: Anthropic Claude's explicit cache offers two retention tiers. The 5-minute tier has a lower unit price; the 1-hour tier has a higher unit price but suits cases where the same content is reused repeatedly within an hour
  • Explicit·Hit: Charged when a subsequent request hits the cache, calculated as the number of hit tokens multiplied by this unit price, which is significantly lower than the input price
  • Implicit·Hit: When the system detects a repeated prefix in a request, the hit portion is billed at this unit price. Because the system writes to the cache automatically, no separate write fee applies

Vendors differ in which cache types they support. The matrix below summarizes current support:

VendorExplicit CacheImplicit Cache
Anthropic ClaudeSupported (5-minute and 1-hour write tiers)Not supported
OpenAI GPTNot supportedSupported
QwenSupportedPartially supported (3.5 series supported; 3.6 / 3.7 series not yet available)
DeepSeekPartially supported (v3.2 supported)Supported
GLMPartially supported (5.1 supported)Supported
KimiSupportedSupported
MiniMaxNot supportedSupported

How Multimodal Embeddings Are Billed

Embedding models are priced separately by input data type. Text input uses a single input price, while image and video inputs are billed by the number of multimodal tokens at a unit price higher than text input.

Overseas Model List Prices

Anthropic Claude Series

ModelInputOutputExplicit·Write·5minExplicit·Write·1hExplicit·Hit
claude-opus-4-75256.25100.5
claude-opus-4-65256.25100.5
claude-opus-4-55256.25100.5
claude-sonnet-4-63153.7560.3
claude-sonnet-4-53153.7560.3
claude-haiku-4-5151.2520.1

OpenAI GPT Series

ModelContext WindowInputOutputImplicit·Hit
gpt-5.4-pro≤272K30180
gpt-5.4-pro>272K60270
gpt-5.4≤272K2.5150.25
gpt-5.4>272K522.50.5
gpt-5.4-mini0.754.50.075
gpt-5.5≤272K5300.5
gpt-5.5>272K10451
gpt-5.2-pro21168
gpt-5.2-2025-12-111M tokens1.75140.175
gpt-5-2025-08-071M tokens1.25100.125
gpt-5-mini-2025-08-071M tokens0.2520.025
gpt-5-nano-2025-08-071M tokens0.050.40.005
gpt-4.1-2025-04-141M tokens280.5
gpt-4.1-mini-2025-04-141M tokens0.41.60.1
gpt-4o-2024-08-06128K tokens2.5101.25
o4-mini1.14.40.275
o3280.5

Qwen Series

Sub-seriesModelContext WindowInputOutputExplicit·WriteExplicit·HitImplicit·Hit
Qwen Maxqwen3.7-max0–1M2.57.53.1250.250.5
Qwen Maxqwen3.6-max-preview0–128K1.37.81.6250.13
Qwen Maxqwen3.6-max-preview128K–256K2122.50.2
Qwen Maxqwen3-max0–32K1.261.50.120.24
Qwen Maxqwen3-max32K–128K2.41230.240.48
Qwen Maxqwen3-max128K–256K3153.750.30.6
Qwen Maxqwen3-max-preview0–32K1.261.50.120.24
Qwen Maxqwen3-max-preview32K–128K2.41230.240.48
Qwen Maxqwen3-max-preview128K–256K3153.750.30.6
Qwen Plusqwen3.6-plus0–256K0.530.6250.05
Qwen Plusqwen3.6-plus256K–1M262.50.2
Qwen Plusqwen3.5-plus0–256K0.42.40.50.040.08
Qwen Plusqwen3.5-plus256K–1M0.530.6250.050.1
Qwen Flashqwen3.6-flash0–256K0.251.50.31250.025
Qwen Flashqwen3.6-flash256K–1M141.250.1
Qwen Flashqwen3.5-flash0–1M0.10.40.1250.01
Qwen Embeddingqwen3-vl-embedding (text input)0.1
Qwen Embeddingqwen3-vl-embedding (image / video input)0.258

DeepSeek Series

ModelInputOutputExplicit·WriteExplicit·HitImplicit·Hit
deepseek-v3.20.571.710.71250.0570.114
deepseek-v4-flash0.20.4
deepseek-v4-pro2.44.8

GLM Series

ModelContext WindowInputOutputExplicit·WriteExplicit·HitImplicit·Hit
glm-4.70–32K0.4312.0070.0862
glm-4.732K–166K0.5742.2940.1148
glm-50–32K0.5732.580.1146
glm-532K–166K0.863.1540.172
glm-5.10–32K0.8253.3011.031250.08250.165
glm-5.132K–200K1.13.8511.3750.110.22

Kimi Series

ModelInputOutputExplicit·WriteExplicit·HitImplicit·Hit
kimi-k2.50.5743.0110.71750.05740.1148
kimi-k2.60.89393.71311.1173750.089390.17878

MiniMax Series

ModelInputOutputImplicit·Hit
MiniMax-M2.50.3041.2130.0608

Billing Examples

Text model: Calling claude-sonnet-4-6 with 10,000 input tokens and 2,000 output tokens, no cache hit:

10000 / 1,000,000 × 3 + 2000 / 1,000,000 × 15 = $0.06

Text model with explicit cache: Calling claude-haiku-4-5. The first request writes 5,000 tokens to the 5-minute cache tier and outputs 1,000 tokens. The second request hits the cache for 5,000 tokens, with 500 new input tokens and 1,000 output tokens:

First call: 5000 / 1,000,000 × 1.25 (Write·5min) + 1000 / 1,000,000 × 5 = $0.00625 + $0.005 = $0.01125 Second call: 5000 / 1,000,000 × 0.1 (Hit) + 500 / 1,000,000 × 1 + 1000 / 1,000,000 × 5 = $0.0005 + $0.0005 + $0.005 = $0.006

Text model with implicit cache: Calling gpt-5-2025-08-07 with 10,000 input tokens (of which 6,000 hit the implicit cache) and 2,000 output tokens:

Cached portion: 6000 / 1,000,000 × 0.125 = $0.00075 Uncached input: 4000 / 1,000,000 × 1.25 = $0.005 Output: 2000 / 1,000,000 × 10 = $0.02 Total: $0.02575