Global, Smart
Token service.

We abstract AI inference into a standardized, metered, and governable Smart Token Service — connecting the major foundation models and adapting across every GPU vendor.

8.
Global cities
<300ms
P50 first token
99.9%
SLA target
4+
GPU / xPU vendors
One API · every major foundation model
OpenAIAnthropicGoogleDeepSeekQWENDoubaoLlamaMistralKimiZhipuGROKGemini

Three lines of code.
That's it.

We speak OpenAI. If your code already calls openai.chat.completions, change the base URL and you're done. Switch models with a string. No SDK lock-in.

OpenAI-compatible · works with openai SDKs, Claude Code, Cursor, Codex, LangChain
Every major open model · DeepSeek, Qwen, Doubao, Gemini, Llama, Mistral, Kimi — one base URL
Auto-failover · we route around outages so your agent never blinks
example.py
# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="https://api.smartoken.com/v1",
    api_key="sk-••••••••••••••",
)

response = client.chat.completions.create(
    model="deepseek-v3",
    messages=[
        {"role": "user", "content": "Explain TaaS in one tweet."}
    ],
)

print(response.choices[0].message.content)
● 200 OKrouted → sg-2 / deepseek-v3
TTFT 38ms·247 tokens
$ run example

Live in four minutes,
not four sprints.

From signup to first token, the whole thing fits on one cup of coffee.

01

Sign up

Sign up with Email or GitHub.

~30 sec
02

Top up credits

WeChat Pay & Stripe supported. Pay-as-you-go.

~1 min
03

Get API key

Create scoped API keys per project.

~30 sec
04

Plug into your agent

Seamless integration in your agents.

~2 min
Works with:Claude CodeCodexCursorOpenClawHermesOpenRouterLangChainClineContinueAider

Every model
that matters.

Browse full catalog →
DS
DeepSeek V3
deepseek-v3
Hot

671B MoE flagship. Strong reasoning at a fraction of frontier-model pricing.

Per 1M tokens
$0.14 in / $0.28 out
Ctx
128K
QW
Qwen2.5 Max
qwen-max
CN

Alibaba's flagship. Strong performance on Chinese and multilingual tasks.

Per 1M tokens
$1.60 in / $6.40 out
Ctx
32K
DB
Doubao Pro
doubao-pro-256k
CN

ByteDance flagship. 256K context at competitive cost.

Per 1M tokens
$0.40 in / $1.20 out
Ctx
256K
GM
Gemini 2.0 Flash
gemini-2.0-flash
Multimodal

Google's fast multimodal. Vision, audio, and tool use built in.

Per 1M tokens
$0.10 in / $0.40 out
Ctx
1M
LL
Llama 3.3 70B
llama-3.3-70b
OSS

Meta's open flagship. Self-hostable. Strong reasoning, no lock-in.

Per 1M tokens
$0.55 in / $0.75 out
Ctx
128K
GR
Mixtral 8x7B
mixtral-8x7b @ groq
Fast

LPU-accelerated. 500+ tok/s. Real-time voice agents, low-latency UX.

Per 1M tokens
$0.24 in / $0.24 out
Ctx
32K

Built for production,
not for slides.

A smart token factory: broad model coverage, accurate metering, smart failover, and infrastructure that lets your data stay where it should.

Global edge inference i

Requests stay in-region. Singapore traffic processes in Singapore, London in London — no cross-border data transfer required for compliance-sensitive workloads.

One API, every modality ii

Doubao, Qwen, DeepSeek, Gemini, Llama, plus enterprise proprietary models — across text, image, video, and multimodal. OpenAI-compatible. Switch by changing one string.

Unified accelerator layer iii

A single abstraction over NVIDIA GPUs, AMD GPUs, Google TPUs, and Groq LPUs. Self-hosted optimized inference stack (vLLM, SGLang, TensorRT-LLM) with distributed scheduling for predictable performance.

Smart routing & failover iv

When an upstream provider degrades, traffic shifts within seconds to the next-best endpoint. Same model name, no app-level changes — your service stays up while their incident gets fixed.

Token-precise metering v

Pay-as-you-go on actual usage. Input/output tokens metered separately, broken down by project, team, and key. No credit card required to start, no minimum commit.

Enterprise-ready ops vi

SSO, audit logs, per-key rate limits, IP allow-lists, and a single invoice across every model. SLA-backed uptime with status page and incident playbook.

Pay for tokens.
Nothing else.

One simple model: pay per token used. No seats, no platform fee, no minimum commit. Sign up, top up, ship.

★ For most teams
Pay-as-you-go
$0/ start, then per-token

Per-call billing on actual token usage. Pricing per model is published — no hidden markup, no platform subscription. Top up any amount to begin.

  • +Access to every major model, every modality
  • +Per-token metering with itemized invoices
  • +Smart routing & automatic failover included
  • +No credit card required, no minimum spend
Enterprise
Custom

For high-volume workloads, regulated industries, and teams that need dedicated capacity. Volume pricing, custom SLAs, and tailored deployment options.

  • +Volume-based pricing & reserved capacity
  • +Dedicated endpoints with stable latency
  • +Custom SLA, audit logs, SSO & RBAC
  • +Region pinning & on-prem options
  • +Named solutions engineer, 24/7 support
· Per-model rates published in console· Self-serve billing: view & download itemized invoices in the console· Volume discounts auto-applied

Not a router.
A platform you can run a business on.

Built for the gap between a working demo and a service that handles real production traffic — smart routing, edge acceleration, deep inference tuning, and data safety as defaults.

Operated, not just routed

We host and operate critical models on our own infrastructure.

Routing-only platforms inherit every upstream incident. We absorb them — mirrored deployments, warm capacity, and engineers paged before your users notice.

99.9%
SLA target
8
Global PoPs
All
Modalities

Smart routing across providers

Each request is routed by latency, cost, or quality to the best-fit endpoint. Pin fallback chains so an upstream incident degrades — instead of breaks — your product.

Global network acceleration

Multi-region PoPs with anycast ingress and warm pools — requests land on the nearest edge, cutting TLS round-trips and cold-start tax for users in every market.

Inference-engine optimizations

Prefill–decode disaggregation, continuous batching, paged KV cache, speculative decoding — production-grade tuning that lifts throughput and pushes TTFT down on the same hardware.

Zero-retention & data safety

Prompts and outputs are not persisted by default. Region pinning keeps data in-jurisdiction, with per-key audit logs and signed request envelopes for sensitive workloads.

Common questions.

Don't see what you're looking for? Our docs go deeper, or talk to a human.

Read the docs
What is Token-as-a-Service (TaaS)?
TaaS is a unified API platform for accessing leading proprietary and open-source AI models — across LLM, image, video, and audio — through a single OpenAI-compatible endpoint. You pay for the tokens you use; we handle model hosting, capacity, scheduling, and failover.
How is Smartoken different from a typical API aggregator?
Most aggregators are thin proxies: when an upstream vendor degrades, your app degrades with it. We host and operate critical models on our own infrastructure, run smart routing across providers, and back the whole thing with an SLA. You get broader coverage, higher reliability, and a single invoice.
When should teams use TaaS instead of self-hosting?
If your team's core value isn't running GPU clusters, TaaS gets you to production faster: no provisioning, no batching/scheduling code, no on-call rotation for inference outages. Use TaaS for the 80% of workloads that benefit from elasticity, and bring up dedicated endpoints when a specific workload justifies it.
How does pricing actually work?
Per-token, pay-as-you-go. Each model has a published per-million-token rate for input and output. You top up an account balance and we deduct as you call the API. No platform fee, no seat cost, no minimum commit. Volume discounts apply automatically; high-volume customers can negotiate dedicated rates.
What about data privacy and compliance?
By default, requests stay in the region where they originate — Singapore traffic processes in Singapore, London in London. For sensitive workloads, enable zero-retention mode (no prompt or output storage), region pinning, and per-key audit logs. We support PDPA, GDPR, and financial-grade compliance requirements out of the box.
How do I migrate from OpenAI / another provider?
Change your base_url and API key — that's the entire migration. Our endpoint is OpenAI-compatible (chat completions, embeddings, function calling, streaming). Your existing SDK, prompts, and tooling work unchanged. Switch back any time; we don't lock your data.
How do I move from PAYG to dedicated infrastructure?
Same account, same API. When a workload outgrows shared capacity — predictable latency requirements, regulated environments, or sustained high QPS — we provision dedicated endpoints alongside your existing keys. Migration is a config change, not a re-platforming project.
Which models are available?
Open-source: DeepSeek, Qwen, Llama, Doubao, Mistral, and dozens more across text, vision, and audio. Proprietary: Gemini and select enterprise models, with new providers added regularly. The full catalog is in the model library — searchable by capability, context window, and price.

Stop juggling
vendor APIs.

Ship one integration. Get every major model, every region, and every price tier — through a single OpenAI-compatible endpoint.

no credit card · pay-as-you-go · cancel anytime