Back to Developer Zone

Alibaba Cloud Models

Explore all 11 models from Alibaba Cloud with detailed pricing, pros & cons, and developer recommendations.

11
Models
$0.0000
Lowest Input
1M
Max Context
3
Quality Tiers

Quick Recommendations

Best Value: Qwen-RobotWorld ($0.0000/1M)
Best Quality: Qwen3.7-Max

Qwen3.7-Max

Flagship

Long-horizon agent workflows, coding agents, complex reasoning

Official Pricing

When to use: Frontier agent workloads requiring long autonomous runs, complex multi-step coding tasks, and deep research analysis.

Upgrade Highlights

  • 1M token context — removes limits on document-heavy agent work
  • 65K max output — massive single-turn generation
  • Sustained 35-hour autonomous kernel optimization (1,158 tool calls)
  • SWE-Verified 80.4, LiveCodeBench 91.6 — rivals Claude Opus 4.6
  • OpenAI + Anthropic API compatible — drop-in replacement
Input Price
$2.50
per 1M tokens
Output Price
$7.50
per 1M tokens
Cached Input
$0.250
per 1M tokens
Batch Input
per 1M tokens
Context Window: 1M
Max Output: 65,536 tokens
Knowledge Cutoff: 2026-05
VisionFunction CallingFine-tuningJSON Mode

Pros

  • 1M context window for document-heavy agent work
  • 65K max output — longest in Qwen family
  • Cross-harness compatibility (Claude Code, OpenClaw, Qwen Code)
  • 35-hour sustained autonomous execution
  • Competitive with Claude Opus 4.6 on coding benchmarks

Cons

  • Proprietary — no open weights or self-hosting
  • Higher cost than Qwen 3.6 line
  • No vision support
  • API-only access

Performance

Output Speed~55 tok/s
Rate Limit2,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU-Pro
89.6%
LiveCodeBench
91.6%
SWE-Verified
80.4%
GPQA Diamond
92.4%

Qwen3.7-Plus

Mid-tier

Multimodal tasks, cost-effective agent deployment

Official Pricing

When to use: Cost-effective multimodal deployments needing video and image understanding alongside text, with long context requirements.

Upgrade Highlights

  • Multimodal input: text + video + image in one model
  • 1M context at $0.40/1M — 6x cheaper than Qwen3.7-Max
  • Strong agent capability at mid-tier cost
  • OpenAI-compatible API
Input Price
$0.400
per 1M tokens
Output Price
$1.60
per 1M tokens
Cached Input
$0.100
per 1M tokens
Batch Input
per 1M tokens
Context Window: 1M
Max Output: 16,384 tokens
Knowledge Cutoff: 2026-05
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • 1M context at mid-tier pricing
  • Multimodal: text, video, and image input
  • Strong speed-capability balance
  • Proprietary but very affordable

Cons

  • Proprietary — no self-hosting
  • Less capable than Qwen3.7-Max on complex reasoning
  • 16K max output

Performance

Output Speed~80 tok/s
Rate Limit5,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU-Pro
84.2%
LiveCodeBench
78.5%
MMMU
72.1%

Qwen3-235B-A22B

Flagship

Complex reasoning, multilingual tasks

Official Pricing

When to use: Best value flagship for multilingual workloads, complex reasoning, and cost-sensitive production deployments.

Upgrade Highlights

  • MoE architecture: 235B params, only 22B active — GPT-4 class at 1/10 the price
  • 131K context — handles long documents and codebases
  • 100+ language support — best-in-class for non-English tasks
  • Open-source: full weights on HuggingFace for self-hosting
  • $0.40/$1.20 per 1M tokens — undercuts GPT-4o by 90%
Input Price
$0.400
per 1M tokens
Output Price
$1.20
per 1M tokens
Cached Input
$0.100
per 1M tokens
Batch Input
per 1M tokens
Context Window: 131K
Max Output: 8,192 tokens
Knowledge Cutoff: 2025-04
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • MoE 235B total / 22B active — flagship performance at low cost
  • 131K context window
  • Strong multilingual (100+ languages)
  • Open-source weights available

Cons

  • No vision support
  • Max output 8K tokens
  • Less ecosystem integration than GPT-4

Performance

Output Speed~70 tok/s
Rate Limit5,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU
86.8%
LiveCodeBench
63.7%
IFEval
86.2%

Agents Using This Model

2

Qwen3-30B-A3B

Mid-tier

Efficient multilingual inference

Official Pricing

When to use: High-throughput multilingual tasks where cost efficiency matters most.

Upgrade Highlights

  • Only 3B active params — runs on consumer GPUs
  • 131K context at $0.15/1M input — cheapest long-context option
  • Open-source for full customization
  • Strong function calling for agent workflows
Input Price
$0.150
per 1M tokens
Output Price
$0.600
per 1M tokens
Cached Input
$0.040
per 1M tokens
Batch Input
per 1M tokens
Context Window: 131K
Max Output: 8,192 tokens
Knowledge Cutoff: 2025-04
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • MoE 30B total / 3B active — ultra-efficient
  • 131K context
  • Excellent cost-performance ratio
  • Open-source

Cons

  • Smaller active params limit complex reasoning
  • No vision
  • 8K max output

Performance

Output Speed~120 tok/s
Rate Limit10,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU
78.5%
LiveCodeBench
48.2%

Qwen3-32B

Mid-tier

Balanced performance and cost

Official Pricing

When to use: When you need reliable dense model performance for coding and general tasks.

Upgrade Highlights

  • Dense 32B architecture — no MoE routing overhead
  • 131K context for long-form content
  • Strong coding: LiveCodeBench 55.3%
  • Open-source with full HuggingFace support
Input Price
$0.200
per 1M tokens
Output Price
$0.600
per 1M tokens
Cached Input
$0.050
per 1M tokens
Batch Input
per 1M tokens
Context Window: 131K
Max Output: 8,192 tokens
Knowledge Cutoff: 2025-04
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • Dense 32B — consistent performance
  • 131K context
  • Strong coding ability
  • Open-source

Cons

  • No vision
  • 8K max output
  • Higher latency than MoE variants

Performance

Output Speed~65 tok/s
Rate Limit5,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU
83.2%
LiveCodeBench
55.3%

Qwen3-14B

Lite

Lightweight general tasks

Official Pricing

When to use: Budget-friendly option for summarization, translation, and simple Q&A.

Upgrade Highlights

  • 14B dense — fits on single GPU
  • 131K context at just $0.10/1M input
  • Good enough for most everyday tasks
  • Open-source for fine-tuning
Input Price
$0.100
per 1M tokens
Output Price
$0.300
per 1M tokens
Cached Input
$0.030
per 1M tokens
Batch Input
per 1M tokens
Context Window: 131K
Max Output: 8,192 tokens
Knowledge Cutoff: 2025-04
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • Compact 14B dense model
  • 131K context
  • Very low cost
  • Open-source

Cons

  • Limited complex reasoning
  • No vision
  • 8K max output

Performance

Output Speed~90 tok/s
Rate Limit10,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU
77.1%
LiveCodeBench
42.8%

Qwen3-8B

Lite

Edge deployment, simple tasks

Official Pricing

When to use: Edge devices, local deployment, or ultra-low-cost batch processing.

Upgrade Highlights

  • 8B params — runs on RTX 3060 or equivalent
  • $0.05/1M input — among the cheapest available
  • 131K context despite small size
  • Ideal for local/offline deployment
Input Price
$0.050
per 1M tokens
Output Price
$0.150
per 1M tokens
Cached Input
$0.010
per 1M tokens
Batch Input
per 1M tokens
Context Window: 131K
Max Output: 8,192 tokens
Knowledge Cutoff: 2025-04
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • Tiny 8B — runs on laptop GPUs
  • 131K context
  • Extremely cheap
  • Open-source

Cons

  • Basic reasoning only
  • No vision
  • 8K max output

Performance

Output Speed~150 tok/s
Rate Limit20,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU
71.5%
LiveCodeBench
33.1%

Qwen-VL-Plus

Mid-tier

Multimodal understanding, document analysis

Official Pricing

When to use: Document analysis, image captioning, visual Q&A, and multimodal RAG pipelines.

Upgrade Highlights

  • Native multimodal — processes images and text together
  • 131K context handles multi-page documents
  • Strong OCR: chart, table, and diagram understanding
  • Multilingual VQA across 100+ languages
Input Price
$0.200
per 1M tokens
Output Price
$0.800
per 1M tokens
Cached Input
$0.050
per 1M tokens
Batch Input
per 1M tokens
Context Window: 131K
Max Output: 8,192 tokens
Knowledge Cutoff: 2025-04
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • Native vision-language model
  • 131K context with images
  • Strong document OCR and chart understanding
  • Multilingual VQA

Cons

  • No fine-tuning
  • 8K max output
  • Higher cost than text-only Qwen3

Performance

Output Speed~55 tok/s
Rate Limit3,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMMU
68.2%
MathVista
62.5%

Qwen-RobotManip

Flagship

Robotic manipulation, dexterous hand control

Official Pricing

When to use: For robotic manipulation tasks: grasping, assembly, and dexterous hand control in research and industrial settings.

Upgrade Highlights

  • First Qwen-Robot VLA manipulation model
  • 38,100+ hours of open-source training data
  • Unified state-action space across robot types
  • Camera-frame end-effector incremental pose control
  • Part of complete Qwen-Robot Suite (Manip + Nav + World)
Input Price
$0.0000
per 1M tokens
Output Price
$0.0000
per 1M tokens
Cached Input
per 1M tokens
Batch Input
per 1M tokens
Context Window: 0
Max Output: 0 tokens
Knowledge Cutoff: 2026-06
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • VLA model for precise robotic manipulation
  • 38,100+ hours of training from open-source data
  • Multi-robot-type support via unified action space
  • Open-source under Apache 2.0

Cons

  • Specialized for robotics — not a general LLM
  • Requires robot hardware or simulator for deployment
  • No text generation capabilities
  • Very new — limited community adoption

Performance

Output Speed
Rate Limit

Multimodal

Image InputImage OutputAudio InputAudio Output

Qwen-RobotNav

Flagship

Robot navigation, path planning, autonomous mobility

Official Pricing

When to use: For mobile robot navigation: instruction-following, point navigation, object tracking, and autonomous driving tasks.

Upgrade Highlights

  • VLN model: vision-language navigation for physical agents
  • Unified 4 task types: instruction, point/goal, tracking, driving
  • Controlled observation encoding + tool interface
  • Open-source: full weights for customization
  • Part of complete Qwen-Robot Suite (Manip + Nav + World)
Input Price
$0.0000
per 1M tokens
Output Price
$0.0000
per 1M tokens
Cached Input
per 1M tokens
Batch Input
per 1M tokens
Context Window: 0
Max Output: 0 tokens
Knowledge Cutoff: 2026-06
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • VLN model unifying 4 navigation task types
  • Controlled observation encoding for 3D spatial awareness
  • Covers instruction following, point/goal navigation, tracking, driving
  • Open-source under Apache 2.0

Cons

  • Specialized for robotics navigation only
  • Requires robot hardware or simulator
  • No text generation
  • Very new — limited real-world validation

Performance

Output Speed
Rate Limit

Multimodal

Image InputImage OutputAudio InputAudio Output

Qwen-RobotWorld

Flagship

Physical world prediction, robot planning

Official Pricing

When to use: For robot planning and world simulation: predicting outcomes of actions across manipulation, driving, and navigation scenarios.

Upgrade Highlights

  • World model: predicts physically plausible futures
  • Cross-scene: works across manipulation, driving, navigation
  • Natural language action interface
  • Open-source: full weights for research and deployment
  • Part of complete Qwen-Robot Suite (Manip + Nav + World)
Input Price
$0.0000
per 1M tokens
Output Price
$0.0000
per 1M tokens
Cached Input
per 1M tokens
Batch Input
per 1M tokens
Context Window: 0
Max Output: 0 tokens
Knowledge Cutoff: 2026-06
VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

  • World model for predicting physically plausible futures
  • Cross-scene: manipulation, driving, and navigation
  • Natural language action interface for intuitive control
  • Open-source under Apache 2.0

Cons

  • Specialized for world simulation only
  • No text generation or robot control
  • Requires integration with Manip/Nav for full stack
  • Very new — limited benchmarks available

Performance

Output Speed
Rate Limit

Multimodal

Image InputImage OutputAudio InputAudio Output

Side-by-Side Comparison

ModelTierInputOutputContext
Qwen3.7-MaxFlagship$2.50$7.501M
Qwen3.7-PlusMid-tier$0.400$1.601M
Qwen3-235B-A22BFlagship$0.400$1.20131K
Qwen3-30B-A3BMid-tier$0.150$0.600131K
Qwen3-32BMid-tier$0.200$0.600131K
Qwen3-14BLite$0.100$0.300131K
Qwen3-8BLite$0.050$0.150131K
Qwen-VL-PlusMid-tier$0.200$0.800131K
Qwen-RobotManipFlagship$0.0000$0.00000
Qwen-RobotNavFlagship$0.0000$0.00000
Qwen-RobotWorldFlagship$0.0000$0.00000