Alibaba Cloud Models

Explore all 11 models from Alibaba Cloud with detailed pricing, pros & cons, and developer recommendations.

Models

$0.0000

Lowest Input

Max Context

Quality Tiers

Quick Recommendations

Best Value: Qwen-RobotWorld ($0.0000/1M)

Best Quality: Qwen3.7-Max

Qwen3.7-Max

Flagship

Long-horizon agent workflows, coding agents, complex reasoning

Official Pricing

When to use: Frontier agent workloads requiring long autonomous runs, complex multi-step coding tasks, and deep research analysis.

Upgrade Highlights

◆1M token context — removes limits on document-heavy agent work
◆65K max output — massive single-turn generation
◆Sustained 35-hour autonomous kernel optimization (1,158 tool calls)
◆SWE-Verified 80.4, LiveCodeBench 91.6 — rivals Claude Opus 4.6
◆OpenAI + Anthropic API compatible — drop-in replacement

Input Price

$2.50

per 1M tokens

Output Price

$7.50

per 1M tokens

Cached Input

$0.250

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 1M

Max Output: 65,536 tokens

Knowledge Cutoff: 2026-05

VisionFunction CallingFine-tuningJSON Mode

Pros

1M context window for document-heavy agent work
65K max output — longest in Qwen family
Cross-harness compatibility (Claude Code, OpenClaw, Qwen Code)
35-hour sustained autonomous execution
Competitive with Claude Opus 4.6 on coding benchmarks

Cons

Proprietary — no open weights or self-hosting
Higher cost than Qwen 3.6 line
No vision support
API-only access

Performance

Output Speed~55 tok/s

Rate Limit2,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU-Pro

89.6%

LiveCodeBench

91.6%

SWE-Verified

80.4%

GPQA Diamond

92.4%

Agents Using This Model

QoderWork Qoder Cloud Agents JVS Agent Suite

Qwen3.7-Plus

Mid-tier

Multimodal tasks, cost-effective agent deployment

Official Pricing

When to use: Cost-effective multimodal deployments needing video and image understanding alongside text, with long context requirements.

Upgrade Highlights

◆Multimodal input: text + video + image in one model
◆1M context at $0.40/1M — 6x cheaper than Qwen3.7-Max
◆Strong agent capability at mid-tier cost
◆OpenAI-compatible API

Input Price

$0.400

per 1M tokens

Output Price

$1.60

per 1M tokens

Cached Input

$0.100

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 1M

Max Output: 16,384 tokens

Knowledge Cutoff: 2026-05

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

1M context at mid-tier pricing
Multimodal: text, video, and image input
Strong speed-capability balance
Proprietary but very affordable

Cons

Proprietary — no self-hosting
Less capable than Qwen3.7-Max on complex reasoning
16K max output

Performance

Output Speed~80 tok/s

Rate Limit5,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU-Pro

84.2%

LiveCodeBench

78.5%

MMMU

72.1%

Qwen3-235B-A22B

Flagship

Complex reasoning, multilingual tasks

Official Pricing

When to use: Best value flagship for multilingual workloads, complex reasoning, and cost-sensitive production deployments.

Upgrade Highlights

◆MoE architecture: 235B params, only 22B active — GPT-4 class at 1/10 the price
◆131K context — handles long documents and codebases
◆100+ language support — best-in-class for non-English tasks
◆Open-source: full weights on HuggingFace for self-hosting
◆$0.40/$1.20 per 1M tokens — undercuts GPT-4o by 90%

Input Price

$0.400

per 1M tokens

Output Price

$1.20

per 1M tokens

Cached Input

$0.100

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 131K

Max Output: 8,192 tokens

Knowledge Cutoff: 2025-04

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

MoE 235B total / 22B active — flagship performance at low cost
131K context window
Strong multilingual (100+ languages)
Open-source weights available

Cons

No vision support
Max output 8K tokens
Less ecosystem integration than GPT-4

Performance

Output Speed~70 tok/s

Rate Limit5,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU

86.8%

LiveCodeBench

63.7%

IFEval

86.2%

Agents Using This Model

Smolagents Dify

Qwen3-30B-A3B

Mid-tier

Efficient multilingual inference

Official Pricing

When to use: High-throughput multilingual tasks where cost efficiency matters most.

Upgrade Highlights

◆Only 3B active params — runs on consumer GPUs
◆131K context at $0.15/1M input — cheapest long-context option
◆Open-source for full customization
◆Strong function calling for agent workflows

Input Price

$0.150

per 1M tokens

Output Price

$0.600

per 1M tokens

Cached Input

$0.040

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 131K

Max Output: 8,192 tokens

Knowledge Cutoff: 2025-04

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

MoE 30B total / 3B active — ultra-efficient
131K context
Excellent cost-performance ratio
Open-source

Cons

Smaller active params limit complex reasoning
No vision
8K max output

Performance

Output Speed~120 tok/s

Rate Limit10,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU

78.5%

LiveCodeBench

48.2%

Qwen3-32B

Mid-tier

Balanced performance and cost

Official Pricing

When to use: When you need reliable dense model performance for coding and general tasks.

Upgrade Highlights

◆Dense 32B architecture — no MoE routing overhead
◆131K context for long-form content
◆Strong coding: LiveCodeBench 55.3%
◆Open-source with full HuggingFace support

Input Price

$0.200

per 1M tokens

Output Price

$0.600

per 1M tokens

Cached Input

$0.050

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 131K

Max Output: 8,192 tokens

Knowledge Cutoff: 2025-04

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

Dense 32B — consistent performance
131K context
Strong coding ability
Open-source

Cons

No vision
8K max output
Higher latency than MoE variants

Performance

Output Speed~65 tok/s

Rate Limit5,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU

83.2%

LiveCodeBench

55.3%

Qwen3-14B

Lite

Lightweight general tasks

Official Pricing

When to use: Budget-friendly option for summarization, translation, and simple Q&A.

Upgrade Highlights

◆14B dense — fits on single GPU
◆131K context at just $0.10/1M input
◆Good enough for most everyday tasks
◆Open-source for fine-tuning

Input Price

$0.100

per 1M tokens

Output Price

$0.300

per 1M tokens

Cached Input

$0.030

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 131K

Max Output: 8,192 tokens

Knowledge Cutoff: 2025-04

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

Compact 14B dense model
131K context
Very low cost
Open-source

Cons

Limited complex reasoning
No vision
8K max output

Performance

Output Speed~90 tok/s

Rate Limit10,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU

77.1%

LiveCodeBench

42.8%

Qwen3-8B

Lite

Edge deployment, simple tasks

Official Pricing

When to use: Edge devices, local deployment, or ultra-low-cost batch processing.

Upgrade Highlights

◆8B params — runs on RTX 3060 or equivalent
◆$0.05/1M input — among the cheapest available
◆131K context despite small size
◆Ideal for local/offline deployment

Input Price

$0.050

per 1M tokens

Output Price

$0.150

per 1M tokens

Cached Input

$0.010

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 131K

Max Output: 8,192 tokens

Knowledge Cutoff: 2025-04

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

Tiny 8B — runs on laptop GPUs
131K context
Extremely cheap
Open-source

Cons

Basic reasoning only
No vision
8K max output

Performance

Output Speed~150 tok/s

Rate Limit20,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMLU

71.5%

LiveCodeBench

33.1%

Qwen-VL-Plus

Mid-tier

Multimodal understanding, document analysis

Official Pricing

When to use: Document analysis, image captioning, visual Q&A, and multimodal RAG pipelines.

Upgrade Highlights

◆Native multimodal — processes images and text together
◆131K context handles multi-page documents
◆Strong OCR: chart, table, and diagram understanding
◆Multilingual VQA across 100+ languages

Input Price

$0.200

per 1M tokens

Output Price

$0.800

per 1M tokens

Cached Input

$0.050

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 131K

Max Output: 8,192 tokens

Knowledge Cutoff: 2025-04

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

Native vision-language model
131K context with images
Strong document OCR and chart understanding
Multilingual VQA

Cons

No fine-tuning
8K max output
Higher cost than text-only Qwen3

Performance

Output Speed~55 tok/s

Rate Limit3,000 RPM

Multimodal

Image InputImage OutputAudio InputAudio Output

Benchmarks

MMMU

68.2%

MathVista

62.5%

Qwen-RobotManip

Flagship

Robotic manipulation, dexterous hand control

Official Pricing

When to use: For robotic manipulation tasks: grasping, assembly, and dexterous hand control in research and industrial settings.

Upgrade Highlights

◆First Qwen-Robot VLA manipulation model
◆38,100+ hours of open-source training data
◆Unified state-action space across robot types
◆Camera-frame end-effector incremental pose control
◆Part of complete Qwen-Robot Suite (Manip + Nav + World)

Input Price

$0.0000

per 1M tokens

Output Price

$0.0000

per 1M tokens

Cached Input

—

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 0

Max Output: 0 tokens

Knowledge Cutoff: 2026-06

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

VLA model for precise robotic manipulation
38,100+ hours of training from open-source data
Multi-robot-type support via unified action space
Open-source under Apache 2.0

Cons

Specialized for robotics — not a general LLM
Requires robot hardware or simulator for deployment
No text generation capabilities
Very new — limited community adoption

Performance

Output Speed—

Rate Limit—

Multimodal

Image InputImage OutputAudio InputAudio Output

Qwen-RobotNav

Flagship

Robot navigation, path planning, autonomous mobility

Official Pricing

When to use: For mobile robot navigation: instruction-following, point navigation, object tracking, and autonomous driving tasks.

Upgrade Highlights

◆VLN model: vision-language navigation for physical agents
◆Unified 4 task types: instruction, point/goal, tracking, driving
◆Controlled observation encoding + tool interface
◆Open-source: full weights for customization
◆Part of complete Qwen-Robot Suite (Manip + Nav + World)

Input Price

$0.0000

per 1M tokens

Output Price

$0.0000

per 1M tokens

Cached Input

—

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 0

Max Output: 0 tokens

Knowledge Cutoff: 2026-06

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

VLN model unifying 4 navigation task types
Controlled observation encoding for 3D spatial awareness
Covers instruction following, point/goal navigation, tracking, driving
Open-source under Apache 2.0

Cons

Specialized for robotics navigation only
Requires robot hardware or simulator
No text generation
Very new — limited real-world validation

Performance

Output Speed—

Rate Limit—

Multimodal

Image InputImage OutputAudio InputAudio Output

Qwen-RobotWorld

Flagship

Physical world prediction, robot planning

Official Pricing

When to use: For robot planning and world simulation: predicting outcomes of actions across manipulation, driving, and navigation scenarios.

Upgrade Highlights

◆World model: predicts physically plausible futures
◆Cross-scene: works across manipulation, driving, navigation
◆Natural language action interface
◆Open-source: full weights for research and deployment
◆Part of complete Qwen-Robot Suite (Manip + Nav + World)

Input Price

$0.0000

per 1M tokens

Output Price

$0.0000

per 1M tokens

Cached Input

—

per 1M tokens

Batch Input

—

per 1M tokens

Context Window: 0

Max Output: 0 tokens

Knowledge Cutoff: 2026-06

VisionFunction CallingFine-tuningJSON ModeFree Tier

Pros

World model for predicting physically plausible futures
Cross-scene: manipulation, driving, and navigation
Natural language action interface for intuitive control
Open-source under Apache 2.0

Cons

Specialized for world simulation only
No text generation or robot control
Requires integration with Manip/Nav for full stack
Very new — limited benchmarks available

Performance

Output Speed—

Rate Limit—

Multimodal

Image InputImage OutputAudio InputAudio Output

Side-by-Side Comparison

Model	Tier	Input	Output	Cached	Context	Max Output
Qwen3.7-Max	Flagship	$2.50	$7.50	$0.250	1M	65,536
Qwen3.7-Plus	Mid-tier	$0.400	$1.60	$0.100	1M	16,384
Qwen3-235B-A22B	Flagship	$0.400	$1.20	$0.100	131K	8,192
Qwen3-30B-A3B	Mid-tier	$0.150	$0.600	$0.040	131K	8,192
Qwen3-32B	Mid-tier	$0.200	$0.600	$0.050	131K	8,192
Qwen3-14B	Lite	$0.100	$0.300	$0.030	131K	8,192
Qwen3-8B	Lite	$0.050	$0.150	$0.010	131K	8,192
Qwen-VL-Plus	Mid-tier	$0.200	$0.800	$0.050	131K	8,192
Qwen-RobotManip	Flagship	$0.0000	$0.0000	—	0	0
Qwen-RobotNav	Flagship	$0.0000	$0.0000	—	0	0
Qwen-RobotWorld	Flagship	$0.0000	$0.0000	—	0	0