Model Signal logo Model Signal Fast, verified AI updates
Coding

Qwen 3.7‑Plus: Multimodal Coding Agent with Vision‑Language Upgrade

3 min read

Quick Summary

Qwen 3.7‑Plus is a new multimodal agent model that adds vision capabilities to the strong text backbone of Qwen 3.7. It can read screens, interact with GUIs, and generate code from visual references while keeping the coding and tool‑use strengths of its predecessor. Benchmarks show notable gains in several coding‑related tasks, especially in terminal‑based and spreadsheet benchmarks.

Key Points

  • Multimodal agent: Handles both visual (screens, images) and textual inputs in a single loop.
  • Coding performance: Improves the Terminal‑Bench 2.0 score to 70.3, the highest among listed models.
  • Productivity workflows: Supports end‑to‑end GUI navigation, CLI commands, and code generation from visual cues.
  • Cross‑framework compatibility: Works with Claude Code, OpenClaw, Qwen Code, and other agent scaffolds.
  • Available via API: Hosted on Alibaba Cloud Model Studio for easy integration.

What Actually Changed?

  • Vision‑Language Integration: Qwen 3.7‑Plus adds perception, reasoning, and grounding over visual inputs while retaining the original Qwen 3.7 text model.
  • Hybrid Agent Loop: The model can switch between GUI (visual) and CLI (text) interactions within the same task, enabling workflows such as reading a screenshot of a UI and issuing corresponding commands.
  • Benchmark Improvements: Compared to Qwen 3.6‑Plus, the new model raises several coding scores (e.g., Terminal‑Bench 2.0 from 61.6 → 70.3, SpreadsheetBench‑v1 from 80.2 → 86.3, Kernel Bench L3 from 1.03/48 % → 2.06/98 %).

Coding Impact

  • Higher Terminal Performance: The 70.3 score on Terminal‑Bench 2.0 indicates stronger ability to execute and reason about command‑line tasks, useful for automation scripts and DevOps tooling.
  • Better Spreadsheet Handling: A jump to 86.3 on SpreadSheetBench‑v1 suggests more reliable generation and manipulation of spreadsheet formulas and data.
  • Improved Code Generation from Visual Context: The multimodal capability lets developers feed UI screenshots or design mock‑ups and receive corresponding front‑end code, streamlining prototyping.
  • Consistent Agentic Strength: Despite the vision upgrade, the model still performs competitively on SWE‑Verified (77.7) and SWE‑Multilingual (75.8) benchmarks, supporting full‑stack development tasks.

Model / Tool Comparison

Model / Tool Terminal‑Bench 2.0 SWE‑Verified SpreadsheetBench‑v1 Kernel Bench L3 (score/%)
Qwen 3.7‑Plus 70.3 77.7 86.3 2.06 / 98 %
Qwen 3.6‑Plus 61.6 78.8 80.2 1.03 / 48 %
Opus‑4.6 65.4 80.8 2.63 / 98 %
K2.6 66.7 80.2 1.41 / 80 %
GLM‑5.1 63.5 2.00 / 78 %
DeepSeek‑V4‑Pro 67.9

Benchmarks are taken directly from the Qwen 3.7‑Plus release notes.

Strengths

  • Unified vision‑language agent: Eliminates the need for separate OCR or image‑analysis pipelines.
  • Strong coding scores: Leads the listed models on terminal and spreadsheet benchmarks.
  • Framework agnostic: Works across multiple agent scaffolds, easing integration into existing pipelines.
  • High multilingual coding ability: Maintains solid performance on SWE‑Multilingual (75.8).

Limitations / Concerns

  • Mixed coding benchmark results: Some tasks (e.g., NL2repo 41.1, SciCode 51.3) are lower than Qwen 3.6‑Plus, indicating room for improvement on repository‑level code synthesis.
  • Vision‑only gains not quantified: The release does not provide separate vision‑only benchmark numbers, so the exact impact of the visual module on non‑coding tasks is unclear.
  • Availability limited to Alibaba Cloud: Access requires using the Alibaba Cloud Model Studio API, which may not suit all deployment environments.

Should I Try It?

If you need a coding assistant that can also understand screenshots, UI layouts, or other visual inputs, Qwen 3.7‑Plus offers a clear advantage over text‑only models. Its top scores on terminal and spreadsheet benchmarks make it attractive for automation, DevOps, and data‑analysis scripts. However, if your primary workload is pure code generation from text or repository‑level tasks, you may want to compare its lower NL2repo and SciCode scores with other specialized models.

Sources

  1. Qwen 3.7‑Plus announcement (qwen.ai)

Why This Matters

Higher Terminal Performance: The 70.3 score on Terminal‑Bench 2.0 indicates stronger ability to execute and reason about command‑line tasks, useful for automation scripts and DevOps tooling.
Better Spreadsheet Handling: A jump to 86.3 on SpreadSheetBench‑v1 suggests more reliable generation and manipulation of spreadsheet formulas and data.
Improved Code Generation from Visual Context: The multimodal capability lets developers feed UI screenshots or design mock‑ups and receive corresponding front‑end code, streamlining prototyping.
Consistent Agentic Strength: Despite the vision upgrade, the model still performs competitively on SWE‑Verified (77.7) and SWE‑Multilingual (75.8) benchmarks, supporting full‑stack development tasks.