Workflow-GYM hero background

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

1ByteDance Seed, 2M-A-P, 3Humanlaya AI

  • Workflow-GYM is a benchmark designed to evaluate whether GUI agents can autonomously complete real-world professional workflows through graphical user interfaces.
  • Unlike existing GUI benchmarks, which mainly focus on general-purpose applications and short tasks, Workflow-GYM covers 300 long-horizon tasks spanning 50+ specialized software tools across multiple domains.
  • Our evaluation of SOTA models reveals that professional software workflows remain a major challenge: even the strongest models achieve only around 30% success rates. Through extensive analysis, we identify several key bottlenecks, including complex interface manipulation, long-horizon planning, professional software knowledge etc..
  • Workflow-GYM provides a realistic testbed for studying the next generation of AI agents capable of performing economically valuable work in professional software environments.

Pipeline

Workflow-GYM is built in collaboration with domain experts from diverse professional fields. Rather than relying on synthetic tasks, we collect workflows directly from real-world professional practices and transform them into executable GUI benchmark tasks. Each workflow is required to be realistic, domain-specific, long-horizon, and objectively verifiable. The resulting benchmark contains 338 tasks spanning 58 professional software systems across engineering, scientific computing, finance, data analysis, multimedia creation, and other specialized domains.

To ensure quality and reproducibility, every task is instantiated in a dedicated virtual-machine environment and undergoes multi-stage validation. Domain experts manually verify task solvability, review instructions and evaluation criteria, and conduct end-to-end testing to identify ambiguities or environment issues. Only tasks that successfully pass all validation stages are included in the final benchmark.

Workflow-GYM pipeline diagram

Data Statistics

Workflow-GYM covers a total of 338 tasks across 6 primary categories and 23 secondary categories. Each task is classified into three difficulty levels: easy, medium, and hard, according to the number of manually annotated operational steps.

Easy 38.2% 129 tasks 30–44 steps
Medium 47.0% 159 tasks 45–60 steps
Hard 14.8% 50 tasks 61–110 steps
Task category distribution across the 6 primary and 23 secondary categories. Hover any slice to see the exact task count and percentage.

Leaderboard

0 10 20 30 40 Avg Score Pass@3 Gemini-3.1-Pro 30.67 41.12 Kimi-K2.6 29.68 41.42 Seed-2.0-Lite 18.24 28.40 GPT-5.4 17.85 26.33 GPT-5.4-Mini 15.98 27.22 Gemini-3-Flash 7.89 15.98 Score
Success Final-State but Incorrect Workflow Incompletion Other Failures 0% 20% 40% 60% 80% 100% Gemini-3.1-Pro 30.67% 42.50% 26.23% Kimi-K2.6 29.68% 46.45% 23.47% Seed-2.0-Lite 18.54% 31.85% 49.11% GPT-5.4 17.85% 20.22% 61.93% GPT-5.4-Mini 15.98% 24.46% 59.37% Gemini-3-Flash 7.89% 18.84% 73.18% Proportion of All Tasks (%)

Showcase

Real agent rollouts. Click any task to compare different models' execution traces side-by-side.

FAQ

How can I get my model's results onto the verified leaderboard?

If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (zhangge.eli@bytedance.com, zhuliya.julia@bytedance.com, dingjingzhe@bytedance.com) to run your agent code on our side and have us report the results.