PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

¹ByteDance Seed, ²M-A-P, ³Humanlaya AI

Workflow-GYM is a benchmark designed to evaluate whether GUI agents can autonomously complete real-world professional workflows through graphical user interfaces.
Unlike existing GUI benchmarks, which mainly focus on general-purpose applications and short tasks, Workflow-GYM covers 300 long-horizon tasks spanning 50+ specialized software tools across multiple domains.
Our evaluation of SOTA models reveals that professional software workflows remain a major challenge: even the strongest models achieve only around 30% success rates. Through extensive analysis, we identify several key bottlenecks, including complex interface manipulation, long-horizon planning, professional software knowledge etc..
Workflow-GYM provides a realistic testbed for studying the next generation of AI agents capable of performing economically valuable work in professional software environments.

Paper Link

Dataset (Coming Soon)

Pipeline

Workflow-GYM is built in collaboration with domain experts from diverse professional fields. Rather than relying on synthetic tasks, we collect workflows directly from real-world professional practices and transform them into executable GUI benchmark tasks. Each workflow is required to be realistic, domain-specific, long-horizon, and objectively verifiable. The resulting benchmark contains 338 tasks spanning 56 professional software systems across engineering, scientific computing, finance, data analysis, multimedia creation, and other specialized domains.

To ensure quality and reproducibility, every task is instantiated in a dedicated virtual-machine environment and undergoes multi-stage validation. Domain experts manually verify task solvability, review instructions and evaluation criteria, and conduct end-to-end testing to identify ambiguities or environment issues. Only tasks that successfully pass all validation stages are included in the final benchmark.

Data Statistics

Workflow-GYM covers a total of 338 tasks across 6 primary categories and 23 secondary categories. Each task is classified into three difficulty levels: easy, medium, and hard, according to the number of manually annotated operational steps.

Easy 38.2% 129 tasks 30–44 steps

Medium 47.0% 159 tasks 45–60 steps

Hard 14.8% 50 tasks 61–110 steps

Task category distribution across the 6 primary and 23 secondary categories. Hover any slice to see the exact task count and percentage.

Leaderboard

Showcase

Real agent rollouts. Click any task to compare different models' execution traces side-by-side.

FAQ

How can I get my model's results onto the verified leaderboard?

If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (zhangge.eli@bytedance.com, zhuliya.julia@bytedance.com, dingjingzhe@bytedance.com) to run your agent code on our side and have us report the results.

What information is needed to test a model on Workflow-Gym?

Workflow-Gym uses the OpenAI-compatible API format. To test a model on this dataset, please provide:

Required information

base_url:: https://api.xxx.com/v1
api_key_env:: XXX_API_KEY
model:: xxx

Example

base_url:: https://api.openai.com/v1
api_key_env:: "sk_xxx"
model:: gpt-5.4