Workflow-GYM is built in collaboration with domain experts from diverse professional fields. Rather than relying on synthetic tasks, we collect workflows directly from real-world professional practices and transform them into executable GUI benchmark tasks. Each workflow is required to be realistic, domain-specific, long-horizon, and objectively verifiable. The resulting benchmark contains 338 tasks spanning 58 professional software systems across engineering, scientific computing, finance, data analysis, multimedia creation, and other specialized domains.
To ensure quality and reproducibility, every task is instantiated in a dedicated virtual-machine environment and undergoes multi-stage validation. Domain experts manually verify task solvability, review instructions and evaluation criteria, and conduct end-to-end testing to identify ambiguities or environment issues. Only tasks that successfully pass all validation stages are included in the final benchmark.