Huawei benchmark finds AI assistants fail months-long tasks

Huawei and academic partners released Claw-Anything, a benchmark for months-long, multi-service personal-assistant tasks; OpenAI’s GPT-5.5 scored 34.5% pass@1 and 6.7% proactive success.

Researchers from Huawei Technologies, Beijing Institute of Technology, Peking University and the Chinese Academy of Sciences released Claw-Anything, a benchmark that evaluates AI assistants on months-long, multi-service personal tasks. The benchmark found OpenAI’s GPT-5.5 scored 34.5% on pass@1 and agents scored 6.7% on proactive tasks.

Claw-Anything runs scenarios that span more than three months of simulated user activity. Tasks require coordination across multiple backend services and interaction with both command-line Linux environments and graphical Android devices. Example tasks include cross-referencing past alerts with calendar events, assembling recent work from notes and communication threads, and taking actions that involve several separate accounts and tools.

Each task carries an average context window of about 191,700 words, compared with roughly 1,700 to 12,000 words in many existing benchmarks. Tasks involve an average of 10.1 interdependent services, which forces agents to retrieve information and act across different backends instead of solving problems within a single system.

The benchmark uses pass@1 as its primary metric, scoring whether an agent completes a task correctly on its first attempt with no retries. Agents scored 25.9% on reactive tasks, where they respond to explicit requests, and 6.7% on proactive tasks, where they identify a need and act without being asked. The paper states, “Current models remain unreliable even when given broader access to the user’s digital world.”

Ablation experiments removed access to tools required for cross-service workflows and found success rates dropped to nearly zero, indicating most tasks require genuine cross-account retrieval and execution rather than isolated reasoning in a single environment.

The researchers released an automated pipeline that generated 2,000 training environments and published the dataset and code on public repositories. Using 1,500 successful agent trajectories from that data to fine-tune an open-weight model, the team raised pass@1 for Qwen3.5-27B by 23.7%, which surpassed several closed-source models on the Claw-Anything leaderboard, including Claude Sonnet.

The paper notes the dataset and tools are intended to reduce data leakage in benchmarks and to reflect the practical demands of assistants that operate across calendars, email, messaging, file stores and device interfaces over long time spans.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author