Oppo’s X-OmniClaw turns Android phones into device AI agent

Oppo published X-OmniClaw, an open-source Android agent that runs mainly on-device, uses camera, mic and screen for context, stores long-term memory, automates apps via deeplinks and calls cloud LLMs for heavy reasoning.

Oppo’s Multi-X Team published X-OmniClaw on GitHub as an open-source framework that runs primarily on Android devices. The project is designed to keep core logic local, use a phone’s sensors for context, and call cloud language models only for complex reasoning.

Oppo’s technical report describes X-OmniClaw as an “edge-native architecture that executes directly on the user’s physical device, thereby eliminating the gap between simulated environments and real-world interaction contexts.” The code builds on the open-source HermesApp codebase and adapts ideas from earlier persistent-agent frameworks for desktop systems.

The framework is organized into three interacting systems: Omni Perception, Omni Memory and Omni Action. Omni Perception merges camera feeds, on-screen content and voice input into a single pipeline. A vision-language model interprets the scene before the agent takes action, for example identifying a product in view before searching shopping apps.

Omni Memory creates a running semantic record from the device’s photo gallery and session logs. The system turns images and session data into structured memory so the agent can recall objects, scenes and events across app switches and multiple sessions. The report states, “runtime continuity is what lets X-OmniClaw operate as an ongoing device agent rather than a one-shot response system.”

Omni Action combines XML interface data, on-device visual models and optical character recognition to find and tap interface elements. A behavior cloning feature records a user’s navigation path once and generates an Android deeplink that replays the route in future sessions, allowing the agent to bypass repeated manual navigation.

Oppo demonstrated several use cases. In one demo the agent identified a physical product with the camera, opened Taobao, scrolled results and returned a price summary without typing. Another demo showed a floating on-screen assistant that reads and solves math exercises by autonomously reading the screen, processing questions and advancing through steps. In a media workflow example, the agent scanned a gallery for parrot photos using its semantic memory, opened the CapCut editor via deeplink, batch-selected matching files and assembled a highlight video.

The framework keeps most processing on the device and uses cloud LLMs for high-level reasoning only, which preserves access to the phone’s real camera, photos and local files that remote virtual devices cannot access. Oppo said it will publish supporting assets and continue updating the project to let developers build agents that combine local sensing, long-term memory and selective cloud reasoning.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author