Agent S: Building Autonomous GUI Agents That Learn from Experience with 11.9k+ GitHub Stars

Explore Agent S, the open-source framework for autonomous GUI agents that achieves 72.6% on OSWorld, surpassing human performance. Learn how it combines hierarchical planning, episodic memory, and multi-model architecture for intelligent desktop automation.

Agent S is an open-source framework that enables autonomous interaction with computers through graphical user interfaces (GUIs). Created by Simular AI, it represents a breakthrough in computer-use agents—AI systems that can observe screens, plan actions, and control mice and keyboards to complete complex tasks autonomously. With 11.9k GitHub stars and active development, Agent S has evolved from achieving 20% accuracy on the OSWorld benchmark to Agent S3's impressive 72.6% performance, surpassing human-level capabilities.

What is Agent S?

Agent S is an open-source framework designed to build intelligent GUI agents that learn from past experiences and perform complex tasks autonomously on computers. Unlike traditional automation tools that rely on predefined scripts, Agent S uses hierarchical planning and episodic memory to adapt to new situations. The framework supports Windows, macOS, and Linux, making it accessible across platforms.

The project is maintained by Simular AI and has evolved through three major versions. Agent S1 (released October 2024) introduced the core hierarchical planning approach. Agent S2 (March 2025) improved performance and generalization. Agent S3 (October 2025) achieved the breakthrough of surpassing human performance on OSWorld, a comprehensive benchmark for desktop automation tasks.

What makes Agent S unique is its combination of experience-augmented hierarchical planning, external knowledge integration, and episodic memory. The framework doesn't just execute tasks—it learns from them, building a knowledge base that improves future performance. This approach has proven more effective than pure scaling or reinforcement learning alone.

Core Features and Architecture

Hierarchical Planning with Memory - Agent S uses a two-level planning system. High-level planning breaks tasks into subtasks, while low-level planning handles specific GUI interactions. The framework maintains episodic memory of past interactions, allowing it to recognize similar situations and apply learned strategies.

Multi-Model Architecture - The framework supports multiple LLM providers including OpenAI (GPT-5), Anthropic Claude, Google Gemini, and open-source models via vLLM. For grounding (translating agent actions into executable code), it uses specialized UI understanding models like UI-TARS-1.5-7B, which can identify UI elements and their coordinates with high precision.

Grounding Models for Precise UI Interaction - Agent S uses dedicated grounding models that understand GUI layouts and element positions. The UI-TARS model family provides state-of-the-art performance in identifying clickable elements, text fields, and other interactive components. This grounding layer translates high-level agent decisions into precise mouse and keyboard actions.

Cross-Platform Support - The framework works seamlessly across Windows, macOS, and Linux. It handles platform-specific differences in GUI rendering and interaction patterns, making it truly universal for desktop automation.

Local Code Execution Environment - For tasks requiring computation beyond GUI interaction, Agent S includes an optional local coding environment. This allows the agent to execute Python and Bash code directly, enabling data processing, file manipulation, and system automation without GUI interaction.

Reflection Agent for Quality Assurance - Agent S3 includes a reflection component that validates actions and corrects mistakes. This secondary agent reviews the primary agent's decisions, catching errors before they compound and improving overall task success rates.

Behavior Best-of-N Sampling - For critical tasks, Agent S can generate multiple rollouts and select the best outcome. This technique improved Agent S3's performance from 66% to 72.6% on OSWorld, demonstrating the value of ensemble approaches in agentic systems.

Get free AI agent insights weekly

Join our community of builders exploring the latest in AI agents, frameworks, and automation tools.

Join Free

Getting Started

Prerequisites - You'll need Python 3.8+, a single monitor setup (Agent S is designed for single-screen environments), and API keys for your chosen LLM provider. For macOS, install Tesseract: brew install tesseract.

Installation - The simplest approach is installing via pip:

pip install gui-agents

For development work, clone the repository and install in editable mode:

git clone https://github.com/simular-ai/Agent-S.git
cd Agent-S
pip install -e .

API Configuration - Set your API keys as environment variables. For OpenAI and Anthropic:

export OPENAI_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"
export HF_TOKEN="your-huggingface-token"

Running Agent S3 - The recommended setup uses GPT-5 with UI-TARS grounding. First, set up a Hugging Face Inference Endpoint for UI-TARS-1.5-7B. Then run:

agent_s \
    --provider openai \
    --model gpt-5-2025-08-07 \
    --ground_provider huggingface \
    --ground_url http://localhost:8080 \
    --ground_model ui-tars-1.5-7b \
    --grounding_width 1920 \
    --grounding_height 1080

For tasks requiring code execution, add the --enable_local_env flag. This allows the agent to run Python and Bash scripts locally.

Real-World Use Cases

Data Entry and Processing - Agent S excels at repetitive data entry tasks. It can extract data from one application, process it, and enter it into another—all without human intervention. Insurance companies and financial institutions use similar agents to automate claims processing and account management.

Software Testing and QA - The framework can navigate complex applications, fill forms, click buttons, and verify results. QA teams can define test scenarios, and Agent S executes them across different configurations and platforms, catching regressions faster than manual testing.

System Administration and DevOps - Agent S can automate server configuration, log analysis, and system monitoring through web dashboards and CLI tools. It handles multi-step workflows like deploying applications, configuring databases, and managing infrastructure—tasks that typically require manual intervention.

Customer Support Automation - Support teams can use Agent S to handle routine customer requests. The agent can navigate ticketing systems, look up customer information, process refunds, and generate responses—freeing human agents for complex issues requiring judgment and empathy.

How It Compares

vs. Claude Computer Use (Anthropic) - Claude's computer-use capability is powerful but closed-source and API-only. Agent S is open-source, allowing customization and local deployment. However, Claude benefits from Anthropic's extensive safety research. Agent S3 achieves comparable performance (72.6% vs Claude's reported 62-65% on OSWorld) while offering more flexibility for developers.

vs. OpenAI Operator - OpenAI's Operator is a commercial product focused on web automation. Agent S supports both web and desktop applications, making it more versatile. Agent S is also open-source and free, though Operator may offer better integration with OpenAI's ecosystem for organizations already using GPT-4.

vs. Traditional RPA Tools (UiPath, Automation Anywhere) - Legacy RPA platforms require extensive configuration and maintenance. Agent S learns from experience and adapts to UI changes automatically. Traditional RPA excels at highly structured, repetitive tasks in enterprise environments, while Agent S handles novel situations and complex reasoning better.

What's Next

The Agent S roadmap focuses on improving generalization across different applications and operating systems. The team is working on better handling of edge cases, improved error recovery, and enhanced support for mobile automation through AndroidWorld benchmarks. Integration with more LLM providers and grounding models is ongoing.

The broader computer-use agent landscape is rapidly evolving. As these systems become more capable, we'll likely see widespread adoption in enterprise automation, customer service, and software development. Agent S's open-source nature positions it as a critical foundation for this emerging ecosystem. The framework demonstrates that with the right architecture—combining hierarchical planning, episodic memory, and multi-model systems—AI agents can achieve human-level performance on complex, real-world tasks.

Sources

Read more