Ai2 Releases MolmoWeb, an Open-Source Browser Agent That Outperforms GPT-4o on Web Tasks
The Allen Institute for AI released MolmoWeb, an open-source 8B-parameter visual web agent that navigates browsers using only screenshots and outperforms GPT-4o on WebVoyager benchmarks. The release includes MolmoWebMix, one of the largest open web agent training datasets with 36K human trajectories.
The Allen Institute for AI (Ai2) on released MolmoWeb, a fully open-source visual web agent that navigates browsers using only screenshots — no HTML parsing, no accessibility trees. The 8-billion-parameter model outperforms agents built on much larger proprietary models including OpenAI's GPT-4o on key web navigation benchmarks, marking a significant milestone for open-source AI agents.
What Happened
MolmoWeb is built on Ai2's Molmo 2 multimodal model family and comes in two sizes: 4B and 8B parameters. The agent works by interpreting screenshots of webpages the way a human would — looking at pixel-level visual information rather than underlying code — then deciding and executing actions like clicking coordinates, typing text, scrolling, and switching tabs. At each step, the model receives a task instruction (e.g., "Find the cheapest nonstop flights from Seattle to Tokyo"), a screenshot of the current browser view, and its action history, then produces a natural-language thought followed by the next browser action.
Alongside the models, Ai2 released MolmoWebMix, one of the largest and most diverse web agent training datasets ever published. It includes 36,000 human task trajectories across 1,100+ websites, 623,000 individual subtask demonstrations from crowdworkers, synthetic trajectories from accessibility-tree agents, and over 2.2 million screenshot QA pairs from approximately 400 websites. The full training and evaluation pipeline, reproducible model checkpoints, and data collection tools are all publicly available.
Key Details
- WebVoyager benchmark: 78.2% — the 8B model achieves this score, outperforming agents built on GPT-4o and other proprietary foundations on real-world web navigation tasks.
- Test-time scaling pushes to 94.7% — using pass@4 (running four attempts and taking the best), MolmoWeb reaches 94.7% on WebVoyager and 60.5% on Online-Mind2Web.
- ScreenSpot v2: beats Claude 3.7 and OpenAI CUA — the model outperforms both Anthropic and OpenAI's computer-use agents on this visual grounding benchmark.
- Fully open release — models on Hugging Face, code on GitHub, training data on HuggingFace datasets, and a live demo at molmoweb.allen.ai.
- No proprietary distillation — unlike other open-weight web agents, MolmoWeb was trained without compressing a proprietary vision-based agent, using human browsing data and synthetic text-only agent trajectories instead.
What Developers and Users Are Saying
Gartner analyst Arun Chandrasekaran called MolmoWeb "an innovation paradigm of computer use agents similar to the proprietary frontier model providers, but with an open approach." He noted it "lowers the barrier to entry for studying agentic behavior and understanding agent decision-making processes that are otherwise opaque." In their blog post, Ai2 framed the release in terms of their broader open-source mission: "In many ways, web agents today are where LLMs were before OLMo — the community needs an open foundation to build on."
The release positions MolmoWeb against closed-source competitors including OpenAI's ChatGPT Atlas, Anthropic's computer use capabilities, and Perplexity's Comet browser agent. The developer community has highlighted the significance of the open dataset release — MolmoWebMix's 36,000 human trajectories and 2.2 million screenshot QA pairs represent a resource that simply didn't exist in the open-source ecosystem before.
What This Means for Developers
MolmoWeb's open release gives developers three concrete capabilities they didn't have before. First, the models can be self-hosted locally or in private cloud environments, enabling web automation workflows without sending sensitive data to third-party APIs. Second, the MolmoWebMix dataset enables fine-tuning web agents on custom domains — teams can train specialized agents for their specific web applications. Third, the full training pipeline means researchers can reproduce, modify, and extend the approach rather than treating it as a black box.
The 4B parameter model is small enough to run on consumer hardware, while the 8B model offers stronger performance for production use cases. Developers building browser automation, testing frameworks, or accessibility tools should evaluate MolmoWeb as a drop-in alternative to proprietary computer-use APIs.
What's Next
The release comes during a turbulent period for Ai2. CEO Ali Farhadi and COO Sophie Lebrecht recently resigned to join Microsoft under Mustafa Suleyman, and key researchers Hanna Hajishirzi and Ranjay Krishna — who led the OLMo and Molmo projects respectively — are also departing. Despite the leadership changes, Ai2 has committed to continuing its open-source AI research mission. The models, data, and code are available now on Hugging Face and GitHub for immediate use.
Sources
- Ai2 Official Blog — MolmoWeb: An open agent for automating web tasks
- GeekWire — Ai2 releases open-source web agent to rival closed systems
- SiliconANGLE — Ai2 releases open-source visual AI agent
- The New Stack — Ai2 launches MolmoWeb, an open-source web agent
- VentureBeat — Ai2 releases MolmoWeb, an open-weight visual web agent
- GitHub — allenai/molmoweb repository
Stay up to date with Doolpa
Subscribe to Newsletter →