Paper reading: WebArena: A REALISTIC WEB ENVIRONMENT FOR BUILDING AUTONOMOUS AGENTS
The author’s claim: current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios.
They built an environment for language-guided agents that is realistic and reproducible.
Overall very well-done and comprehensive work.
Strengths
- website environment in 4 domains: e-commerce, social forum discussions, collaborative software development and content management.
- enriched with tools (map) and external knowledge bases (manuals?)
- set of benchmark tasks focusing on evaluating the functional correctness of task completions.
- the environment is implemented in openai gym, and shipped in docker containers.
- Maintain reproducibility by making the environment standalone (without relying on live websites). no captchas.
- Environment definition:
- State space S, action space A, observation space O
- Observation space:
- web page url
- opened tabs (consider multi-tab web-based tasks to promote tool usage)
- web page content
- DOM tree or
- screenshot or
- accessibility tree
- Action space:
- element selection:
- by coordinates (x, y)
- by element id (numerical)
- element selection:
- Observation space:
- Transition function: $T: S \times A \to S$
- State space S, action space A, observation space O
- The craft of benchmark dataset
- Intent Collection
- seed intents from human annotators
- abstract and high-level, creative (created a reddit account identical to my gitlab one), formulated by a template
- Evaluation of correctness:
- Information seek tasks:
- The answer is a string.
- exact match
- must_include
- fuzzy_match (call gpt-4 to evaluate)
- site navigation and content&config tasks:
- a reward function to evaluate the intermediate state
- locator: javascript or api or database query
- find result text
- keywords
- reuse match functions
- locator: javascript or api or database query
- a reward function to evaluate the intermediate state
- Unachievable Tasks:
- expect the agent to produce N/A
- human performance 78.24%
- Information seek tasks:
- Intent Collection
Challenges
- Does numerical id cause hallucination?
- not enough description on accessibility tree