Leo's blog

建议麦当劳

UCI/CMU, NIPS 2023

Motivation

Previous agents need a large amount of expert demonstration and task specific reward functions to be able to solve unseen tasks.

An LLM can execute computer tasks guided by NL prompts.

Focus on the MiniWoB++ environment, which is an RL-like little game environment but on web.

Strength

  • novel(maybe? not sure) Recursively-Criticize and Improve prompt format

Challenges

  • not that related so I didn’t read thoroughly.

A REAL-WORLD WEBAGENT WITH PLANNING, LONG CONTEXT UNDERSTANDING, AND PROGRAM SYNTHESIS

Deepmind, ICLR 2024

Motivation

LLM Agents in real-world websites has still suffered from open domainness, limited context length and lack of inductive bias on HTML.

They introduce WebAgent, an LLM-driven agent that learns from self experience to complete tasks on real websites following natural language instructions.

Strength

  • method:
    • two-model method.
      • HTML-T5 to predict the next sub-instruction (planning) and generate related HTML snippets (summarization)
      • Flan-U-PaLM prompted to generate Python programs
    • Action is represented by python programs with selenium driver calls.
      • a large action space
      • flexbility
  • Supports any website without predefined action spaces.

Challenges

  • method: a bit complicated
  • action representation: safety? Run generated code from LLM is not safe.
  • Use a smaller model for planning – WHY?

MIND2WEB: Towards a Generalist Agent for the Web

OSU-NLP, NIPS 2023 Dataset Track

A dataset of real tasks on real-world websites, including tasks and user interaction traces.

image-20240716201230890

Dataset Format

image-20240717195904846

  • Action Traces
    • cleaned html and raw html at each point
    • repr of action

Strength

  • Very good project homepage – https://osu-nlp-group.github.io/Mind2Web/
  • annotation process:
    • human-annotated dataset
    • select element then select action
  • Method:
    • two-step.
      1. select candidate dom elements
        1. a 0-1 score for Each Candidate Element
        2. random negative samples
      2. Generate action
        1. a multi-choice on the candidate element

Challenges

  • a 0-1 score for each candidate element is too complex
    • you need many calls for a single action
    • unfeasible

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

NUS, arxiv preprint

Motivation:

Reflecting past experiences in current decision-making processes, an innate human behavior, continues to pose significant challenges. Addressing this, we propose Retrieval-Augmented Planning (RAP) framework, designed to dynamically leverage past experiences corresponding to the current situation and context, thereby enhancing agents’ planning capabilities.

Strength

  • Retrieve related experience as new ICL examples
  • image-20240716200249688

Challenges

  • lack of comparison between fine-tunning the model on the memory database

SteP: Stacked LLM Policies for Web Actions

Cornell and ASAPP research; CoLM 2024

Author’s claim: Specifying a large prompt to handle all possible behaviors and states is extremely complex; decomposition to distinct policies requires careful handling of control between policies. They propose Step, an approach to dynamically compose policies to solve a diverse set of web tasks.

Use Markov Decision Tree where the state is a stack of policies.

Enable dynamic control where any policy can choose to invoke any other policy

image-20240716193601410

Strengths

  • some sort of “mid-level intent” prompt for completing high-level tasks

Challenges

  • I didn’t realize how policies are constructed when reading the introduction and the abstract.
    • Okay, I see why they are trying to avoid saying that. policies are manually crafted.
  • Then the need for manual crafting policies is a big issue.

The author’s claim: current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios.

They built an environment for language-guided agents that is realistic and reproducible.

Overall very well-done and comprehensive work.

Strengths

  • website environment in 4 domains: e-commerce, social forum discussions, collaborative software development and content management.
  • enriched with tools (map) and external knowledge bases (manuals?)
  • set of benchmark tasks focusing on evaluating the functional correctness of task completions.
  • the environment is implemented in openai gym, and shipped in docker containers.
  • Maintain reproducibility by making the environment standalone (without relying on live websites). no captchas.
  • Environment definition:
    • State space S, action space A, observation space O
      • Observation space:
        • web page url
        • opened tabs (consider multi-tab web-based tasks to promote tool usage)
        • web page content
          • DOM tree or
          • screenshot or
          • accessibility tree
          • image-20240716183736667
      • Action space:
        • element selection:
          • by coordinates (x, y)
          • by element id (numerical)
        • image-20240716183926619
    • Transition function: $T: S \times A \to S$
  • The craft of benchmark dataset
    • Intent Collection
      • seed intents from human annotators
      • abstract and high-level, creative (created a reddit account identical to my gitlab one), formulated by a template
    • Evaluation of correctness:
      • Information seek tasks:
        • The answer is a string.
        • exact match
        • must_include
        • fuzzy_match (call gpt-4 to evaluate)
      • site navigation and content&config tasks:
        • a reward function to evaluate the intermediate state
          • locator: javascript or api or database query
            • find result text
          • keywords
            • reuse match functions
        • image-20240716184802401
      • Unachievable Tasks:
        • expect the agent to produce N/A
      • human performance 78.24%

Challenges

  • Does numerical id cause hallucination?
  • not enough description on accessibility tree
0%