Leo's blog

建议麦当劳

UCI/CMU, NIPS 2023

Motivation

Previous agents need a large amount of expert demonstration and task specific reward functions to be able to solve unseen tasks.

An LLM can execute computer tasks guided by NL prompts.

Focus on the MiniWoB++ environment, which is an RL-like little game environment but on web.

Strength

  • novel(maybe? not sure) Recursively-Criticize and Improve prompt format

Challenges

  • not that related so I didn’t read thoroughly.

A REAL-WORLD WEBAGENT WITH PLANNING, LONG CONTEXT UNDERSTANDING, AND PROGRAM SYNTHESIS

Deepmind, ICLR 2024

Motivation

LLM Agents in real-world websites has still suffered from open domainness, limited context length and lack of inductive bias on HTML.

They introduce WebAgent, an LLM-driven agent that learns from self experience to complete tasks on real websites following natural language instructions.

Strength

  • method:
    • two-model method.
      • HTML-T5 to predict the next sub-instruction (planning) and generate related HTML snippets (summarization)
      • Flan-U-PaLM prompted to generate Python programs
    • Action is represented by python programs with selenium driver calls.
      • a large action space
      • flexbility
  • Supports any website without predefined action spaces.

Challenges

  • method: a bit complicated
  • action representation: safety? Run generated code from LLM is not safe.
  • Use a smaller model for planning – WHY?

MIND2WEB: Towards a Generalist Agent for the Web

OSU-NLP, NIPS 2023 Dataset Track

A dataset of real tasks on real-world websites, including tasks and user interaction traces.

image-20240716201230890

Dataset Format

image-20240717195904846

  • Action Traces
    • cleaned html and raw html at each point
    • repr of action

Strength

  • Very good project homepage – https://osu-nlp-group.github.io/Mind2Web/
  • annotation process:
    • human-annotated dataset
    • select element then select action
  • Method:
    • two-step.
      1. select candidate dom elements
        1. a 0-1 score for Each Candidate Element
        2. random negative samples
      2. Generate action
        1. a multi-choice on the candidate element

Challenges

  • a 0-1 score for each candidate element is too complex
    • you need many calls for a single action
    • unfeasible

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

NUS, arxiv preprint

Motivation:

Reflecting past experiences in current decision-making processes, an innate human behavior, continues to pose significant challenges. Addressing this, we propose Retrieval-Augmented Planning (RAP) framework, designed to dynamically leverage past experiences corresponding to the current situation and context, thereby enhancing agents’ planning capabilities.

Strength

  • Retrieve related experience as new ICL examples
  • image-20240716200249688

Challenges

  • lack of comparison between fine-tunning the model on the memory database

SteP: Stacked LLM Policies for Web Actions

Cornell and ASAPP research; CoLM 2024

Author’s claim: Specifying a large prompt to handle all possible behaviors and states is extremely complex; decomposition to distinct policies requires careful handling of control between policies. They propose Step, an approach to dynamically compose policies to solve a diverse set of web tasks.

Use Markov Decision Tree where the state is a stack of policies.

Enable dynamic control where any policy can choose to invoke any other policy

image-20240716193601410

Strengths

  • some sort of “mid-level intent” prompt for completing high-level tasks

Challenges

  • I didn’t realize how policies are constructed when reading the introduction and the abstract.
    • Okay, I see why they are trying to avoid saying that. policies are manually crafted.
  • Then the need for manual crafting policies is a big issue.
0%