Paper reading: Language Models can Solve Computer Tasks

发表于 2024-07-17 更新于 2024-10-22 分类于 Paper Reading 本文字数： 72 阅读时长 ≈ 1 分钟

UCI/CMU, NIPS 2023

Motivation

Previous agents need a large amount of expert demonstration and task specific reward functions to be able to solve unseen tasks.

An LLM can execute computer tasks guided by NL prompts.

Focus on the MiniWoB++ environment, which is an RL-like little game environment but on web.

Strength

novel(maybe? not sure) Recursively-Criticize and Improve prompt format

Challenges

not that related so I didn’t read thoroughly.

Paper reading: A REAL-WORLD WEBAGENT WITH PLANNING, LONG CONTEXT UNDERSTANDING, AND PROGRAM SYNTHESIS

发表于 2024-07-16 更新于 2024-07-17 分类于 Paper Reading 本文字数： 133 阅读时长 ≈ 1 分钟

A REAL-WORLD WEBAGENT WITH PLANNING, LONG CONTEXT UNDERSTANDING, AND PROGRAM SYNTHESIS

Deepmind, ICLR 2024

Motivation

LLM Agents in real-world websites has still suffered from open domainness, limited context length and lack of inductive bias on HTML.

They introduce WebAgent, an LLM-driven agent that learns from self experience to complete tasks on real websites following natural language instructions.

Strength

method:
- two-model method.
  - HTML-T5 to predict the next sub-instruction (planning) and generate related HTML snippets (summarization)
  - Flan-U-PaLM prompted to generate Python programs
- Action is represented by python programs with selenium driver calls.
  - a large action space
  - flexbility
Supports any website without predefined action spaces.

Challenges

method: a bit complicated
action representation: safety? Run generated code from LLM is not safe.
Use a smaller model for planning – WHY?

Paper reading: MIND2WEB: Towards a Generalist Agent for the Web

发表于 2024-07-16 更新于 2024-07-17 分类于 Paper Reading 本文字数： 121 阅读时长 ≈ 1 分钟

MIND2WEB: Towards a Generalist Agent for the Web

OSU-NLP, NIPS 2023 Dataset Track

A dataset of real tasks on real-world websites, including tasks and user interaction traces.

Dataset Format

Action Traces
- cleaned html and raw html at each point
- repr of action

Strength

Very good project homepage – https://osu-nlp-group.github.io/Mind2Web/
annotation process:
- human-annotated dataset
- select element then select action
Method:
- two-step.
  1. select candidate dom elements
    1. a 0-1 score for Each Candidate Element
    2. random negative samples
  2. Generate action
    1. a multi-choice on the candidate element

Challenges

a 0-1 score for each candidate element is too complex
- you need many calls for a single action
- unfeasible

Paper reading: RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

发表于 2024-07-16 更新于 2024-07-17 分类于 Paper Reading 本文字数： 82 阅读时长 ≈ 1 分钟

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

NUS, arxiv preprint

Motivation:

Reflecting past experiences in current decision-making processes, an innate human behavior, continues to pose significant challenges. Addressing this, we propose Retrieval-Augmented Planning (RAP) framework, designed to dynamically leverage past experiences corresponding to the current situation and context, thereby enhancing agents’ planning capabilities.

Strength

Retrieve related experience as new ICL examples

Challenges

lack of comparison between fine-tunning the model on the memory database

Paper reading: SteP: Stacked LLM Policies for Web Actions

发表于 2024-07-16 更新于 2024-07-17 分类于 Paper Reading 本文字数： 138 阅读时长 ≈ 1 分钟

SteP: Stacked LLM Policies for Web Actions

Cornell and ASAPP research; CoLM 2024

Author’s claim: Specifying a large prompt to handle all possible behaviors and states is extremely complex; decomposition to distinct policies requires careful handling of control between policies. They propose Step, an approach to dynamically compose policies to solve a diverse set of web tasks.

Use Markov Decision Tree where the state is a stack of policies.

Enable dynamic control where any policy can choose to invoke any other policy

Strengths

some sort of “mid-level intent” prompt for completing high-level tasks

Challenges

I didn’t realize how policies are constructed when reading the introduction and the abstract.
- Okay, I see why they are trying to avoid saying that. policies are manually crafted.
Then the need for manual crafting policies is a big issue.

Paper reading: WebArena: A REALISTIC WEB ENVIRONMENT FOR BUILDING AUTONOMOUS AGENTS

发表于 2024-07-16 分类于 Paper Reading 本文字数： 294 阅读时长 ≈ 1 分钟

The author’s claim: current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios.

They built an environment for language-guided agents that is realistic and reproducible.

Overall very well-done and comprehensive work.

Strengths

website environment in 4 domains: e-commerce, social forum discussions, collaborative software development and content management.
enriched with tools (map) and external knowledge bases (manuals?)
set of benchmark tasks focusing on evaluating the functional correctness of task completions.
the environment is implemented in openai gym, and shipped in docker containers.
Maintain reproducibility by making the environment standalone (without relying on live websites). no captchas.
Environment definition:
- State space S, action space A, observation space O
  - Observation space:
    - web page url
    - opened tabs (consider multi-tab web-based tasks to promote tool usage)
    - web page content
      - DOM tree or
      - screenshot or
      - accessibility tree
  - Action space:
    - element selection:
      - by coordinates (x, y)
      - by element id (numerical)
- Transition function: $T : S \times A \to S$
The craft of benchmark dataset
- Intent Collection
  - seed intents from human annotators
  - abstract and high-level, creative (created a reddit account identical to my gitlab one), formulated by a template
- Evaluation of correctness:
  - Information seek tasks:
    - The answer is a string.
    - exact match
    - must_include
    - fuzzy_match (call gpt-4 to evaluate)
  - site navigation and content&config tasks:
    - a reward function to evaluate the intermediate state
      - locator: javascript or api or database query
        
        find result text
      - keywords
        
        reuse match functions
  - Unachievable Tasks:
    - expect the agent to produce N/A
  - human performance 78.24%

Challenges

Does numerical id cause hallucination?
not enough description on accessibility tree