Leo's blog

建议麦当劳

UCI/CMU, NIPS 2023

Motivation

Previous agents need a large amount of expert demonstration and task specific reward functions to be able to solve unseen tasks.

An LLM can execute computer tasks guided by NL prompts.

Focus on the MiniWoB++ environment, which is an RL-like little game environment but on web.

Strength

  • novel(maybe? not sure) Recursively-Criticize and Improve prompt format

Challenges

  • not that related so I didn’t read thoroughly.

A REAL-WORLD WEBAGENT WITH PLANNING, LONG CONTEXT UNDERSTANDING, AND PROGRAM SYNTHESIS

Deepmind, ICLR 2024

Motivation

LLM Agents in real-world websites has still suffered from open domainness, limited context length and lack of inductive bias on HTML.

They introduce WebAgent, an LLM-driven agent that learns from self experience to complete tasks on real websites following natural language instructions.

Strength

  • method:
    • two-model method.
      • HTML-T5 to predict the next sub-instruction (planning) and generate related HTML snippets (summarization)
      • Flan-U-PaLM prompted to generate Python programs
    • Action is represented by python programs with selenium driver calls.
      • a large action space
      • flexbility
  • Supports any website without predefined action spaces.

Challenges

  • method: a bit complicated
  • action representation: safety? Run generated code from LLM is not safe.
  • Use a smaller model for planning – WHY?

MIND2WEB: Towards a Generalist Agent for the Web

OSU-NLP, NIPS 2023 Dataset Track

A dataset of real tasks on real-world websites, including tasks and user interaction traces.

image-20240716201230890

Dataset Format

image-20240717195904846

  • Action Traces
    • cleaned html and raw html at each point
    • repr of action

Strength

  • Very good project homepage – https://osu-nlp-group.github.io/Mind2Web/
  • annotation process:
    • human-annotated dataset
    • select element then select action
  • Method:
    • two-step.
      1. select candidate dom elements
        1. a 0-1 score for Each Candidate Element
        2. random negative samples
      2. Generate action
        1. a multi-choice on the candidate element

Challenges

  • a 0-1 score for each candidate element is too complex
    • you need many calls for a single action
    • unfeasible

The author’s claim: current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios.

They built an environment for language-guided agents that is realistic and reproducible.

Overall very well-done and comprehensive work.

Strengths

  • website environment in 4 domains: e-commerce, social forum discussions, collaborative software development and content management.
  • enriched with tools (map) and external knowledge bases (manuals?)
  • set of benchmark tasks focusing on evaluating the functional correctness of task completions.
  • the environment is implemented in openai gym, and shipped in docker containers.
  • Maintain reproducibility by making the environment standalone (without relying on live websites). no captchas.
  • Environment definition:
    • State space S, action space A, observation space O
      • Observation space:
        • web page url
        • opened tabs (consider multi-tab web-based tasks to promote tool usage)
        • web page content
          • DOM tree or
          • screenshot or
          • accessibility tree
          • image-20240716183736667
      • Action space:
        • element selection:
          • by coordinates (x, y)
          • by element id (numerical)
        • image-20240716183926619
    • Transition function: $T: S \times A \to S$
  • The craft of benchmark dataset
    • Intent Collection
      • seed intents from human annotators
      • abstract and high-level, creative (created a reddit account identical to my gitlab one), formulated by a template
    • Evaluation of correctness:
      • Information seek tasks:
        • The answer is a string.
        • exact match
        • must_include
        • fuzzy_match (call gpt-4 to evaluate)
      • site navigation and content&config tasks:
        • a reward function to evaluate the intermediate state
          • locator: javascript or api or database query
            • find result text
          • keywords
            • reuse match functions
        • image-20240716184802401
      • Unachievable Tasks:
        • expect the agent to produce N/A
      • human performance 78.24%

Challenges

  • Does numerical id cause hallucination?
  • not enough description on accessibility tree

ZJU, Tencent, ACL2024

RQ: Previous study working on web understanding for LLM focus on managing complex HTML texts, while visual web agents have been overlooked.

Strengths

  • We introduce WebVoyager (Figure 1), a multimodal web agent designed to autonomously accomplish web tasks online from start to finish, managing the entire process end-to-end without any intermediate human intervention.

  • To accurately evaluate the capabilities of web agents in end-to-end task completion, we propose an automated evaluation protocol using GPT-4V. Specifically, we save screenshots throughout the online navigation process and then use GPT-4V to evaluate these trajectories together with the final results automatically.

  • Autogenerated dataset using self-instruct. Evaluate webvoyager on other datasets as well.

  • Use a visual mask (set-of-masking) to hint the clickable area

    image-20240715165228352

  • human evaluation of 300 tasks.

Drawbacks

  • while I understand the difficulty in recruiting more participants for human evaluation, I’d like to see all models (not only gpt-4) to be evaluated by humans, as gpt-4 only show a ~60-70% success rate in human evaluation.

Dataset

1
2
3
4
5
{"web_name": "Amazon", "id": "Amazon--0", "ques": "Search an Xbox Wireless controller with green color and rated above 4 stars.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--1", "ques": "Search for women's golf polos in m size, priced between 50 to 75 dollars, and save the lowest priced among results.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--2", "ques": "Find a gaming desktop with Windows 11 Home, and the disk size should be 1TB.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--3", "ques": "Find climbing gears and sort the results by price high to low. Answer the first 3 results after sorting.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--4", "ques": "Find the used Nintendo Switch Lite on Amazon then filter by 'Used - Good', tell me the cheapest one that is 'Used - Good'.", "web": "https://www.amazon.com/"}

40 amazon tasks in total, which is maybe useful?

Create 1B personas, open-sourced 200k of them

Target: synthetic data for better LLM training

Persona -> task

Beats 70b open-source model with 7b compact model

  • Text-to-persona: use any text as input to obtain corresponding personas just by prompting the LLM “Who is likely to [read|write|like|dislike|…] the text?”

  • Persona-to-Persona: persona relationship expansion

1
2
3
4
5
{
"input persona": "A software engineer who disagrees with the established computer scientist's methodologies and approaches",
"synthesized text": "The software engineer, John, is working on a project where he needs to calculate the time complexity of his newly developed algorithm. He disagrees with the established computer scientist's methodologies and approaches and wants to use his own method. \n\nJohn's algorithm is a recursive function that calls itself twice for each level of recursion. The base case (when the recursion stops) occurs when the input size is 1. \n\nGiven that the time taken by the algorithm when the input size is 1 is C (a constant), John needs to express the time complexity T(n) of his algorithm as a recurrence relation. \n\nWhat is the recurrence relation for the time complexity of John's algorithm?",
"description": "math problem"
}

Concerns

  • May not be that related to our use-case?
    • They are trying to synthetic more high-quality training data for LLM pre-training
0%