Paper reading: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

发表于 2024-07-15 分类于 Paper Reading 本文字数： 323 阅读时长 ≈ 1 分钟

ZJU, Tencent, ACL2024

RQ: Previous study working on web understanding for LLM focus on managing complex HTML texts, while visual web agents have been overlooked.

Strengths

We introduce WebVoyager (Figure 1), a multimodal web agent designed to autonomously accomplish web tasks online from start to finish, managing the entire process end-to-end without any intermediate human intervention.
To accurately evaluate the capabilities of web agents in end-to-end task completion, we propose an automated evaluation protocol using GPT-4V. Specifically, we save screenshots throughout the online navigation process and then use GPT-4V to evaluate these trajectories together with the final results automatically.
Autogenerated dataset using self-instruct. Evaluate webvoyager on other datasets as well.
Use a visual mask (set-of-masking) to hint the clickable area
human evaluation of 300 tasks.

Drawbacks

while I understand the difficulty in recruiting more participants for human evaluation, I’d like to see all models (not only gpt-4) to be evaluated by humans, as gpt-4 only show a ~60-70% success rate in human evaluation.

Dataset

{"web_name": "Amazon", "id": "Amazon--0", "ques": "Search an Xbox Wireless controller with green color and rated above 4 stars.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--1", "ques": "Search for women's golf polos in m size, priced between 50 to 75 dollars, and save the lowest priced among results.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--2", "ques": "Find a gaming desktop with Windows 11 Home, and the disk size should be 1TB.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--3", "ques": "Find climbing gears and sort the results by price high to low. Answer the first 3 results after sorting.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--4", "ques": "Find the used Nintendo Switch Lite on Amazon then filter by 'Used - Good', tell me the cheapest one that is 'Used - Good'.", "web": "https://www.amazon.com/"}

40 amazon tasks in total, which is maybe useful?

Paper reading: Scaling Synthetic Data Creation with 1,000,000,000 Personas

发表于 2024-07-15 分类于 Paper Reading 本文字数： 221 阅读时长 ≈ 1 分钟

Create 1B personas, open-sourced 200k of them

Target: synthetic data for better LLM training

Persona -> task

Beats 70b open-source model with 7b compact model

Text-to-persona: use any text as input to obtain corresponding personas just by prompting the LLM “Who is likely to [read|write|like|dislike|…] the text?”
Persona-to-Persona: persona relationship expansion

{
 "input persona": "A software engineer who disagrees with the established computer scientist's methodologies and approaches",
 "synthesized text": "The software engineer, John, is working on a project where he needs to calculate the time complexity of his newly developed algorithm. He disagrees with the established computer scientist's methodologies and approaches and wants to use his own method. \n\nJohn's algorithm is a recursive function that calls itself twice for each level of recursion. The base case (when the recursion stops) occurs when the input size is 1. \n\nGiven that the time taken by the algorithm when the input size is 1 is C (a constant), John needs to express the time complexity T(n) of his algorithm as a recurrence relation. \n\nWhat is the recurrence relation for the time complexity of John's algorithm?",
 "description": "math problem"
}

Concerns

May not be that related to our use-case?
- They are trying to synthetic more high-quality training data for LLM pre-training

Paper reading: Understanding Human-AI Workflows for Generating Personas

发表于 2024-07-15 分类于 Paper Reading 本文字数： 131 阅读时长 ≈ 1 分钟

DIS 2024

Generate persona from text, from the perspective of human-ai collaboration

RQ: which persona-generation subtasks should be delegated to user researchers vs. LLMs to produce representative and empathy-evoking personas?

Strength:

Interesting finding
- introducing user researcher in identifying key characteristics make the age variance smaller
- Llm-auto can match the golden truth distribution
- Llm-summary can generate the most statistically representative personas

Concerns:

(somewhat) trivial findings (llm-summary and llm-grouping performs the best):
- Introducing user researcher in more stages will yield a better performance
- No-LLM baseline (how does summarizing help?)
- Human workflow – which part is the most time-consuming?
How do you define a better “persona”
- What the usage of personas? Are they going to be processed by LLMs?
- Will “be more expressive” help?

Paper reading: An Empathy-Based Sandbox Approach to Bridge the Privacy Gap among Attitudes, Goals, Knowledge, and Behaviors [extensive reading]

发表于 2024-07-15 分类于 Paper Reading 本文字数： 78 阅读时长 ≈ 1 分钟

arXiv 2024, Toby

Risk-free privacy sandbox

Strength

Interesting and novel idea
Strong usecase

Concerns:

Mostly on implementation.
Browsing history – hallucination? And – it may not be effective, as it’s sensitive and can not be read by web apps.
How did they mimic and use social media posts?
- they basically can’t.

Idea:

LLM Agent that impersonates a persona, browse the internet for a while (use real twitter/google accounts, mimic real person behavior), longitudinal
- But for what?
- ethical concerns

Paper reading: ChatDev: Communicative Agents for Software Development [extensive reading]

发表于 2024-07-15 分类于 Paper Reading 本文字数： 99 阅读时长 ≈ 1 分钟

ChatDev: Communicative Agents for Software Development

arXiv 2024, THUNLP

a chat-powered software development framework integrating multiple “software agents” for active involvement in three core phases of the software lifecycle: design, coding, and testing

Communicative De-hallucination

Strength:

Novel Communicative De-hallucination
Agent-based self-consistency?
Defines dataset, goal, and metric

Concerns:

Mainly on evaluation metrics.
Completeness – Why there will be incomplete code?
How do you define executability?
- How much code can be compiled isn’t a good metric
- Python?
Consistency – weird definition
- Semantic embedding of software code with textual requirements
- WHY?
Quality:
- Multiply of completeness, executability, and consistency
- Okay

Paper reading: LASER: LLM Agent with State-Space Exploration for Web Navigation [extensive reading]

发表于 2024-07-15 分类于 Paper Reading 本文字数： 45 阅读时长 ≈ 1 分钟

arXiv 2024, Tencent AI LAB

RQ: LLM with state-space exploration for webshop task

Strength:

Allowing LLM agents to develop complex solutions (including going backward) to yield a better result

Concerns:

Comparison between WebGPT: What’s the difference? WebGPT is an pbviously stronger baseline.