Leo's blog

建议麦当劳

ZJU, Tencent, ACL2024

RQ: Previous study working on web understanding for LLM focus on managing complex HTML texts, while visual web agents have been overlooked.

Strengths

  • We introduce WebVoyager (Figure 1), a multimodal web agent designed to autonomously accomplish web tasks online from start to finish, managing the entire process end-to-end without any intermediate human intervention.

  • To accurately evaluate the capabilities of web agents in end-to-end task completion, we propose an automated evaluation protocol using GPT-4V. Specifically, we save screenshots throughout the online navigation process and then use GPT-4V to evaluate these trajectories together with the final results automatically.

  • Autogenerated dataset using self-instruct. Evaluate webvoyager on other datasets as well.

  • Use a visual mask (set-of-masking) to hint the clickable area

    image-20240715165228352

  • human evaluation of 300 tasks.

Drawbacks

  • while I understand the difficulty in recruiting more participants for human evaluation, I’d like to see all models (not only gpt-4) to be evaluated by humans, as gpt-4 only show a ~60-70% success rate in human evaluation.

Dataset

1
2
3
4
5
{"web_name": "Amazon", "id": "Amazon--0", "ques": "Search an Xbox Wireless controller with green color and rated above 4 stars.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--1", "ques": "Search for women's golf polos in m size, priced between 50 to 75 dollars, and save the lowest priced among results.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--2", "ques": "Find a gaming desktop with Windows 11 Home, and the disk size should be 1TB.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--3", "ques": "Find climbing gears and sort the results by price high to low. Answer the first 3 results after sorting.", "web": "https://www.amazon.com/"}
{"web_name": "Amazon", "id": "Amazon--4", "ques": "Find the used Nintendo Switch Lite on Amazon then filter by 'Used - Good', tell me the cheapest one that is 'Used - Good'.", "web": "https://www.amazon.com/"}

40 amazon tasks in total, which is maybe useful?

Create 1B personas, open-sourced 200k of them

Target: synthetic data for better LLM training

Persona -> task

Beats 70b open-source model with 7b compact model

  • Text-to-persona: use any text as input to obtain corresponding personas just by prompting the LLM “Who is likely to [read|write|like|dislike|…] the text?”

  • Persona-to-Persona: persona relationship expansion

1
2
3
4
5
{
"input persona": "A software engineer who disagrees with the established computer scientist's methodologies and approaches",
"synthesized text": "The software engineer, John, is working on a project where he needs to calculate the time complexity of his newly developed algorithm. He disagrees with the established computer scientist's methodologies and approaches and wants to use his own method. \n\nJohn's algorithm is a recursive function that calls itself twice for each level of recursion. The base case (when the recursion stops) occurs when the input size is 1. \n\nGiven that the time taken by the algorithm when the input size is 1 is C (a constant), John needs to express the time complexity T(n) of his algorithm as a recurrence relation. \n\nWhat is the recurrence relation for the time complexity of John's algorithm?",
"description": "math problem"
}

Concerns

  • May not be that related to our use-case?
    • They are trying to synthetic more high-quality training data for LLM pre-training

DIS 2024

Generate persona from text, from the perspective of human-ai collaboration

RQ: which persona-generation subtasks should be delegated to user researchers vs. LLMs to produce representative and empathy-evoking personas?

image-20240715162107072

Strength:

  • Interesting finding
    • introducing user researcher in identifying key characteristics make the age variance smaller
    • Llm-auto can match the golden truth distribution
    • Llm-summary can generate the most statistically representative personas

Concerns:

  • (somewhat) trivial findings (llm-summary and llm-grouping performs the best):
    • Introducing user researcher in more stages will yield a better performance
    • No-LLM baseline (how does summarizing help?)
    • Human workflow – which part is the most time-consuming?
  • How do you define a better “persona”
    • What the usage of personas? Are they going to be processed by LLMs?
    • Will “be more expressive” help?

arXiv 2024, Toby

Risk-free privacy sandbox

Strength

  • Interesting and novel idea
  • Strong usecase

Concerns:

  • Mostly on implementation.
  • Browsing history – hallucination? And – it may not be effective, as it’s sensitive and can not be read by web apps.
  • How did they mimic and use social media posts?
    • they basically can’t.

Idea:

  • LLM Agent that impersonates a persona, browse the internet for a while (use real twitter/google accounts, mimic real person behavior), longitudinal
    • But for what?
    • ethical concerns

ChatDev: Communicative Agents for Software Development

arXiv 2024, THUNLP

a chat-powered software development framework integrating multiple “software agents” for active involvement in three core phases of the software lifecycle: design, coding, and testing

Communicative De-hallucination

Strength:

  • Novel Communicative De-hallucination
  • Agent-based self-consistency?
  • Defines dataset, goal, and metric

Concerns:

  • Mainly on evaluation metrics.
  • Completeness – Why there will be incomplete code?
  • How do you define executability?
    • How much code can be compiled isn’t a good metric
    • Python?
  • Consistency – weird definition
    • Semantic embedding of software code with textual requirements
    • WHY?
  • Quality:
    • Multiply of completeness, executability, and consistency
    • Okay
0%