Anthropic muốn AI Agent của mình kiểm soát máy tính của bạn

Các phiên bản thử nghiệm của các AI Agent có vẻ rất ấn tượng, nhưng đưa công nghệ hoạt động một cách đáng tin cậy và mà không gây phiền toái (hoặc tốn kém) trong đời thực có thể là một thách thức. Các mô hình hiện tại có thể trả lời câu hỏi và trò chuyện với kỹ năng gần như con người, và là cốt lõi của các trợ lý ảo như ChatGPT của OpenAI và Gemini của Google. Chúng cũng có thể thực hiện các công việc trên máy tính khi được một lệnh đơn giản thông qua việc truy cập màn hình máy tính cũng như thiết bị nhập liệu như bàn phím và trackpad hoặc thông qua giao diện phần mềm cấp thấp.

Anthropic cho biết rằng Claude vượt trội so với các AI Agent khác trên một số tiêu chí chính bao gồm SWE-bench, một loại đánh giá khả năng phát triển phần mềm của một AI Agent, và OSWorld, một loại đánh giá khả năng sử dụng hệ điều hành của một AI Agent. Các khẳ năng này vẫn chưa được xác minh độc lập. Anthropic cho rằng Claude có thể thực hiện các công việc trong OSWorld đúng 14.9% thời gian. Điều này thấp hơn so với con người, thường đạt khoảng 75%, nhưng cao hơn rất nhiều so với các AI Agent tốt nhất hiện tại – bao gồm cả GPT-4 của OpenAI – có thể thành công khoảng 7.7% thời gian.

Anthropic cho biết nhiều công ty đang thử nghiệm phiên bản agentic của Claude. Điều này bao gồm Canva, sử dụng nó để tự động hóa thiết kế và chỉnh sửa công việc, và Replit, sử dụng mô hình cho các công việc liên quan đến mã. Các người dùng sớm khác bao gồm The Browser Company, Asana và Notion.

Ofir Press, một nghiên cứu sinh sau đại học tại Đại học Princeton giúp phát triển SWE-bench, cho biết rằng AI agentic thường thiếu khả năng lập kế hoạch xa trước và thường gặp khó khăn khi phục hồi sau các lỗi. “Để chứng minh hữu ích, chúng ta phải đạt được hiệu suất mạnh mẽ trên các bài kiểm tra khó và thực tế,” ông nói, chẳng hạn như lập kế hoạch một loạt các chuyến đi cho người dùng và đặt vé cần thiết.

Kaplan lưu ý rằng Claude đã có thể khắc phục một số lỗi một cách đáng ngạc nhiên tốt. Khi gặp phải lỗi terminal khi cố gắng bắt đầu một máy chủ web, ví dụ, mô hình biết cách sửa lệnh của mình để khắc phục. Nó cũng đã phát hiện ra rằng nó phải kích hoạt popups khi gặp khó khăn khi duyệt web.

Nhiều công ty công nghệ hiện đang đua nhau phát triển AI Agent khi họ tranh giành thị phần và uy tín. Trong thực tế, có thể không cần phải mất thời gian dài nữa trước khi nhiều người dùng có thể sử dụng Agent. Microsoft, đã đổ số lên tới 13 tỷ đô la vào OpenAI, cho biết đang kiểm tra các Agent có thể sử dụng máy tính Windows. Amazon, đã đầu tư mạnh vào Anthropic, đang khám phá cách Agent có thể đề xuất và cuối cùng mua hàng cho khách hàng của mình.

Sonya Huang, một đối tác của công ty đầu tư Sequoia tập trung vào các công ty AI, cho biết mặc dù có nhiều hứng thú xung quanh AI Agent, hầu hết các công ty thực sự chỉ đang tái bẻ lài các công cụ dựa trên AI. Trước khi những tin tức từ Anthropic công bố, cô nói với WIRED rằng công nghệ hiện tại hoạt động tốt nhất khi được áp dụng trong lĩnh vực hẹp như công việc liên quan đến mã. “Bạn cần chọn không gian vấn đề mà nếu mô hình thất bại, điều đó cũng không sao,” cô nói. “Đó là không gian vấn đề mà công ty thực sự Agent native sẽ phát triển.”

Một thách thức chính với AI agentic là lỗi có thể trở nên nghiêm trọng hơn một phản hồi từ một trợ lý ảo lộn xộn. Anthropic đã áp đặt các ràng buộc nhất định về những gì Claude có thể làm – chẳng hạn, giới hạn khả năng sử dụng thẻ tín dụng của một người để mua hàng.

Nếu các lỗi có thể tránh được đủ tốt, Press của Đại học Princeton nói, người dùng có thể học cách nhìn vào AI – và máy tính – một cách hoàn toàn mới. “Tôi rất hào hứng với thời đại mới này,” ông nói.

#Anthropic #AI #MachineLearning #Technology #ComputerControl #TechNews #AIAgent #RealWorldApplication

Nguồn: https://www.wired.com/story/anthropic-ai-agent/

Demos of AI agents can seem stunning, but getting the technology to perform reliably and without annoying (or costly) errors in real life can be a challenge. Current models can answer questions and converse with almost humanlike skill, and are the backbone of chatbots such as OpenAI’s ChatGPT and Google’s Gemini. They can also perform tasks on computers when given a simple command by accessing the computer screen as well as input devices like a keyboard and trackpad, or through low-level software interfaces.

Anthropic says that Claude outperforms other AI agents on several key benchmarks including SWE-bench, which measures an agent’s software development skills, and OSWorld, which gauges an agent’s capacity to use a computer operating system. The claims have yet to be independently verified. Anthropic says Claude performs tasks in OSWorld correctly 14.9 percent of the time. This is well below humans, who generally score around 75 percent, but considerably higher than the current best agents—including OpenAI’s GPT-4—which succeed roughly 7.7 percent of the time.

Anthropic claims that several companies are already testing the agentic version of Claude. This includes Canva, which is using it to automate design and editing tasks, and Replit, which uses the model for coding chores. Other early users include The Browser Company, Asana, and Notion.

Ofir Press, a postdoctoral researcher at Princeton University who helped develop SWE-bench, says that agentic AI tends to lack the ability to plan far ahead and often struggles to recover from errors. “In order to show them to be useful we must obtain strong performance on tough and realistic benchmarks,” he says, such as reliably planning a wide range of trips for a user and booking all the necessary tickets.

Kaplan notes that Claude can already troubleshoot some errors surprisingly well. When faced with a terminal error when trying to start a web server, for instance, the model knew how to revise its command to fix it. It also worked out that it had to enable popups when it ran into a dead end browsing the web.

Many tech companies are now racing to develop AI agents as they chase market share and prominence. In fact, it might not be long before many users have agents at their fingertips. Microsoft, which has poured upwards of $13 billion into OpenAI, says it is testing agents that can use Windows computers. Amazon, which has invested heavily in Anthropic, is exploring how agents could recommend and eventually buy goods for its customers.

Sonya Huang, a partner at the venture firm Sequoia who focuses on AI companies, says for all the excitement around AI agents, most companies are really just rebranding AI-powered tools. Speaking to WIRED ahead of the Anthropic news, she says that the technology works best currently when applied in narrow domains such as coding-related work. “You need to choose problem spaces where if the model fails, that’s okay,” she says. “Those are the problem spaces where truly agent native companies will arise.”

A key challenge with agentic AI is that errors can be far more problematic than a garble chatbot reply. Anthropic has imposed certain constraints on what Claude can do—for example, limiting its ability to use a person’s credit card to buy stuff.

If errors can be avoided well enough, says Press of Princeton University, users might learn to see AI—and computers—in a completely new way. “I’m super excited about this new era,” he says.

[ad_2]

Tin tức

Anthropic muốn AI Agent của mình kiểm soát máy tính của bạn

admin

Leave a Reply Cancel reply