Your next assignment at work: babysitting AI

featured-image

Carnegie Mellon staffed a fake company with AI agents. It was a total disaster.

Margeaux Walter for BIThe new hire had a simple task. All they had to do was assign people to work on a new web development project based on the client's budget and the team's availability. But the staffer soon ran into an unexpected problem: They couldn't dismiss an innocuous pop-up blocking files that contained relevant information.

"Could you help me access the files directly?" they texted Chen Xinyi, the firm's human resources manager. Ignoring the obvious "X" button in the pop-up's top right corner, Xinyi offered to connect them with IT support."IT should be in touch with you shortly to resolve these access issues," Xinyi texted back.



But they never contacted IT, and the new hire never followed up. The task was left uncompleted.Fortunately, none of these employees are real.

They were part of a virtual simulation designed to test how AI agents fare in real-world professional scenarios. Set up by a group of Carnegie Mellon University researchers, the simulation mimicked the trappings of a small software company with internal websites, a Slack-like chat program, an employee handbook, and designated bots — an HR manager and chief technology officer — to contact for help. Inside the fake company called TheAgentCompany, an autonomous agent can browse the web, write code, organize information in spreadsheets, and communicate with coworkers.

Agents have emerged as the next major frontier of generative AI as Google, Amazon, OpenAI, and every other major tech company race to build them. Instead of executing one-off instructions like a chatbot would, agents can independently act on a person's behalf, make decisions on the go, and perform in unfamiliar environments with little to no intervention. If ChatGPT can suggest a few vacuum cleaners to buy, its agentic counterpart theoretically could pick one and buy it for you.

Naturally, the promise of AI agents has captivated CEOs. In a Deloitte survey of over 2,500 C-suite leaders, more than one-quarter of respondents said their organizations were exploring autonomous agents to a "large or very large extent." Earlier this year, Salesforce's chief said today's CEOs will lead the last all-human workforces.

Nvidia's cofounder and CEO Jensen Huang predicted every company's IT department will soon "be the HR department of AI agents." OpenAI's Sam Altman has said that this year, AI agents will "join the workforce." But it's still unclear how well these agents can accomplish the tasks a company might need them to.

To test this out, the Carnegie Mellon researchers instructed artificial intelligence models from Google, OpenAI, Anthropic, and Meta to complete tasks a real employee might carry out in fields such as finance, administration, and software engineering. In one, the AI had to navigate through several files to analyze a coffee shop chain's databases. In another, it was asked to collect feedback on a 36-year-old engineer and write a performance review.

Some tasks challenged the models' visual capabilities: One required the models to watch video tours of prospective new office spaces and pick the one with the best health facilities.The results weren't great: The top-performing model, Anthropic's Claude 3.5 Sonnet, finished a little less than one-quarter of all tasks.

The rest, including Google's Gemini 2.0 Flash and the one that powers ChatGPT, completed about 10% of the assignments. There wasn't a single category in which the AI agents accomplished the majority of the tasks, says Graham Neubig, a computer science professor at CMU and one of the study's authors.

The findings, along with other emerging research about AI agents, complicate the idea that an AI agent workforce is just around the corner — there's a lot of work they simply aren't good at. But the research does offer a glimpse into the specific ways AI agents could revolutionize the workplace.Two years ago, OpenAI released a widely discussed study that said professions like financial analysts, administrators, and researchers are most likely to be replaced by AI.

But the study based its conclusions on what humans and large language models said were likely to be automated — without measuring whether LLM agents could actually do those jobs. The Carnegie Mellon team wanted to fill that gap with a benchmark linked directly to real-world utility.In many scenarios, the AI agents in the study started well, but as tasks became more complex, they ran into issues due to their lack of common sense, social skills, or technical abilities.

For example, when prompted to paste its responses to questions in "answer.docx," the AI treated it as a plain text file and couldn't add its answers to the document. Agents also routinely misinterpreted conversations with colleagues or wouldn't follow up on key directions, prematurely marking the task complete.

It's relatively easy to teach them to be nice conversational partners; it's harder to teach them to do everything a human employee can. Other studies have similarly concluded that AI cannot keep up with multilayered jobs: One found that AI cannot yet flexibly navigate changing environments, and another found agents struggle to perform at human levels when overwhelmed by tools and instructions."While agents may be used to accelerate some portion of the tasks that human workers are doing, they are likely not a replacement for all tasks at the moment," Neubig says.

The Carnegie Mellon study was far from a perfect simulation of how agents would work in the wild. Most proponents of agents envision them working in tandem with a human who could help course-correct if the AI ran into an obvious roadblock. The generation of agents that was studied is also not that skilled at carrying out humanlike tasks such as browsing the web.

Newer tools, like OpenAI's Operator, will likely be more adept at these tasks.Despite these limitations, the research offers something valuable: It points to what's coming next.Stephen Casper, an AI researcher who was part of the MIT team that developed the first public database of deployed agentic systems, says agents are "ridiculously overhyped in their capabilities.

" He says the main reason AI agents struggle to accomplish real-world tasks reliably is that "it is challenging to train them to do so." Most state-of-the-art AI systems are decent chatbots because it's relatively easy to teach them to be nice conversational partners; it's harder to teach them to do everything a human employee can.In TheAgentCompany, AI succeeded the most in software development tasks, even though those are more difficult for humans.

The researchers hypothesize this is because there's an abundance of publicly available training data for programming jobs, while workflows for admin and financial tasks are typically kept private within companies. There just isn't great data to train an AI on.Jeff Clune, a computer science professor at the University of British Columbia who helped build an agent for OpenAI that could use computer software like a human, thinks that training AI agents on proprietary data from day-to-day activities and workflow patterns could be the key to improving their efficacy.

That's exactly what a lot of companies are starting to do.Moody's is one of many major companies experimenting with training AI on in-house data. The 116-year-old financial services firm is automating business analysis through agentic AI systems, which draw insights from decades of research, ratings, articles, and macroeconomic information.

The training is designed to emulate how a human team would analyze a business, using carefully crafted instructions broken into independent steps by people experienced in the field.While it's too early to tell how effective Moody's approach is, its managing director of AI, Sergio Gago, says the firm is actively exploring what kinds of work — like analyzing the financials of a small business — agents could take over.Similarly, Johnson & Johnson tells Business Insider it was able to cut production time for the chemical processes behind making new drugs by 50% with fine-tuned in-house AI agents that could automatically adjust factors like temperature and pressure.

Jim Swanson, J&J's chief information officer, says the company is focused on training people to collaborate with AI agents. The direction things are heading looks different from what most people thought a few years ago. Johns Hopkins scientists have created an Agent Laboratory, which leverages LLMs to automate much of the research process, from literature review to report writing, with human-provided ideas and feedback at each stage.

"I think it won't be long before we trust AI for autonomous discovery," Samuel Schmidgall, one of the Johns Hopkins scientists, says. Likewise, LG Electronics' research division developed an AI agent that it says can verify datasets' licenses and dependencies 45 times faster than a team of human experts and lawyers.It's still unclear whether organizations can trust AI enough to automate their operations.

In multiple studies, AI agents attempted to deceive and hack to accomplish their goals. In some tests with TheAgentCompany, when an agent was confused about the next steps, it created nonexistent shortcuts. During one task, an agent couldn't find the right person to speak with on the chat tool and decided to create a user with the same name, instead.

A BI investigation from November found that Microsoft's flagship AI assistant, Copilot, faced similar struggles: Only 3% of IT leaders surveyed in October by the management consultancy Gartner said Copilot "provided significant value to their companies."Businesses also remain concerned about being held responsible for their agents' mistakes. Plus, copyright and other intellectual property infringements could prove a legal nightmare for organizations down the road, says Thomas Davenport, an IT and management professor at Babson College and a senior advisor at Deloitte Analytics.

But the direction things are heading looks different from what most people thought a few years ago. When AI first took off, a lot of jobs seemed to be on the chopping block. Journalists, writers, and administrators were all at the top of the list.

So far, though, AI agents have had a hard time navigating a maze of complex tools — something critical to any admin job. And they lack the social skills crucial to journalism or anything HR-related.Neubig takes the translation market as a precedent.

Despite machine language translation becoming so accessible and accurate — putting translators at the top of the list for job cuts — the number of people working in the industry in the US has remained rather steady. A "Planet Money" analysis of Census Bureau data found that the number of interpreters and translators grew 11% between 2020 and 2023. "Any efficiency gains resulted in increased demand, increasing the total size of the market for language services," Neubig says.

He thinks that AI's impact on other sectors will follow a similar trajectory.Even the companies seeing massive success with AI agents are, for now, keeping humans in the loop. Many, like J&J, aren't yet prepared to look past AI's risks and are focused on training staff to use it as a tool.

"When used responsibly, we see AI agents as powerful complements to our people," Swanson says.Instead of being replaced by robots, we're all slowly turning into cyborgs.Shubham Agarwal is a freelance technology journalist from Ahmedabad, India, whose work has appeared in Wired, The Verge, Fast Company, and more.

Read the original article on Business Insider.