Despite Sam Altman, CEO of OpenAI, predicting that AI would surpass "junior" software engineers by the end of this year, a recent study from OpenAI has revealed a different reality: even the most advanced AI models still struggle to match human programmers.
In a newly published paper, OpenAI researchers admitted that although AI technology is progressing rapidly, current state-of-the-art models are still inadequate for handling most coding tasks. To validate this, they developed a new benchmark called SWE-Lancer, based on over 1,400 real-world software engineering tasks from the freelancing platform Upwork.
The benchmark evaluated three large language models (LLMs): OpenAI's own o1 inference model, its flagship GPT-4o, and Anthropic's competitor model, Claude 3.5 Sonnet. These models were tasked with two types of assignments: fixing specific software bugs and implementing fixes, and managing tasks, which required making decisions from a broader perspective. Notably, the models were not allowed internet access during testing to prevent copying online answers directly.
While these LLMs demonstrated some capability in tackling high-value Upwork tasks worth hundreds of thousands of dollars, they could only address superficial software issues. When it came to complex errors in larger projects, they often failed to identify or accurately diagnose the root causes. The "solutions" provided were often rough and incomplete, appearing confident at first glance but falling apart upon closer inspection.
The paper highlighted that although the three LLMs generally "outpaced humans in processing speed," they struggled to understand the broader context and background of errors, resulting in incorrect or incomplete solutions. Among them, Claude 3.5 Sonnet performed slightly better than the other two OpenAI models, but most of its answers were still flawed. Researchers emphasized that any model must achieve "higher reliability" before being used for real-world coding tasks.
In summary, this study suggests that while these advanced models excel at fine-grained tasks and operate quickly, they fall short compared to human engineers when it comes to handling actual software engineering challenges.
Although these LLMs have made significant progress in recent years and are expected to continue developing, their skills in software engineering are still insufficient to replace human programmers in real-life scenarios. Nevertheless, this hasn't stopped some CEOs from replacing human programmers with immature AI models in pursuit of efficiency.