"Palmyra LLM by Writer Shines in Enterprise-Level AI Performance Benchmarks"

2024-01-10

Writer is a three-year-old San Francisco startup that raised $100 million in September 2023 to expand its proprietary enterprise-focused large language model to more companies. Despite not often making headlines compared to popular LLM startups like OpenAI, Anthropic, Meta, or even Mistral AI in France, Writer may have a small AI model called Palmyra that could be promising in enterprise use cases. Companies including Accenture, Vanguard Group, Hubspot, and Pinterest are clients of Writer, using the company's creative and productivity platform powered by the Palmyra model.

The Stanford University HAI's Foundation Models Research Center added new models to its benchmark tests last month and developed a new benchmark test called HELM Lite, which includes contextual learning capabilities. For LLMs, contextual learning means learning new tasks through a small set of examples presented in the prompt.

Writer's LLM performed "surprisingly" well in AI benchmark tests.

Although GPT-4 ranked highly in the new benchmark test, Palmyra's X V2 and X V3 models performed "surprisingly" well, "despite being smaller models," wrote Percy Liang, director of the Stanford Foundation Models Research Center.


In the field of machine translation, Palmyra's performance is particularly outstanding - ranking first. May Habib, CEO of Writer, said in a LinkedIn post, "Writer's Palmyra X performs even better than the classic benchmark tests. We are not only the top model in the MMLU benchmark test, but also the top model overall - second only to the analyzed GPT-4 preview version. In the translation benchmark test - a new test - we rank first."

Enterprises need to use economically viable models to build

In an interview, Habib said it would be economically challenging for enterprises to run models like GPT-4, which are trained on 1.2 trillion tokens, in their own environments. "Generative AI use cases in 2024 now need to make economic sense," she said.

She also pointed out that enterprises are building use cases based on GPT models and then "two to three months later, the prompts no longer work because the models have been fine-tuned, and their own service costs are too high." She referred to the HELM Lite benchmark test leaderboard from Stanford University HAI and maintained that GPT-4 (0613) is traffic-limited, so "it will be fine-tuned," while GPT-Turbo is just "a preview version, and we don't know their plans for this model."

Habib added that she believes the benchmarking work of Stanford University HAI is "closest to real-world enterprise use cases and real enterprise practitioners," rather than rankings from platforms like Hugging Face. "Their scenarios are closer to actual usage," she said.