Microsoft's Orca 2 LLM: Performance Comparable to 10x Parameter Models

2023-12-13

Microsoft Research has released its optimized version of Llama 2, called Orca 2 LLM, which performs equally or better than the model with ten times the number of parameters. Orca 2 achieves this performance through the use of synthetic training data and a new technique called "Prompt Erasure." The Orca 2 model improves the performance of the smaller student LLM by using a larger and more powerful LLM as a teacher, guiding the student LLM. Microsoft's training technique teaches the smaller model various reasoning techniques and how to select the most effective techniques for specific tasks. The teacher is given complex prompts to trigger specific reasoning behaviors. However, in a scheme called "Prompt Erasure," the student is only given task requirements and expected responses, without the teacher's prompts. In benchmark tests, a 13 billion parameter Orca 2 model outperformed the Llama 2 baseline model with the same number of parameters by 47.54%. The 700 billion parameter Orca 2 performed "better or equally" to the 7 trillion parameter Llama 2 in inference tasks. While LLMs like ChatGPT typically excel in a wide range of tasks with minimal prompts, hosting these models poses challenges due to their memory and computational requirements. After optimization, smaller models can also perform well, and many researchers have explored training them using synthetic datasets generated by larger LLMs. For example, Google's "Distilling Step-by-Step" method prompts a teacher LLM to generate a small fine-tuning dataset that includes input-output labels and the "reasons" for selecting the output labels. Another example is Stability AI's Stable Beluga model, which uses Microsoft's original Orca 1 approach with "explanatory adjustments," where the teacher LLM is prompted to "generate detailed answers." Similar to Orca 1, the training dataset for Orca 2 is generated by a teacher LLM given detailed prompts. However, Microsoft introduces a new method called "Cautious Reasoning," which pairs training tasks with prompts that induce the teacher to use specific problem-solving strategies, such as "step-by-step" or "explain your answer." During the student's training process, the teacher's prompts are erased, prompting the student to learn to select the correct strategies. To evaluate this approach, Microsoft compared the performance of the Orca 2 model with several baseline models, including Llama 2, ChatGPT (GPT-3.5), and GPT-4, in benchmark tasks such as inference, language understanding, text completion, and summarization. In the inference benchmark test, the 13 billion parameter Orca 2 model outperformed all baselines except ChatGPT and GPT-4. They also found that providing Orca 2 with a "cautious" system prompt ("You are a cautious assistant. You will follow instructions carefully.") resulted in a slight performance improvement compared to an empty system prompt. Some users have posted about Orca 2 on X. One user pointed out, "You don't need tricks like 'explain every step' to prompt it. It already understands." AI researcher Rudi Ranck wrote, "Many clever ideas are so simple... like 'Prompt Erasure' in Orca 2: instead of showing the full prompt, only show the task and answer to the model (it filters out the full prompt that generated those answers). It helps the model strategize at a higher level. This paper is really nice. I highly recommend reading it." The Orca 2 model with 7 billion and 13 billion parameters can be used on Huggingface.