Microsoft Showcases New Prompt Technology to Enhance Cutting-Edge AI Model Potential

2023-12-13

Microsoft researchers have developed an advanced prompting technique to better guide OpenAI's powerful GPT-4 language model, achieving state-of-the-art performance on key AI benchmarks.

The researchers focused on the widely recognized MMLU benchmark, which covers 57 different knowledge domains ranging from mathematics to medicine. By modifying a method originally designed for medical questions called Medprompt, the team achieved the highest MMLU score ever recorded - 90.10%.

Medprompt combines three different prompting strategies:

  • Dynamic few-shot selection
  • Self-generated chain of thought
  • Choice-shuffle ensembling

Firstly, dynamic few-shot selection leverages the ability to quickly adapt to specific domains through few-shot learning. This method cleverly selects different few-shot examples for different tasks, enhancing relevance and representation.

The second strategy involves self-generated chains of thought. This method prompts the model to generate intermediate reasoning steps, improving its ability to solve complex reasoning tasks. Unlike traditional methods that rely on manually written examples, Medprompt automates this process, reducing the risk of erroneous reasoning paths.

Lastly, the majority voting ensemble strategy improves prediction performance by combining outputs from different algorithms. This includes a unique technique called choice-shuffle, which enhances the model's robustness in responding to multiple-choice questions.

While initially designed for the medical field, Microsoft found that Medprompt's fine-grained prompting methods transfer well to a wide range of disciplines, as evidenced when applied to the comprehensive MMLU (measuring large-scale multitask language understanding) benchmark. The original implementation of Medprompt on GPT-4 achieved an impressive score of 89.1% on MMLU. However, by increasing the number of integrated invocations and integrating a simpler prompting method in parallel with the original strategy, they developed Medprompt+.

Medprompt+ achieved a milestone performance, scoring a record-breaking 90.10% on MMLU. This is thanks to the simultaneous integration of outputs from the underlying Medprompt strategy and simplified prompting outputs, guided by a control strategy that leverages the inferred confidence of candidate answers. It is worth noting that this approach utilizes GPT-4's ability to access confidence scores (logprobs), which is a forthcoming feature.

Microsoft has also open-sourced promptbase, an evolving collection of resources, best practices, and example scripts for eliciting optimal performance from base models like GPT-4. The goal is to lower the barriers for researchers and developers to optimize prompts using the model's potential.

In contrast to the upcoming Google AI system Gemini Ultra, Microsoft's prompting innovation on GPT-4 gains additional significance. Google's announcement of Gemini Ultra (which won't be available until next year) was accompanied by a controversial demo video initially praised for its depiction of real-time AI interaction, but later acknowledged by Google to have been choreographed using static images and text prompts, rather than the dynamic, real-time interaction it described.

Now, Microsoft's prompting innovation has extended GPT-4's capabilities beyond what was expected when Gemini Ultra is launched next year. Prompted GPT-4 now significantly outperforms the expected results from Gemini Ultra on all popular benchmarks.

With GPT-4 demonstrating intelligence akin to subject matter experts when properly guided, this milestone highlights the extraordinary natural language abilities of these models and the promising opportunity to further expand their capabilities through more sophisticated prompts. Microsoft's latest advancements underscore the importance of better prompting strategies as these powerful tools progress towards reliable, ethical, and beneficial outcomes.