Intelligent Spectrum AI Releases Text Quality Evaluation Model CritiqueLLM

2023-12-12

In the field of natural language processing, the quality of the text output from precise evaluation models is crucial for the development of the models. Traditional evaluation metrics, such as BLEU and ROUGE, rely on lexical overlap to score, which limits their ability to capture the overall semantic meaning of the text. Although evaluation methods based on complex models can better understand the semantic meaning of the text, they are often affected by high costs, accessibility limitations, and data privacy issues. To address these issues, the CritiqueLLM model has emerged. It is an efficient text quality evaluation model that is both interpretable and easy to scale. The model can comprehensively evaluate text generated by large-scale pre-trained language models, provide ratings from 1 to 10, and come with insightful explanations that analyze the strengths and weaknesses of the text. In the evaluation task, CritiqueLLM performs significantly correlated with human evaluators in 8 different instruction-following tasks, surpassing ChatGPT and rivaling GPT-4. It is particularly noteworthy that in the absence of reference text, CritiqueLLM even outperforms GPT-4 in certain tasks, demonstrating superior evaluation capabilities. The development of CritiqueLLM involves four key steps: 1. User inquiry augmentation: Firstly, the model automatically augments a large amount of inquiry data collected from a small number of public platforms to obtain a wide coverage of inquiry data. The data is carefully selected and filtered based on diversity and difficulty of answers. Subsequently, the generated results of large models with various capabilities are collected on this inquiry dataset. 2. Reference text evaluation data collection: Prompts are designed for GPT-4 to generate evaluation results based on user inquiries, reference text, and model-generated text. The prompts include detailed evaluation criteria to align the evaluation results generated by GPT-4 with human evaluations. 3. Non-reference text evaluation data rewriting: Based on the aforementioned reference text data, GPT-4 is further required to rewrite the evaluation results, removing the parts that mention the reference answers while keeping the rest of the content as unchanged as possible. This is done to collect evaluation data without reference text. 4. Training CritiqueLLM: Using the evaluation data with reference text and without reference text, CritiqueLLM models are trained separately for the two evaluation settings. This enables the models to generate complete evaluation results with evaluation explanations and scores for user inquiries, model-generated text, and possible reference text. In the end, two CritiqueLLM models are obtained that can be used in the evaluation tasks with reference text and without reference text, respectively, suitable for two different settings of text quality evaluation tasks. CritiqueLLM exhibits outstanding correlation with human evaluations, especially its version with 66 billion parameters. Through self-consistent decoding methods, it significantly improves the accuracy of evaluations even in models with smaller parameter sizes. It is worth noting that CritiqueLLM also excels in generating evaluation explanations, on par with GPT-4 and far superior to ChatGPT and small-scale models. Unlike other concurrent works, CritiqueLLM stands out for its efficient construction method and in-depth analysis of evaluation data with or without reference text. It also demonstrates good scalability with model size, as performance improves with larger models. Furthermore, the evaluation results generated by CritiqueLLM not only contribute to replacing evaluation tools like GPT-4 but also serve as feedback to directly optimize the text generated by large models like ChatGPT, thereby improving the quality of model-generated text. In summary, CritiqueLLM not only reveals the potential to replace GPT-4 in evaluation work but also provides valuable feedback for improving text generation models, offering an innovative perspective and reliable reference for future model development and evaluation research.