AI Quantitative Techniques Face New Challenges: Efficiency vs Accuracy

2024-12-26

Quantization techniques in the field of artificial intelligence (AI), a key method aimed at enhancing model efficiency, are gradually approaching their performance limits. Quantization optimizes models by reducing the number of bits required to represent information—the smallest unit processed by computers. This is similar to simplifying information in everyday conversation: when asked about the time, people often respond with "noon" rather than "12:01:04.004," both conveying the same meaning but with different levels of precision. In AI models, the required precision depends on the specific application.

AI models consist of multiple components that can be quantized, particularly the parameters—internal variables used for predictions or decisions. During model execution, millions of calculations are performed, and quantization reduces computational complexity by decreasing the bit count of these parameters, thereby improving efficiency. However, it is important to note that this is different from "distillation," which is a more complex and selective process for parameter pruning.

However, quantization may not have as many advantages as previously thought. A study conducted by researchers from Harvard, Stanford, MIT, Databricks, and Carnegie Mellon University indicates that if the original unquantized model is trained with large amounts of data over a long period, the performance of the quantized model may decline. In other words, in some cases, training a smaller model directly may be more effective than compressing a larger one.

This could be bad news for AI companies that rely on training large models to improve answer quality. These companies often attempt to reduce model service costs through quantization. The negative impact of this trend is already evident; for example, Meta's Llama 3 model performed poorly after quantization, possibly due to its training approach.

Moreover, the overall cost of AI model inference (i.e., running the model, such as ChatGPT answering questions) is typically higher than the cost of model training. For instance, it is estimated that training one of Google's flagship Gemini models costs $191 million, but using the model to provide 50-word answers for half of Google's search queries would cost approximately $6 billion annually.

Although evidence suggests that the performance gains from increasing data and compute diminish over time, large AI labs continue to focus on training models with larger datasets. However, there are signs that this strategy of scaling up may not always be effective.

If labs are unwilling to train models on smaller datasets, are there other ways to mitigate model degradation? Researchers have found that training models at low precision might make them more robust. Here, "precision" refers to the number of digits that a numeric data type can accurately represent. Most models are currently trained at 16-bit (half-precision) and then quantized to 8-bit precision. However, extremely low precision may not be ideal, as models below 7 or 8 bits may see a significant drop in quality unless the original model has a very large number of parameters.

In summary, AI models are not entirely controllable, and known computational shortcuts do not always apply. Researchers emphasize the limitations of quantization techniques and the challenges in reducing inference costs. In the future, more attention may need to be given to the quality of data rather than the quantity, and to developing model architectures that can be stably trained at low precision.