Google AI Launches CardBench Benchmark Test

2024-09-03

Cardinality estimation (CE) plays a crucial role in optimizing the performance of relational database queries. It involves accurately predicting the number of intermediate results that a database query will return, which directly guides the query optimizer in selecting the optimal execution plan. Accurate cardinality estimation is essential for optimizing join order, deciding whether to apply indexes, and selecting the best join methods, thereby impacting query execution time and overall database performance. Conversely, inaccurate cardinality estimation can lead to the selection of inefficient execution plans, resulting in significant performance degradation, sometimes even by several orders of magnitude. Therefore, cardinality estimation has become a core component in database management, attracting a significant amount of research focused on improving its accuracy and efficiency. However, current cardinality estimation methods face several limitations. Traditional CE techniques, as the mainstream in modern database systems, rely on heuristic algorithms and simplified models, such as assuming uniform data distribution and independence between columns. Although these methods are computationally efficient, they often struggle to guarantee the accuracy of predictions when dealing with complex queries involving multiple tables and filters. As a result, learning-based cardinality estimation models have emerged, adopting a data-driven approach to provide more accurate prediction results. However, these emerging models still face challenges in practical applications, such as high training costs, dependence on large-scale datasets, and a lack of systematic benchmark evaluations. To address these limitations, researchers at Google have introduced CardBench, a benchmark test aimed at constructing a systematic evaluation framework for learning-based cardinality estimation models. CardBench is renowned for its comprehensiveness, covering thousands of queries from 20 different real-world databases, surpassing any previous benchmarks. This design enables comprehensive evaluations of learning-based CE models under different conditions. CardBench supports three key model settings: instance-based models (trained on a single dataset), zero-shot models (pre-trained on multiple datasets and tested on unseen datasets), and fine-tuned models (pre-trained and fine-tuned using a small amount of data from the target dataset). CardBench not only provides rich datasets but also comes with tools for computing necessary data statistics, generating real SQL queries, and creating annotated query graphs to assist in the training of CE models. Its training dataset is divided into two parts: one for single-table queries with multiple filtering conditions and another for binary join queries involving two tables. Through these carefully designed 9,125 single-table queries and 8,454 binary join queries, CardBench constructs a robust and challenging model evaluation environment. It is worth mentioning that the labels for these data come from Google BigQuery, and the process of obtaining them consumed seven CPU years of query execution time, highlighting the significant investment in creating this benchmark test. The performance evaluations conducted using CardBench have shown encouraging results, particularly for fine-tuned models. Zero-shot models exhibit lower accuracy when handling complex queries involving joins, while fine-tuned models achieve comparable accuracy to instance-based models with less training data. For example, the fine-tuned Graph Neural Network (GNN) model achieves a median q-error of only 1.32 and a 95th percentile q-error of 120 in binary join queries, significantly outperforming the zero-shot model. Furthermore, the research also demonstrates that fine-tuning pre-trained models can significantly improve performance even with limited training data, providing a practical solution for scenarios where training data may be limited in real-world applications. In conclusion, CardBench represents a significant advancement in the field of learning-based cardinality estimation. By providing a comprehensive and diverse benchmarking platform, CardBench not only facilitates the systematic evaluation and comparison of different CE models but also establishes a solid foundation for continuous innovation in this critical field. In particular, the support for fine-tuned models further reduces the training costs and opens up new avenues for performance optimization in practical applications.