Apple Launches GSM-Symbolic: In-Depth Evaluation of LLM's Mathematical Reasoning Capabilities

2024-10-14

Recently, advancements in the mathematical reasoning abilities of large language models (LLMs) have garnered significant attention, particularly with the introduction of the GSM8K benchmark, designed to assess the elementary-level math problem-solving skills of LLMs. Although LLMs have shown improved performance on GSM8K, doubts remain about whether their reasoning capabilities have genuinely advanced. Existing evaluation metrics may only partially capture LLMs' abilities, with research indicating that LLMs rely more on probabilistic pattern matching rather than true logical reasoning, leading to token bias and sensitivity to minor input variations. Additionally, the static nature and singular evaluation metric of GSM8K constrain its effectiveness in comprehensively assessing LLMs' reasoning abilities under diverse conditions.

Logical reasoning is critical for intelligent systems, yet the consistency of LLMs' performance in this area necessitates further validation. While studies have demonstrated that LLMs can accomplish certain tasks through probabilistic pattern matching, they typically require more formal logical reasoning when tackling complex problems. Variations in input tokens can significantly impact outcomes. Although transformers perform effectively in some scenarios, their reliance on external storage (such as scratchpads) necessitates enhanced computational capabilities to handle intricate tasks. Research indicates that LLMs depend more on data matching encountered during training rather than genuine logical comprehension.

To more accurately evaluate LLMs' reasoning capabilities, researchers at Apple conducted an extensive study employing a new benchmark called GSM-Symbolic. This benchmark generates a diverse range of mathematical problems using symbolic templates, offering a more reliable and controllable assessment method. The study found that as numerical values or problem complexity increase, LLM performance significantly declines. Furthermore, the inclusion of seemingly related but actually irrelevant information can reduce performance by up to 65%, indicating that LLMs primarily depend on pattern matching rather than formal logical reasoning.

The GSM8K dataset comprises over 8,000 elementary-level math problems and answers and is commonly used to assess LLMs. However, its widespread use introduces risks such as data contamination and sensitivity to minor problem variations. To address these issues, researchers developed GSM-Symbolic, a benchmark that generates diverse problem instances using symbolic templates, thereby providing a more robust evaluation method. By testing over 20 open-source and proprietary models with 5,000 samples derived from 100 templates, the study unveiled both the strengths and limitations of LLMs in mathematical reasoning.

Preliminary experiments indicate significant performance discrepancies among models on GSM-Symbolic (a variant of the GSM8K dataset), with accuracies lower than those observed on GSM8K. The study also investigated the impact of name and numerical alterations on LLMs, finding that numerical changes notably degrade performance. The complexity of problems also affects accuracy, with more intricate issues leading to more pronounced performance drops. These results suggest that models may rely more on pattern matching than on genuine logical reasoning, as the addition of extra clauses typically diminishes performance.

This study provides an in-depth analysis of LLMs' reasoning capabilities and highlights the limitations of the current GSM8K evaluation method. By introducing the GSM-Symbolic benchmark, researchers assessed LLMs' mathematical reasoning across various problem variants. Results demonstrate significant variability in LLM performance when numerical values are altered or irrelevant clauses are added. Moreover, as problem complexity increases, LLM performance is adversely affected, indicating a reliance on pattern matching over genuine logical reasoning. The GSM-NoOp test further revealed LLMs' inadequacies in filtering out irrelevant information, leading to substantial performance declines. Overall, this research underscores the necessity for further advancements to enhance LLMs' logical reasoning capabilities.