Alibaba Cloud’s Qwen team announces Qwen2-Math, a new series of large language models designed to solve mathematical problems that is reportedly outperforming models from artificial intelligence (AI) industry giants.
“Over the past year, we have dedicated significant effort to researching and enhancing the reasoning capabilities of large language models, with a particular focus on their ability to solve arithmetic and mathematical problems,” the Qwen team wrote in a post on GitHub.
“We hope that Qwen2-Math can contribute to the community by solving complex mathematical problems,” they added.
A New Math Whiz on the Block
The Qwen2 Math series is a set of base models with variants ranging from 1.5 billion to a staggering 72 billion parameters. Each base model has an “instruct” variant that is tailored to follow user instructions more precisely and allow interaction, making it suitable for educational use cases.
These are designed specifically to tackle complex mathematical problems more effectively than the current state-of-the-art large language models (LLMs). Performance metrics from various benchmarks have proven that they are quite advanced compared to GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, and Llama-3.1-405B.
The most impressive one, the Qwen 2-Math-72B, has the largest number of parameters and is the most complex of the three. This variant is designed for highly intricate tasks that require deep learning and extensive data processing.
The 7B variant, on the other hand, strikes a balance between computational needs and efficiency, which makes it good for mid-range tasks. Finally, the 1.5B variant is even a much lesser power hog but still benefits from the Qwen architecture.
Qwen’s Architecture
The models were initially built upon the Qwen2 foundation and then underwent extensive training on a corpus of curated datasets comprising mathematical texts, books, codes, exam questions, and synthetic data.
The researchers employed reinforcement learning techniques to further enhance the models’ capabilities. Using a math-specific reward model, they fine-tuned the Qwen2-Math models to produce even more accurate solutions, which led to the creation of the Qwen2-Math-Instruct variants. This method made the “instruct” variants superior in performance across various mathematical benchmarks.
Rigorous evaluations on both English and Chinese mathematical benchmarks, including GSM8K, Math, MMLU-STEM, CMATH, and GaoKao Math, have demonstrated the strengths of Qwen2-Math. Notably, the largest Qwen2-Math-Instruct model achieved exceptional results in complex math competitions like the American Invitational Mathematics Examination 2024.
To showcase the model’s capabilities, the team provided case studies in their GitHub post, including some from the International Mathematical Olympiad (IMO).
The introduction of Qwen-2 Math has significant implications for both education and scientific research. In the educational sector, the model could serve as a powerful tool for students and educators. It can explain complex mathematical concepts, offer step-by-step solutions to problems, and even generate practice exercises.
The Qwen team has acknowledged that the new models currently have limitations due to their “English-only support.” They plan to release bilingual models soon and are also developing multilingual LLMs.
“We will continue to enhance our models’ ability to solve complex and challenging mathematical problems,” the Qwen team stated.