Promoted as an open-source model, Reflection 70B, developed by HyperWrite, has the unique ability to self-correct, with a promise to outperform leading commercial AI models. However, recent revelations cast doubt on the authenticity of these claims, awaiting confirmation.
According to a detailed thread by @shinboson, independent attempts to replicate the claimed results of Reflection 70B failed miserably. “The performance is awful,” the user noted.
It was also discovered that Matt Shumer, co-founder and CEO of HyperWrite, was not truthful about what the released model was based on under the hood. “Their API was a Claude wrapper with a system prompt to make it act similar to the open-source model,” @shinboson explained.
A story about fraud in the AI research community:
On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they’ve made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it’s real.
It isn’t. pic.twitter.com/S0jWT8rDVb
— 𝞍 Shin Megami Boson 𝞍 (@shinboson) September 9, 2024
Background On Reflection 70B
Reflection 70B was introduced on Friday as a state-of-the-art model built upon Meta’s open-source Llama 3.1-70B Instruct. What set it apart was the integration of “Reflection-Tuning,” a novel method allowing the model to detect and correct errors in its responses. This self-correcting mechanism was hailed as a significant advancement in AI, addressing the issue of hallucinations—where models generate incorrect or nonsensical information.
Shumer emphasized that Reflection 70B isn’t just competitive with top-tier models; it also offers unique capabilities, with it being even called “the world’s top open-source AI model.”
The model’s launch was met with excitement, especially after it achieved impressive results across several benchmarks. Reflection 70B reportedly scored 89.9% on MMLU, 79.7% on MATH, and 90.1% on IFEval, positioning it as a serious contender in the AI space. It was praised as a significant milestone for open-source AI, offering a competitive alternative to proprietary models like OpenAI’s GPT-4.
Just days after the launch, Meta offered additional computing resources to support the overwhelming demand. The model was renamed “Reflection-Llama-3.1-70B” and made available on Hugging Face for download. API access was launched through GPU service provider Hyperbolic Labs.
Bomb Shell: It’s Not What It Seems
Reddit users were quick to discover that the Reflection 70B API was actually a wrapper for Anthropic’s Sonnet 3.5 model. The API appeared to be filtering out references to Anthropic’s Claude and replacing terms like “Anthropic” with “Meta,” misleading users about the true nature of the model.
The discovery ignited a firestorm of criticism within the AI community. Some commented on the inconsistencies, observing that some versions of the API still allowed the model to mention “Claude,” making them assume the API was redeployed to suppress the scrutiny.
This revelation triggered widespread skepticism about HyperWrite’s claims. Shumer’s earlier statement that Reflection-Tuning enabled the model to “recognize and correct mistakes before committing to an answer” was seen in a new light, with many questioning whether such innovations existed at all in the model as it was presented.
The controversy raises significant concerns about transparency in the AI industry. By presenting Reflection 70B as a groundbreaking model when it was largely reliant on Anthropic’s Sonnet 3.5, HyperWrite has undermined the trust of developers and researchers eager to explore its capabilities. Many who had embraced Reflection 70B as a landmark achievement for open-source AI now feel misled.
As the AI community processes these revelations, HyperWrite’s future—and that of its upcoming Reflection 405B—remains uncertain. For now, these revelations await formal confirmation from major journalists and AI experts. If verified, this incident could serve as a cautionary tale about misleading marketing and the need for greater oversight in AI development.