Tiếp tục hành trình Reflection 70B với báo cáo tử thi từ nhà cung cấp dữ liệu đào tạo.

Tiếp tục chuỗi bài viết về Reflection 70B khi nhà cung cấp dữ liệu đào tạo phát hành báo cáo tử thần Tham gia vào bản tin hàng ngày và hàng tuần của chúng tôi để cập nhật tin tức mới nhất và nội dung độc quyền về bản tin trí tuệ nhân tạo hàng đầu trong ngành. Hãy đọc thêm vào ngày hôm nay ngày 5 tháng 9 năm 2024, Matt Shumer, đồng sáng lập và CEO của startup Hyperwrite AI (còn được biết đến với tên OthersideAI) đã sử dụng mạng xã hội X để đăng thông tin sốc rằng anh đã điều chỉnh một phiên bản của mã nguồn mở Llama 3.1-70B của Meta thành một mô hình ngôn ngữ lớn (LLM) hiệu suất hơn gọi là Reflection 70B – một mô hình “mạnh nhất thế giới,” theo bài đăng của anh. Tuy nhiên, ngay sau khi được phát hành, các nhà đánh giá bên thứ ba trong cộng đồng nghiên cứu và lưu trữ trí tuệ nhân tạo gặp khó khăn khi tái tạo các kết quả đánh giá được tuyên bố, dẫn đến cáo buộc gian lận. Các nhà nghiên cứu chỉ ra sự không phù hợp giữa kết quả đánh giá được công bố và bài kiểm tra độc lập của họ, gây ra một làn sóng chỉ trích trên các nền tảng xã hội như Reddit và X. #Reflection70B Trong bài đăng, Sahil Chaudhary, người sáng lập của Glaive, công ty khởi nghiệp AI có dữ liệu tổng hợp mà Shumer tuyên bố anh đã đào tạo Reflection 70B trên – và sau đó tiết lộ đã đầu tư một số tiền “nhỏ” vào đó – đã phát hành bản báo cáo tử thần trên blog Glaive AI của mình về mô hình Reflection 70B và xuất bản tài nguyên cho cộng đồng AI mã nguồn mở thử nghiệm mô hình và quy trình đào tạo của anh lên. #PostMortemReport Để khôi phục tính minh bạch và xây dựng lại sự tin tưởng, Chaudhary đã chia sẻ một số tài nguyên để giúp cộng đồng tái tạo các kết quả Reflection 70B. #Transparency Trong báo cáo tử thần của mình, Chaudhary giải thích rằng vấn đề chính với việc tái tạo kết quả ban đầu là do một lỗi trong mã đánh giá. #BugInEvaluationCode Mặc dù các điểm số sửa đổi không cao như ban đầu được báo cáo, Chaudhary khẳng định rằng chúng là phản ánh chính xác hơn về khả năng của mô hình. #Reflection70BPerformance Trong khi chỉ trích Chaudhary có vẻ phù hợp hoàn toàn với các điểm số được báo cáo ban đầu, nhiều cộng đồng AI ban đầu đã cảm thấy hoài nghi, cảm thấy như họ đã bị lừa bởi các tuyên bố sai lệch và có thể bị đánh lừa trước đó. #SkepticismInAICommunity Hãy tham gia cùng chúng tôi theo dõi các bước tiếp theo của sự kiện trên hashtag #Reflection70B. Nguồn: https://venturebeat.com/ai/reflection-70b-saga-continues-as-training-data-provider-releases-post-mortem-report/

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


On September 5th, 2024, Matt Shumer, co-founder and CEO of the startup Hyperwrite AI (also known as OthersideAI) took to the social network X to post the bombshell news that he had fine-tuned a version of Meta’s open source Llama 3.1-70B into an even more performant large language model (LLM) known as Reflection 70B — so performant, in fact, based on alleged third-party benchmarking test results he published, that it was “the world’s top open-source model,” according to his post.

However, shortly after its release, third-party evaluators in the AI research and hosting community struggled to reproduce the claimed results, leading to accusations of fraud.

Researchers cited discrepancies between the announced benchmark results and their independent tests, sparking a wave of criticism on social platforms such as Reddit and X.

In response to these concerns, Shumer pledged he would conduct a review of the issues alongside Sahil Chaudhary, founder of Glaive, the AI startup whose synthetic data Shumer claimed he had trained Reflection 70B on — and which he later revealed to have invested what he called a small amount into.

Now, nearly a month later, Chaudhary last night released a post-mortem report on his Glaive AI blog about the Reflection 70B model and published resources for the open-source AI community to test the model and his training process on their own. He says while he was unable to reproduce all of the same benchmarks, he “found a bug in the initial code,” resulting in several results appearing higher than what he has found on recent tests of Reflection 70B. However, other benchmark results appear higher than before — adding to the mystery.

As Chaudhary wrote in the post:

There were a lot of mistakes made by us in the way we launched the model, and handled the problems reported by the community. I understand that things like these have a significant negative effect on the open source ecosystem, and I’d like to apologize for that. I hope that this adds some clarity to what happened, and is a step in the direction of regaining the lost trust. I have released all of the assets required to independently verify the benchmarks and use this model.

Sharing model artifacts

To restore transparency and rebuild trust, Chaudhary shared several resources to help the community replicate the Reflection 70B benchmarks. These include:

  • Model weights: Available on Hugging Face, providing the pre-trained version of Reflection 70B.
  • Training data: Released for public access, enabling independent tests on the dataset used to fine-tune the model.
  • Training scripts and evaluation code: Available on GitHub, these scripts allow for reproduction of the model’s training and evaluation process.

These resources aim to clarify how the model was developed and offer a path for the community to validate the original performance claims.

Reproducing the benchmarks

In his post-mortem, Chaudhary explained that a major issue with reproducing the initial benchmark results stemmed from a bug in the evaluation code. This bug caused inflated scores in certain tasks, such as MATH and GSM8K, due to an error in how the system handled responses from an external API. The corrected benchmarks show slightly lower, but still strong, performance relative to the initial report.

The updated benchmark results for Reflection 70B are as follows:

  • MMLU: 90.94%
  • GPQA: 55.6%
  • HumanEval: 89.02%
  • MATH: 70.8%
  • GSM8K: 95.22%
  • IFEVAL: 87.63%

Compare that to the originally stated performance of:

  • MMLU: 89.9%
  • GPQA: 55.3%
  • HumanEval: 91%
  • MATH: 79.7%
  • GSM8K: 99.2%
  • IFEVAL: 90.13%

Although the revised scores are not as high as those initially reported, Chaudhary asserts that they are more accurate reflections of the model’s capabilities.

He also addressed concerns about dataset contamination, confirming that tests showed no significant overlap between the training data and benchmark sets.

Reflecting on a hasty release

Chaudhary admitted that the decision to release Reflection 70B was made hastily, driven by enthusiasm for the model’s performance on reasoning-based tasks.

He noted that the launch lacked sufficient testing, particularly regarding the compatibility of the model files, and that he and Shumer had not verified whether the model could be easily downloaded and run by the community.

“We shouldn’t have launched without testing, and with the tall claims of having the best open-source model,” Chaudhary wrote. He also acknowledged that more transparency was needed, especially regarding the model’s strengths and weaknesses. While Reflection 70B excels at reasoning tasks, it struggles in areas like creativity and general user interaction, a fact that was not communicated at launch.

Clarifying API confusion

One of the more serious accusations involved the suspicion that the Reflection 70B API was simply relaying outputs from Anthropic’s Claude model.

Users reported strange behavior in the model’s outputs, including responses that seemed to reference Claude directly.

Chaudhary addressed these concerns, explaining that although some of these behaviors were reproducible, he asserts there was no use of Claude APIs or any form of word filtering in the Reflection 70B model.

He reiterated that the API was run on Glaive AI’s compute infrastructure, and Matt Shumer had no access to the code or servers used during this period.

Looking ahead

In closing, Chaudhary emphasized his commitment to transparency and expressed his hope that this post-mortem and the release of model artifacts will help restore trust in the project. He also confirmed that Matt Shumer is continuing independent efforts to reproduce the benchmark scores.

Despite the setbacks, Chaudhary believes the “reflection tuning” approach — in which a model is given time to check its responses for accuracy before outputting them to a user — has potential and encourages further experimentation by the AI community. “The approach explored has merit, and I look forward to others continuing to explore this technique,” he said.

Shumer, for his part, has posted on X stating: “I am still in the process of validating Reflection myself, as Sahil wrote in his postmortem, but I am encouraged by Sahil’s transparency here on the benchmarks he reported and the API he ran. We still believe in + are working on this approach. Hoping to finish up my repro soon.”

Skepticism among open source AI community remains

Despite Chaudhary’s claims to offer transparency and an innocent explanation for what happened with Reflection 70B, many in the AI community who were initially excited about the model and its stated performance remain skeptical, feeling as though they were burned by erroneous claims and potentially tricked before.

“Still doesn’t feel like anything adds up here,” wrote Alexander Moini, an AI researcher, on X, adding “It took a month to get the model weights on to HF (Hugging Face)?”

Yuchen Jin, co-founder and CTO of Hyperbolic Labs, a startup that offers cloud-based GPUs and other AI services on demand who initially worked hard and late to host Reflection 70B before criticizing Shumer over its discrepancies, also voiced skepticism on X toward Chaudhary’s post-mortem report, pointing out that Chaudhary’s claims on X that he “reproduced all but two of the initially reported scores,” don’t actually match with the data he provided, which show at least 4 benchmarks changing scores from before to now.

But perhaps the most damning commentary comes from the Reddit subreddit r/Local LLaMA, wherein one user, “FuckSides” pointed out that Chaudhary could have taken the intervening month to fine-tune a new model to back up his claims that it randomly outputs text indicating it is actually Anthropic’s Claude 3.5 under the hood — which would explain said outputs experienced by users previously and led them to the conclusion that Reflection 70B was a fraudulent wrapper around this other proprietary model served through an API.

Meanwhile, another Redditor, “DangerousBenefit” looked into the training data Chaudhary released today and found it was filled with many instances of the phrase “as an AI language model,” which indicates it could be generated primarily from OpenAI’s ChatGPT and likely wasn’t properly cleaned.

Regardless, the more data the Reflection 70B creators publish about the model, the more evidence the open source AI community has to pore over and check their work.

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *