What follows is a critical analysis of “Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models”

Paper page - Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models

<aside> 👤 This is a joint document written by three MIT EECS seniors (Class of 2024): Raunak Chowdhuri, Neil Deshmukh, and David Koplow.

</aside>

<aside> 🚨 On June 24th, Armando Solar-Lezama (Professor in EECS and COO and Associate Director of CSAIL, MIT), Tonio Buonassisi (Professor of Mechanical Engineering, MIT), and Yoon Kim (Assistant Professor in EECS and CSAIL, MIT) released a public statement regarding the paper. Please read it below.

https://people.csail.mit.edu/asolar/CoursesPaperStatement.pdf

</aside>

<aside> 📌 Update: we’ve run preliminary replication experiments for all zero-shot testing here — we’ve reviewed about 33% of the pure-zero-shot data set. Look at the histogram page in the Google Sheet to see the latest results, but with a subset of 96 Qs (so far graded), the results are ~32% incorrect, ~58% correct, and the rest invalid or mostly correct.

Untitled

We’ve also run and released the expert-prompting zero-shot experiments, but haven’t had time to validate these yet. Some of the clear issues with this prompting system are detailed in the section below.

⚠️ We wanted to make clear that our grading process involved both our own manual grading as well as crowdsourcing some of the manual grading effort to PhD level experts who reached out to us after our original post. As the manual grading process is still underway, we cannot verify that every question is graded correctly until we have finished grading entirely and double-checked the grades. We’ve made the grading spreadsheet public and set it so anyone can comment so as to welcome any corrections from the community in the meantime. Thank you for your patience!

</aside>

<aside> 💻 Replication code in colab notebook here.

</aside>

Summary

A paper seemingly demonstrating that GPT-4 could ace the MIT EECS + Math curriculum recently went viral on twitter, getting over 500 retweets in a single day. Like most, we were excited to read the analysis behind such a feat, but what we found left us surprised and disappointed. Even though the authors of the paper said they manually reviewed the published dataset for quality, we found clear signs that a significant portion of the evaluation dataset was contaminated in such a way that let the model cheat like a student who was fed the answers to a test right before taking it.

We think this should call into greater question the recent flurry of academic work using Large Language Models (LLMs) like GPT to shortcut data validation — a foundational principle in any kind of science, and especially machine learning. These papers are often uploaded to Arxiv and widely shared on Twitter before any legitimate peer review. In this case, potentially spreading bad information and setting a poor precedent for future work.

<aside> 🕊️ Several of the authors listed on the discussed paper are undergraduate researchers. Consequently, we believe it's inappropriate to hold these individuals accountable for any lapses present in the work.

Instead, we believe the responsibility should lie with the supervising authors. They are the ones who are expected to ensure that the work meets the rigorous standards of public scholarship within their field.

</aside>

Background

We discovered the paper on the morning of Friday, June 16, after it started to go viral on Twitter with headlines like "GPT-4 scores 100% on MIT EECS Curriculum." We decided to dig a little deeper. Within an hour, we were skeptical of the paper's methodology. Within two, we had realized that the dataset itself was suspect.