> We surveyed students before releasing grades to capture their experience. [...] Only 13% preferred the AI oral format. 57% wanted traditional written exams. [...] 83% of students found the oral exam framework more stressful than a written exam.
[...]
> Take-home exams are dead. Reverting to pen-and-paper exams in the classroom feels like a regression.
Yeah, not sure the conclusion of the article really matches the data.
Students were invited to talk to an AI. They did so, and having done so they expressed a clear preference for written exams - which can be taken under exam conditions to prevent cheating, something universities have hundreds of years of experience doing.
I know some universities started using the square wheel of online assessment during covid and I can see how this octagonal wheel seems good if you've only ever seen a square wheel. But they'd be even better off with a circular wheel, which really doesn't need re-inventing.
That's what so surprising to me - they data clearly shows the experiment had terrible results. And the write up is nothing but the author stating: "glowing success!".
And they didn't even bother to test the most important thing. Were the LLM evaluations even accurate! Have graders manually evaluate them and see if the LLMs were even close or were wildly off.
This is clearly someone who had a conclusion to promote regardless of what the data was going to show.
> And they didn't even bother to test the most important thing. Were the LLM evaluations even accurate!
This is not true; the professor and the TAs graded every student submission. See this paragraph from the article:
(Just in case you are wondering, I graded all exams myself and I asked the TA to also grade the exams; we mostly agreed with the LLM grades, and I aligned mostly with the softie Gemini. However, when examining the cases when my grades disagreed with the council, I found that the council was more consistent across students and I often thought that the council graded more strictly but more fairly.)
At the risk of perhaps stating the obvious, there appears to be a whiff of aggression from this article. The "fighting fire with fire" language, the "haha, we love old FakeFoster, going to have to see if we change that" response to complaints that the voice was intimidating ... if there wasn't a specific desire to punish the class for LLM use by subjecting them to a robotic NKVD interrogation then the authors should have been more careful to avoid leaving that impression.
Tried it in earnest. Definitely detect some aggression, and would feel stressed if this were an exam setting. I think it was pg who said that any stress you add in an interview situation is just noise, and dilutes the signal.
Also, given that there's so many ways for LLMs to go off the rails (it just gave me the student id I was supposed to say, for example), it feels a bit unprofessional to be using this to administer real exams.
Not that bad? I gave it a random name and random net ID and it basically screamed at me to HANG UP RIGHT NOW AND FIGURE OUT THE CORRECT NET ID. Hahaha
That does not resemble any good professor I've ever heard. It's very aggressive and stern, which is not generally how oral exams are conducted. Feels much more like I'm being cross examined in court.
Also tried it and it could have been a lot better. If I had any type of interview with that voice (press interview, mentor interview, job interview) I would think I was being scammed, sold something, or had entered the wrong room.
The belligerence about changing the voice is so weird. And it does sort of set a tone straight off. "We got feedback that the voice was frightening and intimidating. We're keeping it tho."
I've got a long standing disagreement with an AI CEO that believes LLM convergence indicates greater accuracy. How to explain basic cause and effect in these AI use cases is a real challenge. The essential basic understanding of what an LLM is is not there, and that lack of comprehension is a civilization wide issue.
I don't think they're terrible, but I'm grading on a curve because it's their first attempt and more of a trial run. It seems promising enough to fix the issues and try again.
The quote you gave is not the conclusion of the article. It's a self-evident claim that just as well could have been the first sentence of the article ("take-home exams are dead"), followed by an opinion ("reverting ... feels like a regression") which motivated the experiment.
Some universities and professors have tried to move to a take-home exam format, which allows for more comprehensive evaluation with easier logistics than a too-brief in-class exam or an hours-long outside-of-class sitting where unreasonable expectations for mental and sometimes physical stamina are factors. That "take-home exams are dead" is self-evident, not a result of the experiment in the article. There used to be only a limited number of ways to cheat at a take-home exam, and most of them involved finding a second person who also lacked a moral conscience. Now, it's trivial to cheat at a take-home exam all by yourself.
You also mentioned the hundreds of years of experience universities have at traditional written exams. But the type and manner of knowledge and skills that must be tested for vary dramatically by discipline, and the discipline in question (computer science / software engineering) is still new enough that we can't really say we've matured the art of examining for it.
Lastly, I'll just say that student preference is hardly the way to measure the quality of an exam, or much of anything about education.
> The quote you gave is not the conclusion of the article.
Did I say "conclusion" ? Sorry, I should have said the section just before the acknowledgements, where the conclusion would normally be, entitled "The bigger point"
> they expressed a clear preference for written exams
When I was a student, I would have been quite vocal with my clear preferences for all exams being open-book and/or being able to amend my answers after grading for a revised score.
What I'm saying is, "the students would prefer..." isn't automatically case closed on what's best. Obviously the students would prefer a take-home because you can look up everything you can't recall / didn't show up to class to learn, and yes, because you can trivially cheat with AI (with a light rewrite step to mask the "LLM voice").
But in real life, people really will ask you to explain your decisions and to be able to reason about the problem you're supposedly working on. It seems clear from reading the revised prompts that the intent is to force the agent to be much fairer and easier to deal with than this first attempt was, so I don't think this is a bad idea.
Finally, (this part came from my reading of the student feedback quotes in the article) consider that the current cohort of undergrads is accustomed to communicating mainly via texting. To throw in a further complication, they were around 13-17 when COVID hit, decreasing human contact even more. They may be exceedingly nervous about speaking to anyone who isn't a very close friend. I'm sympathetic to them, but helping them overcome this anxiety with relatively low stakes is probably better than just giving up on them being able to communicate verbally.
> being able to amend my answers after grading for a revised score
How do you expect that to work? After the exam, you talk to your friends (and to ChatGPT) and know the correct answers even if you could have never produced them during the exam.
Not the person you're replying to, but I've had some courses in which you received your graded exams and had an opportunity to regain some points by choosing some number of incorrect responses and redoing the work to obtain a correct answer.
This was pre-LLM, but you could cheat back then too. LLMs make it a bit easier by showing you the work to "show" on your corrections.
Not the case for the class in the blog post, but we also have many online classes. Many professionals prefer these online classes because they can attend without having to commute, and can do it from a place of their own convenience.
Such classes do not have the luxury of pen-and-paper exams, and asking people to go to testing centers is a huge overkill.
Take home exams for such settings (or any other form of written exam) are becoming very prone to cheating, just because the bar to cheating is very low. Oral exams like that make it a bit harder to cheat. Not impossible, but harder.
I did a C# module online run by a Norwegian University. It was worth 6 points, 180 grants you a bachelor's degree in Norway (or did, I think there have been changes since). The course ran over ten weeks and there were weekly assignments. Of course it would have been easy to cheat on those but there would be no point because there was a five hour invigilated open book exam at the end of the course. Had to go to a testing centre about 35 km away to take the exam but that really wasn't a great inconvenience. If I had wanted to pursue a whole degree then I would have had 30 such exams, roughly one a month if you do the degree over the traditional three years. That doesn't seem like overkill to me, it's a lot less effort than attending lectures and tutorials for three years as I did for my Applied Physics degree.
One student had to talk to an AI for more than 60 minutes. These guys are creating a dystopia. Also students will just have an AI pick up the phone if this gets used for more than 2 semesters.
It's not that the oral format should be dismissed, just that the idea of your exam being speaking to a machine to be judged on the merit of your time in a course is dystopian. Talking to another human is fine.
Very different. A scantron machine is deterministic and non-chaotic.
In addition to being non-deterministic LLMs can product vastly different output from very slightly different input.
That’s ignoring how vulnerable LLMs are to prompt injection, and if this becomes common enough that exams aren’t thoroughly vetted by humans, I expect prompt attacks to become common.
Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
I saw this piece as the start of an experiment, and the use of a "council of AI" as they put it to average out the variability sounds like a decent path to standardization to me (prompt injecting would not be impossible, but getting something past all the steps sounds like a pretty tough challenge)
They mention getting 100% agreement between the LLMs on some questions and lower rates on other, so if an exam was composed of only questions where there is near 100% convergence, we'd be pretty close to a stable state.
I agree it would be reassuring to have a human somewhere in the loop, or perhaps allow the students to appeal the evaluation (at cost?) if they is evidence of a disconnect between the exam and the other criteria. But depending on how the questions and format is tweaked we could IMHO end up with something reliable for very basic assessments.
PS:
> Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
Nothing indeed. The arms race hasn't started here, and will keep going IMO.
So the whole thing is a complete waste of time then as an evaluation exercise.
>council of AIs
This only works if the errors and idiosyncrasies of different models are independent, which isn’t likely to be the case.
>100% agreement
When different models independently graded tests 0% of grades matched exactly and the average disagreement was huge.
They only reached convergence on some questions when they allowed the AIs to deliberate. This is essentially just context poisoning.
1 model incorrectly grading a question will make the other models more likely to incorrectly grade that question.
If you don’t let models see each other’s assessments, all it takes is one person writing an answer in a slightly different way that causes disagreement among models to vastly alter the overall scores by tossing out a question.
This is not even close to something you want to use to make consequential decisions.
Imagine that LLMs reproduce the biases of their training sets and human data sets are biased against nonstandard speakers with rural accents/dialects/AAVE as less intelligent. Do you imagine their grade won't be slightly biased when the entire "council" is trained on the same stereotypes?
Appeals aren't a solution either, because students won't appeal (or possibly even notice) a small bias given the variability of all the other factors involved, nor can it be properly adjucated in a dispute.
I might be given too much credit, but given the tone of the post they're not trying to apply this to some super precise extremely competitive check.
If the goal is to assess whether a student properly understood the work they submitted or more generally if they assimilated most concepts of a course, the evaluation can have a bar low enough for let's say 90% of the student to easily pass. That would give enough of margin of error to account for small biases or misunderstandings.
I was comparing to mark sheet tests as they're subject to similar issues, like students not properly understanding the wording (and usually the questions and answer have to be worded in pretty twisted ways to properly) or straight checking the wrong lines or boxes.
To me this method, and other largely scalable methods, shouldn't be used for precise evaluations, and the teachers proposing it also seem to be aware of these limitations.
A technological solution to a human problem is the appeal we have fallen for too many times these last few decades.
Humans are incredibly good at solving problems, but while one person is solving 'how do we prevent students from cheating' a student is thinking 'how I bypass this limitation preventing me from cheating'. And when these problems are digital and scalable, it only takes one student to solve that problem for every other student to have access to the solution.
I feel like the arms race between student cheaters and teacher testing has been going on for hundreds of years, ever since the first answer key written on the back of a hand
University exams being marked by hand, by someone experienced enough to work outside a rigid marking scheme, has been the standard for hundreds of years and has proven scalable enough. If there are so many students that academics can’t keep up, there are likely too many students to maintain a high standard of education anyway.
> there are likely too many students to maintain a high standard of education anyway.
Right on point. I find particularly striking how little is said about whether the best students achieve the best grades. Authors are even candid that different LLMs asses differently, but seem to conclude that LLMs converging after a few rounds of cross reviews indicate they are plausible so who cares. The apparences are safe.
A limitation of written exams is in distance education, which simply was hardly a thing for the hundreds of years exams were used. Just like WFH is a new practice employers have to learn to deal with, study from home (SFH) is a phenomenon that is going to affect education.
The objections to SFH exist and are strikingly similar to objections to WFH, but the economics are different. Some universities already see value in offering that option, and they (of course) leave it to the faculty to deal with the consequences.
Distance education is a tiny percentage of higher education though. Online classes at a local university are more common, but you can still bring the students in for proctored exams.
Even for distance education though, proctored testing centers have been around longer than the internet.
> Distance education is a tiny percentage of higher education though.
It is about a third of the students I teach, which amounts to several hundreds per term. It may be niche, but it is not insignificant, and definitely a problem for some of us.
> Even for distance education though, proctored testing centers have been around longer than the internet.
I don't know how much experience you have with those. Mine is extensive enough that I have a personal opinion that they are not scalable (which is the focus of the comment I was replying to). If you have hundreds of students disseminated around the world, organising a proctored exam is a logistical challenge.
It is not a problem at many universities yet, because they haven't jumped on the bandwagon. However domestic markets are becoming saturated, visas are harder to get for international students, and there is a demand for online education. I would be surprised that it doesn't develop more in the near future.
I agree that proctoring across hundreds of locations globally could be a challenge.
I think the end result though is that schools either limit their students to a smaller number of locations where they can have proctored exams, or they don’t and they effectively lose their credentialing value.
It is literally perfect linear scaling. For every student you must expend constant minutes of TA time grading the exam. Why is it unconscionable that the university should have an expense scale at the same rate it receives tuition revenue? $90,000 of tuition pays for a lot of grading hours. I feel that scalability is a cultural meme that has lost the plot.
There are phrases that hn loves and "scalable" is one of them. Here, it is particularly inappropriate.
Some people dream that technology (preferably duly packaged by for-profit SV concerns) can and will eventually solve each and every problem in the world; unfortunately what education boils down to is good, old-fashioned teaching. By teachers. Nothing whatsoever replaces a good, talented, and attentive teacher, all the technologies in the world, from planetariums to manim, can only augment a good teacher.
Grading students with LLMs is already tone-deaf, but presenting this trainwreck of a result and framing it as any sort of success... Let's just say it reeks of 2025.
If a student is willing and desire to learn, an LLM is better than a bad teacher.
If a student doesn't want to learn, and is instead being forced to (either as a minor, or via certification required to obtain work & money), then they have every incentive to cheat. An LLM is insufficient in this case - a teacher is both the enforcer and the tutor in this case.
There's also nothing wrong with a teacher using an LLM to help with the grading imho.
Not the comprehensive physics exams I assigned as a prof. A well set exam takes at least 20-30 min to grade. That's 8-12 hours of work, and in practice, took several sittings over several days.
If you are going to set an exam that can be graded in 5-10 min, you are not getting a lot of signal out of it.
I wanted to do oral exams, but they are much more exhausting for the prof. Nominally, each student is with you for 30 min, but (1) you need to think of slightly different question for each student (2) you need to squeeze all the exams in only a couple of days to avoid giving later students too much extra time to prepare.
I have never, on my own free will, assigned multiple-choice questions in a serious course. And never will.
- They have a base marks of 20-25% (by random guessing) instead of 0.
- You never see the working. So you can't check if students are thinking correctly. Slightly wrong thinking can get you right answers.
- They don't even remotely reflect real life at all. Written worked through problems on the other hand - I still do those in my professional life as a scientist all the time. It's just that I am setting the questions for myself.
- The format doesn't allow for extended thought questions.
In my undergrad, I had some excellent profs who would set long work through exam question in such a way that you learned something even in the exams. Simply a joy taking those exams that gave a comprehensive walk through of the course. As a prof, I have always tried to replicate that.
On the surface, true. Multiple choice tests are a counter example.
Thinking deeper, though, multiple choice tests require SIGNIFICANTLY more preparation. I would go so far as to say almost all individual professors are completely unqualified to write valid multiple choice tests.
The time investment in multiple choice comes at the start - 12 hours writing it instead of 12 hours grading it - but it’s still a lot of time and frankly there is only very general feedback on student misunderstandings.
Is this a new thing or do you think that most professors were always unable to do their job? Why do you think you are an exception?
I don't believe that your argument is more than an ad-hoc value judgment lacking justification. And it's obvious that if you think so little of your colleagues, that they would also struggle to implement AI tests.
I agree with you and the other posters actually, but I think the efficiency compared with typed work is the reason it’s having such a slow adoption. Another thing to remember is that there is always a mild Jevons paradox at play; while it's true that it was possible in previous centuries, teacher expectations have also increased which strains the amount of time they would have grading handwritten work.
> Why is this a problem now, but was not a problem for the past few centuries? This class had 36 students, you could grade that in a single evening.
At least in Germany, if there are only 36 students in a class, usually oral exams are used because in this case oral exams are typically more efficient. For written exams, more like 200-600 students in a class is the common situation.
I assure you, oral exams are completely scalable. But it does require most of a university's budget to go towards labs and faculty, and not administration and sports arenas and social services and vanity projects and three-star dorms.
But "in any functioning society" is not our society. Human civilization is marginally functional, wildly spotty in the distribution of comfort, with the majority of humanity receiving significantly less than others.
One way of scaling out interactive/oral assessment (and personalized instruction in general) is to hire a group of course assistants/tutors from the previous cohort.
I think it works differently at different schools and in different countries, but hourly (often undergraduate work-study) course assistants in the US can be very affordable since they typically still pay tuition and are paid at a lower rate than fully funded (usually graduate student) TAs.
As a student I really would not want to be taught by someone who was simply a couple of years ahead of me. I want my tutor to be a lot more experienced in both the subject and in tutoring.
To clarify the point here for people who didn't read OP: the oral exams here are customized and tailored to the student's individual unique project, that's the point and why they are not written:
> In our new "AI/ML Product Management" class, the "pre-case" submissions (short assignments meant to prepare students for class discussion) were looking suspiciously good. Not "strong student" good. More like "this reads like a McKinsey memo that went through three rounds of editing," good...Many students who had submitted thoughtful, well-structured work could not explain basic choices in their own submission after two follow-up questions. Some could not participate at all...Oral exams are a natural response. They force real-time reasoning, application to novel prompts, and defense of actual decisions. The problem? Oral exams are a logistical nightmare. You cannot run them for a large class without turning the final exam period into a month-long hostage situation.
Written exams do not do the same thing. You can't say 'just do a written exam'. So sure, the students may prefer them, but so what? That's apples and oranges.
[...]
> Take-home exams are dead. Reverting to pen-and-paper exams in the classroom feels like a regression.
Yeah, not sure the conclusion of the article really matches the data.
Students were invited to talk to an AI. They did so, and having done so they expressed a clear preference for written exams - which can be taken under exam conditions to prevent cheating, something universities have hundreds of years of experience doing.
I know some universities started using the square wheel of online assessment during covid and I can see how this octagonal wheel seems good if you've only ever seen a square wheel. But they'd be even better off with a circular wheel, which really doesn't need re-inventing.