That's what so surprising to me - they data clearly shows the experiment had ter...

wanderingbit · 2026-01-03T10:26:47 1767436007

> And they didn't even bother to test the most important thing. Were the LLM evaluations even accurate!

This is not true; the professor and the TAs graded every student submission. See this paragraph from the article:

(Just in case you are wondering, I graded all exams myself and I asked the TA to also grade the exams; we mostly agreed with the LLM grades, and I aligned mostly with the softie Gemini. However, when examining the cases when my grades disagreed with the council, I found that the council was more consistent across students and I often thought that the council graded more strictly but more fairly.)

leoc · 2026-01-03T03:20:04 1767410404

At the risk of perhaps stating the obvious, there appears to be a whiff of aggression from this article. The "fighting fire with fire" language, the "haha, we love old FakeFoster, going to have to see if we change that" response to complaints that the voice was intimidating ... if there wasn't a specific desire to punish the class for LLM use by subjecting them to a robotic NKVD interrogation then the authors should have been more careful to avoid leaving that impression.

Hnrobert42 · 2026-01-03T11:38:29 1767440309

You can try out the voice yourself. It's not that bad.

https://elevenlabs.io/app/talk-to?agent_id=agent_8101k9d1pq4...

yayitswei · 2026-01-03T19:19:15 1767467955

Tried it in earnest. Definitely detect some aggression, and would feel stressed if this were an exam setting. I think it was pg who said that any stress you add in an interview situation is just noise, and dilutes the signal.

Also, given that there's so many ways for LLMs to go off the rails (it just gave me the student id I was supposed to say, for example), it feels a bit unprofessional to be using this to administer real exams.

Drupon · 2026-01-04T03:26:30 1767497190

Not that bad? I gave it a random name and random net ID and it basically screamed at me to HANG UP RIGHT NOW AND FIGURE OUT THE CORRECT NET ID. Hahaha

That does not resemble any good professor I've ever heard. It's very aggressive and stern, which is not generally how oral exams are conducted. Feels much more like I'm being cross examined in court.

iamthepieman · 2026-01-04T02:29:55 1767493795

Also tried it and it could have been a lot better. If I had any type of interview with that voice (press interview, mentor interview, job interview) I would think I was being scammed, sold something, or had entered the wrong room.

plagiarist · 2026-01-03T15:52:37 1767455557

The belligerence about changing the voice is so weird. And it does sort of set a tone straight off. "We got feedback that the voice was frightening and intimidating. We're keeping it tho."

malcolmgreaves · 2026-01-03T18:05:11 1767463511

It’s not an intimidating voice. Gen Z are just cry babies.

knallfrosch · 2026-01-02T22:41:17 1767393677

I found "well, the LLMs converge when given each other's scores, so they agree and are correct" to be quite the jump to a conclusion.

bsenftner · 2026-01-03T13:46:08 1767447968

I've got a long standing disagreement with an AI CEO that believes LLM convergence indicates greater accuracy. How to explain basic cause and effect in these AI use cases is a real challenge. The essential basic understanding of what an LLM is is not there, and that lack of comprehension is a civilization wide issue.

pooper · 2026-01-03T04:05:33 1767413133

accuracy versus precision is something we learn in high school chemistry.

https://i.imgur.com/EshEhls.png

When someone at that level pretends to not understand it, there is no way to mince words.

This is malice.

bjt · 2026-01-03T05:07:02 1767416822

They did compare the automated grades to the author's own manual ones. It's in there if you read more closely.

chairmansteve · 2026-01-03T13:21:49 1767446509

As far as I can tell, there is very little empirical evidence of efficacy for most modern educational "advances".

Having said that, LLMs can be good tutors if used correctly.

skybrian · 2026-01-03T05:00:32 1767416432

I don't think they're terrible, but I'm grading on a curve because it's their first attempt and more of a trial run. It seems promising enough to fix the issues and try again.