"In the examine|AI system, the base AI (e.g. ChatGPT) is continuously supervised and corrected by a supervisor AI. The supervisor can both passively monitor and evaluate the output of the base AI, or can actively query the base AI. This way, users and developers interact with the team of base and supervisor systems. Performance, robustness and truthfulness are enhaced by the automated evaluation, critique and improvement afforded by the supervisor.
Our approach is inspired by the Socratic method, which aims to identify underlying assumptions, contradictions and errors through dialog and radical questioning."
> Also PLoS one is basically not peer reviewed - they accept every paper after a short review.
That is absolutely not true, PLOS ONE has proper peer-review, their review guidelines just focus on technical soundness and de-emphasize subjective noteworthiness.
(As a personal anecdote, I managed to get one of my papers rejected from PLOS ONE once...)
Which often is terrible advice that mostly serves to limit liability. Most medical doctors will not have a good, evidence-based answer to such questions.
Better reply: Search the internet for authoritative sources (e.g., official guidance of governmental institutions) or medical guidelines.
Being very efficient at mostly extractive summarization and abstaining from abstractive summarization does seem a better bet though, because fewer things can go wrong and it is easier to check the summaries against the full text.
"BioNTech CEO expects vaccine can be fridge-stored for two weeks"
"Speaking at an online media briefing on the purchase of an additional German production site, Chief Executive Ugur Sahin said tests have recently confirmed the genetic compound remains stable at 2 to 8 degrees Celsius for five days but he expects storability at those conditions to be two weeks or longer."
The original Nabla article is missing information on how they primed GPT-3 for each use-case, and how much effort they put into finding good ways of priming.
All fancy GPT-3 demos seem to rely on good priming.
The time scheduling problems are probably hard limit of GPT-3 capabilities.
The "kill yourself" advice, on the other hand, might have been avoided by better priming.
How is it possible that the original submission has been on the front page for 8+ hours, and all discussion is focused on this completely unrelated link?
Have people stopped reading original submission links in favor of comments so much that the discussion is no longer related to the original submission at all?
A shorter reply would be: It would be great to compare PET not only to GPT-3, but also to other models, especially ones geared towards few-shot learning.
Do you know of any other models that should be used for such a comparison, or are there already any relevant results on SuperGLUE that should be mentioned?
This appears to be SOTA on SuperGLUE with few-shot learning.
PET (well, a version called iPET from the same author) is at #9 on the SuperGLUE leaderboard [1], and none of the models above it mention being evaluated by few-shot learning.
The results reported there are what most people would call ‘semi-supervised learning’, not ‘few-shot’. The true few-shot results are in a few places in the paper, https://arxiv.org/abs/2009.07118, labeled with ‘- dist’.
There are many BERT-based models that would have made for a good numeric comparison, had they tested on few-shot learning, but I'm not aware of any that have.
"In the examine|AI system, the base AI (e.g. ChatGPT) is continuously supervised and corrected by a supervisor AI. The supervisor can both passively monitor and evaluate the output of the base AI, or can actively query the base AI. This way, users and developers interact with the team of base and supervisor systems. Performance, robustness and truthfulness are enhaced by the automated evaluation, critique and improvement afforded by the supervisor.
Our approach is inspired by the Socratic method, which aims to identify underlying assumptions, contradictions and errors through dialog and radical questioning."