More

msamwald · on March 17, 2023

Already working on this: https://examine.dev/

"In the examine|AI system, the base AI (e.g. ChatGPT) is continuously supervised and corrected by a supervisor AI. The supervisor can both passively monitor and evaluate the output of the base AI, or can actively query the base AI. This way, users and developers interact with the team of base and supervisor systems. Performance, robustness and truthfulness are enhaced by the automated evaluation, critique and improvement afforded by the supervisor.

Our approach is inspired by the Socratic method, which aims to identify underlying assumptions, contradictions and errors through dialog and radical questioning."

msamwald · on Nov 16, 2022

Well they are not useless for academic research.

msamwald · on July 14, 2021

> Also PLoS one is basically not peer reviewed - they accept every paper after a short review.

That is absolutely not true, PLOS ONE has proper peer-review, their review guidelines just focus on technical soundness and de-emphasize subjective noteworthiness.

(As a personal anecdote, I managed to get one of my papers rejected from PLOS ONE once...)

msamwald · on Dec 30, 2020

Which often is terrible advice that mostly serves to limit liability. Most medical doctors will not have a good, evidence-based answer to such questions.

Better reply: Search the internet for authoritative sources (e.g., official guidance of governmental institutions) or medical guidelines.

msamwald · on Nov 28, 2020

Being very efficient at mostly extractive summarization and abstaining from abstractive summarization does seem a better bet though, because fewer things can go wrong and it is easier to check the summaries against the full text.

msamwald · on Nov 9, 2020

> HTML is a language that is too ambiguous to parse, and that's not a good thing for browsers.

Wouldn't that be XHTML? Which was killed by HTML 5?

msamwald · on Nov 9, 2020

On the other hand:

"BioNTech CEO expects vaccine can be fridge-stored for two weeks"

"Speaking at an online media briefing on the purchase of an additional German production site, Chief Executive Ugur Sahin said tests have recently confirmed the genetic compound remains stable at 2 to 8 degrees Celsius for five days but he expects storability at those conditions to be two weeks or longer."

https://www.reuters.com/article/health-coronavirus-biontech-...

msamwald · on Oct 27, 2020

The original Nabla article is missing information on how they primed GPT-3 for each use-case, and how much effort they put into finding good ways of priming.

All fancy GPT-3 demos seem to rely on good priming.

The time scheduling problems are probably hard limit of GPT-3 capabilities. The "kill yourself" advice, on the other hand, might have been avoided by better priming.

FartyMcFarter · on Oct 27, 2020

Wouldn't this kind of priming be brittle and unreliable? Has anyone successfully primed GPT-3 to solve complex problems consistently?

msamwald · on Oct 17, 2020

Meta-comment:

How is it possible that the original submission has been on the front page for 8+ hours, and all discussion is focused on this completely unrelated link?

Have people stopped reading original submission links in favor of comments so much that the discussion is no longer related to the original submission at all?

msamwald · on Oct 7, 2020

A shorter reply would be: It would be great to compare PET not only to GPT-3, but also to other models, especially ones geared towards few-shot learning.

Do you know of any other models that should be used for such a comparison, or are there already any relevant results on SuperGLUE that should be mentioned?

leereeves · on Oct 7, 2020

This appears to be SOTA on SuperGLUE with few-shot learning.

PET (well, a version called iPET from the same author) is at #9 on the SuperGLUE leaderboard [1], and none of the models above it mention being evaluated by few-shot learning.

1: https://super.gluebenchmark.com/leaderboard/

Veedrac · on Oct 7, 2020

The results reported there are what most people would call ‘semi-supervised learning’, not ‘few-shot’. The true few-shot results are in a few places in the paper, https://arxiv.org/abs/2009.07118, labeled with ‘- dist’.

Veedrac · on Oct 7, 2020

There are many BERT-based models that would have made for a good numeric comparison, had they tested on few-shot learning, but I'm not aware of any that have.

msamwald · on Oct 7, 2020

Well, in table 1 they compare to RoBERTa trained in a vanilla supervised fashion?