I mean they’re building the labeled dataset right now by having creators label it for them.
I would suspect this helps make moderation models better at estimating confidence levels of ai generated content that isn’t labeled as such (ie for deception).
Surprised we aren’t seeing more of this in labeling datasets for this new world (outside of captchas)
I’ve never understood why people want a more verbose version of sql.
I think what people really want is business rules and data cleaning and schema discovery.
If you had to use English against multiple source systems to and tons of joins, the sentence would be paragraphs.
Where I think there’s value is in using something like a data catalog to label business rules against a data warehouse, tied to dashboard queries and other common ones.
But that’s a hard problem and a unique model to every customer. And always changing.
Combining schema discovery and data catalog seems like it might be a hard problem requiring a lot of LLM prompt engineering gymnastics but maybe I underestimate the state of the art.
This is a really great idea and use case. It also makes a ton of sense as a pilot use case for this type of open source project given extensions are smaller in scope.
I mean even having it document a best draft of what the extension code is doing would be awesome.
Unless it’s made into an extension and then you have a recursive hell.