Totally agree that this is becoming the standard "reference architecture" for this kind of pipeline. The only thing that complicates this a lot today is complex inputs. For simple 1-2 page PDFs what you describes works quite well out of the box but for 100+ page doc it starts to fall over in ways I described in another comment.
Are really large inputs solved at midship? If so, I'd consider that a differentiator (at least today). The demo's limited to 15pgs, and I don't see any marketing around long-context or complex inputs on the site.
I suspect this problem gets solved in the next iteration or two of commodity models. In the meantime, being smart about how the context gets divvied works ok.
I do like the UI you appear to have for citing information. Drawing the polygons around the data, and then where they appear in the PDF. Nice.
Firstly as a function of the independent components in our pipeline. For example, we rely on commercial models for document layout and character recognition. We evaluate each of these and select the highest accuracy, then fine-tune where required.
Secondly we evaluate accuracy per customer. This is because however good the individual compenents are, if the model "misinterprets" a single column, every row of data will be wrong in some way. This is more difficult to put a top level number on and something we're still working on scaling on a per-customer basis, but much easier to do when the customer has historic extractions they have done by hand.
Great Q - there is definitely a lot of competition in dev tool offerings but less so in end to end experiences for non technical users.
Some of the things we offer above and beyond dev tools:
1. Schema building to define “what data to extract”
2. A hosted web app to review, audit and export extracted data
3. Integrations into downstream applications like spreadsheets
Outside of those user facing pieces, the biggest engineering effort for us has been in dealing with very complex inputs, like 100+ page PDFs. Just dumping into ChatGPT and asking nicely for the structured data falls over in both obvious (# input/output tokens exceeded) and subtle ways (e.g. missing a row in the middle of the extraction).
I bought it and used it this way professionally for 2 weeks before returning. Honestly I loved it but the screen is still slightly muddy compared to looking at a MacBook and the bigger issue is it’s way too heavy still. Face started hurting after the first hour.
(author here) This really depends on what you're trying to functionally achieve, the organisation of your knowledge base and the fidelity you're looking for.
Let's take the example of "where should the assistant look for information on topic X", the absolute minimums would be to identify possible topics & the hierarchy of places you could look.
From the product engineering POV for the build your own path, for a well defined, limited search space this should be easily doable by a single engineer in a few days to a week. As this scales out to an entire company's knowledge base this quickly becomes a quarters long project for a small ML team to build the ongoing training jobs, data pipelines & monitoring tools required to make it robust.
From the POV of users, we designed our system to give our users the option to provide as much or as little feedback as they like. We can go quite far with upvotes/downvotes on whole answers, but we also accept per reference votes & full natural language feedback. We're still working on even deeper feedback mechanisms for power users & admins but we've typically seen the vast majority of users engaging in per answer voting & then exponentially smaller groups in the more detailed mechanisms.