A lot of fuzzers use Mario or other simple games as an internal testcase. I'm aware of a hypervisor fuzzer from 2016 that did it, and I'm positive there are others (both before and since). Hell, tom7 has a fuzzer for exploring program states that uses Super Mario Bros as the example from 2013 (https://www.cs.cmu.edu/~tom7/mario/mario.pdf, plus a youtube video https://youtu.be/xOCurBYI_gY), and he's definitely not the first either.
We are huge fans of tom7 and that paper was one of our inspirations for using NES as a domain for researching autonomous state space search! I think he does a very good job of explaining why the problem is hard.
Right, in case it wasn't clear to readers this isn't a bad thing. Lots of people use games because they're good analogs for other programs and evocatively show fuzzing exploration progress. Not being the first to point a fuzzer at Mario doesn't matter.
Thanks for flagging this! The work we're announcing today was completed in 2018, and we have since moved on to much more challenging problems both in the Nintendo domain and elsewhere. Totally not looking to pick a fight over priority though. This is such a hilariously understudied and under-explored area, we really value anybody else who's trying to work on these problems.
I think you underestimate the level to which this area has been studied. And I wish you would talk about these new results then instead of announcing 5+ year old results then.
It would be great to see progress in this area (not my primary area of work BTW) but I am not seeing anything here, technically, that is going to make that happen -- maybe it is just getting all the parts in place and magic happens. It just makes me scratch my head a bit.
It's possible you did not make it to the end of the talk where I explain this, but the thing that excites me is that we can now apply fuzzing and related techniques to things which are neither Nintendo games nor tiny stateless libraries and parsers, because of this: https://antithesis.com/blog/deterministic_hypervisor/
As for getting to the newer stuff, yeah, totally, just give us some time. There's a bit of a backlog. :-)
I just rewatched the end of the video to make sure I didn't miss anything. Deterministic execution and replay is very-very well-known and understood. It is possible that your packaging and market fit is right on. Lots of cottage industry in DB testing and bug finding -- but not clear how this generalizes and why something like Coyote [1] (to pick one) wouldn't work as well.
So, fuzzing has been applied to very stateful and very large industrial systems for some time. And yes it is very cool but I feel like I am seeing more "sizzle than steak" so to speak. Great engineering though, hypervisor work is very challenging.
It is absolutely possible to write a large stateful system from the ground up so that autonomous testing techniques can be applied to it. FoundationDB and Tiger Beetle are both examples of this, I think Resonate might be another one, and Toby Bell's talk at Strange Loop last year is a great guide on how to do so.
What's much harder is to take an arbitrary system, written in an arbitrary way, without these techniques in mind, and make it amenable to this sort of testing. From the start of our company, we believed that unless this was possible, the market would be too hard to crack, because most human beings are not very foresightful and not able to justify a bunch of extra work.
Hypervisor-based snapshot fuzzing like Nyx-Net and deterministic userspaces like Facebook's now-discontinued Hermit project are the other ways I know of accomplishing that goal. We believe that both of them have some pretty serious practical limitations which our approach does not share.
EDIT: Maybe the way to get to the crux of the disagreement is for me to turn the question around. Why do you believe that the vast majority of stateful and concurrent systems are not tested with fuzzing?
At the end of the day you have 2 problems (1) how to make execution deterministic within some boundary be it a process, hypervisor, or distributed system and (2) how you handle non-determinism when data crosses this boundary. You can move the boundaries and effort around but the problems always exist. So, if you are claiming that you have a sweet spot on this tradeoff then I could certainly believe that, if you claim that you eliminated this boundary issue then I am highly credulous.
I'll agree with you on indeterminate behaviors though, I suspect they will eventually be seen like the "billion dollar" mistake of null pointers.
1) Fuzzing is under-utilized even for simple code. AFL is dead easy to use and, even so, most projects don't have it in CI runs. So, despite how much I like it, in general it seems people do not see value in this type of testing.
2) The effort to handle external state (say a restful call to get stock ticker info) needs to be mocked -- which is deeply unpopular -- or handled by record/replay which works ok-ish but eventually breaks down with divergences. Outside of well-chosen domains it these eventually pop-up and add an additional pain point that builds on item 1.
Deterministic execution might be well understood by its proponents, but it's a completely niche technique that practically no one uses in practice. You have this jaded tone like this is something that everyone is doing, and everyone knows that this isn't true, so we're curious... why are you writing these things?
Have any of these methods found clips and speedrunning shortcuts? Examples: Clip the base of the flagpole to skip some animation time. Clip into and walk below the floor to run past obstacles. Etc.
Hmm... looks like Vimeo is currently having some kind of outage. If anybody at Vimeo is reading this and wants some help with testing backend systems, email is in bio.
"IJON: Exploring Deep State Spaces via Fuzzing" https://casa.rub.de/fileadmin/img/Publikationen_PDFs/2020_IJ...