You can run whisper.cpp (a very solid TTS) on a Raspberry Pi (https://github.com...

kkielhofner · on March 7, 2023

I’m a big fan of Whisper and whisper.cpp but doing reliable wake word detection with any kind of reasonable latency on a Raspberry Pi is likely to be a poor fit and very bad experience.

The Whisper model operates on 30 sec speech chunks. Input audio has to be padded to that length. So you’re constantly going to be recording audio, padding, looking for wake word, and then activating full recording upon detection. All on padded 30 sec chunks looking back…

Then there is model size and availability. Whisper base or maybe even tiny could potentially give decent results for wake word detection but I’m skeptical. Wake words can be surprisingly tricky.

That’s just for wake word assuming you’re going to stream audio after, as reliably doing ASR and NLP to figure out speaker intent is far too challenging and time consuming to be done on Raspberry Pi class hardware in anything approaching response times that would be considered acceptable. Whisper does pretty well with relatively high noise/low quality speech and far-field microphones are amazing but I doubt that's enough to provide anything approaching Echo/Alexa quality in the real world.

This is a carefully scripted demo[0] showing it takes a whopping 15 seconds to wake, ASR the speech, and return the result. The average person could easily take their phone out of their pocket, unlock it, look for the weather app, and read the weather in less time.

This demo[1] claims "real time" but looking at the example videos it clearly isn't and the accuracy leaves a lot to be desired. This is with three threads on a Raspberry Pi 4.

I just tried asking Echo what the weather is like and it was so fast I had trouble timing it - somewhere around one second.

[0] - https://youtube.com/watch?v=Aor6CFkcWzU&si=EnSIkaIECMiOmarE

[1] - https://github.com/ggerganov/whisper.cpp/discussions/166

ggerganov · on March 7, 2023

Recently I made some progress on efficiently detecting short voice commands (wake words) on RPi4 [0]. Checkout the "command" example in whisper.cpp and it's "Guided mode" operation. There are additional improvements on the way too.

[0] https://twitter.com/ggerganov/status/1602759833312456704

kkielhofner · on March 7, 2023

As I said I really appreciate and respect all of the work you’re doing on whisper.cpp but when it comes to things like wake words and commands I have to think Whisper is just fundamentally the wrong tool for the job. Tiny is a 39m parameter model with fairly poor accuracy and high latency (without GPU) that just about maxes out a Raspberry Pi - all for a few a few very carefully pronounced words under ideal conditions (in this case).

That said, there isn’t much in the open source space (that I’ve found) that’s even remotely competitive with Alexa/Echo so I’m all for any efforts and attention in this area. Perfect is the enemy of good but this thread started off with people wondering if Whisper or anything based on it was close to Alexa/Echo for wake word activated assistant tasks. I think it’s very safe to say it isn’t.

Again, I really appreciate your work on making Whisper more accessible to the masses for local ASR - please don’t take this as criticism for your efforts. If anything I’ve been involved in open source projects and it’s frustrating when people try to jam a square peg in a round hole, only to come back and complain your labor of love didn’t work for them.

physicles · on March 7, 2023

True, but note that they're using the tiny model. In my experimentation, you need at least the small model to get transcription I'd call "good", which is still a bit slower than you'd like on a moderately fast laptop from 2019.

That said, whisper is incredible and the era of very good local speech-to-text on moderate hardware is basically here, or will be in the next year.