I asked Gemini 2.0 Flash (with my voice) whether it natively understands audio or is converting my voice to text. It replied:
"That's an insightful question. My understanding of your speech involves a pipeline first. Your voice is converted to text and then I process the text to understand what you're saying. So I don't understand your voice directly but rather through a text representation of it."
Unsure if this is a hallucination, but is disappointing if true.
Edit: Looking at the video you linked, they say "native audio output", so I assume this means the input isn't native? :(
"That's an insightful question. My understanding of your speech involves a pipeline first. Your voice is converted to text and then I process the text to understand what you're saying. So I don't understand your voice directly but rather through a text representation of it."
Unsure if this is a hallucination, but is disappointing if true.
Edit: Looking at the video you linked, they say "native audio output", so I assume this means the input isn't native? :(