Generating Expressive Portrait Videos with Audio2Video

artninja1988 · on Feb 28, 2024

Looks like it's from some of the same authors as this paper: https://github.com/HumanAIGC/AnimateAnyone . Which sadly never released the code or model.

ashildr · on Feb 28, 2024

This is becoming frightening now. Maybe we need to start to cryptographically sign original sources and require others to do so to prove if something is a real recording? I can spot some details that don’t seem to be moving ‘right’ according to physics, but this is extremely convincing.

ShamelessC · on Feb 28, 2024

Gets mentioned a lot. Typical response is “take a picture/video/recording of the video with your own crypto signature”. Thwarted.

ashildr · on Feb 28, 2024

Yes, but we don’t have an accessible vetting infrastructure for this. I don’t know anything about “ShamelessC” so I have no idea if a video of Hillary Clinton drinking the blood of children should be trusted or not.

ShamelessC · on Feb 29, 2024

I believe you're just describing the current system.

spoonjim · on Feb 28, 2024

If LIDAR data is included, faking it is significantly harder.

HeatrayEnjoyer · on Feb 28, 2024

Just train a model to generate LIDAR data

ashildr · on Feb 28, 2024

By using some tricks we already have computer generated 3D models. I don’t see why we would not get 3D models of complete videos, toon, especially if state actors with a lot of money are interested in that.

johnfernow · on Feb 28, 2024

On smartphones at least, you could require users to use a specific app to establish the authenticity of the footage. You could have a code hash signature for the app to compare against the current hash and make sure the compiled code was not altered. The app can be open source so people can trust it and compile it themselves, but if the hash doesn't match the videos would be considered untrusted. You would also have to take measures to ensure that there aren't runtime modifications — a difficult thing to accomplish for sure, but something some companies are getting increasingly good at.

In addition to LIDAR data, throw in gyroscope data (would make recording a screen more obvious) and GPS data (would need the screen where you say you are — would also need to make sure device is not rooted or jailbroken to prevent spoofing of GPS) and it becomes even more challenging to fake a video. I think securing the app against modification or runtime injections is probably the biggest point of focus, but even if you were able to defeat all those measures, you'd still need to have models generate convincing LIDAR, gyroscope and GPS data. Not impossible of course, but at that point you need a rooted phone that is able to successfully hide that it is rooted, several well trained models, and the ability to defeat the video app's own security measures against both binary and runtime modifications.

On a technical level, it may not be possible to develop an app that records videos with LIDAR, gyroscope and GPS data that could not be fooled by recording a screen. In practice, I think it is possible to develop an app that could establish the authenticity of videos that nearly everyone except maybe state actors would be incapable of defeating (maybe the world's richest corporations might also have the funds to do so, though I think the odds of a whistleblower or leaks is a bit higher there — maybe Microsoft could gather a team of highly talented AI developers to generate fake videos passed as real ones without notice, but I think the likelihood of not a single one of the employees revealing that info to journalists, the government or public is low.)

I share many concerns about device attestation, as it has potential to limit a user's freedom to do what they want with the hardware that they bought and software and services that they pay for. That said, if I really was dying to use a rooted phone (I'm not currently, due to it being more hassle than it's worth), I wouldn't mind buying a second, heavily locked down device for proving videos I record are real. The device could even be powered by entirely open source software, but have a hash for the compiled ISO that is used to install the OS.

ShamelessC · on Feb 28, 2024

It will never work so long as edge cases appear. In reality, the field of journalism will expand their vetting processes as best as they can. Otherwise, we’re screwed and will have to live with the consequences (which may be overblown).

ashildr · on Feb 28, 2024

The owners of “journalism” already decided that vetting is irrelevant and costs their money, so they have been mostly routing around that part of the process for quite a while. Also: Quite a large number of people already left common reality and followed propaganda into cloud-cuckoo-land. But right now it would be still possible to in theory disprove a lot of the nonsense. This will completely change and the consequences can’t be overlown. I think that we need an automated vetting infrastructure, a kind of web of trust for recorded media, that helps to track down a recording to the original source and decide whether we trust the path that is leading there. But I don’t have any hope. We already have fake phone calls from “the president” and doctored photos, imagine what the next elections will look like. Hunter Biden will personally admit having a threesome with the Hillary and Soros, sponsored by RTV.

jerpint · on Feb 28, 2024

Perhaps signing any kind of upload or content with a GPG key that proves your identity?

pedalpete · on Feb 28, 2024

People will claim this isn't possible, or will never work, but they often miss the fact that we don't need to sign every video. Only source videos of certain people need to be signed.

For example, Tom Hanks was used to shill some dental product (I think). He's a public figure. He, and we, would want to be able to validate that anything using his likeness is actually him.

For me, nobody cares if a video of me is me or not, so it doesn't need to have a signature.

grugagag · on Feb 29, 2024

Nobody cares if a video of you is you or not for now but this is subject to change in the near future. Perhaps advertising to your friends and acquaintances with a video of you selling or endorsing some scam product would make you care more?

smusamashah · on Feb 28, 2024

This is amazing, even the breathing can be seen. Next big step (or may be its done already) is to generate expressive audio and we are all set for a generated model you can have a video call with.

pmx · on Feb 29, 2024

Suno's Bark has expressive audio

https://suno-ai.notion.site/Bark-v0-Examples-e572bcfcdf65429...

CSSer · on Feb 28, 2024

This is impressive overall, but I wonder if someone can provide some insight into, or confirmation of, an observation of mine.

I’m having a hard time describing it, but there’s a stiff, shiny, and plastic feel to a lot of these. You can really see it in the video of the woman wearing sunglasses. No one can, or at least likely wouldn’t, hold their eyebrow muscles that stiff for that long. Combined with the head and mouth movements, the expressiveness just feels off, almost animatronic. Does anyone else get that?

unraveller · on Feb 28, 2024

If you can describe it that just means the next version can avoid it. But what you describe sounds like modern soap operas which are plastic surgery expositions.

The last mona lisa doing shakespeare and the last ai girl doing multiple voices are extremely convincing, I think you over estimate just how intensely we look at things normally after we accept them. This is great for many things right now and will be greater once they can perfect all the subtle movements that make video worth watching over audio (easier information absorption and recall).

grugagag · on Feb 29, 2024

I agree, it looks uncanny. The immitation will get better and better. Sometimes in wondering why we’re fascinated with generating artificial humans when there are literally billions of humans that could be filmed with cameras.

nicwolff · on Feb 29, 2024

Humans want to be paid.

lacrimacida · on March 1, 2024

What are you gonna do with an economy where people don’t get paid then? You cant have your cake and eat it too

turnsout · on Feb 29, 2024

This is incredibly good—definitely good enough for videoconferencing. To me this is a more promising path than the classical 3D approach Apple is taking with Personas on the Vision Pro. Just need to make it realtime!

ashildr · on Feb 28, 2024

Maybe someone more on the autistic spectrum here can tell if they feel that the facial expressions are “right”? The examples are too convincing to my brain to properly focus on analyzing that part.

smusamashah · on Feb 28, 2024

Not autistic or anything but I got motion sick from Sora generated videos. So much that I don't want to look at most pf them again, at least on Sora page in original quality.

I could see things getting blurry and some blending. The very last video on the page has teeth issue. Expressions were as good as an actor for me and lip movement looked just perfect.

noduerme · on Feb 28, 2024

Coming soon to a political scandal near you...