What's the big deal on AI being trained on my data? The only argument I can unde...

johnnyanmac · on June 17, 2024

>The only argument I can understand is that a megacorp makes money from content I created

for professional content creators, that is exactly the biggest problem. That's why the artist community is getting a huge brunt of the pushback. You don't want your content production tools like the Adobe suite to retroactively say "yeah, by the way we are scanning your content as its produced and using that to make more money".

Even if you as a person don't mind, that's a huge NDA breach for a companyu. You don't want AI training to produce assets similar to the in-production media that isn't even out yet. It'd be a huge leak at best and a DOA at worst. And I'm sure you've seen the year+ of debate on copyright issues.

kolinko · on June 17, 2024

Let’s say I tweet that I hate cats bc I’m allergic to them.

I would be ok with that tweet being findable through search, but if a model is trained on the information and starts spitting out that I’m a cat hater period (without context), then that’s a different thing.

Also, there is an expectation of privacy on platforms like FB or Insta, because the posts require at least login to be read. Because of that people may be posting things that they expect only their friends to read, and friends of friends - not to become a part of a world knowledge canon.

For me personally - on platforms like HN and Reddit, I have no issue with models getting trained on. Fb/insta I treat as an extended group chat and out of limits for cralwers and ml datasets.

hef19898 · on June 17, 2024

A) Copyright laws exist for a reason, even big tech should respect them (considering their utter disregard for anything but money, well...) B) Especially Meta knows aleeady too much about everyone, now they can add AI trained on all of that (but what could possibly go wrong with that, right? C) I don't want AI to be trained on my stuff, to the point I don't share anything publicly anymore and only use encrypted clouds for back-up (and not even sure I'll do so any longer)

By the way, not everything is about money. Convincing people it is, is even a better trick than the devil convincing people he doesn't exist (that is assuming one believes in such things, and then considers the devil to be evil instead of the people ending up in hell, but I digress).

stavros · on June 17, 2024

You don't have a copyright claim against Meta posts you made on FB and Instagram, though.

hef19898 · on June 17, 2024

That depends on tge terms of service and the jurisdiction you are based in. And whether or not you want to go to court over it or not. IANAL, but I am fairly certain terms of service do not trump national copyright laws. At max, you might forgoe some usage rights, copyright will stay with you (at least were I live, as you cannot sell copyright, only usage rights if I understood, and remember, the finer points of our laws aroubd that stuff correctly).

That bog tech gets away with, is not necessarily because it is legal, but rather that people don't care, law makers only start to care and even if people would care, it is nigh impossible to successfully litigate.

None of the above is written in stone, nor should it stop us from doing something about it.

stavros · on June 17, 2024

IANAL either, but AFAIK you grant meta an irrevocable, perpetual, worldwide license over your content. Meta isn't likely to have overlooked such a basic legal requirement. What they were doing wasn't illegal.

Rinzler89 · on June 17, 2024

The EU level problem isn't that LLMs are trained on your publicly scrapable reddit or HN posts since you're not getting paid for your contributions either way.

The way I understand it is, the laws are made to cover cases of LLMs being trained on work of authors and artists who normally make a living from the content they create such as articles, books, music, images, videos, etc. whose work is normally copyrighted but LLMs don't give a shit and can just scrape and "learn" your content, style and patterns and then regurgitate it with slight alterations without the original authors getting paid.

It's no secret that some LLMs have been trained on pirated books off libgen.

stavros · on June 17, 2024

Sure, but this is about FB and Instagram posts, posts that Meta owns anyway (due to the ToS).

kolinko · on June 17, 2024

In Europe at least, laws override ToS and other agreements. You can sign all kinds if things with your clients/users, but laws can render parts of these agreements void.

In Poland we have a whole catalogue of voided agreement paragraphs, that gets updated regularly. It’s a safety mechanism against companies exploiting the legal system to push people into signing things that are considered unfavourable to them - https://uokik.gov.pl/niedozwolone-klauzule

stavros · on June 17, 2024

I agree, but they weren't voided before this law. Meta wasn't doing anything illegal, since you had signed your rights over your content over to them. That's why I have a hard time accept any arguments of legality/copyright against this.

kolinko · on June 17, 2024

Govt can void it after the agreement was signed. That is because companies keep working around existing laws, and pushing clients to sign unfavorable deals.

We’ve had banks for example that had bad terms (or trading platforms), and then government step in and make such terms void.

The reasoning is that a consumer/user should expect fair treatment when agreeing to ToS, without spending money on legal advice before clicking “I agree”.

stavros · on June 17, 2024

The conversation went like this: I asked why people are against AI training on their posts, some people said "because of copyright", I pointed out that Meta has a license from them to use the posts for anything, and now your post is "but the government can force them not to". Yes they can, but that wasn't the original question.

kolinko · on June 17, 2024

Meta, nor anyone else doesn’t have a license to do anything they wish according to EU law - we don’t have wildcard copyright assignments - you need to specify the fields in which a person grants the copyright. Exatly for the reason that when a new tech is invented, the original author may decide to not grant the copyright for that field, or grant it to someone else.

That is, each legally binging copyright assignment needs an enumeration of fields, like radio, tv, streaming and so on, and when a new field appears the copyright belongs to the original author by default. And statements like “in all fields current and future” are not legally binding.

There is no such thing as “people signed copyright to Meta so Meta can use it in whatever fashion”.

You can read an exercept of Meta’s ToS here - these ToS do not allow them to train models based on the content people publish. https://www.quora.com/When-I-post-a-photo-on-Facebook-or-Ins...

ehnto · on June 17, 2024

Privacy is another, if they are training with chat messages in a sense it "knows" the information in your chat. Data exfiltration would be really interesting in this space, a weird blend of tech and social engineering.

I don't really care about the money but as a society we should care a lot about such a blatant heist of personal effort. For many their words online are their works of the day, it's not at all fair for a company to take that and use it, and even more insidious that it will be used to replace the people who created those works.

stavros · on June 17, 2024

But this is about FB posts, not personal writings, for example. Meta owns your FB posts, don't they?

ehnto · on June 19, 2024

I don't think it is fair to assume your average person understands that their personal rants are ending up in an LLM/Publically accesible AI, even if the T&Cs say they can.

That's what the fuss is about I guess, we are working out the ethics our societies will accept on a pioneering front of technology. I don't think Facebook gets away on the letter of the law this time, they will need to adjust to new regulations as we decide on what's ethical.

prasoonds · on June 17, 2024

The article very clearly mentions training on your public posts.