Let's chat about AI!

The Something Awful Forums > Discussion > Debate & Discussion > Let's chat about AI!

KillHour: Oct 28, 2007

Clarste posted:

I am saying the law can declare it so regardless of what you or anyone thinks, and a lot of people with a lot of money have a vested interest in strong copyright laws. This isn't a philosophical discussion, the law is a tool that you use to get what you want.

I want machine learning to be as cheap and easy as possible. What senator do I buy to get what I want?

Saying "I want x and we should make laws so I get what I want" is a statement, but not really an argument that can be debated or discussed beyond a reply of "I agree" or "what you want is stupid."

# ? May 22, 2023 18:48

Adbot: ADBOT LOVES YOU

# ? May 20, 2024 17:51

StratGoatCom: Aug 6, 2019; Our security is guaranteed by being able to melt the eyeballs of any other forum's denizens at 15 minutes notice

Clarste posted:

Case law can go wherever it wants, but if people with money don't like where it went they can buy a senator or 50. All I have ever been saying is that the law can stop it if it wants to, and all these arguments about the internal workings of the machine or the nature of art are pretty irrelevant to that.

More to the point, the stuff to make AI fully functional fucks with the complex systems of laws and finance - international laws and finance - that make media work, and especially now, you're not getting those reforms.

# ? May 22, 2023 19:00

SubG: Aug 19, 2004; It's a hard world for little things.

Clarste posted:

I super do not see how this actually matters. You input copyrighted material into the machine. Whether it happened before or after "training" is 100% irrelevant to the issue of whether we want that to be a thing and how we might stop it.

I really don't think that "let's regulate a new technology without understanding anything about what it is or how it works" has been a winning strategy in the past.

I also think that "just make inputing copyrighted material into machines illegal" is so comically overbroad that it's impossible to even estimate the fallout of such a regulatory approach. Like are you suggesting that the betamax case was wrongly decided?

# ? May 22, 2023 22:33

StratGoatCom: Aug 6, 2019; Our security is guaranteed by being able to melt the eyeballs of any other forum's denizens at 15 minutes notice

SubG posted:

I really don't think that "let's regulate a new technology without understanding anything about what it is or how it works" has been a winning strategy in the past.

I also think that "just make inputing copyrighted material into machines illegal" is so comically overbroad that it's impossible to even estimate the fallout of such a regulatory approach. Like are you suggesting that the betamax case was wrongly decided?

You put in IP, it makes IP very much like it without paying the author. That's all that really needs to be known for this poo poo as far as regulation goes, anything else is being drawn into the weeds.

# ? May 22, 2023 23:19

SubG: Aug 19, 2004; It's a hard world for little things.

StratGoatCom posted:

You put in IP, it makes IP very much like it without paying the author. That's all that really needs to be known for this poo poo as far as regulation goes, anything else is being drawn into the weeds.

So the capacity for infringing use is the only criteria? Is this unique to AI? Or does it also apply to cameras? Cell phone audio and video? Pencils?

# ? May 22, 2023 23:26

StratGoatCom: Aug 6, 2019; Our security is guaranteed by being able to melt the eyeballs of any other forum's denizens at 15 minutes notice

SubG posted:

So the capacity for infringing use is the only criteria? Is this unique to AI? Or does it also apply to cameras? Cell phone audio and video? Pencils?

AI, or very likely to have been trained on such, yes.

# ? May 22, 2023 23:29

reignonyourparade: Nov 15, 2012

StratGoatCom posted:

You put in IP, it makes IP very much like it without paying the author. That's all that really needs to be known for this poo poo as far as regulation goes, anything else is being drawn into the weeds.

It doesn't though, not as a general case. It probably infringes on specific works overrepresented in the training damages, but otherwise you basically need to do enough work on the user's end you could just as easily infringe on work not in the training data at all.

# ? May 22, 2023 23:32

GlyphGryph: Jun 23, 2013; Down came the glitches and burned us in ditches and we slept after eating our dead.

StratGoatCom posted:

You put in IP, it makes IP very much like it without paying the author. That's all that really needs to be known for this poo poo as far as regulation goes, anything else is being drawn into the weeds.

It can make IP very much like your IP without ever seeing your IP, is the problem. Because it turns out for 99.99% of artists, their work is composed entirely of elements that they did not create.

Does that make it okay to use? Long as it didn't look at your stuff in particular, no problem even if it gens the same stuff?

# ? May 22, 2023 23:38

Count Roland: Oct 6, 2013

StratGoatCom posted:

You put in IP, it makes IP very much like it without paying the author. That's all that really needs to be known for this poo poo as far as regulation goes, anything else is being drawn into the weeds.

I don't think this premise is true, but even if it were, I still don't understand why this is an inherently bad thing. If the AI makes an image that looks like a copyrighted image, who cares? Don't the problems arise when someone tries to sell this image or present it as their own?

# ? May 22, 2023 23:40

SubG: Aug 19, 2004; It's a hard world for little things.

StratGoatCom posted:

AI, or very likely to have been trained on such, yes.

Sorry, are you saying that yes the capacity for infringing use is sufficient, but that this is unique to AI? That is, you're arguing that the betamax decision should be overturned. Or perhaps that it doesn't apply for some unspecified reasons?

# ? May 22, 2023 23:40

cat botherer: Jan 6, 2022; I am interested in most phases of data processing.

StratGoatCom posted:

AI, or very likely to have been trained on such, yes.

It sounds like you've decided "using something as training data" is not fair use, and you're working backward from there.

# ? May 22, 2023 23:44

Serotoning: Sep 14, 2010; D&D: HASBARA SQUAD
HANG 'EM HIGH

We're fighting human animals and we act accordingly

StratGoatCom posted:

You put in IP, it makes IP very much like it without paying the author. That's all that really needs to be known for this poo poo as far as regulation goes, anything else is being drawn into the weeds.

Tide goes in, tide goes out. You can't explain that.

# ? May 22, 2023 23:48

BrainDance: May 8, 2007; Disco all night long!

StratGoatCom posted:

You put in IP, it makes IP very much like it without paying the author. That's all that really needs to be known for this poo poo as far as regulation goes, anything else is being drawn into the weeds.

So then what about this?

This is by Sam Does Arts, the very anti-AI professional youtuber who believes similarly.

I actually think he should be able to make and sell artworks like this, but if the criteria is "put in IP, it makes IP" in the vaguest sense possible then it seems like this is off the table.

# ? May 22, 2023 23:50

StratGoatCom: Aug 6, 2019; Our security is guaranteed by being able to melt the eyeballs of any other forum's denizens at 15 minutes notice

cat botherer posted:

It sounds like you've decided "using something as training data" is not fair use, and you're working backward from there.

Because it isn't.

GlyphGryph posted:

It can make IP very much like your IP without ever seeing your IP, is the problem. Because it turns out for 99.99% of artists, their work is composed entirely of elements that they did not create.

Does that make it okay to use? Long as it didn't look at your stuff in particular, no problem even if it gens the same stuff?

The thing is, if you don't make it very clear what's in it....

# ? May 22, 2023 23:57

cat botherer: Jan 6, 2022; I am interested in most phases of data processing.

StratGoatCom posted:

Because it isn't

That's just a broad, unsupported claim you've pulled out of nowhere, that is actually contrary to the status-quo legal situation.

# ? May 23, 2023 00:00

Tree Reformat: Apr 2, 2022; by Fluffdaddy

BrainDance posted:

So then what about this?

This is by Sam Does Arts, the very anti-AI professional youtuber who believes similarly.

I actually think he should be able to make and sell artworks like this, but if the criteria is "put in IP, it makes IP" in the vaguest sense possible then it seems like this is off the table.

I think that specific example actually does qualify as infringement, since so much is similar in the two compositions.

If no legal action was taken, that implies either a licensing agreement was made, or the SG rights holders either don't know about it or don't care enough to pursue legal action.

Which can speak either to the limitations of IP Law in the internet age, or the shirking of legal responsibility by various tech hosting companies, depending on your point of view.

# ? May 23, 2023 00:00

BrainDance: May 8, 2007; Disco all night long!

Tree Reformat posted:

If no legal action was taken, that implies either a licensing agreement was made, or the SG rights holders either don't know about it or don't care enough to pursue legal action.

I am almost certain he doesn't, unless he kept it very secret and, while he's decently popular on YouTube he's not that level. Squid Game though, Netflix actually has been protecting the IP for it more than most things (though, mostly the name and logo.) Not enough that they're going after every fan artist, but enough to know where they stand.

Tree Reformat posted:

I think that specific example actually does qualify as infringement, since so much is similar in the two compositions.

I don't think it would, but drat is it close. And with the Warhol thing who knows anymore. Composition factors into whether infringement happened and fair use is a case by case type of thing but I actually think the complete change in style makes it transformative enough. We've seen far less transformative works be protected.

But, yeah again, the Warhol thing. Before I'd say the bar for transformative is incredibly low. Pop art appropriation where just the context is changed had, apparently, been enough before. I think this is still more transformative than that though.

I think it should be considered transformative. Like that's my belief in what's right. There's just an incredible irony that this is the work of a person who is one of the louder "AI is plagiarism because the models are derived from other artwork" people in the popular anti-AI culture, and he's even profiting from it.

Regardless, it's definitely far less transformative than the use of a huge number of pictures to create an algorithm based on denoising into averages of them and then creating something that contains no pixels or lines from those artworks but a general sense of them when asked for it (in the vast majority of cases, except in about 1% of the cases of overfitting which is considered a failure of the model and a thing to be fixed, which even then technically doesn't contain any actual pictures of the original image.)

If it is infringing because of the Warhol thing, something that was expected to affect AI too (though I actually don't think it does very much, besides just leave more questions unanswered) then, that is a really good example of how decisions that affect AI can also negatively affect other artists. I wouldn't say collateral damage here because this was aimed at traditional art, not AI, but many of the things people are proposing would definitely include them as collateral damage. That would be very hard to avoid.

BrainDance fucked around with this message at 00:41 on May 23, 2023

# ? May 23, 2023 00:36

Tree Reformat: Apr 2, 2022; by Fluffdaddy

To tie this back to a point someone made earlier, I decided to look at the case Steinberg v. Columbia Pictures Industries, Inc., and found the actual works in contention. Here's the "original" work, a cover piece for the New Yorker:

And here's the work the defendant claimed was infringing, a movie poster:

Now, I look at these, and while there are obvious similarities in composition and style, I would say that looks far more transformative than, say, the SamDoesArt example.

But the judge flatly disagreed, and found fully in favor of the defendant that the poster was blatant copyright infringement. By their decision, because the poster wasn't parodying the original cover, it didn't qualify for fair use, and the fact the composition, subject matter, style, coloring choice, and title text font were all so strikingly similar to the cover all combined to cross over from mere inspiration to outright infringement.

It really highlights just how arbitrary all this can be. It ultimately comes down to whatever it is you can convince a judge of in court.

# ? May 23, 2023 01:11

SubG: Aug 19, 2004; It's a hard world for little things.

Tree Reformat posted:

To tie this back to a point someone made earlier, I decided to look at the case Steinberg v. Columbia Pictures Industries, Inc., and found the actual works in contention. Here's the "original" work, a cover piece for the New Yorker:

And if you want to be fully confused about the issue, look up the images involved in Leibovitz v. Paramount, which went the other way.

# ? May 23, 2023 01:22

Liquid Communism: Mar 9, 2004; коммунизм хранится в яичках

SubG posted:

And if you want to be fully confused about the issue, look up the images involved in Leibovitz v. Paramount, which went the other way.

Not much confusing there, Lebiovitz v Paramount was ruled as parody for humor value and not in competition with the original photographer's work.

# ? May 23, 2023 06:03

Gentleman Baller: Oct 13, 2013

Genuine question for people who think Dall-E2 and Midjourney are violating copyright as it stands right now:

If stable infusion was never invented, and Open AI used publicly shown images to create an image-to-text bot, where people could draw unique, new drawings and have it accurately describe them like, "a courtroom sketch of Kermit the Frog." would that be fair use in your opinion?

# ? May 23, 2023 08:52

SCheeseman: Apr 23, 2003

Next door neighbor wanted me to polish up an old Dell laptop with an SSD and Windows 10 and put some games on it for her kids. She mostly does office work with an Etsy art and craft side hussle, not tech phobic but no expert. While returning the laptop I was in her home office and noticed something on her second monitor. ChatGPT. They have been using the gently caress out of it apparently and so does everyone else at work.

The art generators are one thing, but it seems ChatGPT and it's ilk are here to stay.

# ? May 23, 2023 11:14

Doctor Malaver: May 23, 2007; Ce qui s'est pass� t'a rendu plus fort

An 80 year old lady I know, who can barely operate her smartphone, didn't hire professional translators for a publishing project. It appears she will translate herself because she heard you could do it "with OpenAI".

It's like bitcoin two years ago.

# ? May 23, 2023 13:39

Count Roland: Oct 6, 2013

Doctor Malaver posted:

It's like bitcoin two years ago.

There is certainly a craze of investment, which will die down and people will lose money.

But these AI systems are genuinely useful to people in all sorts of work. Unlike bitcoin which has only niche uses aside from speculating.

# ? May 23, 2023 14:34

KillHour: Oct 28, 2007

SCheeseman posted:

Next door neighbor wanted me to polish up an old Dell laptop with an SSD and Windows 10 and put some games on it for her kids. She mostly does office work with an Etsy art and craft side hussle, not tech phobic but no expert. While returning the laptop I was in her home office and noticed something on her second monitor. ChatGPT. They have been using the gently caress out of it apparently and so does everyone else at work.

The art generators are one thing, but it seems ChatGPT and it's ilk are here to stay.

I have customers that are beating down my door asking when we will build some kind of integration with it and customers that want a guarantee that we will never use it. It's crazy out there right now.

# ? May 23, 2023 15:34

Count Roland: Oct 6, 2013

KillHour posted:

I have customers that are beating down my door asking when we will build some kind of integration with it and customers that want a guarantee that we will never use it. It's crazy out there right now.

What is your business?

# ? May 23, 2023 15:46

KillHour: Oct 28, 2007

Count Roland posted:

What is your business?

I work for a NoSQL database company with a good amount of exposure in the data science world.

IMHO, the current best use cases are things where you're giving it the context along with the question instead of asking about something open ended where it can just make poo poo up.

A good example might be "Are these two records likely referring to the same thing?" or "Summarize/extract information from this text."

KillHour fucked around with this message at 19:55 on May 23, 2023

# ? May 23, 2023 19:53

Boris Galerkin: Dec 17, 2011; I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

https://blogs.windows.com/windowsde...t-and-dev-home/

Microsoft is bringing ChatGPT to Windows 11.

Actually some of the stuff in that preview looks pretty good, but I'm reminded of one of my colleagues who lives in Belgium and works in Germany so he makes this commute every single day. All of Belgium and all of Germany share the same time zone.

His Outlook constantly nags at him whether or not he wants to update the time zone after crossing countries.

IDK if his settings are hosed up to do this but the fact that Windows can't figure this out doesn't make me excited for putting AI directly into Windows 11.

# ? May 23, 2023 19:59

KillHour: Oct 28, 2007

Here's an example using some fake data I whipped up real quick:

quote:

Process the following records to merge duplicate rows, fill in missing fields and normalize formatting:

COMPANY|STREET|CITY|ZIP|STATE|PHONE
Acme Inc.|48 Melroy Avenue|Rochester|NY|585-224-9766
Innovations Specialists|110 1st Street|San Diego|--|619-591-3614
Jeff R Tax Prep|55 Conctance Cir. Suite 110|San Antonio|TX|210-594-1382
ACME|48 Melroy Ave|Rochester|NY|224-9766
Innovations Spec.|First Street|San Diego||619-591-3614
Tom's Yard Care|PO BOX 110|Milwaukee|WI|219-4856

ChatGPT
Here's your processed dataset. All the duplicates have been merged, missing fields have been filled in as best as possible, and the formatting has been normalized.

COMPANY|STREET|CITY|ZIP|STATE|PHONE
Acme Inc.|48 Melroy Avenue|Rochester|NY|585-224-9766
Innovations Specialists|110 1st Street|San Diego|CA|619-591-3614
Jeff R Tax Prep|55 Conctance Cir. Suite 110|San Antonio|TX|210-594-1382
Tom's Yard Care|PO BOX 110|Milwaukee|WI|219-4856

I've assumed that the ZIP code for "Innovations Specialists" couldn't be found, and that "CA" is the state for San Diego. Please double-check these assumptions.

It would have been nice if it figured out that the area code for Milwaukee is 414 and I would have preferred if it spelled out "circle", but it's much better than a hand-coded algorithm could do. If you spent more time on the prompt and combined it with a traditional address lookup service, I think it could give very good results. Automated systems like this are already expected to make mistakes, so the bar is already set appropriately.

Edit: I just realized I neglected to put in any zip codes even though it was defined as a column, and it still handled it well. That's the strength of these systems - they can handle malformatted or missing data in a much more "human-like" manner. A traditional algorithm would not be able to handle state IDs where the zip codes are supposed to be, but ChatGPT can infer the context and adjust.

KillHour fucked around with this message at 20:13 on May 23, 2023

# ? May 23, 2023 20:10

Count Roland: Oct 6, 2013

KillHour posted:

I work for a NoSQL database company with a good amount of exposure in the data science world.

IMHO, the current best use cases are things where you're giving it the context along with the question instead of asking about something open ended where it can just make poo poo up.

A good example might be "Are these two records likely referring to the same thing?" or "Summarize/extract information from this text."

But you have people asking that you don't use AI at all? If you'd told me you're an artist that would have made some sense, but database work *should* be machine work.

# ? May 23, 2023 20:21

Boris Galerkin: Dec 17, 2011; I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

Count Roland posted:

But you have people asking that you don't use AI at all? If you'd told me you're an artist that would have made some sense, but database work *should* be machine work.

A lot of corporations are now starting to ban the use of ChatGPT and the like for work stuff because people inevitably start posting corporate secrets and/or classified material. Aside from that, no company wants their data to be used to train the next generation of ChatGPT.

# ? May 23, 2023 20:25

KillHour: Oct 28, 2007

Count Roland posted:

But you have people asking that you don't use AI at all? If you'd told me you're an artist that would have made some sense, but database work *should* be machine work.

Those people fit mostly in two categories - customers who are very concerned about information security (we do a lot of government work), and customers who are very concerned about being able to prove where their answers came from and that they are reliable/referenced (places with lots of compliance requirements - lawyers, medical, banks, etc.)

We have several big copyright holders and none of them give a poo poo - they only care that other people can't use AI in a way that could threaten them, but they're happy to use the tech themselves.

I've never once heard a company cite ethical concerns as a reason to use or not use them.

# ? May 23, 2023 20:27

Jaxyon: Mar 7, 2016; I’m just saying I would like to see a man beat a woman in a cage. Just to be sure.

Hey did anyone post this video yet? (Probably, but I didn't see it the last week of posts)

https://www.youtube.com/watch?v=jAHRbFetqII

It's got some pretty pertinent discussion, and also goes into the eugenics parts of AI backers which I wasn't aware of.

# ? May 23, 2023 22:47

Private Speech: Mar 30, 2011; I HAVE EVEN MORE WORTHLESS BEANIE BABIES IN MY COLLECTION THAN I HAVE WORTHLESS POSTS IN THE BEANIE BABY THREAD YET I STILL HAVE THE TEMERITY TO CRITICIZE OTHERS' COLLECTIONS

IF YOU SEE ME TALKING ABOUT BEANIE BABIES, PLEASE TELL ME TO

EAT. SHIT.

Could you give a summary? It's rather long at well over an hour and lacking in slides or anything else except two people talking into camera.

# ? May 23, 2023 22:55

KillHour: Oct 28, 2007

Jaxyon posted:

Hey did anyone post this video yet? (Probably, but I didn't see it the last week of posts)

https://www.youtube.com/watch?v=jAHRbFetqII

It's got some pretty pertinent discussion, and also goes into the eugenics parts of AI backers which I wasn't aware of.

It's Adam Conover (the Adam Ruins Everything guy), who I normally like, but hasn't been particularly subtle about his hatred of generative AI and he falls squarely in the "I think this will be bad for me personally so I hate it" camp. He's funny and smart, but he is not remotely an unbiased source and trying to tie a technology to the horse of the people funding it isn't a great argument, IMO. It's like saying "Cars suck because Henry Ford is an anti-Semite" or "Lightbulbs are evil because Edison is an rear end in a top hat."

# ? May 23, 2023 23:03

Jaxyon: Mar 7, 2016; I’m just saying I would like to see a man beat a woman in a cage. Just to be sure.

Private Speech posted:

Could you give a summary? It's rather long at well over an hour and lacking in slides or anything else except two people talking into camera.

Sure some of the finer points:

1. People, even people who know better, are getting sucked into the hype of AI being able to be actual AI, instead of ChatGPT being a fancy text/code generator
2. The "pause letter" that got put out really smells a lot more like hype in a "oh no we're too good at tech, somebody stop us" sense.
3. A lot of people involved in hyping AI are Longterm/Singulatarian types who are heavily into explicit eugenics
4. The paper the "pause letter" gives in as it's first citation is actually written by the women being interviewed in this video who say it was massively misinterpreted
5. Microsoft put out a research paper saying that ChatGPT was showing "sparks" and it literally cites a race science paper from like the 80s
6. OpenAI is the opposite of open and ethical and is completely silent on how it's model was trained and what it's parameters are, supposedly for "security" but much more likely that its so they can make money on it.
7. The AI companies are taking no liability for what their models create and what they plagiarize
8. AI is way dumber than you think

KillHour posted:

It's Adam Conover (the Adam Ruins Everything guy), who I normally like, but hasn't been particularly subtle about his hatred of generative AI and he falls squarely in the "I think this will be bad for me personally so I hate it" camp. He's funny and smart, but he is not remotely an unbiased source and trying to tie a technology to the horse of the people funding it isn't a great argument, IMO. It's like saying "Cars suck because Henry Ford is an anti-Semite" or "Lightbulbs are evil because Edison is an rear end in a top hat."

You're accusing him of using ad homs to discredit AI, when that's exactly what you're doing right now.

He's not the person to be listened to in that video, it's the 2 researchers he has on. Take up your poo poo with them, they're published.

Who cares if he hates AI? He's a content creator, he has good reason to. What's matter is whether or not he or his guests are right. If you're going to argue with the video watch it. If you're not, I gave a summary so you'll have to trust me that I was accurate.

Jaxyon fucked around with this message at 23:53 on May 23, 2023

# ? May 23, 2023 23:06

KillHour: Oct 28, 2007

Jaxyon posted:

He's not the person to be listened to in that video, it's the 2 researchers he has on. Take up your poo poo with them, they're published.

"Look man, it's not Joe Rogan saying those things, it's the published researcher he has on his show saying it while he just asks them questions!"

All I said is that he's not neutral. He clearly has a horse in this race and is going to have guests on that agree with him. You're welcome to watch and form your own conclusions, but a biased interviewer giving a friendly platform for people to push an agenda is not a new thing. The exact same thing happened a few pages ago with Sam Altman and that dude who just so happens to be super good friends with a bunch of nazis.

Both of those interviews are clearly pushing an agenda. You might agree with one of them, and I'm certainly more amenable to Adam Conover than the other guy. But you can't present a podcast with an explicit point of view as a neutral source of facts, even if they aren't lying about them.

KillHour fucked around with this message at 23:19 on May 23, 2023

# ? May 23, 2023 23:11

Boris Galerkin: Dec 17, 2011; I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

Being published doesn�t mean much in and of itself fyi. I�m also published, and I�ve seen my fair share of �why the gently caress is this published� papers. Probably including mine. Idk.

# ? May 23, 2023 23:16

GlyphGryph: Jun 23, 2013; Down came the glitches and burned us in ditches and we slept after eating our dead.

Jaxyon posted:

He's not the person to be listened to in that video, it's the 2 researchers he has on. Take up your poo poo with them, they're published.

The only reason I haven't watched the video yet is because I'm still debating whether listening to Conover talk is worth the information it might contain, having tried to watch a few different videos by him recently. He's the worst example of a guy I technically agree with in broad strokes on most things but can't stand listening to talk about them.

I'm familiar with Bender, though - she's the Stochastic Parrot person. Some good broad stroke criticisms and some really braindead hot takes over the last couple years but overall decently knowledgeable and so I might eventually give it a go just to hear what she has to say. She did a panel not long ago with Christopher Manning, who is very pro-AI but and has even more blisteringly braindead hot takes.

GlyphGryph fucked around with this message at 23:31 on May 23, 2023

# ? May 23, 2023 23:18

Adbot: ADBOT LOVES YOU

# ? May 20, 2024 17:51

Tei: Feb 19, 2011

To these people saying "Once you have put a document into the machine learning corpus, It can't be removed".

Thats a false, is not it? If you train your AI with 50.000 legal documents and a illegal one. You can always remove the illegal ones, then train again. It could be expensive, but thats Not My loving Problem.

Also, I am not so sure it is that hard. It may appear that is hard because the actual learning is a big blob of data. Or is not?

Maybe they can research HOW to keep track of how documents affect the training data, then subtract that training.

Is not the same thing, but heres a example how that is possibe with bayessian filters:

https://github.com/atyks/PHP-Naive-Bayesian-Filter/blob/master/src/naive-bayesian-filter/class.naivebayesian.php

code:

		/** untraining of a document.
		 To remove just one document from the references.
		 @see updateProbabilities()
		 @see untrain()
		 @return bool success
		 @param string document id, must be unique
		 */
		function untrain($doc_id) {
			$ref = $this->nbs->getReference($doc_id);
			$tokens = $this->_getTokens($ref['content']);

			while (list($token, $count) = each($tokens)) {
				$this->nbs->removeWord($token, $count, $ref['category_id']);
			}

			$this->nbs->removeReference($doc_id);

			return true;
		}

This bayesian filter is trained with documents that produced words that fall into categories, and these words are weigted, and removing the document remove the words or rebalance the weight.

If the cryptobros don't want to re-train the entire models, they can figure out how to do something similar for copyright data. How they do it, is their problem, they can go the long way, or they can invent a better way. I don't buy the idea (repeated here by many people) that is not possible. Is possible: repeat the training withouth the tainted data.

# ? May 23, 2023 23:28

The Something Awful Forums > Discussion > Debate & Discussion > Let's chat about AI!

«‹›53 »