Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
How many quarters after Q1 2016 till Marissa Mayer is unemployed?
1 or fewer
2
4
Her job is guaranteed; what are you even talking about?
View Results
 
  • Post
  • Reply
Nenonen
Oct 22, 2009

Mulla on aina kolkyt donaa taskussa

sinky posted:

music industry lawyers gonna :commissar:

sounds like a case for Steven A. Schwartz and Chong Ke

Adbot
ADBOT LOVES YOU

Mega Comrade
Apr 22, 2004

Listen buddy, we all got problems!

sinky posted:

:munch:

https://www.bbc.co.uk/news/articles/c0434yx8vgxo

music industry lawyers gonna :commissar:

Although if songs have been used for training I guess they'll have to pay Sony, while individuals who had their stuff scraped of the internet for AI products can keep getting hosed.

For all the talk of AI companies not needing permission, they sure are racing to strike deals with companies to use their data all of a sudden.

Mister Facetious
Apr 21, 2007

I think I died and woke up in L.A.,
I don't know how I wound up in this place...

:canada:

Mega Comrade posted:

For all the talk of AI companies not needing permission, they sure are racing to strike deals with companies to use their data all of a sudden.

They have enough venture capital to burn on the tech, but not the litigation :v:

SCheeseman
Apr 23, 2003

Mega Comrade posted:

For all the talk of AI companies not needing permission, they sure are racing to strike deals with companies to use their data all of a sudden.

Conversely, IP holders sure seem comfortable with making these deals, rather than taking things to court where they could argue a case to outright prevent use of their works by AI companies. They're both hedging their bets and making compromises.

This proves out what I've been saying for a while: copyright enforcement was never going to solve most of the problems AI is causing, since all the IP holders want this to happen.

Volmarias
Dec 31, 2002

EMAIL... THE INTERNET... SEARCH ENGINES...

SCheeseman posted:

Conversely, IP holders sure seem comfortable with making these deals, rather than taking things to court where they could argue a case to outright prevent use of their works by AI companies. They're both hedging their bets and making compromises.

This proves out what I've been saying for a while: copyright enforcement was never going to solve most of the problems AI is causing, since all the IP holders want this to happen.

The IP holders just want to get paid. Suing to make the other guy not use your stuff doesn't get you paid.

SCheeseman
Apr 23, 2003

Volmarias posted:

The IP holders just want to get paid. Suing to make the other guy not use your stuff doesn't get you paid.

Sure it does, if they won they'd have way more leverage over licensing than they do today.

Volmarias
Dec 31, 2002

EMAIL... THE INTERNET... SEARCH ENGINES...

SCheeseman posted:

Sure it does, if they won they'd have way more leverage over licensing than they do today.

Yes, but that means being paid later, not now.

Main Paineframe
Oct 27, 2010

Volmarias posted:

The IP holders just want to get paid. Suing to make the other guy not use your stuff doesn't get you paid.

It does if you win, since the other guy would generally be forced to pay for their usage up to that point.

SCheeseman posted:

Conversely, IP holders sure seem comfortable with making these deals, rather than taking things to court where they could argue a case to outright prevent use of their works by AI companies. They're both hedging their bets and making compromises.

This proves out what I've been saying for a while: copyright enforcement was never going to solve most of the problems AI is causing, since all the IP holders want this to happen.

Copyright enforcement solves the "people's work is being fed into giant for-profit media generators and they don't see a single cent from it" problem, which is probably the most important one to solve.

But are all that many rights holders in creative industries making these deals? From what I recall, most of these deals aren't coming from media companies, they're coming from companies that have vast libraries of other people's content that they acquired cheaply and license out even more cheaply (or put out for free with advertising next to it). For example, stock photo companies and Stack Overflow.

SCheeseman
Apr 23, 2003

Volmarias posted:

Yes, but that means being paid later, not now.

I agree, all the companies involved are shortsighted and greedy, looking for the least risk averse option that generates the most profit for those at the top of the pyramid. They don't care about crushing AI, they see it is a shortcut to more money.

Mega Comrade
Apr 22, 2004

Listen buddy, we all got problems!

Mister Facetious posted:

They have enough venture capital to burn on the tech, but not the litigation :v:

They probably make lobbying harder too. Its difficult to convince politicians that you're a net good for the country when so many national institutions are suing you.

Volmarias
Dec 31, 2002

EMAIL... THE INTERNET... SEARCH ENGINES...

Mega Comrade posted:

They probably make lobbying harder too. Its difficult to convince politicians that you're a net good for the country when so many national institutions are suing you.

Politicians have exactly zero concerns with taking donations from businesses they supposedly oppose

Mega Comrade
Apr 22, 2004

Listen buddy, we all got problems!

Volmarias posted:

Politicians have exactly zero concerns with taking donations from businesses they supposedly oppose

But it takes more money to convince them

Volmarias
Dec 31, 2002

EMAIL... THE INTERNET... SEARCH ENGINES...

Mega Comrade posted:

But it takes more money to convince them

It really doesn't, which is honestly the most frustrating things. Our politicians are cheap dates.

SCheeseman
Apr 23, 2003

Main Paineframe posted:

Copyright enforcement solves the "people's work is being fed into giant for-profit media generators and they don't see a single cent from it" problem, which is probably the most important one to solve.

I don't think it's the most important, since given a ruling that made clear that training is infringement, generative AI companies will still have troves of licensed data to draw from thanks to these agreements. Artists may get a payday, that's all well and good, but then what? Licensing for individual artists? Fair recurring payments?

No. The old models will be thrown away, since they'd be an ongoing liability. New models based on licensed data will continue to exist and all the other problems AI cause, like to job security, will remain.

Clarste
Apr 15, 2013

Just how many mistakes have you suffered on the way here?

An uncountable number, to be sure.

Volmarias posted:

It really doesn't, which is honestly the most frustrating things. Our politicians are cheap dates.

The amount of money it takes to buy a politician is really less than pennies for the people doing so. Any variance in price probably wouldn't even be noticed.

Agents are GO!
Dec 29, 2004

Perestroika posted:

desperately need some cumin right loving now

Call me

Origin
Feb 15, 2006

sinky posted:

:munch:

https://www.bbc.co.uk/news/articles/c0434yx8vgxo

music industry lawyers gonna :commissar:

Although if songs have been used for training I guess they'll have to pay Sony, while individuals who had their stuff scraped of the internet for AI products can keep getting hosed.

I remember having to be one of the point men at my college for when students would get letters from RIAA members. It was usually from Sony, and the lawyer doing the work was some British guy.

Kwyndig
Sep 23, 2006

Heeeeeey


If we're lucky Sony will get a bug up their rear end about it and shut the whole thing down. either through the infamous RIAA billion dollar fines or just a court order to cease operating infringing AI.

The Lone Badger
Sep 24, 2007

More likely the RIAA will get a payout whenever the model generates music using all the unaffiliated stuff it ingested.

Mister Facetious
Apr 21, 2007

I think I died and woke up in L.A.,
I don't know how I wound up in this place...

:canada:
One of the weaker ones will offer an ownership stake.

Pleasant Friend
Dec 30, 2008

Copyright holders signing deals for the "right" to use their stuff are sensible to sign for anything they can get, because legally they have to right or remedy for compensation. This isn't stealing, it isn't even piracy.

Neito
Feb 18, 2009

😌Finally, an avatar the describes my love of tech❤️‍💻, my love of anime💖🎎, and why I'll never see a real girl 🙆‍♀️naked😭.

Boris Galerkin posted:

It looks like Google has inserted their generative AI results into Google searches now.

It's been in there for a while, but I heard a rumor that now you can't turn them off. I've pretty much slid over to DDG for everything now, but searchability of the web is basically an impossible problem now, short of very specific searches for very specific things.

Also, Reddit just cut a deal to shove it's posts into ChatGPT: https://www.theverge.com/2024/5/16/24158529/reddit-openai-chatgpt-api-access-advertising

Main Paineframe
Oct 27, 2010

SCheeseman posted:

I don't think it's the most important, since given a ruling that made clear that training is infringement, generative AI companies will still have troves of licensed data to draw from thanks to these agreements. Artists may get a payday, that's all well and good, but then what? Licensing for individual artists? Fair recurring payments?

No. The old models will be thrown away, since they'd be an ongoing liability. New models based on licensed data will continue to exist and all the other problems AI cause, like to job security, will remain.

New models based exclusively on licensed data are lovely and uneconomical. That's the entire reason they didn't just use licensed data in the first place. The amount (and variety) of training data available has a substantial impact on both the quality and the versatility of the model, after all. Paying to license the massive amounts of data they need to maintain current levels of quality would be wildly uneconomical, and there isn't nearly enough public domain data for their needs either.

They're not just using artists' work for the sheer fun of stealing it. They're using artists' work because they desperately need artists' work. They can't make this poo poo work without it. Without being able to ingest a hundred million data pieces scraped from all sorts of different sources, they won't be able to build the kinds of generalist models that are the real threat to people's livelihoods here. So if they become subject to copyright, then they either pay out the rear end or release new models that are so much worse that they absolutely crush all the fantasies of ChatGPT being able to replace everything.

SCheeseman
Apr 23, 2003

Main Paineframe posted:

New models based exclusively on licensed data are lovely and uneconomical.

At the same time there's people saying that in order to avoid model collapse training will require a more curated approach.

I'm more inclined to believe that they're doing it this way because it's cheaper, faster and easier, not because there isn't any other way to practically do it.

Main Paineframe
Oct 27, 2010

SCheeseman posted:

At the same time there's people saying that in order to avoid model collapse training will require a more curated approach.

I'm more inclined to believe that they're doing it this way because it's cheaper, faster and easier, not because there isn't any other way to practically do it.

They already curate training data, but that doesn't mean they don't start with a whole fuckton of it. In fact, because losses to curation are substantial, that's all the more reason they need to scoop up so much. One of the datasets used to train ChatGPT3 was a 45TB snippet from Common Crawl (which tries to scrape the entire internet), which they cut down to 545GB after curation and filtering. They supplemented that with a WebText2 dataset (every link that had ever been posted on Reddit) filtered down to only links with more than a certain number of upvotes, and then threw in all of English Wikipedia and two unidentified databases of books (and if you're thinking their refusal to identify these datasets is suspicious, a lot of people think that!).

Besides, the fact that it's cheaper, faster, and easier matters a lot. If it's hard, slow, and expensive, then it becomes much less economical to do.

Boris Galerkin
Dec 17, 2011

I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

Neito posted:

It's been in there for a while, but I heard a rumor that now you can't turn them off. I've pretty much slid over to DDG for everything now, but searchability of the web is basically an impossible problem now, short of very specific searches for very specific things.

Well, it's being put into DDG too.

Neito
Feb 18, 2009

😌Finally, an avatar the describes my love of tech❤️‍💻, my love of anime💖🎎, and why I'll never see a real girl 🙆‍♀️naked😭.

Boris Galerkin posted:

Well, it's being put into DDG too.

At least the DDG one I have to actually acknowledge and actively click to have it spew it's LLM hallucinations at me.

HootTheOwl
May 13, 2012

Hootin and shootin
They can pry my posts out of my cold dead substack

BiggerBoat
Sep 26, 2007

Don't you tell me my business again.
We're so loving doomed.

Leon Sumbitches
Mar 27, 2010

Dr. Leon Adoso Sumbitches (prounounced soom-'beh-cheh) (born January 21, 1935) is heir to the legendary Adoso family oil fortune.





Main Paineframe posted:

They already curate training data, but that doesn't mean they don't start with a whole fuckton of it. In fact, because losses to curation are substantial, that's all the more reason they need to scoop up so much. One of the datasets used to train ChatGPT3 was a 45TB snippet from Common Crawl (which tries to scrape the entire internet), which they cut down to 545GB after curation and filtering. They supplemented that with a WebText2 dataset (every link that had ever been posted on Reddit) filtered down to only links with more than a certain number of upvotes, and then threw in all of English Wikipedia and two unidentified databases of books (and if you're thinking their refusal to identify these datasets is suspicious, a lot of people think that!).

Besides, the fact that it's cheaper, faster, and easier matters a lot. If it's hard, slow, and expensive, then it becomes much less economical to do.

545 GB of text based data seems like a whole lot.

Could one of the book databases be Project Gutenberg? Maybe Google books? I'm curious what the drama/suspicion is about.

Hel
Oct 9, 2012

Jokatgulm is tedium.
Jokatgulm is pain.
Jokatgulm is suffering.

Leon Sumbitches posted:

545 GB of text based data seems like a whole lot.

Could one of the book databases be Project Gutenberg? Maybe Google books? I'm curious what the drama/suspicion is about.

Presumably the book database are something like Library Genesis or SciHub, which aren't exactly legal and authorized, and even in the cases where it's supported by the original author, probably not for the purpose of LLM harvesting.

shoeberto
Jun 13, 2020

which way to the MACHINES?

Neito posted:

At least the DDG one I have to actually acknowledge and actively click to have it spew it's LLM hallucinations at me.

Some hot ~~insider info~~ but we're generally working on making it less obtrusive and more opt-in. But it's very easy to permanently disable, too. I would say that we're pretty acutely aware of how much people don't want this shoved down their throat, and are trying to balance it against discoverability.

Nothingtoseehere
Nov 11, 2010


Leon Sumbitches posted:

545 GB of text based data seems like a whole lot.

Could one of the book databases be Project Gutenberg? Maybe Google books? I'm curious what the drama/suspicion is about.

If you're storing characters in 32 bit blocks, then that's about 1,400,000,000,000 characters, or 280,000,000,000 words, roughly enough for 2.8 million 100k novels. That's alot, but it's not alot alot - you can store that on a single commercial hard drive, or keep it all in a single server racks RAM (not the VRAM of the GPU though).

Leon Sumbitches
Mar 27, 2010

Dr. Leon Adoso Sumbitches (prounounced soom-'beh-cheh) (born January 21, 1935) is heir to the legendary Adoso family oil fortune.





Nothingtoseehere posted:

If you're storing characters in 32 bit blocks, then that's about 1,400,000,000,000 characters, or 280,000,000,000 words, roughly enough for 2.8 million 100k novels. That's alot, but it's not alot alot - you can store that on a single commercial hard drive, or keep it all in a single server racks RAM (not the VRAM of the GPU though).

Ya, 2.8m novels is ~ 2% of total books published since 1440, so not a lot by that metric either.

From what I can tell, the general development plan for ChatGPT and the other major players is "feed it more training data, it gets better on its own". It's both unproved and will likely hit a wall as training data begins to peter out. They've scraped the low hanging fruit of free data and are now entering deal territory with data in walled gardens. With the consolidation and enshitification of the current Internet, there aren't that many sites to make data deals with.

I've heard people also incorrectly apply Moore's Law to AI, and afaik that's also not likely.

Remulak
Jun 8, 2001
I can't count to four.
Yams Fan
It feels like they’re asymptotically approaching the best that can be done with the current approaches, no matter how much data they throw at it. That means a bust as hype/vc money goes to the next thing, then in 5-10 years some new approach will be better we really mean it this time like VR/AR/metaverse, then bust, then…..

HopperUK
Apr 29, 2007

Why would an ambulance be leaving the hospital?
Is it accurate that the 'scrape the internet' style of training AI will falter as more of the internet is already written by AI, or is that just a hypothetical?

blastron
Dec 11, 2007

Don't doodle on it!


Has this been posted here yet?

https://youtu.be/dDUC-LqVrPU

There is convincing research showing that there are indeed diminishing returns on adding more data to a model, and that the sheer quantity of data required to get a general-purpose model to be an “expert” on specialized topics simply doesn’t exist.

Kwyndig
Sep 23, 2006

Heeeeeey


HopperUK posted:

Is it accurate that the 'scrape the internet' style of training AI will falter as more of the internet is already written by AI, or is that just a hypothetical?

It's already happening, LLMs trained on generated data outputs nonsense.

Sundae
Dec 1, 2005

HopperUK posted:

Is it accurate that the 'scrape the internet' style of training AI will falter as more of the internet is already written by AI, or is that just a hypothetical?

Kwyndig posted:

It's already happening, LLMs trained on generated data outputs nonsense.

Yep. Companies had to manually intervene to break the loop of generative AI programs reading each other's answers once one of them decided that yes, you can melt eggs. That's an easily-caught one because of its absurdity, but it's not a big leap from there to having something less obvious become "fact" from AI horseshit.

Adbot
ADBOT LOVES YOU

Magic Underwear
May 14, 2003


Young Orc

Remulak posted:

It feels like they’re asymptotically approaching the best that can be done with the current approaches, no matter how much data they throw at it. That means a bust as hype/vc money goes to the next thing, then in 5-10 years some new approach will be better we really mean it this time like VR/AR/metaverse, then bust, then…..

There is some truth to that in terms of training data and maybe paramerers, but overall ai is advancing rapidly and becoming more efficient. GPT 4o just got announced and it has great potential, conversational multimodal ai that can understand what they see and hear is going to be in every cellphone.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply