Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Carp posted:

Wow, yeah, LoRA sounds very interesting, and I agree that it would make fine-tuning a large model much easier. How has it worked out for you? There is so much new information out there about deep learning and LLMs. If I come across the paper again, I'll be sure to let you know.
I think transfer learning tools like LoRA are going to be the main way that stuff like ChatGPT gets used in industry. It's certainly been the main (only) way I've used language models in the past.

Adbot
ADBOT LOVES YOU

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Carp posted:

What have you used them for in the past and what do think of ChatGPT and GPT-4?
I haven't used either, but I've used older language pre-trained models/vector embeddings like FastText. The jist is that these pretrained models embed each subword as a numeric vector in some latent space (usually like 300-dimensional, but considerably lower dimension than the word space), so that words that are close semantically are usually close in the embedding space. Newer and more sophisticated models embed using more context information, as opposed to just having a 1 word = 1 vector situation.

Here's a typical example of where I'd use something like this: Suppose I have a bunch of consumer reviews with 1-5 stars, with each review associated with a product, user demo info, etc., but that also contains a text review field with natural language. There's potentially a lot of good info locked up in the review text field. However, there's probably too little data to train my own text model, most of the text is short (which means the words have little context of their own), and it would be too much of a PITA anyway. So, instead of doing that, I just get the word vector embedding. Each word in each review gets its vector. Then, those vectors are averaged over each review, so now each review has a semantic vector associated with it (there's other/better ways to do that, but whatever). Just like with the words, semantically similar review texts have similar embedding vectors. These embeddings can then be used alongside the other review data in a downstream model to actually predict the rating.

I'm no expert on the newer models, but the main part of the transfer learning task is still going to be text => numeric vectors, except going through a more sophisticated prediction-context-aware transformer model rather than being essentially a dictionary of words to vectors.

cat botherer fucked around with this message at 00:54 on Mar 31, 2023

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Aramis posted:

More specifically, anything related to structuring information for human consumption is definitely going to be dead in the water real quick. Technical writers, copy editors, etc...
I think that's a bit premature. Technical writing especially is pretty exacting, and usually always going to be with custom products or w/e where there's not going to be anything too similar in its training set. If it becomes a productivity multiplier, there could be less of those jobs, but I don't see them going away.

Carp posted:

That's a pretty good summary. Much better than my notes earlier in the thread, which are a little confused.
Thanks!

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

SaTaMaS posted:

Another promising area is the possibility that ChatGPT can look at really old languages like Cobol and Fortran and not just improve the documentation but translate it into modern languages using cleaner code.
I actually think that would be one of the less likely areas, although it could still help. Legacy Cobol systems are all byzantine unstructured code written mostly by people who are now dead. These systems don't have any real specs for how exactly they should work or what they should do, but they're usually vital, and must keep doing whatever it is they are doing. Correctness of a replacement basically means that it should be identical in behavior to the old system, but there's no way of actually verifying that.

I think its one area where ChatGPT would really be lead astray. ChatGPT only understands text (including code) and textual contexts. Good code written in a modern structured programming language will usually have a pretty decent mapping between syntax and computational semantics. ChatGPT has no idea about any kind of computational semantics, but it is possible that there exists a faithful enough mapping,
code:
(program semantics) <- (syntax) -> (ChatGPT's internal representation) -> (generated syntax) -> (generated semantics)
such that the semantics of the generated code is faithful enough to the original semantics. This would work best with the same language, but probably would usually work translating between, e.g. Python and Ruby.

Cobol is not modern or structured - the relationship between syntax and semantics, in CS terminology, is "hosed up." Because the code is unstructured, it's a massively complex ball of entropy with all sorts of non-local interactions. A piece of code might do very different things depending on current program state, etc. The program can only be understood as a whole, and only fully upon running it many, many times with different inputs - so it's just something that ChatGPT can't do.

cat botherer fucked around with this message at 17:28 on Mar 31, 2023

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Seyser Koze posted:

So all we need is a society completely unlike the one we live in, run by people completely unlike the ones running it, and a ton of people losing their jobs will be no issue. Great.
It's pretty incredible that labor-saving technologies hurt workers, rather than freeing them from rote tasks. One could say it is a contradiction, even.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

SCheeseman posted:

Impede, obstruct, whatever. In any case what you want isn't going to make AI art generators uneconomical, it'll make the 'legal' ones economical only for the entrenched IP hoarders. What you want changes nothing about how people will in actuality be exploited and may even serve to make it worse!
:yeah:

Whatever your problem is, the answer is not "expand copyright protections."

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.
Like it or not, this is the future. I think it's more promising than quantum computers. The best thinking machine is the human brain, so the right way to do AI is to create disembodied brains. The hard part is figuring out how exactly to torture them to get what you want.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

StratGoatCom posted:

This isn't some ip cartel, this is enforcement of long standing rights in law in basically every legal system.
It's not, though. In Anglo systems, this stuff falls under fair use. IDK much about other systems, but the answer is not to increase the power of IP holders. That just empowers rent-seeking behavior and creates unintended consequences. I think people really need to take a deep breath here.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

StratGoatCom posted:

For the hundedth time, this isn't. This is merely using already existing rules. Allowing it to be otherwise in fact will have that effect you fear, because it makes literally anything free real estate for billionaire bandits. Indeed, the point is laundering this behavior, much as crypto was laundering for securities bs.
It's not using already existing rules (in the US/Britain). It's fair US. I have no idea how crypto laundering is related, other than being a computer thing. You're really just pulling all of your assertions out of thin air here.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

StratGoatCom posted:

So? By training those models, they clearly crossed long established lines on copyright law.
They haven't though. Training models on copyrighted things is nothing new. It's been going on for well over a decade. If it was crossing established lines, there would be case law on it by now. Can you actually point to evidence of your legal theories?

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

StratGoatCom posted:

Given the commercial nature of these models and that they create similar outputs WITHOUT permission, no I do not think fair use harbor applies.
You don't think it does, but it does. Fair use covers all sorts of commercial purposes. Again, where is the case law? Copy write holders would have every incentive to go after people training models on their IP, so if your legal theories weren't complete bunk, there would be case law. Can you show this, or are you just pulling all of this out of your rear end?

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

eXXon posted:

There's a class action lawsuit over GitHub CoPilot currently ongoing, filed in November. Microsoft asked to dismiss it in January. No idea what to expect next.

I have a hard time seeing how scraping millions of GPL-licensed repos and charging money for what might well be minimally transformative derivatives that remove all license information is consistent with the GPL, but I suppose we'll see.
There's nothing special to the GPL over restrictive copyrights here. The GPL itself holds up, but they are claiming less rights than, e.g. Nintendo is with Mario. The only novel thing is it being about code instead of movies or books, but it is still generative and does not duplicate copy-written code. It's pretty hard to see how a court would side against Microsoft while maintaining previous decisions.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

KwegiboHB posted:

How about we... and I know I'm being crazy over here... NOT torture the disembodied brains. Or the embodied brains either.
Let's be practical. Think of the swing voters.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

BrainDance posted:

There was this really cool artificial life game series called Creatures back in the day. It was really what got me into the early internet (it had a scripting language to create objects, the creatures had artificial DNA and could evolve so you could export them and share them. It was actually incredibly cool and way better than I'm explaining it.) Like a really complex tamagotchi.

But it really kinda mirrored some of this. Some people took it very seriously. And then one guy, Antinorn, started a website "tortured norns" that was exactly what it sounds like. poo poo hit the fan and the community was divided. But what I remember is him getting a poo poo ton of death threats and stuff.

I guess this is a stupid story and not interesting at all. I have no real idea why Antinorn did it, but I think it's just a thing people are gonna do with anything that looks alive but they know isn't actually alive. Maybe because it feels kinda taboo?
Yeah, I think most people have empathetic response to something that mimicks a human or animal, even if they know on an intellectual level the program or whatever isn't sentient. In the same way, I don't think its necessarily a bad thing to get weirded out by someone who really likes to torture furbys. I shouldn't talk though, I loved the Lemmings games when I was a little kid, especially blowing them up. I'd never hurt a real lemming though, they're adorable.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.
ChatGPT cannot control a robot. JFC people, come back to reality.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

KillHour posted:

:catstare: Mods!?

Edit: I mean other mods who haven't had their soul devoured by math.
Quaternions are straightforward, but miserably so.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Delthalaz posted:

Regarding the fears of AI superintelligence and world domination, I'll be a lot more concerned if Paradox can ever develop an "AI" that can beat an average human player without cheating. Those games are pretty complicated, but not nearly as complicated as the real world, so...
poo poo, I'd welcome an AI superintelligence at this point (not that that's coming within the next century or ever). On one hand, we might all die, but on the other hand, a literal deus ex machina is one of the more realistic ways of dealing with climate change and environmental collapse.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.
Folks, that was a joke post.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Bar Ran Dun posted:

Another AI showing human reasoning article, this time in the times. Based on a Microsoft paper.

https://www.nytimes.com/2023/05/16/technology/microsoft-ai-human-reasoning.html

Same as before though: “They literally acknowledge in their paper’s introduction that their approach is subjective and informal and may not satisfy the rigorous standards of scientific evaluation.”
Yeah that's just slop to build up more buzz. The AI researchers should stay in their lane and let philosophers deal with these kind of questions.

Also a new funny Bard thing just dropped:

https://twitter.com/goodside/status/1657396491676164096

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

SubG posted:

There really isn't any plausible argument for capitalism ending mediaeval feudalism. The normal framing is that agrarian feudalism (roughly the thousand years preceding the 16th Century, although you can fiddle with the endpoints a lot) was supplanted by mercantilism (roughly the 16th to 18th Centuries) which lead to capitalism (somewhere around the late 18th/early 19th Century).

And even if you want to construct an idiosyncratic definition of capitalism that encompasses what's traditionally called mercantilism it isn't like mercantilism killed feudal agrarianism either...mediaeval feudalism largely collapsed under a number of crises which it was unable to handle, generally referred to as the Crisis of the Late Middle Ages: the famine of the early 14th Century; the Black Death; the ending of the Mediaeval Warm Period and the start of the Little Ice Age; the Western Schism; endless peasant uprisings and popular revolts; the Hundred Years' War; and so on. Feudalism/manorialism limped along for a little while afterward, but Thirty Years' War and the Peace of Westphalia are pretty strong arguments that it was dead long before capitalism became dominant.
Mercantilism is a pattern of foreign trade policy, not a system of production in itself. Production by European powers and their empires during the 16th-18th century had much more in common with capitalism than it did feudalism. Proper capitalism didn't get cranking until a ways into the 19th century, but imperialism (with or without mercantilist trade policies), like capitalism, is predicated on continuous and thus exponential expansion. Unlike capitalism, but like feudalism, production in imperial-mercantilist systems depended on control of land. During the mercantilist era, this growth was provided by the dominance of European technology to force non-Europeans to employ themselves and their land for the European's benefit. This got tapped out after not too long - the Earth is only so big, and there is only so many people. Further advances in technology provided the solution to continue expansion, with these advances allowing huge increases in production in themselves, independent of control of land and slave exploitation.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

PT6A posted:

Yeah, I think one aspect of feudalism that gets dismissed or ignored a lot is that it's based on mutual obligation in a semi-theocratic framework. If you upset God by failing to execute the obligations of your divinely-ordained station, it's open season on you!
You're kind of overstating the non materially-motivated aspects of feudalism. During sieges or whatever, the safety of peasants was not a priority. During famines, many starved. The power of lords vs. kings waxed and waned, and wasn't generally based on any kind of higher principle. Sustainable control needed the general survival of peasants underneath, working the land, and the non-aggression of liege lords above.

During most of the middle ages, calvary was king. This needed horses and armor, which the great mass of peasants couldn't supply - but landowners could. Until the advent of longbows and guns, knights were essentially invincible against peasants, so peasant uprisings were easy to squash. While the lords needed the peasants to farm the land, and the peasants needed the land to eat. This shifted power towards landowners, within limits. Any kind of spiritual obligation only existed on Sundays. People were animated by material concerns, same as now.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Count Roland posted:

I believe LLMs are poor at logic. Dealing with facts requires the AI to state things are true or false. Such statements can be logically modified ie if x is true then y. A model that is guessing the next symbols in a phrase will sometimes pull this off but can't itself be reliable. The AI needs to do logical operations. Which I assume is possible, given how logic-based computing is.
You're kind of touching on symbolic AI or expert systems, which was really the first way of doing AI before statistical methods took over. They're really useful in some problem domains. I think figuring out to integrate the two approaches will be a big deal in the future, given that they tend to be good at complimentary things.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Liquid Communism posted:

The AI's entire 'memory' consists of its training set. Hence why you cannot remove something from said training set without retraining the AI, or it will continue to use what has been indexed.

It is incapable of creativity. It is simply pulling elements from training data that is tagged similarly to the prompt given.

This is a large part of why the EU is looking at it sideways, as present designs cannot comply with the GDPR both in proving they do not contain PII, or obeying right to be forgotten.
Let's just stop and think how much space ChatGPT or StableDiffusion would take up if they retained their entire training set...

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Clarste posted:

The idea is to define the program as something that cannot "learn" and can only "copy" so therefore anything in its training set is copying by definition. Like tracing. A computer cannot have a style, it can only trace things.

It is literally already illegal to make copies with a machine! Netflix is giving you permission to watch it on your computer, but not to spread it any further than that! It's in the Terms of Service you skipped!
This would be a good point if these models copy images, but they don’t.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.
Yeah you cannot copyright styles. That's never been the case, and doing it on a computer does not change that.

https://www.thelegalartist.com/blog/you-cant-copyright-style

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Liquid Communism posted:

Yep. Even the 'draw a thing in Bob Ross' style' prompt is a dodge, because the algorithm has no idea what Bob Ross' style is. It knows there were files in its training set that were human-tagged as being produced by or similar to Bob Ross, and will now iterate on parts of them to generate an image that the human user will then decide is or is not what they wanted.
It does not do this. Are you reading any of the posts where people have repeatedly explained that the models do not contain their training sets? It cannot "iterate" on the tagged set to generate a new image because that set of images does not exist at prediction time. You haven't the faintest idea of what you are talking about here.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.
This thread would be a lot easier if people argued based on the ML models and copyright laws that actually exist. It seems that people think these models are some kind of database.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Clarste posted:

I am saying the law can declare it so regardless of what you or anyone thinks, and a lot of people with a lot of money have a vested interest in strong copyright laws. This isn't a philosophical discussion, the law is a tool that you use to get what you want.
Case law can go anywhere, always. That's a specious argument in itself.

Clarste posted:

I super do not see how this actually matters. You input copyrighted material into the machine. Whether it happened before or after "training" is 100% irrelevant to the issue of whether we want that to be a thing and how we might stop it.
It absolutely matters, because it factors into whether it is free use or not. It has also been a thing for years now. ML models are nothing new.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Clarste posted:

Case law can go wherever it wants, but if people with money don't like where it went they can buy a senator or 50. All I have ever been saying is that the law can stop it if it wants to, and all these arguments about the internal workings of the machine or the nature of art are pretty irrelevant to that.
Places like Getty or others want it to be restricted (and generally go against any fair use whenever possible). However there's even bigger money (the tech industry) that wants to maintain the status quo. Machine learning on fair-use data has been a thing for many years now.

cat botherer fucked around with this message at 18:36 on May 22, 2023

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

StratGoatCom posted:

AI, or very likely to have been trained on such, yes.
It sounds like you've decided "using something as training data" is not fair use, and you're working backward from there.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

StratGoatCom posted:

Because it isn't
That's just a broad, unsupported claim you've pulled out of nowhere, that is actually contrary to the status-quo legal situation.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.
It’s actually kind of astonishing how basic most of the math is. It’s just intuition on the best way to use it.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

SubG posted:

No, the thing they cite is not The Bell Curve. They cite an opinion piece in The Wall Street Journal signed by 52 scientists. It's called Mainstream Science on Intelligence.

The reference was removed in the most recent version of the paper.

And for whatever it's worth, they start talking about the paper at around 50 minutes into the video. The get to the part where it cites "Mainstream Science on Intelligence" about three minutes later. Which I mention because went looking for it as well.
It’s never great when you’re citing The Wall Street Journal for a definition of intelligence. It also doesn’t help when the piece is defending a race science book.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.
Languages are indeed fuzzy, which is probably why computational linguistics hasn't made nearly as much progress as simpler statistical models on things like machine translation.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.
English isn’t really “less structured” than Russian. What Russian conveys in conjugation and declension, English conveys with word order and sometimes more words, like auxiliary verbs. In linguistic terms, English is analytic in that it breaks things down, with a small ratio of morphemes (word parts) to words. Russian is the opposite in that it is a synthetic language. Speech in both languages can exist on a wide continuum of ambiguous to exact.

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

NoiseAnnoys posted:

exactly, thank you.

again, don't get me wrong, i'd love for ai to help us crack some of these long undeciphered scripts and manuscripts, but considering the problems these translation apps have with living languages with absolutely huuuuuuge corpuses to draw from, the ais involved need to be way more powerful, or we need to find waaaaaaay more text/data to feed into them.
Given that we have no idea what language family Linear A (it's thought to be non-Indo-European) is from, I would be surprised if it isn't information-theoretically impossible to decipher with current evidence. Same deal with the Indus script :(.

gurragadon posted:

So, its basically just elitism and snobbery from the people in Moscow? Like how French people think (used to think?) that regional accents weren't really French.
Definitely a current thing. They're still assholes to French Canadians about it, even though French Canadians speak a French that is much closer to the standard French of a couple hundred years ago. Not that English speakers should talk with how people view dialects like AAVE.

e: ambiguous typo

cat botherer fucked around with this message at 18:09 on May 24, 2023

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

SubG posted:

With the VM? Nah. There's no "curve of possibilities" because there's nothing indicating what the underlying distribution is. We can estimate e.g. the entropy of the script and from that estimate the amount of information the VM encodes...but to a first order approximation that just tells us how much additional information we'd need in order to produce a meaningful "solution".

A lot of the approaches to understanding the VM start out with a hypothesis like "maybe it's actually Chinese" (or Hebrew or Vietnamese or whatever) because if that happens to be true then you get a huge amount of information about the text more or less for free. But all "solutions" of this form are explicitly predicated on the idea that Voynichese isn't an previously unknown language.
Probability theory and information theory are two sides of the same coin. KL divergence (either for variational Bayes or information gain from prior to posterior) is just relative entropy. The Bayesian optimal model is the one that has minimal message length. It's all the same thing but from different perspectives.

As you say, you never know what the underlying distribution is, but you also never can know the actual entropy of the script, because it depends on an unknown optimal code, or equivalently, a distribution, to describe it. Information entropy is just the expected value of the log probability, but that requires knowledge of the probability distribution to calculate in the first place. Thus, information theory is inseparable from probability. No matter what, there is some kind of assumptions that must be made, and the anything we infer is colored by those decisions.

cat botherer fucked around with this message at 00:10 on May 26, 2023

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

SubG posted:

I understand what you're trying to say, but I don't see how this contradicts anything I said. The issue is that any estimation of the entropy of Voynichese, and therefore information in the VM, just tells us how much there is, not what it is. Put in slightly different terms: it lets us figure out how to compress the text, not how to decrypt/translate it.
That’s very true. However, a good parsimonious (eg good compression) statistical model of the manuscript would be more-or-less optimal for describing the essential features. If the model structure were interpretable linguistically, it would give you meaningful linguistic and semantic information. Of course, it’s probably impossible to come up with any such well-reasoned model in the case of the Voynich manuscript. We don’t know enough about it to propose any kind of meaningful prior (we’re obviously in agreement there).

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

Bar Ran Dun posted:

That’s amazing to me. These models are only copyrighted? They aren’t patenting these models?
Code can’t be patented. It is the equivalent of patenting a specific machine (as opposed to a specific design of some machine). Algorithms can be patented in some circumstances, but that’s thankfully been somewhat cut down on with a general prohibition of “x but on a computer” patents. For software stuff, the best way to protect it as a trade secret.

Even with an algorithmic patent, it’s extremely hard to prove anyone else is using it (it’s almost impossible to reach a level of evidence/suspicion to sue/get a subpoena) without access to the code. Getting a patent means you have to show the whole rear end of your algorithm, which is thus not a good idea given how easy it is to infringe.

With a lot of this stuff, places like OpenAI will publish papers on sometimes innovative aspects of what they’re doing. However, if it’s anything like some places I’ve worked, they’re holding back some important but non-obvious practical details. They aren’t idiots.

The concept of patents on algorithms is incoherent. “Math” results or techniques cannot be patented, but courts consider algorithms to not be part of math. Mathematicians and computer scientists disagree.

cat botherer fucked around with this message at 01:34 on May 27, 2023

Adbot
ADBOT LOVES YOU

cat botherer
Jan 6, 2022

I am interested in most phases of data processing.

StratGoatCom posted:

If your model ate someone's stuff and it emulates it, you are not covered under fair use.

Capice?
Once again, that’s a fact that that only exists in your head. From what you are describing, all artistic influence is copyright infringement.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply