GPU Megat[H]read - the cores of wrath grew heavy on the die that day

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > GPU Megat[H]read - the cores of wrath grew heavy on the die that day

«‹›3876 »

Instant Grat: Jul 31, 2009; Just add
NERD RAAAAAAGE

Dr. Video Games 0031 posted:

"DLSS 3" was just a term imposed by Nvidia's marketing. If you look at their SDKs and development guides and such, you'll see they just refer to the two main technologies as "DLSS Super Resolution" and "DLSS Frame Generation." DLSS 3 now encompasses both.

Ok but the marketing team is still clearly using "DLSS 2" to mean spatial upsampling, and "DLSS 3" to mean framegen:

https://www.nvidia.com/en-us/geforce/news/baldurs-gate-3-and-even-more-dlss-games-this-august/

quote:

August looks to be another great month for DLSS and RTX, headlined by the official launch of Baldur�s Gate 3, enhanced with DLSS 2 and DLAA.

Additionally, DLSS 3 has been added to DESORDRE: A Puzzle Game Adventure and Remnant II, and will be available at the launch of Lost Soul Aside.

I don't think that quote would've been different even if Baldur's Gate had shipped with a DLL that had a version number starting with 3.

# ? Aug 9, 2023 11:52

Adbot: ADBOT LOVES YOU

# ? Jun 3, 2024 23:44

kliras: Mar 27, 2021

Dr. Video Games 0031 posted:

"DLSS 3" was just a term imposed by Nvidia's marketing. If you look at their SDKs and development guides and such, you'll see they just refer to the two main technologies as "DLSS Super Resolution" and "DLSS Frame Generation." DLSS 3 now encompasses both.

with reflex boost baked in, too. i don't mind the dlss 3 naming, makes it easier in some way. as long as i can disable frame gen on its own

# ? Aug 9, 2023 11:53

repiv: Aug 13, 2009

the good news is that AMD is committed to copying nvidias nomenclature around this stuff so FSR3 will probably be just as confusing

it should have just been DLFG 1.0 and FFG 1.0

# ? Aug 9, 2023 12:04

Branch Nvidian: Nov 29, 2012

GPU Megat[H]read - DLSS 2 ver. 3.1.1.0 - v2

# ? Aug 9, 2023 14:05

repiv: Aug 13, 2009

DLSS: 3.0 The Frames Are (Not) Real

# ? Aug 9, 2023 14:15

Brownie: Jul 21, 2007; The Croatian Sensation

I�m very much in the camp of �DLSS looks better than native�, but I cannot understand how people can play with frame generation. The artifacts are still so visible and distracting to me when watching clips of Spider-Man or Cyberpunk.

# ? Aug 9, 2023 14:33

repiv: Aug 13, 2009

it works best at higher output frame rates (100+) so the 60fps limit of most video hosts makes it hard to demonstrate properly

# ? Aug 9, 2023 14:44

Canned Sunshine: Nov 20, 2005; CAUTION: POST QUALITY UNDER CONSTRUCTION

I can't wait to see what bullshit they try to use to justify DLSS "4"

# ? Aug 9, 2023 14:47

Zero VGS: Aug 16, 2002; ASK ME ABOUT HOW HUMAN LIVES THAT MADE VIDEO GAME CONTROLLERS ARE WORTH MORE; Lipstick Apathy

I just noticed that when I'm running Warframe at 4K 120hz... I guess this whole time when I was using DLAA (DLL injector that enables it), the game used 250w average as opposed to around 190w with antialiasing disabled. I didn't realize those Tensor Cores or Optical Flow burn so much juice.

Refresh my memory, how can I set it in Afterburner so that the voltage is capped beyond a certain frequency (as in, the GPU will keep try to boost the clocks higher but can't raise the voltage higher)?

# ? Aug 9, 2023 14:55

repiv: Aug 13, 2009

SourKraut posted:

I can't wait to see what bullshit they try to use to justify DLSS "4"

i bet they're at least experimenting with incorporating all the newer generative AI techniques into a DLSS 1.0 type thing again

# ? Aug 9, 2023 15:11

gradenko_2000: Oct 5, 2010; HELL SERPENT; Lipstick Apathy

real time Stable Diffusion to draw the frames before they're immanent

# ? Aug 9, 2023 15:13

Gyrotica: Nov 26, 2012; Grafted to machines your builders did not understand.

DLSS4 will draw the frames from a game better than the one you are playing.

# ? Aug 9, 2023 15:23

change my name: Aug 27, 2007; Can't post for 5 hours!

repiv posted:

DLSS: 3.0 The Frames Are (Not) Real

You think that's frames you're generating?

# ? Aug 9, 2023 15:41

Josh Lyman: May 24, 2009

Brownie posted:

I�m very much in the camp of �DLSS looks better than native�, but I cannot understand how people can play with frame generation. The artifacts are still so visible and distracting to me when watching clips of Spider-Man or Cyberpunk.

Even with all the crazy fast effects in D4, DLSS quality looks good to me. The main thing is it lets me use ultra textures at 1440p and max out my 165hz monitor.

# ? Aug 9, 2023 15:43

gradenko_2000: Oct 5, 2010; HELL SERPENT; Lipstick Apathy

where we're going, we won't need FSR

# ? Aug 9, 2023 15:43

Truga: May 4, 2014; Lipstick Apathy

repiv posted:

it works best at higher output frame rates (100+) so the 60fps limit of most video hosts makes it hard to demonstrate properly

that's the funny thing about dlss3/framegen, to me. if you're on a monster gpu, it doesn't really do much for you because the game already plays smoothly and at a framerate high enough that the additional frame of latency is whatever, so it's hardly useful

but in places where doubling fps would actually help a ton (shitboxes pushing 40fps on a good day), the additional latency can be noticeable, and the artifacts very visible lol

# ? Aug 9, 2023 16:06

pofcorn: May 30, 2011

I am telling you right now � that motherfucking frame back there is not real.

# ? Aug 9, 2023 16:42

SlowBloke: Aug 14, 2017

repiv posted:

DLSS: 3.0 The Frames Are (Not) Real

Pictured: a 4090 doing a 8k frame

# ? Aug 9, 2023 17:02

njsykora: Jan 23, 2012; Robots confuse squirrels.

Truga posted:

that's the funny thing about dlss3/framegen, to me. if you're on a monster gpu, it doesn't really do much for you because the game already plays smoothly and at a framerate high enough that the additional frame of latency is whatever, so it's hardly useful

but in places where doubling fps would actually help a ton (shitboxes pushing 40fps on a good day), the additional latency can be noticeable, and the artifacts very visible lol

Also it uses a bunch of VRAM to generate those frames which you're already probably limited on if you're on a lower tier GPU lol.

# ? Aug 9, 2023 17:58

Canned Sunshine: Nov 20, 2005; CAUTION: POST QUALITY UNDER CONSTRUCTION

It�s not nvidia�s fault that they can�t afford to equip their mid-tier cards with more VRAM to fully take advantage of the software features they�re pushing for their cards.

# ? Aug 9, 2023 18:00

gradenko_2000: Oct 5, 2010; HELL SERPENT; Lipstick Apathy

pofcorn posted:

I am telling you right now – that motherfucking frame back there is not real.

Lmao

# ? Aug 9, 2023 18:10

ConanTheLibrarian: Aug 13, 2004; dis buch is late; Fallen Rib

Truga posted:

that's the funny thing about dlss3/framegen, to me. if you're on a monster gpu, it doesn't really do much for you because the game already plays smoothly and at a framerate high enough that the additional frame of latency is whatever, so it's hardly useful

but in places where doubling fps would actually help a ton (shitboxes pushing 40fps on a good day), the additional latency can be noticeable, and the artifacts very visible lol

The only thing I've used frame gen for is path tracing in Cyberpunk, but I found it genuinely useful for that purpose.

# ? Aug 9, 2023 18:43

Branch Nvidian: Nov 29, 2012

SourKraut posted:

It’s not nvidia’s fault that they can’t afford to equip their mid-tier cards with more VRAM to fully take advantage of the software features they’re pushing for their cards.

Careful, you'll summon Paul with a 5000 word dissertation on why it's actually the consumer's fault that Nvidia can't put more vram on their cards or something.

# ? Aug 9, 2023 19:39

Cygni: Nov 12, 2005; raring to post

Cheapish stuff up on the evga B stock:

https://www.evga.com/products/productlist.aspx?type=8

$99 1660
$219 3060
$289 3060 Ti
$329 3070
$349 3080

Stuff normally doesnt last long.

# ? Aug 9, 2023 23:31

Dr. Video Games 0031: Jul 17, 2004

There was a 3060 for around $100 and a 3080 for $350 but those went in like 20 minutes.

# ? Aug 9, 2023 23:33

shrike82: Jun 11, 2005

is there typically a clear benefit to swapping in the latest DLSS version for games that aren't updated e.g., bg3?
i switched it using dlss swapper and haven't noticed any improvement in iq or performance

# ? Aug 10, 2023 00:04

kliras: Mar 27, 2021

it's hard to notice most of the time. more of a "might as well" than something you should feel fomo about

# ? Aug 10, 2023 00:06

steckles: Jan 14, 2006

Truga posted:

but in places where doubling fps would actually help a ton (shitboxes pushing 40fps on a good day), the additional latency can be noticeable, and the artifacts very visible lol

I'm sure future iterations of DLSS will allow for more interpolated frames per "real" one. Who knows, maybe they'll up the interpolation to 20 fakes frames per real one and nVidia will advertise that the 5060 8GB or 6060 7GB can do 300fps at 6k in Cyberpunk with Path Tracing. The super advanced predictive Reflex AI needed to keep latency under control will be so good that simply moving the mouse will cause games to play themselves.

# ? Aug 10, 2023 01:49

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference

quote:

MLC-LLM makes it possible to compile LLMs and deploy them on AMD GPUs using ROCm with competitive performance. More specifically, AMD Radeon� RX 7900 XTX gives 80% of the speed of NVIDIA� GeForce RTX� 4090 and 94% of the speed of NVIDIA� GeForce RTX� 3090Ti for Llama2-7B/13B. Besides ROCm, our Vulkan support allows us to generalize LLM deployment to other AMD devices, for example, a SteamDeck with an AMD APU.
...
We made the following things to bring ROCm support:

Reuse the whole MLC pipeline for existing targets (such as CUDA and Metal), including memory planning, operator fusion, etc.

Reuse a generic GPU kernel optimization space written in TVM TensorIR and re-target it to AMD GPUs.

Reuse TVM�s ROCm code generation flow that generates low-level ROCm kernels through LLVM.

Finally, export generated code as a shared or static library that can be invoked by CLI, Python and REST APIs.

someone just did geohot's job for him ool

# ? Aug 10, 2023 02:05

Yudo: May 15, 2003

Paul MaudDib posted:

https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference

someone just did geohot's job for him ool

24gb of VRAM for under $1k is going to get people interested, particularly for LLMs. This is also why I think 50xx won't have GDDR7 and some huge bus: LLM inference is all about memory capacity and memory speed. It would be a 1080ti type mistake. Every time a 4090 is bought for AI, Jensen sheds a tear.

# ? Aug 10, 2023 02:16

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Yudo posted:

24gb of VRAM for under $1k is going to get people interested, particularly for LLMs. This is also why I think 50xx won't have GDDR7 and some huge bus: LLM inference is all about memory capacity and memory speed. It would be a 1080ti type mistake. Every time a 4090 is bought for AI, Jensen sheds a tear.

yeah I don't think it's a given that NVIDIA will end up owning the market long-term, the software moat only needs to be figured out once and there's a lot of financial incentive to do so. We'll eventually hit a plateau and slow down and that'll give companies a breather to design chips and build software tooling etc. I think NVIDIA rides the moments of innovation when it's raining money, because they dominate the research world. but if there's specific profitable niches then AMD can just drop in and build a clone of that, and Intel is fishing for platform-independent code with OneAPI and SyCL. Google and Amazon both have incentives to write the software that lets you pay them money for their dedicated tensor hardware too (and I'm sure they will continue to iterate on that too). Tesla has a thing too, the Dojo training architecture, designed by jim keller even, but lolelon

hot take: apple silicon macbook pro is a potentially neat platform for big-network inference, iirc 70B wants enough VRAM (>30GB iirc?) that only the A6000s and bigboi cards can inference them. but apple silicon can inference on unified VRAM very efficiently (same as the Steam Deck there). And 3060 performance but big VRAM is at least an interesting proposition even if that's all you get, not sure how the performance shakes out with shaders vs the neural engine but it'll definitely run it without memory-capacity slowdowns.

you have to have enough capacity and bandwidth, yup. and it's tough to get that from a multi-GPU model. the idea of tightly-linked systems with high-performance links between their caches and addressing a lot of memory in total is kinda what's gonna make MCM tick too. Cache coherency is still the basic model of multi-socket scaling even in these modern days. Things like NVLink or Infinity Fabric Cache-Coherent interconnect/openCXL or IBM's shared cache thingy make a lot of sense if you are building the One Fast System model (like DGX Server family, or mainframes). And some point that's the only way to keep the bandwidth high enough in a multi-node/multi-chiplet system, remote has to be almost as fast as local, otherwise some classes of programs are restricted to local-only scales.

one extremely funni theory I heard for AMD multi-GCD shenanigans was that they'd basically build the thing with two cache ports. Normally one goes to cache, for multi-GCD configurations the second port goes to the other GCD and it reads its memory/cache for you. Or you could use the second port for a stacked cache and have that double your bandwidth. but because of the "remote needs to be reasonably close to local speeds" principle, a cache port and a network port are pretty much the same thing architecturally.

AMD can also make different MCDs with more PHYs if they want, it won't increase bandwidth (which will be limited by the Infinity Link) but it's higher capacity for a given GCD config. The MCD is a nice little point of indirection for system design, and the MCD concept gives them more physical fanout than NVIDIA. They could probably do 96GB 7900XTX configs if they wanted, I'd think. Clamshell 4-PHY MCDs.

Paul MaudDib fucked around with this message at 03:05 on Aug 10, 2023

# ? Aug 10, 2023 02:37

Shipon: Nov 7, 2005

Yudo posted:

24gb of VRAM for under $1k is going to get people interested, particularly for LLMs. This is also why I think 50xx won't have GDDR7 and some huge bus: LLM inference is all about memory capacity and memory speed. It would be a 1080ti type mistake. Every time a 4090 is bought for AI, Jensen sheds a tear.

gddr6x was already a waste of money and didn't really offer anything for all the extra thermal difficulties it presented, gddr7 is absolutely not going to be worth it

# ? Aug 10, 2023 03:04

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Shipon posted:

gddr6x was already a waste of money and didn't really offer anything for all the extra thermal difficulties it presented, gddr7 is absolutely not going to be worth it

GDDR6X has PAM4, GDDR7 only adopted PAM3 (one less bit transferred per transfer) to save power, and it hits its bandwidth ratings in spite of this. notionally gddr7x with pam4 could perform "better" ofc, and NVIDIA seems to like to do this for the bandwidth-density (per-line/per-trace), they have consistently done this throughout the Pascal and later era for at least some cards. 7X may be a key part of the feed-the-beast picture for blackwell, 512b alone isn't enough to get you to big generational gains over 4090, but 512b and 7x etc might stack to be enough.

but the commodity memory standards have done alright overall, GDDR6 and GDDR5 both caught up to 6X/5X later in their lifecycle anyway. the commodity market iterates faster and more reliably. and yeah the power and thermal problems are getting kinda obscene with PAM4. but I also think it's always been a bit of a play for supply too - NVIDIA is big enough they need to make some big bulk orders with vendors to satisfy their order quantities (especially with build kits now being something almost everyone uses) and since they're doing a big bulk order anyway, if they need something "semicustom" to make it better for their needs... there's vendors who will provide that! and if they won't pay to update it throughout the generation, then it won't be, because nobody else is using it.

Paul MaudDib fucked around with this message at 03:16 on Aug 10, 2023

# ? Aug 10, 2023 03:12

gradenko_2000: Oct 5, 2010; HELL SERPENT; Lipstick Apathy

I was reading this thread before going to bed and had a dream that I overclocked a GTX 1650 to 3090 levels but I had to use a fire extinguisher on it afterwards

# ? Aug 10, 2023 03:12

Shipon: Nov 7, 2005

Paul MaudDib posted:

GDDR6X has PAM4, GDDR7 only adopted PAM3 (one less bit transferred per transfer) to save power, and it hits its bandwidth ratings in spite of this. notionally gddr7x with pam4 could perform "better" ofc, and NVIDIA seems to like to do this for the bandwidth-density (per-line/per-trace), they have consistently done this throughout the Pascal and later era for at least some cards. 7X may be a key part of the feed-the-beast picture for blackwell, 512b alone isn't enough to get you to big generational gains over 4090, but 512b and 7x etc might stack to be enough.

but the commodity memory standards have done alright overall, GDDR6 and GDDR5 both caught up to 6X/5X later in their lifecycle anyway. the commodity market iterates faster and more reliably. and yeah the power and thermal problems are getting kinda obscene with PAM4. but I also think it's always been a bit of a play for supply too - NVIDIA is big enough they need to make some big bulk orders with vendors to satisfy their order quantities (especially with build kits now being something almost everyone uses) and since they're doing a big bulk order anyway, if they need something "semicustom" to make it better for their needs... there's vendors who will provide that! and if they won't pay to update it throughout the generation, then it won't be.

is memory bandwidth the issue at all for cards that don't have the starved lanes of the lower tier 40 series though?

# ? Aug 10, 2023 03:15

Hughmoris: Apr 21, 2007; Let's go to the abyss!

gradenko_2000 posted:

I was reading this thread before going to bed and had a dream that I overclocked a GTX 1650 to 3090 levels but I had to use a fire extinguisher on it afterwards

How many fps

# ? Aug 10, 2023 03:16

Canned Sunshine: Nov 20, 2005; CAUTION: POST QUALITY UNDER CONSTRUCTION

gradenko_2000 posted:

I was reading this thread before going to bed and had a dream that I overclocked a GTX 1650 to 3090 levels but I had to use a fire extinguisher on it afterwards

The Way It's Meant to be Brazed

# ? Aug 10, 2023 03:25

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Shipon posted:

is memory bandwidth the issue at all for cards that don't have the starved lanes of the lower tier 40 series though?

it's an implication of their continued monolithic design. AMD puts 2 PHYs on each MCD, times six MCDs on the 7900XTX (times 2GB per module = 24GB). That's 12 PHYs, same as NVIDIA puts on the 4090. 7900XTX is a much smaller GCD than the 4090 die, it's like half the size. and the area overhead is happening in places that aren't performance-sensitive.

AMD is already trading off along the "we need more capacity than these modules give us at a given bandwidth-density" problem, they just don't have to do it on the GCD. the MCD works like a memory expander and fans out to twice as many PHYs and a bunch of cache, and the infinity link PHY is still a bit smaller than a single GDDR PHY.

iirc measurements from locuza or someone suggested from a 4090, 17% was PHYs, 34% was caches, rest was everything else. so doubling the PHYs (4060 gets 12) would be 29% of the die (34/117) devoted to PHYs at that point. That's getting to be a fair chunk.

Like it's a definite tradeoff axis here. NVIDIA is betting on monolithic. Monolithic can't have 30% of your one and only die be eaten by PHYs out of the gate, then another 34% of cache, and the rest of the 36% for everything doing the work. NVIDIA's gonna make the fastest memory they can make for that, because it's easier than routing a big bus (which they're gonna do for blackwell, and it'll be expensive as poo poo) and spending a ton of area on PHYs. AMD can use smaller infinity links and do whatever the gently caress they want on the other side with the MCD (cache etc). And they can also have big increases in registers/L1/L2 on the GCD too if they want! But that comes with certain downsides depending on how you build it (and on what side of the link caches live on) for things like idle power/low-power.

350mm2 is not pushing it at all, they could totally go bigger if they wanted, and the MCD arrangement also gives them a lot more physical fanout to simplify the routing/placement. it just also is already not very efficient/pushing it on power already, and some of that may have just been performance misses in general, moreso than the MCD thing itself. But rumors have swirled that it's gonna be RX480/5700XT style midrange-only until they can regain their footing.

now, could they have maybe not doubled PHYs like AMD, but thrown in a few more? yeah, def, you could take the 4060 and 4070 prices and slide them up $50/70 for 4060 12GB and 4070 16GB and get a viable product, if you knew upfront (rather than paying for clamshell etc). but for NVIDIA the problem is they specced and manufactured these chips and cards a long while back and the market has kinda shifted under them while the inventory bubble burned through. in Q4 2021 do you know that 8GB is going to be a problem on a 4060 (bearing in mind it'll drive up prices/costs in a soft market)? and a lot of those cards were actually built out mid-2022 on build kits bought at shortage pricing and now partners get rekt or nvidia writes a check ( :lol:

) or you wait for it to burn through. I've said for a long time, the ada bubble comes after the ampere bubble, lol.

the "TSMC won't let NVIDIA cancel 5nm allocation" thing was a looong time ago lol. they took some of it up with consumer production specced out back then, and just warehoused it all. and jensen never takes a loss

Paul MaudDib fucked around with this message at 07:11 on Aug 10, 2023

# ? Aug 10, 2023 03:49

Yudo: May 15, 2003

Paul MaudDib posted:

yeah I don't think it's a given that NVIDIA will end up owning the market long-term, the software moat only needs to be figured out once and there's a lot of financial incentive to do so. We'll eventually hit a plateau and slow down and that'll give companies a breather to design chips and build software tooling etc. I think NVIDIA rides the moments of innovation when it's raining money, because they dominate the research world. but if there's specific profitable niches then AMD can just drop in and build a clone of that, and Intel is fishing for platform-independent code with OneAPI and SyCL. Google and Amazon both have incentives to write the software that lets you pay them money for their dedicated tensor hardware too (and I'm sure they will continue to iterate on that too). Tesla has a thing too, the Dojo training architecture, designed by jim keller even, but lolelon

hot take: apple silicon macbook pro is a potentially neat platform for big-network inference, iirc 70B wants enough VRAM (>30GB iirc?) that only the A6000s and bigboi cards can inference them. but apple silicon can inference on unified VRAM very efficiently (same as the Steam Deck there). And 3060 performance but big VRAM is at least an interesting proposition even if that's all you get, not sure how the performance shakes out with shaders vs the neural engine but it'll definitely run it without memory-capacity slowdowns.

you have to have enough capacity and bandwidth, yup. and it's tough to get that from a multi-GPU model. the idea of tightly-linked systems with high-performance links between their caches and addressing a lot of memory in total is kinda what's gonna make MCM tick too. Cache coherency is still the basic model of multi-socket scaling even in these modern days. Things like NVLink or Infinity Fabric Cache-Coherent interconnect/openCXL or IBM's shared cache thingy make a lot of sense if you are building the One Fast System model (like DGX Server family, or mainframes). And some point that's the only way to keep the bandwidth high enough in a multi-node/multi-chiplet system, remote has to be almost as fast as local, otherwise some classes of programs are restricted to local-only scales.

one extremely funni theory I heard for AMD multi-GCD shenanigans was that they'd basically build the thing with two cache ports. Normally one goes to cache, for multi-GCD configurations the second port goes to the other GCD and it reads its memory/cache for you. Or you could use the second port for a stacked cache and have that double your bandwidth. but because of the "remote needs to be reasonably close to local speeds" principle, a cache port and a network port are pretty much the same thing architecturally.

AMD can also make different MCDs with more PHYs if they want, it won't increase bandwidth (which will be limited by the Infinity Link) but it's higher capacity for a given GCD config. The MCD is a nice little point of indirection for system design, and the MCD concept gives them more physical fanout than NVIDIA. They could probably do 96GB 7900XTX configs if they wanted, I'd think. Clamshell 4-PHY MCDs.

It is also worth considering that no one really likes CUDA except Nvidia, and I think they realize now that they missed their window to assert a complete monopoly. ROCm isn't all that much better either, from what I understand. They are a bitch to program still, and companies have to bring on talent to do it. As you suggest, platform independence is the future. Every large company in AI (and even some of the smaller ones) are going to try their hands at AI ASICs. CUDA isn't an option for them. PyTorch 2.0 also simplified its optimization instructions considerably, opening to targeting non-Nvidia hardware.

I have seen people online talk about using Macs for LLMs. They certainly have the capacity, but looking at the specs of the M2, only the M2 Max and M2 Ultra have memory bandwidth competitive with mid range GPU. You are correct though in that 70b LLMs need around 40gb of VRAM to get out of bed, and M1 and M2 unified memory seems like one of the easiest ways of pulling that off. If AMD could stack memory as you describe, that would be a killer product. LLM inference does not need monster FP, but rather as much VRAM and VRAM speed as possible. I doubt AMD can atch Nvidia in pure crunch, but that isn't as relevant here as, say, AI image and video generation.

The monolithic train is reaching its final stop. New EUV machines are going to have a much smaller reticle limit compared to current machines, so gigantic 800mm^2 chips are not going to be a thing going forward. Nvidia knows how to do chiplets, and I am curious at how they decide to proceed. They rule the roost at the moment, and the big players I'm sure resent that Nvidia has so much power over them. I don't think the sell for Intel or AMD would be that hard if they had just comparable goods.

Shipon posted:

gddr6x was already a waste of money and didn't really offer anything for all the extra thermal difficulties it presented, gddr7 is absolutely not going to be worth it

For games? If Nvidia wants to continue to use monolithic GPUs and has fixed the margins they will accept, memory busses can't get any bigger. Cache too isn't an appealing "solution" going forward either. There isn't any room on the die for an Ampere to Lovelace type jump, and SRAM doesn't scale much at all from N5 to N3. Using faster memory is a potential way to mitigate limited bus size due to a GPU's physical dimensions. Otherwise, I would agree that it isn't going to be some revelation for video games as we already can observe how titles scale given a resolution and set of GPUs with very different memory bandwidth (a 4070ti vs. a 4080, for example). For many AI tasks, however, it is a huge deal. LLM performance needs as much memory throughput as possible. Even training, which is far more FP intensive, is being hobbled not by a lack of compute, but rather how much that compute stalls due to every other part not keeping up. I am not an expert on this stuff, but that to me is where faster VRAM will shine the brightest.

Yudo fucked around with this message at 06:14 on Aug 10, 2023

# ? Aug 10, 2023 06:10

Adbot: ADBOT LOVES YOU

# ? Jun 3, 2024 23:44

shrike82: Jun 11, 2005

yeah going to disagree with the walls of texts - LLMs have entrenched the dominance of Nvidia hardware if anything. there are a lot of toy repos out there that cut down LLMs to fit and be compatible with CPUs including Apple Silicon but it's not really useful beyond students and execs wanting something to run locally on their laptops. a) it's lossy and b) why try to penny pinch on on-prem or private cloud hardware - businesses are happy to throw money at racks of nvidia cards to demonstrate having an AI strategy

# ? Aug 10, 2023 06:26

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > GPU Megat[H]read - the cores of wrath grew heavy on the die that day

«‹›3876 »