GPU Megat[H]read - the cores of wrath grew heavy on the die that day

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > GPU Megat[H]read - the cores of wrath grew heavy on the die that day

«‹›3876 »

steckles: Jan 14, 2006

repiv posted:

epic did a seriously impressive flex by loading up a scene from disney's moana and rendering it in realtime through nanite

I once tried to load that scene into one of my many toy ray tracers. It... did not go well.

v1ld posted:

There have to be really good reasons why this isn't done, it's pretty obvious, but would be good to know why a movie studio which has no realtime constraint on rendering a scene wouldn't pursue those kinds of approaches.

Attempts at ray re-ordering have been made since at least the mid 90s, it's just that Moore's Law usually made the problem go away if you were willing to wait a bit. Predictable brute force is usually preferable to a clever algorithm. We really do seem to be bumping up against some limits though, and just waiting another year for MOAR CORES and twice the memory for half the price isn't feasible, so there's renewed interest in these things.

On the GPU side, batching and re-ordering can be non-trivial parallelize in a way that makes GPUs happy, so perhaps people have just been avoiding it.

steckles fucked around with this message at 03:36 on Nov 6, 2022

# ? Nov 6, 2022 03:19

Adbot: ADBOT LOVES YOU

# ? Jun 4, 2024 09:17

v1ld: Apr 16, 2012

I was watching that same video and what you said about scene size simply not fitting in memory struck home: trying to divvy the work across boxes won't work for path tracing since you still need to fit the full scene in on each box. Right?

So you get better compute distribution, but you're still bumping into per-box memory limits?

E: Guess what I'm asking is if the fundamental bottleneck of pathtracing re: multi-machine parallelization is the full scene has to be on each machine? Rays can hit any part of the scene, so that would seem to be the case?

v1ld fucked around with this message at 03:34 on Nov 6, 2022

# ? Nov 6, 2022 03:31

steckles: Jan 14, 2006

v1ld posted:

E: Guess what I'm asking is if the fundamental bottleneck of pathtracing re: multi-machine parallelization is the full scene has to be on each machine? Rays can hit any part of the scene, so that would seem to be the case?

Yeah, you don't know what geometry a ray is gonna hit until you trace it, so you generally need the entire scene and textures resident in memory at all times for decent performance. This is more of a management problem than a parallelization problem though. My understanding of common production pipelines is that multiple machines never coordinate on the same frame, each machine will be dedicated to a single frame and the whole cluster will be working on a bunch of different frames at once. They'll all be referencing the same geometry files though so if that won't fit in memory of all of the machines, you need to start getting clever.

VVVVV: Typically paths are traced hundreds or thousands of times per-pixel and averaged in off-line rendering. Fundamentally, path tracing would be impossible without averaging. Tracing even double digit numbers of rays per pixel and getting decent results requires nutty space magic like ReSTIR and Deep learning driven de-noising filters.

steckles fucked around with this message at 03:53 on Nov 6, 2022

# ? Nov 6, 2022 03:45

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

From writing raytracers before in school and on a whim there�s issues with basic trivial parallelism methods in ray tracing due to the sheer noise issues as scene complexity increases (more bounces = more error amplification). Some rays also get really, really complex that can really screw with timings and blowing out caches and invalidating pipelines as you get bad spatial jitter issues on top. So some approaches include over-sampling and averaging which is similar to the antialiasing issues just the same as in raster tracing approaches.

John Carmack had a good (long) talk on all the physics and lighting rendering systems from 9 years ago that is still relevant to the discussion

https://www.youtube.com/watch?v=P6UKhR0T6cs

# ? Nov 6, 2022 03:47

istewart: Apr 13, 2005; Still contemplating why I didn't register here under a clever pseudonym

I was interviewing for a job earlier in the year helping test CXL memory breakout boxes, so I imagine that will become a big deal for render farms in the next couple years.

https://en.wikipedia.org/wiki/Compute_Express_Link if anybody hasn't heard about it. I guess you could call it an industry-standard take on stuff like NVLink or vendor-specific fiber channel links.

# ? Nov 6, 2022 03:56

repiv: Aug 13, 2009

v1ld posted:

Genuinely asking here, not being skeptical. It just seems like such a different problem space to games.

it is a very different problem space to games of course, we're talking many orders of magnitude difference in scale and processing power, the main point i was driving at was just that raytracing isn't just good for absolute photorealism (in response to truga)

traditional realtime rasterization techniques are unrealistic, but they're not unrealistic in a way that an artist would deliberately deviate from reality to achieve a stylized look most of the time, unless they are working backwards by art-directing around the limitations of tech

# ? Nov 6, 2022 03:57

Dr. Video Games 0031: Jul 17, 2004

The incoherence issue is part of what shader execution reordering is supposed to solve in the 40-series, right? Obviously it won't eliminate the issue, but it's supposed to provide a pretty big speedup to ray tracing just by sorting the rays beforehand.

# ? Nov 6, 2022 03:58

repiv: Aug 13, 2009

steckles posted:

Tracing even double digit numbers of rays per pixel and getting decent results requires nutty space magic like ReSTIR and Deep learning driven de-noising filters.

speaking of nutty space magic, some of the nerds here may find this overview interesting, from a channel that popped up out of nowhere with improbably high production values

https://www.youtube.com/watch?v=gsZiJeaMO48

goes over the fundamentals, then RIS (the foundational trick Q2RTX used) and then ReSTIR (the new improved trick to be used in Portal RTX/Racer RTX/Cyberpunks RT update)

# ? Nov 6, 2022 04:29

MarcusSA: Sep 23, 2007

repiv posted:

speaking of nutty space magic, some of the nerds here may find this overview interesting, from a channel that popped up out of nowhere with improbably high production values

https://www.youtube.com/watch?v=gsZiJeaMO48

goes over the fundamentals, then RIS (the foundational trick Q2RTX used) and then ReSTIR (the new improved trick to be used in Portal RTX/Racer RTX/Cyberpunks RT update)

This is an interesting video but I could do without the talking dog.

# ? Nov 6, 2022 04:34

shrike82: Jun 11, 2005

Lol front page of Reddit

# ? Nov 6, 2022 07:29

MarcusSA: Sep 23, 2007

shrike82 posted:

Lol front page of Reddit

Lot goin on right there

# ? Nov 6, 2022 07:31

Dr. Video Games 0031: Jul 17, 2004

shrike82 posted:

Lol front page of Reddit

happy for them

# ? Nov 6, 2022 07:38

gradenko_2000: Oct 5, 2010; HELL SERPENT; Lipstick Apathy

shrike82 posted:

Lol front page of Reddit

Dudes absolutely rock

# ? Nov 6, 2022 07:54

infraboy: Aug 15, 2002; Phungshwei!!!!!!1123

Why do other countries get the cool 4090s

# ? Nov 6, 2022 09:32

Palladium: May 8, 2012; Very Good
✔️✔️✔️✔️

infraboy posted:

Why do other countries get the cool 4090s

it doesn't look like a box, more like a bag though

# ? Nov 6, 2022 12:20

Kivi: Aug 1, 2006; I care

Snagged one ex-miner A4000 for cheap. Should be good fit for either of my cases (Fractal Design R6, early rev NCase M1) as FE's were not available around here and everything else is huge gaming poo poo. 140 W TDP :toot:

# ? Nov 6, 2022 12:32

K8.0: Feb 26, 2004; Her Majesty's 56th Regiment of Foot

While the West struggles to figure out real time path tracing, China has pioneered the real time integration of 2D and 3D waifus.

# ? Nov 6, 2022 14:16

Beef: Jul 26, 2004

Are there any Intel Arc benchmarks for Diretide around?

Most of what I can find are youtubers bitching about 4090 perf.

# ? Nov 6, 2022 14:27

repiv: Aug 13, 2009

MarcusSA posted:

This is an interesting video but I could do without the talking dog.

vtubers are the future of graphics tech im afraid

your videos will be narrated by dogs and your drivers will be written by animes

# ? Nov 6, 2022 14:38

Dr. Video Games 0031: Jul 17, 2004

shout out to that furry who made the ray tracing video though.

# ? Nov 6, 2022 14:40

sauer kraut: Oct 2, 2004

Kivi posted:

Snagged one ex-miner A4000 for cheap. Should be good fit for either of my cases (Fractal Design R6, early rev NCase M1) as FE's were not available around here and everything else is huge gaming poo poo. 140 W TDP

Can you use those with the consumer drivers? A 16GB 3070 sounds kinda great.

# ? Nov 6, 2022 14:47

Arivia: Mar 17, 2011

shrike82 posted:

Lol front page of Reddit

why is the 4090 box so huge, does it come with an extra waifu statue or something

# ? Nov 6, 2022 14:59

Dr. Video Games 0031: Jul 17, 2004

The A4000 doesn't quite match the 3070 performance, though it can get kind of close if you overclock it (source: steve walton's review). It seems there's some decent overclocking headroom there, even with the locked voltages.

Just be prepared for the card to be very loud, even at stock. It's a typical noisy workstation blower gpu.

# ? Nov 6, 2022 15:06

Truga: May 4, 2014; Lipstick Apathy

Arivia posted:

why is the 4090 box so huge, does it come with an extra waifu statue or something

https://www.youtube.com/watch?v=0frNP0qzxQc

# ? Nov 6, 2022 15:12

Kivi: Aug 1, 2006; I care

Dr. Video Games 0031 posted:

The A4000 doesn't quite match the 3070 performance, though it can get kind of close if you overclock it (source: steve walton's review). It seems there's some decent overclocking headroom there, even with the locked voltages.

Just be prepared for the card to be very loud, even at stock. It's a typical noisy workstation blower gpu.

Wait, is HWU same as TehcSpot? I watched this before buying https://www.youtube.com/watch?v=HEagFvmjW4w

I don't mind the loudness. My current card is already loud so I already game with headphones or in another room so it's fine. I plan to do some airflow mods to exhaust the air bit better.

# ? Nov 6, 2022 16:09

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

repiv posted:

speaking of nutty space magic, some of the nerds here may find this overview interesting, from a channel that popped up out of nowhere with improbably high production values

https://www.youtube.com/watch?v=gsZiJeaMO48

goes over the fundamentals, then RIS (the foundational trick Q2RTX used) and then ReSTIR (the new improved trick to be used in Portal RTX/Racer RTX/Cyberpunks RT update)

This video was great, thanks a bunch for linking it.

# ? Nov 6, 2022 16:42

OhFunny: Jun 26, 2013; EXTREMELY PISSED AT THE DNC

Kivi posted:

Wait, is HWU same as TehcSpot? I watched this before buying https://www.youtube.com/watch?v=HEagFvmjW4w

I don't mind the loudness. My current card is already loud so I already game with headphones or in another room so it's fine. I plan to do some airflow mods to exhaust the air bit better.

Yep. Hardware Unboxed's video reviews are TechSpot's written reviews or if you prefer vice versa.

# ? Nov 6, 2022 16:48

mobby_6kl: Aug 9, 2009; by Fluffdaddy

Subjunctive posted:

This video was great, thanks a bunch for linking it.

There's also a great book that goes into details into all the nerdy details including math and code in a fairly understandable way. It's not exactly new so might not address the most recent developments but still fascinating.

https://www.amazon.com/Physically-B...ps%2C187&sr=8-1

# ? Nov 6, 2022 16:56

Saukkis: May 16, 2003; Unless I'm on the inside curve pointing straight at oncoming traffic the high beams stay on and I laugh at your puny protest flashes.
I am Most Important Man. Most Important Man in the World.

v1ld posted:

I was watching that same video and what you said about scene size simply not fitting in memory struck home: trying to divvy the work across boxes won't work for path tracing since you still need to fit the full scene in on each box. Right?

I was thinking about this and I guess it would be possible to divide the scene to separate boxes and put every box in their dedicate computer. When I ray travels between these boxes the computers would communicate the ray information between them.

This would be quite a bit more complicated than every computer calculating one full frame, but it would reduce the memory demand quite a bit. Hard to say how big the IO demands between the computers would be and if this would reduce the accuracy of the rays. It would probably be necessary to adjust how big of a box every computer would have to take care of, or some computers would handle several fixed size boxes, maybe adjusting this dynamically.

# ? Nov 6, 2022 17:35

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

mobby_6kl posted:

There's also a great book that goes into details into all the nerdy details including math and code in a fairly understandable way. It's not exactly new so might not address the most recent developments but still fascinating.

https://www.amazon.com/Physically-B...ps%2C187&sr=8-1

I believe I own a hardback of the first printing of that book, but have never taken the time to read it. :smith:

# ? Nov 6, 2022 18:34

steckles: Jan 14, 2006

repiv posted:

speaking of nutty space magic, some of the nerds here may find this overview interesting, from a channel that popped up out of nowhere with improbably high production values

https://www.youtube.com/watch?v=gsZiJeaMO48

goes over the fundamentals, then RIS (the foundational trick Q2RTX used) and then ReSTIR (the new improved trick to be used in Portal RTX/Racer RTX/Cyberpunks RT update)

I wish that video existed when I was first working my way through the ReSTIR paper, would've saved me a few frustrating hours trying to put everything together.

Saukkis posted:

I was thinking about this and I guess it would be possible to divide the scene to separate boxes and put every box in their dedicate computer. When I ray travels between these boxes the computers would communicate the ray information between them.

This is exactly what Disney is working on right now with their renderer. Even with an insanely fast network, the overhead of doing it per-ray is much too high to be worthwhile. What they're doing instead is waiting until "enough" rays can be sent over the network in a batch that the overhead becomes reasonable. There still needs to be a lot of shared data, like some kind of high-level BHV so that each node knows where to send its ray queries, and the time-to-first image is much longer. With a fast computer, you can usually get a noisy but reasonable approximation of the image in a few seconds, even for huge scenes as long as they fit in memory. Once you start deferring all your rays and batching stuff up you might get the final image much faster, but you're waiting multiple minutes for the first, noisy images to hit the screen so the cost of iterating on the shot to dial it in is higher. The increase in artist time needs to be weighed off against the decrease in render time though, so strictly faster isn't always better in a production environment.

# ? Nov 6, 2022 20:32

mobby_6kl: Aug 9, 2009; by Fluffdaddy

Subjunctive posted:

I believe I own a hardback of the first printing of that book, but have never taken the time to read it.

I've only read the few bits that were immediately useful for getting my lovely renderer working :ssh:

# ? Nov 6, 2022 22:14

karoshi: Nov 4, 2008; "Can somebody mspaint eyes on the steaming packages? TIA" yeah well fuck you too buddy, this is the best you're gonna get. Is this even "work-safe"? Let's find out!

Rumors about Navi31 having problems at the last moment to achieve the '3.3GHz-3.7GHz' range it was designed for. Fixed for Navi32, which will hit near 90% of Navi 31's compute performance. It'll make for interesting 7800 series performance. Let's see what pricing they go for for their fully enabled and operational Navi32 cards.

# ? Nov 6, 2022 22:28

Dr. Video Games 0031: Jul 17, 2004

karoshi posted:

Rumors about Navi31 having problems at the last moment to achieve the '3.3GHz-3.7GHz' range it was designed for. Fixed for Navi32, which will hit near 90% of Navi 31's compute performance. It'll make for interesting 7800 series performance. Let's see what pricing they go for for their fully enabled and operational Navi32 cards.

I find this hard to believe. It more sounds like some AMD fans are having a hard time accepting that RDNA3 isn't as powerful as they were hyping it up to be.

edit: lmao, I went to see who was spreading this rumor and found that greymon55, one of the previous AMD "leakers" (who was wrong about most things) deleted their account.

Who was the leaker who first leaked the SKU names and memory/GCD configs (20GB for the XT and 24GB for the XTX)?

edit 2: if this can be believed (the leaker himself says to take it with a grain of salt), then it seems like 3 GHz was the original target, and you shouldn't take max bios limits seriously.

Dr. Video Games 0031 fucked around with this message at 00:17 on Nov 7, 2022

# ? Nov 6, 2022 23:56

lih: May 15, 2013; Just a friendly reminder of what it looks like.

We'll do punctuation later.

karoshi posted:

Rumors about Navi31 having problems at the last moment to achieve the '3.3GHz-3.7GHz' range it was designed for. Fixed for Navi32, which will hit near 90% of Navi 31's compute performance. It'll make for interesting 7800 series performance. Let's see what pricing they go for for their fully enabled and operational Navi32 cards.

people are just speculating because some leaks (though not the only really reliable one which didn't have anything about clock speed) had been saying >3GHz for ages and then in the end we just got "the architecture is designed to reach 3GHz" with lower stock speeds, and the performance is a little underwhelming overall. the 3.7GHz claims always seemed wildly unreasonable and i'd seen it counterclaimed previously that it was just the max in the bios so not going to be anything like the actual speed but who knows how reliable any of that is

it's possible the 3GHz leaks were correct in so much as they were about the target & architecture design but they didn't actually have firmer info so it got overhyped, or just that people were just making poo poo up. the ones claiming higher certainly were.

the names + memory configs of the skus were first leaked by wccftech i think so no clear sources beyond that

lih fucked around with this message at 00:31 on Nov 7, 2022

# ? Nov 7, 2022 00:22

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

SwissArmyDruid posted:

Alleged slide from the presentation, courtesy of Videocardz.

Hmm. It doesn't seem like they were able to divest *anything* from the GPU die except the memory interface(s). And then as a result, found themselves with a bunch of spare room that they decided to whack L3 onto.

I don't.... it's not like Zen 1 block diagrams, right? Where you looked at the Zen 1 block diagram and went, "Ah, I see where the next evolution is going to be, they will take the two side-by-side 4-core CCXes and fuse them together into a single unified 8-core CCX with unified L3 cache, that they then bond with another 8-core CCX to make their new CCD." They wouldn't, of course, they'd just leave it at a single CCX per CCD. And it took them two generations to unify the L3 across all eight cores, but you could see where the tech was going.

I don't know what they do here unless they go vertical. Stacking cache on top of the MCDs is what everyone thinks, but is AMD crazy enough to stack GCDs? We might need that original pre-revision 4090 cooler if that's the case.

Note that infinity cache (L3) is moved to the MCD - that's the red boxes. This implies the cache is segmented per MCD - what you have isn't 1 big cache, it's really 6 little caches. Cache/SRAM does not shrink at 5nm so pushing those to the 6nm dies is beneficial too. The infinity cache did not get larger, but, it's higher-bandwidth.

Since typically memory bits are interleaved across the lanes (right? each access is hitting all memory lanes?) I think they'd always evict in unison, so there's no reason a segmented cache wouldn't work, but, maybe that also has implications as far as the ability to scale GCDs or memory/cache coherence model (I haven't really thought this fully through). I really think based on the patents that multi-GCD is going to follow the same "cache-coherent interconnect paradigm" that CPUs have been using for ages, but, it's possible they may go to a different coherence state-machine model since the GPU model is very read-heavy and write-seldom (not really my area and I also haven't though this fully through either).

One interesting tidbit that was in the patents was that the cache was not shown as part of the memory controller in the multi-GCD patents - it was rather conspicuously on the other side of the diagram next to the interconnect. I think the short-term implication is that the GCD will explicitly manage what it wants evicted (is that a side-cache?) and maybe it broadcasts that to other caches too. Also, you can think of the cache as potentially being another client on the interconnect - the chip just has two 1tbps links (let's say) and one of them normally goes to cache and the other to the interconnect, so you can talk to the other GCD at basically cache-speed and it can talk to its cache at cache-speed. Or perhaps the cache would just be another client of the interconnect and each GCD treats its cache like a fully remote thing that it is pushing/pulling data from, and thus could ask for the other GCD's cache data directly. Maybe if you have 2 lanes what you connect to is the other GCD's cache and you communicate through the caches like a PCIe aperture/window. Or maybe since it's segmented it's really six mini-caches that the GCD manages?

They also did increase L0, L1, and L2 a lot, but, remember that's per-GCD and the GCD has to feed more shaders (each with dual-issue) as well as a lot of other units that are faster as well. NVIDIA also significantly increased L2 on Turing when they moved to dual-issue and added tensor/RT. Obviously RT is not very cacheable (although Shader Execution Reordering/Ray Binning may help a little bit) but maybe it's particularly beneficial for tensor (larger working set vs scalar execution?) or just necessary for dual-issue to perform in general. It is coincidental that both AMD and NVIDIA did dual-issue and bigger low-level caches at the same time (obviously NVIDIA could not do super giant L3 without the SRAM density of TSMC N7 - that is an unsung hero behind all of AMD's 7nm designs, they are just taking advantage of the huge cache density advantage that TSMC gave them) and I wonder what the exact synergy is there.

I just am not a believer in stacking GCD-style dies yet. I was skeptical about even stacked cache this generation when I heard it come up with Ada - "boy that's gonna be fun to pull heat through". But maybe you could do it underneath, as you say. I am also skeptical about pulling out random parts of the graphics pipeline and moving them off-die - memory/cache are a nice clear delineation, but, the "RT accelerator chiplet"/"tensor accelerator chiplet" idea never made sense to me. Those are clients of the SM engines even on NVIDIA's uarch and in AMD's model they're built into something else too iirc, it's an "accelerator instruction". These GPU partitions have been engineered forever as a huge integrated thing with lots of shared resources and scheduling etc, it seems almost impossible to pull it apart after the fact, but, maybe specific functional blocks can be pulled out. But it would have to avoid any significant performance anomalies, which gets back to things like caches - a lot of the improvement of Maxwell/Pascal/etc was making sure the right data was sitting in cache when you were done, so you could move to the next stage. If it lives on another die, that's harder, obviously if you have superfast interconnects that can link a lot of smaller caches it gets easier but I generally think having things being processed off-die (even stacked) is more challenging in a lot of subtle ways. Maybe with the side-cache idea of the patents, AMD is just moving towards the Command Processors explicitly managing what they think needs to be in the cache and that might avoid some of the problem with pulling random pieces off the GPU pipeline.

I still think the first couple multi-GCD gens are going to be more or less full GCDs with a cache-coherent interconnect like multi-socket processors. We could see stacked cache (again, especially under the GCD) and we could see multi-GCD, perhaps even together (no reason why not). But I just think first-gen multi-GCD will be two separate GCDs on a PCB, using the long-distance multisocket style links and not the low-power on-chip ones that you use with TSV/etc. It will of course be much much wider - you have to be talking almost the same speed as native memory to get good scaling in graphics, I think. That's not a inherent problem, you can just do a wider IF link, but, it's a lot of data to move.

And if you can just jump up to an actually-big GCD you can push off the multi-GCD thing for a little while longer. NVIDIA couldn't respond if AMD dropped a 600mm2 or 800mm2 GCD, with a monolithic design with the IO still on-die you can't devote enough area to "GCD". And I think it would also significantly ease some of the PCB routing problems that come along with superwide buses in the GDDR6+ era - Hawaii went to 512b and everything since has been either HBM or narrower. Maybe a 768b bus is feasible with 12xMCD because the GDDR trace routing will be a lot more "diffuse" rather than being crammed in one place.

Paul MaudDib fucked around with this message at 00:58 on Nov 7, 2022

# ? Nov 7, 2022 00:36

lih: May 15, 2013; Just a friendly reminder of what it looks like.

We'll do punctuation later.

the Angstronomics leak, which was spot on, said there is a stacked cache variant in the works that has 192MB of cache, but performance benefits are limited for the cost increase (which makes sense given how much cache & memory bandwidth it already has). will probably show up on a stupidly priced 7950 XTX in a year or something

# ? Nov 7, 2022 00:44

ZombieCrew: Apr 1, 2019

Im looking to grab a 3080 or 3080ti if the price is right. Are there any brands that are prefered? Any to avoid?

# ? Nov 7, 2022 07:39

MarcusSA: Sep 23, 2007

ZombieCrew posted:

Im looking to grab a 3080 or 3080ti if the price is right. Are there any brands that are prefered? Any to avoid?

Honestly not really at this point. I�m completely happy with my FE 3080.

I�d just look for the best deal you can.

# ? Nov 7, 2022 07:47

Adbot: ADBOT LOVES YOU

# ? Jun 4, 2024 09:17

Bad Munki: Nov 4, 2008; We're all mad here.

I mean, probably not EVGA.

# ? Nov 7, 2022 07:53

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > GPU Megat[H]read - the cores of wrath grew heavy on the die that day

«‹›3876 »