GPU Megat[H]read - the cores of wrath grew heavy on the die that day

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > GPU Megat[H]read - the cores of wrath grew heavy on the die that day

«‹›3876 »

steckles: Jan 14, 2006

Regarding path tracing, I don't know that we're going to see a true path tracing focused GPU until we start getting hundreds of megabytes of L2 and like a gigabyte of L3 cache standard. Even if you could make ray queries free, it wouldn't gain a gigantic amount of performance on current architectures because of memory thrashing.

Both NVidia and AMD have put a lot of effort into making the retrieval of random bytes from huge arrays as efficient as possible in the last couple generations. I think that's had as much effect on the gen-on-get RT improvements as increased ray/box and ray/triangle intersection performance has, but it's gonna need to be turbo charged if we want to start shooting a practical number of rays for next level path tracing. Stuff like SER can definitely help in the narrow sense but we'll need new engines built around keeping millions or billions of rays in flight to really make progress. Also, being able to prebuild BVHs and stream them as needed would be a benefit for all architectures even now, the fact that it's treated as a black box was a misstep in the API design.

Anyway, a true path tracing first engine would need to be something like this: Rasterize a depth/normal buffer. Using that, spawn millions or billions of ray queries. Batch those rays based on their origin and direction. Once your batches are large enough, clip them against the acceleration structure. Where a ray enters a leaf node, add it to another batch that's queued for geometry intersection. Once a batch of leaf node ray queries gets large enough, load the actual triangles associated with the BLAS node, clip the rays against them, and batch the hit position and surface normals. Once the batches of intersection data get large enough, load the relevant textures and run your surface shaders. Spawn more ray queries as needed and put them into new TLAS/BLAS batches. After running the shaders, add the computed colour attenuation to a list that's kept per-pixel and then every frame, collapse the list to generate a final color.

Basically to do one thing at a time, keep as much in L2/L3 as possible, and make sure that every request to glacially slow video memory is as contiguous as possible and can serve as many operations as possible. This is already best practice for current rasterization workloads, it's just being taken to a ridiculous extreme. It's not the kind of thing you could easily bolt onto an existing engine, nor is it the kind of thing hardware and driver level shenanigans are going to do for you. Some developer will need to be brave enough write it from scratch. Hopefully the APIs and architectures will evolve to support such a thing, because RT as it currently exists is gonna be hard to scale otherwise.

# ? Apr 14, 2023 22:38

Adbot: ADBOT LOVES YOU

# ? Jun 3, 2024 23:50

Shipon: Nov 7, 2005

the founders edition cards also actually look nice compared to the gaming plastic garbage the partners always put out

# ? Apr 14, 2023 22:39

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

Paul MaudDib posted:

Based on data mining from the NVIDIA data leak,

steckles posted:

Regarding path tracing,

Just wanted to say again how much I love this thread. I learn so so much here.

# ? Apr 14, 2023 22:47

repiv: Aug 13, 2009

batching larger amounts of work in order to extract better hardware utilization is at odds with realtime as we currently know it, you can only batch so much and stay under 16ms (or less)

unless we move towards "frameless" architectures where the GPU can churn on batches out of phase with the framerate, which sounds like a nightmare

repiv fucked around with this message at 23:32 on Apr 14, 2023

# ? Apr 14, 2023 23:15

steckles: Jan 14, 2006

repiv posted:

unless we move towards "frameless" architectures where the GPU can churn on batches out of phase with the framerate, which sounds like a nightmare

With stuff like DLSS3 and temporal accumulation in general, we're already "frameless" in a limited sense. No reason current techniques wouldn't still apply. Indeed, with more data in flight, you'd have opportunities for accumulating in more intelligent places rather than adding up all the frames and hoping.

Nightmare is definitely the right word. True path tracing is gonna need to throw out a lot of what we're used to, but NVidia and AMD could give us hardware to support that. It's just not gonna look like what we have now.

# ? Apr 14, 2023 23:30

SCheeseman: Apr 23, 2003

I remember a Carmack tweet talking about a theoretical framerate-less VR OLED display that displays each ray just as it's calculated, relying on human perceptual image persistence to fill the gaps.

# ? Apr 14, 2023 23:30

repiv: Aug 13, 2009

SCheeseman posted:

I remember a Carmack tweet talking about a theoretical framerate-less VR OLED display that displays each ray just as it's calculated, relying on human perceptual image persistence to fill the gaps.

as steckles says though, for optimal hardware utilization you want a breadth-first/wavefront architecture where rays are processed in huge batches rather than one at a time

you fire off N million rays and don't get anything back until they're all done, and you want N to be as large as possible for the best hardware utilization

that naturally scales up to where N neatly fills one frame interval (with enough slack for denoising, postprocessing, etc) but how to scale up to larger N without tanking the framerate or causing visual issues is an open problem

repiv fucked around with this message at 00:06 on Apr 15, 2023

# ? Apr 14, 2023 23:40

Truga: May 4, 2014; Lipstick Apathy

Subjunctive posted:

the partners could branch out, like my motherboard vendor trying to sell me cloud storage (what?)

if it's not gigabyte offering you a gigabyte of free space i'm gonna be disappointed

# ? Apr 15, 2023 00:01

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Shipon posted:

the founders edition cards also actually look nice compared to the gaming plastic garbage the partners always put out

100%, I said this when Ampere came out, those are fuckin sharp looking cards. That's Apple-level design language right there, you can tell exactly who their customer is: mature adults who don't want GAMERZZZ design language, just give me something that says "fast and futuristic" without being tacky. I totally bet they outsourced that to an actual industrial design house, I know BMW did some cases for Asrock etc and I bet NVIDIA went out to someone who specializes in that and had a concept done.

20 series were ok but 30-series knocked it out of the park. Personally don't care for the chrome highlights on the 4070 though. The copper brown on 4090 is... whatever. It's fine. 30-series were great though.

I am actually not that much of a fan of the coolers themselves though. It's hard to make a direct comparison since they're also by far the smallest cards in every generation, apart from actual single-fan mITX cards. But really the noise-normalized performance is not all that great, and they have a pretty distinct whine especially if you spin them up.

I think part of the problem is the semi-blower design. It's still a blower, it still has to spin at relatively high speed to generate a lot of static pressure. Just not quite as much as if it was pushing through the whole card.

The flow-through design is very nice. I liked it with Fury and Vega cards and I still like it now, blow-through is an aerodynamic/thermodynamic improvement over axial designs. It is unfortunate that it came just as sandwich cases took off, because it blows right into the motherboard, and it (again) happens to be an otherwise desirable card due to being a high-production-values "SFF" (ish) card with great VRMs/etc on the smaller side of things. Buildzoid always has great things to say - with apologies to Scrubs it's "yeah they grab a handful of premium 70A smart power stages and throw them at the board and whatever sticks that's the dosage".

Maybe a half-axial half-blowthrough would be better than half-blower half-blowthrough. Would fix the whine problem. And it is obvious that they know it's a problem, the later FE designs tend to have "bulged" fans where it's going right into the silver highlighting because they pushed the fan size a bit. Especially with fans there simply is no replacement for displacement, bigger fans have better acoustic profiles even for a given noise level it's a lower noise.

A quad-slot partner card does better than the 2.5-slot NVIDIA card. But that's kind of obviously a given too. There's no replacement for displacement in heatsink size either.

Paul MaudDib fucked around with this message at 05:47 on Apr 15, 2023

# ? Apr 15, 2023 00:18

hobbesmaster: Jan 28, 2008

Paul MaudDib posted:

The partners grumble a lot but the reality is they can make and sell even triple-fan cards at MSRP, they just don't like to. Especially once the boat-freight shipments come in - all the early shipments are air-freighted and that's an additional cost. That's part of the justification for why partners get to overcharge you at launch - air-freighted rush-production cards, high demand, limited supply, whatever. But as much as they grumble about it, they actually really ought to be hitting MSRP after a couple months and they really hate that NVIDIA is twisting their arms and making them do it. A big cooler (the 600w vs the 450w) costs like $10 more in bulk according to Igor. The partners just like that because they can charge you $150 more for it.

There's this weird tension where people want more margin for partners but they also don't want to pay higher prices, but, the reality is a lot of the higher prices come from the partners gouging you. They can sell at MSRP and make a profit, even on a fancy card - they're doing it right now on 4090s. Everything above that is gravy.

When EVGA withdrew the rumor was 5-10% margin. That is extremely narrow margin for a product like a GPU.
https://www.igorslab.de/en/evga-pulls-the-plug-with-loud-bang-yet-it-has-long-been-editorial/

At least with GDDR6X, nvidia was selling the GPU core and memory as a non negotiable �kit�. That�s going to be some massive percentage of the bom cost. As nvidia knows exactly what it costs to produce the FE they were likely milking their partners for everything they were worth. At least at the height of crypto stuff.

# ? Apr 15, 2023 00:31

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

Truga posted:

if it's not gigabyte offering you a gigabyte of free space i'm gonna be disappointed

ASUS, 200GB�we�re both disappointed

# ? Apr 15, 2023 00:38

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

steckles posted:

Anyway, a true path tracing first engine would need to be something like this: Rasterize a depth/normal buffer. Using that, spawn millions or billions of ray queries. Batch those rays based on their origin and direction. Once your batches are large enough, clip them against the acceleration structure. Where a ray enters a leaf node, add it to another batch that's queued for geometry intersection. Once a batch of leaf node ray queries gets large enough, load the actual triangles associated with the BLAS node, clip the rays against them, and batch the hit position and surface normals. Once the batches of intersection data get large enough, load the relevant textures and run your surface shaders. Spawn more ray queries as needed and put them into new TLAS/BLAS batches. After running the shaders, add the computed colour attenuation to a list that's kept per-pixel and then every frame, collapse the list to generate a final color.

Basically to do one thing at a time, keep as much in L2/L3 as possible, and make sure that every request to glacially slow video memory is as contiguous as possible and can serve as many operations as possible. This is already best practice for current rasterization workloads, it's just being taken to a ridiculous extreme. It's not the kind of thing you could easily bolt onto an existing engine, nor is it the kind of thing hardware and driver level shenanigans are going to do for you. Some developer will need to be brave enough write it from scratch. Hopefully the APIs and architectures will evolve to support such a thing, because RT as it currently exists is gonna be hard to scale otherwise.

I'm going to chew on this for a while but I'd really encourage you to look at this Intel video, it's super simple and extremely explanatory of what their assessment of the problem is and what their proposed solutions to it are. I'd be super curious to know what your thoughts are around this and if it'd help your idea.

https://www.youtube.com/watch?v=SA1yvWs3lHU&t=289s

I'll summarize it (semi-)briefly as:

Problems:

Divergence (obviously) reduces the number of lanes doing effective work at any given time
For a synchronous call, the average latency increases with the number of lanes (because you have to wait for the last ray that goes the farthest)
Generally the results are not predictable as to what will happen with a ray hit (different materials etc)
Some tasks are so sparse that you're tying up a whole warp waiting for them

Solutions:

Lower warp size (wave8)
Shader Execution Reordering
Async dispatch (sounds like a promise/future model) with Shader Execution Reordering happening behind the scenes

Obviously this is tremendously expensive in transistor terms: it's not shocking at all that Intel is using the transistors of a 3070 to compete with a 3050 or whatever it is! It's an absolute ton of additional SM complexity/scheduler overhead/etc. But cache isn't going to shrink much between 6nm and 3nm, but logic is. And so all that SM complexity is going to get a lot cheaper, while cache isn't going to get much better in the meantime. That's my pet theory at least, Arc looks like a design aimed a couple nodes ahead of where it is. And it wouldn't be surprising if NVIDIA recognized the problem too - 4090 benefits from much higher clocks but 3090 had fairly poor SM scaling, and 2080 Ti wasn't really that great either. AMD had the same problem with wave64 with GCN, Fiji and Vega just didn't scale to 64CU very well, and they switched to wave32. I think at some point everyone is going to have to shrink wave size again.

To some extent the things you're talking about are orthogonal: even if you had that, it would be desirable to increase the percentage of lanes doing effective work, and increase the number of lanes doing aligned access. And an async facility is really kind of what you need for that "massively increase the number of requests in-flight" too. If you are just tossing tasks into a queue (and especially if you don't ever wait on the result!) then the scheduler can just batch up a bunch of them, slice them in some efficient ways, and run them. Iterating over the raster image and tossing off async tasks is exactly what you wanted, right? It just doesn't work without some kind of realignment facility.

Paul MaudDib fucked around with this message at 03:08 on Apr 15, 2023

# ? Apr 15, 2023 00:39

Arivia: Mar 17, 2011

Give me a gpu designed by Jony Ive. Only one display output and it�s usb-c.

# ? Apr 15, 2023 01:08

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

HyperCard Shading Language

# ? Apr 15, 2023 01:10

repiv: Aug 13, 2009

Paul MaudDib posted:

intel are doing cool things with their architecture but unfortunately i think they're going to struggle to get mileage out of it in the real world, coming from a minority market share. it even comes up in that presentation:

they want developers to use the original DXR 1.0 model, as it gives the driver/hardware full control over scheduling, but RDNA doesn't do anything clever with DXR 1.0 so devs are encouraged to use the simpler DXR 1.1 model and handle sorting/binning manually

everyone is doing something different with their raytracing architectures, and it's bleeding into the programming model, so to extract the most performance from all hardware engines will need multiple codepaths. in practice they'll usually just do what's fastest on AMD.

# ? Apr 15, 2023 01:16

steckles: Jan 14, 2006

Paul MaudDib posted:

Obviously this is tremendously expensive in transistor terms: it's not shocking at all that Intel is using the transistors of a 3070 to compete with a 3050 or whatever it is! It's an absolute ton of additional SM complexity/scheduler overhead/etc. But cache isn't going to shrink much between 6nm and 3nm, but logic is. And so all that SM complexity is going to get a lot cheaper, while cache isn't going to get much better in the meantime. That's my pet theory at least, Arc looks like a design aimed a couple nodes ahead of where it is. And it wouldn't be surprising if NVIDIA recognized the problem too - 4090 benefits from much higher clocks but 3090 had fairly poor SM scaling, and 2080 Ti wasn't really that great either. AMD had the same problem with wave64 with GCN, Fiji and Vega just didn't scale to 64CU very well, and they switched to wave32. I think at some point everyone is going to have to shrink wave size again.

To some extent the things you're talking about are orthogonal: even if you had that, it would be desirable to increase the percentage of lanes doing effective work, and increase the number of lanes doing aligned access. And an async facility is really kind of what you need for that "massively increase the number of requests in-flight" too. If you are just tossing tasks into a queue (and especially if you don't ever wait on the result!) then the scheduler can just batch up a bunch of them, slice them in some efficient ways, and run them. Iterating over the raster image and tossing off async tasks is exactly what you wanted, right? It just doesn't work without some kind of realignment facility.

I think Intel's solution is a very logical way to handle the current way of doing raytracing, as an add-on to a traditional workload. Lots of tools to help ensure you've got something to work on while waiting for your ray queries to return and making sure that a stalled instruction doesn't lock up too much of the GPU. That's a smart way to do things as you'll typically have non-RT shaders to run, post processing, rasterizing new shadow maps, sampling textures, rasterizing new micropolys for Nanite, denoising, and all kinds of other things in a typical engine.

When the memory required by the raytracing step gets too large though, as it would with true path tracing, all the other work you could be scheduling gets stalled because what it needs isn't in cache any more. The short term solution to that is obviously more cache, but I doubt you could ever add enough that any sort of hardware level scheduling could hope to remain tenable. Ultimately, ray tracing as it exists now on GPUs is basically rays starting in random places going in random directions needing random geometry and random textures from random places in video memory. That's just an insanely hard thing to schedule for and we're gonna get more out of reducing the randomness to start with.

We really haven't even scratched the surface though and there are so many different ways we could be using visibility queries that haven't been experimented with yet. Nobody has really come up with a good LOD method for raytracing either. Intel showed a method of stochastically picking a LOD on a per-pixel basis which looked great, but would be hard or impossible to implement with the current APIs. While I'm at it, one dumb idea I had was to treat a nanite mesh as the input for an old skool finite element radiosity solver. Basically have all the patches iteratively swap radiance with each other and store it in like some Gaussian Mixture Model or Spherical Harmonic nonsense as part of the surface properties. I'm sure there would be a billion corner cases, but it'd be an interesting experiment.

steckles fucked around with this message at 02:16 on Apr 15, 2023

# ? Apr 15, 2023 02:01

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

Arivia posted:

Give me a gpu designed by Jony Ive. Only one display output and it�s usb-c.

GPUs by BALENCIAGA

# ? Apr 15, 2023 02:02

Dogen: May 5, 2002; Bury my body down by the highwayside, so that my old evil spirit can get a Greyhound bus and ride

Pininfarina is out of cars and into GPUs!

# ? Apr 15, 2023 02:15

repiv: Aug 13, 2009

Arivia posted:

Give me a gpu designed by Jony Ive. Only one display output and it�s usb-c.

no display outputs, only a wifi antenna for connecting to a miracast display

# ? Apr 15, 2023 02:18

Kazinsal: Dec 13, 2011

Dogen posted:

Pininfarina is out of cars and into GPUs!

I'd buy an RTX 5090 NSX Edition.

# ? Apr 15, 2023 02:23

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

repiv posted:

intel are doing cool things with their architecture but unfortunately i think they're going to struggle to get mileage out of it in the real world, coming from a minority market share. it even comes up in that presentation:

they want developers to use the original DXR 1.0 model, as it gives the driver/hardware full control over scheduling, but RDNA doesn't do anything clever with DXR 1.0 so devs are encouraged to use the simpler DXR 1.1 model and handle sorting/binning manually

everyone is doing something different with their raytracing architectures, and it's bleeding into the programming model, so to extract the most performance from all hardware engines will need multiple codepaths. in practice they'll usually just do what's fastest on AMD.

yeah probably a fair criticism and I don't really follow the DXR generations closely enough to know what's going on there. I don't really follow Kronos politics but I probably should.

it's cool from a GPGPU perspective at least! an async task/promise queue is a really good model for a lot of GPGPU stuff imo. There are a lot of things that just don't work very well right now because they're very sparse and don't really keep a GPU busy... but if you batched and aligned them, they would be fine.

the other class of problems that don't work well is ones that require a lot of state per-thread and that sort of gets to the cold shower steckles gave me:

steckles posted:

I think Intel's solution is a very logical way to handle the current way of doing raytracing, as an add-on to a traditional workload. Lots of tools to help ensure you've got something to work on while waiting for your ray queries to return and making sure that a stalled instruction doesn't lock up too much of the GPU. That's a smart way to do things as you'll typically have non-RT shaders to run, post processing, rasterizing new shadow maps, sampling textures, rasterizing new micropolys for Nanite, denoising, and all kinds of other things in a typical engine. When the memory required by the raytracing step gets too large though, as it would with true path tracing, all the other work you could be scheduling gets stalled because what it needs isn't in cache any more. The short term solution to that is obviously more cache, but I doubt you could ever add enough that any sort of hardware level scheduling could hope to remain tenable.

is there a technical reason the async scheduling and reordering needs to happen on the GPU at all? If the state-per-ray is relatively minimal (position, vector, luminosity/whatever) then can't you just immediately spill to CPU, schedule/reorder, and dump over batches? Or spill certain parts of the rayspace, like spill unless it's some area or some vector-direction that is being modeled right then? I realize of course that may not be much of a useful coherence/alignment in itself since ray behavior is so random but the point is just to define some region that you're currently working on and to spill everything else until later.

I guess with a sufficient number of rays/bounces this devolves to copying raylists to the GPU and doing BVH traversal and copying the result back and of course there may not be enough work there compared to pcie bandwidth, it may not be worthwhile. On the other hand for a purely path-traced game this means that you can just partition your scene across multiple GPUs and get good scaling. And if the collective amount of VRAM is sufficient, that's fine, you don't need every GPU to hold the full queue (holding the full BVH tree would be really nice of course). Like if total VRAM can hold it all, you can totally do a multi-GCD/MCM design with path-tracing. (I think that's been remarked before, it's the reason why it works for big pixar movies and stuff too, right?)

I wonder if spilling favors L1/L2 or L3 - NVIDIA chose the former and AMD chose the latter. I�m guessing nvidia probably needed some extra for Shader Execution Reordering to work, perhaps a reason they went with that over an L3. I actually don't know if AMD's thing is a pure data cache or an inclusive cache or what. And actually from what I remember they may be using it as a side cache and their firmware can choose what to do with it, at least that was kinda suggested by an incidental aspect of one of their patents.

I wonder if AMD's thing about memory was because they modeled path-tracing (or even it's just self-evident that there's too little VRAM headroom to spill much) and they think path-tracing is going to do ironically poorly on NVIDIA's GPUs due to VRAM limitations.

That's really the money question around your objections in a practical sense I think - are we talking gigs of ray-state, tens of gigs of ray-state, or hundreds of gigs of ray-state for useful path-tracing? Obviously not too helpful if path-tracing requires people to have 128GB or 256GB of system memory even if it's feasible at a GPU level.

steckles posted:

Ultimately, ray tracing as it exists now on GPUs is basically rays starting in random places going in random directions needing random geometry and random textures from random places in video memory. That's just an insanely hard thing to schedule for and we're gonna get more out of reducing the randomness to start with.

:yea: Obviously a lot better to avoid randomness than to handle it better. But I mean, how is that really possible with raytracing given the positions and angles are essentially random?

you could get maybe one good layer of coherence by sorting by ray source but after that it's effectively random as far as I've ever heard, you just can't predict bounces in a useful way. Aligning by material or some other "hints" seem about as good as it gets.

steckles posted:

We really haven't even scratched the surface though and there are so many different ways we could be using visibility queries that haven't been experimented with yet. Nobody had really come up with a good LOD method for raytracing either. Intel showed a method of stochastically picking a LOD on a per-pixel basis which looked great, but would be hard or impossible to implement with the current APIs. While I'm at it, one dumb idea I had was to treat a nanite mesh as the input for an old skool finite element radiosity solver. Basically have all the patches iteratively swap radiance with each other and store it in like some Gaussian Mixture Model or Spherical Harmonic nonsense as part of the surface properties. I'm sure there would be a billion corner cases, but it'd be an interesting experiment.

Over my head, my background is GPGPU not graphics.

But yea and I loved the :spergin:

speech from the Nanite guy. It's really interesting to see a compsci-theory-based approach to dynamic LOD and data structures at each level.

What would be the point of having patches swap luminosity with each other, I guess you're fuzzing... something? And that produces a statistically useful property in the rays/bounces somehow?

Paul MaudDib fucked around with this message at 05:06 on Apr 15, 2023

# ? Apr 15, 2023 02:34

8-bit Miniboss: May 24, 2005; CORPO COPS CAME FOR MY

Paul MaudDib posted:

100%, I said this when Ampere came out, those are fuckin sharp looking cards. That's Apple-level design language right there, you can tell exactly who their customer is: mature adults who don't want GAMERZZZ design language, just give me something that says "fast and futuristic" without being tacky. I totally bet they outsourced that to an actual industrial design house, I know BMW did some cases for Asrock etc and I bet NVIDIA went out to someone who specializes in that and had a concept done.

20 series were ok but 30-series knocked it out of the park. Personally don't care for the chrome highlights on the 4070 though. The brown on 4090 is... whatever. It's fine. 30-series were great though.

I am actually not that much of a fan of the coolers themselves though. It's hard to make a direct comparison since they're also by far the smallest cards in every generation, apart from actual single-fan mITX cards. But really the noise-normalized performance is not all that great, and they have a pretty distinct whine especially if you spin them up.

I think part of the problem is the semi-blower design. It's still a blower, it still has to spin at relatively high speed to generate a lot of static pressure. Just not quite as much as if it was pushing through the whole card.

The flow-through design is very nice. I liked it with Fury and Vega cards and I still like it now, blow-through is an aerodynamic/thermodynamic improvement over axial designs. It is unfortunate that it came just as sandwich cases took off, because it blows right into the motherboard, and it (again) happens to be an otherwise desirable card due to being a high-production-values card with great VRMs/etc on the smaller side of things. Buildzoid always has great things to say - with apologies to Scrubs it's "yeah they grab a handful of premium 70A smart power stages and throw them at the board and whatever sticks that's the dosage".

Maybe a half-axial half-blowthrough would be better than half-blower half-blowthrough. Would fix the whine problem. And it is obvious that they know it's a problem, the later FE designs tend to have "bulged" fans where it's going right into the silver highlighting because they pushed the fan size a bit. Especially with fans there simply is no replacement for displacement, bigger fans have better acoustic profiles even for a given noise level it's a lower noise.

A quad-slot partner card does better than the 2.5-slot NVIDIA card. But that's kind of obviously a given too. There's no replacement for displacement in heatsink size either.

The big thing for me is the cross frame design and its PCI bracket anchoring into it that makes it one of the most structurally sound video cards to date. The 3080 FE I had was solid as is the 4080 FE that replaced it.

# ? Apr 15, 2023 02:46

repiv: Aug 13, 2009

Paul MaudDib posted:

:yea: Obviously a lot better to avoid randomness than to handle it better. But I mean, how is that really possible with raytracing given the positions and angles are essentially random?

yeah, surely you want to maximise randomness up front in order to extract the broadest set of signals from the fewest possible rays?

# ? Apr 15, 2023 02:55

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

I don�t understand why the rays have to be visited in random order, even if they�re composed of randomly sampled source vertices and directions. After the rays are generated can�t they be sorted into a useful traversal order? Even doing that on subsets of the rays if generating them up front isn�t feasible (with billions of rays it would add up) seems better than a really random traversal. Maybe math doesn�t give us a good way to sort vectors quickly or something? Would surprise me but I barely have high school math.

# ? Apr 15, 2023 03:15

repiv: Aug 13, 2009

sorting the random rays before tracing them can help claw back some coherence, but steckles mentioned the goal being to make rays less random in the first place and I'm struggling to think how that would work out

# ? Apr 15, 2023 03:22

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

I did always wonder why ray tracing used random rays rather than just iterating the light sources, generating sorted sample sets, and walking them. Anything that doesn�t get hit is in shadow, badabing.

# ? Apr 15, 2023 03:27

steckles: Jan 14, 2006

Paul MaudDib posted:

is there a technical reason the async scheduling and reordering needs to happen on the GPU at all? If the state-per-ray is relatively minimal (position, vector, luminosity/whatever) then can't you just immediately spill to CPU, schedule/reorder, and dump over batches? Or spill certain parts of the rayspace, like spill unless it's some area or some vector-direction that is being modeled right then? I realize of course that may not be much of a useful coherence/alignment in itself since ray behavior is so random but the point is just to define some region that you're currently working on and to spill everything else until later.

I supposed, at a high level, portions of the scheduling could happen on the CPU. Bundling rays based on origin/direction and clipping them into the TLAS on the CPU and then queuing BLAS/intersection work on the GPU might work. There's been a lot of work on "out-of-core" raytracing already, so I'm sure you could jimmy something together. As you say though, you'd probably run out of PCIE bandwidth pretty quick. It's an idea that might have more legs in a system with unified memory like the consoles or some kind of hypothetical raytracing-having Apple M3 chip.

Paul MaudDib posted:

On the other hand for a purely path-traced game this means that you can just partition your scene across multiple GPUs and get good scaling. And if the collective amount of VRAM is sufficient, that's fine, you don't need every GPU to hold the full queue (holding the full BVH tree would be really nice of course). Like if total VRAM can hold it all, you can totally do a multi-GCD/MCM design with path-tracing. (I think that's been remarked before, it's the reason why it works for big pixar movies and stuff too, right?)

If you were doing explicit multi-GPU and all the geometry and BVHs weren't resident in all the GPU's memory, then the need for batching and extreme async would probably be even greater. Needing to go to another GPU, or worse system memory would be so expensive that you'd want to make sure you were servicing huge numbers of rays to make it worthwhile. The alternative, send rays to the GPU where the data is has the same issue. Each sent ray would have huge overhead so you'd want to do as many rays at once as you could.

Paul MaudDib posted:

I wonder if spilling favors L1/L2 or L3 - NVIDIA chose the former and AMD chose the latter. I�m guessing nvidia probably needed some extra for Shader Execution Reordering to work, perhaps a reason they went with that over an L3.

SER or any sort of OOO needs as much as data possible in nearby caches to work effectively, so NVidia upping the L1 and L2 to support it in the 4000 series makes sense. I wonder when we're gonna start seeing huge re-order buffers and big instruction caches in SMs like on CPUs.

Paul MaudDib posted:

I wonder if AMD's thing about memory was because they modeled path-tracing (or even it's just self-evident that there's too little VRAM headroom to spill much) and they think path-tracing is going to do ironically poorly on NVIDIA's GPUs due to VRAM limitations.

I know the meme is that AMD hired somebody's incompetent nephew to handle raytracing in RDNA2/3, but some of the instructions in the ISA for streaming BVH data, the either texture sampling or ray intersection design of the hardware, and the huge L3 points to a bet that massively async raytracing would be standard in the future. Who knows, maybe they're right there will be more Fine Wine� and a 7900xtx will run 2030 games at 6fps and the 4090 will only run them at 4fps.

Paul MaudDib posted:

That's really the money question around your objections in a practical sense I think - are we talking gigs of ray-state, tens of gigs of ray-state, or hundreds of gigs of ray-state for useful path-tracing? Obviously not too helpful if path-tracing requires people to have 128GB or 256GB of system memory even if it's feasible at a GPU level.

I wouldn't call them objections so much as observations. I've written at least one ray tracer per year since 2005 and have been following the field for longer than that and the eternal struggle has always been about keeping stuff in cache for as long as possible and maximizing memory access coherence. How much state do we need? I dunno, it's like GPU cores or memory, devs will find a way to use all of it no matter how much is there.

Paul MaudDib posted:

:yea: Obviously a lot better to avoid randomness than to handle it better. But I mean, how is that really possible with raytracing given the positions and angles are essentially random?

Well, that's where batching comes in. You don't start tracing until you've got enough rays to make it worthwhile. Sort rays into a 5D structure and don't trace until there are enough rays to justify it. This works really well off-line, but obviously there are constraints in the real time world that would make if difficult to implement in its purest form.

Paul MaudDib posted:

What would be the point of having patches swap luminosity with each other, I guess you're fuzzing... something? And that produces a statistically useful property in the rays/bounces somehow?

Radiosity is a way of encoding scene radiance on the surface geometry, and outside of the visibility term, you can calculate how much light a pair of triangle send to each other with quite simple equation. After calculation, you'd just query any given micro triangle for the incoming luminance and you never need to actually trace a ray from the frame buffer. Like I said, probably a million corner cases that would make it a pain, but it'd be a fun experiment.

# ? Apr 15, 2023 03:44

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

repiv posted:

that naturally scales up to where N neatly fills one frame interval (with enough slack for denoising, postprocessing, etc) but how to scale up to larger N without tanking the framerate or causing visual issues is an open problem

btw I think this is maybe right where MCM/multi-GCD can come in. There is no reason you can't have GxN samples if the BVH+each working ray set can be contained in VRAM per GCD. Blit your average of the image back when you're done, or get the kill-signal. Sharing a set of MCDs across a group of GCDs seems potentially fraught with performance issues but I guess since all that really has to be shared is BVH maybe that's not that big or is sufficiently-small relative to bandwidth/local misses.

like a ray buffer doesn't have to be fast, just large. 32gb of stick memory onboard would be like 10-20 gb/s that'd probably be fine for a big raybuffer lol. You just need to do your buffering for the rays.

Be funni if the Radeon SSG strategy suddenly became viable. NVMe is pretty fast now, or Optane. a HBM2E stack or HMC or whatever per GCD is plenty. plain old DDR is probably fine..

repiv posted:

you fire off N million rays and don't get anything back until they're all done, and you want N to be as large as possible for the best hardware utilization

honestly though I don't see frametime as a problem though. it's path-tracing, there's not that much pipeline after the RT runs. A little UI compositing, whatever. Target N as accurately as possible, leave a little wiggle room (including any remaining pipeline), and if you have a bad frame... just ship what you've got. the image will... be noiser in some places during a single frame of samples I guess?

oh no, DLSS has one less temporal input, for the 0.1% of samples it couldn't finish, or didn't get passed in time during a few of the rays, etc.

(and really that's a useful property for a lot of GPGPU hacks: "oh no the AI does the best with the samples it's got" seems like it'd also be fairly good for outright errors, perhaps caused by some very slight program incorrectness, ahem stochastic inaccuracy of results. Being able to deliberately cut the corner on explicit sync if it's Probably Correct is very desirable for a lot of high-concurrency GPGPU stuff. And honestly from what I've been told outright twiddling the network to introduce variation tends to improve convergence, so... why are accidental data errors different?)

repiv posted:

as steckles says though, for optimal hardware utilization you want a breadth-first/wavefront architecture.

actually the point they made about scheduling samples in 5D is interesting. One idea that is clearly very hot right now is varying your sampling to the most effective points of your image (moving edges or high-frequency areas (fences) etc) and there's also no reason you can't use both temporal and spatial weighting. Use the optical flow accelerator to find your hotspots, combine that with where your TAAU thinks it's running critically low on temporal samples for the image frequency, and throw samples in the ideal places to make your total upscaler pipeline work well. I totally expect DLSS4 or DLSS5 to visit this concept, the OFA's role in DLSS3 is a blaring "TO BE CONTINUED" to me. You think that's the only neat trick OFA enables? k.

Having a fast, high-accuracy (incredibly so - this is third-gen for OFA) bitmap of your image hotspots across various data-dimensions seems like an insane win for input to AI/ML, like it's not just actual visual pixels but this throws out a lot of useful information about any frame sequence you feed it. You can generate whatever DLSS metadata or or image metadata as a pixel map and feed it to OFA and see where... temporal flow is. Those are your artifacts, or artifact-prone areas.

I truly do believe that this is DLSS1.0 all over again, DLSS3 isn't even the thing here it's all the things that are going to come after this. DLSS4 and DLSS5. I know this is a hot take but Ada crossing some threshold (real or marketing) of necessary OFA accuracy/latency may lead to a situation like Turing where it ages relatively well due to various DLSS enhancements over its lifetime. In the long term the point is disambiguating the output res from the internal rendering (or sampling) res - and then generalizing that to a per-pixel shading rate that's optimal to feed your TAAU algorithm.

I have seen the idea of variable rate sampling working intelligently with both spatial+temporal sampling come up in multiple places, even in regular raster. Don't just do your sampling per-frame, feed it where the image is actually struggling or changing. And OFA looks really really useful for that, here is a bitmap to FFT unit that pre-digests what's going on into FFT macro-blocks for you so you can make useful decisions with a small number of elements (lower dimensionality I guess). Knowing "here's an edge or wire I think needs some samples" seems super achievable imo? And you can feed that state into where the algorithm decides to throw the examples for the next frame, maintain heatmaps etc.

Blackwell still looks poised to be insane. Like I guess NVIDIA isn't going MCM on consumer (rumors only tho) but they're also going to 3nm and potentially doing a uarch shakeup after 3 generations of Turing derivatives? maybe there's something they want to set up architecturally for MCM in future generations (cope), could be pathtracing or something else.

Paul MaudDib fucked around with this message at 06:41 on Apr 15, 2023

# ? Apr 15, 2023 04:08

repiv: Aug 13, 2009

speaking of future DLSS versions, nvidia let slip in one of their talks that realtime AI denoising is coming soon

that was inevitable really, they've been doing offline AI denoising in OptiX for ages

# ? Apr 15, 2023 04:35

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

8-bit Miniboss posted:

The big thing for me is the cross frame design and its PCI bracket anchoring into it that makes it one of the most structurally sound video cards to date. The 3080 FE I had was solid as is the 4080 FE that replaced it.

this is a really interesting comment and I hadn't thought about that.

if only a partner had had an idea besides "big cooler/large VRM/max power limit" in 10 years :smith:

AIO GPU cards being relatively compatible with using other slots on your motherboard is funni though. I don't feel that bad about the 3090 Kingpin Hybrid overall though because of that, it's a fast premium card that actually lets me use slots in my case, will never have VRAM come into question, and is generally a nicely engineered card (great temps even on VRAM - the AIO cools really well, even the memory).

I'm gonna hold onto my 3090 until blackwell, despite how tantalizing the shopmyexchange 4090 FE remains. 1599 no tax, 10% off first day if you sign up for the credit card, and a couple months ago I saw them run a "15% off first day for all new cardholders" and "this stacks". Not sure if it'd work out but even at just 10% or 15% off $1599 no tax it's a good deal.

I'd be right back in the "no slots bro" situation and I enjoy being able to run 2-3 fast cards on my 9900K though. Hell I can use all my x1 PCIe slots too. I could do some random poo poo on the chipset if I needed. Pure luxury - I have a true 2-slot card and I really don't even need much clearance. I am not buying another machine without Thunderbolt (probably X670E Creator or X770) that will free me from the need for an AIO just to use some random pcie card, but in the meantime the AIO is super duper nice.

like there is some giant "hey these turbo huge 30-series coolers from the partners... you can crack the core and VRAM if you leave them unsupported for long periods of time" tech-media thing kicking off right now and like lol yea the average AIB partner cooler probably is to the point where you should get a support noose for the end of the card

Paul MaudDib fucked around with this message at 05:20 on Apr 15, 2023

# ? Apr 15, 2023 04:51

Paul MaudDib: May 3, 2006; TEAM NVIDIA:
FORUM POLICE

steckles posted:

We really haven't even scratched the surface though and there are so many different ways we could be using visibility queries that haven't been experimented with yet. Nobody has really come up with a good LOD method for raytracing either. Intel showed a method of stochastically picking a LOD on a per-pixel basis which looked great, but would be hard or impossible to implement with the current APIs. While I'm at it, one dumb idea I had was to treat a nanite mesh as the input for an old skool finite element radiosity solver. Basically have all the patches iteratively swap radiance with each other and store it in like some Gaussian Mixture Model or Spherical Harmonic nonsense as part of the surface properties. I'm sure there would be a billion corner cases, but it'd be an interesting experiment.

it'd be interesting to see whether OFA could improve this at all. Being able to have some raw "draw poo poo here" metrics would be great. Although I guess at a certain level of detail the heatmaps don't have to be GPU accelerated at all.

it seems like there's some fairly enlightening questions like "what is the number of BVH regions behind this raster region/space voxel/OFA motion estimation group" or "what is the average depth of rays shot in this region" that could guide LOD tuning too. Yes you can compute it but being like "yo there's a chunky thing at X but there's nothing ahead/behind" or even just "the time spent in this region is out of control" are relevant info for building an optimal tree.

I have no idea if that's the kind of thing that could be passed back via shader feedback/etc and maybe fed forward to the next frame's sampling (with a history buffer).

The nanite guy has got the right idea though this isn't a "rebuild every frame" or even every N frames, you should rebalance it based on where poo poo's happening and where rays need a better performance level of both detail and sampling. And the structure needs to be Concurrently Eventually Mostly Correct - you want some reasonable BVH tree based on the past performance of the card on the previous frame, but it doesn't always need to be the exact same LOD for every patch at every point in time.

And of course for certain kinds of scripted content, the optimal sample density for each frame can be "hinted" by precomputation, I suppose. Why can't you bake the parts of the scene where poo poo is happening - especially if it's logical parts ("that soccer ball") and not just a moving pixel region? And OFA gives you those regions in the final image. Or if you have that information from the engine itself (motion vectors) great.

Don't build/trace a pure BVH hierarchy - you want to trace a Huffman coding of the optimal BVH traversal through OFA objects around that timestamp, ideally tracked across all threads in any warp that impacted it. If there's certain samples that you want to weight, give them higher probability.

Paul MaudDib fucked around with this message at 06:37 on Apr 15, 2023

# ? Apr 15, 2023 05:59

SwissArmyDruid: Feb 14, 2014; by sebmojo

Paul MaudDib posted:

GPUs by BALENCIAGA

BALENCIAGA: The Way It's Meant To Be Worn

# ? Apr 15, 2023 07:57

gradenko_2000: Oct 5, 2010; HELL SERPENT; Lipstick Apathy

Louis Vuitton already has a parallel color scheme to Noctua, it wouldn't take much

# ? Apr 15, 2023 08:05

DoctorRobert: Jan 20, 2020

pyrotek posted:

Sure, but then the time of day and lighting in general has to be static, and reflections won't look the same no matter what. Outside of reflections that is why Final Fantasy 7 Remake looked so good and probably explains why it is taking so long to get to the next one because it is probably more open and might be weird for it not to have real-time lighting.

The 7900 XTX is a pretty good card, but stick away from the reference model if you find any still out there, and the ASRock Phantom Gaming frequently has similar issues. The full die is enabled on the XTX so there really isn't room for enhanced versions like with the inevitable Ti cards for Nvidia. I wouldn't count on a 7950 XTX for that reason. Also, the more you spend on a XTX, the closer you get in price to a 4080, and the PNY 4080 is excellent at MSRP.

Actually, looking at Micro Center, there are a few 4080s now from $50 to $100 off, and the Gigabyte Eagle is $1150 at at Best Buy. I'm glad things are finally starting to go in the right direction.

You can't do night/day lighting with raster? :confused:

Could swear there's games all over the place with that.

R.e. the xtx, thought AMD hosed it up the first time so it doesn't hit the right clocks and their cheaper cards later this year ought to be able to, so they could release a refresh of xtx at the same time with better clocks

R.e. the 1080ti, it was $400 maybe 5 years back, and there's no point spending 4x that for a card that still won't hold up the same way 5 years from now, given how everything-hungry the tracing trend is. Games don't look all that different with it at present, and there are hilariously few using it, from the perspective of justifying the expense

# ? Apr 15, 2023 09:23

Yaoi Gagarin: Feb 20, 2014

I do wonder what a GPU designed from the ground up around RT would look like. Something with absolutely no thought given to legacy raster or even GPGPU performance.

# ? Apr 15, 2023 09:41

Shipon: Nov 7, 2005

DoctorRobert posted:

You can't do night/day lighting with raster? Could swear there's games all over the place with that.

R.e. the xtx, thought AMD hosed it up the first time so it doesn't hit the right clocks and their cheaper cards later this year ought to be able to, so they could release a refresh of xtx at the same time with better clocks

R.e. the 1080ti, it was $400 maybe 5 years back, and there's no point spending 4x that for a card that still won't hold up the same way 5 years from now, given how everything-hungry the tracing trend is. Games don't look all that different with it at present, and there are hilariously few using it, from the perspective of justifying the expense

Even ignoring raytracing, hitting 60 FPS is going to be a fair challenge with a 1080Ti nowadays without some compromises to quality settings at 4k. You might be able to crack 100ish at 1440p.

EDIT: Went back through some more recent comparison videos, 60 FPS is about what you could expect out of a 1080Ti with post-2020 AAA games at 1440p, which is still plenty fine for most gamers for sure.

Shipon fucked around with this message at 09:54 on Apr 15, 2023

# ? Apr 15, 2023 09:49

HalloKitty: Sep 30, 2005; Adjust the bass and let the Alpine blast

SwissArmyDruid posted:

BALENCIAGA: The Way It's Meant To Be Worn

EE: Epstein Edition?

# ? Apr 15, 2023 09:54

repiv: Aug 13, 2009

DoctorRobert posted:

You can't do night/day lighting with raster? Could swear there's games all over the place with that.

you can, but it further limits your options for baking the lighting as a continuously moving sun explodes the amount of precomputation and data storage required for a given quality level

in practice raster lighting is a continuum where the more static it is, the better quality it can be, and the more things need to be dynamic the more compromises need to be made

repiv fucked around with this message at 12:18 on Apr 15, 2023

# ? Apr 15, 2023 12:11

Taima: Dec 31, 2006; tfw you're peeing next to someone in the lineup and they don't know

Has anyone else upgraded to any AM5 X3D chip for gaming/ GPU support reasons? I'm curious about your experience.

I've found it to be a giant upgrade for gaming overall, but there are so many side considerations; does the DDR5 help? Does it help FPS or make the system more snappy, or both? How much is AM5 doing, how much is the DDR5 doing and how much is the cpu doing? Does the PCIE 5.0 help on the 4090? Does the gen 5 SSD support help? Probably not, right- I use a SK hynix Platinum P41 2TB which is only a gen 4 drive.

It feels like a giant upgrade (in my case, 7800X3D paired with 4090 and DDR5 6000 CL30). The biggest upgrade is just the overall snappiness of, well, everything. Even Windows is snappier than with my previous 5800X AM4 setup.

I'm also getting a lot more FPS (hard to say, maybe 20% more? I could be making that up) but the real meat and potatoes is the 1% lows. It makes games feel extremely butter smooth, and titles that have a lot of "stuff" associated with their operation such as, of all titles, World of Warcraft, run night and day better in capital cities. I get ~80 fps with 60fps 99% fps in Valdrakken, which is kind of jaw dropping as everyone had basically written off Wow capital cities as performing like poo poo regardless of how beefy your specs are.

TLOU Part 1 went from absolutely drowning in cpu heavy scenes at "4K" DLSS quality / 120 fps mode, to often times running at capped 120 fps with 99% fps almost as high- up to 110 fps! The 1% performance on this setup is crazy. I never paid much attention to 1% lows in general but now that they're so solid, I feel like it's made such a big difference to my admittedly subjective experience.

Anyways my question is this: what is doing the heavy lifting here? How much does low latency DDR5 6000 make a difference? Does the actual AM5 platform make a difference? How much is the CPU changing things? I wish I could understand more broadly what's actually going on here when I'm playing games. Too many variables changed at once for me to get it, especially because I am a layman when it comes to this stuff especially compared to thread veterans.

As an aside, I even got a huge discount on an open box Asus Tuf x670e Plus motherboard because someone opened it, got super confused at the cpu not working, and returned it. It's clear that they just didn't understand how to flash BIOS, and I'm guessing this is a great way to score a good deal on AM5 in general if you're willing to roll the dice, as this is probably a common situation. Which is weird, because it's so easy, but I'm not going to complain. People talked some poo poo on Asus AM5 but with the latest firmware the system has been absolutely rock solid.

It's just weird because the general advice always given to me about gaming is that the CPU doesn't matter much at 4K. That feels like it's very much changing now though and I guess a CPU is back on the menu as PS5 titles progress, even at 4K. It's an interesting development imo and something that might take a second to catch on with many consumers. Even in situations with only a modest fps increase, the 1% lows on these titles make a big difference to the moment to moment experience.

If you've performed a similar upgrade, what has made the biggest impact to your everyday gaming? What games get the biggest lift? How is your day to day experience with the 3D Cache?

Taima fucked around with this message at 12:19 on Apr 15, 2023

# ? Apr 15, 2023 12:12

Adbot: ADBOT LOVES YOU

# ? Jun 3, 2024 23:50

lih: May 15, 2013; Just a friendly reminder of what it looks like.

We'll do punctuation later.

the last of us is an extremely cpu-intensive game on pc due to being a pretty poor port, so it's going to benefit the most from having a top-end cpu like the 7800x3d.

what you're describing sounds like you're seeing the benefits of the 7800x3d's giant cache in particular though.

# ? Apr 15, 2023 14:38

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > GPU Megat[H]read - the cores of wrath grew heavy on the die that day

«‹›3876 »