Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
steckles
Jan 14, 2006

SwissCM posted:

Path tracing is a little different to ray tracing. Path tracing calculates light from the source, ray tracing calculates light from the camera. Ray tracing is more efficient, but less correct and can't do complex light interactions as well as path tracing can (caustics is a common example). Ray tracing is even more efficient if you add tech like eye tracking.
I am compelled by the ray tracing gods to nitpick you here. Path tracing always starts from the camera and will always converge to a correct result. Starting paths from the light source would be more accurately called particle tracing. Particle tracing will also always converge to a correct result. They're really just different ways to express the same set of equations. Bidirectional path tracing uses paths that start from both the light and camera to construct paths that more efficiently capture all interactions.

Most production ray tracing based renderers use fairly straight forward path tracing. I suspect this is because it's easier to get right and offers more opportunities for SIMD and GPU acceleration than BDPT or particle tracing.

Adbot
ADBOT LOVES YOU

steckles
Jan 14, 2006

SwissCM posted:

Whats a good resource to find out more about path tracing?
Although it doesn't go into quite as much detail as 1gnoirents diagram, Physically Based Rendering: From Theory to Implementation is the definitive text on path tracing.

steckles
Jan 14, 2006

shrike82 posted:

it's not clear the frame + movement vectors and previous frame CNN autoencoder approach they've gone with is the only/best way to do game upscaling - just look at how fast Nvidia is iterating on their implementations. given how fast the broader ML field moves, it would not surprise me to see Microsoft or a game developer roll out an implementation of a better approach that some random academic group (or even Microsoft Research) publishes out of nowhere.
I'm reminded of the introduction of Morphological Antialiasing during the PS3/360 era. A new technique requiring the awesome power of the Cell processor comes along, things seems dire for other platforms as the method doesn't work well without SPUs, and the image quality gain is too large to ignore. I remember being blown away seeing God of War III for the first time. It was a bigger leap in graphical fidelity at the time than going from 1440p to 4K is today in my opinion. After a while, cheaper algorithms came along that maybe didn't look quite as good but they work everywhere and so become the default everywhere.

I've got some experience with neural networks and machine learning algorithms. Enough to implement a digit classifier with MNIST from scratch in C++ anyway. I'm also familiar with reconstruction algorithms for computer graphics as writing ray tracers has been a dorky hobby of mine for the last 20 years. The work NVIDIA's been doing on neural denoising is extremely impressive and could be a total game changer for stuff like the architectural visualisation field.

DLSS is also extremely impressive but I don't think it's self evident that they've hit the best spot on the quality/computation curve or that their algorithm universally applicable. Nobody outside of NVIDIA knows exactly what they're doing, but part of their algorithm we do know about is an auto-encoder. Autoencoders are, put very simply, networks that accepts a bunch on inputs, contains a "choke point" in the middle that forces the inputs to be mapped to a much smaller representation, and an expander that attempts to reconstruct the original inputs from the compressed middle layers. These are great at denoising, inpainting missing details, or if you've got a larger number out outputs than inputs, scaling. Running them is just a bunch of matrix multiplications, which is what the Tensor Cores are designed to do quickly. They usually produce crap results when the inputs are too far outside their training set though and you can't train on everything.

To engage in a bit so semi-informed baseless speculation, I think part of the reason DLSS still isn't widely available and has shown up in so few shipped titles is that it requires a degree of per-title or even per-scene tuning in it's current state to produce decent results. Not retraining the underlying network, but massaging and filtering it's inputs. It's in UE4 now, but NVIDIA does approval on a per-developer basis before they let you turn it on. Throwing the doors open and letting everybody play with it will mean that it's limits and weak points are quickly found. NVIDIA has determined, correctly in my opinion, that a few titles that implement DLSS really well combined with the promise of more to come is going to sell more cards than those same few games and a bunch of awful indy crap that uses DLSS badly.

I guess what I'm getting at is that DLSS doesn't seem, to me, to be a general solution to the problem of game upscaling and there are almost certainly cheaper, non neural network based algorithms that will do well enough. Now that they've opened that door, I think we're going to see a lot of rapid progress in the field, and I can't wait to see where things are in a couple years.

steckles
Jan 14, 2006

v1ld posted:

So having the extra data available from a good TAA implementation may not be sufficient to have a good DLSS implementation in the same game? Interesting if so and may explain the gap between promise and reality right now. Thanks for the post, good read along with the one you responded to.
I think it’s a matter of neural networks being really hard to reason about and there probably needs to be adjustments to avoid getting funky results. Stuff like the number of frames worth of data DLSS uses to do reconstruction is gonna be highly dependent on the specifics of a given game or scene. This could extend to altering the colour grading to avoid weird situations too far outside the training set.

Again, this is just speculation, but it’s telling that getting access to DLSS involves jumping through hoops and seems to be behind an NDA. There haven’t been, as far as I know, any third party really technical deep dives or blogs posts from gamedevs loving around with it. Why haven’t we seen somebody add it to a Doom source port yet? This is the kind of stuff that gamedevs and graphics programmers would love to see and would be super informative but it’s been crickets so far.

steckles
Jan 14, 2006

TAAU and FSR aren't orthogonal. You could use TAAU to get to the base resolution and then FSR up to a higher resolution theoretically. Depending on the overhead and your target frame rate and resolution, that could be a viable setup. The fact the FSR will work in purely forward renderers that don't necessarily have motion vectors can't be entirely ignored either.

I dunno, it seems good enough. FSR ultra quality for a ~30% frame rate boost could be a reasonable compromise, if you were struggling to hit 60hz/120hz/244hz at your desired resolution on your hardware. If AMD can convince/pay enough developers to support it, it'll have done it's job, which is to show up.

I've always wondered why pure frame interpolation didn't see more uptake. I remember reading a paper in like 2011 about it from the Force Unleashed developers. They basically just used motion vectors to re-project every second frame while keeping the base rate at 30hz. They sampled the controller every 16ms to ensure that camera motion vectors updated at 60hz which had the effect of reducing perceived latency as well. It looked pretty good at the time and was viable on an Xbox 360. Something like that combined with rendering a quarter or eighth resolution intermediate frame to fill in occlusion artefacts would probably look pretty good if it were constrained to high frame rates.

steckles fucked around with this message at 20:22 on Jun 22, 2021

steckles
Jan 14, 2006

repiv posted:

FSR does fit nicely there, but AFAIK there just aren't that many engines left that haven't already gone all-in on TAA

I can only think of a couple of strategy series (Civ and Anno), Forza (at least as of Horizon 4) and VR in general, but VR people demand so much clarity they tend to go the other way and bruteforce supersample instead
Well, motion vectors aren't a monolithic thing either and that's always going to be a problem for reconstruction techniques. Camera motion vectors are easy enough, but a developer might be using a particular GPU physics system or animation library that can't provide the sub-pixel accurate vectors needed to properly re-project samples.

Player relative motion aside, if you've got shadows in your game you'd need motion vectors relative to shadow casting lights and do reprojection of your shadow maps avoid ghosting. Ray traced reflections have their own problems. Computing the motion vectors in reflected geometry is non-trivial. As far as I know, games just ignore this and treat all motion as relative to the player. I'm not saying stuff like this is gonna be the death of TAAU and DLSS, but it is an area where a purely spatial upscaler could have perceptually better image quality at a given resolution depending on which artefacts you're sensitive to.

steckles
Jan 14, 2006

Dr. Video Games 0031 posted:

I think using DLSS Frame Generation will probably increase input lag by a lot.

I would think that it'd work by looking at frame N-1, N, and then trying to generate frame N+0.5 by extrapolating the detected motion vectors, rather than looking at frame N-1, N, and then trying to generate from N-0.5 based on the difference. Motion vectors should allow you to do this for opaque stuff already, I guess all the optical flow stuff is to they can project transparent stuff forward.

I'm kinda surprised it's take so long for somebody to come back to frame interpolation. The Force Unleashed devs showed an experimental 30 -> 60 FPS frame interpolation technique back in like 2011 that ran on a Xbox 360 and looked pretty good as I recall. They were projecting forward from the previous frame and polled controller input at 60hz rather than 30hz, which let them update camera movement motion vectors at 60hz. This apparently had the effect of reducing perceived latency. I'd be curious to see how something like that would work today, projecting the last frame forward and rendering like a 1/16th resolution frame to fill in disocclusions.

steckles
Jan 14, 2006

repiv posted:

it's been a staple of VR rendering for years, any time you miss vsync they extrapolate the last frame to the current frames head position to keep you from :barf:ing. AFAIK they even borrow the video encoders optical flow engine like nvidia is doing.

due to the artifacts it causes that's generally treated as a last resort though, not something you'd rely on all the time, which is probably why it didn't spread outside of VR until now with a (seemingly) better quality spin on the idea
Yeah, makes sense that it'd have found a home in VR where framerate > everything. It seemed so far along like ten years ago, so I guess I'd have expected more work on it before now.

steckles
Jan 14, 2006

repiv posted:

frozen 1 was the last disney film to use their old raster renderer and it looks conspicuously video-gamey compared to big hero 6 just a year later
Raytracing and Physically Based Rendering were definitely possible in the old REYES based RenderMan. I believe Cars was the first movie that Pixar made that relied heavily on raytracing effects. Mixing and matching the two different pipelines was a real pain in the rear end though. I know a bunch of people who work or have worked on technical teams at Disney and Pixar and various VFX houses and one thing that's often overlooked was the industry's switch to the ACES color space, which happened right around the same time all the big production houses started using fully path traced pipelines. I don't know much about zany color encoding nonsense, but the consensus amongst my friends is that ACES was the real game changer. It apparently enables much more sophisticated tone mapping and grading than could be easily done before. The Lego movie was the first to use it, and I recall thinking it was the first CG movie to actually look "real" in a way that they didn't before.

Apparently production scene sizes are getting so large that they won't actually fit into any amount of ram any more and they're becoming impractical to just move around on company networks. They're totally disk and network bandwidth limited now so a lot of current research is focusing on ray re-ordering and clever caching to increase locality of reference. I wonder if some of it will filter down to GPU sized problems where keeping local caches full is as important as ray/primitive intersection performance.

Incidentally, UE5's Nanite has some REYES-ish qualities to it. They managed to crack the automatic mesh decimation problem in a way that nobody could figure out with REYES, but I think there's a lot of similarity there. I heard from somebody that Weta was experimenting with something like Nanite for their production renderer. Basically, they were running geometry through a preprocessing step every frame to reduce the whole scene down to 1-2 triangles/texels per pixel for visible geometry and used some super simple heuristics for off-screen geometry. The resulting geometry is absolutely tiny compared to the input scene so they don't need to worry about disk or network latency, everything would fit memory. Not sure if it went anywhere, but that's supposedly what they were researching. Something like that might be possible on a GPU too, although I think the API would need to expose some way to update BVH coordinates rather than the current black-box that it is right now.

Exciting times.

steckles
Jan 14, 2006

repiv posted:

epic did a seriously impressive flex by loading up a scene from disney's moana and rendering it in realtime through nanite
I once tried to load that scene into one of my many toy ray tracers. It... did not go well.

v1ld posted:

There have to be really good reasons why this isn't done, it's pretty obvious, but would be good to know why a movie studio which has no realtime constraint on rendering a scene wouldn't pursue those kinds of approaches.
Attempts at ray re-ordering have been made since at least the mid 90s, it's just that Moore's Law usually made the problem go away if you were willing to wait a bit. Predictable brute force is usually preferable to a clever algorithm. We really do seem to be bumping up against some limits though, and just waiting another year for MOAR CORES and twice the memory for half the price isn't feasible, so there's renewed interest in these things.

On the GPU side, batching and re-ordering can be non-trivial parallelize in a way that makes GPUs happy, so perhaps people have just been avoiding it.

steckles fucked around with this message at 03:36 on Nov 6, 2022

steckles
Jan 14, 2006

v1ld posted:

E: Guess what I'm asking is if the fundamental bottleneck of pathtracing re: multi-machine parallelization is the full scene has to be on each machine? Rays can hit any part of the scene, so that would seem to be the case?
Yeah, you don't know what geometry a ray is gonna hit until you trace it, so you generally need the entire scene and textures resident in memory at all times for decent performance. This is more of a management problem than a parallelization problem though. My understanding of common production pipelines is that multiple machines never coordinate on the same frame, each machine will be dedicated to a single frame and the whole cluster will be working on a bunch of different frames at once. They'll all be referencing the same geometry files though so if that won't fit in memory of all of the machines, you need to start getting clever.

VVVVV: Typically paths are traced hundreds or thousands of times per-pixel and averaged in off-line rendering. Fundamentally, path tracing would be impossible without averaging. Tracing even double digit numbers of rays per pixel and getting decent results requires nutty space magic like ReSTIR and Deep learning driven de-noising filters.

steckles fucked around with this message at 03:53 on Nov 6, 2022

steckles
Jan 14, 2006

repiv posted:

speaking of nutty space magic, some of the nerds here may find this overview interesting, from a channel that popped up out of nowhere with improbably high production values

https://www.youtube.com/watch?v=gsZiJeaMO48

goes over the fundamentals, then RIS (the foundational trick Q2RTX used) and then ReSTIR (the new improved trick to be used in Portal RTX/Racer RTX/Cyberpunks RT update)
I wish that video existed when I was first working my way through the ReSTIR paper, would've saved me a few frustrating hours trying to put everything together.


Saukkis posted:

I was thinking about this and I guess it would be possible to divide the scene to separate boxes and put every box in their dedicate computer. When I ray travels between these boxes the computers would communicate the ray information between them.
This is exactly what Disney is working on right now with their renderer. Even with an insanely fast network, the overhead of doing it per-ray is much too high to be worthwhile. What they're doing instead is waiting until "enough" rays can be sent over the network in a batch that the overhead becomes reasonable. There still needs to be a lot of shared data, like some kind of high-level BHV so that each node knows where to send its ray queries, and the time-to-first image is much longer. With a fast computer, you can usually get a noisy but reasonable approximation of the image in a few seconds, even for huge scenes as long as they fit in memory. Once you start deferring all your rays and batching stuff up you might get the final image much faster, but you're waiting multiple minutes for the first, noisy images to hit the screen so the cost of iterating on the shot to dial it in is higher. The increase in artist time needs to be weighed off against the decrease in render time though, so strictly faster isn't always better in a production environment.

steckles
Jan 14, 2006

My friend picked up a 4090 and I tried out Portal and Witcher 3 at his place. DLSS3 is pretty cool and I didn't really notice any latency, but I must be particularly sensitive to the weird compression-y artefacts it introduces. I felt like I was watching a YouTube video on a TV with frame interpolation turn on. I'm sure it'll improve in time, but I'd personally leave it off right now. My friend didn't mind the look though. Seems to perform best when going from a high frame rate to an even higher frame rate so I'm not sure how well something like a 4060, where the base frame rate will be way lower, would fare.

I do think it's amusing that the people were always complaining about all the "hacks" that developers need to do to good looking lighting in video games before ray tracing was a thing. Now we've got ray tracing, but we're also burdened with an entirely new set of temporal hacks to get anything useable. Maybe one day return to the good old days, when a frame was a frame, not some monstrosity stitched together from the corpses of previous, dead frames.

steckles
Jan 14, 2006

mobby_6kl posted:

I dunno. I'd have to see it in action but being mad about "fake frames" while the raster pipeline is full of horrible hacks that vaguely approximate what things should look like seems pretty weird.
SSR, sure that's the very definition of a hack. This is totally splitting hairs but I wouldn't really call shadow maps a hack though. They're fundamentally solving the same visibility query as intersecting a ray with geometry does. You absolutely can do stuff like penumbra, translucency, and pixel perfect sharpness with shadow maps. There is decades worth of research from the film industry that hasn't really been tapped yet in games.

I'm kind of reminded of the mid-late 2000s when everyone started doing HDR, bokeh, and bloom in their games, but few did it well and most were a blurry brown mess. Now, we've got a few stand out raytracing titles but need to put up with a lot ghosting, interpolation, and blur. In the case of RTGI, I find the way lighting lags its cause to be very distracting. Shadows from fans and quickly moving object in Portal RTX have an unrealistic, odd look to them that comes from the temporal reconstruction. We put up with it now because it's new and shiny, but nobody wants that. Honestly, we're probably at the dawn of a glorious new age of hacks that developers will create to bring, I dunno, "immediacy" back to ray traced lighting.

steckles
Jan 14, 2006

JackBandit posted:

Deep learning is actually an insane inflection point on image and video modeling and temporal data, this would not be just a “technology progressing as expected” case. I don’t have much experience with rendering but there’s a reason why Neurips and the other ML conferences sell out registration in minutes and SIGGRAPH is 75% deep learning papers. It really is a new world.
I honestly don't think deep learning is going to find too much use in graphics in the near to medium term beyond running relatively small pre-trained networks, like DLSS does, or for denoising. All the image generation networks are amazing, but they're also inscrutable black boxes that would be tough to art direct. I'd love to see somebody make a game using NeRFs though. That would be awesome.

repiv posted:

yeah shadowmaps can model point lights pretty well, where they fall apart completely is with large or oddly shaped area lights.
You can definitely do some neat stochastic things with multiple maps per emitter to get arbitrarily close to ground truth, but yeah that's where they fall apart. I remember when the first raytraced ambient occlusion images started gaining attention in the 90s and the Renderman people coming up with some absolutely crazy ways to do it with shadow maps.

Truga posted:

so what you're saying is, the first person who builds path tracing assisted shadowmaps to file off the edge cases wins the RT race? :v:
Probably. We're realistically always gonna be at a deficit for ray tracing performance. Using the rays that can be traced in your frame budget to fill in the cracks of you more garden variety hacks is probably gonna make up a lot of research in the next few years.

steckles
Jan 14, 2006

New Zealand can eat me posted:

I'm not sure I agree, things like Restir DI (Reservoir Spatio-Temporal Importance Resampling) by itself seem to indicate the opposite. Any time you can throw monte carlo at a problem and see gains like this, a more efficient educated guess is right around the corner
ReStir isn’t a deep learning algorithm. It can be combined with DL based de-noising/image reconstruction and maybe neural path guiding, but it’s a “classic” light transport algorithm that plays nice with GPUs. It’s not even that impressive against PSSMLT or Path Guiding for complex scenes.

Edit: To clarify further: Stochastic algorithms, like ReSTIR or TAA, and Deep Learning/Whatever passes for AI these days have very little to do with one another. Stochastic techniques have a real part to play in network training, and the application of carefully formed noise is integral to content generation with trained networks, but just because an algorithm can adapt to it's input, like almost every light transport algorithm invented in the last 25 years does, doesn't make it AI or Deep Learning. I'm sure NVidia doesn't mind the public's confusion about where one ends and the other begins, but some DL algorithms are complementary to light transport, rather than a requirement for it.

Edit2: One really cool thing that DL can do for path tracing is to compress extremely complicate BSDFs into very small networks. Super impressive stuff, but the training happens off-line and the CPU/GPU only need to evaluate the final network.

steckles fucked around with this message at 03:29 on Dec 22, 2022

steckles
Jan 14, 2006

Paul MaudDib posted:

the ray volume is so low that pixels aren't being sampled enough. when you've got a raster with motion, and it's harmonic/periodic movement and you sample that at a fixed rate, you're getting an aliased sample - basically the beat frequencies of the motion and the sampling will cause aliasing actually in the subpixel samples themselves.
Hmmm, I don’t think that’s what is going on in this case. More likely, because shadows don’t have motion vectors, the temporal reconstruction algorithm can only use a simple differential brightness heuristic to determine which samples to include when reconstructing each pixel. The developers have obviously decided that smearing is preferable to noise, so they’ve likely got their non-motion vector based rejection criteria set to fairly inclusive levels. They could likely fix the issue, but not without introducing an unacceptable amount of noise into the image. A situationally dependent magic parameter with no physical basis that only lets you choose between two undesirable results, sounds kind of like a… hack to me.

I think this issue is indicative of a broader problem facing all temporal reconstruction algorithms: You need to track a potentially infinite number of motion vectors per pixel to track all of the dis/occlusions that could affect it during any frame. You’ve got a pixel receiving light in the presence of moving geometry? You need to store relative vectors for all the moving lights and geometry that could be affecting it. That pixel is also receiving some bounce light? Well now you also need to store the relative motion vectors for all the moving lights and geometry for all of the points that might contribute some illumination to that pixel. And so on. You basically need as many motion vector as there are path vertices if you want to accurately handle dis/occlusion without smearing. I’m not sure what a path tracer that could take that into account would look like. I doubt it would look anything like ReSTIR though.

steckles
Jan 14, 2006

repiv posted:

someone profiled portal on AMD and saw the issue was extremely high register pressure, which could be AMDs driver doing something stupid, or the way RTXDI is structured just being pathological for the way RDNA does raytracing half in software
https://twitter.com/JirayD/status/1601036292380250112
Yeah, utilization is appalling on RDNA. They hardly spend any time doing raytracing or anything else for that matter and hardware utilization is like 5% on 6900xt. I don't know that NVidia went too far out of their way to make AMD look bad intentionally but I doubt they're putting much effort into tuning RTXDI performance for anything but the 4000 series.

steckles fucked around with this message at 21:55 on Apr 11, 2023

steckles
Jan 14, 2006

Regarding path tracing, I don't know that we're going to see a true path tracing focused GPU until we start getting hundreds of megabytes of L2 and like a gigabyte of L3 cache standard. Even if you could make ray queries free, it wouldn't gain a gigantic amount of performance on current architectures because of memory thrashing.

Both NVidia and AMD have put a lot of effort into making the retrieval of random bytes from huge arrays as efficient as possible in the last couple generations. I think that's had as much effect on the gen-on-get RT improvements as increased ray/box and ray/triangle intersection performance has, but it's gonna need to be turbo charged if we want to start shooting a practical number of rays for next level path tracing. Stuff like SER can definitely help in the narrow sense but we'll need new engines built around keeping millions or billions of rays in flight to really make progress. Also, being able to prebuild BVHs and stream them as needed would be a benefit for all architectures even now, the fact that it's treated as a black box was a misstep in the API design.

Anyway, a true path tracing first engine would need to be something like this: Rasterize a depth/normal buffer. Using that, spawn millions or billions of ray queries. Batch those rays based on their origin and direction. Once your batches are large enough, clip them against the acceleration structure. Where a ray enters a leaf node, add it to another batch that's queued for geometry intersection. Once a batch of leaf node ray queries gets large enough, load the actual triangles associated with the BLAS node, clip the rays against them, and batch the hit position and surface normals. Once the batches of intersection data get large enough, load the relevant textures and run your surface shaders. Spawn more ray queries as needed and put them into new TLAS/BLAS batches. After running the shaders, add the computed colour attenuation to a list that's kept per-pixel and then every frame, collapse the list to generate a final color.

Basically to do one thing at a time, keep as much in L2/L3 as possible, and make sure that every request to glacially slow video memory is as contiguous as possible and can serve as many operations as possible. This is already best practice for current rasterization workloads, it's just being taken to a ridiculous extreme. It's not the kind of thing you could easily bolt onto an existing engine, nor is it the kind of thing hardware and driver level shenanigans are going to do for you. Some developer will need to be brave enough write it from scratch. Hopefully the APIs and architectures will evolve to support such a thing, because RT as it currently exists is gonna be hard to scale otherwise.

steckles
Jan 14, 2006

repiv posted:

unless we move towards "frameless" architectures where the GPU can churn on batches out of phase with the framerate, which sounds like a nightmare
With stuff like DLSS3 and temporal accumulation in general, we're already "frameless" in a limited sense. No reason current techniques wouldn't still apply. Indeed, with more data in flight, you'd have opportunities for accumulating in more intelligent places rather than adding up all the frames and hoping.

Nightmare is definitely the right word. True path tracing is gonna need to throw out a lot of what we're used to, but NVidia and AMD could give us hardware to support that. It's just not gonna look like what we have now.

steckles
Jan 14, 2006

Paul MaudDib posted:

Obviously this is tremendously expensive in transistor terms: it's not shocking at all that Intel is using the transistors of a 3070 to compete with a 3050 or whatever it is! It's an absolute ton of additional SM complexity/scheduler overhead/etc. But cache isn't going to shrink much between 6nm and 3nm, but logic is. And so all that SM complexity is going to get a lot cheaper, while cache isn't going to get much better in the meantime. That's my pet theory at least, Arc looks like a design aimed a couple nodes ahead of where it is. And it wouldn't be surprising if NVIDIA recognized the problem too - 4090 benefits from much higher clocks but 3090 had fairly poor SM scaling, and 2080 Ti wasn't really that great either. AMD had the same problem with wave64 with GCN, Fiji and Vega just didn't scale to 64CU very well, and they switched to wave32. I think at some point everyone is going to have to shrink wave size again.

To some extent the things you're talking about are orthogonal: even if you had that, it would be desirable to increase the percentage of lanes doing effective work, and increase the number of lanes doing aligned access. And an async facility is really kind of what you need for that "massively increase the number of requests in-flight" too. If you are just tossing tasks into a queue (and especially if you don't ever wait on the result!) then the scheduler can just batch up a bunch of them, slice them in some efficient ways, and run them. Iterating over the raster image and tossing off async tasks is exactly what you wanted, right? It just doesn't work without some kind of realignment facility.
I think Intel's solution is a very logical way to handle the current way of doing raytracing, as an add-on to a traditional workload. Lots of tools to help ensure you've got something to work on while waiting for your ray queries to return and making sure that a stalled instruction doesn't lock up too much of the GPU. That's a smart way to do things as you'll typically have non-RT shaders to run, post processing, rasterizing new shadow maps, sampling textures, rasterizing new micropolys for Nanite, denoising, and all kinds of other things in a typical engine.

When the memory required by the raytracing step gets too large though, as it would with true path tracing, all the other work you could be scheduling gets stalled because what it needs isn't in cache any more. The short term solution to that is obviously more cache, but I doubt you could ever add enough that any sort of hardware level scheduling could hope to remain tenable. Ultimately, ray tracing as it exists now on GPUs is basically rays starting in random places going in random directions needing random geometry and random textures from random places in video memory. That's just an insanely hard thing to schedule for and we're gonna get more out of reducing the randomness to start with.

We really haven't even scratched the surface though and there are so many different ways we could be using visibility queries that haven't been experimented with yet. Nobody has really come up with a good LOD method for raytracing either. Intel showed a method of stochastically picking a LOD on a per-pixel basis which looked great, but would be hard or impossible to implement with the current APIs. While I'm at it, one dumb idea I had was to treat a nanite mesh as the input for an old skool finite element radiosity solver. Basically have all the patches iteratively swap radiance with each other and store it in like some Gaussian Mixture Model or Spherical Harmonic nonsense as part of the surface properties. I'm sure there would be a billion corner cases, but it'd be an interesting experiment.

steckles fucked around with this message at 02:16 on Apr 15, 2023

steckles
Jan 14, 2006

Paul MaudDib posted:

is there a technical reason the async scheduling and reordering needs to happen on the GPU at all? If the state-per-ray is relatively minimal (position, vector, luminosity/whatever) then can't you just immediately spill to CPU, schedule/reorder, and dump over batches? Or spill certain parts of the rayspace, like spill unless it's some area or some vector-direction that is being modeled right then? I realize of course that may not be much of a useful coherence/alignment in itself since ray behavior is so random but the point is just to define some region that you're currently working on and to spill everything else until later.
I supposed, at a high level, portions of the scheduling could happen on the CPU. Bundling rays based on origin/direction and clipping them into the TLAS on the CPU and then queuing BLAS/intersection work on the GPU might work. There's been a lot of work on "out-of-core" raytracing already, so I'm sure you could jimmy something together. As you say though, you'd probably run out of PCIE bandwidth pretty quick. It's an idea that might have more legs in a system with unified memory like the consoles or some kind of hypothetical raytracing-having Apple M3 chip.

Paul MaudDib posted:

On the other hand for a purely path-traced game this means that you can just partition your scene across multiple GPUs and get good scaling. And if the collective amount of VRAM is sufficient, that's fine, you don't need every GPU to hold the full queue (holding the full BVH tree would be really nice of course). Like if total VRAM can hold it all, you can totally do a multi-GCD/MCM design with path-tracing. (I think that's been remarked before, it's the reason why it works for big pixar movies and stuff too, right?)
If you were doing explicit multi-GPU and all the geometry and BVHs weren't resident in all the GPU's memory, then the need for batching and extreme async would probably be even greater. Needing to go to another GPU, or worse system memory would be so expensive that you'd want to make sure you were servicing huge numbers of rays to make it worthwhile. The alternative, send rays to the GPU where the data is has the same issue. Each sent ray would have huge overhead so you'd want to do as many rays at once as you could.

Paul MaudDib posted:

I wonder if spilling favors L1/L2 or L3 - NVIDIA chose the former and AMD chose the latter. I’m guessing nvidia probably needed some extra for Shader Execution Reordering to work, perhaps a reason they went with that over an L3.
SER or any sort of OOO needs as much as data possible in nearby caches to work effectively, so NVidia upping the L1 and L2 to support it in the 4000 series makes sense. I wonder when we're gonna start seeing huge re-order buffers and big instruction caches in SMs like on CPUs.

Paul MaudDib posted:

I wonder if AMD's thing about memory was because they modeled path-tracing (or even it's just self-evident that there's too little VRAM headroom to spill much) and they think path-tracing is going to do ironically poorly on NVIDIA's GPUs due to VRAM limitations.
I know the meme is that AMD hired somebody's incompetent nephew to handle raytracing in RDNA2/3, but some of the instructions in the ISA for streaming BVH data, the either texture sampling or ray intersection design of the hardware, and the huge L3 points to a bet that massively async raytracing would be standard in the future. Who knows, maybe they're right there will be more Fine Wine™ and a 7900xtx will run 2030 games at 6fps and the 4090 will only run them at 4fps.

Paul MaudDib posted:

That's really the money question around your objections in a practical sense I think - are we talking gigs of ray-state, tens of gigs of ray-state, or hundreds of gigs of ray-state for useful path-tracing? Obviously not too helpful if path-tracing requires people to have 128GB or 256GB of system memory even if it's feasible at a GPU level.
I wouldn't call them objections so much as observations. I've written at least one ray tracer per year since 2005 and have been following the field for longer than that and the eternal struggle has always been about keeping stuff in cache for as long as possible and maximizing memory access coherence. How much state do we need? I dunno, it's like GPU cores or memory, devs will find a way to use all of it no matter how much is there.

Paul MaudDib posted:

:yea: Obviously a lot better to avoid randomness than to handle it better. But I mean, how is that really possible with raytracing given the positions and angles are essentially random?
Well, that's where batching comes in. You don't start tracing until you've got enough rays to make it worthwhile. Sort rays into a 5D structure and don't trace until there are enough rays to justify it. This works really well off-line, but obviously there are constraints in the real time world that would make if difficult to implement in its purest form.

Paul MaudDib posted:

What would be the point of having patches swap luminosity with each other, I guess you're fuzzing... something? And that produces a statistically useful property in the rays/bounces somehow?
Radiosity is a way of encoding scene radiance on the surface geometry, and outside of the visibility term, you can calculate how much light a pair of triangle send to each other with quite simple equation. After calculation, you'd just query any given micro triangle for the incoming luminance and you never need to actually trace a ray from the frame buffer. Like I said, probably a million corner cases that would make it a pain, but it'd be a fun experiment.

steckles
Jan 14, 2006

Paul MaudDib posted:

it'd be interesting to see whether OFA could improve this at all. Being able to have some raw "draw poo poo here" metrics would be great. Although I guess at a certain level of detail the heatmaps don't have to be GPU accelerated at all.
Eh, there have been a billion attempts at using feedback to drive path samples, but they're usually more complicated and don't perform as well as path-space methods that just shoot rays where the signal is bright. The original Primary Sample Space Metropolis Light Transport paper was released 21 years ago and is still one of the simplest and most effective path guiding methods for most scenes. I believe there's an extension that applies Population Monte Carlo techniques I don't really understand, but it makes it play nice with GPUs.

There's also been a ton of work on path guiding by basically storing an octree of directions where the signal is bright. It's super simple and probably wouldn't be hard to extend to big fat ray bundles, but it's also one of those highly async methods that would need a rethink of how presented frames and light transport work together. Realistically though, an algorithm that focuses on shooting as many dumb rays as possible rather than trying to be clever is probably gonna work better for games in the long run.

Paul MaudDib posted:

it seems like there's some fairly enlightening questions like "what is the number of BVH regions behind this raster region/space voxel/OFA motion estimation group" or "what is the average depth of rays shot in this region" that could guide LOD tuning too. Yes you can compute it but being like "yo there's a chunky thing at X but there's nothing ahead/behind" or even just "the time spent in this region is out of control" are relevant info for building an optimal tree.
Optimal BVH construction is an open problem and the difference between an okay tree for a scene vs a good one can be huge. Clever heuristics can help, but they introduce corner cases and pathological behaviour that can just as easily make things worse. The linear bounding volume hierarchy algorithms used by GPUs are focused on multithreaded build speed rather than quality and it's directly affecting RT performance on all architectures. Characters and moving stuff obviously still needs to be dynamically computed, but there's no reason why a developer shouldn't be able to spend 48 hours generating the optimal tree for static level geometry and ship that. Like I said earlier, the fact that that's missing from the current APIs is a huge misstep.

Paul MaudDib posted:

The nanite guy has got the right idea though this isn't a "rebuild every frame" or even every N frames, you should rebalance it based on where poo poo's happening and where rays need a better performance level of both detail and sampling.
Yeah, in the same way that Nanite was a fundamental rethink of how geometry should work, the next generation of ray tracing is going to have to follow a similar frameless approach. That's why I think encoding RT results on micropolygons and shooting rays completely separately from scene presentation would be an interesting thing to try.

steckles
Jan 14, 2006

repiv posted:

the console APIs do allow developers to cook BVHes ahead of time, i wonder if they're already taking advantage of the fact that they can spend much more time on BVH refinement than is practical on PC

I recall reading somewhere that Epic did that for their Matrix thingy on the consoles.

steckles
Jan 14, 2006

repiv posted:

yeah that's about the extent of what's publicly known about the console RT APIs, but in principle they should be able to expose a lot more of the nitty gritty details since they have a fixed hardware target. directly exposing the raw BVH representation ought to be on the table, which could enable console-exclusive tricks like streaming in sub-trees at arbitrary depth on the fly. maybe they could do nanite-style fine-grained cluster streaming while PC is stuck flipping between discrete LODs.
We know from the published RDNA ISA docs that the ray/box intersection instructions just take a bunch of texture coordinates. Combined with PRT, you should be able to do some pretty zany stuff for streaming huge acceleration structures on console. Of course, that'd probably just make the current wave of appalling PC console ports even worse.

NVidia is a bit more cagey about their low-level architecture, but I'd be shocked if they didn't have similar functionality for dynamically loading BVH data. Maybe what's holding it back on PC is the lack of an agreed upon binary BVH format.

steckles
Jan 14, 2006

Subjunctive posted:

I did always wonder why ray tracing used random rays rather than just iterating the light sources, generating sorted sample sets, and walking them. Anything that doesn’t get hit is in shadow, badabing.
That is exactly how zero bounce direct lighting works. The randomness comes in either when you’ve got large lights which can be partially occluded, or when you want to know what light is being reflected off non-light surfaces and contributing to the illumination at a particular spot. For that, you can only really do random sampling. There are lots of clever ways to increase the likelihood you’re shooting rays in “good” directions, but randomness is fundamentally baked into the whole concept.

steckles
Jan 14, 2006

Subjunctive posted:

You want a random sample but you don’t have to walk them in random order, right? I don’t get why you can’t do bounced light the same way, hmm.
There are two problems that mean randomness will always be part of ray tracing.

The first is that to compute the incoming radiance at any given point, you need information about the entire scene. Each pixel needs to estimate the average colour within some cone of directions, depending on the BSDF of the material and the orientation of the geometry. To calculate that, you need to do the same for every single surface within that cone and to calculate those, you need all the surfaces in those cones and so on forever. That's obviously not a tractable problem, so sampling a random assortment of directions to some random depth of recursion is the best that we can do. Eliminate the randomness between pixels, and you end up with correlation artefacts which cause ugly banding when still and nasty flickering in motion.

The more fundamental problem is that to answer the question "can these two points in space see each other?", you need to load an unknowable-in-advance amount of geometry to make that determination. You can have two rays that start and end 1mm apart and one will need to check 1kb of geometry and the other will need to check 50mb worth because it took a different path through the BVH. No matter how coherently you're processing samples, you're still gonna end up being hit by bad memory access patterns when determining visibility.

There are a few ways to address these. For the visibility problem, going super async and batching a huge numbers of rays together or making sure your whole scene will fit in L2 cache are basically the only strategies that work. If you didn't mind false occlusions and light leaking, you could trace against low-resolution proxy geometry.

For sampling the path integrals, you can pick your random numbers in a way that maximizes the distance between points on the hypercube (Quasi-Monte Carlo), you can sample clusters of rays when you find a path with a high contribution (MLT, MEMLT, Path Guiding, Too many others to list), you can sample clusters of random numbers when you find a good point on the hypercube (PSSMLT), you can try to share rays between pixels when you find a path with a high contribution (ERPT, ReSTIR), you can use some proxy representation of scene radiance (Voxel Cone Tracing, Light Probes, VPLs), or you can use a surface based approach where radiance is computed at fixed locations and each pixel is interpolated from it's nearest neighbors (Radiosity, Surfels, Photon Mapping, plain old Light Maps). All of these serve to minimize randomness, but come with various tradeoffs in maximum quality, time to image, or memory usage and you can always come up with some pathological scene geometry that will make any algorithm perform badly

steckles fucked around with this message at 07:45 on Apr 28, 2023

steckles
Jan 14, 2006

repiv posted:

the main thing that differentiates "proper" supersampling from just rendering at a higher resolution then scaling down is the sampling pattern, the points sampled within each pixel are rotated about 45 degrees from the pixel grid which has nicer visual properties
Beyond just sampling at a higher rate than your output pixel grid, there’s no strict definition of super sampling but yeah, any decent implementation is gonna move the sub samples. Even rotated grid isn’t great and the patterns used by MSAA can be quite complicated. Ideally you want a different set of stratified samples per pixel and then reconstruct them with a really huge filter.

Offline renderers use filters that touch 16, 36, or even 64 pixels per sample. 1000 samples with a box filter will usually look worse than 16 samples with a good filter that approximates sinc. It’d be interesting to see where the inflection point was on the GPU, where you’re better off throwing resources into filtering versus rendering. Of course, I guess it’d be pointless given that good sampling isn’t really compatible with all the temporal reconstruction and denoising shenanigans that are needed today.

steckles
Jan 14, 2006

Yeah, morphological AA is good at smoothing lines, but pretty bad at resolving Moiré patterns, which is what more-triangles-than-pixels will usually give you. The whole concept was pretty neat when it landed on the PS3 though. I recall being blown away by how smooth God of War 3 looked when it came out. I suppose it might still have use in a pure SSAA setting to remove some high frequencies before downsampling and get you a slightly smoother image for the same sample count.

Actually, that reminds me: I would sometimes combat Moiré when resizing photos by adding a light noise layer over the affected areas in Photoshop before downsampling and it'd often work really well to smooth stuff out while still looking natural. I wonder if you could apply something stupid like that, driven by some metric of like the surface area of a nanite mesh to it's projected size to break things up and trade Moiré for "filmic" noise. I'd guess it'd look terrible and require a ton of scene specific tuning, but it might be a fun thing to code up.

steckles
Jan 14, 2006

repiv posted:

maybe once they're all-in on software rasterization they could randomize the sample positions per-pixel to trade aliasing for noise? unreal isn't at the point where they can get that experimental though, they still use hardware rasterization for parts of the pipeline so the software rasterizer needs to match the regular sampling grid of the HW rasterizer for them to fit together seamlessly
I wonder how much performance they'd be losing by just rasterizing everything in the shader. Although if I recall the nanite paper correctly, they're using a DDA type algorithm for that that might be tough to adapt to per-pixel sample points, so maybe adding that would be impractical right now. Hardware support for per-pixel sample locations would be cool though.

That no-TAA unreal video looked shimmery as hell, but I did really like the complete lack of ghosting. Maybe I've spent too long playing with graphics algorithms and the parts of my brain that notice small rendering errors are over-developed, but I've always been super sensitive to temporal artefacts.

steckles
Jan 14, 2006

repiv posted:

i wonder which way the pendulum will swing from here with hardware raster, do the IHVs add quadless rasterization modes to make dense geometry more efficient without having to resort to software raster ala nanite?

or do they embrace compute eating everything, and add facilities to make software rasterization more flexible (e.g. alleviate the current limitations imposed by having to use atomics for depth sorting, which practically forces using a vbuffer)
Hardware support for efficient sorting would be pretty sweet, although I don't know if an algorithm exists which wouldn't need at least some atomic operations.

Nanite proved you can ignore the hardware rasterizer and still get high performance. Maybe it's time to look at some more advanced algorithms. Computing a list of triangles which overlap each pixel and calculating coverage analytically might be viable for rendering billions of little triangles with high quality. That'd probably break a whole pile of other stuff as there'd no longer be a 1-to-1 correspondence between the colour and depth buffers, but it'd eliminate Moiré. Probably not practical on modern GPUs though, given the immense amount of ALU and memory you'd need to allocate to every pixel. Perhaps some of the "deep" buffer algorithms developed for order-independent transparency could be adapted to do AA on micro-triangles instead.

repiv posted:

maybe one day someone will come up with a practical way to do texture-space shading that doesn't cause more problems than it solves, then we could maybe move beyond doing temporal filtering in screen space
Yeah, texture/surface space shading seems like it's perpetually one-last-problem-to-solve away from being truly viable in a real-time setting. A shame, as it was used with such success in Reyes for so long. I guess in a movie you can tune all the geometry to avoid aliasing on a per-shot basis, so they rarely ran into these issues. There was also huge amount of work put into writing surface and displacement shaders that avoided aliasing in the first place, but that kinda went away when everyone moved to path tracing and you were gonna be taking 1000 samples per pixel anyway.

steckles
Jan 14, 2006

repiv posted:

i was thinking along the lines of allowing you to export to the ROPs from compute, so a software rasterizer can do blending and output more than 64 bits at once
I suppose some kind of shared "blend unit" or something could be added somewhere in the mix to handle such things without completely changing the render pipeline. Probably add a bunch of new opportunities for weird stalls, but maybe that'd be fine. Perhaps what we need is some dedicated memory right on the die that knows how to do blending and z-buffer operations. Some kind of "embedded" or eDRAM. Surely that's never been tried before.

steckles
Jan 14, 2006

Falcorum posted:

As a dev, this "porting toolkit" is utterly bizarre and I can't see who it's for at all. It's not actually for porting so companies that aren't porting their games to Macs already won't care, and companies that are will have better tools anyway.
I figured it’d work as a tool to show developers that there might be enough Mac users to justify a real port. “Look at these people jumping through hoops to play your game. If these people exist, maybe the market for Mac gamers is big enough to justify a real port.” It’s not like it’s a small installed base and the GPU in the M1/M2 is good enough that it wouldn’t be a total embarrassment on the performance front.

On the flip side, perhaps leadership at Apple is nervous about games and this is a weird baby step dreamed up to gauge interest while not offending the ardent anti-game contingent too much. “Look at these people jumping through hoops to play games on our computers. Maybe there are enough dollars there to justify more support from us.”

steckles
Jan 14, 2006

Indiana_Krom posted:

Also the thing about path tracing is that it doesn't actually get significantly harder to do with more complex games.
The best you're going to achieve when tracing rays is, in aggregate, log2  N where N is the number of triangles. Put more simply, for every doubling of the number of triangles, you'd expect to do a single additional ray/triangle intersection computation per visibility query. This is generally not possible as the optimal BVH is very hard to compute and the quality of the trees generated by drivers today are pretty bad. They're focused on quick build times at the cost of quality and there's no facility to load precomputed BVHs in any APIs yet. The other issue is that once your geometry is large enough not to fit in L2, you tend to spend more time waiting for triangles to be loaded from memory than actually computing the intersection.

The computational complexity of rasterization is, in aggregate, N in the absence of any occlusion culling. Doubling the number of triangles will double the computation needed. The difference is that rasterization is extremely memory friendly. Basically every part of the rasterization process can be devolved into huge contiguous reads and writes which play very nicely with memory and the increase in overall compute utilization easily makes up for the worse theoretical complexity.

steckles fucked around with this message at 18:26 on Jul 4, 2023

steckles
Jan 14, 2006

Mark my words, in 15-20 years, on the off chance society somehow hasn't collapsed by then, we'll be looking back at the smeared temporally accumulated and interpolated mess of today's games with the same disdain we have for the blurry brownness that was that first crop of games with "HDR".

steckles
Jan 14, 2006

K8.0 posted:

Nah. Games now are doing a markedly better job of showing detail than they ever did before. We're years into the era of details being smaller than one pixel. Throw that stuff in a conventional renderer with no supersampling and it would look horrid.

The only games that will actually look bad when you go back to run them will be the ones capped to low frame rates. Even quite bad spatiotemporal aliasing is barely noticeable if you are running 500 FPS.
I am only being a little hyperbolic. Modern TAA is of course effective at combatting moiré, but the root cause of that is the unsophisticated handling of geometry that exists in today's games. Like what nanite has done for mesh decimation and getting closer to the ideal of two triangles per pixel, I think we've only just scratched the surface of what's possible with new algorithms and engine design. Of course stuff like temporally accumulated RT and GI, where the lighting follows it's own idea of time completely, is something we'll never fix just by adding up more frames.

Plus I think people will just naturally become less tolerant of ghosting and weird artifacts once the shininess of real time path tracing and more-triangles-than-pixels wears off. Like, nobody actually wants those things, right?

steckles
Jan 14, 2006

Truga posted:

but in places where doubling fps would actually help a ton (shitboxes pushing 40fps on a good day), the additional latency can be noticeable, and the artifacts very visible lol
I'm sure future iterations of DLSS will allow for more interpolated frames per "real" one. Who knows, maybe they'll up the interpolation to 20 fakes frames per real one and nVidia will advertise that the 5060 8GB or 6060 7GB can do 300fps at 6k in Cyberpunk with Path Tracing. The super advanced predictive Reflex AI needed to keep latency under control will be so good that simply moving the mouse will cause games to play themselves.

steckles
Jan 14, 2006

shrike82 posted:

yeah going to disagree with the walls of texts - LLMs have entrenched the dominance of Nvidia hardware if anything. there are a lot of toy repos out there that cut down LLMs to fit and be compatible with CPUs including Apple Silicon but it's not really useful beyond students and execs wanting something to run locally on their laptops. a) it's lossy and b) why try to penny pinch on on-prem or private cloud hardware - businesses are happy to throw money at racks of nvidia cards to demonstrate having an AI strategy

Eh, I don’t know. We’re still relatively early in the development of machine learning and it’s hard to say where things are going. Nvidia has the best support and most developed software ecosystem for sure, but ultimately most DL algorithms just need to do as many matrix multiplications as possible. A simpler architecture without all the GPU baggage designed solely to feed billions of MADDs could end up being the most cost effective approach as models continue to grow. Plenty of companies are experimenting with such designs.

I wouldn’t be surprised if we see a bunch more competing products, as Alphabet, Amazon, Meta, Microsoft, and others develop in house, single purpose hardware that is cheaper to rent if you’re already in their cloud ecosystem.

steckles
Jan 14, 2006

I can't remember the last time some GPU thing has caused this much hand wringing. We may be in our 30s/40s now, but there's an indignant 13 year old console warrior in all of us.

You're all lucky I don't have a billion dollar company, I'd be paying companies to keep all TAA out of games altogether.

Adbot
ADBOT LOVES YOU

steckles
Jan 14, 2006

Blorange posted:

This is great, it's just fast and intense enough to trigger the uncanny valley. I suddenly care that it's not switching instantly.
The ironic thing about temporal reconstruction algorithms is that the faster they converge, the more noticeable they tend to be.

Anyway, Ray Reconstruction looks pretty cool. I'm sure Nvidia will remain cagey about what they're actually doing, but I'll take a guess: If you take ray direction and the BSDF/PDF of the surface into account, you can use smaller blur kernels for the same noise level by lowering the weight of intra-frame samples you'd expect to have high variance. That would need a bit of extra data supplied by the engine though. Such information would also let them lower the weight of inter-frame samples whose motion vectors would otherwise imply they should be included in the kernel.

I think they might be shifting the colour of the accumulated frame based on the next intra-frame colour as well. "Reconstructing" it, if you will. Where the changes in hue come from a surface texture rather a change in geometry, as is the case with the Cyberpunk footage shown, you could look at the current frame and say "these pixels used to be yellow, but now they're purple, let's just pretend they were always purple and accumulate them instead of discard them". I'm sure the heuristics for that would get pretty groady, but that seems like a great use case for a small neural network.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply