Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

Remind me what the Quadro and Fire are better at doing than the consumer cards again?

Higher quality bins of chips, much better handling of geometry-heavy scenes (not just lots of triangles, but lots of small triangles), and the driver and technical support commensurate with a workstation-level GPU (not just perf, but some weird/edge-case features that don't matter much for consumers but might for a professional system).

Adbot
ADBOT LOVES YOU

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Stragen posted:

Getting away from DirectX/OpenGL for a moment, does anyone have experience with Nvidia's Optix realtime raycasting?

I have used it to create simple tests and it seemed to perform OK at low resolution, eg 1024x768 on some average hardware (GTX 560 Ti & Intel i7 Bloomfield 2.9 Ghz).

But I wonder how it performs in real world applications? Youtube has dozens of tech demo's but almost nothing in the way of real applications using the library.

It would be nice to know I am not wasting my time before I invest heaps of it coding the CUDA programs and a scene graph and managing GPU paging and all the other related parts. All that is necessary as Optix is just a raycasting library only, it is not a complete scene management solution like Nvidia's SceniX which I do not wish to use for various reasons.

Any experiences people can share about Optix would be helpful.

It depends a lot on your ultimate use-case. How dynamic is your scene? What's your scene complexity? What light transport/shading technique are you using -- Primary + Shadow, or full path tracing? For a 560 Ti, I wouldn't expect to be ray-tracing Crysis or anything. Since performance depends a lot on acceleration structures (which have to be updated when the scene changes) which structures you use for which objects in the hierarchy and how often they need to change will end up impacting your performance a lot.

There's more information here, although it's obviously based on somewhat older hardware: http://graphics.cs.williams.edu/papers/OptiXSIGGRAPH10/Parker10OptiX.pdf

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Illusive gently caress Man posted:

I need to implement shadow mapping for the last part of a class project, and I totally understand it except for one thing. How do you create the frustum / projection+view matrices for the light source so that it covers everything in the camera's frustum? Everything I read seems to just skip over this.

As a dead dane once said, "There's the rub."

In the simplest case, you don't -- just create the light/shadow frustum so that it covers the entire scene (or at least a reasonable area around the player) and use that. This guarantees you're covering all the occluders your viewer might see, but has the substantial downside of not allocating your shadow map texels very efficiently, leading to bad aliasing.

A large portion of shadow mapping research has been dedicated to dealing with this problem. Most games use an adaptation of this called "Cascaded Shadow Maps" [1] where you render the shadow map multiple times from the same viewport with different sized frustra. This gives you multiple levels of detail which you can pick from based on how far the scene location is from the viewpoint -- distant objects use a larger frustrum (lower texel density) while nearer objects use a shadow map with the same resolution but smaller frustrum (higher texel density). This works pretty well, and can be improved upon by a variety of techniques that let you adjust the near/far bounds of each cascade based on the content, and fit it more perfectly to the actual camera frustrum.

There are other perspective-warping techniques that use a single shadow map [2]. These work really well in some circumstances, but have points where they fail badly in the general case. They can be a good option if you have a lot of control over the scene.

Finally, there's some real "blue-sky" research into just eliminating the shadow frustum altogether and sampling only points which correspond to actual screen samples. Irregular Shadow Maps [3] is a neat idea, but the cases where it really provides the best benefit can usually be solved almost as well with much less expensive methods.

[1] http://msdn.microsoft.com/en-us/library/windows/desktop/ee416307(v=vs.85).aspx
[2] http://http.developer.nvidia.com/GPUGems/gpugems_ch14.html
[3] http://visual-computing.intel-research.net/publications/papers/2009/izb/soft_shadows_larrabee.pdf

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

There are only two real solutions for point lights:
- Cast separate shadowmaps for objects (or clusters of them) being hit by the light.
- Use a cubemap.

The first usually produces higher-quality results, but can't deal with lights inside shadow targets and also causes a ton of state thrash.

Secret third option: Don't cast shadows from point lights, which is much more common and viable than you'd initially think. Most artificial light sources are embedded in something, which makes them inherently directional.

Dual paraboloid shadow maps are worth looking at too

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Shameproof posted:

Nobody has any idea? I just need a power of ten. I've heard that if something changes less than every ten frames you should go static.

If you have any question as to how often it will be updated AND the size of the resource is not "large" relative to PCIE bandwidth (i.e. most geometry buffers, some small-medium textures) just make it Dynamic.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Van Kraken posted:

Is there a way to have a GLSL geometry shader operate on two types of primitive at once? I'm writing one that duplicates both points and triangles and I'd rather not have two almost-identical shaders.

You could use a pre-processor macro

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

NVIDIA is only interested in talking about techniques that require (and in turn, sell) their latest generation of hardware, so they're never going to offer insight on implementing techniques on older hardware. The main thing they do in this case is stuff instance data into a constant buffer, which is probably only viable on DX10 because constant manipulation is much slower on DX9, but you could always do this (and they acknowledge as such) by stuffing the data into vertex streams instead.

The point of that article was to give an example of how to use instancing (a DX10 feature) to render a huge amount of skinned geometry effectively. You can issue one Draw call, with one model's worth of vertices and a texture containing all of the skinning data, rather than render rendering 10,000 models one at a time or any other much more cumbersome approaches. The core idea (or the one you're referring to at least) should be portable to DX9, you're right -- and in fact is an approach that's been used in a few DX9 engines.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

Instancing is a DX9 feature. The paper specifically acknowledges that this was possible before using vertex streams to store instance data, the point of it (i.e. what it was purporting to offer over known techniques) was that you could get more performance by using a constant buffer instead.

Yeah, instancing "existed" in DX9, but it was really cumbersome and sometimes problematic. Robust first-class instancing support was one of the "features" of DX10.

Anyways, this is all beside the point. You can in fact do something close to this in DX9.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...
Really what you want to be doing is supersampling -- either by rendering a higher resolution framebuffer and resolving it down, or by replaying the pixel shader multiple times from different offsets within the shader and combining the results. However, this is expectedly pretty expensive, and wasteful in areas where you don't have high-frequency effects that you need to clean up. You could do something like having the shader determine if it's in an area with high-frequency effects (using the same logic as your edge outlining) and only super-sampling those areas.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

movax posted:

I don't know if we have a dedicated GPGPU/CUDA thread, are there folks in this thread that have played with CUDA? Looking to figure out the best way to tune threads/blocks/grids (or write a simple scheduler to parcel up the workload based on input size).

Have you looked at the occupancy calculator that comes with the SDK? It's probably the best place to start.

As for variable workloads, the best practice is to put all of your work "packets" into an device memory array, with a size and a "next" counter. Then dispatch a kernel as a "maximal launch" (i.e. a number of blocks guaranteed to fill all the processors in the GPU). In each kernel, run a loop that does a global atomic fetch-and-increment on the "next" counter. If the value is greater than or equal to your workload size, have the kernel exit; otherwise, fetch the workload description for the index you receive and process it in that iteration. The kernel will end once all the blocks have finished their work and there are no more entries in the queue.

This works great for variable sized workloads of discrete independent tasks which can each still take advantage of block-level parallelism and where you know the workload beforehand. There are also ways to leave a workload processor "spinning" and feed work batches into it as a real-time producer/consumer, but they get a bit trickier.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...
Branching in a shader like that is completely harmless, since it should be uniform across all pixels generated by the same triangle. In other words, the triangle will have a single material id, so the branch will essentially be constant. Branching only has a real cost (beyond the relatively cheap ops needed to compute the branch condition) if the branch is divergent -- i.e. if two pixels that are part of the same SIMD workload on the core ("close in screenspace" for lack of a better term) take different paths at the branch. Otherwise, it might as well not be there.

The Texture approach, on the other hand, is going to be far worse because you're effectively doubling your texture bandwidth requirements (not to mention texture fetch requests) in every case. Again, if a triangle has a uniform material type, then what you've effectively done is is cut your texture cache size in half by doubling the number of requests for no real gain.

Of course, in reality the original shader is still flawed. Assuming that Tex0 and Tex1 are sampled with mip-maps, then the compiler will actually have to move BOTH fetches outside of the branches and replace it with a conditional move anyways, meaning that they'll end up generating almost exactly the same code at the GPU level anyways. Why? Because in order to find the mip level, the shader needs the screen-space derivative of the texture coordinates (dDX(U), dDY(U), dDX(V), dDY(V)) to figure out the minificiation power. Since the GPU does this by comparing registers from adjacent pixels, it requires that all shaders have a common code-path to the point where the derivatives are calculated (and thus to where the textures are fetched). Thus your shader below becomes:

code:
vec4 colour;
vec4 colour0 = texture(tex0, coord);
vec4 colour1 = texture(tex1, coord);
vec4 colour = (vertexMaterial == 0) ? colour0 : colour1;
You can get around this problem by computing the derivatives manually outside the loop, and then doing the fetch in a branch as before:

code:
vec4 colour;
vec2 coord_ddx = ddx(coord);
vec2 coord_ddy = ddy(coord);
if (vertexMaterial == 0)
{
    colour = textureGrad(tex0, coord, coord_dx, coord_dy);
} 
else if (vertexMaterial == 1)
{
    colour = textureGrad(tex, coord, coord_dx, coord_dy);
}
This should be the most efficient method, unless you're already saturating your texture fetch pipeline in the shader (because TextureGrad can't be issued as fast as the simpler Texture instruction).

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

MarsMattel posted:

Interesting, thanks!

Now, I'm not sure this is the correct way of doing things (i.e. I'm not sure if I'm handling multiple materials correctly in my polygon generation), but I can currently occasionally end up with a single triangle having multiple materials. How would that impact things? Would there be a performance hit only on those triangles with multiple materials (which since there would be a small number of these triangles, probably isn't much of a problem)?

Correct; you'd pay the cost for the divergence only for triangles with fragments (actually 32-fragment clusters -- something like 8x4 pixel blocks usually) that follow both branches.

Check this out http://www.nvidia.com/content/PDF/GDC2011/Nathan_Hoobler.pdf (p. 22)
This is a bit more technical/specific than you are looking for, but the concepts are basically the same, and the diagrams should make it a bit more clear. The other sections of that article are also applicable.

Hubis fucked around with this message at 04:35 on Nov 14, 2013

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

So, one thing's really been bugging me about heightmap terrain. Generally, the ideal size of a terrain heightmap is a power of 2 + 1 height samples per axis, because then performing LOD on it is just a matter of collapsing it to half resolution, and it can be partially collapsed.

How do you align textures with that though (i.e. alpha masks for terrain features), since those generally need to be (or perform much better if they're) a power of 2 on each axis instead?

Well, two parts:

First, performance on non-power-of-2 textures is on par with every other surface on any reasonably modern GPU -- if someone is saying otherwise, either it's very old advice, or they've got results I'd be very interested in seeing.

Second, it doesn't matter because your geometry is at (2^N + 1) but your textures can still just be (2^M). If M = N, you'll end up with one texel per "square" in the geometry, since the heightmap describes the edges of the patches. Draw it out and you'll see what I mean.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

That's the problem though, they'll have one texel per quad, but they'll be centered on the terrain quads, which has problems with the edges in particular because the area between the edge texels and the edge of the terrain quads will either be clamped or mirror the opposite side of the terrain.

What I'd like is for the terrain texels to be centered on the mesh points, but doing that requires a 2^N+1 texture.

Ah, I see what you're saying. In those cases, your options are either resize the textures to (2^N+1) like you said, OR resize it to (2^N+2) and create a guard band by copying the second / inner row of the neighboring texture into outer row of the texture so that it inerpolates seamlessly. Note that you'll have to be crafty with how you generate mip maps in either case, and still continue to have discontinuities with anisotropic filtering. On the other hand, the guard-band method can be extended to be up to 16px wide (for example) so that anisotropic filtering still works correctly.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Madox posted:

edit:
I'm aware that I can probably copy the multisample texture to a plain texture and then use that instead, but wonder if there is a proper way to use the multisample texture directly without that extra step

This has kind of already been answered, but to underline -- No, there's no way to Sample() from a MSAA texture. This is because the samples in MSAA are not at regular intervals, and so the interpolator hardware can't know how to automatically interpolate at a given texture coordinate. Load()ing individual samples explicitly and then resolving them yourself (either as a pre-process, or in the sampling shader) is the only semantically meaningful way to access it.

ResolveSubresource() triggers a fast path that does a box filter on the MSAA texture to a non-MSAA texture of the same resolution (in other words, averaging the samples for each pixel). It's worth noting that this actually isn't the best way to resolve an MSAA texture quality-wise, but that gets into sampling theory. The other option would be to, as you said, either write your own resolve pass or do the MSAA resolve per-sample in the shader (probably bad performance-wise).

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Spite posted:

Generally, hoist everything you can. So if you can hoist something off the GPU (or push it up the pipeline), it's worthwhile. You'd have one matrix multiply instead of # of vertexes multiplies.

While this is generally true, I'd only do it if it doesn't require you to update constant buffers more often. Mapping and updating buffers adds several memcpy and buffer management overhead in the API, wheras the cost of the transform is probably negligible (because it will only matter if your vertex transform is actually bottlenecking throughput).

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Boz0r posted:

I'm going to take a shot at implementing metropolis light transport for a project at uni. I have a simple homebrewn raytracer that's single threaded, but easy to debug. Would it be a good idea to do the implementation in a GPU accelerated framework like NVIDIA OptiX, or something? What would you recommend?

If you want it to be fast, then a GPU ray tracer is a good option. I'd recommend Optix if you can, just because, as much of a pain as learning a non-trivial API can be, it solves a lot of fiddly technical problems that are much more less obvious as well.

vv This is also true vv

Hubis fucked around with this message at 20:15 on Jan 24, 2014

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

UraniumAnchor posted:

Couple of related questions regarding meshes:

1) How much of a performance gain does it actually tend to be when a relatively complex mesh is composed of strips using primitive restart rather than discrete triangles? (still using indexed vertices, of course)
2) Is there a good algorithm that can 'bake' a triangle mesh into strips suitable for use with primitive restart? Ideally something that will be called the first time the game loads and then the result gets saved to a cache somewhere. As close to optimal as possible assuming that the complexity doesn't get worse than, say, n^2.

Are the meshes static or dynamic?

If they are static, then you might not see any performance gain at all unless you are vertex-shader bound. Even if you are VS-bound, so long as the verticies are indexed the gains will vary depending on how cache-efficient your vertex ordering is (since it skips vertices whose results are still in the cache) and how wide/efficient your vertex data itself is.

If the meshes are dynamic, then stripping gets more important, since it means touching less memory when the buffer is updated. That has more to do with CPU workload, however, so if you're GPU-bound it may not amount to much.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

UraniumAnchor posted:

Static, I'm mostly wondering if it's worth it to bake some mesh data in this manner before shipping it. I'd be a little surprised, but it's intended for mobile devices which have all sorts of surprising performance bottlenecks.

OK, for mobile devices this might make a lot more sense, since they're often chunkers (and vertex load is multiplied). The chance of being vertex bound is greater, but the general trends are the same -- it will matter much more with wide per-vertex sizes than smaller ones, and how much computation is actually being saved by being able to re-use results already in the cache.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

UraniumAnchor posted:

That's what I figured.

I suppose my third question is where I might find a good algorithm that does so in a reasonable span of time.

"Reasonable" here meaning that it's alright if it takes a while since I can just bake it into the package (or worst case store it in the device cache one time and only rebuild it on a reinstall).

There seems to be an abundance of packages that will claim to do it FOR you, but that's not really what I want either unless it's open source in some way.

take a look at Assimp -- it has a model import flag for "optimizing meshes" that might give you what you're looking for.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Madox posted:

Also for efficiency - what happens in the shader is that if you have 4 textures and are picking one of them using the array index, the shader will sample every texture and THEN throw 3 away, so be aware of that (unless cards are a lot better now - I havnt looked at asm in a while). It's not picking one sampling to do.

As I understand it, all branching is done this way. Both branches run always, and then one is thrown away.

Actually, depending on how you do it, all four textures will be sampled no matter what (they are sampled outside the branch and then the results are conditionally moved into the output). This is because you need execution of all the shader sample instructions across the local screen-space neighborhood to determine the screen-space UV derivatives that mip-mapping uses to determine its mip level.

If you're not using mip-mapping, then the samples will only be executed for branches that are actually taken in the local screen-space neighborhood of the rendered geometry. In other words, if you are drawing a triangle that only samples from one texture then only one of those branches will ever be taken (the shader is smart enough to skip sections with 0 coverage). You're right about what happens if even one pixel in the local neighborhood takes a different branch, however.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Madox posted:

That's interesting, I assumed all four textures will be sampled no matter what all the time. The last time I looked at the assembler output of the D3D9 shader compiler, the assembly always sampled all textures and there was no smartness about mipmapping though maybe its better at run time now - this was like 5 years ago?

What you're seeing is correct -- it's hoisting out the tex instructions because it can't guarantee derivatives if they were in branches. Since Mipmapping is a state property, the compiler doesn't know if that's safe or not (hardware-level drivers may recompile the shader based on actual state, but that's beyond what HLSL does).

The key thing is that this is really all about the derivatives, not mipmapping -- you can put tex2d samples inside branches if you manually hoist calculating (dUV/ddx, dUV/ddy) out above the branch and then feed that into the branched tex2D instructions. In SM 4.0+ I believe that's the SampleGrad instruction.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

OneEightHundred posted:

Is it legal to alias the outputs of a compute shader as a different type in another shader?

Like, if I have a compute shader output to a structured buffer that it thinks contains this:
code:
struct BlockV
{
    int4 stuff[8]
};
... and a pixel shader reads the same structured buffer thinking that it contains this:
code:
struct Block
{
    int stuff[32];
};
.... is that supported behavior?

You almost certainly answered your question by now, but the answer is "100% yes".

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Joda posted:

For my B.Sc. project I need to do multiple samplings of 4 separate buffers per fragment. To achieve somewhat decent frame times, I want to avoid sampling too many separate textures and cache-misses if possible. Say I want to pack 128 bits of arbitrary information into an GL_RGBA32F format, are there any guides on how to "cheat" GLSL in a way that will allow me to pack and unpack the information? An example of what I want to do:

Fragment input:
code:
[Calculate stuff]
vec4 tempOut = vec4();
tempOut += ((quarter-precision vec4 cast) normal1);
tempOut += ((quarter-precision vec4 cast) normal2) << 32;
tempOut += ((quarter precision vec4 cast) position) << 64;
tempOut += ((quarter precision vec4 cast) postion2) << 96;

fragOut = tempOut;
Texel output:
code:
in vec4 input;

void main() {
	vec4 normal1 = (quarter-precision vec4 cast) ((input << 96) >> 96);
	vec4 normal2 = (quarter-precision vec4 cast) ((input << 64) >> 96);
	etc.
}
Is it even possible to do this sort of bit-wise cast in GLSL, where it takes the bits in a sequence and re-interprets them as something else? (i.e. if I have a 32-bit unsigned int that was 2147483648 and I wanted to re-interpret it as a ivec4 it'd take the bits of 0111...111 which would make (128,256,256,256)).

1) Why not use GL_RGBA32UI instead? You should be able to do your casting natively there -- pack the normals/positions into four 32-bit words however you see fit.

2) If you're converting your vectors/positions to signed normals (i.e. a uniform mapping to bits covering the range [0, 1]) you won't need these, but it might be worth looking at the floatBitsToUint/UintBitsToFloat GLSL functions.

3) Be sure what you're doing is actually helping. Texture caches on modern GPUs are really just memory maps, so having 4 separate RGBA8 textures being loaded sequentially will not necessarily be any less efficient than 1 RGBA32 read. With the 128-bit format you're packing your 'structure' adjacently in memory, but remember that each thread/pixel in the GPU isn't being executed sequentially -- it's running in parallel as part of a SIMD group (a 32-thread Warp/64-thread Wavefront depending on if you're talking NV or AMD).

Conceptually, the code might be:
code:
for each pixel in pixels:
  s1 = Load(Tex1, pixel)
  Process(s1)
  s2 = Load(Tex2, pixel)
  Process(s2)
  s3 = Load(Tex3, pixel)
  Process(s3)
  Load (Tex4, pixel)
  Process(s4)
But in reality what is happening is

code:
s1[pixels]
for each pixel in pixels:
  s1[pixel] = Load(Tex1, pixel)
Process(s1)
s2[pixels]
for each pixel in pixels:
  s2[pixel] = Load(Tex2, pixel)
Process(s2)
s3[pixels]
for each pixel in pixels:
  s3[pixel] = Load(Tex3, pixel)
Process(s3)
s4[pixels]
for each pixel in pixels:
  s4[pixel] = Load(Tex4, pixel)
Process(s4)
Also, the pixels are placed in their SIMD groups based on locality, and the textures are laid out such that spatially-local accesses are cache efficient. Thus if you want to reduce cache misses what you want to do is make sure that as many texels that are being sampled by the local group in for that instruction fit into a cache line as possible. This means you want to do two things: make sure that the texture coordinate accesses are local, and reduce the size of that memory access as much as possible.

Now sampling multiple textures MIGHT be a problem for you. First, the texture unit itself has a limit on how much throughput it can handle in terms of requests; however, if you're just doing four taps you shouldn't be hitting this -- it's usually more of an issue for things like soft shadowing shaders that are doing 13+ taps per pixel. You might also have a problem if your texture reads are dependent (that is, the coordinate for one read is derived from the result of another); in that case you have to serialize your reads, and depending on what other work your shader is doing you may not be able to effectively hide the latency of the texture access with arithmetic instructions. From what it looks like though this isn't what you are doing.

It may all be a wash anyways if all of your math ops require all the values to be read before they can do any work. The memory behavior will still be slightly different but you're still pulling the same total bandwidth so there will be a lot more variability based on the hardware's specific cacheing strategy. Still, if there are ANY operations you're doing that only need subset of the results, the compiler should be able to reorder the shader so that they are operating while the rest of the data is loading, leading to better total throughput by overlapping latencies.

Finally, all of this is not to say this isn't a clever idea. There's definitely value in it for something like G-Buffer packing in deferred shaders. For example, a Normal doesn't really need all four channels in an RGBA texture -- you could choose to pack it as 3xFloat16 values, or even two (deriving the third from the fact that the normal is, well, normalized). If you can use bit packing to compress below what you'd be using in separate textures then the gains might start to appear. Just keep in mind that the packing/unpacking instructions aren't free either, so if you're at all instruction-bound then it may hurt more than it helps.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Joda posted:

Whenever I implement lambertian shading I get these concentric circles on uniformly coloured, flat surfaces. I assume it's a consequence of only having 8-bits per colour channel, but I don't recall ever actually seeing them in professional applications. Is there something I'm doing wrong, or is this kind of artefacts just usually hidden by visual complexity (i.e. more polygons, normal mapping etc.)?

could be a linear-sRGB color space problem

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Sex Bumbo posted:

You specify an srgb texture explicitly in the same way you specify an rgb texture explicitly so unless you deliberately did it, you probably don't have one (E: and it might need to be enabled). Also, using linear lighting doesn't stop banding, it just shifts the bands around to less perceptible values.

Working in linear doesn't cause banding, but switching back and forth between linear and sRGB (which it doesn't sound like he's doing) could. I've worked on projects before that ended up with weird banding issues because someone was careless with which post-processing stages were linear and which were sRGB when they tried to reduce the number of passes, for example.

Sex Bumbo posted:

If you take a screenshot and look at the values, they should be adjacent rgb colors, like 127, 127, 127 in one pixel and 128, 128, 128 in the next -- it just happens that your eyes pick up on the small difference. If they're not adjacent values there might be some bad format conversion or something going on. Also it happens more than it should in professional apps imo.

see:
http://loopit.dk/banding_in_games.pdf
https://www.shadertoy.com/view/MslGR8

Yeah, this is also good advice. If you're getting quantization issues, then you'll end up with 'gaps' between bands as mentioned (since the problem comes from projecting a format with lower precision in that range onto a format with higher precision).

Also, is your banding in the bright/medium/dark end of the range?

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Tres Burritos posted:

What's the simplest way to make a heatmap using shaders? (Assuming that's even a good idea?)

I'm fairly certain you could do it by creating a flat plane and then putting point lights over it where your different datapoints are. This also allows you to change the intensity / color of the light based on whatever factors you choose, correct?

And if that's all you're doing, would it be smart to use deferred rendering?

It depends on how accurate you need it to be, and what you want to use it for.

The easiest way would be the "splatting" method -- render your sample points as quads over your destination map with additive blending on. The size of the 'splat' should correspond to the importance/intensity of the sample (so that the 'edge' of the splat is where the weight function hits zero). Inside the pixel shader, you compute the distance the fragment being shaded is from the center of the splat using something like a windowed gaussian, and then output a value scaled by that function. All the splats will additively blend together and give you a localized density in the output texture, which you can then read from and feed into a color lookup if you want to get false-color rendering. This is also known as a "scatter" approach (you're "scattering" a single input to multiple outputs).

The above way is going to be the most efficient for most cases, though you might run into performance problems if you end up with most of your points in a very small area of the render target. In that case you'll run into a blending bottleneck, since the pixels will be contending for the same blending interlocks (although this isn't as much of a problem as it sounds for small amounts of overdraw). The other approach is something closer to what is done with "tiled" deferred lighting: divide the output into regular tiles (8x8 is probably a good number) and go through all the points so that each tile has a list of the points which may be affecting it (points may appear in multiple lists). The, using a compute shader pass (ideally -- though this is doable with a fullscreen PS pass as well, albeit less efficiently) figure out what tile each pixel belongs to and compute the sum of the distance-weighted value for every sample that hits that tile. Then output that, without blending. This can be more efficient if there's a lot of "hot spots" with blend contention, but it's still not necessarily a win because now you've got a much smaller number of threads doing a lot of work in serial, instead of a the entire GPU distributing the workload (and then relying on the blend interlock to coalesce it). This is what's referred to as a "gather" approach (you're "gathering" multiple inputs to a single output).

e:
And if the number of points is relatively small and performance is not critical, jam them into a constant buffer and just do a full screen PS pass that iterates over all the samples at every pixel and accumulates a weight. This is roughly equivalent to the second ("gather") approach except without the tiling step, which means you will probably have a lot of wasted work processing samples that don't affect the pixel at all. Still, if the number of samples is small (or their radius of effect is large) it could be roughly as efficient.

Hubis fucked around with this message at 15:57 on Oct 30, 2015

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Tres Burritos posted:

Ahhh yep, I follow.


Good poo poo. So where you put

code:
float totalWeight = 0.0;
    for (float i = 0.0; i < 69.0; ++i) {
        
    	totalWeight += d(uv, vec2(
            sin(iGlobalTime * 0.3 + float(i)) + sin(i * i), 
            cos(iGlobalTime * 0.4 + float(i * 2.0))
        ));
    }
I'd just be comparing that fragment against all the uniform points or whatever that got passed in. The problem with this seems to be that it doesn't run so hot on a 4k display, I'm guessing looping through all those fragments (for like 1000 points on a GTX 980) is a little expensive. Or maybe shaderToy just doesn't like fullscreen.

I think I'm going to try that first and see how it goes.


I'm not quite getting this one, so for each datapoint I'd create a quad (based on the intensity / whatever) of the point, then I'd run the gaussian function for just that plane and then render that to a texture?
And then when planes are overlapping the GPU would just know what's going on and do some behind the scenes blending? Would you have to make sure that the planes are in distinct layers (y = 0, y = 1, y = 2) so that you didn't get weird collision artifacts (I'm fairly certain I've seen that before)?

They can all be in the same plane -- if you configure for additive blending like Joda described the blend/framebuffer unit will resolve overlapping regions correctly.

You'd accumulate your splats to a single-valued texture (R16_FLOAT or whatever) and then when you rendered that texture, you'd remap the float value to a color via a lookup.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Doc Block posted:

Am I allowed to ask Metal questions in here?

This would probably be the right place, though I'd be curious to see how much expertise is out there

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Sex Bumbo posted:

As an experiment you might want to try lowering the bit depth, even to something extreme like 565, just to see how it affects performance. Other than that, I'm not familiar with Metal but there aren't really that many different ways to downsample a texture and blur it that I'm aware of. Are you able to profile it somehow? You can do hacky profiling like doing a pass-through non-blur just to see if it's the blur kernel that's the bottleneck.

Does Metal support 10-11-10?

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Doc Block posted:

Whatever they're called then when you specify the texcoords in the fragment shader so the GPU doesn't know to prefetch the necessary texels.

Or is that not an issue anymore? I'll have to play around and see.

Not a thing anymore (and hasn't been really since 2005 or so).

"Dependent Texture Reads" refer to one texture fetch relying on the value of a previous texture fetch to determine its lookup location. A screen-space distortion shader is a perfect example -- the value you output is fetched from the rendered framebuffer using a coordinate offset by a second "distortion map". This is potentially bad because the first texture read injects round-trip latency so the shader unit sits idle, then the texture unit sits idle while the shader unit computes the new sample position. If you have well balanced workloads that will keep the shader unit busy with other things then this isn't that big of a deal.

Reading something straight from a vertex interpolator will be no faster than if you read it and did some kind of (simple) math on it first. Either way it's executing some shader functions to put the interpolated (and possibly modified) value into a register that it then feeds to the texture unit. There's no "fast path" where you just pump the TEXCOORD semantic straight to the texture unit anymore.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Sex Bumbo posted:

I thought you were wrong about this, at least regarding A7, but

https://developer.apple.com/library...SPlatforms.html

It is I who am the big dummy

It's a fair possibility -- my experience is mostly with discrete PC GPUs -- but yeah I would have been very surprised if the PowerVR architecture were hugely different in that area.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Ralith posted:

I think the trick is that the dev-to-efficient-game-engine time ends up being much shorter in turn.

Yeah. The "Hello 3D World" program is going to be a lot bigger/more complicated in both DX12 and Vulkan (moreso relatively in Vulkan due to ability to write super basic programs in GL) but a lot of that is due to requiring a lot of boilerplate code that any "real" application is going to need anyways.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

haveblue posted:

Licensed engines are also a lot more common now, so Unity or Epic can incur that time and then everyone else gets to skip it.

Yes, and this is really part of the plan. If you look at the entire rendering stack, from user API down to GPU silicon, as a finite amount of work distributed among all the companies working in the industry (from Indie Dev to Epic to Microsoft to NVIDIA) then you are really redistributing a lot of the optimization work done at the driver/OS level and pushing it into user space. There may be efficiencies that can be gained by that, but ultimately there is still unavoidable added work that's either going to have to be done by developers who are currently focusing on other existing tasks, or by new/expanded middleware. Part of the reason why companies like Epic are pushing the new APIs is that it simultaneously give them the ability to deliver more performance and creates an even bigger niche for them to fill.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Suspicious Dish posted:

And yet still a lot of companies with engines built in the last year are also able to handle the task.

Part of it is because Vulkan's design really doesn't give you a big opportunity to "get it wrong" like you do with high-level APIs, and there's a lack of strange or opaque behavior. Experienced engine programmers aren't upset at Vulkan for making them do more work, they're thankful they don't have to guess what the driver is doing anymore.

Oh, sure. And, even if the API change doesn't provide any new efficiencies over the old stack, the fact that the code is being done at the middleware level instead of the OS/Driver level means developers can have multiple potential solutions to choose from instead of just trying to hope that the end user's platform developers got it right.

The corollary to this is just that you're going to have a lot more modules of code implementing the same thing at a much lower level than before, as opposed to a single code base (kernel module/driver) that is getting way more testing because it's being used by everything. Also, "Guessing what the driver is doing" is a double-edged sword, because sometimes the driver does things in an unspecified manner because what the best practice would be might change between hardware generations.

I think there's going to be somewhat painful transition with a perf issues and stability bugs, but hopefully expertise and middleware will get to the point where that's no longer as much of an issue and we can reap the overhead reduction benefits.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Joda posted:

I'm trying to use a RG16F to encode normals for scene that is centrally projected, since it saves me a texture tap. I did some testing that shows that access to an RGB16F is much slower than RG16F/R11G11B10, probably because it requires an extra texture read. The 11_11_10 texture is too imprecise for normals. The problem is that since it's projected with central projection there's a possibility for both negative and positive Z coordinates so I need to buffer the sign of the Z-coordinate and I thought I could do that by sacrificing precision on the Y-coordinate.

You could use UNORM surfaces and then use shader language intrinsics to unpack them manually as needed (I believe this is recommended SOP). Some APIs will even let you alias a surface as float or UNORM.

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...
Yup! That extension is exactly what I meant. Not sure what's up with the shader error, though...

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Sex Bumbo posted:

I'm honestly impressed with how many hoops every loving vulkan thing wants you to jump through. Guess that's what you miss out on when the api isn't made by msft. I think this is the first time I had to upgrade my python version to install an sdk for a low level graphics api. Good job guys.

It's more "This is what happens when you want the vendor driver stack to get out of the way".

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...

Sex Bumbo posted:

Also I disagree with this, OpenGL should be forgotten forever. It sucks enough that I'd rather use, I don't know, literally anything else if I need some trivial graphics task done.

I don't really get the push to keep people futzing around on DX11. I know it's going to still be supported, but even if I'm just dicking around at home, DX11 is so unsatisfying now. People aren't idiots, they can figure out how to initialize swap chains and descriptor tables and learn about the hardware at the same time. It's fun.

Well, OpenGL will be around forever for the same reason that OpenGL has been such a PITA to push forward -- there's a ton of heavily used software packages that rely on it and aren't going to be ported to something new, let alone something fundamentally different.

I kind of agree about DX11 -- it feels like a bad middle-ground between the "low-level" APIs of DX12/Vulkan and the more high-level wrapper engines. That being said, I definitely think it will be around for quite a long time (longer than DX9, which I think is only just now finally dying off) because there is a huge complexity gap between DX11 and DX12, and there aren't a lot of developers who are going to be able to take advantage of the benefits it confers in the near term.

Minsky posted:

Swap chains and descriptors are all fine and nice and easy to understand.

The thing that scares me most about looking at Vulkan code as someone who writes drivers is the responsibility for application to handle resource barriers. Meaning, if you render to a texture in one draw and then want to read from it in another draw, you the app developer have to manually put a barrier command there in between to make sure the first draw finishes and the relevant caches are synchronized. In OGL/DX, the driver would detect all of this for you.

This puts more control in your hands, but it also can introduce a lot more hardware-specific errors that you may not be aware of if you choose to primarily develop on a particular hardware vendor's GPU that happens to have coherent caches between those two kinds of operations. Vulkan ships with a debug runtime to catch these kinds of mistakes, but it is probably not very mature just yet.

What I anticipate happening is that there will be a lot of growning pains where people start crashing machines because their code that was previously pretty well sandboxed in DX11 is now stomping all over video memory and hitting nasty race conditions. A lot of people "fix" this by wrapping the low level APIs with classes that are overly aggressive with barriers and mutexes, making them "safe" but slow. Over time these wrapper layers will get better and you'll get closer to peak performance, but I think on average CPU-side performance will be *worse* for ported apps over their DX11 equivalents, and GPU performance may not be as good either because the driver isn't going to have the leeway to do some of the smart things that have "bloated" into it over the past decade or so.

There are definitely ways to get much better performance out of the low level APIs, but you've got to do a lot of work to get there -- and it will only really help if you're already being limited by driver/API overhead.

Adbot
ADBOT LOVES YOU

Hubis
May 18, 2003

Boy, I wish we had one of those doomsday machines...
If anyone is going to be at GDC this year, feel free to swing by and check out my talk Wednesday afternoon: Fast, Flexible, Physically-Based Volumetric Light Scattering

https://www.youtube.com/watch?v=lQIZzKBydk4

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply