Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Rhusitaurion
Sep 16, 2003

One never knows, do one?
Anybody ever implemented Dual Contouring? I'm working on implementing it to meshify signed distance functions, and I've got it mostly working except for the quadratic error function stuff.

All of the implementations I can find online (including the reference implementation from the paper) do stupid poo poo like type out every single individual multiply in every matrix operation, rather than using a library. I think I've worked out what that nonsense is doing, and translated it to use Eigen, but I don't really know anything about least-squares minimization or how this stuff is supposed to work:

code:
// takes vectors of points and their corresponding normals, and returns the point that minimizes the sum of their dot products (?)
auto qef = (const std::vector<const Vector3f*> &ps, const std::vector<const Vector3f*> &ns) -> Vector3f {
    Vector3f mp = Vector3f::Zero();   // midpoint
    Matrix3f Ahat = Matrix3f::Zero(); // symmetric matrix of normals
    Vector3f bhat = Vector3f::Zero(); // vector of normals * dot product of normals and points
         
    for(size_t i = 0; i < ps.size(); ++i) {
        const auto &p = *ps[i];
        const auto &n = *ns[i];
        Ahat += n * n.transpose();
        bhat += n * n.dot(p);
        mp += p;
    }

    mp /= (float)ps.size();
    bhat -= Ahat * mp;
        
    Eigen::JacobiSVD<Matrix3f> svd(Ahat, Eigen::ComputeFullU | Eigen::ComputeFullV);
    auto U = svd.matrixU();
    auto S = svd.singularValues();
    auto V = svd.matrixV();
    S(0) = 1.0 / S(0);
    for(int si = 1; si < 3; ++si) {
        S(si) = (abs(S(si)) < 0.1) ? 0.0 : 1.0 / S(si);
    }
    Vector3f x = V.transpose() * Eigen::DiagonalMatrix<float,3>(S) * U.transpose() * bhat;
    return mp + x;
};
However this is clearly wrong since a sphere looks lumpy as poo poo:


If I just use the midpoint (e.g. "return mp") it at least looks symmetrical, but still kind of "boxy":

Adbot
ADBOT LOVES YOU

Rhusitaurion
Sep 16, 2003

One never knows, do one?

Nippashish posted:

I don't know anything about dual contouring, but I think this line is wrong:
code:
Vector3f x = V.transpose() * Eigen::DiagonalMatrix<float,3>(S) * U.transpose() * bhat;
It looks to me like you're trying to compute a stable version of inv(Ahat)*bhat. The SVD gives you three matrices such that Ahat = U * S * transpose(V) with U and V as orthogonal matrices. But that means inv(Ahat) = V * inv(S) * transpose(U), and you have an extra transpose on your V.

Yup, looks like you're right:


I misinterpreted what Eigen's SVD would give me - I thought V would come back as as V-transpose (for some reason), and I'd have to transpose it back. Thanks!

Rhusitaurion
Sep 16, 2003

One never knows, do one?
I have a question about geometry shaders.

I'm using them to generate 3D geometry from 4D geometry. For example:

https://imgur.com/zD9J15J

The way this works is I have a tetrahedral mesh that I send into the geometry shader as lines_adjacency (since it gives you 4 points at time - very convenient). There (and this is the sketchy part), I have a bunch of branchy code that determines if every tetrahedron intersects the view 3-plane, and emits somewhere between 0 and 6 (for the case where the whole tetrahedron is in-plane) vertices in a triangle strip.

It's a neat trick, but it seems sketchy. I'm no GPU wizard, but my understanding is that geometry shaders are slow, and branchy shaders are slow. Additionally they don't seem to be supported in WebGL, or Metal.

Is there any reasonable alternative for generating geometry that's dependent on transformed vertices? I could do this on the CPU, but I'd have to end up doing essentially all the vertex transforms there, which seems lovely. I could save a lot of work with some kind of BVH, but still. Compute shaders seem promising, but I think I'd have to send the transformed vertices back to the CPU to get the 4-to-many vertices thing.

Rhusitaurion
Sep 16, 2003

One never knows, do one?

Hubis posted:

Basically, have a shared memory array that is the size of your maximum possible triangles per dispatch, then compute your actual triangles into there and use atomic increment on a global counter to fetch and then increment a write offset into the output array by that amount. You are effectively reimplementing the GS behavior, but completely relaxing the order dependency.

That makes sense, thanks! I've not really messed with compute shaders before, so I wasn't sure what is and isn't possible. I think this will be pretty doable, since the maximum number of vertices is not that much more than the number of input vertices.

Suspicious Dish posted:

Also worth pointing out that it becomes a lot easier if you generate indexed triangles instead of triangle strips, since you can just jam triangles through without guaranteeing that strips are similar.

I think using indexed triangles would actually make some of the logic easier as well, so that's good to know.

I'm thinking something like have a vertex buffer with 6 output vertices for every (4 choose 2) combination of a tetrahedron's vertices. Then also have an index buffer to connect up the ones that actually land in the view 3-plane, using Hubis's suggestion.

Rhusitaurion
Sep 16, 2003

One never knows, do one?

Hubis posted:

attempt at a Grand Unified Geometry Pipleine to fix both Geometry Shaders and Tessellation, but are still not broad platform. For what you need to do I would concur with the suggestion of using a computer shader that reads a vertex array as input and produces an index buffer as output. For optimal performance you might have to be creative by creating a local index buffer in shared memory and then appending it to your output IB as a single block (to preserve vertex reuse). Basically, have a shared memory array that is the size of your maximum possible triangles per dispatch, then compute your actual triangles into there and use atomic increment on a global counter to fetch and then increment a write offset into the output array by that amount. You are effectively reimplementing the GS behavior, but completely relaxing the order dependency.

Many months later, I actually ended up doing something like this, using Vulkan, but I'm wondering if there's a better way than what I've done.

For each object, I allocate one large buffer that will contain the input vertices, space for computed vertices/indices, an indirect draw struct, and another SSBO with a vertex counter. Then, on each frame for each object
1. In one command buffer (not per-object), reset the vertex and index counters to 0 with fill commands
2. In another command buffer, dispatch the compute shader. It operates on the input vertex buffer and atomically increments the vertex and index count SSBOs to get indices of the output buffers to which to write vertices and indices.
3. In another command buffer, do an indirect draw call.

Then I submit the 3 command buffers with semaphores to make sure that they execute in the order above. The first submission also depends on a semaphore that the draw submission from the last frame triggers.

This seems to work fine (except when I hard lock my GPU with compute shader buffer indexing bugs), but I'm wondering if I'm doing anything obviously stupid. I could double-buffer the computed buffers, perhaps, but I'm not sure if it's worth the hassle. I thought about using events instead of semaphores, but 1. not sure if it's wise to use an event per rendered object and 2. can't use events across queues, and compute queue is not necessarily the same as graphics queue.

Thoughts?

Rhusitaurion fucked around with this message at 00:16 on May 8, 2020

Rhusitaurion
Sep 16, 2003

One never knows, do one?

Ralith posted:

First, you don't need three command buffers. If you're only using a single queue, which is probably the case, you only need one command buffer for the entire frame. Semaphores are only used for synchronizing presentation and operations that span multiple queues. Note that you don't need to use a dedicated compute queue just because it's there; the graphics queue is guaranteed to support compute operations, and for work that your frame is blocked on it's the right place.
Not sure why I didn't realize this earlier. It does make things easier.

quote:

Events definitely aren't appropriate. What you need here is a memory barrier between your writes and the following reads, and between your reads and the following writes. Without suitable barriers your code is unsound, even if it appears to work in a particular case.
Now that it seems like I should be using a single queue and command buffer, memory barriers definitely make sense. I was thinking about events because they would allow the work to be interleaved at the most granular level between the different stages, but I see now that barriers should allow the same thing.

quote:

Second, maybe I misunderstood but it sounds like you're zeroing out memory, then immediately overwriting it? That's not necessary.
Yeah I realize now that I didn't explain this well. The compute stage treats an indirect draw struct's indexCount as an atomic, to "allocate" space in a buffer to write index data in. That index data changes per-frame, so I have to re-zero the counter before each compute dispatch. There's also another atomic that works the same way for the vertex data that the indices index. Is there some other way to reset or avoid resetting these?

quote:

Third, a single global atomic will probably serialize your compute operations, severely compromising performance. Solutions to this can get pretty complex; maybe look into a parallel prefix sum scheme to allocate vertex space.
Well, it's 2 atomics per object, but yeah, it's probably not great. Thanks for the pointer. I'll look into it, but it sounds complicated so the current solution may remain in place for a while...

Rhusitaurion
Sep 16, 2003

One never knows, do one?
Dumb question about memory barriers - this page says that no GPU gives a poo poo about VkBufferMemoryBarrier vs. VkMemoryBarrier. This seems to imply that if I use a VkBufferMemoryBarrier per object to synchronize reset->compute->draw, it will be implemented as a global barrier, so I might as well just do all resets, then all computes, then all draws with global barriers in between. But as far as I can tell, this is essentially what my semaphore solution is currently accomplishing, since semaphores work like a full memory barrier.

Is that post full of poo poo, or can I use VkBufferMemoryBarriers as they seem to be intended, i.e. to provide fine-grain synchronization?

Rhusitaurion fucked around with this message at 19:46 on May 8, 2020

Rhusitaurion
Sep 16, 2003

One never knows, do one?

Ralith posted:

Semaphores introduce an execution dependency, not a memory barrier. You cannot use semaphores as a substitute for memory barriers under any circumstances. For operations that span queues you need both; for operations on a single queue, semaphores aren't useful.

I'm probably misinterpreting the spec here, but the section on semaphore signaling says that all memory accesses by the device are in the first access scope, and similarly for waiting, all memory accesses by the device are in the second scope. Granted it might not be the best way to do it, but it seems like relying on a semaphore for memory dependencies is allowed.

Adbot
ADBOT LOVES YOU

Rhusitaurion
Sep 16, 2003

One never knows, do one?

Ralith posted:

No, you're right, I misremembered. Using a semaphore as you were is not unsound, just unnecessary effort for extra overhead. Note that you do typically need explicit barriers when expressing inter-queue dependencies regardless, but that's for managing ownership transitions when using resources with exclusive sharing mode.

Got it. Thanks for the advice - I've switched over to a single command buffer with barriers, and it seems like it works. Not sure if I got the src and dst masks and whatnot correct, but the validation layers are not complaining, at least!

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply