3D graphics questions that do not deserve their own thread (OpenGL / Dx10)

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > 3D graphics questions that do not deserve their own thread (OpenGL / Dx10)

«‹›3 »

Spite: Jul 27, 2001; Small chance of that...

passionate dongs posted:

edit: Is there somewhere to look for OpenGL best practices? Is this something the red book would have?

There's really only the most basic of "best practices" out there.
You can look at:
http://developer.apple.com/library/mac/#documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide and the more recent WWDC talks on it.

Also keep in mind that it varies (sometimes quite heavily) by GPU vendor and platform.

# ¿ Sep 16, 2011 01:28

Adbot: ADBOT LOVES YOU

# ¿ May 15, 2024 00:12

Spite: Jul 27, 2001; Small chance of that...

Paniolo posted:

Why are you using separate OpenGL contexts for everything?

If you are using multiple contexts you MUST use a separate context per thread.

To answer the main question, yes, multiple contexts can share resources. Note that not everything is shared, like Fences (but Syncs are). Check the spec for specifics.
If you are trying to do something like
Context A, B, C are all shared
Context B modifies a texture
Context A and C get the change
That should work.
Create context A. Create contexts B and C as contexts shared with A. That should allow you to do this.

Keep in mind you've now created a slew of race conditions, so watch your locks!
Also, you have to make sure commands have been flushed to the GPU on the current context before expecting the results to show up elsewhere.
For example:
Context A calls glTexSubImage2D on Texture 1
Context B calls glBindTexture(1), Draw

This will have undefined results. You must do the following:

Context A calls glTexSubImage2D on Texture 1
Context A calls glFlush
Context B calls glBindTexture(1), Draw

On OSX, you must also call glBindTexture before changes are picked up, not sure about windows. It's probably vendor-specific. And again, you need a context per thread. Do not try to access the same context from separate threads; only pain and suffering awaits.

You probably want to re-architect your design, as it doesn't sound very good to me. Also, can't Qt pass you an opaque pointer? You can't hold a single context there and lock around its use? Or have a single background thread doing the rendering?

Spite fucked around with this message at 03:24 on Nov 3, 2011

# ¿ Nov 3, 2011 03:22

Spite: Jul 27, 2001; Small chance of that...

shodanjr_gr posted:

That's what I ended up doing. I create context on application launch and then any other contexts that are initialized share resources with that one context.

I already have wrappers around all GPU-specific resources and the convention is that all of them must be locked for Read/Write before access. I am also figuring out a way to ensure consistency between the GPU and CPU versions of a resource (potentially have "LockForWriteCPU" and "LockForWriteGPU" functions along with a state variable and then an unlock function that updates the relevant copy of the data based on the lock state).

There's a couple of issues. First of all, I am using multiple widgets for rendering and I am not guaranteed that I will get the same context for all widgets (even if they are in the same thread I believe and I plan on multithreading those as well since they tank the UI thread). Additionally, I plan on having an OpenCL context in a separate thread. That's on top of a bunch of threads that produce resources for me. I wanted to minimize the amount of "global" locking that takes place in this scheme hence this exercise...

This is way complicated. Have you tried a single OpenGL context that does all your GL rendering and passes the result back to your widgets? The widgets can update themselves as the rendering comes back. Remember: there's only one GPU so multiple threads will not help with the actual rendering.

Also GPU and CPU resources are separate from each other unless you are using CLIENT_STORAGE or just mapping the buffers and using that CPU side. You can track what needs to be uploaded to the GPU by just making dirty bits and setting them. Multiple threads should not be trying to update the same GPU object at once in general - that gets nasty very fast.

# ¿ Nov 6, 2011 00:43

Spite: Jul 27, 2001; Small chance of that...

shodanjr_gr posted:

You mean as in having a single class that manages the context and rendering requests get posted to that class which does Render-To-Texture for each request then returns the texture to a basic widget for display?

That's how I'd approach it myself. Just post requests to this thread/object and have it pass back whatever is requested or necessary for the UI to draw. I'm making quite a few assumptions because I don't know exactly what your architecture needs to be, but it will vastly simplify your locking and GPU handling to do it that way.

As for the second part, do you think multiple threads will be accessing the same CPU object simultaneously often? You can always make your CPU stuff more volatile - if you don't mind taking copies of stuff and passing that around/modifying the copy instead of locking. Your program's needs will better determine which will be more efficient.

# ¿ Nov 7, 2011 08:34

Spite: Jul 27, 2001; Small chance of that...

Yikes, that's hard. It could be one of approximately a million things. I'm not familiar with slimdx but it's quite possible it's something with that? Do other 3d apps and games run fine? Have you tried writing a simple c d3d or opengl app that does the same thing?

# ¿ Nov 9, 2011 04:43

Spite: Jul 27, 2001; Small chance of that...

OneEightHundred posted:

Yes, you could do that. You'd still want to use GetProcAddress for anything that you weren't sure would be available (i.e. calls exposed by extensions).

Not really. What's happening is that wglGetProcAddress asks the driver for its implementations of the functions. So you'd need a .lib for each driver and probably each build of each driver. Then you'd need a separate exe for each and it would be nasty.

This is also why you have to create a dummy context, get the new context creation function and create a new context if you want to use GL3+. It sucks.

# ¿ Nov 15, 2011 01:16

Spite: Jul 27, 2001; Small chance of that...

SAHChandler posted:

I had a feeling this is what was being done. (especially since I did a dumpbin to a .def file of the ATI drivers, and only about 24 functions are actually present within the opengl driver dll. some of which are for egl )

There's also the gl3w library which focuses on using and loading only opengl 3/4 functions.

One question though, on a slightly related matter. Since OpenCL works well with OpenGL, and there are a variety of implementations out there, writing for each implementation and distributing multiple executables isn't a factor because they didn't go the route of clGetProcAddress style system right?

There's an extension for ES2 compatibility, which is probably why they are there.

As for OpenCL: also keep in mind that OpenCL is driven by Apple, which controls its driver stack and can release headers that work with all its parts.

# ¿ Nov 16, 2011 02:47

Spite: Jul 27, 2001; Small chance of that...

Sampler2D is essentially a number that corresponds to the texture unit. Most GPUs have 16 these days.
ie:

glActiveTexture(GL_TEXTURE0)
glBindTexture(GL_TEXTURE_2D, tex)

Will bind the texture 'tex' to unit 0.
Then you set the sampler to 0 and it will sample from that texture.

As for defaults, it will depend on the implementation. It should default to 0, but I'd check the spec to be sure.

# ¿ Nov 22, 2011 00:56

Spite: Jul 27, 2001; Small chance of that...

If you want to do some cross referencing between GL and DX, try:
http://developer.apple.com/graphicsimaging/opengl/capabilities/

I'd hope everything relatively recent/worth developing for supports depth read

# ¿ Nov 22, 2011 22:36

Spite: Jul 27, 2001; Small chance of that...

OneEightHundred posted:

Yeah looks like I was confusing the X1000-series' lack of PCF with what looks like pre-PS2.0 inability to sample depth textures as floats.

That looks good though, guess everything I care about supports FP16 color buffers.

The ATI X1xxxx series doesn't let you filter float textures in hardware, from what I recall. There's always room for a dirty shader hack though!

# ¿ Nov 23, 2011 01:09

Spite: Jul 27, 2001; Small chance of that...

zzz posted:

I haven't touched GPU stuff in a while, but I was under the impression that static branches based on global uniform variables will be optimized away by all modern compilers/drivers and never get executed on the GPU, so it wouldn't make a significant difference either way...?

Best way to find out is benchmark both, I guess

You'd hope so, but I wouldn't assume that! The vendors do perform various crazy optimizations based on the data. I've seen a certain vendor attempt to optimize 0.0 passed in as a uniform by recompiling the shader and making that a constant. Doesn't always work so well when those values are part of an array of bone transforms, heh.

Basically, you don't want to branch if you can avoid it. Fragments are executed in groups, so if you have good locality for your branches (ie, all the fragments in a block take the same branch) you won't hit the nasty miss case.

# ¿ Nov 24, 2011 08:18

Spite: Jul 27, 2001; Small chance of that...

nolen posted:

Cross-posting this from the Mac OS/iOS Apps thread.

I'm trying to do something with cocos2d, but have been told that it will involve some OpenGL ES work to achieve what I'm looking to do.

This is just a mockup I whipped up in my photo editor.

Let's say I want to load in an image like the one of the left but alter it at runtime to look like the one on the right.

I have no idea where to start with plotting texture points to a 2D polygon and all that jazz. Any suggestions or simple examples that would help?

Split your quad into 2 quads. Then look into an affine transform.

Also cocos2D is one of the worst pieces of software known to man (not helpful, I know...)

# ¿ Dec 1, 2011 21:14

Spite: Jul 27, 2001; Small chance of that...

The default OpenGL implementation in Win32 doesn't have all the entry points. All the modern stuff is requested from the driver via glwGetProcAddress. So you can break on that and see what it returns.

Or you can use one of the various tracing interposers to get a call trace and see what it's doing.

When I've done similar stuff to what you are describing I've taken a CRC of the texture when it's passed in to glTexture2D and recorded the id that's bound to that unit. Then you can store that away and do whenever you want when it's bound again.

# ¿ Dec 28, 2011 07:46

Spite: Jul 27, 2001; Small chance of that...

OneEightHundred posted:

I believe VBOs are mandatory for all geometry in the forward-compatibility contexts. D3D switched to buffer-only ages ago, at least circa D3D8.

If you have a frequently-updated buffer, create it as DYNAMIC. Use discards (pass NULL to glBufferSubData), and if you're doing CPU calculations then use SSE intrinsics and write to the buffer by using MapBuffer and _mm_stream_si128. Discards alone probably make them better than vertex arrays.

Yes, use VBO for sure. VAR are probably super stale code in most drivers these days.

Note: calling glBufferSubData with NULL doesn't discard the buffer. You should use glBufferData with NULL to get that orphaning behavior. glBufferSubData is specified to not modify the buffer, only the data. You can also use the unsynchronized parameters to get nonblocking behavior.

# ¿ Jan 11, 2012 21:17

Spite: Jul 27, 2001; Small chance of that...

Bisse posted:

So the OpenGL driver may not implement the only currently allowed way to render?

It's a bit more complicated and nasty than that.
In Windows, the OS only implements OpenGL 1.1 or something like that. So that's all you are guaranteed to have. Everything else has to go through wglGetProcAddress, which queries the driver and returns a function pointer to the function.

To get a modern (3+) OpenGL context, you have to call wglCreateContextAttribs. Which is not part of the old OpenGL, so you have to actually create an old context, call wglGetProcAddress to get the new creation function and _then_ create your real context. It's a disaster.

And glBegin/End should absolutely be banned. One of OGL's biggest issues is that it has a billion ways to do things, but only one of them is fast, and the others don't tell you they are slow. As was said earlier, a more friendly API could be built on top of OGL to do similar stuff. Begin/End and fixed function are really out of date ways of thinking about modern GPUs and graphics - it may be user-friendly but it has nothing to do with how the hardware works or how you should organize your rendering.

A good low-level graphics API should only have fast paths. The ARB is attempting to remove the slow crap. Unfortunately they'll never succeed because there are too many apps and people that are using the old stuff and not adopting the new stuff.

# ¿ Jan 20, 2012 10:49

Spite: Jul 27, 2001; Small chance of that...

PalmTreeFun posted:

This is a pretty basic math theory question, but I'm taking a graphics class right now and I need a little help. Long story short, my teacher isn't the best at speaking English, much less at explaining things, and I had to go read the book just to figure out what convolution and reconstruction was. I have that figured out, but what I can't wrap my head around is resampling. You know, scaling images/resampling audio. I understand what it's supposed to do, but I don't quite get the math.

For simplicity's sake, say I'm doing it on audio or some other 1-dimensional data. If I have this data set:

f(x) = 0, 1, 4, 5, 3, 5, 7

And I want to resample this using different filters with a radius of 0.5 in order to figure out, say, f(2.5) and f(2.75), with f(2) having the value 4 in the above data set. My question is, what results should I be getting with my estimates if I use, say, a box filter (1/(2r) if -r <= x < r, 0 otherwise) as opposed to a tent/linear filter (1-abs(x) if abs(x) < 1, 0 otherwise).

I hope I didn't make that too confusing, I just am not sure how to exactly compute resampled values. The book doesn't make it very clear. It says something about taking the data points, reconstructing and smoothing them, then resampling a new set of data, but I don't understand how you convolute two functions (a reconstruction and a smoothing one, as opposed to a function and a data set) together.

How's your math? Convolution has different meanings depending on which domain you are in. The easiest way to thing about it as the overlap between two functions. Or you can think about it as multiplying every data point in function A by every datapoint in function B and adding together. Of course, this really isn't feasible in real time so you use a small 'kernel' as the second function.

Take for example a gaussian blur.
You'll typically see something like this:
0.006 0.061 .242 .383 .242. 061 .006
That's the kernel, that's function B.
Your set is
0, 1, 4, 5, 3, 5, 7

Let's say I want to find the blurred value of the middle element, which is 5
0.006+1*0.0061+4*0.242+5*0.383+3*0.242+5*0.061+7*0.006
You repeat that for each value in your set to get the convolved set, ie
f(x-3)*0.006+f(x-2)*0.061+f(x-1)*0.242 0.383*f(x)+0.242*f(x+1)+f(x+2)*0.061+f(x+3)*0.006

That's for a 1D convolution, It can be extended to anything. If you are curious about the math, you should probably take a class on signal processing as it gets quite complex.

Or am I explaining the wrong thing?

# ¿ Feb 9, 2012 23:30

Spite: Jul 27, 2001; Small chance of that...

PalmTreeFun posted:

I think so. I understand the part you explained already, but basically what I want to know is how scaling/reconstructing a sound/image works. Like, you convert a set of discrete data to a continuous function somehow, and you can use a kernel (thanks for explaining what that was, I didn't know that that and the "filter" were the same thing, this teacher really sucks at explaining things) to extrapolate new, "in-between" data.

Like, if you used something like a simple average to find a value between elements 1 and 2 (1 and 4) in the example I gave, you'd get a new value 2.5, because that's halfway from one to the other. The problem is, I don't get how you convey different ways of getting new values using a kernel. Same in reverse, shrinking the set instead of expanding it. I had an assignment on the last homework where we had to resample a data set using two different kernels, one being a tent and the other a box, and I had no idea how to compute that.

E: For what it's worth, here are the lecture slides on the topic:

http://pages.cs.wisc.edu/~cs559-1/syllabus/02-01-resampling2/resampling_cont.pdf

Scroll down to the page that says "Resampling".

E2: I just figured out what exactly the box/triangle filters do, (box is rounding up/down, tent is linear interpolation) but I still don't understand how the process in general is done. Like, I have no clue what's going on with the other filters, like Gaussian, B-Spline cubic, Catmull-Rom cubic, etc.

It's kind of odd they'd have you do this without giving you the background theory around filtering and time domain vs frequency domain.

Forgive me if I'm going a bit overboard with the explanation. The Fourier transform will convert a function in the time domain into its equivalent into the frequency domain. The easiest way to visualize this is to think about a sine wave. Its Fourier transform is simply two peaks, one positive and one negative. They represent the frequency of the wave.

Now, every function (at least the ones you'll be interested in) has a transform that converts to this domain. Why is this interesting? Because you can multiply two functions together in the frequency domain to apply a filter. For example, a box function can be a lowpass filter.

However, this is a problem since you need all of both functions in order to do the transform. So we would like to apply them in the time domain as they are fed in to us. A multiplication in the frequency domain corresponds to convolution in the time domain. So we can take the filter we are interested in and convert them to the time domain via the inverse fourier transform. Then we can do the operation with the kernel we get in the time domain.

The box/triangle filters are in the frequency domain. If you want to apply them to the dataset you need to apply the inverse Fourier transform and convolve that with the set. The box filter's inverse fourier transform is the sinc function, so we can make a kernel out of sinx/x and use that if we wanted to.

Gaussian is a special case since the fourier transform of a Gaussian is also a Gaussian.

The splines are slightly different because they reconstruct points on a path. You can think of something like Catmull-rom, etc as simply a function F(t) that happens to pass through the control points.

# ¿ Feb 13, 2012 21:58

Spite: Jul 27, 2001; Small chance of that...

Contero posted:

In gl 3.1+, do vertex array objects give a performance benefit over manually binding VBOs? What is the advantage of using them?

You have to use VAO in GL 3+.

In legacy contexts they should ideally provide a performance enhancement. However, most drivers don't implement them so that there's a noticable difference.

# ¿ Feb 26, 2012 06:51

Spite: Jul 27, 2001; Small chance of that...

Firstly, what has changed since it used to work?
What OS, HW, driver, etc?

Do other gl calls work?
Can you clear the screen to a color?
Can you draw a simple quad?
Do other glGet* calls work?
Can you call glGetString(GL_RENDERER) or GL_VERSION or GL_EXTENSIONS?

Also it occurs to me that you may not be creating the context correctly in the dynamically linked case. How is the context created? Are you using glXGetProcAddress, etc? Does that return valid function pointers?

# ¿ Feb 28, 2012 07:19

Spite: Jul 27, 2001; Small chance of that...

Sounds like you don't have a context. I'm not sure why that would occur and I don't have much experience with xwin.

Try calling glXGetProcAddress and seeing if it returns valid pointers. Where does the code create your context and how does it do so? Look for glCreateContext or glXCreateContext.

# ¿ Mar 1, 2012 07:46

Spite: Jul 27, 2001; Small chance of that...

That makes sense, the Mesa lib has its own GL implementation. Fun stuff.

# ¿ Mar 2, 2012 01:51

Spite: Jul 27, 2001; Small chance of that...

Also are you sure the Quadro actually implements the Tesselation Shaders in hardware? Sometimes they don't have full implementations. Or they fall back to software for stupid reasons.

# ¿ Mar 21, 2012 01:59

Spite: Jul 27, 2001; Small chance of that...

baka kaba posted:

I have a couple of newbie questions, if anyone can help.

I've been doing a bit of work learning OpenGL ES 2.0, and I'm having a bit of trouble finding information. It might be best if I explain what I think is going on and what I'm trying to do:

So I have a bunch of 2D textures generated and filled with bitmaps, and when it comes to using them in the shader program I need to activate a texture unit and bind the texture to it. Then I pass in to the shader the numbers of the texture units I'm using, and the shader can use them. This all works fine.

My understanding is that a texture unit is actually in hardware, maybe some dedicated memory near the GPU? Different devices have different numbers of texture units available - the one I'm using (a Galaxy Nexus, if it matters) reports 8 available texture units when I call
code:
GLES20.glGetIntegerv(GLES20.GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS,maxTexUnits,0);
which is the maximum number available, not the max any one shader can use.

If these are in hardware, it seems like there would be a performance benefit in keeping textures in a texture unit as long as possible, and even if they're not I'd also avoid unnecessary glActiveTexture and glBindTexture calls which don't look cheap. But my phone is reporting 8 texture units, and I can actually activate and bind to any of the texture unit constants, right up to GL_TEXTURE31 (the highest in the API) and everything still renders just fine.

So my questions are really: are texture units in fast GPU memory, so am I taking the right approach trying to fill them and not swap too much in and out? And what's with the ID numbers, are they just for my reference (with the actual memory arrangement handled however the system sees fit) so I can use any 8 I like, but no more?

Or is this a really bad way to do things, and they're really meant for passing several textures to the shader at once and not for any kind of actual storage?

So, this is old but I can perhaps shed some light on it.

Firstly, calling ActiveTexture with anything more than GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS-1 should be an error. It is a bug in the implementation if you can set, say, texture 8. You are probably just reseting the same unit over and over (which would be one you specified with the last call to glActiveTexture that succeeded).

You're misunderstanding what it means by "texture unit."

On a basic level, texturing involves some memory on the GPU to hold the actual texture data in some format (which is almost always something other than linear the normal array of pixels you're used to thinking of). There's usual a texture cache, because texture accesses tend to be pretty coherent spatially (if you are sampling a pixel, it's very likely you'll be looking at ones immediately around it soon).

There is also usually specific hardware to take your texture coordinates, use them to lookup into whatever format the data is actually in, fetch it and do whatever sampling, interpolation, etc, you've asked for. If you set up filtering, this will actually result in multiple lookups and some interpolation to return you an actual color. This piece of hardware is what you usually refer to as a texture unit - if you could read memory directly you could write code to do it all by hand. But it's always doing the same thing and is faster to have hardware do it.

So your hardware has 8 of these units, which means you can access 8 different textures in a shader, one per unit. They aren't specific caches or pieces of memory, they are more like lookup and translation units.

As for keeping them bound you are correct that it will be faster, but not for that reason. Calling glActiveTexture and glBindTexture does not actually move data around, it changes GL state. And it actually changes 2 levels of state because OpenGL is a terribly designed API.

Think of OpenGL has a giant state machine. When you create a context, you are making a big struct with all the OpenGL state that is mandated by the OpenGL spec. You can't access this struct directly, you have to use the GL API calls. Originally, OpenGL only allowed one texture, so it only had the one call to change the current texture state, which is glBindTexture. You give it the type and the texture 'name' (that numerical id you got from glGenTextures) and that's that. Eventually they added the support for multiple textures, but instead of letting you specify which unit to modify directly, they added the glActiveTexture call, which tells GL which texture unit to modify in subsequent calls.

Think of the code looking like this:

code:

struct Giant_GL_Context
{
.....
//some state
uint32_t currentActiveTexture;
.....
//texture state
GLInternalDriverTexture* textures[INTERNAL_MAX_TEXTURES];
.....
};

void glActiveTexture(GLenum texture)
{
    //there would be error checks here

    //note: GL_TEXTURE0 is 0x84C0 but we want a real index
    context->currentActiveTexture = texture-GL_TEXTURE0;
}

void glBindTexture(GLenum target, GLuint texture)
{
    //there would also be error checking to make sure target is sane and texture exists, etc
    GLInternalDriverTexture* tex = internalLookupTextureByName(context, texture);
    context->textures[context->currentActiveTexture] = tex;

    //other stuff
}

State changes are the most expensive thing in a graphics API on the CPU side. So if you can reorder things to minimize them you'll see a benefit. If your app is small, it's probably not worth the effort, but if you are doing 100s of draws a frame, it's worth your while.
Instead of doing:
set texture A
draw mesh A
set texture B
draw mesh B
set texture A
draw mesh C
set texture B
draw mesh D
try:
set texture A
draw mesh A
draw mesh C
set texture B
draw mesh B
draw mesh D

# ¿ Dec 15, 2013 10:57

Spite: Jul 27, 2001; Small chance of that...

BattleMaster posted:

Cool thanks, it works great. It's funny, a few days ago I was looking at the different options for face culling and I was wondering "why would anyone want to use anything other than back?" I guess I should assume that if something exists in the spec it exists for a reason.

Don't assume this

The spec is a giant political document as much as a functional one.

BattleMaster posted:

Earlier I was also puzzling over why the UniformMatrix family of functions had an argument to transpose your matrix. And last night I was having trouble getting an orthogonal projection matrix to work. It turns out that I was formatting my matrix as row major when OpenGL wants them as column major. So I set transpose to true when I fed it my matrix and it worked perfectly. I guess they put that there for people/libraries that prefer to work with it in row-major. Honestly it's a wonder that my rotation and perspective matrices worked when I was giving it the transpose by accident.

Direct3D also takes row-major matrixes. That argument was mainly a bone to throw to people porting stuff to GL from D3D. Transgaming, etc. I think it's a bad feature since I feel the GL should not do conversions itself and this requires the CPU to do the transpose.

BattleMaster posted:

Edit: Kind of a silly question but would it be problematic to feed my shader several transform matrices (for instance, a transpose, rotation, and scale matrices) and have the GPU multiply them together rather than doing it on the CPU and feeding only one to it? I have a feeling that the GPU is faster at multiplying floating-point matrices than the CPU is but I'm not sure how expensive it is to feed uniforms to a shader program. (Though I guess doing it once on the CPU instead of having the GPU do it 3 times for every triangle might be better.)

Generally, hoist everything you can. So if you can hoist something off the GPU (or push it up the pipeline), it's worthwhile. You'd have one matrix multiply instead of # of vertexes multiplies.

# ¿ Jan 3, 2014 22:33

Spite: Jul 27, 2001; Small chance of that...

Malcolm XML posted:

Yeah though I think the memory space uniforms inhabit is really fast and is cached as well

This greatly depends on the hardware.

For example, most of the modern cards do not actually have uniform registers anymore and treat everything as a constant buffer (it's just that what were previously 'uniforms' are now part of a magic global constant buffer). Constant buffer reads are cached similar to textures. A fun bit of trivia was that the early NV Tesla cards didn't cache constant buffer reads so they were slower than uniforms.

OneEightHundred: That was more true for AMD's Northern Islands. Southern Islands and on are (more) scalar and I think the scheduling happens on the compute units themselves. I haven't read their docs in a while though so I don't remember all the details.

EDIT: I may be misremembering NI, which I think is purely vector all the time. Is that what you meant? In terms of GPGPU they'd have to vectorize everything.

Spite fucked around with this message at 06:58 on Jan 8, 2014

# ¿ Jan 8, 2014 06:52

Spite: Jul 27, 2001; Small chance of that...

Pixel buffers aren't really analogous to texture buffer objects. PBO is more for asynchronous upload/download. Why are you using them and not just straight textures?

EDIT: To be more clear: PBO is just used for transferring data, not storing it.
Basically, how they work is you bind the buffer, and the glTexImage call reads from that buffer instead of whatever pointer you pass. In this case, the argument that was a pointer becomes an offset.

Spite fucked around with this message at 05:57 on Jan 10, 2014

# ¿ Jan 10, 2014 05:50

Spite: Jul 27, 2001; Small chance of that...

slovach posted:

Am I missing something here or what with initializing GL?

I need a valid context to even get wglChoosePixelFormatARB() ... but to get a valid context, I need to have a pixel format set. And then SetPixelFormat() expects a filled out PIXELFORMATDESCRIPTOR anyway.

PIXELFORMATDESCRIPTOR is a windows specific struct, not a general OpenGL struct. You fill that out yourself and use your Windows Device Context to get a Pixel format. (Via ChoosePixelFormat).
Then with the Pixel Format it returns, you call SetPixelFormat, then wglCreateContext.

code:

PIXELFORMATDESCRIPTOR pfd =
                {
                        sizeof(PIXELFORMATDESCRIPTOR),
                        1,
                        PFD_DRAW_TO_WINDOW | PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER,    //Flags
                        PFD_TYPE_RGBA,            //format
                        32,                        //bits per fragment
                        0, 0, 0, 0, 0, 0,
                        0,
                        0,
                        0,
                        0, 0, 0, 0,
                        24,                        //bits for depth
                        8,                        //bits for stencil
                        0,                        //aux
                        PFD_MAIN_PLANE,
                        0,
                        0, 0, 0
                };

                HDC hDC = GetDC(hWnd);

                int  pixfmt;
                pixfmt = ChoosePixelFormat(hDC, &pfd); 
                SetPixelFormat(hDC,pixfmt, &pfd);

                HGLRC ctx = wglCreateContext(hDC);
                wglMakeCurrent (hDC, ctx);

That's one way to do it, anyway.

Because Microsoft didn't want to implement OpenGL3, and OpenGL is a messy API in general, you have to create an OpenGL context using the windows APIs. Then you can request the function pointers for the newer OpenGL context creation methods.
So you'll need to call
wglGetProcAddress
And find the address of wglCreateContextAttribsARB.
Then you'll end up with two contexts, so you'll have to destroy the temporary one.

Or use one of the utilities to do it for you.

# ¿ Feb 15, 2014 09:50

Spite: Jul 27, 2001; Small chance of that...

Vectors should be 16byte aligned.

You could make also make a giant array of floats and then create vectors out of them in your shader, but I wouldn't recommend it.

# ¿ Jun 4, 2014 01:51

Adbot: ADBOT LOVES YOU

# ¿ May 15, 2024 00:12

Spite: Jul 27, 2001; Small chance of that...

Sex Bumbo posted:

Can you elaborate on why it's faster to use 16 byte aligned elements? I tested this out with a simple read-modify-write shader with linear memory access. It's about 10% slower (old GTS 250, gonna try some others too) to do unaligned which I wouldn't qualify as "slowwwwww" but it's still significant.

I imagined linear access would result in a chunk of memory being requested or scheduled, and if it were packed that would result in less bandwidth. Also if it were reading it in as a chunk, the alignment of the elements wouldn't matter would it?

It depends on the hardware. If the hardware only has vector units, then any unaligned values have to be read/masked/swizzled to get them in the right place. More hardware is scalar now, but it's still easiest to think of GPUs as consuming aligned vectors.

OpenCL compiles down to CUDA on nv hardware. I'm not sure what they do with OpenGL compute.

# ¿ Jun 20, 2014 07:48

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > 3D graphics questions that do not deserve their own thread (OpenGL / Dx10)

«‹›3 »