Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Doc Block
Apr 15, 2003
Fun Shoe
[quote="“Suspicious Dish”" post="“477007763”"]
ok, i want to write up my next infopost. today is going to be on what a “graphics card” is, besides just a gpu, which i covered in my first infopost. graphics cards are a quirk of the desktop pc architecture, where the only way to insert new modules is through the pcie expansion card slot, and thus the pcie bus.

pcie is, compared to a native connection to the memory controller — a fairly small bus with limited bandwidth. the reason your graphics card contains the monitor out, and the reason your graphics card has separate vram, is because of this. it’s not horribly small, but, well, 1920*1080 * 4 bytes per pixel * 60 frames per second

this is the primary rule of graphics cards on desktop machines: *it’s expensive to move poo poo from the cpu to the gpu, and the gpu to the cpu*.

let’s talk about how other systems work: on laptops, on consoles, on phones, basically every other device has the cpu and gpu in embedded machines and both of them have equal access to system memory. i mentioned that a gpu is basically a giant parallel array of horribly broken and underpowered cpu’s, and really what they’re doing is computing a giant set of pixels in ram. usually, the cpu allocates some part of system memory and tells the gpu “render into this bit of memory and tell me when you’re done”, so there’s sort of a hand-off, and it’s not a free-for-all.

btw, this is known as a “unified memory architecture”. fun fact: amd actually tried to introduce this on desktops by bundling a radeon in their cpus but nvidia countered it with some marketing that plays right into /r/pcmasterrace: “amd’s gpus are so lovely they can only sell them by bundling them with cpu’s” and it died.
[/quote]

problem with shared memory systems is that they can wind up lowering overall performance whenever the CPU and GPU both want to access the memory bus at the same time. and main memory is still slower than most dedicated GPU's fast dedicated VRAM.

on mobile, powervr has a small framebuffer cache of its own, called the tile buffer (because powervr uses tile based rendering) which it renders into and then flushes to memory all at once to reduce bus contention, but this has performance drawbacks for certain kinds post processing. basically, any post processing that can't be done in the tile buffer (on powervr, shaders can read what's already in the tile buffer, which is also how alpha blending is done on powervr) like bloom/blur effects, kills your performance as the GPU must flush the tile cache to memory multiple times.

contrast with a more traditional GPU architecture, which can render into any part of VRAM and thus can do these effects much faster, since the GPU can just leave the framebuffer where it is in VRAM and then read it in as a texture in subsequent render passes without having to copy it first and then re-clear it's tile cache for the next render pass.

Adbot
ADBOT LOVES YOU

Shame Boy
Mar 2, 2010

Doc Block posted:

main memory is still slower than most dedicated GPU's fast dedicated VRAM.

i agree with your post but presumably this part at least would go away if they were commonly / always integrated from the get-go and the system just had that fast RAM itself

Doc Block
Apr 15, 2003
Fun Shoe
idk. main memory could be faster, but then it'd also be more expensive.

and shared/unified memory systems would still have issues re: wanting to access the memory bus at the same time as something else.

josh04
Oct 19, 2008


"THE FLASH IS THE REASON
TO RACE TO THE THEATRES"

This title contains sponsored content.

ate all the Oreos posted:

yeah that's the "special" one that uses the dedicated hardware. they call it CUDA or CUVENC to make it seem like it's normal CUDA but it actually uses a separate little module on the card. it still works fine / isn't broken but it can't really do "real" CUDA

i don't mean NVENC, which replaced it, but the thing they're talking about taking out of the driver here, which worked on pre-kepler cards which afaik didn't have encoding hardware: https://web.archive.org/web/20150804115020/https://developer.nvidia.com/nvidia-video-codec-sdk

oh, apparently mainconcept had something similar at the time but i assume all this stuff got swept away by the proliferation of hardware encoders

Shame Boy
Mar 2, 2010

josh04 posted:

i don't mean NVENC, which replaced it, but the thing they're talking about taking out of the driver here, which worked on pre-kepler cards which afaik didn't have encoding hardware: https://web.archive.org/web/20150804115020/https://developer.nvidia.com/nvidia-video-codec-sdk

oh i was getting it mixed up with NVCUVID which is both dedicated hardware (that's been on cards going back much farther than NVENC) and also only decodes (so it's basically worthless to me), huh

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Doc Block posted:

problem with shared memory systems is that they can wind up lowering overall performance whenever the CPU and GPU both want to access the memory bus at the same time. and main memory is still slower than most dedicated GPU's fast dedicated VRAM.

on mobile, powervr has a small framebuffer cache of its own, called the tile buffer (because powervr uses tile based rendering) which it renders into and then flushes to memory all at once to reduce bus contention, but this has performance drawbacks for certain kinds post processing. basically, any post processing that can't be done in the tile buffer (on powervr, shaders can read what's already in the tile buffer, which is also how alpha blending is done on powervr) like bloom/blur effects, kills your performance as the GPU must flush the tile cache to memory multiple times.

contrast with a more traditional GPU architecture, which can render into any part of VRAM and thus can do these effects much faster, since the GPU can just leave the framebuffer where it is in VRAM and then read it in as a texture in subsequent render passes without having to copy it first and then re-clear it's tile cache for the next render pass.

first, the pvr's "tile rendering" is "tile deferred rendering", which goes to extreme lengths to try and prevent overdraw, which kills a lot of important optimizations (early z) and is basically equivalent to a depth pre-pass in practice.

second, amd and nv now use tilers even on desktop gpus because of the power savings. the traditional "brute force by memory bandwidth" approach is dead, dead, dead. and a traditional tiler has fast embedded tile memory in each cpu core group or whatever it's called, which is what's used to render to and copy to/fro. but you can usually fetch / texture from system ram.

hackbunny
Jul 22, 2007

I haven't been on SA for years but the person who gave me my previous av as a joke felt guilty for doing so and decided to get me a non-shitty av

thank you, this cleared up a lot of things for me

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Shinku ABOOKEN posted:

good thread rated :five:

can someone tell me how do gpus interact with virtualization? i heard gpus are stateful as gently caress

the state really isn't the issue: i can run two games and a browser side by side and they all render fine without stomping on each other. the gpu's have native concepts of "jobs" and know how to schedule them and switch out everything it needs to, similar to process scheduling on a regular cpu.

the issue is security. i mentioned that everything is in vram: frame buffers, textures, etc. are all sitting there, and there's really nothing to prevent one game from fetching the texture for my bank browser if it wanted to. well, the gpu driver is supposed to validate and care about that, but in a virtualized environment, the guest is the thing running the driver, and it can't be trusted. a malicious person could craft custom machine code for a gpu if they wanted to that fetched someone else's job, and that's a huge no-no in virtualization.

there's two ways to get around this. the first is to build gpu's that have virtual memory and mmu's and switch them out between jobs. that used to be impossibly slow but it's becoming less slow so i think this solution is starting to happen for gpu's designed for virt, like what you get from amazon.

the other approach is to, instead of exposing the raw pcie device to the guest, expose a high-level api that doesn't let you emit "texture fetch from this vram address" in your shader and instead requires "texture fetch from bound texture 1". vmware, in 2008, realized that this would be a huge market, and bought a whole gpu driver company called "Tungsten Graphics" to help them write this virt driver. so the vmware guest gpu driver basically sends a high-level shader and command stream to the host, which is compiled into native code *there*.

atomicthumbs
Dec 26, 2010


We're in the business of extending man's senses.
why does opencl suck so bad

why can't i easily have a hypervisor thing that lets me play video games on windows and then alt-tab over to linux to do cuda poo poo, without involving a second drat graphics card so one can sit idle while the other does stuff

why

atomicthumbs fucked around with this message at 21:05 on Oct 3, 2017

akadajet
Sep 14, 2003

atomicthumbs posted:

why does opencl suck so bad

why can't i easily have a hypervisor thing that lets me play video games on windows and then alt-tab over to linux to do cuda poo poo, without involving a second drat graphics card so one can sit idle while the other does stuff

why

because nobody cares enough about your use case

Doc Block
Apr 15, 2003
Fun Shoe
[quote="“Suspicious Dish”" post="“477014995”"]
first, the pvr’s “tile rendering” is “tile deferred rendering”, which goes to extreme lengths to try and prevent overdraw, which kills a lot of important optimizations (early z) and is basically equivalent to a depth pre-pass in practice.
[/quote]

I'm aware the powervr uses deferred shading, but it wasn't relevant to that post so I didn't mention it.

quote:

second, amd and nv now use tilers even on desktop gpus because of the power savings. the traditional “brute force by memory bandwidth” approach is dead, dead, dead. and a traditional tiler has fast embedded tile memory in each cpu core group or whatever it’s called, which is what’s used to render to and copy to/fro. but you can usually fetch / texture from system ram.

ok...? a desktop tiler from nVidia or AMD still has a few gigabytes of fast VRAM that it doesn't have to share with the CPU, assuming you aren't talking about some integrated GPU that uses shared memory.

the problem with doing stuff like bloom/blur/etc on shared memory tilers is that it pretty much requires writing the tile cache out to system memory so that it can be read back in as a texture in a later pass. those multiple writes out to system memory hurt performance (source: tried to add bloom & blur to an iOS game, saw the FPS go way down while the memory bandwidth per frame went way up). so long as you can keep your post processing FX passes all happening in the tile cache (don't need to sample from other fragments or be at different resolution) it's fast enough, but if it needs to write the tile cache out to system memory as an intermediary step for additional rendering passes then your performance goes off a cliff.

atomicthumbs
Dec 26, 2010


We're in the business of extending man's senses.

ate all the Oreos posted:

so I'm trying to get ffmpeg running nice and fast on my dinky old server box just for fun and I notice it has CUDA support, great cool. I go and get my old graphics card and shove it in (only like 5 years old at this point) and no actually it's the wrong kind of CUDA you need the kind of CUDA that does video, which is only available on the newest cards. what the heck?

"cuda" might not actually be cuda in this case; graphics cards have video encoding and decoding blocks built in these days, and their capabilities differ by generation

atomicthumbs
Dec 26, 2010


We're in the business of extending man's senses.
back before someone recycled a GTX 780 i discovered that for machine learning poo poo (restricted to opencl) running stuff on the integrated GPU in my 4790K using beignet was surprisingly a third as fast as doing it on my amd 7870. interesting and neat

josh04
Oct 19, 2008


"THE FLASH IS THE REASON
TO RACE TO THE THEATRES"

This title contains sponsored content.

atomicthumbs posted:

why does opencl suck so bad

vulkan sapped all it's momentum

atomicthumbs
Dec 26, 2010


We're in the business of extending man's senses.

Suspicious Dish posted:

You'd think they'd decode some magical proprietary format if that was the case but nah it's just the same H.264 that you can already decode in software.

doing decoding on quicksync saves a lot of battery power on a laptop and i think the encoder in my gtx 780 is faster than doing it on the cpu (either via quicksync or software)

atomicthumbs fucked around with this message at 21:47 on Oct 3, 2017

Condiv
May 7, 2008

Sorry to undo the effort of paying a domestic abuser $10 to own this poster, but I am going to lose my dang mind if I keep seeing multiple posters who appear to be Baloogan.

With love,
a mod


josh04 posted:

vulkan sapped all it's momentum

is vulkan capable of the same stuff as opencl? cool

josh04
Oct 19, 2008


"THE FLASH IS THE REASON
TO RACE TO THE THEATRES"

This title contains sponsored content.

Condiv posted:

is vulkan capable of the same stuff as opencl? cool

i mean, believe it when you actually see it running, but the intermediate language spec should mean that you can directly load and run unchanged cl kernels from vulkan

atomicthumbs
Dec 26, 2010


We're in the business of extending man's senses.
let me phrase it this way: why does CUDA blow opencl's pants off

Shame Boy
Mar 2, 2010

atomicthumbs posted:

"cuda" might not actually be cuda in this case; graphics cards have video encoding and decoding blocks built in these days, and their capabilities differ by generation

yeah that's what I was saying in that post, and then what we've discussed for like two days since that post lol

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

atomicthumbs posted:

let me phrase it this way: why does CUDA blow opencl's pants off

because nvidia didn't spend time making opencl fast

atomicthumbs
Dec 26, 2010


We're in the business of extending man's senses.

ate all the Oreos posted:

yeah that's what I was saying in that post, and then what we've discussed for like two days since that post lol

well excuuuuuuuuuse me mister "i read the entire thread for context before responding"

Condiv
May 7, 2008

Sorry to undo the effort of paying a domestic abuser $10 to own this poster, but I am going to lose my dang mind if I keep seeing multiple posters who appear to be Baloogan.

With love,
a mod


i'm going to have to be getting into hpc sooner than later, and i really really really don't want to tie myself to cuda

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Suspicious Dish posted:

second, amd and nv now use tilers even on desktop gpus because of the power savings. the traditional "brute force by memory bandwidth" approach is dead, dead, dead.

lol no it's not

they're going that direction to squeeze more performance out, not because memory bandwidth has fallen out of favor. it hasn't, desktop and data center gpus are transitioning to even more tightly coupled memory technologies (hbm, hbm2) to scale bandwidth even more

a nvidia titan xp (one of the last of the gddr5 gpu designs) has about 550 GB/s memory bandwidth. as for cpus, very high end ones have quad channel ddr4 which can get you to about 80 GB/s. 550 >> 80 my friend, it isn't all stupid pcmasterrace bullshit

those amd cpus with integrated unified memory gpus didn't have quad ddr4. they were dual channel ddr3 iirc and that meant that no matter how much die area they spent on gpu compute resources they were going to be memory bottlenecked out of competing with midrange and high end discrete graphics cards.

that was okay because the product line was targeted at the entry level market. the real problem with those products was the CPU cores

what you're saying about the reasons for hardware to be partitioned as it is now is misguided imo and I can effortpost about that later when not phone posting if you want that to be here

Shame Boy
Mar 2, 2010

atomicthumbs posted:

well excuuuuuuuuuse me mister "i read the entire thread for context before responding"

after i made that post i got a bit worried that you had like, made another post acknowledging it or something because i sure as hell didn't read any posts after yours before mashing reply :v:

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

atomicthumbs posted:

let me phrase it this way: why does CUDA blow opencl's pants off

minor reason: afaik the overall design of cuda is much more usable, or so I've been told

major reason: because nvidia invested a shitload of money and the time of really smart sw engineers into the cuda toolchain, and nobody did the same for opencl tools. the only company with a similar level of motivation to invest heavily was amd, and amd had money and management problems, so opencl just didn't have enough push behind it to compete.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

BobHoward posted:

lol no it's not

they're going that direction to squeeze more performance out, not because memory bandwidth has fallen out of favor. it hasn't, desktop and data center gpus are transitioning to even more tightly coupled memory technologies (hbm, hbm2) to scale bandwidth even more

yeah, tilers are just more scalable because you waste less memory bandwidth. i'm not saying memory bandwidth is superfluous -- it's super expensive. but nvidia had been marketing against tilers for years and trying to influence d3d/opengl to poo poo on tilers, and then they silently switched to a tiler arch themselves.

BobHoward posted:

a nvidia titan xp (one of the last of the gddr5 gpu designs) has about 550 GB/s memory bandwidth. as for cpus, very high end ones have quad channel ddr4 which can get you to about 80 GB/s. 550 >> 80 my friend, it isn't all stupid pcmasterrace bullshit

those amd cpus with integrated unified memory gpus didn't have quad ddr4. they were dual channel ddr3 iirc and that meant that no matter how much die area they spent on gpu compute resources they were going to be memory bottlenecked out of competing with midrange and high end discrete graphics cards.

that was okay because the product line was targeted at the entry level market. the real problem with those products was the CPU cores

what you're saying about the reasons for hardware to be partitioned as it is now is misguided imo and I can effortpost about that later when not phone posting if you want that to be here

i mean most consoles have unified memory but they have an actual fast gddr5 connection between the system ram and gpu. that's the direction i'd like to see pc's go in. uploads become a tlb swap from cpu-owned to gpu-owned.

my understanding is that as high-performance compute becomes more mainstream, manual system ram / vram management is going the way of the dodo in favor of gpus that can page fault, and more, smaller caches with each core, rather than large vram banks.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
also please do gpu effortposts!!! as much as i claim to know i am happy to be proven wrong and i welcome all knowledge like this

Doc Block
Apr 15, 2003
Fun Shoe

Suspicious Dish posted:

yeah, tilers are just more scalable because you waste less memory bandwidth. i’m not saying memory bandwidth is superfluous — it’s super expensive. but nvidia had been marketing against tilers for years and trying to influence d3d/opengl to poo poo on tilers, and then they silently switched to a tiler arch themselves.

i mean most consoles have unified memory but they have an actual fast gddr5 connection between the system ram and gpu. that’s the direction i’d like to see pc’s go in. uploads become a tlb swap from cpu-owned to gpu-owned.

consoles get worse texturing etc performance IIRC. even on equal hardware with the GPU directly attached to system memory, that the memory is shared is gonna give the system worse performance than dedicated VRAM.

and as long as desktop GPUs are removable/upgradeable, I don't think you're gonna see them have a direct connection to main memory vs over an expansion bus. as to your earlier post, them being removable is why the display controller etc is on the graphics card instead of the motherboard.

Doc Block fucked around with this message at 23:32 on Oct 3, 2017

Shaggar
Apr 26, 2006
idk anything about gpus but would it make sense to let the GPU request a chunk of system ram (if it were fast enough) to use as an addition to its own dedicated stuff? Not so much shared ram as additional dedicated, or would it still just be too slow?

Shame Boy
Mar 2, 2010

Doc Block posted:

consoles get worse texturing etc performance IIRC. even on equal hardware with the GPU directly attached to system memory, that the memory is shared is gonna give the system worse performance than dedicated VRAM.

and as long as desktop GPUs are removable/upgradeable, I don't think you're gonna see them have a direct connection to main memory vs over an expansion bus. as to your earlier post, them being removable is why the display controller etc is on the graphics card instead of the motherboard.

how practical would it be to just make them like, a standard socket like the CPU? so instead of swapping out a whole massive card you just have a second CPU-like socket that you put a chip in, and then it shares the rest of the stuff (or maybe has its own RAM sticks, but you can swap them too like normal RAM)

Doc Block
Apr 15, 2003
Fun Shoe

Shaggar posted:

idk anything about gpus but would it make sense to let the GPU request a chunk of system ram (if it were fast enough) to use as an addition to its own dedicated stuff? Not so much shared ram as additional dedicated, or would it still just be too slow?

it's not that the CPU and GPU are trying to access the same piece of memory at the same time (that causes different problems), it's that they're both trying to access memory at all at the same time. a read or write has to go out over the memory bus, and everything else that comes along wanting to access memory has to wait until that's finished.

of course, that's not counting stuff like crossbar switch architectures. but IIRC RAM itself can only be doing one read or write at a time, so you're still gonna run out of memory bandwidth.

Shaggar
Apr 26, 2006
that makes sense. I guess it wouldn't really help unless you're moving basically all your processing over to the gpu.

Doc Block
Apr 15, 2003
Fun Shoe

ate all the Oreos posted:

how practical would it be to just make them like, a standard socket like the CPU? so instead of swapping out a whole massive card you just have a second CPU-like socket that you put a chip in, and then it shares the rest of the stuff (or maybe has its own RAM sticks, but you can swap them too like normal RAM)

I mean, sure, you could do it if you could get GPU and mobo vendors to agree on a socket. but sooner or later you'd have to be doing new GPU socket versions, which would mean not being able to upgrade your GPU without also upgrading your motherboard (and that means possibly CPU and RAM too).

ditto for display stuff: it'd suck if you swapped in a new GPU but we're stuck with your motherboard's old display controller that can't do 4K or the latest DisplayPort or whatever.

and LOL at the cooling for socketed GPUs above the low end.

Doc Block fucked around with this message at 00:05 on Oct 4, 2017

fishmech
Jul 16, 2006

by VideoGames
Salad Prong

ate all the Oreos posted:

how practical would it be to just make them like, a standard socket like the CPU? so instead of swapping out a whole massive card you just have a second CPU-like socket that you put a chip in, and then it shares the rest of the stuff (or maybe has its own RAM sticks, but you can swap them too like normal RAM)

well then you can't swap cards easily between systems from different years anymore, with probably an erratic pace of when the socket changes need to happen and also an erratic pace of when the old sockets can stop getting new cards.

you can swap today's latest and greatest gpu into most boards from like 2010 and still get a notable effect, but it'd be unlikely that the 6 year old gpu socket would still be getting the latest gpus at that point.

munce
Oct 23, 2010

Doc Block posted:

you can check via opengl whether or not the GPU supports texture fetch in the vertex shader. of course, we're talking about android, so the driver might lie, but still. I forget the exact enum, but it's something like GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS.

yeah but the problem is you need to check on the device itself, so either test all gpus yourself or let users download and check if it works on their device. building a list of compatible devices before releasing a commercial product seems pretty essential, but is practically very hard to do. i'm also just annoyed that something so basic as a texture lookup can make a program just not work at all on a proportion of devices.

Doc Block
Apr 15, 2003
Fun Shoe

munce posted:

yeah but the problem is you need to check on the device itself, so either test all gpus yourself or let users download and check if it works on their device. building a list of compatible devices before releasing a commercial product seems pretty essential, but is practically very hard to do. i'm also just annoyed that something so basic as a texture lookup can make a program just not work at all on a proportion of devices.

well if the hardware might not be able to do it, and you won't check for it at runtime and use an alternate shader/rendering path if it can't, what result do you expect? any "master list" of hardware-supported features would either be incomplete, quickly out of date, or both. especially for something as fragmented as the Android ecosystem.

this is also one of the situations where it's nice to have a publisher that can help you with Q/A stuff.

Doc Block fucked around with this message at 01:42 on Oct 4, 2017

The_Franz
Aug 8, 2003

josh04 posted:

vulkan sapped all it's momentum

opencl never really had any momentum

it sounds like the long term plan is to bring opencl and vulkan closer together, although to what degree remains to be seen. they've been adding extensions to vulkan needed for running opencl kernels and there's already a working compiler for building vulkan compute shaders from a subset of opencl. it's still pretty rough, but early days etc...

Doc Block
Apr 15, 2003
Fun Shoe
isn't part of the reason that opencl never caught on was because apple was the only one really invested in it?

like, didn't apple ask nvidia to do the initial design/implementation back when they were still friends, but nvidia would only do it if opencl was published as an open standard? and didn't intel insist on it being gimped and having all that dumb "BUT ALSO RUN KERNEL ON CPU!" garbage to get them to support opencl on their integrated GPUs, because intel still thought realtime raytracing on the CPU was gonna be a thing and that it would kill ATI and NVIDIA?

i could be (read: probably am) completely misremembering the previous paragraph, but it does seem like apple was the only company that tried to push opencl at all.

Doc Block fucked around with this message at 04:12 on Oct 4, 2017

Bulgakov
Mar 8, 2009


рукописи не горят

eager too see how things turn out after apple attempts to puts an amd vega in an imacs pros

Adbot
ADBOT LOVES YOU

Bulgakov
Mar 8, 2009


рукописи не горят

timmy ives: "...but we design for a very defined thermal envelope"

amd: "hold my stale beer"

  • Locked thread