Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
champagne posting
Apr 5, 2006

YOU ARE A BRAIN
IN A BUNKER

rjmccall posted:

intel is completely dropping avx-512 support from its next line of desktop/mobile processors (alder lake). the p-cores are the same design as their new server cores (sapphire rapids), so they do actually have avx-512 silicon on core, but it’s going to be permanently fused off, either because they don’t want to include avx-512 on the e-cores or because they want better yields on the p-cores or both. so they’re throwing out the “well, at least it’ll be the isa everywhere” idea, and it’ll probably kill off avx-512 in non-specialized hardware

i of course don’t give a poo poo because x86 is haram

What exactly does the AVX-512 instruction do, and when do you use it? The higher level instructions all seem terribly ethereal to me

Adbot
ADBOT LOVES YOU

Zlodo
Nov 25, 2006

champagne posting posted:

What exactly does the AVX-512 instruction do, and when do you use it? The higher level instructions all seem terribly ethereal to me

they're used to stress test cpu temperature

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

champagne posting posted:

What exactly does the AVX-512 instruction do, and when do you use it? The higher level instructions all seem terribly ethereal to me

do you know what simd is? basically doing a similar operation on a small vector of values at once, like adding four floats in one instruction instead of doing four separate float adds. x86 has been building out its set of simd operations for a long time: sse, sse2, sse3, avx, avx2. (there was also a dead-end design called mmx that shall never be spoken of again.) all of those operations are on 16-byte (128-bit) vectors. intel has recently been pushing an orthogonal set of extensions to take most (all?) of those operations and make them work on 32-byte and 64-byte vectors; those are called avx-256 and avx-512. but i think in practice they’re a package deal and you either just have the 16-byte vectors or you have all three

working with more data at once is a lot faster if you have the hardware to actually do it, but it’s trickier to set up the conditions to make it possible. more importantly, it actually takes a lot of hardware to make it worthwhile, and that makes hardware more expensive and really power-hungry, so intel doesn’t really want to do it on processors that don’t care about peak vector performance. so intel has been simulating the wider instructions on low-cost processors, and even when hardware is available it’s usually powered off until needed, which means there’s a big latency hit to use it. that, plus the fact that a lot of processors still don’t support the wider instructions, plus the inherent difficulties in setting up the wider operations in the first place, means that software developers have been really reluctant to adopt them. so after like 8 years of selling everyone on how avx-512 is great and is going to be everywhere, intel’s caving on the whole thing and saying they’re exclusive to like their serious datacenter/supercomputer processors

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

lol, and i also remember mmx hype

champagne posting
Apr 5, 2006

YOU ARE A BRAIN
IN A BUNKER

I know nothing of simd but thanks for the post it makes sense

Sapozhnik
Jan 2, 2005

Nap Ghost
sounds complicated op have you tried cuda

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
it's useful for loving around with big matrices, mostly. this is used in various digital signals processing poo poo like image manipulation but also in ~machine learning~. problem is, doing identical operations on a whole bunch of values in parallel is exactly what gpu's are optimized for doing, so a lot of this poo poo has moved off of the cpu.

i've linked one particular example of avx512 usage before (i'm fascinated by how extremely c++ it is) but i can do it again. here's a plain c++ implementation of ordered dithering. it's operating on an array of color values (color values as in red, green or blue if the color format is rgb) that may be 8-bit ints or whatever, does a little bit of float math on each value, clamps it, rounds it back to an int and stores it to a destination array. it does this one value at a time.

here's the same function but using avx512 intrinsics. it's doing the same thing except to 16 values at a time (there's a bunch of templated bullshit further up in the file that loads the various kinds of possible input values into a 512-bit vector of 32-bit floats so it's straightforward to do float maths on them). instead of saying "x = x + d" you say "x = _mm512_add_ps(x, d)" and now you've added 16 d's to 16 x'es all at once, effectively doing 16 loop iterations in a single instruction. completely unreadable of course, but according to the original commit message this version managed to do over 2 pixels per clock cycle on skylake-x if the input was 32-bit float (slightly less for other data types that required conversion). that's a lotta pixel pushing.

TheFluff fucked around with this message at 00:31 on Aug 20, 2021

Sagacity
May 2, 2003
Hopefully my epitaph will be funnier than my custom title.

Captain Foo posted:

lol, and i also remember mmx hype
hang on, i was doing software rendering in the late 90s, right before 3d cards came along and rendered it obsolete. mmx was really needs suiting for that

things like being able to do saturated adds were extremely useful when trying to do blending, for instance

the main drawback was that it was only for integer math, but for pixel pushing this wasn't a problem

Kazinsal
Dec 13, 2011


the most obnoxious part about MMX was that it reused the x87 register stack so if you wanted to use MMX in the middle of code where you were also using the FPU you needed to issue an FSAVE instruction to save the whole FPU state to memory, wait for the FPU to reinitialize to 0, run your MMX code, then issue an FRSTOR afterwards to reload the FPU state

this was in some cases a really goddamn annoying delay because worst case you're now waiting on memory latency and transfer time to save the state, then waiting on the time to reset the FPU, and then the potential memory latency and transfer time to read the state back out of memory (which in the MMX era was like, 66 MHz SDR SDRAM I think). fuuuuuck that

intel had the good sense to just throw a bunch of new registers in for SSE

eschaton
Mar 7, 2007

Don't you just hate when you wind up in a store with people who are in a socioeconomic class that is pretty obviously about two levels lower than your own?
that’s the main reason Mac OS X 10.4 for Intel used a different ABI than NEXTSTEP through Rhapsody: NEXTSTEP used 80487-and-later floating point, while Mac OS X 10.4 for Intel used SSE or whatever to get around both the pain of MMX/x87 switching and the pain of x87 in the first place

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

eschaton posted:

that’s the main reason Mac OS X 10.4 for Intel used a different ABI than NEXTSTEP through Rhapsody: NEXTSTEP used 80487-and-later floating point, while Mac OS X 10.4 for Intel used SSE or whatever to get around both the pain of MMX/x87 switching and the pain of x87 in the first place

I like the idea of building up a backlog of "ugh, abi" todos that get to marinate for a decade until new hardware appears. I imagine the idea to do that got floated like thirty seconds after sse came out or the abi shipped (whichever was later)

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
unfortunately there was a lot that we still didn’t have ready then, like the rework of the objc abi. but yeah, intel macs always had sse support even when they were 32-bit only. (all 64-bit-capable processors have sse)

i think arm64 was the only real wish-list “we’re going to be on this abi forever so we might as well get it right” cycle that apple has ever had

Soricidus
Oct 21, 2010
freedom-hating statist shill
so when’s the mips macbook coming

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

eschaton posted:

that’s the main reason Mac OS X 10.4 for Intel used a different ABI than NEXTSTEP through Rhapsody: NEXTSTEP used 80487-and-later floating point, while Mac OS X 10.4 for Intel used SSE or whatever to get around both the pain of MMX/x87 switching and the pain of x87 in the first place

this decision must've made it easier to pull off both the ppc->x86 and x86->arm transitions, in terms of producing fewer weird FP edge case bugs when recompiling for the new cpu. no worries about the strange (but standards compliant lol) 80-bit intermediate precision feature of x87

(in an ideal world everyone who uses an ieee float or double knows how to avoid making their code fragile to differences in rounding and denorm behavior etc, but we definitely do not live in that world)

Grum
May 7, 2007
the thing people like about avx-512 over the earlier instruction sets is the mask registers. they make non big matrix code way easier. before, to branch on some vector lanes but not others, you needed to use bitwise ops to implement branchless selects and at a certain level of branchiness it just becomes pointless to vectorise, but with avx-512 you get party tricks like this

mystes
May 31, 2006

Weren't people complaining that AVX-512 caused thermal throttling to the point where it was pointless before? Or was that overblown?

necrotic
Aug 2, 2005
I owe my brother big time for this!

mystes posted:

Weren't people complaining that AVX-512 caused thermal throttling to the point where it was pointless before? Or was that overblown?

prime95 with the AVX instructions on will make your CPU hit the thermal limit most likely, if that’s any indication.

Probably different in proper (not home computing) setups, I don’t know that much.

Cybernetic Vermin
Apr 18, 2005

mystes posted:

Weren't people complaining that AVX-512 caused thermal throttling to the point where it was pointless before? Or was that overblown?

nah, they had way better throughput than any alternative, the issue on a consumer chip is that a few (usually not that important) threads deciding to get fancy with avx-512 then blew through the thermal budget of the chip, and on a consumer system that is very bad as the very scalar ui thread throttles back.

seems to me that it is doubling down on dumb that they are just roping off the instructions, on a consumer system they could for convenience just decode to run on a suitably narrow unit (afaik what amd does even for avx-256, and almost certainly for forthcoming -512). a multi-cycle execution loses most of the advantages, but simplifies development, and your gain some fringe preformance benefits from encodings you are already using.

e: but do note that i don't know poo poo

Cybernetic Vermin fucked around with this message at 20:00 on Aug 20, 2021

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
yeah, i don't understand why they don't just rewrite to narrow. presumably the e-cores are still microcoded

crazypenguin
Mar 9, 2005
nothing witty here, move along
You still need enough registers?

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
eh, i'm sure the microarchitecture register file is a lot wider than 512 bits even on an e-core

repiv
Aug 13, 2009

avx512 has 32 registers though, 2 kilobytes seems like a lot by register standards? no idea how expensive that is silicon wise though

Cybernetic Vermin
Apr 18, 2005

repiv posted:

avx512 has 32 registers though, 2 kilobytes seems like a lot by register standards? no idea how expensive that is silicon wise though

doesn't the infrastructure to spill registers already exist though? i.e. much like the register file may through renaming contain memory locations, the contents of a register may also have been spilled into memory. otherwise there is presumably some way (on a "real" avx-512 cpu) of regaining this 2k of register file wasted in a thread that has once run a bit of avx but no longer cares about the register content?

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed

mystes posted:

Weren't people complaining that AVX-512 caused thermal throttling to the point where it was pointless before? Or was that overblown?

it was a problem, but also massively overblown. turning on avx-512 lowered the clock rate by a few percent. this means that using it in a background task or something that isn't your bottleneck can be a bad idea, but if a 5% slowdown negates the benefit of vectorizing something then you've really hosed up and shouldn't be using avx-512 for the problem you're trying to solve anyway.

Nomnom Cookie
Aug 30, 2009



Cybernetic Vermin posted:

doesn't the infrastructure to spill registers already exist though? i.e. much like the register file may through renaming contain memory locations, the contents of a register may also have been spilled into memory. otherwise there is presumably some way (on a "real" avx-512 cpu) of regaining this 2k of register file wasted in a thread that has once run a bit of avx but no longer cares about the register content?

uhhhhhhhhh....do you mean PUSH? although it seems like you're talking about x64 spilling registers on its own, which it doesn't do. where would it write to?

JawnV6
Jul 4, 2004

So hot ...

Cybernetic Vermin posted:

otherwise there is presumably some way (on a "real" avx-512 cpu) of regaining this 2k of register file wasted in a thread that has once run a bit of avx but no longer cares about the register content?

iirc "no"

suffix
Jul 27, 2013

Wheeee!
just throw an interrupt and pass it to the software emulation

Cybernetic Vermin
Apr 18, 2005

JawnV6 posted:

iirc "no"

oho, that is interesting. if so, the cost of ever touching avx-512, with it just locking you out of 2k of register file for the rest of your thread lifetime, is kind of fantastic. still seems pretty doable to create a spill area (e.g. a small cache) to push the registers off into, but if nothing of the sort yet exists i guess it is more sensible to keep any little bit of library code from just monopolizing a resource like that.

Kazinsal
Dec 13, 2011


Intel spitting out more and more extended versions of AVX was entirely a response to GPGPU becoming the de facto standard for doing massively parallel vector compute. there was no way that AVX on 16 or 24 cores would ever keep up with compute shaders running vector operations on ten thousand cores in parallel, but xeon phi was an objective failure and they didn't have a usable GPU architecture that scaled into discrete cards that can match what Nvidia and AMD put out until literally this year so bolting on more and more registers and widening the SIMD data sizes was their way of trying to stay relevant until they could get Xe-HPC taped out and in the hands of the HPC crowd

Shaggar
Apr 26, 2006
it'll be interesting to see if intel's hpg stuff can even remotely compete w/ amd and nvidia. it would be nice to have a 3rd party in the gpu market

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
gpu compute has really high startup/teardown and memory-traffic costs, and it really sucks at non-parallel computation, so cpu simd fills some pretty useful sweet spots. but i agree that there’s been a failure to appreciate that that’s what cpus should be doing, filling a sweet spot instead of trying to be the end-all of computation

to me the real question is how much having a wider vector helps vs just having more capable hardware. if i’ve got a core that can multiply 16 floats in parallel, i mean, that’s a lot of execution units. if i use scalar code then my overheads will probably tie the core up before it can issue 16 iterations, but a 4x loop might well be good enough to keep 4 iterations in flight. how much better would a 16x loop be with the same backing hardware? enough to cover the pretty substantial costs of wider registers? and if it really matters for some client, are they really still in that sweet spot where they should stick to the cpu?

Hed
Mar 31, 2004

Fun Shoe

Captain Foo posted:

lol, and i also remember mmx hype

the MMX part was meh but for consumers the doubled L1/L2 (can’t remember which or both) was awesome.

repiv
Aug 13, 2009

on paper it sounds like ARM had the right idea with SVE, where the vector width is decoupled from the instruction set, so they can have 512bit SIMD where it makes sense but they can still make low power cores with 128 or 256bit SIMD that run the same code

still waiting for SVE to ship on anything besides niche HPC stuff though

Athas
Aug 6, 2007

fuck that joker

repiv posted:

on paper it sounds like ARM had the right idea with SVE, where the vector width is decoupled from the instruction set, so they can have 512bit SIMD where it makes sense but they can still make low power cores with 128 or 256bit SIMD that run the same code

still waiting for SVE to ship on anything besides niche HPC stuff though

I have done lots of GPU programming, which also decouples the hardware vector width from the programming model (although it's often exploited to make synchronisation optimisations). It is a good model. On GPUs it depends on a really loose memory model, which gives the hardware a lot of leeway to schedule the vector units however it wants. I suspect the friendlier memory model of CPUs will prevent this from scaling anywhere like it does on GPUs. I suspect AMDs semi-intentional idea of having many relatively small vector units and then doing normal superscalar execution of independent vector instructions is the best CPU strategy.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Kazinsal posted:

Intel spitting out more and more extended versions of AVX was entirely a response to GPGPU becoming the de facto standard for doing massively parallel vector compute. there was no way that AVX on 16 or 24 cores would ever keep up with compute shaders running vector operations on ten thousand cores in parallel, but xeon phi was an objective failure and they didn't have a usable GPU architecture that scaled into discrete cards that can match what Nvidia and AMD put out until literally this year so bolting on more and more registers and widening the SIMD data sizes was their way of trying to stay relevant until they could get Xe-HPC taped out and in the hands of the HPC crowd

tom forsyth (one of the people who created xeon phi) has a rather different take on whether it was a failure

https://tomforsyth1000.github.io/blog.wiki.html#%5B%5BWhy%20didn%27t%20Larrabee%20fail%3F%5D%5D

it actually met p much all the primary project goals and made intel money

as that post mentions, AVX512 is just phi vector instructions with a better encoding. so it all originated in 2005, and i kinda doubt 2005 intel management was thinking larrabee AKA phi would be a stepping stone to Xe-HPC. as forsyth mentions, intel management was just hostile to gpus. even using larrabee for graphics was politically difficult; not only did it ultimately get killed but as forsyth mentions, intel's regular gpu team was on a leash and not allowed to even try to build something bigger. (there was likely no technical limitation preventing them from scaling up, by their nature gpus scale up with relative ease)

Cybernetic Vermin
Apr 18, 2005

it certainly was the cool kind of crazy

quote:

It had the full DX11 feature set, and there were over 300 titles running perfectly - you download the game from Steam and they Just Work - they totally think it's a graphics card! But it's still actually running FreeBSD on that card, and under FreeBSD it's just running an x86 program called DirectXGfx (248 threads of it). And it shares a file system with the host and you can telnet into it and give it other work to do and steal cores from your own graphics system - it was mind-bending!

Athas
Aug 6, 2007

fuck that joker
The fact that the Xeon Phi ran its own loving OS was not at all the cool kind of crazy. Would you really want to janitor user accounts on your GPU? When I discovered that aspect I lost most of my interest in our Phi (which is now an ornament in my office).

JawnV6
Jul 4, 2004

So hot ...

BobHoward posted:

as forsyth mentions, intel management was just hostile to gpus. even using larrabee for graphics was politically difficult; not only did it ultimately get killed but as forsyth mentions, intel's regular gpu team was on a leash and not allowed to even try to build something bigger. (there was likely no technical limitation preventing them from scaling up, by their nature gpus scale up with relative ease)

ehhhhh idk about this? kinda loses the thread

Nomnom Cookie
Aug 30, 2009



Kazinsal posted:

Intel spitting out more and more extended versions of AVX was entirely a response to GPGPU becoming the de facto standard for doing massively parallel vector compute. there was no way that AVX on 16 or 24 cores would ever keep up with compute shaders running vector operations on ten thousand cores in parallel, but xeon phi was an objective failure and they didn't have a usable GPU architecture that scaled into discrete cards that can match what Nvidia and AMD put out until literally this year so bolting on more and more registers and widening the SIMD data sizes was their way of trying to stay relevant until they could get Xe-HPC taped out and in the hands of the HPC crowd

there are so many people just panting for intel to be competitive in hpc/gpu that every time intel puts out a statement like "our next batch of igp will definitely be good" it just gets reported everywhere amd everyone is like oh yeah intel put those hot fragments in my raster hole and then it launches and won't even run without a hacked bios on a particular mobo lmao

i guess what im saying is benchmarks or gtfo

Adbot
ADBOT LOVES YOU

Dylan16807
May 12, 2010

Cybernetic Vermin posted:

oho, that is interesting. if so, the cost of ever touching avx-512, with it just locking you out of 2k of register file for the rest of your thread lifetime, is kind of fantastic. still seems pretty doable to create a spill area (e.g. a small cache) to push the registers off into, but if nothing of the sort yet exists i guess it is more sensible to keep any little bit of library code from just monopolizing a resource like that.

I don't think it's that you lose space and can't reclaim it, I think it's that vector registers are completely separate


repiv posted:

avx512 has 32 registers though, 2 kilobytes seems like a lot by register standards? no idea how expensive that is silicon wise though

well for comparison's sake the gracemont efficiency cores already have 2KB of normal registers, and going by this page about skylake there's probably significantly more than 32 vector registers already and it doesn't take all that much space to bump them from 256 to 512 bits https://travisdowns.github.io/blog/2020/05/26/kreg2.html

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply