Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
hobbesmaster
Jan 28, 2008

Were they running RTOSes on those HFT things that are trying to execute trades or just hoping for the best on Linux?

Adbot
ADBOT LOVES YOU

Inept
Jul 8, 2003

shrike82 posted:

hft kinda came and past as a market trend - the action has been machine learning for a while, a european fund that tried to headhunt me recently talked about having a 10K+ V100/A100 server farm

I suppose they just switch back and forth between that and crypto mining depending on what they think is best that day

Beef
Jul 26, 2004
HFT still get their xeon sku's, it must mean the demand is there.

Those 10k+ GPU farms are for model training. The deployment (inference) is on CPUs and to some extent FPGAs.

movax
Aug 30, 2008

I’d expect FPGAs to get some play on the I/O side also there — probably not commodity NICs. If you’re targeting end-to-end latency, you have to own everything out to the furthest interface, I.e. Ethernet link.

BlankSystemDaemon
Mar 13, 2009




shrike82 posted:

i haven't kept up with hft in years but the last i heard it was all about fpgas - i'd be surprised a system builder using mainstream CPUs etc, albeit one targeting finance, is doing anything particularly interesting

hft kinda came and past as a market trend - the action has been machine learning for a while, a european fund that tried to headhunt me recently talked about having a 10K+ V100/A100 server farm
The point was more to talk about the CPUs being used in various fields, their core counts, and clock rates - and how little of a lead Intel have, if they have any at all.
There's a few exceptions like if you buy 8-way scalable Xeons and 7U servers with room for 7 motherboards connected to a backplane with up to 224 cores and 24TB memory for a single OS, or design your software such that it can scale out arbitrarily on many systems in which case blade servers (or their functional Open Compute Project equivalent) are still the king (and probably always will be, even if ARM takes over) - and either of those are use-case specific scenarios.

movax posted:

I’d expect FPGAs to get some play on the I/O side also there — probably not commodity NICs. If you’re targeting end-to-end latency, you have to own everything out to the furthest interface, I.e. Ethernet link.
HFTs have been doing privately owned radio chains over the air for years, to try and get a lead on the market.

BlankSystemDaemon fucked around with this message at 10:09 on May 2, 2022

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

ConanTheLibrarian posted:

We haven't seen ARM CPUs clock as high as x86 ones. Is that an intrinsic property of ARM or could we get to a point where their CPUs are running at the same wattage as x86 but getting a whole lot more work done?

Nope, it's not intrinsic to Arm. None of the design techniques which enable high clock speeds in today's desktop x86 cores are really all that specific to any particular ISA. Don't get me wrong, there are things you can do in ISAs which make it harder to design high frequency implementations, but I don't think you'll find any of those problems in Arm.

So why aren't there any high frequency Arm cores? Because high freq costs lots of area and power, and one or both have always been unacceptable prices to pay in the markets today's Arm cores were designed for.

For example, Apple's P cores are designed for iPhones first and Macs second. As a premium brand with high device prices, they aren't too sensitive to area, but they care a lot about power. So the M1/A14 P core is designed for medium frequency and very wide execution. Peak ST clocks aren't even much better in M1 than A14 (3.2 GHz vs 3.1 GHz).

So flip the question on its head: why haven't x86 designers explored this medium frequency / wide execution design space? This is where the answer is a flaw specific to one of these ISAs: x86 is awful to decode, and it gets worse the more instructions you want to decode in "parallel". (Scare quotes because there is no such thing as true parallel decode of future x86 instructions. There's always a serial dependency chain, and that's the problem in a nutshell.)

hobbesmaster
Jan 28, 2008

BlankSystemDaemon posted:

HFTs have been doing privately owned radio chains over the air for years, to try and get a lead on the market.

That’s for NYSE to CME to front run whatever commodities futures effect stocks and vice versa. Within a data center in NYC or Chicago the exchanges have data centers where everyone gets exactly the same length of network cable to the exchange’s servers as if they were impedance matching or something.

BlankSystemDaemon
Mar 13, 2009




hobbesmaster posted:

That’s for NYSE to CME to front run whatever commodities futures effect stocks and vice versa. Within a data center in NYC or Chicago the exchanges have data centers where everyone gets exactly the same length of network cable to the exchange’s servers as if they were impedance matching or something.
They use fiberoptics because that has an order of magnitude lower latency, but yeah I thought it was common knowledge that they all have to be the same length.

Coffee Jones
Jul 4, 2004

16 bit? Back when we was kids we only got a single bit on Christmas, as a treat
And we had to share it!

Beef posted:

Yep, this is why corporate machines need more firepower than a chromebook or 2/4 threads. Although more threads don't help much if McAfee decides that now it a good time to scan your entire HD.

my current nodejs project for work will create about 200k files.
code:
$ find . -type f | wc -l
  222949 <- file count in directory
so - if I was running windows, and the scanner decides this is anomalous behavior and decides to scan the files before they're written or accessed , performance goes out the window

JawnV6
Jul 4, 2004

So hot ...

BobHoward posted:

So flip the question on its head: why haven't x86 designers explored this medium frequency / wide execution design space? This is where the answer is a flaw specific to one of these ISAs: x86 is awful to decode, and it gets worse the more instructions you want to decode in "parallel". (Scare quotes because there is no such thing as true parallel decode of future x86 instructions. There's always a serial dependency chain, and that's the problem in a nutshell.)

Not saying it's trivial but parallel x86 decode is absolutely possible? Why would you care where the last instruction ended, just compute all of them.

LightRailTycoon
Mar 24, 2017

JawnV6 posted:

Not saying it's trivial but parallel x86 decode is absolutely possible? Why would you care where the last instruction ended, just compute all of them.

X86 instructions are variable length, so you need to decode all preceding instructions, or get wacky and speculate all possible instruction decodings.

JawnV6
Jul 4, 2004

So hot ...
next time you can just go ahead and read both sentences instead of posting after the first one

LightRailTycoon posted:

X86 instructions are variable length, so you need to decode all preceding instructions,

JawnV6 posted:

Why would you care where the last instruction ended,

LightRailTycoon posted:

or get wacky and speculate all possible instruction decodings.

JawnV6 posted:

just compute all of them.



you don't have to "decode all preceding instructions" you need to know how many bytes starting from and end+1. why would it matter how you ended up at any particular IP, are you decoding the instructions that lead up to a branch target?

LightRailTycoon
Mar 24, 2017

JawnV6 posted:

next time you can just go ahead and read both sentences instead of posting after the first one







you don't have to "decode all preceding instructions" you need to know how many bytes starting from and end+1. why would it matter how you ended up at any particular IP, are you decoding the instructions that lead up to a branch target?

You don’t know where each instruction begins until you have at least partially decided the preceding instructions.

hobbesmaster
Jan 28, 2008

LightRailTycoon posted:

You don’t know where each instruction begins until you have at least partially decided the preceding instructions.

Thats why it’s “speculative”

LightRailTycoon
Mar 24, 2017

hobbesmaster posted:

Thats why it’s “speculative”

Right, on x86, you either run a single issue, sequential decoder, or you find every possible decoding, and speculate. On an architecture with fixed size instructions, you can decode in parallel, without speculating.

JawnV6
Jul 4, 2004

So hot ...

LightRailTycoon posted:

You don’t know where each instruction begins until you have at least partially decided the preceding instructions.

it literally does not matter though. do you have to decode the instructions before a branch target? why is incrementing the PC from my last known instruction (where I had my decode table already computed) fundamentally different?

my best faith read here is you're assuming a startup cost instead of steady state

LightRailTycoon
Mar 24, 2017
On a fixed instruction size arch, say a64 arm, all instructions are 32 bits log, so you can take the stream of instructions, break it up into 32 bit chunks, and send it off to n simple instruction decoders, with minimal interaction between them.
On x86, instructions vary between 8 and 120 bits in length, you can't just do that. Instead, you have to settle for serial decoding and lose parallelism, or have a much more complicated decode, where all possible instruction decodings are tested for a given instruction stream.
This gives makes ARM work better with many short pipelines, and x86 work better with fewer, longer ones.

JawnV6
Jul 4, 2004

So hot ...

LightRailTycoon posted:

On x86, instructions vary between 8 and 120 bits in length, you can't just do that. Instead, you have to settle for serial decoding and lose parallelism, or have a much more complicated decode, where all possible instruction decodings are tested for a given instruction stream.
again you keep skipping past it, being lovely and calling it 'bits' doesn't help your case. you don't need "every possible decode that lead to IP 0xf00," and for some reason you seem to think that's a necessary component when it is not. you need the forward-looking decode for every byte offset.

im at IP 0xf00. I need to know where that instruction ends. then ill have the next IP. and ill need to know where that one ends. at no point did I need 'every possible decode' I needed the forward offset from ONE IP.

forbidden dialectics
Jul 26, 2005





JawnV6 posted:

again you keep skipping past it, being lovely and calling it 'bits' doesn't help your case. you don't need "every possible decode that lead to IP 0xf00," and for some reason you seem to think that's a necessary component when it is not. you need the forward-looking decode for every byte offset.

im at IP 0xf00. I need to know where that instruction ends. then ill have the next IP. and ill need to know where that one ends. at no point did I need 'every possible decode' I needed the forward offset from ONE IP.

I think what they're saying is that you don't know how many bytes :smug: the forward offset is to fetch the next instruction until AFTER you've decoded the current one, within a single pipeline. This is one of the inherent tradeoffs of x86.

LightRailTycoon
Mar 24, 2017

forbidden dialectics posted:

I think what they're saying is that you don't know how many bytes :smug: the forward offset is to fetch the next instruction until AFTER you've decoded the current one, within a single pipeline. This is one of the inherent tradeoffs of x86.
Yes! Sorry my frustration about not being able to express my thoughts read as being lovely.

I wish I could find the article I read about speculative instruction decode, it was impressive.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

JawnV6 posted:

Not saying it's trivial but parallel x86 decode is absolutely possible? Why would you care where the last instruction ended, just compute all of them.

Not saying it's impossible, just that it costs a lot of area and power, this gets much worse as you try to decode more instructions in parallel, and it has almost certainly limited the effective width of many x86 implementations. In practice x86 designers often do whatever they can to avoid powering their decoder all the time (example: Intel's uop cache), which should tell you something.

Assuming that we call our first instruction decoded in the cycle N, yes, the most basic technique is to build a bunch of parallel decoders to try each possible offset for instruction N+1. 15 to be precise.

Decoding N+2 is where things start to hurt: for each possible offset of N+1, you have to decode each possible offset of N+2. You should be able to avoid the combinatorial explosion by taking advantage of overlaps, but that's still a lot of partial decodes running in parallel. If my mental math is right, you'd need a total of 106 single-instruction decoders to provide eight decoded x86 instructions per cycle.

None of this eliminates the serial dependency chain either, it just burns a ton of area and power to do some of the work before the chain must be resolved. You still have to determine the size of N to figure out which decoder to look at for N+1, then use that decoder's size result to pick the right decoder for N+2, and so on. And I think you'll still pay some of that combinatorial complexity explosion in the form of the muxes which select decoder outputs for N+2 onwards.

There might be a clever way to get around this next bit, I haven't tried to think about it deeply, but the next cycle is also problematic. Since x86 instructions are as little as 1 byte long, you have to do some byte-granularity data realignment to prepare data at the decoder inputs each cycle, and the amount to align by depends on computing all the sizes of the instructions being decoded in this cycle. So there's a lot of pressure to get that dependency chain done super fast.

Another issue is simply how much data must be fetched per cycle to keep decoders fed. M1's 8-wide decode needs exactly 32 bytes per cycle from L1 icache, no more, no less. An 8-wide x86 decoder needs anywhere from 8 to 120 bytes. I expect that practical x86 designs just fetch much less data than the worst case per cycle, and if you run into a sequence of really big instructions, oops, you lose.

JawnV6
Jul 4, 2004

So hot ...
y'all it's 4bits per byte, this is a trivial amount of information to slosh around a die

it can be computed in parallel, I get that y'all are hung up on "why would I decode starting from the middle of an instruction" but it doesn't matter. it doesn't matter in any way/shape/form that the LIP is highly unlikely to land there, we're just trying to answer "what if I landed here, how many bytes to the next instruction" and that continues to be embarrassingly parallel

BobHoward posted:

Decoding N+2 is where things start to hurt: for each possible offset of N+1, you have to decode each possible offset of N+2. You should be able to avoid the combinatorial explosion by taking advantage of overlaps, but that's still a lot of partial decodes running in parallel. If my mental math is right, you'd need a total of 106 single-instruction decoders to provide eight decoded x86 instructions per cycle.

see bob can actually articulate this in a way that reveals the gap. I've never tried to say you would limit things to "each possible offset of N+1" even though that's really tempting in a sparing design. It's what's hamstringing everyone in this discussion, you only need to do that on startup. If you happened to have an oracle (or someone who'd chewed through every byte offset..) you wouldn't care about carefully selecting which of 15 offsets from N might be relevant, you'd just go ask and know they're all there.

from the top, my first attempt at this:

JawnV6 posted:

Why would you care where the last instruction ended, just compute all of them.

I don't give half a poo poo where N+1 is, I'm saying (and have been saying) compute every offset from every byte. Ferry it around next to the cache line, just in case. Just compute them all when you page a text in, have that poo poo lying around in case it gets executed.

"Can't be done in parallel" is BS, you're solving the wrong problem. "Where's my next instruction" vs. "what are the 4bit values for inst length associated with every byte in this cache line"

Beef
Jul 26, 2004
I'm not privy to the details of the front end, but from my work I do know that the front end is rarely the bottleneck. The workloads where the front end is the bottleneck is typically extremely branchy, and prefetching helps to keep the front end fed.

Scaling up x86 decode appears to be an engineering problem has its solutions. Yes it is more than likely using a lot more area and power than a fixed width ISA would. But as mentioned numerous times in this thread: decode is such a tiny part of a Xeon core's surface that it really does not matter. It doesn't even seem to matter for E-cores. So I highly doubt that decode is what is stopping Intel from making super wide cores.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Beef posted:

decode is such a tiny part of a Xeon core's surface that it really does not matter.

:hai:

BlankSystemDaemon
Mar 13, 2009




I think Intel is confused. They seem to think that SR-IOV is new, while I have a HP DL380p Gen8 from 2013 that supports it, so it's only a decade old, no big.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Given that they’re talking about the GVT-g feature for GPUs, context should be sufficient to figure out they’re referring to sr-iov being new in their GPUs.

Xylophone-rib levels of nitpicking.

BlankSystemDaemon
Mar 13, 2009




in a well actually posted:

Given that they’re talking about the GVT-g feature for GPUs, context should be sufficient to figure out they’re referring to sr-iov being new in their GPUs.

Xylophone-rib levels of nitpicking.

The headline of the article, meanwhile, is: Do 11th Generation Intel® Processors Support GVT-g Technology?

EDIT: Also, based on a friend of mine and my own experience, SR-IOV support is a loving hellscape. Even if the vendor claims it's supported, unless you know someone who's confirmed it with a specific revision of a vendors motherboard and know the exact firmware version, there's a good chance it won't work.

BlankSystemDaemon fucked around with this message at 22:41 on May 4, 2022

Agreed
Dec 30, 2003

The price of meat has just gone up, and your old lady has just gone down

in a well actually posted:

Xylophone-rib levels of nitpicking.

To what does this refer? Is this about how in The Lion King, during Scar's big musical number one of the hyenas plays the rib xylophone and the same rib gives different notes at one point?

hobbesmaster
Jan 28, 2008

Agreed posted:

To what does this refer? Is this about how in The Lion King, during Scar's big musical number one of the hyenas plays the rib xylophone and the same rib gives different notes at one point?

Wait, that gag in the Poochie episode was based on something?

Agreed
Dec 30, 2003

The price of meat has just gone up, and your old lady has just gone down

No, I noticed that as a kid, it's here, and I was just hoping that I wasn't the only one :negative:

I searched for this and found https://www.youtube.com/watch?v=pYrRqMHQY7o which I assume must be the thing here (the episode did come out 4 years after he Lion King, though!). I was apparently that guy as a kid, lmfao

Agreed fucked around with this message at 23:14 on May 4, 2022

Cygni
Nov 12, 2005

raring to post

Moores Law is Dead believes a Sapphire Lake based HEDT (ahem, "Extreme Xeon") part is coming this year, shown off in a few weeks. 24c, 5ghz+, quad-channel memory. He says its monolithic, but I think there is a chance its actually 2 tiles.

Believe it when you see it etc etc

Pablo Bluth
Sep 7, 2007

I've made a huge mistake.
Not matching the 3990X isn't a great benchmark, when you look at the further performance gains of the 5 series TR. That said, with AMD going PRO only on TR and the reported low supply, they'll sell. If only because Dell will make it easy to buy one.

repiv
Aug 13, 2009

If anyone here happens to have the AORUS Z690I Ultra, it's being recalled due to PCIe4 being broken

https://videocardz.com/newz/gigabyte-announces-recall-and-exchange-for-z690i-aorus-ultra-motherboard-citing-pcie-gen4-related-issues

gradenko_2000
Oct 5, 2010

HELL SERPENT
Lipstick Apathy

Cygni posted:

Moores Law is Dead believes a Sapphire Lake based HEDT (ahem, "Extreme Xeon") part is coming this year, shown off in a few weeks. 24c, 5ghz+, quad-channel memory. He says its monolithic, but I think there is a chance its actually 2 tiles.

I'd believe it: between Alder Lake being a slam dunk and Threadripper being kinda dead people people are gonna want a new HEDT toy to play with, and this is gonna sell if they make it

SwissArmyDruid
Feb 14, 2014

by sebmojo
Chances AMD goes, "Hey Intel, you wanna just go halfsies on this?" and Intel agrees in light of Intel shareholders continuing to vote down executive bonuses?

https://www.tomshardware.com/news/amd-fsr2-tested-on-intel-integrated-graphics

mobby_6kl
Aug 9, 2009

by Fluffdaddy
Has anyone used a Jasper Lake laptop? I'm looking at this for a travel machine

mobby_6kl posted:

I was looking to replace it with either the OneNote 4 (which is good but too expensive for a travel toy) or this new Chuwi MiniBook X:


https://store.chuwi.com/products/minibook-x?sca_ref=1075383.Gjt2ZMkWAM&sca_source=techtablet

Pretty decent specs: 10.8" 2K full touchscreen | Intel Jasper Lake N5100 | UHD Graphics GPU | Windows 11 | 12GB DDR4+512GB SSD

I think I'd really prefer an ULV Alder Lake CPU (the N5100 is ok but seems to be the same speed as the Core M), no stupid hole punch, and a bigger battery though. I did just receive my new work X1 Yoga (see posts above) so I'll see if I can just use it instead. It's thin and light but the ~3 extra inches make a big difference :pervert:
What bothers me a bit is the CPU performance. In geekbench it seems to do about 600 single core, which is the same as my current Core m3 tablet. Not great, but it is quad core vs my 2 cores... yet the multicore results are almost the same, around my 1400. (https://www.youtube.com/watch?v=bsboQHnMvPM&t=673s)
I would've expected over 2000 at least, what with double the core count. Why wouldn't it scale? Is it just running into power limits?

Doubling total performance would be nice, having exactly the same after 3-4 years would be a bummer.

E: the m3 does have hyperthreading but there's no way that makes that much of a difference.

mobby_6kl fucked around with this message at 17:54 on May 24, 2022

Vanagoon
Jan 20, 2008


Best Dead Gay Forums
on the whole Internet!
Crossposting from the White Whale thread. Took me years to dig this back up.

I don't know if anyone else is interested in this but I found the Intel Groove Machine I had posted about way earlier in the thread. I emailed one of the people who worked on it named Michael Henry and got him to send me a copy. His site is https://isorhythm.com. His correspondence was funny - he kept calling Windows "Windoze"

I couldn't figure out how to video record the display so I resorted to videoing my monitor like a goon.
https://i.imgur.com/Gu07zvk.mp4

Here's a link if anyone wants to mess with it. It was made in 1997 and still works under Windows 10 64 bit~!

https://www.mediafire.com/file/vpta0q8xgeh1b61/Groove.rar/file

Canned Sunshine
Nov 20, 2005

CAUTION: POST QUALITY UNDER CONSTRUCTION



Vanagoon posted:

Crossposting from the White Whale thread. Took me years to dig this back up.

I don't know if anyone else is interested in this but I found the Intel Groove Machine I had posted about way earlier in the thread. I emailed one of the people who worked on it named Michael Henry and got him to send me a copy. His site is https://isorhythm.com. His correspondence was funny - he kept calling Windows "Windoze"

I couldn't figure out how to video record the display so I resorted to videoing my monitor like a goon.
https://i.imgur.com/Gu07zvk.mp4

Here's a link if anyone wants to mess with it. It was made in 1997 and still works under Windows 10 64 bit~!

https://www.mediafire.com/file/vpta0q8xgeh1b61/Groove.rar/file

Could this be turned into a screen saver? :monocle:

mobby_6kl
Aug 9, 2009

by Fluffdaddy

SourKraut posted:

Could this be turned into a screen saver? :monocle:

A screen saver is just an exe renamed to .scr. Then just right-click to install it. A useful piece of knowledge retained from the 9x days :)

Adbot
ADBOT LOVES YOU

BlankSystemDaemon
Mar 13, 2009




Most .scr files are PE32 binary images, so once 32bit support gets completely stripped from Windows, an absolute shitload of screensavers will stop working as even ones made today aren't PE32+ binary images.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply