|
Were they running RTOSes on those HFT things that are trying to execute trades or just hoping for the best on Linux?
|
# ? May 2, 2022 02:33 |
|
|
# ? Jun 12, 2024 00:49 |
|
shrike82 posted:hft kinda came and past as a market trend - the action has been machine learning for a while, a european fund that tried to headhunt me recently talked about having a 10K+ V100/A100 server farm I suppose they just switch back and forth between that and crypto mining depending on what they think is best that day
|
# ? May 2, 2022 04:17 |
|
HFT still get their xeon sku's, it must mean the demand is there. Those 10k+ GPU farms are for model training. The deployment (inference) is on CPUs and to some extent FPGAs.
|
# ? May 2, 2022 08:40 |
|
I’d expect FPGAs to get some play on the I/O side also there — probably not commodity NICs. If you’re targeting end-to-end latency, you have to own everything out to the furthest interface, I.e. Ethernet link.
|
# ? May 2, 2022 10:04 |
shrike82 posted:i haven't kept up with hft in years but the last i heard it was all about fpgas - i'd be surprised a system builder using mainstream CPUs etc, albeit one targeting finance, is doing anything particularly interesting There's a few exceptions like if you buy 8-way scalable Xeons and 7U servers with room for 7 motherboards connected to a backplane with up to 224 cores and 24TB memory for a single OS, or design your software such that it can scale out arbitrarily on many systems in which case blade servers (or their functional Open Compute Project equivalent) are still the king (and probably always will be, even if ARM takes over) - and either of those are use-case specific scenarios. movax posted:I’d expect FPGAs to get some play on the I/O side also there — probably not commodity NICs. If you’re targeting end-to-end latency, you have to own everything out to the furthest interface, I.e. Ethernet link. BlankSystemDaemon fucked around with this message at 10:09 on May 2, 2022 |
|
# ? May 2, 2022 10:05 |
|
ConanTheLibrarian posted:We haven't seen ARM CPUs clock as high as x86 ones. Is that an intrinsic property of ARM or could we get to a point where their CPUs are running at the same wattage as x86 but getting a whole lot more work done? Nope, it's not intrinsic to Arm. None of the design techniques which enable high clock speeds in today's desktop x86 cores are really all that specific to any particular ISA. Don't get me wrong, there are things you can do in ISAs which make it harder to design high frequency implementations, but I don't think you'll find any of those problems in Arm. So why aren't there any high frequency Arm cores? Because high freq costs lots of area and power, and one or both have always been unacceptable prices to pay in the markets today's Arm cores were designed for. For example, Apple's P cores are designed for iPhones first and Macs second. As a premium brand with high device prices, they aren't too sensitive to area, but they care a lot about power. So the M1/A14 P core is designed for medium frequency and very wide execution. Peak ST clocks aren't even much better in M1 than A14 (3.2 GHz vs 3.1 GHz). So flip the question on its head: why haven't x86 designers explored this medium frequency / wide execution design space? This is where the answer is a flaw specific to one of these ISAs: x86 is awful to decode, and it gets worse the more instructions you want to decode in "parallel". (Scare quotes because there is no such thing as true parallel decode of future x86 instructions. There's always a serial dependency chain, and that's the problem in a nutshell.)
|
# ? May 2, 2022 13:13 |
|
BlankSystemDaemon posted:HFTs have been doing privately owned radio chains over the air for years, to try and get a lead on the market. That’s for NYSE to CME to front run whatever commodities futures effect stocks and vice versa. Within a data center in NYC or Chicago the exchanges have data centers where everyone gets exactly the same length of network cable to the exchange’s servers as if they were impedance matching or something.
|
# ? May 2, 2022 16:24 |
hobbesmaster posted:That’s for NYSE to CME to front run whatever commodities futures effect stocks and vice versa. Within a data center in NYC or Chicago the exchanges have data centers where everyone gets exactly the same length of network cable to the exchange’s servers as if they were impedance matching or something.
|
|
# ? May 2, 2022 16:45 |
Beef posted:Yep, this is why corporate machines need more firepower than a chromebook or 2/4 threads. Although more threads don't help much if McAfee decides that now it a good time to scan your entire HD. my current nodejs project for work will create about 200k files. code:
|
|
# ? May 2, 2022 18:07 |
|
BobHoward posted:So flip the question on its head: why haven't x86 designers explored this medium frequency / wide execution design space? This is where the answer is a flaw specific to one of these ISAs: x86 is awful to decode, and it gets worse the more instructions you want to decode in "parallel". (Scare quotes because there is no such thing as true parallel decode of future x86 instructions. There's always a serial dependency chain, and that's the problem in a nutshell.) Not saying it's trivial but parallel x86 decode is absolutely possible? Why would you care where the last instruction ended, just compute all of them.
|
# ? May 2, 2022 18:07 |
|
JawnV6 posted:Not saying it's trivial but parallel x86 decode is absolutely possible? Why would you care where the last instruction ended, just compute all of them. X86 instructions are variable length, so you need to decode all preceding instructions, or get wacky and speculate all possible instruction decodings.
|
# ? May 2, 2022 18:32 |
|
next time you can just go ahead and read both sentences instead of posting after the first oneLightRailTycoon posted:X86 instructions are variable length, so you need to decode all preceding instructions, JawnV6 posted:Why would you care where the last instruction ended, LightRailTycoon posted:or get wacky and speculate all possible instruction decodings. JawnV6 posted:just compute all of them. you don't have to "decode all preceding instructions" you need to know how many bytes starting from and end+1. why would it matter how you ended up at any particular IP, are you decoding the instructions that lead up to a branch target?
|
# ? May 2, 2022 18:37 |
|
JawnV6 posted:next time you can just go ahead and read both sentences instead of posting after the first one You don’t know where each instruction begins until you have at least partially decided the preceding instructions.
|
# ? May 2, 2022 18:40 |
|
LightRailTycoon posted:You don’t know where each instruction begins until you have at least partially decided the preceding instructions. Thats why it’s “speculative”
|
# ? May 2, 2022 18:47 |
|
hobbesmaster posted:Thats why it’s “speculative” Right, on x86, you either run a single issue, sequential decoder, or you find every possible decoding, and speculate. On an architecture with fixed size instructions, you can decode in parallel, without speculating.
|
# ? May 2, 2022 18:51 |
|
LightRailTycoon posted:You don’t know where each instruction begins until you have at least partially decided the preceding instructions. it literally does not matter though. do you have to decode the instructions before a branch target? why is incrementing the PC from my last known instruction (where I had my decode table already computed) fundamentally different? my best faith read here is you're assuming a startup cost instead of steady state
|
# ? May 2, 2022 18:56 |
|
On a fixed instruction size arch, say a64 arm, all instructions are 32 bits log, so you can take the stream of instructions, break it up into 32 bit chunks, and send it off to n simple instruction decoders, with minimal interaction between them. On x86, instructions vary between 8 and 120 bits in length, you can't just do that. Instead, you have to settle for serial decoding and lose parallelism, or have a much more complicated decode, where all possible instruction decodings are tested for a given instruction stream. This gives makes ARM work better with many short pipelines, and x86 work better with fewer, longer ones.
|
# ? May 2, 2022 19:23 |
|
LightRailTycoon posted:On x86, instructions vary between 8 and 120 bits in length, you can't just do that. Instead, you have to settle for serial decoding and lose parallelism, or have a much more complicated decode, where all possible instruction decodings are tested for a given instruction stream. im at IP 0xf00. I need to know where that instruction ends. then ill have the next IP. and ill need to know where that one ends. at no point did I need 'every possible decode' I needed the forward offset from ONE IP.
|
# ? May 2, 2022 20:33 |
|
JawnV6 posted:again you keep skipping past it, being lovely and calling it 'bits' doesn't help your case. you don't need "every possible decode that lead to IP 0xf00," and for some reason you seem to think that's a necessary component when it is not. you need the forward-looking decode for every byte offset. I think what they're saying is that you don't know how many bytes the forward offset is to fetch the next instruction until AFTER you've decoded the current one, within a single pipeline. This is one of the inherent tradeoffs of x86.
|
# ? May 2, 2022 22:39 |
|
forbidden dialectics posted:I think what they're saying is that you don't know how many bytes the forward offset is to fetch the next instruction until AFTER you've decoded the current one, within a single pipeline. This is one of the inherent tradeoffs of x86. I wish I could find the article I read about speculative instruction decode, it was impressive.
|
# ? May 2, 2022 23:26 |
|
JawnV6 posted:Not saying it's trivial but parallel x86 decode is absolutely possible? Why would you care where the last instruction ended, just compute all of them. Not saying it's impossible, just that it costs a lot of area and power, this gets much worse as you try to decode more instructions in parallel, and it has almost certainly limited the effective width of many x86 implementations. In practice x86 designers often do whatever they can to avoid powering their decoder all the time (example: Intel's uop cache), which should tell you something. Assuming that we call our first instruction decoded in the cycle N, yes, the most basic technique is to build a bunch of parallel decoders to try each possible offset for instruction N+1. 15 to be precise. Decoding N+2 is where things start to hurt: for each possible offset of N+1, you have to decode each possible offset of N+2. You should be able to avoid the combinatorial explosion by taking advantage of overlaps, but that's still a lot of partial decodes running in parallel. If my mental math is right, you'd need a total of 106 single-instruction decoders to provide eight decoded x86 instructions per cycle. None of this eliminates the serial dependency chain either, it just burns a ton of area and power to do some of the work before the chain must be resolved. You still have to determine the size of N to figure out which decoder to look at for N+1, then use that decoder's size result to pick the right decoder for N+2, and so on. And I think you'll still pay some of that combinatorial complexity explosion in the form of the muxes which select decoder outputs for N+2 onwards. There might be a clever way to get around this next bit, I haven't tried to think about it deeply, but the next cycle is also problematic. Since x86 instructions are as little as 1 byte long, you have to do some byte-granularity data realignment to prepare data at the decoder inputs each cycle, and the amount to align by depends on computing all the sizes of the instructions being decoded in this cycle. So there's a lot of pressure to get that dependency chain done super fast. Another issue is simply how much data must be fetched per cycle to keep decoders fed. M1's 8-wide decode needs exactly 32 bytes per cycle from L1 icache, no more, no less. An 8-wide x86 decoder needs anywhere from 8 to 120 bytes. I expect that practical x86 designs just fetch much less data than the worst case per cycle, and if you run into a sequence of really big instructions, oops, you lose.
|
# ? May 3, 2022 06:15 |
|
y'all it's 4bits per byte, this is a trivial amount of information to slosh around a die it can be computed in parallel, I get that y'all are hung up on "why would I decode starting from the middle of an instruction" but it doesn't matter. it doesn't matter in any way/shape/form that the LIP is highly unlikely to land there, we're just trying to answer "what if I landed here, how many bytes to the next instruction" and that continues to be embarrassingly parallel BobHoward posted:Decoding N+2 is where things start to hurt: for each possible offset of N+1, you have to decode each possible offset of N+2. You should be able to avoid the combinatorial explosion by taking advantage of overlaps, but that's still a lot of partial decodes running in parallel. If my mental math is right, you'd need a total of 106 single-instruction decoders to provide eight decoded x86 instructions per cycle. see bob can actually articulate this in a way that reveals the gap. I've never tried to say you would limit things to "each possible offset of N+1" even though that's really tempting in a sparing design. It's what's hamstringing everyone in this discussion, you only need to do that on startup. If you happened to have an oracle (or someone who'd chewed through every byte offset..) you wouldn't care about carefully selecting which of 15 offsets from N might be relevant, you'd just go ask and know they're all there. from the top, my first attempt at this: JawnV6 posted:Why would you care where the last instruction ended, just compute all of them. I don't give half a poo poo where N+1 is, I'm saying (and have been saying) compute every offset from every byte. Ferry it around next to the cache line, just in case. Just compute them all when you page a text in, have that poo poo lying around in case it gets executed. "Can't be done in parallel" is BS, you're solving the wrong problem. "Where's my next instruction" vs. "what are the 4bit values for inst length associated with every byte in this cache line"
|
# ? May 3, 2022 19:22 |
|
I'm not privy to the details of the front end, but from my work I do know that the front end is rarely the bottleneck. The workloads where the front end is the bottleneck is typically extremely branchy, and prefetching helps to keep the front end fed. Scaling up x86 decode appears to be an engineering problem has its solutions. Yes it is more than likely using a lot more area and power than a fixed width ISA would. But as mentioned numerous times in this thread: decode is such a tiny part of a Xeon core's surface that it really does not matter. It doesn't even seem to matter for E-cores. So I highly doubt that decode is what is stopping Intel from making super wide cores.
|
# ? May 3, 2022 19:55 |
|
Beef posted:decode is such a tiny part of a Xeon core's surface that it really does not matter.
|
# ? May 3, 2022 20:37 |
I think Intel is confused. They seem to think that SR-IOV is new, while I have a HP DL380p Gen8 from 2013 that supports it, so it's only a decade old, no big.
|
|
# ? May 4, 2022 22:02 |
|
Given that they’re talking about the GVT-g feature for GPUs, context should be sufficient to figure out they’re referring to sr-iov being new in their GPUs. Xylophone-rib levels of nitpicking.
|
# ? May 4, 2022 22:21 |
in a well actually posted:Given that they’re talking about the GVT-g feature for GPUs, context should be sufficient to figure out they’re referring to sr-iov being new in their GPUs. The headline of the article, meanwhile, is: Do 11th Generation Intel® Processors Support GVT-g Technology? EDIT: Also, based on a friend of mine and my own experience, SR-IOV support is a loving hellscape. Even if the vendor claims it's supported, unless you know someone who's confirmed it with a specific revision of a vendors motherboard and know the exact firmware version, there's a good chance it won't work. BlankSystemDaemon fucked around with this message at 22:41 on May 4, 2022 |
|
# ? May 4, 2022 22:37 |
|
in a well actually posted:Xylophone-rib levels of nitpicking. To what does this refer? Is this about how in The Lion King, during Scar's big musical number one of the hyenas plays the rib xylophone and the same rib gives different notes at one point?
|
# ? May 4, 2022 22:57 |
|
Agreed posted:To what does this refer? Is this about how in The Lion King, during Scar's big musical number one of the hyenas plays the rib xylophone and the same rib gives different notes at one point? Wait, that gag in the Poochie episode was based on something?
|
# ? May 4, 2022 23:03 |
|
No, I noticed that as a kid, it's here, and I was just hoping that I wasn't the only one I searched for this and found https://www.youtube.com/watch?v=pYrRqMHQY7o which I assume must be the thing here (the episode did come out 4 years after he Lion King, though!). I was apparently that guy as a kid, lmfao Agreed fucked around with this message at 23:14 on May 4, 2022 |
# ? May 4, 2022 23:05 |
|
Moores Law is Dead believes a Sapphire Lake based HEDT (ahem, "Extreme Xeon") part is coming this year, shown off in a few weeks. 24c, 5ghz+, quad-channel memory. He says its monolithic, but I think there is a chance its actually 2 tiles. Believe it when you see it etc etc
|
# ? May 7, 2022 19:41 |
|
Not matching the 3990X isn't a great benchmark, when you look at the further performance gains of the 5 series TR. That said, with AMD going PRO only on TR and the reported low supply, they'll sell. If only because Dell will make it easy to buy one.
|
# ? May 7, 2022 20:16 |
|
If anyone here happens to have the AORUS Z690I Ultra, it's being recalled due to PCIe4 being broken https://videocardz.com/newz/gigabyte-announces-recall-and-exchange-for-z690i-aorus-ultra-motherboard-citing-pcie-gen4-related-issues
|
# ? May 13, 2022 14:53 |
|
Cygni posted:Moores Law is Dead believes a Sapphire Lake based HEDT (ahem, "Extreme Xeon") part is coming this year, shown off in a few weeks. 24c, 5ghz+, quad-channel memory. He says its monolithic, but I think there is a chance its actually 2 tiles. I'd believe it: between Alder Lake being a slam dunk and Threadripper being kinda dead people people are gonna want a new HEDT toy to play with, and this is gonna sell if they make it
|
# ? May 13, 2022 15:11 |
|
Chances AMD goes, "Hey Intel, you wanna just go halfsies on this?" and Intel agrees in light of Intel shareholders continuing to vote down executive bonuses? https://www.tomshardware.com/news/amd-fsr2-tested-on-intel-integrated-graphics
|
# ? May 19, 2022 12:11 |
|
Has anyone used a Jasper Lake laptop? I'm looking at this for a travel machinemobby_6kl posted:I was looking to replace it with either the OneNote 4 (which is good but too expensive for a travel toy) or this new Chuwi MiniBook X: I would've expected over 2000 at least, what with double the core count. Why wouldn't it scale? Is it just running into power limits? Doubling total performance would be nice, having exactly the same after 3-4 years would be a bummer. E: the m3 does have hyperthreading but there's no way that makes that much of a difference. mobby_6kl fucked around with this message at 17:54 on May 24, 2022 |
# ? May 24, 2022 15:58 |
|
Crossposting from the White Whale thread. Took me years to dig this back up. I don't know if anyone else is interested in this but I found the Intel Groove Machine I had posted about way earlier in the thread. I emailed one of the people who worked on it named Michael Henry and got him to send me a copy. His site is https://isorhythm.com. His correspondence was funny - he kept calling Windows "Windoze" I couldn't figure out how to video record the display so I resorted to videoing my monitor like a goon. https://i.imgur.com/Gu07zvk.mp4 Here's a link if anyone wants to mess with it. It was made in 1997 and still works under Windows 10 64 bit~! https://www.mediafire.com/file/vpta0q8xgeh1b61/Groove.rar/file
|
# ? May 25, 2022 07:52 |
|
Vanagoon posted:Crossposting from the White Whale thread. Took me years to dig this back up. Could this be turned into a screen saver?
|
# ? May 30, 2022 16:58 |
|
SourKraut posted:Could this be turned into a screen saver? A screen saver is just an exe renamed to .scr. Then just right-click to install it. A useful piece of knowledge retained from the 9x days
|
# ? May 30, 2022 21:25 |
|
|
# ? Jun 12, 2024 00:49 |
Most .scr files are PE32 binary images, so once 32bit support gets completely stripped from Windows, an absolute shitload of screensavers will stop working as even ones made today aren't PE32+ binary images.
|
|
# ? May 30, 2022 23:03 |