Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
karoshi
Nov 4, 2008

"Can somebody mspaint eyes on the steaming packages? TIA" yeah well fuck you too buddy, this is the best you're gonna get. Is this even "work-safe"? Let's find out!

Methylethylaldehyde posted:

You can buy it, but the difference between DDR4-3600 CAS 16 and CAS 8 is still a factor of 10 slower than SRAM in the best case. The '10ish ns' average latency across 2-3 generations suggests that it's near the fundamental limits of physics under the cost/capacity/speed/distance/power tradeoffs we've settled on for DRAM products.

If someone got a large enough hair up their rear end about latency in DRAM, you could trade capacity, power and speed for lower overall latency. But it would need to be a completely new mask set for a really niche product. It would probably be way easier to bring the memory to sub-ambient, crank the voltage to the bleeding edge of what it can handle, and bring the timings down as low as you can.

Going to the 'no cost is prohibitive' area means you should just superglue 2 pounds of raw 5 nanometer SRAM dies to a bare DIMM PCB, charge $1 million for it, and enjoy your tCL=1, no refresh needed, main memory.

I'm frankly surprised the product doesn't exist yet. There are 128GB DRAM modules so dividing by 8 means a 16GB SRAM DIMM should be "doable". I'm probably missing some other scaling factor, but still, a 4GB SRAM DIMM surely would be nice to put your worst pointer-chasing mission-critical code/data on it.

I'd like to see it just to display the limits on current memory controllers. And the related improvements for different applications.

Adbot
ADBOT LOVES YOU

Boat Stuck
Apr 20, 2021

I tried to sneak through the canal, man! Can't make it, can't make it, the ship's stuck! Outta my way son! BOAT STUCK! BOAT STUCK!

karoshi posted:

Going to the 'no cost is prohibitive' area means you should just superglue 2 pounds of raw 5 nanometer SRAM dies to a bare DIMM PCB, charge $1 million for it, and enjoy your tCL=1, no refresh needed, main memory.

I'm frankly surprised the product doesn't exist yet. There are 128GB DRAM modules so dividing by 8 means a 16GB SRAM DIMM should be "doable". I'm probably missing some other scaling factor, but still, a 4GB SRAM DIMM surely would be nice to put your worst pointer-chasing mission-critical code/data on it.

I'd like to see it just to display the limits on current memory controllers. And the related improvements for different applications.

That's sorta what 3D V-Cache is. You need the SRAM to be physically closer to the CPU to really minimize latency, since at nanosecond speeds distance matters.

Beef
Jul 26, 2004
The DRAM 'protocol' and controller itself is also adding some fundamental latencies. You have refresh cycles, row selects, etc.

Making DIMMs out of SRAM cells is a fantastic way to piss away a ton of money for little gain. But it's a good thought exercise.
I'll check if I can figure out what gains can be expected in a DRAM simulator.

Beef
Jul 26, 2004
Oops, refresh cycles won't be needed for sram dimms.

Encrypted
Feb 25, 2016

SwissArmyDruid posted:

Please to be remembering that every DIMM of DDR5 is, and I emphasize that this is very crude approximation, two sticks of SO-DIMM glued together, that's how they're getting the extra bandwidth. You are technically getting dual-channel from a single stick of DDR5.

By that same token, x4 sticks of DDR5 would be like x8 sticks of DDR4.

Therefore, for best performance, you would probably want as 2x sticks, yes.
lol jesus christ

BlankSystemDaemon
Mar 13, 2009



Twerk from Home posted:

You're describing Intel Optane here, right? Especially in the DIMM format, my understanding is that this is what it is trying to be.

Getting rid of paging would be pretty interesting.
Sure, but like I was getting at, it doesn't have the write endurance yet, and it's one of those things that seem like it's at least on scientific breakthrough away from happening at present.

Beef
Jul 26, 2004

Beef posted:

The DRAM 'protocol' and controller itself is also adding some fundamental latencies. You have refresh cycles, row selects, etc.

Making DIMMs out of SRAM cells is a fantastic way to piss away a ton of money for little gain. But it's a good thought exercise.
I'll check if I can figure out what gains can be expected in a DRAM simulator.

I ran this past a more knowledgeable colleague as a fun thought experiment:
You still have the extra latency of going off-chip, talking to a controller, adding latency because of the physically larger SRAM modules etc.
In a linear access scenario, e.g. reading pages, you will not really notice any improved latency. However, there will be a significant gain when your workload is doing constant random accesses, causing constant row switching. An extra bonus is that your row switch won't be delayed by the controller because a refresh is in progress. In that particular scenario, your big rear end SRAM DIMMS might provide you a 2x improved latency. Workloads with that type of access patterns (e.g. large-scale sparse graph analysis) are typically significantly memory-latency bound, so it might make sense for some three letter agencies with unlimited funds.

Potato Salad
Oct 23, 2014

nobody cares


highly useful in pchem too, and I'm excited for it

modern tools optimize memory tier and arrangement predictively mostly alright but there's only so much that can be done before sticking a gigantic stack of memory on die

karoshi
Nov 4, 2008

"Can somebody mspaint eyes on the steaming packages? TIA" yeah well fuck you too buddy, this is the best you're gonna get. Is this even "work-safe"? Let's find out!

Beef posted:

I ran this past a more knowledgeable colleague as a fun thought experiment:
You still have the extra latency of going off-chip, talking to a controller, adding latency because of the physically larger SRAM modules etc.
In a linear access scenario, e.g. reading pages, you will not really notice any improved latency. However, there will be a significant gain when your workload is doing constant random accesses, causing constant row switching. An extra bonus is that your row switch won't be delayed by the controller because a refresh is in progress. In that particular scenario, your big rear end SRAM DIMMS might provide you a 2x improved latency. Workloads with that type of access patterns (e.g. large-scale sparse graph analysis) are typically significantly memory-latency bound, so it might make sense for some three letter agencies with unlimited funds.

Thanks for the analysis. So you're saying that we should just fill all the space under the IHS with SRAM dies?

I bet you can fit around 4GB in a thicc chiplet. So you'd have something like a 12-layer HBM chiplet, but SRAM, and put 6 of them like some of the NV modules. Intel could try this on some of their projected multi-chiplet jigsaw-puzzle modules.

NV A100 runs 80GB of HBM https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf, which could be 10GB of SRAM. That's enough to run CS:GO. And the GPU die is bigger than your typical CPU.

Beef
Jul 26, 2004
No, it's impractical as gently caress: a magnitude higher power draw, heat production, physical size, production cost, etc. AMD's 3Dcache fits 64MB of SRAM under a chiplet. I don't know the math of how much heat 10GB of SRAM produces, but I bet you it's going to need some impressive separate cooling. Also, those applications that benefit most from external SRAM modules are also the applications are doing random access into data sets larger than e.g. that GPU's 80GB HBM.
Plus, you're swapping a ~120 cycles load latency for a ~70 cycles load latency in a specific scenario for 10-100x the cost. Those cycles you save by using SRAM cost you dearly, you do not want to then put that off-chip. If you have tons of money to burn, there are more effective way to deal with memory latency.

And now I'm going :goonsay: on this thing, as is this is completely my poo poo. Architecture chat!


Memory latencies are not going to improve. We're fully in the 'memory wall' future and there is no magic tech in the pipeline to solve this.

There's three strategies for processors to deal with the memory wall: reduce, amortize or tolerate latency.
1) Registers, scratchpads and caches are to reduce latency by pulling in what you will need soon or will need repeatedly close by in fast SRAM.
2) Amortize means pulling in more for each access, going wider: larger cache lines, larger page sizes, wider interfaces (HBM, GDDR, DDR5).
3) Tolerate latency means doing something else while you wait for your access. GPUs do that to some extend with warp/SMT scheduling, CPUs do it to some extend with large out of order execution windows and HyperThreading.

The way CPUs have been doing 1), 2) and 3) is in this gentle incremental way, without needing to rewrite your application too much. And that kind of works well, if your application can make use of it. That is, you have a nicely predictable control flow with nicely predictable memory access patterns that the hardware can exploit. (That's why data-oriented programming has become so popular in the video games industry this last decade.)

If your algorithm is fundamentally doing pointer chasing (e.g. large graph analysis), you're poo poo out of luck: most of the silicon on your CPU is at best just a waste of energy. For example, you're visiting a graph node means causing a cache miss, pulling in an entire cache line from a different NUMA node, use 4B of that line once and then keeping that poor line around until it gets evicted. Your OoO does jack poo poo because there's a data hazard between your loads: your need the result of the previous load to know what to load next.

For those type of applications, you're talking x100 speedups and more if you use a CPU and memory architecture that goes all-in on on 3). That's what Tera (now Cray) did with the XMT, and that's what Intel is prototyping with PIUMA (https://arxiv.org/abs/2010.06277).

ConanTheLibrarian
Aug 13, 2004


dis buch is late
Fallen Rib
There are still interesting changes that could be implemented with the move away from monolithic dies. TSMC's die stacking tech can go up to something like 8 layers. Why not make a separate chiplet like that to serve as the L3 for your CPU (512MB if each layer is 64MB), then take all silicon that was L3 on the logic die(s) and turn it into more L2 and L1.

IBM are doing some pretty wild stuff with caches for their Telum CPUs, and there will be Sapphire Rapids SKUs that have 64GB of on-package HBM as L4 cache. Compared to the 4 core/8 thread stagnation we had for much of the 2010s, we're in a period of really interesting architectural changes.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Beef posted:

Memory latencies are not going to improve. We're fully in the 'memory wall' future and there is no magic tech in the pipeline to solve this.

There's three strategies for processors to deal with the memory wall: reduce, amortize or tolerate latency.
1) Registers, scratchpads and caches are to reduce latency by pulling in what you will need soon or will need repeatedly close by in fast SRAM.
2) Amortize means pulling in more for each access, going wider: larger cache lines, larger page sizes, wider interfaces (HBM, GDDR, DDR5).
3) Tolerate latency means doing something else while you wait for your access. GPUs do that to some extend with warp/SMT scheduling, CPUs do it to some extend with large out of order execution windows and HyperThreading.

DDR5 is actually not (2). Clock speed is up, but per-channel bandwidth is worse: the DIMM's 64 data pins are no longer a single channel, they're two fully independent 32-bit channels, each with its own set of command/address/chip select pins. Burst length has doubled from 8 to 16 to preserve one burst moving 64 bytes, as that's the cache line size for a lot of CPUs, including nearly all x86.

As I understand it, the point of this is kind of 3, but not quite. It's really happening due to the reality that mass market consumer gear now features 8-core (and more) CPUs. Lots of active cores generate lots of memory requests, which demands scaling up the number of memory controller channels just to increase the number of transactions in flight so nobody's starved. If you don't want to pay for more MC channels with greater datapath width, you decrease channel width.

Another way of putting it: there are important multicore loads where you have N threads, each of which is generating reasonably temporally local access patterns. If the system doesn't support at least N open DRAM pages, there will be lots of page open/close thrashing, destroying temporal locality and tanking memory performance.

For those not familiar: opening a DRAM page means reading an entire row of data from the 2D DRAM array and storing its contents in a scratch SRAM buffer. Data transfers to and from a DDRn memory chip actually interact with these scratch buffers, rather than directly accessing the DRAM. When you see anything referring to RAS latency, that's talking about the time it takes to send the DRAM chip a command to open a page (read a row into a buffer). This prepares that page for future column (CAS) accesses, which is when data is sent to or from the CPU. Column accesses inside a page are much, much faster than row accesses loading or storing pages because they take place entirely inside SRAM, which is why it's important to reduce the frequency of opening and closing pages.

Beef
Jul 26, 2004

ConanTheLibrarian posted:

There are still interesting changes that could be implemented with the move away from monolithic dies. TSMC's die stacking tech can go up to something like 8 layers. Why not make a separate chiplet like that to serve as the L3 for your CPU (512MB if each layer is 64MB), then take all silicon that was L3 on the logic die(s) and turn it into more L2 and L1.

There are some natural limitations on the size of L1 and L2, that's mostly why you do not see that differ that much across modern CPUs. You only really see L1/L2 D$ increases when the cache line size increases, which I think IBM is doing for Power-9. However, CL size has its tradeoffs, the larger the less efficient your cache is. I do not think we're going to see that 64B CL increase on Intel x86 CPUs soon.

Stacking also gives you some interesting thermal problems. My bet is that we're stuck on +1 layer if we're talking about logic and memory; logic-on-memory designs basically. If we will see multiple layers, it will probably be for extra metal/communication layers and heat-dissipation layers for some whacky thermal management. You might also see some truely dark silicon type of circuits find their way on separate layers, like embedded ASICs that only are used occasionally.

Beef
Jul 26, 2004

BobHoward posted:

DDR5 is actually not (2).

Good point! It is not wider as in a wider bus, but it's a wider interface because of more busses/channels/open pages/command buffers etc.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Beef posted:

There are some natural limitations on the size of L1 and L2, that's mostly why you do not see that differ that much across modern CPUs. You only really see L1/L2 D$ increases when the cache line size increases, which I think IBM is doing for Power-9. However, CL size has its tradeoffs, the larger the less efficient your cache is. I do not think we're going to see that 64B CL increase on Intel x86 CPUs soon.

Another datapoint to illustrate this: Apple M1 performance core L1 caches. They're enormous (192KiB I + 128KiB D), the line size is 128B, and it's likely that the major tradeoff Apple's designers accepted to get such huge caches was a frequency limit. That was OK in Apple's context, since they like to build ultra-wide low clock rate microarchitectures, but that uarch design style doesn't work as well in x86 land. (It's far, far harder for x86 front ends to feed their cores with lots of decoded instructions per clock - x86 decode is not fun.)

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

BobHoward posted:

(It's far, far harder for x86 front ends to feed their cores with lots of decoded instructions per clock - x86 decode is not fun.)

Apparently Apple was heavily involved in designing the aarch64/armv8 ISA (actually it was supposedly the other way around, Apple didn’t just work with arm to design it apple actually designed it and got it approved by ARM) and one of their major goals was to make decoding and reordering easier so they could run a very wide frontend and a very deep reorder buffer, and that’s one of the reasons they’ve been able to absolutely stomp X86 in IPC. If you look at Anandtech’s SPEC measurements and the clocks, they’re doing almost quadruple the IPC of x86 in some benchmarks.

(Instruction density is a thing, but not that much of a thing… the RISC-V folks did a paper when they did their design and across the SPEC suite ARMv8 is only about 12% larger than the x86 equivalent)

Paul MaudDib fucked around with this message at 14:06 on Apr 28, 2022

BlankSystemDaemon
Mar 13, 2009



Paul MaudDib posted:

Apparently Apple was heavily involved in designing the aarch64/armv8 ISA (actually it was supposedly the other way around, Apple didn’t just work with arm to design it apple actually designed it and got it approved by ARM) and one of their major goals was to make decoding and reordering easier so they could run a very wide frontend and a very deep reorder buffer, and that’s one of the reasons they’ve been able to absolutely stomp X86 in IPC. If you look at Anandtech’s SPEC measurements and the clocks, they’re doing almost quadruple the IPC of x86 in some benchmarks.

(Instruction density is a thing, but not that much of a thing… the RISC-V folks did a paper when they did their design and across the SPEC suite ARMv8 is only about 12% larger than the x86 equivalent)
According to who?

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

BlankSystemDaemon posted:

According to who?

A former Apple silicon engineer?

BlankSystemDaemon
Mar 13, 2009



Paul MaudDib posted:

A former Apple silicon engineer?
Whom you know personally, and who's also your uncle? :v:

mdxi
Mar 13, 2006

to JERK OFF is to be close to GOD... only with SPURTING

BlankSystemDaemon posted:

Whom you know personally, and who's also your uncle? :v:

There was a really nice and interesting conversation going on here before you decided to wade in and BOFH it up.

If you've got countervailing technical or historical information, then say so.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
My bad was a former Apple kernel engineer not silicon

https://9to5mac.com/2021/01/05/magic-of-m1-mac-began-10-years-ago/amp/

JawnV6
Jul 4, 2004

So hot ...

BobHoward posted:

(It's far, far harder for x86 front ends to feed their cores with lots of decoded instructions per clock - x86 decode is not fun.)

yeah 'cause they're not trying hard enough

BlankSystemDaemon
Mar 13, 2009



mdxi posted:

There was a really nice and interesting conversation going on here before you decided to wade in and BOFH it up.

If you've got countervailing technical or historical information, then say so.
Okay, apparently asking for the source is not allowed anymore, so I had to go dig it up myself.
Not that I have any reason to distrust Shac, but it'd be interesting with an independent verification that goes into more detail - I'm just not sure that's going to happen due to Apple core IP being involved.

EDIT: Semi-beaten, because I left the page open with a reply half-written.

EDIT2: Also, AppliedMicro's X-Gene from 2011 was ARMv8, so I'm curious what definition of Aarch64 Shac is using.
The one I'm familiar with is from FreeBSD, where it means ARMv8 (and distinguishes it from ARMv5, v6, and v7, because those are all 32bit) - that nomenclature descends from LLVM, where it's also called Aarch64, and since that's an Apple thing, it doesn't seem to me like they're different? Especially considering there isn't and has never been any 64bit ARM target that isn't Aarch64 in LLVM.
X-Gene 3 was being sampled as late as 2017 although that was after AppliedMicron had been acquired by MACOM, only to be resold the same year.

EDIT3: Apple also has a longer history of doing custom ARM cores, with the A6 integrating ARMv7 plus some things not found in the v7 spec like VFPv4, so I wouldn't be completely shocked to learn that the M1 is more custom silicon than it has anything to do with ARM.

EDIT4: Cyclone was very much an ARMv8 chip, in the full sense - implementing both Aarch32 and Aarch64.
Given that that article's from 2013 it kinda doesn't make sense to say that Apple alone was responsible for ARMv8, especially as AppliedMicro were demoing a core on FGPA back in 2011, and since Samsungs first ARMv8 chip was only a year later it doesn't seem likely they did the entire thing for themselves.

BlankSystemDaemon fucked around with this message at 20:04 on Apr 28, 2022

BlankSystemDaemon
Mar 13, 2009



To bring this back in line, does anyone know of a breakdown similar to this but for Alder Lake on Windows/Linux?

New Zealand can eat me
Aug 29, 2008

:matters:


This archived blog post is the best I could find

I have no idea how good or accurate it is, and seems to start at a lower level, but there are some sources cited at the end

BlankSystemDaemon
Mar 13, 2009



New Zealand can eat me posted:

This archived blog post is the best I could find

I have no idea how good or accurate it is, and seems to start at a lower level, but there are some sources cited at the end
That's at least a good starting point for keeping an eye out for programmatic interfaces in the architecture software development manuals.

It'll be interesting to see whether it's even possible for developers to optimize the heuristics mentioned, or whether the manuals will just dictate that "this is how it'll be done" and all scheduler code will end up very similar.

New Zealand can eat me
Aug 29, 2008

:matters:


Did Intel ever put out any Thread Director patches for Linux? I remember the last Phoronix shootout had W11 beating the poo poo out of Linux because of that + maybe missing cpu freq/power management too?

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

I think you got the wrong impression from Shac Ron's tweets and went down some rabbit holes.

The key claim: "Arm64 didn’t appear out of nowhere, Apple contracted ARM to design a new ISA for its purposes."

Designing an ISA - the instruction set architecture - is a different task than designing an implementation of that ISA. The ISA specification is essentially a hardware/software interface contract; a successful ISA spec allows binary programs developed on any spec compliant implementation to run without changes on any other.

For best results, the ISA design process should be informed by expert implementors. My takeaway from these tweets is that, if true, as the party paying for the work to be done, Apple's CPU design team would've provided both overall goals for the ISA and feedback on draft versions of the spec. (Arm's ISA designers naturally would've been getting internal and external feedback from other sources, too.)

I don't know how true Shac Ron's claims are, but I have seen similar unverifiable claims that AArch64 was heavily Apple influenced from similar sources.

By the way, while it's true that Applied Micro launched X-Gene 1 in 2012 before Apple shipped A7 in 2013, X-Gene was one of the most paper launches of all time. You will be hard pressed to find evidence that X-Gene 1 and 2 had any significant design wins in any year, ever. (If WikiChip is to be believed, the sales total for both generations was just 25K units.)

BlankSystemDaemon
Mar 13, 2009



New Zealand can eat me posted:

Did Intel ever put out any Thread Director patches for Linux? I remember the last Phoronix shootout had W11 beating the poo poo out of Linux because of that + maybe missing cpu freq/power management too?
Unless it's dual-licensed with something that's also copy-free, that won't really benefit anyone but Linux.
It's also absolutely the sort of thing that their well-published software development manuals are meant for, since it isn't up to Intel to implement it and I don't believe they employ kernel scheduler developers.

I'm also not sure it's worth mentioning, but don't trust any numbers from Phoronix - the benchmark suite is filled with both statistical issues and technical ones (which have been pointed out to Michael many times), and often his conclusions based on the benchmark have almost nothing to do with the benchmarks in question.

EDIT: To provide a specific example, Michael attributes an improvement in fourier transformations and a database search to "the Intel hardware P-States improvements with FreeBSD 13 kernel and/or other power management improvements".
Needless to say, hwpstate_intel(4) may make a difference, but unless you're testing that as the only variable (ie. setting it between 0 and 100), you can't make declarations about whether that's got anything to do with it or not.

BobHoward posted:

I think you got the wrong impression from Shac Ron's tweets and went down some rabbit holes.

The key claim: "Arm64 didn’t appear out of nowhere, Apple contracted ARM to design a new ISA for its purposes."

Designing an ISA - the instruction set architecture - is a different task than designing an implementation of that ISA. The ISA specification is essentially a hardware/software interface contract; a successful ISA spec allows binary programs developed on any spec compliant implementation to run without changes on any other.

For best results, the ISA design process should be informed by expert implementors. My takeaway from these tweets is that, if true, as the party paying for the work to be done, Apple's CPU design team would've provided both overall goals for the ISA and feedback on draft versions of the spec. (Arm's ISA designers naturally would've been getting internal and external feedback from other sources, too.)

I don't know how true Shac Ron's claims are, but I have seen similar unverifiable claims that AArch64 was heavily Apple influenced from similar sources.

By the way, while it's true that Applied Micro launched X-Gene 1 in 2012 before Apple shipped A7 in 2013, X-Gene was one of the most paper launches of all time. You will be hard pressed to find evidence that X-Gene 1 and 2 had any significant design wins in any year, ever. (If WikiChip is to be believed, the sales total for both generations was just 25K units.)
Yeah, it's a good point.

It'd be neat if someone sat down with some of the people involved in the entire process and wrote a book about it and all the war stories.
I'd read the absolute poo poo out of it.

BlankSystemDaemon fucked around with this message at 22:13 on Apr 28, 2022

JawnV6
Jul 4, 2004

So hot ...

BobHoward posted:

I think you got the wrong impression from Shac Ron's tweets and went down some rabbit holes.

i used to eat lunch with that guy

New Zealand can eat me
Aug 29, 2008

:matters:


I didn't mean to imply that the patches would give you source code to read, moreso that if they've actually implemented it on that side of things, that seems like a much more accessible point for trying to work backwards from and figure out what its actually doing. Was thinking about the M1 article and how nice their profiling tools are compared to Windows

BlankSystemDaemon posted:

I'm also not sure it's worth mentioning, but don't trust any numbers from Phoronix - the benchmark suite is filled with both statistical issues and technical ones (which have been pointed out to Michael many times), and often his conclusions based on the benchmark have almost nothing to do with the benchmarks in question.

Yep, well aware. He makes a LOT of sloppy mistakes and doesn't care. Everything he does is optimized for maximal blog spam. That's not to say some posts aren't useful, but it's purely from a "copy pasting relevant parts of the press release" perspective. If his speculation was as accurate as he acted it is, he wouldn't be posting about it online for a living

Thank you for reminding me though, I've gotten so used to filtering through the crap that I almost forgot it existed

ConanTheLibrarian
Aug 13, 2004


dis buch is late
Fallen Rib

Beef posted:

There are some natural limitations on the size of L1 and L2, that's mostly why you do not see that differ that much across modern CPUs. You only really see L1/L2 D$ increases when the cache line size increases, which I think IBM is doing for Power-9. However, CL size has its tradeoffs, the larger the less efficient your cache is.
I ended up deleting a line that said that there would be trade-offs since I thought that much would be obvious. Despite that, Telum has 32MB of L2 per core. Sure it's probably slower than L2 on an x64 CPU, but I'd be surprised if it was slower than AMD's 32MB of L3 cache per 8 core CCD.

JawnV6 posted:

i used to eat lunch with that guy
So is he worth listening to?

cerious
Aug 18, 2010

:dukedog:
https://www.anandtech.com/show/17366/intel-meteor-lake-client-soc-up-and-running

Meteor Lake powered on, but still a year away at least. Should be a pretty sweet laptop part!

Happy_Misanthrope
Aug 3, 2007

"I wanted to kill you, go to your funeral, and anyone who showed up to mourn you, I wanted to kill them too."

Paul MaudDib posted:

Apparently Apple was heavily involved in designing the aarch64/armv8 ISA (actually it was supposedly the other way around, Apple didn’t just work with arm to design it apple actually designed it and got it approved by ARM) and one of their major goals was to make decoding and reordering easier so they could run a very wide frontend and a very deep reorder buffer, and that’s one of the reasons they’ve been able to absolutely stomp X86 in IPC. If you look at Anandtech’s SPEC measurements and the clocks, they’re doing almost quadruple the IPC of x86 in some benchmarks.

(Instruction density is a thing, but not that much of a thing… the RISC-V folks did a paper when they did their design and across the SPEC suite ARMv8 is only about 12% larger than the x86 equivalent)

There are many differing aspects in the open X86 market vs Apple's relatively closed ecosystem which prevents an Intel chip being viable in the X86 word that's constructed like the M1 Ultra currently, but when reading the original Anandtech piece on the M1, my takeaway was that this 'bottleneck' of the X86 architecture is a very, very difficult thing to solve with the current ISA. Like there are many areas for improvement no doubt, but the kind of perf/watt we're seeing from the M1 family might be incredibly difficult to achieve on X86 because of this, even if processes were equalized. Was this your take as well?

Happy_Misanthrope fucked around with this message at 00:20 on May 1, 2022

BlankSystemDaemon
Mar 13, 2009



Happy_Misanthrope posted:

There are many differing aspects in the open X86 market vs Apple's relatively closed ecosystem which prevents an Intel chip being viable in the X86 word that's constructed like the M1 Ultra currently, but when reading the original Anandtech piece on the M1, my takeaway was that this 'bottleneck' of the X86 architecture is a very, very difficult thing to solve with the current ISA. Like there are many areas for improvement no doubt, but the kind of perf/watt we're seeing from the M1 family might be incredibly difficult to achieve on X86 because of this, even if processes were equalized. Was this your take as well?
I believe it's pretty well-accepted that x86-adjacent architectures will never get to the performance/watt numbers of ARM without breaking with tradition and getting rid of a lot of compatibility.

ConanTheLibrarian
Aug 13, 2004


dis buch is late
Fallen Rib

BlankSystemDaemon posted:

I believe it's pretty well-accepted that x86-adjacent architectures will never get to the performance/watt numbers of ARM without breaking with tradition and getting rid of a lot of compatibility.

We haven't seen ARM CPUs clock as high as x86 ones. Is that an intrinsic property of ARM or could we get to a point where their CPUs are running at the same wattage as x86 but getting a whole lot more work done? Intel can't keep pushing up CPU power indefinitely, so once they hit the limit of what the market will put up with, it seems like the performance crown could shift to ARM designs.

Indiana_Krom
Jun 18, 2007
Net Slacker
There is a lot that goes into the architecture that determines the final clock speeds, one of the ways x86 gets to 5+ GHz is by making sure it does as little work as possible each "tick". The more you try to do in a single clock cycle, the longer it takes to do it and the longer duration the clock cycle must be to accommodate the workload. ARM is optimized for a different spot on that curve, so it does more work per clock cycle but as a result cannot reach as high of a clock speed. There is a lot of stuff that goes on way down in the nuts and bolts of architecture involving pipelines, parallelism and latency that is honestly beyond my ability to explain in a forum post. Probably the best "at a glance" metric is to look at the number of pipeline stages: ARMv10 has 6 pipeline stages, Intel *lake architectures have 14 stages (some netburst types reached as high as 31!). Generally speaking the more stages you have the less work you need to do in each stage so the less time it takes for each stage to complete which allows the clock speed to go higher.

So basically even if you threw unlimited power and cooling at it the current iteration of ARM architecture, it is simply not engineered to be able to reach x86 clock speeds due to the internal latency somewhere in its pipeline stages. There is some minimum amount of time required for one or more of the pipeline stages to complete their work which sets a ceiling that no amount of throwing power at it can overcome. Say something on the order of 0.3 nanoseconds, which means no matter what you do with the energy input the processor frequency cannot exceed 3.33 GHz. x86 gets to 5+ GHz by doing less work in each stage, so every stage in the pipeline can complete its work within like 0.13 nanoseconds or something, hence 7+ GHz overclocks if you can ram enough power through one (under LN2 for instance).

At the various architecture design teams there are probably people who know what the highest latency pipeline stage is that they could use to tell you an approximation of the highest clock limit physically possible on a given chip.

FuturePastNow
May 19, 2014


I think if you look at like the 9W-15W ULV notebook processors and ignore the max turbo speed they can't maintain very long, the clocks are probably a lot more comparable to ARM CPUs in the same power range.

BlankSystemDaemon
Mar 13, 2009



ConanTheLibrarian posted:

We haven't seen ARM CPUs clock as high as x86 ones. Is that an intrinsic property of ARM or could we get to a point where their CPUs are running at the same wattage as x86 but getting a whole lot more work done? Intel can't keep pushing up CPU power indefinitely, so once they hit the limit of what the market will put up with, it seems like the performance crown could shift to ARM designs.
There isn't really that big of a clock difference at the datacenter scale, which is where the majority of x86-adjacent CPUs get sold.

High core count Xeons like the 3rd-gen Platinum 8380 have 40 cores and a base-clock at 2.3GHz and (can temporarily, given thermal headroom) boost to 3.4GHz, while high core count third party ARM server CPUs like the A64FX (found in the worlds fastest supercomputer) clock at 2.2GHz, the ARM IP core known as Neoverse N1 (which underpins Amazon's foray into ARM based butt services) clocks at 2GHz with double the number of cores of the high core count Xeon, and the Ampere Altera has 80 cores that boost up to 3.3GHz (again, probably constrained by thermal headroom, which isn't much given they're at 250W per socket), and Ampere Altera Max is 128 cores at 3GHz boost.

EDIT: Also, while I have the utmost respect for overclockers, being able to push a CPU that's won the silicon lottery to some 8GHz but with only 4 cores, and only one of the cores being able to achieve the highest speed with some absolutely ludicrous level of cooling, isn't saying much about the performance of the CPU or anything else (hint: it doesn't need to run stable, it needs to be stable enough to run the benchmark).
Similarly, the all-core 5.2GHz octocore CPUs you see on the BlackCore HFT systems Intel ship also have the advantage of having the pick of the litter for CPUs, as Intel gets to choose which CPUs to use for those even before sending products SKUs out, so that doesn't really factor in either (hint: if you ever see a BlackCore system, it'll likely be as it's going into the shredder because HFT treats their systems the same way entities that use mainframes do, they pulp them rather than let anyone buy used gear that could in theory contain secrets).

BlankSystemDaemon fucked around with this message at 18:02 on May 1, 2022

Adbot
ADBOT LOVES YOU

shrike82
Jun 11, 2005

i haven't kept up with hft in years but the last i heard it was all about fpgas - i'd be surprised a system builder using mainstream CPUs etc, albeit one targeting finance, is doing anything particularly interesting

hft kinda came and past as a market trend - the action has been machine learning for a while, a european fund that tried to headhunt me recently talked about having a 10K+ V100/A100 server farm

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply