Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Discussion Quorum
Dec 5, 2002
Armchair Philistine
I know, lazy use of terminology, but it's not like USAF is letting Tom's Hardware benchmark the thing, you take what you can get :v:

Adbot
ADBOT LOVES YOU

Sidesaddle Cavalry
Mar 15, 2013

Oh Boy Desert Map
Benchmarking in a vacuum aside, an emergency chip demand for such a scenario would cover a lot more than just expendable MCUs - there'd be plenty of demand out there for general application processors, too.

mobby_6kl
Aug 9, 2009

by Fluffdaddy
If WWIII were to break out, I would not hesitate sending my army of adruinos and ESPs to the front line

human garbage bag
Jan 8, 2020

by Fluffdaddy

BobHoward posted:

But why do you believe that? You still haven't given any actual reason why you think Intel chips automatically equals more TFLOPS per wafer. Much less any indication that you understand why peak TFLOPS numbers are not actually what you should be optimizing for in a super. Or any indication that you understand that you can't just decide to make Intel chips on TSMC's process, you'd have to do a costly port.

I just assumed that Intel CPUs were faster than ARM CPUs simply because ARM CPUs are better optimized for energy deficiency, which means they sacrificed some computing power for that energy efficiency.

silence_kit
Jul 14, 2011

by the sex ghost

human garbage bag posted:

I just assumed that Intel CPUs were faster than ARM CPUs simply because ARM CPUs are better optimized for energy deficiency, which means they sacrificed some computing power for that energy efficiency.

That trade-off I think is pretty classic in computer chip design, and there are multiple ways at multiple levels of abstraction to perform this trade-off.

I wouldn't assume though that current computer chips are so well engineered that their design purely consists of making this trade-off. I suspect, although I don't really know--I have never worked as a computer chip designer or in any capacity for the computer chip industry--that there are ways to push the boundaries of what I'll vaguely call the 'speed x efficiency' figure of merit though better computer design.

Discussion Quorum
Dec 5, 2002
Armchair Philistine
The new king of the Top 500, Fugaku, uses 48-core ARM SoCs, and Cray is offering ARM options. Obviously we're not talking mobile chips but the potential is clearly there.

DrDork
Dec 29, 2003
commanding officer of the Army of Dorkness

silence_kit posted:

That trade-off I think is pretty classic in computer chip design, and there are multiple ways at multiple levels of abstraction to perform this trade-off.

I wouldn't assume though that current computer chips are so well engineered that their design purely consists of making this trade-off. I suspect, although I don't really know--I have never worked as a computer chip designer or in any capacity for the computer chip industry--that there are ways to push the boundaries of what I'll vaguely call the 'speed x efficiency' figure of merit though better computer design.

The other part to this is ensuring that your hardware is tailored to the types of work that you're going to be performing on it. ARM vs x86/x64 use different instruction sets, and those sets perform well addressing different types of work, and require different sorts of code optimizations to get the most out of them.

It's really not as simple as "one being better than the other" as much as it is about using the correct tool for the job.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
x86 (and x86-64) today is translated into RISC micro-ops anyway and so what we're seeing is a trade-off in instruction density that helps with instruction cache efficiencies vs. a simpler ISA that is easier to reason about for the CPU schedulers. ARM also has more efficient ways of switching between its subsets of ISAs such as ARM Thumb (which last I remember was supported for Apple's ARM processors). In contrast, x86's modes are pretty awkward (real, protected, etc.).

Let's also not forget that the Altivec FPUs in older Macs were pretty solid and used by Virginia Tech for some problems, it's just that the PowerPC CPUs just couldn't hold a candle to the raw engineering might of what Intel had developed by that point. I don't really think it's fair to use T500 as a barometer of architectural capability given literally single instructions are sometimes used as an excuse to switch to a different processor. It's moreso that scientific / engineering supercomputer projects as a market are more like algorithms in search of an architecture to implement them than code that is ported to use whatever architecture is supposed to be the fastest / best out there. It's a quasi-hybrid of custom silicon style engineering and general commercial software engineering.

BlankSystemDaemon
Mar 13, 2009




necrobobsledder posted:

it's just that the PowerPC CPUs just couldn't hold a candle to Andy Glew by that point.
fixed that for you, otherwise your sentence is correct :v:

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

human garbage bag posted:

I just assumed that Intel CPUs were faster than ARM CPUs simply because ARM CPUs are better optimized for energy deficiency, which means they sacrificed some computing power for that energy efficiency.

You need to think about this in a fundamentally different way. Today, performance depends on energy efficiency. This is because chips have become quite thermally limited, so you have to work inside a limited power budget (the TDP). The better your perf/W metric, the more performance you get out of X watts of TDP.

What you're doing is looking at today's ARM chip designs optimized for very tiny highly mobile devices with correspondingly tiny TDPs, and wrongly concluding that ARM can never be high performance / high TDP.

This isn't so. The only thing preventing investment in such ARM cores has been the difficulty of coming up with a market for them. When nearly all the software people want to run on high performance computers is x86 software, x86 CPUs are what they buy. Doesn't matter if you've got a great ARM chip with nothing to run on it.

Apple has the advantage of controlling so much of their own software stack, and the foundation layer of an OS and application environment which has gone through many, many ports so there's a lot of structure there to make it easier. (The OS we now call macOS/iOS/iPadOS/tvOS started out its life on Motorola 68K CPUs, but had already been ported to PowerPC, i386, SPARC, and PA-RISC before Apple acquired it by buying NeXT in 1996.)

Microsoft tried to make ARM Windows a thing, but so far has failed due to a combination of their own missteps and the lack of enough good hardware to make it attractive to consumers.

It's taken a long time, and people predicting overnight sea changes years ago were all wrong, but the industry finally seems to have enough momentum behind pushing ARM up out of the mobile space for it to stick. There was never a technical limitation preventing this, it's just a chicken and egg problem where the costs of developing either the chicken or the egg are high, it doesn't make sense to do one without the other, and usually the corporations responsible for the chicken aren't responsible for the egg (and vice versa).

But in your weird hypothetical scenario, none of that applies! The command economy could disregard existing software stacks. Supercomputers tend to need lots of performance tuning work done on the software which runs on them anyways, porting to a different architecture is frequent. And an intelligent technical director of this last ditch effort to optimize for total FLOPS manufacturing capacity (because the FLOPS aliens demand more FLOPS???) would take a look at x86 and realize there's overhead inherent to that ISA which makes it a worse choice than ARM. (And ARM a worse choice than a VLIW ISA... this thought experiment doesn't end up making lots of ARMs either.)

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Eh the X86 decoder is small. FLOPS gods are more likely to be mad about all those microjoules and mm going to cache.

movax
Aug 30, 2008

I'm a bit curious about the memory controller side of things. Someone correct me if I'm wrong, but the "stock" ARM DDR / memory controllers have historically been fairly anemic. No doubt Apple does their own, but to date their SoCs have all been optimized for mobile usage / talking w/ LPDDRx. I guess there are some whitepapers to dig up on the ARM server side of things that talk about that controller performance for more desktop / server type workloads, and how they perform in terms of latency and bandwidth.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

movax posted:

I'm a bit curious about the memory controller side of things. Someone correct me if I'm wrong, but the "stock" ARM DDR / memory controllers have historically been fairly anemic. No doubt Apple does their own, but to date their SoCs have all been optimized for mobile usage / talking w/ LPDDRx. I guess there are some whitepapers to dig up on the ARM server side of things that talk about that controller performance for more desktop / server type workloads, and how they perform in terms of latency and bandwidth.

One of the big wins for Fugaku is that they did HBM2; it’s very well balanced flop/bandwidth.

They also have their own custom interconnect off die but the jury’s still out on that.

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

PCjr sidecar posted:

Eh the X86 decoder is small. FLOPS gods are more likely to be mad about all those microjoules and mm going to cache.

Agreed. The point of even bringing it up was that while x86 overhead often gets overestimated (mostly by people still unable to move past the Great CISC vs RISC wars of the 1980s and 1990s), it's real, so if you were seriously optimizing for nothing but maximum FLOPS per wafer, you'd avoid it.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

movax posted:

I'm a bit curious about the memory controller side of things. Someone correct me if I'm wrong, but the "stock" ARM DDR / memory controllers have historically been fairly anemic. No doubt Apple does their own, but to date their SoCs have all been optimized for mobile usage / talking w/ LPDDRx. I guess there are some whitepapers to dig up on the ARM server side of things that talk about that controller performance for more desktop / server type workloads, and how they perform in terms of latency and bandwidth.

I suspect the Mac pro is gonna be the last one to go apple silicon. Everything else can do lpddr but scaled out.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

BobHoward posted:

Agreed. The point of even bringing it up was that while x86 overhead often gets overestimated (mostly by people still unable to move past the Great CISC vs RISC wars of the 1980s and 1990s), it's real, so if you were seriously optimizing for nothing but maximum FLOPS per wafer, you'd avoid it.

Sure; if we’re at that level, any instruction set is a decadent luxury.

Sidesaddle Cavalry
Mar 15, 2013

Oh Boy Desert Map
My maximum perf architecture involves a 32-bit carry-lookahead adder just iterating +1 over and over again

BlankSystemDaemon
Mar 13, 2009




Ian Cutress has posted an deep dive on Intel Lakefield over at anand.

SwissArmyDruid
Feb 14, 2014

by sebmojo
Are the C4000 Atoms/Celerons/Etc sufficiently safe to use? Like, do we know that Intel addressed the underlying issue since they basically held up all of their customers as a shield against publicity?

I got burned by the C2000 broken clock signal twice and just swore off anything with an Intel logo on it and moved work to an ARM-based NAS, but I'm back in the market for a new NAS and the one that has the features that I want is has a J4005.

NewFatMike
Jun 11, 2015

D. Ebdrup posted:

Ian Cutress has posted an deep dive on Intel Lakefield over at anand.

Intel CPU and Platform Discussion: This is a Balls-on-Balls approach

https://i.imgur.com/PAbeTBB.png

Cygni
Nov 12, 2005

raring to post

D. Ebdrup posted:

Ian Cutress has posted an deep dive on Intel Lakefield over at anand.

I was really looking forward to this part and all it’s new technology but it seems like the performance is solidly in “lol” territory. Just doesn’t have the power budget to run the big core.

Hope they’ve got a rabbit in the hat or the Apple silicon macs are gonna whoop it in its class, even running through Rosetta.

SwissArmyDruid
Feb 14, 2014

by sebmojo

NewFatMike posted:

Intel CPU and Platform Discussion: This is a Balls-on-Balls approach

https://i.imgur.com/PAbeTBB.png

Or, as they say across the pond, "bollocks on bollocks", which I think is especially appropriate.

BlankSystemDaemon
Mar 13, 2009




NewFatMike posted:

Intel CPU and Platform Discussion: This is a Balls-on-Balls approach

https://i.imgur.com/PAbeTBB.png
Performance is stored in the balls.

Cygni posted:

I was really looking forward to this part and all it’s new technology but it seems like the performance is solidly in “lol” territory. Just doesn’t have the power budget to run the big core.

Hope they’ve got a rabbit in the hat or the Apple silicon macs are gonna whoop it in its class, even running through Rosetta.
I mean, they were never going to be able to deliver high-performance with just one high-performance core in a heterogeneous setup, and for reasons which are kinda self-explanatory they can't beat ARM on energy efficiency - this has always seemed like a doomed concept from Intel, if you ask me.

BlankSystemDaemon
Mar 13, 2009




Also, have you not seen the A64FX from Fujitsu?
That thing demonstrates that ARM is going places.

NewFatMike
Jun 11, 2015

Cygni posted:

I was really looking forward to this part and all it’s new technology but it seems like the performance is solidly in “lol” territory. Just doesn’t have the power budget to run the big core.

Hope they’ve got a rabbit in the hat or the Apple silicon macs are gonna whoop it in its class, even running through Rosetta.

Do Apple ARM chips use a Big.little approach? I don't know why I never wondered.

D. Ebdrup posted:

Also, have you not seen the A64FX from Fujitsu?
That thing demonstrates that ARM is going places.

Good year for ARM and nothing else imo.

Also nonedit: banging my arms into a bloody pulp on my desk RISC-V! RISC-V!

Cygni
Nov 12, 2005

raring to post

NewFatMike posted:

Do Apple ARM chips use a Big.little approach? I don't know why I never wondered.

They do, and like most ARM implementations, they can run all of them simultaneously with the scheduler balancing which threads go to big and which go to small. Although the current Apple A12Z devkits running MacOs are only running with the big cores on, probably because the scheduler on the OS side isn't ready I guess.

BlankSystemDaemon
Mar 13, 2009




NewFatMike posted:

Also nonedit: banging my arms into a bloody pulp on my desk RISC-V! RISC-V!
https://i.imgur.com/p4SdcCC.mp4

Encrypted
Feb 25, 2016

NewFatMike posted:

Do Apple ARM chips use a Big.little approach? I don't know why I never wondered.

Yep they had it since a9/iphone 7 era with the 'efficiency cores'. Here's a deeper look into the cores on the a11/a12
https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the-silicon-secrets/2

The interesting thing was that they basically ripped out the efficiency cores and put them into the apple watch 4/5 and finally made it not too garbage for performance.

movax
Aug 30, 2008

NewFatMike posted:

Do Apple ARM chips use a Big.little approach? I don't know why I never wondered.


Good year for ARM and nothing else imo.

Also nonedit: banging my arms into a bloody pulp on my desk RISC-V! RISC-V!

I actually piled a bunch of cash into $WDC recently, partially based on logic, partially based on hopes that they’ll do something with SwerV and really pushing RISC-V. Still think the legions of low low cost chip makers who are tired of paying ARM or Cadence for IP should be all over it.

NewFatMike
Jun 11, 2015

This is awesome reading, thanks fellas!

I wonder how much Android/ChromeOS scheduler tech is going to be useful for these Lakefield processors since they already with with ARM big.little implementations (and how much of that is upstreamed to the mainline Linux kernel). Or if Windows on ARM will have any transferable lessons.

So much cool stuff has been happening with CPUs in the last few years that I'm waiting for the other shoe to drop and something really awesome turning out to be a wet fart.

Encrypted
Feb 25, 2016

Perhaps but maybe not much beyond the minimal effort required by them, especially since they are ditching qualcomm for their own hardware.

sincx
Jul 13, 2012

furiously masturbating to anime titties
.

sincx fucked around with this message at 05:55 on Mar 23, 2021

BlankSystemDaemon
Mar 13, 2009




The biggest problem with heterogeneous CPU cores is all the schedulers for every single OS that wants to support this has to grow support for tracking energy use / scheduler quanta, and also needs an algorithm for when something should be moved around.
Qualcomm is supposedly helping Microsoft implement something for Windows, but everything I've heard about it leads me to believe they're not even close to getting it right - and who knows what state Apples macOS is in? Plus, the places where it'll really matter is for smartphones which ultimately means Linux and iOS respectively, and for reasons which are too dumb to get into, that means none of the rest of the opensource world will benefit from it.

On the off-chance that someone might not be able to conceptualize the problem, think about this:
You have a process that on a 2GHz core will take some time to run and will take a little more time to run on a 1GHz core - but bringing the 2GHz core out of its deepest sleep-state takes time, as does stepping it out of the power saving modes that it starts in when it's brought out of sleep. So do you run the process on the slower core, and hope that it doesn't take that long, or do you bring up the faster core?
To complicate matters, some schedulers like the one in FreeBSD subdivide processes into tasks called quanta - ie. a given process is subdivided into tasks which each thread can steal from any other thread, but there's a penalty to preemption/stealing since it effectively involves a cache invalidation (preemption is not a zero-sum thing, and getting the balance right is loving hard).
To even begin solving this problem, you need to code a system that can factor in energy efficiency into the computation (instead of just how quick something can be executed, which is what schedulers optimize for now) - and preferably, you make this a runtime switch since even a boot-time or worse-yet a compile-time switch is just completely pointless.
And since cores aren't going to get substantially faster for the foreseeable future, you also have to ensure that this process is efficient enough that it doesn't take up considerable CPU time even if you're working with multi-threaded processes which go from 2-4 all the way up to +512 threads.

EDIT: And I can guarantee that I've forgotten or glossed over several other details which are big problems on their own - plus, there's only a very small amount of people in programming who's got a deep enough technical understanding and interest to actually work on schedulers, and to add to add insult to injury it's one of the few places where the best debugging tools you have are flamegraphs from dtrace on long-running production systems, as that's where you'll find all the problems.

BlankSystemDaemon fucked around with this message at 09:59 on Jul 4, 2020

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
You’ve also got thrashing problems. Let’s say you boot up the big core and it churns through the task quickly like it’s supposed to do. But then you power it off, and the slow core don’t fast enough and it starts falling behind. So you turn the big core back on and waste a bunch of energy waiting for it to boot back up. And it gets ahead of the workload again and turns back off. Etc.

Hugely tricky thing to get right even above and beyond normal scheduling.

BlankSystemDaemon
Mar 13, 2009




Paul MaudDib posted:

You’ve also got thrashing problems. Let’s say you boot up the big core and it churns through the task quickly like it’s supposed to do. But then you power it off, and the slow core don’t fast enough and it starts falling behind. So you turn the big core back on and waste a bunch of energy waiting for it to boot back up. And it gets ahead of the workload again and turns back off. Etc.

Hugely tricky thing to get right even above and beyond normal scheduling.
It's not up to the scheduler to deal with race conditions, though. Schedulers are part of the kernel whereas race conditions are on the programmer of the userspace program. If a kernel has race conditions, you're not going to be using it.
EDIT: To expand on this, threading is done asyncronously so there's not really a situation in which the slow core can 'fall behind' as you describe it. Or rather, if there is, someone writing their userland program hosed up in MAJOR ways.

BlankSystemDaemon fucked around with this message at 10:04 on Jul 4, 2020

gradenko_2000
Oct 5, 2010

HELL SERPENT
Lipstick Apathy
https://twitter.com/VideoCardz/status/1279340014938329088?s=19

Stubb Dogg
Feb 16, 2007

loskat naamalle
Big.Little architectures are really complex even beyond power optimization. There were numerous fuckups on Android where apps would crash at completely random times with no real reason, until it turned out manufacturer hosed up and CPUs were reporting different CPU capabilities for big and little cores. So if app was started on Big core, it would read capability bits and enable certain code paths, but after process got migrated to little core that did not support those features, app would promptly crash on illegal instruction. So kernel needs to virtualise CPU capabilities and present only minimum common supported subset to userland.

BlankSystemDaemon
Mar 13, 2009




Stubb Dogg posted:

Big.Little architectures are really complex even beyond power optimization. There were numerous fuckups on Android where apps would crash at completely random times with no real reason, until it turned out manufacturer hosed up and CPUs were reporting different CPU capabilities for big and little cores. So if app was started on Big core, it would read capability bits and enable certain code paths, but after process got migrated to little core that did not support those features, app would promptly crash on illegal instruction. So kernel needs to virtualise CPU capabilities and present only minimum common supported subset to userland.
Heterogeneous cores don't necessarily mean they have to have the same capabilities, for what it's worth - it just means that the applications that want to take advantage of the high-performance cores should be made such that their high-performance threads are asynchronously communicating with the main thread.
But you're absolutely right, just like the scheduler optimizations, it's an enormous complication that's laid entirely at the feet of the individual developers, even if libraries are developed for it.

Also, if you don't wanna type out heterogeneous something-or-other, just use the abbreviation HMP.

Potato Salad
Oct 23, 2014

nobody cares


Discussion Quorum posted:

For scale, the upgraded CPU in the F-35 advertises a Dhrystone of 2900 which puts it on par with a Raspberry Pi 3B+ (yeah I know it likely kills the Pi in other capabilities but we're talking about ~FLOPS~).

https://www.l3commercialaviation.com/avionics/products/high-performance-icp/

You haven't played around with the phased array radar kit for raspberry pi 3?

Adbot
ADBOT LOVES YOU

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

D. Ebdrup posted:

The biggest problem with heterogeneous CPU cores is all the schedulers for every single OS that wants to support this has to grow support for tracking energy use / scheduler quanta, and also needs an algorithm for when something should be moved around.
Qualcomm is supposedly helping Microsoft implement something for Windows, but everything I've heard about it leads me to believe they're not even close to getting it right - and who knows what state Apples macOS is in? Plus, the places where it'll really matter is for smartphones which ultimately means Linux and iOS respectively, and for reasons which are too dumb to get into, that means none of the rest of the opensource world will benefit from it.

On the state of macOS: iOS already supports it, and that means macOS can too. iOS began life as a fork of macOS, and both still build a lot of components from the same source code. That includes the XNU kernel. While they likely haven't turned on the code paths for AMP support in Intel Mac XNU kernel builds, obviously they've got a relatively easy path to get there for ARM Macs. (AMP = asymmetric multiprocessing, Apple's chosen terminology for this)

The open source world can technically benefit, as XNU is an open-source kernel. The actual code probably isn't too useful, though, due to a mix of license incompatibility and just being too different. My understanding is that the Linux scheduler algorithm isn't very much like anything else, which would likely make it difficult to apply AMP modifications from a very different scheduler design.

Even the BSDs (which, frankly, don't matter any more) aren't likely to benefit much. XNU is weird in its own way, it's a highly modified mashup of Mach and BSD kernel code. The last time they synced any of the BSD bits up with a mainline BSD was probably before Mac OS X 10.0 in 2001, but more importantly the XNU scheduler is descended from Mach rather than BSD. (The mashup is roughly Mach for scheduler + VM, BSD for traditional UNIX syscalls, and custom NeXT/Apple bits for I/O.)

Stubb Dogg posted:

Big.Little architectures are really complex even beyond power optimization. There were numerous fuckups on Android where apps would crash at completely random times with no real reason, until it turned out manufacturer hosed up and CPUs were reporting different CPU capabilities for big and little cores. So if app was started on Big core, it would read capability bits and enable certain code paths, but after process got migrated to little core that did not support those features, app would promptly crash on illegal instruction. So kernel needs to virtualise CPU capabilities and present only minimum common supported subset to userland.

That was Samsung, and it wasn't merely a reporting fuckup. In the Exynos 9810, they chose to use their in-house ARMv8.0 big core paired with a Cortex-A55 ARMv8.2 LITTLE core. Among other things, ARMv8.1 onwards added some new atomic instructions, so the cores weren't just asymmetric in performance, they were asymmetric in capability.

The reporting mistake was not that they reported different capabilities. Technically, reporting different capabilities would be correct, but in practice it could never work. A common technique used by software which wants to take advantage of optional new features in CPUs is to query CPU feature support once on launch, set global variables / function pointers / whatnot to direct future code execution down the right paths, and never query CPU capabilities again. If such a program queries on the more capable CPU and then gets switched to the less capable, bad things are gonna happen.

So, if you're writing OS support code for that Exynos, you want to make sure the least common denominator feature is what gets reported even on the CPUs which can do more. That has a chance of working, your application authors have to do extra things to gently caress up.

In fact, apparently the Android Linux kernel already should have been doing just that. But, for reasons best known to themselves, Samsung chose to write a patch which overrode the default behavior and report that all CPU cores supported the new ARMv8.1 atomic instructions.

So, there were poor choices all around! Both hardware and software were hosed.

It is a safe bet that Apple will not have such problems, because their performance and efficiency cores are both custom Apple designs and they can make sure that both support exactly the same ISA, which is what you really want.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply