|
I know, lazy use of terminology, but it's not like USAF is letting Tom's Hardware benchmark the thing, you take what you can get
|
# ? Jun 24, 2020 02:39 |
|
|
# ? Jun 11, 2024 06:02 |
|
Benchmarking in a vacuum aside, an emergency chip demand for such a scenario would cover a lot more than just expendable MCUs - there'd be plenty of demand out there for general application processors, too.
|
# ? Jun 24, 2020 05:21 |
|
If WWIII were to break out, I would not hesitate sending my army of adruinos and ESPs to the front line
|
# ? Jun 24, 2020 10:07 |
|
BobHoward posted:But why do you believe that? You still haven't given any actual reason why you think Intel chips automatically equals more TFLOPS per wafer. Much less any indication that you understand why peak TFLOPS numbers are not actually what you should be optimizing for in a super. Or any indication that you understand that you can't just decide to make Intel chips on TSMC's process, you'd have to do a costly port. I just assumed that Intel CPUs were faster than ARM CPUs simply because ARM CPUs are better optimized for energy deficiency, which means they sacrificed some computing power for that energy efficiency.
|
# ? Jun 24, 2020 13:34 |
|
human garbage bag posted:I just assumed that Intel CPUs were faster than ARM CPUs simply because ARM CPUs are better optimized for energy deficiency, which means they sacrificed some computing power for that energy efficiency. That trade-off I think is pretty classic in computer chip design, and there are multiple ways at multiple levels of abstraction to perform this trade-off. I wouldn't assume though that current computer chips are so well engineered that their design purely consists of making this trade-off. I suspect, although I don't really know--I have never worked as a computer chip designer or in any capacity for the computer chip industry--that there are ways to push the boundaries of what I'll vaguely call the 'speed x efficiency' figure of merit though better computer design.
|
# ? Jun 24, 2020 15:16 |
|
The new king of the Top 500, Fugaku, uses 48-core ARM SoCs, and Cray is offering ARM options. Obviously we're not talking mobile chips but the potential is clearly there.
|
# ? Jun 24, 2020 15:52 |
|
silence_kit posted:That trade-off I think is pretty classic in computer chip design, and there are multiple ways at multiple levels of abstraction to perform this trade-off. The other part to this is ensuring that your hardware is tailored to the types of work that you're going to be performing on it. ARM vs x86/x64 use different instruction sets, and those sets perform well addressing different types of work, and require different sorts of code optimizations to get the most out of them. It's really not as simple as "one being better than the other" as much as it is about using the correct tool for the job.
|
# ? Jun 24, 2020 16:04 |
|
x86 (and x86-64) today is translated into RISC micro-ops anyway and so what we're seeing is a trade-off in instruction density that helps with instruction cache efficiencies vs. a simpler ISA that is easier to reason about for the CPU schedulers. ARM also has more efficient ways of switching between its subsets of ISAs such as ARM Thumb (which last I remember was supported for Apple's ARM processors). In contrast, x86's modes are pretty awkward (real, protected, etc.). Let's also not forget that the Altivec FPUs in older Macs were pretty solid and used by Virginia Tech for some problems, it's just that the PowerPC CPUs just couldn't hold a candle to the raw engineering might of what Intel had developed by that point. I don't really think it's fair to use T500 as a barometer of architectural capability given literally single instructions are sometimes used as an excuse to switch to a different processor. It's moreso that scientific / engineering supercomputer projects as a market are more like algorithms in search of an architecture to implement them than code that is ported to use whatever architecture is supposed to be the fastest / best out there. It's a quasi-hybrid of custom silicon style engineering and general commercial software engineering.
|
# ? Jun 24, 2020 18:27 |
necrobobsledder posted:it's just that the PowerPC CPUs just couldn't hold a candle to Andy Glew by that point.
|
|
# ? Jun 24, 2020 18:40 |
|
human garbage bag posted:I just assumed that Intel CPUs were faster than ARM CPUs simply because ARM CPUs are better optimized for energy deficiency, which means they sacrificed some computing power for that energy efficiency. You need to think about this in a fundamentally different way. Today, performance depends on energy efficiency. This is because chips have become quite thermally limited, so you have to work inside a limited power budget (the TDP). The better your perf/W metric, the more performance you get out of X watts of TDP. What you're doing is looking at today's ARM chip designs optimized for very tiny highly mobile devices with correspondingly tiny TDPs, and wrongly concluding that ARM can never be high performance / high TDP. This isn't so. The only thing preventing investment in such ARM cores has been the difficulty of coming up with a market for them. When nearly all the software people want to run on high performance computers is x86 software, x86 CPUs are what they buy. Doesn't matter if you've got a great ARM chip with nothing to run on it. Apple has the advantage of controlling so much of their own software stack, and the foundation layer of an OS and application environment which has gone through many, many ports so there's a lot of structure there to make it easier. (The OS we now call macOS/iOS/iPadOS/tvOS started out its life on Motorola 68K CPUs, but had already been ported to PowerPC, i386, SPARC, and PA-RISC before Apple acquired it by buying NeXT in 1996.) Microsoft tried to make ARM Windows a thing, but so far has failed due to a combination of their own missteps and the lack of enough good hardware to make it attractive to consumers. It's taken a long time, and people predicting overnight sea changes years ago were all wrong, but the industry finally seems to have enough momentum behind pushing ARM up out of the mobile space for it to stick. There was never a technical limitation preventing this, it's just a chicken and egg problem where the costs of developing either the chicken or the egg are high, it doesn't make sense to do one without the other, and usually the corporations responsible for the chicken aren't responsible for the egg (and vice versa). But in your weird hypothetical scenario, none of that applies! The command economy could disregard existing software stacks. Supercomputers tend to need lots of performance tuning work done on the software which runs on them anyways, porting to a different architecture is frequent. And an intelligent technical director of this last ditch effort to optimize for total FLOPS manufacturing capacity (because the FLOPS aliens demand more FLOPS???) would take a look at x86 and realize there's overhead inherent to that ISA which makes it a worse choice than ARM. (And ARM a worse choice than a VLIW ISA... this thought experiment doesn't end up making lots of ARMs either.)
|
# ? Jun 24, 2020 22:25 |
|
Eh the X86 decoder is small. FLOPS gods are more likely to be mad about all those microjoules and mm going to cache.
|
# ? Jun 24, 2020 22:39 |
|
I'm a bit curious about the memory controller side of things. Someone correct me if I'm wrong, but the "stock" ARM DDR / memory controllers have historically been fairly anemic. No doubt Apple does their own, but to date their SoCs have all been optimized for mobile usage / talking w/ LPDDRx. I guess there are some whitepapers to dig up on the ARM server side of things that talk about that controller performance for more desktop / server type workloads, and how they perform in terms of latency and bandwidth.
|
# ? Jun 24, 2020 23:29 |
|
movax posted:I'm a bit curious about the memory controller side of things. Someone correct me if I'm wrong, but the "stock" ARM DDR / memory controllers have historically been fairly anemic. No doubt Apple does their own, but to date their SoCs have all been optimized for mobile usage / talking w/ LPDDRx. I guess there are some whitepapers to dig up on the ARM server side of things that talk about that controller performance for more desktop / server type workloads, and how they perform in terms of latency and bandwidth. One of the big wins for Fugaku is that they did HBM2; it’s very well balanced flop/bandwidth. They also have their own custom interconnect off die but the jury’s still out on that.
|
# ? Jun 24, 2020 23:35 |
|
PCjr sidecar posted:Eh the X86 decoder is small. FLOPS gods are more likely to be mad about all those microjoules and mm going to cache. Agreed. The point of even bringing it up was that while x86 overhead often gets overestimated (mostly by people still unable to move past the Great CISC vs RISC wars of the 1980s and 1990s), it's real, so if you were seriously optimizing for nothing but maximum FLOPS per wafer, you'd avoid it.
|
# ? Jun 25, 2020 00:44 |
|
movax posted:I'm a bit curious about the memory controller side of things. Someone correct me if I'm wrong, but the "stock" ARM DDR / memory controllers have historically been fairly anemic. No doubt Apple does their own, but to date their SoCs have all been optimized for mobile usage / talking w/ LPDDRx. I guess there are some whitepapers to dig up on the ARM server side of things that talk about that controller performance for more desktop / server type workloads, and how they perform in terms of latency and bandwidth. I suspect the Mac pro is gonna be the last one to go apple silicon. Everything else can do lpddr but scaled out.
|
# ? Jun 25, 2020 00:59 |
|
BobHoward posted:Agreed. The point of even bringing it up was that while x86 overhead often gets overestimated (mostly by people still unable to move past the Great CISC vs RISC wars of the 1980s and 1990s), it's real, so if you were seriously optimizing for nothing but maximum FLOPS per wafer, you'd avoid it. Sure; if we’re at that level, any instruction set is a decadent luxury.
|
# ? Jun 25, 2020 01:02 |
|
My maximum perf architecture involves a 32-bit carry-lookahead adder just iterating +1 over and over again
|
# ? Jun 25, 2020 14:54 |
Ian Cutress has posted an deep dive on Intel Lakefield over at anand.
|
|
# ? Jul 3, 2020 13:23 |
|
Are the C4000 Atoms/Celerons/Etc sufficiently safe to use? Like, do we know that Intel addressed the underlying issue since they basically held up all of their customers as a shield against publicity? I got burned by the C2000 broken clock signal twice and just swore off anything with an Intel logo on it and moved work to an ARM-based NAS, but I'm back in the market for a new NAS and the one that has the features that I want is has a J4005.
|
# ? Jul 3, 2020 17:56 |
|
D. Ebdrup posted:Ian Cutress has posted an deep dive on Intel Lakefield over at anand. Intel CPU and Platform Discussion: This is a Balls-on-Balls approach https://i.imgur.com/PAbeTBB.png
|
# ? Jul 3, 2020 19:00 |
|
D. Ebdrup posted:Ian Cutress has posted an deep dive on Intel Lakefield over at anand. I was really looking forward to this part and all it’s new technology but it seems like the performance is solidly in “lol” territory. Just doesn’t have the power budget to run the big core. Hope they’ve got a rabbit in the hat or the Apple silicon macs are gonna whoop it in its class, even running through Rosetta.
|
# ? Jul 3, 2020 19:07 |
|
NewFatMike posted:Intel CPU and Platform Discussion: This is a Balls-on-Balls approach Or, as they say across the pond, "bollocks on bollocks", which I think is especially appropriate.
|
# ? Jul 3, 2020 19:33 |
NewFatMike posted:Intel CPU and Platform Discussion: This is a Balls-on-Balls approach Cygni posted:I was really looking forward to this part and all it’s new technology but it seems like the performance is solidly in “lol” territory. Just doesn’t have the power budget to run the big core.
|
|
# ? Jul 3, 2020 19:34 |
Also, have you not seen the A64FX from Fujitsu? That thing demonstrates that ARM is going places.
|
|
# ? Jul 3, 2020 19:35 |
|
Cygni posted:I was really looking forward to this part and all it’s new technology but it seems like the performance is solidly in “lol” territory. Just doesn’t have the power budget to run the big core. Do Apple ARM chips use a Big.little approach? I don't know why I never wondered. D. Ebdrup posted:Also, have you not seen the A64FX from Fujitsu? Good year for ARM and nothing else imo. Also nonedit: banging my arms into a bloody pulp on my desk RISC-V! RISC-V!
|
# ? Jul 3, 2020 20:14 |
|
NewFatMike posted:Do Apple ARM chips use a Big.little approach? I don't know why I never wondered. They do, and like most ARM implementations, they can run all of them simultaneously with the scheduler balancing which threads go to big and which go to small. Although the current Apple A12Z devkits running MacOs are only running with the big cores on, probably because the scheduler on the OS side isn't ready I guess.
|
# ? Jul 3, 2020 20:31 |
NewFatMike posted:Also nonedit: banging my arms into a bloody pulp on my desk RISC-V! RISC-V!
|
|
# ? Jul 3, 2020 21:11 |
|
NewFatMike posted:Do Apple ARM chips use a Big.little approach? I don't know why I never wondered. Yep they had it since a9/iphone 7 era with the 'efficiency cores'. Here's a deeper look into the cores on the a11/a12 https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the-silicon-secrets/2 The interesting thing was that they basically ripped out the efficiency cores and put them into the apple watch 4/5 and finally made it not too garbage for performance.
|
# ? Jul 3, 2020 23:26 |
|
NewFatMike posted:Do Apple ARM chips use a Big.little approach? I don't know why I never wondered. I actually piled a bunch of cash into $WDC recently, partially based on logic, partially based on hopes that they’ll do something with SwerV and really pushing RISC-V. Still think the legions of low low cost chip makers who are tired of paying ARM or Cadence for IP should be all over it.
|
# ? Jul 3, 2020 23:49 |
|
This is awesome reading, thanks fellas! I wonder how much Android/ChromeOS scheduler tech is going to be useful for these Lakefield processors since they already with with ARM big.little implementations (and how much of that is upstreamed to the mainline Linux kernel). Or if Windows on ARM will have any transferable lessons. So much cool stuff has been happening with CPUs in the last few years that I'm waiting for the other shoe to drop and something really awesome turning out to be a wet fart.
|
# ? Jul 4, 2020 02:18 |
|
Perhaps but maybe not much beyond the minimal effort required by them, especially since they are ditching qualcomm for their own hardware.
|
# ? Jul 4, 2020 03:30 |
|
.
sincx fucked around with this message at 05:55 on Mar 23, 2021 |
# ? Jul 4, 2020 07:09 |
The biggest problem with heterogeneous CPU cores is all the schedulers for every single OS that wants to support this has to grow support for tracking energy use / scheduler quanta, and also needs an algorithm for when something should be moved around. Qualcomm is supposedly helping Microsoft implement something for Windows, but everything I've heard about it leads me to believe they're not even close to getting it right - and who knows what state Apples macOS is in? Plus, the places where it'll really matter is for smartphones which ultimately means Linux and iOS respectively, and for reasons which are too dumb to get into, that means none of the rest of the opensource world will benefit from it. On the off-chance that someone might not be able to conceptualize the problem, think about this: You have a process that on a 2GHz core will take some time to run and will take a little more time to run on a 1GHz core - but bringing the 2GHz core out of its deepest sleep-state takes time, as does stepping it out of the power saving modes that it starts in when it's brought out of sleep. So do you run the process on the slower core, and hope that it doesn't take that long, or do you bring up the faster core? To complicate matters, some schedulers like the one in FreeBSD subdivide processes into tasks called quanta - ie. a given process is subdivided into tasks which each thread can steal from any other thread, but there's a penalty to preemption/stealing since it effectively involves a cache invalidation (preemption is not a zero-sum thing, and getting the balance right is loving hard). To even begin solving this problem, you need to code a system that can factor in energy efficiency into the computation (instead of just how quick something can be executed, which is what schedulers optimize for now) - and preferably, you make this a runtime switch since even a boot-time or worse-yet a compile-time switch is just completely pointless. And since cores aren't going to get substantially faster for the foreseeable future, you also have to ensure that this process is efficient enough that it doesn't take up considerable CPU time even if you're working with multi-threaded processes which go from 2-4 all the way up to +512 threads. EDIT: And I can guarantee that I've forgotten or glossed over several other details which are big problems on their own - plus, there's only a very small amount of people in programming who's got a deep enough technical understanding and interest to actually work on schedulers, and to add to add insult to injury it's one of the few places where the best debugging tools you have are flamegraphs from dtrace on long-running production systems, as that's where you'll find all the problems. BlankSystemDaemon fucked around with this message at 09:59 on Jul 4, 2020 |
|
# ? Jul 4, 2020 09:51 |
|
You’ve also got thrashing problems. Let’s say you boot up the big core and it churns through the task quickly like it’s supposed to do. But then you power it off, and the slow core don’t fast enough and it starts falling behind. So you turn the big core back on and waste a bunch of energy waiting for it to boot back up. And it gets ahead of the workload again and turns back off. Etc. Hugely tricky thing to get right even above and beyond normal scheduling.
|
# ? Jul 4, 2020 09:58 |
Paul MaudDib posted:You’ve also got thrashing problems. Let’s say you boot up the big core and it churns through the task quickly like it’s supposed to do. But then you power it off, and the slow core don’t fast enough and it starts falling behind. So you turn the big core back on and waste a bunch of energy waiting for it to boot back up. And it gets ahead of the workload again and turns back off. Etc. EDIT: To expand on this, threading is done asyncronously so there's not really a situation in which the slow core can 'fall behind' as you describe it. Or rather, if there is, someone writing their userland program hosed up in MAJOR ways. BlankSystemDaemon fucked around with this message at 10:04 on Jul 4, 2020 |
|
# ? Jul 4, 2020 10:02 |
|
https://twitter.com/VideoCardz/status/1279340014938329088?s=19
|
# ? Jul 4, 2020 10:27 |
|
Big.Little architectures are really complex even beyond power optimization. There were numerous fuckups on Android where apps would crash at completely random times with no real reason, until it turned out manufacturer hosed up and CPUs were reporting different CPU capabilities for big and little cores. So if app was started on Big core, it would read capability bits and enable certain code paths, but after process got migrated to little core that did not support those features, app would promptly crash on illegal instruction. So kernel needs to virtualise CPU capabilities and present only minimum common supported subset to userland.
|
# ? Jul 4, 2020 11:40 |
Stubb Dogg posted:Big.Little architectures are really complex even beyond power optimization. There were numerous fuckups on Android where apps would crash at completely random times with no real reason, until it turned out manufacturer hosed up and CPUs were reporting different CPU capabilities for big and little cores. So if app was started on Big core, it would read capability bits and enable certain code paths, but after process got migrated to little core that did not support those features, app would promptly crash on illegal instruction. So kernel needs to virtualise CPU capabilities and present only minimum common supported subset to userland. But you're absolutely right, just like the scheduler optimizations, it's an enormous complication that's laid entirely at the feet of the individual developers, even if libraries are developed for it. Also, if you don't wanna type out heterogeneous something-or-other, just use the abbreviation HMP.
|
|
# ? Jul 4, 2020 14:34 |
|
Discussion Quorum posted:For scale, the upgraded CPU in the F-35 advertises a Dhrystone of 2900 which puts it on par with a Raspberry Pi 3B+ (yeah I know it likely kills the Pi in other capabilities but we're talking about ~FLOPS~). You haven't played around with the phased array radar kit for raspberry pi 3?
|
# ? Jul 4, 2020 14:37 |
|
|
# ? Jun 11, 2024 06:02 |
|
D. Ebdrup posted:The biggest problem with heterogeneous CPU cores is all the schedulers for every single OS that wants to support this has to grow support for tracking energy use / scheduler quanta, and also needs an algorithm for when something should be moved around. On the state of macOS: iOS already supports it, and that means macOS can too. iOS began life as a fork of macOS, and both still build a lot of components from the same source code. That includes the XNU kernel. While they likely haven't turned on the code paths for AMP support in Intel Mac XNU kernel builds, obviously they've got a relatively easy path to get there for ARM Macs. (AMP = asymmetric multiprocessing, Apple's chosen terminology for this) The open source world can technically benefit, as XNU is an open-source kernel. The actual code probably isn't too useful, though, due to a mix of license incompatibility and just being too different. My understanding is that the Linux scheduler algorithm isn't very much like anything else, which would likely make it difficult to apply AMP modifications from a very different scheduler design. Even the BSDs (which, frankly, don't matter any more) aren't likely to benefit much. XNU is weird in its own way, it's a highly modified mashup of Mach and BSD kernel code. The last time they synced any of the BSD bits up with a mainline BSD was probably before Mac OS X 10.0 in 2001, but more importantly the XNU scheduler is descended from Mach rather than BSD. (The mashup is roughly Mach for scheduler + VM, BSD for traditional UNIX syscalls, and custom NeXT/Apple bits for I/O.) Stubb Dogg posted:Big.Little architectures are really complex even beyond power optimization. There were numerous fuckups on Android where apps would crash at completely random times with no real reason, until it turned out manufacturer hosed up and CPUs were reporting different CPU capabilities for big and little cores. So if app was started on Big core, it would read capability bits and enable certain code paths, but after process got migrated to little core that did not support those features, app would promptly crash on illegal instruction. So kernel needs to virtualise CPU capabilities and present only minimum common supported subset to userland. That was Samsung, and it wasn't merely a reporting fuckup. In the Exynos 9810, they chose to use their in-house ARMv8.0 big core paired with a Cortex-A55 ARMv8.2 LITTLE core. Among other things, ARMv8.1 onwards added some new atomic instructions, so the cores weren't just asymmetric in performance, they were asymmetric in capability. The reporting mistake was not that they reported different capabilities. Technically, reporting different capabilities would be correct, but in practice it could never work. A common technique used by software which wants to take advantage of optional new features in CPUs is to query CPU feature support once on launch, set global variables / function pointers / whatnot to direct future code execution down the right paths, and never query CPU capabilities again. If such a program queries on the more capable CPU and then gets switched to the less capable, bad things are gonna happen. So, if you're writing OS support code for that Exynos, you want to make sure the least common denominator feature is what gets reported even on the CPUs which can do more. That has a chance of working, your application authors have to do extra things to gently caress up. In fact, apparently the Android Linux kernel already should have been doing just that. But, for reasons best known to themselves, Samsung chose to write a patch which overrode the default behavior and report that all CPU cores supported the new ARMv8.1 atomic instructions. So, there were poor choices all around! Both hardware and software were hosed. It is a safe bet that Apple will not have such problems, because their performance and efficiency cores are both custom Apple designs and they can make sure that both support exactly the same ISA, which is what you really want.
|
# ? Jul 4, 2020 20:27 |