Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Beef
Jul 26, 2004

JawnV6 posted:

this is from 24 days ago lol https://fortune.com/longform/microchip-designer-jim-keller-intel-fortune-500-apple-tesla-amd/

he's not a designer, at least not in the ground level jargon. id call him an architect

And as to what someone like Jim Keller does all day: meetings, so many meetings. A PI position is already back to back meetings, and he has been doing media work as well.

Adbot
ADBOT LOVES YOU

Beef
Jul 26, 2004
Rumor has it he really has a family health issue and cannot keep pulling what I assume is the insane workload inside Intel, he's still consulting for now. I kind of believe this one, as he would otherwise jump ship as he has done in the past.

Shorting INTC based on that news and the fact that Zen3 is on schedule after all is one hell of a risky trade.

For gamers and PC users, yeah Zen 3 can absolutely crush it if they manage to get CPI parity and lower down some of the latencies. But institutional shareholders don't really give a poo poo. Desktop is 1) relatively low-margin 2) not a growth sector, 3) not a fab priority given the current problems meeting demand in other areas and 4) not where Intel's boutique approach is very profitable.
Intel still is a boutique firm with a crapton of customized chips and systems for a wide variety of customers. AMD isn't making a Zen variant for high speed trading or base stations, if they had the engineering bandwidth for it I doubt they would get the fab bandwidth. Intel don't need to pay off OEMs with cash, they pay them off with custom designs.

Beef
Jul 26, 2004
Chiming in with some GPU vs CPU NN training experience. GPU have a distinct advantage when training deep networks with convolution layers, because a lot of the VRAM-resident data gets reused and there is a high computational intensity (flops/byte). However, there are a ton of critically useful models that are shallow feedforward networks. On top of that, many real world datasets are sparse (take a shot every time someone says 'embedding' at a conference). In those cases, the GPU's training advantage largely becomes one of existing software and frameworks. One major advantage of CPU training is that you don't have to fight to make your model fit the relatively limited GPU RAM.
It was also nice not having to fight for the handful of GPU nodes in the on-residence cluster :coal:

Beef fucked around with this message at 10:04 on Jul 27, 2020

Beef
Jul 26, 2004

Twerk from Home posted:

How many of the nodes are new enough to have the AVX512 CPU instructions though? Or are you in private industry and not a university that's still running Westmere in production? https://www.vanderbilt.edu/accre/technical-details/

Yeah most are Broadwells and Haswells, I did see some improvements testing with Skylake AVX512. Even if the first layer is a sparse computation, the rest is the usual dense matrix mult that does well under avx512.

Beef
Jul 26, 2004

Not as much changing the name as repainting the store's window frame.

Beef
Jul 26, 2004
In case this is not obvious: the passworded zips sharing the same intel123 password are simply to circumvent their anti-virus/mail server checker. I've been given pre-release binaries with an Intel person I was collaborating with and it was all zipped in that same way.

Beef fucked around with this message at 22:45 on Aug 6, 2020

Beef
Jul 26, 2004
With all that battery chat, I'm absolutely charged up to hear about what's happening at HotChips right now.

https://www.anandtech.com/show/15984/hot-chips-2020-live-blog-next-gen-intel-xeon-ice-lakesp-930am-pt

Beef
Jul 26, 2004
We're firmly in the age of dark silicon: we can put more transistors on a die than we can effectively power all the time. If you were to use those AVX512 transistors for, say, extra L2 cache, you now effectively hobbled your chip: that extra cache cannot be power gated and it's not making enough of a difference to make up for halving your operating frequency.

A lot of those ARM chip efficiencies vanish when they have to start supporting a ton of PCIe lanes, address more than 16GB of LPDDR, inter-core fabric for high-bandwidth cache coherency traffic etc. There's a reason why ARM chips are slow to become real competitors, even though for decades every manager with an excel sheet has been multiplying phone SoC flops with Xeon flops and exclaiming that ARM is more efficient.

edit: What *does* hobble x86 is it's much stronger memory consistency model.

Beef fucked around with this message at 00:07 on Aug 28, 2020

Beef
Jul 26, 2004
Slow down with the takedown pounce there buddy. AVX512 is power gated, as in dynamically, not during binning. No it is not tightly coupled with what you mentioned, it sits behind a separate port in the backend with its own register files. The bringing comes at a cost, and it does bring down the frequency significantly. That's not something you can do with cache. Maybe some day with a LLC?

You are right that tight AVX512-using loops will consume more dynamic power. But what do you think uses more dynamic power in most workloads on average: AVX512 that gets power gated if not used or SRAM that is hit on nearly every instruction? Frequency will not halve, that's an exaggeration, but it definitely takes a hit. And I assure you that a larger L2 will not make up for the voltage drop.

The point is that the architects chose to use the transistors for a sometimes-useful feature that can be power gated when not in use, because the alternatives suck worse.

Beef
Jul 26, 2004
Those arm designs are super wide issue. My guess would be that following Intel-TSO would make poor use of that.

Beef
Jul 26, 2004
Without watching : is it because it's business to business advertising?

Beef
Jul 26, 2004

ColTim posted:

Something I've never quite understood is per core memory bandwidth. Naively I expected that to be bottlenecked by the RAM speed itself, but from testing on my old X99 6800K it looked like the per-core bandwidth topped out around 10-12GB/s, despite having quad-channel DDR4 3200 (with a theoretical bandwidth of ~100GB/s). I think it may have something to do with the cache performance as well; like regardless of the bandwidth of the RAM, each core can only load X cache lines per clock or something...

Things may be a bit different on the smaller client dies (probably higher per-core numbers there) but it definitely threw me for a loop!

Yea that looks low. How did you get to that number? The STREAM benchmark suite is typically pretty good at putting it through its paces.

As already pointed out:
- You need multiple threads on multiple cores to saturate your bandwidth. At least Xeons do. A single load-store buffer cannot hold enough in-flight memory operations, for example.
- If the cache is getting trashed, it's doing a ton of flushes and prefetches, which is memory bandwidth you won't see unless you use hardware counters. Good memory benchmarks will use uncached memory operations to avoid this.

Beef
Jul 26, 2004
A new experimental architecture: https://arxiv.org/abs/2010.06277
gently caress caches, massive threading.

Beef
Jul 26, 2004
With Dalian sold off, the same bean counters can also go brag to their republican buddies that they retreated from China.

Beef
Jul 26, 2004
Intel should have added the Lena encoder and it would have won in every benchmark tbh.

Beef
Jul 26, 2004

screamin and creamin posted:

So now the Intel chuds are posting their hot takes that it's all just about video and image encoding. Even the generalized benchmarks are all rigged!!!!!!!!!!!!!!!

I think there are some jokes that have gone over your head in your rush to post Intel chud takedown posts.

https://www.dangermouse.net/esoteric/lenpeg.html

Beef
Jul 26, 2004
The CPU will speculate which branch will be taken, determined by the branch predictor on historical data (or compiler hints, a untaken assert is virtually free). It will not compute both branches at the same time, at least not in the backends I've worked with. I think GPUs do that in a certain way to handle divergent warps.

The vast majority of branches are correctly predicted. Mispredicts happen mostly in situations where the branch depends on dynamic data, such as a sorting routine or string search. A 100% taken/untaken branch is basically for free. If the branch predictor is not seeing a trend (something like 90 or 80% threshold) it will not cause speculation.
You can test this yourself, worst case are the 80/20 or 70/30 cases because it will sometimes trick the predictor with a sequence above the threshold and then cause mispredicts.

In short, that 128/256 avx throttling due to mispredict is not something you will get in the wild, unless you are doing some really weird shots.

Beef
Jul 26, 2004
When has the top bin segment ever made sense for Intel CPUs. It only does for 'price insensitive' consumers.

Beef
Jul 26, 2004
Speaking of companies poaching Intel engineers:

https://www.bizjournals.com/portland/news/2020/12/31/microsoft-corp-lease-85-000-square-feet-hillsboro.html

Beef
Jul 26, 2004
Quoting from HN comments: The report is all speculation, not some credible source leak.

edit: I'm quoting it because it's factually correct? The source of the 'news' is a report from a market analysis firm, speculation from before the CEO change.

Beef fucked around with this message at 01:41 on Jan 21, 2021

Beef
Jul 26, 2004
At first I'm excited about Intel doing another high performance CPU project. But then I realize we're only going to hear from it in 3 to 5 years, if at all :(
There's the graph processor Intel is working on, but I doubt it's something the general public is going to be programming.

Beef
Jul 26, 2004

ConanTheLibrarian posted:

Any recollection about how long it typically took for the new mem tech's prices to settle down to something reasonable? Since next year's CPUs will be the first to use DDR5, I've mentally pencilled in the follow gen as a decent time to buy in, but that's assumes a quick enough drop in RAM prices.

Depends on https://en.wikipedia.org/wiki/List_of_earthquakes_in_Taiwan

Beef
Jul 26, 2004
My guess is that we will see it used in practice a bit like with the mobile cores, sort of as a power state thing. Low load: small cores used; high load: big cores used.

Dark silicon's a bitch

Beef
Jul 26, 2004
So is performance mounted in the big balls or the small balls.

Beef
Jul 26, 2004
Death to marketing and sales departments, I say. And pass the savings on to YOU!

Beef
Jul 26, 2004
Have there been any known attacks that used any of the speculative shenanigans? Hardware vectors seem so much :effort: compared to the bazillion software/networking vulnerabilities out there.

Beef
Jul 26, 2004
I fully expect to see all leading edge CPUs moving to some kind of separate-node IO blocks.
IO blocks barely scale with transistor density, and IO bandwidth demands for interconnects, PCIe etc are not going to stop growing. That means more and more surface of your server or desktop die is going to be IO, meaning that the gain for being at the expensive leading-edge is decreasing. imec and TSMC are putting out a bunch technologies to integrate multiple dies, like wafer-on-wafer, wafer-on-chip or whatever. It seems to become either cheaper or faster than the current through-silicon-via tech, so that seems to address some of the downsides of IO chiplets.

GloFo is betting the house on it and specializing in IO-blocks processes, they seem to be doing well and are expanding their fab capacity. Like memory, we might even see a schism in process technology?

Beef
Jul 26, 2004

BlankSystemDaemon posted:

Why yes, everyone definitely wants to deal with NUMA penalties on single-CPU systems!

Ah yes the famous Zen NUMA penalty that is making it an inferior product to Intel's.

What makes you think that socket interconnects are in any way like silicon on silicon links? The IO blocks are the things that drive the uncore. A chiplet means that the IO blocks driving the uncore are linked with the core through bumps or through-silicon-via or whatever, instead of being linked through metal layers directly. Latencies are in the same order of magnitude.

Beef
Jul 26, 2004

BlankSystemDaemon posted:

You can't tell me taht something like this isn't going to have an impact:


Specifically, it means that in order to not have an order of magnitude higher latency between core communication, it means the scheduler has to be modified to be aware of these latency impacts - or have some way of testing them.

I was mostly talking IO chiplets, where the extra few ns don't mean much compared to a PCIe/CXL wire length.

The picture you show is pretty much NUCA-caches.jpg. You will see a similar pattern in modern Xeons, within the same order of magnitude. And Xeons do not even scale yet to that many cores.

We're already in the age of NUCA caches where you miss your L3 on core 3, need to hop to the directory on core 0 and it tells you to go and fetch the line from core 8 on the other side of the die, two hops away on your on-die mesh network. The few extra ns from going chiplet-to-chiplet are adding a few ns in the cases where you're already hosed by multiple hops and extra wire length. If doing chiplets means you can cram more L3 cache on package, you're probably compensating well enough.

Will a 64-core monolith have better L2/3 latencies than a 64-core chiplet design on average? Sure, probably, but you will be paying through the nose for that extra percent.


Are you talking about the OS scheduler? NUCA caches have been a thing for a while, so I imagine it's already aware. I've definitely already manually used numactl to pin threads to sibling cores on Xeons to avoid sharing caches with the other side of the mesh.

Beef
Jul 26, 2004

PCjr sidecar posted:

The first gen phis that never got out of the labs ran freebsd, because Larrabee did. Second gen which shipped in volume was a PCIe card and ran Linux. Third gen was linux and standalone.

Some folks ran ps3 clusters (on Linux, mostly, “Other OS”) but it was spectacularly difficult to use. Roadrunner got most of it’s top500 flops from ibm cell blades but most of the science was done on the opteron blades.

It was fun to ssh into a KNL and just have CentOS running. We had to modify htop to fit all cpus on screen. It would have had an impact if it wasn't launched 2 years late.

Beef
Jul 26, 2004
Regarding the chiplet discussion: is there some public information on how much direct extra latency they cause?

Beef
Jul 26, 2004
But that's cache latency. There are so many factors that come into play that are not chiplet related. I mean something like interposed adds X ns compared to regular metal layer.

Beef
Jul 26, 2004
That's again cache latency tests though.

I'll drop the question. It looks like a question that can only be answered by silicon design guys. We don't really have a way to test the same architecture with and without chiplets. With it its going to want to be at higher core counts, which means a mesh/ uncore that has to scale better and add extra latency. LLC is also going to be larger and a distributed NUCA cache. All of that adds some latencies that is technically unrelated to routing through an interposer metal layer.

Beef
Jul 26, 2004
Instructions get fused in ARM and uops get fused on Intel, further complicating things.

It is just not useful anymore to use the terms RISC and CISC to describe CPU architectures.

Beef
Jul 26, 2004

BobHoward posted:

I'm still fond of John Mashey's old Usenet postings on how and why to classify something as RISC or CISC. There are some points which are aging but most of it is timeless.

https://userpages.umbc.edu/~vijay/mashey.on.risc.html


A common misunderstanding is that RISCiness is somehow about the total number of instructions. As Mashey points out, to a CPU architect RISC was mostly about reducing the complexity of things like addressing modes.

It's also interesting and relevant that while x86 ends up on the CISC side of the dividing line, it has several points which are RISCy. I think this is one of the many factors which allowed it to survive past the 1980s. (obviously, the most important factor is being selected by IBM for the 1981 PC)

Thanks, that was an interesting read!

I also have the impression that most of the weird CISCy instructions like

quote:

4a) Have NO operations that combine load/store with arithmetic,
i.e., like add from memory, or add to memory.

My first through was that this aged badly. But actually, now that I think about it, macro/micro-op fusion is actually enabling simplifications in the ISA such as this. E.g. Engineer A worked on a MC that can do indexed loads, so he pushe to add an indirect/indexed load instruction to the ISA. Engineer B tells A he's a dumbass for wanting to grow the ISA, as they can simply add a macro-op fusion rule to the frontend to fuse add+load.

Personally, I don't think it's fair to call a modern x86 a CISC architecture, in practice the asm produced by modern compiler/OS looks pretty damned RISCy to me. Is there anything in common use in x86-64 that could still be considered CISCy?

Beef
Jul 26, 2004

BlankSystemDaemon posted:

Data-oriented design, whether using C++ as in a lot of video games, rust, or any other language, implicitly means that you have to understand the architecture to write effective code.

Yeah, I know - I do wonder what he's been up to there, since I haven't been able to find much that he's done there.

Well, part of the problem is that what they ship isn't truly opensource. The PHY and MAC for the NIC, DDR training, PCI bus training, and many other parts that go on the motherboard are still completely proprietary, and replacing them is argualy a lot more difficult than making a CPU.


Speaking of all this CPU stuff, I remember seeing an article (with, I wanna say, some sort of orange background, with black text?) which goes in-depth about the evolution of modern processors - it starts out by describing a basic processor and then moves onto pipelining and finally covers staged pipelining.
Does this ring a bell for anyone?

This isn't what you asked, but the reference work for processor architecture and design is the Hennessy and Patterson: https://www.amazon.com/Computer-Architecture-Quantitative-John-Hennessy/dp/012383872X


I'm glad Data-oriented design has become popular, but you always had to have understanding of modern CPU architecture to write effective code. I'm afraid of getting to that stage of the hype cycle where a ton of people are just cargo culting the gently caress out of it. That said, I can't wait to see people write branch-less everything 'because branches are EVIL'.

fake edit: mandatory branchless fizzbuzz https://gist.github.com/andrewgho/6a37d71d9848896af12a


Kazinsal posted:

The lea instruction is so versatile that I'm surprised it's not turing complete. It's basically a mov that lets you do addressing mode calculations as if you're dereferencing a bunch of pointer arithmetic, but without the actual dereference. I'm not sure how much it's used in modern compiler output (note to future-self: I should try poking at godbolt a bit if tomorrow's slow at work to confirm this) because, like you said, other instructions probably fuse into sequences of uops better, but drat if it's not fun to fiddle with.

I once asked the same question, in the line of "why isn't this an add". IIRC it's still heavily used because the frontend has a specialized port or something to deal with it, freeing up the rest and making the whole that tiny bit faster. It doesn't make any difference for the backend.

Beef
Jul 26, 2004
Mostly clickbait. He says it in the conclusion as well, but in the context that it sits in the same in-between hole as the 3600. Barely better than the more affordable SKUs, not enough performance for high-productivity work.

Beef
Jul 26, 2004
https://www.anandtech.com/show/1657...n-return-of-idf

Doubling down on IDM model. Chiplet products incoming.

Beef
Jul 26, 2004
To be fair to Intel, it wasn't obvious that EUV would reach maturity in time for their 10nm. I was working at imec years back and EUV was a running joke, always a year or two in the future over a decade or so. Intel was a large investor in ASML and EUV tech, but simply chose to do it later rather than sooner.
It was still a mistake and hubris to think that quad patterning and new materials would make that 10nm node viable, but that's in hindsight.

Adbot
ADBOT LOVES YOU

Beef
Jul 26, 2004

JawnV6 posted:

I worked on a cpu team and there's just no good way to simulate architectural improvement effects in that long-term steady-state kind of thing.

There have been developments in the past decade on architectural simulators, and there are simulation techniques now that do that kind of analysis successfully, e.g. Graphite or Sniper. We're still talking 1000x slowdown compared to hardware, but at least you can simulate a few bil cycles of a multi-socket multi-core. However, slow-as-gently caress cycle-accurate simulators are still the architects' bread and butter, it's hard to make them trust anything else. You can bet they will have to start trusting other simulation techniques if they want to start moving beyond *spits on the floor* spec benchmarks workloads.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply