|
JawnV6 posted:this is from 24 days ago lol https://fortune.com/longform/microchip-designer-jim-keller-intel-fortune-500-apple-tesla-amd/ And as to what someone like Jim Keller does all day: meetings, so many meetings. A PI position is already back to back meetings, and he has been doing media work as well.
|
# ¿ Jun 12, 2020 09:29 |
|
|
# ¿ May 3, 2024 19:27 |
|
Rumor has it he really has a family health issue and cannot keep pulling what I assume is the insane workload inside Intel, he's still consulting for now. I kind of believe this one, as he would otherwise jump ship as he has done in the past. Shorting INTC based on that news and the fact that Zen3 is on schedule after all is one hell of a risky trade. For gamers and PC users, yeah Zen 3 can absolutely crush it if they manage to get CPI parity and lower down some of the latencies. But institutional shareholders don't really give a poo poo. Desktop is 1) relatively low-margin 2) not a growth sector, 3) not a fab priority given the current problems meeting demand in other areas and 4) not where Intel's boutique approach is very profitable. Intel still is a boutique firm with a crapton of customized chips and systems for a wide variety of customers. AMD isn't making a Zen variant for high speed trading or base stations, if they had the engineering bandwidth for it I doubt they would get the fab bandwidth. Intel don't need to pay off OEMs with cash, they pay them off with custom designs.
|
# ¿ Jun 22, 2020 12:15 |
|
Chiming in with some GPU vs CPU NN training experience. GPU have a distinct advantage when training deep networks with convolution layers, because a lot of the VRAM-resident data gets reused and there is a high computational intensity (flops/byte). However, there are a ton of critically useful models that are shallow feedforward networks. On top of that, many real world datasets are sparse (take a shot every time someone says 'embedding' at a conference). In those cases, the GPU's training advantage largely becomes one of existing software and frameworks. One major advantage of CPU training is that you don't have to fight to make your model fit the relatively limited GPU RAM. It was also nice not having to fight for the handful of GPU nodes in the on-residence cluster Beef fucked around with this message at 10:04 on Jul 27, 2020 |
# ¿ Jul 27, 2020 09:51 |
|
Twerk from Home posted:How many of the nodes are new enough to have the AVX512 CPU instructions though? Or are you in private industry and not a university that's still running Westmere in production? https://www.vanderbilt.edu/accre/technical-details/ Yeah most are Broadwells and Haswells, I did see some improvements testing with Skylake AVX512. Even if the first layer is a sparse computation, the rest is the usual dense matrix mult that does well under avx512.
|
# ¿ Jul 28, 2020 00:24 |
|
SwissArmyDruid posted:https://videocardz.com/newz/intel-to-introduce-new-logos-for-its-core-series Not as much changing the name as repainting the store's window frame.
|
# ¿ Jul 31, 2020 16:36 |
|
In case this is not obvious: the passworded zips sharing the same intel123 password are simply to circumvent their anti-virus/mail server checker. I've been given pre-release binaries with an Intel person I was collaborating with and it was all zipped in that same way.
Beef fucked around with this message at 22:45 on Aug 6, 2020 |
# ¿ Aug 6, 2020 22:38 |
|
With all that battery chat, I'm absolutely charged up to hear about what's happening at HotChips right now. https://www.anandtech.com/show/15984/hot-chips-2020-live-blog-next-gen-intel-xeon-ice-lakesp-930am-pt
|
# ¿ Aug 18, 2020 08:09 |
|
We're firmly in the age of dark silicon: we can put more transistors on a die than we can effectively power all the time. If you were to use those AVX512 transistors for, say, extra L2 cache, you now effectively hobbled your chip: that extra cache cannot be power gated and it's not making enough of a difference to make up for halving your operating frequency. A lot of those ARM chip efficiencies vanish when they have to start supporting a ton of PCIe lanes, address more than 16GB of LPDDR, inter-core fabric for high-bandwidth cache coherency traffic etc. There's a reason why ARM chips are slow to become real competitors, even though for decades every manager with an excel sheet has been multiplying phone SoC flops with Xeon flops and exclaiming that ARM is more efficient. edit: What *does* hobble x86 is it's much stronger memory consistency model. Beef fucked around with this message at 00:07 on Aug 28, 2020 |
# ¿ Aug 28, 2020 00:02 |
|
Slow down with the takedown pounce there buddy. AVX512 is power gated, as in dynamically, not during binning. No it is not tightly coupled with what you mentioned, it sits behind a separate port in the backend with its own register files. The bringing comes at a cost, and it does bring down the frequency significantly. That's not something you can do with cache. Maybe some day with a LLC? You are right that tight AVX512-using loops will consume more dynamic power. But what do you think uses more dynamic power in most workloads on average: AVX512 that gets power gated if not used or SRAM that is hit on nearly every instruction? Frequency will not halve, that's an exaggeration, but it definitely takes a hit. And I assure you that a larger L2 will not make up for the voltage drop. The point is that the architects chose to use the transistors for a sometimes-useful feature that can be power gated when not in use, because the alternatives suck worse.
|
# ¿ Aug 29, 2020 01:20 |
|
Those arm designs are super wide issue. My guess would be that following Intel-TSO would make poor use of that.
|
# ¿ Aug 29, 2020 01:29 |
|
Without watching : is it because it's business to business advertising?
|
# ¿ Aug 30, 2020 08:35 |
|
ColTim posted:Something I've never quite understood is per core memory bandwidth. Naively I expected that to be bottlenecked by the RAM speed itself, but from testing on my old X99 6800K it looked like the per-core bandwidth topped out around 10-12GB/s, despite having quad-channel DDR4 3200 (with a theoretical bandwidth of ~100GB/s). I think it may have something to do with the cache performance as well; like regardless of the bandwidth of the RAM, each core can only load X cache lines per clock or something... Yea that looks low. How did you get to that number? The STREAM benchmark suite is typically pretty good at putting it through its paces. As already pointed out: - You need multiple threads on multiple cores to saturate your bandwidth. At least Xeons do. A single load-store buffer cannot hold enough in-flight memory operations, for example. - If the cache is getting trashed, it's doing a ton of flushes and prefetches, which is memory bandwidth you won't see unless you use hardware counters. Good memory benchmarks will use uncached memory operations to avoid this.
|
# ¿ Oct 12, 2020 19:13 |
|
A new experimental architecture: https://arxiv.org/abs/2010.06277 gently caress caches, massive threading.
|
# ¿ Oct 16, 2020 10:43 |
|
With Dalian sold off, the same bean counters can also go brag to their republican buddies that they retreated from China.
|
# ¿ Oct 20, 2020 08:27 |
|
Intel should have added the Lena encoder and it would have won in every benchmark tbh.
|
# ¿ Nov 28, 2020 00:27 |
|
screamin and creamin posted:So now the Intel chuds are posting their hot takes that it's all just about video and image encoding. Even the generalized benchmarks are all rigged!!!!!!!!!!!!!!! I think there are some jokes that have gone over your head in your rush to post Intel chud takedown posts. https://www.dangermouse.net/esoteric/lenpeg.html
|
# ¿ Nov 29, 2020 11:41 |
|
The CPU will speculate which branch will be taken, determined by the branch predictor on historical data (or compiler hints, a untaken assert is virtually free). It will not compute both branches at the same time, at least not in the backends I've worked with. I think GPUs do that in a certain way to handle divergent warps. The vast majority of branches are correctly predicted. Mispredicts happen mostly in situations where the branch depends on dynamic data, such as a sorting routine or string search. A 100% taken/untaken branch is basically for free. If the branch predictor is not seeing a trend (something like 90 or 80% threshold) it will not cause speculation. You can test this yourself, worst case are the 80/20 or 70/30 cases because it will sometimes trick the predictor with a sequence above the threshold and then cause mispredicts. In short, that 128/256 avx throttling due to mispredict is not something you will get in the wild, unless you are doing some really weird shots.
|
# ¿ Jan 6, 2021 20:27 |
|
When has the top bin segment ever made sense for Intel CPUs. It only does for 'price insensitive' consumers.
|
# ¿ Jan 18, 2021 13:34 |
|
Speaking of companies poaching Intel engineers: https://www.bizjournals.com/portland/news/2020/12/31/microsoft-corp-lease-85-000-square-feet-hillsboro.html
|
# ¿ Jan 18, 2021 13:35 |
|
Quoting from HN comments: The report is all speculation, not some credible source leak. edit: I'm quoting it because it's factually correct? The source of the 'news' is a report from a market analysis firm, speculation from before the CEO change. Beef fucked around with this message at 01:41 on Jan 21, 2021 |
# ¿ Jan 21, 2021 01:35 |
|
At first I'm excited about Intel doing another high performance CPU project. But then I realize we're only going to hear from it in 3 to 5 years, if at all There's the graph processor Intel is working on, but I doubt it's something the general public is going to be programming.
|
# ¿ Jan 21, 2021 11:31 |
|
ConanTheLibrarian posted:Any recollection about how long it typically took for the new mem tech's prices to settle down to something reasonable? Since next year's CPUs will be the first to use DDR5, I've mentally pencilled in the follow gen as a decent time to buy in, but that's assumes a quick enough drop in RAM prices. Depends on https://en.wikipedia.org/wiki/List_of_earthquakes_in_Taiwan
|
# ¿ Jan 30, 2021 13:38 |
|
My guess is that we will see it used in practice a bit like with the mobile cores, sort of as a power state thing. Low load: small cores used; high load: big cores used. Dark silicon's a bitch
|
# ¿ Feb 2, 2021 16:08 |
|
So is performance mounted in the big balls or the small balls.
|
# ¿ Feb 4, 2021 00:27 |
|
Death to marketing and sales departments, I say. And pass the savings on to YOU!
|
# ¿ Feb 24, 2021 22:12 |
|
Have there been any known attacks that used any of the speculative shenanigans? Hardware vectors seem so much compared to the bazillion software/networking vulnerabilities out there.
|
# ¿ Feb 26, 2021 22:20 |
|
I fully expect to see all leading edge CPUs moving to some kind of separate-node IO blocks. IO blocks barely scale with transistor density, and IO bandwidth demands for interconnects, PCIe etc are not going to stop growing. That means more and more surface of your server or desktop die is going to be IO, meaning that the gain for being at the expensive leading-edge is decreasing. imec and TSMC are putting out a bunch technologies to integrate multiple dies, like wafer-on-wafer, wafer-on-chip or whatever. It seems to become either cheaper or faster than the current through-silicon-via tech, so that seems to address some of the downsides of IO chiplets. GloFo is betting the house on it and specializing in IO-blocks processes, they seem to be doing well and are expanding their fab capacity. Like memory, we might even see a schism in process technology?
|
# ¿ Mar 17, 2021 12:36 |
|
BlankSystemDaemon posted:Why yes, everyone definitely wants to deal with NUMA penalties on single-CPU systems! Ah yes the famous Zen NUMA penalty that is making it an inferior product to Intel's. What makes you think that socket interconnects are in any way like silicon on silicon links? The IO blocks are the things that drive the uncore. A chiplet means that the IO blocks driving the uncore are linked with the core through bumps or through-silicon-via or whatever, instead of being linked through metal layers directly. Latencies are in the same order of magnitude.
|
# ¿ Mar 17, 2021 14:31 |
|
BlankSystemDaemon posted:You can't tell me taht something like this isn't going to have an impact: I was mostly talking IO chiplets, where the extra few ns don't mean much compared to a PCIe/CXL wire length. The picture you show is pretty much NUCA-caches.jpg. You will see a similar pattern in modern Xeons, within the same order of magnitude. And Xeons do not even scale yet to that many cores. We're already in the age of NUCA caches where you miss your L3 on core 3, need to hop to the directory on core 0 and it tells you to go and fetch the line from core 8 on the other side of the die, two hops away on your on-die mesh network. The few extra ns from going chiplet-to-chiplet are adding a few ns in the cases where you're already hosed by multiple hops and extra wire length. If doing chiplets means you can cram more L3 cache on package, you're probably compensating well enough. Will a 64-core monolith have better L2/3 latencies than a 64-core chiplet design on average? Sure, probably, but you will be paying through the nose for that extra percent. Are you talking about the OS scheduler? NUCA caches have been a thing for a while, so I imagine it's already aware. I've definitely already manually used numactl to pin threads to sibling cores on Xeons to avoid sharing caches with the other side of the mesh.
|
# ¿ Mar 17, 2021 23:40 |
|
PCjr sidecar posted:The first gen phis that never got out of the labs ran freebsd, because Larrabee did. Second gen which shipped in volume was a PCIe card and ran Linux. Third gen was linux and standalone. It was fun to ssh into a KNL and just have CentOS running. We had to modify htop to fit all cpus on screen. It would have had an impact if it wasn't launched 2 years late.
|
# ¿ Mar 17, 2021 23:56 |
|
Regarding the chiplet discussion: is there some public information on how much direct extra latency they cause?
|
# ¿ Mar 19, 2021 21:02 |
|
But that's cache latency. There are so many factors that come into play that are not chiplet related. I mean something like interposed adds X ns compared to regular metal layer.
|
# ¿ Mar 20, 2021 03:03 |
|
That's again cache latency tests though. I'll drop the question. It looks like a question that can only be answered by silicon design guys. We don't really have a way to test the same architecture with and without chiplets. With it its going to want to be at higher core counts, which means a mesh/ uncore that has to scale better and add extra latency. LLC is also going to be larger and a distributed NUCA cache. All of that adds some latencies that is technically unrelated to routing through an interposer metal layer.
|
# ¿ Mar 20, 2021 09:11 |
|
Instructions get fused in ARM and uops get fused on Intel, further complicating things. It is just not useful anymore to use the terms RISC and CISC to describe CPU architectures.
|
# ¿ Mar 21, 2021 10:22 |
|
BobHoward posted:I'm still fond of John Mashey's old Usenet postings on how and why to classify something as RISC or CISC. There are some points which are aging but most of it is timeless. Thanks, that was an interesting read! I also have the impression that most of the weird CISCy instructions like quote:4a) Have NO operations that combine load/store with arithmetic, My first through was that this aged badly. But actually, now that I think about it, macro/micro-op fusion is actually enabling simplifications in the ISA such as this. E.g. Engineer A worked on a MC that can do indexed loads, so he pushe to add an indirect/indexed load instruction to the ISA. Engineer B tells A he's a dumbass for wanting to grow the ISA, as they can simply add a macro-op fusion rule to the frontend to fuse add+load. Personally, I don't think it's fair to call a modern x86 a CISC architecture, in practice the asm produced by modern compiler/OS looks pretty damned RISCy to me. Is there anything in common use in x86-64 that could still be considered CISCy?
|
# ¿ Mar 22, 2021 11:22 |
|
BlankSystemDaemon posted:Data-oriented design, whether using C++ as in a lot of video games, rust, or any other language, implicitly means that you have to understand the architecture to write effective code. This isn't what you asked, but the reference work for processor architecture and design is the Hennessy and Patterson: https://www.amazon.com/Computer-Architecture-Quantitative-John-Hennessy/dp/012383872X I'm glad Data-oriented design has become popular, but you always had to have understanding of modern CPU architecture to write effective code. I'm afraid of getting to that stage of the hype cycle where a ton of people are just cargo culting the gently caress out of it. That said, I can't wait to see people write branch-less everything 'because branches are EVIL'. fake edit: mandatory branchless fizzbuzz https://gist.github.com/andrewgho/6a37d71d9848896af12a Kazinsal posted:The lea instruction is so versatile that I'm surprised it's not turing complete. It's basically a mov that lets you do addressing mode calculations as if you're dereferencing a bunch of pointer arithmetic, but without the actual dereference. I'm not sure how much it's used in modern compiler output (note to future-self: I should try poking at godbolt a bit if tomorrow's slow at work to confirm this) because, like you said, other instructions probably fuse into sequences of uops better, but drat if it's not fun to fiddle with. I once asked the same question, in the line of "why isn't this an add". IIRC it's still heavily used because the frontend has a specialized port or something to deal with it, freeing up the rest and making the whole that tiny bit faster. It doesn't make any difference for the backend.
|
# ¿ Mar 22, 2021 20:07 |
|
Mostly clickbait. He says it in the conclusion as well, but in the context that it sits in the same in-between hole as the 3600. Barely better than the more affordable SKUs, not enough performance for high-productivity work.
|
# ¿ Mar 23, 2021 13:28 |
|
https://www.anandtech.com/show/1657...n-return-of-idf Doubling down on IDM model. Chiplet products incoming.
|
# ¿ Mar 23, 2021 23:07 |
|
To be fair to Intel, it wasn't obvious that EUV would reach maturity in time for their 10nm. I was working at imec years back and EUV was a running joke, always a year or two in the future over a decade or so. Intel was a large investor in ASML and EUV tech, but simply chose to do it later rather than sooner. It was still a mistake and hubris to think that quad patterning and new materials would make that 10nm node viable, but that's in hindsight.
|
# ¿ Apr 3, 2021 09:35 |
|
|
# ¿ May 3, 2024 19:27 |
|
JawnV6 posted:I worked on a cpu team and there's just no good way to simulate architectural improvement effects in that long-term steady-state kind of thing. There have been developments in the past decade on architectural simulators, and there are simulation techniques now that do that kind of analysis successfully, e.g. Graphite or Sniper. We're still talking 1000x slowdown compared to hardware, but at least you can simulate a few bil cycles of a multi-socket multi-core. However, slow-as-gently caress cycle-accurate simulators are still the architects' bread and butter, it's hard to make them trust anything else. You can bet they will have to start trusting other simulation techniques if they want to start moving beyond *spits on the floor* spec benchmarks workloads.
|
# ¿ Apr 15, 2021 20:38 |