Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
WhyteRyce
Dec 30, 2001

but wouldn't you want to know you flagged a bunch of uncorrectable errors when you come out of S3????

I'm pretty sure we had multiple times a customer came back and asked why they saw errors during their qual and we'd trace it back to someone shutting off some checker somewhere and not telling anyone

WhyteRyce fucked around with this message at 18:10 on Aug 14, 2023

Adbot
ADBOT LOVES YOU

JawnV6
Jul 4, 2004

So hot ...

WhyteRyce posted:

I was once told by a manager to not try and run some tests until B-step because we can't expect it to work on A-step and I couldn't get him to see the problem with that logic

Sorry for the stream of nonsensical rants but I have strong feelings

lol aside from the obvious that's been covered.... what if it passes!??!?!?

like sure don't throw a bunch of bugs at pre-si to bat away with "known issue" and all that churn but if it's supposed to fail... it sure would be nice if the content actually stressed that bit and failed?? which we should know in the a-step time frame???

hobbesmaster
Jan 28, 2008

WhyteRyce posted:

one interface owner once set the dropped packet threshold to 100% to get his tests running. I forget when they finally changed it

when I started someone told me to remove the checker that reads back PCIe AERs because that always failed on him doing power management states

i225?

Don’t tell me otherwise; it’s too funny

Beef
Jul 26, 2004
I'm not anywhere near production, but on a prototype we tried to do extensive full-application testing on various simulators/emulation/FPGA before taping because we knew there was only budget for one stepping.
There is a ton of dumb poo poo we caught, but the amount of dumb poo poo some on-contract design engineer will put in that on-contract validation engineers will never check because it's not on the checklist is staggering.

Somehow, I still had to find out, running my software on silicon, that through all those layers of software no one had ever thought that someone would pass size=0 to a DMA instruction.

Less-dumb issues we found on silicon were usually in the bucket of "poo poo going wrong when some buffer is full". Is it not done in pre-Si validation to cover the various states and transitions for those "shitter's full, backpressure the poo poo" scenarios?

WhyteRyce
Dec 30, 2001

Beef posted:

I'm not anywhere near production, but on a prototype we tried to do extensive full-application testing on various simulators/emulation/FPGA before taping because we knew there was only budget for one stepping.

Touched on earlier but one of Intel's problems was that future steppings were just assumed. And that trickled down to bug finding and feature enablement.

I'm long gone from there but one of the issues my old co-workers would describe was people trying to beat it into the heads of worthless middle management and lazy engineers is that you can't assume whoever you are getting to fab your product will just let you crank out 6-8 steppings and try to ship something on C2. But I guess decades of habit and learnings are hard to break

quote:

Somehow, I still had to find out, running my software on silicon, that through all those layers of software no one had ever thought that someone would pass size=0 to a DMA instruction.

I literally hit a bug on something because of something like. The person writing the actual industry spec said he never explicitly forbid that scenario because he if had to write everything you couldn't do then the spec would be too long (:wtf:). No one else down the chain, from the technical leads to the domain owners thought to question the guy writing the spec. Thinking it was bullshit I then got a customer to weigh in on it and they went ape poo poo and the same people that told me it isn't allowed then acted like they were paleontologists who just discovered a dinosaur egg themselves.

WhyteRyce fucked around with this message at 23:18 on Aug 14, 2023

ijyt
Apr 10, 2012

BIG HEADLINE posted:

So apparently the tradeoff for the 14700K's performance bump is an additional 30W of juice: https://www.guru3d.com/news-story/core-i7-14700k-raptor-lake-refresh-cpu-faster-but-uses-30w-more.html

lol

HalloKitty
Sep 30, 2005

Adjust the bass and let the Alpine blast

Heh, I don't get the point of more than 8 e cores, surely they just waste space and take thermal/power headroom from the p cores. If your background tasks need more than those 8 e cores then poo poo, that's a hell of an "idle" load for a desktop

HalloKitty fucked around with this message at 20:09 on Aug 15, 2023

Beef
Jul 26, 2004

HalloKitty posted:

Heh, I don't get the point of more than 8 e cores, surely they just waste space and thermal/power headroom from the p cores. If your background tasks need more than those 8 e cores then poo poo, that's a hell of an "idle" load for a desktop

Cinebench score

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Beef posted:

Cinebench score

The tragedy of Cinebench's popularity is that CPU raytracing is such a totally irrelevant workload for 99% of PC enthusiasts. I dunno how serious you were being here, but if serious, I agree - I think Intel probably does make product decisions to juice Cinebench scores rather than provide real value to users

Beef
Jul 26, 2004
I was being a bit flippant, but the grain of truth is there. When architectures are being planned or designed, you cannot run a full cinebench or similar. But historically the cycle-accurate simulation tools used to drive design decisions can only do short sequential execution traces, good luck catching any kind of higher order behavior there. So specint type of benchmarks were king, still as irrelevant to PC users as cinebench.

Methods, attitudes and tools have been changing, but with the long design pipelines of modern CPUs I can bet that the Rocket Lake refresh has been designed to chiefly against the spec benchmarks. Meteor and beyond is probably where we will see the effect of any changes due to competition, Jim Keller and Pat.

Josh Lyman
May 24, 2009


WhyteRyce posted:

Touched on earlier but one of Intel's problems was that future steppings were just assumed. And that trickled down to bug finding and feature enablement.

I'm long gone from there but one of the issues my old co-workers would describe was people trying to beat it into the heads of worthless middle management and lazy engineers is that you can't assume whoever you are getting to fab your product will just let you crank out 6-8 steppings and try to ship something on C2. But I guess decades of habit and learnings are hard to break
That's kinda hosed to have hardware errors that you're just planning to fix in future steppings. What happens to people who spent $500 of their hard earned dollars on earlier steppings? Are they just stuck with a buggy chip?

Canned Sunshine
Nov 20, 2005

CAUTION: POST QUALITY UNDER CONSTRUCTION



I’m assuming they’d try to fix as much as they could via microcode updates later?

VorpalFish
Mar 22, 2007
reasonably awesometm

HalloKitty posted:

Heh, I don't get the point of more than 8 e cores, surely they just waste space and thermal/power headroom from the p cores. If your background tasks need more than those 8 e cores then poo poo, that's a hell of an "idle" load for a desktop

If your workload scales with core count you're probably better off with e cores than p cores, no? Given it's supposed to be 25% of the die area for 33% of the performance, many workloads that actually scale past 8 cores should be better off with more e cores if the ratio is approximately 4:1.

They aren't there just for "background tasks" they're there to improve multicore performance beyond what would be possible with just p cores.

ijyt
Apr 10, 2012

BobHoward posted:

The tragedy of Cinebench's popularity is that CPU raytracing is such a totally irrelevant workload for 99% of PC enthusiasts. I dunno how serious you were being here, but if serious, I agree - I think Intel probably does make product decisions to juice Cinebench scores rather than provide real value to users

Wait do people actually look at the scores?? I thought it was just for a quick and easy full load test.

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

On the datacenter side that’s the pitch for Sierra Forest; 144 E cores per socket. Is that going to be faster or cheaper than Bergamo?

power crystals
Jun 6, 2007

Who wants a belly rub??

ijyt posted:

Wait do people actually look at the scores?? I thought it was just for a quick and easy full load test.

How else are you gonna be smug that your PC is 1% faster than your friend's?

FuturePastNow
May 19, 2014


the only score I use Cinebench for is the CPU temperature

priznat
Jul 7, 2009

Let's get drunk and kiss each other all night.

in a well actually posted:

On the datacenter side that’s the pitch for Sierra Forest; 144 E cores per socket. Is that going to be faster or cheaper than Bergamo?

Is that also to fend off ARM server challengers?

JawnV6
Jul 4, 2004

So hot ...

Josh Lyman posted:

That's kinda hosed to have hardware errors that you're just planning to fix in future steppings. What happens to people who spent $500 of their hard earned dollars on earlier steppings? Are they just stuck with a buggy chip?

? it's right there in the post you quoted, "try to ship on C2." No consumers are getting B0's, you're tilting at an imaginary windmill. It's only a pain for the post-si folks who have to do big feature validation in a compressed time frame.

that said, steppings got weird before I left. I sorta understood "X0," less so the '/prime ones that seemed more about rear end-covering than technical descriptions.

priznat
Jul 7, 2009

Let's get drunk and kiss each other all night.
In my experience at datacom parts if it went to C0 it was a huge fuckup, B0 should be the production parts. Nowhere near the complexity of a server class CPU though.

At a previous company one part before my time went to F and it was spoken of in hushed tones as people tried to forget the shame

WhyteRyce
Dec 30, 2001

JawnV6 posted:

? it's right there in the post you quoted, "try to ship on C2." No consumers are getting B0's, you're tilting at an imaginary windmill. It's only a pain for the post-si folks who have to do big feature validation in a compressed time frame.

that said, steppings got weird before I left. I sorta understood "X0," less so the '/prime ones that seemed more about rear end-covering than technical descriptions.

On one project the actual steppings were getting dangerously close to the sku designations and I was wondering what happens if the streams cross

Prime was an asscovering on the project I was on but that kind of got blown out the water because they immediately had to do a real stepping right after it. I might be misremembering but I swear they even made the stepping ID the same and you had to look elsewhere to get the prime designation so it was a pain in the rear end for some of our test scripts

X0 was an rear end covering too right? As in, "hey we can't get this important thing ready in time for tape-in and then we end up blocking a whole bunch of people so let's make up a new stepping designation to redefine what A0 is so that we can say we were actually ready on the new A0"

priznat posted:

In my experience at datacom parts if it went to C0 it was a huge fuckup, B0 should be the production parts. Nowhere near the complexity of a server class CPU though.

At a previous company one part before my time went to F and it was spoken of in hushed tones as people tried to forget the shame

I was once on a project were we found a bug but because it didn't fail all the time on all the parts so the designer was convinced they could just implement a screen. There was another stepping already planned too and he could have easily gotten it in time but "had timing implications". We later had to do another stepping literally just for that issue and only that issue

WhyteRyce fucked around with this message at 19:12 on Aug 15, 2023

HalloKitty
Sep 30, 2005

Adjust the bass and let the Alpine blast

VorpalFish posted:

If your workload scales with core count you're probably better off with e cores than p cores, no? Given it's supposed to be 25% of the die area for 33% of the performance, many workloads that actually scale past 8 cores should be better off with more e cores if the ratio is approximately 4:1.

They aren't there just for "background tasks" they're there to improve multicore performance beyond what would be possible with just p cores.

It's a poor match for desktop use though, and it's also made worse by the fact they're cut down feature wise, so no AVX-512. What's the use case, really?
If you actually need tons of threads and your application scales well, and performance is critical, you're not running that on a desktop with dual-channel RAM, you're building a workstation or more likely getting a server or servers somewhere else to do it.

The desktop is going to best served with beefy cores to power through less well threaded tasks, and a number of cores that can handle background tasks to keep power usage way the hell down when doing nothing or when doing light tasks.

Anyway, just my stupid thoughts

HalloKitty fucked around with this message at 20:17 on Aug 15, 2023

Potato Salad
Oct 23, 2014

nobody cares


HalloKitty posted:

It's a poor match for desktop use though, and it's also made worse by the fact they're cut down feature wise, so no AVX-512. What's the use case, really?
If you actually need tons of threads and your application scales well, and performance is critical, you're not running that on a desktop with dual-channel RAM.

The desktop is going to best served with beefy cores to power through less well threaded tasks, and a number of cores that can handle background tasks to keep power usage way the hell down when doing nothing or when doing light tasks

I agree with you entirely. I do want to try to established that there is a mid to high performance middle ground here. I'm not concerned with somebody who is going to be running some headline application like AutoCAD or FinalCut on a non-WS laptop because those kind of workers usually know they need to heft around some beef to get things done.

I'm thinking about the sheer number of applications that desktop productivity users genuinely will be running these days combined with the number of identity, security, configuration, and other agents that will be running on a machine in an environment with any kind of security team whatsoever....

ramping up that e core count has presented a night and a change for the performance and responsiveness of my laptop fleet

Potato Salad fucked around with this message at 20:23 on Aug 15, 2023

priznat
Jul 7, 2009

Let's get drunk and kiss each other all night.

WhyteRyce posted:

I was once on a project were we found a bug but because it didn't fail all the time on all the parts so the designer was convinced they could just implement a screen. There was another stepping already planned too and he could have easily gotten it in time but "had timing implications". We later had to do another stepping literally just for that issue and only that issue

Yikes, would it have pushed the tapeout is that why he decided not to or was it just pure hubris?

The combo of presilicon emulation and a mad rush to interop as much as possible immediately after A0 parts back through several firmware revisions, all the while continuing emulation to catch any fallout AND having a good support team to feed parts + devkits to trusted customers/partners who relay sightings has been the methodology that has been fairly consistent with all the successful B0 parts.. and the occasional metal rev here and there :haw:

Cygni
Nov 12, 2005

raring to post

I personally think ecores make great sense for desktop. For background tasks, they can complete those tasks with less power budget use and less heat output, which leaves more power/thermal budget for the king kong kores to run the physics on the 3D animated phallus on screen. Combined with the modern boosting behavior of "just keep going until you hit a thermal or power limit", you can get more gaming performance out of a given size of silicon.

Where I think it starts shifting into the realm of marketing is going from a handful of e-cores to a boatload of e-cores on desktop. I understand that not all of us came up with an 80x86 and have made it a habit of turning literally every single background program and process off prior to gaming, and I've seen the people gaming with like 10 background programs running, 40 YT tabs, and a systray that goes all the way across the taskbar. But I have to feel that having 12+ e-cores is more to win on graphs than it is to provide the end user an actual benefit.

If you had a real world, money making reason to be crunching huge multicore datasets, you probably wouldnt be running it on your home gaming computer.

So yeah, my uneducated hobbiest take is that some ecore good, but I'm not convinced the 12+ e-cores is for more than graphs.

VorpalFish
Mar 22, 2007
reasonably awesometm

HalloKitty posted:

It's a poor match for desktop use though, and it's also made worse by the fact they're cut down feature wise, so no AVX-512. What's the use case, really?
If you actually need tons of threads and your application scales well, and performance is critical, you're not running that on a desktop with dual-channel RAM, you're building a workstation or more likely getting a server or servers somewhere else to do it.

The desktop is going to best served with beefy cores to power through less well threaded tasks, and a number of cores that can handle background tasks to keep power usage way the hell down when doing nothing or when doing light tasks.

Anyway, just my stupid thoughts

That's the thing though, if your application doesn't scale well and needs high per core performance well.. you don't need more p cores either.

Adding e cores over p cores makes all the sense in the world if you already have 8 p cores. If your application scales, you get more out of the e cores and if it doesn't, well more cores isn't helping.

Ditching avx does suck, but that's a different scenario altogether. The original comment was why add e cores after 8, not they should have stayed with homogeneous cores entirely. If you're going to have e cores at all, and you're losing avx no matter what, it makes sense to add them over p cores once you have enough p cores to feed the desktop applications that are hungry for per core performance.

Cygni posted:

I personally think ecores make great sense for desktop. For background tasks, they can complete those tasks with less power budget use and less heat output, which leaves more power/thermal budget for the king kong kores to run the physics on the 3D animated phallus on screen. Combined with the modern boosting behavior of "just keep going until you hit a thermal or power limit", you can get more gaming performance out of a given size of silicon.

Where I think it starts shifting into the realm of marketing is going from a handful of e-cores to a boatload of e-cores on desktop. I understand that not all of us came up with an 80x86 and have made it a habit of turning literally every single background program and process off prior to gaming, and I've seen the people gaming with like 10 background programs running, 40 YT tabs, and a systray that goes all the way across the taskbar. But I have to feel that having 12+ e-cores is more to win on graphs than it is to provide the end user an actual benefit.

If you had a real world, money making reason to be crunching huge multicore datasets, you probably wouldnt be running it on your home gaming computer.

So yeah, my uneducated hobbiest take is that some ecore good, but I'm not convinced the 12+ e-cores is for more than graphs.

This isn't really an argument against ecores as it is pointing out that most desktop users don't need high core count CPUs, period. In the same vein, it would be silly for someone who mostly games to buy say, a 7950X. If you're a typical desktop user yes, your needs are probably met by a 6-8 core CPU very comfortably.

VorpalFish fucked around with this message at 23:41 on Aug 15, 2023

Methylethylaldehyde
Oct 23, 2004

BAKA BAKA

Beef posted:

I'm not anywhere near production, but on a prototype we tried to do extensive full-application testing on various simulators/emulation/FPGA before taping because we knew there was only budget for one stepping.
There is a ton of dumb poo poo we caught, but the amount of dumb poo poo some on-contract design engineer will put in that on-contract validation engineers will never check because it's not on the checklist is staggering.

Somehow, I still had to find out, running my software on silicon, that through all those layers of software no one had ever thought that someone would pass size=0 to a DMA instruction.

Less-dumb issues we found on silicon were usually in the bucket of "poo poo going wrong when some buffer is full". Is it not done in pre-Si validation to cover the various states and transitions for those "shitter's full, backpressure the poo poo" scenarios?

You'd think one of the low hanging fruit tests would be setting every value that has a 'buffer size must be 1 through 65534 inclusive' to 0, 65535, -1 etc. just to see if whatever hardware logic used throws an exception, throws a fit, or throws a segfault at you.

How many hours does it take a full-fat processor sim to even get to the point where the PCIe bus fully initializes and you could pass that DMA size=0 instruction to it? Or do you have to use the 'a Turing complete spherical cow in a vacuum' stand-in for 95% of the processor, and only test the one part of the chip responsible for the PCIe endpoint?

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

ijyt posted:

Wait do people actually look at the scores?? I thought it was just for a quick and easy full load test.

There are many people out there who are convinced that Cinebench is a gold standard CPU performance benchmark. I have interacted with them. It's weird.

Josh Lyman
May 24, 2009


JawnV6 posted:

? it's right there in the post you quoted, "try to ship on C2." No consumers are getting B0's, you're tilting at an imaginary windmill. It's only a pain for the post-si folks who have to do big feature validation in a compressed time frame.

that said, steppings got weird before I left. I sorta understood "X0," less so the '/prime ones that seemed more about rear end-covering than technical descriptions.
I remember in the Athlon Thunderbird days that new steppings, notably AXIA and later AYHJA, were highly desired because they overclockeded better. Am I misunderstanding steppings or is this no longer done?

repiv
Aug 13, 2009

BobHoward posted:

The tragedy of Cinebench's popularity is that CPU raytracing is such a totally irrelevant workload for 99% of PC enthusiasts.

what's more is that the rendering engine that cinebench is based on is basically on life support, it hasn't had any significant development in years and most users have moved on to alternatives AFAICT

one of the current big names in CPU rendering has their own standalone benchmark similar to cinebench but reviewers haven't caught on yet

https://corona-renderer.com/benchmark

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Beef posted:

I was being a bit flippant, but the grain of truth is there. When architectures are being planned or designed, you cannot run a full cinebench or similar. But historically the cycle-accurate simulation tools used to drive design decisions can only do short sequential execution traces, good luck catching any kind of higher order behavior there. So specint type of benchmarks were king, still as irrelevant to PC users as cinebench.

Cygni posted:

Where I think it starts shifting into the realm of marketing is going from a handful of e-cores to a boatload of e-cores on desktop.

Cygni's thing here is more what I was going for, Beef. Was thinking at the macro level, not the microarchitectural. It's easy to predict that a trivially parallelizable benchmark scales well with core count, and Intel seems to be basing some high level decisions on that.

Intel's P cores have grown too big and hot (thanks to chasing SPECint single thread) to put tons of them into a consumer part. That put Intel at risk of losing the simplistic parallel throughput benchmarks popular with PC hardware reviewers and influencers. What to do? Well, they've thrown in a ton of "E" cores (which are really more like parallel throughput cores) for seemingly little reason other than to win at Cinebench and the like.

Hasturtium
May 19, 2020

And that year, for his birthday, he got six pink ping pong balls in a little pink backpack.

BobHoward posted:

Cygni's thing here is more what I was going for, Beef. Was thinking at the macro level, not the microarchitectural. It's easy to predict that a trivially parallelizable benchmark scales well with core count, and Intel seems to be basing some high level decisions on that.

Intel's P cores have grown too big and hot (thanks to chasing SPECint single thread) to put tons of them into a consumer part. That put Intel at risk of losing the simplistic parallel throughput benchmarks popular with PC hardware reviewers and influencers. What to do? Well, they've thrown in a ton of "E" cores (which are really more like parallel throughput cores) for seemingly little reason other than to win at Cinebench and the like.

How much of a difference do the E cores make for code compilation and parallelized tasks outside of Cinebench and video encoding?

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Hasturtium posted:

How much of a difference do the E cores make for code compilation and parallelized tasks outside of Cinebench and video encoding?

The E cores are a big deal for compilation, enough to make the 12900K almost 50% slower than a 13900K:



This is a very rough comparison, but they're in the same ballpark as Skylake cores with worse vector performance. Would people be griping this much if they were socketing 2 extra 6700Ks onto their boards next to a P-core only main processor?

Edit: it's also worth noting that the E cores are way more power efficient but that doesn't show in the current desktop parts because problems with voltage regulation and power delivery means that they're cranking the E core voltage to the moon on current desktop chips.

Laptop parts are doing 2P+8E plus a GPU to boot in a 15W TDP. Hell, I'd buy a 2P+16E desktop part if Intel would sell it to me, should be similar die space in theory to a 6P alone and way more power efficient.

Twerk from Home fucked around with this message at 02:51 on Aug 16, 2023

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

HalloKitty posted:

Heh, I don't get the point of more than 8 e cores, surely they just waste space and take thermal/power headroom from the p cores. If your background tasks need more than those 8 e cores then poo poo, that's a hell of an "idle" load for a desktop

the point is the same as AMD's c-core approach. you have performance cores to run latency-sensitive workloads (especially games) quickly, but at a certain point if your workload is scaling well to >16 threads or >32 threads, then the e-cores/c-cores are more area-efficient. even if a workload still demands a fast core for the hot threads, there often is stuff that can be dumped into a task queue and offloaded to something else. promise/future is a model that's been used fairly successfully in general, and task queues/executors/etc are often a convenient interface.

you have to distinguish between intel's specific e-cores (and the processors that use them) and e-cores as a general approach. AMD's c-cores are still notionally e-cores, and will tackle that same idea of area efficiency for low-priority batch tasks. they just are doing their e-cores a specific way for certain business+technical reasons. apple has also not really made the e-cores work that great tbh, while they are small, what I've read from people who try to program on apple silicon macos is that getting stuck on an e-core is death for performance. Very good for battery life and background daemons/etc but don't do anything interactive with them.

tbh I think AMD's zen5c cores might be more generally applicable and perhaps even reasonably useful in games (beyond background workers for directstorage/etc). Alder/Raptor integrate the e-cores in a very weird way, every single e-core request has a fairly massive amount of latency. and actually within a e-core CCX is consistently the worst of all? :confused:



I'd be curious what the latency looks like on Tremont (N5100 or similar). Maybe it's another consequence of bolting a second processor onto the main platform and tremont has more sensible latency because it's not having to jump all the way back to the ringbus. the efficiency is also lovely because oops, DLVR doesn't work so you're running the e-cores at way higher voltage than is ideal. intel's still flailing badly overall on the integration front, AMD has just designed a lot better legos that plug together a lot more easily and scalably. infinity fabric is a nice abstraction for the communication port, the CCD itself just needs to implement the fabric link and then they just have an appropriately-sized switch with enough ports for the number of CCDs they want in that product. and in contrast every single intel product has a novel one-off communications bus, some of them are ringbus, some have e-cores bolted on, some don't have ringbus (tremont atom), some have mesh... and because everything is a one-off it constantly has problems even getting out the door.

alder lake is a product that so clearly follows conway's law and has a dozen conflicting objectives and goals from varying teams. getting golden cove out the door ASAP and regaining mindshare in the gaming community was critical, getting the transition to big.little started (and getting people used to gracemont/getting that product out in the world for experimentation) was critical. but half the groups didn't execute well, the plug seems to be pulled on AVX-512 indefinitely (?) for big.little configs, the DLVR being broken even after 2 years of work, etc. And this is the best product they could shake out of it. And a lot of the subsequent products have stumbled too and so these ones are lingering again. Intel's just not executing consistently, and again, I think the one-off nature of everything is part of that. Easiest way to build reliability is to build one tool and use it a lot, just like Soyuz/Proton/space-x. But Intel obviously doesn't have a winning formula nailed down yet, for the bits they have and the goals of their products.

Paul MaudDib fucked around with this message at 04:54 on Aug 16, 2023

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
latest MLID bullshit: rentable cores apparently are sort of a CMT-ish concept that Intel is moving to in the royal cores (designed by keller while he was at intel). it sounds like they're packaging a couple "base tier" cores in a CMT unit with shared resources (cache, accelerator/neural units, etc) along with a fast core. The goal being to keep the area down without sacrificing performance.

it's an interesting idea, my thought is still that this is good for keeping more balls in the air juggling. as long as you are not just absolutely saturating the units then you can have one thread running the hard bit while a couple other threads run their glue code or wait for memory access to complete etc. it is the same big.little concept except applied to the CCX itself. Most of the time when you have bulk load (saturating the units) it benefits more from the area efficiency of the e-cores (e-threads?) but also there is often a need for some stuff to run urgently and with high perf, and the p-thread handles that. But if you put them in the same CCX (CMT module) then transferring threads between the CMT threads could be fairly seamless.

the Thread Director™ was a little baffling at first but in hindsight I think intel is going to dive more deeply into this area of trying to introspect what's actually going on in the threads, to schedule them more efficiently for the actual available resources. especially with the pivot to big.little making scheduling more complex, there's value to be extracted there from scheduling inefficiencies.

Intel had 2 competing designs, one was SMT4-based and was abandoned in favor of this one (which is why there's a weird generation with no hyperthreading (it would have had SMT4) but also no rentable units.

Paul MaudDib fucked around with this message at 04:57 on Aug 16, 2023

Khorne
May 1, 2002

Paul MaudDib posted:

it sounds like they're packaging a couple "base tier" cores in a CMT unit with shared resources (cache, accelerator/neural units, etc) along with a fast core. The goal being to keep the area down without sacrificing performance.
What if we package multiple cores together and have them share infrequently used, expensive resources like the fpu. I bet we could achieve insane clock speeds and avoid complexities like SMT with this future-looking design.

Bridgedozer

If this architecture crushes zen6 it will be very entertaining given the history.

Khorne fucked around with this message at 07:35 on Aug 16, 2023

Dr. Video Games 0031
Jul 17, 2004

https://www.techpowerup.com/312447/intel-announces-termination-of-tower-semiconductor-acquisition

quote:

Intel Corporation (Nasdaq: INTC) today announced that it has mutually agreed with Tower Semiconductor (Nasdaq: TSEM) to terminate its previously disclosed agreement to acquire Tower due to the inability to obtain in a timely manner the regulatory approvals required under the merger agreement, dated Feb. 15, 2022. In accordance with the terms of the merger agreement and in connection with its termination, Intel will pay a termination fee of $353 million to Tower.

Oof.

Beef
Jul 26, 2004

Methylethylaldehyde posted:

You'd think one of the low hanging fruit tests would be setting every value that has a 'buffer size must be 1 through 65534 inclusive' to 0, 65535, -1 etc. just to see if whatever hardware logic used throws an exception, throws a fit, or throws a segfault at you.

How many hours does it take a full-fat processor sim to even get to the point where the PCIe bus fully initializes and you could pass that DMA size=0 instruction to it? Or do you have to use the 'a Turing complete spherical cow in a vacuum' stand-in for 95% of the processor, and only test the one part of the chip responsible for the PCIe endpoint?

As a software person, I also expected that. It's probably a case of the contractor just turning the design document into bullet points to test, they don't get paid more for going beyond that. If you are just checking output signals for given input signals, you might also not be aware that there is some state machine behind the scenes that you are not completely covering. That takes a designer.

As a software person, I'm also in a culture where the person writing the code is also the person writing the tests and doing the actual testing. You know, the adage Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live.
From what I heard, the culture is different for hardware engineers, where functional design teams are separate from validation teams. You can spot the perverse incentives from a mile away: "Hey we worked long hours during the weekend and shipped the design ahead of time, collect your bonus!" (+ delays because could not pass validation in time)

It's also a case of the "at load" bugs slipping through because those are the type of errors not caught with simulation/emulation functional testing. e.g. a host-communication buffer never gets stressed during testing because the x86 host is running a few million times faster than the simulation.


Regarding simulation speeds. My guestimate is in the order of a day, on a limited-model FPGA RTL-emulation platform. On a full-core cycle accurate simulator you're looking at the order of years, you cannot realistically do full-die that way even.
So yes, you use spherical cows. The more spherical your cow becomes in simulation, the more you have to trust that things work as documented or advertised.
There are system-level simulators like Simics at the level that it can boot an OS in a work day or two. Those still rely on functional simulators to execute the instructions, but in x86 space there's SDE that JITs it down to near native speeds.

Beef fucked around with this message at 12:59 on Aug 16, 2023

BobHoward
Feb 13, 2012

The only thing white people deserve is a bullet to their empty skull

Hasturtium posted:

How much of a difference do the E cores make for code compilation and parallelized tasks outside of Cinebench and video encoding?

Compilation: it depends on the language you're compiling and the size of the project, but as Twerk from Home said, this is one of the things lots of Intel-style E cores should actually help with. (But it's not something a typical PC enthusiast ever does.)

Some specifics... C family languages usually have single-threaded compilers. However, medium to big projects contain many source files. So, the build system spawns N copies of the compiler in parallel, each one processing one source file. Every time one compiler process finishes, the build system spawns a replacement on the next source file in the queue, until there are no files left to process. Typically N is chosen to match the machine's core (or hardware thread) count.

After the compile phase comes the link phase, in which all the object files produced by the compile phase must be linked together into a single output executable. Much like the compilers, C family linkers have traditionally been single threaded, so this part of a build didn't scale at all. However, in recent years, there's finally been progress on multi-threaded linkers. As I understand it, these new linkers don't scale as well as compilation, but it is now possible to get some multi-core speedup rather than none.

Balancing all this: developers seldom compile an entire project from scratch. They modify a handful of files, then compile to test their work. Any competent build system will use source file modification time and dependency analysis to figure out exactly what actually needs to be recompiled. Unchanged source files mostly don't need to be, since there's already an object file on disk from past compile passes and unless it #included a changed header, recompiling won't produce different results.

Adbot
ADBOT LOVES YOU

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!

Beef posted:

As a software person, I'm also in a culture where the person writing the code is also the person writing the tests and doing the actual testing.

I have heard of much smaller outfits having the hardware engineers also doing the validation too. It is not an alien concept there. I have also seen plenty of self-described software people who were not writing tests at all.

At the size of Intel, it definitely pays to have some experts that can do JTAG, oscilloscope, and logic analyzer magic. And integration along with physical issues is just a pain in the rear end. But yeah, tons of poo poo gets kind of thrown over a wall.

Regarding the full spectrum of testing and simulation options: "segfault" came up. More practically for hardware pants making GBS threads is probably a kernel panic or BSOD. If you got that far, you could be minutes into a boot in real time. Booting the OS is less the problem there than just eating up all this experimental BIOS code, of which any of it could poo poo its pants before you even have a chance to try minmax buffer sizes.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply