AMD CPU and Platfrom Discussion Episode IV: A New Hope is Ryzen

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > AMD CPU and Platfrom Discussion Episode IV: A New Hope is Ryzen

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

SwissArmyDruid posted:

That is correct. Frames that take too long to render, followed by one or more "runt" frames that are already outdated by the time they are sent to the display that are cut off in mid-scan to be replaced by another frame as the renderer tries to catch up. One frame at 10 FPS + three frames at 120 divided by four = averages to an objectively above-average 92 FPS... but the stutter got you killed in the process.

The bolded is not a typical way that game renderers actually work, FYI

What you seem to be thinking is that every N milliseconds, without fail, the game renderer tries to start rendering a frame based on a snapshot of world state at that moment. If any one frame takes longer than N ms to come out, that means the next frame past it is delayed and therefore outdated.

But most games don't start frames on a fixed schedule like that. Instead they're opportunistic: whenever the rendering thread finishes submitting commands to the graphics driver for frame #N, that's when it takes a snapshot of world state to begin drawing frame #N+1. (If there's some form of vsync, it may sleep for a bit before starting N+1.)

Usually. I'm not saying runt frames are impossible, there's millions of ways to solve this problem and some of them could create effects like that, but the method I describe is common because it's easy to implement and works reasonably well.

Game world state updates (aka the physics tick, or physics FPS) may or may not be tied to graphics frame rate, depending on the game.

fake edit: I think you might be remembering the side effects of AFR multi-GPU? AFR can produce runt frames because the video driver is doing some queueing behind the application's back. IIRC the driver basically lies about frame completion status to get an app to start work on the next frame before the current one is actually done rendering, which leads to the problem you describe when the first frame takes an unexpectedly long time to complete but the second does not.

AFR bad, nobody should AFR

# ¿ Mar 22, 2021 23:42

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 10:57

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

I'm stupid and missed the context then

# ¿ Mar 23, 2021 09:37

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Cygni posted:

I sorta remember it being a confluence of Creative putting Aureal out of business and killing 3D audio, Creatives drivers becoming bloated and unusable, and �sufficient� audio implementations of AC97 getting integrated into south bridges setting the stage for nobody bothering anymore. I guess you could say it was really just CPUs getting powerful enough to do audio in software as a side project, really.

It's this. Old school sound cards only had value as long as it was worthwhile to use a DSP to offload audio calculations. Once CPUs got fast enough to do the same functions with a tiny fraction of their compute power, soundcards became uncompetitive with the combo of a simple DAC and a standard universal software audio mixer delivered as part of the OS.

AC97's role was that AC97 is to DACs as AHCI is to SATA controllers - a standardized HW/SW interface to reduce or eliminate the need for a custom driver for each vendor's chip.

# ¿ Apr 24, 2021 10:01

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Icept posted:

What's the theoretical use case for the shame chiplet? Running all the Windows / OS / services / background stuff on it and devoting the big boy package to the foreground application?

Yes. Consider these approximations for Apple's M1 small cores relative to M1 big cores:

Area: ~0.25x
Power @ max freq: ~0.1x
Perf @ max freq: ~0.33x

The small cores have about 3.3x perf/W ( :eyepop:

) and 1.3x perf/area. You wouldn't want a chip with nothing but the small cores since high ST performance is quite important for general purpose computing, but having some small cores is awesome. Using less energy to run all those lightweight system threads frees up power to run the threads you want to go fast on the big cores.

That said, will AMD and Intel have small cores as good as Apple's? Seems very doubtful! Small cores are where you expect the advantages of a clean RISC architecture to be greatest, and Apple's been putting a lot of effort into their small core designs for a long time, while AMD and Intel have not.

And will Microsoft have a scheduler as good at using small cores as Apple's? Also doubtful.

# ¿ May 8, 2021 12:08

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Klyith posted:

If we lived in a universe where work was always easily divided over an arbitrary # of threads, Jaguar would have been good (and x86 would have been trounced by RISC 30 years ago).

Agree with your overall point but bolded makes no sense at all

# ¿ Aug 26, 2021 06:49

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Cygni posted:

Is there anyone smarter than me here to tell me whether or not I should be pissed that Pluton is built into the 6000 series APUs? The takes I�ve seen range from �it�s just a TPM� to �it�s a full access hardware privacy nightmare which hands all of your information to Microsoft, of all people�

Necroing this to note that an actual subject matter expert doesn't think it's likely to be a bad thing:

https://mjg59.dreamwidth.org/58125.html

(Matthew J. Garrett is a prominent Linux developer who's done a lot of work on Linux support for Secure Boot including use of TPMs, as well as various other things. So he knows what he's talking about, and would be more than willing to poo poo on Pluton if he thought it could do evil things.)

# ¿ Jan 14, 2022 04:43

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Harik posted:

It's an incredibly dumb idea. Notably, Epyc does not have a few MB of flash for a BIOS, so that's still on board in SPI NAND chips or whatever. It's utterly trivial to allocate 1k or whatever of ROM on flash chips when you're talking dell/hp/lenovo volume, so that's where the key should have lived.

Replacing a processor takes a screwdriver and a few minutes time. Replacing a surface-mount BGA takes complete disassembly and a specialized workstation. This isn't for security, it's for killing the secondary market.

True secure boot means verifying that every single instruction executed during boot comes from a trusted source. In practice, this means the processor (or in this case a security coprocessor, it's really an Arm Cortex-A5 apparently) has to validate the firmware's cryptographic signature before allowing any of it to be used, whether it's data or code. That means the processor needs its own local secure key storage, and a way to make that storage immutable (fuse bits).

You're proposing storing the key in the same R/W memory as the firmware. I do not think you know as much about secure systems design as you think you do.

# ¿ Feb 7, 2022 22:35

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

You already anticipated the response: you can, in fact, replace an external flash chip. Or hack its pseudo-OTP region.

You seem to be starting from the assumption that nobody cares about making it extremely difficult (hopefully impossible) to boot untrusted software, and concluding that a secure boot system which puts the root of trust in the SoC itself must be motivated by other goals. What I am telling you is that this just isn't true. There are people who really care about minimizing the chance an attacker can compromise the root of trust, and in order to do that the root of trust pretty much has to live inside the SoC itself.

And yes, an attacker could replace the whole SoC. There are ways to detect that - e.g. store a private key in each SoC's secure key storage, and when a new node comes up on your network, ask it to sign something to prove it has your organization's private key.

# ¿ Feb 9, 2022 02:35

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Crunchy Black posted:

This is just out of my rear end but I don�t think infinity fabric can scale that way. Since there�s no dedicated inter CPU bus ala QPI you�d run out of bandwidth cpu to cpu and have bottlenecks before it became effective unless you were running something really optimized for x86 CPU only

Infinity fabric literally is AMD's equivalent of QPI???

Paul MaudDib posted:

the terminology is fairly muddled because infinity fabric is actually a name for several different interconnects that do similar things, but there is an on-chip link that is used between the IO die and the chiplets (and presumably between IO die nodes on Rome/Milan) and also an external one that is basically PCIe based that is used between the CPU sockets or between CPU and devices in the PCIe sockets.

No form of Infinity Fabric is based on PCIe technology. The idea that it is has been spread far and wide by tech writers whose limited understanding of what's going on has led them to unwarranted conclusions.

The conceptual background needed here is that PCIe, Infinity Fabric, and QPI are all protocol stacks. That is, you can analyze (and implement) them as several distinct layers of software and hardware. At the bottom of all of them is a physical layer (PHY) which transmits and receives packets ("flits" in QPI terminology) over physical wires. The PHY is also known as a SERDES (serializer/deserializer) because as part of its functionality it translates between a wide datapath at a relatively low clock rate and a serial datapath at a much higher clock rate.

What AMD did with IF was to define multiple options for the physical layer. One is a specialized SERDES only suitable for very short range die-to-die interconnect on a substrate. The other is a more conventional long-range SERDES which can handle off-package connections through a backplane (possibly with connectors in the signal path). The tradeoff here is that by reducing range and requiring all-soldered connections you can greatly reduce the energy per bit (possibly latency too).

The protocol running through these different physical layers is identical, and is not in any sense PCIe.

The grain of truth in the "IF is based on PCIe thing" is that the chips which support the long-distance option do so with multimode SERDES which can be configured as either PCIe or Infinity Fabric lanes. This lets system integrators choose what kind of tradeoff between general purpose I/O and socket-to-socket IF they want to make. However, it's a mistake to think that an AMD SERDES configured for Infinity Fabric is some form of PCIe. It probably isn't even fully electrically compatible with PCIe in that mode, and definitely isn't protocol compatible. They're using internal muxes/selectors to disconnect the PHY from one kind of controller and connect it to another.

This kind of thing is pretty old. Over a decade ago I worked on a chip with a broadly similar concept - system integrators could decide whether they wanted each of three SERDES lanes to be a PCIe 1.x root port or a SATA port. Just like Infinity Fabric, you wouldn't say that one of those SERDES lanes in SATA mode was really a form of PCIe; in SATA mode the PHY was disconnected from the PCIe root complex and configured in ways which made it electrically incompatible with PCIe, both externally and internally.

# ¿ Feb 28, 2022 13:55

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Kibner posted:

what in the hell? lmao

CC6 is the deep sleep (power gated) state, so this is likely something along the lines of the core failing to power back up if it's in deep sleep after the time when some continuously incrementing counter in continuously powered uncore support circuitry rolls over.

Some Redditor figured out that the time is roughly 0x380000000000000 ticks of the chip's 2800 MHz TSC, which they interpret as significant thanks to all those zeroes. I'm less sure about this numerology, because 0x38 doesn't tingle my RTL designer spidey senses, but sometimes poo poo is weird and it's someone else's design I have no insight into.

# ¿ Jun 3, 2023 09:49

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

crazypenguin posted:

A redditor deciding the 1044 days is wrong and it's actually 1042 days and titling their post that way because of bullshit numerology they pulled out of their rear end because it has a pleasing number of zeros is hilarious

Well they're not entirely wrong about counter rollovers occurring close to a value with a ton of zeroes. For example, a 16-bit counter's maximum value is 0xFFFF; increment it by 1 and it rolls over to 0x0000 rather than 0x10000 (since the latter requires 17 bits to represent). The reason I don't fully buy the theory is that 0x37FFF.... is not a very plausible top value for a counter - a top value should be all-ones, but 0x37 hex = 0b110111 binary.

# ¿ Jun 4, 2023 01:27

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Dr. Video Games 0031 posted:

Right, Intel still relies heavily on kernel-level scheduling, but I doubt it would perform anywhere near as well without the thread director. AMD lacks anything like that, and it's not something they can just patch in either.

I know this is a necro but I think you & others have overestimated how significant Thread Director is, probably because Intel's marketing has encouraged it and tech media has mostly gone along. I read the spec once and it's just an engine to automate collection of several per-core statistics (instruction mix, current clock speed, power, temperature, etc).

The resulting data can be useful to a scheduler, but the scheduler's still 100% in charge of making the actual decisions. If and when AMD does heterogeneous core types, they may not need something like this at all - Intel's need is driven by things which are probably going to remain unique to Intel. (What I'm thinking of: their overcomplicated DVFS, poorly matched core types, and use of SMT in the big cores. Speaking of SMT, when big cores are running 2 threads and those threads are competing for execution resources, making optimal scheduling decisions gets really complicated. This is probably why Intel chose to capture instruction mix.)

Even if AMD ends up needing this, it ain't that complicated. Both Intel and AMD have had the raw data sources (performance counters, sensors) for a long time. The only fundamentally new thing in Thread Director is that to minimize data collection overhead, Intel threw in a tiny microcontroller core whose only job is to periodically scan performance counters and sensors and dump the data into a table for easy ingestion by the kernel.

# ¿ Jul 30, 2023 08:35

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Klyith posted:

It's not just less cache. I saw a neat article about it, a whole lot of what makes them compact is just removal of "extra" space. Some of that means that the compact cores have to run at lower frequencies, due to signal interference between components that are now closer together, and also power density.

But another part is actually wasted space, because the first run of the design is more modular and "blocky" with all the various sub-components. Makes the design and fab prototyping a lot faster when you chop stuff up into highly partitioned modules.

So I don't know if there will ever be an AMD CPU that uses both normal and compact at the same time. It seems likely that the C version will always be a later follow-up to the standard core, analogous to the old tick-tock cycle. The reason they can squish it down is because they've got complete understanding of how the base model worked.

re: signal interference etc, kind of but not really? Especially the signal interference part. You need to flip your mental model over: AMD chose to make a lower frequency version of the core because (a) lots of servers want core count more than high clocks and (b) clock speed costs area and power, so reducing it enables them to create a variant better optimized for such servers. I'll try and explain why B is true...

That article claims AMD uses the same RTL for 4 and 4c and I'll take it at face value. RTL is a cycle-accurate model written in a high level language (usually Verilog) in a style which is amenable to translation (usually automated, sometimes augmented by human effort) to a netlist - a big list of logic gates and the wires which connect them together. This netlist is the input to physical design - placement and routing.

So how can the same RTL be translated into less or more area, depending on frequency target? Well, for any given logic gate, physical design tools can choose between several different versions in the cell library. These variants all do the same logical function (e.g. 5-input NAND gate, or whatever), the difference lies in the tradeoff between area and output drive strength. High-current output drivers cost more area.

Why have multiple drive strengths? Each gate's output has to drive a wire which goes somewhere. Wire capacitance scales with wire length and loads, so the longer and/or more heavily loaded the wire is, the harder the gate output has to yank on it to quickly move it between logic 0 and logic 1.

With a low speed target, most gates can be the minimum size versions. Doesn't matter if the wire takes its sweet time moving around, there's plenty of slack in the timing budget. As target clock speed goes up, so does the number of gates which have to be bigger versions of themselves. This even has a (usually mild) cascading effect: the extra area used by larger gates increases average wire length, which upsizes even more gates. In some cases, extra long wires may need repeaters inserted in the middle. This is how speed literally costs area. (And power - longer wires switched at higher speed burns lots more power than shorter / slower.)

The modularity thing is not a first-run thing to improve speed of prototyping, it's a side effect of chasing clock speed. Trying to push Fmax to the limit creates lots more physical design optimization work. Since engineers work best in small teams, most of the time the design gets broken down into smaller chunks, each of which is assigned its own team and some floorplan to play inside. Nobody's perfect at predicting the ideal area and layout for each block, so there's usually some give-and-take iterations during the design process. The end result is never ideal on area efficiency, but it's faster.

However, when you choose to reduce speed targets (as in Zen 4c), you can greatly reduce the amount of labor put into physical design optimization, live with fewer partitions, and as a side effect gain some area back.

Re: normal and compact at the same time, it depends on the size of AMD's labor pool. The compact variant doesn't actually need the big variant to go first, and should require fewer engineer hours, but if they both have to be done in parallel that still needs more engineers on staff.

# ¿ Jul 31, 2023 01:23

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Klyith posted:

Neato, thanks for the writeup. So a 5/5c hybrid sounds much more plausible that I thought.

Yeah. Don't depend on the fine details of what I wrote, I'm reading a lot between the lines of that article and making tons of guesses/assumptions about AMD's engineering process, but I'm very confident there's no technical barrier which would block AMD from launching 5 and 5c at the same time, possibly on the same chip. It's just more expensive.

(But note: this should still be far less expensive than designing a new small core with a different uarch. Starting from the same RTL saves them a lot of effort on both the design and validation sides.)

# ¿ Jul 31, 2023 22:33

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

I don't know what Baldur's Gate's deal is (never really looked at it, because ehhhh D&D), but planetary / galaxy scale simulation games like HoI, Stellaris, and Rimworld loving big caches is not a surprise. Big sims have to walk through big in-memory databases of all the simulated objects quite a lot, so fitting more of that DB into cache can speed up the sim a ton.

# ¿ Aug 5, 2023 10:00

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

PC LOAD LETTER posted:

If they're gonna double stack like that then I'd think they'll have to lower power/clocks (even with the process shrink).

Or is the IOD big enough that you can put a CCD and the L3 die side by side on top of it instead of all 3 on top of each other?

I suppose if the edges over hang they can just put a support underneath it if necessary. Can't be that hard or expensive to put a bit of aluminum under it if necessary.

Flip that idea on its head, IMO. Stacking is real bad for power, but SRAM is low power density, so... why not make the IOD and the L3 die one big die? Lay things out so there's only SRAM directly underneath the CCDs, push all I/O functions to the edges where there's no overlap. The heatspreader would be in direct contact with the CCD, so this would actually offer better cooling of the CCD than today's X3D products.

That does make the IO/cache die huge, but one advantage of it being mostly SRAM is that SRAM can have enough redundancy to be easily repairable to improve yield.

This is just speculation on my part. No idea if the economics work, no idea if there's some other fatal problem with the idea.

BTW, when there is a need to resolve overhang with a spacer, my guess is that package designers would use a blank piece of silicon rather than a metal like aluminum. This would keep the coefficient of thermal expansion of all the support materials for the die on top the same so that there's no bending forces as temperatures move around.

e: if only I'd read Cygni's post before writing this one, lol

BobHoward fucked around with this message at 09:26 on Sep 30, 2023

# ¿ Sep 30, 2023 09:23

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

repiv posted:

x86 needs to add javascript instructions like ARM did, it's falling behind the webshit curve

The Arm "javascript instruction" (there's only one) is way less specific to Javascript than people tend to think - it's just a variant of the floating point to integer conversion instruction. Because JS has this idiotic thing where integers are represented as floating point doubles, it leans FP-to-int a lot, and building a variant of FP-to-int with the exact rounding mode and other behaviors needed by JS is a very low implementation complexity thing with enough reward to be worth it.

# ¿ Oct 1, 2023 01:01

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

hobbesmaster posted:

Also memory segmentation as a general concept is an absolutely critical foundation of modern software security

It really isn't though? Only x86 uses segments anymore, and even it doesn't use them for much in long (64-bit) mode. Most RISCs (including Arm and RISC-V) don't support segments at all. POWER/PowerPC do, because IBM had to be weird, but that's all I can think of off the top of my head.

# ¿ Oct 1, 2023 06:03

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

BlankSystemDaemon posted:

NUCA is a loving mess.

I love how CPU designers looked at the issues inherit with NUMA and thought "hey, let's do that, but make it even worse".
NUMA works decently if you're bottlenecked by thread scale-out and aren't doing certain memory sensitive workloads, but it's a loving nightmare to deal with when you're working with any kind of I/O for either auxiliary storage or networking.
NUCA has all of those issues, but now it's for any workload.

What on earth are you complaining about here? It's the laws of physics which dictate that the costs, both energy and time, of accessing memory resources go up as the memories get bigger and/or further away. NUMA and NUCA aren't lazy designers deciding to make life difficult for everyone else, they're just consequences of this reality.

The only way to achieve true uniformity in a large system is globally restricting performance in all cases just because you don't want CPUs to be capable of running faster while working with local (or relatively local) data.

# ¿ Oct 21, 2023 05:31

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Subjunctive posted:

I think you could signal it with a specific NOP sequence since on modern processors you can embed arbitrary bytes in it? I think that trick has been used before for hypervisor hints or something? Maybe?

I really think you're both barking up the wrong tree, there's a known working example in Darwin and it doesn't require wacky compiler-involved schemes because it builds on a universal and well understood primitive - threads. As a programmer you use simple OS API to inform the kernel what kind work each thread does, and the kernel's scheduler uses that as a hint.

So if a thread is marked as "user interactive", that means it exists to respond to user input events (clicks, keypresses, etc). These almost always get scheduled on P cores, with higher prio than other P-core users, because UI latency is so important. At the other extreme, putting a thread in the "background" bucket means that latency and throughput explicitly aren't important, so the scheduler should feel free to lock it to E cores for best energy efficiency. There's a few other options for various kinds of middle ground.

# ¿ Jan 11, 2024 09:18

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Subjunctive posted:

Yeah, that�s true. I�m not sure what the foreground/background equivalent is for power-efficiency-vs-throughput, but tagging on these unit of the thread makes sense.

Yeah, in the x86 world things get more complex because Intel's chips are the most prevalent heterogeneous designs and they're way more complicated than Apple's. In Apple's chips, you have two kinds of cores, they're very differentiated into distinct roles, and they aren't pre-overclocking at the factory. In Intel's, while there's still two kinds of core (three in new products where they've finally decided to do truly efficient efficiency cores), their roles overlap a lot more, and the performance/power/efficiency tradeoffs of P cores are highly variable depending on circumstances. (Mainly: whether it's running 1 or 2 threads, and how high it's currently allowed to clock itself.)

I did forget some details about Apple's system - there's some kind of activity tracking mechanism that, while thread B is doing work on behalf of thread A, gives B the same QoS and priority as A, and IIRC this extends across the userspace/kernel boundary. I don't know how essential it is, though, and the low hanging fruit should be in the base idea of letting applications put their threads in different QoS buckets.

# ¿ Jan 12, 2024 10:25

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

BlankSystemDaemon posted:

Ah, so there are still microarchitectural differences between them.
That sucks.

No, BurritoJustice was saying there aren't uarch differences between Zen 4 and 4C. Literally the same execution units and pipeline stages and so forth, but to save space and reduce power 4C has two optimizations:

- The through-silicon via (TSV) sites used to do die stacking in a full Zen 4 core take some area, and are present even in parts which don't have 3D V-Cache. Zen 4C doesn't support 3D V-Cache, so some of its area reduction comes from omitting TSV pad structures and 3D V-Cache support logic.

- A significant part of "clock go fast brrrrr" design methodology consists of selecting larger and faster versions of gates in the cell library during physical design. Large is fast due to RC (resistance-capacitance) delay. The wires connecting gates together have some resistance, and they also have substantial parasitic capacitance, so the speed at which they can switch state depends on how much current the gate driving the wire can sink or source to drain or fill the capacitance. Higher gate output current requires that it be built from bigger, hotter transistors. Because 4C targets lower Fmax, AMD was able to save a lot of area and power by selecting smaller and lower power versions of gates from the cell library.

Arguably you don't actually want 4C's level of microarchitectural identity, because Z4 is a uarch designed around hitting higher clock speed targets. If you could go back in time and wave a magic wand to have the AMD team design a new clean-sheet uarch designed from the ground up for the space 4C sits in, they almost certainly could've beaten 4C in one or more of power, area, and performance. That said, I think AMD was very smart to do 4C how they did it - it's effective enough, low-risk, and fits with their budget and engineering team headcount.

# ¿ Jan 14, 2024 13:28

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

GRECOROMANGRABASS posted:

Say bud, I think it would make more sense if you thought about how Intel diligently perfected the art of slowly releasing minor incremental updates to their Pentium 4/Core-based architecture for well over a decade.

Wat

You picked the wrong pair of chips here, P4 to Core was not incremental at all. Unless you count Core as incremental from Pentium 3, which it kinda was, in a loose way. P4, though? Dead end.

Also there were at least 2 microarchitectures post-Core where Intel shook things up quite a bit, Nehalem and Sandy Bridge. If you want to make a case for them coasting on incremental improvements to one uarch far too long, SB's probably where you have to start from.

# ¿ Mar 22, 2024 01:42

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Harik posted:

It's not about security, any state level actor who can replace the firmware in a supply chain attack can afford to throw in a new epyc that immediately locks to their key.

The proper way to do it is to have a OTP region/256 efuse bits on your flash chip that the processor reads first. It's less secure than doing it properly, because a physical supply chain attack now only has to replace a designed-to-be-removed processor instead of bringing in the full suite of SMT rework tools to remove a ROM from the board. One of these things you can do in a few unattended minutes, the other you can't.

Serial flash chips with OTP already exist for doing exactly this kind of thing:

Doing the way they did is 100% about killing the secondary market using "security" as an excuse.

Repeatedly shouting that a dumb external flash chip with an OTP region is the proper way doesn't make it so. The supply chain threat model isn't the James Bond spy bullshit you're imagining, it's more boring things like taking advantage of a PCB assembly house's sloppy component procurement processes to get them to buy a batch of fake flash chips.

Therefore, the state of the art in designing a hardware root of trust is to embed it into something which is far harder to craft a substitute for. It's not that hard or expensive to make an ersatz flash chip that'll provide the same functionality as the original, plus let you rewrite the supposedly OTP region if you tickle it the right way. It's profoundly more expensive to make an ersatz EPYC. Also more difficult to inject it into the supply chain somewhere without getting noticed.

Is AMD's implementation the best possible? I dunno, but it seems legitimately designed to provide higher boot security.

I doubt that killing secondary sales was a concern. What drives the bulk of AMD's sales in this market is delivering better perf/W and absolute performance in each successive generation. The customers who buy large numbers of these chips new don't view older generations as viable substitutes, because in data centers, operational costs are very important, and the old poo poo just isn't competitive.

# ¿ Mar 22, 2024 03:42

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Subjunctive posted:

It has the Neural Engine NPU (16 cores/18T op/s in M3) but it�s not clear exactly what it�s being used for. There are some PyTorch conversion tools, but not much direct API for accessing it AIUI.

This gives sort of an overview.

https://machinelearning.apple.com/research/neural-engine-transformers

Instead of explicitly requesting the ANE, you just ask Core ML to run a model and it chooses whether to use CPU, GPU, ANE, or even a combination of those resources. Makes it so that in theory, programmers don't have to worry about whether the device they're running on even has an ANE, but lots of people complain because they do want an explicit "run this on the ANE" interface.

Most of the macOS/iOS features Apple builds on Core ML seem to be about images in one way or another. Photos.app automatically classifies your picture library and somehow manages to learn things like your cat's name (it might have asked, I forget) so you can search by image content. (It's both pretty good and inaccurate, it has a hard time telling the difference between my tuxedo cat and other tuxedo cats.) They also use ML to do real-time image enhancement on the video feed from the laptop's webcam so you look better in video calls.

My actual unironic favorite is built-in near-zero-UI OCR. In a lot of contexts, you can now briefly hover the mouse cursor over text in an image (or even a paused video) and it will turn to the text selection cursor; you can then select text in the image to copy and paste it elsewhere. It handles a surprising number of languages; my main use of it is to C&P into Google Translate. Kinda neat to be able to translate signage and so forth in images you find on the Internet.

Dr. Video Games 0031 posted:

They haven't said. There's a 50/50 chance that Microsoft waffles long enough that the AI boom wanes before any of this becomes reality, they only do a half-hearted attempt at pushing "AI PCs," and then drop the matter and pretend it never happened within 5 years.

Yuuup. The best theory I've seen is that Microsoft's in it to sell expensive Azure cloud compute time to suckers who've bought into the AI hype and need to train big models. Pushing the "AI PC" is them trying to stimulate (or simulate?) demand. Once the next AI winter comes - AI has been doing boom and bust cycles since it first became a thing in the 1960s - they'll probably drop it.

Which is a shame as there really are useful things you can do with an on-device NPU, as Apple's been showing since 2017. Nothing revolutionary, nothing actually "intelligent", just nice little quality of life features. Things you could even do without a NPU, but it would be inconvenient due to battery drain.

# ¿ Apr 8, 2024 22:54

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Subjunctive posted:

What�s the motivation for moving power management onto the DIMM for DDR6? Is there some limit that�s being hit when pushing it from the motherboard, or do they just want to avoid motherboards loving with voltages badly?

Probably just wanting to move the regulators closer to the devices they're powering in order to provide better stability and tighter regulation tolerances.

Ohm's law says V = I*R. Translated: the voltage drop across a resistor is equal to the current flowing through it times its resistance. The resistance of a conductor is proportional to its length, meaning the further a regulator is from the device it's powering, the more error there will be in the voltage the device sees.

Regulator circuits often use a feedback line from the point of load to accurately sense the voltage at the point of load. This lets them compensate for IR droop on the main supply line - the regulator can simply adjust its output voltage above the nominal correct level until the voltage at the device is correct. The way this works despite the distance takes advantage of Ohm's law on the other side: when the I term, current, is next to zero, so is the voltage drop across the feedback wire.

However, IDK if they want to get into doing that kind of thing across a DIMM interface, particularly in multi-DIMM systems where each DIMM is going to be a different distance from the regulator. Physically placing the regulator as close as possible to its load is the best choice from a performance standpoint, so there you are.

# ¿ Apr 8, 2024 23:12

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

crazypenguin posted:

e: and it looks like Apple's A17 Pro is 35 TOPS, and A18 will probably come out at the same time, so maybe qualcomm isn't that far ahead of everyone here

For what it's worth, Apple traditionally used 16-bit TOPS as their marketing number (*), and their NPUs always double that number when doing 8-bit computations. Some think that for whatever reason, they chose to market A17 Pro using 8-bit TOPs, while sticking with 16-bit numbers for M3. The reasoning is simple: in the past Apple's reused the same NPU block in both A-series and M-series chips, and M3/A17 Pro are both N3, launched at about the same time, and share lots of other cores (same CPUs, for example). There should be no reason why the A17 Pro is about 2x as fast.

Frustratingly, in the months since launch, nobody seems to have benchmarked this to confirm or deny the hypothesis. Or if they have, I can't find it.

By the way, yes, this is a huge problem for NPU TOPs comparisons in general. Be sure you're comparing the same thing...

edited to add this footnote:
* - iirc they seldom or never explicitly said they were using 16-bit, people had to test M1 to determine that the marketing number was 16-bit TOPs

BobHoward fucked around with this message at 03:24 on Apr 26, 2024

# ¿ Apr 26, 2024 03:21

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Cygni posted:

My understanding is that they were initially for doing computational photography tricks, which is how phones get the �pictures� from the tiny sensor to look good to people. Behind the scenes, it�s stitching together multiple pictures and applying filters to them in real time to make a hybrid monster image that people like.

Supposedly it also gets used for FaceID and those Animoji/memoji you haven�t seen in years.

Nah, computational photog stuff in Apple's SoCs is a dedicated block, the Image Signal Processor (ISP). It's existed in their chips a lot longer than the Apple Neural Engine (ANE). Apple's far from the only company with an ISP, all cellphone SoCs have had one for ages.

The ANE and similar "AI" engines are coprocessors heavily specialized to accelerate matrix math, as that's the root of so-called "AI". I don't think I've ever heard much about what's in a typical ISP but my guess would be DSP cores for some programmability and possibly some image filter engines that are slightly less programmable.

Apple has recently started borrowing the ANE for some camera functions - an example being that on Apple Silicon MacBooks, the ANE is used to do advanced "AI" image enhancement on the webcam's output. However, as far as I know, the ANE postprocesses the ISP's output rather than taking over the whole pipeline.

The first ANE was in 2017's A11 Bionic, used in the iPhone X, which was the first iPhone with FaceID - so yeah, at the time, FaceID was the ANE's headline feature.

# ¿ May 8, 2024 09:23

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 10:57

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Cygni posted:

My understanding from articles back at the time was that the NPU was an offshoot of the ISPs to allow them to do different computational photography tricks than the ISPs were doing, stuff like subject recognition etc. I might fully be wrong though, not really a camera guy

I guess those articles weren't wrong, but also not quite right? The wrong part is that it was a new block designed to accelerate inference, not an offshoot of the ISP. The right part is that one of the things you can do with spicy matrix math is, as you mentioned, the new kinds of image processing tasks made possible by building and training a model. So, sometimes it does have a role to play in work that was formerly ISP-only. But the ISP is still there to this day; it wasn't made redundant by the NPU.

# ¿ May 12, 2024 13:09

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > AMD CPU and Platfrom Discussion Episode IV: A New Hope is Ryzen