AMD CPU and Platfrom Discussion Episode IV: A New Hope is Ryzen

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > AMD CPU and Platfrom Discussion Episode IV: A New Hope is Ryzen

«‹›798 »

Dr. Video Games 0031: Jul 17, 2004

priznat posted:

Was this from the �AI PC event� or whatever

The only really new news with all this is some manufactures are adding a �copilot� key to keyboards lol

Microsoft is modifying their keyboard standards to add a copilot key, so expect to see one on every windows laptop and OEM keyboard pretty soon: https://arstechnica.com/gadgets/2024/01/ai-comes-for-your-pcs-keyboard-as-microsoft-adds-dedicated-copilot-key/

# ? Jan 8, 2024 22:02

Adbot: ADBOT LOVES YOU

# ? May 29, 2024 02:13

priznat: Jul 7, 2009; Let's get drunk and kiss each other all night.

As long as they don�t gently caress with the enter key. Anyone making it 2 rows high can burn in hell

# ? Jan 8, 2024 22:10

repiv: Aug 13, 2009

ANSI is the more dominant standard worldwide so if anything it's us ISO tall enter enjoyers that are going to be made to fall in line at some point so manufacturers only have to wrangle one physical layout

# ? Jan 8, 2024 22:16

Cygni: Nov 12, 2005; raring to post

Zen 4 APUs are also real. These are the Phoenix 1/2 mobile parts in the AM5 socket.

8700G = 7840/7940/8840/8940 on mobile
8600G = 7640/8640 on mobile
8500G = 7540/8540 on mobile

Clocks are actually pretty close to the mobile parts despite the higher TDP. 8700G goes to 2.9ghz on the GPU instead of 2.7 for the 8840U on mobile, for example. Both have the same turbo clocks on the CPU, but likely will be able to hold those clocks longer on desktop.

On the IGP front, even the 8500G should be able to smoke the last-gen 5700G on AM4, even with the lower memory bandwidth on AM5 vs the mobile parts. Now that sub $100 AM5 boards are out and DDR5 is dirt cheap, you can build a pretty good little APU box for 1080p gaming again! hooray!

8500G is also AMDs first "hybrid" architecture for desktop, using Phoenix 2 silicon (aka the Z1 non-Extreme). Despite AMD promising to be more upfront when they use zen4c cores, the release slide deck convienently forgets to mention that this part is a hybrid. Must have run out of space to mention it after they included 4 AI specific slides in a 12 slide deck.

https://www.anandtech.com/show/21208/amd-unveils-ryzen-8000g-series-processors-zen-4-apus-for-desktop-with-ryzen-ai

Cygni fucked around with this message at 22:26 on Jan 8, 2024

# ? Jan 8, 2024 22:24

Anime Schoolgirl: Nov 28, 2002

i'd peg the RDNA3 4 CUs at 720p maybe even with some upscaling for modern AAA shitheaps, 1080p is definitely doable if you're playing esports titles like CSGO2 though.

A NH-L9a-AM5 is probably going to be a hard requirement for an 8700G if you're going to use one of those in a X600 Deskmini depending on how hot these things run in practice. The default cooler for the Deskmini is a 35w puck fit only for an Athlon 200GE/3000G, not sure if they're changing that for the x600.

# ? Jan 8, 2024 23:24

Cygni: Nov 12, 2005; raring to post

Anime Schoolgirl posted:

i'd peg the RDNA3 4 CUs at 720p maybe even with some upscaling for modern AAA shitheaps, 1080p is definitely doable if you're playing esports titles like CSGO2 though.

yeah, modern AAA is likely gonna be rough without FSR, but its a moving target obvi. feel like it should be able to play a huge amount of the Steam back catalog pretty well at 1080p.

i mean i ran a 1440p Korean special for years with an HD 7700 and then a GTX 960 lol, so i guess "well" is subjective too

# ? Jan 8, 2024 23:47

BlankSystemDaemon: Mar 13, 2009

Cygni posted:

Zen 4 APUs are also real. These are the Phoenix 1/2 mobile parts in the AM5 socket.

8700G = 7840/7940/8840/8940 on mobile
8600G = 7640/8640 on mobile
8500G = 7540/8540 on mobile

Clocks are actually pretty close to the mobile parts despite the higher TDP. 8700G goes to 2.9ghz on the GPU instead of 2.7 for the 8840U on mobile, for example. Both have the same turbo clocks on the CPU, but likely will be able to hold those clocks longer on desktop.

On the IGP front, even the 8500G should be able to smoke the last-gen 5700G on AM4, even with the lower memory bandwidth on AM5 vs the mobile parts. Now that sub $100 AM5 boards are out and DDR5 is dirt cheap, you can build a pretty good little APU box for 1080p gaming again! hooray!

8500G is also AMDs first "hybrid" architecture for desktop, using Phoenix 2 silicon (aka the Z1 non-Extreme). Despite AMD promising to be more upfront when they use zen4c cores, the release slide deck convienently forgets to mention that this part is a hybrid. Must have run out of space to mention it after they included 4 AI specific slides in a 12 slide deck.

https://www.anandtech.com/show/21208/amd-unveils-ryzen-8000g-series-processors-zen-4-apus-for-desktop-with-ryzen-ai

Holy poo poo, the 8600G with DIMM-wide ECC memory and the right motherboard (*) looks perfect for a passively-cooled always-on router+firewall & NAS+HTPC combo.

*: Plenty of daughterboard slots, not necessarily a shitload of PCIe lanes - though I would be curious as to how many it has.

# ? Jan 9, 2024 17:27

priznat: Jul 7, 2009; Let's get drunk and kiss each other all night.

BlankSystemDaemon posted:

Holy poo poo, the 8600G with DIMM-wide ECC memory and the right motherboard (*) looks perfect for a passively-cooled always-on router+firewall & NAS+HTPC combo.

*: Plenty of daughterboard slots, not necessarily a shitload of PCIe lanes - though I would be curious as to how many it has.

That looks interesting! My 2500k is still clunking along as my nas but would be nice to upgrade to something capable but lower power and retire it before it dies.

# ? Jan 9, 2024 17:31

Cygni: Nov 12, 2005; raring to post

BlankSystemDaemon posted:

Holy poo poo, the 8600G with DIMM-wide ECC memory and the right motherboard (*) looks perfect for a passively-cooled always-on router+firewall & NAS+HTPC combo.

*: Plenty of daughterboard slots, not necessarily a shitload of PCIe lanes - though I would be curious as to how many it has.

AMD's video encoders were historically pretty bad, but dunno if Phoenix is better on that front. obvi no plex support. my nas/htpc/plex/seedbox/pihole server is an underclocked 11700k i got used and a clearance z590 itx board which has been real nice for the quicksync/plex support, but no ECC support

tbh the 8500G might be the better ultra low power box with the 4C cores being a fair amount more efficient at lower power envelopes, but no idea what encoders it has as AMD has tried as hard as it can to never talk about Phoenix2

Cygni fucked around with this message at 21:40 on Jan 9, 2024

# ? Jan 9, 2024 21:37

kliras: Mar 27, 2021

amf is still pretty dang terrible. the encoding on their enterprise-y xilinx products is one of the best if not the best out there, but that's not something consumers or even prosumers can get their hands on

there's always svt-av1 on the cpu i guess

# ? Jan 9, 2024 21:45

Anime Schoolgirl: Nov 28, 2002

BlankSystemDaemon posted:

Holy poo poo, the 8600G with DIMM-wide ECC memory and the right motherboard (*) looks perfect for a passively-cooled always-on router+firewall & NAS+HTPC combo.

*: Plenty of daughterboard slots, not necessarily a shitload of PCIe lanes - though I would be curious as to how many it has.

You might have to look at the Ryzen Pro variants of the APUs because on the AM4 APUs only the Pro variants supported ECC and they might repeat that segmentation with the AM5 APUs.

Also the Zen 4 APUs have 8 lanes on the "main" PCIe x16 slot.

# ? Jan 10, 2024 05:22

gradenko_2000: Oct 5, 2010; HELL SERPENT; Lipstick Apathy

yeah I saw a couple of people highlighting how the 8500G and 8300G are going to be starved of PCIe lanes

# ? Jan 10, 2024 05:32

BurritoJustice: Oct 9, 2012

The AV1 encoder in RDNA3 is really fundamentally broken. It encodes in 64x16 pixel tiles, which isn't that unusual, but the hardware support for outputting subtiles is broken and it will only output resolutions that fit in 64x16 tiles (with black pixel padding). They added a special case for 1080p to minimise the impact but it still outputs 1920x1082, which means you have to manually trim the 2 pixels from videos and for streaming a number of streaming services just straight up won't accept a 1082p stream. Naive VMAF scores are in the 30s because of the extra pixels, but even if you trim them out you're still getting decently lower VMAF than NVENC or QSV.

The other common resolution that doesn't fit the tile size is 3440x1440, but that's less important than 1080p.

An AMD Dev on the Mesa gitlab said it is fixed in the RDNA3+ encoder block, but that's not coming to desktop anymore, so you've got to wait until RDNA4 to get a functioning AV1 encoder from AMD. Hopefully they use RDNA3+ in APUs.

# ? Jan 10, 2024 06:02

Cygni: Nov 12, 2005; raring to post

Anime Schoolgirl posted:

You might have to look at the Ryzen Pro variants of the APUs because on the AM4 APUs only the Pro variants supported ECC and they might repeat that segmentation with the AM5 APUs.

Also the Zen 4 APUs have 8 lanes on the "main" PCIe x16 slot.

Phoenix is listed with ECC support on TechPowerUps database. They also show 20 PCIe 4.0 lanes from the CPU, so that might mean a full 16x to the first pcie slot. They could also be playing games with how they count the lanes. The 5700G shows 16 3.0 lanes, for comparison.

https://www.techpowerup.com/cpu-specs/ryzen-7-8700g.c3434

# ? Jan 10, 2024 09:41

BurritoJustice: Oct 9, 2012

Cygni posted:

Phoenix is listed with ECC support on TechPowerUps database. They also show 20 PCIe 4.0 lanes from the CPU, so that might mean a full 16x to the first pcie slot. They could also be playing games with how they count the lanes. The 5700G shows 16 3.0 lanes, for comparison.

https://www.techpowerup.com/cpu-specs/ryzen-7-8700g.c3434

It depends on which lanes they wire up when they put it on an AM5 substrate. Most AM5 boards wire up two M.2 slots from the CPU, and if they put all 16 usable lanes (4 for chipset) into the PCIe slot then you won't have your M.2 slots.

It'll likely be 8x to primary PCIe and 4x/4x to the m.2 slots. You'll lose a PCIe slot this way if they're wired 8x/8x, but otherwise moving down to 8x on the primary isn't a big deal.

The bigger deal are the CPUs based off the 2+4 small phoenix die (8500G/8300G), they only have 10 usable lanes so allocation gets really tricky. They might end up going 4x slot, 4x/2x M.2.

# ? Jan 10, 2024 10:22

BlankSystemDaemon: Mar 13, 2009

priznat posted:

That looks interesting! My 2500k is still clunking along as my nas but would be nice to upgrade to something capable but lower power and retire it before it dies.

This morning while in bed I got to thinking if Supermicro had any new interesting products, and H13SAE-MF appears to be basically everything I want out of a product.
All the PCIe lanes are directly attached to the CPU according to the manual, there's plenty of DIMMs to add a lot of RAM for ZFS ARC and they're aligned right for proper airflow, and it has a dual-RJ45 Intel i210 Ethernet controller so the rest of the daughter board slots can be filled with a SAS-8i, SAS-8e and a dual+SFP+ Intel X520 Ethernet controller plus two M.2 drives.

Cygni posted:

AMD's video encoders were historically pretty bad, but dunno if Phoenix is better on that front. obvi no plex support. my nas/htpc/plex/seedbox/pihole server is an underclocked 11700k i got used and a clearance z590 itx board which has been real nice for the quicksync/plex support, but no ECC support

tbh the 8500G might be the better ultra low power box with the 4C cores being a fair amount more efficient at lower power envelopes, but no idea what encoders it has as AMD has tried as hard as it can to never talk about Phoenix2

For a HTPC, all I care about is Kodi being able to use the h264+h265+av1 decoding as any encoding I do will be with ffmpeg using QRF with proper tunes and presets to get better quality vs filesize than any hardware encoder can manage.
ECC support is, to me, as much about system stability as it is about data reliability - plus, with Supermicro, there's generally a good chance that the Machine Check Exception generation from ECC errors will actually be passed to the OS like they're supposed to, so it can decides if a process or the kernel needs to be restarted instead of risking memory corruption.

I don't trust Intel or AMD to implement HMP properly, because the OS scheduler can't know if it's better to put a process on a thread that's fast but consumes more emergy vs slow but consumes much less energy - and as far as I can figure out, short of having the compiler embed the information into the process somehow, there's no way it can know.

Anime Schoolgirl posted:

You might have to look at the Ryzen Pro variants of the APUs because on the AM4 APUs only the Pro variants supported ECC and they might repeat that segmentation with the AM5 APUs.

Also the Zen 4 APUs have 8 lanes on the "main" PCIe x16 slot.

Ah, very good point, thank you.
In that case, Ryzen 5 Pro 7540U looks even better.

# ? Jan 10, 2024 12:52

ConanTheLibrarian: Aug 13, 2004; dis buch is late; Fallen Rib

BurritoJustice posted:

It depends on which lanes they wire up when they put it on an AM5 substrate. Most AM5 boards wire up two M.2 slots from the CPU, and if they put all 16 usable lanes (4 for chipset) into the PCIe slot then you won't have your M.2 slots.

It'll likely be 8x to primary PCIe and 4x/4x to the m.2 slots. You'll lose a PCIe slot this way if they're wired 8x/8x, but otherwise moving down to 8x on the primary isn't a big deal.

The bigger deal are the CPUs based off the 2+4 small phoenix die (8500G/8300G), they only have 10 usable lanes so allocation gets really tricky. They might end up going 4x slot, 4x/2x M.2.

Asus to the rescue: https://www.tomshardware.com/news/asus-demos-geforce-rtx-4060-ti-with-m2-slots

# ? Jan 10, 2024 20:09

priznat: Jul 7, 2009; Let's get drunk and kiss each other all night.

BlankSystemDaemon posted:

This morning while in bed I got to thinking if Supermicro had any new interesting products, and H13SAE-MF appears to be basically everything I want out of a product.
All the PCIe lanes are directly attached to the CPU according to the manual, there's plenty of DIMMs to add a lot of RAM for ZFS ARC and they're aligned right for proper airflow, and it has a dual-RJ45 Intel i210 Ethernet controller so the rest of the daughter board slots can be filled with a SAS-8i, SAS-8e and a dual+SFP+ Intel X520 Ethernet controller plus two M.2 drives.

Very nice. Supermicro or asrock rack would be my gotos for sure. They can be hard to find at retailers, hopefully newegg Canada has em.

# ? Jan 10, 2024 20:11

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

BlankSystemDaemon posted:

I don't trust Intel or AMD to implement HMP properly, because the OS scheduler can't know if it's better to put a process on a thread that's fast but consumes more emergy vs slow but consumes much less energy - and as far as I can figure out, short of having the compiler embed the information into the process somehow, there's no way it can know.

the compiler can�t know, because power-vs-time tradeoffs are an operational concern, not a compile-time one

best they could do is label different code regions (either with metadata or signalling instructions) with attributes about their relationship to the program�s task, such that the operator can configure the scheduler to respond to them in a locally-appropriate way, I think

# ? Jan 10, 2024 20:24

Helter Skelter: Feb 10, 2004; BEARD OF HAVOC

ConanTheLibrarian posted:

Asus to the rescue: https://www.tomshardware.com/news/asus-demos-geforce-rtx-4060-ti-with-m2-slots

Not really. That card still relies on having the full 16 lanes available (and also motherboard support for bifurcation). The GPU portion uses 8 and the m.2 slots use the rest. If there's only 8 lanes going to the GPU slot, the drives are cut off.

# ? Jan 10, 2024 20:52

Koskun: Apr 20, 2004; I worship the ground NinjaPablo walks on

ConanTheLibrarian posted:

Asus to the rescue: https://www.tomshardware.com/news/asus-demos-geforce-rtx-4060-ti-with-m2-slots

I was surprised to see this does actually exist. I would of thought it would be a paper launch. Though I can only find one site in Europe that has any in stock. Seems the markup of it, and that you are required to have a motherboard that supports bifurcation, haven't really had it flying off the shelves.

What I would like to see tested is they added an m.2 slot to a video card, not for the system to use, but for the video card to. A TB or two of cache could be interesting if utilized right.

# ? Jan 11, 2024 00:15

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

AMD did that for some GPGPU thing a while back, I believe.

# ? Jan 11, 2024 00:44

BlankSystemDaemon: Mar 13, 2009

priznat posted:

Very nice. Supermicro or asrock rack would be my gotos for sure. They can be hard to find at retailers, hopefully newegg Canada has em.

I've found one retailer that has them in Denmark, but it's INCREDIBLY expensive.
ASRockRack is completely unavailable in Denmark, so I think I might have to go Supermicro - though I'm not really complaining.

Subjunctive posted:

the compiler can�t know, because power-vs-time tradeoffs are an operational concern, not a compile-time one

best they could do is label different code regions (either with metadata or signalling instructions) with attributes about their relationship to the program�s task, such that the operator can configure the scheduler to respond to them in a locally-appropriate way, I think

That's kinda what I was getting at, some compile-time flag to set a runtime-readable attribute for the scheduler to know what to do with.
The issue for that is that there's no tooling in place to do it in MSVC, LLVM, or GCC - and even once you get it in, it'll take time to start getting used, and the update cycles tend to be slow at the places where the most high-performance software is developed (because understandably, they want their programmers to program, not maintain the software they use to program).

So that's at least half a decade until HMP is maybe doable? By that point, I'm sure the industry will have moved on.

# ? Jan 11, 2024 01:26

priznat: Jul 7, 2009; Let's get drunk and kiss each other all night.

BlankSystemDaemon posted:

I've found one retailer that has them in Denmark, but it's INCREDIBLY expensive.
ASRockRack is completely unavailable in Denmark, so I think I might have to go Supermicro - though I'm not really complaining.

Yeah I like supermicro even more than asrock rack, definitely a solid choice. I especially love their little plastic m.2 retention plug things, everyone needs to adopt those!!

# ? Jan 11, 2024 01:32

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

BlankSystemDaemon posted:

The issue for that is that there's no tooling in place to do it in MSVC, LLVM, or GCC

I think you could signal it with a specific NOP sequence since on modern processors you can embed arbitrary bytes in it? I think that trick has been used before for hypervisor hints or something? Maybe?

# ? Jan 11, 2024 01:54

BlankSystemDaemon: Mar 13, 2009

priznat posted:

Yeah I like supermicro even more than asrock rack, definitely a solid choice. I especially love their little plastic m.2 retention plug things, everyone needs to adopt those!!

The tool-less installation of this USB-C 3.2 Gen2 10Gbps M.2 enclosure is quite similar, and I really dig it.
I've yet to try Supermicro motherboards; is there really no PMbus header anymore?

Subjunctive posted:

I think you could signal it with a specific NOP sequence since on modern processors you can embed arbitrary bytes in it? I think that trick has been used before for hypervisor hints or something? Maybe?

It'll be nice in half a decade, sure.

BlankSystemDaemon fucked around with this message at 02:39 on Jan 11, 2024

# ? Jan 11, 2024 02:36

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Subjunctive posted:

I think you could signal it with a specific NOP sequence since on modern processors you can embed arbitrary bytes in it? I think that trick has been used before for hypervisor hints or something? Maybe?

I really think you're both barking up the wrong tree, there's a known working example in Darwin and it doesn't require wacky compiler-involved schemes because it builds on a universal and well understood primitive - threads. As a programmer you use simple OS API to inform the kernel what kind work each thread does, and the kernel's scheduler uses that as a hint.

So if a thread is marked as "user interactive", that means it exists to respond to user input events (clicks, keypresses, etc). These almost always get scheduled on P cores, with higher prio than other P-core users, because UI latency is so important. At the other extreme, putting a thread in the "background" bucket means that latency and throughput explicitly aren't important, so the scheduler should feel free to lock it to E cores for best energy efficiency. There's a few other options for various kinds of middle ground.

# ? Jan 11, 2024 09:18

BlankSystemDaemon: Mar 13, 2009

BobHoward posted:

I really think you're both barking up the wrong tree, there's a known working example in Darwin and it doesn't require wacky compiler-involved schemes because it builds on a universal and well understood primitive - threads. As a programmer you use simple OS API to inform the kernel what kind work each thread does, and the kernel's scheduler uses that as a hint.

So if a thread is marked as "user interactive", that means it exists to respond to user input events (clicks, keypresses, etc). These almost always get scheduled on P cores, with higher prio than other P-core users, because UI latency is so important. At the other extreme, putting a thread in the "background" bucket means that latency and throughput explicitly aren't important, so the scheduler should feel free to lock it to E cores for best energy efficiency. There's a few other options for various kinds of middle ground.

Darwin is the first OS I've heard of to use interactivity score for it, and yeah that seems like a pretty good solution - I'll have to look more into it, and see if there's a FreeBSD developer who I can forward it to.
Best of all, all Unix-like OS' already has the plumbing; for example on FreeBSD you just set the kern.os.interact OID using sysctl(8) and each user or process can then set its priority via rtprio(1) or rtprio(2).

# ? Jan 11, 2024 13:17

Subjunctive: Sep 12, 2006; ✨sparkle and shine✨

BobHoward posted:

I really think you're both barking up the wrong tree, there's a known working example in Darwin and it doesn't require wacky compiler-involved schemes because it builds on a universal and well understood primitive - threads. As a programmer you use simple OS API to inform the kernel what kind work each thread does, and the kernel's scheduler uses that as a hint.

So if a thread is marked as "user interactive", that means it exists to respond to user input events (clicks, keypresses, etc). These almost always get scheduled on P cores, with higher prio than other P-core users, because UI latency is so important. At the other extreme, putting a thread in the "background" bucket means that latency and throughput explicitly aren't important, so the scheduler should feel free to lock it to E cores for best energy efficiency. There's a few other options for various kinds of middle ground.

Yeah, that�s true. I�m not sure what the foreground/background equivalent is for power-efficiency-vs-throughput, but tagging on these unit of the thread makes sense.

# ? Jan 11, 2024 17:19

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

Subjunctive posted:

Yeah, that�s true. I�m not sure what the foreground/background equivalent is for power-efficiency-vs-throughput, but tagging on these unit of the thread makes sense.

Yeah, in the x86 world things get more complex because Intel's chips are the most prevalent heterogeneous designs and they're way more complicated than Apple's. In Apple's chips, you have two kinds of cores, they're very differentiated into distinct roles, and they aren't pre-overclocking at the factory. In Intel's, while there's still two kinds of core (three in new products where they've finally decided to do truly efficient efficiency cores), their roles overlap a lot more, and the performance/power/efficiency tradeoffs of P cores are highly variable depending on circumstances. (Mainly: whether it's running 1 or 2 threads, and how high it's currently allowed to clock itself.)

I did forget some details about Apple's system - there's some kind of activity tracking mechanism that, while thread B is doing work on behalf of thread A, gives B the same QoS and priority as A, and IIRC this extends across the userspace/kernel boundary. I don't know how essential it is, though, and the low hanging fruit should be in the base idea of letting applications put their threads in different QoS buckets.

# ? Jan 12, 2024 10:25

Kibner: Oct 21, 2008; Acguy Supremacy

The latest blog for the Linux port to AS has some insight in how apps can be assigned to the efficiency and power cores. Scroll down to the Energy-Aware Scheduling section:

https://asahilinux.org/2024/01/fedora-asahi-new/

quote:

Apple Silicon is the uncontested champion of power efficiency. It�s not unusual to get 12-15 hours of battery life out of a MacBook Air on macOS. Is Apple doing some secret sauce magic to squeeze these unbelievable numbers out of the ARM64 cores? Not really. They�re using tricks that Linux has been able to do for quite some time but, until now, no one has really taken advantage of.

Most x86 multi-core processors are comprised of identical CPU cores copy-pasted n-many times - they are symmetric, and fairly distributing work to the cores is relatively simple. Apple Silicon, of course, is a heterogeneous system comprised of the performance P-cores and the efficiency E-cores. Let�s try to schedule a task on Apple Silicon ourselves. We�re informed that Task A requires 10 �performance� (the actual metric is unimportant). Examining our device tree, we see the performance cores can provide 10 performance at 1.2 GHz, whereas the efficiency cores can only muster that at 2.4 GHz, its maximum operating point. So the scheduler places Task A on a performance core, since it can meet the performance requirements at a lower operating point.

However, we�ve made a critical mistake. The P-cores draw 3.7 W at 1.2 GHz whereas the E-cores actually only draw 1.6 W at 2.4 GHz (hence the name efficiency). The scheduler should have placed the task on an E-core! Energy-Aware Scheduling lets us tell the scheduler how much power each core uses at a given operating point, enabling it to make better scheduling choices that minimise energy consumption. This works a treat when the scheduler is making accurate predictions about a task�s performance requirements, but what happens when it can�t do that?

While working on speaker support, we found that Pipewire and Wireplumber were constantly being mis-scheduled onto P-cores. By default, the kernel prioritizes never being �late� above everything else for real-time threads, thus audio processing, due to its real-time nature, was always being given full performance. We did the math, and we found we don�t need anywhere near full performance to run our DSP code. To fix this, we gave Pipewire and Wireplumber the ability to use utilisation clamping, a scheduler feature that lets applications peg their performance requirements to a fixed range. We cap Pipewire and Wireplumber to an extremely low maximum performance so the scheduler restricts them to efficiency cores at their lowest operating point. Both still function perfectly, and we get to save oodles of battery life for our users! This awesome feature goes so underutilised that it wasn�t even enabled in the standard Fedora kernel until we asked for it to be a couple of weeks ago (CONFIG_UCLAMP_TASK), and we sincerely hope that its enablement in Fedora leads to more widespread adoption. It�s not just about reducing energy consumption, either: this feature can also be used in the other direction to optimize for performance in cases where the kernel�s scheduler doesn�t make good decisions by default, which can be important for games and other mixed CPU/GPU workloads.

Putting EAS and utilisation clamping together, we took a 15" M2 MacBook Air from about 6 hours of useable battery life just sitting at the desktop to about 8-10 hours of 1080p30 YouTube, 12-15 hours of desktop use, and an enormous 25-28 hours of screen-off idle time. We still have many more tricks up our sleeves to eke out more battery life, and a deep dive on EAS utilisation clamping is in the works. Watch this space!

I don�t know goes the task scheduler determines how much performance an app needs.

# ? Jan 12, 2024 14:18

BlankSystemDaemon: Mar 13, 2009

BobHoward posted:

Yeah, in the x86 world things get more complex because Intel's chips are the most prevalent heterogeneous designs and they're way more complicated than Apple's. In Apple's chips, you have two kinds of cores, they're very differentiated into distinct roles, and they aren't pre-overclocking at the factory. In Intel's, while there's still two kinds of core (three in new products where they've finally decided to do truly efficient efficiency cores), their roles overlap a lot more, and the performance/power/efficiency tradeoffs of P cores are highly variable depending on circumstances. (Mainly: whether it's running 1 or 2 threads, and how high it's currently allowed to clock itself.)

I did forget some details about Apple's system - there's some kind of activity tracking mechanism that, while thread B is doing work on behalf of thread A, gives B the same QoS and priority as A, and IIRC this extends across the userspace/kernel boundary. I don't know how essential it is, though, and the low hanging fruit should be in the base idea of letting applications put their threads in different QoS buckets.

What's AMDs HMP design like, comparatively? I know Intel brags about their Thread Director that's part of the microcode, but that seems like a highly complex solution, and what litltle I've read suggests that AMDs solution is "simpler" (though maybe not easier..)

The userspace/kernel boundary isn't really a problem for any of the BSDs, since they're made as a single OS.

# ? Jan 12, 2024 16:12

BurritoJustice: Oct 9, 2012

BlankSystemDaemon posted:

What's AMDs HMP design like, comparatively? I know Intel brags about their Thread Director that's part of the microcode, but that seems like a highly complex solution, and what litltle I've read suggests that AMDs solution is "simpler" (though maybe not easier..)

The userspace/kernel boundary isn't really a problem for any of the BSDs, since they're made as a single OS.

4C cores are 35% smaller and have a much lower fmax, but there aren't any differences in functional units this generation. For reference on clocks, 4C in servers maxes out at 3.1GHz and on desktop the highest they're clocking them stock is 3.7GHz 1t boost. The 3.7GHz boost is way out of the efficiency curve for the core though, it takes as much voltage to take a 4C core to that clock as it does to take a 4 core to 5GHz.

Next generation Zen5 will have full width AVX512 (2x512b per cycle), while 5C will keep the same double pumped design from Zen4 (2x256, running over two cycles) and so there will also be a gap in FP performance to worry about.

# ? Jan 12, 2024 16:55

BlankSystemDaemon: Mar 13, 2009

BurritoJustice posted:

4C cores are 35% smaller and have a much lower fmax, but there aren't any differences in functional units this generation. For reference on clocks, 4C in servers maxes out at 3.1GHz and on desktop the highest they're clocking them stock is 3.7GHz 1t boost. The 3.7GHz boost is way out of the efficiency curve for the core though, it takes as much voltage to take a 4C core to that clock as it does to take a 4 core to 5GHz.

Next generation Zen5 will have full width AVX512 (2x512b per cycle), while 5C will keep the same double pumped design from Zen4 (2x256, running over two cycles) and so there will also be a gap in FP performance to worry about.

Ah, so there are still microarchitectural differences between them.
That sucks.

# ? Jan 13, 2024 13:08

Canned Sunshine: Nov 20, 2005; CAUTION: POST QUALITY UNDER CONSTRUCTION

Zen 5 is also moving from 4 to 6 ALUs, so I'm wondering now if Zen 5c will see that, or will remain at 4 if it's based off of Zen 4.

That's disappointing if it's the case.

# ? Jan 13, 2024 17:24

Twerk from Home: Jan 17, 2009; This avatar brought to you by the 'save our dead gay forums' foundation.

Canned Sunshine posted:

Zen 5 is also moving from 4 to 6 ALUs, so I'm wondering now if Zen 5c will see that, or will remain at 4 if it's based off of Zen 4.

That's disappointing if it's the case.

Ostensibly the 4c/5c cores are competing against Intel E-cores that lack not just clocks but only have 128-bit wide vector units, right? They have no need to worry about getting outperformed.

# ? Jan 13, 2024 17:28

Combat Pretzel: Jun 23, 2004; No, seriously... what kurds?!

The 1.1.0.1 version of ComboPI AM5 still poo poo as 1.1.0.0?

# ? Jan 13, 2024 20:50

BobHoward: Feb 13, 2012; The only thing white people deserve is a bullet to their empty skull

BlankSystemDaemon posted:

Ah, so there are still microarchitectural differences between them.
That sucks.

No, BurritoJustice was saying there aren't uarch differences between Zen 4 and 4C. Literally the same execution units and pipeline stages and so forth, but to save space and reduce power 4C has two optimizations:

- The through-silicon via (TSV) sites used to do die stacking in a full Zen 4 core take some area, and are present even in parts which don't have 3D V-Cache. Zen 4C doesn't support 3D V-Cache, so some of its area reduction comes from omitting TSV pad structures and 3D V-Cache support logic.

- A significant part of "clock go fast brrrrr" design methodology consists of selecting larger and faster versions of gates in the cell library during physical design. Large is fast due to RC (resistance-capacitance) delay. The wires connecting gates together have some resistance, and they also have substantial parasitic capacitance, so the speed at which they can switch state depends on how much current the gate driving the wire can sink or source to drain or fill the capacitance. Higher gate output current requires that it be built from bigger, hotter transistors. Because 4C targets lower Fmax, AMD was able to save a lot of area and power by selecting smaller and lower power versions of gates from the cell library.

Arguably you don't actually want 4C's level of microarchitectural identity, because Z4 is a uarch designed around hitting higher clock speed targets. If you could go back in time and wave a magic wand to have the AMD team design a new clean-sheet uarch designed from the ground up for the space 4C sits in, they almost certainly could've beaten 4C in one or more of power, area, and performance. That said, I think AMD was very smart to do 4C how they did it - it's effective enough, low-risk, and fits with their budget and engineering team headcount.

# ? Jan 14, 2024 13:28

Arzachel: May 12, 2012

Yeah, I feel like 4C is the result of one of the hyperscalers going "make more cores" instead of an explicit efficiency core design

# ? Jan 14, 2024 13:45

Adbot: ADBOT LOVES YOU

# ? May 29, 2024 02:13

PC LOAD LETTER: May 23, 2005; WTF?!

Combat Pretzel posted:

The 1.1.0.1 version of ComboPI AM5 still poo poo as 1.1.0.0?

Varies on a board to board basis. Its fine for me though. Gigabyte aorus x670e master. The previous version was poo poo for me and I have no clue why.

# ? Jan 14, 2024 14:29

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > AMD CPU and Platfrom Discussion Episode IV: A New Hope is Ryzen

«‹›798 »