Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
BurritoJustice
Oct 9, 2012

The full cache can't really be used by cores on one CCD, at least not in any beneficial sense. Going to the other CCD is as slow as going to RAM, it's only an option at all because it's necessary to maintain cache coherence.

A 64MB cache AMD CPU is really best described as 2x32MB.

Adbot
ADBOT LOVES YOU

BurritoJustice
Oct 9, 2012

buffbus posted:

With the important note that this does not apply to all AMD CPUs. It's a large part of why some prefer the 7800X3D over the 7950X3D.

Yeah, it's why I specified 64MB.

In my ideal world of them saying "2x32MB" or "96MB+32MB" we'd also be saved of all the dogshit articles saying "You could install your operating system in cache now!" because AMD markets Genoa-X as 1.152GB L3 instead of 12x96MB.

BurritoJustice
Oct 9, 2012

Twerk from Home posted:

When the E cores wake up it slows down the ring bus, slowing down how fast the P-cores are able to access L3 or memory: https://chipsandcheese.com/2021/12/16/alder-lake-e-cores-ring-clock-and-hybrid-teething-troubles/.

This was fixed in Raptorlake.

E cores and P cores are definitely better able to work on the same data than multi-CCD AMD, nothing you mentioned comes close to the cross-CCD penalty. It just turns out that as long as your CCX is large enough and your memory is uniform, you can just have NUCA without it really mattering too much for a lot of workloads.

Cygni posted:

one of MLIDs rumors was that AMD was pushing board vendors to certify this summers X870 boards (if they end up shipping them, it seems questionable) up to DDR5-8000, which which would still require you to go to 2:1 mode on the memory controller and will likely perform worse across the board with single CCD CPUs.

but hey, marketing will like it! number go up!

2:1 mode is faster than 1:1 mode for gaming when properly configured, even for single CCD CPUs (which you treat all AMD CPUs as for gaming). Running FCLK:UCLK at 1:1 is a large latency decrease and gaming performance increase, just like it was for AM4 era Zen CPUs, but you normally lose too much memory bandwidth to make it worth it. DDR5-8000 means you can run 4000:2000:2000 (MCLK:UCLK:FCLK), letting you still saturate your GMI link on bandwidth while providing lower latency than 1:1 mode.

BurritoJustice
Oct 9, 2012

Cygni posted:

huh ive never tried 7000+ myself, ive only got 6000 rated DDR5 and never really bothered beyond timings with my 7800X3D. i was just going off Buildzoid's numbers and opinions, but i might be wrong then.

BZ doesn't believe in FCLK:UCLK sync being a thing at all despite it empirically showing benefit, he's repeatedly said in the past that there is no also reason to maintain the 2:3 ratio that AMD recommends but you can microbench it making a measurable difference in latency.

I love a good opportunity to bring out my favourite infographic:



Aeryn didn't even hit 8000, only 7800, but you can still see it wins at everything. Dropping your FCLK to 1950MHz only loses out to the 2200MHz config in one test, because the bandwidth microbench is purely a GMI bench, so you can see the benefit of UCLK:FCLK match. You're still only looking at a 136.20/127.29 ≈ 7% mean performance increase over just using DDR5-6000 with buildzoid timings, lol.

BurritoJustice
Oct 9, 2012

Combat Pretzel posted:

Hmmm I’m running 2033MHz FCLK because of some Buildzoid rambling adjacent to his timings. I guess I should put it back to 2000MHz since I run RAM at 6000MT.

It is still better to run higher FCLK than 2:3, but only if it's a good few straps higher.

I'd either fiddle around finding the highest you can run, which will typically be in the range [2133,2200], or just drop back down to 2000.

BurritoJustice
Oct 9, 2012

Dr. Video Games 0031 posted:

Users who might want some advantage from overclocking but don't want to manually oc for funsies should just use the buildzoid timings. Despite him being wrong about a few things, his DDR5-6000 timings are highly compatible with Hynix kits and provide a tangible benefit over stock DDR5-6000 expo. There seems to be a bigger gap between buildzoid's timings and EXPO timings than there is between buildzoid timings and the DDR5-7800 config.

Yeah, BZ timings are great. He did a good job of choosing conservative enough timings that still capture the majority of the performance benefit of manual tweaking.

IIRC his suggestion of using 2033MHz is because the early AGESAs actually hard enforced the 2:3 ratio by silently upping the MCLK/UCLK to match, so his 2033 numbers are actually running at 6100MT/s which is why he sees a performance benefit. Newer AGESAs don't do this and will absolutely let you desync your FCLK, and doing it to gain 33MHz is not worth it.

BurritoJustice
Oct 9, 2012

BlankSystemDaemon posted:

As in, it'll show a noticeable performance benefit in real-world gaming, or synthetic benchmarks that just benefit from bandwidth and don't take latency into account?

I mean in real-world videogame Baldur's Gate 3 you're gaining 10FPS just from BZ timings over EXPO in the chart I posted.

It's been true for generations now that if you are running highly tuned/overclocked ram you're basically a generation ahead on gaming performance. A 9900K with tuned to the gills ram sits roughly in between a 5800X and 5800X3D in games.

BurritoJustice
Oct 9, 2012


That's the one, yeah.

BurritoJustice
Oct 9, 2012

They already moved the power management with DDR5, if you have UDIMMs they take 5V and if you have RDIMMs they take 12V and all the regulation is done on DIMM.

Dynamic voltage/frequency scaling is coming though, which Intel experimented with on DDR5 with XMP 3.0 but nobody uses because it sucks to have your ram going back to JEDEC all the time.

BurritoJustice
Oct 9, 2012

Kibner posted:

afaik, Windows can't really tell. It just knows that if XBox Game Bar recognizes an app and the CPU is a 7950X3D or a 7900X3D, then put the app's threads on cores 0-6/8/12/16 (depending on which cpu and if hyperthreading is turned on).

But maybe that has changed or my understanding has always been wrong!

It doesn't actually do any thread affinity or pinning, the way the driver works is this:

- Detect game is running thanks to game bar
- Change CPPC Preferred Core order to be all cache cores before all frequency cores (by default, non-vcache cores are first so that single threaded tasks can get full boost clock).
- Change windows power plan policy to park 50% of cores, which will turn off the bottom 50% of cores in the CPPC preferred cores order (which are now all the non-vcache cores).
- If load on active cores exceeds a certain threshold, disable core parking until load goes below another threshold.

The key differences between this and actually doing core affinity are:

- Your second CCD doesn't do anything while the game is running.
- If you do load up a multithreaded task in the background, it'll just turn off the parking and because there is no affinity nothing will keep the game from moving over to the other cores (which it usually does immediately and kills game performance).
- If you do intermittent stuff in the background it'll also end up starting and stopping the parking which leads to stuttery performance.
- If you ever install the VCache driver then later turn off the second CCD in the UEFI later you'll amusingly end up parking half your cache cores whenever you play a game (down to 3/4 cores!). It's weirdly insidious, you basically need to reinstall windows to remove it.

If you want a 7950X3D for the use case of wanting background MT tasks while gaming, you've gotta get used to using affinity. I have genuinely no clue why AMD doesn't use affinity in the first place.

And to add more specific numbers to the frequency discussion, the fused Fmax for each of the SKUs is as follows:

7800X3D: 5050MHz
7900X3D: 5150MHz/5650MHz
7950X3D: 5250MHz/5750MHz

The higher VCache clocks on the dual CCD SKUs aren't just at the top end of the VFT curve, the whole curve is brought up so you're typically always running around that +100/+200MHz range over the 7800X3D.

BurritoJustice
Oct 9, 2012

Kibner posted:

Ooh, thanks for that write up! I actually hadn't heard some of those specifics before. Especially the part with the 3D cache driver.

I need to go read up on how the 7950X3D behaves on *nix since that is what I'm running it on right now. I imagine there isn't anything done automatically to put game threads on the X3D cores.

Yep, there is nothing specific on Linux and if you enable preferred cores (which will be on by default with the 6.9 kernel) it'll actually put games on the frequency cores first. I just launch with

code:
 taskset -c 0-7 {game} 

BurritoJustice
Oct 9, 2012

The Ryzen 7000 threadrippers are atrocious for gaming because they use the giant 12 channel IO die from Genoa, which when fully enabled has four NUMA domains with three memory channels and three CCDs each. The configuration used for TR however is four total CCDs and four total memory channels, one per NUMA node. So each CCD only has a single local memory channel.

One of the CCDs is the gaming targeted one and has a Fmax that is 500MHz higher than all the others, and if you turn on gaming mode in the UEFI it will disable all the other CCDs to maximise gaming performance, but you're still crippled for gaming by the awful memory topology.

BurritoJustice
Oct 9, 2012

Combat Pretzel posted:

I would have figured that they've had seen the error in their ways with the earlier Threadrippers and go with an UMA memory controller this time around.

Threadripper non-Pro is just an afterthought they are limping along so they can technically claim they still have HEDT chips, they'll never spin up a new die for it so the best we are getting is the full size IO die cut down enough for it to not cut into their high margin lines.

You have to be in a weird spot of wanting lots of cores but not particularly caring about memory or IO to want to buy TR non-pro now, it's a platform that isn't really a good buy for any real usecases. There's a reason MSI literally didn't bother to make a TRX50 motherboard while the other big vendors only made one SKU each, it's going to sell close to zero chips and they know it.

BurritoJustice
Oct 9, 2012

Klyith posted:

The memory bandwidth is fine, it's latency that gets killed by having to hop across the IO die to a different controller. That's real bad for games, but plenty of Real Work type things are much less bothered.

Bandwidth is also not incredible, you've got four times the cores of a 7950X and only twice the memory bandwidth.

Adbot
ADBOT LOVES YOU

BurritoJustice
Oct 9, 2012

They're just rebranded Raphael, there is nothing different about the dies. You get standard UDIMMs connected to the same IOD.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply