Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
HalloKitty
Sep 30, 2005

Adjust the bass and let the Alpine blast

FuturePastNow posted:

just validates my decision to never overclock anything

If you set xmp, the risk is there

I manually reduced the i/o die voltage when setting up both my am4 and am5 boards after setting xmp, so I guess that would help, but I did that because I hate the way power usage soars when setting xmp

HalloKitty fucked around with this message at 09:28 on Apr 25, 2023

Adbot
ADBOT LOVES YOU

BlankSystemDaemon
Mar 13, 2009



Desuwa posted:

Not sure I'd count EXPO as overclocking when things are advertised at those speeds. The manufacturers love to call it overclocking as a way of shifting responsibility when things go poorly (like this debacle) and being able to advertise things they can't guarantee.
It can be overclocking (and most likely is) because while the JEDEC specs do say that DDR5 can be between 4800MT/s and 8400MT/s (ie. 2400MHZ and 4200MHz respectively), what you're paying G. Skill and other ODMs for is that they've tested the memory up to the speed listed.
By the default, all DDR5 memory has the SPD programmed to run at 4800MT/s, which is why you have to enable EXPO.

Maybe Micron, Samsung, or SK Hynix will start manufacturing DDR5 at higher speeds eventually? But as far as I know, that's not happened yet.

Dr. Video Games 0031 posted:

My voltages are all pretty normal on my gigabyte board. The CPU's "VDDCR_SOC" reading is 1.25V with expo enabled and 1.1V without it, and the motherboard's "VCORE SoC" reading is 1.296V with or without EXPO enabled. I'm not entirely sure what the difference between these is and why the motherboard reading is constant regardless of settings, but as far as I can tell, none of these are dangerous levels or anything.
VDDCR_SOC in particular is the voltage rail for the System Management Unit (which does PBO et cetera), the Platform Security Processor (as mentioned previously ITT, it's responsible for DRAM timing and secure boot, though how those two are linked I've no idea), and the built-in GPU cores.

Dr. Video Games 0031
Jul 17, 2004

Tom's Hardware has a theory after speaking to some industry contacts: https://www.tomshardware.com/news/amd-ryzen-7000-burning-out-root-cause-identified-expo-and-soc-voltages-to-blame

the tl;dr seems to be that soc voltage is going too high and some critical monitoring parts of the chip are getting zapped. Tom's Hardware says that some temperature sensors and thermal protection mechanisms are dying, which is causing the chip to behave in an unpredictable manner and start drawing excessive amounts of current in a death spiral. So I guess high SoC voltage is being blamed here and the damage is showing up on the vcore pins due to the nature of the death spiral.

I don't entirely understand the explanation, to be honest. Aren't there also current and power limits that the chip would have to disregard for this to happen? I suppose PBO would need to be enabled so those limits are lifted? But some people say they're just enabling expo. And can damaged temperature sensors on the i/o die cause this kind of death spiral in the ccd? actual question, because i really don't know and this article doesn't explain much.

Icept
Jul 11, 2001
It seems like a bit too obvious design flaw that the component intended to protect the chip against overvoltage is succumbing to overvoltage.

kliras
Mar 27, 2021
iirc, my 2700x also had some very weird power fluctuations at docp. at this point, that's probably normal, and the main story seems to be that the guardrails are broken more than anything else

BlankSystemDaemon
Mar 13, 2009



Icept posted:

It seems like a bit too obvious design flaw that the component intended to protect the chip against overvoltage is succumbing to overvoltage.
I think you're underestimating just how poorly firmware can be written.

There's a joke about firmware, which is that companies are afraid to move it out of a former employees home directory, because changing the build environment might break the build.
Except it's not a joke.

Combat Pretzel
Jun 23, 2004

No, seriously... what kurds?!
I’m sure glad I’ve adopted lower voltages for the memory overclock than what Buildzoid originally suggested. I had SOC at 1.2V and dropped it to 1.15V now due to all this stuff (and VDDIO etc. to 1.25V). Still seems to work, no WHEA errors so far.

kliras
Mar 27, 2021
the scene of the crime lines up with the ccd's according to buildzoid

https://twitter.com/Buildzoid1/status/1650793026803978242

power crystals
Jun 6, 2007

Who wants a belly rub??

BlankSystemDaemon posted:

I think you're underestimating just how poorly firmware can be written.

It might also just be some engineer at AMD thinking "nobody will run the SsoC at the kind of voltage that would kill this" and no one else even noticing the assumption to challenge it.

My personal guess with regards to why asus seems to be most affected is it's a side effect of their struggle to hit rated memory speeds in general on AM5. Maybe it's as simple as them juicing the voltage a bit more to try to make it work, or maybe it's something weirder. Everybody else seems to just be a victim of laziness, except that said laziness is blowing up expensive hardware in an entirely avoidable way.

Enos Cabell
Nov 3, 2004


I spent yesterday evening swapping out the guts of my pc for my 7800x3d build and missed this whole conversation. All I did in BIOS so far was flash to the latest version and then enable XMP for the Corsair CL30 ddr5. Should I disable XMP for now until whatever this mess is gets untangled? Gigabyte Aorus A670 board.

kliras
Mar 27, 2021
update bios at least. consider disabling expo if you're paranoid

always worth keeping in mind that redditor tend to use the most insane settings, in part because they always publish the "great results" they're getting but never come back, tail between their legs, to admit that the same settings blew up the pc

Dr. Video Games 0031
Jul 17, 2004

Gigabyte has joined Asus in straight-up deleting the first several bios versions from their support pages. I suspect the latest bios versions will be fine to use with the X3D chips. I'm on the latest B650 Aorus Pro bios, and I am not noticing any anomalies with any of the voltage readouts with expo enabled.

repiv
Aug 13, 2009

MSI did the same, if every board maker is doing it then the mistake was probably in AMDs reference code

kliras
Mar 27, 2021

repiv posted:

MSI did the same, if every board maker is doing it then the mistake was probably in AMDs reference code

game of telephone caveats and all that

https://twitter.com/1usmus/status/1650791399699218433

https://twitter.com/1usmus/status/1650793669895626752

hobbesmaster
Jan 28, 2008

Icept posted:

It seems like a bit too obvious design flaw that the component intended to protect the chip against overvoltage is succumbing to overvoltage.

You see it’s just like a fuse.

More seriously, VSoC was also the main way you could “easily” fry AM4 zen parts. SoCs generally simply don’t have the space for serious protection circuitry, that’s the job of the motherboard.

Klyith
Aug 3, 2007

GBS Pledge Week

Dr. Video Games 0031 posted:

I don't entirely understand the explanation, to be honest. Aren't there also current and power limits that the chip would have to disregard for this to happen? I suppose PBO would need to be enabled so those limits are lifted? But some people say they're just enabling expo. And can damaged temperature sensors on the i/o die cause this kind of death spiral in the ccd?

Voltage isn't one of the things limited by PBO, just amps & watts. More voltage means you can deliver more power in fewer amps.

But there's a good chance that there's more going on than just too power delivered to the silicon -- Buildzoid made a pretty good point that the temperatures needed to bubble & burn the substrate is not possible with a functioning CPU. So there's almost certainly another step after putting too much power through the CCD, where something shorts and the whole vcore rail dumps through.

IMO that sounds very likely. It's gotta be a multi-level chain of failure for it to be this destructive and never seen by AMD during engineering & testing.


Icept posted:

It seems like a bit too obvious design flaw that the component intended to protect the chip against overvoltage is succumbing to overvoltage.

Because that's actually the VRM's job. "Voltage Regulator Module"

The SOC just requests more or less volts, and that's used by the VRM to select the actual volts. And there's more interpretation than just the CPU saying "I want 1.2V" and the VRM doing that.

So when you run a -30mV offset, what happens is the CPU requests a voltage level, the VRM compares that level versus a defined voltage curve, selects the correct voltage, subtracts 30mV, then makes the VRM do that. This is why PBO2 curve optimizer is better than a static offset: fiddling with that curve is more granular.

If you set the mobo to manual voltage because you want to do an all-core constant 5ghz OC, it ignores the CPU's requests entirely.


And the SOC itself is supposed to get just a steady, constant voltage supply.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Again, the caveat with this theory being that VCore is the only voltage the cache die gets, and none of those voltages have anything to do with VCore. VSOC and VDIMM aren’t fed to the cache.

The burned pads consistently in the VCore region are suggestive that it is VCore, and it happening on X3D chips in particular (which have an obvious reason to be sensitive towards VCore) makes sense along those lines. If it’s caused by zapping the VSOC then it would damage regular CPUs at an equal rate.

(which I have said is likely a distinct possibility too but the failures do seem to point towards X3D being hit at a higher rate - but again that also can be explained by enthusiasts having more X3D chips than the public as a whole. Crowd-sourced failure reports suck rear end, they’re just all we have until someone gets a failure lab report.)

Again, it’s possible there are actually multiple issues or failure modes here, for example if AMD’s reference AGESA implementation (it seems) doesn’t enforce any voltage or thermal limits and there’s also this other issue around the VDIMM/VSOC

Paul MaudDib fucked around with this message at 15:11 on Apr 25, 2023

Ulio
Feb 17, 2011


Great timing, I just upgraded to Ryzen 7000 on Saturday. I had to upgrade Bios since I was getting all sorts of crashes after upgrading to w11.

PC LOAD LETTER
May 23, 2005
WTF?!

Combat Pretzel posted:

I’m sure glad I’ve adopted lower voltages for the memory overclock than what Buildzoid originally suggested.
He suggested 1.35v vSOC for his quick n' dirty Hynix guide. I think he mentioned off hand that going over 1.4v was a bad idea for long term use.

Supposedly the people whose chips are burning up somehow had their vSOC at 1.44-1.5v or higher.

If its happening because AMD or the mobo guys screwed up the BIOS settings somehow then that is a pretty dumb mistake. Some people are posting screenshots of their EXPO RAM with the vSOC at 1.2v default for DDR5 6000 kits though so whatever it is it seems to be inconsistent.

edit: checking Gigabytes' site for my mobo looks like they deleted all the BIOS's prior to F9 which was released 3/22 and has AGESA 1.0.0.5c.

PC LOAD LETTER fucked around with this message at 15:44 on Apr 25, 2023

Combat Pretzel
Jun 23, 2004

No, seriously... what kurds?!
Those are the voltages he initially suggested:

quote:

VDDIO / MVDD / MVDDQ: 1.35V
VSOC: 1.25V
I figured, I have ECC, if memory transfers are poo poo, I'm gonna notice and went -50mV on all these from the get-go. Now with this CPU killing drama, I'm at -100mV on all these. So far it works.

--ninja edit: Also, why would VSOC need a bump anyway? Because the IF is ramp up in clock?

hobbesmaster
Jan 28, 2008

Combat Pretzel posted:

--ninja edit: Also, why would VSOC need a bump anyway? Because the IF is ramp up in clock?

It’s also where the memory controller is, VDIMM and everything else are the voltages on the memory side.

PC LOAD LETTER
May 23, 2005
WTF?!
Ah OK thanks for the correction.

My understanding is for memory overclocking the memory controller on AM5 tends to need some extra volts and didn't much matter for the IF bus clocks which was more effected by die quality.

runaway dog
Dec 11, 2005

I rarely go into the field, motherfucker.
so, I guess I bought the wrong ram? I got an xmp labeled set because I didn't see any expo 6400 trident rgb and after reading some threads saying it didn't matter, but instead of expo on my new ryzen 7800x3d+ asus x670e-e it has DOCP, everything seemed to work fine when I was using it, I've turned it off on account of this whole melting situation but should I try to get some expo ram or does it not matter?

runaway dog fucked around with this message at 16:21 on Apr 25, 2023

hobbesmaster
Jan 28, 2008

Just set voltages manually to something sane.

PC LOAD LETTER
May 23, 2005
WTF?!
Yeah the problem gets avoided by checking and if necessary manually setting your voltages lower.

Some people apparently had their voltages automatically set way too drat high for some reason when EXPO was used. Whats not clear yet is if its because AMD screwed something up or the mobo vendors did in the BIOS.

DOCP is AMD's version of XMP and precedes EXPO.

edit: yeah apparently its ASUS's rebrand of AMP from way back in 2015, got my memory mixed up: https://www.tomshardware.com/reviews/ddr-dram-faq,4154-5.html

PC LOAD LETTER fucked around with this message at 16:31 on Apr 25, 2023

Dr. Video Games 0031
Jul 17, 2004

edit: ^^^^ "DOCP" is an Asus feature that adapts XMP profiles to AMD platforms. Other brands either have their own thing or will just use the XMP name.

runaway dog posted:

so, I guess I bought the wrong ram? I got an xmp labeled set because I didn't see any expo 6400 trident rgb and after reading some threads saying it didn't matter, but instead of expo on my new ryzen 7800x3d+ asus x670e-e it has DOCP, everything seemed to work fine when I was using it, I've turned it off on account of this whole melting situation but should I try to get some expo ram or does it not matter?

Was it actually running at full speed? (6400 MHz as reported by windows task manager.) If so, that's honestly impressive since Zen 4 typically struggles to go above 6200, and the safe reliable setting is considered 6000. If your motherboard is setting the speed to 6000 or whatever instead, then yeah, I'd say you probably got the wrong ram. But if it works, it works.

What voltages is it setting? Because that's the main issue here.

runaway dog
Dec 11, 2005

I rarely go into the field, motherfucker.

Dr. Video Games 0031 posted:

edit: ^^^^ "DOCP" is an Asus feature that adapts XMP profiles to AMD platforms. Other brands either have their own thing or will just use the XMP name.

Was it actually running at full speed? (6400 MHz as reported by windows task manager.) If so, that's honestly impressive since Zen 4 typically struggles to go above 6200, and the safe reliable setting is considered 6000. If your motherboard is setting the speed to 6000 or whatever instead, then yeah, I'd say you probably got the wrong ram. But if it works, it works.

What voltages is it setting? Because that's the main issue here.

1.4 and yeah it was running at 6400, I never realize that the xmp/expo profiles were on the ram itself, but I guess it makes sense.

It was stable tho cp77 did crash but I just chalked it up to the new path tracing, I haven't really had a chance to try any other games yet, but cinebench ran a bunch of times no problem

I'm just gonna get an expo kit, I had ram instability on my last system rarely because I used mismatched kits, but it was still annoying

wanna get this right this time and I'm new to amd and ddr5 so whats the good CL sweet spot for trident z5 neo 6000? 30/38/38/96?

runaway dog fucked around with this message at 16:49 on Apr 25, 2023

PC LOAD LETTER
May 23, 2005
WTF?!
Dude you got that kit running at 6400 with ease mostly stable.

That is pretty good RAM. And a mobo. And CPU. Most people struggle to get AM5 systems to 6400 right now. 6600 is technically possible but incredibly hard to pull off.

Just set it at 6000 and maybe lower the RAM to 1.35v and you'll be fine. No need to spend money.

runaway dog
Dec 11, 2005

I rarely go into the field, motherfucker.

PC LOAD LETTER posted:

Dude you got that kit running at 6400 with ease mostly stable.

That is pretty good RAM. And a mobo. And CPU. Most people struggle to get AM5 systems to 6400 right now. 6600 is technically possible but incredibly hard to pull off.

Just set it at 6000 and maybe lower the RAM to 1.35v and you'll be fine. No need to spend money.

well I would be returning the old kit, but drat I didn't realize I had some good samples, that never happens for me, I did notice it not going above 75c (Tdie) or 85c (Tctl/Tdie) on the nh-d15s in cinebench which I was looking into and people were saying they were getting similar temps on big watercooling rigs, and that was with the cpu soc voltage doing it's 1.40v+ thing

DoctorRobert
Jan 20, 2020

Looks awesome. Not so sure about the control setup on the Asus handhelds from the pics I've seen but all the same. Down the line, if these igpus keep getting stronger at this pace, coupled maybe with one of those Fror cooler things from CES if they ever pan out, you could have a total beast in steam deck form

PC LOAD LETTER
May 23, 2005
WTF?!

runaway dog posted:

well I would be returning the old kit, but drat I didn't realize I had some good samples

Well I mean I guess if you didn't have to spend money sure returning your current kit and getting the one you'd linked would be fine.

But usually the good stuff down clocks and undervolts just fine. And 6000 is basically where you'd want to be right now anyways. So maybe try that first if you don't want to deal with returns.

Just make sure your vSOC is under 1.3v whatever you do even if you do decide to return it and get that other kit instead.

kliras
Mar 27, 2021
sounds like some asus bios updates dropped an hour ago? might wanna go refresh your motherboard's suport page in case a new bios shows up

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
gigabyte dropped new bioses too. I think this is a “update your bios” moment, i wouldn’t sleep on it regardless of brand or processor. If the VSOC thing is true it could be killing non-X3D too and it seems to have been part of AMD’s reference implementation.

hobbesmaster
Jan 28, 2008

What’s funny is from the standpoint of someone that has only personally tuned/OC’d AM4 even the normal VSoC settings on zen4 seem insane.

Yes yes, the IO Die is completely different and even has graphics on it too but voltages usually drift down, not up, you know?

runaway dog
Dec 11, 2005

I rarely go into the field, motherfucker.
idk, I set the ram manually to 6000 and that's it, everything else set to defaults, and it still auto set the cpu soc to 1.38 volts, so I picked manual mode then another separate option popped up that said offset and the option was auto and it tells me to press + and - and I do and then it wants to turn on the smart ai overclocking thing which I always had nothing but headaches with on my strix z390-e board and tbh I don't understand any of this voltage oc stuff I've never really been into overclocking or manually setting any voltages have always just set xmp profile and called it a day. but I'm uncomfortable fiddling with this crap especially in leu of the whole melting situation, I'm tempted to just leave everything at stock until we have a more definitive answer.

edit: dang hopefully strix boards get that new bios update soon and that it does something.

runaway dog fucked around with this message at 18:03 on Apr 25, 2023

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

hobbesmaster posted:

What’s funny is from the standpoint of someone that has only personally tuned/OC’d AM4 even the normal VSoC settings on zen4 seem insane.

Yes yes, the IO Die is completely different and even has graphics on it too but voltages usually drift down, not up, you know?

Yes, the idea of 1.3v being normal is kind of crazy, AM4 is more like 0.8-1.1v right?

I get the impression this Io die is really really iffy, AMD has always been a bit crappy compared to intel (apart from like, X99) but this time they can’t even run all the slots populated etc and intel didn’t have those problems even on their first gen DDR5 stuff (or X99!). If there ends up being a new IO die next gen it’ll be interesting to see how it does.

Prescription Combs
Apr 20, 2005
   6
AMD just quietly announces the Z1 processors... via marketing email?

Seems like a pretty drat cool release to be so quiet about. Not even seeing any big techtubers posting about it.

Cygni
Nov 12, 2005

raring to post

Paul MaudDib posted:

Yes, the idea of 1.3v being normal is kind of crazy, AM4 is more like 0.8-1.1v right?

Yup, three AM4 parts ive got on the bench are all sub 1.1v on SoC at load.

Subjunctive
Sep 12, 2006

✨sparkle and shine✨

runaway dog posted:

edit: dang hopefully strix boards get that new bios update soon and that it does something.

My ROG Strix X670-E Gaming Wifi has a new BIOS (1202) as of April 21, which I'm hoping has the fix I need because I just installed it!

Adbot
ADBOT LOVES YOU

Subjunctive
Sep 12, 2006

✨sparkle and shine✨

Prescription Combs posted:

AMD just quietly announces the Z1 processors... via marketing email?

Seems like a pretty drat cool release to be so quiet about. Not even seeing any big techtubers posting about it.



WTF is "AMD Link"?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply