Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
PC LOAD LETTER
May 23, 2005
WTF?!
Yea my idea could possibly only save money vs using a gently caress off huge interposer like they're doing right now. As you and Faustian note assembly issues could easily kill any possibility of even the 1st example working out much less the 2nd or 3rd. It'd definitely be more expensive vs using standard packaging and PCB mounted GDDR5X which will be plenty good for low, mid, and probably high end-ish cards.

Even in mid-late 2016 HBM or HBM2 will probably only make sense for solidly high end or super high end halo products.

A upper limit of ~3Ghz doesn't sound all that terrible though for a 1024 bit bus bandwidth or latency wise. Especially if GPU clocks stay around 1-1.5Ghz. Even if you had to drop down to a 512 bit bus @ ~2Ghz for practicality reasons you're still looking at a smidge over 1Tbps worth of bandwidth and decent if not good latency. I dunno about latency wise vs Hawaii's L2 cache but bandwidth wise I believe that is about on par with Hawaii's L2 cache bandwidth. That seems like a respectable amount of bandwidth to feed even a 2016+ GPU. AMD, nV, or Intel could probably do a fairly respectable job of making 2-4 die CPU's, APU's, or GPU's work like 1 big one with a inter-die bus like that. Over 4? I'm guessing you'd run into major scaling issues just trying to coordinate information properly at those speeds without having latency kill performance, but that is just a guess.

Yeah I'm assuming they'd only need a "passive" interposer too. If a "active" interposer is needed, beyond something relatively simple like building repeaters into the interposer to mitigate wire delay, to make the idea work I don't think we'll see something like that for a long time no matter what.

Adbot
ADBOT LOVES YOU

suck my woke dick
Oct 10, 2012

:siren:I CANNOT EJACULATE WITHOUT SEEING NATIVE AMERICANS BRUTALISED!:siren:

Put this cum-loving slave on ignore immediately!

Don't forget you can still buy Win7-downgraded Thinkpads.

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party

Paul MaudDib posted:

re CPU + CPU, i.e. big.LITTLE: It's a promising combination on paper, but my understanding is that it's hard to get a chip that does a good job of balancing real-world loads. It takes a lot of energy to switch between the cores, and it's easy to get "thrashing" back and forth between the processor types.
I think it's fairly unlikely that we'll ever see heterogeneous cores outside of ARM64 and probably never on any chips that expect to run above a 10W TDP.

Marinmo
Jan 23, 2005

Prisoner #95H522 Augustus Hill

Paul MaudDib posted:

However, AMD has already done an HBM-based arch and has a significant advantage in that they're partnering with Samsung, who is basically just behind Intel in tech and is capable of going from raw materials to finished packaged chips. TSMC is just another chip maker. In theory the interposer concept unlocks a fuckload of derivative techs - multi-die chips, heterogeneous chips, etc, all tied into HBM. If AMD/Samsung can exploit that it could be a very promising platform in 2 years or so.
I remember way back in the day when AMD started going to poo poo and they sold all their fabs and let TSMC make all their stuff. Coincidentially, if I remember it right that was also just about when AMD went from "this is looking a bit dicey ..." to "all aboard the trainwreck, next stop AMD!". It just seems to me that TSMC and their lovely unreliable factories unable to churn out just about anything decent in a timely manner (+ inability to shrink the freaking dies while nVIDIA and Intel somehow were able to just fine(ish)) just put a large amount of nails in AMD's coffin - anyone with some more insight want to confirm/deny this?

TL;DR: I think the move to Samsung is really freaking awesome and I hope it means AMD get even remotely competitive again.

Fuzzy Mammal
Aug 15, 2001

Lipstick Apathy
I'm curious to see whether both ihvs launch with a small, medium, or large die first given we don't know much about the new process and its yields. I'm even more curious which ones will get hbm and which will have gddr5x instead.

SwissArmyDruid
Feb 14, 2014

by sebmojo

Right this very moment, with 300 series silicon, I am willing to bet that the GPU silicon runs hotter than the interposer, so obviously you're going to want to put that right up against a heatsink.

With active interposers or a heat spreader to bring everything up to the height above the PCB? I don't think we'll know until we see some numbers out of Polaris.

PC LOAD LETTER
May 23, 2005
WTF?!
SwissArmyDruid:

Yes I agree the GPU will definitely run hotter than the current "passive" interposer. The interposers and memory for my idea would be below the GPU die like they are now, not on top. Sorry about the lovely mspaints I guess or if I'm misunderstanding what you're saying. I was also assuming "passive", or nearly so, interposers would be used. A "active" interposer with significant dedicated logic isn't even being demoed by anyone right now as far as I know so that'd be even more fantastical than my idea at this point.

I also don't think Polaris will use anything like my idea. I was just tossing out a idea that seemed nifty to me for discussion purposes and to see what holes would get shot in it. Being terribly wrong about something seems to be the best way to invite comments and I couldn't be the 1st person to think of using interposers that way so maybe there was some sort of fundamental flaw with the concept that I hadn't seen. If it ever shows up in a actual product it will probably happen a year or 2 in the future at the earliest, key words being if ever.

Fuzzy Mammal:

Generally trying to do big complex chips on a new process right off the bat is bad juju. Given that AMD has so far presented a relatively small-ish die in their press releases I think its reasonable to assume they'll shoot for mid to low end parts first and then do a big die halo product once the quirks are ironed out and yields are better later on the year. I don't see how HBM+interposer is going to move much down the product ladder to mid much less low range unfortunately due to cost by late 2016. The stuff is just still too new. And GDDR5X seems pretty solid and affordable too. It'd be interesting to see if they can really boost low end affordable graphic card performance with it. Mid range certainly won't suffer.

EmpyreanFlux
Mar 1, 2013

The AUDACITY! The IMPUDENCE! The unabated NERVE!
Well, Polaris10 (which is likely Vega10) was shown at CES behind closed doors and is a thing that actually exists *cough*GP100*cough* so AMD will likely release it alongside Zen performance for Q3/Q4, but I don't think AMD will be making use of GDDR5X, at least not until a 14nm refresh or 10nm.

Either way, AMD seems really intent on getting back into mobile, so much so that the latest idiot level rumor is that the CES Polaris demo was an MXM platform for mobile and that AMD has no desktop Polaris yet. God I don't need to get involved in comments.

Anime Schoolgirl
Nov 28, 2002

FaustianQ posted:

Either way, AMD seems really intent on getting back into mobile, so much so that the latest idiot level rumor is that the CES Polaris demo was an MXM platform for mobile and that AMD has no desktop Polaris yet. God I don't need to get involved in comments.
At least they actually had functional samples of their next gen hardware at all :v:

HalloKitty
Sep 30, 2005

Adjust the bass and let the Alpine blast

blowfish posted:

Basically 10 will become standard when all the win 7/8/8.1 shitboxes (which replace win XP shitboxes) finally die in 2025.

Great time to get an easy job doing Windows 10 rollouts...

Durinia
Sep 26, 2014

The Mad Computer Scientist

PC LOAD LETTER posted:

Yea in the context of the PC market I would always assume the phrase "heterogeneous chip" to refer to some sort of CPU+GPU on die/package set up.

Interposer chat: any reason why the interposer should have to be entirely under the GPU or CPU die? I'm no EE or GPU architect but most of the labeled die shots I've seen for modern GPU's like Hawaii or Fiji have the memory interface along the edges of the GPU. The positioning for those particular GPU's isn't ideal for this idea because the memory interface is along 3 of the 4 edges of the die...but what if they could move all of the memory interface to 1 side of the GPU die? Couldn't they then just make a interposer for those areas to the memory and then use "regular" packaging methods to deliver power and other I/O for the chip? I would think 1 small interposers would be cheaper than 1 gently caress off huge one.

As Durinia noted in the post Paul MaudDib linked an interposer is really just a big chip in itself, just on a older and larger geometry "cheap" process, so silicon costs and yield savings typical for any other chip would apply and I would think to be significant if not substantial. The rub here is of course that using a small interposers + "normal" PBGA packaging might drive the defect rate through the roof for all I know. Or be totally impossible for whatever reason.

Anyways if my description is off here is a mspaint of the idea:


alternatively:


goin' for actual bankrupt broke :rms:


You've got a good grasp of the economics. I'm not entirely sure how easy going to multiple interposers (and then finding some way to pad the space between them under the main die) is from a manufacturing standpoint, but it can't be easy. Just looking at yield equations when you add more chips to the setup is multiplicative. If your bonding yield is only 90%, then adding a 3rd chip makes it overall something like 81% - a big hit, especially when you're wasting a lot of expensive chips whenever you do it.

A really interesting technology in this space is Intel's EMIB, which is basically inserting little interposers inside of the substrate. This would (hypothetically) fix the padding issue and let you do something like what you're talking about.

Durinia
Sep 26, 2014

The Mad Computer Scientist

Paul MaudDib posted:

Intel is also working on Knight's Landing, of course, which looks like a hell of a chip. That's the new Xeon Phi architecture. Basically it's a bunch of Atom cores on a package with low-latency memory dies...

https://twitter.com/glennklockwood/status/649267360709259264

LOL

Anime Schoolgirl
Nov 28, 2002

was anyone really expecting anything good out of knight's ___

champagne posting
Apr 5, 2006

YOU ARE A BRAIN
IN A BUNKER

edit: I thought about the upcoming Intel thing kaby lake which is something completely different.

champagne posting fucked around with this message at 18:03 on Jan 25, 2016

EmpyreanFlux
Mar 1, 2013

The AUDACITY! The IMPUDENCE! The unabated NERVE!

Help me grasp how much this makes Knights Landing a failure, because it doesn't look like it? And if it is, I guess good news for AMDs HSA Zen designs.

Rastor
Jun 2, 2001

I heard rumors Kaby Lake will get 256MB of on-package RAM, question is what will the yields be and will Intel make them generally available.

Anime Schoolgirl
Nov 28, 2002

FaustianQ posted:

Help me grasp how much this makes Knights Landing a failure, because it doesn't look like it? And if it is, I guess good news for AMDs HSA Zen designs.
It takes longer for the KNL CPUs to interface with its memory cube RAM than it does with the DDR4 on the motherboard.

This is usually a bad thing.

EmpyreanFlux
Mar 1, 2013

The AUDACITY! The IMPUDENCE! The unabated NERVE!

Anime Schoolgirl posted:

It takes longer for the KNL CPUs to interface with its memory cube RAM than it does with the DDR4 on the motherboard.

This is usually a bad thing.

I got that much but my impression was that was merely at idle and that under load the onboard ram will perform better.

Anime Schoolgirl
Nov 28, 2002

Throughput is probably sky high but that seems to be the reason why it didn't pan out for nvidia

Durinia
Sep 26, 2014

The Mad Computer Scientist

Anime Schoolgirl posted:

Throughput is probably sky high but that seems to be the reason why it didn't pan out for nvidia

Right, the conjecture is that if you load the KNL up with tons of threads, it can keep enough requests in flight to cover that latency. I'm a little skeptical that those small cores have enough ILP capability to do that, but it will make for some interesting experiments when it comes out.

Considering their history with the Knights line, I'm not terribly optimistic.

SwissArmyDruid
Feb 14, 2014

by sebmojo

Anime Schoolgirl posted:

It takes longer for the KNL CPUs to interface with its memory cube RAM than it does with the DDR4 on the motherboard.

This is usually a bad thing.

Wait, did we know for a fact that Intel was using Micron HMC?

Regardless, :roflolmao:

edit: Oh my god it is.

SwissArmyDruid fucked around with this message at 18:46 on Jan 25, 2016

Durinia
Sep 26, 2014

The Mad Computer Scientist

SwissArmyDruid posted:

Wait, did we know for a fact that Intel was using Micron HMC?

Regardless, :roflolmao:

edit: Oh my god it is.

Yep. :stonklol:

And, I'd add, the latency thing is probably related.

EmpyreanFlux
Mar 1, 2013

The AUDACITY! The IMPUDENCE! The unabated NERVE!
Wow, so on that basis it sounds like a functional Zen HSA design should beat a Xeon Phi squarely; go for the throat and do 32 tiny Zen cores that can do 2 virtual threads tacked onto a a Polaris 10 and fed by onboard 32GB HBM2.

champagne posting
Apr 5, 2006

YOU ARE A BRAIN
IN A BUNKER

You don't need to beat the phi though, it's pretty useless as it is.

Proud Christian Mom
Dec 20, 2006
READING COMPREHENSION IS HARD

Boiled Water posted:

You don't need to beat the phi though, it's pretty useless as it is.

hey for AMD a win is a win

EmpyreanFlux
Mar 1, 2013

The AUDACITY! The IMPUDENCE! The unabated NERVE!
I dunno, part of me is kind of disappointed Intel won't do dGPUs. An Iris Pro 6200 scaled up to accommodate a 256 bit bus with 128-192 EUs and GDDR5 should be an excellent card (going off how a IP 6200 is roughly an R7 250, so ~8 EU = ~1 CU, so 192 EUs should give better than 270X performance at much lower power draw).

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

FaustianQ posted:

Help me grasp how much this makes Knights Landing a failure, because it doesn't look like it? And if it is, I guess good news for AMDs HSA Zen designs.

GPU memory is always very high-latency compared to CPU memory. On the other hand, you can't feed a GPU-sized processor on 30 GB/s. That's the inherent trade-off SIMD processors make.

My impression was that HMC was still lower-latency than HBM and GDDR5 but it's been a while since I looked into it, so it's entirely possible I'm wrong there. Feel free to correct me if you have sources, I couldn't find any on a quick search.

Boiled Water posted:

You don't need to beat the phi though, it's pretty useless as it is.

Knight's Corner is pretty useless, but Knight's Landing is where Intel has been aiming their generational update. KNC uses GDDR5, KNL goes to on-package HMC. KNC is designed as a PCIe coprocessor card; KNL can be socketed. KNC is 22nm; KNL is 14nm. And they're dumping in a bunch of new 512-bit-wide SIMD instructions to boot.

KNC to KNL is going to be like going from Hawaii to Greenland, in terms of timeframe and technology. Whether it actually works well is of course another story, but the tech is solid at least.

Paul MaudDib fucked around with this message at 00:42 on Jan 26, 2016

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Durinia posted:

Right, the conjecture is that if you load the KNL up with tons of threads, it can keep enough requests in flight to cover that latency. I'm a little skeptical that those small cores have enough ILP capability to do that, but it will make for some interesting experiments when it comes out.

Considering their history with the Knights line, I'm not terribly optimistic.

Right, and that's also how GPUs work. You have a bunch of threads running to keep a lot of instructions in-flight, but due to GDDR5 latency most of them are sleeping/idled. If you've got any slides or anything on how Intel may be handling that on KNL I'd be really interested too, because that's really the million-dollar question.

Analogously from GPUs, the solution might involve "memory coalescing" (as CUDA calls it) and possibly speculative prefetching or similar techniques. Basically align your threads' memory accesses as much as possible to try and combine the mini requests into a single big one, and avoid incurring the memory access latency as much as you can.

Again, I think the key isn't to think of this as a CPU with a lot of threads - it's a GPU that can switch to an efficient single-threaded/out-of-order execution mode at will. You can theoretically run regular CPU applications on it and it'll kinda-sorta work, which is an advantage. But for optimal performance you will need new applications like GPUs do.

Some of it is fairly straightforward - for example databases have made pretty good gains just by exploiting the enormous bandwidth of GDDR5. But long term you can potentially do entirely different things with GPUs - for example rather than a deterministic tree structure you can make probabilistic structures, where there's many places a piece of data might live, and then search all of them in parallel. That has big advantages for massively concurrent applications, since it drastically reduces synchronization/contention overhead.

Historically, the design precedent of GPGPU is a combination of vector processing (set up once, work on many pieces of data in a pipeline) and barrel processing (rotate between threads of execution to cover latency). Ideally that's where Intel should be aiming.

The CDC Star 6600 in particular is a really interesting piece of machinery that is surprisingly similar to modern GPU architecture in a lot of ways. It's a RISC processor that distinguishes between control processor/execution unit/peripheral IO units, with a single Stream Engine that computes one warp at a time and swaps between them to cover its latency. However processing has gotten much faster since then - back then instruction execution was the bottleneck, now it's memory latency, so it's basically reversed.

Paul MaudDib fucked around with this message at 00:46 on Jan 26, 2016

Professor Science
Mar 8, 2006
diplodocus + mortarboard = party
FWIW. you'll learn more about GPU compute from reading about C* and MasPar than any other historical artifacts I've seen.

Professor Science fucked around with this message at 06:46 on Jan 26, 2016

Durinia
Sep 26, 2014

The Mad Computer Scientist

Paul MaudDib posted:

My impression was that HMC was still lower-latency than HBM and GDDR5 but it's been a while since I looked into it, so it's entirely possible I'm wrong there. Feel free to correct me if you have sources, I couldn't find any on a quick search.

HBM is the (good) latency outlier, actually. It has similar latency to DDR as it borrows a lot from its controller architecture. This SKHynix pres from Hotchips (page 14) shows latency info.

HMC has additional latency add from the SerDes conversion on both sides of the abstracted channel, along with some additional switching delay in the base die. It's also a little finicky on getting overloaded with requests. Here's a slide from and SC15 workshop where someone evaluated it under load. Not quite what you're used to...

Paul MaudDib posted:

Right, and that's also how GPUs work. You have a bunch of threads running to keep a lot of instructions in-flight, but due to GDDR5 latency most of them are sleeping/idled. If you've got any slides or anything on how Intel may be handling that on KNL I'd be really interested too, because that's really the million-dollar question.

Analogously from GPUs, the solution might involve "memory coalescing" (as CUDA calls it) and possibly speculative prefetching or similar techniques. Basically align your threads' memory accesses as much as possible to try and combine the mini requests into a single big one, and avoid incurring the memory access latency as much as you can.

Again, I think the key isn't to think of this as a CPU with a lot of threads - it's a GPU that can switch to an efficient single-threaded/out-of-order execution mode at will. You can theoretically run regular CPU applications on it and it'll kinda-sorta work, which is an advantage. But for optimal performance you will need new applications like GPUs do.
They've got 4-way hyperthreading on each core and yes, they're going to still need to do some tricks with pre-fetching, etc., to get the most out of their memory. Interestingly, it's still only going to have less than half the bandwidth of Pascal. The regular CPU workloads, unless they're heavily threaded, will be pretty bad. Like, 3-5x slower than Xeon bad. That means that for some apps, Amdahl's Law will be pretty harsh on these things - thus Intel's push to squeeze every ounce of parallelism out of your codes.

Paul MaudDib posted:

The CDC Star 6600 in particular is a really interesting piece of machinery that is surprisingly similar to modern GPU architecture in a lot of ways. It's a RISC processor that distinguishes between control processor/execution unit/peripheral IO units, with a single Stream Engine that computes one warp at a time and swaps between them to cover its latency. However processing has gotten much faster since then - back then instruction execution was the bottleneck, now it's memory latency, so it's basically reversed.

Interestingly, it is really very much the same idea that KNL has - except replace "instruction latency" with "memory latency" and "warp" with "thread/SIMD".

Professor Science posted:

FWIW. you'll learn more about GPU compute from reading about C* and MasPar than any other historical artifacts I've seen.

Seconded. MasPar was very similar to a modern GPU in many aspects of operation (just a bunch of discrete units instead of a single chip). I actually used one in college a bit and managed to not be too horribly scarred by it. :science:

Also, AMD something something because this thread.

EmpyreanFlux
Mar 1, 2013

The AUDACITY! The IMPUDENCE! The unabated NERVE!

Durinia posted:

Also, AMD something something because this thread.

Should just be renamed to Hardware Theory and Speculation

Durinia
Sep 26, 2014

The Mad Computer Scientist

FaustianQ posted:

Should just be renamed to Hardware Theory and Speculation

To be fair, that's mostly what AMD's working on these days in the CPU space.

SpelledBackwards
Jan 7, 2001

I found this image on the Internet, perhaps you've heard of it? It's been around for a while I hear.

Durinia posted:

To be fair, that's mostly what AMD's working on these days in the CPU space.

What if we cross-circuit the accelerator? Might give us more... ohms, or something?

https://www.youtube.com/watch?v=4K4Is93e7VA

PC LOAD LETTER
May 23, 2005
WTF?!

Durinia posted:

A really interesting technology in this space is Intel's EMIB, which is basically inserting little interposers inside of the substrate. This would (hypothetically) fix the padding issue and let you do something like what you're talking about.

Yea that pretty much is my idea. I figured it was too obvious for someone to have overlooked it! Hadn't heard of it though. Thanks for the link.

Sidesaddle Cavalry
Mar 15, 2013

Oh Boy Desert Map
Jim Keller landed in Tesla working on self-driving cars.

EmpyreanFlux
Mar 1, 2013

The AUDACITY! The IMPUDENCE! The unabated NERVE!

Likely very much wanted for his ARM, x86 and SoC experience, pretty critical factors in making something like PX2 work.

EDIT: I'm an idiot and jumped to too many conclusions.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
http://www.anandtech.com/show/10008/amd-launches-65w-kaveri-apus-a10-7860k-a6-7470k-wraith

It looks like the A10-7860k is launching with very aggressive pricing, $118 MSRP? Against an i3-4330, the 7850K was more expensive and almost twice the TDP, and slower except in heavily threaded or GPU-using loads, but the 7860k is launching $50 cheaper and 30W cooler.

If this was here 2 years ago, they might even have found a niche. Unfortunately it looks like an i3-6100 has an MSRP $1 lower, and Intel has been making strides in integrated graphics while keeping their single-threaded speed advantage. Are there any redeeming features to the FM2+ CPUs that I'm missing?

EmpyreanFlux
Mar 1, 2013

The AUDACITY! The IMPUDENCE! The unabated NERVE!

Twerk from Home posted:

http://www.anandtech.com/show/10008/amd-launches-65w-kaveri-apus-a10-7860k-a6-7470k-wraith

It looks like the A10-7860k is launching with very aggressive pricing, $118 MSRP? Against an i3-4330, the 7850K was more expensive and almost twice the TDP, and slower except in heavily threaded or GPU-using loads, but the 7860k is launching $50 cheaper and 30W cooler.

If this was here 2 years ago, they might even have found a niche. Unfortunately it looks like an i3-6100 has an MSRP $1 lower, and Intel has been making strides in integrated graphics while keeping their single-threaded speed advantage. Are there any redeeming features to the FM2+ CPUs that I'm missing?

Do you have an FM2+ motherboard? No? Then it doesn't have any redeeming qualities beyond maybe a showcase for Bristol Ridge, in the hopes Bristol Ridge can fully take advantage of DDR4 for something like 65gb/s bandwidth.

But yeah, super sad this wasn't a late 2014/early 2015 part, along with something like the Athlon X4 845K, which then might have a had a proper budget build niche.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

FaustianQ posted:

But yeah, super sad this wasn't a late 2014/early 2015 part, along with something like the Athlon X4 845K, which then might have a had a proper budget build niche.

Even then it would have been competing and losing against the $99 G3258 + Z97 mobo bundle in the budget niche. I guess for a cheapo desktop for my sister I'm choosing between an i3-6100 and a G4400 now.

Adbot
ADBOT LOVES YOU

EmpyreanFlux
Mar 1, 2013

The AUDACITY! The IMPUDENCE! The unabated NERVE!

Twerk from Home posted:

Even then it would have been competing and losing against the $99 G3258 + Z97 mobo bundle in the budget niche. I guess for a cheapo desktop for my sister I'm choosing between an i3-6100 and a G4400 now.

Eh, that's Microcenter B&M, not online shopping in general so without a Microcenter nearby it'd become much competitive (70+40$ vs 70+100$).

  • Locked thread