I had a hail storm of nVidia GPUs die around me couple of mnths now....first a 8600gt had busted caps right after the warranty expired...which i fixed...which died again...and after baking it lived again...and couple of days ago it died again...and sadly its dead for good....
also Aurora, OrodruinBD, TahmidXL8...some other guys had dead cards...so i kinda looked it up...the info is on inquirer....so read it with a bit of skepticism....
It looks almost all nvidia cards have heat issues...they eventually die depending on the uses pattern....mostly cause of solder joint failure...
SourceTHE BURNING QUESTION on everyone's mind is what Nvidia parts are failing in the field? No GT200 jokes here, NV personnel are still quite sensitive about that, but our moles have told us about the bum GPUs.
The short story is that all the G84 and G86 parts are bad. Period. No exceptions. All of them, mobile and desktop, use the exact same ASIC, so expect them to go south in inordinate numbers as well. There are caveats however, and we will detail those in a bit.
Both of these ASICs have a rather terminal problem with unnamed substrate or bumping material, and it is heat related. If you ask Nvidia officially, you will get no reason why this happened, and no list of parts affected, we tried. Unofficially, they will blame everyone under the sun, and trash their suppliers in very colourful language.
The press is totally stonewalled, but analysts are quite another story. If you call up with Wall Street credentials, they will tell you what is going on, but unfortunately it doesn't seem to be entirely accurate. What analysts tell me they were officially told is that it is a specific batch of parts that only HP got.
The official story is that it was a batch of end-of-life parts that used a different bonding/substrate process for only that batch. Once again, the trusty INQUIRER bullshit detectors went off so loudly that the phone almost vibrated out of my hand. More than enough people tell us both the G84 and G86 use the same ASIC across the board, and no changes were made during their lives.
When the process engineers pinged by the INQ picked themselves off the floor from laughing, they politely said that there is about zero chance that NV would change the assembly process or material set for a batch, much less an EOL part.
On the less technical side, multiple analysts also told us that NV specifically told them that this problem is confined only to HP. I wonder why Dell is having failures in huge numbers for their XPS lines and replacing them with ATI parts? Why is Asus having similar problems? Go check the message boards, any notebooks that came with G84s and G86s have boards filled with dead machine problems. Most of these, especially on the NV forums are being quashed and removed by admins, so act quickly and take screenshots of your posts.
Basically, NV seems to have told each analyst a highly personalised version of the story, and stonewalls everyone else who asks. Why? The magnitude of the problem is huge. If Dell and HP hold their feet to the fire, anyone want to bet that $200 million won't cover it? This has all the hallmarks of things the SEC used to investigate in a time before government was purchasable.
The other problem is the long tail. Failures occur due to heat cycling, cold -> hot -> cold for the non-engineers out there. If you remember, we said all G84s and G86s are affected, and all are the same ASIC, so why aren't the desktop parts dying? They are, you are just low enough on the bell curve that you don't see it in number that set off alarm bells publicly yet.
Laptops get turned on and off many times in a day, and due to the power management, throttle down much more than desktops. This has them going through the heat cycle multiple times in a day, whereas desktops typically get turned on and off once a day, sometimes left on for weeks at a time. Failures like this are typically on a bell curve, so they start out slow, build up, then tail off.
Since laptops and desktops have a different "customer use patterns", they are at different points on the bell curve. Laptops have got to the, "we can't bury this anymore" point, desktops haven't, but they will - guaranteed. The biggest question is whether or not they will be under warranty at that point, not whether or not they are defective. They are.
If you look at the HP page, the prophylactic fix they offer is to more or less run the fan all the time. Once again, for the non-engineers out there, fan running eats a lot of power, so this destroys the battery life of notebooks. Basically, people bought a machine with a battery life of X, and now it is Y to prevent meltdown from a bum part. It doesn't fix anything, it just makes the failures take longer, hopefully past the warranty period, at a huge battery life cost. Fire up your class actions people, you got shafted.
Back to the engineering, we intoned that this was a cover-up of engineering failures by Nvidia. We also said that they probably knew what was happening. Think we were kidding? Read this, twice, linked again here for those that can't move their mouse to the left, it is that important.
If we knew a year and change ago that these exact parts had heat problems, think Nvidia did? Think the voltage difference between A02 and A03 is coincidence? This is a classic example of not meeting engineering goals and overclocking through brute force (voltage bump in engineering terms) to compensate.
HP and the others were blindsided by this, it happened far too late in the design cycle to compensate, and it looks to have been covered up hastily, badly, and eventually fatally. Blaming suppliers, OEMs and users is completely unfounded and says that NV is unwilling to properly address this issue, only hide from it. NV knew, they made silicon changes to fix another problem that directly lead to this problem.
Nvidia is covering this up, hard. All the usual sources are keeping mum on the topic with only a few daring to speak out. Given the sheer magnitude of this, their marketshare for notebooks was huge in the period, this could very well suck up most of their remaining cash. Don't underestimate how bad this is going to be for NV, we highly doubt $200 million will even begin to cover it.
NVIDIA IS IN DEEP trouble over the defective parts problem, and from what we're being told, this is only the tip of the iceberg. NV still insists on stonewalling and spinning because the cost of owning up to the problem could very well sink the company.
If you haven't been following the story, the short version, up till now, is that all G84 and G86 chips are bad. Nvidia is blaming everyone under the sun, but denying they have any hand in the failures. While this may sound plausible, technical analyses by people intimately involved in the requisite semiconductor technologies tell The INQ that it is a bunch of bull: NV simply screwed up. Badly. If it was a problem with the suppliers, NV would not be paying out more than the chip cost, much less gagging OEMs: it would simply be passed along.
In any case, the official story is that there was a small batch of parts given only to HP that went bad. That was comprehensively proved wrong when Dell, Apple, Asus, Lenovo and everyone else under the sun also had problems. NV AR recalled the parts and recanted the story about it only being an EOL test run. Bad fibbers, no cookie. They still stuck to the story about it being only laptop parts, and that it was under control.
If you think it is under control now, the following is part of an email sent Monday by a very tech-savvy reader. "We just got our first casualty from the Nvidia mobile graphics [expletive deleted]. Laptop used by one of our senior engineers started acting up this past weekend. Won't boot except in SAFE mode. Called Dell, they tried a few things, gave up, stated it was the graphics module, and said that because they were SO swamped dealing with that issue, they were just going to send a completely new laptop!"
There are two messages here which have echoes in earlier emails received over the past few weeks. First is that Dell is replacing full laptops over this, contrary to what they claim (read the comments here and here for more). The second is that the small 'under control' problem is far from that. If they had a handle on it, they would not be so far behind and drowning in backorders. Anyone want to bet Dell isn't going to get stuck with the bill here?
To make matters more laughable, the fix that NV is forcing on Dell, HP and everyone else does not fix the problem, it simply makes it less likely to occur during the warranty period. With HP now offering an extended warranty period, and Dell looking likely to do the same, this will only multiply the cost. Add in the fact that Nvidia is sending out defective parts as replacements (there are no good ones), and you have a recipe for a long and expensive tale.
That is where we stand now - NV is simply stonewalling everyone and the costs are adding up. How adult of them. The question of why still remains though, and with another little tidbit of information, it becomes quite clear. There was a digitimes article on July 25, here if you are a subscriber, that said: "Due to Nvidia not clearly explaining the details of the faults reported in its notebook GPUs, some channel vendors have demanded graphics card makers issue a recall for desktop-based discrete graphics cards using the same GPU core, according to sources at graphics card makers."
Reading that, it sounds a mite odd: why would Nvidia keep the partners in the dark like that? They have to be told what the real story is for business reasons, right? When you see stories like these, it is very likely that they are not what they seem, and that the story is simply a nice face-saving Asian 'hello' applied with a backhand.
A little digging revealed what this, and more, is all about, and it's far uglier than just the 'notebook' version. It seems that four board partners are seeing G92 and G94 chips going bad in the field at high rates. If you know what failures look like statistically, they follow a Poisson distribution, aka a bell curve. The failures start out small, and ramp up quickly - very quickly. If you know what you are looking for, you can catch the signs early on. From the sound of the backchannel grumblings, the failures have been flagged already, and NV isn't playing nice with their partners.
Why wouldn't they? Well, the G92 chip is used in the 8800GT, 8800GTS, 8800GS, several mobile flavours of 8800, most of the 9800 suffixes, and a few 9600 variants just to confuse buyers. The G94 is basically only the 9600GT. Basically we are told all G92 and G94 variants are susceptible to the same problem - basically they are all defective. Any guesses as to how much this is going to cost?
From the look of it, all G8x variants other than the G80, and all G9x variants are defective, but we have only been able to get people to comment directly on the G84, G86, G92 and G94, and all variants thereof. Since Nvidia is not acknowledging the obvious G84 and G86 problems, don't look for much word on this new set either - if they can bury it, it will drop their costs.
In the end, what it comes down to is that the problem is far bigger than they are admitting, and crosses generational lines, process lines, and OEM lines. Nvidia is quick to point the finger at everyone but themselves, but after a while, the facts strain those cover stories well past breaking point. There is a common engineering failure here - this problem is far too widespread for it to be anything else. The stonewalling, denials and partner gagging is simply a last-ditch attempt at wallet covering.
With OEMs extending warranties, Nvidia is going to have to cover a lot of laptops for a long time. Desktop boards are going bad as well now, contrary to the statements of Nvidia PR and AR, and the hole keeps getting deeper and deeper. I wonder if they can ever come clean and survive.
of chips. My friend owns a Lan Center / GC and here's a cross section of his BFG's:
6 series (66 GT/68 GT/68 ultra): they're starting to fail this last year
7 series (76 GT, 79 GT, 7950GT): No failures, yet
8 series (86 GTS, 88 GT 512, 88 GTS 512): 86 no issues, 8800 GT single slots 50% failure rate, 8800 GT single slot space "upgrade" no failures, 8800 GTS no failures.
9 series: 9800 GTX, no failures.
All of his major failures are tied up in the 8800 GT cards, some of them were replaced 4 or 5 times until he moved a card out of the slot immediately below the intake fan down one slot. Magically the failures stopped. I shouldn't say magically, did you ever listen to an 8800GT fan?
HOT ON THE heels of its denials that anything is wrong with the G92 and G94s comes another PCN that shows the G92s and G92b are being changed for no reason. Yup, the problems that are plaguing G84 and G86 are the same that affect seemingly all 65nm and now 55nm Nvidia parts.
This PCN is very similar to the one linked above, and the formatting is almost almost exactly the same, so we won't cover all the details, just the pertinent points. This one is much more important, it confirms that the problems are not confined to the 65nm products. Since Nvidia told us the last one was unimportant and refused to give it to us, we didn't bother asking this time, we just took notes when they were shown to us at a recent conference.
It is titled "G92 GPU Desktop Products" with a subtitle of "Change Bump Material from High Pb to Eutectic Solder", with a date of June 2008 and a number PCN0346A on it. Page 2 has the "PCN Submit Date" of June 13, 2008, " Planned Implementation Date" of July 28, 2008, and a "Proposed First Ship Date for change" of August 17, 2008. Short story here, if you have a G92 or G92b purchased before next week, you likely have a lemon. Remember, these are chip ship dates, not boards in stores.
The next few chunks, "Change Category" and others are the same, "Class 1", given to everyone under the sun, and OMGWTFBBQ. That is kind of a 'well duh' thing, and is exactly the same as the G86 part PCN.
The big one is the affected parts list. It clearly states that not only are 65nm parts bad, but 55nm ones are as well. The entire list of affected parts is as follows.
Small batch, my arse
Lets see, what do we have here? It looks like they changed the bumping material on the 55nm parts a month and a day after introduction. Yup, no reason for that at all, nothing to see here either.
The next part is a description of what we already knew and told you about on the last PCN story. To use their words, "Nvidia will transition from using high-lead solder (95%Pb/5%Sn) to eutectic solder (63%Sn/37%Pb) flip-chip bump material for the G92 product family. During the transition period Nvidia will be supplying both high-lead and eutectic bump until inventory is depleted. No other materials are being changed."
This makes complete sense, and it is followed by a picture of a modern chip with the bumps and underfill pointed out.
The reasons are the same, supply and robustness, as is the impact statement. Same very curious wording. Nothing new, just bad news.
The "Implementation and Qualification Plan" however does have some new news. It says, "Nvidia has previously qualified numerous products using eutectic solder bumps using the same bump suppliers, substrate vendors, underfill and assembly sites as this device. Qualification data is available upon request." This information backs up our previous assertions that this is quite widespread among all their 65nm and 55nm products. Qual data is available "Now," it says, and samples on July 1, 2008.
Page 4 has the same diagram, and indicates that the eutectic bumps are marked the same way as the G86 ones, with a trailing R on the lot #. Because it is etched on the die, you have no way of knowing which one you have until you take it apart, pull the heatsink, clean off the thermal paste, and read the laser-wielding chicken scratchings. Most stores won't let you do this, and NV is going to be mixing the dies up until they burn off inventory. This means you won't be safe until long after the card is irrelevant, say later in Q4.
The "Recommended Action" and contact info is the same as the G86 PCN, and the Revision history has an Initial Release date of 06/13/08. There is no blank page 5 on this one, it is just the disclaimer that was on page 6 of the last one.
While Nvidia is playing these PCNs off as nothing to worry about, they are. The fact that the defective chip problem extends to the G92 line like we said earlier is bad enough. It pretty much confirms that the problem is the same as the "Small batch of EOL laptops parts only given to HP," that they warned about in July. The bigger problem is that it affects the newer 55nm parts as well. Those were supplanted in a number of days you could almost count to on your fingers and toes if you grew up in a small town in Appalachia, never a good sign. In fact, qual samples were available before the 9800GTX+ actually launched.
It is hard to overstate how bad this is. Basically every 65nm and 55nm Nvidia part appears to be defective. It is not a question of yes or no, but how defective each line is, and what the failure rate for each one is. We are hearing of early failure rates in the teens per cent for 8800GTs and far higher for 9600GTs, so this is not a quibble over split hairs.
To make matters worse, Nvidia has a mound of unsold defective parts that they are going to bleed out into the channel along side of the (hopefully) fixed parts. As a buyer, you have no way of knowing which one you are getting, and it looks like Nvidia isn't keen on helping you figure it out either, that would cost too much.
Until Nvidia comes fully clean on this fiasco, lists all the defective parts, and orders boxes clearly marked, you can't say anything other than just avoid them. Then again, since doing the right thing would likely bankrupt them, we wouldn't hold your breath for it to happen.
---------- Post added May 18th, 2010 at 00:04 ---------- Previous post was May 17th, 2010 at 23:57 ----------
sourceA FEW WEEKS ago, VR-Zone posted a story about Nvidia issuing a Product Change Notification (PCN) about G86 desktop chips and underfill materials. Days later, it strangely disappears, but now is back with an 'explanation' from Nvidia PR appended.
Why is Nvidia so afraid of this information getting out? Easy, it basically proves they are not telling the truth once again about the defective chips fiasco. We asked them for a copy of the PCN, but they declined, but luckily we ran into lots of people who had it at IDF, and we took copious notes.
With that, lets take you through, point by point, the Nvidia 6 page PDF that they issued as a PCN. Page 5 is blank, and Page 6 is the standard legal disclaimer, so we will skip those. The PCN is dated May 22, 2008 on the bottom of pages 2-5, July 25 on the bottom of Page 1, and Page 6 is undated. The first big problem is that it is entitled "G86 Desktop Products" with a subtitle " Change Namics 8439-1 Underfill material to Hitachi 3730". Above that there is " Product/Process Change Notice", the usual NDA only disclaimer.
Remember how Nvidia swore up and down that desktop parts were flat out not affected? Remember how we said that all G84 and G86s were because they were the same ASIC? I guess they decided to change this underfill material to better color coordinate with the substrate hues, given the cost of testing, qualification and other work that needs to be done, you certainly wouldn't want to change it for no good reason. The old one worked just fine, right? Not defective either, they said so. Then again, they said the problem was contained to HP as well.
The official Nvidia explanation is that if you change one SKU, you change them all, so this isn't a big deal. Testing and validation costs be damned, you take a 'working' and near EOL part and change it on a whim because they changed another part. OEMs love that, as do stockholders.
The problem is that this story doesn't wash either. If the original desktop G86s are not affected, there is no reason to change, they work and you are only adding cost and risk, as well as likely more expensive materials. There is no way they would take on this cost if it wasn't necessary. That means the desktop chips were bad as well, and needed changes, validating our original story from early July.
On to page two, there is another whopper, but first they repeat that this is a PCN, and the title has not changed from Page 1. The first bombshell is spread out over three boxes entitled, PCN Submit Date:, Planned Implementation Date:, and Proposed First Ship Date:. They are July 25, 2008, Immediate, and July 25, 2008 respectively.
Why is this important? Well, it shows that the companies knew there was a problem, they made a change, and the change didn't start shipping until a month after they said it was all fixed. This seemingly flat out contradicts their 8-K statement,
But I am sure they will come up with a slick PR reason why if you stretch your imagination and squint, they do line up. either way, if you bought an Nvidia product before July 25, it looks like you probably bought a defective one. Given that the 25th was the start ship date, that means the parts were not going to end users for a bit after that, so you probably aren't safe until mid-August.
In any case, this explains why Dell and the rest would not answer the question of "Is the one I bought since the announcement defective or is it good? " and "Are you still shipping defective parts?" They wouldn't answer here and here and here because they knew that statements like "We are still shipping defective parts to customers," and "you bought a lemon," don't go over well.
The next box is titled "Change Category:", and there are two option check boxes, "Class 1 Change - Major (Customer Approval Required)", and "Class 2 Change - Customer Notification Only". It doesn't take a genius to figure out that this one had Class 1 checked. I wonder how they are going to spin the whole 'it is only minor, you are reading too much into this' in light of that? It is going to be funny to watch in any case.
The next box is "Required Distribution:", and all four boxes, Sales, Marketing, Materials/Planning, Others (Quality and Assembly Engineering) are checked. I guess they meant it when they said it was Major.
Then there is a box with "Attention, for Class 1 Change:" with three bullet points. The first is "Customer should acknowledge the PCN as soon as possible", then "Lack of acknowledgment of the PCN prior to the Proposed First Ship Date constitutes acceptance of the change", and finally "If customer does not accept this change, or would like to work with NVIDIA to change the First Ship Date, please contact your local Sales representative or Program Manager immediately". Yup, nothing to see here, nothing major to worry about, you just need to sign off as a formality. That is the story, and they are sticking to it.
The last box lists affected parts, and has two chips, G86-303-A2 and G86-103-A2, and one kit, G86-213-A2. Told ya so.
Page 3 has three parts, Proposed Change Information, Implementation and Qualification Plan, and Product Marking and Traceability. PCI has three parts, and two of them are whoppers with some truly precious hidden gems. They are, " Description of Change " which says, "NVIDIA will transition from using Namics 8439-1 underfill material to Hitachi 3730 underfill material for the G86 Desktop product skus only." This is followed by "Reason For Change " which says "To increase supply and enhance package robustness," and "Impact of Change (form, fit, function, quality or reliability): " that reads, "There will be no adverse change to form, fit, function or reliability. "
Let's look this closely. The description is pretty obvious, nothing to see here, move along kiddies. The big one is the reason, and it has two parts, the one most people focus on is the part about enhancing package robustness. Now if there was no problem in desktop parts, why is packaging robustness in such dire need of change? And if it really does that, why does the Impact box say there are "no adverse change to form, fit, function or reliability". Correct me if I am wrong, but do they really need to say "this one won't be a lemon guys" that directly?
The most telling statement is the first half of the Reason box, and that is "To increase supply." This means that either Nvidia was having enough problems getting the Namics 8439-1 underfill that it was limiting their ability to make chips, desktop only mind you, or there are enough defects to drag the yield rate into the toilet. Guess which one it is, Nvidia will likely say it is a supply problem, it can't admit failures that high.
The "Implementation and Qualification Plan" says that "Qualification data is available on request", with two sub boxes saying that both the data and qual samples are available now. Guess that means you can get them to test with.
Product Marketing and Traceability is the bottom of Page 3 and most of Page 4. Unfortunately it is a diagram, and our artistic abilities would not allow us to take sufficient notes to re-create it. The text in three paragraphs, two above, one below the diagram, reads as follows. "Product with the eutectic bump will be denoted with an 'R' appended to the end of the lot number in the 4th line. This will not change. The box label will have the 'F' after the lot number for identifying product with Hitachi underfill change," then "During the transition period, traceability will be maintained by NVIDIA. Please contact your Program Manager if a list of affected batch numbers or shipments is required."
Finally below the picture, "For identifying product with Hitachi underfill material, refer to the box label and the lot # on the box will have the letter 'F' as the last character of the lot #."
Lots of words there, but the short story is that if you have a chip, it doesn't seem like you can tell which underfill material you have. The R for eutectic bumps, a significant change that we will discuss in a later article, seems to be the only thing on the chips. The box F label will never make it to the end user, so there doesn't seem to be a way to figure out if you have a lemon or not, other than potentially date of purchase. In any case, if you have access to the box your GPUs shipped in, you can figure this out.
The next box is titled "Recommended Action" and says "No customer qualification required. For additional data or questions, please contact your Program Manager." Below that there is a box titled "NVIDIA Contact " which is split into two boxes that say, "In case of questions, please contact your local Sales representative or Program Manager." and "Change approval can be done through NVIDIA Website: http://partners.nvidia.com/. "
Closing out Page 4, we have the "PCN revision history", and it says "Date: 7/25/08," "Revision: A", and "Reason: Initial Release." As stated earlier, Page 5 is blank, Page 6 has a legal disclaimer.
In the end, this is one of the smoking guns. Nvidia flat out denied that there were any desktop part that were defective, but now they are changing materials to "enhance package robustness." I guess you do that on a whim. I wonder how the SEC will square that with the 8-K which said, "MCP and GPU products that are impacted were included in a number of notebook products." µ