That problem reared its head a lot for the RV770 in particular, with the rise in popularity of stress testing programs like FurMark and OCCT. Although stress testers on the CPU side are nothing new, FurMark and OCCT heralded a new generation of GPU stress testers that were extremely effective in generating a maximum load. Unfortunately for RV770, the maximum possible load and the TDP are pretty far apart, which becomes a problem since the VRMs used in a card only need to be spec’d to meet the TDP of a card plus some safety room. They don’t need to be able to meet whatever the true maximum load of a card can be, as it should never happen.
Why is this? AMD believes that the instruction streams generated by OCCT and FurMark are entirely unrealistic. They try to hit everything at once, and this is something that they don’t believe a game or even a GPGPU application would ever do. For this reason these programs are held in low regard by AMD, and in our discussions with them they referred to them as “power viruses”, a term that’s normally associated with malware. We don’t agree with the terminology, but in our testing we can’t disagree with AMD about the realism of their load – we can’t find anything that generates the same kind of loads as OCCT and FurMark.
Regardless of what AMD wants to call these stress testers, there was a real problem when they were run on RV770. The overcurrent situation they created was too much for the VRMs on many cards, and as a failsafe these cards would shut down to protect the VRMs. At a user level shutting down like this isn’t a very helpful failsafe mode. At a hardware level shutting down like this isn’t enough to protect the VRMs in all situations. Ultimately these programs were capable of permanently damaging RV770 cards, and AMD needed to do something about it. For RV770 they could use the drivers to throttle these programs; until Catalyst 9.8 they detected the program by name, and since 9.8 they detect the ratio of texture to ALU instructions (Ed: We’re told NVIDIA throttles similarly, but we don’t have a good control for testing this statement). This keeps RV770 safe, but it wasn’t good enough. It’s a hardware problem, the solution needs to be in hardware, particularly if anyone really did write a power virus in the future that the drivers couldn’t stop, in an attempt to break cards on a wide scale.