The computing part of the R600 and RV670 chips consisted of 64 universal units each of which had five ALUs, a flow control unit, and an array of general-purpose registers. Four out of the five ALUs were rather simple, capable of executing one FP MAD instruction. And the fifth ALU was complex, capable of processing such instructions as SIN, COS, LOG, EXP, etc. In fact, each execution unit was a processor with a five-stage pipeline.
Theoretically, the GPU contained 320 execution units but this was only true when all the 64 pipelines were loaded, which was not always the case. In 3D applications many operations depend on the results of previous operations, so it is hard to keep the pipeline loaded always. Application-specific optimizations in the Catalyst driver were required for that but it is often impossible to get access to the game code until it is officially released.
As the consequence, the ATI Radeon HD architecture often found itself using but one ALU in each execution unit and lagging behind the competing G80/G92-based solutions from Nvidia.
The latter not only had more independent execution modules but also worked at higher clock rates. Creating the RV770, the ATI developers solved the problem of the potential inefficiency of the superscalar architecture in the most direct way – by increasing the number of execution modules from 64 to 160.
Of course, it means there are more transistors in the core, but the 55nm tech process helped keep the size of the core within reasonable limits.
The architecture of the units has not changed much. Each of them still consists of five ALUs, one flow control unit, and a few general-purpose registers.
ATI claims the execution units are now 40% more efficient
, but the brutal increase in number (from 64 to 160) is already enough to make the Radeon HD 4800 competitive even under unfavorable conditions. That’s not all, though. As we mentioned above, there are more global changes on the core topology level. With the ring topology retained partially, the placement of the functional subunits has been optimized. The RV770’s execution subunits have been joined into 10 SIMD cores (the previous GPU had 4 such cores) with 16 modules (80 ALUs) in each:
Each execution core has a dedicated control logic, 4 TMUs and L1 cache. The cores can communicate locally as well as globally.
Note that the ratio of computing and texture-mapping units has remained the same at 4 to 1. ATI thinks it optimal. You may argue the point but the argument makes no sense for the RV770 because, unlike its processor, this GPU shouldn’t feel a lack of computing or texture-mapping power. Of course, the new chip offers full support for DirectX 10.1.