CPUs Revisited: PC Processor Microarchitecture EvolutionSeptember 20, 2005,
http://www.extremetech.com/print_article2/0,1217,a=160458,00.aspBy
J. Scott GardnerIt's been five years since we took our first in-depth look at PC processor microarchitecture. Since then, we've seen clock rates increase—but not as much as expected. We've seen a push towards multicore and 64-bit processing, all in the context of x86 evolution.
Subtler issues have emerged, too, such as more-efficient power usage, leakage, and new manufacturing processes. All have had an impact on the evolution of PC CPUs.
This article will focus on how the microprocessor landscape has changed since the original article was written almost 5 years ago. Continued...
In our previous article, there was always a character in the back of the room, trying to speed things up and get into the glorious details of CPU microarchitecture. By restraining this understandable enthusiasm, the first half of the article was designed to start with the fundamentals and provide analytical tools for evaluating radically different microprocessors.
Even 5 years later, all of these analytical tools remain valid, and ExtremeTech readers continue to refer back to this document. The second half of the article applied the analytical tools to evaluate the Intel P4, AMD Athlon, and VIA/Centaur C3 microprocessors. To wrap things up, the article made a few observations and predictions about CPU architecture. Another excellent source for a quick review of PC microarchitecture is Nick Stam's CPU article, which was originally published in ExtremeTech Magazine. Now, with the luxury of hindsight, we can look again at the x86 processor world. Continued...
There is one thing we've stressed in all of our articles. While it's a lot of fun to uncover every detail about CPU internals, at the end of the day it only matters if the CPU features help your software run faster. The chipset, memory, and peripherals also play a part in creating a balanced system architecture that may be optimized for a certain type of software workload.
Don't get too hung up on numbers of execution units and cache sizes, since a lot of software may not show any performance benefit from individual microarchitectural features.
Watt Really Matters in CPU Design
As part of the theme for this fresh look at CPU microarchitecture, we need to add a new admonition: Cool new features should be evaluated for their impact on system power consumption—not just system performance.
Since our last look at CPU microarchitecture, Intel has found a new religion and begun to preach the virtues of "Efficient Computing." No longer would armies of engineers be sacrificed at the Altar of Speed, forcing every last ounce of peak performance out of the CPU design. The effigy of Prescott continues to smolder after passing 115 watts before ever reaching 4 GHz, much less the 5 GHz promised for 2005. The bold prophesy for 10 GHz CPUs has been forsaken, now that the Laws of Physics once again hold sway over the marketing multitude.
A Nautical Analogy? The Spring and Fall 2005 Intel Developer Forums were the public view of a company with the inertia of an aircraft carrier making a sharp turn in the water. First, Craig Barrett (Intel's former CEO) admitted that they had hit a thermal wall that kept them from increasing clock rates without incurring ludicrous costs for a cooling solution. This was dire news for the Pentium 4 architecture. As we pointed out in the original article, the longer pipeline must be run faster than other architectures in order to accomplish the same amount of work.
Back then, we were only analyzing the 20-stage Pentium 4 (Willamette)—not the 31-stage beast that is found in Prescott. Sure enough, the Fall IDF marked an announcement that the future Intel microarchitecture would not use the Pentium 4 pipeline and would shift to a 14-stage core that is based on the 12-stage Pentium M (Banias/Dothan). The power-efficient Pentium M seemed the perfect vehicle to use in moving back from the thermal wall.
To its credit, Intel seems remarkably agile in abandoning the clock-rate race, since raw speed was so much of the corporate identity. The new focus on system-level performance per watt should benefit us all, though Intel will have to work harder to differentiate itself. Continued... Most computer architects have to accept on faith what they're told about the performance of the underlying transistors. The strategic goal was to use process technology and circuit tricks to push the NetBurst architecture to 10 GHz. Process technology is an arcane science dominated by a priesthood of experts in quantum physics and the chemistry of exotic materials, far removed from the computer science world of most CPU architects.
The Intel microarchitecture was heavily pipelined so as to chop up the computing tasks into small steps, thereby reducing the number of transistor delays at each stage. We were surprised to find that 2 of Willamette's 20 stages were allocated to just driving signals across metal, though we later learned that a DEC Alpha CPU had this feature even earlier. With less work being done in each stage, the Instructions Per Clock (IPC) for the original P4 architecture was reduced by 10 to 20% when compared with the Pentium 3. With the promised clock-rate headroom, the designers saw this as a good trade-off.
Speculation Takes Its Toll
In our earlier article, we said we'd soon "find it humorous that we thought a 1 GHz processor was a fast chip." Well, the humor was quickly followed by irony. The Pentium 4 architecture rapidly scaled in clock rate to over 3 GHz, but the designers started to pay a price for that speed. As an aggressive out-of-order machine with well over 100 instructions in flight, the hardware was struggling to dynamically schedule resources so that the long pipeline would keep moving.
In addition to speculatively fetching, decoding, and dispatching instructions in the pipeline, the microarchitecture would speculatively load data from the L1 cache—even if it was the wrong data and required dependent instructions to be killed later. The Intel design approach was to plan for the best case and worry less about wasted energy if the speculation doesn't pan out.
Moving to a 31-stage pipeline for Prescott kept the clock rate treadmill going for a while longer, but it caused even more wasted energy when software didn't follow the predicted flow. At the time, getting peak performance a few percentage points better seemed like the right trade-off over power efficiency. As history has shown, this design philosophy caused that thermal wall to arrive even earlier. Continued...
To grossly oversimplify the description, a field-effect transistor (FET) can work like a switch that allows current to flow between the source node and the drain node whenever there is a voltage applied to the base node (see diagram below). Basically, the voltage on the base creates an electric field that controls how much current is allowed to flow through the source-drain channel. It's like a control valve on a water pipe. In this case, the transistor operates as a voltage-controlled current source. A layer of dielectric material (silicon dioxide) insulates the base node from the current flowing through the source-drain channel.
A shorter channel length allows the source-drain current to switch on even faster. Likewise, reducing the thickness of the gate oxide insulating layer can reduce the transistor switching time.
The problem is that we've now shrunk the transistors to the point where the channel lengths are so short that a significant amount of current leaks through the source-drain channel (sub-threshold leakage), even when the transistor switch is in the OFF position. As temperature is increased, the sub-threshold leakage increases exponentially because of a drop in the threshold voltage.
Current also leaks from the base node through the oxide and channel and into the underlying substrate (gate leakage). As process geometries have shrunk even further, another leakage effect is band-to-band tunneling (BTBT) where the source/drain junctions reverse bias to allow electrons to tunnel their way into the substrate. (Tunneling is one of those quantum mechanics properties that Einstein wouldn't have believed in, since it's based on the Heisenberg Uncertainty Principle and a God who rolls dice. Perhaps one of the many physicists in the ExtremeTech forums will elaborate.)
All three sources of leakage have become a huge problem, and the process technology priesthood is working to come up with new materials and transistor designs that reduce the leakage. Continued...
As any overclocker knows, raising the voltage makes chips run faster. The CPU vendors already test and ship their CPUs to run at the highest possible voltage in order to yield the high-end clock rates. The overclocking crowd cranks up the voltage even higher while putting extra effort into keeping the chips cool.
By now, most ExtremeTech readers recognize that dynamic power consumption has a linear relationship with frequency, but a nonlinear, squared relationship with voltage. However, the impact of voltage gets worse when you consider static power consumption, which is almost entirely caused by transistor current leakage. A higher voltage exacerbates current leakage in the transistors, and the leakage power relationship to voltage is a higher-order polynomial. A simplified view of the full power equation is as follows:
Power = C • V2 • f + g1 • V3 + g2 • V5
where C is capacitance, V is the voltage, and f is operating frequency. The g1 term is a parameter representing the sub-threshold leakage, and g2 represents the gate leakage.
This equation doesn't even consider reverse-junction BTBT leakage or other leakage sources that are emerging as designers try to keep those electrons where they belong. The effect of leakage has changed all the rules for the speed benefits of voltage scaling. It's now obvious that power dissipation quickly rises to reach the limit of the package and cooling solution. You can continue to increase voltage to run the processor a lot faster, but you wouldn't be able to cool it. Continued... There is no escaping the harsh laws of physics, and AMD has been fighting the same limitations of CMOS transistors. AMD, like everyone else, is bumping up against the thermal wall and looking for clever ways to raise clock rate with new process technology. But AMD's Athlon 64 only has a 12-stage pipeline, so it's able to get more work done each clock cycle. If you add in the fact that AMD has 9 execution units (including 2 FP units) with fewer issue restrictions, compared with the 7 execution units of Prescott, then it's clear that AMD doesn't have to push the clock rate as high to match Prescott's performance.
As we observed almost 5 years ago, AMD's architecture isn't as oriented towards streaming media as Intel's approach, so AMD didn't choose a long pipeline with aggressive speculation tuned for the well-ordered data flow of media streams. While media codecs can be heavily optimized for the Pentium 4 architecture, a broad range of other applications will have "branchy" code that can lead to a 30 cycle penalty for a branch misprediction (not counting cache-miss effects). For non-media applications, Intel's smart prefetcher won't be as useful in getting the proper data into caches to reduce latency. The difference in design philosophy is likely one of the reasons that AMD CPUs perform so much better on games, while Intel chips tend to do better on media applications. Continued...
Based on a few rumors and conjecture, we were able to predict 5 years ago that Intel would implement simultaneous multithreading (SMT) as a way to deal with the latency sensitivity of the Pentium 4 architecture. Some RISC vendors had already introduced this feature, but Intel needed to introduce a new buzzword to give an old idea new pizzazz.
With Hyper-Threading, architectural state information is duplicated so that software believes two logical CPUs are available. Compute resources are shared, so that the overall cost of Hyper-Threading is only about 5% of the die area. The real benefit is that the physical CPU can work on a different thread whenever a long-latency operation would have otherwise held up execution. On some multi-threaded applications, up to 30% better performance can be achieved. For these applications, it's a clear win. Unfortunately, several single-threaded applications actually ran slightly slower with Hyper-Threading, because of the extra overhead for the control logic.
Why Hyper-Thread If You Can Hyper-Core?The demise of the Pentium 4 microarchitecture on Intel's roadmap will likely put SMT technology on the shelf for awhile. The shorter pipeline of the Pentium M core does not have the latency penalties of Pentium 4, and a mobile processor probably wouldn't want to burn extra power to duplicate all the architectural state. Intel's next-generation architecture will build on the Pentium M core, and thread-level parallelism will be achieved through symmetric multiprocessing (SMP).
Instead of multiple logical processors, several physical CPUs will be integrated together as a single chip. Software won't know the difference, and each core will be simpler and smaller. Intel and AMD both have filled their processor roadmaps with multicore devices, mostly because the extra cores take advantage of the extra die area from process shrinks. The multicore devices will run at a lower voltage and frequency, which our equation proves will yield a non-linear reduction in power consumption. Continued... This question has been asked for decades, since RISC workstations were long ago configured with multiple CPUs. There have also been companies building high-end x86 multiprocessor machines with customized hardware, while low-end x86 SMP motherboards have been available for years. The best SMP applications tended to be server or floating-point tasks where the data processing or number-crunching benefits outweighed the overhead of dealing with cache coherency.
However, even in these applications, it is impossible to get performance to scale linearly with the number of CPUs. An SMP machine creates extra bus traffic and processor stalls to snoop for shared memory that may have been modified by multiple processors. Even in a single-CPU machine, snoop cycles could occur because of other bus masters (such as disk or network controllers) that modify shared memory. The reason that SMP hasn't already found its way onto the mainstream desktop is that few applications scale up very well as you add processors.
While multithreaded operating systems have been available for years, it's very difficult to create multithreaded applications. Multiple threads run at non-deterministic, asynchronous rates, so any data shared between threads may not be correct at the time it's needed. However, simplistic use of operating system mechanisms to force thread synchronization may end up slowing down the application by more than a hundred-fold. It's hard enough to find bugs in a single-threaded application. Even if extra CPU cores are in the system, a lot of programmers may not believe that multithreading their applications is worth the complex coding and debug effort. Continued... While it will be a while before most applications can take advantage of multithreading, an immediate system benefit of multicore will be to increase performance when multiple applications are running simultaneously. While Hyper-Threading shares compute resources, SMP machines provide more compute resources by allowing applications to run on separate CPUs. The overall system response time should be better, since operating system tasks also get reallocated so that they no longer compete as heavily for CPU resources. Unless there is some data dependency between tasks (forcing synchronization delays or snoops on shared memory), an SMP machine will be much faster for multitasking. Of course, the majority of single CPU machines rarely overload the CPU, even while multitasking. This is because most users don't run multiple heavy-weight tasks, though the vendors of multicore chips are working to change that usage model.
Over time, software developers will gain more experience writing multithreaded applications, and a host of programmers from an unexpected direction are developing expertise in writing multithreaded code for consumer applications. We're talking about console game developers. The PS3 and Xbox 360 are multicore systems, and legions of game programmers will figure out new and creative ways to use multiple CPUs in an environment that has classically been the purview of single-threaded applications. Continued... It was hard enough sorting out single-core benchmarks that would often distort the workload to encourage the purchase of high-end chips and systems. With multicore, it will get even harder to relate benchmark scores to the workload of an average user. To support Intel's strategic focus on multicore performance, the company's been using SpecIntRate in its marketing literature. The problem is that the term "rate" in a Spec benchmark means that multiple copies of the benchmark are run simultaneously.
The SPEC organization defines SpecInt as a measure of speed, while SpecIntRate measures throughput. This is a valuable distinction to help in choosing a processor for a server, since maximizing throughput is critical for this workload. But it's very misleading to use those scores and imply any performance benefit for mainstream applications. For a user who wants maximum performance on a single-threaded application (like most current computer games), a slower-clocked multicore device will likely have less performance, not almost 2X as SpecIntRate would have you believe. Continued... The use of SpecIntRate has been extended to bold predictions of the power/performance benefits of Intel's next-generation architecture. The graphs below, from the Intel Developer Forum, suggest a major improvement in desktop and mobile computing efficiency.
For Conroe, public roadmaps claim a 5X improvement in performance/watt over the original Pentium 4. Comparing the dual-core Merom with the already-efficient, single-core Pentium M, the roadmaps predict a 3X improvement in performance/watt. However, the numerator in these terms continues to use SpecIntRate, so that a next-generation, dual-core device gets immediate credit for nearly a 2X performance benefit by virtue of having 2 cores. Dropping the voltage and implementing special power-management circuit tricks will account for improvements in the denominator of performance/watt.
The next-generation pipeline will be more power-efficient than the Pentium 4's, but it will be a while before we can quantify the benefit from microarchitectural improvements alone. These graphs don't say much about microarchitecture, since it seems that clock rates and voltages are varied at each datapoint. The bigger L2 in Dothan doesn't explain the huge improvement over Banias, since the difference is more likely due to a lower voltage after the shrink from 130nm. The shift in process and voltage would also explain why Prescott is shown as more power-efficient than Northwood.
It is very important for the benchmarking community to find ways to model actual desktop and mobile workloads so that users can make valid comparisons. Intel's efforts to train application developers to use multi-threading may eventually lead to a broad range of real-world applications that can be benchmarked. Continued... In our original article, in addition to analyzing the P4 and Athlon, we looked inside the C3 processor from the Centaur design team, which is part of VIA. In the interest of full disclosure, this author has a long history with several of the Centaur founders and has also been paid for consulting work. However, it should be possible to take a quick, objective look at how the Centaur architecture has fared over the 5 years since we last interviewed Glenn Henry while researching for the first article.
Most followers of computer architecture know Glenn well, based on his straight-talking style at Microprocessor Forum or during press interviews. A former IBM Fellow and CTO at Dell, he continues to directly lead the design efforts, and he personally designs much of the hardware and microcode, while sharing the disdain that engineers hold for marketing fluff and distortions. The Centaur design philosophy continues to be a nice contrast to Intel and AMD, since Glenn has always focused on minimizing die size and power consumption while making sure the chips meet mid-range performance targets on real-world applications.
This focus on Efficient Computing didn't have the same media appeal as other vendors' big, powerful chips with high peak performance. However, perhaps Intel's new market focus will help draw attention to the Centaur approach. While other vendors have failed to survive in the x86 market, VIA has carved out a niche for itself as the low-power, low-cost x86 vendor. It's shipped millions of CPUs, but that still only counts for about 1% of the total x86 market. VIA's primary leadership has been for fanless designs, since that usually requires maximum CPU power be approximately 7W TDP.
The recently-announced C7 CPU will run without a fan up to 1.5GHz. Going forward, Centaur has already disclosed some information about the
next-generation architecture it calls "CN." This will be VIA's first out-of-order, 64-bit superscalar design. The investment in the brand-new architecture may eventually be needed to keep Centaur competitive as both Intel and AMD turn their attention to Efficient Computing.
The flipside of the coin is that the low-power approach has been Centaur's competitive advantage, particularly in developing markets. If Intel and AMD aggressively pursue this line of attack, then VIA's competitive edge in low power may evaporate, and any advantage may shift to its ability to build smaller, lower cost CPUs. Continued... The move from 32 bits to 64 bits has been extensively covered in ExtremeTech, so there is only one bit of analysis we could add if we woke up 5 years after that first article. The one question we would ask is, "Why 64 bits everywhere?" The need for a 64-bit notebook computer is the main curiosity, since a 64-bit CPU will always burn more power. There are more register bits and state information, requiring wider register ports, buses, etc.
Given that there are still very few desktop applications that need 64-bit virtual addressing and 40-bit physical addressing, there certainly isn't much need for these big applications on a mobile platform. The answer, of course, is that it is just easier to have a single architecture. The software developers will eventually expect to have the extra address space, and we all know how software likes to gobble up main memory. Unless an application is a handheld device that needs extremely low power, we can expect 64 bits to gradually proliferate everywhere.
Revisiting our predictionsIn the last article, we didn't make any formal predictions and weren't very specific about the timeframe for when we'd see various technologies. That ambiguity seemed to work well, so we can update the status of a few of the long-term ideas. We'll leave it for another article to bring more focus to predicting the future.
Massively Parallel Architectures for the MassesWe made a statement 5 years ago that, "we'll eventually consider it quaint that most computers used only a single processor, since we could be working on machines with hundreds of CPUs on a chip." Well, we can already buy dual-core chips, and in 2006 Intel will rapidly push dual and quad core to replace its own single-core devices. But hundreds of CPUs? Well, if you are counting the number of conventional x86 cores, that may be a while. Even the diminutive Centaur chip is 30 square mm in a 90nm process, though at least a third of any CPU is cache memory. However, you'd have to scale up the memory to avoid starving some of the CPUs.
If you expand your view outside the x86 space, there already are processors with 100s of CPUs. An example is PicoChip, a UK company building specialized chips for wireless basestations. Its architecture is a heterogeneous array with hundreds of 16-bit processors (and specialized hardware) that can be configured for different tasks within the applications. It's the opposite of SMP, which counts on the ability of threads to run on any CPU.
If a heterogeneous computing model were applied to a general-purpose computer, the operating system would have to be a lot smarter about deciding which CPU was best at running a thread. Another term used for this type of machine is "Adaptive Computing," since the hardware resources are adapted to match the application. The floating-point thread could be dispatched to one of the FP CPUs, media-processing might get sent to a DSP-oriented CPU, while pointer-chasing integer processing might get dispatched to fast, lean CPUs.
The adaptive approach could be very power-efficient, since CPUs would only be powered up when needed and would each be optimized for the workload. Unfortunately, the software barrier is probably too high, so it's unlikely that an architecture like this would ever make it into a general-purpose computer. Continued... One big issue is that SMP architectures are difficult to scale beyond 8-way because of the amount of bus bandwidth required for coherency. Large multi-processor machines deal with scaling issues by creating specialized hardware with external copies of the cache tags and extremely fast interconnect. At some point, it is just too expensive to implement shared memory, so for a truly large number of CPUs, the architecture becomes a non-uniform memory architecture (NUMA) cluster. As long as an application can be broken up into data-independent tasks, it can be run today on racks of machines with tens of thousands of processors operating together. As the number of cores on each CPU grows, the compute density of the cluster will increase, as long as the power consumption is manageable.
Seeing the Future of Optical ComputingWhen we described a world where technology enthusiasts would "pore through the complicated descriptions of the physics of optical processing", we were thinking extremely far into the future. However, with all those electrons leaking away, perhaps optics technology will be accelerated. Intel made a breakthrough by creating a laser that uses the "Raman effect" to build a continuous wave laser in silicon, though the initial energy is provided by an external laser. This technology is probably destined for device-interconnect applications, but it's an important step towards a world of optical switches that can begin to replace electronic transistors.
Complexity Takes a RestThe premise of our original article was that we continue to ratchet up the complexity of computers, but we quickly get comfortable with the new terminology and hunger for the next wave of innovation. Five years later it doesn't seem that computer architecture really advanced as fast as we expected. It felt like a huge leap to get to out-of-order machines and complicated branch prediction. Now we're talking about simplifying the microarchitecture, reducing the frequency, and hooking a bunch of identical cores together.
Click here to read more CPU articles on ExtremeTech.
This feels a lot different than the heady days of five years ago. Perhaps it's because of the diminishing returns when optimizing a single CPU for performance. It may just be that Intel and AMD aren't quite as public with their microarchitectural details. Intel still hasn't even confirmed the number of pipeline stages in the Pentium M, much less published an architecture diagram. Hopefully, the company will be more forthcoming about the new Merom architecture. As technology enthusiasts, we'll be eagerly looking forward to enjoying all the details.