The Intel 1.4Ghz and 1.7Ghz Pentium 4 processors and the D850GB mainboard
Friday, April 27, 2001
The 400Mhz system bus
In order to enhance the performance of their CPUs, Intel chose to gift the Pentium 4 family with a quad-pumped, 64-bit, 100MHz system bus (or FSB), connected by Dual Channels to the RAMBUS memory sub-system.
As a result, P4 systems can benefit from upto 3.2GB/sec of memory bandwidth..
In comparison, the Pentium III makes use of PC133 memory, and has a maximum theoretical memory bandwidth of 1.06GB/sec, while the AMD Athlon coupled with PC2100 memory can push a maximum of 2.1GB/sec.
The NetBurst Micro-Architechture
|
The only measure of performance that really matters is the amount of time it takes to execute a given application. Contrary to a popular misconception, it is not clock frequency (MHz) alone or the number of instructions executed per clock (IPC) alone that equates to performance. True performance is a combination of both clock frequency (MHz) and IPC:
Performance = MHz x IPC
This shows that the performance can be improved by increasing frequency, IPC or optimally both. It turns out that
frequency is a function of both the manufacturing process and the micro-architecture. At a given clock frequency, the
IPC is a function of processor micro-architecture and the specific application being executed. Although it is not always
feasible to improve both the frequency and the IPC, increasing one and holding the other close to constant with the prior generation can still achieve a significantly higher level of performance.
In addition to the two methods of increasing performance described above, it is also possible to increase performance by reducing the number of instructions that it takes to execute the specific task being measured. Single Instruction Multiple Data (SIMD) is a technique used to accomplish this. Intel first implemented 64-bit integer SIMD instructions in 1996 on the Pentium ® processor with MMX ™ technology and subsequently introduced 128bit SIMD single precision floating point (SSE) on the Pentium III processor
|

|
|
In most modern x86 processors, branch prediction is performed with a 10-stage pipeline. The Pentium 4, however, has seen this pipeline extended to 20-stages. One result of implementing a longer pipeline within a CPU has always been an overall degradation in the number of operations that can be completed per cycle. On the other hand, the increase in pipeline length also allows designers to increase the operating frequency of a given CPU dramatically. Knowing this, Intel's development team chose to greatly extend the P4's pipeline, calculating that the overall increase in frequency would outweigh the loss in efficiency. This is why the 1.7Ghz cpu is less affected than the 1.4Ghz CPU where the later is beaten by the P3s in terms of performance.
Misprediction within a 10-stage pipeline is inherently less-costly than mis-prediction within a 20-stage pipeline. This comes as a result of the time needed to clean-out a longer pipeline and start all over the whole operation when a branch mis-prediction occurs. When a CPU mispredicts which way a branch will go (yes or no? True or False? etc...), it must immediately stop what its doing, flush the pipeline of all the operations that it was about to execute under the assumption that it had predicted correctly, go back, and start again the whole operation. The longer the pipeline, the longer it takes to get operations back on track - and the longer it takes the operations themselves to travel from start to finish - which results in an overall drop in efficiency. This is why a CPU with a 10-stage pipeline is more efficient than one with 20-stages; the penalty for branch mis-prediction inside a 10-stage pipe is lower, because fewer operations need to be flushed as a result.
Getting back to the Pentium 4, Intel, as mentioned previously, Intel opted to implement a 20-stage pipeline in its newest darling. The expectation is that the increased number of operations that can be issued into the pipe, combined with better branch-prediction algorithms, would result in such a dramatically higher clock frequency that it would out-do any penalty suffered through branch-misprediction. Under these conditions, the pay-off would be especially high for chips running at higher frequencies, such as 1.7GHz.
|
Suite: The Advanced Dynamic Execution.
|