|
||
|
||
|
x86 SMP on the cheap On SMP Symmetric Multiprocessing (SMP) has been around in Intel chipsets for a while but until recently was not affordable for the average user. The use of more than one CPU - a term that somewhat lacks sense if you use more than one processing unit - has some advantages and side effects that need some explanation. In SMP, a power of 2 number of PUs (2, 4, 8, 16, ...) are used in the same system environment, sharing main memory, chipset, I/O etc in a managed peer setup. The most simple version is a dual processor system, which I'll focus on for obvious reasons. Tasks, threads and programs As soon as more than one PU is in the game, we need to pay attention to a distinction between programs, tasks and threads. A thread is a sequence of instructions that represents a specific task within a program, i.e. a specific subroutine. It usually receives certain parameters, executes an algorithm on them and returns a result to the caller, which usually is the program's main loop. A program is made up from at least one thread or a sequence of threads that are usually running one after another in a sequential order. If there is a chance to execute part or whole of the program in parallel, the programmer can decide to split those routines into several threads that can then be executed on separare PUs - that is SMP-friendly coding. A drawback of this concept is that you may end up waiting for all of these threads to end before you can continue with the rest of the program, but that is usually worth the speed advantage on the fast lane. This usually only works for a part of the program, i.e. in a raytracer you could spawn a thread each for even and odd lines or for tiles of the image. You couldn't do multiple threads for the initial parsing of the scene description or for the final writing of the image where all lines/tiles must be combined into one image, though. In this example you'd thread sequentially to the point of rendering, then spawn multiple parallel rendering threads and at the end wait for them all to complete before returning to one single thread line to combine and write the image. Tasks now are nothing but threads really, but they can be whole programs (if the program is a sequence of single threads) or just spawned program fragments (from a multithreaded program). While threads are managed by the program they belong to (spawning and termination but not scheduling onto PUs), tasks are managed by the operating system. GTL+ Bus protocol Intel's GTL+ CPU bus protocol, basis of the BX chipset SMP, has been around since the times of the Pentium Pro 150 (November 1995) and starts to show it's age, in particular in the fact that it is a shared resource of all PUs. On the EV6 bus used with the Athlon and Alpha each PU has a point-to-point connection to the chipset at full FSB speed. That isn't a serious drawback for a single CPU system, but with 2 or more units this means heavy competition for the bandwith of the front side bus (FSB). This becomes even more critical the less L2 cache the PUs have to keep frequently needed data ready for processing - keeping them off the bus. Knowing this, the speed of the FSB becomes a much more critical variable in the system performance equation. In an ideal scenario, the bus will be used alternately while the other PU doesn't require it. In practice there are colliding needs to access it that force one of the PUs to wait to gain busmaster control while the other participants (among them also busmaster devices like UDMA or SCSI controllers) are using the chipset's functionality. SMP Light The Intel Celeron, a variation of the Pentium II core for the budget market, has found wide recognition as a the king of the price/performance arena. This especially is true since many of these CPUs are overclockable far beyond the specificed speed bin Intel sells them in. As a legacy of the Pentium II core, the Celerons also have the ability to work as SMPUs if the necessary signal pin can be connected to the BX chipset to allow it to differ between CPU0 and CPU1. This is true for some manually modified Slot 1 versions of the chip and especially for the PPGA Socket 370 packaging. Until now that meant with a Socket 370 to Slot 1 adapter ("Slotkey" or "Slocket") these Celerons could be used in Slot 1 SMP boards. With ABit's new offering, the BP6, a motherboard specifically geared at the "undocumented feature" of Intel Celeron SMP has taken the buzz hostage. The key feature are two Sockets 370, which allow plug-and-play SMP with PPGA Celerons. This empowers the average Joe - or at least the Joe who can build a computer without endangering his life - to set up an experimental SMP box on a slim budget. Celerons vs Pentium II/III Apart from a seriously lower price tag, the PPGA Celeron enjoys the same core used in the Pentium II and 128 kb of L2 cache on the chip at full core clock speed. The Pentium II uses 512 kb of L2 cache on the CPU module at half speed, this allows for reduced FSB accesses but has to be paid for with a slower L1-L2 coupling and higher latencies in the L2 requests. Pentium III additionally has ISSE SIMD instructions, which are of little relevance at this point in history. More importantly the P3 is available at speeds above 450 MHz, unlike the P2. The second important value, as outlined above, is the FSB speed. The Celerons, destined to be "low end", are saddled with a 66 MHz FSB specification. They usually have no problem with higher FSBs but the fixed multiplier will then force the core clock up into regions where it may no longer cooperate. More on this in the overclocking section. The Pentium II and III operate at 100 MHz FSB, with 133 MHz coming soon. The BX chipset is good for 66 or 100 MHz and usually clocks happily between and beyond that. So the main problem for the Celerons under SMP are their smaller but faster L2 caches that force them to run more frequent accesses over the FSB. To add insult to injury, that FSB is also 50% slower than that of the Pentium II. On the upside the Celerons are able to run processes that fit their caches faster than the Pentium II/III. Expectations Expectations towards such an SMP machine are rather high, the bragging word goes that a dual 366 MHz setup will trash a 650 MHz Pentium III or even an Athlon 600 in performance. Such expectations will be the first casuality of real-life contact with a BP6 machine. Ideally and for very few, selected applications, a SMP setup will double the processing power of a uniprocessor machine of same clock speed. Among these applications are RC5, distributed.net's encryption cracking client. What enables it to this outstanding performance is multithreading (spawning of tasks for each PU) and very low memory access demands. Other tasks are not as accomodating, especially it is rather difficult to find fully multithreaded applications that will naturally embrace SMP without any user interference. Here is where multitasking and the operating system enter the image. Multiple tasks can be distributed across the available PUs if the operating system supports multiple PUs and implements a load-balancing that will shift work to the less occupied PU swiftly and on the fly. Operating System There currently are four widely available operating systems with SMP capabilities: Linux, Windows NT, BeOS and FreeBSD. Windows 9x, the most widely used OS, does not utilize additional CPUs. Windows 2000 will most likely accept Celeron SMP but is still in beta. For the purpose of this article I'll focus on Linux to stay true to the budget thought of a BP6 system, with a few scores under NT for comparision. Linux' default kernel will only use one CPU but as it is the advantage of that operating system, within a few minutes a SMP kernel can be compiled by just configuring the SMP option. It is pretty much a no-brainer, although you have to use the big kernel model (make bzImage instead of make zImage). Linux has a sophisticated load-balanced scheduling for processes and will shift them around between the CPUs to optimize system performance at realtime. This helps in particular with software that isn't multithreaded - most of it. None the less some applications and situations will still require special care to exploit the SMP, for example the POVRay raytracer I used for some benchmarking.
Multithreaded applications As examples of multithreaded applications I will use two things, first the RC5 client which implements an optimal multithreading model that will try to use all available idle cycles on all available PUs. This program needs next to no memory accesses and scales linear with the core speed of the PUs less the switching overhead when multitasking with more threads than PUs.
As I said earlier, RC5 gains the full exta CPU power from a SMP setup and the algorithm is optimized so that additional processes do not gain any performance. In fact there is a small and growing degradation of performance through multitasking switching, as you can see. Secondly I'll use kernel compiles (make clean depend; time make -j N bzImage) with various numbers of simultanous threads. This makes heavy use of the memory and mass storage subsystems as well as forces the OS to swap processes around the CPUs as the thread count grows.
We observe that SMP vastly speeds up compilation by utilizing multiple spawned tasks. The gains from more tasks than CPUs is minimal though and levels out at around 2^4 processes before it degrades again as task management overhead kicks in. The minimal gain I attribute to the fact that the "randomness" of the size of the compiled files are better suited to utilize the full scope of the FSB than the frequent large block requests of i.e. SETI@Home. Homogenous Multitasking Under this category I use the SETI@Home Linux client, running a client on each PU to compare the execution times per work units with a single task down below. Since SETI employs a very memory limited algorithm, the FSB is under heavy stress and hence a lot of competition takes place between the PUs.
Additionally I use POVRay in two instances to calculate one half (left and right) of the same image with one process on each CPU. This requests a lot of information from the memory subsystem at all times as the ray touches varying parts of the world model and needs informations about the material properties.
The results have to be put in perspective with uniprocessor results, check down below for the comments. For POVRay it is imporant to remark that it doesn't multithread on it's own and image splitting must be done with the help of a short script (POVRay supports partial rendering, though) and combined at the end of the run. This approach makes sense for very long rendering runs, for animations it is better to render complete even frames on one CPU and complete odd frames on the other. Heterogenous Multitasking In this category I run one instance of SETI@Home and one instance of POVRay on the system in parallel for as long as it takes SETI@Home to finish. Both are single threaded applications that are scheduled by the OS to run on different PUs while both use a lot of memory accesses to fetch and deposit data. The POVRay serves as a blind load that won't be measured while we look closely at the SETI@Home result to compare it with the uniprocessor and homogenous multitasking outcome.
As you can see the load on the GTL+ FSB is the crucical bottleneck in the SMP system, heavy memory accesses that won't fit the L2 cache will trash the performance by almost 50%. On the other hand a thoughtful combination of tasks can yield significantly improved results over uniprocessor systems of same speed. Single applications Finally as a comparision I run one instance of SETI@Home and one instance of POVRay to see how SMP is comparing to the classical uniprocessor results on the same platform. Also added are the one thread results from the kernel compile and RC5 tests.
Coclusions We see that the benefits of SMP vary strongly based upon the applications run. Multithreaded applications are the best but a rare find still - but only if you're singletasking. Under multitasking conditions the operating system is well capable of acceptable load-balanced scheduling. In this scenario it is of more importance to accomplish a good mix of memory-heavy and memory-light applications to make optimal use of the addtional CPU power. From our scores we can say that an SMP machine can make a good rendering machine for i.e. POVRay and even scientific heavyweights like FFT and cryptoanalysis can show a good gain. Software development, especially with many source modules, will also see a significant productivity leap. This goes twice for the tedious phase of debugging where long compiles in between short test runs and code alterations can be quite aggravating. FFT, because of the heavy FSB demands, can benefit more from running on two equally fast individual machines, but the extra cost may be prohibitive - the solution to this may be efficient distributed computing - unlike SETI@Home's. This article © 1999 by Armin Lenz and appeared first on the Full On 3D Network. Further redistribution forbidden without expressed written consent from the author. |
|
|
|
|