contact us | smp-faq | archives | submit news

Navigation

Home
RC5 Team
Forums
Hardware
Software/OS's
Articles
Links


Official Hardware Sponsor

Mailing List



x86 SMP on the cheap

On SMP

Symmetric Multiprocessing (SMP) has been around in Intel chipsets for a while but until recently was not affordable for the average user.

The use of more than one CPU - a term that somewhat lacks sense if you use more than one processing unit - has some advantages and side effects that need some explanation.

In SMP, a power of 2 number of PUs (2, 4, 8, 16, ...) are used in the same system environment, sharing main memory, chipset, I/O etc in a managed peer setup. The most simple version is a dual processor system, which I'll focus on for obvious reasons.

Tasks, threads and programs

As soon as more than one PU is in the game, we need to pay attention to a distinction between programs, tasks and threads.

A thread is a sequence of instructions that represents a specific task within a program, i.e. a specific subroutine. It usually receives certain parameters, executes an algorithm on them and returns a result to the caller, which usually is the program's main loop.

A program is made up from at least one thread or a sequence of threads that are usually running one after another in a sequential order. If there is a chance to execute part or whole of the program in parallel, the programmer can decide to split those routines into several threads that can then be executed on separare PUs - that is SMP-friendly coding.

A drawback of this concept is that you may end up waiting for all of these threads to end before you can continue with the rest of the program, but that is usually worth the speed advantage on the fast lane.

This usually only works for a part of the program, i.e. in a raytracer you could spawn a thread each for even and odd lines or for tiles of the image. You couldn't do multiple threads for the initial parsing of the scene description or for the final writing of the image where all lines/tiles must be combined into one image, though.

In this example you'd thread sequentially to the point of rendering, then spawn multiple parallel rendering threads and at the end wait for them all to complete before returning to one single thread line to combine and write the image.

Tasks now are nothing but threads really, but they can be whole programs (if the program is a sequence of single threads) or just spawned program fragments (from a multithreaded program).

While threads are managed by the program they belong to (spawning and termination but not scheduling onto PUs), tasks are managed by the operating system.

GTL+ Bus protocol

Intel's GTL+ CPU bus protocol, basis of the BX chipset SMP, has been around since the times of the Pentium Pro 150 (November 1995) and starts to show it's age, in particular in the fact that it is a shared resource of all PUs. On the EV6 bus used with the Athlon and Alpha each PU has a point-to-point connection to the chipset at full FSB speed.

That isn't a serious drawback for a single CPU system, but with 2 or more units this means heavy competition for the bandwith of the front side bus (FSB). This becomes even more critical the less L2 cache the PUs have to keep frequently needed data ready for processing - keeping them off the bus.

Knowing this, the speed of the FSB becomes a much more critical variable in the system performance equation. In an ideal scenario, the bus will be used alternately while the other PU doesn't require it.

In practice there are colliding needs to access it that force one of the PUs to wait to gain busmaster control while the other participants (among them also busmaster devices like UDMA or SCSI controllers) are using the chipset's functionality.

SMP Light

The Intel Celeron, a variation of the Pentium II core for the budget market, has found wide recognition as a the king of the price/performance arena. This especially is true since many of these CPUs are overclockable far beyond the specificed speed bin Intel sells them in.

As a legacy of the Pentium II core, the Celerons also have the ability to work as SMPUs if the necessary signal pin can be connected to the BX chipset to allow it to differ between CPU0 and CPU1.

This is true for some manually modified Slot 1 versions of the chip and especially for the PPGA Socket 370 packaging. Until now that meant with a Socket 370 to Slot 1 adapter ("Slotkey" or "Slocket") these Celerons could be used in Slot 1 SMP boards.

With ABit's new offering, the BP6, a motherboard specifically geared at the "undocumented feature" of Intel Celeron SMP has taken the buzz hostage. The key feature are two Sockets 370, which allow plug-and-play SMP with PPGA Celerons.

This empowers the average Joe - or at least the Joe who can build a computer without endangering his life - to set up an experimental SMP box on a slim budget.

Celerons vs Pentium II/III

Apart from a seriously lower price tag, the PPGA Celeron enjoys the same core used in the Pentium II and 128 kb of L2 cache on the chip at full core clock speed.

The Pentium II uses 512 kb of L2 cache on the CPU module at half speed, this allows for reduced FSB accesses but has to be paid for with a slower L1-L2 coupling and higher latencies in the L2 requests.

Pentium III additionally has ISSE SIMD instructions, which are of little relevance at this point in history. More importantly the P3 is available at speeds above 450 MHz, unlike the P2.

The second important value, as outlined above, is the FSB speed. The Celerons, destined to be "low end", are saddled with a 66 MHz FSB specification. They usually have no problem with higher FSBs but the fixed multiplier will then force the core clock up into regions where it may no longer cooperate. More on this in the overclocking section.

The Pentium II and III operate at 100 MHz FSB, with 133 MHz coming soon. The BX chipset is good for 66 or 100 MHz and usually clocks happily between and beyond that.

So the main problem for the Celerons under SMP are their smaller but faster L2 caches that force them to run more frequent accesses over the FSB.

To add insult to injury, that FSB is also 50% slower than that of the Pentium II. On the upside the Celerons are able to run processes that fit their caches faster than the Pentium II/III.

Expectations

Expectations towards such an SMP machine are rather high, the bragging word goes that a dual 366 MHz setup will trash a 650 MHz Pentium III or even an Athlon 600 in performance.

Such expectations will be the first casuality of real-life contact with a BP6 machine. Ideally and for very few, selected applications, a SMP setup will double the processing power of a uniprocessor machine of same clock speed.

Among these applications are RC5, distributed.net's encryption cracking client. What enables it to this outstanding performance is multithreading (spawning of tasks for each PU) and very low memory access demands.

Other tasks are not as accomodating, especially it is rather difficult to find fully multithreaded applications that will naturally embrace SMP without any user interference. Here is where multitasking and the operating system enter the image.

Multiple tasks can be distributed across the available PUs if the operating system supports multiple PUs and implements a load-balancing that will shift work to the less occupied PU swiftly and on the fly.

Operating System

There currently are four widely available operating systems with SMP capabilities: Linux, Windows NT, BeOS and FreeBSD.

Windows 9x, the most widely used OS, does not utilize additional CPUs. Windows 2000 will most likely accept Celeron SMP but is still in beta.

For the purpose of this article I'll focus on Linux to stay true to the budget thought of a BP6 system, with a few scores under NT for comparision.

Linux' default kernel will only use one CPU but as it is the advantage of that operating system, within a few minutes a SMP kernel can be compiled by just configuring the SMP option.

It is pretty much a no-brainer, although you have to use the big kernel model (make bzImage instead of make zImage).

Linux has a sophisticated load-balanced scheduling for processes and will shift them around between the CPUs to optimize system performance at realtime. This helps in particular with software that isn't multithreaded - most of it.

None the less some applications and situations will still require special care to exploit the SMP, for example the POVRay raytracer I used for some benchmarking.

The ABit BP6 motherboard

This board quickly gathered a following among hardware editors who got in touch with it and for a reason. It is rather cheap for the performance levels it offers, works well as a single CPU board and packs UDMA/66 already which will give it a bonus life over other BX chipset motherboards.

I will not dive too deeply into the specifications, just the hit singles:

  • BP6.jpg (23656 bytes) dual Socket 370 ATX format 
    motherboard with individual core 
    voltage regulation for each socket, 
    shared FSB clock, temperature 
    control for each socket
  • BX chipset, passive cooling with FSB settings of 66//72/75, 78-82 in 2 MHz steps, 83-100 MHz in 1 MHz steps, 104, 106, 108, 110, 124 and 133 MHz. Supports CPUs with multipliers from 2.0 to 8.0

  • core voltage settings of 2.0, 2.05, 2.10, 2.20 and 2.30 for Celeron

  • 3 DIMM slots, 5 PCI, 2 ISA, 1 AGP (2X)

  • on-board additional Highpoint HPT366 UDMA/66 controller for two IDE buses (2 units each)

  • Softmenu BIOS for all CPU settings, no jumpers

As you can see from the image, the Socket 370s are mounted "sidewards" and are in the close vicinity of several high rising capacitors and the memory slots. This is an important fact for the choice of heatsinks for the CPUs.

The board is easy to handle apart from the space restrictions for the CPUs, I recommend you use a case that allows to slide out the motherboard for work and cables of sufficient length.

On the stability side I have no complaints about the BP6, it has been working around the clock for weeks now under full load and the only occasions that it crashed where traceable to the CPU1's thermal problems under overclocked conditions.

The board has my full recommendation for experimental SMP systems such as the one presented below and throughout this article. I would be more hesitant to name it a solution for mission-critical systems.

Test System

The system assembled for this SMP experiment is as follows

  • ABit BP6 motherboard

  • ATX fullsize tower case with additional case fan

  • 128 MB PC100 SDRAM (1 module) at CAS2

  • 2 Intel Celeron 366 (Week 28) overclocked to maximum stable clockspeed (456 MHz and 550 MHz individually, so system speed is forced to 456 MHz) cooled with TennMax MEGA7 Socket 370 coolers

  • Gigabyte GA-660 TNT2 32MB at 125/150 MHz

  • Seagate 36530A 6.4GB UDMA/33 harddisk

  • 24X CD-ROM Samsung SCR-2431 ATAPI

  • NE2000 compatible ISA network card

  • SuSE 6.1 Linux on Kernel 2.2.5 SMP with XFree 3.3.4 and KDE 1.1

These components are pretty generic and affordable, the total system investment is around $850 at time of buying. You may want to buy a Linux distribution with a manual if this is your first installation, add around $50 for that.

Overclocking

Part of the fun with the BP6 is the overclocking support the board offers. The Celeron has often been proven to be highly overclockable, especially those with a low multiplier (which is locked and hence not manipulable).

For this reason I chose the 366 Celerons, which are the lowest multiplier (and speed grade) CPUs still in the market and so have the fairest chance of successful overclocking.

In particular we have to be interested in raising the FSB frequency as high as possible since this is an essential variable for the SMP machine, as discussed earlier and shown in the benchmarks later. Through the locked multiplier this will drive up the core speed as well.

For overclocking it is key to keep the system cool and well ventilated, this prompted my choice of the MEGA7 coolers which tested very favourably in my recent review. They have my full approval for the BP6 motherboard without requiring any use of power tools to fit them on, to stay with our plug-and-pray philosophy for this project.

Naturally these coolers have to be complemented with case fans to cope with the massive heat production of not one but two overclocked CPUs. You'd easily lose 30 MHz in system speed and maybe harddisk integrity without!

After some extensive stability and speed testing, the highest common frequency both PUs would still work 100% stable at has been found to be 456 MHz. Individually one of the chips would reach 550 MHz which is the speed I of course desire for the machine in the end. A different second chip may do it eventually.

Those 456 MHz mean a 83 MHz FSB, that is significantly better than the 66 MHz default but doesn't touch the 100 MHz of a Pentium II SMP box, so all main memory transfers will run 17% slower than on a comparable 450 MHz P2 SMP machine!

But then we didn't pay through our nose to buy it, either.

Performance considerations

With all the pieces in place, we can give thought to the applications that make sense for testing.

The nature of SMP offers the following categories

  1. multithreaded applications

  2. multitasking of same applications

  3. multitasking of different applications

  4. singletasking application

Obviously the latter category will be the worst case from superficial inspection as it would only run on a single CPU.

Wrong.

With SMP there is always the bus sharing issue to be considered, so the lowest performance of an individual application can be within the categories of 2 or 3.

The power of SMP, on the other hand, is overall system performance, which even at worst is not as bad as that of a single CPU system at same clock. If nothing else, the OS tasks will be running on the second CPU, freeing the first one completely for your applications.

You may not finish your application faster on a SMP machine but you'll get more done - that's about what it boils down to.

Multithreaded applications

As examples of multithreaded applications I will use two things, first the RC5 client which implements an optimal multithreading model that will try to use all available idle cycles on all available PUs.

This program needs next to no memory accesses and scales linear with the core speed of the PUs less the switching overhead when multitasking with more threads than PUs.

Number of
parallel threads
time to compute 64 blocks of 2^28 keys switching
overhead
1 3:43:26.25 none
2 1:51:54.89 none
4 1:52:15.05 0.3 %
8 1:53:46.21 1.66 %
16 1:56:08.63 3.78 %

As I said earlier, RC5 gains the full exta CPU power from a SMP setup and the algorithm is optimized so that additional processes do not gain any performance. In fact there is a small and growing degradation of performance through multitasking switching, as you can see.

Secondly I'll use kernel compiles (make clean depend; time make -j N bzImage) with various numbers of simultanous threads. This makes heavy use of the memory and mass storage subsystems as well as forces the OS to swap processes around the CPUs as the thread count grows.

Number of threads time to compute
1 4:35.104
2 2:44.687
3 2:42.322
4 2:39.766
8 2:37.751
16 2:35.932
32 2:36.855
64 2:37.140
128 2:37.503

We observe that SMP vastly speeds up compilation by utilizing multiple spawned tasks. The gains from more tasks than CPUs is minimal though and levels out at around 2^4 processes before it degrades again as task management overhead kicks in.

The minimal gain I attribute to the fact that the "randomness" of the size of the compiled files are better suited to utilize the full scope of the FSB than the frequent large block requests of i.e. SETI@Home.

Homogenous Multitasking

Under this category I use the SETI@Home Linux client, running  a client on each PU to compare the execution times per work units with a single task down below.

Since SETI employs a very memory limited algorithm, the FSB is under heavy stress and hence a lot of competition takes place between the PUs.

2x SETI@Home

16:12:20 hours

Additionally I use POVRay in two instances to calculate one half (left and right) of the same image with one process on each CPU.

This requests a lot of information from the memory subsystem at all times as the ray touches varying parts of the world model and needs informations about the material properties.

2x POVRay 3.0

23:14 min

The results have to be put in perspective with uniprocessor results, check down below for the comments.

For POVRay it is imporant to remark that it doesn't multithread on it's own and image splitting must be done with the help of a short script (POVRay supports partial rendering, though) and combined at the end of the run.

This approach makes sense for very long rendering runs, for animations it is better to render complete even frames on one CPU and complete odd frames on the other.

Heterogenous Multitasking

In this category I run one instance of SETI@Home and one instance of POVRay on the system in parallel for as long as it takes SETI@Home to finish.

Both are single threaded applications that are scheduled by the OS to run on different PUs while both use a lot of memory accesses to fetch and deposit data.

The POVRay serves as a blind load that won't be measured while we look closely at the SETI@Home result to compare it with the uniprocessor and homogenous multitasking outcome.

Application Time elapsed loss/gain
SETI@Home
uniprocessor
11:01:16 0 %
SETI@Home and POVRay 12:12:40 -10.8 %
2 instances
of SETI@Home
16:12:20 +36.0 %

As you can see the load on the GTL+ FSB is the crucical bottleneck in the SMP system, heavy memory accesses that won't fit the L2 cache will trash the performance by almost 50%.

On the other hand a thoughtful combination of tasks can yield significantly improved results over uniprocessor systems of same speed.

Single applications

Finally as a comparision I run one instance of SETI@Home and one instance of POVRay to see how SMP is comparing to the classical uniprocessor results on the same platform. Also added are the one thread results from the kernel compile and RC5 tests.

Application uniprocessor SMP speedup
RC5-64 3:43:26 1:51:54 99.6 %
POVRay 3.0 43:09 23:14 85.7 %
kernel compile 4:35.104 2:44.687 67.0 %
SETI@Home 11:01:16 16:12:20 36.0 %

Coclusions

We see that the benefits of SMP vary strongly based upon the applications run. Multithreaded applications are the best but a rare find still - but only if you're singletasking.

Under multitasking conditions the operating system is well capable of acceptable load-balanced scheduling. In this scenario it is of more importance to accomplish a good mix of memory-heavy and memory-light applications to make optimal use of the addtional CPU power.

From our scores we can say that an SMP machine can make a good rendering machine for i.e. POVRay and even scientific heavyweights like FFT and cryptoanalysis can show a good gain.

Software development, especially with many source modules, will also see a significant productivity leap. This goes twice for the tedious phase of debugging where long compiles in between short test runs and code alterations can be quite aggravating.

FFT, because of the heavy FSB demands, can benefit more from running on two equally fast individual machines, but the extra cost may be prohibitive - the solution to this may be efficient distributed computing - unlike SETI@Home's.

This article © 1999 by Armin Lenz and appeared first on the Full On 3D Network. Further redistribution forbidden without expressed written consent from the author.

 
All content and design of this site is © 2CPU.com 1999, 2000 Read our privacy statement.