|
Linux 2.6 and Hyper-Threading - Kernel Compiles and MP3 Encoding We'll
get the boring kernel compile graph out of the way first. We'll break
down the compiling performance on a processor-by-processor basis afterwards. Starting with the
3.2GHz Prescott with its improved hyper-threading (HT) we see that when
we issue a simple 'make' it finishes the compile of our 2.6.2 kernel in
327.06 seconds. When we enable HT and specify two processes with 'make -j
2', we see a decrease in compile time of ~23% (~75 seconds). The Northwood-based
3.2GHz P4 performs better overall in our compile tests than the Prescott,
but enabling HT and issuing a 'make -j 2' only shows an ~18.7% decrease
in compile time (~56 seconds). The Xeon
results are a little more interesting. Look at what happens when we enable
HT and run with 'make -j 2'. It actually takes longer to compile the code
with HT enabled! Since issuing a 'make -j 4' with HT enabled seemingly
fixes the problem, I'll assume the previous results are due to some scheduling
confusion as I did tell the compiler to issue two simultaneous processes but
we had 4 logical processors awaiting work. The Xeons are the big winners
in this test; as we went from HT disabled and a simple 'make' to HT enabled
with 'make -j 4', we got a ~57% decrease in overall compile time. That's
impressive. You'll
notice that I didn't add 'make -j 4' results for the two uniprocessor
test machines. The reason behind this is simple enough; the numbers were
literally identical to the 'make -j 2' results. This also holds true when
going beyond 'make -j 4' on the dual Xeon machine. I tested all the way
up to 16 simultaneous processes and after 'make -j 4', I didn't see any
improvements. As we
move along to our next benchmark, BladeEnc mp3 encoding, I should provide
some background information. If I'm going to do audio encoding at home,
I generally use LAME (as do most of you, probably) but since it's not
multi-threaded in the least I went in search of a threaded encoder that
could possibly show us some interesting results in our analysis of HT.
A gentleman in our irc channel (#2cpu on irc.freenode.net) pointed me
to BladeEnc.
The author decided to parallelize the application using the message
passing interface (MPI). The original intent was to split the process
of mp3 encoding across several machines, but he does have this to say
about its use on multiple processor platforms: "So does this
scheme work with SMPs? Absolutely. One
of the nice things about MPI is that it intentionally doesn't distinguish
between whether ranks are on the same physical machine or not. When
you use MPI to send a message, you just rely on the MPI implementation
to do the fastest thing to send the message to the destination rank
(regardless of whether the source and destination ranks are on the same
machine or not)." Since it sounded relatively
cool, I downloaded and configured LAM/MPI
and compiled BladeEnc. To ensure BladeEnc would be compiled with MPI and
not the default C compiler, I had to use a simple export command: "export
CC mpicc". Let's have a look at my results. All-in-all
we see some moderate improvements on the Xeons with HT enabled and a slight
decrease in encode time on the uniprocessor machines. Nothing earth-shattering,
but we'll take the 12% decrease in encode time on the Xeons.
|