BRAIN-1: Boxes, Routers And Interconnecting Networks

Welcome to our Beowulf page! A Beowulf is a cluster of very cheap Linux machines, hooked up through a local network and using specialized software to work like a very expensive parallel computer. In iLab, we installed in October 2000 a small Beowulf cluster (16 compute nodes) to serve our simulation needs.

Below we share our experience with designing, setting up and using the cluster.

Design

We took all the latest hardware of the moment, except that we could not afford the fastest available CPUs. Since any information about that will be obsolete in three months, it's not necessary to dwell on the specific parts. But you can have a look at our facilities page for some details.

A few general comments, though: In our experience, cooling is the key to stability; so we loaded the cluster with a 8cm and a 12cm extra fans in addition to the CPU fan and power supply fan. The 12cm fan was custom-mounted in the front, in the three unused 5-1/4 bays, and pushes air inside the box. The 8cm fan was mounted on the other side of the front panel and also pushes air inside. We have opened 3 slots in the back to allow for air exhaust, since we only have the power supply's fan pulling air out. These openings are diagonally opposite to the 12cm fan, so that the air flow has to go through memory, CPU, chipset and ethernets before it can go out. With that configuration, ambient temperature inside each box is 25C and CPU temperature is 29C.

Be very careful when you choose your parts; each hardware operation on a Beowulf will require several hours of hard work (and definitely include an electric screwdriver in your part list!) It may be a good idea to order one box first to see whether everything is okay. If you plan to have add-on boards like we did (three ethernets per box), make sure that you read the motherboard's documentation before giving assembly instructions to your manufacturer: some of the PCI slots on your motherboard are likely to share interrupts, resulting in slightly degraded performance.

3x100Mbps Channel-Bonded Network

In order to be able to process video sequences in real-time with our system, we need a very fast network. So we decided to try an extremely cheap solution: triple-ethernet channel bonding. Channel bonding is a Linux feature which allows you to use multiple network cards as if they were one: traffic is simply load-balanced over the multiple devices. You do need to have one separate switch per branch in your network, as it is critical for bonding to work that all the boards in one machine have the same MAC (ethernet) address (and that would confuse a single switch, since it would not know to which port it should send a packet destined to a given MAC address).

We installed three RealTek RTL8139 ($7.60 each) boards in each node, and connected them through three very cheap switches ($260 each). As mentioned before, be careful about the PCI slots in which you place your boards if you want each to have its own interrupt; placing them in PCI slots 1, 2 and 3 worked for us (rather than nicely interleaving them into slots 1, 3 and 5 as we had originally to maximize cooling).

It seems that changing the MAC address when you bring up the bonded channel does not work with the RTL8139 chips (even though it looks like it worked as per ifconfig; but using rtl8139-diag from Donald Becker showed that the eeprom had not been modified to the new MAC address (when you bring the bond up, it tries to copy the MAC address of the first enslaved device to the other devices). So we decided to just create our own MAC addresses (since we are on a local net) and to permanently burn them into the eeproms. For that, we had to add a few lines of code to rtl8139-diag, and the modified program with eeprom-write capability is here.

This done, everything worked great! The bonding kernel module was already included in our Mandrake 7.2beta3 GNU/Linux distribution, so it was just a matter of configuring the various ifcfg-eth* and we were set.

How about actual performance? Here are some test results using NetPerf, under normal use conditions (multi-user runlevel, many processes running, boxes hooked up through the switches):

TCP transfer rates between two nodes:

[root@n01 /root]# netperf -l 60 -H n11
TCP STREAM TEST to n11
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec
 
 65535  65535  65535    60.00     244.06

UDP transfer rates between two nodes:

[root@n01 /root]# netperf -H n11 -t UDP_STREAM -- -m 10240
UDP UNIDIRECTIONAL SEND TEST to n11
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec
 
 65535   10240   9.99        32503      0     266.49
 65535           9.99        32319            264.98

A single board can get up to 94 Mbps throughput, so it does not perfectly scale up when using 3 boards, but for about $60 per node (including boards, switches and enhanced CAT5 cables), we are fully satisfied with the performance!

BRAIN-1 in action!

Here are a few examples of the cluster in action. In these examples, we process 9600 images with our model of bottom-up visual attention. We wrote a perl script to spread the work over the 16 compute nodes (n01 through n16) while having some progress displays on the console (iLab1).

The baptism: Just born and up for 1 day, and already all nodes are loaded!

Still a virgin: With over 1/2 TByte of compound disk space -- mostly empty at its birth -- this baby is ready to crunch large datasets!

A first example. Here the images to be processed, the intermediary files, and the results are all read from and written to the console (iLab1), which has 50 GByte of RAID0 disk. iLab1 is not used to compute. Clearly, we have a bottleneck at iLab1's network.

In this second example, we now use the local disk on each machine to read/write all the temporary files. That already reduces network activity a lot.

To distribute the disk accesses more evenly over the network, we installed PVFS (the Parallel Virtual File System, which allows you to transparently stripe data over multiple nodes in a beowulf). So we now have 1/2 TByte of live disk space accessible as /home/pvfs on all machines. Copying a few GBytes from iLab1 to /home/pvfs looks like it's working great!

Let's transfer our data to the PVFS distributed disk, still use local disks for temporary files, and go! At the beginning, it's still very violent on the network because the images are very small.

A few hours later, however, our visual attention model starts processing larger and larger files. The network load starts clearing up but we still have a lot of activity towards iLab1 (we'll have to cut on these extremely verbose progress reports in our attention program), and towards n01, which is the control node for PVFS in addition to its computation role.

A few hours later, we are now clearly limited by CPU power rather than network as the images being processed become larger and larger. This has now been going on for a couple of days and the cluster seems rock-solid!

This monitor is too small! to have a good look at how the various simulations are progressing ;-)

Parallel processing of visual saliency

Here is a first implementation of our bottom-up, saliency-based visual attention algorithm for real-time operation. It all works at 15 frames/sec, including video grabbing, processing, and display. Have a look! See our Ongoing Projects for more details.

Parallel encoding of mpeg movies

The parallel version of mpeg_encode works great too. Check out the Beowulf at work on a 6,000-frame movie and some processing statistics.