Posts Tagged ‘cluster’

Gentoo Cluster: a Strange OpenMPI Problem

Wednesday, July 27th, 2011

Yesterday, I tried out some MPI jobs on our gentoo cluster. A really weird problem happened and then solved. One test job is the following mpihello code. At first, I use both qsub mpihello and just command-line mpirun -np 16 --hostfile hosts mpihello. When the number of processes is a low number, say 1 or 2 processes per each node, the jobs end very quickly. But if the number of processes exceeds some threshold, it just hangs there and never ends except being killed by pbs or myself. The threshold seems a larger number when using just mpirun then using qsub. The command pbsnodes shows all nodes are up and free. A debug test shows that the master process does not receive the messages from other processes, that is MPI_Recv is waiting forever.
Solution: Both Infiniband adapter and Ethernet network cards are running. After the bonded ethernet cards are disabled on node 7 and node 8, the problem is solved. I am still not exactly sure about the cause. Other nodes still have bonded ethernet cards running. But so far, it is in an okay working state.

#include "mpi.h"

int main(int argc, char *argv[])
int my_rank;
int p;
int source;
int dest;
int tag = 0;
char message[100];
char hostname[100];

MPI_Status status;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);

if (my_rank != 0) {
gethostname(hostname, 100);
sprintf(message, "Work: Hello world from process %d, on %s!",
my_rank, hostname);
dest = 0;
MPI_Send(message, strlen(message) + 1, MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
// fprintf(stdout, "going away..., %d, %s\n", my_rank, hostname);
} else {
gethostname(hostname, 100);
printf("p = %d\n", p);
printf("Hello world from master process %d, on %s!\n",
my_rank, hostname);
for (source = 1; source < p; source++) {
MPI_Recv(message, 100, MPI_CHAR, MPI_ANY_SOURCE, tag,
MPI_COMM_WORLD, &status);
printf("Master Revd:%s\n", message);
return 0;

Infiniband Installation on Gentoo (II)

Sunday, July 10th, 2011

In a previous post, I wrote the first part of my experience to install infiniband adapters on a gentoo cluster.  Recently, I upgraded the system and found i forgot the details of the installation and setup.  So I need to write down what i have done.

Step 1: Turn on the infiniband modules in the kernel as discussed in the previous post.

Step 2:  Emerge necessary packages.  They were in the science layer, but now (July 2011) moved to the main tree under sys-infiniband category.  On my cluster system,  the following packages were installed.


Step 3:   Edit configuration file, /etc/infiniband/openib.conf. The following is the content of my configuration file.

# Start HCA driver upon boot

# Load UCM module

# Load RDMA_CM module

# Load RDMA_UCM module

# Increase ib_mad thread priority

# Load MTHCA

# Load IPATH

# Load eHCA

# Load MLX4 modules

# Load IPoIB

# Enable IPoIB Connected Mode

# Enable IPoIB High Availability daemon
# Xianlong Wang


# Load SDP module

# Load SRP module

# Enable SRP High Availability daemon

# Load ISER module

# Load RDS module

# Load VNIC module

Step 4:  Edit the init.d script, /etc/init.d/openib. This is the important part.  The original one seems does not load all necessary modules  or in the right order. After all the if-clauses for setting POST_LOAD_MODULES, change the following:

PRE_UNLOAD_MODULES="ib_rds ib_ucm kdapl ib_srp_target scsi_target ib_srp ib_iser ib_sdp rdma_ucm rdma_cm ib_addr ib_cm ib_local_sa findex"
POST_UNLOAD_MODULES="$PRE_UNLOAD_MODULES ib_ipoib ib_sa ib_uverbs ib_umad"

to the following (pay attention to those in bold fonts):

#Xianlong Wang
# svcrdma: server-side module for NFS/RDMA
# xprtrdma: client-side module for NFS/RDMA


#Xianlong Wang
#add ib_ipoib before ib_cm

#PRE_UNLOAD_MODULES="ib_rds ib_ucm kdapl ib_srp_target scsi_target ib_srp ib_iser ib_sdp rdma_ucm rdma_cm ib_addr ib_cm ib_local_sa findex"
# add xprtrdma module for NFS server and client
PRE_UNLOAD_MODULES="xprtrdma svcrdma ib_rds ib_ucm kdapl ib_srp_target scsi_target ib_srp ib_iser ib_sdp rdma_ucm rdma_cm ib_addr ib_ipoib ib_cm ib_local_sa findex"

# Xianlong Wang

if [ "X${MLX4_LOAD}" == "Xyes" ]; then
PRE_UNLOAD_MODULES="mlx4_en mlx4_ib mlx4_core ${PRE_UNLOAD_MODULES}"

In the start() function, after einfo "Loading HCA and Access Layer drivers", add the following to load the necessary modules:

# Xianlong Wang, hard-coded
if [[ "${MLX4_LOAD}" == "yes" ]]; then
/sbin/modprobe mlx4_core > /dev/null 2>&1
rc=$[ $rc + $? ]
/sbin/modprobe mlx4_ib > /dev/null 2>&1
rc=$[ $rc + $? ]
/sbin/modprobe mlx4_en > /dev/null 2>&1
rc=$[ $rc + $? ]


Step 4: add the init.d scripts, openib and opensm to boot level.

rc-update add openib default
rc-update add opensm default

Step 5: Edit the /etc/conf.d/net file for IPoverIB settings. Create the symbolic link /etc/init.d/net.ib0 to /etc/init.d/net.lo.

routes_ib0=("default via")

Then add net.ib0 to default run level.

rc-update add net.ib0 default

After rebooting, check the port status by running ibstatus. The following output is given:

Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:f04d:a290:9778:efbd
base lid: 0x1
sm lid: 0x6
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand

Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:f04d:a290:9778:efbe
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 2: Polling
rate: 70 Gb/sec (4X)
link_layer: InfiniBand

Using ifconfig to check the ip address. The following output is given.

ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr: Bcast: Mask:
RX packets:43233 errors:0 dropped:0 overruns:0 frame:0
TX packets:44438 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:5280799 (5.0 MiB) TX bytes:2771849 (2.6 MiB)

Still is there a problem. ibv_devinfo outputs the following error message.

mlx4: There is a mismatch between the kernel and the userspace libraries: Kernel does not support XRC. Exiting.
Failed to open device

Infiniband Installation on Gentoo

Tuesday, December 7th, 2010

My college recently purchased a Dell M610 cluster and I am in charge for the administration job.  The cluster consists of 8 nodes and each node has two 1Gb Ethernet cards and one Infiniband card(? or whatever it should be called).  The nodes are connected with two back-pane Dell PowerConnect 6220 ethernet switch and one Mellanox M3601Q switch on the chassis.  The Infiniband switch does not come with subnet management.

I decide to choose Gentoo as the base system for a clean and slim installation, particularly the meta-package administration system, portage, is a great attraction.  Basic system installation is no problem and Ethernet cards setup is easy. But Infiniband is a big trouble at the beginning because the OFED package from either Mellanox or OpenFabrics only supports Redhat and SUSE linux boxes.  There is no much documentation available and all the packages are rpm packages… Official portage build does not include any Infiniband packages. Gentoo-science overlay has a category sys-infiniband, but it seems they are not well curated and many issues. I spent quite a lot of time to figure out a solution..  Here I record what I did for the future reference and someone who might encounter the same problem.

Step 1 : check the hardware information


Information regards to the infiniband is as follows:

04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s – IB QDR / 10GigE] (rev b0)

check if it is supported by the linux kernel at

It is supported for kernel v2.6.25- and use “mlx4_core” driver.

Step 2: kernel compilation

I use the gentoo kernel source 2.6.34-gentoo-r12.  Other versions higher than 2.6.25 should be fine, I assume.

a)  Device Drivers -> set “Infiniband support” as module.  Under “Infiniband support”, set the following ones as module, “Infiniband userspace MAD support, Infiniband userspace access (verbs and CM), Mellanox ConnectX HCA support, IP-over-Infiniband, Infiniband SCSI RDMA Protocol, iSCSI Extensions for RDMA (iser)”.  Set “IP-over-InfiniBand Connected Mode Support” as built-in.  For other Infiniband cards, choose other drivers than “Mellanox ConnectX HCA”

b)  Device Drivers -> Network device support: set Ethernet (10000 Mbit) (Gigabit Eternet card has already been configured).  Set “Mellanox Technologies ConnectX 10G support” as module. This provides the driver, mlx4_en.

c) run “make && make modules_install” to compile the kernel.

So far, we have a ready kernel.  After rebooting, using the command, lsmod, you should see “mlx4_core” being loaded.