Archive for July, 2011

Gentoo Cluster: ofa_kernel installation

Friday, July 29th, 2011

Previously, I have setup the cluster and installed the Infiniband kernel modules and userspace libraries. However, a problem was lingering. When the command ibv_devinfo was run, the following error message was always given.

mlx4: There is a mismatch between the kernel and the userspace libraries: Kernel does not support XRC. Exiting.
Failed to open device

I have been ignoring this message. But recently I need to run some serious work with parallel computational power. The same error showed up now and then and MPI communication could not be established expect via the TCP/IP socket. The error was so annoying so i decided to solve the problem.
For the first step, I downloaded the OFED-1.5.3.2 installation package from the OpenFabrics website and extracted the ofa_kernel-1.5.3.2 package from it. I have tried the previous versions and it was not successful to install them on my kernel (2.6.38-gentoo-r6). The typical configure-make-make_install procedure was used to install the modules. However, with the configuration option, --with-nfsrdma-mod, the NFS/RDMA modules (svcrdma and xprtrdma) were unable to compile. They were just too many errors. Even after I manually modified all the errors-related sentences and the compilation was finished, the modules could not be loaded at all. So I have to give up that option.
The newly installed modules were placed under /lib/modules/`uname -r`/updates. After rebooting, the computer was frozen during boot-up. Lots of error messages with “Bad RIP value” were shown up. It turned up it was due to NFS/Client mounting. So after “netmount” was removed from the default runlevel, the rebooting was okay. Now the problem seems solved. The command ibv_devinfo gives the information I expected.

hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.7.710
node_guid: f04d:a290:9778:efe0
sys_image_guid: f04d:a290:9778:efe3
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: DEL08F0120009
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 6
port_lid: 3
port_lmc: 0x00
link_layer: IB

port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: IB


Other diagnostic commands also work fine.
But now a new problem emerges. Although the build-in modules for NFS/RDMA with the kernel (2.6.38-gentoo-r6) were able to load. But whenever I tried to mount a network folder with the rdma protocol, the error message related “Bad RIP value” appeared and the mounting failed. Therefore, I have to switch the traditional TCP protocol. This seems a okay comprise.

After the kernel modules were updated, I installed MVAPICH2 (1.7rc1) using the 3-step installation procedure. I have run some basic test jobs and the osu_benchmarks. It was okay to run the jobs with mpiexec. But when using mpirun_rsh, the following errors were produced without successful results.

[unset]: Unable to get host entry for
[unset]: Unable to connect to on 33276
start..Fatal error in MPI_Init:
Other MPI error
...

By checking the source code, it seems the problem is related a function called gethostbyname which is defined in netdb.h. How to use the package with PBS is needed to figure out.

Gentoo Cluster: a Strange OpenMPI Problem

Wednesday, July 27th, 2011

Yesterday, I tried out some MPI jobs on our gentoo cluster. A really weird problem happened and then solved. One test job is the following mpihello code. At first, I use both qsub mpihello and just command-line mpirun -np 16 --hostfile hosts mpihello. When the number of processes is a low number, say 1 or 2 processes per each node, the jobs end very quickly. But if the number of processes exceeds some threshold, it just hangs there and never ends except being killed by pbs or myself. The threshold seems a larger number when using just mpirun then using qsub. The command pbsnodes shows all nodes are up and free. A debug test shows that the master process does not receive the messages from other processes, that is MPI_Recv is waiting forever.
Solution: Both Infiniband adapter and Ethernet network cards are running. After the bonded ethernet cards are disabled on node 7 and node 8, the problem is solved. I am still not exactly sure about the cause. Other nodes still have bonded ethernet cards running. But so far, it is in an okay working state.

#include
#include
#include
#include "mpi.h"

int main(int argc, char *argv[])
{
int my_rank;
int p;
int source;
int dest;
int tag = 0;
char message[100];
char hostname[100];

MPI_Status status;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);

if (my_rank != 0) {
gethostname(hostname, 100);
sprintf(message, "Work: Hello world from process %d, on %s!",
my_rank, hostname);
dest = 0;
MPI_Send(message, strlen(message) + 1, MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
// fprintf(stdout, "going away..., %d, %s\n", my_rank, hostname);
} else {
gethostname(hostname, 100);
printf("p = %d\n", p);
printf("Hello world from master process %d, on %s!\n",
my_rank, hostname);
for (source = 1; source < p; source++) {
MPI_Recv(message, 100, MPI_CHAR, MPI_ANY_SOURCE, tag,
MPI_COMM_WORLD, &status);
printf("Master Revd:%s\n", message);
}
}
MPI_Finalize();
return 0;
}

umount: device is busy

Wednesday, July 27th, 2011

sometimes, when i try to umount a mounted device, the following error occurs.

xwang@node1 ~ $ umount /mnt/ps4000e/home
umount.nfs: /mnt/ps4000e/home: device is busy
umount.nfs: /mnt/ps4000e/home: device is busy

No one is logged in except myself which I do not use that directory and no other user’s job is running. It is really a mystery to figure out which process causes the “device is busy”. Use google, I found the solution at http://ocaoimh.ie/2008/02/13/how-to-umount-when-the-device-is-busy/.
The solution is to use fuser to find out.

xwang # fuser -m /mnt/ps4000e/home/
/mnt/ps4000e/home/: 7706c

See the manual man fuser for full description of this command. I guess 7706 is the process id which is currently uses the mounted device. Not sure the following letter ‘c’ stands for. So use ps to find out the process.

xwang # ps 7706
PID TTY STAT TIME COMMAND
7706 ? Ss 0:00 /usr/bin/orted --daemonize ....

Now the reason is obvious. I started an mpi job before and it does not end abnormally. The orted is the administration process started by root and it does not exit. So after the process was killed, the device was able to be detached.

Gentoo: NFS/RDMA (Infiniband)

Tuesday, July 12th, 2011

Our cluster system consists of a Dell EqualLogic PS4000e iSCSI SAN (16T) storage array. I used it for database storage and home directory of regular users. The storage array was mounted to the master node using iSCSI initiator, mount point, /mnt/ps4000e/. Then the sub-directory /mnt/ps4000e/home was exported across the cluster, so each node has access to the same home directory. So everyday users do not need move their data files between nodes. NFS services provides the network-based mounting. NFS sever/client is easy to install by following the guideline at http://en.gentoo-wiki.com/wiki/NFS/Server. Data transfer is via the IPoIB mechanism. But since we have Infiniband network, we could use RDMA network. NFS/RDMA achieves much faster speed. Here is my experience to setup NFS/RDMA.

Step 1: Kernel compilation
1) Requirements for NFS Server/Client
For the server node, it is needed to turn on File systems/Network File Systems/NFS server support.
For the client node, it is needed to turn on File systems/Network File Systems/NFS client support.
2) Requirements for RDMA support
Drivers for Infiniband should be compiled as module as said in a previous node. Check if RDMA support is enabled. Make sure that SUNRPC_XPRT_RDMA in the .config file has a value of M.

Step 2: emerge net-fs/nfs-utils
The version of 1.2.3-r1 is installed. The portmap package is no longed needed. Instead, rpcbind as a dependency will be installed instead. If you see the error message that says the nfs-utils package is blocked portmap, un-emerge portmap first. If portmap is pulled by ypserv, un-emerge ypserv and ypbind packages first. After installation of nfs-utils, then emerge ypserv ypbind again.

Step 3: Create the mount point.
edit the /etc/exports file. add the following line,

# /etc/exports: NFS file systems being exported. See exports(5).
/mnt/ps4000e/home 10.0.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_subtree_che
ck,no_root_squash)

The option insecure is important here because the NFS/RDMA client does not use a reserved port.

Step 4: Load necessary modules.
On the server node, svcrdma is needed. On the client node, xprtrdma is needed. I added them into the /etc/init.d/nfs script file. Put the following sentences into an appropriate place in the init.d file.

# svcrdma: server-side module for NFS/RDMA
# xprtrdma: client-side module for NFS/RDMA
/sbin/modprobe svcrdma > /dev/null 2>&1
/sbin/modprobe xprtrdma > /dev/null 2>&1

Remember to unload them when stopping the services. Or add corresponding rmmod commands into the script.

Step 5: Instruct the server to listen on the RDMA transport.

echo "rdma 20049" > /proc/fs/nfsd/portlist

I added it into the nfs script as well.

Step 6: Start the NFS service

/etc/init.d/nfs start

Or add the script to the default run level.

rc-update add nfs default

Step 7. Mount the file system on the client node.
First, ensure that the module xprtrdma has been loaded.

modprobe xprtrdma

Then, use the following command to mount the NFS/RDMA server:

mount -o rdma,port=20049 10.0.0.1:/mnt/ps4000e/home /mnt/ps4000e/home

To verify that the mount is using RDMA, run cat /proc/mounts to check the proto field.
Alternatively for automatic mounting during the boot-up, add the following record to the file /etc/fstab.

10.0.0.1:/mnt/ps4000e/home /mnt/ps4000e/home nfs _netdev,proto=rd
ma,port=20049 0 2

Use the init.d script netmount to mount the NFS/RDMA server.

Infiniband Installation on Gentoo (II)

Sunday, July 10th, 2011

In a previous post, I wrote the first part of my experience to install infiniband adapters on a gentoo cluster.  Recently, I upgraded the system and found i forgot the details of the installation and setup.  So I need to write down what i have done.

Step 1: Turn on the infiniband modules in the kernel as discussed in the previous post.

Step 2:  Emerge necessary packages.  They were in the science layer, but now (July 2011) moved to the main tree under sys-infiniband category.  On my cluster system,  the following packages were installed.


sys-infiniband/dapl-2.0.32
sys-infiniband/infiniband-diags-1.5.8
sys-infiniband/libibcm-1.0.5
sys-infiniband/libibcommon-1.1.2_p20090314
sys-infiniband/libibmad-1.3.7
sys-infiniband/libibumad-1.3.7
sys-infiniband/libibverbs-1.1.4
sys-infiniband/libipathverbs-1.2
sys-infiniband/libmlx4-1.0.1
sys-infiniband/libmthca-1.0.5-r2
sys-infiniband/libnes-1.1.1
sys-infiniband/librdmacm-1.0.14.1
sys-infiniband/libsdp-1.1.108
sys-infiniband/openib-1.4
sys-infiniband/openib-files-1.5.3.1
sys-infiniband/opensm-3.3.9
sys-infiniband/perftest-1.3.0

Step 3:   Edit configuration file, /etc/infiniband/openib.conf. The following is the content of my configuration file.


# Start HCA driver upon boot
ONBOOT=yes

# Load UCM module
UCM_LOAD=no

# Load RDMA_CM module
RDMA_CM_LOAD=yes

# Load RDMA_UCM module
RDMA_UCM_LOAD=yes

# Increase ib_mad thread priority
RENICE_IB_MAD=no

# Load MTHCA
MTHCA_LOAD=no

# Load IPATH
IPATH_LOAD=no

# Load eHCA
EHCA_LOAD=no

# Load MLX4 modules
MLX4_LOAD=yes

# Load IPoIB
IPOIB_LOAD=yes

# Enable IPoIB Connected Mode
SET_IPOIB_CM=yes

# Enable IPoIB High Availability daemon
# Xianlong Wang
#IPOIBHA_ENABLE=yes

#PRIMARY_IPOIB_DEV=ib0
#SECONDARY_IPOIB_DEV=ib1

# Load SDP module
#SDP_LOAD=yes

# Load SRP module
#SRP_LOAD=no

# Enable SRP High Availability daemon
#SRPHA_ENABLE=no

# Load ISER module
#ISER_LOAD=no

# Load RDS module
#RDS_LOAD=no

# Load VNIC module
#VNIC_LOAD=yes

Step 4:  Edit the init.d script, /etc/init.d/openib. This is the important part.  The original one seems does not load all necessary modules  or in the right order. After all the if-clauses for setting POST_LOAD_MODULES, change the following:

PRE_UNLOAD_MODULES="ib_rds ib_ucm kdapl ib_srp_target scsi_target ib_srp ib_iser ib_sdp rdma_ucm rdma_cm ib_addr ib_cm ib_local_sa findex"
POST_UNLOAD_MODULES="$PRE_UNLOAD_MODULES ib_ipoib ib_sa ib_uverbs ib_umad"

to the following (pay attention to those in bold fonts):

#Xianlong Wang
# svcrdma: server-side module for NFS/RDMA
# xprtrdma: client-side module for NFS/RDMA

POST_LOAD_MODULES="$POST_LOAD_MODULES svcrdma xprtrdma"

#Xianlong Wang
#add ib_ipoib before ib_cm

#PRE_UNLOAD_MODULES="ib_rds ib_ucm kdapl ib_srp_target scsi_target ib_srp ib_iser ib_sdp rdma_ucm rdma_cm ib_addr ib_cm ib_local_sa findex"
# add xprtrdma module for NFS server and client
PRE_UNLOAD_MODULES="xprtrdma svcrdma ib_rds ib_ucm kdapl ib_srp_target scsi_target ib_srp ib_iser ib_sdp rdma_ucm rdma_cm ib_addr ib_ipoib ib_cm ib_local_sa findex"

# Xianlong Wang

if [ "X${MLX4_LOAD}" == "Xyes" ]; then
PRE_UNLOAD_MODULES="mlx4_en mlx4_ib mlx4_core ${PRE_UNLOAD_MODULES}"
fi


In the start() function, after einfo "Loading HCA and Access Layer drivers", add the following to load the necessary modules:


# Xianlong Wang, hard-coded
if [[ "${MLX4_LOAD}" == "yes" ]]; then
/sbin/modprobe mlx4_core > /dev/null 2>&1
rc=$[ $rc + $? ]
/sbin/modprobe mlx4_ib > /dev/null 2>&1
rc=$[ $rc + $? ]
/sbin/modprobe mlx4_en > /dev/null 2>&1
rc=$[ $rc + $? ]

fi

Step 4: add the init.d scripts, openib and opensm to boot level.

rc-update add openib default
rc-update add opensm default

Step 5: Edit the /etc/conf.d/net file for IPoverIB settings. Create the symbolic link /etc/init.d/net.ib0 to /etc/init.d/net.lo.

config_ib0=("10.0.0.1/24")
routes_ib0=("default via 10.0.0.1")
nis_domain_ib0="abc"
nis_servers_ib0="10.0.0.1"

Then add net.ib0 to default run level.

rc-update add net.ib0 default

After rebooting, check the port status by running ibstatus. The following output is given:

Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:f04d:a290:9778:efbd
base lid: 0x1
sm lid: 0x6
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand

Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:f04d:a290:9778:efbe
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 2: Polling
rate: 70 Gb/sec (4X)
link_layer: InfiniBand


Using ifconfig to check the ip address. The following output is given.

ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.0.0.1 Bcast:10.0.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:43233 errors:0 dropped:0 overruns:0 frame:0
TX packets:44438 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:5280799 (5.0 MiB) TX bytes:2771849 (2.6 MiB)

Still is there a problem. ibv_devinfo outputs the following error message.

mlx4: There is a mismatch between the kernel and the userspace libraries: Kernel does not support XRC. Exiting.
Failed to open device