Previously, I have setup the cluster and installed the Infiniband kernel modules and userspace libraries. However, a problem was lingering. When the command ibv_devinfo
was run, the following error message was always given.
mlx4: There is a mismatch between the kernel and the userspace libraries: Kernel does not support XRC. Exiting.
Failed to open device
I have been ignoring this message. But recently I need to run some serious work with parallel computational power. The same error showed up now and then and MPI communication could not be established expect via the TCP/IP socket. The error was so annoying so i decided to solve the problem.
For the first step, I downloaded the OFED-1.5.3.2
installation package from the OpenFabrics website and extracted the ofa_kernel-1.5.3.2
package from it. I have tried the previous versions and it was not successful to install them on my kernel (2.6.38-gentoo-r6
). The typical configure-make-make_install procedure was used to install the modules. However, with the configuration option, --with-nfsrdma-mod
, the NFS/RDMA modules (svcrdma
and xprtrdma
) were unable to compile. They were just too many errors. Even after I manually modified all the errors-related sentences and the compilation was finished, the modules could not be loaded at all. So I have to give up that option.
The newly installed modules were placed under /lib/modules/`uname -r`/updates
. After rebooting, the computer was frozen during boot-up. Lots of error messages with “Bad RIP value” were shown up. It turned up it was due to NFS/Client mounting. So after “netmount” was removed from the default runlevel, the rebooting was okay. Now the problem seems solved. The command ibv_devinfo
gives the information I expected.
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.7.710
node_guid: f04d:a290:9778:efe0
sys_image_guid: f04d:a290:9778:efe3
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: DEL08F0120009
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 6
port_lid: 3
port_lmc: 0x00
link_layer: IB
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: IB
Other diagnostic commands also work fine.
But now a new problem emerges. Although the build-in modules for NFS/RDMA with the kernel (2.6.38-gentoo-r6) were able to load. But whenever I tried to mount a network folder with the rdma protocol, the error message related “Bad RIP value” appeared and the mounting failed. Therefore, I have to switch the traditional TCP protocol. This seems a okay comprise.
After the kernel modules were updated, I installed MVAPICH2 (1.7rc1) using the 3-step installation procedure. I have run some basic test jobs and the osu_benchmarks. It was okay to run the jobs with mpiexec
. But when using mpirun_rsh
, the following errors were produced without successful results.
[unset]: Unable to get host entry for
[unset]: Unable to connect to on 33276
start..Fatal error in MPI_Init:
Other MPI error
...
By checking the source code, it seems the problem is related a function called gethostbyname
which is defined in netdb.h
. How to use the package with PBS is needed to figure out.