Posts Tagged ‘mpi’

Gentoo Cluster: Gamess Installation with MVAPICH2 and PBS

Monday, August 1st, 2011

Gamess is an electronic structure calculation package. Its installation is easy if you just want to use “sockets” communication mode. Just emerge it as you regularly do. Then use “rungms” to submit your job. The default rungms is okay to run the serial code. For the parallel computation, you still need to tune the script slightly. But since our cluster has Infiniband installed, it is better to go with the “mpi” communication mode. It took me quite some time to figure out how to install it correctly and make it run with mpiexec.hydra alone or with OpenPBS (Torque). Here is how I did it.

Software packages related:
1. gamess-20101001.3 (Dowload it beforehand from its developer’s website)
2. mvapich2-1.7rc1. (Previous versions should be okay and I installed it under /usr/local/)
3. OFED- (Userspace libraries for Infiniband. See my previous post. Only updated kernel modules installed. Userspace libraries should be the same as in OFED-
4. torque-2.4.14 (OpenPBS)

1. Update the gamess-20101001.3.ebuild with this one and manifest it.
2. Unmask the mpi user flag for gamess in /usr/portage/profiles/base/package.use.mask.
3. Add sci-chemistry/gamess mpi to /etc/portage/package.use; then emerge -av gamess.
4. Update rungms with this one;
5. Create a new script pbsgms as this one;
6. Add kernel.shmmax=XXXXX to /etc/sysctl.conf, in which XXXXX is a large enough integer for shared memory (default value 32MB is too small for DDI). Run /sbin/sysctl -w kernel.shmmax=XXXX to update the setting in-the-fly.
Added on Sept. 9, 2011. It seems that kernel.shmall=XXXXX should be modified as well. Please bear in mind that the unit for kernel.shmall is pages and kernel.shmmax is bytes. And a page is 4096 bytes in usual(use getconf PAGE_SIZE to verify).

7. Environment setting. Create a file /etc/env.d/99gamess


Then update your profile.
8. Create a hostfile, ~/.hosts


This file is only needed by invoking rungms directly.

9. Test your installation: copy a test job input file exam20.inpunder/usr/share/gamess/tests/; submit the job using pbsgms exam20 (other settings will be prompted), or using rungms exam20 00 4.

1. Two changes were made on the ebuild file.
(a). The installation suggestions given in the documentation of Gamess is not enough. More libraries other than mpich are needed to pass over to lked, the linker program for Gamess.
(b) MPI environment constants are needed to exported to the installation program, compddi through an temporary file
2. Many changes were made for the script, rungms. I could not remember all of them. Some are as following.
(a) For parallel computation, the scratch file will be put under /tmp on each node by default.
(b) The script will be working with pbsgms.
(c) System-wide setting for Gamess can be put under /etc/env.d.
(d) A host file is needed if not using PBS. By default, it should be at ~/.hosts. If not found, running on the local host only.
3. The script pbsgms is based on sge-pbs shipped with the Gamess installation package. I have made it to work with Torque. Numerous changes were made.

Gentoo Cluster: ofa_kernel installation

Friday, July 29th, 2011

Previously, I have setup the cluster and installed the Infiniband kernel modules and userspace libraries. However, a problem was lingering. When the command ibv_devinfo was run, the following error message was always given.

mlx4: There is a mismatch between the kernel and the userspace libraries: Kernel does not support XRC. Exiting.
Failed to open device

I have been ignoring this message. But recently I need to run some serious work with parallel computational power. The same error showed up now and then and MPI communication could not be established expect via the TCP/IP socket. The error was so annoying so i decided to solve the problem.
For the first step, I downloaded the OFED- installation package from the OpenFabrics website and extracted the ofa_kernel- package from it. I have tried the previous versions and it was not successful to install them on my kernel (2.6.38-gentoo-r6). The typical configure-make-make_install procedure was used to install the modules. However, with the configuration option, --with-nfsrdma-mod, the NFS/RDMA modules (svcrdma and xprtrdma) were unable to compile. They were just too many errors. Even after I manually modified all the errors-related sentences and the compilation was finished, the modules could not be loaded at all. So I have to give up that option.
The newly installed modules were placed under /lib/modules/`uname -r`/updates. After rebooting, the computer was frozen during boot-up. Lots of error messages with “Bad RIP value” were shown up. It turned up it was due to NFS/Client mounting. So after “netmount” was removed from the default runlevel, the rebooting was okay. Now the problem seems solved. The command ibv_devinfo gives the information I expected.

hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.7.710
node_guid: f04d:a290:9778:efe0
sys_image_guid: f04d:a290:9778:efe3
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: DEL08F0120009
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 6
port_lid: 3
port_lmc: 0x00
link_layer: IB

port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: IB

Other diagnostic commands also work fine.
But now a new problem emerges. Although the build-in modules for NFS/RDMA with the kernel (2.6.38-gentoo-r6) were able to load. But whenever I tried to mount a network folder with the rdma protocol, the error message related “Bad RIP value” appeared and the mounting failed. Therefore, I have to switch the traditional TCP protocol. This seems a okay comprise.

After the kernel modules were updated, I installed MVAPICH2 (1.7rc1) using the 3-step installation procedure. I have run some basic test jobs and the osu_benchmarks. It was okay to run the jobs with mpiexec. But when using mpirun_rsh, the following errors were produced without successful results.

[unset]: Unable to get host entry for
[unset]: Unable to connect to on 33276
start..Fatal error in MPI_Init:
Other MPI error

By checking the source code, it seems the problem is related a function called gethostbyname which is defined in netdb.h. How to use the package with PBS is needed to figure out.

Gentoo Cluster: a Strange OpenMPI Problem

Wednesday, July 27th, 2011

Yesterday, I tried out some MPI jobs on our gentoo cluster. A really weird problem happened and then solved. One test job is the following mpihello code. At first, I use both qsub mpihello and just command-line mpirun -np 16 --hostfile hosts mpihello. When the number of processes is a low number, say 1 or 2 processes per each node, the jobs end very quickly. But if the number of processes exceeds some threshold, it just hangs there and never ends except being killed by pbs or myself. The threshold seems a larger number when using just mpirun then using qsub. The command pbsnodes shows all nodes are up and free. A debug test shows that the master process does not receive the messages from other processes, that is MPI_Recv is waiting forever.
Solution: Both Infiniband adapter and Ethernet network cards are running. After the bonded ethernet cards are disabled on node 7 and node 8, the problem is solved. I am still not exactly sure about the cause. Other nodes still have bonded ethernet cards running. But so far, it is in an okay working state.

#include "mpi.h"

int main(int argc, char *argv[])
int my_rank;
int p;
int source;
int dest;
int tag = 0;
char message[100];
char hostname[100];

MPI_Status status;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);

if (my_rank != 0) {
gethostname(hostname, 100);
sprintf(message, "Work: Hello world from process %d, on %s!",
my_rank, hostname);
dest = 0;
MPI_Send(message, strlen(message) + 1, MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
// fprintf(stdout, "going away..., %d, %s\n", my_rank, hostname);
} else {
gethostname(hostname, 100);
printf("p = %d\n", p);
printf("Hello world from master process %d, on %s!\n",
my_rank, hostname);
for (source = 1; source < p; source++) {
MPI_Recv(message, 100, MPI_CHAR, MPI_ANY_SOURCE, tag,
MPI_COMM_WORLD, &status);
printf("Master Revd:%s\n", message);
return 0;