Gentoo Cluster: a Strange OpenMPI Problem

Yesterday, I tried out some MPI jobs on our gentoo cluster. A really weird problem happened and then solved. One test job is the following mpihello code. At first, I use both qsub mpihello and just command-line mpirun -np 16 --hostfile hosts mpihello. When the number of processes is a low number, say 1 or 2 processes per each node, the jobs end very quickly. But if the number of processes exceeds some threshold, it just hangs there and never ends except being killed by pbs or myself. The threshold seems a larger number when using just mpirun then using qsub. The command pbsnodes shows all nodes are up and free. A debug test shows that the master process does not receive the messages from other processes, that is MPI_Recv is waiting forever.
Solution: Both Infiniband adapter and Ethernet network cards are running. After the bonded ethernet cards are disabled on node 7 and node 8, the problem is solved. I am still not exactly sure about the cause. Other nodes still have bonded ethernet cards running. But so far, it is in an okay working state.

#include
#include
#include
#include "mpi.h"

int main(int argc, char *argv[])
{
int my_rank;
int p;
int source;
int dest;
int tag = 0;
char message[100];
char hostname[100];

MPI_Status status;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);

if (my_rank != 0) {
gethostname(hostname, 100);
sprintf(message, "Work: Hello world from process %d, on %s!",
my_rank, hostname);
dest = 0;
MPI_Send(message, strlen(message) + 1, MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
// fprintf(stdout, "going away..., %d, %s\n", my_rank, hostname);
} else {
gethostname(hostname, 100);
printf("p = %d\n", p);
printf("Hello world from master process %d, on %s!\n",
my_rank, hostname);
for (source = 1; source < p; source++) {
MPI_Recv(message, 100, MPI_CHAR, MPI_ANY_SOURCE, tag,
MPI_COMM_WORLD, &status);
printf("Master Revd:%s\n", message);
}
}
MPI_Finalize();
return 0;
}

Tags: , ,

Comments are closed.