clsplit: Combination of `split’ and `csplit’

August 15th, 2011

Recently, I need to process large numbers of SDF files and some of them are too big (>500MB) to load into memory altogether. A immediate solution is to split these big files into smaller chunks. The two commands, split and csplit, which come with Linux, seem incapable to meet this need. split is convenient to split a text file by lines; but a SDF file could contain many different molecular records and they do not have size, so split will break the integrity of the molecular records. One has to manually fix the head and tails of each result file. csplit is more flexible and can split a file according to patterns. But the weak point is that there is no way to specify how many matched patterns to skip before splitting. As a result, if I use “$$$$” as the record delimiter to csplit a SDF file, it will break each molecular record into a single file. There are just too many of them! That is not what I want. (One could use cat to concatenate them together, but it is too troublesome because of the large number of files).
I wrote this clsplit to meet this need. It is available at here. The basic idea behind this script is to simulate the fixing work after spliting the file by specifying number of lines. It can be called using the following syntax.
clsplit PATTERN line_number file_name
PATTERN must be a valid pattern for grep. For example, if i want to split a big SDF file, the following command can be used.
clsplit \$\$\$\$ 10000 my.sdf
The resulting files usually do not have an exact number of 10000 lines and who cares about an exact number of lines! More important is to preserve the integrity of each record.

Gentoo Cluster: Gamess Installation with MVAPICH2 and PBS

August 1st, 2011

Gamess is an electronic structure calculation package. Its installation is easy if you just want to use “sockets” communication mode. Just emerge it as you regularly do. Then use “rungms” to submit your job. The default rungms is okay to run the serial code. For the parallel computation, you still need to tune the script slightly. But since our cluster has Infiniband installed, it is better to go with the “mpi” communication mode. It took me quite some time to figure out how to install it correctly and make it run with mpiexec.hydra alone or with OpenPBS (Torque). Here is how I did it.

Software packages related:
1. gamess-20101001.3 (Dowload it beforehand from its developer’s website)
2. mvapich2-1.7rc1. (Previous versions should be okay and I installed it under /usr/local/)
3. OFED-1.5.3.2. (Userspace libraries for Infiniband. See my previous post. Only updated kernel modules installed. Userspace libraries should be the same as in OFED-1.5.3.1)
4. torque-2.4.14 (OpenPBS)

Steps
1. Update the gamess-20101001.3.ebuild with this one and manifest it.
2. Unmask the mpi user flag for gamess in /usr/portage/profiles/base/package.use.mask.
3. Add sci-chemistry/gamess mpi to /etc/portage/package.use; then emerge -av gamess.
4. Update rungms with this one;
5. Create a new script pbsgms as this one;
6. Add kernel.shmmax=XXXXX to /etc/sysctl.conf, in which XXXXX is a large enough integer for shared memory (default value 32MB is too small for DDI). Run /sbin/sysctl -w kernel.shmmax=XXXX to update the setting in-the-fly.
Added on Sept. 9, 2011. It seems that kernel.shmall=XXXXX should be modified as well. Please bear in mind that the unit for kernel.shmall is pages and kernel.shmmax is bytes. And a page is 4096 bytes in usual(use getconf PAGE_SIZE to verify).

7. Environment setting. Create a file /etc/env.d/99gamess

GMS_TARGET=mpi
GMS_SCR=/tmp/gamess
GMS_HOSTS=~/.hosts
GMS_MPI_KICK=hydra
GMS_MPI_PATH=/usr/local/bin

Then update your profile.
8. Create a hostfile, ~/.hosts

node1
node2
...

This file is only needed by invoking rungms directly.

9. Test your installation: copy a test job input file exam20.inpunder/usr/share/gamess/tests/; submit the job using pbsgms exam20 (other settings will be prompted), or using rungms exam20 00 4.

Explanations
1. Two changes were made on the ebuild file.
(a). The installation suggestions given in the documentation of Gamess is not enough. More libraries other than mpich are needed to pass over to lked, the linker program for Gamess.
(b) MPI environment constants are needed to exported to the installation program, compddi through an temporary file install.info.
2. Many changes were made for the script, rungms. I could not remember all of them. Some are as following.
(a) For parallel computation, the scratch file will be put under /tmp on each node by default.
(b) The script will be working with pbsgms.
(c) System-wide setting for Gamess can be put under /etc/env.d.
(d) A host file is needed if not using PBS. By default, it should be at ~/.hosts. If not found, running on the local host only.
3. The script pbsgms is based on sge-pbs shipped with the Gamess installation package. I have made it to work with Torque. Numerous changes were made.

Gentoo Cluster: ofa_kernel installation

July 29th, 2011

Previously, I have setup the cluster and installed the Infiniband kernel modules and userspace libraries. However, a problem was lingering. When the command ibv_devinfo was run, the following error message was always given.

mlx4: There is a mismatch between the kernel and the userspace libraries: Kernel does not support XRC. Exiting.
Failed to open device

I have been ignoring this message. But recently I need to run some serious work with parallel computational power. The same error showed up now and then and MPI communication could not be established expect via the TCP/IP socket. The error was so annoying so i decided to solve the problem.
For the first step, I downloaded the OFED-1.5.3.2 installation package from the OpenFabrics website and extracted the ofa_kernel-1.5.3.2 package from it. I have tried the previous versions and it was not successful to install them on my kernel (2.6.38-gentoo-r6). The typical configure-make-make_install procedure was used to install the modules. However, with the configuration option, --with-nfsrdma-mod, the NFS/RDMA modules (svcrdma and xprtrdma) were unable to compile. They were just too many errors. Even after I manually modified all the errors-related sentences and the compilation was finished, the modules could not be loaded at all. So I have to give up that option.
The newly installed modules were placed under /lib/modules/`uname -r`/updates. After rebooting, the computer was frozen during boot-up. Lots of error messages with “Bad RIP value” were shown up. It turned up it was due to NFS/Client mounting. So after “netmount” was removed from the default runlevel, the rebooting was okay. Now the problem seems solved. The command ibv_devinfo gives the information I expected.

hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.7.710
node_guid: f04d:a290:9778:efe0
sys_image_guid: f04d:a290:9778:efe3
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: DEL08F0120009
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 6
port_lid: 3
port_lmc: 0x00
link_layer: IB

port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: IB


Other diagnostic commands also work fine.
But now a new problem emerges. Although the build-in modules for NFS/RDMA with the kernel (2.6.38-gentoo-r6) were able to load. But whenever I tried to mount a network folder with the rdma protocol, the error message related “Bad RIP value” appeared and the mounting failed. Therefore, I have to switch the traditional TCP protocol. This seems a okay comprise.

After the kernel modules were updated, I installed MVAPICH2 (1.7rc1) using the 3-step installation procedure. I have run some basic test jobs and the osu_benchmarks. It was okay to run the jobs with mpiexec. But when using mpirun_rsh, the following errors were produced without successful results.

[unset]: Unable to get host entry for
[unset]: Unable to connect to on 33276
start..Fatal error in MPI_Init:
Other MPI error
...

By checking the source code, it seems the problem is related a function called gethostbyname which is defined in netdb.h. How to use the package with PBS is needed to figure out.

Gentoo Cluster: a Strange OpenMPI Problem

July 27th, 2011

Yesterday, I tried out some MPI jobs on our gentoo cluster. A really weird problem happened and then solved. One test job is the following mpihello code. At first, I use both qsub mpihello and just command-line mpirun -np 16 --hostfile hosts mpihello. When the number of processes is a low number, say 1 or 2 processes per each node, the jobs end very quickly. But if the number of processes exceeds some threshold, it just hangs there and never ends except being killed by pbs or myself. The threshold seems a larger number when using just mpirun then using qsub. The command pbsnodes shows all nodes are up and free. A debug test shows that the master process does not receive the messages from other processes, that is MPI_Recv is waiting forever.
Solution: Both Infiniband adapter and Ethernet network cards are running. After the bonded ethernet cards are disabled on node 7 and node 8, the problem is solved. I am still not exactly sure about the cause. Other nodes still have bonded ethernet cards running. But so far, it is in an okay working state.

#include
#include
#include
#include "mpi.h"

int main(int argc, char *argv[])
{
int my_rank;
int p;
int source;
int dest;
int tag = 0;
char message[100];
char hostname[100];

MPI_Status status;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);

if (my_rank != 0) {
gethostname(hostname, 100);
sprintf(message, "Work: Hello world from process %d, on %s!",
my_rank, hostname);
dest = 0;
MPI_Send(message, strlen(message) + 1, MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
// fprintf(stdout, "going away..., %d, %s\n", my_rank, hostname);
} else {
gethostname(hostname, 100);
printf("p = %d\n", p);
printf("Hello world from master process %d, on %s!\n",
my_rank, hostname);
for (source = 1; source < p; source++) {
MPI_Recv(message, 100, MPI_CHAR, MPI_ANY_SOURCE, tag,
MPI_COMM_WORLD, &status);
printf("Master Revd:%s\n", message);
}
}
MPI_Finalize();
return 0;
}

umount: device is busy

July 27th, 2011

sometimes, when i try to umount a mounted device, the following error occurs.

xwang@node1 ~ $ umount /mnt/ps4000e/home
umount.nfs: /mnt/ps4000e/home: device is busy
umount.nfs: /mnt/ps4000e/home: device is busy

No one is logged in except myself which I do not use that directory and no other user’s job is running. It is really a mystery to figure out which process causes the “device is busy”. Use google, I found the solution at http://ocaoimh.ie/2008/02/13/how-to-umount-when-the-device-is-busy/.
The solution is to use fuser to find out.

xwang # fuser -m /mnt/ps4000e/home/
/mnt/ps4000e/home/: 7706c

See the manual man fuser for full description of this command. I guess 7706 is the process id which is currently uses the mounted device. Not sure the following letter ‘c’ stands for. So use ps to find out the process.

xwang # ps 7706
PID TTY STAT TIME COMMAND
7706 ? Ss 0:00 /usr/bin/orted --daemonize ....

Now the reason is obvious. I started an mpi job before and it does not end abnormally. The orted is the administration process started by root and it does not exit. So after the process was killed, the device was able to be detached.

Gentoo: NFS/RDMA (Infiniband)

July 12th, 2011

Our cluster system consists of a Dell EqualLogic PS4000e iSCSI SAN (16T) storage array. I used it for database storage and home directory of regular users. The storage array was mounted to the master node using iSCSI initiator, mount point, /mnt/ps4000e/. Then the sub-directory /mnt/ps4000e/home was exported across the cluster, so each node has access to the same home directory. So everyday users do not need move their data files between nodes. NFS services provides the network-based mounting. NFS sever/client is easy to install by following the guideline at http://en.gentoo-wiki.com/wiki/NFS/Server. Data transfer is via the IPoIB mechanism. But since we have Infiniband network, we could use RDMA network. NFS/RDMA achieves much faster speed. Here is my experience to setup NFS/RDMA.

Step 1: Kernel compilation
1) Requirements for NFS Server/Client
For the server node, it is needed to turn on File systems/Network File Systems/NFS server support.
For the client node, it is needed to turn on File systems/Network File Systems/NFS client support.
2) Requirements for RDMA support
Drivers for Infiniband should be compiled as module as said in a previous node. Check if RDMA support is enabled. Make sure that SUNRPC_XPRT_RDMA in the .config file has a value of M.

Step 2: emerge net-fs/nfs-utils
The version of 1.2.3-r1 is installed. The portmap package is no longed needed. Instead, rpcbind as a dependency will be installed instead. If you see the error message that says the nfs-utils package is blocked portmap, un-emerge portmap first. If portmap is pulled by ypserv, un-emerge ypserv and ypbind packages first. After installation of nfs-utils, then emerge ypserv ypbind again.

Step 3: Create the mount point.
edit the /etc/exports file. add the following line,

# /etc/exports: NFS file systems being exported. See exports(5).
/mnt/ps4000e/home 10.0.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_subtree_che
ck,no_root_squash)

The option insecure is important here because the NFS/RDMA client does not use a reserved port.

Step 4: Load necessary modules.
On the server node, svcrdma is needed. On the client node, xprtrdma is needed. I added them into the /etc/init.d/nfs script file. Put the following sentences into an appropriate place in the init.d file.

# svcrdma: server-side module for NFS/RDMA
# xprtrdma: client-side module for NFS/RDMA
/sbin/modprobe svcrdma > /dev/null 2>&1
/sbin/modprobe xprtrdma > /dev/null 2>&1

Remember to unload them when stopping the services. Or add corresponding rmmod commands into the script.

Step 5: Instruct the server to listen on the RDMA transport.

echo "rdma 20049" > /proc/fs/nfsd/portlist

I added it into the nfs script as well.

Step 6: Start the NFS service

/etc/init.d/nfs start

Or add the script to the default run level.

rc-update add nfs default

Step 7. Mount the file system on the client node.
First, ensure that the module xprtrdma has been loaded.

modprobe xprtrdma

Then, use the following command to mount the NFS/RDMA server:

mount -o rdma,port=20049 10.0.0.1:/mnt/ps4000e/home /mnt/ps4000e/home

To verify that the mount is using RDMA, run cat /proc/mounts to check the proto field.
Alternatively for automatic mounting during the boot-up, add the following record to the file /etc/fstab.

10.0.0.1:/mnt/ps4000e/home /mnt/ps4000e/home nfs _netdev,proto=rd
ma,port=20049 0 2

Use the init.d script netmount to mount the NFS/RDMA server.

Infiniband Installation on Gentoo (II)

July 10th, 2011

In a previous post, I wrote the first part of my experience to install infiniband adapters on a gentoo cluster.  Recently, I upgraded the system and found i forgot the details of the installation and setup.  So I need to write down what i have done.

Step 1: Turn on the infiniband modules in the kernel as discussed in the previous post.

Step 2:  Emerge necessary packages.  They were in the science layer, but now (July 2011) moved to the main tree under sys-infiniband category.  On my cluster system,  the following packages were installed.


sys-infiniband/dapl-2.0.32
sys-infiniband/infiniband-diags-1.5.8
sys-infiniband/libibcm-1.0.5
sys-infiniband/libibcommon-1.1.2_p20090314
sys-infiniband/libibmad-1.3.7
sys-infiniband/libibumad-1.3.7
sys-infiniband/libibverbs-1.1.4
sys-infiniband/libipathverbs-1.2
sys-infiniband/libmlx4-1.0.1
sys-infiniband/libmthca-1.0.5-r2
sys-infiniband/libnes-1.1.1
sys-infiniband/librdmacm-1.0.14.1
sys-infiniband/libsdp-1.1.108
sys-infiniband/openib-1.4
sys-infiniband/openib-files-1.5.3.1
sys-infiniband/opensm-3.3.9
sys-infiniband/perftest-1.3.0

Step 3:   Edit configuration file, /etc/infiniband/openib.conf. The following is the content of my configuration file.


# Start HCA driver upon boot
ONBOOT=yes

# Load UCM module
UCM_LOAD=no

# Load RDMA_CM module
RDMA_CM_LOAD=yes

# Load RDMA_UCM module
RDMA_UCM_LOAD=yes

# Increase ib_mad thread priority
RENICE_IB_MAD=no

# Load MTHCA
MTHCA_LOAD=no

# Load IPATH
IPATH_LOAD=no

# Load eHCA
EHCA_LOAD=no

# Load MLX4 modules
MLX4_LOAD=yes

# Load IPoIB
IPOIB_LOAD=yes

# Enable IPoIB Connected Mode
SET_IPOIB_CM=yes

# Enable IPoIB High Availability daemon
# Xianlong Wang
#IPOIBHA_ENABLE=yes

#PRIMARY_IPOIB_DEV=ib0
#SECONDARY_IPOIB_DEV=ib1

# Load SDP module
#SDP_LOAD=yes

# Load SRP module
#SRP_LOAD=no

# Enable SRP High Availability daemon
#SRPHA_ENABLE=no

# Load ISER module
#ISER_LOAD=no

# Load RDS module
#RDS_LOAD=no

# Load VNIC module
#VNIC_LOAD=yes

Step 4:  Edit the init.d script, /etc/init.d/openib. This is the important part.  The original one seems does not load all necessary modules  or in the right order. After all the if-clauses for setting POST_LOAD_MODULES, change the following:

PRE_UNLOAD_MODULES="ib_rds ib_ucm kdapl ib_srp_target scsi_target ib_srp ib_iser ib_sdp rdma_ucm rdma_cm ib_addr ib_cm ib_local_sa findex"
POST_UNLOAD_MODULES="$PRE_UNLOAD_MODULES ib_ipoib ib_sa ib_uverbs ib_umad"

to the following (pay attention to those in bold fonts):

#Xianlong Wang
# svcrdma: server-side module for NFS/RDMA
# xprtrdma: client-side module for NFS/RDMA

POST_LOAD_MODULES="$POST_LOAD_MODULES svcrdma xprtrdma"

#Xianlong Wang
#add ib_ipoib before ib_cm

#PRE_UNLOAD_MODULES="ib_rds ib_ucm kdapl ib_srp_target scsi_target ib_srp ib_iser ib_sdp rdma_ucm rdma_cm ib_addr ib_cm ib_local_sa findex"
# add xprtrdma module for NFS server and client
PRE_UNLOAD_MODULES="xprtrdma svcrdma ib_rds ib_ucm kdapl ib_srp_target scsi_target ib_srp ib_iser ib_sdp rdma_ucm rdma_cm ib_addr ib_ipoib ib_cm ib_local_sa findex"

# Xianlong Wang

if [ "X${MLX4_LOAD}" == "Xyes" ]; then
PRE_UNLOAD_MODULES="mlx4_en mlx4_ib mlx4_core ${PRE_UNLOAD_MODULES}"
fi


In the start() function, after einfo "Loading HCA and Access Layer drivers", add the following to load the necessary modules:


# Xianlong Wang, hard-coded
if [[ "${MLX4_LOAD}" == "yes" ]]; then
/sbin/modprobe mlx4_core > /dev/null 2>&1
rc=$[ $rc + $? ]
/sbin/modprobe mlx4_ib > /dev/null 2>&1
rc=$[ $rc + $? ]
/sbin/modprobe mlx4_en > /dev/null 2>&1
rc=$[ $rc + $? ]

fi

Step 4: add the init.d scripts, openib and opensm to boot level.

rc-update add openib default
rc-update add opensm default

Step 5: Edit the /etc/conf.d/net file for IPoverIB settings. Create the symbolic link /etc/init.d/net.ib0 to /etc/init.d/net.lo.

config_ib0=("10.0.0.1/24")
routes_ib0=("default via 10.0.0.1")
nis_domain_ib0="abc"
nis_servers_ib0="10.0.0.1"

Then add net.ib0 to default run level.

rc-update add net.ib0 default

After rebooting, check the port status by running ibstatus. The following output is given:

Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:f04d:a290:9778:efbd
base lid: 0x1
sm lid: 0x6
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand

Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:f04d:a290:9778:efbe
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 2: Polling
rate: 70 Gb/sec (4X)
link_layer: InfiniBand


Using ifconfig to check the ip address. The following output is given.

ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.0.0.1 Bcast:10.0.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:43233 errors:0 dropped:0 overruns:0 frame:0
TX packets:44438 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:5280799 (5.0 MiB) TX bytes:2771849 (2.6 MiB)

Still is there a problem. ibv_devinfo outputs the following error message.

mlx4: There is a mismatch between the kernel and the userspace libraries: Kernel does not support XRC. Exiting.
Failed to open device

湖南见闻

February 13th, 2011

1. 山里的人们爱盖楼,无论贫富。这是有较老的木建二层楼房(曾是村小学,现闲置), 现只有极少数村民还住在这样小楼. 大部分改建水泥框架结构,两层到五层不等。经济不富裕的村民建房分多步走,多年完成整个建设工程。

Two-story wooden building

Two-story wooden building

2. 一般村民一层正中是祖先牌位,也有不少村民家供着毛泽东。

Mao Zedong

Mao Zedong

3. 湖南丘陵地貌,山沟中雨水充沛,土壤肥沃,气温高,适合水稻生长。或许如此大环境下才成就了袁隆平这位水稻专家。该宣传画介绍他的水稻试验田之一:安江。

Yuan Longpin

Yuan Longpin

Infiniband Installation on Gentoo

December 7th, 2010

My college recently purchased a Dell M610 cluster and I am in charge for the administration job.  The cluster consists of 8 nodes and each node has two 1Gb Ethernet cards and one Infiniband card(? or whatever it should be called).  The nodes are connected with two back-pane Dell PowerConnect 6220 ethernet switch and one Mellanox M3601Q switch on the chassis.  The Infiniband switch does not come with subnet management.

I decide to choose Gentoo as the base system for a clean and slim installation, particularly the meta-package administration system, portage, is a great attraction.  Basic system installation is no problem and Ethernet cards setup is easy. But Infiniband is a big trouble at the beginning because the OFED package from either Mellanox or OpenFabrics only supports Redhat and SUSE linux boxes.  There is no much documentation available and all the packages are rpm packages… Official portage build does not include any Infiniband packages. Gentoo-science overlay has a category sys-infiniband, but it seems they are not well curated and many issues. I spent quite a lot of time to figure out a solution..  Here I record what I did for the future reference and someone who might encounter the same problem.

Step 1 : check the hardware information

lspci

Information regards to the infiniband is as follows:

04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s – IB QDR / 10GigE] (rev b0)

check if it is supported by the linux kernel at http://kmuto.jp/debian/hcl/.

It is supported for kernel v2.6.25- and use “mlx4_core” driver.

Step 2: kernel compilation

I use the gentoo kernel source 2.6.34-gentoo-r12.  Other versions higher than 2.6.25 should be fine, I assume.

a)  Device Drivers -> set “Infiniband support” as module.  Under “Infiniband support”, set the following ones as module, “Infiniband userspace MAD support, Infiniband userspace access (verbs and CM), Mellanox ConnectX HCA support, IP-over-Infiniband, Infiniband SCSI RDMA Protocol, iSCSI Extensions for RDMA (iser)”.  Set “IP-over-InfiniBand Connected Mode Support” as built-in.  For other Infiniband cards, choose other drivers than “Mellanox ConnectX HCA”

b)  Device Drivers -> Network device support: set Ethernet (10000 Mbit) (Gigabit Eternet card has already been configured).  Set “Mellanox Technologies ConnectX 10G support” as module. This provides the driver, mlx4_en.

c) run “make && make modules_install” to compile the kernel.

So far, we have a ready kernel.  After rebooting, using the command, lsmod, you should see “mlx4_core” being loaded.

Calligraphy Practicing

October 25th, 2010