Archive for August, 2011

clsplit: Combination of `split’ and `csplit’

Monday, August 15th, 2011

Recently, I need to process large numbers of SDF files and some of them are too big (>500MB) to load into memory altogether. A immediate solution is to split these big files into smaller chunks. The two commands, split and csplit, which come with Linux, seem incapable to meet this need. split is convenient to split a text file by lines; but a SDF file could contain many different molecular records and they do not have size, so split will break the integrity of the molecular records. One has to manually fix the head and tails of each result file. csplit is more flexible and can split a file according to patterns. But the weak point is that there is no way to specify how many matched patterns to skip before splitting. As a result, if I use “$$$$” as the record delimiter to csplit a SDF file, it will break each molecular record into a single file. There are just too many of them! That is not what I want. (One could use cat to concatenate them together, but it is too troublesome because of the large number of files).
I wrote this clsplit to meet this need. It is available at here. The basic idea behind this script is to simulate the fixing work after spliting the file by specifying number of lines. It can be called using the following syntax.
clsplit PATTERN line_number file_name
PATTERN must be a valid pattern for grep. For example, if i want to split a big SDF file, the following command can be used.
clsplit \$\$\$\$ 10000 my.sdf
The resulting files usually do not have an exact number of 10000 lines and who cares about an exact number of lines! More important is to preserve the integrity of each record.

Gentoo Cluster: Gamess Installation with MVAPICH2 and PBS

Monday, August 1st, 2011

Gamess is an electronic structure calculation package. Its installation is easy if you just want to use “sockets” communication mode. Just emerge it as you regularly do. Then use “rungms” to submit your job. The default rungms is okay to run the serial code. For the parallel computation, you still need to tune the script slightly. But since our cluster has Infiniband installed, it is better to go with the “mpi” communication mode. It took me quite some time to figure out how to install it correctly and make it run with mpiexec.hydra alone or with OpenPBS (Torque). Here is how I did it.

Software packages related:
1. gamess-20101001.3 (Dowload it beforehand from its developer’s website)
2. mvapich2-1.7rc1. (Previous versions should be okay and I installed it under /usr/local/)
3. OFED-1.5.3.2. (Userspace libraries for Infiniband. See my previous post. Only updated kernel modules installed. Userspace libraries should be the same as in OFED-1.5.3.1)
4. torque-2.4.14 (OpenPBS)

Steps
1. Update the gamess-20101001.3.ebuild with this one and manifest it.
2. Unmask the mpi user flag for gamess in /usr/portage/profiles/base/package.use.mask.
3. Add sci-chemistry/gamess mpi to /etc/portage/package.use; then emerge -av gamess.
4. Update rungms with this one;
5. Create a new script pbsgms as this one;
6. Add kernel.shmmax=XXXXX to /etc/sysctl.conf, in which XXXXX is a large enough integer for shared memory (default value 32MB is too small for DDI). Run /sbin/sysctl -w kernel.shmmax=XXXX to update the setting in-the-fly.
Added on Sept. 9, 2011. It seems that kernel.shmall=XXXXX should be modified as well. Please bear in mind that the unit for kernel.shmall is pages and kernel.shmmax is bytes. And a page is 4096 bytes in usual(use getconf PAGE_SIZE to verify).

7. Environment setting. Create a file /etc/env.d/99gamess

GMS_TARGET=mpi
GMS_SCR=/tmp/gamess
GMS_HOSTS=~/.hosts
GMS_MPI_KICK=hydra
GMS_MPI_PATH=/usr/local/bin

Then update your profile.
8. Create a hostfile, ~/.hosts

node1
node2
...

This file is only needed by invoking rungms directly.

9. Test your installation: copy a test job input file exam20.inpunder/usr/share/gamess/tests/; submit the job using pbsgms exam20 (other settings will be prompted), or using rungms exam20 00 4.

Explanations
1. Two changes were made on the ebuild file.
(a). The installation suggestions given in the documentation of Gamess is not enough. More libraries other than mpich are needed to pass over to lked, the linker program for Gamess.
(b) MPI environment constants are needed to exported to the installation program, compddi through an temporary file install.info.
2. Many changes were made for the script, rungms. I could not remember all of them. Some are as following.
(a) For parallel computation, the scratch file will be put under /tmp on each node by default.
(b) The script will be working with pbsgms.
(c) System-wide setting for Gamess can be put under /etc/env.d.
(d) A host file is needed if not using PBS. By default, it should be at ~/.hosts. If not found, running on the local host only.
3. The script pbsgms is based on sge-pbs shipped with the Gamess installation package. I have made it to work with Torque. Numerous changes were made.