Recently, I need to process large numbers of SDF files and some of them are too big (>500MB) to load into memory altogether. A immediate solution is to split these big files into smaller chunks. The two commands, split
and csplit
, which come with Linux, seem incapable to meet this need. split
is convenient to split a text file by lines; but a SDF file could contain many different molecular records and they do not have size, so split
will break the integrity of the molecular records. One has to manually fix the head and tails of each result file. csplit
is more flexible and can split a file according to patterns. But the weak point is that there is no way to specify how many matched patterns to skip before splitting. As a result, if I use “$$$$” as the record delimiter to csplit
a SDF file, it will break each molecular record into a single file. There are just too many of them! That is not what I want. (One could use cat
to concatenate them together, but it is too troublesome because of the large number of files).
I wrote this clsplit
to meet this need. It is available at here. The basic idea behind this script is to simulate the fixing work after split
ing the file by specifying number of lines. It can be called using the following syntax.
clsplit PATTERN line_number file_name
PATTERN must be a valid pattern for grep
. For example, if i want to split a big SDF file, the following command can be used.
clsplit \$\$\$\$ 10000 my.sdf
The resulting files usually do not have an exact number of 10000 lines and who cares about an exact number of lines! More important is to preserve the integrity of each record.
Archive for August, 2011
clsplit: Combination of `split’ and `csplit’
Monday, August 15th, 2011Gentoo Cluster: Gamess Installation with MVAPICH2 and PBS
Monday, August 1st, 2011Gamess is an electronic structure calculation package. Its installation is easy if you just want to use “sockets” communication mode. Just emerge it as you regularly do. Then use “rungms” to submit your job. The default rungms is okay to run the serial code. For the parallel computation, you still need to tune the script slightly. But since our cluster has Infiniband installed, it is better to go with the “mpi” communication mode. It took me quite some time to figure out how to install it correctly and make it run with mpiexec.hydra alone or with OpenPBS (Torque). Here is how I did it.
Software packages related:
1. gamess-20101001.3 (Dowload it beforehand from its developer’s website)
2. mvapich2-1.7rc1. (Previous versions should be okay and I installed it under /usr/local/)
3. OFED-1.5.3.2. (Userspace libraries for Infiniband. See my previous post. Only updated kernel modules installed. Userspace libraries should be the same as in OFED-1.5.3.1)
4. torque-2.4.14 (OpenPBS)
Steps
1. Update the gamess-20101001.3.ebuild
with this one and manifest it.
2. Unmask the mpi
user flag for gamess in /usr/portage/profiles/base/package.use.mask
.
3. Add sci-chemistry/gamess mpi
to /etc/portage/package.use
; then emerge -av gamess
.
4. Update rungms
with this one;
5. Create a new script pbsgms
as this one;
6. Add kernel.shmmax=XXXXX
to /etc/sysctl.conf
, in which XXXXX is a large enough integer for shared memory (default value 32MB is too small for DDI). Run /sbin/sysctl -w kernel.shmmax=XXXX
to update the setting in-the-fly.
Added on Sept. 9, 2011. It seems that kernel.shmall=XXXXX
should be modified as well. Please bear in mind that the unit for kernel.shmall
is pages and kernel.shmmax
is bytes. And a page is 4096 bytes in usual(use getconf PAGE_SIZE
to verify).
7. Environment setting. Create a file /etc/env.d/99gamess
GMS_TARGET=mpi
GMS_SCR=/tmp/gamess
GMS_HOSTS=~/.hosts
GMS_MPI_KICK=hydra
GMS_MPI_PATH=/usr/local/bin
Then update your profile.
8. Create a hostfile, ~/.hosts
node1
node2
...
This file is only needed by invoking rungms
directly.
9. Test your installation: copy a test job input file exam20.inp
under/usr/share/gamess/tests/
; submit the job using pbsgms exam20
(other settings will be prompted), or using rungms exam20 00 4
.
Explanations
1. Two changes were made on the ebuild file.
(a). The installation suggestions given in the documentation of Gamess is not enough. More libraries other than mpich are needed to pass over to lked
, the linker program for Gamess.
(b) MPI environment constants are needed to exported to the installation program, compddi
through an temporary file install.info
.
2. Many changes were made for the script, rungms
. I could not remember all of them. Some are as following.
(a) For parallel computation, the scratch file will be put under /tmp on each node by default.
(b) The script will be working with pbsgms
.
(c) System-wide setting for Gamess can be put under /etc/env.d.
(d) A host file is needed if not using PBS. By default, it should be at ~/.hosts
. If not found, running on the local host only.
3. The script pbsgms
is based on sge-pbs
shipped with the Gamess installation package. I have made it to work with Torque. Numerous changes were made.