Posts Tagged ‘linux’

clsplit: Combination of `split’ and `csplit’

Monday, August 15th, 2011

Recently, I need to process large numbers of SDF files and some of them are too big (>500MB) to load into memory altogether. A immediate solution is to split these big files into smaller chunks. The two commands, split and csplit, which come with Linux, seem incapable to meet this need. split is convenient to split a text file by lines; but a SDF file could contain many different molecular records and they do not have size, so split will break the integrity of the molecular records. One has to manually fix the head and tails of each result file. csplit is more flexible and can split a file according to patterns. But the weak point is that there is no way to specify how many matched patterns to skip before splitting. As a result, if I use “$$$$” as the record delimiter to csplit a SDF file, it will break each molecular record into a single file. There are just too many of them! That is not what I want. (One could use cat to concatenate them together, but it is too troublesome because of the large number of files).
I wrote this clsplit to meet this need. It is available at here. The basic idea behind this script is to simulate the fixing work after spliting the file by specifying number of lines. It can be called using the following syntax.
clsplit PATTERN line_number file_name
PATTERN must be a valid pattern for grep. For example, if i want to split a big SDF file, the following command can be used.
clsplit \$\$\$\$ 10000 my.sdf
The resulting files usually do not have an exact number of 10000 lines and who cares about an exact number of lines! More important is to preserve the integrity of each record.

umount: device is busy

Wednesday, July 27th, 2011

sometimes, when i try to umount a mounted device, the following error occurs.

xwang@node1 ~ $ umount /mnt/ps4000e/home
umount.nfs: /mnt/ps4000e/home: device is busy
umount.nfs: /mnt/ps4000e/home: device is busy

No one is logged in except myself which I do not use that directory and no other user’s job is running. It is really a mystery to figure out which process causes the “device is busy”. Use google, I found the solution at http://ocaoimh.ie/2008/02/13/how-to-umount-when-the-device-is-busy/.
The solution is to use fuser to find out.

xwang # fuser -m /mnt/ps4000e/home/
/mnt/ps4000e/home/: 7706c

See the manual man fuser for full description of this command. I guess 7706 is the process id which is currently uses the mounted device. Not sure the following letter ‘c’ stands for. So use ps to find out the process.

xwang # ps 7706
PID TTY STAT TIME COMMAND
7706 ? Ss 0:00 /usr/bin/orted --daemonize ....

Now the reason is obvious. I started an mpi job before and it does not end abnormally. The orted is the administration process started by root and it does not exit. So after the process was killed, the device was able to be detached.