clsplit: Combination of `split’ and `csplit’

Recently, I need to process large numbers of SDF files and some of them are too big (>500MB) to load into memory altogether. A immediate solution is to split these big files into smaller chunks. The two commands, split and csplit, which come with Linux, seem incapable to meet this need. split is convenient to split a text file by lines; but a SDF file could contain many different molecular records and they do not have size, so split will break the integrity of the molecular records. One has to manually fix the head and tails of each result file. csplit is more flexible and can split a file according to patterns. But the weak point is that there is no way to specify how many matched patterns to skip before splitting. As a result, if I use “$$$$” as the record delimiter to csplit a SDF file, it will break each molecular record into a single file. There are just too many of them! That is not what I want. (One could use cat to concatenate them together, but it is too troublesome because of the large number of files).
I wrote this clsplit to meet this need. It is available at here. The basic idea behind this script is to simulate the fixing work after spliting the file by specifying number of lines. It can be called using the following syntax.
clsplit PATTERN line_number file_name
PATTERN must be a valid pattern for grep. For example, if i want to split a big SDF file, the following command can be used.
clsplit \$\$\$\$ 10000 my.sdf
The resulting files usually do not have an exact number of 10000 lines and who cares about an exact number of lines! More important is to preserve the integrity of each record.

Tags: , , ,

Comments are closed.