Recently, I need to process large numbers of SDF files and some of them are too big (>500MB) to load into memory altogether. A immediate solution is to split these big files into smaller chunks. The two commands, split
and csplit
, which come with Linux, seem incapable to meet this need. split
is convenient to split a text file by lines; but a SDF file could contain many different molecular records and they do not have size, so split
will break the integrity of the molecular records. One has to manually fix the head and tails of each result file. csplit
is more flexible and can split a file according to patterns. But the weak point is that there is no way to specify how many matched patterns to skip before splitting. As a result, if I use “$$$$” as the record delimiter to csplit
a SDF file, it will break each molecular record into a single file. There are just too many of them! That is not what I want. (One could use cat
to concatenate them together, but it is too troublesome because of the large number of files).
I wrote this clsplit
to meet this need. It is available at here. The basic idea behind this script is to simulate the fixing work after split
ing the file by specifying number of lines. It can be called using the following syntax.
clsplit PATTERN line_number file_name
PATTERN must be a valid pattern for grep
. For example, if i want to split a big SDF file, the following command can be used.
clsplit \$\$\$\$ 10000 my.sdf
The resulting files usually do not have an exact number of 10000 lines and who cares about an exact number of lines! More important is to preserve the integrity of each record.