Just some thoughts about size and length of a workunit:
Length (expected cpu time): Ideally a couple of hours on a I7 Ivy Bridge or later (assuming one thread continuously dedicated to a single workunit).
Size (input): In order to minimize bandwidth, get the input data just once (if there is a unique input data file valid for a large set of workunits).
Size (output): Much better have a single large (but not too much...) output file then a lot of small files. Compress it before transfer it back (use the gzip flag) if it is proven to be useful.
Performance: Try to limit I/O (avoid too much stress to the local hd).
Buffer your data and write the largest blocks possible (better not after any iteration if this is too fast). The stdio routines buffer by default, but writing a byte at a time is still a lot slower than writing large blocks (avoid fprintf, prefer fwrite or WriteFile, better not using sprintf like tools for formatting). Also be aware about I/O differences between operating systems. As a rule of thumb do as few writes as possible, accordingly to the checkpoints.