Disk space problem

Message boards : Development : Disk space problem

Author	Message
valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 632 Credit: 34,744,744 RAC: 0	Message 213 - Posted: 31 Dec 2013, 12:19:09 UTC
	Too much data is needed (transmitted to and stored on the client's local disk) for our project, this may raise big bandwidth and space problems. Every workunit needs an input file which is about 70 to 300 Mb. We started using 'sticky files', ie files that supposedly are downloaded once and used by a lot of wokunits. By 'a lot' I intend thousands of workunits, ie lasting for months before being replaced. Unfortunately this is not our case, those input files are shared among too few workunits, thus accumulating during the time in the client's disk.... We have to change this somehow. There are two possible approaches: 1- Transmit the needed input data attached to any workunit, compressing, squeezing. reducing its size the best that we can, thus avoiding sticky files at all. 2- Find a way to identify all the (unique) data that a workunit may need for a very long time, compress it into a stick file and send it to the client. Suggestions are obviously welcome.
	ID: 213 · Reply Quote

[VENETO] boboviz Send message Joined: 12 Dec 13 Posts: 184 Credit: 4,641,505 RAC: 0	Message 217 - Posted: 2 Jan 2014, 9:23:31 UTC - in response to Message 213.
	There are two possible approaches: 1- Transmit the needed input data attached to any workunit, compressing, squeezing. reducing its size the best that we can, thus avoiding sticky files at all. 2- Find a way to identify all the (unique) data that a workunit may need for a very long time, compress it into a stick file and send it to the client. I'm not sure, but i think Rosetta@home uses the method numer 1
	ID: 217 · Reply Quote

marco giglio Send message Joined: 12 Nov 13 Posts: 20 Credit: 1,708 RAC: 0	Message 245 - Posted: 4 Jan 2014, 11:04:51 UTC - in response to Message 217.
	As written some days ago in another post, I made some attempts of compressing the input files. With the current inputs, the best compression is of course lzma, but it takes way too much time to compress one file. bzip2 achieves almost the same compression ratio but it takes less time (in exchange of a big amount of memory, if I'm not wrong). The compressed file are ~38% of the original ones. A modification to the application will be needed in order to decompress files on windows (on linux bzip2 should be already available on all machines and we can state the dependency on the website). Another improvement I discussed together with Daniele is the possibility to write inputs in binary instead of a csv. Doing that we should decrease the size of the file, however I also expect the compression factor of a binary file to be less than the one achieved on the csv file, so we should perform experiment about that. Last improvement is regarding the stickyness. If I'm not wrong, right now from a 70MB input file we extract from 4 to 6 WUs, whose duration is ~2 hrs each. We could choose to increase the duration of each WUs in order to have that input file as dependency of a single WUs (or 2 WUs if we prefer a shorter computation time). Doing so we don't need sticky files. However it depends on the size of the input files. If the input files are ~200MB (boboviz and I computed a WU coming from such a big file) the things are different since the computation time and the number of dependent WUs grow.
	ID: 245 · Reply Quote

marco giglio Send message Joined: 12 Nov 13 Posts: 20 Credit: 1,708 RAC: 0	Message 247 - Posted: 4 Jan 2014, 14:58:26 UTC Last modified: 4 Jan 2014, 15:01:05 UTC
	Ok, I made some experiments regarding substituting the csv input files with binary ones and here are the results. Files used for testing: separate_At_AT114 ~ 70 MB separate_At2_AT106 ~ 71 MB At2_all_obs_script.csv ~ 242 MB Size of the files when compressed using bzip2 separate_At_AT114.bz2 ~ 24 MB separate_At2_AT106.bz2 ~ 25 MB At2_all_obs_script.csv.bz2 ~ 85 MB I wrote a program which parse the csv files and write binary files in which the data are represented as floating point variables. Each data in the csv file occupies, in average, 8 byte, each floating point 4 bytes so we obtain files which are half the size of csvs. separate_At_AT114.bin ~ 35 MB separate_At2_AT106.bin ~ 36 MB At2_all_obs_script.csv.bin ~ 121 MB Good! Then I compressed these binaries file with bzip2 and here is what I've obtained separate_At_AT114.bin.bz2 ~ 23 MB separate_At2_AT106.bin.bz2 ~ 24 MB At2_all_obs_script.csv.bin.bz2 ~ 83 MB If you compare the compressed binaries with the compressed csv you can see there is just a little difference, hence I don't think it is worthy to move to binaries input. Another possibility is to correlate the data. the current csv is written as data1,data2,data3,data4 I don't know how much correlate are these data, but it seems they tend to be close one another, hence we could try to write a new input csv as follows: data1,data1-data2,data2-data3,data3-data4... doing so we maximize the number of 0 characters, hence incrementing the compression factor. This is just an idea, and honestly I'm not so confident about that neither, but it is a possibility...
	ID: 247 · Reply Quote

Post to thread

Message boards : Development : Disk space problem