Disk space problem
log in

Advanced search

Message boards : Development : Disk space problem

Author Message
Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 213 - Posted: 31 Dec 2013, 12:19:09 UTC

Too much data is needed (transmitted to and stored on the client's local disk) for our project, this may raise big bandwidth and space problems.

Every workunit needs an input file which is about 70 to 300 Mb. We started using 'sticky files', ie files that supposedly are downloaded once and used by a lot of wokunits. By 'a lot' I intend thousands of workunits, ie lasting for months before being replaced. Unfortunately this is not our case, those input files are shared among too few workunits, thus accumulating during the time in the client's disk....

We have to change this somehow. There are two possible approaches:
1- Transmit the needed input data attached to any workunit, compressing, squeezing. reducing its size the best that we can, thus avoiding sticky files at all.
2- Find a way to identify all the (unique) data that a workunit may need for a very long time, compress it into a stick file and send it to the client.

Suggestions are obviously welcome.

Profile [VENETO] boboviz
Send message
Joined: 12 Dec 13
Posts: 182
Credit: 4,633,870
RAC: 28
Italy
Message 217 - Posted: 2 Jan 2014, 9:23:31 UTC - in response to Message 213.

There are two possible approaches:
1- Transmit the needed input data attached to any workunit, compressing, squeezing. reducing its size the best that we can, thus avoiding sticky files at all.
2- Find a way to identify all the (unique) data that a workunit may need for a very long time, compress it into a stick file and send it to the client.


I'm not sure, but i think Rosetta@home uses the method numer 1

marco giglio
Send message
Joined: 12 Nov 13
Posts: 20
Credit: 1,708
RAC: 0
Italy
Message 245 - Posted: 4 Jan 2014, 11:04:51 UTC - in response to Message 217.

As written some days ago in another post, I made some attempts of compressing the input files. With the current inputs, the best compression is of course lzma, but it takes way too much time to compress one file. bzip2 achieves almost the same compression ratio but it takes less time (in exchange of a big amount of memory, if I'm not wrong). The compressed file are ~38% of the original ones.
A modification to the application will be needed in order to decompress files on windows (on linux bzip2 should be already available on all machines and we can state the dependency on the website).

Another improvement I discussed together with Daniele is the possibility to write inputs in binary instead of a csv. Doing that we should decrease the size of the file, however I also expect the compression factor of a binary file to be less than the one achieved on the csv file, so we should perform experiment about that.

Last improvement is regarding the stickyness. If I'm not wrong, right now from a 70MB input file we extract from 4 to 6 WUs, whose duration is ~2 hrs each. We could choose to increase the duration of each WUs in order to have that input file as dependency of a single WUs (or 2 WUs if we prefer a shorter computation time). Doing so we don't need sticky files.
However it depends on the size of the input files. If the input files are ~200MB (boboviz and I computed a WU coming from such a big file) the things are different since the computation time and the number of dependent WUs grow.

marco giglio
Send message
Joined: 12 Nov 13
Posts: 20
Credit: 1,708
RAC: 0
Italy
Message 247 - Posted: 4 Jan 2014, 14:58:26 UTC
Last modified: 4 Jan 2014, 15:01:05 UTC

Ok, I made some experiments regarding substituting the csv input files with binary ones and here are the results.

Files used for testing:
separate_At_AT114 ~ 70 MB
separate_At2_AT106 ~ 71 MB
At2_all_obs_script.csv ~ 242 MB

Size of the files when compressed using bzip2
separate_At_AT114.bz2 ~ 24 MB
separate_At2_AT106.bz2 ~ 25 MB
At2_all_obs_script.csv.bz2 ~ 85 MB

I wrote a program which parse the csv files and write binary files in which the data are represented as floating point variables.
Each data in the csv file occupies, in average, 8 byte, each floating point 4 bytes so we obtain files which are half the size of csvs.
separate_At_AT114.bin ~ 35 MB
separate_At2_AT106.bin ~ 36 MB
At2_all_obs_script.csv.bin ~ 121 MB

Good! Then I compressed these binaries file with bzip2 and here is what I've obtained
separate_At_AT114.bin.bz2 ~ 23 MB
separate_At2_AT106.bin.bz2 ~ 24 MB
At2_all_obs_script.csv.bin.bz2 ~ 83 MB

If you compare the compressed binaries with the compressed csv you can see there is just a little difference, hence I don't think it is worthy to move to binaries input.

Another possibility is to correlate the data.
the current csv is written as
data1,data2,data3,data4
I don't know how much correlate are these data, but it seems they tend to be close one another, hence we could try to write a new input csv as follows:
data1,data1-data2,data2-data3,data3-data4...
doing so we maximize the number of 0 characters, hence incrementing the compression factor.
This is just an idea, and honestly I'm not so confident about that neither, but it is a possibility...


Post to thread

Message boards : Development : Disk space problem


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN