Output file size (and plans for the future)

Message boards : Number crunching : Output file size (and plans for the future)

Author	Message
valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 631 Credit: 34,744,744 RAC: 34	Message 884 - Posted: 9 Feb 2017, 10:28:37 UTC Last modified: 9 Feb 2017, 11:39:17 UTC
	We are almost done with the Ec (Escherichia coli mono) experiment. As many of you have probably noticed we were switching to another organism (Vv, Vitis Vinifera). Complex organisms have a larger genome (our Ec dataset contains ~4000 genes, the Vv one ~28000). The application basically finds one-to-one interactions between all the genes of the dataset, so the output file is bigger on more complex organisms. This is obviously a problem. A ~30 minutes run will produce a ~6 Mb output file (gzipped). Many people around the world simply don't have enough internet bandwidth to handle this. This is also a problem for us (more hard disk storage used). The only way to solve this is to rethink the way the output file is built, maybe with some kind of filtering. But this means rewriting the application output code, i.e. it will take time to do this. In the meanwhile we stopped the Vv experiment and are switching back to the dream5 experiment (see the science section of the forum) which is also based on Ec (smaller output file) Thanks all for your understanding. [edit] You may notice some workunits shortage in the meanwhile
	ID: 884 · Reply Quote

No.15 Send message Joined: 2 Feb 16 Posts: 13 Credit: 64,229,764 RAC: 0	Message 886 - Posted: 9 Feb 2017, 14:13:49 UTC - in response to Message 884.
	We are almost done with the Ec (Escherichia coli mono) experiment. As many of you have probably noticed we were switching to another organism (Vv, Vitis Vinifera). Complex organisms have a larger genome (our Ec dataset contains ~4000 genes, the Vv one ~28000). The application basically finds one-to-one interactions between all the genes of the dataset, so the output file is bigger on more complex organisms. This is obviously a problem. A ~30 minutes run will produce a ~6 Mb output file (gzipped). Many people around the world simply don't have enough internet bandwidth to handle this. This is also a problem for us (more hard disk storage used). The only way to solve this is to rethink the way the output file is built, maybe with some kind of filtering. But this means rewriting the application output code, i.e. it will take time to do this. In the meanwhile we stopped the Vv experiment and are switching back to the dream5 experiment (see the science section of the forum) which is also based on Ec (smaller output file) Thanks all for your understanding. [edit] You may notice some workunits shortage in the meanwhile What about something similar to what rosetta@home does? Where the end user can pick how long the WU's run and that would make the file size bigger/smaller depending on one's choices.
	ID: 886 · Reply Quote

No.15 Send message Joined: 2 Feb 16 Posts: 13 Credit: 64,229,764 RAC: 0	Message 887 - Posted: 9 Feb 2017, 14:15:59 UTC
	BTW I think running out of work is worse than having to big of an output file.
	ID: 887 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 631 Credit: 34,744,744 RAC: 34	Message 888 - Posted: 9 Feb 2017, 15:14:26 UTC - in response to Message 887.
	What about something similar to what rosetta@home does? Where the end user can pick how long the WU's run and that would make the file size bigger/smaller depending on one's choices This would be really difficult to do, due to the nature of the algorithm, but we have some ideas to reduce the typical output file size to around 1Mb per workunit. BTW I think running out of work is worse than having to big of an output file. We are still producing new work, it may take some time before a large queue will be ready (another thing to do is to really speed up our work generator...)
	ID: 888 · Reply Quote

Col323 Send message Joined: 23 Nov 16 Posts: 7 Credit: 1,329,132 RAC: 0	Message 889 - Posted: 9 Feb 2017, 15:43:56 UTC
	Forgive the novice question, but what is the difference between the two Ec experiments? If we've run out of work on e-coli, I'm not sure how you can say there's more e-coli work to do which is not redundant. Are we analyzing it from a different angle? Are the results used differently? I just don't want to feel like my limited CPU cycles are being kept busy so I'll stick around while we sort out the upload size. But if there's useful work which can be done, then send more of those Ec units my way while the Vv units are sorted! Thanks!
	ID: 889 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 890 - Posted: 9 Feb 2017, 16:31:29 UTC
	Maybe you could store data in binary format? Or use more effective compressing algorithm like bz2 or even xz? ____________
	ID: 890 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 631 Credit: 34,744,744 RAC: 34	Message 891 - Posted: 9 Feb 2017, 17:00:46 UTC - in response to Message 889. Last modified: 9 Feb 2017, 17:06:12 UTC
	Forgive the novice question, but what is the difference between the two Ec experiments? If we've run out of work on e-coli, I'm not sure how you can say there's more e-coli work to do which is not redundant. Are we analyzing it from a different angle? Are the results used differently? I just don't want to feel like my limited CPU cycles are being kept busy so I'll stick around while we sort out the upload size. But if there's useful work which can be done, then send more of those Ec units my way while the Vv units are sorted! Thanks! The organism is obviously the same. The difference is the expression data, you may think about having different 'measurements' on slightly different 'individuals' with different instruments/probes and/or resolution. The ecm experiment (the one that is almost ended) is based on a data set publicly available in the COLOMBOS repository (http://www.colombos.net). See below: Escherichia coli str. K-12 is the model organism for Gram negative bacteria. More information can be found on the corresponding NICB genome project webpage and EcoliWiki. The Escherichia coli compendium in colombos contains expression values for 4321 genes, measured for 4077 condition contrasts. This corresponds to a total of 269 experiments and 5510 samples measured on 73 different platforms. Of these, 254 experiments were retrieved from GEO and 15 experiments were retrieved from ArrayExpress. For additional gene annotations, colombos mainly relies on data from the databases or resources indicated in the table below. The net3 dream5 experiment is something different, see http://gene.disi.unitn.it/test/forum_thread.php?id=152. Both experiments are of high interest for us.
	ID: 891 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 631 Credit: 34,744,744 RAC: 34	Message 893 - Posted: 9 Feb 2017, 17:25:32 UTC - in response to Message 890. Last modified: 9 Feb 2017, 17:25:44 UTC
	Maybe you could store data in binary format? Or use more effective compressing algorithm like bz2 or even xz? There is not a big difference between a binary format and the gzipped text we use. I just picked a random Vv output file, plain text, it is 1139829 lines, each line needs at least 4+1 bytes (indexes instead of gene names), a total of 5699145, which is almost the same as the gzipped version. On the other way gzip is internally supported by boinc. A dramatic change of the output format would also mean a lot of changes in the post-processing. ... But we are thinking about a solution like removing from the output file all the interactions that are present only once.
	ID: 893 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 894 - Posted: 9 Feb 2017, 18:41:04 UTC - in response to Message 893. Last modified: 9 Feb 2017, 18:41:36 UTC
	Maybe you could store data in binary format? Or use more effective compressing algorithm like bz2 or even xz? There is not a big difference between a binary format and the gzipped text we use. I just picked a random Vv output file, plain text, it is 1139829 lines, each line needs at least 4+1 bytes (indexes instead of gene names), a total of 5699145, which is almost the same as the gzipped version. On the other way gzip is internally supported by boinc. A dramatic change of the output format would also mean a lot of changes in the post-processing. ... But we are thinking about a solution like removing from the output file all the interactions that are present only once. Compression libraries provides API which resembles C file API, so conversion of existing code should be pretty straightforward. Here is example how to use libbz2: http://linux.math.tifr.res.in/manuals/html/manual_3.html#SEC34. You can also use bzip2 command to uncompress file first and then process it as usual. ____________
	ID: 894 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 631 Credit: 34,744,744 RAC: 34	Message 895 - Posted: 9 Feb 2017, 20:10:16 UTC - in response to Message 887. Last modified: 9 Feb 2017, 20:14:12 UTC
	BTW I think running out of work is worse than having to big of an output file. Yep, the work generator is running, making Vv workunits, but at this moment it is not able to handle all the requests. Its speed is about 332 new workunits every 9 minutes. I guess that we have to optimize it a little bit ....
	ID: 895 · Reply Quote

Woof Send message Joined: 16 Jan 17 Posts: 3 Credit: 650,991 RAC: 0	Message 896 - Posted: 10 Feb 2017, 4:12:02 UTC - in response to Message 887.
	BTW I think running out of work is worse than having to big of an output file. You have to think about the scale of things though. Let's go with a user with 100 cores to offer(since I have around that many and that's how I did the math since it was easy),and each WU only takes 30 minutes and makes that 6MB output file. That's 1.2GB to upload every hour. ~29GB every day. ~864GB per month. For someone with a lot more compute power that could really scale out. It looks like you have ~500 cores,and while I'm sure that's spread out across projects if you only ran this project and those Vv WUs you would end up using terabytes of upload per month....and I'm pretty sure your ISP would be giving you a call at that point. Not to mention what it would be like to the computer at the other end of that. The server status shows ~13K tasks underway,so imagining that as the core count of all end users,scale that out. That's terabytes of transfer PER DAY.
	ID: 896 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 897 - Posted: 10 Feb 2017, 6:54:51 UTC - in response to Message 896. Last modified: 10 Feb 2017, 7:14:00 UTC
	BTW I think running out of work is worse than having to big of an output file. You have to think about the scale of things though. Let's go with a user with 100 cores to offer(since I have around that many and that's how I did the math since it was easy),and each WU only takes 30 minutes and makes that 6MB output file. That's 1.2GB to upload every hour. ~29GB every day. ~864GB per month. For someone with a lot more compute power that could really scale out. It looks like you have ~500 cores,and while I'm sure that's spread out across projects if you only ran this project and those Vv WUs you would end up using terabytes of upload per month....and I'm pretty sure your ISP would be giving you a call at that point. Not to mention what it would be like to the computer at the other end of that. The server status shows ~13K tasks underway,so imagining that as the core count of all end users,scale that out. That's terabytes of transfer PER DAY. Good point. Please also keep in mind that one day GPU app may appear. It is hard to tell how much faster it may be, so lets assume it is from 10x to 50x. So for given machine with one GPU upload size will increase few times (actual value depends on CPU count and GPU app speed). BTW, user with that ~500 cores may want to switch them all here if he will be participating in some TN-Grid challenge. Assuming 32 cores per machine, it will be 16 physical machines. And each of them may have one or more top GPUs. So upload size may easily be doubled or even tripled. It would be good to let user configure max result size or something like this too. Or run two apps, one for tasks with small results and another for big results. Users with limited upload speed or who pay for transfer bandwidth could use 1st one only. ____________
	ID: 897 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 631 Credit: 34,744,744 RAC: 34	Message 899 - Posted: 10 Feb 2017, 11:54:05 UTC - in response to Message 897. Last modified: 10 Feb 2017, 11:54:33 UTC
	The worst scenario (boinc-related) is a fast computation with big output file (Vv), the ideal one would be slow and small (Ec-mono, ecm), Ec dream5 is something in between (fast and small). But there's another point to take into account: science. Scientists would like to study a specific organism, following their own interests. Now, Ec-mono is almost ended, Ec-dream5 will be distributed shortly (but it wont last long), Vv is really important for us and should be done. In the future we are thinking about Hs (Homo sapiens, us), we don't know yet about computational speed but we know that we have a lot of genes (slightly less than a plant...). So, we definitely have to solve this 'big output file' issue.....
	ID: 899 · Reply Quote

[AF>FAH-Addict.net]toTOW Send message Joined: 17 May 14 Posts: 2 Credit: 3,677,698 RAC: 0	Message 907 - Posted: 12 Feb 2017, 21:25:15 UTC
	I don't care about the upload/download sizes ... so if you set up an option to choose the size of transfers, I'm in for the big ones ! :D Your project is not the worse about file sizes ... some projects can transfer more than 100 MB per WU ...
	ID: 907 · Reply Quote

koschi Send message Joined: 22 Oct 16 Posts: 25 Credit: 17,961,188 RAC: 1	Message 908 - Posted: 12 Feb 2017, 21:43:39 UTC
	GPUGrid uploads ~150MB per large WU, but these run ~16 hours on a GTX 970 or 1060, so its most likely less than what TNGrid comes up with over the same period on 8 threads. If you can sort out the server side capacity issues, with a stable 100/10mbit connection I'd be opting in into a large transfer queue as well...
	ID: 908 · Reply Quote

Tex1954 Send message Joined: 3 Apr 16 Posts: 1 Credit: 4,367,188 RAC: 0	Message 920 - Posted: 21 Feb 2017, 5:51:22 UTC - in response to Message 908.
	GPUGrid uploads ~150MB per large WU, but these run ~16 hours on a GTX 970 or 1060, so its most likely less than what TNGrid comes up with over the same period on 8 threads. If you can sort out the server side capacity issues, with a stable 100/10mbit connection I'd be opting in into a large transfer queue as well... I agree, bandwidth and size mean nothing to me with my 10M-Byte/Sec connection... Let people select an option in the preferences for limits while the majority of us crunch on. Also, if you need more disk space or some other hardware thing easily obtained, you can reach out to the BOINC community for help like Milkyway does. Crunchers keep Milkyway (and others) alive... we are partners in many ways, not just users... 8-)
	ID: 920 · Reply Quote

Post to thread

Message boards : Number crunching : Output file size (and plans for the future)