log in |
Message boards : Number crunching : Output file size (and plans for the future)
Author | Message |
---|---|
We are almost done with the Ec (Escherichia coli mono) experiment. As many of you have probably noticed we were switching to another organism (Vv, Vitis Vinifera). Complex organisms have a larger genome (our Ec dataset contains ~4000 genes, the Vv one ~28000). The application basically finds one-to-one interactions between all the genes of the dataset, so the output file is bigger on more complex organisms. | |
ID: 884 · Reply Quote | |
We are almost done with the Ec (Escherichia coli mono) experiment. As many of you have probably noticed we were switching to another organism (Vv, Vitis Vinifera). Complex organisms have a larger genome (our Ec dataset contains ~4000 genes, the Vv one ~28000). The application basically finds one-to-one interactions between all the genes of the dataset, so the output file is bigger on more complex organisms. What about something similar to what rosetta@home does? Where the end user can pick how long the WU's run and that would make the file size bigger/smaller depending on one's choices. | |
ID: 886 · Reply Quote | |
BTW I think running out of work is worse than having to big of an output file. | |
ID: 887 · Reply Quote | |
What about something similar to what rosetta@home does? Where the end user can pick how long the WU's run and that would make the file size bigger/smaller depending on one's choicesThis would be really difficult to do, due to the nature of the algorithm, but we have some ideas to reduce the typical output file size to around 1Mb per workunit. BTW I think running out of work is worse than having to big of an output file.We are still producing new work, it may take some time before a large queue will be ready (another thing to do is to really speed up our work generator...) | |
ID: 888 · Reply Quote | |
Forgive the novice question, but what is the difference between the two Ec experiments? If we've run out of work on e-coli, I'm not sure how you can say there's more e-coli work to do which is not redundant. Are we analyzing it from a different angle? Are the results used differently? | |
ID: 889 · Reply Quote | |
Maybe you could store data in binary format? Or use more effective compressing algorithm like bz2 or even xz? | |
ID: 890 · Reply Quote | |
Forgive the novice question, but what is the difference between the two Ec experiments? If we've run out of work on e-coli, I'm not sure how you can say there's more e-coli work to do which is not redundant. Are we analyzing it from a different angle? Are the results used differently? The organism is obviously the same. The difference is the expression data, you may think about having different 'measurements' on slightly different 'individuals' with different instruments/probes and/or resolution. The ecm experiment (the one that is almost ended) is based on a data set publicly available in the COLOMBOS repository (http://www.colombos.net). See below: Escherichia coli str. K-12 is the model organism for Gram negative bacteria. More information can be found on the corresponding NICB genome project webpage and EcoliWiki. The net3 dream5 experiment is something different, see http://gene.disi.unitn.it/test/forum_thread.php?id=152. Both experiments are of high interest for us. | |
ID: 891 · Reply Quote | |
Maybe you could store data in binary format? Or use more effective compressing algorithm like bz2 or even xz? There is not a big difference between a binary format and the gzipped text we use. I just picked a random Vv output file, plain text, it is 1139829 lines, each line needs at least 4+1 bytes (indexes instead of gene names), a total of 5699145, which is almost the same as the gzipped version. On the other way gzip is internally supported by boinc. A dramatic change of the output format would also mean a lot of changes in the post-processing. ... But we are thinking about a solution like removing from the output file all the interactions that are present only once. | |
ID: 893 · Reply Quote | |
Maybe you could store data in binary format? Or use more effective compressing algorithm like bz2 or even xz? Compression libraries provides API which resembles C file API, so conversion of existing code should be pretty straightforward. Here is example how to use libbz2: http://linux.math.tifr.res.in/manuals/html/manual_3.html#SEC34. You can also use bzip2 command to uncompress file first and then process it as usual. ____________ | |
ID: 894 · Reply Quote | |
BTW I think running out of work is worse than having to big of an output file. Yep, the work generator is running, making Vv workunits, but at this moment it is not able to handle all the requests. Its speed is about 332 new workunits every 9 minutes. I guess that we have to optimize it a little bit .... | |
ID: 895 · Reply Quote | |
BTW I think running out of work is worse than having to big of an output file. You have to think about the scale of things though. Let's go with a user with 100 cores to offer(since I have around that many and that's how I did the math since it was easy),and each WU only takes 30 minutes and makes that 6MB output file. That's 1.2GB to upload every hour. ~29GB every day. ~864GB per month. For someone with a lot more compute power that could really scale out. It looks like you have ~500 cores,and while I'm sure that's spread out across projects if you only ran this project and those Vv WUs you would end up using terabytes of upload per month....and I'm pretty sure your ISP would be giving you a call at that point. Not to mention what it would be like to the computer at the other end of that. The server status shows ~13K tasks underway,so imagining that as the core count of all end users,scale that out. That's terabytes of transfer PER DAY. | |
ID: 896 · Reply Quote | |
BTW I think running out of work is worse than having to big of an output file. Good point. Please also keep in mind that one day GPU app may appear. It is hard to tell how much faster it may be, so lets assume it is from 10x to 50x. So for given machine with one GPU upload size will increase few times (actual value depends on CPU count and GPU app speed). BTW, user with that ~500 cores may want to switch them all here if he will be participating in some TN-Grid challenge. Assuming 32 cores per machine, it will be 16 physical machines. And each of them may have one or more top GPUs. So upload size may easily be doubled or even tripled. It would be good to let user configure max result size or something like this too. Or run two apps, one for tasks with small results and another for big results. Users with limited upload speed or who pay for transfer bandwidth could use 1st one only. ____________ | |
ID: 897 · Reply Quote | |
The worst scenario (boinc-related) is a fast computation with big output file (Vv), the ideal one would be slow and small (Ec-mono, ecm), Ec dream5 is something in between (fast and small). | |
ID: 899 · Reply Quote | |
I don't care about the upload/download sizes ... so if you set up an option to choose the size of transfers, I'm in for the big ones ! :D | |
ID: 907 · Reply Quote | |
GPUGrid uploads ~150MB per large WU, but these run ~16 hours on a GTX 970 or 1060, so its most likely less than what TNGrid comes up with over the same period on 8 threads. | |
ID: 908 · Reply Quote | |
GPUGrid uploads ~150MB per large WU, but these run ~16 hours on a GTX 970 or 1060, so its most likely less than what TNGrid comes up with over the same period on 8 threads. I agree, bandwidth and size mean nothing to me with my 10M-Byte/Sec connection... Let people select an option in the preferences for limits while the majority of us crunch on. Also, if you need more disk space or some other hardware thing easily obtained, you can reach out to the BOINC community for help like Milkyway does. Crunchers keep Milkyway (and others) alive... we are partners in many ways, not just users... 8-) | |
ID: 920 · Reply Quote | |
Message boards :
Number crunching :
Output file size (and plans for the future)