Posts by Beyond
log in
1) Message boards : Number crunching : Weird hosts (Message 1487)
Posted 10 Jan 2019 by Profile Beyond
>> setting its max_results_day field to -1, let see what happens.

You should be able to do this for all hosts so that this problem will automatically take care of itself. Hosts failing for any reason can be cut off and then build back up if/when the problem is fixed. This is also a service to the owner of the machine as they sometimes won't notice that their host is failing. The work restriction is one way to automatically give them a heads up.

>> The problem here is that I do not have any owner's e-mail address (because of the gridcoin pool mechanism)

Yet another problem with BOINC allowing "pools" of users to register as a single user. There should probably be something like a subteam and require the individual users to provide an e-mail address.
2) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1484)
Posted 9 Jan 2019 by Profile Beyond
You have got to be one of the nicest, most conscientious admins in BoincLand.
3) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1481)
Posted 9 Jan 2019 by Profile Beyond
293 hours because it's prime.
Edit: just looked, think it was between 281 and 282 hours. 281 is prime too...
You received > 7,000 credits. Cool. Wonder why the first guy who finished it didn't get credits?
4) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1479)
Posted 7 Jan 2019 by Profile Beyond
Well Beyond, it has passed 252 Hours and is still running and still at 73.00%.
Care for another guess?

Conan

264 hours because 2+4=6 and 6+6=12 and 2x6=12 and 2x12x12-2x12=264. Figure out that reasoning... ;-)
5) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1476)
Posted 6 Jan 2019 by Profile Beyond
Conan, looks like you're running this WU: WU=17810575
Judging from the speed of the one machine that completed it vs. your machine I'd say you're looking at around 10-11 days total for this WU. It should complete though if your computer doesn't do something untoward (such as reboot).

Beyond, yes that is the one.
It has passed 200 hours now still at 73% and last checkpoint was at 24 hours.
But still running and using a full core so we will see how it goes, must finish soon?

I'm guessing 240-264 hours. Let's start a pool... ;-)
I'll take 252 hours.
6) Message boards : Number crunching : Weird hosts (Message 1474)
Posted 5 Jan 2019 by Profile Beyond
That's not even a fast machine by today's standards.

1) Do you have the the e-mail of the owner?
2) You can ban the host in BOINC.
3) I believe in BOINC you can set server side rules so that a bad host will automatically get fewer WUs/day until it's down to 1 per day. Then if it starts returning valid results the WUs/day will increase.
7) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1473)
Posted 5 Jan 2019 by Profile Beyond
It worked and it looks like they're being awarded credit. Very nice:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256
8) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1470)
Posted 4 Jan 2019 by Profile Beyond
Thanks valterc! It's also quite possible that some of those "timed out - no response" are still running unless the server sends an abort message...
9) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1466)
Posted 3 Jan 2019 by Profile Beyond
Well I guess then that my 141 hours and stuck at 73% for over 2 days, has a way to go.
Still taking up a full core so letting it run. Would be nice to get many, many thousands of credit for it (ha, ha).

Conan

Conan, looks like you're running this WU:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810575

Judging from the speed of the one machine that completed it vs. your machine I'd say you're looking at around 10-11 days total for this WU. It should complete though if your computer doesn't do something untoward (such as reboot).
10) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1465)
Posted 3 Jan 2019 by Profile Beyond
Sounds fair to me. Thanks!
11) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1461)
Posted 29 Dec 2018 by Profile Beyond
Thanks for the update. 148368_Hs_TCGA-KLF6_wu-6_1543433660987 above finished this morning and validated. I'm aborting 148366_Hs_TCGA-BRCA2_wu-136_1543431411382 at 619 hours (still at 12%). I'd bet this has to be a record run time for a WU:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256
12) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1459)
Posted 29 Dec 2018 by Profile Beyond
There are about 30 TCGA workunits still around and for sure those are the very long ones. Theoretically those could run forever. The reason is that for a certain, very rare, type of input the algorithm's completely is exponential. We usually manage to adjust the input dataset in order to avoid this but in the current case there were an issue that we were able to fix only after the workunits were distributed. The results are of scientific value, of course, but without a checkpoint inside the critical section of the algorithm this is not the kind of computation to do inside the BOINC framework.

Anyway, I will wait for them for another couple of days then I will abort them server side, I will figure out a way to give credits even in this case.

It looks like this WU has been cancelled by the server? Yet it's still running on my system and probably others. Here's the latest:



The top WU should finish within a day but the bottom one is still at 12% and now over 605 hours (483 hours since the last checkpoint). It looks "dead". Here's a link to the WU:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256

and here's a different WU that looks hopeless:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810093
13) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1452)
Posted 19 Dec 2018 by Profile Beyond
Three of the above WUs finished as expected. The 4th is STILL running at 12% completion after 377 hours (and 312 hours since the last checkpoint). It looks like 8 machines are still running this one, some longer than I have:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256

I'm not going to be able to run more of these non checkpointing WUs for a while as I'm doing major rearrangement of machines (long overdue).
14) Message boards : Number crunching : sse2 vs avx (Message 1449)
Posted 19 Dec 2018 by Profile Beyond
Interesting. I have followed link on task info page to get info about CPU and OS, so looks that sometimes crashes occur on 2700 too. Unfortunately today this task page is deleted so I cannot add link here. Anyway, I will update my report that bug happens mostly on 1700.

Never had SSE failures on my 2700, only on the 2 1700 Ryzens. AVX and FMA are bulletproof on all of them.

Thanks/Ed
15) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1443)
Posted 16 Dec 2018 by Profile Beyond
There are about 30 TCGA workunits still around and for sure those are the very long ones. Theoretically those could run forever. The reason is that for a certain, very rare, type of input the algorithm's completely is exponential. We usually manage to adjust the input dataset in order to avoid this but in the current case there were an issue that we were able to fix only after the workunits were distributed. The results are of scientific value, of course, but without a checkpoint inside the critical section of the algorithm this is not the kind of computation to do inside the BOINC framework.

Anyway, I will wait for them for another couple of days then I will abort them server side, I will figure out a way to give credits even in this case.

Thanks for the update and explanation. I have four of these WUs currently running on 3 machines. Thought I'd post a small BoinkTasks screenshot:



From clues gleaned by looking at the WU history I expect a couple of these to finish within the next 2 days. A third one will probably be longer and the currently 306 hour one is totally mysterious as several faster machines are running it and it's never been completed.
16) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1439)
Posted 15 Dec 2018 by Profile Beyond
Here's the longest one I've seen yet:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256

I suspect the 4 "Timed out - no response" instances are still crunching away. My example is on a pretty slow machine. It's at 137 hours and 12%. It looks like these aren't checkpointing often. This WU is showing 117 hours since the last checkpoint (in BoincTasks).

Now at 183 hours, and 156 hours since the last checkpoint (which is troubling). Still at 12%. I assume that if the computer reboots or the WU gets interrupted for any reason it will restart. It appears that 6 computers are currently running this WU and no one has finished it. Is it viable?

Now at 282 hours, 238 hours since the last checkpoint. Still at 12%. Now there seems to be 7 computers actively running this WU (counting the 5 that are listed as "Timed out - no response", one of which is mine). Any possibility of adding some checkpoints to these TCGA WUs? A power glitch or any other interruption would wipe out nearly 10 days of work.
17) Message boards : Number crunching : sse2 vs avx (Message 1429)
Posted 11 Dec 2018 by Profile Beyond
Is fma faster than avx on the Ryzen 1700?

It seems to be just slightly, though they are so close that it would take longer-term testing to be sure. I think it would be easier for the project to find the best extension for a given processor type, and just use it.

I'll drink to that. None of my machines ever get the fma version so I don't even have the latest executable to test. Looking at the current applications it doesn't even show an fma version for Windows. Maybe v1.10 was the last fma version? I wish we could just pick the app(s) we want to run through project preferences.
18) Message boards : Number crunching : sse2 vs avx (Message 1427)
Posted 11 Dec 2018 by Profile Beyond
Thanks a lot. I substituted "fma" for "avx", and it is running fine on my Ryzen 1700, making it productive again.

Is fma faster than avx on the Ryzen 1700?

I've also been testing an app_info. So far it's running well:

<app_info>
<app>
<name>gene_pcim</name>
<user_friendly_name>gene@home v1.11</user_friendly_name>
</app>
<file>
<name>gene_pcim_v1.11_win64__avx.exe</name>
<executable/>
</file>
<app_version>
<app_name>gene_pcim</app_name>
<version_num>111</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>1.000000</avg_ncpus>
<max_ncpus>1.000000</max_ncpus>
<plan_class>avx</plan_class>
<api_version>7.3.0</api_version>
<file_ref>
<file_name>gene_pcim_v1.11_win64__avx.exe</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>
19) Message boards : Number crunching : New TCGA workunits (TCGAz) (Message 1426)
Posted 11 Dec 2018 by Profile Beyond
Here's the longest one I've seen yet:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256

I suspect the 4 "Timed out - no response" instances are still crunching away. My example is on a pretty slow machine. It's at 137 hours and 12%. It looks like these aren't checkpointing often. This WU is showing 117 hours since the last checkpoint (in BoincTasks).

Now at 183 hours, and 156 hours since the last checkpoint (which is troubling). Still at 12%. I assume that if the computer reboots or the WU gets interrupted for any reason it will restart. It appears that 6 computers are currently running this WU and no one has finished it. Is it viable?
20) Message boards : Number crunching : Wu stuck? (TCGA workunits) (Message 1421)
Posted 9 Dec 2018 by Profile Beyond
The output of the 'problematic' TCGA workunits were, in some cases, different if calculated on Windows or Linux. There are some validation errors because of this so I wrote down some code in order to give credits even for the TCGA invalids. See this workunit as an example: http://gene.disi.unitn.it/test/workunit.php?wuid=17810593

Very nice and much appreciated.


Next 20

Main page · Your account · Message boards


Copyright © 2019 CNR-TN & UniTN