New TCGA workunits (TCGAz)
log in

Advanced search

Message boards : Number crunching : New TCGA workunits (TCGAz)

1 · 2 · 3 · Next
Author Message
Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 1409 - Posted: 4 Dec 2018, 10:48:02 UTC

The new TCGA workunits (with a modified dataset) that contain the TCGAz string should behave correctly. Please let them run. I started with just a few batches.

Retnek
Send message
Joined: 26 Jun 17
Posts: 5
Credit: 21,391,086
RAC: 0
Germany
Message 1410 - Posted: 4 Dec 2018, 18:54:16 UTC

Thanks for the warning - some of these jobs seem to need considerably more CPU-time. 17 h for 32% in one case. Anyhow, if there's a chance they come to an end I'll let them run.

Profile Buro87 [Lombardia]
Send message
Joined: 23 Nov 16
Posts: 100
Credit: 4,000,541
RAC: 0
Italy
Message 1411 - Posted: 5 Dec 2018, 8:24:11 UTC - in response to Message 1410.

Thanks for the warning - some of these jobs seem to need considerably more CPU-time. 17 h for 32% in one case. Anyhow, if there's a chance they come to an end I'll let them run.



if you can, let them run. If you need to turn off your pc at the end of the day, abort them. If you are in the middle of long checkpoint, and you turn off your pc, you'll come back to the previously checkpoint, losing your time.
One wu stay at 52% for 9h. Total run time 33h

Profile Buro87 [Lombardia]
Send message
Joined: 23 Nov 16
Posts: 100
Credit: 4,000,541
RAC: 0
Italy
Message 1412 - Posted: 5 Dec 2018, 8:27:19 UTC - in response to Message 1411.

Thanks for the warning - some of these jobs seem to need considerably more CPU-time. 17 h for 32% in one case. Anyhow, if there's a chance they come to an end I'll let them run.



if you can, let them run. If you need to turn off your pc at the end of the day, abort them. If you are in the middle of long checkpoint, and you turn off your pc, you'll come back to the previously checkpoint, losing your time.
One wu stay at 52% for 9h. Total run time 33h


I refer to old TCGA wus

Retnek
Send message
Joined: 26 Jun 17
Posts: 5
Credit: 21,391,086
RAC: 0
Germany
Message 1413 - Posted: 5 Dec 2018, 17:42:02 UTC - in response to Message 1411.

Thanks for the warning - some of these jobs seem to need considerably more CPU-time. 17 h for 32% in one case. Anyhow, if there's a chance they come to an end I'll let them run.



if you can, let them run. If you need to turn off your pc at the end of the day, abort them. If you are in the middle of long checkpoint, and you turn off your pc, you'll come back to the previously checkpoint, losing your time.
One wu stay at 52% for 9h. Total run time 33h


No problem to let them run - looking good. One job finished after 1 days 3 hours 52 min 36 sec, completed and validated.

There's one job now at 5% after one day, 19 days to go. Let's see. :-)

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 1414 - Posted: 5 Dec 2018, 17:59:21 UTC - in response to Message 1413.

OK, just to summarize: the TCGA workunits are the problematic ones, really long and with long periods of time (hours) without any checkpoint. Nevertheless we would be very happy if you let them run until completion.

The new TCGAz should behave in the usual way, i.e. you shouldn't worry if you got some.

kain
Send message
Joined: 11 Jun 15
Posts: 30
Credit: 15,565,761
RAC: 0
Poland
Message 1415 - Posted: 6 Dec 2018, 18:41:48 UTC - in response to Message 1414.

OK, just to summarize: the TCGA workunits are the problematic ones, really long and with long periods of time (hours) without any checkpoint. Nevertheless we would be very happy if you let them run until completion.

The new TCGAz should behave in the usual way, i.e. you shouldn't worry if you got some.


No problem, my CPUs are working 24/7. I have no mercy.

Retnek
Send message
Joined: 26 Jun 17
Posts: 5
Credit: 21,391,086
RAC: 0
Germany
Message 1416 - Posted: 7 Dec 2018, 8:29:41 UTC - in response to Message 1415.

Here's one job with no progress but growing time:

gene@home PC-IM 1.11 (avx)
148368_Hs_TCGA-KLF6_wu-22_1543433692665
03.12.2018 11:36:21
87.729 GFLOPs
time running
2d 14:56:04
time to complete
49d 21:49:11
recent progress
5,000%

Since the job seems to be frozen at 5% but rising times to complete, I'll better ask what to do? No problem to let it run for 100 days, if there's hope for a result.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 1417 - Posted: 7 Dec 2018, 10:29:26 UTC - in response to Message 1416.
Last modified: 7 Dec 2018, 10:52:43 UTC

Well, I got this https://gene.disi.unitn.it/test/result.php?resultid=37659353 that spent almost 24 hours without any (visible) progress and eventually got validated. This one https://gene.disi.unitn.it/test/result.php?resultid=37623246 ran for about three days and still needs to be validated. The longest one, by now, is this one https://gene.disi.unitn.it/test/workunit.php?wuid=17810415 that got more than 2300 credits.
However, there is no 'theoretical' computational limit in the algorithm. A task may run forever.
It would be useful to let the tasks run, even for many days, by taking care not to stop BOINC in between (because of the lack of checkpointing in the 'critical' period.

I'm constantly monitoring the situation, adding credits for 'invalid' and 'too late to validate' workunits.

Retnek
Send message
Joined: 26 Jun 17
Posts: 5
Credit: 21,391,086
RAC: 0
Germany
Message 1418 - Posted: 7 Dec 2018, 11:06:05 UTC - in response to Message 1417.

...I'm constantly monitoring the situation, adding credits for 'invalid' and 'too late to validate' workunits.


Thanks, that's what I love to read. If someone cares for the result, there's some reason in it. I'll let it run and we'll see what happens. Curiosity is a strong motive.

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 1420 - Posted: 9 Dec 2018, 16:16:14 UTC

Here's the longest one I've seen yet:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256

I suspect the 4 "Timed out - no response" instances are still crunching away. My example is on a pretty slow machine. It's at 137 hours and 12%. It looks like these aren't checkpointing often. This WU is showing 117 hours since the last checkpoint (in BoincTasks).

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 1426 - Posted: 11 Dec 2018, 14:56:47 UTC - in response to Message 1420.

Here's the longest one I've seen yet:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256

I suspect the 4 "Timed out - no response" instances are still crunching away. My example is on a pretty slow machine. It's at 137 hours and 12%. It looks like these aren't checkpointing often. This WU is showing 117 hours since the last checkpoint (in BoincTasks).

Now at 183 hours, and 156 hours since the last checkpoint (which is troubling). Still at 12%. I assume that if the computer reboots or the WU gets interrupted for any reason it will restart. It appears that 6 computers are currently running this WU and no one has finished it. Is it viable?

Profile JStateson
Send message
Joined: 7 May 18
Posts: 2
Credit: 3,577,000
RAC: 0
United States
Message 1432 - Posted: 12 Dec 2018, 14:24:27 UTC

from the few i have looked at, the sse2 and avx clearly have problems but fma are succeeding.

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 1439 - Posted: 15 Dec 2018, 17:52:08 UTC - in response to Message 1426.

Here's the longest one I've seen yet:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256

I suspect the 4 "Timed out - no response" instances are still crunching away. My example is on a pretty slow machine. It's at 137 hours and 12%. It looks like these aren't checkpointing often. This WU is showing 117 hours since the last checkpoint (in BoincTasks).

Now at 183 hours, and 156 hours since the last checkpoint (which is troubling). Still at 12%. I assume that if the computer reboots or the WU gets interrupted for any reason it will restart. It appears that 6 computers are currently running this WU and no one has finished it. Is it viable?

Now at 282 hours, 238 hours since the last checkpoint. Still at 12%. Now there seems to be 7 computers actively running this WU (counting the 5 that are listed as "Timed out - no response", one of which is mine). Any possibility of adding some checkpoints to these TCGA WUs? A power glitch or any other interruption would wipe out nearly 10 days of work.

Jim1348
Send message
Joined: 29 Dec 16
Posts: 87
Credit: 21,013,002
RAC: 0
United States
Message 1440 - Posted: 15 Dec 2018, 20:23:52 UTC - in response to Message 1439.

Now at 183 hours, and 156 hours since the last checkpoint (which is troubling). Still at 12%.

I have seen a few of those. Before they get that far, I abort them. I think you are unnecessarily conscientious; they are duds.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 1442 - Posted: 16 Dec 2018, 10:11:10 UTC - in response to Message 1439.

There are about 30 TCGA workunits still around and for sure those are the very long ones. Theoretically those could run forever. The reason is that for a certain, very rare, type of input the algorithm's completely is exponential. We usually manage to adjust the input dataset in order to avoid this but in the current case there were an issue that we were able to fix only after the workunits were distributed. The results are of scientific value, of course, but without a checkpoint inside the critical section of the algorithm this is not the kind of computation to do inside the BOINC framework.

Anyway, I will wait for them for another couple of days then I will abort them server side, I will figure out a way to give credits even in this case.

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 1443 - Posted: 16 Dec 2018, 17:32:27 UTC - in response to Message 1442.
Last modified: 16 Dec 2018, 17:39:17 UTC

There are about 30 TCGA workunits still around and for sure those are the very long ones. Theoretically those could run forever. The reason is that for a certain, very rare, type of input the algorithm's completely is exponential. We usually manage to adjust the input dataset in order to avoid this but in the current case there were an issue that we were able to fix only after the workunits were distributed. The results are of scientific value, of course, but without a checkpoint inside the critical section of the algorithm this is not the kind of computation to do inside the BOINC framework.

Anyway, I will wait for them for another couple of days then I will abort them server side, I will figure out a way to give credits even in this case.

Thanks for the update and explanation. I have four of these WUs currently running on 3 machines. Thought I'd post a small BoinkTasks screenshot:



From clues gleaned by looking at the WU history I expect a couple of these to finish within the next 2 days. A third one will probably be longer and the currently 306 hour one is totally mysterious as several faster machines are running it and it's never been completed.

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 1452 - Posted: 19 Dec 2018, 16:29:28 UTC

Three of the above WUs finished as expected. The 4th is STILL running at 12% completion after 377 hours (and 312 hours since the last checkpoint). It looks like 8 machines are still running this one, some longer than I have:

http://gene.disi.unitn.it/test/workunit.php?wuid=17810256

I'm not going to be able to run more of these non checkpointing WUs for a while as I'm doing major rearrangement of machines (long overdue).

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 1453 - Posted: 19 Dec 2018, 17:04:52 UTC - in response to Message 1452.

Do not worry, if you need to abort them do that. There are about 18 workunits of the TCGA batch still around (of a total of 1240), some of them are probably the most problematic ones.

Jim1348
Send message
Joined: 29 Dec 16
Posts: 87
Credit: 21,013,002
RAC: 0
United States
Message 1454 - Posted: 19 Dec 2018, 17:47:07 UTC - in response to Message 1452.

The 4th is STILL running at 12% completion after 377 hours (and 312 hours since the last checkpoint).

All projects have stuck work units; some more than others. If the Progress % is not making any progress after a few hours (24 hours is more than enough time), then it is stuck in a loop.

1 · 2 · 3 · Next
Post to thread

Message boards : Number crunching : New TCGA workunits (TCGAz)


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN