Wu stuck? (TCGA workunits)
log in

Advanced search

Message boards : Number crunching : Wu stuck? (TCGA workunits)

Author Message
thejackhome
Send message
Joined: 25 Oct 18
Posts: 3
Credit: 1,114
RAC: 0
Message 1385 - Posted: 29 Nov 2018, 10:29:17 UTC
Last modified: 29 Nov 2018, 10:52:40 UTC

Hi,
it is more than 2 hours that a Workunit seems to be "stuck" at 6% completion rate with no checkpoints.
Is it normal or it is gone to an endless loop and to be aborted?

Workunit: 148365_Hs_TCGA-AR_wu-25_1543429200632_2
Application: gene@home PC-IM v1.11 (avx)
OS: Windows 7


Nevermind! after 2.5 hours decided to go ahead and went to 6.5% and correctly wrote the checkpoint

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 1386 - Posted: 29 Nov 2018, 12:15:27 UTC - in response to Message 1385.

The input file is made by some 'computational chunks', usually any chunks run for more or less the same time, so it was easy to decide how many chunks to put inside a workunit and to forecast the overall computational time.
The TCGA workunits behave in a (somewhat unexpected) different way, i.e. some chunks may take *really* a lot of time more than the others in order to complete.
We don't know exactly why, it could be because the intrinsic 'nature' of the TCGA dataset or there could be some kind of 'artifact' inside it.
Anyway, we sent around a limited number of such workunits, hoping to gather enough feedback to help us solve the problem. (we don't want to distribute workunits that may take days to complete).

So, please, if you have such workunits, let them finish.

thejackhome
Send message
Joined: 25 Oct 18
Posts: 3
Credit: 1,114
RAC: 0
Message 1387 - Posted: 29 Nov 2018, 15:27:04 UTC
Last modified: 29 Nov 2018, 15:27:23 UTC

Thanks for the explanation and quick feedback Valterc.
I will for sure let it finish.

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,191,036
RAC: 156
United States
Message 1388 - Posted: 29 Nov 2018, 21:45:54 UTC

Just had one that was stuck on 16+ hours and still at 17% on a relatively fast and previously reliable machine. Aborted it before I came over here to check...
Have another one that's getting close to 75% and 7.6 hours on a very fast machine. That one's still running.

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,191,036
RAC: 156
United States
Message 1389 - Posted: 29 Nov 2018, 23:55:14 UTC

>> Have another one that's getting close to 75% and 7.6 hours on a very fast machine. That one's still running.

The one above finished in 9:10.

Now I have one on the same very fast machine that's run for 9:06 hours and is still at 0.5%.

thejackhome
Send message
Joined: 25 Oct 18
Posts: 3
Credit: 1,114
RAC: 0
Message 1390 - Posted: 30 Nov 2018, 12:14:54 UTC

I also have to abort my task unfortunately.

My crunching time is usually working day hours (9/10 more or less).
If the application is not able to write a checkpoint at least every 4/5 hours I have to drop it.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 1391 - Posted: 30 Nov 2018, 14:04:46 UTC - in response to Message 1390.
Last modified: 30 Nov 2018, 14:07:15 UTC

Any work-unit is built up from a certain number of small computational pieces, in the TCGA experiments the numbers are 1000 or 200. The checkpoint is written at he end of every chunk. Obviously, if one chunk is one of the 'abnormal' ones, running for hours, there will be no checkpoint for a long period of time, unfortunately.

Anyway, results are coming back, although slower than usual and even the long workunits seems to be validated (and credit given) correctly.

Thyme Lawn
Send message
Joined: 22 Nov 16
Posts: 5
Credit: 2,261,033
RAC: 0
United Kingdom
Message 1393 - Posted: 1 Dec 2018, 20:50:11 UTC

My hyper-threaded i7-6700K Windows 10 system has 1 validated TCGA task (148365_Hs_TCGA-AR_wu-124_1543429375131_2, 561.1 credits for 11:59:57 runtime), 1 pending validation (148368_Hs_TCGA-KLF6_wu-102_1543433840285_2, 9:10:54 runtime), 4 running and 5 in its work queue.

The running tasks are:


____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 1395 - Posted: 2 Dec 2018, 13:32:26 UTC - in response to Message 1393.

The TCGA workunits are problematic. If you happen to have one and find it like 'frozen', i.e. no progress after a long time, feel free to abort it.
Tomorrow, Monday, I will abort them all, server side.

Thyme Lawn
Send message
Joined: 22 Nov 16
Posts: 5
Credit: 2,261,033
RAC: 0
United Kingdom
Message 1396 - Posted: 2 Dec 2018, 23:58:31 UTC - in response to Message 1395.

3 of those tasks have now completed:


The other task mentioned in my previous message was pre-empted for 9 hours and is now running again:


One of the queued tasks has also completed today:


  • 148365_Hs_TCGA-AR_wu-34_1543429216393_2- 12:37:28 runtime, 19 checkpoints, longest interval 4:08:26 for #10, validation inconclusive (my task was run on an Intel i7-6700K, Windows 10 system using v1.11 (sse2); _1 task ran for 28:04:24 on an AMD FX-8320E, Linux 4.4.76-1-default system using v1.10 (fma))


____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: 22 Nov 16
Posts: 5
Credit: 2,261,033
RAC: 0
United Kingdom
Message 1406 - Posted: 3 Dec 2018, 15:48:55 UTC - in response to Message 1396.

The other task mentioned in my previous message was pre-empted for 9 hours and is now running again:


Now completed with 30:52:42 runtime and 16 checkpoints made, validation pending.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 1408 - Posted: 3 Dec 2018, 17:09:53 UTC - in response to Message 1406.

The output of the 'problematic' TCGA workunits were, in some cases, different if calculated on Windows or Linux. There are some validation errors because of this so I wrote down some code in order to give credits even for the TCGA invalids. See this workunit as an example: http://gene.disi.unitn.it/test/workunit.php?wuid=17810593

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,191,036
RAC: 156
United States
Message 1421 - Posted: 9 Dec 2018, 16:19:02 UTC - in response to Message 1408.

The output of the 'problematic' TCGA workunits were, in some cases, different if calculated on Windows or Linux. There are some validation errors because of this so I wrote down some code in order to give credits even for the TCGA invalids. See this workunit as an example: http://gene.disi.unitn.it/test/workunit.php?wuid=17810593

Very nice and much appreciated.

kain
Send message
Joined: 11 Jun 15
Posts: 29
Credit: 15,562,505
RAC: 3
Poland
Message 1435 - Posted: 14 Dec 2018, 1:09:15 UTC
Last modified: 14 Dec 2018, 1:09:53 UTC

Crunched on my 1950X and waiting for validation:
http://gene.disi.unitn.it/test/workunit.php?wuid=17809940 - CPU time: 251,481.41 seconds :(
http://gene.disi.unitn.it/test/workunit.php?wuid=17810657 - CPU time: 149,729.80 seconds, five people resigned...


Validated:
http://gene.disi.unitn.it/test/workunit.php?wuid=17810577 - CPU time: 396,309.20 :O It's too much even for me.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 1436 - Posted: 14 Dec 2018, 10:12:19 UTC - in response to Message 1435.
Last modified: 14 Dec 2018, 10:14:46 UTC

The last one is probably one of the longest I ever seen (4,704.77 credits...). This one https://gene.disi.unitn.it/test/workunit.php?wuid=17810524 is probably the record until now.

Anyway, there are only a few (TCGA) workunits still around, probably the longest...

kain
Send message
Joined: 11 Jun 15
Posts: 29
Credit: 15,562,505
RAC: 3
Poland
Message 1437 - Posted: 14 Dec 2018, 10:54:47 UTC - in response to Message 1436.

GPUGRID has two queues - short one and long one. Can you maybe do the same here?

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 1438 - Posted: 14 Dec 2018, 12:05:29 UTC - in response to Message 1437.
Last modified: 14 Dec 2018, 12:05:46 UTC

GPUGRID has two queues - short one and long one. Can you maybe do the same here?

The TCGA batch behavior was unexpected. Workunits like those, without checkpoints for a very long time and somewhat unpredictable running time are not for BOINC. We don't have any plan to distribute very long workunits in the future and, for sure, workunits like the TCGA ones.

Just wanted to point up that the TCGAz workunits behave correctly.

Thyme Lawn
Send message
Joined: 22 Nov 16
Posts: 5
Credit: 2,261,033
RAC: 0
United Kingdom
Message 1444 - Posted: 17 Dec 2018, 10:35:52 UTC

My i7 has just completed the _8 task from workunit 148368_Hs_TCGA-KLF6_wu-154_1543433935389 with 57:17:44 runtime, 41:07:29 of it being between the 15th and 16th checkpoints.

The over deadline _2 task was completed by a Linux ARMv7 system after running for 320:02:11, and pro rata the long checkpoint would have taken almost 10 days! Aaron deserves a special commendation for sticking with it.

As expected for TCGA tasks, the fact that it's been completed by a Windows and Linux system means that validation is inconclusive. 4 other tasks from the WU have timed out, 3 have been aborted and 2 are in progress.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer


Post to thread

Message boards : Number crunching : Wu stuck? (TCGA workunits)


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN