Inconclusive workunits - checkpointing problem?
log in

Advanced search

Message boards : Number crunching : Inconclusive workunits - checkpointing problem?

Author Message
Dj Ninja
Send message
Joined: 3 Feb 17
Posts: 10
Credit: 841,281
RAC: 0
Germany
Message 853 - Posted: 6 Feb 2017, 1:45:13 UTC

I've got a couple of inconclusive workunits on multiple machines.
here is one of them. When I looked into them I noticed that all of my wingmen's workunits have been restarted from checkpoints while mine ran straight through without interruption.

May there be a checkpointing problem which leads to different results when a workunit is restarted?

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 320
Credit: 16,279,926
RAC: 3,847
Italy
Message 856 - Posted: 6 Feb 2017, 10:18:20 UTC - in response to Message 853.
Last modified: 6 Feb 2017, 10:22:16 UTC

Yes, we know about this. There is a bug, somewhere, which is probably related with the checkpoint mechanism. As far as I know the problem happens when a task is suspended and restarted before the first checkpoint. We are still investigating this issue. In the last seven days we had 945 invalid workunits (out of 457007), I guess most of them for this reason, so this is not an alarming ratio, but still a problem we would like to fix.

Dj Ninja
Send message
Joined: 3 Feb 17
Posts: 10
Credit: 841,281
RAC: 0
Germany
Message 859 - Posted: 6 Feb 2017, 13:39:24 UTC

Okay, thank you for explanation. I found the number quite high, much higher than that what I've seen on other projects.


Post to thread

Message boards : Number crunching : Inconclusive workunits - checkpointing problem?


Main page · Your account · Message boards


Copyright © 2017 CNR-TN & UniTN