log in |
Message boards : News : Storage problem (again)
Author | Message |
---|---|
As all of you already know we are constantly struggling with our problematic storage. The University have bought a new one and they planned the moving procedure. The HPC cluster should be moved this Monday (March, 6th), the other resources will follow. | |
ID: 3095 · Reply Quote | |
Thank you all for your understanding and patience. No problem. We are ready to restart our pc to crunch your wus!! | |
ID: 3096 · Reply Quote | |
I have 7 tasks stuck in upload. | |
ID: 3097 · Reply Quote | |
>>> I have 7 tasks stuck in upload. | |
ID: 3099 · Reply Quote | |
>>> I have 7 tasks stuck in upload. My post was a day earler than your comment...so yes...stuck. Resolved now. | |
ID: 3100 · Reply Quote | |
I hope this filesystem problem gets repaired soon as I'm starting to see other troubling problems. To wit: Supposedly task 230186_Hs_T142754-TRIML1-wu96_1677560784479 was sent to one of my hosts on 28-Feb-2023 19:35:31 UTC. Looking through the client log, that task is not found in any form. At that time in the log there was an indication that a scheduler request timed out. I was never sent the WU but evidently the server thought it was. Consequently, 5 days later the task was listed as an error against my host as Timed out -- no response. I'm seeing a rising number of these against all my hosts. All my hosts return WUs within the 5 day return period. Several were flagged against my 128 thread EPYC server that cannot get even one day of work before hitting the "limit of tasks in progress" message. All systems are up 24/7/365 (mostly, except brief reboots for security fixes) | |
ID: 3101 · Reply Quote | |
I also noticed these "ghost" tasks. | |
ID: 3102 · Reply Quote | |
it is not a major problem as only about 86 of 15300 WUs were flagged as being timed out. I don't see them when the filesystem isn't acting up. hopefully, after the move, we won't see them anymore | |
ID: 3103 · Reply Quote | |
I noticed that one way to get lost work units like these to be immediately marked as invalid to force the work unit to be sent out to another computer is to detach the computer that was assigned the lost work units from the project and then reattach the same computer to the project. The detachment and reattachment should be done after all of the work units on that machine have been uploaded and then reported by setting the project to "No new tasks" mode. This will cause the scheduler to mark the lost work units as abandoned. | |
ID: 3104 · Reply Quote | |
Meanwhile the work generator will be active just a few hours a day, so very few workunits will be available. By the looks of things the server is coping well with the amount of work ready to send (17,636), a lot of work can be produced considering the "work generator" is only "running a few hours a day" Currently 56,087 result are being processed :-) Your welcome, happy to wait until project has moved to new hardware before a process more work. | |
ID: 3105 · Reply Quote | |
Hi, new user here (not new to boinc). I'm getting failed attempts when I try to attach this project to existing machines. "Failed to add project, please try again later". Is this expected in any way? | |
ID: 3106 · Reply Quote | |
The University have bought a new one and they planned the moving procedure. The HPC cluster should be moved this Monday (March, 6th), the other resources will follow. Any news?? | |
ID: 3108 · Reply Quote | |
The University have bought a new one and they planned the moving procedure. The HPC cluster should be moved this Monday (March, 6th), the other resources will follow. Not really, The University should have finished moving the HPC-related storage. That's probably why the old storage "seems" to be much more reliable than before (I let the work generator run continuously). We are still waiting to our turn to move. | |
ID: 3109 · Reply Quote | |
Thanks for confirming what was happening with the word generator. I have to agree everything seems to be running a lot nicer currently on the old server :-) | |
ID: 3110 · Reply Quote | |
Message boards :
News :
Storage problem (again)