Storage problem (again)

Message boards : News : Storage problem (again)

Author	Message
valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 632 Credit: 34,744,744 RAC: 0	Message 3095 - Posted: 2 Mar 2023, 10:00:42 UTC Last modified: 2 Mar 2023, 10:01:06 UTC
	As all of you already know we are constantly struggling with our problematic storage. The University have bought a new one and they planned the moving procedure. The HPC cluster should be moved this Monday (March, 6th), the other resources will follow. I will reasonably shut down the server when needed, I will tell you when as soon as I know the date. Meanwhile the work generator will be active just a few hours a day, so very few workunits will be available. Thank you all for your understanding and patience.
	ID: 3095 · Reply Quote

[VENETO] boboviz Send message Joined: 12 Dec 13 Posts: 184 Credit: 4,641,505 RAC: 0	Message 3096 - Posted: 2 Mar 2023, 10:29:22 UTC - in response to Message 3095.
	Thank you all for your understanding and patience. No problem. We are ready to restart our pc to crunch your wus!!
	ID: 3096 · Reply Quote

Greg_BE Send message Joined: 22 Feb 22 Posts: 7 Credit: 321,413 RAC: 0	Message 3097 - Posted: 2 Mar 2023, 11:13:09 UTC - in response to Message 3096. Last modified: 2 Mar 2023, 11:13:55 UTC
	I have 7 tasks stuck in upload. Is there any chance before Monday of getting them uploaded to you? That's a lot of time sitting on my machine and no way to shut down the upload without killing the tasks.
	ID: 3097 · Reply Quote

adrianxw Send message Joined: 22 Dec 16 Posts: 37 Credit: 8,489,198 RAC: 0	Message 3099 - Posted: 3 Mar 2023, 7:13:22 UTC
	>>> I have 7 tasks stuck in upload. Stuck? My tasks are uploading without issue, I've just sent a couple within the last five minutes. Is there another issue? ____________ Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
	ID: 3099 · Reply Quote

Greg_BE Send message Joined: 22 Feb 22 Posts: 7 Credit: 321,413 RAC: 0	Message 3100 - Posted: 3 Mar 2023, 22:57:14 UTC - in response to Message 3099.
	>>> I have 7 tasks stuck in upload. Stuck? My tasks are uploading without issue, I've just sent a couple within the last five minutes. Is there another issue? My post was a day earler than your comment...so yes...stuck. Resolved now.
	ID: 3100 · Reply Quote

entity Send message Joined: 20 Jul 20 Posts: 20 Credit: 31,959,259 RAC: 0	Message 3101 - Posted: 6 Mar 2023, 15:37:32 UTC - in response to Message 3100.
	I hope this filesystem problem gets repaired soon as I'm starting to see other troubling problems. To wit: Supposedly task 230186_Hs_T142754-TRIML1-wu96_1677560784479 was sent to one of my hosts on 28-Feb-2023 19:35:31 UTC. Looking through the client log, that task is not found in any form. At that time in the log there was an indication that a scheduler request timed out. I was never sent the WU but evidently the server thought it was. Consequently, 5 days later the task was listed as an error against my host as Timed out -- no response. I'm seeing a rising number of these against all my hosts. All my hosts return WUs within the 5 day return period. Several were flagged against my 128 thread EPYC server that cannot get even one day of work before hitting the "limit of tasks in progress" message. All systems are up 24/7/365 (mostly, except brief reboots for security fixes)
	ID: 3101 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 632 Credit: 34,744,744 RAC: 0	Message 3102 - Posted: 7 Mar 2023, 10:08:25 UTC - in response to Message 3101.
	I also noticed these "ghost" tasks. I don't exactly know the reason behind this but, when the file system is acting bad everything on the server is painfully slow, also at the lower level (kernel, locks, sockets). Probably the server is trying to send out a workunit, flags this on the DB but timeouts getting it from the HD, or sending it via http. These kind of workunits will have the effect of delaying the computation until the deadline (and yes, will be flagged "timeout error"). I know it's annoying but actually, at least, it's not a waste of computation.
	ID: 3102 · Reply Quote

entity Send message Joined: 20 Jul 20 Posts: 20 Credit: 31,959,259 RAC: 0	Message 3103 - Posted: 7 Mar 2023, 14:17:02 UTC - in response to Message 3102.
	it is not a major problem as only about 86 of 15300 WUs were flagged as being timed out. I don't see them when the filesystem isn't acting up. hopefully, after the move, we won't see them anymore
	ID: 3103 · Reply Quote

Jesse Viviano Send message Joined: 18 Dec 16 Posts: 1 Credit: 2,724,032 RAC: 0	Message 3104 - Posted: 7 Mar 2023, 23:21:12 UTC - in response to Message 3102.
	I noticed that one way to get lost work units like these to be immediately marked as invalid to force the work unit to be sent out to another computer is to detach the computer that was assigned the lost work units from the project and then reattach the same computer to the project. The detachment and reattachment should be done after all of the work units on that machine have been uploaded and then reported by setting the project to "No new tasks" mode. This will cause the scheduler to mark the lost work units as abandoned.
	ID: 3104 · Reply Quote

Speedy Send message Joined: 13 Nov 21 Posts: 33 Credit: 1,020,742 RAC: 0	Message 3105 - Posted: 9 Mar 2023, 0:18:52 UTC - in response to Message 3095.
	Meanwhile the work generator will be active just a few hours a day, so very few workunits will be available. Thank you all for your understanding and patience. By the looks of things the server is coping well with the amount of work ready to send (17,636), a lot of work can be produced considering the "work generator" is only "running a few hours a day" Currently 56,087 result are being processed :-) Your welcome, happy to wait until project has moved to new hardware before a process more work.
	ID: 3105 · Reply Quote

davidjharder Send message Joined: 27 Feb 23 Posts: 1 Credit: 546,165 RAC: 0	Message 3106 - Posted: 9 Mar 2023, 16:19:54 UTC
	Hi, new user here (not new to boinc). I'm getting failed attempts when I try to attach this project to existing machines. "Failed to add project, please try again later". Is this expected in any way?
	ID: 3106 · Reply Quote

[VENETO] boboviz Send message Joined: 12 Dec 13 Posts: 184 Credit: 4,641,505 RAC: 0	Message 3108 - Posted: 14 Mar 2023, 10:02:07 UTC - in response to Message 3095.
	The University have bought a new one and they planned the moving procedure. The HPC cluster should be moved this Monday (March, 6th), the other resources will follow. I will reasonably shut down the server when needed, I will tell you when as soon as I know the date. Meanwhile the work generator will be active just a few hours a day, so very few workunits will be available. Any news??
	ID: 3108 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 632 Credit: 34,744,744 RAC: 0	Message 3109 - Posted: 14 Mar 2023, 19:29:07 UTC - in response to Message 3108.
	The University have bought a new one and they planned the moving procedure. The HPC cluster should be moved this Monday (March, 6th), the other resources will follow. I will reasonably shut down the server when needed, I will tell you when as soon as I know the date. Meanwhile the work generator will be active just a few hours a day, so very few workunits will be available. Any news?? Not really, The University should have finished moving the HPC-related storage. That's probably why the old storage "seems" to be much more reliable than before (I let the work generator run continuously). We are still waiting to our turn to move.
	ID: 3109 · Reply Quote

Speedy Send message Joined: 13 Nov 21 Posts: 33 Credit: 1,020,742 RAC: 0	Message 3110 - Posted: 16 Mar 2023, 23:34:06 UTC - in response to Message 3109. Last modified: 16 Mar 2023, 23:35:23 UTC
	Thanks for confirming what was happening with the word generator. I have to agree everything seems to be running a lot nicer currently on the old server :-) Out of curiosity when the move to new server happens what will happen with work that is currently being processed?
	ID: 3110 · Reply Quote

Post to thread

Message boards : News : Storage problem (again)