Storage problem (again)
log in

Advanced search

Message boards : News : Storage problem (again)

Author Message
Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 6
Italy
Message 3095 - Posted: 2 Mar 2023, 10:00:42 UTC
Last modified: 2 Mar 2023, 10:01:06 UTC

As all of you already know we are constantly struggling with our problematic storage. The University have bought a new one and they planned the moving procedure. The HPC cluster should be moved this Monday (March, 6th), the other resources will follow.
I will reasonably shut down the server when needed, I will tell you when as soon as I know the date. Meanwhile the work generator will be active just a few hours a day, so very few workunits will be available.
Thank you all for your understanding and patience.

Profile [VENETO] boboviz
Send message
Joined: 12 Dec 13
Posts: 183
Credit: 4,641,505
RAC: 0
Italy
Message 3096 - Posted: 2 Mar 2023, 10:29:22 UTC - in response to Message 3095.

Thank you all for your understanding and patience.


No problem. We are ready to restart our pc to crunch your wus!!

Greg_BE
Send message
Joined: 22 Feb 22
Posts: 7
Credit: 321,413
RAC: 0
Belgium
Message 3097 - Posted: 2 Mar 2023, 11:13:09 UTC - in response to Message 3096.
Last modified: 2 Mar 2023, 11:13:55 UTC

I have 7 tasks stuck in upload.
Is there any chance before Monday of getting them uploaded to you?
That's a lot of time sitting on my machine and no way to shut down the upload without killing the tasks.

Profile adrianxw
Send message
Joined: 22 Dec 16
Posts: 36
Credit: 8,489,198
RAC: 0
Denmark
Message 3099 - Posted: 3 Mar 2023, 7:13:22 UTC

>>> I have 7 tasks stuck in upload.

Stuck? My tasks are uploading without issue, I've just sent a couple within the last five minutes. Is there another issue?
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Greg_BE
Send message
Joined: 22 Feb 22
Posts: 7
Credit: 321,413
RAC: 0
Belgium
Message 3100 - Posted: 3 Mar 2023, 22:57:14 UTC - in response to Message 3099.

>>> I have 7 tasks stuck in upload.

Stuck? My tasks are uploading without issue, I've just sent a couple within the last five minutes. Is there another issue?



My post was a day earler than your comment...so yes...stuck. Resolved now.

entity
Send message
Joined: 20 Jul 20
Posts: 20
Credit: 31,475,949
RAC: 2
United States
Message 3101 - Posted: 6 Mar 2023, 15:37:32 UTC - in response to Message 3100.

I hope this filesystem problem gets repaired soon as I'm starting to see other troubling problems. To wit: Supposedly task 230186_Hs_T142754-TRIML1-wu96_1677560784479 was sent to one of my hosts on 28-Feb-2023 19:35:31 UTC. Looking through the client log, that task is not found in any form. At that time in the log there was an indication that a scheduler request timed out. I was never sent the WU but evidently the server thought it was. Consequently, 5 days later the task was listed as an error against my host as Timed out -- no response. I'm seeing a rising number of these against all my hosts. All my hosts return WUs within the 5 day return period. Several were flagged against my 128 thread EPYC server that cannot get even one day of work before hitting the "limit of tasks in progress" message. All systems are up 24/7/365 (mostly, except brief reboots for security fixes)

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 6
Italy
Message 3102 - Posted: 7 Mar 2023, 10:08:25 UTC - in response to Message 3101.

I also noticed these "ghost" tasks.

I don't exactly know the reason behind this but, when the file system is acting bad everything on the server is painfully slow, also at the lower level (kernel, locks, sockets). Probably the server is trying to send out a workunit, flags this on the DB but timeouts getting it from the HD, or sending it via http.

These kind of workunits will have the effect of delaying the computation until the deadline (and yes, will be flagged "timeout error"). I know it's annoying but actually, at least, it's not a waste of computation.

entity
Send message
Joined: 20 Jul 20
Posts: 20
Credit: 31,475,949
RAC: 2
United States
Message 3103 - Posted: 7 Mar 2023, 14:17:02 UTC - in response to Message 3102.

it is not a major problem as only about 86 of 15300 WUs were flagged as being timed out. I don't see them when the filesystem isn't acting up. hopefully, after the move, we won't see them anymore

Jesse Viviano
Send message
Joined: 18 Dec 16
Posts: 1
Credit: 2,724,032
RAC: 0
United States
Message 3104 - Posted: 7 Mar 2023, 23:21:12 UTC - in response to Message 3102.

I noticed that one way to get lost work units like these to be immediately marked as invalid to force the work unit to be sent out to another computer is to detach the computer that was assigned the lost work units from the project and then reattach the same computer to the project. The detachment and reattachment should be done after all of the work units on that machine have been uploaded and then reported by setting the project to "No new tasks" mode. This will cause the scheduler to mark the lost work units as abandoned.

Speedy
Send message
Joined: 13 Nov 21
Posts: 33
Credit: 1,020,742
RAC: 0
New Zealand
Message 3105 - Posted: 9 Mar 2023, 0:18:52 UTC - in response to Message 3095.

Meanwhile the work generator will be active just a few hours a day, so very few workunits will be available.
Thank you all for your understanding and patience.

By the looks of things the server is coping well with the amount of work ready to send (17,636), a lot of work can be produced considering the "work generator" is only "running a few hours a day" Currently 56,087 result are being processed :-) Your welcome, happy to wait until project has moved to new hardware before a process more work.

davidjharder
Send message
Joined: 27 Feb 23
Posts: 1
Credit: 546,165
RAC: 0
Canada
Message 3106 - Posted: 9 Mar 2023, 16:19:54 UTC

Hi, new user here (not new to boinc). I'm getting failed attempts when I try to attach this project to existing machines. "Failed to add project, please try again later". Is this expected in any way?

Profile [VENETO] boboviz
Send message
Joined: 12 Dec 13
Posts: 183
Credit: 4,641,505
RAC: 0
Italy
Message 3108 - Posted: 14 Mar 2023, 10:02:07 UTC - in response to Message 3095.

The University have bought a new one and they planned the moving procedure. The HPC cluster should be moved this Monday (March, 6th), the other resources will follow.
I will reasonably shut down the server when needed, I will tell you when as soon as I know the date. Meanwhile the work generator will be active just a few hours a day, so very few workunits will be available.


Any news??

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 6
Italy
Message 3109 - Posted: 14 Mar 2023, 19:29:07 UTC - in response to Message 3108.

The University have bought a new one and they planned the moving procedure. The HPC cluster should be moved this Monday (March, 6th), the other resources will follow.
I will reasonably shut down the server when needed, I will tell you when as soon as I know the date. Meanwhile the work generator will be active just a few hours a day, so very few workunits will be available.


Any news??

Not really, The University should have finished moving the HPC-related storage. That's probably why the old storage "seems" to be much more reliable than before (I let the work generator run continuously).
We are still waiting to our turn to move.

Speedy
Send message
Joined: 13 Nov 21
Posts: 33
Credit: 1,020,742
RAC: 0
New Zealand
Message 3110 - Posted: 16 Mar 2023, 23:34:06 UTC - in response to Message 3109.
Last modified: 16 Mar 2023, 23:35:23 UTC

Thanks for confirming what was happening with the word generator. I have to agree everything seems to be running a lot nicer currently on the old server :-)
Out of curiosity when the move to new server happens what will happen with work that is currently being processed?


Post to thread

Message boards : News : Storage problem (again)


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN