Server down this weekend
log in

Advanced search

Message boards : News : Server down this weekend

Author Message
Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 206
Italy
Message 1277 - Posted: 20 Apr 2018, 18:02:34 UTC
Last modified: 20 Apr 2018, 18:05:33 UTC

Our storage system (managed by the University) has some serious problem. They are working to fix it but they cannot estimate for how long it will be down. We decided to stop the server at least until the next Monday.
Thank you for your understanding.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 206
Italy
Message 1278 - Posted: 24 Apr 2018, 9:09:58 UTC - in response to Message 1277.

They are still working on the system. I do not know the details but I guess it has to be a serious problem. I cannot do anything until they fix and test it.

Profile marsinph
Send message
Joined: 18 Apr 18
Posts: 4
Credit: 3,292,957
RAC: 0
Belgium
Message 1279 - Posted: 25 Apr 2018, 11:47:21 UTC

Hello
Any news about restarting ?
Best regards from Belgium
____________

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 206
Italy
Message 1280 - Posted: 26 Apr 2018, 9:01:48 UTC - in response to Message 1279.

The storage system is still down. There is a 'status page' here https://icts.unitn.it/alert/sistema-storage-nx-collina-non-raggiungibile (although in Italian). I cannot do anything but wait....

Profile [VENETO] boboviz
Send message
Joined: 12 Dec 13
Posts: 183
Credit: 4,641,505
RAC: 0
Italy
Message 1281 - Posted: 27 Apr 2018, 9:01:53 UTC - in response to Message 1280.

The storage system is still down. There is a 'status page' here https://icts.unitn.it/alert/sistema-storage-nx-collina-non-raggiungibile (although in Italian)


It seems that the problem is solved...

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 206
Italy
Message 1282 - Posted: 27 Apr 2018, 9:13:10 UTC - in response to Message 1281.

The storage system should be up and running (fixed) but in the next days it will be probably down another few times in order to fully test it. So I decided to restart the BOINC server components but not the work generator until the next Monday.

I don't know exactly if simply restarting the system after some days of inactivity will cause any trouble (deadlines probably...)

Senilix
Send message
Joined: 17 May 14
Posts: 1
Credit: 370,741
RAC: 0
Germany
Message 1283 - Posted: 2 May 2018, 11:26:36 UTC - in response to Message 1282.

The storage system should be up and running (fixed) but in the next days it will be probably down another few times in order to fully test it. So I decided to restart the BOINC server components but not the work generator until the next Monday.

I don't know exactly if simply restarting the system after some days of inactivity will cause any trouble (deadlines probably...)

Looks like the work generator is back up and running? I just received a couple a dozen fresh WUs...

Profile [VENETO] boboviz
Send message
Joined: 12 Dec 13
Posts: 183
Credit: 4,641,505
RAC: 0
Italy
Message 1284 - Posted: 3 May 2018, 6:44:09 UTC - in response to Message 1283.

Looks like the work generator is back up and running? I just received a couple a dozen fresh WUs...


Yessss, after a LONG time i received some wus (vitis)!!!

[SG]Felix
Send message
Joined: 18 Oct 17
Posts: 9
Credit: 1,303,222
RAC: 0
Germany
Message 1302 - Posted: 19 May 2018, 8:15:13 UTC

i got the message server out of diskspace, and i can see db_purge is turned off.

are you aware of it?

mmonnin
Send message
Joined: 24 Oct 16
Posts: 14
Credit: 4,519,646
RAC: 0
United States
Message 1303 - Posted: 19 May 2018, 11:05:12 UTC

I don't recall what the uploads were before but at 9.4MB it probably doesn't take long to fill up.

purge is something that probably doesn't need to run all the time.

Profile [VENETO] boboviz
Send message
Joined: 12 Dec 13
Posts: 183
Credit: 4,641,505
RAC: 0
Italy
Message 1304 - Posted: 19 May 2018, 11:43:59 UTC - in response to Message 1302.

i got the message server out of diskspace, and i can see db_purge is turned off.



+1

19/05/2018 13:34:59 | TN-Grid Platform | [error] Error reported by file upload server: can't write file /home/boincadm/projects/test/upload/2cc/141327_Hs_T173892-IL6_wu-97_1526673370376_0_0: No space left on server

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 206
Italy
Message 1305 - Posted: 19 May 2018, 13:42:38 UTC - in response to Message 1304.
Last modified: 19 May 2018, 14:00:54 UTC

I noticed the problem. Unfortunately I cannot do anything to try to fix it until Monday. For now I was able just to stop the work generator.

mmonnin
Send message
Joined: 24 Oct 16
Posts: 14
Credit: 4,519,646
RAC: 0
United States
Message 1306 - Posted: 19 May 2018, 20:42:46 UTC - in response to Message 1305.

I noticed the problem. Unfortunately I cannot do anything to try to fix it until Monday. For now I was able just to stop the work generator.


Work is still being sent out but they error for everyone. Can they be stopped?

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 206
Italy
Message 1310 - Posted: 20 May 2018, 10:14:43 UTC - in response to Message 1306.

I just stopped the boinc server....

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 206
Italy
Message 1311 - Posted: 22 May 2018, 9:22:49 UTC - in response to Message 1310.
Last modified: 22 May 2018, 9:24:27 UTC

I just stopped the boinc server....

I moved the upload directory to another storage and restarted the system, we have to find a way to reduce the size of the output file.
You may see that a lot of workunits are erroring out very quick. The reason is a bad input file (zero bytes or similar) that was made while the disk was full. They should auto-abort after some tries.

pls2000
Send message
Joined: 3 May 18
Posts: 4
Credit: 0
RAC: 0
United States
Message 1312 - Posted: 22 May 2018, 9:41:12 UTC

Some of the tasks I've had waiting to upload have done so. Other are still failing with the message

2018-05-22 02:38:33 | TN-Grid Platform | [error] Error reported by file upload server: can't open file

Profile [VENETO] boboviz
Send message
Joined: 12 Dec 13
Posts: 183
Credit: 4,641,505
RAC: 0
Italy
Message 1313 - Posted: 22 May 2018, 9:53:19 UTC

Welcome back!!
And....my first milion!!!

mmonnin
Send message
Joined: 24 Oct 16
Posts: 14
Credit: 4,519,646
RAC: 0
United States
Message 1314 - Posted: 22 May 2018, 10:33:32 UTC - in response to Message 1312.

Some of the tasks I've had waiting to upload have done so. Other are still failing with the message

2018-05-22 02:38:33 | TN-Grid Platform | [error] Error reported by file upload server: can't open file


I am getting this too. Server site is really slow too.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 206
Italy
Message 1315 - Posted: 22 May 2018, 11:34:43 UTC - in response to Message 1314.
Last modified: 22 May 2018, 16:29:15 UTC

Some of the tasks I've had waiting to upload have done so. Other are still failing with the message

2018-05-22 02:38:33 | TN-Grid Platform | [error] Error reported by file upload server: can't open file


I am getting this too. Server site is really slow too.

I had to move the upload directory to a nfs mounted storage (which is slower than the previous one), also a lot of people is uploading their results, the server is overwhelmed by work...
[edit] but it is getting better...


Post to thread

Message boards : News : Server down this weekend


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN