Bad hosts topic
log in

Advanced search

Message boards : Number crunching : Bad hosts topic

Author Message
manalog
Send message
Joined: 5 Oct 15
Posts: 33
Credit: 1,098,442
RAC: 0
Italy
Message 2287 - Posted: 19 Apr 2021, 9:52:43 UTC

Found a bad host: http://gene.disi.unitn.it/test/show_host_detail.php?hostid=25653

Bryn Mawr
Send message
Joined: 23 Jun 20
Posts: 43
Credit: 14,256,442
RAC: 7
United Kingdom
Message 2288 - Posted: 19 Apr 2021, 11:03:47 UTC - in response to Message 2287.

Found a bad host: http://gene.disi.unitn.it/test/show_host_detail.php?hostid=25653


Good spot, that’s as bad as they come.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 340
Italy
Message 2289 - Posted: 19 Apr 2021, 12:26:33 UTC - in response to Message 2288.

This probably belongs to someone that even don't know what it happens here... Anyway, I just blacklisted it

Aurum
Send message
Joined: 18 Jul 18
Posts: 97
Credit: 291,241,518
RAC: 266
United States
Message 2670 - Posted: 17 May 2022, 16:10:23 UTC
Last modified: 17 May 2022, 16:12:16 UTC

I have a slug of WUs that are going to take over 3 days to run, e.g. http://gene.disi.unitn.it/test/workunit.php?wuid=35037615

My wingman completed it in a few hours as normal. It's all the WUs running on the same computer Rig-31 with a Xeon E5-2699 v3 with 4x8 GB RAM. I've rebooted and reduced the CPU utilization but that does not speed them up.

My wingman Technologov is also running Linux with an E5-2680 v4 (14c/28t) with 56 processors so must be a dual CPU server MB and 64 GB RAM.
http://gene.disi.unitn.it/test/show_host_detail.php?hostid=78642

I can't see any reason for this computer to run them so slow. But, I've been running BOINC 24x7 for years and I'm literally wearing out MBs. Some ooze oil from the PWM caps or stop communicating with one or more PCIe slots. This ASRock X99 Extreme4 may be at the end of its life and ready for the scrap heap.

Penny for your thoughts if you have a suggestion as to what might be my problem. TIA

Bryn Mawr
Send message
Joined: 23 Jun 20
Posts: 43
Credit: 14,256,442
RAC: 7
United Kingdom
Message 2671 - Posted: 17 May 2022, 18:41:49 UTC - in response to Message 2670.

I have a slug of WUs that are going to take over 3 days to run, e.g. http://gene.disi.unitn.it/test/workunit.php?wuid=35037615

My wingman completed it in a few hours as normal. It's all the WUs running on the same computer Rig-31 with a Xeon E5-2699 v3 with 4x8 GB RAM. I've rebooted and reduced the CPU utilization but that does not speed them up.

My wingman Technologov is also running Linux with an E5-2680 v4 (14c/28t) with 56 processors so must be a dual CPU server MB and 64 GB RAM.
http://gene.disi.unitn.it/test/show_host_detail.php?hostid=78642

I can't see any reason for this computer to run them so slow. But, I've been running BOINC 24x7 for years and I'm literally wearing out MBs. Some ooze oil from the PWM caps or stop communicating with one or more PCIe slots. This ASRock X99 Extreme4 may be at the end of its life and ready for the scrap heap.

Penny for your thoughts if you have a suggestion as to what might be my problem. TIA


Probably not you PC.

I also have the occasional WU that takes double the normal time or more where a wingman running similar kit takes the normal time. Why? I don’t know, I just accept it and carry on.

Latest example, my R9/3900 took 20,000 seconds, wingman’s R7/5800x took the normal 10,000 seconds.

Aurum
Send message
Joined: 18 Jul 18
Posts: 97
Credit: 291,241,518
RAC: 266
United States
Message 2672 - Posted: 17 May 2022, 20:56:25 UTC
Last modified: 17 May 2022, 20:57:39 UTC

I suspended all but 5 and went away. They're running much faster now. The CPU is 18c/36t so I'm going to work my way up to 18 WUs. Feels like the problem when WUs load too much into the L3 cache and choke the CPU traffic cop.

What feels so strange is that TN-Grid is the only project I'm running and yet this only affected a single computer. I sorted BoincTasks by WU name and other WUs with the same prefix on other computers are running at normal speed.

Does anyone know of a utility that monitors CPU cache utilization?

Bryn Mawr
Send message
Joined: 23 Jun 20
Posts: 43
Credit: 14,256,442
RAC: 7
United Kingdom
Message 2674 - Posted: 17 May 2022, 22:24:49 UTC - in response to Message 2672.

I suspended all but 5 and went away. They're running much faster now. The CPU is 18c/36t so I'm going to work my way up to 18 WUs. Feels like the problem when WUs load too much into the L3 cache and choke the CPU traffic cop.

What feels so strange is that TN-Grid is the only project I'm running and yet this only affected a single computer. I sorted BoincTasks by WU name and other WUs with the same prefix on other computers are running at normal speed.

Does anyone know of a utility that monitors CPU cache utilization?


That makes sense, the Ryzen t series has twice the L3 cache of the 3 series.

Aurum
Send message
Joined: 18 Jul 18
Posts: 97
Credit: 291,241,518
RAC: 266
United States
Message 2675 - Posted: 18 May 2022, 12:10:56 UTC
Last modified: 18 May 2022, 12:17:01 UTC

Think I see what's happening, the CPU clocks are slowed down:
sudo inxi -C
CPU: Topology: 18-Core model: Intel Xeon E5-2699 v3 bits: 64 type: MT MCP
L2 cache: 45.0 MiB
Speed: 241 MHz min/max: 1200/3600 MHz Core speeds (MHz): 1: 250 2: 240 3: 235 4: 237
5: 248 6: 236 7: 236 8: 235 9: 234 10: 240 11: 223 12: 222 13: 255 14: 221 15: 212
16: 219 17: 230 18: 250 19: 307 20: 226 21: 221 22: 246 23: 238 24: 236 25: 228
26: 263 27: 249 28: 219 29: 221 30: 291 31: 236 32: 234 33: 245 34: 231 35: 237
36: 238

I had these two lines in my cc_config:

<process_priority>3</process_priority> <process_priority_special>2</process_priority_special>
I never noticed they caused a problem before so I just left them alone. I changed them to:
<process_priority>0</process_priority> <process_priority_special>0</process_priority_special>
and the clocks sped up:
CPU: Topology: 18-Core model: Intel Xeon E5-2699 v3 bits: 64 type: MT MCP
L2 cache: 45.0 MiB
Speed: 1247 MHz min/max: 1200/3600 MHz Core speeds (MHz): 1: 1205 2: 1295 3: 1199
4: 1343 5: 1277 6: 1210 7: 2525 8: 2310 9: 2051 10: 1199 11: 1226 12: 1325 13: 1282
14: 1300 15: 1387 16: 1199 17: 1247 18: 1199 19: 2298 20: 1199 21: 1199 22: 1199
23: 1291 24: 1254 25: 1199 26: 1199 27: 1877 28: 1980 29: 2001 30: 1199 31: 1199
32: 1287 33: 1199 34: 2179 35: 1209 36: 1850
Well that was with CPU utilization at 12% (4t). When I switched it back to 95% (34t) all the clocks went back to 250ish.
I guess it's better to just let the CPU regulate itself and not force it into a different state. Maybe it's just something about the E5-2699 v3.

Aurum
Send message
Joined: 18 Jul 18
Posts: 97
Credit: 291,241,518
RAC: 266
United States
Message 2676 - Posted: 18 May 2022, 17:54:33 UTC

Looks like it's fixed. Installed psensor and watched the CPU temps:

sudo apt-get install psensor psensor-server -y

The E5-2699 v3 has a max CPU temp of 76.4 C before it downclocks the CPUs to protect them and CPUs were all high some up to 88 C. https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E5-2699%20v3.html

It had a CoolMaster CPU cooler and they have the lowest quality fans. Replaced the fan. Also checked thermal paste and it was hard so I replaced with my ThermalRight TF-8. http://www.thermalright.com/product/tf8-thermal-paste-2g/
With TF-8 it stays soft and pliable for months whereas others dry out and harden forming voids with low thermal conductivity.

CPU: Topology: 18-Core model: Intel Xeon E5-2699 v3 bits: 64 type: MT MCP
L2 cache: 45.0 MiB
Speed: 2733 MHz min/max: 1200/3600 MHz Core speeds (MHz): 1: 2608 2: 2609 3: 2610
4: 2295 5: 2610 6: 2611 7: 2611 8: 2611 9: 2611 10: 2611 11: 2611 12: 2612 13: 2609
14: 2612 15: 2612 16: 2612 17: 2612 18: 2612 19: 2613 20: 2613 21: 2613 22: 2613
23: 2299 24: 2609 25: 2614 26: 2614 27: 2614 28: 2614 29: 2614 30: 2614 31: 2609
32: 2609 33: 2609 34: 2610 35: 2610 36: 2610


Post to thread

Message boards : Number crunching : Bad hosts topic


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN