OUT of tasks
log in

Advanced search

Message boards : Number crunching : OUT of tasks

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Author Message
Retvari Zoltan
Send message
Joined: 31 Mar 20
Posts: 43
Credit: 51,206,467
RAC: 0
Hungary
Message 3227 - Posted: 28 May 2023, 10:48:43 UTC - in response to Message 3225.

AMD Zen 2 cores are an equivalent of older Intel Skylake cores. (6th Gen up to 10th Gen Core i7 chips).
While the actual architectual improvements between the 6th and 8th Gen Intel cores are debated, there is a significant increase in computing performance (and in performance per Watt) between each genereation (except for the 9th and the 10th Gen cores), so it's inadequate to wash all the CPU generations from the 6th to the 10th together.

They [the AMD Zen 2 cores] are significantly slower than newer Intel 12th and 13th gen Core i7 chips.
While this is true, that's not the reason for their poor performance here (on TN-Grid).
The real reason for their poor performance is the misunderstanding of Hyper-Threading (Simultaneous Multi-Threading) that leads to the overwhelming the execution units of the CPU cores.

If you are interested, here is the TLDR document:
http://www.cslab.ece.ntua.gr/courses/advcomparch/2007/material/readings/Intel%20Hyper-Threading%20Technology.pdf
The key concepts of this technology are the same for Intel, AMD, or any other CPU manufacturer.

The main point can be found on page 15, titled "Keys to Hyper-Threading Technology Performance"
At the bottom of this page:

Understand Hyper-Threading Technology Processor Resources
Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers andsome machine-state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total. Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic and buses.


That means: if you want your (single-threaded) science application to run as fast as it could, don't use more than 50% of your CPUs for this purpose.
The 12th and 13th gen Intel CPUs have E-Cores, which don't have the necessary resources to run TN-Grid (and similar scientific) applications, so the percentage of the usable "CPUs" (threads in reality) on these CPUs are even lower (34% on i9-12xxx, 25% on i9-13xxx).
Look at these two i9-13900k's:
https://gene.disi.unitn.it/test/results.php?hostid=86237&offset=0&show_names=0&state=4&appid= Run time: 1h 37m CPU time: 1h 8m, 1 error
https://gene.disi.unitn.it/test/results.php?hostid=86238&offset=0&show_names=0&state=4&appid= Run time: 1h 34m CPU time: 1h 6m, many errors

Speedy
Send message
Joined: 13 Nov 21
Posts: 33
Credit: 1,020,742
RAC: 0
New Zealand
Message 3228 - Posted: 28 May 2023, 21:50:09 UTC - in response to Message 3227.

Thank you for all the information very interesting


That means: if you want your (single-threaded) science application to run as fast as it could, don't use more than 50% of your CPUs for this purpose.
The 12th and 13th gen Intel CPUs have E-Cores, which don't have the necessary resources to run TN-Grid (and similar scientific) applications, so the percentage of the usable "CPUs" (threads in reality) on these CPUs are even lower (34% on i9-12xxx, 25% on i9-13xxx).
Look at these two i9-13900k's:
https://gene.disi.unitn.it/test/results.php?hostid=86237&offset=0&show_names=0&state=4&appid= Run time: 1h 37m CPU time: 1h 8m, 1 error
https://gene.disi.unitn.it/test/results.php?hostid=86238&offset=0&show_names=0&state=4&appid= Run time: 1h 34m CPU time: 1h 6m, many errors

Yes I agree the bottom host as many "errors" although they are classed as this I wouldn't call them "errors" because the 1st 20 at least are "abandoned" which to me just means "aborted by user" because the 2 tasks have been returned in one example I checked

entity
Send message
Joined: 20 Jul 20
Posts: 20
Credit: 31,475,949
RAC: 3
United States
Message 3232 - Posted: 29 May 2023, 14:26:24 UTC

Looking at the increase in projected run times, I would venture to guess that the WUs have been increased from 600 to 800 with the last one (#58) still having 400. Number of tasks in progress have been declining for several hours.

Retvari Zoltan
Send message
Joined: 31 Mar 20
Posts: 43
Credit: 51,206,467
RAC: 0
Hungary
Message 3233 - Posted: 29 May 2023, 16:33:37 UTC - in response to Message 3232.

Looking at the increase in projected run times, I would venture to guess that the WUs have been increased from 600 to 800 with the last one (#58) still having 400.
I confirm that. The awarded credits went up as well.

Number of tasks in progress have been declining for several hours.
This is the way I think we could give more time for the work generator to keep up with the pace of the crunchers. (As every function of the BOINC infrastructure runs on the same server, they take resources from each other. If we reduce the overhead of administering the workunits by decreasing their numbers, the host can spend more resources on generating new work, and comparing results.)

Speedy
Send message
Joined: 13 Nov 21
Posts: 33
Credit: 1,020,742
RAC: 0
New Zealand
Message 3234 - Posted: 29 May 2023, 22:36:31 UTC - in response to Message 3224.

I dedicate 50% threads to BOINC (WCG/SiDock/Rosetta/TN-Grid) and 50% to Stockfish Fishtest chess.

To achieve using 50% CPU time boinc do you use preferences "50% of the CPUs"? I also send you a PM


I don't know about Intel vs AMD but it does prefer Linux. 10-15% faster IIRC.

Yes I completely agree a lot of application related to boinc like Linux. Contrary to this I have a feeling that the current project "Vitis vinifera" appears to run under 3 hours on Windows times given were before they increase work unit size. Previous project ran for over 3 hours most of the time

Retvari Zoltan
Send message
Joined: 31 Mar 20
Posts: 43
Credit: 51,206,467
RAC: 0
Hungary
Message 3235 - Posted: 29 May 2023, 22:54:04 UTC - in response to Message 3234.
Last modified: 29 May 2023, 22:59:45 UTC

Let me answer:

To achieve using 50% CPU time boinc do you use preferences "50% of the CPUs"?
Yes.

... I have a feeling that the current project "Vitis vinifera" appears to run under 3 hours on Windows times given were before they increase work unit size. Previous project ran for over 3 hours most of the time
My Windows host with an i7-9700F CPU @4.5GHz can finish one longer "Vitis vinifera" in 1h 52m (~6.700s), the shorter ones took 1h 23m (~5.000s).
This CPU is not hyperthreaded (so I don't have to use the above setting), has 8 cores, I run 7 TN-Grid tasks simultaneously.

Technologov
Send message
Joined: 27 Jan 22
Posts: 36
Credit: 302,393,914
RAC: 8
Ukraine
Message 3236 - Posted: 30 May 2023, 0:40:04 UTC

Available Tasks are still at Zero. What is being done to address this state of affairs?

Retvari Zoltan
Send message
Joined: 31 Mar 20
Posts: 43
Credit: 51,206,467
RAC: 0
Hungary
Message 3237 - Posted: 30 May 2023, 9:26:24 UTC - in response to Message 3236.
Last modified: 30 May 2023, 9:27:19 UTC

Available Tasks are still at Zero. What is being done to address this state of affairs?
The number of workunits in progress is slowly rising, so the server might be able to fill up every host with work using the present settings (if comparing the longer results won't take up too much resources).
Give it a few days until all hosts return the "shorter" workunits, then fill up their queue with the "longer" ones, then the longer results are uplodaded and compared.

entity
Send message
Joined: 20 Jul 20
Posts: 20
Credit: 31,475,949
RAC: 3
United States
Message 3238 - Posted: 30 May 2023, 13:09:54 UTC

With the longer WUs, I'm still getting "Computer has reached limit of tasks in progress" on some of my machines with 1 day cache. Same as with the smaller WUs. This means the number of WUs downloaded is the same as before the change. WUs would have to get bigger before the number downloaded doesn't hit that limit. 6 x # of threads didn't even provide a half day cache with the smaller WUs, now it provides maybe a little over a half day but not one day.

Falconet
Send message
Joined: 21 Dec 16
Posts: 105
Credit: 3,092,711
RAC: 0
Portugal
Message 3239 - Posted: 30 May 2023, 14:45:47 UTC

The longer work units are taking slightly less time to complete than the Hs work units we had before.
____________

Retvari Zoltan
Send message
Joined: 31 Mar 20
Posts: 43
Credit: 51,206,467
RAC: 0
Hungary
Message 3240 - Posted: 30 May 2023, 16:41:42 UTC - in response to Message 3238.
Last modified: 30 May 2023, 16:42:55 UTC

With the longer WUs, I'm still getting "Computer has reached limit of tasks in progress" on some of my machines with 1 day cache.
TN-Grid limits the total number of workunits per host (regardless of its core count) to help maximize the total output of the project by spreading the work between as many hosts as possible, regardless of their core counts. Hosts with large core conts finish work more frequently, thus they have better chance to download work during a shortage.

Same as with the smaller WUs. This means the number of WUs downloaded is the same as before the change.
This means that your host can queue 33% more work than before the change.

WUs would have to get bigger before the number downloaded doesn't hit that limit. 6 x # of threads didn't even provide a half day cache with the smaller WUs, now it provides maybe a little over a half day but not one day.
That's no problem (both for you and for the project) provided that your host *always* have *some* work in its queue. As every workunit is a piece in a chain of 58 workunits, workunits that just sitting in a computer's queue holds back the completion of the entire chain. The more core a host have the more chains it can put on hold. The reason for limiting the total number of workunits is limiting the number of chains a host can put on hold, without reducing the host's througput.

BTW your hosts are a nice example to show why *not* to crunch on virtual cores:
Your 13 years old AMD Phenom II 1090T X6 CPU can finish a longer VV workunit in 13.100 seconds,
while your 4 year old AMD Ryzen 9 3900X CPU can finish the same workunit in 12.000 seconds. Of course, it can finish twice as much workunits under the same time than the older CPU (so it's RAC is higher), but I guess it could do the same amount of work (or even a little more) if you would limit the number of tasks to the number of cores (6). Depending on the extra cache misses the extra task per core inflict, the performance running limited number of tasks simultaneously can be better.

entity
Send message
Joined: 20 Jul 20
Posts: 20
Credit: 31,475,949
RAC: 3
United States
Message 3241 - Posted: 30 May 2023, 18:21:16 UTC - in response to Message 3240.

TN-Grid limits the total number of workunits per host (regardless of its core count) to help maximize the total output of the project by spreading the work between as many hosts as possible, regardless of their core counts. Hosts with large core conts finish work more frequently, thus they have better chance to download work during a shortage.

That isn't my observation. On the 64 thread system I receive 384 WUs (6 x 64). After the change I still get 384 WUs. If I change the thread count to 128 I will get 768 (6 x 64) and it is consistent across all of my systems.

This means that your host can queue 33% more work than before the change.

I agree as long as we are only considering the number of "chunks" being downloaded (800 vs 600) but the # of WUs is the same as before. Supposedly, the problem we were trying to solve was related to the work generator and the number of work units created per unit of time. That doesn't seem to have changed much after the change.

That's no problem (both for you and for the project) provided that your host *always* have *some* work in its queue. As every workunit is a piece in a chain of 58 workunits, workunits that just sitting in a computer's queue holds back the completion of the entire chain. The more core a host have the more chains it can put on hold. The reason for limiting the total number of workunits is limiting the number of chains a host can put on hold, without reducing the host's througput.

I am always going to have a number of WUs in "Ready to Start" state to allow me to continue to work through outages whether at the project site or locally. That is a choice I make as a cruncher but I try to limit it to less than a day for the reasons you describe.

BTW your hosts are a nice example to show why *not* to crunch on virtual cores:
Your 13 years old AMD Phenom II 1090T X6 CPU can finish a longer VV workunit in 13.100 seconds,
while your 4 year old AMD Ryzen 9 3900X CPU can finish the same workunit in 12.000 seconds. Of course, it can finish twice as much workunits under the same time than the older CPU (so it's RAC is higher), but I guess it could do the same amount of work (or even a little more) if you would limit the number of tasks to the number of cores (6). Depending on the extra cache misses the extra task per core inflict, the performance running limited number of tasks simultaneously can be better.

I choose to run on virtual cores as I have found it a pain in the back side to have to constantly go into the BIOS to turn off HT/SMT for different projects. The lower thread count systems don't get me enough extra throughput to justify the trouble. The bigger server gives me better pay back for running fewer threads but I do that through the BOINC Manager (Use 50% of the CPUS) and yes I know that by doing that I'm not truly eliminating virtual cores and work isn't always balanced across the sockets but WUs run in about 1/2 the time. Turning off SMT wouldn't get me that much more throughput.

I would advocate for changing the WUs to 1200 from 800. I think that would make them run about the same time as the HS work.

Retvari Zoltan
Send message
Joined: 31 Mar 20
Posts: 43
Credit: 51,206,467
RAC: 0
Hungary
Message 3242 - Posted: 30 May 2023, 19:26:17 UTC - in response to Message 3241.
Last modified: 30 May 2023, 19:36:41 UTC

That isn't my observation. On the 64 thread system I receive 384 WUs (6 x 64). After the change I still get 384 WUs. If I change the thread count to 128 I will get 768 (6 x 64) and it is consistent across all of my systems.
That's my mistake. Perhaps I should consider it as a new idea then.

I choose to run on virtual cores as I have found it a pain in the back side to have to constantly go into the BIOS to turn off HT/SMT for different projects. The lower thread count systems don't get me enough extra throughput to justify the trouble.
Agreed. I don't recommend to turn it off in the BIOS.

The bigger server gives me better pay back for running fewer threads but I do that through the BOINC Manager (Use 50% of the CPUS) and yes I know that by doing that I'm not truly eliminating virtual cores and work isn't always balanced across the sockets but WUs run in about 1/2 the time. Turning off SMT wouldn't get me that much more throughput.
Modern OSes select the cores wisely for power hungry apps. I let them do it on their own. Windows is bad for running thousands of threads (my Windows 11 PC: 3300) that's one of the reasons for it's degraded performance (compared to Linux).

For anyone interested in other methods than changing the BOINC manager options -> Computing preferences -> "Use at most 50% of the CPUs":

I put an app_config.xml in each project's directory, that look similar to this:
<app_config> <app> <name>gene_pcim</name> <max_concurrent>7</max_concurrent> </app> </app_config>
You can figure out the app name from the project's webpage, or from the BOINC manager's log. (Nothing bad happens when you put an incorrect name here, BOINC manager will show the known app names in it's log, you can correct the names accordingly)

The other method is limit the project itself:
<app_config> <project_max_concurrent>7</project_max_concurrent> </app_config>
This will also limit the maximum number of workunits in the queue.

When I crunch for multiple projects at the same time, it's tedious to set these files to add up to the number of cores I want to use, in this case I use the cc_config.xml method: (It's located in the BOINC directory)
<cc_config> <options> <ncpus>7</ncpus> </options> </cc_config>
This will also limit the maximum number of workunits in the queue.
Originally this value is set to -1 (=all CPUs).

Don't forget to make BOINC manager to read the configuration files after any changes you've made.

I would advocate for changing the WUs to 1200 from 800. I think that would make them run about the same time as the HS work.
Agreed, my suggested number was 920 (50 workunits) or 1150 (40 workunits). (1200 would result 38 + 1/3 workunits.)

Retvari Zoltan
Send message
Joined: 31 Mar 20
Posts: 43
Credit: 51,206,467
RAC: 0
Hungary
Message 3243 - Posted: 31 May 2023, 18:57:38 UTC

The number of tasks in progress is increasing slowly but steadily (~1000 per day it's 32.318 atm), while the ready to send is fluctuating between 80 and 120 workunits, so the work generator can keep up the pace just barely. Many people (on the northern hemisphere) will leave for summer vacation soon, therefore the available computing power will decrease soon, so I think the settings are fine for now. But I'm still curious how the project would perform when it's generating even larger (1150 chunks) workunits.

JaviF
Send message
Joined: 14 May 19
Posts: 41
Credit: 0
RAC: 0
Spain
Message 3244 - Posted: 1 Jun 2023, 6:14:22 UTC - in response to Message 2844.

New project will be done before northern people go on vacation...is already less than 2 months to finish, it could be earlier if more workunits were provided:)

Retvari Zoltan
Send message
Joined: 31 Mar 20
Posts: 43
Credit: 51,206,467
RAC: 0
Hungary
Message 3245 - Posted: 2 Jun 2023, 9:07:48 UTC
Last modified: 2 Jun 2023, 9:08:27 UTC

The number of tasks in progress is still increasing slowly. It's 35.016 at the moment, which is near the previous top (35.559). I wonder when will we reach the new top in the number of tasks in progress, and what that number will be. There are 80 workunits ready to send.

JaviF
Send message
Joined: 14 May 19
Posts: 41
Credit: 0
RAC: 0
Spain
Message 3246 - Posted: 2 Jun 2023, 9:59:56 UTC - in response to Message 3245.

Such amount of workunits to be sent, is almost zero, as soon as a couple of hosts request more WU, again we go to 0 and then hosts will be emptying their queues.
80 Wu could even be requested by 1 single user...

rsNeutrino
Send message
Joined: 12 Mar 23
Posts: 7
Credit: 1,194,671
RAC: 0
Germany
Message 3247 - Posted: 2 Jun 2023, 21:14:34 UTC



Slowly rising, the task size change was around where the cross is.
Could be better, but as long the RTS number is rising it's enough.

Retvari Zoltan
Send message
Joined: 31 Mar 20
Posts: 43
Credit: 51,206,467
RAC: 0
Hungary
Message 3248 - Posted: 3 Jun 2023, 7:25:31 UTC

The number of tasks in progress is still increasing slowly.
There are 36.919 tasks in progress, and are 0 workunits ready to send, but a few minutes later there are 36.839/44 tasks.
The average runtime was 2.67 hours before the workunit size change, it has risen to 3.4 hours in two days, then slowly risen to it's estimated new value: 3.6 hours (2.67*8/6=3.56)

JaviF
Send message
Joined: 14 May 19
Posts: 41
Credit: 0
RAC: 0
Spain
Message 3249 - Posted: 5 Jun 2023, 6:09:25 UTC - in response to Message 3248.

Again, WUs to be sent around 0, task in progress 34685.

Seems to be better but not enough...

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Post to thread

Message boards : Number crunching : OUT of tasks


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN