Optimization

Message boards : Number crunching : Optimization

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next

Author	Message
KrÃ¼mel Send message Joined: 31 Oct 16 Posts: 22 Credit: 14,099,551 RAC: 0	Message 800 - Posted: 24 Jan 2017, 8:49:55 UTC Last modified: 24 Jan 2017, 8:55:04 UTC
	Avg. runtime of the new FMA-App on a FX 8320 @ 3,5 GHz (RAM 1.333) = 55 Minutes. AVX-App = 58 Minutes. Now testing SSE2...
	ID: 800 · Reply Quote

KrÃ¼mel Send message Joined: 31 Oct 16 Posts: 22 Credit: 14,099,551 RAC: 0	Message 801 - Posted: 24 Jan 2017, 10:43:30 UTC - in response to Message 800.
	SSE2 = 52 Minutes
	ID: 801 · Reply Quote

Crystal Pellet Send message Joined: 1 Jan 17 Posts: 2 Credit: 1,247,672 RAC: 0	Message 802 - Posted: 24 Jan 2017, 10:49:48 UTC - in response to Message 794.
	I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here. SandyBridge i2600 with the slow avx running 8 threads (hyperthread on). The SSE2-application (v1.1) is a bit faster. I tested under the same conditions with 4 tasks simultaniously (on the other 4 threads 1 LHC-VM and 3 WCG-Beta's). Average of cpu-hours of the avx-application 1.22 hours Average of cpu-hours of the sse2-application 1.19 hours
	ID: 802 · Reply Quote

Beyond Send message Joined: 2 Nov 16 Posts: 50 Credit: 44,372,499 RAC: 0	Message 803 - Posted: 24 Jan 2017, 17:52:44 UTC - in response to Message 799. Last modified: 24 Jan 2017, 17:53:12 UTC
	This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here. I am seeing a little less than 34 min/workunit on win 7 64 with the newest AVX version. On an i3-4330 (Haswell) running two instances Would it be possible to test the newest sse2 app on your machine? I bet it's faster.
	ID: 803 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 804 - Posted: 24 Jan 2017, 18:18:22 UTC Last modified: 24 Jan 2017, 18:33:15 UTC
	I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs: SSE2 20.600 AVX 21.441 AVX+FMA 21.361 AVX2+FMA 21.618 As you can see, SSE2 version is the fastest one. I found reason for this: it is Instruction-level parallelism. It means that CPU is able to execute multiple instructions at the same time. Previous app version mixed fast (add/sub/mul) and slow (div/sqrt) operations, so new CPUs were able to get some advantage from this, plus they were able to calculate two div/sqrt ops faster using 1 AVX instruction than using 2 SSE2 instructions. In result they were able to run app faster. New app version performs slow calculations (div/sqrt) first, then use calculated values multiple times in fast ops (sub/mul). It looks that CPUs are able to execute two SSE2 instructions in the same time as 1 AVX instruction. Application processes diagonal half of matrix, so every row has different length. App always loads 2 (SSE2 app) or 4 (AVX app) values and processes them. This saves some time, at cost of calculating values which will will be discarded later. Because of this SSE2 version is better, less results are finally discarded so app performs better. I tried to prepare app version which used AVX for vectors of length 3 or 4, and SSE2 if there was 1 or 2 elements left. However this app version was even slower: AVX-new 23.059 AFX+FMA-new 22.923 AVX2+FMA-new 22.512 I also tested one more idea: SSE2+FMA. It turned out that it is faster than plain SSE2: SSE2+FMA 20.465 So looks that it make to prepare app which will process data using SSE vectors to get most of Instruction-level parallelism, and use FMA instructions which operates on such vectors to save some time on add/sub operations. There is also one more thing worth exploring: use AVX for div/sqrt calculations only, they are faster than SSE so app should be a bit faster. And one more thing - it looks that results for AVX2 app are mixed, for original app they are worse, but for modified one they are better. More testing is needed here to determine if it makes sense to have such app version too. ____________
	ID: 804 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 805 - Posted: 24 Jan 2017, 19:47:43 UTC Last modified: 24 Jan 2017, 19:49:09 UTC
	I have uploaded new version of FMA app for Windows and Linux 64bit. It uses shorter (SSE) vectors, so it should be faster than SSE version now. ____________
	ID: 805 · Reply Quote

Beyond Send message Joined: 2 Nov 16 Posts: 50 Credit: 44,372,499 RAC: 0	Message 806 - Posted: 24 Jan 2017, 20:22:42 UTC - in response to Message 804.
	I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs: SSE2 20.600 SSE2+FMA 20.465 Impressive work. Thanks! It looks like the sse2+fma is only 0.659% faster. Is that even worth having another version?
	ID: 806 · Reply Quote

Col323 Send message Joined: 23 Nov 16 Posts: 7 Credit: 1,329,132 RAC: 0	Message 807 - Posted: 24 Jan 2017, 20:23:45 UTC
	I just wanted to say a big Thank You! I have run a few using your FMA optimizations released on the 23rd, and they did cut my runtimes in half! Now I'm emptying my queue to see what this latest tweak can shave off of that. I really appreciate you digging into this. I love seeing us go faster with the same amount of effort.
	ID: 807 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 808 - Posted: 24 Jan 2017, 20:33:27 UTC - in response to Message 806. Last modified: 24 Jan 2017, 20:35:02 UTC
	I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs: SSE2 20.600 SSE2+FMA 20.465 Impressive work. Thanks! It looks like the sse2+fma is only 0.659% faster. Is that even worth having another version? I am going to modify code a bit to use AVX for div/sqrt calculations and SSE for the rest. This should improve performance a bit, so finally it should be a bit faster than this SSE+FMA version. Will see when I will have it ready how much it is faster. Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny. ____________
	ID: 808 · Reply Quote

KrÃ¼mel Send message Joined: 31 Oct 16 Posts: 22 Credit: 14,099,551 RAC: 0	Message 809 - Posted: 24 Jan 2017, 21:32:35 UTC - in response to Message 805.
	I have uploaded new version of FMA app for Windows and Linux 64bit. It uses shorter (SSE) vectors, so it should be faster than SSE version now. IÂ´ll give it a try tomorrow. :)
	ID: 809 · Reply Quote

Woof Send message Joined: 16 Jan 17 Posts: 3 Credit: 650,991 RAC: 0	Message 810 - Posted: 25 Jan 2017, 4:19:01 UTC
	Going from the 0.10 AVX version on my dual E5-2620 machine it used to average around 8-10K CPU seconds,with the newer app that is cut to around 5-6K secs. Interestingly if I lower the number of tasks running at once to something like 4-6 at a time instead of using all 24 cores at a time it drops down to nearly 3-4K seconds. I wonder if this is related to boost clocks or memory related. I've run into some weirdness with things that are memory bound when running on dual socket motherboards because it may get memory allocated in the wrong space,and while the links between CPUs are quick nowadays,it's still a performance penalty. Anyways,simply amazing work!
	ID: 810 · Reply Quote

KrÃ¼mel Send message Joined: 31 Oct 16 Posts: 22 Credit: 14,099,551 RAC: 0	Message 811 - Posted: 25 Jan 2017, 7:24:13 UTC
	The new FMA-App is great. Times dropt from 55 Minutes to 46 Minutes!
	ID: 811 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 812 - Posted: 25 Jan 2017, 7:50:26 UTC - in response to Message 811. Last modified: 25 Jan 2017, 7:51:58 UTC
	The new FMA-App is great. Times dropt from 55 Minutes to 46 Minutes! Great news :) Going from the 0.10 AVX version on my dual E5-2620 machine it used to average around 8-10K CPU seconds,with the newer app that is cut to around 5-6K secs. Interestingly if I lower the number of tasks running at once to something like 4-6 at a time instead of using all 24 cores at a time it drops down to nearly 3-4K seconds. I wonder if this is related to boost clocks or memory related. I've run into some weirdness with things that are memory bound when running on dual socket motherboards because it may get memory allocated in the wrong space,and while the links between CPUs are quick nowadays,it's still a performance penalty. Anyways,simply amazing work! It's memory bandwidth. App uses about 1.3MB of global data, and accesses them in a way which looks random for memory controller, so it cannot prefetch data efficiently. When less apps works in parallel, every one of them has more CPU Cache available so there is bigger change that required data will be there. New app uses almost twice as much memory as the old one to store some precalculated values. Half of this data is in fact transposed copy of other half of matrix, so it should be possible to optimize this. I will add this to my TODO list :) ____________
	ID: 812 · Reply Quote

koschi Send message Joined: 22 Oct 16 Posts: 25 Credit: 17,961,188 RAC: 0	Message 813 - Posted: 25 Jan 2017, 8:51:51 UTC
	Thanks Daniel, the progress and explanations how you reached these are pretty amazing!
	ID: 813 · Reply Quote

Dave Peachey Send message Joined: 6 Nov 16 Posts: 7 Credit: 2,364,725 RAC: 0	Message 814 - Posted: 25 Jan 2017, 10:19:45 UTC Last modified: 25 Jan 2017, 10:22:09 UTC
	Hmm, I tried the updated fma v1.1 version on my quad core Q9550-based (Windows 7 build) machine and it threw up immediate computation errors against all the WUs it tried to process. Needless to say, I terminated the BONIC session as soon as I could. The Stderr output file for all seven of the errored WUs shows the same message: <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> (unknown error) - exit code -1073741795 (0xc000001d) </message> ]]> which, I'm afraid, means nothing to me (ah, Vienna), but may help someone to identify the problem. For now, I'm sticking with the sse2 version on both of my machines which is halving WU processing times on both machines ... and I can't complain too much about that. Dave
	ID: 814 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 817 - Posted: 25 Jan 2017, 11:16:18 UTC - in response to Message 814.
	Hmm, I tried the updated fma v1.1 version on my quad core Q9550-based (Windows 7 build) machine and it threw up immediate computation errors against all the WUs it tried to process. Needless to say, I terminated the BONIC session as soon as I could. The Stderr output file for all seven of the errored WUs shows the same message: <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> (unknown error) - exit code -1073741795 (0xc000001d) </message> ]]> which, I'm afraid, means nothing to me (ah, Vienna), but may help someone to identify the problem. For now, I'm sticking with the sse2 version on both of my machines which is halving WU processing times on both machines ... and I can't complain too much about that. Dave This CPU supports up to SSE4.1, so AVX and FMA apps will not work there. You can check features supported by CPU using CPU-Z. ____________
	ID: 817 · Reply Quote

Dave Peachey Send message Joined: 6 Nov 16 Posts: 7 Credit: 2,364,725 RAC: 0	Message 819 - Posted: 25 Jan 2017, 11:40:45 UTC - in response to Message 817. Last modified: 25 Jan 2017, 11:41:05 UTC
	This CPU supports up to SSE4.1, so AVX and FMA apps will not work there. You can check features supported by CPU using CPU-Z. Daniel, Ah, yes, I see that now - it didn't occur to me that there might be a technology gap! It should, however, work on my larger machine with the i7 6900K CPU so I'll run down the WU cache there and see what happens. Cheers
	ID: 819 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 632 Credit: 34,744,744 RAC: 0	Message 820 - Posted: 25 Jan 2017, 15:14:21 UTC - in response to Message 808.
	Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny. Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50)
	ID: 820 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 821 - Posted: 25 Jan 2017, 15:26:15 UTC - in response to Message 820.
	Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny. Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50) I was talking about test data file, it contains one block only. ____________
	ID: 821 · Reply Quote

Dave Peachey Send message Joined: 6 Nov 16 Posts: 7 Credit: 2,364,725 RAC: 0	Message 822 - Posted: 25 Jan 2017, 16:24:16 UTC - in response to Message 820.
	Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50) And might we be see another doubling of the 'production' WUs to 200 'blocks' (in the not too distant future) given that Daniel has enabled us, once again, to process twice as many WUs than we were doing with the last release of his optimised app? Perhaps I shouldn't be putting such ideas into your head! ;-)
	ID: 822 · Reply Quote

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Post to thread

Message boards : Number crunching : Optimization