log in |
Message boards : Number crunching : Optimization
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Author | Message |
---|---|
Avg. runtime of the new FMA-App on a FX 8320 @ 3,5 GHz (RAM 1.333) = 55 Minutes. | |
ID: 800 · Reply Quote | |
SSE2 = 52 Minutes | |
ID: 801 · Reply Quote | |
I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here. SandyBridge i2600 with the slow avx running 8 threads (hyperthread on). The SSE2-application (v1.1) is a bit faster. I tested under the same conditions with 4 tasks simultaniously (on the other 4 threads 1 LHC-VM and 3 WCG-Beta's). Average of cpu-hours of the avx-application 1.22 hours Average of cpu-hours of the sse2-application 1.19 hours | |
ID: 802 · Reply Quote | |
This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here. Would it be possible to test the newest sse2 app on your machine? I bet it's faster. | |
ID: 803 · Reply Quote | |
I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs: SSE2 20.600
AVX 21.441
AVX+FMA 21.361
AVX2+FMA 21.618 As you can see, SSE2 version is the fastest one. I found reason for this: it is Instruction-level parallelism. It means that CPU is able to execute multiple instructions at the same time. Previous app version mixed fast (add/sub/mul) and slow (div/sqrt) operations, so new CPUs were able to get some advantage from this, plus they were able to calculate two div/sqrt ops faster using 1 AVX instruction than using 2 SSE2 instructions. In result they were able to run app faster. New app version performs slow calculations (div/sqrt) first, then use calculated values multiple times in fast ops (sub/mul). It looks that CPUs are able to execute two SSE2 instructions in the same time as 1 AVX instruction. Application processes diagonal half of matrix, so every row has different length. App always loads 2 (SSE2 app) or 4 (AVX app) values and processes them. This saves some time, at cost of calculating values which will will be discarded later. Because of this SSE2 version is better, less results are finally discarded so app performs better. I tried to prepare app version which used AVX for vectors of length 3 or 4, and SSE2 if there was 1 or 2 elements left. However this app version was even slower: AVX-new 23.059
AFX+FMA-new 22.923
AVX2+FMA-new 22.512 I also tested one more idea: SSE2+FMA. It turned out that it is faster than plain SSE2: SSE2+FMA 20.465 So looks that it make to prepare app which will process data using SSE vectors to get most of Instruction-level parallelism, and use FMA instructions which operates on such vectors to save some time on add/sub operations. There is also one more thing worth exploring: use AVX for div/sqrt calculations only, they are faster than SSE so app should be a bit faster. And one more thing - it looks that results for AVX2 app are mixed, for original app they are worse, but for modified one they are better. More testing is needed here to determine if it makes sense to have such app version too. ____________ | |
ID: 804 · Reply Quote | |
I have uploaded new version of FMA app for Windows and Linux 64bit. It uses shorter (SSE) vectors, so it should be faster than SSE version now. | |
ID: 805 · Reply Quote | |
I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs: Impressive work. Thanks! It looks like the sse2+fma is only 0.659% faster. Is that even worth having another version? | |
ID: 806 · Reply Quote | |
I just wanted to say a big Thank You! I have run a few using your FMA optimizations released on the 23rd, and they did cut my runtimes in half! Now I'm emptying my queue to see what this latest tweak can shave off of that. | |
ID: 807 · Reply Quote | |
I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs: I am going to modify code a bit to use AVX for div/sqrt calculations and SSE for the rest. This should improve performance a bit, so finally it should be a bit faster than this SSE+FMA version. Will see when I will have it ready how much it is faster. Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny. ____________ | |
ID: 808 · Reply Quote | |
I have uploaded new version of FMA app for Windows and Linux 64bit. It uses shorter (SSE) vectors, so it should be faster than SSE version now. I´ll give it a try tomorrow. :) | |
ID: 809 · Reply Quote | |
Going from the 0.10 AVX version on my dual E5-2620 machine it used to average around 8-10K CPU seconds,with the newer app that is cut to around 5-6K secs. | |
ID: 810 · Reply Quote | |
The new FMA-App is great. | |
ID: 811 · Reply Quote | |
The new FMA-App is great. Great news :) Going from the 0.10 AVX version on my dual E5-2620 machine it used to average around 8-10K CPU seconds,with the newer app that is cut to around 5-6K secs. It's memory bandwidth. App uses about 1.3MB of global data, and accesses them in a way which looks random for memory controller, so it cannot prefetch data efficiently. When less apps works in parallel, every one of them has more CPU Cache available so there is bigger change that required data will be there. New app uses almost twice as much memory as the old one to store some precalculated values. Half of this data is in fact transposed copy of other half of matrix, so it should be possible to optimize this. I will add this to my TODO list :) ____________ | |
ID: 812 · Reply Quote | |
Thanks Daniel, the progress and explanations how you reached these are pretty amazing! | |
ID: 813 · Reply Quote | |
Hmm, | |
ID: 814 · Reply Quote | |
Hmm, This CPU supports up to SSE4.1, so AVX and FMA apps will not work there. You can check features supported by CPU using CPU-Z. ____________ | |
ID: 817 · Reply Quote | |
This CPU supports up to SSE4.1, so AVX and FMA apps will not work there. You can check features supported by CPU using CPU-Z. Daniel, Ah, yes, I see that now - it didn't occur to me that there might be a technology gap! It should, however, work on my larger machine with the i7 6900K CPU so I'll run down the WU cache there and see what happens. Cheers | |
ID: 819 · Reply Quote | |
Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny. Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50) | |
ID: 820 · Reply Quote | |
Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny. I was talking about test data file, it contains one block only. ____________ | |
ID: 821 · Reply Quote | |
Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50) And might we be see another doubling of the 'production' WUs to 200 'blocks' (in the not too distant future) given that Daniel has enabled us, once again, to process twice as many WUs than we were doing with the last release of his optimised app? Perhaps I shouldn't be putting such ideas into your head! ;-) | |
ID: 822 · Reply Quote | |
Message boards :
Number crunching :
Optimization