log in |
Message boards : Number crunching : Optimization
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next
Author | Message |
---|---|
Surprise! I have just released new optimized app version (Opti v1.1), 2 times faster than previous optimized one (now the official one) :). It can be downloaded from the same place as previous ones: https://bitbucket.org/sirzooro/pc-boinc/downloads. At this moment there are only 64-bit versions for Windows and Linux available. I will add 32-bit Windows version later. real 0m54.472s
user 0m52.358s
sys 0m0.045s
real 0m26.208s
user 0m24.142s
sys 0m0.033s
I also was able to add code which uses NEON instructions on ARM 64bit (AARCH64). Here are results of running non-NEON and NEON apps on test data my Odroid C2: real 2m18.336s
user 2m18.180s
sys 0m0.080s
real 1m48.669s
user 1m48.600s
sys 0m0.060s
At this moment I do not have BOINC libraries ready for ARM64, so there is no app for it yet. I am going to add it later too. If you have them you can compile it too, source code is in BitBucket repo on "additional_optimizations" branch. If you are curious how I managed to make it even faster, here is answer. I did following changes: - changed way how data was stored, what allowed me to replace unaligned load/store instructions with aligned ones; - removed unnecessary memory writes; - changed calculations a bit - replaced square root of product with product of square roots, so I was able to calculate these square roots first and then use result multiple times; - removed some unnecessary code and provided templated versions of most performance-critical function, so compiler could optimize it further. ____________ | |
ID: 780 · Reply Quote | |
All I can say is wow. This new version is smoking fast. | |
ID: 781 · Reply Quote | |
Thanks Daniel, | |
ID: 782 · Reply Quote | |
Thanks Daniel! | |
ID: 783 · Reply Quote | |
Hi, thank You. | |
ID: 784 · Reply Quote | |
But I´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit). Keep on Your great work! On my four AMD FX-8320E and 8310 machines the older fma app worked but not the new version. However the newest sse2 version is working fine (error free and validating properly) on all of those boxes and also on my various AMD Phenom II X6 CPUs, the AMD 5350 APU and the Intel Celeron 1037U. Haven't tried the newest avx as the previous avx version didn't test faster for my machines. | |
ID: 785 · Reply Quote | |
Thanks Daniel! Hi, thank You. I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message. I have recompiled and uploaded FMA Windows version, now it does not use AVX2 so it should work fine. I also uploaded separate AVX2 versions for Windows and Linux 64-bit. Could someone with sufficiently new CPU run some benchmarks with test data on AVX2 and FMA versions? I wonder if there is some performance improvement between AVX2 and FMA versions. ____________ | |
ID: 786 · Reply Quote | |
Ok, It seems that that application is getting really fast, that's obviously very good for us, thanks again. As usual, I will wait for some time before deploying the new application, I will make an announcement when ready. | |
ID: 787 · Reply Quote | |
I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message. I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test. | |
ID: 788 · Reply Quote | |
OK, tried SSE2 once again and it is working. :) | |
ID: 789 · Reply Quote | |
OK, tried SSE2 once again and it is working. :) Good to hear this :) I have uploaded ARM 32-bit version. It turned out that my Odroid XU4 it is two times faster than Odroid C2 running non-NEON app, and 1.5 times faster than NEON one :-O real 1m10.599s
user 1m9.830s
sys 0m0.345s
____________ | |
ID: 790 · Reply Quote | |
I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test. So far the new fma version is running but seems to be slower on this CPU than the sse2 app. On the previous optimized app the fma was a little faster than the sse2 on this CPU. Strange. Edit: The new sse2 app is faster than the fma app on this machine, a reversal from the earlier optimized app. Newest optimized sse2: 51:08 to 53:32 Newest optimized fma: 56:10 to 58:12 The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange... | |
ID: 791 · Reply Quote | |
Odroid C2 1.75GHz, 1104MHz RAM root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run.sh
Running bin/[b]pc_armv7a-vfpv4-v1.1[/b] -
Loading: 0.601
computeStandardDeviations: 0.002
computeCorrelations: 1.436
pcAlgorithm, l 0: 0.031
pcAlgorithm, l 1: 2.451
pcAlgorithm, l 2: 0.894
pcAlgorithm, l 3: 0.096
pcAlgorithm, l 4: 0.041
pcAlgorithm, l 5: 0.013
pcAlgorithm, l 6: 0.003
pcAlgorithm, l 7: 0.000
pcAlgorithm, l 8: 0.000
[b]real 0m7.615s[/b]
user 0m5.520s
sys 0m0.080s
Running bin/[b]pc_armv8-v0.9[/b] -
real 0m10.489s
user 0m8.430s
sys 0m0.070s Should complete a WU in a bit over 5 hours. Not bad against the ARMv8 app I was running before (7.5-8h)... | |
ID: 792 · Reply Quote | |
Here you can download other applications for Linux on ARM. Same Opti v1.1 code. | |
ID: 793 · Reply Quote | |
I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test. This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here. BTW, I have created apps for Windows 32 bit, SSE and non-SSE versions. Please let me know if they work on Windows XP. Cygwin dropped Windows XP support some time ago, so I am not sure if current 32-bit Cygwin version is able to create binaries for WinXP. Previous Win32 apps were compiled on WinXP with older Cygwin version so they worked fine. If new ones will not work, I will have to download older Cygwin 32bit version, fortunately there is some mirror which still holds it. ____________ | |
ID: 794 · Reply Quote | |
Odroid C2 1.75GHz, 1104MHz RAM Running bin/pc_armv8-a - A lovely 37% gain over the previous ARMv8 app :-D The ARMv7 vfp4 app works on my Rpi 3. | |
ID: 795 · Reply Quote | |
Please try running test_run2.sh script which works on some real data. Script test_run.sh does not show performance improvement provided by NEON SIMD instructions. Actually NEON app is even a bit slower than non-NEON 64-bit version when running this script. I posted results from running test_run2.sh on my C2 earlier, take a look on them :) | |
ID: 796 · Reply Quote | |
The new sse2 app is faster than the fma app on this machine (AMD FX-8320E), a reversal from the earlier optimized app. Thanks for the great explanation about what's probably going on. I was scratching my head over this one and it was starting to hurt. ;-) | |
ID: 797 · Reply Quote | |
Yep thanks, out of laziness I was reusing my test_run.sh that I had adjusted to loop through all pc* in bin. With test_run2.sh the change is quite dramatic!
A saving of 57%, or the app is 2.33x as fast as the ARMv8-v0.9 app. Should complete a WU in ~3.5h, amazing! | |
ID: 798 · Reply Quote | |
I am seeing a little less than 34 min/workunit on win 7 64 with the newest AVX version. On an i3-4330 (Haswell) running two instances | |
ID: 799 · Reply Quote | |
Message boards :
Number crunching :
Optimization