Optimization

Author	Message
[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 780 - Posted: 23 Jan 2017, 1:48:26 UTC
Surprise! I have just released new optimized app version (Opti v1.1), 2 times faster than previous optimized one (now the official one) :). It can be downloaded from the same place as previous ones: https://bitbucket.org/sirzooro/pc-boinc/downloads. At this moment there are only 64-bit versions for Windows and Linux available. I will add 32-bit Windows version later. app_info.xml file provided together with app does not specify plan class, so make sure you finish or abort your tasks. Otherwise you will loose them when you will install my app! This file also specifies new app version (10), so make sure you have no tasks if you are still running previous app installed manually, you will also lose your tasks if you replace that file. Here are results for test data from previous and new SSE Linux version: real 0m54.472s user 0m52.358s sys 0m0.045s real 0m26.208s user 0m24.142s sys 0m0.033s I also was able to add code which uses NEON instructions on ARM 64bit (AARCH64). Here are results of running non-NEON and NEON apps on test data my Odroid C2: real 2m18.336s user 2m18.180s sys 0m0.080s real 1m48.669s user 1m48.600s sys 0m0.060s At this moment I do not have BOINC libraries ready for ARM64, so there is no app for it yet. I am going to add it later too. If you have them you can compile it too, source code is in BitBucket repo on "additional_optimizations" branch. If you are curious how I managed to make it even faster, here is answer. I did following changes: - changed way how data was stored, what allowed me to replace unaligned load/store instructions with aligned ones; - removed unnecessary memory writes; - changed calculations a bit - replaced square root of product with product of square roots, so I was able to calculate these square roots first and then use result multiple times; - removed some unnecessary code and provided templated versions of most performance-critical function, so compiler could optimize it further. ____________
ID: 780 · Reply Quote

No.15 Send message Joined: 2 Feb 16 Posts: 13 Credit: 64,229,764 RAC: 0	Message 781 - Posted: 23 Jan 2017, 3:06:08 UTC
All I can say is wow. This new version is smoking fast. Thanks for all your work Daniel!
ID: 781 · Reply Quote

koschi Send message Joined: 22 Oct 16 Posts: 25 Credit: 17,961,188 RAC: 0	Message 782 - Posted: 23 Jan 2017, 9:49:51 UTC
Thanks Daniel, runtime is down from ~93 minutes to 53 minutes on my i7 3770. Amazing work!
ID: 782 · Reply Quote

Beyond Send message Joined: 2 Nov 16 Posts: 50 Credit: 44,375,756 RAC: 1	Message 783 - Posted: 23 Jan 2017, 16:50:01 UTC
Thanks Daniel! I tried the newest fma version and it crashed on a machine that worked with the old fma optimized app. The new sse2 version worked on all my various CPUs and was over twice as fast as the previous optimized app. To be clear, the new sse2 app is more than 2x faster on every one of my machines, even the ones that ran the old fma version.
ID: 783 · Reply Quote

KrÃ¼mel Send message Joined: 31 Oct 16 Posts: 22 Credit: 14,099,551 RAC: 0	Message 784 - Posted: 23 Jan 2017, 17:47:04 UTC
Hi, thank You. But IÂ´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit). Keep on Your great work!
ID: 784 · Reply Quote

Beyond Send message Joined: 2 Nov 16 Posts: 50 Credit: 44,375,756 RAC: 1	Message 785 - Posted: 23 Jan 2017, 19:08:20 UTC - in response to Message 784. Last modified: 23 Jan 2017, 19:11:26 UTC
But IÂ´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit). Keep on Your great work! On my four AMD FX-8320E and 8310 machines the older fma app worked but not the new version. However the newest sse2 version is working fine (error free and validating properly) on all of those boxes and also on my various AMD Phenom II X6 CPUs, the AMD 5350 APU and the Intel Celeron 1037U. Haven't tried the newest avx as the previous avx version didn't test faster for my machines.
ID: 785 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 786 - Posted: 23 Jan 2017, 19:11:15 UTC - in response to Message 784.
Thanks Daniel! I tried the newest fma version and it crashed on a machine that worked with the old fma optimized app. The new sse2 version worked on all my various CPUs and was over twice as fast as the previous optimized app. To be clear, the new sse2 app is more than 2x faster on every one of my machines, even the ones that ran the old fma version. Hi, thank You. But IÂ´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit). Keep on Your great work! I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message. I have recompiled and uploaded FMA Windows version, now it does not use AVX2 so it should work fine. I also uploaded separate AVX2 versions for Windows and Linux 64-bit. Could someone with sufficiently new CPU run some benchmarks with test data on AVX2 and FMA versions? I wonder if there is some performance improvement between AVX2 and FMA versions. ____________
ID: 786 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 635 Credit: 34,757,094 RAC: 4	Message 787 - Posted: 23 Jan 2017, 19:20:55 UTC - in response to Message 780.
Ok, It seems that that application is getting really fast, that's obviously very good for us, thanks again. As usual, I will wait for some time before deploying the new application, I will make an announcement when ready. BTW, we are still working on the Mac OS version, apologies for the delay... Looking at the applications page here http://gene.disi.unitn.it/test/apps.php you may also notice that no one is using the avx linux32 bit version. (which is good, I don't see any reasons to install a 32bit OS on a avx capable cpu). I will probably deprecate that one and put a 32-bit plain (no sse2) version for Linux.
ID: 787 · Reply Quote

Beyond Send message Joined: 2 Nov 16 Posts: 50 Credit: 44,375,756 RAC: 1	Message 788 - Posted: 23 Jan 2017, 19:27:46 UTC - in response to Message 786. Last modified: 23 Jan 2017, 19:41:26 UTC
I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message. I have recompiled and uploaded FMA Windows version, now it does not use AVX2 so it should work fine. I also uploaded separate AVX2 versions for Windows and Linux 64-bit. Could someone with sufficiently new CPU run some benchmarks with test data on AVX2 and FMA versions? I wonder if there is some performance improvement between AVX2 and FMA versions. I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test.
ID: 788 · Reply Quote

KrÃ¼mel Send message Joined: 31 Oct 16 Posts: 22 Credit: 14,099,551 RAC: 0	Message 789 - Posted: 23 Jan 2017, 20:07:51 UTC
OK, tried SSE2 once again and it is working. :) FMA seems to be running too. Thank You Daniel!
ID: 789 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 790 - Posted: 23 Jan 2017, 20:29:57 UTC - in response to Message 789.
OK, tried SSE2 once again and it is working. :) FMA seems to be running too. Thank You Daniel! Good to hear this :) I have uploaded ARM 32-bit version. It turned out that my Odroid XU4 it is two times faster than Odroid C2 running non-NEON app, and 1.5 times faster than NEON one :-O real 1m10.599s user 1m9.830s sys 0m0.345s ____________
ID: 790 · Reply Quote

Beyond Send message Joined: 2 Nov 16 Posts: 50 Credit: 44,375,756 RAC: 1	Message 791 - Posted: 23 Jan 2017, 20:50:11 UTC - in response to Message 788. Last modified: 23 Jan 2017, 21:25:18 UTC
I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test. So far the new fma version is running but seems to be slower on this CPU than the sse2 app. On the previous optimized app the fma was a little faster than the sse2 on this CPU. Strange. Edit: The new sse2 app is faster than the fma app on this machine, a reversal from the earlier optimized app. Newest optimized sse2: 51:08 to 53:32 Newest optimized fma: 56:10 to 58:12 The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange...
ID: 791 · Reply Quote

koschi Send message Joined: 22 Oct 16 Posts: 25 Credit: 17,961,188 RAC: 0	Message 792 - Posted: 23 Jan 2017, 21:12:24 UTC
Odroid C2 1.75GHz, 1104MHz RAM root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run.sh Running bin/[b]pc_armv7a-vfpv4-v1.1[/b] - Loading: 0.601 computeStandardDeviations: 0.002 computeCorrelations: 1.436 pcAlgorithm, l 0: 0.031 pcAlgorithm, l 1: 2.451 pcAlgorithm, l 2: 0.894 pcAlgorithm, l 3: 0.096 pcAlgorithm, l 4: 0.041 pcAlgorithm, l 5: 0.013 pcAlgorithm, l 6: 0.003 pcAlgorithm, l 7: 0.000 pcAlgorithm, l 8: 0.000 [b]real 0m7.615s[/b] user 0m5.520s sys 0m0.080s Running bin/[b]pc_armv8-v0.9[/b] - real 0m10.489s user 0m8.430s sys 0m0.070s Should complete a WU in a bit over 5 hours. Not bad against the ARMv8 app I was running before (7.5-8h)...
ID: 792 · Reply Quote

sorcrosc Volunteer developer Send message Joined: 19 Dec 13 Posts: 26 Credit: 3,866,632 RAC: 0	Message 793 - Posted: 23 Jan 2017, 22:01:05 UTC Last modified: 23 Jan 2017, 22:02:12 UTC
Here you can download other applications for Linux on ARM. Same Opti v1.1 code. armv6 armv7_vfpv3 armv8 @koschi: try the latest one on your C2 :)
ID: 793 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 794 - Posted: 23 Jan 2017, 22:11:03 UTC - in response to Message 791.
I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test. So far the new fma version is running but seems to be slower on this CPU than the sse2 app. On the previous optimized app the fma was a little faster than the sse2 on this CPU. Strange. Edit: The new sse2 app is faster than the fma app on this machine, a reversal from the earlier optimized app. Newest optimized sse2: 51:08 to 53:32 Newest optimized fma: 56:10 to 58:12 The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange... This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here. BTW, I have created apps for Windows 32 bit, SSE and non-SSE versions. Please let me know if they work on Windows XP. Cygwin dropped Windows XP support some time ago, so I am not sure if current 32-bit Cygwin version is able to create binaries for WinXP. Previous Win32 apps were compiled on WinXP with older Cygwin version so they worked fine. If new ones will not work, I will have to download older Cygwin 32bit version, fortunately there is some mirror which still holds it. ____________
ID: 794 · Reply Quote

koschi Send message Joined: 22 Oct 16 Posts: 25 Credit: 17,961,188 RAC: 0	Message 795 - Posted: 23 Jan 2017, 22:33:23 UTC - in response to Message 792. Last modified: 23 Jan 2017, 22:35:40 UTC
Odroid C2 1.75GHz, 1104MHz RAM root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run.sh Running bin/pc_armv7a-vfpv4-v1.1 - Loading: 0.601 computeStandardDeviations: 0.002 computeCorrelations: 1.436 pcAlgorithm, l 0: 0.031 pcAlgorithm, l 1: 2.451 pcAlgorithm, l 2: 0.894 pcAlgorithm, l 3: 0.096 pcAlgorithm, l 4: 0.041 pcAlgorithm, l 5: 0.013 pcAlgorithm, l 6: 0.003 pcAlgorithm, l 7: 0.000 pcAlgorithm, l 8: 0.000 real 0m7.615s user 0m5.520s sys 0m0.080s Running bin/pc_armv8-v0.9 - real 0m10.489s user 0m8.430s sys 0m0.070s Should complete a WU in a bit over 5 hours. Not bad against the ARMv8 app I was running before (7.5-8h)... Running bin/pc_armv8-a - Loading: 0.376 computeStandardDeviations: 0.003 computeCorrelations: 1.442 pcAlgorithm, l 0: 0.023 pcAlgorithm, l 1: 1.815 pcAlgorithm, l 2: 0.856 pcAlgorithm, l 3: 0.084 pcAlgorithm, l 4: 0.030 pcAlgorithm, l 5: 0.010 pcAlgorithm, l 6: 0.002 pcAlgorithm, l 7: 0.000 pcAlgorithm, l 8: 0.000 real 0m6.667s user 0m4.600s sys 0m0.070s A lovely 37% gain over the previous ARMv8 app :-D The ARMv7 vfp4 app works on my Rpi 3.
ID: 795 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 796 - Posted: 23 Jan 2017, 22:38:40 UTC - in response to Message 795. Last modified: 23 Jan 2017, 22:43:10 UTC
Please try running test_run2.sh script which works on some real data. Script test_run.sh does not show performance improvement provided by NEON SIMD instructions. Actually NEON app is even a bit slower than non-NEON 64-bit version when running this script. I posted results from running test_run2.sh on my C2 earlier, take a look on them :) ____________
ID: 796 · Reply Quote

Beyond Send message Joined: 2 Nov 16 Posts: 50 Credit: 44,375,756 RAC: 1	Message 797 - Posted: 23 Jan 2017, 23:01:36 UTC - in response to Message 794. Last modified: 23 Jan 2017, 23:02:57 UTC
The new sse2 app is faster than the fma app on this machine (AMD FX-8320E), a reversal from the earlier optimized app. Newest optimized sse2: 51:08 to 53:32 Newest optimized fma: 56:10 to 58:12 The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange... This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here. Thanks for the great explanation about what's probably going on. I was scratching my head over this one and it was starting to hurt. ;-)
ID: 797 · Reply Quote

koschi Send message Joined: 22 Oct 16 Posts: 25 Credit: 17,961,188 RAC: 0	Message 798 - Posted: 23 Jan 2017, 23:04:09 UTC Last modified: 23 Jan 2017, 23:04:38 UTC
Yep thanks, out of laziness I was reusing my test_run.sh that I had adjusted to loop through all pc* in bin. With test_run2.sh the change is quite dramatic! root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run2.sh bin/pc_armv7a-vfpv4-v1.1 input/tile2.txt output/output2.txt 0.05 1 2470 Loading: 0.831 computeStandardDeviations: 0.003 computeCorrelations: 0.369 pcAlgorithm, l 0: 0.001 pcAlgorithm, l 1: 0.064 pcAlgorithm, l 2: 0.893 pcAlgorithm, l 3: 4.866 pcAlgorithm, l 4: 16.922 pcAlgorithm, l 5: 23.217 pcAlgorithm, l 6: 22.773 pcAlgorithm, l 7: 17.738 pcAlgorithm, l 8: 16.013 pcAlgorithm, l 9: 10.758 pcAlgorithm, l 10: 6.917 pcAlgorithm, l 11: 3.896 pcAlgorithm, l 12: 2.017 pcAlgorithm, l 13: 0.736 pcAlgorithm, l 14: 0.205 pcAlgorithm, l 15: 0.041 pcAlgorithm, l 16: 0.005 pcAlgorithm, l 17: 0.000 pcAlgorithm, l 18: 0.000 real 2m10.423s user 2m8.150s sys 0m0.120s diff: output/output2.txt: No such file or directory ####################################################################### bin/pc_armv8-v0.9 input/tile2.txt output/output2.txt 0.05 1 2470 real 3m48.623s user 3m46.260s sys 0m0.110s diff: output/output2.txt: No such file or directory ####################################################################### bin/pc_armv8-v1.1 input/tile2.txt output/output2.txt 0.05 1 2470 Loading: 0.466 computeStandardDeviations: 0.003 computeCorrelations: 0.384 pcAlgorithm, l 0: 0.001 pcAlgorithm, l 1: 0.047 pcAlgorithm, l 2: 1.054 pcAlgorithm, l 3: 4.910 pcAlgorithm, l 4: 12.164 pcAlgorithm, l 5: 18.240 pcAlgorithm, l 6: 17.246 pcAlgorithm, l 7: 13.092 pcAlgorithm, l 8: 11.164 pcAlgorithm, l 9: 7.474 pcAlgorithm, l 10: 4.813 pcAlgorithm, l 11: 2.743 pcAlgorithm, l 12: 1.423 pcAlgorithm, l 13: 0.520 pcAlgorithm, l 14: 0.146 pcAlgorithm, l 15: 0.030 pcAlgorithm, l 16: 0.004 pcAlgorithm, l 17: 0.000 pcAlgorithm, l 18: 0.000 real 1m37.931s user 1m35.870s sys 0m0.060s diff: output/output2.txt: No such file or directory A saving of 57%, or the app is 2.33x as fast as the ARMv8-v0.9 app. Should complete a WU in ~3.5h, amazing!
ID: 798 · Reply Quote

hoppisaur Send message Joined: 20 Nov 16 Posts: 1 Credit: 475,585 RAC: 0	Message 799 - Posted: 24 Jan 2017, 2:03:05 UTC - in response to Message 794.
This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here. I am seeing a little less than 34 min/workunit on win 7 64 with the newest AVX version. On an i3-4330 (Haswell) running two instances
ID: 799 · Reply Quote

Author

Message

[B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0

Surprise! I have just released new optimized app version (Opti v1.1), 2 times faster than previous optimized one (now the official one) :). It can be downloaded from the same place as previous ones: https://bitbucket.org/sirzooro/pc-boinc/downloads. At this moment there are only 64-bit versions for Windows and Linux available. I will add 32-bit Windows version later.

app_info.xml file provided together with app does not specify plan class, so make sure you finish or abort your tasks. Otherwise you will loose them when you will install my app! This file also specifies new app version (10), so make sure you have no tasks if you are still running previous app installed manually, you will also lose your tasks if you replace that file.

Here are results for test data from previous and new SSE Linux version:

real 0m54.472s user 0m52.358s sys 0m0.045s real 0m26.208s user 0m24.142s sys 0m0.033s

I also was able to add code which uses NEON instructions on ARM 64bit (AARCH64). Here are results of running non-NEON and NEON apps on test data my Odroid C2:

real 2m18.336s user 2m18.180s sys 0m0.080s real 1m48.669s user 1m48.600s sys 0m0.060s

At this moment I do not have BOINC libraries ready for ARM64, so there is no app for it yet. I am going to add it later too. If you have them you can compile it too, source code is in BitBucket repo on "additional_optimizations" branch.

If you are curious how I managed to make it even faster, here is answer. I did following changes:
- changed way how data was stored, what allowed me to replace unaligned load/store instructions with aligned ones;
- removed unnecessary memory writes;
- changed calculations a bit - replaced square root of product with product of square roots, so I was able to calculate these square roots first and then use result multiple times;
- removed some unnecessary code and provided templated versions of most performance-critical function, so compiler could optimize it further.
____________

ID: 780 · Reply Quote

No.15
Send message
Joined: 2 Feb 16
Posts: 13
Credit: 64,229,764
RAC: 0

All I can say is wow. This new version is smoking fast.

Thanks for all your work Daniel!

ID: 781 · Reply Quote

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,961,188
RAC: 0

Thanks Daniel,
runtime is down from ~93 minutes to 53 minutes on my i7 3770.

Amazing work!

ID: 782 · Reply Quote

Beyond

Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,375,756
RAC: 1

Thanks Daniel!

I tried the newest fma version and it crashed on a machine that worked with the old fma optimized app.
The new sse2 version worked on all my various CPUs and was over twice as fast as the previous optimized app.
To be clear, the new sse2 app is more than 2x faster on every one of my machines, even the ones that ran the old fma version.

ID: 783 · Reply Quote

KrÃ¼mel
Send message
Joined: 31 Oct 16
Posts: 22
Credit: 14,099,551
RAC: 0

Hi, thank You.

But IÂ´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit).

Keep on Your great work!

ID: 784 · Reply Quote

Beyond

Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,375,756
RAC: 1

But IÂ´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit). Keep on Your great work!

On my four AMD FX-8320E and 8310 machines the older fma app worked but not the new version. However the newest sse2 version is working fine (error free and validating properly) on all of those boxes and also on my various AMD Phenom II X6 CPUs, the AMD 5350 APU and the Intel Celeron 1037U. Haven't tried the newest avx as the previous avx version didn't test faster for my machines.

ID: 785 · Reply Quote

[B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0

Thanks Daniel!

I tried the newest fma version and it crashed on a machine that worked with the old fma optimized app.
The new sse2 version worked on all my various CPUs and was over twice as fast as the previous optimized app.
To be clear, the new sse2 app is more than 2x faster on every one of my machines, even the ones that ran the old fma version.

Hi, thank You.

But IÂ´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit).

Keep on Your great work!

I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message.

I have recompiled and uploaded FMA Windows version, now it does not use AVX2 so it should work fine. I also uploaded separate AVX2 versions for Windows and Linux 64-bit. Could someone with sufficiently new CPU run some benchmarks with test data on AVX2 and FMA versions? I wonder if there is some performance improvement between AVX2 and FMA versions.
____________

ID: 786 · Reply Quote

valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 635
Credit: 34,757,094
RAC: 4

Ok, It seems that that application is getting really fast, that's obviously very good for us, thanks again. As usual, I will wait for some time before deploying the new application, I will make an announcement when ready.

BTW, we are still working on the Mac OS version, apologies for the delay...

Looking at the applications page here http://gene.disi.unitn.it/test/apps.php you may also notice that no one is using the avx linux32 bit version. (which is good, I don't see any reasons to install a 32bit OS on a avx capable cpu). I will probably deprecate that one and put a 32-bit plain (no sse2) version for Linux.

ID: 787 · Reply Quote

Beyond

Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,375,756
RAC: 1

I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message.

I have recompiled and uploaded FMA Windows version, now it does not use AVX2 so it should work fine. I also uploaded separate AVX2 versions for Windows and Linux 64-bit. Could someone with sufficiently new CPU run some benchmarks with test data on AVX2 and FMA versions? I wonder if there is some performance improvement between AVX2 and FMA versions.

I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test.

ID: 788 · Reply Quote

KrÃ¼mel
Send message
Joined: 31 Oct 16
Posts: 22
Credit: 14,099,551
RAC: 0

OK, tried SSE2 once again and it is working. :)
FMA seems to be running too.

Thank You Daniel!

ID: 789 · Reply Quote

[B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0

OK, tried SSE2 once again and it is working. :)
FMA seems to be running too.

Thank You Daniel!

Good to hear this :)

I have uploaded ARM 32-bit version. It turned out that my Odroid XU4 it is two times faster than Odroid C2 running non-NEON app, and 1.5 times faster than NEON one :-O

real 1m10.599s user 1m9.830s sys 0m0.345s

____________

ID: 790 · Reply Quote

Beyond

Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,375,756
RAC: 1

I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test.

So far the new fma version is running but seems to be slower on this CPU than the sse2 app.
On the previous optimized app the fma was a little faster than the sse2 on this CPU. Strange.

Edit: The new sse2 app is faster than the fma app on this machine, a reversal from the earlier optimized app.

Newest optimized sse2: 51:08 to 53:32
Newest optimized fma: 56:10 to 58:12

The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange...

ID: 791 · Reply Quote

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,961,188
RAC: 0

Odroid C2 1.75GHz, 1104MHz RAM

root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run.sh Running bin/[b]pc_armv7a-vfpv4-v1.1[/b] - Loading: 0.601 computeStandardDeviations: 0.002 computeCorrelations: 1.436 pcAlgorithm, l 0: 0.031 pcAlgorithm, l 1: 2.451 pcAlgorithm, l 2: 0.894 pcAlgorithm, l 3: 0.096 pcAlgorithm, l 4: 0.041 pcAlgorithm, l 5: 0.013 pcAlgorithm, l 6: 0.003 pcAlgorithm, l 7: 0.000 pcAlgorithm, l 8: 0.000 [b]real 0m7.615s[/b] user 0m5.520s sys 0m0.080s Running bin/[b]pc_armv8-v0.9[/b] - real 0m10.489s user 0m8.430s sys 0m0.070s

Should complete a WU in a bit over 5 hours. Not bad against the ARMv8 app I was running before (7.5-8h)...

ID: 792 · Reply Quote

sorcrosc
Volunteer developer
Send message
Joined: 19 Dec 13
Posts: 26
Credit: 3,866,632
RAC: 0

Here you can download other applications for Linux on ARM. Same Opti v1.1 code.

armv6
armv7_vfpv3
armv8

@koschi: try the latest one on your C2 :)

ID: 793 · Reply Quote

[B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0

I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test.

So far the new fma version is running but seems to be slower on this CPU than the sse2 app.
On the previous optimized app the fma was a little faster than the sse2 on this CPU. Strange.

Edit: The new sse2 app is faster than the fma app on this machine, a reversal from the earlier optimized app.

Newest optimized sse2: 51:08 to 53:32
Newest optimized fma: 56:10 to 58:12

The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange...

This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here.

BTW, I have created apps for Windows 32 bit, SSE and non-SSE versions. Please let me know if they work on Windows XP. Cygwin dropped Windows XP support some time ago, so I am not sure if current 32-bit Cygwin version is able to create binaries for WinXP. Previous Win32 apps were compiled on WinXP with older Cygwin version so they worked fine. If new ones will not work, I will have to download older Cygwin 32bit version, fortunately there is some mirror which still holds it.
____________

ID: 794 · Reply Quote

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,961,188
RAC: 0

Odroid C2 1.75GHz, 1104MHz RAM

root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run.sh
Running bin/pc_armv7a-vfpv4-v1.1 -
Loading: 0.601
computeStandardDeviations: 0.002
computeCorrelations: 1.436
pcAlgorithm, l 0: 0.031
pcAlgorithm, l 1: 2.451
pcAlgorithm, l 2: 0.894
pcAlgorithm, l 3: 0.096
pcAlgorithm, l 4: 0.041
pcAlgorithm, l 5: 0.013
pcAlgorithm, l 6: 0.003
pcAlgorithm, l 7: 0.000
pcAlgorithm, l 8: 0.000

real 0m7.615s
user 0m5.520s
sys 0m0.080s

Running bin/pc_armv8-v0.9 -

real 0m10.489s
user 0m8.430s
sys 0m0.070s

Should complete a WU in a bit over 5 hours. Not bad against the ARMv8 app I was running before (7.5-8h)...

Running bin/pc_armv8-a -
Loading: 0.376
computeStandardDeviations: 0.003
computeCorrelations: 1.442
pcAlgorithm, l 0: 0.023
pcAlgorithm, l 1: 1.815
pcAlgorithm, l 2: 0.856
pcAlgorithm, l 3: 0.084
pcAlgorithm, l 4: 0.030
pcAlgorithm, l 5: 0.010
pcAlgorithm, l 6: 0.002
pcAlgorithm, l 7: 0.000
pcAlgorithm, l 8: 0.000

real 0m6.667s
user 0m4.600s
sys 0m0.070s

A lovely 37% gain over the previous ARMv8 app :-D

The ARMv7 vfp4 app works on my Rpi 3.

ID: 795 · Reply Quote

[B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0

Please try running test_run2.sh script which works on some real data. Script test_run.sh does not show performance improvement provided by NEON SIMD instructions. Actually NEON app is even a bit slower than non-NEON 64-bit version when running this script. I posted results from running test_run2.sh on my C2 earlier, take a look on them :)
____________

ID: 796 · Reply Quote

Beyond

Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,375,756
RAC: 1

The new sse2 app is faster than the fma app on this machine (AMD FX-8320E), a reversal from the earlier optimized app.

Newest optimized sse2: 51:08 to 53:32
Newest optimized fma: 56:10 to 58:12

The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange...

This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here.

Thanks for the great explanation about what's probably going on. I was scratching my head over this one and it was starting to hurt. ;-)

ID: 797 · Reply Quote

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,961,188
RAC: 0

Yep thanks, out of laziness I was reusing my test_run.sh that I had adjusted to loop through all pc* in bin. With test_run2.sh the change is quite dramatic!

root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run2.sh
bin/pc_armv7a-vfpv4-v1.1 input/tile2.txt output/output2.txt 0.05 1 2470
Loading: 0.831
computeStandardDeviations: 0.003
computeCorrelations: 0.369
pcAlgorithm, l 0: 0.001
pcAlgorithm, l 1: 0.064
pcAlgorithm, l 2: 0.893
pcAlgorithm, l 3: 4.866
pcAlgorithm, l 4: 16.922
pcAlgorithm, l 5: 23.217
pcAlgorithm, l 6: 22.773
pcAlgorithm, l 7: 17.738
pcAlgorithm, l 8: 16.013
pcAlgorithm, l 9: 10.758
pcAlgorithm, l 10: 6.917
pcAlgorithm, l 11: 3.896
pcAlgorithm, l 12: 2.017
pcAlgorithm, l 13: 0.736
pcAlgorithm, l 14: 0.205
pcAlgorithm, l 15: 0.041
pcAlgorithm, l 16: 0.005
pcAlgorithm, l 17: 0.000
pcAlgorithm, l 18: 0.000

real 2m10.423s
user 2m8.150s
sys 0m0.120s
diff: output/output2.txt: No such file or directory
#######################################################################

bin/pc_armv8-v0.9 input/tile2.txt output/output2.txt 0.05 1 2470

real 3m48.623s
user 3m46.260s
sys 0m0.110s
diff: output/output2.txt: No such file or directory
#######################################################################

bin/pc_armv8-v1.1 input/tile2.txt output/output2.txt 0.05 1 2470
Loading: 0.466
computeStandardDeviations: 0.003
computeCorrelations: 0.384
pcAlgorithm, l 0: 0.001
pcAlgorithm, l 1: 0.047
pcAlgorithm, l 2: 1.054
pcAlgorithm, l 3: 4.910
pcAlgorithm, l 4: 12.164
pcAlgorithm, l 5: 18.240
pcAlgorithm, l 6: 17.246
pcAlgorithm, l 7: 13.092
pcAlgorithm, l 8: 11.164
pcAlgorithm, l 9: 7.474
pcAlgorithm, l 10: 4.813
pcAlgorithm, l 11: 2.743
pcAlgorithm, l 12: 1.423
pcAlgorithm, l 13: 0.520
pcAlgorithm, l 14: 0.146
pcAlgorithm, l 15: 0.030
pcAlgorithm, l 16: 0.004
pcAlgorithm, l 17: 0.000
pcAlgorithm, l 18: 0.000

real 1m37.931s
user 1m35.870s
sys 0m0.060s
diff: output/output2.txt: No such file or directory

A saving of 57%, or the app is 2.33x as fast as the ARMv8-v0.9 app.
Should complete a WU in ~3.5h, amazing!

ID: 798 · Reply Quote

hoppisaur
Send message
Joined: 20 Nov 16
Posts: 1
Credit: 475,585
RAC: 0

This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here.

I am seeing a little less than 34 min/workunit on win 7 64 with the newest AVX version. On an i3-4330 (Haswell) running two instances

ID: 799 · Reply Quote