Optimization
log in

Advanced search

Message boards : Number crunching : Optimization

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next
Author Message
Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 780 - Posted: 23 Jan 2017, 1:48:26 UTC

Surprise! I have just released new optimized app version (Opti v1.1), 2 times faster than previous optimized one (now the official one) :). It can be downloaded from the same place as previous ones: https://bitbucket.org/sirzooro/pc-boinc/downloads. At this moment there are only 64-bit versions for Windows and Linux available. I will add 32-bit Windows version later.

app_info.xml file provided together with app does not specify plan class, so make sure you finish or abort your tasks. Otherwise you will loose them when you will install my app! This file also specifies new app version (10), so make sure you have no tasks if you are still running previous app installed manually, you will also lose your tasks if you replace that file.

Here are results for test data from previous and new SSE Linux version:

real 0m54.472s user 0m52.358s sys 0m0.045s real 0m26.208s user 0m24.142s sys 0m0.033s


I also was able to add code which uses NEON instructions on ARM 64bit (AARCH64). Here are results of running non-NEON and NEON apps on test data my Odroid C2:

real 2m18.336s user 2m18.180s sys 0m0.080s real 1m48.669s user 1m48.600s sys 0m0.060s


At this moment I do not have BOINC libraries ready for ARM64, so there is no app for it yet. I am going to add it later too. If you have them you can compile it too, source code is in BitBucket repo on "additional_optimizations" branch.

If you are curious how I managed to make it even faster, here is answer. I did following changes:
- changed way how data was stored, what allowed me to replace unaligned load/store instructions with aligned ones;
- removed unnecessary memory writes;
- changed calculations a bit - replaced square root of product with product of square roots, so I was able to calculate these square roots first and then use result multiple times;
- removed some unnecessary code and provided templated versions of most performance-critical function, so compiler could optimize it further.
____________

No.15
Send message
Joined: 2 Feb 16
Posts: 13
Credit: 64,229,764
RAC: 0
United States
Message 781 - Posted: 23 Jan 2017, 3:06:08 UTC

All I can say is wow. This new version is smoking fast.

Thanks for all your work Daniel!

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 782 - Posted: 23 Jan 2017, 9:49:51 UTC

Thanks Daniel,
runtime is down from ~93 minutes to 53 minutes on my i7 3770.

Amazing work!

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 783 - Posted: 23 Jan 2017, 16:50:01 UTC

Thanks Daniel!

I tried the newest fma version and it crashed on a machine that worked with the old fma optimized app.
The new sse2 version worked on all my various CPUs and was over twice as fast as the previous optimized app.
To be clear, the new sse2 app is more than 2x faster on every one of my machines, even the ones that ran the old fma version.

Krümel
Send message
Joined: 31 Oct 16
Posts: 19
Credit: 14,099,551
RAC: 0
Germany
Message 784 - Posted: 23 Jan 2017, 17:47:04 UTC

Hi, thank You.

But I´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit).

Keep on Your great work!

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 785 - Posted: 23 Jan 2017, 19:08:20 UTC - in response to Message 784.
Last modified: 23 Jan 2017, 19:11:26 UTC

But I´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit). Keep on Your great work!

On my four AMD FX-8320E and 8310 machines the older fma app worked but not the new version. However the newest sse2 version is working fine (error free and validating properly) on all of those boxes and also on my various AMD Phenom II X6 CPUs, the AMD 5350 APU and the Intel Celeron 1037U. Haven't tried the newest avx as the previous avx version didn't test faster for my machines.

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 786 - Posted: 23 Jan 2017, 19:11:15 UTC - in response to Message 784.

Thanks Daniel!

I tried the newest fma version and it crashed on a machine that worked with the old fma optimized app.
The new sse2 version worked on all my various CPUs and was over twice as fast as the previous optimized app.
To be clear, the new sse2 app is more than 2x faster on every one of my machines, even the ones that ran the old fma version.


Hi, thank You.

But I´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit).

Keep on Your great work!


I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message.

I have recompiled and uploaded FMA Windows version, now it does not use AVX2 so it should work fine. I also uploaded separate AVX2 versions for Windows and Linux 64-bit. Could someone with sufficiently new CPU run some benchmarks with test data on AVX2 and FMA versions? I wonder if there is some performance improvement between AVX2 and FMA versions.
____________

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,676,744
RAC: 1,154
Italy
Message 787 - Posted: 23 Jan 2017, 19:20:55 UTC - in response to Message 780.

Ok, It seems that that application is getting really fast, that's obviously very good for us, thanks again. As usual, I will wait for some time before deploying the new application, I will make an announcement when ready.

BTW, we are still working on the Mac OS version, apologies for the delay...

Looking at the applications page here http://gene.disi.unitn.it/test/apps.php you may also notice that no one is using the avx linux32 bit version. (which is good, I don't see any reasons to install a 32bit OS on a avx capable cpu). I will probably deprecate that one and put a 32-bit plain (no sse2) version for Linux.

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 788 - Posted: 23 Jan 2017, 19:27:46 UTC - in response to Message 786.
Last modified: 23 Jan 2017, 19:41:26 UTC

I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message.

I have recompiled and uploaded FMA Windows version, now it does not use AVX2 so it should work fine. I also uploaded separate AVX2 versions for Windows and Linux 64-bit. Could someone with sufficiently new CPU run some benchmarks with test data on AVX2 and FMA versions? I wonder if there is some performance improvement between AVX2 and FMA versions.

I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test.

Krümel
Send message
Joined: 31 Oct 16
Posts: 19
Credit: 14,099,551
RAC: 0
Germany
Message 789 - Posted: 23 Jan 2017, 20:07:51 UTC

OK, tried SSE2 once again and it is working. :)
FMA seems to be running too.

Thank You Daniel!

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 790 - Posted: 23 Jan 2017, 20:29:57 UTC - in response to Message 789.

OK, tried SSE2 once again and it is working. :)
FMA seems to be running too.

Thank You Daniel!

Good to hear this :)

I have uploaded ARM 32-bit version. It turned out that my Odroid XU4 it is two times faster than Odroid C2 running non-NEON app, and 1.5 times faster than NEON one :-O

real 1m10.599s user 1m9.830s sys 0m0.345s

____________

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 791 - Posted: 23 Jan 2017, 20:50:11 UTC - in response to Message 788.
Last modified: 23 Jan 2017, 21:25:18 UTC

I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test.

So far the new fma version is running but seems to be slower on this CPU than the sse2 app.
On the previous optimized app the fma was a little faster than the sse2 on this CPU. Strange.

Edit: The new sse2 app is faster than the fma app on this machine, a reversal from the earlier optimized app.

Newest optimized sse2: 51:08 to 53:32
Newest optimized fma: 56:10 to 58:12

The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange...

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 792 - Posted: 23 Jan 2017, 21:12:24 UTC

Odroid C2 1.75GHz, 1104MHz RAM

root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run.sh Running bin/[b]pc_armv7a-vfpv4-v1.1[/b] - Loading: 0.601 computeStandardDeviations: 0.002 computeCorrelations: 1.436 pcAlgorithm, l 0: 0.031 pcAlgorithm, l 1: 2.451 pcAlgorithm, l 2: 0.894 pcAlgorithm, l 3: 0.096 pcAlgorithm, l 4: 0.041 pcAlgorithm, l 5: 0.013 pcAlgorithm, l 6: 0.003 pcAlgorithm, l 7: 0.000 pcAlgorithm, l 8: 0.000 [b]real 0m7.615s[/b] user 0m5.520s sys 0m0.080s Running bin/[b]pc_armv8-v0.9[/b] - real 0m10.489s user 0m8.430s sys 0m0.070s



Should complete a WU in a bit over 5 hours. Not bad against the ARMv8 app I was running before (7.5-8h)...

sorcrosc
Volunteer developer
Send message
Joined: 19 Dec 13
Posts: 26
Credit: 3,866,632
RAC: 0
Italy
Message 793 - Posted: 23 Jan 2017, 22:01:05 UTC
Last modified: 23 Jan 2017, 22:02:12 UTC

Here you can download other applications for Linux on ARM. Same Opti v1.1 code.

armv6
armv7_vfpv3
armv8

@koschi: try the latest one on your C2 :)

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 794 - Posted: 23 Jan 2017, 22:11:03 UTC - in response to Message 791.

I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test.

So far the new fma version is running but seems to be slower on this CPU than the sse2 app.
On the previous optimized app the fma was a little faster than the sse2 on this CPU. Strange.

Edit: The new sse2 app is faster than the fma app on this machine, a reversal from the earlier optimized app.

Newest optimized sse2: 51:08 to 53:32
Newest optimized fma: 56:10 to 58:12

The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange...

This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here.

BTW, I have created apps for Windows 32 bit, SSE and non-SSE versions. Please let me know if they work on Windows XP. Cygwin dropped Windows XP support some time ago, so I am not sure if current 32-bit Cygwin version is able to create binaries for WinXP. Previous Win32 apps were compiled on WinXP with older Cygwin version so they worked fine. If new ones will not work, I will have to download older Cygwin 32bit version, fortunately there is some mirror which still holds it.
____________

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 795 - Posted: 23 Jan 2017, 22:33:23 UTC - in response to Message 792.
Last modified: 23 Jan 2017, 22:35:40 UTC

Odroid C2 1.75GHz, 1104MHz RAM

root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run.sh
Running bin/pc_armv7a-vfpv4-v1.1 -
Loading: 0.601
computeStandardDeviations: 0.002
computeCorrelations: 1.436
pcAlgorithm, l 0: 0.031
pcAlgorithm, l 1: 2.451
pcAlgorithm, l 2: 0.894
pcAlgorithm, l 3: 0.096
pcAlgorithm, l 4: 0.041
pcAlgorithm, l 5: 0.013
pcAlgorithm, l 6: 0.003
pcAlgorithm, l 7: 0.000
pcAlgorithm, l 8: 0.000

real 0m7.615s
user 0m5.520s
sys 0m0.080s


Running bin/pc_armv8-v0.9 -

real 0m10.489s
user 0m8.430s
sys 0m0.070s



Should complete a WU in a bit over 5 hours. Not bad against the ARMv8 app I was running before (7.5-8h)...



Running bin/pc_armv8-a -
Loading: 0.376
computeStandardDeviations: 0.003
computeCorrelations: 1.442
pcAlgorithm, l 0: 0.023
pcAlgorithm, l 1: 1.815
pcAlgorithm, l 2: 0.856
pcAlgorithm, l 3: 0.084
pcAlgorithm, l 4: 0.030
pcAlgorithm, l 5: 0.010
pcAlgorithm, l 6: 0.002
pcAlgorithm, l 7: 0.000
pcAlgorithm, l 8: 0.000

real 0m6.667s
user 0m4.600s
sys 0m0.070s


A lovely 37% gain over the previous ARMv8 app :-D

The ARMv7 vfp4 app works on my Rpi 3.

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 796 - Posted: 23 Jan 2017, 22:38:40 UTC - in response to Message 795.
Last modified: 23 Jan 2017, 22:43:10 UTC

Please try running test_run2.sh script which works on some real data. Script test_run.sh does not show performance improvement provided by NEON SIMD instructions. Actually NEON app is even a bit slower than non-NEON 64-bit version when running this script. I posted results from running test_run2.sh on my C2 earlier, take a look on them :)
____________

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 797 - Posted: 23 Jan 2017, 23:01:36 UTC - in response to Message 794.
Last modified: 23 Jan 2017, 23:02:57 UTC

The new sse2 app is faster than the fma app on this machine (AMD FX-8320E), a reversal from the earlier optimized app.

Newest optimized sse2: 51:08 to 53:32
Newest optimized fma: 56:10 to 58:12

The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange...

This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here.

Thanks for the great explanation about what's probably going on. I was scratching my head over this one and it was starting to hurt. ;-)

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 798 - Posted: 23 Jan 2017, 23:04:09 UTC
Last modified: 23 Jan 2017, 23:04:38 UTC

Yep thanks, out of laziness I was reusing my test_run.sh that I had adjusted to loop through all pc* in bin. With test_run2.sh the change is quite dramatic!


root@odroidc2-1:~/BOINC_dev/boinc/samples/pc-boinc# ./test_run2.sh
bin/pc_armv7a-vfpv4-v1.1 input/tile2.txt output/output2.txt 0.05 1 2470
Loading: 0.831
computeStandardDeviations: 0.003
computeCorrelations: 0.369
pcAlgorithm, l 0: 0.001
pcAlgorithm, l 1: 0.064
pcAlgorithm, l 2: 0.893
pcAlgorithm, l 3: 4.866
pcAlgorithm, l 4: 16.922
pcAlgorithm, l 5: 23.217
pcAlgorithm, l 6: 22.773
pcAlgorithm, l 7: 17.738
pcAlgorithm, l 8: 16.013
pcAlgorithm, l 9: 10.758
pcAlgorithm, l 10: 6.917
pcAlgorithm, l 11: 3.896
pcAlgorithm, l 12: 2.017
pcAlgorithm, l 13: 0.736
pcAlgorithm, l 14: 0.205
pcAlgorithm, l 15: 0.041
pcAlgorithm, l 16: 0.005
pcAlgorithm, l 17: 0.000
pcAlgorithm, l 18: 0.000

real 2m10.423s
user 2m8.150s
sys 0m0.120s
diff: output/output2.txt: No such file or directory
#######################################################################

bin/pc_armv8-v0.9 input/tile2.txt output/output2.txt 0.05 1 2470

real 3m48.623s
user 3m46.260s
sys 0m0.110s
diff: output/output2.txt: No such file or directory
#######################################################################

bin/pc_armv8-v1.1 input/tile2.txt output/output2.txt 0.05 1 2470
Loading: 0.466
computeStandardDeviations: 0.003
computeCorrelations: 0.384
pcAlgorithm, l 0: 0.001
pcAlgorithm, l 1: 0.047
pcAlgorithm, l 2: 1.054
pcAlgorithm, l 3: 4.910
pcAlgorithm, l 4: 12.164
pcAlgorithm, l 5: 18.240
pcAlgorithm, l 6: 17.246
pcAlgorithm, l 7: 13.092
pcAlgorithm, l 8: 11.164
pcAlgorithm, l 9: 7.474
pcAlgorithm, l 10: 4.813
pcAlgorithm, l 11: 2.743
pcAlgorithm, l 12: 1.423
pcAlgorithm, l 13: 0.520
pcAlgorithm, l 14: 0.146
pcAlgorithm, l 15: 0.030
pcAlgorithm, l 16: 0.004
pcAlgorithm, l 17: 0.000
pcAlgorithm, l 18: 0.000

real 1m37.931s
user 1m35.870s
sys 0m0.060s
diff: output/output2.txt: No such file or directory


A saving of 57%, or the app is 2.33x as fast as the ARMv8-v0.9 app.
Should complete a WU in ~3.5h, amazing!

hoppisaur
Send message
Joined: 20 Nov 16
Posts: 1
Credit: 475,585
RAC: 0
United States
Message 799 - Posted: 24 Jan 2017, 2:03:05 UTC - in response to Message 794.


This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here.


I am seeing a little less than 34 min/workunit on win 7 64 with the newest AVX version. On an i3-4330 (Haswell) running two instances

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next
Post to thread

Message boards : Number crunching : Optimization


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN