Optimization
log in

Advanced search

Message boards : Number crunching : Optimization

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next
Author Message
Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 2
Italy
Message 823 - Posted: 25 Jan 2017, 17:28:38 UTC - in response to Message 822.
Last modified: 25 Jan 2017, 17:29:31 UTC

Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50)

And might we be see another doubling of the 'production' WUs to 200 'blocks' (in the not too distant future) given that Daniel has enabled us, once again, to process twice as many WUs than we were doing with the last release of his optimised app?

Perhaps I shouldn't be putting such ideas into your head! ;-)

Well, I already had the intention of doing this... The 'problem' is that the size of the output file is almost the same regardless of the number of blocks, so there is no reason of having very short workunits (just more stuff into the database and more network traffic). About timing, I will wait until the beginning of February, for deploying the new apps and also increase the workunit size. I also want, before doing this, to deploy a small batch of workunits related to another organism, just to check if everything is working well.

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 50
Credit: 44,372,499
RAC: 0
United States
Message 824 - Posted: 25 Jan 2017, 18:17:32 UTC - in response to Message 808.

Impressive work. Thanks! It looks like the sse2+fma is only 0.659% faster. Is that even worth having another version?

I am going to modify code a bit to use AVX for div/sqrt calculations and SSE for the rest. This should improve performance a bit, so finally it should be a bit faster than this SSE+FMA version. Will see when I will have it ready how much it is faster. Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny.

Looking at the user reporting for his AMD X8, his results show that the new fma app is actually running around 11% faster than the sse2 version. This is also what I'm seeing on my four AMD X8 CPUs. A useful increase. Once again, THANKS!

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 2
Italy
Message 826 - Posted: 26 Jan 2017, 19:04:21 UTC - in response to Message 821.

I found an host that is not able to run the linux x64 version because of missing shared libraries (http://gene.disi.unitn.it/test/show_host_detail.php?hostid=2990), too old kernel? (3.2.0-4-amd64). The error is version `GLIBC_2.15' not found, version `GLIBC_2.16' not found.

The Makefile doesn't link with 'g++ -static ...', which is the way I know for making a static exe (checked this with ldd and it works). I don't know if this is a good solution, or the only way to solve this is to put a minimum kernel version inside the plan class of the application.

BTW I made a static Linux x64 sse2 version of the application using the latest source code, if someone would like to play with it: http://gene.disi.unitn.it/test/files/tngrid_expansion_v11_linux64-static__sse2.tar.gz

Hints are welcome.

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 828 - Posted: 26 Jan 2017, 22:04:24 UTC - in response to Message 826.

I found an host that is not able to run the linux x64 version because of missing shared libraries (http://gene.disi.unitn.it/test/show_host_detail.php?hostid=2990), too old kernel? (3.2.0-4-amd64). The error is version `GLIBC_2.15' not found, version `GLIBC_2.16' not found.

The Makefile doesn't link with 'g++ -static ...', which is the way I know for making a static exe (checked this with ldd and it works). I don't know if this is a good solution, or the only way to solve this is to put a minimum kernel version inside the plan class of the application.

BTW I made a static Linux x64 sse2 version of the application using the latest source code, if someone would like to play with it: http://gene.disi.unitn.it/test/files/tngrid_expansion_v11_linux64-static__sse2.tar.gz

Hints are welcome.

If ldd no longer shows these libs, it should be OK. Although I am a bit reluctant about doing this - this particular kernel version was used by Debian Wheezy, which is now past its End of Life. This means that there are no new updates for this system version, especially no security updates for new security holes. By not providing app which will work there user may get convinced to upgrade system to some new version which will have support for few next years.

I played with new app a bit trying to optimize it more. It turned out that using AVX for calculating square roots only was slower than using SSE only. I also tried to use values from one half of matrix only, but this slowed down app too. So it does not make sense to apply any of these changes.

I also tried measure run time of app with SSE vectors on Haswell CPU, compiled with different instruction sets:

SSE2 20,766 AVX 19,933 FMA 20,163 AVX2 20,355


It turned out that AVX version is faster than SSE2, probably thanks to some SSE3+ instructions or AVX used in code automatically vectorized by gcc. So this app version should be provided by project. FMA app is to my surprise slower than AVX and I do not have a good explanation for this now. AVX2 version also is slower. It would be good if someone with some new CPU like Skylake could perform some tests and post results here, maybe it will work better on such new CPUs. If not, existing versions (SSE2, AVX, FMA) would be sufficient.

I have uploaded new versions of AVX and AVX2 apps for Linux and Windows, feel free to download and run them.
____________

Crystal Pellet
Send message
Joined: 1 Jan 17
Posts: 2
Credit: 1,247,672
RAC: 0
Netherlands
Message 829 - Posted: 27 Jan 2017, 15:01:26 UTC - in response to Message 828.

I have uploaded new versions of AVX and AVX2 apps for Linux and Windows, feel free to download and run them.

Because your Windows avx version of 23rd of January was a bit slower than the sse2, I tried your newer avx version from yesterday.

Average numbers of 8 tasks concurrently running on my i7 2600:

elapsed 1:15:39 - cpu 1:14:34 efficiency 98,579% -- sse2
elapsed 1:13:29 - cpu 1:12:53 efficiency 99,186% -- avx

Krümel
Send message
Joined: 31 Oct 16
Posts: 19
Credit: 14,099,551
RAC: 0
Germany
Message 830 - Posted: 27 Jan 2017, 17:12:08 UTC
Last modified: 27 Jan 2017, 17:49:27 UTC

i7 6700T @ 3 GHz, HTT on (8 WU at a time)

New AVX2-App: 34 Minutes
FMA-App: 33 Minutes
SSE2-App: 41 Minutes

Profile NxtGenCowboy
Send message
Joined: 26 Jan 17
Posts: 5
Credit: 432,072
RAC: 0
United States
Message 841 - Posted: 4 Feb 2017, 1:50:19 UTC - in response to Message 782.
Last modified: 4 Feb 2017, 2:43:40 UTC

Which version did you use for your 3770k?

Currently running SSE2v1.1 on my i7 3770k@4.3Ghz

Time Remaining 5 hours... I dont think that is correct..

Profile NxtGenCowboy
Send message
Joined: 26 Jan 17
Posts: 5
Credit: 432,072
RAC: 0
United States
Message 842 - Posted: 4 Feb 2017, 3:50:22 UTC - in response to Message 841.

Which version did you use for your 3770k?

Currently running SSE2v1.1 on my i7 3770k@4.3Ghz

Time Remaining 5 hours... I dont think that is correct..



59 minutes it ended up being.

The other variations crashed. FMA/AVX2

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 843 - Posted: 4 Feb 2017, 10:02:14 UTC - in response to Message 842.
Last modified: 4 Feb 2017, 10:03:01 UTC

Which version did you use for your 3770k?

Currently running SSE2v1.1 on my i7 3770k@4.3Ghz

Time Remaining 5 hours... I dont think that is correct..



59 minutes it ended up being.

The other variations crashed. FMA/AVX2

Your CPU supports instructions up to AVX: http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-3770K.html. It does not have FMA or AVX2, these apps will crash there. You can try AVX version, it should work for you. You can also use CPU-Z to check this.
____________

Dj Ninja
Send message
Joined: 3 Feb 17
Posts: 13
Credit: 1,013,889
RAC: 0
Germany
Message 846 - Posted: 4 Feb 2017, 17:57:47 UTC

I think he better should try the SSE2 version.

I have an i5-3570 which is nearly an i7-3770 without HT and your AVX (not AVX2) app crashes instantly on this machine.

Profile NxtGenCowboy
Send message
Joined: 26 Jan 17
Posts: 5
Credit: 432,072
RAC: 0
United States
Message 849 - Posted: 4 Feb 2017, 19:43:39 UTC
Last modified: 4 Feb 2017, 19:44:36 UTC

Its about 49-55 minutes per WU using SSE2 v1.1

I haven't tried AVX yet2.

However I did just upgrade my server to 2 5670s, gotta figure out which one to run there as well

Profile KPX
Send message
Joined: 9 Dec 14
Posts: 4
Credit: 533,268
RAC: 0
Czech Republic
Message 1019 - Posted: 1 Apr 2017, 20:07:33 UTC

Why am I getting SSE2 work units for my AVX-capable CPUs? Is't that a waste of resources?

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 2
Italy
Message 1020 - Posted: 2 Apr 2017, 10:28:10 UTC - in response to Message 1019.

Why am I getting SSE2 work units for my AVX-capable CPUs? Is't that a waste of resources?

At the beginning the server will send both apps (sse,avx), gathering statistics. After some time if there​ is a clear winner you will just get that, if not you will continue to get both.
This means that there is not a big difference running sse or avx in your computer.

Jim1348
Send message
Joined: 29 Dec 16
Posts: 87
Credit: 21,013,002
RAC: 0
United States
Message 1021 - Posted: 2 Apr 2017, 17:00:24 UTC

Is a GPU version still under consideration? I get the impression that it would work, with all the programming talent that Daniel (and others) bring to the project, but there may not be enough work to support it.

Where are we on that?

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 1022 - Posted: 2 Apr 2017, 18:34:34 UTC - in response to Message 1021.

Is a GPU version still under consideration? I get the impression that it would work, with all the programming talent that Daniel (and others) bring to the project, but there may not be enough work to support it.

Where are we on that?

Yes, I am still going to create it. But first I would like to release new version of CPU app, it is almost ready.
____________

Jim1348
Send message
Joined: 29 Dec 16
Posts: 87
Credit: 21,013,002
RAC: 0
United States
Message 1023 - Posted: 2 Apr 2017, 19:53:12 UTC - in response to Message 1022.

Outstanding, I will try the new CPU app on both Windows and Ubuntu as a baseline for the GPU app.

Profile KPX
Send message
Joined: 9 Dec 14
Posts: 4
Credit: 533,268
RAC: 0
Czech Republic
Message 1024 - Posted: 2 Apr 2017, 21:38:52 UTC - in response to Message 1020.

Why am I getting SSE2 work units for my AVX-capable CPUs? Is't that a waste of resources?

At the beginning the server will send both apps (sse,avx), gathering statistics. After some time if there​ is a clear winner you will just get that, if not you will continue to get both.
This means that there is not a big difference running sse or avx in your computer.

Yes, that is what I thought. However, the reality is, that my Core i7-4770K is getting exclusively sse2 units. Nothing else. No choice. In the recorded history of 456 units, it was sent an avx unit only once.
Well, whatever. I just thought that avx units should be faster on this CPU.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 2
Italy
Message 1025 - Posted: 3 Apr 2017, 9:27:46 UTC - in response to Message 1024.

Why am I getting SSE2 work units for my AVX-capable CPUs? Is't that a waste of resources?

At the beginning the server will send both apps (sse,avx), gathering statistics. After some time if there​ is a clear winner you will just get that, if not you will continue to get both.
This means that there is not a big difference running sse or avx in your computer.

Yes, that is what I thought. However, the reality is, that my Core i7-4770K is getting exclusively sse2 units. Nothing else. No choice. In the recorded history of 456 units, it was sent an avx unit only once.
Well, whatever. I just thought that avx units should be faster on this CPU.

This may be something hidden inside the boinc scheduler' decisions. I also have one I7-4770K running windows (http://gene.disi.unitn.it/test/results.php?hostid=3241). It got some sse2 and avx work at the beginning, right now it gets only fma work (having opted to accept beta work in my profile). If I remember correctly boinc will repeat the 'performance test' after some time.

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 1031 - Posted: 9 Apr 2017, 5:59:33 UTC

New app version is ready! It is available at the same place as usual: https://bitbucket.org/sirzooro/pc-boinc/downloads/. In order to install it, do following steps:
- finish or abort all existing tasks (they will be aborted after install automatically);
- stop BOINC;
- unpack selected version to project's directory (path like C:\Users\All Users\BOINC\projects\gene.disi.unitn.it_test\ on Windows, and /var/lib/boinc-client/projects/gene.disi.unitn.it_test on Linux);
- start BOINC again
After doing this, app name should change to "Gene Network Application (Opti v1.2)". You should also see message "Found app_info.xml; using anonymous platform" in event log for TN-Grid project.

This time I used Gray code (not Grey!) to optimize app. This code is a number sequence with special property: every two consecutive numbers differs by one bit only. This concept can be generalized in various ways. One of them are Gray code combinations, where every two consecutive subsets differs by one element only. Here is example of 3-combinations of 5 element set, generated in Gray code order:

1 2 3 1 2 4 1 3 4 2 3 4 2 3 5 1 3 5 1 2 5 1 4 5 2 4 5 3 4 5


TN-Grid Gene app uses combinations generator, so I decided to replace it with new Gray code combinations, and exploit its special property to recalculate only values which depends on changed element. By doing so I reduced total calculations time. Savings depends on maximum L value, and increases with it:
- some old organism stored as "test" data, max L=8: time reduced from 0.559s to 0.534s (4.4%);
- current organism (VV), max L=12: time reduced from 2.092s to 1.815s (13.2%);
- other old organism stored as "test2" data (it was probably ECM), max L=18: time reduced from 14.401s to 9.254s (35.7%).

If you are interested in algorithm details, you can check "Combinatorial Generation" by Frank Ruskey (page 129, algorithm 5.8), available at http://www.1stworks.com/ref/ruskeycombgen.pdf.

New app also checks if CPU supports required instruction set, and will exit with error message like "AVX instructions are not supported by your CPU!" if CPU will not support them.
____________

Jim1348
Send message
Joined: 29 Dec 16
Posts: 87
Credit: 21,013,002
RAC: 0
United States
Message 1033 - Posted: 9 Apr 2017, 17:36:37 UTC - in response to Message 1031.

New app version is ready! It is available at the same place as usual: https://bitbucket.org/sirzooro/pc-boinc/downloads/. In order to install it, do following steps:
- finish or abort all existing tasks (they will be aborted after install automatically);
- stop BOINC;
- unpack selected version to project's directory (path like C:\Users\All Users\BOINC\projects\gene.disi.unitn.it_test\ on Windows, and /var/lib/boinc-client/projects/gene.disi.unitn.it_test on Linux);
- start BOINC again
After doing this, app name should change to "Gene Network Application (Opti v1.2)". You should also see message "Found app_info.xml; using anonymous platform" in event log for TN-Grid project.

I did all that, using the SSE2 version for Linux on my i7-4770, and got that message on reboot. But I am getting only errors.
http://gene.disi.unitn.it/test/results.php?hostid=6148

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next
Post to thread

Message boards : Number crunching : Optimization


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN