Optimization

Message boards : Number crunching : Optimization

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

Author	Message
valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 632 Credit: 34,744,744 RAC: 0	Message 823 - Posted: 25 Jan 2017, 17:28:38 UTC - in response to Message 822. Last modified: 25 Jan 2017, 17:29:31 UTC
	Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50) And might we be see another doubling of the 'production' WUs to 200 'blocks' (in the not too distant future) given that Daniel has enabled us, once again, to process twice as many WUs than we were doing with the last release of his optimised app? Perhaps I shouldn't be putting such ideas into your head! ;-) Well, I already had the intention of doing this... The 'problem' is that the size of the output file is almost the same regardless of the number of blocks, so there is no reason of having very short workunits (just more stuff into the database and more network traffic). About timing, I will wait until the beginning of February, for deploying the new apps and also increase the workunit size. I also want, before doing this, to deploy a small batch of workunits related to another organism, just to check if everything is working well.
	ID: 823 · Reply Quote

Beyond Send message Joined: 2 Nov 16 Posts: 50 Credit: 44,372,499 RAC: 0	Message 824 - Posted: 25 Jan 2017, 18:17:32 UTC - in response to Message 808.
	Impressive work. Thanks! It looks like the sse2+fma is only 0.659% faster. Is that even worth having another version? I am going to modify code a bit to use AVX for div/sqrt calculations and SSE for the rest. This should improve performance a bit, so finally it should be a bit faster than this SSE+FMA version. Will see when I will have it ready how much it is faster. Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny. Looking at the user reporting for his AMD X8, his results show that the new fma app is actually running around 11% faster than the sse2 version. This is also what I'm seeing on my four AMD X8 CPUs. A useful increase. Once again, THANKS!
	ID: 824 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 632 Credit: 34,744,744 RAC: 0	Message 826 - Posted: 26 Jan 2017, 19:04:21 UTC - in response to Message 821.
	I found an host that is not able to run the linux x64 version because of missing shared libraries (http://gene.disi.unitn.it/test/show_host_detail.php?hostid=2990), too old kernel? (3.2.0-4-amd64). The error is version `GLIBC_2.15' not found, version `GLIBC_2.16' not found. The Makefile doesn't link with 'g++ -static ...', which is the way I know for making a static exe (checked this with ldd and it works). I don't know if this is a good solution, or the only way to solve this is to put a minimum kernel version inside the plan class of the application. BTW I made a static Linux x64 sse2 version of the application using the latest source code, if someone would like to play with it: http://gene.disi.unitn.it/test/files/tngrid_expansion_v11_linux64-static__sse2.tar.gz Hints are welcome.
	ID: 826 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 828 - Posted: 26 Jan 2017, 22:04:24 UTC - in response to Message 826.
	I found an host that is not able to run the linux x64 version because of missing shared libraries (http://gene.disi.unitn.it/test/show_host_detail.php?hostid=2990), too old kernel? (3.2.0-4-amd64). The error is version `GLIBC_2.15' not found, version `GLIBC_2.16' not found. The Makefile doesn't link with 'g++ -static ...', which is the way I know for making a static exe (checked this with ldd and it works). I don't know if this is a good solution, or the only way to solve this is to put a minimum kernel version inside the plan class of the application. BTW I made a static Linux x64 sse2 version of the application using the latest source code, if someone would like to play with it: http://gene.disi.unitn.it/test/files/tngrid_expansion_v11_linux64-static__sse2.tar.gz Hints are welcome. If ldd no longer shows these libs, it should be OK. Although I am a bit reluctant about doing this - this particular kernel version was used by Debian Wheezy, which is now past its End of Life. This means that there are no new updates for this system version, especially no security updates for new security holes. By not providing app which will work there user may get convinced to upgrade system to some new version which will have support for few next years. I played with new app a bit trying to optimize it more. It turned out that using AVX for calculating square roots only was slower than using SSE only. I also tried to use values from one half of matrix only, but this slowed down app too. So it does not make sense to apply any of these changes. I also tried measure run time of app with SSE vectors on Haswell CPU, compiled with different instruction sets: SSE2 20,766 AVX 19,933 FMA 20,163 AVX2 20,355 It turned out that AVX version is faster than SSE2, probably thanks to some SSE3+ instructions or AVX used in code automatically vectorized by gcc. So this app version should be provided by project. FMA app is to my surprise slower than AVX and I do not have a good explanation for this now. AVX2 version also is slower. It would be good if someone with some new CPU like Skylake could perform some tests and post results here, maybe it will work better on such new CPUs. If not, existing versions (SSE2, AVX, FMA) would be sufficient. I have uploaded new versions of AVX and AVX2 apps for Linux and Windows, feel free to download and run them. ____________
	ID: 828 · Reply Quote

Crystal Pellet Send message Joined: 1 Jan 17 Posts: 2 Credit: 1,247,672 RAC: 0	Message 829 - Posted: 27 Jan 2017, 15:01:26 UTC - in response to Message 828.
	I have uploaded new versions of AVX and AVX2 apps for Linux and Windows, feel free to download and run them. Because your Windows avx version of 23rd of January was a bit slower than the sse2, I tried your newer avx version from yesterday. Average numbers of 8 tasks concurrently running on my i7 2600: elapsed 1:15:39 - cpu 1:14:34 efficiency 98,579% -- sse2 elapsed 1:13:29 - cpu 1:12:53 efficiency 99,186% -- avx
	ID: 829 · Reply Quote

KrÃ¼mel Send message Joined: 31 Oct 16 Posts: 22 Credit: 14,099,551 RAC: 0	Message 830 - Posted: 27 Jan 2017, 17:12:08 UTC Last modified: 27 Jan 2017, 17:49:27 UTC
	i7 6700T @ 3 GHz, HTT on (8 WU at a time) New AVX2-App: 34 Minutes FMA-App: 33 Minutes SSE2-App: 41 Minutes
	ID: 830 · Reply Quote

NxtGenCowboy Send message Joined: 26 Jan 17 Posts: 5 Credit: 432,072 RAC: 0	Message 841 - Posted: 4 Feb 2017, 1:50:19 UTC - in response to Message 782. Last modified: 4 Feb 2017, 2:43:40 UTC
	Which version did you use for your 3770k? Currently running SSE2v1.1 on my i7 3770k@4.3Ghz Time Remaining 5 hours... I dont think that is correct..
	ID: 841 · Reply Quote

NxtGenCowboy Send message Joined: 26 Jan 17 Posts: 5 Credit: 432,072 RAC: 0	Message 842 - Posted: 4 Feb 2017, 3:50:22 UTC - in response to Message 841.
	Which version did you use for your 3770k? Currently running SSE2v1.1 on my i7 3770k@4.3Ghz Time Remaining 5 hours... I dont think that is correct.. 59 minutes it ended up being. The other variations crashed. FMA/AVX2
	ID: 842 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 843 - Posted: 4 Feb 2017, 10:02:14 UTC - in response to Message 842. Last modified: 4 Feb 2017, 10:03:01 UTC
	Which version did you use for your 3770k? Currently running SSE2v1.1 on my i7 3770k@4.3Ghz Time Remaining 5 hours... I dont think that is correct.. 59 minutes it ended up being. The other variations crashed. FMA/AVX2 Your CPU supports instructions up to AVX: http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-3770K.html. It does not have FMA or AVX2, these apps will crash there. You can try AVX version, it should work for you. You can also use CPU-Z to check this. ____________
	ID: 843 · Reply Quote

Dj Ninja Send message Joined: 3 Feb 17 Posts: 13 Credit: 1,013,889 RAC: 0	Message 846 - Posted: 4 Feb 2017, 17:57:47 UTC
	I think he better should try the SSE2 version. I have an i5-3570 which is nearly an i7-3770 without HT and your AVX (not AVX2) app crashes instantly on this machine.
	ID: 846 · Reply Quote

NxtGenCowboy Send message Joined: 26 Jan 17 Posts: 5 Credit: 432,072 RAC: 0	Message 849 - Posted: 4 Feb 2017, 19:43:39 UTC Last modified: 4 Feb 2017, 19:44:36 UTC
	Its about 49-55 minutes per WU using SSE2 v1.1 I haven't tried AVX yet2. However I did just upgrade my server to 2 5670s, gotta figure out which one to run there as well
	ID: 849 · Reply Quote

KPX Send message Joined: 9 Dec 14 Posts: 4 Credit: 533,268 RAC: 0	Message 1019 - Posted: 1 Apr 2017, 20:07:33 UTC
	Why am I getting SSE2 work units for my AVX-capable CPUs? Is't that a waste of resources?
	ID: 1019 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 632 Credit: 34,744,744 RAC: 0	Message 1020 - Posted: 2 Apr 2017, 10:28:10 UTC - in response to Message 1019.
	Why am I getting SSE2 work units for my AVX-capable CPUs? Is't that a waste of resources? At the beginning the server will send both apps (sse,avx), gathering statistics. After some time if thereâ€‹ is a clear winner you will just get that, if not you will continue to get both. This means that there is not a big difference running sse or avx in your computer.
	ID: 1020 · Reply Quote

Jim1348 Send message Joined: 29 Dec 16 Posts: 87 Credit: 21,013,002 RAC: 0	Message 1021 - Posted: 2 Apr 2017, 17:00:24 UTC
	Is a GPU version still under consideration? I get the impression that it would work, with all the programming talent that Daniel (and others) bring to the project, but there may not be enough work to support it. Where are we on that?
	ID: 1021 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 1022 - Posted: 2 Apr 2017, 18:34:34 UTC - in response to Message 1021.
	Is a GPU version still under consideration? I get the impression that it would work, with all the programming talent that Daniel (and others) bring to the project, but there may not be enough work to support it. Where are we on that? Yes, I am still going to create it. But first I would like to release new version of CPU app, it is almost ready. ____________
	ID: 1022 · Reply Quote

Jim1348 Send message Joined: 29 Dec 16 Posts: 87 Credit: 21,013,002 RAC: 0	Message 1023 - Posted: 2 Apr 2017, 19:53:12 UTC - in response to Message 1022.
	Outstanding, I will try the new CPU app on both Windows and Ubuntu as a baseline for the GPU app.
	ID: 1023 · Reply Quote

KPX Send message Joined: 9 Dec 14 Posts: 4 Credit: 533,268 RAC: 0	Message 1024 - Posted: 2 Apr 2017, 21:38:52 UTC - in response to Message 1020.
	Why am I getting SSE2 work units for my AVX-capable CPUs? Is't that a waste of resources? At the beginning the server will send both apps (sse,avx), gathering statistics. After some time if thereâ€‹ is a clear winner you will just get that, if not you will continue to get both. This means that there is not a big difference running sse or avx in your computer. Yes, that is what I thought. However, the reality is, that my Core i7-4770K is getting exclusively sse2 units. Nothing else. No choice. In the recorded history of 456 units, it was sent an avx unit only once. Well, whatever. I just thought that avx units should be faster on this CPU.
	ID: 1024 · Reply Quote

valterc Project administrator Project tester Send message Joined: 30 Oct 13 Posts: 632 Credit: 34,744,744 RAC: 0	Message 1025 - Posted: 3 Apr 2017, 9:27:46 UTC - in response to Message 1024.
	Why am I getting SSE2 work units for my AVX-capable CPUs? Is't that a waste of resources? At the beginning the server will send both apps (sse,avx), gathering statistics. After some time if thereâ€‹ is a clear winner you will just get that, if not you will continue to get both. This means that there is not a big difference running sse or avx in your computer. Yes, that is what I thought. However, the reality is, that my Core i7-4770K is getting exclusively sse2 units. Nothing else. No choice. In the recorded history of 456 units, it was sent an avx unit only once. Well, whatever. I just thought that avx units should be faster on this CPU. This may be something hidden inside the boinc scheduler' decisions. I also have one I7-4770K running windows (http://gene.disi.unitn.it/test/results.php?hostid=3241). It got some sse2 and avx work at the beginning, right now it gets only fma work (having opted to accept beta work in my profile). If I remember correctly boinc will repeat the 'performance test' after some time.
	ID: 1025 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 1031 - Posted: 9 Apr 2017, 5:59:33 UTC
	New app version is ready! It is available at the same place as usual: https://bitbucket.org/sirzooro/pc-boinc/downloads/. In order to install it, do following steps: - finish or abort all existing tasks (they will be aborted after install automatically); - stop BOINC; - unpack selected version to project's directory (path like C:\Users\All Users\BOINC\projects\gene.disi.unitn.it_test\ on Windows, and /var/lib/boinc-client/projects/gene.disi.unitn.it_test on Linux); - start BOINC again After doing this, app name should change to "Gene Network Application (Opti v1.2)". You should also see message "Found app_info.xml; using anonymous platform" in event log for TN-Grid project. This time I used Gray code (not Grey!) to optimize app. This code is a number sequence with special property: every two consecutive numbers differs by one bit only. This concept can be generalized in various ways. One of them are Gray code combinations, where every two consecutive subsets differs by one element only. Here is example of 3-combinations of 5 element set, generated in Gray code order: 1 2 3 1 2 4 1 3 4 2 3 4 2 3 5 1 3 5 1 2 5 1 4 5 2 4 5 3 4 5 TN-Grid Gene app uses combinations generator, so I decided to replace it with new Gray code combinations, and exploit its special property to recalculate only values which depends on changed element. By doing so I reduced total calculations time. Savings depends on maximum L value, and increases with it: - some old organism stored as "test" data, max L=8: time reduced from 0.559s to 0.534s (4.4%); - current organism (VV), max L=12: time reduced from 2.092s to 1.815s (13.2%); - other old organism stored as "test2" data (it was probably ECM), max L=18: time reduced from 14.401s to 9.254s (35.7%). If you are interested in algorithm details, you can check "Combinatorial Generation" by Frank Ruskey (page 129, algorithm 5.8), available at http://www.1stworks.com/ref/ruskeycombgen.pdf. New app also checks if CPU supports required instruction set, and will exit with error message like "AVX instructions are not supported by your CPU!" if CPU will not support them. ____________
	ID: 1031 · Reply Quote

Jim1348 Send message Joined: 29 Dec 16 Posts: 87 Credit: 21,013,002 RAC: 0	Message 1033 - Posted: 9 Apr 2017, 17:36:37 UTC - in response to Message 1031.
	New app version is ready! It is available at the same place as usual: https://bitbucket.org/sirzooro/pc-boinc/downloads/. In order to install it, do following steps: - finish or abort all existing tasks (they will be aborted after install automatically); - stop BOINC; - unpack selected version to project's directory (path like C:\Users\All Users\BOINC\projects\gene.disi.unitn.it_test\ on Windows, and /var/lib/boinc-client/projects/gene.disi.unitn.it_test on Linux); - start BOINC again After doing this, app name should change to "Gene Network Application (Opti v1.2)". You should also see message "Found app_info.xml; using anonymous platform" in event log for TN-Grid project. I did all that, using the SSE2 version for Linux on my i7-4770, and got that message on reboot. But I am getting only errors. http://gene.disi.unitn.it/test/results.php?hostid=6148
	ID: 1033 · Reply Quote

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next
Post to thread

Message boards : Number crunching : Optimization