Optimization
log in

Advanced search

Message boards : Number crunching : Optimization

1 · 2 · 3 · 4 . . . 10 · Next
Author Message
Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 635 - Posted: 12 Dec 2016, 15:00:46 UTC

Sound interesting. PrimeGrid already does something like this - they have two kind of bonuses: one is for apps which takes long time to complete (few days), second is a conjecture bonus. Some tasks have both of these bonuses.

BTW, I am working on optimized app for TN-Grid. After applying various code optimizations and adding AVX support, it works about 30% faster on my Sandy Bridge Xeons, and so far all WUs passed validation. I think I will be able to release its code and compiled binaries to everyone before Christmas, I am still working on it.
____________

Profile [VENETO] boboviz
Send message
Joined: 12 Dec 13
Posts: 183
Credit: 4,641,505
RAC: 0
Italy
Message 636 - Posted: 12 Dec 2016, 15:26:34 UTC - in response to Message 635.

BTW, I am working on optimized app for TN-Grid. After applying various code optimizations and adding AVX support, it works about 30% faster on my Sandy Bridge Xeons, and so far all WUs passed validation. I think I will be able to release its code and compiled binaries to everyone before Christmas, I am still working on it.


VERY interesting.
The problem is: are results of optimized version "good" for admins team?
If Valterc or others will say that it's ok, we will ready to crunch!! :-)

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 637 - Posted: 12 Dec 2016, 15:59:33 UTC - in response to Message 635.
Last modified: 12 Dec 2016, 16:05:42 UTC

Sound interesting. PrimeGrid already does something like this - they have two kind of bonuses: one is for apps which takes long time to complete (few days), second is a conjecture bonus. Some tasks have both of these bonuses.

Yes, I now, the problem is that they don't use CreditNew


BTW, I am working on optimized app for TN-Grid. After applying various code optimizations and adding AVX support, it works about 30% faster on my Sandy Bridge Xeons, and so far all WUs passed validation. I think I will be able to release its code and compiled binaries to everyone before Christmas, I am still working on it.

Are you doing the work on the Windows platform? (We know that the binary we have for Windows, made with Visual C, is way slower than the Linux one). Anyway we were very happy to have a faster version. What I suggest is to keep the new application inside the "anonymous platform" framework for a while. If it validates against the standard version, everything should be fine.

Thanks for your collaboration!

[edit]If working on the Linux version please take care of the kernel requirements, i.e. use a 3.0 kernel to make the build.

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 638 - Posted: 12 Dec 2016, 16:21:21 UTC - in response to Message 637.
Last modified: 12 Dec 2016, 16:39:55 UTC


BTW, I am working on optimized app for TN-Grid. After applying various code optimizations and adding AVX support, it works about 30% faster on my Sandy Bridge Xeons, and so far all WUs passed validation. I think I will be able to release its code and compiled binaries to everyone before Christmas, I am still working on it.

Are you doing the work on the Windows platform? (We know that the binary we have for Windows, made with Visual C, is way slower than the Linux one). Anyway we were very happy to have a faster version. What I suggest is to keep the new application inside the "anonymous platform" framework for a while. If it validates against the standard version, everything should be fine.

Thanks for your collaboration!

I am developing it on Windows under Cygwin, but final version was for Linux. I plan to compile Windows version as a standalone app using MinGW, and run it via BOINC wrapper. Code version without SSE/AVX support should compile under MSVC, but SSE/AVX most probably not - MSVC probably does not support gcc vector extensions which I use for SSE/AVX. I already have some classes which provide similar API with help of Intel Intrinsics, but this part will need more work.

I have eliminated two top bottlenecks in code: testAndRemove is reduced to isnan and simple range check plus existing code for edge removal; BoincFile::getLine works on big blocks of data so file loading time is reduced from 10+ secs per tile to less than 0.2 sec. I wonder if the latter was even slower on Windows. I also read that on Windows atof function is very slow, so you may need to find some faster replacement. I also wonder if MSVC is able to optimize pow(x,2) to x*x like gcc does. Will see how my version will perform when you compile it using MSVC.
____________

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 639 - Posted: 12 Dec 2016, 17:08:41 UTC - in response to Message 638.

Just some naive questions (I'm not really expert in cross-compiling): Would be possible to use MinGW-w64 to compile native Windows binaries on Linux? (avoiding MSVC at all). Why drop the boinc API and use the application with a wrapper?

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 640 - Posted: 12 Dec 2016, 17:19:22 UTC - in response to Message 639.

[edit]If working on the Linux version please take care of the kernel requirements, i.e. use a 3.0 kernel to make the build.

Why this particular version? I use CentOS 7 with kernel 3.10, is it OK too?

Just some naive questions (I'm not really expert in cross-compiling): Would be possible to use MinGW-w64 to compile native Windows binaries on Linux? (avoiding MSVC at all).

Yes, assuming that you have such crosscompiler, and was able to build BOINC libs for MinGW target (build using crosscompiler or natively on Windows).

Why drop the boinc API and use the application with a wrapper?

Compilation of BOINC libraries with MinGW compiler under Cygwin is broken. I tried to fix it but this was taking too much time, so I decided to give up and stub BOINC functions used by TN-Grid app to get standalone app.

Maybe it would be possible to compile BOINC from MinGW shell, I did not try to do it.
____________

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 641 - Posted: 12 Dec 2016, 17:36:42 UTC - in response to Message 640.
Last modified: 12 Dec 2016, 17:39:49 UTC

Why this particular version? I use CentOS 7 with kernel 3.10, is it OK too?

I guess it is. There are only a few users with pre 3.1 kernels (sometimes this may give some problems, because of the old gcc shared libraries). We made the binary this way using shared libraries, making it static will probably solve all the possible problems.

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 643 - Posted: 12 Dec 2016, 22:12:25 UTC - in response to Message 640.

Just some naive questions (I'm not really expert in cross-compiling): Would be possible to use MinGW-w64 to compile native Windows binaries on Linux? (avoiding MSVC at all).

Yes, assuming that you have such crosscompiler, and was able to build BOINC libs for MinGW target (build using crosscompiler or natively on Windows).

I have just spotted that I incorrectly understood and answered your question. Both MinGW and Cygwin runs on Windows, so there is no crosscompilation Linux->Windows. Cygwin is full-blown POSIX environment for Windows. Apps built for it require special library which provides necessary POSIX emulation layer. MinGW is a "Minimalist GNU for Windows", it allows to build native Windows apps in Unix-like development environment.
____________

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 646 - Posted: 13 Dec 2016, 11:06:28 UTC - in response to Message 643.
Last modified: 13 Dec 2016, 11:07:17 UTC

Just some naive questions (I'm not really expert in cross-compiling): Would be possible to use MinGW-w64 to compile native Windows binaries on Linux? (avoiding MSVC at all).

Yes, assuming that you have such crosscompiler, and was able to build BOINC libs for MinGW target (build using crosscompiler or natively on Windows).

I have just spotted that I incorrectly understood and answered your question. Both MinGW and Cygwin runs on Windows, so there is no crosscompilation Linux->Windows. Cygwin is full-blown POSIX environment for Windows. Apps built for it require special library which provides necessary POSIX emulation layer. MinGW is a "Minimalist GNU for Windows", it allows to build native Windows apps in Unix-like development environment.

Actually I was thinking about the opposite direction, i.e. doing all the testing on Linux and then make the Windows exe there (see this http://www.mingw.org/wiki/linuxcrossmingw). I was able to make a 32/64 Win exe (hello world) using it but I didn't try to cross-compile the BOINC api.

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 648 - Posted: 14 Dec 2016, 9:07:58 UTC - in response to Message 646.
Last modified: 14 Dec 2016, 9:43:11 UTC

Just some naive questions (I'm not really expert in cross-compiling): Would be possible to use MinGW-w64 to compile native Windows binaries on Linux? (avoiding MSVC at all).

Yes, assuming that you have such crosscompiler, and was able to build BOINC libs for MinGW target (build using crosscompiler or natively on Windows).

I have just spotted that I incorrectly understood and answered your question. Both MinGW and Cygwin runs on Windows, so there is no crosscompilation Linux->Windows. Cygwin is full-blown POSIX environment for Windows. Apps built for it require special library which provides necessary POSIX emulation layer. MinGW is a "Minimalist GNU for Windows", it allows to build native Windows apps in Unix-like development environment.

Actually I was thinking about the opposite direction, i.e. doing all the testing on Linux and then make the Windows exe there (see this http://www.mingw.org/wiki/linuxcrossmingw). I was able to make a 32/64 Win exe (hello world) using it but I didn't try to cross-compile the BOINC api.

It is worth trying it. Yesterday I found page how to compile BOINC for Windows https://boinc.berkeley.edu/trac/wiki/CompileAppWin and it says that there is special Makefile created for MinGW. So looks that BOINC team decided in the past to do it this way instead of fixing autoconf scripts. No wonder that my attempts to compile it failed, I tried to use the configure script and Makefiles generated by it. I did not try this special Makefile yet, I hope it would work as expected.

I found something what may be a problem. You use undirected graph, so I thought that I could reduce number of iterations of loop at pc.cpp:418 to test (i,j) pairs for j > i only. However after doing this output file size changed from 47.8K to 67.6K. Original code before my changes also generated bigger file after applying this change. I checked code briefly and do not see anything obvious what may cause this. Could you take a look on this?

I think that current app version could be converted to GPU version quite easily. It seems that for given l value calculations for (i,j) pair are independent of each other, only final graph edge removal would need special attention. I will try to create some prototype after I finish this AVX app.

BTW, this topic was created for discussing different thing. Could you move posts related to my app to a new one, or rename this and create new one for your original question?
____________

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 650 - Posted: 14 Dec 2016, 10:47:18 UTC - in response to Message 648.

I found something what may be a problem. You use undirected graph, so I thought that I could reduce number of iterations of loop at pc.cpp:418 to test (i,j) pairs for j > i only. However after doing this output file size changed from 47.8K to 67.6K. Original code before my changes also generated bigger file after applying this change. I checked code briefly and do not see anything obvious what may cause this. Could you take a look on this?

I'll try to look at this, I also will contact the original authors of the code (maybe they have some clues)

I think that current app version could be converted to GPU version quite easily. It seems that for given l value calculations for (i,j) pair are independent of each other, only final graph edge removal would need special attention. I will try to create some prototype after I finish this AVX app.

Yes, the edge removal is the problem here. We also have another slightly different version of the algorithm with the removal done only after each major iteration (I'll talk about this with the authors)

BTW, this topic was created for discussing different thing. Could you move posts related to my app to a new one, or rename this and create new one for your original question?

Done. BTW Thank you again for your efforts.

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 651 - Posted: 14 Dec 2016, 11:08:43 UTC - in response to Message 650.

I think that current app version could be converted to GPU version quite easily. It seems that for given l value calculations for (i,j) pair are independent of each other, only final graph edge removal would need special attention. I will try to create some prototype after I finish this AVX app.

Yes, the edge removal is the problem here. We also have another slightly different version of the algorithm with the removal done only after each major iteration (I'll talk about this with the authors)

This is also one of approaches which I considered. I also thought about introducing extra synchronization for edge removal (probably with double-checked locking pattern), but this may have negative effect on performance.
____________

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 653 - Posted: 19 Dec 2016, 23:56:24 UTC

I have forked your project in BitBucket and uploaded modified files. (link: https://bitbucket.org/sirzooro/pc-boinc). All my changes are so far on branch sse_avx_optimizations, you can take a look on them.

Compiled apps for Linux x86_64 are in Downloads section in menu. There are 3 versions, SSE2, AVX and FMA. Download selected one and unpack it into /var/lib/boinc/projects/gene.disi.unitn.it_test/ or /var/lib/boinc-client/projects/gene.disi.unitn.it_test/ (depending on your Linux distribution), and restart BOINC. If you crunch TN-Grid WUs at this moment, stop BOINC before unpacking these files, and start it later.

It turned out that on my SandyBridge CPUs SSE version was the fastest one (I suspect that unaligned loads kills performance of AVX version). it needs about 1 hour per WUs (original version needed about 2.5 hours).

On never CPUs AVX version may be faster. If you have CPU with FMA instructions (at least Haswell), try also the FMA version. App calculates tons of a-b*c expressions, plus AVX instructions works faster there, so you should see some nice speed boost. Unfortunately I do not have access to such CPU, so I could not perform any tests. FMA instruction were inserted by compiler by optimizing SSE/AVX instruction pairs which do the same calculations, so app should give correct results.

There are no non-SSE version for x86_64, SSE2 support is integral part of 64-bit platform.

I do not have at this moment BOINC libs compiled for Linux 32 bit and for Windows. I will prepare Windows app later. Let me know if you need 32-bit app for Windows or Linux too, I wonder if someone would need it, let me know if you need one.

Beside SSE/AVX/FMA optimizations I changed algorithm itself to remove top performance bottlenecks, so other platforms like ARM will benefit too.
____________

Profile [VENETO] boboviz
Send message
Joined: 12 Dec 13
Posts: 183
Credit: 4,641,505
RAC: 0
Italy
Message 654 - Posted: 20 Dec 2016, 9:16:09 UTC - in response to Message 653.

It turned out that on my SandyBridge CPUs SSE version was the fastest one (I suspect that unaligned loads kills performance of AVX version). it needs about 1 hour per WUs (original version needed about 2.5 hours).


:-O

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 655 - Posted: 20 Dec 2016, 11:19:45 UTC - in response to Message 654.
Last modified: 20 Dec 2016, 22:37:56 UTC

Amazing work! I just tried a simple benchmark (just two tiles, the output results were obviously the same) on a 4770k (Haswell) and the results are impressive:

time bin/pc 5560_Ec_ecm-b0624-crcB_wu-1.input.twotiles 5560_Ec_ecm-b0624-crcB_wu-1.output.twotiles 0.05 1 2470 real 1m59.675s user 1m57.815s sys 0m0.040s time bin/TN-Grid.linux-x86-64-fma 5560_Ec_ecm-b0624-crcB_wu-1.input.twotiles 5560_Ec_ecm-b0624-crcB_wu-1.output.twotiles.fma 0.05 1 2470 real 1m2.131s user 1m0.218s sys 0m0.008s

Ok. Let's go one step further, please try (using the anonymous platform mechanism) the optimized binaries (just Linux x64 for now), so we may see if there is something wrong (like gcc/kernel dependencies), please be aware that:
1-Using AVX (FMA?) extensions will push the cpu to the limits (keep an eye on temperatures)
2-The provided app_info contains an explicit reference to the input data we use with the EC experiment (it won't work if we change organism)

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 656 - Posted: 20 Dec 2016, 12:05:14 UTC
Last modified: 20 Dec 2016, 12:07:26 UTC

Apps are linked statically with glibc and libstdc++, so chance for incompatibilities should be limited.

That explicit reference to input data in app_info is in fact not needed, I will remove it.

Apps are compiled with gcc 4.8.5. After changing it to newer version app you can expect some extra speed gain, especially after enabling optimization for your CPU type. BTW, app code can be optimized further, so it could be even faster. I am going to spend some extra time on it later.

FMA can be considered an extension for AVX, so it probably loads CPU even more.

@valterc, could you test other app versions and post results here? I wonder how they perform on your CPU.
____________

zioriga
Send message
Joined: 18 Dec 13
Posts: 10
Credit: 7,239,142
RAC: 0
Italy
Message 657 - Posted: 20 Dec 2016, 14:12:05 UTC

I downloaded the optimized version
extracted in the gene.... directory both files (app_config.xml and pc)
I issued a read config command
but everything went in error

My rig is an Intel i7 5960 Linux mint 17.3 64bit Boinc 7.2.42

Is something else to do ???

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 658 - Posted: 20 Dec 2016, 14:24:40 UTC - in response to Message 657.

I downloaded the optimized version
extracted in the gene.... directory both files (app_config.xml and pc)
I issued a read config command
but everything went in error

My rig is an Intel i7 5960 Linux mint 17.3 64bit Boinc 7.2.42

Is something else to do ???

Please paste error message here, without it I can only guess what may be wrong.
____________

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 623
Credit: 34,677,535
RAC: 1
Italy
Message 659 - Posted: 20 Dec 2016, 15:20:09 UTC - in response to Message 657.

I downloaded the optimized version
extracted in the gene.... directory both files (app_config.xml and pc)
I issued a read config command
but everything went in error

My rig is an Intel i7 5960 Linux mint 17.3 64bit Boinc 7.2.42

Is something else to do ???

Looking at your host I didn't find workunits marked as anonymous platform.
Did you copy all the files in the right place? Check app_info (not app_config)

zioriga
Send message
Joined: 18 Dec 13
Posts: 10
Credit: 7,239,142
RAC: 0
Italy
Message 660 - Posted: 20 Dec 2016, 15:25:22 UTC - in response to Message 658.

sorry, it's my bad
I misinterpreted the informations on the countdown, it was as running with the stock application, and I tryed some bad actions (delete some files and so on)

Now I restarted with all the files and is running faster.
I'll do a test with the 3 versions to find the best solution for may rig.

BTW. the flops parameter will be changed automatically ??

1 · 2 · 3 · 4 . . . 10 · Next
Post to thread

Message boards : Number crunching : Optimization


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN