Posts by [B@P] Daniel

81) Message boards : Number crunching : Optimization (Message 653)
Posted 19 Dec 2016 by

I have forked your project in BitBucket and uploaded modified files. (link: https://bitbucket.org/sirzooro/pc-boinc). All my changes are so far on branch sse_avx_optimizations, you can take a look on them.

Compiled apps for Linux x86_64 are in Downloads section in menu. There are 3 versions, SSE2, AVX and FMA. Download selected one and unpack it into /var/lib/boinc/projects/gene.disi.unitn.it_test/ or /var/lib/boinc-client/projects/gene.disi.unitn.it_test/ (depending on your Linux distribution), and restart BOINC. If you crunch TN-Grid WUs at this moment, stop BOINC before unpacking these files, and start it later.

It turned out that on my SandyBridge CPUs SSE version was the fastest one (I suspect that unaligned loads kills performance of AVX version). it needs about 1 hour per WUs (original version needed about 2.5 hours).

On never CPUs AVX version may be faster. If you have CPU with FMA instructions (at least Haswell), try also the FMA version. App calculates tons of a-b*c expressions, plus AVX instructions works faster there, so you should see some nice speed boost. Unfortunately I do not have access to such CPU, so I could not perform any tests. FMA instruction were inserted by compiler by optimizing SSE/AVX instruction pairs which do the same calculations, so app should give correct results.

There are no non-SSE version for x86_64, SSE2 support is integral part of 64-bit platform.

I do not have at this moment BOINC libs compiled for Linux 32 bit and for Windows. I will prepare Windows app later. Let me know if you need 32-bit app for Windows or Linux too, I wonder if someone would need it, let me know if you need one.

Beside SSE/AVX/FMA optimizations I changed algorithm itself to remove top performance bottlenecks, so other platforms like ARM will benefit too.

82) Message boards : Number crunching : Optimization (Message 651)
Posted 14 Dec 2016 by

[B@P] Daniel

I think that current app version could be converted to GPU version quite easily. It seems that for given l value calculations for (i,j) pair are independent of each other, only final graph edge removal would need special attention. I will try to create some prototype after I finish this AVX app.

Yes, the edge removal is the problem here. We also have another slightly different version of the algorithm with the removal done only after each major iteration (I'll talk about this with the authors)

This is also one of approaches which I considered. I also thought about introducing extra synchronization for edge removal (probably with double-checked locking pattern), but this may have negative effect on performance.

83) Message boards : Number crunching : Optimization (Message 648)
Posted 14 Dec 2016 by

[B@P] Daniel

Just some naive questions (I'm not really expert in cross-compiling): Would be possible to use MinGW-w64 to compile native Windows binaries on Linux? (avoiding MSVC at all).

Yes, assuming that you have such crosscompiler, and was able to build BOINC libs for MinGW target (build using crosscompiler or natively on Windows).

I have just spotted that I incorrectly understood and answered your question. Both MinGW and Cygwin runs on Windows, so there is no crosscompilation Linux->Windows. Cygwin is full-blown POSIX environment for Windows. Apps built for it require special library which provides necessary POSIX emulation layer. MinGW is a "Minimalist GNU for Windows", it allows to build native Windows apps in Unix-like development environment.

Actually I was thinking about the opposite direction, i.e. doing all the testing on Linux and then make the Windows exe there (see this http://www.mingw.org/wiki/linuxcrossmingw). I was able to make a 32/64 Win exe (hello world) using it but I didn't try to cross-compile the BOINC api.

It is worth trying it. Yesterday I found page how to compile BOINC for Windows https://boinc.berkeley.edu/trac/wiki/CompileAppWin and it says that there is special Makefile created for MinGW. So looks that BOINC team decided in the past to do it this way instead of fixing autoconf scripts. No wonder that my attempts to compile it failed, I tried to use the configure script and Makefiles generated by it. I did not try this special Makefile yet, I hope it would work as expected.

I found something what may be a problem. You use undirected graph, so I thought that I could reduce number of iterations of loop at pc.cpp:418 to test (i,j) pairs for j > i only. However after doing this output file size changed from 47.8K to 67.6K. Original code before my changes also generated bigger file after applying this change. I checked code briefly and do not see anything obvious what may cause this. Could you take a look on this?

I think that current app version could be converted to GPU version quite easily. It seems that for given l value calculations for (i,j) pair are independent of each other, only final graph edge removal would need special attention. I will try to create some prototype after I finish this AVX app.

BTW, this topic was created for discussing different thing. Could you move posts related to my app to a new one, or rename this and create new one for your original question?

84) Message boards : Number crunching : Optimization (Message 643)
Posted 12 Dec 2016 by

[B@P] Daniel

Just some naive questions (I'm not really expert in cross-compiling): Would be possible to use MinGW-w64 to compile native Windows binaries on Linux? (avoiding MSVC at all).

Yes, assuming that you have such crosscompiler, and was able to build BOINC libs for MinGW target (build using crosscompiler or natively on Windows).

I have just spotted that I incorrectly understood and answered your question. Both MinGW and Cygwin runs on Windows, so there is no crosscompilation Linux->Windows. Cygwin is full-blown POSIX environment for Windows. Apps built for it require special library which provides necessary POSIX emulation layer. MinGW is a "Minimalist GNU for Windows", it allows to build native Windows apps in Unix-like development environment.

85) Message boards : Number crunching : Optimization (Message 640)
Posted 12 Dec 2016 by

[B@P] Daniel

[edit]If working on the Linux version please take care of the kernel requirements, i.e. use a 3.0 kernel to make the build.

Why this particular version? I use CentOS 7 with kernel 3.10, is it OK too?

Just some naive questions (I'm not really expert in cross-compiling): Would be possible to use MinGW-w64 to compile native Windows binaries on Linux? (avoiding MSVC at all).

Yes, assuming that you have such crosscompiler, and was able to build BOINC libs for MinGW target (build using crosscompiler or natively on Windows).

Why drop the boinc API and use the application with a wrapper?

Compilation of BOINC libraries with MinGW compiler under Cygwin is broken. I tried to fix it but this was taking too much time, so I decided to give up and stub BOINC functions used by TN-Grid app to get standalone app.

Maybe it would be possible to compile BOINC from MinGW shell, I did not try to do it.

86) Message boards : Number crunching : Optimization (Message 638)
Posted 12 Dec 2016 by

[B@P] Daniel

BTW, I am working on optimized app for TN-Grid. After applying various code optimizations and adding AVX support, it works about 30% faster on my Sandy Bridge Xeons, and so far all WUs passed validation. I think I will be able to release its code and compiled binaries to everyone before Christmas, I am still working on it.

Are you doing the work on the Windows platform? (We know that the binary we have for Windows, made with Visual C, is way slower than the Linux one). Anyway we were very happy to have a faster version. What I suggest is to keep the new application inside the "anonymous platform" framework for a while. If it validates against the standard version, everything should be fine.

Thanks for your collaboration!

I am developing it on Windows under Cygwin, but final version was for Linux. I plan to compile Windows version as a standalone app using MinGW, and run it via BOINC wrapper. Code version without SSE/AVX support should compile under MSVC, but SSE/AVX most probably not - MSVC probably does not support gcc vector extensions which I use for SSE/AVX. I already have some classes which provide similar API with help of Intel Intrinsics, but this part will need more work.

I have eliminated two top bottlenecks in code: testAndRemove is reduced to isnan and simple range check plus existing code for edge removal; BoincFile::getLine works on big blocks of data so file loading time is reduced from 10+ secs per tile to less than 0.2 sec. I wonder if the latter was even slower on Windows. I also read that on Windows atof function is very slow, so you may need to find some faster replacement. I also wonder if MSVC is able to optimize pow(x,2) to x*x like gcc does. Will see how my version will perform when you compile it using MSVC.

87) Message boards : Number crunching : Optimization (Message 635)
Posted 12 Dec 2016 by

[B@P] Daniel

Sound interesting. PrimeGrid already does something like this - they have two kind of bonuses: one is for apps which takes long time to complete (few days), second is a conjecture bonus. Some tasks have both of these bonuses.

BTW, I am working on optimized app for TN-Grid. After applying various code optimizations and adding AVX support, it works about 30% faster on my Sandy Bridge Xeons, and so far all WUs passed validation. I think I will be able to release its code and compiled binaries to everyone before Christmas, I am still working on it.

88) Message boards : Number crunching : Gene application for GNU/Linux on ARM devices (Message 628)
Posted 7 Dec 2016 by

[B@P] Daniel

Yep thanks, shortly after posting, I stumbled over the thread and managed to compile the source. Unfortunately the test run times are in the 8-9min range. They tend to get worse specifying matching -march and -mtune for the C2's A53 cores.

I'm no programmer nor advanced compilation pro, so I leave it like this for now.

Did you change any cflags etc? The C2 has only very limited information in /proc/cpuinfo, maybe thats part of the problem, that it can't determine its capabilities and account for them during build...

Have you tried to use -march=native -mtune=native ? They tell gcc to check CPU it is running on and enable all supported features. On x86_64 CPUs this sets more flags than simply correct -march and -mtune, so ARM also may benefit from this.

89) Message boards : Wish List : Remove the Invitation Code (Message 607)
Posted 29 Nov 2016 by

[B@P] Daniel

I see that you added badges, thanks! Looks very nice :) Could you add somewhere list of all of them, with required credit levels?

90) Message boards : Wish List : Remove the Invitation Code (Message 587)
Posted 31 Oct 2016 by

[B@P] Daniel

badges

yes, I solemnly promise (;) to do something in September/October...

Any update on this? Can we expect them soon?

Previous 20