Posts by [B@P] Daniel

41) Message boards : Number crunching : Unknown error number (0xffffffffc000001d) (Message 862)
Posted 6 Feb 2017 by

That might be. I'll try that out, just to confirm. Some of my machines are used for nothing else than boinc and therefore not updated to prevent problems with the updates (unplanned restarts, machines not coming up again).

This is very bad from security perspective - new security holes are found every month, and computers connected to Internet and constantly scanned for these holes in attempt to turn them into zombies connected to some botnet. There is even malware which scans local network and tries to infect computers there. Without antivirus and firewall which will block all incoming traffic such computer will sooner or later be infected.

Can the standard application get a check against this before it uses AVX? This will make it more stable.

Yes, it is possible. App could check this and print some user-friendly error. However it still will exit with fail status, unsupported instruction set usually is not something what user may be able to fix without upgrading CPU. I will include such check when I will be releasing new app.

42) Message boards : Number crunching : Unknown error number (0xffffffffc000001d) (Message 855)
Posted 6 Feb 2017 by

[B@P] Daniel

Strange. I suspect that my app uses some rarely used AVX instruction, which is not recognized by your CPU

Looking at that host's specs,he's on Win7 but it doesn't declare SP1,which is required for AVX support.

You are right, I missed this detail. SP1 is required for AVX.

43) Message boards : Number crunching : Unknown error number (0xffffffffc000001d) (Message 852)
Posted 5 Feb 2017 by

[B@P] Daniel

Sorry to tell you... but it crashes exactly like the previous AVX version.

Strange. I suspect that my app uses some rarely used AVX instruction, which is not recognized by your CPU because of some bug in its microcode so it reports error "illegal instruction". Other projects apparently does not use it, so they work fine. Please try updating microcode in your CPU. This update should work for you: https://support.microsoft.com/pl-pl/help/3064209/june-2015-intel-cpu-microcode-update-for-windows. It can be also done from Linux: https://askubuntu.com/questions/545925/how-to-update-intel-microcode-properly/546056. You can also try to update BIOS, CPU microcode updates may be distributed this way too.

44) Message boards : Number crunching : Unknown error number (0xffffffffc000001d) (Message 850)
Posted 4 Feb 2017 by

[B@P] Daniel

Well, I am puzzled. It should work for you, bot for some reason it crashes. I checked compilation options and they should be fine, according to various pages enabled instruction sets should be supported by your CPU.

I have created app specifically tailored for IvyBridge CPUs (compiled with -march=ivybridge -mtune=ivybridge). Please try it and let me know if it works or still crashes.
https://bitbucket.org/sirzooro/pc-boinc/downloads/TN-Grid.windows-x86-64-ivybridge-v1.1.zip

45) Message boards : Number crunching : Unknown error number (0xffffffffc000001d) (Message 847)
Posted 4 Feb 2017 by

[B@P] Daniel

http://gene.disi.unitn.it/test/result.php?resultid=6502952
http://gene.disi.unitn.it/test/result.php?resultid=6502909

There you have two of them. The standard app switched randomly between SSE2 and AVX and all the AVX tasks failed, blocking the work unit slot due to a windows error report.

I tried to use the new AVX optimized app on this machine too, which failed the same way. The manually installed new SSE2 optimized app runs without problems.

Thanks. Your CPU is an Intel Ivy Bridge, so it should have working AVX. I checked these WUs. They worked for some time before crashing, so looks that they were able to execute AVX for some time. Apps on CPUs without AVX usually crashes within few seconds.
I have noticed one thing: second WU worked for over 11 hours before it finally crashed, what is strange. Do you have similar problems with apps from other projects? I suspect that your CPU may be overheating or you have some other hardware issue, e.g. with memory. Please try to stress-test your PC, here is list of some software do do this: https://www.raymond.cc/blog/test-system-stability-by-putting-heavy-load-on-system-resources/. And here are memory testers: http://www.howtogeek.com/260813/how-to-test-your-computers-ram-for-problems/

46) Message boards : Number crunching : Unknown error number (0xffffffffc000001d) (Message 844)
Posted 4 Feb 2017 by

[B@P] Daniel

Hi!

I got these crashes with the standard app too.

It seems that there are different "versions" of AVX (not AVX2) or some AVX-capable processors don't support the full used instruction set. This problem affects the new optimized AVX (not AVX2) app too, both crashed here on and i5 3570, which should be able to run AVX. The SSE2 app runs great on this machine.

Unfortunately your computers are hidden, so I cannot check details. Please send me a link to some AVX WU which crashed for you.

47) Message boards : Number crunching : Optimization (Message 843)
Posted 4 Feb 2017 by

[B@P] Daniel

Which version did you use for your 3770k?

Currently running SSE2v1.1 on my i7 3770k@4.3Ghz

Time Remaining 5 hours... I dont think that is correct..

59 minutes it ended up being.

The other variations crashed. FMA/AVX2

Your CPU supports instructions up to AVX: http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-3770K.html. It does not have FMA or AVX2, these apps will crash there. You can try AVX version, it should work for you. You can also use CPU-Z to check this.

48) Message boards : Number crunching : Unknown error number (0xffffffffc000001d) (Message 837)
Posted 30 Jan 2017 by

[B@P] Daniel

Recently I investigated similar case here: http://gene.disi.unitn.it/test/forum_thread.php?id=135&postid=817#817. Someone tried to run AVX app on non-AVX CPU. When I googled for this error code (truncated to 32-bit, 0xc000001d) I found pages where people also had this problem when they tried to run some SSE apps on non-SSE CPU.

Edit: I found that it is possible to disable AVX support in Windows by executing command "bcdedit /set xsavedisable 1". Maybe first person did this for some reason (overheating?).
2nd CPU (Xeon X5680) supports up to SSE 4.2, so maybe he/she tried to run AVX or FMA app on it.

49) Message boards : Number crunching : Optimization (Message 828)
Posted 26 Jan 2017 by

[B@P] Daniel

I found an host that is not able to run the linux x64 version because of missing shared libraries (http://gene.disi.unitn.it/test/show_host_detail.php?hostid=2990), too old kernel? (3.2.0-4-amd64). The error is version `GLIBC_2.15' not found, version `GLIBC_2.16' not found.

The Makefile doesn't link with 'g++ -static ...', which is the way I know for making a static exe (checked this with ldd and it works). I don't know if this is a good solution, or the only way to solve this is to put a minimum kernel version inside the plan class of the application.

BTW I made a static Linux x64 sse2 version of the application using the latest source code, if someone would like to play with it: http://gene.disi.unitn.it/test/files/tngrid_expansion_v11_linux64-static__sse2.tar.gz

Hints are welcome.

If ldd no longer shows these libs, it should be OK. Although I am a bit reluctant about doing this - this particular kernel version was used by Debian Wheezy, which is now past its End of Life. This means that there are no new updates for this system version, especially no security updates for new security holes. By not providing app which will work there user may get convinced to upgrade system to some new version which will have support for few next years.

I played with new app a bit trying to optimize it more. It turned out that using AVX for calculating square roots only was slower than using SSE only. I also tried to use values from one half of matrix only, but this slowed down app too. So it does not make sense to apply any of these changes.

I also tried measure run time of app with SSE vectors on Haswell CPU, compiled with different instruction sets:

SSE2 20,766 AVX 19,933 FMA 20,163 AVX2 20,355

It turned out that AVX version is faster than SSE2, probably thanks to some SSE3+ instructions or AVX used in code automatically vectorized by gcc. So this app version should be provided by project. FMA app is to my surprise slower than AVX and I do not have a good explanation for this now. AVX2 version also is slower. It would be good if someone with some new CPU like Skylake could perform some tests and post results here, maybe it will work better on such new CPUs. If not, existing versions (SSE2, AVX, FMA) would be sufficient.

I have uploaded new versions of AVX and AVX2 apps for Linux and Windows, feel free to download and run them.

50) Message boards : Number crunching : Optimization (Message 821)
Posted 25 Jan 2017 by

[B@P] Daniel

Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny.

Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50)

I was talking about test data file, it contains one block only.

51) Message boards : Number crunching : Optimization (Message 817)
Posted 25 Jan 2017 by

[B@P] Daniel

Hmm,

I tried the updated fma v1.1 version on my quad core Q9550-based (Windows 7 build) machine and it threw up immediate computation errors against all the WUs it tried to process. Needless to say, I terminated the BONIC session as soon as I could.

The Stderr output file for all seven of the errored WUs shows the same message:
<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073741795 (0xc000001d)
</message>
]]>
which, I'm afraid, means nothing to me (ah, Vienna), but may help someone to identify the problem.

For now, I'm sticking with the sse2 version on both of my machines which is halving WU processing times on both machines ... and I can't complain too much about that.

Dave

This CPU supports up to SSE4.1, so AVX and FMA apps will not work there. You can check features supported by CPU using CPU-Z.

52) Message boards : Number crunching : Optimization (Message 812)
Posted 25 Jan 2017 by

[B@P] Daniel

The new FMA-App is great.
Times dropt from 55 Minutes to 46 Minutes!

Great news :)

Going from the 0.10 AVX version on my dual E5-2620 machine it used to average around 8-10K CPU seconds,with the newer app that is cut to around 5-6K secs.

Interestingly if I lower the number of tasks running at once to something like 4-6 at a time instead of using all 24 cores at a time it drops down to nearly 3-4K seconds.

I wonder if this is related to boost clocks or memory related. I've run into some weirdness with things that are memory bound when running on dual socket motherboards because it may get memory allocated in the wrong space,and while the links between CPUs are quick nowadays,it's still a performance penalty.

Anyways,simply amazing work!

It's memory bandwidth. App uses about 1.3MB of global data, and accesses them in a way which looks random for memory controller, so it cannot prefetch data efficiently. When less apps works in parallel, every one of them has more CPU Cache available so there is bigger change that required data will be there.

New app uses almost twice as much memory as the old one to store some precalculated values. Half of this data is in fact transposed copy of other half of matrix, so it should be possible to optimize this. I will add this to my TODO list :)

53) Message boards : Number crunching : Optimization (Message 808)
Posted 24 Jan 2017 by

[B@P] Daniel

I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs:

SSE2 20.600 SSE2+FMA 20.465

Impressive work. Thanks! It looks like the sse2+fma is only 0.659% faster. Is that even worth having another version?

I am going to modify code a bit to use AVX for div/sqrt calculations and SSE for the rest. This should improve performance a bit, so finally it should be a bit faster than this SSE+FMA version. Will see when I will have it ready how much it is faster. Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny.

54) Message boards : Number crunching : Optimization (Message 805)
Posted 24 Jan 2017 by

[B@P] Daniel

I have uploaded new version of FMA app for Windows and Linux 64bit. It uses shorter (SSE) vectors, so it should be faster than SSE version now.

55) Message boards : Number crunching : Optimization (Message 804)
Posted 24 Jan 2017 by

[B@P] Daniel

I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs:

SSE2 20.600 AVX 21.441 AVX+FMA 21.361 AVX2+FMA 21.618

As you can see, SSE2 version is the fastest one. I found reason for this: it is Instruction-level parallelism. It means that CPU is able to execute multiple instructions at the same time. Previous app version mixed fast (add/sub/mul) and slow (div/sqrt) operations, so new CPUs were able to get some advantage from this, plus they were able to calculate two div/sqrt ops faster using 1 AVX instruction than using 2 SSE2 instructions. In result they were able to run app faster.

New app version performs slow calculations (div/sqrt) first, then use calculated values multiple times in fast ops (sub/mul). It looks that CPUs are able to execute two SSE2 instructions in the same time as 1 AVX instruction. Application processes diagonal half of matrix, so every row has different length. App always loads 2 (SSE2 app) or 4 (AVX app) values and processes them. This saves some time, at cost of calculating values which will will be discarded later. Because of this SSE2 version is better, less results are finally discarded so app performs better.

I tried to prepare app version which used AVX for vectors of length 3 or 4, and SSE2 if there was 1 or 2 elements left. However this app version was even slower:

AVX-new 23.059 AFX+FMA-new 22.923 AVX2+FMA-new 22.512

I also tested one more idea: SSE2+FMA. It turned out that it is faster than plain SSE2:

SSE2+FMA 20.465

So looks that it make to prepare app which will process data using SSE vectors to get most of Instruction-level parallelism, and use FMA instructions which operates on such vectors to save some time on add/sub operations. There is also one more thing worth exploring: use AVX for div/sqrt calculations only, they are faster than SSE so app should be a bit faster.

And one more thing - it looks that results for AVX2 app are mixed, for original app they are worse, but for modified one they are better. More testing is needed here to determine if it makes sense to have such app version too.

56) Message boards : Number crunching : Optimization (Message 796)
Posted 23 Jan 2017 by

[B@P] Daniel

Please try running test_run2.sh script which works on some real data. Script test_run.sh does not show performance improvement provided by NEON SIMD instructions. Actually NEON app is even a bit slower than non-NEON 64-bit version when running this script. I posted results from running test_run2.sh on my C2 earlier, take a look on them :)

57) Message boards : Number crunching : Optimization (Message 794)
Posted 23 Jan 2017 by

[B@P] Daniel

I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test.

So far the new fma version is running but seems to be slower on this CPU than the sse2 app.
On the previous optimized app the fma was a little faster than the sse2 on this CPU. Strange.

Edit: The new sse2 app is faster than the fma app on this machine, a reversal from the earlier optimized app.

Newest optimized sse2: 51:08 to 53:32
Newest optimized fma: 56:10 to 58:12

The sse2 app is about 9% faster on this CPU, while the older fma app was faster. Again, strange...

This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here.

BTW, I have created apps for Windows 32 bit, SSE and non-SSE versions. Please let me know if they work on Windows XP. Cygwin dropped Windows XP support some time ago, so I am not sure if current 32-bit Cygwin version is able to create binaries for WinXP. Previous Win32 apps were compiled on WinXP with older Cygwin version so they worked fine. If new ones will not work, I will have to download older Cygwin 32bit version, fortunately there is some mirror which still holds it.

58) Message boards : Number crunching : Optimization (Message 790)
Posted 23 Jan 2017 by

[B@P] Daniel

OK, tried SSE2 once again and it is working. :)
FMA seems to be running too.

Thank You Daniel!

Good to hear this :)

I have uploaded ARM 32-bit version. It turned out that my Odroid XU4 it is two times faster than Odroid C2 running non-NEON app, and 1.5 times faster than NEON one :-O

real 1m10.599s user 1m9.830s sys 0m0.345s

59) Message boards : Number crunching : Optimization (Message 786)
Posted 23 Jan 2017 by

[B@P] Daniel

Thanks Daniel!

I tried the newest fma version and it crashed on a machine that worked with the old fma optimized app.
The new sse2 version worked on all my various CPUs and was over twice as fast as the previous optimized app.
To be clear, the new sse2 app is more than 2x faster on every one of my machines, even the ones that ran the old fma version.

Hi, thank You.

But IÂ´m getting compute errors with the new SSE2 and FMA App. Only AVX ist working very well on my FX 8320 (Win 10 64 bit).

Keep on Your great work!

I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message.

I have recompiled and uploaded FMA Windows version, now it does not use AVX2 so it should work fine. I also uploaded separate AVX2 versions for Windows and Linux 64-bit. Could someone with sufficiently new CPU run some benchmarks with test data on AVX2 and FMA versions? I wonder if there is some performance improvement between AVX2 and FMA versions.

60) Message boards : Number crunching : Optimization (Message 780)
Posted 23 Jan 2017 by

[B@P] Daniel

Surprise! I have just released new optimized app version (Opti v1.1), 2 times faster than previous optimized one (now the official one) :). It can be downloaded from the same place as previous ones: https://bitbucket.org/sirzooro/pc-boinc/downloads. At this moment there are only 64-bit versions for Windows and Linux available. I will add 32-bit Windows version later.

app_info.xml file provided together with app does not specify plan class, so make sure you finish or abort your tasks. Otherwise you will loose them when you will install my app! This file also specifies new app version (10), so make sure you have no tasks if you are still running previous app installed manually, you will also lose your tasks if you replace that file.

Here are results for test data from previous and new SSE Linux version:

real 0m54.472s user 0m52.358s sys 0m0.045s real 0m26.208s user 0m24.142s sys 0m0.033s

I also was able to add code which uses NEON instructions on ARM 64bit (AARCH64). Here are results of running non-NEON and NEON apps on test data my Odroid C2:

real 2m18.336s user 2m18.180s sys 0m0.080s real 1m48.669s user 1m48.600s sys 0m0.060s

At this moment I do not have BOINC libraries ready for ARM64, so there is no app for it yet. I am going to add it later too. If you have them you can compile it too, source code is in BitBucket repo on "additional_optimizations" branch.

If you are curious how I managed to make it even faster, here is answer. I did following changes:
- changed way how data was stored, what allowed me to replace unaligned load/store instructions with aligned ones;
- removed unnecessary memory writes;
- changed calculations a bit - replaced square root of product with product of square roots, so I was able to calculate these square roots first and then use result multiple times;
- removed some unnecessary code and provided templated versions of most performance-critical function, so compiler could optimize it further.

Previous 20 · Next 20