log in |
41)
Message boards :
Number crunching :
Unknown error number (0xffffffffc000001d)
(Message 862)
Posted 6 Feb 2017 by [B@P] Daniel That might be. I'll try that out, just to confirm. Some of my machines are used for nothing else than boinc and therefore not updated to prevent problems with the updates (unplanned restarts, machines not coming up again). This is very bad from security perspective - new security holes are found every month, and computers connected to Internet and constantly scanned for these holes in attempt to turn them into zombies connected to some botnet. There is even malware which scans local network and tries to infect computers there. Without antivirus and firewall which will block all incoming traffic such computer will sooner or later be infected. Can the standard application get a check against this before it uses AVX? This will make it more stable. Yes, it is possible. App could check this and print some user-friendly error. However it still will exit with fail status, unsupported instruction set usually is not something what user may be able to fix without upgrading CPU. I will include such check when I will be releasing new app. |
42)
Message boards :
Number crunching :
Unknown error number (0xffffffffc000001d)
(Message 855)
Posted 6 Feb 2017 by [B@P] Daniel
You are right, I missed this detail. SP1 is required for AVX. |
43)
Message boards :
Number crunching :
Unknown error number (0xffffffffc000001d)
(Message 852)
Posted 5 Feb 2017 by [B@P] Daniel Sorry to tell you... but it crashes exactly like the previous AVX version. Strange. I suspect that my app uses some rarely used AVX instruction, which is not recognized by your CPU because of some bug in its microcode so it reports error "illegal instruction". Other projects apparently does not use it, so they work fine. Please try updating microcode in your CPU. This update should work for you: https://support.microsoft.com/pl-pl/help/3064209/june-2015-intel-cpu-microcode-update-for-windows. It can be also done from Linux: https://askubuntu.com/questions/545925/how-to-update-intel-microcode-properly/546056. You can also try to update BIOS, CPU microcode updates may be distributed this way too. |
44)
Message boards :
Number crunching :
Unknown error number (0xffffffffc000001d)
(Message 850)
Posted 4 Feb 2017 by [B@P] Daniel Well, I am puzzled. It should work for you, bot for some reason it crashes. I checked compilation options and they should be fine, according to various pages enabled instruction sets should be supported by your CPU. I have created app specifically tailored for IvyBridge CPUs (compiled with -march=ivybridge -mtune=ivybridge). Please try it and let me know if it works or still crashes. https://bitbucket.org/sirzooro/pc-boinc/downloads/TN-Grid.windows-x86-64-ivybridge-v1.1.zip |
45)
Message boards :
Number crunching :
Unknown error number (0xffffffffc000001d)
(Message 847)
Posted 4 Feb 2017 by [B@P] Daniel http://gene.disi.unitn.it/test/result.php?resultid=6502952 Thanks. Your CPU is an Intel Ivy Bridge, so it should have working AVX. I checked these WUs. They worked for some time before crashing, so looks that they were able to execute AVX for some time. Apps on CPUs without AVX usually crashes within few seconds. I have noticed one thing: second WU worked for over 11 hours before it finally crashed, what is strange. Do you have similar problems with apps from other projects? I suspect that your CPU may be overheating or you have some other hardware issue, e.g. with memory. Please try to stress-test your PC, here is list of some software do do this: https://www.raymond.cc/blog/test-system-stability-by-putting-heavy-load-on-system-resources/. And here are memory testers: http://www.howtogeek.com/260813/how-to-test-your-computers-ram-for-problems/ |
46)
Message boards :
Number crunching :
Unknown error number (0xffffffffc000001d)
(Message 844)
Posted 4 Feb 2017 by [B@P] Daniel Hi! Unfortunately your computers are hidden, so I cannot check details. Please send me a link to some AVX WU which crashed for you. |
47)
Message boards :
Number crunching :
Optimization
(Message 843)
Posted 4 Feb 2017 by [B@P] Daniel Which version did you use for your 3770k? Your CPU supports instructions up to AVX: http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-3770K.html. It does not have FMA or AVX2, these apps will crash there. You can try AVX version, it should work for you. You can also use CPU-Z to check this. |
48)
Message boards :
Number crunching :
Unknown error number (0xffffffffc000001d)
(Message 837)
Posted 30 Jan 2017 by [B@P] Daniel Recently I investigated similar case here: http://gene.disi.unitn.it/test/forum_thread.php?id=135&postid=817#817. Someone tried to run AVX app on non-AVX CPU. When I googled for this error code (truncated to 32-bit, 0xc000001d) I found pages where people also had this problem when they tried to run some SSE apps on non-SSE CPU. Edit: I found that it is possible to disable AVX support in Windows by executing command "bcdedit /set xsavedisable 1". Maybe first person did this for some reason (overheating?). 2nd CPU (Xeon X5680) supports up to SSE 4.2, so maybe he/she tried to run AVX or FMA app on it. |
49)
Message boards :
Number crunching :
Optimization
(Message 828)
Posted 26 Jan 2017 by [B@P] Daniel I found an host that is not able to run the linux x64 version because of missing shared libraries (http://gene.disi.unitn.it/test/show_host_detail.php?hostid=2990), too old kernel? (3.2.0-4-amd64). The error is version `GLIBC_2.15' not found, version `GLIBC_2.16' not found. If ldd no longer shows these libs, it should be OK. Although I am a bit reluctant about doing this - this particular kernel version was used by Debian Wheezy, which is now past its End of Life. This means that there are no new updates for this system version, especially no security updates for new security holes. By not providing app which will work there user may get convinced to upgrade system to some new version which will have support for few next years. I played with new app a bit trying to optimize it more. It turned out that using AVX for calculating square roots only was slower than using SSE only. I also tried to use values from one half of matrix only, but this slowed down app too. So it does not make sense to apply any of these changes. I also tried measure run time of app with SSE vectors on Haswell CPU, compiled with different instruction sets: SSE2 20,766
AVX 19,933
FMA 20,163
AVX2 20,355 It turned out that AVX version is faster than SSE2, probably thanks to some SSE3+ instructions or AVX used in code automatically vectorized by gcc. So this app version should be provided by project. FMA app is to my surprise slower than AVX and I do not have a good explanation for this now. AVX2 version also is slower. It would be good if someone with some new CPU like Skylake could perform some tests and post results here, maybe it will work better on such new CPUs. If not, existing versions (SSE2, AVX, FMA) would be sufficient. I have uploaded new versions of AVX and AVX2 apps for Linux and Windows, feel free to download and run them. |
50)
Message boards :
Number crunching :
Optimization
(Message 821)
Posted 25 Jan 2017 by [B@P] Daniel Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny. I was talking about test data file, it contains one block only. |
51)
Message boards :
Number crunching :
Optimization
(Message 817)
Posted 25 Jan 2017 by [B@P] Daniel Hmm, This CPU supports up to SSE4.1, so AVX and FMA apps will not work there. You can check features supported by CPU using CPU-Z. |
52)
Message boards :
Number crunching :
Optimization
(Message 812)
Posted 25 Jan 2017 by [B@P] Daniel The new FMA-App is great. Great news :) Going from the 0.10 AVX version on my dual E5-2620 machine it used to average around 8-10K CPU seconds,with the newer app that is cut to around 5-6K secs. It's memory bandwidth. App uses about 1.3MB of global data, and accesses them in a way which looks random for memory controller, so it cannot prefetch data efficiently. When less apps works in parallel, every one of them has more CPU Cache available so there is bigger change that required data will be there. New app uses almost twice as much memory as the old one to store some precalculated values. Half of this data is in fact transposed copy of other half of matrix, so it should be possible to optimize this. I will add this to my TODO list :) |
53)
Message boards :
Number crunching :
Optimization
(Message 808)
Posted 24 Jan 2017 by [B@P] Daniel I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs: I am going to modify code a bit to use AVX for div/sqrt calculations and SSE for the rest. This should improve performance a bit, so finally it should be a bit faster than this SSE+FMA version. Will see when I will have it ready how much it is faster. Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny. |
54)
Message boards :
Number crunching :
Optimization
(Message 805)
Posted 24 Jan 2017 by [B@P] Daniel I have uploaded new version of FMA app for Windows and Linux 64bit. It uses shorter (SSE) vectors, so it should be faster than SSE version now. |
55)
Message boards :
Number crunching :
Optimization
(Message 804)
Posted 24 Jan 2017 by [B@P] Daniel I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs: SSE2 20.600
AVX 21.441
AVX+FMA 21.361
AVX2+FMA 21.618 As you can see, SSE2 version is the fastest one. I found reason for this: it is Instruction-level parallelism. It means that CPU is able to execute multiple instructions at the same time. Previous app version mixed fast (add/sub/mul) and slow (div/sqrt) operations, so new CPUs were able to get some advantage from this, plus they were able to calculate two div/sqrt ops faster using 1 AVX instruction than using 2 SSE2 instructions. In result they were able to run app faster. New app version performs slow calculations (div/sqrt) first, then use calculated values multiple times in fast ops (sub/mul). It looks that CPUs are able to execute two SSE2 instructions in the same time as 1 AVX instruction. Application processes diagonal half of matrix, so every row has different length. App always loads 2 (SSE2 app) or 4 (AVX app) values and processes them. This saves some time, at cost of calculating values which will will be discarded later. Because of this SSE2 version is better, less results are finally discarded so app performs better. I tried to prepare app version which used AVX for vectors of length 3 or 4, and SSE2 if there was 1 or 2 elements left. However this app version was even slower: AVX-new 23.059
AFX+FMA-new 22.923
AVX2+FMA-new 22.512 I also tested one more idea: SSE2+FMA. It turned out that it is faster than plain SSE2: SSE2+FMA 20.465 So looks that it make to prepare app which will process data using SSE vectors to get most of Instruction-level parallelism, and use FMA instructions which operates on such vectors to save some time on add/sub operations. There is also one more thing worth exploring: use AVX for div/sqrt calculations only, they are faster than SSE so app should be a bit faster. And one more thing - it looks that results for AVX2 app are mixed, for original app they are worse, but for modified one they are better. More testing is needed here to determine if it makes sense to have such app version too. |
56)
Message boards :
Number crunching :
Optimization
(Message 796)
Posted 23 Jan 2017 by [B@P] Daniel Please try running test_run2.sh script which works on some real data. Script test_run.sh does not show performance improvement provided by NEON SIMD instructions. Actually NEON app is even a bit slower than non-NEON 64-bit version when running this script. I posted results from running test_run2.sh on my C2 earlier, take a look on them :) |
57)
Message boards :
Number crunching :
Optimization
(Message 794)
Posted 23 Jan 2017 by [B@P] Daniel I'm draining the queue on one of my AMD FX-8320E machines now. Will install the fixed fma version and test. This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here. BTW, I have created apps for Windows 32 bit, SSE and non-SSE versions. Please let me know if they work on Windows XP. Cygwin dropped Windows XP support some time ago, so I am not sure if current 32-bit Cygwin version is able to create binaries for WinXP. Previous Win32 apps were compiled on WinXP with older Cygwin version so they worked fine. If new ones will not work, I will have to download older Cygwin 32bit version, fortunately there is some mirror which still holds it. |
58)
Message boards :
Number crunching :
Optimization
(Message 790)
Posted 23 Jan 2017 by [B@P] Daniel OK, tried SSE2 once again and it is working. :) Good to hear this :) I have uploaded ARM 32-bit version. It turned out that my Odroid XU4 it is two times faster than Odroid C2 running non-NEON app, and 1.5 times faster than NEON one :-O real 1m10.599s
user 1m9.830s
sys 0m0.345s
|
59)
Message boards :
Number crunching :
Optimization
(Message 786)
Posted 23 Jan 2017 by [B@P] Daniel Thanks Daniel! Hi, thank You. I found why FMA version crashed. I compiled windows FMA version with target CPU architecture set to Haswell, and gcc enabled AVX2 which is not supported by AMD Bulldozer CPUs so app crashed with message "Illegal Instruction". But SSE2 version crash is surprising, it is compiled with the same options (target architecture: core2). Could you check again to make sure that it crashes, and provide me link to failed task? I would like to check error message. I have recompiled and uploaded FMA Windows version, now it does not use AVX2 so it should work fine. I also uploaded separate AVX2 versions for Windows and Linux 64-bit. Could someone with sufficiently new CPU run some benchmarks with test data on AVX2 and FMA versions? I wonder if there is some performance improvement between AVX2 and FMA versions. |
60)
Message boards :
Number crunching :
Optimization
(Message 780)
Posted 23 Jan 2017 by [B@P] Daniel Surprise! I have just released new optimized app version (Opti v1.1), 2 times faster than previous optimized one (now the official one) :). It can be downloaded from the same place as previous ones: https://bitbucket.org/sirzooro/pc-boinc/downloads. At this moment there are only 64-bit versions for Windows and Linux available. I will add 32-bit Windows version later. app_info.xml file provided together with app does not specify plan class, so make sure you finish or abort your tasks. Otherwise you will loose them when you will install my app! This file also specifies new app version (10), so make sure you have no tasks if you are still running previous app installed manually, you will also lose your tasks if you replace that file. Here are results for test data from previous and new SSE Linux version: real 0m54.472s
user 0m52.358s
sys 0m0.045s
real 0m26.208s
user 0m24.142s
sys 0m0.033s
I also was able to add code which uses NEON instructions on ARM 64bit (AARCH64). Here are results of running non-NEON and NEON apps on test data my Odroid C2: real 2m18.336s
user 2m18.180s
sys 0m0.080s
real 1m48.669s
user 1m48.600s
sys 0m0.060s
At this moment I do not have BOINC libraries ready for ARM64, so there is no app for it yet. I am going to add it later too. If you have them you can compile it too, source code is in BitBucket repo on "additional_optimizations" branch. If you are curious how I managed to make it even faster, here is answer. I did following changes: - changed way how data was stored, what allowed me to replace unaligned load/store instructions with aligned ones; - removed unnecessary memory writes; - changed calculations a bit - replaced square root of product with product of square roots, so I was able to calculate these square roots first and then use result multiple times; - removed some unnecessary code and provided templated versions of most performance-critical function, so compiler could optimize it further. |