Optimization
log in

Advanced search

Message boards : Number crunching : Optimization

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Author Message
Krümel
Send message
Joined: 31 Oct 16
Posts: 15
Credit: 1,268,515
RAC: 590
Germany
Message 800 - Posted: 24 Jan 2017, 8:49:55 UTC
Last modified: 24 Jan 2017, 8:55:04 UTC

Avg. runtime of the new FMA-App on a FX 8320 @ 3,5 GHz (RAM 1.333) = 55 Minutes.
AVX-App = 58 Minutes.

Now testing SSE2...

Krümel
Send message
Joined: 31 Oct 16
Posts: 15
Credit: 1,268,515
RAC: 590
Germany
Message 801 - Posted: 24 Jan 2017, 10:43:30 UTC - in response to Message 800.

SSE2 = 52 Minutes

Crystal Pellet
Send message
Joined: 1 Jan 17
Posts: 2
Credit: 1,147,073
RAC: 21
Netherlands
Message 802 - Posted: 24 Jan 2017, 10:49:48 UTC - in response to Message 794.

I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here.

SandyBridge i2600 with the slow avx running 8 threads (hyperthread on). The SSE2-application (v1.1) is a bit faster.

I tested under the same conditions with 4 tasks simultaniously (on the other 4 threads 1 LHC-VM and 3 WCG-Beta's).
Average of cpu-hours of the avx-application 1.22 hours
Average of cpu-hours of the sse2-application 1.19 hours

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 24
Credit: 5,422,091
RAC: 24,298
United States
Message 803 - Posted: 24 Jan 2017, 17:52:44 UTC - in response to Message 799.
Last modified: 24 Jan 2017, 17:53:12 UTC

This app was limited by memory speed, so SSE version may be faster. Older version was executing more slow calculations (square roots, divisions) plus loops for AVX were executing less times because of longer vectors, so AVX and FMA were faster. Now with reduced number of these slow calculations and with unrolled loops it may be that SSE is faster. I could test it on SandyBridge CPUs only which have slow AVX division and square roots, so AVX app was slower there too. On newer CPUs with faster AVX and memory things may work differently and AVX/FMA versions may be faster. Will see, I hope people will post their results for new apps here.

I am seeing a little less than 34 min/workunit on win 7 64 with the newest AVX version. On an i3-4330 (Haswell) running two instances

Would it be possible to test the newest sse2 app on your machine? I bet it's faster.

Profile Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 80
Credit: 2,202,886
RAC: 0
Poland
Message 804 - Posted: 24 Jan 2017, 18:18:22 UTC
Last modified: 24 Jan 2017, 18:33:15 UTC

I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs:

SSE2 20.600 AVX 21.441 AVX+FMA 21.361 AVX2+FMA 21.618


As you can see, SSE2 version is the fastest one. I found reason for this: it is Instruction-level parallelism. It means that CPU is able to execute multiple instructions at the same time. Previous app version mixed fast (add/sub/mul) and slow (div/sqrt) operations, so new CPUs were able to get some advantage from this, plus they were able to calculate two div/sqrt ops faster using 1 AVX instruction than using 2 SSE2 instructions. In result they were able to run app faster.

New app version performs slow calculations (div/sqrt) first, then use calculated values multiple times in fast ops (sub/mul). It looks that CPUs are able to execute two SSE2 instructions in the same time as 1 AVX instruction. Application processes diagonal half of matrix, so every row has different length. App always loads 2 (SSE2 app) or 4 (AVX app) values and processes them. This saves some time, at cost of calculating values which will will be discarded later. Because of this SSE2 version is better, less results are finally discarded so app performs better.

I tried to prepare app version which used AVX for vectors of length 3 or 4, and SSE2 if there was 1 or 2 elements left. However this app version was even slower:

AVX-new 23.059 AFX+FMA-new 22.923 AVX2+FMA-new 22.512


I also tested one more idea: SSE2+FMA. It turned out that it is faster than plain SSE2:

SSE2+FMA 20.465


So looks that it make to prepare app which will process data using SSE vectors to get most of Instruction-level parallelism, and use FMA instructions which operates on such vectors to save some time on add/sub operations. There is also one more thing worth exploring: use AVX for div/sqrt calculations only, they are faster than SSE so app should be a bit faster.

And one more thing - it looks that results for AVX2 app are mixed, for original app they are worse, but for modified one they are better. More testing is needed here to determine if it makes sense to have such app version too.
____________

Profile Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 80
Credit: 2,202,886
RAC: 0
Poland
Message 805 - Posted: 24 Jan 2017, 19:47:43 UTC
Last modified: 24 Jan 2017, 19:49:09 UTC

I have uploaded new version of FMA app for Windows and Linux 64bit. It uses shorter (SSE) vectors, so it should be faster than SSE version now.
____________

Profile Beyond
Avatar
Send message
Joined: 2 Nov 16
Posts: 24
Credit: 5,422,091
RAC: 24,298
United States
Message 806 - Posted: 24 Jan 2017, 20:22:42 UTC - in response to Message 804.

I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs:

SSE2 20.600 SSE2+FMA 20.465

Impressive work. Thanks! It looks like the sse2+fma is only 0.659% faster. Is that even worth having another version?

Col323
Send message
Joined: 23 Nov 16
Posts: 6
Credit: 1,094,778
RAC: 690
Angola
Message 807 - Posted: 24 Jan 2017, 20:23:45 UTC

I just wanted to say a big Thank You! I have run a few using your FMA optimizations released on the 23rd, and they did cut my runtimes in half! Now I'm emptying my queue to see what this latest tweak can shave off of that.

I really appreciate you digging into this. I love seeing us go faster with the same amount of effort.

Profile Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 80
Credit: 2,202,886
RAC: 0
Poland
Message 808 - Posted: 24 Jan 2017, 20:33:27 UTC - in response to Message 806.
Last modified: 24 Jan 2017, 20:35:02 UTC

I found Haswell Xeon machine where I could perform some tests. Here are results of running app on test data, averaged over 10 runs:

SSE2 20.600 SSE2+FMA 20.465

Impressive work. Thanks! It looks like the sse2+fma is only 0.659% faster. Is that even worth having another version?

I am going to modify code a bit to use AVX for div/sqrt calculations and SSE for the rest. This should improve performance a bit, so finally it should be a bit faster than this SSE+FMA version. Will see when I will have it ready how much it is faster. Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny.
____________

Krümel
Send message
Joined: 31 Oct 16
Posts: 15
Credit: 1,268,515
RAC: 590
Germany
Message 809 - Posted: 24 Jan 2017, 21:32:35 UTC - in response to Message 805.

I have uploaded new version of FMA app for Windows and Linux 64bit. It uses shorter (SSE) vectors, so it should be faster than SSE version now.


I´ll give it a try tomorrow. :)

Woof
Send message
Joined: 16 Jan 17
Posts: 3
Credit: 643,947
RAC: 0
Message 810 - Posted: 25 Jan 2017, 4:19:01 UTC

Going from the 0.10 AVX version on my dual E5-2620 machine it used to average around 8-10K CPU seconds,with the newer app that is cut to around 5-6K secs.

Interestingly if I lower the number of tasks running at once to something like 4-6 at a time instead of using all 24 cores at a time it drops down to nearly 3-4K seconds.

I wonder if this is related to boost clocks or memory related. I've run into some weirdness with things that are memory bound when running on dual socket motherboards because it may get memory allocated in the wrong space,and while the links between CPUs are quick nowadays,it's still a performance penalty.


Anyways,simply amazing work!

Krümel
Send message
Joined: 31 Oct 16
Posts: 15
Credit: 1,268,515
RAC: 590
Germany
Message 811 - Posted: 25 Jan 2017, 7:24:13 UTC

The new FMA-App is great.
Times dropt from 55 Minutes to 46 Minutes!

Profile Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 80
Credit: 2,202,886
RAC: 0
Poland
Message 812 - Posted: 25 Jan 2017, 7:50:26 UTC - in response to Message 811.
Last modified: 25 Jan 2017, 7:51:58 UTC

The new FMA-App is great.
Times dropt from 55 Minutes to 46 Minutes!

Great news :)

Going from the 0.10 AVX version on my dual E5-2620 machine it used to average around 8-10K CPU seconds,with the newer app that is cut to around 5-6K secs.

Interestingly if I lower the number of tasks running at once to something like 4-6 at a time instead of using all 24 cores at a time it drops down to nearly 3-4K seconds.

I wonder if this is related to boost clocks or memory related. I've run into some weirdness with things that are memory bound when running on dual socket motherboards because it may get memory allocated in the wrong space,and while the links between CPUs are quick nowadays,it's still a performance penalty.


Anyways,simply amazing work!

It's memory bandwidth. App uses about 1.3MB of global data, and accesses them in a way which looks random for memory controller, so it cannot prefetch data efficiently. When less apps works in parallel, every one of them has more CPU Cache available so there is bigger change that required data will be there.

New app uses almost twice as much memory as the old one to store some precalculated values. Half of this data is in fact transposed copy of other half of matrix, so it should be possible to optimize this. I will add this to my TODO list :)
____________

koschi
Send message
Joined: 22 Oct 16
Posts: 24
Credit: 3,053,098
RAC: 44
Germany
Message 813 - Posted: 25 Jan 2017, 8:51:51 UTC

Thanks Daniel, the progress and explanations how you reached these are pretty amazing!

Dave Peachey
Send message
Joined: 6 Nov 16
Posts: 7
Credit: 2,364,522
RAC: 21
United Kingdom
Message 814 - Posted: 25 Jan 2017, 10:19:45 UTC
Last modified: 25 Jan 2017, 10:22:09 UTC

Hmm,

I tried the updated fma v1.1 version on my quad core Q9550-based (Windows 7 build) machine and it threw up immediate computation errors against all the WUs it tried to process. Needless to say, I terminated the BONIC session as soon as I could.

The Stderr output file for all seven of the errored WUs shows the same message:
<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073741795 (0xc000001d)
</message>
]]>
which, I'm afraid, means nothing to me (ah, Vienna), but may help someone to identify the problem.

For now, I'm sticking with the sse2 version on both of my machines which is halving WU processing times on both machines ... and I can't complain too much about that.

Dave

Profile Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 80
Credit: 2,202,886
RAC: 0
Poland
Message 817 - Posted: 25 Jan 2017, 11:16:18 UTC - in response to Message 814.

Hmm,

I tried the updated fma v1.1 version on my quad core Q9550-based (Windows 7 build) machine and it threw up immediate computation errors against all the WUs it tried to process. Needless to say, I terminated the BONIC session as soon as I could.

The Stderr output file for all seven of the errored WUs shows the same message:
<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073741795 (0xc000001d)
</message>
]]>
which, I'm afraid, means nothing to me (ah, Vienna), but may help someone to identify the problem.

For now, I'm sticking with the sse2 version on both of my machines which is halving WU processing times on both machines ... and I can't complain too much about that.

Dave

This CPU supports up to SSE4.1, so AVX and FMA apps will not work there. You can check features supported by CPU using CPU-Z.
____________

Dave Peachey
Send message
Joined: 6 Nov 16
Posts: 7
Credit: 2,364,522
RAC: 21
United Kingdom
Message 819 - Posted: 25 Jan 2017, 11:40:45 UTC - in response to Message 817.
Last modified: 25 Jan 2017, 11:41:05 UTC

This CPU supports up to SSE4.1, so AVX and FMA apps will not work there. You can check features supported by CPU using CPU-Z.

Daniel,

Ah, yes, I see that now - it didn't occur to me that there might be a technology gap! It should, however, work on my larger machine with the i7 6900K CPU so I'll run down the WU cache there and see what happens.

Cheers

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 320
Credit: 16,278,261
RAC: 4,455
Italy
Message 820 - Posted: 25 Jan 2017, 15:14:21 UTC - in response to Message 808.

Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny.

Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50)

Profile Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 80
Credit: 2,202,886
RAC: 0
Poland
Message 821 - Posted: 25 Jan 2017, 15:26:15 UTC - in response to Message 820.

Also keep in mind that WUs sent by server now are 100 times longer and we can expect that they will be 200 times longer, so actual time reduction per WU will not be so tiny.

Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50)

I was talking about test data file, it contains one block only.
____________

Dave Peachey
Send message
Joined: 6 Nov 16
Posts: 7
Credit: 2,364,522
RAC: 21
United Kingdom
Message 822 - Posted: 25 Jan 2017, 16:24:16 UTC - in response to Message 820.

Well, I only doubled the size of the workunits (starting at 2016-12-30, 100 'blocks' instead of 50)

And might we be see another doubling of the 'production' WUs to 200 'blocks' (in the not too distant future) given that Daniel has enabled us, once again, to process twice as many WUs than we were doing with the last release of his optimised app?

Perhaps I shouldn't be putting such ideas into your head! ;-)

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Post to thread

Message boards : Number crunching : Optimization


Main page · Your account · Message boards


Copyright © 2017 CNR-TN & UniTN