Gene application for GNU/Linux on ARM devices
log in

Advanced search

Message boards : Number crunching : Gene application for GNU/Linux on ARM devices

1 · 2 · Next
Author Message
sorcrosc
Volunteer developer
Send message
Joined: 19 Dec 13
Posts: 26
Credit: 3,866,632
RAC: 0
Italy
Message 544 - Posted: 13 Feb 2016, 15:59:36 UTC

I have compiled the application for ARM devices running GNU/Linux (Raspberry Pi and co.)

Here are three versions and an app_info.xml to run them on boinc as Anonymous platform

vfp
vfpv3
vfpv4

app_info.xml


Keep the app that match fp unit of your device soc (for example vfp for classic armv6 raspberry, vfpv4 for new raspberry/banana/orange pi)
I found really poor (near nothing) performance gain among the three so you can just take the first and it will be good on all devices.

You need also the app_info.xml
The name of the application file have to match the name inside the app_info so rename it as "pc" only (or modify the app_info)

Attach your device to the project, copy both files in the directory (usually /var/lib/boinc-client/projects/gene.disi.unitn.it_test/ ) and make executable the file pc.
Restart boinc and it should start to get work from TN-Grid



If you want to test all the three there is a testing script on github with stuff provided by Francesco. Keep all the project:

git clone https://github.com/sorcrosc/rpi-boinc-ap

Then go in gene_pc directory and run the script (stop boinc computation first):
cd rpi-boinc-ap/gene_pc/ ./test_run.sh

This will give you a timed short run of all the three apps. Let me know if you see a noticeable difference
____________

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 619 - Posted: 4 Dec 2016, 11:56:04 UTC - in response to Message 544.

Only small differences on an Odroid C2 @ 1.75GHz...

root@odroidc2-1:~/rpi-boinc-ap/gene_pc# ./test_run.sh
bin/pc_armv6zk_vfp

real 0m24.204s
user 0m22.080s
sys 0m0.120s

bin/pc_armv7_vfpv3

real 0m22.814s
user 0m20.740s
sys 0m0.090s

bin/pc_armv7_vfpv4

real 0m23.793s
user 0m21.690s
sys 0m0.100s

Is the source available, any chance to compile it eg. for 64bit?

Thanks!

sorcrosc
Volunteer developer
Send message
Joined: 19 Dec 13
Posts: 26
Credit: 3,866,632
RAC: 0
Italy
Message 622 - Posted: 4 Dec 2016, 18:40:42 UTC - in response to Message 619.

Thank you for testing koschi.

The project source code is available as announced here.

I am thinking to take a 64 bit board, maybe an Odroid C2 or an Hikey and I'll try to compile but I really don't know if there will be any gain cause I'm not a programmer.
____________

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 623 - Posted: 4 Dec 2016, 20:53:18 UTC

Yep thanks, shortly after posting, I stumbled over the thread and managed to compile the source. Unfortunately the test run times are in the 8-9min range. They tend to get worse specifying matching -march and -mtune for the C2's A53 cores.

I'm no programmer nor advanced compilation pro, so I leave it like this for now.

Did you change any cflags etc? The C2 has only very limited information in /proc/cpuinfo, maybe thats part of the problem, that it can't determine its capabilities and account for them during build...

sorcrosc
Volunteer developer
Send message
Joined: 19 Dec 13
Posts: 26
Credit: 3,866,632
RAC: 0
Italy
Message 624 - Posted: 4 Dec 2016, 21:30:27 UTC
Last modified: 4 Dec 2016, 21:31:23 UTC

I used some -m flags but they don't make much difference. See the build scripts in my repo
:
See also the Makefile, I added there the -fsigned-char flag. This is very important, without that the code don't works. I suspect this is your problem :)
____________

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 628 - Posted: 7 Dec 2016, 22:47:01 UTC - in response to Message 623.

Yep thanks, shortly after posting, I stumbled over the thread and managed to compile the source. Unfortunately the test run times are in the 8-9min range. They tend to get worse specifying matching -march and -mtune for the C2's A53 cores.

I'm no programmer nor advanced compilation pro, so I leave it like this for now.

Did you change any cflags etc? The C2 has only very limited information in /proc/cpuinfo, maybe thats part of the problem, that it can't determine its capabilities and account for them during build...

Have you tried to use -march=native -mtune=native ? They tell gcc to check CPU it is running on and enable all supported features. On x86_64 CPUs this sets more flags than simply correct -march and -mtune, so ARM also may benefit from this.
____________

Profile [VENETO] sabayonino
Avatar
Send message
Joined: 21 Jan 14
Posts: 13
Credit: 2,932,089
RAC: 0
Italy
Message 629 - Posted: 8 Dec 2016, 0:26:43 UTC
Last modified: 8 Dec 2016, 0:37:27 UTC

run

gcc -march=native -E -v - </dev/null 2>&1 | grep cc1


to see what flags gcc=native enable or disable

These could be different from cpuinfo flags

don't run --march=native for not ARMv7-?? models

Cortex-M , Cortex-R and so on may have some different flags that could be not recognized from all
____________
Powered by Gentoo Linux
Kernel : 4.4.26-gentoo
KDE 16.04.3

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 630 - Posted: 8 Dec 2016, 20:46:01 UTC

Hi, yes I was definitely using march=native, mtune=native - not sure...

root@odroidc2-1:~# gcc -march=native -E -v - </dev/null 2>&1 | grep cc1
/usr/lib/gcc/aarch64-linux-gnu/4.9/cc1 -E -quiet -v -imultiarch aarch64-linux-gnu - -march=native -mlittle-endian -mabi=lp64 -fstack-protector-strong -Wformat -Wformat-security


I had used the compile script linux64_build.sh included in https://bitbucket.org/francesco-asnicar/pc-boinc/, it completed, but produced slow executables.

Your ./linuxarmv7_build.sh I will have to adjust for the new compiler, remove -mfpu (which the aarch64 GCC doesn't understand), lets see what else, this isn't exactly my field of excellence ;-)

sorcrosc
Volunteer developer
Send message
Joined: 19 Dec 13
Posts: 26
Credit: 3,866,632
RAC: 0
Italy
Message 632 - Posted: 10 Dec 2016, 21:51:39 UTC

I added right now an armv8 application. This should work in 64 bit only os.
I can't edit my first post, download here:

armv8
____________

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 714 - Posted: 3 Jan 2017, 9:56:28 UTC
Last modified: 3 Jan 2017, 10:02:24 UTC

Thanks a ton!

So far I only did the test_run.sh and replaced the binary on one system (no completed WUs yet). The benchmark looks very promising though, wow!


root@odroidc2-1:~/rpi-boinc-ap/pc-boinc# ./test_run.sh
Running test with bin/pcv7:

real 0m23.641s
user 0m21.550s
sys 0m0.100s
Running test with bin/pcv8:

real 0m10.500s
user 0m8.430s
sys 0m0.070s


edit:

the DL link was no longer valid, I used https://raw.githubusercontent.com/sorcrosc/rpi-boinc-ap/master/TN-Grid/bin/pc_armv8-a.tgz to get the v8 binary...

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 715 - Posted: 3 Jan 2017, 11:18:45 UTC - in response to Message 714.

Thanks a ton!

So far I only did the test_run.sh and replaced the binary on one system (no completed WUs yet). The benchmark looks very promising though, wow!


root@odroidc2-1:~/rpi-boinc-ap/pc-boinc# ./test_run.sh
Running test with bin/pcv7:

real 0m23.641s
user 0m21.550s
sys 0m0.100s
Running test with bin/pcv8:

real 0m10.500s
user 0m8.430s
sys 0m0.070s


edit:

the DL link was no longer valid, I used https://raw.githubusercontent.com/sorcrosc/rpi-boinc-ap/master/TN-Grid/bin/pc_armv8-a.tgz to get the v8 binary...

Nice numbers :) BTW, you can get additional speed boost if you use my optimized code. I have created one binary for ARMv7, it is about 30% faster than original code.
____________

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 716 - Posted: 3 Jan 2017, 11:42:42 UTC

I'm already giving that a try on another C2, thanks ;-)

The 64bit WU run times seem not to improve that much unfortunately. While that hosts previous average was 18072 seconds/WU, final run time with the 64bit app should still be well over 14000 seconds.
The "benchmark" suggest a 55% run time reduction though.

Will report back once the units complete and validate.

sorcrosc
Volunteer developer
Send message
Joined: 19 Dec 13
Posts: 26
Credit: 3,866,632
RAC: 0
Italy
Message 717 - Posted: 3 Jan 2017, 11:53:07 UTC

I'm replacing all versions with others based on your code, Daniel. Only the armv6 is missing now.
The armv8 koschi tested is already yours and seems also about 5% faster over armv7 on my device.
I also changed the test data with the Daniel's one (ec experiments) which show more realistic results.
____________

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 718 - Posted: 3 Jan 2017, 13:05:55 UTC
Last modified: 3 Jan 2017, 14:04:08 UTC

Ok good, so I was using the old test data...

edit:
Those WUs returned after 1pm UTC on 3rd of January are done purely with the 64bit app:
http://gene.disi.unitn.it/test/results.php?hostid=3074&offset=0&show_names=0&state=4&appid=

Looks like a reduction from 5 to 4 hours...

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 722 - Posted: 4 Jan 2017, 8:43:11 UTC

With the new ARMv7 app run times on my RPi3 dropped from 27k to 23k seconds...

sorcrosc
Volunteer developer
Send message
Joined: 19 Dec 13
Posts: 26
Credit: 3,866,632
RAC: 0
Italy
Message 725 - Posted: 5 Jan 2017, 16:02:13 UTC
Last modified: 5 Jan 2017, 16:41:45 UTC

Here are all tarballs with new application based on Daniel optimized code and app_info included:

armv6
armv7_vfpv3
armv7_vfpv4
armv8

If you want to test the fastest one then install git and clone the repo:

git clone https://github.com/sorcrosc/rpi-boinc-ap

Then cd to the bin directory and untar one by one the apps you want to test
cd rpi-boinc-ap/TN-Grid/bin tar -xzf pc_armv7_vfpv3.tgz tar -xzf pc_armv7_vfpv3.tgz tar -xzf .....

Go to the upper directory and run the test script (stop boinc computation first). It should take 5-10 minutes every app.
cd .. ./test_run.sh


Further info:
I crosscompiled all the apps with latest gcc 6.2 release from Linaro here. Fresh and ready to use in case project admin want to look in to it ;)
armv8 with aarch64-linux-gnu
armv7 with arm-linux-gnueabihf
armv6 like arm-linux-gnueabihf but I recompiled it through crosstool-ng because released binaries are configured for armv7. Maybe it doesn't worth the pain because I am the only one who still use the first old Raspberry Pi 1 here :)
____________

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 726 - Posted: 5 Jan 2017, 17:56:11 UTC

I have just read that AARCH64 CPUs has new NEON SIMD instructions with double precision support, so it should be possible to get additional speed boost by using them. Probably it is time to get some Odroid C2 and play with it a bit :)
____________

sorcrosc
Volunteer developer
Send message
Joined: 19 Dec 13
Posts: 26
Credit: 3,866,632
RAC: 0
Italy
Message 727 - Posted: 5 Jan 2017, 18:15:41 UTC - in response to Message 726.

I have just read that AARCH64 CPUs has new NEON SIMD instructions with double precision support, so it should be possible to get additional speed boost by using them. Probably it is time to get some Odroid C2 and play with it a bit :)


I like this :)

koschi
Send message
Joined: 22 Oct 16
Posts: 25
Credit: 17,960,768
RAC: 0
Germany
Message 728 - Posted: 6 Jan 2017, 18:55:36 UTC - in response to Message 726.

I have just read that AARCH64 CPUs has new NEON SIMD instructions with double precision support, so it should be possible to get additional speed boost by using them. Probably it is time to get some Odroid C2 and play with it a bit :)


As a C2 fanboy, I approve of this ;-)

If you have troubles obtaining one, I might also be able to grant you access to one of mine...

fractal
Send message
Joined: 10 Dec 16
Posts: 2
Credit: 1,008,686
RAC: 0
Message 729 - Posted: 6 Jan 2017, 22:20:09 UTC - in response to Message 726.

I have just read that AARCH64 CPUs has new NEON SIMD instructions with double precision support, so it should be possible to get additional speed boost by using them. Probably it is time to get some Odroid C2 and play with it a bit :)

The Odroid C2 is a fantastic product. I love mine. Well made, well supported, solid performer.

But there may be better AARCH64 SBC's if you can only have one. My main objection to the C2 is the lack of AES instructions. It only has the following extension: fp asimd crc32

As for TN-Grid, here are tests on a C2.

me@odroid-c2:~/TN-Grid$ ./test_run.sh -> pc_armv6zk_vfp real 5m2.251s user 4m51.740s sys 0m0.080s -> pc_armv7_vfpv3 real 4m35.022s user 4m32.840s sys 0m0.080s -> pc_armv7_vfpv4 real 4m39.926s user 4m37.720s sys 0m0.100s -> pc_armv8-a real 5m14.590s user 5m12.330s sys 0m0.100s


The version compiled for the armv7 vfpv3 architecture is a bit faster than the armv8 version.

A less expensive alternative that does have the aes instructions is the pine64 board. It has the following extension: fp asimd aes pmull sha1 sha2 crc32. My main issue with it is cooling as it can not run flat out without overheating. A simple stick-on heat sink from a RPI helps some but not enough. It is also a bit slower than the C2. Performance on it looks like:

ubuntu@pine64:~/boinc/samples/TN-Grid$ ./test_run.sh -> pc_armv6zk_vfp real 6m31.015s user 6m22.620s sys 0m0.150s -> pc_armv7_vfpv3 real 6m8.191s user 5m59.540s sys 0m0.440s -> pc_armv7_vfpv4 real 6m12.538s user 6m4.400s sys 0m0.100s -> pc_armv8-a real 6m59.002s user 6m50.040s sys 0m0.210s

with vfp3 enjoying a slight advantage as well.

1 · 2 · Next
Post to thread

Message boards : Number crunching : Gene application for GNU/Linux on ARM devices


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN