Validation problems (roadmap) application+server
log in

Advanced search

Message boards : Development : Validation problems (roadmap) application+server

Author Message
Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 212 - Posted: 31 Dec 2013, 11:44:13 UTC
Last modified: 31 Dec 2013, 14:14:44 UTC

My guess is that workunits will validate correctly if checked, with the bitwise validator, against workunits crunched in the same architectural environment (operating system, cpu, etc.)

I requested people of the server group to do some statistics and see if the above statement is true. If not we have not other solution than to develop a 'slightly different is considered valid' concept and implement it in the new validator. In this case, assuming that this is also acceptable from the scientific point of view (ask Enrico), we have to decide which workunit is the 'correct' one (just pick one randomly, the merge of the two?)

But if the first statement is true we may follow other paths:
1- Implement, boinc server-side, the concept of Homogeneous redundancy, see http://boinc.berkeley.edu/trac/wiki/HomogeneousRedundancy
2- Go deeper, dealing with numerical discrepancies by looking at the application code (functions, snippets) and compiler flags, explicit or defaults (VC vs gcc), by carefully checking the usual "number crunching in the float domain" things (like fp80 rounding). Look at this for a starting point http://stackoverflow.com/questions/1961442/different-math-rounding-behaviour-between-linux-mac-os-x-and-windows. Another interesting article is this one http://www.yosefk.com/blog/consistency-how-to-defeat-the-purpose-of-ieee-floating-point.html. I personally think that this approach, event if it is not so easy, is the more instructive and elegant.
3- Rewrite the code (or part of it) avoiding double/float, use fixed point arithmetic, libraries such Gnu MPFR.

What to do now? Just hold on and stay put until we have answers for the first statement...

Trung
Send message
Joined: 28 Nov 13
Posts: 7
Credit: 15
RAC: 0
Italy
Message 236 - Posted: 3 Jan 2014, 17:34:12 UTC

I did statistic on inconclusive workunits because I think it's also easy for application group to check bug.

We have 30 inconclusive workunits for application version 2.

Most of different results come from a pair of Windows vs Linux (One result runs on Windows while another runs on Linux, both use 64 bit version and Intel processor) - 25 cases. However, these differences are small, a few edges are added to each graph.

There are 4 cases come from a pair of Window/AMD and Linux/Intel. It may caused by the difference of Window and Linux because in Workunit 22345, result from Window/AMD and from Window/Intel are the same.

There are 3 cases which results run on same OS are different. All 3 differences are very large, might caused by application bug: Wu 23255 - Rs 37169, Wu 23117 - Rs 37229, Wu 23056 - Rs 36786.

Although the main source of difference come from pair of Windows and Linux, it not imply that they are always different. Host 21 (Linux/Intel Core i7) and host 39 (Windows7/Intel Core i7) work together in 24 workunits. They agree in 6 and disagree in 18 workunits. Results are same for 33%.

For more details: https://docs.google.com/spreadsheet/ccc?key=0AulsPzT-4D4ydFNsOXlGV3ZjY3MwTF9ObXhEcDF3X0E&usp=drive_web#gid=0

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 238 - Posted: 3 Jan 2014, 17:59:18 UTC - in response to Message 236.

Thank you for your survey. I guess we will be able to solve the problem, somehow. I started the work generator again, one of the reasons was to gather more statistics. BTW, did you write a script or something for checking this?

Trung
Send message
Joined: 28 Nov 13
Posts: 7
Credit: 15
RAC: 0
Italy
Message 240 - Posted: 3 Jan 2014, 19:39:58 UTC

I queried database manually, only use script for extracting and comparing two files. I take about 1 hour for 30 workunits with 70 results.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 263 - Posted: 7 Jan 2014, 11:40:20 UTC - in response to Message 240.
Last modified: 7 Jan 2014, 12:16:09 UTC

I'm also checking, by hands, the results. So far any couple of workunits that have been crunched by two windows 7 or 8 (even XP vs Win7) computers gave exactly the same results. After manually checking this I also run (by command line) the validator.

For instance, for Workunit 23555:
bin/gene_network_validator --app gene --one_pass -d 4 --one_pass --mod 23555 0
Here is a snippet of the result:

2014-01-07 12:35:10.1598 [debug] [WU#23555 Expansion_At2_work-1387834088.xml_pn20456] Found 2 viable results
2014-01-07 12:35:10.1600 [debug] [WU#23555 Expansion_At2_work-1387834088.xml_pn20456] Enough for quorum, checking set.
rm: remove write-protected regular file `/home/boincadm/projects/test/upload/263/Expansion_At2_work-1387834088.xml_pn20456_0_0'? y
rm: remove write-protected regular file `/home/boincadm/projects/test/upload/69/Expansion_At2_work-1387834088.xml_pn20456_1_0'? y
query: select * from app_version where id=25
query: select * from host where id=19


However, after the console request to delete the two files (just before the "query" line) the validator seems to hang for 4-5 seconds before outputting things again. What is it doing? (the overall check seems much slower than expected).

Profile paolomorettin
Project developer
Project tester
Project scientist
Send message
Joined: 20 Nov 13
Posts: 19
Credit: 13,027
RAC: 0
Message 266 - Posted: 7 Jan 2014, 15:19:22 UTC - in response to Message 263.

However, after the console request to delete the two files (just before the "query" line) the validator seems to hang for 4-5 seconds before outputting things again. What is it doing? (the overall check seems much slower than expected).


Is it the new validator?
____________
Paolo - Application team dev (SSC11)

"If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?" Seymour Cray

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 267 - Posted: 7 Jan 2014, 15:33:25 UTC - in response to Message 266.
Last modified: 7 Jan 2014, 15:33:52 UTC

Dated 27 Dec. The one in the bin directory.

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 269 - Posted: 7 Jan 2014, 17:25:52 UTC - in response to Message 267.

Other output (just for reference)

bin/gene_network_validator --app gene --one_pass -d 3 --one_pass --mod 23690 0
2014-01-07 18:20:22.0198 Starting validator, debug level 3
2014-01-07 18:20:22.0201 Modulus 23690, remainder 0
2014-01-07 18:20:22.0314 [WU#23690 Expansion_At2_work-1387834213.xml_pn20618] handle_wu(): No canonical result yet
2014-01-07 18:20:22.0348 [debug] [WU#23690 Expansion_At2_work-1387834213.xml_pn20618] Found 2 viable results
2014-01-07 18:20:22.0350 [debug] [WU#23690 Expansion_At2_work-1387834213.xml_pn20618] Enough for quorum, checking set.
rm: remove write-protected regular file `/home/boincadm/projects/test/upload/d6/Expansion_At2_work-1387834213.xml_pn20618_0_0'? y
rm: remove write-protected regular file `/home/boincadm/projects/test/upload/163/Expansion_At2_work-1387834213.xml_pn20618_1_0'? y
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
2014-01-07 18:20:36.6669 [debug] [HAV#30] consecutive valid now 4
2014-01-07 18:20:37.2982 [RESULT#38452 Expansion_At2_work-1387834213.xml_pn20618_0] Valid; granted 53.918587 credit [HOST#27]
2014-01-07 18:20:37.2983 [HOST#27 AV#30] [outlier=0] Updating HAV in db. pfc.n=3.000000->4.000000
2014-01-07 18:20:37.7356 [debug] [HAV#29] consecutive valid now 1
2014-01-07 18:20:37.8163 [RESULT#38453 Expansion_At2_work-1387834213.xml_pn20618_1] Valid; granted 53.918587 credit [HOST#8]
2014-01-07 18:20:37.8164 [HOST#8 AV#29] [outlier=0] Updating HAV in db. pfc.n=0.000000->1.000000
2014-01-07 18:20:37.8577 [debug] [WU#23690 Expansion_At2_work-1387834213.xml_pn20618] Found a canonical result: id=38452

Profile valterc
Project administrator
Project tester
Send message
Joined: 30 Oct 13
Posts: 616
Credit: 34,514,943
RAC: 395
Italy
Message 270 - Posted: 7 Jan 2014, 17:29:27 UTC - in response to Message 269.

I just wrote a small script in html/ops/check_need_validation.php to figure out from which operating systems are results coming.

BTW, it would be nice to have a 'read-only'/debug/test-only validation switch just to check if the validator works and, in the future, to see if it grants credits accordingly.

chau
Send message
Joined: 12 Nov 13
Posts: 15
Credit: 229
RAC: 0
Italy
Message 271 - Posted: 8 Jan 2014, 5:44:57 UTC - in response to Message 266.

Hi,

This is a new version of validator. The performance is slow because we need to extract contents from zipped files. After that we compare 2 files to find the similar and different edges. Since size of every file is about nearly 30MB, it is also a reason why the performance is a little bit slow.


Post to thread

Message boards : Development : Validation problems (roadmap) application+server


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN