Author |
Message |
valtercProject administrator Project tester Send message
Joined: 30 Oct 13 Posts: 624 Credit: 34,677,535 RAC: 0
|
My guess is that workunits will validate correctly if checked, with the bitwise validator, against workunits crunched in the same architectural environment (operating system, cpu, etc.)
I requested people of the server group to do some statistics and see if the above statement is true. If not we have not other solution than to develop a 'slightly different is considered valid' concept and implement it in the new validator. In this case, assuming that this is also acceptable from the scientific point of view (ask Enrico), we have to decide which workunit is the 'correct' one (just pick one randomly, the merge of the two?)
But if the first statement is true we may follow other paths:
1- Implement, boinc server-side, the concept of Homogeneous redundancy, see http://boinc.berkeley.edu/trac/wiki/HomogeneousRedundancy
2- Go deeper, dealing with numerical discrepancies by looking at the application code (functions, snippets) and compiler flags, explicit or defaults (VC vs gcc), by carefully checking the usual "number crunching in the float domain" things (like fp80 rounding). Look at this for a starting point http://stackoverflow.com/questions/1961442/different-math-rounding-behaviour-between-linux-mac-os-x-and-windows. Another interesting article is this one http://www.yosefk.com/blog/consistency-how-to-defeat-the-purpose-of-ieee-floating-point.html. I personally think that this approach, event if it is not so easy, is the more instructive and elegant.
3- Rewrite the code (or part of it) avoiding double/float, use fixed point arithmetic, libraries such Gnu MPFR.
What to do now? Just hold on and stay put until we have answers for the first statement... |
|
|
|
I did statistic on inconclusive workunits because I think it's also easy for application group to check bug.
We have 30 inconclusive workunits for application version 2.
Most of different results come from a pair of Windows vs Linux (One result runs on Windows while another runs on Linux, both use 64 bit version and Intel processor) - 25 cases. However, these differences are small, a few edges are added to each graph.
There are 4 cases come from a pair of Window/AMD and Linux/Intel. It may caused by the difference of Window and Linux because in Workunit 22345, result from Window/AMD and from Window/Intel are the same.
There are 3 cases which results run on same OS are different. All 3 differences are very large, might caused by application bug: Wu 23255 - Rs 37169, Wu 23117 - Rs 37229, Wu 23056 - Rs 36786.
Although the main source of difference come from pair of Windows and Linux, it not imply that they are always different. Host 21 (Linux/Intel Core i7) and host 39 (Windows7/Intel Core i7) work together in 24 workunits. They agree in 6 and disagree in 18 workunits. Results are same for 33%.
For more details: https://docs.google.com/spreadsheet/ccc?key=0AulsPzT-4D4ydFNsOXlGV3ZjY3MwTF9ObXhEcDF3X0E&usp=drive_web#gid=0 |
|
|
valtercProject administrator Project tester Send message
Joined: 30 Oct 13 Posts: 624 Credit: 34,677,535 RAC: 0
|
Thank you for your survey. I guess we will be able to solve the problem, somehow. I started the work generator again, one of the reasons was to gather more statistics. BTW, did you write a script or something for checking this? |
|
|
|
I queried database manually, only use script for extracting and comparing two files. I take about 1 hour for 30 workunits with 70 results. |
|
|
valtercProject administrator Project tester Send message
Joined: 30 Oct 13 Posts: 624 Credit: 34,677,535 RAC: 0
|
I'm also checking, by hands, the results. So far any couple of workunits that have been crunched by two windows 7 or 8 (even XP vs Win7) computers gave exactly the same results. After manually checking this I also run (by command line) the validator.
For instance, for Workunit 23555:
bin/gene_network_validator --app gene --one_pass -d 4 --one_pass --mod 23555 0
Here is a snippet of the result:
2014-01-07 12:35:10.1598 [debug] [WU#23555 Expansion_At2_work-1387834088.xml_pn20456] Found 2 viable results
2014-01-07 12:35:10.1600 [debug] [WU#23555 Expansion_At2_work-1387834088.xml_pn20456] Enough for quorum, checking set.
rm: remove write-protected regular file `/home/boincadm/projects/test/upload/263/Expansion_At2_work-1387834088.xml_pn20456_0_0'? y
rm: remove write-protected regular file `/home/boincadm/projects/test/upload/69/Expansion_At2_work-1387834088.xml_pn20456_1_0'? y
query: select * from app_version where id=25
query: select * from host where id=19
However, after the console request to delete the two files (just before the "query" line) the validator seems to hang for 4-5 seconds before outputting things again. What is it doing? (the overall check seems much slower than expected). |
|
|
|
However, after the console request to delete the two files (just before the "query" line) the validator seems to hang for 4-5 seconds before outputting things again. What is it doing? (the overall check seems much slower than expected).
Is it the new validator?
____________
Paolo - Application team dev (SSC11)
"If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?" Seymour Cray |
|
|
valtercProject administrator Project tester Send message
Joined: 30 Oct 13 Posts: 624 Credit: 34,677,535 RAC: 0
|
Dated 27 Dec. The one in the bin directory. |
|
|
valtercProject administrator Project tester Send message
Joined: 30 Oct 13 Posts: 624 Credit: 34,677,535 RAC: 0
|
Other output (just for reference)
bin/gene_network_validator --app gene --one_pass -d 3 --one_pass --mod 23690 0
2014-01-07 18:20:22.0198 Starting validator, debug level 3
2014-01-07 18:20:22.0201 Modulus 23690, remainder 0
2014-01-07 18:20:22.0314 [WU#23690 Expansion_At2_work-1387834213.xml_pn20618] handle_wu(): No canonical result yet
2014-01-07 18:20:22.0348 [debug] [WU#23690 Expansion_At2_work-1387834213.xml_pn20618] Found 2 viable results
2014-01-07 18:20:22.0350 [debug] [WU#23690 Expansion_At2_work-1387834213.xml_pn20618] Enough for quorum, checking set.
rm: remove write-protected regular file `/home/boincadm/projects/test/upload/d6/Expansion_At2_work-1387834213.xml_pn20618_0_0'? y
rm: remove write-protected regular file `/home/boincadm/projects/test/upload/163/Expansion_At2_work-1387834213.xml_pn20618_1_0'? y
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
2014-01-07 18:20:36.6669 [debug] [HAV#30] consecutive valid now 4
2014-01-07 18:20:37.2982 [RESULT#38452 Expansion_At2_work-1387834213.xml_pn20618_0] Valid; granted 53.918587 credit [HOST#27]
2014-01-07 18:20:37.2983 [HOST#27 AV#30] [outlier=0] Updating HAV in db. pfc.n=3.000000->4.000000
2014-01-07 18:20:37.7356 [debug] [HAV#29] consecutive valid now 1
2014-01-07 18:20:37.8163 [RESULT#38453 Expansion_At2_work-1387834213.xml_pn20618_1] Valid; granted 53.918587 credit [HOST#8]
2014-01-07 18:20:37.8164 [HOST#8 AV#29] [outlier=0] Updating HAV in db. pfc.n=0.000000->1.000000
2014-01-07 18:20:37.8577 [debug] [WU#23690 Expansion_At2_work-1387834213.xml_pn20618] Found a canonical result: id=38452
|
|
|
valtercProject administrator Project tester Send message
Joined: 30 Oct 13 Posts: 624 Credit: 34,677,535 RAC: 0
|
I just wrote a small script in html/ops/check_need_validation.php to figure out from which operating systems are results coming.
BTW, it would be nice to have a 'read-only'/debug/test-only validation switch just to check if the validator works and, in the future, to see if it grants credits accordingly. |
|
|
|
Hi,
This is a new version of validator. The performance is slow because we need to extract contents from zipped files. After that we compare 2 files to find the similar and different edges. Since size of every file is about nearly 30MB, it is also a reason why the performance is a little bit slow. |
|
|