Posts by L.Gilson

1) Message boards : Number crunching : Please fix computer #57018 - only invalids! (Message 1798)
Posted 22 Apr 2020 by L.Gilson

There are more of those:

http://gene.disi.unitn.it/test/results.php?hostid=34120

That's why boinc is verifying results by running each WU at least twice. And that's something the project admin should see in the backend. I'm not sure if it possible to act on those.

2) Message boards : Science : SARS-CoV-2 virus (Message 1783)
Posted 18 Apr 2020 by L.Gilson

The WCG also contains more competition. TN-Grid will not get a big chunk of the processing power just like that. It's not a free, unlimited resource. Even the WCG will set priorities and you might not like them.

Within boinc, it is your choice to do TN-Grid. Within WCG it is not.

Put TN-Grid on the official boinc project list, get a Twitter, Facebook, ... compaign going. Put posters up at local universities or wherever. And first upgrade the backend so the tn-grid doesn't end up like Rosetta (out of work and untested, badly failing WU in the wild) or F@H (haven't seen a WU in the last week's).

Also: the results from the current batches at F@H, Rosetta,... will contribute to fighting covid-19 sometimes in the future. Research on biology is on the timescale of years, not weeks. Getting all twitchy now doesn't solve the current problem, it just shows that humans as a group cannot plan well in advance.

3) Message boards : Number crunching : TN-Grid on AMD GPUs (Message 1781)
Posted 18 Apr 2020 by L.Gilson

I am author of optimized apps used by TN-Grid now. I also tried to port app to the GPU, but faced the same problem - global memory access was too slow. If I remember correctly, I tried to port app version which does not use Gray codes (predecessor of current app version), it looked more GPU-friendly for me. I was looking for potential solution for this slow memory access, and asked for algorithm on StackOverflow. I got an answer, however I never actually tried to implement it - I was busy with different things at that time. Here is link to that question, I hope it will be useful for you. Good luck!
https://stackoverflow.com/questions/46635137/how-to-generate-combinations-in-chunks

Looking at the first answer, I will not implement it either.

Yes, I also used the iterative version as a starting point and just started one thread per link. That backfires at l > 2 because the matrix is already sparse and most threads do nothing (80% empty matrix =>
80% of threads dead on arrival).

I want to run it as a 3D grid with the first 2 dimensions scanning the matrix and the last dimension splitting up combinations to test. Chucking them is actually easy if you allow for some work dublication. The first 3 members of the combination group will be fixed for each thread (two from the matrix row and column, one additional fixed by the thread ID). Long running combinations (the ones that stay true in the matrix) will distribute better across the cores that way, while wraps hitting an false entry in the matrix can quit all together. Matrix compaction or just writing the still active combinations in a flat list is probably a good idea at some point. The more complex themes detailed in the first SO answer will result in wrap divergence. Just accepting work dublication avoids that.

The second idea is related p, the mini-rho matrix re-calculated for each combination. Applying the gray code idea by just recalculating the change parts (last element changes => organise the matrix in a way that the last element is also the last one calculated in p) will reduce the number of float ops by a lot. That will be a bit trickier, but probably needed to get a viable solution.

The gray code would also run on the GPU if it weren't recursive. Turning that into iterative is the other path to a fast GPU implementation.

4) Message boards : Number crunching : TN-Grid on AMD GPUs (Message 1774)
Posted 15 Apr 2020 by L.Gilson

Thanks for the positive feedback. I have good and bad news.

Good news first:
the OpenCL version is running and produces roughly the same results. Roughly means 8% difference on the testing data included in the repo (that's 8% at the level of individual links). I have not yet tested on current production data.

Bad news:
The OpenCL port is in large parts pretty much a 1:1 copy of the CPU code. That implies it is still very, very slow. Memory access is all over the place and wrap divergence must be huge. Long story short: It's about as fast as a single CPU-core. I sort of expected that, 1:1 ports never end up looking good. I'm now looking at which factor is the most dominating slowdown and eliminate those (if possible).

I will keep you informed ...

5) Message boards : Number crunching : TN-Grid on AMD GPUs (Message 1766)
Posted 10 Apr 2020 by L.Gilson

Will you create an application for AMD GPUs?

I'm actually on it.

The client code is open-source and anyone willing can improve it. Several people from outside the research group already added very clever stuff (see posts by "[B@P] Daniel"), but so far no GPU version made it to stable. I'm now focusing on a new OpenCL-GPU (Nvidia and AMD GPU) implementation.

The current code cannot be ported 1:1, but the idea behind it can be ported. The switch from sequential to parallel will have 2 effects:

1. the raw number of calculations goes up by a factor of 1.5 - 2. That isn't so bad as most computers have way, way faster GPU than CPU.

2. the results will slightly change. I contacted the team about this. They seem to be willing to at least consider the option. The change in results implies the CPU and GPU version cannot crunch the same WU without major changes on the server side. But the scientific output will be comparable.

TN-Grid recovers links between gene expressions. Nobody knows for sure what the true result is. There is the uncertainty related to the measurement itself, statistics,... . But there are also test data sets generated by a human (ground truth data). Even for those datasets the algorithms don't hit the target perfectly. Getting the main parts right is good enough, for various liberal definitions of "main", "right" and "good enough".

Porting the code takes time, don't hold your breath. I'm not part of the team, i'm not related to the university, the scientific team may not like the result in the end and i do this on my own in my free time. It currently looks doable without major gotchas. It will be faster than the CPU version, but not by an exorbitant factor. Expect 25x or maybe 100x faster than a single CPU-core, not some magic 10.000x faster. Given that modern CPUs with 8 cores / 16 threads already run 16 WU in parallel, the impact will not be a total game changer. The current code is already good and highly optimized.

ETA wise think months, not days. And the results need to be verified, which will also take time.