log in |
Message boards : Number crunching : TN-Grid on AMD GPUs
Author | Message |
---|---|
Hi. Will you create an application for AMD GPUs? Actually mine is working on Folding@Home (when the work is available) or MilkyWay@Home but I'd like to help medical research.[/quote] | |
ID: 1739 · Reply Quote | |
I remember that, they says, their algorithm and simulations are sequential, and they would not benefit by introducing gpu app, because gpu are suitable with high parallelized work | |
ID: 1741 · Reply Quote | |
Will you create an application for AMD GPUs? I'm actually on it. The client code is open-source and anyone willing can improve it. Several people from outside the research group already added very clever stuff (see posts by "[B@P] Daniel"), but so far no GPU version made it to stable. I'm now focusing on a new OpenCL-GPU (Nvidia and AMD GPU) implementation. The current code cannot be ported 1:1, but the idea behind it can be ported. The switch from sequential to parallel will have 2 effects: 1. the raw number of calculations goes up by a factor of 1.5 - 2. That isn't so bad as most computers have way, way faster GPU than CPU. 2. the results will slightly change. I contacted the team about this. They seem to be willing to at least consider the option. The change in results implies the CPU and GPU version cannot crunch the same WU without major changes on the server side. But the scientific output will be comparable. TN-Grid recovers links between gene expressions. Nobody knows for sure what the true result is. There is the uncertainty related to the measurement itself, statistics,... . But there are also test data sets generated by a human (ground truth data). Even for those datasets the algorithms don't hit the target perfectly. Getting the main parts right is good enough, for various liberal definitions of "main", "right" and "good enough". Porting the code takes time, don't hold your breath. I'm not part of the team, i'm not related to the university, the scientific team may not like the result in the end and i do this on my own in my free time. It currently looks doable without major gotchas. It will be faster than the CPU version, but not by an exorbitant factor. Expect 25x or maybe 100x faster than a single CPU-core, not some magic 10.000x faster. Given that modern CPUs with 8 cores / 16 threads already run 16 WU in parallel, the impact will not be a total game changer. The current code is already good and highly optimized. ETA wise think months, not days. And the results need to be verified, which will also take time. | |
ID: 1766 · Reply Quote | |
Regardless of how it turns out, you have my thanks and, I'm sure, everyone else's. | |
ID: 1767 · Reply Quote | |
Thank you very much L.Gilson :) | |
ID: 1768 · Reply Quote | |
Regardless of how it turns out, you have my thanks and, I'm sure, everyone else's. Yes, certainly. Most projects I have seen get maybe a 15X - 20X speedup with a GPU. But TN-Grid had better start planning for more work. This will keep them busy. | |
ID: 1769 · Reply Quote | |
thank you Mr. Gilson | |
ID: 1770 · Reply Quote | |
Thanks for the positive feedback. I have good and bad news. | |
ID: 1774 · Reply Quote | |
Thanks for the positive feedback. I have good and bad news. I am author of optimized apps used by TN-Grid now. I also tried to port app to the GPU, but faced the same problem - global memory access was too slow. If I remember correctly, I tried to port app version which does not use Gray codes (predecessor of current app version), it looked more GPU-friendly for me. I was looking for potential solution for this slow memory access, and asked for algorithm on StackOverflow. I got an answer, however I never actually tried to implement it - I was busy with different things at that time. Here is link to that question, I hope it will be useful for you. Good luck! https://stackoverflow.com/questions/46635137/how-to-generate-combinations-in-chunks ____________ | |
ID: 1779 · Reply Quote | |
I am author of optimized apps used by TN-Grid now. I also tried to port app to the GPU, but faced the same problem - global memory access was too slow. If I remember correctly, I tried to port app version which does not use Gray codes (predecessor of current app version), it looked more GPU-friendly for me. I was looking for potential solution for this slow memory access, and asked for algorithm on StackOverflow. I got an answer, however I never actually tried to implement it - I was busy with different things at that time. Here is link to that question, I hope it will be useful for you. Good luck! Looking at the first answer, I will not implement it either. Yes, I also used the iterative version as a starting point and just started one thread per link. That backfires at l > 2 because the matrix is already sparse and most threads do nothing (80% empty matrix => 80% of threads dead on arrival). I want to run it as a 3D grid with the first 2 dimensions scanning the matrix and the last dimension splitting up combinations to test. Chucking them is actually easy if you allow for some work dublication. The first 3 members of the combination group will be fixed for each thread (two from the matrix row and column, one additional fixed by the thread ID). Long running combinations (the ones that stay true in the matrix) will distribute better across the cores that way, while wraps hitting an false entry in the matrix can quit all together. Matrix compaction or just writing the still active combinations in a flat list is probably a good idea at some point. The more complex themes detailed in the first SO answer will result in wrap divergence. Just accepting work dublication avoids that. The second idea is related p, the mini-rho matrix re-calculated for each combination. Applying the gray code idea by just recalculating the change parts (last element changes => organise the matrix in a way that the last element is also the last one calculated in p) will reduce the number of float ops by a lot. That will be a bit trickier, but probably needed to get a viable solution. The gray code would also run on the GPU if it weren't recursive. Turning that into iterative is the other path to a fast GPU implementation. | |
ID: 1781 · Reply Quote | |
Will you create an application for AMD GPUs? L.Gibson Your possible OpenCl porting was mentioned on thread in the news AMD COVID-19 HPC Fund. Bill | |
ID: 1956 · Reply Quote | |
Message boards :
Number crunching :
TN-Grid on AMD GPUs