TN-Grid on AMD GPUs

Message boards : Number crunching : TN-Grid on AMD GPUs

Author	Message
Alessio Susi Send message Joined: 7 Jun 15 Posts: 15 Credit: 661,981 RAC: 0	Message 1739 - Posted: 3 Apr 2020, 12:42:01 UTC
	Hi. Will you create an application for AMD GPUs? Actually mine is working on Folding@Home (when the work is available) or MilkyWay@Home but I'd like to help medical research.[/quote] ____________ ASUS X570 E-Gaming AMD Ryzen 9 3950X, 16 core / 32 thread 4.4 GHz AMD Radeon Sapphire RX 480 4GB Nitro+ Nvidia GTX 1080 Ti Gaming X Trio 4x16 GB Corsair Vengeance RGB 3466 MHz
	ID: 1739 · Reply Quote

Buro87 [Lombardia] Send message Joined: 23 Nov 16 Posts: 100 Credit: 4,000,541 RAC: 0	Message 1741 - Posted: 3 Apr 2020, 14:42:34 UTC - in response to Message 1739.
	I remember that, they says, their algorithm and simulations are sequential, and they would not benefit by introducing gpu app, because gpu are suitable with high parallelized work
	ID: 1741 · Reply Quote

L.Gilson Volunteer developer Send message Joined: 25 Mar 20 Posts: 5 Credit: 596,499 RAC: 0	Message 1766 - Posted: 10 Apr 2020, 17:05:02 UTC - in response to Message 1739.
	Will you create an application for AMD GPUs? I'm actually on it. The client code is open-source and anyone willing can improve it. Several people from outside the research group already added very clever stuff (see posts by "[B@P] Daniel"), but so far no GPU version made it to stable. I'm now focusing on a new OpenCL-GPU (Nvidia and AMD GPU) implementation. The current code cannot be ported 1:1, but the idea behind it can be ported. The switch from sequential to parallel will have 2 effects: 1. the raw number of calculations goes up by a factor of 1.5 - 2. That isn't so bad as most computers have way, way faster GPU than CPU. 2. the results will slightly change. I contacted the team about this. They seem to be willing to at least consider the option. The change in results implies the CPU and GPU version cannot crunch the same WU without major changes on the server side. But the scientific output will be comparable. TN-Grid recovers links between gene expressions. Nobody knows for sure what the true result is. There is the uncertainty related to the measurement itself, statistics,... . But there are also test data sets generated by a human (ground truth data). Even for those datasets the algorithms don't hit the target perfectly. Getting the main parts right is good enough, for various liberal definitions of "main", "right" and "good enough". Porting the code takes time, don't hold your breath. I'm not part of the team, i'm not related to the university, the scientific team may not like the result in the end and i do this on my own in my free time. It currently looks doable without major gotchas. It will be faster than the CPU version, but not by an exorbitant factor. Expect 25x or maybe 100x faster than a single CPU-core, not some magic 10.000x faster. Given that modern CPUs with 8 cores / 16 threads already run 16 WU in parallel, the impact will not be a total game changer. The current code is already good and highly optimized. ETA wise think months, not days. And the results need to be verified, which will also take time.
	ID: 1766 · Reply Quote

Falconet Send message Joined: 21 Dec 16 Posts: 106 Credit: 3,095,506 RAC: 0	Message 1767 - Posted: 10 Apr 2020, 18:38:27 UTC - in response to Message 1766. Last modified: 10 Apr 2020, 18:38:34 UTC
	Regardless of how it turns out, you have my thanks and, I'm sure, everyone else's.
	ID: 1767 · Reply Quote

Buro87 [Lombardia] Send message Joined: 23 Nov 16 Posts: 100 Credit: 4,000,541 RAC: 0	Message 1768 - Posted: 10 Apr 2020, 19:59:37 UTC - in response to Message 1767.
	Thank you very much L.Gilson :)
	ID: 1768 · Reply Quote

Jim1348 Send message Joined: 29 Dec 16 Posts: 87 Credit: 21,013,002 RAC: 0	Message 1769 - Posted: 10 Apr 2020, 21:07:50 UTC - in response to Message 1767.
	Regardless of how it turns out, you have my thanks and, I'm sure, everyone else's. Yes, certainly. Most projects I have seen get maybe a 15X - 20X speedup with a GPU. But TN-Grid had better start planning for more work. This will keep them busy.
	ID: 1769 · Reply Quote

italianpower18 Send message Joined: 18 Oct 18 Posts: 6 Credit: 16,414,273 RAC: 0	Message 1770 - Posted: 12 Apr 2020, 16:54:57 UTC - in response to Message 1766.
	thank you Mr. Gilson
	ID: 1770 · Reply Quote

L.Gilson Volunteer developer Send message Joined: 25 Mar 20 Posts: 5 Credit: 596,499 RAC: 0	Message 1774 - Posted: 15 Apr 2020, 20:10:18 UTC
	Thanks for the positive feedback. I have good and bad news. Good news first: the OpenCL version is running and produces roughly the same results. Roughly means 8% difference on the testing data included in the repo (that's 8% at the level of individual links). I have not yet tested on current production data. Bad news: The OpenCL port is in large parts pretty much a 1:1 copy of the CPU code. That implies it is still very, very slow. Memory access is all over the place and wrap divergence must be huge. Long story short: It's about as fast as a single CPU-core. I sort of expected that, 1:1 ports never end up looking good. I'm now looking at which factor is the most dominating slowdown and eliminate those (if possible). I will keep you informed ...
	ID: 1774 · Reply Quote

[B@P] Daniel Volunteer developer Send message Joined: 19 Oct 16 Posts: 90 Credit: 2,205,103 RAC: 0	Message 1779 - Posted: 17 Apr 2020, 20:12:53 UTC - in response to Message 1774. Last modified: 17 Apr 2020, 20:13:51 UTC
	Thanks for the positive feedback. I have good and bad news. Good news first: the OpenCL version is running and produces roughly the same results. Roughly means 8% difference on the testing data included in the repo (that's 8% at the level of individual links). I have not yet tested on current production data. Bad news: The OpenCL port is in large parts pretty much a 1:1 copy of the CPU code. That implies it is still very, very slow. Memory access is all over the place and wrap divergence must be huge. Long story short: It's about as fast as a single CPU-core. I sort of expected that, 1:1 ports never end up looking good. I'm now looking at which factor is the most dominating slowdown and eliminate those (if possible). I will keep you informed ... I am author of optimized apps used by TN-Grid now. I also tried to port app to the GPU, but faced the same problem - global memory access was too slow. If I remember correctly, I tried to port app version which does not use Gray codes (predecessor of current app version), it looked more GPU-friendly for me. I was looking for potential solution for this slow memory access, and asked for algorithm on StackOverflow. I got an answer, however I never actually tried to implement it - I was busy with different things at that time. Here is link to that question, I hope it will be useful for you. Good luck! https://stackoverflow.com/questions/46635137/how-to-generate-combinations-in-chunks ____________
	ID: 1779 · Reply Quote

L.Gilson Volunteer developer Send message Joined: 25 Mar 20 Posts: 5 Credit: 596,499 RAC: 0	Message 1781 - Posted: 18 Apr 2020, 9:48:52 UTC - in response to Message 1779.
	I am author of optimized apps used by TN-Grid now. I also tried to port app to the GPU, but faced the same problem - global memory access was too slow. If I remember correctly, I tried to port app version which does not use Gray codes (predecessor of current app version), it looked more GPU-friendly for me. I was looking for potential solution for this slow memory access, and asked for algorithm on StackOverflow. I got an answer, however I never actually tried to implement it - I was busy with different things at that time. Here is link to that question, I hope it will be useful for you. Good luck! https://stackoverflow.com/questions/46635137/how-to-generate-combinations-in-chunks Looking at the first answer, I will not implement it either. Yes, I also used the iterative version as a starting point and just started one thread per link. That backfires at l > 2 because the matrix is already sparse and most threads do nothing (80% empty matrix => 80% of threads dead on arrival). I want to run it as a 3D grid with the first 2 dimensions scanning the matrix and the last dimension splitting up combinations to test. Chucking them is actually easy if you allow for some work dublication. The first 3 members of the combination group will be fixed for each thread (two from the matrix row and column, one additional fixed by the thread ID). Long running combinations (the ones that stay true in the matrix) will distribute better across the cores that way, while wraps hitting an false entry in the matrix can quit all together. Matrix compaction or just writing the still active combinations in a flat list is probably a good idea at some point. The more complex themes detailed in the first SO answer will result in wrap divergence. Just accepting work dublication avoids that. The second idea is related p, the mini-rho matrix re-calculated for each combination. Applying the gray code idea by just recalculating the change parts (last element changes => organise the matrix in a way that the last element is also the last one calculated in p) will reduce the number of float ops by a lot. That will be a bit trickier, but probably needed to get a viable solution. The gray code would also run on the GPU if it weren't recursive. Turning that into iterative is the other path to a fast GPU implementation.
	ID: 1781 · Reply Quote

Bill F Send message Joined: 3 Apr 17 Posts: 50 Credit: 1,161,875 RAC: 0	Message 1956 - Posted: 19 Sep 2020, 2:15:01 UTC - in response to Message 1766.
	Will you create an application for AMD GPUs? I'm actually on it. The client code is open-source and anyone willing can improve it. Several people from outside the research group already added very clever stuff (see posts by "[B@P] Daniel"), but so far no GPU version made it to stable. I'm now focusing on a new OpenCL-GPU (Nvidia and AMD GPU) implementation. The current code cannot be ported 1:1, but the idea behind it can be ported. The switch from sequential to parallel will have 2 effects: 1. the raw number of calculations goes up by a factor of 1.5 - 2. That isn't so bad as most computers have way, way faster GPU than CPU. 2. the results will slightly change. I contacted the team about this. They seem to be willing to at least consider the option. The change in results implies the CPU and GPU version cannot crunch the same WU without major changes on the server side. But the scientific output will be comparable. TN-Grid recovers links between gene expressions. Nobody knows for sure what the true result is. There is the uncertainty related to the measurement itself, statistics,... . But there are also test data sets generated by a human (ground truth data). Even for those datasets the algorithms don't hit the target perfectly. Getting the main parts right is good enough, for various liberal definitions of "main", "right" and "good enough". Porting the code takes time, don't hold your breath. I'm not part of the team, i'm not related to the university, the scientific team may not like the result in the end and i do this on my own in my free time. It currently looks doable without major gotchas. It will be faster than the CPU version, but not by an exorbitant factor. Expect 25x or maybe 100x faster than a single CPU-core, not some magic 10.000x faster. Given that modern CPUs with 8 cores / 16 threads already run 16 WU in parallel, the impact will not be a total game changer. The current code is already good and highly optimized. ETA wise think months, not days. And the results need to be verified, which will also take time. L.Gibson Your possible OpenCl porting was mentioned on thread in the news AMD COVID-19 HPC Fund. Bill
	ID: 1956 · Reply Quote

Post to thread

Message boards : Number crunching : TN-Grid on AMD GPUs