I'd like to hear from people with experience of coding for both. Myself, I only have experience with NVIDIA.
NVIDIA CUDA seems to be a lot more popular than the competition. (Just counting question tags on this forum, 'cuda' outperforms 'opencl' 3:1, and 'nvidia' outperforms 'ati' 15:1, and there's no tag for 'ati-stream' at all).
On the other hand, according to Wikipedia, ATI/AMD cards should have a lot more potential, especially per dollar. The fastest NVIDIA card on the market as of today, GeForce 580 ($500), is rated at 1.6 single-precision TFlops. AMD Radeon 6970 can be had for $370 and it is rated at 2.7 TFlops. The 580 has 512 execution units at 772 MHz. The 6970 has 1536 execution units at 880 MHz.
How realistic is that paper advantage of AMD over NVIDIA, and is it likely to be realized in most GPGPU tasks? What happens with integer tasks?
preguntado el 09 de enero de 11 a las 08:01
Metaphorically speaking ati has a good engine compared to nvidia. But nvidia has a better car :D
This is mostly because nvidia has invested good amount of its resources (in money and people) to develop important libraries required for scientific computing (BLAS, FFT), and then a good job again in promoting it. This may be the reason CUDA dominates the tags over here compared to ati (or OpenCL)
As for the advantage being realized in GPGPU tasks in general, it would end up depending on other issues (depending on the application) such as, memory transfer bandwidth, a good compiler and probably even the driver. nvidia having a more mature compiler, a more stable driver on linux (linux because, its use is widespread in scientific computing), tilt the balance in favor of CUDA (at least for now).
EDITAR Enero 12, 2013
It's been two years since I made this post and it still seems to attract views sometimes. So I have decided to clarify a few things
- AMD has stepped up their game. They now have both BLAS and FFT libraries. Numerous third party libraries are also cropping up around OpenCL.
- Intel has introduced Xeon Phi into the wild supporting both OpenMP and OpenCL. It also has the ability use existing x86 code. as noted in the comments, limited x86 without SSE for now
- NVIDIA and CUDA still have the edge in the range of libraries available. However they may not be focusing on OpenCL as much as they did before.
In short OpenCL has closed the gap in the past two years. There are new players in the field. But CUDA is still a bit ahead of the pack.
I don't have any strong feelings about CUDA vs. OpenCL; presumably OpenCL is the long-term future, just by dint of being an open standard.
But current-day NVIDIA vs ATI cards for GPGPU (not graphics performance, but GPGPU), that I do have a strong opinion about. And to lead into that, I'll point out that on the current Top 500 list of big clusters, NVIDIA leads AMD 4 systems to 1, and on gpgpu.org, search results (papers, links to online resources, etc) for NVIDIA outnumber results for AMD 6:1.
A huge part of this difference is the amount of online information available. Check out the NVIDIA CUDA Zone versus AMD's GPGPU Developer Central. The amount of stuff there for developers starting up doesn't even come close to comparing. On NVIDIAs site you'll find tonnes of papers - and contributed code - from people probably working on problems like yours. You'll find tonnes of online classes, from NVIDIA and elsewhere, and very useful documents like the developers' best practice guide, etc. The availability of free devel tools - the profiler, the cuda-gdb, etc - overwhelmingly tilts NVIDIAs way.
(Editor: the information in this paragraph is no longer accurate.) And some of the difference is also hardware. AMDs cards have better specs in terms of peak flops, but to be able to get a significant fraction of that, you have to not only break your problem up onto many completely independent stream processors, each work item also needs to be vectorized. Given that GPGPUing ones code is hard enough, that extra architectural complexity is enough to make or break some projects.
And the result of all of this is that the NVIDIA user community continues to grow. Of the three or four groups I know thinking of building GPU clusters, none of them are seriously considering AMD cards. And that will mean still more groups writing papers, contributing code, etc on the NVIDIA side.
I'm not an NVIDIA shill; I wish it weren't this way, and that there were two (or more!) equally compelling GPGPU platforms. Competition is good. Maybe AMD will step up its game very soon - and the upcoming fusion products look very compelling. But in giving someone advice about which cards to buy today, and where to spend their time putting effort in right now, I can't in good conscience say that both development environments are equally good.
Editado para agregar: I guess the above is a little elliptical in terms of answering the original question, so let me make it a bit more explicit. The performance you can get from a piece of hardware is, in an ideal world with infinite time available, dependent only on the underlying hardware and the capabilities of the programming language; but in reality, the amount of performance you can get in a fixed amount of time invested is also strongly dependant on devel tools, existing community code bases (eg, publicly available libraries, etc). Those considerations all point strongly to NVIDIA.
(Editor: the information in this paragraph is no longer accurate.) In terms of hardware, the requirement for vectorization within SIMD units in the AMD cards also make achieving paper performance even harder than with NVIDIA hardware.
The main difference between AMD's and NVIDIA's architectures is that AMD is optimized for problems where the behavior of the algorithm can be determined at compile-time while NVIDIA is optimized for problems where the behavior of the algorithm can only be determined at run-time.
AMD has a relatively simple architecture that allows them to spend more transistors on ALU's. As long as the problem can be fully defined at compile-time and be successfully mapped to the architecture in a somewhat static or linear way, there is a good chance that AMD will be able to run the algorithm faster than NVIDIA.
On the other hand, NVIDIA's compiler is doing less analysis at compile time. Instead, NVIDIA has a more advanced architecture where they have spent more transistors on logic that is able to handle dynamic behavior of the algorithm that only emerges at run-time.
I believe the fact that most supercomputers that use GPUs go with NVIDIA is that the type of problem that scientists are interested in running calculations on, in general map better to NVIDIA's architecture than AMD's.
I've done some iterative coding in OpenCL. And the results of running it in NVIDIA and ATI, are pretty much the same. Near the same speed in the same value ($) cards.
In both cases, speeds were ~10x-30x comparing to a CPU.
I didn't test CUDA, but I doubt it could solve my random memory fetch problems magically. Nowadays, CUDA and OpenCL are more or less the same, and I see more future on OpenCL than on CUDA. The main reason is that Intel is launching drivers with OpenCL for their processors. This will be a huge advance in the future (running 16, 32 or 64 threads of OpenCL in CPU is REALLY fast, and really easy to port to GPU).
Having spent some time with OpenCL for GCN cards after a few years of CUDA for Fermi and Kepler, I still prefer CUDA as a programming language and would choose AMD hardware with CUDA if I had an option.
Main differences of NVIDIA and AMD (OpenCL):
Even with Maxwell, NVidia still has longer command latencies and complex algorithms are likely to be 10 faster on AMD(assuming same theoretical Tflops) after easy optimizations for both. The gap was up to 60% for Kepler VS GCN. It's harder to optimize complex kernels for NVidia in this sense.
OpenCL is open standard with other vendors available.
Has the Tesla line of hardware that's suitable for reliable high server loads.
New Maxwell is way more power efficient.
Compiler and tools are way more advanced. AMD still can't get to implement
maxregcoutparameter, so you can easily control occupancy on various hardware and their compiler has a lot of random ideas of what is an optimal code that change with every version, so you may need to revisit old code every half a year because it suddenly became 40% slower.
At this point if GPGPU is your goal, CUDA is the only choice, since opencL with AMD is not ready for server farm and it's significantly harder to write efficient code for AMD due to the fact that the compiler always seems to be "in beta".
I am new to GPGPU but I have some experience in scientific computing (PhD in Physics). I am putting together a research team and I want to go towards using GPGPU for my calculations. I had to choose between the available platforms. I decided on Nvidia, for a couple of reasons: while ATI might be faster on paper, Nvidia has a more mature platform and more documentation so it will be possible to get closer to the peak performance on this platform.
Nvidia also has an academic research support program, one can apply for support, I just received a TESLA 2075 card which I am very happy about. I don't know if ATI or Intel supports research this way.
What I heard about OpenCL is that it's trying to be everything at once, it is true that your OpenCL code will be more portable but it's also likely to not exploit the full capabilities of either platform. I'd rather learn a bit more and write programs that utilize the resources better. With the TESLA K10 that just came out this year Nvidia is in the 4.5 TeraFlops range so it is not clear that Nvidia is behind ... however Intel MICs could prove to be a real competitor, especially if they succeed in moving the GPGPU unit to the motherboard. But for now, I chose Nvidia.
My experience in evaluating OpenCL floating point performance tends to favor NVIDIA cards. I've worked with a couple of floating point benchmarks on NVIDIA cards ranging from the 8600M GT to the GTX 460. NVIDIA cards consistently achieve about half of theoretical single-precisino peak on these benchmarks.
The ATI cards I have worked with rarely achieve better than one third of single-precision peak. Note that my experience with ATI is skewed; I've only been able to work with one 5000 series card. My experience is mostly with HD 4000 series cards, which were never well supported. Support for the HD 5000 series cards is much better.
I would like to add to the debate. For us in the business of software, we can compromise raw single-precision performance to productivity but even that I do not have to compromise since, as already pointed out, you cannot achieve as much performance on on ATI's hardware using OpenCL as you can achieve if you write in CUDA on NVIDIA's hardware.
And yes, with PGI's announcement of x86 compiler for CUDA, there won't be any good reason to spend more time and resources writing in OpenCL :)
P.S: My argument might be biased since we do almost all our GPGPU work on CUDA. We have an Image Processing/Computer Vision library CUVI (CUDA for Vision and Imaging) which accelerates some core IP/CV functionality on CUDA.
Cuda is certainly popular than OpenCL as of today, as it was released 3 or 4 years before OpenCL. Since OpenCL been has released, Nvidia has not contributed much for the language as they concentrate much on CUDA. They have not even released openCL 1.2 version for any driver.
As far as heterogenous computing as well as hand held devices as concerned OpenCl will surely gain more popularity in near future. As of now biggest contributor to OpenCL is AMD, It's visible on their site.
in my experience:
if you want best absolute performance then you need to see who is on the latest hardware iteration, and use their stack (including latest / beta releases).
if you want the best performance for the money you will be aiming at gamer cards rather than "professional" cards and the flexibility of targetting different platforms favors opencl.
if you are starting out, in particular, cuda tends to be more polished and have more tools and libraries.
finally, my personal take, after appalling "support" from nvidia (we got a dead tesla and it wasn't changed for months, while a client was waiting): the flexibility to jump ship with opencl is worth the risk of slightly lower performance when nvidia are ahead in the release cycle.