...
But, the compute capacity of the CUDA cores is near fully realised on GDDR5 cards.
...
Actually you can easily test this to show that the cores are
not fully utilized with GDDR5 memory: download the source code of ethminer from github, modify the kernel so that in the DAG sampling loop instead of reading from global memory it reads from shared* memory, and you'll be able to see what the CUDA cores are capable of if they were accessing very low latency shared memory as opposed to the much higher latency global memory.
For actual figures, a GTX 1060 which would normally hash at around 23 MH/s when it uses global memory, is capable of around 53 MH/s when it is using shared memory instead. This shows what it's cores are capable of if they had very low latency memory to use instead of the slower GDDR5 device memory.
(* for non-CUDA programmers: shared memory is program-controllable cache memory for data sharing within a thread block and has the same latency as the L1/texture caches)