cbuchner1, did you note my earlier post about autotune problems and K kernel performance regression?
Do you need any additional information to diagnose those problems?
Part of the problem is that previously the loop trip count N=1024 was hardcoded and the kernel always assumed to operate in a single, linear memory block. The process of making it more flexibe to work with any N value, and to operate also on chunked memory cost a bit of performance.
Because there is now a faster replacement kernel for scrypt called "Y" I will not be addressing the performance drop of the K kernel now (later, maybe...). "K" is still kicking butt in scrypt-jane with high N factors and with lookup gap. That's what you will want to use it for.
About autotune being wonky: Part of this can be attributed to the "boost" feature of the GPUs. these decide pretty randomly when to clock down and when to clock up. So the measured values can be jumping up and down pretty badly, making an accurate assesment very hard.
What I fixed yesterday was measurements showing "infinite" hashing speeds for very fast kernels on Windows, such as N-factor 7 or 8 scrypt-jane coins. These would always win over any correct measurements because apparently infinite is always bigger

Christian