I have given your kernel a spin on my notebook with HD6750, got 25KH/s. My current parallel kernel outputs 45KH/s there with 3x lower kernel binary size. Perhaps your kernel isn't supposed to run on this hardware. I'll check with newer video cards later.
A follow-up. I've tested it on a HD7990. Produces 900KH/s vs. 880KH/s of my kernel. No scratch registers used, but ISA size is 174Kb vs. 40Kb of my kernel. It stores all data buffers as uint's and bitaligns them for processing using conditionals. This is an interesting approach, though not for older GPUs definitely because they hate branching. That's why I went another way to optimise for older and newer GPUs at the same time. I guess it's time to split work into two separate kernels and optimise them independently.
Yeah, my optimizations are only for GCN. Like you said, they don't make any sense for VLIW4/5.
I did a quick run and got 970KH/s on my stock 7990 with Crimson 16.9.2.
I am currently working on an offline OpenCL compiler for GCN with inline assembly support.
If I can pull this off, we will see more interesting stuff.
