Thanks. From a first look, I don't see anything I haven't tried yet :-)
Do you have some hashrate figures?
Sorry, no testing results, I'm away from all crypto stuff, that's rather abandoned project, collecting virtual dust on HDD...
I didn't quite grok all your tricks

I only use 3 arrays of 32 integers for intermediate results, so memory usage should be almost minimal and such buffer reusing could be an independent optimization, quite sure you have tried the rest
