POC2
...
If someone mentioned it already, point me to it. What you just did is PoW with variable time/space tradeoff as used in cryptanalysis and password crackers. In current PoC, the split ratio it permanently fixed, disfavouring PoW heavily. It's been actually discussed few years ago in #bitcoin as PoW alternative. For each bit of nonce storage shaved off, the readout can brute force it in a time-space tradeoff. Assuming we get 2^22 of 32bit nonces in a scoop number/1TB readout where we hash 2^22 times to get 2^22 candidate deadlines, we could store 2^22
31bit nonces, and hash 2^23 times during a readout... with 1Gh/s GPU you can get rid of 10 bits or so. Yes, it's a logarithmic relation, but profitable to PoW up to a point (1/3 split PoW/Poc). The problem is that one needs more and more faster PoW as PoC gets larger, potentially up to 50/50 split of the tradeoff (when using ASICs for the POW part).
Ironically, this solves your point 1, as POC2 is capped by PoW requirement introduced.
Bottomline, plotting will be very slow using groestl as it is. Forget about slowing it down even more with that PoW during plotting, the throughput is not up for it. Salsa kernel or even siphash (1GByte/s) could bring it to speeds of current plotting, with 6-8 zero bits plot-PoW (1GByte/s & 8 bit pow -> 4MByte/s). Ideally, this step can be avoided altogether, and just use large totalscoops instead, though it brings engineering difficulties to make memory usage during plotting sane.
I hadn't thought of that with the pow/poc tradeoff for saving space. I'll keep that in mind.
Most testing I've done has been using a target that 50% of nonces make it through, and I have been finding throughput to be a major issue. I'm not giving up on groestl yet, but I'll take a look at those if I can't get performance to an acceptable level. In the most recent version the sorting thread can't keep up with a 7970 hashing, so I expect there'll be a significant speedup by refactoring that.