For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset + global_worksize
You're doubling the global id, not the worksize.
Global ID without global offset is: 0 till (global worksize - 1) queried in the kernel via get_global_size(0). So for 2-component vectors we need a global worksize 2 (as argument in the enqueue kernel call), because there are 2 nonces processed in each work-item.
Global ID with global offset is: global offset till (global worksize + global offset - 1).
Dia
I still think it should be something like:
u nonce = ((uint)get_global_id(0) + get_global_size(0) * (0,1,2,3));
for vectors4 etc.
Thinking loud again:
get_global_id(0) == ranges from global_offset for the 1st work-item till (global_offset + (global_worksize - 1)) for the last work-item
get_global_size(0) == global_worksize (constant value)
global_offset == nonce-base, that results in:
nonce.x = nonce-base + global_worksize * 0;
nonce.y = nonce-base + global_worksize * 1;
nonce.z = nonce-base + global_worksize * 2;
nonce.w = nonce-base + global_worksize * 3;
Let's consider 10 as nonce-base and 4 as global_worksize. This leads to the following nonces that get checked during 1 kernel execution:
Work-Item 0:
10 + 4 * 0 = 10
10 + 4 * 1 = 14
10 + 4 * 2 = 18
10 + 4 * 3 = 22
Work-Item 1:
11 + 4 * 0 = 11
11 + 4 * 1 = 15
11 + 4 * 2 = 19
11 + 4 * 3 = 23
Work-Item 2:
12 + 4 * 0 = 12
12 + 4 * 1 = 16
12 + 4 * 2 = 20
12 + 4 * 3 = 24
Work-Item 0:
13 + 4 * 0 = 13
13 + 4 * 1 = 17
13 + 4 * 2 = 21
13 + 4 * 3 = 25
So we have nonces from 10 to 25
Now if we divide the passed global worksize by 4 (because of 4-component vector usage in your example) and use 1 for it this leads to:
Work-Item 0:
10 + 1 * 0 = 10
10 + 1 * 1 = 11
10 + 1 * 2 = 12
10 + 1 * 3 = 13
So I guess your code works, if you divide the global worksize by the vec-size before passing that argument to clEnueueNDRangeKernel.
Dia