### QUESTION: How can you optimize this further? You should know at least one way.
### QUESTION: How can you optimize this further? You should know at least one way.
Use shared memory. Use better memory access pattern for kernel calls that compare blocks.
Use shared memory. Use better memory access pattern for kernel calls that compare blocks.
### QUESTION: Should each thread produce one output or two? Why?
Each thread swaps two elements, which means that each thread produces two outputs.
### QUESTION: How many items can you handle in one workgroup?
512 since the number of threads in the local work size is 512.
### QUESTION: What problem must be solved when you use more than one workgroup? How did you solve it?
Synchronization between work groups. Multiple kernel calls.
### QUESTION: What time do you get? Difference to the CPU? What is the break even size? What can you expect for a parallel CPU version? (Your conclusions here may vary between the labs.)
For 131072 elements:
CPU sorting.
CPU 0.082142
GPU sorting.
GPU 0.001693
The CPU is faster than the GPU only up til 1024 elements, after that the GPU is always faster. A parallelized CPU will run faster than the current version. However, the GPU will always beat the CPU on large element sizes, since the bitonic sort makes use of massive parallelism, which the CPU cannot.