Ylva Selling
Multicore GPU programming

Repository



Multicore GPU programming
This is the git repository for the six labs in the course Multicore and GPU programming (TDDD56) at Linköping University.

Theoretical questions

Lab 1

Write a detailed explanation why computation load can be imbalanced and how it
affects the global performance.
Hint: What is necessary to compute a black pixel, as opposed to a colored pixel?

For a colored pixel any iteration number between 1 and MAXITER is possible, however for the black pixel MAXITER is always needed.
In the worst case scenario all the black pixels are assigned to one core, thus making the other cores having to wait for that particular core.
In this case multithreading would hardly make a difference.

Describe a load-balancing method that would help reducing the performance loss
due to load-imbalance.
Hint: Observe that the load-balancing method must be valid for any picture computed, not only the default picture.

Each pixel will take an unknown time to compute and therefore it is impossible to allocate thread tasks statically. It must be done dynamically, e.g
with shared (critical section) or distributed work pool.


Lab 2

Lab 3

Question 1.1: Why does SkePU have a "fused" MapReduce when there already are separate Map and Reduce skeletons? Hint: Think about memory access patterns.
If you use Map and Reduce in a fused variant, you only have to access the shared memory vector once, and load each element to the local cache to that processor.

Question 1.2: Is there any practical reason to ever use separate Map and Reduce in sequence?
If you need to use the vector that the Map returns to anything else in the program, this will be necessary.

Question 1.3: Is there a SkePU backend which is always more efficient to use, or does this depend on the problem size?  Why?  Either show with measurements or provide a valid reasoning.
CPU: Small problems sizes will be faster because the clock frequency of the CPU is faster than the GPU.
GPU: Big problem sizes will be faster because there are many more cores in the GPU. Need big problems parallelizable to make use of the GPU, it takes time to send from CPU to GPU.

Question 1.4: Try measuring the parallel back-ends with measureExecTime exchanged for measureExecTimeIdempotent. This measurement does a "cold run"of the lambda expression before running the proper measurement.  Do you see a difference for some backends, and if so, why?
Combined for GPU is a big difference when measuring with measureExecTimeIdempotent. It is almost the same speed as separate then.
Separable can possibly be parallelized better than combined.

Question 2.1: Which version of the averaging filter (unified, separable) is the most efficient? Why?
Separable, fewer calculations and it can be parallelized better.

Question 3.1: In data-parallel skeletons like MapOverlap, all elements are processed independently of each other. Is this a good fit for the median filter? Why/why not?
Yes, it is a good fit because all pixels are independent of each other.
Could be more efficient if the current pixel checked its neighbours instead of processing it independently. With median filtering, large areas of almost the same color appears.

Question 3.2: Describe the sequence of instructions executed in your user-function. Is it data dependent? What does this mean for e.g., automatic vectorization, or the GPU backend?
Add the elements of the defined region to a array with predefined size. Sort the elements in the array and extract the element in the middle, the median and return. The bubblesort is data dependent, which means it cannot be parallelized.

Lab 4

Lab 5

Lab 6