@@ -24,6 +24,10 @@ with shared (critical section) or distributed work pool.
...
@@ -24,6 +24,10 @@ with shared (critical section) or distributed work pool.
## Lab 2
## Lab 2
## Lab 3
#### Question 1.1: Why does SkePU have a "fused" MapReduce when there already are separate Map and Reduce skeletons? Hint: Think about memory access patterns.
#### Question 1.1: Why does SkePU have a "fused" MapReduce when there already are separate Map and Reduce skeletons? Hint: Think about memory access patterns.
If you use Map and Reduce in a fused variant, you only have to access the shared memory vector once, and load each element to the local cache to that processor.
If you use Map and Reduce in a fused variant, you only have to access the shared memory vector once, and load each element to the local cache to that processor.
...
@@ -41,10 +45,10 @@ Especially for OpenCL, the bottleneck is loading the data from the CPU to the GP
...
@@ -41,10 +45,10 @@ Especially for OpenCL, the bottleneck is loading the data from the CPU to the GP
#### Question 2.1: Which version of the averaging filter (unified, separable) is the most efficient? Why?
#### Question 2.1: Which version of the averaging filter (unified, separable) is the most efficient? Why?
#### Question 3.1: In data-parallel skeletons like MapOverlap, all elements are processed independently of each other. Is this a good fit for the median filter? Why/why not?
#### Question 3.1: In data-parallel skeletons like MapOverlap, all elements are processed independently of each other. Is this a good fit for the median filter? Why/why not?
Could be more efficient if the current pixel checked its neighbours instead of processing it independently. With median filtering, large areas of almost the same color appears.
#### Question 3.2: Describe the sequence of instructions executed in your user-function. Is it data dependent? What does this mean for e.g., automatic vectorization, or the GPU backend?
#### Question 3.2: Describe the sequence of instructions executed in your user-function. Is it data dependent? What does this mean for e.g., automatic vectorization, or the GPU backend?
Add the elements of the defined region to a array with predefined size. Sort the elements in the array and extract the element in the middle, the median and return. The bubblesort is data dependent, which means it cannot be parallelized.