Update readme

01f56f99 · ylvse560 · b706929f · 01f56f99
Commit 01f56f99 authored 4 years ago by ylvse560
--- a/README.md
+++ b/README.md
@@ -5,18 +5,12 @@ This is the git repository for the six labs in the course Multicore and GPU prog
 # Theoretical questions
 ## Lab 1

-* Write a detailed explanation why computation load can be imbalanced and how it
-affects the global performance.
-Hint: What is necessary to compute a black pixel, as opposed to a colored pixel?
-
+### Write a detailed explanation why computation load can be imbalanced and how it affects the global performance.
 For a colored pixel any iteration number between 1 and MAXITER is possible, however for the black pixel MAXITER is always needed.
 In the worst case scenario all the black pixels are assigned to one core, thus making the other cores having to wait for that particular core.
 In this case multithreading would hardly make a difference.

-* Describe a load-balancing method that would help reducing the performance loss
-due to load-imbalance.
-Hint: Observe that the load-balancing method must be valid for any picture computed, not only the default picture.
-
+### Describe a load-balancing method that would help reducing the performance loss due to load-imbalance.
 Each pixel will take an unknown time to compute and therefore it is impossible to allocate thread tasks statically. It must be done dynamically, e.g
 with shared (critical section) or distributed work pool.

@@ -36,40 +30,40 @@ There are three threads and a shared stack. First, thread 0 pops an element but

 ## Lab 3

-#### Question 1.1: Why does SkePU have a "fused" MapReduce when there already are separate Map and Reduce skeletons? Hint: Think about memory access patterns.
+### Why does SkePU have a "fused" MapReduce when there already are separate Map and Reduce skeletons? Hint: Think about memory access patterns.
 If you use Map and Reduce in a fused variant, you only have to access the shared memory vector once, and load each element to the local cache to that processor.

-#### Question 1.2: Is there any practical reason to ever use separate Map and Reduce in sequence?
+### Is there any practical reason to ever use separate Map and Reduce in sequence?
 If you need to use the vector that the Map returns to anything else in the program, this will be necessary.

-#### Question 1.3: Is there a SkePU backend which is always more efficient to use, or does this depend on the problem size?  Why?  Either show with measurements or provide a valid reasoning.
+### Is there a SkePU backend which is always more efficient to use, or does this depend on the problem size?  Why?  Either show with measurements or provide a valid reasoning.
 CPU: Small problems sizes will be faster because the clock frequency of the CPU is faster than the GPU.

 GPU: Big problem sizes will be faster because there are many more cores in the GPU. Need big problems parallelizable to make use of the GPU, it takes time to send from CPU to GPU.

-#### Question 1.4: Try measuring the parallel back-ends with measureExecTime exchanged for measureExecTimeIdempotent. This measurement does a "cold run"of the lambda expression before running the proper measurement.  Do you see a difference for some backends, and if so, why?
+### Try measuring the parallel back-ends with measureExecTime exchanged for measureExecTimeIdempotent. This measurement does a "cold run"of the lambda expression before running the proper measurement.  Do you see a difference for some backends, and if so, why?
 The time for combined on the GPU is faster when measuring with measureExecTimeIdempotent. It is almost the same speed as separate.
 Our guess is that the time it takes to move the memory over to the GPU is not measured, only the time for the function is measured.

-#### Question 2.1: Which version of the averaging filter (unified, separable) is the most efficient? Why?
+### Which version of the averaging filter (unified, separable) is the most efficient? Why?
 Separable, fewer calculations and it can be parallelized better.

-#### Question 3.1: In data-parallel skeletons like MapOverlap, all elements are processed independently of each other. Is this a good fit for the median filter? Why/why not?
+### In data-parallel skeletons like MapOverlap, all elements are processed independently of each other. Is this a good fit for the median filter? Why/why not?
 Yes, it is a good fit because all pixels are independent of each other.
 Could be more efficient if the current pixel checked its neighbours instead of processing it independently. With median filtering, large areas of almost the same color appears.

-#### Question 3.2: Describe the sequence of instructions executed in your user-function. Is it data dependent? What does this mean for e.g., automatic vectorization, or the GPU backend?
+### Describe the sequence of instructions executed in your user-function. Is it data dependent? What does this mean for e.g., automatic vectorization, or the GPU backend?
 Add the elements of the defined region to a array with predefined size. Sort the elements in the array and extract the element in the middle, the median and return. The bubblesort is data dependent, which means it cannot be parallelized.

 ## Lab 4

-### QUESTION: How many cores will simple.cu use, max, as written? How many SMs?
+### How many cores will simple.cu use, max, as written? How many SMs?
 simple.cu has the dimension of the grid is 1 x 1, which means there only is one block. This gives a total of 16 threads. 1 SM = 8 cores and 1 SM = 1 block. This gives that there are 8 cores.

-### QUESTION: Is the calculated square root identical to what the CPU calculates? Should we assume that this is always the case?
+### Is the calculated square root identical to what the CPU calculates? Should we assume that this is always the case?
 Yes, it is in Olympen. But we shall not expect that this is always the case on every GPU, since there are single precision GPU out there.

-### QUESTION: How do you calculate the index in the array, using 2-dimensional blocks?
+### How do you calculate the index in the array, using 2-dimensional blocks?
 By adding index for the y-dimension:

 int idx = blockIdx.x * blockDim.x + threadIdx.x;
@@ -77,115 +71,115 @@ int idy = blockIdx.y * blockDim.y + threadIdx.y;
 int index = idy * blockDim.x + idx;
 result[index] = c[index] + d[index];

-### QUESTION: What happens if you use too many threads per block?
+### What happens if you use too many threads per block?
 There is an error in CUDA and the kernel is not run properly - undefined behavior.

-### QUESTION: At what data size is the GPU faster than the CPU?
+### At what data size is the GPU faster than the CPU?
 At N = 32, the CPU is faster. At N = 64 the GPU is faster.

-### QUESTION: What block size seems like a good choice? Compared to what?
+### What block size seems like a good choice? Compared to what?
 We have set the block size to 32*32. The maximum number of threads per block in Olympen GPU:s is 1024 threads. We would like to have the block quadratic so we take the square root of 1024 = 32.

-### QUESTION: Write down your data size, block size and timing data for the best GPU performance you can get.
+### Write down your data size, block size and timing data for the best GPU performance you can get.
 #### 1. Blocksize = 32, gridsize = 32, N = 1024. Time (ms): 0.058720
 #### 2. Blocksize = 16, gridsize = 64, N = 1024. Time (ms): 0.069376
 #### 3. Blocksize = 8, gridsize = 128, N = 1024. Time (ms): 0.069440

-### QUESTION: How much performance did you lose by making data accesses non-coalesced?
+### How much performance did you lose by making data accesses non-coalesced?
 For coalesced data: 0.058720 ms
 For non coalesced data: 0.178624 ms
 Took about 3x the time.

-###QUESTION: What were the main changes in order to make the Mandelbrot run in CUDA?
+### What were the main changes in order to make the Mandelbrot run in CUDA?
 Make computeFractal a __global__ function and remove the two for loops in there.

-###QUESTION: How many blocks and threads did you use?
+### How many blocks and threads did you use?
 32*32 = 1024 threads and 16*16 = 256 blocks

-###QUESTION: When you use the Complex class, what modifier did you have to use on the methods?
+### When you use the Complex class, what modifier did you have to use on the methods?
 __device__

-###QUESTION: What performance did you get? How does that compare to the CPU solution?
+### What performance did you get? How does that compare to the CPU solution?
 float GPU Time elapsed (ms): 0.039488
 float CPU Measured time (ms): 73.739000

-###QUESTION: What performance did you get with float vs double precision?
+### What performance did you get with float vs double precision?
 double GPU Time elapsed (ms): 0.237120
 double CPU Measured time (ms): 75.787000

-###QUESTION: In Lab 1, load balancing was an important issue. Is that an issue here? Why/why not?
+### In Lab 1, load balancing was an important issue. Is that an issue here? Why/why not?
 No, because every pixel has one thread. On the CPU this was not possible because we only had 8 cores (16 hyperthreading).

 ## Lab 5
-### QUESTION: How much data did you put in shared memory?
+### How much data did you put in shared memory?
 We put in storage for 32*32 pixels.

-### QUESTION: How much data does each thread copy to shared memory?
+### How much data does each thread copy to shared memory?
 Each thread copies one pixel with three channels r,g and b.

-### QUESTION: How did you handle the necessary overlap between the blocks?
+### How did you handle the necessary overlap between the blocks?
 We split the image in tiles of size 12. These have a padding that is the size of the radius of the filter. For each tile, we load the padding / overlap.

-### QUESTION: If we would like to increase the block size, about how big blocks would be safe to use in this case? Why?
+### If we would like to increase the block size, about how big blocks would be safe to use in this case? Why?
 A block on the graphics card in Olympen you can use 1024 threads, which means the biggest blocksize is already in use 32*32.

-### QUESTION: How much speedup did you get over the naive version? For what filter size?
+### How much speedup did you get over the naive version? For what filter size?
 Optimized version: average 85ms with filter 21x21.
 Naive version: average 115ms with filter 21x21.

-### QUESTION: Is your access to global memory coalesced? What should you do to get that?
+### Is your access to global memory coalesced? What should you do to get that?
 Yes

-### QUESTION: How much speedup did you get over the non-separated? For what filter size?
+### How much speedup did you get over the non-separated? For what filter size?
 Separable version: average 70ms with filter 21x21. 15ms speedup.

-### QUESTION: Compare the visual result to that of the box filter. Is the image LP-filtered with the weighted kernel noticeably better?
+### Compare the visual result to that of the box filter. Is the image LP-filtered with the weighted kernel noticeably better?
 The Gaussian filter has a nicer look in our opinion. :-)

-### QUESTION: What was the difference in time to a box filter of the same size (5x5)?
+### What was the difference in time to a box filter of the same size (5x5)?
 No noticeable difference in time. 90ms for both.
-### QUESTION: If you want to make a weighted kernel customizable by weights from the host, how would you deliver the weights to the GPU?
+### If you want to make a weighted kernel customizable by weights from the host, how would you deliver the weights to the GPU?
 Allocate memory on CPU and GPU. Then use cudaMemcpy to send to GPU.
-### QUESTION: What kind of algorithm did you implement for finding the median?
+### What kind of algorithm did you implement for finding the median?
 We used a bubble sort to sort each pixels surroundings. Then, we picked out the median from the sorted list with an index.

-### QUESTION: What filter size was best for reducing noise?
+### What filter size was best for reducing noise?
 Larger filter size gives a less noisy image, but it also blurs the image. Thus, you have to strike a balance between the two.
 We found this balance with a 9 x 9 filter kernel.


 ## Lab 6
-### Question: How is the communication between the host and the graphic card handled?
+### How is the communication between the host and the graphic card handled?
 First we set our context on the GPU. Then we allocate/copy memory to GPU. Then the kernel is run.
-### Question: What function executes your kernel?
+### What function executes your kernel?
 clEnqueueNDRangeKernel()
-### Question: How does the kernel know what element to work on?
+### How does the kernel know what element to work on?
 get_global_id() in function kernel void hello()
-### QUESTION: What timing did you get for your GPU reduction? Compare it to the CPU version.
+### What timing did you get for your GPU reduction? Compare it to the CPU version.
 1024 elements
 CPU 0.000013
 GPU 0.000403

-### QUESTION: Try larger data size. On what size does the GPU version get faster, or at least comparable, to the CPU?
+### Try larger data size. On what size does the GPU version get faster, or at least comparable, to the CPU?
 Around 100 000 elements the two are comparable. With larger sizes the GPU gets faster.

 262 144 elements
 CPU 0.001173
 GPU 0.000492

-### QUESTION: How can you optimize this further? You should know at least one way.
+### How can you optimize this further? You should know at least one way.
 Use shared memory. Use better memory access pattern for kernel calls that compare blocks.

-### QUESTION: Should each thread produce one output or two? Why?
+### Should each thread produce one output or two? Why?
 Each thread swaps two elements, which means that each thread produces two outputs.

-### QUESTION: How many items can you handle in one workgroup?
+### How many items can you handle in one workgroup?
 512 since the number of threads in the local work size is 512.

-### QUESTION: What problem must be solved when you use more than one workgroup? How did you solve it?
+### What problem must be solved when you use more than one workgroup? How did you solve it?
 Synchronization between work groups. Multiple kernel calls.

-### QUESTION: What time do you get? Difference to the CPU? What is the break even size? What can you expect for a parallel CPU version? (Your conclusions here may vary between the labs.)
+### What time do you get? Difference to the CPU? What is the break even size? What can you expect for a parallel CPU version? (Your conclusions here may vary between the labs.)
 For 131072 elements:

 CPU sorting.