Skip to content
Snippets Groups Projects
Commit 01f56f99 authored by ylvse560's avatar ylvse560
Browse files

Update readme

parent b706929f
No related branches found
No related tags found
No related merge requests found
...@@ -5,18 +5,12 @@ This is the git repository for the six labs in the course Multicore and GPU prog ...@@ -5,18 +5,12 @@ This is the git repository for the six labs in the course Multicore and GPU prog
# Theoretical questions # Theoretical questions
## Lab 1 ## Lab 1
* Write a detailed explanation why computation load can be imbalanced and how it ### Write a detailed explanation why computation load can be imbalanced and how it affects the global performance.
affects the global performance.
Hint: What is necessary to compute a black pixel, as opposed to a colored pixel?
For a colored pixel any iteration number between 1 and MAXITER is possible, however for the black pixel MAXITER is always needed. For a colored pixel any iteration number between 1 and MAXITER is possible, however for the black pixel MAXITER is always needed.
In the worst case scenario all the black pixels are assigned to one core, thus making the other cores having to wait for that particular core. In the worst case scenario all the black pixels are assigned to one core, thus making the other cores having to wait for that particular core.
In this case multithreading would hardly make a difference. In this case multithreading would hardly make a difference.
* Describe a load-balancing method that would help reducing the performance loss ### Describe a load-balancing method that would help reducing the performance loss due to load-imbalance.
due to load-imbalance.
Hint: Observe that the load-balancing method must be valid for any picture computed, not only the default picture.
Each pixel will take an unknown time to compute and therefore it is impossible to allocate thread tasks statically. It must be done dynamically, e.g Each pixel will take an unknown time to compute and therefore it is impossible to allocate thread tasks statically. It must be done dynamically, e.g
with shared (critical section) or distributed work pool. with shared (critical section) or distributed work pool.
...@@ -36,40 +30,40 @@ There are three threads and a shared stack. First, thread 0 pops an element but ...@@ -36,40 +30,40 @@ There are three threads and a shared stack. First, thread 0 pops an element but
## Lab 3 ## Lab 3
#### Question 1.1: Why does SkePU have a "fused" MapReduce when there already are separate Map and Reduce skeletons? Hint: Think about memory access patterns. ### Why does SkePU have a "fused" MapReduce when there already are separate Map and Reduce skeletons? Hint: Think about memory access patterns.
If you use Map and Reduce in a fused variant, you only have to access the shared memory vector once, and load each element to the local cache to that processor. If you use Map and Reduce in a fused variant, you only have to access the shared memory vector once, and load each element to the local cache to that processor.
#### Question 1.2: Is there any practical reason to ever use separate Map and Reduce in sequence? ### Is there any practical reason to ever use separate Map and Reduce in sequence?
If you need to use the vector that the Map returns to anything else in the program, this will be necessary. If you need to use the vector that the Map returns to anything else in the program, this will be necessary.
#### Question 1.3: Is there a SkePU backend which is always more efficient to use, or does this depend on the problem size? Why? Either show with measurements or provide a valid reasoning. ### Is there a SkePU backend which is always more efficient to use, or does this depend on the problem size? Why? Either show with measurements or provide a valid reasoning.
CPU: Small problems sizes will be faster because the clock frequency of the CPU is faster than the GPU. CPU: Small problems sizes will be faster because the clock frequency of the CPU is faster than the GPU.
GPU: Big problem sizes will be faster because there are many more cores in the GPU. Need big problems parallelizable to make use of the GPU, it takes time to send from CPU to GPU. GPU: Big problem sizes will be faster because there are many more cores in the GPU. Need big problems parallelizable to make use of the GPU, it takes time to send from CPU to GPU.
#### Question 1.4: Try measuring the parallel back-ends with measureExecTime exchanged for measureExecTimeIdempotent. This measurement does a "cold run"of the lambda expression before running the proper measurement. Do you see a difference for some backends, and if so, why? ### Try measuring the parallel back-ends with measureExecTime exchanged for measureExecTimeIdempotent. This measurement does a "cold run"of the lambda expression before running the proper measurement. Do you see a difference for some backends, and if so, why?
The time for combined on the GPU is faster when measuring with measureExecTimeIdempotent. It is almost the same speed as separate. The time for combined on the GPU is faster when measuring with measureExecTimeIdempotent. It is almost the same speed as separate.
Our guess is that the time it takes to move the memory over to the GPU is not measured, only the time for the function is measured. Our guess is that the time it takes to move the memory over to the GPU is not measured, only the time for the function is measured.
#### Question 2.1: Which version of the averaging filter (unified, separable) is the most efficient? Why? ### Which version of the averaging filter (unified, separable) is the most efficient? Why?
Separable, fewer calculations and it can be parallelized better. Separable, fewer calculations and it can be parallelized better.
#### Question 3.1: In data-parallel skeletons like MapOverlap, all elements are processed independently of each other. Is this a good fit for the median filter? Why/why not? ### In data-parallel skeletons like MapOverlap, all elements are processed independently of each other. Is this a good fit for the median filter? Why/why not?
Yes, it is a good fit because all pixels are independent of each other. Yes, it is a good fit because all pixels are independent of each other.
Could be more efficient if the current pixel checked its neighbours instead of processing it independently. With median filtering, large areas of almost the same color appears. Could be more efficient if the current pixel checked its neighbours instead of processing it independently. With median filtering, large areas of almost the same color appears.
#### Question 3.2: Describe the sequence of instructions executed in your user-function. Is it data dependent? What does this mean for e.g., automatic vectorization, or the GPU backend? ### Describe the sequence of instructions executed in your user-function. Is it data dependent? What does this mean for e.g., automatic vectorization, or the GPU backend?
Add the elements of the defined region to a array with predefined size. Sort the elements in the array and extract the element in the middle, the median and return. The bubblesort is data dependent, which means it cannot be parallelized. Add the elements of the defined region to a array with predefined size. Sort the elements in the array and extract the element in the middle, the median and return. The bubblesort is data dependent, which means it cannot be parallelized.
## Lab 4 ## Lab 4
### QUESTION: How many cores will simple.cu use, max, as written? How many SMs? ### How many cores will simple.cu use, max, as written? How many SMs?
simple.cu has the dimension of the grid is 1 x 1, which means there only is one block. This gives a total of 16 threads. 1 SM = 8 cores and 1 SM = 1 block. This gives that there are 8 cores. simple.cu has the dimension of the grid is 1 x 1, which means there only is one block. This gives a total of 16 threads. 1 SM = 8 cores and 1 SM = 1 block. This gives that there are 8 cores.
### QUESTION: Is the calculated square root identical to what the CPU calculates? Should we assume that this is always the case? ### Is the calculated square root identical to what the CPU calculates? Should we assume that this is always the case?
Yes, it is in Olympen. But we shall not expect that this is always the case on every GPU, since there are single precision GPU out there. Yes, it is in Olympen. But we shall not expect that this is always the case on every GPU, since there are single precision GPU out there.
### QUESTION: How do you calculate the index in the array, using 2-dimensional blocks? ### How do you calculate the index in the array, using 2-dimensional blocks?
By adding index for the y-dimension: By adding index for the y-dimension:
int idx = blockIdx.x * blockDim.x + threadIdx.x; int idx = blockIdx.x * blockDim.x + threadIdx.x;
...@@ -77,115 +71,115 @@ int idy = blockIdx.y * blockDim.y + threadIdx.y; ...@@ -77,115 +71,115 @@ int idy = blockIdx.y * blockDim.y + threadIdx.y;
int index = idy * blockDim.x + idx; int index = idy * blockDim.x + idx;
result[index] = c[index] + d[index]; result[index] = c[index] + d[index];
### QUESTION: What happens if you use too many threads per block? ### What happens if you use too many threads per block?
There is an error in CUDA and the kernel is not run properly - undefined behavior. There is an error in CUDA and the kernel is not run properly - undefined behavior.
### QUESTION: At what data size is the GPU faster than the CPU? ### At what data size is the GPU faster than the CPU?
At N = 32, the CPU is faster. At N = 64 the GPU is faster. At N = 32, the CPU is faster. At N = 64 the GPU is faster.
### QUESTION: What block size seems like a good choice? Compared to what? ### What block size seems like a good choice? Compared to what?
We have set the block size to 32*32. The maximum number of threads per block in Olympen GPU:s is 1024 threads. We would like to have the block quadratic so we take the square root of 1024 = 32. We have set the block size to 32*32. The maximum number of threads per block in Olympen GPU:s is 1024 threads. We would like to have the block quadratic so we take the square root of 1024 = 32.
### QUESTION: Write down your data size, block size and timing data for the best GPU performance you can get. ### Write down your data size, block size and timing data for the best GPU performance you can get.
#### 1. Blocksize = 32, gridsize = 32, N = 1024. Time (ms): 0.058720 #### 1. Blocksize = 32, gridsize = 32, N = 1024. Time (ms): 0.058720
#### 2. Blocksize = 16, gridsize = 64, N = 1024. Time (ms): 0.069376 #### 2. Blocksize = 16, gridsize = 64, N = 1024. Time (ms): 0.069376
#### 3. Blocksize = 8, gridsize = 128, N = 1024. Time (ms): 0.069440 #### 3. Blocksize = 8, gridsize = 128, N = 1024. Time (ms): 0.069440
### QUESTION: How much performance did you lose by making data accesses non-coalesced? ### How much performance did you lose by making data accesses non-coalesced?
For coalesced data: 0.058720 ms For coalesced data: 0.058720 ms
For non coalesced data: 0.178624 ms For non coalesced data: 0.178624 ms
Took about 3x the time. Took about 3x the time.
###QUESTION: What were the main changes in order to make the Mandelbrot run in CUDA? ### What were the main changes in order to make the Mandelbrot run in CUDA?
Make computeFractal a __global__ function and remove the two for loops in there. Make computeFractal a __global__ function and remove the two for loops in there.
###QUESTION: How many blocks and threads did you use? ### How many blocks and threads did you use?
32*32 = 1024 threads and 16*16 = 256 blocks 32*32 = 1024 threads and 16*16 = 256 blocks
###QUESTION: When you use the Complex class, what modifier did you have to use on the methods? ### When you use the Complex class, what modifier did you have to use on the methods?
__device__ __device__
###QUESTION: What performance did you get? How does that compare to the CPU solution? ### What performance did you get? How does that compare to the CPU solution?
float GPU Time elapsed (ms): 0.039488 float GPU Time elapsed (ms): 0.039488
float CPU Measured time (ms): 73.739000 float CPU Measured time (ms): 73.739000
###QUESTION: What performance did you get with float vs double precision? ### What performance did you get with float vs double precision?
double GPU Time elapsed (ms): 0.237120 double GPU Time elapsed (ms): 0.237120
double CPU Measured time (ms): 75.787000 double CPU Measured time (ms): 75.787000
###QUESTION: In Lab 1, load balancing was an important issue. Is that an issue here? Why/why not? ### In Lab 1, load balancing was an important issue. Is that an issue here? Why/why not?
No, because every pixel has one thread. On the CPU this was not possible because we only had 8 cores (16 hyperthreading). No, because every pixel has one thread. On the CPU this was not possible because we only had 8 cores (16 hyperthreading).
## Lab 5 ## Lab 5
### QUESTION: How much data did you put in shared memory? ### How much data did you put in shared memory?
We put in storage for 32*32 pixels. We put in storage for 32*32 pixels.
### QUESTION: How much data does each thread copy to shared memory? ### How much data does each thread copy to shared memory?
Each thread copies one pixel with three channels r,g and b. Each thread copies one pixel with three channels r,g and b.
### QUESTION: How did you handle the necessary overlap between the blocks? ### How did you handle the necessary overlap between the blocks?
We split the image in tiles of size 12. These have a padding that is the size of the radius of the filter. For each tile, we load the padding / overlap. We split the image in tiles of size 12. These have a padding that is the size of the radius of the filter. For each tile, we load the padding / overlap.
### QUESTION: If we would like to increase the block size, about how big blocks would be safe to use in this case? Why? ### If we would like to increase the block size, about how big blocks would be safe to use in this case? Why?
A block on the graphics card in Olympen you can use 1024 threads, which means the biggest blocksize is already in use 32*32. A block on the graphics card in Olympen you can use 1024 threads, which means the biggest blocksize is already in use 32*32.
### QUESTION: How much speedup did you get over the naive version? For what filter size? ### How much speedup did you get over the naive version? For what filter size?
Optimized version: average 85ms with filter 21x21. Optimized version: average 85ms with filter 21x21.
Naive version: average 115ms with filter 21x21. Naive version: average 115ms with filter 21x21.
### QUESTION: Is your access to global memory coalesced? What should you do to get that? ### Is your access to global memory coalesced? What should you do to get that?
Yes Yes
### QUESTION: How much speedup did you get over the non-separated? For what filter size? ### How much speedup did you get over the non-separated? For what filter size?
Separable version: average 70ms with filter 21x21. 15ms speedup. Separable version: average 70ms with filter 21x21. 15ms speedup.
### QUESTION: Compare the visual result to that of the box filter. Is the image LP-filtered with the weighted kernel noticeably better? ### Compare the visual result to that of the box filter. Is the image LP-filtered with the weighted kernel noticeably better?
The Gaussian filter has a nicer look in our opinion. :-) The Gaussian filter has a nicer look in our opinion. :-)
### QUESTION: What was the difference in time to a box filter of the same size (5x5)? ### What was the difference in time to a box filter of the same size (5x5)?
No noticeable difference in time. 90ms for both. No noticeable difference in time. 90ms for both.
### QUESTION: If you want to make a weighted kernel customizable by weights from the host, how would you deliver the weights to the GPU? ### If you want to make a weighted kernel customizable by weights from the host, how would you deliver the weights to the GPU?
Allocate memory on CPU and GPU. Then use cudaMemcpy to send to GPU. Allocate memory on CPU and GPU. Then use cudaMemcpy to send to GPU.
### QUESTION: What kind of algorithm did you implement for finding the median? ### What kind of algorithm did you implement for finding the median?
We used a bubble sort to sort each pixels surroundings. Then, we picked out the median from the sorted list with an index. We used a bubble sort to sort each pixels surroundings. Then, we picked out the median from the sorted list with an index.
### QUESTION: What filter size was best for reducing noise? ### What filter size was best for reducing noise?
Larger filter size gives a less noisy image, but it also blurs the image. Thus, you have to strike a balance between the two. Larger filter size gives a less noisy image, but it also blurs the image. Thus, you have to strike a balance between the two.
We found this balance with a 9 x 9 filter kernel. We found this balance with a 9 x 9 filter kernel.
## Lab 6 ## Lab 6
### Question: How is the communication between the host and the graphic card handled? ### How is the communication between the host and the graphic card handled?
First we set our context on the GPU. Then we allocate/copy memory to GPU. Then the kernel is run. First we set our context on the GPU. Then we allocate/copy memory to GPU. Then the kernel is run.
### Question: What function executes your kernel? ### What function executes your kernel?
clEnqueueNDRangeKernel() clEnqueueNDRangeKernel()
### Question: How does the kernel know what element to work on? ### How does the kernel know what element to work on?
get_global_id() in function kernel void hello() get_global_id() in function kernel void hello()
### QUESTION: What timing did you get for your GPU reduction? Compare it to the CPU version. ### What timing did you get for your GPU reduction? Compare it to the CPU version.
1024 elements 1024 elements
CPU 0.000013 CPU 0.000013
GPU 0.000403 GPU 0.000403
### QUESTION: Try larger data size. On what size does the GPU version get faster, or at least comparable, to the CPU? ### Try larger data size. On what size does the GPU version get faster, or at least comparable, to the CPU?
Around 100 000 elements the two are comparable. With larger sizes the GPU gets faster. Around 100 000 elements the two are comparable. With larger sizes the GPU gets faster.
262 144 elements 262 144 elements
CPU 0.001173 CPU 0.001173
GPU 0.000492 GPU 0.000492
### QUESTION: How can you optimize this further? You should know at least one way. ### How can you optimize this further? You should know at least one way.
Use shared memory. Use better memory access pattern for kernel calls that compare blocks. Use shared memory. Use better memory access pattern for kernel calls that compare blocks.
### QUESTION: Should each thread produce one output or two? Why? ### Should each thread produce one output or two? Why?
Each thread swaps two elements, which means that each thread produces two outputs. Each thread swaps two elements, which means that each thread produces two outputs.
### QUESTION: How many items can you handle in one workgroup? ### How many items can you handle in one workgroup?
512 since the number of threads in the local work size is 512. 512 since the number of threads in the local work size is 512.
### QUESTION: What problem must be solved when you use more than one workgroup? How did you solve it? ### What problem must be solved when you use more than one workgroup? How did you solve it?
Synchronization between work groups. Multiple kernel calls. Synchronization between work groups. Multiple kernel calls.
### QUESTION: What time do you get? Difference to the CPU? What is the break even size? What can you expect for a parallel CPU version? (Your conclusions here may vary between the labs.) ### What time do you get? Difference to the CPU? What is the break even size? What can you expect for a parallel CPU version? (Your conclusions here may vary between the labs.)
For 131072 elements: For 131072 elements:
CPU sorting. CPU sorting.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment