refactor: streamline example 2

0992a448 · Rasmus Ringdahl · 6b0dc88d · 6b0dc88d · 0992a448 · 0992a448
Commit 0992a448 authored 3 months ago by Rasmus Ringdahl
--- a/2_multi_core_job/multi_core_job.sh
+++ b/2_multi_core_job/multi_core_job.sh
-#! /bin/bash
-#SBATCH --job-name=demo_multi_core
-#SBATCH --time=00:05:00
-#SBATCH --ntasks=1
-#SBATCH --cpus-per-task=2
-#SBATCH --mem-per-cpu=50MB
-#SBATCH --output=multi_core_job.log
-
-# Loading Python into the environment
-module load python/anaconda3-2024.02-3.11.7
-
-# Start job stage
-srun python multi_core_task.py
\ No newline at end of file
--- a/2_multi_core_job/README.md
+++ b/2_multi_core_job/README.md
-# Multi core jobs
-A multi core job is a job that splits the computation to multiple cores. This type of job is the most suitable and most common ones to run on Lundgren. This includes optimization problems and heavy computations.
+# Multiple job steps
+In SLURM, Job Steps are a way to launch distinct parallel (most commonly) and/or sequential tasks from within a single job script. Job Steps are executed using the SLURM command "srun"
+
+Multiple job steps is good to use when a chain of tasks is needed. The chain of tasks could be pre-processing -> calculations -> post-processing. In this example Slurm is instructed to run with muliple job step. In the example input data will be copied to local storage then some calculations will be done and lastly the output is compressed and sent back to the home folder.

 ## How to run
 To run the example do the following steps:
 1. Log in to Lundgren
 2. Change directory to the example code
-3. Run `sbatch multi_core_job.sh`
+3. Run `sbatch multiple_job_steps.sh`
 4. Check queue status by running `squeue`
-5. When the job is completed check the file _multi_core_job.log_
-
-Try changing the number of cpus in _multi_core_job.sh_ and see the changes in processing time.
+5. When the job is completed check the file _multiple_job_step.log_.

 ## Detailed description of the example
 The batch script is the main file for the job allocation and preparation. Inside the python script a few environmental variables are fetched and printed out.

 ### The batch script
-The batch script, multi_core_job.sh_, contains three sections. The first section contains input arguments to the Slurm scheduler. The second section loads Python into environment so it is accessible and lastly the a job step is performed.
+The batch script, _multiple_job_steps.sh_contains three sections. The first section contains input arguments to the Slurm scheduler. The second section loads Python into environment so it is accessible and lastly all the job steps is performed.

 The input arguments are defined with a comment beginning with SBATCH followed by the argument key and value. For easier readablility the -- method is used.

- __job-name:__ The name of the job is set to demo_multi_core
- __time:__ The requeted time is set to 5 minutes, _00:05:00_
- __ntasks:__ The number of tasks to be performed in this job is set to _1_.
- __cpus-per-task:__ The requested number of cores per task is set to _2_
- __mem:__ The requested memory is set to _50 MB_
- __output:__ The standard output should be sent to the file multi_core_job.log_
+- __job-name:__ The name of the job
+- __time:__ The requeted time
+- __ntasks:__ The number of tasks to be performed in this job
+- __cpus-per-task:__ The requested number of cpus per task
+- __mem-per-cpu:__ The requested memory adjusted per the number of cpu's
+- __output:__ File name for standard output

 Python needs to be loaded into the environment in order to be accessible this is done in the next step with the __module__ command.

-The job step with the single task is allocated and performed with the __srun__ command.
+The job steps is allocated and performed with the __srun__ commands.
+1. A folder is created with the same name as the Job ID on the local hard drive in the data folder of Lundgren _/local/data1/<LiU-ID>_. 
+2. Input data files are copied to the newly created folder.
+3. The third step is the computational step of the job.
+4. The output files are compressed.
+5. The compressed output files are moved to the home folder.
+6. The folder with the data is removed and if the <LiU-ID> folder in the data folder is empty it is removed as well.
+
+_In this example only the computational step needs multiple CPU's therefore the srun for all job steps except for step 3 are set to use 1 CPU per task._

 #### The python script
-The python script represents the taskt to be done. In this case the task is to wait a random time and print the waiting is done.
+The python script represents the taskt to be done. In this case the task is read an input file and wait to simulate a calculation and afterwards print to an output file.

-The environment variable __SLURM_CPUS_PER_TASK__ is used to restrict the worker pool to the allocated number of cores.
+- The environment variable __JOB_ID__ can be used to create temporary files and folders.
+- The environment variable __SLURM_CPUS_PER_TASK__ is used to restrict the worker pool to the allocated number of cpus when running in parallel.

 ### How to set the number of cores in different programing languages and softwares
 Most programming languages and softwares tries to make use of all cores that are available. This can lead to an oversubscription on the resources. On a shared resource one must match the maximum used resources with the allocated ones. This section gives a reference in how to do it in commonly used softwares.
@@ -41,4 +50,4 @@ Most programming languages and softwares tries to make use of all cores that are
 - __CPLEX:__ Use the parameter _global thread count_. Read more in the [documentation](https://www.ibm.com/docs/en/icos/22.1.2?topic=parameters-global-thread-count)
 - __Gurobi:__ Use the configuration parameter _ThreadLimit_. Read more in the [documentation](https://docs.gurobi.com/projects/optimizer/en/current/reference/parameters.html#threadlimit)
 - __MATLAB:__ Create a instance of the parpool object with the _poolsize_ set to the number of cores and use the pool when running in parallell. Read more in the [documentation](https://se.mathworks.com/help/parallel-computing/parpool.html)
- __Python:__  If the multiprocessing package is used, create an instance of the pool class with the _processes_ set to the number of cores and use the pool when running in parallell. Read more in the [documentation](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool)
+- __Python:__  If the multiprocessing package is used, create an instance of the pool class with the _processes_ set to the number of cores and use the pool when running in parallell. Read more in the [documentation](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool)
\ No newline at end of file
--- a/2_multiple_job_steps/multiple_job_steps.sh
+++ b/2_multiple_job_steps/multiple_job_steps.sh
+#! /bin/bash
+#SBATCH --job-name=multiple_job_step
+#SBATCH --time=00:02:00
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=2
+#SBATCH --mem-per-cpu=50MB
+#SBATCH --output=multiple_job_step.log
+
+# Loading Python into the environment
+module load python/anaconda3-2024.02-3.11.7
+
+# Specify_ input file
+file=data_4.txt
+temporary_folder=/local/data1/${USER}
+working_folder=${temporary_folder}/${SLURM_JOB_ID}
+
+# Step 1 - Create a temporary folder to store data in.
+srun --cpus-per-task=1 mkdir -v -p ${working_folder}
+
+# Step 2 - Copy indata to the temporary folder.
+srun --cpus-per-task=1 cp -v ${PWD}/../data/${file} ${working_folder}
+
+# Step 3 - Start job stage
+srun python parallel_task.py ${working_folder}/${file} ${working_folder}/output.csv
+
+# Step 4 - Compress data all csv files.
+srun --cpus-per-task=1 tar -czvf ${working_folder}/output.tar.gz -C ${working_folder} $(cd ${working_folder} && ls *.csv)
+
+# Step 5 - Move output data to home folder
+srun --cpus-per-task=1 mv -v ${working_folder}/output.tar.gz ${PWD}
+
+# Step 6a - Remove temporary files.
+srun --cpus-per-task=1 rm -rfv ${working_folder}
+
+# Step 6b - Clear folder
+srun --cpus-per-task=1 test -n "$(ls -A "$temporary_folder")" || rmdir -v "$temporary_folder"
\ No newline at end of file
--- a/2_multi_core_job/multi_core_task.py
+++ b/2_multi_core_job/multi_core_task.py
 from datetime import datetime
 from multiprocessing import Pool
+
+import json
 import logging
 import os
-import random
+import sys
 import time

 logger = logging.getLogger(__name__)

-def sleep(input):
+def sleep(input) -> int:
    time.sleep(input[1])
    logger.info('Task %d done.',input[0])

-def main():
+    return input[1]
+
+def main(input_file: str, output_file: str):
    # Read environment variables.
-    NUMBER_OF_CORES = os.environ.get('SLURM_CPUS_PER_TASK','Unknown')
-    if NUMBER_OF_CORES in 'Unknown':
-        logger.error('Unkown number of cores, exiting.')
+    JOB_NAME = os.environ.get('SLURM_JOB_NAME','Unknown')
+    JOB_ID = os.environ.get('SLURM_JOB_ID','Unknown')
+    NUMBER_OF_CPUS = os.environ.get('SLURM_CPUS_PER_TASK','Unknown')
+    if NUMBER_OF_CPUS in 'Unknown':
+        logger.error('Unkown number of CPU''s, exiting.')
        return

-    NUMBER_OF_CORES = int(NUMBER_OF_CORES)
-    logger.info('Running program with %d cores.',NUMBER_OF_CORES)
+    NUMBER_OF_CPUS = int(NUMBER_OF_CPUS)
+    logger.info('**** Output for job %s (%s) ****', JOB_NAME, JOB_ID)
+    logger.info('Running program with %d CPU''s.',NUMBER_OF_CPUS)

-    # Creating a list of tasks to be performed
-    # This represents the calculations
-    random.seed(1)
+    # Reading configuration file and create a list of tasks
+    # This represents the reading of parameters and calculations
+    logger.info('Reading configuration from %s.',input_file)
+    with open(input_file, 'r') as file:
+        data = json.load(file)
+    
    tasks = []
    total_time = 0
-    for i in range(10):
-        time = random.randrange(1,29)
+    for i in range(len(data['sleep'])):
+        time = data['sleep'][i]
        tasks.append((i, time))
        total_time = total_time + time

    # Creating a multiprocessing pool to perform the tasks
-    pool = Pool(processes=NUMBER_OF_CORES)
+    pool = Pool(processes=NUMBER_OF_CPUS)

    # Running submitting the tasks to the worker pool
    tic = datetime.now()
    logger.info('Submitting tasks to pool.')
-    pool.map(sleep, tasks)
+    results = pool.map(sleep, tasks)
    toc = datetime.now()

    logger.info('All tasks are done, took %d seconds, compared to %d seconds with single thread.',
        (toc-tic).seconds, total_time)
    
+    logger.info('Writing result to %s', output_file)
+    with open(output_file, 'w') as file:
+        file.write('time\n')
+        for result in results:
+            file.write('{}\n'.format(result))
+    

 if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
-    main()
+    input_file = sys.argv[1]
+    output_file = sys.argv[2]
+    main(input_file, output_file)
+    sys.exit(0)