I was recently requested to conduct a coupled E3SM simulation experiment. This task provids me with the opportunity to familiarize myself with the latest best-practice simulation standards. An official step-by-step guide can be found here.
Some Key Takeaways:
- Comprehensive Scripting: Maintain all aspects within the script, including cloning the source code, building the project, and altering the namelist. This practice helps prevent future mistakes.
- Script Version Control: Ensure that each simulation is associated with a specific script version, enabling better tracking and replication of results.
- Test the Simulation before Production Runs: Conduct small PE-layout test runs over a few days to verify the stability of the simulation. This preliminary testing ensures that the simulation does not crash and can produce bit-for-bit (b4b) results in comparison to the baseline runs or across various PE layouts. Such testing minimizes the risk of a production run crashing after an extended wait time in the queue.
- Post-Processing Version Control: Utilize version control for the post-processing control file, for tools such as zppy, to maintain consistency across various stages of analysis.
- Archiving Simulation Results: Always archive the simulation results, making it easier to locate past results and restart the simulations if necessary.
Steps:
- Start from an exsisting simulation script: Use a script like this one.
- Reproducing Results:
- To reproduce the same results, submit the job with a short test option activated.
- Ensure
do_fetch_code=true
to enable the script to automatically clone the code and check the correct branch.
- Short Test and Verification:
- After the short test, varify the test results with the original simulation results by using the
md5sum
function:# 10-day test simulations with different layouts cd /lcrc/group/e3sm/ac.tian.zhou/E3SMv3_dev/20230808.v3alpha02.piControl.chrysalis/tests for test in *_*_ndays do zgrep -h '^ nstep, te ' ${test}/run/atm.log.*.gz | sort -n -k 3,3 | uniq > atm_${test}.txt done # Reference simulation from original runs (log files extracted using zstash) zgrep -h '^ nstep, te ' /lcrc/group/e3sm/ac.golaz/E3SMv3_dev/20230808.v3alpha02.piControl.chrysalis/original/archive/logs/atm.log.347003.230622-141836.gz | sort -n -k 3,3 | uniq | head -n 482 > atm_ref.txt # Verification md5sum *.txt ac18d8e307d5f474659dffbde3ddf0d3 atm_ref.txt ac18d8e307d5f474659dffbde3ddf0d3 atm_XS_1x10_ndays.txt
- After the short test, varify the test results with the original simulation results by using the
- Modifications and Further Testing:
- Modify the simulation case as needed.
- If the source code remains the same, no need to rebuild the model for other tests or production runs. Set
do_fetch_code=false
anddo_case_build=false
; others remaintrue
.
- Production Run and Archiving:
- Once the production run is done, consider archiving the results.
- First compress log files from failed runs. Make sure not to compress the log files from an active simulation, this will cause the model to crash
cd /lcrc/group/e3sm/ac.tian.zhou/E3SMv3_dev/20230808.v3alpha02.piControl.chrysalis/run gzip *.log.372164.230811-102057
- Repeat for all the log file time stamps, except the current one.
- Short term archiving for the first 50 years (siumulation was from 0001 to 0050)
cd /lcrc/group/e3sm/ac.tian.zhou/E3SMv3_dev/20230808.v3alpha02.piControl.chrysalis/case_scripts ./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs
- Postprocessing using zppy:
- Prepare zppy configuration file. You can start with an exsisting one
cd ~/E3SMv3_dev/scripts cp /home/ac.golaz/E3SMv3_dev/scripts/post.20230802.v3alpha02.1pctCO2_0101.chrysalis.cfg .
- Rename and customize the file content.
- Run zppy
# first load e3sm_unified # on Chrysalis source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh # on Compy source /share/apps/E3SM/conda_envs/load_latest_e3sm_unified_compy.sh # on NERSC Perlmutter source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh # then run zppy zppy -c <your_file>.cfg
- If everything runs okay, the analysis results will be availble to the public like this
- If you want to compare the zppy-generated plots using the Interface for InterComparision of E3SM (IICE) between simulations, you could use these two links for E3SM_Diags and for MPAS.
- Prepare zppy configuration file. You can start with an exsisting one
- Documenting the simulation:
Additional Notes:
- Quota check for Compy and Chrysalis
- Compy:
lfs quota -hu zhou014 /compyfs
- Chrysalis:
/usr/lpp/mmfs/bin/mmlsquota -u ac.tian.zhou --block-size T fs2
- Compy:
- Job management
- From Chris: to boost the Quality of Service (QoS) for a faster job turnaround, one can do
scontrol update jobid=<JOBID> qos=high
for a waiting job. You can also editenv_batch.xml
incase_scripts
by adding<directive> --qos=high </directive>
- Chris also shared a command to list priorities for all jobs:
alias sqa='squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10r %.10M %.10l %.8Q %j" --sort=P,-t,-p'
- To check the priority of a job
sprio -l -j <JOBID> JOBID PARTITION USER PRIORITY SITE AGE ASSOC FAIRSHARE JOBSIZE PARTITION QOS NICE TRES 411168 compute ac.tian. 53844 0 843 0 560 2441 0 50000 0
The high qos command above increased my job QOS from 5000 to 50000. The higher QOS, the higher PRIORITY. The AGE is the waiting time spent in the queue in minute. The longer AGE, the higher PRIORITY too.
- To use priority partition, simply modify the following lines in the run script, of course, you have to be added to the priority partition by the admin first.
readonly PROJECT="priority" ./xmlchange CHARGE_ACCOUNT=priority ./xmlchange --force JOB_QUEUE=priority
- From Chris: to boost the Quality of Service (QoS) for a faster job turnaround, one can do