3 min read

Categories

I was recently requested to conduct a coupled E3SM simulation experiment. This task provids me with the opportunity to familiarize myself with the latest best-practice simulation standards. An official step-by-step guide can be found here.

Some Key Takeaways:

  • Comprehensive Scripting: Maintain all aspects within the script, including cloning the source code, building the project, and altering the namelist. This practice helps prevent future mistakes.
  • Script Version Control: Ensure that each simulation is associated with a specific script version, enabling better tracking and replication of results.
  • Test the Simulation before Production Runs: Conduct small PE-layout test runs over a few days to verify the stability of the simulation. This preliminary testing ensures that the simulation does not crash and can produce bit-for-bit (b4b) results in comparison to the baseline runs or across various PE layouts. Such testing minimizes the risk of a production run crashing after an extended wait time in the queue.
  • Post-Processing Version Control: Utilize version control for the post-processing control file, for tools such as zppy, to maintain consistency across various stages of analysis.
  • Archiving Simulation Results: Always archive the simulation results, making it easier to locate past results and restart the simulations if necessary.

Steps:

  • Start from an exsisting simulation script: Use a script like this one.
  • Reproducing Results:
    • To reproduce the same results, submit the job with a short test option activated.
    • Ensure do_fetch_code=true to enable the script to automatically clone the code and check the correct branch.
  • Short Test and Verification:
    • After the short test, varify the test results with the original simulation results by using the md5sum function:
        # 10-day test simulations with different layouts
        cd /lcrc/group/e3sm/ac.tian.zhou/E3SMv3_dev/20230808.v3alpha02.piControl.chrysalis/tests
        for test in *_*_ndays
        do
            zgrep -h '^ nstep, te ' ${test}/run/atm.log.*.gz | sort -n -k 3,3 | uniq > atm_${test}.txt
        done
        # Reference simulation from original runs (log files extracted using zstash)
        zgrep -h '^ nstep, te ' /lcrc/group/e3sm/ac.golaz/E3SMv3_dev/20230808.v3alpha02.piControl.chrysalis/original/archive/logs/atm.log.347003.230622-141836.gz | sort -n -k 3,3 | uniq | head -n 482 > atm_ref.txt
        # Verification
        md5sum *.txt
        ac18d8e307d5f474659dffbde3ddf0d3  atm_ref.txt
        ac18d8e307d5f474659dffbde3ddf0d3  atm_XS_1x10_ndays.txt
      
  • Modifications and Further Testing:
    • Modify the simulation case as needed.
    • If the source code remains the same, no need to rebuild the model for other tests or production runs. Set do_fetch_code=false and do_case_build=false; others remain true.
  • Production Run and Archiving:
    • Once the production run is done, consider archiving the results.
    • First compress log files from failed runs. Make sure not to compress the log files from an active simulation, this will cause the model to crash
        cd /lcrc/group/e3sm/ac.tian.zhou/E3SMv3_dev/20230808.v3alpha02.piControl.chrysalis/run
        gzip *.log.372164.230811-102057 
      
    • Repeat for all the log file time stamps, except the current one.
    • Short term archiving for the first 50 years (siumulation was from 0001 to 0050)
        cd /lcrc/group/e3sm/ac.tian.zhou/E3SMv3_dev/20230808.v3alpha02.piControl.chrysalis/case_scripts
        ./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs
      
  • Postprocessing using zppy:
    • Prepare zppy configuration file. You can start with an exsisting one
        cd ~/E3SMv3_dev/scripts
        cp /home/ac.golaz/E3SMv3_dev/scripts/post.20230802.v3alpha02.1pctCO2_0101.chrysalis.cfg .
      
    • Rename and customize the file content.
    • Run zppy
        # first load e3sm_unified 
        # on Chrysalis
        source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh
        # on Compy
        source /share/apps/E3SM/conda_envs/load_latest_e3sm_unified_compy.sh
        # on NERSC Perlmutter
        source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh   
        # then run zppy
        zppy -c <your_file>.cfg
      
    • If everything runs okay, the analysis results will be availble to the public like this
    • If you want to compare the zppy-generated plots using the Interface for InterComparision of E3SM (IICE) between simulations, you could use these two links for E3SM_Diags and for MPAS.
  • Documenting the simulation:
    • Use this template to document the simulation.
    • PACE provides detailed performance information for simulations. Provide job ID or search username to find the simulation.

Additional Notes:

  • Quota check for Compy and Chrysalis
    • Compy: lfs quota -hu zhou014 /compyfs
    • Chrysalis: /usr/lpp/mmfs/bin/mmlsquota -u ac.tian.zhou --block-size T fs2
  • Job management
    • From Chris: to boost the Quality of Service (QoS) for a faster job turnaround, one can do scontrol update jobid=<JOBID> qos=high for a waiting job. You can also edit env_batch.xml in case_scripts by adding <directive> --qos=high </directive>
    • Chris also shared a command to list priorities for all jobs:
        alias sqa='squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10r %.10M %.10l %.8Q %j" --sort=P,-t,-p'
      
    • To check the priority of a job
        sprio -l -j <JOBID>
         JOBID PARTITION     USER PRIORITY SITE AGE ASSOC FAIRSHARE JOBSIZE PARTITION   QOS NICE TRES
        411168   compute ac.tian.    53844    0 843     0       560    2441         0 50000    0   
      

      The high qos command above increased my job QOS from 5000 to 50000. The higher QOS, the higher PRIORITY. The AGE is the waiting time spent in the queue in minute. The longer AGE, the higher PRIORITY too.

    • To use priority partition, simply modify the following lines in the run script, of course, you have to be added to the priority partition by the admin first.
        readonly PROJECT="priority"
        ./xmlchange CHARGE_ACCOUNT=priority
        ./xmlchange --force JOB_QUEUE=priority