Negative Runoff Quick Fix

This is the Technical Note for the Negative Runoff Quick Fix PR.This PR implements a feature to handle negative runoff (qgwl) from ELM, which can cause negative discharge to the ocean. The draft of this document was generated by Claude and then revised by myself.

Key Implementation Details

Feature controlled by namelist flag: redirect_negative_qgwl = .true.
Main code in src/riverroute/RtmMod.F90 (lines 2406-3174)
Uses sparse packing approach with reprosum for bit-for-bit reproducibility across PE layouts
Two scenarios based on global net qgwl (positive sum + negative sum)

Scenario A (net_global_qgwl ≥ 0):

Proportionally reduce positive qgwl cells to offset negative cells
Zero out negative qgwl cells
Scaling factor: net_global_qgwl / global_positive_qgwl_sum
No outlet redistribution needed (conservation achieved by proportional reduction)

Scenario B (net_global_qgwl < 0):

Zero all qgwl input
Redistribute deficit to all outlets weighted by discharge
Each outlet receives: correction = (outlet_discharge / total_outlet_discharge) × |net_global_qgwl|
Warning issued if redistribution changes total outlet discharge by >5%

Lesson Learned: MPI Sum vs Reprosum: Critical for Bit-for-Bit Reproducibility

The Problem with Standard MPI_Allreduce

Standard MPI reduction operations (MPI_Allreduce, MPI_Reduce) are NOT bit-for-bit reproducible across different PE (processor) layouts because:

Non-associativity of floating-point addition: (a + b) + c ≠ a + (b + c) due to rounding errors
PE layout affects summation order:
- 160 PEs: might sum as ((PE0 + PE1) + (PE2 + PE3)) + ...
- 320 PEs: might sum as ((PE0 + PE1) + (PE2 + PE3)) + ... with different groupings
Result: Same simulation with different PE counts produces slightly different (~1e-13) global sums

The Reprosum Solution

The shr_reprosum_calc function from CESM’s shared utilities provides bit-for-bit reproducible sums using integer vector representation:

Lesson Learned: PEM Test Fix: The Near-Zero Value Problem

Root Cause of PEM Failure:

The negative qgwl redistribution initially failed PEM tests because near-zero values (~±1e-20) were classified inconsistently:

Upstream floating-point operations: Different PE layouts produce slightly different rounding in ELM calculations
Sign flipping: Value could be +1e-20 in 160 PE layout, -1e-20 in 320 PE layout
Classification inconsistency: if (qgwl > 0.0_r8) classified these differently → different positive/negative sums
Non-reproducibility propagation: Different sums → different scaling factors → different outputs across timesteps

Example from actual logs:

160 PEs: Positive sum = 1.2345678901234560E+03
320 PEs: Positive sum = 1.2345678901234567E+03  (differs by ~7e-13)

The negative sum was identical (bit-for-bit), but positive sum differed because near-zero values were classified differently.

Solution: Sparse Packing with Threshold

Following the pattern from MPAS ocean model, we implemented:

Threshold parameter: TINYVALUE_s = 1.0e-14_r8
Sparse packing: Only include values with abs(qgwl) > TINYVALUE_s in reprosum
Consistent classification: Use threshold in all comparisons (qgwl > TINYVALUE_s instead of qgwl > 0.0_r8)

Comments

Loading comments...