This is the Technical Note for the Negative Runoff Quick Fix PR.This PR implements a feature to handle negative runoff (qgwl) from ELM, which can cause negative discharge to the ocean. The draft of this document was generated by Claude and then revised by myself.
Key Implementation Details
- Feature controlled by namelist flag:
redirect_negative_qgwl = .true. - Main code in
src/riverroute/RtmMod.F90(lines 2406-3174) - Uses sparse packing approach with reprosum for bit-for-bit reproducibility across PE layouts
- Two scenarios based on global net qgwl (positive sum + negative sum)
Scenario A (net_global_qgwl ≥ 0):
- Proportionally reduce positive qgwl cells to offset negative cells
- Zero out negative qgwl cells
- Scaling factor:
net_global_qgwl / global_positive_qgwl_sum - No outlet redistribution needed (conservation achieved by proportional reduction)
Scenario B (net_global_qgwl < 0):
- Zero all qgwl input
- Redistribute deficit to all outlets weighted by discharge
- Each outlet receives:
correction = (outlet_discharge / total_outlet_discharge) × |net_global_qgwl| - Warning issued if redistribution changes total outlet discharge by >5%
Lesson Learned: MPI Sum vs Reprosum: Critical for Bit-for-Bit Reproducibility
The Problem with Standard MPI_Allreduce
Standard MPI reduction operations (MPI_Allreduce, MPI_Reduce) are NOT bit-for-bit reproducible across different PE (processor) layouts because:
- Non-associativity of floating-point addition:
(a + b) + c ≠ a + (b + c)due to rounding errors - PE layout affects summation order:
- 160 PEs: might sum as
((PE0 + PE1) + (PE2 + PE3)) + ... - 320 PEs: might sum as
((PE0 + PE1) + (PE2 + PE3)) + ...with different groupings
- 160 PEs: might sum as
- Result: Same simulation with different PE counts produces slightly different (~1e-13) global sums
The Reprosum Solution
The shr_reprosum_calc function from CESM’s shared utilities provides bit-for-bit reproducible sums using integer vector representation:
Lesson Learned: PEM Test Fix: The Near-Zero Value Problem
Root Cause of PEM Failure:
The negative qgwl redistribution initially failed PEM tests because near-zero values (~±1e-20) were classified inconsistently:
- Upstream floating-point operations: Different PE layouts produce slightly different rounding in ELM calculations
- Sign flipping: Value could be +1e-20 in 160 PE layout, -1e-20 in 320 PE layout
- Classification inconsistency:
if (qgwl > 0.0_r8)classified these differently → different positive/negative sums - Non-reproducibility propagation: Different sums → different scaling factors → different outputs across timesteps
Example from actual logs:
160 PEs: Positive sum = 1.2345678901234560E+03
320 PEs: Positive sum = 1.2345678901234567E+03 (differs by ~7e-13)
The negative sum was identical (bit-for-bit), but positive sum differed because near-zero values were classified differently.
Solution: Sparse Packing with Threshold
Following the pattern from MPAS ocean model, we implemented:
- Threshold parameter:
TINYVALUE_s = 1.0e-14_r8 - Sparse packing: Only include values with
abs(qgwl) > TINYVALUE_sin reprosum - Consistent classification: Use threshold in all comparisons (
qgwl > TINYVALUE_sinstead ofqgwl > 0.0_r8)
Loading comments...