Performance profiling and optimization of the NCMRWF’s unified model (NCUM) vn13.0 on the MIHIR cray XC40 HPC facility at NCMRWF

Authors

  • Ashish Routray National Centre for Medium Range Weather Forecasting (NCMRWF), Ministry of Earth Sciences (MoES), Noida , India
  • Shivali Gangwar National Centre for Medium Range Weather Forecasting (NCMRWF), Ministry of Earth Sciences (MoES), Noida , India
  • Suryakanti Dutta National Centre for Medium Range Weather Forecasting (NCMRWF), Ministry of Earth Sciences (MoES), Noida , India
  • Preveen Kumar D National Centre for Medium Range Weather Forecasting (NCMRWF), Ministry of Earth Sciences (MoES), Noida , India
  • V. S. Prasad National Centre for Medium Range Weather Forecasting (NCMRWF), Ministry of Earth Sciences (MoES), Noida , India
  • B. Athiyaman National Centre for Medium Range Weather Forecasting (NCMRWF), Ministry of Earth Sciences (MoES), Noida , India

DOI:

https://doi.org/10.54302/mausam.v77i3.7254

Abstract

The National Centre for Medium Range Weather Forecasting (NCMRWF) operated the MIHIR High Performance Computing (HPC) Facility, delivering up to 2.8 petaflops of processing capacity to run Numerical Weather Prediction (NWP) models, thereby enabling accurate and timely weather forecasting. This study presents a comprehensive performance profiling and optimization analysis of the NCMRWF Unified Model (NCUM) version 13.0 on the MIHIR Cray XC40 HPC Facility. The model requires computations in the order of peta floating point operations per second (PFLOPS). The NCUM is profiled at two horizontal resolutions n96e (approximately 130 km) taken as baseline and n1280e (approximately 10 km) using the Cray Performance Analysis Tool (CrayPAT) to identify runtime bottlenecks and communication overhead. The optimization experiments focused on halo size reduction, domain decomposition strategies, OpenMP threading, and MPI rank reordering. Results demonstrate that reducing the extended halo size from 10 to 5 grid points improves execution time by approximately 10 seconds and thus reduces communication costs and lowers imbalance in key routines from 15.2 percent to 3.8 percent. A change in domain decomposition from 4 by 8 to 3 by 12 further reduced MPI collective imbalance, while rank reordering improved execution time by up to 12.8 percent. These findings will be valuable in guiding domain and communication optimizations to efficiently scale high resolution NWP models on the Cray XC40 system and future HPC architectures to be deployed at NCMRWF.

Downloads

Download data is not yet available.

Downloads

Published

2026-07-01

Issue

Section

Research Papers

Categories

How to Cite

[1]
“Performance profiling and optimization of the NCMRWF’s unified model (NCUM) vn13.0 on the MIHIR cray XC40 HPC facility at NCMRWF”, MAUSAM, vol. 77, no. 3, pp. 1005–1018, Jul. 2026, doi: 10.54302/mausam.v77i3.7254.

Most read articles by the same author(s)

<< < 1 2