So, you have an imperfect data-driven parameterization for climate modeling. How can you make it better?

1 minute read

Published:

Minah Yang submitted a paper on how to combat data imbalance in regression problems to JAMES. The goal is to improve the performance of data-driven paramterizations, particularly for profiles that are rare, but important. This is a data imbalance problem: we need ensure that the parameterization works well on input-output pairs that are seldom seen in training. Minah proposes a technique based on histogram equalization, visualized below with help from Cece, Minah’s faithful companion! The idea is to oversample or reweight these rare cases during training, to ensure the method learns from them.

The plain language summary does a nice job of explaining the big picture: Subgrid-scale parameterizations are a part of climate models that represent effects of processes that cannot be directly modelled. In recent years, there have been many efforts to improve upon these parameterizations by applying machine learning techniques. Since these methods rely heavily on the dataset they are learning from, it is important to consider the frequency at which important events occur within the dataset because they are adept at learning frequent events at high accuracy but are prone to learning rare but important events at low accuracy. To remedy this data imbalance problem, we developed a resampling methodology that can be easily adjusted by tuning just two parameters. We find that a right combination of those parameters can improve the accuracy of an ML model at the rare event regime while keeping the accuracy high in the frequent regime. However, a “wrong” combination can actually increase the errors at the rare event regime by overfitting to that regime.