AI-based weather forecasting is an emerging paradigm different than numerical weather prediction (NWP), which today is the default way of making weather forecasts. In some cases, their performance is on par or exceeding that of NWP, especially in the longer term. However, AI-based weather forecasts tend to be smoother than NWP forecasts. What’s more, they seem to become even blurrier at longer lead times. This impacts weather forecasts: for example, extreme weather events are captured less precisely. In this article, we’re going to explore why this occurs. Then, we continue with reviewing solutions, allowing the quality of AI-based weather forecasts to improve.
Recall from the previous section that the differences between the prediction and the true state (the expected prediction) of all the training samples are represented in a loss value. Since this value must be computed, there must be a function doing so: we call that function the loss function. Choosing a loss function is at the discretion of the model developers and helps optimize models in a desired way.
In AI-based weather models, many times, a customized Mean Squared Error loss function (or something that looks like it, such as Root Mean Squared Error) is used. In the simplest form, MSE loss looks like the formula below (Versloot, 2019). For all grid cells, the prediction is subtracted from the expected value from the training data, the difference is squared, and the sum over all grid cells is then divided by the number of grid cells to compute the average. In other words, MSE losses compute the average error across the whole prediction (in the case of AI-based weather models, that is the global weather prediction for a time step).
Figure 2: MSE loss.
Recognizing meteorological effects and the expressiveness of NWP models (providing the analysis data for training) around the poles, AI weather model developers typically customize their loss functions even further, taking these effects into account. For example, the loss function below – that of the GraphCast weather model – considers forecast date time, lead time, spatial location and all variables when computing loss. Nevertheless, as can be seen from the right of the function, it is still a squared error – and thus an MSE loss.
Figure 3: GraphCast loss function (Lam et al., 2022).
Now, recall from the previous section that the loss value, through the optimization process after each step, guides the direction of model optimization. In other words, if a loss function is used that penalizes errors (and especially large errors, given the square), it is no surprise to see that smoothed, average forecasts tend to produce lower error values than detailed forecasts capturing both signal and noise. After all, on average, average forecasts produce the lowest errors – as detailed forecasts can be very spot-on but produce large MSE scores if they are wrong instead. The figures below visualize this effect. For some expected temperature forecast, if a fictional model produces a more average, smoothened prediction, the resulting MSE is (much) lower compared to a detailed one. Even though it also captures the larger-scale signal, it produces a lot of small-scale errors when doing so, which are penalized heavily by the square present in MSE loss functions.
Figure 4: a smoothened temperature forecast has an MSE of 338.06.
Figure 5: a more detailed forecast capturing the signal while being noisy has a much larger MSE.
In many cases, this is not a problem: average weather tends to be represented well in the many years of training data and weather conditions tend to be like averages. More problematic is the case of extreme weather.
First, extreme weather does not happen often (hence the name), so it is underrepresented in the training data, making its prediction difficult. What’s more, since models are trained to reduce the average error, and average predictions tend to produce lower average errors (see the figures above), what follows is that models tend to produce forecasts where extreme events are underestimated. A case study of the Pangu-Weather model in extreme weather events demonstrates this effect (Bi et al., 2022). In other words, one of the downsides of AI-based weather models is that they can be quite good ¬and sometimes even better – in average scenarios, such as large-scale weather systems and their positions. NWP models still provide more detailed forecasts for small-scale phenomena, such as extreme weather events and local differences in wind speed, as can e.g. be seen in figure 6.
Figure 6: the energy spectra plot for 10-meter windspeed at short lead times (+12 hours in this case) indicates that IFS HRES has more power, thus more detail, for small-scale weather phenomena compared to AI-based models: this includes expressivity for small-scale extreme weather conditions.
Figure 7: Power spectra for various deterministic AI and NWP weather models for 2-meter temperature at 192 hours ahead (8 days). Higher power means that forecasts have more detail. Shorter wavelength means lower scale, thus higher power at shorter wavelengths means that small-scale weather phenomena are captured with higher detail. Image from WeatherBench (2024).
The effect can also be seen when actually looking at the output of AI-based weather models. For example, in the figure below, we are comparing the forecasted temperature at 2 meters for +0h and +360h in the future using a run of the AIFS weather model. Clearly, the model is less expressive 360 hours ahead compared to 0h in the future. This is especially apparent in the Atlantic Ocean, where temperature zones seem to be more ‘generic’. This reduction in sharpness present with AI-based weather models results in lower power scores in energy spectra for lower wave lengths and, in the end, influences the quality of predictions made by these models for your location.
Figure 8: Output of the AIFS weather model as visible on our professional weather platform I’m Weather. On the left, we see the initialization of the forecast (at timestep +0h); on the right, the forecast for 15 days ahead (at timestep +360h). Observe the differences in detail when comparing smaller-scale differences in temperature between both time steps, especially in the Atlantic.
Like the smoothing effects discussed before, for improving AI-based weather modelling, it’s important to understand the origin of reduced sharpness. Effectively, it is related to the autoregressive procedure of AI-based weather modelling. That is, each weather forecast is started with an analysis of the atmosphere, which is the best possible mix between an older forecast for the starting time and observations taken since then. Using an AI weather model, this analysis is then used to generate a forecast for a future time step, e.g. for 6 hours in the future. We then use this forecast to generate the forecast for +12 hours; that one for the +18h hours forecast, and so forth. In other words, autoregressive modelling means that predictions by the AI model are used for generating new predictions.
Figure 9: autoregressive AI-based weather modelling: using a previous prediction for generating a new one.
However, recall from a previous article that all models are wrong, while some are useful. The wrong, here, originates from the observation that the prediction will always be a bit off compared to reality. The fact that predictions of AI weather models are always a bit smoothened out, as discussed in the previous section, effectively means that there is an error in the first prediction (and besides, they can be simply wrong as well). This error can also be seen as information loss, as there is a difference between reality and prediction. If you would then use this prediction for generating a new one, the information loss introduced with the first prediction due to smoothing and other causes is exacerbated in the second prediction; also in the third, and so forth. This is known as error accumulation, since additional error is introduced with every forecast step. Its visibility increases slowly: in the first few time steps, the effect is not significant, but at longer lead times forecasts get blurrier quickly. Clearly, error accumulation can be observed when comparing the analysis with the forecast for +360 hours, as we saw above!
The focus of the approaches described previously was mostly reducing error accumulation for longer lead times. However, we also discussed that AI-based weather models are less expressive than NWP models when it comes to extreme weather conditions, which usually requires that small-scale phenomena are predicted precisely. The creators of the FuXi weather model recognize that diffusion models are used successfully in image generation (think the text-to-image tools available widely today) “due to their remarkable capability for generating highly detailed images” (Zhong et al., 2023). In other words, why not try and use them for sharpening weather forecasts?
In doing so, Zhong et al. (2023) use the outputs of FuXi as conditioning for a diffusion model trained to convert noise into surface-level weather variables. Now named FuXi-Extreme, significant performance improvements on metrics like CSI and SEDI are observed compared to the original FuXi model. In the figure below, especially around extreme weather events (such as cyclones present in Asia), predictions are clearly sharper.
Figure 13: applying a diffusion model trained on weather forecasts conditioned by outputs of the FuXi model (above) leading to sharper forecasts (below). Images from Zhong et al. (2023).
Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., & Tian, Q. (2022). Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast. arXiv preprint arXiv:2211.02556.
Chen, K., Han, T., Gong, J., Bai, L., Ling, F., Luo, J. J., ... & Ouyang, W. (2023a). Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead. arXiv preprint arXiv:2304.02948.
Chen, L., Zhong, X., Zhang, F., Cheng, Y., Xu, Y., Qi, Y., & Li, H. (2023b). FuXi: A cascade machine learning forecasting system for 15-day global weather forecast. npj Climate and Atmospheric Science, 6(1), 190.
Han, T., Guo, S., Ling, F., Chen, K., Gong, J., Luo, J., ... & Bai, L. (2024). Fengwu-ghr: Learning the kilometer-scale medium-range global weather forecasting. arXiv preprint arXiv:2402.00059.
Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., ... & Battaglia, P. (2022). GraphCast: Learning skillful medium-range global weather forecasting. arXiv preprint arXiv:2212.12794.
Versloot, C. (2021, July 19). How to use PyTorch loss functions. MachineCurve.com | Machine Learning Tutorials, Machine Learning Explained. https://machinecurve.com/index.php/2021/07/19/how-to-use-pytorch-loss-functions
Versloot, C. (2023, November 19). A gentle introduction to Lora for fine-tuning large language models. MachineCurve.com | Machine Learning Tutorials, Machine Learning Explained. https://machinecurve.com/index.php/2023/11/19/a-gentle-introduction-to-lora-for-fine-tuning-large-language-models
Versloot, C. (2024, June 27). AI & weather forecasting: Evaluate AI weather models with WeatherBench. Infoplaza - Guiding you to the decision point. https://www.infoplaza.com/en/blog/ai-weather-forecasting-evaluate-ai-weather-models-with-weatherbench
WeatherBench. (2024). Spectra. Spectra.
https://sites.research.google/weatherbench/spectra/
Zhong, X., Chen, L., Liu, J., Lin, C., Qi, Y., & Li, H. (2023). FuXi-Extreme: Improving extreme rainfall and wind forecasts with diffusion model. arXiv preprint arXiv:2310.19822.