Assessing Forecast Accuracy

This is a sample lesson page from the Certificate of Achievement in Weather Forecasting offered by the Penn State Department of Meteorology. Any questions about this program can be directed to: Steve Seman


Upon completion of this page, you should be able to perform simple calculations for absolute error, mean absolute error, and Brier Scores, and interpret them to assess the quality of a particular forecast.


Photograph of darts on a dartboard
Contrary to popular belief, throwing darts has nothing to do with forecasting or measuring forecast accuracy.

You've probably heard the old adage, "weather forecasters are the only people who can be wrong most of the time and still get paid." Hilarious, right? Most folks just don't realize the complex science that is involved in making good forecasts. Throwing darts won't cut it! I don't think most people even realize weather forecasters have methods of verifying and assessing the accuracy of their forecasts (or even if the methods exist, they assume that nobody cares enough to use them). Well, methods for verifying and assessing the accuracy of forecasts do exist--many types, in fact. Here, we're just going to cover a few of the most commonly used ones. If you're a bit "math averse," don't worry too much. We'll limit the discussion to simple arithmetic and other basics.

Mean Absolute Error

Perhaps the simplest way to assess the accuracy of a deterministic point forecast is to calculate the absolute error. The absolute error is merely the absolute value of the difference between the observed and the forecast value (absolute value just means that we're ignoring the sign of the difference). So, if you forecast a high temperature of 65 degrees Fahrenheit, and the observed high temperature ends up being 68 degrees Fahrenheit, then the absolute error would be |65ºF - 68ºF| = |-3ºF| = 3ºF. Or, if you forecast 0.82 inches of rain to fall in a 24-hour period, and 0.50 inches actually falls, the absolute error would be |0.82 inches - 0.50 inches| = 0.32 inches. Simple enough, right? Then, if you take the absolute errors over a number of forecast periods and average them, you get mean absolute error (which you may see abbreviated as MAE). The mean absolute error is useful for telling us the average difference between our forecasts and the observed values that occur. As you probably realize, more accurate forecasts yield smaller absolute errors (and smaller mean absolute errors over the long haul).

You might be asking, "why do we take the absolute value of our forecast errors to eliminate the sign?" If we didn't, then we would get misleading results when we calculate the average of our errors. Take the example from above: You forecast a high temperature of 65 degrees Fahrenheit, and the observed high temperature is 68 degrees Fahrenheit. The error would be 65ºF - 68ºF = -3ºF. The next day, you forecast a high of 65 degrees Fahrenheit again, but the observed high is 62 degrees Fahrenheit. Your forecast error would be 65ºF - 62ºF = +3ºF. If we averaged your two errors we would get an average error of zero degrees Fahrenheit over the two day period. That's not an accurate reflection of your error because the positive and negative errors canceled each other out. But, your mean absolute error over the two-day period would be 3 degrees Fahrenheit, which is more telling. Calculating mean error (including the signs) over time doesn't tell us much about the size of the difference between forecasts and observations, but it can tell us about whether or not forecasts exhibit any bias (as in, whether forecasts tend to be lower or higher than the observed values over time).

Check your understanding of absolute error and mean absolute error with the practice questions below:

Brier Scores

How would we assess the accuracy of a probabilistic forecast? Mean absolute error doesn't really help us. So, I introduce to you the Brier Score, a common way of assessing the accuracy of probabilistic forecasts. Calculating the Brier Score (BS) for a single forecast event is pretty simple, using this formula:

BS = (p - o)2

In this formula, p is the forecast probability of an event occurring, and o is the occurrence of the event ("0" if the event does not occur, or "1" if the event does occur). The Brier Score for a single forecast event, then, is just the difference between the forecast probability and "0" or "1" depending on whether the event occurs. Let's think about a simple example. Say you forecast a 70 percent chance of measurable rain at your hometown on a given day, and indeed, measurable rain ends up falling. The Brier Score calculation would be:

BS = (0.7 - 1)2 = (-0.3)2 = 0.09

So, your Brier Score for the forecast would be 0.09, which is pretty good! Now, if you had made the same forecast (a 70 percent chance of measurable rain), and no rain fell at all, then your Brier Score would be:

BS = (0.7 - 0)2 = (0.7)2 = 0.49

This time, your forecast wasn't as good because you predicted a 70 percent probability of something that ended up not happening. From these two examples, it's clear that lower Brier Scores are more desirable. A perfect Brier Score for a single event would be "0" (from a forecast of 0 percent or 100 percent that ends up being correct), while the worst possible Brier Score for a single event would be "1" (from a forecast of 0 percent or 100 percent that ended up being incorrect...a complete bust). If we want to track Brier Scores for a particular forecast variable (say, precipitation) over a long period of time, we can merely average them. Or, if we wanted to combine Brier Scores for multiple forecast events on a given day (say, the probability of measurable rain, the probability of the high temperature exceeding 80 degrees Fahrenheit, and the probability of a thunderstorm occurring), we can add up the individual Brier Score for each event to compute a Total Brier Score.

Check your understanding of Brier Scores (and their interpretation) with the practice questions below:

Threat Scores

Mean absolute error and Brier Scores help us assess the accuracy of deterministic and probabilistic forecasts at a single point, but what about areal forecasts? That's where threat scores come into play. Forecasters at the Weather Prediction Center (WPC), for example, use threat scores to assess their areal forecasts for heavy precipitation. For a given event, forecasters compute threat scores by comparing the area where they predicted heavy precipitation with the area that ultimately received heavy rain or heavy snow.

You can think of a threat score as the ratio of the area where the forecast was accurate to the area where the forecast didn't verify correctly. In the figure below, the Forecast area (F) is the region for which WPC forecasted heavy precipitation (shaded in red). The observed area (OB) indicates the region where heavy precipitation fell and is shaded in green. The hatched area, C, represents the region where the forecast for heavy precipitation was correct. By definition, the threat score (T) is calculated using the equation: T = C / (F + OB - C)

Schematic showing the components of a threat score calculation
A schematic defining the areas used to compute a threat score (T). The hatched area, C, represents the region where the forecast for heavy precipitation verified. F, the red shaded area, is the region for which WPC forecasters predicted heavy precipitation. OB, the green shaded area, indicates the region where heavy precipitation fell.
Credit: David Babb

Unlike with Brier Scores, higher threat scores indicate better forecasts. A threat score for a perfect forecast is "1", while a completely busted forecast gets a big fat "0" (you can read more about computing threat scores, if you're interested).

The figure below shows the trend in annual threat scores based on predictions for one inch of rain (or a liquid-equivalent of one inch) during each year from 1960 through 2020. To put this graph in proper context, I point out that WPC bases all QPF verifications on the 12Z-to-12Z period. As for the graph itself, please note that Day 1 represents the forecast period from 12 to 36 hours (WPC calculates forecast hours based on the 00Z model runs). Day 2 forecasts cover 36 to 60 hours, while Day 3 spans from 60 to 84 hours.

Graph showing WPC's annual threat scores for 1 inch of precipitation from 1960-2020
The annual threat scores for predicting one inch of rain or liquid equivalent from 1960 to 2020. Threat scores are improving, but lots of room for improvement still exists.
Credit: Weather Prediction Center

These threat scores indicate that WPC typically attains a threat score right around 0.28 for Day-2 forecasts of one inch of rain (or liquid equivalent). Such a score translates to WPC getting only about 45 percent of the predicted area of one inch of rain or liquid equivalent correct 36 to 60 hours in advance. The Day-3 threat scores are a bit lower yet. Keep in mind that the forecasters at WPC are very good!

The take-home message here is that threat scores in the field are improving (note the increasing trends on the graph), but our ability to accurately forecast regions of heavy rain or snow 48-hours or more from an event is not very good. Kind of paints extended forecasts for blizzards and daily deterministic point forecasts weeks into the future in a sobering light, doesn't it? 

Forecast Tip

I highly recommend keeping a "forecast journal" when making forecasts. Keep notes about key forecast issues, model predictions, the reasoning behind your forecast, the final verification, and what your forecast errors were. These notes can really help you learn from your mistakes, and help build your mental log of forecasting experience!

Before we move on, check out how what you learned in this section applies to forecasts in the WxChallenge competition in the WxChallenge Application section below.

WxChallenge Application

As I mentioned previously, in this course, our focus is going to be short-range, deterministic point forecasting for WxChallenge. So, you won't need to calculate threat scores in this course (the exact calculations can be quite daunting, and I promised we'd keep to basic math). I just wanted to give you a taste for the various types of forecast assessment schemes that exist. For our purposes, realize that the scoring formulas in WxChallenge are based on daily absolute errors for high and low temperature, maximum sustained wind speed, and amount of precipitation (although they assign an artificial point scheme to the absolute errors for wind and precipitation).

Now that we've covered the basics of forecasting and accuracy assessment, let's think more about how prudent, responsible forecasters go about making forecasts. Applying the basic approach described in the next section will help you create quality forecasts more consistently given any forecast objective, including those in WxChallenge.