M. Efforts Related to Forecast Verification

[Background] [Some properties of skill scores]
[Improved methods for verification of forecast “objects”]


1.  Background

The RAP verification group continues to search for and develop new and better approaches for verification of aviation weather forecasts. Because verification has a critical role to play in the development of new forecast products, and because verification of these types of forecasts is inherently difficult, this research is an important aspect of the forecast development process. This year the group’s work focused primarily on two areas: (a) examining some properties of skill scores used to compare the quality of forecast systems and (b) initial development of improved methods for verifying convective forecasts, as well as other forecasts that can be defined as fields or objects. These efforts are briefly described in the following sections.

2. Some properties of skill scores

Skill scores are commonly used in forecast verification as a means of comparing the quality of forecasting systems. Because skill scores attempt to quantify a multidimensional problem in a single dimension, skill scores are inherently problematic. They are, however, unlikely to be dropped from use because they do offer a simple way of making comparisons.

Skill scores are similar in many ways to Goodness-of-Fit (GOF) tests used in statistics, for example, to evaluate the fit of a parametric distribution to a set of data. In recent work by Read and Cressie (1988), when power transformations were applied to GOF tests, the resulting test yielded more desirable qualities. The work on GOF tests provides motivation for similar transformations applied to skill scores. The impacts of a variety of transformations were examined by T. Fowler and R. Bullock.

The general format of skill scores is a ratio of differences (Stanski et al. 1989; Wilks 1995). The numerator measures the difference between the forecast of interest and some reference forecast while the denominator measures the difference between a perfect forecast and the reference forecast. Thus, the general formula for a skill score involves the measure associated with the forecast to be evaluated (A), the measure associated with some reference forecast (Aref) and the measure associated with a perfect forecast (Aperf). These measures are combined as a ratio of differences (A - Aref)/(Aperf - Aref). The result is interpretable as a “percentage improvement” over the reference forecast (Wilks 1995). However, comparing different skill scores for the same forecasts makes it clear that the “percent improvement” varies considerably depending on the score chosen. Four different skill score are evaluated, the True Skill Statistic (TSS), the Gilbert Skill Score (GSS), the Heidke Skill Score (HSS), and the Probability of Detection Skill Score (PODSS).

R. Bullock and T. Fowler applied several types of transformations to the skill scores. Most yielded little or no improvement over the original score. However, in some cases, square root and log transformations applied to the counts prior to plugging them into the skill score formula yields better properties. For example, the cell counts are brought closer to equal and the effect of bias on the score is reduced. For some types of forecasts, the power transformations have no effect on the resulting score. However, this method seems to yield more consistent results between the different scores in some cases. It is of interest to note that previous research supports the use of power transformations on count data.

The infamous Finley tornado forecasts (F) were used by R. Bullock and T. Fowler as an example (Murphy 1996). These forecasts are for a rare event and are biased. Transformation of the counts via the square root (R) and logarithm (L) yielded more consistent results from several different skill scores than the original data, as shown in Figure M1. Power transformations such as this, where an exponent is applied to the counts, are continuous. Figure M2 shows the effects on the four skill scores of applying an exponent between 0 and 2 to the counts prior to computing the skill score. The results indicate that exponents greater than one yield less consistent skill scores. Restricting the exponent to between 0.3 and 1 seems to yield the most consistent results between the four skill scores.

As with GOF tests, no skill score is preferable in all situations. Both GOF tests and skill scores can exhibit undesirable behavior when cell counts are too small, too large, or too different from each other. In the forecast verification situation, these types of cell counts by definition are the only types of interest. (Equal cell counts would correspond to “coin flip” forecasts paired with an event that has a 50% climatological probability.)

For this study, all samples were large. Thus, the effect of transformations on smaller sample sizes will be assessed in future work. In future work, R. Bullock and T. Fowler also will undertake a more mathematically rigorous examination of the properties of the various transformations on skill scores. Additionally, the effect of using weights on ratios of counts will be investigated.

Figure M1: Skill scores for Finley’s tornado forecasts (F), their square root (R), and natural logarithm (L)

Figure M2: Skill scores for transformed Finley’s tornado forecasts. Lambda is the exponent of the transformation applied to the counts.

 

3.  Improved methods for verification of forecast “objects”

In ongoing work, B. Brown and R. Bullock have begun to develop more diagnostic approaches for the verification of forecasts of convection and precipitation. Standard approaches for the verification of these forecasts and other types of forecasts that can be treated as objects or fields have numerous limitations. Most importantly, these approaches are insensitive to the magnitudes of timing, location, and shape errors. In particular, the penalty for an incorrect forecast is the same regardless of the magnitude or source of the error. Moreover, the approaches and associated measures are not diagnostic.

In general, the standard verification approaches are based on the direct comparison of the points on a forecast grid to the corresponding points on an observation grid (note that objects can easily be transformed into grids). From these comparisons, for example, the counts of matched and mis-matched forecast-observation pairs can be determined, to be used to compute such standard verification statistics as the probability of detection (POD), the False Alarm Ratio (FAR), and the Critical Success Index (CSI) (Stanski et al. 1989).

A goal of the current work involves trying to improve the approaches used to verify these types of forecasts, to remedy some of the problems associated with them. Brown and Bullock are building on the work of others to develop appropriate methods. One example is work by Hoffman et al. (1995), which separates forecast errors into those that are due to the location, shape, and size of the objects. Another related example is the work by Ebert and McBride (2000), in which the root mean squared error (RMSE) associated with forecasts of precipitation is decomposed into various sources, including location and intensity. 

Figure M3 shows an example of a prototype implementation of a simple diagnostic verification approach for forecast objects, based on similar ideas. This prototype approach is applied to the Collaborative Convective Forecast Product (CCFP), which is a human-generated 2-6-h forecast of convection expected to impact air traffic, produced by the National Weather Service’s Aviation Weather Center in collaboration with airline and other meteorologists. The verification approach involves systematically moving and rotating the CCFP shapes until an optimal match is achieved with the observations (in this case, a radar mosaic combined with lightning observations). The original three CCFP shapes for the example case are shown in Figure M3a, along with the verifying observations. Figure M3b shows the same shapes after translation and rotation. Impacts of the approach are shown in Table M1, which presents the overall verification statistics associated with the two plots (for all three forecast areas together). Table M2 shows the optimal translations and rotations that were applied to the CCFP shapes.

[top of page]

Table M1. Verification statistics for example case.

Shapes

POD

FAR

CSI

Bias

Original

0.26

0.86

0.08

1.1

Translated

0.42

0.63

0.24

1.1

Table M2. Translation and rotation statistics for example case.

Shape

Translation

Rotation

1

218 km

9°

2

279 km

135°

3

249 km

-8°


The statistics in Table M1 indicate that the translated and rotated shapes match the observations much more closely than the original shapes. In particular, POD is nearly doubled, FAR is about two-thirds as large, and the CSI is tripled. The Bias (the ratio of the forecasted area to the observed area) is unchanged, since the sizes of the forecast shapes were not altered. The statistics in Table M2 indicate that the optimal shapes involved a translation of 200 to 280 km. The optimal rotation required to achieve the “best” statistics was quite small for shapes 1 and 3, whereas a large rotation was applied to shape 2. These results provide an indication of the sensitivity of standard verification statistics to relatively small location and orientation errors. In addition, the diagnostic information that is provided clearly delineates the sizes of the true errors in the forecasts.

Examining the shapes in Figure M3b suggests that the forecasts could easily be improved further. For example, forecast shape 3 would provide a better indication of the convection along the frontal region if it extended further along the line to the southwest. Hence, an additional evaluation would consider changes in the shape and/or size of forecast areas that are needed to improve the quality of the forecasts. A hierarchical approach is envisioned, starting with evaluation of translation and rotation errors, and proceeding to evaluation of size, shape, and intensity errors. Timing errors could be considered in a similar manner. Thus, this simple approach could easily be enhanced and improved. This proof-of-concept suggests that such enhancements are worth pursuing.

Figure M3. Six-hour CCFP forecast, valid at 1900 UTC on 21 June 2000: (a) original forecast areas; (b) optimally translated and rotated shapes. Observations, on a 40-km scale, are shown in gray.


 

References

Ebert, E., and J.L. McBride 2000: Verification of precipitation in weather systems: Determination of systematic errors. Journal of Hydrology, 239, 179-202.

Hoffman, R.N., Z. Liu, J.-F. Louis, and C. Grassotti, 1995: Distortion representation of forecast errors. Monthly Weather Review, 123, 2758-2770.

Murphy, A. H., 1996: The Finley affair: A signal event in the history of forecast verification. Weather and Forecasting, 11, 3-20.

Read, T. R. C. and N. A. C. Cressie, 1988: Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer Verlag, New York.

Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of Common Verification Methods in Meteorology. Atmospheric Environment Service, Forecast Research Division, Ontario, Canada.

Wilks, D.S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, San Diego.

[top of page]