ASPRS

PE&RS February 2002

VOLUME 68, NUMBER 2
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
JOURNAL OF THE AMERICAN SOCIETY FOR PHOTOGRAMMETRY AND REMOTE SENSING

Letter to the Editor

Map Evaluation and “Chance Correction”

Dear Editor,
Recently, Koukoulas and Blackburn (2001) cited my GT-Index (Türk, 1979, 1983) among the map accuracy indices. However, my GT-Index is not an index of “map accuracy” but the first and only index to measure the “diagnostic ability” of the map-maker.

There are others, as well, who consider that the “GT index is a member of the family of agreement coefficient” like Kappa (Rosenfield and Fitzpatrick-Lins, 1986). So, this letter is also a belated answer to such misstatements.

More importantly, I will use this occasion to point out an important distinction so far overlooked by the remote sensing (RS) community: The distinction between the accuracy of a map and the diagnostic ability of its maker. This distinction has an important consequence:

“Chance-correction” is necessary to measure the diagnostic ability, but unnecessary for map accuracy evaluation.

Since the issue of “chance-correction” had been raised for the first time in my 1979 paper, I wonder whether I have inadvertently triggered this business of “chance-correction” for accuracy(sic!) evaluation. In fact, just after my 1979 paper, Cohen’s (1960) Kappa has been transferred from psychology to RS field both in the U.S. (Congalton, 1980) and the U.K. (Chrisman, 1980). In time, a literature about “chance-corrected” indices flourished; Kappa became a part of RS folklore and even found its way into RS textbooks. An untenable recommendation might have boosted the use of Kappa.

Misleading Recommendation
Rosenfield and Fitzpatrick-Lins (1986) explicitly recommended the use of Kappa and conditional Kappa for map accuracy evaluation. This recommendation, however, was not a result of a rigorous analysis; it has flimsy basis (i.e. “.. Kappa coefficient is ... defensible as intra-class correlation coefficient”) and supported only a single and inappropriate numerical comparison. In this comparison the following five indices were used:

1. PCC (percent correctly classified) with respect to map-unit, which is now known as “User’s Accuracy”(Congalton, 1991) or “purity of map unit”(Lark, 1995),
2. Hellden’s (1980) “mean accuracy,” which is the same as the three decades old “similarity coefficient” of Sorensen (1948),
3. Short’s (1982) “mapping accuracy.” which is identical to “coefficient of community” formulated in 1901 by Jaccard (1912) for phyto-sociological studies at the Alps,
4. Conditional Kappa (Light, 1971) introduced by the reviewers “as a measure of accuracy for a given category,”
5. GT-Index (Türk, 1979), which is not an accuracy index.
 Obviously, only commensurate indices can be compared, namely, indices should measure the same property. Are these indices could be commensurate as tacitly assumed by the reviewers?
First of all, Hellden’s index, H, and Jaccard’s index, J, are symmetrical (i.e. the same formula is used for field or for map), but the rest are asymmetrical (i.e. different formulas for field and for map). Obviously, symmetric and asymmetric indices cannot measure the same property.

Secondly, when the reference points are different, then even symmetric indices cannot measure the same property. GT-Index is a “chance-corrected” version of PCC with respect to field class, i.e. Ground Truth (GT) category [Now, this PCC is known as “Producer’s Accuracy” (Congalton, 1991) or “representation of the class” (Lark, 1995)]. The reviewers, however, compared it with PCC with respect to map-unit (i.e. “User’s accuracy”). Besides, GT-Index is not an accuracy index.

Lastly, even if the indices were commensurate, to compare their numerical values directly may not be appropriate. For instance, two thermometers both measure the temperature. However, their numeric values cannot be directly compared if one is in degree centigrade and the other in degree Fahrenheit. By comparing degree C values with corresponding degree F values, one cannot conclude that the former thermometer “tends to underestimate” temperature, as done by the reviewers. Degree C and F are related and can be converted to each other. Similarly, H and J are related:

H = 2J /(1 + J) or J = H / (2 – H).

In short, the reviewers have no evidence to support that recommendation (Türk, 1997a, 2001a).

Their recommendation might have been boosted the use of Kappa, but it cannot be the reason of its introduction into the RS field. So, how did all of this come about?

Unfortunate Misunderstanding
My 1979 paper has two simple messages:

1. Not all correct classification is due to positive diagnosis, some might be a result of good luck; and,
2. For proper evaluation of the diagnostic skill, chance contribution to map accuracy should be discounted.
  The former message was widely received, but rarely acknowledged by the RS community. The later message, it seems, was misunderstood as “chance-correction” for map accuracy (sic) evaluation.

Surely, I did raise the issue of “chance-correction,” perhaps for the first time in the RS literature, and said PCC values “are inflated by a chance component that is proportional to” the frequency of random guessing (Türk, 1979,p.68), and applied “chance-correction” to PCC with respect to GT-category (i.e. “Producer’s Accuracy”). However, I made that correction to measure the skill (i.e. diagnostic performance) of the map-maker, and not for map accuracy evaluation.

It seems my proposal as somehow misunderstood as “chance-correction” for accuracy evaluation. This misunderstanding, I think, stems from overlooking an important distinction: Accuracy of a map vs. diagnostic ability of its maker.

The Crucial Distinction
So far, the distinction between accuracy and diagnostic ability is not sufficiently appreciated. Map accuracy is important for the users of the finished map. On the other hand, diagnostic ability of a given map-making procedure is important to assess its potential for producing accurate maps in the future. However, whether the map is finished or yet to be made has no bearing upon this distinction. Conceptually, the map accuracy is the same for the finished maps or for the future maps. Of course, current accuracy assessment procedure (i.e. forming a Confusion Matrix by sampling) can only be used for finished maps. However, I recently offered a procedure to assess “expected” map accuracy of the future maps (Türk, 2000).

Making a classificatory map is essentially a process of assigning “labels” to the field points. When a field point has the same label with the corresponding field point, then it is accurately labeled (i.e. correctly classified). This event can be called a hit. Accurate labeling could be due to positive diagnosis (namely sure hit) or the result of good luck (namely random hit). Obviously: Hits = Sure hits + Random hits.

If accurate labeling of the points is our main concern, then the reason for achieving accurate labeling is totally irrelevant. The more hits there are, the more accurate the map would be. The chance contribution to map accuracy can be gladly accepted as “ a windfall gain.”

On the other hand, that chance contribution becomes “a liability” when trying to figure out what the future performance of the map-maker would be. “Chance-inflated index values precludes any generalization to the future applications, since the amount of this windfall gain depends upon the vicissitudes” of each individual occasion (Türk, 1997b, 2001a). Clearly, accurate labeling by chance cannot be considered a diagnosis, hence a chance-correction is necessary to measure the diagnosis. That is why I made chance-correction in GT-Index. As previously stated “GT index provides an opportunity to study the effects of certain factors on the predictive ability of a given RS procedure” (Türk, 1979, p.73).

A Thought Experiment
A thought experiment (Türk, 1998, 2001) clarifies why and when “chance-correction” is needed: Imagine a strawman-of-no-diagnosis, which labels field points randomly and, hence, prepares a random map. A special GT study is made for that random map, and very respectable PCC values are obtained (A rare event, but not impossible). Afterward, how this map has been produced is learned. Now we face two distinct questions: 1.) Whether we could use this random map? 2.) Whether we should employ “random labeling” as a map-making procedure in the future?

Obviously every hit on this map is a random hit, therefore a proper diagnostic index should have zero value. However, high accuracy of this random map is verified by a special GT study. In fact, this map is as good as any diligently made map having the same PCC values. Therefore, we can profitably use this random map with high accuracy, but we cannot adopt “random labeling” as a map-making procedure.

Map Accuracy Indices
Since the accuracy is the extent of correspondence between the field reality (indicated by GT- categories) and its map representation (given by map-units), hence PCC is a good measure for map accuracy evaluation. PCC with respect to the whole field (and necessarily for the whole map) is now known as Overall Accuracy. It is a symmetrical index and indicates the general level of map accuracy, namely, overall correspondence between the field and its map.

If we want to know what proportion of a given field class is correctly transferred into the corresponding map-unit, then we should know the Reproduction Rate (RR) of that GT category. This RR value is given by PCC with respect to a given GT category (i.e. “Producer’s accuracy” or “representation of the class”). High RR is important for the map users: When a GT category has low RR value, then its whereabouts will not be displayed by the corresponding map-unit, since most of its points were mislabeled as something else.

If we would like to know how often we will find at the field what is shown on the map, then we should know the Finding Rate (FR) of a given map-unit. This is equal to PCC with respect to map-unit (i.e. “User’s accuracy” or “purity of map unit”). High FR is important to the map user, since low FR means impure map-unit and, equivalently, a high chance of ending up a wrong field class when using the guidance of that map unit.

If both questions are considered important, then by combining RR and FR a composite index can be formed. In fact, Hellden’s H is the “harmonic mean” of RR and FR (Türk, 1997a). Since Jaccard’s J is linearly related to Helden’s H, it is also a composite index of RR and FR.

Moreover, Koukoulas and Blackburn’s (2001) indices are also a composite of RR and FR. Obviously, (1 – RR) is equal to their “omission percentage,” and (1 – FR ) is equal to their “commission percentage,” hence:

ICSI = RR + FR - 1

which is linearly related to arithmetic mean of RR and FR. Their CSI is the arithmetic mean of ICSI, and GCSI is the sum of ICSI of the classes of particular interest (GCSI should be divided by n to make its maximum value unity).

They offer GCSI for the cases in which “accurate representation of some classes are of particular interest while other classes are of little concern.” There is a better and more general solution to this type of problem. First, the user forms a “Utility Matrix” by attaching a “utility value” to every cell of the Confusion Matrix, then this Utility Matrix of the user is combined with the Confusion Matrix of the map (Türk, 2001b).

A Final Point
All “chance-corrected” indices have the same general formula:

K = (A – R) / (1 – R)

Where K is the “chance-corrected” index,
A is the accuracy (i.e. the relevant PCC), and
R is the “chance-correction” term.

Since GT index has the same formula (Türk, 1979, p.68), hence Kappa and similar indices might have been and could be considered as “easy-to-calculate” alternatives to GT-Index. It is not so.

First of all, Kappa, Tau etc. do not have clear conceptual definitions, but the GT-Index has. It is “the proportion of those items that will always be classified correctly” (original emphasis, Türk, 1979, p.69). That is “proportion of sure hits,” whereas PCC is the proportion of all hits.

Clearly, even if the same formula is used, different correction terms create different indices. Therefore, the correction term should be properly defined for the problem at hand and the characteristic of the data. Psychologists are interested in measuring the “real agreement” between two independent observers, hence they have an Agreement Matrix as their data, and define their correction term as “random matches” between independent declarations made by two observers. Obviously, if two observers usually disagree, then their “agreement score” would be less than “random matches.” This means correction term can be bigger than actual agreement.

However, correct classification by chance, cannot be more that actual correct classifications. That is, random hits cannot be more than hits. Therefore, correction term for Confusion Matrix should take this fact into account. When Kappa, Tau, etc. are used, then such illogical chance-correction (i.e. correction more than actual hits) could, and in fact, did occur.

GT-Index is calculated by separating the Confusion Matrix into two matrices: 1.) A diagonal matrix for diagnosis (i.e. sure hits on diagonals), and 2.) a completely random matrix for random labeling (i.e. random hits on diagonals and misclassifications on off-diagonals). Consequently, random hits can never be more than actual hits. This separation also assures the equality of conceptual definition and general formula for GT-Index (see. Annex).

Conclusion
In short:

1. There is a distinction between the accuracy of the map and the diagnostic ability of its maker.
2. Some of the correct classification might be the result of good luck. However, when evaluating map accuracy, “chance-correction” is unnecessary. In fact, chance contribution to map accuracy can be gladly accepted as “windfall gain.” Consequently, all accuracy indices are based upon various PCC values or their combination.
3. Correct classification by chance cannot be considered as diagnosis, hence “chance-correction” is absolutely necessary to measure the diagnostic ability of a map maker.
4.  “Chance-correction” should not be more that actual hits, but such illogical correction did occur with Kappa, etc.
5. My GT-Index is the first and only measure for measuring diagnostic ability.
This Letter to the Editor intends to set the record straight regarding my GT-Index by providing an overview. Hopefully, its publication may also initiate a re-evaluation of the widespread practice of “chance-correction” and may start a deliberation upon the distinction between “accuracy” and “diagnosis.”

Sincerely.
Göksel TÜRK,
Özler Sitesi, AB Blok D:1, Isparta, TURKEY
g.turk@sdu.edu.tr or gokselturk@mynet.com

References
Chirisman, N.R., 1980. Assessing Landsat accuracy: A geographic application of misclassification analysis (1980) Second Collocium on Quantitative and Theoretical Geography 1980, Trinity Hall, Cambridge, England. [Cited by James B. Kampbell, 1996, Introduction to Remote Sensing, 2nd Ed., Chap. 13, Taylor & Francis Ltd.].
Cohen, J., 1960. A coefficient of agreement of nominal scales, Educational and Psychological Measurement, 20:37 - 46.
Congalton, R.G., 1980. Statistical techniques for analysis of Landsat classifications accuracy (March, 11, 1980). Paper presented at the meeting of the American Society of Photogrammetry, St. Louis, Missouri. [Cited by Rosenfield and Fitzpatrick-Lins, 1986].
Top Home