PE&RS March 2016 Public version

Approximating Prediction Uncertainty for

Random Forest Regression Models

John W. Coulston , Christine E. Blinn , Valerie A. Thomas , and Randolph H. Wynne

Abstract

Machine learning approaches such as random forest have

increased for the spatial modeling and mapping of continu-

ous variables. Random forest is a non-parametric ensemble

approach, and unlike traditional regression approaches there

is no direct quantification of prediction error. Understanding

prediction uncertainty is important when using model-based

continuous maps as inputs to other modeling applications

such as fire modeling. Here we use a Monte Carlo approach to

quantify prediction uncertainty for random forest regression

models. We test the approach by simulating maps of dependent

and independent variables with known characteristics and

comparing actual errors with prediction errors. Our approach

produced conservative prediction intervals across most of the

range of predicted values. However, because the Monte Carlo

approach was data driven, prediction intervals were either too

wide or too narrow in sparse parts of the prediction distribu-

tion. Overall, our approach provides reasonable estimates of

prediction uncertainty for random forest regression models.

Introduction

Remote sensing scientists have a rich set of standard methods

with which the uncertainty of (inherently categorical) thematic

maps derived from remotely-sensed data can be estimated (e.g.,

Congalton and Green, 2008). For the most part, resulting uncer-

tainty estimates are (a) independent of the analytical method

used for the categorical data analysis, and (b) contain informa-

tion on category-specific accuracy but not pixel specific accura-

cy. Methods with which to estimate the uncertainty of mapped

continuous fields are, in contrast, much less standardized.

Category-specific accuracy, of course, is no longer relevant,

but the means by which uncertainty of continuous variables

is estimated is often tied to the technique used. Examples

abound, including use of

RMSE

in classical regression oriented

approaches (Fernandes

et al.

, 2004) and cross-validation-de-

rived

PRESS

(sum of squares of the prediction residuals)

RMSE

(Popescu

et al.

, 2004). Cross-validation approaches are also

widely used in regression tree analyses of remotely sensed data

(Bacini

et al

., 2007). The cross-validation can estimate many

prediction error statistics, including residual sum of squares.

However, increasingly cross-validation is used primarily for

model selection and (usually non-parametric) bootstrapping

is used once the model is “fixed” (see, e.g., Molinaro, 2005).

These methods have been extended to random forest imple-

mentations, but the resulting estimates of prediction uncertain-

ty are aggregated (i.e., global) and do not produce pixel-specific

uncertainties required for use in subsequent spatial modeling.

The use of machine learning techniques has increased sub-

stantially in remote sensing and geospatial data development.

For example, Homer

et al

. (2004) used regression trees for the

development of a categorical land cover map for the Unit-

ed States, and Coulston

et al

. (2012) used random forests to

develop a continuous field map of percent tree canopy cover.

Other techniques that have been proposed and tested include

artificial neural networks, support vector machines, stochas-

tic gradient boosting, and K nearest neighbor (Moisen and

Frescino, 2002; Wieland and Pittore, 2014). Machine learning

approaches have become particularly attractive because they

are well suited to recognize patterns in high-dimension data

(Cracknell and Reading, 2014). Further, several of these ap-

proaches allow for modeling either categorical response vari-

ables or continuous response variables (e.g., random forests,

support vector machines/support vector regression). How-

ever unlike traditional parametric approaches (e.g., multiple

regression), information about prediction error (standard error

of a prediction for a new data point) is not readily available.

Broad scale raster maps of continuous variables have been

developed for percent impervious surface (Homer

et al

., 2007),

percent tree canopy (Huang

et al

., 2001; Coulston

et al

., 2012),

forest biomass (Blackard

et al

., 2008), and forest carbon (Wil-

son

et al

., 2013) among other examples. These efforts all relied

on machine learning approaches and used either Landsat or

MODIS

imagery for predictor variables. Each pixel within these

modeled raster maps contains a predicted value yet, per-pixel

uncertainty is rarely expressed along with the predictions. Un-

derstanding the pixel-level uncertainty is critical to understand-

ing the utility of the data. Furthermore, many geospatial datasets

(such as those mentioned above) are used in subsequent model-

ing applications. For example, the 2001

NLCD

tree canopy cover

dataset (Huang

et al

., 2001) was a major component of forest fire

behavior and fuel models (Rollins and Frame, 2006). Clearly

the uncertainty around this fire behavior model is related to

the uncertainty in the underlying data, such as the 2001

NLCD

percent tree canopy cover. Our intent is to provide guidance on

quantifying prediction uncertainty at the pixel level.

While there are numerous machine learning techniques,

here we focus on random forest because it is straightforward

to train, computationally efficient, and provides stable pre-

dictions (Cracknell and Reading, 2014). Random forest is an

ensemble method that uses bootstrap aggregating (bagging)

to develop multiple models to improve prediction (Breiman,

2001). Along with bagging, random forests also relies on ran-

dom feature selection to develop a forest of independant

CART

models. This technique has been used by Powell

et al

. (2010)

and Baccini

et al

. (2008) to predict forest biomass, Evans and

Cushman (2009) to predict species occurrence probability,

Hernandez

et al

. (2008) to predict faunal species distributions,

and Moisen

et al

. (2012) to predict percent tree canopy cover.

Though there have been numerous studies describing and

using random forests, there is a lack of information regarding

John W. Coulston is with the USDA Forest Service, Southern

Research Station, Blacksburg, VA (

jcoulston@fs.fed.us

).

Christine E. Blinn , Valerie A. Thomas , and Randolph H.

Wynne are with Virginia Polytechnic Institute and State Uni-

versity, Department of Forest Resources and Environmental

Conservation, Blacksburg, VA.

Photogrammetric Engineering & Remote Sensing

Vol. 82, No. 3, March 2016, pp. 189–197.

0099-1112/16/189–197

and Remote Sensing

doi: 10.14358/PERS.82.3.189

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

March 2016

189

PE&RS March 2016 Public version - page 189

Warning.