Climate Change Data Portal
DOI | 10.1007/s10661-017-6025-0 |
Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology | |
Fox, Eric W.1; Hill, Ryan A.2; Leibowitz, Scott G.1; Olsen, Anthony R.1; Thornbrugh, Darren J.2,3; Weber, Marc H.1 | |
发表日期 | 2017-07-01 |
ISSN | 0167-6369 |
卷号 | 189期号:7 |
英文摘要 | Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a crossvalidation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets. |
英文关键词 | Random forest modeling;Variable selection;Model selection bias;National rivers and streams assessment;StreamCat dataset;Benthic macroinvertebrates |
语种 | 英语 |
WOS记录号 | WOS:000404652900013 |
来源期刊 | ENVIRONMENTAL MONITORING AND ASSESSMENT
![]() |
来源机构 | 美国环保署 |
文献类型 | 期刊论文 |
条目标识符 | http://gcip.llas.ac.cn/handle/2XKMVOVA/57721 |
作者单位 | 1.US EPA, Natl Hlth & Environm Effects Res Lab, Western Ecol Div, 200 SW 35th St, Corvallis, OR 97333 USA; 2.US EPA, Natl Hlth & Environm Effects Res Lab, Western Ecol Div, Oak Ridge Inst Sci & Educ ORISE Postdoctoral Part, 200 SW 35th St, Corvallis, OR 97333 USA; 3.Northern Great Plains Network, Natl Pk Serv, 231 East St Joseph St, Rapid City, SD 55701 USA |
推荐引用方式 GB/T 7714 | Fox, Eric W.,Hill, Ryan A.,Leibowitz, Scott G.,et al. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology[J]. 美国环保署,2017,189(7). |
APA | Fox, Eric W.,Hill, Ryan A.,Leibowitz, Scott G.,Olsen, Anthony R.,Thornbrugh, Darren J.,&Weber, Marc H..(2017).Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology.ENVIRONMENTAL MONITORING AND ASSESSMENT,189(7). |
MLA | Fox, Eric W.,et al."Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology".ENVIRONMENTAL MONITORING AND ASSESSMENT 189.7(2017). |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。