262-10 Influence of Validation Sample Selection In Ecological Modeling.



Tuesday, October 18, 2011: 3:45 PM
Henry Gonzalez Convention Center, Room 211, Concourse Level

Nichola M. Knox1, Sabine Grunwald1, Pasicha Chaikaew1 and David B. Myers2, (1)Soil and Water Science Department, University of Florida, Gainesville, FL
(2)Cropping Systems and Water Quality Unit, USDA-ARS, Columbia, MO
Spatially-explicit models that capture the distribution of ecological properties (feature space) and spatial patterns (across spatial space) allow addressing many critical questions in the realms of mixed-use agricultural, forest, urban, and wetland ecosystems. To evaluate the quality and uncertainty of ecological model predictions ideally independent validation is used. In practice, the criteria to select an independent validation set may introduce bias and may even derive incorrect conclusions. This study investigated the behavior of criteria used for selection of validation sets on large ecological datasets (n: 1000 randomly selected samples).  We used three ecological datasets (solar radiation, vegetation and soil), generated over the state of Florida, to investigate the effect of the size and the distribution (in terms of both spatial and feature space) of validation datasets on the outcomes (RMSE, R2 and RPD) of spatial model validation on large scale studies.   These effects were tested taking into consideration the different population distributions (bimodal, normal and skewed) displayed in the ecological datasets. The effects of selecting validation sets on spatially-explicit relationships were evaluated using models developed by applying Kriging (ordinary and universal – method dependent on the dataset properties).  It was found that as the validation dataset was reduced in size, from 30% to 5% of the overall dataset, the variation in the validation outcomes increased.  Unexpectedly it was found that selecting a validation dataset with a distribution statistically similar to the calibration/overall distribution did not ensure that the validation results (RMSE, R2 and RPD) were significantly lower than a validation dataset differing significantly from the calibration/overall distribution.  Our findings indicate that, in large scale ecological studies, irrespective of the population distribution a validation dataset should contain between 20-30% of the total original dataset, and these validation datasets should be maximized in terms of their spread across both spatial and feature space.
See more from this Division: S05 Pedology
See more from this Session: Spatial Predictions In Soils, Crops and Agro/Forest/Urban/Wetland Ecosystems: II (Includes Graduate Student Competition)