An intro to species distribution modeling for an iconic California woodland species.
cover image courtsey of California Academy of Sciences, 2016
In this post I’ll walk through the steps of species distribution modeling (SDM) using the MaxEnt (Maximum Entropy (Elith 2011) prediction approach. This is a machine learning (algorithm) approach to SDM that uses known occurrence points combined with explanatory variables (i.e. environmental data) to the estimate presence across a study area. MaxEnt is probably the most commonly used species distribution model since it performs well with few input data points, only requires presence points (and samples background for comparison) and is easy to use with a Java graphical user interface (GUI). Check out the rSpatial Species Distribution Modeling website for more background on SDM with R.
SDMs are used to make inferences about the distribution of suitable habitat for a species of interest. There are many SDM approaches out there all with their own pros and cons. Check out this table comparing different SDM techniches. While there are many limitations of SDMs, their ability to gain ecological insight and predict species distribution across a landscape is valuable.
For this exercise we’ll use the blue oak (Quercus douglasii) tree. Blue oaks are drought and flood tolerant trees endemic to California (CNPS, 2022). They occur with a narrow range along the valleys and slopes of the mountain ranges that surround the Central Valley in California (Calflora, 2022).
The steps of this exercise I’ll walk through include; acquiring species occurrence data, gathering underlying climate data, making MaxEnt predictions, and model calibration. The libraries we’ll use to the specific steps include:
spocc
to collect species occurrence datasdmpredictors
to gather climate raster datadismo
to run and evalutate MaxEnt Modelusdm
to calibrate our model 5. rJava
Java GUIIf you are having problems installing the rJava
package like so:
Here are some troubleshooting options for installation on a macOS Monterey with an M1 architecture (ymmv).
Here are the steps I took to get rJava
working:
rJava
with remove.packages()
.R CMD javareconf
in the terminal. This reconfigures JAVA_HOME directories.rJava
in Rstudio and voila.We’ll set up our file structure here.
Let’s download our species data. We’ll limit it to the Global Biodiversity Information Facility (GBIF) data library using the spocc::occ
function. Other data repositories are available (iNaturalist, eBird, Bison, iDigBio), but a lot of that data eventually ends up in GBIF. We’ll filter our results for CA only observations (native range) and remove “PRESERVED_SPECIMEN” which are Herbaria records. in CA To do this we’ll We then take the obs_csv dataframe and covert it to a spatial object using our coordinate data.
The function file.exists()
returns a logical vector. I’ll use that notation to avoid re-downloading/re-creating files each time the document is compiled.
We’ll use the two terrestrial datasets, “WorldClim” and ENVIREM”. Let’s take a look at those.
There are 86 layers to choose from. Not all of them are needed so let’s select a few layers that are most relevant. We’ll use:
Layer Code | name | description |
---|---|---|
WC_alt | Altitude | Altitude |
WC_bio1 | Annual mean temp | Annual mean temp |
WC_bio2 | Mean diurnal temp range | Mean of the monthly (max temp - min temp) |
WC_bio6 | Minimum temp | Minimum temp of the coldest month |
WC_bio12 | Annual precipitation | Annual precipitation |
ER_tri | Terrain roughness index | Terrain roughness index |
ER_topoWet | Topographic wetness | SAGA-GIS topographic wetness index |
With our datasets selected, let’s download them and create our raster stack.
These raster layers are currently global. Let’s crop the rasters to a reasonable study area. There are two primary options:
sf
package (sf::st_convex_hull(st_union())
)We’ll use this second option grabbing boundaries with the USAboundaries
package.
Now let’s clip our environmental layers to this bounding box.
While pseudo-absence points aren’t needed to run the MaxEnt model, they are useful if you want to evaluate model performance. Let’s generate these points now and use them a couple steps from now.
We prepare the data for MaxEnt modeling by extracting the raster values from our raster stack using the presence-only observation points. We start by creating a simple dataframe and then use the raster::extract()
function to extract from our raster stack.
The table above (where present=1) is the input for our MaxEnt model. This table feeds into our SDM (y ~ X), where: y is the present column with values of 1 (present) and X is all other columns: WC_alt, WC_bio1, WC_bio2, WC_bio6, WC_bio12, ER_tri, ER_topoWet, lon, lat
The MaxEnt approach is said to feel like a black box. That’s primarily because it is difficult to compare outputs with other model-based approaches. With MaxEnt, the output provides an environmental suitability rather than a probability of occurrence. MaxEnt works by iteratively building multiple models using two primary input components:
MaxEnt first calculates multiple probability densities using the environmental inputs above. This describes the relative likelihood of all environmental variables in the model. Then, the distribution across each environmental layer is calculated. MaxEnt chooses the distribution that maximizes similarity between the environmental characteristics of the total available environments and those of the locations where the species is known to be present.
To run the model, it’s quite simple. We’ll use the dismo::maxent()
function.
This is MaxEnt version 3.4.3
There are a couple plots that help interpret and evaluate what’s going on under the hood: term plots, variable contribution plots, and response curves. Let’s take a look at all three.
Before looking at our model output, we can look at how obviously differentiated the presence verses absence points for each predictor is. While MaxEnt doesn’t use absence points, we can use them to evaluate our model nonetheless. A plot for a specific predictor and response is called a “term plot”. Here we are looking for predictors where the presence (present=1) occupies a distinct “niche” from the background absence points (present=0). A more pronounced presence peak generally makes for a more confident model.
Some of the predictor variables, where presence=1, with a more pronounced peak include lon, WC_alt, WC_bio1, WC_bio12, and WC_bio6.
The Variable Contribution plot tells you the overall percentage each environmental predictor contribute to the model. We’ll use the raster::plot()
function for that.
We see the top three predictors WC_bio12, WC_bio6, and WC_bio1 contribute approximately 90%.
Response curves show how much each environmental variable affects the MaxEnt prediction and how that prediction changes as each variable is varied. The default in the dismo::response
package is to vary one variable while setting all other environmental variables to their mean. In other words, the curves show the marginal effect of changing exactly one variable, holding all else constant.
Let’s make raw predictions of for species location
While these raw results are a good start, it’s important to calibrate a model. It is critical to check for multicollinearity between independent variables in your model. When you have multiple independent variables (in our case, our X values) that have high intercorrelation, this can lead to skewed or misleading results. This essentially widens your confidence intervals to produce less reliable results.
We first use a pairs plot to identify pairwise correlations between variables to identify pairs or groups of variables that are highly correlated. You can see below we have some strongly correlated variables. For our model, we’ll set our pairwise correlation threshold at 0.7.
We can detect multicollinearity (strong correlation between two or more independent variables) with a Variance Inflaction Factor (VIF) with the usdm::vif()
function. A value with a VIF greater than 10 indicates that our model has a collinearity problem. We can reduce
We see that WC_alt, WC_bio1, and WC_bio6 all have VIF greater than 10. Let’s remove those from our raster stack. We’ll use vifcor()
to to exclude those variables using a threshold correlation of 0.7.
3 variables from the 7 input variables have collinearity problem:
ER_topoWet WC_bio6 WC_bio1
After excluding the collinear variables, the linear correlation coefficients ranges between:
min correlation ( WC_bio2 ~ WC_alt ): -0.1452815
max correlation ( ER_tri ~ WC_alt ): 0.6134207
---------- VIFs of the remained variables --------
Variables VIF
1 WC_alt 1.650535
2 WC_bio2 1.145421
3 WC_bio12 1.483896
4 ER_tri 2.192877
Let’s re-compute our model.
With our previous variable contribution plot the top two variables our top WC_bio12, WC_bio6, and WC_bio1 approximately 90%.
For attribution, please cite this work as
Cognac (2022, Aug. 4). Steven Cognac: Species Distribution Modeling: Blue Oak (Quercus douglasii). Retrieved from https://github.com/cognack/2022-08-04-speciesdistmodeling/
BibTeX citation
@misc{cognac2022species, author = {Cognac, Steven}, title = {Steven Cognac: Species Distribution Modeling: Blue Oak (Quercus douglasii)}, url = {https://github.com/cognack/2022-08-04-speciesdistmodeling/}, year = {2022} }