Steven Cognac: Species Distribution Modeling: Blue Oak (Quercus douglasii)

Steven Cognac

cover image courtsey of California Academy of Sciences, 2016

Species Distribution Modeling using MaxEnt

In this post I’ll walk through the steps of species distribution modeling (SDM) using the MaxEnt (Maximum Entropy (Elith 2011) prediction approach. This is a machine learning (algorithm) approach to SDM that uses known occurrence points combined with explanatory variables (i.e. environmental data) to the estimate presence across a study area. MaxEnt is probably the most commonly used species distribution model since it performs well with few input data points, only requires presence points (and samples background for comparison) and is easy to use with a Java graphical user interface (GUI). Check out the rSpatial Species Distribution Modeling website for more background on SDM with R.

SDMs are used to make inferences about the distribution of suitable habitat for a species of interest. There are many SDM approaches out there all with their own pros and cons. Check out this table comparing different SDM techniches. While there are many limitations of SDMs, their ability to gain ecological insight and predict species distribution across a landscape is valuable.

Quercus douglasii; Blue Oak. 2018 Gary McDonald — *Quercus douglasii*; Blue Oak. 2018 Gary McDonald

For this exercise we’ll use the blue oak (Quercus douglasii) tree. Blue oaks are drought and flood tolerant trees endemic to California (CNPS, 2022). They occur with a narrow range along the valleys and slopes of the mountain ranges that surround the Central Valley in California (Calflora, 2022).

The steps of this exercise I’ll walk through include; acquiring species occurrence data, gathering underlying climate data, making MaxEnt predictions, and model calibration. The libraries we’ll use to the specific steps include:

spocc to collect species occurrence data
sdmpredictors to gather climate raster data
dismo to run and evalutate MaxEnt Model
usdm to calibrate our model 5. rJava Java GUI

Installing Packages

If you are having problems installing the rJava package like so:

Here are some troubleshooting options for installation on a macOS Monterey with an M1 architecture (ymmv).

Stack overflow post. This got me 99% of the way there. I had to uninstall the latest JDK (19) and downgrade to Java 8. I have detailed steps below on this.
Check out the R installation guide for Java Support
Helpful background on Java and MacOS Monterey - Wolf Paulus’s blog.
Here’s another helpful run through on [ Zhiyzuo’s blog.

Here are the steps I took to get rJava working:

Uninstall rJava with remove.packages().
Download Java 8 (LTS) x86_64-bit JDK for MacOS from Azul Core Platform .
Run R CMD javareconf in the terminal. This reconfigures JAVA_HOME directories.
Re-install rJava in Rstudio and voila.

Collecting Species Occurrence Data

We’ll set up our file structure here.

Let’s download our species data. We’ll limit it to the Global Biodiversity Information Facility (GBIF) data library using the spocc::occ function. Other data repositories are available (iNaturalist, eBird, Bison, iDigBio), but a lot of that data eventually ends up in GBIF. We’ll filter our results for CA only observations (native range) and remove “PRESERVED_SPECIMEN” which are Herbaria records. in CA To do this we’ll We then take the obs_csv dataframe and covert it to a spatial object using our coordinate data.

The function file.exists() returns a logical vector. I’ll use that notation to avoid re-downloading/re-creating files each time the document is compiled.

Gathering Climate Data

Select Datasets

We’ll use the two terrestrial datasets, “WorldClim” and ENVIREM”. Let’s take a look at those.

There are 86 layers to choose from. Not all of them are needed so let’s select a few layers that are most relevant. We’ll use:

Layer Code	name	description
WC_alt	Altitude	Altitude
WC_bio1	Annual mean temp	Annual mean temp
WC_bio2	Mean diurnal temp range	Mean of the monthly (max temp - min temp)
WC_bio6	Minimum temp	Minimum temp of the coldest month
WC_bio12	Annual precipitation	Annual precipitation
ER_tri	Terrain roughness index	Terrain roughness index
ER_topoWet	Topographic wetness	SAGA-GIS topographic wetness index

Download Data

With our datasets selected, let’s download them and create our raster stack.

Crop Raster Stack

These raster layers are currently global. Let’s crop the rasters to a reasonable study area. There are two primary options:

create a convex hull around our species observations using the sf package (sf::st_convex_hull(st_union()))
use the boundary of California

We’ll use this second option grabbing boundaries with the USAboundaries package.

Now let’s clip our environmental layers to this bounding box.

Generate Pseudo-Absence Points

While pseudo-absence points aren’t needed to run the MaxEnt model, they are useful if you want to evaluate model performance. Let’s generate these points now and use them a couple steps from now.

Extract Raster values from Observations

We prepare the data for MaxEnt modeling by extracting the raster values from our raster stack using the presence-only observation points. We start by creating a simple dataframe and then use the raster::extract() function to extract from our raster stack.

The table above (where present=1) is the input for our MaxEnt model. This table feeds into our SDM (y ~ X), where: y is the present column with values of 1 (present) and X is all other columns: WC_alt, WC_bio1, WC_bio2, WC_bio6, WC_bio12, ER_tri, ER_topoWet, lon, lat

Running a MaxEnt (Maximum Entropy) Model

The MaxEnt approach is said to feel like a black box. That’s primarily because it is difficult to compare outputs with other model-based approaches. With MaxEnt, the output provides an environmental suitability rather than a probability of occurrence. MaxEnt works by iteratively building multiple models using two primary input components:

Entropy: our presence points. A model is calibrated to find the distribution that is most spread out, or closest to a uniform distribution throughout the study region
Constraints: the environmental variables (raster stack) that constrain the predicted distribution. These rules are based the environmental variables where the species was observed, present=1.

MaxEnt first calculates multiple probability densities using the environmental inputs above. This describes the relative likelihood of all environmental variables in the model. Then, the distribution across each environmental layer is calculated. MaxEnt chooses the distribution that maximizes similarity between the environmental characteristics of the total available environments and those of the locations where the species is known to be present.

To run the model, it’s quite simple. We’ll use the dismo::maxent() function.

This is MaxEnt version 3.4.3

Model Evaluation

There are a couple plots that help interpret and evaluate what’s going on under the hood: term plots, variable contribution plots, and response curves. Let’s take a look at all three.

Term Plots

Before looking at our model output, we can look at how obviously differentiated the presence verses absence points for each predictor is. While MaxEnt doesn’t use absence points, we can use them to evaluate our model nonetheless. A plot for a specific predictor and response is called a “term plot”. Here we are looking for predictors where the presence (present=1) occupies a distinct “niche” from the background absence points (present=0). A more pronounced presence peak generally makes for a more confident model.

Some of the predictor variables, where presence=1, with a more pronounced peak include lon, WC_alt, WC_bio1, WC_bio12, and WC_bio6.

Variable Contribution Plot

The Variable Contribution plot tells you the overall percentage each environmental predictor contribute to the model. We’ll use the raster::plot() function for that.

We see the top three predictors WC_bio12, WC_bio6, and WC_bio1 contribute approximately 90%.

Response Curves

Response curves show how much each environmental variable affects the MaxEnt prediction and how that prediction changes as each variable is varied. The default in the dismo::response package is to vary one variable while setting all other environmental variables to their mean. In other words, the curves show the marginal effect of changing exactly one variable, holding all else constant.

Let’s make raw predictions of for species location

Generate Raw Predictions

Model Calibration

While these raw results are a good start, it’s important to calibrate a model. It is critical to check for multicollinearity between independent variables in your model. When you have multiple independent variables (in our case, our X values) that have high intercorrelation, this can lead to skewed or misleading results. This essentially widens your confidence intervals to produce less reliable results.

We first use a pairs plot to identify pairwise correlations between variables to identify pairs or groups of variables that are highly correlated. You can see below we have some strongly correlated variables. For our model, we’ll set our pairwise correlation threshold at 0.7.

We can detect multicollinearity (strong correlation between two or more independent variables) with a Variance Inflaction Factor (VIF) with the usdm::vif() function. A value with a VIF greater than 10 indicates that our model has a collinearity problem. We can reduce

We see that WC_alt, WC_bio1, and WC_bio6 all have VIF greater than 10. Let’s remove those from our raster stack. We’ll use vifcor() to to exclude those variables using a threshold correlation of 0.7.

3 variables from the 7 input variables have collinearity problem: 
 
ER_topoWet WC_bio6 WC_bio1 

After excluding the collinear variables, the linear correlation coefficients ranges between: 
min correlation ( WC_bio2 ~ WC_alt ):  -0.1452815 
max correlation ( ER_tri ~ WC_alt ):  0.6134207 

---------- VIFs of the remained variables -------- 
  Variables      VIF
1    WC_alt 1.650535
2   WC_bio2 1.145421
3  WC_bio12 1.483896
4    ER_tri 2.192877

Let’s re-compute our model.

With our previous variable contribution plot the top two variables our top WC_bio12, WC_bio6, and WC_bio1 approximately 90%.

Species Distribution Modeling: Blue Oak (Quercus douglasii)