Space (doesn’t have to be) the final frontier: Some Tips in Accounting for Spatial Dependence within Your Regression-Type Models in R

Elliot Dovers

School of Mathematics and Statistics

So you’ve collected some spatial data

  pres   x    y       rain soil
1    0 447 6872  0.3567818 fair
2    1 448 6872  0.1267776 fair
3    0 449 6871  0.1267776 fair
4    0 446 6869  0.3567818 good
5    0 447 6869 -0.5445859 good
6    0 445 6868 -0.2026878 good

So you’ve collected some spatial data

  pres   x    y       rain soil
1    0 447 6872  0.3567818 fair
2    1 448 6872  0.1267776 fair
3    0 449 6871  0.1267776 fair
4    0 446 6869  0.3567818 good
5    0 447 6869 -0.5445859 good
6    0 445 6868 -0.2026878 good

So you’ve collected some spatial data

  pres   x    y       rain soil
1    0 447 6872  0.3567818 fair
2    1 448 6872  0.1267776 fair
3    0 449 6871  0.1267776 fair
4    0 446 6869  0.3567818 good
5    0 447 6869 -0.5445859 good
6    0 445 6868 -0.2026878 good

     CRIME NEIG  HOVAL     OPEN
1 15.72598    5 80.467 0.217155
2 18.80175    1 44.567 0.320581
3 30.62678    6 26.350 0.374404
4 32.38776    2 33.200 1.186944
5 50.73151    7 23.225 0.624596
6 26.06666    8 28.750 0.254130
Columbus, Ohio USA

So you’ve collected some spatial data

  pres   x    y       rain soil
1    0 447 6872  0.3567818 fair
2    1 448 6872  0.1267776 fair
3    0 449 6871  0.1267776 fair
4    0 446 6869  0.3567818 good
5    0 447 6869 -0.5445859 good
6    0 445 6868 -0.2026878 good

     CRIME NEIG  HOVAL     OPEN
1 15.72598    5 80.467 0.217155
2 18.80175    1 44.567 0.320581
3 30.62678    6 26.350 0.374404
4 32.38776    2 33.200 1.186944
5 50.73151    7 23.225 0.624596
6 26.06666    8 28.750 0.254130

  TRUMP    REGION      POP      AREA
1    55     South  4712651 133709.27
2    13      West  6246816 295281.25
3    12      West  4887061 269573.06
4     4 Northeast  3545837  12976.59
5    54     South 18511620 151052.01
6    56     South  9468815 152725.21

We can use regression-style models… are they “spatial”?

m1 = glm(pres ~ rain + soil,
  data = tree,
  family = binomial)

predict(m1, newdata = ne_nsw)

m2 = lm(CRIME ~ HOVAL + OPEN,
  data = crime)

predict(m2)

m3 = glmmTMB(TRUMP ~ POP +
  AREA + (1|REGION),
  data = usa,
  family = poisson)

predict(m3)

We can use regression-style models… are they “spatial”?

m1 = glm(pres ~ rain + soil,
  data = tree,
  family = binomial)

predict(m1, newdata = ne_nsw)

m2 = lm(CRIME ~ HOVAL + OPEN,
  data = crime)

predict(m2)

m3 = glmmTMB(TRUMP ~ POP +
  AREA + (1|REGION),
  data = usa,
  family = poisson)

predict(m3)
  • a spatial model incorporates proximity of observational units into the model
  • leverages the fact that observations “closer” together tend to be more similar

So why does including spatial information matter?

  • Regression-style models relate response variable (V) to predictor variables
    • Linear models: V is continuous variable, e.g. lm()
    • Generalised linear models: V can be binary or discrete, e.g. glm()
    • Mixed models: GLM with random effects, e.g. glmmTMB()
  • Common assumption: data are independent conditional on predictor variables
  • Ignoring spatial dependence breaks this assumption
    • incorrect standard errors (usually smaller)
  • Accounting for dependence gives improved confidence intervals

How can you tell if there is spatial dependence?

  • Investigate the residuals! Options:
    • plot residuals spatially: ad-hoc visualisation may reveal broad trends
    • calculate Moran’s I statistic (formal test of spatial autocorrelation)
    • correlogram: plot correlation between residuals at various distances/proximity

Continuous Spatial Units

Discrete Spatial Units

How can you tell if there is spatial dependence?

  • Investigate the residuals! Options:
    • plot residuals spatially: ad-hoc visualisation may reveal broad trends
    • calculate Moran’s I statistic (formal test of spatial autocorrelation)
    • correlogram: plot correlation between residuals at various distances/proximity

Continuous Spatial Units

m1 = glm(pres ~ rain + soil, data = tree,
  family = binomial)

library(statmod)
res1 = qresiduals(m1)

library(ncf)
cl1 = spline.correlog(x = tree$x, y = tree$y,
                      z = res1)

Discrete Spatial Units

library(sf)
crime = st_read("shapes/columbus.shp")

m2 = lm(CRIME ~ HOVAL + OPEN, data = crime)
res2 = m2$residuals

library(spdep)
crime.nb = poly2nb(crime)
names(crime.nb) = crime$NEIG

cl2 = sp.correlogram(crime.nb, res2,
                     order=8, method="I")

So is there spatial dependence?

Tree data: plot(cl1)

Crime data: plot(cl2)

Trump data: plot(cl3)

One solution to unaccounted for spatial dependence:

  • like an unmeasured predictor variable over the 2D space
  • informed by the data like a random effect

Fields are easy in a generalised additve model (GAM)

  • GAMs similar to GLMs but allow wiggly predictor terms
    • s() called a “smooth”, usually on predictors
    • non-linear relationships with response



library(mgcv)
m <- gam(response ~ s(x, bs = "cr", k = 5))



Fields are easy in a generalised additve model (GAM)

  • GAMs similar to GLMs but allow wiggly predictor terms
    • s() called a “smooth”, usually on predictors
    • non-linear relationships with response



library(mgcv)
m <- gam(response ~ s(x, bs = "tp", k = 5))



Fields are easy in a generalised additve model (GAM)

  • GAMs similar to GLMs but allow wiggly predictor terms
    • s() called a “smooth”, usually on predictors
    • non-linear relationships with response



library(mgcv)
m <- gam(response ~ s(x, bs = "gp", k = 5))



Fields are easy in a generalised additve model (GAM)

  • GAMs similar to GLMs but allow wiggly predictor terms
    • s() called a “smooth”, usually on predictors
    • non-linear relationships with response



library(mgcv)
m <- gam(response ~ s(x, bs = "gp", k = 7))



Fields are easy in a generalised additve model (GAM)

  • GAMs similar to GLMs but allow wiggly predictor terms
    • s() called a “smooth”, usually on predictors
    • non-linear relationships with response



library(mgcv)
m <- gam(response ~ s(x, bs = "gp", k = 10))



Fields are easy in a generalised additve model (GAM)

  • GAMs similar to GLMs but allow wiggly predictor terms
    • s() called a “smooth”, usually on predictors
    • non-linear relationships with response
  • When applied to your spatial domain these are random (unmeasured) fields
  • just need to select the right basis for your spatial units



library(mgcv)
m <- gam(response ~ s(x))



Many other options for spatial models exist

  • R-INLA
  • glmmTMB has spatial covariance structures: exp, gau, mat
  • spatialreg
  • brms has specific functions for spatial autoregressive models
  • and more
  • Using latent fields within GAMs:
    • scales well for large data (control with k)
    • requires little more know-how than any regression-style model

Example 1: continuous spatial units

  • Presence/absence of a tree species (Eucalyptus fastigata) at sites across north-eastern NSW

  • We want to use x and y coordinates to fit a latent field and account for spatial dependence
  • In this case: a Gaussian random field

Example 1: continuous spatial units

  • Recall the non-spatial model:
m1 = glm(pres ~ rain + soil,
  data = tree, family = binomial)

cl1 <- spline.correlog(x = tree$x, y = tree$y,
                       z = qresiduals(m1))
plot(cl1)


Example 1: continuous spatial units

  • Recall the non-spatial model:
m1 = glm(pres ~ rain + soil,
  data = tree, family = binomial)

cl1 <- spline.correlog(x = tree$x, y = tree$y,
                       z = qresiduals(m1))
plot(cl1)


library(mgcv)

sm1 = gam(pres ~ rain + soil +
            s(x, y, bs = "gp", k = 100),
          data = tree,  family = binomial)

scl1 <- spline.correlog(x = tree$x, y = tree$y,
                        z = qresiduals(sm1))
plot(scl1)

Example 1: continuous spatial units

  • Recall the non-spatial model:
m1 = glm(pres ~ rain + soil,
  data = tree, family = binomial)

cl1 <- spline.correlog(x = tree$x, y = tree$y,
                       z = qresiduals(m1))
plot(cl1)


library(mgcv)

sm1 = gam(pres ~ rain + soil +
            s(x, y, bs = "gp", k = 100),
          data = tree,  family = binomial)

scl1 <- spline.correlog(x = tree$x, y = tree$y,
                        z = qresiduals(sm1))
plot(scl1)


Example 1: continuous spatial units

  • Recall the non-spatial model:
m1 = glm(pres ~ rain + soil,
  data = tree, family = binomial)

confint.default(m1,
  parm = c("rain", "soilfair", "soilfgood"))
library(mgcv)

sm1 = gam(pres ~ rain + soil +
            s(x, y, bs = "gp", k = 100),
          data = tree,  family = binomial)

confint.default(sm1,
  parm = c("rain", "soilfair", "soilfgood"))

Example 1: continuous spatial units

Example 2: discrete spatial units

  • Crime in neighbourhoods of Columbus Ohio, USA.

  • We want to use NEIG (neighbourhood) to fit a latent field and account for spatial dependence
  • In this case: a Markov random field

Example 2: discrete spatial units

  • Recall the non-spatial model:
m2 <- lm(CRIME ~ HOVAL + OPEN, data = crime)

cl2 <- sp.correlogram(crime.nb, m2$residuals,
                      order=8, method="I")
plot(cl2)


Example 2: discrete spatial units

  • Recall the non-spatial model:
m2 <- lm(CRIME ~ HOVAL + OPEN, data = crime)

cl2 <- sp.correlogram(crime.nb, m2$residuals,
                      order=8, method="I")
plot(cl2)


library(mgcv)
sm2 <- gam(CRIME ~ HOVAL + OPEN +
  s(NEIG, bs = 'mrf', xt = list(nb = crime.nb)),
  data = crime, method = 'REML')

scl2 <- sp.correlogram(crime.nb, sm2$residuals,
                       order=8, method="I")
plot(scl2)

Example 2: discrete spatial units

  • Recall the non-spatial model:
m2 <- lm(CRIME ~ HOVAL + OPEN, data = crime)

cl2 <- sp.correlogram(crime.nb, m2$residuals,
                      order=8, method="I")
plot(cl2)


library(mgcv)
sm2 <- gam(CRIME ~ HOVAL + OPEN +
  s(NEIG, bs = 'mrf', xt = list(nb = crime.nb)),
  data = crime, method = 'REML')

scl2 <- sp.correlogram(crime.nb, sm2$residuals,
                       order=8, method="I")
plot(scl2)


Example 2: discrete spatial units

Final remarks

  • Fields in GAMs: easy way to account for spatial dependence in regression-style models
    • see Gordana’s seminar on dependence in general
  • Be sure to check other model assumptions
  • Do you have enough (continuous) spatial data to fit a field?
  • Consider how you are including measured predictors
    • check out Nickson’s seminar on model selection
  • Consult with Stats Central if needed!

Thanks and good luck with your adventures in space!