Hands-on Exercise 7: Visualising and Analysing Geographic Data

Author

Michael Djohan

Published

February 23, 2023

Modified

March 9, 2023

1. Install and launching R packages

The code chunk below uses p_load() of pacman package to check if packages are installed in the computer. If they are, then they will be launched into R. The R packages installed are:

  • sf for handling geospatial data
  • tmap to plot functional and truthful choropleth map
pacman::p_load(sf, tmap, tidyverse)

2. Plotting Choropleth Map

Choropleth mapping involves the symbolisation of enumeration units, such as countries, provinces, states, counties or census units, using area patterns or graduated colors. For example, a social scientist may need to use a choropleth map to portray the spatial distribution of aged population of Singapore by Master Plan 2014 Subzone Boundary.

2.1 Importing the data

Two data set will be used to create the choropleth map. They are:

  • Master Plan 2014 Subzone Boundary (Web) (i.e. MP14_SUBZONE_WEB_PL) in ESRI shapefile format. It can be downloaded at data.gov.sg This is a geospatial data. It consists of the geographical boundary of Singapore at the planning subzone level. The data is based on URA Master Plan 2014.

  • Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020 in csv format (i.e. respopagesextod2011to2020.csv). This is an aspatial data fie. It can be downloaded at Department of Statistics, Singapore Although it does not contain any coordinates values, but it’s PA and SZ fields can be used as unique identifiers to geocode to MP14_SUBZONE_WEB_PL shapefile.

mpsz <-  sf::st_read(dsn = "data/geospatial",
                     layer = "MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\michaeldjo\ISSS608-VAA\Hands-on_Ex\Hands-on_Ex07\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
mpsz
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
First 10 features:
   OBJECTID SUBZONE_NO       SUBZONE_N SUBZONE_C CA_IND      PLN_AREA_N
1         1          1    MARINA SOUTH    MSSZ01      Y    MARINA SOUTH
2         2          1    PEARL'S HILL    OTSZ01      Y          OUTRAM
3         3          3       BOAT QUAY    SRSZ03      Y SINGAPORE RIVER
4         4          8  HENDERSON HILL    BMSZ08      N     BUKIT MERAH
5         5          3         REDHILL    BMSZ03      N     BUKIT MERAH
6         6          7  ALEXANDRA HILL    BMSZ07      N     BUKIT MERAH
7         7          9   BUKIT HO SWEE    BMSZ09      N     BUKIT MERAH
8         8          2     CLARKE QUAY    SRSZ02      Y SINGAPORE RIVER
9         9         13 PASIR PANJANG 1    QTSZ13      N      QUEENSTOWN
10       10          7       QUEENSWAY    QTSZ07      N      QUEENSTOWN
   PLN_AREA_C       REGION_N REGION_C          INC_CRC FMEL_UPD_D   X_ADDR
1          MS CENTRAL REGION       CR 5ED7EB253F99252E 2014-12-05 31595.84
2          OT CENTRAL REGION       CR 8C7149B9EB32EEFC 2014-12-05 28679.06
3          SR CENTRAL REGION       CR C35FEFF02B13E0E5 2014-12-05 29654.96
4          BM CENTRAL REGION       CR 3775D82C5DDBEFBD 2014-12-05 26782.83
5          BM CENTRAL REGION       CR 85D9ABEF0A40678F 2014-12-05 26201.96
6          BM CENTRAL REGION       CR 9D286521EF5E3B59 2014-12-05 25358.82
7          BM CENTRAL REGION       CR 7839A8577144EFE2 2014-12-05 27680.06
8          SR CENTRAL REGION       CR 48661DC0FBA09F7A 2014-12-05 29253.21
9          QT CENTRAL REGION       CR 1F721290C421BFAB 2014-12-05 22077.34
10         QT CENTRAL REGION       CR 3580D2AFFBEE914C 2014-12-05 24168.31
     Y_ADDR SHAPE_Leng SHAPE_Area                       geometry
1  29220.19   5267.381  1630379.3 MULTIPOLYGON (((31495.56 30...
2  29782.05   3506.107   559816.2 MULTIPOLYGON (((29092.28 30...
3  29974.66   1740.926   160807.5 MULTIPOLYGON (((29932.33 29...
4  29933.77   3313.625   595428.9 MULTIPOLYGON (((27131.28 30...
5  30005.70   2825.594   387429.4 MULTIPOLYGON (((26451.03 30...
6  29991.38   4428.913  1030378.8 MULTIPOLYGON (((25899.7 297...
7  30230.86   3275.312   551732.0 MULTIPOLYGON (((27746.95 30...
8  30222.86   2208.619   290184.7 MULTIPOLYGON (((29351.26 29...
9  29893.78   6571.323  1084792.3 MULTIPOLYGON (((20996.49 30...
10 30104.18   3454.239   631644.3 MULTIPOLYGON (((24472.11 29...
popdata <- read_csv("data/aspatial/respopagesextod2011to2020.csv")

2.2 Data wrangling

Creating new variables called YOUNG, ECONOMY ACTIVE, and AGED

popdata2020 <- popdata %>%
  filter(Time == 2020) %>%
  group_by(PA, SZ, AG) %>%
  summarise(`POP` = sum(`Pop`)) %>%
  ungroup() %>%
  
  pivot_wider(names_from=AG, 
              values_from=POP) %>%
  
  mutate(YOUNG = rowSums(.[3:6])
         +rowSums(.[12])) %>%
  mutate(`ECONOMY ACTIVE` = rowSums(.[7:11])+rowSums(.[13:15]))%>%
  mutate(`AGED`=rowSums(.[16:21])) %>%
  mutate(`TOTAL`=rowSums(.[3:21])) %>%  
  mutate(`DEPENDENCY` = (`YOUNG` + `AGED`)/`ECONOMY ACTIVE`) %>%
  select(`PA`, `SZ`, `YOUNG`, 
       `ECONOMY ACTIVE`, `AGED`, 
       `TOTAL`, `DEPENDENCY`)

Before ioining the attribute and geospatial data, one extra step is required to convert the values in PA and SZ fields to uppercase. This is because the values of PA and SZ fields are made up of upper- and lowercase. On the other, hand the SUBZONE_N and PLN_AREA_N are in uppercase

popdata2020 <- popdata2020 |> 
  mutate_at(.vars = vars(PA, SZ),
            .funs = funs(toupper)) |> 
  
  filter(`ECONOMY ACTIVE` > 0)

Next, left_join() of dplyr is used to join the geographical data and attribute table using planning subzone name e.g. SUBZONE_N and SZ as the common identifier

mpsz_pop2020 <- left_join(mpsz, popdata2020,
                          by = c("SUBZONE_N" = "SZ"))
write_rds(mpsz_pop2020, "data/rds/mpszpop2020.rds")

2.3 Plotting the map

2.3.1 Using qtm()

Plotting standard choropleth map using qtm()

#tmap_mode() with “plot” option is used to produce a static map. For interactive mode, “view” option should be used.
tmap_mode("plot")

#fill argument is used to map the attribute (i.e. DEPENDENCY)
qtm(mpsz_pop2020, 
    fill = "DEPENDENCY")

2.3.2 Using tmap() elements

tmap provides a total ten data classification methods, namely: fixed, sd, equal, pretty (default), quantile, kmeans, hclust, bclust, fisher, and jenks. To specify the method, the style argument of tm_fill() or tm_polygons() should be used. We can also specify n argument to specify the number of classes (i.e., n = 5).

We can also specify the category breaks manually by specifying the breaks argument of the tm_fill(). It is important to note that, in tmap the breaks include a minimum and maximum. As a result, in order to end up with n categories, n+1 elements must be specified in the breaks option (the values must be in increasing order). Example would be breaks = c(0, 0.60, 0.70, 0.80, 0.90, 1.00)

#tm_shape defines the input data. We can use tm_polygons() to draw the planning subzone polygons
tm_shape(mpsz_pop2020)+
  
#We can assign variable to tm_polygons() like below
  #tm_polygons("DEPENDENCY") 
  
  #tm_fill() ONLY shadses the polygons by using a color scheme
  tm_fill("DEPENDENCY", 
          style = "quantile",
          
          #This is using colorbrewer palette. if We want to reverse the color order, we can add '-' in front of the palette (i.e., "-Blues")
          palette = "Blues",
          title = "Dependency ratio") +
  
  #set the layout
  tm_layout(main.title = "Distribution of Dependency Ratio by planning subzone",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE) +
  
  #tm_borders() draw the borders of the shapefil onto the choropleth map
  #lwd is border line width, col is border color, and lty is the line type
  tm_borders(lwd = 0.1, alpha = 0.5) +
  
  #add compass
  tm_compass(type="8star", size = 2) +
  
  #add scale bar
  tm_scale_bar(width = 0.15) +
  
  #add grid lines
  tm_grid(alpha =0.2) +
  
  #default style
  tmap_style("white") +
  
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Population data from Department of Statistics DOS", 
             position = c("left", "bottom"))

Specifying map legends and style

tm_shape(mpsz_pop2020)+
  
  tm_fill("DEPENDENCY", 
          style = "quantile", 
          palette = "-Greens", 
          legend.hist = TRUE, 
          legend.is.portrait = TRUE,
          legend.hist.z = 0.1) +
  
  tm_layout(main.title = "Distribution of Dependency Ratio by planning subzone \n(Jenks classification)",
            main.title.position = "center",
            main.title.size = 1,
            legend.height = 0.45, 
            legend.width = 0.35,
            legend.outside = FALSE,
            legend.position = c("right", "bottom"),
            frame = FALSE) +
  
  tm_borders(alpha = 0.5) +
  
  tmap_style("classic")

2.3.3 Creating facet choropleth maps

2.3.3.1 By assigning multiple values to at least one of the aesthetic arguments
tm_shape(mpsz_pop2020)+
  
  #fill by both Young and Aged and specify ncols = 2
  tm_fill(c("YOUNG", "AGED"),
          style = c("equal", "quantile"), 
          palette = list("Blues", "Greens"),
          ncols = 2) +
  tm_layout(legend.position = c("right", "bottom")) +
  tm_borders(alpha = 0.5) +
  tmap_style("white")

2.3.3.2 By defining a group-by variable in tm_facets()
tm_shape(mpsz_pop2020) +
  tm_fill("DEPENDENCY",
          style = "quantile",
          palette = "Blues",
          thres.poly = 0) + 
  
  #use tm_facets() to facet by REGION_N
  tm_facets(by="REGION_N", 
            free.coords=TRUE, 
            drop.shapes=TRUE) +
  
  tm_layout(legend.show = FALSE,
            title.position = c("center", "center"), 
            title.size = 20) +
  
  tm_borders(alpha = 0.5)

2.3.3.3 By creating multiple stand-alone maps with tmap_arrange()
youngmap <- tm_shape(mpsz_pop2020)+ 
  tm_polygons("YOUNG", 
              style = "quantile", 
              palette = "Blues")

agedmap <- tm_shape(mpsz_pop2020)+ 
  tm_polygons("AGED", 
              style = "quantile", 
              palette = "Blues")

tmap_arrange(youngmap, agedmap, asp=1, ncol=2)

2.3.4 Mappping Spatial Object Meeting a Selection Criterion

Instead of creating small multiple choropleth map, you can also use selection funtion to map spatial objects meeting the selection criterion.

#filter by CENTRAL REGION only
tm_shape(mpsz_pop2020[mpsz_pop2020$REGION_N=="CENTRAL REGION", ])+
  tm_fill("DEPENDENCY", 
          style = "quantile", 
          palette = "Blues", 
          legend.hist = TRUE, 
          legend.is.portrait = TRUE,
          legend.hist.z = 0.1) +
  tm_layout(legend.outside = TRUE,
            legend.height = 0.45, 
            legend.width = 5.0,
            legend.position = c("right", "bottom"),
            frame = FALSE) +
  tm_borders(alpha = 0.5)

3. Visualising Geospatial Point Data

Proportional symbol maps (also known as graduate symbol maps) are a class of maps that use the visual variable of size to represent differences in the magnitude of a discrete, abruptly changing phenomenon, e.g. counts of people. Like choropleth maps, you can create classed or unclassed versions of these maps. The classed ones are known as range-graded or graduated symbols, and the unclassed are called proportional symbols, where the area of the symbols are proportional to the values of the attribute being mapped. In this hands-on exercise, you will learn how to create a proportional symbol map showing the number of wins by Singapore Pools’ outlets using an R package called tmap.

3.1 Importing the data

The data set use for this hands-on exercise is called SGPools_svy21. The data is in csv file format.

Figure below shows the first 15 records of SGPools_svy21.csv. It consists of seven columns. The XCOORD and YCOORD columns are the x-coordinates and y-coordinates of SingPools outlets and branches. They are in Singapore SVY21 Projected Coordinates System.

Notice that sgpools is aspatial (not based on Geographic Coordinates Systems)

sgpools <- read_csv("data/aspatial/SGPools_svy21.csv")
list(sgpools)
[[1]]
# A tibble: 306 × 7
   NAME                            ADDRESS POSTC…¹ XCOORD YCOORD OUTLE…² Gp1Gp…³
   <chr>                           <chr>     <dbl>  <dbl>  <dbl> <chr>     <dbl>
 1 Livewire (Marina Bay Sands)     2 Bayf…   18972 30842. 29599. Branch        5
 2 Livewire (Resorts World Sentos… 26 Sen…   98138 26704. 26526. Branch       11
 3 SportsBuzz (Kranji)             Lotus …  738078 20118. 44888. Branch        0
 4 SportsBuzz (PoMo)               1 Sele…  188306 29777. 31382. Branch       44
 5 Prime Serangoon North           Blk 54…  552542 32239. 39519. Branch        0
 6 Singapore Pools Woodlands Cent… 1A Woo…  731001 21012. 46987. Branch        3
 7 Singapore Pools 64 Circuit Rd … Blk 64…  370064 33990. 34356. Branch       17
 8 Singapore Pools 88 Circuit Rd … Blk 88…  370088 33847. 33976. Branch       16
 9 Singapore Pools Anchorvale Rd … Blk 30…  540308 33910. 41275. Branch       21
10 Singapore Pools Ang Mo Kio N2 … Blk 20…  560202 29246. 38943. Branch       25
# … with 296 more rows, and abbreviated variable names ¹​POSTCODE,
#   ²​`OUTLET TYPE`, ³​`Gp1Gp2 Winnings`

3.2 Creating a sf dataframe from an aspatial data frame

The st_as_sf() function adds a new column called geometry which specify the points

sgpools_sf <- st_as_sf(sgpools, 
                       
                       #provide column name of the x-coordinates first, followed by y-coordinates
                       coords = c("XCOORD", "YCOORD"),
                       
                       #provide the coordinate systems in epsg format
                       crs= 3414)

EPSG: 3414 is Singapore SVY21 Projected Coordinate System. You can search for other country’s epsg code by refering to epsg.io.

3.3 Plotting the map

#make this interactive by setting tmap_mode to "view"
tmap_mode("view")

tm_shape(sgpools_sf)+
  
          #specify colors by the OUTLET TYPE
tm_bubbles(col = "OUTLET TYPE", 
          #make the size proportional to Gp1Gp2 Winnings
          size = "Gp1Gp2 Winnings",
          border.col = "black",
          border.lwd = 1)

We can also add tm_facets() which is in sync (zoom and pan settings)

tm_shape(sgpools_sf) +
  tm_bubbles(col = "OUTLET TYPE", 
          size = "Gp1Gp2 Winnings",
          border.col = "black",
          border.lwd = 1) +
  tm_facets(by= "OUTLET TYPE",
            nrow = 1,
            sync = TRUE)

Switch back to tmap plot mode

tmap_mode("plot")

4. Analytical Mapping

4.1 Importing the data

For the purpose of this hands-on exercise, a prepared data set called NGA_wp.rds will be used. The data set is a polygon feature data.frame providing information on water point of Nigeria at the LGA level. You can find the data set in the rds sub-direct of the hands-on data folder.

NGA_wp <- read_rds("data/rds/NGA_wp.rds")

4.2 Basic Choropleth Mapping

Plotting the distribution of water point by LGA

#functional water
p1 <- tm_shape(NGA_wp) +
  tm_fill("wp_functional",
          n = 10,
          style = "equal",
          palette = "Blues") +
  tm_borders(lwd = 0.1,
             alpha = 1) +
  tm_layout(main.title = "Distribution of functional water point by LGAs",
            legend.outside = FALSE)
#non-functional water
p2 <- tm_shape(NGA_wp) +
  tm_fill("wp_nonfunctional",
          n = 10,
          style = "equal",
          palette = "Blues") +
  tm_borders(lwd = 0.1,
             alpha = 1) +
  tm_layout(main.title = "Distribution of functional water point by LGAs",
            legend.outside = FALSE)
#total water
p3 <- tm_shape(NGA_wp) +
  tm_fill("total_wp",
          n = 10,
          style = "equal",
          palette = "Blues") +
  tm_borders(lwd = 0.1,
             alpha = 1) +
  tm_layout(main.title = "Distribution of total  water point by LGAs",
            legend.outside = FALSE)
tmap_arrange(p3, p1, p2, ncol = 1)

4.3 Choropleth Mapping for Rates

In much of our readings we have now seen the importance to map rates rather than counts of things, and that is for the simple reason that water points are not equally distributed in space. That means that if we do not account for how many water points are somewhere, we end up mapping total water point size rather than our topic of interest.

Deriving proportion of functional water points and non-functional water points

NGA_wp <- NGA_wp |> 
  mutate(pct_functional = wp_functional/total_wp) |> 
  mutate(pct_nonfunctional = wp_nonfunctional/total_wp)

Plotting the map

tm_shape(NGA_wp) +
  tm_fill("pct_functional",
          n = 10,
          style = "equal",
          palette = "Blues",
          legend.hist = TRUE) +
  tm_borders(lwd = 0.1,
             alpha = 1) +
  tm_layout(main.title = "Rate map of functional water point by LGAs",
            legend.outside = TRUE)

4.4 Extreme Value Maps

Extreme value maps are variations of common choropleth maps where the classification is designed to highlight extreme values at the lower and upper end of the scale, with the goal of identifying outliers. These maps were developed in the spirit of spatializing EDA, i.e., adding spatial features to commonly used approaches in non-spatial EDA (Anselin 1994).

4.4.1 Percentile Map

The percentile map is a special type of quantile map with six specific categories: 0-1%,1-10%, 10-50%,50-90%,90-99%, and 99-100%. The corresponding breakpoints can be derived by means of the base R quantile command, passing an explicit vector of cumulative probabilities as c(0,.01,.1,.5,.9,.99,1). Note that the begin and endpoint need to be included.

4.4.1.1 Data Preparation

Remove NA

NGA_wp <- NGA_wp %>%
  drop_na()

Creating customised classification and extracting values

When variables are extracted from an sf data.frame, the geometry is extracted as well. For mapping and spatial manipulation, this is the expected behavior, but many base R functions cannot deal with the geometry. Specifically, the quantile() gives an error. As a result st_set_geomtry(NULL) is used to drop geometry field.

percent <- c(0,.01,.1,.5,.9,.99,1)
var <- NGA_wp["pct_functional"] %>%
  st_set_geometry(NULL)
quantile(var[,1], percent)
       0%        1%       10%       50%       90%       99%      100% 
0.0000000 0.0000000 0.2169811 0.4791667 0.8611111 1.0000000 1.0000000 
4.4.1.2 Creating get.var function

Firstly, we will write an R function as shown below to extract a variable (i.e. wp_nonfunctional) as a vector out of an sf data.frame. This will achieve the same function as 4.4.1.1

  • arguments:

    • vname: variable name (as character, in quotes)

    • df: name of sf data frame

  • returns:

    • v: vector with values (without a column name)
get.var <- function(vname,df) {
  v <- df[vname] %>% 
    st_set_geometry(NULL)
  v <- unname(v[,1])
  return(v)
}
4.4.1.3 Creating function to plot percentile mapping
percentmap <- function(vnam, df, legtitle=NA, mtitle="Percentile Map"){
  percent <- c(0,.01,.1,.5,.9,.99,1)
  
  #Use the get.var function above to get vector 
  var <- get.var(vnam, df)
  bperc <- quantile(var, percent)
  tm_shape(df) +
  tm_polygons() +
  tm_shape(df) +
     tm_fill(vnam,
             title=legtitle,
             breaks=bperc,
             palette="Blues",
          labels=c("< 1%", "1% - 10%", "10% - 50%", "50% - 90%", "90% - 99%", "> 99%"))  +
  tm_borders() +
  tm_layout(main.title = mtitle, 
            title.position = c("right","bottom"))
}
4.4.1.4 Drawing the map using the function
percentmap(vnam = "total_wp", df = NGA_wp)

4.4.2 Box Map

In essence, a box map is an augmented quartile map, with an additional lower and upper category. When there are lower outliers, then the starting point for the breaks is the minimum value, and the second break is the lower fence. In contrast, when there are no lower outliers, then the starting point for the breaks will be the lower fence, and the second break is the minimum value (there will be no observations that fall in the interval between the lower fence and the minimum value).

Let’s examine the boxplot distribution of nonfunctional water points

ggplot(data = NGA_wp,
       aes(x = "",
           y = wp_nonfunctional)) +
  geom_boxplot()

To create a box map, a custom breaks specification will be used. However, there is a complication. The break points for the box map vary depending on whether lower or upper outliers are present.

4.4.2.1 Creating boxbreaks function

The code chunk below is an R function that creating break points for a box map.

  • arguments:

    • v: vector with observations

    • mult: multiplier for IQR (default 1.5)

  • returns:

    • bb: vector with 7 break points compute quartile and fences
boxbreaks <- function(v,mult=1.5) {
  qv <- unname(quantile(v))
  iqr <- qv[4] - qv[2]
  upfence <- qv[4] + mult * iqr
  lofence <- qv[2] - mult * iqr
  # initialize break points vector
  bb <- vector(mode="numeric",length=7)
  # logic for lower and upper fences
  if (lofence < qv[1]) {  # no lower outliers
    bb[1] <- lofence
    bb[2] <- floor(qv[1])
  } else {
    bb[2] <- lofence
    bb[1] <- qv[1]
  }
  if (upfence > qv[5]) { # no upper outliers
    bb[7] <- upfence
    bb[6] <- ceiling(qv[5])
  } else {
    bb[6] <- upfence
    bb[7] <- qv[5]
  }
  bb[3:5] <- qv[2:4]
  return(bb)
}
4.4.2.2 Creating get.var function

Refer to Section 4.4.1.2

4.4.2.3 Testing the function
var <- get.var("wp_nonfunctional", NGA_wp) 
boxbreaks(var)
[1] -56.5   0.0  14.0  34.0  61.0 131.5 278.0
4.4.2.4 Creating function to plot boxmap

The code chunk below is an R function to create a box map.

  • arguments:

    • vnam: variable name (as character, in quotes)

    • df: simple features polygon layer

    • legtitle: legend title

    • mtitle: map title

    • mult: multiplier for IQR

  • returns: - a tmap-element (plots a map)

boxmap <- function(vnam, df, 
                   legtitle=NA,
                   mtitle="Box Map",
                   mult=1.5){
  var <- get.var(vnam,df)
  bb <- boxbreaks(var)
  tm_shape(df) +
    tm_polygons() +
  tm_shape(df) +
     tm_fill(vnam,title=legtitle,
             breaks=bb,
             palette="Blues",
          labels = c("lower outlier", 
                     "< 25%", 
                     "25% - 50%", 
                     "50% - 75%",
                     "> 75%", 
                     "upper outlier"))  +
  tm_borders() +
  tm_layout(main.title = mtitle, 
            title.position = c("left",
                               "top"))
}
4.4.2.5 Drawing the map using the function
tmap_mode("plot")
boxmap("wp_nonfunctional", NGA_wp)