::p_load(here,
pacman
sf, tidyverse, spdep,
funModeling, tmap, ggpubr,
corrplot, heatmaply, cluster, ClustGeo, factoextra, GGally )
Take Home Ex 2
Overview
Files required for Take Home Ex 2:
Water point attribute data (exported as CSV file):
https://www.waterpointdata.org/access-data/
Click on the button “Access WPdx+ Global Data Repository”
Then click export button & download as CSV file
Nigeria Administrative Level 2 Boundary File:
https://www.geoboundaries.org/
Enter “Nigeria” in the name filter column.
Download Nigeria-NGA-ADM2-2022 under the column “geoBoundaries”
Column Description can be found at:
https://www.waterpointdata.org/wp-content/uploads/2021/04/WPDx_Data_Standard.pdf
Step 1: Importing the required packages
The following packages are imported:
here: We use this to generate a path to the file stored at the root directory
sf: We use this for manipulation of simple features
funModeling: This is a useful package for us to do some EDA
tmap: We use for thematic plotting
tidyverse (https://www.tidyverse.org/)
ggplot2: Great for use during EDA
dplyr: Used for data manipulation (e.g. mutate() )
readr: We use this for reading rectangular data like csv
spdep: We use this for spatial related computations such as neighbors and weight matrix
corrplot: For plotting of the correlation object
heatmaply: For plotting of heatmaps and also for normalize() function
cluster: We use this package for performing clustering related computations
ClustGeo: We use this package for clustering with both variable and constraint space
factoextra: Extract and Visualize the Results of Multivariate Data Analyses
GGally: We use this package for multivariate analysis
Step 2: Shrinking the CSV file
Since the CSV file is really huge, we need to shrink it down first so we can store a copy of it on github. To do this, we will import the CSV file and then only select the rows that we are keen to keep using the select() function. The CSV will initially be imported as a tibble file format, thereafter once we select our referred columns, we also need to convert geometry coordinate before finally saving it as a rds file.
Using the here() function, we will generate a path to where the large csv file is stored:
<- here("data", "dataNigeria_2", "waterpoint", "wpdx.csv")
csv_path csv_path
[1] "D:/f4sared/ISSS624/data/dataNigeria_2/waterpoint/wpdx.csv"
We will use the read_csv() of the readr package to ingest the CSV file while filtering for Nigeria:
<- read_csv(csv_path) %>% filter(`#clean_country_name` == "Nigeria") wp_nga
We will use the select() to keep only columns of interest:
<- wp_nga %>% select(7:10, 13:14, 22, 46:47, 62) wp_nga_out
Using st_as_sfc() of the sf package, we derive a new “Geometry” column based on `New Georeferenced Column`. This is because the column is as actually holding data in textual format.
We run the code as follow:
$Geometry = st_as_sfc(wp_nga_out$`New Georeferenced Column`) wp_nga_out
Next using st_sf() of the sf package, we will convert the current object to sf dataframe:
<- st_sf(wp_nga_out, crs=4326) wp_nga_out
Since the column `New Georeferenced Column` is no longer needed, we exclude it:
<- wp_nga_out %>% select(1:9, 11) wp_nga_out
To save the file, we need to generate a save path:
<- here("data", "dataNigeria_2", "waterpoint", "wp_nga_v2.rds")
savefile_path savefile_path
[1] "D:/f4sared/ISSS624/data/dataNigeria_2/waterpoint/wp_nga_v2.rds"
We then save the dataframe as a rds file:
write_rds(wp_nga_out, savefile_path)
Step 3: Loading the data
Load the water point data:
Now, using the read_rds of the readr package, we load the data as follow:
-we rename all the columns for ease of work
-we also replace all N/A values of the status with “Unknown”
<- read_rds(savefile_path) %>%
wp_nga rename(`status`=`#status_clean`) %>%
rename(`adm1`=`#clean_adm1`) %>%
rename(`adm2`=`#clean_adm2`) %>%
rename(`tech_clean`=`#water_tech_clean`) %>%
rename(`tech_category`=`#water_tech_category`) %>%
mutate(status = replace_na(status, "Unknown")) %>%
mutate(tech_category = replace_na(tech_category, "Unknown")) %>%
mutate(tech_clean = replace_na(tech_clean, "Unknown"))
We will now use the st_geometry of the sf package to look at our data info:
st_geometry(wp_nga)
Geometry set for 95008 features
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
Geodetic CRS: WGS 84
First 5 geometries:
POINT (5.12 7.98)
POINT (3.597668 6.964532)
POINT (7.92972 6.48694)
POINT (7.64867 6.72757)
POINT (7.66485 6.7799)
We will check the CRS of the loaded rds file using st_crs() from sf:
st_crs(wp_nga)
Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
We will check the output:
st_crs(wp_nga)
Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
Load the Geo Boundary data:
Once again, we need to generate the path as follow:
<- here("data", "dataNigeria_2", "boundary")
boundary_path boundary_path
[1] "D:/f4sared/ISSS624/data/dataNigeria_2/boundary"
Using the st_read() function of the sf package, we will load the boundary file:
<- st_read(dsn = boundary_path,
nga layer = "geoBoundaries-NGA-ADM2",
crs = 4326) %>% select(shapeName)
Reading layer `geoBoundaries-NGA-ADM2' from data source
`D:\f4sared\ISSS624\data\dataNigeria_2\boundary' using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS: WGS 84
We will check the CRS with st_crs() from sf package:
st_crs(nga)
Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
Handling Repeated Names
We refer to our classmates’ method:
https://jordan-isss624-geospatial.netlify.app/posts/geo/geospatial_exercise/#data-wrangling
Get the index of the rows that are duplicated
<- duplicated(nga$shapeName)
bool_list <- nga$shapeName[bool_list]
duplicated_names <- which(nga$shapeName %in% duplicated_names)
index_rows index_rows
[1] 94 95 304 305 355 356 519 520 546 547 693 694
duplicated_names
[1] "Bassa" "Ifelodun" "Irepodun" "Nasarawa" "Obi" "Surulere"
Save the duplicated region as a new sf dataframe:
<- nga %>% filter(nga$shapeName %in% duplicated_names) dup_regions
Select the rows from the waterpoint data whose adm 2 is part of the duplicated list
<- wp_nga %>%
dup_points filter(adm2 %in%
c("Bassa",
"Ifelodun",
"Irepodun",
"Nasarawa",
"Obi",
"Surulere"))
Perform a simple intersection between the duplicated regions and the earlier filtered waterpoints:
<- st_intersects(dup_regions, dup_points) dup_intersect
Next, we will use the slice function to select rows that fall under each of the duplicated region, then we will display the column “adm1” which shows the sub name for that region.
Bassa 1:
%>% slice(unlist(dup_intersect[1])))$adm1 %>% head() (dup_points
[1] "Kogi" "Kogi" "Kogi" "Kogi" "Kogi" "Kogi"
Bassa 2:
%>% slice(unlist(dup_intersect[2])))$adm1 %>% head() (dup_points
[1] "Plateau" "Plateau" "Plateau" "Plateau" "Plateau" "Plateau"
Ifelodun 1:
%>% slice(unlist(dup_intersect[3])))$adm1 %>% head() (dup_points
[1] "Kwara" "Kwara" "Kwara" "Kwara" "Kwara" "Kwara"
Ifelodun 2:
%>% slice(unlist(dup_intersect[4])))$adm1 %>% head() (dup_points
[1] "Osun" "Osun" "Osun" "Osun" "Osun" "Osun"
Irepodun 1:
%>% slice(unlist(dup_intersect[5])))$adm1 %>% head() (dup_points
[1] "Kwara" "Kwara" "Kwara" "Kwara" "Kwara" "Kwara"
Irepodun 2:
%>% slice(unlist(dup_intersect[6])))$adm1 %>% head() (dup_points
[1] "Osun" "Osun" "Osun" "Osun" "Osun" "Osun"
Nasarawa 1:
%>% slice(unlist(dup_intersect[7])))$adm1 %>% head() (dup_points
character(0)
Nasarawa 2:
%>% slice(unlist(dup_intersect[8])))$adm1 %>% head() (dup_points
[1] "Nasarawa" "Nasarawa" "Nasarawa" "Nasarawa" "Nasarawa" "Nasarawa"
Obi 1:
%>% slice(unlist(dup_intersect[9])))$adm1 %>% head() (dup_points
[1] "Benue" "Benue" "Benue" "Benue" "Benue" "Benue"
Obi 2:
%>% slice(unlist(dup_intersect[10])))$adm1 %>% head() (dup_points
[1] "Nasarawa" "Nasarawa" "Nasarawa" "Nasarawa" "Nasarawa" "Nasarawa"
Surulere 1:
%>% slice(unlist(dup_intersect[11])))$adm1 %>% head() (dup_points
[1] "Lagos" "Lagos" "Lagos" "Lagos" "Lagos" "Lagos"
Surulere 2:
%>% slice(unlist(dup_intersect[12])))$adm1 %>% head() (dup_points
[1] "Oyo" "Oyo" "Oyo" "Oyo" "Oyo" "Oyo"
Plot the areas that are duplicated:
tmap_mode("view")
tm_shape(dup_rows) + tm_polygons()
Rename the duplicated areas
$shapeName[index_rows] <- c("Bassa_Kogi", "Bassa_Plateau",
nga"Ifelodun_Kwara", "Ifelodun_Osun",
"Irepodun_Kwara", "Irepodun_Osun",
"Nasarawa_0", "Nasarawa_Nasarawa",
"Obi_Benue", "Obi_Nasawara",
"Surulere_Lagos","Surulere_Oyo")
Check to see if there are any repeated rows again:
<- duplicated(nga$shapeName)
bool_list <- nga$shapeName[bool_list]
duplicated_names <- which(nga$shapeName %in% duplicated_names)
index_rows index_rows
integer(0)
We use the rm() function to remove the environment variables to reduce clutter:
rm(dup_intersect)
rm(dup_points)
rm(dup_regions)
rm(bool_list)
rm(duplicated_names)
rm(index_rows)
Step 4: Data Wrangling
Status of water point
Using the freq() from funModeling, we will check the distribution of status of the water points:
freq(data=wp_nga, input = 'status')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
Please report the issue at <https://github.com/pablo14/funModeling/issues>.
status frequency percentage cumulative_perc
1 Functional 45883 48.29 48.29
2 Non-Functional 29385 30.93 79.22
3 Unknown 10656 11.22 90.44
4 Functional but needs repair 4579 4.82 95.26
5 Non-Functional due to dry season 2403 2.53 97.79
6 Functional but not in use 1686 1.77 99.56
7 Abandoned/Decommissioned 234 0.25 99.81
8 Abandoned 175 0.18 99.99
9 Non functional due to dry season 7 0.01 100.00
Functional water point
<- wp_nga %>%
wpt_functional filter(status %in%
c("Functional",
"Functional but not in use",
"Functional but needs repair"))
freq(data=wpt_functional,
input = 'status')
Non Functional water point
<- wp_nga %>%
wpt_nonfunctional filter(status %in%
c("Abandoned/Decommissioned",
"Abandoned",
"Non-Functional",
"Non functional due to dry season",
"Non-Functional due to dry season"))
freq(data=wpt_nonfunctional,
input = 'status')
Unknown
<- wp_nga %>% filter(status == "Unknown") wpt_unknown
Water point Technology
<- wp_nga %>% filter(tech_category == "Hand Pump") wpt_hand_pump
freq(data=wp_nga, input = 'tech_category')
Usage Capacity < 1000
We first use the freq() of the funModeling package to show the distribution of capacity:
freq(data=wp_nga, input = 'usage_capacity')
usage_capacity frequency percentage cumulative_perc
1 300 68789 72.40 72.40
2 1000 25644 26.99 99.39
3 250 573 0.60 99.99
4 50 2 0.00 100.00
The above plots shows us that majority of the values are either 300 or 1000.
This for our analysis, we will focus on values that are less than 1000
<- wp_nga %>% filter(usage_capacity < 1000) wpt_1000
Rural waterpoints
<- wp_nga %>% filter(is_urban == FALSE) wpt_rural
freq(data=wp_nga, input = 'is_urban')
Point-in-Polygon
Since each of the waterpoint in wp_nga belongs to a specific region, we need to allocated them into their own respective region, we do this using the the st_intersect() function of the sf package along with the mutate() function of the dplyr package.
We perform Point in polygon as follow:
<- nga %>%
nga_wp mutate(`total_wpt` = lengths(
st_intersects(nga, wp_nga))) %>%
mutate(`wpt_functional` = lengths(
st_intersects(nga, wpt_functional))) %>%
mutate(`wpt_non_functional` = lengths(
st_intersects(nga, wpt_nonfunctional))) %>%
mutate(`wpt_unknown` = lengths(
st_intersects(nga, wpt_unknown))) %>%
mutate(`wpt_hand_pump` = lengths(
st_intersects(nga, wpt_hand_pump))) %>%
mutate(`wpt_1000` = lengths(
st_intersects(nga, wpt_1000))) %>%
mutate(`wpt_rural` = lengths(
st_intersects(nga, wpt_rural)))
Calculate Percentage:
<- nga_wp %>%
nga_wp mutate(pct_functional = `wpt_functional`/`total_wpt`) %>%
mutate(`pct_non_functional` = `wpt_non_functional`/`total_wpt`) %>%
mutate(`pct_hand_pump` = `wpt_hand_pump`/`total_wpt`) %>%
mutate(`pct_1000` = `wpt_1000`/`total_wpt`) %>%
mutate(`pct_rural` = `wpt_rural`/`total_wpt`)
Replace NaN with 0
<- is.nan(nga_wp$pct_functional)
NaN_list $pct_functional[NaN_list] <- 0
nga_wp
<- is.nan(nga_wp$pct_non_functional)
NaN_list $pct_non_functional[NaN_list] <- 0
nga_wp
<- is.nan(nga_wp$pct_hand_pump)
NaN_list $pct_hand_pump[NaN_list] <- 0
nga_wp
<- is.nan(nga_wp$pct_hand_pump)
NaN_list $pct_hand_pump[NaN_list] <- 0
nga_wp
<- is.nan(nga_wp$pct_1000)
NaN_list $pct_1000[NaN_list] <- 0
nga_wp
<- is.nan(nga_wp$pct_rural)
NaN_list $pct_rural[NaN_list] <- 0 nga_wp
qtm(nga_wp, fill = "wpt hand pump") +
tm_layout(legend.height = 0.4,legend.width = 0.4)
Once again, we now clean the environment of variables:
rm(wpt_1000)
rm(wpt_functional)
rm(wpt_hand_pump)
rm(wpt_nonfunctional)
rm(wpt_rural)
rm(wpt_unknown)
rm(NaN_list)
rm(wp_nga)
Performing Projection to 26391
We will now use st_trasnform() function to project to the Nigeria CRS of 26391:
<- st_transform(nga_wp, crs = 26391) nga_wp
We will also transform the boundary file:
<- st_transform(nga, crs = 26391) nga
Step 5: EDA
Spatial Mapping
First we will do a spatial plot with the functional & non-functional vars to check for any spatial patterns.
We will make use of the tmap package and the tmap_arrange() function:
<- tm_shape(nga_wp) +
functional.map tm_fill(col = "wpt_functional",
n = 5,
style = "jenks",
title = "wpt_functional") +
tm_borders(alpha = 0.9) +
tm_layout(main.title = "Functional water points distribution",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.6,
legend.width = 0.6,
frame = TRUE)
<- tm_shape(nga_wp) +
non_functional.map tm_fill(col = "wpt_non_functional",
n = 5,
style = "jenks",
title = "wpt_non_functional") +
tm_borders(alpha = 0.9) +
tm_layout(main.title = "Non-functional water points distribution",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.6,
legend.width = 0.6,
frame = TRUE)
<- tm_shape(nga_wp) +
pct_functional.map tm_fill(col = "pct_functional",
n = 5,
style = "jenks",
title = "pct_functional") +
tm_borders(alpha = 0.9) +
tm_layout(main.title = "Functional water points %",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.6,
legend.width = 0.6,
frame = TRUE)
<- tm_shape(nga_wp) +
pct_non_functional.map tm_fill(col = "pct_non_functional",
n = 5,
style = "jenks",
title = "pct_non_functional") +
tm_borders(alpha = 0.9) +
tm_layout(main.title = "Non-Functional water points %",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.6,
legend.width = 0.6,
frame = TRUE)
tmap_arrange(functional.map, non_functional.map,
pct_functional.map,pct_non_functional.map,asp=NA, ncol=2)
In the top left and top right plot, we have the absolute number of functional and non-functional water point. We can see a clear indication of patterns where certain cluster of regions have higher absolute value.
However, the magnitude of the variable does not tell the full story, thus we will also make a plot of the percentage of functional and non-functional water points in the lower left and right plots. From these two plots, we also notice a distinct pattern and clustering behavior of the variables.
Thus this is good evidence for us to look further.
Next we further turn our attention to the remaining variables:
<- tm_shape(nga_wp) +
hand_pump.map tm_fill(col = "pct_hand_pump",
n = 5,
style = "jenks",
title = "pct_hand_pump") +
tm_borders(alpha = 0.9) +
tm_layout(main.title = "Handpump % distribution",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.6,
legend.width = 0.6,
frame = TRUE)
<- tm_shape(nga_wp) +
pct_1000.map tm_fill(col = "pct_1000",
n = 5,
style = "jenks",
title = "pct_1000") +
tm_borders(alpha = 0.9) +
tm_layout(main.title = "Capacity < 1000 % distribution",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.6,
legend.width = 0.6,
frame = TRUE)
<- tm_shape(nga_wp) +
pct_rural.map tm_fill(col = "pct_rural",
n = 5,
style = "jenks",
title = "pct_rural") +
tm_borders(alpha = 0.9) +
tm_layout(main.title = "Rural % distribution",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.6,
legend.width = 0.6,
frame = TRUE)
tmap_arrange(hand_pump.map, pct_1000.map, pct_rural.map,
asp=NA, ncol=2)
For the reaming variables, we observe clustering pattern for heat pump % and capacity < 1000 %. Interestingly, they seem to have similar spatial patterns which can suggest the possibility of correlation.
For the rural % however, the clustering pattern appears to be more fragmented.
Housekeeping:
rm(functional.map)
rm(non_functional.map)
rm(pct_functional.map)
rm(pct_non_functional.map)
rm(hand_pump.map)
rm(pct_1000.map)
rm(pct_rural.map)
Statistical
Next, we will make use of ggplot package to explore the distribution of values.
We will use geom_histogram() of the ggplot package to show the histogram of the variables:
<- ggplot(data=nga_wp,
wpt_functional aes(x= `wpt_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")+
theme(axis.title=element_text(size=30,face="bold"))
<- ggplot(data=nga_wp,
wpt_non_functional aes(x= `wpt_non_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")+
theme(axis.title=element_text(size=30,face="bold"))
<- ggplot(data=nga_wp,
pct_functional aes(x= `pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")+
theme(axis.title=element_text(size=30,face="bold"))
<- ggplot(data=nga_wp,
pct_non_functional aes(x= `pct_non_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")+
theme(axis.title=element_text(size=30,face="bold"))
ggarrange(wpt_functional, wpt_non_functional, pct_functional, pct_non_functional,
ncol = 2,
nrow = 2)
We can see that while this distribution of the functional and non-functional waterpoints are skewed, the % functional and % non-functional seems to follow a normal distribution.
We will use geom_histogram() of the ggplot package to show the histogram of remaining variables:
<- ggplot(data=nga_wp,
pct_handpump aes(x= `pct_hand_pump`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")+
theme(axis.title=element_text(size=30,face="bold"))
<- ggplot(data=nga_wp,
pct_1000 aes(x= `pct_1000`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")+
theme(axis.title=element_text(size=30,face="bold"))
<- ggplot(data=nga_wp,
pct_rural aes(x= `pct_rural`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")+
theme(axis.title=element_text(size=30,face="bold"))
ggarrange(pct_handpump, pct_1000, pct_rural,
ncol = 2,
nrow = 2)
From the above plots, we observe a left skew in the distribution of regions with < 1000 capacity. We observe the same left skew for the distribution of rural waterpoints.
Housekeeping:
rm(wpt_functional)
rm(wpt_non_functional)
rm(pct_functional)
rm(pct_non_functional)
rm(hand_pump)
Warning in rm(hand_pump): object 'hand_pump' not found
rm(pct_1000)
rm(pct_rural)
Step 6: Correlation Analysis
We will not get the correlation variables as follow:
(we use st_set_geometry(NULL) of sf package to exclude geometric data)
<- nga_wp[,c(4:5, 10:14)] %>% st_set_geometry(NULL) corr_data
In order to perform correlation analysis, we need to get the correlation matrix using cor() from the base R package, then we can plot it using the corrplot.mixed() function from corrplot package.
Perform correlation analysis:
= cor(corr_data)
cluster_vars.cor corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black")
From the correlation plot above, we notice that pct_hand_pump is highly negatively correlated with pct_1000. Because the value is not that high extremely high, we always include it in the analysis at another time.
Perform cleanup:
rm(cluster_vars.cor)
rm(corr_data)
Step 7: Hierarchy Cluster Analysis
Extract the clustering variables
To perform the analysis, we need to extract the cluster variables without the geometric data. To do this, we we will use the function st_set_geometry(NULL) from sf package on our sf dataframe.
We run the code as follow:
<- nga_wp %>%
cluster_vars st_set_geometry(NULL) %>%
select("shapeName", "wpt_functional", "wpt_non_functional",
"pct_functional", "pct_non_functional",
"pct_hand_pump", "pct_1000", "pct_rural")
head(cluster_vars,3)
shapeName wpt_functional wpt_non_functional pct_functional pct_non_functional
1 Aba North 7 9 0.4117647 0.5294118
2 Aba South 29 35 0.4084507 0.4929577
3 Abadam 0 0 0.0000000 0.0000000
pct_hand_pump pct_1000 pct_rural
1 0.11764706 0.1764706 0.00000000
2 0.09859155 0.1267606 0.05633803
3 0.00000000 0.0000000 0.00000000
Next, we need to change the row names using row.name() function from R base:
row.names(cluster_vars) <- cluster_vars$"shapeName"
head(cluster_vars,3)
shapeName wpt_functional wpt_non_functional pct_functional
Aba North Aba North 7 9 0.4117647
Aba South Aba South 29 35 0.4084507
Abadam Abadam 0 0 0.0000000
pct_non_functional pct_hand_pump pct_1000 pct_rural
Aba North 0.5294118 0.11764706 0.1764706 0.00000000
Aba South 0.4929577 0.09859155 0.1267606 0.05633803
Abadam 0.0000000 0.00000000 0.0000000 0.00000000
Now, we do not need the column shapeName & pct_1000 anymore, so we exclude it:
<- select(cluster_vars, c(2:6, 8))
cluster_vars head(cluster_vars, 3)
wpt_functional wpt_non_functional pct_functional pct_non_functional
Aba North 7 9 0.4117647 0.5294118
Aba South 29 35 0.4084507 0.4929577
Abadam 0 0 0.0000000 0.0000000
pct_hand_pump pct_rural
Aba North 0.11764706 0.00000000
Aba South 0.09859155 0.05633803
Abadam 0.00000000 0.00000000
Data Standardization
From our dataframe “done_cluster_vars”, we can see that the columns wpt_functional and wpt_non_functional have much higher ranges since they are counts instead of percentages. Thus we need to perform standardization on these columns. This is because analysis results will be biased towards clustering variables with large values. Hence the need for standardization before cluster analysis
We will therefore perform the min-max standardization using the normalize() of the heatmaply package:
(we use summary() of base R package to check the outcome)
<- normalize(cluster_vars)
std_cluster_vars summary(std_cluster_vars)
wpt_functional wpt_non_functional pct_functional pct_non_functional
Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.02261 1st Qu.:0.04406 1st Qu.:0.3261 1st Qu.:0.2105
Median :0.06051 Median :0.12230 Median :0.4741 Median :0.3505
Mean :0.08957 Mean :0.14962 Mean :0.4984 Mean :0.3592
3rd Qu.:0.11669 3rd Qu.:0.21853 3rd Qu.:0.6699 3rd Qu.:0.5076
Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000
pct_hand_pump pct_rural
Min. :0.0000 Min. :0.0000
1st Qu.:0.1670 1st Qu.:0.5727
Median :0.5099 Median :0.8645
Mean :0.4873 Mean :0.7271
3rd Qu.:0.7778 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
Compute the proximity matrix
When calculating proximity matrix, there are a few options available for us:
-Euclidean
-Maximum
-Manhattan / City Block Distance
-Canberra
-Binary
-Minkowski
We will use Euclidean since it is the default.
Using the dist() of the base R package, we will compute the proximity matrix:
<- dist(std_cluster_vars, method = 'euclidean') proxmat
Compute the Clustering
There are 2 forms of clustering methods:
Agglomerative-AGNES: This is a bottom up manner where each point starts as a cluster
Divise Analysis-DIANA: This is a top down approach. At each step, most heterogeneous is split.
There are a total of eight clustering algorithms for AGNES:
-ward.D
-ward.D2
-single
-complete
-average(UPGMA)
-mcquitty(WPGMA)
-median(WPGMC)
-centroid(UPGMC)
We will start off with “ward.D”
Using hclust() of the base R package, we will compute the cluster as follow:
<- hclust(proxmat, method = 'ward.D') hclust_ward
We will then get an object hclust_ward which contains the detaiLED information on the structure of the cluster tree. This will allow the later plot() function to know how to plot the structure correctly.
Plotting the Dendrogram:
To view the outcome, we save the output as a png and then load it as a picture:
https://stackoverflow.com/questions/11444743/how-to-adjust-sizes-of-x-axis-in-dendrogram-r
Each time the code is run, the .png object will be updated. For better resolution, we increase the width and height of the .png file. The “cex” is adjusted such that the names of each region is just visible when we zoom in on the .png object.
Additional parameters:
lwd: Line width
main: Title
line: Position of title
cex.main: Size of title
We run the code as follow
png("plotdendogram.png",width=12800,height=3200)
plot(hclust_ward, cex = 1.2, lwd=3.0, main="Dendrogram", line = -20, cex.main=20)
Warning in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
"line" is not a graphical parameter
Warning in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
"line" is not a graphical parameter
dev.off()
png
2
Please open the png in a new tab and then zoom in:
Selecting the optimal clustering algorithm:
In order the measure the strength of the clustering structure, we will need to measure the agglomerative coefficient. If we compare the aforementioned coefficient across various clustering algorithms, we will be able to select the algorithm that gives us the best clustering structure.
The function agnes() functions similar to the hclust() function with the addition of a coefficient:
https://rdrr.io/cran/cluster/man/agnes.html
We now use the agnes() function from the cluster package:
<- agnes(std_cluster_vars, method = "ward")
agnes_result $ac agnes_result
[1] 0.990276
The results are stored in the output data structure under the name agnes_result$ac. The higher the magnitude of this value, the better the clustering structure. In this case, “ward” gets a score of 0.990276 which shows strong clustering structure.
We will need to test all 4 of the clustering algorithms to determine which is the best to apply.
First, we need to create a data structure to hold the names of the clustering algorithms:
<- c( "average", "single", "complete", "ward")
algorithms names(algorithms) <- c( "average", "single", "complete", "ward")
algorithms
average single complete ward
"average" "single" "complete" "ward"
Next we write a simple function “test” that takes in the name of the clustering algorithm and then outputs the score of the clustering structure:
<- function(x) {
test agnes(std_cluster_vars, method = x)$ac
}
Finally, using the map_dbl() function from the purr package, we will map the name of each algorithm to the input of the functions:
map_dbl(algorithms, test)
average single complete ward
0.9114067 0.7819154 0.9440637 0.9902760
Of the clustering algorithms, we can see that “ward” methods gives us the best clustering structure. Hence we will use this for the rest of the analysis.
We perform house keeping:
rm(cluster_vars)
rm(agnes_result)
rm(algorithms)
# rm(proxmat)
rm(test)
Determine optimal clusters
In order to perform meaningful analysis, we need to select the optimal amount of cluster. Too little clusters, we won’t be able to draw meaningful insights. Too many clusters, and we will also fail to draw meaning.
There are 3 methods to do this:
Elbow method
Average silhouette method
Gap statistic
For this, we will make use of the tool called the gap statistic:
https://hastie.su.domains/Papers/gap.pdf
The gap statistic is a measure of intra-cluster variation. The larger this value, the bigger the indication that the clustering behaviour is far from the random distribution of points. As we increase the number of cluster, we want to get a highest gap statistic number possible. The minimum number of clusters that we accept is 3 cluster for meaningful analysis.
There are a few methods to determine clusters:
firstSEmax
Tibs2001SEmax
globalSEmax
firstmax
globalmax
Useful links:
https://stats.stackexchange.com/questions/95290/how-should-i-interpret-gap-statistic
https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clusGap.html
To do this, we use the clusGap() function from the cluster package:
(for predictability, we need to set a specific random seed)
set.seed(2022)
<- clusGap(std_cluster_vars,
gap_stat FUN = hcut,
nstart = 25,
K.max = 25,
B = 50)
# Print the result
print(gap_stat, method = "Tibs2001SEmax")
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = std_cluster_vars, FUNcluster = hcut, K.max = 25, B = 50, nstart = 25)
B=50 simulated reference sets, k = 1..25; spaceH0="scaledPCA"
--> Number of clusters (method 'Tibs2001SEmax', SE.factor=1): 4
logW E.logW gap SE.sim
[1,] 5.000512 5.537541 0.5370297 0.006640580
[2,] 4.838605 5.435522 0.5969167 0.010174498
[3,] 4.675903 5.364424 0.6885215 0.011084303
[4,] 4.585719 5.307609 0.7218895 0.010288092
[5,] 4.533596 5.262031 0.7284345 0.011096225
[6,] 4.487042 5.225343 0.7383019 0.010516894
[7,] 4.452088 5.193566 0.7414780 0.011305457
[8,] 4.411099 5.166658 0.7555587 0.010852306
[9,] 4.372379 5.144079 0.7717004 0.010651529
[10,] 4.342018 5.123315 0.7812978 0.009989005
[11,] 4.305545 5.104442 0.7988971 0.009588659
[12,] 4.274778 5.086947 0.8121689 0.009324876
[13,] 4.254300 5.070183 0.8158832 0.009123792
[14,] 4.228367 5.054393 0.8260260 0.009010597
[15,] 4.205874 5.039250 0.8333759 0.009009316
[16,] 4.184087 5.024765 0.8406782 0.009149193
[17,] 4.160961 5.010952 0.8499907 0.009199555
[18,] 4.136993 4.997904 0.8609109 0.009114803
[19,] 4.114595 4.985161 0.8705663 0.009126652
[20,] 4.095154 4.972999 0.8778457 0.009019012
[21,] 4.074534 4.961102 0.8865674 0.008938229
[22,] 4.058345 4.949750 0.8914045 0.008921074
[23,] 4.044721 4.938825 0.8941047 0.008809354
[24,] 4.026319 4.928110 0.9017906 0.008750836
[25,] 4.015433 4.917798 0.9023651 0.008647470
Next, we will plot the Gap Statistics as follow:
fviz_gap_stat(gap_stat)
From the graph we see that the gradient of the Gap Statistic decreases greatly from the 4th cluster onwards. Therefore, we will opt to keep 5 clusters
Interpreting the Dendrograms
png("plotdendogram_0.png",width=12800,height=3200)
plot(hclust_ward, cex = 1.2, lwd=3.0, main="Dendrogram", line = -20, cex.main=20)
Warning in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
"line" is not a graphical parameter
Warning in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
"line" is not a graphical parameter
rect.hclust(hclust_ward, k = 5, border = 2:5)
dev.off()
png
2
To zoom in, right click and open picture on a new tab:
Visual Hierarchical clustering Analysis
In order to plot the heatmap, we need to convert the dataframe into a data matrix:
<- data.matrix(std_cluster_vars) mat_std_cluster_vars
Next using heatmaply() of the heatmaply package, we can plot the interactive as follow:
heatmaply(mat_std_cluster_vars,
Colv=NA,
showticklabels = c(TRUE, FALSE),
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 5,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="ICT indicators",
xlab = "ICT Indicators",
ylab = "Nigeria LGA"
)
Mapping the clusters
We first use the function cutree() of R base to prune the clusters to 5:
<- as.factor(cutree(hclust_ward, k=5)) groups
We will create a new sf dataframe nga_final which includes the geometry and the variables:
<- nga_wp %>%
nga_final select("shapeName", "wpt_functional", "wpt_non_functional",
"pct_functional", "pct_non_functional",
"pct_hand_pump", "pct_1000", "pct_rural")
Next, we will use cbind() from base R to save the clustering group as a column:
<- cbind(nga_final, as.matrix(groups)) %>%
bind_std_cluster_vars rename(`CLUSTER`=`as.matrix.groups.`)
Here we will do a simple filtering to allow us to get the group number of a specific region
%>% filter(shapeName == "Kanke") bind_std_cluster_vars
Simple feature collection with 1 feature and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 776977.8 ymin: 576111.2 xmax: 812592.4 ymax: 616431.6
Projected CRS: Minna / Nigeria West Belt
shapeName wpt_functional wpt_non_functional pct_functional pct_non_functional
1 Kanke 217 89 0.484375 0.1986607
pct_hand_pump pct_1000 pct_rural CLUSTER geometry
1 0.6160714 0.9330357 1 2 MULTIPOLYGON (((779281.4 61...
Using qtm() of the tmap, we will perform a quick plot as follow:
tmap_mode("plot")
tmap mode set to plotting
tm_shape(bind_std_cluster_vars) +
tm_fill("CLUSTER") +
tm_borders(alpha = 0.9) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_layout(main.title = "Cluster Distribution of Nigeria",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE)
From the above plot, we can see that if we do clustering without consideration for geographic location, we get a cluster map that has clusters that are highly fragmented.
Housekeeping to remove variables from environment:
rm(gap_stat)
rm(hclust_ward)
rm(mat_std_cluster_vars)
rm(groups)
Step 8: SKATER
Unconstrained clustering often will show some form of regional clustering, but if we want all members of the clusters to show spatially contiguous relationships, we need specialized algorithm such as SKATER.
A key idea here is that traditional cluster valuation measures is sub-optimal in practice because of real-world constraints. As shown from the unconstrained cluster evaluation above, our clusters turned out to be quite fragmented.
Thus, the advantage of using SKATER is that it provides a hard requirement that spatial objects in the same cluster are also geographically linked. It is a rationalization approach using graph partitioning while preserving neighborhood connectivity.
Additional Info for SKATER:
https://www.dshkol.com/post/spatially-constrained-clustering-and-regionalization/
Check for regions without neighbors:
Using the summary function, we will check if all polygons have a neighbor:
summary(poly2nb(nga))
Neighbour list object:
Number of regions: 774
Number of nonzero links: 4440
Percentage nonzero weights: 0.7411414
Average number of links: 5.736434
1 region with no links:
86
Link number distribution:
0 1 2 3 4 5 6 7 8 9 10 11 12 14
1 2 14 57 125 182 140 122 72 41 12 4 1 1
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links
It appears that polygon from row 86 does not have a neighbor:
poly2nb(nga)[86]
[[1]]
[1] 0
Remove regions without neighbors:
We remove row 86 from boundary file:
<- nga[c(1:85, 87:774),] nga
We do the same for the clustering variables:
<- bind_std_cluster_vars[c(1:85, 87:774),]
bind_std_cluster_vars <- nga_wp[c(1:85, 87:774),]
nga_wp <- std_cluster_vars[c(1:85, 87:774),]
std_cluster_vars <- nga_final[c(1:85, 87:774),] nga_final
Compute the neighbors and the cost:
Neighbor
We will convert the boundary file to a spatial file and then feed it to the function poly2nb():
<- as_Spatial(nga)
nga_sp <- poly2nb(nga_sp)
nga.nb summary(nga.nb)
Neighbour list object:
Number of regions: 773
Number of nonzero links: 4440
Percentage nonzero weights: 0.7430602
Average number of links: 5.743855
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 14
2 14 57 125 182 140 122 72 41 12 4 1 1
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links
We can then plot the neighbor relationship:
plot(nga_sp,
border=grey(.5), main="Neighbor Map", cex.main=4)
plot(nga.nb,
coordinates(nga_sp),
col="blue",
add=TRUE)
Edge cost and weights
Next we will use the neighbor list and the clustering variables to compute the edge cost
using nbcosts() of spdep package:
<- nbcosts(nga.nb, std_cluster_vars) lcosts
Next we need to construct the weights taking into account the edge cost and the neighbors:
(We need to use “B” for binary here since we want the edge cost)
<- nb2listw(nga.nb,
nga.w
lcosts, style="B")
Warning in nb2listw(nga.nb, lcosts, style = "B"): zero sum general weights
summary(nga.w)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 773
Number of nonzero links: 4440
Percentage nonzero weights: 0.7430602
Average number of links: 5.743855
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 14
2 14 57 125 182 140 122 72 41 12 4 1 1
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links
Weights style: B
Weights constants summary:
n nn S0 S1 S2
B 773 597529 2109.869 2666.155 28614.43
Construct Minimum Spanning Tree
Now, using mstree() of the spdep package, we will construct the minimum spanning tree:
<- mstree(nga.w) nga.mst
We now check the class:
class(nga.mst)
[1] "mst" "matrix"
We also check the dimension:
dim(nga.mst)
[1] 772 3
We can check the contents of the minimum spanning tree:
head(nga.mst)
[,1] [,2] [,3]
[1,] 296 49 0.2877620
[2,] 296 325 0.3152240
[3,] 325 622 0.1296126
[4,] 622 514 0.1773219
[5,] 514 461 0.1267727
[6,] 514 681 0.1891041
Finally we can plot the minimum spanning tree as follow:
(We will set the size of label to be small so we can see the tree)
plot(nga_sp, border=gray(.5), main="Minimum Spanning Tree", cex.main=4)
plot.mst(nga.mst,
coordinates(nga_sp),
col="blue",
cex.lab=0.1,
cex.circles=0.005,
add=TRUE)
Housekeeping
rm(lcosts)
rm(nga.w)
rm(nga.nb)
Compute Spatially constrained clusters with SKATER method
Now that we have the minimum spanning tree, we can then compute the spatially constrained cluster using the skater() package of spdep. We will be needing the minimum spanning tree computed earlier as well as the standardized cluster variables.
Note: the number of cuts is not equal to the number of clusters. For cluster 5, the number of cuts required is actually 4.
We run the code as follow:
<- spdep::skater(edges = nga.mst[,1:2],
clust5 data = std_cluster_vars,
method = "euclidean",
ncuts = 4)
<- clust5$groups ccs5
Next we will plot our the lines for each individual cluster:
plot(nga_sp, border=gray(.1), main="SKATER Clusters", cex.main=4)
plot(clust5,
coordinates(nga_sp),
cex.lab=0.1,
groups.colors=c("red","green","blue", "brown", "black"),
cex.circles=0.005, lwd=3.0,
add=TRUE)
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter
Lastly we can visualize our outcome with qtm() of the tmap:
<- as.matrix(clust5$groups)
groups_mat <- cbind(nga_final, as.factor(groups_mat)) %>%
nga_sf_spatialcluster rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
tm_shape(nga_sf_spatialcluster) +
tm_fill("SP_CLUSTER") +
tm_borders(alpha = 0.9) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_layout(main.title = "Cluster Distribution of Nigeria",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE)
After applying the SKATER method, we not see that the clusters becomes much less fragmented. as compared to earlier.
We can easily compare the difference using tmap_arrange() from the tmap package:
tmap_mode("plot")
tmap mode set to plotting
<- tm_shape(bind_std_cluster_vars) +
normal_cluster.map tm_fill("CLUSTER") +
tm_borders(alpha = 1.0) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_layout(main.title = "Cluster Distribution of Nigeria",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE)
<- tm_shape(nga_sf_spatialcluster) +
skater_cluster.map tm_fill("SP_CLUSTER") +
tm_borders(alpha = 1.0) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_layout(main.title = "SKATER Distribution of Nigeria",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE)
tmap_arrange(normal_cluster.map, skater_cluster.map,
asp=NA, ncol=2)
Housekeeping
rm(clust5)
rm(groups_mat)
rm(normal_cluster.map)
# rm(skater_cluster.map)
Step 9: Spatially Constrained Method with ClustGeo
ClustGeo is a useful package that allows us to mix and find the best balance between maintaining spatial contiguity and the solution based on the variable of interest. All this is controlled by the mixing variable alpha.
There are 2 matrix D0 & D1 that is involved:
-D0: This matrix represents the dissimilarities in the attribute/clustering variable space.
-D1: This matrix represents the dissimilarities in the spatial domain.
Our goal is thus to find a good balance between this 2 matrix using the alpha variable.
Traditional Ward-Like clustering with ClustGeo
ClustGeo also offers a non-spatially constrained clustering with hclustgeo() as follow:
<- hclustgeo(proxmat)
non_geo_cluster
png("plotdendogram_1.png",width=12800,height=3200)
plot(non_geo_cluster, cex = 1.2, lwd=3.0, main="Dendrogram",
line = -20, cex.main=20)
Warning in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
"line" is not a graphical parameter
Warning in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
"line" is not a graphical parameter
rect.hclust(non_geo_cluster, k = 5, border = 2:5)
dev.off()
png
2
Housekeeping:
rm(non_geo_cluster)
Preparing Spatially Constrained Hierarchical Clustering
To perform the clustering, we need 3 ingredients:
-Distance Matrix: Represents the spatial domain
-Proximity Matrix: Represents the variable domain
-Mixing variable Alpha: This is required for denoting the degree of mix
Distance Matrix
We will use st_distance() of the sf package to compute the distance matrix:
<- st_distance(nga_final, nga_final)
dist <- as.dist(dist) distmat
Proximity Matrix
Same as earlier, we will use dist() of the base R package to compute the variable proximity matrix:
<- dist(std_cluster_vars, method = 'euclidean') proxmat
Mixing Variable Alpha
Using the above 2 ingredients, we can now get the third ingredient.
We use the function choicealpha() from the ClustGeo package as follow:
<- choicealpha(proxmat, distmat, range.alpha = seq(0, 1, 0.1), K=6,
cr graph = TRUE)
From the graph plotted, we observe that the optimal Qnorm value is at 0.5, we will now use this as the mixing variable.
Spatially Constrained Hierarchical Clustering
Once the mixing variable has been determined, we can then compute the clusters using hclustgeo():
<- hclustgeo(proxmat, distmat, alpha = 0.5) clustG
Now that the clusters have been calculated, we will then use cutree() to limit number of clusters to 5:
<- as.factor(cutree(clustG, k=5)) groups
We will then append the cluster ID of each region to the nga_final dataframe and save as a new sf:
<- cbind(nga_final, as.matrix(groups)) %>%
nga_sf_Gcluster rename(`CLUSTER` = `as.matrix.groups.`)
<- tm_shape(nga_sf_Gcluster) +
clustgeo_cluster.map tm_fill("CLUSTER") +
tm_borders(alpha = 1.0) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_layout(main.title = "ClustGeo Distribution of Nigeria",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE)
clustgeo_cluster.map
Using tmap_arrange(), we will do a comparison:
tmap_arrange(skater_cluster.map, clustgeo_cluster.map,
asp=NA, ncol=2)
Comparing the SKATER and the clustGeo, we see that clustGeo is less strict on geographical relationship.
Step 10: Interpreting our results
Using ggarrange(), we will combine multiple box plots:
<- ggplot(data = nga_sf_Gcluster,
wpt_functional aes(x = CLUSTER, y = wpt_functional)) +
geom_boxplot()+
theme(axis.title=element_text(size=6,face="bold"))
<- ggplot(data = nga_sf_Gcluster,
wpt_non_functional aes(x = CLUSTER, y = wpt_non_functional)) +
geom_boxplot()+
theme(axis.title=element_text(size=6,face="bold"))
<- ggplot(data = nga_sf_Gcluster,
pct_functional aes(x = CLUSTER, y = pct_functional)) +
geom_boxplot()+
theme(axis.title=element_text(size=6,face="bold"))
<- ggplot(data = nga_sf_Gcluster,
pct_non_functional aes(x = CLUSTER, y = pct_non_functional)) +
geom_boxplot()+
theme(axis.title=element_text(size=6,face="bold"))
<- ggplot(data = nga_sf_Gcluster,
pct_hand_pump aes(x = CLUSTER, y = pct_hand_pump)) +
geom_boxplot()+
theme(axis.title=element_text(size=6,face="bold"))
<- ggplot(data = nga_sf_Gcluster,
pct_rural aes(x = CLUSTER, y = pct_rural)) +
geom_boxplot()+
theme(axis.title=element_text(size=6,face="bold"))
ggarrange(wpt_functional, wpt_non_functional, pct_functional, pct_non_functional,
pct_hand_pump, pct_rural,ncol = 2,
nrow = 3)
The above box plots tells us the the presence of each variable inside each cluster. We are able to make the following observations:
Cluster 5 contains the regions with higher number of working water points (Plot wpt_functional)
Cluster 5 contains regions with higher percentage of working water points (Plot pct_functional)
Cluster 1, 2 & 4 have lower number of hand pump water points (Plot pct_hand_pump)
Cluster 3, 4 & 5 consist more of rural water points
We will need to normalize the wpt_functional & wpt_non_functional again since values are counts:
$wpt_functional <- normalize(nga_sf_Gcluster$wpt_functional)
nga_sf_Gcluster$wpt_non_functional <- normalize(nga_sf_Gcluster$wpt_non_functional) nga_sf_Gcluster
Using multi-variate visualization, we can try to explain the characteristic of each cluster:
ggparcoord(data = nga_sf_Gcluster,
columns = c(2:6,8),
scale = "globalminmax",
alphaLines = 0.2,
boxplot = TRUE,
title = "Multiple Parallel Coordinates Plots of ICT Variables by Cluster") +
facet_grid(rows = vars(CLUSTER)) +
theme(axis.text.x = element_text(angle = 10))
From the above, we can derive further the characteristic if each cluster:
Cluster 1:Majority non-rural water points what seem to be in good working condition
Cluster 2: Majority non-rural water points
Cluster 3: Mostly Rural water points which are mostly in good working condition which uses hand pump technology.
Cluster 4: Highly Rural waterpoints which less numbers in working condition and most of them are not hand pump technology.
Cluster 5: Highly Rural water points which are majority in good working condition which uses hand pump technology.
Finally, we can compute our summary statistics as follow:
%>%
nga_sf_Gcluster st_set_geometry(NULL) %>%
group_by(CLUSTER) %>%
summarise(wpt_functional = mean(wpt_functional),
wpt_non_functional = mean(wpt_non_functional),
pct_functional = mean(pct_functional),
pct_non_functional = mean(pct_non_functional),
pct_hand_pump = mean(pct_hand_pump),
pct_rural = mean(pct_rural))
# A tibble: 5 × 7
CLUSTER wpt_functional wpt_non_functional pct_functi…¹ pct_n…² pct_h…³ pct_r…⁴
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.0478 0.0998 0.492 0.305 0.235 0.263
2 2 0.0172 0.0193 0.193 0.139 0.325 0.102
3 3 0.0865 0.183 0.555 0.421 0.646 0.716
4 4 0.0544 0.164 0.349 0.465 0.330 0.869
5 5 0.169 0.161 0.704 0.255 0.774 0.901
# … with abbreviated variable names ¹pct_functional, ²pct_non_functional,
# ³pct_hand_pump, ⁴pct_rural
From our summary statistic, we can see that the mean values for each role matches our description from the multivariate visualization greatly.