Depending on the state of your spatial data, they may require some cleaning and modification before interpolation. The types of issues to address pre-interpolation fall into a few distinct categories:
Each of these conditions will be discussed below. The following examples assume:
library(areal)
library(dplyr) # data wrangling
library(sf) # spatial data operations
# load data into enviornment
race <- ar_stl_race
asthma <- ar_stl_asthma
wards <- ar_stl_wards
# create example data - non-spatial data
asthmaTbl <- ar_stl_asthma
st_geometry(asthmaTbl) <- NULL
# create example data - wrong crs
race83 <- st_transform(race, crs = 4269)
areal
has a built-in tool for data validation,
ar_validate()
, that provides an excellent starting place
for ensuring your data are ready to interpolate. The validation process
covers the first three issues listed in the introduction. It can be run
in a simple format:
If ar_validate()
returns a FALSE
value, it
can also be run in a verbose manner that returns a detailed tibble:
ar_validate(source = asthma, target = wards, varList = "ASTHMA", method = "aw", verbose = TRUE)
#> # A tibble: 7 × 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Polygon Geometries TRUE
#> 5 Variables Exist in Source TRUE
#> 6 No Variable Conflicts in Target TRUE
#> 7 Overall Evaluation TRUE
Use the verbose = TRUE
output as a checklist to address
issues before interpolating your data.
Data need to be loaded as sf
objects in R
.
If ar_validate()
returns FALSE
for the first
condition, data are not stored as sf
objects:
ar_validate(source = asthmaTbl, target = wards, varList = "ASTHMA", method = "aw", verbose = TRUE)
#> # A tibble: 7 × 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects FALSE
#> 2 CRS Match NA
#> 3 CRS is Planar NA
#> 4 Polygon Geometries NA
#> 5 Variables Exist in Source NA
#> 6 No Variable Conflicts in Target NA
#> 7 Overall Evaluation FALSE
If you have external data, using st_read()
is the best
way to load them:
The st_read()
function supports a variety of the most
common data sources.
If you are working with data that is an sp
object in
R
, the sf
package has a function
st_as_sf()
can be used to convert the object to
sf
:
If you are using either the tidycensus
or tigris
packages to access spatial data, there are options for downloading data
as sf
objects as well. In tidycensus
, the
geometry = TRUE
option should be used. In
tigris
, the class = "sf"
option should be
used.
If you have tabular data, you will need to merge them with a
shapefile or other spatial data set that contains the geometry for the
corresponding spatial features. For American users, a potentially common
scenario here will be table of census geography (such as tracts) that
lacks the geometry for those features. These can be downloaded using tigris
and
then combined using dplyr
. The following
example assumes the tract data contains an identification number column
named GEOID
with the appropriate tract identifiers for
St. Louis:
Two coordinate system rules are enforced by ar_validate
:
both the source and the target data must be in the same coordinate
system and that coordinate system must be a projected coordinate
system. The sf
package contains a function named
st_crs()
for previewing the current coordinate system of
your data:
st_crs(race83)
#> Coordinate Reference System:
#> User input: EPSG:4269
#> wkt:
#> GEOGCRS["NAD83",
#> DATUM["North American Datum 1983",
#> ELLIPSOID["GRS 1980",6378137,298.257222101,
#> LENGTHUNIT["metre",1]]],
#> PRIMEM["Greenwich",0,
#> ANGLEUNIT["degree",0.0174532925199433]],
#> CS[ellipsoidal,2],
#> AXIS["geodetic latitude (Lat)",north,
#> ORDER[1],
#> ANGLEUNIT["degree",0.0174532925199433]],
#> AXIS["geodetic longitude (Lon)",east,
#> ORDER[2],
#> ANGLEUNIT["degree",0.0174532925199433]],
#> USAGE[
#> SCOPE["Geodesy."],
#> AREA["North America - onshore and offshore: Canada - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon. Puerto Rico. United States (USA) - Alabama; Alaska; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Hawaii; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming. US Virgin Islands. British Virgin Islands."],
#> BBOX[14.92,167.65,86.45,-40.73]],
#> ID["EPSG",4269]]
The EPSG
value 4269
refers to a specific
type of coordinate system known as as geographic coordinate system.
These data use latitude-longitude values, which vary in length and thus
cannot be used to calculate area (a key component of carrying out areal
interpolations). These data would fail the ar_validate()
process:
ar_validate(source = race83, target = wards, varList = "TOTAL_E", method = "aw", verbose = TRUE)
#> # A tibble: 7 × 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match FALSE
#> 3 CRS is Planar FALSE
#> 4 Polygon Geometries TRUE
#> 5 Variables Exist in Source TRUE
#> 6 No Variable Conflicts in Target TRUE
#> 7 Overall Evaluation FALSE
These data fail both the matching CRS validation and the planar CRS
validation because the race
and wards
data are
in two different coordinate systems. We’ve already seen the CRS data for
race
above so we’ll look at the
CRS data for
wards`:
st_crs(wards)
#> Coordinate Reference System:
#> User input: ESRI:102296
#> wkt:
#> PROJCRS["NAD_1983_HARN_StatePlane_Missouri_East_FIPS_2401",
#> BASEGEOGCRS["NAD83(HARN)",
#> DATUM["NAD83 (High Accuracy Reference Network)",
#> ELLIPSOID["GRS 1980",6378137,298.257222101,
#> LENGTHUNIT["metre",1]]],
#> PRIMEM["Greenwich",0,
#> ANGLEUNIT["degree",0.0174532925199433]],
#> ID["EPSG",4152]],
#> CONVERSION["NAD_1983_HARN_StatePlane_Missouri_East_FIPS_2401",
#> METHOD["Transverse Mercator",
#> ID["EPSG",9807]],
#> PARAMETER["Latitude of natural origin",35.8333333333333,
#> ANGLEUNIT["degree",0.0174532925199433],
#> ID["EPSG",8801]],
#> PARAMETER["Longitude of natural origin",-90.5,
#> ANGLEUNIT["degree",0.0174532925199433],
#> ID["EPSG",8802]],
#> PARAMETER["Scale factor at natural origin",0.999933333333333,
#> SCALEUNIT["unity",1],
#> ID["EPSG",8805]],
#> PARAMETER["False easting",250000,
#> LENGTHUNIT["metre",1],
#> ID["EPSG",8806]],
#> PARAMETER["False northing",0,
#> LENGTHUNIT["metre",1],
#> ID["EPSG",8807]]],
#> CS[Cartesian,2],
#> AXIS["(E)",east,
#> ORDER[1],
#> LENGTHUNIT["metre",1]],
#> AXIS["(N)",north,
#> ORDER[2],
#> LENGTHUNIT["metre",1]],
#> USAGE[
#> SCOPE["Not known."],
#> AREA["United States (USA) - Missouri - counties of Bollinger; Butler; Cape Girardeau; Carter; Clark; Crawford; Dent; Dunklin; Franklin; Gasconade; Iron; Jefferson; Lewis; Lincoln; Madison; Marion; Mississippi; Montgomery; New Madrid; Oregon; Pemiscot; Perry; Pike; Ralls; Reynolds; Ripley; Scott; Shannon; St Charles; St Francois; St Louis; Ste. Genevieve; Stoddard; Warren; Washington; Wayne."],
#> BBOX[35.98,-91.97,40.61,-89.1]],
#> ID["ESRI",102296]]
If these data fail that process, st_crs()
can be used to
identify whether the source or the target (or both) are in an
inappropriate coordinate system. The EPSG
value
26915
refers to another type of coordinate system known as
as projected coordinate system (or “planar” data). Further,
26915
is a particular type of projected coordinate system
known as the Universal
Transverse Mercator coordinate system. This is a good option for
using areal
because it is available around the world. There
are a variety of tools for identifying the appropriate UTM zone,
including this excellent interactive
map. Once you have an EPSG zone, the website epsg.io can be used to search for the
appropriate EPSG
value.
In this case, the wards
data are in an appropriate
system but the race
data need to be transformed. The
st_transform()
function can be used to modify them to an
appropriate coordinate system:
Once this is done, these data should pass the
ar_validate()
process:
ar_validate(source = raceFixed, target = wards, varList = "TOTAL_E", method = "aw", verbose = TRUE)
#> # A tibble: 7 × 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match FALSE
#> 3 CRS is Planar TRUE
#> 4 Polygon Geometries TRUE
#> 5 Variables Exist in Source TRUE
#> 6 No Variable Conflicts in Target TRUE
#> 7 Overall Evaluation FALSE
If a variable does not exist in the source data, the validation process will fail:
ar_validate(source = race, target = wards, varList = "TOTAL", method = "aw", verbose = TRUE)
#> # A tibble: 7 × 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Polygon Geometries TRUE
#> 5 Variables Exist in Source FALSE
#> 6 No Variable Conflicts in Target TRUE
#> 7 Overall Evaluation FALSE
In this case, the fix is as easy as correcting the typo in our
ar_validate()
call - the variable TOTAL
should
really be TOTAL_E
:
names(race)
#> [1] "GEOID" "STATEFP" "COUNTYFP" "TRACTCE" "NAMELSAD" "ALAND"
#> [7] "AWATER" "TOTAL_E" "TOTAL_M" "WHITE_E" "WHITE_M" "BLACK_E"
#> [13] "BLACK_M" "AIAN_E" "AIAN_M" "ASIAN_E" "ASIAN_M" "NHPI_E"
#> [19] "NHPI_M" "OTHER_E" "OTHER_M" "TWOPLUS_E" "TWOPLUS_M" "geometry"
Another common problem is that a variable exists in both our source
and target data. For instance, we could add a TOTAL_E
variable to our target data:
wardsVar <- mutate(wards, TOTAL_E = seq(1:28))
ar_validate(source = race, target = wardsVar, varList = "TOTAL_E", method = "aw", verbose = TRUE)
#> # A tibble: 7 × 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Polygon Geometries TRUE
#> 5 Variables Exist in Source TRUE
#> 6 No Variable Conflicts in Target FALSE
#> 7 Overall Evaluation FALSE
Now, we fail the validation process because there is a conflict
between the source and target data - TOTAL_E
exists in both
data sets. We can use the select()
function from
dplyr
to remove the offending variable from the target data
before proceeding:
wardsFixed <- select(wardsVar, -TOTAL_E)
ar_validate(source = race, target = wardsFixed, varList = "TOTAL_E", method = "aw", verbose = TRUE)
#> # A tibble: 7 × 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Polygon Geometries TRUE
#> 5 Variables Exist in Source TRUE
#> 6 No Variable Conflicts in Target TRUE
#> 7 Overall Evaluation TRUE
Another option would be to rename the column using the
rename()
function from dplyr
or by using
another approach for renaming columns. Either option will work for
areal
’s purposes.
One cautionary note when using areal
is that the
validation process will not check to see if too many variables or
observations exist in the data. Cleaning the data ahead of time so that
only the relevant observations exist can help improve performance - if
you have tract data for an entire state but only need a particular city,
everything will go faster if you subset these data using the
filter()
function from dplyr
:
Having too many columns poses an organizational problem more than
anything else. We strongly encourage users to limit their interpolated
data to only the needed values. For example, the wards
data
contains an AREA
column whose units we don’t know and an
OBJECTID
column that is an artifact from its creation using
an ESRI product:
We can use the select()
function from dplyr
to remove these before interpolation since we don’t need them: