vignettes/gat_tech_notes.Rmd
gat_tech_notes.Rmd
This package is maintained by the New York State Department of Health (NYSDOH) Environmental Public Health Tracking (EPHT) program. The purpose of this package is to run the Geographic Aggregation Tool (GAT) in R.
Health outcome maps with fine geographic resolution can be misleading due to random fluctuations in disease rates caused by small numbers. In some cases, these maps can also inadvertently disclose confidential data. To overcome these limitations NYSDOH EPHT developed GAT to join neighboring geographic areas together until a user defined population and/or number of cases is reached. GAT can be used to produce maps for the public at the finest geographic resolution practicable.
The input object is a shapefile (*.shp). For shapefile requirements, see Preparing your shapefile for GAT. At minimum, GAT requires a polygon shapefile with a character variable that uniquely identifies areas and a numeric variable to sum for aggregation.
A series of dialog boxes allows the user to select: a variable to uniquely identify areas, one or more aggregation variables, optionally, a variable of areas within which merging will be preferred (ex. county), the amount (sum) the aggregation variable(s) should be aggregated to, and the preferred aggregation method - closest, least value, or ratio.
The output includes: a crosswalk shapefile, an aggregated shapefile with associated data, a PDF with several maps, a comprehensive log, and if requested a KML file. For output regions, most numeric variables will be summed, except variables with the following names, which will be averaged instead: x, y, GATx, GATy, lat, lon, latitude, longitude, and long. Values of character variables will be concatenated. The shapefile can be used in ArcGIS, MapInfo, and QGIS. The KML file can be displayed in Google Earth and other Internet-based mapping programs.
The package includes two embedded shapefiles to use when testing
several of the package functions. For details on these shapefiles, enter
?hftown
or ?hfpop
into the R console.
GAT creates several variables in both the original shapefile, which it resaves as a crosswalk, and the aggregated shapefile. These variables are listed below. For the flag variable, a brief data dictionary is provided both below and in the log produced by GAT, to help identify possible issues in the aggregation.
The crosswalk shapefile will have the following variables added:
The aggregated shapefile will have the following variables added:
The overall goal of the merging is to create a relatively large number of compact regions that meet the specified criteria. To accomplish this, the areas in the input shapefile are aggregated pairwise until all regions have the minimum values specified.
First, a subset of areas is selected that meets the following restrictions:
This results in a subset of areas that all require merging. This subset is reordered from the highest values of the aggregation variable(s) to the lowest. If there is only one aggregation variable, the area with the highest value is merged first. If there is more than one aggregation variable, the process is more complicated. When two aggregation variables are selected, each aggregation variable is divided by the minimum value specified for it. The area with the highest proportion is selected to be aggregated first.
For example, given the following:
- minimum for first aggregation variable: 1000 population
- minimum for second aggregation variable: 100 births
- first area: 200 population, 90 births
- second area: 400 population, 50 births
Calculate proportions of area value over minimum value for the first and second areas as follows:
- first area: \(\frac{200 population}{1000 minimum population} = 0.2\), \(\frac{90 births}{100 minimum births} = 0.9\)
maximum proportion = 0.9- second area: \(\frac{400 population}{1000 minimum population} = 0.4\), \(\frac{50 births}{100 minimum births} = 0.5\)
maximum proportion = 0.5Of these two areas, the first area has a higher maximum proportion, so it will be merged first.
After the target area has been identified, GAT generates a list of its neighbors. Areas are considered neighbors if they share at least two points with the selected area. However, not all of these neighbors may be eligible for merging. GAT considers several criteria when selecting eligible areas.
runGATprogram(adjacent = FALSE)
, which will result in
adjacent areas being preferred, but not required.adjacent = FALSE
, GAT
will select the closest eligible area within the border, even if it is
not adjacent. If you select a border and do not check the box to enforce
the border, GAT will consider neighbors outside the border last.runGATprogram(minfirst = TRUE)
. If there are no neighbors
with aggregation values below the minimum value, GAT will recheck for
neighbors without this restriction.When relaxing eligibility restrictions, GAT drops criteria in this
order: Value below minimum (if triggered by
minfirst = TRUE
), Adjacent (only if
adjacent = FALSE
), Inside border (only if border not
enforced). If no eligible areas are identified after relaxing these
restrictions, the target area is flagged as unable to merge further. For
each area for which selection criteria were relaxed, a warning is
written to the log.
If the area does not have adjacent neighbors and you used
adjacent = FALSE
, GAT will default to the closest
geographic centroid unless you chose population weighting, in which case
it will use the closest population-weighted centroid. If the area has
more than one eligible adjacent neighbor, the next few sections cover
how the neighbor to merge is selected, depending on the desired merge
type.
The new region then becomes a candidate for additional merging, and merging continues until all regions contain the minimum values of the aggregation variables specified or have been flagged as being unable to merge further.
The target area is merged with the eligible area with the closest straight-line centroid based on the shape of the area. The new centroid is calculated by averaging the latitudes and longitudes of the two areas. This speeds up GAT considerably, but can result in centroids that do not exactly match the area’s center.
In the graphic above, the area with the largest value below 5 is E. The closest neighbor to E using geographic centroids is C, so GAT assigns E to merge with C. The next largest value below 5 is in area A, whose closest neighbor is the new area CE, so GAT merges A to CE. The last area below 5 is D, which is closest to B, so GAT merges D with B.
This method provides the most compact areas and can work well for environmental exposures like air pollution and temperature.
The population file the user selected is used to determine the area’s
centroid based on the underlying population. The target area is merged
with the eligible area with the closest population-weighted centroid.
There are two methods for calculating the population-weighted centroid
for the new area. The faster method, which is the default, is to
calculate the proportion of the population each area contributes to the
merged area, then multiply these proportions by their respective
latitudes and longitudes and sum the results. For example, if area A
contributes 70% of the population and area B contributes 30% of the
population, the new longitude would be \(0.7 *
longitude_A + 0.3 * longitude_B\). The slower method recalculates
the centroid of the new area using the underlying populations of both
areas in the population file. To access this method, use
runGATprogram(pwrepeat = TRUE)
.
In the graphic above, the area with the largest value below 5 is E. The closest neighbor to E using population-weighted centroids is D, so GAT assigns E to merge with D. The next largest value below 5 is in the new area DE, whose closest neighbor is C, so GAT merges DE to C. The last area below 5 is A, which is closest to B, so GAT merges A with B.
This method provides areas that may sprawl a bit, but are representative of population centers and can work well for disease rates.
If there is only one minimum value specified, the area with lowest value is selected. If there are two minimum values specified, the area with the lowest proportion of aggregation variable to minimum value specified is selected.
In the graphic above, the area with the largest value below 5 is E. E’s neighbor with the smallest value is D, so GAT assigns E to merge with D. The next largest value below 5 is in the new area DE, whose neighbor with the smallest value is A, so GAT merges DE to A.
This method provides the largest number of areas and therefore the greatest granularity. However, it can result in weird snaky shapes and possibly donuts, where a rural area entirely surrounds an urban one.
The selected area is merged with the area with the least absolute difference between the ratios of the variables chosen for similarity comparison. For example, suppose the variables chosen for similarity comparison are counts of persons under poverty and total population. If the area to be merged has a percent under poverty of 10%, and it has neighbors have 9%, 13%, and 15% under poverty, then then it would be merged with the area with 9% under poverty.
In the graphic above, the area with the largest value below 5 is E. E’s neighbor with the closest ratio is D, so GAT assigns E to merge with D. The next largest value below 5 is in the new area DE, whose neighbor with the closest ratio is B, so GAT merges DE to B. The last value below 5 is in A, whose neighbor with the closest ratio is C, so GAT merges A to C.
This method works well if you want to create areas of similar populations to investigate social determinants of health.
The program may merge very slowly if the geographic boundaries are unnecessarily complex. Boundaries can be simplified by a process known as “thinning” (or “simplification” or “integrating” or “generalizing”). Thinning is removing nodes based on how far apart they are and/or their collinearity to decrease the complexity of the maps (see figure below). GIS systems like MapInfo and ArcGIS can be used to thin maps. After thinning, areas should merge much more quickly in the GAT, but the map may appear crude. If high resolution maps are needed for display, the thinned map used for processing data can be linked back to the high resolution boundaries using unique area identifiers.
The compactness ratio is a measure used to assess how well areas were aggregated. Ideally, areas should be combined with other areas very close to them under the assumption that people tend to live closer to people with whom they are more similar. A circle is the most compact shape, so an area that is relatively round, like St. Lawrence County, New York (below right; compactness ratio = 0.71), will be more compact than an area that is long and narrow, like Cayuga County, New York (below left; compactness ratio = 0.18).
To calculate compactness ratio, the perimeter of the shape is calculated, then a circle with a circumference equal to the perimeter of the shape is drawn and its area is calculated. Next, the area of the shape is divided by the area of the circle. A highly compact shape will have a compactness ratio close to 1.