Gannett Fleming, Inc.
Brigham Young University
This study explores how cluster analysis can be used to categorize a large number of planning districts in a region into a smaller, manageable number of land-use scenarios consisting of planning districts of similar land-use patterns whose mean land-use distributions can be used as future land-use alternatives for those planning districts. We used Utah's Wasatch Front region for the analysis. After applying a family of cluster analysis methods, we were able to group the 343 planning districts in the region into 35 land-use planning scenarios. A combination of the Ward's linkage method, the Squared Euclidean distance measure, and the Z-score standardization of variables produced the most logical clustering of planning districts for the region.
A recent survey on transportation planning issues and needs for planning research conducted by the Transportation Research Board indicated that one quarter of respondents identified research relating to land-use planning as a top-priority topic area (TRB 2000). Land-use and transportation systems interact to form an urban landscape, and the two components must be considered together in transportation planning to create a livable urban area (Vuchic 1999). For many metropolitan planning organizations (MPOs), this ideal has been difficult, if not impossible, to carry out because local governments have jurisdiction over land-use planning, whereas regional transportation planning is often done by state agencies. Often, land-use and transportation planning done separately have resulted in undesirable urban sprawl and traffic congestion in urbanized areas. The development of urban planning procedures that integrate land-use and transportation planning while allowing all participants access to the decisionmaking process is needed for transportation planning in the new century.
In a study funded by the National Science Foundation, Balling and others developed a multi-objective genetic algorithm model to simultaneously optimize a land-use and transportation network within a city (Balling et al. 1999; Taber et al. 1999). This procedure quickly examines an extremely large search set of feasible plans and narrows the number of alternatives to be considered. The model optimizes land-use and transportation network plans with the objective of minimizing travel time on the street network, cost to the city, and change from the current status, typically politically infeasible. Other objective functions dealing with more current trends in land-use and transportation planning are currently being considered as additions to the model. The model was first applied to Provo, Utah, and then to the twin cities of Provo and Orem. See figure 1 for their locations in Utah.
This is a new paradigm in urban planning. The model produces a set of optimized land-use and transportation infrastructure plans, and the plans are presented to those involved in the planning effort, such as city council members, planners, and citizen groups. Each plan in the Pareto set is optimal for a different weighting of the competing objectives, allowing participants the opportunity to explore compromise solutions rather than being forced to choose from only a few plans.
The second phase of the model development expands this genetic algorithm model to regional urban planning. The proposed model aims to produce macro-level Pareto plans for a multi-city metropolitan region and optimize land-use and transportation corridors between the cities. These plans will not restrict micro-planning done at the city level. The proposed model will produce optimal land-use plans that give target scenarios for land-use distribution for each planning unit. In this study, the planning units were named "districts." The model might find, for example, that a district in a particular city would best benefit regional objectives if it had a mix of 40% low-density residential land use, 20% medium- and high-density residential use, and 10% each of commercial, industrial, and open-space uses. A city cooperating with the regional planning organization would try to meet such target scenarios but would be free to plan any conceivable layout of land use within the city's planning districts in order to optimize local objectives within the framework of the regional objectives.
The very nature of the genetic algorithm requires potential scenarios of land-use distribution to be discrete rather than continuous variables. Therefore, the objective of this study was to determine the suitability of cluster analysis for the creation of just such a scenario set. The Wasatch Front region of Utah, consisting of Weber, Davis, Salt Lake, and Utah counties, was selected as the study area (see figure 1). Geographic Information Systems (GIS) data defining suitable districts did not exist. Therefore, we gathered and manipulated land-use and other necessary data to create approximately 300 planning districts (modeler's discretion), each with an approximately known land-use distribution. Once the percentage of distribution of land use for each district was found, districts were grouped, or categorized, by cluster analysis to create a set of 20 or 30 land-use distribution scenarios. The cluster means for each scenario will be used by the aforementioned planning model as the discrete values for possible future land-use scenarios.
Data necessary for accomplishing the objective of this study were collected from various sources: the Wasatch Front Regional Council, Mountain Land Association of Governments, State of Utah Automated Geographic Resource Center, Salt Lake County, Utah Country, and various city planners and engineers.
Most of the information needed for this study is in GIS format. The GIS files contain map shapes representing parcels (individually owned plots of land), city boundaries, and boundaries for districts created for analysis. Associated with the parcel shapes are codes for various types of land use (for example: residential-low density, residential-high density, industrial, commercial, agricultural), from which percentages for each type of land use in each district were derived. Of particular advantage was the fact that the bulk of the land-use data was based not on zoning but rather on information collected for individual parcels from county recorders' offices. Only where gaps in the land-use data existed was zoning used as a rough approximation.
Land-use data for the Wasatch Front region exist in a variety of formats and data structures. While all the cities and counties maintain some type of data, the systems of classifications used vary in scope, detail, and accuracy. Making these different land-use classification systems congruent for the entire region would be a tremendous task. In order to get the best overall picture of land-use scenarios existing along the Wasatch Front, it was decided to use one source, the State of Utah Automated Geographic Resource Center (AGRC 1997), for all land-use data wherever possible. To draw appropriate district boundaries and find the land-use scenarios that exist in those districts, these data from AGRC needed to be combined with city boundary and parcel information available from the cities, counties, and MPOs.
In order to accomplish the first objective of the study, we followed a 14-step procedure. This procedure is briefly outlined in table 1, and detailed discussions of it can be found in Smith (2000). The GeoProcessing Wizard of ArcView (ESRI 1999) and user-written Avenue scripts were used as aids in constructing district boundaries. District boundaries were constructed using a combination of city boundaries, traffic analysis zone boundaries, and city-provided neighborhood boundaries.
Table 1 - GeoProcessing® Steps Used in Preparing Land-Use Data for Cluster Analysis
An attempt was made to exclude from the districts most lands not considered candidates for future development. Undevelopable lands were defined as those lands covered by water or wetlands, with gradient slopes over 25%, or owned by certain public agencies such as the Forest Service or the Division of Wildlife Resources and not available for future development.
Figure 1 shows the result of the geoprocessing work; in total we created 343 planning districts. Figure 2 (after page 42) shows parcel-level land-use data for the area covered by the 343 planning districts, along with their land-use codes. The land-use categories shown in figure 2 are the ones used by the AGRC and form the basis for this study.
Cluster analysis was chosen to categorize the districts due to the difficulty of creating intuitive groupings for the large number of land uses. If only two variables, such as percentages of residential and commercial uses, were to be considered, it would be a simple enough matter to make a plot of percentage residential versus percentage commercial, showing points for each of the 343 districts. Boundaries could then be drawn around groups of points to separate them into the desired number of categories for different scenarios. With the 13 variables involved in this study, however, such an exercise is impossible.
Cluster analysis is a family of methods that seeks to explore the structure of a data set by defining the relationships between individual observations in the set, such as planning districts in this study. Such analysis is particularly useful when no preconceived idea of the proper manner of data classification exists. The MINITAB software package (MINITAB 1999) was used to perform cluster analysis on the land-use data.
The objective of the cluster analysis was to obtain land-use scenarios categorized by a reasonable system of classification that would include the following:
The usual agglomerative clustering procedure was used for this study. This means that for a given data set, each step in the analysis agglomerates, or groups together, two clusters, which may each be either individual observations or sets of observations grouped together in a previous step. Thus, analysis of a data set with 100 observations would begin by treating each observation as its own cluster. The first step would reduce the number of clusters to 99 by grouping the two clusters closest together into a new cluster. Each step would then group two more clusters together until, after 99 steps, only one cluster of 100 observations remained. The analyst would then look at the various cluster groupings at different steps in the process to decide when the observations are most appropriately categorized. We acknowledge that the subjective nature of this step is a widely held criticism of cluster analysis as a technique. Nonetheless, we feel that the cluster analysis is a useful exploration tool for land-use data since this level of subjectivity is much lower than that usually employed by planners in classifying land-use scenarios.
The manner in which closeness of observations is measured is called the distance measure. While the distance between clusters is relatively straightforward if each cluster contains only a single observation, the matter of measuring distances between clusters of many observations becomes more complex. Consequently, various linkage methods exist to determine distances between clusters containing multiple observations. When calculating distances between clusters, each variable is assumed to be on the same scale unless some standardization technique is employed.
The Distance Measure
In two-dimensional space, the distance measure may be visualized by connecting two points representing two observations, i and j. The most widely used distance measure, the Euclidean distance, is the straight-line distance between the two points, calculated in N-space as
The Euclidean distance may be squared in order to further reduce the likelihood of very dissimilar observations being clustered together. The Pearson distance, which may also be squared, is similar to the Euclidean distance, but incorporates the variances of each variable (x1, x2,..., xN) in order to reduce the portion of the distance contributed by variables with high variance (MINITAB 1999).
Another accepted measure of distance is the Manhattan distance, measured by summing the absolute values of the distances along axes between observations in N-space.
The Linkage Method
Seven different linkage methods available in the MINITAB software (MINITAB 1999) were used, including the single linkage, complete linkage, average linkage, centroid linkage, median linkage, McQuitty's linkage, and Ward's linkage. Detailed discussions of the linkage methods can be found in standard statistics textbooks and software user's manuals. Raising some issues relating to selecting a proper linkage method, this paper discusses two methods: single linkage and Ward's linkage.
The single linkage method defines the distance between any two clusters as the shortest distance between any observation in the first cluster and any observation in the second cluster. It was found that the chaining of observations tends to produce one very large cluster and other very small clusters. Consequently, the variable space within the large cluster is not very well explored.
In Ward's linkage, chosen as the final cluster method for this study, the distance between any two clusters is the sum of the squared deviations between the centroid and the points of the new cluster that would be formed by joining the two clusters. MINITAB uses an approximation to this distance. The objective of this method is to produce clusters with a minimal amount of within-cluster variance. Ward's linkage tends to produce clusters of similar numbers of observations since a disproportionately high number of observations in a cluster would result in a higher number of squared deviations to be summed, thus tending to increase the distance across which the cluster must be formed.
Because the variables in this clustering problem have varying distributions, Z-score standardization was employed before calculating the distance matrix. Standardization of variables in a clustering problem can have both advantages and disadvantages. Consider the following simplification of the land-use scenario classification problem for two imaginary districts (refer to figure 2 for the land use codes shown here): (districts table 1)
At first glance, these two districts may seem to be very similar in terms of their distribution of land use. Without any standardization of variables, these two districts would likely be clustered together in the same scenario type due to the low distance between them. Suppose, however, that the total range of percentages for R4, mobile homes, is only 0 to 10% for all of the districts being clustered. Also, suppose the average value for R4 is 0.5%, that its standard deviation is 1.5%, and that all of the other variables are very near their mean values. The 9% value for mobile homes is now clearly an extreme value for that variable. If such scenarios were always clustered together at relatively early stages, no diversity in mobile home land use would be apparent from the final cluster grouping. All of the cluster means for the different scenario types would reflect mobile home land use of about 0.5%. Transforming these percentages into Z-scores remedies this problem. The Z-score for R4 for district 2 would be (9 - 0.5) / 1.5 = 5.67, a relatively high value. Since all other variables are near their means, their Z-scores would all be near zero. The above comparison, now standardized, would look approximately like this: (districts table 2)
Remembering that a standardized value of even one (one standard deviation) is a significant departure from the average, we can see that these districts would now be judged different enough based on the distance between them to remain in separate clusters until much later in the clustering process.
Consider, though, what might happen with these two hypothetical districts: (districts table 3)
Both would have standardized R4 values of 5.76 and identical, near-zero values for R2, R3, C1, C2, and C3. Standardized values for R1 and AG would be non-zero but not nearly as extreme as 5.67 due to the approximately normal distribution for R1 and AG. These characteristics would likely cause these two districts to cluster together as districts with similarly high percentages of mobile-home use. In many ways, though, this categorization doesn't make much sense because single-family residential and agricultural uses, together accounting for 80% of the land use in the districts, are in opposite proportion to each other. The wide distribution of these variables compared to the narrow distribution for R4 is what makes these districts seem similar when standardization is applied. A primary objective in applying cluster analysis to the 343 districts was to balance these 2 effects of standardization.
Thirty-eight cluster analyses were applied to the previously mentioned 343 districts using MINITAB statistical software in order to determine the best distance measure, linkage method, and standardization strategy for the data set in question. A table with cluster group assignments for each of the 38 analyses and 343 districts was created and joined to the district coverage's attribute table in ArcView so that results from the 38 analyses could be quickly viewed and compared. Table 2 summarizes the input parameters for the 38 analyses with comments on the results.
In the first 13 cases, different linkage methods and numbers of clusters were tried with the Euclidean distance measure with Ward's method giving the most appropriate distribution of land-use scenarios. Other methods generally sorted the districts into land-use scenario categories of mostly residential, mostly agricultural, mostly industrial, and so forth, without differentiating the distributions of the minor land uses in a district. Ward's method succeeded in partitioning many of these groups into separate scenario categories, particularly among the residential scenarios. The single linkage method was eliminated from consideration due to its tendency toward grouping the majority of the districts in one mega-cluster. With this method, most of the districts are grouped in one land-use scenario (scenario 1) as shown in figure 3 (after page 42). The median and centroid linkages were also judged to give poor enough results to eliminate the need for any further consideration of their use.
For cases 14 to 29, different distance measures were tried with the average, complete, McQuitty, and Ward linkage methods. For some reason, the Pearson and Squared Pearson measures produced a chaining effect similar to that of the single linkage for all but the Ward's linkage. For this reason, a standardization method other than the Pearson distance measure became necessary. The Manhattan and Squared Euclidean measures produced results comparable to the Euclidean.
For cases 30 to 33, Z-score standardization was applied to the variables before clustering for a few of the best combinations of linkage method and distance measure found thus far. With the complete and average linkages, the Z-score standardization also produced a chaining of observations. With Ward's method, the Z-score standardization proved adept in singling out extreme values of oddly distributed variables like mobile homes, apartments, and warehouses. However, doing this broadened the range of the more normally distributed variables like single-family residential and agricultural that could be included in the same scenario. Cases 34 and 35 were created with 30 clusters instead of 20 in an attempt to break up some of the more dissimilar clusters. Still, a few clusters exhibited odd groupings. As an example, the three districts Provo 3, Orem 2, and Utah County 3, were all grouped into the same cluster in case 34, largely based on their similarly high percentages of warehouse use (C3) (see table below).
Clearly, Provo 3 is made up of largely commercial uses; Orem 2 is primarily single-family residential, and Utah County 3 is mostly agricultural. Ideally, the classification scheme chosen would allow for three different scenarios, distinct from those scenarios for residential, commercial, and agricultural that do not include warehouses, to represent these districts.
Case 36 used 30 clusters with no standardization applied to see if the increased number of clusters alone would provide for more distinct groupings without sacrificing representation of the more minor uses like mobile homes, apartments, and warehouses. The range of variables about the means decreased significantly, but most representation of extreme values for the minor variables was lost. For case 37, a compromise standardization procedure was tried before clustering. Instead of using Z-scores to standardize the variables, all variables were scaled such that the minimum value for the variable was 0.0% and the maximum value was 100%. Hence, a value of 5.8% for mobile homes (minimum 0% and maximum 12.1%) in West Valley City was scaled to 48.3% (5.8% × 12.1% 3 100%). This method of standardization worked nearly as well as the Z-score method, but the improvement in within-cluster dissimilarity was small enough to make the effort unfruitful.
Consequently, the parameters chosen as best for cluster analysis on the land-use data were Ward's linkage method, the Squared Euclidean distance measure, and Z-score standardization of variables (case 38). Increasing the number of clusters to 35 broke up a few additional odd groupings. The cluster that included the three districts mentioned previously, for example, was broken into two clusters. It was decided that an additional reduction of dissimilarities would have required a greater number of clusters in the final analysis than was desirable; the final number of scenarios became 35.
Table 3 lists the final 35 scenarios with cluster means. Since adding the mean percentages for each scenario did not always result in a total of 100% (sums varied from 92 to 100%), the mean values in table 3 represent the cluster means scaled such that their sums total 100% for each scenario. Figure 4 (after page 42) shows the district map coded for the results of the final cluster analysis with 35 scenarios. This figure shows which planning districts have similar land-use patterns. Table 3 and figure 4 show that the clustering by the Ward's linkage method performed well for the given data set.
This paper showed cluster analysis to be a viable tool for grouping planning districts into a smaller number of planning scenarios for regional land-use and transportation sketch planning. A comparison of figures 2 and 4 shows that cluster analysis reduces the amount of detail required to represent a wide variety of land-use scenarios but does so without significantly altering the big picture on a regional level. This finding is potentially significant to future land-use and transportation planning projects. Alternative land-use scenarios used in planning models need not be limited to a few hypothetical land-use cases. Rather, multiple scenarios can be generated as warranted to represent accurately all patterns of land use found in a given regional area. Scenarios currently not found in the study region, such as those taken from another region, may then be added to represent a fuller spectrum of possibilities.
Certain challenges must be met in order to apply cluster analysis to land-use data successfully. Chief among these is the proper selection of a distance measure, linkage method, and standardization procedure. The findings of this study do not eliminate the need to reiterate this process for data gathered from areas outside the Wasatch Front region of Utah. Additionally, measures must be taken to ensure that the input land-use data is current and accurate or the investigator risks magnifying the inaccuracies in the final set of derived scenarios. Uniformity among agencies in the categorized description of land use is indispensable in the gathering of timely and accurate data on regional land-use scenarios. Lastly, a good method for drawing district boundaries is needed to ensure that subjective concerns do not influence the process.
This study was funded by the National Science Foundation (contract number CMS 9817690). The authors wish to thank Stuart Challender of the State of Utah Automated Geographic Resource Center (AGRC), Mike Brown of the Wasatch Front Regional Council, Andrew Jackson and Andrew Wooley of the Mountain Land Association of Governments, and many other city employees and engineers who provided us with GIS-based data and other information essential for this study.
Balling, R.J., J.T. Taber, M.R. Brown, and K. Day. 1999. Multiobjective Urban Planning Using Genetic Algorithm. ASCE Journal of Urban Planning and Development 125, no. 2:86-99.
Environmental Systems Research Institute (ESRI). 1999. ArcView GIS 3.2 User Manual. Redlands, CA.
MINITAB. 1999. MINITAB Statistical Software Release 12 for Windows User Manual. State College, PA.
Smith, J. 2000. Cluster Analysis as Part of a Land-Use Classification Scheme for Optimized Land Use and Transportation Planning. Masters Thesis, Department of Civil Engineering, Brigham Young University, Provo, UT.
The State of Utah Automated Geographic Resource Center (AGRC). 1997. Parcel-Level Land-Use Data. Salt Lake City, UT.
Taber, J.T., R.J. Balling, M.R. Brown, K. Day, and G.A. Meyer. 1999. Optimizing Transportation Infrastructure Planning with a Multiobjective Genetic Algorithm Model. Journal of the Transportation Research Board 1685:51-6.
Transportation Research Board (TRB). 2000. Conference II Resource Paper: Summary of Survey Results. In Refocusing Transportation Planning for the 21st Century, Conference Proceedings 20:179-83.
Vuchic, V.R. 1999. Transportation for Livable Cities. Rutgers: Urban Policy Research Press.
Mitsuru Saito, Department of Civil and Environmental Engineering, Brigham Young University, 368 Clyde Building, Provo, UT 84602. Email: firstname.lastname@example.org.