Predicting Subdivision With Machine Learning

At this year’s Regional ESRI Users Group conference in Whangarei, I presented Vision’s work on applying land change modelling with machine learning software to predict where future subdivisions may occur in the Bay of Islands. This is going to be a long post; so for those wanting to skip strait to the results you can have a play with the following slides. Clicking between the numbers at the top, you can see the transition between:

  1. Parcel Size (m²) in 2003,
  2. Parcel Size (m²) in 2018,
  3. Projected Parcel Size (m²) in 2038 (Inclusive Model Scenario), and
  4. Projected Parcel Size (m²) in 2038 (Exclusive Model Scenario)
[smartslider3 slider=3]

For those wanting to know more, I should first outline how this model is different from typical land change modelling. Land change models are commonly applied to changes in physical characteristics of a geographic area. This can be the land cover (eg forest converting to plains) or land use (farms land to residential). Our project is trying to measure and predict a non-physical attribute, legal boundaries. More Specifically, the transition of larger parcel size to smaller parcels. Our study area covers the Bay of Islands change from 2003 to 2018, and project the new subdivisions out to 2038.

We used machine learning software, developed by Clark Labs, to perform the raster analysis.  The process has four primary stages:

  1. Data Preparation
  2. Measuring change
  3. Modelling Susceptibility
  4. Predictive Modelling

Stage 1. Data Preparation

I won’t spend much time reliving this stage, partially because its boring, mostly because it turned out to be an arduous process that took about 50% of the time dedicated to the project.  What to know:

  • All data was converted to a raster format;
  • To work in the software, all data was converted to share the same CRS, extent, and grid size;
  • This project analyzed subdivision or primary parcels (not land titles);
  • Only parcels which were privately owned were retained for the model; and
  • I did not have access to a historic data set, so the historic (2003) parcels were approximated using a reductive model I built in ArcGIS.  While this will likely introduce some level of error, about 96% of parcels were able to be filtered.

Stage 2. Measuring Change

After all the data is created and imported, the first step for modelling land change is to find the change which happened over the base period (from 2003 to 2018). The subdivision being measured over this period was set at four different thresholds:

  • 5,000 m²
  • 10,000 m²
  • 20,000 m²
  • 40,000 m²

The two layers are compared, and any change is extracted. These areas are what will be used in the later stages to determine the strength of correlation between the change observed and the driver variables. The net change by area and where the change occurred are shown in the table and animation below. Net change indicates how much new area for each category was gained or lost between 2003 and 2018.

Table 1. Net Transition by Parcel Threshold from 2003 to 2018
Threshold Cells Hectares
Sub 5,000 m² 44,665 285
Sub 10,000 m² 29,393 188
Sub 20,000 m² 46,783 299
Sub 40,000 m² 104,714 671
Over 40,000 m² -227,583 -1,456

As you can see, land in the Bay of Island area has been shifting to smaller parcels with subdivision activity being strongest in the Kerikeri area.

Stage 3.  Modelling Susceptibility

This is the point where we get to the real meat of the process. In short the software uses Machine Learning Algorithms, more specifically Artificial Neural Network, to:

  • Take a Sample of 20,000 cells (50% transition, 50% persistence),
  • Combine the sample with the driver layers to evaluate and weight susceptibility criteria, and
  • perform 5,000 iterations of each model.

The change layer you have already seen in in Stage 2. What is important to note here, the model is trying to understand two things. First is transition, or where over the base period cells change. Second is persistence, or where over the base period cells did not change. Both are equally important to understanding how subdivision may look in the future.

When the transition and persistence sample is taken, the model begins comparing where change did and did not happen with the driver layers. These include:

  • Distance from:
    • Primary Roads,
    • Secondary Roads,
    • Waste Water Lines,
    • Urban/Town Land cover (Anthropogenic Disturbances),
    • Coast,
    • Kerikeri (Township), and
  • Coverage by:
    • Slope,
    • Elevation,
    • Population Density,
    • Land Cover Data Base, and
    • District Plan Zoning.

[smartslider3 slider=2]

 

This process was performed for each parcel size threshold in two different scenarios. The first scenario was Inclusive. This means that all parcels beneath the size threshold are included in each model (eg. Sub 20k includes all parcels 20,000 m² and under). In this scenario we would have to assume that there is not much difference in how the driver variable act on the different size parcel.

The second scenario was Exclusive. This means that only parcels between the size threshold are included in each model (eg. Sub 20k only includes all parcels between 10,000 m² and 20,000 m²). In this scenario we are assuming that the driver variable act differently on parcels of different sizes.

The results for each model are shown in Table 2.  and Table 3. The model accuracy is based on a blind sample in the model. After the processing of weighting and calibrating the driver variables against change and persistence, the software predicts where change or persistence will happen against an unknown sample. The percentage is how many times the prediction was correct. The skill measure beneath accuracy indicates what the model was better at predicting (higher being better). Lastly, the number in the column shows the driver variable’s rank in terms of how much influence it had on predicting transition and persistence.

Table 2. Inclusive Model Results
Parameters Sub 5K Sub 10K Sub 20K Sub 40K
Model Accuracy 89.2% 84.4% 83.9% 81.7%
Transition Skill 0.7487 0.6850 0.7432 0.7272
Persistence Skill 0.8172 0.6900 0.6138 0.5394
Distance from:  Town 5 11  8 5
 Anthro-
pogenic      
2 9
 Coast 9 5 5 4
 Waste Water 6 10 6 7
 Roads Primary 4 2 3 3
 Roads Secondary 2 2
Cover by:  Slope 10 7 11  9
 Zoning 8 8 6 7
 Elevation 7 4 7 8
 LCDB 11  6 10 11 
 Pop Density 3 3 9 10

 

Table 3. Exclusive Model Results
Parameters Sub 5K Sub 10K Sub 20K Sub 40K
Model Accuracy 89.1% 84.1% 84.3% 81.0%
Transition Skill 0.7631 0.7303 0.8142 0.7079
Persistence Skill 0.8019 0.6353 0.5566 0.5321
Distance from:  Town 5 3 8 8
 Anthro-
pogenic     
 Coast 9 6 5 4
 Waste Water 6 4 3 7
 Roads Primary 2 2 2 2
 Roads Secondary 4 8 4 3
Cover by:  Slope 11  11  10 9
 Zoning 8 7 7 6
 Elevation 7 10 6 5
 LCDB 10 5 9 10
 Pop Density 3 9 11  11 

Now that the model has figured out the weighting for each driver layer, they are combined into a layer that shows the susceptibility to transition for each size threshold. The heat maps bellow shows the results at each threshold. Dark red shows the areas with the highest susceptibility (the combination of drivers where most subdivision occurred) while blue shows area of low susceptibility (the combination of of drivers where little to no subdivision occurred).

[smartslider3 slider=4]

Comparing the results of the two scenarios, indicates that drivers may not act the same at all subdivision thresholds. In the result table for the inclusive model, you can see that the most and least influential drivers are less consistent than in the table for the exclusive model. You can also see the variation in the resulting heatmaps, where the exclusive model shows a wider spread of susceptibility for the larger parcel thresholds.

Stage 4. Predicting Future Subdivision

With the susceptibility heat maps generated, the last stage of the process can be completed. The projection uses a variation of Cellular Automata by:

  • Taking the historic rate of change, meaning the observed number of cells (land area) that changed annually between 2003 and 2018,
  • projecting the rate of change forward to the year 2038 (year chosen by me), and
  • flipping (subdivide) the cells, prioritizing those with the highest susceptibility score and adjacent to or joining areas of prior subdivision activity.

This process assumes that:

  • past drivers of change will maintain the same amount of influence on future subdivision,
  • subdivision will happen at a similar rate to historic change, and
  • areas of recent subdivision activity will influence future activity.

The scenarios were modeled with five year outputs from 2018 to 2038. The results are shown in the two slideshows below.

[smartslider3 slider=5]

[smartslider3 slider=6]

So What?

We are finally to the end of this long post and you may be wonder; “So what have we learned?”. Off the bat, it appears that land use change models operate reasonably well with non-physical land cover attributes. Though that may not excite most people, I’m already making a wish list of next projects.

More generally, we gained insights about subdivision trends in the Bay of Islands. First, we know the models had better accuracy for subdivision of smaller parcels, meaning the driver variable chosen had a stronger influence on predicting the outcomes; and additional/different drivers may need to be identified for larger parcel subdivision models.

Second, we know that there was a variation between the inclusive and exclusive model results, with the exclusive model being more stable. This indicates that other research related to subdivision may need to account for differences in parcel size.

Third, the accuracy of models drop as the parcel size gets large, and the drop is mostly attributed to lower persistence scores. This could mean (with the variables given) that knowing where the “holdouts” will be is harder to assess, and could be the topic of continued research.

Lastly, we now have layers that show where subdivision has occurred in the past and areas are most susceptible for subdivision in the future. Interestingly, the “hottest” areas form a V facing out from Kerikeri;  with one arm reaching out through Waipapa/Kapiro and the other arm following Kerikeir Road towards Waimate North. I had a look at various layers and found that this formation largely followed the Kerikeri Irrigation scheme. Were I to refine this model, that would likely be a driver layer I’d test.

We could take the results further and start comparing them to layers associated with District and Regional Planning. For instance, we could see where zoning may need to be augmented to promote or restrict growth, and where potential areas of reverse sensitivity may be developing. The Land Use Capability layer for soils could be added, and the amount of highly versatile soils at high risk of subdivision could be assessed. Or we could see what habitats may be at risk of human encroachment.

There is also the option to add nuances to the model such as:

  • testing new variables,
  • incorporating  roading development tool for dynamic (modeled) or scheduled (already planned) road generation, or
  • using the “Incentive and Constraint” layer tool (eg. new zoning, Urban Growth Boundaries, or new conservation areas)

I’ll close by saying this project was only a first blush at modelling subdivision. There is still calibration and validation that should be performed before viewing the results as conclusive. However, it does serve as interesting starting point for understanding the future of land use in the Bay or Islands.

-Logan Ahmore

Understanding Place by Proportionally Dominant Characteristics

Over the years, I’ve been involved in quite a few projects where the client asks for the “supermap”. It normally goes like this: “Can you show us what a,b,c…z look like, and can it be on maps, and we want the information to be compared on the same map.”

These requests can often feel hodgepodge and cumbersome; to the point you’re not sure what to do with the information. They tend to come as a result of some form of consultation or engagement where “key criteria”, “drivers”, or  “target groups” have been identified. While there are a lot of ways to make a supermap (some Bizarros floating around too), I’ll be sharing one that I find helpful when initially probing the data.

I call these “proportionally dominant characteristic” maps. These are a variation on Predominance Maps, which are typically used to display dominance by sub-groups of single characteristics. Familiar examples would be showing what age or ethnic group is largest, or what level of education attainment is most common in an area.

The proportionally dominant map steps outside of a single character group, and tries to show dominance between different characteristics, as opposed to subgroups.

For my chosen example, I’ll be using Wellington census data that was lying around my hard drive. I (somewhat) randomly chose 9 meshblock indicators for my characteristics (that a made-up stakeholder group decided were important for understanding Place Making and Spatial Planning in Wellington). The characteristics are:

  • People Aged over 65,
  • People with College Education,
  • Families with Children,
  • High Income Households,
  • Owned Homes,
  • Labour Force Participation,
  • One Person Households,
  • Rented Dwelling, and
  • Volunteers

While standard predominance may work to compare something like Home Ownership vs Renting, it will not work out of the box for Home Ownership vs Volunteers. You may wonder, “can’t we just normalize everything as a percentage or rate of population, and compare that?”

Short answer, “maybe.” Long answer, if we can assume the rates/percentages have similar distribution, then perhaps. But typically the distribution of different characteristics can vary widely. For example (user warning, made-up stats follow) Home Ownership may range between 35% and 95% averaging 65% and volunteerism may range from 0% to 50% averaging 15%. In this case, Home ownership would likely noise out volunteerism in most cases. We don’t want that, we want to know where the rate of volunteerism is outstanding compared to the rate of Home Ownership.

So this finally gets me to the meat of my method; using proportions. First, this method has two basic assumptions:

  • that all the characteristics have a common relationship with an explanatory variable, and
  • in this relationship, the characteristic scales linearly (ish) with that explanatory variable.

I look at my characteristics and think, “something all of these seem to have in common is that they depend on the size of population in an area.” I look at my distributions of populations by meshblock and see that most meshblocks are between about 50 and 300 population in Wellington. (Note, I don’t actually plan to use population for any calculations; I’m just using it to check my assumptions.)

I then chart each characteristic against meshblocks with population between 50 and 300. While the chart on the left shows a lot of variation, the trend for each characteristic seems positive. I confirm this by smoothing the lines in the right chart. The lazy statistician says, “looks linear to me” and moves forward. The chart on the right also highlights the issue of using predominance. Labour Force Participants have a notably high count compared with the other indicators. If we were to select by the actual value or something derivative of actual value (percent, ratios, etc.), labour force would mostly come out on top.

We’re now to the point that takes a little more effort; calculating the proportions of each characteristic for each meshblock. Or in words, for each characteristic we will need take the observed value for a meshblock and divide by the sum of observed values for all meshblocks.

For example, if the meshblock had 30 volunteers and there were 5000 volunteers in all of Wellington’s meshblocks, the proportion would be 0.006 (0.06%) of all volunteers. The charts bellow show the proportional results for all characteristics. As you can see, the distributions are now scaled in a way that is more comparable. The smoothed lines in the right chart largely overlap, while the actual values on the left show where characteristics stand out against each other in the meshblocks.

With the proportional values calculated, they can be joined to a map of the meshblock. The GIF below cycles through each of the characteristics looks like for meshblocks in the Wellington area.

The last step of this process is to create the supermap. In this case, our supermap is comparing the proportional scores for all 9 characteristics in each meshblock and finding the one which is largest. This can be accomplished by creating and calculating a new field, or by using the arcade predominance formula in symbology (my chosen method).  The results are shown bellow.

Now is the part where we ask “ok, that is cool, but what do we do with it?”

1. Validating Assumption:
Planning and strategies work around a lot of assumptions, assumptions that are founded on experience or professional intuition. We will all say at some point in our careers “I’ve done a lot of work in that area, and I know that ________ is the biggest issue.” Anecdotally we know _______ to be true, but if we are going to do proper project development it needs a bit more evidence base. This map can help to validate such assumption or show where they may need to be re-evaluated.

2. Prioritization and Strategic Intervention:
In some cases this map may be enough to meet the needs of your client or project. Say the characteristics were related to strategic interventions. Areas highest in Characteristic A. will use strategic Intervention 1.; Areas highest in Characteristic B. will use strategic intervention 2; and so forth. We now have a map that shows what the most dominant characteristics are located.

3. Foundation for Further Analysis:
With most of my work, a map like this is only a starting point. It may be used to help me conceptualize the issues spatially before moving on to a more detailed analysis.  What is more, the proportional figures can be used to create new formulas to regroup and reclassify different areas.

For instance we could re-class the meshblocks based on the complexity of apportioned scores (areas low in most characteristics to areas high in several characteristics). We could then know which areas may require more complex interventions.

Or, we could group areas that are high in specified shared characteristics. As an example, we may want to have a special strategy for areas that are proportionally high in families with children and rental homes. We can build a formula to create a new score for areas highest in both. We could even take it a step further and start using machine learning techniques to perform cluster analysis to define and classify the sub-areas.

To close this post off, I’ll say this is just one of many ways to cut data spatially. In my job I am constantly exploring new ways to visualize data to help myself and others understand place. Hopefully I will get a chance to share other methods, maybe even create some samples of a more advanced analysis.