Connor Mitchell (M’20) majored in Computer Sciences, Data Science. He used his Capstone project to research adaptable geographic clustering, a data-driven and dynamic market expansion technique for home installers. This is the executive summary of his project.


You’re an operations manager at a US-based growth company that sells a product requiring at least a day of home installation labor (i.e. a comprehensive home security system, a backyard pool, a roof-mounted solar array). You’re operating in most major US cities, but your accounting department says profits are slim. Farmer Joe, a man living in a small town in rural West Virginia, has wanted to buy your product for 6 months now. He’s been calling your customer service team every couple weeks to see if you’ve expanded service operations to include customers in his area. As much as you’d like to sell to Joe and the neighbors he’s convinced to make a product reservation, it just isn’t cost-effective. Setting up permanent operations in remote areas of the country with a small demand curve would burn through cash faster than many Silicon Valley tech companies do–completely unacceptable for all but the best-capitalized businesses. But the current level of customer demand in Joe’s town is high, and you know they would benefit from your products more than many people in the densely populated urban areas you currently operate. Unfortunately, the more time you take to decide whether or not to take their business, the more impatient customers in Joe’s town are getting. Some have even cancelled their reservations. What do you do? 

One idea might be to temporarily send an installation crew from a nearby city to Joe’s town for a week, install all currently reserved systems, and then return to business-as-usual. If demand increases significantly after the initial group of systems is installed, then perhaps it’s enough to justify permanent operations. But what if there are problems with the installations? Sending a crew back to fix the issues will cost your company travel expenses and extra labor hours. You could bulk service them in the same way that you bulk installed them, but you don’t know how much that will cost, let alone the cost of initial installations. And what about the precedent opening Joe’s town for business will set? Joe’s town is not the only source of rural demand for your product; the sales team has leads from half a dozen others. How will you decide where to expand? 


This was the problem I faced while interning on Tesla’s Energy Operations team. Customers across the country had reserved Solar-PV and Powerwall (battery) installations, even in areas Tesla did not currently service. In order to help operations managers make expansion decisions, I built the first version of a Market Expansion Clustering Tool (MECT). 

Watch Part 1 and Part 2 of the tutorial on 2x speed to get a sense for how to use the tool. Optionally you can review the tool code and comments here or read this discussion of HDBSCAN, the density-based clustering method offered by the tool. 

MECT offers users two clustering methods: agglomerative and density-based. Both options identify optimally-dense geographical groups of potential customers or leads, but in different ways. The relevance of each method to the market expansion context is dependent upon the hardware constraints of the business. For example, Tesla would need to rent warehouse space in a centralized location between installation sites, since one truck can only carry enough equipment (solar panels, tools, etc.) for 1-2 installations at a time. On the other hand, a home security system installer like ADT does not need to rent warehouse space in any of the towns they choose to service; they simply need to equip the installation crew with a truck and enough camera equipment to service customers along their route. Any additional equipment can be shipped to a destination down the road. 

For these structural reasons, agglomerative clustering makes more sense for Tesla and density-based clustering makes more sense for ADT. The next section and this class will evaluate the Tesla use-case of agglomerative clustering using four criteria important to business operations managers: adaptability, simplicity, shape, and comprehensiveness. 

Agglomerative Clustering 

A high-level overview of agglomerative clustering is that the method starts with each point or lead as its own cluster, then begins iteratively merging clusters based on a certain distance criterion until there is one large cluster. Two of the most common criteria are single and complete-linkage, where the former merges clusters with the smallest distance (minimum) between the closest pair of points and the latter merges clusters with the smallest distance between the farthest pair of points. Mathematically this is written as follows, where is a d distance matrix containing a distance ( ) value separating two clusters ( and ), where that ist d u v distance value is calculated between a single pair of points ( and ). The and i j in m ax m determine which pair-distance is used, either the nearest (single-linkage) points in both clusters or the farthest (complete-linkage). 


d(u, v) = min(dist(u[i], v[j]))


d(u, v) = max(dist(u[i], v[j]))

In this use-case, where the adaptability of the model is important, businesses need to constrain the maximum size of clusters in order to avoid breaking the hub and spoke model. This requires complete-linkage so that users can set the maximum cluster size and stop the merging process when the largest cluster reaches that maximum. Placing this stop is also called the “cut point” because the iterative cluster merging process forms a hierarchical binary decision tree called a dendrogram (Figure 1). If the tool used single-linkage instead, the cut point would set a minimum density threshold, whereby the closest points would be less spread out. This is less useful for operations managers who care about the footprint of the entire cluster of leads the installer would need to service, rather than just two. 

Figure 1: Dendrogram Example (simulated data from Market Expansion Clustering Tool) 

Using the example from Figure 1, if the user wished to cut the tree to a maximum cluster diameter of 100 miles, they would first convert miles to radians: 

This value is only marginally greater than zero, so according to the dendrogram, the user can expect a large quantity of clusters in this dataset. Indeed, after mapping the results using Kepler we see the following results, where red-blue arcs connect current office locations (red dots) with clusters (blue dots) (Figure 2): 

Figure 2: Results Map Example (simulated data from Market Expansion Clustering Tool) 

In order to produce useful and insightful results, users only need to choose where to cut the dendrogram, which is a decision driven by operational time and cost constraints. A single intuitive parameter like maximum cluster size preserves the simplicity of this method. Additionally, since the dimensions of each cluster are determined by the two farthest points, the cluster is globular (circular) keeping a symmetrical shape. 

Another characteristic of the map in Figure 2 are the zipcodes themselves (pink dots), each of which contains all the lead data for that zipcode. Pink dots with an arc leading to them are 1 zipcode clusters, which can be either filtered out by the user or referenced for potential second-order expansion (after first-order expansion to larger clusters). This characteristic underlines the comprehensiveness of the agglomerative method. 


Depending on your product’s installation timeline and hardware requirements, either agglomerative or density-based clustering may make more sense for your business. Agglomerative complete-linkage is simpler to understand and gives you the flexibility to limit cluster diameter, but it is restricted to identifying spherical clusters even if a more flexible shape could improve the density of installations. Density-based methods produce more flexibly shaped and sized clusters, but can be more difficult to interpret. By evaluating these methods based on the proposed 4-criteria framework (adaptability, simplicity, shape, and comprehensiveness) operations managers are able to identify which clustering method makes the most sense for their business, apply it to identifying market expansion opportunities, and notify leads like Farmer Joe if/when he can expect his reserved product to be installed.