Using Machine Learning and Big Data to Understand Micro Markets Vernon H. Budinger, CFA November 3, 2023

Overview

This paper focuses on a data/AI toolkit that marketing managers can use to understand the demographics in their market. While many AI projects rely on enormous data sets and new intimidating neural algorithms, Machine Learning provides informative, detailed assessments of consumer markets in economic micro-regions using large pools of free data and free software. Such “humble” AI efforts can dramatically improve the reach and efficiency of marketing campaigns.

Targeted Market Segments.

There are many avenues for creating region-specific content and delivering that message to a specific census tract.

· Facebook, Instagram, Twitter, and LinkedIn allow you to target a small region with posts.

· Direct mail companies can create mail campaigns by census tract

· Advertisements can be placed in community calendars and local news media.

· Billboards reach specific market segments

The key is to segment your audience based on demographics, interests, and needs. The Census Bureau now provides APIs (Application Programming Interface) to pull Census Data by tract or census block (smaller than a tract). In addition, the American Community Survey is a U.S. Census service that updates the regional information yearly. This data is free and is tagged with geolocation information.

This paper demonstrates how the U.S. Census/ACS study can used with freeware statistical/graphing packages to explore the ACS data and develop profiles of the consumers in the area. The paper will show statistical/Machine Learning analysis of the data that provides deep insights into the demographic characteristics of each census tract.

Section II Market Analysis of North Bay California Counties

These tools were chosen because the data and analytics are accessible to small and medium-sized companies. They offer a better understanding of geo-demographic trends, improve customer experience, and build stronger relationships with the client base. The analysis produces detailed quantitative measures of economic and demographic status as well as consumer behavior for micro-regions.

This analysis was adapted from Chapter 8 of Kyle Walker’s book “Analyzing U.S. Census Data.” This study focuses on median home value from the U.S. Census Bureau’s 5-year American Community Survey (5-year ACS). While this data depends on estimates, it is more current than the Decennial Survey and has more data than the 1-year ACS, which only covers areas of population equal to or greater than 65,000. The more current 1-Year ACS would not cover any of the cities except Santa Rosa in Sonoma.

Five County Demographic Comparison

Sonoma, Lake, Marin, Mendocino, & Napa Counties

The data table for the five counties illustrates the challenges that marketing operations face in this complex region. Sonoma and Marin Counties in the southern part of the North Bay Region are wealthier and more densely populated than Mendocino and Lake Counties to the north.

In reviewing the table, the income disparity is shocking - the median household income for Sonoma County is about 64% higher than neighboring Mendocino County. Marin County, with its proximity to San Francisco, is the wealthiest county in terms of median household income and per capita income. The 5-county average ratio of 1.8 between median household income and per capita income suggests that most households have two sources of income. Napa’s land area is slightly bigger than Marin County’s, but its economy is roughly half the size. Mendocino and Lake Counties are clearly rural with a population density that is a fraction of the other counties. Mendocino County has twice as much land as the next biggest county, but the population is a fraction of the densities for Marin and Sonoma.

Moreover, Mendocino and Lake County are poor by California North Bay standards with a respective 16.1% and 16.5% of the two counties living under the poverty level compared to Napa (9%), Marin (7.8%), and Sonoma (9.1%). The United States Department of Agriculture classifies any county with a poverty rate of more than 20% as “high poverty.” The Dissimilarity Index and the GINI Index paint a similar picture. This dissimilarity index measures segregation in the counties; the GINI index measures the income disparity.

Dissimilarity Index for White vs Hispanic and GINI Index for Income

The initial reaction of a marketing professional might be to classify Marin, Sonoma, and Napa Counties as rich and Mendocino and Lake Counties as poor. However, we will see that each of the North Bay counties has pockets of wealth and poverty.

This example focuses on the median home value as a measure of opportunity for measuring the wealth of the region. Home value is often the single largest family investment and a measure of wealth. However, we learn about more than wealth; this data set is rich with other variables for study and provides detailed data to understand the inferences from Machine Learning Tools.

Consumer Market Analysis Using Unsupervised Learning

Unsupervised Learning, which includes Principal Components Analysis, provides statistics for reducing the “dimensions” of the data. This tool is especially adept at identifying common factors in datasets with thousands of variables without using labels — and therefore is considered unsupervised.

County-level data does not really give us a refined picture of the population and smaller regional economies. Are there common factors for each county or are the counties completely different? The American Community Survey provides a detailed breakdown of the social and economic microclimates in the counties. We can see from the two maps of Aggregate Income that the picture is complex. Small businesses can take advantage of this knowledge by marketing to specific microclimates through targeted social media and other marketing channels.

The two maps below provide some insight into data available in Mendocino and Sonoma Counties. The county subdivisions are the U.S. Census Bureau’s Census Tracts for organizing the Decennial Census. These tracts can be further divided into Census Blocks for additional micro-region detail.

Note that the legend for Sonoma tops out at $400 million whereas the maximum for Mendocino County’s legend is $200 million. One of the poorest regions in Sonoma County borders Mendocino County but is 2 times the income of Mendocino’s neighboring tract.

When we combine the two counties, we see there is an abrupt change in income levels on the borders of the counties, but there are also many areas of the counties that are similar. This section solves many of these puzzles using unsupervised machine learning to provide detailed insights about microeconomic climates that astute marketing managers use to tailor specific messages that connect with local populations.

Principal Component Analysis (PCA)

Benefit: PCA provides a tool to reduce the number of features (variables) that we need to consider while maintaining most of the information from those features. As will be discussed in the next few pages, the component information provides deep insights into the key items that unite or separate populations.

This Principal Components Analysis identifies the factors that drive the demographics of the area. Principal Components are vectors of numbers used to reduce the number of features (variables) in analysis but still describe a census tract with great mathematical detail. Each component has factor loadings that further break down the variables associated with each factor. This can be useful for micro-economic research, as it can help to identify the key factors that drive economic activity in different tracts.

Eighty-six percent of the variance in median home values can be explained by the first 8 principal components of this dataset (PC1 to PC8). As explained above, each principal component provides the factor loadings for the variables.

Ranking of the most important Principal Components by contribution (the first 10 provide 92.42% of the information):

Contribution Cumulative Contribution

PC1: 34.82% 34.82%

PC2: 23.41% 58.23%

PC3: 10.23% 68.46%

PC4: 5.58% 74.03%

PC5: 4.06% 78.09%

PC6: 3.42% 81.51%

PC7: 3.06% 84.57%

PC8: 2.96% 87.52%

PC9: 2.64% 90.17%

PC10: 2.25% 92.42%

When we look at the map of the factor loadings for each principal component, we begin to understand how they reduce the dimensions without losing the ability to model volatility.

Principal Components for the North Bay Region

Each principal component has several factor loadings. The factor loading is positive if the green bar juts to the right and negative if it juts to the left. Each component is composed of various combinations of factor loadings or exposure to the variables – examples are:

College Education

Foreign Borne

Renter Occupied Housing

Population Density

Median Age of the Structures

Median Age of the Population

Hispanic

Asian

Principal Component 1, which explained 34.82% of the volatility in the data, is heavily positively loaded for the following key factors:

White

Total Population

Living in the same house last year

Owner occupied

Median Income

Higher Aggregate Income for the tract and by household

Principal Component 2 explains 23.41% of the volatility and contrasts strongly with #1:

Populated areas (same as #1)

Renting housing

Low percent white

High foreign-born

Low owner-occupied housing

Low income

Highest Hispanic loading

Principal Component 3 explains 10.4% of the volatility:

Negative exposure to White

Most Positive Wages to Social Security

Foreign Born

High Percent College

Negative weighting on Owning House

Negative on Living in Same House Last Year

Positive loading for Aggregate Income per Person

The Principal Components can then be used to construct a mathematical model of the census tract.

With Principal Components, marketing can develop very precise mathematical descriptions for target neighborhoods. The weights for each Principal Component are assigned to each tract and serve to mathematically characterize the location in detail. For instance, the weight for PC1 for Covelo is -4.381 because it is the site of the Roundtree Indian Reservation, and the white population is a relatively small percentage of the population. However, Covelo’s weighting for PC7, with the strong factor loading for Native Americans, is 6.850. West Novato in Marin County, on the other hand, is a neighborhood with many whites, its weight for PC1 is 7.105, while the weight for PC7 is 0.295. These two tracts contrast with East San Rafael with one of the highest exposures to PC2 that is heavily loaded for Renters and Hispanics and very few college graduates. This is only the beginning of insight into these tracts and the possible combinations provide deep insight into the demographics of the tract.

We can now map the importance of the component to each census tract. NOTE: This paper will only look at the top 3 principal components.

PC1 Loads are heavily influenced by factors associated with the white population (see the yellow and light green tracts):

PC2 Loads Hispanic and associated variables (Note that once again, tracts that have a high Hispanic contribution and are yellow-green):

Positive:

Percent Foreign Born

Renter Occupied

Population Density

Hispanic

Negative:

White

Percent College Graduate

PC3 Loads Wealth, foreign-born, college education, and negative for receipts of social security: PC3 is heavily influenced by income from wages, is the only factor where race is not a major loading, and tends to be more important in the South.

Principal Components Regression: Supervised Learning Applied to Unsupervised Learning Results

Benefit: Principal Components Regression provides another view of the data, like looking at a house from the front and then walking to view from the side.

The previous PCA focused on component-by-component analysis. The PCA regression gives a tool to incorporate all the components in one equation to evaluate a tract. Note: The PCs can be used as indices and equations can be used to develop a score for each tract.

There is more to the Principal Components story. Principal Components can be used for principal components regression, in which the derived components themselves are used as model predictors. Generally, components should be chosen that account for at least 90 percent of the original variance in the predictors, though this will often be up to the discretion of the analyst. In the example below, we will fit a model using the first six principal components which represent 80% of the model variance and the outcome variable is once again the log of the median home value.

Principal Components Regression

Principal Components Regression Analysis

With an R-squared value of 71.46%, the model fit is close to the first regression model fit of 75.05% earlier in this paper. The PCA model is also statistically significant. We can think of principal components as indices that measure the economic activity in the region. The advantage of this analysis is that we can also examine the contributions of the factors to each census tract based on the factor loadings.

Table of Selected Observations from Map

This regression provides an economic index of the well-being of a census tract. The average is the intercept, 13.42. One of the higher scores (the score is calculated by multiplying the Estimates by the factor loading for each Principal Component) is 14.81 for Tiburon in Marin County with a Poverty Rate of 5%. One of the lowest scores is 12.35 for Kelseyville in Lake County with a Poverty Rate of 21%.

The regression scores are based on the following combination of variables.

PC1: A strong positive contribution to the median value of housing for the entire region.

PC2: Negative factor in housing valuation and the second most significant variable.

PC3: As noted above, this component measures the factors associated with high income in a region. It makes sense that this component would have the highest estimate (0.1615) and would be the most statistically significant (t value of 15.864).

PC4: This component loads heavily for a high percentage of owner-occupied houses, low number of renters, low percentage White, high percentage Asian, large number of rooms in the house, and occupied by the same person last year.

PC 5: Not significant (small estimate and low t value).

PC 6: This factor loads heavily for residents of Pacific Island Descent and most of the locations are in Napa Valley. It also has a heavy factor loading for the age of the structure and a negative loading for Other Races and Hispanics.

PC 7: There are two positive main loadings for this component: percentage Native American and percentage Black.

PC 8: There are two positive main loadings for this component: median structure age and other race..

PC 9: There are three positive main loadings for this component: negative exposure to Pacific islanders, other races, and Native Americans.

Supervised Learning: Geographically Weighted Regression

Benefit: The linear regressions estimate global relationships between the dependent variable (variable being predicted) and the independent variables (used to predict). Per Walker, “This lends itself to conclusions like ‘In the Dallas-Fort Worth metropolitan area, higher levels of educational attainment are associated with higher median home values.’ However, metropolitan regions like Dallas-Fort Worth are diverse and multifaceted. It is possible that a relationship between a predictor and the outcome variable that is observed for the entire region on average may vary significantly from neighborhood to neighborhood. This type of phenomenon is called spatial non-stationarity, and can be explored with geographically weighted regression, or GWR (Brunsdon, Fotheringham, and Charlton 1996).”

In the following analysis, we map the Median Home Value and then compare that to the local R-squared to find the local variations from the global conclusions that we reached using PCA and PC Regression.

Geographically Weighted Regression Map

The below shows the predicted values for the log of the Median Home Value and the results are very similar to the Principal Component Analysis.

With the base R-squared of 75% on the legend, this map shows how the R2 deviates by census tract. The model performs well across the region but is better in some of the more rural areas of the region, especially in the very south, the eastern, and northern census tracts where the R-squared ranges from 80% to above 90%. Note: the model deviates most in the rural regions of Mendocino and Lake Counties.

The map below shows the relationships of Percent of Owner-Occupied Housing in local tracts to the overall model. Recall that the relationship in the percentage of Owner-Occupied Housing (OOH) to home value is negative for the region. The dark purple areas on the map are those areas where the global relationship in the model reflects the local relationship, as local parameter estimates are negative. The areas that stand out include the high-density area of lower Marin County, where median home values are very high. However, in the mostly northern rural tracts of the region, the estimate is zero indicating that the local percentage of owner-occupied housing does not affect the value.

The population density parameter estimate was positive for the entire equation. The tracts in Marin County in the south, in southern Mendocino County in the center, and Lake County in the central east have no local beta. Once again, the key wine-growing regions in Napa and Sonoma go against the overall trend — the property values increase with lower population density.

Cluster Analysis: Unsupervised Learnin

Benefit: Cluster Analysis identifies economic characteristics that explain the spatial distribution of economic activity and groups the tracts into data sets or clusters with similar characteristics versus clusters that differ significantly. Cluster analysis on PCs provides insights into economic opportunities in micro-regions. Does the tract present qualities for economic growth or is it characterized by a low-income or impoverished economy?

While Cluster Analysis can be run on raw data, many data scientists apply PCA to the data before analyzing the data with clustering algorithms. This two-step procedure reduces the noise in the clustering results. While PCA and Cluster Analysis are similar, the two techniques have different goals. If a study has 100 features (variables), PCA tries to condense that information into a smaller number of features that really matter. Cluster Analysis, on the other hand, seeks to represent the 100 features into a set of clusters that are internally the same but significantly different from the other clusters.

After several iterations, I found that 6 clusters provided the best fit and separated the PCs into distinct groups.

In review of the factor loadings in the dot plots from previous sections, PC 1 is the component that represents the White population with higher income, some college graduates, and owner-occupied housing. PC2 represented a mostly Hispanic population that was dominated by renters. The dots represent the tracts, and the color identifies the cluster assigned. Tracts that are to the right of zero on the horizontal axis are weighted positively toward PC1 as with Clusters 2, 3 and 5. Tracts that are above zero on the vertical axis are weighted to PC2; Clusters 1 and 4.

Plotting PC1 against PC3 (income - no race component) shows that there are several distinct income groups. Cluster 1 (Hispanic, renter in the denser south region) ranks positively in PC3 as does Cluster 3 (Rural, White, Small Towns) and Cluster 5 (Wealthier denser populations in the south). Cluster 2 (Rural population centers with mixed PC1 and PC2), Cluster 4, and Cluster 6 do not rank as high in wealth.

Cluster #1(Red): Rural areas, firmly Hispanic (PC2) with few Whites (PC1). In general, these regions are in pockets that lie between bigger regions, like the 101 corridor from Cloverdale to Healdsburg, a small tract on the southeast side of Santa Rosa, and in Napa Valley.

Cluster #2(Blue): Cluster 2 includes heavily commercial areas in the south that have positive exposure to both PC1 and PC2, where both White and Hispanic residents are strong. In the south, this represents the Highway 101 corridor. Most of these areas have the densest populations in the region and are significantly more populated than the surrounding census tracts in the region. The big blue tracts in the north are Willits, Ukiah, Kelseyville, and Cloverdale.

Cluster #3(Green): Represents agricultural and wine-growing regions with positive exposure to PC1 (White, owner-occupied). The green areas are northern Santa Rosa to Healdsburg regions and include some of the Russian River wine region and key wine regions of Napa. Like Cluster 5, this cluster has no exposure to PC2 – the Hispanic-dominated Principal Component.

Cluster #4(Purple): Represents the poorer mixed races in rural, and agricultural regions.

Cluster #5(Orange): This cluster has a negative weighting for PC1 and for PC2, meaning the residents are predominantly white and in the higher-income areas of Marin and Sonoma County.

Cluster #6(Yellow): Equally PC1 and PC2, but low exposure to PC3 (wealth component). This factor covers the Covelo tract in the northeastern corner, some of the poorer neighborhoods around Clear Lake in Lake County, and the Point Arena and Navarro/Boonville regions of western Mendocino County. These areas are sparsely populated and either agricultural or heavily forested.

Summary

While this study delivered some deep insights into the demographic breakdown of the North Bay Region, it is a preliminary case study or a first step that small and medium-sized companies can take to understand customers and improve customer experience.

AI and Machine Learning Tools can spearhead an effective defense against bigger companies and competition from new, disruptive technology. Despite the length of this paper, it only addressed a small group of customer preferences, and, in many ways, it raises as many questions as it answers.

Some of the solutions that the analysis highlighted:

1. A financial company might want to advertise the highest interest rates on deposits to the wealthy, older communities in Marin.

2. Send out advertisements in Spanish to the heavily Hispanic Communities.

3. Promote community programs in the poorer tracts of Lake County and Mendocino

4. A finance company might want to promote home equity loans in regions with high home ownership

5. On the other hand, the same finance company would promote affordable home loan programs in Spanish to tracts with a large percentage of the Hispanic population who rent.

As the paper shows, AI and Machine Learning give Small and Medium-Sized businesses the tools to counter disruptive market developments with a deep understanding of their market and an intense commitment to improving the customer experience, from a seamless delivery of products to attention to customers’ specific needs.

In the bigger picture, the ability to use AI also depends on a company’s corporate structure. The modern firm needs to transform itself into a digital, agile organization that can share AI throughout the firm to survive the coming market disruptions.

Neural Profit Engines provides a suite of chief financial officer services under the brand name Neural Financial Officer.

· Big Data studies to aid planning and financial analysis

· Business strategy based on AI and Machine Learning

· Data cleaning - labeling and preparation

· AI and Machine Learning analysis of big data and trends

· Planning and financial analysis

· Bookkeeping and accounting services

· Company training for ChatGPT and Bard

Vernon H. Budinger, CFA

Chief Executive Officer

vernon@neuralprofitengines.com

www.neuralprofitengines.com

+1(707) 513-0880