This chapter covers
- Identifying which customers are most likely to abandon a service (churn)
- Targeting customers who are most interested in buying more (upselling)
- Using unsupervised learning for data-driven customer segmentation
- Case studies: using AI on electric grid data and mining retail analytics
In the previous chapter, we explored the role of structured data in a variety of business applications. Even if sales and marketing can sometimes fit into the core business data category, they’re so important and peculiar that they deserve their own chapter. We’ll cover various marketing problems and explore how you can use artificial intelligence and data science to strengthen and improve the relationship between your organization and its customers.
3.1 Why AI for sales and marketing
One of the main goals of marketers is finding the best way to offer the right product to the right customer at the right time. But even with billions of dollars at stake, marketers have suffered from various limitations. The first one was a lack of data. When the world wasn’t as connected as it is now, the only way to get one’s questions answered was to talk to people. The internet largely solved this problem: it’s now easier than ever to reach broad audiences, expose them to a message, and measure their reaction. The flip side of the coin is that it’s easy to end up with data that is so large and granular that it’s too much for humans to understand and extract insights.
We want to start this chapter by giving you a little insight into why AI changes everything. Every marketer knows that not all customers are alike, and that they respond best when they’re engaged with a personalized message. A common marketing strategy is to divide customers into segments according to demographics or similar aspects. A simple segment can be “wealthy women between 25 and 30 years old who spend more than $1,000 per year on entertainment.” A marketer can come up with a custom message to reach this category of people, which is different from what will be done for other segments. While this technique is as old as the marketplace, it really was the best we could do before AI came on the scene.
The problem with this approach is that no matter how specific you get with your segmentation (marketers talk about microsegmentation ), you’ll always end up in a situation where two customers get treated exactly the same even if they’re fundamentally different, just because they fall into the same category. There is a limit to the number of categories a human brain can manage. Just think about how many of your friends have similar characteristics to you on paper (same age, neighborhood, education) but have completely different tastes.
As you can see in figure 3.1, the traditional marketing segmentation approach can’t target Marc for being Marc. It will always target him as a “male between 25 and 30 years old who lives in a large city.” AI changes the rules of the game because it can process much more information. With AI, you can reach personalization at scale, learning about people from their specific actions and characteristics and targeting them for who they really are, and not for the handcrafted bucket they fall into.
Figure 3.1 AI personalization versus traditional marketing segmentation
What does the ability of such fine-grained personalization mean for a business? Well, companies specializing in AI for marketing can show some eye-popping metrics that would be every marketer’s dream. An example is Amplero, a US company specializing in AI-driven marketing. Here are some of the results it reports in its marketing material:
- It helped a major European telco increase the first 30-day average revenue per user from 0.32% to 2.8%, an almost 800% increase.
- It reduced the customer acquisition cost (CAC) of one of the top five North American mobile carriers by over 97%: from $40 per customer to just $1.
- It managed to retarget the unhappiest customers of a major European mobile carrier three weeks before they were canceling their plans, created a more meaningful customer experience for reengaging them, and increased retention rates from 2% to 10%.
These numbers aren’t meant to boast about the results of a specific marketing company. You’ll find many startups and larger organizations that can achieve similar results. If the idea of reaching this kind of performance in your organization gives you goosebumps, you’re not alone. Let’s see how this can be made possible.
Marketing is a complex function, so instead of listing all the possible applications, we’ll focus on three general problems that apply to most businesses:
- Identifying which customers are likely to leave your service (churn)
- Identifying which customers are likely to buy a new service (upselling)
- Identifying similar customer groups (customer segmentation)
3.2 Predicting churning customers
One of the most important marketing metrics is the customer churn (also known as attrition or customer turnover ). The churn is defined as the percentage of customers leaving a business over a period of time. Wouldn’t it be amazing to know beforehand which customers are unhappiest and most likely to abandon a product or service in the near future? This is exactly how AI can help you solve the problem of customer churn: using machine learning and the data assets of the organization, we can find the customers who are most likely to leave your service and reach out to them with personalized messages to bring their engagement up again. Next we’ll show how a churn predictor works, giving you the confidence to see opportunities for this application in your organization.
In this machine learning problem, we have two classes of customers: the ones who are likely to churn and the ones who are not. Therefore, the label that our ML model will have to learn to predict is whether the customer belongs to one class or the other (let’s say that customers who are about to churn belong to class 1, and the others belong to class 0). For instance, a telephone company may label with “churned” all the customers who dropped out of its phone plan, while with “not churned” all the others who are still on their plan.
Now that we have defined a label that our algorithm has to predict, let’s look into what features we can use. Remember that features in an ML problem are the parameters that the model will look at to discriminate between classes. These can be an attribute of the user (for example, demographics) or its interaction with your product (for example, the number of uses of a specific service in the last month).
What we have just described has the form of a supervised learning problem: an ML algorithm is asked to learn a mapping between a set of features (customer characteristics) and a label (churned/not churned) based on historical data. Let’s recap the necessary steps to solve it, as visualized in figure 3.2:
- Define an ML task starting from a business one (identifying customers who are likely to leave our service).
- Clearly identify a label: churned or not churned.
- Identify the features: elements of a customer that are likely to influence the likelihood of churning. You can come up with possible examples by thinking about what you would look at if you had to do this job by yourself:
- Age
- How long the customer has used the service
- Money spent on the service
- Time spent using the service in the last two months
- Gather historical data of churned and active customers.
- Train the model: the ML model will learn how to predict the label, given the features.
- Perform inference: use the model on new data to identify which of your current customers are likely to churn.
Notice that the label must be found retroactively by looking at past customer records. Let’s consider the easiest situation first. Assume you have a subscription-based business model, like Netflix or Spotify. Subscriptions are usually renewed automatically, so customers have to actively pursue an action in order to cancel the subscription: call the customer service in case of a phone company, or go to the website and turn off automatic renewal in the case of Netflix or Spotify. In these situations, finding your label is easy: there’s no doubt about whether a customer is still on board or not, and a clear database table exists that can tell you exactly when that happened.
Other business models are more complex to deal with. Let’s assume you are the marketing manager of a supermarket, and you use fidelity cards to track your customers every time they come in and shop. Most likely, a customer who found a better supermarket won’t call you and say, “By the way, I just want to let you know that I won’t be back again shopping at your supermarket.” Instead, this person likely won’t show up anymore, and that’s it! No traces left, no Unsubscribed column in your database, no easy label. Can you still find a way to assign labels to these kinds of customers? Sure you can. As you can see in figure 3.3, a common and simple way to do it is to look at purchase patterns and see when they change suddenly. Let’s assume that a very loyal family comes in to buy groceries every Sunday. However, in the last month, you haven’t seen them. You may assume that they’ve decided not to come anymore, and therefore label them as “churned.”
Is one month the right threshold? It’s hard to say without additional context, but luckily, that’s not your job: leave the task of figuring it out to your data scientists. What’s important is that you understand that no matter the business, if you have returning customers--and have collected data on their interactions--there’s likely a way to define churn and identify who has left and who is still active.
Once you come up with some labels to distinguish “happy customers” from churned ones, the situation becomes similar to the example of house-price prediction we’ve seen before. Luckily, the training data for churn prediction can easily be extracted from a company’s customer relationship management system (CRM). More specifically, we can extract CRM data from up to, say, 18 months ago, and then label whether customers have churned in the past 6 months.
Figure 3.2 The process of creating and using an ML model, from its definition to its usage in the inference phase
By now, you are already much more confident and effective in defining a label for a churn-prediction project than most business managers. Every data scientist will be thankful for that, but if you really want to help them, you need to put in extra effort: help them selecting features.
If this sounds to you like a technical detail, you’re missing out on a great opportunity to let your experience and domain knowledge shine. In an ML problem, remember that a feature is an attribute of the phenomenon we’re trying to model that affects its outcome. Assuming you’re a marketing expert, no one in the world has better insights about the relevant features, and your expertise can help your data science team follow a path that leads to successful results.
Figure 3.3 A graphical representation of the buying pattern behavior of churned and active customers
To give you an idea of what your contribution may look like, ask yourself, “If I had to guess the likelihood of churn of just one customer, what parameters would I look at?” This can inform the conversation with an engineer:
Engineer: Do you know what is affecting the customer churn? I need to come up with some relevant features.
Marketer: Sure, we know that the payment setup is highly relevant to churn. Usually, someone who has a contract instead of a prepaid card is less likely to abandon the service because they have more lock-in. It’s also true that when we’re close to the expiration date of a contract, customers start looking at competitors, so that’s another factor.
Engineer: Interesting. For sure, I’ll use a feature in the model that expresses “contract” or “prepaid.” Another feature will be the number of days to the expiration of the contract. Anything else?
Marketer: Sure, we know that age plays a big role. These young millennials change companies all the time, while older people are more loyal. Also, if someone has been our client for a long time, that’s a good indicator of loyalty.
Engineer: Nice; we can look in the CRM and include a feature for “days since sign-up” and one for age. Is age the only interesting demographic attribute?
Marketer: I don’t think gender is; we never noticed any impact. The occupation is important: we know that the self-employed are less eager to change plans.
Engineer: OK, I’ll try to double-check whether gender has any correlation with churn. Regarding the occupation, that’s a good hint. Thanks!
A conversation like this can go on for days, usually with a constant back-and-forth between the engineers and you. You’ll provide your experience and domain knowledge, and the engineer will translate that into something readable by a machine. Eventually, the engineer will come back with some insight or questions that came out of the data analysis and that require your help to interpret. As you can see, it’s not a nerd exercise: it’s a team effort between the business and the nerds.
3.3 Using AI to boost conversion rates and upselling
You’ve seen how churn prediction can be a powerful application of classification algorithms. In this case, the classes we are labeling the customers with are “churned” or “not churned.” In other situations, you can label customers with a class that is relevant to your marketing department and use ML algorithms to make predictions. A natural one is whether a customer will buy a service based on past sales.
Let’s imagine you have a classic marketing funnel: customers subscribe to a free service, and then eventually some of them upgrade to a premium one. You therefore have two classes of customers:
- Converted --Customers who bought the premium service after trying the free version
- Not converted --Customers who kept using the free service
Web companies may end up investing millions to maximize the number of users who convert to a paid product. This metric is holy to software companies that have a Software as a Service (SaaS) business model: companies that offer services purchased with a subscription. Depending on the conversion rate, a web-based subscription business can live or die.
The most naive way to try increasing the paying users’ conversion rate is to massively target the entire user base with marketing activities: newsletters, offers, free trials, and so on. More-sophisticated marketers may think about setting up elaborate strategies to assess the likelihood of a conversion and invest the marketing budget more wisely. For instance, we may believe that a user who opened a newsletter is more interested in buying the premium service than one who never opened any, and target them with Facebook ads (have you ever been spammed on Facebook after you visited a website or opened a newsletter?).
Given the importance of the topic and the amount of money that rides on it, let’s see if we can use ML to classify users by their conversion likelihood, optimizing our marketing costs and achieving better results. If you look at the problem, you’ll see it’s a perfect fit for machine learning. You’ve already seen that you have a clearly defined task: identifying users who can upgrade from a free to a paid service. This is a supervised learning classification task, and you have the labels ready: let’s say 1 for users who bought the paid service, and 0 for users who didn’t. You now need to think of what features you would use to train your classifier. Remember that a good starting point for identifying features is to ask yourself this question: “If I had to guess the likelihood of a conversion myself, what information would I need?” This information might include the following:
- Usage of the free product. Remember this has to be an actual number, so you have to come up with a useful way to describe this. If you’re selling a service like Dropbox, usage may be described with a bunch of parameters:
- Number of files stored
- Number of devices the users logged in from (gives a hint about how useful the service is for the user)
- Number of accesses per day/week/month (indicates how frequently the user relies on it)
- Newsletter open rates (How interested is the user in our message?)
- How long ago the user subscribed
- Acquisition channel (Someone who subscribed after a friend’s referral may be more valuable than someone who clicked on a Facebook ad.)
These variables may vary based on the kind of business you own, but the concept is generally simple: think of what factors can hint at a likelihood of conversion and then feed your ML algorithm with them. It’s worth pointing out that some businesses have more data than others: for instance, internet services that use a Facebook login will be able to know all their users’ interests as features for such classifiers.
Assuming you have historical data on past customers who converted as well as those who didn’t convert, you can train your algorithm to identify how the features you select affect the user’s likelihood to buy your premium service. Once the training phase is done, your algorithm is finally ready to apply what it has learned from previous customers to your existing customers, ranking them from the most likely to convert to the least likely. As you may recall from the preceding chapter, this phase is called inference (making predictions on new data after an algorithm has been trained on a past one). Figure 3.4 illustrates this process of learning from past customers’ behavior and making predictions about new customers.
Notice that we applied this methodology to the case of an internet-based system that uses a freemium model (a free service and a paid upgrade), but it can be applied to any other case in which you have a group of customers performing one action and another group doing something else (or nothing). This scenario is common, and we’d like to encourage you to look for such situations and think about whether there’s space to build an ML classifier for it.
To give you some inspiration, here are some other cases where you can apply this methodology:
- You have a basic product and some upsells (accessories or additional services, which are common for telco companies). You can label customers with “has bought upsell X ” or “hasn’t bought upsell X ” and use their basic product usage to assess whether it may be worth proposing the upsell to your customer.
- You have a newsletter and want to optimize its open rates. Your labels are “has opened the newsletter” or “hasn’t opened the newsletter.” The features you use for the classifier may be the time you sent the email (day of the week, hour, and so forth) and some user-related features, and you may also tag emails by their content (for example, “informative,” “product news,” or “whitepaper”).
- You have a physical store with a fidelity card (to track which customer buys what). You can run marketing initiatives (newsletters again or also physical ads) and classify your users based on what brought them into your store and what didn’t.
As you can see, the method we just described of dividing users into two separate classes and building an ML classifier that can recognize the two is pretty flexible and can be used on a lot of problems. Pretty powerful, isn’t it?
3.4 Performing automated customer segmentation
In this chapter’s introduction, we referenced one of the key activities that marketers have to perform when developing a marketing plan: customer segmentation. Segmenting a market means dividing customers who share similar characteristics and behaviors into groups. The core idea behind this effort is that customers in the same group will be responsive to similar marketing actions. For example, a fashion retailer would likely benefit from having separate market segments for men versus women, and teenagers versus young adults versus professionals.
Figure 3.4 To boost conversion rates, an ML algorithm is trained with data of customers who have purchased a premium service in the past. Later, this algorithm identifies which customers are most likely to do the same.
Segments can be more or less specific, and therefore more or less granular. Here are two examples:
- Broad segment --Young males between 20 and 25 years old
- Highly specific segment --Young males between 20 and 25 years old, studying in college, living in one of the top five largest US cities, and with a passion for first-person-shooter video games
Many marketers can intuitively perform this segmentation task in their brains as long as the amount of data is limited, both in terms of examples (number of customers) and features. This usually produces generic customer segments like the first one, which can be limiting considering the amount of variation that exists among these groups. A marketer could attempt to define a more specific segment like the second one, but how do they come up with it? Here are the questions that could be raised during a typical brainstorming session:
- Is it a good idea to use the 20- to 25-year-old threshold, or is it better to use 20 to 28?
- Are we sure that the college students living in large cities are fundamentally different from the ones living in smaller ones? Can’t we put all of them into a single cluster?
- Is there a fundamental difference between males and females? Do we really need to create two segments, or is this just a cliché?
Answering these questions can be done in three ways:
- Go by your gut feeling. We’re not in 1980, so don’t do that.
- Look at the data, and use the marketer’s instinct to interpret it. This is better than a gut feeling, but marketers will likely project their biases into their analysis and see what they want to see. So avoid this as well.
- Let AI come up with customer segments by itself, keeping a marketer in the loop to use their creativity and context knowledge.
Option 3 is most likely to outperform the others. Let’s see why and how.
3.4.1 Unsupervised learning (or clustering)
Let’s look at the problem we just described:
- What we start with: a pool of customers, with a bunch of elements that characterize them (age, location, interests, and so forth)
- What we want: a certain number of segments we can use to divide our customers
You can imagine this problem as having a bunch of customers and needing to place each one into a bucket, which we’ll call a cluster (figure 3.5).
Figure 3.5 A clustering algorithm divides a group of uniform customers into three clusters.
The customers’ characteristics that we’ll use resemble what we’ve been calling features in previous chapters, so you may think that we’re dealing with the same kind of task and can use the same tools we’ve already described. But the devil is in the details: because we don’t know in advance the groups we want to define, we don’t know which labels to apply. So far in the book, we’ve used a subset of ML called supervised learning. The typical recipe for a supervised learning task is the following:
- We have data on a group of customers characterized by certain features.
- These customers also have a label: a target value we’re interested in predicting (for example, whether they churned or not).
- The supervised learning algorithm goes through the customers’ data, and learns a general mapping between features and labels.
In our new scenario, point 2 is missing: we don’t have a label attached to each user. This is what we want our new algorithm to find. Therefore, this is what our new task looks like:
- As before, we have data on a bunch of customers characterized by certain features.
- We want to divide customers into a certain number of segments (clusters)--let’s say three of them.
- We run some kind of ML algorithm that, looking at the data, determines the best clusters we can come up with and splits the users into them.
This new kind of ML algorithm is called clustering , or unsupervised learning . Unsupervised learning is another form of ML in which an algorithm is fed with a set of unlabeled examples (just a set of parameters) and is asked to divide the examples into groups that share some similarity. In this sense, unsupervised learning algorithms use the concept of similarity to overcome the lack of a previously defined label, dividing the examples they’re fed into groups in an autonomous manner. This is the core difference between supervised and unsupervised learning: supervised learning algorithms learn a mapping between a set of features and labels, whereas unsupervised algorithms just look at labels and group data points based in clusters that share a certain similarity, as shown in figure 3.6.
Figure 3.6 The differences in input and output of supervised and unsupervised algorithms
The task of finding similar groups within sets of data is pretty simple when the dimensions that we need to take into account are limited. Take a look at figure 3.7, and you can see that the points naturally condense into two well-separated groups. But what happens when we want to consider a large number of user characteristics? If we want to consider, let’s say, 10 attributes, our minds can’t possibly identify groups that are similar to each other. This is when unsupervised learning algorithms shine: they can scale the concept of similarity even to hundreds of dimensions without any problem, and we can use the results to gain useful insights.
Figure 3.7 The effects of clustering on a simple set of points with two features
Right now, you should have a hunch about why the kind of ML we have been using so far is called supervised : our algorithms were asked to map a given set of features to a given set of labels. Because we have no labels to begin with, the algorithm has to find them by itself, in an unsupervised way.
You can think about it in this way:
- In supervised learning , you already know what you’re looking for. If you can classify customers into different classes (for example, churned/not churned as we did before), an ML algorithm can learn how to recognize customers belonging to one class or the other.
- In unsupervised learning , you don’t know exactly what you’re looking for: you can’t assign a label to customers. An unsupervised learning algorithm will recognize groups of customers who are similar and assign them a label. It won’t tell you what that label means, though: the algorithm will just tell you that customers with label A are similar to each other and are different from customers with label B; it’s up to you to understand why.
Let’s see how a conversation between an ML expert and a marketer who has read this book would unfold:
Marketer: I’m looking at ways that ML can help our team improve customer segmentation.
ML expert: How did you do customer segmentation before?
Marketer: You know, the “good old way”: using a mix of surveys, experience, and gut feeling. I know that there are some ML techniques that can help with that.
ML expert: Yes, I can use unsupervised learning to automatically generate clusters. Let’s start with something simple: what are the top three elements that are relevant to segment our customers?
Marketeer: For sure, demographics like age and gender, and I’d add monthly average spending into the mix. This is a good indicator of their likelihood of purchasing new services from us.
ML expert: Nice. I’ll get an export from our CRM of these dimensions for 1,000 clients and get back to you. I’ll need your help to interpret the results.
ML expert: I’ve done some preliminary clustering, and it looks like we have three well-defined clusters: young low-spender males, high-spender women in their 30s, and one that is between.
Marketeer: Interesting--we didn’t know that women in their 30s were such a profitable segment for us. I’d like to dig deeper; can we add another dimension to the clustering? I’m interested in knowing their purchase frequency: we know that women like to buy more often than men, and I wonder whether unsupervised learning can pick up something deeper in it.
ML expert: Sure, let’s define a label that is “average time between orders.” I’ll look into the results.
As a business person, it’s important that you start these conversations with some knowledge of unsupervised learning so you can have a constructive discussion with technical people, and with an open mindset that can accept the inputs that are given to you by them.
A good way to visualize clustering is to picture how stars are scattered over the night sky. Our brain intuitively groups neighboring stars and assigns them zodiac signs. Most real-world applications are a bit more complex, for three main reasons:
- Sometimes, data points are distributed homogeneously, making it difficult to decide how many clusters to consider, let alone how to divide customers.
- As humans, we easily perform segmentation on limited dimensions, but we struggle when the number of dimensions increases. Think back to the zodiac sign examples: we see the sky as a two-dimensional canvas; performing the same task in a three-dimensional space would be much harder. In four dimensions, it would be impossible. What about with 20 dimensions of customer information?
- From a business standpoint, segmentation is not an idle exercise, but is most useful only when the various segments can be linked to a business outcome, often in terms of customer lifetime value, price sensitivity, or channel preference.
In the next section, we’ll address these issues through an example. For now, let’s dig a bit deeper into the nuts and bolts of clustering. One of the first important decisions to make when tackling a clustering problem is to decide which features (or dimensions) to use for clustering. In our trivial example of a night sky, the choice of dimensions is obvious: the horizontal and vertical position of each star. Real-world applications may be way more complex, though.
3.4.2 Unsupervised learning for customer segmentation
This section will give you more details on unsupervised learning and shed some light on the inner workings of these algorithms. The core concepts you’ve learned so far are enough to envision new applications for unsupervised learning within your organization. If that’s enough for your goals, feel free to skip this section. If you want to know more about how to use these techniques in practice, keep reading.
Let’s use the example of an e-commerce website that sells shoes and has access to the purchase history of its customers. Each data point represents a purchase or a return, and contains information about the shoe such as price, brand, size, color, the date and time of the transaction, and whether it was bought together with other items. We could decide to use all these features for the segmentation or to limit our analysis to a subset. For example, looking at the colors of all the shoes a customer bought might help us better understand that customer’s taste, or looking at the time of the day for those purchases may provide suggestions on the best time of the day to propose discounts.
We could even extrapolate parameters such as the average amount spent per purchase and the number of shoes bought in a month. These two pieces of information together would likely help the clustering algorithm find a natural distinction between high-frequency/low-value customers and low-frequency/high-value ones.
For the sake of simplicity, let’s assume we’re building a simple clustering algorithm that looks at three features of each customer:
Keep in mind that the attributes that are not used for clustering are not thrown away but can be used for profiling . This means describing the characteristics of each group to inform marketing decisions.
Table 3.1 shows our data for the first five customers.
Table 3.1 Five customer features before being fed to an unsupervised learning algorithm
In some of the most commonly used clustering algorithms, the next step would be to decide the number of clusters we’re looking for. This is often counterintuitive: After all, wouldn’t you want the algorithm to tell you how many groups of users there are? If you think about it, though, there are many ways to slice and dice the population, and choosing the number of clusters up front is the only way to direct the algorithm. Let’s keep things simple for now and say that we’re looking to get three clusters. The clustering algorithm is going to find a way to divide users such that
- Customers within the same cluster are similar to each other.
- Customers in different clusters are different from each other.
In this way, we are mathematically sure that when we’re choosing an action to target customers from a certain cluster, we are maximizing the likelihood that they’ll respond in the same way. Two outputs of the clustering algorithm are interesting to look at:
- The cluster that will be associated with each user, as indicated in table 3.2.
- The cluster centers . Each cluster has a center, which can be considered the “stereotype” of the kind of user who belongs to that cluster. A marketer would call this a buyer persona .
Table 3.2 Adding a Cluster column, identified by an unsupervised learning algorithm
Looking at the cluster centers is crucial, as it gives us quantitative information on what the algorithm has found, which we can then interpret to extract insights. Each center is going to be characterized by the same three features that we used before to describe users (though we may rearrange them to be more meaningful). Typically, we would also add a count of the number of users who fall into each cluster. Table 3.3 shows what the cluster centers may look like, assuming we started with data from 1,000 customers.
Table 3.3 A summary of the characteristics of the three clusters spotted by our algorithm
This apparently innocuous table is packed with useful information. Let’s spend some time trying to extrapolate insights and getting to know our segments:
- Cluster 1 is mainly made out of young (average age: 18.2), predominantly male people who are not high spenders (average monthly spending of $15.24). This is an average-sized cluster (29% of users are here).
- Cluster 2 skews toward older females (29.3 years on average) who spend much more than any other cluster ($28.15 average spending compared to $15.24 and $17.89 of clusters 1 and 3, respectively). This is a rather small segment, with 12% of users belonging to it.
- Cluster 3 is almost equally split between males and females. They are not as thrifty and young as cluster 1, but definitely spend less than cluster 2 and are much younger (22 years old versus 29.3).
Marketers can make good use of this information and come up with personalized strategies to market products to each group of people. For instance, customers in clusters 1 and 3 can be offered cheaper products compared to the ones in cluster 2. Adding new features like the colors of shoes bought may give us further insights to take more fine-grained decisions.
Notice how the process started with the data and the algorithm’s findings about it, but ended with a human looking at the results and interpreting the cluster centers to extract useful information that is actionable. This aspect is key to any ML algorithm, but is especially critical for clustering: true value is achieved by the symbiosis between what AI can do and what expert and curious humans can build on top of it.
3.5 Measuring performance
Because ML-based predictions have a direct impact on the business outcome, evaluating their performance is an important skill to have. Researchers and engineers will often develop models for you, and report the algorithms’ performance by using different metrics. Although these metrics do a good job of describing the statistical performance of the models, they don’t tell the whole story. In fact, the link between these nerdy numbers and business outcomes can be subtle. It’s your job to understand ML metrics enough to make informed decisions about how the accuracy of the model can affect your business goals.
3.5.1 Classification algorithms
A big part of working with machine learning is getting comfortable with dealing with errors. Even the best-performing algorithm won’t be 100% perfect, and it’s going to misclassify some examples. Remember that the process to build ML applications is to perform some training on historical data first and then use it “in the real world”: it’s important to have a sense of how many errors the algorithm will likely make after it’s deployed, and what kind of mistakes.
The simplest and most naive way to evaluate an algorithm is with a metric called accuracy , representing the percentage of correct guesses over the total predictions:
However, not all errors are the same. In the case of our churn predictor, there are two possible errors:
- A customer is wrongly labeled by the algorithm as churned but is actually still engaged. This case is called a false positive (FP ).
- A customer is wrongly labeled by the algorithm as active but is actually about to cancel their subscription. This case is a false negative (FN ).
This distinction isn’t another fixation for nerds: it has a direct impact on the business, and it’s important for you to understand how it’s going to affect your business decisions. A good data scientist should not represent performances with a simple accuracy number, and neither should you. A better idea is using a more informative table like table 3.4.
Table 3.4 Visualizing the possible combinations of algorithms’ predictions and ground truth
Presenting results in this type of table is common for binary classification tasks (in which the label has only one of two outcomes--in this case, churn/not churn), so let’s spend some time explaining how to read it. The goal of this table is to have a measure of the algorithm’s performances in a single snapshot, both in terms of correct guesses and errors.
The top-left and bottom-right cells are relative to the correct guesses. The top-left cell represents the number of customers who are churned and are correctly classified as such by our algorithm (true positives , or TPs ). The bottom-right is for active customers (not churned) that the algorithm identified correctly (true negatives , or TNs ).
The bottom-left and top-right cells are relative to the errors we mentioned previously. Customers who are active but the algorithm misclassified as churned are in the bottom left (false positives , or FPs ), and customers who are active but have been wrongly classified as churned are in the top right (false negatives , or FNs ).
Researchers and engineers spend a lot of time looking at tables like table 3.4, because the numbers they contain often give insights as to how an algorithm will perform in the real world. As someone whose business is impacted by these numbers, it’s important that you understand the nuances that can hide behind them.
First, the number of false positives and false negatives are linked together, and it’s easy to trade one for the other without substantial changes to the model or additional training. For example, consider a very naive model that always predicts that customers are churning: whatever inputs it gets, it always outputs “yes, the customer is about to churn.” Metrics for true positives and false negatives will be encouraging (100% the former, and 0% the latter--can you see why?), but false positives and true negatives will be terrible.
Now, false positives, false negatives, true positives, and true negatives are absolute metrics. It’s always a good idea to report them to an absolute metric that isn’t sensitive to the number of samples combined. Two metrics can help us with this, called precision and recall . Here’s what they tell us:
- Precision --Out of all the customers whom the algorithm predicted as churned (true positives), how many were really going to churn?
- Recall --Out of all the customers who churned, how many did the algorithm predict?
Figure 3.8 shows how precision and recall are calculated based on the algorithm’s predictions.
Figure 3.8 A graphical representation of true positives, true negatives, false positives, and false negatives, and how these metrics combined form the precision and the recall
You can imagine an algorithm with a high precision and low recall as a sniper: it wants to be sure before shooting, so it stays conservative and doesn’t shoot unless it’s 100% sure. This means that it misses some targets, but every time it fires, you’re confident that it’s going to hit the right target. On the other hand, an algorithm with a high recall and low precision is like a machine gun: it shoots a lot and hits many of the targets it should hit, but also hits some wrong ones along the way. Table 3.5 summarizes the difference between precision, recall, and accuracy, and what these metrics tell you.
Table 3.5 Accuracy, precision, and recall
3.5.2 Clustering algorithms
When it comes to unsupervised learning, evaluating performance is tricky because there’s no objective “greatness” metric: because we don’t have labels to compare, we can’t define whether the algorithm’s output is “right” or “wrong.” Remember also that most clustering algorithms require you to define the number of clusters you want it to identify, so another question you’ll have to answer is whether your choice for the number of clusters was good. How do we get out of this apparently foggy situation?
First of all, some mathematical tools can tell a data scientist whether a clustering has been done well. Unfortunately, having an algorithm that performs great from a mathematical point of view isn’t necessarily a sign of it being useful for business purposes. If you bought this book, we assume that you’re not using ML to publish a scientific paper but rather to help your organization. If this is true, mathematical stunts won’t be of any interest to you. What you should do instead is look at your results and ask yourself these questions:
- Are the results interpretable ? In other words, are the cluster centers interpretable as buyer personas that make logical sense? If the answer is yes, move to question 2.
- Are the results actionable ? In other words, are my clusters different enough that I can come up with different strategies to target customers belonging to the different centers?
If the answer to these questions is yes, congrats: you can start testing the results in the real world, collect data, and move forward either by iterating and tweaking your algorithm when you have more data, or by using the new knowledge to redesign your approach. Luckily, underperforming unsupervised algorithms are usually less risky than underperforming supervised learning ones, as the concept of “right” or “wrong” prediction is foggier. For this reason, you don’t have to worry much about metrics, but rather about the testing methodology you should have in place to evaluate the business impact of your project. We’ll talk about the iteration and design process extensively in part 2 of this book.
3.6 Tying ML metrics to business outcomes and risks
Now that you have gained some familiarity with common ML metrics, let’s see what they imply in a business scenario. Let’s assume you have deployed a great churn predictor that has identified a group of customers likely to leave your service, and you want to reach out with a personalized phone call to each of them.
If your data science team has built a model with high precision but low recall , you won’t waste many phone calls: every time you call, you’ll talk with a user who is really considering leaving you. On the other hand, some users are leaving the service, but your algorithm has failed to spot them. A high-recall, low-precision algorithm would instead have you make a lot of phone calls--so you’ll reach out to a large chunk (even all) of the customers planning to abandon your service, but you’ll waste phone calls to other users who weren’t planning to unsubscribe at all.
The trade-off between precision and recall is clearly a function of the business. If each customer has a high value and reaching out is cheap, go with a high recall. If your customers are not making expensive purchases, don’t want to be disturbed unnecessarily, and calling them is expensive, go with a high-precision one. An even more sophisticated strategy could be to reserve more-expensive actions like phone calls for customers with higher value or a higher probability of churn, and use cheaper actions like email for others. As you can see in figure 3.9, you can decide whether to focus on recall or on precision based on two parameters: the cost of losing a customer, and the cost of the actions you pursue to try to retain them.
The problem becomes even more serious in safety-critical applications. If we have a classifier used for medical diagnoses, the cost of a false positive and false negative are very different. Suppose our algorithm is detecting lung cancer. In table 3.6, you can see what each error means and its implications.
Table 3.6 Implications of true positives, true negatives, false positives, and false negatives
Figure 3.9 A cheat sheet to decide whether to focus on high-precision, high-recall or just general accuracy based on key business metrics
As you can see, the cost associated with a mistake is very different, depending on whether we’re misclassifying healthy or sick patients. A false positive can imply additional exams, or maybe unnecessary treatment, but a false negative will deprive a patient of the therapy desperately needed to survive.
In such a scenario, you clearly see how dangerous it is to optimize for the wrong metric. Having high accuracy here wouldn’t be indicative of a good algorithm, as it would weight false negatives and false positives in the same way. What we want in this case is a high recall: we want the highest number of patients who are sick to be identified, even if that means having some false positives. False positives may lead to additional unnecessary exams and frightening some families, but it will ensure that the highest number of sick people are spotted and taken care of.
As you can see, every algorithm has an indicator of its performance, and you need to be able to assess which metric is the most important for you, so that your data science team can work on maximizing it. We’ve seen that the best choice is not always obvious. From a business point of view, the most powerful idea is to tie each misclassification to a dollar amount. If your assumptions are correct, this will automatically lead to the best business outcome. Our imaginary telecom company could develop a retention plan for disgruntled customers that gives them a $100 bonus if they stay with the company for the next 12 months, thus retaining a customer that would have otherwise defected to a competitor in the near future.
On the other hand, let’s assume that each lost customer loses the company $500 in profit. Now that we have all these numbers, we can easily compute just how much money false negatives and false positives cost us. Each false negative (a customer who defected to a competitor before we could entice them with a discount) costs us $500 in lost revenue. Each false positive (a loyal customer who would not have churned but received a gift anyway) costs us $100 in lost revenue. We can now use some basic accounting to tie the performance of the model to a monetary value:
Total cost = $500 * FN + $0 * TN + $100 * FP + $100 * TP
Here, FN is the number of false negatives, TN is true negatives, FN is false negatives, and TP is true positives.
You can use these same ideas for other situations where you’re using a binary classification model. For example, say that you’re developing an automated quality control system for a manufacturing line. You can tweak the model to either let more items through (increase the false-negative rate) or reject more items right away (increase the false-positive rate). The former might let more faulty items through, thus leading to more issues downstream in the manufacturing process. The latter might result in unneeded waste of material, as perfectly fine products are rejected. In any case, you can adapt the preceding formula to your specific business situation.
Whatever your strategy, you still need to be comfortable with the fact that no machine learning algorithm will be perfect (if it is, something is wrong). You’ll make mistakes: make sure that these mistakes are acceptable and that both you and your algorithm learn from them.
3.7 Case studies
This section presents two case studies from real companies that used AI for their marketing efforts. The first one is Opower, an energy company that used clustering to segment its users based on their energy consumption habits. The second one is a Target initiative: the retail store used the data of its customers to identify signs of a pregnancy early on and start advertising diapers.
3.7.1 AI to refine targeting and positioning: Opower
The relationship that energy companies have with customers is peculiar: although consumer companies normally get paid when the customer makes a purchase, with energy, there’s a lag: people first consume energy for a period of time and get billed later. Also, people don’t buy energy for the sake of it: it’s a necessary expense to enjoy other goods (such as turning on a TV).
These two factors make it hard for every utility to know its customers and interact with them. This isn’t a problem just for the marketing department of energy companies, but also for their overall costs and profitability. In fact, producing energy doesn’t always cost utilities the same price: it depends on the kind of power plant they use to produce it, a choice that depends on how much energy they need to produce and how well in advance they planned for it.
The best-case scenario is a perfectly flat demand: utilities can start producing X amount of energy using their best power plants and never change. This is, of course, never the case, as daily demand has a duck shape instead, with two peaks usually happening when people wake up and when they come back home from work. To fulfill demand during these peaks, utilities need to use power plants that are easy to turn on and off and modulate, but flexibility comes at a price: these plants are more expensive to run, consume more energy, and burn more-expensive fuels. As represented in figure 3.10, this problem is getting worse: today’s habits are making the energy demand curve less flat, with a bigger valley during the day and sharper peaks in the morning and in the evening. The dream of every utility company is to be able to shape their customers’ habits by moving their peaks to different times of the day, achieving an overall almost flat demand.
Figure 3.10 Typical load curves for the energy consumption of California homes
Opower was founded in 2007 with the mission of providing energy companies the tools to target their users and engage them in changing their consumption habits, both to save energy and to achieve an overall flatter demand. The company was getting smart meter data from utilities’ customers, and could send them reports with details on their energy consumption and tips to improve their habits, using mail, email, internet portals, text messages, and sometimes in-home energy displays.
Figure 3.11 Energy consumption values of a household at different times of the day
The company’s secret sauce is a mixture of behavioral science and data science. Opower turned to one of the core applications of machine learning--unsupervised learning--which it used to identify particular user groups that could be targeted similarly.
The most characteristic element of an energy user is their consumption pattern: how much energy they use during different hours of the day. We can therefore see each user as in figure 3.11: a combination of 24 numbers that represent the amount of energy consumed for each hour of the day. These numbers draw an accurate picture of the habits of the users, just as the number of purchases, average expense, and items bought would be in an e-commerce scenario.
Spotting different groups of users based on their consumption patterns isn’t an easy task. If we represent 1,000 users in this way and plot all of them on a graph, we get what data scientists like to call a hairball : an intricate mix of lines, one on top of the other, impossible for a human eye to distinguish. On the other hand, identifying patterns is where unsupervised algorithms are the most useful; using these, Opower was able to spot five user clusters. We gathered data from the internet and ran an unsupervised algorithm to mimic the results that Opower achieved; you can see the hairball and the final clusters in figure 3.12.
Figure 3.12 Turning a “hairball” of 1,000 users into five clusters
Looking at these clusters, we can try interpreting each of them:
- Twin Peaks --They consume a lot of energy in the early morning, way more than users from other clusters. They also have a peak during the evening at about dinner time, but a bit lower than their morning peak.
- Evening Peakers --They have a high peak during the evening that they reach smoothly, starting from the early morning. The first morning peak is absent.
- Day Timers --They don’t have the classic peaks that all other clusters have. Instead, they have a high consumption during the day that ramps up and slows down smoothly.
- Steady Eddies --They’re the steadiest ones; the two classic peaks are very smooth.
- Night Owls --They have classic peaks during the early morning and late evening but tend to consume energy during the night.
Opower used the results of the clustering to target different users in different ways. Its infrastructure allowed splitting each cluster into groups and testing different messages.
Figure 3.13 Decision tree that Opower used to target its users
From the analysis of the results and using behavioral science, Opower was able to constantly improve its targeting. An example of this segmentation is represented in figure 3.13: Opower could pick a specific cluster (the Night Owls, in this case), and split them into two groups that it targeted with different emails. This approach allowed Opower to optimize its messaging and target people in the best way possible.
Thanks to its approach, Opower was able to help consumers save 10% on their energy bills, while making utilities more profitable, thanks to the better distribution of the total demand. As a result of its performance, Opower was acquired by Oracle in 2016 for $532 million.
Case question
How can you persuade a group of people to change their behavior?
Case discussion
Think for a second about the complexity of Opower’s mission: it has to enable millions of people to change their way of consuming energy by sending them some kind of report. You’d be naïve to think the same message could work for all of them.
Every person responds to different incentives and different messaging, so ideally you’d have to target everyone with a different message. This is, of course, not practical or scalable, but you can still go a long way by splitting your audience into groups that have something in common. For Opower, this first step was done using unsupervised learning. Remember that unsupervised learning is a class of ML algorithms that looks for patterns in data, finding groups of similar data points based on their features. In Opower’s case, the most important characteristic was the user’s energy consumption habits, so Opower needed a way to account for this. Opower’s idea to express energy consumption habits into features (a bunch of numbers) was to consider in each hour of the day the amount of energy a certain user was consuming. The result was a representation of each person made up of 24 numbers, which is easy enough for a human to grasp, but very easy for a computer: once an unsupervised learning algorithm is fed with this data, it can easily compare two users and spot similarities.
Unsupervised learning provided Opower with the first, most important distinction across users: their energy consumption. The second step was measuring the effects of different messages, which Opower did by using behavioral science and A/B testing (sending different messages to different people and measuring the results).
A final implementation note on the number of clusters that the algorithm identified here. Opower had five, but where did this number come from? With some of the widely used algorithms, you need to come up with the number of clusters the algorithm has to look for, and the algorithm will optimize for it.
Mathematical tools provide data scientists clues about what number would yield the best results from a mathematical standpoint, but it wouldn’t necessarily be the best solution for your business needs, so there’s no “right” or “wrong” solution. Because there’s no script for deciding the number of clusters, coming up with this number can sometimes feel like doing grocery shopping: “Do we want five clusters? Let’s do six; it’s better to have a little more, just in case we run out of them.” We can argue that making a decision is actually much simpler.
Think about this: every time you ask the algorithm to find one more cluster, it’s going to see new, subtler patterns in the data. Assuming we have 1,000 users, the upper limit would be to find 1,000 clusters, where each user has a cluster of their own. Before you reach that extreme situation, you’ll sometimes find that you can’t really gain any new knowledge from adding a new cluster. This is a good moment to stop. Adding clusters will make the results of the algorithm more specific, but also less interpretable. Figure 3.14 illustrates this trade-off.
Figure 3.14 The relationship between the number of clusters and the interpretability and specification of the results
The ideal process would be to start with a number of clusters that is rather small; ideally, this number can be provided by your data scientists, who will use mathematical tools to find out what makes sense from a technical standpoint. From there, you can move either down, condensing different clusters, or up, increasing the granularity of the clustering. Every time you increase or decrease the number of clusters, try interpreting the results and ask yourself whether a higher number of clusters helps you better understand the phenomenon. If the answer is no, we can argue that you found the optimal number of clusters. This may not be what the data scientists found as the technical optimum, but it will be what’s optimal for the business.
3.7.2 AI to anticipate customer needs: Target
The birth of a child can be not only a joyful event for a couple, but also the beginning of a new stream of costs. Suddenly, the parents need to include new, expensive products in their usual shopping, such as diapers, baby food, a stroller, and so on.
Such an event is extremely lucrative to retailers, which try to attract new parents to their stores to become their trusted point of reference for purchasing all they need for their newborn. The strategy that retailers usually use is offering discounts and promotions to the newly formed family, and the earlier the retailer can put its coupons in their hands, the more advantage it wins over the competition in acquiring the new customer.
Among all the new expenses that families need to sustain, diapers is an extremely important one. According to a Nielsen report for 2016, American families spend an average of more than $1,000 per year per newborn for diapers. Revenue isn’t the only reason diapers are important to retailers, as they also play a strategic role as traffic builders : products that a family can’t live without and so attract young mothers and fathers into a store where they then end up spending money on other goods as well.
The result of the economic and strategic value of acquiring a customer who has just had a baby results in a “coupon war” between retailers: all of them fight to be the first ones to place offers in the hands of the new mother and father, and attract them to buy diapers and other products at their stores.
The first approach that retailers tried to win these new customers was to get the openly available registry office data for newborns and send targeted offers to the new fathers and mothers, incentivizing them to shop for their first diapers in their shops. However, many retailers quickly started using this strategy and began competing for attention from newborns’ families right after the baby was born.
To get ahead of its competitors, Target, the eighth-largest retailer in the United States, decided to investigate a new, bold strategy: provide families offers before the baby was born. If the retail chain could know that a couple was waiting for a baby, it could outpace its competitors and reach the family first.
The retail giant used its fidelity card data to identify patterns in the shopping habits of pregnant women, and prepare a specific set of coupons and promotions that would entice that category of shoppers to expand the range of items that they choose to buy at Target. In a 2012 Forbes article ( http://mng.bz/PO6g ), we get some indications of the results that Target’s statistician Andrew Pole was able to find:
As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.
Target was reportedly able to increase sales thanks to its new strategy. However, increased revenue came at a cost: people started sensing they were being spied on and weren’t happy with that. The first warning sign that the general public might not like such intensive targeting techniques came when a disgruntled father complained about the coupons for baby products that his high-school daughter had been getting from Target. As reported in the same Forbes article, he complained to the store manager, shouting:
My daughter got this in the mail! She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?
It turns out, the girl was actually pregnant, and Target had figured it out before her father did. The Target’s manager reportedly called the angry father a while later, and got his response as reported by Forbes:
I had a talk with my daughter. It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.
As reported in the Forbes article, Target ended up mixing personalized recommendations obtained through ML with seemingly random items from the catalog (for instance, adding a coupon for a lawn mower right next to one for diapers), to avoid making customers feel like they were being stalked.
Case question
Are special discounts an effective way to win customers over competitors?
Case discussion
Special offers and coupons are an extremely common strategy that retailers use to attract customers. However, when everybody knows a person is looking for a specific product that everybody can sell, the effects are usually a promotions war, eating retailers’ margins. This may be good for customers but definitely is not for retailers. We argue that offering discounts to customers can be a valid marketing idea if the following are true:
- The retailer knows who to send the coupons to (it’s effective at targeting its users).
- Other retailers are not targeting that person at the same time, to avoid the price wars mentioned previously.
The initial strategy that retailers adopted was clever: using openly available registry office data for newborns to target families immediately after a baby is born. This strategy is effective at point 1 (targeting the right customer), but fails at point 2: as the name says, open data is open, and every retailer can use it.
Target’s case is a perfect example of how you can use data and machine learning to win over competitors, targeting the perfect customer before anyone else. Target started from its core business sales data : the shopping habits of customers, collected through fidelity cards. It was also able to build a label “is waiting for a child”/“is not waiting for a child” for past customers, by looking at whether this customer actually started buying diapers in the future. We hope you already figured out that this looks like a classic supervised learning classification problem:
- The features are the shopping habits of the customer (let data scientists figure out how to build them exactly).
- The label is whether that customer is waiting for a baby, according to feature purchases of baby-related goods.
- After the model is built for past data, it can be applied to current customers, identifying the ones waiting for a baby.
Notice that the label wasn’t immediately available: Target’s customers don’t call the retailer, communicating that a baby is born. However, this label can be assigned to past customers by looking at their past purchasing behavior: if a customer started buying diapers or a baby stroller six months ago, she must have been pregnant within the nine months before and can be labeled as a positive example.
Think about how powerful this approach is. Patterns exist in the purchasing habits of pregnant women: they may stop buying certain foods like sausages or sushi, eat healthier food, or quit cigarettes. The problem is that when there’s a lack of clear indicators (for example, the purchase of prenatal vitamins or diapers) it’s hard for a human to spot patterns by looking at customer purchases. But it’s not hard for an ML algorithm, which can figure out which purchasing habits suggest a pregnancy.
Something else you should keep in mind is that what may sound like a nice story, and a great move from Target to boost sales, can also be a ticking PR time bomb due to aggressive targeting in such sensitive domains.
To summarize, you should keep the following in mind from this case:
- ML can help you predict future purchases, anticipating competitors.
- Core business data has a much higher value than openly available data (more about this in chapter 8).
- Be on the lookout for the creepiness factor: ML can be so powerful that people might feel like you’re spying on them.
Summary
- The core reasons behind the effectiveness of AI in marketing and sales are the ability to target specific customers at scale and make data-driven decisions.
- A churn prediction model is a typical application of AI in marketing, and can spot which customers are most likely to quit your service.
- Likewise, machine learning can help marketers find upselling opportunities by identifying which customers are more inclined to buy more.
- Unsupervised learning is a subset of machine learning that enables computers to find structure in data, finding clusters of similar data points. Clustering also has uses in marketing.
- There are different ways and metrics to evaluate the performance of classification algorithms (we covered accuracy, recall, and precision). Each one tells us something different about the algorithm, and you can look at your business objectives to choose the most appropriate one.
- Opower used unsupervised learning to divide energy consumers into clusters and then targeted them individually to optimize their energy consumption patterns.
- Target used supervised learning to identify which of their customers was about to have a baby, and then offered them coupons for diapers and baby products.