Swiss insurance companies often take a closer look than others. I like it when customers ask “Is that possible?” questions. That’s when the inventor in me comes out and starts tinkering. It was the same here.
We had just built a causal AI system that explained what drives NPS and how to strategically optimize it. In the dashboard simulator, it was possible to make improvements in certain CX topic and then see how much the insurer’s NPS would improve.
But then came the question: “What good is one more NPS point anyway?” Melanie asked freely. Such questions are at the beginning of every innovation.
In our case, we decided to request a data set from the data team that would show the characteristics of the customers surveyed in the past and then also find out whether these customers had churned from their contracts in the following year.
An initial analysis made us suspicious. The NPS value correlated positively with customer churn. The more loyal the customers, the more likely they were to churn? “That can’t be right,” we said to ourselves and built an initial forecasting model that used the NPS rating and some information about the customer to predict whether they would churn. The influence of the NPS was now very small, but still positive.
I looked at the first model and realized that many of the available variables had not been included in the model. In particular, the internal segmentation of customers had not yet been taken into account.
Now you might ask: “What does this have to do with the influence of loyalty on churn? The answer became more than clear in this project.
The resulting model not only had a better predictive quality, it also showed a clearly negative influence of loyalty (NPS rating) on customer churn.
What had happened?
By integrating additional variables, we have taken so-called confounder effects into account. This confounder was called “customer segment”.
The insurance company had a higher-value customer segment that was obviously more selective and documented this attitude with a lower rating. With the same level of loyalty, these people would be more likely to give a lower NPS rating. The causal influence of segment affiliation on the loyalty measure was negative.
At the same time, the tendency to churn was lower in this upmarket segment, as they were generally better looked after. The causal influence of segment affiliation on churn was positive.
If an external variable (segment affiliation) influences two variables at the same time, then these variables correlate for this reason. In this case, the correlation is negative, as the loyalty measure is influenced negatively and churn positively – i.e. in the opposite direction.
If this external variable – this so-called confounder – is not included in the model, the model will show a causal relationship that does not actually exist.
In science and data science practice, a procedure known as “p-value hacking” is common. The p-value stands for the significance of a correlation. In order to obtain more significant correlations in statistical models, it is helpful to remove more and more variables from the model. However, each elimination not only increases the significance of the correlations, but also the probability that the model will produce causally incorrect results.
It has taken a few decades, but now even the American Statistical Society has clarified the limitations of significance tests as a key quality criterion in a statement. But I think it will take another two or three generations before this becomes common practice.
Whenever we cannot fall back on robust theories (which is almost always the case in marketing), causal models are those that take many variables into account. They do this with the aim of reducing the risk of confounders by modeling their effects.
It’s a bit like the hare and the hedgehog. The hedgehog crosses the finish line first and you automatically conclude that he must be faster. The fallacy, however, is the length of the path that the hedgehog has surreptitiously shortened.
Don’t be fooled by the hedgehog. Don’t just look at the data. You can’t measure cause-and-effect relationships, you can’t see them. You can only deduce them indirectly. You have to be very careful not to let the hedgehog get the better of you.
At Success Drivers, we have been analyzing the data from Microsoft’s global B2B customer satisfaction surveys for many years. I still remember the day Angelika, a project manager with us, came to see me. She had developed an AI driver analysis model that explained what drives satisfaction. In line with the guidelines, she had not only incorporated the measured partial satisfaction into the model, but also various other characteristics from the data set to make it more holistic.
She proudly said: “Frank, we now have an explanation quality of 0.88”. That made me sit up and take notice. “What are the main drivers,” I asked. “That’s the IRO,” she said?
Neither she nor I knew what this mysterious IRO was supposed to be. We checked with Microsoft and learned that particularly dissatisfied customers were flagged and then contacted. This variable was not a cause, but a consequence, which we tried to explain.
The AI model didn’t care. It did what it was told: to find the function with which we could predict the result of the input variables well.
The entire model was worthless. IRO has a part of the target variable in it and thus makes the model no longer interpretable. It is therefore part of the action instruction that only driver variables that can principally be logically causal are used.
“Do you always know?” I’m often asked. Of course you don’t always know. But marketing science provides a framework in which most variables in the marketing context can be categorized.
There are variables that relate to the results and status in the marketing funnel. These are, for example, the purchase intention, the consideration/evoked set and the awareness. These variables have a logical sequence. Purchase intention, for example, is a downstream stage of consideration.
The status of the marketing funnel is influenced by the perception of the product and the brand. Do consumers perceive the product as tasting good, healthy or trustworthy? These are product and brand-specific attitudes. And here, too, there are logical causalities. The brand influences the perception of the product. Of course, there is also a reverse causal effect, but this takes place on a longer-term timeline. I will discuss this shortly.
These perceptions and attitudes change as a result of experiences with the product and the brand. This can be the consumption of the product, an advertising contact, a conversation with a friend or a television report. In marketing, these experiences are often referred to as “touchpoints”.
How the touchpoints in turn influence the attitude towards the product and the marketing funnel can vary depending on the target group and situation. It therefore makes sense to include characteristics of the person (e.g. demographics) and the situation (e.g. seasonality) as possible moderating variables.
There are hypotheses about what works and hypotheses about what doesn’t work. According to Nassim Taleb, the latter are easier to set up. It is easier for us humans to know, for example, that overall satisfaction has no influence on service satisfaction than to know that service satisfaction has a decisive influence on overall satisfaction.
Therefore, it is not the aim of the above framework to define which variable influences which causal relationship. Rather, the aim is to specify to the model which causal relationships can be logically excluded with a high degree of probability.
To speak in images: It is relatively safe to say that a pope is a human being. However, a more intensive check is required to identify a person as a pope. We want to leave this check to the AI if we cannot do it ourselves with certainty.
The non-profit organization “Kindernothilfe” wanted to revise its marketing strategy and, as a first step, better understand how donors choose the non-profit organization.
At Success Drivers, we usually solve this question by designing a questionnaire that collects the data we need according to the framework mentioned above by interviewing the target customers. For example, we held a workshop with the organization to fill in the categories of the framework. This usually happens quite quickly because we find a lot of things in old questionnaires and documents that we just have to assign.
Nevertheless, it is worth investing a lot of time in brainstorming. In this project, I realized this again by a happy coincidence: at the beginning of the questionnaire, we asked the respondents which aid organization they knew. From the list of known ones, another one was selected in addition to Kindernothilfe. These two brands were then evaluated in order to be able to understand in retrospect what motivates donors to prefer an aid organization.
This question was therefore only intended as a control question. When we were building the model, we had this information in the data set about which aid organizations a respondent knew. I noticed that some respondents knew a lot of aid organizations and others only knew very few. My intuition told me that this could be a useful information in the marketing funnel. So this variable was included in the model.
In fact, it turned out that this variable plays a central role. Donors who know many providers proved to be much more selective. It is much more difficult to win them over because they have many points of comparison.
That’s another insight where you say to yourself: “Yes, of course, that’s logical”. But nobody in the workshop had ever said that before.
Such “it’s obvious” aha-experiences have been with me since I started using Causal AI for companies, and we’ll talk about a few more.
The wider we cast the net, the more variables we collect about potentially influential facts, the better the causal model of reality becomes and the more astonishing the “aha” moments.
It was not only useful to realize that donors who only know a few providers are easier to win over. It turned out that younger potential donors naturally know fewer providers. “Also clear” – but unfortunately only in retrospect. In addition, people who know few providers can be found in other places and at other contact points.
In short, the entire marketing strategy has been turned on its head as young target groups suddenly come into focus.
We overestimate the relevance of what we know and underestimate the relevance of what we don’t know. That is why it is so helpful to think “out of the box”. It is precisely this process of taking a holistic approach that will require human experts for a while yet. At least a human, inspired by LLMs, will be able to do it. The expertise required here has nothing to do with data science.
To put it metaphorically, everyone knows the situation: You have a problem, but you can’t find a solution. No matter how hard you try. Then, in the shower, the idea comes. You can’t think of a word, no matter how hard you try. In a relaxed moment, it comes out of nowhere. When we concentrate too much, we focus on what we have access to (knowing) and not on what is associative further away (not knowing or early knowing). Then we miss the chance to find solutions that lie outside the current paradigm.
Most scientific breakthroughs were not made possible by a research plan, but by unplanned “coincidences”. Whether Penecilin, Post-Its, airbags, microwave devices or Teflon – great inventions are the result of lucky coincidences and thinking “outside the box”.
If you want to lead your company into a new phase of growth, it is helpful not only to focus on what you know, but above all to look for insights where your own knowledge is limited. This is exactly the idea of “Causal AI” – using knowledge to explore the unknown.
When I started experimenting with neural networks in the early 90s, I was full of enthusiasm. Harun and I collected stock market data. We “scraped” the stock prices that were broadcast on German television via screen text, because the Internet was not yet available. We thought about how we could best pre-process the data in order to feed it to a neural network.
Performance on test data (i.e. data sets that the system had not yet learned) was terrible. Even linear regression was better. Something was going wrong. What could it be? There were so many variables. Network architecture, learning methods, number of parameters, pre-initialization, better preprocessing, and so on. We tried out a lot. Really a lot. I learned that you can waste a lot of time if you try things out without questioning your paradigm. I learned that A/B testing has serious practical limitations. Fortunately, we were students, we had the time, and it took us a year or two to realize that all the methods, even those described so promisingly in the textbooks, are of little use if you ignore the so-called “regularization“.
What is that? All machine learning methods have a common goal: they try to minimize the prediction error. And this goal is precisely the problem. The prediction error naturally only relates to the data that is available for learning. However, the goal must be that the prediction based on situations that have not yet been seen (i.e. input data) has a minimal error.
This is a dilemma. I could use the test data for learning to improve the model with even more data. But in the live application, there will always be new, unseen input data. You have to accept that a model not only needs to fit better, but also needs to become generalizable. Regularization achieves this by following the philosophical principle of “Occam’s razor”: When in doubt, used the simpler model.
Regularization methods attempt to make a model simpler while sacrificing as little as possible of the predictive accuracy of the training data.
The figure shows the learning data as crosses. The thin line is the model that is only aimed at minimizing the prediction error. The thick line is the regularized model.
These methods form the basis of many AI systems today.
As soon as we had tried out the regularization methods in our student programming circle, the system became usable. However, the results only became really good after a further step towards causality.
Marketing is one of the most complicated fields you can choose. It is often ridiculed by technical professions. They think: “They’re just talking”. I have to admit that I thought the same thing at the beginning of my studies. We looked down on the business economists who didn’t have to and couldn’t solve really complicated higher mathematics like we did.
But over the course of the semester, I realized that the natural sciences are actually comparatively simple. The math is complex but not complicated. You can experiment relatively easily and get immediate feedback. That’s why we know relatively much in the natural sciences and relatively little in marketing. Marketing is like open-heart surgery, on millions of hearts at the same time.
There are so many uncontrollable variables. That’s what Causal AI is all about: we want to be able to capture and understand the complexity. Then you realize that most of the variables correlate with each other. And that’s a problem. Because that makes it more difficult to determine which variable is causal.
This was also the case with Kindernothilfe. The older the potential donors are, the more often and the more they donate. This is well known and leads to senior citizens being the focus of fundraising. Is this justified?
Many other variables also correlate. Various components of brand perception correlate strongly. Even wealth and income correlate.
It turns out that although classic machine learning and AI produce precise estimates for the training data, the more variables are involved, the more they have to contend with multicollinearity. As an undesirable side effect, variables are used in the model that have no direct causal influence. This in turn leads to unstable forecasts and distorted attributions of causes.
There are two algorithmic methods in particular that are used by AI systems to measure causal effects more accurately.
1. Double Machine Learning (DML)
Back to the SONOS example. When we predict loyalty from the other data using an AI model, the prediction contains all the information of the explanatory variables that the algorithm could use. What remains (the difference between prediction and actual value) is called “noise” (i.e. an unexplained random component). If the explanatory variables (=causes) do not fully explain the loyalty (=effect), part of the “noise” is the actual information contained in the loyalty variable.
This is also the case when we explain the “service evaluation” with an AI model. Double Machine Learning now attempts to explain the “intrinsic information” of the target variable through that of the other variables by working with adjusted variables.
The method consists of two stages of machine learning. Hence the name “Double”. In the first stage, machine learning models, e.g. a neural network, are trained for each variable that is influenced by other variables. This also includes the target variable, such as loyalty.
For the second step, the difference between the predicted value of the machine learning procedure and the actual value is calculated (this difference is referred to as the residual). The second step now calculates a machine learning model using only the residuals, not the actual values.
It’s a bit like tracking. You won’t see the snow fox that only roams around at night. But its tracks in the snow can tell us where it is coming from and where it is going.
2. Automated Relevance Detection (ARD)
There is another method to get a grip on the interdependencies between the explanatory variables. The idea is to try to eliminate the explanatory variables during the AI’s iterative learning process without losing accuracy, instead of using a two-stage approach.
So the idea is to integrate this goal into the AI’s objective function. What does objective function mean? Neural networks are optimized by setting up a function (= formula) that expresses what you want to achieve. In the simplest case, this is the “sum of the amounts of all differences between the actual and the predicted value for each case/data set”. This sum should be small.
This formula now depends on the weights (=parameters) of the neural network. The learning algorithm, in turn, is a method that knows how the weights must be iteratively changed bit by bit so that the sum of this formula becomes smaller.
Automated Relevance Detection (ARD) has changed its objective function so that the goal is not only to minimize the prediction error, but also to omit dependent variables. It’s not about black or white, not about in or out, but about a trade-off. It’s a balancing process. Weighing up how fit should be weighed up against the simplicity of the model. This is a learning process in itself, which is implemented by the procedure using the principles of Bayesian statistics.
It’s a bit like the job of hunters. Their job is to keep the food chain in balance. How many predators does it take? How much prey? How much pasture grass? But what exactly the right balance is, is not a trivial question. The ARD algorithm investigates this question in an iterative search process.
Modern “Causal AI” has therefore typically implemented Automated Relevance Detection (ARD) and / or Double Machine Learning (DML).
“Marketing works a little differently in the pharmaceutical sector,” Daniel explains to me, pointing to a graphic. His company, which was part of Solvay at the time, produced prescription drugs. Marketing is mainly done through channels and campaigns designed to convince doctors to consider a drug.
Of course, there are also advertisements in specialist journals. But the majority of the budget is spent on equipping the sales force to keep doctors well informed and make a good impression. The whole thing is not called marketing, but “commercial excellence”. At its core, however, it is all about the same questions: Which channels and campaigns are effective? How can I sell more?
We helped Daniel to structure the problem and then collect data. In a pilot country, all sales representatives were asked to collect data for their sales territory and the last 24 months. In addition to the target figure “number of prescriptions”, we compiled the most important measures. These included the number of visits, participation in workshops, invitations to conferences, the distribution of brand reminders such as pens, product samples and much more. We collected 480 data records from over 20 sales territories, which enabled us to analyze the data across 14 channels.
We applied our Causal AI software and once again the results were surprising: it was to be expected that the number of sales visits was an important driver. But the overall impact of product samples was zero. “Product samples are important,” I remembered Daniel saying.
I looked at the plots of the non-linear relationships. It looked strange. The plots showed the result of a simulation: how many additional prescriptions can be expected if the number of product samples per doctor is increased or reduced to a certain value.
The graph shows an inverted U-function. There was a value for the number of product samples at which the effect was maximum. It took a while for it to click. “That’s logical,” I thought to myself. If the sales force distributes too many product samples, at some point the doctors will no longer have enough patients to prescribe the medicine to. Instead, the product samples are then handed out instead of prescriptions. The product samples then replace the prescriptions instead of promoting them.
The software found something that we hadn’t thought of before. In hindsight, it was as clear as day.
This example is intended to show one thing: Reality is often different than we think. It is also usually more complex than we think, because we humans are used to thinking one-dimensionally and linearly. Because this is the case, we need causal AI methods that can detect unknown non-linearities without having to specify them with hypotheses beforehand.
It’s a bit like a small child playing a pegging game. It can be so frustrating when the cube doesn’t fit into the round hole. No amount of kicking or hammering will help. A causal AI, like the one we need, first looks to see what kind of hole we have and can then put the right object into it.
Remember the Mintel case study above? Here, Causal AI has discovered interaction effects that we didn’t have on screen before.
Interactions are similar to non-linearities. Unfortunately, many managers do not intuitively understand what exactly is meant by the term “interaction” in methodological terms. So here is a definition:
Interaction or moderation effect:
We speak of interaction or moderation when the extent or manner in which a causal variable acts depends on another causal variable.
In this case, two (or more) variables “interact” in their effect. The degree of distribution “only promotes sales” if the product looks attractive, and it is only bought again if it tastes good. The effect of each component depends on the strength of the others.
Interaction effects are different from mediation effects:
Intermediation or mediation effect:
Conversationally, the term interaction is often used when one causal variable (e.g. friendliness) influences another causal variable (e.g. service quality) and this in turn influences the result (e.g. loyalty). However, this must be distinguished from a genuine interaction. That is why we have a different term for it: (inter)mediation. In this example, the mediator is the service quality.
To find interactions, we again need a machine learning approach that is flexible enough to see and find what is there. Many causal AI approaches, on the other hand, work on the basis of so-called “structural equations” or “causal graphs”. Here, the analyst determines which variable may have an effect on which variable. However, this unconsciously makes a fatal assumption: the assumption that the effects of the variables add up. Each interaction is considered individually and its effect should add up. Unknown interactions are thus excluded.
In Step#1, I described how important it is to create a holistic data set and to have the knowledge about the real-world topic that the data describes. In Step#2, I described what AI should be able to do to build causal models.
Now, in step 3, it’s about how we should use these AI models to derive causally useful knowledge.
Illuminating the black box
The AI finds the formula hidden in the data. As such, it follows the flexible structural logic of a neural network and does not fit directly into the framework of human thinking.
Human thinking consists of logical connections based on categorizations (black and white). Continuous connections can only be understood as “the more – the better/worse”. The requirement to make the findings of AI understandable for humans is the requirement to simplify the findings and translate them into the structure of human language.
A neural network does not tell us how important the input variables are, nor how they are related. The weights in neural networks have no fixed meaning and this is only formed in the context of all the other weights. The first hidden neuron is interchangeable with the second. The position plays no role. Only the result of all neurons together has a meaning. In this respect, an analysis of the weights is only of limited use.
What you can do is to research the properties of the unknown function by simulation.
Let’s stay with Daniel and his Pharma Commercial Excellence success model and run some simulations together with the variable “number of product samples”.
Average Simulated Effect (ASE): For each data set we have (a sales territory in a given month), we simply increase the number of product samples by 1 and then see how many prescriptions the neural network predicts as a result. If providing product samples works, the average number of prescriptions should increase. This was not the case with Daniel. The average simulated effect was close to zero. So did the variable have no effect?
No. To understand this, we carry out these further simulations:
Overall Explain Absolute Deviation (OEAD): To do this, we manipulate the “product sample” variable again. This time we replace the actual data of this variable with a constant value. To do this, we take the average number of product samples. The output of the neural network now provides different values. The predicted values resulting from the real data are close to the actual prescription numbers (small error). The forecast values resulting from the manipulated data are no longer as accurate. They have a larger error. By measuring how much explanation we lose for prescribing behavior when we no longer have the information from the “product samples” variable, we can measure the importance of the variables to us. In Daniel’s case, this value was quite high. So it was an important variable. But there was no simple explanatory (monotonic) relationship.
But how does this relationship look like?
Non-linear curves: All we have to do is look at the individual values of the OEAD simulation in a diagram. The diagram shows the number of product samples on the horizontal axis (X). We create a point in the diagram for each data set. On the vertical axis (Y), we plot the CHANGE that results for this data set in the target variable (number of prescriptions) if we replace the value of the number of product samples with its mean value.
What we now see in Daniel’s example is a U-shaped relationship. The points do not form a clear line, but a cloud of points, but there is a recognizable relationship. The point cloud is not created by the estimation error of the neural network, because we subtract two forecast values from the neural network and thus the random component is subtracted to zero. The point cloud is created by interactions with other variables (and model inaccuracies, which then appear as interactions).
Interaction plots: We can proceed in a similar way to visualize interaction effects. We simply take the OEAD model manipulated above and set a second variable that could interact constant. In this case, we take the number of sales visits. Again, we can visualize the result in a 3D diagram in a similar way. The horizontal dimensions are the number of product samples and the number of sales visits. The vertical dimension is again the CHANGE that results when the mean value is replaced. If the change due to the product samples is greater than the change due to the number of sales visits, we have an interaction. This then becomes visually apparent. This can then be recorded again in a key figure.
Metaphorically speaking, these simulations work like the human eye. In reality, the human eye only sees a small section. It sees the section that it focuses on. It sees what is by moving the eye minimally. The difference allows us to understand what is and to filter out the insignificant. If our muscles, including the eye muscles, were paralyzed, we would no longer see anything. So these simulations are like our eyes, allowing us to see a complex world.
All these simulations can be converted into key figures. With the help of bootstrapping, we can also calculate significance values for these correlations. The procedure is very simple. From a sample of N data records, N data records are drawn at random (with “putting them back” procedure). This means that some data records may occur twice, others not at all. In this way, a bootstrap data set is created. It represents a possible alternative data set that could be drawn in a similar way the next time. You now draw dozens or, if possible, hundreds of these bootstrap data sets. Different KPIs are obtained for each of these data sets, such as ASE. If these values are close to each other, the significance is high.
But be careful. “Significant” does not mean “important” or “relevant”. It only means “an effect is present provided that the model is meaningful”. Since many users (even in science) do not understand the true meaning of significance, significance hacking is practiced in science and practice. The smaller (and therefore more unrealistic) a model becomes, the higher the significance value. In this way, variables that do not fit the picture are filtered out just to give the appearance of quality.
In fact, “significance” is largely irrelevant to business practice. Even the American Association of Statistical Science recently confirmed my personal findings from practice in a 6-point statement.
What we want to know in practice is whether a cause is “relevant”. The OEAD above measures such relevance and is a kind of measure of effect size. In Bayesian statistics, there is the concept of evidence. A correlation is evident if it is statistically relevant and the model makes sense (in accordance with the wisdom of the field). A small, overly simple model makes little sense and can therefore only provide limited evidence, even if we can demonstrate an effect size. In STEP #1 above we ensure meaningfulness, in STEP #2 we model the actual effects and in STEP #3 we measure their RELEVANCE.
To illustrate this, let’s look at weightlifters and bodybuilders. It is highly significant that bodybuilders are strong and powerful. But this strength is not relevant. It would be relevant if you could lift particularly heavy weights with it, for example. But bodybuilders would lose every competition in all weightlifting competitions. The picture shows what the record holder in weightlifting looks like. Actually, the bodybuilder would be expected to do much more, but that’s just for show. Just like the significance values, are for the „show“ only.
We stick with Daniel and his model to explain the prescription figures. Each time the sales force visited a physician, product samples, brand reminders and new information material were distributed. There were times when no product samples were distributed. Typically, however, it was customary to distribute some samples.
Methodologically, this habit is reflected in an indirect causal effect. This is because when the sales managers increased the frequency of their visits, they also increased the number of samples handed out in line with the usual and expected ritual. The effect was also reversed. Consequently, the two variables correlated strongly with each other.
In order to determine the overall effect of the distribution visits, both the direct effect (= the effect of the visits) and the indirect effect (= the effect of the samples multiplied by the probability of issuing samples during a visit) must be taken into account. This indirect effect arises from the fact that more visits also result in more samples being distributed. This is because management only asks one question: “What happens if I change cause X?”
The total ASE is the direct ASE plus the indirect ASE. The indirect ASE is the ASE on the number of samples multiplied by the ASE on the number of samples multiplied by the number of prescriptions.
The total OEAD can be calculated in the same way. Of course, there are many indirect effects in complex cause-and-effect networks, some of which are interlinked or even circular. However, all these effects can be calculated with the help of software and combined to form an overall effect.
The Total OEAD tells me whether a variable is relevant. The Total ASE tells me how large the (monotonic) effect of an incremental increase in a cause is on average.
Orchestra musicians know this best. The room, the orchestra hall, is crucial for the sound. Only some of the sound waves reach the listener’s ear directly. There are many indirect reflections, which then make up the fullness of the sound. The sound measurements on the instruments symbolize the action variables, the measurements on the walls of the hall the intermediate variables and the coughing of the person sitting next to you the situational variables.
It is clear that the impact of an action can only be properly measured if the indirect paths are not taken into account.
When we implemented the first algorithms for causal direction detection in our NEUSREL software in 2012, we came across some astonishing results. The first use case was data from the American Satisfaction Index. In the structural equation model used in marketing, satisfaction influences loyalty and not the other way around. Satisfaction is a short-term changing opinion about a brand. Loyalty, on the other hand, is an attitude that only changes in the long term. Marketing science has discovered this.
However, our algorithms revealed a clear, different picture. Loyalty influences satisfaction. Not the other way around! Was everything really okay with the algorithms?
Then it clicked: Both were right! Only in its own way.
The data are responses from consumers in a survey. If someone is loyal but dissatisfied, they tend to indicate a higher level of satisfaction than they actually feel because of their loyalty. This is a kind of psychological response bias. In this sense, current loyalty has a causal influence on current reported satisfaction.
Things look different with a different time horizon. If the same people were surveyed again with a time lag, it would be found that a low level of satisfaction – over a certain period of time – leads to a drop in loyalty.
Ergo. Causality is always linked to a time horizon. Understanding and demonstrating this is also the task of marketing managers. A purely data-driven view is blind here.
Another example? If you wade barefoot through the cold November rain, you risk catching a cold. The cold doesn’t make you ill. But it weakens the immune system. If there is a virus in the body, the likelihood of an outbreak increases.
Conversely, if you regularly expose yourself to the cold, you strengthen your immune system, boost the performance of your mitochondria and are less susceptible to colds in the long term. Walking barefoot leads to fewer colds in the long term, not more colds.
There are many such examples. If you wash your hair often, you will have well-groomed, grease-free hair. If you don’t wash your hair, you will soon notice, because your hair quickly becomes greasy. However, if you never wash your hair, your hair roots will not become greasy anymore in the long term because your hair already has a small, healthy greasy film. The hair will also look healthy and well-groomed (if you comb it).
It can therefore be seen that the topic of causality requires human supervision. First of all, we have to define for ourselves which horizon of effect we are interested in.
In addition, most causal directions in marketing can be derived from common sense and specialist knowledge. For the remaining relationships , test procedures can be used to learn. I would like to discuss the two most commonly used concepts here: The PC algorithm and the Additive Noise Model.
The PC algorithm (named after Peter Spirtes and Clark Glymour) is a machine learning method that is used to determine the structure of causal networks in data. The algorithm attempts to discover causal relationships between variables by analyzing so-called “conditional independencies” in the data.
The figure shows how the algorithm extracts pairs of three and uses the “conditional dependencies” (black arrows) to triangulate which causal directions logically result from this. If A correlates with C and B correlates with C, but A does not correlate with B, then it follows that A and B have an effect on C and not vice versa. If this were not the case, A and B would have to correlate.
The method focuses on linear relationships, but can be extended to non-linear relationships. However, studies show that the rate of incorrect decisions increases rapidly with the size of the causal network. It is therefore recommended to use this method only for small models with less than 20 variables. The example above is an extreme example. In reality, we are dealing less with black and white and more with shades of gray. It is therefore becoming increasingly difficult to clearly prove the causal directions when effect sizes of a path (edge) is low.
Additive noise modeling is a method to find out whether something (let’s call it A) causes something else (B) or vice versa. It is based on the idea that if A causes B, the change in B cannot be explained by A alone, but that there is also some “noise” (unforeseen influences or chance) that is added. Importantly, this noise has nothing to do with A.
The figure below shows the same data on the left and right, except that the variable X is plotted once horizontally and once vertically. A simple linear regression is shown by the red line. The fit (or the error) of the regression (coefficient of determination R2) is exactly the same in both cases. However, the deviation (the extent of the scatter) of the regression is constant in the left-hand case and not in the right-hand case. Assuming independent noise, X must be the cause of Y and not vice versa.
To decide whether A causes B or B causes A, we look at both possibilities and try to determine in which situation the noise is truly independent of its assumed cause. If it appears that the noise is only independent if we assume that A causes B (and not the other way around), then we would say that it is likely that A actually causes B.
The method is based on the assumption that the noise is truly independent and that the relationship between A and B is adequately captured by the model. If this is not the case, the method can lead to incorrect conclusions. Therefore, modeling that is as close to reality as possible is an important prerequisite for this test.
It’s a bit like trying to work out which way is up and which way is down, assuming that the gravitational force is coming from the earth. If you turn your head, your hair should hang down.
To summarize, there are several methods for testing the direction of causality. None of them offers a silver bullet. Therefore, a conscious approach and the inclusion of expert knowledge is essential.
A telecommunications company asked us to forecast the risk of customer churn. The company had already implemented a number of customer retention measures and wanted to know how well they were working. The focus of interest was the so-called “cuddle calls”. Here, customers were called as a precaution to ask them about their satisfaction, simply to show appreciation and, if necessary, to hear whether someone was at risk of churning.
The figures were shocking: households that had accepted a cuddle call had twice as high a cancellation rate the following year as households that had not accepted a cuddle call. The program was on the cancelation list. It was suspected that customers at risk of churning (so-called “sleepers”) were being activated by these calls.
Our first churn model also seemed to confirm this. The flag variable “cuddle call” had a positive influence on the probability of termination.
We then enriched the data with sociographic data and expanded the modeling. The result: cuddle calls now reduce the probability of quitting! How did this happen?
We had introduced a confounder into the model. It turned out that most households cannot be reached by phone during the day and that this accessibility represents a strong filter for making calls. Socially disadvantaged households in particular were reachable and these had a significantly higher probability of termination per se.
The cuddle call correlates positively with the probability of quitting, because target groups with an affinity for quitting were easier to reach by phone – not because the call was ineffective – on the contrary.
It really is like a puppet show. As children, we see how the puppets move. The robber hits the farmer and he falls down. Only apparently. In reality, there is a cause that we don’t see: the puppeteer.
It is the same with the data. If we do not see the puppeteer (we have no data about him in the model), then we infer false cause-effect relationships from correlations.
So how can we check whether disturbance variables influence our model? We use two methods in the NEUSREL software:
The Hausman Test
This test has similarities with additive noise modeling and also with double machine learning. Its thesis: If the residuals, the so-called noise of the target variable (i.e. the difference between the prediction and the actual value), can be explained by the causes themselves, then they are not causes. So if the explanatory variables are used to predict the residuals with the help of an AI model, then this is an indicator of confounders.
After the Hausman test, the same modeling is carried out again – only with the residuals as the target variable. The same AI algorithms are used and the same simulation algorithms are used to calculate the effect size OEAD in particular.
So if you have clear indications as to whether a confounder is at work, you can do some soul-searching and sometimes it becomes clear which data sources you may have forgotten to use.
However, if no other data can be obtained for practical reasons, the question arises as to whether there is any way of avoiding the confounder.
Confounder elimination
In 2011 to 2013, I worked on this topic with Dr. Dominik Janzen (then Max Planck Institute, now Amazon Research) and together we came up with an idea.
If you plot the data of two (or more) explanatory variables on a two-dimensional graph, you obtain a random distribution of some kind, e.g. a Gaussian distribution. This distribution can be elongated if the variables are correlated with each other. Or it can be curved if there is a non-linear correlation.
However, if two (or more) distributions occur, i.e. if the data merge into clusters, then there must be a cause for this. This unknown cause is the confounder. It is the influence of this confounder that splits a previously uniform distribution. The information of the confounder is the vector between the clusters. This vector consists of the differences between the cluster centers.
Our procedure for “confounder elimination” now proceeds as follows
Metaphorically speaking, confounders are like an Italian mom who distributes the spaghetti on the plates. If we follow the plates, we can understand that they have been filled by the mom and are not filling each other.
In the first step, we collected and processed data and developed a modeling approach based on existing knowledge and insights. In the second step, we causally modeled each influenced variable using a suitable AI method. In the third step, we opened the black box and conducted some tests. We will learn from the analysis of these results.
We may find that we have made a mistake. Wrong in some assumptions or wrong in the treatment of a variable. Then refinement, optimization and re-modeling is usually required.
The benchmark for optimization is not just the key metrics for model fit, causal direction or confounders. Ultimately, it is the human being who decides whether the model is meaningful and useful. This requires good specialist knowledge, wisdom and a solid foundation in causal machine learning.
What if you don’t have the confidence?
Of course you can get help from external experts, but there is a second strategy: standardization.
Let’s assume you are building a marketing mix modeling. If you have gone through the process cleanly, you can try to define all the steps in a standard. If you prepare the same type of data in the same way, model and simulate it with the same AI methods, then the results will be interpretable in the same way. If this is the case, this interpretation and consulting service can be cast in software. The entire process can then be repeated for other business units, countries or at other times with little effort.
In my view, the future for 80% of marketing issues lies in the development of standardized Causal AI-based solutions.
The term standardization is often equated with “loss of individuality” and thus loss of quality. However, this view is very short-sighted and one-eyed. A good standard process is congealed wisdom. It also ensures quality because it avoids errors. Individualization is always possible, but is associated with considerable costs.