A more appropriate analysis (we will get to know it later) showed that new customers in particular were more willing to recommend the company to others due to the initial euphoria. However, the customer hotline was primarily used by new customers, as this is where most questions about setting up the devices arose. The service contacts therefore proved to be the second consequence of new customer acquisition. As a result, the willingness to recommend correlated with service satisfaction, as it was a consequence of new customer acquisition. However, there was no causal relationship.
These and other examples show one thing: correlation analysis is not suitable for tracking down causes. What methods has statistics or, as we say today, data science developed for this? These are, for example, econometric models and structural equation models. Let’s look at an example to see how useful these methods are in practice.
The company Mintel collects data on new products worldwide. In fact, thousands of employees “roam” through supermarkets worldwide to find new products, evaluate them subjectively and send them to the headquarters in London, where they are objectively evaluated and categorized. Sales figures, distribution figures, prices and lots of other data are collected in a huge database for this purpose.
From time to time, the experts ask bold questions like these: Why do only 5% of all new product launches survive the first two years? How can I predict whether my product will be successful? How can I manage it at an early stage?
So the company’s data scientists set about unearthing the treasure trove of data. An econometric model was built. It was tweaked and tinkered with. But the explanatory power was more than sobering:
ZERO
Then one day I received this email asking if we had any better methods.
We had them. Once the work was done, we were able to predict with 81% accuracy whether each new product would be a winner or a loser.
How was that possible? Is classical modeling really that “bad”?
No, “bad” is the wrong word. Classic modeling is not practicable. It does not have the methodological properties that AI offers us today and that it needs in order to gain useful insights in a predictable manner within a limited time.
Specifically, there are problems in the following three areas in particular.
1. Hypothesis-based
Even in classical research today, it is still considered the gold standard to always proceed on the basis of hypotheses. Most of us learned this in our studies. The reason behind this is that (without using the methods we will get to know) only a good hypothesis actually prevents a spurious correlation from being declared true, i.e. causal.
The only practical problem is that there is usually a lack of good hypotheses. The greater the need for useful insights, the fewer hypotheses are available. The more reliable hypotheses there are, the more likely marketers are to say to themselves: Well, what we know is enough for us to make decisions. This is always referred to as the “last digit after the decimal point that you don’t need”.
Collecting hypotheses with the help of expert interviews is really a lot of work and takes time. Nevertheless, there are still gaps, large gaps. Forming hypothesis leads to small statistical explanatory models because there are typically only a few reliable hypotheses left. These small models explain less. What is even worse, however, is that they are associated with an invisibly higher risk of delivering false results. We will see why this is the case later in the context of “confounders”.
So both in practice and in science, people “cheat” behind the scenes. You simply look at the data you have and see if you can come up with a hypothesis. The proper procedure would be the other way around.
It is not uncommon for hypotheses to be “knitted” after the analysis. The whole thing is then idealized as the fine art of “storytelling”.
It was the same with Mintel. “All variables in” and then “let’s see”. Even the statement of the customers surveyed as to whether they would buy a product or not had no explanatory power for the product’s success.
Does this disprove the hypothesis that a higher purchase intention also leads to purchases?
Yes and no. Assuming that all model assumptions are correct, yes. This brings us to the second point.
2. Linear and independent
For example, many new products are more likely to be considered if they are perceived as unique. However, it turns out that uniqueness can be overdone. “Very unique” becomes “quite strange”.
The standard methods of classical modeling assume that the more pronounced an explanatory variable is (e.g. the more unique), the greater the target variable (e.g. the sales figures of the product). A fixed relationship is assumed. It is a linear relationship. Only the extent of the relationship is determined by the parameters.
The second standard assumption is independence. According to this, a price reduction of 1 euro, for example, has a certain absolute sales effect of, say, 50%. – regardless of the brand of the product, for example. Even if this does not seem very realistic in this example, it is the core of all standard methods. Sure, with econometric methods, it is possible to make them non-linear. It is also possible to map the dependencies between the causes in the model. There’s just one catch: it’s hypothesis-based. You have to know it in advance.
The data scientist needs to know what kind of non-linearity to build in. Is it a saturation function? A U-function? A growth function? An S-function?
He also needs to know what kind of dependency he should “build in”. Do we have an AND link, i.e. sales only increase if the price falls AND the brand is strong? Or an OR link? Or an EITHER link? Or something in between?
The MINTEL model had 200 variables. Even if you only have 100 variables, the question arises: Who goes through them all to correctly determine the non-linearity? And who goes through all 100 times 100 (=10,000) combinations to see how they are related and interact?
This makes it plausible how impractical classical statistical modeling is. The methods should help us to learn what we do not yet know and not just validate what we already know.
There are other challenges in business practice where traditional methods fail.
Imagine painting a 10 cm long line on a sheet of paper with a brush. Then paint an area of 10 x 10 cm with it. How much paint do you need? If you have a thick brush, perhaps ten times as much. Now we go from two-dimensional to three-dimensional. How much paint do we need to fill a 10 x 10 x 10 cm box with paint?
The paint is the data we need. The dimensions are the variables we have. The point is this: the more explanatory variables we have, the larger the space of possibilities. This space contains our data. As the number of variables increases, we theoretically need exponentially more data. This phenomenon is known as the “curse of dimensionality”.
The only tools that classical methods use to overcome the curse of dimensionality are hypotheses and assumptions. We have seen that this is not very practical.
In the course of the development of artificial intelligence, intelligent methods have been developed that get to grips with the curse of dimensionality without strict hypotheses and assumptions.
For example, when AI algorithms today identify a cat in an image with 1000 x 1000 pixels, they process 1 million (1000 x 1000) explanatory variables. The possibility space here is significantly larger than the sum of the elementary particles in the entire universe (10^81) could fill. Even the millions of cats that the algorithm has seen are a drop in the ocean.
Highly dimensional challenges in corporate marketing can be mapped in the same way.
Another limitation of classic modeling is the use of differently scaled variables. There are binary variables such as gender or segment affiliation. And there are continuous variables such as customer satisfaction or turnover. Classical statistics can hardly mix these.
The data sets are therefore divided e.g. into women and men, estimated separately and then compared. The sample is thus halved, as is the significance, and the gender analysis is purely correlative (instead of causal).
The requirements in business practice are different. But if you only have a hammer, every problem looks like a nail.
This was also the case in the Mintel project. If classical modeling is hypothesis-based and postulates linearity and independence of effects, it is intuitively plausible that the approach has its limitations. This becomes even clearer when we look at what a modern AI-based method has found:
The central insight of the model is that the success levers are mutually dependent. To sell a product, it has to be on the shelf. A good-looking product is useless if the degree of distribution is low. A high level of distribution is useless if the product is not so good that consumers want to buy it again. A good product is useless if the price is not within an acceptable range. An acceptable price is of no use if the brand is not recognized on the shelf. All these factors are interdependent rather than complementary.
Generate 100 random numbers between 0 and 1 for 4 variables. For each of these 4 number series, half of the cases are greater than 0.5. If you multiply two of the variables (number series), only 25% of the numbers are greater than 0.5. If you multiply another number series, 12.5% of the numbers are greater than 0.5 and for the fourth, about 6% of the numbers are greater than 0.5. This 6% is pretty much the percentage of new products that survive two years.
This multiplication logically corresponds to an AND link. Success is only achieved if a new product is widely distributed, has an attractive overall appearance, a reasonable price, is easily recognizable and is so good that customers want to buy it again after their first purchase.
A new type of modeling was able to discover this relationship in the data. And this despite the fact that more than 200 variables, including binary and metric variables, were available and, above all, despite the fact that nobody had expected this result in advance.
It is not the case that classic modeling methods are “bad”. Quite the opposite. Within the scope of their assumptions, the methods are extremely good and extremely accurate. It’s like a Formula One car. It is extremely optimized, has a high top speed and the tires can be changed in seconds.
If you order a car like this as a company car, you won’t get 100 meters far. There’s no trunk and no gas at the filling station. But above all, every bump on a normal road shatters the underbody into thousands of parts.
A modern AI-based analysis system is more like an off-road vehicle. It may not be as fast as a Formula 1 car. But it gets from A to B no matter what the surface looks like, whether there is a river in between or a hill to cross.
What is artificial intelligence and what is machine learning? The answer is quite simple: machine learning is written in Phyton and artificial intelligence in PowerPoint.
Joking aside.
Artificial intelligence originally referred to all technical systems whose behavior gives the impression of being controlled by human intelligence. For our purposes, a different definition makes more sense. This is because in many applications, AI systems are far more “intelligent” than humans in this area. This understanding of AI is particularly unhelpful in data analysis. Because what AI can achieve here is many times greater than even the most ingenious human being. What we want to do with AI is to gain insights from data and make predictions. In this context, we differentiate between statistical modeling and artificial intelligence:
Statistical modeling finds the parameters of a fixed, predefined formula.
Artificial intelligence finds the formula itself and its parameters.
“Machine learning” is often used as a synonym for AI, but for data scientists in particular, statistical modeling is also part of machine learning. In a sense, the machine learns by finding the parameters. This is why the majority of “AI” start-ups in Europe do not use AI at all, as a study showed a few years ago.
What exactly does the term “formula” mean in this context? Every rational explanatory approach, including a forecasting system, can be expressed as a mathematical function in which the result (the forecast) is calculated from the explanatory variables (numbers that represent certain characteristics of the causes).
The classic linear regression has this formula:
Result = Variable_1 x Coefficient_1 + Variable_2 x Coefficient_2 +… + Variable_N x Coefficient_N + Constant
The formula consists of added terms and a constant. The coefficients and the constant are calculated by the algorithm in such a way that the result of the sample data in the data set (estimated value) comes closest to the actual result.
In addition to addition, there are other basic arithmetic operations such as multiplication. The basic arithmetic operations are possible basic building blocks that can be used to construct ANY function. There are also other basic building blocks that can be used to construct arbitrary functions. A neural network uses an S-shaped function as a basic building block and with its addition you can also build any other function (the mathematician Kolmogorov proved this 100 years ago).
The following image of mountains has always helped me: The longitude and latitude represent the explanatory variables. The height of a mountain at a certain point (= combination of longitude and latitude) is the result you are looking for. There is now a mathematical function that describes every mountain (except for a neglectable deviation). The AI can find this unknown function.
To stay with the image of the mountains. AI is like a forestry company that cuts down trees in the mountains and notes the longitude, latitude and altitude on each tree. At the sawmill, the AI can then estimate the shape of the mountain based on the data on the trees.
Okay, there are a few gaps. For example, where no trees grow. This is also the case in corporate practice. There were no major crashes in the stock market data for the last ten years (2014 – 2024). New crashes cannot be predicted from this data.
If you use a forecasting system to select and acquire target customers, this system may stop working one day. If you don’t monitor the system, you will quickly go broke.
It is important to be aware of these framework conditions. Otherwise you run the risk of becoming a victim of “Black Swan” phenomena.
But there are even more serious problems.
When I started studying in Berlin in 1993, I was always fascinated by this Reuters terminal that stood in the middle of the canteen. The student stock exchange association had set it up there. The monitor showed the current stock market prices in real time, which were delivered by satellite (remember: there was no internet back then!).
Then one day, Germany’s biggest daily newspaper ran the headline “Artificial intelligence predicts stock market”. I was fascinated. Shortly afterwards, I joined the student stock exchange association and read up on how the professionals make their investment decisions. That’s when I met Harun. He was also studying electrical engineering and had caught wind of the newspaper article.
Over the next few years, we met weekly to discuss the nights and program neural networks, fortified by ready-made spaghetti. Successes and setbacks alternated.
I still remember it well. We had built a system that not only learned the training data with high accuracy, but also predicted the test data with good results. This was data from a shorter time horizon with which the neural network had not yet been trained. I ran the model training for two weeks during my vacation.
But the performance on the live data was disappointing. How could that be?
It turned out that our model was suffering from a phenomenon called “model drift”. Data scientists all over the world are familiar with it. And most of them still don’t have a solution. They simply retrain the model more frequently, which often only masks the problem.
If I want to predict the career success of managers on the basis of shoe size, then that at first works somewhat. For well-known reasons, men climb the career ladder more often than women. And they have the bigger shoes. When shoe fashion changes and women wear long shoes, the model starts to falter. Why? Because shoe size is not the cause of professional success.
Model drift occurs when the explanatory variables/data no longer explain the target variable in the same way over time.
Many banks use forecasting systems to predict the credit default risk of a loan applicant. There have already been several discrimination scandals in this area. What happened?
These AI systems used all available information about a customer and then tried to predict the probability of credit default from the past. However, this information is highly correlated. People with higher incomes live in different zip code areas and people with darker skin have a lower income on average.
Machine learning – AI or not – traditionally has only one goal: to reproduce the target variable as accurately as possible. If two explanatory variables are highly correlated, the algorithm does not “care” which variable is used to reproduce the result. As a result, skin color has an explanatory contribution to credit default, even if this is not (causally) justified. The causal factors are income and job security, not skin color.
It was similar with Amazon’s applicant scoring models, which categorically screened out all female applicants. The learning data set was not only male-dominated. The women it contained often had less professional experience. The algorithm’s only goal was to predict career success, and the gender variable was useful in identifying the “underperformers”. This characteristic was not simply “politically incorrect”. It was simply factually incorrect because, all other things being equal, women were just as successful.
The technical reason for the failure of classical AI and ML is that they are not designed to use only those variables that have a causal influence.
The result is not only unfair models. The result is models that deliver suboptimal or even incorrect predictions and findings in regular operation.
The main criticism of AI systems in recent years has often been their black box character. This is being worked on. Methods called “Explainable AI” have been developed. With SHARP, freely available open source libraries have been created. AI-based driver analyses have been developed, most of which use the Random Forests method and are designed to tell the user which variables have which importance.
But an a-causal random forest (or an a-causal neural network) cannot be repaired an “Explainable AI” algorithm.
Wrong remains wrong.
Explainable AI therefore offers a dangerous illusion of transparency.
Distilling alcohol in the cellar at home used to be commonplace. In some parts of the world, this is still the case today. It happened time and again that methanol was produced during distillation. The result: at best blindness, at worst instant death.
Methanol is produced by the splitting and fermentation of the pectin contained in the cell walls of the grain. If the mash is not filtered properly before distillation so that it contains hardly any cell walls, the spirit is rich in methanol.
Today, AI and machine learning are similar to the unfiltered distillation of schnapps. It often works, sometimes it goes wrong.
What we need is a filter system – also for AI.
Causal AI methods are one such “filter”. They address the core of the problem: the acausal explanatory data.
The challenge posed by causal AI is a tough nut to crack. You can achieve a lot with “filter algorithms”. But it turns out that a good knowledge of the real world, which is described by the data, is also very helpful here.
So what is causal AI? My simplified formula is
Causal AI = artificial intelligence + domain expertise + X
Artificial intelligence algorithms are needed to discover what is difficult or impossible for humans. To be useful for Causal AI, we place certain requirements on the AI technologies used. For example, it is not enough to map a target variable well (i.e. to obtain a good fit).
We also need the expertise and knowledge to use and process the right data. We also need it to ultimately derive definitive findings. It gives us the context that is not visible in the data itself.
We also need algorithms from time to time that check whether the conditions for causality are met. Are all important influences included in the model? What about the direction of causality? Does satisfaction influence customer loyalty or vice versa?
I will explore all of this in the next chapter. What process should we use to bring in expert knowledge? What technology should the AI algorithms use? What processes are behind the “X”? In this way, we can better understand what a good filter for AI should look like.