# Causal Inference: Contextual Bandit vs. Randomized Trials

Bandit Learning is an approach to causal inference that combines treatment effect estimation and decision making. This post compares the contextual bandit to the conditional mean approach for treatment effect estimation to show that policy learning results in a selection effect that impacts the treatment effect estimate and is difficult to correct ex post.

## Making the right decision

import numpy as np
from scipy.stats import linregress
from sklearn.linear_model import LogisticRegression

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,6)
import seaborn as sns

np.random.seed(1234)


Image the following setting: We observe users with a single characteristic $X \sim N(1,1)$. We can choose to not address the user with an action ($t=0$) or can play an action, for example pay to display a banner ad ($t=1$). After we decide on an action, we observe an outcome $y$ that depends on the user characteristic and our action in the following way: $y =0 + x + t \cdot 2 \cdot x - t \cdot 1+ \epsilon$

## Conclusion

A/B testing and bandit learning are approaching the same problem: Which of several actions leads to the best outcome? We have seen that A/B testing explores the outcomes for different users randomly to estimate the outcome of each action accuractly after exploration. Bandit learning minimizes the number of bad decisions while exploring, but leads to sample selection issues that in turn lead to less accurate estimates of the outcomes of each action.