The Receiver-Operating-Characteristic-Curve (ROC) and the area-under-the-ROC-curve (AUC) are popular measures to compare the performance of different models in machine learning.

The AUC is a rank measure, which means it abstracts from the differences in the probability scores and only looks at too which extend the target observations are ranked above non-target observations. We can understand the AUC and its properties much better by understanding its relation to the Mann-Whitney U test a.k.a Wilcoxon rank sum test. In particular, we will get a better intuition why the score distribution plays a role for model comparison using the AUC.

The post has the structure:

- Introduction of the ROC-AUC
- The AUC as a rank-sum test
- The Normal distribution of the AUC statistic
- Confidence intervals for the AUC

*Hernández-Orallo, J., Flach, P., & Ferri, C. (2012). A unified view
of performance metrics: translating threshold choice into expected classification loss. Journal of
Machine Learning Research, 13, 2813-2869.*

Let's see how the ROC is usually used for model comparison.

In [1]:

```
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
```

In [2]:

```
def generate_data(n, beta_0, beta, sigma):
if hasattr(beta, '__len__'):
X = np.random.uniform(-5,5,size=[n, len(beta)])
else:
X = np.random.uniform(-5,5,size=[n,])
#y = beta_0 + np.dot(X, beta) + np.random.normal(0, sigma, size=[n,])
y = np.random.binomial(n=1, p = 1/(1 + np.exp(-(beta_0 + np.dot(X, beta)
+ 0.4*np.dot(X**2, beta)))))
return X, y
```

In [3]:

```
X, y = generate_data(10000, 1, [2,-2], 1)
```

In [4]:

```
print(y[0:5])
print(X[0:3])
```

In [5]:

```
from sklearn.linear_model import LogisticRegression
```

In [6]:

```
logit = LogisticRegression()
logit.fit(X, y)
print(logit.coef_)
```

In [7]:

```
logit_score = logit.predict_proba(X)[:,1]
```

Calculate the AUC score given the model predictions.

In [8]:

```
from sklearn.metrics import roc_auc_score
```

In [9]:

```
roc_auc_score(y, logit_score)
```

Out[9]:

The ROC curve is defined as the sensitivity of the model at each level of (1-specificity) for each probability threshold $t$. Since sensitivity and specificity lie between 0 and 1, the total area between them is 1 and consequently, the area-under-the-ROC-curve lies between 0 and 1.

Why would we do we need the ROC-AUC anyway? We calculate the AUC for different models and pick the model with the highest AUC. Doesn't that like a case for statistical testing?

In [10]:

```
# Transform the logit model scores
logit_score2 = np.log(logit_score)*2
```

In [11]:

```
np.percentile(logit_score2, [0,0.25,0.5,0.75,1])
```

Out[11]:

In [12]:

```
roc_auc_score(y, logit_score2)
```

Out[12]:

The AUC is equivalent to the Wilcoxon or Mann-Whitney U test statistic with the relation:

$ AUC = U / (n_0*n_1)$

where $n_1$ and $n_0$ are the number of observations in the target and non-target group, respectively. See https://rmets.onlinelibrary.wiley.com/doi/epdf/10.1256/003590002320603584 for a proof.

The Mann-Whitney U test is used to determine if two independent samples were selected from populations having the same mean rank. Our samples are the model scores for the non-target group and the target group. The mean rank, and so the AUC, can differ with the location of the distribution but also with its shape.

If but only if the distributions are identically shaped but shifted, the U test can be used to determine if there is a difference in medians. See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1120984/

In [13]:

```
# Plot score distributions
plt.hist(logit_score[y==0], bins=50, alpha=0.5)
plt.hist(logit_score[y==1], bins=50, alpha=0.5)
# Plot median
plt.axvline(x= np.median(logit_score[y==0]))
plt.axvline(x= np.median(logit_score[y==1]), color="orange")
plt.show()
```

*not* the same, they look like they are
mirrored. The difference in mean rank is consequently not equivalent to the difference in mean or median.
In the original description by Mann and Whitney, they devise the test (in my words) to check if the scores
for the target group are stochastically greater than the scores for the non-target group. That means we
check if the probability that the test score is above a threshold is always greater for the target than
for the non-target group.

** vvv The core insight is here vvv **

In [14]:

```
def calc_U(y_true, y_score):
n1 = np.sum(y_true==1)
n0 = len(y_score)-n1
## Calculate the rank for each observation
# Get the order: The index of the score at each rank from 0 to n
order = np.argsort(y_score)
# Get the rank: The rank of each score at the indices from 0 to n
rank = np.argsort(order)
# Python starts at 0, but statistical ranks at 1, so add 1 to every rank
rank += 1
# If the rank for target observations is higher than expected for a random model,
# then a possible reason could be that our model ranks target observations higher
U1 = np.sum(rank[y_true == 1]) - n1*(n1+1)/2
U0 = np.sum(rank[y_true == 0]) - n0*(n0+1)/2
# Formula for the relation between AUC and the U statistic
AUC1 = U1/ (n1*n0)
AUC0 = U0/ (n1*n0)
return U1, AUC1, U0, AUC0
```

Calculate the U and AUC for the model. Suprise! It's the AUC calculated from the ROC!

In [15]:

```
calc_U(y, logit_score)
```

Out[15]:

In [16]:

```
# See why many packages flip model predictions if AUC < 0.5?
# Flipping the output classes flips the reported AUC of class 1.
calc_U(y, 1-logit_score)
```

Out[16]:

In order to understand the distribution of the AUC, we can make use of our knowledge about the Mann-Whitney U. Unfortunately, I couldn't find a lot of good information out there on the Mann-Whitney or Wilcoxon rank sum test. I find even the original papers short but not very clear.

Be careful with the general guideline to look at the sample with smaller U rather than fix the
interesting group in advance. For the AUC correspondance to work as expected, the U statistic should be
calculated for the group with higher scores.

The example below shows that the p-value for the Mann-Whitney U test is identical, independent of on which
group we calculate the U statistic. Note that I place the mean and median of the normal distribution and
t-distribution both at 2 so the test is really identifying a difference in the distribution rather than
the mean.

In [17]:

```
# Make two distributions that are different but similar
# The logit scores from above are so different that the p-value goes to 0
distr1 = sp.random.normal(2,1, size=100)
distr2 = sp.random.standard_t(1, size=100)+2
```

In [18]:

```
scipy_U = sp.stats.mannwhitneyu(distr1, distr2, alternative="two-sided", use_continuity=False)
print(scipy_U.statistic)
print(scipy_U.pvalue)
```

In [19]:

```
scipy_U = sp.stats.mannwhitneyu(distr2, distr1, alternative="two-sided", use_continuity=False)
print(scipy_U.statistic)
print(scipy_U.pvalue)
```

*of a group of that size*.

But the sum of ranks clearly also depends on what the highest possible rank is, i.e. on the total number
of observations! To make them comparable over different numbers of total observations, the U value is
standardized using the mean and standard deviation, which depend on the total number of observations.

In [20]:

```
# Possible ranks from 0 to n1+n2=200
x = (np.arange(0,200)+1).tolist()
```

In [21]:

```
# Under the H_0 that the scores are equally distributed
# Randomly sample ranks and calculate U for n_iter trials
import random
n_iter = 5000 # How fine grained our simulation should be
sim_U = np.empty([n_iter])
# Draw U values under the H0 assumption of random ranks,
# e.g. from a model giving random scores
for it in range(n_iter):
n = 100 # number of samples/group size
s = random.sample(x, n)
ranksum = np.sum(s) - n*(n+1)/2
sim_U[it] = ranksum
```

In [22]:

```
# Plot the distribution of the U statistic
import matplotlib.pyplot as plt
plt.hist(sim_U, bins= 40)
plt.show()
```

In [23]:

```
# Plot standard normal
plt.plot(np.arange(-4,4,0.01), sp.stats.norm.pdf(x=np.arange(-4,4,0.01),loc=0,scale=1), color='red', alpha=0.5)
# Plot standardized U statistics from simulation
# We should standardize by the theoretical mean and std, but the empirical approximation is enough for show
plt.hist((sim_U - np.mean(sim_U))/np.std(sim_U), # Standardized U statistics
bins=40, density=True, alpha=0.5)
plt.show()
```

In [24]:

```
n = len(distr1)
scores = np.concatenate([distr1, distr2])
order = np.argsort(scores)
rank = np.argsort(order)
rank += 1
U_distr1 = np.sum(rank[0:n]) - n*(n+1)/2
U_distr1_normalized = (U_distr1 - 100*100/2) / np.sqrt(100*100*(100+100+1)/12)
print(U_distr1)
print(U_distr1_normalized)
```

We can calculate the p-value as the probability of finding a more extreme value under our expectation that the U is approximately Gaussian normal distributed. We take the absolute of the value so that we will only have to look on the positive right side. We can do this, because the normal distribution is symmetric. The cumulative distribution function gives us the probability of observing a value $x$ below $z$, i.e. $P(x \leq z)$. We can turn this around as $1 - P(x \leq z)$ to get the probability of seeing a larger value.

For the two-sided test, since the normal distribution is symmetric, the probability of finding a more extreme value on the negative side is exactly the same, leaving us with $2*(1 - P(x \leq z))$.

In [25]:

```
# Calculate the p-value
# better to use stats.norm.sf() as exact 1-cdf
print("One-sided:", 1 - sp.stats.norm.cdf(abs(U_distr1_normalized)))
print("Two-sided:", 2*(1 - sp.stats.norm.cdf(abs(U_distr1_normalized))))
```

In [26]:

```
# As above:
# Plot standard normal
plt.plot(np.arange(-4,4,0.01), sp.stats.norm.pdf(x=np.arange(-4,4,0.01),loc=0,scale=1), color='red', alpha=0.5)
# Plot standardized U statistics from simulation
plt.hist((sim_U - np.mean(sim_U))/np.std(sim_U), # Standardized U statistics
bins=40, density=True, alpha=0.5)
# Plot U of our (good) classifier
plt.axvline(U_distr1_normalized)
plt.show()
```

We can use the approximate variance to calculate confidence intervals for the AUC. This is the correct intuition, but ideally we would use the exact variance, since our approximation may lead to wrong confidence intervals.

Unfortunately, the exact variance only be calculated based on the distributions of the scores (Cortes
& Mohri 2005), which typically do not follow a known distribution. There seem to be different
suggestions on how to calculate a useful confidence interval, see Cortes & Mohri (2005) for a summary
and their (long) formula in *Corollary 1 ([4])*:

\begin{align} \sigma^2(AUC) &= \frac{(m+n+1)(m+n)(m+n−1)T((m+n−2)Z_4−(2m−n+3k−10)Z_3)} {72m^2n^2} \\ &+ \frac{(m+n+1)(m+n)T(m^2−nm+3km−5m+2k^2−nk+12−9k)Z_2}{48m^2n^2} \\ &− \frac{(m+n+1)^2(m−n)^4 Z_1^2} { 16m^2n^2} − \frac{(m+n+1)Q_1 Z_1}{72m^2n^2} +\frac{kQ_0}{144m^2n^2} \end{align}

with:

\begin{alignat}{2} T &= && 3((m − n)^2 + m + n) + 2 \\ \\ Z_i &= && \frac{ \sum^{k−i}_{x=0} \left(\begin{smallmatrix} m+n+1−i \\ x \end{smallmatrix}\right) } { \sum^k_{x=0} \left(\begin{smallmatrix} m+n+1 \\ x \end{smallmatrix}\right) }\\ \\ Q_0 &= &&(m + n + 1)T k^2 + ((−3n^2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k +\\ & && (−3m^2 +7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) \\ \\ Q_1 &= && T k^3 + 3(m − 1)T k^2 + ((−3n^2 + 3mn − 3m + 8)T − 6(6mn + m + n))k +\\ & && (−3m^2 +7(m + n) + 3mn)T − 2(6mn + m + n) \end{alignat}

References

*Cortes, C., & Mohri, M. (2005). Confidence intervals for the area under the ROC curve. In
Advances in neural information processing systems (pp. 305-312).* https://cs.nyu.edu/~mohri/pub/area.pdf

Feel free to share!