import pandas as pd # dataframes for python
import plotnine as pn # ggplot2 clone
= (5, 5) # set a default figure size for plotnine plots
pn.options.figure_size = pn.theme_bw() # set simple theme
pn.options.current_theme import seaborn as sns # statistical plotting in python land
="whitegrid") # plot theme
sns.set_theme(style# frequentist modeling
import scipy.stats as stats # classic freq stats for python
import pingouin as pg # alt to scipy.stats
# bayesian modeling
import pymc as pm # write your models explicitly
import bambi as bmb # formula like interface
import arviz as az # plots for MCMC objects
Introduction
This is the second of three posts that will carry out data loading, exploration, filtering and statistical testing (frequentist & Bayesian). In the first post of the series we used R. In this post we’ll use python. Like the previous post there won’t be much exposition - we’ll just move through the process.
If you want to follow along the data are here.
Preliminaries
Python, like R, has a host of extra packages to help with data import, wrangling, plotting & building various kinds of models. The first step is to load the packages we will need. I use the Anaconda python distribution and packages that are not installed by default can be installed with the conda
tool. In this post we use the pymc
package for Bayesian modeling. The installation notes for pymc
recommend installing it into its own python conda
environment so this is what I did! To run the code in VSCode I set the relevant python
interpreter by using Ctrl+Shift+P to bring up the Command Palette and selecting the relevant python environment. The other packages had to be installed into the same enviroment using conda install
.
Ok, let’s get on with loading the packages we’ll need!
Loading the data
We can use the read_csv()
function of the pandas
(Reback et al. 2020) package to read in the data. These data are from body composition practicals run as part of the Sport & Exercise Science degree at the University of Stirling. They were collected over a numbers of years by the students who carried out various measures on themselves.
# get the data
= pd.read_csv('data/BODY_COMPOSITION_DATA.csv', sep=',', na_values = "NA") data_in
Exploration & tidying
The pandas
package also provides some tools for exploring the data.
data_in.head()# examine summary of types etc
# there are missing values in bia & HW data_in.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203 entries, 0 to 202
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 203 non-null object
1 girths 203 non-null float64
2 bia 202 non-null float64
3 DW 203 non-null float64
4 jackson 203 non-null float64
5 HW 202 non-null float64
6 skinfolds 203 non-null float64
7 BMI 203 non-null float64
8 WHR 203 non-null float64
9 Waist 203 non-null float64
dtypes: float64(9), object(1)
memory usage: 16.0+ KB
We can see that there are some missing values in the BIA and HW variables (these variables have 202 non-null values). There are many ways to deal with missing values but here we will just drop rows with missing values. The dropna()
method for pandas
dataframes allows us to drop rows (axis 0) or columns (axis 1) with missing values. We also specify the inplace = True
argument so that the data we are working on is altered.
# drop the rows (index 0) with missing values; alter dataframe (inplace = True)
= 0, inplace = True)
data_in.dropna(axis # all non-null values
data_in.info()
# summary stats
data_in.describe()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 201
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 201 non-null object
1 girths 201 non-null float64
2 bia 201 non-null float64
3 DW 201 non-null float64
4 jackson 201 non-null float64
5 HW 201 non-null float64
6 skinfolds 201 non-null float64
7 BMI 201 non-null float64
8 WHR 201 non-null float64
9 Waist 201 non-null float64
dtypes: float64(9), object(1)
memory usage: 17.3+ KB
girths | bia | DW | jackson | HW | skinfolds | BMI | WHR | Waist | |
---|---|---|---|---|---|---|---|---|---|
count | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 |
mean | 20.734552 | 16.980100 | 21.614647 | 14.212189 | 21.426408 | 82.684751 | 23.223000 | 0.781529 | 76.756716 |
std | 8.382091 | 6.792665 | 7.307617 | 7.479344 | 8.006785 | 33.108192 | 3.168286 | 0.057142 | 7.352106 |
min | 7.150000 | 5.700000 | 4.100000 | 3.000000 | 4.100000 | 27.750000 | 2.900000 | 0.670000 | 61.000000 |
25% | 15.080000 | 11.900000 | 16.300000 | 8.000000 | 15.000000 | 59.250000 | 21.170000 | 0.740000 | 72.000000 |
50% | 20.120000 | 15.900000 | 21.400000 | 12.600000 | 21.000000 | 76.230000 | 23.000000 | 0.780000 | 76.000000 |
75% | 24.400000 | 21.200000 | 28.000000 | 19.000000 | 27.000000 | 100.350000 | 24.800000 | 0.815000 | 81.000000 |
max | 87.900000 | 39.300000 | 45.900000 | 35.000000 | 43.000000 | 181.000000 | 33.030000 | 0.990000 | 100.800000 |
Next we will convert our data from wide format to long format (Wickham 2014) with the pandas.melt()
function. Long data makes plotting and statistical analyses easier. In long format data the values for each individual and each measurement technique are identified by rows rather than spread across row & column combinations.
# long data
= pd.melt(data_in, id_vars = "sex", var_name = "method", value_name = "value")
dataL dataL.head()
sex | method | value | |
---|---|---|---|
0 | M | girths | 10.85 |
1 | M | girths | 14.12 |
2 | M | girths | 12.30 |
3 | M | girths | 8.50 |
4 | M | girths | 11.66 |
Exploration with plots is an essential step for checking values and the distribution of data. There is an extensive plotting ecosystem in python.
The seaborn
(Waskom 2021) package provides a high level interface for plotting data & statistical summaries. If you’re used to e.g. ggplot2
in R then the plotnine
package provides very similar functionality.The tabs below demonstrate the same plot using each of these packages.
= sns.FacetGrid(dataL, col = 'method', hue = 'sex', col_wrap = 3, sharey = False); # create grid
fg
map(sns.stripplot, 'sex', 'value', jitter = 0.05, size = 10, palette=["firebrick", "cornflowerblue"], alpha = 0.5, order = ["F", "M"]); # map stripplot onto grid fg.
= pn.ggplot(dataL, pn.aes('sex', 'value', colour = 'sex')) + pn.geom_jitter(width = 0.1, alpha = 0.5) + pn.facet_wrap("method", scales = "free_y") + pn.scale_colour_manual(values=['firebrick', 'cornflowerblue'])
pt pt
/home/iain/anaconda3/envs/pymc_env/lib/python3.11/site-packages/plotnine/facets/facet.py:440: PlotnineWarning: If you need more space for the x-axis tick text use ... + theme(subplots_adjust={'wspace': 0.25}). Choose an appropriate value for 'wspace'.
There are a couple of mad values in the BMI
and girths
variables. For the rest of the analysis we’ll concentrate on the BMI
variable. First we’ll filter the data to just BMI.
# filter to just bmi data
= dataL[dataL.method == "BMI"]
bmi_data bmi_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 1206 to 1406
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 201 non-null object
1 method 201 non-null object
2 value 201 non-null float64
dtypes: float64(1), object(2)
memory usage: 6.3+ KB
# first few values
bmi_data.head()
sex | method | value | |
---|---|---|---|
1206 | M | BMI | 20.70 |
1207 | M | BMI | 21.90 |
1208 | M | BMI | 21.39 |
1209 | M | BMI | 19.26 |
1210 | M | BMI | 22.30 |
We’ll re-plot these data.
= sns.stripplot(x = "sex", y = "value", data = bmi_data, jitter = 0.05, palette=["firebrick", "cornflowerblue"], alpha = 0.8, order = ["F", "M"]);
bmi_pt1 bmi_pt1
<Axes: xlabel='sex', ylabel='value'>
= pn.ggplot(bmi_data, pn.aes("sex", "value", colour = "sex")) + pn.geom_jitter(width = 0.1, alpha = 0.5) + pn.scale_colour_manual(values = ["firebrick", "cornflowerblue"])
bmi_pt2 bmi_pt2
We can clearly see the outlier in the male data. Removing outliers is a contentious subject but a BMI of 2 is unrealistic so we’ll remove this value.
# note very low bmi point in M; let's drop that
= bmi_data[bmi_data.value > 15]
bmi_data # summary
bmi_data.describe()
value | |
---|---|
count | 200.000000 |
mean | 23.324615 |
std | 2.828887 |
min | 18.080000 |
25% | 21.177500 |
50% | 23.000000 |
75% | 24.802500 |
max | 33.030000 |
# seaborn plot
= sns.stripplot(x = "sex", y = "value", data = bmi_data, jitter = 0.05, palette=["firebrick", "cornflowerblue"], alpha = 0.8, order = ["F", "M"]);
bmi_pt3 bmi_pt3
<Axes: xlabel='sex', ylabel='value'>
# plotnine plot
= pn.ggplot(bmi_data, pn.aes("sex", "value", colour = "sex")) + pn.geom_jitter(width = 0.1, alpha = 0.5) + pn.scale_colour_manual(values = ["firebrick", "cornflowerblue"])
bmi_pt4 bmi_pt4
Much better!
Frequentist testing
We’re now in a position to undertake some statistical analysis. We’ll start with a simple t-test to examine the mean difference in BMI between males and females. The scipy.stats
(Virtanen et al. 2020) library provides functions for one sample, paired & independent t-tests (and other tests). We first extract the data we want to test into separate series and then pass these series to the appropriate function. The stats.ttest_ind()
function returns a tuple containing the t-statistic and the p-value for the test and we can extract these and print those. The equal_var = False
argument means we get Welch’s t-test which doesn’t assume equal variances in each group.
# test diff between men & women; get data
= bmi_data[bmi_data.sex == "M"]
male_data = bmi_data[bmi_data.sex == "F"]
female_data
# do the test
= stats.ttest_ind(male_data.value, female_data.value, equal_var = False) # tuple out, t-stat and p-value
t_res
t_res# print informative result
print("The t-statistic is %.2f with a p-value of %.3f." % (t_res[0], t_res[1]))
The t-statistic is 2.11 with a p-value of 0.036.
The pingouin
(Vallat 2018) package also provides functions for statistical testing.
Using the ttest()
function with correction = 'auto'
means pingouin
automatically uses Welch’s T-test when the sample sizes are unequal as they are here.
# pingouin example; correction =‘auto’
= False, correction = 'auto') pg.ttest(male_data.value, female_data.value, paired
T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
---|---|---|---|---|---|---|---|---|
T-test | 2.11342 | 192.047574 | two-sided | 0.035856 | [0.05, 1.59] | 0.293662 | 1.242 | 0.529014 |
The pingouin
package provides us with much more information - which may or may not be useful to you. The difference between male & female BMI is significant. This means that in a hypothetical long series of repeats of this study with different samples from the same population we would expect to see a difference as big or bigger between the sexes in more than 95% of those repeats. The pingouin
package also reports the power of the test here. This is post-hoc power though & post-hoc power is witchcraft e.g. (Gelman 2019).
Bayesian testing
In the previous post with R we used the Stan probabilistic programming language to create a Bayesian model for the BMI data. We could also use Stan here via the pystan interface but instead we’ll use a native python library called [pymc
] (Salvatier, Wiecki, and Fonnesbeck 2016). The pymc
package allows us to write data generating models and then use Markov Chain Monte Carlo (MCMC) sampling with those model definitions to generate posterior distributions. pymc
supports a range of MCMC algorithms. In the code below we use the same priors we defined in the post using R.
# bayesian test with pymc
# create dummy variables; F = 0, M = 1
= pd.get_dummies(bmi_data, columns = ["sex"], drop_first = True)
bmi_data_dummy
# set up priors & likelihood
# https://docs.pymc.io/en/latest/api/generated/pymc.sample.html
with pm.Model() as model: # model specifications in PyMC3 are wrapped in a `with` statement
# Define priors
= pm.HalfNormal("sigma", sigma = 100)
sigma = pm.Normal("Intercept", mu = 25, sigma=10)
intercept = pm.Normal("male_diff", mu = 0, sigma = 5)
x_coeff
# Define likelihood
= pm.Normal("value", mu = intercept + x_coeff * bmi_data_dummy.sex_M, sigma=sigma, observed=bmi_data_dummy.value) likelihood
Next we run the MCMC sampling on the model we defined above; by default the NUTS algorithm is used. This is the same MCMC algorithm as the Stan
probabilistic progamming language uses by default. Using return_inferencedata = True
means we can easily plot the results (see below).
# MCMC sampling
# 3 MCMC chains
# draw 3000 posterior samples using NUTS sampling; 1000 iter burn-in
with model:
= pm.sample(3000, tune = 1000, return_inferencedata = True, chains = 3) bayes_bmi
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 4 jobs)
NUTS: [sigma, Intercept, male_diff]
Sampling 3 chains for 1_000 tune and 3_000 draw iterations (3_000 + 9_000 draws total) took 2 seconds.
The arviz
library (Kumar et al. 2019) provides tools for summarising & plotting data from MCMC chains & posterior distributions.
; az.plot_trace(bayes_bmi)
We want the traceplots (on the right) to look like ‘hairy caterpillars’ & they all look fine here. The posterior distributions for each parameter also look healthy. We can plot the posteriors using arviz
as well.
= (2,2), hdi_prob = 0.95); az.plot_posterior(bayes_bmi, grid
The posterior distributions all look good. We can extract the intercept posterior and the posterior for the effect of ‘male’ and add these together to get the posterior for male BMI.
# add Intercept & male diff posteriors; keep this new posterior in existing InferenceData object
"male_bmi"] = bayes_bmi.posterior["Intercept"] + bayes_bmi.posterior["male_diff"]
bayes_bmi.posterior[
# replot with only intercept (female BMI), male BMI and sigma
= ["Intercept", "male_bmi", "sigma"] , grid = (2,2), hdi_prob = 0.95); az.plot_posterior(bayes_bmi, var_names
# summary
= ["Intercept", "male_bmi", "sigma"] , kind = "stats", hdi_prob = 0.9) az.summary(bayes_bmi, var_names
mean | sd | hdi_5% | hdi_95% | |
---|---|---|---|---|
Intercept | 22.835 | 0.316 | 22.292 | 23.317 |
male_bmi | 23.661 | 0.258 | 23.247 | 24.099 |
sigma | 2.826 | 0.143 | 2.588 | 3.051 |
The output tells us that the estimated mean for female BMI is 22.8 (females were dummy coded as 0). Given the priors we used we can say that there is a 90% probability that the value for female BMI lies between 22.3 and 23.4. The estimated male BMI is 23.6 with 90% probability of being between 23.2 & 24. Note that the actual values might vary in the decimal point because the MCMC chains are random.
The bambi
library (Capretto et al. 2022) can be used to create Bayesian models with a more intuitive formula interface like brms
or rstanarm
in R.
# model with bambi
# define priors
= {
prior_spec "Intercept": bmb.Prior("Normal", mu = 25, sigma = 10),
"sex_M": bmb.Prior("Normal", mu = 0, sigma = 5),
"value_sigma": bmb.Prior("HalfNormal", sigma = 100)
}
# define the model; formula syntax
= bmb.Model("value ~ sex", priors = prior_spec, data = bmi_data)
bmb_bayes_model # MCMC sampling; returns InferenceData obj
= bmb_bayes_model.fit(draws = 3000, tune = 1000, chains = 3) bmb_bayes_bmi
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 4 jobs)
NUTS: [value_sigma, Intercept, sex]
Sampling 3 chains for 1_000 tune and 3_000 draw iterations (3_000 + 9_000 draws total) took 2 seconds.
The bmb_bayes_bmi
object is of type InferenceData
like that returned from pymc
(bambi
uses pymc
under the hood). We can use the bambi
result in the same way we used the pymc
result with arviz
.
First we’ll plot the posterior distributions and plots for each MCMC chain.
# plots and dists
; az.plot_trace(bmb_bayes_bmi)
Next we’ll plot the posterior distributions and get summaries of those posteriors.
# plot posteriors
= (2,2), hdi_prob = 0.95);
az.plot_posterior(bmb_bayes_bmi, grid
# Key summary and diagnostic info on the model parameters
az.summary(bmb_bayes_bmi)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 22.840 | 0.309 | 22.255 | 23.417 | 0.003 | 0.002 | 15248.0 | 7273.0 | 1.0 |
sex[M] | 0.828 | 0.406 | 0.027 | 1.549 | 0.003 | 0.003 | 15087.0 | 7029.0 | 1.0 |
value_sigma | 2.816 | 0.144 | 2.545 | 3.086 | 0.001 | 0.001 | 12954.0 | 6515.0 | 1.0 |
We get some extra information using the bambi
summary.
As we did in the pymc3
example we can add the Intercept
and sex
chains together to get a posterior distribution for the male BMI and add this data to our existing `` object.
"male_bmi"] = bmb_bayes_bmi.posterior["Intercept"] + bmb_bayes_bmi.posterior["sex"] bmb_bayes_bmi.posterior[
We can easily summarise & plot the parameters we are interested in.
# plot selected posteriors
= ["Intercept", "male_bmi", "value_sigma"], grid = (2,2), hdi_prob = 0.95);
az.plot_posterior(bmb_bayes_bmi, var_names
# posterior summary
= ["Intercept", "male_bmi", "value_sigma"], kind = "stats", hdi_prob = 0.9) az.summary(bmb_bayes_bmi, var_names
mean | sd | hdi_5% | hdi_95% | |
---|---|---|---|---|
Intercept | 22.840 | 0.309 | 22.322 | 23.335 |
male_bmi[M] | 23.668 | 0.259 | 23.249 | 24.091 |
value_sigma | 2.816 | 0.144 | 2.570 | 3.043 |
Summary
This post has been a quick skip through some data loading, exploration, filtering and both frequentist & Bayesian modelling with python.