Context
As a french basket ball player and NBA fan, I am really excited by the debuts of the french player Victor Wembenyama during the 2023/2024 NBA season.
Wemby is probably the most anticipated player since Lebron James, and for good reasons:
- He was selected as the first overall pick in the 2023 NBA draft by the San Antonio Spurs
- He boasts exceptional qualities, standing at 2.24m and showcasing skills like dribbling, driving, and three-point shooting akin to a point guard
- He was considered a potential Hall of Famer even before his first game
Performance Overview:
Despite the high anticipation surrounding his NBA debut and a promising start with decent statistical averages, his performance has yet to fully align with the initial expectations.
Notably, his field goal accuracy (the number of shots made divided by the number of attempts) is currently below 50%, a figure that may be surprising given his height advantage on the court. However, recent games indicate a potential improvement in his accuracy.
Domain related considerations
It is common for rookie players to experience an adjustment period as they familiarize themselves with the intensity of NBA games, team dynamics, and various playing styles. This could be reflected in the improvement of Wemby’s accuracy during his first season. On the other side, as we will see, this improvement is subtle and could be related to the natural variability of this statistic. The accuracy in a game being quite uncertain by nature, it can be difficult to get an accurate representation of its progression. This leads us to the central question under examination in this analysis:
→ Did Wemby’s accuracy really improved over the season ?
Preprocess data
The study was conducted in January 2024, therefore the data stops on January 10th, 2024
NBA data can be extracted in several ways. Checkout my nba-stats repository which presents a convenient method for retrieving and modeling NBA data. This facilitates the retrieval of a broad range of statistics essential for conducting data analysis, similar to the one performed here.
Here we will explore the data related to Wemby’s first games. The pre-processing is fairly straightforward as the raw data already contains the main statistics needed in the analysis.
Code
COLS_TO_USE = {
"field_goals_made": "nb_win",
"field_goals_attempts": "nb_try",
"field_goals_pct": "conv_rate",
}
# Sort df per game date
df = boxscore_per_game.sort_values("game_id")
# Use subset of columns
df = df[COLS_TO_USE.keys()].rename(columns=COLS_TO_USE)
# Add columns
df["game_nb"] = np.arange(len(df)) # game nb
df['nb_miss'] = df['nb_try'] - df['nb_win']
Initial overview
The hypothesis being that Wemby’s accuracy improved over the games, let’s have a quick overview of it’s accuracy per game. For convenience we will also display the rolling accuracy averaged over 5 games.
Code
WIDTH = 600
HEIGHT = 400
BAR_SIZE = 15
bar = (
alt.Chart(df)
.mark_bar(
color=Colors.TERRACOTTA,
size=BAR_SIZE,
cornerRadiusTopLeft=3,
cornerRadiusTopRight=3,
)
.encode(
x=alt.X("game_nb").title("Game #").scale(domain=(-1, len(df))),
y=alt.Y("conv_rate").axis(format="%").title("Accuracy (Conversion rate)"),
tooltip=["game_nb", "conv_rate"],
)
)
line = (
alt.Chart(df)
.mark_line(color=Colors.BROWN)
.transform_window(rolling_mean="mean(conv_rate)", frame=[-5, 0])
.encode(
x=alt.X("game_nb").scale(domain=(0, len(df))),
y="rolling_mean:Q",
tooltip=["game_nb", "rolling_mean:Q"],
)
)
(bar + line).properties(
width=WIDTH,
height=HEIGHT,
title="Accuracy per games, and averaged over 5 games window",
).show()
Observations:
Although Wemby’s season started with a high accuracy game, his accuracy decreased over the following games, and seems to have slightly improved thereafter. As previously hinted; this increase is subtle and could be be insignificant from a statistical standpoint.
→ In the following parts, we will try to quantify this evolution and assess its statistical significance.
First estimation: Linear regression and correlation
We will start with a classic and simple approach, computing the correlation and linear correlation coefficients.
Code
chart = (
alt.Chart(df)
.mark_point(color=Colors.TERRACOTTA)
.encode(
x=alt.X("game_nb").title("Game #").scale(domain=(-1, len(df))),
y=alt.Y("conv_rate").axis(format="%").title("Accuracy (Conversion rate)"),
tooltip=["game_nb", "conv_rate"],
)
)
linear_reg_chart = chart.transform_regression("game_nb", "conv_rate").mark_line(color=Colors.BROWN)
(chart + linear_reg_chart).properties(
width=WIDTH,
height=HEIGHT,
title={
"text": "OLS regression between accuracy % and nb of games",
"subtitle": f"y={slope:.3g}*x + {intercept:.3g} | corr coeff = {correlation_coefficient:.3g}",
},
).show()
Observations:
The correlation coefficient coefficient is quite low and the slope parameter is not so important. This tends to suggest that the improvement of Wemby’s accuracy is minimal and maybe non significant over the games.
Limits:
This approach have some limits: We treat the accuracy of each game with the same weight, although the number of shots between games is not the same.
This is important because for example, if a player’s true accuracy is 50% (under binomial law hypothesis):
- If he takes 2 shots during the game: he will shot 2/2 25% of the time
- If he takes 10 shots during the game: he will shot 10/10 ~0.01 % of the time
This can increase the uncertainty of our model, especially because the number of games in our sample is low.
The graph below shows the distribution of shots per game and we can see that although the 1st game is a high accuracy one, Wemby took less shoot than usual.
Code
(
alt.Chart(
df.melt(
id_vars="game_nb",
value_vars=[
"nb_win",
"nb_miss",
],
).assign(
color=lambda x: x["variable"].map(
{
"nb_miss": Colors.BROWN,
"nb_win": Colors.TERRACOTTA,
}
)
)
)
.mark_bar(
size=BAR_SIZE,
cornerRadiusTopLeft=3,
cornerRadiusTopRight=3,
)
.encode(
x=alt.X("game_nb").title("Game #").scale(domain=(0, len(df))),
y=alt.Y("value").title("Number of shots"),
color=alt.Color("color", scale=None, legend=alt.Legend(title="Miss / In")),
tooltip=["game_nb", "variable", "value"],
)
).properties(
width=WIDTH,
height=HEIGHT,
title="Number of shots made and missed per game",
).show()
Moreover, our estimation of the uncertainty around the strength of the trend is also limited. We will therefore use other another method to get a better estimation of the trend.
Bayesian approach: Modeling games with Binomial law
One possibility is to change a bit our model, and leverage Bayesian statistics to get the distribution of our estimated parameters.
→ Under these hypothesis, our goal is to estimate the repartition of the slope \(a\) in order to quantify how Wemby’s accuracy improved during the season.
Computing the distribution of model’s parameters
In order to estimate the distribution of the underlying parameters, we will leverage the Bayesian formula using the grid method:
We start by estimating the probability of the observed value (the \(likelihood\)) for every combination of parameters (here the slope \(a\) and the intercept \(b\)) in a discrete, close to exhaustive, representation of our parameters space. We then compute the posterior value for each combination of parameters as (according to Bayes Formula) the \(likelihood\) multiplied by the \(prior\) divided by the \(evidence\). We will use a uniform distribution of parameters values for our prior.
See Think bayes for more details about this approach.
Code
def compute_likelihood(a: float, b: float, k: np.array, n: np.array, x: np.array):
"""
Compute likelihood of observed games for a given set of parameters
a: slope
b: intercept
k: nb_success
n: nb_try
x: here, nb of games
"""
p = a * x + b
likelihood_per_game = scipy.stats.binom.pmf(k, n, p)
return likelihood_per_game.prod()
# Priors
range_slope = np.linspace(-0.01, 0.01, 101) # slope parameter space
range_intercept = np.linspace(0.27, 0.57, 51) # intercept parameter space
cols = ["slope", "intercept"]
prior = pd.DataFrame(list(product(range_slope, range_intercept)), columns=cols)
prior["p"] = 1 / len(prior)
prior = prior.set_index(cols).p
# Likelihood
likelihood = prior.copy()
for slope, intercept in prior.index:
likelihood[slope, intercept] = compute_likelihood(
a=slope,
b=intercept,
k=df.nb_win,
n=df.nb_try,
x=df.game_nb,
)
# Bayes formula
posterior = prior * likelihood
posterior /= posterior.sum() # Can be done if our represented parameters space is a sampling close enough to the possible parameter space.
# Posterior marginal
pdf_posterior_slope = posterior.rename("p").reset_index().groupby("slope").p.sum().sort_index()
pdf_posterior_intercept = posterior.rename("p").reset_index().groupby("intercept").p.sum().sort_index()
cdf_posterior_slope = pdf_posterior_slope.cumsum()
cdf_posterior_intercept = pdf_posterior_intercept.cumsum()
Remark: This method can lead to instability due to numerical computation limitation with very small values of observed probabilities. In this case, one can use statsmodel’s GLM implementation, or pymc’s MCMC implementation.
Observations
We now have the generated distribution of the \(slope\) and \(intercept\) parameter, which we can use in order to compute metrics like confidence interval or maximum likelihood estimation. Since the \(slope\) directly represents the evolution of accuracy, we’ll use it to quantify the evolution. Let’s have a look:
Code
slope_pdf_chart = (
alt.Chart(pdf_posterior_slope.reset_index())
.mark_area(
opacity=0.3,
line={"color": Colors.BROWN},
color=Colors.TERRACOTTA,
)
.encode(
x="slope",
y=alt.Y("p").axis(format="%").title("%"),
tooltip=["p"],
)
).properties(title="Estimated repartition of slope parameter")
slope_cdf_chart = (
alt.Chart(cdf_posterior_slope.reset_index())
.mark_area(
opacity=0.3,
line={"color": Colors.BROWN},
color=Colors.TERRACOTTA,
)
.encode(
x="slope",
y=alt.Y("p").axis(format="%").title("%"),
tooltip=["p", "slope"],
)
).properties(title="Estimated cumulative repartition of slope parameter")
rule_low = (
alt.Chart(pd.DataFrame([{"threshold": 0.05}])).mark_rule().encode(y="threshold:Q")
)
rule_high = (
alt.Chart(pd.DataFrame([{"threshold": 0.95}])).mark_rule().encode(y="threshold:Q")
)
intercept_pdf_chart = (
alt.Chart(pdf_posterior_intercept.reset_index())
.mark_area(
opacity=0.3,
line={"color": Colors.TERRACOTTA},
color=Colors.YELLOW,
)
.encode(
x="intercept",
y=alt.Y("p").axis(format="%").title("%"),
tooltip=["p", "intercept"],
)
).properties(title="Estimated repartition of intercept parameter")
intercept_cdf_chart = (
alt.Chart(cdf_posterior_intercept.reset_index())
.mark_area(
opacity=0.3,
line={"color": Colors.TERRACOTTA},
color=Colors.YELLOW,
)
.encode(
x=alt.X("intercept").scale(
domain=(
cdf_posterior_intercept.index.min(),
cdf_posterior_intercept.index.max(),
)
),
y=alt.Y("p").axis(format="%").title("%").scale(domain=(0, 1)),
tooltip=["p", "intercept"],
)
).properties(title="Estimated cumulative repartition of intercept parameter")
(
(slope_pdf_chart | (slope_cdf_chart + rule_low + rule_high))
& (intercept_pdf_chart | (intercept_cdf_chart + rule_low + rule_high))
).show()
Code
post_max_likelihood = pdf_posterior_slope.sort_values().index[-1]
prob_improvement_accuracy = pdf_posterior_slope[pdf_posterior_slope.index > 0].sum()
credible_interval_accuracy_evol_per_game_90_pct = (
cdf_posterior_slope[cdf_posterior_slope > 0.05].index.min(),
cdf_posterior_slope[cdf_posterior_slope < 0.95].index.max(),
)
ci_90_low, ci_90_high = credible_interval_accuracy_evol_per_game_90_pct
observations_msg = f"""
- Probability of slope superior to 0: {prob_improvement_accuracy:.1%}
- 90% credible interval of slope value {ci_90_low:.2%} and {ci_90_high:.2%}
- Slope's maximum likelihood estimation of slope: {post_max_likelihood:.2%}
"""
print(observations_msg)
- Probability of slope superior to 0: 69.6%
- 90% credible interval of slope value -0.26% and 0.52%
- Slope's maximum likelihood estimation of slope: 0.14%
Conclusion:
We now have new - more precise - ways to describe the evolution of Wemby’s accuracy:
While it is likely that his accuracy slightly improved (Maximum likelihood estimation = 0.14% of improvement per game), the evidence is not very strong (Probability of positive accuracy improvement per game ~70%). We should wait for more games in order to see if this trend can be confirmed by our model. Although the conclusion can seem deceptive in our case, being able to quantify the uncertainty in this way can can be quite important to aid decision-making in some business contexts.
Discussion and further improvements
Model limitation - Although our model present an improvement compared to the first OLS regression to reflect an underlying trend, it still has some limitations. For instance, it overlooks factors such as the defensive strength of the opponent or any psychological variables that may influence the likelihood of over or under performance across a sequence of games.
Testing - To test the model, we could define a test set (e.g., the last x games or the upcoming games) and compute a p-value to compare the observed accuracy with the predicted accuracy. Additionally, we can draw from the posterior probability to simulate the evolution using a Monte Carlo approach.