Availability Is the Best Ability: Methods and Data Sources

Dataset

858 club-seasons across six European top-flight leagues, covering the ten completed seasons from 2015/16 to 2024/25.

League Code Clubs/season Seasons Club-seasons
Premier LeagueGB12010200
La LigaES12010200
BundesligaL11810180
Serie AIT12010200
Ligue 1FR118–2010~190
Liga PortugalPO11810~180

The 2025/26 season is excluded because it is in progress — partial-season points totals would distort the regression. Club-seasons missing squad market value data are excluded from the regression (858 of 1,156 club-seasons have complete data).

Player availability data

Player availability was scraped from Transfermarkt’s Ausfallzeiten (Periods of Absence) pages. These pages show a matchday-by-matchday grid of each player’s status for a given club, competition, and season.

Each cell in the grid encodes one of the following statuses via CSS classes:

StatusTM class suffixIn injury burden?
Starting XI_sNo (denominator only)
Substituted in_eNo (denominator only)
On the bench_kNo (denominator only)
Injured / ill_vYes (numerator)
SuspendedReclassified from _a via detail textYes (numerator)
National teamReclassified from _a via detail textNo (excluded)
Not in matchday squad_rNo (denominator only)
Not at club (transferred/loaned)_bg_rot_20No (excluded entirely)
Not included_ (empty suffix)No (excluded entirely)

Injury detail text (e.g. “Hamstring injury — Return expected on 05/01/2026”) is extracted from inner <span> elements within the hidden responsive-duplicate cells. The generic absent status is reclassified into suspended, national_team, or other_absence using this detail text.

Injury burden definition

For each club in each season, injury burden is computed from league matches only:

injury_burden = (injured_matchdays + suspended_matchdays) / total_in_squad_matchdays

Where:

  • Numerator: matchdays with status injured or suspended
  • Denominator: matchdays where the player was part of the squad (starting, sub, bench, injured, suspended, national team, not in squad)
  • Excluded from both: not_at_club (transfers) and not_included (not registered)
  • Excluded from numerator: national_team callups (not a club-level health issue)

The burden is unweighted — a fringe player’s absence counts the same as a star’s. This is a deliberate choice: weighting by market value or starts would introduce circularity (availability affects starts) and conflate the quality control (squad value) with the treatment variable.

Injury type classification

For the split regression, 149,870 injury entries with detail text were categorised into three buckets based on keyword matching on the injury description:

CategoryKeywordsSharePreventable?
Preventable (muscle/soft tissue) hamstring, calf, groin, adductor, thigh, muscle, strain, quadricep, fitness, fatigue 34.3% Largely yes
Bad luck (trauma/illness) broken, fracture, bruise, concussion, knock, dead leg, ill, corona, virus, flu 11.2% No
Mixed (ligament/joint/other) cruciate, ligament, achilles, knee, ankle, shoulder, surgery, and all others 54.5% Mixed

Model specification

The main model is an OLS regression with league and season fixed effects:

points = β₀ + β₁·log(squad_value) + β₂·injury_burden
       + β₃·in_europe + β₄·promoted + β₅·covid
       + β₆·avg_age_squad + β₇·manager_change
       + league_FE + season_FE + ε

Standard errors are clustered by club to account for within-club serial correlation (the same club observed across multiple seasons shares unobserved characteristics).

Control variables

VariableSourceDefinition
log_squad_value Transfermarkt squad value page Natural log of average squad market value from two snapshots: September 1 (post-summer window) and February 1 (post-January window)
promoted Derived from standings data Binary: 1 if the club was not in the top division the previous season
covid Calendar Binary: 1 for the 2019/20 and 2020/21 seasons (compressed schedules, no crowds)
avg_age_squad Transfermarkt squad age page Average age of the squad (fractional years)
manager_change Transfermarkt manager changes page Binary: 1 if the club changed manager mid-season (after August 15). Summer appointments excluded.
in_europe Derived from absence data Binary: 1 if the club participated in any European competition (Champions League, Europa League, Conference League) that season
league_FE Dummy variables for each league (5 dummies, ES1 as reference). Absorbs structural differences between leagues.
season_FE Dummy variables for each season (9 dummies, 2015 as reference). Absorbs temporal trends.

Full coefficient table

VariableCoefficientSE (clustered)p-value95% CI
injury_burden−44.618.83< 0.001[−61.9, −27.3]
log_squad_value15.750.69< 0.001[14.4, 17.1]
in_europe3.231.010.001[1.3, 5.2]
manager_change−9.040.66< 0.001[−10.3, −7.7]
covid−7.280.85< 0.001[−8.9, −5.6]
promoted3.481.270.006[1.0, 6.0]
avg_age_squad0.670.340.051[0.0, 1.3]
League fixed effects (reference: La Liga)
Ligue 14.191.440.004[1.4, 7.0]
Premier League−11.681.55< 0.001[−14.7, −8.6]
Serie A1.761.430.219[−1.0, 4.6]
Bundesliga−2.541.270.047[−5.0, −0.03]
Liga Portugal21.822.13< 0.001[17.7, 26.0]
Season fixed effects (reference: 2015/16)
2016/17−0.061.230.953[−2.5, 2.3]
2017/18−2.581.240.037[−5.0, −0.2]
2018/19−7.601.42< 0.001[−10.4, −4.8]
2019/20−7.090.85< 0.001[−8.8, −5.4]
2020/21−0.840.730.253[−2.3, 0.6]
2021/22−10.851.34< 0.001[−13.5, −8.2]
2022/23−10.941.47< 0.001[−13.8, −8.1]
2023/24−10.701.53< 0.001[−13.7, −7.7]
2024/25−11.191.27< 0.001[−13.7, −8.7]

R² = 0.745 · Adjusted R² = 0.738 · N = 858 · Clustered by club (110 clusters, robust to within-club serial correlation).

Club fixed effects robustness check

A separate model with club fixed effects (replacing league FE with 110 club dummies) produces β(injury_burden) = −45.97 (SE = 10.06, p < 0.001), R² = 0.805. The coefficient is nearly identical to the main model, confirming the effect is driven by within-club variation rather than between-club differences.

Split regression: preventable vs bad luck injuries

Replacing the single injury_burden with three category-specific burdens:

VariableCoefficientSEp-valueInterpretation
Muscle injuries−73.8717.48< 0.00110pp → 7.4 fewer points
Bad luck injuries−15.5624.990.534Not significant
Mixed injuries−31.5910.960.00410pp → 3.2 fewer points

All other controls remain the same. R² = 0.741.

Robustness: within-season timing test

To investigate causal direction, we tested whether early-season injuries (matchdays 1–10) predict late-season points (matchdays 20–38), controlling for squad value and early-season form (points at matchday 10).

TestCoefficientSEp-valueResult
Early injuries → Late points −7.03 5.43 0.196 Not significant (directionally correct)
Late injuries → Early points (placebo) −5.26 3.09 0.089 Not significant (good — no reverse causation)

The timing test does not provide strong evidence of causation in either direction. The relationship observed in the season-level model is best described as a robust association, plausibly causal but not definitively so.

League-specific results

Leagueβ(injury burden)SEp-valueSignificant?
Premier League−51.0012.5< 0.001Yes
Bundesliga−63.6415.2< 0.001Yes
Serie A−61.2225.10.016Yes
Ligue 1−50.1032.10.122Borderline
La Liga2.6219.30.892No
Liga Portugal10.4021.40.628No

Each league-specific model includes the full set of controls plus season fixed effects. Standard errors are clustered by club within each league.

Limitations

  • Endogeneity: We cannot definitively establish that injuries cause poor performance rather than the reverse (or that both are caused by an unmeasured third factor like poor management). The within-season timing test was inconclusive.
  • Transfermarkt data quality: TM’s absence data is crowd-sourced and may not capture every injury, particularly at smaller clubs. Reporting standards may differ across leagues, which could contribute to the La Liga/Portugal anomaly.
  • Injury classification: The preventable/bad-luck split relies on keyword matching of injury descriptions. Some injuries are ambiguous (e.g. a “knee injury” could be contact or overuse). The mixed category absorbs this ambiguity.
  • Unweighted burden: A fringe player’s absence counts the same as a star’s. This is defensible (avoids circularity) but may understate the effect for clubs that lose key players.
  • Squad value as proxy: Transfermarkt market values are crowd-estimated and may not perfectly reflect true squad quality. They are the best publicly available proxy.
  • Manager change timing: We filter to changes after August 15, but some early-season appointments (September) may be planned rather than crisis-driven.

Data sources

  • Player availability: Transfermarkt — ausfallzeiten pages for 6 leagues, 11 seasons (1,270 club-season files, 2.1M matchday entries). Scraped via scripts/fetch-transfermarkt-absences.py.
  • Squad market values: Transfermarkt — marktwerteverein pages at two snapshots per season (September + February), divisions 1 and 2. Scraped via scripts/fetch-transfermarkt-squad-values.py.
  • League standings: salimt/football-datasets on GitHub — team_competitions_seasons.csv with final standings, points, GD, and manager data.
  • Elo ratings: Club Elo — historical Elo ratings for all clubs. Used to compute Gini coefficient per league-season.
  • Squad ages: Transfermarkt — altersschnitt pages. Scraped via scripts/analysis/fetch-control-variables.py.
  • Manager changes: Transfermarkt — trainerwechsel pages (filtered to mid-season changes only). Scraped via scripts/analysis/fetch-control-variables.py.