Availability Is the Best Ability: Methods and Data Sources
Dataset
858 club-seasons across six European top-flight leagues, covering the ten completed seasons from 2015/16 to 2024/25.
| League | Code | Clubs/season | Seasons | Club-seasons |
|---|---|---|---|---|
| Premier League | GB1 | 20 | 10 | 200 |
| La Liga | ES1 | 20 | 10 | 200 |
| Bundesliga | L1 | 18 | 10 | 180 |
| Serie A | IT1 | 20 | 10 | 200 |
| Ligue 1 | FR1 | 18–20 | 10 | ~190 |
| Liga Portugal | PO1 | 18 | 10 | ~180 |
The 2025/26 season is excluded because it is in progress — partial-season points totals would distort the regression. Club-seasons missing squad market value data are excluded from the regression (858 of 1,156 club-seasons have complete data).
Player availability data
Player availability was scraped from Transfermarkt’s Ausfallzeiten (Periods of Absence) pages. These pages show a matchday-by-matchday grid of each player’s status for a given club, competition, and season.
Each cell in the grid encodes one of the following statuses via CSS classes:
| Status | TM class suffix | In injury burden? |
|---|---|---|
| Starting XI | _s | No (denominator only) |
| Substituted in | _e | No (denominator only) |
| On the bench | _k | No (denominator only) |
| Injured / ill | _v | Yes (numerator) |
| Suspended | Reclassified from _a via detail text | Yes (numerator) |
| National team | Reclassified from _a via detail text | No (excluded) |
| Not in matchday squad | _r | No (denominator only) |
| Not at club (transferred/loaned) | _bg_rot_20 | No (excluded entirely) |
| Not included | _ (empty suffix) | No (excluded entirely) |
Injury detail text (e.g. “Hamstring injury — Return expected on 05/01/2026”)
is extracted from inner <span> elements within the hidden responsive-duplicate
cells. The generic absent status is reclassified into suspended,
national_team, or other_absence using this detail text.
Injury burden definition
For each club in each season, injury burden is computed from league matches only:
injury_burden = (injured_matchdays + suspended_matchdays) / total_in_squad_matchdays Where:
- Numerator: matchdays with status
injuredorsuspended - Denominator: matchdays where the player was part of the squad (starting, sub, bench, injured, suspended, national team, not in squad)
- Excluded from both:
not_at_club(transfers) andnot_included(not registered) - Excluded from numerator:
national_teamcallups (not a club-level health issue)
The burden is unweighted — a fringe player’s absence counts the same as a star’s. This is a deliberate choice: weighting by market value or starts would introduce circularity (availability affects starts) and conflate the quality control (squad value) with the treatment variable.
Injury type classification
For the split regression, 149,870 injury entries with detail text were categorised into three buckets based on keyword matching on the injury description:
| Category | Keywords | Share | Preventable? |
|---|---|---|---|
| Preventable (muscle/soft tissue) | hamstring, calf, groin, adductor, thigh, muscle, strain, quadricep, fitness, fatigue | 34.3% | Largely yes |
| Bad luck (trauma/illness) | broken, fracture, bruise, concussion, knock, dead leg, ill, corona, virus, flu | 11.2% | No |
| Mixed (ligament/joint/other) | cruciate, ligament, achilles, knee, ankle, shoulder, surgery, and all others | 54.5% | Mixed |
Model specification
The main model is an OLS regression with league and season fixed effects:
points = β₀ + β₁·log(squad_value) + β₂·injury_burden
+ β₃·in_europe + β₄·promoted + β₅·covid
+ β₆·avg_age_squad + β₇·manager_change
+ league_FE + season_FE + ε Standard errors are clustered by club to account for within-club serial correlation (the same club observed across multiple seasons shares unobserved characteristics).
Control variables
| Variable | Source | Definition |
|---|---|---|
log_squad_value | Transfermarkt squad value page | Natural log of average squad market value from two snapshots: September 1 (post-summer window) and February 1 (post-January window) |
promoted | Derived from standings data | Binary: 1 if the club was not in the top division the previous season |
covid | Calendar | Binary: 1 for the 2019/20 and 2020/21 seasons (compressed schedules, no crowds) |
avg_age_squad | Transfermarkt squad age page | Average age of the squad (fractional years) |
manager_change | Transfermarkt manager changes page | Binary: 1 if the club changed manager mid-season (after August 15). Summer appointments excluded. |
in_europe | Derived from absence data | Binary: 1 if the club participated in any European competition (Champions League, Europa League, Conference League) that season |
league_FE | — | Dummy variables for each league (5 dummies, ES1 as reference). Absorbs structural differences between leagues. |
season_FE | — | Dummy variables for each season (9 dummies, 2015 as reference). Absorbs temporal trends. |
Full coefficient table
| Variable | Coefficient | SE (clustered) | p-value | 95% CI |
|---|---|---|---|---|
| injury_burden | −44.61 | 8.83 | < 0.001 | [−61.9, −27.3] |
| log_squad_value | 15.75 | 0.69 | < 0.001 | [14.4, 17.1] |
| in_europe | 3.23 | 1.01 | 0.001 | [1.3, 5.2] |
| manager_change | −9.04 | 0.66 | < 0.001 | [−10.3, −7.7] |
| covid | −7.28 | 0.85 | < 0.001 | [−8.9, −5.6] |
| promoted | 3.48 | 1.27 | 0.006 | [1.0, 6.0] |
| avg_age_squad | 0.67 | 0.34 | 0.051 | [0.0, 1.3] |
| League fixed effects (reference: La Liga) | ||||
| Ligue 1 | 4.19 | 1.44 | 0.004 | [1.4, 7.0] |
| Premier League | −11.68 | 1.55 | < 0.001 | [−14.7, −8.6] |
| Serie A | 1.76 | 1.43 | 0.219 | [−1.0, 4.6] |
| Bundesliga | −2.54 | 1.27 | 0.047 | [−5.0, −0.03] |
| Liga Portugal | 21.82 | 2.13 | < 0.001 | [17.7, 26.0] |
| Season fixed effects (reference: 2015/16) | ||||
| 2016/17 | −0.06 | 1.23 | 0.953 | [−2.5, 2.3] |
| 2017/18 | −2.58 | 1.24 | 0.037 | [−5.0, −0.2] |
| 2018/19 | −7.60 | 1.42 | < 0.001 | [−10.4, −4.8] |
| 2019/20 | −7.09 | 0.85 | < 0.001 | [−8.8, −5.4] |
| 2020/21 | −0.84 | 0.73 | 0.253 | [−2.3, 0.6] |
| 2021/22 | −10.85 | 1.34 | < 0.001 | [−13.5, −8.2] |
| 2022/23 | −10.94 | 1.47 | < 0.001 | [−13.8, −8.1] |
| 2023/24 | −10.70 | 1.53 | < 0.001 | [−13.7, −7.7] |
| 2024/25 | −11.19 | 1.27 | < 0.001 | [−13.7, −8.7] |
R² = 0.745 · Adjusted R² = 0.738 · N = 858 · Clustered by club (110 clusters, robust to within-club serial correlation).
Club fixed effects robustness check
A separate model with club fixed effects (replacing league FE with 110 club dummies) produces β(injury_burden) = −45.97 (SE = 10.06, p < 0.001), R² = 0.805. The coefficient is nearly identical to the main model, confirming the effect is driven by within-club variation rather than between-club differences.
Split regression: preventable vs bad luck injuries
Replacing the single injury_burden with three category-specific burdens:
| Variable | Coefficient | SE | p-value | Interpretation |
|---|---|---|---|---|
| Muscle injuries | −73.87 | 17.48 | < 0.001 | 10pp → 7.4 fewer points |
| Bad luck injuries | −15.56 | 24.99 | 0.534 | Not significant |
| Mixed injuries | −31.59 | 10.96 | 0.004 | 10pp → 3.2 fewer points |
All other controls remain the same. R² = 0.741.
Robustness: within-season timing test
To investigate causal direction, we tested whether early-season injuries (matchdays 1–10) predict late-season points (matchdays 20–38), controlling for squad value and early-season form (points at matchday 10).
| Test | Coefficient | SE | p-value | Result |
|---|---|---|---|---|
| Early injuries → Late points | −7.03 | 5.43 | 0.196 | Not significant (directionally correct) |
| Late injuries → Early points (placebo) | −5.26 | 3.09 | 0.089 | Not significant (good — no reverse causation) |
The timing test does not provide strong evidence of causation in either direction. The relationship observed in the season-level model is best described as a robust association, plausibly causal but not definitively so.
League-specific results
| League | β(injury burden) | SE | p-value | Significant? |
|---|---|---|---|---|
| Premier League | −51.00 | 12.5 | < 0.001 | Yes |
| Bundesliga | −63.64 | 15.2 | < 0.001 | Yes |
| Serie A | −61.22 | 25.1 | 0.016 | Yes |
| Ligue 1 | −50.10 | 32.1 | 0.122 | Borderline |
| La Liga | 2.62 | 19.3 | 0.892 | No |
| Liga Portugal | 10.40 | 21.4 | 0.628 | No |
Each league-specific model includes the full set of controls plus season fixed effects. Standard errors are clustered by club within each league.
Limitations
- Endogeneity: We cannot definitively establish that injuries cause poor performance rather than the reverse (or that both are caused by an unmeasured third factor like poor management). The within-season timing test was inconclusive.
- Transfermarkt data quality: TM’s absence data is crowd-sourced and may not capture every injury, particularly at smaller clubs. Reporting standards may differ across leagues, which could contribute to the La Liga/Portugal anomaly.
- Injury classification: The preventable/bad-luck split relies on keyword matching of injury descriptions. Some injuries are ambiguous (e.g. a “knee injury” could be contact or overuse). The mixed category absorbs this ambiguity.
- Unweighted burden: A fringe player’s absence counts the same as a star’s. This is defensible (avoids circularity) but may understate the effect for clubs that lose key players.
- Squad value as proxy: Transfermarkt market values are crowd-estimated and may not perfectly reflect true squad quality. They are the best publicly available proxy.
- Manager change timing: We filter to changes after August 15, but some early-season appointments (September) may be planned rather than crisis-driven.
Data sources
- Player availability: Transfermarkt
— ausfallzeiten pages for 6 leagues, 11 seasons (1,270 club-season files,
2.1M matchday entries). Scraped via
scripts/fetch-transfermarkt-absences.py. - Squad market values: Transfermarkt
— marktwerteverein pages at two snapshots per season (September + February),
divisions 1 and 2. Scraped via
scripts/fetch-transfermarkt-squad-values.py. - League standings: salimt/football-datasets
on GitHub —
team_competitions_seasons.csvwith final standings, points, GD, and manager data. - Elo ratings: Club Elo — historical Elo ratings for all clubs. Used to compute Gini coefficient per league-season.
- Squad ages: Transfermarkt
— altersschnitt pages. Scraped via
scripts/analysis/fetch-control-variables.py. - Manager changes: Transfermarkt
— trainerwechsel pages (filtered to mid-season changes only). Scraped via
scripts/analysis/fetch-control-variables.py.