Hey, everyone, I’m Andrew Weatherman, creator of toRvik
and lover of college basketball analytics. The goal of toRvik
is to expand access to reliable, high-quality CBB statistics. While analogous packages exist to pull data, like Saiem Gilani’s brilliant hoopR
, toRvik
requires no paid subscription or set-up and can be immediately utilized by anyone with just a few lines of code.
Install toRvik
# You can install using {pacman} with the following code:
if (!requireNamespace('pacman', quietly = TRUE)){
install.packages('pacman')
}
pacman::p_load_current_gh("andreweatherman/toRvik", dependencies = TRUE, update = TRUE)
Overview of Barttorvik and toRvik
toRvik
is a package of scrapers that pull data from Barttorvik, a popular college basketball analytics website, and return it in tidy format. Barttorvik splits its data on a number of variables and hosts detailed player and game statistics, while serving as a reputable, industry-recognized metric rating system. Generally speaking, all data is avaliable back to the 2007-08 season. More information about Barttorvik, its data, and its metric rating system can be found here.
Package functions are syntactically structured to point to their data source (e.g. by ‘player,’ ‘game,’ etc.) and should be considered get
functions by nature. As of toRvik
version 1.0.1, the package exports more than 20 functions covering the website and its data. Some highlights include:
- Retrieving detailed game-by-game player statistics
- Splitting advanced metrics by game location, type, date range, or opponent strength
- Pulling play-by-play shooting bins for players and teams
- Grabbing composite recruit rankings for top players by coach
- Generating injury-adjusted efficiency measures by team
Quick start with ratings
toRvik
requires no set-up and can be instantly executed in any session. To understand the package, the T-Rank functions, pulling and splitting Barttorvik’s metric rating system, are an excellent place to start. Let’s take a glance at the top teams in T-Rank using toRvik
:
tictoc::tic()
toRvik::bart_ratings(year=2022) %>%
utils::head(10)
#> # A tibble: 10 x 19
#> team conf barthag barthag_rk adj_o adj_o_rk adj_d adj_d_rk adj_t adj_t_rk
#> <chr> <chr> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>
#> 1 Gonzaga WCC 0.966 1 120. 4 89.9 9 72.6 5
#> 2 Houston Amer 0.959 2 117. 10 88.5 6 63.7 336
#> 3 Kansas B12 0.958 3 120. 5 91.3 13 69.1 71
#> 4 Texas ~ B12 0.951 4 111. 41 85.4 1 66.3 223
#> 5 Baylor B12 0.949 5 118. 8 91.3 14 67.6 149
#> 6 Duke ACC 0.944 6 123. 1 96.0 53 67.4 161
#> 7 Tennes~ SEC 0.944 7 111. 34 87.1 3 67.4 164
#> 8 Villan~ BE 0.935 8 117. 9 93.0 26 62.2 350
#> 9 Arizona P12 0.934 9 118. 7 93.7 35 72.3 9
#> 10 UCLA P12 0.932 10 116. 12 92.2 20 65.4 274
#> # ... with 9 more variables: wab <dbl>, nc_elite_sos <dbl>, nc_fut_sos <dbl>,
#> # nc_cur_sos <dbl>, ov_elite_sos <dbl>, ov_fut_sos <dbl>, ov_cur_sos <dbl>,
#> # seed <int>, year <dbl>
tictoc::toc()
#> 3.45 sec elapsed
Here, the bart_ratings
function returned the top ten teams in T-Rank in the current season. We are also presented with each team’s adjusted efficiencies, their adjusted tempo, and two forms of strength of schedule (documented in bart_ratings
). But what if we want these same measures in home games only? We would use bart_factors
and input ‘home’ as venue:
tictoc::tic()
toRvik::bart_factors(venue='home') %>%
utils::head(10)
#> # A tibble: 10 x 23
#> team conf barthag rec wins games adj_t adj_o off_efg off_to off_or
#> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Houston Amer 0.978 16–1 16 17 65.8 116. 54.3 16.5 39.1.1
#> 2 Baylor B12 0.968 15–2 15 17 69.3 118. 55.1 17.6 38.1.1
#> 3 Gonzaga WCC 0.966 16–0 16 16 73.1 121. 60 15.3 31.8.8
#> 4 Texas Tech B12 0.965 18–0 18 18 68.2 117. 57 19.4 38.3.3
#> 5 Auburn SEC 0.961 16–0 16 16 72.5 116. 52.6 16.3 32.5.5
#> 6 Tennessee SEC 0.959 16–0 16 16 68.9 113. 53 18.3 37.5.5
#> 7 Villanova BE 0.956 12–1 12 13 62.6 122. 57.2 14.5 29.4.4
#> 8 UCLA P12 0.952 14–1 14 15 69.4 116. 54.3 13.4 30.7.7
#> 9 Purdue B10 0.950 16–1 16 17 67.6 125. 58.3 16.8 38.5.5
#> 10 Texas B12 0.948 16–3 16 19 63.6 110. 51.1 18 33.8.8
#> # ... with 12 more variables: off_ftr <dbl>, adj_d <dbl>, def_efg <dbl>,
#> # def_to <dbl>, def_or <dbl>, def_ftr <dbl>, wab <dbl>, year <dbl>,
#> # venue <chr>, type <chr>, top <dbl>, quad <chr>
tictoc::toc()
#> 2.39 sec elapsed
And now, we have four factor data and metric ratings for home games only. The bart_factors
function, and the analogous bart_conf_factors
, takes venue, game type, date range, and opponent strength as additional splits. Great, but what if we want to explore rating trends over time? toRvik
gives us that ability with bart_archive
, a function that pulls adjusted ratings and projected records from the morning of a desired date:
tictoc::tic()
toRvik::bart_archive('20220113') %>%
utils::head(10)
#> # A tibble: 10 x 16
#> rk team conf rec adj_o adj_o_rk adj_d adj_d_rk barthag proj_rec
#> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 Gonzaga WCC 13-2 124. 2 93.3 20 0.964 26-3
#> 2 2 Baylor B12 15-1 121. 5 91.3 13 0.962 26-5
#> 3 3 Houston Amer 14-2 120. 6 91 11 0.961 27-4
#> 4 4 Auburn SEC 15-1 115. 20 89.3 8 0.947 27-4
#> 5 5 LSU SEC 15-1 106. 117 82.4 1 0.946 27-4
#> 6 6 Arizona P12 13-1 117. 15 91 12 0.946 27-4
#> 7 7 Villanova BE 12-4 117. 13 91.6 16 0.942 24-6
#> 8 8 Kansas B12 13-2 122. 4 96.5 50 0.938 24-7
#> 9 9 Purdue B10 13-2 125 1 98.9 94 0.937 24-7
#> 10 10 Duke ACC 13-2 117. 11 93.8 23 0.926 26-5
#> # ... with 6 more variables: proj_conf_rec <chr>, wab <dbl>, wab_rk <dbl>,
#> # cur_rk <dbl>, change <dbl>, date <date>
tictoc::toc()
#> 0.73 sec elapsed
At this time, bart_archive
only takes a single date, but if you want to track longer periods, I suggest looking into mapping packages such as purrr
.
Exploring player and game data
Perhaps the most valuable functions in toRvik
concern granular analysis. The package gives us the ability to explore advanced statistics at a game-by-game level for every Division 1 player since the 2007-08 season using bart_player_game
.
Please note: This function returns a large tibble with >100,000 rows for each completed season. If you will be performing analyses on this data, it is recommended to store a fresh tibble as a saftey variable.
tictoc::tic()
toRvik::bart_player_game(year=2022, stat='adv') %>%
dplyr::filter(team=='Duke') %>%
dplyr::arrange(desc(net)) %>%
utils::head(10)
#> # A tibble: 10 x 24
#> date year player exp team opp result min pts usg ortg
#> <date> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2021-12-14 2022 AJ Griffin Fr Duke Sout~ W 22 19 16.7 214.
#> 2 2021-11-12 2022 Wendell Mo~ Jr Duke Army W 35 19 22.9 142.
#> 3 2021-11-19 2022 Wendell Mo~ Jr Duke Lafa~ W 29 23 25.2 159.
#> 4 2022-01-15 2022 Mark Willi~ So Duke Nort~ W 27 19 25 144.
#> 5 2022-03-18 2022 Mark Willi~ So Duke Cal ~ W 32 15 19.5 156.
#> 6 2021-11-22 2022 Paolo Banc~ Fr Duke The ~ W 31 28 29.3 157.
#> 7 2022-03-24 2022 Paolo Banc~ Fr Duke Texa~ W 37 22 23.6 146.
#> 8 2022-01-29 2022 AJ Griffin Fr Duke Loui~ W 34 22 17.2 163.
#> 9 2022-03-01 2022 Trevor Kee~ Fr Duke Pitt~ W 34 27 25.9 175.
#> 10 2021-11-19 2022 AJ Griffin Fr Duke Lafa~ W 21 18 16.4 188.
#> # ... with 13 more variables: or_pct <dbl>, dr_pct <dbl>, ast_pct <dbl>,
#> # to_pct <dbl>, stl_pct <dbl>, blk_pct <dbl>, bpm <dbl>, obpm <dbl>,
#> # dbpm <dbl>, net <dbl>, poss <dbl>, id <dbl>, game_id <chr>
tictoc::toc()
#> 16.7 sec elapsed
Here, bart_player_game
returned the 20 highest individual net BPMs by a Duke player this season. The function takes ‘box,’ ‘shooting,’ and ‘adv’ as stat inputs, and I welcome you to explore each one in your own session. But what if we want to investigate similar performance at a seaosn level? Well, bart_player_season
gives us that option – also taking ‘box,’ ‘shooting,’ and ‘adv’ as stat inputs.
tictoc::tic()
toRvik::bart_player_season(year=2022, stat='shooting') %>%
dplyr::filter(team=='Duke') %>%
dplyr::arrange(desc(mid_a)) %>%
utils::head(5)
#> # A tibble: 5 x 32
#> player pos exp team conf g mpg ppg p_per usg ortg efg ts
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Paolo~ Wing~ Fr Duke ACC 39 33.0 17.2 20.9 27.2 111. 52 55.7
#> 2 Wende~ Comb~ Jr Duke ACC 39 33.9 13.4 15.8 20.3 121. 56.9 60.5
#> 3 AJ Gr~ Wing~ Fr Duke ACC 39 24.3 10.4 17.1 16.9 127. 61.3 63.0
#> 4 Jerem~ Comb~ So Duke ACC 39 29 8.62 11.9 17.7 105. 47.7 51.5
#> 5 Trevo~ Comb~ Fr Duke ACC 36 30.2 11.5 15.2 20.1 110. 49.6 52.0
#> # ... with 19 more variables: ftm <dbl>, fta <dbl>, ft_pct <dbl>, two_m <dbl>,
#> # two_a <dbl>, two_pct <dbl>, three_m <dbl>, three_a <dbl>, three_pct <dbl>,
#> # dunk_m <dbl>, dunk_a <dbl>, dunk_pct <dbl>, rim_m <dbl>, rim_a <dbl>,
#> # rim_pct <dbl>, mid_m <dbl>, mid_a <dbl>, mid_pct <dbl>, id <dbl>
tictoc::toc()
#> 1.99 sec elapsed
And now, we have a tibble of season-long shooting data for Duke players, sorted by number of mid-range attempts. Advanced metric data can be pulled by team on a per-game basis using bart_team_schedule
, and total team shooting splis can be accessed using bart_team_shooting
. Game box data can be pulled with bart_game_total
.
Investigating the NCAA tournament
Lastly for this introductory vignette, we will explore toRvik
functions for scraping tournament data. Frequent any time on social media in college basketball circles in March, and you will undoubtedly hear about ‘team sheets,’ detailed repositories of strength and quality metrics used by the seeding and selection committee. With bart_tourney_sheets
, you can pull ‘quick-hit’ team sheets in tidy format with just a single line of code:
tictoc::tic()
toRvik::bart_tourney_sheets(year=2022) %>%
utils::head(10)
#> # A tibble: 10 x 16
#> team seed net kpi sor res_avg bpi kp sag qual_avg q1a q1
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 Gonza~ 1 1 5 7 6 1 1 1 1 5-2 10-3
#> 2 Arizo~ 1 2 3 2 2.5 3 2 2 2.3 4-2 6-3
#> 3 Houst~ 5 3 13 14 13.5 2 4 5 3.7 0-3 1-4
#> 4 Baylor 1 4 2 4 3 6 5 4 5 4-4 10-5
#> 5 Kentu~ 2 5 9 5 7 4 3 6 4.3 3-6 9-7
#> 6 Kansas 1 6 1 1 1 8 6 3 5.7 4-4 12-5
#> 7 Tenne~ 3 7 4 3 3.5 5 7 7 6.3 4-7 11-7
#> 8 Villa~ 2 8 7 8 7.5 7 11 9 9 5-4 7-6
#> 9 Texas~ 3 9 17 12 14.5 13 9 14 12 5-5 8-9
#> 10 UCLA 4 10 11 15 13 9 8 10 9 2-4 5-4
#> # ... with 4 more variables: q2 <chr>, q1_2 <chr>, q3 <chr>, q4 <chr>
tictoc::toc()
#> 0.89 sec elapsed
Returned are sheets of top teams sorted by their NCAA NET ranking. Because this function relies on NET data, it is only available back to the 2018-19 season. In-season performance is valuable, but what if you want to investigate just tournament data? Well, toRvik
gives you two options to do so: bart_tourney_odds
and bart_tourney_results
. The former returns metric-adjusted round probabilities by split. Let’s explore round odds for the 2022 NCAA Tournament:
tictoc::tic()
toRvik::bart_tourney_odds(year=2022, odds='pre') %>%
dplyr::arrange(desc(s16)) %>%
utils::head(10)
#> # A tibble: 10 x 11
#> seed region team conf r64 r32 s16 e8 f4 f2 champ
#> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 West Gonzaga WCC 100 96.6 81.9 69.6 52 38.5 27.5
#> 2 1 Midwest Kansas B12 100 96.3 73.7 48.7 32.5 17.7 8.5
#> 3 1 South Arizona P12 100 94.8 72.7 37.3 21.2 12 5.4
#> 4 1 East Baylor B12 100 94.9 72.5 42.9 25.2 11.1 5.8
#> 5 2 Midwest Auburn SEC 100 91.5 70 48.4 24.8 11.7 4.8
#> 6 2 West Duke ACC 100 94.1 69.8 38.9 15.5 8.2 4
#> 7 3 West Texas Tech B12 100 92.6 68.4 40.9 17.1 9.5 5
#> 8 3 South Tennessee SEC 100 92.3 67.5 41 20.8 11.6 5.2
#> 9 5 Midwest Iowa B10 100 84.3 64.5 32.2 19.3 9.2 3.7
#> 10 2 South Villanova BE 100 90.8 63.6 34.6 16.1 8.4 3.5
tictoc::toc()
#> 0.24 sec elapsed
With the ‘odds’ argument set to ‘pre,’ we returned pre-tournament odds and sorted by likelihood to reach the second weekend (Sweet 16). bart_tourney_odds
also takes current odds (‘current’), odds based on recent performance (‘recent’), and odds based on games against strong opponents (‘t100’). This data is similarly available starting with the 2019 tournament. Now, what if we want to explore tournament results?
tictoc::tic()
toRvik::bart_tourney_results(min_year=2011, max_year=2021, type='conf') %>%
utils::head(5)
#> # A tibble: 5 x 18
#> conf pake pase wins loss w_percent r64 r32 s16 e8 f4 f2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 P12 11.2 11.4 55 38 0.591 38 27 18 8 2 0
#> 2 SEC 10.9 15.5 78 48 0.619 49 33 21 14 7 2
#> 3 MVC 4.1 6.1 19 15 0.559 15 11 4 2 2 0
#> 4 ACC 3.6 -0.3 102 61 0.626 64 44 31 15 5 4
#> 5 Horz 2.6 3 5 10 0.333 10 1 1 1 1 1
#> # ... with 6 more variables: champ <dbl>, top2 <dbl>, f4_percent <dbl>,
#> # champ_percent <dbl>, from <dbl>, to <dbl>
tictoc::toc()
#> 0.44 sec elapsed
With bart_tourney_results
, we can return raw and adjusted outcomes by split. Here, we returned aggregate conference results from 2011 to 2021, sorted by PAKE – the number of wins attained above or below KenPom expectation. The function splits by team (‘team’), conference (‘conf’), coach (‘coach’), and seed (‘seed’) and includes data starting in 2000.
And you’re off!
toRvik
includes several additional functions and capabilities that I did not describe here; take time to explore them and those detailed in this introduction. If you have any questions, feel free to message me on Twitter. If you run into any bugs, please open an issue on the GitHub. Happy exploring!