Intro to toRvik

Hey, everyone, I’m Andrew Weatherman, creator of toRvik and lover of college basketball analytics. The goal of toRvik is to expand access to reliable, high-quality CBB statistics. While analogous packages exist to pull data, like Saiem Gilani’s brilliant hoopR, toRvik requires no paid subscription or set-up and can be immediately utilized by anyone with just a few lines of code.

Install `toRvik`

# You can install using {pacman} with the following code:
if (!requireNamespace('pacman', quietly = TRUE)){
  install.packages('pacman')
}
pacman::p_load_current_gh("andreweatherman/toRvik", dependencies = TRUE, update = TRUE)

Overview of Barttorvik and `toRvik`

toRvik is a package of scrapers that pull data from Barttorvik, a popular college basketball analytics website, and return it in tidy format. Barttorvik splits its data on a number of variables and hosts detailed player and game statistics, while serving as a reputable, industry-recognized metric rating system. Generally speaking, all data is avaliable back to the 2007-08 season. More information about Barttorvik, its data, and its metric rating system can be found here.

Package functions are syntactically structured to point to their data source (e.g. by ‘player,’ ‘game,’ etc.) and should be considered get functions by nature. As of toRvik version 1.0.1, the package exports more than 20 functions covering the website and its data. Some highlights include:

Retrieving detailed game-by-game player statistics
Splitting advanced metrics by game location, type, date range, or opponent strength
Pulling play-by-play shooting bins for players and teams
Grabbing composite recruit rankings for top players by coach
Generating injury-adjusted efficiency measures by team

Quick start with ratings

toRvik requires no set-up and can be instantly executed in any session. To understand the package, the T-Rank functions, pulling and splitting Barttorvik’s metric rating system, are an excellent place to start. Let’s take a glance at the top teams in T-Rank using toRvik:

tictoc::tic()
toRvik::bart_ratings(year=2022) %>% 
  utils::head(10)
#> # A tibble: 10 x 19
#>    team    conf  barthag barthag_rk adj_o adj_o_rk adj_d adj_d_rk adj_t adj_t_rk
#>    <chr>   <chr>   <dbl>      <int> <dbl>    <int> <dbl>    <int> <dbl>    <int>
#>  1 Gonzaga WCC     0.966          1  120.        4  89.9        9  72.6        5
#>  2 Houston Amer    0.959          2  117.       10  88.5        6  63.7      336
#>  3 Kansas  B12     0.958          3  120.        5  91.3       13  69.1       71
#>  4 Texas ~ B12     0.951          4  111.       41  85.4        1  66.3      223
#>  5 Baylor  B12     0.949          5  118.        8  91.3       14  67.6      149
#>  6 Duke    ACC     0.944          6  123.        1  96.0       53  67.4      161
#>  7 Tennes~ SEC     0.944          7  111.       34  87.1        3  67.4      164
#>  8 Villan~ BE      0.935          8  117.        9  93.0       26  62.2      350
#>  9 Arizona P12     0.934          9  118.        7  93.7       35  72.3        9
#> 10 UCLA    P12     0.932         10  116.       12  92.2       20  65.4      274
#> # ... with 9 more variables: wab <dbl>, nc_elite_sos <dbl>, nc_fut_sos <dbl>,
#> #   nc_cur_sos <dbl>, ov_elite_sos <dbl>, ov_fut_sos <dbl>, ov_cur_sos <dbl>,
#> #   seed <int>, year <dbl>
tictoc::toc()
#> 3.45 sec elapsed

Here, the bart_ratings function returned the top ten teams in T-Rank in the current season. We are also presented with each team’s adjusted efficiencies, their adjusted tempo, and two forms of strength of schedule (documented in bart_ratings). But what if we want these same measures in home games only? We would use bart_factors and input ‘home’ as venue:

tictoc::tic()
toRvik::bart_factors(venue='home') %>%
  utils::head(10)
#> # A tibble: 10 x 23
#>    team       conf  barthag rec    wins games adj_t adj_o off_efg off_to off_or
#>    <chr>      <chr>   <dbl> <chr> <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>
#>  1 Houston    Amer    0.978 16–1     16    17  65.8  116.    54.3   16.5   39.1.1
#>  2 Baylor     B12     0.968 15–2     15    17  69.3  118.    55.1   17.6   38.1.1
#>  3 Gonzaga    WCC     0.966 16–0     16    16  73.1  121.    60     15.3   31.8.8
#>  4 Texas Tech B12     0.965 18–0     18    18  68.2  117.    57     19.4   38.3.3
#>  5 Auburn     SEC     0.961 16–0     16    16  72.5  116.    52.6   16.3   32.5.5
#>  6 Tennessee  SEC     0.959 16–0     16    16  68.9  113.    53     18.3   37.5.5
#>  7 Villanova  BE      0.956 12–1     12    13  62.6  122.    57.2   14.5   29.4.4
#>  8 UCLA       P12     0.952 14–1     14    15  69.4  116.    54.3   13.4   30.7.7
#>  9 Purdue     B10     0.950 16–1     16    17  67.6  125.    58.3   16.8   38.5.5
#> 10 Texas      B12     0.948 16–3     16    19  63.6  110.    51.1   18     33.8.8
#> # ... with 12 more variables: off_ftr <dbl>, adj_d <dbl>, def_efg <dbl>,
#> #   def_to <dbl>, def_or <dbl>, def_ftr <dbl>, wab <dbl>, year <dbl>,
#> #   venue <chr>, type <chr>, top <dbl>, quad <chr>
tictoc::toc()
#> 2.39 sec elapsed

And now, we have four factor data and metric ratings for home games only. The bart_factors function, and the analogous bart_conf_factors, takes venue, game type, date range, and opponent strength as additional splits. Great, but what if we want to explore rating trends over time? toRvik gives us that ability with bart_archive, a function that pulls adjusted ratings and projected records from the morning of a desired date:

tictoc::tic()
toRvik::bart_archive('20220113') %>%
  utils::head(10)
#> # A tibble: 10 x 16
#>       rk team      conf  rec   adj_o adj_o_rk adj_d adj_d_rk barthag proj_rec
#>    <dbl> <chr>     <chr> <chr> <dbl>    <dbl> <dbl>    <dbl>   <dbl> <chr>   
#>  1     1 Gonzaga   WCC   13-2   124.        2  93.3       20   0.964 26-3    
#>  2     2 Baylor    B12   15-1   121.        5  91.3       13   0.962 26-5    
#>  3     3 Houston   Amer  14-2   120.        6  91         11   0.961 27-4    
#>  4     4 Auburn    SEC   15-1   115.       20  89.3        8   0.947 27-4    
#>  5     5 LSU       SEC   15-1   106.      117  82.4        1   0.946 27-4    
#>  6     6 Arizona   P12   13-1   117.       15  91         12   0.946 27-4    
#>  7     7 Villanova BE    12-4   117.       13  91.6       16   0.942 24-6    
#>  8     8 Kansas    B12   13-2   122.        4  96.5       50   0.938 24-7    
#>  9     9 Purdue    B10   13-2   125         1  98.9       94   0.937 24-7    
#> 10    10 Duke      ACC   13-2   117.       11  93.8       23   0.926 26-5    
#> # ... with 6 more variables: proj_conf_rec <chr>, wab <dbl>, wab_rk <dbl>,
#> #   cur_rk <dbl>, change <dbl>, date <date>
tictoc::toc()
#> 0.73 sec elapsed

At this time, bart_archive only takes a single date, but if you want to track longer periods, I suggest looking into mapping packages such as purrr.

Exploring player and game data

Perhaps the most valuable functions in toRvik concern granular analysis. The package gives us the ability to explore advanced statistics at a game-by-game level for every Division 1 player since the 2007-08 season using bart_player_game.

Please note: This function returns a large tibble with >100,000 rows for each completed season. If you will be performing analyses on this data, it is recommended to store a fresh tibble as a saftey variable.

tictoc::tic()
toRvik::bart_player_game(year=2022, stat='adv') %>%
  dplyr::filter(team=='Duke') %>%
  dplyr::arrange(desc(net)) %>%
  utils::head(10)
#> # A tibble: 10 x 24
#>    date        year player      exp   team  opp   result   min   pts   usg  ortg
#>    <date>     <dbl> <chr>       <chr> <chr> <chr> <chr>  <dbl> <dbl> <dbl> <dbl>
#>  1 2021-12-14  2022 AJ Griffin  Fr    Duke  Sout~ W         22    19  16.7  214.
#>  2 2021-11-12  2022 Wendell Mo~ Jr    Duke  Army  W         35    19  22.9  142.
#>  3 2021-11-19  2022 Wendell Mo~ Jr    Duke  Lafa~ W         29    23  25.2  159.
#>  4 2022-01-15  2022 Mark Willi~ So    Duke  Nort~ W         27    19  25    144.
#>  5 2022-03-18  2022 Mark Willi~ So    Duke  Cal ~ W         32    15  19.5  156.
#>  6 2021-11-22  2022 Paolo Banc~ Fr    Duke  The ~ W         31    28  29.3  157.
#>  7 2022-03-24  2022 Paolo Banc~ Fr    Duke  Texa~ W         37    22  23.6  146.
#>  8 2022-01-29  2022 AJ Griffin  Fr    Duke  Loui~ W         34    22  17.2  163.
#>  9 2022-03-01  2022 Trevor Kee~ Fr    Duke  Pitt~ W         34    27  25.9  175.
#> 10 2021-11-19  2022 AJ Griffin  Fr    Duke  Lafa~ W         21    18  16.4  188.
#> # ... with 13 more variables: or_pct <dbl>, dr_pct <dbl>, ast_pct <dbl>,
#> #   to_pct <dbl>, stl_pct <dbl>, blk_pct <dbl>, bpm <dbl>, obpm <dbl>,
#> #   dbpm <dbl>, net <dbl>, poss <dbl>, id <dbl>, game_id <chr>
tictoc::toc()
#> 16.7 sec elapsed

Here, bart_player_game returned the 20 highest individual net BPMs by a Duke player this season. The function takes ‘box,’ ‘shooting,’ and ‘adv’ as stat inputs, and I welcome you to explore each one in your own session. But what if we want to investigate similar performance at a seaosn level? Well, bart_player_season gives us that option – also taking ‘box,’ ‘shooting,’ and ‘adv’ as stat inputs.

tictoc::tic()
toRvik::bart_player_season(year=2022, stat='shooting') %>%
  dplyr::filter(team=='Duke') %>%
  dplyr::arrange(desc(mid_a)) %>%
  utils::head(5)
#> # A tibble: 5 x 32
#>   player pos   exp   team  conf      g   mpg   ppg p_per   usg  ortg   efg    ts
#>   <chr>  <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Paolo~ Wing~ Fr    Duke  ACC      39  33.0 17.2   20.9  27.2  111.  52    55.7
#> 2 Wende~ Comb~ Jr    Duke  ACC      39  33.9 13.4   15.8  20.3  121.  56.9  60.5
#> 3 AJ Gr~ Wing~ Fr    Duke  ACC      39  24.3 10.4   17.1  16.9  127.  61.3  63.0
#> 4 Jerem~ Comb~ So    Duke  ACC      39  29    8.62  11.9  17.7  105.  47.7  51.5
#> 5 Trevo~ Comb~ Fr    Duke  ACC      36  30.2 11.5   15.2  20.1  110.  49.6  52.0
#> # ... with 19 more variables: ftm <dbl>, fta <dbl>, ft_pct <dbl>, two_m <dbl>,
#> #   two_a <dbl>, two_pct <dbl>, three_m <dbl>, three_a <dbl>, three_pct <dbl>,
#> #   dunk_m <dbl>, dunk_a <dbl>, dunk_pct <dbl>, rim_m <dbl>, rim_a <dbl>,
#> #   rim_pct <dbl>, mid_m <dbl>, mid_a <dbl>, mid_pct <dbl>, id <dbl>
tictoc::toc()
#> 1.99 sec elapsed

And now, we have a tibble of season-long shooting data for Duke players, sorted by number of mid-range attempts. Advanced metric data can be pulled by team on a per-game basis using bart_team_schedule, and total team shooting splis can be accessed using bart_team_shooting. Game box data can be pulled with bart_game_total.

Investigating the NCAA tournament

Lastly for this introductory vignette, we will explore toRvik functions for scraping tournament data. Frequent any time on social media in college basketball circles in March, and you will undoubtedly hear about ‘team sheets,’ detailed repositories of strength and quality metrics used by the seeding and selection committee. With bart_tourney_sheets, you can pull ‘quick-hit’ team sheets in tidy format with just a single line of code:

tictoc::tic()
toRvik::bart_tourney_sheets(year=2022) %>%
  utils::head(10)
#> # A tibble: 10 x 16
#>    team    seed   net   kpi   sor res_avg   bpi    kp   sag qual_avg q1a   q1   
#>    <chr>  <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl> <chr> <chr>
#>  1 Gonza~     1     1     5     7     6       1     1     1      1   5-2   10-3 
#>  2 Arizo~     1     2     3     2     2.5     3     2     2      2.3 4-2   6-3  
#>  3 Houst~     5     3    13    14    13.5     2     4     5      3.7 0-3   1-4  
#>  4 Baylor     1     4     2     4     3       6     5     4      5   4-4   10-5 
#>  5 Kentu~     2     5     9     5     7       4     3     6      4.3 3-6   9-7  
#>  6 Kansas     1     6     1     1     1       8     6     3      5.7 4-4   12-5 
#>  7 Tenne~     3     7     4     3     3.5     5     7     7      6.3 4-7   11-7 
#>  8 Villa~     2     8     7     8     7.5     7    11     9      9   5-4   7-6  
#>  9 Texas~     3     9    17    12    14.5    13     9    14     12   5-5   8-9  
#> 10 UCLA       4    10    11    15    13       9     8    10      9   2-4   5-4  
#> # ... with 4 more variables: q2 <chr>, q1_2 <chr>, q3 <chr>, q4 <chr>
tictoc::toc()
#> 0.89 sec elapsed

Returned are sheets of top teams sorted by their NCAA NET ranking. Because this function relies on NET data, it is only available back to the 2018-19 season. In-season performance is valuable, but what if you want to investigate just tournament data? Well, toRvik gives you two options to do so: bart_tourney_odds and bart_tourney_results. The former returns metric-adjusted round probabilities by split. Let’s explore round odds for the 2022 NCAA Tournament:

tictoc::tic()
toRvik::bart_tourney_odds(year=2022, odds='pre') %>%
  dplyr::arrange(desc(s16)) %>%
  utils::head(10)
#> # A tibble: 10 x 11
#>     seed region  team       conf    r64   r32   s16    e8    f4    f2 champ
#>    <dbl> <chr>   <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1 West    Gonzaga    WCC     100  96.6  81.9  69.6  52    38.5  27.5
#>  2     1 Midwest Kansas     B12     100  96.3  73.7  48.7  32.5  17.7   8.5
#>  3     1 South   Arizona    P12     100  94.8  72.7  37.3  21.2  12     5.4
#>  4     1 East    Baylor     B12     100  94.9  72.5  42.9  25.2  11.1   5.8
#>  5     2 Midwest Auburn     SEC     100  91.5  70    48.4  24.8  11.7   4.8
#>  6     2 West    Duke       ACC     100  94.1  69.8  38.9  15.5   8.2   4  
#>  7     3 West    Texas Tech B12     100  92.6  68.4  40.9  17.1   9.5   5  
#>  8     3 South   Tennessee  SEC     100  92.3  67.5  41    20.8  11.6   5.2
#>  9     5 Midwest Iowa       B10     100  84.3  64.5  32.2  19.3   9.2   3.7
#> 10     2 South   Villanova  BE      100  90.8  63.6  34.6  16.1   8.4   3.5
tictoc::toc()
#> 0.24 sec elapsed

With the ‘odds’ argument set to ‘pre,’ we returned pre-tournament odds and sorted by likelihood to reach the second weekend (Sweet 16). bart_tourney_odds also takes current odds (‘current’), odds based on recent performance (‘recent’), and odds based on games against strong opponents (‘t100’). This data is similarly available starting with the 2019 tournament. Now, what if we want to explore tournament results?

tictoc::tic()
toRvik::bart_tourney_results(min_year=2011, max_year=2021, type='conf') %>%
  utils::head(5)
#> # A tibble: 5 x 18
#>   conf   pake  pase  wins  loss w_percent   r64   r32   s16    e8    f4    f2
#>   <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 P12    11.2  11.4    55    38     0.591    38    27    18     8     2     0
#> 2 SEC    10.9  15.5    78    48     0.619    49    33    21    14     7     2
#> 3 MVC     4.1   6.1    19    15     0.559    15    11     4     2     2     0
#> 4 ACC     3.6  -0.3   102    61     0.626    64    44    31    15     5     4
#> 5 Horz    2.6   3       5    10     0.333    10     1     1     1     1     1
#> # ... with 6 more variables: champ <dbl>, top2 <dbl>, f4_percent <dbl>,
#> #   champ_percent <dbl>, from <dbl>, to <dbl>
tictoc::toc()
#> 0.44 sec elapsed

With bart_tourney_results, we can return raw and adjusted outcomes by split. Here, we returned aggregate conference results from 2011 to 2021, sorted by PAKE – the number of wins attained above or below KenPom expectation. The function splits by team (‘team’), conference (‘conf’), coach (‘coach’), and seed (‘seed’) and includes data starting in 2000.

And you’re off!

toRvik includes several additional functions and capabilities that I did not describe here; take time to explore them and those detailed in this introduction. If you have any questions, feel free to message me on Twitter. If you run into any bugs, please open an issue on the GitHub. Happy exploring!