3 min read

Automatic PITCHf/x database updates with pitchRx

This post is a follow up of sorts to modifying and querying a PITCHf/x database. The query portion of that post is still very relevant, but I didn’t want to impose the hack to obtain dates upon users, so I updated pitchRx to explicitly include dates by default. I also added some functionality to make database updates painless.

Dates are now supported!

Since dates are so vital to PITCHf/x analysis in practice, I decided to officially add support for a ‘date’ column in the ‘atbat’ table. Just to demonstrate, let’s create a new database, add one day’s worth of data to it, and print a few records from the date column:

library(dplyr)
library(pitchRx) 
# Make sure you have pitchRx 1.5 -- as 1.4 has a couple bugs...
# You can `devtools::install_github("cpsievert/pitchRx")` to get newest version
packageVersion("pitchRx") 
## [1] '1.8.2'
db <- src_sqlite("new-db.sqlite3", create = TRUE)
scrape(start = "2014-06-01", end = "2014-06-01", connect = db$con)
select(tbl(db, "atbat"), date)
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0 [new-db.sqlite3]
##    date      
##    <chr>     
##  1 2014_06_01
##  2 2014_06_01
##  3 2014_06_01
##  4 2014_06_01
##  5 2014_06_01
##  6 2014_06_01
##  7 2014_06_01
##  8 2014_06_01
##  9 2014_06_01
## 10 2014_06_01
## # ... with more rows

If you already have a PITCHf/x database without a date column, scrape now knows to create this date column for you (using a process similar to what I describe here).

Introducing update_db

In the past, my recommended solution for updating a PITCHf/x database was admittingly ambiguous/painful, so I’ve decided to automate that job. The new update_db function is my attempt at doing so. Give this function a database connection and it knows how to append new data to all your existing tables. It does this by finding the most recent game from the ‘gameday_link’ column in the ‘atbat’ table, then grabs data for all games after that date. For this reason, if your tables conflict in terms of the most recent date, do not use this function and use scrape to perform an update instead. Now, by default, update_db includes games up to and including yesterday. The reason is that files are updated in real-time on MLBAM’s servers which can lead to missing/duplicate records if you aren’t incredibly careful about adding today’s data to your database. Of course, you can still get your hands on today’s data, but don’t say I didn’t warn you if you append it to your database! :)

dat <- scrape(start = Sys.Date(), end = Sys.Date())

Automatic updates!

I’m a bit embarassed it took me so long to learn about cron jobs considering how easy it is to create scheduled jobs to run R scripts on a unix/linux based machine. Thanks to cron jobs, I no longer have to think about updating my database – it updates by itself. To make this happen, I first save this script under the /Users/cpsievert/pitchfx/ directory. This directory also has my PITCHf/x database saved as pitchRx.sqlite3. Next, I open my terminal and enter

sudo crontab -e

After entering my password, this opens my default text editor where I then have:

0 12 * * * Rscript /Users/cpsievert/pitchfx/update.R

This tells my machine to update my database everyday at noon. I’m not so sure how to schedule automated tasks on windows, but this thread might be a good starting point if your interested.