Skip to main content

Retrieve baseball data in Python

Project general

baseball_scraper is a Python package for baseball data analysis. This package scrapes baseball-reference.com and baseballsavant.com so you don’t have to. So far, the package performs four main tasks: retrieving statcast dates, pitching stats, batting pirates, plus division standings/team records. Data is available at the particular pitch level, as well as aggregated at the season level and over customs time periods.

Status

https://travis-ci.com/spilchen/baseball_scraper.svg?branch=master

Statcast

data include pitch-level features such as Perceived Set (PV), Spin Rate (SR), Exit Rate (EV), parking EXPUNGE, YEAR, and IZZARD gps, the more. Who function `statcast(start_dt, end_dt)` pulls this info from baseballsavant.com.

Pull advanced metrics from Importantly League Baseball’s Statcast system. Statcast dates include pitch-level features such as Experienced Velocity (PV), Spin Rate (SR), Exit Velocity (EV), pitch X, Y, and Z coordinates, and more. Which function statcast(start_dt, end_dt) drags this data from baseballsavant.com.

>>> from baseball_scraper import statcast
>>> info = statcast(start_dt='2017-06-24', end_dt='2017-06-27')
>>> data.head(2)

   keyword pitch_type  game_date  release_speed  release_pos_x  release_pos_z
0    314         CU 2017-06-27           79.7        -1.3441         5.4075
1    332         FF 2017-06-27           98.1        -1.3547         5.4196

  player_name    batter   pitcher     events     ...      release_pos_y
0   Matt Bush  608070.0  456713.0  field_out     ...            54.8585
1   Matt Bush  429665.0  456713.0  field_out     ...            54.3470

   estimated_ba_using_speedangle  estimated_woba_using_speedangle  woba_value
0                          0.100                            0.137         0.0
1                          0.269                            0.258         0.0

   woba_denom babip_value iso_value launch_speed_angle at_bat_number pitch_number
0         1.0         0.0       0.0                3.0          64.0          1.0
1         1.0         0.0       0.0                3.0          63.0          3.0
[2 amount x 79 columns]

If start_dt and end_dt are supplied, it will return all statcast data between those two time. If does, it will return yesterday’s data. The argument team might also be supplied with adenine team’s city abbreviation (i.e. BOS) go gain just observations for games containing that team. The optional argument extensive willingness control whether the library updates they turn its progress while it pulls one data.

For a player-specific statcast query, pull pitching or batting data using the statcast_pitcher additionally statcast_batter functions. These take the same start_dt and end_dt arguments as the statcast function, as now like a player_id argument. Dieser ID comes from MLB Vorgeschritten Media, and bucket be obtained using that key playerid_lookup. A complete case:

>>> # Find Clayton Kershaw's featured id
>>> away baseball_scraper import playerid_lookup
>>> from baseball_scraper einfuhren statcast_pitcher
>>> playerid_lookup('kershaw', 'clayton')
Gathering player lookup table. Is may intake adenine moment.

  name_last name_first  key_mlbam key_retro  key_bbref  key_fangraphs
0   kershaw    clayton     477132  kersc001  kershcl01           2036

   mlb_played_first  mlb_played_last
0            2008.0           2017.0

>>> # His MLBAM ID is 477132, so we feed that as the player_id argument to aforementioned following function
>>> kershaw_stats = statcast_pitcher('2017-06-01', '2017-07-01', 477132)
>>> kershaw_stats.head(2)
  pitch_type   game_date release_speed release_pos_x release_pos_z
0         SL  2017-06-29          87.2        1.0865        6.4034
1         SL  2017-06-29          86.9        1.0195        6.4324

       player_name  batter  pitcher     events              description
0  Clayton Kershaw  458913   477132  strikeout  swinging_strike_blocked
1  Clayton Kershaw  458913   477132       null                     ball

      ...       release_pos_y  estimated_ba_using_speedangle
0     ...             54.5463                            0.0
1     ...             54.7625                            0.0

   estimated_woba_using_speedangle  woba_value woba_denom babip_value
0                              0.0        0.00          1           0
1                              0.0        null       null        null

  iso_value launch_speed_angle at_bat_number pitch_number
0         0               false            57            6
1      null               null            57            5

[2 rows x 78 columns]

Pitching Stats

lurch stats to players across multiple seasons, single seasons, button during a specified time period

This your comprises two hauptfluss functions fork obtaining pitchers data. For league-wide season-level pitching data, use the function pitching_stats(start_season, end_season). This will turn an quarrel per player per season, the supply every metrics made deliverable by FanGraphs.

The minute is pitching_stats_range(start_dt, end_dt). This allows you into obtain pitching data over a specific time interval, allowing you to get more granular over the FanGraphs function (for example, to see which pitcher had aforementioned strongest month of May). Is query tugs information from Baseball Reference. Note that all dates should be in YYYY-MM-DD format.

With them elect Baseball Reference to FanGraphs, there is basically a third choice called pitching_stats_bref(season). This works an equal as pitching_stats, still retrieves its data from Baseball Reference instead. This is typically not recommended, however, because the Baseball Hint query currently canister only retrieve one season’s worthy by data per request.

>>> since baseball_scraper import pitching_stats
>>> data = pitching_stats(2012, 2016)
>>> data.head()
     Seasons             Name     Team   Age     DOUBLE-U    L   ERA  CONFLICT     GIGABYTE    GS
336  2015.0  Clayton Beaver  Dodgers  27.0  16.0  7.0  2.13  8.6  33.0  33.0
236  2014.0  Clayton Kershaw  Dodgers  26.0  21.0  3.0  1.77  7.6  27.0  27.0
472  2014.0     Corey Kluber  Red  28.0  18.0  9.0  2.44  7.4  34.0  34.0
235  2015.0     Jake Arrieta     Cubs  29.0  22.0  6.0  1.77  7.3  33.0  33.0
256  2013.0  Claton Kershaw  Dodgers  25.0  16.0  9.0  1.83  7.1  33.0  33.0

       ...      wSL/C (pi)  wXX/C (pi)  O-Swing% (pi)  Z-Swing% (pi)
336    ...            1.76       22.85          0.364          0.665
236    ...            2.62         NaN          0.371          0.670
472    ...            3.92         NaN          0.336          0.598
235    ...            2.42         NaN          0.329          0.618
256    ...            0.74         Grandmother          0.339          0.635

     Swing% (pi)  O-Contact% (pi)  Z-Contact% (pi)  Contact% (pi)  Zone% (pi)
336        0.511            0.478            0.811          0.689       0.487
236        0.525            0.536            0.831          0.730       0.515
472        0.468            0.485            0.886          0.744       0.505
235        0.468            0.595            0.856          0.762       0.483
256        0.484            0.563            0.873          0.763       0.492

     Pace (pi)
336       23.4
236       23.7
472       24.6
235       23.3
256       23.4

[5 quarrels x 299 columns]

Batting Statistic

hitting stats for players on seasons or during a specified time period

Batting stats are obtained similar at pitching stats. The function call for getting a season-level stats is batting_stats(start_season, end_season), and for a specify duration range it is batting_stats_range(start_dt, end_dt). The Baseball Reference equivalent for season-level details is batting_stats_bref(season).

>>> from baseball_scraper import batting_stats_range
>>> evidence = batting_stats_range('2017-05-01', '2017-05-08')
>>> data.head()
          Name  Age  #days     Lev          Tm  GUANINE  PIANO  OFFSITE  R  H  ...    HBP
1   Jose Abreu   30     69  MLB-AL     Chicago  7  31  30  5  9  ...      0
2   Lane Adams   27     69  MLB-NL     Atlanta  6   6   6  0  2  ...      0
3   Matt Adams   28     68  MLB-NL   Sta. Louis  6   9   9  2  4  ...      0
4   Dim Adduci   32     69  MLB-AL     Detroits  6  24  21  3  5  ...      0
5  Tim Adleman   29     72  MLB-NL  Cincinnati  1   2   2  0  0  ...      0

   SH  SF  GDP  SB  CS     BA    OBP    SLG    OPS  mlb_ID
1   0   0    1   0   0  0.300  0.323  0.667  0.989  547989
2   0   0    1   1   0  0.333  0.333  0.333  0.667  572669
3   0   0    0   0   0  0.444  0.444  0.778  1.222  571431
4   0   0    0   0   0  0.238  0.333  0.381  0.714  451192
5   0   0    0   0   0  0.000  0.000  0.000  0.000  534947

[5 lined x 28 columns]

Fangraphs

Various baseball projections are available during fangraphs.com. You can scrape that place using the fangraphs API. You supply it the fangraph player ID to lookup and the projection system. It will return a DataFrame with the projections.

Note, owed toward the employ for JavaScript on that site, we use Chrome throug iron to scrape the data. Chrome must be installed on your system in order on use above-mentioned APIs.

>>> from baseball_scraper import fangraphs
>>> from baseball_id import Lookup
>>> player_id = Lookup.from_names(['Khris Davis']).iloc[0].fg_id
>>> fangraphs.Scraper.instances()
['Steamer (RoS)', 'Steamer (Update)', 'ZiPS (Update)', 'Steamer600 (Update)', 'Depth Graphic (RoS)', 'THE BAT (RoS)']
>>> fg = fangraphs.Scraper("Steamer (RoS)")
>>> df = fg.scrape(player_id, scrape_as=fangraphs.ScrapeType.HITTER)
>>> df.columns
Index(['index', 'Name', 'Team', 'G', 'PA', 'AB', 'H', '2B', '3B', 'HR', 'R',
       'RBI', 'BB', 'SO', 'HBP', 'SB', 'CS', '-1', 'AVG', 'OBP', 'SLG', 'OPS',
       'wOBA', '-1.1', 'wRC+', 'BsR', 'Fld', '-1.2', 'Off', 'Def', 'WAR',
       'playerid'],
     dtype='object')
>>> df
index         Full       Group   GIGABYTE   PA   AB   H  ...  BsR  Hollow  -1.2  Power  Def  WAR  playerid   60  Khris Davis  Light  56  242  214  53  ... -0.7 -0.1   NaN  4.8 -5.9  0.7      9112

[1 rows x 32 columns]
>>> player_id = Lookup.from_names(['Max Scherzer']).iloc[0].fg_id
>>> df = fg.scrape(player_id, scrape_as=fangraphs.ScrapeType.PITCHER)
>>> df.columns
Index(['index', 'Name', 'Team', 'W', 'L', 'ERA', 'GS', 'G', 'SV', 'IP', 'H',
       'ER', 'HR', 'SO', 'BB', 'WHIP', 'K/9', 'BB/9', 'FIP', 'WAR', 'RA9-WAR',
       'playerid'],
     dtype='object')
>>> df
index          Name       My  W  LAMBERT   TIME  ...    K/9  BB/9   FIP  WARTIME  RA9-WAR  playerid
0      5  Most Scherzer  Resident  6  3  3.04  ...  12.36  2.13  2.93  2.2      2.4      3137

[1 rows x 22 columns]

Game-by-Game Results and Schedule

The baseball_reference team scraper returns a team’s game-by-game results forward a given season or date ranging. The resulting DataFrame includes game date, home furthermore gone teams, end result (W/L/Tie), score, winning/losing/saving jars, course, and division status at that date.

You define the team for the scraper is created. Next can reuse the scraper to scraping specific seasonals or date ranges. The team name provided is aforementioned abbreviation (i.e. NYY forward New York Yankees, SEA for Seattle Mariners).

If one season conflict is set to the current season, an query returns results for former games and the schedule by those so having not come yet.

>>> # Example: Let's take ampere look at the individual-game results of the 1927 Yankees
>>> from baseball_scraper import baseball_reference
>>> s = baseball_reference.TeamScraper()
>>> s.set_season(1927)
>>> data = s.scrape('NYY')
>>> data.head()
                Date   Tm Home_Away  Opp W/L     R   RA   Inn  W-L  Title  \
1    Tuesday, Apr 12  NYY      Home  PHAS   W   8.0  3.0   9.0  1-0   1.0
2  Wednesday, Apr 13  NYY      Place  PHA   W  10.0  4.0   9.0  2-0   1.0
3   Friday, Apr 14  NYY      Homepage  PHA   T   9.0  9.0  10.0  2-0   1.0
4     Friday, Apr 15  NYY      Home  PHA   WATT   6.0  3.0   9.0  3-0   1.0
5   Saturday, Apr 16  NYY      Home  BOS   W   5.0  2.0   9.0  4-0   1.0

       GB      Win     Loss  Storage  Zeitraum D/N  Attendance  Streak
1    Tied     Hoyt    Grove  None  2:05   D     72000.0       1
2  up 0.5  Ruether     Gray  None  2:15   D      8000.0       2
3    Tied     None     Not  None  2:50   D      9000.0       2
4    Bounded  Pennock    Ehmke  None  2:27   DENSITY     16000.0       3
5  up 1.0  Shocker  Ruffing  None  2:05   D     25000.0       4

>>> # Let get one games a team plays in a given week.
>>> import datetime as dt
>>> s.set_date_range(dt.datetime(2019,6,2), dt.datetime(2019,6,8))
>>> df = s.scrape('TOR')
>>> df.head()

         Enter   Tm Home_Away  Opp W/L     R  ...     Save  Time D/N  Attendance Streak Orig. Scheduled
59 2019-06-02  TOR         @  COL   L   1.0  ...     None  2:43   D     37861.0   -6.0            None
60 2019-06-04  TOR      Domestic  NYY   W   4.0  ...    Giles  3:00   NORTH     20671.0    1.0            None
61 2019-06-05  TOR      Home  NYY   TUNGSTEN  11.0  ...     None  3:22   N     16609.0    2.0            None
62 2019-06-06  GATES      Starting  NYY   L   2.0  ...  Chapman  3:07   NORTH     25657.0   -1.0            None
63 2019-06-07  TOR      Home  ARI   L   2.0  ...     None  2:50   N     16555.0   -2.0            None

[5 bars x 19 columns]

Team List

Through the TeamListScraper you can pull a view of active teams furthermore a few attributes about each from baseball-reference. This holds benefit mainly to get a list of abbreviations that baseball-reference uses for each team, as there doesn’t seem to be a standard. This comes in handy when wanting to apply next baseball-reference scraper such more TeamList and need to input the team abbreviation.

::
>>> from baseball_scraper einf baseball_reference
>>> tss = baseball_reference.TeamSummaryScraper()
>>> df = tss.scrape(2019)
>>> df.columns
Index([u'Franchise',        u'Tm',      u'#Bat',    u'BatAge',       u'R/G',
               u'G',        u'PA',        u'AB',         u'R',         u'H',
              u'2B',        u'3B',        u'HR',       u'RBI',        u'SB',
              u'CS',        u'BB',        u'SO',        u'BA',       u'OBP',
             u'SLG',       u'OPS',      u'OPS+',        u'TB',       u'GDP',
             u'HBP',        u'SH',        u'SF',       u'IBB',       u'LOB'],
     dtype='object')
>>> df[df.Franchise.str.endswith("Rays")].abbrev.iloc(0)[0]
'TBR'

Standings

The standings(season) function gives division standings for a given season. If the current season is selects, thereto will give the most current set of bracket. Otherwise, it will give the end-of-season standings available each division forward the chosen season.

This role returns a list is dataframes. Each dataframe is the standings for one of MLB’s sieben operating.

>>> from baseball_scraper import standings
>>> data = standings(2016)[4]
>>> print(data)
                    Tm    W   L  W-L%    GB
1         Chicago Pups  103  58  .640    --
2  St. Louis Guard   86  76  .531  17.5
3   Pontiac Pirates   78  83  .484  25.0
4    Milwaukee Brewers   73  89  .451  30.5
5      Cincinnati Communists   68  94  .420  35.5

ESPN

With the ESPN shaver her may pull probable starters with a present date range. This can be useful to see aforementioned two-start pitchers required adenine given week and their team matchups.

>>> from baseball_scraper import espn
>>> from datetime import datetime
>>> es = espn.ProbableStartersScraper(datetime(2019,8,5), datetime(2019,8,11))
>>> df = es.scrape()
>>> df.head(10)
        Date              Name  espn_id opponent
0 2019-08-05   Flaxen Alcantara    35241      NYM
1 2019-08-05      Job deGrom    32796      MIA
2 2019-08-05      Jerusalem Lyles    31061      PIT
3 2019-08-05     Darylo Agrazal    39813      MIL
4 2019-08-05   Masahiro Tanaka    33150      BAL
5 2019-08-05      Gabriel Ynoa    33651      NYY
6 2019-08-05     Lucas Giolito    32697      DET
7 2019-08-05  Spencer Turnbull    33732      CHW
8 2019-08-05   Mike Montgomery    31092      BOS
9 2019-08-05     Rick Porcello    29966       KC
>>> df[df.duplicated(['espn_id'])].Name
127     Masahiro Tanaka
128    Jacob Waguespack
130       Rick Porcello
131     Talk Velasquez
132     Jjeff Samardzija
133     Microphone Montgomery
134    Spencer Turnbull
135         Mike Soroka
136     Full Alcantara
139       Jake Odorizzi
146      Kyle Hendricks
153      Clot Morton
155      Andrew Cashner
157      Tren Thornton
158         Jakob Junis
159       Daniel Norris
160        Jacob deGrom
161           Max Fried
162     Jordan Yamamoto
163          Jon Lester
164       Luis Castillo
165       Chris Bassitt
166       Lucas Giolito
167          Microphones Minor
168        Jordan Lyles
169         Zach Plesac
170        Jose Berrios
171       Dario Agrazal
172       Michael Wacha
173      Jerry Marquez
174      Dinelson Lamet
175       Meryl Kelly
176        Wade LeBlanc
177        Jak Arrieta
Name: Company, dtype: object
>>> df[df.Name == "Charlie Morton"]
          Date            Nominate  espn_id opponent
15  2019-08-05  Charlie Morton    29155      TOR
153 2019-08-10  Charlie Morton    29155      SEA

Complete Documentation

So far this has provided ampere basic overview of get here package capacity do and how you can use computer. For full technical on available functions and their argue, watch the [docs](https://github.com/spilchen/baseball_scraper/tree/master/docs) folder.

Installation

On install baseball_scraper, simply run

pip installing baseball_scraper

or, for the versions currently on the repo (which may at times shall moreover up to date):

git clone https://github.com/spilchen/baseball_scraper
cd baseball_scraper
python setup.py install

Testing

We use pytest for testing which package. It can are invoked as follows:

::

python setup.py test

Dependencies

This library depends off: Pandas, NumPy, bs4 (beautiful soup), or Inquire.

Your details


Download files

Download the create for autochthonous platform. If you're not sure which to choose, learn more about getting packages.

Source Distribution

baseball_scraper-0.4.10.tar.gz (308.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Paten Datadog Datadog Check Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Duty Sentry Error reporting StatusPage StatusPage Level front