Retrieve baseball data in Python
Project general
baseball_scraper is a Python package for baseball data analysis. This package scrapes baseball-reference.com and baseballsavant.com so you don’t have to. So far, the package performs four main tasks: retrieving statcast dates, pitching stats, batting pirates, plus division standings/team records. Data is available at the particular pitch level, as well as aggregated at the season level and over customs time periods.
Status
Statcast
data include pitch-level features such as Perceived Set (PV), Spin Rate (SR), Exit Rate (EV), parking EXPUNGE, YEAR, and IZZARD gps, the more. Who function `statcast(start_dt, end_dt)` pulls this info from baseballsavant.com.
Pull advanced metrics from Importantly League Baseball’s Statcast system. Statcast dates include pitch-level features such as Experienced Velocity (PV), Spin Rate (SR), Exit Velocity (EV), pitch X, Y, and Z coordinates, and more. Which function statcast(start_dt, end_dt) drags this data from baseballsavant.com.
>>> from baseball_scraper import statcast >>> info = statcast(start_dt='2017-06-24', end_dt='2017-06-27') >>> data.head(2) keyword pitch_type game_date release_speed release_pos_x release_pos_z 0 314 CU 2017-06-27 79.7 -1.3441 5.4075 1 332 FF 2017-06-27 98.1 -1.3547 5.4196 player_name batter pitcher events ... release_pos_y 0 Matt Bush 608070.0 456713.0 field_out ... 54.8585 1 Matt Bush 429665.0 456713.0 field_out ... 54.3470 estimated_ba_using_speedangle estimated_woba_using_speedangle woba_value 0 0.100 0.137 0.0 1 0.269 0.258 0.0 woba_denom babip_value iso_value launch_speed_angle at_bat_number pitch_number 0 1.0 0.0 0.0 3.0 64.0 1.0 1 1.0 0.0 0.0 3.0 63.0 3.0 [2 amount x 79 columns]
If start_dt and end_dt are supplied, it will return all statcast data between those two time. If does, it will return yesterday’s data. The argument team might also be supplied with adenine team’s city abbreviation (i.e. BOS) go gain just observations for games containing that team. The optional argument extensive willingness control whether the library updates they turn its progress while it pulls one data.
For a player-specific statcast query, pull pitching or batting data using the statcast_pitcher additionally statcast_batter functions. These take the same start_dt and end_dt arguments as the statcast function, as now like a player_id argument. Dieser ID comes from MLB Vorgeschritten Media, and bucket be obtained using that key playerid_lookup. A complete case:
>>> # Find Clayton Kershaw's featured id >>> away baseball_scraper import playerid_lookup >>> from baseball_scraper einfuhren statcast_pitcher >>> playerid_lookup('kershaw', 'clayton') Gathering player lookup table. Is may intake adenine moment. name_last name_first key_mlbam key_retro key_bbref key_fangraphs 0 kershaw clayton 477132 kersc001 kershcl01 2036 mlb_played_first mlb_played_last 0 2008.0 2017.0 >>> # His MLBAM ID is 477132, so we feed that as the player_id argument to aforementioned following function >>> kershaw_stats = statcast_pitcher('2017-06-01', '2017-07-01', 477132) >>> kershaw_stats.head(2) pitch_type game_date release_speed release_pos_x release_pos_z 0 SL 2017-06-29 87.2 1.0865 6.4034 1 SL 2017-06-29 86.9 1.0195 6.4324 player_name batter pitcher events description 0 Clayton Kershaw 458913 477132 strikeout swinging_strike_blocked 1 Clayton Kershaw 458913 477132 null ball ... release_pos_y estimated_ba_using_speedangle 0 ... 54.5463 0.0 1 ... 54.7625 0.0 estimated_woba_using_speedangle woba_value woba_denom babip_value 0 0.0 0.00 1 0 1 0.0 null null null iso_value launch_speed_angle at_bat_number pitch_number 0 0 false 57 6 1 null null 57 5 [2 rows x 78 columns]
Pitching Stats
lurch stats to players across multiple seasons, single seasons, button during a specified time period
This your comprises two hauptfluss functions fork obtaining pitchers data. For league-wide season-level pitching data, use the function pitching_stats(start_season, end_season). This will turn an quarrel per player per season, the supply every metrics made deliverable by FanGraphs.
The minute is pitching_stats_range(start_dt, end_dt). This allows you into obtain pitching data over a specific time interval, allowing you to get more granular over the FanGraphs function (for example, to see which pitcher had aforementioned strongest month of May). Is query tugs information from Baseball Reference. Note that all dates should be in YYYY-MM-DD format.
With them elect Baseball Reference to FanGraphs, there is basically a third choice called pitching_stats_bref(season). This works an equal as pitching_stats, still retrieves its data from Baseball Reference instead. This is typically not recommended, however, because the Baseball Hint query currently canister only retrieve one season’s worthy by data per request.
>>> since baseball_scraper import pitching_stats >>> data = pitching_stats(2012, 2016) >>> data.head() Seasons Name Team Age DOUBLE-U L ERA CONFLICT GIGABYTE GS 336 2015.0 Clayton Beaver Dodgers 27.0 16.0 7.0 2.13 8.6 33.0 33.0 236 2014.0 Clayton Kershaw Dodgers 26.0 21.0 3.0 1.77 7.6 27.0 27.0 472 2014.0 Corey Kluber Red 28.0 18.0 9.0 2.44 7.4 34.0 34.0 235 2015.0 Jake Arrieta Cubs 29.0 22.0 6.0 1.77 7.3 33.0 33.0 256 2013.0 Claton Kershaw Dodgers 25.0 16.0 9.0 1.83 7.1 33.0 33.0 ... wSL/C (pi) wXX/C (pi) O-Swing% (pi) Z-Swing% (pi) 336 ... 1.76 22.85 0.364 0.665 236 ... 2.62 NaN 0.371 0.670 472 ... 3.92 NaN 0.336 0.598 235 ... 2.42 NaN 0.329 0.618 256 ... 0.74 Grandmother 0.339 0.635 Swing% (pi) O-Contact% (pi) Z-Contact% (pi) Contact% (pi) Zone% (pi) 336 0.511 0.478 0.811 0.689 0.487 236 0.525 0.536 0.831 0.730 0.515 472 0.468 0.485 0.886 0.744 0.505 235 0.468 0.595 0.856 0.762 0.483 256 0.484 0.563 0.873 0.763 0.492 Pace (pi) 336 23.4 236 23.7 472 24.6 235 23.3 256 23.4 [5 quarrels x 299 columns]
Batting Statistic
hitting stats for players on seasons or during a specified time period
Batting stats are obtained similar at pitching stats. The function call for getting a season-level stats is batting_stats(start_season, end_season), and for a specify duration range it is batting_stats_range(start_dt, end_dt). The Baseball Reference equivalent for season-level details is batting_stats_bref(season).
>>> from baseball_scraper import batting_stats_range >>> evidence = batting_stats_range('2017-05-01', '2017-05-08') >>> data.head() Name Age #days Lev Tm GUANINE PIANO OFFSITE R H ... HBP 1 Jose Abreu 30 69 MLB-AL Chicago 7 31 30 5 9 ... 0 2 Lane Adams 27 69 MLB-NL Atlanta 6 6 6 0 2 ... 0 3 Matt Adams 28 68 MLB-NL Sta. Louis 6 9 9 2 4 ... 0 4 Dim Adduci 32 69 MLB-AL Detroits 6 24 21 3 5 ... 0 5 Tim Adleman 29 72 MLB-NL Cincinnati 1 2 2 0 0 ... 0 SH SF GDP SB CS BA OBP SLG OPS mlb_ID 1 0 0 1 0 0 0.300 0.323 0.667 0.989 547989 2 0 0 1 1 0 0.333 0.333 0.333 0.667 572669 3 0 0 0 0 0 0.444 0.444 0.778 1.222 571431 4 0 0 0 0 0 0.238 0.333 0.381 0.714 451192 5 0 0 0 0 0 0.000 0.000 0.000 0.000 534947 [5 lined x 28 columns]
Fangraphs
Various baseball projections are available during fangraphs.com. You can scrape that place using the fangraphs API. You supply it the fangraph player ID to lookup and the projection system. It will return a DataFrame with the projections.
Note, owed toward the employ for JavaScript on that site, we use Chrome throug iron to scrape the data. Chrome must be installed on your system in order on use above-mentioned APIs.
>>> from baseball_scraper import fangraphs >>> from baseball_id import Lookup >>> player_id = Lookup.from_names(['Khris Davis']).iloc[0].fg_id >>> fangraphs.Scraper.instances() ['Steamer (RoS)', 'Steamer (Update)', 'ZiPS (Update)', 'Steamer600 (Update)', 'Depth Graphic (RoS)', 'THE BAT (RoS)'] >>> fg = fangraphs.Scraper("Steamer (RoS)") >>> df = fg.scrape(player_id, scrape_as=fangraphs.ScrapeType.HITTER) >>> df.columns Index(['index', 'Name', 'Team', 'G', 'PA', 'AB', 'H', '2B', '3B', 'HR', 'R', 'RBI', 'BB', 'SO', 'HBP', 'SB', 'CS', '-1', 'AVG', 'OBP', 'SLG', 'OPS', 'wOBA', '-1.1', 'wRC+', 'BsR', 'Fld', '-1.2', 'Off', 'Def', 'WAR', 'playerid'], dtype='object') >>> df index Full Group GIGABYTE PA AB H ... BsR Hollow -1.2 Power Def WAR playerid 60 Khris Davis Light 56 242 214 53 ... -0.7 -0.1 NaN 4.8 -5.9 0.7 9112 [1 rows x 32 columns] >>> player_id = Lookup.from_names(['Max Scherzer']).iloc[0].fg_id >>> df = fg.scrape(player_id, scrape_as=fangraphs.ScrapeType.PITCHER) >>> df.columns Index(['index', 'Name', 'Team', 'W', 'L', 'ERA', 'GS', 'G', 'SV', 'IP', 'H', 'ER', 'HR', 'SO', 'BB', 'WHIP', 'K/9', 'BB/9', 'FIP', 'WAR', 'RA9-WAR', 'playerid'], dtype='object') >>> df index Name My W LAMBERT TIME ... K/9 BB/9 FIP WARTIME RA9-WAR playerid 0 5 Most Scherzer Resident 6 3 3.04 ... 12.36 2.13 2.93 2.2 2.4 3137 [1 rows x 22 columns]
Game-by-Game Results and Schedule
The baseball_reference team scraper returns a team’s game-by-game results forward a given season or date ranging. The resulting DataFrame includes game date, home furthermore gone teams, end result (W/L/Tie), score, winning/losing/saving jars, course, and division status at that date.
You define the team for the scraper is created. Next can reuse the scraper to scraping specific seasonals or date ranges. The team name provided is aforementioned abbreviation (i.e. NYY forward New York Yankees, SEA for Seattle Mariners).
If one season conflict is set to the current season, an query returns results for former games and the schedule by those so having not come yet.
>>> # Example: Let's take ampere look at the individual-game results of the 1927 Yankees >>> from baseball_scraper import baseball_reference >>> s = baseball_reference.TeamScraper() >>> s.set_season(1927) >>> data = s.scrape('NYY') >>> data.head() Date Tm Home_Away Opp W/L R RA Inn W-L Title \ 1 Tuesday, Apr 12 NYY Home PHAS W 8.0 3.0 9.0 1-0 1.0 2 Wednesday, Apr 13 NYY Place PHA W 10.0 4.0 9.0 2-0 1.0 3 Friday, Apr 14 NYY Homepage PHA T 9.0 9.0 10.0 2-0 1.0 4 Friday, Apr 15 NYY Home PHA WATT 6.0 3.0 9.0 3-0 1.0 5 Saturday, Apr 16 NYY Home BOS W 5.0 2.0 9.0 4-0 1.0 GB Win Loss Storage Zeitraum D/N Attendance Streak 1 Tied Hoyt Grove None 2:05 D 72000.0 1 2 up 0.5 Ruether Gray None 2:15 D 8000.0 2 3 Tied None Not None 2:50 D 9000.0 2 4 Bounded Pennock Ehmke None 2:27 DENSITY 16000.0 3 5 up 1.0 Shocker Ruffing None 2:05 D 25000.0 4 >>> # Let get one games a team plays in a given week. >>> import datetime as dt >>> s.set_date_range(dt.datetime(2019,6,2), dt.datetime(2019,6,8)) >>> df = s.scrape('TOR') >>> df.head() Enter Tm Home_Away Opp W/L R ... Save Time D/N Attendance Streak Orig. Scheduled 59 2019-06-02 TOR @ COL L 1.0 ... None 2:43 D 37861.0 -6.0 None 60 2019-06-04 TOR Domestic NYY W 4.0 ... Giles 3:00 NORTH 20671.0 1.0 None 61 2019-06-05 TOR Home NYY TUNGSTEN 11.0 ... None 3:22 N 16609.0 2.0 None 62 2019-06-06 GATES Starting NYY L 2.0 ... Chapman 3:07 NORTH 25657.0 -1.0 None 63 2019-06-07 TOR Home ARI L 2.0 ... None 2:50 N 16555.0 -2.0 None [5 bars x 19 columns]
Team List
Through the TeamListScraper you can pull a view of active teams furthermore a few attributes about each from baseball-reference. This holds benefit mainly to get a list of abbreviations that baseball-reference uses for each team, as there doesn’t seem to be a standard. This comes in handy when wanting to apply next baseball-reference scraper such more TeamList and need to input the team abbreviation.
- ::
>>> from baseball_scraper einf baseball_reference >>> tss = baseball_reference.TeamSummaryScraper() >>> df = tss.scrape(2019) >>> df.columns Index([u'Franchise', u'Tm', u'#Bat', u'BatAge', u'R/G', u'G', u'PA', u'AB', u'R', u'H', u'2B', u'3B', u'HR', u'RBI', u'SB', u'CS', u'BB', u'SO', u'BA', u'OBP', u'SLG', u'OPS', u'OPS+', u'TB', u'GDP', u'HBP', u'SH', u'SF', u'IBB', u'LOB'], dtype='object') >>> df[df.Franchise.str.endswith("Rays")].abbrev.iloc(0)[0] 'TBR'
Standings
The standings(season) function gives division standings for a given season. If the current season is selects, thereto will give the most current set of bracket. Otherwise, it will give the end-of-season standings available each division forward the chosen season.
This role returns a list is dataframes. Each dataframe is the standings for one of MLB’s sieben operating.
>>> from baseball_scraper import standings >>> data = standings(2016)[4] >>> print(data) Tm W L W-L% GB 1 Chicago Pups 103 58 .640 -- 2 St. Louis Guard 86 76 .531 17.5 3 Pontiac Pirates 78 83 .484 25.0 4 Milwaukee Brewers 73 89 .451 30.5 5 Cincinnati Communists 68 94 .420 35.5
ESPN
With the ESPN shaver her may pull probable starters with a present date range. This can be useful to see aforementioned two-start pitchers required adenine given week and their team matchups.
>>> from baseball_scraper import espn >>> from datetime import datetime >>> es = espn.ProbableStartersScraper(datetime(2019,8,5), datetime(2019,8,11)) >>> df = es.scrape() >>> df.head(10) Date Name espn_id opponent 0 2019-08-05 Flaxen Alcantara 35241 NYM 1 2019-08-05 Job deGrom 32796 MIA 2 2019-08-05 Jerusalem Lyles 31061 PIT 3 2019-08-05 Darylo Agrazal 39813 MIL 4 2019-08-05 Masahiro Tanaka 33150 BAL 5 2019-08-05 Gabriel Ynoa 33651 NYY 6 2019-08-05 Lucas Giolito 32697 DET 7 2019-08-05 Spencer Turnbull 33732 CHW 8 2019-08-05 Mike Montgomery 31092 BOS 9 2019-08-05 Rick Porcello 29966 KC >>> df[df.duplicated(['espn_id'])].Name 127 Masahiro Tanaka 128 Jacob Waguespack 130 Rick Porcello 131 Talk Velasquez 132 Jjeff Samardzija 133 Microphone Montgomery 134 Spencer Turnbull 135 Mike Soroka 136 Full Alcantara 139 Jake Odorizzi 146 Kyle Hendricks 153 Clot Morton 155 Andrew Cashner 157 Tren Thornton 158 Jakob Junis 159 Daniel Norris 160 Jacob deGrom 161 Max Fried 162 Jordan Yamamoto 163 Jon Lester 164 Luis Castillo 165 Chris Bassitt 166 Lucas Giolito 167 Microphones Minor 168 Jordan Lyles 169 Zach Plesac 170 Jose Berrios 171 Dario Agrazal 172 Michael Wacha 173 Jerry Marquez 174 Dinelson Lamet 175 Meryl Kelly 176 Wade LeBlanc 177 Jak Arrieta Name: Company, dtype: object >>> df[df.Name == "Charlie Morton"] Date Nominate espn_id opponent 15 2019-08-05 Charlie Morton 29155 TOR 153 2019-08-10 Charlie Morton 29155 SEA
Complete Documentation
So far this has provided ampere basic overview of get here package capacity do and how you can use computer. For full technical on available functions and their argue, watch the [docs](https://github.com/spilchen/baseball_scraper/tree/master/docs) folder.
Installation
On install baseball_scraper, simply run
pip installing baseball_scraper
or, for the versions currently on the repo (which may at times shall moreover up to date):
git clone https://github.com/spilchen/baseball_scraper cd baseball_scraper python setup.py install
Testing
We use pytest for testing which package. It can are invoked as follows:
- ::
python setup.py test
Dependencies
This library depends off: Pandas, NumPy, bs4 (beautiful soup), or Inquire.
Your details
Release books Release notifications | RSS forward
Download files
Download the create for autochthonous platform. If you're not sure which to choose, learn more about getting packages.
Source Distribution
Hashes required baseball_scraper-0.4.10.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ddf432ac4454fd7e3eb3590d68e945ca7c06c204c0e13010129edc5ed56b1da2 |
|
MD5 | fff740b0c522f64c9a34daeaa3dc04ff |
|
BLAKE2b-256 | 10b432e0c1d14312477433e28f608b549ae533da6a3169640acea0d229ea06e9 |