Categories
Uncategorized

Web Scrapping – Passer Data (Part III)

In Part III I’m reviewing the code to populate a DataFrame with Passer data from the current NFL season.

To start I use the games DataFrame I created in Part II to create 4 new DataFrames:

  • reg_season_games – All of the Regular Season Games
  • pre_season_games – All of the Pre Season Games
  • gameshome – The Home Games
  • gamesaway – The Away Games

A cool aspect of the DataFrames is that you can treat them kind of like temporary tables (at least, this is how I’m thinking about them as I am mostly a SQL programmer) and create other temporary tables based on criteria. In the code below I’m taking the nfl_start_date that I defined in Part II as a way to split the data frame into pre / and regular season DataFrame. I then take the regular season DataFrame and split that into home and away DataFrames. I do this so I don’t double count the statistics for the passers.

#Start Section 3

reg_season_games = games.loc[games['match_date'] >= nfl_start_date]
pre_season_games = games.loc[games['match_date'] < nfl_start_date]

gameshome = reg_season_games.loc[reg_season_games['ha_ind'] == 'vs']
gamesaway = reg_season_games.loc[reg_season_games['ha_ind'] == '@']

Next, I set up some variables to be used later:

BASE_URL = 'http://www.espn.com/nfl/boxscore/_/gameId/{0}'

#Create the lists to hold the values for the games for the passers
player_pass_name = []
player_pass_catch = []
player_pass_attempt = []
player_pass_yds = []
player_pass_avg = []
player_pass_td = []
player_pass_int = []
player_pass_sacks = []
player_pass_sacks_yds_lost = []
player_pass_rtg = []
player_pass_week_id = []
player_pass_result = []
player_pass_team = []
player_pass_ha_ind = []
player_match_id = []
player_id = [] #declare the player_id as a list so it doesn't get set to a str by the loop below

headers_pass = ['match_id', 'id', 'Name', 'CATCHES','ATTEMPTS', 'YDS', 'AVG', 'TD', 'INT', 'SACKS', 'YRDLSTSACKS', 'RTG']

Now it’s time to start populating some of the list variables I created above. I am taking the week_id, result, team_x, and ha_ind columns from the games DataFrame (I’m sure there is a better way to do this, and I will need to revisit it in the future)

player_pass_week_id.append(gamesaway.week_id)
player_pass_result.append(gamesaway.result)
player_pass_team.append(gamesaway.team_x)
player_pass_ha_ind.append(gamesaway.ha_ind)

Now for the looping (everybody’s favorite part!). Using BeautifulSoup I get the div of class col column-one gamepackage-away-wrap. Once I have that I get the table rows and then loop through the data in the row to get what I need from the table holding the passer data. Some intersting things happening below:

  • The Catches / Attempts and Sacks / Yrds Lost are displayed as a single column each (even though each column holds 2 statistics). In order to fix this I use the index() method and get all of the data to the left of a character (- and / respectively for each column previously mentioned) and append the resulting 2 items per column (so four in total) to 2 different lists (four in total).

The last line of code gets the ESPN player_id, just in case I need/want to use it later.

for index, row in gamesaway.iterrows():
    print(index)
    try:
        request = requests.get(BASE_URL.format(index))
        table_pass = BeautifulSoup(request.text, 'lxml').find_all('div', class_='col column-one gamepackage-away-wrap')

        pass_ = table_pass[0]
        player_pass_all = pass_.find_all('tr')


        for tr in player_pass_all:
            for td in tr.find_all('td', class_='sacks'):
                for t in tr.find_all('td', class_='name'):
                    if t.text != 'TEAM':
                        player_pass_sacks.append(int(td.text[0:td.text.index('-')]))
                        player_pass_sacks_yds_lost.append(int(td.text[td.text.index('-')+1:]))
            for td in tr.find_all('td', class_='c-att'):
                for t in tr.find_all('td', class_='name'):
                    if t.text != 'TEAM':
                        player_pass_catch.append(int(td.text[0:td.text.index('/')]))
                        player_pass_attempt.append(int(td.text[td.text.index('/')+1:]))
            for td in tr.find_all('td', class_='name'):
                for t in tr.find_all('td', class_='name'):
                    for s in t.find_all('span', class_=''):
                        if t.text != 'TEAM':
                            player_pass_name.append(s.text)
            for td in tr.find_all('td', class_='yds'):
                for t in tr.find_all('td', class_='name'):
                    if t.text != 'TEAM':
                        player_pass_yds.append(int(td.text))
            for td in tr.find_all('td', class_='avg'):
                for t in tr.find_all('td', class_='name'):
                    if t.text != 'TEAM':
                        player_pass_avg.append(float(td.text))
            for td in tr.find_all('td', class_='td'):
                for t in tr.find_all('td', class_='name'):
                    if t.text != 'TEAM':
                        player_pass_td.append(int(td.text))
            for td in tr.find_all('td', class_='int'):
                for t in tr.find_all('td', class_='name'):
                    if t.text != 'TEAM':
                        player_pass_int.append(int(td.text))
            for td in tr.find_all('td', class_='rtg'):
                for t in tr.find_all('td', class_='name'):
                    if t.text != 'TEAM':
                        player_pass_rtg.append(float(td.text))
                        player_match_id.append(index)
            #The code below cycles through the passers and gets their ESPN Player ID
            for a in tr.find_all('a', href=True):
                player_id.append(a['href'].replace("http://www.espn.com/nfl/player/_/id/","")[0:a['href'].replace("http://www.espn.com/nfl/player/_/id/","").index('/')])

    except Exception as e:
        pass

With all of the data from above we now populate our DataFrame using specific headers (that’s why we set the headers_pass variable above):

player_passer_data = pd.DataFrame(np.column_stack((
player_match_id,
player_id,
player_pass_name,
player_pass_catch,
player_pass_attempt,
player_pass_yds,
player_pass_avg,
player_pass_td,
player_pass_int,
player_pass_sacks,
player_pass_sacks_yds_lost,
player_pass_rtg
)), columns=headers_pass)

An issue that I ran into as I was playing with the generated DataFrame was that even though I had set the numbers generated in the for loop above to be of type int anytime I would do something like a sum() on the DataFrame the numbers would be concatenated as though they were strings (because they were!).

After much Googling I came across a useful answer on StackExchange (where else would I find it, right?)

What it does is to set the data type of the columns from string to int

player_passer_data[['TD', 'CATCHES', 'ATTEMPTS', 'YDS', 'INT', 'SACKS', 'YRDLSTSACKS','AVG','RTG']] = player_passer_data[['TD', 'CATCHES', 'ATTEMPTS', 'YDS', 'INT', 'SACKS', 'YRDLSTSACKS','AVG','RTG']].apply(pd.to_numeric)

OK, so I’ve got a DataFrame with passer data, I’ve got a DataFrame with away game data, now I need to join them. As expected, pandas has a way to join DataFrame data … with the join method obviously!

I create a new DataFrame called game_passer_data which joins player_passer_data with games_away on their common key match_id. I then have to use set_index to make sure that the index stays set to match_id … If I don’t then the index is reset to an auto-incremented integer.

game_passer_data = player_passer_data.join(gamesaway, on='match_id').set_index('match_id')

This is great, but now game_passer_data has all of these extra columns. Below is the result of running game_passer_data.head() from the terminal:

id          Name  CATCHES  ATTEMPTS  YDS  AVG  TD  INT  SACKS  \
match_id
400874518  2577417  Dak Prescott       22        30  292  9.7   0    0      4
400874674  2577417  Dak Prescott       23        32  245  7.7   2    0      2
400874733  2577417  Dak Prescott       18        27  247  9.1   3    1      2
400874599  2577417  Dak Prescott       21        27  247  9.1   3    0      0
400874599    12482  Mark Sanchez        1         1    8  8.0   0    0      0

           YRDLSTSACKS                        ...                          \
match_id                                      ...
400874518           14                        ...
400874674           11                        ...
400874733           14                        ...
400874599            0                        ...
400874599            0                        ...

           ha_ind  match_date                  opp result          team_x  \
match_id
400874518       @  2016-09-18  washington-redskins      W  Dallas Cowboys
400874674       @  2016-10-02  san-francisco-49ers      W  Dallas Cowboys
400874733       @  2016-10-16    green-bay-packers      W  Dallas Cowboys
400874599       @  2016-11-06     cleveland-browns      W  Dallas Cowboys
400874599       @  2016-11-06     cleveland-browns      W  Dallas Cowboys

          week_id prefix_1             prefix_2               team_y  \
match_id
400874518       2      wsh  washington-redskins  Washington Redskins
400874674       4       sf  san-francisco-49ers  San Francisco 49ers
400874733       6       gb    green-bay-packers    Green Bay Packers
400874599       9      cle     cleveland-browns     Cleveland Browns
400874599       9      cle     cleveland-browns     Cleveland Browns

                                                         url
match_id
400874518  http://www.espn.com/nfl/team/_/name/wsh/washin...
400874674  http://www.espn.com/nfl/team/_/name/sf/san-fra...
400874733  http://www.espn.com/nfl/team/_/name/gb/green-b...
400874599  http://www.espn.com/nfl/team/_/name/cle/clevel...
400874599  http://www.espn.com/nfl/team/_/name/cle/clevel...

That is nice, but not exactly what I want. In order to remove the extra columns I use the drop method which takes 2 arguments:

  • what object to drop
  • an axis which determine what types of object to drop (0 = rows, 1 = columns):

Below, the object I define is a list of columns (figured that part all out on my own as the documentation didn’t explicitly state I could use a list, but I figured, what’s the worst that could happen?)

game_passer_data = game_passer_data.drop(['opp', 'prefix_1', 'prefix_2', 'url'], 1)

Which gives me this:

id          Name  CATCHES  ATTEMPTS  YDS  AVG  TD  INT  SACKS  \
match_id
400874518  2577417  Dak Prescott       22        30  292  9.7   0    0      4
400874674  2577417  Dak Prescott       23        32  245  7.7   2    0      2
400874733  2577417  Dak Prescott       18        27  247  9.1   3    1      2
400874599  2577417  Dak Prescott       21        27  247  9.1   3    0      0
400874599    12482  Mark Sanchez        1         1    8  8.0   0    0      0

           YRDLSTSACKS    RTG ha_ind  match_date result          team_x  \
match_id
400874518           14  103.8      @  2016-09-18      W  Dallas Cowboys
400874674           11  114.7      @  2016-10-02      W  Dallas Cowboys
400874733           14  117.4      @  2016-10-16      W  Dallas Cowboys
400874599            0  141.8      @  2016-11-06      W  Dallas Cowboys
400874599            0  100.0      @  2016-11-06      W  Dallas Cowboys

          week_id               team_y
match_id
400874518       2  Washington Redskins
400874674       4  San Francisco 49ers
400874733       6    Green Bay Packers
400874599       9     Cleveland Browns
400874599       9     Cleveland Browns

I finally have a DataFrame with the data I care about, BUT all of the column names are wonky!

This is easy enough to fix (and should have probably been fixed earlier with some of the objects I created only containing the necessary columns, but I can fix that later). By simply renaming the columns as below:

game_passer_data.columns = ['id', 'Name', 'Catches', 'Attempts', 'YDS', 'Avg', 'TD', 'INT', 'Sacks', 'Yards_Lost_Sacks', 'Rating', 'HA_Ind', 'game_date', 'Result', 'Team', 'Week', 'Opponent']

I now get the data I want, with column names to match!

id          Name  Catches  Attempts  YDS  Avg  TD  INT  Sacks  \
match_id
400874518  2577417  Dak Prescott       22        30  292  9.7   0    0      4
400874674  2577417  Dak Prescott       23        32  245  7.7   2    0      2
400874733  2577417  Dak Prescott       18        27  247  9.1   3    1      2
400874599  2577417  Dak Prescott       21        27  247  9.1   3    0      0
400874599    12482  Mark Sanchez        1         1    8  8.0   0    0      0

           Yards_Lost_Sacks  Rating HA_Ind   game_date Result            Team  \
match_id
400874518                14   103.8      @  2016-09-18      W  Dallas Cowboys
400874674                11   114.7      @  2016-10-02      W  Dallas Cowboys
400874733                14   117.4      @  2016-10-16      W  Dallas Cowboys
400874599                 0   141.8      @  2016-11-06      W  Dallas Cowboys
400874599                 0   100.0      @  2016-11-06      W  Dallas Cowboys

          Week             Opponent
match_id
400874518    2  Washington Redskins
400874674    4  San Francisco 49ers
400874733    6    Green Bay Packers
400874599    9     Cleveland Browns
400874599    9     Cleveland Browns

I’ve posted the code for all three parts to my GitHub Repo.

Work that I still need to do:

  1. Add code to get the home game data
  2. Add code to get data for the other position players
  3. Add code to get data for the defense

When I started this project on Wednesday I had only a bit of exposure to very basic aspects of Python and my background as a developer. I’m still a long way from considering myself proficient in Python but I know more now that I did 3 days ago and for that I’m pretty excited! It’s also given my an excuse reason to write some stuff which is a nice side effect.