Presenting Data – Referee Crew Calls in the NFL

One of the great things about computers is their ability to take tabular data and turn them into pictures that are easier to interpret. I’m always amazed when given the opportunity to show data as a picture, more people don’t jump at the chance.

For example, this piece on ESPN regarding the difference in officiating crews and their calls has some great data in it regarding how different officiating crews call games.

One thing I find a bit disconcerting is:

  1. One of the rows is missing data so that row looks ‘odd’ in the context of the story and makes it look like the writer missed a big thing … they didn’t (it’s since been fixed)
  2. This tabular format is just begging to be displayed as a picture.

Perhaps the issue here is that the author didn’t know how to best visualize the data to make his story, but I’m going to help him out.

If we start from the underlying premise that not all officiating crews call games in the same way, we want to see in what ways they differ.

The data below is a reproduction of the table from the article:

Triplette, Jeff 39 2 34 6 81
Anderson, Walt 12 2 39 10 63
Blakeman, Clete 13 2 41 7 63
Hussey, John 10 3 42 3 58
Cheffers, Cartlon 22 0 31 3 56
Corrente, Tony 14 1 31 8 54
Steratore, Gene 19 1 29 5 54
Torbert, Ronald 9 4 31 7 51
Allen, Brad 15 1 28 6 50
McAulay, Terry 10 4 23 12 49
Vinovich, Bill 8 7 29 5 49
Morelli, Peter 12 3 24 9 48
Boger, Jerome 11 3 27 6 47
Wrolstad, Craig 9 1 31 5 46
Hochuli, Ed 5 2 33 4 44
Coleman, Walt 9 2 25 4 40
Parry, John 7 5 20 6 38

The author points out:

Jeff Triplette’s crew has called a combined 81 such penalties — 18 more than the next-highest crew and more than twice the amount of two others

The author goes on to talk about his interview with Mike Pereira (who happens to be pimping promoting his new book).

While the table above is helpful it’s not an image that you can look at and ask, “Man, what the heck is going on?” There is a visceral aspect to it that says, something is wrong here … but I can’t really be sure about what it is.

Let’s sum up the defensive penalties (Defensive Offsides, Encroachment, and Neutral Zone Infractions) and see what the table looks like:

Triplette, Jeff 47 34 81
Anderson, Walt 24 39 63
Blakeman, Clete 22 41 63
Hussey, John 16 42 58
Cheffers, Cartlon 25 31 56
Corrente, Tony 23 31 54
Steratore, Gene 25 29 54
Torbert, Ronald 20 31 51
Allen, Brad 22 28 50
McAulay, Terry 26 23 49
Vinovich, Bill 20 29 49
Morelli, Peter 24 24 48
Boger, Jerome 20 27 47
Wrolstad, Craig 15 31 46
Hochuli, Ed 11 33 44
Coleman, Walt 15 25 40
Parry, John 18 20 38

Now we can see what might actually be going on, but it’s still a bit hard for those visual people. If we take this data and then generate a scatter plot we might have a picture to show us the issue. Something like this:

The horizontal dashed blue lines represent the average defensive calls per crew while the vertical dashed blue line represents the average offensive calls per crew. The gray box represents the area containing plus/minus 2 standard deviations from the mean for both offensive and defensive penalty calls.

Notice anything? Yeah, me too. Jeff Triplette’s crew is so far out of range for defensive penalties it’s like they’re watching a different game, or reading from a different play book.

What I’d really like to be able to do is this same analysis but on a game by game basis. I don’t think this would really change the way that Jeff Triplette and his crew call games, but it may point out some other inconsistencies that are worth exploring.

Code for this project can be found on my GitHub Repo


Rogue One – A Star Wars Story

Rogue One – A Star Wars Movie: My Thoughts

Today I watched Rogue One and I wanted to jot down my thoughts while they were still fresh.

First, what I didn’t like:

  1. The Rogue One Font at the beginning on the movie. There was just something about it that wasn’t as strong as the Original Franchise
  2. The jumping around done at the beginning of the movie with the planet names (again, with a weak font).
  3. There were no Bonthans either dying or otherwise.

OK, not that’s out of the way. What did I like:

Every. Thing. Else.

Jin’s character had the depth needed to be a protagonist you would both believe and want to follow. I think the most surprising thing (maybe … I still need to think about this) is that from a writing perspective you know ALL of your characters are going to be throw away characters. They won’t appear in Episodes 4-6, although the actions they take drive those movies.

As I realized this, I realized that each lead character was going to die. It can’t really be any other way. And while I was sad to see that premonition come to fruition, I was also glad that the writers did what the story called for. Make the story a one-off whose characters can’t influence the canon in any way other then how they already had.

Maybe I went into the movie with low expectations, or maybe it was just that good. Either way, I would see this again and again and again.

The story was strong, with dynamic characters. A mix of good and bad for the rebels (as it should be) and all bad, but with depth for the imperial characters.

I liked this so much I have already pre-purchased it on iTunes.


It’s Science!

I have a 10 year old daughter in the fifth grade. She has participated in the Science Fair almost every year, but this year was different. This year was required participation.

dun … dun … dun …

She and her friend had a really interesting idea on what to do. They wanted to ask the question, “Is Soap and Water the Best Cleaning Method?

The two Scientists decided that they would test how well the following cleaning agents cleaned a white t-shirt (my white t-shirt actually) after it got dirty:

  • Plain Water
  • Soap and Water
  • Milk
  • Almond Milk

While working with them we experimented on how to make the process as scientific as possible. Our first attempt was to just take a picture of the Clean shirt, cut the shirt up and get it dirty. Then we’d try each cleaning agent to see how it went.

It did not go well. It was immediately apparent that there would be no way to test the various cleaning methods efficacy.

No problem. In our second trial we decided to approach it more scientifically.

We would draw 12 equally sized squares on the shirt and take a picture:

We needed 12 squares because we had 4 cleaning methods and 3 trials that needed to be performed

4 Cleaning Methods X 3 Trials = 12 Samples

Next, the Scientists would get the shirt dirty. We then cut out the squares so that we could test cleaning the samples.

Here’s an outline of what the Scientists did to test their hypothesis:

  1. Take a picture of each piece BEFORE they get dirty
  2. Get each sample dirty
  3. Take a picture of each dirty sample
  4. Clean each sample
  5. Take a picture of each cleaned sample
  6. Repeat for each trial

For the ‘Clean Each Sample’ step they placed 1/3 of a cup of the cleaning solution into a small Tupperware tub that could be sealed and vigorously shook for 5 minutes. They had some tired arms at the end.

Once we had performed the experiment we our raw data:

Trial 1

Method Start Dirty Cleaned
Soap And Water
Almond Milk

Trial 2

Method Start Dirty Cleaned
Soap And Water
Almond Milk

Trial 3

Method Start Dirty Cleaned
Soap And Water
Almond Milk

This is great and all, but now what? We can’t really use subjective measures to determine cleanliness and call it science!

My daughter and her friend aren’t coders, but I did explain to them that we needed a more scientific way to determine cleanliness. I suggested that we use python to examine the image and determine the brightness of the image.

We could then use some math to compare the brightness.1

Now, onto the code!

OK, let’s import some libraries:

from PIL import Image, ImageStat
import math
import glob
import pandas as pd
import matplotlib.pyplot as plt

There are 2 functions to determine brightness that I found here. They were super useful for this project. As an aside, I love StackOverflow!

#Covert image to greyscale, return average pixel brightness.
def brightness01( im_file ):
   im ='L')
   stat = ImageStat.Stat(im)
   return stat.mean[0]

#Covert image to greyscale, return RMS pixel brightness.
def brightness02( im_file ):
   im ='L')
   stat = ImageStat.Stat(im)
   return stat.rms[0]

The next block of code takes the images and processes them to get the return the brightness levels (both of them) and return them to a DataFrame to be used to write to a csv file.

I named the files in such a way so that I could automate this. It was a bit tedious (and I did have the scientists help) but they were struggling to understand why we were doing what we were doing. Turns out teaching CS concepts is harder than it looks.

f = []
img_brightness01 = []
img_brightness02 = []
trial = []
state = []
method = []
for filename in glob.glob('/Users/Ryan/Dropbox/Abby/Science project 2016/cropped images/**/*', recursive=True):
for part in f:
    method.append(part.split('_')[2].replace('.png', '').replace('.jpg',''))

dic = {'TrialNumber': trial, 'SampleState': state, 'CleaningMethod': method, 'BrightnessLevel01': img_brightness01, 'BrightnessLevel02': img_brightness02}

results = pd.DataFrame(dic)

I’m writing the output to a csv file here so that the scientist will have their data to make their graphs. This is where my help with them ended.

#write to a csv file
results.to_csv('/Users/Ryan/Dropbox/Abby/Science project 2016/results.csv')

Something I wanted to do though was to see what our options were in python for creating graphs. Part of the reason this wasn’t included with the science project is that we were on a time crunch and it was easier for the Scientists to use Google Docs to create their charts, and part of it was that I didn’t want to cheat them out of creating the charts on their own.

There is a formula below to determine a score which is given by a normalized percentage that was used by them, but the graphing portion below I did after the project was turned in.

Let’s get the setup out of the way:

#Create Bar Charts
trials = ['Trial1','Trial2','Trial3']

n_trials = len(trials)
index = np.arange(n_trials)
bar_width = 0.25
bar_buffer = 0.05
opacity = 0.4

graph_color = ['b', 'r', 'g', 'k']
methods = ['Water', 'SoapAndWater', 'Milk', 'AlmondMilk']

graph_data = []

Now, let’s loop through each cleaning method and generate a list of scores (where one score is for one trial)

for singlemethod in methods:
    score= []
    for trialnumber in trials:
        s = results.loc[results['CleaningMethod'] == singlemethod].loc[results['TrialNumber'] == trialnumber].loc[results['SampleState'] == 'Start'][['BrightnessLevel01']]
        s = list(s.values.flatten())[0]
        d = results.loc[results['CleaningMethod'] == singlemethod].loc[results['TrialNumber'] == trialnumber].loc[results['SampleState'] == 'Dirty'][['BrightnessLevel01']]
        d = list(d.values.flatten())[0]
        c = results.loc[results['CleaningMethod'] == singlemethod].loc[results['TrialNumber'] == trialnumber].loc[results['SampleState'] == 'Clean'][['BrightnessLevel01']]
        c = list(c.values.flatten())[0]
        scorepct =  float((c-d) / (s - d))

This last section was what stumped me for the longest time. I had such a mental block converting from iterating over items in a list to item counts of a list. After much Googling I was finally able to make the breakthrough I needed and found the idea of looping through a range and everything came together:

for i in range(0, len(graph_data)): (bar_width)*i, graph_data[i], bar_width-.05, alpha=opacity,color=graph_color[i],label=methods[i])
    plt.xlabel('Trial Number')
    plt.axvline(x=i-.025, color='k', linestyle='--')
    plt.xticks(index+bar_width*2, trials)
    plt.yticks((-1,-.75, -.5, -.25, 0,0.25, 0.5, 0.75, 1))
    plt.ylabel('Brightness Percent Score')
    plt.title('Comparative Brightness Scores')

The final output of this code gives:

From the graph you can see the results are … inconclusive. I’m not sure what the heck happened in Trial 3 but the Scientists were able to make the samples dirtier. Ignoring Trial 3 there is no clear winner in either Trial 1 or Trial 2.

I think it would have been interesting to have 30 – 45 trials and tested this with a some statistics, but that’s just me wanting to show something to be statistically valid.

I think the best part of all of this was the time I got to spend with my daughter and the thinking through the experiment. I think she and her friend learned a bit more about the scientific method (and hey, isn’t that what this type of thing is all about?).

I was also really excited when her friend said, “Science is pretty cool” and then had a big smile on her face.

They didn’t go onto district, or get a blue ribbon, but they won in that they learned how neat science can be.

  1. The score is the ratio of how clean the cleaning method was able to get the sample compared to where it started, i.e. the ratio of the difference of the cleaned sample and the dirty sample to the difference of the starting sample and the dirty sample.