My First Python Script that does ‘something’

My First Python Script (that does ‘something’)

I’ve been interested in python as a tool for a while and today I had the chance to try and see what I could do.

With my 12.9 iPad Pro set up at my desk, I started out. I have Ole Zorn’s Pythonista 3 installed so I started on my first script.

My first task was to scrape something from a website. I tried to start with a website listing doctors, but for some reason the html rendered didn’t include anything useful.

So the next best thing was to find a website with staff listed on it. I used my dad’s company and his (staff listing)[http://www.graphtek.com/Our-Team] as a starting point.

I started with a quick Google search to find Pythonista Web Scrapping and came across this post on the Pythonista forums.

That got me this much of my script:

import bs4, requests

myurl = 'http://www.graphtek.com/Our-Team'

def get_beautiful_soup(url):

return bs4.BeautifulSoup(requests.get(url).text, "html5lib")

soup = get_beautiful_soup(myurl)

Next, I needed to see how to start traversing the html to get the elements that I needed. I recalled something I read a while ago and was (luckily) able to find some help.

That got me this:

tablemgmt = soup.findAll('div', attrs={'id':'our-team'})

This was close, but it would only return 2 of the 3 div tags I cared about (the management team has a different id for some reason … )

I did a search for regular expressions and Python and found this useful stackoverflow question and saw that if I updated my imports to include re then I could use regular expressions.

Great, update the imports section to this:

import bs4, requests, re

And added re.compile to my findAll to get this:

tablemgmt = soup.findAll('div', attrs={'id':re.compile('our-team')})

Now I had all 3 of the div tags I cared about.

Of course the next thing I wanted to do was get the information i cared out of the structure tablemgmt.

When I printed out the results I noticed leading and trailing square brackets and eveytime I tried to do something I’d get an error.

It took an embarrassingly long time to realize that I needed to treat tablemgmt as an array. Whoops!

Once I got through that it was straight forward to loop through the data and output it:

list_of_names = []

for i in tablemgmt:

for row in i.findAll('span', attrs={'class':'team-name'}):

text = row.text.replace('<span class="team-name"', '')

if len(text)>0:

list_of_names.append(text)

list_of_titles = []

for i in tablemgmt:

for row in i.findAll('span', attrs={'class':'team-title'}):

text = row.text.replace('<span class="team-title"', '')

if len(text)>0:

list_of_titles.append(text)

The last bit I wanted to do was to add some headers and make the lists into a two column multimarkdown table.

OK, first I needed to see how to ‘combine’ the lists into a multidimensional array. Another google search and … success. Of course the answer would be on stackoverflow

With my knowldge of looping through arrays and the function zip I was able to get this:

for j, k in zip(list_of_names, list_of_titles):

print('|'+ j + '|' + k + '|')

Which would output this:

|Mike Cheley|CEO/Creative Director|

|Ozzy|Official Greeter|

|Jay Sant|Vice President|

|Shawn Isaac|Vice President|

|Jason Gurzi|SEM Specialist|

|Yvonne Valles|Director of First Impressions|

|Ed Lowell|Senior Designer|

|Paul Hasas|User Interface Designer|

|Alan Schmidt|Senior Web Developer|

This is close, however, it still needs headers.

No problem, just add some static lines to print out:

print('| Name | Title |')
print('| --- | --- |')

And voila, we have a multimarkdown table that was scrapped from a web page:

| Name | Title |
| --- | --- |
|Mike Cheley|CEO/Creative Director|
|Ozzy|Official Greeter|
|Jay Sant|Vice President|
|Shawn Isaac|Vice President|
|Jason Gurzi|SEM Specialist|
|Yvonne Valles|Director of First Impressions|
|Ed Lowell|Senior Designer|
|Paul Hasas|User Interface Designer|
|Alan Schmidt|Senior Web Developer|

Which will render to this:

Name Title
Mike Cheley CEO/Creative Director
Ozzy Official Greeter
Jay Sant Vice President
Shawn Isaac Vice President
Jason Gurzi SEM Specialist
Yvonne Valles Director of First Impressions
Ed Lowell Senior Designer
Paul Hasas User Interface Designer
Alan Schmidt Senior Web Developer

Converting Writing Examples from doc to markdown: My Process

Converting Writing Examples from doc to markdown: My Process

All of my writing examples were written while attending the University of Arizona when I was studying Economics.

These writing examples are from 2004 and were written in either Microsoft Word OR the OpenOffice Writer

Before getting the files onto Github I wanted to convert them into markdown so that they would be in plain text.

I did this mostly as an exercise to see if I could, but in going through it I’m glad I did. Since the files were written in .doc format, and the .doc) format has been replaced with the .docx format it could be that at some point my work would be inaccessible. Now, I don’t have to worry about that.

So, how did I get from a .doc file written in 2004 to a converted markdown file created in 2016? Here’s how:

Round 1

  1. Downloaded the Doc files from my Google Drive to my local Desktop and saved them into a folder called Summaries
  2. Each week of work had it’s own directory, so I had to go into each directory individually (not sure how to do recursive work yet)
  3. Each of the files was written in 2004 so I had to change the file types from .doc to .docx. This was accomplished with this command:
    textutil -convert docx *.doc
  4. Once the files were converted from .doc to .docx I ran the following commands:
    1. cd ../yyyymmdd where yyyy = YEAR, mm = Month in 2 digits; dd = day in 2 digits
    2. for f in *\ *; do mv "$f" "${f// /_}"; done ^1– this would replace the space character with an underscore. this was needed so I could run the next command
    3. for file in $(ls *.docx); do pandoc -s -S "${file}" -o "${file%docx}md"; done ^2 – this uses pandoc to convert the docx file into valid markdown files
    4. mv *.md ../ – used to move the .md files into the next directory up
  5. With that done I just needed to move the files from my Summaries directory to my writing-examples github repo. I’m using the GUI for this so I have a folder on my desktop called writing-examples. To move them I just used mv Summaries/*.md writing-examples/

So that’s it. Nothing too fancy, but I wanted to write it down so I can look back on it later and know what the heck I did.

Round 2

The problem I’m finding is that the bulk conversion using textutil isn’t keeping the footnotes from the original .doc file. These are important though, as they reference the original work. Ugh!

Used this command ^5 to recursively replace the spaces in the files names with underscores:

find . -depth -name '* *' \ | while IFS= read -r f ; do mv -i "$f" "$(dirname "$f")/$(basename "$f"|tr ' ' _)" ; done

Used this command ^3 to convert all of the .doc to .docx at the same time

find . -name *.doc -exec textutil -convert docx '{}' \;

Used this command ^4 to generate the markdown files recursively:

find . -name "*.docx" | while read i; do pandoc -f docx -t markdown "$i" -o "${i%.*}.md"; done;

Used this command to move the markdown files:

Never figured out what to do here 🙁

Round 3

OK, I just gave up on using textutil for the conversion. It wasn’t keeping the footnotes on the conversion from .doc to .docx.

Instead I used the Google Drive add in Converter for Google Drive Document. It converted the .doc to .docx AND kept the footnotes like I wanted it to.

Of course, it output all of the files to the same directory, so the work I did to get the recursion to work previously can’t be applied here sigh

Now, the only commands to run from the terminal are the following:

1. `for f in *\ *; do mv "$f" "${f// /_}"; done` [^1]- this would replace the space character with an underscore. this was needed so I could run the next command
2. `for file in $(ls *.docx); do pandoc -s -S "${file}" -o "${file%docx}md"; done` [^2] - this uses pandoc to convert the docx file into valid markdown files
3. `mv *.md <directory/path>`

Round 4

Like any good bad lazy programer I’ve opted for a rute force method of converting the doc files to docx files. I opened each one in Word on macOS and saved as docx. Problem solved ¯_(ツ)_/¯

Step 1: used the command I found here ^7 to recursively replace the spaces in the files names with underscores _

find . -depth -name '* *' \
| while IFS= read -r f ; do mv -i "$f" "$(dirname "$f")/$(basename "$f"|tr ' ' _)" ; done

Step 2: Use the command found here ^6 to generate the markdown files recursively:

find . -name "*.docx" | while read i; do pandoc -f docx -t markdown "$i" -o "${i%.*}.md"; done;

Step 3: Add the files to my GitHub repo graduate-writing-examples

For this I used the GitHub macOS Desktop App to create a repo in my Documents directory, so it lives in ~/Documents/graduate-writing-examples/

I then used the finder to locate all of the md files in the Summaries folder and then dragged them into the repo. There were 2 files with the same name Rose_Summary and Libecap_and_Johnson_Summary. While I’m sure that I could have figured out how to accomplish this with the command line, this took less than 1 minute, and I had just spent 5 minutes trying to find a terminal command to do it. Again, the lazy programmer wins.

Once the files were in the local repo I committed the files and boom they were in my GitHub Writing Examples repo.

Vin’s Last Game

Twelve years ago today Steve Finley hit a Grand Slam in the 9th to clinch the NL West title against the Giants. Today, the Dodgers have already won the NL West title so we won’t have anything like that again, but it is Vin Scully’s last game to be called. EVER.

 

I remember watching Dodgers games on KTLA with my grandmother in the 80’s. I thought baseball was slow and boring, but the way that Vin would tell stories was able to keep me interested.

Vin is able to call a game, with no emotion, just tell the story of the game. Dropping tidbits about this player or that player. Always knowing more about the people in the game while also knowing so much about the game.

 

He’s quite litterally seen everything. From perfect games to triple plays. He called Hank Aaron’s historic record breaking home run. He even saw a pitcher throwing a perfect game through 7 get pulled (I’m looking at you DAve Roberts).

 

In the last game he ever called the Dodgers are in playoff form. This … is not a good thing. The Dodgers are historically an awful performing playoff team, and so far, they have managed to lose 4 of their last 5 and are working on making it 5 of 6.

 

It’s a dark and dreary day in San Francisco. It’s raining in San Francisco. Kenta Maeda is pitching for the Dodgers.

 

Dodgers first out of the game is a Stikeout of Hunter Pence … but the Dodgers are down 0-2. Might be a long one today

 

 

The game ended not with a win, but a whimper as the Dodgers lost to the Giants 7-1.

 

As Vin gave his last call it wasn’t a great call like Charlie Culberson’s Home Run to win the West (and the game) last weekend. It was a pop fly that sent the Giants back to New York to face the Mets.

 

Five years ago I never wanted him to retire. This season, I’m glad he decided to put the microphone up. A little slower in the outfield, not quite as quick with the bat, still the greatest broadcaster that ever lived.

 

Vin was teaching lessons all those years, not just calling games. It was only in his last season, his last game, that I really was able to hear them.

 

Realize that you are blessed to do what you do.

 

Don’t be sad that something has ended, but be happy that it had started.

 

The last one gets me in a way I can’t quite describe. Maybe it’s where I’m at in life right now, maybe it’s that it would resonate with me regardless, but it is a nice reminder, that life is what you make of it. Don’t be sad that a thing has ended, but instead be happy that you had a chance to have it happen at all.

 

Great words Vin. Thank you