Ottawa Valley SAGE

Providing a forum since 1998

Jun 17, 2016 - 9 minute read - Comments

June Meeting Recap

Overview

Last night we had the June LOPSA/ovSAGE meeting. It was a good talk as far as feedback goes and interest level. The slides will be available in the usual location. I’ll also detail the extra bit I did at the end of the talk as well as provide some code and an example of what it would have ended up like if we had continued on.

Off Topic: A Bit Warm

But first, how does that tagging thing work? Oh yeah…

<rant>

The room was just a little warm (read as no AC) and the sun was beating in due to the lack of shades of any kind. This time I was somewhat prepared, having brought a golf umbrella to prop on an unused chair and at least keep me from being roasted. A small USB fan will be the next item to add, just to keep the air moving.

</rant>

Further Scraping

As I mentioned earlier, there is one thing that is not in the slides. I did a quick grab of the text based forecast for Ottawa, just to show how easy it is when you know how to target your search. This was a trivially simple page to target, as it was designed to produce a preformatted text section. Since there is only one, searching for the <pre> tag takes care of getting the data.

After you have that, you can use other python functions to target what you want. This is commented to death to make sure there are no questions.

Code Example

#!/usr/bin/env python

#
# Quick and dirty script to illustrate a grabbing text data
# I'm not a python programmer by habit, so excuse any lack of optimizations.
#

# get the modules we will need: regular expressions, url requests and beautifulsoup
import re
import requests
from bs4 import BeautifulSoup

# Forecasts for Southern Ontario and the National Capital Region  URL
url = 'https://weather.gc.ca/forecast/public_bulletins_e.html?Bulletin=fpcn11.cwto'

# get the content of the web page
response = requests.get(url)
html = response.content

# cut out everything but the preformatted section
# This is the text you hear read out on the radio in some areas
soup = BeautifulSoup(html,"lxml")
text = soup.find('pre')

# force it to be a string now
new_text = str(text)

# yank the html tags
new_text = new_text.replace("<pre>","")
new_text = new_text.replace("</pre>","")

# A quick examination of the text will show that is has DOS line ending markers, 
# so if you are a UNIX person, you might want to remove those as well. 
# To see them, save the output in a file and use: cat -v <filename>

new_text = new_text.replace("\r\n","\n")

# No more bs4 here, we are doing straight text manipulation. We want the header portion
# and the Ottawa section. This text is always formatted this way, so it is easy to match.
# - A header section
# - Paragraphs for each sub region
# - Each section is bounded by two line end markers
# 
# I want the header after the id marker and then the national capital section
# and since every section is separated with two newlines, it provides the easiest
# method of separating the paragraphs. As the forecast can be updated, the section 
# I want may not be in the same position, so I need to find that. I also want to strip 
# out the first two lines of the header

sections = new_text.split('\n\n')

# get rid of the identifier, always in the first section
sections[0] = re.sub(r' \nFPCN.*\n',"",sections[0])

print str(sections[0])

# put a blank line after the header
print

# print the paragraph that has the 'City of Ottawa' identifier
print "\n".join( s for s in sections if 'City of Ottawa' in s)

Text from the website

The original text grabbed from the system (example from time of writing)

FPCN11 CWTO 171500
Forecasts for Southern Ontario and the National Capital Region issued
by Environment Canada at 11.00 am edt Friday 17 June 2016 for today
and Saturday.
The next scheduled forecast will be issued at 3.30 pm.

City of Toronto.
Today..Sunny. High 29 except 23 near Lake Ontario. UV index 8 or very
high.
Tonight..Clear. Low 17.
Saturday..Mainly sunny. High 30. Humidex 34.

Windsor - Essex - Chatham-Kent.
Today..Sunny. Wind northeast 20 km/h. High 28. Humidex 31. UV index 8
or very high.
Tonight..Clear. Low 15.
Saturday..Sunny. High 31. Humidex 34.

Sarnia - Lambton.
Today..Mainly sunny. High 28 except 23 near Lake Huron. Humidex 30.
UV index 8 or very high.
Tonight..Clear. Low 13.
Saturday..Sunny. High 29. Humidex 31.

Elgin
London - Middlesex
Simcoe - Delhi - Norfolk
Dunnville - Caledonia - Haldimand
Oxford - Brant.
Today..Sunny. High 28. Humidex 31. UV index 8 or very high.
Tonight..Clear. Low 13.
Saturday..Sunny. High 29. Humidex 30.

Niagara Falls - Welland - Southern Niagara Region.
Today..Sunny. High 28. Humidex 29. UV index 8 or very high.
Tonight..Clear. Low 18.
Saturday..Sunny. Wind becoming west 20 km/h late in the afternoon.
High 30 except 25 near Lake Erie. Humidex 33.

St. Catharines - Grimsby - Northern Niagara Region
City of Hamilton.
Today..Sunny. High 28 except 24 near Lake Ontario. Humidex 31. UV
index 8 or very high.
Tonight..Clear. Low 15.
Saturday..Sunny. High 30. Humidex 34.

Burlington - Oakville
Mississauga - Brampton
Pickering - Oshawa - Southern Durham Region.
Today..Sunny. High 29 except 23 near Lake Ontario. UV index 8 or very
high.
Tonight..Clear. Low 15.
Saturday..Sunny. High 30. Humidex 33.

Halton Hills - Milton
Caledon
Vaughan - Richmond Hill - Markham
Newmarket - Georgina - Northern York Region
Uxbridge - Beaverton - Northern Durham Region.
Today..Sunny. High 29. UV index 8 or very high.
Tonight..Clear. Low 14.
Saturday..Sunny. High 30. Humidex 33.

Huron - Perth.
Today..Sunny. High 28. Humidex 30. UV index 8 or very high.
Tonight..Clear. Low 12.
Saturday..Sunny. High 29. Humidex 31.

Waterloo - Wellington
Dufferin - Innisfil.
Today..Mainly sunny. High 29. Humidex 31. UV index 8 or very high.
Tonight..Clear. Low 13.
Saturday..Mainly sunny. High 30. Humidex 32.

Grey - Bruce.
Today..Sunny. High 28. Humidex 29. UV index 8 or very high.
Tonight..Clear. Low 13.
Saturday..Sunny. High 28.

Barrie - Orillia - Midland.
Today..Sunny. High 29. Humidex 30. UV index 8 or very high.
Tonight..Clear. Low 16.
Saturday..Sunny. High 30. Humidex 31.

Belleville - Quinte - Northumberland
Kingston - Prince Edward.
Today..Mainly sunny. High 27 except 22 near Lake Ontario. Humidex 30.
UV index 7 or high.
Tonight..Clear. Low 13.
Saturday..Sunny. High 29 except 24 near Lake Ontario. Humidex 34.

Lindsay - Southern Kawartha Lakes
Fenelon Falls - Balsam Lake Park - Northern Kawartha Lakes
Peterborough City - Lakefield - Southern Peterborough County
Stirling - Tweed - South Frontenac.
Today..Mainly sunny. High 29. Humidex 30. UV index 8 or very high.
Tonight..Clear. Low 12.
Saturday..Sunny. High 29. Humidex 31.

Apsley - Woodview - Northern Peterborough County
Bancroft - Bon Echo Park
Haliburton.
Today..Mainly sunny. High 29. UV index 7 or high.
Tonight..Clear. Low 9.
Saturday..Sunny. High 30. Humidex 31.

Brockville - Leeds and Grenville.
Today..Sunny. High 28. UV index 8 or very high.
Tonight..Clear. Low 13.
Saturday..Sunny. High 30. Humidex 32.

City of Ottawa
Gatineau
Prescott and Russell
Cornwall - Morrisburg
Smiths Falls - Lanark - Sharbot Lake.
Today..Sunny. Becoming a mix of sun and cloud this afternoon.
High 29. UV index 8 or very high.
Tonight..Partly cloudy. Becoming clear near midnight. Low 13.
Saturday..Sunny. High 31. Humidex 33.

Port Carling - Port Severn
Town of Parry Sound - Rosseau - Killbear Park
Bayfield Inlet - Dunchurch.
Today..Sunny. High 28. UV index 7 or high.
Tonight..Clear. Low 15.
Saturday..Sunny. High 28. Humidex 30.

Bracebridge - Gravenhurst
Huntsville - Baysville
South River - Burk's Falls.
Today..Sunny. High 29. UV index 7 or high.
Tonight..Clear. Low 12.
Saturday..Sunny. High 30.

Renfrew - Pembroke - Barry's Bay.
Today..Mainly sunny. High 29. UV index 7 or high.
Tonight..Clear. Low 11.
Saturday..Mainly sunny. High 31.

Algonquin.
Today..Sunny. High 27. UV index 7 or high.
Tonight..Clear. Low 9.
Saturday..Sunny. High 30.

End
$$$$^^

Text after processing

After massaging the text, we get the header and a specific forecast.

Forecasts for Southern Ontario and the National Capital Region issued
by Environment Canada at 11.00 am edt Friday 17 June 2016 for today
and Saturday.
The next scheduled forecast will be issued at 3.30 pm.

City of Ottawa
Gatineau
Prescott and Russell
Cornwall - Morrisburg
Smiths Falls - Lanark - Sharbot Lake.
Today..Sunny. Becoming a mix of sun and cloud this afternoon.
High 29. UV index 8 or very high.
Tonight..Partly cloudy. Becoming clear near midnight. Low 13.
Saturday..Sunny. High 31. Humidex 33.

What you can do with it

This is small enough to include in an email/daily newsletter, etc. You now have another example of webscraping to use. If you have a mac, you can convert it to speech easily

$python scraper.py > forecast.txt
$say -v Daniel -o ./forecast.m4a -f forecast.txt

You now have a forecast in a British voice that actually sounds much better that the synthesized one the city uses to broadcast along the Queensway.

What prompted this example?

This was an item on one of my TODO lists from a few years ago and as long as I was playing with beautifulsoup, I thought I’d close that item. It is part of a longer list that will eventually become a streaming radio station for my house. The forecast will get processed through the speech synthesizer from my mac (or whatever i’m using at the time) and injected into the playlist. Ultimately, it will be an automated system along the lines of:

  • Set up a random music playlist for the day
  • Grab the weather a few times during the day
  • Scrape some news headlines, convert to speech and inject at the top of the hour
  • Play a podcast from my extensive backlog starting at a specific time in the afternoon
  • Play a chapter or two from an audio book
  • More music
  • Ten most recent tweets in my twitter feed read out loud from the last hour
  • Whatever else comes to mind

While a little silly, it is probably as good as the stuff being broadcast on a lot of stations it will be music I want to listen to and I don’t have to listen to the “personalities”. I could even low power broadcast it on FM using a Raspberry Pi as the broadcast system.

Tags: python beautiful soup reveal web scraping

SSL Testing with Qualys Labs Logo Issue

comments powered by Disqus