In my previous post I pasted in a list of shows on Australian Netflix sorted by IMDB ratings. When I started thinking about how to generate such a list the first thing I needed to figure was how to obtain a list of all TV shows that are available on Australian Netflix. I did some searching and found that finder.com.au provides a regularly updated list. It is format in HTML, so I needed to do some web scraping to obtain this information.
I wrote a scraping script in Python using the requests library to pull down the page then BeautifulSoup to extract the information I needed.
#!/usr/bin/python
import requests
import re
from bs4 import BeautifulSoup
def getshowlist():
url = "https://www.finder.com.au/netflix-tv-shows"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
div = soup.find('div', class_="ts-table-container")
tb = div.find('table', class_='luna-table')
#print tb
list = {}
for row in tb.find_all('tr'):
#print row
name = row.find('b')
year = row.find("td", {"data-title": "Year of release"})
temp = str(name)
temp2 = str(year)
#print type(name)
title = re.sub('<[^<]+?>', '', temp)
release = re.sub('<[^<]+?>', '', temp2)
#print title, release
list[title] = release
return list
What is happening here? It pretty straight forward, requests does a get request to the finder.com.au listing page, pulls down the html. Beautiful soup then takes the html and builds an XPATH tree. Next the script pulls the DIV out from the tree that contains the table we are interested in, the table with the class ts-table-container, then it extracts the table from the DIV. Finally the script iterates through the elements in the table and extracts the show name and year. A list is returned containing pairs of shows and years released. My next post will deal with obtaining the IMDB rating data.