Netflix and Python or how I scraped Aussie Netflix tv shows from finder

In my previous post I pasted in a list of shows on Australian Netflix sorted by IMDB ratings. When I started thinking about how to generate such a list the first thing I needed to figure was how to obtain a list of all TV shows that are available on Australian Netflix. I did some searching and found that finder.com.au provides a regularly updated list. It is format in HTML, so I needed to do some web scraping to obtain this information.

I wrote a scraping script in Python using the requests library to pull down the page then BeautifulSoup to extract the information I needed.

 

#!/usr/bin/python

import requests
import re
from bs4 import BeautifulSoup


def getshowlist():
    url = "https://www.finder.com.au/netflix-tv-shows"

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    div = soup.find('div', class_="ts-table-container")

    tb = div.find('table', class_='luna-table')
    #print tb

    list = {}

    for row in tb.find_all('tr'):
        #print row
        name = row.find('b')
        year = row.find("td", {"data-title": "Year of release"})
        temp = str(name)
        temp2 = str(year)
        #print type(name)
        title = re.sub('<[^<]+?>', '', temp)
        release = re.sub('<[^<]+?>', '', temp2)
        #print title, release
        list[title] = release
    return list

What is happening here? It pretty straight forward, requests does a get request to the finder.com.au listing page, pulls down the html. Beautiful soup then takes the html and builds an XPATH tree. Next the script pulls the DIV out from the tree that contains the table we are interested in, the table with the class ts-table-container, then it extracts the table from the DIV. Finally the script iterates through the elements in the table and extracts the show name and year. A list is returned containing pairs of shows and years released. My next post will deal with obtaining the IMDB rating data.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s