EPA-1316 Introduction to Urban Data Science

Lab 1 - part 3: Web Scraping¶

TU Delft
Q1 2022
Instructor: Trivik Verma
TAs: Auriane Técourt, Dorukhan Yesilli, Ludovica Bindi, Nicolò Canal, Ruth Nelson, Vaibhavi Srivastava
Centre for Urban Science & Policy


In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import time
import os
from IPython.display import Image

Table of Contents¶

  • Learning Goals
  • Introduction to Web Servers and HTTP
  • Download webpages and get basic properties
  • Beautiful Soup
    • Parse the page with Beautiful Soup
    • Takeaway lesson
  • String formatting
  • Walkthrough Example of web scraping
  • How To Utilize APIs
    • What is an API?
    • Making API Requests in Python
    • First API Request

Learning Goals ¶

  • Understand the structure of a web page
  • Understand how to use Beautiful soup to scrape content from web pages.
  • Feel comfortable storing and manipulating the content in various formats.
  • Understand how to convert structured format into a Pandas DataFrame

In this lab, we'll scrape Goodread's Best Books list:

https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1 .

We'll walk through scraping the list pages for the book names/urls. First, we start with an even simpler example.

NOTE: The contents of this notebook are advanced in nature.

Introduction to Web Servers and HTTP ¶

A web server is just a computer -- usually a powerful one, but ultimately it's just another computer -- that runs a long/continuous process that listens for requests on a pre-specified (Internet) port on your computer. It responds to those requests via a protocol called HTTP (HyperText Transfer Protocol). HTTPS is the secure version. When we use a web browser and navigate to a web page, our browser is actually sending a request on our behalf to a specific web server. The browser request is essentially saying "hey, please give me the web page contents", and it's up to the browser to correctly render that raw content into a coherent manner, dependent on the format of the file. For example, HTML is one format, XML is another format, and so on.

Ideally (and usually), the web server complies with the request and all is fine. As part of this communication exchange with web servers, the server also sends a status code.

  • If the code starts with a 2, it means the request was successful.
  • If the code starts with a 4, it means there was a client error (you, as the user, are the client). For example, ever receive a 404 File Not Found error because a web page doesn't exist? This is an example of a client error, because you are requesting a bogus item.
  • If the code starts with a 5, it means there was a server error (often that your request was incorrectly formed).

Click here for a full list of status codes.

As an analogy, you can think of a web server as being like a server at a restaurant; its goal is serve you your requests. When you try to order something not on the menu (i.e., ask for a web page at a wrong location), the server says 'sorry, we don't have that' (i.e., 404, client error; your mistake).

IMPORTANT: As humans, we visit pages in a sane, reasonable rate. However, as we start to scrape web pages with our computers, we will be sending requests with our code, and thus, we can make requests at an incredible rate. This is potentially dangerous because it's akin to going to a restaurant and bombarding the server(s) with thousands of food orders. Very often, the restaurant will ban you (i.e., Harvard's network gets banned from the website, and you are potentially held responsible in some capacity?). It is imperative to be responsible and careful. In fact, this act of flooding web pages with requests is the single-most popular, yet archiac, method for maliciously attacking websites / computers with Internet connections. In short, be respectful and careful with your decisions and code. It is better to err on the side of caution, which includes using the time.sleep() function to pause your code's execution between subsequent requests. time.sleep(2) should be fine when making just a few dozen requests. Each site has its own rules, which are often visible via their site's robots.txt file.

Additional Resources¶

HTML: if you are not familiar with HTML see https://www.w3schools.com/html/ or one of the many tutorials on the internet.

Document Object Model (DOM): for more on this programming interface for HTML and XML documents see https://www.w3schools.com/js/js_htmldom.asp.

Download webpages and get basic properties ¶

Requests is a highly useful Python library that allows us to fetch web pages. BeautifulSoup is a phenomenal Python library that allows us to easily parse web content and perform basic extraction.

If one wishes to scrape webpages, one usually uses requests to fetch the page and BeautifulSoup to parse the page's meaningful components. Webpages can be messy, despite having a structured format, which is why BeautifulSoup is so handy.

Let's get started:

In [2]:
from bs4 import BeautifulSoup
import requests

To fetch a webpage's content, we can simply use the get() function within the requests library:

In [3]:
url = "https://www.npr.org/2018/11/05/664395755/what-if-the-polls-are-wrong-again-4-scenarios-for-what-might-happen-in-the-elect"
response = requests.get(url) # you can use any URL that you wish

The response variable has many highly useful attributes, such as:

  • status_code
  • text
  • content

Let's try each of them!

response.status_code¶

In [4]:
response.status_code
Out[4]:
200

You should have received a status code of 200, which means the page was successfully found on the server and sent to receiver (aka client/user/you). Again, you can click here for a full list of status codes.

response.text¶

In [5]:
response.text
Out[5]:
'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta name="robots" content="noindex, nofollow">\n    <meta content="text/html;charset=utf-8" http-equiv="Content-Type">\n    <meta content="utf-8" http-equiv="encoding">\n    <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, shrink-to-fit=no" />\n\n    <title>NPR Cookie Consent and Choices</title>\n\n    <link rel="stylesheet" media="screen, print" href="https://s.npr.org/templates/css/fonts/Knockout.css"/>\n    <link rel="stylesheet" media="screen, print" href="https://s.npr.org/templates/css/fonts/GothamSSm.css"/>\n    <link rel="stylesheet" media="screen, print" href="css/choice-stylesheet.css"/>\n    <script type="text/javascript" src="./js/redirects.js"></script>\n    <script type="text/javascript" src="./js/domains.js"></script>\n</head>\n<body>\n<main class="content" id="content">\n    <header role="banner">\n        <img src="https://media.npr.org/chrome_svg/npr-logo.svg" alt="NPR logo" class="npr-logo"/>\n\n        <h1 class="header-txt">Cookie Consent and Choices</h1>\n\n        <div id="npr-rule" role="presentation"><span></span><span></span></div>\n    </header>\n\n    <section class="main-section">\n        <p>\n            NPR&rsquo;s sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites (together, &ldquo;cookies&rdquo;) to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR&rsquo;s sponsors, provide social media features, and analyze NPR&rsquo;s traffic. This information is shared with social media, sponsorship, analytics, and other vendors or service providers.\n            <a href="https://text.npr.org/s.php?sId=609791368">See details</a>.\n        </p>\n\n        <p>\n            You may click on &ldquo;<strong>Your Choices</strong>&rdquo; below to learn about and use cookie management tools to limit use of cookies when you visit NPR&rsquo;s sites. This page will also tell you how you can reject cookies and still obtain access to NPRâ\x80\x99s sites, and you can adjust your cookie choices in those tools at any time. If you click &ldquo;<strong>Agree and Continue</strong>&rdquo; below, you acknowledge that your cookie choices in those tools will be respected and that you otherwise agree to the use of cookies on NPR&rsquo;s sites.\n        </p>\n\n        <p class="acceptance-date" id="acceptanceDate"></p>\n\n        <div class="user-actions">\n            <button class="user-action user-action--accept" id="accept">Agree and Continue</button>\n\n            <a class="user-action user-action--text" id="textLink" href="https://text.npr.org/s.php?sId=609131973#your-choices">YOUR CHOICES</a>\n        </div>\n\n        <footer class="footer">\n            <p>NPR&rsquo;s <a href="https://text.npr.org/s.php?sId=179876898">Terms of Use</a> and <a\n                    href="https://text.npr.org/s.php?sId=609131973">Privacy Policy</a>.</p>\n        </footer>\n    </section>\n</main>\n\n<script>\n    // self executing function here\n    (function () {\n        var choiceVersion = 1;\n\n        // Return true is the origin param is present in the URL\n        // Make sure origin starts with "https://" in order to avoid cross-site scripting attack\n        var hasOrigin = function () {\n            var searchParam = window.location.search;\n            return searchParam.substr(0, 16) === \'?origin=https://\';\n        };\n\n        // Append choiceRedirect=true to a destination\n        // This will tell use that a user has been already redirected by the choice page\n        // stopping a potential infinite redirect loop\n        var addChoiceRedirectParam = function (url) {\n            var paramControl = \'?\';\n            if (url.includes(\'?\')){\n                paramControl = \'&\';\n            }\n            return url + paramControl + \'t=\' + (new Date()).getTime();\n        }\n\n        // Redirect made from AKAMAI will include the original\n        // destination with the request ex:\n        // https://www.npr.org/choice.html?origin=https://www.npr.org/about-npr/178660742/public-radio-finances\n        var getDestination = function () {\n            var searchParam = window.location.search;\n            if (hasOrigin()) {\n                var destination = searchParam.substr(8);\n                if (checkOrigin(destination)) {\n                    return destination;\n                }\n            }\n            return \'https://www.npr.org\';\n        };\n\n        var getCookie = function (name) {\n            var value = "; " + document.cookie;\n            var parts = value.split("; " + name + "=");\n            if (parts.length == 2) return parts.pop().split(";").shift();\n            return false;\n        };\n\n        var create_cookie = function (name, value) {\n            // Cookies have a tendency to expire, so I arbitrarily set the max age to 10 year\n            document.cookie = name + \'=\' + value + \';secure;path=/;domain=npr.org;max-age=315360000;\';\n        };\n\n        // True is user previously accepted the correct version of the consent page\n        var hasPreviouslyAcceptedChoiceOptions = function () {\n            return getCookie(\'trackingChoice\') && getCookie(\'choiceVersion\') == choiceVersion;\n        }\n\n        // Grab the thing id form the destination\n        var getThingId = function (destination) {\n            var yearMonthDateWithPreFixReg = /https:\\/\\/www\\.npr\\.org\\/([a-z]+\\/){0,2}\\d{4}\\/\\d{2}\\/\\d{2}\\/(\\d+)\\/.*/;\n            var match = yearMonthDateWithPreFixReg.exec(destination);\n            if (match) {\n                return match[2];\n            }\n\n            var noDateUrlRegex = /https:\\/\\/www\\.npr\\.org\\/([a-z]+\\/){1,2}(\\d+)\\/.*/;\n            match = noDateUrlRegex.exec(destination);\n            if (match) {\n                return match[2];\n            }\n\n            var thingIdByParam = /https:\\/\\/www\\.npr\\.org\\/.*[iI]d=(\\d{4,}).*/;\n            match = thingIdByParam.exec(destination);\n            if (match) {\n                return match[1];\n            }\n\n            // Check if we have a hard coded page url\n            //Remove https://www.npr.org from the destination\n            var location = destination.substr(19);\n            for (var key in redirectLookup) {\n                // If the first part of the location matches a\n                // hard coded url, then we have a match.\n                if (location.startsWith(key)){\n                    return redirectLookup[key];\n                }\n            }\n\n            return false;\n        }\n\n        document.getElementById(\'accept\').addEventListener(\'click\', function () {\n            var d = new Date();\n            var dateOfChoice = d.getTime();\n\n            create_cookie(\'trackingChoice\', \'true\');\n            create_cookie(\'choiceVersion\', choiceVersion);\n            create_cookie(\'dateOfChoice\', dateOfChoice);\n            window.location = addChoiceRedirectParam(getDestination());\n        });\n\n        var thingId = getThingId(getDestination());\n        if (thingId) {\n            document.getElementById(\'textLink\').href = "https://text.npr.org/r.php?id=" + thingId;\n        }\n\n        if (hasOrigin() && hasPreviouslyAcceptedChoiceOptions()) {\n            // If the user has already accepted the choice options\n            // and has an origin param in his request\n            // We will redirect him to that origin request.\n            // This will solve the issue where applications are caching 307 redirects\n            window.location = addChoiceRedirectParam(getDestination());\n        } else if (hasPreviouslyAcceptedChoiceOptions()) {\n            var lastDateOfChoice = getCookie(\'dateOfChoice\');\n            var d = new Date(parseInt(lastDateOfChoice, 10));\n            var dateString = "On "\n                + (d.getMonth() + 1)\n                + "/"\n                + d.getDate()\n                + "/"\n                + d.getFullYear()\n                + " you agreed to the above.";\n            document.getElementById(\'acceptanceDate\').innerText = dateString;\n            document.getElementById(\'content\').classList.add(\'accepted\');\n        }\n\n    })();\n</script>\n</body>\n</html>\n'

Holy moly! That looks awful. If we use our browser to visit the URL, then right-click the page and click 'View Page Source', we see that it is identical to this chunk of glorious text.

response.content¶

In [6]:
response.content
Out[6]:
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta name="robots" content="noindex, nofollow">\n    <meta content="text/html;charset=utf-8" http-equiv="Content-Type">\n    <meta content="utf-8" http-equiv="encoding">\n    <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, shrink-to-fit=no" />\n\n    <title>NPR Cookie Consent and Choices</title>\n\n    <link rel="stylesheet" media="screen, print" href="https://s.npr.org/templates/css/fonts/Knockout.css"/>\n    <link rel="stylesheet" media="screen, print" href="https://s.npr.org/templates/css/fonts/GothamSSm.css"/>\n    <link rel="stylesheet" media="screen, print" href="css/choice-stylesheet.css"/>\n    <script type="text/javascript" src="./js/redirects.js"></script>\n    <script type="text/javascript" src="./js/domains.js"></script>\n</head>\n<body>\n<main class="content" id="content">\n    <header role="banner">\n        <img src="https://media.npr.org/chrome_svg/npr-logo.svg" alt="NPR logo" class="npr-logo"/>\n\n        <h1 class="header-txt">Cookie Consent and Choices</h1>\n\n        <div id="npr-rule" role="presentation"><span></span><span></span></div>\n    </header>\n\n    <section class="main-section">\n        <p>\n            NPR&rsquo;s sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites (together, &ldquo;cookies&rdquo;) to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR&rsquo;s sponsors, provide social media features, and analyze NPR&rsquo;s traffic. This information is shared with social media, sponsorship, analytics, and other vendors or service providers.\n            <a href="https://text.npr.org/s.php?sId=609791368">See details</a>.\n        </p>\n\n        <p>\n            You may click on &ldquo;<strong>Your Choices</strong>&rdquo; below to learn about and use cookie management tools to limit use of cookies when you visit NPR&rsquo;s sites. This page will also tell you how you can reject cookies and still obtain access to NPR\xe2\x80\x99s sites, and you can adjust your cookie choices in those tools at any time. If you click &ldquo;<strong>Agree and Continue</strong>&rdquo; below, you acknowledge that your cookie choices in those tools will be respected and that you otherwise agree to the use of cookies on NPR&rsquo;s sites.\n        </p>\n\n        <p class="acceptance-date" id="acceptanceDate"></p>\n\n        <div class="user-actions">\n            <button class="user-action user-action--accept" id="accept">Agree and Continue</button>\n\n            <a class="user-action user-action--text" id="textLink" href="https://text.npr.org/s.php?sId=609131973#your-choices">YOUR CHOICES</a>\n        </div>\n\n        <footer class="footer">\n            <p>NPR&rsquo;s <a href="https://text.npr.org/s.php?sId=179876898">Terms of Use</a> and <a\n                    href="https://text.npr.org/s.php?sId=609131973">Privacy Policy</a>.</p>\n        </footer>\n    </section>\n</main>\n\n<script>\n    // self executing function here\n    (function () {\n        var choiceVersion = 1;\n\n        // Return true is the origin param is present in the URL\n        // Make sure origin starts with "https://" in order to avoid cross-site scripting attack\n        var hasOrigin = function () {\n            var searchParam = window.location.search;\n            return searchParam.substr(0, 16) === \'?origin=https://\';\n        };\n\n        // Append choiceRedirect=true to a destination\n        // This will tell use that a user has been already redirected by the choice page\n        // stopping a potential infinite redirect loop\n        var addChoiceRedirectParam = function (url) {\n            var paramControl = \'?\';\n            if (url.includes(\'?\')){\n                paramControl = \'&\';\n            }\n            return url + paramControl + \'t=\' + (new Date()).getTime();\n        }\n\n        // Redirect made from AKAMAI will include the original\n        // destination with the request ex:\n        // https://www.npr.org/choice.html?origin=https://www.npr.org/about-npr/178660742/public-radio-finances\n        var getDestination = function () {\n            var searchParam = window.location.search;\n            if (hasOrigin()) {\n                var destination = searchParam.substr(8);\n                if (checkOrigin(destination)) {\n                    return destination;\n                }\n            }\n            return \'https://www.npr.org\';\n        };\n\n        var getCookie = function (name) {\n            var value = "; " + document.cookie;\n            var parts = value.split("; " + name + "=");\n            if (parts.length == 2) return parts.pop().split(";").shift();\n            return false;\n        };\n\n        var create_cookie = function (name, value) {\n            // Cookies have a tendency to expire, so I arbitrarily set the max age to 10 year\n            document.cookie = name + \'=\' + value + \';secure;path=/;domain=npr.org;max-age=315360000;\';\n        };\n\n        // True is user previously accepted the correct version of the consent page\n        var hasPreviouslyAcceptedChoiceOptions = function () {\n            return getCookie(\'trackingChoice\') && getCookie(\'choiceVersion\') == choiceVersion;\n        }\n\n        // Grab the thing id form the destination\n        var getThingId = function (destination) {\n            var yearMonthDateWithPreFixReg = /https:\\/\\/www\\.npr\\.org\\/([a-z]+\\/){0,2}\\d{4}\\/\\d{2}\\/\\d{2}\\/(\\d+)\\/.*/;\n            var match = yearMonthDateWithPreFixReg.exec(destination);\n            if (match) {\n                return match[2];\n            }\n\n            var noDateUrlRegex = /https:\\/\\/www\\.npr\\.org\\/([a-z]+\\/){1,2}(\\d+)\\/.*/;\n            match = noDateUrlRegex.exec(destination);\n            if (match) {\n                return match[2];\n            }\n\n            var thingIdByParam = /https:\\/\\/www\\.npr\\.org\\/.*[iI]d=(\\d{4,}).*/;\n            match = thingIdByParam.exec(destination);\n            if (match) {\n                return match[1];\n            }\n\n            // Check if we have a hard coded page url\n            //Remove https://www.npr.org from the destination\n            var location = destination.substr(19);\n            for (var key in redirectLookup) {\n                // If the first part of the location matches a\n                // hard coded url, then we have a match.\n                if (location.startsWith(key)){\n                    return redirectLookup[key];\n                }\n            }\n\n            return false;\n        }\n\n        document.getElementById(\'accept\').addEventListener(\'click\', function () {\n            var d = new Date();\n            var dateOfChoice = d.getTime();\n\n            create_cookie(\'trackingChoice\', \'true\');\n            create_cookie(\'choiceVersion\', choiceVersion);\n            create_cookie(\'dateOfChoice\', dateOfChoice);\n            window.location = addChoiceRedirectParam(getDestination());\n        });\n\n        var thingId = getThingId(getDestination());\n        if (thingId) {\n            document.getElementById(\'textLink\').href = "https://text.npr.org/r.php?id=" + thingId;\n        }\n\n        if (hasOrigin() && hasPreviouslyAcceptedChoiceOptions()) {\n            // If the user has already accepted the choice options\n            // and has an origin param in his request\n            // We will redirect him to that origin request.\n            // This will solve the issue where applications are caching 307 redirects\n            window.location = addChoiceRedirectParam(getDestination());\n        } else if (hasPreviouslyAcceptedChoiceOptions()) {\n            var lastDateOfChoice = getCookie(\'dateOfChoice\');\n            var d = new Date(parseInt(lastDateOfChoice, 10));\n            var dateString = "On "\n                + (d.getMonth() + 1)\n                + "/"\n                + d.getDate()\n                + "/"\n                + d.getFullYear()\n                + " you agreed to the above.";\n            document.getElementById(\'acceptanceDate\').innerText = dateString;\n            document.getElementById(\'content\').classList.add(\'accepted\');\n        }\n\n    })();\n</script>\n</body>\n</html>\n'

What?! This seems identical to the .text field. However, the careful eye would notice that the very 1st characters differ; that is, .content has a b' character at the beginning, which in Python syntax denotes that the data type is bytes, whereas the .text field did not have it and is a regular String.

Ok, so that's great, but how do we make sense of this text? We could manually parse it, but that's tedious and difficult. As mentioned, BeautifulSoup is specifically designed to parse this exact content (any webpage content).

Beautiful Soup ¶

The documentation for BeautifulSoup is found here.

A BeautifulSoup object can be initialized with the .content from request and a flag denoting the type of parser that we should use. For example, we could specify html.parser, lxml, etc documentation here. Since we are interested in standard webpages that use HTML, let's specify the html.parser:

In [7]:
soup = BeautifulSoup(response.content, "html.parser")
soup
Out[7]:
<!DOCTYPE html>

<html lang="en">
<head>
<meta content="noindex, nofollow" name="robots"/>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<meta content="utf-8" http-equiv="encoding"/>
<meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, shrink-to-fit=no" name="viewport">
<title>NPR Cookie Consent and Choices</title>
<link href="https://s.npr.org/templates/css/fonts/Knockout.css" media="screen, print" rel="stylesheet"/>
<link href="https://s.npr.org/templates/css/fonts/GothamSSm.css" media="screen, print" rel="stylesheet"/>
<link href="css/choice-stylesheet.css" media="screen, print" rel="stylesheet"/>
<script src="./js/redirects.js" type="text/javascript"></script>
<script src="./js/domains.js" type="text/javascript"></script>
</meta></head>
<body>
<main class="content" id="content">
<header role="banner">
<img alt="NPR logo" class="npr-logo" src="https://media.npr.org/chrome_svg/npr-logo.svg"/>
<h1 class="header-txt">Cookie Consent and Choices</h1>
<div id="npr-rule" role="presentation"><span></span><span></span></div>
</header>
<section class="main-section">
<p>
            NPR’s sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites (together, “cookies”) to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR’s sponsors, provide social media features, and analyze NPR’s traffic. This information is shared with social media, sponsorship, analytics, and other vendors or service providers.
            <a href="https://text.npr.org/s.php?sId=609791368">See details</a>.
        </p>
<p>
            You may click on “<strong>Your Choices</strong>” below to learn about and use cookie management tools to limit use of cookies when you visit NPR’s sites. This page will also tell you how you can reject cookies and still obtain access to NPR’s sites, and you can adjust your cookie choices in those tools at any time. If you click “<strong>Agree and Continue</strong>” below, you acknowledge that your cookie choices in those tools will be respected and that you otherwise agree to the use of cookies on NPR’s sites.
        </p>
<p class="acceptance-date" id="acceptanceDate"></p>
<div class="user-actions">
<button class="user-action user-action--accept" id="accept">Agree and Continue</button>
<a class="user-action user-action--text" href="https://text.npr.org/s.php?sId=609131973#your-choices" id="textLink">YOUR CHOICES</a>
</div>
<footer class="footer">
<p>NPR’s <a href="https://text.npr.org/s.php?sId=179876898">Terms of Use</a> and <a href="https://text.npr.org/s.php?sId=609131973">Privacy Policy</a>.</p>
</footer>
</section>
</main>
<script>
    // self executing function here
    (function () {
        var choiceVersion = 1;

        // Return true is the origin param is present in the URL
        // Make sure origin starts with "https://" in order to avoid cross-site scripting attack
        var hasOrigin = function () {
            var searchParam = window.location.search;
            return searchParam.substr(0, 16) === '?origin=https://';
        };

        // Append choiceRedirect=true to a destination
        // This will tell use that a user has been already redirected by the choice page
        // stopping a potential infinite redirect loop
        var addChoiceRedirectParam = function (url) {
            var paramControl = '?';
            if (url.includes('?')){
                paramControl = '&';
            }
            return url + paramControl + 't=' + (new Date()).getTime();
        }

        // Redirect made from AKAMAI will include the original
        // destination with the request ex:
        // https://www.npr.org/choice.html?origin=https://www.npr.org/about-npr/178660742/public-radio-finances
        var getDestination = function () {
            var searchParam = window.location.search;
            if (hasOrigin()) {
                var destination = searchParam.substr(8);
                if (checkOrigin(destination)) {
                    return destination;
                }
            }
            return 'https://www.npr.org';
        };

        var getCookie = function (name) {
            var value = "; " + document.cookie;
            var parts = value.split("; " + name + "=");
            if (parts.length == 2) return parts.pop().split(";").shift();
            return false;
        };

        var create_cookie = function (name, value) {
            // Cookies have a tendency to expire, so I arbitrarily set the max age to 10 year
            document.cookie = name + '=' + value + ';secure;path=/;domain=npr.org;max-age=315360000;';
        };

        // True is user previously accepted the correct version of the consent page
        var hasPreviouslyAcceptedChoiceOptions = function () {
            return getCookie('trackingChoice') && getCookie('choiceVersion') == choiceVersion;
        }

        // Grab the thing id form the destination
        var getThingId = function (destination) {
            var yearMonthDateWithPreFixReg = /https:\/\/www\.npr\.org\/([a-z]+\/){0,2}\d{4}\/\d{2}\/\d{2}\/(\d+)\/.*/;
            var match = yearMonthDateWithPreFixReg.exec(destination);
            if (match) {
                return match[2];
            }

            var noDateUrlRegex = /https:\/\/www\.npr\.org\/([a-z]+\/){1,2}(\d+)\/.*/;
            match = noDateUrlRegex.exec(destination);
            if (match) {
                return match[2];
            }

            var thingIdByParam = /https:\/\/www\.npr\.org\/.*[iI]d=(\d{4,}).*/;
            match = thingIdByParam.exec(destination);
            if (match) {
                return match[1];
            }

            // Check if we have a hard coded page url
            //Remove https://www.npr.org from the destination
            var location = destination.substr(19);
            for (var key in redirectLookup) {
                // If the first part of the location matches a
                // hard coded url, then we have a match.
                if (location.startsWith(key)){
                    return redirectLookup[key];
                }
            }

            return false;
        }

        document.getElementById('accept').addEventListener('click', function () {
            var d = new Date();
            var dateOfChoice = d.getTime();

            create_cookie('trackingChoice', 'true');
            create_cookie('choiceVersion', choiceVersion);
            create_cookie('dateOfChoice', dateOfChoice);
            window.location = addChoiceRedirectParam(getDestination());
        });

        var thingId = getThingId(getDestination());
        if (thingId) {
            document.getElementById('textLink').href = "https://text.npr.org/r.php?id=" + thingId;
        }

        if (hasOrigin() && hasPreviouslyAcceptedChoiceOptions()) {
            // If the user has already accepted the choice options
            // and has an origin param in his request
            // We will redirect him to that origin request.
            // This will solve the issue where applications are caching 307 redirects
            window.location = addChoiceRedirectParam(getDestination());
        } else if (hasPreviouslyAcceptedChoiceOptions()) {
            var lastDateOfChoice = getCookie('dateOfChoice');
            var d = new Date(parseInt(lastDateOfChoice, 10));
            var dateString = "On "
                + (d.getMonth() + 1)
                + "/"
                + d.getDate()
                + "/"
                + d.getFullYear()
                + " you agreed to the above.";
            document.getElementById('acceptanceDate').innerText = dateString;
            document.getElementById('content').classList.add('accepted');
        }

    })();
</script>
</body>
</html>

Alright! That looks a little better; there's some whitespace formatting, adding some structure to our content! HTML code is structured by <tags>. Every tag has an opening and closing portion, denoted by < > and </ >, respectively. If we want just the text (not the tags), we can use:

In [8]:
soup.get_text()
Out[8]:
'\n\n\n\n\n\n\nNPR Cookie Consent and Choices\n\n\n\n\n\n\n\n\n\n\nCookie Consent and Choices\n\n\n\n\n            NPR’s sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites (together, “cookies”) to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR’s sponsors, provide social media features, and analyze NPR’s traffic. This information is shared with social media, sponsorship, analytics, and other vendors or service providers.\n            See details.\n        \n\n            You may click on “Your Choices” below to learn about and use cookie management tools to limit use of cookies when you visit NPR’s sites. This page will also tell you how you can reject cookies and still obtain access to NPR’s sites, and you can adjust your cookie choices in those tools at any time. If you click “Agree and Continue” below, you acknowledge that your cookie choices in those tools will be respected and that you otherwise agree to the use of cookies on NPR’s sites.\n        \n\n\nAgree and Continue\nYOUR CHOICES\n\n\nNPR’s Terms of Use and Privacy Policy.\n\n\n\n\n\n\n'

There's some tricky Javascript still nesting within it, but it definitely cleaned up a bit. On other websites, you may find even clearer text extraction.

As detailed in the BeautifulSoup documentation, the easiest way to navigate through the tags is to simply name the tag you're interested in. For example:

In [9]:
soup.head # fetches the head tag, which ecompasses the title tag
Out[9]:
<head>
<meta content="noindex, nofollow" name="robots"/>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<meta content="utf-8" http-equiv="encoding"/>
<meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, shrink-to-fit=no" name="viewport">
<title>NPR Cookie Consent and Choices</title>
<link href="https://s.npr.org/templates/css/fonts/Knockout.css" media="screen, print" rel="stylesheet"/>
<link href="https://s.npr.org/templates/css/fonts/GothamSSm.css" media="screen, print" rel="stylesheet"/>
<link href="css/choice-stylesheet.css" media="screen, print" rel="stylesheet"/>
<script src="./js/redirects.js" type="text/javascript"></script>
<script src="./js/domains.js" type="text/javascript"></script>
</meta></head>

Usually head tags are small and only contain the most important contents; however, here, there's some Javascript code. The title tag resides within the head tag.

In [10]:
soup.title # we can specifically call for the title tag
Out[10]:
<title>NPR Cookie Consent and Choices</title>

This result includes the tag itself. To get just the text within the tags, we can use the .name property.

In [11]:
soup.title.string
Out[11]:
'NPR Cookie Consent and Choices'

We can navigate to the parent tag (the tag that encompasses the current tag) via the .parent attribute:

In [12]:
soup.title.parent.name
Out[12]:
'meta'

Parse the page with Beautiful Soup ¶

In HTML code, paragraphs are often denoated with a <p> tag.

In [13]:
soup.p
Out[13]:
<p>
            NPR’s sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites (together, “cookies”) to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR’s sponsors, provide social media features, and analyze NPR’s traffic. This information is shared with social media, sponsorship, analytics, and other vendors or service providers.
            <a href="https://text.npr.org/s.php?sId=609791368">See details</a>.
        </p>

This returns the first paragraph, and we can access properties of the given tag with the same syntax we use for dictionaries and dataframes:

In [14]:
c= BeautifulSoup('<p class="body"></p>')
c.p['class']
Out[14]:
['body']
In [15]:
c.p.attrs
Out[15]:
{'class': ['body']}

In addition to 'paragraph' (aka p) tags, link tags are also very common and are denoted by <a> tags

In [16]:
soup.a
Out[16]:
<a href="https://text.npr.org/s.php?sId=609791368">See details</a>

It is called the a tag because links are also called 'anchors'. Nearly every page has multiple paragraphs and anchors, so how do we access the subsequent tags? There are two common functions, .find() and .find_all().

In [17]:
soup.find('title')
Out[17]:
<title>NPR Cookie Consent and Choices</title>
In [18]:
soup.find_all('title')
Out[18]:
[<title>NPR Cookie Consent and Choices</title>]

Here, the results were seemingly the same, since there is only one title to a webpage. However, you'll notice that .find_all() returned a list, not a single item. Sure, there was only one item in the list, but it returned a list. As the name implies, find_all() returns all items that match the passed-in tag.

In [19]:
soup.find_all('a')
Out[19]:
[<a href="https://text.npr.org/s.php?sId=609791368">See details</a>,
 <a class="user-action user-action--text" href="https://text.npr.org/s.php?sId=609131973#your-choices" id="textLink">YOUR CHOICES</a>,
 <a href="https://text.npr.org/s.php?sId=179876898">Terms of Use</a>,
 <a href="https://text.npr.org/s.php?sId=609131973">Privacy Policy</a>]

Look at all of those links! Amazing. It might be hard to read but the href portion of an a tag denotes the URL, and we can capture it via the .get() function.

In [20]:
for link in soup.find_all('a'): # we could optionally pass the href=True flag .find_all('a', href=True)
    print(link.get('href'))
https://text.npr.org/s.php?sId=609791368
https://text.npr.org/s.php?sId=609131973#your-choices
https://text.npr.org/s.php?sId=179876898
https://text.npr.org/s.php?sId=609131973

Many of those links are relative to the current URL (e.g., /section/news/).

In [21]:
paragraphs = soup.find_all('p')
paragraphs
Out[21]:
[<p>
             NPR’s sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites (together, “cookies”) to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR’s sponsors, provide social media features, and analyze NPR’s traffic. This information is shared with social media, sponsorship, analytics, and other vendors or service providers.
             <a href="https://text.npr.org/s.php?sId=609791368">See details</a>.
         </p>,
 <p>
             You may click on “<strong>Your Choices</strong>” below to learn about and use cookie management tools to limit use of cookies when you visit NPR’s sites. This page will also tell you how you can reject cookies and still obtain access to NPR’s sites, and you can adjust your cookie choices in those tools at any time. If you click “<strong>Agree and Continue</strong>” below, you acknowledge that your cookie choices in those tools will be respected and that you otherwise agree to the use of cookies on NPR’s sites.
         </p>,
 <p class="acceptance-date" id="acceptanceDate"></p>,
 <p>NPR’s <a href="https://text.npr.org/s.php?sId=179876898">Terms of Use</a> and <a href="https://text.npr.org/s.php?sId=609131973">Privacy Policy</a>.</p>]

If we want just the paragraph text:

In [22]:
for pa in paragraphs:
    print(pa.get_text())
            NPR’s sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites (together, “cookies”) to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR’s sponsors, provide social media features, and analyze NPR’s traffic. This information is shared with social media, sponsorship, analytics, and other vendors or service providers.
            See details.
        

            You may click on “Your Choices” below to learn about and use cookie management tools to limit use of cookies when you visit NPR’s sites. This page will also tell you how you can reject cookies and still obtain access to NPR’s sites, and you can adjust your cookie choices in those tools at any time. If you click “Agree and Continue” below, you acknowledge that your cookie choices in those tools will be respected and that you otherwise agree to the use of cookies on NPR’s sites.
        

NPR’s Terms of Use and Privacy Policy.

Since there are multiple tags and various attributes, it is useful to check the data type of BeautifulSoup objects:

In [23]:
type(soup.find('p'))
Out[23]:
bs4.element.Tag

Since the .find() function returns a BeautifulSoup element, we can tack on multiple calls that continue to return elements:

In [24]:
soup.find('p')
Out[24]:
<p>
            NPR’s sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites (together, “cookies”) to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR’s sponsors, provide social media features, and analyze NPR’s traffic. This information is shared with social media, sponsorship, analytics, and other vendors or service providers.
            <a href="https://text.npr.org/s.php?sId=609791368">See details</a>.
        </p>
In [25]:
soup.find('p').find('a')
Out[25]:
<a href="https://text.npr.org/s.php?sId=609791368">See details</a>
In [26]:
soup.find('p').find('a').attrs['href'] # att
Out[26]:
'https://text.npr.org/s.php?sId=609791368'
In [27]:
soup.find('p').find('a').text
Out[27]:
'See details'

That doesn't look pretty, but it makes sense because if you look at what .find('a') returned, there is plenty of whitespace. We can remove that with Python's built-in .strip() function.

In [28]:
soup.find('p').find('a').text.strip()
Out[28]:
'See details'

NOTE: above, we accessed the attributes of a link by using the property .attrs. .attrs takes a dictionary as a parameter, and in the example above, we only provided the key, not a value, too. That is, we only cared that the <a> tag had an attribute named href (which we grabbed by typing that command), and we made no specific demands on what the value must be. In other words, regardless of the value of href, we grabbed that element. Alternatively, if you inspect your HTML code and notice select regions for which you'd like to extract text, you can specify it as part of the attributes, too!

For example, in the full response.text, we see the following line:

<header class="npr-header" id="globalheader" aria-label="NPR header">

Let's say that we know that the information we care about is within tags that match this template (i.e., class is an attribute, and its value is 'npr-header').

In [29]:
#soup.find('header') has 1 attribute called "role"
soup.find('header').attrs
Out[29]:
{'role': 'banner'}
In [30]:
#That is why the line below won't work 
#soup.find('header', attrs={'id':'globalheader'})

soup.find('header', attrs={'role':'banner'})
Out[30]:
<header role="banner">
<img alt="NPR logo" class="npr-logo" src="https://media.npr.org/chrome_svg/npr-logo.svg"/>
<h1 class="header-txt">Cookie Consent and Choices</h1>
<div id="npr-rule" role="presentation"><span></span><span></span></div>
</header>

This matched it! We could then continue further processing by tacking on other commands:

In [31]:
soup.find('header', attrs={'role':'banner'}).find_all("li") # li stands for list items
Out[31]:
[]

This returns all of our list items, and since it's within a particular header section of the page, it appears they are links to menu items for navigating the webpage. If we wanted to grab just the links within these:

In [32]:
menu_links = set()
for list_item in soup.find('header', attrs={'role':'banner'}).find_all("li"):
    for link in list_item.find_all('a', href=True):
        menu_links.add(link)
menu_links # a unique set of all the seemingly important links in the header
Out[32]:
set()

Takeaway lesson ¶

The above tutorial isn't meant to be a study guide to memorize; its point is to show you the most important functionaity that exist within BeautifulSoup, and to illustrate how one can access different pieces of content. No two web scraping tasks are identical, so it's useful to play around with code and try different things, while using the above as examples of how you may navigate between different tags and properties of a page. Don't worry; we are always here to help when you get stuck!

String formatting ¶

As we parse webpages, we may often want to further adjust and format the text to a certain way.

For example, say we wanted to scrape a polical website that lists all US Senators' name and office phone number. We may want to store information for each senator in a dictionary. All senators' information may be stored in a list. Thus, we'd have a list of dictionaries. Below, we will initialize such a list of dictionary (it has only 3 senators, for illustrative purposes, but imagine it contains many more).

In [33]:
# this is a bit clumsy of an initialization, but we spell it out this way for clarity purposes
# NOTE: imagine the dictionary were constructed in a more organic manner
senator1 = {"name":"Lamar Alexander", "number":"555-229-2812"}
senator2 = {"name":"Tammy Baldwin", "number":"555-922-8393"}
senator3 = {"name":"John Barrasso", "number":"555-827-2281"}
senators = [senator1, senator2, senator3]
print(senators)
[{'name': 'Lamar Alexander', 'number': '555-229-2812'}, {'name': 'Tammy Baldwin', 'number': '555-922-8393'}, {'name': 'John Barrasso', 'number': '555-827-2281'}]

In the real-world, we may not want the final form of our information to be in a Python dictionary; rather, we may need to send an email to people in our mailing list, urging them to call their senators. If we have a templated format in mind, we can do the following:

In [34]:
email_template = """Please call {name} at {number}"""
for senator in senators:
    print(email_template.format(**senator))
Please call Lamar Alexander at 555-229-2812
Please call Tammy Baldwin at 555-922-8393
Please call John Barrasso at 555-827-2281

Please visit here for further documentation

Alternatively, one can also format their text via the f'-strings property. See documentation here. For example, using the above data structure and goal, one could yield identical results via:

In [35]:
for senator in senators:
    print(f"Please call {senator['name']} at {senator['number']}")
Please call Lamar Alexander at 555-229-2812
Please call Tammy Baldwin at 555-922-8393
Please call John Barrasso at 555-827-2281

Additionally, sometimes we wish to search large strings of text. If we wish to find all occurrences within a given string, a very mechanical, procedural way of doing it would be to use the .find() function in Python and to repeatedly update the starting index from which we are looking.

Regular Expressions¶

A way more suitable and powerful way is to use Regular Expressions, which is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). A tutorial on Regular Expressions (aka regex) is beond this lab, but below are many great resources that we recommend, if you are interested in them (could be very useful for a homework problem):

  • https://docs.python.org/3.3/library/re.html
  • https://regexone.com
  • https://docs.python.org/3/howto/regex.html.

Walkthrough Example of web scraping ¶

We're going to see the structure of Goodread's best books list (NOTE: Goodreads is described a little more within the other Lab2_More_Pandas.ipynb notebook). We'll use the Developer tools in chrome, safari and firefox have similar tools available. To get this page we use the requests module. But first we should check if the company's policy allows scraping. Check the robots.txt to find what sites/elements are not accessible. Please read and verify.

In [36]:
url="https://www.npr.org/2018/11/05/664395755/what-if-the-polls-are-wrong-again-4-scenarios-for-what-might-happen-in-the-elect"
response = requests.get(url)
# response.status_code
# response.content

# Beautiful Soup (library) time!
soup = BeautifulSoup(response.content, "html.parser")
    #print(soup)
    # soup.prettify()
soup.find("title")

    # Q1: how do we get the title's text?

    # Q2: how do we get the webpage's entire content?
Out[36]:
<title>NPR Cookie Consent and Choices</title>
In [37]:
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="
url = URLSTART+BESTBOOKS+'1'
print(url)
page = requests.get(url)
https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1

We can see properties of the page. Most relevant are status_code and text. The former tells us if the web-page was found, and if found, ok.

In [38]:
page.status_code # 200 is good
Out[38]:
200
In [39]:
page.text[:5000]
Out[39]:
'<!DOCTYPE html>\n<html class="desktop withSiteHeaderTopFullImage\n">\n<head>\n  <title>Best Books Ever (92746 books)</title>\n\n<meta content=\'91,253 books based on 226331 votes: The Hunger Games by Suzanne Collins, Harry Potter and the Order of the Phoenix by J.K. Rowling, Pride and Prejudice b...\' name=\'description\'>\n<meta content=\'telephone=no\' name=\'format-detection\'>\n<link href=\'https://www.goodreads.com/list/show/1.Best_Books_Ever\' rel=\'canonical\'>\n\n\n\n    <script type="text/javascript"> var ue_t0=window.ue_t0||+new Date();\n </script>\n  <script type="text/javascript">\n    var ue_mid = "A1PQBFHBHS6YH1";\n    var ue_sn = "www.goodreads.com";\n    var ue_furl = "fls-na.amazon.com";\n    var ue_sid = "996-7332168-6077398";\n    var ue_id = "5PPFF1YA973W43QVG877";\n\n    (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:arguments,t:a.d()})}}function b(m,l,h,j,i){var k={m:m,f:l,l:h,c:""+j,err:i,fromOnError:1,args:arguments};c.ueLogError(k);return false}b.skipTrace=1;e.onerror=b;function f(){c.uex("ld")}if(e.addEventListener){e.addEventListener("load",f,false)}else{if(e.attachEvent){e.attachEvent("onload",f)}}a.tag=d("tag");a.log=d("log");a.reset=d("rst");c.ue_csm=c;c.ue=a;c.ueLogError=d("err");c.ues=d("ues");c.uet=d("uet");c.uex=d("uex");c.uet("ue")})(window);(function(e,d){var a=e.ue||{};function c(g){if(!g){return}var f=d.head||d.getElementsByTagName("head")[0]||d.documentElement,h=d.createElement("script");h.async="async";h.src=g;f.insertBefore(h,f.firstChild)}function b(){var k=e.ue_cdn||"z-ecx.images-amazon.com",g=e.ue_cdns||"images-na.ssl-images-amazon.com",j="/images/G/01/csminstrumentation/",h=e.ue_file||"ue-full-11e51f253e8ad9d145f4ed644b40f692._V1_.js",f,i;if(h.indexOf("NSTRUMENTATION_FIL")>=0){return}if("ue_https" in e){f=e.ue_https}else{f=e.location&&e.location.protocol=="https:"?1:0}i=f?"https://":"http://";i+=f?g:k;i+=j;i+=h;c(i)}if(!e.ue_inline){if(a.loadUEFull){a.loadUEFull()}else{b()}}a.uels=c;e.ue=a})(window,document);\n\n    if (window.ue && window.ue.tag) { window.ue.tag(\'list:show:signed_out\', ue.main_scope);window.ue.tag(\'list:show:signed_out:desktop\', ue.main_scope); }\n  </script>\n\n  <!-- * Copied from https://info.analytics.a2z.com/#/docs/data_collection/csa/onboard */ -->\n<script>\n  //<![CDATA[\n    !function(){function n(n,t){var r=i(n);return t&&(r=r("instance",t)),r}var r=[],c=0,i=function(t){return function(){var n=c++;return r.push([t,[].slice.call(arguments,0),n,{time:Date.now()}]),i(n)}};n._s=r,this.csa=n}();\n    \n    if (window.csa) {\n      window.csa("Config", {\n        "Application": "GoodreadsMonolith",\n        "Events.SushiEndpoint": "https://unagi.amazon.com/1/events/com.amazon.csm.csa.prod",\n        "Events.Namespace": "csa",\n        "CacheDetection.RequestID": "5PPFF1YA973W43QVG877",\n        "ObfuscatedMarketplaceId": "A1PQBFHBHS6YH1"\n      });\n    \n      window.csa("Events")("setEntity", {\n        session: { id: "996-7332168-6077398" },\n        page: {requestId: "5PPFF1YA973W43QVG877", meaningful: "interactive"}\n      });\n    }\n    \n    var e = document.createElement("script"); e.src = "https://m.media-amazon.com/images/I/41mrkPcyPwL.js"; document.head.appendChild(e);\n  //]]>\n</script>\n\n\n          <script type="text/javascript">\n        if (window.Mobvious === undefined) {\n          window.Mobvious = {};\n        }\n        window.Mobvious.device_type = \'desktop\';\n        </script>\n\n\n  \n<script src="https://s.gr-assets.com/assets/webfontloader-a550a17efafeccd666200db5de8ec913.js"></script>\n<script>\n//<![CDATA[\n\n  WebFont.load({\n    classes: false,\n    custom: {\n      families: ["Lato:n4,n7,i4", "Merriweather:n4,n7,i4"],\n      urls: ["https://s.gr-assets.com/assets/gr/fonts-e256f84093cc13b27f5b82343398031a.css"]\n    }\n  });\n\n//]]>\n</script>\n\n  <link rel="stylesheet" media="all" href="https://s.gr-assets.com/assets/goodreads-162177800b408edc4c2767205cb5e35f.css" />\n\n    <style type="text/css" media="screen">\n    .bigTabs {\n      margin-bottom: 10px;\n    }\n\n    .list_read{\n      background-color: #D7D2C4;\n      float: left;\n    }\n  </style>\n\n\n  <link rel="stylesheet" media="screen" href="https://s.gr-assets.com/assets/common_images-670d97636259cafc355c94fc43e871d7.css" />\n\n  <script src="https://s.gr-assets.com/assets/desktop/libraries-c07ee2e4be9ade4a64546b3ec60b523b.js"></script>\n  <script src="https://s.gr-assets.com/assets/application-d83f118fca4171aa093f1e14bf761694.js"></script>\n\n    <script>\n  //<![CDATA[\n    var gptAdSlots = gptAdSlots || [];\n    var googletag = googletag || {};\n    googletag.cmd = googletag.cmd || [];\n    (function() {\n      var gads = document.createElement("script");\n      gads.async = true;\n      gads.type = "text/javascript";\n      var useSSL = "https:" == document.location.protocol;\n      gads.src = (useSSL ? "https:" : "http:") +\n      "//securepubads.g.doubleclick.net/tag/js/gpt.js";\n      var node = document.get'

Let us write a loop to fetch 2 pages of "best-books" from goodreads. Notice the use of a format string. This is an example of old-style python format strings

In [40]:
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="
for i in range(1,3):
    bookpage=str(i)
    stuff=requests.get(URLSTART+BESTBOOKS+bookpage)
    filetowrite="files/page"+ '%02d' % i + ".html"
    os.makedirs(os.path.dirname(filetowrite), exist_ok=True)
    print("FTW", filetowrite)
    fd=open(filetowrite,"w+",encoding="utf-8")
    fd.write(stuff.text)
    fd.close()
    time.sleep(2)
FTW files/page01.html
FTW files/page02.html

Step 1. Parse the page, extract book urls¶

Notice how we do file input-output, and use beautiful soup in the code below. The with construct ensures that the file being read is closed, something we do explicitly for the file being written. We look for the elements with class bookTitle, extract the urls, and write them into a file

In [41]:
bookdict={}
for i in range(1,3):
    books=[]
    stri = '%02d' % i
    filetoread="files/page"+ stri + '.html'
    print("FTW", filetoread)
    with open(filetoread) as fdr:
        data = fdr.read()
    soup = BeautifulSoup(data, 'html.parser')
    for e in soup.select('.bookTitle'):
        books.append(e['href'])
    print(books[:10])
    bookdict[stri]=books
    fd=open("files/list"+stri+".txt","w")
    fd.write("\n".join(books))
    fd.close()
FTW files/page01.html
['/book/show/2767052-the-hunger-games', '/book/show/2.Harry_Potter_and_the_Order_of_the_Phoenix', '/book/show/1885.Pride_and_Prejudice', '/book/show/2657.To_Kill_a_Mockingbird', '/book/show/19063.The_Book_Thief', '/book/show/41865.Twilight', '/book/show/170448.Animal_Farm', '/book/show/11127.The_Chronicles_of_Narnia', '/book/show/30.J_R_R_Tolkien_4_Book_Boxed_Set', '/book/show/11870085-the-fault-in-our-stars']
FTW files/page02.html
['/book/show/43763.Interview_with_the_Vampire', '/book/show/4381.Fahrenheit_451', '/book/show/4473.A_Prayer_for_Owen_Meany', '/book/show/153747.Moby_Dick_or_the_Whale', '/book/show/1.Harry_Potter_and_the_Half_Blood_Prince', '/book/show/7171637-clockwork-angel', '/book/show/16299.And_Then_There_Were_None', '/book/show/37435.The_Secret_Life_of_Bees', '/book/show/4989.The_Red_Tent', '/book/show/49552.The_Stranger']

Here is Harry Potter and the Goblet of Fire:

In [42]:
bookdict['02'][3]
Out[42]:
'/book/show/153747.Moby_Dick_or_the_Whale'

Step 2. Parse a book page, extract book properties¶

Ok so now lets dive in and get one of these these files and parse them.

In [43]:
furl=URLSTART+bookdict['02'][0]
furl
Out[43]:
'https://www.goodreads.com/book/show/43763.Interview_with_the_Vampire'
In [44]:
fstuff=requests.get(furl)
print(fstuff.status_code)
200
In [45]:
#d=BeautifulSoup(fstuff.text, 'html.parser')
# try this to take care of arabic strings
d = BeautifulSoup(fstuff.text, 'html.parser', from_encoding="utf-8")

#ignore warning
/Users/vaiby/opt/anaconda3/lib/python3.9/site-packages/bs4/__init__.py:226: UserWarning: You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.
  warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.")
In [46]:
d.select("meta[property='og:title']")[0]['content']
Out[46]:
'Interview with the Vampire (The Vampire Chronicles, #1)'
In [47]:
d.select("meta[property='books:isbn']")
Out[47]:
[<meta content="9780345476876" property="books:isbn"/>]

Lets get everything we want...

In [48]:
#d=BeautifulSoup(fstuff.text, 'html.parser', from_encoding="utf-8")
print(
"title", d.select_one("meta[property='og:title']")['content'],"\n",
#"isbn", d.select("meta[property='books:isbn']")[0]['content'],"\n",
"type", d.select("meta[property='og:type']")[0]['content'],"\n",
#"author", d.select("meta[property='books:author']")[0]['content'],"\n",
#"average rating", d.select_one("span.average").text,"\n",
#"ratingCount", d.select("meta[itemprop='ratingCount']")[0]["content"],"\n"
#"reviewCount", d.select_one("span.count")["title"]
)
title Interview with the Vampire (The Vampire Chronicles, #1) 
 type books.book 

Ok, now that we know what to do, lets wrap our fetching into a proper script. So that we dont overwhelm their servers, we will only fetch 5 from each page, but you get the idea...

We'll segue of a bit to explore new style format strings. See https://pyformat.info for more info.

In [49]:
"list{:0>2}.txt".format(3)
Out[49]:
'list03.txt'
In [50]:
a = "4"
b = 4
class Four:
    def __str__(self):
        return "Fourteen"
c=Four()
In [51]:
"The hazy cat jumped over the {} and {} and {}".format(a, b, c)
Out[51]:
'The hazy cat jumped over the 4 and 4 and Fourteen'

Step 3. Set up a pipeline for fetching and parsing¶

Ok lets get back to the fetching...

In [52]:
fetched=[]
for i in range(1,3):
    with open("files/list{:0>2}.txt".format(i)) as fd:
        counter=0
        for bookurl_line in fd:
            if counter > 4:
                break
            bookurl=bookurl_line.strip()
            stuff=requests.get(URLSTART+bookurl)
            filetowrite=bookurl.split('/')[-1]
            filetowrite="files/"+str(i)+"_"+filetowrite+".html"
            print("FTW", filetowrite)
            fd=open(filetowrite,"w",encoding="utf-8")
            fd.write(stuff.text)
            fd.close()
            fetched.append(filetowrite)
            time.sleep(2)
            counter=counter+1
            
print(fetched)
FTW files/1_2767052-the-hunger-games.html
FTW files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html
FTW files/1_1885.Pride_and_Prejudice.html
FTW files/1_2657.To_Kill_a_Mockingbird.html
FTW files/1_19063.The_Book_Thief.html
FTW files/2_43763.Interview_with_the_Vampire.html
FTW files/2_4381.Fahrenheit_451.html
FTW files/2_4473.A_Prayer_for_Owen_Meany.html
FTW files/2_153747.Moby_Dick_or_the_Whale.html
FTW files/2_1.Harry_Potter_and_the_Half_Blood_Prince.html
['files/1_2767052-the-hunger-games.html', 'files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html', 'files/1_1885.Pride_and_Prejudice.html', 'files/1_2657.To_Kill_a_Mockingbird.html', 'files/1_19063.The_Book_Thief.html', 'files/2_43763.Interview_with_the_Vampire.html', 'files/2_4381.Fahrenheit_451.html', 'files/2_4473.A_Prayer_for_Owen_Meany.html', 'files/2_153747.Moby_Dick_or_the_Whale.html', 'files/2_1.Harry_Potter_and_the_Half_Blood_Prince.html']

Ok we are off to parse each one of the html pages we fetched. We have provided the skeleton of the code and the code to parse the year, since it is a bit more complex...see the difference in the screenshots above.

In [53]:
import re
yearre = r'\d{4}'
def get_year(d):
    if d.select_one("nobr.greyText"):
        return d.select_one("nobr.greyText").text.strip().split()[-1][:-1]
    else:
        thetext=d.select("div#details div.row")[1].text.strip()
        rowmatch=re.findall(yearre, thetext)
        if len(rowmatch) > 0:
            rowtext=rowmatch[0].strip()
        else:
            rowtext="NA"
        return rowtext

Exercise
¶

Your job is to fill in the code to get the genres. Go to your browser, enter the url of the bookpage above and "View Page Source". In the main web-page, search for the text which contains the year or the genre (or whichever field it is that you are looking for) and right-click and "Inspect Element". Your browser will highlight in the page's source code where this line is. You can use the tags associated with this field to search for it automatically in BeautifulSoup.

In [54]:
def get_genres(d):
    # your code here
    genres=d.select("div.elementList div.left a")
    glist=[]
    for g in genres:
        glist.append(g['href'])
    return glist
In [55]:
listofdicts=[]
for filetoread in fetched:
    print(filetoread)
    td={}
    with open(filetoread, encoding = "utf-8") as fd:
        datext = fd.read()
    d=BeautifulSoup(datext, 'html.parser')
    td['title']=d.select_one("meta[property='og:title']")['content']
    #td['isbn']=d.select_one("meta[property='books:isbn']")['content']
    td['booktype']=d.select_one("meta[property='og:type']")['content']
    #td['author']=d.select_one("meta[property='books:author']")['content']
    #td['rating']=d.select_one("span.average").text
    #td['year'] = get_year(d)
    td['file']=filetoread
    glist = get_genres(d)
    td['genres']="|".join(glist)
    listofdicts.append(td)
files/1_2767052-the-hunger-games.html
files/1_2.Harry_Potter_and_the_Order_of_the_Phoenix.html
files/1_1885.Pride_and_Prejudice.html
files/1_2657.To_Kill_a_Mockingbird.html
files/1_19063.The_Book_Thief.html
files/2_43763.Interview_with_the_Vampire.html
files/2_4381.Fahrenheit_451.html
files/2_4473.A_Prayer_for_Owen_Meany.html
files/2_153747.Moby_Dick_or_the_Whale.html
files/2_1.Harry_Potter_and_the_Half_Blood_Prince.html
In [56]:
listofdicts[0]
Out[56]:
{'title': 'The Hunger Games (The Hunger Games, #1)',
 'booktype': 'books.book',
 'file': 'files/1_2767052-the-hunger-games.html',
 'genres': ''}

Finally lets write all this stuff into a csv file which we will use to do analysis.

In [57]:
df = pd.DataFrame.from_records(listofdicts)
df
Out[57]:
title booktype file genres
0 The Hunger Games (The Hunger Games, #1) books.book files/1_2767052-the-hunger-games.html
1 Harry Potter and the Order of the Phoenix (Har... books.book files/1_2.Harry_Potter_and_the_Order_of_the_Ph...
2 Pride and Prejudice books.book files/1_1885.Pride_and_Prejudice.html /genres/classics|/genres/fiction|/genres/roman...
3 To Kill a Mockingbird books.book files/1_2657.To_Kill_a_Mockingbird.html /genres/classics|/genres/fiction|/genres/histo...
4 The Book Thief books.book files/1_19063.The_Book_Thief.html
5 Interview with the Vampire (The Vampire Chroni... books.book files/2_43763.Interview_with_the_Vampire.html
6 Fahrenheit 451 books.book files/2_4381.Fahrenheit_451.html /genres/classics|/genres/fiction|/genres/scien...
7 A Prayer for Owen Meany books.book files/2_4473.A_Prayer_for_Owen_Meany.html
8 Moby-Dick or, the Whale books.book files/2_153747.Moby_Dick_or_the_Whale.html
9 Harry Potter and the Half-Blood Prince (Harry ... books.book files/2_1.Harry_Potter_and_the_Half_Blood_Prin...
In [58]:
df.to_csv("files/meta_utf8_EK.csv", index=False, header=True)

Advanced skills: Let's Learn How To Utilize APIs (Optional) ¶

Information below has been cited and edited from Dataquest - Python API Tutorial

APIs can help you retrieve data for your data science projects. APIs are the industry standard way of providing access to data. Most government websites allow utilization of their data over their APIs. This would be very handy say, if you are searching for COVID statistics day-by-day in India, right?

However, one might ask why use an API? I can download my CSVs just fine.. That could be true. But APIs are very useful in the following cases:

  • If your data is changing quickly and continiously.
  • You would like to access a certain part of the database and not the entire thing.
  • There is repeated computation involved.
    • For example, Spotify has an API that can tell you the piece of music. You can create you own classifier to compute music categories yourself however you'll never have much data as Spotify does.

Then, What is an API? ¶

from IPython.display import ImageAn API, in other words, an Application Programming Interface is a server that you can use to retrieve and send data to using code. APIs are most commonly used to retrieve data, which by the end of this tutorial you will be proficient at.

When we would like to receive data from an API, we need to make a request. Requests are used all over the web. For instance, when you visit a blog post, your web browser makes a request to a web server, which responds with the content of the web page.

In [59]:
Image('figs/api.png')
Out[59]:

API works in the same way. You make a request to an API server for data, and you get a response.

Making API Requests in Python ¶

To work with APIs in Python we can use the most common requests library.

Note: As it is not part of base Python you will need to install it through pip install with the following: 'pip install requests' .

Our First API Request ¶

There are many different types of requests. The most commonly used one, a GET request, is used to retrieve data. Because we’ll just be working with retrieving data, our focus will be on making ‘get’ requests. When we make a request, the response from the API comes with a response code which tells us whether our request was successful. Response codes are important because they immediately tell us if something went wrong.

But first, let us find an end point to connect to. We will use Open Governence Data Platform India for this endevour which could be a good example for your projects. Open Government Data Platform India or data.gov.in is a platform for supporting Open data initiative of Government of India. This portal is a single-point access to datasets, documents, services, tools and applications published by ministries, departments and organisations of the Government of India. You will now be introduced to how to use this website as an example for your future researches. Other end points are laid out in a similar matter with different documentation and structure which you must research and try to use yourself.

Let us start our example. The website itself looks like this:

In [60]:
Image("figs/datagovin.png", width= 1400, height = 1000 )
Out[60]:

To utilize this website let us first find a subject through the catalog. You can also select certain domains (databases) and sectors as seen below.

In [61]:
Image("figs/datagovin2.png")
Out[61]:

I arbitrarily chose covid as a topic and it has listed me all the APIs regarding the matter.

In [62]:
Image("figs/datagovin3.png")
Out[62]:

After I select a API I am interested in, this page pops up where it asks for certain parameters.

  • API Key: This is specific to this website in particular, however do expect many other databases to establish their own rules.
    • In this website, you need to create an account in order to receive a member specific key which you can use to get the entirety of the API output.
    • With a sample key you can only access 10 data points.
  • Format : Here, many are tempted to get the output format as CSV however it is easier to just get a JSON file for now and only after then turn it into a csv which we will do later on.
  • Offset : If you would like to skip some records.
  • Limit : Maximum number of records.
  • Filters : Filters the results with respect to fields specified.

There is no need to input all the parameters, the critical ones are the API key and format.

In [63]:
Image("figs/datagovin5.png")
Out[63]:
In [64]:
Image("figs/datagovin6.png")
Out[64]:

To make a ‘GET’ request, we’ll use the requests.get() function, which requires one argument — the URL we want to make the request to. Please note that we copied the request url in the image above on the url part in the code below. When you have your own api it is important that you change the code after the part "?api-key= "

In [65]:
import requests
import pandas as pd
import json # We can use JSON operations with this, we need it to change it a pandas dataframe later on.
from pandas.io.json import json_normalize #we use this so that we can stop an error such as arrays being not in the same length in fields in the json file. 


resp = requests.get("https://api.data.gov.in/resource/0bdce1aa-10a9-4dcd-abfd-33ba51a879d6?api-key=579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b&format=json")

The get() function returns a response object. We can use the response.status_code attribute to receive the status code for our request:

In [66]:
print(resp.status_code)
200
We received a 200! This means our request was succesful. .

There are many API Status Codes. You can find a small snippet of them here if you are interested, and are willing to understand what went wrong in the future. Cited from Dataquest - Python API Tutorial

In [67]:
Image("figs/status.png")
Out[67]:

Now that we have received a response and stored it in our "response" object, let us read what is inside the file as a JSON.

In [68]:
print(resp.json())
{'index_name': '0bdce1aa-10a9-4dcd-abfd-33ba51a879d6', 'title': 'State/UT-wise COVID-19 Vaccine Doses Supplied by Government of India till 01 July 2021', 'desc': 'State/UT-wise COVID-19 Vaccine Doses Supplied by Government of India till 01 July 2021', 'org_type': 'Central', 'org': ['Rajya Sabha'], 'sector': ['All'], 'source': 'data.gov.in', 'catalog_uuid': '54c74b5d-74b1-4e9a-b16e-7023c6c0a6fd', 'visualizable': '1', 'active': '1', 'created': 1650970873, 'updated': 1651221646, 'created_date': '2022-04-26T11:01:13Z', 'updated_date': '2022-04-29T14:10:46Z', 'external_ws': 0, 'external_ws_url': '', 'target_bucket': {'index': 'api', 'type': '54c74b5d-74b1-4e9a-b16e-7023c6c0a6fd', 'field': '0bdce1aa-10a9-4dcd-abfd-33ba51a879d6'}, 'field': [{'id': 'sl__no_', 'name': 'Sl. No.', 'type': 'keyword'}, {'id': 'state_ut', 'name': 'State/UT', 'type': 'keyword'}, {'id': 'vaccine_doses_supplied', 'name': 'Vaccine doses supplied', 'type': 'double'}], 'message': 'Resource lists', 'version': '2.2.0', 'status': 'ok', 'total': 39, 'count': 10, 'limit': '10', 'offset': '0', 'records': [{'sl__no_': '1', 'state_ut': 'Andaman & Nicobar Islands', 'vaccine_doses_supplied': 230000}, {'sl__no_': '2', 'state_ut': 'Andhra Pradesh', 'vaccine_doses_supplied': 13804020}, {'sl__no_': '3', 'state_ut': 'Arunachal Pradesh', 'vaccine_doses_supplied': 610360}, {'sl__no_': '4', 'state_ut': 'Assam', 'vaccine_doses_supplied': 6566020}, {'sl__no_': '5', 'state_ut': 'Bihar', 'vaccine_doses_supplied': 12789100}, {'sl__no_': '6', 'state_ut': 'Chandigarh', 'vaccine_doses_supplied': 474480}, {'sl__no_': '7', 'state_ut': 'Chhattisgarh', 'vaccine_doses_supplied': 8722780}, {'sl__no_': '8', 'state_ut': 'Dadra and Nagar Haveli', 'vaccine_doses_supplied': 217800}, {'sl__no_': '9', 'state_ut': 'Daman and Diu', 'vaccine_doses_supplied': 191420}, {'sl__no_': '10', 'state_ut': 'Delhi', 'vaccine_doses_supplied': 6547400}]}

As you can see very hard to utilize!

Here you have 2 options you can choose. Either work with JSON which you can read further in the link provided in Part 3: Let's Learn How To Utilize APIs or turn this data into a pandas dataframe. The continuation of this introduction will focus on the pandas dataframe route which was summarized from this post in Stackoverflow.

In [69]:
content = json.loads(resp.content) #We find the content of the response variable and we load it utilizing json.loads into a variable.
In [70]:
content #The variable shows what it has inside as content. Now here please be careful as we are looking for things that resemble a data cluster.
#In this example we can see that the data itself starts after the 'records' and thus must be kept in mind.
Out[70]:
{'index_name': '0bdce1aa-10a9-4dcd-abfd-33ba51a879d6',
 'title': 'State/UT-wise COVID-19 Vaccine Doses Supplied by Government of India till 01 July 2021',
 'desc': 'State/UT-wise COVID-19 Vaccine Doses Supplied by Government of India till 01 July 2021',
 'org_type': 'Central',
 'org': ['Rajya Sabha'],
 'sector': ['All'],
 'source': 'data.gov.in',
 'catalog_uuid': '54c74b5d-74b1-4e9a-b16e-7023c6c0a6fd',
 'visualizable': '1',
 'active': '1',
 'created': 1650970873,
 'updated': 1651221646,
 'created_date': '2022-04-26T11:01:13Z',
 'updated_date': '2022-04-29T14:10:46Z',
 'external_ws': 0,
 'external_ws_url': '',
 'target_bucket': {'index': 'api',
  'type': '54c74b5d-74b1-4e9a-b16e-7023c6c0a6fd',
  'field': '0bdce1aa-10a9-4dcd-abfd-33ba51a879d6'},
 'field': [{'id': 'sl__no_', 'name': 'Sl. No.', 'type': 'keyword'},
  {'id': 'state_ut', 'name': 'State/UT', 'type': 'keyword'},
  {'id': 'vaccine_doses_supplied',
   'name': 'Vaccine doses supplied',
   'type': 'double'}],
 'message': 'Resource lists',
 'version': '2.2.0',
 'status': 'ok',
 'total': 39,
 'count': 10,
 'limit': '10',
 'offset': '0',
 'records': [{'sl__no_': '1',
   'state_ut': 'Andaman & Nicobar Islands',
   'vaccine_doses_supplied': 230000},
  {'sl__no_': '2',
   'state_ut': 'Andhra Pradesh',
   'vaccine_doses_supplied': 13804020},
  {'sl__no_': '3',
   'state_ut': 'Arunachal Pradesh',
   'vaccine_doses_supplied': 610360},
  {'sl__no_': '4', 'state_ut': 'Assam', 'vaccine_doses_supplied': 6566020},
  {'sl__no_': '5', 'state_ut': 'Bihar', 'vaccine_doses_supplied': 12789100},
  {'sl__no_': '6', 'state_ut': 'Chandigarh', 'vaccine_doses_supplied': 474480},
  {'sl__no_': '7',
   'state_ut': 'Chhattisgarh',
   'vaccine_doses_supplied': 8722780},
  {'sl__no_': '8',
   'state_ut': 'Dadra and Nagar Haveli',
   'vaccine_doses_supplied': 217800},
  {'sl__no_': '9',
   'state_ut': 'Daman and Diu',
   'vaccine_doses_supplied': 191420},
  {'sl__no_': '10', 'state_ut': 'Delhi', 'vaccine_doses_supplied': 6547400}]}
In [71]:
"""
As we realized that the 'records' were where the data was starting to be written, we used that
to create a data frame from the dictionary that was inside the content of the response. 
"""

df = pd.DataFrame.from_dict(content['records'])
In [72]:
df.head()
Out[72]:
sl__no_ state_ut vaccine_doses_supplied
0 1 Andaman & Nicobar Islands 230000
1 2 Andhra Pradesh 13804020
2 3 Arunachal Pradesh 610360
3 4 Assam 6566020
4 5 Bihar 12789100

There we go! We have the data now as a pandas data frame. From this moment on we can do any operations that we deem necessary. Follow the course to learn how :)

You can also save this data frame in a csv file just like this:

In [73]:
df.to_csv("sample.csv")
Remember! Always try to research the web first when you encounter a block - be it a bug, or a general sense of not knowing what to do. This will make you into better scientists. Hey, even I studied a bit on the web to remember how to do this! You can do it too :)

Citations utilized:

ValueError: arrays must all be same length

Dataquest - Python API Tutorial

Best of Luck with your studies!