{"id":721,"date":"2021-04-22T10:28:35","date_gmt":"2021-04-22T10:28:35","guid":{"rendered":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/?p=721"},"modified":"2021-04-22T10:28:36","modified_gmt":"2021-04-22T10:28:36","slug":"a-simple-python-web-scraping-guide-for-journalists","status":"publish","type":"post","link":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/2021\/04\/22\/a-simple-python-web-scraping-guide-for-journalists\/","title":{"rendered":"A simple Python web scraping guide for journalists"},"content":{"rendered":"\n<p>The phrase &#8220;data is all around us&#8221; is thrown around often when you first start data journalism or data science. While data is indeed all around us, from social media statistics to daily weather fluctuations, most of the time, it&#8217;s difficult, if not impossible to download, clean and analyze as a journalist without a strong technical background.<\/p>\n\n\n\n<p>For me, a student journalist who almost failed out of her first-year computer science class, web scraping was a data journalism skill I never thought I would take up. Open source government data in clean spreadsheets seemed to be the extent of the data I would work on. However, over the past few weeks, I learned how to write a small piece of Python code that helped me scrape some data I had been searching for from a government website.<\/p>\n\n\n\n<p>While I still have a lot to learn and do for this project, being able to write this program from scratch and have it actually give me the output I needed was incredibly rewarding. This piece will take you step-by-step through web scraping process.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is web scraping?<\/strong><\/h2>\n\n\n\n<p><a href=\"http:\/\/dictionary.com\">Dictionary.com<\/a> defines web scraping as &#8220;the extraction and copying of data from a website into a structured format using a computer program.&#8221;<\/p>\n\n\n\n<p>On web pages, there is text, images, tables, lists, links and other elements that you can scrape using code. Web pages are made up of hypertext markup language (HTML) that you can scrape from. For example, you can scrape tables off of Wikipedia pages, which are denoted with the &lt;table&gt;&lt;\/table&gt; tags. <\/p>\n\n\n\n<p>These two web pages are examples of pages that can be scraped for data, relatively easily, using code.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<style>.wp-block-kadence-advancedgallery.kb-gallery-wrap-id-_c0effa-81{overflow:hidden;}.kb-gallery-id-_c0effa-81 .kadence-blocks-gallery-item .kb-gal-image-radius, .kb-gallery-id-_c0effa-81 .kb-slide-item .kb-gal-image-radius img{border-radius:0px 0px 0px 0px;;}.kb-gallery-caption-style-bottom.kb-gallery-id-_c0effa-81 .kadence-blocks-gallery-item .kadence-blocks-gallery-item-inner .kadence-blocks-gallery-item__caption, .kb-gallery-caption-style-bottom-hover.kb-gallery-id-_c0effa-81 .kadence-blocks-gallery-item .kadence-blocks-gallery-item-inner .kadence-blocks-gallery-item__caption{background:linear-gradient(0deg, rgba(0, 0, 0, 0.8) 0, rgba(0, 0, 0, 0) 100%);}.kb-gallery-wrap-id-_c0effa-81.wp-block-kadence-advancedgallery{overflow:visible;}.kb-gallery-wrap-id-_c0effa-81.wp-block-kadence-advancedgallery .kt-blocks-carousel{overflow:visible;}<\/style><div class=\"kb-gallery-wrap-id-_c0effa-81 alignnone wp-block-kadence-advancedgallery\"><div class=\"kb-gallery-ul kb-gallery-non-static kb-gallery-type-carousel kb-gallery-id-_c0effa-81 kb-gallery-caption-style-bottom kb-gallery-filter-none\" data-image-filter=\"none\" data-lightbox-caption=\"true\"><div class=\"kt-blocks-carousel splide kt-carousel-container-dotstyle-dark kt-carousel-arrowstyle-whiteondark kt-carousel-dotstyle-dark kb-slider-group-arrow kb-slider-arrow-position-center\" data-columns-xxl=\"2\" data-columns-xl=\"2\" data-columns-md=\"2\" data-columns-sm=\"2\" data-columns-xs=\"1\" data-columns-ss=\"1\" data-slider-anim-speed=\"400\" data-slider-scroll=\"1\" data-slider-arrows=\"true\" data-slider-dots=\"true\" data-slider-hover-pause=\"false\" data-slider-auto=\"\" data-slider-speed=\"7000\" data-slider-gap=\"20px\" data-slider-gap-tablet=\"20px\" data-slider-gap-mobile=\"20px\" data-show-pause-button=\"false\" data-slider-label=\"Photo Gallery Carousel\"><div class=\"splide__track\"><ul class=\"kt-blocks-carousel-init kb-gallery-carousel splide__list\"><li class=\"kb-slide-item kb-gallery-carousel-item splide__slide\"><div class=\"kadence-blocks-gallery-item\"><div class=\"kadence-blocks-gallery-item-inner\"><figure class=\"kb-gallery-figure kadence-blocks-gallery-item-has-caption\"><div class=\"kb-gal-image-radius\"><div class=\"kb-gallery-image-contain kadence-blocks-gallery-intrinsic kb-gallery-image-ratio-land32 kb-has-image-ratio-land32\" ><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.34.41-PM-1024x563.png\" width=\"1024\" height=\"563\" alt=\"\" data-full-image=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.34.41-PM.png\" data-light-image=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.34.41-PM.png\" data-id=\"722\" class=\"wp-image-722 skip-lazy\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.34.41-PM-1024x563.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.34.41-PM-300x165.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.34.41-PM-768x422.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.34.41-PM-1536x845.png 1536w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.34.41-PM.png 1722w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/div><div class=\"kadence-blocks-gallery-item__caption\">A table of provincial parks in B.C (Wikipedia)<\/div><\/div><\/figure><\/div><\/div><\/li><li class=\"kb-slide-item kb-gallery-carousel-item splide__slide\"><div class=\"kadence-blocks-gallery-item\"><div class=\"kadence-blocks-gallery-item-inner\"><figure class=\"kb-gallery-figure kadence-blocks-gallery-item-has-caption\"><div class=\"kb-gal-image-radius\"><div class=\"kb-gallery-image-contain kadence-blocks-gallery-intrinsic kb-gallery-image-ratio-land32 kb-has-image-ratio-land32\" ><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.33.10-PM-1024x739.png\" width=\"1024\" height=\"739\" alt=\"\" data-full-image=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.33.10-PM.png\" data-light-image=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.33.10-PM.png\" data-id=\"723\" class=\"wp-image-723 skip-lazy\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.33.10-PM-1024x739.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.33.10-PM-300x217.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.33.10-PM-768x555.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-5.33.10-PM.png 1313w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/div><div class=\"kadence-blocks-gallery-item__caption\">A list of security incidents on Ryerson campus (Ryerson University)<\/div><\/div><\/figure><\/div><\/div><\/li><\/ul><\/div><\/div><\/div><\/div>\n\n\n<div style=\"height:28px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Web scraping code will identify the page or pages you&#8217;re looking to extract from, grab the data you want and then save it in a &#8220;structured format&#8221; such as a CSV file. The purpose of web scraping is to get data that exists on a website into a file on your computer that you can analyze more easily.<\/p>\n\n\n\n<p>Web scraping is useful when you can find the data you want on a website, but it&#8217;s not available to download or not available to download in a file you want. For example, you might want to scrape a simple table off of a website that isn&#8217;t available to download. Or in my case, you want to scrape multiple PDF files and save them as a different type of file.<\/p>\n\n\n\n<p>The process of web scraping, as described by <a href=\"https:\/\/www.dataquest.io\/blog\/web-scraping-python-using-beautiful-soup\/\">Dataquest.io <\/a>is as follows:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Request the content (source code) of a specific URL from the server<\/li><li>Download the content that is returned<\/li><li>Identify the elements of the page that are part of the table we want<\/li><li>Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What I&#8217;m working on<\/strong><\/h2>\n\n\n\n<p>When I was working on my <a href=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/2021\/01\/23\/north-shore-rescue-sees-a-busy-winter-season-with-a-record-breaking-2020\/?_thumbnail_id=477\">previous story on North Shore Rescue&#8217;s record-breaking 2020 year,<\/a> I was trying to find search and rescue statistics from 2020, including which parks or areas saw the most calls.<\/p>\n\n\n\n<p>Unfortunately, they didn&#8217;t have this information published anywhere, but they did have comprehensive data on every search and rescue mission that happened. However, it was all stored in PDFs.<\/p>\n\n\n\n<p>Every week, Emergency BC releases a weekly incident report on their <a href=\"https:\/\/www2.gov.bc.ca\/gov\/content\/safety\/emergency-preparedness-response-recovery\/emergency-response-and-recovery\/incident-summaries\">website<\/a> that lists out all the emergency-related incidents, including search and rescue missions, that occurred in the past week. These reports are kept as PDF files on the website which anyone can download. However, PDFs are virtually useless for data analysis purposes\u2013we want them in a machine-readable format, such as a CSV file.<\/p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#e1e1e1\"><strong>My web scraping goal became this: Download every PDF on the webpage and convert them into a CSV file and then save it into a folder on my computer.<\/strong><\/p>\n\n\n\n<p>Using Python, a relatively easy-to-learn coding language, some perseverance and lots of Stack Overflow searching, I managed to write a short program that did just that.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 1: Locating the data<\/h2>\n\n\n\n<p>First, let&#8217;s take a look at the webpage to see where these PDFs are stored. Here&#8217;s the link to all the weekly incident reports in 2019 on the B.C government website.<\/p>\n\n\n\n<p><a href=\"https:\/\/www2.gov.bc.ca\/gov\/content\/safety\/emergency-preparedness-response-recovery\/emergency-response-and-recovery\/incident-summaries\/incident-summaries-2019\">https:\/\/www2.gov.bc.ca\/gov\/content\/safety\/emergency-preparedness-response-recovery\/emergency-response-and-recovery\/incident-summaries\/incident-summaries-2019<\/a><\/p>\n\n\n\n<p>There&#8217;s a separate web page for every year, and each web page holds a list of links to the weekly incident reports PDFs. If you right-click on the web page and click &#8220;Inspect,&#8221; you can see how the page is formatted with HTML. <\/p>\n\n\n\n<figure class=\"wp-block-video\"><video controls src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Recording-2021-04-21-at-5.57.43-PM-1.mov\"><\/video><\/figure>\n\n\n\n<p>Each list item is held in a <strong>&lt;li><\/strong> tag. In each &lt;li> tag is a link to the PDF that we want, and the links are held in <strong>&lt;a href><\/strong> tags. <\/p>\n\n\n\n<p>Now that we know where our data is living on the web page, we have a better idea of how to go about identifying it and locating it in our code.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Setting up your workspace<\/h2>\n\n\n\n<p>Before we start writing our Python code, we need to have the code downloaded on our computer. This <a href=\"https:\/\/realpython.com\/installing-python\/\" target=\"_blank\" rel=\"noreferrer noopener\">guide from RealPython<\/a> will tell you how to properly install Python 3 on your computer.<\/p>\n\n\n\n<p>Once you have Python installed, you&#8217;ll need to download a text editor to write your code in. A text editor is basically a program that allows you to write and edit a range of programming languages. <\/p>\n\n\n\n<p>I use <a href=\"https:\/\/atom.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Atom<\/a> for my work, but more simpler text editors include:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>SublimeText<\/li><li>Notepad++<\/li><li>Anaconda (A very popular text editor for using Python for data science)<\/li><\/ul>\n\n\n\n<p>Open up a new file and give it a name. I named my file scraperExample.py. Make sure to include the file extension &#8220;.py&#8221; so that your text editor knows that you&#8217;re writing in Python. <\/p>\n\n\n\n<p>Now we can start writing some code!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Import packages<\/h2>\n\n\n\n<p>Python packages are sets of useful functions that contain a directory of pre-written modules. Using packages eliminates the need to write a lot of code from scratch.<\/p>\n\n\n\n<p>Here&#8217;s a <a href=\"https:\/\/datatofish.com\/install-package-python-using-pip\/\">helpful guide<\/a> on how to install packages on your computer from DataFish.<\/p>\n\n\n\n<p>It took me a while to figure out which packages to use and which functions to use within the package. I looked up a lot of examples on <a href=\"https:\/\/stackoverflow.com\/\">Stack Overflow<\/a> and other coding resources to see how other programmers use certain packages, and then applied bits of their code in my own project.<\/p>\n\n\n\n<p>That being said, here&#8217;s a brief overview of the libraries we&#8217;re using:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/docs.python-requests.org\/en\/master\/\">requests<\/a> &#8211; allows you to send HTTP\/1.1 requests <\/li><li><a href=\"https:\/\/tabula-py.readthedocs.io\/en\/latest\/\">tabula<\/a> &#8211; can read tables of PDFs and convert a PDF file into CSV\/TSV\/JSON file<\/li><li><a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/\">Beautiful Soup<\/a> &#8211; pulls data out of HTML and XML files<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2194\" height=\"632\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.29.24-PM.png\" alt=\"\" class=\"wp-image-732\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.29.24-PM.png 2194w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.29.24-PM-300x86.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.29.24-PM-1024x295.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.29.24-PM-768x221.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.29.24-PM-1536x442.png 1536w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.29.24-PM-2048x590.png 2048w\" sizes=\"auto, (max-width: 2194px) 100vw, 2194px\" \/><\/figure>\n\n\n\n<p>Once you&#8217;ve installed the packages on your computer, you just need to write &#8220;import (library name)&#8221; in the first lines of your file.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Identify the website to scrape from<\/h2>\n\n\n\n<p>After you&#8217;ve got your packages imported into your file, we need to start off by telling our scraper where we want to scrape from.<\/p>\n\n\n\n<p>I created a variable called <strong><span style=\"color:#283f4a\" class=\"has-inline-color\">link<\/span><\/strong> that stored the URL of the BC government website with the weekly incident reports.<\/p>\n\n\n\n<p> The <strong>headers<\/strong> variable contains the HTTP response headers. Frankly, I still don&#8217;t really know what headers are but I used <a href=\"https:\/\/websniffer.cc\/\">Websniffer<\/a> to get response headers for the website in question. Just know that to use the <strong>requests.get <\/strong>function, you&#8217;ll need a parameter for its headers.<\/p>\n\n\n\n<p>I created another variable called <strong>request<\/strong> which will send an HTTP request to the URL we&#8217;ve identified and download the HTML contents of the web page for us we&#8217;ve just identified. The <strong>.get<\/strong> is a function we&#8217;ve pulled from the<strong> requests<\/strong> library that we imported earlier. It needs two parameters: the URL of the website and the headers, both of which we&#8217;ve just identified and assigned to variables. <\/p>\n\n\n\n<p>Our variable <strong>request<\/strong> now holds the source code of our our <strong>link<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.32.45-PM-1024x311.png\" alt=\"\" class=\"wp-image-734\" width=\"1024\" height=\"311\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.32.45-PM-1024x311.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.32.45-PM-300x91.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.32.45-PM-768x234.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.32.45-PM-1536x467.png 1536w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.32.45-PM-2048x623.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Now, we bring in Beautiful Soup to parse through our HTML data. <strong>Requests<\/strong> and <strong>Beautiful Soup<\/strong> are often used together in this order to request the HTML from the website and then identify certain tags we want from it.<\/p>\n\n\n\n<p>The <strong>soup<\/strong> variable holds the contents of BeautifulSoup() function, which is essentially the HTML code of the web page we requested. The second parameter of the Beautiful Soup function &#8216;html.parser&#8217; basically instruct Beautiful Soup to use the appropriate parser.<\/p>\n\n\n\n<p>At this point, it&#8217;s a good idea to check that your code is working so far. To see if my <strong>soup <\/strong>variable actually captured the website HTML, I wrote the line &#8220;<strong>print (soup)<\/strong>&#8221; to print out the output of our <strong>soup<\/strong> variable.<\/p>\n\n\n\n<p>Then, I run my code using Command + I on my Mac. Depending on what text editor you use, you might run your code differently. If all goes well, <strong>print(soup)<\/strong> should return all of the HTML content of the web page.<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1536\" style=\"aspect-ratio: 2264 \/ 1536;\" width=\"2264\" controls src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Recording-2021-04-21-at-8.58.57-PM.mov\"><\/video><\/figure>\n\n\n\n<p>It&#8217;s not pretty, but it&#8217;s all there. This is the same HTML content we saw when we right-clicked &#8220;Inspect&#8221; on the web page. We can also see the &lt;li> tags with the PDF links that we want.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Getting the PDF links<\/h2>\n\n\n\n<p>Right now, we have all of the HTML inside our <strong>soup <\/strong>variable, but we really only need the links to the PDFs. If you take a look at our printed HTML output, you can see that the link is stored in an &lt;a href&gt; tag inside a &lt;li&gt; tag. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"361\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.12.18-PM-1024x361.png\" alt=\"\" class=\"wp-image-738\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.12.18-PM-1024x361.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.12.18-PM-300x106.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.12.18-PM-768x271.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.12.18-PM-1536x541.png 1536w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.12.18-PM-2048x721.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Also, notice that the link isn&#8217;t the full link, but only the subdirectory of the website\u2013or what comes after the backslash. This means the &lt;a href&gt; item doesn&#8217;t include &#8220;https:\/\/www2.gov.bc.ca&#8221; or the domain. But we&#8217;ll deal with that later.<\/p>\n\n\n\n<p>Next, we&#8217;re going to parse through the HTML content to pick out the &lt;a href> tags we want. As you might&#8217;ve seen, there are other links, and therefore other &lt;a href> tags, besides the links to the PDF on the web page. To avoid scraping every single &lt;a href> tag on the page, we have to zero in on the specific HTML container that our PDF links are nested in.<\/p>\n\n\n\n<p>First, we have to find the right &lt;div> tag to dig through. &lt;Div> tags are basically containers for groups of HTML content, and the one that we want has an id of &#8220;body.&#8221; You can see how the &lt;li> tags are nested within the &#8220;body&#8221; &lt;div> on the website.<\/p>\n\n\n\n<figure class=\"wp-block-video alignleft\"><video height=\"1536\" style=\"aspect-ratio: 2880 \/ 1536;\" width=\"2880\" controls src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Recording-2021-04-21-at-9.19.11-PM.mov\"><\/video><\/figure>\n\n\n\n<p>This makes it easy for us to tell Beautiful Soup to <strong>.find<\/strong> that exact div. The <strong>.find<\/strong> module is another function included in the Beautiful Soup package that helps us locate one specific HTML tag. We attach that to our <strong>soup <\/strong>variable, which stores our HTML content, and set the parameters to the &#8220;body&#8221; &lt;div> item. We store our &#8220;body&#8221; &lt;div> in a variable called <strong>bodyDiv<\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"alignright size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"304\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.43.05-PM-1024x304.png\" alt=\"\" class=\"wp-image-736\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.43.05-PM-1024x304.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.43.05-PM-300x89.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.43.05-PM-768x228.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.43.05-PM-1536x457.png 1536w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-8.43.05-PM-2048x609.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n\n\n\n<p>But, we&#8217;re not quite there yet. Within the <strong>bodyDiv<\/strong> variable, we need to grab all the &lt;a href> tags. Beautiful Soup has another function called <strong>.find_all<\/strong> that returns a list of requested elements. This time, we&#8217;re telling Beautiful Soup to find <em>all <\/em>instances of &#8220;a&#8221; within <strong>bodyDiv.<\/strong><\/p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#e5e5e5\"><strong>Important: There are different data types in Python including lists, strings, integers and arrays. A list does not act the same as a string and a string doesn&#8217;t act the same as an integer. <\/strong>This <a href=\"https:\/\/www.w3schools.com\/python\/python_datatypes.asp\">W3School overview<\/a> describes the differences between different Python data types.<\/p>\n\n\n\n<p>Again, I like to print out my most recent variable to see if it&#8217;s captured what I intended it to. The <strong>listItems<\/strong> variable should store a list of all the &lt;a href> tags in our &#8220;body&#8221; &lt;div>.<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1126\" style=\"aspect-ratio: 2676 \/ 1126;\" width=\"2676\" controls src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Recording-2021-04-21-at-9.45.52-PM.mov\"><\/video><\/figure>\n\n\n\n<p>Great, now we have a list of all the &lt;a href> tags. But we only really need the bit that sits inside the quotation marks after the &#8220;href&#8221; parameter. This is the first list item we have in our <strong>listItems <\/strong>variable. We just need the highlighted part.<\/p>\n\n\n\n<p><strong>&lt;a href=&#8221;<mark>\/assets\/gov\/public-safety-and-emergency-services\/emergency-preparedness-response-recovery\/embc\/ecc-incident-summaries-2020\/weekly_incident_summary-12302019-01052020.pdf<\/mark>&#8221; target=&#8221;_blank&#8221;>December 30, 2019 &#8211; January. 5, 2019 (PDF)&lt;\/a><\/strong><\/p>\n\n\n\n<p>This next part is a bit tricky. What we&#8217;re trying to do is create a new list called <strong>links<\/strong> which will contain the full PDF URLs we need. First, we create an empty list by setting <strong>links<\/strong> = [] as there is nothing contained in the list, yet.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"463\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.10.04-PM-1024x463.png\" alt=\"\" class=\"wp-image-737\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.10.04-PM-1024x463.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.10.04-PM-300x136.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.10.04-PM-768x347.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.10.04-PM-1536x694.png 1536w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.10.04-PM-2048x925.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Next we use a <strong>for loop<\/strong> (read more about for loops at this <a href=\"https:\/\/www.w3schools.com\/python\/python_for_loops.asp\">W3School article<\/a>) to do the following steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Grabbing only the href value from our &lt;a href&gt; tags in our <strong>listItems<\/strong> list<\/li><li>Appending (or adding) that href value into our empty <strong>links <\/strong>list<\/li><\/ol>\n\n\n\n<p>Our for loop will do those two steps for every item in our <strong>listItems <\/strong>list until completion. The for loop lets us iterate through the list in two lines of code instead of writing the same code for every single list item.<\/p>\n\n\n\n<p>Now, if we print our <strong>links<\/strong> variable, we should get a list with just the PDF link.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"386\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.58.36-PM-1024x386.png\" alt=\"\" class=\"wp-image-741\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.58.36-PM-1024x386.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.58.36-PM-300x113.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.58.36-PM-768x289.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.58.36-PM-1536x578.png 1536w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-9.58.36-PM-2048x771.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Great! But as we saw before, these are still missing the domain name\u2013it&#8217;s not a full URL. Thankfully, since they all live on the same website (&#8220;https:\/\/www2.gov.bc.ca&#8221;) we just need to append all of our list items to that exact string (text).<\/p>\n\n\n\n<p>In Python, to concatenate means to add two string objects together. We&#8217;re essentially concatenating the domain with every list item in our <strong>links<\/strong> list.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"362\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.01.34-PM-1024x362.png\" alt=\"\" class=\"wp-image-743\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.01.34-PM-1024x362.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.01.34-PM-300x106.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.01.34-PM-768x271.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.01.34-PM-1536x543.png 1536w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.01.34-PM-2048x723.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Our new variable <strong>PDFLinks<\/strong> stores a list where each list item starts with the string &#8220;https:\/\/www2.gov.bc.ca&#8221; and then we use <a rel=\"noreferrer noopener\" href=\"https:\/\/www.w3schools.com\/python\/python_lists_comprehension.asp\" target=\"_blank\">list comprehension<\/a> to take items from an existing list (our <strong>links <\/strong>list) and add every single item to our string.<\/p>\n\n\n\n<p>This <a rel=\"noreferrer noopener\" href=\"https:\/\/stackoverflow.com\/questions\/2050637\/appending-the-same-string-to-a-list-of-strings-in-python\" target=\"_blank\">Stack Overflow discussion<\/a> goes over list comprehension and saved me a lot of time during this step.<\/p>\n\n\n\n<p>If we print <strong>PDFLinks<\/strong>, we finally get a list of our complete PDF URLs. Yay!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">5. Downloading the PDFs and saving as CSVs<\/h2>\n\n\n\n<p>We have a list of the links we need. Now we need to write some code that goes to each URL, downloads the PDF, and then saves it as a CSV. This was by far the trickiest part of the code for me\u2013figuring out how to automatically change the file name for every URL took me days to figure out. Without the help of my computer science friend, an extremely helpful CBC data journalist and Stack Overflow, I wouldn&#8217;t have been able figure this part out (while keeping my sanity).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"578\" src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.16.29-PM-1024x578.png\" alt=\"\" class=\"wp-image-745\" srcset=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.16.29-PM-1024x578.png 1024w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.16.29-PM-300x169.png 300w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.16.29-PM-768x433.png 768w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.16.29-PM-1536x867.png 1536w, https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Shot-2021-04-21-at-10.16.29-PM-2048x1156.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>First we set a variable called <strong>i<\/strong> to 0 (as an integer, not a string) that acts as an iterator. Every time the for loop completes, we add 1 to <strong>i <\/strong>until we go through the entire list. This will make more sense later.<\/p>\n\n\n\n<p>Inside our our for loop, we&#8217;re looking at every URL list item in our <strong>PDFLinks<\/strong> list. We created another variable called <strong>pdf<\/strong> that requests the source code for the URL.  (Remember our handy <strong>requests.get<\/strong> function?)<\/p>\n\n\n\n<p>Now that we&#8217;ve requested the contents of the first PDF URL in our <strong>PDFLinks<\/strong> list, we need to save it on our computer. In our <strong>with<\/strong> line, we&#8217;re creating and opening a file (represented as the variable <strong>f<\/strong>) using the <strong>open() <\/strong>function. <\/p>\n\n\n\n<p>The name of our file is the string &#8220;2019-Week&#8221; plus (concatenated with) the STRING version of our current <strong>i <\/strong>value. The str() function converts an object from one data type, in this case, an integer, into a string so that we can manipulate it as if it was a string. <\/p>\n\n\n\n<p>In our first iteration through our for loop, i would be 0. <\/p>\n\n\n\n<p>So far, our file name is &#8220;2019-Week-0&#8221; and we add that with the string &#8220;.pdf&#8221; which denotes the file as a PDF file. We&#8217;ve created, named and opened the file which is represented as <strong>f.<\/strong><\/p>\n\n\n\n<p>Now we, use the <strong>.write<\/strong> function to insert the content of our <strong>pdf<\/strong> (which we requested earlier) into this newly made file. At this point in our code, we&#8217;ve just downloaded the PDF of the first link in our list!<\/p>\n\n\n\n<p>Now within the same for loop, we have to convert the PDF into a CSV. We do this by using the <strong>tabula<\/strong> library that we imported at the beginning. The parameters of the <strong>tabula.convert_into<\/strong> function are as follows:<\/p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#e2e2e2\">tabula.convert_into(&#8220;file we want to convert.pdf&#8221;, &#8220;output.csv&#8221;, output_format=&#8221;csv&#8221;, pages=&#8217;all&#8217;)<\/p>\n\n\n\n<p>We simply reuse the same naming format we used to name our PDF (&#8220;2019-Week-&#8221; + str(i) + &#8220;.pdf&#8221;) to identify the PDF we want to convert, and name the CSV file the same thing, swapping the &#8220;.pdf&#8221; for &#8220;.csv.&#8221; The last two parameters makes sure that our output format is CSV and that we capture all the pages.<\/p>\n\n\n\n<p>Now, we&#8217;ve successfully converted our PDF into CSV! But, we have to do this with every list item in our <strong>PDFLinks<\/strong> remember? To iterate through the loop again, while making sure every file name is different, we add 1 to our <strong>i <\/strong>index at the very end of our loop. <\/p>\n\n\n\n<p>Now, our for loop will go through the exact same process for every URL, adding 1 to <strong>i <\/strong>as it completes each URL until we&#8217;ve reached the end of our list.<\/p>\n\n\n\n<p>Let&#8217;s see it in action:<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1800\" style=\"aspect-ratio: 2880 \/ 1800;\" width=\"2880\" controls src=\"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-content\/uploads\/sites\/116\/2021\/04\/Screen-Recording-2021-04-21-at-10.35.23-PM.mov\"><\/video><\/figure>\n\n\n\n<p>And that&#8217;s it! Now we have our data in a CSV format that we can do some more analysis on using pivot tables, filters and more spreadsheet functions. If you see the CSV file in the video, some of the tables are messed up since there were multiple tables of different dimensions in the PDF. It still needs some cleaning up&#8230; which may be a coding project for another time.<\/p>\n\n\n\n<p>I worked on this very short piece of code on and off for almost a month, which sort of defeats the purpose of writing web scraping code.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"550\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">what is learning data journalism if not trying to figure out web scraping code for days when downloading all the files manually would&#39;ve been faster at this point persevering<\/p>&mdash; kayla zhu (\u6731\u6cf3\u8339) (@kylzhu) <a href=\"https:\/\/twitter.com\/kylzhu\/status\/1370507774346465283?ref_src=twsrc%5Etfw\">March 12, 2021<\/a><\/blockquote><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script>\n<\/div><\/figure>\n\n\n\n<p>But, the process of writing and debugging this code really reaffirmed my motivation to become more code-savvy. Working on projects like this, where you&#8217;re learning code as you&#8217;re building your solution, can be a great way to sharpen your skills in a language you might have some basic understanding of.<\/p>\n\n\n\n<p>Web scraping can unlock new worlds of data for journalists if you know where to look. Interesting data can be hidden in lengthy annual reports or slide decks stored as PDFs on a website, or in HTML lists and tables on a random web page of a company website. Learning the basics of Python scraping and HTML can be a useful skill to add to your data journalist toolkit.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The phrase &#8220;data is all around us&#8221; is thrown around often when you first start data journalism or data science. While data is indeed all around us, from social media statistics to daily weather fluctuations, most of the time, it&#8217;s difficult, if not impossible to download, clean and analyze as a journalist without a strong [&hellip;]<\/p>\n","protected":false},"author":263,"featured_media":724,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-721","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/posts\/721","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/users\/263"}],"replies":[{"embeddable":true,"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/comments?post=721"}],"version-history":[{"count":0,"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/posts\/721\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/media\/724"}],"wp:attachment":[{"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/media?parent=721"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/categories?post=721"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/project.journalism.torontomu.ca\/jrn-305-2021\/wp-json\/wp\/v2\/tags?post=721"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}