{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Web Scraping\n", "\n", "The process of retrieving data displayed on websites automatically with your computer is called web scraping. It means instead of clicking, selecting and copying data yourself, your python script does it for you. You would only need to define how and where to look. Scraping information from the internet can be very time consuming, even when your machine does it for you. However, it may also be very rewarding because of the vast amount of data and possibilities - which come for free.\n", "\n", "Before we can turn to the actual scraping online, we will deepen our understanding of html and how to navigate through a file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Xpath\n", "\n", "The **E**xtensible **M**arkup **L**anguage **XML** provides something like a framework to markup languages to encode structured, hierarchical data for transfer. It is thus a related to HTML, just less for displaying data than storing and transferring it. While the differences are not of interest here (learn about them yourself anytime, though), the similarities are: in the way, we can use them to navigate html documents. \n", "\n", "The **XML Path Language** (**Xpath**) allows to select items from an xml document by addressing tag names, attributes and structural characteristics. We can use the respective commands to retrieve elements from html documents, i.e. websites.\n", "\n", "The two basic operators are\n", "\n", "- a forward slash ```/```: to look into the next generation\n", "\n", "- square brackets ```[]```: select a specific element\n", "\n", "The forward slash works basically the same way as in an URL or path when navigating a directory on your computer. The square brackets on the other hand have a very similar functionality to the usage of square brackets in python.\n", "\n", "Lets have a look at a slightly extended version of the html document from the previous chapter to see how we can use these operators for navigation.\\\n", "\n", "As usual, we first import the package (actually only one class for now) providing the desired functionality: **Scrapy** \n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": false }, "outputs": [], "source": [ "from scrapy import Selector" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "my_html = \\\n", "\"\"\"
This is a brief introduction to HTML.
\n", "Part 2.
Part 3. Homepage
\n", "\"\"\"\n", "\n", "#formatted version to better visualise the structure\n", "\n", "# \n", "# \n", "#\n", "# This is a brief introduction to HTML.\n", "#
\n", "#\n", "# Part 2.\n", "#
\n", "#\n", "# Part 3.\n", "# Homepage\n", "#
\n", "# \n", "# " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because of the tree like structure elements are labeled in on the basis of a family tree.\\\n", "The html consists of the **body** inside the **html** tags.\\\n", "In the next generation, called *child generation*, we see one **div** element, itself containing two **p** elements, i.e. *children*.\\\n", "Another **p** element appears as a *sibling* to the only **div** element.\n", "\n", "Now let's select single elements from this html using the ```xpath()``` method to which we pass our statement as string. To do so, we must first instantiate an Selector object with ```my_html```. Note that chaining the ```extract()``` method to the selector object gives the desired result." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "both p elements:\n", " ['This is a brief introduction to HTML.
', 'Part 2.
']\n" ] } ], "source": [ "sel = Selector(text=my_html)\n", "\n", "# navigate to p elements inside div\n", "print('both p elements:\\n', sel.xpath('/html/body/div/p').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the special case here, where all child elements are the *p* elements we are looking for, we can use the wildcard character ```*```. It wil select any element, no matter the tag in the next generation (or in all future generations with a double forward slash)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "wildcard *:\n", " ['This is a brief introduction to HTML.
', 'Part 2.
']\n" ] } ], "source": [ "# use wildcard character *\n", "print('\\nwildcard *:\\n', sel.xpath('/html/body/div/*').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select only the first *p* element, index the result with square brackets. \n", ":::{admonition}Caution\n", ":class: caution\n", "Xpath indexing starts with 1 and not 0 like python does!:::\n", "\n", "Another way is to use scrapy's ```extract_first()``` method." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "slelect only first p\n", " ['This is a brief introduction to HTML.
']\n", "\n", "slelect only first p\n", "This is a brief introduction to HTML.
\n", "\n", "slelect only first p\n", "This is a brief introduction to HTML.
\n" ] } ], "source": [ "# select first p with [1]\n", "print('slelect only first p\\n',sel.xpath('/html/body/div/p[1]').extract())\n", "\n", "# indexing the python list starts with 0!\n", "print('\\nslelect only first p\\n',sel.xpath('/html/body/div/p').extract()[0])\n", "\n", "# with extract_first()\n", "print('\\nslelect only first p\\n',sel.xpath('/html/body/div/p').extract_first())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see, that a forward slash navigates one generation deeper into the structure, from *html* over *body* and *div* to both(!) *p* elements. Here, we get back a list (first print statement) of the *p* elements, where we can intuitively use the square brackets to select only the first one.\\\n", "What we get back, however, is still a *selector object*. To extract the text only, we need to modify the statement further by ```/text()```." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text in first p:\n", " ['This is a brief introduction to ', '.']\n" ] } ], "source": [ "# select text from first p element\n", "to_print = sel.xpath('/html/body/div/p[1]/text()').extract()\n", "\n", "print('text in first p:\\n',to_print)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now again, we do not see all of the text, just the string before the *b* tags and the dot after it as elements of a list. Even though the *b* tags for bold printing appear in-line, they still define a child generation for the enclosed 'HTML' string. Here, we could navigate further down the tree using a path to the b element for exampe.\\\n", "Instead, we will make use of the double forward slash ```//```, which selects everything '**from future generations**'. Meaning it will select all the text from all generations that follow this first *p* element. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text in first p and future generations with double slash:\n", " ['This is a brief introduction to ', 'HTML', '.']\n", "\n", "complete string:\n", " This is a brief introduction to HTML.\n" ] } ], "source": [ "# select text from first p element and all future generations by a double forward slash\n", "to_print = sel.xpath('/html/body/div/p[1]//text()').extract()\n", "print('text in first p and future generations with double slash:\\n',to_print)\n", "\n", "# for one string, use the join() function\n", "print('\\ncomplete string:\\n', ''.join(to_print))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beside specifying a path explicitly, we can leverage built-in methods to extract our desired elements. Scrapy's ```getall()``` returns all elements of a given tag as list" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "scrapy:\n", " ['This is a brief introduction to HTML.
', 'Part 2.
', 'Part 3. Homepage
']\n" ] } ], "source": [ "print('scrapy:\\n', sel.xpath('//p').getall())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned before, elements can be addressed using attributes, like ID or class name. The attribute is addressed in square brackets, with ```@attr = 'attr_name'```. If the same attribute applies to several tags, a list of all results is returned." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID with wildcard:\n", " ['This is a brief introduction to HTML.
\\nPart 2.
This is a brief introduction to HTML.
\\nPart 2.
This is a brief introduction to HTML.
', 'Part 2.
']\n" ] } ], "source": [ "# find all elements where attr contains string\n", "print('contains string:\\n', sel.xpath('//*[contains(@class, \"parag-1\")]').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, an attribute can be returned by addressing it in the path." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "get attribute:\n", " ['intro']\n" ] } ], "source": [ "print('get attribute:\\n', sel.xpath('/html/body/div/p[1]/@id').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CSS\n", "\n", "The **C**ascading **S**tyle **S**heets language is another way to work with markup languages and thus html. Of special interest for us is that it also provides selectors. Even more convenient is that scrapy includes these selectors and the respective language. CSS uses a different syntax, which can offer much simpler and shorter statements for the same element than xpath (or sometime, the opposite) \n", "\n", "The basic syntax changes from xpath like this:\n", "- a single generation forward: ```/``` is replaced by ```>```\n", "\n", "- all future generations: ```//``` is replaced by a blank space (!)\n", "\n", "- not so short anymore: indexing with ```[k]``` becomes ```:nth-of-type(k)```\n", "\n", "For scrapy, nothing changes really, except using ```.css()``` method. The selector object is the same as before.\\\n", "Some examples: " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "xpath:\n", " ['This is a brief introduction to HTML.
', 'Part 2.
']\n", "\n", "css:\n", " ['This is a brief introduction to HTML.
', 'Part 2.
']\n", "\n", "all p elements\n", "xpath:\n", " ['This is a brief introduction to HTML.
', 'Part 2.
', 'Part 3. Homepage
']\n", "\n", "css:\n", " ['This is a brief introduction to HTML.
', 'Part 2.
', 'Part 3. Homepage
']\n" ] } ], "source": [ "# navigate to p elements inside div\n", "print('xpath:\\n', sel.xpath('/html/body/div/p').extract())\n", "print('\\ncss:\\n', sel.css('html>body>div>p').extract())\n", "\n", "# navigate to all p elements in document\n", "print('\\nall p elements')\n", "print('xpath:\\n', sel.xpath('//p').extract())\n", "print('\\ncss:\\n', sel.css('p').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The css locator provides short notation when working with *id* or *class*:\n", "- to select elements by class: use ```tag.class-name``` (similiar to method chaining)\n", "\n", "- to select elements by id: use ```tag#id-name```\n", "\n", "Examples from the toy html above:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "select by class:\n", " ['This is a brief introduction to HTML.
']\n", "\n", "select by id:\n", " ['This is a brief introduction to HTML.
']\n" ] } ], "source": [ "# select by id\n", "print('select by class:\\n', sel.css('*.parag-1').extract())\n", "\n", "# select by id\n", "print('\\nselect by id:\\n', sel.css('*#intro').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Attributes are in general addressed by a double colon ```::attr(attribute-name)```, for example to extract a link from the ```href``` attribute:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "select by attribute:\n", " ['www.uni-passau.de']\n" ] } ], "source": [ "# select by attribute\n", "print('\\nselect by attribute:\\n', sel.css('*::attr(href)').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The double colon is also used to extract ```::text```, like we used ```/text()``` in xpath:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "extract text:\n", " ['This is a brief introduction to ', 'HTML', '.', 'Part 2.']\n" ] } ], "source": [ "# extract all (blank space!) text descending from p elements inside div elements with class 'div_1' \n", "print('\\nextract text:\\n', sel.css('div#div_1>p ::text').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look for exampe at [this cheatsheet](https://devhints.io/xpath) for more commands with xpath and css.\n", "\n", "Lastly, let's look at BeautifulSoup again. It offers basically the same functionality to address and select any element based on its tag or attributes, but can make life easier with its prewritten methods. Keep in mind however, that BeautifulSoup is a python package, while the xpath and css syntax is standalone and will thus transfer to other software, e.g. **R**." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "by id:\n", " [This is a brief introduction to HTML.
\n", "Part 2.
This is a brief introduction to HTML.
]\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "soup = BeautifulSoup(my_html)\n", "# select by id\n", "print('by id:\\n',soup.find_all('div', id='div_1'))\n", "\n", "# select by class \n", "print('\\n by class:\\n',soup.find_all('p', class_='parag-1')) # note class_ (not class)\n", "\n", "# by s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting a webpage\n", "\n", "Common internet browsers like firefox, chrome, safari, etc. include functionality to inspect webpages, meaning to look at the underlying html. To view the html, right-click on a specific element on the webpage, e.g. a headline, and select \"Inspect Element\" (or something similar), like shown in this screenshot taken from [this page](https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL):\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will then see a small window open at the bottom of your browser, showing the html in hierarchical structure where elements can be expanded to show its child generations:\n", "\n", "\n", "\n", "While hovering with the mouse over\n", "an element from the list, your browser should highlight the respective element on the website. So hovering over the ```\n", " | Current Qtr. | \n", "Next Qtr. | \n", "Current Year | \n", "Next Year | \n", "
---|---|---|---|---|
No. of Analysts | \n", "27 | \n", "24 | \n", "35 | \n", "35 | \n", "
Avg. Estimate | \n", "24 | \n", "35 | \n", "35 | \n", "1.39 | \n", "
Low Estimate | \n", "35 | \n", "35 | \n", "1.39 | \n", "2.1 | \n", "
High Estimate | \n", "35 | \n", "1.39 | \n", "2.1 | \n", "6.06 | \n", "
Year Ago EPS | \n", "1.39 | \n", "2.1 | \n", "6.06 | \n", "6.56 | \n", "
\n", " | Earnings Estimate | \n", "Current Qtr. (Sep 2023) | \n", "Next Qtr. (Dec 2023) | \n", "Current Year (2023) | \n", "Next Year (2024) | \n", "
---|---|---|---|---|---|
0 | \n", "No. of Analysts | \n", "27.00 | \n", "24.00 | \n", "35.00 | \n", "35.00 | \n", "
1 | \n", "Avg. Estimate | \n", "1.39 | \n", "2.10 | \n", "6.06 | \n", "6.56 | \n", "
2 | \n", "Low Estimate | \n", "1.35 | \n", "1.72 | \n", "5.82 | \n", "5.60 | \n", "
3 | \n", "High Estimate | \n", "1.45 | \n", "2.41 | \n", "6.17 | \n", "7.09 | \n", "
4 | \n", "Year Ago EPS | \n", "1.29 | \n", "1.88 | \n", "6.11 | \n", "6.06 | \n", "