{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Web Scraping\n", "\n", "The process of retrieving data displayed on websites automatically with your computer is called web scraping. It means instead of clicking, selecting and copying data yourself, your python script does it for you. You would only need to define how and where to look. Scraping information from the internet can be very time consuming, even when your machine does it for you. However, it may also be very rewarding because of the vast amount of data and possibilities - which come for free.\n", "\n", "Before we can turn to the actual scraping online, we will deepen our understanding of html and how to navigate through a file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Xpath\n", "\n", "The **E**xtensible **M**arkup **L**anguage **XML** provides something like a framework to markup languages to encode structured, hierarchical data for transfer. It is thus a related to HTML, just less for displaying data than storing and transferring it. While the differences are not of interest here (learn about them yourself anytime, though), the similarities are: in the way, we can use them to navigate html documents. \n", "\n", "The **XML Path Language** (**Xpath**) allows to select items from an xml document by addressing tag names, attributes and structural characteristics. We can use the respective commands to retrieve elements from html documents, i.e. websites.\n", "\n", "The two basic operators are\n", "\n", "- a forward slash ```/```: to look into the next generation\n", "\n", "- square brackets ```[]```: select a specific element\n", "\n", "The forward slash works basically the same way as in an URL or path when navigating a directory on your computer. The square brackets on the other hand have a very similar functionality to the usage of square brackets in python.\n", "\n", "Lets have a look at a slightly extended version of the html document from the previous chapter to see how we can use these operators for navigation.\\\n", "\n", "As usual, we first import the package (actually only one class for now) providing the desired functionality: **Scrapy** \n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": false }, "outputs": [], "source": [ "from scrapy import Selector" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "my_html = \\\n", "\"\"\"

This is a brief introduction to HTML.

\n", "

Part 2.

\n", "

Part 3. Homepage

\n", "\"\"\"\n", "\n", "#formatted version to better visualise the structure\n", "\n", "# \n", "# \n", "#

\n", "#

\n", "# This is a brief introduction to HTML.\n", "#

\n", "#

\n", "# Part 2.\n", "#

\n", "#

\n", "# Part 3.\n", "# Homepage\n", "#

\n", "# \n", "# " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because of the tree like structure elements are labeled in on the basis of a family tree.\\\n", "The html consists of the **body** inside the **html** tags.\\\n", "In the next generation, called *child generation*, we see one **div** element, itself containing two **p** elements, i.e. *children*.\\\n", "Another **p** element appears as a *sibling* to the only **div** element.\n", "\n", "Now let's select single elements from this html using the ```xpath()``` method to which we pass our statement as string. To do so, we must first instantiate an Selector object with ```my_html```. Note that chaining the ```extract()``` method to the selector object gives the desired result." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "both p elements:\n", " ['

This is a brief introduction to HTML.

', '

Part 2.

']\n" ] } ], "source": [ "sel = Selector(text=my_html)\n", "\n", "# navigate to p elements inside div\n", "print('both p elements:\\n', sel.xpath('/html/body/div/p').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the special case here, where all child elements are the *p* elements we are looking for, we can use the wildcard character ```*```. It wil select any element, no matter the tag in the next generation (or in all future generations with a double forward slash)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "wildcard *:\n", " ['

This is a brief introduction to HTML.

', '

Part 2.

']\n" ] } ], "source": [ "# use wildcard character *\n", "print('\\nwildcard *:\\n', sel.xpath('/html/body/div/*').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select only the first *p* element, index the result with square brackets. \n", ":::{admonition}Caution\n", ":class: caution\n", "Xpath indexing starts with 1 and not 0 like python does!:::\n", "\n", "Another way is to use scrapy's ```extract_first()``` method." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "slelect only first p\n", " ['

This is a brief introduction to HTML.

']\n", "\n", "slelect only first p\n", "

This is a brief introduction to HTML.

\n", "\n", "slelect only first p\n", "

This is a brief introduction to HTML.

\n" ] } ], "source": [ "# select first p with [1]\n", "print('slelect only first p\\n',sel.xpath('/html/body/div/p[1]').extract())\n", "\n", "# indexing the python list starts with 0!\n", "print('\\nslelect only first p\\n',sel.xpath('/html/body/div/p').extract()[0])\n", "\n", "# with extract_first()\n", "print('\\nslelect only first p\\n',sel.xpath('/html/body/div/p').extract_first())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see, that a forward slash navigates one generation deeper into the structure, from *html* over *body* and *div* to both(!) *p* elements. Here, we get back a list (first print statement) of the *p* elements, where we can intuitively use the square brackets to select only the first one.\\\n", "What we get back, however, is still a *selector object*. To extract the text only, we need to modify the statement further by ```/text()```." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text in first p:\n", " ['This is a brief introduction to ', '.']\n" ] } ], "source": [ "# select text from first p element\n", "to_print = sel.xpath('/html/body/div/p[1]/text()').extract()\n", "\n", "print('text in first p:\\n',to_print)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now again, we do not see all of the text, just the string before the *b* tags and the dot after it as elements of a list. Even though the *b* tags for bold printing appear in-line, they still define a child generation for the enclosed 'HTML' string. Here, we could navigate further down the tree using a path to the b element for exampe.\\\n", "Instead, we will make use of the double forward slash ```//```, which selects everything '**from future generations**'. Meaning it will select all the text from all generations that follow this first *p* element. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text in first p and future generations with double slash:\n", " ['This is a brief introduction to ', 'HTML', '.']\n", "\n", "complete string:\n", " This is a brief introduction to HTML.\n" ] } ], "source": [ "# select text from first p element and all future generations by a double forward slash\n", "to_print = sel.xpath('/html/body/div/p[1]//text()').extract()\n", "print('text in first p and future generations with double slash:\\n',to_print)\n", "\n", "# for one string, use the join() function\n", "print('\\ncomplete string:\\n', ''.join(to_print))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beside specifying a path explicitly, we can leverage built-in methods to extract our desired elements. Scrapy's ```getall()``` returns all elements of a given tag as list" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "scrapy:\n", " ['

This is a brief introduction to HTML.

', '

Part 2.

', '

Part 3. Homepage

']\n" ] } ], "source": [ "print('scrapy:\\n', sel.xpath('//p').getall())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned before, elements can be addressed using attributes, like ID or class name. The attribute is addressed in square brackets, with ```@attr = 'attr_name'```. If the same attribute applies to several tags, a list of all results is returned." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID with wildcard:\n", " ['

This is a brief introduction to HTML.

\\n

Part 2.

']\n", "\n", "ID with div:\n", " ['

This is a brief introduction to HTML.

\\n

Part 2.

']\n" ] } ], "source": [ "# search all elements in document (leading //*) by id\n", "print('ID with wildcard:\\n', sel.xpath('//*[@id=\"div_1\"]').extract())\n", "\n", "# same as (tag must be known)\n", "print('\\nID with div:\\n', sel.xpath('//div[@id=\"div_1\"]').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```contains``` function provides access to attributes based on (parts of) the attribute name." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "contains string:\n", " ['

This is a brief introduction to HTML.

', '

Part 2.

']\n" ] } ], "source": [ "# find all elements where attr contains string\n", "print('contains string:\\n', sel.xpath('//*[contains(@class, \"parag-1\")]').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, an attribute can be returned by addressing it in the path." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "get attribute:\n", " ['intro']\n" ] } ], "source": [ "print('get attribute:\\n', sel.xpath('/html/body/div/p[1]/@id').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CSS\n", "\n", "The **C**ascading **S**tyle **S**heets language is another way to work with markup languages and thus html. Of special interest for us is that it also provides selectors. Even more convenient is that scrapy includes these selectors and the respective language. CSS uses a different syntax, which can offer much simpler and shorter statements for the same element than xpath (or sometime, the opposite) \n", "\n", "The basic syntax changes from xpath like this:\n", "- a single generation forward: ```/``` is replaced by ```>```\n", "\n", "- all future generations: ```//``` is replaced by a blank space (!)\n", "\n", "- not so short anymore: indexing with ```[k]``` becomes ```:nth-of-type(k)```\n", "\n", "For scrapy, nothing changes really, except using ```.css()``` method. The selector object is the same as before.\\\n", "Some examples: " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "xpath:\n", " ['

This is a brief introduction to HTML.

', '

Part 2.

']\n", "\n", "css:\n", " ['

This is a brief introduction to HTML.

', '

Part 2.

']\n", "\n", "all p elements\n", "xpath:\n", " ['

This is a brief introduction to HTML.

', '

Part 2.

', '

Part 3. Homepage

']\n", "\n", "css:\n", " ['

This is a brief introduction to HTML.

', '

Part 2.

', '

Part 3. Homepage

']\n" ] } ], "source": [ "# navigate to p elements inside div\n", "print('xpath:\\n', sel.xpath('/html/body/div/p').extract())\n", "print('\\ncss:\\n', sel.css('html>body>div>p').extract())\n", "\n", "# navigate to all p elements in document\n", "print('\\nall p elements')\n", "print('xpath:\\n', sel.xpath('//p').extract())\n", "print('\\ncss:\\n', sel.css('p').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The css locator provides short notation when working with *id* or *class*:\n", "- to select elements by class: use ```tag.class-name``` (similiar to method chaining)\n", "\n", "- to select elements by id: use ```tag#id-name```\n", "\n", "Examples from the toy html above:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "select by class:\n", " ['

This is a brief introduction to HTML.

']\n", "\n", "select by id:\n", " ['

This is a brief introduction to HTML.

']\n" ] } ], "source": [ "# select by id\n", "print('select by class:\\n', sel.css('*.parag-1').extract())\n", "\n", "# select by id\n", "print('\\nselect by id:\\n', sel.css('*#intro').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Attributes are in general addressed by a double colon ```::attr(attribute-name)```, for example to extract a link from the ```href``` attribute:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "select by attribute:\n", " ['www.uni-passau.de']\n" ] } ], "source": [ "# select by attribute\n", "print('\\nselect by attribute:\\n', sel.css('*::attr(href)').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The double colon is also used to extract ```::text```, like we used ```/text()``` in xpath:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "extract text:\n", " ['This is a brief introduction to ', 'HTML', '.', 'Part 2.']\n" ] } ], "source": [ "# extract all (blank space!) text descending from p elements inside div elements with class 'div_1' \n", "print('\\nextract text:\\n', sel.css('div#div_1>p ::text').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look for exampe at [this cheatsheet](https://devhints.io/xpath) for more commands with xpath and css.\n", "\n", "Lastly, let's look at BeautifulSoup again. It offers basically the same functionality to address and select any element based on its tag or attributes, but can make life easier with its prewritten methods. Keep in mind however, that BeautifulSoup is a python package, while the xpath and css syntax is standalone and will thus transfer to other software, e.g. **R**." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "by id:\n", " [

This is a brief introduction to HTML.

\n", "

Part 2.

]\n", "\n", " by class:\n", " [

This is a brief introduction to HTML.

]\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "soup = BeautifulSoup(my_html)\n", "# select by id\n", "print('by id:\\n',soup.find_all('div', id='div_1'))\n", "\n", "# select by class \n", "print('\\n by class:\\n',soup.find_all('p', class_='parag-1')) # note class_ (not class)\n", "\n", "# by s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting a webpage\n", "\n", "Common internet browsers like firefox, chrome, safari, etc. include functionality to inspect webpages, meaning to look at the underlying html. To view the html, right-click on a specific element on the webpage, e.g. a headline, and select \"Inspect Element\" (or something similar), like shown in this screenshot taken from [this page](https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL):\n", "\n", "![title](graphics/inspect.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will then see a small window open at the bottom of your browser, showing the html in hierarchical structure where elements can be expanded to show its child generations:\n", "\n", "![title](graphics/html.png)\n", "\n", "While hovering with the mouse over\n", "an element from the list, your browser should highlight the respective element on the website. So hovering over the `````` element will highlight the complete displayed table, while hovering over a `````` element will only highlight a specific row of the table. This facilitates the setup of your code for selection. Furthermore, a right-click on the element in the list let's one copy several characteristics like the path or the attributes directly to avoid spelling or selection errors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we see that information is stored in a ```

``` and we could select it by specifying the data-reactid' (we would have to check if there are more elements of this class, tough ids should generally be unique)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to collect the displayed information, we use the methods from before." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### scrapy selector" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import requests as re\n", "from scrapy import Selector\n", "\n", "headers = {\n", " \"User-Agent\":\n", " \"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36\"\n", "}\n", "url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'\n", "response = re.get(url, headers=headers)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response.ok" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "sel = Selector(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "column names:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Earnings Estimate', 'Current Qtr.', ' (', 'Sep', ' 2023)', 'Next Qtr.', ' (', 'Dec', ' 2023)', 'Current Year', ' (2023)', 'Next Year', ' (2024)']\n" ] } ], "source": [ "# select column names based on order, as no id is given for the tables\n", "columns = sel.xpath('//table[1]/thead[1]/*').css(' ::text').extract()\n", "print(columns) # unfortunately gives unsatisfying result with parentheses" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Current Qtr.', 'Next Qtr.', 'Current Year', 'Next Year']\n" ] } ], "source": [ "# filter by hand...\n", "columns = [el for el in columns[1:] if any(['Year' in el, 'Qtr.' in el])]\n", "print(columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "table data:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['No. of Analysts', '27', '24', '35', '35', 'Avg. Estimate', '1.39', '2.1', '6.06', '6.56', 'Low Estimate', '1.35', '1.72', '5.82', '5.6', 'High Estimate', '1.45', '2.41', '6.17', '7.09', 'Year Ago EPS', '1.29', '1.88', '6.11', '6.06']\n" ] } ], "source": [ "# select table entries based on data-reactid\n", "table_body = sel.xpath('//table[1]/tbody/*').css(' ::text').extract()\n", "print(table_body) " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "labels = [el for i, el in enumerate(table_body) if i%5==0]\n", "data = [el for i, el in enumerate(table_body) if i%5!=0]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'No. of Analysts': ['27', '24', '35', '35'],\n", " 'Avg. Estimate': ['24', '35', '35', '1.39'],\n", " 'Low Estimate': ['35', '35', '1.39', '2.1'],\n", " 'High Estimate': ['35', '1.39', '2.1', '6.06'],\n", " 'Year Ago EPS': ['1.39', '2.1', '6.06', '6.56']}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{labels[x]:[el for el in data[x:x+4]] for x in range(len(labels))}" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	Current Qtr.	Next Qtr.	Current Year	Next Year
No. of Analysts	27	24	35	35
Avg. Estimate	24	35	35	1.39
Low Estimate	35	35	1.39	2.1
High Estimate	35	1.39	2.1	6.06
Year Ago EPS	1.39	2.1	6.06	6.56

\n", "" ], "text/plain": [ " Current Qtr. Next Qtr. Current Year Next Year\n", "No. of Analysts 27 24 35 35\n", "Avg. Estimate 24 35 35 1.39\n", "Low Estimate 35 35 1.39 2.1\n", "High Estimate 35 1.39 2.1 6.06\n", "Year Ago EPS 1.39 2.1 6.06 6.56" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.DataFrame.from_dict({labels[x]:[el for el in data[x:x+4]] for x in range(len(labels))},\\\n", " orient='index', columns = columns)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### pandas\n", "\n", "The ```read_html()```function returns a list of all available tables when we pass the response object as argument." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type: , length: 6\n" ] } ], "source": [ "pd_df = pd.read_html(response.content)\n", "print(f'type: {type(pd_df)}, length: {len(pd_df)}')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	Earnings Estimate	Current Qtr. (Sep 2023)	Next Qtr. (Dec 2023)	Current Year (2023)	Next Year (2024)
0	No. of Analysts	27.00	24.00	35.00	35.00
1	Avg. Estimate	1.39	2.10	6.06	6.56
2	Low Estimate	1.35	1.72	5.82	5.60
3	High Estimate	1.45	2.41	6.17	7.09
4	Year Ago EPS	1.29	1.88	6.11	6.06

\n", "

" ], "text/plain": [ " Earnings Estimate Current Qtr. (Sep 2023) Next Qtr. (Dec 2023) \\\n", "0 No. of Analysts 27.00 24.00 \n", "1 Avg. Estimate 1.39 2.10 \n", "2 Low Estimate 1.35 1.72 \n", "3 High Estimate 1.45 2.41 \n", "4 Year Ago EPS 1.29 1.88 \n", "\n", " Current Year (2023) Next Year (2024) \n", "0 35.00 35.00 \n", "1 6.06 6.56 \n", "2 5.82 5.60 \n", "3 6.17 7.09 \n", "4 6.11 6.06 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd_df[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Webcrawler\n", "\n", "A webcrawler or spider will automatically 'crawl' the web in order to collect data. Once set up, it can be run every day in the morning, every couple of hours, etc.\n", "Such rugular scraping is used to build a database, e..g with newspaper articles. Commonly, for this kind of data collection, the webcrawler is deployed using cloud computing which is not part of this course.\n", "\n", "We will, however, take a look at a very basic spider not diving in too deep.\n", "In scrapy, spiders are defined as **classes** including the respective methods for scraping:\n", "\n", "- providing the seed pages to start\n", "\n", "- parsing and storing the data\n", "\n", "It is important to note that scrapy spiders make use of **callbacks**. A callback is a function, which is run as soon as some criterion is met, usually when some other task is finished. In a scrapy spider, the parsing function is used as callback for the request method: The request must be complete so that the parsing may commence.\n", "\n", "### callback\n", "The following is a basic example for a callback. The function will print the string and the callback will print the length of the string." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "onomatopoeia\n", "12\n" ] } ], "source": [ "def enter_string(string, callback):\n", " print(string)\n", " callback(string)\n", " \n", "def print_len(x):\n", " print(len(x))\n", " \n", "enter_string('onomatopoeia', callback=print_len)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the basic structure of a spider, found in the [scrapy docs](https://docs.scrapy.org/en/latest/intro/tutorial.html)." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "import scrapy\n", "scrapy.spiders.logging.getLogger('scrapy').propagate = False # suppress logging output in cell\n", "\n", "\n", "class mySpider(scrapy.Spider):\n", " name = \"my_first\"\n", "\n", " custom_settings = {\n", " 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',\n", " }\n", " def start_requests(self):\n", " urls = [\n", " 'http://www.example.com'\n", " ]\n", " for url in urls:\n", " yield scrapy.Request(url=url, callback=self.parse)\n", "\n", " def parse(self, response):\n", " print(response.css('p::text').extract())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `start_requests` method provides the start or url or seed. For every url in this starting list, we send a `Request` using the respective function. We provide to this function the callback to be the `parse` method defined subsequently. It is important to use `yield`, i.e. to create a generator, as this is required by scrapy.\n", "\n", "In `parse()`, we use the already familiar `css` method to extract all text found in p elements from the website.\n", "\n", "In order to let the spider crawl, we need to import `CrawlerProcess` instantiate an according object, link the spider to it and start the process." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['This domain is for use in illustrative examples in documents. You may use this\\n domain in literature without prior coordination or asking for permission.']\n" ] } ], "source": [ "from scrapy.crawler import CrawlerProcess\n", "proc = CrawlerProcess()\n", "proc.crawl(mySpider)\n", "proc.start()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After some output (red window), we see the text from the website.\n", "\n", "In order to really crawl the web, we would need to extend the spider:\n", "\n", "1. find all links on the starting page \n", "\n", "2. parse all text on the starting page\n", "\n", "3. find all links on the pages found the step before\n", "\n", "4. parse all text from the pages found the step before\n", "\n", "then repeat steps 3 and 4.\n", "\n", "We will now expand the spider from above to print the text from the starting page, then follow the provided links to the next page and print that text. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from scrapy.crawler import CrawlerProcess\n", "import scrapy\n", "scrapy.spiders.logging.getLogger('scrapy').propagate = False # suppress logging output in cell\n", "\n", "class followSpider(scrapy.Spider):\n", " name = \"my_second\"\n", " custom_settings = {\n", " 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',\n", " }\n", " def start_requests(self):\n", " urls = [\n", " 'http://www.example.com'\n", " ]\n", " for url in urls:\n", " yield scrapy.Request(url=url, callback=self.parse_links)\n", "\n", " def parse_links(self, response):\n", " links = response.css('a::attr(href)').extract()\n", " print(''.join(response.xpath('//p/text()').extract()))\n", " for link in links:\n", " yield response.follow(url=link, callback=self.parse)\n", " \n", " def parse(self, response):\n", " print(response.url)\n", " print(''.join(response.xpath('//p/text()').extract()))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This domain is for use in illustrative examples in documents. You may use this\n", " domain in literature without prior coordination or asking for permission.\n", "http://www.iana.org/help/example-domains\n", "As described in and , a\n", "number of domains such as example.com and example.org are maintained\n", "for documentation purposes. These domains may be used as illustrative\n", "examples in documents without prior coordination with us. They are not\n", "available for registration or transfer.We provide a web service on the example domain hosts to provide basic\n", "information on the purpose of the domain. These web services are\n", "provided as best effort, but are not designed to support production\n", "applications. While incidental traffic for incorrectly configured\n", "applications is expected, please do not design applications that require\n", "the example domains to have operating HTTP service.The IANA functions coordinate the Internet’s globally unique identifiers, and\n", " are provided by , an affiliate of\n", " .\n" ] } ], "source": [ "proc = CrawlerProcess()\n", "proc.crawl(followSpider)\n", "proc.start()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing several parse functions may also be necessary when dealing with 'nested' websites: One parse function may extract a specific block from an overview page, the other from all children of the overview pages.\n", "\n", "We used the `scrapy.follow` method, to go to the next link. Using this method enables us to write the `parse` method recursively, i.e. the spider would not stop until arriving at a page with no links. Thus, in order to set up a functional spider, not only the webpages must be selected accordingly, but one should always include some filters before starting the spider and storing the scraped data. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.15" } }, "nbformat": 4, "nbformat_minor": 4 }