{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Online Data\n", "\n", "The internet is probably the most common place to seek information. In fact, every homepage you browse and every article you read is data and, as long as there is no pay wall or other security feature, free to you. Most of the data in such form is unordered and cannot easily be analysed quantitatively, like text or video clips (especially text usually does carry a lot of information). Beside this unstructured data, there are files for free to download, e.g. in a .csv format. Yet, all this has to be done by hand: either click some 'download' button to acquire a file or, even worse, copy a text online and paste it into a document.\n", "\n", "In the next sections, we will at first briefly introduce some basic elements of online communication and then present different ways to access data online from our computer without loading and storing it manually (we will write code manually, though).\n", "\n", "## Website basics \n", "\n", "### HTTP \n", "\n", "The **H**yper**T**ext **T**ransfer **P**rotocol is used for online communication in the world wide web (www). The basic operations are *requests* and *response*. In the www, when trying to open a website, you are actually sending a request to a server asking to provide the information stored under that web address. The server will then hopefully respond to your request by sending the respective information to your machine which then will be rendered in your browser. This response information does contain additional meta data, like the time of the request, status/error codes, etc which we will see later. An extension is **https** (**s**ecure).\n", "\n", "### URL\n", "\n", "The **U**niform **R**esource **L**ocator is the address your request is telling the server to look for information. It consists of several elements, ```http://www.uni-passau.de```:\n", "\n", "- the protocol: usually ```http``` or ```https``` in the www, followed by a colon and doube slash ```://```\n", "\n", "- the hostname: ```www.uni-passau.de``` \n", "\n", "- a path/query: appended with ```/``` or ```?``` to the hostname\n", "\n", "To know about these three separate parts enables us to automate navigation in the www using python.\n", "\n", "### HTML\n", "\n", "Information in the www is usually found in the **H**yper**T**ext **M**arkup **L**anguage for web browsers. HTML uses **tags** to structure hierarchical data, telling the browser how to display the single elements. Tags appear in pairs and are enclosed in angle brackets `````` to open a respective section and `````` to close it.\\\n", "Any HTML will be enclosed in the most outer `````` and `````` tags. Inside this, the document is structured by various tags. To name just some: \n", "\n", "- ``````: the main part of a website containing the displayed information\n", "\n", "- ```
```: division, denotes a section\n", "\n", "- ```

```: paragraph, simple text information\n", "\n", "- ``````: displays data in a table, (`````` for single rows)\n", "\n", "- ``````: bold, does not structure the page, but alters the font \n", "\n", "Beside the tag itself attributes may be included inside the brackets. With the right tools, we can exploit tags and attributes to navigate to specific elements on webpages in order to extract the element information. Very common attributes are ```id``` and ```class```. \n", "\n", "A simple html may look like this (inside \"\"\" and indented):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"\\n\\n \\n
\\n

\\n This is a brief introduction to HTML.\\n

\\n
\\n \\n\\n\"" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "\n", " \n", "
\n", "

\n", " This is a brief introduction to HTML.\n", "

\n", "
\n", " \n", "\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since jupyter works with html, we can render the above html directly in a markdown cell: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", " \n", "
\n", "

\n", " This is a brief introduction to HTML. \n", "

\n", "
\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### JSON\n", "\n", "The **J**ava**S**cript **O**bject **N**otation is a type of data format which you will frequently encounter when working with online data. For one, websites that offer a .scv download often also offer a .json file to download. Moreover, the json format is often used by APIs (see next secrtion) to transfer data. Its structure looks is very similar to a python dictionary, using curly and square brackets and key-value pairs as well as enabling nesting. Data is therefore stored in such objects. There are packages which can handle json files, for example pandas to transform it into a dataframe. \n", "\n", "Let's see what a dataframe looks like when transformed into a json format:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lengthheight
obj_11955
obj_22264
\n", "

" ], "text/plain": [ " length height\n", "obj_1 19 55\n", "obj_2 22 64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame([[19, 55], [22, 64]],\n", "index=[\"obj_1\", \"obj_2\"],\n", "columns=[\"length\", \"height\"])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the respective json format, created with the ```to_json()``` method:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"length\":{\"obj_1\":19,\"obj_2\":22},\"height\":{\"obj_1\":55,\"obj_2\":64}}\n" ] } ], "source": [ "print(df.to_json()) " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lengthheight
obj_11955
obj_22264
\n", "
" ], "text/plain": [ " length height\n", "obj_1 19 55\n", "obj_2 22 64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# back to dataframe with read_json()\n", "pd.read_json(df.to_json())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API\n", "\n", "Instead of (or in addition to) any downloadable files, data providers may offer an **API** (**A**pplication **P**rogramming **I**nterface) to access data. An API constitutes a rather convenient way to getting information as it allows direct machine to machine transaction - meaning you can log in, select and download data directly with and into python. With reliable API access, it is thus not necessary to store the data on your hard drive permanently (you must however do so temporarily). \\\n", "The two most basic operations when working with an API over http:\n", "\n", "- GET: send a request and *get* a response from the server, for example the data you asked for\n", "\n", "- POST: send information to the API, for example to add data to a database remotely\n", "\n", "Many large companies offer APIs for programmers, for example the google maps API is often used on web pages where directions are explained or Twitter offers an API where tweets and meta data are provided. These APIs usually are not free. However, free APIs do exist for many topics of data, [see here](https://github.com/public-apis/public-apis/blob/master/README.md) for example . \\\n", "Normally, the procedure is to register an e-mail account to get an **identification key**. This key is needed to use the API from your computer and enables the provider to track your activity. For example, if only *n* downloads are free per day, too unrestrained a downloading business might even get you kicked or banned. In any case, to work properly with an API, one should read the documentary (or at least parts of it).\n", "\n", "In general, we will use the [```requests```](https://requests.readthedocs.io/en/master/) package for working with APIs.\n", "\n", "For a first and simpler example, though, we can make use of a package called [```wikipediaapi```](https://pypi.org/project/Wikipedia-API/) implementing the [Wikipedia API](https://www.mediawiki.org/wiki/API:Main_page). Whenever working with APIs, it is advised to look for trustworthy implementations, since the work might have been done for you already. We will then go on to produce the same result using the requests package.\n", "\n", "As usual, the first step is to import the packages (if already installed)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import wikipediaapi" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# instantiate object and call page method\n", "wiki_en = wikipediaapi.Wikipedia('en')\n", "page = wiki_en.page('University_of_Passau') " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "title:\n", " University_of_Passau\n", "\n", "summary:\n", " The University of Passau (Universität Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\n", "Today it is home to four faculties and 39 different undergraduate and postgraduate degree programmes.\n" ] } ], "source": [ "print('title:\\n', page.title)\n", "print('\\nsummary:\\n',page.summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The API from the ```wikipediaapi``` package returns preprocessed data from the respective article.\n", "\n", "\n", "Since APIs may come without such handy packages, we will now engage the ```requests```package to perform the same tasks.\\\n", "At first, we import the package under the alias ```re```." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import requests as re" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "wiki_url = 'https://en.wikipedia.org/w/api.php'\n", "\n", "# store parameters as dictionary to pass to function, may include key and username for authentication \n", "# the parameters are found in the documentary!\n", "params = {'action': 'query',\n", " 'format': 'json',\n", " 'titles': 'University_of_Passau',\n", " 'prop': 'extracts',\n", " 'exintro': 1,\n", " 'disablelimitreport':1}\n", "\n", "response = re.get(wiki_url, params=params)\n", "print(type(response))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check, if request was successful\n", "response.status_code # -> 200 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having a successful request, we can now take a look at the response object's elements. Since the format was specified as json in our request, we can use the built-in ```.json()``` method.\\\n", "To display a json file in a more readbale manner (indents, linebreaks), we can use the ```json``` package and its ```dumps```method." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import json" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"batchcomplete\": \"\",\n", " \"warnings\": {\n", " \"main\": {\n", " \"*\": \"Unrecognized parameter: disablelimitreport.\"\n", " },\n", " \"extracts\": {\n", " \"*\": \"HTML may be malformed and/or unbalanced and may omit inline images. Use at your own risk. Known problems are listed at https://www.mediawiki.org/wiki/Special:MyLanguage/Extension:TextExtracts#Caveats.\"\n", " }\n", " },\n", " \"query\": {\n", " \"normalized\": [\n", " {\n", " \"from\": \"University_of_Passau\",\n", " \"to\": \"University of Passau\"\n", " }\n", " ],\n", " \"pages\": {\n", " \"409091\": {\n", " \"pageid\": 409091,\n", " \"ns\": 0,\n", " \"title\": \"University of Passau\",\n", " \"extract\": \"

The University of Passau (Universit\\u00e4t Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\\n

Today it is home to four faculties and 39 different undergraduate and postgraduate degree programmes.

\"\n", " }\n", " }\n", " }\n", "}\n" ] } ], "source": [ "print(json.dumps(response.json(), indent=4))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "title: University of Passau\n", "\n", "summary:

The University of Passau (Universität Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\n", "

Today it is home to four faculties and 39 different undergraduate and postgraduate degree programmes.

\n" ] } ], "source": [ "# we can now navigate through the nested dictionaries\n", "title = response.json()['query']['pages']['409091']['title']\n", "print('title: ',title)\n", "summary = response.json()['query']['pages']['409091']['extract']\n", "print('\\nsummary:', summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the Text is still in html, as can be seen from tags, we are not yet finished. To convert html to normal text, one way is by using [**BeautifulSoup**.](https://pypi.org/project/beautifulsoup4/)\\\n", "Let's import it from the ```bs4``` package and transform the html." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The University of Passau (Universität Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\n", "Today it is home to four faculties and 39 different undergraduate and postgraduate degree programmes.\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "\n", "# first instantiate object, use 'lxml' parser (installation necessary) \n", "my_soup = BeautifulSoup(summary, 'lxml')\n", "\n", "# extract the text, i.e. remove tags\n", "print(my_soup.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Website content\n", "BeautifulSoup can not only extract text from html, but, in analogy to the json ```dumps()``` method from before, print html files more readable. We will leverage BeautifulSoup's functions now to extract again the same information, but now from the website directly, i.e. without an API.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# first, send request - get response from ordinary url\n", "response = re.get('https://en.wikipedia.org/wiki/University_of_Passau')\n", "\n", "# second, extract content from the response (html)\n", "html = response.content\n", "\n", "# third, instantiate BS onject\n", "soup = BeautifulSoup(html)\n", "\n", "# if necessary look at a formatted version of the html\n", "# print(soup.prettify()) # -> very long output" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The University of Passau (Universität Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\n", "Today it is home to four faculties and 39 different undergraduate and postgraduate degree programmes.\n" ] } ], "source": [ "# finally print the text attribute from the created object\n", "print(soup.text[588:1037]) # only part of document for shorter output " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The University of Passau (Universität Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\\n'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find('p').getText()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see, that we loaded the whole website as raw text. One can immediately see that it is not obvious how to extract the wanted summary only. We will later see some ways to extract text and leverage repeated structure across a website for its different pages to extract text sections like this." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Back to APIs\n", "\n", "Now, the API we will mostly use in this course is from [financialmodelingprep.com](https://financialmodelingprep.com/developer/docs/), providing financial and company data. In contrast to the Wikipedia API, a registration by e-mail is required. Please register to get your API key.\n", "\n", "To access the data, the key must be submitted in the request. We can use the params statement as before, handing it the respective statements.\n", "\n", "(Note that the API key is already saved in \"api_key\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "api_key = 'my_key'\n", "api_key = '44c50c7a71efa92e8dac68f5902ea0ec'" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# define components of the request\n", "base_url = 'https://financialmodelingprep.com/api/v3/'\n", "filing = 'profile'\n", "stock = 'GOOG'\n", "params = {'apikey': api_key}\n", "\n", "# send request and store reponse\n", "response = re.get(base_url+filing+'/'+stock, params=params)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'symbol': 'GOOG',\n", " 'price': 101.38,\n", " 'beta': 1.099484,\n", " 'volAvg': 23226134,\n", " 'mktCap': 1316794269696,\n", " 'lastDiv': 0.0,\n", " 'range': '95.27-152.1',\n", " 'changes': -0.01000214,\n", " 'companyName': 'Alphabet Inc.',\n", " 'currency': 'USD',\n", " 'cik': '0001652044',\n", " 'isin': 'US02079K1079',\n", " 'cusip': '02079K107',\n", " 'exchange': 'NASDAQ Global Select',\n", " 'exchangeShortName': 'NASDAQ',\n", " 'industry': 'Internet Content & Information',\n", " 'website': 'https://www.abc.xyz',\n", " 'description': 'Alphabet Inc. provides various products and platforms in the United States, Europe, the Middle East, Africa, the Asia-Pacific, Canada, and Latin America. It operates through Google Services, Google Cloud, and Other Bets segments. The Google Services segment offers products and services, including ads, Android, Chrome, hardware, Gmail, Google Drive, Google Maps, Google Photos, Google Play, Search, and YouTube. It is also involved in the sale of apps and in-app purchases and digital content in the Google Play store; and Fitbit wearable devices, Google Nest home products, Pixel phones, and other devices, as well as in the provision of YouTube non-advertising services. The Google Cloud segment offers infrastructure, platform, and other services; Google Workspace that include cloud-based collaboration tools for enterprises, such as Gmail, Docs, Drive, Calendar, and Meet; and other services for enterprise customers. The Other Bets segment sells health technology and internet services. The company was founded in 1998 and is headquartered in Mountain View, California.',\n", " 'ceo': 'Mr. Sundar Pichai',\n", " 'sector': 'Communication Services',\n", " 'country': 'US',\n", " 'fullTimeEmployees': '174014',\n", " 'phone': '650-253-0000',\n", " 'address': '1600 Amphitheatre Parkway',\n", " 'city': 'Mountain View',\n", " 'state': 'CA',\n", " 'zip': '94043',\n", " 'dcfDiff': 30.96,\n", " 'dcf': 132.35,\n", " 'image': 'https://financialmodelingprep.com/image-stock/GOOG.png',\n", " 'ipoDate': '2004-08-19',\n", " 'defaultImage': False,\n", " 'isEtf': False,\n", " 'isActivelyTrading': True,\n", " 'isAdr': False,\n", " 'isFund': False}]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Without further processing, the json is already quite well structured to read\n", "response.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also load symbols from all companies listed in the Dow Jones Index and then load historical stock price data from those companies using the afore acquired symbols." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "200\n" ] } ], "source": [ "filing = 'dowjones_constituent'\n", "response = re.get(base_url+filing, params=params)\n", "print(response.status_code)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " symbol name sector \\\n", "0 CRM Salesforce.Com Inc Technology \n", "1 WBA Walgreens Boots Alliance Inc Healthcare \n", "2 V Visa Inc Financial Services \n", "\n", " subSector headQuarter dateFirstAdded cik \\\n", "0 Technology San Francisco, CALIFORNIA 2020-08-31 0001108524 \n", "1 Healthcare Deerfield, ILLINOIS 2018-06-26 0001618921 \n", "2 Financial Services San Francisco, CALIFORNIA 2013-09-23 0001403161 \n", "\n", " founded \n", "0 2004-06-23 \n", "1 2014-12-31 \n", "2 2008-03-19 \n" ] } ], "source": [ "dji_df = pd.DataFrame.from_dict(response.json())\n", "print(dji_df.head(3))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['CRM', 'WBA', 'V', 'NKE', 'UNH', 'TRV', 'VZ', 'INTC', 'WMT', 'JNJ', 'DIS', 'MCD', 'JPM', 'CAT', 'BA', 'AMGN', 'DOW', 'AAPL', 'GS', 'CSCO', 'MSFT', 'HD', 'PG', 'MRK', 'IBM', 'HON', 'KO', 'CVX', 'AXP', 'MMM']\n" ] } ], "source": [ "# extract symbol column and convert to list\n", "dji_symbols = dji_df.symbol.tolist()\n", "print(dji_symbols)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# set the filing (from documentation)\n", "filing = 'historical-price-full'\n", "\n", "# create empty dataframe (before loop!) to collect data for every company in loop\n", "dji_hist = pd.DataFrame()\n", "\n", "# loop over first 5 symbols in the list\n", "for symbol in dji_symbols[:5]:\n", " \n", " # request data for every symbol \n", " response = re.get(base_url+filing+'/'+symbol, params=params)\n", " # break the loop if a response error occurs\n", " if response.status_code != 200:\n", " print('Error! Aborted')\n", " break\n", " # convert the response to json to temporary dataframe\n", " temp_df = pd.DataFrame.from_dict(response.json()['historical'], orient='columns')\n", " # add column with respective symbol\n", " temp_df['symbol'] = symbol\n", " # append temporary dataframe to collect dataframe \n", " dji_hist = pd.concat([dji_hist, temp_df], ignore_index=True)\n", " # delete temporary dataframe before next iteration\n", " del temp_df" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AxesSubplot(0.125,0.11;0.775x0.77)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# just one example to show all companies' data made it to the collect dataframe: variance of opening price \n", "print(dji_hist.groupby('symbol').open.var().plot(kind='bar'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now know how to retrieve data from an API or the whole html content from a normal website. The next chapter will look at how we can extract information displayed on websites in a more directed manner than downloading just all of the website's content." ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }