{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Online Data\n",
"\n",
"The internet is probably the most common place to seek information. In fact, every homepage you browse and every article you read is data and, as long as there is no pay wall or other security feature, free to you. Most of the data in such form is unordered and cannot easily be analysed quantitatively, like text or video clips (especially text usually does carry a lot of information). Beside this unstructured data, there are files for free to download, e.g. in a .csv format. Yet, all this has to be done by hand: either click some 'download' button to acquire a file or, even worse, copy a text online and paste it into a document.\n",
"\n",
"In the next sections, we will at first briefly introduce some basic elements of online communication and then present different ways to access data online from our computer without loading and storing it manually (we will write code manually, though).\n",
"\n",
"## Website basics \n",
"\n",
"### HTTP \n",
"\n",
"The **H**yper**T**ext **T**ransfer **P**rotocol is used for online communication in the world wide web (www). The basic operations are *requests* and *response*. In the www, when trying to open a website, you are actually sending a request to a server asking to provide the information stored under that web address. The server will then hopefully respond to your request by sending the respective information to your machine which then will be rendered in your browser. This response information does contain additional meta data, like the time of the request, status/error codes, etc which we will see later. An extension is **https** (**s**ecure).\n",
"\n",
"### URL\n",
"\n",
"The **U**niform **R**esource **L**ocator is the address your request is telling the server to look for information. It consists of several elements, ```http://www.uni-passau.de```:\n",
"\n",
"- the protocol: usually ```http``` or ```https``` in the www, followed by a colon and doube slash ```://```\n",
"\n",
"- the hostname: ```www.uni-passau.de``` \n",
"\n",
"- a path/query: appended with ```/``` or ```?``` to the hostname\n",
"\n",
"To know about these three separate parts enables us to automate navigation in the www using python.\n",
"\n",
"### HTML\n",
"\n",
"Information in the www is usually found in the **H**yper**T**ext **M**arkup **L**anguage for web browsers. HTML uses **tags** to structure hierarchical data, telling the browser how to display the single elements. Tags appear in pairs and are enclosed in angle brackets ```
```: paragraph, simple text information\n", "\n", "- ```
\n", " | length | \n", "height | \n", "
---|---|---|
obj_1 | \n", "19 | \n", "55 | \n", "
obj_2 | \n", "22 | \n", "64 | \n", "
\n", " | length | \n", "height | \n", "
---|---|---|
obj_1 | \n", "19 | \n", "55 | \n", "
obj_2 | \n", "22 | \n", "64 | \n", "
The University of Passau (Universit\\u00e4t Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\\n
Today it is home to four faculties and 39 different undergraduate and postgraduate degree programmes.
\"\n", " }\n", " }\n", " }\n", "}\n" ] } ], "source": [ "print(json.dumps(response.json(), indent=4))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "title: University of Passau\n", "\n", "summary:The University of Passau (Universität Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\n", "
Today it is home to four faculties and 39 different undergraduate and postgraduate degree programmes.
\n" ] } ], "source": [ "# we can now navigate through the nested dictionaries\n", "title = response.json()['query']['pages']['409091']['title']\n", "print('title: ',title)\n", "summary = response.json()['query']['pages']['409091']['extract']\n", "print('\\nsummary:', summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the Text is still in html, as can be seen from tags, we are not yet finished. To convert html to normal text, one way is by using [**BeautifulSoup**.](https://pypi.org/project/beautifulsoup4/)\\\n", "Let's import it from the ```bs4``` package and transform the html." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The University of Passau (Universität Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\n", "Today it is home to four faculties and 39 different undergraduate and postgraduate degree programmes.\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "\n", "# first instantiate object, use 'lxml' parser (installation necessary) \n", "my_soup = BeautifulSoup(summary, 'lxml')\n", "\n", "# extract the text, i.e. remove tags\n", "print(my_soup.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Website content\n", "BeautifulSoup can not only extract text from html, but, in analogy to the json ```dumps()``` method from before, print html files more readable. We will leverage BeautifulSoup's functions now to extract again the same information, but now from the website directly, i.e. without an API.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# first, send request - get response from ordinary url\n", "response = re.get('https://en.wikipedia.org/wiki/University_of_Passau')\n", "\n", "# second, extract content from the response (html)\n", "html = response.content\n", "\n", "# third, instantiate BS onject\n", "soup = BeautifulSoup(html)\n", "\n", "# if necessary look at a formatted version of the html\n", "# print(soup.prettify()) # -> very long output" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The University of Passau (Universität Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\n", "Today it is home to four faculties and 39 different undergraduate and postgraduate degree programmes.\n" ] } ], "source": [ "# finally print the text attribute from the created object\n", "print(soup.text[588:1037]) # only part of document for shorter output " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The University of Passau (Universität Passau in German) is a public research university located in Passau, Lower Bavaria, Germany. Founded in 1973, it is the youngest university in Bavaria and consequently has the most modern campus in the state. Nevertheless, its roots as the Institute for Catholic Studies dates back to the early 17th century.\\n'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find('p').getText()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see, that we loaded the whole website as raw text. One can immediately see that it is not obvious how to extract the wanted summary only. We will later see some ways to extract text and leverage repeated structure across a website for its different pages to extract text sections like this." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Back to APIs\n", "\n", "Now, the API we will mostly use in this course is from [financialmodelingprep.com](https://financialmodelingprep.com/developer/docs/), providing financial and company data. In contrast to the Wikipedia API, a registration by e-mail is required. Please register to get your API key.\n", "\n", "To access the data, the key must be submitted in the request. We can use the params statement as before, handing it the respective statements.\n", "\n", "(Note that the API key is already saved in \"api_key\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "api_key = 'my_key'\n", "api_key = '44c50c7a71efa92e8dac68f5902ea0ec'" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# define components of the request\n", "base_url = 'https://financialmodelingprep.com/api/v3/'\n", "filing = 'profile'\n", "stock = 'GOOG'\n", "params = {'apikey': api_key}\n", "\n", "# send request and store reponse\n", "response = re.get(base_url+filing+'/'+stock, params=params)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'symbol': 'GOOG',\n", " 'price': 101.38,\n", " 'beta': 1.099484,\n", " 'volAvg': 23226134,\n", " 'mktCap': 1316794269696,\n", " 'lastDiv': 0.0,\n", " 'range': '95.27-152.1',\n", " 'changes': -0.01000214,\n", " 'companyName': 'Alphabet Inc.',\n", " 'currency': 'USD',\n", " 'cik': '0001652044',\n", " 'isin': 'US02079K1079',\n", " 'cusip': '02079K107',\n", " 'exchange': 'NASDAQ Global Select',\n", " 'exchangeShortName': 'NASDAQ',\n", " 'industry': 'Internet Content & Information',\n", " 'website': 'https://www.abc.xyz',\n", " 'description': 'Alphabet Inc. provides various products and platforms in the United States, Europe, the Middle East, Africa, the Asia-Pacific, Canada, and Latin America. It operates through Google Services, Google Cloud, and Other Bets segments. The Google Services segment offers products and services, including ads, Android, Chrome, hardware, Gmail, Google Drive, Google Maps, Google Photos, Google Play, Search, and YouTube. It is also involved in the sale of apps and in-app purchases and digital content in the Google Play store; and Fitbit wearable devices, Google Nest home products, Pixel phones, and other devices, as well as in the provision of YouTube non-advertising services. The Google Cloud segment offers infrastructure, platform, and other services; Google Workspace that include cloud-based collaboration tools for enterprises, such as Gmail, Docs, Drive, Calendar, and Meet; and other services for enterprise customers. The Other Bets segment sells health technology and internet services. The company was founded in 1998 and is headquartered in Mountain View, California.',\n", " 'ceo': 'Mr. Sundar Pichai',\n", " 'sector': 'Communication Services',\n", " 'country': 'US',\n", " 'fullTimeEmployees': '174014',\n", " 'phone': '650-253-0000',\n", " 'address': '1600 Amphitheatre Parkway',\n", " 'city': 'Mountain View',\n", " 'state': 'CA',\n", " 'zip': '94043',\n", " 'dcfDiff': 30.96,\n", " 'dcf': 132.35,\n", " 'image': 'https://financialmodelingprep.com/image-stock/GOOG.png',\n", " 'ipoDate': '2004-08-19',\n", " 'defaultImage': False,\n", " 'isEtf': False,\n", " 'isActivelyTrading': True,\n", " 'isAdr': False,\n", " 'isFund': False}]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Without further processing, the json is already quite well structured to read\n", "response.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also load symbols from all companies listed in the Dow Jones Index and then load historical stock price data from those companies using the afore acquired symbols." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "200\n" ] } ], "source": [ "filing = 'dowjones_constituent'\n", "response = re.get(base_url+filing, params=params)\n", "print(response.status_code)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " symbol name sector \\\n", "0 CRM Salesforce.Com Inc Technology \n", "1 WBA Walgreens Boots Alliance Inc Healthcare \n", "2 V Visa Inc Financial Services \n", "\n", " subSector headQuarter dateFirstAdded cik \\\n", "0 Technology San Francisco, CALIFORNIA 2020-08-31 0001108524 \n", "1 Healthcare Deerfield, ILLINOIS 2018-06-26 0001618921 \n", "2 Financial Services San Francisco, CALIFORNIA 2013-09-23 0001403161 \n", "\n", " founded \n", "0 2004-06-23 \n", "1 2014-12-31 \n", "2 2008-03-19 \n" ] } ], "source": [ "dji_df = pd.DataFrame.from_dict(response.json())\n", "print(dji_df.head(3))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['CRM', 'WBA', 'V', 'NKE', 'UNH', 'TRV', 'VZ', 'INTC', 'WMT', 'JNJ', 'DIS', 'MCD', 'JPM', 'CAT', 'BA', 'AMGN', 'DOW', 'AAPL', 'GS', 'CSCO', 'MSFT', 'HD', 'PG', 'MRK', 'IBM', 'HON', 'KO', 'CVX', 'AXP', 'MMM']\n" ] } ], "source": [ "# extract symbol column and convert to list\n", "dji_symbols = dji_df.symbol.tolist()\n", "print(dji_symbols)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# set the filing (from documentation)\n", "filing = 'historical-price-full'\n", "\n", "# create empty dataframe (before loop!) to collect data for every company in loop\n", "dji_hist = pd.DataFrame()\n", "\n", "# loop over first 5 symbols in the list\n", "for symbol in dji_symbols[:5]:\n", " \n", " # request data for every symbol \n", " response = re.get(base_url+filing+'/'+symbol, params=params)\n", " # break the loop if a response error occurs\n", " if response.status_code != 200:\n", " print('Error! Aborted')\n", " break\n", " # convert the response to json to temporary dataframe\n", " temp_df = pd.DataFrame.from_dict(response.json()['historical'], orient='columns')\n", " # add column with respective symbol\n", " temp_df['symbol'] = symbol\n", " # append temporary dataframe to collect dataframe \n", " dji_hist = pd.concat([dji_hist, temp_df], ignore_index=True)\n", " # delete temporary dataframe before next iteration\n", " del temp_df" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AxesSubplot(0.125,0.11;0.775x0.77)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "