Working with Data#

Several libraries offer useful tools to work with data in order to allow for a meaningful analysis. One of the most popular and powerful is Pandas. Beside including efficient ways for cleaning and manipulating data, pandas also includes functions for statistical analysis and graphics.

Usually, pandas is imported under the alias pd.

import pandas as pd

Pandas - DataFrames and Series#

Indexing#

The basic elements for data are DataFrames and Series. A DataFrame is a whole matrix- or table-like representation of data with column and row names. A Series can be understood as a single column of such a data matrix (but without the need for a table).
There are respective functions to turn other objects, e.g. lists or dicts, into DataFrames or Series. Indexing, similar to lists or dicts, uses square brackets.

my_list = [1,2,3,4,5,6,7]
my_df = pd.DataFrame(my_list, columns=['var1'])
print('df:\n', my_df)

my_series = pd.Series(my_list)
print('series:\n',my_series)

# selecting a single column from a DataFrame
print('select column from df:\n', my_df['var1'])
df:
    var1
0     1
1     2
2     3
3     4
4     5
5     6
6     7
series:
 0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64
select column from df:
 0    1
1    2
2    3
3    4
4    5
5    6
6    7
Name: var1, dtype: int64

To select specific rows or columns, the iloc method, for selecting based on an index, and loc method, based on labels, are recommended. Especially when several columns are to be selected. Indexing can also be done by boolean Series (or lists) and thus conditionally.
Another way to select a single column is by chaining the column’s name to the DataFrame’s name by a dot (like in method chaining).

my_df = pd.DataFrame(
{'age': [20, 34, 56],
 'height': [183, 179, 172]
}, index=['person_a', 'person_b', 'person_c'])
print(my_df)
print('1.:', my_df.loc['person_b','age'], 'is the same as',  my_df.iloc[1,0])

# age > 27
print('indexing by condition/list\n', my_df.loc[my_df.age >27], '\ncorresponds to \n', my_df.loc[[False, True, True]])
print(type(my_df.age >27))
          age  height
person_a   20     183
person_b   34     179
person_c   56     172
1.: 34 is the same as 34
indexing by condition/list
           age  height
person_b   34     179
person_c   56     172 
corresponds to 
           age  height
person_b   34     179
person_c   56     172
<class 'pandas.core.series.Series'>

Useful Methods#

Pandas includes many useful methods that will help you get to know and manipulate a dataset. Some of these methods are shown in the following, others are introduced later when needed.
More often than not, a dataset will contain missing values, i.e. cells in a data table contain no value. They will be depicted as NaN, Not a Number.

import numpy as np
my_df =  pd.DataFrame(
{'age': [20, 34, 56, np.nan, 44],
 'height': [183, 179, np.nan,  163, np.nan]
})
my_df
age height
0 20.0 183.0
1 34.0 179.0
2 56.0 NaN
3 NaN 163.0
4 44.0 NaN
# view the first rows (view last rows with .tail())
print('0.\n', my_df.head(n=5))

# general information
print('\n1.')
my_df.info()

# descriptive statistsics on dataset
print('\n2.\n',my_df.describe())

# number of missing values per column
print('\n3.\n',my_df.isnull().sum())

# single statistics are included as methods, also for single columns
print('\n4.\n', my_df.age.mean())

# fill missing values (e.g. with mean of column)
print('\n 5.\n', my_df.fillna(my_df.mean()))    

# note that you must assign this to my_df (or a different variable) in order to impute missing values permanently!
my_df = my_df.fillna(my_df.mean())

# sort values by column(s)
print('\n6.\n', my_df.sort_values(by=['height']))    
0.
     age  height
0  20.0   183.0
1  34.0   179.0
2  56.0     NaN
3   NaN   163.0
4  44.0     NaN

1.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     4 non-null      float64
 1   height  3 non-null      float64
dtypes: float64(2)
memory usage: 208.0 bytes

2.
              age      height
count   4.000000    3.000000
mean   38.500000  175.000000
std    15.264338   10.583005
min    20.000000  163.000000
25%    30.500000  171.000000
50%    39.000000  179.000000
75%    47.000000  181.000000
max    56.000000  183.000000

3.
 age       1
height    2
dtype: int64

4.
 38.5

 5.
     age  height
0  20.0   183.0
1  34.0   179.0
2  56.0   175.0
3  38.5   163.0
4  44.0   175.0

6.
     age  height
3  38.5   163.0
2  56.0   175.0
4  44.0   175.0
1  34.0   179.0
0  20.0   183.0
# get column names (useful for looping)
print('7.\n', my_df.columns)

# drop rows containing missing values
print('8.\n', my_df.dropna()) 

# drop rows or columns
print('9.\n', my_df.drop(['age'], axis=1))

# merge DataFrames (automatically on shared variable if not specified otherwise)
df2 = pd.DataFrame(
{'age': [20, 34, 56, np.nan, 44],
 'weight': [83, 63, 98,  50, 77]
})
print('10.\n', my_df.merge(df2))
my_df = my_df.merge(df2)

# correlation matrix
print('11.\n', my_df.corr())

# adding new columns
my_df = my_df.assign(bmi = my_df.weight/(my_df.height/100)**2)
my_df
7.
 Index(['age', 'height'], dtype='object')
8.
     age  height
0  20.0   183.0
1  34.0   179.0
2  56.0   175.0
3  38.5   163.0
4  44.0   175.0
9.
    height
0   183.0
1   179.0
2   175.0
3   163.0
4   175.0
10.
     age  height  weight
0  20.0   183.0      83
1  34.0   179.0      63
2  56.0   175.0      98
3  44.0   175.0      77
11.
              age    height    weight
age     1.000000 -0.946549  0.481176
height -0.946549  1.000000 -0.282126
weight  0.481176 -0.282126  1.000000
age height weight bmi
0 20.0 183.0 83 24.784258
1 34.0 179.0 63 19.662308
2 56.0 175.0 98 32.000000
3 44.0 175.0 77 25.142857

As a last tool in this section, we will look at the get_dummies() function. Dummy variables are used to encode categorical variables with zero and one, for example in order to calculate the correlation with some other numerical variable.

df3 = pd.DataFrame(
{'hair': ['blonde', 'black', 'red', 'red', 'black']
})

print(pd.get_dummies(df3.hair))
   black  blonde  red
0      0       1    0
1      1       0    0
2      0       0    1
3      0       0    1
4      1       0    0

Plots#

Methods for standard pot types are available. For a histogram of the data, just use .hist(). Other types are available by chaining .plot. and the plot type.

# histogram
my_df.hist()
array([[<AxesSubplot: title={'center': 'age'}>,
        <AxesSubplot: title={'center': 'height'}>],
       [<AxesSubplot: title={'center': 'weight'}>,
        <AxesSubplot: title={'center': 'bmi'}>]], dtype=object)
_images/9ff447333c418fa29eb33363614c199797ad02a7bd47f55746e648023791247d.png
# lineplot
my_df.sort_values(by='age').plot.line(x='age', y='height')
<AxesSubplot: xlabel='age'>
_images/f5bcd7a0cb4945de965afbcffe5cfb2ffa8a169a2f60a0a788a9647863e989a0.png
# scatter plot
my_df.plot.scatter(x='age', y='weight')
<AxesSubplot: xlabel='age', ylabel='weight'>
_images/e944b800d854672780a4fc5c9a8386b6560ad02a9ae33550b9bba30f7bfc8225.png

Importing and Exporting Data#

Your data may come to you in various file formats. Pandas enables you to import data from all common formats. The respective functions are usually called read_ and to_ followed by the respective file type.

To read a .csv for example, use the read_csv() function. Note that the file need not be stored locally on your computer.

# import from csv
import pandas as pd
dax = pd.read_csv('data/DAX.csv')
print(dax.head(3))
print(dax.tail(3))
         Date          Open          High           Low         Close  \
0  2020-03-10  10724.980469  11032.290039  10423.900391  10475.490234   
1  2020-03-11  10601.849609  10761.429688  10390.509766  10438.679688   
2  2020-03-12   9863.990234   9932.559570   9139.120117   9161.129883   

      Adj Close     Volume  
0  10475.490234  267400800  
1  10438.679688  216708900  
2   9161.129883  390477000  
           Date          Open          High           Low         Close  \
251  2021-03-08  14024.570313  14402.919922  13977.129883  14380.910156   
252  2021-03-09  14345.509766  14475.650391  14309.349609  14437.940430   
253  2021-03-10  14439.450195  14554.490234  14408.519531  14528.570313   

        Adj Close     Volume  
251  14380.910156  109071900  
252  14437.940430  107881800  
253  14528.570313          0  
# save data frame to excel
dax.to_excel('DAX.xlsx')

Lets do some exploration and manipulation of the historical data from the DAX index we just imported. \( \)

print('shape:', dax.shape)
shape: (254, 7)
dax.info()    # the 'Date' column is of dtype object 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 0 to 253
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       254 non-null    object 
 1   Open       254 non-null    float64
 2   High       254 non-null    float64
 3   Low        254 non-null    float64
 4   Close      254 non-null    float64
 5   Adj Close  254 non-null    float64
 6   Volume     254 non-null    int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 14.0+ KB
# check type of first entry in 'Date'
print(type(dax.Date[0]))
<class 'str'>

Transform it to datetime, a special type for dates in python.

dax['Datetime'] = pd.to_datetime(dax.Date)
print(dax.Datetime.head(3))    # check dtype now
0   2020-03-10
1   2020-03-11
2   2020-03-12
Name: Datetime, dtype: datetime64[ns]
print(dax.columns)
Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume',
       'Datetime'],
      dtype='object')
print(f'of {len(dax)} rows:\n{dax.notna().sum()}')
print('')
print(f'makes a total of {dax.isnull().sum().sum()} missing values')
of 254 rows:
Date         254
Open         254
High         254
Low          254
Close        254
Adj Close    254
Volume       254
Datetime     254
dtype: int64

makes a total of 0 missing values
dax.plot(x='Datetime', y=['Open', 'Close'])    # using Datetime for plotting
<AxesSubplot: xlabel='Datetime'>
_images/be88ebf1038b43995cf5cd7e21cf6c4459c03e9de1c39cb50d407094df07de2b.png
dax.describe()
Open High Low Close Adj Close Volume
count 254.000000 254.000000 254.000000 254.000000 254.000000 2.540000e+02
mean 12488.785944 12595.082573 12373.931187 12489.142640 12489.142640 9.469272e+07
std 1324.154858 1295.824891 1365.892066 1331.191733 1331.191733 4.941752e+07
min 8495.940430 8668.480469 8255.650391 8441.709961 8441.709961 0.000000e+00
25% 11896.257813 12185.372315 11850.512451 12031.852783 12031.852783 6.409985e+07
50% 12847.964844 12945.129883 12764.479981 12851.169922 12851.169922 7.875350e+07
75% 13314.162353 13362.944824 13224.792481 13292.120362 13292.120362 1.085247e+08
max 14439.450195 14554.490234 14408.519531 14528.570313 14528.570313 3.904770e+08

For statistics on one variable, index the result as usual.

mean_open = dax.describe().loc['mean', 'Open']
print(mean_open)
12488.785944444882

Create a new column, with a flag if the closing price was higher than the opening price.

dax = dax.assign(positive = dax.Close > dax.Open)
print(dax.head(3))

print('')
# fraction of days when this was the case
print('fraction of positive days:', dax.positive.mean())
print('\ncheck: \n', dax.positive.value_counts())
         Date          Open          High           Low         Close  \
0  2020-03-10  10724.980469  11032.290039  10423.900391  10475.490234   
1  2020-03-11  10601.849609  10761.429688  10390.509766  10438.679688   
2  2020-03-12   9863.990234   9932.559570   9139.120117   9161.129883   

      Adj Close     Volume   Datetime  positive  
0  10475.490234  267400800 2020-03-10     False  
1  10438.679688  216708900 2020-03-11     False  
2   9161.129883  390477000 2020-03-12     False  

fraction of positive days: 0.5

check: 
 False    127
True     127
Name: positive, dtype: int64

Extract same fraction for every day in the week. Days are counted from 0 (Monday) to 6 (Sunday).

for i in range(7):
    print(f'day {i}: ', dax[dax.Datetime.dt.dayofweek == i].positive.mean())
day 0:  0.58
day 1:  0.49056603773584906
day 2:  0.5283018867924528
day 3:  0.44
day 4:  0.4583333333333333
day 5:  nan
day 6:  nan

A more straight forward way using built-in methods.

dax = dax.assign(wday = dax.Datetime.dt.dayofweek)
dax.groupby(['wday']).mean(numeric_only=True)  # rows with nans are not calculated
Open High Low Close Adj Close Volume positive
wday
0 12510.387852 12640.970996 12400.179629 12546.658457 12546.658457 8.860451e+07 0.580000
1 12523.045843 12621.270969 12403.764906 12520.457547 12520.457547 9.450496e+07 0.490566
2 12522.144310 12628.387014 12414.166403 12527.621886 12527.621886 9.062774e+07 0.528302
3 12422.626602 12512.666523 12286.600566 12394.369766 12394.369766 9.770017e+07 0.440000
4 12460.538106 12567.442179 12360.190674 12450.887695 12450.887695 1.025976e+08 0.458333

Database and SQL#

Beside CSV (or Excel) files, another way to work with data is using databases. From there, data can be accessed using a query language. A very common one for relational databases is SQL (Structured Query Language). It allows to extract specific records, i.e. records which meet special requirements, from a database using single commands.

Store in Database#

First, saving a data frame to a database is conveniently done with a built-in method in Pandas. However, to access a database, a connection must first be established. In the following, we will use the sqlalchemy package for working with a database and use a SQLite database engine.

from sqlalchemy import create_engine, Table
# setup SQLite engine and connect
path = 'data/dax_db.sqlite'
engine = create_engine('sqlite:///' + path, echo=False)  # if 'echo = False', information is not printed 
conn = engine.connect()
# save dataframe to database and close connection
dax.to_sql('historical_data', con=conn, if_exists='replace')
conn.close()       # close connection
engine.dispose()   # dispose engine

Load from Database#

To import data from a database, at first a connection must be created the same way as before. Then, SQL statements are used to fetch the data and store it in a Pandas dataframe.

from sqlalchemy import inspect, select, MetaData
path = 'data/dax_db.sqlite'
engine = create_engine('sqlite:///' + path, echo=False)  # if 'echo = True', information is printed in every step
conn = engine.connect()
# show tables and columns with inspector
inspector = inspect(engine)
tables = inspector.get_table_names()
print(tables)

# show columns
for col in inspector.get_columns(tables[0]):
    print(f"Col: {col['name']},\t type: {col['type']}")
['historical_data']
Col: index,	 type: BIGINT
Col: Date,	 type: TEXT
Col: Open,	 type: FLOAT
Col: High,	 type: FLOAT
Col: Low,	 type: FLOAT
Col: Close,	 type: FLOAT
Col: Adj Close,	 type: FLOAT
Col: Volume,	 type: BIGINT
Col: Datetime,	 type: DATETIME
Col: positive,	 type: BOOLEAN
Col: wday,	 type: BIGINT

Use a MetaData Object, which holds information about the database.

m = MetaData()
m.reflect(engine)
print(m.tables.keys())
dict_keys(['historical_data'])
for table in m.tables.values():
    print(table.name)
    for column in table.c:
        print(f'Col: {column.name},\t, Type: {column.type}')
historical_data
Col: index,	, Type: BIGINT
Col: Date,	, Type: TEXT
Col: Open,	, Type: FLOAT
Col: High,	, Type: FLOAT
Col: Low,	, Type: FLOAT
Col: Close,	, Type: FLOAT
Col: Adj Close,	, Type: FLOAT
Col: Volume,	, Type: BIGINT
Col: Datetime,	, Type: DATETIME
Col: positive,	, Type: BOOLEAN
Col: wday,	, Type: BIGINT

Knowing all components, a query can be sent to get the required data from the database to a dataframe. By convention, SQL is written in all CAPS, even though this is not required for the statement to work.

To select all columns from a table, the wildcard character * is used. The syntax is SELECT <col1>, <col2> FROM <table>.

my_query = 'SELECT * FROM historical_data'

results = conn.execute(my_query)
df = pd.DataFrame(results.fetchall(), columns=results.keys())
df.head(2)
index Date Open High Low Close Adj Close Volume Datetime positive wday
0 0 2020-03-10 10724.980469 11032.290039 10423.900391 10475.490234 10475.490234 267400800 2020-03-10 00:00:00.000000 0 1
1 1 2020-03-11 10601.849609 10761.429688 10390.509766 10438.679688 10438.679688 216708900 2020-03-11 00:00:00.000000 0 2
# built-in function in Pandas
df2 = pd.read_sql_query(my_query, conn)
df2.head(2)
index Date Open High Low Close Adj Close Volume Datetime positive wday
0 0 2020-03-10 10724.980469 11032.290039 10423.900391 10475.490234 10475.490234 267400800 2020-03-10 00:00:00.000000 0 1
1 1 2020-03-11 10601.849609 10761.429688 10390.509766 10438.679688 10438.679688 216708900 2020-03-11 00:00:00.000000 0 2

For some sanity checks, not only regarding imports, the assert keyword will raise an error, if the trailing statement is not True.

assert df.equals(df2)

To select specific columns only, the can be listed, separated by a comma

my_query = 'SELECT open, close FROM historical_data'

df = pd.read_sql_query(my_query, conn)
print(df.head(2))

print('')
# select only first 10 entries with LIMIT
my_query = 'SELECT open, close FROM historical_data LIMIT 10'

df = pd.read_sql_query(my_query, conn)
print('only 10 rows:', df.shape)
           Open         Close
0  10724.980469  10475.490234
1  10601.849609  10438.679688

only 10 rows: (10, 2)

The dataset can be filtered before being loaded into a dataframe. To pose restrictions on the import of records, use WHERE after the table name.

# select only first 10 entries with LIMIT
my_query = 'SELECT open, close FROM historical_data WHERE wday=0'   # note that 'wday' does not need to be imported

df = pd.read_sql_query(my_query, conn)
print('only mondays:\n', df.head())
only mondays:
            Open         Close
0   8728.480469   8742.250000
1   8536.330078   8741.150391
2   9725.879883   9815.969727
3   9889.030273  10075.169922
4  10704.349609  10675.900391

Only some of the other functions and commands will be shown, for a quick overview, see this collection.

# Count occurrence
my_query = 'SELECT Count(*) FROM historical_data WHERE wday=0'   
mon = pd.read_sql_query(my_query, conn).values
print('number of monday records in dataset:\n', mon)
number of monday records in dataset:
 [[50]]
# get distinct values
my_query = 'SELECT DISTINCT(wday) FROM historical_data'   
weekdays = pd.read_sql_query(my_query, conn).values
print('distinct weekdays in dataset:\n', weekdays)
distinct weekdays in dataset:
 [[1]
 [2]
 [3]
 [4]
 [0]]
# calculate mean
my_query = 'SELECT ROUND(AVG(open),2) as mean_open FROM historical_data'   
mean_open = pd.read_sql_query(my_query, conn)
print(mean_open)
   mean_open
0   12488.79
# import Variable under different name
my_query = 'SELECT ROUND(open,2) as opening_price, ROUND(close, 2) as closing_price FROM historical_data'
df = pd.read_sql_query(my_query, conn)
print(df.columns.to_list())
['opening_price', 'closing_price']

In order to automatically close connections after import or export, python offers the so-called connection manager. Usually, it is called with the keyword with and a variable after as. The connection is only active in the indented block afterwards and is closed when leaving this body of the context manager.

# engine is already created
with engine.connect() as conn:
    my_q = 'SELECT * FROM historical_data LIMIT 1'
    df = pd.read_sql_query(my_q, conn)
    
print(df)
   index        Date          Open          High           Low         Close  \
0      0  2020-03-10  10724.980469  11032.290039  10423.900391  10475.490234   

      Adj Close     Volume                    Datetime  positive  wday  
0  10475.490234  267400800  2020-03-10 00:00:00.000000         0     1  
import os
os.getcwd()
'/Users/uni/Desktop/Lehre/python_class/testbook'

SQL offers a wide functionality, which is way beyond the scope of this course. To name just one more big advantage, not shown above, is that data from several related tables can be merged and loaded at once.