Graphics#

- A graph is worth a thousand records (and more) -

Graphics provide an easy way to display information extracted from data to the beholder. While graphics are a major tool for reporting purposes, reporting may already mark the endpoint of your journey with the data. Instead of creating graphics only as a final step, they should also be considered at the very beginning of a project: when exploring a dataset for the first time. While creating different plots for a dataset, we will therefor try to explain any oddities we find.

Python offers several packages, which allow a quick look at your data using basic plot types. The respective functions to build these plots in a basic fashion using the default preferences are highly customizable.

Matplotlib#

A popular library for graphics in python is matplotlib. Its functionality is accessed by importing its API (commonly under the alias plt) matplotlib.pyplot as plt. It works seamlessly with numpy arrays and pandas objects, making it the library of choice here.

In general, when creating a plot, we create a figure to which we add axes objects.

pyplot functions then specify which part of the figure is adressed and adds objects to it. Such parts include for example the axes, title or drawing area. In order to customize a plot, you can stack these function calls, since the invoked states will be preserved until the concluding plt.show() statement creates the graph.

In the following, we will look at basic examples of different plots and learn the more specific syntax on the fly. For an overview of plots and how to create them, look at this gallery of plots from matplotlib.

We will use some modified credit card data from this kaggle page.

import pandas as pd

df = pd.read_csv('https://www.dropbox.com/s/sbqezclbr7qkx5r/BankChurners_mod.csv?dl=1', index_col=0)
print(df.head(3))
   CLIENTNUM     Attrition_Flag  Customer_Age Gender  Dependent_count  \
0  768805383  Existing Customer            45      M                3   
1  818770008  Existing Customer            49      F                5   
2  713982108  Existing Customer            51      M                3   

  Education_Level Marital_Status Income_Category Card_Category  \
0     High School        Married     $60K - $80K          Blue   
1        Graduate         Single  Less than $40K          Blue   
2        Graduate        Married    $80K - $120K          Blue   

   Months_on_book  Total_Relationship_Count  Months_Inactive_12_mon  \
0              39                         5                       1   
1              44                         6                       1   
2              36                         4                       1   

   Contacts_Count_12_mon  Credit_Limit  Total_Revolving_Bal  Avg_Open_To_Buy  \
0                      3       12691.0                  777          11914.0   
1                      2        8256.0                  864           7392.0   
2                      0        3418.0                    0           3418.0   

   Avg_Utilization_Ratio  
0                  0.061  
1                  0.105  
2                  0.000  

We see that the data consists of numbers and strings. We can verify, that by calling the info() method on the data frame.

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10127 entries, 0 to 10126
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           10127 non-null  object 
 6   Marital_Status            10127 non-null  object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(3), int64(8), object(6)
memory usage: 1.4+ MB

In order to create a meaningful plot, the type must be chosen depending on the type of the variables.

Categories#

Such variables are usually (but not necessarily) of type string. After identifying a variable to be categorical, one must decide whether it is of ordinal - the levels can be ranked - or nominal scale - they cannot be ranked. Examples from our dataset are Income_Category for ordinal and Martial_Status for nominal scale.

For categorical variables, bar charts and histograms are maybe the first plot type you should consider. The difference however, is only the order and visualisation: ordinal data uses histograms, which appear column to column in a defined order, nominal data uses barcharts iwth spacing between the columns and no specific order. To build the standard plots, we need not do more than call the hist() or bar function with the respective data.

Bar Charts#

We need to provide the categories and the height of the variable. With the .unique() method, we get all levels present in the variable, with .calue_counts() we get the number of occurrences.

import matplotlib.pyplot as plt

plt.bar(df.Marital_Status.unique(), df.Marital_Status.value_counts())  #create object
<BarContainer object of 4 artists>
_images/44583efc0f0ed81e0bb6f7d8663e16ed0d6ae3a163a1c99bcb6c270e341374a8.png

Warning

Jupyter Notebook will automatically display graphics from a matplotlib object. Using the statement above would not create a plot elsewhere, as this only creates an object for plotting.

Use the plt.show() statement!

plt.bar(df.Marital_Status.unique(), df.Marital_Status.value_counts())
plt.show()    # show image
_images/44583efc0f0ed81e0bb6f7d8663e16ed0d6ae3a163a1c99bcb6c270e341374a8.png

Histograms#

Histograms are used, when the distribution of a numerical variable should be displayed

The default number of bins in matplotlib is 10 with equal spacing. For a more detailed look or further processing, we can use the hist() function to get the number of samples in each bin, the borders of the bins and the patches, i.e.the displayed parts of the histogram.

Let’s change the number of bins to 5 and view these unpacked parts. (Note that jupyter will print the plot any way)

n, bins, patches = plt.hist(df.Credit_Limit, bins=5, rwidth=.85)
print('samples in bin:', n, '\nbin borders:', bins)
print('patches:', patches[0], patches[1])
samples in bin: [6735. 1520.  706.  413.  753.] 
bin borders: [ 1438.3   8053.84 14669.38 21284.92 27900.46 34516.  ]
patches: Rectangle(xy=(1934.47, 0), width=5623.21, height=6735, angle=0) Rectangle(xy=(8550.01, 0), width=5623.21, height=1520, angle=0)
_images/94a4438ea979dd251717fc546cf938efaa8160bc63f5b7b3f6331d1067958b14.png

We see that the ticks are not automatically matching the bars. To solve this, we can use the bin borders, bins from above, to set the ticks and labels on the x axis.

plt.hist(df.Credit_Limit, bins=15)
plt.xticks(ticks=bins[1:], rotation=65)    # set xticks at bin border

plt.show()   
_images/4d02f508448ecd4d99d3c10f4a2be5730a34a7ae48efb5afd7c6d120e15979e8.png

Subplots#

We can compare two results side by side, using the subplot(nrow, ncol, index) function. We specify how many plots to show by the number of rows and columns in the parentheses. To make one cell of the figure the current subplot, call the subplot function before.

  • At first, we specify a larger figure size.

  • We choose 15 bins

  • For colours, matplotlib automatically recognizes RGB and RGBA when a 3- or 4-tuple is provided to color

  • We also add titles and change the colours to better differentiate between the plot from above and the modified one. Nore the state based approach of calling these functions subsequently

  • Since the tick labels in the right plot are too long and would overlap, we rotate them

plt.figure(figsize=(12,5))    # this will stretch/compress the plots to fit the figsize

plt.subplot(1,2,1)    # first subplot of two in one row
plt.hist(df.Credit_Limit, bins=15, color=(0,137/255,147/255))
plt.title('default ticks')    # add title

plt.subplot(1,2,2)    # second subplot
n, bins,_ = plt.hist(df.Credit_Limit, bins=15, color=(.9,.1,0,1))
plt.title('modified ticks')

plt.ylabel('count')
plt.xlabel('Credit Limit ($)')
plt.xticks(ticks=bins[1:], rotation=65)    # set xticks at bin border

plt.show()   # only after all subplots have been defined
_images/fc12b2897d16cfc34cb70365e3f3ab8218b313a132f79e614d51c2fa8d62a172.png

The rightmost bar strikes the eye as it interrupts the decreasing trend seen towards higher values. If we check for the occurence of the maximum value by Python len(df[df.Credit_Limit == max(df.Credit_Limit)]), we find it occurs 508 times. This is in indication, that the dataset has been clipped. Meaning that values higher or lower than some critical values are overwritten by these values.

See this numpy example with the clip() function:

import numpy as np
print(np.clip([-1,2,3,4,11,100], 3, 4))    # the list values are clipped to 3 and 4, respectively
[3 3 3 4 4 4]

A more special case arises for the variable Income_Category which is already binned to categories but can be ranked. We would want to order the bars accordingly.

While the default plt.bar() would only make a random order of the categories, we can specify this order by providing the tick positions and labels. Another way to do this is by mapping the values to integers, according to their rank. Either way, some work by hand is required to reorder the bars.

We will define a dictionary containing k-v pairs for mapping and then use pandas’ map() function.

(df.Income_Category.unique())    # get unique values to biuld dict in next step
array(['$60K - $80K', 'Less than $40K', '$80K - $120K', '$40K - $60K',
       '$120K +', 'Unknown'], dtype=object)
map_income = {'Unknown': 0, 'Less than $40K':1, '$40K - $60K':2, '$60K - $80K':3, '$80K - $120K':4, '$120K +':5}

print(pd.concat( [df.Income_Category[:3], df.Income_Category[:3].map(map_income)], axis=1))
   Income_Category  Income_Category
0      $60K - $80K                3
1   Less than $40K                1
2     $80K - $120K                4

We will use a different syntax for creating subplots this time, using the subplots() function (plural!). It unpacks into a figure and an axes object. (We could use this syntax for just one plot, too.) Since we want to display 3 subplots, we unpack the axes objects into a tuple of length 3. We call these objects ax1, ax2 and ax3. The respective plot method will then be called on these objects. The commands for ticks etc. change from the case before.

Note that for the second plot, we could have used a list in the desired order. The dict, i.e. the values of the dict, are only used for the histogram.

fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(10,6))     # unpacking into figure and 3 axes objects

fig.suptitle('Bar Charts and Histograms', fontsize=18)

ax1.bar(range(len(df.Income_Category.unique())), height=df.Income_Category.value_counts())
ax1.title.set_text('default bar chart')

ax2.bar([k for k,v in map_income.items()], color='red', 
        height=[df.Income_Category.value_counts()[k]
                for k,v in map_income.items()],)    # unpack keys as labels, note that a list would suffice here
ax2.title.set_text('reordered bar chart') 
ax2.set_xticks(range(6))    # warning if not specified before the labels
ax2.set_xticklabels([k for k,v in map_income.items()], rotation=65)

n, bins, _ = ax3.hist(df.Income_Category.map(map_income), bins=6, rwidth=.8, color='grey')
bin_width = (bins[1] - bins[0])/2    # calculate half bin width for centered tickmarks
ax3.set_xticks(bins[:-1]+bin_width)  
ax3.set_xticklabels([k for k,v in map_income.items()], rotation=65) # unpack keys as labels
ax3.title.set_text('histogram with mapping')

plt.tight_layout()    # rearrange subplots for a better look
plt.show()
_images/181300713658fb4e77ea28a8c352f6619e286af990ee052bd06252a9dde73f22.png

We can see that we basically built the same plot (middle and right) by using a bar plot and a histogram.

Now, there are many more options on how to build histograms. One last example is to display two variables in the same plot: a grouped and stacked histogram

It should be noted, that the values should lie in the same range approximately.

For the first plot, we provide the variables as a list to the hist() function. For the second plot, we use the state based approach and create two plots which we display in the same graphic.

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,4))     # unpacking into figure and 2 axes objects

n, bins, _ = ax1.hist([df.Credit_Limit, df.Avg_Open_To_Buy], bins=12, 
                     color=['green', (142/255,142/255, 141/255, .6)])
bin_width = (bins[1] - bins[0])/2    # calculate half bin width for centered tickmarks
ax1.set_xticks(bins[1:])  
ax1.set_xticklabels([int(b/1000) for b in bins[1:]], rotation=65) 
ax1.title.set_text('grouped')
ax1.title.set_fontsize(18)
ax1.legend(['Credit Limit', 'Avg Open to Buy'], fontsize='large')
ax1.set_xlabel('thousands of dollars', fontsize = 13)
ax1.set_ylabel('Count', fontsize = 13)

n, bins, _ = ax2.hist([df.Credit_Limit, df.Avg_Open_To_Buy], bins=12,
                     color=['green', (142/255,142/255, 141/255, .6)], stacked=True)
ax2.set_xticks(bins[1:])  
ax2.set_xticklabels([int(b/1000) for b in bins[1:]], rotation=65) 
ax2.title.set_text('stacked')
ax2.title.set_fontsize(18)
ax2.legend(['Credit Limit', 'Avg Open to Buy'], fontsize='large')
ax2.set_xlabel('thousands of dollars', fontsize = 13)
ax2.set_ylabel('Count', fontsize = 13)


plt.show()
_images/376c26e20c484b3c9ca421c22fa05aba1cbb4d5387995df79ec7354d6c25c59b.png
n, bins, _ = plt.hist([df.Credit_Limit], bins=12, rwidth=.7, color=['green'], alpha=.5, edgecolor='black')
bin_width = (bins[1] - bins[0])/2    # calculate half bin width for centered tickmarks
plt.xticks(bins[1:], [int(b/100) for b in bins[1:]], rotation=65) 

n, bins1, _ = plt.hist([df.Avg_Open_To_Buy], bins=12, rwidth=.7, 
                     color=[(142/255,142/255, 141/255, .5)], edgecolor='black')

plt.title('a title')
plt.show()
_images/7e44bad2cb920b8d074b47fe83b2f2b4849bc8937d65b91785ad10d1618ef79a.png

The basic framework stays the same for all other plots that can be created using matplotlib. We will have a look at two more examples before switching to a different library more specific for statistical plots.

Numerical Data#

Beside categorical data, we often deal with numerical data. We have already used Credit_Limit for a histogram so far.

Scatter Plots#

This kind of plot uses uses variables from the dataset on both axes, unlike before where we saw the count on the y-axis. Scatter plots are particularly useful for checking relationships between two variables. To create a scatterplot, use plt.scatter().

fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(df.Customer_Age, df.Months_on_book, marker='.', alpha=.1)
plt.show()
_images/6d7d925fa3b64c4e056b83e36ac8fc85745ad2a973cde786f08af0a076d49f3f.png

We clearly see a positive linear correlation between the two variables. We also see, that one value of Months_on_Book appears throughout all ages. This is likely from imputing missing values, i.e. if this information is not available, the system instantly or some data engineer later set the value to the mean or some other statistical indicator.

Let’s quantify the high correlation between the variables and then see if we can find out more about the value appearing almost as horizontal line.

print(np.corrcoef(df.Customer_Age, df.Months_on_book)[0][1])
0.788912358993053
print(df.Months_on_book.value_counts().iloc[:3])    # no surprise, it is the most frequent number for this variable
36    2463
37     358
34     353
Name: Months_on_book, dtype: int64
print('mean of whole dataset:', df.Months_on_book.mean())
print('mean without 36 months at all:',df[df.Months_on_book!=36].Months_on_book.mean())
mean of whole dataset: 35.928409203120374
mean without 36 months at all: 35.905401878914404

Another way to make your scatterplots more fancy and more informative, is through grouping by some third variable and display these groups using different colours. This is particularly interesting for categorical variables, where the markers are coloured according to the single levels.

Note the dictionary containing the colour codes for the categories and the for loop used to create axes objects. We could have used the c=df.Card_Categories.map(card_col), but then we would not get a legend.

fig, ax = plt.subplots(figsize=(12,8))

# use only rows in upper half of total revolving balance for less data points
groups = df[df.Total_Revolving_Bal > df.Total_Revolving_Bal.mean()].groupby(df.Card_Category)

# dictionary with 
card_col = {'Blue': '#069AF3', 'Gold': 'gold', 'Platinum': 'black', 'Silver': 'silver'}

# loop over the groups to create axes objects
for name, group in groups:
    ax.scatter(group.Customer_Age, group.Months_on_book, marker='p', alpha=1, label=name, color=card_col[name])
ax.legend(prop={'size': 12})

#show in same plot
plt.show()
_images/832f65e44e292dfc12a0f65a53a07022d3cd53674821cada39694783d27b4394.png
for name, group in groups:
    print(card_col[name])
#069AF3
gold
black
silver

Beside the colour, the marker size can also be customized to display a relative difference between third variables of the observations.

Note that the size parameter s needs an numerical input, which could refer to categories, e.g. through mapping.

fig, ax = plt.subplots(figsize=(8,8))

# use only rows in upper half of total revolving balance for less data points
groups = df[df.Total_Revolving_Bal > df.Total_Revolving_Bal.mean()].groupby(df.Gender)

# loop over the (two) groups to create axes objects
# make marker size according to fourth variable
for name, group in groups:
    ax.scatter(group.Customer_Age, group.Months_on_book, marker='.', alpha=.5, label=name,
               s=(group.Dependent_count+2)**3)   # resizing might be necessary
ax.legend()

plt.show()
_images/7d44e976629f45623b8e5d08e49cca15fc62279752078508ebd7a4178ddc6034.png

Line plots#

The ‘most standard’ type of plot may be the line plot. It is used for example to display the evolution of a variable over time like in stock charts. Another example would be the visualisation of a fit curve or some distribution in a histogram.

To create a lineplot, use the plot() function. We will at first look at how to plot a normal distribution and then simulate random variables accordingly to compare them in a histogram.

# define function for plotting
def my_gauss(x, mu, sig):
    return (1/(sig * np.sqrt(2* np.pi))) * np.exp(-.5 * ((x-mu)/sig)**2)
    
# define x values
x = np.linspace(-5,5, 100)

# y values
y = my_gauss(x, 0 ,1)

# draw random numbers from standard normal
my_rands = np.random.normal(0,1,100000)
 
samples= [30, 200, 1000]

plt.figure(figsize=(16,4))

for i,s in enumerate(samples):
    plt.subplot(1,4,i+1)
    plt.plot(x,y, linestyle='--', lw=2)
    plt.hist(my_rands[:s], bottom=0, bins=20, density=True, alpha=.8, edgecolor='black')
    plt.title(f'{s} samples')
    
plt.subplot(1,4,4)
plt.plot(x,y, marker='^', ls='')
plt.hist(my_rands, bottom=0, bins=20, density=True, alpha=.8, edgecolor='black')
plt.title(f'{s} samples')
plt.show()
_images/f60730e6c166a4e4c426cd1ee3834a695ad73e2d0938963e5155dd1259b8da82.png

Matplotlib includes many useful shortcuts in customizing the graphics: colors (simply ‘blue’ instead of an rgb code), the markers (e.g. ‘^’ for triangles) and the linestyle (’–’ for a dashed line). They can be found in the documentation of the respective function. In general, it is always useful to check the documentation when creating a graph that should be beyond the standard, because of the mutltitude of built-in keywords.

While Matplotlib offers the possibility to create and customize plots almost at will, it sometimes does not quite have the right plot readily availabe. For statistical plots and graphics, another library includes more predefinded functions for a wide variety of plot types.

Seaborn#

Seaborn is a library for statistical plots and graphs. It is based on matplotlib and thus works neatly with pandas and numpy as well. The advantage is the many plot types which are available which may be tiresome to build ourselves using matplotlib. Beside more available default plot types than in matplotlib, seaborn includes composite graphics, where one command automatically renders a set of graphics. This makes exploring of a dataset even quicker. In the following, we will show some of the above plots and new kinds of standard statistical plots in seaborn.

The syntax is slightly different to matplotlib: The data argument specifies the dataframe/array and the variables are selected as strings. Note however, that since seaborn is built upon the matplotlib framework, we can still use the syntax used above. Furthermore, we can utilise some functions we have learned before. We would then just need to import matplotlib as well.

At first, we import seaborn as sns

import seaborn as sns
# and matplotlib
import matplotlib.pyplot as plt

At first we will again look at bar plots. To create the same plot as in the beginning of this chapter in seaborn, we use the countplot() function. To rotate the ticks, we can use different approaches. Another difference to the examples with pyplot is the order keyword.

plt.figure(figsize=(10,4))
plt.subplot(1,3,1)
sns.countplot(x = 'Income_Category', data = df)
plt.xticks(rotation=65)
plt.title('default, rotation 1')

plt.subplot(1,3,2)
ctplot = sns.countplot(x = 'Income_Category', data = df)
for item in ctplot.get_xticklabels():
    item.set_rotation(65)
ctplot.set_xticklabels
plt.title('default, rotation 2')
    
plt.subplot(1,3,3)
order = ['Unknown', 'Less than $40K', '$40K - $60K', '$60K - $80K', '$80K - $120K', '$120K +']
color = (.35, .44 , .6)
new = sns.countplot(x = 'Income_Category', data = df, order=order, color=color)
plt.xticks(rotation=65)
plt.title('reordered')

plt.tight_layout()
_images/8bff805e290afa4c96e45ef458357fef2ced1eee75bf90b385e7d109bfe62c33.png

For histograms additional functionality is readily available. The displot() function builds the basic histogram. In seaborn, we only need single keywords to build more advanced plots:

  • kind defines the kind of plot

  • col/row creates subplots for the levels of the respective variable

  • hue creates one plot with all categories separately coloured

  • rug, if True, creates marignal ticks for every observation

  • kde uses a dernel density estimation, i.e. it displays histogram and estimated density

sns.displot(data=df, x="Credit_Limit", col="Gender", kind='hist', height=3, aspect=1.2)
plt.suptitle('col with "Gender"', fontsize=14)    # use suptitle, as subplots are automatically created

sns.displot(data=df, x="Credit_Limit", hue="Gender", kde=True, height=3, aspect=1.2)
plt.title('hue with "Gender" and kde')

sns.displot(data=df, x="Credit_Limit", hue="Gender", kind='ecdf', height=3, aspect=1.2)
plt.title('ecdf for "Gender"')
plt.show()
_images/afc39cea148905549d5fd720a897193ef9829710dbce04ee1149f49d388644f5.png _images/6ccb786faca1a890c01756b74bb0894bfb2e7548fc613257d808631c94d904b5.png _images/aa60d25389008de3b886be5b7f92695d486e58d587d621d3a1c095cb03c9a282.png

At this point, we will no longer repeat the plot types we have seen in the previous chapter. However, you are advised to check how to create them using seaborn and learn about the additional possibilities.

We will now look at some other plots which come with seaborn by default and may not be built so quickly using pyplot. A more complete set of available plot types can be found here

Heatmaps#

This is a special kind of plot, that is often used to visualize correlation matrices. It shows variables (or categories) on the axes and rectangles in the plot which are usually coloured with respect to some value.

For a correlation matrix, we first have to calculate it using numpy. The resulting matrix can then be passed to seaborn.

To compute the correlation, we need to either omit the categorical variables or use one-hot encoding. For this example, we will only look at numerical variables and use the usual Pearson correlation.

corr_df = df.select_dtypes('int64').drop('CLIENTNUM', axis=1)   # select only numerical variables
                                                                 # exclude system variable
corr_mat = corr_df.corr()    # create correlation matrix
corr_mat    # not using the print statement makes it more pretty thanks to jupyter
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Total_Revolving_Bal
Customer_Age 1.000000 -0.122254 0.788912 -0.010931 0.054361 -0.018452 0.014780
Dependent_count -0.122254 1.000000 -0.103062 -0.039076 -0.010768 -0.040505 -0.002688
Months_on_book 0.788912 -0.103062 1.000000 -0.009203 0.074164 -0.010774 0.008623
Total_Relationship_Count -0.010931 -0.039076 -0.009203 1.000000 -0.003675 0.055203 0.013726
Months_Inactive_12_mon 0.054361 -0.010768 0.074164 -0.003675 1.000000 0.029493 -0.042210
Contacts_Count_12_mon -0.018452 -0.040505 -0.010774 0.055203 0.029493 1.000000 -0.053913
Total_Revolving_Bal 0.014780 -0.002688 0.008623 0.013726 -0.042210 -0.053913 1.000000

We now only have to call seaborn’s heatmap() function:

sns.heatmap(corr_mat)
<AxesSubplot: >
_images/dc25f464d57ae8fdcb3b42f1a940289cac60b56484f7193b1da27bae61ed19d1.png

The high correlation between ‘Months_on_Book’ and ‘Customer_Age’ we found earlier can now clearly be seen by the light field corresponding to a high correlation.

We could further mask the upper half, since the matrix is symmetric and all information is included in one half. We will also add the rounded value of the correlation coefficient. At last, we use a custom colour map to try and get a better contrast for the lower values.

# create mask
mask = np.triu(np.ones_like(corr_mat))

# create custom colour bar
cmap = sns.diverging_palette(250, 390, l=50, s=100, as_cmap=True)

sns.heatmap(corr_mat, annot= np.round(corr_mat,2), mask=mask, cmap=cmap)
plt.title('customized correlation amtrix')
plt.show()
_images/985ac5c338fd0a783ca66fe19d74380b424a21e8392b95d8e6efb7ab951080df.png

With Jupyter notebook, the colours can even be chosen from an interactive interface. (ipywidgets must be installed)

cmap = sns.choose_diverging_palette(as_cmap=False)
sns.heatmap(corr_mat, annot= np.round(corr_mat,2), mask=mask, cmap=cmap)
plt.title('customized correlation amtrix')
plt.show()
_images/19e01b5ad278d1ed4e56cbc5f84f54a8856894c0335e17de41886860f229e7ad.png

Boxplot#

Boxplots are used to depict several descriptive statistics of a numerical variable at once. The usually contain the following elements:

  • a box, stretching from the first to the third quartile

  • a line in the box, indicating the median value

  • two whiskers, one pointing upwards and the other downwards

The whiskers may represent different quantities, like a specific quantile or even the maximum and minimum value. By default, the whisker length is definded by the inter-quartile range: \(a \cdot Q_{0.75} - Q_{0.25}\)

The respective keyword for the factor \(a\) is whis, accepting a float. In this configuration, outliers, i.e. data points above or beyond the respective quantiles, are shown as single dots.

The whiskers are only drawn to the maximum/minimum value in each direction, meaning that one whisker could be shorter than the other or be omitted completely, if no data points fall in this interval.

plt.figure(figsize=(6,6))
x_val = .17
sns.boxplot(y=df.Credit_Limit, whis=1.5, width=.3)
plt.text(x_val,32e3,'outliers', fontsize=14)
# interquartile range
iqr = np.quantile(df.Credit_Limit, .75) - np.quantile(df.Credit_Limit, 0.25)

# upper whisker -> max value inside (Q_75 + 1.5 IQR)
plt.text(x_val,np.quantile(df.Credit_Limit,.75)+1.5*iqr,'$Q_{75}+1.5 \cdot IQR$', fontsize=14)

# Quartile
plt.text(x_val,np.quantile(df.Credit_Limit, .75),'$Q_{75}$', fontsize=14)
plt.text(x_val,np.median(df.Credit_Limit),'$Q_{50}$', fontsize=14)
plt.text(x_val,np.quantile(df.Credit_Limit, .25),'$Q_{25}$', fontsize=14)

# lower whisker -> min value inside (Q_25 - 1.5 IQR)
plt.text(x_val,max(np.quantile(df.Credit_Limit,.25)-1.5*iqr, min(df.Credit_Limit)),'$min$', fontsize=14)
plt.show()
_images/3edfd9e53dad3a71892a7f8b2b1b4221b1705b1d93257f4d1913c65a87f8d584.png

Violin Plot#

A plot showing similar information (and is maybe more aesthetic) is the violin plot. It includes a kernel density estimation which is shown on both sides of the middle axis. This adds visual information beyond the quartiles and outliers, by displaying the distribution of values.

Specifying inner="quartile" introduces respective lines for the 1st, 2nd and 3rd quartile.

sns.violinplot(data=df.Credit_Limit, inner="quartile")
plt.show()
_images/d8e589b6b86a78e85b0329bffe969026089038688ca2f4aff73bee65c4ecb201.png

We can display the distribution of one variable with respect to the (customized order) levels of a second variable. Furthermore, we can also use a ‘hue’ for a third variable. For dichotomous variables, setting split=True makes use of both sides of a violinplot, instead of creating sets of two violinplots.

plt.figure(figsize=(13,5))
plt.subplot(1,2,1)
sns.violinplot(data=df, y='Credit_Limit', x='Card_Category', hue = 'Gender', split = False,
               order=['Blue', 'Silver', 'Gold', 'Platinum'])
plt.legend(loc=2)
plt.title('Split = False')
plt.subplot(1,2,2)
sns.violinplot(data=df, y='Credit_Limit', x='Card_Category', hue = 'Gender', split = True,
               order=['Blue', 'Silver', 'Gold', 'Platinum'])
plt.legend(loc=2)
plt.title('Split = True')

plt.tight_layout()

plt.show()
_images/4ce139f3f9a84c949a58a0b9210c872f85876a753f5df50215619c2cb3d86b20.png