{ "cells": [ { "cell_type": "markdown", "id": "25890abd", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# SciKit Learn\n", "\n", "The [sklearn package](https://scikit-learn.org/stable/) provides a broad collection of data analysis and machine learning tools:\n", " - cover the whole process, data manipulation, fitting models, evaluating the results\n", " - Sklearn is based on **numpy**: data and results as numpy arrays. \n", "\n", "Classes and functions are provided in an high-level API:\n", " - allows application without requiring (too much) knowledge about the algorithm itself\n", " - API allows to use the same syntax for very different algorithms" ] }, { "cell_type": "markdown", "id": "99ee232b", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Basic syntax for creating a model:\n", "\n", "- instantiate the respective (algorithm's) object with hyper parameters and options\n", "\n", "- fit the data using this object's built-in methods\n", "\n", "- evaluate the model or use the model for prediction \n", "\n", "\n", "Note: we import classes specifically from the `scikit-learn` package instead of importing the package as a whole " ] }, { "cell_type": "markdown", "id": "fcc68b89", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Regression\n", "\n", "### Linear Regression\n", "\n", "- in LR, we model the the influence of some independent numerical variables on the value of a dependent numerical variable (the target). \n", "\n", "- widely used in economics\n", "\n", "The ordinary least squares (OLS) regression is found in module `linear_model` as the `LinearRegression` class.\n", "\n", "NOTE: by default, sklearn will fit an intercept. To exclude the intercept, set `fit_intercept=False` when instantiating the `LinearRegression()` object.\n", "\n", "To demonstrate the procedure, we will use a [health insurance data set](https://www.kaggle.com/mirichoi0218/insurance/version/1#) trying to explain the insurance charges.\n", "\n", "The data includes some categorical variables, for which we need to create dummy variables" ] }, { "cell_type": "markdown", "id": "4234c7d9", "metadata": {}, "source": [ "- intercept\n", "y = $\\alpha$ + $\\beta$ $\\cdot$ x \n", "\n", "- no intercept\n", "y = $\\beta$ $\\cdot$ x " ] }, { "cell_type": "code", "execution_count": 1, "id": "c7be2dd7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "sex | \n", "bmi | \n", "children | \n", "smoker | \n", "region | \n", "charges | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "19 | \n", "female | \n", "27.90 | \n", "0 | \n", "yes | \n", "southwest | \n", "16884.9240 | \n", "
1 | \n", "18 | \n", "male | \n", "33.77 | \n", "1 | \n", "no | \n", "southeast | \n", "1725.5523 | \n", "
2 | \n", "28 | \n", "male | \n", "33.00 | \n", "3 | \n", "no | \n", "southeast | \n", "4449.4620 | \n", "