Lesson 07 - GLMs, Linear Optimization and Clustering
Lesson 07 - GLMs, Linear Optimization and Clustering
Welcome to lesson 8! We will continue on with our application of what we have learned so far to statistical analysis of our data.
Today we will cover Generalised Linear Models (in the form of logistic regression), Linear Optimization using NumPy, and clustering using UPGMA and Kmeans in scikit-learn.
Download the notebook here.
Generalised Linear Models
Linear models can be extended to fit certain non-linear data, using GLMs.
To do this, we need a ‘link’ function, which gives us a linear response. For today we will use a logistic regression, but we have a wide array of link functions available.
We use logistic regression where we have a binomial output - 0 or 1. For example whether a patient is sick (1) or not (0) relates to their temperature (continuous), but not in a linear fashion.
Let’s do some imports:
In [3]:
We will read in some college admissions data. This data shows whether an applicant was admitted to a college or not, their gre and gpa, and a categorical ranking of the college by ‘prestige’.
We are interested in the relationship between admission, gpa, gre and prestige.
In [167]:
admit | gre | gpa | prestige | |
---|---|---|---|---|
0 | 0 | 380 | 3.61 | 3 |
1 | 1 | 660 | 3.67 | 3 |
2 | 1 | 800 | 4.00 | 1 |
3 | 1 | 640 | 3.19 | 4 |
4 | 0 | 520 | 2.93 | 4 |
Pandas has a crosstab function, which is useful for tabulating data based on categorical variables. Here we tabulate admit and prestige, before plotting out our data.
In [3]:
prestige | 1 | 2 | 3 | 4 |
---|---|---|---|---|
admit | ||||
0 | 28 | 97 | 93 | 55 |
1 | 33 | 54 | 28 | 12 |
In [168]:
Now we can carry out our regression - we will model admission as a response of gre, prestige (as a categorical variable) and gpa.
To do this we can use the formula interface, similar to last week, and using the glm function with the family specified as Bionomial:
In [169]:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: admit No. Observations: 400
Model: GLM Df Residuals: 394
Model Family: Binomial Df Model: 5
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -229.26
Date: Fri, 25 Mar 2016 Deviance: 458.52
Time: 09:57:37 Pearson chi2: 397.
No. Iterations: 6
====================================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept -3.9900 1.140 -3.500 0.000 -6.224 -1.756
C(prestige)[T.2] -0.6754 0.316 -2.134 0.033 -1.296 -0.055
C(prestige)[T.3] -1.3402 0.345 -3.881 0.000 -2.017 -0.663
C(prestige)[T.4] -1.5515 0.418 -3.713 0.000 -2.370 -0.733
gre 0.0023 0.001 2.070 0.038 0.000 0.004
gpa 0.8040 0.332 2.423 0.015 0.154 1.454
====================================================================================
It is difficult to plot two continuous variables against each other: here we will use seaborn to plot the effects of gre and gpa seperately:
In [7]:
To use the non-formula interface, we need to create dumm variables. Dummy variables allow us to treat categorical variables as 1s and 0s by adding on more factors:
In [170]:
prestige_1 | prestige_2 | prestige_3 | prestige_4 | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 0.0 | 0.0 | 0.0 | 1.0 |
To avoid multi-colinearity, we must remove one of the dummy variables.
We also need to manually add an intercept column if we wish to allow for an intercept term in our model
In [9]:
admit | gre | gpa | prestige_2 | prestige_3 | prestige_4 | intercept | |
---|---|---|---|---|---|---|---|
0 | 0 | 380 | 3.61 | 0.0 | 1.0 | 0.0 | 1.0 |
1 | 1 | 660 | 3.67 | 0.0 | 1.0 | 0.0 | 1.0 |
2 | 1 | 800 | 4.00 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 1 | 640 | 3.19 | 0.0 | 0.0 | 1.0 | 1.0 |
4 | 0 | 520 | 2.93 | 0.0 | 0.0 | 1.0 | 1.0 |
In [187]:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: admit No. Observations: 400
Model: GLM Df Residuals: 394
Model Family: Binomial Df Model: 5
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -229.26
Date: Fri, 25 Mar 2016 Deviance: 458.52
Time: 10:16:29 Pearson chi2: 397.
No. Iterations: 6
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre 0.0023 0.001 2.070 0.038 0.000 0.004
gpa 0.8040 0.332 2.423 0.015 0.154 1.454
prestige_2 -0.6754 0.316 -2.134 0.033 -1.296 -0.055
prestige_3 -1.3402 0.345 -3.881 0.000 -2.017 -0.663
prestige_4 -1.5515 0.418 -3.713 0.000 -2.370 -0.733
intercept -3.9900 1.140 -3.500 0.000 -6.224 -1.756
==============================================================================
We can see that the results are the same. The choice of formula or non formula interface is up to you - use whichever you feel most comfortable with.
We can get our odds ratios out by calling .params and taking the exp:
In [198]:
gre 1.002267
gpa 2.234545
prestige_2 0.508931
prestige_3 0.261792
prestige_4 0.211938
intercept 0.018500
dtype: float64
Linear Optimization
Python has a variety of ways of carrying out linear optimization. We can either use the built in NumPy solver, or interface with a commercial or open source solver through Python.
Gurobi and AMPL both offer Python apis, as well as PuLP and pyomo which offer integration into a wide range of solvers. For use of these packages, see their help pages, as they have their own idiosyncracies.
The main purpose of these interfaces is to get the data in from our equations and then run the solvers - We will use the built in solver in scipy, available in the scipy.optimize library:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linprog.html
In [12]:
Imagine we are working for French’s and we want to take advantage of the goodwill we have recently got from Canadian consumers.
We want to maxmise our profit on mustard and ketchup: We have 650 units of Canadian tomatoes, 1800 units of vinegar and 600 units of mustard seed available. We need 2 units of tomatoes and 4 of vinegar per pallet of ketchup, and 3 units of mustard seed and 7 units of vinegar per palette of mustard.
We can sell ketchup at $400 per palette, and mustard at $300 per palette.
We also have a limit on our new Canadian ketchup factory - we can produce at most 400 palettes of both products.
We also have to produce at lease 100 units of each product to protect our brand.
First let’s state our constraints mathematically:
maximize:
- profit = 400 * ketchup + 300 * mustard
subject to:
-
ketchup * 2 <= 650
-
ketchup * 4 + mustard * 7 <= 1800
-
mustard * 3 <= 600
-
ketchup + mustard <= 400
-
ketchup >= 100
-
mustard >= 100
Let’s plot out our system using our knowledge of matplotlib from last lesson:
In [201]:
Now we need to specify our system to solve it.
To do this, we give an array, c, to be minimized, a 2D array (A) which gives the coefficients to be satisfied, and a 1D array with the upper bounds.
We can also optionally use equalities rather than upper bounds.
In [202]:
Optimization terminated successfully.
Current function value: -130000.000000
Iterations: 3
In [203]:
fun: -130000.0
message: 'Optimization terminated successfully.'
nit: 3
slack: array([ 150., 100., 300., 0., 150., 300., 0., 0.])
status: 0
success: True
x: array([ 250., 100.])
In [165]:
130000
We can generalize this to multiple dimensions and variables! In scipy, we only have a simplex solver available, so to find faster and other solvers, try playing with pyomo and pulp. In general, these packages produce the same arrays to pass into the solvers from nicer inputs - the solvers and algorithms are the differences here.
Clustering
Clustering allows us to put multidimensional data into bins by similarity.
Clustering is one of the major methods used, and be both supervised or unsupervised - where we know a priori the ;number of clusters or not.
Today we will use two supervised clustering examples, before we move on to more sophisticated methods of clustering in the next lessons.
First, let’s load in some libraries and data:
In [1]:
Iris Plants Database
Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris
The famous Iris database, first used by Sir R.A Fisher
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
Now let’s plot the first two dimensions:
In [4]:
Now we can plot these using seaborn. We will use the same dataset, but in a slightly different format:
In [5]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
<seaborn.matrix.ClusterGrid at 0x7fe70cbd2ba8>
However, we quickly run into the same problem as last lesson - how do we get the values out?
Under the hood, seaborn is using scipy.cluster.hierarchy
To recapitulate the results, we can use the function linkage:
In [6]:
(149, 4)
Here I am using straight pairwise differences, we can use a wide range of distance metrics. You should also correct your data for outliers and different scales - here we have data all in a similar scale, so the differences are simple to use as is.
The method for clustering used is ‘average’, also known as the UPGMA method, common in ecology (and many other fields).
This works by finding the two most similar items, clustering them and meaning their values. The branch length is then the distance between these objects. The method works recursively until there is only one group left. We have a wide range of methods to use here.
Now we can plot using dendrogram:
In [7]:
From here we can cluster using a variety of methods, with the function fclust.
We can specify a number of clusters, which works by descending a line until we hit the given distance or specifying a distance. Let’s specify three clusters:
In [8]:
<matplotlib.collections.PathCollection at 0x7fe70c44ec50>
We can also use kmeans, another clustering algorithm. This works by putting k random centroids on a graph, finding the closest centroid to each point, then moving centroids to minimize the error within each group.
Let’s do this in scikit-learn, the machine learning library.
We will cover sci kit learn exclusively in the next lesson, but for now we can do a brief overview.
We have already imported the dataset, iris from it. Datasets are generally used or made into numpy matrices, with data and targets as seperate arrays. The data are stored as np.float64 by default, and we need to convert categorical variables into dummy variables, but targets are ok as categorical variables.
There is a wide range of clustering algorithms available. For now we will run a simple kmeans.
Scikit-learn is set up to allow validation, pipelining etc etc.
In [9]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1,
2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1], dtype=int32)
In [10]:
<matplotlib.collections.PathCollection at 0x7fe70be615f8>
That’s it for today, we have had a brief tour of GLMs, linear optimisation and clustering. Next week we will focus on scikit-learn and some applied pipelines and examples.
Exercises
All these exercises are a little tough this week, do numbers 3 and 4, and do the questions from 1,2,5 and 6 which are relevant to your work.
1. Use the awards data from ucla to carry out a poisson glm in stasmodels:
The data shows how many awards a student received, with columns for their math score, and the program they were in coded as a categorical variable.
2. Solve the optimization problem (taken from http://www.purplemath.com/modules/linprog3.htm)
P = –2x + 5y, subject to:
100 < x < 200
80 < y < 170
y > –x + 200
3. Change the iris data set from seaborn into the numpy format used by scikit-learn.
4. Change the iris data set from scikit-learn into a pandas dataframe the same as used in seaborn
5. Make some ‘noisy moons’ using scikit-learns dataset module:
Fit the data using a Kmeans algorithm.
6. Fit your noisy moons data using hierarchical clustering