Lesson 04 - NumPy and pandas
Lesson 04 - NumPy and pandas
So far we have covered the base Python environment - built in data types and structures, and looked at how we can use Python as a general purpose programming language.
Now we will hone in on our goal of data science - using the data science modules developed by the data science community. Download todays notebook here.
This lesson and the next lesson are based on the book Python for data analysis, by Wes McKinney, the primary developer of pandas. Feel free to go without the book - we will cover much of its content in the class, and it is a little outdated.
The first module we will examine is NumPy and we will the move onto pandas. NumPy provides arrays, while pandas provides DataFrames.
NumPy
NumPy stands for numerical Python. So far the Python data structures have worked, but have not been tailored for large scale data analysis.
Think of how Python works under the hood when multiplying every element of a list by 2:
In [1]:
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
[2, 4, 6, 8, 10]
[2, 4, 6, 8, 'aa']
Python needs to check each data type to find the times method associated with it. In small examples like this, the overhead is very low, but when we are dealing with millions of rows, it quickly adds up.
To work better with numeric (or other large scale data), numpy introduces the array, a data structure which may only contain one type of data:
In [9]:
[ 2 4 6 8 10]
It is also much faster (by a process called vectorisation):
In [3]:
100 loops, best of 3: 2.1 ms per loop
10000 loops, best of 3: 57.6 µs per loop
As well as the array data type, numpy contains broadcasting methods, built in functions utilising the array structure to work extremely fast (by going through C), linear algebra, random numbers and good integration into C and Fortran code
NumPy basics
The array is a new class, with a lot of its own methods. The exact implementation is outside the scope of the class, take a look at the NumPy website for source code and official documentation.
Technically, we use the np.array to create an instance of class ndarray. I’ll refer to them as arrays in this lesson.
We can access the type of data contained in an array:
In [4]:
int64
{'float': [<class 'numpy.float16'>, <class 'numpy.float32'>, <class 'numpy.float64'>, <class 'numpy.float128'>], 'others': [<class 'bool'>, <class 'object'>, <class 'str'>, <class 'str'>, <class 'numpy.void'>], 'complex': [<class 'numpy.complex64'>, <class 'numpy.complex128'>, <class 'numpy.complex256'>], 'int': [<class 'numpy.int8'>, <class 'numpy.int16'>, <class 'numpy.int32'>, <class 'numpy.int64'>], 'uint': [<class 'numpy.uint8'>, <class 'numpy.uint16'>, <class 'numpy.uint32'>, <class 'numpy.uint64'>]}
We can initialise arrays in a number of ways
In [8]:
[[[ 6.90469329e-310 2.13249800e-316]
[ 0.00000000e+000 0.00000000e+000]
[ 0.00000000e+000 8.60952352e-072]]
[[ 4.46535817e-090 1.39938874e-076]
[ 1.55075695e+184 1.43927482e+160]
[ 3.99910963e+252 2.32204073e-056]]]
[[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]]
[1 2 3 5]
[0 1 2 3 4 5 6 7 8 9]
In [205]:
array([1, 2, 3])
In [25]:
[1, 2, 3, 4, 1, 2, 3, 4]
[2 4 6 8]
In [10]:
[ 1 4 9 16]
[[ 1 4]
[ 9 25]]
Subsetting
We can subset much like lists, with the addition of broadcasting for assignment
In [217]:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-217-3d54943f2f76> in <module>()
1 l = list(range(10))
----> 2 l[2:5] = 3
3 print(l)
4 #error!
5 l = np.arange(10)
TypeError: can only assign an iterable
We need to be careful about assigning slices:
In [218]:
[0 1 2 3 4 4 4 4 8 9]
We can make 2d and 3,4,5… matrices using nested lists, and subset them appropriately:
In [13]:
(3, 3)
2
(2, 3, 3)
3
array([[[ 1, 4, 9],
[16, 25, 36],
[49, 64, 81]],
[[ 1, 4, 9],
[16, 25, 36],
[49, 64, 81]]])
In [30]:
array([ 6, 15, 24])
We don’t need filter: We can subset with booleans much like R:
In [17]:
array([[1, 2, 3],
[0, 0, 0],
[0, 0, 0]])
We can subset to rearrange:
In [53]:
array([[ 0., 0., 0., 0.],
[ 1., 1., 1., 1.],
[ 2., 2., 2., 2.],
[ 3., 3., 3., 3.],
[ 4., 4., 4., 4.],
[ 5., 5., 5., 5.],
[ 6., 6., 6., 6.],
[ 7., 7., 7., 7.]])
In [244]:
array([[ 7., 7., 7., 7.],
[ 5., 5., 5., 5.]])
In [246]:
array([[False, False, True],
[ True, True, True],
[ True, False, True]], dtype=bool)
In [248]:
array([[ 50, 90, 130],
[ 80, 255, 430],
[130, 490, 730]])
In [249]:
array([[ 0.43077853, 0. , 2.18006544, 0. , 0.99805323],
[ 0.62200891, 0. , 1.51734812, 0. , 0.4610735 ],
[ 0.35047453, 0.78169552, 0. , 1.40064949, 0. ],
[ 0.62944507, 0. , 0. , 0. , 0. ]])
Reshape and Matrix methods
Using the built in linear algebra methods, we can carry out matrix operations easily.
We will endeavour to cover more matrix algebra in a lesson including sympy, sciPy and linear optimization.
Reshape allows us to reshape our matrices:
In [61]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
In [65]:
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])
In [70]:
array([[ 0, 1, 4, 9, 16],
[ 25, 36, 49, 64, 81],
[100, 121, 144, 169, 196]])
In [14]:
array([[ 30, 80, 130],
[ 80, 255, 430],
[130, 430, 730]])
In [18]:
array([[-1724114088, 9679576, 1743473240, ..., -1049853224,
683940440, -1877233192],
[ 9679576, 1016093272, 2022506968, ..., -1152439720,
-146026024, 860387672],
[ 1743473240, 2022506968, -1993426600, ..., -1255026216,
-975992488, -696958760],
...,
[-1049853224, -1152439720, -1255026216, ..., 1884158552,
1781572056, 1678985560],
[ 683940440, -146026024, -975992488, ..., 1781572056,
951605592, 121639128],
[-1877233192, 860387672, -696958760, ..., 1678985560,
121639128, -1435707304]])
NumPy has the expected array of matrix functions, implemented in standard C or fortran code. See the website documentation for examples. Some of the functions are in the linalg submodule:
In [18]:
[[1 0 0]
[0 2 0]
[0 0 3]]
[ 1. 2. 3.]
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
In [20]:
array([ 17.57575758, 3.12121212])
More on linear algebra later in the course
NumPy universal functions
NumPy has a number of ‘ufuncs’ built in. These are fast, as they are (mostly) implemented in C, and are a great choice for carrying out element wise operations. For the full list, see the official docs
In [253]:
array([ 80., 255., 430.])
We have two main classes of ufuncs, unary, which operate on one array, and binary which operate on two:
In [19]:
array([[ 50, 90, 130],
[ 80, 255, 430],
[130, 490, 730]])
We can write our own ufuncs using frompyfunc
. The main benefit of this is to
allow broadcasting instead of having to use a loop:
In [27]:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-27-423e7f25fa0f> in <module>()
2 return(oct(x))
3
----> 4 print(myfun(l))
<ipython-input-27-423e7f25fa0f> in myfun(x)
1 def myfun(x):
----> 2 return(oct(x))
3
4 print(myfun(l))
TypeError: only integer arrays with one element can be converted to an index
In [28]:
array([['0o36', '0o120', '0o202'],
['0o120', '0o377', '0o656'],
['0o202', '0o656', '0o1332']], dtype=object)
Reading in data
pandas read_csv
function is much easier, but as a stop gap, and to keep the
numbers in NumPy, we can use NumPys built in csv reader. You can see the
offical docs
here
In [25]:
array([[ 4.83900000e-01, 4.53600000e-01, 3.56100000e-01],
[ 1.29200000e-01, 6.87500000e-01, -9.99000000e+02],
[ 1.78100000e-01, 3.04900000e-01, 8.92800000e-01],
[ -9.99000000e+02, 5.80100000e-01, 2.03800000e-01],
[ 5.99300000e-01, 4.35700000e-01, 7.41000000e-01]])
NumPy Summary
NumPy is a large library - we haven’t touched on its sorting, sets, or random number generations capabilities. However, as pandas is based on NumPy arrays, we will continue to cover it’s functionality here.
Here is a quick overview of the example given in the install instructions:
In [104]:
pandas
pandas, short for Python and data analysis (or panel datasets) was created by Wes McKinney, while he was working as a financial analyst (amongst other projects, he is currently working for Apache to get pandas to work with the Apache Arrow format). He initially began it as a port of R into Python for speed, but quickly diverged into a slightly different model.
It is primarily made for time series and tabular data, and its main point of use are the new classes, DataFrame and Series, modelled on Rs dataframe.
In [2]:
Series
Series are effectively NumPy arrays, but with an added index, which is retained through operations:
In [107]:
0 3
1 6
2 9
3 12
dtype: int64
In [256]:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-256-1f81be84565f> in <module>()
3 obj.index
4 #indexes are immutable!
----> 5 obj.index[1] = 5
C:\Anaconda3\lib\site-packages\pandas\core\index.py in __setitem__(self, key, value)
1128
1129 def __setitem__(self, key, value):
-> 1130 raise TypeError("Index does not support mutable operations")
1131
1132 def __getitem__(self, key):
TypeError: Index does not support mutable operations
You can think of a Series, as a fixed length, ordered dict, and we can easily convert a dict to a Series:
In [111]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [259]:
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
In [118]:
35000
In [120]:
Ohio 35000
Texas 71000
dtype: int64
In [260]:
Ohio 35000
Texas 71000
Oregon 16000
Ontario NaN
dtype: float64
In [132]:
False
In [261]:
Ohio 70000
Ontario NaN
Oregon 32000
Texas 142000
Utah NaN
dtype: float64
Summary
That’s it for today.
Next lesson we will cover in more detail how to read in and clean data, as well as merging, and the split, apply, combine methods on DataFrames.
Exercises
1. Create an array of size 10, with all 0s
2. Reshape the above vector to have dims of (5,2)
3. Create a 4*4 identity matrix in NumPy (use google or the docs for a function)
4. Create a 10 by 10 matrix or random values (randn, mean = 3), and find the minimum and maximum values. Find the index of these values
5. Normalise the above matrix to have a mean of 0
6. Create a Series, which has an index of NY, SF, TO, and CH, and the values 0.2, 0.9, 3.5 and 2.4
7. Reindex this series to include VA with NA