Lesson 00 - R&D Stream

Welcome to the Python course @ Precima

Hopefully, by now, everyone has read the syllabus, cheat sheets and course outline on the main website.

Moreover, in this lesson, it’s assumed you either attended the Applied Stream intro to data types lecture, and/or have read through Lesson 00 - Applied Section.

Today we will briefly go over the Anaconda environment and then dive into Python statements and functions.

All lessons are made as iPython/Jupyter notebooks and will be available on this site to download. Grab todays lesson from this link!.

Python

Python is a programming language first released in 1991 by Guido van Rossum and one of the most popular languages used in computing today. Python currently ranks as the 5th most common language used on github.

Python is used as a scripting language, as well as in web development and to create applications - some of the more popular websites and applications running at least partially on Python include: Google, Youtube, Facebook, Instagram, Reddit, Dropbox, Civilization IV, EVE Online and BitTorrent.

Python as a language is based on readability, flexibility, simplicity and extensibility.

The extensibility of Python has caused it to be adopted, along with R, as one of the premier data science programming languages.

Python has had many added on modules (or libraries) added to it, to allow data science work which we will cover in this course.

In general compared to R, Python is faster, more programmer focussed and less restrictive in licensing. The downside is new statistical methods tend to appear in R before Python, although as the community grows, this has become less of a problem.

2.7 vs 3

In 2008, Python version 3.0 was released. Due to the number and nature of changes, Python 2.6 and Python 3.0 were not compatible. This has lead to a split in the Python community, as many users were unwilling to fix existing code to 3 compatibility, and have since continued to develop Python 2.6 into 2.7, while 3.0 has been developed in parallel and currently stands at version 3.5.

In this course, we will use Python 3. This is due to the majority of users having no legacy code to worry about, the better memory management, and the availability of the data science stack in Python 3. If you are a die hard Python 2.7 user, feel free to continue using it, although you will need to fix the code yourself.

In the level we will be coding at the changes are not too big, the largest differences we will see are print() vs print, xrange vs range and other generators. Python code found online will often be 2.7, but should be readable by a 3 trained user.

Anaconda, Spyder and iPython/Jupyter

We will use Anaconda, a distribution of Python by Continuum Analytics, put together for use in data science.

Anaconda comes with most of the modules we need for data analysis. The Anaconda launcher updates and launches these apps, and includes:

(i) Spyder, which is the IDE we will use for the first few lessons. It allows coding and running scripts in an integrated environment.

(ii) The iPython/Jupyter notebook, which is a platform for developing interative code-driven documents (like these lessons!) in a browser. (iPython notebooks are currently in the process of being rebranded into Jupyter notebooks. The launcher and documentation - and we - will refer to them interchangeably).

(ii) The iPython-QT console, which is an advanced console allowing inline graphing

These programs all run Python on the backend - the difference is in how you interact with the environment, the possibilities that the environment provides, and how the input and output of your code is displayed and documented.

Course Logistics

We will start the first part of the course by covering the data types and structures, before move on to loops and statements, then functions and classes. After that, we will introduce the standard data modules, Pandas and NumPy. Then, we will move onto graphing, optimization, machine learning and other fun topics.

Please note that the topics for the second half of the R&D class can be somewhat flexible - they can be adapted to suit the common interests of the class. So please let me know if you are working on something particular, or would like to look at a specific application or problem, and we can try and address these in the later classes.

Finally, note that we are tracking both group and individual progress, to make sure this course is giving value for your time. In this light, all attendees are required to complete all quizzes and assessments we will send out during the course. The time committment for this section is, at minimum, 5 hours a week (though a few more hours is strongly recommended).

One last thing: coding along with the lesson is encouraged!

Outside Resources

A number of people have asked about text books or online resources for the course. Most Python tutorials online will assume a knowledge at around the level we will reach after a few lessons. I’m using Python for Data Analysis for the Numpy and Pandas section - it won’t be necessary to purchase but is a good book (if a little outdated). For the other lessons, links will be provided to the relevant docs.

Basic Data Types Review

Type	Example	Mutable	Ordered	Duplicates	Subsetting
int	2	NA	NA	NA	NA
float	2.5	NA	NA	NA	NA
complex	2.5 + 0.6J	NA	NA	NA	NA
boolean	True	NA	NA	NA	NA
string	‘abcd’	No	Yes	Yes	x[1]
list	[1,2,3,’a’]	Yes	Yes	Yes	x[1]
tuple	(1,2,3,’a’)	No	Yes	Yes	x[1]
set	{1,2,3,4}	Yes	No	No	No
frozenset	{1,2,3,4}	No	No	No	No
dictionary	{1:1,2:2}	Yes	No	No	x[‘key’]

Numbers - Advanced

Internally, floats are stored as double in C - use sys.float_info for more information about tolerances etc.

In [1]:

import sys
sys.float_info

sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

Integers have unlimited precision

In [2]:

z = 10**1000000 +1
z % 10

Python 2.7 has another class, large for large numbers, Python 3 makes the conversion implicilty and keeps type int:

In [3]:

print(type(z))
del(z)

<class 'int'>

Booleans - Advanced

In addition to the standard boolean operators (<,>,==,!=,) mentioned in the intro class, there are five others: in,is,not, any and all. Any and all will be introduced in the functional programming section.

In [4]:

l = ['oranges','apples','bananas','bananas','kiwis']
'oranges' in l

True

In [5]:

l = 5
5 is l
##very similar to ==

True

In [6]:

#we are checking identity, not equality
l = ['oranges','apples','bananas']
k = ['oranges','apples','bananas']
print(l is k)
k = l
print(l is k)

False
True

In [7]:

not True

False

In Python (and most sensible languages), boolean comparisons are short circuit operators - once one conditon is impossible we stop evaluating. This can get a good speed up on code if you have one expensive comparison and one that is not - Therefore, put the cheap calculation first!

In [8]:

#del(l)
x = "bananas"
#l is not defined! This would give and error if it hits l
print(x == "bananas" or l == 4)
print(x == "bananas" and l == 4)

True
False

Data Structures Review

In [9]:

customer_1 = {'loyaltyids':(1234, 3456),
              'transactions':{1:['oranges','apples','bananas','pears','kiwis'],
                              2:['bread', 'milk','bananas','bananas']},
              'postalcodes':{'M5T1V1', 'M5S3B2'},
              'transactionspends':[100.46, 34.55],
              'usedcoupons':[True, False]}

In [10]:

#subsetting here

Python Statements

Python uses indentation, rather than braces, as its primary way of parsing loops and statements. This was originally to force some degree of readability on code.

If, Else, Elif

We can do if and else statements

In [11]:

x = 4.99

if not True:
    print(x)
elif x < 5:
    print('on sale')
else:
    print('regular price')


#if True:
#print(x)

on sale

and nest them as deep as we would like:

In [12]:

pricehistory = [11]

if pricehistory[-1] >= 10 and len(pricehistory) == 1:
    if pricehistory[-1] >= 12:
        pricehistory.append(pricehistory[0] - 1)
        print('new price is ' + str(pricehistory[-1]))
    else:
        pricehistory.append(pricehistory[0] - 0.10 * pricehistory[-1])
        print('new price is ' + str(pricehistory[-1]))
elif len(pricehistory) > 1:
    pricehistory.append(min(pricehistory))
    print('new price is ' + str(pricehistory[-1]))
else:
    pricehistory.append(pricehistory[0]*0.9)
    print('new price is ' + str(pricehistory[-1]))

pricehistory

new price is 9.9





[11, 9.9]

For loops

For loops work similarly in syntax to if and else. We can use them on any of the major data structures we just introduced

In [13]:

l = ['oranges','bread','bananas','milk']
fruits = ['oranges','apples','bananas','kiwis','strawberries', 'pears']

for item in l:
    if item in fruits:
        print("fruit")
    else:
        print("other")

print("\n")
s = ['oranges','bread','bananas','milk']

for item in s:
    if item in fruits:
        print("fruit")
    else:
        print("other")

fruit
other
fruit
other


fruit
other
fruit
other

On dictionaries, we have to do it a little differently

In [14]:

d = {'produce':'banana', 'dairy':'milk', 'canned':'chickpeas'}

for k, v in d.items():
    print(k + ": " + v)
#unordered!

#similar for nested lists:

l = [[1,2],[3,4],[5,6],[7,8]]

for one, two in l:
    print(one * two)
    #no modifying!
    one = one * 2
l

canned: chickpeas
dairy: milk
produce: banana
2
12
30
56





[[1, 2], [3, 4], [5, 6], [7, 8]]

While loops

Similarly, we can use while loops, and run a loop until a condition is met

In [15]:

x = 5

while x > 1:
    print(x)
    x -= 1
#also +=, *=, /=

Break, Pass, Continue

We can skip, break or pass elements in our loops

In [16]:

l = [1,2,3,4,"oranges",6]

for num in l:
    if type(num) == int:
        print(num)
    else:
        print(num + " is not a number")
        break

for num in l:
    if type(num) == int:
        print(num)
    else:
        print(num + " is not a number")
        continue
        print("how did you get here?")

for num in l:
    if type(num) == int:
        print(num)
    else:
        print(num + " is not a number")
        pass
        print("how did you get here?")

1
2
3
4
oranges is not a number
1
2
3
4
oranges is not a number
6
1
2
3
4
oranges is not a number
how did you get here?
6

In [17]:

#example
basket = [['apples', 2.50], ["oranges", 3.00], ['razors', 10.99], ['potatoes', 4.99], ['toothbrush', 2.99], ['pumpkin', 2.98],
          [9.99, "strawberries"]]
fruits = ['oranges','apples','bananas','kiwis','strawberries', 'pears']
veges = ['potatoes', 'parsnips', 'cabbage', 'pumpkins']

fruit = 0
veg = 0
other = []

while len(basket) > 0:
    x = basket.pop()
    if x[0] in fruits:
        fruit += x[1]
    elif x[0] in veges:
        veg += x[1]
    else:
        if isinstance(x[0], str) and isinstance(x[1], float):
            other.append(x)
        else:
            print(x, "not proper item")
            pass

#print(fruit, veg, other)

[9.99, 'strawberries'] not proper item

Range

We saw earlier, we cannot modify in place using a loop. We can instead use range to generate indices and use these (NB range in Python 3 is similar to xrange in Python 2.7, not range). Range objects are not instantly enumerated - we dont have to hold them in memory.

In [18]:

#similar to the subset from:to:by
range(0,10,2)
type(range(0,10))

range

In [19]:

#not instantly enumerated! this would fill the memory of nearly all computers
range(100000000000000000)

range(0, 100000000000000000)

In [20]:

l = [1,2,3,4,5,6]
range(len(l))

range(0, 6)

In [21]:

for i in range(len(l)):
    l[i] = l[i]*2
l

[2, 4, 6, 8, 10, 12]

In [22]:

#also useful for generating simple lists:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

We also have the built in function, enumerate which can give us a tuple of the index and value at each position in a sequence:

In [23]:

cheese = ['gouda','cheddar','edam','brie']
print(list(enumerate(cheese)))
for i,j in enumerate(cheese):
    print('item {i} is {j}'.format(i = i, j = j))

[(0, 'gouda'), (1, 'cheddar'), (2, 'edam'), (3, 'brie')]
item 0 is gouda
item 1 is cheddar
item 2 is edam
item 3 is brie

Zip

zip is a built in python function, that ‘zips’ sequences into tuples:

In [24]:

item = ['oranges','apples','bananas','kiwis','strawberries']
cost = [7,8,9,10,11]
units = [12,13,14,15,16]
list(zip(item,cost,units))

[('oranges', 7, 12),
 ('apples', 8, 13),
 ('bananas', 9, 14),
 ('kiwis', 10, 15),
 ('strawberries', 11, 16)]

Comprehensions

Comprehensions are are syntactic sugar for for loops. We can make lists (or set or dicts) using this method. Tuples might seem like they should also have a method, but this is reserved for generators, which will discuss later on in the course.

In [25]:

#for loop
l = []
for i in range(10):
    l.append(i * 5)
l

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45]

In [26]:

#comprehension
[i * 5 for i in range(10)]

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45]

In [27]:

#dict comprehension
{i * 10 : j * 2  for i, j in {1: 'a', 2: 'b'}.items()}

{10: 'aa', 20: 'bb'}

In [28]:

celsius = [0,100,25,37]
[((9/5)*temp + 32) for temp in celsius]

[32.0, 212.0, 77.0, 98.60000000000001]

In [29]:

#nested
matrix = [[1,2,3],[4,5,6],[7,8,9]]
[[el * 2 for el in row] for row in matrix]

[[2, 4, 6], [8, 10, 12], [14, 16, 18]]

Ternary expressions

Ternary expressions are a fancy way of doing if else and are very pythonic:

In [30]:

a = 12
a if a > 11 else b

In [31]:

a = 10
b = "no"
a if a > 11 else b

'no'

we can chain this with our comprehensions

In [32]:

[a if a else 2 for a in [0,1,0,3]]

[2, 1, 2, 3]

or our nested comprehensions

In [33]:

matrix = [[1,2,3],[4,5,6],[7,8,9]]
[[el * 2 if el % 2 == 0 else el for el in row] for row in matrix]

[[1, 4, 3], [8, 5, 12], [7, 16, 9]]

As these comprehensions are not faster than a for loop, it is worth using for loops once you can’t follow clearly what the comprehension is doing (I would almost never use a nested comprehension)!

In [34]:

cheese = {'gouda': 9.99, 'cheddar': 8.99, 'edam': 6.99, 'brie': 10.99}
cheese.update({i + ' sale' : round(j * 0.9, 2) for i, j in cheese.items()})

Motivation

How much can we actually learn and predict from Data Science using Python?

Recently, a match fixing ring in professional tennis was alleged by a joint investigation between BBC news and Buzzfeed, resulting in a large amount of news coverage.

The data analysis carried out was done in Python, and released online as an iPython Notebook. This story, and its continued fall out, was front page news on the Guardian last week (9-Feb-2016).

Have a read through the notebook, take note of the functions and methods used on the data, and see if you believe the analysis.

Exercises

First, run through the exercises from the applied streams lesson 1 if you have not already done this.

in can generalise to strings - 'im' in 'Precima' will give True. Can you think of a way to test if letters are in a string in any order? For now, ignore any duplicates. ‘imP’ should resolve to True if compared to ‘Precima’
Rewrite your answer above using all, a for loop, and a comprehension (if you didn’t use these). Which one feels the most natural to you?
We have a nested dictionary of transactions - dict = {‘user1’:{1:345,2:546.5,3:987}, ‘user2’:{1:546, 2:744.4, 3:454.4}}. Use a nested list comprehension to get out a list of the values:[[345, 546.5, 987], [546, 744.4, 454.4]]. Can you think of a better way? Can you zip this list to [(345, 546), (546.5, 744.4), (987, 454.4)]?
Why does range not enumerate to a list all at once? How can you force it to?
How does zip behave on lists of different sizes? (l = [1,2,3], k = [1,2,3,4,5]). Make a loop that works until the longest list ends. What should the output look like and why?
We can index previous members of a sequence in a for loop, using a number of methods. Use enumerate and a for loop to test whether each member of list l = [1,1,2,3,5,8,13,21,34,55,88,143] is the sum of the previous two objects of the list. Think about how to treat the first two values. The output should look be a list of booleans
Rewrite the previous loop using range. How will you treat the first two values now? Can you think of a way to do this using only for i in l (nb off the top of my head, I can’t)?
Write a list comprehension giving True for only those values which are perfect cubes - ie x**1/3 gives an integer. Think about how to avoid floating point errors: use google, or use a function I briefly introduced today. l = [1,8,16,27,64] will give [True, True, False, True, True]
Write the above as a for loop.
Can your solution handle negative values? If not, modify it so that it can.