Lesson 09 - Workflow and working on the Server

Welcome to lesson 10. Today we will cover the basics of working on the server - reading in data, managing packages in a virtual environment, connecting to and running queries with SQL, unit testing, and a brief introduction to version control.

Download todays notebook here - link

Firstly however, we will learn a little more about generators

Iterables and Generators

Let’s go back to when we introduced zip:

In [59]:

#taken from comments on the site
l=[1,2,3]
k=[4,5,6]
a=zip(l,k)
print(list(a))
print(list(a))
[(1, 4), (2, 5), (3, 6)]
[]

Huh, why did’t the second one work?

It turns out, many of the maps, zips etc in python 3 are implemented as iterators. These objects allow us to generate a single part at each step, without storing it all in memory (these are based on range, but work a little differently).

In [2]:

a = zip(l,k)
print(next(a))
print(next(a))
print(next(a))
print(list(a))
(1, 4)
(2, 5)
(3, 6)
[]

We can iterate over any iterable in a for loop:

In [4]:

for i in l:
    print(i)
1
2
3

But to explicitly make it an iterator, we use the iter() function:

In [7]:

j = iter(l)
print(next(j))
print(next(j))
print(next(j))
print(next(j))
1
2
3



---------------------------------------------------------------------------

StopIteration                             Traceback (most recent call last)

<ipython-input-7-31e9e4b4925c> in <module>()
      3 print(next(j))
      4 print(next(j))
----> 5 print(next(j))


StopIteration:

Turning a preexisitng object into an iterator is not very useful however, as we already have it in memory. If we want to create a function to make our output, we can use a generator function Generator functuions work very similar to standard functions, but use the yield keyword, rather than return:

In [9]:

def mygen(n):
    yield n
    yield n + 1

g = mygen(10)
print(g)
print(next(g))
print(next(g))
<generator object mygen at 0x00000167659A1FC0>
10
11

Or, a fibonacci implementation:

In [10]:

def fib(n):
    a, b = 1, 1
    for i in range(n):
        yield a
        a, b = b, a + b

for num in fib(10):
    print(num)
1
1
2
3
5
8
13
21
34
55

This is also why we can’t do tuple comprehensions - the syntax is reserved for making generator expressions:

In [11]:

l=[1,2,3]
g = (i for i in l)
print(next(g))
print(next(g))
print(next(g))
print(next(g))
1
2
3



---------------------------------------------------------------------------

StopIteration                             Traceback (most recent call last)

<ipython-input-11-0f1ad886ef79> in <module>()
      4 print(next(g))
      5 print(next(g))
----> 6 print(next(g))


StopIteration:

In general, we can think of generators as a ‘lazy list’ - a way of storing how to get the next object, without taking up all the memory.

Working with large files

In general Python holds the data we have in memory. We need to come up with ways to handle larger data out of memory in piecemeal (or buy more RAM). Most methods are specific to a certain type of data, but we will cover a general method for now. We can open a file on the disk in Python, as long as we use the correct permissions (read_csv from pandas took care of this for us). Let’s download the test example data - http://jeremy.kiwi.nz/pythoncourse/assets/tests/r&d/test1data.csv

In [13]:

g = open('c:/users/jeremy/downloads/test1data.csv', 'r')
print(g)
<_io.TextIOWrapper name='c:/users/jeremy/downloads/test1data.csv' mode='r' encoding='cp1252'>

We need to specify a ‘mode’ to open our file - I have chosen r for read, we can also use w for writing (this deletes the exsiting file), a for appending, and r+ for writing/andor reading. The file is not read in straight away - we merely have a pointer to the file. We can read the next line as though it was created using a generator:

In [14]:

#nextline
print(g.readline())
print(g.readline())
TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber

999,5,Friday,68113152929,-1,FINANCIAL SERVICES,1000

If you want to read all the lines of a file in a list you can also use list(f), f.readlines() or f.read(). Once we are done with a file, we need to close it:

In [15]:

g.close()

But, this doesn’t help use too much - we can imagine reading in enough files to fill our memory, and then carrying out some analysis, then reading in more. Luckily, we have the with statement and generators:

In [16]:

with open('c:/users/jeremy/downloads/test1data.csv', 'r') as file:
    head = [next(file).strip() for x in range(5)]

print(head)

def partread(file):
     with open(file) as myfile:
        for i in myfile:
            yield i

lines = 5
g = partread('c:/users/jeremy/downloads/test1data.csv')
[next(g).strip() for i in range(lines)]
[next(g).strip() for i in range(lines)]
['TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber', '999,5,Friday,68113152929,-1,FINANCIAL SERVICES,1000', '30,7,Friday,60538815980,1,SHOES,8931', '30,7,Friday,7410811099,1,PERSONAL CARE,4504', '26,8,Friday,2238403510,2,PAINT AND ACCESSORIES,3565']





['26,8,Friday,2006613744,2,PAINT AND ACCESSORIES,1017',
 '26,8,Friday,2006618783,2,PAINT AND ACCESSORIES,1017',
 '26,8,Friday,2006613743,1,PAINT AND ACCESSORIES,1017',
 '26,8,Friday,7004802737,1,PAINT AND ACCESSORIES,2802',
 '26,8,Friday,2238495318,1,PAINT AND ACCESSORIES,4501']

Pandas also has a built-in methods to generate an interator:

In [17]:

import pandas as pd
x = pd.read_csv('c:/users/jeremy/downloads/test1data.csv', iterator = True)
print(x)
<pandas.io.parsers.TextFileReader object at 0x00000167659E95F8>

In [18]:

x.get_chunk(5)
TripType VisitNumber Weekday Upc ScanCount DepartmentDescription FinelineNumber
0 999 5 Friday 68113152929 -1 FINANCIAL SERVICES 1000
1 30 7 Friday 60538815980 1 SHOES 8931
2 30 7 Friday 7410811099 1 PERSONAL CARE 4504
3 26 8 Friday 2238403510 2 PAINT AND ACCESSORIES 3565
4 26 8 Friday 2006613744 2 PAINT AND ACCESSORIES 1017

What about SQL? To run a SQL/netezza query from python, we have to install a couple of modules…

Virtual Environments

When working on a server, we need to be cognizant that others are also using the server. Even if we have permissions to install packages we probably shouldn’t - what if we want to upgrade pandas for ourselves, while our coworkers are using it?

Similarly, we may be working on projects where we need specifically to not have a certain module installed, or only install a certain version.

To fix these problems, we can install a ‘virtual environment’ and use this as the location to load our modules from.

Anaconda comes with a virtual env manager, the most commonly used non-conda version is virtualenv and virtualenvwrapper. They work similarly to conda.

Let’s check what environments we have available:

conda info --envs

Let’s make a new environment, called sqlproj which has installed pyodbc:

conda create -n sqlproj pyodbc

Now switch to this env:

activate sqlproj

Now that we are in our project, if we open python, we get our new version:

python

In [19]:

import pyodbc
import pandas
---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

<ipython-input-19-235fb7d14406> in <module>()
----> 1 import pyodbc
      2 import pandas


ImportError: No module named 'pyodbc'

Huh, we don’t have pandas? By default, none of our previous packages come with us. This is to stop us copying our environment to a new computer and have it no longer work.

Let’s install pandas:

conda install pandas

or, if we aren’t in the env:

conda install -n pyodbc pandas

If we want a specific version, we can specify it:

conda install -n pyodbc pandas=0.15.0

We can even specify a python version here….

Deactivate: deactivate

We can export our configuration:

conda env export > env.yml

and use that to share our setup with colleagues:

conda env create -f env.yml

Connecting to the databases

netezza is unfortunately a non-free piece of software, and so has a couple of integration issues with python.

Luckily, it is compliant with ODBC, so we can use the pyodbc module to connect to our database.

sql-alchemy is also a possibility, but is much more involved!

We have two choices - we can run python on the server, and connect to the database locally, or we can run python on our computer and connect to the database remotely.

The exact configuration will depend on your server! We currently can work with the R&D servers - individual servers may require some setup - Please email any errors you get to me and we will endeavour to make sure we can connect!

In [20]:

#import odbc
#in general:
#conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=<myserver>;PORT=<myport>;DATABASE=<mydbschema>;UID=<user>;PWD=<password>;")
#using DSN
#conn = pyodbc.connect(dsn = 'server', UID = 'uname', PWD = 'PWD') #server eg:prcmusf

I don’t have access to your servers, so here is a simple sqlite database:

In [48]:

import sqlite3
import pandas as pd

x = sqlite3.connect('c:/users/jeremy/downloads/survey.db')
#database from software carpentry, http://files.software-carpentry.org/survey.db
#same as the above connection!

Once we have a connection, we need to create a cursor. A cursor is a control structure which allows us to interact with the database

In [40]:

cxn = x.cursor()

cxn.execute('select * from person')
#can do much more complicated....
#declaritive
<sqlite3.Cursor at 0x16767ed1490>

Now we have our cursor ready to get data from:

In [33]:

print(cxn.fetchall())
print(cxn.fetchall())
[]
[]

It looks a lot like a generator! (NB it isn’t…).

We can go line by line, or use it like an iterable:

In [30]:

cxn.execute('select * from person')

print(cxn.fetchone())
('dyer', 'William', 'Dyer')

In [36]:

for row in cxn.execute('SELECT * FROM person'):
    print(row)
('dyer', 'William', 'Dyer')
('pb', 'Frank', 'Pabodie')
('lake', 'Anderson', 'Lake')
('roe', 'Valentina', 'Roerich')
('danforth', 'Frank', 'Danforth')

pandas has built in sql integration, through recently is has been reduced to use the sql-alchemy backend.

We use the database connection, rather than the cursor:

In [49]:

df = pd.read_sql('select * from Person', x)
df
ident personal family
0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth

In [46]:

df['newcol'] = [1,2,3,4,5]

In [52]:

#df.to_sql('data1', x)
pd.read_sql('select * from data1', x)
index ident personal family newcol
0 0 dyer William Dyer 1
1 1 pb Frank Pabodie 2
2 2 lake Anderson Lake 3
3 3 roe Valentina Roerich 4
4 4 danforth Frank Danforth 5

In general, we should keep the sql inside python to a minimum - netezza is a little finicky with it’s integration, so run a sql command, and then read in all the data.

Scripts vs Notebooks

The i in iPython stands for interactive. In general, we should use iPython to develop code, test and run it, and then run as scripts for production. The current move is in to using notebooks in production - we can run them on the server, and interact with them from our own computers - This however requires running a web facing server which is not super secure.

We will see if this is possible for our setup…

In the mean time, see how easy it is to convert a notebook to script.py

Testing

Test driven development is a development style where we write tests that our completed code should pass, then attempt to write code to pass them. In this manner, we can ensure that our code works as desired, and gives outputs that we desire.

To a lesser extent, all code should include tests - a lot of time spent debugging and writing code is simply manual testing - why didn’t my code work? Why did this particular data give me an error? What about special edge cases?

This informal testing is often all that code goes through. Code reviews, and Pair Programming have been shown to help reduce bugs, but a good start is unit testing.

Unit testing allows us to test each part (unit) of our code automatically, and can greatly help in refactoring large code bases, or prexisting code bases. We could theoretically completely rewrite entire scripts and keep the same tests, so that our inputs and outputs stay identical.

There are a wide range of testing suites available, here we will use the unittest module from the standard library.

unittest works best with scripts - Let’s make one with our fibonacci functions from the second lesson:

In [53]:

def fibo(x):
    if x < 3:
        return 1
    a,b,counter = 1,2,3
    while counter < x:
        a,b,counter = b,a+b,counter+1
    return(b)
#saved as fibo.py

Then we create a seperate script, which imports unittest, and define our tests as a class which inherits from unittest.TestCase. All of our tests are methods of this class, and must start with test.

We then use the range of assert* methods built in to the class to say what our functions should do:

In [54]:

import unittest
from fibo import fibo

class testfibo(unittest.TestCase):
    def test_one(self):
        self.assertEqual(fibo(1), 1)

    def test_zero(self):
        self.assertEqual(fibo(0), 0)

    def test_negative(self):
        self.assertRaises(ValueError, fibo, -1)

    def test_ten(self):
        self.assertEqual(fibo(10), 55)

if __name__ == '__main__':
    unittest.main()
#saved as tests.py
---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

<ipython-input-54-097a35bbb889> in <module>()
      1 import unittest
----> 2 from fibo import fibo
      3
      4 class testfibo(unittest.TestCase):
      5     def test_one(self):


ImportError: No module named 'fibo'

python tests.py

``` jeremy@thin:~$ python tests.py F..F ====================================================================== FAIL: test_negative (main.testfibo) ———————————————————————- Traceback (most recent call last): File “tests.py”, line 10, in test_negative self.assertRaises(ValueError, fibo, -1) AssertionError: ValueError not raised by fibo

====================================================================== FAIL: test_zero (main.testfibo) ———————————————————————- Traceback (most recent call last): File “tests.py”, line 8, in test_zero self.assertEqual(fibo(0), 0) AssertionError: 1 != 0


Ran 4 tests in 0.002s

FAILED (failures=2) ```

Then we can fix our function:

In [55]:

def fibo(x):
    if x < 1:
        if x < 0:
            raise ValueError()
        else:
            return(0)
    if x < 3:
        return 1
    a,b,counter = 1,2,3
    while counter < x:
        a,b,counter = b,a+b,counter+1
    return(b)
#saved as fibo.py

And rerun our tests.

We know that this is a slow function, so maybe we would like to refactor it. We can do this, leaving the tests as is:

In [56]:

def memoize(myfunction):
    cache = {}
    def function_to_cache(*args):
        if args in cache:
            return cache[args]
        else:
            cache[args] = myfunction(*args)
            return cache[args]
    return function_to_cache

def fiborecur(x):
    if x < 3:
        return 1
    return fiborecur(x - 1) + fiborecur(x - 2)

#saved as fibo.py

Whoops - our refactor didn’t define fibo - We could do this in our script, but maybe we don’t want to for now.

We have the setUp and tearDown methods - using these we can run code to set up our tests - eg connect to a database or download some data. In general, we should keep any set up inside our class - we don’t want to modify the global environment for any other tests.

In [None]:

import unittest
from fibo import fiborecur
from fibo import memoize

class testfibo(unittest.TestCase):
    def setUp(self):
        self.fibo = memoize(fiborecur)

    def test_one(self):
        self.assertEqual(self.fibo(1), 1)

    def test_zero(self):
        self.assertEqual(self.fibo(0), 0)

    def test_negative(self):
        self.assertRaises(ValueError, self.fibo, -1)

    def test_ten(self):
        self.assertEqual(self.fibo(10), 55)

if __name__ == '__main__':
    unittest.main()
#saved as tests.py

Version Control

Every piece of code should be under version control!!

It seems that there is no company wide version control system set up - so let’s use git. We could also use SVN or mecurial.

Version control allows us to track changes on our code. By making a series of commits, we can see the changes we made, who made them, and roll them back if necessary.

Install git from here.

Let’s play around a small amount as an introduction - you will want to learn a lot more before git is useful for you!

mkdir gittest cd gittest dir

Now we initialise a git repository:

git init dir

Turns out it’s a hidden folder:

dir /a

We can get status using git status: git status

Ok, so let’s get coding:

In [57]:

def fibo(x):
    if x < 3:
        return 1
    a,b,counter = 1,2,3
    while counter < x:
        a,b,counter = b,a+b,counter+1
    return(b)
#saved as fibo.py

Now run git status:

git status

we can see that we have an untracked file!

git add fibo.py #or git add --all git status

Now we commit the change!

git commit -m "initial commit of fibo.py" git status

We can see the history using git log:

git log

Now let’s fix our function!

In [58]:

def fibo(x):
    '''
    fixed version!
    '''
    if x < 1:
        if x < 0:
            raise ValueError()
        else:
            return(0)
    if x < 3:
        return 1
    a,b,counter = 1,2,3
    while counter < x:
        a,b,counter = b,a+b,counter+1
    return(b)
#saved as fibo.py

again git status allows us to see we have changes.

git diff allows us to see what they are: git diff

we can then add and commit the changes:

git add --all git commit -m "passes tests, docstring" git log

Now if we accidently delete our file, we can restore it:

git checkout head fibo.py

or if our changes are no good, we can revert:

git checkout head~1 fibo.py

git is not github! Github is a website built on remote hosting of git repositories. You probably can’t use it for work code, but it is useful to see how it works.

We can clone down from github:

cd .. git clone https://github.com/jeremycg/pythoncourse cd pythoncourse git log #you will probably want to look at the raw brach: git checkout raw

For more information on git, see the online course at Software Carpentry - http://swcarpentry.github.io/git-novice/ and at codeschool : https://try.github.io/