Lesson 09 - Workflow and working on the Server

Welcome to lesson 10.

Today we will cover the basics of working on the server - managing packages in a virtual environment, connecting to and running queries with SQL, parallel processing and a brief introduction to version control.

Download the notebook here

Virtual Environments

When working on a server, we need to be cognizant that others are also using the server. Even if we have permissions to install packages we probably shouldn’t - what if we want to upgrade pandas for ourselves, while our coworkers are using it?

Similarly, we may be working on projects where we need specifically to not have a certain module installed, or only install a certain version.

To fix these problems, we can install a ‘virtual environment’ and use this as the location to load our modules from.

Anaconda comes with a virtual env manager, the most commonly used non-conda version is virtualenv and virtualenvwrapper. They work similarly to conda.

Let’s check what environments we have available:

conda info --envs

Let’s make a new environment, called sqlproj which has installed pyodbc:

conda create -n sqlproj pyodbc

Now switch to this env:

activate sqlproj

Now that we are in our project, if we open python, we get our new version:

python

In [1]:

import pyodbc
import pandas
---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

<ipython-input-1-235fb7d14406> in <module>()
----> 1 import pyodbc
      2 import pandas


ImportError: No module named 'pyodbc'

Huh, we don’t have pandas? By default, none of our previous packages come with us. This is to stop us copying our environment to a new computer and have it no longer work.

Let’s install pandas:

conda install pandas

or, if we aren’t in the env:

conda install -n pyodbc pandas

If we want a specific version, we can specify it:

conda install -n pyodbc pandas=0.15.0

We can even specify a python version here….

Deactivate: deactivate

We can export our configuration:

conda env export > env.yml

and use that to share our setup with colleagues:

conda env create -f env.yml

Connecting to the databases

netezza is unfortunately a non-free piece of software, and so has a couple of integration issues with python.

Luckily, it is compliant with ODBC, so we can use the pyodbc module to connect to our database.

sql-alchemy is also a possibility, but is much more involved!

We have two choices - we can run python on the server, and connect to the database locally, or we can run python on our computer and connect to the database remotely.

The exact configuration will depend on your server! We currently can work with the R&D servers - individual servers may require some setup - Please email any errors you get to me and we will endeavour to make sure we can connect!

In [2]:

#import odbc
#in general:
#conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=<myserver>;PORT=<myport>;DATABASE=<mydbschema>;UID=<user>;PWD=<password>;")
#using DSN
#conn = pyodbc.connect(dsn = 'server', UID = 'uname', PWD = 'PWD') #server eg:prcmusf

I don’t have access to your servers, so here is a simple sqlite database:

In [3]:

import sqlite3
import pandas as pd

x = sqlite3.connect('c:/users/jeremy/downloads/survey.db')
#database from software carpentry, http://files.software-carpentry.org/survey.db
#same as the above connection!

Once we have a connection, we need to create a cursor. A cursor is a control structure which allows us to interact with the database

In [4]:

cxn = x.cursor()

cxn.execute('select * from person')
#can do much more complicated....
#declaritive
<sqlite3.Cursor at 0x1e3fe568e30>

In [None]:

print(cxn.fetchall())
print(cxn.fetchall())

It looks a lot like a generator! (NB it isn’t…).

We can go line by line, or use it like an iterable:

In [5]:

cxn.execute('select * from person')

print(cxn.fetchone())
('dyer', 'William', 'Dyer')

In [6]:

for row in cxn.execute('SELECT * FROM person'):
    print(row)
('dyer', 'William', 'Dyer')
('pb', 'Frank', 'Pabodie')
('lake', 'Anderson', 'Lake')
('roe', 'Valentina', 'Roerich')
('danforth', 'Frank', 'Danforth')

pandas has built in sql integration, through recently is has been reduced to use the sql-alchemy backend.

We use the database connection, rather than the cursor:

In [8]:

df = pd.read_sql('select * from Person', x)
df
ident personal family
0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth

In [9]:

df['newcol'] = [1,2,3,4,5]

In [10]:

#df.to_sql('data1', x)
pd.read_sql('select * from data1', x)
index ident personal family newcol
0 0 dyer William Dyer 1
1 1 pb Frank Pabodie 2
2 2 lake Anderson Lake 3
3 3 roe Valentina Roerich 4
4 4 danforth Frank Danforth 5

In general, we should keep the sql inside python to a minimum - netezza is a little finicky with it’s integration, so run a sql command, and then read in all the data.

Scripts vs Notebooks

The i in iPython stands for interactive. In general, we should use iPython to develop code, test and run it, and then run as scripts for production. The current move is in to using notebooks in production - we can run them on the server, and interact with them from our own computers - This however requires running a web facing server which is not super secure.

We will see if this is possible for our setup…

In the mean time, see how easy it is to convert a notebook to script.py

Parallel Processing

Without getting into the exact details of the Python source code, Python (or more explicitly, Cython) uses a Global interpreter lock, or GIL to prevent multithreading. This is so that we don’t accidentally modify objects in one thread, which we migth access in another.

This means that parallel procesisng in Python is a little complicated - It is hard to work in a Python session with multiple threaded Python code. What normally happens is a couple of tricks which either effectively open a new Python process, or we can use compiled code called by python to do the parallel work.

Because we need multiple sessions, we are not sharing memory. Each object has to be reloaded into memory of each process, leading to troubles with memory, and in pandas, a significant slowdown. At the moment, there is a concerted effort (dask) to bring parallel computing to pandas, as well as some experimental support for bypassing the GIL.

For now, let’s stick with numpy, where parallel processing is well implemented. We will then cover how we can use the iPython parallel procesisng implementation to generalise our parallel work.

We can define parallel tasks into two categories - Embarrassingly parallel and inherently serial problems. Very briefly, a problem is embarrassingly parallel if it does not depend on previous steps. This way we can split it into multiple small chunks and pass it to multiple processes and not worry about cross talk.

Algorithm design, again, is a little beyond this class. So if we have serial problems, think very carefully if they can be converted to non-serial problems, otherwise the programming will get very complicated.

Depending on your implementation of BLAS, numpy will be natively parallel:

In [17]:

import numpy as np

a = np.arange(1000000).reshape(1000,1000)
b = np.arange(1000000).reshape(1000,1000)
c = np.dot(a,b)

How do we change our BLAS? Well, you will need to build Python from source on your current machine/server, or download a version optimised for your current computer (my linux machine works with the default Anaconda, my windows one does not).

Numpy also releases the GIL when running matrix operations. we can take advantage of this using threading….

More advantagous however, is using the iPython parallel module. First we need to install it:

conda install ipyparallel

and start an instance:

ipcluster start -n 4

Now we can connect through our notebook (if we are running remotely, we need to play with the Client call):

In [3]:

import ipyparallel as ipp

c = ipp.Client()

In [4]:

c.ids
[0, 1, 2, 3]

In [6]:

ourdview = c[:]

Now we can call methods on our dview object to carry out parallel processing:

In [7]:

x = map(lambda x: x*x, range(100))

y = ourdview.map(lambda x: x*x, range(100))

y
<AsyncMapResult: <lambda>>

In [13]:

%time x = list(ourdview.map(lambda x: x*x, range(1000000)));
Wall time: 1.14 s

In [16]:

%time y = list(map(lambda x: x*x, range(1000000)));
Wall time: 326 ms

due to overhead, it is slower than the non-parallel version! As we increase complexity however, we increase our savings:

In [18]:

import numpy as np
A = np.random.random((64,48))
C_local = A*A

We can use a decorator to automate (or at least try to) our functions:

In [19]:

@dview.parallel(block=True)
def pmul(A,B):
    return A*B

C_remote = pmul(A,A)

In [21]:

(C_remote == C_local).all()
True

The main method for carrying out functions, is the .apply method:

viewobj.apply(function, args, kwargs)

In [29]:

def myfunc(a,b):
    return a + b

y = ourdview.apply(myfunc, a = [1,2,3,4], b = [1,2,3,4])
y
<AsyncResult: myfunc>

In [30]:

list(y)
[[1, 2, 3, 4, 1, 2, 3, 4],
 [1, 2, 3, 4, 1, 2, 3, 4],
 [1, 2, 3, 4, 1, 2, 3, 4],
 [1, 2, 3, 4, 1, 2, 3, 4]]

We can push our variables to the remote sessions:

In [41]:

ourdview.push(dict( a = [1,2,3,4], b = [1,2,3,4]))
<AsyncResult: _push>

In [35]:

list(ourdview.pull('a'))
[[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]

In [42]:

list(ourdview.apply(lambda x: a + b + x, x = [1]))
[[1, 2, 3, 4, 1, 2, 3, 4, 1],
 [1, 2, 3, 4, 1, 2, 3, 4, 1],
 [1, 2, 3, 4, 1, 2, 3, 4, 1],
 [1, 2, 3, 4, 1, 2, 3, 4, 1]]

That’s the very basic introduction to parallel processing in python - read the documentation and use google to figure out how to speed up your own code.

If you are using scikit learn, to use parallel computation, we can use the joblib library, or use ipython as noted abo ve.

Version Control

Every piece of code should be under version control!!

It seems that there is no company wide version control system set up - so let’s use git. We could also use SVN or mecurial.

Version control allows us to track changes on our code. By making a series of commits, we can see the changes we made, who made them, and roll them back if necessary.

Install git from here.

Let’s play around a small amount as an introduction - you will want to learn a lot more before git is useful for you!

mkdir gittest cd gittest dir

Now we initialise a git repository:

git init dir

Turns out it’s a hidden folder:

dir /a

We can get status using git status: git status

Ok, so let’s get coding:

In [11]:

def fibo(x):
    if x < 3:
        return 1
    a,b,counter = 1,2,3
    while counter < x:
        a,b,counter = b,a+b,counter+1
    return(b)
#saved as fibo.py

Now run git status:

git status

we can see that we have an untracked file!

git add fibo.py #or git add --all git status

Now we commit the change!

git commit -m "initial commit of fibo.py" git status

We can see the history using git log:

git log

Now let’s fix our function!

In [13]:

def fibo(x):
    '''
    fixed version!
    '''
    if x < 1:
        if x < 0:
            raise ValueError()
        else:
            return(0)
    if x < 3:
        return 1
    a,b,counter = 1,2,3
    while counter < x:
        a,b,counter = b,a+b,counter+1
    return(b)
#saved as fibo.py

again git status allows us to see we have changes.

git diff allows us to see what they are: git diff

we can then add and commit the changes:

git add --all git commit -m "passes tests, docstring" git log

Now if we accidently delete our file, we can restore it:

git checkout head fibo.py

or if our changes are no good, we can revert:

git checkout head~1 fibo.py

git is not github! Github is a website built on remote hosting of git repositories. You probably can’t use it for work code, but it is useful to see how it works.

We can clone down from github:

cd .. git clone https://github.com/jeremycg/pythoncourse cd pythoncourse git log #you probably want the raw branch: git checkout raw

For more information on git, see the online course at Software Carpentry - http://swcarpentry.github.io/git-novice/ and at codeschool : https://try.github.io/