Lesson 08 - Optimization
Lesson 08 - Python Optimization
Welcome to lesson 9. This lesson we will deepen our understanding of Python, and learn how to optimise our code through profiling, benchmarking and timing.
We will also learn the general methods of error handling and raising, and how to automatically test code to make sure it is carrying out what we want it to do.
Firstly however, we will learn a little more about generators.
Download todays notebook here
Iterables and Generators
Let’s go back to when we introduced zip:
In [1]:
[(1, 4), (2, 5), (3, 6)]
[]
Huh, why did’t the second one work?
It turns out, many of the maps, zips etc in python 3 are implemented as iterators. These objects allow us to generate a single part at each step, without storing it all in memory (these are based on range, but work a little differently).
In [119]:
(1, 4)
(2, 5)
(3, 6)
[]
We can iterate over any iterable in a for loop:
In [120]:
1
2
3
But to explicitly make it an iterator, we use the iter() function:
In [121]:
1
2
3
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-121-31e9e4b4925c> in <module>()
3 print(next(j))
4 print(next(j))
----> 5 print(next(j))
StopIteration:
Turning a preexisitng object into an iterator is not very useful however, as we already have it in memory.
If we want to create a function to make our output, we can use a generator function
Generator functuions work very similar to standard functions, but use the yield keyword, rather than return:
In [129]:
<generator object mygen at 0x7f18a3ed0830>
10
11
Or, a fibonacci implementation:
In [2]:
1
1
2
3
5
8
13
21
34
55
This is also why we can’t do tuple comprehensions - the syntax is reserved for making generator expressions:
In [131]:
1
2
3
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-131-0f1ad886ef79> in <module>()
4 print(next(g))
5 print(next(g))
----> 6 print(next(g))
StopIteration:
In general, we can think of generators as a ‘lazy list’ - a way of storing how to get the next object, without taking up all the memory.
Working with large files
In general Python holds the data we have in memory. We need to come up with ways to handle larger data out of memory in piecemeal (or buy more RAM). Most methods are specific to a certain type of data, but we will cover a general method for now.
We can open a file on the disk in Python, as long as we use the correct permissions (read_csv from pandas took care of this for us). Let’s download the test example data - http://jeremy.kiwi.nz/pythoncourse/assets/tests/r&d/test1data.csv
In [132]:
<_io.TextIOWrapper name='/home/jeremy/Downloads/test1data.csv' mode='r' encoding='UTF-8'>
We need to specify a ‘mode’ to open our file - I have chosen r for read, we can also use w for writing (this deletes the existing file), a for appending, and r+ for writing/andor reading.
The file is not read in straight away - we merely have a pointer to the file. We can read the next line as though it was created using a generator:
In [133]:
TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
999,5,Friday,68113152929,-1,FINANCIAL SERVICES,1000
If you want to read all the lines of a file in a list you can also use list(f), f.readlines() or f.read().
Once we are done with a file, we need to close it:
In [134]:
But, this doesn’t help us too much - we can imagine reading in enough files to fill our memory, and then carrying out some analysis, then reading in more.
Luckily, we have the with statement and generators:
In [4]:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-4-f6f9a1ae0ace> in <module>()
----> 1 with open('/home/jeremy/Downloads/test1data.csv', 'r') as file:
2 head = [next(file).strip() for _ in range(5)]
3
4 print(head)
FileNotFoundError: [Errno 2] No such file or directory: '/home/jeremy/Downloads/test1data.csv'
In [136]:
['26,8,Friday,2006613744,2,PAINT AND ACCESSORIES,1017',
'26,8,Friday,2006618783,2,PAINT AND ACCESSORIES,1017',
'26,8,Friday,2006613743,1,PAINT AND ACCESSORIES,1017',
'26,8,Friday,7004802737,1,PAINT AND ACCESSORIES,2802',
'26,8,Friday,2238495318,1,PAINT AND ACCESSORIES,4501']
Pandas also has a built-in methods to generate an interator:
In [137]:
<pandas.io.parsers.TextFileReader object at 0x7f1883699ac8>
In [138]:
TripType | VisitNumber | Weekday | Upc | ScanCount | DepartmentDescription | FinelineNumber | |
---|---|---|---|---|---|---|---|
0 | 999 | 5 | Friday | 68113152929 | -1 | FINANCIAL SERVICES | 1000 |
1 | 30 | 7 | Friday | 60538815980 | 1 | SHOES | 8931 |
2 | 30 | 7 | Friday | 7410811099 | 1 | PERSONAL CARE | 4504 |
3 | 26 | 8 | Friday | 2238403510 | 2 | PAINT AND ACCESSORIES | 3565 |
4 | 26 | 8 | Friday | 2006613744 | 2 | PAINT AND ACCESSORIES | 1017 |
There are more sensible workflows using large data technologies - for now we will move on.
Error and Exception handling
It’s easier to ask forgiveness than it is to get permission. - Grace Hopper
We can often program more easily, if we simply try to do something, and then handle the failure. Errors will however break our code if we are not careful, so we can build in fail safe methods to handle errors:
In [139]:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-139-9b568190695a> in <module>()
----> 1 f = open('testfile','r')
FileNotFoundError: [Errno 2] No such file or directory: 'testfile'
We can try to do this, using the try statement, and an exception:
In [140]:
file not found
Now we have no longer raised an error serious enough to stop our script (whether this is bad or good is up to you). We can also specify the type of error we will catch (more specific is better):
In [141]:
file not found
In [142]:
type error!
We can add on a final else which is only completed if we did not raise an error:
In [143]:
operation sucessful
We can use finally to run a piece of code whether or not we were sucessful, which is useful for cleanup:
In [5]:
type error!
cleanedup
If we want to manually raise an exception, we can use the raise statement (or use an assertion):
In [9]:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-ed11983429c5> in <module>()
----> 1 raise TypeError("I'm sorry, Dave. I'm afraid I can't do that.")
TypeError: I'm sorry, Dave. I'm afraid I can't do that.
Debugging
We have an interactive debugger in iPython, called after an error using the %debug command. Using this we can trace back our errors, see current values, and step forward in code:
In [10]:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-091615ac5c44> in <module>()
----> 1 thisisnotadefinedvariable
NameError: name 'thisisnotadefinedvariable' is not defined
In [11]:
> [1;32m<ipython-input-10-091615ac5c44>[0m(1)[0;36m<module>[1;34m()[0m
[1;32m----> 1 [1;33m[0mthisisnotadefinedvariable[0m[1;33m[0m[0m
[0m
ipdb> x
*** NameError: name 'x' is not defined
ipdb> exit
In the debugger we have a lot of commands - use h or help to get them. Useful are c for continue, q for quit, n for next line, s for step into function, u/d for up/down in the call stack and l for listing the code.
The help page is available online.
The easiest way to debug problematic code is to manually enter a breakpoint, using pdb.set_trace(). From here you will enter the debugger, and have access to all the variables available in the cirrent environment.
In [13]:
In [15]:
1
> <ipython-input-15-113b0f967fab>(3)fib()
-> for i in range(num):
(Pdb) exit
---------------------------------------------------------------------------
BdbQuit Traceback (most recent call last)
<ipython-input-15-113b0f967fab> in <module>()
6 pdb.set_trace()
7
----> 8 for each in fib(10):
9 print(each)
<ipython-input-15-113b0f967fab> in fib(num)
1 def fib(num):
2 acounter, bcounter = 1, 1
----> 3 for i in range(num):
4 yield acounter
5 acounter, bcounter = bcounter, acounter + bcounter
<ipython-input-15-113b0f967fab> in fib(num)
1 def fib(num):
2 acounter, bcounter = 1, 1
----> 3 for i in range(num):
4 yield acounter
5 acounter, bcounter = bcounter, acounter + bcounter
C:\Anaconda3\lib\bdb.py in trace_dispatch(self, frame, event, arg)
46 return # None
47 if event == 'line':
---> 48 return self.dispatch_line(frame)
49 if event == 'call':
50 return self.dispatch_call(frame, arg)
C:\Anaconda3\lib\bdb.py in dispatch_line(self, frame)
65 if self.stop_here(frame) or self.break_here(frame):
66 self.user_line(frame)
---> 67 if self.quitting: raise BdbQuit
68 return self.trace_dispatch
69
BdbQuit:
Profiling and Timing
We have already briefly covered %timeit: This is magic function which runs code and gives us the execution time. We have the very similar magic function %time:
In [149]:
CPU times: user 510 µs, sys: 0 ns, total: 510 µs
Wall time: 581 µs
CPU times: user 27.6 ms, sys: 20.1 ms, total: 47.7 ms
Wall time: 51.2 ms
array([ 0, 1, 2, ..., 99997, 99998, 99999])
%time runs our command once, and reports the CPU time and wall time.
%timeit runs our command multiple times (It aims for five seconds), and reports the average times. One caveat is timeit turns off the garbage collector - so if we are deleting a lot of things we might be misled. We can use the timeit module for more fine grain control if needed
In [150]:
1000 loops, best of 3: 456 µs per loop
10 loops, best of 3: 44.6 ms per loop
We can also use profiling tools!
First, we have the %prun magic method. This allows us to profile multiple function calls:
In [151]:
Now this is not super useful - only if we have a large file with multiple functions. We could probably just use %time or %%timeit.
If we want to go line by line, we need the line profiler module (conda install line_profiler). We then need to load it as an iPython extension, rather than a module:
In [152]:
The line_profiler extension is already loaded. To reload it, use:
%reload_ext line_profiler
In [153]:
In the same manner, we can do memory profiling, using the memory_profiler module
In [2]:
In [13]:
ERROR: Could not find file <ipython-input-13-4a4111679972>
NOTE: %mprun can only be used on functions defined in physical files, and not in the IPython environment.
In [19]:
In [18]:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-d2a247059265> in <module>()
1 from mymem import testmem
----> 2 get_ipython().magic('mprun -f testmem testmem()')
/home/jeremy/anaconda3/lib/python3.5/site-packages/IPython/core/interactiveshell.py in magic(self, arg_s)
2334 magic_name, _, magic_arg_s = arg_s.partition(' ')
2335 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2336 return self.run_line_magic(magic_name, magic_arg_s)
2337
2338 #-------------------------------------------------------------------------
/home/jeremy/anaconda3/lib/python3.5/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line)
2255 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2256 with self.builtin_trap:
-> 2257 result = fn(*args,**kwargs)
2258 return result
2259
/home/jeremy/anaconda3/lib/python3.5/site-packages/memory_profiler.py in mprun(self, parameter_s, cell)
/home/jeremy/anaconda3/lib/python3.5/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
191 # but it's overkill for just that one bit of state.
192 def magic_deco(arg):
--> 193 call = lambda f, *a, **k: f(*a, **k)
194
195 if callable(arg):
/home/jeremy/anaconda3/lib/python3.5/site-packages/memory_profiler.py in mprun(self, parameter_s, cell)
724
725 try:
--> 726 profile.runctx(arg_str, global_ns, local_ns)
727 message = ''
728 except SystemExit:
/home/jeremy/anaconda3/lib/python3.5/site-packages/memory_profiler.py in runctx(self, cmd, globals, locals)
513 self.enable_by_count()
514 try:
--> 515 exec(cmd, globals, locals)
516 finally:
517 self.disable_by_count()
<string> in <module>()
/home/jeremy/Downloads/mymem.py in testmem()
3
4 def testmem():
----> 5 a = np.arange(1000000)
6 b = list(range(1000000))
7 del(a)
NameError: name 'numpy' is not defined
We can also use the %memit magic (Here I’m showing I was not lieing about the range function being efficient):
In [156]:
peak memory: 182.58 MiB, increment: 0.15 MiB
peak memory: 217.49 MiB, increment: 34.91 MiB
Magic Commands
Just as an aside, we can see we can import magic commands from multiple packages, and have a lot built in. Here are two of my favourites:
In [157]:
1
2
3
4
5
6
7
8
9
10
In [158]:
\begin{align}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}}
\end{align}
Testing
Test driven development is a development style where we write tests that out completed code should pass, then attempt to write code to pass them. In this manner, we can ensure that our code works as desired, and gives outputs that we desire.
To a lesser extent, all code should include tests - a lot of time spent debugging and writing code is simply manual testing - why didn’t my code work? Why did this particular data give me an error? What about special edge cases?
This informal testing is often all that code goes through. Code reviews, and Pair Programming have been shown to help reduce bugs, but a good start is unit testing.
Unit testing allows us to test each part (unit) of our code automatically, and can greatly help in refactoring large code bases, or prexisting code bases. We could theoretically completely rewrite entire scripts and keep the same tests, so that our inputs and outputs stay identical.
There are a wide range of testing suites available, here we will use the unittest module from the standard library.
In [159]:
unittest works best with scripts - Let’s make one with our fibonacci functions form the second lesson:
In [None]:
The we create a seperate script, which imports unittest, and define our tests as a class which inherits from unittest.TestCase. All of our tests are methods of this class, and must start with test.
We then use the range of assert* methods built in to the class to say what our functions should do:
In [None]:
python tests.py
``` jeremy@thin:~$ python tests.py F..F ====================================================================== FAIL: test_negative (main.testfibo) ———————————————————————- Traceback (most recent call last): File “tests.py”, line 10, in test_negative self.assertRaises(ValueError, fibo, -1) AssertionError: ValueError not raised by fibo
====================================================================== FAIL: test_zero (main.testfibo) ———————————————————————- Traceback (most recent call last): File “tests.py”, line 8, in test_zero self.assertEqual(fibo(0), 0) AssertionError: 1 != 0
Ran 4 tests in 0.002s
FAILED (failures=2) ```
Then we can fix our function:
In [None]:
And rerun our tests.
We know that this is a slow function, so maybe we would like to refactor it. We can do this, leaving the tests as is:
In [None]:
Whoops - our refactor didn’t define fibo - We could do this in our script, but maybe we don’t want to for now.
We have the setUp and tearDown methods - using these we can run code to set up our tests - eg connect to a database or download some data. In general, we should keep any set up inside our class - we don’t want to modify the global environment for any other tests.
In [None]:
We forgot our initial bug fixes - lucky we had tests!
Summary
Today we covered generators and iterators - ways of compactly handling data. We also covered reading in data in chunks for memory efficiency, error handling, debugging, profiling and testing.
In the last lesson we will cover parallel processing, connecting to your netezza databases, virtual environments and working on the server.
Exercises
1. Write a generator function which will produce factorials: mygen(10) will generate 1, 2, 6, 24, 120… up to 10!
2. Write a unit test for this generator using the unittest module
3. Write a small script that takes a user input (something like x = input () in the script) and returns that number squared. You should use error handling to handle cases where the input might not be a number.
4. Write unit tests for this script - use different types of input as test cases.
5. Use the pdb module to set_trace inside your factorial generator. Walk through the function and get a feel for the debugger.
6. Install both the memory_profiler and line_profiler plugins. Profile your factorial code for both performance and memory usage. Is there anything you can optimise?
7. Advanced, optional. Read in the test data set csv using pandas. Get any numeric column into numpy using .values. Profile and benchmark finding the sum, mean and cumulative sum of these numbers in numpy and pandas. Which one (if any) performs better? Can you guess why?