Initiation

Research tries to explain the world based on observations. The challenge is to come up with an explanation before anyone else does. You as a researcher are confronted with a volume of data. You need to model it and demonstrate that your model is correct. How likely are you to hit upon a valid model on first try? Not very, unless the problem is trivial. So, the challenge is not just to prove that the correct model works. The even greater challenge is for you to discover that your model is not the solution as quickly as possible. This is precisely where Python is superb.

Though, Python does have some issues

Python works with dynamic types of data. A list can contain any mixture of data types. Typically, a research problem deals with data of one type. The mixture of data types, obviously, allows for the data to be of the same type as well. However, there is a price. The type of the data has to be checked at run time. As a consequence, for massive computation, Python can be virtually unusable.

So, Jim Hugunin, a graduate student at MIT created Numeric in 1995. (He went on to create Jython and IronPython.) Numeric continued to grow in the scientific and engineering community. Many students and scientists built their modules around the Numeric module. The Numeric project's name on sourceforge was numpy. This work has now become a part of the much broader SciPy project. You can find out more about it at http://www.scipy.org/History_of_SciPy/.

Python does not come with the numpy module included by default. It is, however, available in various distributions. You will need to make sure that it is installed. Familiarity with core Python is assumed.

Numpy One-dimensional Arrays or Vectors

The best way to learn is to try a little code. Suppose you need to multiply corresponding elements of two arrays. How would the code and performance be different? Try the following small benchmark.

N=1000000

a=[i for i in range(N)]

b=[0.5*i for i in range(N)]

for t in range(10):

ab = [a[i]*b[i] for i in range(N)]

You have defined 2 arrays of a million elements each. Your code creates a new array which is the product of the corresponding elements. You repeat it 10 times so that the execution times can be compared reasonably.

Now, you can write the same code using numpy arrays.

import numpy as np

N=1000000

a=np.array([i for i in range(N)])

b=np.array([0.5*i for i in range(N)])

for t in range(10):

ab = a*b

The method array in the numpy module converts a Python list into an array object of uniform type elements. There are a number of operations defined on these objects, e.g. multiplication results in a new array which is the product of the corresponding elements.

On my system, the core Python program took 8.9 sec while the numpy version took 1.7 sec.

Now, try the following code:

>>> a = [1,2,3]

>>> b = [3,2,1]

>>> a + b

Before you give your answer, try the following as well:

>>> s = ['a','b','c']

>>> a + s

The only reasonable, consistent answer is concatenation of the two lists. Now, try the same thing with numpy arrays.

>>> a=np.array([1,2,3])

>>> b=np.array([3,2,1])

>>> s=np.array(['a', 'b', 'c'])

>>> a + b

array([4, 4, 4])

>>> a + s

Since the elements in the array of the same type, summation of a and b is very reasonable. Now, what should happen in the case of adding s to a? Understandably, it should raise an exception as addition of a number to a string is not meaningful.

You need to compute the sine of a list of numbers. This can also be good test to benchmark. So, try the following code:

import math

N=1000000

a=[i*2*math.pi/N for i in range(N)]

for t in range(10):

sina = [math.sin(v) for v in a]

You have defined an array of one million equally spaced values between 0 and 2π. As before you create a new array containing the sine of the corresponding value in the first array. You repeat it 10 times so that the run time is large enough.

The corresponding program using numpy will be:

import math

import numpy as np

N=1000000

a=np.array([i*2*math.pi/N for i in range(N)])

for t in range(10):

sina = np.sin(a)

As before, you have converted a Python list into a numpy array. You can pass an array to a function and its meaning is very clear. It creates a new array whose values are the function of each of the corresponding elements of the array passed as a parameter.

In this case, on my system, the core Python version takes 7.3 sec while the numpy version takes 1.8 sec.

Numpy Multi-dimensional Arrays and Matrices

A list of lists would be a two dimensional array. Try another little test.

r = 1280*[1]

m = 1024*[r]

for t in range(10):

m2 = []

for r in m:

m2.append([.5*e for e in r])

You define a list of 1024 rows. Each row having 1280 elements, with value 1. The task is to scale all elements with a constant.

The corresponding program using numpy would be:

import numpy as np

r = 1280*[1]

m = np.array(1024*[r])

print m.shape

for t in range(10):

m2 = .5*m

As before, the array method converts a Python list into an array. Printing the shape of the array confirms that it is a two dimensional array of size 1024x1280. The core Python program takes 5.9 sec while the numpy application takes 0.7 sec.

But notice how easy it is to multiply all elements in an array with a constant. You can use numpy to define the matrix with all elements being 1, as follows

import numpy as np

m = np.ones((1024,1280))

print m.shape

for t in range(10):

mp = .5*m

This version runs in 0.4 sec.

It is reasonable to assume that the addition of 2 arrays of the same shape is possible and corresponding elements will be added. Explore the following:

>>> m = np.array([[1,2,3], [ 4,5,6]])

>>> n = np.array([[1,2,3], [ 4,5,6]])

>>> m.shape

(2, 3)

>>> n.shape

(2, 3)

>>> m + n

array([[ 2, 4, 6],

[ 8, 10, 12]])

>>> n.shape = (6,1)

>>> m + n

You are creating 2 arrays of size 2x3 and adding them. The result is as expected. Numpy allows an array to be reinterpreted by changing the shape of the array. In the example above, the array n is re-interpreted as a one dimensional array. Now, if you try to add m and n, it should fail and it does.

Multiplication of two dimensional arrays could be interpreted as a matrix multiplication. However, that would not be consistent across various dimensions. Hence, in numpy, multiplication of two arrays is allowed if their shapes are identical and corresponds to creation of a new array whose elements are a product of the corresponding elements of the two arrays. You can try the following:

>>> m = np.array([[1,2],[3,4]])

>>> n = np.array([[1,2],[3,4]])

>>> m*n

array([[ 1, 4],

[ 9, 16]])

If you want to work with matrices, you need to define a two dimensional array as a matrix. Try the following example, compare and explore

>>> m = np.matrix([[1,2],[3,4]])

>>> n = np.matrix([[1,2],[3,4]])

>>> m*n

matrix([[ 7, 10],

[15, 22]])

>>> m.shape=4,1

>>> n.shape = 4,1

>>> m*n

>>> n*m

Use the built in help to know more about the possibilities available for arrays, matrices and a lot more.

It should be clear that if an operation can be applied to an array, the performance gain over core Python can be substantial. Numpy provides a lot of functionality which can be very useful when analysing hordes of data. It can provide an edge for researchers to beat the competition!

Comments