A Fresh Approach using Python: Introduction

Fri 01 November 2013

Welcome to the first in a series of tutorials on using Python for introductory statistical analysis. As I put more of these tutorials online, you should be able to access them easily by clicking or searching for the relevant category: “Statistical Modeling for Python”.

This series of tutorials is based on the ‘Computational Technique’ sections of each chapter from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’. The goal of this series of tutorials is to show how all of the R analysis and commands used in the book can be done just as easily using the Python programming language. This has the dual goal of introducing ‘Scientific Python’ to students learning statistics, as well as showcasing the recent advances in statistical computing that have been introduced to Python in recent years. Each tutorial in the series will cover the Computational Technique section of a different chapter from the book, starting with Section 1.4.3 from the introduction (which technically isn’t a Computational Technique section, but is a useful introduction none-the-less), which is available online here. Note that many of these tutorials will require you to have read the corresponding chapter(s) from the book in order to be useful.

All the tutorials assume that Python is installed and running on your computer. To use the notebooks associated with these tutorials, you’ll also need to have IPython (with notebook) installed. There are plenty of resources online with information on IPython, Notebooks, and installation instructions) for both. You’ll also need to have the Scientific Python libraries installed for additional statistical functionality (we’ll also introduce some other statistical libraries in later tutorials), and matplotlib for plotting and visualization.

The IPython Command Console

Once you have IPython installed, you are ready to perform all sorts of statistical and other mathematical and scientific operations. However, to use the powerful range of tools, functions, commands, and statistical methods available in Python, you first need to learn a little bit about the syntax and meaning of Python commands. Once you have learned this, operations become simple to perform. Before staring this tutorial, read section 1.4 from ‘Statistical Modeling: A Fresh Approach’, which outlines some general concepts surrounding computational statistics (in the context of R). In particular, it provides some explanation of a ‘language-based approach’ to statistical computing - which is an important concept throughout the book and this series of tutorials.

Invoking an Operation

People often think of computers as doing things: sending email, playing music, storing files. Your job in using a computer is to tell the computer what to do. There are many different words used to refer to the “what”: a procedure, a task, a function, a routine, and so on. Like in the book, I’ll use the word computation. Admittedly, this is a bit circular, but it is easy to remember: computers perform computations.

Complex computations are built up from simpler computations. This may seem obvious, but it is a powerful idea. An algorithm is just a description of a computation in terms of other computations that you already know how to perform. To help distinguish between the computation as a whole and the simpler parts, it is helpful to introduce a new word: an operator performs a computation.

It’s helpful to think of the computation carried out by an operator as involving four parts:

  1. The name of the operator
  2. The input arguments
  3. The output value
  4. Side effects

A typical operation takes one or more input arguments and uses the information in these to produce an output value. Along the way, the computer might take some action: display a graph, store a file, make a sound, etc. These actions are called side effects.

Because R is a programing language designed specifically for statistical analysis, many of the ‘base’ commands (such as sqrt()) are available ‘out-of-the-box’. However, since Python is a more general-purpose programming language, we usually need to import statistical commands (think of this as adding words to a language) before we can use them. For Scientific Python, the most important library that we need is numpy (Numerical Python), which can be loaded like this:

In [1]:
import numpy as np

It is best to ensure you are using the latest version of `numpy`, which is available [from here][numpy].

To tell the computer to perform a computation - call this invoking an operation or giving a command - you need to provide the name and the input arguments in a specific format. The computer then returns the output value. For example, the command np.sqrt(25) invokes the square root operator (named sqrt from the numpy library) on the argument 25. The output from the computation will, of course, we 5.

The syntax of invoking an operation consists of the operator’s name, followed by round parentheses. The input arguments go inside the parentheses.

The software program that you use to invoke operators is called an interpreter (the interpreter is the program you are running when you start Python). You enter your commands as a ‘dialog’ between you and the interpreter (just like when converting between any two languages!). Commands can be entered as part of a script (a text file with a list of commands to perform) or directly at a ‘command prompt’:

In [2]:
np.sqrt(25)
Out[2]:
5.0

In the above situation, the ‘prompt’ is In [ ]:, and the ‘command’ is np.sqrt(25). When you press ‘Enter’, the interpreter reads your command and performs the computation. For commands such as the one above, the interpreter will print the output value from the computation:

In [3]:
np.sqrt(25)
Out[3]:
5.0

In the above example, the ‘output marker’ is Out[ ]:, and the output value is 5.0. The dialog continues as the interpreter prints another prompt and waits for your further command.

Often, operations involve more than one argument. The various arguments are separated by commas. For example, here is an operation named arange from the numpy library that produces a range of numbers (increasing values between 3 and 10):

In [4]:
np.arange(3, 10)
Out[4]:
array([3, 4, 5, 6, 7, 8, 9])

The first argument tells where to start the range and the second tells where to end it. The order of the arguments is important. For instance, here is the range produced when 10 is the first argument, 3 is the second, and the third is -1 (decreasing values between 10 and 3):

In [5]:
np.arange(10, 3, -1)
Out[5]:
array([10,  9,  8,  7,  6,  5,  4])

For some operators, particularly those that have many input arguments, some of the arguments can be referred to by name rather than position. This is particularly useful when the named argument has a sensible default value. For example, the arange operator from the numpy library can be instructed what type of output values to produce (integers, floats, etc). This is accomplished using an argument named dtype:

In [6]:
np.arange(10, 3, -1, dtype='float')
Out[6]:
array([ 10.,   9.,   8.,   7.,   6.,   5.,   4.])

Note that all the values in the range now have decimal places. Depending on the circumstances, all four parts of an operation need not be present. For example, the ctime operation from the time library returns the current time and date; no input arguments are required:

In [7]:
import time
time.ctime()
Out[7]:
'Mon Sep 23 15:58:25 2013'

In the above example, we first imported the time library, which provides a series of commands that help us work with dates and times. Next, even though there are no arguments, the parentheses are still used when calling the ctime command. Think of the pair of parentheses as meaning, ‘do this‘.

Naming and Storing Values

Often the value returned by an operation will be used later on. Values can be stored for later use with the assignment operator. This has a different syntax that reminds the user that a value is being stored. Here’s an example of a simple assignment:

In [8]:
x = 16

The command has stored the value 16 under the name x. The syntax is always the same: an equal sign (=) with a name on the left side and a value on the right. Such stored values are called objects. Making an assignment to an object defines the object. Once an object has been defined, it can be referred to and used in later computations. Notice that an assignment operation does not return a value or display a value. Its sole purpose is to have the side effects of defining the object and thereby storing a value under the object’s name.

To refer to the value stored in the object, just use the object’s name itself. For instance:

In [9]:
x
Out[9]:
16

Doing a computation on the value store in an object is much the same (and provides and extremely rich syntax for performing complex calculations):

In [10]:
np.sqrt(x)
Out[10]:
4.0

You can create as many objects as you like and give them names that remind you of their purpose. Some examples: wilma, ages, temp, dog_houses, foo3. There are some general rules for object names:

  • Use only letters and numbers and ‘underscores’ (_)
  • Do NOT use spaces anywhere in the name (Python won’t let you)
  • A number cannot be the first character in the name
  • Capital letters are treated as distinct from lower-case letters (i.e., Python is case-sensitive)
    • the objects named wilma, Wilma, and WILMA are all different
  • If possible, use an ‘underscore’ between words (i.e., my_object)

For the sake of readability, keep object names short. But if you really must have an object named something like ages_of_children_from_the _clinical_trial, feel free (it’s just more typing for you later!).

Objects can store all sorts of things, for example a range of numbers:

In [11]:
x = np.arange(1, 7)

When you assign a new value to an existing object, as just done to x above, the former values of that object is erased from the computer memory. The former value of x was 16, but after the new assignment above, it is:

In [12]:
x
Out[12]:
array([1, 2, 3, 4, 5, 6])

The value of an object is changed only via the assignment operator. Using an object in a computation does not change the value. For example, suppose you invoke the square-root operator on x:

In [13]:
np.sqrt(x)
Out[13]:
array([ 1.        ,  1.41421356,  1.73205081,  2.        ,  2.23606798,
        2.44948974])

The square roots have been returned as a value, but this doesn’t change the value of x:

In [14]:
x
Out[14]:
array([1, 2, 3, 4, 5, 6])

An assignment command like x=np.sqrt(x) can be confusing to people who are used to algebraic notation. In algebra, the equal sign describes a relationship between the left and right sides. So, $x = \sqrt{x}$ tells us about how the quantity $x$ and the quantity $\sqrt{x}$ are related. Students are usually trained to ‘solve’ such a relationship, going through a series of algebraic steps to find values for $x$ that are consistent with the mathematical statement (for $x = \sqrt{x}$, the solutions are $x = 0$ and $x = 1$). In contrast, the assignment command x = np.sqrt(x) is a way of replacing the previous values stored in x with new values that are the square-root of the old ones.

If you want to change the value of x, you need to use the assignment operator:

In [15]:
x = np.sqrt(x)

Connecting Computations

The brilliant thing about organizing operators in terms of unput arguments and output values is that the output of one operator can be used as an input to another. This lets complicated computations be built out of simpler ones.

For example, suppose you have a list of 10000 voters in a precinct and you want to select a random sample of 20 of them for a survey. The np.arange operator can be used to generate a set of 10000 choices. The np.random.choice operator can then be used to select a subset of these values at random.

One way to connect the computations is by using objects to store the intermediate outputs:

In [16]:
choices = np.arange(1, 10000)
np.random.choice(choices, 20, replace=False) # sample _without_ replacement
Out[16]:
array([7863, 8378, 9128, 3340, 5674, 9055, 6374, 8668, 3768, 6798, 8066,
       6443, 5154, 5991, 1535, 3580, 8516, 4872, 8618, 7240])

You can also pass the output of an operator directly as an argument to another operator. Here’s another way to accomplish exactly the same thing as the above (note that the values will differ because we are performing a random sample):

In [17]:
np.random.choice(np.arange(1, 10000), 20, replace=False)
Out[17]:
array([5732, 6833, 7705, 4459, 3131, 3515, 4177, 6312, 2820, 2705, 4580,
       9125, 7395, 1927,  728, 4725, 1854, 6147, 4421, 2756])

Numbers and Arithmetic

The Python language has a concise notation for arithmetic that looks very much like the traditional one:

In [18]:
7. + 2.
Out[18]:
9.0
In [19]:
3. * 4.
Out[19]:
12.0
In [20]:
5. / 2.
Out[20]:
2.5
In [21]:
3. - 8.
Out[21]:
-5.0
In [22]:
-3.
Out[22]:
-3.0
In [23]:
5.**2. # same as 5^2 (or 5 to the power of 2)
Out[23]:
25.0

Arithmetic operators, like any other operators, can be connected to form more complicated computations. For instance:

In [24]:
8. + 4. / 2.
Out[24]:
10.0

The a human reader, the command 8+4/2 might seem ambiguous. Is it intended to be (8+4)/2 or 8+(4/2)? The computer uses unambiguous rules to interpret the expression, but it’s a good idea for you to use parentheses so that you can make sure that what you intend is what the computer carries out:

In [25]:
(8. + 4.) / 2.
Out[25]:
6.0

Traditional mathematical notations uses superscripts and radicals to indicate exponentials and roots, e.g. $3^2$ or $\sqrt{3}$ or $\sqrt[3]{8}$. This special typography doesn’t work well with an ordinary keyboard, so Python and most other computer languages uses a different notation:

In [26]:
3.**2.
Out[26]:
9.0
In [27]:
np.sqrt(3.) # or 3.**0.5
Out[27]:
1.7320508075688772
In [28]:
8.**(1./3.)
Out[28]:
2.0

There is a large set of mathematical functions: exponentials, logs, trigonometric and inverse trigonometric functions, etc. Some examples:

Traditional Python
$e^2$ np.exp(2)
$\log_{e}(100)$ np.log(100)
$\log_{10}(100)$ np.log10(100)
$\log_{2}(100)$ np.log2(100)
$\cos(\frac{\pi}{2})$ np.cos(np.pi/2)
$\sin(\frac{\pi}{2})$ np.sin(np.pi/2)
$\tan(\frac{\pi}{2})$ np.tan(np.pi/2)
$\cos^{-1}(-1)$ np.acos(-1)

Numbers can be written in scientific notation. For example, the ‘universal gravitational constant’ that describes the gravitational attraction between masses is $6.67428 \times 10^{11}$ (with units meters-cubed per kilogram per second squared). In the computer notation, this would be written as 6.67428e-11. The Avogadro constant, which gives the number of atoms in a mole, is $6.02214179 \times 10^{23}$ per mole, or 6.02214179e+23.

The computer language does not directly support the recording of units. This is unfortunate, since in the real world numbers often have units and the units matter. For example, in 1999 the Mars Climate Orbiter crashed into Mars because the design engineers specified the engine’s thrust in units of pounds, while the guidance engineers thought the units were newtons.

There are *some* Python packages for handling units, including [pint], [quantities], [units], [sympy.physics.units], [etc].

Computer arithmetic is accurate and reliable, but it often involves very slight rounding of numbers. Ordinarily, this is not noticeable. However, it can become apparent in some calculations that produce results that are (near) zero. For example, mathematically, $sin(\pi) = 0$, however, the computer does not duplicate the mathematical relationship exactly:

In [29]:
np.sin(np.pi)
Out[29]:
1.2246467991473532e-16

Whether a number like this is properly interpreted as ‘close to zero’ depends on the context and, for quantities that have units, on the units themselves. For instance, the unit ‘parsec’ is used in astronomy in reporting distances between stars. The closest start to the Sun is Proxima, at a distance of 1.3 parsecs. A distance of $1.22 \times 10^{-16}$ parsecs is tiny in astronomy but translates to about 2.5 meters - not so small on the human scale. In statistics, many calculations relate to probabilities which are always in the range 0 to 1. On this scale, 1.22e-16 is very close to zero.

There are several ‘special’ numbers in the Python world; two of which are inf, which stands for $\infty$ (infinity), and nan, which stands for ‘not a number’ (nan results when a numerical operation isn’t define), for instance:

Mathematically oriented readers will wonder why Python should have any trouble with a computation like $\sqrt{-9}$; the result is the imaginary number $3\jmath$ (imaginary numbers may be represented by a $\jmath$ or a $\imath$, depending on the field). Python works with complex numbers, but you have to explicitly tell the system that this is what you want to do. To calculate $\sqrt{-9}$ for example, simply use np.sqrt(-9+0j).

In [30]:
np.float64(1.) / 0.
-c:1: RuntimeWarning: divide by zero encountered in double_scalars
Out[30]:
inf
In [31]:
np.float64(0.) / 0.
-c:1: RuntimeWarning: invalid value encountered in double_scalars
Out[31]:
nan

Types of Objects

Most of the examples used so far have dealt with numbers. But computers work with other kinds of information as well: text, photographs, sounds, sets of data, and so on. The word type is used to refer to the kind of information. Modern computer languages support a great variety of types. It’s important to know about the types of data because operators expect their input arguments to be of specific types. When you use the wrong type of input, the computer might not be able to process your command.

In Python, data frames are not ‘built in’ as part of the basic language, but the excellent [‘pandas’][pandas] library provides data frames and a whole slew of other functionality to researchers doing data analysis with Python. We will be learning more about ‘pandas’ in future tutorials.

For the purposes of starting computational statistics with Python, it’s important to distinguish among three basic types:

  • numeric positive and negative numbers, decimal and fractional numbers (floats), and whole numbers (integers) - numbers of the sort encountered so far
  • data frames collections of data (more or less) in the form of a spreadsheet table - the ‘Computational Technique’ section from chapter 2 will introduce data frames and the operators and libraries for working with data frames
  • strings textual data - you indicate string data to the computer by enclosing the text in quotation marks (e.g., name = "python")

A Note on Strings

There is something a bit subtle going on in the previous command involving the string "python", so look at it carefully. The purpose of the command is to create a new object, called name, which stores a little bit of text data. Notice that the name of the object is not put in quotes, but the text characters are.

Whenever you refer to an object name, make sure that you don’t use quotes. For example, in the following, we are first assigning the string "python" to the name object, and then returning (and printing automatically) the name object.

In [32]:
name = "python"
name
Out[32]:
'python'

If you make a command with the object name in quotes, it won’t be treated as referring to an object. Instead, it will merely mean the text itself:

In [33]:
"name"
Out[33]:
'name'

Similarly, if you omit the quotation marks from around the text, the computer will treat it as if it were an object name and will look for the object of that name. For instance, the following command directs the computer to look up the value contained in an object named python and insert that value into the object name:

In [34]:
name = python
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-34-43a69bc65ba8> in <module>()
----> 1 name = python

NameError: name 'python' is not defined

As it happens, there was no object named python because it had not been defined by any previous assignment command. So, the computer generated an error. For the most part, you will not need to use vary many operators on text data when doing statistical analyses; you just need to remember to include text, such as file names, in quotation marks, "like this".

Reference

As with all ‘Statistical Modeling: A Fresh Approach for Python’ tutorials, this tutorial is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan. This tutorial is based on Chapter 1: Introduction.

Another useful source of information for statistics in Python is this introduction to statistics web-book by Thomas Haslwanter which is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.

Statistical Modeling for Python

Helpful Tip Python Statistical Modeling Teaching

twitter

recent visitors