# Data: Cases, Variables, Samples

Sat 09 November 2013

The second in a series of tutorials on using Python for introductory statistical analysis, this tutorial covers data, including cases, variables, samples, and a whole lot more. As always, the `iPython Notebook` associated with this tutorial is available here on github.

Data used in statistical modeling are usually organized into tables, often created using spreadsheet software. Most people presume that the same software used to create a table of data should be used to display and analyze it. This is part of the reason for the popularity of spreadsheet programs such as ‘Excel’ and ‘Google Spreadsheets’.

For serious statistical work, it’s helpful to take another approach that strictly separates the processes of data collection and of data analysis: use one program to create data ﬁles and another program to analyze the data stored in those ﬁles. By doing this, one guarantees that the original data are not modiﬁed accidentally in the process of analyzing them. This also makes it possible to perform many diﬀerent analyses of the data; modelers often create and compare many diﬀerent models of the same data.

## Reading Tabular Data into Python¶

Data is central to statistics, and the tabular arrangement of data is very common. Accordingly, Python provides a large number of ways to read in tabular data. These vary depending on how the data are stored, where they are located, etc. To help keep things as simple as possible, the ‘pandas’ Python library iprovides an operator, `read_csv()` that allows you to access data ﬁles in tabular format on your computer as well as data stored in repositories such as the one associated with the ‘Statistical Modeling: A Fresh Approach’ book, or one that a course instructor might set up for his or her students.

The ‘pandas’ library is available here, and you can follow these installation instructions to get it working on your computer (installation via `pip` is the easiest method). Once you have ‘pandas’ installed, you need to `import pandas` in order to to use `read_csv()`, as well as a variety of other ‘pandas’ operators that you will encounter later (it is also usually a good idea to `import numpy as np` at the same time that we `import pandas as pd`).

An alternative to writing `pds.xxx` when calling each ‘pandas’ operator is to import all available operators from ‘pandas’ at once: `from pandas import *`. This makes things a bit easier in terms of typing, but can sometimes lead to confusion when operators from different libraries have the same name.

In [1]:
```import pandas as pd
import numpy as np
```

You need do this only once in each session of Python, and on systems such as IPython, the library will sometimes be reloaded automatically (if you get an error message, it’s likely that the ‘pandas’ library has not been installed on your system. Follow the installation instructions provided at the link above.)

Reading in a data table that’s been connected with `read_csv()` is simply a matter of knowing the name (and location) of the data set. For instance, one data table used in examples in the ‘Statistical Modeling: A Fresh Approach’ book is `"swim100m.csv"`. To read in this data table and create an object in Python that contains the data, use a command like this:

In [2]:
```swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
```

The csv part of the name in `"swim100m.csv"` indicates that the ﬁle has been stored in a particular data format, comma-separated values that is handled by spreadsheet software as well as many other kinds of software. The part of this command that requires creativity is choosing a name for the Python object that will hold the data. In the above command it is called `swim`, but you might prefer another name (e.g., `s` or `sdata` or even `ralph`). Of course, it’s sensible to choose names that are short, easy to type and remember, and remind you what the contents of the object are about.

To help you identify data tables that can be accessed through `read_csv()`, examples from these tutorials will be marked with a flag containing the name of the data file. The files themselves are mostly available automatically through the web site for the ‘Statistical Modeling: A Fresh Approach’ book.

### Data Frames¶

The type of Python object created by `read_csv()` is called a data frame and is essentially a tabular layout. To illustrate, here are the ﬁrst several cases of the `swim` data frame created by the previous use of `read_csv()`:

In [3]:
```swim.head()
```
Out[3]:
year time sex
0 1905 65.8 M
1 1908 65.6 M
2 1910 62.8 M
3 1912 61.6 M
4 1918 61.4 M

Note that the `head()` function, one of several functions built-into ‘pandas’ data frames, is a function of the Python object (data frame) itself; not from the main ‘pandas’ library.

Data frames, like tabular data generally, involve variables and cases. In ‘pandas’ data frames, each of the variables is given a name. You can refer to the variable by name in a couple of diﬀerent ways. To see the variable names in a data frame, something you might want to do to remind yourself of how names a spelled and capitalized, use the `columns` attribute of the data frame object:

In [4]:
```swim.columns
```
Out[4]:
`Index([u'year', u'time', u'sex'], dtype=object)`

Note that we have not used brackets `()` in the above command. This is because `columns` is not a function; it is an attribute of the data frame. Attributes add ‘extra’ information (or metadata) to objects in the form of additional Python objects. In this case, the attributes describe the names (and data types) of the columns. Another way to get quick information about the variables in a data frame is with `describe()`:

In [5]:
```swim.describe()
```
Out[5]:
year time
count 62.000000 62.000000
mean 1952.145161 59.924194
std 29.472881 9.916588
min 1905.000000 47.840000
25% 1924.500000 53.642500
50% 1956.500000 56.880000
75% 1975.750000 65.200000
max 2004.000000 95.000000

This provides a numerical summary of each of the variables contained in the data frame. To keep things simple, the output from `describe()` is itself a data frame.

There are lots of different functions and attributes available for data frames (and any other Python objects). For instance, to see how many cases and variables there are in a data frame, you can use the `shape` attribute:

In [6]:
```swim.shape
```
Out[6]:
`(62, 3)`

### Variables in Data Frames¶

Perhaps the most common operation on a data frame is to refer to the values in a single variable. The two ways you will most commonly use involve referring to a variable by string-quoted name (`swim["year"]`) and as an attribute of a data frame without quotes (`swim.year`).

Each column or variable in a ‘pandas’ data frame is called a ‘series’, and each series can contain one of many different data types. For more information on series’, data frames, and other objects in ‘pandas’, [have a look here][intro].

Most of the statistical modeling functions you will encounter in these tutorials are designed to work with data frames and allow you to refer directly to variables within a data frame. For instance:

In [7]:
```swim.year.mean()
```
Out[7]:
`1952.1451612903227`
In [8]:
```swim["year"].min()
```
Out[8]:
`1905`

It is also possible to combine ‘numpy’ operators with ‘pandas’ variables:

In [9]:
```np.min(swim["year"])
```
Out[9]:
`1905`
In [10]:
```np.min(swim.year)
```
Out[10]:
`1905`

The `swim` portion of the above commands tells Python which data frame we want to operate on. Leaving oﬀ that argument leads to an error:

In [11]:
```year.min()
```
```---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-2ef03df1cde8> in <module>()
----> 1 year.min()

NameError: name 'year' is not defined```

Of course, you know that the variable year is deﬁned within the data frame `swim`, but you have to tell Python which data frame you want to operate on explicitly, otherwise it doesn’t know where to find the variable(s). Think of this notation as referring to the variable by both its family name (the data frame’s name,`"swim"`) and its given name (`"year"`), something like `einstein.albert`.

The advantage of referring to variables by name becomes evident when you construct statements that involve more than one variable within a data frame. For instance, here’s a calculation of the mean year, separately for (grouping by) the different sexes:

In [12]:
```swim.groupby('sex')['year'].mean()
```
Out[12]:
```sex
F      1950.677419
M      1953.612903
Name: year, dtype: float64```

You will see much more of the `groupby` function, starting in Tutorial 4 (Group-wise Models). It’s the ‘pandas’ way of grouping or aggregating data frames. In subsequent chapters, we will build on this notion to develop more complex ways of “grouping” and “modeling” variables “by” other variables.

Both the `mean()` and `min()` functions have been arranged by the ‘pandas’ library to look in the data frame when interpreting variables, but not all Python functions are designed this way. For instance:

In [13]:
```swim.year.sqrt()
```
```---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-e6382fdf6716> in <module>()
----> 1 swim.year.sqrt()

AttributeError: 'Series' object has no attribute 'sqrt'```

When you encounter a function that isn’t supported by data frames, you can use ‘numpy’ functions and the special `apply` function built-into data frames (note that the `func` argument is optional):

In [14]:
```swim.year.apply(func=np.sqrt).head() # There are 62 cases in total
```
Out[14]:
```0    43.646306
1    43.680659
2    43.703547
3    43.726422
4    43.794977
Name: year, dtype: float64```

Alternatively, since columns are basically just arrays, we can use built-in numpy functions directly on the columns:

In [15]:
```np.sqrt(swim.year).head() # Again, there are 62 cases in total
```
Out[15]:
```0    43.646306
1    43.680659
2    43.703547
3    43.726422
4    43.794977
Name: year, dtype: float64```

Sometimes you will compute a new quantity from the existing variables and want to treat this as a new variable. Adding a new variable to a data frame can be done similarly to accessing a variable. For instance, here is how to create a new variable in `swim` that holds the `time` converted from seconds to units of minutes:

In [16]:
```swim['minutes'] = swim.time/60. # or swim['time']/60.
```

By default, columns get inserted at the end. The `insert` function is available to insert at a particular location in the columns.

In [17]:
```swim.insert(1, 'mins', swim.time/60.)
```

You could also, if you want, redeﬁne an existing variable, for instance:

In [18]:
```swim['time'] = swim.time/60.
```

As always, we can take a quick look at the results of our operations by using the `head()` fuction of our data frame:

In [19]:
```swim.head()
```
Out[19]:
year mins time sex minutes
0 1905 1.096667 1.096667 M 1.096667
1 1908 1.093333 1.093333 M 1.093333
2 1910 1.046667 1.046667 M 1.046667
3 1912 1.026667 1.026667 M 1.026667
4 1918 1.023333 1.023333 M 1.023333

Such assignment operations do not change the original file from which the data were read, only the data frame in the current session of Python. This is an advantage, since it means that your data in the data file stay in their original state and therefore won’t be corrupted by operations made during analysis.

## Sampling from a Sample Frame¶

Much of statistical analysis is concerned with the consequences of drawing a sample from the population. Ideally, you will have a sampling frame that lists every member of the population from which the sample is to be drawn. With this in hand, you could treat the individual cases in the sampling frame as if they were cards in a deck of hands. To pick your random sample, shuffle the deck and deal out the desired number of cards.

When doing real work in the ﬁeld, you would use the randomly dealt cards to locate the real-world cases they correspond to. Sometimes in these tutorials, however, in order to let you explore the consequences of sampling, you will select a sample from an existing data set. For example, the `"kidsfeet.csv"` data set has `n=39` cases.

In [20]:
```kids = pd.read_csv("http://www.mosaic-web.org/go/datasets/kidsfeet.csv")
kids.shape
```
Out[20]:
`(39, 8)`

There are a number of procedures to draw a random sample of 5 cases from this data frame. The preferred option however, is to randomly select a subset of case ids (in this case 5) using `np.random.choice`, and return a subsetted data frame using the `ix[]` operator.

The `ix[]` property is a bit tricky to figure out at first. For more information, see [the official docs][selecting].

In [21]:
```rows = np.random.choice(kids.index, 5, replace=False)
kids.ix[rows]
```
Out[21]:
name birthmonth birthyear length width sex biggerfoot domhand
23 Erica 9 88 24.5 9.0 G L R
16 Caroline 12 87 24.0 8.7 G R L
4 Lang 2 88 25.1 8.9 B L R
32 Leigh 3 88 24.5 8.6 G L R
7 Caitlin 6 88 23.0 8.8 G L R

To make things a bit more concise, you can `import np.random.choice as choice`, which will allow you to simply use `choice()` without including the library *and* module when typing.

This can also be done in a single line:

In [22]:
```kids.ix[np.random.choice(kids.index, 5, replace=False)]
```
Out[22]:
name birthmonth birthyear length width sex biggerfoot domhand
19 Heather 3 88 25.5 9.5 G R R
4 Lang 2 88 25.1 8.9 B L R
3 Josh 1 88 25.2 9.8 B L R
31 Caitlin 7 88 22.5 8.6 G R R
7 Caitlin 6 88 23.0 8.8 G L R

The results returned by the above methods will never contain the same case more than once (because we told the function not to sample with replacement), just as if you were dealing cards from a shuffled deck. In contrast, ‘re-sampling with replacement’ replaces each case after it is dealt so that it can appear more than once in the result. You wouldn’t want to do this to select from a sampling frame, but it turns out that there are valuable statistical uses for this sort of sampling with replacement. You’ll make use of re-sampling in Tutorial 5 (Conﬁdence Intervals).

In [23]:
```np.random.seed(1237) # Set seed so results are reproducible
kids.ix[np.random.choice(kids.index, 5, replace=True)]
```
Out[23]:
name birthmonth birthyear length width sex biggerfoot domhand
11 Ray 3 88 24.8 8.9 B L R
25 Glen 7 88 27.1 9.4 B L R
36 Teshanna 3 88 26.0 9.0 G L R
7 Caitlin 6 88 23.0 8.8 G L R
25 Glen 7 88 27.1 9.4 B L R

Notice that ‘Glen’ was sampled twice.

### Reference¶

As with all ‘Statistical Modeling: A Fresh Approach for Python’ tutorials, this tutorial is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan. This tutorial is based on Chapter 2: Data: Cases, Variables, Samples.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.