python, jupyter, pandas

Pandas - Opening and Selecting Data

Data assets used in these examples


Reading data from CSV files

Reading CSV files and using one of the columns as the index

Basic loading of data in CSV format using Pandas.

# Import requisite libraries
import pandas as pd
import pytablewriter

# Import training and testing datasets using the PassengerId column as the index
train = pd.read_csv("../Data/Titanic/train.csv", index_col='PassengerId')
test = pd.read_csv("../Data/Titanic/test.csv", index_col='PassengerId')

# Combine the training and testing datasets into a single dataframe
titanic = pd.concat(objs=[train,test], axis=0, sort=False) \
    .reset_index(drop=True)

df[["Sex","Age","Fare","Embarked"]].head(5)
	Sex	Age	Fare	Embarked
0	male	22.0	7.2500	S
1	female	38.0	71.2833	C
2	female	26.0	7.9250	S
3	female	35.0	53.1000	S
4	male	35.0	8.0500	S

Reading CSV files with different encoding and thousands markers

The first thing we need to do is figure out what kind of file we're actually dealing with. On Linux, there's a libmagic shared library which is used for the express purpose of discovering file encoding. It's the library which provides the backend functionality for the file command, and there are multiple Python bindings which can leverage it.

user@server $> file Top5000Population.csv
Top5000Population.csv: ISO-8859 text, with CRLF line terminators
topCities = pd.read_csv("../Data/Training/Top5000Population.csv", thousands=',', encoding="iso-8859-1")

topCities.head(5)
            city	state	population
0	New York	NY	8363710
1	Los Angeles	CA	3833995
2	Chicago	        IL	2853114
3	Houston	        TX	2242193
4	Phoenix	        AZ	1567924

Reading CSV files that have date/time stamps

Pandas has the ability to parse well formed datetime stamps, making it convenient to perform parsing and select date ranges. Notice that since we're using the time column as the index, it doesn't appear in the columns list.

quakes = pd.read_csv("../Data/Training/usgs_earthquakes.csv", parse_dates=True, index_col='time')

quakes.columns.values
array(['latitude', 'longitude', 'depth', 'mag', 'magType', 'nst', 'gap',
       'dmin', 'rms', 'net', 'id', 'updated', 'place', 'type',
       'horizontalError', 'depthError', 'magError', 'magNst', 'status',
       'locationSource', 'magSource'], dtype=object)

Selecting data

Selecting specific columns

Selecting specific columns is fairly simple, just pass a list of column names to the dataframe.

quakes[["latitude","longitude","depth","mag"]].head(1)
time	                latitude	longitude	depth	mag
2016-11-13 11:02:56.340	-42.7373	173.054	        15.11	7.8

Selecting rows based on index

Since we used the time field for the index column when we loaded the data, we can now perform some slicing of the data using that field. This means we can also use portions of the datetime stamp, too.

# Look at the quakes on May 15, 1965
quakes.loc["1965-05-15"][["latitude","longitude","depth","mag"]]
time	                latitude	longitude	depth	mag
1965-05-15 23:58:36	  -4.131	  134.946	 25.0	5.8

# Look at all of the quakes in May, 1965
quakes.loc["1965-05"][["latitude","longitude","depth","mag"]]
time	                latitude	longitude	depth	mag
1965-05-20 00:40:27	 -14.921	  167.340	120.0	7.7
1965-05-15 23:58:36	  -4.131	  134.946	 25.0	5.8

# How many quakes were there in 2000
len(quakes.loc["2000"])
140

# What was the magnitude of the Great Alaskan Earthquake?
# It occured March 27th, 1964 at 5:56PM AST, but the quakes are recorded in UTC
quakes.loc["1964-03-28 03:36"][["latitude","longitude","depth","mag"]]
time	                latitude	longitude	depth	mag
1964-03-28 03:36:16	  60.908	 -147.339	 25.0	9.2

Select rows based on a predicate

Selecting rows based on a predicate requires performing it as a subquery, effectively.

quakes.loc[quakes["mag"] > 9.0][["latitude","longitude","depth","mag"]]
time	                    latitude	longitude	depth	mag
2011-03-11 05:46:24.120	    38.297	142.373	        29.0	9.1
2004-12-26 00:58:53.450	    3.295	95.982	        30.0	9.1
1964-03-28 03:36:16.000	    60.908	-147.339	25.0	9.2
1960-05-22 19:11:20.000	    -38.143	-73.407	        25.0	9.5

Performing a selection with multiple predicates requires performing queries wrapped inside parenthesis and the Pandas boolean operators.

quakes.loc[((quakes["mag"] == 8.8) | (quakes["mag"] == 9.1))][["latitude","longitude","depth","mag"]]
time	                    latitude	longitude	depth	mag
2011-03-11 05:46:24.120	    38.297	142.373	        29.0	9.1
2010-02-27 06:34:11.530	    -36.122	-72.898	        22.9	8.8
2004-12-26 00:58:53.450	    3.295	95.982	        30.0	9.1

quakes.loc[((quakes["depth"] == 29.0) & (quakes["mag"] == 9.1))][["latitude","longitude","depth","mag"]]
time	                    latitude	longitude	depth	mag
2011-03-11 05:46:24.120	    38.297	142.373	        29.0	9.1

Slicing rows

Slicing rows in Pandas is just like slicing strings or lists in standard Python. Add a slicing bracket to the end of the statement. The first position is the start of the slice (beginning at 0), the second position states the end of the slice, and the third position is for how many records to step over.

# Sample from which we'll slice the data
quakes.loc["2000-12"][["latitude","longitude","depth","mag"]]
time	                    latitude	longitude	depth	mag
2000-12-28 04:34:28.450	    -4.050	152.307	        33.0	6.2
2000-12-24 11:21:58.260	    23.971	69.763	        10.0	4.6
2000-12-21 01:01:27.770	    -5.706	151.122	        33.0	6.4
2000-12-15 16:44:47.660	    38.457	31.351	        10.0	6.0
2000-12-13 04:44:22.530	    47.968	86.123	        40.3	4.9
2000-12-07 09:31:19.120	    -4.274	152.786	        33.0	5.8
2000-12-06 22:57:40.040	    -4.218	152.725	        31.0	6.5
2000-12-06 17:11:06.400	    39.566	54.799	        30.0	7.0

# Starting at the 0th record, and ending at the 8th record, select every 2nd record
quakes.loc["2000-12"][["latitude","longitude","depth","mag"]][0:8:2]
time	                    latitude	longitude	depth	mag
2000-12-28 04:34:28.450	    -4.050	152.307	        33.0	6.2
2000-12-21 01:01:27.770	    -5.706	151.122	        33.0	6.4
2000-12-13 04:44:22.530	    47.968	86.123	        40.3	4.9
2000-12-06 22:57:40.040	    -4.218	152.725	        31.0	6.5

# You can also slice directly from within the row selection as well
quakes.loc['2000-12-31':'2000-12-01':2][["latitude","longitude","depth","mag"]]
time	                    latitude	longitude	depth	mag
2000-12-28 04:34:28.450	    -4.050	152.307	        33.0	6.2
2000-12-21 01:01:27.770	    -5.706	151.122	        33.0	6.4
2000-12-13 04:44:22.530	    47.968	86.123	        40.3	4.9
2000-12-06 22:57:40.040	    -4.218	152.725	        31.0	6.5
Author image

About James Conner

Scuba dive master, wildlife photographer, anthropologist, programmer, electronics tinkerer and big data expert.