Data assets used in these examples
Reading data from CSV files
Reading CSV files and using one of the columns as the index
Basic loading of data in CSV format using Pandas.
# Import requisite libraries
import pandas as pd
import pytablewriter
# Import training and testing datasets using the PassengerId column as the index
train = pd.read_csv("../Data/Titanic/train.csv", index_col='PassengerId')
test = pd.read_csv("../Data/Titanic/test.csv", index_col='PassengerId')
# Combine the training and testing datasets into a single dataframe
titanic = pd.concat(objs=[train,test], axis=0, sort=False) \
.reset_index(drop=True)
df[["Sex","Age","Fare","Embarked"]].head(5)
Sex Age Fare Embarked
0 male 22.0 7.2500 S
1 female 38.0 71.2833 C
2 female 26.0 7.9250 S
3 female 35.0 53.1000 S
4 male 35.0 8.0500 S
Reading CSV files with different encoding and thousands markers
The first thing we need to do is figure out what kind of file we're actually dealing with. On Linux, there's a libmagic
shared library which is used for the express purpose of discovering file encoding. It's the library which provides the backend functionality for the file
command, and there are multiple Python bindings which can leverage it.
user@server $> file Top5000Population.csv
Top5000Population.csv: ISO-8859 text, with CRLF line terminators
topCities = pd.read_csv("../Data/Training/Top5000Population.csv", thousands=',', encoding="iso-8859-1")
topCities.head(5)
city state population
0 New York NY 8363710
1 Los Angeles CA 3833995
2 Chicago IL 2853114
3 Houston TX 2242193
4 Phoenix AZ 1567924
Reading CSV files that have date/time stamps
Pandas has the ability to parse well formed datetime stamps, making it convenient to perform parsing and select date ranges. Notice that since we're using the time column as the index, it doesn't appear in the columns list.
quakes = pd.read_csv("../Data/Training/usgs_earthquakes.csv", parse_dates=True, index_col='time')
quakes.columns.values
array(['latitude', 'longitude', 'depth', 'mag', 'magType', 'nst', 'gap',
'dmin', 'rms', 'net', 'id', 'updated', 'place', 'type',
'horizontalError', 'depthError', 'magError', 'magNst', 'status',
'locationSource', 'magSource'], dtype=object)
Selecting data
Selecting specific columns
Selecting specific columns is fairly simple, just pass a list of column names to the dataframe.
quakes[["latitude","longitude","depth","mag"]].head(1)
time latitude longitude depth mag
2016-11-13 11:02:56.340 -42.7373 173.054 15.11 7.8
Selecting rows based on index
Since we used the time field for the index column when we loaded the data, we can now perform some slicing of the data using that field. This means we can also use portions of the datetime stamp, too.
# Look at the quakes on May 15, 1965
quakes.loc["1965-05-15"][["latitude","longitude","depth","mag"]]
time latitude longitude depth mag
1965-05-15 23:58:36 -4.131 134.946 25.0 5.8
# Look at all of the quakes in May, 1965
quakes.loc["1965-05"][["latitude","longitude","depth","mag"]]
time latitude longitude depth mag
1965-05-20 00:40:27 -14.921 167.340 120.0 7.7
1965-05-15 23:58:36 -4.131 134.946 25.0 5.8
# How many quakes were there in 2000
len(quakes.loc["2000"])
140
# What was the magnitude of the Great Alaskan Earthquake?
# It occured March 27th, 1964 at 5:56PM AST, but the quakes are recorded in UTC
quakes.loc["1964-03-28 03:36"][["latitude","longitude","depth","mag"]]
time latitude longitude depth mag
1964-03-28 03:36:16 60.908 -147.339 25.0 9.2
Select rows based on a predicate
Selecting rows based on a predicate requires performing it as a subquery, effectively.
quakes.loc[quakes["mag"] > 9.0][["latitude","longitude","depth","mag"]]
time latitude longitude depth mag
2011-03-11 05:46:24.120 38.297 142.373 29.0 9.1
2004-12-26 00:58:53.450 3.295 95.982 30.0 9.1
1964-03-28 03:36:16.000 60.908 -147.339 25.0 9.2
1960-05-22 19:11:20.000 -38.143 -73.407 25.0 9.5
Performing a selection with multiple predicates requires performing queries wrapped inside parenthesis and the Pandas boolean operators.
quakes.loc[((quakes["mag"] == 8.8) | (quakes["mag"] == 9.1))][["latitude","longitude","depth","mag"]]
time latitude longitude depth mag
2011-03-11 05:46:24.120 38.297 142.373 29.0 9.1
2010-02-27 06:34:11.530 -36.122 -72.898 22.9 8.8
2004-12-26 00:58:53.450 3.295 95.982 30.0 9.1
quakes.loc[((quakes["depth"] == 29.0) & (quakes["mag"] == 9.1))][["latitude","longitude","depth","mag"]]
time latitude longitude depth mag
2011-03-11 05:46:24.120 38.297 142.373 29.0 9.1
Slicing rows
Slicing rows in Pandas is just like slicing strings or lists in standard Python. Add a slicing bracket to the end of the statement. The first position is the start of the slice (beginning at 0), the second position states the end of the slice, and the third position is for how many records to step over.
# Sample from which we'll slice the data
quakes.loc["2000-12"][["latitude","longitude","depth","mag"]]
time latitude longitude depth mag
2000-12-28 04:34:28.450 -4.050 152.307 33.0 6.2
2000-12-24 11:21:58.260 23.971 69.763 10.0 4.6
2000-12-21 01:01:27.770 -5.706 151.122 33.0 6.4
2000-12-15 16:44:47.660 38.457 31.351 10.0 6.0
2000-12-13 04:44:22.530 47.968 86.123 40.3 4.9
2000-12-07 09:31:19.120 -4.274 152.786 33.0 5.8
2000-12-06 22:57:40.040 -4.218 152.725 31.0 6.5
2000-12-06 17:11:06.400 39.566 54.799 30.0 7.0
# Starting at the 0th record, and ending at the 8th record, select every 2nd record
quakes.loc["2000-12"][["latitude","longitude","depth","mag"]][0:8:2]
time latitude longitude depth mag
2000-12-28 04:34:28.450 -4.050 152.307 33.0 6.2
2000-12-21 01:01:27.770 -5.706 151.122 33.0 6.4
2000-12-13 04:44:22.530 47.968 86.123 40.3 4.9
2000-12-06 22:57:40.040 -4.218 152.725 31.0 6.5
# You can also slice directly from within the row selection as well
quakes.loc['2000-12-31':'2000-12-01':2][["latitude","longitude","depth","mag"]]
time latitude longitude depth mag
2000-12-28 04:34:28.450 -4.050 152.307 33.0 6.2
2000-12-21 01:01:27.770 -5.706 151.122 33.0 6.4
2000-12-13 04:44:22.530 47.968 86.123 40.3 4.9
2000-12-06 22:57:40.040 -4.218 152.725 31.0 6.5