Plotting Data With Seaborn and Pandas

Pandas DataFrames are the most widely used in-memory representation of complex data collections within Python. Whether in finance, scientific fields, or data science, a familiarity with Pandas is essential. This course teaches you to work with real-world data sets containing both string and numeric data, often structured around time series. You will learn powerful analysis, selection, and visualization techniques in this course.

Pandas

Why is Pandas great? It is built on top on NumPy. It uses its multi-dimensional arrays and fast operations internally to provide higher level methods for manipulation and analysis.

It is also easy to use. Almost every Pandas method returns a (modified) copy of the data, which allows you to chain transformations, and perform complex modifications in one line. The overview is divided into sections, each one with code examples and an explanation of what is being done.

First we’ll need to make some imports, which will be necessary through all the examples. We’ll use the well known Titanic dataset (available in Seaborn), which holds data of the Titanic passengers, such as their age, paid fare, and if they survived or not.

examples. We’ll use the well known Titanic dataset (available in Seaborn), which holds data of the Titanic passengers, such as their age, paid fare, and if they survived or not.

  import matplotlib.pyplot as plt
  import numpy as np
  import pandas as pd
  import seaborn as sns
  import timeit
   
  # Load dataset
  titanic = sns.load_dataset('titanic')

Basic aspects

These are must knows that will make your life easier when dealing with Pandas for the first time.

Data Structures

Pandas’ data structures can hold mixed typed values as well as labels, and their axes can have names set. The data structures are the following.

The most basic Data Structure available in Pandas is the Series. This is basically a 1-dimensional labeled array. Therefore, Series have only one axis (axis == 0) called “index”.

 

pd.Series([1, 90, 'hey', np.nan], index=['a', 'B', 'C', 'd'])

 

a B C d
1 90 "hey" NaN

Then, we have DataFrames. These are 2-dimensional structures, with two axes, the “index” axis (axis == 0), and the “columns” axis (axis == 1). DataFrames can be thought of as Python dictionaries where the keys are the column labels, and the values are the column Series.

  pd.DataFrame({'day': [17, 30], 'month': [1, 12], 'year': [2010, 2017]})

 

day month year
0 17 1 2010
1 30 12 2017

Lastly, we have Panels. These are 3-dimensional data structures, that are rarely used, in comparison with DataFrames. Analogously to DataFrames, they can be thought of as Python dictionaries of DataFrames. Instead of “index” and “columns”, Panels’ axes are named as follow:

  • items (axis == 0)
  • major_axis (axis == 1)
  • minor_axis (axis == 2)

The axes distinction is vital, since a lot of methods need to have this specified properly in order to work as expected. We’ll see it’s usage in following examples.

From here on, we will use the Series/DataFrame as the data structure of choice in the examples when explaining things. Keep in mind that anything that applies to Series probably applies to DataFrames too, but it may not be the case the other way around.

seaborn

Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures.

Here is some of the functionality that seaborn offers:

  • A dataset-oriented API for examining relationships between multiple variables
  • Specialized support for using categorical variables to show observations or aggregate statistics
  • Options for visualizing univariate or bivariate distributions and for comparing them between subsets of data
  • Automatic estimation and plotting of linear regression models for different kinds dependent variables
  • Convenient views onto the overall structure of complex datasets
  • High-level abstractions for structuring multi-plot grids that let you easily build complex visualizations
  • Concise control over matplotlib figure styling with several built-in themes
  • Tools for choosing color palettes that faithfully reveal patterns in your data

Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.

Exploring Seaborn Plots

The main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting.

Let's take a look at a few of the datasets and plot types available in Seaborn. Note that all of the following could be done using raw Matplotlib commands (this is, in fact, what Seaborn does under the hood) but the Seaborn API is much more convenient.

Pair plots

When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very useful for exploring correlations between multidimensional data, when you'd like to plot all pairs of values against each other.

We'll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three iris species:

Input:

iris = sns.load_dataset("iris")
iris.head()

Output : 

  sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Visualizing the multidimensional relationships among the samples is as easy as calling sns.pairplot: