Plots With Python
Create different types of plots (Scatter, Histograms, etc.) Change the esthetics of your figures (color, marker). Plot statistical distributions (such as Gaussian). Plot candlesticks charts. Read also: Introduction to Finance and Technical Indicators with Python. Please check the full notebook here. Happy Coding ♥ View Full Code. Home Python Visualizing Distribution plots in Python Using Seaborn. Visualizing Distribution plots in Python Using Seaborn. Francis Onyango, 1 year ago 1 4 min read 810. Data visualization is the graphical or pictorial presentation of data for a better understanding of its correlations and patterns.
Data visualization is the graphical or pictorial presentation of data for a better understanding of its correlations and patterns. Data visualization is a very important step in data science. This, therefore, requires every data scientist or analyst to be able to master the intrigues behind the data through visualization. For every machine learning model development, a greater understanding of data through EDA (Exploratory Data Analysis) makes work a lot easier.
Seaborn
Python offers different graphing libraries with lots of features. In this article, we will learn data visualization techniques in python using Seaborn.
According to the seaborn official page,
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
We will use data from seaborn inbuilt datasets. This article will focus on the syntax and not on interpreting the graphs. We will learn how to create the following distribution plots;
- Distplot
- Jointplot
- Pairplot
- Rugplot
- Kdeplot
These plots show the distribution of the dataset. This is the first part of my series on data visualization in Python using seaborn. In the next articles, we will delve into more complex visualizations using seaborn.
Now, let us start by importing seaborn and the dataset. We will use planets dataset in seaborn.
From there, let us proceed and do a few more checks on the dataset before we start visualization. This process is important since it helps in getting a glimpse into the dataset before actual visualization. To get a lot more explanation on this, read my previous article on Exploratory Data Analysis in Python for Beginners. We will use .shape, .columns and .info( ) functions to get the following results.
The dataset has 6 columns and 1,035 rows. It also contains bothcategorical and numerical columns. We will also check whether there are anymissing values in the dataset.
It is evident that there are missing values in orbital_period, mass and distance columns.
Next, let us remove the columns with missing values. Kindly note that when dealing with real-world data, it is important to determine how you will handle the missing values. Deleting them sometimes can be very costly in terms of its impact on the model you plan to create. If you want to understand more about handle missing values, please read my previous article on 4 Techniques for Treating Missing Values in your Data.
The following syntax will help us remove the missing values;
Our dataset now has 498 rows only. Let us use this to create the plots.
1. Dist plot
First and foremost, we will create dist plots. Dist plots show the distribution of a univariate set of observations. Let us plot the distribution of mass column using distplot. The syntax here is quite simple. All we need to do is to use sns.distplot( ) and specify the column we want to plot as follows;
We can remove the kde layer (the line on the plot) and have the plot with histogram only as follows;
2. Joint plot
After that, we will create joint plot. Joint plot is used to plot bivariate data by specifying the kind of parameter we need. For example, we can use ‘scatter’, ‘hex’, ‘kde’, ‘reg’, etc. The general syntax for joint plot requires us to specify the x and y labels, the data we want to use and the kind of plot we need. Let us plot the year column against the distance column using kind=’scatter‘.
Let us repeat the joint plot using kind=’hex’.
We can note the difference between the two plots. Let us do the same using ‘reg‘ and ‘kde‘ as follows.
Pair plot
The third distribution plot is pair plot. Pair plot plots pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns). Let us plot the whole dataframe planets. The general syntax behind pair plots requires specifying the data to be used only. However, you can specify the hue and palette as you desire especially when dealing with categorical data.
When dealing with categorical data, we can specify hue for categorical data and the palette (color scheme) as follows;
Rug plot
The fourth one is rug plot. A rug plot a plot of data for a single quantitative variable, displayed as marks along an axis.
Kde plot
Last but not least, we will create kde plot. Kde plots are Kernel Density Estimation plots. These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value.
A Q-Q plot, short for “quantile-quantile” plot, is often used to assess whether or not a set of data potentially came from some theoretical distribution. In most cases, this type of plot is used to determine whether or not a set of data follows a normal distribution.
This tutorial explains how to create a Q-Q plot for a set of data in Python.
Example: Q-Q Plot in Python
Suppose we have the following dataset of 100 values:
To create a Q-Q plot for this dataset, we can use the qqplot() function from the statsmodels library:
In a Q-Q plot, the x-axis displays the theoretical quantiles. This means it doesn’t show your actual data, but instead it represents where your data would be if it were normally distributed.
The y-axis displays your actual data. This means that if the data values fall along a roughly straight line at a 45-degree angle, then the data is normally distributed.
We can see in our Q-Q plot above that the data values tend to closely follow the 45-degree, which means the data is likely normally distributed. This shouldn’t be surprising since we generated the 100 data values by using the numpy.random.normal() function.
Consider instead if we generated a dataset of 100 uniformally distributed values and created a Q-Q plot for that dataset:
The data values clearly do not follow the red 45-degree line, which is an indication that they do not follow a normal distribution.
Notes on Q-Q Plots
Python Plot With Error Bar
Keep in mind the following notes about Q-Q plots:
Plotting Graphs In Python
- Although a Q-Q plot isn’t a formal statistical test, it offers an easy way to visually check whether or not a data set is normally distributed.
- Be careful not to confuse Q-Q plots with P-P plots, which are less commonly used and not as useful for analyzing data values that fall on the extreme tails of the distribution.
You can find more Python tutorials here.