Data Visualization Contd.

Divij Sharma
5 min readDec 29, 2023
Photo by Carlos Muza on Unsplash

In the last article I discussed about the best practices of Data Visualization.

Data Visualization is important because it helps us understand distribution, trend, relationship, comparison and composition of data. It also helps the decision makers to quickly examine large piles of data and discover hidden patterns and insights.

Each chart in data visualization has its own importance. In this article I will discuss various types of charts and their importance and when to use them.

In Python the most commonly used packages for generating charts are

  1. Matplotlib — Matplotlib¹ is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib has complete 2D support and limited 3D support.
  2. Seaborn — Seaborn² is a Python data visualization library based on Matplotlib and integrates closely with pandas data structures³. It provides a high-level interface for drawing attractive and informative statistical graphics.

The different types of charts are —

Histogram

Histogram by IkamusumeFan via Wikimedia Commons CC BY-SA 4.0

The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on an interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data.⁴

Histogram should be used to display the distribution of variable. It is most commonly used chart to understand whether the variable is normally distributed or not. Histogram is used for numerical data and not categorical data. The bins of histogram are not overlapping and hence a very good and easy way to determine the distribution and presence of outlier in data. The x-axis of histogram cannot be reordered.

Box Plot

Box Plot by Schutz via Wikimedia Commons CC BY-SA 4.0

A box plot (or boxplot or box-and-whisker plot or box-and-whisker diagram⁵) is used to graphically indicate the spread of data. The box plot gives us a summary of 5 statistical points of data — Minimum, First Quartile, median, Third Quartile and Maximum. The box in boxplot represents the data between first and third quartile. The lines extending from the box (called whiskers) indicate variability outside upper and lower quartiles.

It also gives us an indication of outliers in the data represented by the dots beyond the whiskers.

Violin Plot

Violin Plot by Pejman Mohammadi via Wikimedia Commons CC BY 2.5

Violin plot is a type of distribution plot that shows the density of values in the data. The Violin Plot can answer questions such as — “ Are the most of the values clustered around median or are they clustered around the maximum/minimum with nothing in the middle?”

Violin Plot is a combination of Box Plot and Kernel Density Plot which shows the distribution peaks of data. It is used to visualize the distribution of numerical data.

The bulge of Violin Plot shows the higher probability that the data will take on the given value and tapered or skinner section represents a lower probability.

Violin plot has a dot (.) or plus (+) representing the median and the thick bar in the center represents the interquartile range.

Bar Chart

Bar Chart by Skies via Wikimedia Commons
Stacked Bar Chart by RCraig09 via Wikimedia Commons CC BY-SA 4.0

Bar charts are usually used to compare various categories of data. Each category is rendered as a separate bar and unlike histogram, a bar chart categories can be reordered. Bar chart is usually used when we want to compare the data and there are less categories. It can be used for both static and over-time data. There are 2 types of bar chart — A. Normal and B. Stacked.

Line Chart

Line Chart by Leland McInnes at English Wikipedia CC BY-SA 4.0

A line chart is used to show the data that is collected over a period of time. If the categories is less then we can use Bar Chart also but if the number of categories are more then a line chart is preferred.

Scatter Plot

Scatter Plot by IkamusumeFan via Wikimedia Commons CC BY-SA 4.0
Scatter Plot by DanielPenfield — Own work, CC BY-SA 3.0

Scatter plot is one of the basic plots to visualize the relationship and distribution of data. Scatter plot works well 2 different numeric variables. The color of dots can be used to differentiate between 2 categorical values of data. For example, scatter plot can be used for temperature vs humidity plot with color of dot representing time of the day (morning, afternoon, evening, night).

Scatter plot can suggest correlations between variables. If the dots’ pattern slopes from lower left to upper right, it indicates a positive correlation. If the pattern of dots slopes from upper left to lower right, it indicates a negative correlation.⁶

Bubble Plot

Bubble Plot by George Huhn via Wikimedia Commons CC BY-SA 4.0

Bubble chart is used to represent the relationship between 3 numeric variables. Each bubble represents a single data point. The 3 variables are represented by x-axis, y-axis and bubble size. Additionally, color of the bubble can be used to represent more dimensions (regions like APAC, EMEA, etc.). Sometimes, the bubbles can be animated to add more dimensions (size of bubble reducing or increasing to represent the change over time).

Pie Chart

Pie chart by M. W. Toews via Wikimedia Commons CC BY-SA 4.0

Pie chart is used to represent the composition of data — specifically to represent the single share of the total i.e. percentage distribution. The circle is divided into multiple segments for each category in the data with the area of each segment proportional to the percent of corresponding category to the whole.

Pie chart should be used when the number of categories is less than 6 otherwise the resulting chart will be too complex to understand with multiple small segments. Also, the pie charts are not useful when the values of categories is similar to each other because it is difficult to see the differences between segments.

Conclusion

Data Visualization is the first — and very important — step in Exploratory Data Analysis. The above discussed charts are first step to start understanding the data.

References

¹ https://matplotlib.org/stable/index.html

² https://seaborn.pydata.org/

³ https://seaborn.pydata.org/tutorial/introduction.html

https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch9/histo/5214822-eng.htm

https://en.wikipedia.org/wiki/Box_plot

https://en.wikipedia.org/wiki/Scatter_plot#Overview

--

--