Facts with Facets
“Simpson’s Paradox: when a whole body of data displays one trend, yet when broken into subgroups, the opposite trend comes into view for each of those subgroups.”
― Cathy O’Neil, Weapons of Math Destruction
We all know machine learning models deliver better results when trained with large volumes of data. But then only large volumes of data can not help if the data quality is poor. Data quality is equally important to build good-performing machine learning models.
Data comes in various forms and shapes. There can be data which is wrong or invalid, can be missing altogether, can be wrongly labelled, can be imbalanced, can contain extreme values and so on. So it is critical that we clean the dirty data before we proceed any further. However in order to clean our data correctly we need to understand where all it needs cleaning.
Also, better machine learning models come with better understanding of your data. And hence it is imperative that we understand our data deeper and wider in order to build better models.
To understand our data better, we not only need statistical analysis but also need to visualize the data to get a different view of the data. Visualizing the data gives clarity on where all the data needs correction or cleaning and can help in extensive Exploratory Data Analysis of our data.
Google has come up with an amazing library that is open source and can help study patterns on large volumes of data – Facets.
Try before you “buy”
Well, Facets is free. But before even you read this article to see how to use Facets, we suggest you take a look at the demo page by Google that allows you to try Facets live. You can even load your own data to see the outcome. Visit: https://pair-code.github.io/facets/.
The “Facets” comes in two components: The Facets Overview and the Facets Dive.
To start with, Facets provides you with a broad overview of your data. You can select what types of variables you would like to analyze.
You can choose to sort the display alphabetically or by percent of missing values. There is a column that shows the percentage of missing values for each column. If the percentage of missing values is high, it automatically shows up in red color.
While you get to see the summary statistics like mean, median, minimum, maximum and percentage of missing values at a glance, you also get to visualize the distribution of the variables. You may also choose to change the chart type from standard to quantile type.
Another feature that comes in very handy is, that you can view a log transformed distribution of the variable by a check on the log option, to verify if the log transformation renders a better distribution of the variable and hence more fit for modelling.
Next let us take a look at Facets dive. The dictionary meaning of ‘facet’ is “one part of a subject that has many parts”. And that’s exactly what Facets Dive does for us. It can help ‘facet’ our data by rows and columns across multiple variables and can provide us with more clarity on our data. Facet Dive provides us with an interactive interface to dive into exploring relationships between multiple features across our dataset.
Getting into action
We will use Google Colaboratory to demonstrate the Facets library.
|Note: If you use Jupyter in the local system, at the time of writing this article, the Jupyter notebook is not able to display both Facets Overview & Facets Dive visualization at the same time.|
To begin with, start a notebook in Google Colaboratory (https://colab.research.google.com/). Assign a filename, click on the connect button to allocate server resources.
As a next step, install facets-overview library with the pip command:
Load the required data. In our demonstration we load the penguins data from the seaborn library.
We then create the feature statistics for the dataset and stringify it. To calculate the feature statistics, we use the
|To know more about feature statistic generation read: https://pypi.org/project/facets-overview/|
The Overview visualization is powered by the feature statistics protocol buffer. The feature statistics protocol buffer messages store summary statistics for individual feature columns.
|To know more about Protocol Buffer Basics read: https://developers.google.com/protocol-buffers/docs/pythontutorial|
The feature statistics protocol buffer can be created for datasets by the facets-overview library. To create the proto from a pandas dataframe, use the
The Base64 encoding is used to convert bytes that have binary or text data into ASCII characters. Encoding prevents the data from getting corrupted when it is processed.
SerializeToString() function serializes the message and returns it as a string. The
decode() function is used to convert from one encoding scheme to the desired encoding scheme. UTF-8 is the preferred encoding for email and web pages.
Finally, display the Facets Overview visualization for the data. Facets Overview & Facets Dive – both use HTML import. So in order to use Facets Overview & Facets Dive, we need to load the required polyfill.
The above code generates the overview as follows:
The Overview visualization shows us the summary statistics for the numeric & categorical variables. It also shows us the distribution of the variables using histogram for numeric variables and bar charts for the categorical variable. The missing percentage is displayed in red color.
We can sort the display by alphabetical order. We can also choose the types of variables we would like to view.
Chart types present us with different views. With chart type as Quantile we get to see the distribution of the data in quantile plots.
To display the Dive visualization, we run the below code:
In the above code, we convert the dataframe to JSON object. The Dive visualization, similar to the Overview visualization, uses the HTML imports.
|To know more on dataframe to JSON conversion read: https://www.w3resource.com/pandas/dataframe/dataframe-to_json.php|
We pass the JSON object to the HTML template that generates the below visualization:
The dataset has 344 observations and each data point belongs to a species. The Dive visualization has automatically applied colors by species to the data points.
With the x-axis set to sex variable, we can view the data from another dimension.
We add one more dimension to the y-axis and set it to the island variable. We can now study our data from another angle.
Further to it, we can also view the relationship between two numeric variables based on the selected dimensions. We choose
bill_depth_mm as x-axis and
bill_length_mm as y-axis for the scatter plot. We get to study the relationship between
sex and by
island with colors clearly denoting the
With a few lines of ready code available and no additional coding required, exploring data in much depth and width has become more than simple. Facets provide a wide variety of tools for you to play around with your data and explore far more much easily than what you could have done with several lines of coding.
The python code is available at this URL in case you want to run the Facets by yourself: