There were so many visualization libraries in Python. MatplotLib,Seaborn,Plotly, Bokeh, Lightning,Pandas are some of the options a user can have while trying to visualize. Every library has its own ups and downs and the tool that we use to visualize largely depends on our need. All these APIs for visualization are inspired from the Principles of Wilkinson’s grammar of graphics. Many users use different libraries these days to try to visualize their data in Jupyter Notebook. At times we go through problems which are still not addressed by the developers of individual visualisation libraries. So Jake Vanderplas and Brian Granger of IPython project decided to address generic API solution and that’s how they started Altair Viz project. This is a completely new library but I believe that it has lot to offer for the IPython community.
Lets all accept this fact. I have used tableau, Excel, R. Visualization with these tools are really easy because they have easy drag and drop options. Whereas we have to understand lot of syntax and semantics to use a visualization library in python. I completely agree that we cannot custom visualize a very large dataset in Excel/ tableau as it eats up all the RAM and slows down our system. So libraries like these are the only scalable solution for visualizing large datasets. Lets first list down the drawbacks of the visualization libraries in Python.
There are three components in Statistical Visualization:
It is very important to have a set of well-defined rules that devise a set of abstractions while visualizing data. It’ll be very nice to express the Statistical Visualization with a small number of abstractions. These abstractions form a Visualization Grammar. These abstractions are defined so that a model can be reused again. We have different levels of abstractions. But not all the visualization library present today follow strictly.
Developers of Altair Viz claim that most of the libraries follow all these rules. But they lack in two of the rules in the ruleset.
Transformation: We beleive that transformation of data happens before visualization. But when we do Statistical visualization, it is really difficult to do the data transformation before the visualization is done. We really have to develop a data transformation pipeline into the visualization process itself.
Scaling: Scaling is nothing but mapping the transformed tidy data values to visualization properties like shapes, colors and facets (plane or axis). This mapping is incomplete in the current libraries. This is where most of the libraries lack representations. Libraries like Matplotlibs can map quantitative columns with visual properties but they lack representation for categorical and timeseries data.
There are two types of APIs generally. Declarative and Imperative APIs.
The following table compares both types of APIs
Lets assume that you are building a decision tree. At each node level you would like determine the attribute that best classifies the dataset. To identify the attribute you will have to undergo a statistical test like information gain on each attribute and decide the best. But this process is error prone. We humans do lots of mistakes while doing such calculations and its a horror story to debug a multilevel decision tree when something goes wrong in these calculations. You know that you are capable of doing these math. Do you want to prove it each and every time??
Given a magic function which does this mundane statistical property math forever correctly as you pass the parameters, wont you prefer that over writing the function yourself?
Now lets concentrate on the space of visualisation libraries. The below image compares declarative and imperative viz libraries with two other important characteristics. Ease of use and representational power.
You may be wondering why Altair Viz is not there in the Declarative API section of the previous image. Atleast I have been telling you so far that Altair viz is a statistical visualization library.
But!
Let me confess you a truth. Altair is not fully a vizualization library because it does not do any rendering of visulaizations.
We all know D3.js renders rich visualizations. But it is verbose. We need to write lots of code to render even a simple bar chart. Vega is a visualization grammar.It is a declarative format for creating and sharing interactive visualization designs which is built on D3.It is again verbose. Simple implementation of bar chart in Vega. Vega-lite is a high-level visualization grammar. It provides a concise JSON syntax for supporting rapid generation of visualizations to support analysis. It is strictly for declarative statistical visualizations.Check how well it simplifies the visualization of bar chart here
Developers of Altair-Viz did not try to re-invent the wheel. They wanted to utilise this powerful visualisation grammar which abstracts all the complexity of visualisation. So they created a Python API which emits type-checked JSON which is sent down to Vega-lite which inturn is interpreted by Vega and visualised by D3. This integration is well appreciated by all other viz library developer community like Matplotlib and they are also planning to create an API like Altair Viz so that Vega/Vega-lite become the lingua-franca of data visualisation
Before we start, we should install Altair viz.
conda install altair --channel conda-forge
To do statistical visualization, you should have 3 distinct components working together. Let’s understand these components syntax seperately.
The data that we generally visualize is a tidy dataframe. This can be your pandas dataframe.
import pandas as pd
data = pd.DataFrame({'a': list('AABCCDDEE'),
'b': [2, 7, 4, 1, 2, 6, 8, 4, 7]})
data
a | b | |
---|---|---|
0 | A | 2 |
1 | A | 7 |
2 | B | 4 |
3 | C | 1 |
4 | C | 2 |
5 | D | 6 |
6 | D | 8 |
7 | E | 4 |
8 | E | 7 |
The visual properties of data are defined in the Chart object. Chart object has variety of functions which can be applied on a dataframe when passed. It can also directly take in a JSON object. Once you load the data into Chart object, you will define the visual properties. Consider it like a pipeline and follow the instructions one after another.
from altair import Chart
chart = Chart(data)
chart
You can see that the chart object has already visualized the data before even you define its visual properties. What you can infer from this chart is that, All the rows in the dataframe overlap over one another with a default mark type called mark_point. Now lets add Visual properties to it
from altair import X,Y,Axis
chart.mark_bar(color='#c70d08').encode(
X('a',axis=Axis(title='Category')),
Y('b',axis=Axis(title='Count')),
)
chart
That was so simple! The Chart object has many mark functions which lets you define the kind of chart to be drawn. Play around with the options. You can define properties like color of the chart inside the ‘mark_’ function. ‘#c70d08’ is an example color code. You can set your own color code from here. Whatever you define inside the encode are called Channels and they are python classes themselves with init attributes. you can find more details about encoding channels here.There are different types of encoding channels. Classes X,Y are called position encoding channels. In the last link, you have an example implementation of most of the channel types.
You have to notice the links to Vega editor and View Source above. As you click those links you can understand how your chart definitions is transformed to type-checked Vega grammar which is sent to D3 to visualize.
Another important functionality of Altair is automatic data type interpretation. We did not specify datatype of the Axis. Altair intelligently interprets the type of data and emits out the data according to its type. But you can always specify datatype. Altair classifies data into four types.
Data type | Shorthand code | Description |
---|---|---|
quantitative | Q | a continuous real-valued quantity |
ordinal | O | a discrete ordered quantity |
nominal | N | a discrete unordered category |
temporal | T | a time or date value |
So you can also define the Chart this way.
chart.mark_bar(color='#c70d08').encode(
X('a:N',axis=Axis(title='Category')),
Y('b:Q',axis=Axis(title='Count')),
)
chart
chart.to_dict(data=False)
{'config': {'mark': {'color': u'#c70d08'}},
'encoding': {'x': {'axis': {'title': u'Category'},
'field': u'a',
'type': 'nominal'},
'y': {'axis': {'title': u'Count'}, 'field': u'b', 'type': 'quantitative'}},
'mark': 'bar'}
The chart you deined can be converted to dictionary and its visual properties can be understood. Lets see a bit more fancy chart now. For simplicity I am going to use the datasets that are pre-defined in Altair. Datasets are in JSON format and they can be loaded directly. We can also use load_dataset function to get the dataframe version of the dataset.
# in-built data sets
from altair.datasets import list_datasets
list_datasets()
[u'flights-3m',
u'weather',
u'crimea',
u'us-10m',
u'driving',
u'sf-temps',
u'flights-5k',
u'flights-20k',
u'barley',
u'miserables',
u'birdstrikes',
u'sp500',
u'world-110m',
u'monarchs',
u'seattle-temps',
u'jobs',
u'cars',
u'weball26',
u'flights-2k',
u'seattle-weather',
u'gapminder-health-income',
u'anscombe',
u'unemployment-across-industries',
u'stocks',
u'population',
u'iris',
u'climate',
u'github',
u'airports',
u'countries',
u'flare',
u'burtin',
u'budget',
u'flights-10k',
u'flights-airport',
u'movies',
u'points',
u'wheat',
u'budgets',
u'gapminder']
from altair import load_dataset,Color,Scale,SortField,Row,Column
car_data = load_dataset('cars')
car_data.head(5)
Acceleration | Cylinders | Displacement | Horsepower | Miles_per_Gallon | Name | Origin | Weight_in_lbs | Year | |
---|---|---|---|---|---|---|---|---|---|
0 | 12.0 | 8 | 307.0 | 130.0 | 18.0 | chevrolet chevelle malibu | USA | 3504 | 1970-01-01 |
1 | 11.5 | 8 | 350.0 | 165.0 | 15.0 | buick skylark 320 | USA | 3693 | 1970-01-01 |
2 | 11.0 | 8 | 318.0 | 150.0 | 18.0 | plymouth satellite | USA | 3436 | 1970-01-01 |
3 | 12.0 | 8 | 304.0 | 150.0 | 16.0 | amc rebel sst | USA | 3433 | 1970-01-01 |
4 | 10.5 | 8 | 302.0 | 140.0 | 17.0 | ford torino | USA | 3449 | 1970-01-01 |
car_Chart = Chart(car_data).mark_circle().encode(
X('Horsepower',axis = Axis(title = 'Horsepower')),
Y('Miles_per_Gallon', axis = Axis(title ='Miles per Gallon')),
Color('Cylinders')
)
car_Chart
We generally know that cars with more horsepower gives less mileage. Does increasing the number of cylinders in the engine impact the mileage per gallon? See for yourself! Fancy right? :) Can we do better?
car_Chart.to_dict(data=False)
{'encoding': {'color': {'field': u'Cylinders', 'type': 'quantitative'},
'x': {'axis': {'title': u'Horsepower'},
'field': u'Horsepower',
'type': 'quantitative'},
'y': {'axis': {'title': u'Miles per Gallon'},
'field': u'Miles_per_Gallon',
'type': 'quantitative'}},
'mark': 'circle'}
'''We see that Cylinders are of quantitative type.
But from our prior knowledge we know that the cylinders in a car don't
max out over a range and they can be considered as categorical data types.
Lets try overriding Altair's data type interpretation.'''
car_Chart = Chart(car_data).mark_circle().encode(
X('Horsepower',axis = Axis(title = 'Horsepower')),
Y('Miles_per_Gallon', axis = Axis(title ='Miles per Gallon')),
Color('Cylinders:N')
)
car_Chart
Let me introduce some helper functions before we go over the next example:
Shorthand | Equivalent long-form |
---|---|
x=’name’ | X(‘name’) |
x=’name:Q’ | X(‘name’, type=’quantitative’) |
x=’sum(name)’ | X(‘name’, aggregate=’sum’) |
x=’sum(name):Q’ | X(‘name’, aggregate=’sum’, type=’quantitative’) |
If the encoding channel (i.e X,Y,Row,Column) has a continuous scale (Numerical) we can sort them directly. The default sorting order is Ascending. If the encoding channel is ordinal then we can use another encoding channel field to sort as a sort definition object. More about this can be found here
For continuous variable Reverse and ascending order
Chart(car_data).mark_tick().encode(
X('Horsepower',sort='descending'),
)
# try this for ascending
# Chart(car_data).mark_tick().encode(
# X('Horsepower'),
# )
# lets try sorting a categorical variable
Chart(car_data).mark_tick().encode(
X('Origin',sort='descending'),
)
## You could see that it got sorted by the precedence of alphabet. Now lets try to sort with
## another field or by specifying an operation on same field
Chart(car_data).mark_tick().encode(
X('Origin',
sort=SortField(field='Origin', op='count'),),
)
## Sort based on the number of cars from each origin
## now lets put things together in the next problem
barley_Data = load_dataset('barley')
barley_Data.head()
site | variety | year | yield | |
---|---|---|---|---|
0 | University Farm | Manchuria | 1931 | 27.00000 |
1 | Waseca | Manchuria | 1931 | 48.86667 |
2 | Morris | Manchuria | 1931 | 27.43334 |
3 | Crookston | Manchuria | 1931 | 39.93333 |
4 | Grand Rapids | Manchuria | 1931 | 32.96667 |
# Let me show you how Row and Column classes in the Encode channels work
# This graph is to understand how response depends on a conditioned exploratory variable.
'''
Questions to answer:Which site has which variety of Barley produce for the corresponding year
'''
Chart(barley_Data).mark_point().encode(
color='year:N',
row='site:O',
x='mean(yield):Q',
y=Y('variety:O',
scale=Scale(
bandSize=12.0,
),
## variety is an ordinal field. We are using another field as sort definition object here
sort=SortField(
field='yield',
op='mean',
)
)
)
'''
Question to Answer: how much of yield have happened
in particular site for a particular variety each year?
'''
Chart(barley_Data).mark_bar().encode(
color='site:N',
column='year:O',
x='sum(yield):Q', ## According to vega documentation, We cannot change title of field names
## derived from transformation functions
y='variety:N',
)
This section introduces data transformation pipeline which embedded to the visualisation process. Binning is a process of grouping similar data. Aggregation is a process of applying a group function on the specific group. It follows the same concept as pandas. Coming back to Cars dataset.
'''
Question to answer: How many cars are there with similar (mileage,horsepower)
'''
from altair import Size
car_Chart = Chart(car_data).mark_point().encode(
X('Horsepower', bin=True),
Y('Miles_per_Gallon', bin=True),
Size('count(*):Q'),
)
car_Chart
'''
Question to Answer: What is the average acceleration for those vehicles which has similar (mileage,horsepower)?
Does acceleration affect the mileage? Do vehicles with high horsepower accelerates faster? Do they consume
more fuel per mile?
'''
car_Data = Chart(car_data).mark_circle().encode(
X('Horsepower', bin=True,title='HorsePower'),
Y('Miles_per_Gallon', bin=True,title='Miles per Gallon'),
size='count(*):Q',
color='average(Acceleration):Q'
)
car_Data
One of the coolest feature in Altair is the transformation feature. Let me brief you the problem it solves.
We will take a US population dataset. I want to display the demographic dividend of last previous census year which is 2000. In Altair viz I can visualize this specific data without pre-processing my dataframe and crunching it to represent 2000 data.
census_Data = load_dataset('population')
census_Data.head()
# sex 1 = Male and 2= Female
age | people | sex | year | |
---|---|---|---|---|
0 | 0 | 1483789 | 1 | 1850 |
1 | 0 | 1450376 | 2 | 1850 |
2 | 5 | 1411067 | 1 | 1850 |
3 | 5 | 1359668 | 2 | 1850 |
4 | 10 | 1260099 | 1 | 1850 |
from altair import Formula
color_range = Scale(range=['#B3E5FC','#0288D1'])
male_or_female = Formula(field='gender',
expr='datum.sex == 1 ? "Male" : "Female"')
Chart(census_Data).mark_bar().encode(
x='age:O',
y='mean(people):Q',
color=Color('gender:N', scale=color_range)
).transform_data(
# inside transform_data we can set column names as attributes of datum.
filter='datum.year == 2000',
calculate=[male_or_female],
)
## interpret the graph this way. of 18.5 M ppl between age 0 and 5 there are 10 M males
## and around 8.5 Million female
## Find a better representation below
male_or_female = Formula(field='gender',
expr='datum.sex == 1 ? "Male" : "Female"')
Chart(census_Data).mark_bar().encode(
color=Color('gender:N',
scale=Scale(
range=['#B3E5FC', '#0288D1'],
),
),
column=Column('age:O',
axis=Axis(
# orientation of Axis should be in the bottom
orient='bottom',
),
),
x=X('gender:N',
axis=False,
scale=Scale(
# if you remove the bandSize Chart will enlarge
bandSize=6.0,
),
),
y=Y('sum(people):Q',
axis=Axis(
title='population',
),
),
).transform_data(
calculate=[male_or_female],
filter='datum.year == 2000',
).configure_facet_cell(
strokeWidth=0.0,
)
Many visualizing libraries are really bad at representing timeseries data. Lets see what Altair has to offer
seattle_data = load_dataset('seattle-weather')
seattle_data.head()
date | precipitation | temp_max | temp_min | wind | weather | |
---|---|---|---|---|---|---|
0 | 2012/01/01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle |
1 | 2012/01/02 | 10.9 | 10.6 | 2.8 | 4.5 | rain |
2 | 2012/01/03 | 0.8 | 11.7 | 7.2 | 2.3 | rain |
3 | 2012/01/04 | 20.3 | 12.2 | 5.6 | 4.7 | rain |
4 | 2012/01/05 | 1.3 | 8.9 | 2.8 | 6.1 | rain |
'''
Question to answer: Which months in seattle are warmest? Which are the coldest? over the years
'''
temp_range = Formula(field='temp_range',
expr='datum.temp_max - datum.temp_min')
Chart(seattle_data).mark_line().encode(
X('date:T', timeUnit='month'),
y='mean(temp_range):Q'
).transform_data(
calculate=[temp_range],
)
So far, you have seen the most basic functionalities of Altair Viz. The developers are still working on the API and I am sure it is only going to get better.
Altair viz is definitely a constrained visualisation model. I personally like the way it seperates the statistical visualisation problem into understandable components.It will be most suitable to do exploratory data analysis over large datasets. It is also trying to converge usage of single data model which reduces the learning curve for developers. If you are really intereseted to learn more copy paste this code in a new cell and start playing around!
from altair import tutorial
tutorial()