In order to get some EDA experience working with non-HEP data, I’m starting this series of exercises which are meant to take a lot of the tools I’ve been learning and apply them to some popular datasets. I’m not expecting to do anything with this data that hasn’t been done before but the practice should prep me to work with more novel data sets.

I’m recording everything here as a way of ensuring that I spend time reflecting on the data and what I’ve learned by playing with it.

Dataset

The Auto MPG dataset records basic attributes of about 400 cars produced between 1970 and 1982. The nine features can be split into three categories:

Physical features: number of cylinders, displacement, horsepower, weight, acceleration
What I’ll call “meta” features: car name, model year, “origin” (From what I’ve gathered…1 = America, 2 = Europe, 3 = Asia)
Target feature: miles per gallon (MPG)

I use the phrase “target feature” simply to mean that this is the feature dependent on the rest. The purpose of the dataset is typically to try and predict this but we’ll see if that’s what we do; it doesn’t sound terribly interesting compared to finding relationships between the other features.

Initial Expectations

I’m not a huge car buff but a physics background and the two “Small Engines” courses I took in high school make it easy to reason that many of the physical characteristics of the cars are correlated.

Horsepower is a unit of “power” which is energy per unit of time. In terms of acceleration, power needed to move the car is given by \(P = \frac{mass \cdot accel. \cdot distance}{time}\). There’s a trick here though I think… The acceleration for cars is typically given as the time it takes to get from 0 to 60 mph. Using \(mass = weight / g\), \(a = 60/t_a\), and \(distance = \frac{1}{2}a{t_a^2}\) (where \(t_a\) is the value for “acceleration” in the dataset), we can work out
\[P = \frac{weight}{t_a} \cdot c_1(\vec{\alpha})\]
- where \(c_1(\vec{\alpha})\) is some factor (possibly a function of other parameters \(\vec{\alpha}\)) that takes care of unit conversions and other effects not modeled by the variables we have access to (weight and \(t_a\)).
The horsepower produced by the engine (conceptual different from the previous “power needed to move the car” but eventually equivalent when the car has to drive) is a function of the engine displacement. Energy is derived from compressed aerosol gas being ignited, driving the piston down. The more gas you can intake before compression (the displacement), the more energy you extract to the drive shaft when you ignite the gas. Displacement is the volume swept by all of the pistons - so more cylinders leads to more displacement; \(P_\text{engine} \propto displacement \propto N_\text{cylinder}\). We can be a bit more specific by saying
\[P_{\text{per rev.}} = V_{g} \cdot c_2(\vec{\alpha})\]
- where \(V_{g}\) is the volume of gas ignited per revolution (the displacement) and \(c_2(\vec{\alpha})\) is another factor that deals with unit conversions, the energy density of the gas, efficiency to extract energy from the gas, etc.

Finally, it’s no secret that different regions of the world have different “philosophies” when it comes to producing and owning cars that don’t differ so much today from the 1970s. Americans notoriously love their big gas guzzling trucks and muscle cars (ex. Ford Expeditions and Mustangs). Europeans have an affinity for slick but powerful sports cars and saloons (ex. Audi, BMW, Ferrari). Asia has dominated consumer market cars that are small, efficient, and long-lasting (ex. Honda Accord, Toyota Prius).

Cars in the 1970s may have been made by a different set of manufacturers and in different styles than today but these cultural affinities were not so different so there should be some expectation that the “origin” attribute will be an important categorical feature when examining the physical features of the cars.

Exploration

Based on the reasoning above, we already have some ideas about how to build more useful features from this base set but let’s start by simply plotting all of the features as histograms and looking at the correlation matrix.

Immediately, there are few questions to ask:

Are multiple instances of a car name true duplicates (every feature is identical)? If so, duplicates should be dropped.
What should be done with the cars having horsepower of 0? These were ‘?’ in the original dataset but I converted ‘?’ to zero for the sake of plotting. What about the 3 and 5 cylinder cars for which there is little data?
Given the discussion above and the fact that Origin and Cylinders have only a few possible values each, they seem like good options for grouping the dataset. Does one provide better discrimination than the other?

Let’s handle the former two bullets. For the first, a simple groupby shows that repeated car names correspond to the same make and model in a different year. So these should be kept. For the second, I’ve opted to keep unlisted horsepower and the 3 and 5 cylinder cars for now but we’ll see how they muddle the EDA.

What then of the categories? Upon a bit more staring, I noticed that the MPG and weight distributions have shapes relatively similar to HEP distributions - steep rise (“turn on” in HEP lingo) and a smooth “fall”. However, I know from my HEP experience this can be misleading, especially when distributions are a bit “bumpy” which could be a result of statistical fluctuations but maybe not. Let’s break them each down into our two different categorical features (four plots total).

Aha! “MPG by Cylinders” especially shows that the MPG distributions is a sum of normal distributions. It’s clear the cylinders are important and that’s not a huge surprise based on the reasoning above that showed most of the physical features are dependent on the number of cylinders.

The country the car was developed in also separates the data nicely but only so far as “American” and “not American”.

Is there any correlation to worry about between these categories that could make them redundant? Well, the average American car has 6 cylinders while the average Asian or European car has just 4. In fact, looking at the breakdown of number of cylinders by country below, one can see that Asian and European manufacturers are dominated by the 4 cylinder car while American manufacturers distributed production across 4, 6, and 8 cylinders more evenly - with 8 cylinders being the most predominant!

Given the nice separation presented by the cylinders feature, I’m inclined to take that as the categorical feature over the region of origin (ignoring categorization of the product of these two features for now).

Before getting any more into the weeds, it’s clear that many of the features are correlated so let’s use some of what’s been reasoned to develop intuition about what principle component analysis might reveal.

Selecting PCA features

Immediately, let’s take advantage of the basic mathematical relationships from two sections ago. For both, let’s do a slight rework and assume \(c_i\) is a constant and there is an additional bias term and then fit.

\[P = \frac{weight}{t_a} \cdot c_1 + b_1\] \[P_{\text{per rev.}} = V_{g} \cdot c_2 + b_2\]

In other words, fit a line to “horsepower” vs “weight over acceleration” and do the same for “horsepower” vs “displacement”, dropping the empty horsepower values to avoid a bias.

The first of these looks quite good. It’s only the case of the 8 cylinder cars where the spread along the line gets bad (hover over the points to see which model cars are the outliers!)

The second shows a less linear relationship when considering all cylinder counts but one can fit the five different cylinder categories separately as well to see their trends.

It’s a bit hard to say what is going on exactly but consider the number of mechanical car pieces between the cylinders and the wheels - there are a lot of variables that can break the assumption in the second equation above. In other words, there are parameters (\(\vec{\alpha}\)) that we don’t have but would need to specify \(c_2(\vec{\alpha})\) so we could work the data from this direction.

Here’s what we’ve learned so far though:

The number of cylinders and the origin of the car are correlated, mainly because American manufacturers built more cars with > 4 cylinders than the other manufacturing regions combined.
A model of the MPG would benefit from segmenting on number of cylinders first and treating the resulting categories separately.
Horsepower is highly correlated with the pounds per “acceleration” (time from 0 to 60 mph) and the correlation is better than with just the acceleration or weight (as can been in the scatter matrix below).

A quick look above at the set of all scatter plots for continuous features shows that each feature has a dependence on number of cylinders, with visible cylinder clusters in all scatter plots.

The equivalent plot of the principle components above (number of components chosen based on an MLE method) shows that the first component is mainly identifying the different cylinders. Since we already have a feature that does this exactly, let’s perform the PCA per group of cylinders (ignoring the 3 and 5 cylinder cases due to the small sample size in these groups).

Instead of looking at the scatter matrices for these, let’s look at the breakdown of how each feature contributes to each component.

This is good but since we showed that horsepower is correlated and anti-correlated with weight and acceleration, let’s remove it and see how much the explained variance changes.

In fact, the explained variance doesn’t change or improves with fewer features and, as a result, fewer components (that are easier to read too… at least to my eyes)!

Summary

This first exercise has grown a bit larger than I anticipated so let’s summarize.

It’s very easy to show up to a dataset like this and throw any number of ML algorithms at it. My initial inclination was PCA + linear regression (if I wanted to predict MPG) because the features seemed obviously correlated and linear relationships between MPG and the principle components didn’t seem like a terrible assumption.

However, the later half of the last section showed that knowledge of the data can help you more efficiently perform PCA. The two key modifications were:

Using a powerful categorical feature to separate the dataset into groups, solving for principle components in each group.
Removing a redundant feature (horsepower) that we already knew was possible to describe using two other features. By removing this feature, the PCA was able to describe the variance in the data more effectively with fewer features and components.

For an example like this dataset, the differences are probably marginal (depending what you’re trying to do). But on datasets with many features, the process of evaluating inter-feature dependencies before performing PCA would be valuable to more effectively describe the data with fewer features and components.