Gower's distance for mixed data types
2024-08-29 — 5 min read
The idea of a "distance measure" between data points is very useful in different machine learning or data science tasks where we want some notion of how similar (or dissimilar) pairs of observations are. This is required for jobs such as clustering or classification, can be used to detect anomalies, or in information retrieval tasks through similarity searches.
Suppose we have two observations
Gower's score
Gower's distance (Gower 1971) is a simple metric appropriate for such a situation. It combines a "score"
Each feature is given a score according to its data type, as described by Gower:
Numeric variables (continuous or discrete):
Categorical or binary variables:
Dichotomous variables:
These are features that are either true or false, but used in situations where only the presence of a feature ("true") is informative, whereas its absence is uninformative.
An example of a dichotomous variable might be "travels by car": if two people do not travel by car, that does not necessarily mean they are similar.
The total score is then just the average of all the scores for the available features:
From score to distance
So far we have only referred to a total similarity score rather than a distance in the more familiar sense. If observations
For this to be a valid measure of distance it must satisfy the triangle equality - considering three data points, a, b, and c
Example
Finally, let's consider an example of calculating Gower's distance for data containing mixed-type features. The table below shows information about a number of individuals (identified by "Subject ID") with four attributes:
- Age - a continuous numeric variable
- Handedness - a binary variable, whether the individual is left- or right-handed.
- Eye colour - a categorical variable
- Knows Python - a dichotomous variable (while two people who know python may have a similar education, both of them not knowing Python does not imply they have similar backgrounds).
Subject ID | Age | Handedness | Eye Colour | Knows Python |
---|---|---|---|---|
001 | 28 | Right | Blue | Yes |
002 | 34 | Left | Blue | No |
003 | 22 | Right | Green | Yes |
004 | 45 | Right | Hazel | No |
005 | 30 | Left | Brown | Yes |
Let's look at individuals 001 and 002, and calculate the score for each feature in turn.
Age:
Handedness:
Since the individuals have different handedness,
Eye colour:
They have the same eyes,
Knows Python:
Individual 001 knows python whereas 002 does not, so
We have no missing data, so all the
Repeating the calculation for all pairs of individuals, we obtain the distance table below, which could then be used for clustering them into groups. We can see that the individuals who are most similar are 001 and 003
Pairing | 001 | 002 | 003 | 004 | 005 |
---|---|---|---|---|---|
001 | - | 0.75 | 0.56 | 0.83 | 0.72 |
002 | 0.75 | - | 0.94 | 0.93 | 0.74 |
003 | 0.56 | 0.94 | - | 0.87 | 0.77 |
004 | 0.83 | 0.93 | 0.87 | - | 0.96 |
005 | 0.72 | 0.74 | 0.77 | 0.96 | - |
In summary, Gower's distance is a simple and versatile tool for analysing datasets that contain mixed data types. It is particularly effective at handling combinations of numerical, categorical, and binary variables within a single measure. This makes it very useful in fields where datasets with mixed variable types are common, such as healthcare, marketing, and social sciences.
Further reading
- A General Coefficient of Similarity and Some of Its Properties (Gower 1971)
- Extending Gower's General Coefficient of Similarity to Ordinal Characters (Podani 1999)
- What is Gower's distance? (Statistical Odds & Ends)
- Gower's distance? (Wikipedia)