Gower's distance for mixed data types

2024-08-29 — 5 min read

The idea of a "distance measure" between data points is very useful in different machine learning or data science tasks where we want some notion of how similar (or dissimilar) pairs of observations are. This is required for jobs such as clustering or classification, can be used to detect anomalies, or in information retrieval tasks through similarity searches.

Suppose we have two observations $X_{1} = (x_{1}, y_{1})$ and $X_{2} = (x_{2}, y_{2})$ : if the data in these vectors is numeric then we can use familiar geometric measures such as Euclidean distance or Manhattan distance to quantify their similarity. However, what if the observations contained a mix of different data types - not just numeric, but categorical or binary features? How could we compare them then?

Gower's score

Gower's distance (Gower 1971) is a simple metric appropriate for such a situation. It combines a "score" $s_{i j k}$ for each different feature $k$ into a single measure of similarity between pairs of points $i$ and $j$ .

Each feature is given a score according to its data type, as described by Gower:

Numeric variables (continuous or discrete):

$s_{i j k} = 1 - \frac{| x_{i k} - x_{j k} |}{R_{k}}$ where $R_{k}$ is the range of the feature $k$ , either across the sample or the population (if known). This scoring can also be extended to ordinal features, e.g. ranks or scales (Podani 1999).

Categorical or binary variables:

$s_{i j k} = {\begin{matrix} 1 if x_{i k} = x_{j k} \\ 0 if x_{i k} \neq x_{j k} \end{matrix}$

Dichotomous variables:

These are features that are either true or false, but used in situations where only the presence of a feature ("true") is informative, whereas its absence is uninformative.

$s_{i j k} = {\begin{matrix} 1 if x_{i k} = true AND x_{j k} = true \\ 0 otherwise. \end{matrix}$

An example of a dichotomous variable might be "travels by car": if two people do not travel by car, that does not necessarily mean they are similar.

The total score is then just the average of all the scores for the available features: $S_{i j} = \frac{\sum_{k = 1}^{N} (s_{i j k} \cdot δ_{i j k})}{\sum_{k = 1}^{N} δ_{i j k}} .$ $δ_{i j k}$ is equal to 1 for all features that can be compared and 0 when they cannot, for example if one or both observations are missing that feature. The total score will be 1 for pairs of identical observations, and 0 for those which are as different as it is possible to be given the data.

From score to distance

So far we have only referred to a total similarity score rather than a distance in the more familiar sense. If observations $i$ and $j$ have a total score $S_{i j}$ then the "distance" between them is given by $D_{i j} = \sqrt{1 - S_{i j}} .$

For this to be a valid measure of distance it must satisfy the triangle equality - considering three data points, a, b, and c $\sqrt{1 - S_{a b}} + \sqrt{1 - S_{b c}} \geq \sqrt{1 - S_{a c}} .$ Gower (1971) shows that this holds true if there are no missing features (i.e. all the $δ_{i j k} = 1$ ).

Example

Finally, let's consider an example of calculating Gower's distance for data containing mixed-type features. The table below shows information about a number of individuals (identified by "Subject ID") with four attributes:

Age - a continuous numeric variable
Handedness - a binary variable, whether the individual is left- or right-handed.
Eye colour - a categorical variable
Knows Python - a dichotomous variable (while two people who know python may have a similar education, both of them not knowing Python does not imply they have similar backgrounds).

Subject ID	Age	Handedness	Eye Colour	Knows Python
001	28	Right	Blue	Yes
002	34	Left	Blue	No
003	22	Right	Green	Yes
004	45	Right	Hazel	No
005	30	Left	Brown	Yes

Let's look at individuals 001 and 002, and calculate the score for each feature in turn.

Age:

$s_{a g e} = 1 - \frac{| 28 - 34 |}{23} = 0.74$ Here the range $R_{k}$ = 23 is the range of ages in the sample, which has a minimum of 22 and a maximum of 45.

Handedness:

Since the individuals have different handedness, $s_{h a n d e d n e s s} = 0$

Eye colour:

They have the same eyes, $s_{e y e s} = 1$

Knows Python:

Individual 001 knows python whereas 002 does not, so $s_{p y t h o n} = 0$

We have no missing data, so all the $δ_{i j k} = 1$ . The overall similarity score between 001 and 002 is therefore $\begin{aligned} S & = \frac{0.74 \times 1 + 0 \times 1 + 1 \times 1 + 0 \times 1}{1 + 1 + 1 + 1} \\ = \frac{1.74}{4} \\ = 0.435, \end{aligned}$ so the Gower distance between these two people is $D = \sqrt{1 - 0.435} = 0.75 .$

Repeating the calculation for all pairs of individuals, we obtain the distance table below, which could then be used for clustering them into groups. We can see that the individuals who are most similar are 001 and 003 $(D = 0.56)$ and the most different are 004 and 005 $(D = 0.96)$ .

Pairing	001	002	003	004	005
001	-	0.75	0.56	0.83	0.72
002	0.75	-	0.94	0.93	0.74
003	0.56	0.94	-	0.87	0.77
004	0.83	0.93	0.87	-	0.96
005	0.72	0.74	0.77	0.96	-

In summary, Gower's distance is a simple and versatile tool for analysing datasets that contain mixed data types. It is particularly effective at handling combinations of numerical, categorical, and binary variables within a single measure. This makes it very useful in fields where datasets with mixed variable types are common, such as healthcare, marketing, and social sciences.