Home
Blog home

Gower's distance for mixed data types

2024-08-29 — 5 min read


The idea of a "distance measure" between data points is very useful in different machine learning or data science tasks where we want some notion of how similar (or dissimilar) pairs of observations are. This is required for jobs such as clustering or classification, can be used to detect anomalies, or in information retrieval tasks through similarity searches.

Suppose we have two observations \(X_1 = (x_1, y_1)\) and \(X_2 = (x_2, y_2)\): if the data in these vectors is numeric then we can use familiar geometric measures such as Euclidean distance or Manhattan distance to quantify their similarity. However, what if the observations contained a mix of different data types - not just numeric, but categorical or binary features? How could we compare them then?

Gower's score

Gower's distance (Gower 1971) is a simple metric appropriate for such a situation. It combines a "score" \(s_{ijk}\) for each different feature \(k\) into a single measure of similarity between pairs of points \(i\) and \(j\).

Each feature is given a score according to its data type, as described by Gower:

Numeric variables (continuous or discrete):

$$s_{ijk} = 1 - \frac{|x_{ik} - x_{jk}|}{R_k}$$ where \(R_k\) is the range of the feature \(k\), either across the sample or the population (if known). This scoring can also be extended to ordinal features, e.g. ranks or scales (Podani 1999).

Categorical or binary variables:

$$s_{ijk} =\left\{\begin{matrix}1 \text{ if }x_{ik} = x_{jk}\\0 \text{ if }x_{ik} \neq x_{jk}\end{matrix}\right.$$

Dichotomous variables:

These are features that are either true or false, but used in situations where only the presence of a feature ("true") is informative, whereas its absence is uninformative.

$$s_{ijk} = \left\{\begin{matrix}1 \text{ if }x_{ik} = \text{ true AND }x_{jk} = \text{ true } \\0 \text{ otherwise.}\end{matrix}\right.$$

An example of a dichotomous variable might be "travels by car": if two people do not travel by car, that does not necessarily mean they are similar.

The total score is then just the average of all the scores for the available features: $$ S_{ij} = \frac{\sum_{k=1}^N{(s_{ijk}\cdot\delta_{ijk})}}{\sum_{k=1}^N{\delta_{ijk}}}. $$ \(\delta_{ijk}\) is equal to 1 for all features that can be compared and 0 when they cannot, for example if one or both observations are missing that feature. The total score will be 1 for pairs of identical observations, and 0 for those which are as different as it is possible to be given the data.

From score to distance

So far we have only referred to a total similarity score rather than a distance in the more familiar sense. If observations \(i\) and \(j\) have a total score \(S_{ij}\) then the "distance" between them is given by $$ D_{ij} = \sqrt{1-S_{ij}}. $$

For this to be a valid measure of distance it must satisfy the triangle equality - considering three data points, a, b, and c $$ \sqrt{1-S_{ab}} + \sqrt{1-S_{bc}} \geq \sqrt{1-S_{ac}}. $$ Gower (1971) shows that this holds true if there are no missing features (i.e. all the \(\delta_{ijk} = 1\)).

Example

Finally, let's consider an example of calculating Gower's distance for data containing mixed-type features. The table below shows information about a number of individuals (identified by "Subject ID") with four attributes:

Subject ID Age Handedness Eye Colour Knows Python
001 28 Right Blue Yes
002 34 Left Blue No
003 22 Right Green Yes
004 45 Right Hazel No
005 30 Left Brown Yes

Let's look at individuals 001 and 002, and calculate the score for each feature in turn.

Age:

$$s_{age} = 1 - \frac{|28 - 34|}{23} = 0.74$$ Here the range \(R_k\) = 23 is the range of ages in the sample, which has a minimum of 22 and a maximum of 45.

Handedness:

Since the individuals have different handedness, \(s_{handedness} = 0\)

Eye colour:

They have the same eyes, \(s_{eyes} = 1\)

Knows Python:

Individual 001 knows python whereas 002 does not, so \(s_{python} = 0\)

We have no missing data, so all the \(\delta_{ijk} = 1\). The overall similarity score between 001 and 002 is therefore \begin{align*} S &= \frac{0.74\times1 + 0\times1 + 1\times1 + 0\times1}{1+1+1+1}\\ &= \frac{1.74}{4}\\ &= 0.435, \end{align*} so the Gower distance between these two people is $$ D = \sqrt{1-0.435} = 0.75. $$

Repeating the calculation for all pairs of individuals, we obtain the distance table below, which could then be used for clustering them into groups. We can see that the individuals who are most similar are 001 and 003 \((D = 0.56)\) and the most different are 004 and 005 \((D=0.96)\).

Pairing 001 002 003 004 005
001 - 0.75 0.56 0.83 0.72
002 0.75 - 0.94 0.93 0.74
003 0.56 0.94 - 0.87 0.77
004 0.83 0.93 0.87 - 0.96
005 0.72 0.74 0.77 0.96 -

In summary, Gower's distance is a simple and versatile tool for analysing datasets that contain mixed data types. It is particularly effective at handling combinations of numerical, categorical, and binary variables within a single measure. This makes it very useful in fields where datasets with mixed variable types are common, such as healthcare, marketing, and social sciences.

Further reading

Back to top