What are the characteristics of “true clusters?”


In the digital world where billions of customers are making trillions of visits on a multi-channel marketing environment, big data has drawn researchers’ attention all over the world. Customers leave behind a huge trail of data volumes in digital channels. It is becoming an extremely difficult task finding the right data, given exploding data volumes, that can potentially help make the right decision.

This can be a big issue for brands. Traditional databases have not been efficient enough to capture the sheer amount of information or complexity of datasets we accumulate on the web, on social media and other places.

A leading consulting firm, for example, boasts of one of its client having 35 million customers and 9million unique visitors daily on their website, leaving a huge amount of shopper information data every second. Segmenting this large amount of data with the right tools to help target marketing activities is not readily available. To make matters more complicated, this data can be structured and unstructured, making traditional methods of analysing data not suitable.

Tackling market segmentation

Market segmentation is a process by which market researchers identify key attributes about customers and potential customers which can be used to create distinct target market groups. Without a market segmentation base, Advertising and Sales can lose large amount of money targeting the wrong set of customers.

Some known methods of segmenting consumers market include geographical segmentation, demographical segmentation, behavioural segmentation, multi-variable account segmentation and others. Common approaches using statistical methods to segment various markets include:

  • Clustering algorithms such as K-Means clustering
  • Statistical mixture models such as Latent Class Analysis
  • Ensemble approaches such as Random Forests

Most of these methods assume the number of clusters to be known, which in reality is never the case. There are several approaches to estimate the number of clusters. However, strong evidence about the quality of this clusters does not exist.

To add to the above issues, clusters could be domain specific, which means they are built to solve certain domain problems such as:

  • Delimitation of species of plants or animals in biology.
  • Medical classification of diseases.
  • Discovery and segmentation of settlements and periods in archaeology.
  • Image segmentation and object recognition.
  • Social stratification.
  • Market segmentation.
  • Efficient organization of data bases for search queries.

There are also quite general tasks for which clustering is applied in many subject areas:

  • Exploratory data analysis looking for “interesting patterns” without prescribing any specific interpretation, potentially creating new research questions and hypotheses.
  • Information reduction and structuring of sets of entities from any subject area for simplification, effective communication, or effective access/action such as complexity reduction for further data analysis, or classification systems.
  • Investigating the correspondence of a clustering in specific data with other groupings or characteristics, either hypothesized or derived from other data.

Depending on the application, it may differ a lot what is meant by a “cluster,” and cluster definition and methodology have to be adapted to the specific aim of clustering in the application of interest.

Van Mechelen et al. (1993) set out an objective characteristics of what a “true clusters” should possess which includes the following:

  • Within-cluster dissimilarities should be small.
  • Between-cluster dissimilarities should be large.
  • Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or by linear, time series or spatial process models.
  • Members of a cluster should be well represented by its centroid.
  • The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).
  • Clusters should be stable.
  • Clusters should correspond to connected areas in data space with high density.
  • The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).
  • It should be possible to characterize the clusters using a small number of variables.
  • Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.
  • Features should be approximately independent within clusters.
  • The number of clusters should be low.


  1. Van Mechelen, J. Hampton, R.S. Michalski, P. Theuns (1993). Categories and Concepts—Theoretical Views and Inductive Data Analysis, Academic Press, London

About Author

Larry Orimoloye

Sr Business Solutions Manager

Larry has experience in the design and delivery of analytics solutions in the telecom, technology, oil and gas, pharma, banking and logistics sectors, among others. He has helped clients establish centres of excellence with an analytics remit across the organization, and designed and implemented customer-centric, real-time decision platforms using a combination of statistics, big data and machine learning techniques. Larry holds a master’s degree in applied statistics and data mining from the University of St. Andrews, Scotland.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top