In his book Here’s Looking at Euclid, Alex Bellos recounted an experiment he performed weighing the baguettes he purchased on a daily basis from his local baker.  The first baguette weighed 391 grams.  The second one weighed 398 grams, the third 399 grams, the fourth 403 grams, and the fifth 384 grams.  After 100 baguettes, he stopped his experiment.  By the end every number between 379 grams and 422 grams had been covered at least once with only four exceptions.

When Bellos reexamined the baguette-weighing experiment, he discovered a fundamental flaw in his methodology that provided a bias to the distribution of weights he measured.  “I had been storing the uneaten baguettes in my my kitchen, and I decided to weigh one that was a few days old.  To my surprise, it was only 321 grams—significantly lower than the lowest weight I had measured.  It dawned on me then that baguette weight was not fixed, because bread gets lighter as it dries out.  I bought another loaf and discovered that a baguette loses about 15 grams between 8 a.m. and noon.”

Bellos concluded measuring is intrinsically fuzzy since “many factors contributed to the daily variance in weight—the amount and consistency of the flour used, the length of time in the oven, the journey of the baguettes from the central bakery to my local store, the humidity of the air and so on.”

Measuring the quality of data, or the information created from data, often doesn’t take into account either random variations in the production process or what happens after production is completed.

Just as the weight of a baguette is not fixed and is subject to numerous variations despite a high-quality baking process, the quality of data is not fixed and is subject to numerous variations despite a high-quality information management process.

We often look for errors in the process in order to improve quality, but we rarely look for errors in the way we measure quality.  A fundamental flaw in our methodology would be to not acknowledge that measuring is intrinsically fuzzy.

Share

Blogger-in-Chief at Obsessive-Compulsive Data Quality (OCDQ)

Jim Harris is a recognized data quality thought leader with 25 years of enterprise data management industry experience. Jim is an independent consultant, speaker, and freelance writer. Jim is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality and its related disciplines, including data governance, master data management, and business intelligence.

1. Dave Chamberlain on

There are times when the measure is absolute on an individual datum by datum basis. An address is either right (for the right person, deliverable etc.) or wrong (wrong person, undeliverable etc.). I think one area fuzziness enters into the equation is when trying to determine duplicate records where the individual attribute values are not exactly, 100% the same. So the trick is to try and approximate the human notion of judging whether or not (let's say just 2) records, despite their differences, are in fact about the same entity.

• Thanks for your comment, Dave.

Instead of saying there are times when the measure is absolute, I would prefer to say there are times when the measure is accurate — at the time the measurement was taken.

Using your example, a postal address that is right (for the right person and deliverable) today, might not be right sometime later. In the United States, the USPS estimates that 17% of people will move on an annual basis, so although the postal address remains deliverable, it will not be the right address for the person after they have moved.

Best Regards,

Jim