Five myths about unstructured data and five good reasons you should be analyzing it


“How can we begin to make sense of the unstructured data, when we still don’t make the most of our structured data?” said the exasperated senior manager from a large retail firm.

One of the great pleasures of my job is the relationship with students that continues after class has ended. This call was from one such former student who keeps in touch from time to time.

“Jake” (not his real name, but we’ll call him that) manages a predictive modeling team and has been leading a project to establish an analytical data mart, bringing data together from all aspects of the business: suppliers, customers, transactions, channels, marketing, post-merger business units, etc. Jake has made good progress and the modelers are already seeing the benefits of having a single version of the truth through high quality data integration. Jake was calling me because he had recently locked horns with a marketing manager who feels that the real future of analytics is in unstructured data (for example, social media, location data, web click-through data, social networks links, etc.). While both managers agree that there is value to be found in unstructured data, the marketing manager wants to abandon new work on the structured data predictive modeling database that the company had been investing in, and instead focus efforts on analyzing exclusively the unstructured data.  Jake wanted the marketing manager to understand the value of predictive analytics.

Most companies would do well to invest more resources in unstructured data. This is an area that is largely untapped, and there can be value. But unstructured data might not be the panacea that some are led to believe. Below are 5 myths about unstructured data:

  • Myth #1: Unstructured data are easier to manage. By definition, these data do not have a tidy structure and do not lend themselves to easy-to-use algorithms. To summarize what is there, you must create quantitative structure from it somehow.
  • Myth #2: Unstructured data can legally be used in any way we want to. It is unclear in what contexts data from social media sites like Twitter or Facebook can be legally used for business purposes. Just like structured data, different countries and different industries are subject to differing regulations in how these data can be used. And worse than structured data, unstructured data is more likely to conceal personally identifiable information, which can violate the laws of many countries.
  • Myth #3: The analyses are easier to understand: Most unstructured data are analyzed using truly massive matrices and complex decomposition techniques, sparse matrix algorithms, and other crunchy, delicious sounding words. If the analyst avoids logit functions because they’re too hard to understand, left- and right-singular values will be over the top. The two key things that make analyses of this type easier or harder to understand are 1) the ability of the analyst to communicate findings to others, and 2) the ease-of-use of the software that you use. These two things can make it look easy, even if it is actually pretty heavy computation happening behind the curtain.
  • Myth #4: There is more insight to be found in unstructured data than in structured data. While there can be marvelous and shiny hidden nuggets in unstructured data, most of the bang-for-your-buck in analytics still comes from good old predictive models. It might not be true in 10 years from now, but the current state of analytics is such that many companies are still reinventing the wheel on their structured data, trying to home-grow an analytics solution from scratch rather than making use of what is already out there and working for the competition.
  • Myth #5: Unstructured data cannot be quantitatively analyzed. Decades of work in natural language processing paired with Moore’s law make it possible to quantify patterns found in unstructured data and summarize what is found there to detect patterns. This opens up a new world of data for statistical modeling.

If you think that I am saying not to make use of your unstructured data, think again. Here are what I consider to be five good reasons to invest in unstructured data:

  • You already make good use of structured data. If you are already enjoying the benefits of statistical modeling, segmentation, experimentation, and other analytical techniques that turn your data into knowledge, then it is time to get more. Unstructured data can tap into new knowledge that you would otherwise miss.
  • You have a seasoned team of analytics professionals. These folks know how to frame a business problem as an analytical problem and answer questions with statistical analysis. They know the difference between an odds ratio and plain old odds and why both are useful. And they can communicate with the stake-holders in terms that everyone can understand. They might also be obsessed with good graphical presentation, and probably have an unnatural hatred of pie charts.
  • You have terrific hardware and software resources available. As I mentioned above, unstructured data analytical techniques usually involve massive computations. Every word of that 160-character tweet has to be converted into some kind of numerical representation. At its simplest, this is word counts. At its more complicated, this is an arrangement of every tweet (a “document”), word-by-word, into a table. How many tweets do you have? (That is a big table.) Add to this web click-throughs, email traffic, and other sources of unstructured data. It boggles the mind to think how big this stuff is. Of course, SAS can handle this, but you will want some good hardware to do it elegantly. After all, we’re supposed to make this look easy.
  • You already have a melting pot of data happiness. If your structured data are of high quality and live in a well-integrated database, then you might be able to identify connections among customers who interact with one another, suppliers that depend on one another, and so on. By combining what you know about each “node” in social networks, you can transform information about a specific customer’s social network into structured data. Consider a churn example. If I know that Customer C’s best friend, mother, and spouse all have a high predicted probably of cancelling their contracts, then it might be time to send a box of “Please Don’t Go” chocolates to Customer C. These social links are a particularly tricky type of unstructured data but can yield great value, as they already have in areas such as fraud and security.
  • You own the unstructured data. Any time you have in-house data, you have a competitive advantage. Anyone can buy data, but the data you collect yourself, that your company owns, is special. To bury those bits of gold on storage devices never to be used is to miss an opportunity. Just watch out for the rules; how are you allowed to use the data? For example, data owned by the marketing department might be illegal for the credit risk department to use in production scoring.

Back to Jake… of course, the real answer to Jake’s conundrum is that both projects have merit, and that the modeling team should continue to work on the structured data that has already yielded benefit. The marketing manager should designate a team to work on unstructured data. Eventually the unstructured data will dovetail with the predictive modeling data mart.

Jake’s company is still actively reaping the benefits of their analytical data mart. He convinced the executive leadership that there is still considerably more return to be made from that investment. However, he is now working with the marketing team to extend the analytical data mart to incorporate unstructured data from call centers. To be sure, this could be a slower project than the original data mart was. But if they make smart use of the unstructured data to augment the analytical data mart, then the resulting whole can be greater than the sum of its parts.

What are some good and bad reasons you have for investing in unstructured data?

If you would like to learn more about analytics in business from problem framing to model deployment and management, sign up for the class Advanced Analytics for the Modern Business Analyst. If you live in Australia, come on SAS Forum Sydney 2013. I will be teaching a special four-day offering of the course the week after the conference. If you’re at the conference or the training, find me and say hello! It is always a treat to meet SAS Training Post readers.


About Author

Catherine (Cat) Truxillo

Director of Analytical Education, SAS

Catherine Truxillo, Ph.D. has written or co-written SAS training courses for advanced statistical methods, including: multivariate statistics, linear and generalized linear mixed models, multilevel models, structural equation models, imputation methods for missing data, statistical process control, design and analysis of experiments, and cluster analysis. She also teaches courses on leadership and communication in data science.

Related Posts


  1. Cat Truxillo on

    Michelle- I look forward to coming to seeing you in Sydney, too!

    Jennifer -- I completely agree, and what customers are saying is definitely something hidden in structured data.

    Thanks for reading!

  2. I agree that most companies would do well to invest in unstructured data analysis. you can find out what customers are saying about you and find key insights you didn't know you were looking for.

Back to Top