Data visualization: first, prepare your data

0

94802938We are in the age of big data. But just because modern software makes it easy to handle large data volumes, it's worth asking: do you always need all that data? In other words, if your analytics software can accommodate and even thrive in this big data environment, does that mean that any data source is a candidate for these new big data tools?  For me, the answer is a resounding no.

Since these big data tools can handle large amounts of data (some scale to billions of rows and thousands of columns), there is a natural temptation to just drop in all your data and let it fly.  But should you? My answer, again, is no.  To get the most out of these tools, you need to do more than just drop data in and churn out a report.

You should concern yourself first with data quality, design and reporting structures.

Unfortunately the power of these tools makes it easy to get away with poor data quality and design.  If you drop in a relatively small set of data (say 10 million rows dropped in a system capable of handling billions of rows), a poorly designed set of data will still perform in an acceptable manner, meaning your reports and dashboards will render in a couple seconds.  But as more and more users consume precious space in these environments, it may be only a matter of time before an IT administrator must start to make hard decisions on managing the environment.

Do you scale up the system by buying more hardware to handle the data volume?  Do you limit who can load data and what size data can be loaded?

I was recently involved in a big data reporting project where the data was reaching the upper limits of the environment.  Due to time and cost considerations, we had to work within the existing environment, with no option to scale out.  In order to bring the project in on time and under budget, we had to solve the problem by working with the data. At the successful conclusion of the project, I decided to document the steps we took to modify the data.

This list is not meant to be all inclusive, but should be considered a starting point to help you get the most out of your data and environment, while being good stewards of those precious IT resources:

Data preparation steps to consider

  1. Drop unused columns
    • Sometimes you bring in data just because it is part of the original data source. Data visualization is a great way to gain insights into your data and understand what may or may not be of value to your analysis.
      • Visualizations such as box plots and scatter plots, along with descriptive analytics such as mean, median and standard deviation can give you a better sense of your data elements.
    • If you are not going to use it in your visualizations or reports, eliminate it.
  2. Reduce field lengths
    • Sometimes a field length is based on the length of the longest data element. Analyze how you will use this data.  You might be able to reduce the length of the field.
      • In my project there was one field with a character length of 85. Less than 1 percent of the data in this field had text with that length.  By renaming or truncating those values, I was able to reduce the length to 40.  Given that my data set was over 300 million records, just this action alone reduced the file size considerably.
  3. Flatten the data set to combine rows, if possible
    • The sparsity of your data can negatively impact performance and increase the size of your data.
    • Sometimes the nature of your data is that it is extremely sparse. It may be possible to aggregate rows together which could reduce your data size.
  4. Transpose the data from rows to columns
    • Solutions like SAS® Visual Analytics not only help you visualize and report on values in your data, but have integrated sophisticated analytics such as forecasting, decision trees and correlations (among others). Transposing data from rows to columns enables this sophisticated analytics.
    • As a side benefit, this may also help reduce the sparsity of the data.
  5. Reduce cardinality, if possible
    • High cardinality may degrade performance, so it is a good idea to review your data and reduce it where possible.
      • Some data will naturally have high cardinality because of their unique nature, such as customer ID or employee number.
      • Some data will have high cardinality, but not require it. Focus on these situations.
        • In my project there was a measure that was the result of a calculation. The original data set had this extending out to 10 decimal places, but we did not require this precision in reporting.  This resulted in extremely high cardinality.  Rounding this field down to 2 decimal places eliminated over 90 percent of the high cardinality for this measure.

The power, speed and agility of SAS Visual Analytics gives users the ability to visualize, explore and report on large amounts of data.  While data preparation may not always be necessary, following some basic rules may help you to increase your data quality and reduce the size of your data.  Use this not as a complete checklist, but as a starting point to help you manage your data.  And you just may make some friends over in IT.  Happy visualizing!

Demo SAS Visual Analytics with data from multiple industries.

Share

About Author

Steve Mellgren

Sr Solutions Architect

Steve Mellgren is a Senior Solutions Architect with SAS’ Data Ethics Practice (DEP) and a champion of responsible innovation. As a member of the SAS Data for Good team, he collaborates with non-profits and humanitarian organizations to demonstrate how our technology can address their data, analytic and visualization challenges in a trustworthy and ethical manner. He holds a Masters in Business Administration.

Comments are closed.

Back to Top