Sigurd Hermansen, from Westat, got his idea for breaking big data into small pieces after watching Jim Goodnight dissect a blade server to show the power high-performance analytics has processing for big data. Goodnight also demonstrated how Visual Analytics Explorer transfers the data from the servers to a workstation and then uses the data in analysis – in a point and click interface.
“For those of us who don’t have a blade server at our disposal, what I’m trying to do is give you some ideas that you can apply that will reduce big data to manageable proportions,” said Hermansen during his presentation at the SouthEast SAS Users Group conference (SESUG).
According to Hermansen, computing platforms now efficiently handle very large data sets. He discussed four of the methods that he believes are the most accessible:
- Compression – Closer to the machine level and in many cases implemented at the machine level.
- Summarization - Reduces irrelevant detail. You can retain the information in a summary table.
- Normalization – Partitions large sets of data into smaller sets, thus minimizing data reduction.
- Sampling - Selects a smaller set of data that represents basically the same information as the whole.
Compression
Deflation is a straightforward compression method. “Let’s say you have a data set that has a lot of blanks and empty cells. You can reduce that pretty simply by introducing commas and semicolons to take the place of space,” he said.
He said Compress=yes will give you an impressive degree of data reduction - in some cases - and better processing time. “In today’s world, CPU time is not very expensive, maybe even free in some sense. This is a great tradeoff of CPU time for less disc space,” said Hermansen.
Summarization
Hermansen provided an example from real data analysis at Westat: In a very large multitudinal data infrastructure, the analysts were reading in the C32 standard for an electronic health record, an XML document. There is one XML document for each patient and it needed to be read in sequentially and then combined.
When the XML document was entered into a SQL server schema, the document ended up with a table containing a brand code system version associated with every code value. “So for a code value that might be 10 characters, you’d end up with 700 plus bytes of data that you were storing for each record,” said Hermansen. ”And we had many hundreds of thousands of these.”
The descriptions of the coding were spread throughout the data set. Westat needed to learn how many different coding sets there were so they summarized the data. The DISTINCT option in PROC SQL allowed them to limit the data set to the descriptions of the coding, reducing the complexity of the data without losing any information.
Normalization
According to Hermansen, the normalization method is at a slightly higher level than the preceding methods. “What we are doing there is implementing a relational database design with side effects that reduce big data,” he said. “So you get the benefits from having a better database design - even from the research perspective.”
Using that type of architecture allows you to create any other analytic data that you might need. It’s very flexible. The advantages include minimizing redundant data, the number of variables and “structural” missing values. For instance, what if you have a data set with lots of addresses that you want to put in a flat file – you’ll want to reduce the number of repeating variables. Structural missing values in this instance would include the people who have perhaps four addresses over a lifetime when others have five. The missing fifth address is a structural missing value.
“These structural missing values also get in your way when you are processing. You have to have special programming for handling them or not handling them: Are they supposed to be zeros, blanks or missing. It’s best not to have to deal with them,” said Hermansen. He also said that you also reduce the number of variables, thus reducing the time spent managing variables.
Sampling
“By using some of the sampling techniques, you can take a look at your data without having to handle enormous sets of data,” he said.
To effectively use sampling, you need to select data that closely resembles the characteristics of the population. In addition, the accuracy of the sample should increase with the size of the sample. Hermansen said this is traditionally done with random sampling, but he believes there are better ways to accomplish this.
This method judiciously ‘loses’ repetitious data. Be careful to make your sample size is large enough to retain the characterizations of the data.
His conclusion: The SAS system provides a handy reduction tool kit that can reduce big data to manageable proportions.
Read Hermansen’s entire paper – Reducing Big Data to Manageable Proportions - when it becomes available at SESUG.org.