The promise of high-performance analytics, as I understand it, is this: Regardless of how you store your data or how much of it there is, complex analytical procedures can still access that data, conduct a series of calculations on that data and provide answers quickly, accurately and using the full potential of the resources in your computing environment.
That promise was simple enough when most organizations were storing data on single machines or in data warehouses with common structures. Today, however, one of the challenges with big data is that it’s often coming at you too quickly to store it all, let alone finesse and structure it in the perfect format for analysis.
For instance, many organizations want to quickly save data that’s coming in from cell phones and inventory transactions, and come back to analyze it later. Others want to analyze data at deeper levels than before, modeling at the customer transaction level and not just the customer account level.
In order to bring high-end analytics to big data scenarios, SAS developers have been rewriting the company’s analytical procedures to operate in distributed computing environments. The new procedures have two components:
- The data access component scans the storage environment, recognizes how data are stored, and serves the data to the computational component.
- The computational component breaks complex algorithms and models down into a series of calculations that execute in parallel, on a single machine, or in a distributed environment.
With minimal input from a local programmer, the first layer acts as the explorer or the cartographer. It takes a peek at the data and the computing environment, understands the lay of the land and tells the second layer how to divide-and-conquer to analyze various segments of the data simultaneously and then bring those segmented calculations back together to run the final step of the procedure that provides the answer.
Oliver Schabenberger, lead architect of the SAS High-Performance Analytics product, began the high-performance project by asking himself, “How can we develop software that can execute in a distributed environment, a single platform or a single machine from the same code base?”
His answer introduced the new data access component. It is important because it allows SAS to execute in different high-performance modes with minimal input from the user. If your data are on a single machine, the first component recognizes that and tells the second component to run the calculations in concurrent threads on a single machine, or by dividing and conquering on a high-performance computing appliance. If the data are stored in distributed form, for example, in a massively parallel database, the procedure knows how to break it down into steps that take advantage of the data structure and the computing environment.
“What we are doing is taking the data source out of picture,” says Paul Kent, VP of SAS Platform R&D. “In the future, whether you run on an EMC Greenplum appliance, an Oracle database, or a Teradata database, it won’t matter.”
To describe the divide and conquer approach, Paul provides a simple metaphor: What if we wanted to calculate the average age of everyone in a large auditorium? You could send one person around the auditorium to ask everyone their age, add up the ages and then divide by the total number of everyone in the room. Or, you could start on the right side of the room, have each person report their age to the person on their left who adds it to their own and so on down the row. Then, one person standing at the left counts the total number of people, compiles age totals by row, and calculates the average.
With complex algorithms and large data sources, the calculations are obviously more rigorous, but you get the idea. Each computer node (or, auditorium row) is working on its own piece of the calculation simultaneously and reporting the answers back to the procedure for a final calculation.
“We’re not inviting data to come to us so we can munch on it anymore,” says Paul. “We’re finding clever ways to go where the data are, and move the work out to all the different slices of data as it exists.”
After all, moving data around to combine only the exact pieces required for specific business problems is not always practical. Plus, faster computers alone can no longer offer the increases in performance needed.
“Our ability to compute outpaces the industry’s ability to move data from disc to disc,” says Paul. With SAS High-Performance Analytics, organizations can take advantage of those fast computing capabilities and the smartest algorithms around – regardless of how the data is stored.