~Contributed by Becky Graebe, SAS Communications Manager~
If there was any doubt in the minds of SAS Global Forum attendees that the computing landscape has changed remarkably in recent years, Vice President of Platform R&D Paul Kent and Research Statistician Developer Oliver Schabenberger set that idea squarely off the grid Wednesday.
Kent and Schabenberger addressed a standing-room-only crowd eager to learn what high-performance computing is all about.
“Nowadays, everyone is building big data farms – racks and racks of very standardized bricks of computers. Racks and blades … just bring them in by the truckload, and we’ll figure out what to do with them later,” said Kent. “At the same time, the data landscape has changed hugely. For the first time, we’ve collected more data than we can keep.”
It is that intersection – where big data and big analytics come together – that creates a major area for tools and solution development. “You need both of those to make it happen. One without the other just isn’t that interesting,” Kent said. The hardware story
Kent explained the basics of the high-performance computing architecture: multisocket with multicore processing and commodity blades. Blade servers are useful because they are essentially stripped-down computers, and 48 or 96 GB of RAM per blade is not uncommon. The blades are arranged in enclosures called chassis. Chassis are arranged in racks.
“If you can buy three racks of these blades, you’re looking at a terabyte of memory. There are a lot of things that can fit in a terabyte,” Kent said.
The data story
Experts estimate that as much as 1,800 exabytes of data (each exabyte represents one quintillion bytes) will be captured this year. Organizations will have to tackle the fact that their data is going to have to be on more than one computer.
So what’s an organization to do with a data explosion of this magnitude? Kent believes there are three options:
- Continue to buy more storage.
- Store it in multiple places (distributed databases, parallel processing).
- Adopt a classic, shared-nothing Massively Parallel Processing (MPP) on Database Management System (DBMS) architecture (a set of CPU blades, a matching set of disks and a private network that connects it all together).
Most organizations are looking at the third option. In press releases issued earlier this week, SAS announced that it will offer SAS High-Performance Analytics in such fashion on EMC Greenplum and Teradata appliances.
Schabenberger explained what SAS has done to bring mathematics into and alongside databases, using three analytical tiers: hindsight, insight and foresight.
Hindsight is generally what organizations do when they have massive amounts of data and they look back at it and attempt to learn something from that data.
Insight involves descriptive modeling and considers relationships among certain variables. It may involve correlation or factor analysis and gets a little more interesting.
Foresight is “the upper echelon of analytics,” said Oliver. It includes predictive modeling, random effects, linear effects and optimization. “That’s the good stuff. “High-end analytics offers much more than slice-and-dice reporting.”
But custom reporting at this level comes with some requirements, namely co-location of data and analytics. “You have to get your data closer to the analytics,” he said. But co-location has to be done right and has to adjust to the complexity of the task. In addition, it is important to parallelize and avoid accessing disks; use memory instead.
Schabenberger also discussed three acceleration strategies to access data within the database management system:
- Structured Query Language (SQL) pass-through.
- Inside the database.
- Alongside the database.
SAS High-Performance Analytics utilizes the new strategy of running alongside the database. In this approach, math processes are running as peers of the database and use the same hardware as the database. The database remains, but the analytic processes turn on and off dynamically. As a co-location model data is passed, rather than moved, to the analytic process, math processes can communicate without having to send raw data.
“We don’t have to send data from node to node to node,” said Kent. “The opportunity to share information before you go to the next step of your mathematics is the breakthrough. In the classic approach, there was no way to communicate between units of work.”
The benefits and the PROCS
The database appliance provides replication and failover, so that if a node goes down, the database knows where to find it. SAS High-Performance Analytics offers high-end performance analytics with a user interface familiar to SAS users. All the new procedures will work on the desktop as well.
Ten new SAS procedures will be offered as part of the release that move existing SAS procedures from single-threaded activities to multithreaded:
- HPREG: linear regression and variable selection.
- HPLOGISTIC: logistic regression and variable selection.
- HPLMIXED: linear mixed models.
- HPNEURAL: neural nets.
- HPNLIN: nonlinear regression and maximum likelihood.
- HPREDUCE: covariance/correlation analysis, variable reduction.
- HPDMDB: summarization.
- HPSUMMARY: descriptive statistics.
- HPFOREST: predictive modeling based on decision trees.
- HPDS2: next-generation DATA step.
“The new PROCS will be aware of the distributed computer model … You won’t even notice that it happens,” said Oliver.
“Well, we hope you’ll notice,” Kent said. “It will just be a lot faster.”