Don’t be fooled: there’s really only two basic types of distributed processing

Every time I pick up a new article about analytics, I am always disappointed by the fact that I cannot find any specifics mentioned about back-end processing. It is no secret that every vendor wishes they had the latest and greatest parallel processing capabilities, but the truth is that many software vendors are still bound by single-threaded processing – as indicated by their obvious reticence about discussing details on the subject. As a result of using older approaches to data processing, most competitors will toss around terms like ‘in-memory’ and ‘distributed processing’ to sow confusion about how their stuff actually works. I will explain the difference in this post and tell you why you should care.

The truth is that there are really only two basic types of distributed processing, namely multi-processing (essentially grid–enabled networks) and pooled-memory massive parallel processing (MPP). Multi-processing essentially consists of duplicate sets of instructions being sent to an array of interconnected processing nodes. In the latter scenario, each node has its own allocation of CPU, RAM, and data, and generally does not have the ability to communicate or share information with other nodes in the same array. While a large multi-step job can be chopped up in pieces and each piece processed in parallel, the multi-processing configuration is still largely limited by duplicate, single-threaded sets of instructions that need to run.

Contrast multi-processing with a pooled-memory architecture that has inter-node communication and does not require duplicate sets of instructions. Each node in a pooled-resource configuration can work on a different part of a problem, large or small. If any node needs more resources, data, or information from any of the other nodes, it can get what it needs by issuing messages to any of the other nodes. This makes for a truly ‘shared resources environment,’ and as a consequence it runs about ten times faster than the fastest multi-processing array configuration.

Now much of the confusion about these two types of distributed processing exists because of misuse of the term ‘in-memory’. The fact is that ALL data processing occurs in-memory at some point in the execution of a set of code instructions. So ‘in-memory’ is really a misnomer for distributed processing. For example, traditional SAS processing has always occurred in-memory as blocks of data are read from disk into RAM. As RAM allocations have gotten larger, more data has been loaded into memory, yet the instructions were still processed using a single-threaded and sequential approach. What was needed was a rewrite of the software to enable multi-threading, namely routing separate tasks to different processors. Combining a multi-threaded program with all data pre-loaded into memory produces the phenomenally fast run-times as compared to what was able to be accomplished before.

Even though a program is multi-threaded, there is still no guarantee that things will run faster. An obvious example is Mahout, an Apache project that relied on MapReduce to facilitate inter-node communication in a pooled-resource environment. MapReduce is notoriously slow, as nodes take a long time to load data into memory and must write inter-node communication requests to disk before other nodes can access the request. As a consequence of its lethargic response time, Mahout has largely been abandoned by most large business customers in favor of faster architectures.

Message Passing Interface (MPI) is a much faster communication protocol, because it runs in-memory and it can accomplish multiple data iterations that are common to predictive analytics work. Currently there are only two MPI initiatives that offer true multi-threading, one based on Spark, an in-memory plug-in to Hadoop, and SAS’ High Performance Analytics. Spark development is still in its infancy, comparatively speaking, and it will likely be years before any push-button applications can make use of its capabilities. Alternatively, SAS has products that are production-ready today and can dramatically shorten your analytics lifecycle. So, do not be fooled by claims of in-memory or distributed processing, because MPI-enabled pooled-memory processing is here to stay and bodes well to become the de facto standard for all future predictive analytics processing.

For many standard analytics jobs, your standard architecture may be sufficient. But these phenomenally-fast run times matter when you are trying to process dozens, if not hundreds, of tournaments that consist of the most advanced machine learning techniques like random forests and deep-learning neural networks. Statistical professionals are finding that these new techniques are not only more accurate, but they also allow us to investigate much lower levels of granularity than ever before. As a result, models are getting more precise and profitability is increasing concomitantly. So if you want to solve more problems faster and with more accuracy (plus use the same headcount), be sure to investigate claims of “in-memory” and choose the right architecture for your job.