IT folks love SQL (Standard Query Language). Once you know how to program in SQL, you can work with almost any database because it is a standard. However, SQL is NOT a standard for doing analytics. The SAS programming language pre-dates SQL and even though SAS does SQL, SQL does not do SAS.
SQL is a programming language designed for managing data held in a relational database management system. SAS is a programming language designed to access data stored in any format, prepare it for analysis, perform analytics, and deliver insights and results based on this analysis.
Why is this important? Because when you move into the analytics world, SQL alone will just not cut it.
The decision to use "standard" ETL tools (which are SQL based) to support analytic environments and analytically derived data sources causes many headaches for IT, for business and ultimately for the organization as a whole. It can also limit what type of analytics can be used effectively and raises the overall risk of failure for analytic based initiatives or projects. In fact, one of the reasons "shadow IT" programs crop up in business units is because the data scientist types understand analytics preparation and exploration means more than what SQL has to offer.
Why is the typical ETL tool not the best choice for preparing data for analytics exploration and analysis? Let's look at a typical scenario.
IT uses their standard ETL tool to pull data from various sources to surface the data to their analytics team. This provides an initial "development" analytics data warehouse (ADW) for the analytics team. Then the analytics team uses analytical transformations and analytical data prep to stage a sandbox and what I will call analytical base tables (ABTs) that are used as the input for developing new insights, testing, and creating predictive models, forecasting models and optimization models.
Once a new insight has been discovered or a model has been developed, it must then be deployed into production systems for it to provide any real business value. For these analytical assets to work, they need the proper input values that reside in the analytical base tables. In other words, the necessary inputs reside in the ABTs which are located in the analytics environment, NOT in the initial ADW that IT provided.
This is why IT often struggles with getting analytics deployed because they are trying to use analytics with missing pieces.
Then, IT spends time and resources trying to understand what the analytics team did with the ADW data so that IT can duplicate it in, you guessed it, SQL. If IT used an analytics ETL tool instead, the organization would improve its overall data governance, auditability and provide a more streamlined process for deploying analytics into production. Using an analytics ETL to move the data from the analytics environment or the analytical base tables to a test and production "new" analytical datawarehouse, which is now the most efficient way to streamline deploying or operationalizing analytics.
To recap, recall my post describing the difference between business intelligence (BI) and business analytics (BA), and then consider these three data flow diagrams.
- This is a data flow that supports just business intelligence:
ETL from data sources --> dev(EDW) --> typical BI reporting --> test(EDW) --> prod(EDW) --> reports/end user consumption - This is the data flow that IT believes will support business analytics:
ETL from data sources --> dev(EDW) --> ETL --> dev(ADW) --> analytic exploration/discovery --> ETL --> test(ADW) --> prod(ADW) --> reports/end user consumption - This is an ideal data flow that should be used to support business analytics:
ETL from data sources --> dev(EDW) --> ETL to create dev (IT's ADW) --> analytic ETL --> dev(ABTs) analytic exploration/discovery --> analytical ETL --> test (ADW sourced from ABTs) --> prod (ADW)
It is human nature to fall back on what you are comfortable with or fully understand. Standard ETL tools based on SQL are good for moving around data (that already exists) and transforming it into different forms like a star schema (designed to optimize certain BI reporting queries). However, SQL alone does not render new insights that help "predict" what will happen or help drive more profit for an organization.
Metaphorically speaking, you can build a deck with nails (BI) and a hammer (standard ETL) is the right tool to use. If you want a better, stronger, longer-lasting deck then you should use screws (BA). However, if IT still wants to use a hammer (std ETL), it might be a problem. You actually can choose to use a hammer with screws, but why would you want too?
The concept described above is just one reason why SAS placed so high in the most recent report, The Forrester Wave™: Agile Business Intelligence Platforms, Q3 2014. Download it now to learn more.
2 Comments
Great post David, In the concepts we are fully in agreement. I am trying to explain this difference classic ETL and the analytics way also for some time now. Not many are accepting those differences even at ....
Well my comment is on some details using the dev/tst/acpt/prod indications.
This is important at the moment asking for (virtual) machines there are a lot of assumptions done. As the production environment is holy environment many are bothering bout that. The focus in that is on the production data sometimes privacy or for other reasons needed to be well protected. Just a statement: "developers are not allowed in production"
At network and machine level often a well segregated setup is done including firewalls to implement that. Having code objects, data objects, middleware and OS, there you have a N-P-K dimension problem of D T A P, nobody really want to understand/solve in practice.
As the analytics environment only is usefull with using "real life production data" all environment of the analytics process should be part of "production" segments even your analytics-dev.
I avoided the dev/prod for analytics by indicating it as background(dev) opposed to the approved analytics version foreground(prod). Data-reports could be promoted/copied within that aside some code. Feel the aversion in discussions coming in with version management tools release management tools. Any better suggestions for namings?
Jaap,
Thank you for reading, agreeing, and commenting on this topic. BTW: I also agree with your statement:
'As the analytics environment only is usefull with using "real life production data" all environment of the analytics process should be part of "production" segments even your analytics-dev.' so much so I have a previous blog on this very topic as well. See Treating analytic discover and model development as a production environment.
David