SAS high-performance capabilities with Hadoop YARN

For Hadoop to be successful as part of the modern data architecture, it needs to integrate with existing tools. This integration allows you to reuse existing resources (licenses and personnel) and is typically 60% of the evaluation criteria for integration of Hadoop into the data center. One of the most important tools in many data architectures are the analytics tools, and SAS is a major leader in this space.

Guided by our joint customers deployment models for big data and by working with SAS engineering teams since 2013, we identified a few key primary integration patterns. These patterns are centered around exploiting the inherent scale-out compute and storage capabilities of Hadoop to enhance the richness of SAS analytics.

So far, we have provided connectivity options that address the movement of data and compute between discrete SAS and Hortonworks Data Platform clusters. These options such as SAS access capabilities facilitate widely used analytic workloads that can tolerate the inherent latency overhead of standalone clusters. Read More »

Post a Comment

Share your cluster – How Apache Hadoop YARN helps SAS

Even though it sounds like something you hear on a Montessori school playground, this theme “Share your cluster” echoes across many modern Apache Hadoop deployments.

Data architects are plotting to assemble all their big data in one system – something that is now achievable thanks to the economics of modern Apache Hadoop systems. Once assembled, this collection of data now has sufficient gravity to attract the application processing towards it – and people are increasingly becoming intolerant of the idea that we should make another copy (and have to reconcile, secure and govern that copy) to facilitate processing. Read More »

Post a Comment

Data science versus narrative psychology

My previous post explained how confirmation bias can prevent you from behaving like the natural data scientist you like to imagine you are by driving your decision making toward data that confirms your existing beliefs.

This post tells the story of another cognitive bias that works against data science. Consider the following scenario:

Company-wide sales are significantly down this quarter. Ricky, the organization’s top salesperson for the last six quarters in a row, hasn't closed a single sale.

Read More »

Post a Comment

Soliciting information about enterprise reference data

The first step is establishing governance for reference data is assessing the existing reference data landscape: understanding what reference data sets are used, who is using them, and how they are being employed to support business processes. That suggests a three-pronged approach to identifying organizational business process and application dependencies on reference data domains.

Two of these prongs are empirical, involving analyses to find reference data domains (embodied via code value sets or enumerated inline inside programs) and then figuring out what they represent. The third prong involves engaging the business users to solicit their input. Read More »

Post a Comment

Cracking the code for a successful conversion: establishing security

What kind of security do we need for this conversion?  In fact, where are the security people? 

Including security personnel upfront in any conversion project can sure save some time and heartache later.  It is important to include security for the following:

  1. Source system access – You must be able to profile the source the data, check for quality issues, and attach any ETL or conversion programs to the source system.
  2. New platform (target) security for the data – Databases need the right security groups to be set up.  Also, consider the directory security on the server itself.
  3. User interface security – Who are the people that will require access to this application? In the project plan there is probably a task that refers to end user setup and security.  Consider adding to that deliverable a list of business users who will use this application.  Revisit this list as implementation gets closer and closer. Read More »
Post a Comment

Achieving persistent data governance, pt. 1: link your teams

During client conversations I often hear stories about past efforts to launch data governance that never reached critical mass and ended up being resized and marginalized. I find such outcomes fascinating... in the same vein as a car crash that causes you to tap on the brakes as you drive by.

When I hear stories of these failed projects, I always ask myself similar questions: How did this happen? What mistakes were made? How can I avoid this in my program? In this blog series I want to share a few thoughts on achieving persistent data governance and recommendations for avoiding a roadside emergency while on your governance journey. Read More »

Post a Comment

Video tutorial: 5 ways to instantly improve your data profiling performance

Data profiling is essential. So why do so many data quality teams fail to get the most out of this crucial technique? In my short video, you’ll discover the answers to unlocking the full potential of your data profiling efforts.

By broadening and deepening your knowledge of data profiling with new approaches to methodology and deployment, you'll realize numerous benefits such as:

  • Greater business impact
  • Faster decisions
  • Simpler profiling workflow

The answer lies with five simple techniques. Read More »

Post a Comment

A foxier way to search

What are all of the companies in San Francisco trying to make the Internet of Things happen? Google it if you like, but you're only like to get a simple list of companies, no doubt in an SEO-friendly order.

What if you could see those companies in a more comprehensive way? Better yet, what if you could filter and sort by Alexa Rank, headcount, company status (re: public vs. private), total revenue and other forms of structured data? And what if you could see how close those companies' offerings compete with each other in a very visual and interactive way? Read More »

Post a Comment

Can data change an already made up mind?

Nowadays we hear a lot about how important it is that we are data-driven in our decision-making. We also hear a lot of criticism aimed at those that are driven more by intuition than data. Like most things in life, however, there’s a big difference between theory and practice.

It’s easy to say that we will go where data drives us, but what happens if data is driving us to a destination that we’re uncomfortable with? What happens when data calls into question some of our long-standing beliefs?

We like to think that we are all natural data scientists who are ready, willing and able to be swayed by evidence presented by new data. And in a big data world we certainly do not suffer from a dearth of new data.

However, whether or not we want to admit it (especially to others), our minds are often already made up before we look at data. And big data makes a very good yes-man, amplifying our natural tendency to only search out data that supports our viewpoints so that we find further evidence for what we already believe.

This is known as confirmation bias, which, as Chip and Dan Heath, co-authors of Decisive: How to Make Better Choices in Life and Work explained, “leads us to hunt for information that flatters our existing beliefs.” They cited a recent meta-analysis of more than 91 psychological studies involving over 8,000 participants that concluded we are twice as likely to favor confirming information than disconfirming information. Read More »

Post a Comment

The value of reference data governance

In my last post, I shared some thoughts about challenges associated with the lack of management for reference data, such as reinterpretation of semantics and the inconsistencies that crop up when multiple copies are used. All of the challenges I mentioned are indications of a need for improving the enterprisewide governance of reference data.

The first steps in establishing governance involve assessing the current state and putting a management program in place. That program should include a framework for documenting the values and meanings of reference data management in a way that can be aligned with development of policies for governing use and sharing of those reference domains.

In turn, one can envision the benefits that can be derived through policy-driven reference data management, such as: Read More »

Post a Comment