What skills do you need to become a data miner?


This is a question I am often asked: What skills do I need to become a good analyst or data miner?

In order to become good data mining practitioner one needs to understand statistical concepts and basic principles of knowledge induction. Knowing inferential stats, t-tests, analysis of variance, regressions, and so on is important, and this is your "bread and butter" knowledge. Then, you need to know some academic or commercial software, since you won't be doing logistic regression on a piece of paper.  At this point, as a student, you are at the mercy of the educational institution. If they have access to software that large organizations in your local markets are using - well, then your employability will be far higher than if they use open source academic software.

So, you know the stats and you know some statistical software. Is that it? Sure not. A few times I had statisticians with postgraduate degrees phoning me to ask, "Where do I start with solving my business problem?" No predictive modeling type of project will ever start with a predictive model. In fact, no data mining project will ever start with data or with statistical/data mining software. Methodology is not what you learn in most stat classes! Right? So, knowing methodologies and processes is another vitally important ingredient.

Applying your knowledge to the business problem

Now, is knowing statistical methods, some statistical software and methodologies all you need to know to become good analyst/data miner? Not quite! You also need knowledge of the applications for which you would use your statistical skills. If you are building a credit-card fraud detection model, as opposed to policy lapse prediction model, the basic principles of model building are the same but the modeling nuances are not.

I often say, a good data miner understands that everything that happens in the data preparation phase, as well as in the modeling phase is a function of the business problem and the business objectives. So, one can imagine that there are big differences between fraud in the credit card space as opposed to insurance policies lapses. And those differences will be translated into how you are going to prepare your data and how you will build your model.

Therefore, knowing the business application of the model is very important. And, regardless of in which industry you are, some applications you will encounter more often than the others. Segmentations, cross-selling, retention/churn, customer value and profitability are particularly common. Every practitioner should be familiar with at least those.

So, is that all? Still no.

Knowing what works

How about knowledge of best practices? Instead of relying on your intuition and gut-feeling when confronted with a specific modeling puzzle, wouldn't it be helpful to know what others have done when confronted with the same problem, and how successful they were? That is where knowing best practices comes in handy.  And if the solution to your modeling problem is not in any best practices manual - then relying on your intuition and creativity becomes essential.

So, let's imagine you have all these skills, and you are required to present your results to the board of your multinational client, but you go in there talking about regressions and variance and induction. If you only talk about statistics at this point, you will not be able to convince the board that you will solve their business problem. And that means that even with all your skills,  your project has FAILED!

What am I trying to say here? You have to have the ability to communicate to your business sponsors in a way that they can understand you, so that you can convince them that what you have done will have positive business effects. So you have to be able to switch between two different languages: the technical language and the language of business folk. And this is not an easy task!  I have seen good modelers blowing it all up by putting their business audiences to sleep, and I have seen modelers having huge success with mediocre modeling results, just by virtue of by being able to talk the language of their business audiences.

So, now you can see why data miner is rare creature. And hopefully, this article can help you in identifying your own areas of improvement.


About Author

Goran Dragosavac

Goran Dragosavac joined SAS Institute in 2000 and has been with SAS ever since. He has developed a successful track record in deploying Analytical Intelligence across different industries and across a variety of business applications, having accomplished over hundred successful projects and assignments for most of the leading companies in South Africa, including the work for public sector and government. Goran is often invited to speak at seminars and conferences in front of business and academic audiences locally and internationally, and some of his work is published in the academic literature.


  1. Pingback: This week in blogs: SAS ODS, Hadoop and hangovers - SAS Voices

  2. Elisabete Miranda on

    It’s a very interesting and useful article. I consider it relevant and important to all organizations that think to start data mining projects. Many thanks. EM

  3. Pingback: Friday’s Innovation Inspiration – Data miners slow churn - SAS Users Groups

  4. This is a very informative article which helps me a lot. Thanks very much. One quick question. I've heard a lot of algorithms in data mining area. such as decision tree, KNN so on and so forth. Why is regression so important as you said in the first part of the article? Thank you.

Back to Top