Is big data a big deal?

Maybe… but messy data is a bigger deal.

Big data hit mainstream over the past year or so. I know this because the BBC has produced several programmes covering it. What I’ve heard is that there is no clear definition of what big data is and why it is important. When I ask people if they have big data, they overwhelming say “yes,” whether they have a thousand or many millions of rows of data or observations. So who is right? It depends.

Nowadays, statistical software, including those designed to maximise the power of the desktop like JMP Pro, can easily handle data sets with millions of rows. What is a more important is the number of columns. Very tall and very wide data sets are truly big. These may require standard statistical methods such as sampling to build useful models, bringing model building within the power of a desktop computer.

So if big data is easily manageable, what are the real challenges faced by today's analysts, engineers and scientists? We surveyed delegates at the two model-building seminars held recently in Marlow and Edinburgh and uncovered an interesting finding: All of the delegates had messy data.

Make the most of messy data

You have messy data if you have missing data, empty cells, outliers or wrong entries. Traditional statistical methods, such as logistic and linear regression, throw out rows where cells are missing, resulting in a poorer model. Outliers also throw the model off, making it less useful.

John Sall discussed a new way of dealing with messy data called "Informative Missing" in his blog post. This takes the use of missing data beyond imputation to a new realm: Missing data might actually be informing you of something that is important and so must be included in your model. An example would be a loan applicant leaving part of their application blank in order to hide a poor credit history; this would be a critical finding for a credit analyst to model. If you are working in a manufacturing setting, data might be missing because the result was literally off-the-scale, which could be useful information to capture in the model. If you are modelling the activity of substances based on their chemical properties, you might have missing data for, say, decomposition temperature if the material was not seen to decompose over the measured temperature range; so if you include this information in the model, you would get better predictions of activity.

There is a new class of modelling techniques called shrinkage methods that are designed to provide you with a model that predicts well and has the smallest number of variables, even when you have strong correlations between input variables. The Generalised Regression personality allows you to use these methods from within the Fit Model platform. Used along with Informative Missing, it has the added benefit of using all rows of data when building the model -- even with messy data.

Decision tree-based methods are good for dealing with outliers, because the point at which the split occurs is not biased. JMP Pro users are telling us that they are building useful models without having to clean their data because of this and Informative Missing: With robust modelling techniques, you might be able to skip data cleansing and still produce a good model. Now that is truly revolutionary. Decision trees also have the added advantage of being visual, allowing you to explain your findings to execs.

What do I do if I have messy data?

JMP Pro is the software designed to deal with your messy data.

We will be running an exclusive, hands-on workshop in the UK for new users of this software on 12 June, so if you would would like join us in Marlow, let me know.

If you would like your managers to see how JMP Pro deals with these problems, you can ask them to join the webcast on 3 April when we will be showing two case studies.

tags: Big Data, Decision Trees, Generalised Regression, Informative Missing, JMP Pro, Messy Data

5 Comments

  1. Mike Clayton
    Posted March 28, 2014 at 11:34 am | Permalink

    Thanks very much.

    Students need to learn these more advanced methods, but schools are not supporting JMP Pro, so what is the chance that free webinars will demo newer concepts in a way that high school STEM classes can benefit?

  2. Volker Kraft Volker Kraft
    Posted March 28, 2014 at 1:27 pm | Permalink

    Hello Mike, assuming you are in the UK, I have some good news: Right now we are exploring opportunities to support STEM related classes at high schools with JMP packages for teaching. I would appreciate if you get in touch (volker.kraft@jmp.com), to keep you updated and learn more about your requirements in high school education.

  3. Phil Kay
    Posted March 31, 2014 at 10:13 am | Permalink

    Is there anybody out there who NEVER has messy data?

    From my own experience, small scale trials and design of experiments will usually produce complete data sets without too many problems. However, getting into the realm of data mining and predictive modelling ...there is ALWAYS some messy aspect to the data.

    • Bernard McKeown Bernard McKeown
      Posted March 31, 2014 at 11:20 am | Permalink

      You are right Phil. The survey of the delegates to the seminar confirms this as well. This is why these techniques are so helpful.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>