Imagine that you’ve been asked to provide analytics for an important business initiative. Your data is well-sourced and readily available. You’re ready to start digging in – only to discover that the data you were given doesn’t match up well. It’s like getting driving directions based on kilometers when your car measures in miles. Before beginning your trip, you’ll need to make modifications to the directions so you can use them to get to your destination.
Based on more than 30 years’ experience in data analytics across multiple industries, I know that one of the most consuming tasks required for analysis is data preparation. It’s the old 80/20 rule – yet the topic gets very little attention, despite the huge amount of time it consumes. In fact, many people overlook data preparation altogether at the planning stage.
If you’re like me, you don’t want to dive right in and use the data only to learn it’s not in an advantageous form. I’ve found that the proper amount of preparation and a clear concept of the requirements of your analytic task are tremendously helpful. Establishing a clear approach up front also reduces the chances of complications down the road.
My top data preparation tips
Much of the contrast in data is due to the data coming from disparate sources (LOBs, billing, profile database, product DB, web). We typically can’t control the creators and output of our data sources, which is why we need to make modifications on the front end of our analytic efforts. Here are 6 quick tips that will help.
- Character/numeric fields. Convert any numeric fields that won’t be used for calculations to character values. This includes such fields as Social Security Numbers, phone numbers and zip codes.
- Character field length. Modify the length of character fields to be consistent throughout data sources, avoiding truncation. Data elements such as account numbers may provide inconsistent merging unless compatibly aligned.
- Case assignment. Identify the nature of case assignments throughout sources to be consistent. This is especially important when matching sources on names that may be upper case/lower case combinations. It's often helpful to add an additional field, modified to lower case, for match queries.
- Date/time assignments. Be cognizant of the use of date/time values. I’ve seen many uses of date/time where it is really a date and zeroed out for time. In this scenario, it’s best to convert to just a date value if this is appropriate for the analysis. Time zone and consistent formats may fluctuate. Choose a uniform format such as mm/dd/yyyy or ddmonyyy.
- Special characters in fields. Phone numbers and SSNs may contain special characters such as (, ), -. To be consistent, these fields should be stripped of any special characters and compressed. Formatting for visual context may be done on the back end.
- Formal names. Individual’s names come in a multitude of variations. A full name may be populated all in one field across data sources, but vary in order. (First, Middle, Last), (Last, comma, First), etc. Creating fields that break names into their individual groups (First, Middle, Last) is an appropriate action.
These suggestions are not set in stone. A primary data source may require modifications to accessory data feeds for simplicity’s sake. The point is to be aware of such alignments prior to analysis. Once data preparation is properly applied, the fun part begins. Then you can do your analysis without being bogged down by data issues that interfere with getting accurate and timely results.