As I explained in Part 1 of this series, spelling my name wrong does bother me! However, life changes quickly at health insurance, healthcare and pharmaceutical companies. That said, taking unintegrated or cleansed data and propagating it to Hadoop may only help one issue. That would be the issue of getting the data into the hands of the business user or consumer quickly, based on the business requirements. The data could very well NOT be cleansed or integrated with other systems – but the objective is speed. So, let’s define this type of user:
- They need the data as close to real-time as possible.
- They probably do some type of analytics on the data.
- They may not need the data cleansed because they are doing some type of fraud analysis, etc.
- They may not need the data integrated because this business group is targeting the data in one source system or just part of the data in multiple source systems.
- The data and analysis is NOT going outside of the company.
- This data usually does not need to be stored for long periods of time, like data warehouse data.
So, why integrate or cleanse data?
- The first objective would be to bring in one view of customers, claims, providers, financials, etc. – without duplicates. This requires an understanding of the data in all the source systems, profiling to understand data anomalies and outliers, and sometimes the use of data quality tools to enhance and make the data better over time.
- Outside of the corporation, those receiving the information probably also want correct information. (Maybe it bugs them to have names spelled incorrectly, too.)
- Corporate reporting requires integrated/cleansed financial information for audit and reporting.
Integrating and cleansing data once and using it multiple times is a great way to roll.