Once in a while we run into issues where a business customer assumes the number of records in the data warehouse - for customer or product - should match exactly the records in the master data management data store or the original source system. There are many reasons why the numbers don’t match. Here are a few of my thoughts:
1. Time is part of the data warehouse if, in fact, we are storing product data with time as part of the primary key. There could be multiple records for each product in the data warehouse. For example, Product AYZ (or the widget) has characteristics (dimensions) that change based on seasons. For winter, Product AYZ only comes in the color of black; summer – white, spring – blue and fall – brown. The objective in our data warehouse is to track sales by season for Product AYZ. The product dimension would have time based by season (season 1, 2, 3, 4) associated with the product or a different unique product number for each of the four products, thus requiring us to query based on the product name to see the results we would like to analyze.
2. Another issue that may make the counts different would be using "not applicable" or "unknown" for product sales information that is not described properly or does not sync up with a specific product. So the product dimension may have a row with a product name of "not applicable" or "unknown." This would result in at least one more record than master data.
3. Also in the data warehouse we may keep all records for the last seven years, each with unique statuses like "delete" or "archived." Whereas the source system or master data management system may do "physical" deletes.
Let me know of any other instances where the data warehouse would have more records than master data.