When people tell me they want the data fast, or as close to real-time as possible, I like to ask “Well, how fast is fast?” Sometimes the streamed data from an application is not very recognizable unless it's joined with other enterprise data. For example, point-of-sale information may require more comprehensive customer data for you to be able to understand trend of purchase.
Combining or joining data with other sources can take a bit of time to accomplish. It usually does not happen instantaneously, and it may require a bit of magic to combine that data. So, if someone really needs the data fast, consider the following questions:
- What decisions are you making with this data? In insurance, are you trying to decide how many medical visits a client has left? Or are you deciding how many visits a client may have in the next six months?
- Does the streamed data need to be combined with other corporate data to be complete so that it can answer a specific question needed for a corporate decision? If so, where will this take place? How is the combined data used, and by what reporting tools?
- Does the streamed data equal the data stored further downstream, perhaps in a data warehouse? What if they're not equal? What if the data is enhanced along the way and does not equal what originally came into the enterprise? Are you going to get a ding during an audit? Does faster data make a difference in the decision-making process?
- Where is the streamed data stored? Is it in temporary storage, persisted storage or permanent storage?
- How is the streamed data maintained? What other applications (if any) receive this data?
- Does the streamed data include employee information that has to be GDPR-compliant?
- What are you doing with the documentation of the decisions made with this data? (My hopes are that it's stored in a data warehouse and will be used for trending.)
- Does this data fall under enterprise data governance rules and/or procedures?
- Would these decisions or requirements be better met in the data warehouse? If that's the case, could the decision wait 24 hours rather than being made right now?
- Is this data used by other enterprise applications?
The value of a data flow diagram
When all is said and done, it's a good idea to create a data flow diagram that shows all interactions with the streamed data. Consider including everything going in and out of the process. I would also include any other applications or interfaces that make the diagram complete and easy to understand. If there are processes within your application that enhance or change the data, you may want to consider a deeper dive in that documentation.
Streamed data is not bad data – it just brings complexity that must be addressed prior to implementation. If your corporation has an enterprise architecture group, consider doing a review with this group. Sometimes people in the trenches are not aware of other applications that want the same data, or they many not have the future picture that only the enterprise architect knows.
A good example – the other day I was drawing a data flow diagram, and it showed the master data management system feeding data to a streaming application. Turns out (after talking with the enterprise architect), that MDM system is slated for retirement in early 2019. So, back to the drawing board on how to get the required master data to combine with our streaming application.
Learn about SAS Event Stream Processing