In my last blog post we defined data scientists – who they are and what they do. In this post, we'll discuss the data engineer, who has such an important role that you'll want this person as your very best friend!
A data engineer transforms and integrates data into the format required for analysis or reporting. To do this they must perform as a data architect with infrastructure knowledge. (I've been doing this for my entire career, by the way.)
The data engineer works side by side with data scientists and business users. Data engineers' skills include:
- Database knowledge, both structured and unstructured.
- Working with structured databases entails having knowledge of performance and tuning, indexing, data design, where the data came from, frequency of updates and more.
- Working with unstructured databases requires knowledge of programming and rules to use in this environment. For example, data ingestion, data integration and usage.
- Programming and ETL requires knowledge of different programming languages, as well as extract, transformation and load software. SQL is a plus for structured data, as well as other programming for Hadoop and NoSQL databases (Java, Python, Scala, etc.).
- Data integration best practices. This calls for a good understanding of how to join data from multiple databases, as well as knowing when to create an integrated store of data for use by the enterprise.
- Front-end tool knowledge involves:
- Reporting – the enterprise needs reporting tools for standardized data presentation. Typically these tools are used for specific reports. Once standardized, these reports may not change often, but they should be inventoried and monitored over time for effectiveness.
- Analytics – and most organizations require analytical tools. The data collected for analytics may only be used for a short period of time, so it's not required for historical purposes. Other analytics may require raw data for analysis. The analytical tool needs to match the requirements and the data collected to meet those requirements.
- Business metadata. The data engineer knows the business rules that surround the data. For example, in an insurance environment the data engineer may have taken part in the data design. Hopefully this information is updated in a metadata repository.
- Technical metadata. The data engineer has an active role in creating technical metadata based on data collection/rules.
- Operational metadata. The data engineer knows the operational requirements for backup/recovery and security on the data used across the enterprise.
The data engineer's skills are just one part of the story. Data engineers also need to know (or ask) the following:
- How the data should be gathered and integrated, and the frequency of updates. Which ETL/programming tools are used (and how), as well as the quality of the data.
- How the data is transformed, and any business rules that are applied to the data – and they must be prepared to actively update metadata repositories.
- What database is used for the data and how to optimize performance for data usage. This requires infrastructure knowledge of all the enterprise data platforms.
- Data warehousing/business intelligence best practices using industry examples. An understanding from soup to nuts about data warehousing includes gathering requirements, design, implementation, presentation, etc.
- Database technologies, especially the technology that houses the information required for analysis.
- Presentation tools and other programming languages for structured and unstructured platforms.
- Metadata (business, technical and operational) – the data engineer knows where all the data is, based on the metadata.
- Data integration and processing skills.
- Enough data modeling/design knowledge to understand the joins between the data tables and the integration rules applied to the data.
Considering all of that, it may be safe to say that data engineers really are jacks of all trades. They definitely need to be involved upfront, and throughout any project.Download a TDWI e-book about data preparation challenges