As a career data geek, I enjoy watching how the growing pervasiveness and popularity of data is reshaping industries and mainstream culture. A related trend is the increasing number of jobs that include data in their title. One that's becoming almost as prevalent as data scientist these days is data engineer. Searching a few of the major job posting websites, I discovered an expected amount of variability in how the role of data engineer was defined. But the most common job responsibilities and skills included:
- Identify and evaluate data sources, both internal and external to the enterprise.
- Design and model data infrastructure solutions for both relational and noSQL data structures.
- Build and maintain extract-transform-load (ETL), data integration and data quality processes.
- Document requirements, data lineage and subject matter in both business and technical terminology.
- Have strong programming skills in various languages (Python and Java were most commonly cited).
- Be proficient with big data technologies, such as Hadoop, MapReduce, Hive, Pig and Apache Spark.
It seems to me that data engineer has become a catch-all term for data-related responsibilities not assigned to data scientists and data analysts. And it encapsulates aspects of job titles I heard for decades – like data modeler, database administrator (DBA), data steward and ETL developer.
Data engineers: Implementing data protection
Another recurring aspect of many job postings relates to data privacy and the role data engineers play in protecting sensitive data. This is especially relevant in light of regulatory compliance frameworks such as the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). To address these and other data privacy initiatives, organizations need clearly defined data privacy policies that detail how to identify sensitive personal data that's subject to protection.
Organizations can reduce the risk of unauthorized access to sensitive data by:
- Understanding how data privacy affects different organizational roles.
- Masking sensitive data to provide differential privacy – this enables essential analytics without needlessly exposing sensitive information.
Most of the responsibility for implementing data protection falls under the purview of what are now called data engineers.
Learn about SAS for Personal Data Protection