In 2012, Harvard Business Review declared the data scientist the sexiest job of the 21st century.

Here’s what we knew at the time: big data was (and still is to this day) an enormous opportunity to make new discoveries. We were in the boom of user-generated content from social platforms, which meant big data was coming in every variety in high volume. At the time, data science was considered a “nascent” trade.

Where are we more than a decade later? Big data and data scientists are still a big deal. According to the U.S. Bureau of Labor Statistics, employment of data scientists is projected to grow 36 percent from 2023-2033 – higher than average for other occupations.

But here’s the big elephant in the room: AI. The need for accurate, explainable and trusted data has compounded in the era of AI, shifting a shared spotlight on the data engineer, who has a core responsibility to build quality data pipelines that yield trusted AI outputs.

AI ushers new responsibilities for data management and governance

Data is the gas that fuels AI and data engineering will continue to evolve to meet the demands of an increasingly complex technology landscape. With the AI evolution, data governance and privacy are critical concerns and will remain imperative for compliance with regulations, like HIPAA, ISO, GDPR or the EU AI Act. Issues like disparate data, inconsistencies and incompatible data types can slow down model development and expose organizations to privacy and governance risks.

Understanding the impact of bad data

Poor-quality data without proper data processing can lead to flawed business strategies and unexpected costs. According to Gartner, poor data quality costs organizations an average of $12.9 million every year. Therefore, data from acquisition and integration to cleansing, governance, storage and preparation for analysis must be transparent and explainable to support business decisions.

The crazy thing about AI is that it’s rarely a bad algorithm or bad learning model that causes AI failures. It’s not the math or the science; more often, it’s the quality of the data being used to answer the question. Dan Soceanu, Senior Manager in Technology Product Marketing at SAS

Data sensitivities and privacy

Among data quality risks is the potential to share confidential information accidentally, especially sensitive data in health care, such as patient data. Data engineers use data masking and anonymization techniques to protect personal and sensitive information. This ensures that data can be used for analysis without exposing sensitive details.

However, entrusting data into an AI process means measures must be taken to ensure that sensitive data doesn’t accidentally seep into AI outputs. Data engineers now have a role in ensuring ethical guidelines are followed without bias.

“Addressing ethical concerns in AI requires a comprehensive strategy focused on fairness, transparency and accountability,” said Vrushali Sawant, Data Scientist, Data Ethics Practice at SAS. “Without a clear understanding of how AI algorithms reach conclusions, there is a risk of perpetuating societal inequalities and eroding trust in their decisions.”

The emergence of synthetic data

Data engineers will take a lead role with emerging technology, like synthetic data. Regulated industries need to build, train and test models but face challenges related to data privacy and availability. Introducing synthetic data into a data and AI platform can overcome these concerns and accelerate model development and deployment.

For instance, in health care, synthetic data can help solve rare diseases by filling data gaps, while in the financial industry, it can address data privacy restrictions.

The projections for synthetic data are supported. According to Forbes, artificially generated datasets will become the preferred training ground for machine learning models.

“Synthetic data can solve data management issues that have challenged organizations for years. Organizations spend a lot of time acquiring data, preparing data and cleaning data for their AI development efforts,” says Brett Wujek, Senior Manager of Product Strategy at SAS. “It’s not a one-time process. It happens repeatedly. With a reliable synthetic data generation process, organizations can avoid costs associated with data acquisition and preparation and essentially “turn the crank” on the data they need at any given time.”

AI engineers will need to regularly review synthetic datasets to ensure they are of high quality and accurately represent true patterns. A budding responsibility with AI.

Modern data management and automation

Machine learning and AI capabilities can be used to automate repetitive tasks, allowing data engineers to focus on more strategic work. DataOps is critical to data engineering and maintaining efficient data pipelines with high-quality data.

“The path to successful AI is intrinsically linked to modern data management practices,” says Soceanu. “Data-powered AI is often hindered by unstructured, inaccessible data across the enterprise.”

The highest-quality data needs to be ready and available to inform decisions. Finding novel ways to automate and streamline data tasks will help the data engineer ensure that trusted data is passed to the data science team.

Alignment within the data and AI life cycle

The demand for large volumes of preprocessed data to support AI initiatives has grown exponentially – with no slowdown in sight. As a result, data engineering teams are working more closely with data science teams than ever before. But it doesn’t just stop at data science. AI success is achieved when data and AI platforms support all roles, such as data engineers, data scientists, MLOps engineers and business analysts. Working within a single platform enables teams to complete the end-to-end data and AI life cycle efficiently with transparency.

As data management and governance become increasingly crucial for ensuring trustworthy AI outputs, the significance of every role within the data and AI life cycle grows. Enhanced collaboration among data engineers, data scientists, MLOps engineers, and business analysts will lead to quicker value realization and more reliable AI. Among these, data engineers stand out as the unsung heroes, playing a vital role in the foundational success of data and AI initiatives.

Significant portions of the data and AI life cycle are spent cleaning and preparing data, rather than modeling or utilizing it. The Futurum Group conducted an in-depth analysis of three distinct data and AI platforms to measure their impact on productivity throughout the data and AI life cycle. The study found data engineering tasks, like data upload, data profiling, data sensitivity analysis and data quality analysis were:

  • 16 times more productive versus the commercial platform alternative.

  • 16 times more productive versus the non-commercial platform alternatives.

Read the report, Unlock AI Productivity With SAS Viya

Share

About Author

Lindsey Coombs

Senior Editor, Data and AI

Lindsey Coombs is a Senior Editor for data and AI at SAS. She researches and writes on topics covering advanced analytics and evolving tech like generative AI. Lindsey is a seasoned communicator with more than 18 years of experience writing content for a broad range of industries and audiences. She is passionate about the safe and ethical use of technology that benefits humanity.

Leave A Reply

Back to Top