How to modernize your data estate for AI success

402.74 million terabytes of data are generated every day. That’s a staggering number.

As more organizations gravitate toward cloud-native, open-source-friendly architectures integrated with an ecosystem, the demand for explainability from AI models trained on this data mounts. So does pressure to show ROI from AI investments.

Data is the source of truth that informs business decisions, which are critical to reputation, financial success and regulatory compliance. Today, leaders demand clear, credible views into governance and risk. At the same time, users of these technologies – such as data engineers and data scientists – want ease of use for exploring, preparing and operationalizing data.

When complexity gets in the way of either, it slows decision-making, stalls innovation, and drives up costs. That’s why modern data estates share common goals: rationalize existing data, improve trust and governance, modernize integration and access, and shift heavy compute closer to data without compromising production analytics.

Dan Soceanu is Senior Manager of Product Marketing for platform, data management and enterprise decisioning at SAS. In this Q&A, he shares how modern data management is critical for AI success as it evolves from copilots into agentic systems with varying levels of human oversight.

What are the current market demands for a modern data estate? Why now?

Soceanu: The market is evolving quickly. Data has always been at the core of computing, but even more so with the emergence of AI. As organizations move from pilot programs to production, they realize that their data estate needs to be structured for success.

Customers are looking for flexibility to work where they are in their environments, while data estate modernization is increasingly critical for “AI readiness.” Embedding governance into analytics and data-driven decision flows provides auditability, approvals and traceability. Organizations also need to operationalize models and decisions with monitoring and controls.

So modern data isn’t just stored better, it’s used better. Data scientists and engineers want flexibility and choice in programming languages.

They’re looking to automate business processes to save valuable time on routine tasks and have more time for high-value work. They need a solution that meets data governance requirements while addressing emerging challenges such as data sovereignty.

Organizations not only need to protect data from misuse, but also retain control over how data is stored, shared and used. This is particularly true for global governments. Data management has become high-stakes for private and public sector organizations, especially as they consider moving to agentic workflows with less human oversight.

A modern data platform has become a high-stakes priority for private and public sector organizations, especially as they consider moving to agentic workflows with less human oversight. Dan Soceanu

Speaking of flexibility, there are various data architectures to consider. Can you explain how open data lakehouses can help build better and faster data pipelines for AI?

Soceanu: AI demands faster access to high-quality data across the organization. Many organizations are moving from batch to real-time data processing across all data formats. It’s not a one size that fits all. Data engineers need to work with their existing data infrastructure and data management technology stack.

The secret ingredient is having a data and AI platform from data integration to model output that is built for flexibility, explainability and auditability with options to support the specific business need. Modern data estates standardize on open object storage and open formats. Think Parquet, Delta, Iceberg, Hudi, etc.

Standardization supports modernization by working with lakehouse architectures, allowing data engineering teams to standardize storage, while analytics teams remain productive. This enables governed access across domains without forcing every consumer into a single tool or query engine.

For example, a data lakehouse architecture is ideal for AI projects. Analytics can be integrated and applied to this architecture to bring value from data to modeling. Cost management is also a factor with a lakehouse architecture.

Organizations can store massive amounts of data and are typically charged only for data used in production or analytics. This makes open-format data architectures a solid choice for AI, as they require massive amounts of data and compute to train models.

We know that preparing structured and unstructured data is a barrier to building AI models for many organizations. How can organizations overcome this challenge?

Soceanu: The reality is that most data is unstructured, coming from sources such as social media and images. This makes it challenging, considering the data constantly changes and the volume intensifies.

Unstructured data can cause several problems, including schema drift, broken lineage and governance gaps. Combining multiple databases into a single, flexible view is essential.

Take SAS® SpeedyStore, for example. The architecture brings analytics to the data and not vice versa. Using the power of SingleStore’s hybrid data architecture, combined with SAS analytics operating in-database, creates the ideal data environment for analytics and AI. This reduces data movement and improves data processing, so analysts don’t have a disjointed data store and stagnant data.

It’s also cost-effective. If your goal is analytics and standardizing on a single database, SAS SpeedyStore is the optimal environment.

Posten Bring, which delivers more than 100 million parcels annually across the Nordic region, implemented SAS SpeedyStore to unify high-speed data access with advanced analytics and AI.
Read more.

How is Retrieval-Augmented Generation (RAG) being used and how does it ensure better prompt and response accuracy from LLMs?

Soceanu: RAG utilizes niche unstructured documentation and a large language model, then sources the needed parts of the unstructured documentation and combines that information with LLM search to provide AI (contextually aware) responses. Why? So outputs are grounded in the enterprise knowledge base. This reduces false LLM information in searches as all responses are sourced from the uploaded documentation.

RAG frameworks also allow for the latest information to be used in responses and implement security measures for sensitive information. RAG is important for data engineers and data scientists who are responsible for connecting LLMs to governed data and trusted outputs.

The scope of modeling has broadened with GenAI and large language models (LLMs). How does SAS Retrieval Agent Manager (RAM) solve unstructured data problems inherent in LLMs, beyond RAG frameworks?

Soceanu: LLMs are good at anomaly detection and extracting attributes to enhance data. Pulling data from external sources requires data governance for accuracy and appropriateness.

Think of data management as a “for and with LLMs” proposition. This includes outlier detection, attribute retrieval and augmentation. Human oversight is important with LLMs, especially to monitor for bias and toxicity, and to prevent sensitive and confidential data from entering the model.

Organizations need to improve data management and data quality for LLMs. RAG implementations are difficult to manage. RAG is backend-intensive for ingesting unstructured data, transforming it and implementing an LLM pipeline that incorporates user-uploaded datasets. RAG also tends to be specific to a particular LLM. It can’t be easily used across LLMs without significant work and time.

Previously, working with unstructured data often meant stitching together code-heavy solutions that were complex, time-consuming and hard to scale. RAM changes that.

Built upon the RAG framework, SAS Retrieval Agent Manager (RAM) enables fast, accurate and context-aware GenAI responses from unstructured enterprise data. It turns raw, unstructured content into usable enterprise knowledge in a far more approachable way by ingesting and processing documents, then automatically selecting the right configurations so teams can interact with that information quickly through an API or chatbot.

According to IDC MarketScape: Worldwide Data Integration Software Platforms 2025, SAS is a leader recognized for its broad, integrated data engineering platform.
Read the report.

Many industries are now using synthetic data to solve data problems, such as scarcity, privacy and availability. How is synthetic data being used now and how do you predict it will be used in the future to build better AI models?

Soceanu: It’s important to approach AI in a manner that solves specific use cases. Applying synthetic data is not always the answer, but when it makes sense, it can be an AI accelerator.

Synthetic data helps solve many common data management challenges with training data, like data scarcity and privacy. By creating data representative of real-world data with tools like SAS Data Maker, model accuracy improves. However, just like organic data, synthetic data needs to be managed and governed. Automated processes in DataOps can help flag sensitive data and inform a human-in-the-loop to intervene to safeguard analytical and AI outputs.

We know data management is vital to copilots and more autonomous AI systems like agentic AI. How should data and AI engineers think about supporting these systems with human-in-the-loop data governance?

Soceanu: Human-in-the-loop data governance is a check-and-balance for AI. Human oversight needs to be embedded in the data and AI platform at every critical decision point, from data preparation to model deployment.

For example, in data preparation, humans label the training data and control data quality. This human task can identify issues that automated preprocessing may miss. We can’t blindly turn over all tasks to AI without complete explainability, traceability and auditability. Data is central to AI governance and responsible use.

Trust in AI systems is deeply connected to data management and AI success. So how can organizations build trust in their data management practices?

Soceanu: We worked with IDC to conduct research on trust and AI. The research indicated that organizations that prioritize trustworthy AI are much more likely to achieve higher ROI from AI initiatives. Respondents were most concerned about data privacy, transparency and explainability, and ethical use. If production data is derived from a “black box” system, trust in AI systems will erode.

To ensure regulatory compliance across the public and private sectors, data use must comply with laws such as GDPR and HIPAA. That said, data must be explainable and traceable, especially when using personally identifiable information (PII), which is often found in highly regulated industries like financial services and health care. The data and AI platforms need to govern all aspects of data to ensure compliance. This doesn’t end after building a data pipeline. It is continuous and ongoing.

Read our explainers on data warehouses and data management.

How can leaders guide data estate modernization with a trusted AI and data strategy?

Soceanu: Organizations are placing a strong focus on preparing their data and AI environments for the future. This digital transformation is tedious and challenging. Leaders need to ensure everything runs within strong compliance, security, and governance guardrails so their data stays protected.

Modern data and AI platforms can help organizations become more efficient while also nimble enough to integrate agentic AI, automation and built‑in monitoring. This allows for more trustworthy, flexible data pipelines that work across hybrid and multicloud deployments to build dependable AI outputs.

Data can’t be thought of in a vacuum. Modern approaches demand a holistic strategy that orchestrates data and AI across the data and AI life cycle for faster, high-quality data access, while ensuring trustworthy data that protects the organization’s reputation, financial success, and regulatory compliance.

Preparing your modern data estate

Modernizing your data estate is foundational to successful AI outputs that deliver measurable, trusted results. As organizations move from AI experimentation to production, success depends on unifying data integration, quality, governance and analytics processes in ways that reduce friction, improve transparency and scale across hybrid and multicloud environments.

With data integrity, security and human oversight built in, modern data management can evolve with the fast-moving market, from copilots to more autonomous, agentic capabilities. All while remaining accountable, compliant and aligned to organizational goals.

Blogs