The Data Roundtable
A community of data management expertsAt a recent TDWI conference, I was strolling the exhibition floor when I noticed an interesting phenomenon. A surprising percentage of the exhibiting vendors fell into one of two product categories. One group was selling cloud-based or hosted data warehousing and/or analytics services. The other group was selling data integration products. Of
In the past, we've always protected our data to create an integrated environment for reporting and analytics. And we tried to protect people from themselves when using and accessing data, which sometimes could have been considered a bottleneck in the process. We instituted guidelines and procedures around: Certification of the data
When you spend long enough writing and working in any industry, you inevitably see trends emerge and reach varying levels of maturity. Data governance is one such trend, as you can see from the following Google Trends chart:
.@philsimon lists the gravest data-quality errors.
I've been doing some investigation into Apache Spark, and I'm particularly intrigued by the concept of the resilient distributed dataset, or RDD. According to the Apache Spark website, an RDD is “a fault-tolerant collection of elements that can be operated on in parallel.” Two aspects of the RDD are particularly
Data quality has always been relative and variable, meaning data quality is relative to a particular business use and can vary by user. Data of sufficient quality for one business use may be insufficient for other business uses, and data considered good by one user may be considered bad by others.