Whether it’s neural networks, natural language processing, machine learning or computer vision, artificial intelligence (AI) is increasingly used to improve enterprise solutions. All AI applications are data-driven and therefore dependent on high-quality data. In this post, we'll examine examples of how data quality improves AI.
Better data, better search
Let’s begin by looking at an AI application that has been around for decades, which most of us still use several times daily – an Internet search engine. Its goal is to help you find websites, images, videos and other online content relevant to your search query. A search engine is driven by, and dependent on, high-quality data in three ways.
Index of keywords and metadata
First, a search engine is driven by an index of keywords and other metadata that its algorithms use to filter and weight search results. This is how it returns a ranked list of links with the most relevant results listed first. Just like any machine learning for AI, the quality of the search engine’s training data is essential. Its training data is the initial search index along with a set of expected search results for the most common and relatively unambiguous search queries. This enables the analytical model the search engine is building to get off to the right start by establishing the correct correlations between the search index and different search queries.
Web crawlers
Up-to-date and frequently refreshed data is necessary for improving any AI application. Web crawlers help search engines obtain this type of data. Web crawlers go from website to website to collect data for updating and optimizing the search index. This is where it encounters an interesting data quality challenge – search engine optimization (SEO).
From an SEO perspective, both individuals and organizations try to game the system to get their websites to rank higher in search results. I remember the early days of the Internet when the footer of fraudulent and malware-ridden websites contained long lists of common keywords. That was an attempt to rank higher and draw more visitors to the page. In this example, search algorithms had to be updated – not because of bad data but because of good data put to bad uses.
SEO requires the search engine to continuously adapt – or, in some cases, be manually adjusted – to overcome new attempts to game its system. Non-AI manipulation of the search engine also occurs when companies pay their way to the top of search results. But in recent years search engines have started clearly marking those search results as advertisements.
Learning from users
The third way a search engine is driven by data is the real key to its success – learning from users. When users don’t like the search results they get, it’s often due to mistakes in their search queries. For example, people make errors in spelling or keyword order, use too few or generic search terms, or include ineffective or unnecessary information in the query.
When users modify their query, especially before clicking on any results, it provides a powerful learning opportunity for the search engine. That's because the search engine retains both the original and the modified query. By comparing the two queries and the results eventually selected, the search engine learns and strengthens correlations between the search index and different search queries.
If you initially searched for “bat man” and then “bat man superhero,” it learns that users are more likely searching for a comic book character than an animal control specialist. This is why the predictive algorithm that helps you complete your search query starts recommending keywords to add to the query as you type. For example, if you start typing “bat man,” the recommended keywords for most search engines include “movie,” “comic book” and “toys.” This is the same reason you get “Did you mean Batman?” – because the search engine has learned that the comic book character is a one-word proper name. Further, the search engine learned how to spell not by using a dictionary but by learning the correlations between misspelled keywords and correctly spelled entries in its index.
As this illustrates, users continuously improve the quality of the data a search engine is so highly dependent on. Users perform the time-consuming task that undermines most AI applications: data cleansing.
A search engine demonstrates how an AI application requires massive amounts of high-quality data to achieve its goals. It also exemplifies how an AI application has to incorporate formal processes for continuously ingesting new data and methods for continuously improving data quality.
Download an e-book to explore how AI can improve your understanding of the world