Teaching the machine a lesson

The forecast is frightening: Robots will take over all manual labor and self-generating code will automatically spin out the algorithms once developed by statisticians and programmers. Humans will become obsolete. What will mere mortals do all day long?

Ride captive as our self-driving cars take us on a sentimental journey to see the parking lots where shopping malls used to be?

Visit the local greenhouse to watch through the window as mechanical gardeners harvest hydroponic vegetables loaded onto drones optimized for doorstep delivery?

Or just sit in front of a giant screen forever and ever while Amaflix Consolidated serves up an endless number of Arrested Development re-runs because AmaFlix knows that, no matter what clever new show we might tell our friends we are binge-watching, we are really just watching Arrested Development re-runs over and over again.

What will we do?

Suffice to say: The standard error of Y estimates for X-axis values outside the range of observed data tend to be large for a reason. That what-if analysis quickly turns into a what-the-heck analysis. Caveat emptor. Or, as Google automatically translates that from “English” to English: Caveat emptor.

Here's a more realistic prediction: Self-correcting machine learning models and auto-generated code will change the way statisticians, programmers, and data scientists work. But there will be plenty of work for human beings to do from the beginning to end of the process.

The hand that gathers the data

On the front end of the process, every machine learning model requires training data. These days, you can find data on almost any subject from food price data from 75+ countries to the infamous Enron e-mails, US police shootings, and classical music, not to mention the projects SAS, itself, has been involved in sharing clinical trial data and missing migrant reports.

All of these very different kinds of data have one thing in common: Human beings worked to gather the data, structure the data, and provide access to it. That fact gets lost in discussions of automated processes and self-generating systems. The reason data exist is because human beings decided to gather the data, often at significant cost, if not in terms of cash then in terms of time and effort.

Humans act on their values, sometimes individually, sometimes collectively. One Ph.D. candidate may decide a topic is of such importance that he or she will spend two or three years gathering new data, not knowing how the analysis will turn out. A nation may together agree to gather health care data on opioid use or geological data on fracking or educational data on student achievement. Humans decide which data are worth the effort to gather. Humans decide which data they want to share and how. Humans decide the rules that allow or restrict others from gathering and sharing data.

Make no mistake: The hand that gathers the data rules the machine. When it is young, the machine learns what we teach it based on the data we feed it. Once mature, the machine that is not regularly fed more healthy data becomes a temperamental monster that will not open the pod bay doors. You heard me, Hal. Yes, I am talking about you. Hal. Open the pod bay doors.

Philosophers of probability

In the middle of the process, human beings will be needed to serve as statistical philosophers, reasoning through which models to implement. Someday, that selection may be completely automated, but that won’t happen any time soon. The truth is that the rapid proliferation of models and auto-generated code has led to choice overload, creating a kind of analytic consumer confusion making it hard to choose which of many almost equally good models is best.

I am reminded of how simple things were 20 years ago when I could write an Experimental Methods test question like this: The statistical procedure used to evaluate differences among three experimental treatments with interval or ratio data is the ______: a) t-test; b) Chi-square test; c) Analysis of Variance; d) Mann-Whitney test. Undergraduates were taught that the appropriate statistical test could be determined by the experimental design, level of measurement, and distribution of data. Simple as that.

But the world of machine learning and cognitive computing will never be as simple as that. Cluster analysis is a great example of the challenge. Clustering finds neighbors in data, linking together things that seem more alike or closer together than those that are different or far apart based on a particular definition of similarity. But different methods produce very different results and there is no simple flowchart that directs data scientists to the “right” answer.

Kleinberg (2002) has described the challenge in terms of an “Impossibility Theorem” which asserts that, among a “bewildering array of different clustering techniques,” it may be impossible for one approach to be unequivocally best. Other researchers have accepted the challenge and begun to try to present a more systematic approach by classifying clustering models in a taxonomy, but “in spite of the wide use of clustering in many practical applications, currently, there exists no principled method to guide the selection of clustering algorithms” (Ackerman, Ben-David, & Loker, 2010). There are some obvious distinctions for map-based data or time series data, but the choice of a particular clustering model is not going to fit neatly into a multiple-choice question format.

The fact that you can easily point-and-click your way to a batch of alternative results doesn’t make deciding which method is appropriate any easier. It does, on the other hand, make more work for human beings.

Converting binary to logic

On the back end of the process, the machine requires an interpreter. What is the machine actually telling us? Is caveat emptor an English phrase? It was Latin for “buyer beware,” but many English speakers have heard the phrase so often that it may indeed now count as English.

Old words change meaning and new words are invented, sometimes by chance and sometimes by choice. “Truthiness” was a word used as the punchline to a joke, a seemingly inaccurate variation on “truth,” but it is now accepted in the Oxford English Dictionary. Does that make it a real word? Does it help or hurt the case to know that, according to Google Translate, “truthiness” is spelled the same way in Latin?

Google Translates Truthiness into Latin?

Who or what says that a Latin phrase is English now or that any word is a word? Should we all collectively press the feedback link and tell the machine it is wrong… or is it right? Is it smarter than we imagine since it knows that a Latin phrase is so often used in English that it should be treated as English? Should we admire the machine’s ability to think beyond the facts or chide it for making such a silly mistake?

Caveat emptor is English? There’s a ring of truthiness to that.

And that’s the problem. Just as there is uncertainty regarding which data are important to gather and which model is best for a particular purpose, there will always be uncertainty in interpreting the results. Forget clustering and other unsupervised models where there is no fixed target to hit. Even when considering supervised models with well-defined targets, humans will argue among themselves about whether what appears to be the “best” model is maybe “too good,” an over-specified model that overfits the data.

Put aside for a moment the problem that humans often like to engage in a deliberate misreading of the facts for various reasons. "To err is human," as Shakespeare said. This remains true: Intelligent, sincere, thoughtful human beings disagree in how to interpret facts and the way they tend to settle such disagreements is to gather even more human beings to join in the discussion, whether sitting at the bar or standing around the podium at a research symposium.

Work for humans. Gathering the data humans think is important, inventing methods to better understand when to use the methods humans have already invented, and serving as interpreters for the output while maintaining the right to veto everything the machine generates. Lots of work for human beings. Robots may indeed replace a significant amount of manual labor, but human intellectual labor is still and will remain for a long time in high demand.

Robots may replace manual labor, but human intellectual labor will remain in high demand. Share on X

Just when the machines thought they had made us obsolete, we’re back in charge again. It’s almost like we programmed them to need us.

To learn more about the history of artificial intelligence and where we're headed with cognitive computing, read the article, Becoming Cognitive.

Blogs