Number Sense & Nonsense

As
you will know from the news media, business executives and the techno-sphere,
we are in the age of big data and analytics.
(Disclosure: I too am part of this trend with my forthcoming course on
leading change in the Applied Analytics Master’s program at Columbia University.)

For
those of us who have been practitioners of analytics, this attention is long
overdue.  But there is a certain naiveté
in the breathless stories we have all read and in many of the uses – really misuses
– of analytics that we see now.

Partly to provide a more mature understanding of analytics, Kaiser Fung,
the director of the Columbia program, has written an insightful book
titled “NumberSense”. 

image

Filled with compelling examples, the book is a
general call for more sophistication in this age of big data.  I like to
think of it as a warning that a superficial look at the numbers you
first see will not necessarily provide the most accurate picture, any
more than the first thing you see about an unpeeled onion tells you as
much as you can see once it is cut.

image

Continuing this theme in his recent book, “The End Of Average”, Todd
Rose has popularized the story of the Air Force’s misuse of averages and
rankings after World War II.  He describes how the Air Force was faced
with an inexplicable series of accidents despite nothing being wrong
with the equipment or seemingly with the training of the pilots.  The
Air Force had even gone to the effort of designing the cockpits to fit
the exact dimensions of the average pilot!

image

As Rose reports in a recent article:

“In
the early 1950s, the U.S. air force measured more than 4,000 pilots on
140 dimensions of size, in order to tailor cockpit design to the
‘average’ pilot … [But] Out of 4,063 pilots, not a single airman fit
within the average range on all 10 dimensions.  One pilot might have a
longer-than-average arm length, but a shorter-than-average leg length.
Another pilot might have a big chest but small hips.  Even more
astonishing, Daniels discovered that if you picked out just three of the
ten dimensions of size — say, neck circumference, thigh circumference
and wrist circumference — less than 3.5 per cent of pilots would be
average sized on all three dimensions.  Daniels’s findings were clear
and incontrovertible.  There was no such thing as an average pilot.
If you’ve designed a cockpit to fit the average pilot, you’ve actually
designed it to fit no one.”

Rose criticizes the very popular
one-dimension rankings and calls for an understanding of the full
complexity, the multi-dimensional nature of human behavior and
performance.  As a Harvard Professor of Education, he puts special
emphasis on the misleading rankings that every student faces.

He shows three ways that these averages can mislead by not recognizing that:

  1. The
    one number used to rank someone actually represents multiple dimensions
    of skills, personality and the like. Two people can have the same
    score, but actually have a very different set of attributes.
  2. Behavior and skill change depending upon context.
  3. The
    path to even the same endpoint can be different for two people. While
    they may look the same when they get there, watching their progress
    shows a different picture.  He provides, as an example, the various
    patterns of infants learning to walk.  Eventually, they all do learn,
    but many babies do not follow any standard pattern of doing so.

It
is not too difficult to take this argument back to Michael Lewis’s
portrayal in Moneyball of the way that the Oakland A’s put together a
successful roster by not selecting those who looked like star baseball
athletes – a uni-dimensional, if very subjective, ranking.

Let’s
hope that as big data analytics mature, there are more instances of
Moneyball sophistication and less of the academic rankings that Rose
criticizes.

© 2016 Norman Jacknis, All Rights Reserved

[http://njacknis.tumblr.com/post/143113012009/number-sense-nonsense]

Big Data, Big Egos?

By now, lots of
people have heard about Big Data, but the message often comes across as another
corporate marketing phrase and a message with multiple meanings.  That may be because people also hear from
corporate executives who eagerly anticipate big new revenues from the Big Data
world.

However, I
suspect that most people don’t know what Big Data experts are talking about,
what they’re doing, what they believe about the world, and the issues arising
from their work.

Although it was
originally published in 2013, the book “Big Data: A Revolution That Will
Transform How We Live, Work, And Think” by Viktor Mayer-Schönberger and Kenneth Cukier is perhaps the best recent
in-depth description of the world of Big Data.

For people like
me, with an insatiable curiosity and good analytical skills, having access to
lots of data is a treat.  So I’m very
sympathetic to the movement.  But like
all such movements, the early advocates can get carried away with their
enthusiasm.  After all, it makes you feel
so powerful as I recall some bad sci fi movies.

Here then is a
summary of some key elements of Big Data thinking – and some limits to that
thinking. 

Causation and
Correlation

When
presented with the result of some analysis, we’ve often been reminded that
“correlation is not causation”, implying we know less than we think if all we
have is a correlation.

For
many Big Data gurus, correlation is better than causation – or at least finding
correlations is quicker and easier than testing a causal model, so it’s not
worth putting the effort into building that model of the world.  They say that causal models may be an outmoded
idea or as Mayer-Schönberger
and Cukier say, “God is dead”.  They add
that “Knowing what, rather than why, is good enough” – good enough, at least,
to try to predict things.

This
isn’t the place for a graduate school seminar on the philosophy of science, but
there are strong arguments that models are still needed whether we live in a
world of big data or not.

All The Data, Not Just
Samples

Much
of traditional statistics dealt with the issue of how to draw conclusions about
the whole world when you could only afford to take a sample.  Big data experts say that traditional
statistics’ focus is a reflection of an outmoded era of limited data. 

Indeed, an example is a 1975 textbook that
was titled “Data Reduction: Analysing and Interpreting Statistical Data”. While
Big Data provides lots more opportunity for analysis, it doesn’t overcome all
the weaknesses that have been associated with statistical analysis and
sampling.  There can still be measurement
error.  Big Data advocates say the sheer
volume of data reduces the necessity of being careful about measurement error,
but can’t there still systematic error?

Big Data gurus say that they include all the data, not just a sample.  But, in a way, that’s clearly an
overstatement.  For example, you can
gather all the internal records a company has about the behavior and breakdowns
of even millions of devices it is trying to keep track of.  But, in fact, you may not have collected all
the relevant data.  It may also be a
mistake to assume that what is observed about even all people today will
necessarily be the case in the future – since even the biggest data set today
isn’t using tomorrow’s data.

More Perfect Predictions

The
Big Data proposition is that massive volumes of data allows for almost perfect
predictions, fine grain analysis and can almost automatically provide new
insights.  While these fine grain
predictions may indicate connections between variables/factors that we hadn’t
thought of, some of those connections may be spurious.  This is an extension of the issue of
correlation versus causation because there is likely an increase in spurious
correlations as the size of the data set increases.

If
Netflix recommends movies you don’t like, this isn’t a big problem.  You just ignore them.  In the public sector, when this approach to
predicting behavior leads to something like racial profiling, it raises legal
issues.

It
has actually been hard to find models that achieve even close to perfect
predictions – even the well-known stories about how Farecast predicted the best
time to buy air travel tickets or how Google searches predicted flu outbreaks.  For a more general review of these
imperfections, read Kaiser Fung’s “Why
Websites Still Can’t Predict Exactly What You Want
” published in Harvard
Business Review last year. 

Giving It All Away

Much
of the Big Data movement depends upon the use of data from millions – billions?
– of people who are making it available unknowingly, unintentionally or at
least without much consideration.

Slowly,
but surely, though, there is a developing public policy issue around who has
rights to that data and who owns it. 
This past November’s Harvard
Business Review
– hardly a radical fringe journal – had an article that
noted the problems if companies continue to assume that they own the
information about consumers’ lives.  In
that article, MIT Professor Alex Pentland proposes a “New Deal On Data”. 

So Where Does This Leave
Us?

Are
we much better off and learning much more with the availability of Big Data,
instead of samples of data, and the related ability of inexpensive computers
and software to handle this data? 
Absolutely, yes!

As
some of the big egos of Big Data claim, is Big Data perfect enough to withhold
some skepticism about its results? Has Big Data become the omniscient god? – Not
quite yet.

image
image

©
2015 Norman Jacknis

[http://njacknis.tumblr.com/post/110070952204/big-data-big-egos]

Too Many Metrics?

The New York Times Sunday Style Section – of all places – recently contained a report, titled “The United States of Metrics”, about how every area of life now is dominated by numbers and statistics.  As its author, Bruce Feiler, put it:

In the last few years, there has been a revolution so profound that it’s sometimes hard to miss its significance. We are awash in numbers. Data is everywhere. Old-fashioned things like words are in retreat; numbers are on the rise. Unquantifiable arenas like history, literature, religion and the arts are receding from public life, replaced by technology, statistics, science and math. Even the most elemental form of communication, the story, is being pushed aside by the list.

After reviewing the use of analytics in fields as diverse as sports, health, lifestyle, etc., Feiler ends the story with Einstein’s time-worn warning, “Not everything that can be counted counts and not everything that counts can be counted.”

A couple of months ago, Zachary Karabell’s book, titled “The Leading Indicators: A Short History of the Numbers That Rule Our World”, was published.  Karabell goes into this subject in much more depth and with a lot more historical context. 

(By the way, Karabell is a lively writer and brings all this to life in a more engaging way than the average reader would expect of a book about economic statistics.)

Despite their prominent role in politics and business planning, he notes that the statistics we all hear reported about – GDP, trade deficits, unemployment rates, etc. – are misleading, inaccurate to varying degrees and mostly fairly new.  Nevertheless many are already outdated by changes in the economy and the ways that people make a living.

He discusses various ways that these economic statistics can be updated.  However, he also points out that no single measure alone will be able to provide a good picture of something as large and complex as a national and changing economy.  So maybe we need more metrics to round out the picture.

Karabell thinks the metrics are good and useful, but that we need to be more sophisticated in our handling of them.

That’s something that makes sense.  In a world that increasingly needs and demands the kind of data-driven knowledge that all these measurements can provide, our understanding and literacy in using quantitative methods also needs to improve.

In a way, this is not all that different from the argument that is made by those in the visual arts, who also call for more visual literacy in a world that is also increasingly visual, rather than textual.  See my post “Visual Images And Text” from about a year ago at http://njacknis.tumblr.com/post/60268577982/visual-images-and-text .

(Come to think of it, these last two paragraphs do pose an ironic challenge to a blogger who writes using words – as traditional text gets diminished in a world of numbers and images 🙂

©2014 Norman Jacknis

[http://njacknis.tumblr.com/post/87101098190/too-many-metrics]