What ‘Moneyball’ can teach us about Intelligence Analysis

In calling for the improvement of intelligence analysis, other professions have been invoked, such as medicine and law.  But, what about that most quintessentially American of all sports: baseball? In the story of Moneyball we find a modern day David versus Goliath story which has been used challenge thinking in diverse fields ranging from business to medicine, and even corrections. As with these other fields, the lessons of Moneyball can transform intelligence analysis.

Moneyball is the story of how in 2002 the cash-strapped Oakland A’s management revolutionized baseball. In the film version of the story, A’s general manager Billy Beane, played by Brad Pitt, sums up the difficult position of his club: “There are rich teams, and there are poor teams. Then there’s 50 feet of crap. And then there’s us.” And Beane wasn’t exaggerating; the A’s 30 million dollar budget was a mere fraction the powerhouse clubs, such as the New York Yankees (c. 130 million).

2003 Major League Budgets by Team

Graph Source: http://en.wikipedia.org/wiki/Moneyball

steve blog 3 pic 1

Combining an unorthodox perspective on recruitment and traditional baseball statistics, the A’s front office imported sophisticated statistical analysis to build a highly competitive team with a shoe-string budget. With their revolutionary approach the A’s won 20 consecutive games, an American League record, and to this day are among the most efficient teams, purchasing a win in the 2013 season at $631,000 versus big payroll clubs such as the Yankees purchasing each win for $2.6 million.

But, if you dig deeper the real message of Moneyball isn’t about statistics, or even money, for that matter: it is really about improving a profession by challenging orthodoxy with novel ideas.

Experience is useful, but it’s no crystal ball

Consider the role of experience in baseball recruiting. Traditionally, baseball recruiting is a subjective art practiced by experienced scouts. According to established baseball thinking:

                  you found a big league ballplayer by driving sixty thousand miles, staying in a                       hundred crappy motels, and eating god knows how many meals at Denny’s all                     so you could watch 200 high school and college baseball games inside of four                     months….Most of your worth derived from your membership in the fraternity of                   old scouts who did this for a living… (Lewis p. 37)

After staying in a ‘hundred crappy hotels’ and eating all those Denny’s meals, the scouts relied on their experienced to select their top picks on the basis of the appearance and anecdotal information they know about the player, an error that led some players to be vastly overvalued and others undervalued.  Players that didn’t “look” the part of a major league player or who didn’t have the right back story were consistently passed up, something that the A’s capitalized on to buy the best team for their buck. Take for example, Chad Bradford, and his unusual submarine pitch.  While Bradford ended up a staple relief pitcher for the A’s, he was so overlooked by scouts that he began his pitching career at a community college.

Chad Bradford delivering his unique submarine pitch

steve blog 3 pic 2

But, Bradford’s case could be a ‘black swan’ and you could even reason that it is OK to miss the occasional submarine pitcher. However, the problem isn’t just about ‘unique’ ballplayers. Consider the common scouting task of selecting the superior hitter. Look at the picture below of Pittsburgh Pirates slugger, Neil Walker. Can you guess what 2013 his batting average is?

Guess Neil Walker’s Batting Average

steve blog 3 pic 3

If you know baseball you might come close because you know roughly how good Walker is, or other standout hitters. However, if you never watch baseball, I am willing to bet you’d be wildly off (but for those of you keeping score at home, Neil Walker batted .280). Now, if you were a scout, you might come a little closer to being right than someone who watches baseball, but you probably wouldn’t guess much differently.

A meta-analysisof clinical judgment in medicine suggests that experts (e.g baseball scouts) slightly outperform well-informed observers (e.g. baseball fans).  However, as complexity increases in the judgment task and the opportunities for chances to learn goes down—as is the case with most intelligence tasks— experts will not do much better than well-informed observers.  In foreign affairs forecasting, Tetlock found a similar dynamic: there was a rapid and diminishing return on expertise.

The Diminishing Returns of Expertise

steve blog 3 pic 4

In one study I asked participants to make judgments about the extent to which the Assad regime will comply with a UN resolution to remove and destroy all declared weapons before June 30th, 2014. The participants of the study included 24 graduate students in a security studies program, representing well-informed observers, and 16 International Association for Intelligence Education members (IAFIE), and 5 analysts from the IC, representing experts, for a total of 45 participants.

This modest experiment seems to reflect the diminishing returns of experience argument: the grad students guessed 62% versus the 51% for IAFIE and IC analysts. Whether one group has guessed a figure closer to actual number of weapons destroyed is a more complex matter for another discussion, but it would seem that there is not at least a large difference between the two groups, despite the presumably large experience gap.

 

Estimates of Percentage of Weapons Destroyed & Removed

graph1

Since such estimative judgments are but one part of the analyst’s job, I also asked each participant for their rationale. Again, like the estimations, the rationales were similar but not the same. Below are frequency distributions for each group’s cited reason for their judgments. Both groups identified reluctance of the Assad regime, interference of the civil war, and difficult timeline as their main reasons for their judgments, although the two groups prioritized the first two justifications differently. Where I found any substantial difference was that IAFIE and analyst groups noted the importance of the weapons falling into a 3rd parties hands (e.g. a rebel group).

Graduate Student Justifications

steve blog 3 pic 6

IAFIE and Analyst Justifications

steve blog 3 pic 5

The point of this lesson and the findings of the study is not that experience doesn’t matter, but that it can only take us so far as the difference between well-informed observers and experts is not large. In lieu of this issue we need to keep developing new methodologies and techniques that can supplement—not supplant—the experience of analysts. However, if analytic methodologies and techniques are to be created we would learn well to heed the next Moneyball lesson.

Don’t assume the established way of thinking or doing something is right

Realizing the limitations of the scouts, the A’s turned to the analysis of baseball statistics, sabermetrics, but they didn’t just assume the numbers would save them. In fact, the A’s scrutinized many of the standard baseball statistics and found some were just as biased as traditional scouting.

Take for example, the fielding error, which occurs when a fielder misplays a ball such that a runner from the opposing team can advance.  The fielding error statistic was thought up in the early years of baseball as a way to account for how barehanded players fielded (baseball gloves weren’t common till the 1890s).  A century later baseball statisticians began noticing the fielding error statistic was misleading because an error could only occur if a player made an attempt to pursue the ball in the first place, therefore punishing those who made the attempt to get the ball, and rewarding those who either avoided or couldn’t get to the ball in time.

In short, fielding error statistics “weren’t just inadequate; they lied ” (Lewis, p. 67)

The result of the misleading error statistic is that many players were passed up, and teams relying on error as a measure, were mismanaging how they appraised their defense.

The Fielding Error made more sense in the time of rough fields & bare hands

blog 3 pic 7

In intelligence analysis similar mistakes can be made. For example, consider the use social network analysis (SNA), the analysis of links between people, often used in intelligence to study terrorist and criminal groups. With the increase of social network analysis tools and ‘big data’ analysts increasingly rely on SNA and associated statistics such as, degree centrality, a simple measure of how well-connected a person is in a network. However, reliance on this statistic as a measure of influence can be as troublesome for determining influence in the network, as an error statistic is for determining fielding skill.

Consider the case presented by Bienenstock and Salwen (forthcoming) of Abu Zubaida. While Zubaida was identified as a number 3 in al Qaeda by U.S. leadership, he was later found to be a low-level operative. Yet, he was heavily connected in the network because his role as a courier, and therefore would have had a high degree centrality score.  Below is a sociogram from Marc Sageman’s well-known global violent Salafist data linking al Qaeda to other violent Salafist plots with Zubaida in the center of the graph near bin Laden.

Al Qaeda-Madrid-Singapore Network (source: Sageman et al)

pic 8

I witnessed firsthand how network statistics can be misleading while working with Mike Kenney on a project using SNA to map al-Muhajiroun, an Islamist extremist group in the UK.  To create our network we used a massive dataset of news reports and automated data extraction tools similar to what IC uses, but what we did that was novel was we cross-validated our network statistics with in depth field research. Over the course of two years, Mike visited the UK several times interviewing 86 people within al-Muhajiroun, from the top leaders down to the rank-and-file members.

When we attempted to cross-validate our networks against hours of interview recordings we found something pretty surprising: the standard network statistics by themselves were incredibly misleading.  For example, in our SNA we found individuals that had no operational ties to the network but were ranked artificially high in the network, such as Osama Bin Laden (yellow). Others had were ranked high because they were well-known in the British media, such as Salahuddin Amin (green), but certainly weren’t ranking leaders of the network as the SNA would imply.

Betweenness Centrality in al-Muhajiroun

pic 9

The moral of the fielding error, Zubaida, and  al-Muhajiroun stories is not that these statistics have no value, but that it is necessary to make a conscious effort to determine if a particular way of thinking or method actually works.  While it might be difficult to evaluate some of these practices it is entirely possible with additional effort.

Looking to the outside for innovation

If the Moneyball revolution had an ideological father it would certainly be Bill James. It was his mistrust of experienced judgment and concern with the traditional baseball statistics, such as the fielding error, that led the A’s management to storm the Bastille of professional baseball. Yet, James was an ‘outsider’ in the purest sense of the word—he penned his first tract on baseball analysis while working as a night watchman at a bean factory.

Beginning with his self-published books in the 1970s, his popularity grew a cult-following among computer and stats nerds, but even as his circle grew larger and larger, his message landed on deaf ears of the men who ran professional baseball.  This all changed once the A’s front office brought James’ ideas into practice, drawing the ire of the traditional ‘baseball men.’ Still, even with strong resistance, the A’s were able to inaugurate the Moneyball revolution with a combination of outsider innovation and insider know-how.

Bill James on the cover of a 1981 issue of Sports Illustrated 

pic 10

In intelligence analysis there is a similar dynamic underway as many post 9/11 programs have opened up numerous channels for new ideas, such as the Intelligence Advanced Research Projects Activity’s (IARPA) grants and the Centers of Academic Excellence program. In short, it would seem inroads are being made to bring in more outsiders.

Still, there are still few places and opportunities for those interested in implementing the core message of Moneyball in intelligence analysis. For my own part I am trying to validate some of the structured analytic techniques promoted after 9/11, but I’ve faced an endless set of institutional and cultural barriers. This stems from the fact that as a young intelligence studies researcher, I am caught between a rock and hard place; my research subject is unfamiliar to an academic audience and some practitioners are distrustful of applied social science research.

It would seem we need not just an institutional shift but also a cultural shift to bring in new ideas. In the beginning, there will certainly always be resistance, but if the objective is to improve the profession, considering the limitations of experience and questioning ‘what works’—the lessons of Moneyball—can take us a long way.

[Bracketing] the Black Swan (Part II)

In my last blog post I discussed the possibility of being able to bracket, or identify, a ‘black swan,’ an extremely rare event which has significant consequences. Trying to identify a black swan event is a pretty tall order since these events by definition, are highly unlikely. As I discussed in my blog entry last month, the challenge is to ‘reach out’ on to the statistical distribution towards the unlikely hypotheses.

Research on knowledge systems  suggests that the most commonly identified hypotheses among a group of experts are on the extreme left of the distribution. In most analytic tasks, the most instrumental hypothesis is probably here. For example, there are a few commonly discussed hypothesesfor the outcome of the Syrian Civil War (e.g. Assad regime wins, stalemate, etc.). In the graph below these hypotheses would fall in the green shaded region as H1, H2, and H3. But, in the case of black swan events, the hypothesis (or hypotheses) are less frequently suggested and are further out on the right. In the Syrian example, this might include Iran invading and achieving victory out in the yellow shaded region.

Steve Blog 2 Pic 1

 

Imaginative structured analytic techniques assist analysts in reaching out further on this distribution, but , some of the techniques have notable  limitations. For example, one such technique, brainstorming, assumes equal participation among diverse group members, which defies conventional experience. Further, most of these techniques cannot tell the analyst where they are on the distribution, and more importantly, when they have reached saturation and generated the bulk of plausible hypotheses. In a traditional brainstorming session, this is usually identified by a lull in the conversation and participants are satisfied they have captured the likely hypotheses.

Boundary analysis, developed by William N. Dunn, is another way to generate hypotheses. The technique requires analysts to sample documents containing hypotheses (e.g. news reports) and write down each hypothesis. As an analyst records more hypotheses he should observe the effect of Bradford’s Law: after a point the number of new hypotheses gathered from each document drops precipitously. Since the hypotheses come from the documents rather than the group itself, the technique may ameliorate some of the negative effects of group dynamics on hypothesis generation. Furthermore, one can simply expand the scope of the search for more documents to gain access to rarely cited hypotheses.

Stopping Point of Bradford’s Law

Steve Blog 2 Pic 2

For most analytic tasks, stopping at the “knee of the curve” (where the marginal frequency of each new hypothesis levels off) will likely include the correct hypothesis. But for “black swan” events, we have no such defined rule. By definition it would seem that a black swan should fall after the stopping rule, but it is also entirely possible that the black swan really was foreseeable.

We simply don’t know.

To address this question I teamed up with my colleague Jay Rickabaugh to apply boundary analysis retrospectively to a ‘real world’ intelligence analysis task: the 2012 University of Pittsburgh bomb threats.

The Pitt Bomb Threats

Over the course of ten weeks in the spring of 2012, the University of Pittsburgh received approximately 140 bomb threats. While the threats took a variety of forms, beginning with scrawled threats in campus restrooms, the most persistent and numerous threats came from emails sent through a remailer, which masked the location of the perpetrator. Further, confounding the investigation were copycat actions, false accusations and others seeking publicity by capitalizing on the chaos.  The swarming of these threats made this case different from a traditional bomb scare and thus the possibility of black swan explanations seems more possible.

During the multi-agency investigation, several leads were pursued but each led to a dead-end. Finally on April 19th, after weeks of threats causing the University of Pittsburgh to spend more than $300,000 in direct costs alone, the University met the demand of one of the threateners to rescind a $50,000 reward, and immediately thereafter, the emailed threats stopped.

In mid-August, after a months-long investigation, authorities held a press conference to announce that they were charging Adam Busby, a 64-year-old Scottish nationalist involved with the Scottish National Liberation Army (SNLA) in connection with the emailed threats. The result was stunning and best summed up by Andrew Fournaridis, administrator of a blog developed during the bomb threats who wrote:

“This is the mind-bending stuff intelligence analysts must deal with on a daily basis, especially in the 21st century cyber-crime era.”

To this day authorities have never divulged Busby’s motivation.

The question is: will boundary analysis find the black swan before the stopping rule?

Using Boundary Analysis & Findings

For our analysis we used open source documents from two local newspapers (the Pittsburgh Post-Gazette and Pittsburgh Tribune-Review) and blog postings from www.stopthepittbombthreats.blogspot.com, a major platform for crowd-sourcing during the threats. After compiling all the sources we had more 130 news articles and numerous blog posts ranging from January 1, 2012 to August 30, 2012.

Articles that did not contain useful information (e.g. articles about how students coped with threats) were omitted, leaving us with 73 articles that we coded by date in an Excel spreadsheet.  Next, each article was scrutinized for hypotheses, a process that took a single coder approximately 8-10 hours.

Our boundary analysis of the bomb threats yields two findings:

  • Boundary analysis identified the ‘usual suspects’ quickly

In conducting our retrospective boundary analysis we quickly found our stopping rule. In fact, within in a time span of roughly one month, from March to April, almost all of our hypotheses were identified in our documents (see graph). These original hypotheses included typical explanations such as students avoiding exams, students who have conflicts with university administration, pranksters, etc.

Steve Blog 2 Pic 3

The ability of boundary analysis to locate the main hypotheses quickly may also be helpful when combined with hypothesis testing techniques. For example, once the analyst extracts the most common hypotheses he can begin testing each one using a diagnostic technique (for example, alternative competing hypotheses) and move further out on the distribution as needed.

  • The normal stopping rule did not bracket the black swan hypothesis

After an examination of our three data sources, the correct hypothesis—a foreign national from the UK pranking the University—was not identified in the documents. However, we stopped our analysis at the stopping rule, or “knee of the curve.” We do not have enough information to suggest what a good limit to set would be, but applying these same principles to more black swan intelligence cases (the DC Sniper, Eric Rudolph, etc.) would give us a better indication. With more research, we can begin to identify how far past the knee one would need to research to be reasonably confident the black swans are identified. Thus, when unanticipated or abnormal events begin to occur, we do not use ordinary methods for unique circumstances.

Implications

While we were unable to bracket the black swan using traditional limits, the two findings have important implications for intelligence analysis. Probably the greatest benefit of boundary analysis could be to give analysts a list of ‘usual suspects’ hypotheses. Analysts can then use diagnostic techniques to whittle down the number of plausible hypotheses. If these usual hypotheses are not useful, the analyst can keep moving to the right of the distribution by extending the boundary analysis or employ an imaginative technique. As we note, an area of future research is conducting more research retrospectively to determine if there is a stopping rule that will catch most black swans.

[Bracketing] the Black Swan in Intelligence Analysis (Part I)

This article is part of a recurring series by Steve Coulthart, a PhD candidate at the Graduate School for Public and International Affairs at the University of Pittsburgh.  If you have any questions or comments, feel free to contact him at SJC62@pitt.edu.

Intelligence analysts have a ‘black swan problem,’ or, if you want to be more academic you might call it the ‘problem of induction.’ The problem of induction touches on the conundrum of how we can know the future given the experience we have today.  Or, in other words, at what point can we say we know what we know?

An example from Taleb’s well-known book on black swans helps to clarify. Take the turkey’s dilemma: for the first 1,000 days of its life the turkey is fed and treated well, thus each day increasing its expectation that the following day will be the same. Yet, at day 1,001 that assumption is proven false and the turkey ends up for Thanksgiving dinner.

steve blog 1 fig 1

Substitute the turkey and insert an analyst trying to forecast the next revolution, coup, or terrorist attack and the task comes into focus. To avoid surprise, analysts attempt to foresee different possible outcomes. We can think of these different possible outcomes as hypotheses about things that could happen in the future. For example, in the case of outcomes for the Syrian Civil War, there are several possible hypotheses floating around: the Assad government wins, the stalemate lingers, rebels win, etc.

The first step for analysts working on a forecasting task is to conjure up hypotheses from their own experience, intelligence reports, experts, etc.  Most likely, the analyst will identify a few well-known hypotheses (such as the ones mentioned above). We know this because how well known a hypothesis is, measured in the form of how often it is cited in discussion (e.g. in news articles, amongst experts, etc.), fits a power law distribution. The practical result of this is that there is a set of core hypotheses almost everyone knows and a long tail of lesser known ones (this is due to Zipf’s Law, check out the link for more information).

In his pioneering research in the policy analysis field, William Dunn found that the most cited or discussed hypotheses are on the extreme left of the distribution while the black swans are on the extreme right. In intelligence analysis, these are the hypotheses that are often ignored until it is too late (e.g. the 9/11 attack). In our Syria example this could include something seemingly unlikely like an Iranian invasion of Syria.

steve blog 1 fig 2

What can analysts do to ‘reach out’ on to the tail? The common answer is to encourage analysts to think creatively and/or consider the complexity of the situation. To do this, analysts are trained in ‘imaginative’ structured analytic techniques that supposedly open their minds.   The U.S. Intelligence Community’s tradecraft primer lists a few of these techniques and Heuer and Pherson’s standard texthas several hypothesis generation techniques. Unfortunately, these techniques have a crucial weakness: there is no stopping rule.

What is a stopping rule? Well, like the turkey in the example above, the analyst doesn’t know when he or she can stop considering new hypotheses, including a potential black swan waiting in the wings (no pun intended).

Consider a hypothetical group of analysts brainstorming the outcomes of the Syrian Civil War. At what point should the analysts stop generating hypotheses?  Perhaps they have identified our black swan of Iran invading, but what now? Are they done? The common answer is to say when it “feels right,” but as we know, cognitive biases can creep in, and further, what if the black swan is still lurking out on the tail?

One possible answer, yet to be discussed in the intelligence analysis literature, is the use of boundary analysis developed by Dunn. As the name implies, boundary analysis is a method to determine the analytic ‘boundaries’ of a problem, in this case the number of plausible hypotheses. The technique also addresses the stopping rule problem plaguing imaginative structured analytic techniques.

Here’s how it works:

The first step in boundary analysis is the specification of the analytic problem. For example, “what are the likely outcomes of the Syrian Civil War?” Next, analysts sample data sources that hold hypotheses related to the analytic question. A common source of hypotheses can be found in open source documents, such as news reports. Once the data is compiled, it can be mined by coding each unique hypothesis.

At first the list of hypotheses will grow exponentially with each document, however, the analyst will soon see something very puzzling: after the initial rapid increase of new hypotheses, each new successive document will yield less, and less new hypotheses.  This rapid leveling-off is due to Bradford’s Law

steve blog 1 fig 3

An Example of Bradford’s Law: Citations

In 1934, British mathematician Samuel Bradford was searching physics journals and found that after locating approximately two dozen core journals he had found the bulk of all physics academic citations. After these core journals each subsequent journal provided a diminishing amount of new citations. The leveling-off effect of the Bradford Law also applies to hypotheses and provides a stopping point at which analysts know they have reviewed almost all known hypotheses.

Returning to our power-law distribution of hypotheses, we could imagine that a boundary analysis might get us closer to finding the black swan, but boundary analysis is still no panacea because at this point we really do not know how well the technique does in identifying possible black swans in intelligence analysis tasks.

steve blog 1 fig 4

Fortunately, the question of how boundary analysis performs on intelligence analysis tasks is an answerable empirical question. In my next blog post I will present results from a research study using boundary analysis on a ‘real world’ intelligence analysis problem.