Next: The Necessity of Ecological
Up: A Solution...
Previous: PART I: INTRODUCTION
Ogburn and Goltra's ``indirect'' method of estimating women's votes was to correlate the percent of women voting in each precinct in Portland, Oregon, with the percent of people voting ``no'' in selected referenda in the same precincts. They reasoned that individual women were probably casting ballots against the referenda questions at a higher rate than men ``if precincts with large percentages of women voting, vote in larger percentages against a measure than the precincts with small percentages of women voting.'' But they (correctly) worried that what has come to be known as the ecological inference problem might invalidate their analysis: ``It is also theoretically possible to gerrymander the precincts in such a way that there may be a negative correlative even though men and women each distribute their votes 50 to 50 on a given measure'' (p. 415). The essence of the ecological inference problem is that the true individual-level relationship could even be the reverse of the observed aggregate correlation if it were the men in the heavily female precincts who voted disproportionately against the referenda.
Ogburn and Goltra's data no longer appear to be available, but the
problem they raised can be illustrated by this simple hypothetical
example reconstructed in part from their verbal descriptions.
Consider two equal-sized precincts voting on Proposition 22, an
initiative by the radical ``People's Power League'' to institute
proportional representation in Oregon's Legislative Assembly
elections: 40% of voters in precinct 1 are women and 40% of all
voters in this precinct oppose the referenda. In precinct 2, 60% of
voters are women and 60% of the precinct opposes the referenda.
Precinct 2 has more women and is more opposed to the referenda than
precinct 1, and so it certainly seems that women are opposing
the proportional representation reform. Indeed, it could be the case
that all women were opposed and all men voted in favor in both
precincts, as might have occured if the reform were uniformly seen as
a way of ensuring men a place in the legislature even though they
formed a (slight) minority in every legislative district. But however
intuitive this inference may appear, simple arithmetic indicates that
it would be equally consistent with the observed aggregate data for
men to have opposed proportional representation at a rate four times
higher than that of women.
These higher relative
rates of individual male opposition would occur, given the same
aggregate percentages, if a larger fraction of men in the
female-dominated precinct 2 opposed the reform than men in precinct 1,
as might happen if precinct 2 was a generally more radical area
independent of, or even because of, its gender composition.
But if Ogburn and Goltra were Leif Ericson, William Robinson was
Christopher Columbus: for not until Robinson's (1950) article was the
problem widely recognized and the quest for a valid method of making
ecological inferences begun in earnest.
Robinson's article remains
one of the most influential works in social science methodology. His
(correct) view was that, with the methods available at the time, valid
ecological inference was impossible. He warned analysts never to use
aggregate data to infer individual relationships, and thus to avoid
what has since come to be known as ``the ecological fallacy.'' His
work sent two shock waves through the social sciences that are still
being felt, causing some scholarly pursuits to end and another to
begin.
First, the use of aggregate data by political scientists, quantitative
historians, sociologists, and others declined relative to use of other forms
of data; scholars began to avoid using aggregate data to address
whole classes of important research questions (King, 1990). In many
countries and fields of study, this ``collapse of aggregate data
analysis
and its replacement by individual survey analysis as
the dominant method of quantitative social research'' (Achen and
Shively, 1995: 5) meant that numerous, often historical and
geographical, issues were put aside, and many still remain unanswered.
What might have become vibrant fields of scholarship withered. The
scholars who continue to work in these fields--such as those in
comparative politics attempting to explain who voted for the Nazi
party, or political historians studying working-class support for
political parties in the antebellum Southern U.S.--do so because
of the lack of an alternative to ecological data, but they toil under
a cloud of great suspicion. The ecological inference problem hinders
substantive work in almost every empirical field of political science,
as well as numerous areas of sociology, education, marketing,
economics, history, geography, epidemiology, and statistics. For
example, historical election statistics have fallen into disuse and
studies based on them into at least some disrepute. Classic studies,
such as V. O. Key's (1949) Southern Politics, have been
succeeded by scholarship based mostly on survey research, often to
great advantage, but necessarily ignoring much of history, focused as
it is on the few recent, mostly national, elections for which surveys are
available.
The literature's nearly exclusive focus on national surveys with random interviews of isolated individuals means that the geographic component to social science data is often neglected. Commercial state-level surveys are available, but their quality varies considerably and the results are widely suspect in the academic community. Even if the address of each survey respondent were available, the usual 1,000-2,000 respondents to national surveys are insufficient for learning much about spatial variation except for the grossest geographic patterns, in which a country would be divided into no more than perhaps a dozen broad regions. For example, some National Election Study polls locate respondents within congressional districts, but only about a dozen interviews are conducted in any district, and no sample is taken from most of the congressional districts for any one survey. The General Social Survey makes available no geographic information to researchers unless they sign a separate confidentiality agreement, and even then only the respondent's state of residence is released. Survey organizations in other countries are even more reticent about releasing local geographic information.
Creative combinations of quantitative and qualitative research are much more difficult when the identity and rich qualitative information about individual communities or respondents cannot be revealed to readers. Indeed, in most cases, respondents' identities are not even known to the data analyst. If ``all politics is local,'' political science is missing much of politics. In contrast, aggregate data are saturated with precise spatial information. For example, the United States can be divided into approximately 190,000 electoral precincts, and detailed aggregate political data are available for each. Only the ecological inference problem stands between the scientific community and this rich source of information.
Whereas the first shock wave from Robinson's article stifled research
in many substantive fields, the second energized the social science
statistics community to try to solve the problem. One partial measure
of the level of effort devoted to solving the ecological inference
problem is that Robinson's article has been cited more than eight hundred
times.
Many other scholars have written on the topic as
well, citing those who originally cited Robinson or approaching the
problem from different perspectives. At one extreme, the literature
includes authors such as Bogue and Bogue (1982), who try,
unsuccessfully, to ``refute'' the ecological fallacy altogether; at
the other extreme are fatalists who liken the seventy-five year search
for a solution to the ecological inference problem to seeking
``alchemists' gold'' (Flanigan and Zingale, 1985) or to ``a fruitless
quest'' (Achen and Shively,
1995). These scholars, and numerous
others between these extreme positions, have written extensively, and
often very fruitfully, on the topic. Successive generations of young
scholars and methodologists in the making, having been warned off
aggregate data analysis with their teachers' mantra ``thou shalt not
draw conclusions about individual behavior from aggregate data,'' come
away with the conviction that the ecological inference problem
presents an enormous barrier to social science research. This belief
has drawn a steady stream of social science methodologists into the
search for a solution over the years, myself included.
Numerous important advances have been made in the ecological inference
literature, but even the best current methods give incorrect answers a large
fraction of the time, and nonsensical answers very frequently (such as 115% of
blacks voting for the Democrats or
of foreign-born Americans being
illiterate). No proposed method has been scientifically validated. Any that
have been tried on data sets for which the individual-level relationship of
interest is known generally fail to give the right answer. It is a testimony
to the difficulty of the problem that no serious attempts have even been made
to address a variety of basic statistical issues related to the problem. For
example, currently available measures of uncertainty, such as confidence
intervals, standard errors, and others, have never been validated and appear to
be hopelessly inaccurate. Indeed, for some important approaches, no
uncertainty measures have even been proposed.
Unlike the rest of this book, this chapter contains no technical details and
should be readable even by those with little or no statistical background. In
the remainder of this chapter, I summarize some other applications of
ecological inference (Section
), define the problem more
precisely by way of a leading example of the failures of the most popular
current method (Section
), summarize the nature of the solution
offered (Section
), provide some brief empirical evidence that
the method works in practice (Section
), and outline the
statistical method offered (Section
).