Table
portrays the issue in this case as an example of the more
general ecological inference problem. This table depicts what is known for the
election to the Ohio State House that occurred in District 42 in 1990. The
black Democratic candidate received 19,896 votes (65% of votes cast) in a race
against a white Republican opponent. African Americans constituted 55,054 of
the 80,760 people of voting age in this district (68%). Because this known
information appears in the margins of the cross-tabulation, it is usually
referred to as the marginals. The ecological inference problem involves
replacing the question marks in the body of this table with inferences based on
information from the marginals. (Ecological inference is traditionally defined
in terms of a table like this and thus in terms of discrete individual-level
variables. Most political scientists, sociologists, and geographers, and some
statisticians, have retained this original definition. Epidemiologists and
some others generalize the term to include any aggregation problem, including
continuous individual-level variables. I use the traditional definition in
this book in order to emphasize the distinctive characteristics of aggregated
discrete data, and discuss aggregation problems involving continuous
individual-level variables in Chapter
.)
white
For example, the question mark in the upper left corner of the table
represents the (unknown) number of blacks who voted for the Democratic
candidate. Obviously, a wide range of different numbers could be put
in this cell of the table without contradicting its row and column
marginals, in this case any number between 0 and 19,896, a logic
referred to in the literature as the method of
bounds.
Fortunately, somewhat more information is available in this example, since the
parties in the Ohio case had data at the level of precincts (or sometimes
slightly higher levels of aggregation instead, which I
also will refer to as
precincts). Ohio State House District 42 is composed of 131 precincts, for
which information analogous to Table
The ecological inference problem does not vanish by having access to the
precinct-level data, such as that in Table
With a few minor exceptions, no method has even been proposed to fill in the
unknown quantities at the precinct level in Table
Unfortunately, even the best available current methods of ecological
inference are often wildly inaccurate. For example, at the federal
trial in Ohio (and in formal sworn deposition and in a prepared
report), the expert witness testifying for the plaintiffs reported
that 109.63% of blacks voted for the Democratic candidate in District
42 in 1990! He also reported in a separate, but obviously related,
statement that a negative number of blacks voted for the Republican
candidate. Lest this seem like one wayward result chosen selectively
from a sea of valid inferences, consider a list of the results from
all districts reported by this witness (every white Republican who
faced a black Democrat since 1986), which I present in Table
What of the analyses in Table
When ridiculous results appear in academic work, as they sometimes do, there
are few practical ramifications. In contrast, inaccurate results used in making
public policy can have far-reaching consequences. Thus, in order to attempt to
avoid this situation, the witness in this case used the best available methods
at the time and had at his disposal far more resources and time than one would
have for almost any academic project. The partisan control of a state
legislature was at stake, and research resources were the last things that
would be spared if the case could be won. (The witness also had extensive
experience testifying in similar cases.) Moreover, he was using a method (a
version of Goodman's ``ecological regression'') that the U.S. Supreme Court
had previously declared to be appropriate in applications such as this
(Thornburg v. Gingles, 1986). If there was any way of avoiding these
silly conclusions, he certainly would have done so. Yet, even with all this
going for him he was effectively forced by the lack of better methods to
present results that indicated, in over half the districts he studied, that
more African Americans voted for the Democratic candidate than there were
African Americans who voted.
Two types of statistical difficulties cause inaccurate results such as these in
ecological inferences. The first is aggregation bias. This is the
effect of the information loss that occurs when individual-level data are
aggregated into the observed marginals. The problem is that in some aggregate
data collections, the type of information loss may be selective, so that
inferences that do not take this into account will be biased.
The second cause of inaccurate results in ecological inferences is a
variety of basic statistical problems, unrelated to aggregation
bias, that have not been incorporated into existing methods. These
are the kinds of issues that would be resolved first in any other
methodological area, although most have not yet been addressed. For
example, much data used for ecological inferences have massive levels
of ``heteroskedasticity'' (a basic problem in regression analysis),
but this has never been noted in the literature--and sometimes
explicitly denied--even though it is obviously present even in most
published scatter plots (about which more in Chapter
Next: The Solution
Up: Chapter 1: Qualitative Overview
Previous: The Necessity of Ecological
Race of
Voting Age Voting Decision
Person Democrat
Republican No vote
black
55,054
25,706
19,896 10,936 49,928 80,760
As a result, some other information or method must be used to further
narrow the range of results.
is available. For example,
Table
displays the information from Precinct P, which in
District 42 falls between Cascade Valley Park and North High School in the
First Ward in the city of Akron. The sum of any item in the precinct tables,
across all precincts, would equal the number in the same position in the
district table. For example, if the number of blacks voting for the Democratic
candidate in Precinct P were added to the same number from each of the other
130 precincts, we would arrive at the total number of blacks casting ballots
for the Democratic candidate represented as the first cell in Table
.
Race of
Voting Age Voting Decision
Person Democrat
Republican No
vote
black
221
white
484
130 92 483 705
, because we
ultimately require individual-level information. Each of the cells in this
table is still unknown. Thus, knowing the parts would tell us about the whole,
but disaggregation to precincts does not appear to reveal much more about the
parts.
. What scholars have done is to develop methods to use
the observed variation in the marginals over precincts to help narrow the range
of results at the district level in Table
. For example, if the
Democratic candidate receives the most votes in precincts with the largest
fractions of African Americans, then it seems intuitively reasonable to suppose
that blacks are voting disproportionately for the Democrats (and thus the upper
left cell in Table
is probably large). This assumption is often
reasonable, but Robinson showed that it can be dead wrong: the individual-level
relationship is often the opposite sign of this aggregate correlation, as will
occur if, for example, whites in heavily black areas tend to vote more
Democratic than whites living in predominately white neighborhoods.
. A majority of these results are over 100%, and thus
impossible. No one was accusing the Democratic candidates of stuffing
the ballot box; dead voters were not suspected of turning out to vote
more than they usually do. Rather, these results point out the
failure of the general methodological approach. For those familiar
with existing ecological inference methods, these results may be
disheartening, but they will not be surprising: impossible results
occur with regularity.
Estimated Percent of Blacks
Year District Voting for the Democratic Candidate
1986 12 95.65%
23 100.06
29 103.47
31 98.92
42 108.41
45 93.58
1988 12 95.67
23 102.64
29 105.00
31 100.20
42 111.05
45 97.49
1990 12 94.79
14 97.83
16 94.36
23 101.09
25 98.83
29 103.42
31 102.17
36 101.35
37 101.39
42 109.63
45 97.62
that produced results that
were not impossible? For example, in District 25, the application of
this standard method of ecological inference indicated that 99% of
blacks voted for the Democratic candidate in 1990. Is this correct?
Since no external information is available, we have no idea. However,
we do know, from other situations where data do exist with which to
verify the results of ecological analyses, that the methods usually do
not work. The problem, of course, is that when they give results that
are technically possible we might be lulled into believing them. As
Robinson so clearly stated, even technically possible results from
these standard methods are usually wrong.
).