National Academies Press: OpenBook

Measuring Racial Discrimination (2004)

Chapter: 6 experimental methods for assessing discrimination, 6 experimental methods for assessing discrimination.

A s we discussed in Chapter 5 ,at the core of assessing discrimination is a causal inference problem. When racial disparities in life outcomes occur, explicit or subtle prejudice leading to discriminatory behavior and processes is a possible cause, so that the outcomes could represent, at least in part, the effect of discrimination. Accurately determining what constitutes the effect of discrimination, personal choice, and other related and unrelated factors requires the ability to draw clear causal inferences. In this chapter, we review two experimental approaches that have been used by researchers to reach causal conclusions about racial discrimination: laboratory experiments and field experiments (particularly audit studies).

Experimental Design

To permit valid causal inferences about racial discrimination, the design of an experiment and the analytic method used in conjunction with that design must address several issues. First, there are frequently intervening or confounding variables that are not of direct interest but that may affect the outcome. The effects of these variables must be accounted for in the study design and analysis. In controlled laboratory experiments, the investigator manipulates a variable of interest, randomly assigns participants to different conditions of the variable or treatments, and measures their responses to the manipulation while attempting to control for other

relevant conditions or attributes. As described in the previous chapter, randomization greatly increases the likelihood of being able to infer that an observed difference between the treatment and control groups is causal. Observing a difference in outcome between the groups of participants can be the basis for a causal inference. In controlled field experiments, researchers analyze the results of a deliberately manipulated factor of interest, such as the race of an interviewer. They attempt to control carefully for any intervening or confounding variables. Random assignment of treatments to participants is frequently used to reduce any doubts about lingering effects of unobserved variables, provided, of course, that one can actually apply the randomization to the variable of interest.

In addition to the problem of credibly designing an experiment that supports a causal inference, a common weakness of experiments is a lack of external validity. That is, the results of the experiment may not generalize to individuals other than those enrolled in the experiment, or to different areas or populations with different economic or sociological environments, or to attributes that differ from those tested in the experiment.

Despite these problems, the strengths of experiments for answering some types of questions are undeniable. Even if their results may not be completely generalizable and even if they do not always capture all the relevant aspects of the issue of interest, experiments provide more credible evidence than other methods for measuring the effects of an attribute (e.g., race) in one location and on one population.

Using Experiments to Measure Racial Discrimination

Use of an experimental design to measure racial discrimination raises important questions because race cannot be directly manipulated or assigned randomly to participants. Researchers who use randomized controlled experiments to measure discrimination, therefore, can manipulate race by either varying the “apparent” race of a target person as the experimental treatment or can manipulate “apparent” discrimination by randomly assigning study participants to being treated with different degrees of discrimination.

In the first case, the experimenter varies the treatment, namely, the apparent race, by such means as by providing race-related cues on job applications (e.g., name or school attended) or by showing photographs to participants in which the only differences are skin color and facial features. The experimenter then measures whether participants respond differently under one race treatment compared with another (e.g., evaluating black versus white job applicants or associating positive or negative attributes with photographs of blacks versus whites). In such a study, the experimenter elicits responses from the participants to determine the effect of

apparent race on their behavior (e.g., whether the participants engage in discriminatory behavior toward black and not white applicants). That is, they measure the behavior of potential discriminators toward targets of different races. If successful, then finding a difference in behavior would indicate an effect of race.

In the second instance, experimenters randomly assign participants to be treated differently, that is, either with or without discrimination. This type of experiment attempts to measure the response to discrimination rather than directly measure the expression of discrimination—that is, it measures the behavior of potential targets of discrimination. Because race cannot be experimentally manipulated, an explicit specification of the behavioral process is needed that allows the translation of results from such experiments into causal statements about the actual discrimination mechanism measured in the experiment (i.e., the extent to which the experimenter can manipulate some other factor related to race, such as perception). To our knowledge, no one has attempted to carry out such formal reverse reasoning, and we believe that doing so is especially crucial when arguing for the external validity of experimental results.

One of the few examples of attempts to perform similar inferential reversals is the special case of understanding odds ratios (and adjusted odds ratios) in the context of comparing retrospective and prospective studies on categorical variables. In retrospective studies, the data are collected only after the treatment has taken place, whereas in prospective studies the data are collected on possible covariates before treatment and on outcomes after the treatment. If one has both categorical explanatory and categorical response variables, one can estimate their relationship in the prospective study based on a retrospective sample. If the logistic causal model is correct, the inference about the key causal coefficient from the retrospective study is the same as if one had done a prospective sampling on the explanatory variable. 1 Those results, however, do not generalize to relationships among continuous variables.

LABORATORY EXPERIMENTS

Laboratory experiments, like all experiments, include the standard features of (1) an independent variable that researchers can manipulate (i.e., assign conditions or treatments to participants); (2) random assignment to

treatment conditions; and (3) control over extraneous variables that otherwise might be confounded with the independent variable of interest, potentially undermining the interpretation of causality. Laboratory experiments occur in a controlled setting, chosen for its ability to minimize confounding variables and other extraneous stimuli.

Laboratory experiments on discrimination would ideally measure reactions to the exact same person while manipulating only that person’s race. As noted above, while strictly speaking one cannot manipulate the actual race of a single person, experimenters do typically either manipulate the apparent race of a target person or randomly assign subjects or study participants to the experimental condition while attempting to hold constant all other attributes of possible relevance. One common method of varying race is for experimenters to train several experimental confederates—both black and white—to interact with study participants according to a prepared script, to dress in comparable style, and to represent comparable levels of baseline physical attractiveness (see, e.g., Cook and Pelfrey, 1985; Dovidio et al., 2002; Henderson-King and Nisbett, 1996; Stephan and Stephan, 1989). Another common method of varying race involves preparing written materials and either incidentally indicating race or attaching a photograph of a black or white person to the materials (e.g., Linville and Jones, 1980).

Effects of race occur in concert with other situational or personal factors, called moderator variables , that may increase or decrease the effect of race on the participants’ responses. In addition to manipulating a person’s apparent race, for example, investigators may manipulate the person’s apparent success or failure, cooperation or competition, helpfulness, friendliness, dialect, or credentials (see, e.g., Cook and Pelfrey, 1985; Dovidio et al., 2002; Henderson-King and Nisbett, 1996; Linville and Jones, 1980; Stephan and Stephan, 1989). Even more often, experimenters will manipulate features of the situation expected to moderate levels of bias toward black and white targets; examples involve anonymity, potential retaliation, norms, motivation, time pressure, and distraction (Crosby et al., 1980). Finally, the study participants frequently are black and white college students (e.g., Crosby et al., 1980; Correll et al., 2002; Judd et al., 1995).

Strengths of Laboratory Experiments

Laboratory experiments, if well designed and executed, can have high levels of internal validity for causal inference—that is, they are designed to measure exactly what causes what. The direction of causality follows from the manipulation of randomly assigned independent variables that control for two kinds of unwanted, extraneous effects: systematic (confounding) variables and random (noise) variables.

Laboratory experiments are the method of choice for isolating a single variable of interest, particularly when fine-tuned manipulation of precisely defined independent variables is required. Laboratory studies also allow precise measurement of dependent variables (such as response time or inches in seating distance). The laboratory setting gives experimenters a great degree of control over the attention of participants, potentially allowing them to maximize the impact of the manipulation in an otherwise bland environment.

Because of these fine-grained methods, laboratory experiments on discrimination are well suited to examining psychological processes. Both face-to-face interactions and processes in which single individuals react to racial stimuli are readily studied in such experiments. The most sophisticated experiments show not only the effect of some variable (e.g., expectancies) on an outcome variable (e.g., discriminatory behavior) but also the mechanism or process that mediates the effect (e.g., biased interpretations, nonverbal hostility, stereotypic associations). That is, when an experiment manipulates the apparent race of two otherwise equivalent job candidates or interaction partners (as in the interracial interaction studies described later in this chapter; see Dovidio et al., 2002; Word et al., 1974), the experiment ideally should also measure some of the proposed explanatory psychological mechanisms (such as emotional prejudices and cognitive stereotypes, either implicit or explicit), as well as the predicted discrimination (either implicit behaviors, such as nonverbal reactions, or more explicit behaviors, such as verbal reactions).

A hallmark of the better laboratory experiments is that they not only test useful theories but also show how important, compelling phenomena (e.g., the automaticity of discrimination) can and do occur. Laboratory studies often show that very small, subtle alterations in a situation can have substantial effects on important outcome variables.

Measuring Racial Discrimination

Experimenters measure varying degrees of discrimination. Laboratory measures of discrimination begin with verbal hostility (e.g., in studies of interracial aggression), which can constitute discrimination, when, for example, negative personal comments result in a hostile work environment (see Chapter 3 ). At the next level are disparaging written ratings of an individual member of a particular group (Talaska et al., 2003). If unjustified, such negative evaluations can constitute discrimination in a school or workplace.

At the subtle behavioral level, laboratory studies measure nonverbal indicators of hostility, such as seating distance or tone of voice (Crosby et al., 1980). Related nonverbal measures include coding of overt facial ex-

pressions, as well as measurement of minute nonvisible movements in the facial muscles that constitute the precursors of a frown. Experimenters study these nonverbal behaviors because they, too, could result in a hostile environment.

Moving up a level, laboratory measures of discriminatory avoidance include participants’ choice of whether to associate or work with a member of a racial outgroup, volunteer to help an organization, or provide direct aid to an outgroup member who requests it (Talaska et al., 2003). In a laboratory setting, segregation can be measured by how people constitute small groups or choose leaders in organizational teams (Levine and Moreland, 1998; Pfeffer, 1998). Finally, aggression against outgroups can be measured in laboratory settings by competitive games or teacher–learner scenarios in which one person is allowed to punish another—an outgroup member—with low levels of shock, blasts of noise, or other aversive experiences (Crosby et al., 1980; Talaska et al., 2003).

A review of laboratory studies as of the early 1980s (Crosby et al., 1980) summarized the findings as follows. Experiments on unobtrusive forms of bias and prejudice showed that white bias was more prevalent than indicated by surveys. Experiments on helping, aggressive, and nonverbal behaviors indicated that (1) whites tended to help whites more often than they helped blacks, especially when they did not have to face the person in need of help directly; (2) under sanctioned conditions (e.g., in competitive games or administration of punishment), whites acted aggressively against blacks more than against whites but only when the consequences to the aggressor were low (under conditions of no retaliation, no censure, and anonymity); and (3) white nonverbal behavior displayed a discrepancy between verbal nondiscrimination and nonverbal hostility or discomfort, betrayed in tone of voice, seating distance, and the like. This review sparked the realization, discussed in earlier chapters, that modern forms of discrimination can be subtle, covert, and possibly unconscious, representing a new challenge to careful measurement, both inside and outside the laboratory (survey measures for these forms of discrimination are discussed in Chapter 8 ).

Key Examples

Since the 1980s, laboratory experiments on discrimination have concentrated more on measuring subtle forms of bias and less on examining overt behaviors, such as helping others. This shift occurred precisely because of the discrepancy between some people’s overtly egalitarian responses on surveys and their discriminatory responses when they think no one is looking, or at least when they have a nonprejudiced excuse for their discriminatory behavior. In Boxes 6-1 through 6-3 , we describe three of the

best examples of controlled laboratory experiments on discrimination, ranging from simpler classic to more recent sophisticated studies. In a classic example, Word et al. (1974) created working definitions of race and discrimination to investigate subtle yet potentially powerful effects of stereotypical expectations hypothesized to result in discrimination (see Box 6-1 ). Another famous experiment showed that researchers can study social perception processes hypothesized to underlie discrimination, in which people see what they want to see by interpreting ambiguous evidence to fit their stereotypical biases (Darley and Gross, 1983; see Box 6-2 ). And in a final experiment, Dovidio et al. (2002) showed that implicit forms of prejudice tended to lead to implicit but potentially important forms of discrimination, whereas explicit forms of prejudice tended to lead to explicit forms of discrimination (see Box 6-3 ).

Other provocative recent experiments have shown that actual discriminatory behavior can follow from subliminal exposure to racial and other demographic stimuli (Bargh et al., 1996). This work has revealed that exposure to concepts and stereotypes at speeds too fast for conscious recognition primes relevant behavior, even though participants cannot remember or report having seen the priming stimuli. For example, researchers randomly assigned participants to see, at subliminal speeds, words related to rudeness or neutral topics and showed that those participants exposed to rude words responded more rudely to an experimenter. In a parallel experiment, subliminal exposure to photographs of unfamiliar black male faces, as compared with white ones, was followed by more rude, hostile behavior when the white experimenter subsequently made an annoying request. Similar results have been demonstrated for exposure to phenomena related to being

elderly, which resulted in participants walking more slowly to the elevator after the experiment. The point is that researchers can manipulate racial cues without participants’ conscious awareness and measure subtle forms of behavior that, if occurring selectively toward members of one racial group or another, could constitute a hostile environment form of discrimination. Other more direct forms of discrimination are also possible to measure in such experiments, such as making negative comments in a job interview.

These examples illustrate the range of aspects of racial discrimination that can be examined in laboratory settings. Such experiments can manipulate racial and moderator variables; test various hypothesized mechanisms of discrimination, such as attitudes; and assess various hypothesized manifestations of discrimination, including verbal, nonverbal, and affiliative responses. They can also simulate pieces of real-world situations of interest, such as job applications and others. Most of the phenomena studied in experiments on race discrimination have been replicated in studies of gender discrimination and sometimes age, disability, class, or other ingroup–outgroup variations. Research indicates that gender, race, and age are the most salient, immediately encoded social categories (Fiske, 1998).

Limitations of Laboratory Experiments

Laboratory experiments usually are limited in time and measurement, so they generally do not aim to answer questions about behavior over long periods of time or behavior related to entire batteries of measures. The purpose of a laboratory experiment may include one or more of the following: (1) to demonstrate that an effect indeed can occur, at least under some conditions, with some people, for some period of time; (2) to create a simulation or microcosm that includes the most important factors; (3) to create a realistic psychological situation that is intrinsically compelling; or (4) to test a theory that has obvious larger importance.

Laboratory experiments are also at risk for various biases related to the settings in which they occur. For example, they may be set up in such a narrow, constraining way that the participants have no choice but to respond as the experimenters expect (Orne, 1962). Crafting more subtle manipulations and providing true choice in response options can sometimes be used to limit the potential biases in such cases. In addition, the experimenter may inadvertently bias presentation of the manipulations and measures, so that participants are equally inadvertently induced to confirm the hypotheses (Rosenthal, 1976). This problem can often be addressed using double-blind methods, in which experimenters as well as participants are not aware of the treatment assigned to them. Participants may also worry

about whether their behavior is socially acceptable (Marlow and Crowne, 1961) and fail to react spontaneously. Nonreactive, unobtrusive, disguised measurement can avert this problem. It is worth noting that not all of these issues are unique to the laboratory. Many of the potential biases and artifacts of laboratory experiments also occur at least as often in other kinds of experiments (e.g., field experiments, which we turn to next), as well as with nonexperimental methods (natural experiments and observational studies, such as surveys).

Translating Experimental Effects

Laboratory experiments are useful for measuring psychological mechanisms that lead to discriminatory behavior (e.g., implicit or explicit stereotypes), but they do not describe the frequency of occurrence of such behavior in the world. They cannot, by their nature, say how often or how much a particular phenomenon occurs, such as what proportion of a racial disparity is a function of discriminatory behavior. Thus, they can be legitimately criticized on the grounds of low external validity—that is, limited generalizability to other samples, other settings, and other measures. Laboratory experimenters can sometimes make a plausible case for generalizability by varying plausible factors that might limit the applicability of the experiment. For example, if there are theoretically or practically compelling reasons for suspecting that an effect is limited to college sophomores, one might also replicate the study with business executives on campus for a seminar or retirees passing through for an Airstream conference. But laboratory experiments rarely randomly sample participants from the population of interest. Thus by themselves they cannot address external validity, and it is an empirical question whether or how well their findings translate into discrimination occurring in the larger population. In well-designed and well-executed experiments, the effects of confounding variables are randomized, allowing researchers to dismiss competing explanations as unlikely, but they are not entirely eliminated. For this reason, replication is important. In the study of discrimination, there are many laboratory experiment results that do not generalize in field settings. Findings either may diminish or not hold up over time. However, many other effects tested both in the laboratory and in the field have been consistent, some showing even stronger effects in the field (Brewer and Brown, 1998; Crosby et al., 1980; Johnson and Stafford, 1998).

FIELD EXPERIMENTS

Field experiments have many of the standard features commonly found in laboratory experiments. The term field experiment refers to any fully randomized research design in which people or other observational units found in a natural setting are assigned to treatment and control conditions. The typical field experiment uses a two-group, post-test-only control group design (Campbell and Stanley, 1963). In such a design, people are randomly assigned to treatment and control groups. An experimental manipulation is administered to the treatment group, and an outcome measure is obtained for both treatment and control groups. Because of random assignment, differences between the two groups provide some evidence of an effect of the manipulation. However, because no preexperiment measure for the outcome is obtained (which is an option in laboratory experiments), one cannot be altogether sure whether the groups are similar prior to the experiment. Nonetheless, randomization protects against this problem because it ensures that, on average , the two groups are similar except for the treatment.

Field experiments are attractive and often persuasive because, when done well, they can eliminate many of the obstacles to valid statistical inference. They can measure the impact of differential treatment more cleanly than nonexperimental approaches, yet they have the advantage of occurring in a realistic setting and hence are more directly generalizable than laboratory experiments. Furthermore, for measuring discrimination, they appear to reflect the broader public vision of what discrimination means—the treatment of two (nearly) identical people differently.

The social scientific knowledge necessary to design effective field experiments is stronger in some areas than in others. For example, our knowledge of the mechanisms and incentives underlying real estate markets is arguably more advanced than our knowledge of the incentives underlying labor markets (Yinger, 1995). Hence, our ability to use field experiments is correspondingly stronger for measuring behavior in housing markets than in other areas. We therefore focus our discussion below on a common methodology—audit or paired testing—used particularly to assess discrimination in housing markets as well as in other areas. With the exception of a study we describe later (in Box 6-5 ), we do not review other types of field experiments in the domain of racial discrimination.

Audit or Paired-Testing Methodology 2 , 3

Audit or paired-testing methodology is commonly used to measure the level or frequency of discrimination in particular markets, usually in the labor market or in housing (Ross, 2002; for a summary of paired-testing studies in the labor and housing markets, see Bendick et al., 1994; Fix et al., 1993; Neumark, 1996; Riach and Rich, 2002). Auditors or testers are randomly assigned to pairs (one of each race) and matched on equivalent characteristics (e.g., socioeconomic status), credentials (e.g., education), tastes, and market needs. Members of each pair are typically trained to act in a similar fashion and are equipped with identical supporting documents. To avoid research subjects becoming suspicious when they confront duplicate sets of supporting documents, researchers sometimes vary the documents while keeping them similar enough that the two testers have equivalent levels of support.

As part of the study, testers are sent sequentially to a series of relevant locations to obtain goods or services or to apply for employment, housing, or college admission (Dion, 2001; Esmail and Everington, 1993; Fix et al., 1993; National Research Council, 1989; Schuman et al., 1983; Turner et al., 1991a, 1991b; Yinger, 1995). The order of arrival at the location is randomly assigned. For example, in a study of hiring, testers have identical résumés and apply for jobs, whereas in a study of rental housing, they have identical rental histories and apply for housing. Once the study has been completed, researchers use the differences in treatment experienced by the testers as an estimate of discrimination.

To the extent that testers are matched on a relevant set of nonracial characteristics, systematic differences by the race of the testers can be used to measure discrimination on the basis of race. Propensity score matching is sometimes used when there are too many relevant characteristics on which to match on every one. In propensity score matching, an index of similarity is created by fitting a logistic regression with the outcome variable being race and the explanatory variables being the relevant characteristics on which one wishes to match. Subjects of one race are then paired or matched with subjects of the other race having similar fitted logit values—the pro-

pensity score index (see Rosenbaum, 2002, and the references therein for a more complete description).

Paired-testing studies use an experimental design in natural settings to obtain information on apparently real outcomes and to assess the occurrence and prevalence of discrimination. An advantage to using paired tests is that individuals are matched on observed characteristics relevant to a particular market. Effective matching decreases the likelihood that differences are due to chance rather than discrimination because many factors are controlled for.

Paired testing is used in audit studies, such as the U.S. Department of Housing and Urban Development’s (HUD’s) national study of housing discrimination, to estimate overall levels of discrimination against racial and ethnic minorities. Audit studies can be highly effective enforcement tools for assessing treatment or detecting unfavorable treatment of members of disadvantaged groups (see Ross and Yinger, 2002). 4 Studies in the housing market (e.g., Wienk et al., 1979; Yinger, 1995) and in the labor market (e.g., Bendick et al., 1994; Cross et al., 1990; Neumark, 1996; Turner et al., 1991b) using the paired-testing methodology provide evidence of discrimination against racial minorities (see National Research Council, 2002b; Ross and Yinger, 2002). In the case of housing, these studies might involve selecting a random sample of newspaper advertisements and then investigating the behavior of real estate agencies associated with these advertisements (Ross and Yinger, 2002). Employment audits are similarly based on a random sample of advertised jobs. While providing the generality valued by researchers, these studies also make it possible to observe the behavior of individual agencies or firms. This approach has been applied to other areas as well (see the examples in the next section).

Much of the use of audit or paired-testing methodology to study discrimination flows primarily from federal investigations concerning housing discrimination. National results of the 2000 Housing Discrimination Study (2000 HDS), conducted by the Urban Institute for HUD, show that housing discrimination persists, although its incidence has declined since 1989 for African Americans and Hispanics. Non-Hispanic whites are consistently favored over African Americans and Hispanics in metropolitan rental and

sales markets (Turner et al., 2002b); similarly, Asians and Pacific Islanders in metropolitan areas nationwide (particularly homebuyers) face significant levels of discrimination (Turner et al., 2003; see Box 6-4 for a brief history of housing audits). In another example, Yinger (1986) studied the Boston housing rental and sales markets in 1981. In the rental market, whites discussed 17 percent more units with a rental agent and were invited to inspect 57 percent more units than blacks. In the sales market, whites discussed 35 percent more houses and were invited to inspect 34 percent more houses; moreover, the difference in treatment was larger for low-income families and families with children. Yinger also found substantial variation in treat-

ment across neighborhoods. Taken together, these results document significant discrimination in the housing market.

As reported by Ross and Yinger (2002) and by Riach and Rich (2002), although the typical audit study concerns housing (e.g., Donnerstein et al., 1975; Schafer, 1979; Wienk et al., 1979; Yinger 1986), researchers have used variants of the design described above to examine discrimination in other areas. Areas studied include the labor market (Turner et al., 1991b), entry-level hiring (Cross et al., 1990), automobile purchases (e.g., Ayres and Siegelman, 1995), helping behaviors (Benson et al., 1976), small favors (Gaertner and Bickman, 1971), being reported for shoplifting (Dertke et al.,

1974), obtaining a taxicab (Ridley et al., 1989), preapplication behavior by lenders (Smith and Delair, 1999; Turner et al., 2002a), and home insurance (Squires and Velez, 1988; Wissoker et al., 1997).

In an example involving automobile purchases, Ayres and Siegelman (1995) sent 38 testers (19 pairs) to 153 randomly selected Chicago-area new-car dealers to bargain over nine car models. Testers bargained for the same model (a model of their mutual choice) at the same location within a few days of each other. In contrast with the common paired-testing design, pair membership was not limited to a single pair; instead, testers were assigned to multiple pairs. Also, testers did not know that the study was intended to investigate discrimination or that another tester would be sent to the same dealership. Testers were randomly allocated to dealerships, and the order of their visits was also randomly assigned. The testers were trained to follow a bargaining script in which they informed the dealer early on that they would not need financing. They followed two different bargaining strategies: one that depended on the behavior of the seller and another that was independent of seller behavior.

Ayres and Siegelman found that initial offers to white males were approximately $1,000 over dealer cost, whereas initial offers to black males were approximately $1,935 over dealer cost. White and black females received initial offers that were $1,110 and $1,320 above dealer cost, respectively. Final offers were lower, as expected, but the gaps remained largely unchanged. Compared with white males, black males were asked to pay $1,100 more to purchase a car, black females were asked to pay $410 more, and white females were asked to pay $92 more. These examples of evidence gleaned on market discrimination show the value of paired-testing methods for studying discrimination.

In Box 6-5 , we provide an example of a field experiment on job hiring (Bertrand and Mullainathan, 2002) that emulates some of the best features of laboratory and audit studies. This study uses a large sample and avoids many of the problems of audit studies (e.g., auditor heterogeneity) by randomly assigning race to different résumés. It is a particularly good example of the possibilities of field study methodology to investigate racial discrimination.

Limitations of Audit Studies

Ross and Yinger (2002) discuss two main issues raised by researchers concerning the use of paired-testing methodology. They are (1) the accuracy of audit evidence and (2) its validity, particularly with respect to the target population. It is also worth noting that such studies typically require extensive effort to prepare and implement. They can be very expensive.

The Accuracy Issue

Many claim that the designs of audit studies are not true between-subjects experiments because research subjects (e.g., employer or housing agent) are not assigned to treatment or control groups but are exposed to both treatment and control (see Chapter 7 for a discussion of issues in repeated-measures designs). Also, although the order of exposure for each subject is randomized so that it should balance out, the time lapse between exposures makes it possible for the difference to be unrelated to the concept of focus (i.e., discrimination). In the time between two visits to an establishment, for example, someone else other than a tester may take the job or apartment of interest.

In the housing market, newspaper advertisements are used as a sampling frame (National Research Council, 2002b), but they may not accurately represent the sample of houses that are available or affordable to members of disadvantaged racial groups. Newspaper advertisements can be limiting because the sampling frame is restricted to members of disadvantaged racial groups who respond to typical advertisements and are qualified for the advertised housing unit or job. This limited sample may lead to a very specific interpretation of discrimination. For example, members of the sample may not be aware of alternative search strategies or know of other available housing units or jobs of interest. The practical difficulties associated with any sampling frame other than newspaper advertisements (and the associated steps of training auditors and assigning characteristics to them) are difficult to overcome.

The Validity Issue

Inferential target: estimating an effect of discrimination. Researchers have also debated the validity of audit studies (see the discussion in Ross and Yinger, 2002). Heckman and colleagues criticize the calculation of measures of discrimination (Heckman, 1998; Heckman and Siegelman, 1993). They argue that an estimate of discrimination at a randomly selected firm (or in an advertisement) does not measure the impact of discrimination in a market. Rather, discrimination should be measured by looking at (1) the average difference in the treatment of disadvantaged racial groups and whites or (2) the actual experience of the average member of a disadvantaged racial group, as opposed to examining the average experience of members of disadvantaged racial groups in a random sample of firms (i.e., the focus should be on the average across the population of applicants rather than the population of firms). Both of these proposed approaches to measuring discrimination are valid, but each has limitations.

Researchers typically determine the incidence of discrimination by mea-

suring (1) the proportion of cases in which a white tester reports more favorable treatment than a nonwhite tester reports (gross adverse treatment) or (2) the difference between the proportion of cases in which a white tester reports favorable treatment and the proportion of cases in which a nonwhite tester reports favorable treatment (net adverse treatment) (for further discussion of these measures, see Fix et al., 1993; Heckman and Siegelman, 1993; Ondrich et al., 2000; Ross, 2002). Because statistical measures are “model-based” aggregates, net measures correctly measure the parameters in those models conditional on important stratifying variables. The gross measure may provide useful supplemental information to the net measure if the balancing disparities are large.

Ross and Yinger (2002) note that it would be valuable to know the true experiences of members of disadvantaged racial groups on average, but such information could not reveal the extent to which these individuals change their behavior to avoid experiencing discrimination. As a result, discrimination encountered by averaging over members of a disadvantaged racial group is not a complete measure of the impact of racial discrimination (Holzer and Ludwig, 2003). It is valuable to determine how much discrimination exists before such behavioral responses take place—which is the amount estimated using paired testing—and whether discrimination arises under certain circumstances.

The key observation of Murphy (2002) relates to the inferential target: Are we interested in estimating an overall or a market-level discrimination effect? Several distinct effects might be estimated, and they need to be distinguished because the estimates that result will not necessarily be identical. What is the appropriate population of real estate agents or ads from which to sample? Do we want to use only those agents that minorities actually visit? If past discrimination affects choice of agent, this population may vary from the population of agents selling houses that members of a nonwhite population could reasonably afford. Thus, the estimated effect of discrimination will be different under these alternative sampling strategies. Would it make sense to sample from agents or ads that could not reasonably be expected to be appropriate for most members of the nonwhite population? Murphy recommends ascertaining “discrimination in situations in which Blacks are qualified buyers” (2002:72).

Auditor heterogeneity. Heckman and colleagues (Heckman, 1998; Heckman and Siegelman, 1993) also argue that average differences in treatment by race may be driven by differences in the unobserved characteristics of testers (i.e., auditor heterogeneity) rather than by discrimination. 5 Such characteristics (e.g., accent, height, body language, or physical attractiveness) of one or the other member of the pair may have a significant impact on interpersonal interactions and judgments and thus lead to invalid results (Smith, 2002). The role of these characteristics cannot be eliminated because of the paucity of observations of the research subjects. Ross (2002) addresses the problem by suggesting that, instead of trying to match testers exactly (which is virtually impossible), one can train testers to ensure that their true characteristics, as opposed to their assigned characteristics, have little influence on their behavior during the test.

Murphy (2002) addresses most of the issues raised by Heckman (1998) and discussed above. She lays out a framework showing that “as long as audit pairs are matched on all qualifications that vary in distribution by race, audit results averaged over realtors, circumstances of the visits, and auditors can be viewed as an unbiased estimate of overall-level discrimination” (Murphy, 2002:69). Murphy formally delineates the circumstances under which an estimate of discrimination will be erroneous if the researcher fails to account for individual auditor characteristics that do not vary in distribution by race and therefore were not used in the matching process.

The problem is the effect of the heterogeneity among applicants and agents. The strategy of matching on all characteristics that vary in distribution by race—including observed, unobserved, and unobservable character-

istics—substitutes for randomization. The problem, of course, is that we do not know whether we have in fact matched on all characteristics that vary by race. If all unmatched characteristics have the same distribution across racial groups, and if the auditors were selected to be representative of the distribution of these characteristics, we will have managed to balance the covariates across racial groups and can estimate an unbiased effect of race. But as Heckman and others note, there are a variety of reasons to believe that this goal of matching is elusive.

Heckman and Seigelman (1993) make the point that the problem of auditor heterogeneity poses a challenge particularly for employment audits, as well as for studies of wage discrimination, because the determinants of productivity within a firm are not well understood and are difficult to measure. Ross and Yinger (2002:45) note: “Heckman and Siegelman argue that matching may ultimately exacerbate the biases caused by unobserved auditor characteristics because those characteristics are the only ones on which [testers] differ; however, the direction and magnitude of this type of bias [are] not known.” Heckman and his colleague further argue that the factors that employers use to differentiate applicants are not well known; thus, equating testers on those factors can be difficult, if not impossible. This lack of knowledge may make experimental designs particularly problematic for labor market behaviors. However, it does not affect designs in areas with a well-known or identifiable set of legitimate cues to which establishments or authorities may respond (e.g., the rental market).

There are several other problems associated with paired testing. First, paired testing cannot be used to measure discrimination at points beyond the entry level of the housing or labor market. Examples are job assignments, promotions, discharges, or terms of housing agreements and loans. Second, the assignments and training provided to testers may not correspond to qualifications and behaviors of members of racially disadvantaged groups during actual transactions. Third, actual home or job seekers do not randomly assign themselves to housing agents or employers but select them for various reasons. Finally, different employees in the same establishment may behave differently. If a rental office has more than one agent who shows apartments, different experiences of the members of the pair may be traceable to differences in the behavior of the agent with whom they dealt.

Addressing the Limitations of Audit Studies

Ross and Yinger (2002) offer several options for addressing the limitations of audit studies. Three of the approaches they identify to address the problem of accuracy are (1) broaden the sampling frame to encompass methods other than newspaper advertisements (e.g., searching neighborhoods for rental or help-wanted signs); (2) examine whether the characteristics of

the specific goods or services involved (e.g., housing unit) instead of the characteristics of the testers affect the probability of discrimination (Yinger, 1995); and (3) use actual characteristics—as opposed to assigned characteristics—of testers and determine whether controlling for these characteristics influences estimates of discrimination.

To address validity concerns, Ross and Yinger (2002) suggest a strategy of sending multiple pairs to each establishment, which would allow researchers to obtain the data needed to reduce the effects of the idiosyncratic characteristics of single pairs of testers. Testers could then be debriefed after each experience to determine the agent with whom they had dealt. Doing so would not remove the potential effect of different agents on the results obtained, but it would allow researchers to assess that effect. Use of additional pairs of testers would also address issues regarding the calculation of outcome measures. Using multiple pairs might help in distinguishing systematic from random behaviors of an establishment and should, at the very least, tighten the bounds one might calculate on the basis of different mathematical formulas. Of course, care would need to be taken to avoid sending so many pairs of confederates that the research would become obvious.

Another approach to addressing the limitations of omitted variables is to collect extensive information on the actual characteristics of testers, as opposed to assigning their characteristics, and to determine whether controlling for these characteristics influences estimates of discrimination. HUD’s national audit study of housing discrimination, conducted in 2000, explicitly collected information on many actual characteristics of testers, such as their income (as opposed to the income assigned to them for the study), their education, and their experience in conducting tests. 6

SUMMARY AND RECOMMENDATIONS

True experiments involve manipulation of the variable hypothesized to be causal, random assignment of participants to the experimental condition, and control of confounding variables. Experimental methods potentially provide the best solution to addressing causal inference (e.g., assigning disparate racial outcomes to discrimination per se) because well-designed and well-executed experiments have high levels of internal validity. In the language of contemporary statistics, experiments come closest to addressing the counterfactual question of how a person would have been treated but for his or her race, although they do not do so in a form that is easily translatable into direct measurement of the discriminatory effect.

The experimental method faces challenges when applied to race, which cannot be randomly assigned to an actual person. Experimental researchers frequently manipulate racial cues (e.g., racial designations or photographs on a résumé) or train black and white confederates to respond in standard ways. In both approaches, an attempt is made to manipulate apparent race, while holding all other variables constant, and to elicit a response from the participants. Although the experimental method has uncovered many subtle yet powerful psychological mechanisms, a laboratory experiment does not address the generalizability or external validity of its effects. Therefore, it is unable to estimate what proportion of observed disparities is actually a function of discrimination.

Over the past two decades, laboratory experiments have focused more on measuring subtle forms of bias and nonverbal forms of discriminatory behavior and less on examining overt behaviors, such as assisting others. If laboratory studies were to be more focused on real-world-type behaviors, they could help analysts who use statistical models for developing causal inferences from observational data (see Chapter 7 ). Thus, the results of real-world-oriented laboratory studies could provide more fully fleshed-out theories of discriminatory mechanisms to guide the modeling work. In turn, real-world studies based on laboratory-developed theories could be usefully conducted to try to replicate, and thereby validate, laboratory results.

Because laboratory experiments have limited external validity, researchers turn to field experiments, which emphasize real-world generalizability but inevitably sacrifice some methodological precision. Field audit studies randomly assign experimental and control treatments (e.g., black and white apartment hunters) to units (e.g., a rental agency) and measure outcomes (e.g., number of apartments shown). Aggregated over many encounters and units of analysis, audit studies come closer than laboratory experiments to assessing levels of discrimination in a particular market. Both the accuracy and the validity of audit studies on discrimination have been questioned, however. Advocates of paired-testing and survey experiments have responded that all these limitations can be remedied.

Although generally limited to particular aspects of housing and labor markets (e.g., showing of apartments or houses and callbacks to job applicants), audit studies to measure racial discrimination in housing and employment have demonstrated useful results. It is likely that audit studies of racial discrimination in other domains (e.g., schooling and health care) could produce useful results as well, even though their use will undoubtedly present methodological challenges specific to each domain.

Recommendation 6.1. To enhance the contribution of laboratory experiments to measuring racial discrimination, public and private funding agencies and researchers should give priority to the following:

Laboratory experiments that examine not only racially discriminatory attitudes but also discriminatory behavior. The results of such experiments could provide the theoretical basis for more accurate and complete statistical models of racial discrimination fit to observational data.

Studies designed to test whether the results of laboratory experiments can be replicated in real-word settings with real-world data. Such studies can help establish the general applicability of laboratory findings.

Recommendation 6.2. Nationwide field audit studies of racially based housing discrimination, such as those implemented by the U.S. Department of Housing and Urban Development in 1977, 1989, and 2000, provide valuable data and should be continued.

Recommendation 6.3. Because properly designed and executed field audit studies can provide an important and useful means of measuring discrimination in various domains, public and private funding agencies should explore appropriately designed experiments for this purpose.

Many racial and ethnic groups in the United States, including blacks, Hispanics, Asians, American Indians, and others, have historically faced severe discrimination—pervasive and open denial of civil, social, political, educational, and economic opportunities. Today, large differences among racial and ethnic groups continue to exist in employment, income and wealth, housing, education, criminal justice, health, and other areas. While many factors may contribute to such differences, their size and extent suggest that various forms of discriminatory treatment persist in U.S. society and serve to undercut the achievement of equal opportunity.

Measuring Racial Discrimination considers the definition of race and racial discrimination, reviews the existing techniques used to measure racial discrimination, and identifies new tools and areas for future research. The book conducts a thorough evaluation of current methodologies for a wide range of circumstances in which racial discrimination may occur, and makes recommendations on how to better assess the presence and effects of discrimination.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of plosone

Measuring gender attitudes: Developing and testing Implicit Association Tests for adolescents in India

1 Indian Institute of Management Ahmedabad, Ahmedabad, India

2 University of Oxford, Oxford, United Kingdom

Vrinda Kapoor

3 Cornell University, Ithaca, New York, United States of America

Vrinda Kapur

4 Porticus, New Delhi, India

5 University of California San Diego, San Diego, California, United States of America

Associated Data

Data and files required for replication are posted in a public repository: https://bit.ly/3vKZJnB .

We develop and test gender attitude measures conducted with a school-based sample of adolescents aged 14–17 years in India. We test a measure with survey items and vignettes to capture gender-based value and stereotypes, an Implicit Association Test (IAT) capturing gender-based value, and an IAT capturing gender stereotype. All demonstrate good internal reliability, and both IATs are significantly associated with our survey measure suggesting criterion validity, though not confirming it due to the lack of a gold standard measure on gender attitudes. Finally, construct validity is indicated from the measures’ positive significant associations with higher girls’ mobility and education. The gender-related IAT tools developed are consistent and valid, and modestly correlated with gender-related behavior outcomes such as mobility and school enrolment.

1. Introduction

A growing body of research from different settings demonstrates that gender norms and roles—the range of socially constructed behaviors and attitudes expected and even required for people based on their sex—harm the health and well-being of females, males, and those outside of the gender binary [ 1 ]. Among women, for example, these norms can restrict employment opportunity and disproportionately burden them with household responsibilities, affecting reproductive autonomy and health care access. Norms may encourage men to behave harmfully through greater substance use or unsafe occupations (e.g., workers in the mining industry in developing countries often fall under the informal sector where labor and safety laws are either non-existent or lax) that increase their risk for pre-mature mortality. Research demonstrates that constraining people within public and private spheres of life due to norms and roles not only harms individuals, but also compromises economic development, political governance, and climate action at a national scale [ 2 – 4 ]. A growing literature studies attitudes and gender norm change in developing countries. Bandiera et al. (2020) study the effect of livelihood training on female empowerment in Uganda, and Ashraf et al. (2020) study negotiation skills training for girls in Zambia [ 5 , 6 ]. Consequently, there is increasing interest in altering gender norms to improve health and society, with a focus on youth as a population amenable to change [ 7 ]. Further, while gender socialization (environmental training on these norms) starts at birth, early adolescence is when gender attitudes take root and is thus important in understanding and affecting gender attitudes [ 7 ]. Our paper focusses on measuring these gender attitudes, gender-based behavior and norm change amongst adolescents, which could help policymakers devise better gender equal policies and help erode centuries-old regressive gender and cultural norms.

Research on gender attitudes among adolescents in low and middle-income countries (LMICs) is nascent despite recognition for more focus on these issues in contexts such as South Asia [ 8 ]. Qualitative and quantitative research on early adolescents in India highlights that gender socialization, largely transmitted to youth by parents, restricts mobility, dress, and interaction with the opposite sex more for girls relative to boys, with greater negative consequences for girls who do not adhere to gender norms [ 9 , 10 ]. Basu et al. (2017) also found fewer expectations and support related to education and employment for girls relative to boys, even at this early age [ 9 ]. These findings highlight differences in attitudes of acceptability toward certain behaviors in ways that over-restrict girls and may foster unhealthy behaviors (e.g., substance use, physical risk taking) among boys as a means of showing masculinity. Evidence from the region suggests that school-based interventions focused on gender norm changes can have positive impact on norms and school attendance [ 11 – 13 ].

While these findings demonstrate the importance of measuring and understanding these issues, quantitative measures are currently limited by the lack of standard measures on these attitudes and contextual considerations for India. Prior reviews of the literature found that much of the gender measures development comes from North America and Western Europe and focuses on adults and older adolescents rather than early adolescents [ 7 , 14 ]. The only attitudinal survey measure developed for use with early adolescents in LMICs including India was limited to attitudes toward romantic engagement, inclusive of sexual expectations and the double standard (acceptability and even social value for males dating, where female dating is viewed unfavorably [ 15 ]). Attitudes toward gender stereotypes beyond dating exist for adults, though they are largely self-reported survey items [ 16 – 19 ], and are thus more vulnerable to social desirability bias and falsification [ 20 , 21 ]. These findings point to the need for tools to measure more general gender attitudes among early adolescents residing in LMICs.

Novel behavioral science methodologies able to capture gender attitudes but less vulnerable to bias include vignettes and Implicit Association Tests (IAT). Vignettes involve text or stories to which participants can respond regarding their thoughts on a topic, or what they think others might think or do related to the topic or circumstance [ 22 ]. Researchers use this approach to measure participant attitudes of acceptability of behaviors as well as perceptions of potential social sanctioning against the behaviors, based on sex of the individual in the story and sex of the respondent [ 23 – 25 ]. A more objective rather than subjective means of assessing the attitudes (e.g., asking about what others believe or what is likely to happen in the story) offers an approach that is less vulnerable to social desirability bias. IAT is a computer-based test in which participants must rapidly categorize two target concepts (e.g., “male” and “female”) with a characteristic (e.g., “teacher” or “construction worker”) [ 26 ]. IAT measures both connection of a given concept with the characteristic, as well as the time required by the participant to make that connection, allowing assessment of both self-reported attitudes and comfort with that attitude based on the relative time required by the participant to make that connection [ 26 , 27 ]. By capturing response times between these attributes, the IAT is able to consider discomfort and possibly social desirability in responses [ 26 , 27 ]. Researchers have only more recently started using vignettes to assess gender attitudes among early adolescents in LMICs but show success, at least in terms of gendered attitudes toward romantic partnering [ 28 ]. Researchers have used IATs [ 29 – 31 ], including measurement of gender attitudes related to women in politics in India [ 32 ], but no published studies have assessed gender attitudes with early adolescents, in India or elsewhere.

This study involves psychometric testing of gender attitude measures for early adolescents in India for gender-based value (preference of boys/men over girls/women indicated by positive/negative attributes, prioritization for opportunities and resources) and gender stereotypes (attitudinal beliefs regarding females relative to males on types of employment and domestic responsibilities). We include three new measures for consideration: 1) an IAT capturing gender-based value—a “taste-based” IAT, 2) an IAT capturing gender stereotypes, and 3) a survey measure inclusive of both survey items and vignettes to capture gender-based value and stereotypes. We developed these measures for inclusion at follow-up for an evaluation of an intervention designed to support more equitable gender attitudes among middle school students in India. Findings from this work offer novel gender attitude measures that can be applied for monitoring and evaluation of these attitudes in early adolescents in India, a nation with over 250 million adolescents, as well as for adaptation and use in other LMIC settings.

2. Study design

We analyzed data from 6458 adolescents aged 14 to 17 years who participated in the three-year follow-up survey conducted for a two-arm cluster randomized trial evaluating a year-long gender attitude change program for middle school students in four districts of Haryana, India (Sonipat, Panipat, Rohtak and Jhajjar). Government schools within these districts (N = 314 schools) were randomly assigned, stratified by district, co-ed status of the school, school size, and distance to the district headquarters, to the intervention condition (i.e., the gender attitude-change intervention; N = 150 schools) or the control condition (i.e., no intervention; N = 164 schools). This sample size was selected to be able to detect statistically significant medium and long-term effects of the program on gender attitudes, behaviors, aspirations, as well as educational and fertility outcomes (Power calculations are available on request.). We recruited and consented approximately 46 randomly selected 6th and 7th graders per school to participate in the evaluation study and followed them again at a three-year follow-up, as 9th and 10th graders in 2016–17. Among these, a subset of 8333 students, i.e., approximately 26 students per school, were randomly selected to respond to the IAT measure at baseline (The research team used the software Inquisit by Millisecond Software to code, administer and collect data from the IATs. The team used the statistical software Stata to analyze the data.). Out of these, 6458 students were retained at follow-up when complete measures of gender attitudes (gender-based values and stereotypes) were administered. We found no significant differences between those retained and lost at follow-up on key demographics including sex (S2 Table in S1 Appendix ).

We recruited, consented and conducted behavioral assessments with randomly selected students within each school during the fall of the 2013–14 academic year; follow-up data were collected in 2016–2017. We analyzed follow-up data for the current study, as we only added the full barrage of gender attitudes measures at follow-up.

At study entry, we obtained informed consent both and separately from parents and students prior to surveying students. Male and female research staff members (i.e., enumerators, supervisors and monitors) were hired specifically for this study trained in both data collection and gender equity over a 10-day training period. Sex-matched enumerators approached selected participants who provided verbal or written personal assent and written parental consent (collected prior to data collection) and escorted them to a more private setting for assessment outside the classroom. We replicated this same procedure at follow-up.

Follow-up behavioral assessments, which included the survey and vignette measure as well as the IAT, were approximately 60 minutes in length for most participants, and occurred during or after the school day. Prior to conducting the IATs, a practice round was provided to the students which was focused on matching flowers and insects rather than gender roles. We used this approach to facilitate students’ ability to self-administer the IATs—which is required by the test. Subsequent to the full behavioral assessment, participants were thanked for their time. We provided no incentive payment for participation.

All research protocols and survey instruments received ethics approval from both the Government of Haryana, which formally approved the questionnaires and study protocols and granted us permission to conduct the surveys in schools, as well as the Institutional Review Boards of Northwestern University in the United States and the Institute for Financial Management and Research in India.

Measures development

In the process of developing the evaluation study in 2013, the research team recognized a lack of measures on gender attitudes and stereotypes developed or validated with adolescents in India and other low and middle-income countries. Consequently, the team made efforts to develop new measures for Indian adolescents based on theory, expert input, prior research and testing on gender attitudes and stereotypes measures, as well as formative research with youth, approaches recommended as standard for measures development [ 11 , 33 ].

As noted above, IAT development is considerably more complex than measures that can be administered via survey item or vignette format (discussed below). To determine what images, language, characteristics/attributes (‘good’, ‘bad’, stereotypical male and female jobs) should be used for IAT development for Indian adolescents, we obtained input from school teachers, students, and local and national experts on gender attitudes and on adolescent development. The images and language were then constructed and reviewed with these groups again for finalization. Cognitive interviews followed, with students describing their understanding of the images, vocabulary and the tasks in the process. These efforts were all critical in selecting words and images which the respondents could easily understand given their age and environment. For example, the images chosen were those of boys and girls who looked like they belonged to a similar age and background as the respondents. The first IAT measure (IAT1) focused on gender-based value and was developed at the baseline. The second IAT measure (IAT2) focused on gender stereotypes and was only implemented at follow-up. Both IATs were developed in Hindi for administering to the students, and later translated to English for purposes of dissemination. Due to the length of each IAT, we were only able to implement one IAT for a given subject, so we randomly assigned each IAT to a student stratified on their school, grade, and gender, resulting in 3,078 students receiving IAT1 and 3,380 students receiving IAT2 at follow-up (The randomization was done prior to rolling out the survey (stratified on school-grade-gender) so completion rates are different.).

The survey items and vignette measures were developed in parallel with the IAT. Feedback from social science and adolescent development experts helped validate items in terms of face validity for the concepts attempting to be measured, both gender attitudes and stereotypes reflective of the cultural context. The nine-item measure was tested with sixth and seventh graders via cognitive interviews, in which students received the assessment and then explained back to the interviewer what the assessment was asking of them. This approach ensured that the given measure is comprehended correctly by the students. This nine-item measure was then used in the baseline survey and demonstrated low internal reliability (Cronbach alpha<0.60 [ 11 ]).

Based on poor performance, the team undertook the same process again to expand the items on gender attitudes and stereotypes, and dropping those items with very low variability, indicated by >90% in a given cell. A new measure of 15 survey items was constructed and implemented at follow-up, which expanded the items under the areas of gender equality in education and employment, gender roles and female autonomy, to improve the quality of the measure. Two vignettes were used to assess gender attitudes and stereotypes related to education and employment, respectively, with an item added for assessment using the education vignette and two items added for assessment using the employment vignette. Given resource and time constraints, it was not possible to administer both vignettes to all students. Therefore, each vignette was randomly assigned to a student such that a given student received either one vignette or the other. This resulted in N = 3271 students receiving the education vignette and N = 3186 students receiving the employment vignette at follow-up. Current analyses are based on this expanded measure, in two versions, one with the education vignette items and the other with the employment vignette items.

Measurement

As noted above, participants received a behavioral assessment that included survey items on key demographics, gender attitudes and stereotypes survey items and vignettes, and an IAT.

Implicit Association Tests

Gender attitudes and stereotypes iats.

IAT is a complex measure that requires participants to sort stimulus items simultaneously, i.e., pictures or text, into two response options. For this study, we developed two unique but related IATs, as noted above; these were the Gender-Based Values IAT (IAT1) and the Gender Stereotypes IAT (IAT2). For IAT1 on gender-based value, participants sorted stimuli as good versus bad, indicating attitudes of gender bias in valuing women/girls relative to men/boys. For IAT2 on gender stereotypes, participants sorted stimuli in labor tasks—domestic or traditionally female work versus otherwise.

To ensure accurate data capture and reduce subject confusion or error, the IATs include practice tasks to familiarize respondents with the stimulus materials and sorting rules. Hence, the IAT is provided in seven blocks ( Table 1 ). Blocks after the practice block introduce greater complexity. In the initial block B1, participants practice basic keyboard use and sorting using a less complex concept on which to sort—insects versus flowers. In the first block B2, participants just sort stimuli as boy versus girls. In the second block B3, participants sort the response options as good words or bad words. In blocks B4 and B5, sorting becomes more complex, with items representing boys and good (for example, boys’ faces and good words) receive one response, and items representing girls and bad (in this example, girls’ faces and bad words) receive the alternative response. In blocks B7 and B8, items representing girls and good are sorted with one response, and items representing boys and bad are sorted with the alternative response. During each block, if subjects assign stimuli to wrong groups, a red “X” appears, and subjects press on the correct keys to see the next stimuli. The key assumption here is that subjects who possess stronger associations of positive evaluation with boys compared to girls will find the first sorting task much easier than the second. Likewise, subjects possessing stronger associations of positive evaluation with girls compared to boys might find the second sorting task easier than the first (To control for sequence effects from the order in which images and words are grouped, even numbered subjects got girl/good on one side and boy/bad on the other side as the first sorting task (B1, B3 and B4), and the reverse in the second sorting task (B5, B6 and B7)).

Ease of sorting is indexed both by the speed of responses as well as the frequency of errors with both faster responses and fewer errors indicating stronger associations. The time elapsed between each stimulus’ appearance and pressing a correct response key is recorded as response latency. In the IATs we administered (structure shown in Tables ​ Tables2 2 and ​ and3), 3 ), the stimuli were randomly drawn with replacements, and in blocks B4, B6, B8 and B10, words and images alternately appeared. While latency for all trials were recorded (except for instruction pages), we followed the standard protocol for IAT and only use latency from blocks B4, B6, B8 and B10 to measure implicit association. The other trials are included for familiarization and practice with which images and words appear in our IAT and what their correct response keys are. IAT 2 follows a similar structure ( Table 3 ) although the stimuli are changed to association of images of men and women with gender stereotyped jobs and roles. S1-S6 Figs in S2 Appendix show example screenshots of IAT1 and IAT2.

Notes: Block 5 and 9 are one-page instructions given before starting next blocks.

Following Lane et al. (2007), we deleted extreme outlier trials where response latencies were greater than 10,000 milliseconds (msec), and deleted subjects for whom more than 10% of trials have latency less than 300 msec [ 34 ]. To measure the strength of implicit association, we calculated mean response latencies for blocks girl+good; girl+bad; boy+good; boy+bad for IAT1, and for blocks domestic task+women; domestic task+men; professional task+women; professional task+men for IAT2. As noted above, this was only done for Blocks 4, 6, 8, and 10. We computed an “inclusive” standard deviation for all trials in compatible and incompatible blocks. In order to reduce any influence that the order of pairing might have on the response latencies, we counterbalance the order of pairing for half of the students by presenting the incompatible blocks before the compatible blocks. We used the mean latency from each of the blocks (two compatible and two incompatible) to compute mean differences (Incompatible 1 –Compatible 1) and (Incompatible 2 –Compatible 2). These were then divided by their associated “inclusive” standard deviation; we then calculated the D-measure as the equal-weight average of the two resulting ratios. A higher D-measure, with the specified range of -1 and 1, represents greater implicit preference for girls over boys. The D measure for IAT1 indicates that adolescent girls have higher implicit preference for girls compared to their male peers. On the other hand, IAT2 indicates a slight preference for girls in the sample of adolescents for boys and girls ( Table 4 ). We also calculated mean accuracy and latency for all four blocks in IAT1 and IAT2 ( Table 4 ). Note that despite trimming outliers, the mean latency is higher compared to other studies, such as average latency of 929 msec in Greenwald’s IAT using Bush/Al Gore images and pleasant/unpleasant words, which used an adult sample and did not trim outliers [ 26 ]. Feasibility of further improvement in response latencies is uncertain given the sub-population of young adolescents in low technology and low literacy contexts. In the future, other researchers working in similar contexts could consider additional practice tests for respondents to build more comfort with the laptop and the IAT.

Compatible tests include the combination ‘boy+good’ and ‘girl+bad’ for the Good versus Bad IAT and ‘boy+professional tasks’ and ‘girl+domestic tasks’ for the Occupation IAT. Incompatible tests include the combination ‘boy+bad’ and ‘girl+good’ for the Good versus Bad IAT and ‘boy+domestic tasks’ and ‘girl+professional tasks’ for the Occupation IAT. The compatible and incompatible tests had two blocks each in the Implicit Association Test.

Gender-based value and stereotypes survey and vignette measure

The Gender-Based Value and Stereotypes Measure is an 18-item measure inclusive of 15 survey items and 3 additional items which used vignettes. The 15 survey items were on gender-based value and stereotypes in the following areas: gender bias against females/advantaging males in education (2 items), gender bias against females/advantaging males in employment (3 items), gender roles/expectations and female autonomy (9 items), son preference as indicated by fertility preferences (1 item). The two vignettes included a vignette focused on gender bias versus equality in education and a second vignette focused on gender bias versus equality in employment. One item was then used to assess gender bias using the education vignette and two items were used for the employment vignette. For 11 of the 15 survey items and both of the employment vignette items, participants asked how much they agree or disagree on each item using a five-point Likert scale. Four items on gender-based value and stereotypes related to gender bias in employment were based on constructing the differential in a given behavior or attribute ascribed to girls/women and then to boys/men. The item used for the education vignette directly assessed female or male preference for an educational opportunity. Half the participants received the education vignette and corresponding item, and half received the employment vignette and corresponding items. Missing values are imputed by taking the district-gender-treatment average. All items were scored and summated to yield an overall measurement score per Anderson (2008) [ 35 ], with a higher score indicating more gender equal attitudes and stereotypes. Details on items and scoring can be seen in S3 Appendix and have been described in Dhar et al. (2022) [ 11 ].

3. Descriptive analysis—Gender attitudes

We report two sets of descriptive analyses on gender attitudes from the survey and vignette measure, as well as the IAT. The first, represented by the D measure, reports the results on implicit preference for girls for IAT1 and IAT2 ( Table 4 ). The main finding is that for IAT1, which looks at sexist gender attitudes, we find that boys display an implicit negative preference for girls given that their D-measure is negative (-0.14). It is striking that boys disproportionately associate the opposite sex with negative attributes. The results are reversed for girls. Girls are not gender neutral, but disproportionately associate female with good attributes with a positive D measure of 0.23. For IAT2, which measures stereotypes of gender roles, both boys and girls surprisingly display a slight positive implicit preference for girls, indicating that IAT2 does not show strong gender stereotypes related to household work and employment outside the house in this sample.

The second, the Gender Attitude Index from the Gender-Based Value and Stereotype Survey and Vignette Measure in Table 5 shows that girls tend to be more progressive than boys on their gender attitudes, with the overall score -0.07 for boys and 0.05 for girls. The difference is visible in certain survey items. For example, 68% of girls disagree that a man should have the final word about decisions in the home, as opposed to 46% for boys. Almost half of the girls disagree that boys should get more resources and opportunities for education than girls, whereas less than a third of boys disagree with that statement.

4. Empirical analysis—Psychometric properties

In this section, we check for the robustness of the IAT measures and test its psychometric properties, in comparison to the gender-based value and stereotypes survey and vignette-based measure.

Internal consistency

Internal consistencies are measured by how homogeneous the responses are to all items in a particular measure [ 36 ]. Current conventions suggest that inter-item consistencies of 0.80 (20% error) or higher are acceptable [ 34 ], although many widely used scales remain around 0.70 [ 37 ]. Previous studies on the IAT have found that implicit attitude measures generally have low inter-item consistency and typically do not fare as well as self-reported measures in this regard. Table 6 reports inter-item consistency for the IATs by calculating Cronbach’s alpha using the response latency for compatible and incompatible blocks [ 36 ]. We find that IAT1 and IAT2 have comparable consistencies, with Cronbach alphas of 0.72 for IAT1 and 0.61 for IAT2.

We compare the internal consistency of the IATs with the gender attitude index and find that the IAT1 is as consistent as the gender attitude index ( a = 0.72), but this is not true for IAT2.

Using guidance from applied psychology, validity testing is designed to assess whether the measure is in fact measuring the construct of focus by assessing its association with the same construct if a gold standard measure of that construct is available (criterion validity), a similar construct if no gold standard measure exists (criterion-related validity), or a variable or outcome one would theorize to be associated with the measure if it was actually measuring the construct of focus (construct validity) [ 38 ]. To assess criterion-related validity, we compared IAT scores to the gender attitude index, as both measures assess a similar construct and there is no gold standard measure of gender attitudes in the field. We use the following equation to estimate the correlation between IAT responses and the gender attitude index.

where Y ij is the outcome of interest (IAT) for student i in school j. X ij is the gender attitude index and γ gd and Φ gc denote district-gender and gender-grade fixed effects respectively. We also run a specification without controls.

Table 7 indicates that IAT1 shows a modest and significant correlation of 0.440 (p<0.01) with the gender attitude index. However, the correlation weakens with the inclusion of gender-grade and district-gender fixed effects in Column 2 (0.089, p<0.05). Relative to IAT1, IAT2 shows a weaker correlation with the gender attitude index without fixed effects (0.118, p<0.05), but a stronger correlation when controlling for fixed effects (0.156, p<0.01). Overall, the results suggest that positive implicit preference captured in the IATs corresponds to more progressive responses in the gender index, and there is a helpful correlation between explicitly progressive views and implicit preference for girls.

Notes: Basic controls include gender-grade and district-gender fixed effects, and standard errors are clustered at the school level.

* p<0.10,

**p<0.05,

***p<0.01.

We also test the construct validity of the IAT by examining whether IAT correlates with girls’ individual behavioral outcomes (such as mobility) and educational achievement (attendance in school, grade completion), given that we would anticipate that girls with more traditional gender attitudes would report more restrictive or poorer behavioral outcomes. We estimate the correlation between IAT responses and self-reported measures of gender related behavior using the following model.

where Y ij is the outcome of interest (IAT score for IAT1 or IAT2) for student i in school j. X ij is the measure of girls’ agency such as mobility or school enrolment. Among boys, X ij is based on boys’ willingness to intervene when a girl at school is being teased, their contribution to household work, or discouragement for sisters and girl cousins to meet friends. As before, γ gd and Φ gc are district-gender and gender-grade fixed effects, respectively. We run specifications with and without controls.

Table 8 reports that both IAT scores are modestly and significantly correlated with girls’ mobility (0.128, p<0.01 for IAT1 in Column 1; 0.092, p<0.05 for IAT2 in Column 6) as well as with girls’ school enrolment (0.108, p<0.01 in Column 2 and 0.139, p<0.05 in Column 7). For boys, we find that IAT2, but not IAT1, is significantly correlated with their intervention if a girl was being teased in school (0.108, p<0.05 in Column 10). IATs were not correlated with other behaviors for boys, such as household work or encouragement for your sister or cousin to meet friends. These results persist with the inclusion of controls (available on request). These correlations suggest that the IATs have some ability to predict variables of self-reported gender-related behavior.

5. Discussion

Measuring gender attitudes well has been a continuing challenge, especially for adolescents in low-income contexts, underlining the need to reassess and improve measurement of gender attitudes. Better measures are needed, and vignettes and IATs in particular offer a potentially valuable addition to the mix of existing instruments to capture gender attitudes that can circumvent dilemmas such as false reporting and social desirability bias, which can hinder reliable measurement of gender attitudes.

This paper extends our understanding of effective approaches of measuring gender attitudes by assessing the psychometric properties of two novel IATs customized for adolescent boys and girls in India, as well as a survey assessment inclusive of vignettes. The inclusion of both allows for benchmarking the performance of the IATs with a gender attitudes index comprised of responses from self-reported survey questions and vignettes, allowing for a criterion-related validity assessment in the absence of a gold standard measure of gender attitudes in this population. To our knowledge, these are the first IAT scales to measure gender attitudes among adolescents in India. IAT1 focuses on general positive and negative attitudes associated with gender, and the second (IAT2) measures attitudes associated with gender-stereotypical employment. These IATs can serve as prototypes for those interested in measuring gender attitudes with adolescents in India or other similar contexts. Importantly, our measures all demonstrated good internal reliability and validity.

IATs can be a useful technique to measure gender attitudes and discrimination among different sub-populations, and the concepts/associations can be adapted based on the needs of the project. The IATs capture interesting results which offer insights into the implicit gender attitudes of youth in India. The gender-values focused IAT1 results signal that male students hold a negative implicit preference for girls, in contrast to their female peers who display a strong positive preference for girls. This pattern is similar to the gender attitudes index, comprised of items from survey questions and vignettes, which shows that female students have more progressive views than do their male counterparts. IAT2, however, shows that both male and female students do not hold strong implicit stereotypes on gender roles within and outside the household, diverging from their explicit views in self-reported questions and vignettes where there is stronger stereotyping.

Despite being administered in a low digital literacy context with a low-income population, we find that the gender IATs are consistent, valid, and moderately correlated with outcomes. Our results show that IATs such as IAT1 can perform just as well as direct survey questions in terms of consistency. IATs display a modest correlation with survey and vignette based measures of gender attitudes and also have some predictive power for outcome variables for individual gender-related behavior and outcomes such as girls’ mobility and school enrolment or boys’ intervention when a girl is being teased. These findings suggest that despite the cost and resource intensity of customized IATs, they may provide a useful addition and unbiased alternative to survey-based measures relying on direct questions. While the current technology and time to administer IATs makes them prohibitive to incorporate in at-scale surveys, they expand the mix of novel tools to measure certain biases and stereotypes.

The limitations of this study include the fact that we rely only on self-reported measures for behavioral outcomes used to measure the criterion validity of the IAT. Ideally, there would be a gold standard to measure gender attitudes for adolescents in India, but given there were no such available measures, we benchmark the IAT with a customized direct survey and vignette based measure developed specifically for the study (i.e., testing criterion-related validity). Since the implementation of the study, other measures on gender attitudes and norms have been released for use with adolescents in India (for instance, Blum et al’s (2019) Global Early Adolescent Study [ 28 ]), providing the opportunity to do more robust validity testing of the IAT using these new measures. Further, the measures were only tested with school students in a single state in India, potentially limiting their generalizability to other contexts and sub-populations. Lastly, an important limitation is that we use data from a follow-up survey of an existing evaluation study, which had not been explicitly designed to test these measures. Measures such as the IAT and vignette were randomized in their administration to individuals, so we did not leverage the entire sample of respondents—however, the samples were large enough and the imputation was done in a technically sound manner.

Our study introduces a new IAT based instrument to measure adolescent gender attitudes in India and contributes to the set of existing tools for measuring and investigating gender attitudes in developing countries and among adolescents. The IAT tools and the design strategy can be used for further customization or for creating other similar IATs to systematize response latency measures for gender attitude measurement. Further refinements should be considered such as additional computer-based practice rounds for respondents to get more comfortable with the test or a complete switch away from words to images to simplify the test further for a lower literacy population. Nevertheless, we hope that this effort at improving IATs will stimulate further research and efforts in this direction to improve IAT testing and come up with other innovative creative approaches to measuring gender attitudes. Given that the tool is less prone to social desirability bias compared to traditional methods, we also hope that our design strategy can facilitate more developing country research based on IATs in domains such as education, transition to work, domestic violence and women’s agency (See Purkayastha et al. (2003) [ 39 ] for a review of topics).

Supporting information

S1 appendix, s2 appendix, s3 appendix, funding statement.

Tarun Jain received funding for this study through the Evidence-based Measures of Empowerment for Research on Gender Equality (EMERGE) Project at University of California San Diego (BMGF OPP 1163682 OPP1163682), and the International Growth Centre.

Data Availability

The good and the bad: Are some attribute words better than others in the Implicit Association Test?

  • Published: 04 May 2021
  • Volume 53 , pages 2512–2527, ( 2021 )

Cite this article

a researcher who measures attitudes by assessing whether white

  • Jordan R. Axt 1 , 2 ,
  • Tony Y. Feng 1 &
  • Yoav Bar-Anan 3  

3072 Accesses

3 Citations

3 Altmetric

Explore all metrics

The Implicit Association Test (IAT) is one of the most popular measures in psychological research. A lack of standardization across IATs has resulted in significant variability among stimuli used by researchers, including the positive and negative words used in evaluative IATs. Does the variability in attribute words in evaluative IATs produce unwanted variability in measurement quality across studies? The present work investigated the effect of evaluative stimuli across three studies using 13 IATs and over 60,000 participants. The 64 positive and negative words that we tested provided similar measurement quality. Further, measurement was satisfactory even in IATs that used only category labels as stimuli. These results suggest that common sense is probably a sufficient method for selection of evaluative stimuli in the IAT. For reasonable measurement quality, we recommend that researchers using evaluative IATs in English select words randomly from the set we tested in the present research.

Similar content being viewed by others

a researcher who measures attitudes by assessing whether white

Online Administration of the Implicit Relational Assessment Procedure: The Web-IRAP

Thomas Geist, Samantha Plezia, … Miguel Rodríguez Valverde

A Derived Transformation of Evaluative Preferences Using Implicit Association Tests

Micah Amd & Dermot Barnes-Holmes

a researcher who measures attitudes by assessing whether white

Implicit Association Test (IAT): Using Computer-Based Methods to Measure Consumer Implicit Attitudes

Avoid common mistakes on your manuscript.

Research about implicit social cognition (Greenwald & Banaji, 1995 ) centers on unintentional, uncontrolled, cognitively efficient mental processes that influence behavior and judgment. While much of the interest in the field can be attributed to its theoretical novelty, another contributing factor is the generation of individual difference measures of implicit social cognition that can be easily adapted to many research contexts (see Gawronski & De Houwer, 2014 for a review). The most popular of these measures is the Implicit Association Test (IAT; Greenwald et al., 1998 ). The IAT is an indirect measure of cognitions (e.g., attitudes), inferred from performance in a task of categorizing two pairs of categories (e.g., flowers/insects, good/bad) with only two motor responses. A conservative estimate of the number of published studies using the IAT exceeds 4000 and would cover nearly all areas of psychology, such as research related to exercise (Forrest et al., 2016 ), eating disorders (Ahern et al., 2008 ), political judgment (Hawkins & Nosek, 2012 ), intergroup relations (Turner & Crisp, 2010 ), and consumer behavior (Gibson, 2008 ). In addition to its influence in the psychological literature, the IAT has also played a significant role in larger discussions related to prejudice and intergroup disparities, as over 30 million people have completed an IAT using the Project Implicit website (Ratliff & Smith, in press ).

Aside from its adaptability to many research contexts, the popularity of the IAT may also be attributed to the large amount of work that has gone into exploring the measure’s validity, ranging from issues like construct validity (Bar-Anan & Nosek, 2014 ; Bar-Anan & Vianello, 2018 ; Nosek & Smyth, 2007 ), internal validity (Dasgupta et al., 2003 ; Nosek, Greenwald & Banaji, 2005 ), and predictive validity (Buttrick et al., 2020 ). Meta-analytic investigations have provided further evidence of the IAT’s utility, such as in predicting variance in intergroup behavior (Kurdi et al., 2019 ). Notably, there is also a skeptical view on the validity of the IAT as a measure of individual differences in implicit social cognitions (e.g., Schimmack, 2019 ), and, more broadly, on the implicit construct and the leading theories proposed for the distinction between implicit and explicit constructs (Corneille & Hütter, 2020 ). While the value of the concept of implicit social cognition, and the IAT specifically, will remain a continued topic of discussion, researchers can draw from a wide range of existing studies and resources to justify their use of the IAT.

Within the vast literature about IAT methodology, little published work has studied the effect of individual stimuli on the measurement quality of the IAT. Perhaps as a result of this paucity in empirical research about stimulus selection, there is no consistency in the individual stimuli used in different IATs to represent the attributes (e.g., good and bad) or categories (e.g., Black people and White people) in the test. Indeed, a brief review of recent evaluative IATs (i.e., IATs with categories that reflect valence, such as good/bad) finds that the specific stimuli used to represent each attribute are either not listed (Chevance, Caudroit, Romain & Boiche, 2017 ; Hagiwara et al., 2016 ; Panzone et al., 2016 ; Conroy et al., 2010 ; Haider et al., 2015 ) or when listed, show little consistency. Indeed, as of this writing, the ten most recent papers using an evaluative IAT each used a unique combination of words to represent the positive and negative categories (see Table 1 ), with no individual word appearing in more than four of ten studies.

Such unaccounted variability across IAT stimuli has the potential to impact research outcomes. Most drastically, if specific stimuli or combinations of stimuli significantly improve or degrade measurement quality, then prior findings using the IAT may have limited generalizability. Similarly, if certain stimuli are particularly detrimental, then earlier studies that produced null results may have been misinterpreted; for example, if researchers found that the IAT failed to predict a behavioral outcome, then such a result could have been driven by the IAT’s stimuli rather than an actual lack of relationship between the behavioral measure and the implicit construct purportedly measured by the IAT. Finally, a less severe but still consequential possibility is that even relatively small effects of measurement quality due to variance in stimuli could introduce unnecessary noise, thereby further minimizing the already small to moderate associations found between the IAT and many relevant outcomes (e.g., Buttrick et al., 2020 ; Forscher, Lai, et al., 2019 ). For these reasons, a more systematic investigation of the role of stimuli variability, with a focus on finding the most useful evaluative stimuli, would be valuable both for interpreting past research using the IAT and for future uses of the measure.

Stimulus effects on the IAT

Prior work has produced conflicting evidence concerning the importance of individual stimuli in measurement quality in the IAT. In some studies, variance among stimuli had no impact on IAT performance. For instance, De Houwer ( 2001 ) found that similar IAT effects emerged in a British-foreign evaluative IAT that used positive British names and negative foreign names (e.g., Princess Diana, Adolf Hitler) versus negative British names and positive foreign names (Margaret Thatcher, Albert Einstein), though the small sample size ( N  = 28 in a within-subjects design) suggests the study likely had low statistical power to detect any stimuli effects.

However, similar results were found in two studies totaling over 600 participants (Stieger et al., 2010 ). In a clever design, participants completed IATs or single-target IATs concerning associations between the category of the self and attributes of anxious versus calm using either (a) a predetermined set of words as stimuli, (b) stimuli that each participant chose from a larger pool of options as particularly representative of either anxiety of calmness, or (c) stimuli generated by each participant individually to represent anxiety and calmness. Results found that these variations on stimuli did not impact overall performance, internal reliability, or test-retest reliability. However, even a sample of 600 participants across three conditions provides relatively low statistical power (e.g., only 18% power to detect a small effect of Cohen’s q =  .10 for differences in test-retest reliability). Finally, a very high-powered investigation ( N > 40,000) of the number of stimuli used to represent each category in IATs developed to measure Black-White attitudes, old-young attitudes and the gender-science stereotypes showed no decreases in measurement quality—assessed by overall effect sizes and IAT–self-report correlations—when using as few as two stimuli for either the target or attribute categories (Nosek et al., 2005 ), though this study did find that measurement quality decreased slightly when the IATs used a single stimulus that was identical to the category label.

These results contrast with other work that has found more substantive impacts of stimuli on the IAT. Several studies have illustrated the impact of valence when selecting stimuli to represent IAT categories (rather than the evaluative attributes). For example, compared to an IAT using general names associated with Black and White people (e.g., Tyrone, Josh) or an IAT using admired White people and disliked Black people as stimuli, positive associations towards White people were significantly reduced when stimuli were of admired Black people (e.g., Michael Jordan) and disliked White people (e.g., Timothy McVeigh; Govan & Williams, 2004 ; Mitchell et al., 2003 ).

Separate studies have found that the stimuli used to represent the attributes in IATs could impact task performance. For instance, among female participants, gender-power associations (i.e., the degree to which male versus female names were associated with the categories of potency versus weakness) were stronger when the stimuli representing weakness were more positive (e.g., delicate) compared to more negative (e.g., timid; Rudman et al., 2001 ). Other work manipulated the frequency of positive or negative stimuli within the IAT itself, as West German participants showed more ingroup favoritism on the IAT when positive stimuli occurred on 75% of attribute trials than when negative stimuli were shown on 75% of attribute trials (Bluemke & Fiedler, 2009 ). A final study found greater pro-White race IAT scores when using more recognizable White names as stimuli (Dasgupta et al., 2000 ).

Similarly, and most relevant to the present study, IAT performance can be influenced by the use of attribute stimuli that are already associated with the categories used. For example, ingroup preferences in IAT scores were higher among West German participants when the positive stimuli in the IAT drew from pre-existing positive stereotypes about West Germans (e.g., successful) and negative stimuli drew from pre-existing negative stereotypes about East Germans (e.g., xenophobic; Bluemke & Friese, 2006 ). Comparable results have been found in gender associations, as participants showed greater pro-female attitudes when the IAT used positive words that were stereotypically related to females (e.g., beautiful) compared to an IAT where positive words were stereotypically related to males (e.g., independent; Steffens & Plewe, 2001 ).

The current work

Previous research suggests that variance in stimuli can impact IAT performance and measurement quality. However, stimuli effects on IAT performance have only been observed through relatively drastic manipulations, such as when category stimuli representing Black and White people in a race IAT were made to differ strongly in positivity or negativity (Govan & Williams, 2004 ; Mitchell et al., 2003 ), or when attribute stimuli were selected because of potential contaminating effects due to existing associations with the IAT categories (Bluemke & Friese, 2006 ; Steffens & Plewe, 2001 ). In these cases, researchers have sought to “stack the deck” in an attempt to test the boundary conditions wherein stimuli may become so problematic that quality of measurement on the IAT is impacted. What is less clear is the role of stimuli variance when researchers, like in the studies listed in Table 1 , have the opposite goal—selecting stimuli with the hopes of minimizing measurement error. A notable lack of standardization across IATs has led to a large amount of diversity among stimulus sets, but the consequences of this diversity are currently unknown.

The present work explored this question with large samples and a wide range of topics. We focused specifically on the effect of variance in stimuli on measurement quality for evaluative IATs, which are the most popular form of IAT: a recent meta-analysis of interventions to change performance on measures of implicit associations found that 67% of studies that used the IAT assessed implicit evaluations (Forscher, Lai, et al., 2019 ). Our purpose was to test whether even when researchers choose attribute words in order to maximize measurement quality, some evaluative words produce better measurement quality than other words. If that is the case, our research could provide a list of words that are useful for maximizing measurement quality in evaluative IATs. Alternatively, if we find no consistent effects of the choice of words on measurement quality, future research could use the whole set of words tested in the present research.

We used the following indicators of measurement quality (see Bar-Anan & Nosek, 2014 for similar criteria): mean-level effects, known-groups differences, correlations with direct measures, and internal reliability. Next, we provide further justification for each of these criteria.

Mean-level effects

A superior measure should be more sensitive to the assessed construct. Assuming a modal response tendency in the IAT that is interpreted as a preference for one group over another, such as for White versus Black people or straight versus gay people (Nosek et al., 2007 ), then measurement error can only weaken the ability to detect such preferences in implicit associations and therefore result in lower overall effect sizes. This assumption that measurement error will only weaken the overall effect size is common in prior research on the IAT’s validity (see Bar-Anan & Nosek, 2014 and Nosek et al., 2005 , for parallel reasoning). As a result, stimuli that produce larger overall IAT effects will indicate better measurement.

Known-groups differences

Relatedly, a superior measure of a construct should be better able to detect variance between groups known to differ on that construct. Prior work suggests robust differences in the construct captured by indirect measures across the attitudinal domains included in our studies, such as race (Nosek et al., 2007 ), sexuality (Jost et al., 2004 ), and weight (Sabin et al., 2012 ). Again assuming that measurement error only reduces known effect sizes, the magnitude of such differences will be underestimated with greater measurement error. Therefore, measures that minimize error would increase the size of these known-groups differences, such as between gay and straight participants in sexuality IATs. Past work seeking to validate other IATs or indirect measures has used similar criteria (e.g., Axt et al., 2021 ; Nosek et al., 2014 ).

Correlations with self-report

A better measure of a construct should maximize correlations with related measures due to reduced error, assuming the error between measures is uncorrelated (Nosek, Greenwald & Banaji, 2005 ). Given widespread evidence that the IAT and self-report measures assess distinct but related constructs (Bar-Anan & Vianello, 2018 ; Nosek & Smyth, 2007 ), it is expected that the IATs used here will have a reliable (but not perfect) correlation with parallel measures of self-reported attitudes. Therefore, stronger correlations between the IAT and direct measures signal reductions in measurement error (see Axt, 2018 for another example of using correlations between the IAT and self-report variables as a way of assessing measurement error).

Internal reliability

Higher internal reliability does not guarantee superior measurement of a construct, but all else equal, measures with greater internal reliability minimize error in assessment of the targeted construct (see Sriram & Greenwald, 2009 for a similar approach).

Across three studies, 13 IATs, and more than 60,000 participants, we examined the role of variance in stimuli on IAT measurement quality. Study 1 tested whether, across 64 different words, the presence or absence of any one stimulus was associated with greater or weaker measurement quality. In a more direct test, Study 2 compared the measurement quality of IATs that used the best performing words and worst performing words from Study 1. Finally, after Studies 1 and 2 found no noticeable effect of attribute words, Study 3 examined whether variability and relevance of attribute words to the attribute categories are of any importance to the measurement quality of the IAT by comparing a typical evaluative IAT with an IAT that used either only the attribute names as the evaluative stimuli or an IAT that used nonwords unrelated to the attributes as the evaluative stimuli.

In Study 1, participants completed IATs with randomly selected words from a larger pool of positive and negative words. We compared whether measurement quality varied as a function of the presence or absence of each specific word.

Participants

We analyzed data from visitors completing evaluative IATs at the Project Implicit demonstration site ( https://implicit.harvard.edu/implicit/takeatest.html ). In these evaluative IATs, stimuli were randomly sampled from a pool of possible words. Specifically, participants ( N =  252,670, M Age   =  31.07, SD Age  = 13.34, 64.3% female, 71.4% White) completing attitudinal IATs had the “Good” label populated with eight of a possible 32 positive words and the “Bad” label populated with eight of a possible 32 negative words (see online supplement at https://osf.io/fxe8q/?view_only=68f3dfeb3d6b4015a5487878c722219d for full list). These words were taken from stimuli used in prior research on evaluative IATs (Nosek, 2005 ) and had been originally chosen to be readily categorizable as positive or negative.

For the data analyzed in Study 1, we first selected eight topics involving attitudinal IATs (age, Arab-Muslim, disability, race, religion, sexuality, skin tone, weight). The Religion task randomly assigned participants to complete an IAT measuring implicit associations for either Christianity vs. Judaism, Christianity vs. Islam, or Judaism vs. Islam. In total, there were then ten IATs included in Study 1. For each IAT, we began downloading data starting in July 2018 and added data from prior months until we reached enough completed IAT sessions such that, for each word in the stimuli pool, there were a minimum of 4500 completed IATs with that word. The word that appeared in the smallest number of IATs (i.e., study sessions) in our data appeared in 4868 IATs and was absent from 15,275 IATs. This sample size then allowed for very high-powered tests, such as 95% power to detect a Cohen’s q effect of .06 when comparing correlations between the IAT with self-reported attitudes. In Studies 1–3, we excluded participants who responded faster than 300 ms in more than 10% of trials (Nosek, Greenwald & Banaji, 2005 ).

See Table 2 for the labels used for all categories and attributes, as well as whether the category stimuli for each IAT consisted of images and/or words. The online supplement details category stimuli used in all studies. The procedure of the IATs and our scoring of the IAT scores followed those outlined in Greenwald et al. ( 2003 ). IATs were scored such that higher values indicated more positive associations with the dominant group (the group listed first in Table 2 ).

For each topic, self-reported attitudes were assessed by a single seven-point relative preference item (Axt, 2018 ); for example, self-reported weight attitudes were measured by an item ranging from −3 = “I strongly prefer fat people to thin people” to +3 = “I strongly prefer thin people to fat people” with a neutral midpoint of 0 = “I like fat people and thin people equally”. See the online supplement for full text for each self-report preference item.

Participants completed the IAT and self-report attitude measure in a randomized order. Each topic also included a demographic questionnaire of varying length, and other self-report variables that were not included in analyses.

Internal reliability was calculated using the same method as Bar-Anan and Nosek ( 2014 ). Separate D scores were calculated for (1) IAT blocks 3 and 4 (40 trials), (2) the first half of IAT blocks 6 and 7 (40 trials), and (3) the second half of IAT blocks 6 and 7 (40 trials), and these three scores were used to calculate a Cronbach’s α (Cronbach, 1955 ).

Table S1 in the online supplement presents the overall IAT effect size (Cohen’s d ) for when each word was or was not included in the IAT. Across all words, the largest average reduction across IATs comparing the presence versus absence of a word was d diff  = −.02 (“selfish”), and the largest average increase was d diff   =  .011 (“scorn”). We also coded, for each IAT, whether the presence or absence of each word was associated with greater or weaker effect sizes. No word was associated with either stronger or weaker effect sizes across all tests; one word (“annoy”) was associated with weaker effects in nine of ten IATs, and four words (“cheerful,” “detest,” “poison,” “scorn”) were associated with larger effects in eight of ten IATs.

Table S2 in the online supplement presents the internal reliability coefficient α for the presence or absence of each word in each IAT. Across all words, the largest average reduction in reliability across IATs that did or did not include each word was α diff   =  −.007 (“disgust”), and the largest average increase was α diff   =  .005 (“hatred”). Across IATs, no word was associated with either stronger or weaker internal reliability across all tests; four words (“awful”, “disgust”, “humiliate”, “magnificent”) were associated with lower internal reliability in eight of ten IATs, and two words (“attractive”, “beautiful”) were associated with greater internal reliability in nine of ten IATs.

Correlations with self-reported attitudes

Table S3 in the online supplement presents the correlation r between the IAT D score and self-reported attitudes when each word was or was not present in each IAT. Across all words, the largest average reduction in variance explained ( R 2 ) comparing the presence versus absence was R 2 diff   =  −.006 (“delightful”), and the largest average increase was R 2 diff   =  .006 (“horrible”). Across IATs, no word was consistently associated with either stronger or weaker correlations with self-reported attitudes; three words (“awful”, “delight”, “delightful”, “scorn”) were associated with lower IAT–self-report correlations in eight of ten IATs, and one word (“lovely”) was associated with stronger IAT–self-report correlations in eight of ten IATs.

Consistency across metrics

Across the three metrics used to evaluate measurement quality, we investigated whether any single word was consistently associated with greater or weaker measurement quality. Specifically, we inspected whether any word was ranked in the top or bottom 25% of words for each criterion. These analyses found that none of the 64 words were ranked in the top or bottom 25% of words across the three metrics, suggesting that no word was consistently related with better or worse measurement quality among the criteria used in Study 1.

Using three criteria for measurement quality, no single word out of 64 was consistently associated with better or worse measurement across ten IATs. This is suggestive, though not conclusive, evidence that when stimuli are selected with the intent of avoiding any contaminating or problematic influences (e.g., pre-existing associations with the categories; Bluemke & Friese, 2006 ), variation in word choice is unlikely to have substantive effects on measurement quality. However, the design of Study 1 might not lend itself to a strong test of this hypothesis, as the stimuli representing “good” and “bad” were selected from a larger pool of words for each study session. As a result, any single word that could have reduced measurement quality may have been frequently used alongside other words that simultaneously improved measurement quality. In other words, the random nature of selecting stimuli for the IATs in Study 1 could have diluted the positive or negative measurement effects of any single word.

Study 2 sought to test the effect of attribute stimuli with a stronger manipulation. We assigned participants to complete IATs using the words most associated with greater or weaker measurement quality based on Study 1’s results. If the null results of Study 1 were a consequence of the noise introduced through randomly selecting the other stimuli that was used with each target word, then combining the best and worst performing words should compound any possible effects and create a stronger test of the role of stimuli variation in IAT measurement.

Methods and analyses for Study 2 were preregistered at https://osf.io/wpu6n/?view_only=b469dcebf5be4679819efb92709b6b0b . We targeted a minimum sample size of 1000 eligible participants per IAT and stimulus set. Delays in removing the study led to a slight increase in sample size, though no analyses were completed until all data were collected.

A total of 16,783 eligible IATs were completed through the Project Implicit research pool from 8829 participants ( M Age =  34.5, SD =  14.6, 72.2% White, 66.1% female). Participants could complete multiple study sessions. Only study sessions where a participant completed the same IAT a second time (or more) were excluded (11.6% of sessions). For each IAT, this sample size allowed for over 95% power to detect an effect of Cohen’s q =  .15 when comparing correlation strength and an effect of Cohen’s f =  .085 (Cohen’s d  = .17) when comparing the magnitude of known-groups differences. Data, materials, and analysis syntax for Studies 2 and 3 can be accessed at https://osf.io/ezj5t/?view_only=1781c05b04d54b829fd2eff67e0d429c .

Participants were randomly assigned to complete IATs related to race, sexuality, age, weight, skin tone, and Arab Muslims using the same category labels as in Study 1. The one change was that the race IAT used only the category labels “White people” and “Black people”. A politics IAT using the categories “Democrats” and Republicans” was also included, with category stimuli consisting of party logos and prominent members (e.g., Joe Biden, Ronald Reagan). In Study 2, all attribute labels were “Positive” and “Negative”.

Within each topic, participants were randomly assigned to complete an IAT with “high-performing” or “low-performing” words based on the results of Study 1. To determine each stimulus set, all 64 Study 1 words were ranked on ability to (1) maximize overall effect sizes, (2) strengthen correlation with self-reported attitudes, and (3) heighten internal reliability. An average ranking was calculated for each word. The eight positive and eight negative words with the highest average ranking were assigned to the “high-performing” set, while the eight positive and eight negative words with the lowest average ranking were assigned to the “low-performing” set.

The high-performing words were: friend, smiling, adore, joyful, pleasure, friendship, happy, attractive, bothersome, poison, pain, nasty, dirty, hatred, rotten, horrific. The low-performing words were: cherish, glad, delightful, fabulous, fantastic, magnificent, terrific, triumph, hurtful, annoy, disgust, despise, horrible, awful, disaster, humiliate.

Self-reported attitudes

Participants completed five self-reported evaluation items. For each topic, participants completed a single relative preference item as in Study 1 (e.g., −3 = I strongly prefer Black to White people, +3 = I strongly prefer White to Black people), two thermometer items ranging from 1 = strongly dislike to 7 = strongly like assessing liking of each group separately, and two slider items ranging from 1 = extremely negative to 100 = extremely positive assessing positivity towards each group separately. A composite measure of self-reported attitudes was calculated by creating separate difference scores from the liking and thermometer slider items then standardizing and averaging those two difference scores with the relative self-reported preference item (Axt, Bar-Anan & Vianello, 2020 ).

Demographics

Upon registering for the research pool, participants reported a number of demographic details that we used for the known-groups analyses. Depending on the topic, additional demographic variables were added to allow for tests of known-groups differences: a seven-point measure of perceived skin tone (1 = very light, 7 = very dark), a five-point measure of identification as an Arab Muslim (1 = not at all, 5 = very much), an item about sexual orientation that allowed participants to identify as “heterosexual or straight” or “lesbian or gay”, among other options, a seven-point measure of perceived weight status (1 = very underweight, 7 = very overweight), and a seven-point measure of strength of identification with Republicans versus Democrats (1= identify much more with Republicans, 7 = identify much more with Democrats). See the online supplement for full text of all demographic items.

Participants completed the IAT and self-report items in a random order. All added demographic items were completed immediately after the self-reported attitude items.

Given the large number of analyses included in Study 2, it was likely that several could reach statistical significance (i.e., p <  .05) by chance. As a result, our preregistration outlined criteria that we believed would indicate substantive evidence that the stimuli manipulation impacted measurement quality. First, we would conclude that there are differences between the low-performing and high-performing stimuli if at least three of the seven tests found significant differences in the same direction when comparing (1) strength of correlations with self-reported evaluations, (2) degree of internal reliability, or (3) differences in the magnitude of known-groups differences. In addition, we would only conclude that there are significant differences between stimuli sets on measurement quality if (1) the average effect on correlations with self-report exceeded a small effect of Cohen’s q =  .10 (Cohen, 1988 ), (2) the average effect on internal reliability exceeded a difference of α = .05, and (3) the average effect on known-groups differences exceeded a small effect of d =  .10 (or η p 2   =  .0025 in an ANOVA).

Correlations

All IATs were positively correlated with the parallel self-report attitude measure (all r s > .128, all p s < .001). Table 3 lists the sample size and strength of the correlation with self-report for each IAT and stimulus set, as well as the results of a Fisher’s Z test comparing the strength of correlations for each topic. Across the seven tests, there were no reliable differences between the high-performing and low-performing stimuli conditions.

Internal reliability was calculated using the same procedure as in Study 1. Table 4 lists the sample size and internal reliability for each IAT and stimulus set, as well as the results of a Feldt ( 1969 ) test comparing internal reliabilities for each topic. There were no reliable differences between the high-performing and low-performing stimuli conditions. Footnote 1

Our preregistered classifications for known-groups differences compared (1) young (18–30) vs. old (50+) participants on the age IAT, (2) participants who identified at least “a little” as being an Arab Muslim versus those who “did not identify at all” on the Arab Muslim IAT, (3) participants who identified slightly, moderately, or much more with Democrats versus those who identified slightly, moderately, or much more with Republicans on the politics IAT, (4) Black versus White participants on the race IAT, (5) heterosexual or straight versus lesbian or gay participants on the sexuality IAT, (6) participants who identified as very light-, light-, or somewhat light-skinned versus those who identified as very dark, dark, or somewhat dark on the skin tone IAT, and (7) participants who identified as underweight (or neutral) versus overweight on the weight IAT. See the online supplement for descriptive statistics for each social group within each IAT and stimuli condition.

Table 5 presents the results of independent samples t tests of D scores between known groups on each IAT as well as the results of the interaction term in a 2 (Social group) × 2 (Stimulus set) ANOVA in each topic. Here, the interaction term estimated the likelihood that the known-groups differences was larger in one stimulus set than in the other. Five topics produced the expected known-groups differences within each stimulus set (e.g., differences in D scores between straight or heterosexual versus lesbian or gay participants). Within these five topics, only one interaction term was reliable; specifically, differences between White and Black participants’ D scores were greater in the high-performing than low-performing stimuli condition.

Two IATs—those concerning age and Arab Muslim attitudes—failed to produce any group differences, making the results of the ANOVA interaction term difficult to interpret. In retrospect, these results are compatible with past work that found very weak relationships between participant age and indirectly measured age attitudes (e.g., Axt et al., 2014 ; Chopik & Giasson, 2017 ), and small average effects of more negative indirectly measured attitudes towards Arab Muslims among non-Arab Muslim participants (Buttrick et al., 2020 ).

As in Study 1, we found no consistent effect in Study 2 for the selection of positive and negative words on the measurement quality of the IAT, despite our attempt to use a stronger manipulation of word selection. The “high-performing” stimuli of Study 1 did not reliably produce stronger correlations with self-report, greater internal reliability, or larger differences between social groups known to differ in the measured attitudes. These high-powered null results provide more compelling evidence that variance in individual stimuli selected without the goal of introducing contaminating effects does not impact measurement quality.

It is possible that individual stimuli may matter for some IATs more than others; for instance, the race IAT showed greater known-groups differences when using the high-performing versus low-performing stimuli. Though additional data would be needed to confirm whether or not this finding reflects a false positive or a real effect that is just specific to the race IAT, the totality of evidence from Study 2 suggests that our stimulus set manipulation did not consistently impact measurement quality.

The results of Studies 1 and 2 suggest that the use of most stimuli is unlikely to severely impact IAT measurement quality, but a related question concerns the importance of variation in stimuli at all. Study 3 investigates this issue by comparing measurement quality among evaluative IATs that used multiple stimuli to represent the positive and negative attributes against two clearly inferior alternatives: (1) IATs that had no variation in attribute exemplars (i.e., the exemplars were only the attribute names) and (2) IATs that had attribute exemplars with no pre-existing association with the category (i.e., using totally unrelated letters to represent the attribute categories). This latter condition represents a particularly strong test regarding the importance of attribute stimuli, as it allows participants to easily engage in task recoding (Rothermund & Wentura, 2004 ). Specifically, participants can categorize these stimuli based on visual appearance, and any instructions to treat such stimuli as exemplars of the concepts of positive or negative could be intentionally disregarded, a process that would reduce the effect of associations between valence and the target attitude objects on performance in the IAT.

Methods and analyses for Study 3 were preregistered at https://osf.io/48vz7/?view_only=c47e91f99b58497fb7f460b63509d436 . We again targeted a minimum sample size of 1000 eligible participants per IAT and stimulus set. Delays in removing the study led to a slight increase in sample size, though no analyses were completed until all data were collected.

A total of 27,274 eligible IATs were completed through the Project Implicit research pool from 13,879 participants ( M Age =  36.3, SD =  14.2, 67.1% White, 67.2% female). Only study sessions where a participant completed the same IAT a second time (or more) were excluded (15.0% of sessions). This sample size allowed for a minimum of 95% power to detect an effect of Cohen’s q =  .14 when comparing correlation strength and an effect of Cohen’s f =  .077 ( d =  .15) when comparing the magnitude of known-groups differences.

Participants were randomly assigned to complete IATs related to race, sexuality, politics, weight, food, and the environment. The race, sexuality, politics, and weight IATs had the same category stimuli and labels as in Study 2. The food IAT assessed associations concerning “Meat” and “Vegetables”, with each category using seven color images of different meats or vegetables as stimuli (see online supplement ). The environment IAT assessed attitudes towards the concepts “Urban” (items: busy, noise, city, building, skyscraper) and “Rural” (items: farm, country, fields, slow, quiet). In Study 3, the attribute labels for all IATs were “Good” and “Bad”.

Participants completed IATs using one of three stimuli sets for the “Good” and “Bad” categories. In the Words condition, IATs used the same words as the high-performing condition in Study 2. In the Good-Bad condition, stimuli were only the words Good, good, Bad, bad . Finally, in the Q-Z condition, stimuli were only the letters Q, q, Z, z . In this condition, IAT instructions told participants to “pretend that the letter 'Q' means any good word” and to “pretend the letter 'Z' means any bad word.”

We measured self-reported attitudes with the same five-item format as in Study 2.

Demographics items related to known-groups differences in race, sexuality, politics, and weight were the same as in Study 2. In addition, participants who completed the food IAT responded to a single yes/no question about whether they identified as a vegetarian or vegan, and participants who completed the environment IAT responded to an item concerning the area in which they currently lived (“large city”, “suburb of a large city”, “medium-sized city”, “suburb of a medium-sized city”, “small town”, “rural”).

Participants completed the IAT and self-report items in a random order. All additional demographic items were completed immediately after the self-reported attitude items.

As in Study 2, our preregistration outlined several criteria that we believed would indicate substantive evidence that the manipulations to IAT stimuli consistently impacted measurement quality. First, we would conclude that any manipulation impacted measurement quality if at least three of the six tests found reliable differences in the same direction when comparing (1) strength of correlations with self-reported attitudes, (2) level of internal reliability, or (3) differences in the magnitude of known-groups differences. In addition, in order to conclude a substantive effect of our manipulation, results would need to show (1) the average effect on correlations with self-report exceeded a small effect of Cohen’s q =  .10 (Cohen, 1988 ), (2) the average effect on internal reliability exceeded a difference of α = .05, and (3) the average effect on known-groups differences exceeded a small effect of d =  .10 (or η p 2   =  .0025).

All IATs were positively correlated with the parallel self-reported attitude (all r s > .071, all p s < .006). Table 6 lists the sample size and strength of correlation with self-report for each IAT and stimulus manipulation, as well as the results of a Fisher’s Z test comparing the strength of correlations between all conditions.

Relative to the Q-Z condition, the Words condition produced reliably stronger correlations for all six topics, and the Good-Bad condition did so for five topics. The Words condition also showed stronger correlations with self-report than the Good-Bad condition for three topics. Meta-analyzing the Cohen’s q effect sizes of differences in correlations, the Q-Z condition showed weaker correlations with self-report than the Words condition (meta-analytic q  = .25, p <  .001) and the Good-Bad condition (meta-analytic q  = .17, p <  .001). Finally, while the Words condition showed evidence of on average stronger correlations with self-report than the Good-Bad condition (meta-analytic q =  .08, p =  .001), this effect size was lower than our preregistered threshold of q =  .10 for indicating substantive differences in correlations.

Table 7 presents the IAT internal reliability within each condition and topic, as well as the results of Feldt tests comparing level of internal reliability between all conditions. Notably, even IATs using only “Q” and “Z” as stimuli showed moderate levels of internal reliability (minimum α = .66, median α = .75), and even showed greater internal reliability than the Good-Bad condition for two of six topics. However, relative to the Good-Bad and Q-Z condition, the Words condition showed higher internal reliability for all six topics.

Following the method outlined by Feldt and Charter ( 2006 ), a weighted average across topics found that the Words condition (α = .81) had a higher internal reliability than the Good-Bad (α = .73) and Q-Z conditions (α = .73). This difference was higher than our preregistered criteria of a difference in α greater than .05 to indicate substantive effect of stimuli on internal reliability. Footnote 2

In exploratory analyses, we also estimated internal consistency using the correlation between the D scores computed from blocks 3 and 6 with the D score computed from blocks 4 and 7 (applying the Spearman–Brown correction for split-half correlations). This approach gives more weight to trials in IAT blocks 3 and 6, which is more similar to how the overall D score is computed. None of the comparisons between conditions reached statistical significance, meaning overall conclusions were the same as when using α (see online supplement for full analyses).

As in Study 2, we also estimated internal consistency in an exploratory analysis using the correlation between the D scores computed from blocks 3 and 6 with the D score computed from blocks 4 and 7. Overall conclusions were the same as when using α; the Good-Bad and Q-Z conditions did not consistently differ in correlation strength across domain, but for each domain, the Words condition showed substantively greater internal consistency (Cohen’s q > .10) relative to both the Good-Bad and Q-Z conditions. See online supplement for full analyses.

Ahern, A. L., Bennett, K. M., & Hetherington, M. M. (2008). Internalization of the ultra-thin ideal: positive implicit associations with underweight fashion models are associated with drive for thinness in young women. Eating Disorders , 16 , 294–307.

Article   PubMed   Google Scholar  

Axt, J. R. (2018). The best way to measure explicit racial attitudes is to ask about them. Social Psychological and Personality Science , 9 , 896–906.

Article   Google Scholar  

Axt, J.R., Conway, M.C., Westgate, E.C. & Buttrick, N.R. (2021). Implicit attitudes independently predict gender and transgender-related beliefs. Personality and Social Psychology Bulletin,  47 , 257–274.

Axt, J. R., Ebersole, C. R., & Nosek, B. A. (2014). The rules of implicit evaluation by race, religion, and age. Psychological Science , 25 , 1804–1815.

Axt, J. R., Bar-Anan, Y., & Vianello, M. (2020). The relation between evaluation and racial categorization of emotional faces. Social Psychological and Personality Science , 11 , 196–206.

Bar-Anan, Y., & Nosek, B. A. (2014). A comparative investigation of seven indirect attitude measures. Behavior Research Methods , 46 , 668–688.

Bar-Anan, Y., & Vianello, M. (2018). A multi-method multi-trait test of the dual-attitude perspective. Journal of Experimental Psychology: General , 147 , 1264–1272.

Bluemke, M., & Fiedler, K. (2009). Base rate effects on the IAT. Consciousness and Cognition , 18 , 1029–1038.

Bluemke, M., & Friese, M. (2006). Do features of stimuli influence IAT effects?. Journal of Experimental Social Psychology , 42 , 163–176.

Brailovskaia, J., & Teichert, T. (2020). “I like it” and “I need it”: Relationship between implicit associations, flow, and addictive social media use. Computers in Human Behavior , 113 , 106509.

Buttrick, N., Axt, J., Ebersole, C. R., & Huband, J. (2020). Re-assessing the incremental predictive validity of Implicit Association Tests. Journal of Experimental Social Psychology , 88 , 103941.

Carnevale, J. J., Fujita, K., Han, H. A., & Amit, E. (2015). Immersion versus transcendence: How pictures and words impact evaluative associations assessed by the Implicit Association Test. Social Psychological and Personality Science , 6 , 92–100.

Chevance, G., Caudroit, J., Romain, A. J., & Boiché, J. (2017). The adoption of physical activity and eating behaviors among persons with obesity and in the general population: the role of implicit attitudes within the Theory of Planned Behavior. Psychology, Health & Medicine , 22 , 319–324.

Chopik, W. J., & Giasson, H. L. (2017). Age differences in explicit and implicit age attitudes across the lifespan. The Gerontologist , 57 , S169–S177.

Article   PubMed   PubMed Central   Google Scholar  

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Google Scholar  

Conroy, D. E., Hyde, A. L., Doerksen, S. E., & Ribeiro, N. F. (2010). Implicit attitudes and explicit motivation prospectively predict physical activity. Annals of Behavioral Medicine , 39 (2), 112–118.

Cooley, E., & Payne, B. K. (2017). Using groups to measure intergroup prejudice. Personality and Social Psychology Bulletin , 43 , 46–59.

Corneille, O., Hütter, M. (2020). Implicit? What do you mean? A comprehensive review of the implicitness construct in attitude research. Personality and Social Psychology Review , 24 , 212–232

Cronbach, L. J. (1955). Processes affecting scores on understanding of others and assumed similarity. Psychological Bulletin, 52 , 177–193. https://doi.org/10.1037/h0044919 .

Dai, J., Gao, H., Zhang, L., & Chen, H. (2020). Attention and memory biases for aggressive information in college students with fragile high self-esteem. International Journal of Psychology .

Dasgupta, N., Greenwald, A. G., & Banaji, M. R. (2003). The first ontological challenge to the IAT: Attitude or mere familiarity?. Psychological Inquiry , 14 , 238–243.

Dasgupta, N., McGhee, D. E., Greenwald, A. G., & Banaji, M. R. (2000). Automatic preference for White Americans: Eliminating the familiarity explanation. Journal of Experimental Social Psychology , 36 , 316–328.

De Houwer, J. (2001). A structural and process analysis of the Implicit Association Test. Journal of Experimental Social Psychology , 37 , 443–451.

Dickter, C. L., Burk, J. A., Anthony, L. G., Robertson, H. A., Verbalis, A., Seese, S., ... & Anthony, B. J. (in press). Assessment of Sesame Street online autism resources: Impacts on parental implicit and explicit attitudes toward children with autism. Autism .

Fazio, R. H., Sanbonmatsu, D. M., Powell, M. C., & Kardes, F. R. (1986). On the automatic activation of attitudes. Journal of Personality and Social Psychology , 50 , 229–238.

Feldt, L. S. (1969). A test of the hypothesis that Cronbach’s alpha or Kuder-Richardson coefficient twenty is the same for two tests. Psychometrika , 34 , 363–373.

Feldt, L. S., & Charter, R. A. (2006). Averaging internal consistency reliability coefficients. Educational and Psychological Measurement , 66 , 215–227.

Foroni, F., & Bel-Bahar, T. (2010). Picture-IAT versus Word-IAT: level of stimulus representation influences on the IAT. European Journal of Social Psychology , 40 , 321–337.

Foroni, F., & Semin, G. R. (2012). Not all implicit measures of attitudes are created equal: Evidence from an embodiment perspective. Journal of Experimental Social Psychology , 48 , 424–427.

Forrest, L. N., Smith, A. R., Fussner, L. M., Dodd, D. R., & Clerkin, E. M. (2016). Using implicit attitudes of exercise importance to predict explicit exercise dependence symptoms and exercise behaviors. Psychology of Sport and Exercise , 22 , 91–97.

Forscher, P. S., Lai, C. K., Axt, J. R., Ebersole, C. R., Herman, M., Devine, P. G., & Nosek, B. A. (2019). A meta-analysis of procedures to change implicit measures. Journal of Personality and Social Psychology , 117 , 522–559.

Gawronski, B. (2019). Six lessons for a cogent science of implicit bias and its criticism. Perspectives on Psychological Science , 14 , 574–595.

Gawronski, B., & De Houwer, J. (2014). Implicit measures in social and personality psychology. In H. T. Reis, & C. M. Judd (Eds.). Handbook of research methods in social and personality psychology (pp. 283–310). (2nd ed.). New York: Cambridge University Press.

Gibson, B. (2008). Can evaluative conditioning change attitudes toward mature brands? New evidence from the Implicit Association Test. Journal of Consumer Research , 35 , 178–188.

Goddard, T., McDonald, A. D., Alambeigi, H., Kim, A. J., & Anderson, B. A. (2020). Unsafe bicyclist overtaking behavior in a simulated driving task: The role of implicit and explicit attitudes. Accident Analysis & Prevention , 144 , 105595.

Govan, C. L., & Williams, K. D. (2004). Changing the affective valence of the stimulus items influences the IAT by re-defining the category labels. Journal of Experimental Social Psychology , 40 , 357–365.

Greenwald, A. G., & Banaji, M. R. (1995). Implicit social cognition: Attitudes, self-esteem, and stereotypes. Psychological Review , 102 , 4–27.

Greenwald, A. G., Mcghee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology , 74 , 1464–1480.

Greenwald, A. G., Nosek, B. A., & Banaji, M. R. (2003). Understanding and using the Implicit Association Test: I. An improved scoring algorithm. Journal of Personality and Social Psychology , 85 , 197–216.

Haider, A. H., Schneider, E. B., Sriram, N., Scott, V. K., Swoboda, S. M., Zogg, C. K., ... & Freischlag, J. A. (2015). Unconscious race and class biases among registered nurses: vignette-based study using implicit association testing. Journal of the American College of Surgeons , 220 , 1077–1086.

Hagiwara, N., Dovidio, J. F., Eggly, S., & Penner, L. A. (2016). The effects of racial attitudes on affect and engagement in racially discordant medical interactions between non-Black physicians and Black patients. Group Processes & Intergroup Relations , 19 , 509–527.

Hall, S. S., & Lee, K. H. (2020). Marital attitudes and Implicit Associations Tests (IAT) among young adults. Journal of Family Issues .

Hawkins, C. B., & Nosek, B. A. (2012). Motivated independence? Implicit party identity predicts political judgments among self-proclaimed independents. Personality and Social Psychology Bulletin , 38 , 1437–1452.

Hughes, S., Mattavelli, S., Hussey, I. & De Houwer, J. (2020). The influence of extinction and counterconditioning procedures on operant evaluative conditioning and intersecting regularity effects. Royal Society Open Science, 7 , 192085.

Irving, L. H., & Smith, C. T. (2020). Measure what you are trying to predict: Applying the correspondence principle to the Implicit Association Test. Journal of Experimental Social Psychology , 86 , 103898.

Jost, J. T., Banaji, M. R., & Nosek, B. A. (2004). A decade of system justification theory: Accumulated evidence of conscious and unconscious bolstering of the status quo. Political Psychology , 25 , 881–919.

Karpinski, A., & Steinman, R. B. (2006). The single category implicit association test as a measure of implicit social cognition. Journal of Personality and Social Psychology , 91 , 16–32.

King, D., & Auschaitrakul, S. (2020). Affect-based nonconscious signaling: When do consumers prefer negative branding?. Psychology & Marketing .

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., ... & Banaji, M. R. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist , 74 , 569–586.

Meissner, F., & Rothermund, K. (2015). A thousand words are worth more than a picture? The effects of stimulus modality on the implicit association test. Social Psychological and Personality Science , 6 , 740–748.

Mitchell, J. P., Nosek, B. A., & Banaji, M. R. (2003). Contextual variations in implicit evaluation. Journal of Experimental Psychology: General , 132 , 455–469.

Nosek, B. A. (2005). Moderators of the relationship between implicit and explicit evaluation. Journal of Experimental Psychology: General , 134 , 565–584.

Nosek, B. A., & Banaji, M. R. (2001). The Go/No-Go Association Task. Social Cognition , 19 , 625–666.

Nosek, B. A., Greenwald, A. G., & Banaji, M. R. (2005). Understanding and using the Implicit Association Test: II. Method variables and construct validity. Personality and Social Psychology Bulletin , 31 , 166–180.

Nosek, B. A., Smyth, F. L., Hansen, J. J., Devos, T., Lindner, N. M., Ranganath, K. A., ... & Banaji, M. R. (2007). Pervasiveness and correlates of implicit attitudes and stereotypes. European Review of Social Psychology , 18 , 36–88.

Nosek, B. A., Bar-Anan, Y., Sriram, N., Axt, J., & Greenwald, A. G. (2014). Understanding and using the brief implicit association test: Recommended scoring procedures. PloS one , 9 , e110938.

Nosek, B. A., & Smyth, F. L. (2007). A multitrait-multimethod validation of the Implicit Association Test. Experimental Psychology , 54 , 14–29.

Payne, B. K., Burkley, M. A., & Stokes, M. B. (2008). Why do implicit and explicit attitude tests diverge? The role of structural fit. Journal of Personality and Social Psychology , 94 (1), 16–31.

Panzone, L., Hilton, D., Sale, L., & Cohen, D. (2016). Socio-demographics, implicit attitudes, explicit attitudes, and sustainable consumption in supermarket shopping. Journal of Economic Psychology , 55 , 77–95.

Piccirillo, M. L., Burke, T. A., Moore-Berg, S. L., Alloy, L. B., & Heimberg, R. G. (2020). Self-stigma toward nonsuicidal self-injury: An examination of implicit and explicit attitudes. Suicide and Life-Threatening Behavior .

Puce, A., Allison, T., Asgari, M., Gore, J. C., & McCarthy, G. (1996). Differential sensitivity of human visual cortex to faces, letterstrings, and textures: A functional MRI study. Journal of Neuroscience , 16 , 5205–5215.

Qiu, Y., & Zhang, G. (2020). Make exercise easier: A brief intervention to influence implicit attitudes towards exercise and physical activity behavior. Learning and Motivation , 72 , 101660.

Ratliff, K. & Smith, C.T. (in press). Lessons from two decades with Project Implicit. In J. Krosnick, T. Stark & A. Scott (Eds.), A Handbook of Research on Implicit Bias and Racism . APA Books.

Rothermund, K., & Wentura, D. (2004). Underlying processes in the implicit association test: dissociating salience from associations. Journal of Experimental Psychology: General , 133 , 139–165.

Rudman, L. A., Greenwald, A. G., & McGhee, D. E. (2001). Implicit self-concept and evaluative implicit gender stereotypes: Self and ingroup share desirable traits. Personality and Social Psychology Bulletin , 27 , 1164–1178.

Sabin, J. A., Marini, M., & Nosek, B. A. (2012). Implicit and explicit anti-fat bias among a large sample of medical doctors by BMI, race/ethnicity and gender. PloS one , 7 , e48448.

Scaife, R., Stafford, T., Bunge, A., & Holroyd, J. (2020). To blame? The effects of moralized feedback on implicit racial bias. Collabra: Psychology .

Book   Google Scholar  

Schimmack, U. (2019). The Implicit Association Test: A method in search of a construct. Perspectives on Psychological Science . https://doi.org/10.1177/1745691619863798 .

Sriram, N., & Greenwald, A. G. (2009). The brief implicit association test. Experimental Psychology , 56 , 283–294.

Steffens, M. C., & Plewe, I. (2001). Items’ cross-category associations as a confounding factor in the Implicit Association Test. Zeitschrift für experimentelle Psychologie , 48 , 123–134.

Stieger, S., Göritz, A. S., & Burger, C. (2010). Personalizing the IAT and the SC-IAT: Impact of idiographic stimulus selection in the measurement of implicit anxiety. Personality and Individual Differences , 48 , 940–944.

Turner, R. N., & Crisp, R. J. (2010). Imagining intergroup contact reduces implicit prejudice. British Journal of Social Psychology , 49 , 129–142.

Zitelny, H., Shalom, M., & Bar-Anan, Y. (2017). What is the implicit gender-science stereotype? Exploring correlations between the gender-science IAT and self-report measures. Social Psychological and Personality Science , 8 , 719–735.

Download references

Authors’ Note

All data and study materials are available at the project page on the Open Science Framework ( https://osf.io/ezj5t/?view_only=1781c05b04d54b829fd2eff67e0d429c ). All measures, manipulations, and exclusions in the studies are disclosed.

Author information

Authors and affiliations.

Department of Psychology, McGill University, 2001 McGill College Ave, Montreal, Quebec, H3A 1G1, Canada

Jordan R. Axt & Tony Y. Feng

Project Implicit, Washington, DC, USA

Jordan R. Axt

Tel Aviv University, Tel Aviv, Israel

Yoav Bar-Anan

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jordan R. Axt .

Ethics declarations

Conflicting interests.

This research was partly supported by Project Implicit. J. R. Axt is Director of Data and Methodology for Project Implicit, Inc., a nonprofit organization with the mission to “develop and deliver methods for investigating and applying phenomena of implicit social cognition, including especially phenomena of implicit bias based on age, race, gender, or other factors.” There are no other potential conflicts of interest with respect to authorship or the publication of this article.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

(PDF 1132 kb)

For the topics of politics, race, sexuality, and weight, our classifications for known-groups differences were the same as in Study 2. In addition, we compared food IAT performance among participants who did versus did not self-identify as vegetarian or vegan and compared environment IAT performance among participants who reported living in a large city or suburb of a large city versus those who reported living in a small town or rural environment. Each IAT and stimulus condition produced the expected difference between social groups, with the one exception being the Q-Z condition failing to produce differences in weight IAT D scores between participants who identified as underweight versus overweight. See the online supplement for descriptive statistics for each social group within each IAT and stimuli condition.

Table 8 presents the results of independent samples t tests of D scores between known groups on each IAT, and the results of the interaction term in a 2 (Social group) × 3 (Stimulus set) ANOVA in each topic. As before, a reliable interaction term would suggest that one stimulus set caused a stronger effect of group membership on the IAT scores. For two topics, sexuality and weight, the size of known-group differences did not reliably differ across the three stimuli conditions. The remaining four topics found reliable social group by stimulus set interactions. Follow-up analyses revealed that, for race and politics, the Q-Z condition produced weaker group differences than the Good-Bad (Race: p =  .028, η p 2   =  .002, Politics: p <  .001, η p 2   =  .013) and the Words condition (Race: p <  .001, η p 2   =  .012, Politics: p <  .001, η p 2   =  .047). In turn, the Good-Bad condition produced weaker differences than the Words condition (Race: p <  .001, η p 2   =  .005, Politics: p <  .001, η p 2   =  .013). For the Rural-Urban IAT, the Words condition produced larger group differences than the Q-Z ( p <  .001, η p 2   =  .008) and Good-Bad condition ( p =  .012, η p 2   =  .004), while the Good-Bad and Q-Z condition did not reliably differ from each other ( p =  .149, η p 2   =  .001). Finally, for the meat-vegetables IAT, the Q-Z condition produced weaker group differences than the Words condition ( p =  .004, η p 2   =  .003), while the Q-Z condition and Good-Bad condition did not reliably differ ( p =  .076, η p 2   =  .001), and neither did the Good-Bad condition and Words condition ( p =  .175, η p 2   =  .001). In total, the Q-Z condition had weaker known-groups differences than the Words condition in four of six topics, and for three of six topics compared to the Good-Bad condition. The Words condition produced stronger known-groups differences for three of six topics compared to the Good-Bad condition.

Across all six topics, meta-analyses found that the Good-Bad condition was associated with greater differences between known groups than the Q-Z condition (meta-analytic η p 2   =  .002, p =  .004). In addition, the Words condition produced greater group differences than either the Q-Z (meta-analytic η p 2   =  .008, p =  .003) or the Good-Bad (meta-analytic η p 2   =  .002, p =  .005) conditions. However, these effects should be considered quite small, as only the contrast between the Words and Q-Z condition (η p 2   =  .008 is equivalent to d =  .18) exceeded our preregistered criteria of d =  .10 for substantive differences between manipulations. See the online supplement for full reporting of each follow-up ANOVA.

Compared to using stimuli with no variation or no meaningful association with the attribute labels, using varied stimuli improved measurement quality by increasing correlations with self-report, internal reliability, and known-groups differences. The advantage of varied stimuli versus simply using the attribute labels as stimuli was evident but much weaker, with effects that exceeded our prespecified criteria for evidence of a substantive effect on internal reliability but not in maximizing known-groups differences or increasing correlations with self-report. Finally, using the attribute labels as stimuli created superior measurement relative to using novel, unrelated stimuli on correlations with self-report, but the effects did not exceed the prespecified criteria for known-groups differences or internal reliability. The relatively strong performance of the Q-Z condition and its ability to produce a majority of the IAT effects found in the other conditions suggests that many participants followed the instructions to think of “Q” as positive and “Z” as negative, and participants did not naturally adopt a task-recoding strategy (treating the categories as “Q” and “Z” rather than positive and negative) when given the opportunity.

Taken together, Study 3’s results suggest that using individual stimuli improves IAT measurement quality but is not necessary to achieve satisfactory measurement. Indeed, using meaningless stimuli that had no pre-existing association with the attribute labels still produced satisfactory measurement—evident in outcomes like reliable correlations with self-reported attitudes, known-groups differences for five of six topics, and a median internal reliability of α  = .75. In short, given the only modest discrepancies between using individual words or unvaried stimuli that only reflected the attribute labels, it is unlikely that differences within the use of stimuli are a source of significant variation in IAT measurement.

General discussion

Three studies investigated how measurement quality was affected by variation in the words chosen as stimuli to represent the positive and negative attributes in evaluative IATs. In Study 1, an archival analysis of ten evaluative IATs did not find a consistent effect of the presence or absence of any individual word on the overall IAT D scores, internal reliability, or correlations with self-reported attitudes. Similarly, in Study 2, the best performing set of words (numerically) from Study 1 did not produce better measurement quality than the worst performing set of words. In Study 3, using the attribute labels as the attribute stimuli was inferior to using a set of eight words for each evaluative category, although the decrement in the measurement quality of the IAT was not always substantial. Further, even a condition that used stimuli unrelated to the attribute labels produced acceptable levels of internal reliability, reliable correlations with self-reported attitudes, and expected differences in IAT performance based on participants’ demographics, ideology, or self-perceptions.

Taken together, results from these studies indicate that variation in the words selected as IAT stimuli does not appear to be a strong source of variation in IAT measurement. Based on previous research that found some effects of specific item stimuli on IAT performance (e.g., Bluemke & Friese, 2006 ; Govan & Williams, 2004 ; Rudman et al., 2001 ), we speculated that some evaluative words might be best suited for producing high measurement quality in the IAT. However, we found no evidence that this is the case. The present results are consistent with previous research that suggested that the category labels have a larger effect on measurement quality than the specific items categorized to these categories (Axt et al., 2021 ; Mitchell et al., 2003 ). Practically, our results reassure researchers looking to use the IAT that their results are unlikely to be overly influenced by specific evaluative stimuli in the IAT, so long as those stimuli are unambiguously associated with the relevant attributes and do not have clear confounds with the selected categories (e.g., Steffens & Plewe, 2001 ). Further reassurance that common sense is probably sufficient for a satisfactory choice of evaluative words comes from the fact that, in Study 3, even a rather unimaginative and restricted choice of a word stimulus that is identical to the attribute category labels did not result in a drastic decrease in the measure quality. For researchers who wish to use evaluative IATs in English, the present research then offers 64 equally suitable words (see online supplement for full list).

Our failure to find differences between evaluative IATs that used different attribute stimuli decreases the probability that this factor surreptitiously contributed to past findings, such as the modest correlation found between the IAT and measures of relevant behavior (e.g., Gawronski, 2019 ; Kurdi et al., 2019 ). Our results suggest that variation among the word stimuli chosen to represent attribute labels does not introduce a significant source of noise into the quality of IAT measurement and is unlikely to further suppress associations between the IAT and outcomes of interest. As a result, researchers seeking to better understand or maximize the association between the IAT and relevant criterion measures may look towards more structural components of study design, such as the degree of conceptual correspondence between the IAT and the measure of interest (Irving & Smith, 2020 ; Payne et al., 2008 ).

The present results do not suggest that the IAT is insensitive to the effect of stimuli. Past work clearly shows that the IAT can be influenced by manipulations to the stimuli used to represent specific categories or attributes. However, this work required substantial changes to such stimuli, often to the point of deliberately introducing confounds into the measure, such as by using images of White people that were widely detested and images of Black people that were (at the time) widely admired (Govan & Williams, 2004 ), or in using attribute words that were intentionally meant to have pre-existing associations with the categories used (e.g., using “beautiful” as a positive word when assessing implicit gender associations; Steffens & Plewe, 2001 ). Like these previous studies, our choice of stimuli in Study 3 influenced measurement quality, but that result required the drastic step of using stimuli that had no pre-existing association with the attributes. These prior studies are helpful in illustrating that introducing serious confounds into the stimuli can have serious effects on IAT performance, yet the present work more fully reveals the inverse finding—without major confounds in the selected stimuli, variation in stimuli have no substantial effects on IAT performance.

At the same time, our conclusions are limited to the fact that using common sense for selecting stimuli for the IAT is enough to achieve satisfactory measurement. It is unclear, however, whether specific informed selection methods, such as tailoring the positive or negative items to each attitude object, may produce even greater measurement quality. Prior studies on this topic, which used stimuli that had pre-existing associations with the categories used in the IAT, suffered from low statistical power and only included a single attitude domain (e.g., Steffens & Plewe, 2001 ). For instance, measurement quality may be improved on a race IAT that uses attribute items that refer to traits that are stereotypically associated with White or Black people. However, it is also possible that this approach could degrade measurement quality by changing the associations being measured, as completing an IAT with negative items that are stereotypically Black and positive items that are stereotypically White could temporarily strengthen anti-Black associations. This is a worthy direction for future research that may lead to advances on the validity of the IAT, though the present results suggest that such work is not required for finding satisfactory measurement. However, the small decrease (if any) in measurement quality that we found when using the category labels as the attribute words might suggest that improvement in the selection of attribute exemplars would not be easy to accomplish.

Extending prior work on IAT measurement

In addition to the question of variability among IAT stimuli, these studies speak to other issues related to IAT measurement. For one, our results can shed light on prior discussions regarding the number of stimuli required per category in order to achieve satisfactory IAT measurement. A previous investigation (Nosek, Greenwald & Banaji, 2005 ) manipulated the number of stimuli in target and attribute categories across three IATs measuring either racial attitudes, age attitudes, or gender-science stereotypes. One version of the race IAT had six stimuli to represent each racial category (i.e., six images each of Black and White people) and a single stimulus to represent the attribute category (i.e., only using the category labels “Good” and “Bad), a design that is very similar to the Good-Bad condition in Study 3. Relative to versions of the race IAT that included more attribute stimuli, Nosek et al. ( 2005 ) found that using only a single stimulus did not produce large changes in the overall IAT effect or correlations with self-reported racial attitudes.

Though present results largely replicate these conclusions and extend them to a greater number of IATs, the larger sample sizes used here were also able to detect an effect of lower internal reliability when using only a single stimulus per attribute. These data more fully highlight that while including multiple stimuli per attribute category should improve measurement quality, it is not a requirement for achieving expected IAT effects. This finding might help to simplify the IAT (for example, when the participants have low language proficiency) with no serious cost in measurement quality. Further, assuming the present finding generalizes to other IAT categories, this might help researchers who struggle to find more than a couple of stimuli for the attribute categories (e.g., Socialism vs. Capitalism) or the target categories (e.g., word stimuli for two political parties of a similar ideology). As a result, this finding may expand the research topics to which the IAT can be effectively applied.

Notably, one shortcoming of this work is its inability to speak to differences in stimulus modality, such as in comparing IAT performance when using words versus images to represent a category. Prior work suggests that stimulus modality may influence IAT performance (e.g., Meissner & Rothermund, 2015 ). For instance, a single-category IAT produced stronger associations between tastiness words and desserts (versus vegetables) when food was represented as pictures versus words, and similar results occurred in an evaluative IAT measuring positive associations for desserts versus vegetables, though these effects were limited to participants who reported being on a diet (Carnevale et al., 2015 ).

One explanation for the impact of stimulus modality on performance concerns the level of representation. More specifically, images may induce more lower-level processing compared to words since images are more concrete representations of the category that activate less extraneous knowledge (e.g., Puce et al., 1996 ). Indeed, follow-up studies have manipulated level of representation using the same IAT modality; for example, Dutch participants showed more negative associations towards immigrants (versus natives) when IAT stimuli depicted groups of people (invoking higher-level representations) compared to when stimuli only depicted a single person at a time (Foroni & Bel-Bahar, 2010 ; see also Cooley & Payne, 2017 ). A similar process may explain why the IATs used in the present work were largely resistant to variation in individual (word) stimuli. The use of words may have facilitated higher-order processing of the stimuli and IAT attribute categories, and since many different words can unambiguously fit into the attributes used here (e.g., “positive” or “negative”), participants may have had little difficulty processing any of these words as representing each attribute. That is, using words as stimuli may allow participants to take a more expansive approach to the attribute labels that allows for a greater number of stimuli to fall under that attribute. Follow-up research on this topic may seek to test this account directly, such as by manipulating participants’ perceptions of what words best reflect a certain category as well as investigating whether similar effects emerge when using image stimuli.

The results of the Q-Z condition in Study 3 extend this notion of participants’ flexibility in the ability to categorize stimuli. Even when the stimuli had no pre-existing association with the attribute labels, participants were able to incorporate the stimuli into their representation of the attribute and produce IAT performance that had acceptable levels of internal reliability as well as expected patterns of known-groups differences and correlations with self-reported attitudes. These data are strong evidence that IAT performance is much more dependent on the attribute or category labels used to determine how stimuli are categorized than the specific stimuli used to represent the categories or attributes (e.g., Axt et al., 2021 ).

Limitations and future directions

One clear limitation of this work mentioned previously is the somewhat narrow scope of our manipulations, as we did not examine the effects of stimulus variability using image stimuli or other instances of text stimuli, such as when words are used to represent categories (e.g., first names associated more with Black versus White people) or attributes other than positive or negative (e.g., words associated with danger and safety). Though prior work suggests that the choice between representing categories or attributes as images versus words may have an impact on IAT performance (Carnevale et al., 2015 ; Meissner & Rothermund, 2015 ), it is less clear whether variation among the images used in IATs substantively impacts measurement quality. Similarly, the effects found here are specific to evaluative IATs, and stimulus variation may play a role in IATs seeking to measure stereotypic associations, like that between gender and science versus arts (e.g., Zitelny et al., 2017 ). While the current results cannot rule out the possibility that stimulus variation is an important factor in IATs using images or those assessing stereotypic associations, we see no a priori reason to expect this lack of generalizability. Regardless, this line of research will only benefit by extending the question into IATs that use other stimulus modalities or measure other types of associations, as the flexibility shown by participants in adapting the Q-Z labels to an evaluative context may not necessarily extend to IATs seeking to measure more specific associations than a general positive vs. negative distinction.

Another possible concern might be that we tested only 64 words, but there are many more evaluative words. Indeed, it is theoretically possible that we missed some words that would perform better than the words we chose to test in the present research. However, this seems less likely when considering the relatively small effect, found in Study 3, of replacing the words with stimuli identical to the attribute category labels. The modest decrease in the measurement quality of the IAT in that condition suggests that even if we had increased the set of tested words in Studies 1 and 2, no substantial variability in measurement quality would have been found.

More generally, this investigation focused on only one indirect measure, the IAT. The question of how variation among stimuli impacts measurement quality could be extended into other forms of the IAT, such as the single-category IAT (Karpinski & Steinman, 2006 ) as well as other indirect measures, such as the Go No-Go Association Test (Nosek & Banaji, 2001 ) and evaluative priming (Fazio et al., 1986 ). Given similarities in performance across these tasks (Bar-Anan & Nosek, 2014 ), we would anticipate that other indirect measures would also not be overly influenced by variation within individual stimuli. However, it is still possible that variation among stimuli may impact some tasks more than others, especially given preliminary evidence that indirect measures of implicit associations may engage or rely on different psychological processes (Foroni & Semin, 2012 ).

Despite its wide usage within psychological research, the stimuli used for the IAT show considerable variability across researchers. If measurement quality were overly influenced by individual stimuli, then many conclusions from IAT studies may not generalize to other forms of the test, and variation from stimuli could be a significant source of measurement error across IATs. The present work suggests this scenario to be an unlikely one, as measurement quality across 13 evaluative IATs was not impacted by variation among words used to represent the positive and negative attributes. In fact, there was evidence of only small and somewhat inconsistent decrease in measurement quality when using only a single stimulus that was redundant with the attribute label. These results highlight a greater need for researchers to focus on more conceptual and theoretical explanations for when the associations detected on an IAT develop, change over time, and do or do not predict behavior.

Rights and permissions

Reprints and permissions

About this article

Axt, J.R., Feng, T.Y. & Bar-Anan, Y. The good and the bad: Are some attribute words better than others in the Implicit Association Test?. Behav Res 53 , 2512–2527 (2021). https://doi.org/10.3758/s13428-021-01592-8

Download citation

Accepted : 29 March 2021

Published : 04 May 2021

Issue Date : December 2021

DOI : https://doi.org/10.3758/s13428-021-01592-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Implicit Association Test
  • Reliability
  • Implicit attitudes
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Qualities of a good researcher

    a researcher who measures attitudes by assessing whether white

  2. A conceptual framework of knowledge, attitude, and practice as

    a researcher who measures attitudes by assessing whether white

  3. Module 5: Attitudes

    a researcher who measures attitudes by assessing whether white

  4. Observational research

    a researcher who measures attitudes by assessing whether white

  5. PPT

    a researcher who measures attitudes by assessing whether white

  6. PPT

    a researcher who measures attitudes by assessing whether white

VIDEO

  1. ED122 Learning and Cognition Vlog #02

  2. EDU 122 Learning and cognition vlog #03

  3. Supreme Court to debate whether White House crosses First Amendment line on social media

  4. R2R 2024 Lightning Talk

  5. White Test SPSS

  6. Evaluating Risk: Beyond Right or Wrong

COMMENTS

  1. Measuring Whiteness: A Systematic Review of Instruments and Call to

    In recent years, racism and antiracist activism have become more prominent in the public lives of White individuals in the United States. In 2009, only 22% of White people in the United States reported that racism was a problem in comparison to 52% in 2017 (Pew Research Center, 2017).Similarly, the proportion of White individuals who believed that race relations are generally good declined ...

  2. Social Psych- Chapter 4 Flashcards

    A researcher who measures attitudes by assessing whether White people take longer to associate positive words with black faces than White faces is most likely using: the implicit association test. Following the U.S. Supreme Court's 1954 decision to desegregate schools, the percentage of White Americans favoring integrated schools jumped and now ...

  3. PSYC 360 Ch 4 Flashcards

    A researcher who measures attitudes by assessing whether White people take longer to associate positive words with Black faces than with White faces is most likely using: the implicit association test.

  4. The White Racial Affect Scale (WRAS): A Measure of White Guilt, Shame

    It is therefore critical that when researchers measure guilt or shame that they offer individuals response choices that allow for a range of responses that may be experienced simultaneously, not exclusively—even in apparently contradictory ways (e.g., a person who might express a propensity for both White guilt and detachment processes, or ...

  5. Relationships between the race implicit association test and other

    In recent years, researchers have debated the theoretical tenets underpinning the IAT, questioning whether performance on this task: (1) measures implicit attitudes that operate automatically outside of conscious awareness; (2) reflects individual differences in social cognition; and (3) can predict social behavior.

  6. (PDF) Measuring Whiteness: A Systematic Review of ...

    Future research could assess how attitudes concerning police brutality relate to both White privilege using the POI (Hays et al., 2007) or the WP AS (Pinterits et al., 2009) and color-blind racial ...

  7. The Best Way to Measure Explicit Racial Attitudes Is to Ask About Them

    Direct assessments of explicit racial attitudes, such as reporting an overt preference for White versus Black people, may raise social desirability concerns and reduce measurement quality. As a result, researchers have developed more indirect self-report measures of explicit racial attitudes. While such measures dampen social desirability ...

  8. 6 Experimental Methods for Assessing Discrimination

    White participants' scores on the attitudes questionnaire and their self-reported friendliness (both measures of explicit, overt prejudice) correlated with each other; that is, whites' self-reported attitudes predicted bias in verbal friendliness toward black relative to white confederates. These measures also correlated with verbal ...

  9. Using Implicit Measures of Discrimination: White, Black, and ...

    In a new study to evaluate use of implicit measures for assessing multiple types of discrimination in health and other population research [], we have shown that a new brief validated version of the IAT (i.e., the Brief IAT, B-IAT) [17, 18] is a valuable instrument for quickly assessing discrimination based on race/ethnicity, sex, gender identity, sexual orientation, age, and weight.

  10. PDF Behaviour and Attitudes

    measure how quickly people associate concepts (Greenwald et al., 2002, 2003). One can, for example, measure implicit racial attitudes by assessing whether people take longer to associate positive words with Black rather than with White faces. Knowing that people don't wear their hearts on their sleeves, social psy-

  11. Social Psychology Exam 2 (Chapters 4, 5, and 6) Flashcards

    A researcher who measures attitudes by assessing whether white people take longer to associate positive words with Black faces than with white faces is most likely using the implicit association test Identify a true statement about dissonance theory.

  12. White and Black American Children's Implicit Intergroup Bias

    1 We conducted a pilot study in which 10 White children from a majority-White school (aged 88-132 months; 7 girls and 3 boys) completed the evaluative and social status IATs and the explicit attitude measures employed in the main study, as well as a third IAT assessing implicit evaluations of high and low status (i.e., pairing "rich" and ...

  13. Measuring Attitudes: Current Practices in Health ...

    When considering the measurement of attitudes, it is suggested that researchers investigate whether a valid attitude scale that assesses the desired attitude construct is already available. In the event that the needed attitude scale does already not exist, then a valid, reliable, and responsive quantitative measure composed of items that are ...

  14. Full article: A brief measure of attitudes towards qualitative research

    The use and teaching of qualitative research methods in psychology is increasing, but to date no measure has been developed to identify and measure changes in attitudes towards qualitative research in psychology student, academic, and scientist‐practitioner populations. In this article, we present the development and initial validation of a ...

  15. Measuring gender attitudes: Developing and testing Implicit Association

    1. Introduction. A growing body of research from different settings demonstrates that gender norms and roles—the range of socially constructed behaviors and attitudes expected and even required for people based on their sex—harm the health and well-being of females, males, and those outside of the gender binary [].Among women, for example, these norms can restrict employment opportunity ...

  16. The good and the bad: Are some attribute words better than others in

    The Implicit Association Test (IAT) is one of the most popular measures in psychological research. A lack of standardization across IATs has resulted in significant variability among stimuli used by researchers, including the positive and negative words used in evaluative IATs. Does the variability in attribute words in evaluative IATs produce unwanted variability in measurement quality across ...

  17. A Puzzle of Racial Attitudes: A Measurement Analysis of Racial

    Expressions of traditional racial prejudice by white Americans have declined since the 1940s, but the average level of support for policies and initiatives that would reduce racial inequality has remained largely unchanged (Bobo et al. 2012; Schuman et al. 1997).Starting in the 1970s, some researchers argued that the reason for this apparent inconsistency was that a new and distinctive ...

  18. Social Psych Ch #4 Flashcards

    A researcher who measures attitudes by assessing whether White people take longer to associate positive words with Black faces than with White faces is most likely using: the implicit association test.

  19. Solved A researcher who measures attitudes by assessing

    A researcher who measures attitudes by assessing whether white people take longer to associate positive words with Black faces than with white faces is most likely usinga. a strong interest inventory.b. a bogus pipeline paradigm.c. the facial muscle response test.d. the implicit association test.

  20. Measurement, Statistics, and Research Design

    one self-report attitude measure, the ATDP and no relation with the Scale of Attitudes Toward Disabled Persons (Antonak, 1982). Thomas and colleagues (2007) also found a small relation (r = .25) between the Multiple Disability IAT and the self-report attitude measure used in their study, the IDP (Gething, 1994).

  21. Social Psychology Flashcards

    When we are unsure of our attitudes, we infer them the same way someone else would who was observing us—that is, we look at our behavior. ... measures unconscious attitudes. A researcher who measures attitudes by assessing whether White people take longer to associate positive words with Black faces than with White faces is most likely using ...

  22. PSH 250

    Study with Quizlet and memorize flashcards containing terms like In 1964, Leon Festinger observed that:, In the context of ways in which one's behaviors affect one's attitudes, Manis et al. (1974), Tesser et al. (1972), and Tetlock (1983) stated that:, A researcher who measures attitudes by assessing whether White people take longer to associate positive words with Black faces than with White ...

  23. PSY 240 EXAM 2 Flashcards

    A researcher who measures attitudes by assessing whether White people take longer to associate positive words with Black faces than with White faces is most likely using: the implicit association test.