• Experimental Vs Non-Experimental Research: 15 Key Differences

busayo.longe

There is a general misconception around research that once the research is non-experimental, then it is non-scientific, making it more important to understand what experimental and experimental research entails. Experimental research is the most common type of research, which a lot of people refer to as scientific research. 

Non experimental research, on the other hand, is easily used to classify research that is not experimental. It clearly differs from experimental research, and as such has different use cases. 

In this article, we will be explaining these differences in detail so as to ensure proper identification during the research process.

What is Experimental Research?  

Experimental research is the type of research that uses a scientific approach towards manipulating one or more control variables of the research subject(s) and measuring the effect of this manipulation on the subject. It is known for the fact that it allows the manipulation of control variables. 

This research method is widely used in various physical and social science fields, even though it may be quite difficult to execute. Within the information field, they are much more common in information systems research than in library and information management research.

Experimental research is usually undertaken when the goal of the research is to trace cause-and-effect relationships between defined variables. However, the type of experimental research chosen has a significant influence on the results of the experiment.

Therefore bringing us to the different types of experimental research. There are 3 main types of experimental research, namely; pre-experimental, quasi-experimental, and true experimental research.

Pre-experimental Research

Pre-experimental research is the simplest form of research, and is carried out by observing a group or groups of dependent variables after the treatment of an independent variable which is presumed to cause change on the group(s). It is further divided into three types.

  • One-shot case study research 
  • One-group pretest-posttest research 
  • Static-group comparison

Quasi-experimental Research

The Quasi type of experimental research is similar to true experimental research, but uses carefully selected rather than randomized subjects. The following are examples of quasi-experimental research:

  • Time series 
  • No equivalent control group design
  • Counterbalanced design.

True Experimental Research

True experimental research is the most accurate type,  and may simply be called experimental research. It manipulates a control group towards a group of randomly selected subjects and records the effect of this manipulation.

True experimental research can be further classified into the following groups:

  • The posttest-only control group 
  • The pretest-posttest control group 
  • Solomon four-group 

Pros of True Experimental Research

  • Researchers can have control over variables.
  • It can be combined with other research methods.
  • The research process is usually well structured.
  • It provides specific conclusions.
  • The results of experimental research can be easily duplicated.

Cons of True Experimental Research

  • It is highly prone to human error.
  • Exerting control over extraneous variables may lead to the personal bias of the researcher.
  • It is time-consuming.
  • It is expensive. 
  • Manipulating control variables may have ethical implications.
  • It produces artificial results.

What is Non-Experimental Research?  

Non-experimental research is the type of research that does not involve the manipulation of control or independent variable. In non-experimental research, researchers measure variables as they naturally occur without any further manipulation.

This type of research is used when the researcher has no specific research question about a causal relationship between 2 different variables, and manipulation of the independent variable is impossible. They are also used when:

  • subjects cannot be randomly assigned to conditions.
  • the research subject is about a causal relationship but the independent variable cannot be manipulated.
  • the research is broad and exploratory
  • the research pertains to a non-causal relationship between variables.
  • limited information can be accessed about the research subject.

There are 3 main types of non-experimental research , namely; cross-sectional research, correlation research, and observational research.

Cross-sectional Research

Cross-sectional research involves the comparison of two or more pre-existing groups of people under the same criteria. This approach is classified as non-experimental because the groups are not randomly selected and the independent variable is not manipulated.

For example, an academic institution may want to reward its first-class students with a scholarship for their academic excellence. Therefore, each faculty places students in the eligible and ineligible group according to their class of degree.

In this case, the student’s class of degree cannot be manipulated to qualify him or her for a scholarship because it is an unethical thing to do. Therefore, the placement is cross-sectional.

Correlational Research

Correlational type of research compares the statistical relationship between two variables .Correlational research is classified as non-experimental because it does not manipulate the independent variables.

For example, a researcher may wish to investigate the relationship between the class of family students come from and their grades in school. A questionnaire may be given to students to know the average income of their family, then compare it with CGPAs. 

The researcher will discover whether these two factors are positively correlated, negatively corrected, or have zero correlation at the end of the research.

Observational Research

Observational research focuses on observing the behavior of a research subject in a natural or laboratory setting. It is classified as non-experimental because it does not involve the manipulation of independent variables.

A good example of observational research is an investigation of the crowd effect or psychology in a particular group of people. Imagine a situation where there are 2 ATMs at a place, and only one of the ATMs is filled with a queue, while the other is abandoned.

The crowd effect infers that the majority of newcomers will also abandon the other ATM.

You will notice that each of these non-experimental research is descriptive in nature. It then suffices to say that descriptive research is an example of non-experimental research.

Pros of Observational Research

  • The research process is very close to a real-life situation.
  • It does not allow for the manipulation of variables due to ethical reasons.
  • Human characteristics are not subject to experimental manipulation.

Cons of Observational Research

  • The groups may be dissimilar and nonhomogeneous because they are not randomly selected, affecting the authenticity and generalizability of the study results.
  • The results obtained cannot be absolutely clear and error-free.

What Are The Differences Between Experimental and Non-Experimental Research?    

  • Definitions

Experimental research is the type of research that uses a scientific approach towards manipulating one or more control variables and measuring their defect on the dependent variables, while non-experimental research is the type of research that does not involve the manipulation of control variables.

The main distinction in these 2 types of research is their attitude towards the manipulation of control variables. Experimental allows for the manipulation of control variables while non-experimental research doesn’t.

 Examples of experimental research are laboratory experiments that involve mixing different chemical elements together to see the effect of one element on the other while non-experimental research examples are investigations into the characteristics of different chemical elements.

Consider a researcher carrying out a laboratory test to determine the effect of adding Nitrogen gas to Hydrogen gas. It may be discovered that using the Haber process, one can create Nitrogen gas.

Non-experimental research may further be carried out on Ammonia, to determine its characteristics, behaviour, and nature.

There are 3 types of experimental research, namely; experimental research, quasi-experimental research, and true experimental research. Although also 3 in number, non-experimental research can be classified into cross-sectional research, correlational research, and observational research.

The different types of experimental research are further divided into different parts, while non-experimental research types are not further divided. Clearly, these divisions are not the same in experimental and non-experimental research.

  • Characteristics

Experimental research is usually quantitative, controlled, and multivariable. Non-experimental research can be both quantitative and qualitative , has an uncontrolled variable, and also a cross-sectional research problem.

The characteristics of experimental research are the direct opposite of that of non-experimental research. The most distinct characteristic element is the ability to control or manipulate independent variables in experimental research and not in non-experimental research. 

In experimental research, a level of control is usually exerted on extraneous variables, therefore tampering with the natural research setting. Experimental research settings are usually more natural with no tampering with the extraneous variables.

  • Data Collection/Tools

  The data used during experimental research is collected through observational study, simulations, and surveys while non-experimental data is collected through observations, surveys, and case studies. The main distinction between these data collection tools is case studies and simulations.

Even at that, similar tools are used differently. For example, an observational study may be used during a laboratory experiment that tests how the effect of a control variable manifests over a period of time in experimental research. 

However, when used in non-experimental research, data is collected based on the researcher’s discretion and not through a clear scientific reaction. In this case, we see a difference in the level of objectivity. 

The goal of experimental research is to measure the causes and effects of variables present in research, while non-experimental research provides very little to no information about causal agents.

Experimental research answers the question of why something is happening. This is quite different in non-experimental research, as they are more descriptive in nature with the end goal being to describe what .

 Experimental research is mostly used to make scientific innovations and find major solutions to problems while non-experimental research is used to define subject characteristics, measure data trends, compare situations and validate existing conditions.

For example, if experimental research results in an innovative discovery or solution, non-experimental research will be conducted to validate this discovery. This research is done for a period of time in order to properly study the subject of research.

Experimental research process is usually well structured and as such produces results with very little to no errors, while non-experimental research helps to create real-life related experiments. There are a lot more advantages of experimental and non-experimental research , with the absence of each of these advantages in the other leaving it at a disadvantage.

For example, the lack of a random selection process in non-experimental research leads to the inability to arrive at a generalizable result. Similarly, the ability to manipulate control variables in experimental research may lead to the personal bias of the researcher.

  • Disadvantage

 Experimental research is highly prone to human error while the major disadvantage of non-experimental research is that the results obtained cannot be absolutely clear and error-free. In the long run, the error obtained due to human error may affect the results of the experimental research.

Some other disadvantages of experimental research include the following; extraneous variables cannot always be controlled, human responses can be difficult to measure, and participants may also cause bias.

  In experimental research, researchers can control and manipulate control variables, while in non-experimental research, researchers cannot manipulate these variables. This cannot be done due to ethical reasons. 

For example, when promoting employees due to how well they did in their annual performance review, it will be unethical to manipulate the results of the performance review (independent variable). That way, we can get impartial results of those who deserve a promotion and those who don’t.

Experimental researchers may also decide to eliminate extraneous variables so as to have enough control over the research process. Once again, this is something that cannot be done in non-experimental research because it relates more to real-life situations.

Experimental research is carried out in an unnatural setting because most of the factors that influence the setting are controlled while the non-experimental research setting remains natural and uncontrolled. One of the things usually tampered with during research is extraneous variables.

In a bid to get a perfect and well-structured research process and results, researchers sometimes eliminate extraneous variables. Although sometimes seen as insignificant, the elimination of these variables may affect the research results.

Consider the optimization problem whose aim is to minimize the cost of production of a car, with the constraints being the number of workers and the number of hours they spend working per day. 

In this problem, extraneous variables like machine failure rates or accidents are eliminated. In the long run, these things may occur and may invalidate the result.

  • Cause-Effect Relationship

The relationship between cause and effect is established in experimental research while it cannot be established in non-experimental research. Rather than establish a cause-effect relationship, non-experimental research focuses on providing descriptive results.

Although it acknowledges the causal variable and its effect on the dependent variables, it does not measure how or the extent to which these dependent variables change. It, however, observes these changes, compares the changes in 2 variables, and describes them.

Experimental research does not compare variables while non-experimental research does. It compares 2 variables and describes the relationship between them.

The relationship between these variables can be positively correlated, negatively correlated or not correlated at all. For example, consider a case whereby the subject of research is a drum, and the control or independent variable is the drumstick.

Experimental research will measure the effect of hitting the drumstick on the drum, where the result of this research will be sound. That is, when you hit a drumstick on a drum, it makes a sound.

Non-experimental research, on the other hand, will investigate the correlation between how hard the drum is hit and the loudness of the sound that comes out. That is, if the sound will be higher with a harder bang, lower with a harder bang, or will remain the same no matter how hard we hit the drum.

  • Quantitativeness

Experimental research is a quantitative research method while non-experimental research can be both quantitative and qualitative depending on the time and the situation where it is been used. An example of a non-experimental quantitative research method is correlational research .

Researchers use it to correlate two or more variables using mathematical analysis methods. The original patterns, relationships, and trends between variables are observed, then the impact of one of these variables on the other is recorded along with how it changes the relationship between the two variables.

Observational research is an example of non-experimental research, which is classified as a qualitative research method.

  • Cross-section

Experimental research is usually single-sectional while non-experimental research is cross-sectional. That is, when evaluating the research subjects in experimental research, each group is evaluated as an entity.

For example, let us consider a medical research process investigating the prevalence of breast cancer in a certain community. In this community, we will find people of different ages, ethnicities, and social backgrounds. 

If a significant amount of women from a particular age are found to be more prone to have the disease, the researcher can conduct further studies to understand the reason behind it. A further study into this will be experimental and the subject won’t be a cross-sectional group. 

A lot of researchers consider the distinction between experimental and non-experimental research to be an extremely important one. This is partly due to the fact that experimental research can accommodate the manipulation of independent variables, which is something non-experimental research can not.

Therefore, as a researcher who is interested in using any one of experimental and non-experimental research, it is important to understand the distinction between these two. This helps in deciding which method is better for carrying out particular research. 

Logo

Connect to Formplus, Get Started Now - It's Free!

  • examples of experimental research
  • non experimental research
  • busayo.longe

Formplus

You may also like:

Response vs Explanatory Variables: Definition & Examples

In this article, we’ll be comparing the two types of variables, what they both mean and see some of their real-life applications in research

is a case study non experimental

What is Experimenter Bias? Definition, Types & Mitigation

In this article, we will look into the concept of experimental bias and how it can be identified in your research

Simpson’s Paradox & How to Avoid it in Experimental Research

In this article, we are going to look at Simpson’s Paradox from its historical point and later, we’ll consider its effect in...

Experimental Research Designs: Types, Examples & Methods

Ultimate guide to experimental research. It’s definition, types, characteristics, uses, examples and methodolgy

Formplus - For Seamless Data Collection

Collect data the right way with a versatile data collection tool. try formplus and transform your work productivity today..

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 7: Nonexperimental Research

Overview of Nonexperimental Research

Learning Objectives

  • Define nonexperimental research, distinguish it clearly from experimental research, and give several examples.
  • Explain when a researcher might choose to conduct nonexperimental research as opposed to experimental research.

What Is Nonexperimental Research?

Nonexperimental research  is research that lacks the manipulation of an independent variable, random assignment of participants to conditions or orders of conditions, or both.

In a sense, it is unfair to define this large and diverse set of approaches collectively by what they are  not . But doing so reflects the fact that most researchers in psychology consider the distinction between experimental and nonexperimental research to be an extremely important one. This distinction is because although experimental research can provide strong evidence that changes in an independent variable cause differences in a dependent variable, nonexperimental research generally cannot. As we will see, however, this inability does not mean that nonexperimental research is less important than experimental research or inferior to it in any general sense.

When to Use Nonexperimental Research

As we saw in  Chapter 6 , experimental research is appropriate when the researcher has a specific research question or hypothesis about a causal relationship between two variables—and it is possible, feasible, and ethical to manipulate the independent variable and randomly assign participants to conditions or to orders of conditions. It stands to reason, therefore, that nonexperimental research is appropriate—even necessary—when these conditions are not met. There are many ways in which preferring nonexperimental research can be the case.

  • The research question or hypothesis can be about a single variable rather than a statistical relationship between two variables (e.g., How accurate are people’s first impressions?).
  • The research question can be about a noncausal statistical relationship between variables (e.g., Is there a correlation between verbal intelligence and mathematical intelligence?).
  • The research question can be about a causal relationship, but the independent variable cannot be manipulated or participants cannot be randomly assigned to conditions or orders of conditions (e.g., Does damage to a person’s hippocampus impair the formation of long-term memory traces?).
  • The research question can be broad and exploratory, or it can be about what it is like to have a particular experience (e.g., What is it like to be a working mother diagnosed with depression?).

Again, the choice between the experimental and nonexperimental approaches is generally dictated by the nature of the research question. If it is about a causal relationship and involves an independent variable that can be manipulated, the experimental approach is typically preferred. Otherwise, the nonexperimental approach is preferred. But the two approaches can also be used to address the same research question in complementary ways. For example, nonexperimental studies establishing that there is a relationship between watching violent television and aggressive behaviour have been complemented by experimental studies confirming that the relationship is a causal one (Bushman & Huesmann, 2001) [1] . Similarly, after his original study, Milgram conducted experiments to explore the factors that affect obedience. He manipulated several independent variables, such as the distance between the experimenter and the participant, the participant and the confederate, and the location of the study (Milgram, 1974) [2] .

Types of Nonexperimental Research

Nonexperimental research falls into three broad categories: single-variable research, correlational and quasi-experimental research, and qualitative research. First, research can be nonexperimental because it focuses on a single variable rather than a statistical relationship between two variables. Although there is no widely shared term for this kind of research, we will call it  single-variable research . Milgram’s original obedience study was nonexperimental in this way. He was primarily interested in one variable—the extent to which participants obeyed the researcher when he told them to shock the confederate—and he observed all participants performing the same task under the same conditions. The study by Loftus and Pickrell described at the beginning of this chapter is also a good example of single-variable research. The variable was whether participants “remembered” having experienced mildly traumatic childhood events (e.g., getting lost in a shopping mall) that they had not actually experienced but that the research asked them about repeatedly. In this particular study, nearly a third of the participants “remembered” at least one event. (As with Milgram’s original study, this study inspired several later experiments on the factors that affect false memories.)

As these examples make clear, single-variable research can answer interesting and important questions. What it cannot do, however, is answer questions about statistical relationships between variables. This detail is a point that beginning researchers sometimes miss. Imagine, for example, a group of research methods students interested in the relationship between children’s being the victim of bullying and the children’s self-esteem. The first thing that is likely to occur to these researchers is to obtain a sample of middle-school students who have been bullied and then to measure their self-esteem. But this design would be a single-variable study with self-esteem as the only variable. Although it would tell the researchers something about the self-esteem of children who have been bullied, it would not tell them what they really want to know, which is how the self-esteem of children who have been bullied  compares  with the self-esteem of children who have not. Is it lower? Is it the same? Could it even be higher? To answer this question, their sample would also have to include middle-school students who have not been bullied thereby introducing another variable.

Research can also be nonexperimental because it focuses on a statistical relationship between two variables but does not include the manipulation of an independent variable, random assignment of participants to conditions or orders of conditions, or both. This kind of research takes two basic forms: correlational research and quasi-experimental research. In correlational research , the researcher measures the two variables of interest with little or no attempt to control extraneous variables and then assesses the relationship between them. A research methods student who finds out whether each of several middle-school students has been bullied and then measures each student’s self-esteem is conducting correlational research. In  quasi-experimental research , the researcher manipulates an independent variable but does not randomly assign participants to conditions or orders of conditions. For example, a researcher might start an antibullying program (a kind of treatment) at one school and compare the incidence of bullying at that school with the incidence at a similar school that has no antibullying program.

The final way in which research can be nonexperimental is that it can be qualitative. The types of research we have discussed so far are all quantitative, referring to the fact that the data consist of numbers that are analyzed using statistical techniques. In  qualitative research , the data are usually nonnumerical and therefore cannot be analyzed using statistical techniques. Rosenhan’s study of the experience of people in a psychiatric ward was primarily qualitative. The data were the notes taken by the “pseudopatients”—the people pretending to have heard voices—along with their hospital records. Rosenhan’s analysis consists mainly of a written description of the experiences of the pseudopatients, supported by several concrete examples. To illustrate the hospital staff’s tendency to “depersonalize” their patients, he noted, “Upon being admitted, I and other pseudopatients took the initial physical examinations in a semipublic room, where staff members went about their own business as if we were not there” (Rosenhan, 1973, p. 256). [3] Qualitative data has a separate set of analysis tools depending on the research question. For example, thematic analysis would focus on themes that emerge in the data or conversation analysis would focus on the way the words were said in an interview or focus group.

Internal Validity Revisited

Recall that internal validity is the extent to which the design of a study supports the conclusion that changes in the independent variable caused any observed differences in the dependent variable.  Figure 7.1  shows how experimental, quasi-experimental, and correlational research vary in terms of internal validity. Experimental research tends to be highest because it addresses the directionality and third-variable problems through manipulation and the control of extraneous variables through random assignment. If the average score on the dependent variable in an experiment differs across conditions, it is quite likely that the independent variable is responsible for that difference. Correlational research is lowest because it fails to address either problem. If the average score on the dependent variable differs across levels of the independent variable, it  could  be that the independent variable is responsible, but there are other interpretations. In some situations, the direction of causality could be reversed. In others, there could be a third variable that is causing differences in both the independent and dependent variables. Quasi-experimental research is in the middle because the manipulation of the independent variable addresses some problems, but the lack of random assignment and experimental control fails to address others. Imagine, for example, that a researcher finds two similar schools, starts an antibullying program in one, and then finds fewer bullying incidents in that “treatment school” than in the “control school.” There is no directionality problem because clearly the number of bullying incidents did not determine which school got the program. However, the lack of random assignment of children to schools could still mean that students in the treatment school differed from students in the control school in some other way that could explain the difference in bullying.

""

Notice also in  Figure 7.1  that there is some overlap in the internal validity of experiments, quasi-experiments, and correlational studies. For example, a poorly designed experiment that includes many confounding variables can be lower in internal validity than a well designed quasi-experiment with no obvious confounding variables. Internal validity is also only one of several validities that one might consider, as noted in  Chapter 5.

Key Takeaways

  • Nonexperimental research is research that lacks the manipulation of an independent variable, control of extraneous variables through random assignment, or both.
  • There are three broad types of nonexperimental research. Single-variable research focuses on a single variable rather than a relationship between variables. Correlational and quasi-experimental research focus on a statistical relationship but lack manipulation or random assignment. Qualitative research focuses on broader research questions, typically involves collecting large amounts of data from a small number of participants, and analyses the data nonstatistically.
  • In general, experimental research is high in internal validity, correlational research is low in internal validity, and quasi-experimental research is in between.

Discussion: For each of the following studies, decide which type of research design it is and explain why.

  • A researcher conducts detailed interviews with unmarried teenage fathers to learn about how they feel and what they think about their role as fathers and summarizes their feelings in a written narrative.
  • A researcher measures the impulsivity of a large sample of drivers and looks at the statistical relationship between this variable and the number of traffic tickets the drivers have received.
  • A researcher randomly assigns patients with low back pain either to a treatment involving hypnosis or to a treatment involving exercise. She then measures their level of low back pain after 3 months.
  • A college instructor gives weekly quizzes to students in one section of his course but no weekly quizzes to students in another section to see whether this has an effect on their test performance.
  • Bushman, B. J., & Huesmann, L. R. (2001). Effects of televised violence on aggression. In D. Singer & J. Singer (Eds.), Handbook of children and the media (pp. 223–254). Thousand Oaks, CA: Sage. ↵
  • Milgram, S. (1974). Obedience to authority: An experimental view . New York, NY: Harper & Row. ↵
  • Rosenhan, D. L. (1973). On being sane in insane places. Science, 179 , 250–258. ↵

Research that lacks the manipulation of an independent variable, random assignment of participants to conditions or orders of conditions, or both.

Research that focuses on a single variable rather than a statistical relationship between two variables.

The researcher measures the two variables of interest with little or no attempt to control extraneous variables and then assesses the relationship between them.

The researcher manipulates an independent variable but does not randomly assign participants to conditions or orders of conditions.

Research Methods in Psychology - 2nd Canadian Edition by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

is a case study non experimental

6.1 Overview of Non-Experimental Research

Learning objectives.

  • Define non-experimental research, distinguish it clearly from experimental research, and give several examples.
  • Explain when a researcher might choose to conduct non-experimental research as opposed to experimental research.

What Is Non-Experimental Research?

Non-experimental research  is research that lacks the manipulation of an independent variable. Rather than manipulating an independent variable, researchers conducting non-experimental research simply measure variables as they naturally occur (in the lab or real world).

Most researchers in psychology consider the distinction between experimental and non-experimental research to be an extremely important one. This is because although experimental research can provide strong evidence that changes in an independent variable cause differences in a dependent variable, non-experimental research generally cannot. As we will see, however, this inability to make causal conclusions does not mean that non-experimental research is less important than experimental research.

When to Use Non-Experimental Research

As we saw in the last chapter , experimental research is appropriate when the researcher has a specific research question or hypothesis about a causal relationship between two variables—and it is possible, feasible, and ethical to manipulate the independent variable. It stands to reason, therefore, that non-experimental research is appropriate—even necessary—when these conditions are not met. There are many times in which non-experimental research is preferred, including when:

  • the research question or hypothesis relates to a single variable rather than a statistical relationship between two variables (e.g., How accurate are people’s first impressions?).
  • the research question pertains to a non-causal statistical relationship between variables (e.g., is there a correlation between verbal intelligence and mathematical intelligence?).
  • the research question is about a causal relationship, but the independent variable cannot be manipulated or participants cannot be randomly assigned to conditions or orders of conditions for practical or ethical reasons (e.g., does damage to a person’s hippocampus impair the formation of long-term memory traces?).
  • the research question is broad and exploratory, or is about what it is like to have a particular experience (e.g., what is it like to be a working mother diagnosed with depression?).

Again, the choice between the experimental and non-experimental approaches is generally dictated by the nature of the research question. Recall the three goals of science are to describe, to predict, and to explain. If the goal is to explain and the research question pertains to causal relationships, then the experimental approach is typically preferred. If the goal is to describe or to predict, a non-experimental approach will suffice. But the two approaches can also be used to address the same research question in complementary ways. For example, Similarly, after his original study, Milgram conducted experiments to explore the factors that affect obedience. He manipulated several independent variables, such as the distance between the experimenter and the participant, the participant and the confederate, and the location of the study (Milgram, 1974) [1] .

Types of Non-Experimental Research

Non-experimental research falls into three broad categories: cross-sectional research, correlational research, and observational research. 

First, cross-sectional research  involves comparing two or more pre-existing groups of people. What makes this approach non-experimental is that there is no manipulation of an independent variable and no random assignment of participants to groups. Imagine, for example, that a researcher administers the Rosenberg Self-Esteem Scale to 50 American college students and 50 Japanese college students. Although this “feels” like a between-subjects experiment, it is a cross-sectional study because the researcher did not manipulate the students’ nationalities. As another example, if we wanted to compare the memory test performance of a group of cannabis users with a group of non-users, this would be considered a cross-sectional study because for ethical and practical reasons we would not be able to randomly assign participants to the cannabis user and non-user groups. Rather we would need to compare these pre-existing groups which could introduce a selection bias (the groups may differ in other ways that affect their responses on the dependent variable). For instance, cannabis users are more likely to use more alcohol and other drugs and these differences may account for differences in the dependent variable across groups, rather than cannabis use per se.

Cross-sectional designs are commonly used by developmental psychologists who study aging and by researchers interested in sex differences. Using this design, developmental psychologists compare groups of people of different ages (e.g., young adults spanning from 18-25 years of age versus older adults spanning 60-75 years of age) on various dependent variables (e.g., memory, depression, life satisfaction). Of course, the primary limitation of using this design to study the effects of aging is that differences between the groups other than age may account for differences in the dependent variable. For instance, differences between the groups may reflect the generation that people come from (a cohort effect) rather than a direct effect of age. For this reason, longitudinal studies in which one group of people is followed as they age offer a superior means of studying the effects of aging. Once again, cross-sectional designs are also commonly used to study sex differences. Since researchers cannot practically or ethically manipulate the sex of their participants they must rely on cross-sectional designs to compare groups of men and women on different outcomes (e.g., verbal ability, substance use, depression). Using these designs researchers have discovered that men are more likely than women to suffer from substance abuse problems while women are more likely than men to suffer from depression. But, using this design it is unclear what is causing these differences. So, using this design it is unclear whether these differences are due to environmental factors like socialization or biological factors like hormones?

When researchers use a participant characteristic to create groups (nationality, cannabis use, age, sex), the independent variable is usually referred to as an experimenter-selected independent variable (as opposed to the experimenter-manipulated independent variables used in experimental research). Figure 6.1 shows data from a hypothetical study on the relationship between whether people make a daily list of things to do (a “to-do list”) and stress. Notice that it is unclear whether this is an experiment or a cross-sectional study because it is unclear whether the independent variable was manipulated by the researcher or simply selected by the researcher. If the researcher randomly assigned some participants to make daily to-do lists and others not to, then the independent variable was experimenter-manipulated and it is a true experiment. If the researcher simply asked participants whether they made daily to-do lists or not, then the independent variable it is experimenter-selected and the study is cross-sectional. The distinction is important because if the study was an experiment, then it could be concluded that making the daily to-do lists reduced participants’ stress. But if it was a cross-sectional study, it could only be concluded that these variables are statistically related. Perhaps being stressed has a negative effect on people’s ability to plan ahead. Or perhaps people who are more conscientious are more likely to make to-do lists and less likely to be stressed. The crucial point is that what defines a study as experimental or cross-sectional l is not the variables being studied, nor whether the variables are quantitative or categorical, nor the type of graph or statistics used to analyze the data. It is how the study is conducted.

Figure 6.1  Results of a Hypothetical Study on Whether People Who Make Daily To-Do Lists Experience Less Stress Than People Who Do Not Make Such Lists

Second, the most common type of non-experimental research conducted in Psychology is correlational research. Correlational research is considered non-experimental because it focuses on the statistical relationship between two variables but does not include the manipulation of an independent variable.  More specifically, in correlational research , the researcher measures two continuous variables with little or no attempt to control extraneous variables and then assesses the relationship between them. As an example, a researcher interested in the relationship between self-esteem and school achievement could collect data on students’ self-esteem and their GPAs to see if the two variables are statistically related. Correlational research is very similar to cross-sectional research, and sometimes these terms are used interchangeably. The distinction that will be made in this book is that, rather than comparing two or more pre-existing groups of people as is done with cross-sectional research, correlational research involves correlating two continuous variables (groups are not formed and compared).

Third,   observational research  is non-experimental because it focuses on making observations of behavior in a natural or laboratory setting without manipulating anything. Milgram’s original obedience study was non-experimental in this way. He was primarily interested in the extent to which participants obeyed the researcher when he told them to shock the confederate and he observed all participants performing the same task under the same conditions. The study by Loftus and Pickrell described at the beginning of this chapter is also a good example of observational research. The variable was whether participants “remembered” having experienced mildly traumatic childhood events (e.g., getting lost in a shopping mall) that they had not actually experienced but that the researchers asked them about repeatedly. In this particular study, nearly a third of the participants “remembered” at least one event. (As with Milgram’s original study, this study inspired several later experiments on the factors that affect false memories.

The types of research we have discussed so far are all quantitative, referring to the fact that the data consist of numbers that are analyzed using statistical techniques. But as you will learn in this chapter, many observational research studies are more qualitative in nature. In  qualitative research , the data are usually nonnumerical and therefore cannot be analyzed using statistical techniques. Rosenhan’s observational study of the experience of people in a psychiatric ward was primarily qualitative. The data were the notes taken by the “pseudopatients”—the people pretending to have heard voices—along with their hospital records. Rosenhan’s analysis consists mainly of a written description of the experiences of the pseudopatients, supported by several concrete examples. To illustrate the hospital staff’s tendency to “depersonalize” their patients, he noted, “Upon being admitted, I and other pseudopatients took the initial physical examinations in a semi-public room, where staff members went about their own business as if we were not there” (Rosenhan, 1973, p. 256) [2] . Qualitative data has a separate set of analysis tools depending on the research question. For example, thematic analysis would focus on themes that emerge in the data or conversation analysis would focus on the way the words were said in an interview or focus group.

Internal Validity Revisited

Recall that internal validity is the extent to which the design of a study supports the conclusion that changes in the independent variable caused any observed differences in the dependent variable.  Figure 6.2  shows how experimental, quasi-experimental, and non-experimental (correlational) research vary in terms of internal validity. Experimental research tends to be highest in internal validity because the use of manipulation (of the independent variable) and control (of extraneous variables) help to rule out alternative explanations for the observed relationships. If the average score on the dependent variable in an experiment differs across conditions, it is quite likely that the independent variable is responsible for that difference. Non-experimental (correlational) research is lowest in internal validity because these designs fail to use manipulation or control. Quasi-experimental research (which will be described in more detail in a subsequent chapter) is in the middle because it contains some, but not all, of the features of a true experiment. For instance, it may fail to use random assignment to assign participants to groups or fail to use counterbalancing to control for potential order effects. Imagine, for example, that a researcher finds two similar schools, starts an anti-bullying program in one, and then finds fewer bullying incidents in that “treatment school” than in the “control school.” While a comparison is being made with a control condition, the lack of random assignment of children to schools could still mean that students in the treatment school differed from students in the control school in some other way that could explain the difference in bullying (e.g., there may be a selection effect).

Figure 7.1 Internal Validity of Correlational, Quasi-Experimental, and Experimental Studies. Experiments are generally high in internal validity, quasi-experiments lower, and correlational studies lower still.

Figure 6.2 Internal Validity of Correlation, Quasi-Experimental, and Experimental Studies. Experiments are generally high in internal validity, quasi-experiments lower, and correlation studies lower still.

Notice also in  Figure 6.2  that there is some overlap in the internal validity of experiments, quasi-experiments, and correlational studies. For example, a poorly designed experiment that includes many confounding variables can be lower in internal validity than a well-designed quasi-experiment with no obvious confounding variables. Internal validity is also only one of several validities that one might consider, as noted in Chapter 5.

Key Takeaways

  • Non-experimental research is research that lacks the manipulation of an independent variable.
  • There are two broad types of non-experimental research. Correlational research that focuses on statistical relationships between variables that are measured but not manipulated, and observational research in which participants are observed and their behavior is recorded without the researcher interfering or manipulating any variables.
  • In general, experimental research is high in internal validity, correlational research is low in internal validity, and quasi-experimental research is in between.
  • A researcher conducts detailed interviews with unmarried teenage fathers to learn about how they feel and what they think about their role as fathers and summarizes their feelings in a written narrative.
  • A researcher measures the impulsivity of a large sample of drivers and looks at the statistical relationship between this variable and the number of traffic tickets the drivers have received.
  • A researcher randomly assigns patients with low back pain either to a treatment involving hypnosis or to a treatment involving exercise. She then measures their level of low back pain after 3 months.
  • A college instructor gives weekly quizzes to students in one section of his course but no weekly quizzes to students in another section to see whether this has an effect on their test performance.
  • Milgram, S. (1974). Obedience to authority: An experimental view . New York, NY: Harper & Row. ↵
  • Rosenhan, D. L. (1973). On being sane in insane places. Science, 179 , 250–258. ↵

Creative Commons License

Share This Book

  • Increase Font Size

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • What Is a Case Study? | Definition, Examples & Methods

What Is a Case Study? | Definition, Examples & Methods

Published on May 8, 2019 by Shona McCombes . Revised on November 20, 2023.

A case study is a detailed study of a specific subject, such as a person, group, place, event, organization, or phenomenon. Case studies are commonly used in social, educational, clinical, and business research.

A case study research design usually involves qualitative methods , but quantitative methods are sometimes also used. Case studies are good for describing , comparing, evaluating and understanding different aspects of a research problem .

Table of contents

When to do a case study, step 1: select a case, step 2: build a theoretical framework, step 3: collect your data, step 4: describe and analyze the case, other interesting articles.

A case study is an appropriate research design when you want to gain concrete, contextual, in-depth knowledge about a specific real-world subject. It allows you to explore the key characteristics, meanings, and implications of the case.

Case studies are often a good choice in a thesis or dissertation . They keep your project focused and manageable when you don’t have the time or resources to do large-scale research.

You might use just one complex case study where you explore a single subject in depth, or conduct multiple case studies to compare and illuminate different aspects of your research problem.

Prevent plagiarism. Run a free check.

Once you have developed your problem statement and research questions , you should be ready to choose the specific case that you want to focus on. A good case study should have the potential to:

  • Provide new or unexpected insights into the subject
  • Challenge or complicate existing assumptions and theories
  • Propose practical courses of action to resolve a problem
  • Open up new directions for future research

TipIf your research is more practical in nature and aims to simultaneously investigate an issue as you solve it, consider conducting action research instead.

Unlike quantitative or experimental research , a strong case study does not require a random or representative sample. In fact, case studies often deliberately focus on unusual, neglected, or outlying cases which may shed new light on the research problem.

Example of an outlying case studyIn the 1960s the town of Roseto, Pennsylvania was discovered to have extremely low rates of heart disease compared to the US average. It became an important case study for understanding previously neglected causes of heart disease.

However, you can also choose a more common or representative case to exemplify a particular category, experience or phenomenon.

Example of a representative case studyIn the 1920s, two sociologists used Muncie, Indiana as a case study of a typical American city that supposedly exemplified the changing culture of the US at the time.

While case studies focus more on concrete details than general theories, they should usually have some connection with theory in the field. This way the case study is not just an isolated description, but is integrated into existing knowledge about the topic. It might aim to:

  • Exemplify a theory by showing how it explains the case under investigation
  • Expand on a theory by uncovering new concepts and ideas that need to be incorporated
  • Challenge a theory by exploring an outlier case that doesn’t fit with established assumptions

To ensure that your analysis of the case has a solid academic grounding, you should conduct a literature review of sources related to the topic and develop a theoretical framework . This means identifying key concepts and theories to guide your analysis and interpretation.

There are many different research methods you can use to collect data on your subject. Case studies tend to focus on qualitative data using methods such as interviews , observations , and analysis of primary and secondary sources (e.g., newspaper articles, photographs, official records). Sometimes a case study will also collect quantitative data.

Example of a mixed methods case studyFor a case study of a wind farm development in a rural area, you could collect quantitative data on employment rates and business revenue, collect qualitative data on local people’s perceptions and experiences, and analyze local and national media coverage of the development.

The aim is to gain as thorough an understanding as possible of the case and its context.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

is a case study non experimental

In writing up the case study, you need to bring together all the relevant aspects to give as complete a picture as possible of the subject.

How you report your findings depends on the type of research you are doing. Some case studies are structured like a standard scientific paper or thesis , with separate sections or chapters for the methods , results and discussion .

Others are written in a more narrative style, aiming to explore the case from various angles and analyze its meanings and implications (for example, by using textual analysis or discourse analysis ).

In all cases, though, make sure to give contextual details about the case, connect it back to the literature and theory, and discuss how it fits into wider patterns or debates.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. (2023, November 20). What Is a Case Study? | Definition, Examples & Methods. Scribbr. Retrieved February 22, 2024, from https://www.scribbr.com/methodology/case-study/

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, primary vs. secondary sources | difference & examples, what is a theoretical framework | guide to organizing, what is action research | definition & examples, what is your plagiarism score.

Logo for Kwantlen Polytechnic University

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Non-Experimental Research

28 Overview of Non-Experimental Research

Learning objectives.

  • Define non-experimental research, distinguish it clearly from experimental research, and give several examples.
  • Explain when a researcher might choose to conduct non-experimental research as opposed to experimental research.

What Is Non-Experimental Research?

Non-experimental research  is research that lacks the manipulation of an independent variable. Rather than manipulating an independent variable, researchers conducting non-experimental research simply measure variables as they naturally occur (in the lab or real world).

Most researchers in psychology consider the distinction between experimental and non-experimental research to be an extremely important one. This is because although experimental research can provide strong evidence that changes in an independent variable cause differences in a dependent variable, non-experimental research generally cannot. As we will see, however, this inability to make causal conclusions does not mean that non-experimental research is less important than experimental research. It is simply used in cases where experimental research is not able to be carried out.

When to Use Non-Experimental Research

As we saw in the last chapter , experimental research is appropriate when the researcher has a specific research question or hypothesis about a causal relationship between two variables—and it is possible, feasible, and ethical to manipulate the independent variable. It stands to reason, therefore, that non-experimental research is appropriate—even necessary—when these conditions are not met. There are many times in which non-experimental research is preferred, including when:

  • the research question or hypothesis relates to a single variable rather than a statistical relationship between two variables (e.g., how accurate are people’s first impressions?).
  • the research question pertains to a non-causal statistical relationship between variables (e.g., is there a correlation between verbal intelligence and mathematical intelligence?).
  • the research question is about a causal relationship, but the independent variable cannot be manipulated or participants cannot be randomly assigned to conditions or orders of conditions for practical or ethical reasons (e.g., does damage to a person’s hippocampus impair the formation of long-term memory traces?).
  • the research question is broad and exploratory, or is about what it is like to have a particular experience (e.g., what is it like to be a working mother diagnosed with depression?).

Again, the choice between the experimental and non-experimental approaches is generally dictated by the nature of the research question. Recall the three goals of science are to describe, to predict, and to explain. If the goal is to explain and the research question pertains to causal relationships, then the experimental approach is typically preferred. If the goal is to describe or to predict, a non-experimental approach is appropriate. But the two approaches can also be used to address the same research question in complementary ways. For example, in Milgram’s original (non-experimental) obedience study, he was primarily interested in one variable—the extent to which participants obeyed the researcher when he told them to shock the confederate—and he observed all participants performing the same task under the same conditions. However,  Milgram subsequently conducted experiments to explore the factors that affect obedience. He manipulated several independent variables, such as the distance between the experimenter and the participant, the participant and the confederate, and the location of the study (Milgram, 1974) [1] .

Types of Non-Experimental Research

Non-experimental research falls into two broad categories: correlational research and observational research. 

The most common type of non-experimental research conducted in psychology is correlational research. Correlational research is considered non-experimental because it focuses on the statistical relationship between two variables but does not include the manipulation of an independent variable. More specifically, in correlational research , the researcher measures two variables with little or no attempt to control extraneous variables and then assesses the relationship between them. As an example, a researcher interested in the relationship between self-esteem and school achievement could collect data on students’ self-esteem and their GPAs to see if the two variables are statistically related.

Observational research  is non-experimental because it focuses on making observations of behavior in a natural or laboratory setting without manipulating anything. Milgram’s original obedience study was non-experimental in this way. He was primarily interested in the extent to which participants obeyed the researcher when he told them to shock the confederate and he observed all participants performing the same task under the same conditions. The study by Loftus and Pickrell described at the beginning of this chapter is also a good example of observational research. The variable was whether participants “remembered” having experienced mildly traumatic childhood events (e.g., getting lost in a shopping mall) that they had not actually experienced but that the researchers asked them about repeatedly. In this particular study, nearly a third of the participants “remembered” at least one event. (As with Milgram’s original study, this study inspired several later experiments on the factors that affect false memories).

Cross-Sectional, Longitudinal, and Cross-Sequential Studies

When psychologists wish to study change over time (for example, when developmental psychologists wish to study aging) they usually take one of three non-experimental approaches: cross-sectional, longitudinal, or cross-sequential. Cross-sectional studies involve comparing two or more pre-existing groups of people (e.g., children at different stages of development). What makes this approach non-experimental is that there is no manipulation of an independent variable and no random assignment of participants to groups. Using this design, developmental psychologists compare groups of people of different ages (e.g., young adults spanning from 18-25 years of age versus older adults spanning 60-75 years of age) on various dependent variables (e.g., memory, depression, life satisfaction). Of course, the primary limitation of using this design to study the effects of aging is that differences between the groups other than age may account for differences in the dependent variable. For instance, differences between the groups may reflect the generation that people come from (a cohort effect ) rather than a direct effect of age. For this reason, longitudinal studies , in which one group of people is followed over time as they age, offer a superior means of studying the effects of aging. However, longitudinal studies are by definition more time consuming and so require a much greater investment on the part of the researcher and the participants. A third approach, known as cross-sequential studies , combines elements of both cross-sectional and longitudinal studies. Rather than measuring differences between people in different age groups or following the same people over a long period of time, researchers adopting this approach choose a smaller period of time during which they follow people in different age groups. For example, they might measure changes over a ten year period among participants who at the start of the study fall into the following age groups: 20 years old, 30 years old, 40 years old, 50 years old, and 60 years old. This design is advantageous because the researcher reaps the immediate benefits of being able to compare the age groups after the first assessment. Further, by following the different age groups over time they can subsequently determine whether the original differences they found across the age groups are due to true age effects or cohort effects.

The types of research we have discussed so far are all quantitative, referring to the fact that the data consist of numbers that are analyzed using statistical techniques. But as you will learn in this chapter, many observational research studies are more qualitative in nature. In  qualitative research , the data are usually nonnumerical and therefore cannot be analyzed using statistical techniques. Rosenhan’s observational study of the experience of people in psychiatric wards was primarily qualitative. The data were the notes taken by the “pseudopatients”—the people pretending to have heard voices—along with their hospital records. Rosenhan’s analysis consists mainly of a written description of the experiences of the pseudopatients, supported by several concrete examples. To illustrate the hospital staff’s tendency to “depersonalize” their patients, he noted, “Upon being admitted, I and other pseudopatients took the initial physical examinations in a semi-public room, where staff members went about their own business as if we were not there” (Rosenhan, 1973, p. 256) [2] . Qualitative data has a separate set of analysis tools depending on the research question. For example, thematic analysis would focus on themes that emerge in the data or conversation analysis would focus on the way the words were said in an interview or focus group.

Internal Validity Revisited

Recall that internal validity is the extent to which the design of a study supports the conclusion that changes in the independent variable caused any observed differences in the dependent variable.  Figure 6.1 shows how experimental, quasi-experimental, and non-experimental (correlational) research vary in terms of internal validity. Experimental research tends to be highest in internal validity because the use of manipulation (of the independent variable) and control (of extraneous variables) help to rule out alternative explanations for the observed relationships. If the average score on the dependent variable in an experiment differs across conditions, it is quite likely that the independent variable is responsible for that difference. Non-experimental (correlational) research is lowest in internal validity because these designs fail to use manipulation or control. Quasi-experimental research (which will be described in more detail in a subsequent chapter) falls in the middle because it contains some, but not all, of the features of a true experiment. For instance, it may fail to use random assignment to assign participants to groups or fail to use counterbalancing to control for potential order effects. Imagine, for example, that a researcher finds two similar schools, starts an anti-bullying program in one, and then finds fewer bullying incidents in that “treatment school” than in the “control school.” While a comparison is being made with a control condition, the inability to randomly assign children to schools could still mean that students in the treatment school differed from students in the control school in some other way that could explain the difference in bullying (e.g., there may be a selection effect).

Figure 6.1 Internal Validity of Correlational, Quasi-Experimental, and Experimental Studies. Experiments are generally high in internal validity, quasi-experiments lower, and correlational studies lower still.

Notice also in  Figure 6.1 that there is some overlap in the internal validity of experiments, quasi-experiments, and correlational (non-experimental) studies. For example, a poorly designed experiment that includes many confounding variables can be lower in internal validity than a well-designed quasi-experiment with no obvious confounding variables. Internal validity is also only one of several validities that one might consider, as noted in Chapter 5.

  • Milgram, S. (1974). Obedience to authority: An experimental view . New York, NY: Harper & Row. ↵
  • Rosenhan, D. L. (1973). On being sane in insane places. Science, 179 , 250–258. ↵

A research that lacks the manipulation of an independent variable.

Research that is non-experimental because it focuses on the statistical relationship between two variables but does not include the manipulation of an independent variable.

Research that is non-experimental because it focuses on recording systemic observations of behavior in a natural or laboratory setting without manipulating anything.

Studies that involve comparing two or more pre-existing groups of people (e.g., children at different stages of development).

Differences between the groups may reflect the generation that people come from rather than a direct effect of age.

Studies in which one group of people are followed over time as they age.

Studies in which researchers follow people in different age groups in a smaller period of time.

Research Methods in Psychology Copyright © 2019 by Rajiv S. Jhangiani, I-Chant A. Chiang, Carrie Cuttler, & Dana C. Leighton is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

is a case study non experimental

Home Investigación de mercado

Non-experimental research: What it is, overview & advantages

non-experimental-research

Non-experimental research is the type of research that lacks an independent variable. Instead, the researcher observes the context in which the phenomenon occurs and analyzes it to obtain information.

Unlike experimental research , where the variables are held constant, non-experimental research happens during the study when the researcher cannot control, manipulate or alter the subjects but relies on interpretation or observations to conclude.

This means that the method must not rely on correlations, surveys , or case studies and cannot demonstrate an actual cause and effect relationship.

Characteristics of non-experimental research

Some of the essential characteristics of non-experimental research are necessary for the final results. Let’s talk about them to identify the most critical parts of them.

characteristics of non-experimental research

  • Most studies are based on events that occurred previously and are analyzed later.
  • In this method, controlled experiments are not performed for reasons such as ethics or morality.
  • No study samples are created; on the contrary, the samples or participants already exist and develop in their environment.
  • The researcher does not intervene directly in the environment of the sample.
  • This method studies the phenomena exactly as they occurred.

Types of non-experimental research

Non-experimental research can take the following forms:

Cross-sectional research : Cross-sectional research is used to observe and analyze the exact time of the research to cover various study groups or samples. This type of research is divided into:

  • Descriptive: When values are observed where one or more variables are presented.
  • Causal: It is responsible for explaining the reasons and relationship that exists between variables in a given time.

Longitudinal research: In a longitudinal study , researchers aim to analyze the changes and development of the relationships between variables over time. Longitudinal research can be divided into:

  • Trend: When they study the changes faced by the study group in general.
  • Group evolution: When the study group is a smaller sample.
  • Panel: It is in charge of analyzing individual and group changes to discover the factor that produces them.

LEARN ABOUT: Quasi-experimental Research

When to use non-experimental research

Non-experimental research can be applied in the following ways:

  • When the research question may be about one variable rather than a statistical relationship about two variables.
  • There is a non-causal statistical relationship between variables in the research question.
  • The research question has a causal research relationship, but the independent variable cannot be manipulated.
  • In exploratory or broad research where a particular experience is confronted.

Advantages and disadvantages

Some advantages of non-experimental research are:

  • It is very flexible during the research process
  • The cause of the phenomenon is known, and the effect it has is investigated.
  • The researcher can define the characteristics of the study group.

Among the disadvantages of non-experimental research are:

  • The groups are not representative of the entire population.
  • Errors in the methodology may occur, leading to research biases .

Non-experimental research is based on the observation of phenomena in their natural environment. In this way, they can be studied later to reach a conclusion.

Difference between experimental and non-experimental research

Experimental research involves changing variables and randomly assigning conditions to participants. As it can determine the cause, experimental research designs are used for research in medicine, biology, and social science. 

Experimental research designs have strict standards for control and establishing validity. Although they may need many resources, they can lead to very interesting results.

Non-experimental research, on the other hand, is usually descriptive or correlational without any explicit changes done by the researcher. You simply describe the situation as it is, or describe a relationship between variables. Without any control, it is difficult to determine causal effects. The validity remains a concern in this type of research. However, it’s’ more regarding the measurements instead of the effects.

LEARN MORE: Descriptive Research vs Correlational Research

Whether you should choose experimental research or non-experimental research design depends on your goals and resources. If you need any help with how to conduct research and collect relevant data, or have queries regarding the best approach for your research goals, contact us today! You can create an account with our survey software and avail of 88+ features including dashboard and reporting for free.

Create a free account

MORE LIKE THIS

Customer Experience Management strategy

Customer Experience Management Strategy: A Complete Guide

Feb 21, 2024

customer experience framework

Customer Experience Framework: Building Exceptional Strategy

Feb 20, 2024

As much as anyone may understand about buyer personas which includes understanding their needs and purchasing authority, it is also vital to understand the path they will take towards the purchase

Want Someone To Listen? First Try to Understand — Tuesday CX Thoughts

5-point vs 7-point Likert scale

5-point vs 7-point Likert scale: Choosing the Best

Feb 19, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Social Sci LibreTexts

6: Non-Experimental Research

  • Last updated
  • Save as PDF
  • Page ID 19616

  • Rajiv S. Jhangiani, I-Chant A. Chiang, Carrie Cuttler, & Dana C. Leighton
  • Kwantlen Polytechnic U., Washington State U., & Texas A&M U.—Texarkana

In this chapter we look more closely at non-experimental research. We begin with a general definition of, non-experimental research, along with a discussion of when and why non-experimental research is more appropriate than experimental research. We then look separately at three important types of non-experimental research: cross-sectional research, correlational research and observational research.

  • 6.1: Prelude to Nonexperimental Research What do the following classic studies have in common? Stanley Milgram found that about two thirds of his research participants were willing to administer dangerous shocks to another person just because they were told to by an authority figure (Milgram, 1963). Elizabeth Loftus and Jacqueline Pickrell showed that it is relatively easy to “implant” false memories in people by repeatedly asking them about childhood events that did not actually happen to them (Loftus & Pickrell, 1995).
  • 6.2: Overview of Non-Experimental Research Most researchers in psychology consider the distinction between experimental and non-experimental research to be an extremely important one. This is because although experimental research can provide strong evidence that changes in an independent variable cause differences in a dependent variable, non-experimental research generally cannot. As we will see, however, this inability to make causal conclusions does not mean that non-experimental research is less important than experimental research.
  • 6.3: Correlational Research Correlational research is a type of non-experimental research in which the researcher measures two variables and assesses the statistical relationship (i.e., the correlation) between them with little or no effort to control extraneous variables. There are many reasons that researchers interested in statistical relationships between variables would choose to conduct a correlational study rather than an experiment.
  • 6.4: Complex Correlation As we have already seen, researchers conduct correlational studies rather than experiments when they are interested in noncausal relationships or when they are interested in causal relationships but the independent variable cannot be manipulated for practical or ethical reasons. In this section, we look at some approaches to complex correlational research that involve measuring several variables and assessing the relationships among them.
  • 6.5: Qualitative Research Quantitative researchers typically start with a focused research question or hypothesis, collect a small amount of data from a large number of individuals, describe the resulting data using statistical techniques, and draw general conclusions about some large population. Although this method is by far the most common approach to conducting empirical research in psychology, there is an important alternative called qualitative research.
  • 6.6: Observational Research Observational research is used to refer to several different types of non-experimental studies in which behavior is systematically observed and recorded. The goal of observational research is to describe a variable or set of variables. The goal is to obtain a snapshot of specific characteristics of an individual, group, or setting. Observational research is non-experimental because nothing is manipulated or controlled, and as such we cannot arrive at causal conclusions using this approach.
  • 6.7: Non-Experimental Research (Summary) Key Takeaways and Exercises for the chapter on Non-Experimental Research.

Thumbnail: An example of data produced by data dredging, showing a correlation between the number of letters in a spelling bee's winning word (red curve) and the number of people in the United States killed by venomous spiders (black curve). (CC BY 4.0 International; Tyler Vigen - Spurious Correlations ).​​​​​

Case Study vs. Experiment

What's the difference.

Case studies and experiments are both research methods used in various fields to gather data and draw conclusions. However, they differ in their approach and purpose. A case study involves in-depth analysis of a particular individual, group, or situation, aiming to provide a detailed understanding of a specific phenomenon. On the other hand, an experiment involves manipulating variables and observing the effects on a sample population, aiming to establish cause-and-effect relationships. While case studies provide rich qualitative data, experiments provide quantitative data that can be statistically analyzed. Ultimately, the choice between these methods depends on the research question and the desired outcomes.

Further Detail

Introduction.

When conducting research, there are various methods available to gather data and analyze phenomena. Two commonly used approaches are case study and experiment. While both methods aim to provide insights and answers to research questions, they differ in their design, implementation, and the type of data they generate. In this article, we will explore the attributes of case study and experiment, highlighting their strengths and limitations.

A case study is an in-depth investigation of a particular individual, group, or phenomenon. It involves collecting and analyzing detailed information from multiple sources, such as interviews, observations, documents, and archival records. Case studies are often used in social sciences, psychology, and business research to gain a deep understanding of complex and unique situations.

One of the key attributes of a case study is its ability to provide rich and detailed data. Researchers can gather a wide range of information, allowing for a comprehensive analysis of the case. This depth of data enables researchers to explore complex relationships, identify patterns, and generate new hypotheses.

Furthermore, case studies are particularly useful when studying rare or unique phenomena. Since they focus on specific cases, they can provide valuable insights into situations that are not easily replicated or observed in controlled experiments. This attribute makes case studies highly relevant in fields where generalizability is not the primary goal.

However, it is important to note that case studies have limitations. Due to their qualitative nature, the findings may lack generalizability to broader populations or contexts. The small sample size and the subjective interpretation of data can also introduce bias. Additionally, case studies are time-consuming and resource-intensive, requiring extensive data collection and analysis.

An experiment is a research method that involves manipulating variables and measuring their effects on outcomes. It aims to establish cause-and-effect relationships by controlling and manipulating independent variables while keeping other factors constant. Experiments are commonly used in natural sciences, psychology, and medicine to test hypotheses and determine the impact of specific interventions or treatments.

One of the key attributes of an experiment is its ability to establish causal relationships. By controlling variables and randomly assigning participants to different conditions, researchers can confidently attribute any observed effects to the manipulated variables. This attribute allows for strong internal validity, making experiments a powerful tool for drawing causal conclusions.

Moreover, experiments often provide quantitative data, allowing for statistical analysis and objective comparisons. This attribute enhances the precision and replicability of findings, enabling researchers to draw more robust conclusions. The ability to replicate experiments also contributes to the cumulative nature of scientific knowledge.

However, experiments also have limitations. They are often conducted in controlled laboratory settings, which may limit the generalizability of findings to real-world contexts. Ethical considerations may also restrict the manipulation of certain variables or the use of certain interventions. Additionally, experiments can be time-consuming and costly, especially when involving large sample sizes or long-term follow-ups.

While case studies and experiments have distinct attributes, they can complement each other in research. Case studies provide in-depth insights and a rich understanding of complex phenomena, while experiments offer controlled conditions and the ability to establish causal relationships. By combining these methods, researchers can gain a more comprehensive understanding of the research question at hand.

When deciding between case study and experiment, researchers should consider the nature of their research question, the available resources, and the desired level of control and generalizability. Case studies are particularly suitable when exploring unique or rare phenomena, aiming for depth rather than breadth, and when resources allow for extensive data collection and analysis. On the other hand, experiments are ideal for establishing causal relationships, testing specific hypotheses, and when control over variables is crucial.

In conclusion, case study and experiment are two valuable research methods with their own attributes and limitations. Both approaches contribute to the advancement of knowledge in various fields, and their selection depends on the research question, available resources, and desired outcomes. By understanding the strengths and weaknesses of each method, researchers can make informed decisions and conduct rigorous and impactful research.

Comparisons may contain inaccurate information about people, places, or facts. Please report any issues.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

1.11: Experimental and non-experimental research

  • Last updated
  • Save as PDF
  • Page ID 16762

  • Matthew J. C. Crump
  • Brooklyn College of CUNY

One of the big distinctions that you should be aware of is the distinction between “experimental research” and “non-experimental research”. When we make this distinction, what we’re really talking about is the degree of control that the researcher exercises over the people and events in the study.

Experimental research

The key features of experimental research is that the researcher controls all aspects of the study, especially what participants experience during the study. In particular, the researcher manipulates or varies something (IVs), and then allows the outcome variable (DV) to vary naturally. The idea here is to deliberately vary the something in the world (IVs) to see if it has any causal effects on the outcomes. Moreover, in order to ensure that there’s no chance that something other than the manipulated variable is causing the outcomes, everything else is kept constant or is in some other way “balanced” to ensure that they have no effect on the results. In practice, it’s almost impossible to think of everything else that might have an influence on the outcome of an experiment, much less keep it constant. The standard solution to this is randomization : that is, we randomly assign people to different groups, and then give each group a different treatment (i.e., assign them different values of the predictor variables). We’ll talk more about randomization later in this course, but for now, it’s enough to say that what randomization does is minimize (but not eliminate) the chances that there are any systematic difference between groups.

Let’s consider a very simple, completely unrealistic and grossly unethical example. Suppose you wanted to find out if smoking causes lung cancer. One way to do this would be to find people who smoke and people who don’t smoke, and look to see if smokers have a higher rate of lung cancer. This is not a proper experiment, since the researcher doesn’t have a lot of control over who is and isn’t a smoker. And this really matters: for instance, it might be that people who choose to smoke cigarettes also tend to have poor diets, or maybe they tend to work in asbestos mines, or whatever. The point here is that the groups (smokers and non-smokers) actually differ on lots of things, not just smoking. So it might be that the higher incidence of lung cancer among smokers is caused by something else, not by smoking per se. In technical terms, these other things (e.g. diet) are called “confounds”, and we’ll talk about those in just a moment.

In the meantime, let’s now consider what a proper experiment might look like. Recall that our concern was that smokers and non-smokers might differ in lots of ways. The solution, as long as you have no ethics, is to control who smokes and who doesn’t. Specifically, if we randomly divide participants into two groups, and force half of them to become smokers, then it’s very unlikely that the groups will differ in any respect other than the fact that half of them smoke. That way, if our smoking group gets cancer at a higher rate than the non-smoking group, then we can feel pretty confident that (a) smoking does cause cancer and (b) we’re murderers.

Non-experimental research

Non-experimental research is a broad term that covers “any study in which the researcher doesn’t have quite as much control as they do in an experiment”. Obviously, control is something that scientists like to have, but as the previous example illustrates, there are lots of situations in which you can’t or shouldn’t try to obtain that control. Since it’s grossly unethical (and almost certainly criminal) to force people to smoke in order to find out if they get cancer, this is a good example of a situation in which you really shouldn’t try to obtain experimental control. But there are other reasons too. Even leaving aside the ethical issues, our “smoking experiment” does have a few other issues. For instance, when I suggested that we “force” half of the people to become smokers, I must have been talking about starting with a sample of non-smokers, and then forcing them to become smokers. While this sounds like the kind of solid, evil experimental design that a mad scientist would love, it might not be a very sound way of investigating the effect in the real world. For instance, suppose that smoking only causes lung cancer when people have poor diets, and suppose also that people who normally smoke do tend to have poor diets. However, since the “smokers” in our experiment aren’t “natural” smokers (i.e., we forced non-smokers to become smokers; they didn’t take on all of the other normal, real life characteristics that smokers might tend to possess) they probably have better diets. As such, in this silly example they wouldn’t get lung cancer, and our experiment will fail, because it violates the structure of the “natural” world (the technical name for this is an “artifactual” result; see later).

One distinction worth making between two types of non-experimental research is the difference between quasi-experimental research and case studies . The example I discussed earlier – in which we wanted to examine incidence of lung cancer among smokers and non-smokers, without trying to control who smokes and who doesn’t – is a quasi-experimental design. That is, it’s the same as an experiment, but we don’t control the predictors (IVs). We can still use statistics to analyse the results, it’s just that we have to be a lot more careful.

The alternative approach, case studies, aims to provide a very detailed description of one or a few instances. In general, you can’t use statistics to analyse the results of case studies, and it’s usually very hard to draw any general conclusions about “people in general” from a few isolated examples. However, case studies are very useful in some situations. Firstly, there are situations where you don’t have any alternative: neuropsychology has this issue a lot. Sometimes, you just can’t find a lot of people with brain damage in a specific area, so the only thing you can do is describe those cases that you do have in as much detail and with as much care as you can. However, there’s also some genuine advantages to case studies: because you don’t have as many people to study, you have the ability to invest lots of time and effort trying to understand the specific factors at play in each case. This is a very valuable thing to do. As a consequence, case studies can complement the more statistically-oriented approaches that you see in experimental and quasi-experimental designs. We won’t talk much about case studies in these lectures, but they are nevertheless very valuable tools!

(Mostly Clinical) Epidemiology with R

Chapter 6 non-experimental designs, 6.0.1 r packages required for this chapter, 6.1 introduction.

This chapter will provide only the briefest review of non-experimental study designs so as to try and assure that the reader has a common baseline knowledge of the advantages and limitations of each design. The excellent introductory ( Gordis 2014 ) ( Rothman 2012 ) , intermediate ( Szklo and Nieto 2019 ) and advanced ( Rothman, Greenland, and Lash 2008 ) epidemiology textbooks referred to in Chapter 1.2 provide more complete information. The common hierarchy of evidence based medicine (EBM) research designs is presented in this ubiquituous pyramid schema.

EBM pyramid of research designs

Figure 6.1: EBM pyramid of research designs

However, Figure 6.1 is another example of a simple heuristic which enables quick but often erroneous conclusions concerning study design. Unfortunately as shown in this and susequent chapters, there are no shortcuts to the evaluation of study designs which need individual considerations. In that respect the following pyramid is both more realistic and helpful.

Better EBM pyramid of research designs

Figure 3.3: Better EBM pyramid of research designs

Although one may debate that the magnitude of the blue area is an overestimate of the scale of good designs, it does underline the importance in clinical epidemiology of assessing individual study quality for both experimental and non-experimental studies. As this book is orientated toward clinical epidemiology, experimental designs (randomized clinical trials (RCTs)) will be emphasized and discussed separately in a later chapter. RCTs have a special emphasis since they are often considered the pinnacle of research designs and greatly influence medical guidelines and consequently clinical practice.

The following is a taxonomy of the different types of study designs.

Overview of different research designs

Figure 6.2: Overview of different research designs

Ecological study designs involve populations or groups of individuals as the unit of analysis as opposed to the other observation designs where the unit of analysis is the individual. Ecological studies are especially useful for descriptive reports, the analysis of birth cohorts, when exposure is only available at the group level, or to investigate differences between populations when the between population difference is much greater than within population differences. For example, ecological studies would be appropriate for aggregate exposure involving air pollution, health care systems, or gun control laws. One must be careful to avoid the ecological bias that can occur because an association observed between variables on the aggregate level does not necessarily represent the association that exists at the individual level. The ecological bias can be shown graphically in Figure 6.3 where the discordance of the exposure association between groups and individuals is especially strong.

Ecological bias

Figure 6.3: Ecological bias

The real problem is cross-level reference . For an ecological study the level of measurement, level of analysis, and level of inference must all remain at the group level. Ecological studies will not be considered further in this book.

Among non-experimental designs involving individuals, there are essentially 3 different ways at arriving at conclusions by 1) reference to population follow-up (cohort) 2) joint assessment of exposure among cases and non-cases (case-control) 3) reference to one particular time (cross-sectional)

Since all study designs, including the non-experimental (cohort, case-control, cross-sectional) either secretly or not, aim to estimate similar causal quantities, it could be argued that emphasizing their distinctions is somewhat artificial. In other words, do not mistake the journey (particular study type) for the destination (causal effect of an exposure).

This lack of randomization with observational studies, is why we are continually reminded that non-experimental studies can’t provide evidence for causality. Instead we are told to talk of associations. A list of the top 2019 JAMA Internal Medicine articles as determined by Altmetric scores can be found here and is reproduced in Figure 6.4 .

JAMA Internal Medicine Top 2019 Altmetric Articles

Figure 6.4: JAMA Internal Medicine Top 2019 Altmetric Articles

Most interestingly, 11 of the top 14 articles have “association” in their title. But are we really interested in mere associations of the type that matches in your pocket or yellow fingers are associated with lung cancer? I don’t believe so and suspect most people are more interested for obvious reasons in causality and whether they care to admit it or not are subconsciously interpreting association studies in this light ( Hernan 2018 ) . Otherwise there appears little justification to waste one’s time on reading these articles.

Although this is not a book on causality and although I most definitely lack the expertise to delve deeply into this issue, I do think we should acknowledge that the heuristic that observational studies can never inform on causality is flawed and should think more deeply about how we can reach this elusive goal of establishing causality. Please see the following excellent references on causal inference, ( Westreich 2019 ) , ( Pearl, Glymour, and Jewell 2016 ) , ( Hernán and JM 2020 ) and ( Pearl 2008 ) , ordered in increasing detail and complexity.

Again more details may be found in the above mentioned references.

6.2 Cohort studies

A cohort is a designated/defined group of individuals followed through time often to study incidence of disease in the study group. Examples of sample cohorts may include occupational cohorts, specific groups at risk of a disease or convenience samples (e.g. Nurses or Framingham cohorts). Be careful to distinguish between study, source and target populations.

Cohort studies offer multiple potential advantages: * Can study exposures that are difficult or unthinkable to randomize * Study population often more representative of target pop’l * Allows calculation of incidence rates * Time sequence is generally clear (exposure before outcome) * Efficient as multiple outcomes / exposures can be assessed as new hypothesis are developed over time

Cohort studies can be of 2 formats; 1. Concurrent (prospective) cohort studies: assembled at present time. Advantages: measurement of exposure, outcome, covariates is decided at baseline and can see temporal ordering of exposure and disease. Disadvantages: expensive and time consuming. 2. Historical/non-concurrent/mixed (retrospective) cohort studies: incorporates historical time exposed (at least partially). Advantages: less expensive; linking data registries (e.g. exposure and outcome information). Disadvantages: can only use available information; possibly lower quality measurements; data may not have been collected for research purposes.

In cohort studies, exposed/unexposed groups exist in the source population and are selected by the investigator while in an RCT, a form of closed cohort, treatment/exposure is assigned by the investigator.

Observational research design

Figure 6.5: Observational research design

Experimental research design

Figure 6.6: Experimental research design

Practically, the best general approach to achieve valid causal non-experimental designs is to try to emulate the RCT ( Hernan and Robins 2016 ) you would like to do with special attention to the following: * Selection of population * Exposure definition (induction, duration, intensity, cumulative exposures) * Outcome ascertainment with minimization of lost to follow-up

6.3 Case control

Sometimes it useful to start with the cases! Although important for all research designs, it is obviously essential for case control designs to have an unambiguous, valid case definition preferably using objective and established criteria that avoids any misclassification or other biases. Careful distinction between incident and prevalent cases is also of prime importance. Where the cases are found is a function of the particular research question and setting. Potential sources include hospital registers, vital records (births/deaths), national registries (e.g., for cancer, diabetes) and community clinics.

After case identification, the most important and difficult next step is the selection of the controls. Consideration of a counterfactual model can help operationalize the choice of controls. Controls are drawn from a sample of individuals without the disease, selected from the same reference population that originated the cases and who go through the same selection processes as the cases. Moreover, controls must be individuals who, if they had developed the case disease, would have been included in the case group.

Case control studies may be conducted with an open or closed study population. In a dynamic (open) population, there are two options for selecting controls; i) if the population is in a steady-state, sample randomly from the person-time distribution ii) if not, controls may be selected at the same time as cases occur (i.e., “matched on time”). In a closed study population, there are three options for selecting controls; i) at the end of follow-up ii) at the beginning of follow-up iii) throughout follow-up as cases occur (“matched on time”). Analytically, these distinctions lead to different effect measures, each of which (under various assumptions) parallels an equivalent measure from a full-cohort study.

Case control sampling times

Figure 6.7: Case control sampling times

  • Sample point 1 - This is the classic “ case based ” sampling (AKA “exclusive,” “cumulative”) that occurs at the end of follow-up. In this case, the incident odds ratio (OR) ≈ risk ratio (RR) (under the rare disease assumption)
  • Sample point 2 - This is “ case-cohort ” sampling (AKA “inclusive”) that occurs at the beginning of follow-up. In this case, the incident odds ratio (OR) ≈ risk ratio (RR) (rare disease assumption not required)
  • Sample point 3 - “ Nested ” sampling (AKA “incidence density”) from the distribution of exposed person-time matched during follow-up. In this case, the incident odds ratio (OR) ≈ rate ratio (RR)

The efficiency of the case control design comes from taking a sample, and not all, of the controls. Under that logic, it may be reasonably asked why not take only a sample of the cases? Consider the following example.

The OR is 2.11 with 95% CI 1.65 - 2.71. The OR is fairly close to incident risk ratio for the full cohort (RR = 2) since the rare assumption is approximately true, about 10% incidence. If we select only 1/10 of the controls the OR is 2.11 with 95% CI 1.54 - 2.89, a trivial difference. On the other hand, suppose we take a 1/10 sample of the cases, the OR remains unbiased at 2.11 but the 95% CI 0.96 - 4.63 is much larger. It is this lack of precision that mandates the inclusion of all cases in a case / control design. This is easily understood when it is recalled that the standard error of the estimated OR is \[se(\hat{OR}) = \sqrt{\dfrac{1}{a} + \dfrac{1}{b} + \dfrac{1}{c} + \dfrac{1}{d}}\] so the largest component comes from the smallest cell entries and se will be minimized by taking all the cases.

The following figure from ( Knol et al. 2008 ) is a useful summary of the effect measures available from case control studies depending on the nature of the cases (prevalent or incident; level 1), the type of source population (fixed cohort or dynamic population; level 2), the sampling design used to select controls (level 3), and the underlying assumptions (level 4).

Effect measures from case control designs

Figure 6.8: Effect measures from case control designs

In summary, case control studies have the advantages of being faster to perform and less expensive to conduct than cohort studies but care must be exercised that they, like all study designs, are carefully performed. Proper control selection is essential and must come from the same target population as cases (easiest when performed within an established cohort). Controls must be sampled independently of exposure and there is improved precision with more controls (1,2,3,4) but diminishing returns (SE 0.167, 0.145, 0.138, 0.134). Effect measure precision is improved precision by taking all the cases. Although case control studies are susceptible to recognized biases (Berkson, recall, incidence/prevalence) these can be avoided with necessary care. The routine placement of case-control studies under cohort studies on hierarchies of study designs is not well-founded.

An interesting variant is the case crossover design where each case serves as its own control thereby minimizes confounding for time invariant factors whether observed or unobserved. Exposures must vary over time but have a short induction time, a transient effect (e.g., triggers for heart attacks, asthma episodes) and no cumulative effects. This design is the observational analogue of the crossover randomized trial.

A final variant are case series without any controls. While the early identification of cases may prove sentinel for a new disease (see cases series that lead to first identification of the AIDS epidemic ), the inferential strength of this design is limited due to the lack of any suitable comparator. Moreover arbitrary selection of cases and an embellished narrative can lead to an undervaluing of scientific evidence and great public health danger (see cases series ( Wakefield et al. 1998 ) , later retracted, at the genesis of the vaccine autism falsehood).

6.4 Cross sectional

Cross sectional studies are most useful for descriptive epidemiology with a primary goal of estimating disease prevalence. As no follow-up is required, cross sectional studies are fast, efficient and can enroll large numbers of participants. However they have little value for causal inference as they provide no information on timing of outcome relative to exposure (temporality) and include only those individuals alive at the time of the study, thereby introducing a prevalence-incidence bias. Due to these limitations, this study design has little value in clinical epidemiology and will not be discussed further.

6.5 Miscellaneous designs

There are, of course, many variants and other miscellaneous non-experimental designs including difference-in-difference (DID) , regression discontinuity , and quasi-experimental to name but a few.

Conceptually, DID design can be best thought of as a combination of a before & after comparison and a comparison between treated & untreated individuals and are therefore also known as ontrolled before and after studies. These studies minimize bias due to pretreatment differences in outcomes,allow for a flexible control of time invariant confounders and are preferable to an uncontrolled before and after comparison of only treated individuals.

Quasi-experimental designs refer to approaches to effect estimation in which investigators identify (or create) a source of variation in the exposure which is unrelated to the rest of the causal system under study—including the outcome (except through the exposure itself) and the confounders. A classic historical example is John Snow’s cholera work where which household received “dirtier” water from the Southwark and Vauxhall company or “cleaner” water from the Lambeth company was a quasi-random event. The company can thus be seen as an instrumental variable, similar to randomization. Regression discontinuity designs are a special subset of quasi-experimental designs where subjects just above and below a given threshold are essentially identical on all observed and unobserved characteristics yet are arbitrarily assigned different therapies.

Banner

Please Note: Froedtert Health has implemented a new security protocol which has resulted in the degradation of on-campus access to MCW Libraries' electronic resources. Users on the Froedtert network must sign in via OpenAthens when attempting to access electronic library resources (including UpToDate, VisualDx, etc.). MCW Libraries is aware of this change and is working to provide a permanent access solution. If you have MCW credentials, please sign in via MCW employees and students when prompted. If you are a Froedtert employee, click here to register for an OpenAthens account or view more information .

Evidence Based Practice: Study Designs & Evidence Levels

  • Databases to Search
  • EBP Resources
  • Study Designs & Evidence Levels
  • Comparisons
  • How Do I...

Introduction

This section reviews some research definitions and provides commonly used evidence tables.

Levels of Evidence Johns Hopkins Nursing Evidence Based Practice

Dang, D., & Dearholt, S. (2017). Johns Hopkins nursing evidence-based practice: model and guidelines. 3rd ed. Indianapolis, IN: Sigma Theta Tau International. www.hopkinsmedicine.org/evidence-based-practice/ijhn_2017_ebp.html

Identifying the Study Design

The type of study can generally be figured out by looking at three issues:

Q1. What was the aim of the study?

  • To simply describe a population (PO questions)  = descriptive
  • To quantify the relationship between factors (PICO questions)  =  analytic.

Q2. If analytic, was the intervention randomly allocated?

  • Yes?  =  RCT 
  • No? = Observational study  

For an observational study, the main type will then depend on the timing of the measurement of outcome, so our third question is:

Q3. When were the outcomes determined?

  • Some time after the exposure or intervention? = Cohort study ('prospective study')
  • At the same time as the exposure or intervention? = Cross sectional study or survey
  • Before the exposure was determined? = Case-control study ('retrospective study' based on recall of the exposure)

Centre for Evidence-Based Medicine (CEBM)

Definitions of Study Types

Case report / Case series:  A report on a series of patients with an outcome of interest. No control group is involved.

Case control study:  A study which involves identifying patients who have the outcome of interest (cases) and patients without the same outcome (controls), and looking back to see if they had the exposure of interest.

Cohort study:  Involves identification of two groups (cohorts) of patients, one which received the exposure of interest, and one which did not, and following these cohorts forward for the outcome of interest.

Randomized controlled clinical trial:  Participants are randomly allocated into an experimental group or a control group and followed over time for the variables/outcomes of interest.

Systematic review:  A summary of the medical literature that uses explicit methods to perform a comprehensive literature search and critical appraisal of individual studies and that uses appropriate statistical techniques to combine these valid studies.

Meta-analysis:  A systematic review that uses quantitative methods to synthesize and summarize the results.

Meta-synthesis: A systematic approach to the analysis of data across qualitative studies. -- EJ Erwin, MJ Brotherson, JA Summers. Understanding Qualitative Meta-synthesis. Issues and Opportunities in Early Childhood Intervention Research, 33(3) 186-200 .

Cross sectional study:  The observation of a defined population at a single point in time or time interval. Exposure and outcome are determined simultaneously.

Prospective, blind comparison to a gold standard:  Studies that show the efficacy of a diagnostic test are also called prospective, blind comparison to a gold standard study. This is a controlled trial that looks at patients with varying degrees of an illness and administers both diagnostic tests — the test under investigation and the “gold standard” test — to all of the patients in the study group. The sensitivity and specificity of the new test are compared to that of the gold standard to determine potential usefulness.

Qualitative research:  answers a wide variety of questions related to human responses to actual or potential health problems.The purpose of qualitative research is to describe, explore and explain the health-related phenomena being studied.

Retrospective cohort:  follows the same direction of inquiry as a cohort study.  Subjects begin with the presence or absence of an exposure or risk factor and are followed until the outcome of interest is observed.  However, this study design uses information that has been collected in the past and kept in files or databases.  Patients are identified for exposure or non-exposures and the data is followed forward to an effect or outcome of interest.

(Adapted from CEBM's Glossary and Duke Libraries' Intro to Evidence-Based Practice )

American Association of Critical Care Nursing-- Levels of Evidence

AACN Evidence Levels Pyramid

Level A   Meta-analysis of multiple controlled studies or meta-synthesis of qualitative studies with results that consistently support a specific action, intervention or treatment

Level B  Well designed controlled studies, both randomized and nonrandomized, with results that consistently support a specific action, intervention, or treatment

Level C   Qualitative studies, descriptive or correlational studies, integrative reviews, systematic reviews, or randomized controlled trials with inconsistent results

Level D Peer-reviewed professional organizational standards, with clinical studies to support recommendations

Level E Theory-based evidence from expert opinion or multiple case reports

Level M  Manufacturers’ recommendations only  

Armola RR, Bourgault AM, Halm MA, Board RM, Bucher L, Harrington L, Heafey CA, Lee R, Shellner PK, Medina J. (2009) AACN levels of evidence: what's new ?  J.Crit Care Nurse. Aug;29(4):70-3.

Flow Chart of Study Designs

Figure: Flow chart of different types of studies (Q1, 2, and 3 refer to the three questions below  in "Identifying the Study Design" box.) Centre for Evidence-Based Medicine (CEBM)

What is a "Confidence Interval (CI)"?

A confidence interval (CI) can be used to show within which interval the population's mean score will probably fall. Most researchers use a CI of 95%. By using a CI of 95%, researchers accept there is a 5% chance they have made the wrong decision in treatment. Therefore, if 0 falls within the agreed CI, it can be concluded that there is no significant difference between the two treatments. When 0 lies outside the CI, researchers will conclude that there is a statistically significant difference.

Halfens, R. G., & Meijers, J. M. (2013). Back to basics: an introduction to statistics.  Journal Of Wound Care ,  22 (5), 248-251.

What is a "p-value?"

Categorical (nominal) tests This category of tests can be used when the dependent, or outcome, variable is categorical (nominal), such as the dif­ference between two wound treatments and the healing of the wound (healed versus non­healed). One of the most used tests in this category is the chi­squared test (χ2). The chi­squared statistic is calculated by comparing the differences between the observed and the expected frequencies. The expected frequencies are the frequencies that would be found if there was no relationship between the two variables. 

Based on the calculated χ2 statistic, a probability (p ­value) is given, which indicates the probability that the two means are not different from each other. Researchers are often satisfied if the probability is 5% or less, which means that the researchers would conclude that for p < 0.05, there is a significant difference. A p ­value ≥ 0.05 suggests that there is no significant difference between the means.

Halfens, R. G., & Meijers, J. M. (2013). Back to basics: an introduction to statistics. Journal Of Wound Care, 22(5), 248-251.

  • << Previous: EBP Resources
  • Next: Citations Managers >>
  • Last Updated: Dec 29, 2023 11:15 AM
  • URL: https://mcw.libguides.com/evidencebasedpractice

MCW Libraries 8701 Watertown Plank Road Milwaukee, WI 53226 (414) 955-8300

Contact Us Locations & Hours Send Us Your Comments

To read this content please select one of the options below:

Please note you do not have access to teaching notes, nonexperimental research: strengths, weaknesses and issues of precision.

European Journal of Training and Development

ISSN : 2046-9012

Article publication date: 6 September 2016

Nonexperimental research, defined as any kind of quantitative or qualitative research that is not an experiment, is the predominate kind of research design used in the social sciences. How to unambiguously and correctly present the results of nonexperimental research, however, remains decidedly unclear and possibly detrimental to applied disciplines such as human resource development. To clarify issues about the accurate reporting and generalization of nonexperimental research results, this paper aims to present information about the relative strength of research designs, followed by the strengths and weaknesses of nonexperimental research. Further, some possible ways to more precisely report nonexperimental findings without using causal language are explored. Next, the researcher takes the position that the results of nonexperimental research can be used cautiously, yet appropriately, for making practice recommendations. Finally, some closing thoughts about nonexperimental research and the appropriate use of causal language are presented.

Design/methodology/approach

A review of the extant social science literature was consulted to inform this paper.

Nonexperimental research, when reported accurately, makes a tremendous contribution because it can be used for conducting research when experimentation is not feasible or desired. It can be used also to make tentative recommendations for practice.

Originality/value

This article presents useful means to more accurately report nonexperimental findings through avoiding causal language. Ways to link nonexperimental results to making practice recommendations are explored.

  • Research design
  • Experimental design
  • Causal inference
  • Nonexperimental
  • Social science research
  • Triangulation

Reio, T.G. (2016), "Nonexperimental research: strengths, weaknesses and issues of precision", European Journal of Training and Development , Vol. 40 No. 8/9, pp. 676-690. https://doi.org/10.1108/EJTD-07-2015-0058

Emerald Group Publishing Limited

Copyright © 2016, Emerald Group Publishing Limited

Related articles

We’re listening — tell us what you think, something didn’t work….

Report bugs here

All feedback is valuable

Please share your general feedback

Join us on our journey

Platform update page.

Visit emeraldpublishing.com/platformupdate to discover the latest news and updates

Questions & More Information

Answers to the most commonly asked questions here

  • Form Builder
  • Survey Maker
  • AI Form Generator
  • AI Survey Tool
  • AI Quiz Maker
  • Store Builder
  • WordPress Plugin
  • Integrations
  • Popular Forms
  • Job Application Form Template
  • Rental Application Form Template
  • Hotel Accommodation Form Template
  • Online Registration Form Template
  • Employment Application Form Template
  • Application Forms
  • Booking Forms
  • Consent Forms
  • Contact Forms
  • Donation Forms
  • Customer Satisfaction Surveys
  • Employee Satisfaction Surveys
  • Evaluation Surveys
  • Feedback Surveys
  • Market Research Surveys
  • Personality Quiz Template
  • Geography Quiz Template
  • Math Quiz Template
  • Science Quiz Template
  • Vocabulary Quiz Template

Try without registration Quick Start

is a case study non experimental

HubSpot CRM

is a case study non experimental

Google Sheets

is a case study non experimental

Google Analytics

is a case study non experimental

Microsoft Excel

is a case study non experimental

Read engaging stories, how-to guides, learn about forms.app features.

Inspirational ready-to-use templates for getting started fast and powerful.

Spot-on guides on how to use forms.app and make the most out of it.

is a case study non experimental

See the technical measures we take and learn how we keep your data safe and secure.

  • Help Center
  • Sign In Sign Up Free
  • What is non-experimental research: Definition, types & examples

What is non-experimental research: Definition, types & examples

Defne Çobanoğlu

The experimentation method is very useful for getting information on a specific subject. However, when experimenting is not possible or practical, there is another way of collecting data for those interested. It's a non-experimental way, to say the least.

In this article, we have gathered information on non-experimental research, clearly defined what it is and when one should use it, and listed the types of non-experimental research. We also gave some useful examples to paint a better picture. Let us get started. 

  • What is non-experimental research?

Non-experimental research is a type of research design that is based on observation and measuring instead of experimentation with randomly assigned participants.

What characterizes this research design is the fact that it lacks the manipulation of independent variables . Because of this fact, the non-experimental research is based on naturally occurring conditions, and there is no involvement of external interventions. Therefore, the researchers doing this method must not rely heavily on interviews, surveys , or case studies.

  • When to use non-experimental research?

An experiment is done when a researcher is investigating the relationship between one or two phenomena and has a theory or hypothesis on the relationship between two variables that are involved. The researcher can carry out an experiment when it is ethical, possible, and feasible to do one.

However, when an experiment can not be done because of a limitation, then they decide to opt for a non-experimental research design . Non-experimental research is considered preferable in some conditions, including:

  • When the manipulation of the independent variable is not possible because of ethical or practical concerns
  • When the subjects of an experimental design can not be randomly assigned to treatments.
  • When the research question is too extensive or it relates to a general experience.
  • When researchers want to do a starter research before investing in more extensive research.
  • When the research question is about the statistical relationship between variables , but in a noncausal context.
  • Characteristics of non-experimental research

Non-experimental research has some characteristics that clearly define the framework of this research method. They provide a clear distinction between experimental design and non-experimental design. Let us see some of them:

  • Non-experimental research does not involve the manipulation of variables .
  • The aim of this research type is to explore the factors as they naturally occur .
  • This method is used when experimentation is not possible because of ethical or practical reasons .
  • Instead of creating a sample or participant group, the existing groups or natural thresholds are used during the research.
  • This research method is not about finding causality between two variables.
  • Most studies are done on past events or historical occurrences to make sense of specific research questions.
  • Types of non-experimental research

Non-experimental research types

Non-experimental research types

What makes research non-experimental research is the fact that the researcher does not manipulate the factors, does not randomly assign the participants, and observes the existing groups. But this research method can also be divided into different types. These types are:

Correlational research:

In correlation studies, the researcher does not manipulate the variables and is not interested in controlling the extraneous variables. They only observe and assess the relationship between them. For example, a researcher examines students’ study hours every day and their overall academic performance. The positive correlation this between study hours and academic performance suggests a statistical association. 

Quasi-experimental research:

In quasi-experimental research, the researcher does not randomly assign the participants into two groups. Because you can not deliberately deprive someone of treatment, the researcher uses natural thresholds or dividing points . For example, examining students from two different high schools with different education methods.

Cross-sectional research:

In cross-sectional research, the researcher studies and compares a portion of a population at the same time . It does not involve random assignment or any outside manipulation. For example, a study on smokers and non-smokers in a specific area.

Observational research:

In observational research, the researcher once again does not manipulate any aspect of the study, and their main focus is observation of the participants . For example, a researcher examining a group of children playing in a playground would be a good example.

  • Non-experimental research examples

Non-experimental research is a good way of collecting information and exploring relationships between variables. It can be used in numerous fields, from social sciences, economics, psychology, education, and market research. When gathering information using secondary research is not enough and an experiment can not be done, this method can bring out new information.

Non-experimental research example #1

Imagine a researcher who wants to see the connection between mobile phone usage before bedtime and the amount of sleep adults get in a night . They can gather a group of individuals to observe and present them with some questions asking about the details of their day, frequency and duration of phone usage, quality of sleep, etc . And observe them by analyzing the findings.

Non-experimental research example #2

Imagine a researcher who wants to explore the correlation between job satisfaction levels among employees and what are the factors that affect this . The researcher can gather all the information they get about the employees’ ages, sexes, positions in the company, working patterns, demographic information, etc . 

The research provides the researcher with all the information to make an analysis to identify correlations and patterns. Then, it is possible for researchers and administrators to make informed predictions.

  • Frequently asked questions about non-experimental research

When not to use non-experimental research?

There are some situations where non-experimental research is not suitable or the best choice. For example, the aim of non-experimental research is not about finding causality therefore, if the researcher wants to explore the relationship between two variables, then this method is not for them. Also, if the control over the variables is extremely important to the test of a theory, then experimentation is a more appropriate option.

What is the difference between experimental and non-experimental research?

Experimental research is an example of primary research where the researcher takes control of all the variables, randomly assigns the participants into different groups, and studies them in a pre-determined environment to test a hypothesis. 

On the contrary, non-experimental research does not intervene in any way and only observes and studies the participants in their natural environments to make sense of a phenomenon

What makes a quasi-experiment a non-experiment?

The same as true experimentation, quasi-experiment research also aims to explore a cause-and-effect relationship between independent and dependent variables. However, in quasi-experimental research, the participants are not randomly selected. They are assigned to groups based on non-random criteria .

Is a survey a non-experimental study?

Yes, as the main purpose of a survey or questionnaire is to collect information from participants without outside interference, it makes the survey a non-experimental study. Surveys are used by researchers when experimentation is not possible because of ethical reasons, but first-hand data is needed

What is non-experimental data?

Non-experimental data is data collected by researchers via using non-experimental methods such as observations, interpretation, and interactions. Non-experimental data could both be qualitative or quantitative, depending on the situation.

Advantages of non-experimental research

Non-experimental research has its positive sides that a researcher should have in mind when going through a study. They can start their research by going through the advantages. These advantages are:

  • It is used to observe and analyze past events .
  • This method is more affordable than a true experiment .
  • As the researcher can adapt the methods during the study, this research type is more flexible than an experimental study.
  • This method allows the researchers to answer specific questions .

Disadvantages of non-experimental research

Even though non-experimental research has its advantages, it also has some disadvantages a researcher should be mindful of. Here are some of them:

  • The findings of non-experimental research can not be generalized to the whole population. Therefore, it has low external validity .
  • This research is used to explore only a single variable .
  • Non-experimental research designs are prone to researcher bias and may not produce neutral results.
  • Final words

A non-experimental study differs from an experimental study in that there is no intervention or change of internal or extraneous elements. It is a smart way to collect information without the limitations of experimentation. These limitations could be about ethical or practical problems. When you can not do proper experimentation, your other option is to study existing conditions and groups to draw conclusions. This is a non-experimental design .

In this article, we have gathered information on non-experimental research to shed light on the details of this research method. If you are thinking of doing a study, make sure to have this information in mind. And lastly, do not forget to visit our articles on other research methods and so much more!

Defne is a content writer at forms.app. She is also a translator specializing in literary translation. Defne loves reading, writing, and translating professionally and as a hobby. Her expertise lies in survey research, research methodologies, content writing, and translation.

  • Form Features
  • Data Collection

Table of Contents

Related posts.

35+ must-ask questions for your website usability survey

35+ must-ask questions for your website usability survey

Şeyma Beyazçiçek

A complete guide to creating your logo design questionnaire (tips & templates)

A complete guide to creating your logo design questionnaire (tips & templates)

5 best free alternatives to Google Forms in 2022

5 best free alternatives to Google Forms in 2022

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Nonexperimental Comparative Effectiveness Research Using Linked Healthcare Databases

Comparative Effectiveness Research (CER) has gained a great deal of attention over the past year through the new federal coordinating council, 1 the recent Institute of Medicine (IOM) report, 2 and the American Recovery & Reinvestment Act (ARRA) stimulus funding. 3 CER has a broad scope as defined by the IOM, addressing “…the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat and monitor a clinical condition or to improve the delivery of care. The purpose of CER is to assist consumers, clinicians, purchasers, and policymakers to make informed decisions that will improve health care at both the individual and population levels.”. 2

So what’s new? As pharmacoepidemiologists, we could point out that we have generated evidence on health-relevant drug benefits and harms at the population level for over 25 years (the International Society for Pharmacoepdemiology just had its 26 th International conference). And we could point out that we have moved from using untreated as a comparison group when assessing drug effects towards using treated with a realistic clinical alternative as a comparator group well before CER (e.g., 4 ). And we acknowledge that CER is much broader than pharmacoepidemiology. What is really new for us is the implicit acknowledgement of the need for and value of nonexperimental evidence for benefits. Until recently (e.g., 5 ), the Food and Drug Administration has largely dismissed nonexperimental evidence on drug benefits (e.g., 6 ), mainly because of fear about intractable confounding by indication. 7 , 8 The Food and Drug Administration also insists on comparing treatments with untreated (including placebo) and does not accept the idea of a comparator drug for proof of efficacy. 9 It is interesting to see that other government agencies, including the Agency for Health Care Research and Quality through their network of DEcIDE (Developing Evidence to Inform Decisions about Effectiveness) Centers, 10 see this differently. It is also encouraging to see that some divisions of the Food and Drug Administration, e.g., the Division of Epidemiology in the Office of Surveillance and Biometrics, Center for Devices and Radiological Health, are pioneers in recognizing the value of nonexperimental research. 5

When we focus on the nonexperimental evaluation of the use and beneficial and harmful effects of drugs in the population (i.e., pharmacoepidemiology), we have to acknowledge some specific aspects of drugs (Greek: Pharmacon) that we need to take into account. Drugs are the mainstay of contemporary medicine. They are used for the primary, secondary, and tertiary prevention of disease outcomes. A multibillion $ industry constantly screens chemical compounds for physiologic effects. Drugs are marketed only after experimental proof of efficacy (benefits) in humans. They are therefore likely to affect disease outcomes.

Most prescription drugs are dispensed at pharmacies which then submit claims to payors, such as Medicare, Medicaid or private insurance companies. Likewise, physician visits, hospital stays, laboratory test results, procedures, injections, and other encounters with the healthcare system each generate a paper trail (or the electronic equivalent thereof). By linking these data across various sources (insurance claims, electronic health records, clinical records, billing data, laboratory results, vital statistics), an integrated picture of the patient’s health and healthcare emerges. After being stripped of key identifiers, researchers can obtain permission to access these data with appropriate confidentiality safeguards. 11

These databases have unique advantages for epidemiologic research. Most are population-based and therefore less prone to the healthy (given the target) selection almost inherent in recruiting and consenting participants for randomized trials or cohort studies. 12 Linked healthcare databases include continuous service dates rather than the interval assessments (e.g., every two years) that are common in epidemiologic cohort studies and many large trials. Continuous assessment of exposure and outcomes allows us to be specific about timing, an important consideration given that the exposure coming before the outcome is arguably the only sine qua non condition for causality. 13 Linked healthcare databases contain information on almost all drugs prescribed or dispensed in an outpatient setting. And they include codes for outpatient and inpatient diagnoses and procedures that the patient has received. 11 They also have major downsides. Without pretending to be exhaustive (this is neither within the scope of this commentary nor necessary for our argument) these include lack of data on important confounders, lack of data on drugs administered during hospitalization or purchased over the counter, lack of mortality data (in some databases), lack of data on the sensitivity and specificity of various algorithms to define outcomes, and lack of data on events not covered by the corresponding insurance plan. 11

Let us now step back from linked healthcare databases and highlight several important threats to the validity of nonexperimental research on drug effects in general. In addition to other sources of confounding, there is the potential for confounding by indication. Sicker patients are almost always more likely to be treated and are often more likely to have bad outcomes. If the severity of disease is unknown or measured with error, residual confounding will make the drug look bad. Recent work has focused on another kind of unmeasured confounding, confounding by frailty. Frailty is difficult to measure condition close to death that is not linked to a specific pathology but rather an overall (poor) state of health, and probably easily recognized by trained physicians. Unlike study populations in many other nonexperimental studies and trials, study populations assembled from large linked healthcare databases include very frail patients. Frailty may reduce the likelihood of a particular treatment if physicians focus on a patient’s main medical problem and do not initiate useful therapies for secondary conditions. 14 The practitioner may determine that in the presence of competing risks a therapy offers little expected benefit. 15 Because frailty is hard to measure and a very strong risk factor for poor outcomes (especially mortality), it will also lead to unmeasured and residual confounding. When comparing the treated with the untreated, however, frailty will often tend to make the drug look good. Frailty is a plausible explanation for paradoxical treatment-outcome associations observed in the elderly. 16 – 18

Besides confounding by unmeasured confounders and residual confounding, selection bias over time on treatment is a major problem when assessing longer term effects of drugs. Patients who adhere to treatments over prolonged periods of time tend to be healthier. Conversely, patients stopping treatments tend to be sicker. This leads to increasingly healthy users with increasing duration of treatment. 19 , 20 Similar to confounding by frailty, selection bias from healthy users tends to be again most pronounced for all-cause mortality. 21 Those not adherent to placebo in randomized trials have been shown to have twice the mortality rates of those adherent to placebo. 22 And there is the potential for immortal time bias. 23

None of these threats to validity is specific to pharmacoepidemiology or linked healthcare databases. For example, selection bias is a major issue in occupational epidemiology (healthy worker bias, e.g., 24 ). Furthermore, only the potential for major confounding is specific to nonexperimental study designs. The lack of an inherent sampling structure compared with ad hoc studies may, however, increase the risk for flawed designs.

Pharmacoepidemiologic methodologists have made substantial progress in recent years addressing the above mentioned threats to validity. Recent developments include the new user design, 25 which allows us to focus on treatment initiation decision processes and hypothetical interventions. The new user design allows us to implement propensity scores 26 and instrumental variables 27 , and assess positivity 28 , 29 and treatment contrary to prediction 18 . All these can be used to balance cohorts that have some equipoise to initiate all treatments compared with respect to the risk for the disease outcome. The new user design also allows us to address stopping, switching, and augmenting of drug use after baseline separate from the confounding at baseline. We are thus able to apply various methods for censoring, ranging from immediate censoring to ignoring (first treatment carried forward or intention to treat analysis). If we are able to predict adherence, we can apply marginal structural models. 30 – 32 Finally, comparing patients initiating a drug to patients who initiate another drug for the same indication reduces the potential for and thus magnitude of most of the biases outlined above, including confounding, selection bias, and immortal time bias. The “comparative” in CER can therefore help us to avoid major biases when making nonexperimental treatment comparisons.

Based on these recent developments, we propose that study design has a larger influence on validity of pharmacoepidemiologic studies than whether we use linked healthcare databases or data from ad hoc studies. For instance, a reanalysis of the data from the Nurses Health Study on the effects of estrogen and progestin therapy on coronary heart disease in postmenopausal women based on a new user design and dealing with selection bias after initiation 33 showed results compatible with the ones from a large randomized trial. Threats to validity in pharmacoepidemiology are by no means specific to linked healthcare databases.

We need timely and trustworthy answers about both drug benefits and harms in the population. Such answers are essential to safeguard public health and they can rarely be obtained by data collection, be it within randomized controlled trials, large simple trials, or cohort studies. Current examples include insulin glargine and angiotensin receptor blockers, which have been implicated with increased risk for malignancies. 34 , 35 Fortunately in both cases we have obvious comparator drugs that will allow us to limit the potential for bias; but we need to be careful to avoid flaws in the study design 36 .

The answer to the question whether we are brave or foolhardy may depend on what we want to do. We can be brave and study treatments with similar indications (e.g., insulin glargine vs. NPH insulin, angiotensin receptor blockers vs. angiotensin converting enzyme inhibitors), unintended effects, and short term effects. We would probably be foolhardy to study intended long term effects without a comparator, e.g., statins vs. no statin. Note that this list is not exhaustive. There will always be gray zones and it may very well be easier for academics to live with these than for industry and regulatory agencies. 37

We conclude that large linked healthcare databases offer major advantages for pharmacoepidemiologic research. 38 While certainly not ideal, the population is often close to ideal compared with ad hoc studies because it is unselected (e.g., Medicare, General Practice Research Database, Scandinavia 39 ). And while again not ideal, the information on drug exposure is almost ideal for prescription drugs in the outpatient setting, i.e., for most of the drugs used. We can study clinically relevant outcomes, and given the large size of linked healthcare databases, we can find timely answers to multiple questions without the need to wait for new data to be collected. Among the many downsides is the lack of information on important covariates but we have design options to limit and reduce confounding by unmeasured confounders and selection bias. Large linked healthcare databases allow us to answer some important questions that could otherwise not be answered. The future will allow us to link electronic medical records, cohort studies, and claims data. The ideal database will remain elusive, however.

While we need to increase the use large linked healthcare databases to answer appropriate pharmacoepidemiological and CER questions using state of the art nonexperimental methodology, we also need to acknowledge the need for more head to head, simple, large randomized trials comparing the effect of relevant treatment alternatives on clinically relevant outcomes (e.g., based on linked healthcare databases) in unselected populations.

Acknowledgments

The authors are the core faculty of the Pharmacoepidemiology Program within the Department of Epidemiology at the UNC Gillings School of Global Public Health. The core faculty combines backgrounds from medicine (Til Stürmer), psychology (Michele Jonsson Funk), health administration (Charles Poole), and biostatistics (Alan Brookhart) with strong training and expertise in epidemiology methods into a common interest in the development and assessment of innovative research methods, specifically for the nonexperimental evaluation of drug benefits and harms using large linked healthcare databases. Funding for research and training comes from various sources, including the National Institute on Aging (RO1 AG023178 and K25 AG27400); AHRQ (K02 HS17950); the UNC-GSK Center of excellence in Pharmacoepidemiology and Public Health, an innovative academia-industry collaboration; and unrestricted research grants from the pharmaceutical industry (e.g., Merck, Sanofi-Aventis, Amgen). UNC houses an AHRQ DEcIDE Center.

  • Open access
  • Published: 09 September 2023

Using public clinical trial reports to probe non-experimental causal inference methods

  • Ethan Steinberg 1 ,
  • Nikolaos Ignatiadis 2 ,
  • Steve Yadlowsky 3 ,
  • Yizhe Xu 1 &
  • Nigam Shah 1  

BMC Medical Research Methodology volume  23 , Article number:  204 ( 2023 ) Cite this article

1121 Accesses

1 Citations

Metrics details

Non-experimental studies (also known as observational studies) are valuable for estimating the effects of various medical interventions, but are notoriously difficult to evaluate because the methods used in non-experimental studies require untestable assumptions. This lack of intrinsic verifiability makes it difficult both to compare different non-experimental study methods and to trust the results of any particular non-experimental study.

We introduce TrialProbe , a data resource and statistical framework for the evaluation of non-experimental methods. We first collect a dataset of pseudo “ground truths” about the relative effects of drugs by using empirical Bayesian techniques to analyze adverse events recorded in public clinical trial reports. We then develop a framework for evaluating non-experimental methods against that ground truth by measuring concordance between the non-experimental effect estimates and the estimates derived from clinical trials. As a demonstration of our approach, we also perform an example methods evaluation between propensity score matching, inverse propensity score weighting, and an unadjusted approach on a large national insurance claims dataset.

From the 33,701 clinical trial records in our version of the ClinicalTrials.gov dataset, we are able to extract 12,967 unique drug/drug adverse event comparisons to form a ground truth set. During our corresponding methods evaluation, we are able to use that reference set to demonstrate that both propensity score matching and inverse propensity score weighting can produce estimates that have high concordance with clinical trial results and substantially outperform an unadjusted baseline.

Conclusions

We find that TrialProbe is an effective approach for probing non-experimental study methods, being able to generate large ground truth sets that are able to distinguish how well non-experimental methods perform in real world observational data.

Peer Review reports

Non-experimental studies (which are also known as observational studies) are valuable for estimating causal relationships in medical settings where randomized trials are not feasible due to either ethical or logistical concerns [ 1 ]. In addition, effects from randomized trials might not generalize to real-world use due to limited and non-representative study populations and differing clinical practice environments [ 2 ]. Accurately estimating these causal relationships is important, as learning which treatments are the most effective is a key component of improving health care. However, non-experimental studies are difficult to use in practice due to the absence of randomization, which forces them to rely on difficult-to-verify assumptions, such as the absence of unmeasured confounding and non-informative censoring [ 3 ]. These assumptions make it difficult to evaluate the performance of non-experimental methods, which is an important step for verifying the reliability of these techniques as well as determining the relative merits of different methods. Despite significant recent progress in non-experimental study evaluation (detailed in Section “Related work”), this difficulty with evaluation hampers research, by making it more difficult to develop more effective methods, and hinders practice, as clinicians are hesitant to use evidence generated from non-experimental studies even in situations where clinical trial derived evidence is not available [ 4 , 5 , 6 ].

In this work, we introduce TrialProbe , a new principled approach for the systematic appraisal of non-experimental causal inference methods. Our basic premise is that we can evaluate non-experimental causal inference methods by comparing adverse event effect estimates from non-experimental methods with published experimentally derived estimates from public ClinicalTrials.gov clinical trial reports. Compared to previous approaches for the evaluation of non-experimental methods (more of which below in Section “Related work”), TrialProbe  differs in three regards. First, we explicitly focus on active comparator study designs where one drug is directly compared to another drug as those are easier to connect to potential non-experimental study designs [ 7 ]. Second, we estimate the magnitude of the effects extracted from the public clinical trial reports through an empirical Bayes approach that explicitly accounts for the heterogeneity of odds ratios across the clinical trials, the statistical information content (e.g., sample size) used to estimate each odds ratio, and the fact that most effects are very small. Third, we use those estimated effects to split our reference set into several subsets that contain drug effects of varying strengths, so that users can simultaneously understand the concordance between non-experimental and experimental methods for both stronger and weaker effects.

We then use TrialProbe  to evaluate common non-experimental study methods in terms of their ability to identify causal relationships from a large national administrative claims dataset - Optum’s de-identified Clinformatics Data Mart Database. We find that available methods can reproduce a significant fraction of the reported effect and that adjusting for a low-dimensional representation of patient history outperforms a naive analysis that does not adjust for any covariates.

Related work

The importance of evaluating non-experimental methods is well-understood and ubiquitous. The most common approach for evaluation is based on simulation experiments, or more recently, based on semi-synthetic simulations that seek to mimic real observational datasets [ 8 , 9 , 10 , 11 , 12 ]. The upshot of simulation studies is that the ground truth is precisely known, and so non-experimental methods can be compared with respect to any metric of interest. Nevertheless, it is difficult to determine whether or not those simulations provide a realistic confounding structure that is similar to observational data in practice.

Non-experimental methods have also been evaluated in terms of reproducibility by evaluating whether it is possible to independently reproduce previously published non-experimental studies [ 13 ]. Reproducibility is an important and useful feature for non-experimental studies, but measuring reproducibility alone does not necessarily address the issue of whether non-experimental studies provide correct effect estimates.

Closer to our work, several authors have evaluated non-experimental methods by comparing them to results from RCTs. Some authors have used data from RCTs to estimate a causal effect, and then applied a non-experimental method only to the treatment arm of the same RCT [ 14 , 15 ] Footnote 1 or to the treated subjects from the RCT along with control subjects drawn from survey datasets [ 16 ]. Furthermore, such approaches require access to patient-level data for each RCT.

Other authors have constructed pairs of published non-experimental studies and RCTs that assess the same intervention in similar populations [ 17 , 18 ]. Such an approach is appealing, as it directly compares non-experimental designs that researchers have pursued (and published). On the other hand, such an approach does not allow the large-scale and systematic exploration of variations in causal inference methods and is typically restricted to the study of dozens of effects. This approach is also subject to publication bias issues, which results in an under-reporting of non-significant effects in both experimental and non-experimental designs.

Another common approach—that most closely aligns with our work—for evaluating non-experimental causal inference methods is through reference sets [ 19 , 20 ]. A reference set is a collection of relationships about the effects of treatments that are independently verified, and treated as ground truth against which the ability of a non-experimental method to identify those effects from available data can be quantified. There have been several proposed approaches to create reference sets, the most prominent of which rely on either FDA labels or expert knowledge to declare known relationships between drugs and outcomes [ 20 ]. However, the actual construction of existing reference sets can be opaque. Instead, in TrialProbe  we generate a sequence of nested reference sets that correspond to increasing levels of evidence for the strength of the causal effect. The construction of the TrialProbe  reference sets is fully data-driven and reproducible. Furthermore, we are not aware of previous reference sets that focus on active comparator study designs.

RCT-Duplicate [ 21 ] is another closely related effort that attempts to quantify the performance of non-experimental methods by carefully reproducing the results of 32 clinical trials using insurance claims databases. This manual emulation of the trial design (to the extent feasible) allows RCT-Duplicate to very closely match the exact clinical trial setup, including details such as inclusion/exclusion criteria that are not possible with fully automated approaches such as ours. In addition, the increased effort per trial limits the number of RCTs that can be feasibly reproduced to just 32. Our work is similar in spirit, but expands on the idea by vastly increasing the number of estimated effects by several orders of magnitude to 12,967 by being fully automated and by taking advantage of the entire ClinicalTrials.gov database.

All the approaches we outlined above for the evaluation of non-experimental methods based on results from RCTs face the following difficulty: Even in an optimal situation, it is not expected that any non-experimental method will reproduce the entire ground truth in the reference set because the observational data usually comes from a different population than the population used to collect the ground truth [ 22 ]. Identification of a known relationship might fail for example because the academic medical center population used in an RCT might differ drastically from the general population available in the non-experimental data resource. Many other study design factors (e.g., whether the estimand is a hazard ratio in the non-experimental study and an odds ratio in the RCT) can further lead to deviations between the non-experimental study and the RCT. A related issue is that experimental studies also have a certain error rate, in that incorrect blinding, randomization, unrealistic usage, or other errors can cause an RCT to return incorrect effect estimates [ 2 ]. Nevertheless, a common assumption is that while the exact effect might differ, the effect identified in the observational data and the original “ground truth” should be correlated and good non-experimental methods should on average have greater correspondence with the provided ground truth [ 23 ]. Here we take this idea to an extreme and only check for concordance between the direction of effects in RCTs and the non-experimental methods [ 12 , 20 ]. A related evaluation approach, where one only seeks to recover the direction of an effect, has appeared in the causal discovery literature [ 24 ].

In this section we describe the TrialProbe  approach. We describe the data source of the clinical trial reports (ClinicalTrials.gov), the processing of the raw data to a curated dataset of \(M={12,967}\) unique drug/drug adverse event comparisons, as well as the statistical approach that we propose for comparing non-experimental causal inference methods.

The primary data source: ClinicalTrials.gov

ClinicalTrials.gov serves as a public repository for clinical trials carried out in the United States and abroad. The database contains pre-registration information, trial status, and results as provided by researchers conducting the trials. Many clinical trials are legally required to report results to ClinicalTrials.gov within 1 year of study completion, with a compliance rate of over 40%  [ 25 ]. In this work we use the June 4, 2020 version of the database, which includes 33,701 clinical trials. Note that we are not using patient level data collected in the trial, but the public report posted at ClinicalTrials.gov.

Extracting trials with an active comparator design

We focus on drug versus drug active comparator clinical trials, which evaluate one drug directly against another. The reason is that such comparisons are easier to conduct in the context of a non-experimental study design. In contrast, placebo or standard of care based trials are more difficult to work with because there is no clear corresponding case-control non-experimental study that can be used to estimate effects. We additionally restrict our analysis to higher quality clinical trials using the study design reported on ClinicalTrials.gov. We implement a quality filter by inspecting the reported randomization and blinding information and explicitly removing trials that are either not randomized or do not use participant blinding.

The results section of each active comparator clinical trial record consists of a set of intervention arms as well as the primary outcomes and adverse events associated with each arm. The primary outcomes and side effects are all specified in natural language and must be mapped to standardized terminologies. We discard the primary outcomes because it is difficult to consistently map them to electronic healthcare data sources due to a wide diversity of measurements and a lack of standardized terminology. We instead focus on the adverse events because they are specified using MedDRA terminology and because mappings to corresponding condition codes are available for healthcare data sources. We obtain a standardized version of these adverse outcomes by mapping them to ICD10 using the dictionary mappings contained within UMLS 2019AB.

The drug mentions in the ClinicalTrials.gov records are specified in an ad-hoc manner in terms of brand names, ingredients, dosages and/or more specialized names. As a preliminary step, we filter out all treatment arms with fewer than 100 patients as trials of that size frequently do not have enough power to obtain statistical significance. We then use the RxNorm API to transform the text descriptions of drugs into RxNorm ingredient sets. We require at least 50% of the tokens to match in order to avoid false positives. Treatment arms with more than one ingredient (due to either containing multiple drugs or drugs with multiple active ingredients) are also filtered out. As an additional quality control step, we remove intervention arms that contain plus (“ \(+\) ”) signs in their names that usually indicate combination treatments that RxNorm is not always able to detect and map to ingredients correctly. Finally, we map those RxNorm ingredient sets to Anatomical Therapeutic Chemical (ATC) codes so that we can find the corresponding drugs more easily in our ATC code annotated observational data. We manually very that this automated drug name extraction and mapping step did not introduce significant errors by manually inspecting a set of 100 random mapped trials and double-checking that all drugs in those trials were resolved to correct the RxNorms.

One important feature of ClinicalTrials.gov data is that it often contains records where the same drug-drug comparisons have been tested in multiple trials. We aggregate side effect event counts and participant counts for trials with identical drug combinations and outcome measurements. Similarly, we also aggregate counts across arms where the same drug was evaluated with different dosages. This aggregation procedure has the dual purpose of strengthening the reliability of consistent true effects while helping to down-weigh trials with conflicting effects.

We also note that in an active comparator design, there is typically no concrete choice for the baseline arm (in contrast to e.g., placebo or standard of care trials)—the role of the two arms is symmetric . To express this symmetry, we reorder all pairs of drugs under comparison (for each adverse event) in such a way that the sample odds ratio is \(\ge 1\) .

At the end of this process, we have compiled \(M={12,967}\) unique drug versus drug treatment adverse event comparisons. The summarized data for the i -th entry comprises of the ICD10 code of the adverse event, the ATC code of the two drugs being compared, as well as the contingency table \(Z_i\) :

Below we describe our concrete statistical proposal for leveraging the above dataset to compare non-experimental causal inference methods.

Empirical Bayes effect size estimation

In this section, we develop an approach for estimating the effect sizes of all the drug versus drug treatment adverse event comparisons that adjusts for the following issues: First, most of the drug vs drug effect sizes are very small, close to 1, if not non-existent. Adjusting for this prior is necessary in order to reject spurious, but statistically significant effects. Second, each drug vs drug comparison contains vastly different amounts of information, with differing event rates, patient counts, etc for each comparison. Taking into account the differences in information content is important for identifying effects that are weak, but strongly supported due to the quantity of clinical trial evidence.

Our estimation approach follows a tradition of methodological developments based on hierarchical modeling combined with an empirical Bayes analysis [ 26 , 27 , 28 , 29 ]. This approach explicitly learns a prior to take into account how most effects are small and takes advantage of the differing amounts of information in each comparison. We model the likelihood for the log odds ratio \(\omega _i\) of the i -th comparison (with contingency table ( 1 )) through the non-central hypergeometric distribution, that is,

The likelihood \(L_i(\omega _i)\) for the analysis of \(2 \times 2\) contingency tables has been proposed by, e.g., [ 30 , 31 , 32 , 33 ], and is derived by conditioning on the margins of the table \(Z_i\) —in entirely the same way as in the derivation of Fisher’s exact test.

In our hierarchical approach, we further model the \(\omega _i\) as exchangeable random effects, independent of the margins of \(Z_i\) , with:

In contrast to a fully Bayesian approach, we do not posit knowledge of G , but instead follow the empirical Bayes paradigm and estimate G based on the data \(Z_1,\dotsc ,Z_M\) as follows:

Equation ( 4 ) is an optimization problem over all symmetric distributions G and the objective is the marginal log-likelihood—each component likelihood \(L_i(\cdot )\)  ( 2 ) is integrated with respect to the unknown G . The estimator \(\widehat{G}\) is the nonparametric maximum likelihood estimator (NPMLE) of Kiefer and Wolfowitz  [ 34 ], and has been used for contingency tables [ 30 ]. We note that in contrast to previous works  [ 30 ], we also enforce symmetry of G around 0 in ( 3 ), ( 4 ). The reason is that, as explained in Section “Extracting trials with an active comparator design”, our active comparator design setting is symmetric with respect to the drugs under comparison.

Figure  1 a shows the estimated distribution function \(\widehat{G}\)  ( 4 ) based on the TrialProbe  dataset (in terms of odds ratios \(\textrm{exp}(\omega _i)\) , but with a logarithmic x -axis scale), as well as the empirical distribution of sample odds ratios. Footnote 2 We observe that even though the sample odds ratios are quite spread out, the NPMLE \(\widehat{G}\) is substantially more concentrated around odds ratios near 0. This is consistent with the intuition that for an active comparator design study, side effects will often be similar for the two drugs under comparison (but not always).

figure 1

a Distribution function of drug versus drug adverse event odds ratios in TrialProbe . \(\widehat{G}\) is estimated via nonparametric maximum likelihood as in ( 4 ), while the dashed curve is the empirical distribution of sample odds ratios. b Denoised vs. raw odds ratios. Denoising ( 5 ) is done by computing the posterior mean of the log odds ratio given the data for the i -th comparison and the estimated \(\widehat{G}\)

Finally, to create an effect estimate for the the drug versus drug treatment adverse event comparisons, we use the plug-in principle: We use the estimated \(\widehat{G}\) to compute denoised point estimates of the log odds ratios via the empirical Bayes rule :

Figure  1 b plots \(\textrm{exp}(\widehat{\omega }_i^{\text {EB}})\) against the sample odds ratios. We observe that the rule \(\widehat{\omega }_i^{\text {EB}}\) automatically shrinks most sample log odds ratios toward 0 (equivalently: \(\textrm{exp}(\widehat{\omega }_i^{\text {EB}})\) shrinks most sample odds ratios toward 1), while rigorously accounting for varying effective sample size of each comparison (so that shrinkage toward 1 is heterogeneous). Table 1 gives the first ten entries of TrialProbe , with the largest denoised odds ratio \(\textrm{exp}(\widehat{\omega }_i^{\text {EB}})\) .

Effect size ranking and subsetting

Given our effect size estimates computed through empirical Bayes, we rank drug vs drug adverse event comparisons by effect size magnitude [ 35 ] and construct subsets of our reference set that only contain effects greater than a chosen magnitude.

There is a challenging trade-off when choosing the effect size threshold required to be included in the reference set. Stronger effects should be more resilient to errors in either the clinical trial or non-experimental study design, but might exclude moderate effects that clinicians and researchers are interested in estimating with non-experimental methods.

Due to that complicated trade-off, we do not choose a static effect size threshold and instead perform all analyses with all possible effect size thresholds. This strategy also allows us to provide some insight into how metrics degrade as weaker effects are allowed in the reference set.

We thus define a family of reference sets \(S_t\) , where t is the minimum required denoised odds ratio to be included in the set. Each set \(S_t\) is a subset of TrialProbe , defined as follows:

Evaluation: concordant sign rate

As explained previously, there are many possible reasons why the exact effect size from a non-experimental assessment of a causal effect may not match the results of a clinical trial. We propose to handle this by only looking at the estimated effect direction for those effects which are known to be large. We additionally only compare concordance for cases where the non-experimental method returns a statistically significant result, as this both removes cases where we wouldn’t expect the non-experimental assessment to match and better aligns with how non-experimental assessments are used in practice. The basic premise of our approach is the following.

Consider the comparison of two drugs with respect to an adverse event. Suppose that:

In the clinical trial report, there is strong evidence that \(\omega _A \gg \omega _B\) , that is, there is strong evidence that the adverse event rate under drug A is substantially larger compared to drug B.

The non-experimental causal inference method yields a significant p -value, indicating that the null hypothesis (that both drugs have the same adverse event rate) is probably false.

According to the non-experimental method, drug B leads to a higher adverse event rate compared to drug A, that is, the direction of the effect is the opposite compared to the clinical trial evidence.

Then, we are confident that the non-experimental method yields misleading evidence in this case as it provides statistically significant effects in the wrong direction compared to the ground truth.

We instantiate the above framework as follows. We seek to systematically evaluate a non-experimental causal inference method \(\mathcal {O}\) , which we define as follows (see Section “Case study on Optum’s Clinformatics” for a concrete instantiation): \(\mathcal {O}\) is a mapping from two drugs (drug A and drug B) and an adverse event to a p -value and a predicted causal effect direction (i.e., whether drug A or drug B causes the adverse event more frequently). Specifying the mapping \(\mathcal {O}\) requires specification of the healthcare data resource, the protocol for extracting subjects treated with drug A, resp. drug B, and a statistical method (e.g., an observational method that adjusts for observed covariates) that returns a p -value and the predicted effect direction.

We define \(\mathcal {R}(\mathcal {O}) \subset \textit{TrialProbe}\) as the set of comparisons such that the non-experimental study returns a p -value \(\le 0.05\) . In order to ensure that we only evaluate larger effects, we use the \(S_t\) subsets of TrialProbe defined in the previous section which require each entry in the set to have an empirical Bayes denoised odds ratio greater than t .

We then define the Concordant Sign Rate, as:

Large values of \(\text {CSR}(\mathcal {S}_t, \mathcal {O})\) are preferable. We may define \(1-\text {CSR}(\mathcal {S}_t, \mathcal {O})\) as the discordant sign rate, which is analogous to the notion of false sign rate in multiple testing [ 36 , 37 ] and the type-S (“sign”) error [ 38 ]. In the present setting, however, there is no precise notion of “true” and “false” sign, and instead we evaluate only based on concordance/discordance with the effect derived from the public clinical trial reports.

For every \(\mathcal {S}_t\) and every non-experimental causal inference method \(\mathcal {O}\) , we compute two metrics: the fraction of statistically significant results that have a concordant sign (as in ( 7 )) and the fraction of entries of \(\mathcal {S}_t\) recovered (as in being marked statistically significant with concordant sign). The concordant sign rate gives an indication of how reliable a non-experimental method is and the fraction recovered gives an indication of its power.

Case study on Optum’s Clinformatics

To illustrate how TrialProbe  may be applied, we consider a hypothetical investigator who is interested in comparing two drugs with respect to a specific adverse event and seeks to generate evidence for the comparison. The investigator has access to Optum’s de-identified Clinformatics Data Mart 8.0 medical claims dataset [ 39 ], a large US commercial claims dataset containing over 88 million patients that is frequently used for non-experimental studies.

The investigator proceeds as follows:

Cohorts are constructed systematically using the first drug reimbursement claim for either of the two drugs as the index time. Patients with a prior event or an event at the index time are excluded. At most 100,000 patients are sampled for each drug. Outcomes are measured until each record is censored (as indicated by the end of their healthcare enrollment in the Clinformatics dataset).

For the cohort generated as above, the investigator fits a Cox proportional hazards model with response equal to the first time the adverse event occurs and covariate equal to the indicator of treatment assignment to drug A. Footnote 3

The investigator reports a significant causal effect if the p -value from the Cox fit is \(\le 0.05\) and in that case, declares the direction of the effect according to the estimated hazard ratio.

Steps 1—3 comprise a non-experimental strategy \(\mathcal {O}\) . We also consider two additional non-experimental strategies that replace step 2. by 2.’ or 2.”:

The investigator fits a propensity score matched (PSM) Cox model. The propensity score is estimated using logistic regression on a low-dimensional representation of the patient’s history obtained via a procedure by Steinberg et al. [ 40 ]. When performing matching, the investigator uses a 1:1 greedy matching algorithm on the logit scale with a caliper of 0.1. Once a matched cohort is chosen, the hazard ratio is estimated using a Cox regression by modeling the survival outcome as a function of the treatment status in the cohort. The calculation of the p -value corresponding to the hazard ratio ignores the estimation of the propensity scores.

The investigator fits an inverse propensity score weighted (IPSW) Cox model. As in 2.’, the propensity score is estimated using logistic regression on a low-dimensional representation of the patient’s history obtained via a procedure by Steinberg et al. [ 40 ]. The calculation of the p -value corresponding to the hazard ratio ignores the estimation of the propensity scores.

In what follows, we refer to these two non-experimental methods as “Unadjusted Cox”, “Cox PSM” and “Cox IPSW”. We note that there are many possible criticisms to all three approaches. For example, the first approach is naïve, in that it does not even attempt to adjust for confounding. The second approach adjusts for confounding, but also has caveats, e.g., the computed standard error may be overly conservative [ 41 ]. Finally, the third approach, IPSW, has relatively high variance and can be unstable, especially when there is minimal overlap. Nevertheless, it is plausible that an investigator would proceed using one of these non-experimental approaches (especially Cox PSM and Cox IPSW). With TrialProbe , we can probe some of the properties of these three non-experimental methods.

For a given comparison of interest, it could be the case that any of the methods provides more reliable evidence than the others, or perhaps all methods provide unreliable evidence. There are many reasons why the methods could fail to provide reliable evidence, and these reasons may vary from comparison to comparison (as explained before). Through TrialProbe  we probe operating characteristics of methods in aggregate over many possible comparisons. At the same time, we also encourage researchers to delve in more depth at specific comparisons to identify failure modes of non-experimental strategies.

is a case study non experimental

As an example, the effect in the third row is so strong, so that all three non-experimental methods declare the effect as significant and determine a concordant direction. On the other hand, we do not see good concordance or recovery for the Nicotine vs Bupropion examples (rows one, two, and six), with the covariate-adjusted methods returning three statistically insignificant results and the unadjusted method returning one statistically significant concordant result, one statistically significant discordant result, and one statistically insignificant result. This illustrates some of the tradeoffs when adjusting for confounders in that adjusted methods have an increased Type 1 error rate, but also an increased Type 2 error rate. A likely explanation for the poor performance with nicotine in particular is that nicotine usage is frequently not recorded well in claims data. In this case the potential mismatch between trial results and non-experimental results may be more due to the data source, and not due to the adjustment strategies. This example thus illustrates how TrialProbe  can help identify failure modes of non-experimental studies.

figure 2

a Fraction of significant results with concordant sign as a function of the odds ratio threshold in ( 6 ). b Fraction of recovered entries as a function of the odds ratio threshold

We continue with a more holistic picture of the comparison of the two non-experimental strategies (instead of looking at results for individual comparisons) and proceed as suggested in Section “Evaluation: Concordant sign rate”. One important aspect of our results is that many of the non-experimental effect estimates are not statistically significant, and thus not evaluated by our pipeline. The fraction of non-significant results are in Table 3 . The high frequency of non-significant results, even with the use of a large observational dataset probably reflects the fact that many of these adverse events are rare, especially given the underreporting common in claims data. We compute the fraction of significant results that have concordant signs and the fraction of reference set entries correctly recovered by each method for each subset \(S_t\) of TrialProbe  that only contains effects that have an odds ratio threshold greater than t . Figure 2 provides the performance of each of our three methods on these two metrics. It is reassuring that for the relatively strong effects, all methods perform better than a “coin-flip” based guess of directionality. On the other hand, also as anticipated, the methods that adjust for confounders have better concordance compared to unadjusted Cox-PH and the concordant sign rate is \(\ge 80\%\) for comparisons with strong evidence in ClinicalTrials.gov, say, with (denoised) odds ratio \(\ge 2\) .

We make the following remarks: As the x -axis varies in the plots, we are scanning over less stringent choices of “reference sets”. However, in the spirit of probing methods in an exploratory way, we do not need to make a choice of a specific reference set / cutoff on the x -axis. We also note that as the denoised odds ratios approaches zero, the “reference set” \(\mathcal {S}_t\) becomes increasingly uninformative, and so we would anticipate that any method would have \(\text {CSR} \approx 0.5\) .

Comparison to prior work

In order to better understand how TrialProbe compares to prior work, we perform three other non-experimental method evaluation strategies. First, we perform a direct concordance and recovery rate evaluation using the positive controls (that are presumed to have an effect) from the OMOP and EU-ADR reference sets. We also create an ablated form of TrialProbe that does not use the empirical Bayesian effect estimation and odds ratio magnitude filtering, and instead only identifies significant effects using an exact Fisher test with a 0.05 p -value threshold. Table 4 contains the results of this comparison.

We find that all three of these sets, OMOP, EU-ADR, and the corresponding TrialProbe  subset that only required Fisher statistical significance, were difficult to reproduce, with many non-concordant signs and lost effects. The low concordance and recovery of Fisher exact test based TrialProbe subset in particular helps indicate the importance of our empirical Bayesian estimation and effect size filtering.

Importance of clinical trial filtering

One of the key decisions for constructing TrialProbe is which clinical trials to include for analysis. Our analysis uses an assignment and blinding filter, requiring all candidate clinical trials to use randomized assignment and participant blinding. This filter excludes 6,855 of the 19,822 candidate effects that we could have otherwise studied. In order to understand the effect of this filter, and whether it is worth the lost entries, we perform an ablation experiment where we rerun our analysis without this filter. The resulting concordance and recovery plots are in Fig. 3 .

figure 3

Concordance and recovery rates for an ablated form of TrialProbe that does not use clinical trial quality filters. a Fraction of significant results with concordant sign as a function of the odds ratio threshold in ( 6 ). b Fraction of recovered entries as a function of the odds ratio threshold

The concordance rate and recovery rate without the clinical trial quality filter are distinctly lower, especially at larger odds ratio thresholds. This probably reflects how low-quality clinical trials are less likely to be reproducible due to the inherent increased error rate caused by a lack of participant blinding and incomplete randomization.

In this work, we use clinical trial records from ClinicalTrials.gov to build a source of ground truth to probe the performance non-experimental study methods. We show how such a dataset can be constructed in a systematic statistically sound manner in a way that also allows us to filter by the estimated strength of the effects. We also demonstrate the value of our approach by quantifying the performance of three commonly used non-experimental study methods.

Our approach has three advantages. First, it characterizes the performance of methods on real observational data. Second, our approach provides high quality ground truth based on clinical trials that have varying effect sizes, allowing a read out of the performance of a method for a given effect size (Fig. 2 ). Prior reference sets rely on ground truth sources that might be less reliable or have weaker relationships. Finally, our approach scales better than prior work, because we can create thousands of “known relationships” from published trial reports. This is a significant advantage compared to prior approaches that rely on evaluating methods using patient-level randomized trial datasets that can be difficult to acquire [ 42 ].

The empirical Bayes estimation and odds ratio magnitude subsetting in particular seems to be a key component of how TrialProbe can achieve relatively high measured concordance between the clinical trials and non-experimental methods. As shown in our results section, a TrialProbe subset that only relies on statistical significance achieves very low concordance. Likewise, the OMOP and EU-ADR reference sets (which indirectly rely only on statistical significance through FDA reports) also report similarly poor performance. We believe the most likely hypothesis for explaining this is that there is likely to be significant type 1 error due to the implicit vast multiple hypothesis testing problem when searching for a small number of significant adverse event effects in a sea of thousands of reported minor effects. Empirical Bayes automatically adjusts for this multiple hypothesis testing issue by learning a prior that incorporates the knowledge that most adverse event effects are null (Fig. 1 ), and can thus more effectively discard these invalid effects.

However, our approach has several limitations. The primary limitation of our approach is that we rely on an assumption that the average treatment effect seen in the clinical trials generalizes to the observational data. One way this could be violated is if there is a significant mismatch in the patient population and there is a heterogeneous treatment effect. In that case, it is possible to see different effect directions in the observational data than the randomized trial even if the non-experimental methods are functioning correctly [ 43 , 44 ]. Another probable mismatch between the observational data and the clinical trials is that there is frequent underreporting of outcomes in our observational datasets because they rely on billing records for adverse events. This is especially the case for non-serious outcomes such as nausea or rashes. Such underreporting would cause the estimated rate of adverse events to be lower in our observational data than in clinical trials A third potential cause is that the clinical trial might not provide a correct effect estimate due to poor internal clinical trial quality (such as improper blinding, poor randomization, and publication bias). For all of these potential causes of different effect estimates, our primary mitigation strategy is to focus on the effect directions of hazard ratios. The benefit of effect directions is that they intrinsically require greater error to change, especially when the effect magnitude is large. Hazard ratios additionally increase resilience by making analysis more resilient to changes in the base rate of the event, whether due to population differences or outcome reporting changes. One piece of evidence that this mitigation strategy is somewhat successful is that we observe much greater concordance between non-experimental methods and clinical trials than what could be achieved by random chance. However, we do expect this mitigation strategy to be imperfect, and differences in the underlying effects should cause us to underestimate the performance of non-experimental methods.

Our work also has several secondary limitations. First, our approach is only able to evaluate methods for detecting average treatment effects because our ground truth is in the form of average treatment effects. We are simply unable to evaluate how effective methods can detect heterogeneous treatment effects. A second additional limitation is that our evaluation strategy simultaneously probes both the statistical method and the observational healthcare data resource used, in that we would only expect high concordance when both are of high quality. This is frequently a disadvantage, in that it can be hard to understand the particular cause of poor concordance. However, in some circumstances, this can be an advantage: TrialProbe can help identify potential issues associated with the observational dataset itself (e.g., the underreporting of side effects such as nausea). TrialProbe could also be used to probe and contrast different observational datasets, e.g., one could seek to contrast one statistical method applied to a cohort extracted from Optum’s de-identified Clinformatics Data Mart Database compared to the same statistical method applied to a cohort extracted from an alternative observational data resource. Third, our reference set is a biased sample of true drug effects due to selection bias, caused by a combination of publication bias (in the form of trials not reporting results to clinicaltrials.gov) and our requirement for drug prescriptions in our observational data. In particular, it is probably the case that studies that result in significant quantities of adverse events are halted and those drugs are then infrequently (or not at all) used in clinical practice, resulting in our work underestimating the “true” adverse event rates of various drugs. This would in turn mean that the empirical Bayes based subsets that try to identify effects of a particular strength will incorrectly contain stronger effects than expected. However, this should not affect our estimated concordance between non-experimental methods and clinical trials within a particular subset, as we only compare effect directions and not effect magnitudes. Finally, one other disadvantage of our current approach is that the same prior is learned for all log-odds ratios; this presupposes that the selection of effects we consider are relevant to each other. This may not necessarily be the case; for example, chemotherapy drugs will typically have much stronger side effects than other drugs. Not accounting for these differences might cause us to underestimate the effect sizes for high risk drugs like chemotherapy drugs and underestimate the effect sizes for less risky medications. A refinement of the approach would be to stratify effects into groups [ 45 ] and learn a separate prior for each group, or to apply methods for empirical Bayes estimation in the presence of covariate information [ 46 ].

We propose an approach for evaluating non-experimental methods using clinical trial derived reference sets, and evaluate three commonly used non-experimental study methods in terms of their ability to identify the known relationships in a commonly used claims dataset. We find that adjustment significantly improves the ability to correctly recover known relationships, with propensity score matching performing particularly well for detecting large effects.

We make TrialProbe , i.e., the reference set as well as the procedure to create it, freely available at https://github.com/som-shahlab/TrialProbe . TrialProbe  is useful for benchmarking observational study methods performance by developers of the methods as well as for practitioners interested in knowing the expected performance of a specific method on the dataset available to them.

Availability of data and materials

Our code is available at https://github.com/som-shahlab/TrialProbe . The source clinical trial records can be found at clinicaltrials.gov. The data we used in our case study, Optum’s Clinformatics Data Mart Database, is not publicly available as it is a commercially licensed product. In order to get access to Optum’s Clinformatics Data Mart Database, it is generally necessary to reach out to Optum directly to obtain both a license and the data itself. Contact information and other details about how to get access can be found on the product sheet [ 39 ]. Optum is the primary long term repository for their datasets and we are not allowed to maintain archive copies past our contract dates.

Such comparisons make sense when there is imperfect compliance to treatment and one is not interested in intention-to-treat effects.

Computed with a pseudocount adjustment to deal with zero cell counts, that is, \(\textrm{exp}(\widehat{\omega }^{\text {sample}}_i)= \left( {(X_{A,i}+0.5)/(Y_{A,i}+1)}\right) \big /\left( {(X_{B,i}+0.5)/(Y_{B,i}+1)}\right) .\)

In other words, the investigator does not adjust for any possible confounders.

Grootendorst DC, Jager KJ, Zoccali C, Dekker FW. Observational studies are complementary to randomized controlled trials. Nephron Clin Pract. 2010;114(3):173–7.

Article   Google Scholar  

Gershon AS, Lindenauer PK, Wilson KC, Rose L, Walkey AJ, Sadatsafavi M, et al. Informing Healthcare Decisions with Observational Research Assessing Causal Effect. An Official American Thoracic Society Research Statement. Am J Respir Crit Care Med. 2021;203(1):14–23.

Berger ML, Sox H, Willke RJ, Brixner DL, Eichler HG, Goettsch W, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiol Drug Saf. 2017;26(9):1033–9.

Article   PubMed   PubMed Central   Google Scholar  

Darst JR, Newburger JW, Resch S, Rathod RH, Lock JE. Deciding without data. Congenit Heart Dis. 2010;5(4):339–42.

Hampson G, Towse A, Dreitlein WB, Henshall C, Pearson SD. Real-world evidence for coverage decisions: opportunities and challenges. J Comp Eff Res. 2018;7(12):1133–43.

Article   PubMed   Google Scholar  

Klonoff DC. The Expanding Role of Real-World Evidence Trials in Health Care Decision Making. J Diabetes Sci Technol. 2020;14(1):174–9.

Hernán MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. Am J Epidemiol. 2016;183(8):758–64.

Schuler A, Jung K, Tibshirani R, Hastie T, Shah N. Synth-validation: Selecting the best causal inference method for a given dataset. arXiv preprint arXiv:1711.00083 . 2017.

Dorie V, Hill J, Shalit U, Scott M, Cervone D. Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. arXiv:1707.02641 . 2017.

Dorie V, Hill J, Shalit U, Scott M, Cervone D. Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Stat Sci. 2019;34(1):43–68.

Athey S, Imbens GW, Metzger J, Munro E. Using wasserstein generative adversarial networks for the design of monte carlo simulations. J Econom. 2021:105076. https://doi.org/10.1016/j.jeconom.2020.09.013 .

Schuemie MJ, Cepeda MS, Suchard MA, Yang J, Tian Y, Schuler A, et al. How confident are we about observational findings in health care: a benchmark study. Harvard Data Science Review. 2020;2(1). https://doi.org/10.1162/99608f92.147cc28e .

Wang SV, Sreedhara SK, Schneeweiss S, Franklin JM, Gagne JJ, Huybrechts KF, et al. Reproducibility of real-world evidence studies using clinical practice data to inform regulatory and coverage decisions. Nat Commun. 2022;13(1). https://doi.org/10.1038/s41467-022-32310-3 .

Gordon BR, Zettelmeyer F, Bhargava N, Chapsky D. A comparison of approaches to advertising measurement: Evidence from big field experiments at Facebook. Mark Sci. 2019;38(2):193–225.

Gordon BR, Moakler R, Zettelmeyer F. Close enough? a large-scale exploration of non-experimental approaches to advertising measurement. arXiv:2201.07055 . 2022.

LaLonde RJ. Evaluating the econometric evaluations of training programs with experimental data. Am Econ Rev. 1986;76(4):604–20. http://www.jstor.org/stable/1806062 . Accessed 5 Sept 2023.

Ioannidis JP, Haidich AB, Pappa M, Pantazis N, Kokori SI, Tektonidou MG, et al. Comparison of evidence of treatment effects in randomized and nonrandomized studies. JAMA. 2001;286(7):821–30.

Article   CAS   PubMed   Google Scholar  

Dahabreh IJ, Kent DM. Can the learning health care system be educated with observational data? JAMA. 2014;312(2):129–30.

Schuemie MJ, Gini R, Coloma PM, Straatman H, Herings RMC, Pedersen L, et al. Replication of the OMOP experiment in Europe: evaluating methods for risk identification in electronic health record databases. Drug Saf. 2013;36(Suppl 1):159–69.

Ryan PB, Schuemie MJ, Welebob E, Duke J, Valentine S, Hartzema AG. Defining a reference set to support methodological research in drug safety. Drug Saf. 2013;36(Suppl 1):33–47.

Wang SV, Schneeweiss S, Initiative RD. Emulation of Randomized Clinical Trials With Nonrandomized Database Analyses: Results of 32 Clinical Trials. JAMA. 2023;329(16):1376–85. https://doi.org/10.1001/jama.2023.4221 .

Thompson D. Replication of Randomized, Controlled Trials Using Real-World Data: What Could Go Wrong? Value Health. 2021;24(1):112–5.

Camerer CF, Dreber A, Holzmeister F, Ho TH, Huber J, Johannesson M, et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat Hum Behav. 2018;2(9):637–44.

Mooij JM, Peters J, Janzing D, Zscheischler J, Schölkopf B. Distinguishing cause from effect using observational data: methods and benchmarks. J Mach Learn Res. 2016;17(1):1103–204.

Google Scholar  

DeVito NJ, Bacon S, Goldacre B. Compliance with legal requirement to report clinical trial results on ClinicalTrials.gov: a cohort study. Lancet. 2020;395(10221):361–9.

Robbins H. An Empirical Bayes Approach to Statistics. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. Berkeley: The Regents of the University of California; 1956. p. 157–163.

Efron B, Morris C. Data Analysis Using Stein’s Estimator and Its Generalizations. J Am Stat Assoc. 1975;70(350):311–9.

Efron B. Bayes, oracle Bayes and empirical Bayes. Statist Sci. 2019;34(2):177–201. https://doi.org/10.1214/18-STS674 .

Gu J, Koenker R. Invidious comparisons: Ranking and selection as compound decisions. Econometrica (forthcoming). 2022.

Van Houwelingen HC, Zwinderman KH, Stijnen T. A bivariate approach to meta-analysis. Stat Med. 1993;12(24):2273–84.

Efron B. Empirical Bayes methods for combining likelihoods. J Am Stat Assoc. 1996;91(434):538–50.

Sidik K, Jonkman JN. Estimation using non-central hypergeometric distributions in combining 2 \(\times\) 2 tables. J Stat Plan Infer. 2008;138(12):3993–4005.

Stijnen T, Hamza TH, Özdemir P. Random effects meta-analysis of event outcome in the framework of the generalized linear mixed model with applications in sparse data. Stat Med. 2010;29(29):3046–67.

Kiefer J, Wolfowitz J. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann Math Statist. 1956;27(4):887–906. https://doi.org/10.1214/aoms/1177728066 .

Aitkin M, Longford N. Statistical modelling issues in school effectiveness studies. J R Stat Soc Ser A Gen. 1986;149(1):1–26.

Stephens M. False discovery rates: a new deal. Biostatistics. 2017;18(2):275–94.

PubMed   Google Scholar  

Ignatiadis N, Wager S. Confidence Intervals for Nonparametric Empirical Bayes Analysis. J Am Stat Assoc. 2022;117(539):1149–66.

Article   CAS   Google Scholar  

Gelman A, Tuerlinckx F. Type S error rates for classical and Bayesian single and multiple comparison procedures. Comput Stat. 2000;15(3):373–90.

Optum. Optum’s de-identified Clinformatics Data Mart Database. 2017. https://www.optum.com/content/dam/optum/resources/productSheets/Clinformatics_for_Data_Mart.pdf . Accessed 5 Sept 2023.

Steinberg E, Jung K, Fries JA, Corbin CK, Pfohl SR, Shah NH. Language models are an effective representation learning technique for electronic health record data. J Biomed Inform. 2021;113:103637.

Austin PC, Small DS. The use of bootstrapping when using propensity-score matching without replacement: a simulation study. Stat Med. 2014;33(24):4306–19.

Powers S, Qian J, Jung K, Schuler A, Shah NH, Hastie T, et al. Some methods for heterogeneous treatment effect estimation in high dimensions. Stat Med. 2018;37(11):1767–87.

Rogers JR, Hripcsak G, Cheung YK, Weng C. Clinical comparison between trial participants and potentially eligible patients using electronic health record data: a generalizability assessment method. J Biomed Inform. 2021;119:103822.

Dahabreh IJ, Robins JM, Hernán MA. Benchmarking Observational Methods by Comparing Randomized Trials and Their Emulations. Epidemiology. 2020;31(5):614–9.

Efron B, Morris C. Combining Possibly Related Estimation Problems. J R Stat Soc Ser B Methodol. 1973;35(3):379–402.

Ignatiadis N, Wager S. Covariate-powered empirical Bayes estimation. Adv Neural Inf Process Syst. 2019;32.

Download references

Acknowledgements

We would like to thank Agata Foryciarz, Stephen R. Pfohl, and Jason A. Fries for providing useful comments on the paper. We would also like to thank the anonymous reviewers who have contributed feedback that has helped us improve this work.

This work was funded under NLM R01-LM011369-05.

Author information

Authors and affiliations.

Center for Biomedical Informatics Research, Stanford University, Stanford, US

Ethan Steinberg, Yizhe Xu & Nigam Shah

Department of Statistics, University of Chicago, Chicago, US

Nikolaos Ignatiadis

Google Research, Google, Cambridge, US

Steve Yadlowsky

You can also search for this author in PubMed   Google Scholar

Contributions

Ethan Steinberg: Conceptualization, Methodology, Software, Writing—original draft. Nikolaos Ignatiadis: Methodology, Software, Writing. Steve Yadlowsky: Methodology, Software, Writing. Yizhe Xu: Software, Writing. Nigam H. Shah: Writing—review & editing, Supervision, Funding acquisition.

Corresponding author

Correspondence to Ethan Steinberg .

Ethics declarations

Ethics approval and consent to participate.

Optum’s Clinformatics Data Mart Database is a de-identified dataset [ 39 ] per HIPAA (Health Insurance Portability and Accountability Act) standards so neither IRB approval nor patient consent is required. As such, we can confirm that all experiments were performed in accordance with relevant guidelines and regulations.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Steinberg, E., Ignatiadis, N., Yadlowsky, S. et al. Using public clinical trial reports to probe non-experimental causal inference methods. BMC Med Res Methodol 23 , 204 (2023). https://doi.org/10.1186/s12874-023-02025-0

Download citation

Received : 27 October 2022

Accepted : 24 August 2023

Published : 09 September 2023

DOI : https://doi.org/10.1186/s12874-023-02025-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Causal inference
  • Meta-analysis
  • Clinical trials
  • Method evaluation

BMC Medical Research Methodology

ISSN: 1471-2288

is a case study non experimental

State Of The Union

Biden Admin Pours $1 Million Into Studies Aimed At Denying There Are Only Two Genders

T he National Science Foundation is allocating over $900,000 in taxpayer funds to three universities to conduct a study claiming biology courses inaccurately portray sex and gender as binary, in order to make them more inclusive for transgender and non-binary students.

The study aims to explore how sex and gender topics are taught, their impact on transgender student belonging and interest, and ways to create a more inclusive curriculum acknowledging diversity in sexes and the complex relationship between sex and gender.

While the NSF claims strong theoretical foundations and peer review support the research, critics argue it amounts to denying basic biological realities in order to push an ideological agenda.

“There is a strong theoretical foundation on which the research questions are based,” a National Science Foundation spokesperson said, noting that its “merit review process is recognized as the ‘gold standard’ of scientific review.”

Concerns have been raised over the politicization of federal agencies and wasting of public funds to embed extreme gender ideologies in institutions, as part of broader DEI initiatives, when evidence for related medical interventions remains limited.

Calls have been made to rein in bureaucratic overreach promoting left-wing social policies.

Most Popular:

Cause of Death Released For ‘Rocky’ Movie Star

Supreme Court Audio Suggests Trump Case Is Doomed

Biden Admin Pours $1 Million Into Studies Aimed At Denying There Are Only Two Genders

A Bayesian sampling optimisation strategy for finite element model updating

  • Original Paper
  • Open access
  • Published: 20 February 2024

Cite this article

You have full access to this open access article

  • Davide Raviolo 1 ,
  • Marco Civera   ORCID: orcid.org/0000-0003-0414-7440 2 &
  • Luca Zanotti Fragonara 3  

27 Accesses

Explore all metrics

Model Updating (MU) aims to estimate the unknown properties of a physical system of interest from experimental observations. In Finite Element (FE) models, these unknowns are the elements’ parameters. Typically, besides model calibration purposes, MU and FEMU procedures are employed for the Non-Destructive Evaluation (NDE) and damage assessment of structures. In this framework, damage can be located and quantified by updating the parameters related to stiffness. However, these procedures require the minimisation of a cost function, defined according to the difference between the model and the experimental data. Sophisticated FE models can generate expensive and non-convex cost functions, which minimization is a non-trivial task. To deal with this challenging optimization problem, this work makes use of a Bayesian sampling optimisation technique. This approach consists of generating a statistical surrogate model of the underlying cost function (in this case, a Gaussian Process is used) and applying an acquisition function that drives the intelligent selection of the next sampling point, considering both exploitation and exploration needs. This results in a very efficient yet very powerful optimization technique, necessitating of minimal sampling volume. The performance of this proposed scheme is then compared to three well-established global optimisation algorithms. This investigation is performed on numerical and experimental case studies based on the famous Mirandola bell tower.

Similar content being viewed by others

is a case study non experimental

Sampling Techniques in Bayesian Finite Element Model Updating

is a case study non experimental

A Kriging Approach to Model Updating for Damage Detection

is a case study non experimental

Uncertainty Quantification and Model Identification in a Bayesian and Metaheuristic Framework

Avoid common mistakes on your manuscript.

1 Introduction

Finite Element Model Updating (FEMU) refers to all strategies and algorithms intended for the calibration of an existing FE model based on experimental evidence, especially vibration data [ 1 ]. Data used for such calibration or updating purposes can be acquired from occasional in situ surveys or by an embedded, permanent monitoring system. For the specific case considered here—and as very commonly used in real-life circumstances—modal parameters, extracted from acceleration time histories, are utilised. This is a well-known example of the indirect FEMU method [ 2 ], where the input parameters are varied to match the output results (natural frequencies and mode shapes). At its core, this represents an optimisation problem.

Therefore, the aim is to estimate the mechanical properties to be assigned to the numerical model, given its geometry. Once calibrated, the FE model can be used in several ways. The most obvious application is for predictive analysis, e.g., to estimate the remaining resilience of the structure in case of strong motions or other potentially dangerous events. If a predictive model was already available before a specific damaging event (e.g., a major earthquake), updating that FE model and comparing the estimated stiffnesses before and after the seism can be used for model-based Structural Health Monitoring (SHM) and damage assessment [ 3 , 4 ]. That allows not only basic damage detection but also specific advanced tasks such as damage localisation and severity assessment. A third use is for hybrid simulations [ 5 , 6 ]; in these applications, a target structure is divided into experimental and numerical substructures, due to practical limitations or to save costs. FEMU allows one to match the response of the numerical components to their experimental counterparts. Finally, in the case of continuously monitored structures and infrastructures, constantly re-updating a detailed Finite Element Model represents an enabling technology required for Digital Twins. This can serve the decision maker to evaluate the current and future structural situation of the assets under management. More details about the basic and general concepts of FEMU can be found in the works of Friswell and Mottershead—e.g., [ 7 ] and [ 8 ].

1.1 Efficient Bayesian sampling for finite element model updating

Arguably, one of the major issues about the FEMU procedure described so far is that the optimisation problem can become computationally expensive and very time-consuming, especially when dealing with complex FE models.

On the one hand, numerical models are becoming more and more complex, and so very computationally demanding. The need for very efficient optimisation techniques suitable for potentially highly demanding tasks is therefore clear. Nonetheless, an optimisation algorithm should discern  the global minimum across the function domain, thereby circumventing the risk of encountering local minima. Unfortunately, sampling efficiency and global search capabilities are somewhat conflicting goals. Consequently, global optimization techniques that require high sampling volumes to search the space for the global optimum are frequently employed.

For these reasons, the approach proposed here employs Bayesian Sampling Optimisation [ 9 ] in the framework of FEMU. As will be described in detail in the Methodology section, Bayesian Sampling Optimisation (or simply Bayesian Optimisation, BO, for short) uses the basics of Bayes’ Theorem to infer the best sampling strategy in the search domain. This greatly increases the computational efficiency of the procedure, vastly reducing the sampling volume required to attain a solution, especially when a larger number of parameters needs to be estimated at once, as in the case of damaged structures and infrastructures, where multiple areas can be affected by different levels of damage.

In common practice, also according to the visible crack pattern, the target system is divided into macro-areas, under the assumption that these substructures will have different mechanical properties [ 10 ]. In the most common case, the local parameters of these macro-areas (Young’s moduli, etc.) must be jointly estimated to match the global dynamic response of the structure. Hence, the dimensionality of the search space of the optimisation function, defined by these numerous parameters, easily ten or more [ 11 ]—can become very high. As a note, it is important to remark that this will be the intended use of the term ‘Bayesian’ in this work; other research works, for example [ 12 , 13 , 14 , 15 ], and Ref. [ 16 ] among many others, use the same adjective to refer to the estimated output. Instead, this paper focuses solely on Bayesian sampling and its effectiveness for the optimisation of expensive functions in search spaces characterized by high dimensionality.

1.2 Applications of FEMU to historical architectural heritage and earthquake engineering.

The proposed Bayesian Optimization-based FEMU strategy is validated on a case study of interest for Structural Dynamics purposes, the bell tower of the Santa Maria Maggiore Cathedral in Mirandola. This historical high-rise masonry building suffered extensive damage after the 2012 Emilia Earthquake and has been the subject of several research studies throughout the years—see e.g., [ 1 ]. Both numerically simulated and experimental data were employed, thereby allowing to benchmark the proposed approach with a known ground truth and in a controlled fashion.

Indeed, regarding this specific application, FEMU is especially important for Earthquake Engineering. After major seismic events, reliable and predictive FE models are required as soon as possible to design and evaluate temporary interventions that should be deployed in the immediate aftermath to secure the damaged structures. However, the calibration of such FE models is not trivial. This situation is relatively worse for masonry structures, where even the properties of the original (pristine) structure are more difficult to estimate than homogeneous materials such as structural steel. In architectural and cultural heritage (CH) sites, even the pre-earthquake material properties are often unknown due to the lack of historical records; yet they have notoriously low mechanical resistance, due to their centuries-old ageing. These aspects make these unique and irreplaceable structures strongly vulnerable. Among them, historical bell towers are at particular risk during seismic events, due to various factors such as their relative slenderness, several potential failure mechanisms, and building material (bricks and mortar) [ 17 ]. These further underscore the importance of implementing robust monitoring strategies to detect and track damage development in such structures [ 18 ].

Thus, FEMU represents a precious tool for CH. Moreover, even after the first phase of a post-earthquake emergency, model updating is an important tool for vibration-based continuous monitoring and/or periodic dynamic investigations [ 19 ]. In the short to medium term, seismic aftershocks can cause more damage than the main shock, as strong motions insist on accumulated damage; in the long run, the initial cracks can expose structural vulnerabilities to external environmental factors.

Some noteworthy examples of FEMU applications can be found in [ 17 , 20 , 21 , 22 , 23 ] and [ 24 ]. A broader, up-to-date review of Structural Health Monitoring (SHM) techniques successfully applied to CH structures is given by [ 25 ], while [ 26 ] specifically delves into the historical and contemporary advancements in SHM concerning the Garisenda tower in Bologna, Italy, a heritage structure similar in many aspects to the case study under examination. Similarly, Refs. [ 27 ] and [ 28 ] thoroughly analyses the Civic Tower of Ostra, Italy, and the Civic Clock tower of Rotella, respectively, using detailed numerical models and experimental data to assess the structural condition and establish standards for ongoing maintenance, posing the accent on the use of Genetic Algorithms. The remainder of this paper is organised as follows. In Sect. 2, the theoretical background of Finite Element Model Updating and Bayesian Sampling Optimisation are discussed in detail. In Sect. 3, the specific methodology of the algorithm implemented for this research work is reported. The three optimisation algorithms used for the comparison of the results are also briefly recalled. Section 4 describes the case study. Section 5 comments on the results, comparing the BO estimates with the three benchmark algorithms and the findings retrieved from the published scientific literature on the same case study. Finally, Sect. 6 concludes this paper.

2 Theoretical background

Parametric models (such as finite elements models) are described by a vector of model parameters \({\varvec{\theta}}\) . Thus, being \(M\) the model operator, \({\varvec{y}}={\varvec{M}}({\varvec{x}},{\varvec{\theta}})\) returns the output vector \({\varvec{y}}\) for a given input vector \({\varvec{x}}\) . For obvious reasons, in model updating, it is preferable to adopt outputs that are independent of the input and dependent on the model parameters only (such as modal features). According to this assumption, the \({\varvec{x}}\) vector can be dropped, and the input–output relationship is simply represented by \({\varvec{y}}={\varvec{M}}({\varvec{\theta}})\) .

Finite elements model updating methods fall into two categories, direct methods and iterative methods (the latter also called deterministic). Direct methods try to improve observed data and computed data agreement by directly changing the mass and stiffness matrices; this leads to little physical meaning (no correlation with physical model parameters), problems with elements connectivity, and fully populated stiffness matrices. For these reasons, they are seldom used in common structural engineering applications. The iterative methods attempt to obtain results that fit the observations by iteratively changing the model parameters: this enables retaining good physical understanding of the model and doesn’t present the above-mentioned problems. The degree of correlation is determined by a penalty function (or cost function ): optimising this function requires the problem to be solved iteratively, which means computing the output (i.e., performing a FE analysis) of the numerical model at each iteration. Hence, a higher computational cost is the major drawback of iterative methods.

Many FEMU methods have been proposed and successfully used: sensitivity-based methods, [ 29 , 30 ], and [ 31 ]; eigenstructure-assignment methods, [ 32 ] and [ 33 ]; uncertainty quantification methods [ 34 ]; sensitivity-independent iterative methods, [ 35 ]; and many more [ 36 ].

As described, model updating is an inverse problem, as it aims at inverting the relationship between model parameters and the model output to find the optimal set of parameters \({\varvec{\uptheta}}\) that minimises the difference between computed data and measured data.

In this sense, model updating can be simply considered as the following constrained optimisation problem:

where \({{\varvec{\uptheta}}}^{*}\) is the set of optimal parameters, \(D\) is the parameter space, \(F\) is the cost function and \({\varvec{f}}\) is the measured data.

The whole process of solving \({\text{F}}\left(M\left({\varvec{\uptheta}}\right),{\varvec{f}}\right)\) – the output of the numerical model “post-processed” in some way by a cost function – may be conceived as computing an unknown (non-linear) objective function of the model parameters \({\varvec{\uptheta}}\) , which constitutes the sole input of the numerical model to be updated. Typically, this objective function is non-convex and expensive to evaluate. The output surface of the objective function lies in a \(d-\) dimensional space, where \(d\) is the number of parameters to be optimised. The sampling volume is exponential to \(d\) , thus posing an implicit restriction to the number of parameters that can be optimised.

Many optimisation algorithms have been developed in the last decades, each of them with its peculiar strengths and weaknesses. Among them, three of the better-known and most extensively used are Generalized pattern search (GPS) algorithms, Genetic Algorithms (GA), and simulated annealing (SA) algorithms.

In recent years, BO has proven itself to be a powerful strategy for finding the global minimum of non-linear functions that are expensive to evaluate, non-convex and whose access to the derivatives is burdensome. Furthermore, Bayesian sampling optimisation techniques distinguish themselves as being among the most efficient approaches in terms of a number of objective evaluations [ 37 , 38 , 39 , 40 , 41 ].

The essence of the Bayesian approach lies in the reading of the optimisation problem given by the ‘Bayes’ Theorem’:

which mathematically states that the conditional probability of event \(M\) occurring given the event \(E\) is true is proportional to the conditional probability of event \(E\) occurring if event \(M\) is true multiplied by the probability of \(M\) . Here, \(P(M|E)\) is seen as the posterior probability of the model \(M\) given the evidence (or observations) \(E\) , \(P(E|M)\) as the likelihood of \(E\) given \(M\) and \(P(M)\) as the prior probability of the model \(M\) . Essentially, the prior, \(P(M)\) , represents the extant beliefs about the type of possible objective functions, eventually based on the observations already at disposal. The posterior \(P(M|E)\) , on the other hand, represents the updated beliefs about the objective function, given the new observations. The process basically aims at estimating the objective function by means of a statistical surrogate function, or surrogate model.

Many stochastic regression models can be used as a surrogate: the model must be able to describe a predictive distribution that represents the uncertainty in the reconstruction of the objective function, in practice by providing a mean and a variance.

To efficiently select the next sampling point, the proposed approach makes use of an acquisition function defined over the statistical moments of the posterior distribution given by the surrogate. The role of the acquisition function is crucial since it governs the trade-off between exploration (aptitude for a global search of the minimum) and exploitation (aptitude for sampling regions where the function is expected to be low) of the optimisation process. Probability of Improvement (PI), Expected Improvement (EI) and Upper Confidence Bound (UCB) are among the most used and most popular acquisition functions in Bayesian optimisation applications.

2.1 Finite element model updating

As mentioned, when iterative model updating methods are involved, the solution to the problem described by Eq. ( 1 ) entails the optimisation of a highly non-convex, high-dimensional cost function. In this case, modal features have been chosen to evaluate the degree of correlation between experimental and theoretical results, by employing both natural frequencies and associated mode shapes.

The selection of the parameters to be updated is a crucial step to reduce optimisation complexity, retain good physical understanding and ensure the well-posedness of the problem [ 42 ]. Generally, good practices to avoid ill-conditioning or ill-posedness are (1) choosing updating parameters that adequately affect the model output and (2) reducing the number of parameters to limit the occurrence of under-determinacy issues in the updating problem [ 43 ]. The first task can be accomplished by using sensitivity-based methods to discard non-sensitive parameters, and the second by dividing the structure into sub-parts with the same material properties. Additionally, the richness and the nature of the measured data, in contrast to the degree of discretization of the finite element model, places a limit on the type and number of parameters that can be updated while retaining physical meaningfulness.

Various issues of ill-conditioning or rank-deficiency can arise in relation to the specific optimisation technique used. In the case of the BO approach, the rank of the covariance matrix of the Gaussian Process (i.e., kernel matrix) may be source of some concern. The matrix can become nearly singular if (i) the original function that is being optimised is so smooth and predictable that leads to a high correlation between sampling points, thereby generating columns of near-one values, and/or if (ii) the sampled points are very close one to another (which typically happens towards the end of the optimisation process), thereby generating several columns that are almost identical [ 44 ].

2.2 Bayesian sampling optimisation algorithm

For highly non-convex cost functions and problems denoted by high dimensionality, traditional optimization algorithms may encounter difficulty in identifying the global optimum or fail to converge, even within the framework of well-posed problem sets. In this study, we undertake a comparison between the performance of the proposed Bayesian sampling optimization approach and the outcomes derived from the aforementioned three classical alternatives. These will be discussed later in a dedicated paragraph.

When dealing with expensive and non-convex functions to optimise, both efficiency (in terms of sampling) and global search capabilities are paramount. Indeed, several global optimisation techniques have been developed over the years, but very few perform well when the number of function evaluations is kept to a minimum. One way to deal with expensive functions is by using surrogate optimisation techniques. This approach consists in substituting the objective function with a fast surrogate model, which is then used to search for the optimum and speed up the optimisation process. Of course, the validity of the surrogate model, that is to say, its capability to represent the behaviour of the underlying objective function, is of uttermost importance to obtain good and reliable results. Unfortunately, when a linear regression of the form

is used to fit the data (where \({\mathbf{x}}^{(i)}\) is the i-th sampled point out of a total of \(h\) , \(y\left({\mathbf{x}}^{(i)}\right)\) is the associated objective value, \({f}_{h}(\mathbf{x})\) is a function of \(\mathbf{x}\) , \({\beta }_{h}\) are coefficients to be estimated, and \({\epsilon }^{(i)}\) are the independent errors, normally distributed), it is arduous to determine which functional form should be employed if none or scanty a priori information about the function of interest is available. As such, these strategies are often impracticable for model updating optimisation problems.

The approach of Bayesian sampling optimisation consists of a change of paradigm. Instead of trying to minimise the error \({\epsilon }^{(i)}\) by selecting some functional form that aligns with the data, the focus is placed on modelling the error by means of a stochastic process, so that the surrogate model is of the form:

where \(\mu\) is the regression term (the functional form is just a constant), and the error term \(\epsilon \left({\mathbf{x}}^{(i)}\right)\) is a stochastic process with mean zero (in other words, a set of correlated random variables indexed by space). This change of perspective about the surrogate function is comprehensively described in one of the most interesting papers on modern Bayesian optimisation, [ 38 ], where the proposed method is called Efficient Global Optimisation, EGO. Besides modelling the surrogate as a stochastic process, the Bayesian sampling optimisation method makes use of an acquisition function to perform a utility-based selection of the points to be sampled. These (a stochastic predictive/surrogate model combined with the acquisition function) are in fact the two key elements of Bayesian optimisation.

BO has gained much attention in the last decades only. However, the first works on the topic have been published in the early 60s by [ 45 ]. After some developments by [ 46 ], who used Wiener processes, the concept of Bayesian optimisation using Gaussian Processes as the surrogate model was first used in the EGO formulation, combined with the expected improvement (EI) concept [ 47 ].

In the last years, several research works have proven the advantages of using Bayesian optimisation with expensive non-convex functions [ 48 ], making it a popular and well-known global optimisation technique.

Fitting a surrogate model to the data requires carrying out an additional optimisation process for the determination of hyperparameters. Furthermore, the next point to be sampled is found by searching for the maximum of the acquisition function. Hence, the BO approach entails two secondary (arguably fast-computing) optimisation problems, to be solved at each iteration: this results in a somewhat fancy and potentially heavy algorithm, which is suitable only if the objective function is considerably expensive.

In the following, this notation is often used:

where \({\mathcal{D}}_{1:t}\) denotes the observations set, or sample, made of \(t\) observations in total. \({\mathbf{x}}_{i}\) is the input point vector of the \(i\) -th observation. This vector, in other words, contains the updating parameters (in the input domain). The length of \({\mathbf{x}}_{i}\) equals \(d\) , the dimensionality of the updating problem, i.e., the number of updating parameters. Finally, \(f\left({\mathbf{x}}_{1:t}\right)\) , also abbreviated in \({\mathbf{f}}_{t}\) , are the observed values of the objective function at \({\mathbf{x}}_{1:t}\) , i.e., the outputs of the cost function at each set of updating parameters \({\mathbf{x}}_{i}\) .

While any probabilistic model can be adopted to describe the prior and the posterior, it should be (i) relatively light and fast, to provide quick access to predictions and related uncertainties, (ii) able to adequately fit the objective function with a small number of observations, since sampling efficiency is pursued, and (iii) the conditional variance must cancel if and only if the distance between an observation and the prediction point is zero, as this is one condition to ensure the convergence of the BO method [ 49 ].

Given these requisites, Gaussian Process priors are the chosen probabilistic model in the majority of modern Bayesian optimisation implementations. To mention some popular alternatives, [ 50 ] worked with random forests, [ 51 ] with deep neural networks, [ 52 ] made use of Bayesian neural networks, while [ 53 ] used Mondrian trees. GPs are well-suited for model updating problems where the penalty function to be minimised is continuous.

Given a Gaussian Process (seen as a continuous collection of random variables, any finite number of which have consistent joint Gaussian distributions [ 54 ]), of the form:

where \(m(\mathbf{x})\) is the mean function, and \(k\left(\mathbf{x},{\mathbf{x}}^{\mathrm{^{\prime}}}\right)\) is the covariance function (which models the level of correlation between two observations, \({f}_{i}\) and \({f}_{j}\) , relatively to the distance between the points \({x}_{i}\) and \({x}_{j}\) ), the covariance can be computed for each pair of sampled points and conveniently arranged in matrix form:

Many covariance functions \(k\left(\mathbf{x},{\mathbf{x}}^{\mathrm{^{\prime}}}\right)\) (or kernel functions ) can be chosen, as decreasing functions of the distance between points \({x}_{i}\) and \({x}_{j}\) in the input space.

Considering the joint Gaussian distribution:

where \({f}_{*}\) is the objective output at \({\mathbf{x}}_{*}\) , that is \({f}_{*}=f\left({\mathbf{x}}_{*}\right)\) , and

the following predictive distribution can be derived—for a full analytical derivation, see [ 55 ]:

In the above set of equations, \({\mu }_{t}\left({\mathbf{x}}_{*}\right)\) is the prediction over the objective function value at any chosen point \({\mathbf{x}}_{*}\) , and \({\sigma }_{t}^{2}\left({\mathbf{x}}_{*}\right)\) is the variance of the prediction at \({\mathbf{x}}_{*}\) (the subscripts here denote that the prediction and its variance come from a GP trained with the \({\mathcal{D}}_{1:t}=\left\{{\mathbf{x}}_{1:t},f\left({\mathbf{x}}_{1:t}\right)\right\}\) data sample).

To compute the prediction and the variance at \({\mathbf{x}}_{*}\) from Eqs. ( 11 ) and ( 12 ) (by means of which exact inference is computed), it is necessary to invert the kernel matrix \(\mathbf{K}\) . This operation has a computational complexity of \(\mathcal{O}\left({N}^{3}\right)\) , where \(N\) is the size of the (square) kernel matrix (which equals the number of observations, \(t\) ). While this operation is relatively fast on its own, it can lead to computationally burdensome workflows as (i) the BO approach entails the maximisation of the acquisition function, a task that may require computing thousands of predictions (especially in high dimensional problems), and (ii) the number of observations keeps increasing (and so the size of \(\mathbf{K}\) ) as the optimisation advances and new points are added to the observations set.

Therefore, when using Gaussian Processes, BO badly scales with the number of observations. One way to mitigate such a problem consists of limiting the number of observations used to fit the GP to a certain amount (e.g., defining an “active set” size of a few hundreds), by randomly choosing a new set of training points among the sample at each iteration of the algorithm. Indeed, this practice is applied in the implementation used within this work, with an active set size of 300 being used.

The choice of the kernel function deeply affects the smoothness properties of a GP. This must be coherent with the features of the underlying objective function to obtain a quality surrogate model. Moreover, as each problem has its own specifics, the kernel function must be properly scaled. To this extent, the kernel functions are generalized by introducing hyperparameters . In the case of a squared exponential kernel function, this results in the equation:

where \({\sigma }_{f}\) is the vertical scale , which is the GP’s standard deviation (i.e., describes the vertical scaling of the GP’s variance), and the hyperparameter \(\theta\) is the characteristic length scale , which defines how far apart the input points  \({{\varvec{x}}}_{i}\)  can be for the output to become uncorrelated. When dealing with anisotropic problems (as it is often the case with model updating), it is much more convenient to use separate length scales, one for each parameter. This is typically done with automatic relevance determination kernels (ARD), that consists in using a vector of hyperparameters \({\varvec{\theta}}\) , which size equals \(d\) .

In practical terms, when a specific length scale \({\theta }_{l}\) assumes a significantly higher value compared to other length scales, the kernel matrix becomes independent of the \(l\) -th parameter.

The optimal set of hyperparameters \({\varvec{\theta}}\) is computed by maximisation of the marginal log-likelihood of the evidence \({\mathcal{D}}_{1:t}=\left\{{\mathbf{x}}_{1:t},f\left({\mathbf{x}}_{1:t}\right)\right\}\) given \({\varvec{\theta}}\) :

where the \({{\varvec{\theta}}}^{+}\) vector contains the length scales \({{\varvec{\theta}}}_{1:l}\) plus the vertical scale, and the mean \({\mu }_{0}\) (i.e., the constant regression term) of the GP (and therefore all the \(d+2\) hyperparameters), so that \({{\varvec{\theta}}}^{+}:=({{\varvec{\theta}}}_{1:l},\boldsymbol{ }{\mu }_{0},{\sigma }_{f})\) . In the previous equation, the dependency on \({{\varvec{\theta}}}^{+}\) is obviously found in the kernel matrix  \({\varvec{K}}\) .

By employing this approach, a sort of sensitivity analysis of the parameters over the sampled points is performed. This built-in feature of the Bayesian optimisation technique is certainly useful for what concerns structural model updating problems, where the system sensitivity to the updating parameters is often dissimilar and usually unknown.

Four different ARD kernels will be employed, two of which are in the form of Matérn functions [ 56 ]. These functions are defined as

where \(\varsigma\) is a smoothness coefficient, while \(\Gamma (\bullet )\) and \({H}_{\varsigma }\left(\bullet \right)\) are the Gamma function and the Bessel function of order \(\varsigma\) , respectively. As the smoothness coefficient \(\varsigma\) tends towards infinite, the Matérn function reduces to the squared exponential function; when \(\varsigma\) tends towards zero, the Matérn function reduces to the unsquared exponential function. The four employed kernels are:

an ARD unsquared exponential kernel:

an ARD squared exponential kernel:

an ARD Matérn 3/2 kernel ( \(\varsigma =\frac{3}{2}\) ):

an ARD Matérn 5/2 kernel ( \(\varsigma =\frac{5}{2}\) ):

The difference between the four kernel functions is visible in Fig.  1 . The exponential kernel stands out due to its rapid decline in correlation as distance increases. Consequently, function samples drawn from a GP constructed with the exponential kernel exhibit notably rugged features, whereas those generated using Matérn 3/2, Matérn 5/2, and squared exponential kernel functions display increasingly smoother characteristics.

figure 1

Correlation between observations \({{\varvec{x}}}_{i}\) and \({{\varvec{x}}}_{j}\) according to the four different kernel functions, plotted against the distance ( \(\theta =0.25\) )

As mentioned already, two different approaches can be followed to search for the optimum: the exploitative approach and the explorative approach. The automatic trade-off between exploitation and exploration is taken care of by the acquisition function.

Typically, Bayesian Optimization has found its historical roots within the scientific literature as a methodology primarily designed for the maximization of objective functions. Consequently, acquisition functions are conventionally formulated to yield high values in regions where the objective is deemed to exhibit a high value. Thus, when seeking the minimum of a function \(f\left({\varvec{x}}\right)\) , such in the case of model updating, it is sufficient to consider the equivalent problem:

The next point \({{\varvec{x}}}_{t+1}\) that will be chosen for sampling is found by maximising the acquisition function \(a({\varvec{x}})\) according to the optimisation problem:

Four acquisition functions were tested on a prelaminar numerical case study, consisting of the updating of a simple 2D shear-frame (see Fig.  2 ): Probability of Improvement (PI), Expected Improvement (EI), a modified version of Expected Improvement [ 57 ] and Upper Confidence Bound (UCB). Among these, UCB was selected for implementation in the Bayesian sampling algorithm used in this study, as it was found to strike the best balance between exploitation and exploration. In particular, UCB was found to be about 40%, 20%, and 70% better than PI, EI, and EI (Bull) in terms of accuracy, respectively, given the same initial seed and the same total sampling volume.

figure 2

Comparative analysis of four distinct acquisition functions. The plots illustrate the trajectory of the optimal objective function value throughout the optimization iterations for the updating of a basic 2D shear-frame

Upper Confidence Bound (or Lower Confidence Bound, LCB, if minimisation is involved), first proposed by [ 58 ] in the “Sequential Design for Optimisation” algorithm (SDO), consists of a very simple yet very effective approach. The UCB function is defined as:

where \(\kappa\) is typically a positive integer number, which controls the bound width identified by the standard deviation \(\sigma \left(\mathbf{x}\right)\) and therefore the propensity of the algorithm to explore the search space. Often, \(\kappa\) is taken equal to 2, so that the confidence bound is about 95% (indeed, this is the value used in the following case-study).

The “intelligent” sampling performed by the BO approach when using UCB is displayed in Fig.  3 . Here, a simple numerically-simulated updating problem consisting of a 3-DOF shear-type system is considered. Levels are all denoted by the same stiffens  \(k\) , while the lumped masses are  \({m}_{1}\) , \({m}_{2}\) and  \({m}_{3}\) . The parameters being updated consist of the stiffness \(k\) and the mass  \({m}_{2}\) , which target values are known in advance (as well as the target response of the system). The associated 2D penalty function is sampled at 9 randomly chosen points, a Gaussian Process is fitted to the observations,and the UCB function is computed knowing \(\mu \left(\mathbf{x}\right)\) and \(\sigma \left(\mathbf{x}\right)\) by means of Eqns. ( 11 ) and ( 12 ). The minimum according to the surrogate model and the acquisition function maximum (i.e., the succeeding sampling point) are visible. When using UCB (with  \(\kappa =2\) ), the choice of the next sampling point is heavily influenced by the high level of uncertainty in the predictions. The acquisition maximum is found in an area far from other observations, where uncertainty is very high, while the predicted objective is still reasonably low. Furthermore, a tendency to explore the optimization space at the early stages of the optimization procedure, when the number of observations is low and the uncertainty is high, followed by a gradually more exploitative behaviour as the overall uncertainty decreases, is a remarkable aspect of UCB functions for locating the global optimum, as both exploration and exploitation needs are upheld.

figure 3

The upper visualization shows the GP mean (surface in red), the nine observations \({{\varvec{f}}}_{1:9}\) used for training, the model minimum (i.e., the believed lowest function value according to the GP), and the next sampling point selected by the UCB acquisition function. The lower depicts the acquisition function surface \(UCB({\varvec{x}})\) and the next chosen sampling point, which corresponds to the acquisition function maximum

3 Methodology

In this section, the implementation of the Bayesian optimisation algorithm strategy used in this study is described. For completeness, the main technical details about the optimisation techniques used for comparison are discussed as well. These algorithms are in fact very susceptible to specific implementation choices and initial parameters’ values, as the optimisation outcome is affected both in terms of sampling efficiency and accuracy. This is particularly true for Simulated Annealing and Genetic Algorithm.

3.1 Bayesian optimisation: the proposed algorithm

Technical details about the implementation of the employed Bayesian sampling optimisation procedure are summarized as follows, according to the flowchart represented in Fig.  4 .

The optimisation procedure is initialized by computing the objective function at the seed points, which are randomly chosen within the optimisation domain, defined by the search bounds of each input parameter. The seed size should be sufficiently large to avoid overfitting when selecting the optimal set of kernel hyperparameters through log-likelihood maximisation. As a rule of thumb, [ 38 ] suggest setting the initial seed size at \(10\bullet d\) at least, where \(d\) is the number of dimensions of the optimisation problem (i.e., updating parameters). Indeed, this criterion will be followed for the presented case study.

The fitting of the Gaussian Process (i.e., the surrogate model at iteration \(i\) ) occurs by maximising the marginal log-likelihood, which enables to select the optimal set of hyperparameters \({{\varvec{\theta}}}^{+}\) . Moreover, a small amount of Gaussian noise \({\sigma }^{2}\) is added to the observations (such that the prior distribution has covariance K \(({\varvec{x}},{\varvec{x}}\boldsymbol{^{\prime}};{\varvec{\theta}})+{\sigma }^{2}{\varvec{I}}\) ).

To maximise the acquisition function, several thousands of predictions \({\mu }_{t}\left({\mathbf{x}}_{*}\right)\) are computed at points \({\mathbf{x}}_{*}\) randomly chosen within the optimisation space. Then, some of the best points are further improved with local search (for this application, the “ fmincon ” MatLab® function is used), among which the best point is finally selected.

The objective function is computed at the point corresponding to the acquisition function maximum.

figure 4

Bayesian sampling optimisation algorithm flowchart

By following this workflow, a newly fitted GP is used at each algorithm iteration. In fact, the objective function value computed at each \(i-{\text{th}}\) iteration is added to the set of observations at iteration \(i+1\) , which is then employed to train the GP used to model the objective function, by determining a new set of hyperparameters via log-likelihood maximisation.

Before starting the actual procedure (step (2)), the code is implemented to perform cross-validation tests to determine which configuration of the GP is most suitable for the specific updating problem, according to the following procedure. First, non-exhaustive cross-validation tests are performed to choose whether to log transform the input variables (i.e., updating parameters), as this is often found to improve the GP regression quality. To this extent, validation loss is computed for two GPs, one fitted using non-transformed variables, and the other fitted using log-transformed variables. Secondly, after choosing whether to transform the input variables or not, the GP is fitted four different times using the four kernel functions previously introduced. Once more, cross-validation loss is computed to establish which kernel is the fittest for modelling the objective function. Once the input-variables transformation is established and the GP kernel chosen, the algorithm is actually initialized.

3.2 Benchmark algorithms: GPS, SA, and GA

Generalized pattern search is a relatively simple traditional optimisation algorithm, while Simulated Annealing, the Genetic Algorithm and BO are generally considered to be “computational intelligence” optimisation techniques. All four algorithms have in common that no use of derivatives is made, hence the function is not required to be differentiable. Besides, despite the different approaches and backgrounds, Simulated annealing, GA and Bayesian sampling optimisation techniques share many elements: all algorithms are designed to carry out a global search of the minimum, avoiding local minima; they are appropriate for non-linear and non-smooth functions; finally, the algorithms are especially suitable for black-box models, where establishing in advance a functional form that effectively aligns with the data is often impossible. The key difference between the BO approach and the other techniques is that the former requires a much smaller sampling volume to achieve comparable optimization performance, greatly enhancing the computational efficiency when expensive cost functions are involved. Nonetheless, a greater sampling efficiency comes at the expense of a more sophisticated algorithm, that requires computationally intensive operations at each iteration. One shared drawback of GA, simulated annealing and Bayesian optimisation techniques is that these algorithms tend to give results close to the global minimum, although not very accurate. On the other hand, GPS can achieve high accuracy of solution.

The selection of parameters proves to be a critical aspect across all techniques. However, parameter selection for GPS, GA, and SA, as found by the authors, presents greater challenges compared to BO. The optimization performance exhibits heightened sensitivity to factors such as initial temperature, cooling schedule, and acceptance probability function for SA, crossover and mutation operators for GA and mesh shape and size in the case of GPS. On the contrary, while BO also demands user-defined parameters as described in Sect. 2.2—most notably, the kernel function—these do not pose significant difficulty and demonstrate robust generalization within the optimization framework of model updating cost functions. Furthermore, kernel hyperparameters are automatically optimized through maximization of the marginal log-likelihood, as already mentioned.

Reference can be made to [ 59 ] for a first implementation of a GPS algorithm, to [ 60 ] who initially proposed the GA algorithm, and to [ 61 ] for a first application of the SA concept to optimisation problems.

Given the diverse nature of the optimisation techniques employed, it is essential to exercise particular care in selecting each algorithm's implementation strategy to ensure a fair comparison, necessary to evaluate the performance of the Bayesian sampling optimisation approach in model updating applications. In particular, GPS, SA and GA are allowed to sample the objective function 1000 times, while BO is stopped at 500 function evaluations. This is necessary since the former techniques typically need a much greater sampling volume to achieve sufficient levels of accuracy.

The specific technical details of each alternative are briefly described as follows. The GPS algorithm used in the case studies adheres to the standard procedure firstly introduced by Hooke & Jeeves and is set according to the following details:

Input parameters are linearly scaled to the interval \([\mathrm{0,100}]\) , according to the optimisation bounds: this is needed since the employed mesh size is equal in all dimensions.

The mesh size is multiplied by a factor of 2 at every successful poll, and it is divided by the same factor after any unsuccessful poll.

The algorithm stops if the maximum allowed number of objective function evaluations is reached.

Simulated annealing is implemented in its most common formulation as proposed by Kirkpatrick et al. Specifically, it is set according to the following strategy:

Input parameters are linearly scaled to the interval \([\mathrm{0,1}]\) ,

The initial temperature \({T}_{0}\) is set at 50.

The temperature gradually decreases at each iteration according to the cooling schedule \(T={T}_{0}/k\) , where \(k\) is a parameter equal to the iteration number.

Each new sampled point, if its objective is higher than the current one, is accepted according to the acceptance function \(\frac{1}{1+{\text{exp}}\left(\frac{\Delta }{{\text{max}}(T)}\right)}\) , where \(\Delta\) is the difference between the objective values (at the incumbent point and the newly sampled one).

Finally, the Genetic Algorithm implementation used is based on the following principles:

Compliance with optimisation bounds is enforced by ensuring each individual is generated within the constraints at each generation though proper crossover and mutation operators.

The initial population, necessary to initialize the algorithm, consists of points randomly chosen within the space defined by the optimisation bounds of each parameter. As the population size should increase accordingly to the number of dimensions, the initial population size was set to 200.

The choice of parents is made according to their fitness value. In particular, the chances of breeding are higher for higher fitness values.

The crossover fraction is set to 0.8.

The elite size is set at 5% of the population size.

The mutation fraction varies dynamically, according to the genetic diversity of each generation.

3.3 The objective function

A well-known objective function is employed, based on the difference between the estimated and the actual values of natural frequencies and on the Modal Assurance Criterion—MAC [ 62 ] between target and computed modes. This can be defined as follows:

where \({\omega }_{i}^{targ/id}\) and \({\omega }_{i}^{calc}\) are respectively the \(i\) -th target (or identified) natural angular frequency and the \(i\) -th computed natural angular frequency out of the \(N\) modes used for updating, and \(MAC\left({\phi }_{i}^{\text{calc }},{\phi }_{i}^{targ/id}\right)\) is the MAC value relative to the \(i\) -th computed mode shape \({\phi }_{i}^{\text{calc}}\) and the \(i\) -th target (or identified) mode shape \({\phi }_{i}^{targ/id}\) . This objective function includes both natural frequencies and mode shapes, with equal weights.

3.4 Performance metrics

The root mean square relative error (RMSRE) is considered as a global metric for the accuracy of the optimisation procedure, both in the input domain (i.e., for each updating parameter) and in the output domain (i.e., for the natural frequencies and mode shapes). In this latter case, this is computed as:

where \({f}_{{\text{rel}},i}\) is the relative error of the i-th natural frequency, \({MAC}_{i}\) is the MAC value of the i -th mode shape and \(n\) is the number of modes considered for updating. In the former case, for the parameters estimated in the input space, the RMSRE is instead given by:

where in this case \({X}_{{\text{rel}},i}\) is the relative error between the i -th updating parameter and its target value, and \(n\) is the number of updating parameters considered. Obviously, this calculation was only possible for the numerical dataset, were the inputs of the numerically-generated results are user-defined and thus known and comparable.

Finally, the total computational time is reported as well, specifically as a way to compare BO to the other algorithms—GPS, SA, and GA. This allows to prove the time-saving advantage of using a very efficient optimisation technique such as Bayesian sampling optimisation.

4 The case study: the Mirandola Bell Tower

The proposed approach has been validated on a well-known CH case study, the Mirandola bell tower, resorting to data collected from on-site surveys [ 63 , 64 ]. Indeed, historical CH structures require very specific monitoring and assessment strategies; a complete overview can be found in [ 65 ].

Both experimental data and numerically-simulated data have been employed. This latter case is intended for assessing the algorithm capabilities in a more controlled fashion. The experimental validation, on the other hand, proves its feasibility for data collected from operational (output-only) Ambient Vibration (AV) tests.

The masonry-made bell tower of the Santa Maria Maggiore Cathedral in Mirandola (Emilia-Romagna, Italy) is pictured in Fig.  5 .a. For this application, model updating is used as a means of structural damage assessment, as mentioned in the Introduction. The results from this procedure, carried out through the use of a Bayesian sampling optimisation approach, will be compared to the damage analysis of the bell tower structure conducted in [ 63 ]. The data analysed here originate from the same dataset, which will be discussed later.

figure 5

Pictures of the Santa Maria Maggiore Cathedral ( a ), the bell tower located on its south-east corner ( b ), the discretisation in five substructures ( c ), and the sensor layout with the location and orientation of the recording channels ( d ). Adapted from [ 63 ]

The target structure represents an important piece of historical and cultural heritage. Built during the late fourteenth century, it underwent several structural modifications throughout the centuries, especially in the seventeenth century when the height of the original tower was tripled [ 66 ]; the existing portions were reinforced as well to withstand the additional weights. The octagonal stone roofing was finally added in the eighteenth century. From a geometric perspective, the tower has a square plan (5.90 \(\times\) 5.90 m) for a total height of 48 m with four levels of openings (Fig.  5 b).

Historically, the area of interest was only considered to be at modest seismic risk. However, two events linked to the 2012 Emilia Earthquake (specifically on the 20th and 29th of May) caused significant damage to the tower, as detailed in [ 64 ]. The damage pattern influenced the FE modelling as well, as will be discussed in the next subsections. More details from the visual inspection and in-situ surveys can be found in [ 64 ] and [ 63 ].

4.1 Acquisition setup

The dynamic testing discussed here was performed by the laboratory of Strength of Materials (LabSCo) of the IUAV (Istituto Universitario di Venezia) in August 2012 [ 64 ], considering the post-earthquake situation before the installation of the provisional safety interventions. An Operational Modal Analysis (OMA) was conducted, relying on the structural response from ambient vibrations to identify the modal properties of the target structure [ 67 ].

Eight uniaxial piezoelectric accelerometers (PCB Piezotronics type 393C) were deployed as portrayed in Fig.  5 .d, using metal bases to attach them to the masonry walls [ 1 ]. All recorded signals had a length of \(\sim\) 300 s and a sampling frequency \({f}_{s}=192\) Hz: this was then reduced to 48 Hz in post-processing, focusing on the first eight modes, to expedite the output-only identification procedure. This was done by employing the Stochastic Subspace Identification (SSI) algorithm [ 68 ] on the detrended and filtered data. However, only the first six modes (reported in Table  1 ) were employed as done in [ 63 ] since the 7th and the 8th ones were deemed as less reliable.

4.2 Crack pattern and finite element model

The FE model (Fig.  6 ) was organised in substructures (i.e. macro-areas), following the structural partition of the tower structure and reflecting the strong localisation of damage. This macro-zoning procedure reflects what is generally done for similar structures after seismic damage—see the similar case of the Fossano bell tower in [ 69 ]. The rationale is to encompass areas that are expected to show similar and homogeneous mechanical properties.

figure 6

Finite Element Model of the Mirandola bell tower. a The whole structure b detail of the linear springs elements (c) the idealized connections with the nearby buildings, shown in plan-view. Adapted from [ 63 ]

In this case, the tower base (subsection 1, highlighted in red in Fig.  5 .c), i.e., the portion from ground level (0.00 m) to + 9.50 m, suffered minimal to no damage, as well as the tower top (subsection 4) and the belfry (subsection 5). These are highlighted respectively in yellow, between + 30.5 m and + 37.5 m, and green, from + 37.5 m to + 48.0 m. The most severe and large damage portions were found in subsection 2 (dark orange, + 9.50 m to + 21.0 m) and subsection 3 (light orange, + 21.0 m to + 30.5 m). This latter floor represents a structural peculiarity of this case study, as it includes very large window openings on all façades. The reason is that the portion corresponds to the first belfry, built before the addition of the second one on top of it. In any case, this layout is quite different from the most common cases where the opening size decreases moving from the tower bottom. This affects the local and global dynamic response of the whole structure [ 70 ]. In fact, during the seismic events of interest, this locally more flexible portion underwent a twisting rotation, which caused deep diagonal cracks to arise right below this order of openings, extending down to the (less wide) openings of the underlying level on all four sides. Importantly, the second and third portions were also the ones covered by the eight metal tie rods (four per portion) installed as provisional safety interventions.

As for the previous case study, the model was realized in ANSYS Mechanical APDL. SHELL181 elements were applied for all façades in all macro-areas, as well as for the stone roof and the masonry vaults at the basement level, for a total of 1897 elements. The interactions with the Cathedral of Santa Maria Maggiore and the rectory were modelled by 104 COMBIN14 linear springs, distributed along the whole contact surface in correspondence to the apse arches, the nave walls, and the rectory wall on the East, West, and North side (Fig.  6 b). That is intended to simulate the in-plane stiffness of the attached masonry walls, thus removing these external elements and replacing their reaction forces with springs’ elastic forces, acting as boundary conditions as also suggested in [ 63 ].

The complete FE model consisted of 2052 nodes.

4.3 Model updating setup

The following eleven parameters were considered for updating (see Table  2 ):

\({E}_{1}, {E}_{2}, {E}_{3}, {E}_{4}\) : the Young’s modulus of the damaged masonry in the four sub-structures (according to Fig.  6 ).

\({\nu }_{mas}\) : the Poisson’s ratio of the masonry, assumed as constant everywhere.

\({k}_{1},{k}_{2},{k}_{3},{k}_{4},{k}_{5},{k}_{6}\) : the linear stiffness of the six distributed springs (used to model the connections with the nearby structures, see Fig.  6 ).

These parameters are exactly the same ones considered in [ 63 ], thus enabling a direct comparability of the results. No sensors were available on the belfry roof, therefore, for lack of reliable data, the 5th macro-area was not considered in the FE updating.

Table 2 shows these input-parameters and the assumed optimisation bounds. Notice that the optimisation range of the link-element parameters spans through several orders of magnitude, thus generating an extremely wide optimisation space. This reflects the high uncertainty about the boundary conditions, which might significantly affect the dynamic response of the structure. The optimisation bounds of the elasticity moduli consider the values suggested by literature and Italian regulations for brick masonry, while allowing to capture the level of damage suffered by the structure.

The (arbitrarily chosen) target input parameters and the related system-output parameters (in terms of frequency only) used for the numerically simulated data setup are reported in Table  3 . System-output parameters used for updating the experimental data setup are the identified natural frequencies shown in Table  1 (and the related mode shapes).

The results for BO and each benchmark optimisation algorithm are firstly presented for the numerical dataset, drawing considerations about accuracy and computational time. These are followed by the results of the experimental datasets, which allow to assess BO performance in a real application. Furthermore, concerning BO, the use of several kernels is investigated, and the benefits of accessing parameters length scales when using ARD kernel functions in the framework of model updating are highlighted.

5.1 Numerically-simulated data results

For the numerical case, the initial seed size is set to 220 points (20 times the number of parameters being updated). A logarithmic transformation is employed on the input parameters, as it was observed to augment the quality of the GP regression. The choice of the kernel function is driven by a cross-validation test, since some kernels may happen to be more suitable at modelling the underlying objective function specific to this updating problem, resulting in surrogate models with enhanced validity. The outcome of a tenfold cross-validation test of four different Gaussian Processes, fitted using the ARD exponential, the ARD Matérn 3/2, the ARD Matérn 5/2 and the ARD squared exponential kernels are shown in Table  4 . Whilst all kernel functions are seen to return reliable regression models, the ARD Matérn 5/2 kernel is found to be the most suitable, returning excellent validation results. Fitting the GP to the initial seed also allows retrieving information on system sensitivity through the optimized hyperparameters of the selected ARD kernel (Table  5 ).

As expected, the eleven length scales differ by several orders of magnitude due to the anisotropy of the problem. The parameters which mostly affect the system response are the elasticity moduli: such behaviour is foreseeable, as material elasticity is usually blamed for significantly impact modal properties. However, kernel hyperparameters suggest that the elasticity modulus of the fourth sub-structure has instead low sensitivity: this is likely due to the scarcity of sensors at that level of the building, which actually disqualifies from capturing the necessary vibration information. Therefore, reliable estimations of \({E}_{4}\) should not be expected, neither in the simulated-data setup, nor in the experimental one. The problem appears to be scarcely sensitive to changes in the Poisson’s ratio as well: this makes perfect sense as damping has only marginal effect on modal properties. For what concerns the springs, these are generally found to have lesser effects on the modal response compared to the elasticity moduli of the first three sub-structures. In particular, as hyperparameters suggest, \({k}_{5}\) and \({k}_{6}\) are found to have lower impact on the system modal response, while \({k}_{1}\) , \({k}_{3}\) and \({k}_{4}\) feature higher sensitivity.

The Bayesian sampling optimisation is carried out using the upper confidence bound (UCB) acquisition function, which was generally found to be the most effective, as it provided a good balance between exploitation and exploration. In Fig.  7 (at the top) it is clear how the first selected sampling point already represents a massive improvement over the best cost function value, suggesting that the Gaussian Process is able to model the objective function impressively well. As the GP is updated with newly sampled points, UCB steadily converges towards a minimum, gradually improving the accuracy of the optimisation solution. For additional clarity, the best-computed objective against the iteration number is shown at the bottom. The best-obtained objective is 0.0742, which is fairly close to zero, the (known) global optimum in the output space. Also, it is noticeable how BO struggles at further improving the result after 350 iterations in this case.

figure 7

Top: Bayesian optimisation progress over iterations for the numerical dataset. The objective value at the randomly sampled seed points is displayed in red, while the objective at points selected by UCB is displayed in blue. Bottom: best objective function value over iterations

The results relative to the output, that are the raw (best) cost function value and the estimated modal data, are shown in Table  6 for each optimisation technique. For all modes, the updated value and the target value are reported, as well as the relative error and the MAC value. Also, the RMSRE and the best achieved objective function value are reported for each algorithm. These results highlight how GPS, SA and GA fail at minimising the cost function under the proposed conditions: all algorithms, and especially SA, return objective values that are far from zero. Bayesian optimisation, on the contrary, at only 500 evaluations, achieves quite impressive results. The error about the frequencies is kept to a minimum (here, only the second mode is showing a higher divergence), and so is the error about the mode shapes. The relative errors are computed, for each \(n\) -th mode, as \(({f}_{n}^{{\text{UPD}}}-{f}_{n}^{{\text{TAR}}})/{f}_{n}^{{\text{TAR}}}\) , where \({f}_{n}^{{\text{UPD}}}\) is the updated value and \({f}_{n}^{{\text{TAR}}}\) is the corresponding target.

The resulting updated parameters (that generate the updated modal features just discussed), are displayed in Table  7 . The updated value, the target value and the RMSRE value (about the four elasticity moduli) are reported. Up to 1000 observations are used for the three former algorithms, while BO employs 500 observations only, leading to comparable total optimization times. Among all parameters, the most interesting are the elasticity moduli of the four sub-structures as these parameters respond to the main goal of the updating problem, that is assessing the level of damage to the structure. The stiffnesses of the six springs are of secondary interest since these are introduced in the updating procedure only due to unawareness of the degree of support provided by the adjacent architectonical elements as well as of their impact on the dynamic response of the building. Moreover, given their extremely wide optimisation range, identifying the right order of magnitude can already be considered a satisfactory result. In light of the above, GPS, SA and GA show quite poor results, failing to attain a good estimation of the first four parameters (and providing even worse estimations for the rest). On the contrary, Bayesian optimisation returns acceptable errors over the estimated values, showing good agreement with the first four in particular. Although a ~ 25% error in the elasticity moduli may appear significant, these results are in fact remarkable given the intricate nature of the optimization problem under consideration. Indeed, this level of precision allows for informed assessments regarding which portions of the structure likely experienced the most damage and which sections remain structurally sound. The scale of the springs is in some cases recognized as well, except for \({k}_{2}\) and \({k}_{6}\) (and to a lower extent \({k}_{1}\) and \({k}_{4}\) ), suggesting once more that the algorithm could have run into a local minimum. Potentially, had Bayesian optimisation been able to properly sort out the right scale of \({k}_{2}\) and \({k}_{6}\) , it would have then returned even better accuracy about the parameters of interest (i.e., \({E}_{1}\) , \({E}_{2}\) , \({E}_{3}\) and \({E}_{4}\) ).

In the damage assessment study of 2017, the updating procedure was carried out in batches: at first, the springs were calibrated while holding the elasticity moduli constant, afterwards, \({E}_{1}\) , \({E}_{2}\) , \({E}_{3}\) and \({E}_{4}\) were optimised using the link-element stiffness values previously estimated. As traditional optimisation techniques were employed, this approach aimed at facilitating the optimisation procedure by lowering the dimensionality of the problem. Given the outcome of this numerical test, such an approach can be avoided when using BO, as this technique is powerful enough to allow considering all parameters at once, cutting off computational time and enhancing chances of ending up close enough to the global minimum.

The optimisation time employed by each algorithm is reported in Table  8 . As the number of observations is relatively high, the secondary optimisation problems (i.e., the maximisation of the marginal log-likelihood and the maximisation of the acquisition function) are relatively burdensome tasks, significantly extending the total optimisation time (Fig.  8 ). As computing a prediction has a computational complexity of \(\mathcal{O}\left({{\text{N}}}^{3}\right)\) , the modelling and point selection time gradually increases as the number of observations grows. These two tasks are crucial to obtain high sampling efficiency, which in turn allows for keeping the total number of observations to a minimum. In fact, with 500 iterations only (actually, in this case, about 350 would have been sufficient to obtain the same final results), BO still enables saving some computational time when compared to other techniques, even if negatively affected by increasingly burdensome secondary optimization problems, while retaining far superior accuracy.

figure 8

Total optimisation time of BO, as the sum of the objective evaluation time (the cumulative time employed for evaluating the objective function) and the modelling and point selection time (the cumulative time employed for maximising the marginal log-likelihood and the acquisition function)

5.2 Experimental data results

Differently from numerically simulated data, the experimental case study is affected by both the implicit limitations of the FE model and the measurement noise of the acquisitions. Thus, the minimum of the cost function is never found at zero when using real data, since significant misfit between experimental and computed modal properties will endure.

The same kernel and acquisition function of the numerical case are used, while the seed size is set to 250 points. For a given number of total observations (in this case 500), using larger seeds leads to shorter computational time (as the actual number of algorithm iterations is reduced, meaning less time is spent on secondary optimisation problems), potentially improved surrogate quality, at the expense of reduced exploitation of the algorithm’s “intelligent” sampling capabilities. In a way, we rely more on the surrogate represented by the GP and less on the point-selection process operated by the acquisition function.

Table 9 provides the length-scales obtained when fitting the GP to the new 250-point seed. As different observations are here employed, hyperparameters (only) marginally differ from the ones obtained for the numerically simulated dataset. It is noticeable a substantial correlation between the two. As such, previous considerations about length scales apply in this case as well.

The results obtained through the optimisation process (shown in Fig.  9 ) reveal how in this case the optimiser cannot be expected to converge at zero, since FE modelling deficiencies coupled with identified modal data inaccuracies lead to the ineluctable misfit between computed and measured modal response. For additional clarity, the best-computed objective against the iteration number is shown as well. The value associated with the function minimum is 0.8343.

figure 9

Top: Bayesian optimisation progress over iterations for the experimental dataset. Once again, the objective value at the randomly sampled seed points is displayed in red, while the objective value at points selected by UCB is displayed in blue. Bottom: best objective function value over iterations

The results relative to the output, that are the raw (best) cost function value and the estimated modal data, are shown in Table  10 , along with the modal parameters attained in the paper of reference. Once again, for each considered mode, the updated and the identified values are reported, as well as the relative error and the MAC value. Looking at the updated modal parameters, Bayesian optimisation gives results that are consistent with what was obtained in the 2017 study. Generally, natural frequencies obtained through the Bayesian sampling optimisation show a good correlation with the identified ones. The modes that exhibit the highest amount of error are the third and the fifth: this was already the case in the former study. The MAC values of the first three modes suggest a good correlation with the identified mode shapes of the tower, while the last three denote some degree of incoherence (especially for the fifth and the sixth). This issue is in common with the study made in 2017, indicating some problems probably due to the quality of the measurements at the highest modes or the inadequacy of the FE model to capture the dynamic behaviour of the bell tower at higher frequencies. Indeed, due to these reasons, fitting the system response with modal features beyond the third mode is often unpractical in many FEMU applications [ 71 ].

The results relative to the input space (i.e., the estimated parameters) are reported in Table  11 , as the estimated value obtained through BO and the estimations of the former damage assessment study. Focusing on the elasticity moduli of the four sub-structures, results stemming from the Bayesian sampling optimisation approach are mostly compliant with the former study, except for the elasticity modulus of the third sub-structure.

These findings can be used to assess the damage condition of the bell tower. The low values obtained for the two lowest levels suggest that the bell tower has probably endured high levels of damage in these areas, which mostly affect the lower modes of the building. Concerning these two subparts, a slightly higher estimate of the second elasticity modulus is the only marginal difference between the two results. On the contrary, a substantial difference can be seen for the elasticity of the walls at the third level of the structure: the former study highlighted a much higher level of damage. Judging by the value estimated through BO, this specific part of the tower could have been either less damaged by the seismic event, or originally characterized by walls built with stiffer and more qualitative material. This hypothesis is made more plausible by considering that the building, which construction started in the late fourteenth century, was severally altered in the seventeenth century, when the height of the bell tower was tripled and the original structure reinforced [ 63 ]. Finally, the walls of the fourth level were found to be significantly stiffer than the rest of the structure. However, estimates of parameters concerning this level of the structure should not be considered as reliable for the reasons stated before (scarcity of sensors).

Regarding the springs elements which model the degree of constrain enforced by the adjacent structures, the results of the BO approach agree with the estimations of the former analysis, particularly for what concerns the value of \({k}_{5}\) , which stands out in both cases with respect to the other link-element parameters by a factor of \({10}^{3}\) . The greater stiffness of the fifth spring element suggests that the architectonical element having the greatest impact on the dynamic response of the bell tower is the easternmost apse arch. All the other elements (particularly the ones modelled by \({k}_{2}\) , \({k}_{4}\) and \({k}_{6}\) , hence the nave walls and the rectory wall) seem to provide little contrast to the motion of the bell tower when considering small-amplitude environmental vibrations.

6 Conclusions

When expensive FE models are involved, the optimization algorithm represents one of the most critical aspects of FEMU applications that make use of iterative methods. The optimisation technique is a key element of the updating procedure, as it should feature good sampling efficiency, global search attitudes and adequate accuracy to cope with non-convex and complex cost functions.

This research presented and validated a Bayesian sampling optimisation (BO) approach for such a task, with an application to a real case study—the Mirandola bell tower—that represents an interesting example of post-seismic assessment of a historical building of cultural and architectural relevance. Overall, the proposed procedure proved itself to be well-suited for this challenging task. Especially, BO outperformed the other well-established global optimisation techniques selected for the benchmark (namely Generalized Pattern Search, Simulated annealing and Genetic Algorithm), featuring far superior sampling efficiency, greater accuracy, and better capabilities of finding the global function minimum, particularly as the dimensionality of the problem increases. Results were achieved with half of the objective function evaluations allowed for GPS, SA, and GA. This, practically, translates into shorter computational times and costs.

One major drawback of SA and GA is that both techniques rely on large sampling volumes. Furthermore, these algorithms tend to provide results denoted by poor accuracy. The GPS algorithm, on the contrary, may exhibit scarce global search aptitudes, as it was either found converging too quickly or failing at finding the global function optimum. The proposed BO algorithm is not affected by any of these drawbacks.

Regarding the implementation of the described technique, a logarithmic transformation of the input variables was found to improve the quality of the Gaussian Process regression, albeit marginally. Furthermore, this research revealed that all four investigated kernels can be successfully used, although the ARD Matérn 5/2 kernel provided the best results in terms of validation of the surrogate model for this specific case-study. When using automatic relevance determination (ARD) kernels, it is possible to retrieve hyperparameters (i.e., length scales) of the GP, gathering useful information about the problem sensitivity to each parameter. Additionally, ARD kernels automatically discard irrelevant dimensions from the optimization procedure, making this approach well-suited for highly anisotropic problems. This sort of “built-in” sensitivity analysis is particularly advantageous in structural model updating applications, where information about the relevance of parameters (often scarcely known in advance) can be considerably useful for a better understanding of the structural behaviour.

Availability of data and materials

Data available upon reasonable request from the authors.

Boscato G, Dal Cin A, Ientile S, Russo S (2016) Optimized procedures and strategies for the dynamic monitoring of historical structures. J Civ Struct Health Monit 6(2):265–289. https://doi.org/10.1007/S13349-016-0164-9

Article   Google Scholar  

Friswell MI, Mottershead JE (1995) Finite element model updating in structural dynamics, vol 38. Springer, Netherlands, Dordrecht. https://doi.org/10.1007/978-94-015-8508-8

Book   Google Scholar  

Friswell MI (2007) Damage identification using inverse methods. Philos Trans R Soc A Math Physi Eng Sci 365(1851):393–410. https://doi.org/10.1098/rsta.2006.1930

Article   ADS   Google Scholar  

Durmazgezer E, Yucel U, Ozcelik O (2019) Damage identification of a reinforced concrete frame at increasing damage levels by sensitivity-based finite element model updating. Bull Earthq Eng 17(11):6041–6060. https://doi.org/10.1007/s10518-019-00690-5

Mohagheghian K, Karami Mohammadi R (2017) Comparison of online model updating methods in pseudo-dynamic hybrid simulations of TADAS frames. Bull Earthq Eng 15(10):4453–4474. https://doi.org/10.1007/s10518-017-0147-1

Miraglia G, Petrovic M, Abbiati G, Mojsilovic N, Stojadinovic B (2020) A model-order reduction framework for hybrid simulation based on component-mode synthesis. Earthq Eng Struct Dyn 49(8):737–753. https://doi.org/10.1002/EQE.3262

Mottershead JE, Friswell MI (1993) Model updating in structural dynamics: a survey. J Sound Vib 167(2):347–375. https://doi.org/10.1006/JSVI.1993.1340

Mottershead JE, Link M, Friswell MI (2011) The sensitivity method in finite element model updating: a tutorial. Mech Syst Signal Process 25(7):2275–2296. https://doi.org/10.1016/J.YMSSP.2010.10.012

Brochu E, Cora VM, de Freitas N (2010) A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. https://doi.org/10.48550/arXiv.1012.2599

Ceravolo R, Pistone G, Fragonara LZ, Massetto S, Abbiati G (2016) Vibration-based monitoring and diagnosis of cultural heritage: a methodological discussion in three examples. Int J Arch Heritage 10(4):375–395. https://doi.org/10.1080/15583058.2013.850554

Boscato G, Russo S, Ceravolo R, Fragonara LZ (2015) Global sensitivity-based model updating for heritage structures. Comput-Aided Civ Infrastruct Eng 30(8):620–635. https://doi.org/10.1111/mice.12138

Bracchi S, Rota M, Magenes G, Penna A (2016) Seismic assessment of masonry buildings accounting for limited knowledge on materials by Bayesian updating. Bull Earthq Eng 14(8):2273–2297. https://doi.org/10.1007/s10518-016-9905-8

Secanell R, Martin C, Viallet E, Senfaute G (2018) A Bayesian methodology to update the probabilistic seismic hazard assessment. Bull Earthq Eng 16(6):2513–2527. https://doi.org/10.1007/s10518-017-0137-3

Monchetti S, Viscardi C, Betti M, Bartoli G (2022) Bayesian-based model updating using natural frequency data for historic masonry towers. Probab Eng Mech 70:103337. https://doi.org/10.1016/J.PROBENGMECH.2022.103337

Ponsi F, Bassoli E, Vincenzi L (2022) Bayesian and deterministic surrogate-assisted approaches for model updating of historical masonry towers. J Civ Struct Health Monit 12(6):1469–1492. https://doi.org/10.1007/s13349-022-00594-0

Monchetti S, Viscardi C, Betti M, Clementi F (2023) Comparison between Bayesian updating and approximate Bayesian computation for model identification of masonry towers through dynamic data. Bull Earthq Eng. https://doi.org/10.1007/S10518-023-01670-6

Ferraioli M, Miccoli L, Abruzzese D (2018) Dynamic characterisation of a historic bell-tower using a sensitivity-based technique for model tuning. J Civ Struct Health Monit 8(2):253–269. https://doi.org/10.1007/S13349-018-0272-9

Civera M, Ferraris M, Ceravolo R, Surace C, Betti R (2019) The Teager-Kaiser energy cepstral coefficients as an effective structural health monitoring tool. Appl Sci 9(23):5064. https://doi.org/10.3390/app9235064

Ceravolo R, de Lucia G, Lenticchia E, Miraglia G (2019) Seismic structural health monitoring of cultural heritage structures. Springer Tracts Civ Eng 11:51–85. https://doi.org/10.1007/978-3-030-13976-6_3

Gentile C, Saisi A (2007) Ambient vibration testing of historic masonry towers for structural identification and damage assessment. Constr Build Mater 21(6):1311–1321. https://doi.org/10.1016/J.CONBUILDMAT.2006.01.007

De Stefano A, Ceravolo R (2007) Assessing the health state of ancient structures: the role of vibrational tests. J Intell Mater Syst Struct 18(8):793–807. https://doi.org/10.1177/1045389X06074610

Standoli G, Giordano E, Milani G, Clementi F (2021) Model updating of historical belfries based on oma identification techniques. Int J Arch Heritage 15(1):132–156. https://doi.org/10.1080/15583058.2020.1723735

Ceravolo R, de Lucia G, Miraglia G, Pecorelli ML (2020) Thermoelastic finite element model updating with application to monumental buildings. Comput-Aided Civ Infrastruct Eng 35(6):628–642. https://doi.org/10.1111/mice.12516

Ozcelik O, Misir IS, Yucel U, Durmazgezer E, Yucel G, Amaddeo C (2022) Model updating of Masonry courtyard walls of the historical Isabey mosque using ambient vibration measurements. J Civ Struct Health Monit 12(5):1157–1172. https://doi.org/10.1007/s13349-022-00610-3

Rossi M, Bournas D (2023) Structural health monitoring and management of cultural heritage structures: a state-of-the-art review. Appl Sci 13(11):6450. https://doi.org/10.3390/APP13116450

Article   CAS   Google Scholar  

Cartiaux F, Olivetti G, Fort V, Pelletier P (2022, Preserving the heritage of world’s monuments through Structural Health Monitoring. A case study: the Garisenda Tower. In: 5th Joint International Symposium on Deformation Monitoring (JISDM 2022)

Standoli G, Salachoris GP, Masciotta MG, Clementi F (2021) Modal-based FE model updating via genetic algorithms: Exploiting artificial intelligence to build realistic numerical models of historical structures. Constr Build Mater 303:124393. https://doi.org/10.1016/J.CONBUILDMAT.2021.124393

Salachoris GP, Standoli G, Betti M, Milani G, Clementi F (2023) Evolutionary numerical model for cultural heritage structures via genetic algorithms: a case study in central Italy. Bull Earthq Engg. https://doi.org/10.1007/S10518-023-01615-Z

Fox RL, Kapoor MP (1968) Rates of change of eigenvalues and eigenvectors. AIAA J 6(12):2426–2429. https://doi.org/10.2514/3.5008

Chen JC, Garba JA (1980) Analytical model improvement using modal test results. AIAA J 18(6):684–690. https://doi.org/10.2514/3.50805

Alvin KF (1997) Finite element model update via Bayesian estimation and minimization of dynamic residuals. AIAA J 35(5):879–886. https://doi.org/10.2514/2.7462

Zimmerman DC, Kaouk M (1992) Eigenstructure assignment approach for structural damage detection. AIAA J 30(7):1848–1855. https://doi.org/10.2514/3.11146

Datta BN (2002) Finite-element model updating, eigenstructure assignment and eigenvalue embedding techniques for vibrating systems. Mech Syst Signal Process 16(1):83–96. https://doi.org/10.1006/mssp.2001.1443

Article   ADS   MathSciNet   Google Scholar  

Simoen E, de Roeck G, Lombaert G (2015) Dealing with uncertainty in model updating for damage assessment: a review. Mech Syst Signal Process 56–57:123–149. https://doi.org/10.1016/j.ymssp.2014.11.001

Levin RI, Lieven NAJ (1998) Dynamic finite element model updating using simulated annealing and genetic algorithms. Mech Syst Signal Process 12(1):91–120. https://doi.org/10.1006/mssp.1996.0136

Wang DJ, Tan ZC, Li Y, Liu Y (2013) Review of the application of finite element model updating to civil structures. Key Eng Mater 574:107–115. https://doi.org/10.4028/www.scientific.net/KEM.574.107

Mockus J (1994) Application of Bayesian approach to numerical methods of global and stochastic optimization. J Global Optim 4(4):347–365. https://doi.org/10.1007/BF01099263

Article   MathSciNet   Google Scholar  

Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Global Optim 13(4):455–492. https://doi.org/10.1023/A:1008306431147

Streltsov S, Vakili P (1999) A non-myopic utility function for statistical global optimization algorithms. J Global Optim 14(3):283–298. https://doi.org/10.1023/A:1008284229931

Jones DR (2001) A taxonomy of global optimization methods based on response surfaces. J Global Optim 21(4):345–383. https://doi.org/10.1023/A:1012771025575

Sasena MJ (2002) Flexibility and efficiency enhancements for constrained global design optimization with kriging approximations [Doctoral dissertation, University of Michigan]. https://hdl.handle.net/2027.42/132844

Friswell MI, Penny JE (1992) Assessing model quality in parameter updating procedures. In: 10th International Modal Analysis Conference, San Diego, California, USA, 1992, pp 188–194

Ahmadian H, Gladwell GM, Ismail F (1997) Parameter selection strategies in finite element model updating. J Vib Acoust 119(1):37–45

Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Glob Optim 13(4):455–492. https://doi.org/10.1023/A:1008306431147

Kushner HJ (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J Basic Eng 86(1):97–106. https://doi.org/10.1115/1.3653121

Mockus J, Tiesis V, Zilinskas A (1978) The application of Bayesian methods for seeking the extremum. Towards Glob Optim 2:117–129

Google Scholar  

Sacks J, Welch WJ, Mitchell TJ, Wynn HP (1989) Design and analysis of computer experiments. Stat Sci. https://doi.org/10.1214/ss/1177012413

Shahriari B, Swersky K, Wang Z, Adams RP, de Freitas N (2016) Taking the human out of the loop: a review of Bayesian optimization. Proc IEEE 104(1):148–175. https://doi.org/10.1109/JPROC.2015.2494218

Mockus J (1982) The Bayesian approach to global optimization. In: Drenick RF, Kozin F (eds) System modeling and optimization, vol 38. Springer Berlin Heidelberg, Berlin, pp 473–481. https://doi.org/10.1007/BFb0006119

Chapter   Google Scholar  

Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential model-based optimization for general algorithm configuration, pp 507–523. https://doi.org/10.1007/978-3-642-25566-3_40 .

Snoek J et al (2015) Scalable Bayesian optimization using deep neural networks. arXiv. https://doi.org/10.48550/ARXIV.1502.05700 .

Springenberg JT, Klein A, S, Hutter F (2016), Bayesian optimization with robust Bayesian neural networks. In: Lee D, Sugiyama M, Luxburg U, Guyon I and Garnett R (eds) Advances in neural information processing systems, pp 4134–4142

Wang Z, Gehring C, Kohli P, Jegelka S (2018) Batched large-scale Bayesian optimization in high-dimensional spaces. In: International Conference on artificial intelligence and statistics (AISTATS), pp 745–754

Rasmussen C (2004) Gaussian processes in machine learning. In: Bousquet O, Luxburg U, Rätsch G (eds) Advanced lectures on machine learning. Springer Berlin, Heidelberg, pp 63–71. https://doi.org/10.1007/b100712

Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning. The MIT Press. https://doi.org/10.7551/mitpress/3206.001.0001

Matérn B (1986) Spatial variation, vol 36. Springer, New York, New York. https://doi.org/10.1007/978-1-4615-7892-5

Bull AD (2011) Convergence rates of efficient global optimization algorithms. J Mach Learn Res 12:2879–2904

MathSciNet   Google Scholar  

Cox DD, John S A statistical method for global optimization. In: Proc. IEEE International Conference on systems, man, and cybernetics, IEEE, pp 1241–1246. https://doi.org/10.1109/ICSMC.1992.271617 .

Hooke R, Jeeves TA (1961) ‘Direct Search’ solution of numerical and statistical problems. J ACM 8(2):212–219

Holland J (1992) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor

Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science (1979) 220(4598):671–680. https://doi.org/10.1126/science.220.4598.671

Article   MathSciNet   CAS   Google Scholar  

Allemang RJ, Brown DL (1982) A correlation coefficient for modal vector analysis. In: Proceedings of the 1st International Modal Analysis Conference, Orlando, 8-10 November 1982, pp 110–116

Zanotti Fragonara L et al (2017) Dynamic investigation on the Mirandola bell tower in post-earthquake scenarios. Bull Earthq Eng 15(1):313–337. https://doi.org/10.1007/s10518-016-9970-z

Boscato G, Dal Cin A, Russo S (2015) Collapse mechanisms due to earthquake in the structural typologies of historic constructions: the case of Mirandola. Key Eng Mater 624:59–65. https://doi.org/10.4028/WWW.SCIENTIFIC.NET/KEM.624.59

De Stefano A, Matta E, Clemente P (2016) Structural health monitoring of historical heritage in Italy: some relevant experiences. J Civ Struct Health Monit 6(1):83–106. https://doi.org/10.1007/S13349-016-0154-Y

Ceretti F (1889) Delle chiese dei conventi e delle confraternite della Mirandola. In: Commissione Municipale di Storia Patria e di Arti Belle della Mirandola (Eds.) Memorie Storiche della Città e dell'antico Ducato della Mirandola

Mugnaini V, Zanotti ragonara L, Civera M (2022) A machine learning approach for automatic operational modal analysis. Mech Syst Signal Process. https://doi.org/10.1016/j.ymssp.2022.108813

Van Overschee P, De Moor B (1996) Subspace identification for linear systems: theory and implementation—applications. Kluwer Academic Press, Dordrecht

Civera M, Pecorelli ML, Ceravolo R, Surace C, Zanotti Fragonara L (2021) A multi-objective genetic algorithm strategy for robust optimal sensor placement. Comput-Aided Civ Infrastruct Eng 36(9):1185–1202. https://doi.org/10.1111/mice.12646

Bartoli G, Betti M, Marra AM, Monchetti S (2020) On the role played by the openings on the first frequency of historic masonry towers. Bull Earthq Eng 18(2):427–451. https://doi.org/10.1007/s10518-019-00662-9

Friswell MI, Mottershead JE (1995), Finite element model updating in structural dynamics, vol. 38. In: Solid mechanics and its applications, vol. 38. Springer Netherlands, Dordrecht. https://doi.org/10.1007/978-94-015-8508-8 .

Download references

Acknowledgements

The authors would like to thank the Laboratory of Strength of Materials (LabSCo) of IUAV University of Venice for the acquisition, processing, and dynamic identification of the case study.

Open access funding provided by Politecnico di Torino within the CRUI-CARE Agreement. This research did not receive external fundings.

Author information

Authors and affiliations.

Department of Structural Engineering, Norwegian University of Science and Technology (NTNU), 7491, Trondheim, Norway

Davide Raviolo

Department of Structural, Geotechnical and Building Engineering, Politecnico di Torino, 10129, Turin, Italy

Marco Civera

School of Aerospace, Transport and Manufacturing, Cranfield University, Cranfield, Bedfordshire, MK43, UK

Luca Zanotti Fragonara

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: M.C. and L.Z.F.; Methodology: D.R., M.C., and L.Z.F.; Software: D.R., M.C., and L.Z.F.; Validation: D.R.; Formal Analysis: D.R. and M.C.; Investigation: D.R., M.C., and L.Z.F.; Resources: L.Z.F.; Data Curation: D.R., M.C., and L.Z.F.; Writing—Original Draft: D.R.; Writing—Review & Editing: M.C. and L.Z.F.; Visualization: D.R. and M.C.; Supervision: M.C. and L.Z.F.; Project administration: L.Z.F.

Corresponding author

Correspondence to Marco Civera .

Ethics declarations

Conflict of interest.

There are no conflicts of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Raviolo, D., Civera, M. & Zanotti Fragonara, L. A Bayesian sampling optimisation strategy for finite element model updating. J Civil Struct Health Monit (2024). https://doi.org/10.1007/s13349-023-00759-5

Download citation

Received : 28 March 2023

Accepted : 27 December 2023

Published : 20 February 2024

DOI : https://doi.org/10.1007/s13349-023-00759-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Bayesian optimisation
  • Bayesian expected improvement
  • Finite Element Model Updating
  • Masonry structures
  • Find a journal
  • Publish with us
  • Track your research

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Online First
  • Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0001-9531-4990 Bashar Hasan 1 , 2 ,
  • http://orcid.org/0000-0001-9225-1197 Samer Saadi 1 , 2 ,
  • Noora S Rajjoub 1 ,
  • Moustafa Hegazi 1 , 2 ,
  • Mohammad Al-Kordi 1 , 2 ,
  • Farah Fleti 1 , 2 ,
  • Magdoleen Farah 1 , 2 ,
  • Irbaz B Riaz 3 ,
  • Imon Banerjee 4 , 5 ,
  • http://orcid.org/0000-0002-9368-6149 Zhen Wang 1 , 6 ,
  • http://orcid.org/0000-0001-5502-5975 Mohammad Hassan Murad 1 , 2
  • 1 Kern Center for the Science of Healthcare Delivery , Mayo Clinic , Rochester , Minnesota , USA
  • 2 Public Health, Infectious Diseases and Occupational Medicine , Mayo Clinic , Rochester , Minnesota , USA
  • 3 Division of Hematology-Oncology Department of Medicine , Mayo Clinic , Rochester , Minnesota , USA
  • 4 Department of Radiology , Mayo Clinic Arizona , Scottsdale , Arizona , USA
  • 5 School of Computing and Augmented Intelligence , Arizona State University , Tempe , Arizona , USA
  • 6 Health Care Policy and Research , Mayo Clinic Minnesota , Rochester , Minnesota , USA
  • Correspondence to Dr Bashar Hasan, Mayo Clinic, Rochester, MN 55905, USA; Hasan.Bashar{at}mayo.edu

Large language models (LLMs) may facilitate and expedite systematic reviews, although the approach to integrate LLMs in the review process is unclear. This study evaluates GPT-4 agreement with human reviewers in assessing the risk of bias using the Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool and proposes a framework for integrating LLMs into systematic reviews. The case study demonstrated that raw per cent agreement was the highest for the ROBINS-I domain of ‘Classification of Intervention’. Kendall agreement coefficient was highest for the domains of ‘Participant Selection’, ‘Missing Data’ and ‘Measurement of Outcomes’, suggesting moderate agreement in these domains. Raw agreement about the overall risk of bias across domains was 61% (Kendall coefficient=0.35). The proposed framework for integrating LLMs into systematic reviews consists of four domains: rationale for LLM use, protocol (task definition, model selection, prompt engineering, data entry methods, human role and success metrics), execution (iterative revisions to the protocol) and reporting. We identify five basic task types relevant to systematic reviews: selection, extraction, judgement, analysis and narration. Considering the agreement level with a human reviewer in the case study, pairing artificial intelligence with an independent human reviewer remains required.

  • Evidence-Based Practice
  • Systematic Reviews as Topic

Data availability statement

Data are available upon reasonable request. Search strategy, selection process flowchart, prompts and boxes containing included SRs and studies are available in the appendix. Analysed datasheet is available upon request.

https://doi.org/10.1136/bmjebm-2023-112597

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

Risk of bias assessment in systematic reviews is a time-consuming task associated with inconsistency. Large language models’ (LLMs) utilisation in systematic reviews may be helpful but largely unexplored.

WHAT THIS STUDY ADDS

This study introduces a structured framework for integrating LLMs into systematic reviews with four domains: rationale, protocol, execution and reporting.

The framework defines five possible task types for LLMs in systematic reviews: selection, data extraction, judgement, analysis and narration.

A case study about using LLMs for risk of bias assessments using Risk Of Bias In Non-randomised Studies of Interventions demonstrates fair agreement between LLM and human reviewers.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

The proposed framework can serve as a blueprint for future systematic reviewers planning to integrate LLMs into their workflow.

The case study suggests the need to pair LLMs assessing the risk of bias with a human reviewer.

Introduction

Systematic reviews are the key initial step in decision-making in healthcare. However, they are costly, require a long time to complete and become outdated, especially in areas of rapidly evolving evidence. Semi-automating systematic reviews and transitioning to living systematic reviews using the best contemporary available evidence are key priority areas of current evidence synthesis. 1–4 Recent advances in artificial intelligence (AI) have ushered in a new era of possibilities in healthcare practice and medical research, 5–7 including evidence synthesis and living systematic reviews. 8 9 By learning from human data analysis patterns (supervision), AI technologies offer the ability to automate, accelerate and enhance the accuracy of a wide array of research tasks, from data collection to analysis and even interpretation. 10

A recent AI advancement, large language models (LLMs) such as Meta AI LLaMA2 and OpenAI’s GPT-4, 11 are considered foundational models pre-trained in a self-supervised manner by leveraging a tremendous amount of free text data. The pre-training process allows them to acquire generic knowledge, and afterward, they can be fine-tuned on downstream tasks. With increasing model size, larger training data sets and longer training time, LLMs evolve emergent abilities such as zero-shot and few-shot in-context learning generalisation and have demonstrated significant capabilities in understanding and generating human-like text and processing data with minimal supervision, which may lead to meaningful participation in a systematic review. 12 13

Risk of bias (RoB) assessment is a significant step in systematic reviews that requires time, introduces inconsistencies and may be amenable to using AI and LLMs. 14 In this exposition, we propose a framework for incorporating LLMs into systematic reviews and employ GPT-4 for RoB assessment in a case study using the Cochrane Collaboration’s Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool. 15 We chose the ROBINS-I tool for this case study because it is a modern tool that is quite detailed, relatively complicated, and requires a long time to apply, 16 which makes it an ideal candidate to explore whether models such as GPT-4 can improve its consistency and time requirements.

The reporting of this case study adheres to the guidelines of methodological research. 17

Search strategy and study identification

We searched Scopus to identify all systematic reviews (SRs) from the Cochrane Collaboration that cited the original publication of the ROBINS-I tool. 15 We limited our search to SRs conducted by Cochrane in the field of medicine that were fully published. All original non-randomised studies included in the identified SRs were included as long as the ROBINS-I tool was used for their RoB assessment in the SR.

Data entry into ChatGPT

We conducted several pilot tests to determine the most effective method of obtaining RoB assessments using ChatGPT (GPT-4). The initial approach involved directly uploading the study PDFs to GPT-4 via the Code Interpreter tool available to Plus users. However, the tool was unable to interpret the fragmented pieces of text from the PDFs. We then attempted to paste the full text of individual studies in the prompt, however, this was unsuccessful due to the current estimated 2500-word limit for GPT-4 prompts. Finally, we converted the PDF to a Word file and extracted only the Methods and Results sections from each study for RoB assessment because these are the sections on which human reviewers focus for RoB assessments. Prompts used to instruct ChatGPT are presented in the appendix. The processes of data entry and prompt development were done iteratively until data were appropriately uploaded and a sensical output was obtained (ie, these processes were not prespecified). Foreign-language studies were provided in their original language to GPT-4.

Statistical analysis

One reviewer extracted RoB judgements from each Cochrane SR and a second reviewer verified the extraction. We measured the agreement between Cochrane reviewers and GPT-4 comparing the ordinal judgements about RoB using raw per cent agreement, weighted Cohen’s kappa and Kendall’s τ for correlation. The magnitude of agreement based on values of a correlation or kappa coefficient was considered to be slight (0–0.20), fair (0.21–0.40), moderate (0.41–60), substantial (0.61–0.80) and almost perfect (0.81–1.0).

Analysis was conducted using R software package (R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria URL https://www.R-project.org ).

Initial screening and inclusion

The initial search yielded 98 SRs, from which 36 provided full ROBINS-I assessment. After deduplicating studies that appeared in multiple SRs, we finalised our sample with 307 unique individual studies ( online supplemental figure; box 1 and box 2 ).

Supplemental material

Agreement between cochrane reviewers and gpt-4.

Agreement measures are summarised in table 1 for each ROBINS-I domain and for overall judgements. Raw per cent agreement was the highest for the domain of ‘Classification of Intervention’. Kendall agreement coefficient was highest for the domains of ‘Participant Selection’, ‘Missing Data’ and ‘Measurement of Outcomes’, suggesting moderate agreement in these domains. Kappa coefficient was low across all domains. Agreement about the overall RoB across domains was fair (61% raw agreement, Kendall coefficient 0.35).

  • View inline

Performance metrics

Framework for incorporating LLM’s in a systematic review

Figure 1 outlines the proposed framework for integrating LLMs into a systematic review workflow. The framework has four domains that relate to establishing a rationale, incorporating LLM in the protocol of the systematic review, execution and reporting.

  • Download figure
  • Open in new tab
  • Download powerpoint

Framework for incorporating large language models in systematic reviews. LLM, large language model; RoB, risk of bias; SR, systematic review.

The first step is to establish the rationale (ie, why LLMs are needed, and whether they are capable of doing this specific task). In the protocol, the LLM model should be described with its version and whether it was off the shelf or used via other tools, applications or interfaces. For example, code interpreters or AI agents can be used. An LLM agent, such as a generative pre-trained transformer (GPT) agent, is a specialised system designed to execute complex, multistep tasks and can adapt to new tools not included in the general model’s training data or recently published tools.

The prompts for LLM need to be iteratively tested and refined and described in the protocol to the extent possible, realising that it will not be possible to prespecify or anticipate every step. The method of data entry (copy/paste vs uploading a file) also needs to be tested and described in the protocol. Metrics of success depend on the task type that is assigned to LLM. We identify five basic task types: selection (eg, of included studies), extraction (eg, of study characteristics and outcomes), judgement (eg, RoB assessment), analysis (quantitative and qualitative) and narration/editing (eg, writing a manuscript, abstract or a lay person or executive summary). The metrics of success and the extent of human interaction and supervision should also be specified in the protocol.

The execution of LLM engagement will likely lead to changes in some of the approaches specified in the protocol, which should be explicitly mentioned as revisions to the protocol. Reporting is the last part of the framework and is vital. The items mentioned above, which are beyond the usual reporting requirements from the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement and its extensions, should all be included in the manuscript. 18 19 Importantly, the AI model and interface used need to be explicitly reported along with a timestamp of when AI was used because the output may vary over time for the same input and prompts. The transparency in reporting and informing peer reviewers and journal editors about the details of using LLMs are critical for the credibility of the systematic review process and subsequent decisions made based on the evidence. The proposed framework is applied to the current case study in table 2 .

Applying the proposed framework to the case study

The current case study suggests an overall fair agreement between Cochrane reviewers and ChatGPT-4 in using ROBINS-I for assessing RoB in non-randomised studies of intervention. This work identifies several challenges for using general utility LLM models, such as handling file types, word token limits and the quality of prompt engineering. Nonetheless, our study provides an assessment of zero-shot performance and a rationale for training RoB-specific systematic review models. The proposed framework is just a starting point since this field is very dynamic.

The current study also provides insight into evaluating inter-rater agreement on ordinal variables. We found that the weighted kappa coefficient was low across all domains which likely reflects the skewed distribution of the ratings. Kappa accounts for agreement occurring by chance, while Kendall’s τ measures the strength and direction of the association between two ranked variables. A recent comparison of reliability coefficients for ordinal rating scales suggested that the differences between these measures can vary at different agreement levels. 20 Thus, using more than one measure is helpful to assess the robustness of results. While our findings suggest the potential of LLMs like GPT-4 to be used in systematic reviews, it is obvious that there is a certain rate of error and that duplication of RoB assessment is needed.

Some limitations of the case study should be mentioned. This study was feasible because of the availability of comprehensive systematic reviews from the Cochrane Collaboration that used the ROBINS-I tool and reported detailed judgements. While their RoB assessment is certainly not a reference standard and can be quite poor for some domains such as confounding, 21 the rigorous and multidomain evaluation conducted by pairs of independent reviewers in these reviews makes them a reasonable comparison for novel LLM application. It is possible also that some systematic reviews used ROBINS-I but did not cite its original paper and were not included in our sample. We also had to use ChatGPT to translate a few studies published in languages other than English, truncate text when it was too lengthy and convert files format, all may have affected RoB judgements.

Practical implications

Given its current capabilities, GPT-4 is arguably a very advanced text-analysing tool. A major advantage is its availability as a universal language model—one model that can perform any language-based extraction, retrieval or even reasoning-based tasks. However, this approach may not be suitable for application in every domain. Sensitive domains like medicine require precise use of language in a consistent manner. LLMs have displayed trends of inconsistency in performance—different output for the same input. LLMs have the propensity to generate favourable answers and to hallucinate. Hallucination is a major threat to the use of LLMs in research. In table 3 , we describe the phenomenon of artificial hallucinations in terms of definition, types and plausible causes. 22–24

The phenomenon of artificial hallucinations: definition, types and causes

Additional applications in systematic reviews can extend to other tasks such as aiding in screening studies, translating foreign-language studies in real-time, data extraction, meta-analysis and even generating decision aids or translational products. 25 However, a human reviewer remains needed as a duplicate independent reviewer.

This exploration of LLMs application in systematic reviews is a step toward integrating AI as a dynamic adjunct in research. The proposed framework, coupled with a case study on RoB assessment, underscores the potential of LLMs to facilitate research tasks. While GPT-4 is not without its limitations, its ability to assist in complex tasks under human supervision makes it a promising tool for assessing RoB in systematic reviews. Considering the agreement level with a human reviewer in the case study, pairing AI with an independent human reviewer remains required at present.

Ethics statements

Patient consent for publication.

Not applicable.

Ethics approval

  • Chu H , et al
  • Sipra QUAR ,
  • Naqvi SAA , et al
  • Ryu AJ , et al
  • Naqvi SAA ,
  • He H , et al
  • Kayaalp ME ,
  • Ollivier M , et al
  • Noorbakhsh-Sabet N ,
  • Zhang Y , et al
  • Ramkumar PN ,
  • Haeberle HS , et al
  • Kelly SE , et al
  • Feng Y , et al
  • van Dijk SHB ,
  • Brusse-Keizer MGJ ,
  • Bucsán CC , et al
  • Touvron H ,
  • Stone K , et al
  • Kolluri S ,
  • Liu R , et al
  • Jardim PSJ ,
  • Ames HM , et al
  • Sterne JA ,
  • Hernán MA ,
  • Reeves BC , et al
  • Jeyaraman MM ,
  • Rabbani R ,
  • Al-Yousif N , et al
  • Liberati A ,
  • Tetzlaff J , et al
  • de Raadt A ,
  • Warrens MJ ,
  • Bosker RJ , et al
  • Thirunavukarasu AJ ,
  • Elangovan K , et al
  • Alkaissi H ,
  • McFarlane SI
  • Blaizot A ,
  • Veettil SK ,
  • Saidoung P , et al

Supplementary materials

Supplementary data.

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

  • Data supplement 1

Twitter @BasharHasanMD, @M_Hassan_Murad

Contributors MHM and BH conceived this study. BH, SS, MH, MA-K, FF, MF, ZW, IBR, IB and NSR participated in data identification, extraction and analysis. MHM, SS, IBR and IB wrote the first draft. All authors critically revised the manuscript and approved the final version. BH is the guarantor.

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Provenance and peer review Not commissioned; externally peer reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Read the full text or download the PDF:

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: scalable methods for brick kiln detection and compliance monitoring from satellite imagery: a deployment case study in india.

Abstract: Air pollution kills 7 million people annually. Brick manufacturing industry is the second largest consumer of coal contributing to 8%-14% of air pollution in Indo-Gangetic plain (highly populated tract of land in the Indian subcontinent). As brick kilns are an unorganized sector and present in large numbers, detecting policy violations such as distance from habitat is non-trivial. Air quality and other domain experts rely on manual human annotation to maintain brick kiln inventory. Previous work used computer vision based machine learning methods to detect brick kilns from satellite imagery but they are limited to certain geographies and labeling the data is laborious. In this paper, we propose a framework to deploy a scalable brick kiln detection system for large countries such as India and identify 7477 new brick kilns from 28 districts in 5 states in the Indo-Gangetic plain. We then showcase efficient ways to check policy violations such as high spatial density of kilns and abnormal increase over time in a region. We show that 90% of brick kilns in Delhi-NCR violate a density-based policy. Our framework can be directly adopted by the governments across the world to automate the policy regulations around brick kilns.

Submission history

Access paper:.

  • Download PDF
  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Non-experimental research: What it is, Types & Tips

    is a case study non experimental

  2. 🐈 Non experimental research examples. Overview of Non. 2022-11-15

    is a case study non experimental

  3. Description of contemporary experimental and non-experimental

    is a case study non experimental

  4. 7.1 Overview of Non-Experimental Research

    is a case study non experimental

  5. 🏆 Difference between experimental and non experimental research

    is a case study non experimental

  6. PPT

    is a case study non experimental

VIDEO

  1. Case study Non stiffness Mode1

  2. non Experimental short

  3. 20 Hours Study Non-Stop 😱😱😱 #motivation #shortvideo #studytips

  4. RESEARCH DESIGN PRESENTED BY DR. JAYA BHARTI

  5. Types of Experimental Research ll Laboratory Experimental Research #research #psychology

  6. Case Study || Research Methodology || Part 11

COMMENTS

  1. Experimental Vs Non-Experimental Research: 15 Key Differences

    There is a general misconception around research that once the research is non-experimental, then it is non-scientific, making it more important to understand what experimental and experimental research entails. ... The main distinction between these data collection tools is case studies and simulations. Even at that, similar tools are used ...

  2. Overview of Nonexperimental Research

    Recall that internal validity is the extent to which the design of a study supports the conclusion that changes in the independent variable caused any observed differences in the dependent variable. Figure 7.1 shows how experimental, quasi-experimental, and correlational research vary in terms of internal validity. Experimental research tends ...

  3. 6.1 Overview of Non-Experimental Research

    When researchers use a participant characteristic to create groups (nationality, cannabis use, age, sex), the independent variable is usually referred to as an experimenter-selected independent variable (as opposed to the experimenter-manipulated independent variables used in experimental research). Figure 6.1 shows data from a hypothetical study on the relationship between whether people make ...

  4. What Is a Case Study?

    Revised on November 20, 2023. A case study is a detailed study of a specific subject, such as a person, group, place, event, organization, or phenomenon. Case studies are commonly used in social, educational, clinical, and business research. A case study research design usually involves qualitative methods, but quantitative methods are ...

  5. 2.5: Experimental and Non-experimental Research

    Non-experimental research is a broad term that covers "any study in which the researcher doesn't have quite as much control as they do in an experiment". ... case studies can complement the more statistically-oriented approaches that you see in experimental and quasi-experimental designs. We won't talk much about case studies in these ...

  6. Overview of Non-Experimental Research

    Non-experimental research is research that lacks the manipulation of an independent variable. Rather than manipulating an independent variable, researchers conducting non-experimental research simply measure variables as they naturally occur (in the lab or real world). Most researchers in psychology consider the distinction between experimental ...

  7. Quantitative Research with Nonexperimental Designs

    There are two main types of nonexperimental research designs: comparative design and correlational design. In comparative research, the researcher examines the differences between two or more groups on the phenomenon that is being studied. For example, studying gender difference in learning mathematics is a comparative research.

  8. Non-experimental research: What it is, Types & Tips

    This means that the method must not rely on correlations, surveys, or case studies and cannot demonstrate an actual cause and effect relationship. Characteristics of non-experimental research. Some of the essential characteristics of non-experimental research are necessary for the final results.

  9. 5.7: Non-Experimental Research (Summary)

    Practice: Find and read a published case study in psychology. (Use case study as a key term in a PsycINFO search.) Then do the following: Describe one problem related to internal validity. Describe one problem related to external validity. Generate one hypothesis suggested by the case study that might be interesting to test in a subsequent study.

  10. 6: Non-Experimental Research

    6.3: Correlational Research. Correlational research is a type of non-experimental research in which the researcher measures two variables and assesses the statistical relationship (i.e., the correlation) between them with little or no effort to control extraneous variables. There are many reasons that researchers interested in statistical ...

  11. Case Study vs. Experiment

    A case study involves in-depth analysis of a particular individual, group, or situation, aiming to provide a detailed understanding of a specific phenomenon. On the other hand, an experiment involves manipulating variables and observing the effects on a sample population, aiming to establish cause-and-effect relationships.

  12. 1.11: Experimental and non-experimental research

    Non-experimental research. Non-experimental research is a broad term that covers "any study in which the researcher doesn't have quite as much control as they do in an experiment". Obviously, control is something that scientists like to have, but as the previous example illustrates, there are lots of situations in which you can't or shouldn't try to obtain that control.

  13. Understanding Research Designs and External Scientific Evidence

    Case study - A case study is an uncontrolled, observational study of events and outcomes in a single case. Case series - A description of uncontrolled, non-experimental events and outcomes for a series of similar cases who receive the same intervention or have the same outcome.

  14. Case Study Methodology of Qualitative Research: Key Attributes and

    A case study is one of the most commonly used methodologies of social research. This article attempts to look into the various dimensions of a case study research strategy, the different epistemological strands which determine the particular case study type and approach adopted in the field, discusses the factors which can enhance the effectiveness of a case study research, and the debate ...

  15. Chapter 6 Non-experimental designs

    Among non-experimental designs involving individuals, there are essentially 3 different ways at arriving at conclusions by. 1) reference to population follow-up (cohort) 2) joint assessment of exposure among cases and non-cases (case-control) 3) reference to one particular time (cross-sectional) Since all study designs, including the non ...

  16. Evidence Based Practice: Study Designs & Evidence Levels

    Non-experimental study Systematic review of a combination of RCTs, quasi-experimental and non-experimental studies, or non-experimental studies only, with or without meta-analysis ... Case control study: A study which involves identifying patients who have the outcome of interest (cases) and patients without the same outcome (controls), ...

  17. Experimental vs. Non-Experimental Research

    Typically, this means the non-experimental researcher must rely on correlations, surveys or case studies, and cannot demonstrate a true cause-and-effect relationship.

  18. Nonexperimental research: strengths, weaknesses and issues of precision

    Nonexperimental research, defined as any kind of quantitative or qualitative research that is not an experiment, is the predominate kind of research design used in the social sciences. How to unambiguously and correctly present the results of nonexperimental research, however, remains decidedly unclear and possibly detrimental to applied ...

  19. Research Study Designs: Non-experimental

    The studies are sometimes referred to as "case-referent," "case-comparison," or "trohoc" studies. The latter term is simply cohort spelled backwards, which is a good description of the main difference between the two study designs. A case-control design selects two similar popu-lations of patients based on their outcome.

  20. What is non-experimental research: Definition, types & examples

    Non-experimental research is a type of research design that is based on observation and measuring instead of experimentation with randomly assigned participants. What characterizes this research design is the fact that it lacks the manipulation of independent variables. Because of this fact, the non-experimental research is based on naturally ...

  21. Nonexperimental Comparative Effectiveness Research Using Linked

    Comparative Effectiveness Research (CER) has gained a great deal of attention over the past year through the new federal coordinating council, 1 the recent Institute of Medicine (IOM) report, 2 and the American Recovery & Reinvestment Act (ARRA) stimulus funding. 3 CER has a broad scope as defined by the IOM, addressing "…the generation and synthesis of evidence that compares the benefits ...

  22. Using public clinical trial reports to probe non-experimental causal

    Non-experimental studies (also known as observational studies) are valuable for estimating the effects of various medical interventions, but are notoriously difficult to evaluate because the methods used in non-experimental studies require untestable assumptions. This lack of intrinsic verifiability makes it difficult both to compare different non-experimental study methods and to trust the ...

  23. Biden Admin Pours $1 Million Into Studies Aimed At Denying There ...

    More for You. The National Science Foundation is allocating over $900,000 in taxpayer funds to three universities to conduct a study claiming biology courses inaccurately portray sex and gender as ...

  24. Shallow Synthesis of Knowledge in GPT-Generated Texts: A Case Study in

    Numerous AI-assisted scholarly applications have been developed to aid different stages of the research process. We present an analysis of AI-assisted scholarly writing generated with ScholaCite, a tool we built that is designed for organizing literature and composing Related Work sections for academic papers. Our evaluation method focuses on the analysis of citation graphs to assess the ...

  25. [2402.11760] Reinforcement Learning as a Parsimonious Alternative to

    Deep learning architectures have achieved state-of-the-art (SOTA) performance on computer vision tasks such as object detection and image segmentation. This may be attributed to the use of over-parameterized, monolithic deep learning architectures executed on large datasets. Although such architectures lead to increased accuracy, this is usually accompanied by a large increase in computation ...

  26. A Bayesian sampling optimisation strategy for finite element ...

    Sophisticated FE models can generate expensive and non-convex cost functions, which minimization is a non-trivial task. To deal with this challenging optimization problem, this work makes use of a Bayesian sampling optimisation technique. ... Differently from numerically simulated data, the experimental case study is affected by both the ...

  27. Integrating large language models in systematic reviews: a framework

    Large language models (LLMs) may facilitate and expedite systematic reviews, although the approach to integrate LLMs in the review process is unclear. This study evaluates GPT-4 agreement with human reviewers in assessing the risk of bias using the Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool and proposes a framework for integrating LLMs into systematic reviews.

  28. [2402.13796] Scalable Methods for Brick Kiln Detection and Compliance

    Air pollution kills 7 million people annually. Brick manufacturing industry is the second largest consumer of coal contributing to 8%-14% of air pollution in Indo-Gangetic plain (highly populated tract of land in the Indian subcontinent). As brick kilns are an unorganized sector and present in large numbers, detecting policy violations such as distance from habitat is non-trivial. Air quality ...