• Open access
  • Published: 30 April 2021

Ethics review of big data research: What should stay and what should be reformed?

  • Agata Ferretti   ORCID: orcid.org/0000-0001-6716-5713 1 ,
  • Marcello Ienca 1 ,
  • Mark Sheehan 2 ,
  • Alessandro Blasimme 1 ,
  • Edward S. Dove 3 ,
  • Bobbie Farsides 4 ,
  • Phoebe Friesen 5 ,
  • Jeff Kahn 6 ,
  • Walter Karlen 7 ,
  • Peter Kleist 8 ,
  • S. Matthew Liao 9 ,
  • Camille Nebeker 10 ,
  • Gabrielle Samuel 11 ,
  • Mahsa Shabani 12 ,
  • Minerva Rivas Velarde 13 &
  • Effy Vayena 1  

BMC Medical Ethics volume  22 , Article number:  51 ( 2021 ) Cite this article

15k Accesses

36 Citations

19 Altmetric

Metrics details

Ethics review is the process of assessing the ethics of research involving humans. The Ethics Review Committee (ERC) is the key oversight mechanism designated to ensure ethics review. Whether or not this governance mechanism is still fit for purpose in the data-driven research context remains a debated issue among research ethics experts.

In this article, we seek to address this issue in a twofold manner. First, we review the strengths and weaknesses of ERCs in ensuring ethical oversight. Second, we map these strengths and weaknesses onto specific challenges raised by big data research. We distinguish two categories of potential weakness. The first category concerns persistent weaknesses, i.e., those which are not specific to big data research, but may be exacerbated by it. The second category concerns novel weaknesses, i.e., those which are created by and inherent to big data projects. Within this second category, we further distinguish between purview weaknesses related to the ERC’s scope (e.g., how big data projects may evade ERC review) and functional weaknesses, related to the ERC’s way of operating. Based on this analysis, we propose reforms aimed at improving the oversight capacity of ERCs in the era of big data science.


We believe the oversight mechanism could benefit from these reforms because they will help to overcome data-intensive research challenges and consequently benefit research at large.

Peer Review reports

The debate about the adequacy of the Ethics Review Committee (ERC) as the chief oversight body for big data studies is partly rooted in the historical evolution of the ERC. Particularly relevant is the ERC’s changing response to new methods and technologies in scientific research. ERCs—also known as Institutional Review Boards (IRBs) or Research Ethics Committees (RECs)—came to existence in the 1950s and 1960s [ 1 ]. Their original mission was to protect the interests of human research participants, particularly through an assessment of potential harms to them (e.g., physical pain or psychological distress) and benefits that might accrue from the proposed research. ERCs expanded in scope during the 1970s, from participant protection towards ensuring valuable and ethical human subject research (e.g., having researchers implement an informed consent process), as well as supporting researchers in exploring their queries [ 2 ].

Fast forward fifty years, and a lot has changed. Today, biomedical projects leverage unconventional data sources (e.g., social media), partially inscrutable data analytics tools (e.g., machine learning), and unprecedented volumes of data [ 3 , 4 , 5 ]. Moreover, the evolution of research practices and new methodologies such as post-hoc data mining have blurred the concept of ‘ human subject’ and elicited a shift towards the concept of data subject —as attested in data protection regulations. [ 6 , 7 ]. With data protection and privacy concerns being in the spotlight of big data research review, language from data protection laws has worked its way into the vocabulary of research ethics. This terminological shift further reveals that big data, together with modern analytic methods used to interpret the data, creates novel dynamics between researchers and participants [ 8 ]. Research data repositories about individuals and aggregates of individuals are considerably expanding in size. Researchers can remotely access and use large volumes of potentially sensitive data without communicating or actively engaging with study participants. Consequently, participants become more vulnerable and subjected to the research itself [ 9 ]. As such, the nature of risk involved in this new form of research changes too. In particular, it moves from the risk of physical or psychological harm towards the risk of informational harm, such as privacy breaches or algorithmic discrimination [ 10 ]. This is the case, for instance, with projects using data collected through web search engines, mobile and smart devices, entertainment websites, and social media platforms. The fact that health-related research is leaving hospital labs and spreading into online space creates novel opportunities for research, but also raises novel challenges for ERCs. For this reason, it is important to re-examine the fit between new data-driven forms of research and existing oversight mechanisms [ 11 ].

The suitability of ERCs in the context of big data research is not merely a theoretical puzzle but also a practical concern resulting from recent developments in data science. In 2014, for example, the so-called ‘emotional contagion study’ received severe criticism for avoiding ethical oversight by an ERC, failing to obtain research consent, violating privacy, inflicting emotional harm, discriminating against data subjects, and placing vulnerable participants (e.g., children and adolescents) at risk [ 12 , 13 ]. In both public and expert opinion [ 14 ], a responsible ERC would have rejected this study because it contravened the research ethics principles of preventing harm (in this case, emotional distress) and adequately informing data subjects. However, the protocol adopted by the researchers was not required to undergo ethics review under US law [ 15 ] for two reasons. First, the data analyzed were considered non-identifiable, and researchers did not engage directly with subjects, exempting the study from ethics review. Second, the study team included both scientists affiliated with a public university (Cornell) and Facebook employees. The affiliation of the researchers is relevant because—in the US and some other countries—privately funded studies are not subject to the same research protections and ethical regulations as publicly funded research [ 16 ]. An additional example is the 2015 case in which the United Kingdom (UK) National Health Service (NHS) shared 1.6 million pieces of identifiable and sensitive data with Google DeepMind. This data transfer from the public to the private party took place legally, without the need for patient consent or ethics review oversight [ 17 ]. These cases demonstrate how researchers can pursue potentially risky big data studies without falling under the ERC’s purview. The limitations of the regulatory framework for research oversight are evident, in both private and public contexts.

The gaps in the ERC’s regulatory process, together with the increased sophistication of research contexts—which now include a variety of actors such as universities, corporations, funding agencies, public institutes, and citizens associations—has led to an increase in the range of oversight bodies. For instance, besides traditional university ethics committees and national oversight committees, funding agencies and national research initiatives have increasingly created internal ethics review boards [ 18 , 19 ]. New participatory models of governance have emerged, largely due to an increase in subjects’ requests to control their own data [ 20 ]. Corporations are creating research ethics committees as well, modelled after the institutional ERC [ 21 ]. In May 2020, for example, Facebook welcomed the first members of its Oversight Board, whose aim is to review the company’s decisions about content moderation [ 22 ]. Whether this increase in oversight models is motivated by the urge to fill the existing regulatory gaps, or whether it is just ‘ethics washing’, is still an open question. However, other types of specialized committees have already found their place alongside ERCs, when research involves international collaboration and data sharing [ 23 ]. Among others, data safety monitoring boards, data access committees, and responsible research and innovation panels serve the purpose of covering research areas left largely unregulated by current oversight [ 24 ].

The data-driven digital transformation challenges the purview and efficacy of ERCs. It also raises fundamental questions concerning the role and scope of ERCs as the oversight body for ethical and methodological soundness in scientific research. Footnote 1 Among these questions, this article will explore whether ERCs are still capable of their intended purpose, given the range of novel (maybe not categorically new, but at least different in practice) issues that have emerged in this type of research. To answer this question, we explore some of the challenges that the ERC oversight approach faces in the context of big data research and review the main strengths and weaknesses of this oversight mechanism. Based on this analysis, we will outline possible solutions to address current weaknesses and improve ethics review in the era of big data science.

Strengths of the ethics review via ERC

Historically, ERCs have enabled cross disciplinary exchange and assessment [ 27 ]. ERC members typically come from different backgrounds and bring their perspectives to the debate; when multi-disciplinarity is achieved, the mixture of expertise provides the conditions for a solid assessment of advantages and risks associated with new research. Committees which include members from a variety of backgrounds are also suited to promote projects from a range of fields, and research that cuts across disciplines [ 28 ]. Within these committees, the reviewers’ expertise can be paired with a specific type of content to be reviewed. This one-to-one match can bring timely and, ideally, useful feedback [ 29 ]. In many countries (e.g., European countries, the United States (US), Canada, Australia), ERCs are explicitly mandated by law to review many forms of research involving human participants; moreover, these laws also describe how such a body should be structured and the purview of its review [ 30 , 31 ]. In principle, ERCs also aim to be representative of society and the research enterprise, including members of the public and minorities, as well as researchers and experts [ 32 ]. And in performing a gatekeeping function to the research enterprise, ERCs play an important role: they recognize that both experts and lay people should have a say, with different views to contribute [ 33 ].

Furthermore, the ERC model strives to ensure independent assessment. The fact that ERCs assess projects “from the outside” and maintain a certain degree of objectivity towards what they are reviewing, reduces the risk of overlooking research issues and decreases the risk for conflicts of interest. Moreover, being institutionally distinct—for example, being established by an organization that is distinct from the researcher or the research sponsor—brings added value to the research itself as this lessens the risk for conflict of interest. Conflict of interest is a serious issue in research ethics because it can compromise the judgment of reviewers. Institutionalized review committees might particularly suffer from political interference. This is the case, for example, for universities and health care systems (like the NHS), which tend to engage “in house” experts as ethics boards members. However, ERCs that can prove themselves independent are considered more trustworthy by the general public and data subjects; it is reassuring to know that an independent committee is overseeing research projects [ 34 ].

The ex-ante (or pre-emptive) ethical evaluation of research studies is by many considered the standard procedural approach of ERCs [ 35 ]. Though the literature is divided on the usefulness and added value provided by this form of review [ 36 , 37 ], ex-ante review is commonly used as a mechanism to ensure the ethical validity of a study design before the research is conducted [ 38 , 39 ]. Early research scrutiny aims at risk-mitigation: the ERC evaluates potential research risks and benefits, in order to protect participants’ physical and psychological well-being, dignity, and data privacy. This practice saves researchers’ resources and valuable time by preventing the pursuit of unethical or illegal paths [ 40 ]. Finally, the ex-ante ethical assessment gives researchers an opportunity to receive feedback from ERCs, whose competence and experience may improve the research quality and increase public trust in the research [ 41 ].

All strengths mentioned in this section are strengths of the ERC model in principle. In practice, there are many ERCs that are not appropriately interdisciplinary or representative of the population and minorities, that lack independence from the research being reviewed, and that fail to improve research quality, and may in fact hinder it. We now turn to consider some of these weaknesses in more detail.

Weaknesses of the ethics review via ERC

In order to assess whether ERCs are adequately equipped to oversee big data research, we must consider the weaknesses of this model. We identify two categories of weaknesses which are described in the following section and summarized in Fig.  1 :

Persistent weaknesses : those existing in the current oversight system, which could be exacerbated by big data research

Novel weaknesses : those brought about by and specific to the nature of big data projects

Within this second category of novel weaknesses, we further differentiate between:

Purview weaknesses : reasons why some big data projects may bypass the ERCs’ purview

Functional weaknesses : reasons why some ERCs may be inadequate to assess big data projects specifically

figure 1

Weaknesses of the ERCs

We base the conceptual distinction between persistent and novel weaknesses on the fact that big data research diverges from traditional biomedical research in many respects. As previously mentioned, big data projects are often broad in scope, involve new actors, use unprecedented methodologies to analyze data, and require specific expertise. Furthermore, the peculiarities of big data itself (e.g., being large in volume and from a variety of sources) make data-driven research different in practice from traditional research. However, we should not consider the category of “novel weaknesses” a closed category. We do not argue that weaknesses mentioned here do not, at least partially, overlap with others which already exist. In fact, in almost all cases of ‘novelty’, (i) there is some link back to a concept from traditional research ethics, and (ii) some thought has been given to the issue outside of a big data or biomedical context (e.g., the problem of ERCs’ expertise has arisen in other fields [ 42 ]). We believe that by creating conceptual clarity about novel oversight challenges presented by big data research, we can begin to identify tailored reforms.

Persistent weaknesses

As regulation for research oversight varies between countries, ERCs often suffer from a lack of harmonization. This weakness in the current oversight mechanism is compounded by big data research, which often relies on multi-center international consortia. These consortia in turn depend on approval by multiple oversight bodies demanding different types of scrutiny [ 43 ]. Furthermore, big data research may give rise to collaborations between public bodies, universities, corporations, foundations, and citizen science cooperatives. In this network, each stakeholder has different priorities and depends upon its own rules for regulation of the research process [ 44 , 45 , 46 ]. Indeed, this expansion of regulatory bodies and aims does not come with a coordinated effort towards agreed-upon review protocols [ 47 ]. The lack of harmonization is perpetuated by academic journals and funding bodies with diverging views on the ethics of big data. If the review bodies which constitute the “ethics ecosystem” [ 19 ] do not agree to the same ethics review requirements, a big data project deemed acceptable by an ERC in one country may be rejected by another ERC, within or beyond the national borders.

In addition, there is inconsistency in the assessment criteria used within and across committees. Researchers report subjective bias in the evaluation methodology of ERCs, as well as variations in ERC judgements which are not based on morally relevant contextual considerations [ 48 , 49 ]. Some authors have argued that the probability of research acceptance among experts increases if some research peer or same-field expert sits on the evaluation committee [ 50 , 51 ]. The judgement of an ERC can also be influenced by the boundaries of the scientific knowledge of its members. These boundaries can impact the ERC’s approach towards risk taking in unexplored fields of research [ 52 ]. Big data research might worsen this problem since the field is relatively new, with no standardized metric to assess risk within and across countries [ 53 ]. The committees do not necessarily communicate with each other to clarify their specific role in the review process, or try to streamline their approach to the assessment. This results in unclear oversight mandates and inconsistent ethical evaluations [ 27 , 54 ].

Additionally, ERCs may fall short in their efforts to justly redistribute the risks and benefits of research. The current review system is still primarily tilted toward protecting the interests of individual research participants. ERCs do not consistently assess societal benefit, or risks and benefits in light of the overall conduct of research (balancing risks for the individual with collective benefits). Although demands on ERCs vary from country to country [ 55 ], the ERC approach is still generally tailored towards traditional forms of biomedical research, such as clinical trials and longitudinal cohort studies with hospital patients. These studies are usually narrow in scope and carry specific risks only for the participants involved. In contrast, big data projects can impact society more broadly. As an example, computational technologies have shown potential to determine individuals’ sexual orientation by screening facial images [ 56 ]. An inadequate assessment of the common good resulting from this type of study can be socially detrimental [ 57 ]. In this sense, big data projects resemble public health research studies, with an ethical focus on the common good over individual autonomy [ 58 ]. Within this context, ERCs have an even greater responsibility to ensure the just distribution of research benefits across the population. Accurately determining the social value of big data research is challenging, as negative consequences may be difficult to detect before research begins. Nevertheless, this task remains a crucial objective of research oversight.

The literature reports examples of the failure of ERCs to be accountable and transparent [ 59 ]. This might be the result of an already unclear role of ERCs. Indeed, the ERCs practices are an outcome of different levels of legal, ethical, and professional regulations, which largely vary across jurisdictions. Therefore, some ERCs might function as peer counselors, others as independent advisors, and still others as legal controllers. What seems to be common across countries, though, is that ERCs rarely disclose their procedures, policies, and decision-making process. The ERCs’ “secrecy” can result in an absence of trust in the ethical oversight model [ 60 ].This is problematic because ERCs rely on public acceptance as accountable and trustworthy entities [ 61 ]. In big data research, as the number of data subjects is exponentially greater, a lack of accountability and an opaque deliberative process on the part of ERCs might bring even more significant public backlash. Ensuring truthfulness of the stated benefits and risks of research is a major determinant of trust in both science and research oversight. Researchers are another category of stakeholders negatively impacted by poor communication and publicity on the part of the ERC. Commentators have shown that ERCs often do not clearly provide guidance about the ethical standards applied in the research review [ 62 ]. For instance, if researchers provide unrealistic expectations of privacy and security to data subjects, ERCs have an institutional responsibility to flag those promises (e.g., about data security and the secondary-uses of subject data), especially when the research involves personal and high sensitivity data [ 63 ]. For their part, however, ERCs should make their expectations and decision-making processes clear.

Finally, ERCs face the increasing issue of being overwhelmed by the number of studies to review [ 64 , 65 ]. Whereas ERCs originally reviewed only human subjects research happening in natural sciences and medicine, over time they also became the ethical body of reference for those conducting human research in the social sciences (e.g., in behavioral psychology, educational sciences, etc.). This increase in demand creates pressure on ERC members, who often review research pro bono and on a voluntary basis. The wide range of big data research could exacerbate this existing issue. Having more research to assess and less time to accomplish the task may negatively impact the quality of the ERC’s output, as well as increase the time needed for review [ 66 ]. Consequently, researchers might carry out potentially risky studies because the relevant ethical issues of those studies were overlooked. Furthermore, research itself could be significantly delayed, until it loses its timely scientific value.

Novel weaknesses: purview weaknesses

To determine whether the ERC is still the most fit-for-purpose entity to oversee big data research, it is important to establish under which conditions big data projects fall under the purview of ERCs.

Historically, research oversight has primarily focused on human subject research in the biomedical field, using public funding. In the US for instance, each review board is responsible for a subtype of research based on content or methodology (for example there are IRBs dedicated to validating clinical trial protocols, assessing cancer treatments, examining pediatric research, and reviewing qualitative research). This traditional ethics review structure cannot accommodate big data research [ 2 ]. Big data projects often reach beyond a single institution, cut across disciplines, involve data collected from a variety of sources, re-use data not originally collected for research purposes, combine diverse methodologies, orient towards population-level research, rely on large data aggregates, and emerge from collaboration with the private sector. Given this scenario, big data projects may likely fall beyond the purview of ERCs.

Another case in which big data research does not fall under ERC purview is when it relies on anonymized data. If researchers use data that cannot be traced back to subjects (anonymized or non-personal data), then according to both the US Common Rule and HIPAA regulations, the project is considered safe enough to be granted an ethics review waiver. If instead researchers use pseudonymized (or de-identified) data, they must apply for research ethics review, as in principle the key that links the de-identified data with subjects could be revealed or hacked, causing harm to subjects. In the European Union, it would be left to each Member State (and national laws or policies at local institutions) to define whether research using anonymized data should seek ethical review. This case shows once more that current research ethics regulation is relatively loose and disjointed across jurisdictions, and may leave areas where big data research is unregulated. In particular, the special treatment given anonymized data comes from an emphasis on risk at the individual level. So far in the big data discourse, the concept of harm has been mainly linked to vulnerability in data protection. Therefore if privacy laws are respected, and protection is built into the data system, researchers can prevent harmful outcomes [ 40 ]. However, this view is myopic as it does not include other misuses of data aggregates, such as group discrimination and dignitary harm. These types of harm are already emerging in the big data ecosystem, where anonymized data reveal health patterns of a certain sub-group, or computational technologies include strong racial biases [ 67 , 68 ]. Furthermore, studies using anonymized data should not be deemed oversight-free by default, as it is increasingly hard to anonymize data. Technological advancements might soon make it possible to re-identify individuals from aggregate data sets [ 69 ].

The risks associated with big data projects also increase due to the variety of actors involved in research alongside university researchers (e.g., private companies, citizen science associations, bio-citizen groups, community workers cooperatives, foundations, and non-profit organizations) [ 70 , 71 ]. The novel aspect of health-related big data research compared with traditional research is that anyone who can access large amounts of data about individuals and build predictive models based on that data, can now determine and infer the health status of a person without directly engaging with that person in a research program [ 72 ]. Facebook, for example, is carrying out a suicide prediction and prevention project, which relies exclusively on the information that users post on the social network [ 18 ]. Because this type of research is now possible, and the available ethics review model exempts many big data projects from ERC appraisal, gaps in oversight are growing [ 17 , 73 ]. Just as corporations can re-use publicly available datasets (such as social media data) to determine life insurance premiums [ 74 ], citizen science projects can be conducted without seeking research oversight [ 75 ]. Indeed, participant-led big data research (despite being increasingly common) is another area where the traditional overview model is not effective [ 76 ]. In addition, ERCs might consider research conducted outside academia or publicly funded institutions to be not serious. Thus ERCs may disregard review requests from actors outside the academic environment (e.g., by the citizen science or health tech start up) [ 77 ].

Novel weaknesses: functional weaknesses

Functional weaknesses are those related to the skills, composition, and operational activities of ERCs in relation to big data research.

From this functional perspective, we argue that the ex-ante review model might not be appropriate for big data research. Project assessment at the project design phase or at the data collection level is insufficient to address emerging challenges that characterize big data projects – especially as data, over time, could become useful for other purposes, and therefore be re-used or shared [ 53 ]. Limitations of the ex-ante review model have already become apparent in the field of genetic research [ 78 ]. In this context, biobanks must often undergo a second ethics assessment to authorize the specific research use on exome sequencing of their primary data samples [ 79 ]. Similarly, in a case in which an ERC approved the original collection of sensitive personal data, a data access committee would ensure that the secondary uses are in line with original consent and ethics approval. However, if researchers collect data from publicly accessible platforms, they can potentially use and re-use data for research lawfully, without seeking data subject consent or ERC review. This is often the case in social media research. Social media data, which are collected by researchers or private companies using a form of broad consent, can be re-used by researchers to conduct additional analysis without ERC approval. It is not only the re-use of data that poses unforeseeable risks. The ex-ante approach might not be suitable to assess other stages of the data lifecycle [ 80 ], such as deployment machine learning algorithms.

Rather than re-using data, some big data studies build models on existing data (using data mining and machine learning methods), creating new data, which is then used to further feed the algorithms [ 81 ]. Sometimes it is not possible to anticipate which analytic models or tools (e.g., artificial intelligence) will be leveraged in the research. And even then, the nature of computational technologies which extract meaning from big data make it difficult to anticipate all the correlations that will emerge from the analysis [ 37 ]. This is an additional reason that big data research often has a tentative approach to a research question, instead of growing from a specific research hypothesis [ 82 ].The difficulty of clearly framing the big data research itself makes it even harder for ERCs to anticipate unforeseeable risks and potential societal consequences. Given the existing regulations and the intrinsic exploratory nature of big data projects, the mandate of ERCs does not appear well placed to guarantee research oversight. It seems even less so if we consider problems that might arise after the publication of big data studies, such as repurposing or dual-use issues [ 83 ].

ERCs also face the challenge of assessing the value of informed consent for big data projects. To re-obtain consent from research subjects is impractical, particularly when using consumer generated data (e.g., social media data) for research purposes. In these cases, researchers often rely on broad consent and consent waivers. This leaves the data subjects unaware of their participation in specific studies, and therefore makes them incapable of engaging with the research progress. Therefore, the data subjects and the communities they represent become vulnerable towards potential negative research outcomes. The tool of consent has limitations in big data research—it cannot disclose all possible future uses of data, in part because these uses may be unknown at the time of data generation. Moreover, researchers can access existing datasets multiple times and reuse the same data with alternative purposes [ 84 ]. What should be the ERCs’ strategy, given the current model of informed consent leaves an ethical gap in big data projects? ERCs may be tempted to focus on the consent challenge, neglecting other pressing big data issues [ 53 ]. However, the literature reports an increasing number of authors who are against the idea of a new consent form for big data studies [ 5 ].

A final widely discussed concern is the ERC’s inadequate expertise in the area of big data research [ 85 , 86 ]. In the past, there have been questions about the technical and statistical expertise of ERC members. For example, ERCs have attempted to conform social science research to the clinical trial model, using the same knowledge and approach to review both types of research [ 87 ]. However, big data research poses further challenges to ERCs’ expertise. First, the distinct methodology of big data studies (based on data aggregation and mining) requires a specialized technical expertise (e.g., information systems, self-learning algorithms, and anonymization protocols). Indeed, big data projects have a strong technical component, due to data volume and sources, which brings specific challenges (e.g., collecting data outside traditional protocols on social media) [ 88 , 89 ]. Second, ERCs may be unfamiliar with new actors involved in big data research, such as citizen science actors or private corporations. Because of this lack of relevant expertise, ERCs may require unjustified amendments to research studies, or even reject big data projects tout-court [ 36 ]. Finally, ERCs may lose credibility as an oversight body capable of assessing ethical violations and research misconduct. In the past, ERCs solved this challenge by consulting independent experts in a relevant field when reviewing a protocol in that domain. However, this solution is not always practical as it depends upon the availability of an expert. Furthermore, experts may be researchers working and publishing in the field themselves. This scenario would be problematic because researchers would have to define the rules experts must abide by, compromising the concept of independent review [ 19 ]. Nonetheless, this problem does not disqualify the idea of expertise but requires high transparency standards regarding rule development and compliance. Other options include ad-hoc expert committees or provision of relevant training for existing committee members [ 47 , 90 , 91 ]. Given these options, which one is best to address ERCs’ lack of expertise in big data research?

Reforming the ERC

Our analysis shows that ERCs play a critical role in ensuring ethical oversight and risk–benefit evaluation [ 92 ], assessing the scientific validity of a project in its early stages, and offering an independent, critical, and interdisciplinary approach to the review. These strengths demonstrate why the ERC is an oversight model worth holding on to. Nevertheless, ERCs carry persistent big data-specific weaknesses, reducing their effectiveness and appropriateness as oversight bodies for data-driven research. To answer our initial research question, we propose that the current oversight mechanism is not as fit for purpose to assess the ethics of big data research as it could be in principle. ERCs should be improved at several levels to be able to adequately address and overcome these challenges. Changes could be introduced at the level of the regulatory framework as well as procedures. Additionally, reforming the ERC model might mean introducing complementary forms of oversight. In this section we explore these possibilities. Figure  2 offers an overview of the reforms that could aid ERCs in improving their process.

figure 2

Reforms overview for the research oversight mechanism

Regulatory reforms

The regulatory design of research oversight is the first aspect which needs reform. ERCs could benefit from new guidance (e.g., in the form of a flowchart) on the ethics of big data research. This guidance could build upon a deep rethinking of the importance of data for the functioning of societies, the way we use data in society, and our justifications for this use. In the UK, for instance, individuals can generally opt out of having their data (e.g., hospital visit data, health records, prescription drugs) stored by physicians’ offices or by NHS digital services. However, exceptions to this opt-out policy apply when uses of the data are vital to the functioning of society (for example, in the case of official national statistics or overriding public interest, such as the COVID-19 pandemic) [ 93 ].

We imagine this new guidance also re-defining the scope of ERC review, from protection of individual interest to a broader research impact assessment. In other words, it will allow the ERC’s scope to expand and to address purview issues which were previously discussed. For example, less research will be oversight-free because more factors would trigger ERC purview in the first place. The new governance would impose ERC review for research involving anonymized data, or big data research within public–private partnerships. Furthermore, ERC purview could be extended beyond the initial phase of the study to other points in the data lifecycle [ 94 ]. A possible option is to assess a study after its conclusion (as is the case in the pharmaceutical industry): ERCs could then decide if research findings and results should be released and further used by the scientific community. This new ethical guidance would serve ERCs not only in deciding whether a project requires review, but also in learning from past examples and best practices how to best proceed in the assessment. Hence, this guidance could come in handy to increase transparency surrounding assessment criteria used across ERCs. Transparency could be achieved by defining a minimum global standard for ethics assessment that allows international collaboration based on open data and a homogenous evaluation model. Acceptance of a global standard would also mean that the same oversight procedures will apply to research projects with similar risks and research paths, regardless of whether they are carried on by public or private entities. Increased clarification and transparency might also streamline the review process within and across committees, rendering the entire system more efficient.

Procedural reforms

Procedural reforms might target specific aspects of the ERC model to make it more suitable for the review of big data research. To begin with, ERCs should develop new operational tools to mitigate emerging big data challenges. For example, the AI Now algorithmic impact assessment tool, which appraises the ethics of automated decision systems, and informs decisions about whether or not to deploy the systems in society, could be used [ 95 ]. Forms of broad consent [ 96 ] and dynamic consent [ 20 ] can also address some of the issues raised, by using, re-using, and sharing big data (publicly available or not). Nonetheless, informed consent should not be considered a panacea for all ethical issues in big data research—especially in the case of publicly available social media data [ 97 ]. If the ethical implications of big data studies affect the society and its vulnerable sub-groups, individual consent cannot be relied upon as an effective safeguard. For this reason, ERCs should move towards a more democratic process of review. Possible strategies include engaging research subjects and communities in the decision-making process or promoting a co-governance system. The recent Montreal Declaration for Responsible AI is an example of an ethical oversight process developed out of public involvement [ 98 ]. Furthermore, this inclusive approach could increase the trustworthiness of the ethics review mechanism itself [ 99 ]. In practice, the more that ERCs involve potential data subjects in a transparent conversation about the risks of big data research, the more socially accountable the oversight mechanism will become.

ERCs must also address their lack of big data and general computing expertise. There are several potential ways to bridge this gap. First, ERCs could build capacity with formal training on big data. ERCs are willing to learn from researchers about social media data and computational methodologies used for data mining and analysis [ 85 ]. Second, ERCs could adjust membership to include specific experts from needed fields (e.g., computer scientists, biotechnologists, bioinformaticians, data protection experts). Third, ERCs could engage with external experts for specific consultations. Despite some resistance to accepting help, recent empirical research has shown that ERCs may be inclined to rely upon external experts in case of need [ 86 ].

In the data-driven research context, ERCs must embrace their role as regulatory stewards, and walk researchers through the process of ethics review [ 40 ]. ERCs should establish an open communication channel with researchers to communicate the value of research ethics while clarifying the criteria used to assess research. If ERCs and researchers agree to mutually increase transparency, they create an opportunity to learn from past mistakes and prevent future ones [ 100 ]. Universities might seek to educate researchers on ethical issues that can arise when conducting data-driven research. In general, researchers would benefit from training on identifying issues of ethics or completing ethics self-assessment forms, particularly if they are responsible for submitting projects for review [ 101 ]. As biomedical research is trending away from hospitals and clinical trials, and towards people’s homes and private corporations, researchers should strive towards greater clarity, transparency, and responsibility. Researchers should disclose both envisioned risks and benefits, as well as the anticipated impact at the individual and population level [ 54 ]. ERCs can then more effectively assess the impact of big data research and determine whether the common good is guaranteed. Furthermore, they might examine how research benefits are distributed throughout society. Localized decision making can play a role here [ 55 ]. ERCs may take into account characteristics specific to the social context, to evaluate whether or not the research respects societal values.

Complementary reforms

An additional measure to tackle the novelty of big data research might consist in reforming the current research ethics system through regulatory and procedural tools. However, this strategy may not be sufficient: the current system might require additional support from other forms of oversight to complement its work.

One possibility is the creation of hybrid review mechanisms and norms, merging valuable aspects of the traditional ERC review model with more innovative models, which have been adopted by various partners involved in the research (e.g., corporations, participants, communities) [ 102 ]. This integrated mechanism of oversight would cover all stages of big data research and involve all relevant stakeholders [ 103 ]. Journals and the publishing industry could play a role within this hybrid ecosystem in limiting potential dual use concerns. For instance, in the research publication phase, resources could be assigned to editors so as to assess research integrity standards and promote only those projects which are ethically aligned. However, these implementations can have an impact only when there is a shared understanding of best practice within the oversight ecosystem [ 19 ].

A further option is to include specialized and distinct ethical committees alongside ERCs, whose purpose is to assess big data research and provide sectorial accreditation to researchers. In this model, ERCs would not be overwhelmed by the numbers of study proposals to review and could outsource evaluations requiring specialist knowledge in the field of big data. It is true that specialized committees (data safety monitoring boards, data access committees, and responsible research and innovation panels) already exist and support big data researchers in ensuring data protection (e.g., system security, data storage, data transfer). However, something like a “data review board” could assess research implications both for the individual and society, while reviewing a project’s technical features. Peer review could play a critical role in this model: the research community retains the expertise needed to conduct ethical research and to support each other when the path is unclear [ 101 ].

Despite their promise, these scenarios all suffer from at least one primary limitation. The former might face a backlash when attempting to bring together the priorities and ethical values of various stakeholders, within common research norms. Furthermore, while decentralized oversight approaches might bring creativity over how to tackle hard problems, they may also be very dispersive and inefficient. The latter could suffer from overlapping scope across committees, resulting in confusing procedures, and multiplying efforts while diluting liability. For example, research oversight committees have multiplied within the United States, leading to redundancy and disharmony across committees [ 47 ]. Moreover, specialized big data ethics committees working in parallel with current ERCs could lead to questions over the role of the traditional ERC, when an increasing number of studies will be big data studies.

ERCs face several challenges in the context of big data research. In this article, we sought to bring clarity regarding those which might affect the ERC’s practice, distinguishing between novel and persistent weaknesses which are compounded by big data research. While these flaws are profound and inherent in the current sociotechnical transformation, we argue that the current oversight model is still partially capable of guaranteeing the ethical assessment of research. However, we also advance the notion that introducing reform at several levels of the oversight mechanism could benefit and improve the ERC system itself. Among these reforms, we identify the urgency for new ethical guidelines and new ethical assessment tools to safeguard society from novel risks brought by big data research. Moreover, we recommend that ERCs adapt their membership to include necessary expertise for addressing the research needs of the future. Additionally, ERCs should accept external experts’ consultations and consider training in big data technical features as well as big data ethics. A further reform concerns the need for transparent engagement among stakeholders. Therefore, we recommend that ERCs involve both researchers and data subjects in the assessment of big data research. Finally, we acknowledge the existing space for a coordinated and complementary support action from other forms of oversight. However, the actors involved must share a common understanding of best practice and assessment criteria in order to efficiently complement the existing oversight mechanism. We believe that these adaptive suggestions could render the ERC mechanism sufficiently agile and well-equipped to overcome data-intensive research challenges and benefit research at large.

Availability of data and materials

Not applicable.

There is an unsettled discussion about whether ERCs ought to play a role in evaluating both scientific and ethical aspects of research, or whether these can even come apart—but we will not go into detail here. 25.Dawson AJ, Yentis SM. Contesting the science/ethics distinction in the review of clinical research. Journal of Medical Ethics. 2007;33(3):165–7, 26.Angell EL, Bryman A, Ashcroft RE, Dixon-Woods M. An analysis of decision letters by research ethics committees: the ethics/scientific quality boundary examined. BMJ Quality & Safety. 2008;17(2):131–6.


Ethics Review Committee(s)

Health Insurance Portability and Accountability Act

Institutional Review Board(s)

National Health Service

Research Ethics Committee(s)

United Kingdom

United States

Moon MR. The history and role of institutional review boards: A useful tension. AMA J Ethics. 2009;11(4):311–6.

Article   Google Scholar  

Friesen P, Kearns L, Redman B, Caplan AL. Rethinking the Belmont report? Am J Bioeth. 2017;17(7):15–21.

Nebeker C, Torous J, Ellis RJB. Building the case for actionable ethics in digital health research supported by artificial intelligence. BMC Med. 2019;17(1):137.

Ienca M, Ferretti A, Hurst S, Puhan M, Lovis C, Vayena E. Considerations for ethics review of big data health research: A scoping review. PloS one. 2018;13(10).

Hibbin RA, Samuel G, Derrick GE. From “a fair game” to “a form of covert research”: Research ethics committee members’ differing notions of consent and potential risk to participants within social media research. J Empir Res Hum Res Ethics. 2018;13(2):149–59.

Maldoff G. How GDPR changes the rules for research: International Association of Privacy Protection; 2020 [Available from: https://iapp.org/news/a/how-gdpr-changes-the-rules-for-research/ .

Samuel G, Buchanan E. Guest Editorial: Ethical Issues in Social Media Research. SAGE Publications Sage CA: Los Angeles, CA; 2020. p. 3–11.

Shmueli G. Research Dilemmas with Behavioral Big Data. Big Data. 2017;5(2).

Sula CA. Research ethics in an age of big data. Bull Assoc Inf Sci Technol. 2016;42(2):17–21.

Metcalf J, Crawford K. Where are human subjects in Big Data research? The emerging ethics divide. Big Data Soc. 2016;3(1):2053951716650211.

Vayena E, Gasser U, Wood AB, O'Brien D, Altman M. Elements of a new ethical framework for big data research. Washington and Lee Law Review Online. 2016;72(3).

Goel V. As Data Overflows Online, Researchers Grapple With Ethics: The New York Times; 2014 [Available from: https://www.nytimes.com/2014/08/13/technology/the-boon-of-online-data-puts-social-science-in-a-quandary.html .

Vitak J, Shilton K, Ashktorab Z, editors. Beyond the Belmont principles: Ethical challenges, practices, and beliefs in the online data research community. Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing; 2016.

BBC World. Facebook emotion experiment sparks criticism 2014 [Available from: https://www.bbc.com/news/technology-28051930 .

Fiske ST, Hauser RM. Protecting human research participants in the age of big data. Proc Natl Acad Sci USA. 2014;111(38):13675.

Klitzman R, Appelbaum PS. Facebook’s emotion experiment: Implications for research ethics: The Hastings Center; 2014 [Available from: https://www.thehastingscenter.org/facebooks-emotion-experiment-implications-for-research-ethics/ .

Ballantyne A, Stewart C. Big Data and Public-Private Partnerships in Healthcare and Research. Asian Bioethics Review. 2019;11(3):315–26.

Barnett I, Torous J. Ethics, transparency, and public health at the intersection of innovation and Facebook's suicide prevention efforts. American College of Physicians; 2019.

Samuel G, Derrick GE, van Leeuwen T. The ethics ecosystem: Personal ethics, network governance and regulating actors governing the use of social media research data. Minerva. 2019;57(3):317–43.

Vayena E, Blasimme A. Biomedical big data: new models of control over access, use and governance. Journal of bioethical inquiry. 2017;14(4):501–13.

BBC World. Google announces AI ethics panel: BBC World; 2019 [Available from: https://www.bbc.com/news/technology-47714921 .

Clegg N. Welcoming the Oversight Board - About Facebook: FACEBOOK; 2020 [updated 2020–05–06. Available from: https://about.fb.com/news/2020/05/welcoming-the-oversight-board/ .

Shabani M, Dove ES, Murtagh M, Knoppers BM, Borry P. Oversight of genomic data sharing: what roles for ethics and data access committees? Biopreservation and biobanking. 2017;15(5):469–74.

Joly Y, Dove ES, Knoppers BM, Bobrow M, Chalmers D. Data sharing in the post-genomic world: the experience of the International Cancer Genome Consortium (ICGC) Data Access Compliance Office (DACO). PLoS Comput Biol. 2012;8(7):e1002549.

Dawson AJ, Yentis SM. Contesting the science/ethics distinction in the review of clinical research. J Med Ethics. 2007;33(3):165–7.

Angell EL, Bryman A, Ashcroft RE, Dixon-Woods M. An analysis of decision letters by research ethics committees: the ethics/scientific quality boundary examined. BMJ Qual Saf. 2008;17(2):131–6.

Nichols AS. Research ethics committees (RECS)/institutional review boards (IRBS) and the globalization of clinical research: Can ethical oversight of human subjects research be standardized. Wash U Global Stud L Rev. 2016;15:351.

Google Scholar  

Garrard E, Dawson A. What is the role of the research ethics committee? Paternalism, inducements, and harm in research ethics. J Med Ethics. 2005;31(7):419–23.

Page SA, Nyeboer J. Improving the process of research ethics review. Research Integrity and Peer Review. 2017;2(1):14.

Bowen AJ. Models of institutional review board function. 2008.

McGuinness S. Research ethics committees: the role of ethics in a regulatory authority. J Med Ethics. 2008;34(9):695–700.

Kane C, Takechi K, Chuma M, Nokihara H, Takagai T, Yanagawa H. Perspectives of non-specialists on the potential to serve as ethics committee members. J Int Med Res. 2019;47(5):1868–76.

Kirkbride J, George A. Lay REC members: patient and public. J Med Ethics. 2020;39(12):780–2.

Resnik DB. Trust as a Foundation for Research with Human Subjects. The Ethics of Research with Human Subjects: Protecting People, Advancing Science, Promoting Trust. Cham: Springer International Publishing; 2018. p. 87–111.

Kritikos M. Research Ethics Governance: The European Situation. Handbook of Research Ethics and Scientific Integrity. 2020:33–50.

Molina JL, Borgatti SP. Moral bureaucracies and social network research. Social Networks [Internet]. 2019;16(11):2020.

Sheehan M, Dunn M, Sahan K. Reasonable disagreement and the justification of pre-emptive ethics governance in social research: a response to Hammersley. J Med Ethics. 2018;44:719–20.

Mustajoki H. Pre-emptive research ethics: Finnish NationalBoard on Research Integrity Tenk; 2018 [Available from: https://vastuullinentiede.fi/en/doing-research/pre-emptive-research-ethics .

Biagetti M, Gedutis A. Towards Ethical Principles of Research Evaluation in SSH. The Third Research Evaluation in SSH Conference, Valencia, 19–20 September 20192019. p. 19–20.

Dove ES. Regulatory Stewardship of Health Research: Edward Elgar Publishing; 2020.

Tene O, Polonetsky J. Beyond IRBs: Ethical guidelines for data research. Washington and Lee Law Review Online. 2016;72(3):458.

Bloss C, Nebeker C, Bietz M, Bae D, Bigby B, Devereaux M, et al. Reimagining human research protections for 21st century science. J Med Internet Res. 2016;18(12):e329.

Dove ES, Garattini C. Expert perspectives on ethics review of international data-intensive research: Working towards mutual recognition. Research Ethics. 2018;14(1):1–25.

van den Broek T, van Veenstra AF. Governance of big data collaborations: How to balance regulatory compliance and disruptive innovation. Technol Forecast Soc Chang. 2018;129:330–8.

Jackman M, Kanerva L. Evolving the IRB: building robust review for industry research. Washington and Lee Law Review Online. 2016;72(3):442.

Someh I, Davern M, Breidbach CF, Shanks G. Ethical issues in big data analytics: A stakeholder perspective. Commun Assoc Inf Syst. 2019;44(1):34.

Friesen P, Redman B, Caplan A. Of Straws, Camels, Research Regulation, and IRBs. Therapeutic innovation & regulatory science. 2019;53(4):526–34.

Kohn T, Shore C. The ethics of university ethics committees. Risk management and the research imagination, in Death of the public university. 2017:229–49.

Friesen P, Yusof ANM, Sheehan M. Should the Decisions of Institutional Review Boards Be Consistent? Ethics & human research. 2019;41(4):2–14.

Binik A, Hey SP. A framework for assessing scientific merit in ethical review of clinical research. Ethics & human research. 2019;41(2):2–13.

Derrick GE, Haynes A, Chapman S, Hall WD. The association between four citation metrics and peer rankings of research influence of Australian researchers in six fields of public health. PLoS ONE. 2011;6(4):e18521.

Luukkonen T. Conservatism and risk-taking in peer review: Emerging ERC practices. Research Evaluation. 2012;21(1):48–60.

Dove ES, Townend D, Meslin EM, Bobrow M, Littler K, Nicol D, et al. Ethics review for international data-intensive research. Science. 2016;351(6280):1399–400.

Abbott L, Grady C. A systematic review of the empirical literature evaluating IRBs: What we know and what we still need to learn. J Empir Res Hum Res Ethics. 2011;6(1):3–19.

Shaw DM, Elger BS. The relevance of relevance in research. Swiss Medical Weekly. 2013;143(1920).

Kosinski Y, Wang M. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. J Pers Soc Psychol. 2018;114(2):246–57.

Levin S. LGBT groups denounce 'dangerous' AI that uses your face to guess sexuality: The Guardian; 2017 [updated 2017–09–09. Available from: http://www.theguardian.com/world/2017/sep/08/ai-gay-gaydar-algorithm-facial-recognition-criticism-stanford .

Tan S, Zhao Y, Huang W. Neighborhood Social Disadvantage and Bicycling Behavior: A Big Data-Spatial Approach Based on Social Indicators. Soc Indic Res. 2019;145(3):985–99.

Lynch HF. Opening closed doors: Promoting IRB transparency. J Law Med Ethics. 2018;46(1):145–58.

Samuel GN, Farsides B. Public trust and ‘ethics review’as a commodity: the case of Genomics England Limited and the UK’s 100,000 genomes project. Med Health Care Philos. 2018;21(2):159–68.

Nebeker C, Lagare T, Takemoto M, Lewars B, Crist K, Bloss CS, et al. Engaging research participants to inform the ethical conduct of mobile imaging, pervasive sensing, and location tracking research. Translational behavioral medicine. 2016;6(4):577–86.

Clapp JT, Gleason KA, Joffe S. Justification and authority in institutional review board decision letters. Soc Sci Med. 2017;194:25–33.

Sheehan M, Friesen P, Balmer A, Cheeks C, Davidson S, Devereux J, et al. Trust, trustworthiness and sharing patient data for research. Journal of Medical Ethics [Internet]. 2020.

Klitzman R. The ethics police?: The struggle to make human research safe: Oxford University Press; 2015.

Cantonal Ethics Committee Zurich. Annual Report 2019. 2019 [Available from: https://www.zh.ch/content/dam/zhweb/bilder-dokumente/organisation/gesundheitsdirektion/ethikkommission-/jahresberichte-kek/Jahresbericht_KEK%20ZH%202019_09-03-2020_PKL.pdf .

Lynch HF, Abdirisak M, Bogia M, Clapp J. Evaluating the quality of research ethics review and oversight: a systematic analysis of quality assessment instruments. AJOB Empirical Bioethics. 2020:1–15.

Hoffman S. What genetic testing teaches about long-term predictive health analytics regulation. 2019.

Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53.

Yoshiura H. Re-identifying people from anonymous histories of their activities. 2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST); 23–25 Oct. 20192019. p. 1–5.

Holm S, Ploug T. Big Data and Health Research—The Governance Challenges in a Mixed Data Economy. Journal of Bioethical Inquiry. 2017;14(4):515–25.

Nebeker C. mHealth Research Applied to Regulated and Unregulated Behavioral Health Sciences. The Journal of Law, Medicine & Ethics. 2020;48(1_suppl):49–59.

Marks M. Emergent Medical Data: Health Information Inferred by Artificial Intelligence. UC Irvine Law Review (2021, Forthcoming). 2020.

Friesen P, Douglas Jones R, Marks M, Pierce R, Fletcher K, Mishra A, et al. Governing AI-driven health research: are IRBs up to the task? Ethics & Human Research. 2020 Forthcoming

Baron J. Life Insurers Can Use Social Media Posts To Determine Premiums, As Long As They Don't Discriminate: Forbes; 2019 [Available from: https://www.forbes.com/sites/jessicabaron/2019/02/04/life-insurers-can-use-social-media-posts-to-determine-premiums/ .

Wiggins A, Wilbanks J. The rise of citizen science in health and biomedical research. Am J Bioeth. 2019;19(8):3–14.

Ienca M, Vayena E. “Hunting Down My Son’s Killer”: New Roles of Patients in Treatment Discovery and Ethical Uncertainty. Journal of Bioethical Inquiry. 2020:1–11.

Grant AD, Wolf GI, Nebeker C. Approaches to governance of participant-led research: a qualitative case study. BMJ Open. 2019;9(4):e025633.

Mascalzoni D, Hicks A, Pramstaller P, Wjst M. Informed consent in the genomics era. PLoS Med. 2008;5(9):e192.

McGuire AL, Beskow LM. Informed consent in genomics and genetic research. Annu Rev Genomics Hum Genet. 2010;11:361–81.

Roth S, Luczak-Roesch M. Deconstructing the data life-cycle in digital humanitarianism. Inf Commun Soc. 2020;23(4):555–71.

Gal A, Senderovich A. Process Minding: Closing the Big Data Gap. International Conference on Business Process Management: Springer; 2020. p. 3–16.

Ferretti A, Ienca M, Hurst S, Vayena E. Big Data, Biomedical Research, and Ethics Review: New Challenges for IRBs. Ethics & human research. 2020;42(5):17–28.

Ienca M, Vayena E. Dual use in the 21st century: emerging risks and global governance. Swiss Med Wkly. 2018;148:w14688.

Shabani M, Borry P. Rules for processing genetic data for research purposes in view of the new EU General Data Protection Regulation. Eur J Hum Genet. 2018;26(2):149–56.

Nebeker C, Harlow J, Espinoza Giacinto R, Orozco-Linares R, Bloss CS, Weibel N. Ethical and regulatory challenges of research using pervasive sensing and other emerging technologies: IRB perspectives. AJOB empirical bioethics. 2017;8(4):266–76.

Sellers C, Samuel G, Derrick G. Reasoning, “uncharted territory”: notions of expertise within ethics review panels assessing research use of social media. J Empir Res Hum Res Ethics. 2020;15(1–2):28–39.

Schrag ZM. The case against ethics review in the social sciences. Research Ethics. 2011;7(4):120–31.

Beskow LM, Hammack-Aviran CM, Brelsford KM, O'Rourke PP. Expert Perspectives on Oversight for Unregulated mHealth Research: Empirical Data and Commentary. The Journal of Law, Medicine & Ethics. 2020;48(1_suppl):138–46.

Huh-Yoo J, Rader E. It’s the Wild, Wild West: Lessons Learned From IRB Members’ Risk Perceptions Toward Digital Research Data. Proceedings of the ACM on Human-Computer Interaction. 2020;4(CSCW1):1–22.

Research; NHA. Gene Therapy Advisory Committee 2020 [Available from: https://www.hra.nhs.uk/about-us/committees-and-services/res-and-recs/gene-therapy-advisory-committee/ .

Research; NHA. The Social Care Research Ethics Committee (REC) 2020 [Available from: https://www.hra.nhs.uk/planning-and-improving-research/policies-standards-legislation/social-care-research/ .

Sheehan M, Dunn M, Sahan K. In defence of governance: ethics review and social research. J Med Ethics. 2017;44(10):710–6.

NHS UK. When your choice does not apply. 2019 [Available from: https://www.nhs.uk/your-nhs-data-matters/where-your-choice-does-not-apply/ .

Master Z, Martinson BC, Resnik DB. Expanding the scope of research ethics consultation services in safeguarding research integrity: Moving beyond the ethics of human subjects research. Am J Bioeth. 2018;18(1):55–7.

Reisman D, Schultz J, Crawford K. Whittaker M. Algorithmic impact assessments: A practical framework for public agency accountability. AI Now Institute; 2018. p. 1–22.

Sheehan M. Broad consent is informed consent Bmj. 2011;343:d6900.

Sheehan M, Thompson R, Fistein J, Davies J, Dunn M, Parker M, et al. Authority and the Future of Consent in Population-Level Biomedical Research. Public Health Ethics. 2019;12(3):225–36.

Montréal; Ud. Montréal Declaration for a Responsible Development of Artificial Intelligence 2019 [Available from: https://www.montrealdeclaration-responsibleai.com .

McCoy MS, Jongsma KR, Friesen P, Dunn M, Neuhaus CP, Rand L, et al. National Standards for Public Involvement in Research: missing the forest for the trees. J Med Ethics. 2018;44(12):801–4.

Brown C, Spiro J, Quinton S. The role of research ethics committees: Friend or foe in educational research? An exploratory study. Br Edu Res J. 2020;46(4):747–69.

Pagoto S, Nebeker C. How scientists can take the lead in establishing ethical practices for social media research. J Am Med Inform Assoc. 2019;26(4):311–3.

Harlow J, Weibel N, Al Kotob R, Chan V, Bloss C, Linares-Orozco R, et al. Using participatory design to inform the Connected and Open Research Ethics (CORE) commons. Sci Eng Ethics. 2020;26(1):183–203.

Vayena E, Blasimme A. Health research with big data: Time for systemic oversight. J Law Med Ethics. 2018;46(1):119–29.

Download references


This article reports the ideas and the conclusions emerged during a collaborative and participatory online workshop. All authors participated in the “Big Data Challenges for Ethics Review Committees” workshop, held online the 23-24 April 2020 and organized by the Health Ethics and Policy Lab, ETH Zurich.

This research is supported by the Swiss National Science Foundation under award 407540_167223 (NRP 75 Big Data). MS1 is grateful for funding from the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The funding bodies did not take part in designing this research and writing the manuscript.

Author information

Authors and affiliations.

Health Ethics and Policy Lab, Department of Health Sciences and Technology, ETH Zürich, Hottingerstrasse 10 (HOA), 8092, Zürich, Switzerland

Agata Ferretti, Marcello Ienca, Alessandro Blasimme & Effy Vayena

The Ethox Centre, Department of Population Health, University of Oxford, Oxford, UK

Mark Sheehan

School of Law, University of Edinburgh, Edinburgh, UK

Edward S. Dove

Brighton and Sussex Medical School, Brighton, UK

Bobbie Farsides

Biomedical Ethics Unit, Department of Social Studies of Medicine, McGill University, Montreal, Canada

Phoebe Friesen

Johns Hopkins Berman Institute of Bioethics, Baltimore, USA

Mobile Health Systems Lab, Department of Health Sciences and Technology, ETH Zürich, Zürich, Switzerland

Walter Karlen

Cantonal Ethics Committee Zürich, Zürich, Switzerland

Peter Kleist

Center for Bioethics, Department of Philosophy, New York University, New York, USA

S. Matthew Liao

Research Center for Optimal Digital Ethics in Health (ReCODE Health), Herbert Wertheim School of Public Health and Longevity Science, University of California, San Diego, USA

Camille Nebeker

Department of Global Health and Social Medicine, King’s College London, London, UK

Gabrielle Samuel

Faculty of Law and Criminology, Ghent University, Ghent, Belgium

Mahsa Shabani

Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland

Minerva Rivas Velarde

You can also search for this author in PubMed   Google Scholar


AF drafted the manuscript, MI, MS1 and EV contributed substantially to the writing. EV is the senior lead on the project from which this article derives. All the authors (AF, MI, MS1, AB, ESD, BF, PF, JK, WK, PK, SML, CN, GS, MS2, MRV, EV) contributed greatly to the intellectual content of this article, edited it, and approved the final version. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Agata Ferretti .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Ferretti, A., Ienca, M., Sheehan, M. et al. Ethics review of big data research: What should stay and what should be reformed?. BMC Med Ethics 22 , 51 (2021). https://doi.org/10.1186/s12910-021-00616-4

Download citation

Received : 23 November 2020

Accepted : 15 April 2021

Published : 30 April 2021

DOI : https://doi.org/10.1186/s12910-021-00616-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Research ethics
  • Ethics review
  • Biomedical research

BMC Medical Ethics

ISSN: 1472-6939

ethics of big data research

Data ethics: What it means and what it takes

Now more than ever, every company is a data company. By 2025, individuals and companies around the world will produce an estimated 463 exabytes of data each day, 1 Jeff Desjardins, “How much data is generated each day?” World Economic Forum, April 17, 2019. compared with less than three exabytes a decade ago. 2 IBM Research Blog , “Dimitri Kanevsky translating big data,” blog entry by IBM Research Editorial Staff, March 5, 2013.

With that in mind, most businesses have begun to address the operational aspects of data management—for instance, determining how to build and maintain a data lake  or how to integrate data scientists and other technology experts  into existing teams. Fewer companies have systematically considered and started to address the ethical aspects of data management, which could have broad ramifications and responsibilities. If algorithms are trained with biased data sets or data sets are breached, sold without consent, or otherwise mishandled, for instance, companies can incur significant reputational and financial costs. Board members could even be held personally liable. 3 Leah Rizkallah, “Potential board liability for cybersecurity failures under Caremark law,” CPO Magazine , February 22, 2022.

So how should companies begin to think about ethical data management? What measures can they put in place to ensure that they are using consumer, patient, HR, facilities, and other forms of data appropriately across the value chain—from collection to analytics to insights?

We began to explore these questions by speaking with about a dozen global business leaders and data ethics experts. Through these conversations, we learned about some common data management traps that leaders and organizations can fall into, despite their best intentions. These traps include thinking that data ethics does not apply to your organization, that legal and compliance have data ethics covered, and that data scientists have all the answers—to say nothing of chasing short-term ROI at all costs and looking only at the data rather than their sources.

In this article, we explore these traps and suggest some potential ways to avoid them, such as adopting new standards for data management, rethinking governance models, and collaborating across disciplines and organizations. This list of potential challenges and remedies is not exhaustive; our research base was relatively small, and leaders could face many other obstacles, beyond our discussion here, to the ethical use of data. But what’s clear from our research is that data ethics needs both more and sustained attention from all members of the C-suite, including the CEO.

Potential challenges for business leaders

What is data ethics.

We spoke with about a dozen business leaders and data ethics experts. In their eyes, these are some characteristics of ethical data use:

It preserves data security and protects customer information. The practitioners we spoke with tend to view cybersecurity and data privacy as part and parcel of data ethics. They believe companies have an ethical responsibility (as well as legal obligations) to protect customers’ data, defend against breaches, and ensure that personal data are not compromised.

It offers a clear benefit to both consumers and companies. “The consumer’s got to be getting something” from a data-based transaction, explained an executive at a large financial-services company. “If you’re not solving a problem for a consumer, you’ve got to ask yourself why you’re doing what you’re doing.” The benefit to customers should be straightforward and easy to summarize in a single sentence: customers might, for instance, get greater speed, convenience, value, or savings.

It offers customers some measure of agency. “We don’t want consumers to be surprised,” one executive told us. “If a customer receives an offer and says, ‘I think I got this because of how you’re using my data, and that makes me uncomfortable. I don’t think I ever agreed to this,’ another company might say, ‘On page 41, down in the footnote in the four-point font, you did actually agree to this.’ We never want to be that company.”

It is in line with your company’s promises. In data management, organizations must do what they say they will do—or risk losing the trust of customers and other key stakeholders. As one senior executive pointed out, keeping faith with stakeholders may mean turning down certain contracts if they contradict the organization’s stated data values and commitments.

There is a dynamic body of literature on data ethics. Just as the methods companies use to collect, analyze, and access data are evolving, so will definitions of the term itself. In this article, we define data ethics as data-related practices that seek to preserve the trust of users, patients, consumers, clients, employees, and partners. Most of the business leaders we spoke to agreed broadly with that definition, but some have tailored it to the needs of their own sectors or organizations (see sidebar, “What is data ethics?”). Our conversations with these business leaders also revealed the unintended lapses in data ethics that can happen in organizations. These include the following:

Thinking that data ethics doesn’t apply to your organization

While privacy and ethical considerations are essential whenever companies use data (including artificial-intelligence and machine-learning applications), they often aren’t top of mind for some executives. In our experience, business leaders are not intentionally pushing these thoughts away; it’s often just easier for them to focus on things they can “see”— the tools, technologies, and strategic objectives associated with data management—than on the seemingly invisible ways data management can go wrong.

In a 2021 McKinsey Global Survey on the state of AI , for instance, only 27 percent of some 1,000 respondents said that their data professionals actively check for skewed or biased data during data ingestion. Only 17 percent said that their companies have a dedicated data governance committee that includes risk and legal professionals. In that same survey, only 30 percent of respondents said their companies recognized equity and fairness as relevant AI risks. AI-related data risks are only a subset of broader data ethics concerns, of course, but these numbers are striking.

Thinking in silos: Legal, compliance, or data scientists have data ethics covered

Companies may believe that just by hiring a few data scientists, they’ve fulfilled their data management obligations. The truth is data ethics is everyone’s domain, not just the province of data scientists or of legal and compliance teams. At different times, employees across the organization—from the front line to the C-suite—will need to raise, respond to, and think through various ethical issues surrounding data. Business unit leaders will need to vet their data strategies with legal and marketing teams, for example, to ensure that their strategic and commercial objectives are in line with customers’ expectations and with regulatory and legal requirements for data usage.

As executives navigate usage questions, they must acknowledge that although regulatory requirements and ethical obligations are related, adherence to data ethics goes far beyond the question of what’s legal. Indeed, companies must often make decisions before the passage of relevant laws. The European Union’s General Data Protection Regulation (GDPR) went into effect only in May 2018, the California Consumer Privacy Act has been in effect only since January 2020, and federal privacy law is only now pending in the US Congress. Years before these and other statutes and regulations were put in place, leaders had to set the terms for their organizations’ use of data—just as they currently make decisions about matters that will be regulated in years to come.

Laws can show executives what they can do . But a comprehensive data ethics framework can guide executives on whether they should , say, pursue a certain commercial strategy and, if so, how they should go about it. One senior executive we spoke with put the data management task for executives plainly: “The bar here is not regulation. The bar here is setting an expectation with consumers and then meeting that expectation—and doing it in a way that’s additive to your brand.”

Chasing short-term ROI

Prompted by economic volatility, aggressive innovation in some industries, and other disruptive business trends, executives and other employees may be tempted to make unethical data choices—for instance, inappropriately sharing confidential information because it is useful—to chase short-term profits. Boards increasingly want more standards for the use of consumer and business data, but the short-term financial pressures remain. As one tech company president explained: “It’s tempting to collect as much data as possible and to use as much data as possible. Because at the end of the day, my board cares about whether I deliver growth and EBITDA.… If my chief marketing officer can’t target users to create an efficient customer acquisition channel, he will likely get fired at some point—or at least he won’t make his bonus.”

Looking only at the data, not at the sources

Ethical lapses can occur when executives look only at the fidelity and utility of discrete data sets and don’t consider the entire data pipeline. Where did the data come from? Can this vendor ensure that the subjects of the data gave their informed consent for use by third parties? Do any of the market data contain material nonpublic information? Such due diligence is key: one alternative data provider was charged with securities fraud for misrepresenting to trading firms how its data were derived. In that case, companies had provided confidential information about the performance of their apps to the data vendor, which did not aggregate and anonymize the data as promised. Ultimately, the vendor had to settle with the US Securities and Exchange Commission. 4 “SEC charges App Annie and its founder with securities fraud,” US Securities and Exchange Commission, September 14, 2021.

A few important building blocks

These data management challenges are common—and they are by no means the only ones. As organizations generate more data, adopt new tools and technologies to collect and analyze data, and find new ways to apply insights from data, new privacy and ethical challenges and complications will inevitably emerge. Organizations must experiment with ways to build fault-tolerant data management programs. These seven data-related principles, drawn from our research, may provide a helpful starting point.

Set company-specific rules for data usage

Leaders in the business units, functional areas, and legal and compliance teams must come together to create a data usage framework for employees—a framework that reflects a shared vision and mission for the company’s use of data . As a start, the CEO and other C-suite leaders must also be involved in defining data rules that give employees a clear sense of the company’s threshold for risk and which data-related ventures are OK to pursue and which are not.

Leaders must come together to create a data usage framework that reflects a shared vision and mission for the company’s use of data.

Such rules can improve and potentially speed up individual and organizational decision making. They should be tailored to your specific industry, even to the products and services your company offers. They should be accessible to all employees, partners, and other critical stakeholders. And they should be grounded in a core principle—for example, “We do not use data in any way that we cannot link to a better outcome for our customers.” Business leaders should plan to revisit and revise the rules periodically to account for shifts in the business and technology landscape.

Communicate your data values, both inside and outside your organization

Once you’ve established common data usage rules, it’s important to communicate them effectively inside and outside the organization. That might mean featuring the company’s data values on employees’ screen savers, as the company of one of our interview subjects has done. Or it may be as simple as tailoring discussions about data ethics to various business units and functions and speaking to their employees in language they understand. The messaging to the IT group and data scientists, for instance, may be about creating ethical data algorithms or safe and robust data storage protocols. The messaging to marketing and sales teams may focus on transparency and opt-in/opt-out protocols.

Organizations also need to earn the public’s trust. Posting a statement about data ethics on the corporate website worked for one financial-services organization. As an executive explained: “When you’re having a conversation with a government entity, it’s really helpful to be able to say, ‘Go to our website and click on Responsible Data Use, and you’ll see what we think.’ We’re on record in a way that you can’t really walk back.” Indeed, publicizing your company’s data ethics framework may help increase the momentum for powerful joint action, such as the creation of industry-wide data ethics standards.

" "

Why digital trust truly matters

Build a diverse data-focused team.

A strong data ethics program won’t materialize out of the blue. Organizations large and small need people who focus on ethics issues; it cannot be a side activity. The work should be assigned to a specific team or attached to a particular role. Some larger technology and pharmaceutical companies have appointed chief ethics or chief trust officers in recent years. Others have set up interdisciplinary teams, sometimes referred to as data ethics boards, to define and uphold data ethics. Ideally, such boards would include representatives from, for example, the business units, marketing and sales, compliance and legal, audit, IT, and the C-suite. These boards should also have a range of genders, races, ethnicities, classes, and so on: an organization will be more likely to identify issues early on (in algorithm-training data, for example) when people with a range of different backgrounds and experiences sit around the table.

One multinational financial-services corporation has developed an effective structure for its data ethics deliberations and decision making. It has two main data ethics groups. The major decisions are made by a group of senior stakeholders, including the head of security and other senior technology executives, the chief privacy officer, the head of the consulting arm, the head of strategy, and the heads of brand, communications, and digital advertising. These are the people most likely to use the data.

Governance is the province of another group, which is chaired by the chief privacy officer and includes the global head of data, a senior risk executive, and the executive responsible for the company’s brand. Anything new concerning data use gets referred to this council, and teams must explain how proposed products comply with the company’s data use principles. As one senior company executive explains, “It’s important that both of these bodies be cross-functional because in both cases you’re trying to make sure that you have a fairly holistic perspective.”

As we’ve noted, compliance teams and legal counsel should not be the only people thinking about a company’s data ethics, but they do have an important role to play in ensuring that data ethics programs succeed. Legal experts are best positioned to advise on how your company should apply existing and emerging regulations. But teams may also want to bring in outside experts to navigate particularly difficult ethical challenges. For example, a large tech company brought in an academic expert on AI ethics to help it figure out how to navigate gray areas, such as the environmental impact of certain kinds of data use. That expert was a sitting but not voting member of the group because the team “did not want to outsource the decision making.” But the expert participated in every meeting and led the team in the work that preceded the meetings.

Engage champions in the C-suite

Some practitioners and experts we spoke with who had convened data ethics boards pointed to the importance of keeping the CEO and the corporate board apprised of decisions and activities. A senior executive who chaired his organization’s data ethics group explained that while it did not involve the CEO directly in the decision-making process, it brought all data ethics conclusions to him “and made sure he agreed with the stance that we were taking.” All these practitioners and experts agreed that having a champion or two in the C-suite can signal the importance of data ethics to the rest of the organization, put teeth into data rules, and support the case for investment in data-related initiatives.

Indeed, corporate boards and audit committees can provide the checks needed to ensure that data ethics are being upheld, regardless of conflicting incentives. The president of one tech company told us that its board had recently begun asking for a data ethics report as part of the audit committee’s agenda, which had previously focused more narrowly on privacy and security. “You have to provide enough of an incentive—a carrot or a stick to make sure people take this seriously,” the president said.

Consider the impact of your algorithms and overall data use

Organizations should continually assess the effects of the algorithms and data they use—and test for bias throughout the value chain. That means thinking about the problems organizations might create, even unwittingly, in building AI products. For instance, who might be disadvantaged by an algorithm or a particular use of data? One technologist we spoke with advises asking the hard questions: “Start your meetings about AI by asking, ‘Are the algorithms we are building sexist or racist?’”

Certain data applications require far greater scrutiny and consideration. Security is one such area. A tech company executive recalled the extra measures his organization took to prevent its image and video recognition products and services from being misused: “We would insist that if you were going to use our technology for security purposes, we had to get very involved in ensuring that you debiased the data set as much as possible so that particular groups would not be unfairly singled out.” It’s important to consider not only what types of data are being used but also what they are being used for—and what they could potentially be used for down the line.

Think globally

The ethical use of data requires organizations to consider the interests of people who are not in the room. Anthropologist Mary Gray, the senior principal researcher at Microsoft Research, raises questions about global reach in her 2019 book, Ghost Work . Among them: Who labeled the data? Who tagged these images? Who kept violent videos off this website? Who weighed in when the algorithm needed a steer?

Today’s leaders need to ask these sorts of questions, along with others about how such tech work happens. Broadly, leaders must take a 10,000-foot view of their companies as players in the digital economy, the data ecosystem, and societies everywhere. There may be ways they can support policy initiatives or otherwise help to bridge the digital divide, support the expansion of broadband infrastructure, and create pathways for diversity in the tech industry. Ultimately, data ethics requires leaders to reckon with the ongoing rise in global inequality—and the increasing concentration of wealth and value both in geographical tech hubs and among AI-enabled organizations. 5 For more on the concentration of value among AI-enabled firms, see Marco Iansiti and Karim R. Lakhani, Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World , Boston: Harvard Business Review Press, 2020.

Embed your data principles in your operations

It’s one thing to define what constitutes the ethical use of data and to set data usage rules; it’s another to integrate those rules into operations across the organization. Data ethics boards, business unit leaders, and C-suite champions should build a common view (and a common language) about how data usage rules should link up to both the company’s data and corporate strategies and to real-world use cases for data ethics, such as decisions on design processes or M&A. In some cases, there will be obvious places to operationalize data ethics—for instance, data operations teams, secure-development operations teams, and machine-learning operations teams. Trust-building frameworks for machine-learning operations  can ensure that data ethics will be considered at every step in the development of AI applications.

Regardless of which part of the organization the leaders target first, they should identify KPIs that can be used to monitor and measure its performance in realizing their data ethics objectives. To ensure that the ethical use of data becomes part of everyone’s daily work, the leadership team also should advocate, help to build, and facilitate formal training programs on data ethics.

Data ethics can‘t be put into practice overnight. As many business leaders know firsthand, building teams, establishing practices, and changing organizational culture are all easier said than done. What’s more, upholding your organization’s data ethics principles may mean walking away from potential partnerships and other opportunities to generate short-term revenues. But the stakes for companies could not be higher. Organizations that fail to walk the walk on data ethics risk losing their customers’ trust and destroying value.

Alex Edquist is an alumna of McKinsey’s Atlanta office; Liz Grennan is an associate partner in the Stamford, Connecticut, office; Sian Griffiths is a partner in the Washington, DC, office; and Kayvaun Rowshankish is a senior partner in the New York office.

The authors wish to thank Alyssa Bryan, Kasia Chmielinski, Ilona Logvinova, Keith Otis, Marc Singer, Naomi Sosner, and Eckart Windhagen for their contributions to this article.

This article was edited by Roberta Fusaro, an editorial director in the Waltham, Massachusetts, office.

Explore a career with us

Related articles.

Getting to know--and manage--your biggest AI risks

Getting to know—and manage—your biggest AI risks

" "

Localization of data privacy regulations creates competitive opportunities

Close up view of white Greek statues head with a blue background.

AI Ethics in today’s world

An Ethics Framework for Big Data in Health and Research


  • 1 Centre for Biomedical Ethics, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
  • 2 Centre for Social Ethics and Policy, School of Law, University of Manchester, Manchester, UK.
  • 3 Department of Primary Health Care & General Practice, University of Otago, Dunedin, New Zealand.
  • 4 Division of Business Law, College of Business, Nanyang Technological University, Singapore.
  • 5 Sydney Health Ethics, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia.
  • 6 The University of Sydney Law School, Sydney, Australia.
  • 7 School of Social Sciences, College of Humanities, Arts, & Social Sciences, Nanyang Technological University, Singapore.
  • 8 School of Law and JK Mason Institute for Medicine, Life Sciences and the Law, University of Edinburgh, Edinburgh, UK.
  • 9 Saw Swee Hock School of Public Health, National University of Singapore, Singapore.
  • 10 Division of Endocrinology, National University Hospital, Singapore.
  • PMID: 33717314
  • PMCID: PMC7747261
  • DOI: 10.1007/s41649-019-00099-x

Ethical decision-making frameworks assist in identifying the issues at stake in a particular setting and thinking through, in a methodical manner, the ethical issues that require consideration as well as the values that need to be considered and promoted. Decisions made about the use, sharing, and re-use of big data are complex and laden with values. This paper sets out an Ethics Framework for Big Data in Health and Research developed by a working group convened by the Science, Health and Policy-relevant Ethics in Singapore (SHAPES) Initiative. It presents the aim and rationale for this framework supported by the underlying ethical concerns that relate to all health and research contexts. It also describes a set of substantive and procedural values that can be weighed up in addressing these concerns, and a step-by-step process for identifying, considering, and resolving the ethical issues arising from big data uses in health and research. This Framework is subsequently applied in the papers published in this Special Issue. These papers each address one of six domains where big data is currently employed: openness in big data and data repositories, precision medicine and big data, real-world data to generate evidence about healthcare interventions, AI-assisted decision-making in healthcare, public-private partnerships in healthcare and research, and cross-sectoral big data.

Keywords: Artificial intelligence; Cross-sectorial data; Data repositories; Ethics framework; Health and research; Open sharing; Precision medicine; Public-private partnership; Real-world evidence.

© The Author(s) 2019.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access


Research Article

Considerations for ethics review of big data health research: A scoping review

Contributed equally to this work with: Marcello Ienca, Agata Ferretti

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – original draft, Writing – review & editing

Affiliation Health Ethics and Policy Laboratory, Department of Health Sciences and Technology, ETH Zurich, Zurich, Switzerland

Roles Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – review & editing

Affiliation Institute for Ethics, History and the Humanities, Faculty of Medicine, University of Geneva, Geneva, Switzerland

Roles Conceptualization, Funding acquisition, Methodology, Writing – review & editing

Affiliation Epidemiology, Biostatistics and Prevention Institute, University of Zurich, Zurich, Switzerland

Affiliation Division of Medical Information Sciences, Department of Radiology and Medical Informatics, University Hospital of Geneva, Geneva, Switzerland

* E-mail: [email protected]

ORCID logo

  • Marcello Ienca, 
  • Agata Ferretti, 
  • Samia Hurst, 
  • Milo Puhan, 
  • Christian Lovis, 
  • Effy Vayena


  • Published: October 11, 2018
  • https://doi.org/10.1371/journal.pone.0204937
  • Reader Comments

Fig 1

Big data trends in biomedical and health research enable large-scale and multi-dimensional aggregation and analysis of heterogeneous data sources, which could ultimately result in preventive, diagnostic and therapeutic benefit. The methodological novelty and computational complexity of big data health research raises novel challenges for ethics review. In this study, we conducted a scoping review of the literature using five databases to identify and map the major challenges of health-related big data for Ethics Review Committees (ERCs) or analogous institutional review boards. A total of 1093 publications were initially identified, 263 of which were included in the final synthesis after abstract and full-text screening performed independently by two researchers. Both a descriptive numerical summary and a thematic analysis were performed on the full-texts of all articles included in the synthesis. Our findings suggest that while big data trends in biomedicine hold the potential for advancing clinical research, improving prevention and optimizing healthcare delivery, yet several epistemic, scientific and normative challenges need careful consideration. These challenges have relevance for both the composition of ERCs and the evaluation criteria that should be employed by ERC members when assessing the methodological and ethical viability of health-related big data studies. Based on this analysis, we provide some preliminary recommendations on how ERCs could adaptively respond to those challenges. This exploration is designed to synthesize useful information for researchers, ERCs and relevant institutional bodies involved in the conduction and/or assessment of health-related big data research.

Citation: Ienca M, Ferretti A, Hurst S, Puhan M, Lovis C, Vayena E (2018) Considerations for ethics review of big data health research: A scoping review. PLoS ONE 13(10): e0204937. https://doi.org/10.1371/journal.pone.0204937

Editor: Godfrey Biemba, Paediatric Centre of Excellence, ZAMBIA

Received: May 4, 2018; Accepted: September 17, 2018; Published: October 11, 2018

Copyright: © 2018 Ienca et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: This work was supported by Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung, Award Number: 407540_167223 (NRP 75 Big Data). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.


The generation of digital data has drastically increased in the last years due to the ubiquitous deployment of digital technology as well as advanced computational analytics techniques [ 1 , 2 ]. The term big data is still vaguely defined. In general terms, big data involves large sets of data with diverse levels of analysable structuration, coming from heterogeneous sources (online data, social media profiles, financial records, self-tracked parameters, etc.), produced with high frequency and which can be further processed and analysed using computational techniques. While the term big data has become nearly ubiquitous, there is controversy over what data volumes are sufficiently large to obtain the big data label. Dumbill, for example, suggested that data should be considered big when they cross the threshold of the conventional databases systems’ capacity in processing information [ 3 ].

Big data trends characterize various sectors including basic science [ 1 , 4 ], business [ 5 ], government [ 6 ], national security [ 7 ] and transportation [ 8 ]. Big data trends have increasingly pervaded also the healthcare domain, as new health-related data sources have grown in volume and variety, and became available for large-scale aggregation and high-speed analysis [ 9 ]. These include Electronic Health Records (EHRs), data from mobile health (mHealth) applications, medical blogs and web-networks [ 10 ] [ 11 ], healthcare robotics [ 12 ], medical internet of things [ 13 ], as well as direct-to-consumer genetic [ 14 ], and screening tests [ 15 ]. Additionally, health-related information can be derived not only from digital health applications, but also from non-strictly-medical data sources [ 16 ] such as online personal dietary programs, fitness club memberships and Twitter hashtags [ 17 ]. Health-related big data is the umbrella term used to describe extremely large and heterogeneous data sets that may be analysed computationally to reveal patterns, trends, and correlations, that have relevance for human health [ 18 ].

The availability of health-related big data holds the promise of exerting a positive impact on biomedical research. For example, tailoring diagnostics to automated analyses of high resolution images has become a standard procedure in cancer research [ 19 ]. In parallel, mapping and collecting large-scale data volumes enables the creation of epidemiological models that can inform about an epidemics’ space-time propagation. Finally, novel and patient-tailored therapeutic opportunities might emerge from the possibility of continuously monitoring patient health, tracking pathologic characteristics at specific points in time, and aggregating heterogeneous data sources [ 20 ]. These benefits might occur both in public health and at the individual level. Bates [ 21 ] argued that the use of big data has a valuable impact on public health, since it might help identify and promptly intervene on high-risk and high-cost patients.

While opening the prospect of clinical benefit, the use of health-related big data raises important challenges. In light of their methodological novelty, potentially far-reaching impacts, and computational complexity, big data approaches to human health have been claimed to raise ethical, legal and social implications [ 22 ]. Ethical and legal challenges include the risk to compromise privacy, personal autonomy, and the solidarity-based approach to healthcare funding, as well as effects on public demand for transparency, trust, and fairness while using big data [ 23 ]. Furthermore, authors have listed data heterogeneity, data protection, analytical flows in analysing data, and the lack of appropriate infrastructures for data storage as critical technical and infrastructural issues that might endanger a big-data-driven healthcare [ 24 ]. While some of these challenges have received scientific and institutional attention, other ones have remained largely unexplored. In 2016, a review identified a number of areas of concern associated with health-related big data that did not obtained adequate attention among researchers [ 22 ]. These included group-level ethical harms, the intimate link between epistemological and ethical issues, the distinction between harms to data subject resulting from, respectively, academic and commercial uses of big data, the problematic fiduciary relationship between data custodian and data subjects, the role of data ownership and intellectual property as a mechanism for data control, and, finally, the provision of data access rights to data subjects.

The ethical, legal and social implications of health-related big data raise novel challenges also for Ethics Review Committees (ERCs). ERCs and institutional review boards are increasingly requested to evaluate an ever-growing number of research projects and associated activities involving big data (large data volumes and big data analytics), whose risks and benefits often appear hard to assess. Some authors have called for the development of comprehensive regulatory policies for healthcare entities and new computing safeguards that can address public concerns, such as the protection of individually identifiable information [ 25 ]. However, in absence of specific guidelines and comprehensive evaluation studies, ERCs might be facing uncertainty on how to review health-related big data projects and according to which evaluative criteria. In fact, researchers have observed that traditional conceptual tools and/or legal requirements for ethics review in clinical research like informed consent, minimal risk and fair subject selection might be of limited help, if not ill suited, for the evaluation of big data projects [ 26 , 27 ]. The reason for that stems from the fact that these tools were conceived in the context of conventional clinical research (e.g. clinical trials) not in connection to the evolving applications and innovative research designs of big data research [ 27 ]. For example, informed consent is often not practical to obtain for studies involving a retrospective access to data from millions of individuals.

The nature of big data studies also challenges the current mandate and purview of ERCs. For example, studies involving publicly available and anonymized data have traditionally been perceived to be outside of the purview of ERCs. This would include data from Twitter (which are public by default), Facebook or other online platforms. Furthermore, ethical safeguards for human subjects research “are often written with definitions that exclude Internet research”[ 28 ]. This is problematic for a twofold reason. First, research has shown that big data analytics can reveal sensitive information from seemingly innocuous public data points, including information that the original data generators might reasonably wish to keep private. For example, a recent study has successfully used deep neural networks to predict the sexual orientation of users based on facial images from public profiles posted on dating website [ 29 ]. Second, several studies have shown that de-identified [ 30 ] and even anonymized data [ 31 ] can be reverse engineered to re-identify individuals, leading experts to the conclusion that “there is no such thing as anonymous data”. This raises the question of whether big data projects should require oversight by an ERC even when the data collected are public and anonymized or de-identified. A recent systematic review has concluded that most normative documents deem the review of an ERC as necessary to address the concerns associated with the use of anonymized data for research [ 32 ]. In contrast, when ERCs waived the review of big data studies involving publicly available and anonymized data repositories because they considered them outside their purview, such as in the case of Facebook’s “emotional contagion” study [ 33 ], experts criticized this narrow interpretation of the ERC’s mandate [ 34 ].

In the present study, we aim to identify the promises and challenges of health-related big data research that have relevance for ERCs. Furthermore, we use these findings to suggest how ERCs could adaptively respond to this methodological transformation. This exploration is designed to synthesize useful information for researchers, ERCs and relevant institutional bodies involved in the conduction and/or assessment of health-related big data research.

On the 18 th of September 2018 we conducted a scoping review of the scientific literature and searched five databases (EMBASE, Web of Science, Pubmed, IEEE Xplore, and Scopus) to retrieve eligible publications. We searched title, abstract, and keywords for the terms: ("big data" OR “Artificial Intelligence” OR "data science" OR "digital data") AND (“medical” OR “healthcare” OR “clinical” OR "personalised medicine") AND (“policy” OR “ethics” OR “governance” OR "ethics committee" OR “IRB” OR "review board" OR “assessment”). Query logic was modified to adapt to the language used by each engine or database. Screening identified 1093 entries. All entries were imported into the Endnote literature manager software. Three phases of filtering were performed independently by two researchers to minimize subjective bias.

The scoping review is a review method aimed at synthesizing research evidence and mapping the existing literature in a certain field of interest [ 35 ]. Unlike a systematic review, scoping review methods are considered of particular use when the topic has not yet been extensively reviewed or is of a complex or heterogeneous nature [ 35 , 36 ]. Following the recommendations by Pham et al. [ 36 ], the study selection process was conducted and presented using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses ( http://prisma-statement.org/ ) as a guide (see Fig 1 ).


  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image


First, duplicates were removed both automatically using the Endnote tool for duplicate detection and manually based on abstract screening. A total of 226 articles was removed at this stage.

Second, eligibility assessment was performed independently by two of the co-authors on the remaining 867 articles through title-abstract screening and, subsequently, full text screening. Diverging inclusion choices between the two reviewers were discussed with the research group with documented reasons. Studies included in the synthesis had the following features: (i) original articles, book chapters or conference proceedings; (ii) written in English, Italian, French or German (languages spoken by the researchers); (iii) published before September 18 th , 2017; and (iv) focused on the assessment of big data trends in the biomedical/healthcare context. Reviews, letters to the editors, business reports and dissertations were not included. A total of 263 studies were included in the final synthesis and imported manually into Microsoft Excel 15.40 format based on a shared data-charting form. Following the recommendations to enhance scoping study methodology delineated by Levac et al. [ 37 ], the data-charting form was collectively developed by our research team to determine which variables to extract from the review data.

Third, based on the same recommendations, we performed both a descriptive numerical summary and a thematic analysis. In the former analysis, both relative and cumulative frequencies were extracted and graphically represented using bar charts. Following Arksey and O'Malley [ 36 ], our descriptive numerical summary also included the total number of articles included, types of study design (empirical vs. non empirical), years of publication etc. In the latter analysis, recurrent thematic patterns were identified through full-text screening and subsequent coding. The coding phases was independently performed by two researchers. Once conceptually stable thematic patterns emerged from the codes, these were grouped together into a system of themes and subthemes. All entries were checked anew through an automated text search for the presence of the emerging themes. Following Braun and Clarke [ 38 ], codes that did not seem to fit into any main theme, were temporarily housed in a “miscellaneous” group and subsequently either clustered into a new theme or reallocated to an existing thematic group after consultation. Internal consultation was performed among all members of our research team to integrate and validate our findings.

Our results reveal a large, diverse and rapidly growing body of literature on the impact of big data in the biomedical domain. Data show that the overall number of articles published in the time period 2012–2017 is 131 times higher compared to the period 2001–2005 as represented in Fig 2 .


N.B. The search was performed on September 18, 2017. Therefore, the full number of articles for year 2017 was calculated by projecting the data until September 18.


Data breakdown by medical speciality and field of medical application indicates that big data approaches have been discussed and evaluated in relation to several branches of medicine including neurology and psychiatry (n = 31), oncology (n = 17), cardiology (n = 8), medical genetics (n = 8), immunology and infectious diseases (n = 8), as well as nuclear medicine and radiology (n = 6). Crossfield evaluations of health-related big data appeared highly prevalent (n = 155).

Thematic analysis identified a number of potential opportunities and challenges associated with health-related big data approaches, many of which have relevance for ethics review. Opportunities could be grouped into four main themes: biomedical research, prevention, healthcare delivery and healthcare management. Potential benefits in the research domain include the possibility of collecting real-world data, accelerating the development of new medical technology and facilitating translational research. Big data was also associated with the improvement of preventive measures at both the individual and population level. In relation to care delivery, the following benefits were envisioned: precision and personalized medicine, earlier and more accurate diagnostics, enhanced clinical decision-making, ubiquitous health monitoring, improved patient safety and better therapy. Subsequent numeric analysis of thematic clusters is presented in Table 1 .



Envisioned challenges appeared of seven major types: technical (n = 125), ethical (n = 81), methodological (n = 66), regulatory (n = 39), social (n = 16), infrastructural (n = 11) and financial (n = 10). Technical challenges relate to issues inherent in the data ecosystem. These include data security, data quality, data storage, data linkage, and tools for data reuse. Methodological challenges relate to the system of methods used in the study and include issues of standardizing data and metadata, integrating and processing data, monitoring resource utilisation and compensating for incomplete data. Regulatory challenges relate to rules or directives such as those regulating data ownership and the accountability of actors in relation to the potential risks associated with using and managing data. Social challenges are those that have relevance for human society and its members. These include, among others, secondary uses of data in relation to participants consent, sociocultural and ethnic bias and subsequent risk of discrimination, power asymmetries between data subjects and data controllers. Finally, financial and infrastructural issues included the financial viability of data storage sites and to the level of preparedness of existing infrastructures respectively.

Ethical challenges are those related to moral principles. Our analysis revealed privacy and confidentiality to be by far the dominant concern (n = 146) in the ethical domain, followed by informed consent (n = 49), fairness and justice (n = 34), trust (n = 23), data ownership (n = 18) and others. Fig 3 presents a full overview of ethical considerations associated with health-related big data studies with associated relative frequencies.



While the analysis revealed a number of implications with relevance for ethics review, only 13% of reviewed studies provided specific normative recommendations for ERCs or other institutional review boards. Data breakdown by study methodology revealed that only a small portion of those recommendations (n = 5; 14%) was informed by empirical methods.

A subsequent analysis of thematic co-occurrences shows a strong mutual relationship between different thematic families, especially between technical and ethical issues, as shown in Fig 4 . In particular, technical issues such as data security and data linkage were often presented in coordination with ethical issues such as personal privacy.




This study presents four main limitations. First, a selection bias might be present since the search retrieved only articles written in languages known by the researchers (English, French, German and Italian), excluding articles written in other languages. A similar limitation affects database selection as searching other databases may have possibly identified additional relevant studies. While this risk of selection bias applies to any review since the number of databases that can be feasibly searched is always finite, we attempted to minimize selection bias by exploring both domain-general and domain-specific databases, including the major databases in biomedical research and computer science, which represent the primary interdisciplinary intersection when it comes to biomedical big data. Second, as it was often observed in relation to scoping reviews, the explorative nature and broad focus of our search methodology makes it ‘unrealistic to retrieve and screen all the relevant literature’ [ 39 ]. However, one advantage of the scoping methodology is the opportunity to explore also the grey literature and the secondary sources (e.g. bibliographies of retrieved papers), which is likely to increase comprehensiveness. The breadth of the research focus might have inevitably affected the depth of the analysis. The reason for that stems from the fact that the outcomes of a scoping review, compared to systematic review methods, are “more narrative in nature” [ 40 ] and usually not presented through descriptive statistical analysis. Finally, our review included very heterogeneous studies and did not assess the study quality. The reason for that stems from the fact that our main goal was to explore the entire range of challenges that have relevance for ERCs, regardless of how those challenges were originally addressed and discussed. While these four limitations might prevent generalization, we believe that the scoping methodology was best suited to reflect the explorative nature and broad focus of our research question. In fact, it has often been noted, that scoping reviews are not intended to be exhaustive [ 41 , 42 ] or to provide detailed statistical analyses [ 40 ] but to map an heterogeneous body of literature related to a broad and novel topic [ 35 ]. As scoping reviews are usually considered a “richly informed starting point for further investigations” [ 40 ], future studies should consider this work as a preliminary step to a systematic review and associated statistical data analysis. Furthermore, they could use this general mapping of the health-related big data topic to generate empirically testable research hypotheses.

The drastic increase over the past 5 years in the number of studies discussing the implications of health-related big data confirms the research community’s increasing attention to the applicability of big data approaches into the healthcare domain. As the application of big data in healthcare [ 43 ] and the market size forecasts for big data hardware, software and professional services investments in the healthcare and pharmaceutical industry are growing steadily [ 44 ], there will be a parallel need to assess the impact of this expanding sociotechnical trend. This expansion can be seen as a sign of what has been defined the “inevitable application of big data to healthcare”[ 10 ] induced by the widespread uptake of electronic health records (EHRs), and the large-scale storing and sharing of genomic, proteomics, imaging and many other biomedical data.

The large prevalence of cross-field evaluations of health-related big data is an indicator of the potential of big data approaches to aggregate data from multiple medical data sources (e.g. combining data about gene expression and brain function in neurogenic studies) and multiple levels of clinical intervention (e.g. linking prevention and diagnostics to therapy and care delivery). In addition, analyses show that clinical outcomes can be produced from novel and non-strictly medical data sources. These include using Twitter to track and even forecast disease activity [ 45 ], exploiting Facebook data for suicide prevention [ 46 ], or using seasonal pollen forecast to predict asthma [ 47 , 48 ]. On the long term, this meta-specialty nature of big data approaches is likely to blur traditional separations between different medical specialties and levels of clinical intervention, opening more interfaces for inter-specialty exchange in the healthcare and biomedical research domains. This will raise the challenge for ERCs to review big data projects without relying on traditional discrete taxonomies of medical specialization and/or models of clinical application. In parallel, our findings illustrate the potential applicability of big data approaches to an increased variety of medical specialties. While branches of medicine like oncology [ 49 , 50 ], radiology [ 51 ] and clinical genetics [ 52 ] were already known to be particularly suitable for big data approaches, our review revealed a promising outlook associated with using big data in several other medical domains including neurology [ 53 , 54 ], psychiatry [ 55 ], immunology [ 56 ], nephrology [ 57 ], and geriatrics [ 58 ].

The high frequency of technical challenges addressed when assessing health-related big data highlights the persistence of a number of technical weaknesses and limitations, most of which are likely dependent on the historical novelty of such sociotechnical trend. These include problems of data quality, integrity, and security. Developing robust technical solutions that can guarantee the quality, integrity and security of the data, and allow their secure transmission, linkage and storage, was often presented as a priority for any successful deployment of big data for human health. This might require the development of better security-protecting infrastructures, data wrangling and scripting (e.g. batch processing) tools for data cleansing in order to guarantee the quality of data -for example, through automatic detection and removal of corrupt or inaccurate records- as well as techniques that can preserve the integrity of data through the entire data cycle, prevent corruption and enable interoperability. Furthermore, distributed ledger technology, distributed storage and incremental analytics are also believed to hold promises in the health domain [ 59 , 60 ]. From the perspective of ERCs, this implies a more rigorous yet systemic oversight [ 61 ] of technical considerations to guarantee that the afore listed safeguards are implemented by the researchers.

The relative frequency of methodological issues, however, highlights that fixing technical problems alone might not be sufficient to use big data for good. ERCs are usually required to evaluate the methodological soundness of a study if this has ethical consequences. For example, if a RCT is designed without giving participants an equal chance of being assigned to any group, ERCs are entitled to assess the methodological soundness of the study to preserve the principle of fairness. For the same reason, in the context of big data research, ERCs might be entitled to assess the soundness of studies whose methods may result in algorithmic discrimination or breaches of personal privacy. For example, they may examine whether the researchers have implemented all necessary safeguards to prevent algorithmic bias and comply with data security standards.

Examining the methodological soundness of health-related big data studies will likely require the adoption of different assessment criteria compared to traditional biomedical research. For example, it may require a rethinking of what counts as “public” data and what counts as “harm” in data-driven research. In addition, big data research is usually not based on the formulation and testing of specific research hypotheses, but on the identification of patterns from large volumes of data. This hypothesis-free nature of (some) big data research makes it harder to apply conventional epistemological mechanisms for scientific demarcation and quality control like falsifiability and refutability [ 62 ]. This poses for ERCs the problem of clearly demarcating the explanatory power of big data driven research. Researchers have questioned that big data analytics might speak for themselves [ 63 ] independent of explanatory hypotheses and refuted the idea that they can be used for biomedical purposes in absence of robust and causally explanatory scientific models or theories [ 64 , 65 ].

Ethical challenges also constitute an important area of consideration for ERCs. Data breakdown by class of ethical consideration reveals that the current ethical debate is being largely monopolized by issues of privacy and data protection ( Fig 3 ). It was already pointed out, that the ethics of big data should not be reduced to a privacy challenge but it encompasses a number of positive ethical goals [ 66 ]. Several ethical issues for which Mittelstad and Floridi [ 22 ] demanded increased ethical attention still appear largely underexplored. For example, our analysis reveals that issues of data ownership, group-level ethical harms, and the distinction between academic and commercial uses of big data, do not appear as ethical priorities. Furthermore, we observed that issues of fairness and the risk of discrimination compose a relatively small portion of the current ethical spectrum even though the misuse of big data has demonstrably resulted in various forms of ethnic, gender and class discrimination [ 67 ]. While group-level harms are usually considered outside the purview of ERCs, the dangers of ignoring this type of risk require careful assessment [ 68 ]. Issues of trust, transparency, accountability, dignity compose an even smaller fraction of the current ethical landscape. We suggest that the ethical review of health-related big data research should explore a broader spectrum of ethical issues. In particular, it should scrutinize more carefully (i) whether and how each project attempts to address the social benefits, if any, of research; (ii) how data subjects involved in the study can exercise control over their data (data control problem); (iii) which measures of accountability are being employed by the researchers, (iv) whether the collected data can be reused for secondary, including malevolent, purposes (dual use problem) and what measures are implemented to prevent that.

These technical, methodological and ethical challenges should not be seen as sealed rooms. Thematic analysis reveals an intimate interconnection between the three thematic families. For example, the technical problem of data security appears strictly connected to the ethical notion of privacy and the regulatory principle of data protection. Similarly, methodological errors like dataset bias might have detrimental ethical consequences such as racial and gender discrimination. This intimate link between technical and ethical issues highlights the importance of cooperative approaches to study design in big data research through strategies like ethical design of data-collecting technologies, proactive ethical assessment of big data studies and ethical requirement analyses for data-sharing platforms, data storage sites and other digital infrastructures. ERCs should be sensitized to this interconnection and examine how weaknesses in one domain affect other domains of evaluation. Similarly, the interdependence of epistemological and ethical issues, which was already highlighted by Mittelstad and Floridi [ 22 ], requires careful consideration by ERCs to prevent that inaccurate study designs or data curation practices result in unintended harms to individuals or groups.

Overall, these findings have three main and direct implications for ERCs. First, the significance and complexity of technical and methodological challenges suggests that members of ERCs should need to acquire stronger technical and methodological expertise to adequately review and evaluate health-related big data studies. This might require specific educational courses or other training activities aimed at strengthening ERC-members’ ability to identify technical/methodological problems or inaccuracies, especially those that can result in harms to data subjects or society like data security breaches, database corruption and biased algorithm training. Specialized training modules in data science, bioinformatics and cybersecurity might serve this purpose. In parallel, as emerging from the normative suggestions, ERCs need to consider including experts from the afore listed disciplines within the review board. Since health-related big data is here to stay, new expert profiles are needed during the review process. Data scientists, security experts, bioinformaticians should complement the expertise of clinicians, ethicists and other traditional ERC members. ERC members will need to be equipped with the necessary tools to inspect how the data will be collected, in conformity with which security standards they will be stored and shared, what classification systems will be employed, how uncertainty will be quantified, what cluster models will be adopted during exploratory data mining etc.

In spite of these important challenges, ERCs might still be faced with uncertainty when reviewing health-related big data studies. Review results indicate that only a tiny fraction of studies (13%) provided specific normative recommendations for ERCs. These are suggestions or proposals for ERCs as to the best course of action. Further thematic analysis reveals a general disagreement and a lack of consensus on what codes of conduct should be prioritized, with some authors [ 25 ] favouring the simplification of the ethics review process and others [ 69 ] requiring more stringent scrutiny. Nonetheless, five recurring themes could be identified: (i) preventing the dangers of downstream data linkage and inadvertent individual identification; (ii) expanding the purview and involvement of ERCs; (iii) developing a clearer understanding of the risks and benefits of health-related big data research, (iv) harmonizing ethical standards for big data research and (v) rethinking the composition of ERCs. The extremely small fraction of studies providing normative recommendations informed by empirical research (i.e. based on studies involving direct observation or experience such as survey questionnaires or focus groups), further underscores how these recommendations are mostly based on individual viewpoints rather than on solid consensus within the research community.

In the debate on what ERCs should do in relation to health-related big data, the opinion of ERC members is missing. Future empirical research is highly required to explore the needs, views and attitudes of ERC members about health-related big data. Empirical research in this domain could methodologically build upon previous studies involving ethics advisors working in big-data-related areas of research such as genomics governance [ 70 ]. Combining empirical and normative ethical research in the health-related big data domain would not only benefit the understanding of the current problems that ERCs are facing when reviewing health-related big data studies, but also favour the development of empirically-informed research ethics guidelines [ 71 ], hence resulting in better ethical oversight and governance of the health-related big data phenomenon.

Finally, it is legitimate to raise the question of whether ERCs should be the only governance body responsible for the evaluation of biomedical big data research. Given their traditional mandate, which is deeply rooted in the pre-digital era of biomedical research, it might be reasonably argued that ERCs are ill-suited to exercise exclusive ethical oversight on health-related big data research. Research regulators should consider whether complementary governance mechanisms such as data boards, data security committees or allied bodies are necessary to expand the bandwidth and sensitivity of ethical oversight.

Supporting information

S1 file. search strategy..


S2 File. Dataset.


S3 File. Compressed article repository.



This manuscript was supported by the Swiss National Science Foundation under award 407540_167223 (NRP 75 Big Data).

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 3. Dumbill E. Making sense of big data. Mary Ann Liebert, Inc. 140 Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA; 2013.
  • 5. Minelli M, Chambers M, Dhiraj A. Big data, big analytics: emerging business intelligence and analytic trends for today's businesses: John Wiley & Sons; 2012.
  • 16. Vayena E, Gasser U. Strictly biomedical? Sketching the ethics of the big data ecosystem in biomedicine. The ethics of biomedical big data: Springer; 2016. p. 17–39.
  • 25. Balas EA, Vernon M, Magrabi F, Gordon LT, Sexton J, editors. Big Data Clinical Research: Validity, Ethics, and Regulation. MedInfo; 2015.
  • 26. Foster Riley M. Big data, HIPAA and the common rule: time for big change?: Cambridge University Press; 2018.
  • 57. Megherbi DB, Soper B, editors. Analysis of how the choice of machine learning algorithms affects the prediction of a clinical outcome prior to minimally invasive treatments for benign pro static hyperplasia BPH. CIMSA 2012–2012 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications, Proceedings; 2012.
  • 58. Anderson SL, Anderson M, editors. A Prima Facie duty approach to machine ethics and its application to elder care. AAAI Workshop—Technical Report; 2011.
  • 59. Azaria A, Ekblaw A, Vieira T, Lippman A, editors. Medrec: Using blockchain for medical data access and permission management. Open and Big Data (OBD), International Conference on; 2016: IEEE.
  • 62. Popper K. The logic of scientific discovery: Routledge; 2005.

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents


Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Scientific Research and Big Data

Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse data, culminating in the emergence of the field of data science , which brings together computational, algorithmic, statistical and mathematical techniques towards extrapolating knowledge from big data. At the same time, the Open Data movement—emerging from policy trends such as the push for Open Government and Open Science—has encouraged the sharing and interlinking of heterogeneous research data via large digital infrastructures. The availability of vast amounts of data in machine-readable formats provides an incentive to create efficient procedures to collect, organise, visualise and model these data. These infrastructures, in turn, serve as platforms for the development of artificial intelligence, with an eye to increasing the reliability, speed and transparency of processes of knowledge creation. Researchers across all disciplines see the newfound ability to link and cross-reference data from diverse sources as improving the accuracy and predictive power of scientific findings and helping to identify future directions of inquiry, thus ultimately providing a novel starting point for empirical investigation. As exemplified by the rise of dedicated funding, training programmes and publication venues, big data are widely viewed as ushering in a new way of performing research and challenging existing understandings of what counts as scientific knowledge.

This entry explores these claims in relation to the use of big data within scientific research, and with an emphasis on the philosophical issues emerging from such use. To this aim, the entry discusses how the emergence of big data—and related technologies, institutions and norms—informs the analysis of the following themes:

  • how statistics, formal and computational models help to extrapolate patterns from data, and with which consequences;
  • the role of critical scrutiny (human intelligence) in machine learning, and its relation to the intelligibility of research processes;
  • the nature of data as research components;
  • the relation between data and evidence, and the role of data as source of empirical insight;
  • the view of knowledge as theory-centric;
  • understandings of the relation between prediction and causality;
  • the separation of fact and value; and
  • the risks and ethics of data science.

These are areas where attention to research practices revolving around big data can benefit philosophy, and particularly work in the epistemology and methodology of science. This entry doesn’t cover the vast scholarship in the history and social studies of science that has emerged in recent years on this topic, though references to some of that literature can be found when conceptually relevant. Complementing historical and social scientific work in data studies, the philosophical analysis of data practices can also elicit significant challenges to the hype surrounding data science and foster a critical understanding of the role of data-fuelled artificial intelligence in research.

1. What Are Big Data?

2. extrapolating data patterns: the role of statistics and software, 3. human and artificial intelligence, 4. the nature of (big) data, 5. big data and evidence, 6. big data, knowledge and inquiry, 7. big data between causation and prediction, 8. the fact/value distinction, 9. big data risks and the ethics of data science, 10. conclusion: big data and good science, other internet resources, related entries.

We are witnessing a progressive “datafication” of social life. Human activities and interactions with the environment are being monitored and recorded with increasing effectiveness, generating an enormous digital footprint. The resulting “big data” are a treasure trove for research, with ever more sophisticated computational tools being developed to extract knowledge from such data. One example is the use of various different types of data acquired from cancer patients, including genomic sequences, physiological measurements and individual responses to treatment, to improve diagnosis and treatment. Another example is the integration of data on traffic flow, environmental and geographical conditions, and human behaviour to produce safety measures for driverless vehicles, so that when confronted with unforeseen events (such as a child suddenly darting into the street on a very cold day), the data can be promptly analysed to identify and generate an appropriate response (the car swerving enough to avoid the child while also minimising the risk of skidding on ice and damaging to other vehicles). Yet another instance is the understanding of the nutritional status and needs of a particular population that can be extracted from combining data on food consumption generated by commercial services (e.g., supermarkets, social media and restaurants) with data coming from public health and social services, such as blood test results and hospital intakes linked to malnutrition. In each of these cases, the availability of data and related analytic tools is creating novel opportunities for research and for the development of new forms of inquiry, which are widely perceived as having a transformative effect on science as a whole.

A useful starting point in reflecting on the significance of such cases for a philosophical understanding of research is to consider what the term “big data” actually refers to within contemporary scientific discourse. There are multiple ways to define big data (Kitchin 2014, Kitchin & McArdle 2016). Perhaps the most straightforward characterisation is as large datasets that are produced in a digital form and can be analysed through computational tools. Hence the two features most commonly associated with Big Data are volume and velocity. Volume refers to the size of the files used to archive and spread data. Velocity refers to the pressing speed with which data is generated and processed. The body of digital data created by research is growing at breakneck pace and in ways that are arguably impossible for the human cognitive system to grasp and thus require some form of automated analysis.

Volume and velocity are also, however, the most disputed features of big data. What may be perceived as “large volume” or “high velocity” depends on rapidly evolving technologies to generate, store, disseminate and visualise the data. This is exemplified by the high-throughput production, storage and dissemination of genomic sequencing and gene expression data, where both data volume and velocity have dramatically increased within the last two decades. Similarly, current understandings of big data as “anything that cannot be easily captured in an Excel spreadsheet” are bound to shift rapidly as new analytic software becomes established, and the very idea of using spreadsheets to capture data becomes a thing of the past. Moreover, data size and speed do not take account of the diversity of data types used by researchers, which may include data that are not generated in digital formats or whose format is not computationally tractable, and which underscores the importance of data provenance (that is, the conditions under which data were generated and disseminated) to processes of inference and interpretation. And as discussed below, the emphasis on physical features of data obscures the continuing dependence of data interpretation on circumstances of data use, including specific queries, values, skills and research situations.

An alternative is to define big data not by reference to their physical attributes, but rather by virtue of what can and cannot be done with them. In this view, big data is a heterogeneous ensemble of data collected from a variety of different sources, typically (but not always) in digital formats suitable for algorithmic processing, in order to generate new knowledge. For example boyd and Crawford (2012: 663) identify big data with “the capacity to search, aggregate and cross-reference large datasets”, while O’Malley and Soyer (2012) focus on the ability to interrogate and interrelate diverse types of data, with the aim to be able to consult them as a single body of evidence. The examples of transformative “big data research” given above are all easily fitted into this view: it is not the mere fact that lots of data are available that makes a different in those cases, but rather the fact that lots of data can be mobilised from a wide variety of sources (medical records, environmental surveys, weather measurements, consumer behaviour). This account makes sense of other characteristic “v-words” that have been associated with big data, including:

  • Variety in the formats and purposes of data, which may include objects as different as samples of animal tissue, free-text observations, humidity measurements, GPS coordinates, and the results of blood tests;
  • Veracity , understood as the extent to which the quality and reliability of big data can be guaranteed. Data with high volume, velocity and variety are at significant risk of containing inaccuracies, errors and unaccounted-for bias. In the absence of appropriate validation and quality checks, this could result in a misleading or outright incorrect evidence base for knowledge claims (Floridi & Illari 2014; Cai & Zhu 2015; Leonelli 2017);
  • Validity , which indicates the selection of appropriate data with respect to the intended use. The choice of a specific dataset as evidence base requires adequate and explicit justification, including recourse to relevant background knowledge to ground the identification of what counts as data in that context (e.g., Loettgers 2009, Bogen 2010);
  • Volatility , i.e., the extent to which data can be relied upon to remain available, accessible and re-interpretable despite changes in archival technologies. This is significant given the tendency of formats and tools used to generate and analyse data to become obsolete, and the efforts required to update data infrastructures so as to guarantee data access in the long term (Bowker 2006; Edwards 2010; Lagoze 2014; Borgman 2015);
  • Value , i.e., the multifaceted forms of significance attributed to big data by different sections of society, which depend as much on the intended use of the data as on historical, social and geographical circumstances (Leonelli 2016, D’Ignazio and Klein 2020). Alongside scientific value, researchers may impute financial, ethical, reputational and even affective value to data, depending on their intended use as well as the historical, social and geographical circumstances of their use. The institutions involved in governing and funding research also have ways of valuing data, which may not always overlap with the priorities of researchers (Tempini 2017).

This list of features, though not exhaustive, highlights how big data is not simply “a lot of data”. The epistemic power of big data lies in their capacity to bridge between different research communities, methodological approaches and theoretical frameworks that are difficult to link due to conceptual fragmentation, social barriers and technical difficulties (Leonelli 2019a). And indeed, appeals to big data often emerge from situations of inquiry that are at once technically, conceptually and socially challenging, and where existing methods and resources have proved insufficient or inadequate (Sterner & Franz 2017; Sterner, Franz, & Witteveen 2020).

This understanding of big data is rooted in a long history of researchers grappling with large and complex datasets, as exemplified by fields like astronomy, meteorology, taxonomy and demography (see the collections assembled by Daston 2017; Anorova et al. 2017; Porter & Chaderavian 2018; as well as Anorova et al. 2010, Sepkoski 2013, Stevens 2016, Strasser 2019 among others). Similarly, biomedical research—and particularly subfields such as epidemiology, pharmacology and public health—has an extensive tradition of tackling data of high volume, velocity, variety and volatility, and whose validity, veracity and value are regularly negotiated and contested by patients, governments, funders, pharmaceutical companies, insurances and public institutions (Bauer 2008). Throughout the twentieth century, these efforts spurred the development of techniques, institutions and instruments to collect, order, visualise and analyse data, such as: standard classification systems and formats; guidelines, tools and legislation for the management and security of sensitive data; and infrastructures to integrate and sustain data collections over long periods of time (Daston 2017).

This work culminated in the application of computational technologies, modelling tools and statistical methods to big data (Porter 1995; Humphreys 2004; Edwards 2010), increasingly pushing the boundaries of data analytics thanks to supervised learning, model fitting, deep neural networks, search and optimisation methods, complex data visualisations and various other tools now associated with artificial intelligence. Many of these tools are based on algorithms whose functioning and results are tested against specific data samples (a process called “training”). These algorithms are programmed to “learn” from each interaction with novel data: in other words, they have the capacity to change themselves in response to new information being inputted into the system, thus becoming more attuned to the phenomena they are analysing and improving their ability to predict future behaviour. The scope and extent of such changes is shaped by the assumptions used to build the algorithms and the capability of related software and hardware to identify, access and process information of relevance to the learning in question. There is however a degree of unpredictability and opacity to these systems, which can evolve to the point of defying human understanding (more on this below).

New institutions, communication platforms and regulatory frameworks also emerged to assemble, prepare and maintain data for such uses (Kitchin 2014), such as various forms of digital data infrastructures, organisations aiming to coordinate and improve the global data landscape (e.g., the Research Data Alliance), and novel measures for data protection, like the General Data Protection Regulation launched in 2017 by the European Union. Together, these techniques and institutions afford the opportunity to assemble and interpret data at a much broader scale, while also promising to deliver finer levels of granularity in data analysis. [ 1 ] They increase the scope of any investigation by making it possible for researchers to link their own findings to those of countless others across the world, both within and beyond the academic sphere. By enhancing the mobility of data, they facilitate their repurposing for a variety of goals that may have been unforeseeable when the data were originally generated. And by transforming the role of data within research, they heighten their status as valuable research outputs in and of themselves. These technological and methodological developments have significant implications for philosophical conceptualisations of data, inferential processes and scientific knowledge, as well as for how research is conducted, organised, governed and assessed. It is to these philosophical concerns that I now turn.

Big data are often associated to the idea of data-driven research, where learning happens through the accumulation of data and the application of methods to extract meaningful patterns from those data. Within data-driven inquiry, researchers are expected to use data as their starting point for inductive inference, without relying on theoretical preconceptions—a situation described by advocates as “the end of theory”, in contrast to theory-driven approaches where research consists of testing a hypothesis (Anderson 2008, Hey et al. 2009). In principle at least, big data constitute the largest pool of data ever assembled and thus a strong starting point to search for correlations (Mayer-Schönberger & Cukier 2013). Crucial to the credibility of the data-driven approach is the efficacy of the methods used to extrapolate patterns from data and evaluate whether or not such patterns are meaningful, and what “meaning” may involve in the first place. Hence, some philosophers and data scholars have argued that

the most important and distinctive characteristic of Big Data [is] its use of statistical methods and computational means of analysis, (Symons & Alvarado 2016: 4)

such as for instance machine learning tools, deep neural networks and other “intelligent” practices of data handling.

The emphasis on statistics as key adjudicator of validity and reliability of patterns extracted from data is not novel. Exponents of logical empiricism looked for logically watertight methods to secure and justify inference from data, and their efforts to develop a theory of probability proceeded in parallel with the entrenchment of statistical reasoning in the sciences in the first half of the twentieth century (Romeijn 2017). In the early 1960s, Patrick Suppes offered a seminal link between statistical methods and the philosophy of science through his work on the production and interpretation of data models. As a philosopher deeply embedded in experimental practice, Suppes was interested in the means and motivations of key statistical procedures for data analysis such as data reduction and curve fitting. He argued that once data are adequately prepared for statistical modelling, all the concerns and choices that motivated data processing become irrelevant to their analysis and interpretation. This inspired him to differentiate between models of theory, models of experiment and models of data, noting that such different components of inquiry are governed by different logics and cannot be compared in a straightforward way. For instance,

the precise definition of models of the data for any given experiment requires that there be a theory of the data in the sense of the experimental procedure, as well as in the ordinary sense of the empirical theory of the phenomena being studied. (Suppes 1962: 253)

Suppes viewed data models as necessarily statistical: that is, as objects

designed to incorporate all the information about the experiment which can be used in statistical tests of the adequacy of the theory. (Suppes 1962: 258)

His formal definition of data models reflects this decision, with statistical requirements such as homogeneity, stationarity and order identified as the ultimate criteria to identify a data model Z and evaluate its adequacy:

Z is an N-fold model of the data for experiment Y if and only if there is a set Y and a probability measure P on subsets of Y such that \(Y = \langle Y, P\rangle\) is a model of the theory of the experiment, Z is an N-tuple of elements of Y , and Z satisfies the statistical tests of homogeneity, stationarity and order. (1962: 259)

This analysis of data models portrayed statistical methods as key conduits between data and theory, and hence as crucial components of inferential reasoning.

The focus on statistics as entry point to discussions of inference from data was widely promoted in subsequent philosophical work. Prominent examples include Deborah Mayo, who in her book Error and the Growth of Experimental Knowledge asked:

What should be included in data models? The overriding constraint is the need for data models that permit the statistical assessment of fit (between prediction and actual data); (Mayo 1996: 136)

and Bas van Fraassen, who also embraced the idea of data models as “summarizing relative frequencies found in data” (Van Fraassen 2008: 167). Closely related is the emphasis on statistics as means to detect error within datasets in relation to specific hypotheses, most prominently endorsed by the error-statistical approach to inference championed by Mayo and Aris Spanos (Mayo & Spanos 2009a). This approach aligns with the emphasis on computational methods for data analysis within big data research, and supports the idea that the better the inferential tools and methods, the better the chance to extract reliable knowledge from data.

When it comes to addressing methodological challenges arising from the computational analysis of big data, however, statistical expertise needs to be complemented by computational savvy in the training and application of algorithms associated to artificial intelligence, including machine learning but also other mathematical procedures for operating upon data (Bringsjord & Govindarajulu 2018). Consider for instance the problem of overfitting, i.e., the mistaken identification of patterns in a dataset, which can be greatly amplified by the training techniques employed by machine learning algorithms. There is no guarantee that an algorithm trained to successfully extrapolate patterns from a given dataset will be as successful when applied to other data. Common approaches to this problem involve the re-ordering and partitioning of both data and training methods, so that it is possible to compare the application of the same algorithms to different subsets of the data (“cross-validation”), combine predictions arising from differently trained algorithms (“ensembling”) or use hyperparameters (parameters whose value is set prior to data training) to prepare the data for analysis.

Handling these issues, in turn, requires

familiarity with the mathematical operations in question, their implementations in code, and the hardware architectures underlying such implementations. (Lowrie 2017: 3)

For instance, machine learning

aims to build programs that develop their own analytic or descriptive approaches to a body of data, rather than employing ready-made solutions such as rule-based deduction or the regressions of more traditional statistics. (Lowrie 2017: 4)

In other words, statistics and mathematics need to be complemented by expertise in programming and computer engineering. The ensemble of skills thus construed results in a specific epistemological approach to research, which is broadly characterised by an emphasis on the means of inquiry as the most significant driver of research goals and outputs. This approach, which Sabina Leonelli characterised as data-centric , involves “focusing more on the processes through which research is carried out than on its ultimate outcomes” (Leonelli 2016: 170). In this view, procedures, techniques, methods, software and hardware are the prime motors of inquiry and the chief influence on its outcomes. Focusing more specifically on computational systems, John Symons and Jack Horner argued that much of big data research consists of software-intensive science rather than data-driven research: that is, science that depends on software for its design, development, deployment and use, and thus encompasses procedures, types of reasoning and errors that are unique to software, such as for example the problems generated by attempts to map real-world quantities to discrete-state machines, or approximating numerical operations (Symons & Horner 2014: 473). Software-intensive science is arguably supported by an algorithmic rationality focused on the feasibility, practicality and efficiency of algorithms, which is typically assessed by reference to concrete situations of inquiry (Lowrie 2017).

Algorithms are enormously varied in their mathematical structures and underpinning conceptual commitments, and more philosophical work needs to be carried out on the specifics of computational tools and software used in data science and related applications—with emerging work in philosophy of computer science providing an excellent way forward (Turner & Angius 2019). Nevertheless, it is clear that whether or not a given algorithm successfully applies to the data at hand depends on factors that cannot be controlled through statistical or even computational methods: for instance, the size, structure and format of the data, the nature of the classifiers used to partition the data, the complexity of decision boundaries and the very goals of the investigation.

In a forceful critique informed by the philosophy of mathematics, Christian Calude and Giuseppe Longo argued that there is a fundamental problem with the assumption that more data will necessarily yield more information:

very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. (Calude & Longo 2017: 595)

They conclude that big data analysis is by definition unable to distinguish spurious from meaningful correlations and is therefore a threat to scientific research. A related worry, sometimes dubbed “the curse of dimensionality” by data scientists, concerns the extent to which the analysis of a given dataset can be scaled up in complexity and in the number of variables being considered. It is well known that the more dimensions one considers in classifying samples, for example, the larger the dataset on which such dimensions can be accurately generalised. This demonstrates the continuing, tight dependence between the volume and quality of data on the one hand, and the type and breadth of research questions for which data need to serve as evidence on the other hand.

Determining the fit between inferential methods and data requires high levels of expertise and contextual judgement (a situation known within machine learning as the “no free lunch theorem”). Indeed, overreliance on software for inference and data modelling can yield highly problematic results. Symons and Horner note that the use of complex software in big data analysis makes margins of error unknowable, because there is no clear way to test them statistically (Symons & Horner 2014: 473). The path complexity of programs with high conditionality imposes limits on standard error correction techniques. As a consequence, there is no effective method for characterising the error distribution in the software except by testing all paths in the code, which is unrealistic and intractable in the vast majority of cases due to the complexity of the code.

Rather than acting as a substitute, the effective and responsible use of artificial intelligence tools in big data analysis requires the strategic exercise of human intelligence—but for this to happen, AI systems applied to big data need to be accessible to scrutiny and modification. Whether or not this is the case, and who is best qualified to exercise such scrutiny, is under dispute. Thomas Nickles argued that the increasingly complex and distributed algorithms used for data analysis follow in the footsteps of long-standing scientific attempts to transcend the limits of human cognition. The resulting epistemic systems may no longer be intelligible to humans: an “alien intelligence” within which “human abilities are no longer the ultimate criteria of epistemic success” (Nickles forthcoming). Such unbound cognition holds the promise of enabling powerful inferential reasoning from previously unimaginable volumes of data. The difficulties in contextualising and scrutinising such reasoning, however, sheds doubt on the reliability of the results. It is not only machine learning algorithms that are becoming increasingly inaccessible to evaluation: beyond the complexities of programming code, computational data analysis requires a whole ecosystem of classifications, models, networks and inference tools which typically have different histories and purposes, and whose relation to each other—and effects when they are used together—are far from understood and may well be untraceable.

This raises the question of whether the knowledge produced by such data analytic systems is at all intelligible to humans, and if so, what forms of intelligibility it yields. It is certainly the case that deriving knowledge from big data may not involve an increase in human understanding, especially if understanding is understood as an epistemic skill (de Regt 2017). This may not be a problem to those who await the rise of a new species of intelligent machines, who may master new cognitive tools in a way that humans cannot. But as Nickles, Nicholas Rescher (1984), Werner Callebaut (2012) and others pointed out, even in that case “we would not have arrived at perspective-free science” (Nickles forthcoming). While the human histories and assumptions interwoven into these systems may be hard to disentangle, they still affect their outcomes; and whether or not these processes of inquiry are open to critical scrutiny, their telos, implications and significance for life on the planet arguably should be. As argued by Dan McQuillan (2018), the increasing automation of big data analytics may foster acceptance of a Neoplatonist machinic metaphysics , within which mathematical structures “uncovered” by AI would trump any appeal to human experience. Luciano Floridi echoes this intuition in his analysis of what he calls the infosphere :

The great opportunities offered by Information and Communication Technologies come with a huge intellectual responsibility to understand them and take advantage of them in the right way. (2014: vii)

These considerations parallel Paul Humphreys’s long-standing critique of computer simulations as epistemically opaque (Humphreys 2004, 2009)—and particularly his definition of what he calls essential epistemic opacity:

A process is essentially epistemically opaque to X if and only if it is impossible , given the nature of X , for X to know all of the epistemically relevant elements of the process. (Humphreys 2009: 618)

Different facets of the general problem of epistemic opacity are stressed within the vast philosophical scholarship on the role of modelling, computing and simulations in the sciences: the implications of lacking experimental access to the concrete parts of the world being modelled, for instance (Morgan 2005; Parker 2009; Radder 2009); the difficulties in testing the reliability of computational methods used within simulations (Winsberg 2010; Morrison 2015); the relation between opacity and justification (Durán & Formanek 2018); the forms of black-boxing associated to mechanistic reasoning implemented in computational analysis (Craver and Darden 2013; Bechtel 2016); and the debate over the intrinsic limits of computational approaches and related expertise (Collins 1990; Dreyfus 1992). Roman Frigg and Julian Reiss argued that such issues do not constitute fundamental challenges to the nature of inquiry and modelling, and in fact exist in a continuum with traditional methodological issues well-known within the sciences (Frigg & Reiss 2009). Whether or not one agrees with this position (Humphreys 2009; Beisbart 2012), big data analysis is clearly pushing computational and statistical methods to their limit, thus highlighting the boundaries to what even technologically augmented human beings are capable of knowing and understanding.

Research on big data analysis thus sheds light on elements of the research process that cannot be fully controlled, rationalised or even considered through recourse to formal tools.

One such element is the work required to present empirical data in a machine-readable format that is compatible with the software and analytic tools at hand. Data need to be selected, cleaned and prepared to be subjected to statistical and computational analysis. The processes involved in separating data from noise, clustering data so that it is tractable, and integrating data of different formats turn out to be highly sophisticated and theoretically structured, as demonstrated for instance by James McAllister’s (1997, 2007, 2011) and Uljana Feest’s (2011) work on data patterns, Marcel Boumans’s and Leonelli’s comparison of clustering principles across fields (forthcoming), and James Griesemer’s (forthcoming) and Mary Morgan’s (forthcoming) analyses of the peculiarities of datasets. Suppes was so concerned by what he called the “bewildering complexity” of data production and processing activities, that he worried that philosophers would not appreciate the ways in which statistics can and does help scientists to abstract data away from such complexity. He described the large group of research components and activities used to prepare data for modelling as “pragmatic aspects” encompassing “every intuitive consideration of experimental design that involved no formal statistics” (Suppes 1962: 258), and positioned them as the lowest step of his hierarchy of models—at the opposite end of its pinnacle, which are models of theory. Despite recent efforts to rehabilitate the methodology of inductive-statistical modelling and inference (Mayo & Spanos 2009b), this approach has been shared by many philosophers who regard processes of data production and processing as so chaotic as to defy systematic analysis. This explains why data have received so little consideration in philosophy of science when compared to models and theory.

The question of how data are defined and identified, however, is crucial for understanding the role of big data in scientific research. Let us now consider two philosophical views—the representational view and the relational view —that are both compatible with the emergence of big data, and yet place emphasis on different aspects of that phenomenon, with significant implications for understanding the role of data within inferential reasoning and, as we shall see in the next section, as evidence. The representational view construes data as reliable representations of reality which are produced via the interaction between humans and the world. The interactions that generate data can take place in any social setting regardless of research purposes. Examples range from a biologist measuring the circumference of a cell in the lab and noting the result in an Excel file, to a teacher counting the number of students in her class and transcribing it in the class register. What counts as data in these interactions are the objects created in the process of description and/or measurement of the world. These objects can be digital (the Excel file) or physical (the class register) and form a footprint of a specific interaction with the natural world. This footprint—“trace” or “mark”, in the words of Ian Hacking (1992) and Hans-Jörg Rheinberger (2011), respectively—constitutes a crucial reference point for analytic study and for the extraction of new insights. This is the reason why data forms a legitimate foundation to empirical knowledge: the production of data is equivalent to “capturing” features of the world that can be used for systematic study. According to the representative approach, data are objects with fixed and unchangeable content, whose meaning, in virtue of being representations of reality, needs to be investigated and revealed step-by-step through adequate inferential methods. The data documenting cell shape can be modelled to test the relevance of shape to the elasticity, permeability and resilience of cells, producing an evidence base to understand cell-to-cell signalling and development. The data produced counting students in class can be aggregated with similar data collected in other schools, producing an evidence base to evaluate the density of students in the area and their school attendance frequency.

This reflects the intuition that data, especially when they come in the form of numerical measurements or images such as photographs, somehow mirror the phenomena that they are created to document, thus providing a snapshot of those phenomena that is amenable to study under the controlled conditions of research. It also reflects the idea of data as “raw” products of research, which are as close as it gets to unmediated knowledge of reality. This makes sense of the truth-value sometimes assigned to data as irrefutable sources of evidence—the Popperian idea that if data are found to support a given claim, then that claim is corroborated as true at least as long as no other data are found to disprove it. Data in this view represent an objective foundation for the acquisition of knowledge and this very objectivity—the ability to derive knowledge from human experience while transcending it—is what makes knowledge empirical. This position is well-aligned with the idea that big data is valuable to science because it facilitates the (broadly understood) inductive accumulation of knowledge: gathering data collected via reliable methods produces a mountain of facts ready to be analysed and, the more facts are produced and connected with each other, the more knowledge can be extracted.

Philosophers have long acknowledged that data do not speak for themselves and different types of data require different tools for analysis and preparation to be interpreted (Bogen 2009 [2013]). According to the representative view, there are correct and incorrect ways of interpreting data, which those responsible for data analysis need to uncover. But what is a “correct” interpretation in the realm of big data, where data are consistently treated as mobile entities that can, at least in principle, be reused in countless different ways and towards different objectives? Perhaps more than at any other time in the history of science, the current mobilisation and re-use of big data highlights the degree to which data interpretation—and with it, whatever data is taken to represent—may differ depending on the conceptual, material and social conditions of inquiry. The analysis of how big data travels across contexts shows that the expectations and abilities of those involved determine not only the way data are interpreted, but also what is regarded as “data” in the first place (Leonelli & Tempini forthcoming). The representative view of data as objects with fixed and contextually independent meaning is at odds with these observations.

An alternative approach is to embrace these findings and abandon the idea of data as fixed representations of reality altogether. Within the relational view , data are objects that are treated as potential or actual evidence for scientific claims in ways that can, at least in principle, be scrutinised and accounted for (Leonelli 2016). The meaning assigned to data depends on their provenance, their physical features and what these features are taken to represent, and the motivations and instruments used to visualise them and to defend specific interpretations. The reliability of data thus depends on the credibility and strictness of the processes used to produce and analyse them. The presentation of data; the way they are identified, selected, and included (or excluded) in databases; and the information provided to users to re-contextualise them are fundamental to producing knowledge and significantly influence its content. For instance, changes in data format—as most obviously involved in digitisation, data compression or archival procedures— can have a significant impact on where, when, and who uses the data as source of knowledge.

This framework acknowledges that any object can be used as a datum, or stop being used as such, depending on the circumstances—a consideration familiar to big data analysts used to pick and mix data coming from a vast variety of sources. The relational view also explains how, depending on the research perspective interpreting it, the same dataset may be used to represent different aspects of the world (“phenomena” as famously characterised by James Bogen and James Woodward, 1988). When considering the full cycle of scientific inquiry from the viewpoint of data production and analysis, it is at the stage of data modelling that a specific representational value is attributed to data (Leonelli 2019b).

The relational view of data encourages attention to the history of data, highlighting their continual evolution and sometimes radical alteration, and the impact of this feature on the power of data to confirm or refute hypotheses. It explains the critical importance of documenting data management and transformation processes, especially with big data that transit far and wide over digital channels and are grouped and interpreted in different ways and formats. It also explains the increasing recognition of the expertise of those who produce, curate, and analyse data as indispensable to the effective interpretation of big data within and beyond the sciences; and the inextricable link between social and ethical concerns around the potential impact of data sharing and scientific concerns around the quality, validity, and security of data (boyd & Crawford 2012; Tempini & Leonelli, 2018).

Depending on which view on data one takes, expectations around what big data can do for science will vary dramatically. The representational view accommodates the idea of big data as providing the most comprehensive, reliable and generative knowledge base ever witnessed in the history of science, by virtue of its sheer size and heterogeneity. The relational view makes no such commitment, focusing instead on what inferences are being drawn from such data at any given point, how and why.

One thing that the representational and relational views agree on is the key epistemic role of data as empirical evidence for knowledge claims or interventions. While there is a large philosophical literature on the nature of evidence (e.g., Achinstein 2001; Reiss 2015; Kelly 2016), however, the relation between data and evidence has received less attention. This is arguably due to an implicit acceptance, by many philosophers, of the representational view of data. Within the representational view, the identification of what counts as data is prior to the study of what those data can be evidence for: in other words, data are “givens”, as the etymology of the word indicates, and inferential methods are responsible for determining whether and how the data available to investigators can be used as evidence, and for what. The focus of philosophical attention is thus on formal methods to single out errors and misleading interpretations, and the probabilistic and/or explanatory relation between what is unproblematically taken to be a body of evidence and a given hypothesis. Hence much of the expansive philosophical work on evidence avoids the term “data” altogether. Peter Achinstein’s seminal work is a case in point: it discusses observed facts and experimental results, and whether and under which conditions scientists would have reasons to believe such facts, but it makes no mention of data and related processing practices (Achinstein 2001).

By contrast, within the relational view an object can only be identified as datum when it is viewed as having value as evidence. Evidence becomes a category of data identification, rather than a category of data use as in the representational view (Canali 2019). Evidence is thus constitutive of the very notion of data and cannot be disentangled from it. This involves accepting that the conditions under which a given object can serve as evidence—and thus be viewed as datum - may change; and that should this evidential role stop altogether, the object would revert back into an ordinary, non-datum item. For example, the photograph of a plant taken by a tourist in a remote region may become relevant as evidence for an inquiry into the morphology of plants from that particular locality; yet most photographs of plants are never considered as evidence for an inquiry into the features and functioning of the world, and of those who are, many may subsequently be discarded as uninteresting or no longer pertinent to the questions being asked.

This view accounts for the mobility and repurposing that characterises big data use, and for the possibility that objects that were not originally generated in order to serve as evidence may be subsequently adopted as such. Consider Mayo and Spanos’s “minimal scientific principle for evidence”, which they define as follows:

Data x 0 provide poor evidence for H if they result from a method or procedure that has little or no ability of finding flaws in H , even if H is false. (Mayo & Spanos 2009b)

This principle is compatible with the relational view of data since it incorporates cases where the methods used to generate and process data may not have been geared towards the testing of a hypothesis H: all it asks is that such methods can be made relevant to the testing of H, at the point in which data are used as evidence for H (I shall come back to the role of hypotheses in the handling of evidence in the next section).

The relational view also highlights the relevance of practices of data formatting and manipulation to the treatment of data as evidence, thus taking attention away from the characteristics of the data objects alone and focusing instead on the agency attached to and enabled by those characteristics. Nora Boyd has provided a way to conceptualise data processing as an integral part of inferential processes, and thus of how we should understand evidence. To this aim she introduced the notion of “line of evidence”, which she defines as:

a sequence of empirical results including the records of data collection and all subsequent products of data processing generated on the way to some final empirical constraint. (Boyd 2018:406)

She thus proposes a conception of evidence that embraces both data and the way in which data are handled, and indeed emphasises the importance of auxiliary information used when assessing data for interpretation, which includes

the metadata regarding the provenance of the data records and the processing workflow that transforms them. (2018: 407)

As she concludes,

together, a line of evidence and its associated metadata compose what I am calling an “enriched line of evidence”. The evidential corpus is then to be made up of many such enriched lines of evidence. (2018: 407)

The relational view thus fosters a functional and contextualist approach to evidence as the manner through which one or more objects are used as warrant for particular knowledge items (which can be propositional claims, but also actions such as specific decisions or modes of conduct/ways of operating). This chimes with the contextual view of evidence defended by Reiss (2015), John Norton’s work on the multiple, tangled lines of inferential reasoning underpinning appeals to induction (2003), and Hasok Chang’s emphasis on the epistemic activities required to ground evidential claims (2012). Building on these ideas and on Stephen Toulmin’s seminal work on research schemas (1958), Alison Wylie has gone one step further in evaluating the inferential scaffolding that researchers (and particularly archaeologists, who so often are called to re-evaluate the same data as evidence for new claims; Wylie 2017) need to make sense of their data, interpret them in ways that are robust to potential challenges, and modify interpretations in the face of new findings. This analysis enabled Wylie to formulate a set of conditions for robust evidential reasoning, which include epistemic security in the chain of evidence, causal anchoring and causal independence of the data used as evidence, as well as the explicit articulation of the grounds for calibration of the instruments and methods involved (Chapman & Wylie 2016; Wylie forthcoming). A similar conclusion is reached by Jessey Wright’s evaluation of the diverse data analysis techniques that neuroscientists use to make sense of functional magnetic resonance imaging of the brain (fMRI scans):

different data analysis techniques reveal different patterns in the data. Through the use of multiple data analysis techniques, researchers can produce results that are locally robust. (Wright 2017: 1179)

Wylie’s and Wright’s analyses exemplify how a relational approach to data fosters a normative understanding of “good evidence” which is anchored in situated judgement—the arguably human prerogative to contextualise and assess the significance of evidential claims. The advantages of this view of evidence are eloquently expressed by Nancy Cartwright’s critique of both philosophical theories and policy approaches that do not recognise the local and contextual nature of evidential reasoning. As she notes,

we need a concept that can give guidance about what is relevant to consider in deciding on the probability of the hypothesis, not one that requires that we already know significant facts about the probability of the hypothesis on various pieces of evidence. (Cartwright 2013: 6)

Thus she argues for a notion of evidence that is not too restrictive, takes account of the difficulties in combining and selecting evidence, and allows for contextual judgement on what types of evidence are best suited to the inquiry at hand (Cartwright 2013, 2019). Reiss’s proposal of a pragmatic theory of evidence similarly aims to

takes scientific practice [..] seriously, both in terms of its greater use of knowledge about the conditions under which science is practised and in terms of its goal to develop insights that are relevant to practising scientists. (Reiss 2015: 361)

A better characterisation of the relation between data and evidence, predicated on the study of how data are processed and aggregated, may go a long way towards addressing these demands. As aptly argued by James Woodward, the evidential relationship between data and claims is not a “a purely formal, logical, or a priori matter” (Woodward 2000: S172–173). This again sits uneasily with the expectation that big data analysis may automate scientific discovery and make human judgement redundant.

Let us now return to the idea of data-driven inquiry, often suggested as a counterpoint to hypothesis-driven science (e.g., Hey et al. 2009). Kevin Elliot and colleagues have offered a brief history of hypothesis-driven inquiry (Elliott et al. 2016), emphasising how scientific institutions (including funding programmes and publication venues) have pushed researchers towards a Popperian conceptualisation of inquiry as the formulation and testing of a strong hypothesis. Big data analysis clearly points to a different and arguably Baconian understanding of the role of hypothesis in science. Theoretical expectations are no longer seen as driving the process of inquiry and empirical input is recognised as primary in determining the direction of research and the phenomena—and related hypotheses—considered by researchers.

The emphasis on data as a central component of research poses a significant challenge to one of the best-established philosophical views on scientific knowledge. According to this view, which I shall label the theory-centric view of science, scientific knowledge consists of justified true beliefs about the world. These beliefs are obtained through empirical methods aiming to test the validity and reliability of statements that describe or explain aspects of reality. Hence scientific knowledge is conceptualised as inherently propositional: what counts as an output are claims published in books and journals, which are also typically presented as solutions to hypothesis-driven inquiry. This view acknowledges the significance of methods, data, models, instruments and materials within scientific investigations, but ultimately regards them as means towards one end: the achievement of true claims about the world. Reichenbach’s seminal distinction between contexts of discovery and justification exemplifies this position (Reichenbach 1938). Theory-centrism recognises research components such as data and related practical skills as essential to discovery, and more specifically to the messy, irrational part of scientific work that involves value judgements, trial-and-error, intuition and exploration and within which the very phenomena to be investigated may not have been stabilised. The justification of claims, by contrast, involves the rational reconstruction of the research that has been performed, so that it conforms to established norms of inferential reasoning. Importantly, within the context of justification, only data that support the claims of interest are explicitly reported and discussed: everything else—including the vast majority of data produced in the course of inquiry—is lost to the chaotic context of discovery. [ 2 ]

Much recent philosophy of science, and particularly modelling and experimentation, has challenged theory-centrism by highlighting the role of models, methods and modes of intervention as research outputs rather than simple tools, and stressing the importance of expanding philosophical understandings of scientific knowledge to include these elements alongside propositional claims. The rise of big data offers another opportunity to reframe understandings of scientific knowledge as not necessarily centred on theories and to include non-propositional components—thus, in Cartwright’s paraphrase of Gilbert Ryle’s famous distinction, refocusing on knowing-how over knowing-that (Cartwright 2019). One way to construe data-centric methods is indeed to embrace a conception of knowledge as ability, such as promoted by early pragmatists like John Dewey and more recently reprised by Chang, who specifically highlighted it as the broader category within which the understanding of knowledge-as-information needs to be placed (Chang 2017).

Another way to interpret the rise of big data is as a vindication of inductivism in the face of the barrage of philosophical criticism levelled against theory-free reasoning over the centuries. For instance, Jon Williamson (2004: 88) has argued that advances in automation, combined with the emergence of big data, lend plausibility to inductivist philosophy of science. Wolfgang Pietsch agrees with this view and provided a sophisticated framework to understand just what kind of inductive reasoning is instigated by big data and related machine learning methods such as decision trees (Pietsch 2015). Following John Stuart Mill, he calls this approach variational induction and presents it as common to both big data approaches and exploratory experimentation, though the former can handle a much larger number of variables (Pietsch 2015: 913). Pietsch concludes that the problem of theory-ladenness in machine learning can be addressed by determining under which theoretical assumptions variational induction works (2015: 910ff).

Others are less inclined to see theory-ladenness as a problem that can be mitigated by data-intensive methods, and rather see it as a constitutive part of the process of empirical inquiry. Arching back to the extensive literature on perspectivism and experimentation (Gooding 1990; Giere 2006; Radder 2006; Massimi 2012), Werner Callebaut has forcefully argued that the most sophisticated and standardised measurements embody a specific theoretical perspective, and this is no less true of big data (Callebaut 2012). Elliott and colleagues emphasise that conceptualising big data analysis as atheoretical risks encouraging unsophisticated attitudes to empirical investigation as a

“fishing expedition”, having a high probability of leading to nonsense results or spurious correlations, being reliant on scientists who do not have adequate expertise in data analysis, and yielding data biased by the mode of collection. (Elliott et al. 2016: 880)

To address related worries in genetic analysis, Ken Waters has provided the useful characterisation of “theory-informed” inquiry (Waters 2007), which can be invoked to stress how theory informs the methods used to extract meaningful patterns from big data, and yet does not necessarily determine either the starting point or the outcomes of data-intensive science. This does not resolve the question of what role theory actually plays. Rob Kitchin (2014) has proposed to see big data as linked to a new mode of hypothesis generation within a hypothetical-deductive framework. Leonelli is more sceptical of attempts to match big data approaches, which are many and diverse, with a specific type of inferential logic. She rather focused on the extent to which the theoretical apparatus at work within big data analysis rests on conceptual decisions about how to order and classify data—and proposed that such decisions can give rise to a particular form of theorization, which she calls classificatory theory (Leonelli 2016).

These disagreements point to big data as eliciting diverse understandings of the nature of knowledge and inquiry, and the complex iterations through which different inferential methods build on each other. Again, in the words of Elliot and colleagues,

attempting to draw a sharp distinction between hypothesis-driven and data-intensive science is misleading; these modes of research are not in fact orthogonal and often intertwine in actual scientific practice. (Elliott et al. 2016: 881, see also O’Malley et al. 2009, Elliott 2012)

Another epistemological debate strongly linked to reflection on big data concerns the specific kinds of knowledge emerging from data-centric forms of inquiry, and particularly the relation between predictive and causal knowledge.

Big data science is widely seen as revolutionary in the scale and power of predictions that it can support. Unsurprisingly perhaps, a philosophically sophisticated defence of this position comes from the philosophy of mathematics, where Marco Panza, Domenico Napoletani and Daniele Struppa argued for big data science as occasioning a momentous shift in the predictive knowledge that mathematical analysis can yield, and thus its role within broader processes of knowledge production. The whole point of big data analysis, they posit, is its disregard for causal knowledge:

answers are found through a process of automatic fitting of the data to models that do not carry any structural understanding beyond the actual solution of the problem itself. (Napoletani, Panza, & Struppa 2014: 486)

This view differs from simplistic popular discourse on “the death of theory” (Anderson 2008) and the “power of correlations” (Mayer-Schoenberg and Cukier 2013) insofar as it does not side-step the constraints associated with knowledge and generalisations that can be extracted from big data analysis. Napoletani, Panza and Struppa recognise that there are inescapable tensions around the ability of mathematical reasoning to overdetermine empirical input, to the point of providing a justification for any and every possible interpretation of the data. In their words,

the problem arises of how we can gain meaningful understanding of historical phenomena, given the tremendous potential variability of their developmental processes. (Napoletani et al. 2014: 487)

Their solution is to clarify that understanding phenomena is not the goal of predictive reasoning, which is rather a form of agnostic science : “the possibility of forecasting and analysing without a structured and general understanding” (Napoletani et al. 2011: 12). The opacity of algorithmic rationality thus becomes its key virtue and the reason for the extraordinary epistemic success of forecasting grounded on big data. While “the phenomenon may forever re-main hidden to our understanding”(ibid.: 5), the application of mathematical models and algorithms to big data can still provide meaningful and reliable answers to well-specified problems—similarly to what has been argued in the case of false models (Wimsatt 2007). Examples include the use of “forcing” methods such as regularisation or diffusion geometry to facilitate the extraction of useful insights from messy datasets.

This view is at odds with accounts that posit scientific understanding as a key aim of science (de Regt 2017), and the intuition that what researchers are ultimately interested in is

whether the opaque data-model generated by machine-learning technologies count as explanations for the relationships found between input and output. (Boon 2020: 44)

Within the philosophy of biology, for example, it is well recognised that big data facilitates effective extraction of patterns and trends, and that being able to model and predict how an organism or ecosystem may behave in the future is of great importance, particularly within more applied fields such as biomedicine or conservation science. At the same time, researchers are interested in understanding the reasons for observed correlations, and typically use predictive patterns as heuristics to explore, develop and verify causal claims about the structure and functioning of entities and processes. Emanuele Ratti (2015) has argued that big data mining within genome-wide association studies often used in cancer genomics can actually underpin mechanistic reasoning, for instance by supporting eliminative inference to develop mechanistic hypotheses and by helping to explore and evaluate generalisations used to analyse the data. In a similar vein, Pietsch (2016) proposed to use variational induction as a method to establish what counts as causal relationships among big data patterns, by focusing on which analytic strategies allow for reliable prediction and effective manipulation of a phenomenon.

Through the study of data sourcing and processing in epidemiology, Stefano Canali has instead highlighted the difficulties of deriving mechanistic claims from big data analysis, particularly where data are varied and embodying incompatible perspectives and methodological approaches (Canali 2016, 2019). Relatedly, the semantic and logistical challenges of organising big data give reason to doubt the reliability of causal claims extracted from such data. In terms of logistics, having a lot of data is not the same as having all of them, and cultivating illusions of comprehensiveness is a risky and potentially misleading strategy, particularly given the challenges encountered in developing and applying curatorial standards for data other than the high-throughput results of “omics” approaches (see also the next section). The constant worry about the partiality and reliability of data is reflected in the care put by database curators in enabling database users to assess such properties; and in the importance given by researchers themselves, particularly in the biological and environmental sciences, to evaluating the quality of data found on the internet (Leonelli 2014, Fleming et al. 2017). In terms of semantics, we are back to the role of data classifications as theoretical scaffolding for big data analysis that we discussed in the previous section. Taxonomic efforts to order and visualise data inform causal reasoning extracted from such data (Sterner & Franz 2017), and can themselves constitute a bottom-up method—grounded in comparative reasoning—for assigning meaning to data models, particularly in situation where a full-blown theory or explanation for the phenomenon under investigation is not available (Sterner 2014).

It is no coincidence that much philosophical work on the relation between causal and predictive knowledge extracted from big data comes from the philosophy of the life sciences, where the absence of axiomatized theories has elicited sophisticated views on the diversity of forms and functions of theory within inferential reasoning. Moreover, biological data are heterogeneous both in their content and in their format; are curated and re-purposed to address the needs of highly disparate and fragmented epistemic communities; and present curators with specific challenges to do with tracking complex, diverse and evolving organismal structures and behaviours, whose relation to an ever-changing environment is hard to pinpoint with any stability (e.g., Shavit & Griesemer 2009). Hence in this domain, some of the core methods and epistemic concerns of experimental research—including exploratory experimentation, sampling and the search for causal mechanisms—remain crucial parts of data-centric inquiry.

At the start of this entry I listed “value” as a major characteristic of big data and pointed to the crucial role of valuing procedures in identifying, processing, modelling and interpreting data as evidence. Identifying and negotiating different forms of data value is an unavoidable part of big data analysis, since these valuation practices determine which data is made available to whom, under which conditions and for which purposes. What researchers choose to consider as reliable data (and data sources) is closely intertwined not only with their research goals and interpretive methods, but also with their approach to data production, packaging, storage and sharing. Thus, researchers need to consider what value their data may have for future research by themselves and others, and how to enhance that value—such as through decisions around which data to make public, how, when and in which format; or, whenever dealing with data already in the public domain (such as personal data on social media), decisions around whether the data should be shared and used at all, and how.

No matter how one conceptualises value practices, it is clear that their key role in data management and analysis prevents facile distinctions between values and “facts” (understood as propositional claims for which data provide evidential warrant). For example, consider a researcher who values both openness —and related practices of widespread data sharing—and scientific rigour —which requires a strict monitoring of the credibility and validity of conditions under which data are interpreted. The scale and manner of big data mobilisation and analysis create tensions between these two values. While the commitment to openness may prompt interest in data sharing, the commitment to rigour may hamper it, since once data are freely circulated online it becomes very difficult to retain control over how they are interpreted, by whom and with which knowledge, skills and tools. How a researcher responds to this conflict affects which data are made available for big data analysis, and under which conditions. Similarly, the extent to which diverse datasets may be triangulated and compared depends on the intellectual property regimes under which the data—and related analytic tools—have been produced. Privately owned data are often unavailable to publicly funded researchers; and many algorithms, cloud systems and computing facilities used in big data analytics are only accessible to those with enough resources to buy relevant access and training. Whatever claims result from big data analysis are, therefore, strongly dependent on social, financial and cultural constraints that condition the data pool and its analysis.

This prominent role of values in shaping data-related epistemic practices is not surprising given existing philosophical critiques of the fact/value distinction (e.g., Douglas 2009), and the existing literature on values in science—such as Helen Longino’s seminal distinction between constitutive and contextual values, as presented in her 1990 book Science as Social Knowledge —may well apply in this case too. Similarly, it is well-established that the technological and social conditions of research strongly condition its design and outcomes. What is particularly worrying in the case of big data is the temptation, prompted by hyped expectations around the power of data analytics, to hide or side-line the valuing choices that underpin the methods, infrastructures and algorithms used for big data extraction.

Consider the use of high-throughput data production tools, which enable researchers to easily generate a large volume of data in formats already geared to computational analysis. Just as in the case of other technologies, researchers have a strong incentive to adopt such tools for data generation; and may do so even in cases where such tools are not good or even appropriate means to pursue the investigation. Ulrich Krohs uses the term convenience experimentation to refer to experimental designs that are adopted not because they are the most appropriate ways of pursuing a given investigation, but because they are easily and widely available and usable, and thus “convenient” means for researchers to pursue their goals (Krohs 2012).

Appeals to convenience can extend to other aspects of data-intensive analysis. Not all data are equally easy to digitally collect, disseminate and link through existing algorithms, which makes some data types and formats more convenient than others for computational analysis. For example, research databases often display the outputs of well-resourced labs within research traditions which deal with “tractable” data formats (such as “omics”). And indeed, the existing distribution of resources, infrastructure and skills determines high levels of inequality in the production, dissemination and use of big data for research. Big players with large financial and technical resources are leading the development and uptake of data analytics tools, leaving much publicly funded research around the world at the receiving end of innovation in this area. Contrary to popular depictions of the data revolution as harbinger of transparency, democracy and social equality, the digital divide between those who can access and use data technologies, and those who cannot, continues to widen. A result of such divides is the scarcity of data relating to certain subgroups and geographical locations, which again limits the comprehensiveness of available data resources.

In the vast ecosystem of big data infrastructures, it is difficult to keep track of such distortions and assess their significance for data interpretation, especially in situations where heterogeneous data sources structured through appeal to different values are mashed together. Thus, the systematic aggregation of convenient datasets and analytic tools over others often results in a big data pool where the relevant sources and forms of bias are impossible to locate and account for (Pasquale 2015; O’Neill 2016; Zuboff 2017; Leonelli 2019a). In such a landscape, arguments for a separation between fact and value—and even a clear distinction between the role of epistemic and non-epistemic values in knowledge production—become very difficult to maintain without discrediting the whole edifice of big data science. Given the extent to which this approach has penetrated research in all domains, it is arguably impossible, however, to critique the value-laden structure of big data science without calling into question the legitimacy of science itself. A more constructive approach is to embrace the extent to which big data science is anchored in human choices, interests and values, and ascertain how this affects philosophical views on knowledge, truth and method.

In closing, it is important to consider at least some of the risks and related ethical questions raised by research with big data. As already mentioned in the previous section, reliance on big data collected by powerful institutions or corporations risks raises significant social concerns. Contrary to the view that sees big and open data as harbingers of democratic social participation in research, the way that scientific research is governed and financed is not challenged by big data. Rather, the increasing commodification and large value attributed to certain kinds of data (e.g., personal data) is associated to an increase in inequality of power and visibility between different nations, segments of the population and scientific communities (O’Neill 2016; Zuboff 2017; D’Ignazio and Klein 2020). The digital gap between those who not only can access data, but can also use it, is widening, leading from a state of digital divide to a condition of “data divide” (Bezuidenout et al. 2017).

Moreover, the privatisation of data has serious implications for the world of research and the knowledge it produces. Firstly, it affects which data are disseminated, and with which expectations. Corporations usually only release data that they regard as having lesser commercial value and that they need public sector assistance to interpret. This introduces another distortion on the sources and types of data that are accessible online while more expensive and complex data are kept secret. Even many of the ways in which citizens -researchers included - are encouraged to interact with databases and data interpretation sites tend to encourage participation that generates further commercial value. Sociologists have recently described this type of social participation as a form of exploitation (Prainsack & Buyx 2017; Srnicek 2017). In turn, these ways of exploiting data strengthen their economic value over their scientific value. When it comes to the commerce of personal data between companies working in analysis, the value of the data as commercial products -which includes the evaluation of the speed and efficiency with which access to certain data can help develop new products - often has priority over scientific issues such as for example, representativity and reliability of the data and the ways they were analysed. This can result in decisions that pose a problem scientifically or that simply are not interested in investigating the consequences of the assumptions made and the processes used. This lack of interest easily translates into ignorance of discrimination, inequality and potential errors in the data considered. This type of ignorance is highly strategic and economically productive since it enables the use of data without concerns over social and scientific implications. In this scenario the evaluation on the quality of data shrinks to an evaluation of their usefulness towards short-term analyses or forecasting required by the client. There are no incentives in this system to encourage evaluation of the long-term implications of data analysis. The risk here is that the commerce of data is accompanied by an increasing divergence between data and their context. The interest in the history of the transit of data, the plurality of their emotional or scientific value and the re-evaluation of their origins tend to disappear over time, to be substituted by the increasing hold of the financial value of data.

The multiplicity of data sources and tools for aggregation also creates risks. The complexity of the data landscape is making it harder to identify which parts of the infrastructure require updating or have been put in doubt by new scientific developments. The situation worsens when considering the number of databases that populate every area of scientific research, each containing assumptions that influence the circulation and interoperability of data and that often are not updated in a reliable and regular way. Just to provide an idea of the numbers involved, the prestigious scientific publication Nucleic Acids Research publishes a special issue on new databases that are relevant to molecular biology every year and included: 56 new infrastructures in 2015, 62 in 2016, 54 in 2017 and 82 in 2018. These are just a small proportion of the hundreds of databases that are developed each year in the life sciences sector alone. The fact that these databases rely on short term funding means that a growing percentage of resources remain available to consult online although they are long dead. This is a condition that is not always visible to users of the database who trust them without checking whether they are actively maintained or not. At what point do these infrastructures become obsolete? What are the risks involved in weaving an ever more extensive tapestry of infrastructures that depend on each other, given the disparity in the ways they are managed and the challenges in identifying and comparing their prerequisite conditions, the theories and scaffolding used to build them? One of these risks is rampant conservativism: the insistence on recycling old data whose features and management elements become increasingly murky as time goes by, instead of encouraging the production of new data with features that specifically respond to the requirements and the circumstances of their users. In disciplines such as biology and medicine that study living beings and therefore are by definition continually evolving and developing, such trust in old data is particularly alarming. It is not the case, for example, that data collected on fungi ten, twenty or even a hundred years ago is reliable to explain the behaviour of the same species of fungi now or in the future (Leonelli 2018).

Researchers of what Luciano Floridi calls the infosphere —the way in which the introduction of digital technologies is changing the world - are becoming aware of the destructive potential of big data and the urgent need to focus efforts for management and use of data in active and thoughtful ways towards the improvement of the human condition. In Floridi’s own words:

ICT yields great opportunity which, however, entails the enormous intellectual responsibility of understanding this technology to use it in the most appropriate way. (Floridi 2014: vii; see also British Academy & Royal Society 2017)

In light of these findings, it is essential that ethical and social issues are seen as a core part of the technical and scientific requirements associated with data management and analysis. The ethical management of data is not obtained exclusively by regulating the commerce of research and management of personal data nor with the introduction of monitoring of research financing, even though these are important strategies. To guarantee that big data are used in the most scientifically and socially forward-thinking way it is necessary to transcend the concept of ethics as something external and alien to research. An analysis of the ethical implications of data science should become a basic component of the background and activity of those who take care of data and the methods used to view and analyse it. Ethical evaluations and choices are hidden in every aspect of data management, including those choices that may seem purely technical.

This entry stressed how the emerging emphasis on big data signals the rise of a data-centric approach to research, in which efforts to mobilise, integrate, disseminate and visualise data are viewed as central contributions to discovery. The emergence of data-centrism highlights the challenges involved in gathering, classifying and interpreting data, and the concepts, technologies and institutions that surround these processes. Tools such as high-throughput measurement instruments and apps for smartphones are fast generating large volumes of data in digital formats. In principle, these data are immediately available for dissemination through internet platforms, which can make them accessible to anybody with a broadband connection in a matter of seconds. In practice, however, access to data is fraught with conceptual, technical, legal and ethical implications; and even when access can be granted, it does not guarantee that the data can be fruitfully used to spur further research. Furthermore, the mathematical and computational tools developed to analyse big data are often opaque in their functioning and assumptions, leading to results whose scientific meaning and credibility may be difficult to assess. This increases the worry that big data science may be grounded upon, and ultimately supporting, the process of making human ingenuity hostage to an alien, artificial and ultimately unintelligible intelligence.

Perhaps the most confronting aspect of big data science as discussed in this entry is the extent to which it deviates from understandings of rationality grounded on individual agency and cognitive abilities (on which much of contemporary philosophy of science is predicated). The power of any one dataset to yield knowledge lies in the extent to which it can be linked with others: this is what lends high epistemic value to digital objects such as GPS locations or sequencing data, and what makes extensive data aggregation from a variety of sources into a highly effective surveillance tool. Data production and dissemination channels such as social media, governmental databases and research repositories operate in a globalised, interlinked and distributed network, whose functioning requires a wide variety of skills and expertise. The distributed nature of decision-making involved in developing big data infrastructures and analytics makes it impossible for any one individual to retain oversight over the quality, scientific significance and potential social impact of the knowledge being produced.

Big data analysis may therefore constitute the ultimate instance of a distributed cognitive system. Where does this leave accountability questions? Many individuals, groups and institutions end up sharing responsibility for the conceptual interpretation and social outcomes of specific data uses. A key challenge for big data governance is to find mechanisms for allocating responsibilities across this complex network, so that erroneous and unwarranted decisions—as well as outright fraudulent, unethical, abusive, discriminatory or misguided actions—can be singled out, corrected and appropriately sanctioned. Thinking about the complex history, processing and use of data can encourage philosophers to avoid ahistorical, uncontextualized approaches to questions of evidence, and instead consider the methods, skills, technologies and practices involved in handling data—and particularly big data—as crucial to understanding empirical knowledge-making.

  • Achinstein, Peter, 2001, The Book of Evidence , Oxford: Oxford University Press. doi:10.1093/0195143892.001.0001
  • Anderson, Chris, 2008, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Wired Magazine , 23 June 2008.
  • Aronova, Elena, Karen S. Baker, and Naomi Oreskes, 2010, “Big science and big data in biology: From the International Geophysical Year through the International Biological Program to the Long Term Ecological Research (LTER) Network, 1957–present”, Historical Studies in the Natural Sciences , 40: 183–224.
  • Aronova, Elena, Christine von Oertzen, and David Sepkoski, 2017, “Introduction: Historicizing Big Data”, Osiris , 32(1): 1–17. doi:10.1086/693399
  • Bauer, Susanne, 2008, “Mining Data, Gathering Variables and Recombining Information: The Flexible Architecture of Epidemiological Studies”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 39(4): 415–428. doi:10.1016/j.shpsc.2008.09.008
  • Bechtel, William, 2016, “Using Computational Models to Discover and Understand Mechanisms”, Studies in History and Philosophy of Science Part A , 56: 113–121. doi:10.1016/j.shpsa.2015.10.004
  • Beisbart, Claus, 2012, “How Can Computer Simulations Produce New Knowledge?”, European Journal for Philosophy of Science , 2(3): 395–434. doi:10.1007/s13194-012-0049-7
  • Bezuidenhout, Louise, Leonelli, Sabina, Kelly, Ann and Rappert, Brian, 2017, “Beyond the Digital Divide: Towards a Situated Approach to Open Data”. Science and Public Policy , 44(4): 464–475. doi: 10.1093/scipol/scw036
  • Bogen, Jim, 2009 [2013], “Theory and Observation in Science”, in The Stanford Encyclopedia of Philosophy (Spring 2013 Edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2013/entries/science-theory-observation/ >.
  • –––, 2010, “Noise in the World”, Philosophy of Science , 77(5): 778–791. doi:10.1086/656006
  • Bogen, James and James Woodward, 1988, “Saving the Phenomena”, The Philosophical Review , 97(3): 303. doi:10.2307/2185445
  • Bokulich, Alisa, 2018, “Using Models to Correct Data: Paleodiversity and the Fossil Record”, in S.I.: Abstraction and Idealization in Scientific Modelling by Synthese , 29 May 2018. doi:10.1007/s11229-018-1820-x
  • Boon, Mieke, 2020, “How Scientists Are Brought Back into Science—The Error of Empiricism”, in A Critical Reflection on Automated Science , Marta Bertolaso and Fabio Sterpetti (eds.), (Human Perspectives in Health Sciences and Technology 1), Cham: Springer International Publishing, 43–65. doi:10.1007/978-3-030-25001-0_4
  • Borgman, Christine L., 2015, Big Data, Little Data, No Data , Cambridge, MA: MIT Press.
  • Boumans, M.J. and Sabina Leonelli, forthcoming, “From Dirty Data to Tidy Facts: Practices of Clustering in Plant Phenomics and Business Cycles”, in Leonelli and Tempini forthcoming.
  • Boyd, Danah and Kate Crawford, 2012, “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon”, Information, Communication & Society , 15(5): 662–679. doi:10.1080/1369118X.2012.678878
  • Boyd, Nora Mills, 2018, “Evidence Enriched”, Philosophy of Science , 85(3): 403–421. doi:10.1086/697747
  • Bowker, Geoffrey C., 2006, Memory Practices in the Sciences , Cambridge, MA: The MIT Press.
  • Bringsjord, Selmer and Naveen Sundar Govindarajulu, 2018, “Artificial Intelligence”, in The Stanford Encyclopedia of Philosophy (Fall 2018 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/fall2018/entries/artificial-intelligence/ >.
  • British Academy & Royal Society, 2017, Data Management and Use: Governance in the 21st Century. A Joint Report of the Royal Society and the British Academy , British Academy & Royal Society 2017 available online (see Report).
  • Cai, Li and Yangyong Zhu, 2015, “The Challenges of Data Quality and Data Quality Assessment in the Big Data Era”, Data Science Journal , 14: 2. doi:10.5334/dsj-2015-002
  • Callebaut, Werner, 2012, “Scientific Perspectivism: A Philosopher of Science’s Response to the Challenge of Big Data Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 69–80. doi:10.1016/j.shpsc.2011.10.007
  • Calude, Cristian S. and Giuseppe Longo, 2017, “The Deluge of Spurious Correlations in Big Data”, Foundations of Science , 22(3): 595–612. doi:10.1007/s10699-016-9489-4
  • Canali, Stefano, 2016, “Big Data, Epistemology and Causality: Knowledge in and Knowledge out in EXPOsOMICS”, Big Data & Society , 3(2): 205395171666953. doi:10.1177/2053951716669530
  • –––, 2019, “Evaluating Evidential Pluralism in Epidemiology: Mechanistic Evidence in Exposome Research”, History and Philosophy of the Life Sciences , 41(1): art. 4. doi:10.1007/s40656-019-0241-6
  • Cartwright, Nancy D., 2013, Evidence: For Policy and Wheresoever Rigor Is a Must , London School of Economics and Political Science (LSE), Order Project Discussion Paper Series [Cartwright 2013 available online ].
  • –––, 2019, Nature, the Artful Modeler: Lectures on Laws, Science, How Nature Arranges the World and How We Can Arrange It Better (The Paul Carus Lectures) , Chicago, IL: Open Court.
  • Chang, Hasok, 2012, Is Water H2O? Evidence, Realism and Pluralism , (Boston Studies in the Philosophy of Science 293), Dordrecht: Springer Netherlands. doi:10.1007/978-94-007-3932-1
  • –––, 2017, “VI—Operational Coherence as the Source of Truth”, Proceedings of the Aristotelian Society , 117(2): 103–122. doi:10.1093/arisoc/aox004
  • Chapman, Robert and Alison Wylie, 2016, Evidential Reasoning in Archaeology , London: Bloomsbury Publishing Plc.
  • Collins, Harry M., 1990, Artificial Experts: Social Knowledge and Intelligent Machines , Cambridge, MA: MIT Press.
  • Craver, Carl F. and Lindley Darden, 2013, In Search of Mechanisms: Discoveries Across the Life Sciences , Chicago: University of Chicago Press.
  • Daston, Lorraine, 2017, Science in the Archives: Pasts, Presents, Futures , Chicago: University of Chicago Press.
  • De Regt, Henk W., 2017, Understanding Scientific Understanding , Oxford: Oxford University Press. doi:10.1093/oso/9780190652913.001.0001
  • D’Ignazio, Catherine and Klein, Lauren F., 2020, Data Feminism , Cambridge, MA: The MIT Press.
  • Douglas, Heather E., 2009, Science, Policy and the Value-Free Ideal , Pittsburgh, PA: University of Pittsburgh Press.
  • Dreyfus, Hubert L., 1992, What Computers Still Can’t Do: A Critique of Artificial Reason , Cambridge, MA: MIT Press.
  • Durán, Juan M. and Nico Formanek, 2018, “Grounds for Trust: Essential Epistemic Opacity and Computational Reliabilism”, Minds and Machines , 28(4): 645–666. doi:10.1007/s11023-018-9481-6
  • Edwards, Paul N., 2010, A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming , Cambridge, MA: The MIT Press.
  • Elliott, Kevin C., 2012, “Epistemic and methodological iteration in scientific research”. Studies in History and Philosophy of Science , 43: 376–382.
  • Elliott, Kevin C., Kendra S. Cheruvelil, Georgina M. Montgomery, and Patricia A. Soranno, 2016, “Conceptions of Good Science in Our Data-Rich World”, BioScience , 66(10): 880–889. doi:10.1093/biosci/biw115
  • Feest, Uljana, 2011, “What Exactly Is Stabilized When Phenomena Are Stabilized?”, Synthese , 182(1): 57–71. doi:10.1007/s11229-009-9616-7
  • Fleming, Lora, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, 2017, “Big Data in Environment and Human Health”, in Oxford Research Encyclopedia of Environmental Science , by Lora Fleming, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, Oxford: Oxford University Press. doi:10.1093/acrefore/9780199389414.013.541
  • Floridi, Luciano, 2014, The Fourth Revolution: How the Infosphere is Reshaping Human Reality , Oxford: Oxford University Press.
  • Floridi, Luciano and Phyllis Illari (eds.), 2014, The Philosophy of Information Quality , (Synthese Library 358), Cham: Springer International Publishing. doi:10.1007/978-3-319-07121-3
  • Frigg, Roman and Julian Reiss, 2009, “The Philosophy of Simulation: Hot New Issues or Same Old Stew?”, Synthese , 169(3): 593–613. doi:10.1007/s11229-008-9438-z
  • Frigg, Roman and Stephan Hartmann, 2016, “Models in Science”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/win2016/entries/models-science/ >.
  • Gooding, David C., 1990, Experiment and the Making of Meaning , Dordrecht & Boston: Kluwer.
  • Giere, Ronald, 2006, Scientific Perspectivism , Chicago: University of Chicago Press.
  • Griesemer, James R., forthcoming, “A Data Journey through Dataset-Centric Population Biology”, in Leonelli and Tempini forthcoming.
  • Hacking, Ian, 1992, “The Self-Vindication of the Laboratory Sciences”, In Science as Practice and Culture , Andrew Pickering (ed.), Chicago, IL: The University of Chicago Press, 29–64.
  • Harris, Todd, 2003, “Data Models and the Acquisition and Manipulation of Data”, Philosophy of Science , 70(5): 1508–1517. doi:10.1086/377426
  • Hey Tony, Stewart Tansley, and Kristin Tolle, 2009, The Fourth Paradigm. Data-Intensive Scientific Discovery , Redmond, WA: Microsoft Research.
  • Humphreys, Paul, 2004, Extending Ourselves: Computational Science, Empiricism, and Scientific Method , Oxford: Oxford University Press. doi:10.1093/0195158709.001.0001
  • –––, 2009, “The Philosophical Novelty of Computer Simulation Methods”, Synthese , 169(3): 615–626. doi:10.1007/s11229-008-9435-2
  • Karaca, Koray, 2018, “Lessons from the Large Hadron Collider for Model-Based Experimentation: The Concept of a Model of Data Acquisition and the Scope of the Hierarchy of Models”, Synthese , 195(12): 5431–5452. doi:10.1007/s11229-017-1453-5
  • Kelly, Thomas, 2016, “Evidence”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/win2016/entries/evidence/ >.
  • Kitchin, Rob, 2013, The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences , Los Angeles: Sage.
  • –––, 2014, “Big Data, new epistemologies and paradigm shifts”, Big Data and Society , 1(1) April-June. doi: 10.1177/2053951714528481
  • Kitchin, Rob and Gavin McArdle, 2016, “What Makes Big Data, Big Data? Exploring the Ontological Characteristics of 26 Datasets”, Big Data & Society , 3(1): 205395171663113. doi:10.1177/2053951716631130
  • Krohs, Ulrich, 2012, “Convenience Experimentation”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 52–57. doi:10.1016/j.shpsc.2011.10.005
  • Lagoze, Carl, 2014, “Big Data, data integrity, and the fracturing of the control zone,” Big Data and Society , 1(2) July-December. doi: 10.1177/2053951714558281
  • Leonelli, Sabina, 2014, “What Difference Does Quantity Make? On the Epistemology of Big Data in Biology”, Big Data & Society , 1(1): 205395171453439. doi:10.1177/2053951714534395
  • –––, 2016, Data-Centric Biology: A Philosophical Study , Chicago: University of Chicago Press.
  • –––, 2017, “Global Data Quality Assessment and the Situated Nature of ‘Best’ Research Practices in Biology”, Data Science Journal , 16: 32. doi:10.5334/dsj-2017-032
  • –––, 2018, “The Time of Data: Timescales of Data Use in the Life Sciences”, Philosophy of Science , 85(5): 741–754. doi:10.1086/699699
  • –––, 2019a, La Recherche Scientifique à l’Ère des Big Data: Cinq Façons Donc les Données Massive Nuisent à la Science, et Comment la Sauver , Milano: Éditions Mimésis.
  • –––, 2019b, “What Distinguishes Data from Models?”, European Journal for Philosophy of Science , 9(2): 22. doi:10.1007/s13194-018-0246-0
  • Leonelli, Sabina and Niccolò Tempini, 2018, “Where Health and Environment Meet: The Use of Invariant Parameters in Big Data Analysis”, Synthese , special issue on the Philosophy of Epidemiology , Sean Valles and Jonathan Kaplan (eds.). doi:10.1007/s11229-018-1844-2
  • –––, forthcoming, Data Journeys in the Sciences , Cham: Springer International Publishing.
  • Loettgers, Andrea, 2009, “Synthetic Biology and the Emergence of a Dual Meaning of Noise”, Biological Theory , 4(4): 340–356. doi:10.1162/BIOT_a_00009
  • Longino, Helen E., 1990, Science as Social Knowledge: Values and Objectivity in Scientific Inquiry , Princeton, NJ: Princeton University Press.
  • Lowrie, Ian, 2017, “Algorithmic Rationality: Epistemology and Efficiency in the Data Sciences”, Big Data & Society , 4(1): 1–13. doi:10.1177/2053951717700925
  • MacLeod, Miles and Nancy J. Nersessian, 2013, “Building Simulations from the Ground Up: Modeling and Theory in Systems Biology”, Philosophy of Science , 80(4): 533–556. doi:10.1086/673209
  • Massimi, Michela, 2011, “From Data to Phenomena: A Kantian Stance”, Synthese , 182(1): 101–116. doi:10.1007/s11229-009-9611-z
  • –––, 2012, “ Scientific perspectivism and its foes”, Philosophica , 84: 25–52.
  • –––, 2016, “Three Tales of Scientific Success”, Philosophy of Science , 83(5): 757–767. doi:10.1086/687861
  • Mayer-Schönberger, Victor and Kenneth Cukier, 2013, Big Data: A Revolution that Will Transform How We Live, Work, and Think , New York: Eamon Dolan/Houghton Mifflin Harcourt.
  • Mayo, Deborah G., 1996, Error and the Growth of Experimental Knowledge , Chicago: University of Chicago Press.
  • Mayo, Deborah G. and Aris Spanos (eds.), 2009a, Error and Inference , Cambridge: Cambridge University Press.
  • Mayo, Deborah G. and Aris Spanos, 2009b, “Introduction and Background”, in Mayo and Spanos (eds.) 2009a, pp. 1–27.
  • McAllister, James W., 1997, “Phenomena and Patterns in Data Sets”, Erkenntnis , 47(2): 217–228. doi:10.1023/A:1005387021520
  • –––, 2007, “Model Selection and the Multiplicity of Patterns in Empirical Data”, Philosophy of Science , 74(5): 884–894. doi:10.1086/525630
  • –––, 2011, “What Do Patterns in Empirical Data Tell Us about the Structure of the World?”, Synthese , 182(1): 73–87. doi:10.1007/s11229-009-9613-x
  • McQuillan, Dan, 2018, “Data Science as Machinic Neoplatonism”, Philosophy & Technology , 31(2): 253–272. doi:10.1007/s13347-017-0273-3
  • Mitchell, Sandra D., 2003, Biological Complexity and Integrative Pluralism , Cambridge: Cambridge University Press. doi:10.1017/CBO9780511802683
  • Morgan, Mary S., 2005, “Experiments versus Models: New Phenomena, Inference and Surprise”, Journal of Economic Methodology , 12(2): 317–329. doi:10.1080/13501780500086313
  • –––, forthcoming, “The Datum in Context”, in Leonelli and Tempini forthcoming.
  • Morrison, Margaret, 2015, Reconstructing Reality: Models, Mathematics, and Simulations , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199380275.001.0001
  • Müller-Wille, Staffan and Isabelle Charmantier, 2012, “Natural History and Information Overload: The Case of Linnaeus”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 4–15. doi:10.1016/j.shpsc.2011.10.021
  • Napoletani, Domenico, Marco Panza, and Daniele C. Struppa, 2011, “Agnostic Science. Towards a Philosophy of Data Analysis”, Foundations of Science , 16(1): 1–20. doi:10.1007/s10699-010-9186-7
  • –––, 2014, “Is Big Data Enough? A Reflection on the Changing Role of Mathematics in Applications”, Notices of the American Mathematical Society , 61(5): 485–490. doi:10.1090/noti1102
  • Nickles, Thomas, forthcoming, “Alien Reasoning: Is a Major Change in Scientific Research Underway?”, Topoi , first online: 20 March 2018. doi:10.1007/s11245-018-9557-1
  • Norton, John D., 2003, “A Material Theory of Induction”, Philosophy of Science , 70(4): 647–670. doi:10.1086/378858
  • O’Malley M, Maureen A., Kevin C. Elliott, Chris Haufe, and Richard Burian, 2009. “Philosophies of funding”. Cell , 138: 611–615. doi: 10.1016/j.cell.2009.08.008
  • O’Malley, Maureen A. and Orkun S. Soyer, 2012, “The Roles of Integration in Molecular Systems Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 58–68. doi:10.1016/j.shpsc.2011.10.006
  • O’Neill, Cathy, 2016, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , New York: Crown.
  • Parker, Wendy S., 2009, “Does Matter Really Matter? Computer Simulations, Experiments, and Materiality”, Synthese , 169(3): 483–496. doi:10.1007/s11229-008-9434-3
  • –––, 2017, “Computer Simulation, Measurement, and Data Assimilation”, The British Journal for the Philosophy of Science , 68(1): 273–304. doi:10.1093/bjps/axv037
  • Pasquale, Frank, 2015, The Black Box Society: The Secret Algorithms That Control Money and Information , Cambridge, MA: Harvard University Press.
  • Pietsch, Wolfgang, 2015, “Aspects of Theory-Ladenness in Data-Intensive Science”, Philosophy of Science , 82(5): 905–916. doi:10.1086/683328
  • –––, 2016, “The Causal Nature of Modeling with Big Data”, Philosophy & Technology , 29(2): 137–171. doi:10.1007/s13347-015-0202-2
  • –––, 2017, “Causation, probability and all that: Data science as a novel inductive paradigm”, in Frontiers in Data Science , Matthias Dehmer and Frank Emmert-Streib (eds.), Boca Raton, FL: CRC, 329–353.
  • Porter, Theodore M., 1995, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life , Princeton, NJ: Princeton University Press.
  • Porter, Theodore M. and Soraya de Chadarevian, 2018, “Introduction: Scrutinizing the Data World”, Historical Studies in the Natural Sciences , 48(5): 549–556. doi:10.1525/hsns.2018.48.5.549
  • Prainsack, Barbara and Buyx, Alena, 2017, Solidarity in Biomedicine and Beyond , Cambridge, UK: Cambridge University Press.
  • Radder, Hans, 2009, “The Philosophy of Scientific Experimentation: A Review”, Automated Experimentation , 1(1): 2. doi:10.1186/1759-4499-1-2
  • Ratti, Emanuele, 2015, “Big Data Biology: Between Eliminative Inferences and Exploratory Experiments”, Philosophy of Science , 82(2): 198–218. doi:10.1086/680332
  • Reichenbach, Hans, 1938, Experience and Prediction: An Analysis of the Foundations and the Structure of Knowledge , Chicago, IL: The University of Chicago Press.
  • Reiss, Julian, 2015, “A Pragmatist Theory of Evidence”, Philosophy of Science , 82(3): 341–362. doi:10.1086/681643
  • Reiss, Julian, 2015, Causation, Evidence, and Inference , New York: Routledge.
  • Rescher, Nicholas, 1984, The Limits of Science , Berkely, CA: University of California Press.
  • Rheinberger, Hans-Jörg, 2011, “Infra-Experimentality: From Traces to Data, from Data to Patterning Facts”, History of Science , 49(3): 337–348. doi:10.1177/007327531104900306
  • Romeijn, Jan-Willem, 2017, “Philosophy of Statistics”, in The Stanford Encyclopedia of Philosophy (Spring 2017), Edward N. Zalta (ed.), URL: https://plato.stanford.edu/archives/spr2017/entries/statistics/ .
  • Sepkoski, David, 2013, “Toward ‘a natural history of data’: Evolving practices and epistemologies of data in paleontology, 1800–2000”, Journal of the History of Biology , 46: 401–444.
  • Shavit, Ayelet and James Griesemer, 2009, “There and Back Again, or the Problem of Locality in Biodiversity Surveys*”, Philosophy of Science , 76(3): 273–294. doi:10.1086/649805
  • Srnicek, Nick, 2017, Platform capitalism , Cambridge, UK and Malden, MA: Polity Press.
  • Sterner, Beckett, 2014, “The Practical Value of Biological Information for Research”, Philosophy of Science , 81(2): 175–194. doi:10.1086/675679
  • Sterner, Beckett and Nico M. Franz, 2017, “Taxonomy for Humans or Computers? Cognitive Pragmatics for Big Data”, Biological Theory , 12(2): 99–111. doi:10.1007/s13752-017-0259-5
  • Sterner, Beckett W., Nico M. Franz, and J. Witteveen, 2020, “Coordinating dissent as an alternative to consensus classification: insights from systematics for bio-ontologies”, History and Philosophy of the Life Sciences , 42(1): 8. doi: 10.1007/s40656-020-0300-z
  • Stevens, Hallam, 2016, “Hadooping the Genome: The Impact of Big Data Tools on Biology”, BioSocieties , 11: 352–371.
  • Strasser, Bruno, 2019, Collecting Experiments: Making Big Data Biology , Chicago: University of Chicago Press.
  • Suppes, Patrick, 1962, “Models of data”, in Logic, Methodology and Philosophy of Science , Ernest Nagel, Patrick Suppes, & Alfred Tarski (eds.), Stanford: Stanford University Press, 252–261.
  • Symons, John and Ramón Alvarado, 2016, “Can We Trust Big Data? Applying Philosophy of Science to Software”, Big Data & Society , 3(2): 1-17. doi:10.1177/2053951716664747
  • Symons, John and Jack Horner, 2014, “Software Intensive Science”, Philosophy & Technology , 27(3): 461–477. doi:10.1007/s13347-014-0163-x
  • Tempini, Niccolò, 2017, “Till Data Do Us Part: Understanding Data-Based Value Creation in Data-Intensive Infrastructures”, Information and Organization , 27(4): 191–210. doi:10.1016/j.infoandorg.2017.08.001
  • Tempini, Niccolò and Sabina Leonelli, 2018, “Concealment and Discovery: The Role of Information Security in Biomedical Data Re-Use”, Social Studies of Science , 48(5): 663–690. doi:10.1177/0306312718804875
  • Toulmin, Stephen, 1958, The Uses of Arguments , Cambridge: Cambridge University Press.
  • Turner, Raymond and Nicola Angius, 2019, “The Philosophy of Computer Science”, in The Stanford Encyclopedia of Philosophy (Spring 2019 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2019/entries/computer-science/ >.
  • Van Fraassen, Bas C., 2008, Scientific Representation: Paradoxes of Perspective , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199278220.001.0001
  • Waters, C. Kenneth, 2007, “The Nature and Context of Exploratory Experimentation: An Introduction to Three Case Studies of Exploratory Research”, History and Philosophy of the Life Sciences , 29(3): 275–284.
  • Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, et al., 2016, “The FAIR Guiding Principles for Scientific Data Management and Stewardship”, Scientific Data , 3(1): 160018. doi:10.1038/sdata.2016.18
  • Williamson, Jon, 2004 “A dynamic interaction between machine learning and the philosophy of science”, Minds and Machines , 14(4): 539–54. doi:10.1093/bjps/axx012
  • Wimsatt, William C., 2007, Re-Engineering Philosophy for Limited Beings: Piecewise Approximations to Reality , Cambridge, MA: Harvard University Press.
  • Winsberg, Eric, 2010, Science in the Age of Computer Simulation , Chicago: University of Chicago Press.
  • Woodward, James, 2000, “Data, phenomena and reliability”, Philosophy of Science , 67(supplement): Proceedings of the 1998 Biennial Meetings of the Philosophy of Science Association. Part II: Symposia Papers (Sep., 2000), pp. S163–S179. https://www.jstor.org/stable/188666
  • –––, 2010, “Data, Phenomena, Signal, and Noise”, Philosophy of Science , 77(5): 792–803. doi:10.1086/656554
  • Wright, Jessey, 2017, “The Analysis of Data and the Evidential Scope of Neuroimaging Results”, The British Journal for the Philosophy of Science , 69(4): 1179–1203. doi:10.1093/bjps/axx012
  • Wylie, Alison, 2017, “How Archaeological Evidence Bites Back: Strategies for Putting Old Data to Work in New Ways”, Science, Technology, & Human Values , 42(2): 203–225. doi:10.1177/0162243916671200
  • –––, forthcoming, “Radiocarbon Dating in Archaeology: Triangulation and Traceability”, in Leonelli and Tempini forthcoming.
  • Zuboff, Shoshana, 2017, The Age of Surveillance Capitalism: The Fight for the Future at the New Frontier of Power , New York: Public Affairs.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.

[Please contact the author with suggestions.]

artificial intelligence | Bacon, Francis | biology: experiment in | computer science, philosophy of | empiricism: logical | evidence | human genome project | models in science | Popper, Karl | science: theory and observation in | scientific explanation | scientific method | scientific theories: structure of | statistics, philosophy of


The research underpinning this entry was funded by the European Research Council (grant award 335925) and the Alan Turing Institute (EPSRC Grant EP/N510129/1).

Copyright © 2020 by Sabina Leonelli < s . leonelli @ exeter . ac . uk >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 08 April 2024

Large-scale phenotyping of patients with long COVID post-hospitalization reveals mechanistic subtypes of disease

  • Felicity Liew 1   na1 ,
  • Claudia Efstathiou   ORCID: orcid.org/0000-0001-6125-8126 1   na1 ,
  • Sara Fontanella 1 ,
  • Matthew Richardson 2 ,
  • Ruth Saunders 2 ,
  • Dawid Swieboda 1 ,
  • Jasmin K. Sidhu 1 ,
  • Stephanie Ascough 1 ,
  • Shona C. Moore   ORCID: orcid.org/0000-0001-8610-2806 3 ,
  • Noura Mohamed 4 ,
  • Jose Nunag   ORCID: orcid.org/0000-0002-4218-0500 5 ,
  • Clara King 5 ,
  • Olivia C. Leavy 2 , 6 ,
  • Omer Elneima 2 ,
  • Hamish J. C. McAuley 2 ,
  • Aarti Shikotra 7 ,
  • Amisha Singapuri   ORCID: orcid.org/0009-0002-4711-7516 2 ,
  • Marco Sereno   ORCID: orcid.org/0000-0003-4573-9303 2 ,
  • Victoria C. Harris 2 ,
  • Linzy Houchen-Wolloff   ORCID: orcid.org/0000-0003-4940-8835 8 ,
  • Neil J. Greening   ORCID: orcid.org/0000-0003-0453-7529 2 ,
  • Nazir I. Lone   ORCID: orcid.org/0000-0003-2707-2779 9 ,
  • Matthew Thorpe 10 ,
  • A. A. Roger Thompson   ORCID: orcid.org/0000-0002-0717-4551 11 ,
  • Sarah L. Rowland-Jones 11 ,
  • Annemarie B. Docherty   ORCID: orcid.org/0000-0001-8277-420X 10 ,
  • James D. Chalmers 12 ,
  • Ling-Pei Ho   ORCID: orcid.org/0000-0001-8319-301X 13 ,
  • Alexander Horsley   ORCID: orcid.org/0000-0003-1828-0058 14 ,
  • Betty Raman 15 ,
  • Krisnah Poinasamy 16 ,
  • Michael Marks 17 , 18 , 19 ,
  • Onn Min Kon 1 ,
  • Luke S. Howard   ORCID: orcid.org/0000-0003-2822-210X 1 ,
  • Daniel G. Wootton 3 ,
  • Jennifer K. Quint 1 ,
  • Thushan I. de Silva   ORCID: orcid.org/0000-0002-6498-9212 11 ,
  • Antonia Ho 20 ,
  • Christopher Chiu   ORCID: orcid.org/0000-0003-0914-920X 1 ,
  • Ewen M. Harrison   ORCID: orcid.org/0000-0002-5018-3066 10 ,
  • William Greenhalf 21 ,
  • J. Kenneth Baillie   ORCID: orcid.org/0000-0001-5258-793X 10 , 22 , 23 ,
  • Malcolm G. Semple   ORCID: orcid.org/0000-0001-9700-0418 3 , 24 ,
  • Lance Turtle 3 , 24 ,
  • Rachael A. Evans   ORCID: orcid.org/0000-0002-1667-868X 2 ,
  • Louise V. Wain 2 , 6 ,
  • Christopher Brightling 2 ,
  • Ryan S. Thwaites   ORCID: orcid.org/0000-0003-3052-2793 1   na1 ,
  • Peter J. M. Openshaw   ORCID: orcid.org/0000-0002-7220-2555 1   na1 ,
  • PHOSP-COVID collaborative group &

ISARIC investigators

Nature Immunology volume  25 ,  pages 607–621 ( 2024 ) Cite this article

25k Accesses

1 Citations

2140 Altmetric

Metrics details

  • Inflammasome
  • Inflammation
  • Innate immunity

One in ten severe acute respiratory syndrome coronavirus 2 infections result in prolonged symptoms termed long coronavirus disease (COVID), yet disease phenotypes and mechanisms are poorly understood 1 . Here we profiled 368 plasma proteins in 657 participants ≥3 months following hospitalization. Of these, 426 had at least one long COVID symptom and 233 had fully recovered. Elevated markers of myeloid inflammation and complement activation were associated with long COVID. IL-1R2, MATN2 and COLEC12 were associated with cardiorespiratory symptoms, fatigue and anxiety/depression; MATN2, CSF3 and C1QA were elevated in gastrointestinal symptoms and C1QA was elevated in cognitive impairment. Additional markers of alterations in nerve tissue repair (SPON-1 and NFASC) were elevated in those with cognitive impairment and SCG3, suggestive of brain–gut axis disturbance, was elevated in gastrointestinal symptoms. Severe acute respiratory syndrome coronavirus 2-specific immunoglobulin G (IgG) was persistently elevated in some individuals with long COVID, but virus was not detected in sputum. Analysis of inflammatory markers in nasal fluids showed no association with symptoms. Our study aimed to understand inflammatory processes that underlie long COVID and was not designed for biomarker discovery. Our findings suggest that specific inflammatory pathways related to tissue damage are implicated in subtypes of long COVID, which might be targeted in future therapeutic trials.

Similar content being viewed by others

ethics of big data research

Immunopathological signatures in multisystem inflammatory syndrome in children and pediatric COVID-19

Keith Sacco, Riccardo Castagnoli, … Luigi D. Notarangelo

ethics of big data research

In COVID-19, NLRP3 inflammasome genetic variants are associated with critical disease and these effects are partly mediated by the sickness symptom complex: a nomothetic network approach

Michael Maes, Walton Luiz Del Tedesco Junior, … Andréa Name Colado Simão

ethics of big data research

Epidemiology, clinical presentation, pathophysiology, and management of long COVID: an update

Sizhen Su, Yimiao Zhao, … Lin Lu

One in ten severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections results in post-acute sequelae of coronavirus disease 2019 (PASC) or long coronavirus disease (COVID), which affects 65 million people worldwide 1 . Long COVID (LC) remains common, even after mild acute infection with recent variants 2 , and it is likely LC will continue to cause substantial long-term ill health, requiring targeted management based on an understanding of how disease phenotypes relate to underlying mechanisms. Persistent inflammation has been reported in adults with LC 1 , 3 , but studies have been limited in size, timing of samples or breadth of immune mediators measured, leading to inconsistent or absent associations with symptoms. Markers of oxidative stress, metabolic disturbance, vasculoproliferative processes and IFN-, NF-κB- or monocyte-related inflammation have been suggested 3 , 4 , 5 , 6 .

The PHOSP-COVID study, a multicenter United Kingdom study of patients previously hospitalized with COVID-19, has reported inflammatory profiles in 626 adults with health impairment after COVID-19, identified through clustering. Elevated IL-6 and markers of mucosal inflammation were observed in those with severe impairment compared with individuals with milder impairment 7 . However, LC is a heterogeneous condition that may be a distinct form of health impairment after COVID-19, and it remains unclear whether there are inflammatory changes specific to LC symptom subtypes. Determining whether activated inflammatory pathways underlie all cases of LC or if mechanisms differ according to clinical presentation is essential for developing effective therapies and has been highlighted as a top research priority by patients and clinicians 8 .

In this Letter, in a prospective multicenter study, we measured 368 plasma proteins in 657 adults previously hospitalized for COVID-19 (Fig. 1a and Table 1 ). Individuals in our cohort experienced a range of acute COVID-19 severities based on World Health Organization (WHO) progression scores 9 ; WHO 3–4 (no oxygen support, n  = 133 and median age of 55 years), WHO 5–6 (oxygen support, n  = 353 and median age of 59 years) and WHO 7–9 (critical care, n  = 171 and median age of 57 years). Participants were hospitalized for COVID-19 ≥3 months before sample collection (median 6.1 months, interquartile range (IQR) 5.1–6.8 months and range 3.0–8.3 months) and confirmed clinically ( n  = 36/657) or by PCR ( n  = 621/657). Symptom data indicated 233/657 (35%) felt fully recovered at 6 months (hereafter ‘recovered’) and the remaining 424 (65%) reported symptoms consistent with the WHO definition for LC (symptoms ≥3 months post infection 10 ). Given the diversity of LC presentations, patients were grouped according to symptom type (Fig. 1b ). Groups were defined using symptoms and health deficits that have been commonly reported in the literature 1 ( Methods ). A multivariate penalized logistic regression model (PLR) was used to explore associations of clinical covariates and immune mediators at 6 months between recovered patients ( n  = 233) and each LC group (cardiorespiratory symptoms, cardioresp, n  = 398, Fig. 1c ; fatigue, n  = 384, Fig. 1d ; affective symptoms, anxiety/depression, n  = 202, Fig. 1e ; gastrointestinal symptoms, GI, n  = 132, Fig. 1f ; and cognitive impairment, cognitive, n  = 61, Fig. 1g ). Women ( n  = 239) were more likely to experience CardioResp (odds ratio (OR 1.14), Fatigue (OR 1.22), GI (OR 1.13) and Cognitive (OR 1.03) outcomes (Fig. 1c,d,f,g ). Repeated cross-validation was used to optimize and assess model performance ( Methods and Extended Data Fig. 1 ). Pre-existing conditions, such as chronic lung disease, neurological disease and cardiovascular disease (Supplementary Table 1 ), were associated with all LC groups (Fig. 1c–g ). Age, C-reactive protein (CRP) and acute disease severity were not associated with any LC group (Table 1 ).

figure 1

a , Distribution of time from COVID-19 hospitalization at sample collection. All samples were cross-sectional. The vertical red line indicates the 3 month cutoff used to define our final cohort and samples collected before 3 months were excluded. b , An UpSet plot describing pooled LC groups. The horizontal colored bars represent the number of patients in each symptom group: cardiorespiratory (Cardio_Resp), fatigue, cognitive, GI and anxiety/depression (Anx_Dep). Vertical black bars represent the number of patients in each symptom combination group. To prevent patient identification, where less than five patients belong to a combination group, this has been represented as ‘<5’. The recovered group ( n  = 233) were used as controls. c – g , Forest plots of Olink protein concentrations (NPX) associated with Cardio_Resp ( n  = 365) ( c ), fatigue (n = 314) ( d ), Anx_Dep ( n  = 202) ( e ), GI ( n  = 124) ( f ) and cognitive ( n  = 60) ( g ). Neuro_Psych, neuropsychiatric. The error bars represent the median accuracy of the model. h , i , Distribution of Olink values (NPX) for IL-1R2 ( h ) and MATN2, neurofascin and sCD58 ( i ) measured between symptomatic and recovered individuals in recovered ( n  = 233), Cardio_Resp ( n  = 365), fatigue ( n  = 314) and Anx_Dep ( n  = 202) groups ( h ) and MATN2 in GI ( n  = 124), neurofascin in cognitive ( n  = 60) and sCD58 in Cardio_Resp and recovered groups ( i ). The box plot center line represents the median, the boundaries represent IQR and the whisker length represents 1.5× IQR. The median values were compared between groups using two-sided Wilcoxon signed-rank test, * P  < 0.05, ** P  < 0.01, *** P  < 0.001 and **** P  < 0.0001.

To study the association of peripheral inflammation with symptoms, we analyzed cross-sectional data collected approximately 6 months after hospitalizations. We measured 368 immune mediators from plasma collected contemporaneously with symptom data. Mediators suggestive of myeloid inflammation were associated with all symptoms (Fig. 1c–h ). Elevated IL-1R2, an IL-1 receptor expressed by monocytes and macrophages modulating inflammation 11 and MATN2, an extracellular matrix protein that modulates tissue inflammation through recruitment of innate immune cells 12 , were associated with cardioresp (IL-1R2 OR 1.14, Fig. 1c,h ), fatigue (IL-1R2 OR 1.45, Fig. 1d,h ), anxiety/depression (IL-1R2 OR 1.34. Fig. 1e,h ) and GI (MATN2 OR 1.08, Fig. 1f ). IL-3RA, an IL-3 receptor, was associated with cardioresp (OR 1.07, Fig. 1c ), fatigue (OR 1.21, Fig. 1d ), anxiety/depression (OR 1.12, Fig. 1e ) and GI (OR 1.06, Fig. 1f ) groups, while CSF3, a cytokine promoting neutrophilic inflammation 13 , was elevated in cardioresp (OR 1.06, Fig. 1c ), fatigue (OR 1.12, Fig. 1d ) and GI (OR 1.08, Fig. 1f ).

Elevated COLEC12, which initiates inflammation in tissues by activating the alternative complement pathway 14 , associated with cardioresp (OR 1.09, Fig. 1c ), fatigue (OR 1.19, Fig. 1d ) and anxiety/depression (OR 1.11, Fig. 1e ), but not with GI (Fig. 1f ) and only weakly with cognitive (OR 1.02, Fig. 1g ). C1QA, a degradation product released by complement activation 15 was associated with GI (OR 1.08, Fig. 1f ) and cognitive (OR 1.03, Fig. 1g ). C1QA, which is known to mediate dementia-related neuroinflammation 16 , had the third strongest association with cognitive (Fig. 1g ). These observations indicated that myeloid inflammation and complement activation were associated with LC.

Increased expression of DPP10 and SCG3 was observed in the GI group compared with recovered (DPP10 OR 1.07 and SCG3 OR 1.08, Fig. 1f ). DPP10 is a membrane protein that modulates tissue inflammation, and increased DPP10 expression is associated with inflammatory bowel disease 17 , 18 , suggesting that GI symptoms may result from enteric inflammation. Elevated SCG3, a multifunctional protein that has been associated with irritable bowel syndrome 19 , suggested that noninflammatory disturbance of the brain–gut axis or dysbiosis, may occur in the GI group. The cognitive group was associated with elevated CTSO (OR 1.04), NFASC (OR 1.03) and SPON-1 (OR 1.02, Fig. 1g,i ). NFASC and SPON-1 regulate neural growth 20 , 21 , while CTSO is a cysteine proteinase supporting tissue turnover 22 . The increased expression of these three proteins as well as C1QA and DPP10 in the cognitive group (Fig. 1g ) suggested neuroinflammation and alterations in nerve tissue repair, possibly resulting in neurodegeneration. Together, our findings indicated that complement activation and myeloid inflammation were common to all LC groups, but subtle differences were observed in the GI and cognitive groups, which may have mechanistic importance. Acutely elevated fibrinogen during hospitalization has been reported to be predictive of LC cognitive deficits 23 . We found elevated fibrinogen in LC relative to recovered (Extended Data Fig. 2a ; P  = 0.0077), although this was not significant when restricted to the cognitive group ( P  = 0.074), supporting our observation of complement pathway activation in LC and in keeping with reports that complement dysregulation and thrombosis drive severe COVID-19 (ref. 24 ).

Elevated sCD58 was associated with lower odds of all LC symptoms and was most pronounced in cardioresp (OR 0.85, Fig. 1c,i ), fatigue (OR 0.80, Fig. 1d ) and anxiety/depression (OR 0.83, Fig. 1e ). IL-2 was negatively associated with the cardioresp (Fig. 1c , OR 0.87), fatigue (Fig. 1d , OR 0.80), anxiety/depression (Fig. 1e , OR 0.84) and cognitive (Fig. 1g , OR 0.96) groups. Both IL-2 and sCD58 have immunoregulatory functions 25 , 26 . Specifically, sCD58 suppresses IL-1- or IL-6-dependent interactions between CD2 + monocytes and CD58 + T or natural killer cells 26 . The association of sCD58 with recovered suggests a central role of dysregulated myeloid inflammation in LC. Elevated markers of tissue repair, IDS and DNER 27 , 28 , were also associated with recovered relative to all LC groups (Fig. 1c–g ). Taken together, our data suggest that suppression of myeloid inflammation and enhanced tissue repair were associated with recovered, supporting the use of immunomodulatory agents in therapeutic trials 29 (Supplementary Table 2 ).

We next sought to validate the experimental and analytical approaches used. Although Olink has been validated against other immunoassay platforms, showing superior sensitivity and specificity 30 , 31 , we confirmed the performance of Olink against chemiluminescent immunoassays within our cohort. We performed chemiluminescent immunoassays on plasma from a subgroup of 58 participants (recovered n  = 13 and LC n  = 45). There were good correlations between results from Olink (normalized protein expression (NPX)) and chemiluminescent immunoassays (pg ml −1 ) for CSF3, IL-1R2, IL-3RA, TNF and TFF2 (Extended Data Fig. 3 ). Most samples did not have concentrations of IL-2 detectable using a mesoscale discovery chemiluminescent assay, limiting this analysis to 14 samples (recovered n  = 4, LC n  = 10, R  = 0.55 and P  = 0.053, Extended Data Fig. 3 ). We next repeated our analysis using alternative definitions of LC. The Centers for Disease Control and Prevention and National Institute for Health and Care Excellence definitions for LC include symptoms occurring 1 month post infection 32 , 33 . Using the 1 month post-infection definition included 62 additional participants to our analysis (recovered n  = 21, 3 females and median age 61 years and LC n  = 41, 15 females and median age 60 years, Extended Data Fig. 2c ) and found that inflammatory associations with each LC group were consistent with our analysis based on the WHO definition (Extended Data Fig. 2d–h ). Finally, to validate the analytical approach (PLR) we examined the distribution of data, prioritizing proteins that were most strongly associated with each LC/recovered group (IL-1R2, MATN2, NFASC and sCD58). Each protein was significantly elevated in the LC group compared with recovered (Fig. 1h,i and Extended Data Fig. 4 ), consistent with the PLR. Alternative regression approaches (unadjusted regression models and partial least squares, PLS) reported results consistent with the original analysis of protein associations and LC outcome in the WHO-defined cohort (Fig. 1c–g , Supplementary Table 3 and Extended Data Figs. 5 and 6 ). The standard errors of PLS estimates were wide (Extended Data Fig. 6 ), consistent with previous demonstrations that PLR is the optimal method to analyze high-dimensional data where variables may have combined effects 34 . As inflammatory proteins are often colinear, working in-tandem to mediate effects, we prioritized PLR results to draw conclusions.

To explore the relationship between inflammatory mediators associated with different LC symptoms, we performed a network analysis of Olink mediators highlighted by PLR within each LC group. COLEC12 and markers of endothelial and mucosal inflammation (MATN2, PCDH1, ROBO1, ISM1, ANGPTL2, TGF-α and TFF2) were highly correlated within the cardioresp, fatigue and anxiety/depression groups (Fig. 2 and Extended Data Fig. 7 ). Elevated PCDH1, an adhesion protein modulating airway inflammation 35 , was highly correlated with other inflammatory proteins associated with the cardioresp group (Fig. 2 ), suggesting that systemic inflammation may arise from the lung in these individuals. This was supported by increased expression of IL-3RA, which regulates innate immune responses in the lung through interactions with circulating IL-3 (ref. 36 ), in fatigue (Figs. 1d and 2 ), which correlated with markers of tissue inflammation, including PCDH1 (Fig. 2 ). MATN2 and ISM1, mucosal proteins that enhance inflammation 37 , 38 , were highly correlated in the GI group (Fig. 2 ), highlighting the role of tissue-specific inflammation in different LC groups. SCG3 correlated less closely with mediators in the GI group (Fig. 2 ), suggesting that the brain–gut axis may contribute separately to some GI symptoms. SPON-1, which regulates neural growth 21 , was the most highly correlated mediator in the cognitive group (Fig. 2 and Extended Data Fig. 7 ), highlighting that processes within nerve tissue may underlie this group. These observations suggested that inflammation might arise from mucosal tissues and that additional mechanisms may contribute to pathophysiology underlying the GI and cognitive groups.

figure 2

Network analysis of Olink mediators associated with cardioresp ( n  = 365), fatigue ( n  = 314), anxiety/depression ( n  = 202), GI ( n  = 124) and cognitive groups ( n  = 60). Each node corresponds to a protein mediator identified by PLR. The edges (blue lines) were weighted according to the size of Spearman’s rank correlation coefficient between proteins. All edges represent positive and significant correlations ( P  < 0.05) after FDR adjustment.

Women were more likely to experience LC (Table 1 ), as found in previous studies 1 . As estrogen can influence immunological responses 39 , we investigated whether hormonal differences between men and women with LC in our cohort explained this trend. We grouped men and women with LC symptoms into two age groups (those younger than 50 years and those 50 years and older, using age as a proxy for menopause status in women) and compared mediator levels between men and women in each age group, prioritizing those identified by PLR to be higher in LC compared with recovered. As we aimed to understand whether women with LC had stronger inflammatory responses than men with LC, we did not assess differences in men and women in the recovered group. IL-1R2 and MATN2 were significantly higher in women ≥50 years than men ≥50 years in the cardioresp group (Fig. 3a , IL-1R2 and MATN2) and the fatigue group (Fig. 3b ). In the GI group, CSF3 was higher in women ≥50 years compared with men ≥50 years (Fig. 3c ), indicating that the inflammatory markers observed in women were not likely to be estrogen-dependent. Women have been reported to have stronger innate immune responses to infection and to be at greater risk of autoimmunity 39 , possibly explaining why some women in the ≥50 years group had higher inflammatory proteins than men the same group. Proteins associated with the anxiety/depression (IL-1R2 P  = 0.11 and MATN2 P  = 0.61, Extended Data Fig. 8a ) and cognitive groups (CTSO P  = 0.64 and NFASC P  = 0.41, Extended Data Fig. 8b ) were not different between men and women in either age group, consistent with the absent/weak association between sex and these outcomes identified by PLR (Fig. 1e,g ). Though our findings suggested that nonhormonal differences in inflammatory responses may explain why some women are more likely to have LC, they require confirmation in adequately powered studies.

figure 3

a – c , Olink-measured plasma protein levels (NPX) of IL-1R2 and MATN2 ( a and b ) and CSF3 ( c ) between LC men and LC women divided by age (<50 or ≥50 years) in the cardiorespiratory group (<50 years n  = 8 and ≥50 years n  = 270) ( a ), fatigue group (<50 years n  = 81 and ≥50 years n  = 227) ( b ) and GI group (<50 years n  = 34 and ≥50 years n  = 82) ( c ). the median values were compared between men and women using two-sided Wilcoxon signed-rank test, * P  < 0.05, ** P  < 0.01, *** P  < 0.001 and **** P  < 0.0001. The box plot center line represents the median, the boundaries represent IQR and the whisker length represents 1.5× IQR.

To test whether local respiratory tract inflammation persisted after COVID-19, we compared nasosorption samples from 89 participants (recovered, n  = 31; LC, n  = 33; and healthy SARS-CoV-2 naive controls, n  = 25, Supplementary Tables 4 and 5 ). Several inflammatory markers were elevated in the upper respiratory tract post COVID (including IL-1α, CXCL10, CXCL11, TNF, VEGF and TFF2) when compared with naive controls, but similar between recovered and LC (Fig. 4a ). In the cardioresp group ( n  = 29), inflammatory mediators elevated in plasma (for example, IL-6, APO-2, TGF-α and TFF2) were not elevated in the upper respiratory tract (Extended Data Fig. 9a ) and there was no correlation between plasma and nasal mediator levels (Extended Data Fig. 9b ). This exploratory analysis suggested upper respiratory tract inflammation post COVID was not specifically associated with cardiorespiratory symptoms.

figure 4

a , Nasal cytokines measured by immunoassay in post-COVID participants ( n  = 64) compared with healthy SARS-CoV-2 naive controls ( n  = 25), and between the the cardioresp group ( n  = 29) and the recovered group ( n  = 31). The red values indicate significantly increased cytokine levels after FDR adjustment ( P  < 0.05) using two-tailed Wilcoxon signed-rank test. b , SARS-CoV-2 N antigen measured in sputum by electrochemiluminescence from recovered ( n  = 17) and pooled LC ( n  = 23) groups, compared with BALF from SARS-CoV-2 naive controls ( n  = 9). The horizontal dashed line indicates the lower limit of detection of the assay. c , Plasma S- and N-specific IgG responses measured by electrochemiluminescence in the LC ( n  = 35) and recovered ( n  = 19) groups. The median values were compared using two-sided Wilcoxon signed-rank tests, NS P  > 0.05, * P  < 0.05, ** P  < 0.01, *** P  < 0.001 and **** P  < 0.0001. The box plot center lines represent the median, the boundaries represent IQR and the whisker length represents 1.5× IQR.

To explore whether SARS-CoV-2 persistence might explain the inflammatory profiles observed in the cardioresp group, we measured SARS-CoV-2 nucleocapsid (N) antigen in sputum from 40 participants (recovered n  = 17 and LC n  = 23) collected approximately 6 months post hospitalization (Supplementary Table 6 ). All samples were compared with prepandemic bronchoalveolar lavage fluid ( n  = 9, Supplementary Table 4 ). Only four samples (recovered n  = 2 and LC n  = 2) had N antigen above the assay’s lower limit of detection, and there was no difference in N antigen concentrations between LC and recovered (Fig. 4b , P  = 0.78). These observations did not exclude viral persistence, which might require tissues samples for detection 40 , 41 . On the basis of the hypothesis that persistent viral antigen might prevent a decline in antibody levels over time, we examined the titers of SARS-CoV-2-specific antibodies in unvaccinated individuals (recovered n  = 19 and LC n  = 35). SARS-CoV-2 N-specific ( P  = 0.023) and spike (S)-specific ( P  = 0.0040) immunoglobulin G (IgG) levels were elevated in LC compared with recovered (Fig. 4c ).

Overall, we identified myeloid inflammation and complement activation in the cardioresp, fatigue, anxiety/depression, cognitive and GI groups 6 months after hospitalization (Extended Data Fig. 10 ). Our findings build on results of smaller studies 5 , 6 , 42 and are consistent with a genome-wide association study that identified an independent association between LC and FOXP4 , which modulates neutrophilic inflammation and immune cell function 43 , 44 . In addition, we identified tissue-specific inflammatory elements, indicating that myeloid disturbance in different tissues may result in distinct symptoms. Multiple mechanisms for LC have been suggested, including autoimmunity, thrombosis, vascular dysfunction, SARS-CoV-2 persistence and latent virus reactivation 1 . All these processes involve myeloid inflammation and complement activation 45 . Complement activation in LC has been suggested in a proteomic study in 97 mostly nonhospitalized COVID-19 cases 42 and a study of 48 LC patients, of which one-third experienced severe acute disease 46 . As components of the complement system are known to have a short half-life 47 , ongoing complement activation suggests active inflammation rather than past tissue damage from acute infection.

Despite the heterogeneity of LC and the likelihood of coexisting or multiple etiologies, our work suggests some common pathways that might be targeted therapeutically and supports the rationale for several drugs currently under trial. Our finding of increased sCD58 levels (associated with suppression of monocyte–lymphocyte interactions 26 ) in the recovered group, strengthens our conclusion that myeloid inflammation is central to the biology of LC and that trials of steroids, IL-1 antagonists, JAK inhibitors, naltrexone and colchicine are justified. Although anticoagulants such as apixaban might prevent thrombosis downstream of complement dysregulation, they can also increase the risk of serious bleeding when given after COVID-19 hospitalization 48 . Thus, clinical trials, already underway, need to carefully assess the risks and benefits of anticoagulants (Supplementary Table 2 ).

Our finding of elevated S- and N-specific IgG in LC could suggest viral persistence, as found in other studies 6 , 42 , 49 . Our network analysis indicated that inflammatory proteins in the cardioresp group interacted strongly with ISM1 and ROBO1, which are expressed during respiratory tract infection and regulate lung inflammation 50 , 51 . Although we were unable to find SARS-CoV-2 antigen in sputum from our LC cases, we did not test for viral persistence in GI tract and lung tissue 40 , 41 or in plasma 52 . Evidence of SARS-CoV-2 persistence would justify trials of antiviral drugs (singly or in combination) in LC. It is also possible that autoimmune processes could result in an innate inflammatory profile in LC. Autoreactive B cells have been identified in LC patients with higher SARS-CoV-2-specific antibody titers in a study of mostly mild acute COVID cases (59% WHO 2–3) 42 , a different population from our study of hospitalized cases.

Our observations of distinct protein profiles in GI and cognitive groups support previous reports on distinct associations between Epstein–Barr virus reactivation and neurological symptoms, or autoantibodies and GI symptoms relative to other forms of LC 49 , 53 . We did not assess autoantibody induction but found evidence of brain–gut axis disturbance (SCG3) in the GI group, which occurs in many autoimmune diseases 54 . We found signatures suggestive of neuroinflammation (C1QA) in the cognitive group, consistent with findings of brain abnormalities on magnetic resonance imaging after COVID-19 hospitalization 55 , as well as findings of microglial activation in mice after COVID-19 (ref. 56 ). Proinflammatory signatures dominated in the cardioresp, fatigue and anxiety/depression groups and were consistent with those seen in non-COVID depression, suggesting shared mechanisms 57 . The association between markers of myeloid inflammation, including IL-3RA, and symptoms was greatest for fatigue. Whilst membrane-bound IL-3RA facilitates IL-3 signaling upstream of myelopoesis 36 its soluble form (measured in plasma) can bind IL-3 and can act as a decoy receptor, preventing monocyte maturation and enhancing immunopathology 58 . Monocytes from individuals with post-COVID fatigue are reported to have abnormal expression profiles (including reduced CXCR2), suggestive of altered maturation and migration 5 , 59 . Lung-specific inflammation was suggested by the association between PCDH1 (an airway epithelial adhesion molecule 35 ) and cardioresp symptoms.

Our observations do not align with all published observations on LC. One proteomic study of 55 LC cases after generally mild (WHO 2–3) acute disease found that TNF and IFN signatures were elevated in LC 3 . Vasculoproliferative processes and metabolic disturbance have been reported in LC 4 , 60 , but these studies used uninfected healthy individuals for comparison and cannot distinguish between LC-specific phenomena and residual post-COVID inflammation. A study of 63 adults (LC, n  = 50 and recovered, n  = 13) reported no association between immune cell activation and LC 3 months after infection 61 , though myeloid inflammation was not directly measured, and 3 months post infection may be too early to detect subtle differences between LC and recovered cases due to residual acute inflammation.

Our study has limitations. We designed the study to identify inflammatory markers identifying pathways underlying LC subgroups rather than diagnostic biomarkers. The ORs we report are small, but associations were consistent across alternative methods of analysis and when using different LC definitions. Small effect sizes can be expected when using PLR, which shrinks correlated mediator coefficients to reflect combined effects and prevent colinear inflation 62 , and could also result from measurement of plasma mediators that may underestimate tissue inflammation. Although our LC cohort is large compared with most other published studies, some of our subgroups are small (only 60 cases were designated cognitive). Though the performance of the cognitive PLR model was adequate, our findings should be validated in larger studies. It should be noted that our cohort of hospitalized cases may not represent all types of LC, especially those occurring after mild infection. We looked for an effect of acute disease severity within our study and did not find it, and are reassured that the inflammatory profiles we observed were consistent with those seen in smaller studies including nonhospitalized cases 42 , 46 . Studies of posthospital LC may be confounded by ‘posthospital syndrome’, which encompasses general and nonspecific effects of hospitalization (particularly intensive care) 63 .

In conclusion, we found markers of myeloid inflammation and complement activation in our large prospective posthospital cohort of patients with LC, in addition to distinct inflammatory patterns in patients with cognitive impairment or gastrointestinal symptoms. These findings show the need to consider subphenotypes in managing patients with LC and support the use of antiviral or immunomodulatory agents in controlled therapeutic trials.

Study design and ethics

After hospitalization for COVID-19, adults who had no comorbidity resulting in a prognosis of less than 6 months were recruited to the PHOSP-COVID study ( n  = 719). Patients hospitalized between February 2020 and January 2021 were recruited. Both sexes were recruited and gender was self-reported (female, n  = 257 and male, n  = 462). Written informed consent was obtained from all patients. Ethical approvals for the PHOSP-COVID study were given by Leeds West Research Ethics Committee (20/YH/0225).

Symptom data and samples were prospectively collected from individuals approximately 6 months (IQR 5.1–6.8 months and range 3.0–8.3 months) post hospitalization (Fig. 1a ), via the PHOSP-COVID multicenter United Kingdom study 64 . Data relating to patient demographics and acute admission were collected via the International Severe Acute Respiratory and Emerging Infection Consortium World Health Organization Clinical Characterisation Protocol United Kingdom (ISARIC4C study; IRAS260007/IRAS126600) (ref. 65 ). Adults hospitalized during the SARS-CoV-2 pandemic were systematically recruited into ISARIC4C. Written informed consent was obtained from all patients. Ethical approval was given by the South Central–Oxford C Research Ethics Committee in England (reference 13:/SC/0149), Scotland A Research Ethics Committee (20/SS/0028) and WHO Ethics Review Committee (RPC571 and RPC572l, 25 April 2013).

Data were collected to account for variables affecting symptom outcome, via hospital records and self-reporting. Acute disease severity was classified according to the WHO clinical progression score: WHO class 3–4: no oxygen therapy; class 5: oxygen therapy; class 6: noninvasive ventilation or high-flow nasal oxygen; and class 7–9: managed in critical care 9 . Clinical data were used to place patients into six categories: ‘recovered’, ‘GI’, ‘cardiorespiratory’, ‘fatigue’, ‘cognitive impairment’ and ‘anxiety/depression’ (Supplementary Table 7 ). Patient-reported symptoms and validated clinical scores were used when feasible, including Medical Research Council (MRC) breathlessness score, dyspnea-12 score, Functional Assessment of Chronic Illness Therapy (FACIT) score, Patient Health Questionnaire (PHQ)-9 and Generalized Anxiety Disorder (GAD)-7. Cognitive impairment was defined as a Montreal Cognitive Assessment score <26. GI symptoms were defined as answering ‘Yes’ to the presence of at least two of the listed symptoms. ‘Recovered’ was defined by self-reporting. Patients were placed in multiple groups if they experienced a combination of symptoms.

Matched nasal fluid and sputum samples were prospectively collected from a subgroup of convalescent patients approximately 6 months after hospitalization via the PHOSP-COVID study. Nasal and bronchoalveolar lavage fluid (BALF) collected from healthy volunteers before the COVID-19 pandemic were used as controls (Supplementary Table 4 ). Written consent was obtained for all individuals and ethical approvals were given by London–Harrow Research Ethics Committee (13/LO/1899) for the collection of nasal samples and the Health Research Authority London–Fulham Research Ethics Committee (IRAS project ID 154109; references 14/LO/1023, 10/H0711/94 and 11/LO/1826) for BALF samples.

Ethylenediaminetetraacetic acid plasma was collected from whole blood taken by venepuncture and frozen at −80 °C as previously described 7 , 66 . Nasal fluid was collected using a NasosorptionTM FX·I device (Hunt Developments), which uses a synthetic absorptive matrix to collect concentrated nasal fluid. Samples were eluted and stored as previously described 67 . Sputum samples were collected via passive expectoration and frozen at −80 °C without the addition of buffers. Sputum samples from convalescent individuals were compared with BALF from healthy SARS-CoV-2-naive controls, collected before the pandemic. BALF samples were used to act as a comparison for lower respiratory tract samples since passively expectorated sputum from healthy SARS-CoV-2-naive individuals was not available. BALF samples were obtained by instillation and recovery of up to 240 ml of normal saline via a fiberoptic bronchoscope. BALF was filtered through 100 µM strainers into sterile 50 ml Falcon tubes, then centrifuged for 10 min at 400  g at 4 °C. The resulting supernatant was transferred into sterile 50 ml Falcon tubes and frozen at −80 °C until use. The full methods for BALF collection and processing have been described previously 68 , 69 .


To determine inflammatory signatures that associated with symptom outcomes, plasma samples were analyzed on an Olink Explore 384 Inflammation panel 70 . Supplementary Table 8 (Appendix 1 ) lists all the analytes measured. To ensure the validity of results, samples were run in a single batch with the use of negative controls, plate controls in triplicate and repeated measurement of patient samples between plates in duplicate. Samples were randomized between plates according to site and sample collection date. Randomization between plates was blind to LC/recovered outcome. Data were first normalized to an internal extension control that was included in each sample well. Plates were standardized by normalizing to interplate controls, run in triplicate on each plate. Each plate contained a minimum of four patient samples, which were duplicates on another plate; these duplicate pairs allowed any plate to be linked to any other through the duplicates. Data were then intensity normalized across all cohort samples. Finally, Olink results underwent quality control processing and samples or analytes that did not reach quality control standards were excluded. Final normalized relative protein quantities were reported as log 2 NPX values.

To further validate our findings, we performed conventional electrochemiluminescence (ECL) assays and enzyme-linked immunosorbent assay for Olink mediators that were associated with symptom outcome ( Supplementary Methods ). Contemporaneously collected plasma samples were available from 58 individuals. Like most omics platforms, Olink measures relative quantities, so perfect agreement with conventional assays that measure absolute concentrations is not expected.

Sputum samples were thawed before analysis and sputum plugs were extracted with the addition of 0.1% dithiothreitol creating a one in two sample dilution, as previously described 71 . SARS-CoV-2 S and N proteins were measured by ECL S-plex assay at a fixed dilution of one in two (Mesoscale Diagnostics), as per the manufacturers protocol 72 . Control BALF samples were thawed and measured on the same plate, neat. The S-plex assay is highly sensitive in detecting viral antigen in respiratory tract samples 73 .

Nasal cytokines were measured by ECL (mesoscale discovery) and Luminex bead multiplex assays (Biotechne). The full methods and list of analytes are detailed in Supplementary Methods .

Statistics and reproducibility

Clinical data was collected via the PHOSP REDCap database, to which access is available under reasonable request as per the data sharing statement in the manuscript. All analyses were performed within the Outbreak Data Analysis Platform (ODAP). All data and code can be accessed using information in the ‘Data sharing’ and ‘Code sharing’ statements at the end of the manuscript. No statistical method was used to predetermine sample size. Data distribution was assumed to be normal but this was not formally tested. Olink assays and immunoassays were randomized and investigators were blinded to outcomes.

To determine protein signatures that associated with each symptom outcome, a ridge PLR was used. PLR shrinks coefficients to account for combined effects within high-dimensional data, preventing false discovery while managing multicollinearity 34 . Thus, PLR was chosen a priori as the most appropriate model to assess associations between a large number of explanatory variables (that may work together to mediate effects) and symptom outcome 34 , 62 , 70 , 74 . In keeping with our aim to perform an unbiased exploration of inflammatory process, the model alpha was set to zero, facilitating regularization without complete penalization of any mediator. This enabled review of all possible mediators that might associate with LC 62 .

A 50 repeats tenfold nested cross-validation was used to select the optimal lambda for each model and assess its accuracy (Extended Data Fig. 1 ). The performance of the cognitive impairment model was influenced by the imbalance in size of the symptom group ( n  = 60) relative to recovered ( n  = 250). The model was weighted to account for this imbalance resulting in a sensitivity of 0.98, indicating its validity. We have expanded on the model performance and validation approaches in Supplementary Information .

Age, sex, acute disease severity and preexisting comorbidities were included as covariates in the PLR analysis (Supplementary Tables 1 and 3 ). Covariates were selected a priori using features reported to influence the risk of LC and inflammatory responses 1 , 39 , 64 , 75 . Ethnicity was not included since it has been shown not to predict symptom outcome in this cohort 64 . Individuals with missing data were excluded from the regression analysis. Each symptom group was compared with the ‘recovered’ group. The model coefficients of each covariate were converted into ORs for each outcome and visualized in a forest plot, after removing variables associated with regularized OR between 0.98 and 1.02 or in cases where most variables fell outside of this range, using mediators associated with the highest decile of coefficients either side of this range. This enabled exclusion of mediators with effect sizes that were unlikely to have clinical or mechanistic importance since the ridge PLR shrinks and orders coefficients according to their relative importance rather than making estimates with standard error. Thus, confidence intervals cannot be appropriately derived from PLR, and forest plot error bars were calculated using the median accuracy of the model generated by the nested cross-validation. To verify observations made through PLR analysis, we also performed an unadjusted PLR, an unadjusted logistic regression and a PLS analysis. Univariate analyses using Wilcoxon signed-rank test was also performed (Supplementary Table 8 , Appendix 1 ). Analyses were performed in R version 4.2.0 using ‘data.table v1.14.2’, ‘EnvStats v2.7.0’ ‘tidyverse v1.3.2’, ‘lme4 v1.1-32’, ‘caret v6.0-93’, ‘glmnet v4.1-6’, ‘mdatools v0.14.0’, ‘ggpubbr v0.4.0’ and ‘ggplot2 v3.3.6’ packages.

To further investigate the relationship between proteins elevated in each symptom group, we performed a correlation network analysis using Spearman’s rank correlation coefficient and false discovery rate (FDR) thresholding. The mediators visualized in the PLR forest plots, which were associated with cardiorespiratory symptoms, fatigue, anxiety/depression GI symptoms and cognitive impairment were used, respectively. Analyses were performed in R version 4.2.0 using ‘bootnet v1.5.6 ’ and ‘qgraph v1.9.8 ’ packages.

To determine whether differences in protein levels between men and women related to hormonal differences, we divided each symptom group into premenopausal and postmenopausal groups using an age cutoff of 50 years old. Differences between sexes in each group were determined using the Wilcoxon signed-rank test. To understand whether antigen persistence contributed to inflammation in adults with LC, the median viral antigen concentration from sputum/BALF samples and cytokine concentrations from nasal samples were compared using the Wilcoxon signed-rank test. All tests were two-tailed and statistical significance was defined as a P value < 0.05 after adjustment for FDR ( q -value of 0.05). Analyses were performed in R version 4.2.0 using ‘bootnet v1.5.6’ and ‘qgraph v1.9.8’ packages.

Extended Data Fig. 10 was made using Biorender, accessed at www.biorender.com .

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

This is an open access article under the CC BY 4.0 license.

The PHOSP-COVID protocol, consent form, definition and derivation of clinical characteristics and outcomes, training materials, regulatory documents, information about requests for data access, and other relevant study materials are available online at ref. 76 . Access to these materials can be granted by contacting [email protected] and [email protected].

The ISARIC4C protocol, data sharing and publication policy are available at https://isaric4c.net . ISARIC4C’s Independent Data and Material Access Committee welcomes applications for access to data and materials ( https://isaric4c.net ).

The datasets used in the study contain extensive clinical information at an individual level that prevent them from being deposited in an public depository due to data protection policies of the study. Study data can only be accessed via the ODAP, a protected research environment. All data used in this study are available within ODAP and accessible under reasonable request. Data access criteria and information about how to request access is available online at ref. 76 . If criteria are met and a request is made, access can be gained by signing the eDRIS user agreement.

Code availability

Code was written within the ODAP, using R v4.2.0 and publicly available packages (‘data.table v1.14.2’, ‘EnvStats v2.7.0’, ‘tidyverse v1.3.2’, ‘lme4 v1.1-32’, ‘caret v6.0-93’, ‘glmnet v4.1-6’, ‘mdatools v0.14.0’, ‘ggpubbr v0.4.0’, ‘ggplot2 v3.3.6’, ‘bootnet v1.5.6’ and ‘qgraph v1.9.8’ packages). No new algorithms or functions were created and code used in-built functions in listed packages available on CRAN. The code used to generate data and to analyze data is publicly available at https://github.com/isaric4c/wiki/wiki/ISARIC ; https://github.com/SurgicalInformatics/cocin_cc and https://github.com/ClaudiaEfstath/PHOSP_Olink_NatImm .

Davis, H. E., McCorkell, L., Vogel, J. M. & Topol, E. J. Long COVID: major findings, mechanisms and recommendations. Nat. Rev. Microbiol. 21 , 133–146 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Antonelli, M., Pujol, J. C., Spector, T. D., Ourselin, S. & Steves, C. J. Risk of long COVID associated with delta versus omicron variants of SARS-CoV-2. Lancet 399 , 2263–2264 (2022).

Talla, A. et al. Persistent serum protein signatures define an inflammatory subcategory of long COVID. Nat. Commun. 14 , 3417 (2023).

Captur, G. et al. Plasma proteomic signature predicts who will get persistent symptoms following SARS-CoV-2 infection. EBioMedicine 85 , 104293 (2022).

Scott, N. A. et al. Monocyte migration profiles define disease severity in acute COVID-19 and unique features of long COVID. Eur. Respir. J. https://doi.org/10.1183/13993003.02226-2022 (2023).

Klein, J. et al. Distinguishing features of Long COVID identified through immune profiling. Nature https://doi.org/10.1038/s41586-023-06651-y (2023).

Article   PubMed   PubMed Central   Google Scholar  

Evans, R. A. et al. Clinical characteristics with inflammation profiling of long COVID and association with 1-year recovery following hospitalisation in the UK: a prospective observational study. Lancet Respir. Med . 10 , 761–775 (2022).

Article   CAS   Google Scholar  

Houchen-Wolloff, L. et al. Joint patient and clinician priority setting to identify 10 key research questions regarding the long-term sequelae of COVID-19. Thorax 77 , 717–720 (2022).

Article   PubMed   Google Scholar  

Marshall, J. C. et al. A minimal common outcome measure set for COVID-19 clinical research. Lancet Infect. Dis. 20 , e192–e197 (2020).

Post COVID-19 condition (long COVID). World Health Organization https://www.who.int/europe/news-room/fact-sheets/item/post-covid-19-condition#:~:text=Definition,months%20with%20no%20other%20explanation (2022).

Peters, V. A., Joesting, J. J. & Freund, G. G. IL-1 receptor 2 (IL-1R2) and its role in immune regulation. Brain Behav. Immun. 32 , 1–8 (2013).

Article   CAS   PubMed   Google Scholar  

Luo, Z. et al. Monocytes augment inflammatory responses in human aortic valve interstitial cells via β2-integrin/ICAM-1-mediated signaling. Inflamm. Res. 71 , 681–694 (2022).

Bendall, L. J. & Bradstock, K. F. G-CSF: from granulopoietic stimulant to bone marrow stem cell mobilizing agent. Cytokine Growth Factor Rev. 25 , 355–367 (2014).

Ma, Y. J. et al. Soluble collectin-12 (CL-12) is a pattern recognition molecule initiating complement activation via the alternative pathway. J. Immunol. 195 , 3365–3373 (2015).

Laursen, N. S. et al. Functional and structural characterization of a potent C1q inhibitor targeting the classical pathway of the complement system. Front. Immunol. 11 , 1504 (2020).

Dejanovic, B. et al. Complement C1q-dependent excitatory and inhibitory synapse elimination by astrocytes and microglia in Alzheimer’s disease mouse models. Nat. Aging 2 , 837–850 (2022).

Xue, G., Hua, L., Zhou, N. & Li, J. Characteristics of immune cell infiltration and associated diagnostic biomarkers in ulcerative colitis: results from bioinformatics analysis. Bioengineered 12 , 252–265 (2021).

He, T. et al. Integrative computational approach identifies immune‐relevant biomarkers in ulcerative colitis. FEBS Open Bio. 12 , 500–515 (2022).

Sundin, J. et al. Fecal chromogranins and secretogranins are linked to the fecal and mucosal intestinal bacterial composition of IBS patients and healthy subjects. Sci. Rep. 8 , 16821 (2018).

Kriebel, M., Wuchter, J., Trinks, S. & Volkmer, H. Neurofascin: a switch between neuronal plasticity and stability. Int. J. Biochem. Cell Biol. 44 , 694–697 (2012).

Woo, W.-M. et al. The C. elegans F-spondin family protein SPON-1 maintains cell adhesion in neural and non-neural tissues. Development 135 , 2747–2756 (2008).

Yadati, T., Houben, T., Bitorina, A. & Shiri-Sverdlov, R. The ins and outs of cathepsins: physiological function and role in disease management. Cells 9 , 1679 (2020).

Taquet, M. et al. Acute blood biomarker profiles predict cognitive deficits 6 and 12 months after COVID-19 hospitalization. Nat. Med. https://doi.org/10.1038/s41591-023-02525-y (2023).

Siggins, M. K. et al. Alternative pathway dysregulation in tissues drives sustained complement activation and predicts outcome across the disease course in COVID‐19. Immunology 168 , 473–492 (2023).

Pol, J. G., Caudana, P., Paillet, J., Piaggio, E. & Kroemer, G. Effects of interleukin-2 in immunostimulation and immunosuppression. J. Exp. Med. 217 , e20191247 (2020).

Zhang, Y., Liu, Q., Yang, S. & Liao, Q. CD58 immunobiology at a glance. Front. Immunol. 12 , 705260 (2021).

Demydchuk, M. et al. Insights into Hunter syndrome from the structure of iduronate-2-sulfatase. Nat. Commun. 8 , 15786 (2017).

Wang, Z. et al. DNER promotes epithelial–mesenchymal transition and prevents chemosensitivity through the Wnt/β-catenin pathway in breast cancer. Cell Death Dis. 11 , 642 (2020).

Bonilla, H. et al. Therapeutic trials for long COVID-19: a call to action from the interventions taskforce of the RECOVER initiative. Front. Immunol. 14 , 1129459 (2023).

Wik, L. et al. Proximity extension assay in combination with next-generation sequencing for high-throughput proteome-wide analysis. Mol. Cell. Proteomics 20 , 100168 (2021).

Measuring protein biomarkers with Olink—technical comparisons and orthogonal validation. Olink Proteomics https://www.olink.com/content/uploads/2021/09/olink-technical-comparisons-and-orthogonal-validation-1118-v2.0.pdf (2021).

COVID-19 rapid guideline: managing the long-term effects of COVID-19. National Institute for Health and Care Excellence (NICE), Scottish Intercollegiate Guidelines Network (SIGN) and Royal College of General Practitioners (RCGP) https://www.nice.org.uk/guidance/ng188/resources/covid19-rapid-guideline-managing-the-longterm-effects-of-covid19-pdf-51035515742 (2022).

Long COVID or post-COVID conditions. Centers for Disease Control and Prevention https://www.cdc.gov/coronavirus/2019-ncov/long-term-effects/index.html#:~:text=Long%20COVID%20is%20broadly%20defined,after%20acute%20COVID%2D19%20infection (2023).

Firinguetti, L., Kibria, G. & Araya, R. Study of partial least squares and ridge regression methods. Commun. Stat. Simul. Comput 46 , 6631–6644 (2017).

Article   Google Scholar  

Mortensen, L. J., Kreiner-Moller, E., Hakonarson, H., Bonnelykke, K. & Bisgaard, H. The PCDH1 gene and asthma in early childhood. Eur. Respir. J. 43 , 792–800 (2014).

Tong, Y. et al. The RNFT2/IL-3Rα axis regulates IL-3 signaling and innate immunity. JCI Insight 5 , e133652 (2020).

Wu, Y. et al. Effect of ISM1 on the immune microenvironment and epithelial-mesenchymal transition in colorectal cancer. Front. Cell Dev. Biol. 9 , 681240 (2021).

Luo, G. G. & Ou, J. J. Oncogenic viruses and cancer. Virol. Sin. 30 , 83–84 (2015).

Klein, S. L. & Flanagan, K. L. Sex differences in immune responses. Nat. Rev. Immunol. 16 , 626–638 (2016).

Gaebler, C. et al. Evolution of antibody immunity to SARS-CoV-2. Nature 591 , 639–644 (2021).

Bussani, R. et al. Persistent SARS‐CoV‐2 infection in patients seemingly recovered from COVID‐19. J. Pathol. 259 , 254–263 (2023).

Woodruff, M. C. et al. Chronic inflammation, neutrophil activity, and autoreactivity splits long COVID. Nat. Commun. 14 , 4201 (2023).

Lammi, V. et al. Genome-wide association study of long COVID. Preprint at medRxiv https://doi.org/10.1101/2023.06.29.23292056 (2023).

Ismailova, A. et al. Identification of a forkhead box protein transcriptional network induced in human neutrophils in response to inflammatory stimuli. Front. Immunol. 14 , 1123344 (2023).

Beurskens, F. J., van Schaarenburg, R. A. & Trouw, L. A. C1q, antibodies and anti-C1q autoantibodies. Mol. Immunol. 68 , 6–13 (2015).

Cervia-Hasler, C. et al. Persistent complement dysregulation with signs of thromboinflammation in active long Covid. Science 383 , eadg7942 (2024).

Morgan, B. P. & Harris, C. L. Complement, a target for therapy in inflammatory and degenerative diseases. Nat. Rev. Drug Discov. 14 , 857–877 (2015).

Toshner, M. R. et al. Apixaban following discharge in hospitalised adults with COVID-19: preliminary results from a multicentre, open-label, randomised controlled platform clinical trial. Preprint at medRxiv , https://doi.org/10.1101/2022.12.07.22283175 (2022).

Su, Y. et al. Multiple early factors anticipate post-acute COVID-19 sequelae. Cell 185 , 881–895.e20 (2022).

Branchfield, K. et al. Pulmonary neuroendocrine cells function as airway sensors to control lung immune response. Science 351 , 707–710 (2016).

Rivera-Torruco, G. et al. Isthmin 1 identifies a subset of lung hematopoietic stem cells and it is associated with systemic inflammation. J. Immunol. 202 , 118.18 (2019).

Swank, Z. et al. Persistent circulating severe acute respiratory syndrome coronavirus 2 spike is associated with post-acute coronavirus disease 2019 sequelae. Clin. Infect. Dis. 76 , e487–e490 (2023).

Peluso, M. J. et al. Chronic viral coinfections differentially affect the likelihood of developing long COVID. J. Clin. Invest. 133 , e163669 (2023).

Bellocchi, C. et al. The interplay between autonomic nervous system and inflammation across systemic autoimmune diseases. Int. J. Mol. Sci. 23 , 2449 (2022).

Raman, B. et al. Multiorgan MRI findings after hospitalisation with COVID-19 in the UK (C-MORE): a prospective, multicentre, observational cohort study. Lancet Respir. Med 11 , 1003–1019 (2023).

Fernández-Castañeda, A. et al. Mild respiratory COVID can cause multi-lineage neural cell and myelin dysregulation. Cell 185 , 2452–2468.e16 (2022).

Dantzer, R., O’Connor, J. C., Freund, G. G., Johnson, R. W. & Kelley, K. W. From inflammation to sickness and depression: when the immune system subjugates the brain. Nat. Rev. Neurosci. 9 , 46–56 (2008).

Broughton, S. E. et al. Dual mechanism of interleukin-3 receptor blockade by an anti-cancer antibody. Cell Rep. 8 , 410–419 (2014).

Ley, K., Miller, Y. I. & Hedrick, C. C. Monocyte and macrophage dynamics during atherogenesis. Arterioscler. Thromb. Vasc. Biol. 31 , 1506–1516 (2011).

Iosef, C. et al. Plasma proteome of long-COVID patients indicates HIF-mediated vasculo-proliferative disease with impact on brain and heart function. J. Transl. Med. 21 , 377 (2023).

Santopaolo, M. et al. Prolonged T-cell activation and long COVID symptoms independently associate with severe COVID-19 at 3 months. eLife 12 , e85009 (2023).

Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 , 1–22 (2010).

Voiriot, G. et al. Chronic critical illness and post-intensive care syndrome: from pathophysiology to clinical challenges. Ann. Intensive Care 12 , 58 (2022).

Evans, R. A. et al. Physical, cognitive, and mental health impacts of COVID-19 after hospitalisation (PHOSP-COVID): a UK multicentre, prospective cohort study. Lancet Respir. Med 9 , 1275–1287 (2021).

Docherty, A. B. et al. Features of 20,133 UK patients in hospital with covid-19 using the ISARIC WHO clinical characterisation protocol: prospective observational cohort study. BMJ https://doi.org/10.1136/bmj.m1985 (2020).

Elneima, O. et al. Cohort profile: post-hospitalisation COVID-19 study (PHOSP-COVID). Preprint at medRxiv https://doi.org/10.1101/2023.05.08.23289442 (2023).

Liew, F. et al. SARS-CoV-2-specific nasal IgA wanes 9 months after hospitalisation with COVID-19 and is not induced by subsequent vaccination. EBioMedicine 87 , 104402 (2023).

Ascough, S. et al. Divergent age-related humoral correlates of protection against respiratory syncytial virus infection in older and young adults: a pilot, controlled, human infection challenge model. Lancet Healthy Longev. 3 , e405–e416 (2022).

Guvenel, A. et al. Epitope-specific airway-resident CD4+ T cell dynamics during experimental human RSV infection. J. Clin. Invest. 130 , 523–538 (2019).

Article   PubMed Central   Google Scholar  

Greenwood, C. J. et al. A comparison of penalised regression methods for informing the selection of predictive markers. PLoS ONE 15 , e0242730 (2020).

Higham, A. et al. Leukotriene B4 levels in sputum from asthma patients. ERJ Open Res. 2 , 00088–02015 (2016).

SARS-CoV-2 spike kit. MSD https://www.mesoscale.com/~/media/files/product%20inserts/s-plex%20sars-cov-2%20spike%20kit%20product%20insert.pdf (2023).

Ren, A. et al. Ultrasensitive assay for saliva-based SARS-CoV-2 antigen detection. Clin. Chem. Lab. Med. 60 , 771–777 (2022).

Breheny, P. & Huang, J. Penalized methods for bi-level variable selection. Stat. Interface 2 , 369–380 (2009).

Thwaites, R. S. et al. Inflammatory profiles across the spectrum of disease reveal a distinct role for GM-CSF in severe COVID-19. Sci. Immunol. 6 , eabg9873 (2021).

Resources. PHOSP-COVID https://phosp.org/resource/ (2022).

Download references


This research used data assets made available by ODAP as part of the Data and Connectivity National Core Study, led by Health Data Research UK in partnership with the Office for National Statistics and funded by UK Research and Innovation (grant ref. MC_PC_20058). This work is supported by the following grants: the PHOSP-COVD study is jointly funded by UK Research and Innovation and National Institute of Health and Care Research (NIHR; grant references MR/V027859/1 and COV0319). ISARIC4C is supported by grants from the National Institute for Health and Care Research (award CO-CIN-01) and the MRC (grant MC_PC_19059) Liverpool Experimental Cancer Medicine Centre provided infrastructure support for this research (grant reference C18616/A25153). Other grants that have supported this work include the UK Coronavirus Immunology Consortium (funder reference 1257927), the Imperial Biomedical Research Centre (NIHR Imperial BRC, grant IS-BRC-1215-20013), the Health Protection Research Unit in Respiratory Infections at Imperial College London and NIHR Health Protection Research Unit in Emerging and Zoonotic Infections at University of Liverpool, both in partnership with Public Health England, (NIHR award 200907), Wellcome Trust and Department for International Development (215091/Z/18/Z), Health Data Research UK (grant code 2021.0155), MRC (grant code MC_UU_12014/12) and NIHR Clinical Research Network for providing infrastructure support for this research. We also acknowledge the support of the MRC EMINENT Network (MR/R502121/1), which is cofunded by GSK, the Comprehensive Local Research Networks, the MRC HIC-Vac network (MR/R005982/1) and the RSV Consortium in Europe Horizon 2020 Framework Grant 116019. F.L. is supported by an MRC clinical training fellowship (award MR/W000970/1). C.E. is funded by NIHR (grant P91258-4). L.-P.H. is supported by Oxford NIHR Biomedical Research Centre. A.A.R.T. is supported by a British Heart Foundation (BHF) Intermediate Clinical Fellowship (FS/18/13/33281). S.L.R.-J. receives support from UK Research and Innovation (UKRI), Global Challenges Research Fund (GCRF), Rosetrees Trust, British HIV association (BHIVA), European & Developing Countries Clinical Trials Partnership (EDCTP) and Globvac. J.D.C. has grants from AstraZeneca, Boehringer Ingelheim, GSK, Gilead Sciences, Grifols, Novartis and Insmed. R.A.E. holds a NIHR Clinician Scientist Fellowship (CS-2016-16-020). A. Horsley is currently supported by UK Research and Innovation, NIHR and NIHR Manchester BRC. B.R. receives support from BHF Oxford Centre of Research Excellence, NIHR Oxford BRC and MRC. D.G.W. is supported by an NIHR Advanced Fellowship. A. Ho has received support from MRC and for the Coronavirus Immunology Consortium (MR/V028448/1). L.T. is supported by the US Food and Drug Administration Medical Countermeasures Initiative contract 75F40120C00085 and the National Institute for Health Research Health Protection Research Unit in Emerging and Zoonotic Infections (NIHR200907) at the University of Liverpool in partnership with UK Health Security Agency (UK-HSA), in collaboration with Liverpool School of Tropical Medicine and the University of Oxford. L.V.W. has received support from UKRI, GSK/Asthma and Lung UK and NIHR for this study. M.G.S. has received support from NIHR UK, MRC UK and Health Protection Research Unit in Emerging and Zoonotic Infections, University of Liverpool. J.K.B. is supported by the Wellcome Trust (223164/Z/21/Z) and UKRI (MC_PC_20004, MC_PC_19025, MC_PC_1905, MRNO2995X/1 and MC_PC_20029). The funders were not involved in the study design, interpretation of data or writing of this manuscript. The views expressed are those of the authors and not necessarily those of the Department of Health and Social Care (DHSC), the Department for International Development (DID), NIHR, MRC, the Wellcome Trust, UK-HSA, the National Health Service or the Department of Health. P.J.M.O. is supported by a NIHR Senior Investigator Award (award 201385). We thank all the participants and their families. We thank the many research administrators, health-care and social-care professionals who contributed to setting up and delivering the PHOSP-COVID study at all of the 65 NHS trusts/health boards and 25 research institutions across the United Kingdom, as well as those who contributed to setting up and delivering the ISARIC4C study at 305 NHS trusts/health boards. We also thank all the supporting staff at the NIHR Clinical Research Network, Health Research Authority, Research Ethics Committee, Department of Health and Social Care, Public Health Scotland and Public Health England. We thank K. Holmes at the NIHR Office for Clinical Research Infrastructure for her support in coordinating the charities group. The PHOSP-COVID industry framework was formed to provide advice and support in commercial discussions, and we thank the Association of the British Pharmaceutical Industry as well the NIHR Office for Clinical Research Infrastructure for coordinating this. We are very grateful to all the charities that have provided insight to the study: Action Pulmonary Fibrosis, Alzheimer’s Research UK, Asthma and Lung UK, British Heart Foundation, Diabetes UK, Cystic Fibrosis Trust, Kidney Research UK, MQ Mental Health, Muscular Dystrophy UK, Stroke Association Blood Cancer UK, McPin Foundations and Versus Arthritis. We thank the NIHR Leicester Biomedical Research Centre patient and public involvement group and Long Covid Support. We also thank G. Khandaker and D. C. Newcomb who provided valuable feedback on this work. Extended Data Fig. 10 was created using Biorender.

Author information

These authors contributed equally: Felicity Liew, Claudia Efstathiou, Ryan S. Thwaites, Peter J. M. Openshaw.

Authors and Affiliations

National Heart and Lung Institute, Imperial College London, London, UK

Felicity Liew, Claudia Efstathiou, Sara Fontanella, Dawid Swieboda, Jasmin K. Sidhu, Stephanie Ascough, Onn Min Kon, Luke S. Howard, Jennifer K. Quint, Christopher Chiu, Ryan S. Thwaites, Peter J. M. Openshaw, Jake Dunning & Peter J. M. Openshaw

Institute for Lung Health, Leicester NIHR Biomedical Research Centre, University of Leicester, Leicester, UK

Matthew Richardson, Ruth Saunders, Olivia C. Leavy, Omer Elneima, Hamish J. C. McAuley, Amisha Singapuri, Marco Sereno, Victoria C. Harris, Neil J. Greening, Rachael A. Evans, Louise V. Wain, Christopher Brightling & Ananga Singapuri

NIHR Health Protection Research Unit in Emerging and Zoonotic Infections, Institute of Infection, Veterinary and Ecological Sciences, University of Liverpool, Liverpool, UK

Shona C. Moore, Daniel G. Wootton, Malcolm G. Semple, Lance Turtle, William A. Paxton & Georgios Pollakis

The Imperial Clinical Respiratory Research Unit, Imperial College NHS Trust, London, UK

Noura Mohamed

Cardiovascular Research Team, Imperial College Healthcare NHS Trust, London, UK

Jose Nunag & Clara King

Department of Population Health Sciences, University of Leicester, Leicester, UK

Olivia C. Leavy, Louise V. Wain & Beatriz Guillen-Guio

NIHR Leicester Biomedical Research Centre, University of Leicester, Leicester, UK

Aarti Shikotra

Centre for Exercise and Rehabilitation Science, NIHR Leicester Biomedical Research Centre-Respiratory, University of Leicester, Leicester, UK

Linzy Houchen-Wolloff

Usher Institute, University of Edinburgh, Edinburgh, UK

Nazir I. Lone, Luke Daines, Annemarie B. Docherty, Nazir I. Lone, Matthew Thorpe, Annemarie B. Docherty, Thomas M. Drake, Cameron J. Fairfield, Ewen M. Harrison, Stephen R. Knight, Kenneth A. Mclean, Derek Murphy, Lisa Norman, Riinu Pius & Catherine A. Shaw

Centre for Medical Informatics, The Usher Institute, University of Edinburgh, Edinburgh, UK

Matthew Thorpe, Annemarie B. Docherty, Ewen M. Harrison, J. Kenneth Baillie, Sarah L. Rowland-Jones, A. A. Roger Thompson & Thushan de Silva

Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, UK

A. A. Roger Thompson, Sarah L. Rowland-Jones, Thushan I. de Silva & James D. Chalmers

University of Dundee, Ninewells Hospital and Medical School, Dundee, UK

James D. Chalmers & Ling-Pei Ho

MRC Human Immunology Unit, University of Oxford, Oxford, UK

Ling-Pei Ho & Alexander Horsley

Division of Infection, Immunity and Respiratory Medicine, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK

Alexander Horsley & Betty Raman

Radcliffe Department of Medicine, University of Oxford, Oxford, UK

Betty Raman & Krisnah Poinasamy

Asthma + Lung UK, London, UK

Krisnah Poinasamy & Michael Marks

Department of Clinical Research, London School of Hygiene and Tropical Medicine, London, UK

Michael Marks

Hospital for Tropical Diseases, University College London Hospital, London, UK

Division of Infection and Immunity, University College London, London, UK

Michael Marks & Mahdad Noursadeghi

MRC Centre for Virus Research, School of Infection and Immunity, University of Glasgow, Glasgow, UK

Antonia Ho & William Greenhalf

Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK

William Greenhalf & J. Kenneth Baillie

The Roslin Institute, University of Edinburgh, Edinburgh, UK

J. Kenneth Baillie, J. Kenneth Baillie, Sara Clohisey, Fiona Griffiths, Ross Hendry, Andrew Law & Wilna Oosthuyzen

Pandemic Science Hub, University of Edinburgh, Edinburgh, UK

J. Kenneth Baillie

The Pandemic Institute, University of Liverpool, Liverpool, UK

Malcolm G. Semple & Lance Turtle

University of Manchester, Manchester, UK

Kathryn Abel, Perdita Barran, H. Chinoy, Bill Deakin, M. Harvie, C. A. Miller, Stefan Stanel & Drupad Trivedi

Intensive Care Unit, Royal Infirmary of Edinburgh, Edinburgh, UK

Kathryn Abel & J. Kenneth Baillie

North Bristol NHS Trust and University of Bristol, Bristol, UK

H. Adamali, David Arnold, Shaney Barratt, A. Dipper, Sarah Dunn, Nick Maskell, Anna Morley, Leigh Morrison, Louise Stadon, Samuel Waterson & H. Welch

University of Edinburgh, Manchester, UK

Davies Adeloye, D. E. Newby, Riinu Pius, Igor Rudan, Manu Shankar-Hari, Catherine Sudlow, Sarah Walmsley & Bang Zheng

King’s College Hospital NHS Foundation Trust and King’s College London, London, UK

Oluwaseun Adeyemi, Rita Adrego, Hosanna Assefa-Kebede, Jonathon Breeze, S. Byrne, Pearl Dulawan, Amy Hoare, Caroline Jolley, Abigail Knighton, M. Malim, Sheetal Patale, Ida Peralta, Natassia Powell, Albert Ramos, K. Shevket, Fabio Speranza & Amelie Te

Guy’s and St Thomas’ NHS Foundation Trust, London, UK

Laura Aguilar Jimenez, Gill Arbane, Sarah Betts, Karen Bisnauthsing, A. Dewar, Nicholas Hart, G. Kaltsakas, Helen Kerslake, Murphy Magtoto, Philip Marino, L. M. Martinez, Marlies Ostermann, Jennifer Rossdale & Teresa Solano

Royal Free London NHS Foundation Trust, London, UK

Shanaz Ahmad, Simon Brill, John Hurst, Hannah Jarvis, C. Laing, Lai Lim, S. Mandal, Darwin Matila, Olaoluwa Olaosebikan & Claire Singh

University Hospital Birmingham NHS Foundation Trust and University of Birmingham, Birmingham, UK

N. Ahmad Haider, Catherine Atkin, Rhiannon Baggott, Michelle Bates, A. Botkai, Anna Casey, B. Cooper, Joanne Dasgin, Camilla Dawson, Katharine Draxlbauer, N. Gautam, J. Hazeldine, T. Hiwot, Sophie Holden, Karen Isaacs, T. Jackson, Vicky Kamwa, D. Lewis, Janet Lord, S. Madathil, C. McGee, K. Mcgee, Aoife Neal, Alex Newton-Cox, Joseph Nyaboko, Dhruv Parekh, Z. Peterkin, H. Qureshi, Liz Ratcliffe, Elizabeth Sapey, J. Short, Tracy Soulsby, J. Stockley, Zehra Suleiman, Tamika Thompson, Maximina Ventura, Sinead Walder, Carly Welch, Daisy Wilson, S. Yasmin & Kay Por Yip

Stroke Association, London, UK

Rubina Ahmed & Richard Francis

University College London Hospital and University College London, London, UK

Nyarko Ahwireng, Dongchun Bang, Donna Basire, Jeremy Brown, Rachel Chambers, A. Checkley, R. Evans, M. Heightman, T. Hillman, Joseph Jacob, Roman Jastrub, M. Lipman, S. Logan, D. Lomas, Marta Merida Morillas, Hannah Plant, Joanna Porter, K. Roy & E. Wall

Oxford University Hospitals NHS Foundation Trust and University of Oxford, Oxford, UK

Mark Ainsworth, Asma Alamoudi, Angela Bloss, Penny Carter, M. Cassar, Jin Chen, Florence Conneh, T. Dong, Ranuromanana Evans, V. Ferreira, Emily Fraser, John Geddes, F. Gleeson, Paul Harrison, May Havinden-Williams, P. Jezzard, Ivan Koychev, Prathiba Kurupati, H. McShane, Clare Megson, Stefan Neubauer, Debby Nicoll, C. Nikolaidou, G. Ogg, Edmund Pacpaco, M. Pavlides, Yanchun Peng, Nayia Petousi, John Pimm, Najib Rahman, M. J. Rowland, Kathryn Saunders, Michael Sharpe, Nick Talbot, E. M. Tunnicliffe & C. Xie

St George’s University Hospitals NHS Foundation Trust, London, UK

Mariam Ali, Raminder Aul, A. Dunleavy, D. Forton, Mark Mencias, N. Msimanga, T. Samakomva, Sulman Siddique, Vera Tavoukjian & J. Teixeira

University Hospitals of Leicester NHS Trust and University of Leicester, Leicester, UK

M. Aljaroof, Natalie Armstrong, H. Arnold, Hnin Aung, Majda Bakali, M. Bakau, E. Baldry, Molly Baldwin, Charlotte Bourne, Michelle Bourne, Nigel Brunskill, P. Cairns, Liesel Carr, Amanda Charalambou, C. Christie, Melanie Davies, Enya Daynes, Sarah Diver, Rachael Dowling, Sarah Edwards, C. Edwardson, H. Evans, J. Finch, Sarah Glover, Nicola Goodman, Bibek Gooptu, Kate Hadley, Pranab Haldar, Beverley Hargadon, W. Ibrahim, L. Ingram, Kamlesh Khunti, A. Lea, D. Lee, Gerry McCann, P. McCourt, Teresa Mcnally, George Mills, Will Monteiro, Manish Pareek, S. Parker, Anne Prickett, I. N. Qureshi, A. Rowland, Richard Russell, Salman Siddiqui, Sally Singh, J. Skeemer, M. Soares, E. Stringer, T. Thornton, Martin Tobin, T. J. C. Ward, F. Woodhead, Tom Yates & A. J. Yousuf

University of Exeter, Exeter, UK

Louise Allan, Clive Ballard & Andrew McGovern

University of Leicester, Leicester, UK

Richard Allen, Michelle Bingham, Terry Brugha, Selina Finney, Rob Free, Don Jones, Claire Lawson, Daniel Lozano-Rojas, Gardiner Lucy, Alistair Moss, Elizabeta Mukaetova-Ladinska, Petr Novotny, Kimon Ntotsis, Charlotte Overton, John Pearl, Tatiana Plekhanova, M. Richardson, Nilesh Samani, Jack Sargant, Ruth Saunders, M. Sharma, Mike Steiner, Chris Taylor, Sarah Terry, C. Tong, E. Turner, J. Wormleighton & Bang Zhao

Liverpool University Hospitals NHS Foundation Trust and University of Liverpool, Liverpool, UK

Lisa Allerton, Ann Marie Allt, M. Beadsworth, Anthony Berridge, Jo Brown, Shirley Cooper, Andy Cross, Sylviane Defres, S. L. Dobson, Joanne Earley, N. French, Kera Hainey, Hayley Hardwick, Jenny Hawkes, Victoria Highett, Sabina Kaprowska, Angela Key, Lara Lavelle-Langham, N. Lewis-Burke, Gladys Madzamba, Flora Malein, Sophie Marsh, Chloe Mears, Lucy Melling, Matthew Noonan, L. Poll, James Pratt, Emma Richardson, Anna Rowe, Victoria Shaw, K. A. Tripp, Lilian Wajero, S. A. Williams-Howard, Dan Wootton & J. Wyles

Sherwood Forest Hospitals NHS Foundation Trust, Nottingham, UK

Lynne Allsop, Kaytie Bennett, Phil Buckley, Margaret Flynn, Mandy Gill, Camelia Goodwin, M. Greatorex, Heidi Gregory, Cheryl Heeley, Leah Holloway, Megan Holmes, John Hutchinson, Jill Kirk, Wayne Lovegrove, Terri Ann Sewell, Sarah Shelton, D. Sissons, Katie Slack, Susan Smith, D. Sowter, Sarah Turner, V. Whitworth & Inez Wynter

Nottingham University Hospitals NHS Trust and University of Nottingham, London, UK

Paula Almeida, Akram Hosseini, Robert Needham & Karen Shaw

Manchester University NHS Foundation Trust and University of Manchester, London, UK

Bashar Al-Sheklly, Cristina Avram, John Blaikely, M. Buch, N. Choudhury, David Faluyi, T. Felton, T. Gorsuch, Neil Hanley, Tracy Hussell, Zunaira Kausar, Natasha Odell, Rebecca Osbourne, Karen Piper Hanley, K. Radhakrishnan & Sue Stockdale

Imperial College London, London, UK

Danny Altmann, Anew Frankel, Luke S. Howard, Desmond Johnston, Liz Lightstone, Anne Lingford-Hughes, William Man, Steve McAdoo, Jane Mitchell, Philip Molyneaux, Christos Nicolaou, D. P. O’Regan, L. Price, Jennifer K. Quint, David Smith, Jonathon Valabhji, Simon Walsh, Martin Wilkins & Michelle Willicombe

Hampshire Hospitals NHS Foundation Trust, Basingstoke, UK

Maria Alvarez Corral, Ava Maria Arias, Emily Bevan, Denise Griffin, Jane Martin, J. Owen, Sheila Payne, A. Prabhu, Annabel Reed, Will Storrar, Nick Williams & Caroline Wrey Brown

British Heart Foundation, Birmingham, UK

Shannon Amoils

NHS Greater Glasgow and Clyde Health Board and University of Glasgow, Glasgow, UK

David Anderson, Neil Basu, Hannah Bayes, Colin Berry, Ammani Brown, Andrew Dougherty, K. Fallon, L. Gilmour, D. Grieve, K. Mangion, I. B. McInnes, A. Morrow, Kathryn Scott & R. Sykes

University of Oxford, Oxford, UK

Charalambos Antoniades, A. Bates, M. Beggs, Kamaldeep Bhui, Katie Breeze, K. M. Channon, David Clark, X. Fu, Masud Husain, Lucy Kingham, Paul Klenerman, Hanan Lamlum, X. Li, E. Lukaschuk, Celeste McCracken, K. McGlynn, R. Menke, K. Motohashi, T. E. Nichols, Godwin Ogbole, S. Piechnik, I. Propescu, J. Propescu, A. A. Samat, Z. B. Sanders, Louise Sigfrid & M. Webster

Belfast Health and Social Care Trust and Queen’s University Belfast, Belfast, UK

Cherie Armour, Vanessa Brown, John Busby, Bronwen Connolly, Thelma Craig, Stephen Drain, Liam Heaney, Bernie King, Nick Magee, E. Major, Danny McAulay, Lorcan McGarvey, Jade McGinness, Tunde Peto & Roisin Stone

Airedale NHS Foundation Trust, Keighley, UK

Lisa Armstrong, Brigid Hairsine, Helen Henson, Claire Kurasz, Alison Shaw & Liz Shenton

Wrightington Wigan and Leigh NHS Trust, Wigan, UK

A. Ashish, Josh Cooper & Emma Robinson

Leeds Teaching Hospitals and University of Leeds, Leeds, UK

Andrew Ashworth, Paul Beirne, Jude Clarke, C. Coupland, Matthhew Dalton, Clair Favager, Jodie Glossop, John Greenwood, Lucy Hall, Tim Hardy, Amy Humphries, Jennifer Murira, Dan Peckham, S. Plein, Jade Rangeley, Gwen Saalmink, Ai Lyn Tan, Elaine Wade, Beverley Whittam, Nicola Window & Janet Woods

University of Liverpool, Liverpool, UK

M. Ashworth, D. Cuthbertson, G. Kemp, Anne McArdle, Benedict Michael, Will Reynolds, Lisa Spencer, Ben Vinson, Katie A. Ahmed, Jane A. Armstrong, Milton Ashworth, Innocent G. Asiimwe, Siddharth Bakshi, Samantha L. Barlow, Laura Booth, Benjamin Brennan, Katie Bullock, Nicola Carlucci, Emily Cass, Benjamin W. A. Catterall, Jordan J. Clark, Emily A. Clarke, Sarah Cole, Louise Cooper, Helen Cox, Christopher Davis, Oslem Dincarslan, Alejandra Doce Carracedo, Chris Dunn, Philip Dyer, Angela Elliott, Anthony Evans, Lorna Finch, Lewis W. S. Fisher, Lisa Flaherty, Terry Foster, Isabel Garcia-Dorival, Philip Gunning, Catherine Hartley, Karl Holden, Anthony Holmes, Rebecca L. Jensen, Christopher B. Jones, Trevor R. Jones, Shadia Khandaker, Katharine King, Robyn T. Kiy, Chrysa Koukorava, Annette Lake, Suzannah Lant, Diane Latawiec, Lara Lavelle-Langham, Daniella Lefteri, Lauren Lett, Lucia A. Livoti, Maria Mancini, Hannah Massey, Nicole Maziere, Sarah McDonald, Laurence McEvoy, John McLauchlan, Soeren Metelmann, Nahida S. Miah, Joanna Middleton, Joyce Mitchell, Ellen G. Murphy, Rebekah Penrice-Randal, Jack Pilgrim, Tessa Prince, P. Matthew Ridley, Debby Sales, Rebecca K. Shears, Benjamin Small, Krishanthi S. Subramaniam, Agnieska Szemiel, Aislynn Taggart, Jolanta Tanianis-Hughes, Jordan Thomas, Erwan Trochu, Libby van Tonder, Eve Wilcock & J. Eunice Zhang

University College London, London, UK

Shahab Aslani, Amita Banerjee, R. Batterham, Gabrielle Baxter, Robert Bell, Anthony David, Emma Denneny, Alun Hughes, W. Lilaonitkul, P. Mehta, Ashkan Pakzad, Bojidar Rangelov, B. Williams, James Willoughby & Moucheng Xu

Hull University Teaching Hospitals NHS Trust and University of Hull, Hull, UK

Paul Atkin, K. Brindle, Michael Crooks, Katie Drury, Nicholas Easom, Rachel Flockton, L. Holdsworth, A. Richards, D. L. Sykes, Susannah Thackray-Nocera & C. Wright

East Kent Hospitals University NHS Foundation Trust, Canterbury, UK

Liam Austin, Eva Beranova, Tracey Cosier, Joanne Deery, Tracy Hazelton, Carly Price, Hazel Ramos, Reanne Solly, Sharon Turney & Heather Weston

Baillie Gifford Pandemic Science Hub, Centre for Inflammation Research, The Queen’s Medical Research Institute, University of Edinburgh, Edinburgh, UK

Nikos Avramidis, J. Kenneth Baillie, Erola Pairo-Castineira & Konrad Rawlik

Roslin Institute, University of Edinburgh, Edinburgh, UK

Nikos Avramidis, J. Kenneth Baillie & Erola Pairo-Castineira

Newcastle upon Tyne Hospitals NHS Foundation Trust and University of Newcastle, Newcastle upon Tyne, UK

A. Ayoub, J. Brown, G. Burns, Gareth Davies, Anthony De Soyza, Carlos Echevarria, Helen Fisher, C. Francis, Alan Greenhalgh, Philip Hogarth, Joan Hughes, Kasim Jiwa, G. Jones, G. MacGowan, D. Price, Avan Sayer, John Simpson, H. Tedd, S. Thomas, Sophie West, M. Witham, S. Wright & A. Young

East Cheshire NHS Trust, Macclesfield, UK

Marta Babores, Maureen Holland, Natalie Keenan, Sharlene Shashaa & Helen Wassall

Sheffield Teaching NHS Foundation Trust and University of Sheffield, Sheffield, UK

J. Bagshaw, M. Begum, K. Birchall, Robyn Butcher, H. Carborn, Flora Chan, Kerry Chapman, Yutung Cheng, Luke Chetham, Cameron Clark, Zach Coburn, Joby Cole, Myles Dixon, Alexandra Fairman, J. Finnigan, H. Foot, David Foote, Amber Ford, Rebecca Gregory, Kate Harrington, L. Haslam, L. Hesselden, J. Hockridge, Ailsa Holbourn, B. Holroyd-Hind, L. Holt, Alice Howell, E. Hurditch, F. Ilyas, Claire Jarman, Allan Lawrie, Ju Hee Lee, Elvina Lee, Rebecca Lenagh, Alison Lye, Irene Macharia, M. Marshall, Angeline Mbuyisa, J. McNeill, Sharon Megson, J. Meiring, L. Milner, S. Misra, Helen Newell, Tom Newman, C. Norman, Lorenza Nwafor, Dibya Pattenadk, Megan Plowright, Julie Porter, Phillip Ravencroft, C. Roddis, J. Rodger, Peter Saunders, J. Sidebottom, Jacqui Smith, Laurie Smith, N. Steele, G. Stephens, R. Stimpson, B. Thamu, N. Tinker, Kim Turner, Helena Turton, Phillip Wade, S. Walker, James Watson, Imogen Wilson & Amira Zawia

University of Nottingham, Nottingham, UK

David Baguley, Chris Coleman, E. Cox, Laura Fabbri, Susan Francis, Ian Hall, E. Hufton, Simon Johnson, Fasih Khan, Paaig Kitterick, Richard Morriss, Nick Selby, Iain Stewart & Louise Wright

Wirral University Teaching Hospital, Wirral, UK

Elisabeth Bailey, Anne Reddington & Andrew Wight

MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Western General Hospital, Edinburgh, UK

University of Swansea, Swansea, UK

University of Southampton, London, UK

David Baldwin, P. C. Calder, Nathan Huneke & Gemma Simons

Royal Brompton and Harefield Clinical Group, Guy’s and St Thomas’ NHS Foundation Trust, London, UK

R. E. Barker, Daniele Cristiano, N. Dormand, P. George, Mahitha Gummadi, S. Kon, Kamal Liyanage, C. M. Nolan, B. Patel, Suhani Patel, Oliver Polgar, L. Price, P. Shah, Suver Singh & J. A. Walsh

York and Scarborough NHS Foundation Trust, York, UK

Laura Barman, Claire Brookes, K. Elliott, L. Griffiths, Zoe Guy, Kate Howard, Diana Ionita, Heidi Redfearn, Carol Sarginson & Alison Turnbull

NHS Highland, Inverness, UK

Fiona Barrett, A. Donaldson & Beth Sage

Royal Papworth Hospital NHS Foundation Trust, Cambridge, UK

Helen Baxendale, Lucie Garner, C. Johnson, J. Mackie, Alice Michael, J. Newman, Jamie Pack, K. Paques, H. Parfrey, J. Parmar & A. Reddy

University Hospitals of Derby and Burton, Derby, UK

Paul Beckett, Caroline Dickens & Uttam Nanda

NHS Lanarkshire, Hamilton, UK

Murdina Bell, Angela Brown, M. Brown, R. Hamil, Karen Leitch, L. Macliver, Manish Patel, Jackie Quigley, Andrew Smith & B. Welsh

Cambridge University Hospitals NHS Foundation Trust, NIHR Cambridge Clinical Research Facility and University of Cambridge, Cambridge, UK

Areti Bermperi, Isabel Cruz, K. Dempsey, Anne Elmer, Jonathon Fuld, H. Jones, Sherly Jose, Stefan Marciniak, M. Parkes, Carla Ribeiro, Jessica Taylor, Mark Toshner, L. Watson & J. Worsley

Loughborough University, Loughborough, UK

Lettie Bishop & David Stensel

Betsi Cadwallader University Health Board, Bangor, UK

Annette Bolger, Ffyon Davies, Ahmed Haggar, Joanne Lewis, Arwel Lloyd, R. Manley, Emma McIvor, Daniel Menzies, K. Roberts, W. Saxon, David Southern, Christian Subbe & Victoria Whitehead

Nottingham University Hospitals NHS Trust and University of Nottingham, Nottingham, UK

Charlotte Bolton, J. Bonnington, Melanie Chrystal, Catherine Dupont, Paul Greenhaff, Ayushman Gupta, W. Jang, S. Linford, Laura Matthews, Athanasios Nikolaidis, Sabrina Prosper & Andrew Thomas

King’s College London, London, UK

Kate Bramham, M. Brown, Khalida Ismail, Tim Nicholson, Carmen Pariante, Claire Sharpe, Simon Wessely & J. Whitney

Bradford Teaching Hospitals NHS Foundation Trust, Bradford, UK

Lucy Brear, Karen Regan, Dinesh Saralaya & Kim Storton

South London and Maudsley NHS Foundation Trust and King’s College London, London, UK

G. Breen & M. Hotopf

London School of Hygiene and Tropical Medicine, London, UK

Andrew Briggs

Whittington Health NHS Trust, London, UK

E. Bright, P. Crisp, Ruvini Dharmagunawardena & M. Stern

Cardiff and Vale University Health Board, Cardiff, UK

Lauren Broad, Teriann Evans, Matthew Haynes, L. Jones, Lucy Knibbs, Alison McQueen, Catherine Oliver, Kerry Paradowski, Ramsey Sabit & Jenny Williams

Yeovil District Hospital NHS Foundation Trust, Yeovil, UK

Andrew Broadley

University of Birmingham, Birmingham, UK

Mattew Broome, Paul McArdle, Paul Moss, David Thickett, Rachel Upthegrove, Dan Wilkinson, David Wraith & Erin L. Aldera

BHF Centre for Cardiovascular Science, University of Edinburgh, Edinburgh, UK

Anda Bularga

University of Cambridge, Cambridge, UK

Ed Bullmore, Jonathon Heeney, Claudia Langenberg, William Schwaeble, Charlotte Summers & J. Weir McCall

NIHR Leicester Biomedical Research Centre–Respiratory Patient and Public Involvement Group, Leicester, UK

Jenny Bunker, Rhyan Gill & Rashmita Nathu

Imperial College Healthcare NHS Trust and Imperial College London, London, UK

L. Burden, Ellen Calvelo, Bethany Card, Caitlin Carr, Edwin Chilvers, Donna Copeland, P. Cullinan, Patrick Daly, Lynsey Evison, Tamanah Fayzan, Hussain Gordon, Sulaimaan Haq, Gisli Jenkins, Clara King, Onn Min Kon, Katherine March, Myril Mariveles, Laura McLeavey, Silvia Moriera, Unber Munawar, Uchechi Nwanguma, Lorna Orriss-Dib, Alexandra Ross, Maura Roy, Emily Russell, Katherine Samuel, J. Schronce, Neil Simpson, Lawrence Tarusan, David Thomas, Chloe Wood & Najira Yasmin

Harrogate and District NHD Foundation Trust, Harrogate, UK

Tracy Burdett, James Featherstone, Cathy Lawson, Alison Layton, Clare Mills & Lorraine Stephenson

Newcastle University/Chair of NIHR Dementia TRC, Newcastle, UK

Oxford University Hospitals NHS Foundation Trust, Oxford, UK

A. Burns & N. Kanellakis

Tameside and Glossop Integrated Care NHS Foundation Trust, Ashton-under-Lyne, UK

Al-Tahoor Butt, Martina Coulding, Heather Jones, Susan Kilroy, Jacqueline McCormick, Jerome McIntosh, Heather Savill, Victoria Turner & Joanne Vere

University of Oxford, Nuffield Department of Medicine, Oxford, UK

University of Glasgow, Glasgow, UK

Jonathon Cavanagh, S. MacDonald, Kate O’Donnell, John Petrie, Naveed Sattar & Mark Spears

United Lincolnshire Hospitals NHS Trust, Grantham, UK

Manish Chablani & Lynn Osborne

Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK

Trudie Chalder

University Hospital of South Manchester NHS Foundation Trust, Manchester, UK

N. Chaudhuri

University Hospital Southampton NHS Foundation Trust and University of Southampton, Southampton, UK

Caroline Childs, R. Djukanovic, S. Fletcher, Matt Harvey, Mark Jones, Elizabeth Marouzet, B. Marshall, Reena Samuel, T. Sass, Tim Wallis & Helen Wheeler

King’s College Hospital/Guy’s and St Thomas’ NHS FT, London, UK

A. Chiribiri & C. O’Brien

Barts Health NHS Trust, London, UK

K. Chong-James, C. David, W. Y. James, Paul Pfeffer & O. Zongo

NHS Lothian and University of Edinburgh, Edinburgh, UK

Gaunab Choudhury, S. Clohisey, Andrew Deans, J. Furniss, Ewen Harrison, S. Kelly & Aziz Sheikh

School of Cardiovascular Medicine and Sciences. King’s College London, London, UK

Phillip Chowienczyk

Lewisham and Greenwich NHS Trust, London, UK

Hywel Dda University Health Board, Haverfordwest, UK

S. Coetzee, Kim Davies, Rachel Ann Hughes, Ronda Loosley, Heather McGuinness, Abdelrahman Mohamed, Linda O’Brien, Zohra Omar, Emma Perkins, Janet Phipps, Gavin Ross, Abigail Taylor, Helen Tench & Rebecca Wolf-Roberts

NHS Tayside and University of Dundee, Dundee, UK

David Connell, C. Deas, Anne Elliott, J. George, S. Mohammed, J. Rowland, A. R. Solstice, Debbie Sutherland & Caroline Tee

Swansea Bay University Health Board, Port Talbot, UK

Lynda Connor, Amanda Cook, Gwyneth Davies, Tabitha Rees, Favas Thaivalappil & Caradog Thomas

Faculty of Medicine, Nursing and Health Sciences, School of Biomedical Sciences, Monash University, Melbourne, Victoria, Australia

Eamon Coughlan

Rotherham NHS Foundation Trust, Rotherham, UK

Alison Daniels, Anil Hormis, Julie Ingham & Lisa Zeidan

Salford Royal NHS Foundation Trust, Salford, UK

P. Dark, Nawar Diar-Bakerly, D. Evans, E. Hardy, Alice Harvey, D. Holgate, Sean Knight, N. Mairs, N. Majeed, L. McMorrow, J. Oxton, Jessica Pendlebury, C. Summersgill, R. Ugwuoke & S. Whittaker

Cwm Taf Morgannwg University Health Board, Mountain Ash, UK

Ellie Davies, Cerys Evenden, Alyson Hancock, Kia Hancock, Ceri Lynch, Meryl Rees, Lisa Roche, Natalie Stroud & T. Thomas-Woods

Borders General Hospital, NHS Borders, Melrose, UK

Joy Dawson, Hosni El-Taweel & Leanne Robinson

Aneurin Bevan University Health Board, Caerleon, UK

Amanda Dell, Sara Fairbairn, Nancy Hawkings, Jill Haworth, Michaela Hoare, Victoria Lewis, Alice Lucey, Georgia Mallison, Heeah Nassa, Chris Pennington, Andrea Price, Claire Price, Andrew Storrie, Gemma Willis & Susan Young

University of Exeter Medical School, Exeter, UK

London North West University Healthcare NHS Trust, London, UK

Shalin Diwanji, Sambasivarao Gurram, Padmasayee Papineni, Sheena Quaid, Gerlynn Tiongson & Ekaterina Watson

Alzheimer’s Research UK, Cambridge, UK

Hannah Dobson

Health and Care Research Wales, Cardiff, UK

Yvette Ellis

University of Bristol, Bristol, UK

Jonathon Evans

University of Sheffield, Sheffield, UK

L. Finnigan, Laura Saunders & James Wild

Great Western Hospital Foundation Trust, Swindon, UK

Eva Fraile & Jacinta Ugoji

Royal Devon and Exeter NHS Trust, Barnstaple, UK

Michael Gibbons

Kettering General Hospital NHS Trust, Kettering, UK

Anne-Marie Guerdette, Melanie Hewitt, R. Reddy, Katie Warwick & Sonia White

NIHR Leicester Biomedical Research Centre, Leicester, UK

Beatriz Guillen-Guio

University of Leeds, Leeds, UK

Elspeth Guthrie & Max Henderson

Royal Surrey NHS Foundation Trust, Cranleigh, UK

Mark Halling-Brown & Katherine McCullough

Chesterfield Royal Hospital NHS Trust, Calow, UK

Edward Harris & Claire Sampson

Long Covid Support, London, UK

Claire Hastie, Natalie Rogers & Nikki Smith

King’s College Hospital, NHS Foundation Trust and King’s College London, London, UK

Department of Oncology and Metabolism, University of Sheffield, Sheffield, UK

Simon Heller

NIHR Office for Clinical Research Infrastructure, London, UK

Katie Holmes

Asthma UK and British Lung Foundation Partnership, London, UK

Ian Jarrold & Samantha Walker

North Middlesex University Hospital NHS Trust, London, UK

Bhagy Jayaraman & Tessa Light

Action for Pulmonary Fibrosis, Peterborough, UK

Cardiff University, National Centre for Mental Health, Cardiff, UK

McPin Foundation, London, UK

Thomas Kabir

Roslin Institute, The University of Edinburgh, Edinburgh, UK

Steven Kerr

The Hillingdon Hospitals NHS Foundation Trust, London, UK

Samantha Kon, G. Landers, Harpreet Lota, Mariam Nasseri & Sofiya Portukhay

Queen Mary University of London, London, UK

Ania Korszun

Swansea University, Swansea Welsh Network, Hywel Dda University Health Board, Swansea, UK

Royal Infirmary of Edinburgh, NHS Lothian, Edinburgh, UK

Nazir I. Lone

Barts Heart Centre, London, UK

Barts Health NHS Trust and Queen Mary University of London, London, UK

Adrian Martineau

Salisbury NHS Foundation Trust, Salisbury, UK

Wadzanai Matimba-Mupaya & Sophia Strong-Sheldrake

University of Newcastle, Newcastle, UK

Hamish McAllister-Williams, Stella-Maria Paddick, Anthony Rostron & John Paul Taylor

Gateshead NHS Trust, Gateshead, UK

W. McCormick, Lorraine Pearce, S. Pugmire, Wendy Stoker & Ann Wilson

Manchester Centre for Clinical Neurosciences, Salford Royal NHS Foundation Trust, Manchester, UK

Katherine McIvor

Kidney Research UK, Peterborough, UK

Aisling McMahon

NHS Dumfries and Galloway, Dumfries, UK

Michael McMahon & Paula Neill

Swansea University, Swansea, UK

MQ Mental Health Research, London, UK

Lea Milligan

BHF Centre for Cardiovascular Science, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK

Nicholas Mills

Shropshire Community Health NHS Trust, Shropshire, UK

Sharon Painter, Johanne Tomlinson & Louise Warburton

Somerset NHS Foundation Trust, Taunton, UK

Sue Palmer, Dawn Redwood, Jo Tilley, Carinna Vickers & Tania Wainwright

Francis Crick Institute, London, UK

Markus Ralser

Manchester University NHD Foundation Trust, Manchester, UK

Pilar Rivera-Ortega

Diabetes UK, University of Glasgow, Glasgow, UK

Elizabeth Robertson

Barnsley Hospital NHS Foundation Trust, Barnsley, UK

Amy Sanderson

MRC–University of Glasgow Centre for Virus Research, Glasgow, UK

Janet Scott

Diabetes UK, London, UK

Kamini Shah

British Heart Foundation Centre, King’s College London, London, UK

King’s College Hospital NHS Foundation Trust, London, UK

University Hospitals Birmingham NHS Foundation Trust and University of Birmingham, Birmingham, UK

Institute of Cardiovascular and Medical Sciences, BHF Glasgow Cardiovascular Research Centre, University of Glasgow, Glasgow, UK

University College London NHS Foundation Trust, London and Barts Health NHS Trust, London, UK

Northumbria University, Newcastle upon Tyne, UK

Ioannis Vogiatzis

Swansea University and Swansea Welsh Network, Swansea, UK

N. Williams

DUK | NHS Digital, Salford Royal Foundation Trust, Salford, UK

Queen Alexandra Hospital, Portsmouth, UK

  • Kayode Adeniji

Princess Royal Hospital, Haywards Heath, UK

Daniel Agranoff & Chi Eziefula

Bassetlaw Hospital, Bassetlaw, UK

Darent Valley Hospital, Dartford, UK

Queen Elizabeth the Queen Mother Hospital, Margate, UK

Ana Alegria

School of Informatics, University of Edinburgh, Edinburgh, UK

Beatrice Alex, Benjamin Bach & James Scott-Brown

North East and North Cumbria Ingerated, Newcastle upon Tyne, UK

Section of Biomolecular Medicine, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK

Petros Andrikopoulos, Kanta Chechi, Marc-Emmanuel Dumas, Julian Griffin, Sonia Liggi & Zoltan Takats

Section of Genomic and Environmental Medicine, Respiratory Division, National Heart and Lung Institute, Imperial College London, London, UK

Petros Andrikopoulos, Marc-Emmanuel Dumas, Michael Olanipekun & Anthonia Osagie

John Radcliffe Hospital, Oxford, UK

Brian Angus

Royal Albert Edward Infirmary, Wigan, UK

Abdul Ashish

Manchester Royal Infirmary, Manchester, UK

Dougal Atkinson

MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, UK

Section of Molecular Virology, Imperial College London, London, UK

Wendy S. Barclay

Furness General Hospital, Barrow-in-Furness, UK

Shahedal Bari

Hull University Teaching Hospital Trust, Kingston upon Hull, UK

Gavin Barlow

Hillingdon Hospital, Hillingdon, UK

Stella Barnass

St Thomas’ Hospital, London, UK

Nicholas Barrett

Coventry and Warwickshire, Coventry, UK

Christopher Bassford

St Michael’s Hospital, Bristol, UK

Sneha Basude

Stepping Hill Hospital, Stockport, UK

David Baxter

Royal Liverpool University Hospital, Liverpool, UK

Michael Beadsworth

Bristol Royal Hospital Children’s, Bristol, UK

Jolanta Bernatoniene

Scarborough Hospital, Scarborough, UK

John Berridge

Golden Jubilee National Hospital, Clydebank, UK

Colin Berry

Liverpool Heart and Chest Hospital, Liverpool, UK

Nicola Best

Centre for Inflammation Research, The Queen’s Medical Research Institute, University of Edinburgh, Edinburgh, UK

Debby Bogaert & Clark D. Russell

James Paget University Hospital, Great Yarmouth, UK

Pieter Bothma & Darell Tupper-Carey

Aberdeen Royal Infirmary, Aberdeen, UK

Robin Brittain-Long

Adamson Hospital, Cupar, UK

Naomi Bulteel

Royal Devon and Exeter Hospital, Exeter, UK

Worcestershire Royal Hospital, Worcester, UK

Andrew Burtenshaw

ISARIC Global Support Centre, Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, UK

Gail Carson, Laura Merson & Louise Sigfrid

Conquest Hospital, Hastings, UK

Vikki Caruth

The James Cook University Hospital, Middlesbrough, UK

David Chadwick

Dorset County Hospital, Dorchester, UK

Duncan Chambler

Antimicrobial Resistance and Hospital Acquired Infection Department, Public Health England, London, UK

Meera Chand

Department of Epidemiology and Biostatistics, School of Public Health, Faculty of Medicine, Imperial College London, London, UK

Kanta Chechi

Royal Bournemouth General Hospital, Bournemouth, UK

Harrogate Hospital, Harrogate, UK

Jenny Child

Royal Blackburn Teaching Hospital, Blackburn, UK

Srikanth Chukkambotla

Edinburgh Clinical Research Facility, University of Edinburgh, Edinburgh, UK

Richard Clark, Audrey Coutts, Lorna Donelly, Angie Fawkes, Tammy Gilchrist, Katarzyna Hafezi, Louise MacGillivray, Alan Maclean, Sarah McCafferty, Kirstie Morrice, Lee Murphy & Nicola Wrobel

Torbay Hospital, Torquay, UK

Northern General Hospital, Sheffield, UK

Paul Collini, Cariad Evans & Gary Mills

Liverpool Clinical Trials Centre, University of Liverpool, Liverpool, UK

Marie Connor, Jo Dalton, Chloe Donohue, Carrol Gamble, Michelle Girvan, Sophie Halpin, Janet Harrison, Clare Jackson, Laura Marsh, Stephanie Roberts & Egle Saviciute

Department of Infectious Disease, Imperial College London, London, UK

Graham S. Cooke & Shiranee Sriskandan

St Georges Hospital (Tooting), London, UK

Catherine Cosgrove

Blackpool Victoria Hospital, Blackpool, UK

Jason Cupitt & Joanne Howard

The Royal London Hospital, London, UK

Maria-Teresa Cutino-Moguel

MRC-University of Glasgow Centre for Virus Research, Glasgow, UK

Ana da Silva Filipe, Antonia Y. W. Ho, Sarah E. McDonald, Massimo Palmarini, David L. Robertson, Janet T. Scott & Emma C. Thomson

Salford Royal Hospital, Salford, UK

University Hospital of North Durham, Durham, UK

Chris Dawson

Norfolk and Norwich University Hospital, Norwich, UK

Samir Dervisevic

Intensive Care Unit, Royal Infirmary Edinburgh, Edinburgh, UK

Annemarie B. Docherty & Seán Keating

Institute of Infection, Veterinary and Ecological Sciences, Faculty of Health and Life Sciences, University of Liverpool, Liverpool, UK

Cara Donegan & Rebecca G. Spencer

Salisbury District Hospital, Salisbury, UK

Phil Donnison

National Phenome Centre, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK

Gonçalo dos Santos Correia, Matthew Lewis, Lynn Maslen, Caroline Sands, Zoltan Takats & Panteleimon Takis

Section of Bioanalytical Chemistry, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK

Gonçalo dos Santos Correia, Matthew Lewis, Lynn Maslen, Caroline Sands & Panteleimon Takis

Guy’s and St Thomas’, NHS Foundation Trust, London, UK

Sam Douthwaite, Michael MacMahon, Marlies Ostermann & Manu Shankar-Hari

The Royal Oldham Hospital, Oldham, UK

Andrew Drummond

European Genomic Institute for Diabetes, Institut Pasteur de Lille, Lille University Hospital, University of Lille, Lille, France

Marc-Emmanuel Dumas

McGill University and Genome Quebec Innovation Centre, Montreal, Qeubec, Canada

National Infection Service, Public Health England, London, UK

Jake Dunning & Maria Zambon

Hereford Count Hospital, Hereford, UK

Ingrid DuRand

Southampton General Hospital, Southampton, UK

Ahilanadan Dushianthan

Northampton General Hospital, Northampton, UK

Tristan Dyer

University Hospital of Wales, Cardiff, UK

Chrisopher Fegan

University Hospitals Bristol NHS Foundation Trust, Bristol, UK

Liverpool School of Tropical Medicine, Liverpool, UK

Tom Fletcher

Leighton Hospital, Crewe, UK

Duncan Fullerton & Elijah Matovu

Manor Hospital, Walsall, UK

Scunthorpe Hospital, Scunthorpe, UK

Sanjeev Garg

Cambridge University Hospital, Cambridge, UK

Effrossyni Gkrania-Klotsas

West Suffolk NHS Foundation Trust, Bury St Edmunds, UK

Basingstoke and North Hampshire Hospital, Basingstoke, UK

Arthur Goldsmith

North Cumberland Infirmary, Carlisle, UK

Clive Graham

Paediatric Liver, GI and Nutrition Centre and MowatLabs, King’s College Hospital, London, UK

Tassos Grammatikopoulos

Institute of Liver Studies, King’s College London, London, UK

Institute of Microbiology and Infection, University of Birmingham, Birmingham, UK

Christopher A. Green

Department of Molecular and Clinical Cancer Medicine, University of Liverpool, Liverpool, UK

William Greenhalf

Institute for Global Health, University College London, London, UK

Rishi K. Gupta

NIHR Health Protection Research Unit, Institute of Infection, Veterinary and Ecological Sciences, Faculty of Health and Life Sciences, University of Liverpool, Liverpool, UK

Hayley Hardwick, Malcolm G. Semple, Tom Solomon & Lance C. W. Turtle

Warwick Hospital, Warwick, UK

Elaine Hardy

Birmingham Children’s Hospital, Birmingham, UK

Stuart Hartshorn

Nottingham City Hospital, Nottingham, UK

Daniel Harvey

Glangwili Hospital Child Health Section, Carmarthen, UK

Peter Havalda

Alder Hey Children’s Hospital, Liverpool, UK

Daniel B. Hawcutt

Department of Infectious Diseases, Queen Elizabeth University Hospital, Glasgow, UK

Antonia Y. W. Ho

Bronglais General Hospital, Aberystwyth, UK

Maria Hobrok

Worthing Hospital, Worthing, UK

Luke Hodgson

Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, UK

Peter W. Horby

Rotheram District General Hospital, Rotheram, UK

Anil Hormis

Virology Reference Department, National Infection Service, Public Health England, Colindale Avenue, London, UK

Samreen Ijaz

Royal Free Hospital, London, UK

Michael Jacobs & Padmasayee Papineni

Homerton Hospital, London, UK

Airedale Hospital, Airedale, UK

Paul Jennings

Basildon Hospital, Basildon, UK

Agilan Kaliappan

The Christie NHS Foundation Trust, Manchester, UK

Vidya Kasipandian

University Hospital Lewisham, London, UK

Stephen Kegg

The Whittington Hospital, London, UK

Michael Kelsey

Southmead Hospital, Bristol, UK

Jason Kendall

Sheffield Childrens Hospital, Sheffield, UK

Caroline Kerrison

Royal United Hospital, Bath, UK

Ian Kerslake

Department of Pharmacology, University of Liverpool, Liverpool, UK

Nuffield Department of Medicine, Peter Medawar Building for Pathogen Research, University of Oxford, Oxford, UK

Paul Klenerman

Translational Gastroenterology Unit, Nuffield Department of Medicine, University of Oxford, Oxford, UK

Public Health Scotland, Edinburgh, UK

Susan Knight, Eva Lahnsteiner & Sarah Tait

Western General Hospital, Edinburgh, UK

Oliver Koch

Southend University Hospital NHS Foundation Trust, Southend-on-Sea, UK

Gouri Koduri

Hinchingbrooke Hospital, Huntingdon, UK

George Koshy & Tamas Leiner

Royal Preston Hospital, Fulwood, UK

Shondipon Laha

University Hospital (Coventry), Coventry, UK

Steven Laird

The Walton Centre, Liverpool, UK

Susan Larkin

ISARIC, Global Support Centre, COVID-19 Clinical Research Resources, Epidemic diseases Research Group, Oxford (ERGO), University of Oxford, Oxford, UK

James Lee & Daniel Plotkin

Centre for Health Informatics, Division of Informatics, Imaging and Data Science, School of Health Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK

Gary Leeming

Hull Royal Infirmary, Hull, UK

Patrick Lillie

Nottingham University Hospitals NHS Trust:, Nottingham, UK

Wei Shen Lim

Darlington Memorial Hospital, Darlington, UK

Queen Elizabeth Hospital (Gateshead), Gateshead, UK

Vanessa Linnett

Warrington Hospital, Warrington, UK

Jeff Little

Bristol Royal Hospital for Children, Bristol, UK

Mark Lyttle

St Mary’s Hospital (Isle of Wight), Isle of Wight, UK

Emily MacNaughton

The Tunbridge Wells Hospital, Royal Tunbridge Wells, UK

Ravish Mankregod

Huddersfield Royal, Huddersfield, UK

Countess of Chester Hospital, Liverpool, UK

Ruth McEwen & Lawrence Wilson

Frimley Park Hospital, Frimley, UK

Manjula Meda

Nuffield Department of Medicine, John Radcliffe Hospital, Oxford, UK

Alexander J. Mentzer

Department of Microbiology/Infectious Diseases, Oxford University Hospitals NHS Foundation Trust, John Radcliffe Hospital, Oxford, UK

MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK

Alison M. Meynert & Murray Wham

St James University Hospital, Leeds, UK

Jane Minton

Arrowe Park Hospital, Birkenhead, UK

Kavya Mohandas

Great Ormond Street Hospital, London, UK

Royal Shrewsbury Hospital, Shrewsbury, UK

Addenbrookes Hospital, Cambridge, UK

Elinoor Moore

Institute of Infection, Veterinary and Ecological Sciences, University of Liverpool, Liverpool, UK

Shona C. Moore, William A. Paxton & Georgios Pollakis

East Surrey Hospital (Redhill), Redhill, UK

Patrick Morgan

Burton Hospital, Burton, UK

Craig Morris & Tim Reynolds

Peterborough City Hospital, Peterborough, UK

Katherine Mortimore

Kent and Canterbury Hospital, Canterbury, UK

Samuel Moses

Weston Area General Trust, Bristol, UK

Mbiye Mpenge

Bedfordshire Hospital, Bedfordshire, UK

Rohinton Mulla

Glasgow Royal Infirmary, Glasgow, UK

Michael Murphy

Macclesfield General Hospital, Macclesfield, UK

Thapas Nagarajan

Derbyshire Healthcare, Derbyshire, UK

Megan Nagel

Chelsea and Westminster Hospital, London, UK

Mark Nelson & Matthew K. O’Shea

Watford General Hospital, Watford, UK

Lillian Norris & Tom Stambach

EPCC, University of Edinburgh, Edinburgh, UK

Lucy Norris

Section of Biomolecular Medicine, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, London, UK

Michael Olanipekun

Imperial College Healthcare NHS Trust: London, London, UK

Peter J. M. Openshaw

Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK

Anthonia Osagie

Prince Philip Hospital, Llanelli, UK

Igor Otahal & Andrew Workman

George Eliot Hospital – Acute Services, Nuneaton, UK

Molecular and Clinical Cancer Medicine, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK

Carlo Palmieri

Clatterbridge Cancer Centre NHS Foundation Trust, Liverpool, UK

Kettering General Hospital, Kettering, UK

Selva Panchatsharam

University Hospitals of North Midlands NHS Trust, North Midlands, UK

Danai Papakonstantinou

Russells Hall Hospital, Dudley, UK

Hassan Paraiso

Harefield Hospital, Harefield, UK

Lister Hospital, Lister, UK

Natalie Pattison

Musgrove Park Hospital, Taunton, UK

Justin Pepperell

Kingston Hospital, Kingston, UK

Mark Peters

Queen’s Hospital, Romford, UK

Mandeep Phull

Southport and Formby District General Hospital, Southport, UK

Stefania Pintus

St George’s University of London, London, UK

Tim Planche

King’s College Hospital (Denmark Hill), London, UK

Centre for Clinical Infection and Diagnostics Research, Department of Infectious Diseases, School of Immunology and Microbial Sciences, King’s College London, London, UK

Nicholas Price

Department of Infectious Diseases, Guy’s and St Thomas’ NHS Foundation Trust, London, UK

The Clatterbridge Cancer Centre NHS Foundation, Bebington, UK

David Price

The Great Western Hospital, Swindon, UK

Rachel Prout

Ninewells Hospital, Dundee, UK

Nikolas Rae

Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK

Andrew Rambaut

Poole Hospital NHS Trust, Poole, UK

Henrik Reschreiter

William Harvey Hospital, Ashford, UK

Neil Richardson

King’s Mill Hospital, Sutton-in-Ashfield, UK

Mark Roberts

Liverpool Women’s Hospital, Liverpool, UK

Devender Roberts

Pinderfields Hospital, Wakefield, UK

Alistair Rose

North Devon District Hospital, Barnstaple, UK

Guy Rousseau

Queen Elizabeth Hospital, Birmingham, UK

Tameside General Hospital, Ashton-under-Lyne, UK

Brendan Ryan

City Hospital (Birmingham), Birmingham, UK

Taranprit Saluja

Department of Pediatrics and Virology, St Mary’s Medical School Bldg, Imperial College London, London, UK

Vanessa Sancho-Shimizu

The Newcastle Upon Tyne Hospitals NHS Foundation Trust, Newcastle Upon Tyne, UK

Matthias Schmid

NHS Greater Glasgow and Clyde, Glasgow, UK

Janet T. Scott

Respiratory Medicine, Institute in The Park, University of Liverpool, Alder Hey Children’s Hospital, Liverpool, UK

Malcolm G. Semple

Broomfield Hospital, Broomfield, UK

Stoke Mandeville, UK

Prad Shanmuga

University Hospital of North Tees, Stockton-on-Tees, UK

Anil Sharma

Institute of Translational Medicine, University of, Liverpool, Merseyside, UK

Victoria E. Shaw

Royal Manchester Children’s Hospital, Manchester, UK

Anna Shawcross

New Cross Hospital, Wolverhampton, UK

Jagtur Singh Pooni

Bedford Hospital, Bedford, UK

Jeremy Sizer

Colchester General Hospital, Colchester, UK

Richard Smith

University Hospital Birmingham NHS Foundation Trust, Birmingham, UK

Catherine Snelson & Tony Whitehouse

Walton Centre NHS Foundation Trust, Liverpool, UK

Tom Solomon

Chesterfield Royal Hospital, Calow, UK

Nick Spittle

MRC Centre for Molecular Bacteriology and Infection, Imperial College London, London, UK

Shiranee Sriskandan

Princess Alexandra Hospital, Harlow, UK

Nikki Staines & Shico Visuvanathan

Milton Keynes Hospital, Eaglestone, UK

Richard Stewart

Division of Structural Biology, The Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK

David Stuart

Royal Bolton Hopital, Farnworth, UK

Pradeep Subudhi

Department of Medicine, University of Cambridge, Cambridge, UK

Charlotte Summers

Department of Child Life and Health, University of Edinburgh, Edinburgh, UK

Olivia V. Swann

Royal Gwent (Newport), Newport, UK

Tamas Szakmany

The Royal Marsden Hospital (London), London, UK

Kate Tatham

Blood Borne Virus Unit, Virus Reference Department, National Infection Service, Public Health England, London, UK

Richard S. Tedder

Transfusion Microbiology, National Health Service Blood and Transplant, London, UK

Department of Medicine, Imperial College London, London, UK

Queen Victoria Hospital (East Grinstead), East Grinstead, UK

Leeds Teaching Hospitals NHS Trust, Leeds, UK

Robert Thompson

Royal Stoke University Hospital, Stoke-on-Trent, UK

Chris Thompson

Whiston Hospital, Rainhill, UK

Ascanio Tridente

Tropical and Infectious Disease Unit, Royal Liverpool University Hospital, Liverpool, UK

Lance C. W. Turtle

Croydon University Hospital, Thornton Heath, UK

Mary Twagira

Gloucester Royal, Gloucester, UK

Nick Vallotton

West Hertfordshire Teaching Hospitals NHS Trust, Hertfordshire, UK

Rama Vancheeswaran

North Middlesex Hospital, London, UK

Rachel Vincent

Medway Maritime Hospital, Gillingham, UK

Lisa Vincent-Smith

Royal Papworth Hospital Everard, Cambridge, UK

Alan Vuylsteke

Derriford (Plymouth), Plymouth, UK

St Helier Hospital, Sutton, UK

Rachel Wake

Royal Berkshire Hospital, Reading, UK

Andrew Walden

Royal Liverpool Hospital, Liverpool, UK

Ingeborg Welters

Bradford Royal infirmary, Bradford, UK

Paul Whittaker

Central Middlesex, London, UK

Ashley Whittington

Royal Cornwall Hospital (Tresliske), Truro, UK

Meme Wijesinghe

North Bristol NHS Trust, Bristol, UK

Martin Williams

St. Peter’s Hospital, Runnymede, UK

Stephen Winchester

Leicester Royal Infirmary, Leicester, UK

Martin Wiselka

Grantham and District Hospital, Grantham, UK

Adam Wolverson

Aintree University Hospital, Liverpool, UK

Daniel G. Wootton

North Tyneside General Hospital, North Shields, UK

Bryan Yates

Queen Elizabeth Hospital, King’s Lynn, UK

Peter Young

You can also search for this author in PubMed   Google Scholar

PHOSP-COVID collaborative group

  • Kathryn Abel
  • , H. Adamali
  • , Davies Adeloye
  • , Oluwaseun Adeyemi
  • , Rita Adrego
  • , Laura Aguilar Jimenez
  • , Shanaz Ahmad
  • , N. Ahmad Haider
  • , Rubina Ahmed
  • , Nyarko Ahwireng
  • , Mark Ainsworth
  • , Asma Alamoudi
  • , Mariam Ali
  • , M. Aljaroof
  • , Louise Allan
  • , Richard Allen
  • , Lisa Allerton
  • , Lynne Allsop
  • , Ann Marie Allt
  • , Paula Almeida
  • , Bashar Al-Sheklly
  • , Danny Altmann
  • , Maria Alvarez Corral
  • , Shannon Amoils
  • , David Anderson
  • , Charalambos Antoniades
  • , Gill Arbane
  • , Ava Maria Arias
  • , Cherie Armour
  • , Lisa Armstrong
  • , Natalie Armstrong
  • , David Arnold
  • , H. Arnold
  • , A. Ashish
  • , Andrew Ashworth
  • , M. Ashworth
  • , Shahab Aslani
  • , Hosanna Assefa-Kebede
  • , Paul Atkin
  • , Catherine Atkin
  • , Raminder Aul
  • , Hnin Aung
  • , Liam Austin
  • , Cristina Avram
  • , Nikos Avramidis
  • , Marta Babores
  • , Rhiannon Baggott
  • , J. Bagshaw
  • , David Baguley
  • , Elisabeth Bailey
  • , J. Kenneth Baillie
  • , Steve Bain
  • , Majda Bakali
  • , E. Baldry
  • , Molly Baldwin
  • , David Baldwin
  • , Clive Ballard
  • , Amita Banerjee
  • , Dongchun Bang
  • , R. E. Barker
  • , Laura Barman
  • , Perdita Barran
  • , Shaney Barratt
  • , Fiona Barrett
  • , Donna Basire
  • , Neil Basu
  • , Michelle Bates
  • , R. Batterham
  • , Helen Baxendale
  • , Gabrielle Baxter
  • , Hannah Bayes
  • , M. Beadsworth
  • , Paul Beckett
  • , Paul Beirne
  • , Murdina Bell
  • , Robert Bell
  • , Kaytie Bennett
  • , Eva Beranova
  • , Areti Bermperi
  • , Anthony Berridge
  • , Colin Berry
  • , Sarah Betts
  • , Emily Bevan
  • , Kamaldeep Bhui
  • , Michelle Bingham
  • , K. Birchall
  • , Lettie Bishop
  • , Karen Bisnauthsing
  • , John Blaikely
  • , Angela Bloss
  • , Annette Bolger
  • , Charlotte Bolton
  • , J. Bonnington
  • , A. Botkai
  • , Charlotte Bourne
  • , Michelle Bourne
  • , Kate Bramham
  • , Lucy Brear
  • , Jonathon Breeze
  • , Katie Breeze
  • , Andrew Briggs
  • , E. Bright
  • , Christopher Brightling
  • , Simon Brill
  • , K. Brindle
  • , Lauren Broad
  • , Andrew Broadley
  • , Claire Brookes
  • , Mattew Broome
  • , Vanessa Brown
  • , Ammani Brown
  • , Angela Brown
  • , Jeremy Brown
  • , Terry Brugha
  • , Nigel Brunskill
  • , Phil Buckley
  • , Anda Bularga
  • , Ed Bullmore
  • , Jenny Bunker
  • , L. Burden
  • , Tracy Burdett
  • , David Burn
  • , John Busby
  • , Robyn Butcher
  • , Al-Tahoor Butt
  • , P. Cairns
  • , P. C. Calder
  • , Ellen Calvelo
  • , H. Carborn
  • , Bethany Card
  • , Caitlin Carr
  • , Liesel Carr
  • , G. Carson
  • , Penny Carter
  • , Anna Casey
  • , M. Cassar
  • , Jonathon Cavanagh
  • , Manish Chablani
  • , Trudie Chalder
  • , James D. Chalmers
  • , Rachel Chambers
  • , Flora Chan
  • , K. M. Channon
  • , Kerry Chapman
  • , Amanda Charalambou
  • , N. Chaudhuri
  • , A. Checkley
  • , Yutung Cheng
  • , Luke Chetham
  • , Caroline Childs
  • , Edwin Chilvers
  • , H. Chinoy
  • , A. Chiribiri
  • , K. Chong-James
  • , N. Choudhury
  • , Gaunab Choudhury
  • , Phillip Chowienczyk
  • , C. Christie
  • , Melanie Chrystal
  • , Cameron Clark
  • , David Clark
  • , Jude Clarke
  • , S. Clohisey
  • , G. Coakley
  • , Zach Coburn
  • , S. Coetzee
  • , Joby Cole
  • , Chris Coleman
  • , Florence Conneh
  • , David Connell
  • , Bronwen Connolly
  • , Lynda Connor
  • , Amanda Cook
  • , Shirley Cooper
  • , B. Cooper
  • , Josh Cooper
  • , Donna Copeland
  • , Tracey Cosier
  • , Eamon Coughlan
  • , Martina Coulding
  • , C. Coupland
  • , Thelma Craig
  • , Daniele Cristiano
  • , Michael Crooks
  • , Andy Cross
  • , Isabel Cruz
  • , P. Cullinan
  • , D. Cuthbertson
  • , Luke Daines
  • , Matthhew Dalton
  • , Patrick Daly
  • , Alison Daniels
  • , Joanne Dasgin
  • , Anthony David
  • , Ffyon Davies
  • , Ellie Davies
  • , Kim Davies
  • , Gareth Davies
  • , Gwyneth Davies
  • , Melanie Davies
  • , Joy Dawson
  • , Camilla Dawson
  • , Enya Daynes
  • , Anthony De Soyza
  • , Bill Deakin
  • , Andrew Deans
  • , Joanne Deery
  • , Sylviane Defres
  • , Amanda Dell
  • , K. Dempsey
  • , Emma Denneny
  • , J. Dennis
  • , Ruvini Dharmagunawardena
  • , Nawar Diar-Bakerly
  • , Caroline Dickens
  • , A. Dipper
  • , Sarah Diver
  • , Shalin Diwanji
  • , Myles Dixon
  • , R. Djukanovic
  • , Hannah Dobson
  • , S. L. Dobson
  • , Annemarie B. Docherty
  • , A. Donaldson
  • , N. Dormand
  • , Andrew Dougherty
  • , Rachael Dowling
  • , Stephen Drain
  • , Katharine Draxlbauer
  • , Katie Drury
  • , Pearl Dulawan
  • , A. Dunleavy
  • , Sarah Dunn
  • , Catherine Dupont
  • , Joanne Earley
  • , Nicholas Easom
  • , Carlos Echevarria
  • , Sarah Edwards
  • , C. Edwardson
  • , Claudia Efstathiou
  • , Anne Elliott
  • , K. Elliott
  • , Yvette Ellis
  • , Anne Elmer
  • , Omer Elneima
  • , Hosni El-Taweel
  • , Teriann Evans
  • , Ranuromanana Evans
  • , Rachael A. Evans
  • , Jonathon Evans
  • , Cerys Evenden
  • , Lynsey Evison
  • , Laura Fabbri
  • , Sara Fairbairn
  • , Alexandra Fairman
  • , K. Fallon
  • , David Faluyi
  • , Clair Favager
  • , Tamanah Fayzan
  • , James Featherstone
  • , T. Felton
  • , V. Ferreira
  • , Selina Finney
  • , J. Finnigan
  • , L. Finnigan
  • , Helen Fisher
  • , S. Fletcher
  • , Rachel Flockton
  • , Margaret Flynn
  • , David Foote
  • , Amber Ford
  • , D. Forton
  • , Eva Fraile
  • , C. Francis
  • , Richard Francis
  • , Susan Francis
  • , Anew Frankel
  • , Emily Fraser
  • , N. French
  • , Jonathon Fuld
  • , J. Furniss
  • , Lucie Garner
  • , N. Gautam
  • , John Geddes
  • , J. George
  • , P. George
  • , Michael Gibbons
  • , Rhyan Gill
  • , Mandy Gill
  • , L. Gilmour
  • , F. Gleeson
  • , Jodie Glossop
  • , Sarah Glover
  • , Nicola Goodman
  • , Camelia Goodwin
  • , Bibek Gooptu
  • , Hussain Gordon
  • , T. Gorsuch
  • , M. Greatorex
  • , Paul Greenhaff
  • , William Greenhalf
  • , Alan Greenhalgh
  • , Neil J. Greening
  • , John Greenwood
  • , Rebecca Gregory
  • , Heidi Gregory
  • , D. Grieve
  • , Denise Griffin
  • , L. Griffiths
  • , Anne-Marie Guerdette
  • , Beatriz Guillen-Guio
  • , Mahitha Gummadi
  • , Ayushman Gupta
  • , Sambasivarao Gurram
  • , Elspeth Guthrie
  • , Kate Hadley
  • , Ahmed Haggar
  • , Kera Hainey
  • , Brigid Hairsine
  • , Pranab Haldar
  • , Lucy Hall
  • , Mark Halling-Brown
  • , Alyson Hancock
  • , Kia Hancock
  • , Neil Hanley
  • , Sulaimaan Haq
  • , Hayley Hardwick
  • , Tim Hardy
  • , Beverley Hargadon
  • , Kate Harrington
  • , Edward Harris
  • , Victoria C. Harris
  • , Ewen Harrison
  • , Paul Harrison
  • , Nicholas Hart
  • , Alice Harvey
  • , Matt Harvey
  • , M. Harvie
  • , L. Haslam
  • , Claire Hastie
  • , May Havinden-Williams
  • , Jenny Hawkes
  • , Nancy Hawkings
  • , Jill Haworth
  • , A. Hayday
  • , Matthew Haynes
  • , J. Hazeldine
  • , Tracy Hazelton
  • , Liam Heaney
  • , Cheryl Heeley
  • , Jonathon Heeney
  • , M. Heightman
  • , Simon Heller
  • , Max Henderson
  • , Helen Henson
  • , L. Hesselden
  • , Melanie Hewitt
  • , Victoria Highett
  • , T. Hillman
  • , Ling-Pei Ho
  • , Michaela Hoare
  • , Amy Hoare
  • , J. Hockridge
  • , Philip Hogarth
  • , Ailsa Holbourn
  • , Sophie Holden
  • , L. Holdsworth
  • , D. Holgate
  • , Maureen Holland
  • , Leah Holloway
  • , Katie Holmes
  • , Megan Holmes
  • , B. Holroyd-Hind
  • , Anil Hormis
  • , Alexander Horsley
  • , Akram Hosseini
  • , M. Hotopf
  • , Linzy Houchen-Wolloff
  • , Luke S. Howard
  • , Kate Howard
  • , Alice Howell
  • , E. Hufton
  • , Rachel Ann Hughes
  • , Joan Hughes
  • , Alun Hughes
  • , Amy Humphries
  • , Nathan Huneke
  • , E. Hurditch
  • , John Hurst
  • , Masud Husain
  • , Tracy Hussell
  • , John Hutchinson
  • , W. Ibrahim
  • , Julie Ingham
  • , L. Ingram
  • , Diana Ionita
  • , Karen Isaacs
  • , Khalida Ismail
  • , T. Jackson
  • , Joseph Jacob
  • , W. Y. James
  • , Claire Jarman
  • , Ian Jarrold
  • , Hannah Jarvis
  • , Roman Jastrub
  • , Bhagy Jayaraman
  • , Gisli Jenkins
  • , P. Jezzard
  • , Kasim Jiwa
  • , C. Johnson
  • , Simon Johnson
  • , Desmond Johnston
  • , Caroline Jolley
  • , Ian Jones
  • , Heather Jones
  • , Mark Jones
  • , Don Jones
  • , Sherly Jose
  • , Thomas Kabir
  • , G. Kaltsakas
  • , Vicky Kamwa
  • , N. Kanellakis
  • , Sabina Kaprowska
  • , Zunaira Kausar
  • , Natalie Keenan
  • , Steven Kerr
  • , Helen Kerslake
  • , Angela Key
  • , Fasih Khan
  • , Kamlesh Khunti
  • , Susan Kilroy
  • , Bernie King
  • , Clara King
  • , Lucy Kingham
  • , Jill Kirk
  • , Paaig Kitterick
  • , Paul Klenerman
  • , Lucy Knibbs
  • , Sean Knight
  • , Abigail Knighton
  • , Onn Min Kon
  • , Samantha Kon
  • , Ania Korszun
  • , Ivan Koychev
  • , Claire Kurasz
  • , Prathiba Kurupati
  • , Hanan Lamlum
  • , G. Landers
  • , Claudia Langenberg
  • , Lara Lavelle-Langham
  • , Allan Lawrie
  • , Cathy Lawson
  • , Claire Lawson
  • , Alison Layton
  • , Olivia C. Leavy
  • , Ju Hee Lee
  • , Elvina Lee
  • , Karen Leitch
  • , Rebecca Lenagh
  • , Victoria Lewis
  • , Joanne Lewis
  • , Keir Lewis
  • , N. Lewis-Burke
  • , Felicity Liew
  • , Tessa Light
  • , Liz Lightstone
  • , W. Lilaonitkul
  • , S. Linford
  • , Anne Lingford-Hughes
  • , M. Lipman
  • , Kamal Liyanage
  • , Arwel Lloyd
  • , Nazir I. Lone
  • , Ronda Loosley
  • , Janet Lord
  • , Harpreet Lota
  • , Wayne Lovegrove
  • , Daniel Lozano-Rojas
  • , Alice Lucey
  • , Gardiner Lucy
  • , E. Lukaschuk
  • , Alison Lye
  • , Ceri Lynch
  • , S. MacDonald
  • , G. MacGowan
  • , Irene Macharia
  • , J. Mackie
  • , L. Macliver
  • , S. Madathil
  • , Gladys Madzamba
  • , Nick Magee
  • , Murphy Magtoto
  • , N. Majeed
  • , Flora Malein
  • , Georgia Mallison
  • , William Man
  • , S. Mandal
  • , K. Mangion
  • , C. Manisty
  • , R. Manley
  • , Katherine March
  • , Stefan Marciniak
  • , Philip Marino
  • , Myril Mariveles
  • , Michael Marks
  • , Elizabeth Marouzet
  • , Sophie Marsh
  • , M. Marshall
  • , B. Marshall
  • , Jane Martin
  • , Adrian Martineau
  • , L. M. Martinez
  • , Nick Maskell
  • , Darwin Matila
  • , Wadzanai Matimba-Mupaya
  • , Laura Matthews
  • , Angeline Mbuyisa
  • , Steve McAdoo
  • , Hamish McAllister-Williams
  • , Paul McArdle
  • , Anne McArdle
  • , Danny McAulay
  • , Hamish J. C. McAuley
  • , Gerry McCann
  • , W. McCormick
  • , Jacqueline McCormick
  • , P. McCourt
  • , Celeste McCracken
  • , Lorcan McGarvey
  • , Jade McGinness
  • , K. McGlynn
  • , Andrew McGovern
  • , Heather McGuinness
  • , I. B. McInnes
  • , Jerome McIntosh
  • , Emma McIvor
  • , Katherine McIvor
  • , Laura McLeavey
  • , Aisling McMahon
  • , Michael McMahon
  • , L. McMorrow
  • , Teresa Mcnally
  • , M. McNarry
  • , J. McNeill
  • , Alison McQueen
  • , H. McShane
  • , Chloe Mears
  • , Clare Megson
  • , Sharon Megson
  • , J. Meiring
  • , Lucy Melling
  • , Mark Mencias
  • , Daniel Menzies
  • , Marta Merida Morillas
  • , Alice Michael
  • , Benedict Michael
  • , C. A. Miller
  • , Lea Milligan
  • , Nicholas Mills
  • , Clare Mills
  • , George Mills
  • , L. Milner
  • , Jane Mitchell
  • , Abdelrahman Mohamed
  • , Noura Mohamed
  • , S. Mohammed
  • , Philip Molyneaux
  • , Will Monteiro
  • , Silvia Moriera
  • , Anna Morley
  • , Leigh Morrison
  • , Richard Morriss
  • , A. Morrow
  • , Paul Moss
  • , Alistair Moss
  • , K. Motohashi
  • , N. Msimanga
  • , Elizabeta Mukaetova-Ladinska
  • , Unber Munawar
  • , Jennifer Murira
  • , Uttam Nanda
  • , Heeah Nassa
  • , Mariam Nasseri
  • , Rashmita Nathu
  • , Aoife Neal
  • , Robert Needham
  • , Paula Neill
  • , Stefan Neubauer
  • , D. E. Newby
  • , Helen Newell
  • , J. Newman
  • , Tom Newman
  • , Alex Newton-Cox
  • , T. E. Nichols
  • , Tim Nicholson
  • , Christos Nicolaou
  • , Debby Nicoll
  • , Athanasios Nikolaidis
  • , C. Nikolaidou
  • , C. M. Nolan
  • , Matthew Noonan
  • , C. Norman
  • , Petr Novotny
  • , Kimon Ntotsis
  • , Jose Nunag
  • , Lorenza Nwafor
  • , Uchechi Nwanguma
  • , Joseph Nyaboko
  • , Linda O’Brien
  • , C. O’Brien
  • , Natasha Odell
  • , Kate O’Donnell
  • , Godwin Ogbole
  • , Olaoluwa Olaosebikan
  • , Catherine Oliver
  • , Zohra Omar
  • , Peter J. M. Openshaw
  • , D. P. O’Regan
  • , Lorna Orriss-Dib
  • , Lynn Osborne
  • , Rebecca Osbourne
  • , Marlies Ostermann
  • , Charlotte Overton
  • , Jamie Pack
  • , Edmund Pacpaco
  • , Stella-Maria Paddick
  • , Sharon Painter
  • , Erola Pairo-Castineira
  • , Ashkan Pakzad
  • , Sue Palmer
  • , Padmasayee Papineni
  • , K. Paques
  • , Kerry Paradowski
  • , Manish Pareek
  • , Dhruv Parekh
  • , H. Parfrey
  • , Carmen Pariante
  • , S. Parker
  • , M. Parkes
  • , J. Parmar
  • , Sheetal Patale
  • , Manish Patel
  • , Suhani Patel
  • , Dibya Pattenadk
  • , M. Pavlides
  • , Sheila Payne
  • , Lorraine Pearce
  • , John Pearl
  • , Dan Peckham
  • , Jessica Pendlebury
  • , Yanchun Peng
  • , Chris Pennington
  • , Ida Peralta
  • , Emma Perkins
  • , Z. Peterkin
  • , Tunde Peto
  • , Nayia Petousi
  • , John Petrie
  • , Paul Pfeffer
  • , Janet Phipps
  • , S. Piechnik
  • , John Pimm
  • , Karen Piper Hanley
  • , Riinu Pius
  • , Hannah Plant
  • , Tatiana Plekhanova
  • , Megan Plowright
  • , Krisnah Poinasamy
  • , Oliver Polgar
  • , Julie Porter
  • , Joanna Porter
  • , Sofiya Portukhay
  • , Natassia Powell
  • , A. Prabhu
  • , James Pratt
  • , Andrea Price
  • , Claire Price
  • , Carly Price
  • , Anne Prickett
  • , I. Propescu
  • , J. Propescu
  • , Sabrina Prosper
  • , S. Pugmire
  • , Sheena Quaid
  • , Jackie Quigley
  • , Jennifer K. Quint
  • , H. Qureshi
  • , I. N. Qureshi
  • , K. Radhakrishnan
  • , Najib Rahman
  • , Markus Ralser
  • , Betty Raman
  • , Hazel Ramos
  • , Albert Ramos
  • , Jade Rangeley
  • , Bojidar Rangelov
  • , Liz Ratcliffe
  • , Phillip Ravencroft
  • , Konrad Rawlik
  • , Anne Reddington
  • , Heidi Redfearn
  • , Dawn Redwood
  • , Annabel Reed
  • , Meryl Rees
  • , Tabitha Rees
  • , Karen Regan
  • , Will Reynolds
  • , Carla Ribeiro
  • , A. Richards
  • , Emma Richardson
  • , M. Richardson
  • , Pilar Rivera-Ortega
  • , K. Roberts
  • , Elizabeth Robertson
  • , Leanne Robinson
  • , Emma Robinson
  • , Lisa Roche
  • , C. Roddis
  • , J. Rodger
  • , Natalie Rogers
  • , Gavin Ross
  • , Alexandra Ross
  • , Jennifer Rossdale
  • , Anthony Rostron
  • , Anna Rowe
  • , J. Rowland
  • , M. J. Rowland
  • , A. Rowland
  • , Sarah L. Rowland-Jones
  • , Maura Roy
  • , Igor Rudan
  • , Richard Russell
  • , Emily Russell
  • , Gwen Saalmink
  • , Ramsey Sabit
  • , Beth Sage
  • , T. Samakomva
  • , Nilesh Samani
  • , A. A. Samat
  • , Claire Sampson
  • , Katherine Samuel
  • , Reena Samuel
  • , Z. B. Sanders
  • , Amy Sanderson
  • , Elizabeth Sapey
  • , Dinesh Saralaya
  • , Jack Sargant
  • , Carol Sarginson
  • , Naveed Sattar
  • , Kathryn Saunders
  • , Peter Saunders
  • , Ruth Saunders
  • , Laura Saunders
  • , Heather Savill
  • , Avan Sayer
  • , J. Schronce
  • , William Schwaeble
  • , Janet Scott
  • , Kathryn Scott
  • , Nick Selby
  • , Malcolm G. Semple
  • , Marco Sereno
  • , Terri Ann Sewell
  • , Kamini Shah
  • , Ajay Shah
  • , Manu Shankar-Hari
  • , M. Sharma
  • , Claire Sharpe
  • , Michael Sharpe
  • , Sharlene Shashaa
  • , Alison Shaw
  • , Victoria Shaw
  • , Karen Shaw
  • , Aziz Sheikh
  • , Sarah Shelton
  • , Liz Shenton
  • , K. Shevket
  • , Aarti Shikotra
  • , Sulman Siddique
  • , Salman Siddiqui
  • , J. Sidebottom
  • , Louise Sigfrid
  • , Gemma Simons
  • , Neil Simpson
  • , John Simpson
  • , Ananga Singapuri
  • , Suver Singh
  • , Claire Singh
  • , Sally Singh
  • , D. Sissons
  • , J. Skeemer
  • , Katie Slack
  • , David Smith
  • , Nikki Smith
  • , Andrew Smith
  • , Jacqui Smith
  • , Laurie Smith
  • , Susan Smith
  • , M. Soares
  • , Teresa Solano
  • , Reanne Solly
  • , A. R. Solstice
  • , Tracy Soulsby
  • , David Southern
  • , D. Sowter
  • , Mark Spears
  • , Lisa Spencer
  • , Fabio Speranza
  • , Louise Stadon
  • , Stefan Stanel
  • , R. Steeds
  • , N. Steele
  • , Mike Steiner
  • , David Stensel
  • , G. Stephens
  • , Lorraine Stephenson
  • , Iain Stewart
  • , R. Stimpson
  • , Sue Stockdale
  • , J. Stockley
  • , Wendy Stoker
  • , Roisin Stone
  • , Will Storrar
  • , Andrew Storrie
  • , Kim Storton
  • , E. Stringer
  • , Sophia Strong-Sheldrake
  • , Natalie Stroud
  • , Christian Subbe
  • , Catherine Sudlow
  • , Zehra Suleiman
  • , Charlotte Summers
  • , C. Summersgill
  • , Debbie Sutherland
  • , D. L. Sykes
  • , Nick Talbot
  • , Ai Lyn Tan
  • , Lawrence Tarusan
  • , Vera Tavoukjian
  • , Jessica Taylor
  • , Abigail Taylor
  • , Chris Taylor
  • , John Paul Taylor
  • , Amelie Te
  • , Caroline Tee
  • , J. Teixeira
  • , Helen Tench
  • , Sarah Terry
  • , Susannah Thackray-Nocera
  • , Favas Thaivalappil
  • , David Thickett
  • , David Thomas
  • , S. Thomas
  • , Caradog Thomas
  • , Andrew Thomas
  • , T. Thomas-Woods
  • , A. A. Roger Thompson
  • , Tamika Thompson
  • , T. Thornton
  • , Matthew Thorpe
  • , Ryan S. Thwaites
  • , Jo Tilley
  • , N. Tinker
  • , Gerlynn Tiongson
  • , Martin Tobin
  • , Johanne Tomlinson
  • , Mark Toshner
  • , T. Treibel
  • , K. A. Tripp
  • , Drupad Trivedi
  • , E. M. Tunnicliffe
  • , Alison Turnbull
  • , Kim Turner
  • , Sarah Turner
  • , Victoria Turner
  • , E. Turner
  • , Sharon Turney
  • , Lance Turtle
  • , Helena Turton
  • , Jacinta Ugoji
  • , R. Ugwuoke
  • , Rachel Upthegrove
  • , Jonathon Valabhji
  • , Maximina Ventura
  • , Joanne Vere
  • , Carinna Vickers
  • , Ben Vinson
  • , Ioannis Vogiatzis
  • , Elaine Wade
  • , Phillip Wade
  • , Louise V. Wain
  • , Tania Wainwright
  • , Lilian Wajero
  • , Sinead Walder
  • , Samantha Walker
  • , S. Walker
  • , Tim Wallis
  • , Sarah Walmsley
  • , Simon Walsh
  • , J. A. Walsh
  • , Louise Warburton
  • , T. J. C. Ward
  • , Katie Warwick
  • , Helen Wassall
  • , Samuel Waterson
  • , L. Watson
  • , Ekaterina Watson
  • , James Watson
  • , M. Webster
  • , J. Weir McCall
  • , Carly Welch
  • , Simon Wessely
  • , Sophie West
  • , Heather Weston
  • , Helen Wheeler
  • , Sonia White
  • , Victoria Whitehead
  • , J. Whitney
  • , S. Whittaker
  • , Beverley Whittam
  • , V. Whitworth
  • , Andrew Wight
  • , James Wild
  • , Martin Wilkins
  • , Dan Wilkinson
  • , Nick Williams
  • , N. Williams
  • , B. Williams
  • , Jenny Williams
  • , S. A. Williams-Howard
  • , Michelle Willicombe
  • , Gemma Willis
  • , James Willoughby
  • , Ann Wilson
  • , Imogen Wilson
  • , Daisy Wilson
  • , Nicola Window
  • , M. Witham
  • , Rebecca Wolf-Roberts
  • , Chloe Wood
  • , F. Woodhead
  • , Janet Woods
  • , Dan Wootton
  • , J. Wormleighton
  • , J. Worsley
  • , David Wraith
  • , Caroline Wrey Brown
  • , C. Wright
  • , S. Wright
  • , Louise Wright
  • , Inez Wynter
  • , Moucheng Xu
  • , Najira Yasmin
  • , S. Yasmin
  • , Tom Yates
  • , Kay Por Yip
  • , Susan Young
  • , Bob Young
  • , A. J. Yousuf
  • , Amira Zawia
  • , Lisa Zeidan
  • , Bang Zhao
  • , Bang Zheng
  •  & O. Zongo
  • , Daniel Agranoff
  • , Ken Agwuh
  • , Katie A. Ahmed
  • , Dhiraj Ail
  • , Erin L. Aldera
  • , Ana Alegria
  • , Beatrice Alex
  • , Sam Allen
  • , Petros Andrikopoulos
  • , Brian Angus
  • , Jane A. Armstrong
  • , Abdul Ashish
  • , Milton Ashworth
  • , Innocent G. Asiimwe
  • , Dougal Atkinson
  • , Benjamin Bach
  • , Siddharth Bakshi
  • , Wendy S. Barclay
  • , Shahedal Bari
  • , Gavin Barlow
  • , Samantha L. Barlow
  • , Stella Barnass
  • , Nicholas Barrett
  • , Christopher Bassford
  • , Sneha Basude
  • , David Baxter
  • , Michael Beadsworth
  • , Jolanta Bernatoniene
  • , John Berridge
  • , Nicola Best
  • , Debby Bogaert
  • , Laura Booth
  • , Pieter Bothma
  • , Benjamin Brennan
  • , Robin Brittain-Long
  • , Katie Bullock
  • , Naomi Bulteel
  • , Tom Burden
  • , Andrew Burtenshaw
  • , Nicola Carlucci
  • , Gail Carson
  • , Vikki Caruth
  • , Emily Cass
  • , Benjamin W. A. Catterall
  • , David Chadwick
  • , Duncan Chambler
  • , Meera Chand
  • , Kanta Chechi
  • , Nigel Chee
  • , Jenny Child
  • , Srikanth Chukkambotla
  • , Richard Clark
  • , Tom Clark
  • , Jordan J. Clark
  • , Emily A. Clarke
  • , Sara Clohisey
  • , Sarah Cole
  • , Paul Collini
  • , Marie Connor
  • , Graham S. Cooke
  • , Louise Cooper
  • , Catherine Cosgrove
  • , Audrey Coutts
  • , Helen Cox
  • , Jason Cupitt
  • , Maria-Teresa Cutino-Moguel
  • , Ana da Silva Filipe
  • , Jo Dalton
  • , Paul Dark
  • , Christopher Davis
  • , Chris Dawson
  • , Thushan de Silva
  • , Samir Dervisevic
  • , Oslem Dincarslan
  • , Alejandra Doce Carracedo
  • , Cara Donegan
  • , Lorna Donelly
  • , Phil Donnison
  • , Chloe Donohue
  • , Gonçalo dos Santos Correia
  • , Sam Douthwaite
  • , Thomas M. Drake
  • , Andrew Drummond
  • , Marc-Emmanuel Dumas
  • , Chris Dunn
  • , Jake Dunning
  • , Ingrid DuRand
  • , Ahilanadan Dushianthan
  • , Tristan Dyer
  • , Philip Dyer
  • , Angela Elliott
  • , Cariad Evans
  • , Anthony Evans
  • , Chi Eziefula
  • , Cameron J. Fairfield
  • , Angie Fawkes
  • , Chrisopher Fegan
  • , Lorna Finch
  • , Adam Finn
  • , Lewis W. S. Fisher
  • , Lisa Flaherty
  • , Tom Fletcher
  • , Terry Foster
  • , Duncan Fullerton
  • , Carrol Gamble
  • , Isabel Garcia-Dorival
  • , Atul Garg
  • , Sanjeev Garg
  • , Tammy Gilchrist
  • , Michelle Girvan
  • , Effrossyni Gkrania-Klotsas
  • , Jo Godden
  • , Arthur Goldsmith
  • , Clive Graham
  • , Tassos Grammatikopoulos
  • , Christopher A. Green
  • , Julian Griffin
  • , Fiona Griffiths
  • , Philip Gunning
  • , Rishi K. Gupta
  • , Katarzyna Hafezi
  • , Sophie Halpin
  • , Elaine Hardy
  • , Ewen M. Harrison
  • , Janet Harrison
  • , Catherine Hartley
  • , Stuart Hartshorn
  • , Daniel Harvey
  • , Peter Havalda
  • , Daniel B. Hawcutt
  • , Ross Hendry
  • , Antonia Y. W. Ho
  • , Maria Hobrok
  • , Luke Hodgson
  • , Karl Holden
  • , Anthony Holmes
  • , Peter W. Horby
  • , Joanne Howard
  • , Samreen Ijaz
  • , Clare Jackson
  • , Michael Jacobs
  • , Susan Jain
  • , Paul Jennings
  • , Rebecca L. Jensen
  • , Christopher B. Jones
  • , Trevor R. Jones
  • , Agilan Kaliappan
  • , Vidya Kasipandian
  • , Seán Keating
  • , Stephen Kegg
  • , Michael Kelsey
  • , Jason Kendall
  • , Caroline Kerrison
  • , Ian Kerslake
  • , Shadia Khandaker
  • , Katharine King
  • , Robyn T. Kiy
  • , Stephen R. Knight
  • , Susan Knight
  • , Oliver Koch
  • , Gouri Koduri
  • , George Koshy
  • , Chrysa Koukorava
  • , Shondipon Laha
  • , Eva Lahnsteiner
  • , Steven Laird
  • , Annette Lake
  • , Suzannah Lant
  • , Susan Larkin
  • , Diane Latawiec
  • , Andrew Law
  • , James Lee
  • , Gary Leeming
  • , Daniella Lefteri
  • , Tamas Leiner
  • , Lauren Lett
  • , Matthew Lewis
  • , Sonia Liggi
  • , Patrick Lillie
  • , Wei Shen Lim
  • , James Limb
  • , Vanessa Linnett
  • , Jeff Little
  • , Lucia A. Livoti
  • , Mark Lyttle
  • , Louise MacGillivray
  • , Alan Maclean
  • , Michael MacMahon
  • , Emily MacNaughton
  • , Maria Mancini
  • , Ravish Mankregod
  • , Laura Marsh
  • , Lynn Maslen
  • , Hannah Massey
  • , Huw Masson
  • , Elijah Matovu
  • , Nicole Maziere
  • , Sarah McCafferty
  • , Katherine McCullough
  • , Sarah E. McDonald
  • , Sarah McDonald
  • , Laurence McEvoy
  • , Ruth McEwen
  • , John McLauchlan
  • , Kenneth A. Mclean
  • , Manjula Meda
  • , Alexander J. Mentzer
  • , Laura Merson
  • , Soeren Metelmann
  • , Alison M. Meynert
  • , Nahida S. Miah
  • , Joanna Middleton
  • , Gary Mills
  • , Jane Minton
  • , Joyce Mitchell
  • , Kavya Mohandas
  • , James Moon
  • , Elinoor Moore
  • , Shona C. Moore
  • , Patrick Morgan
  • , Kirstie Morrice
  • , Craig Morris
  • , Katherine Mortimore
  • , Samuel Moses
  • , Mbiye Mpenge
  • , Rohinton Mulla
  • , Derek Murphy
  • , Lee Murphy
  • , Michael Murphy
  • , Ellen G. Murphy
  • , Thapas Nagarajan
  • , Megan Nagel
  • , Mark Nelson
  • , Lisa Norman
  • , Lillian Norris
  • , Lucy Norris
  • , Mahdad Noursadeghi
  • , Michael Olanipekun
  • , Wilna Oosthuyzen
  • , Anthonia Osagie
  • , Matthew K. O’Shea
  • , Igor Otahal
  • , Mark Pais
  • , Massimo Palmarini
  • , Carlo Palmieri
  • , Selva Panchatsharam
  • , Danai Papakonstantinou
  • , Hassan Paraiso
  • , Brij Patel
  • , Natalie Pattison
  • , William A. Paxton
  • , Rebekah Penrice-Randal
  • , Justin Pepperell
  • , Mark Peters
  • , Mandeep Phull
  • , Jack Pilgrim
  • , Stefania Pintus
  • , Tim Planche
  • , Daniel Plotkin
  • , Georgios Pollakis
  • , Frank Post
  • , Nicholas Price
  • , David Price
  • , Tessa Prince
  • , Rachel Prout
  • , Nikolas Rae
  • , Andrew Rambaut
  • , Henrik Reschreiter
  • , Tim Reynolds
  • , Neil Richardson
  • , P. Matthew Ridley
  • , Mark Roberts
  • , Stephanie Roberts
  • , Devender Roberts
  • , David L. Robertson
  • , Alistair Rose
  • , Guy Rousseau
  • , Bobby Ruge
  • , Clark D. Russell
  • , Brendan Ryan
  • , Debby Sales
  • , Taranprit Saluja
  • , Vanessa Sancho-Shimizu
  • , Caroline Sands
  • , Egle Saviciute
  • , Matthias Schmid
  • , Janet T. Scott
  • , James Scott-Brown
  • , Aarti Shah
  • , Prad Shanmuga
  • , Anil Sharma
  • , Catherine A. Shaw
  • , Victoria E. Shaw
  • , Anna Shawcross
  • , Rebecca K. Shears
  • , Jagtur Singh Pooni
  • , Jeremy Sizer
  • , Benjamin Small
  • , Richard Smith
  • , Catherine Snelson
  • , Tom Solomon
  • , Rebecca G. Spencer
  • , Nick Spittle
  • , Shiranee Sriskandan
  • , Nikki Staines
  • , Tom Stambach
  • , Richard Stewart
  • , David Stuart
  • , Krishanthi S. Subramaniam
  • , Pradeep Subudhi
  • , Olivia V. Swann
  • , Tamas Szakmany
  • , Agnieska Szemiel
  • , Aislynn Taggart
  • , Sarah Tait
  • , Zoltan Takats
  • , Panteleimon Takis
  • , Jolanta Tanianis-Hughes
  • , Kate Tatham
  • , Richard S. Tedder
  • , Jo Thomas
  • , Jordan Thomas
  • , Robert Thompson
  • , Chris Thompson
  • , Emma C. Thomson
  • , Ascanio Tridente
  • , Erwan Trochu
  • , Darell Tupper-Carey
  • , Lance C. W. Turtle
  • , Mary Twagira
  • , Nick Vallotton
  • , Libby van Tonder
  • , Rama Vancheeswaran
  • , Rachel Vincent
  • , Lisa Vincent-Smith
  • , Shico Visuvanathan
  • , Alan Vuylsteke
  • , Sam Waddy
  • , Rachel Wake
  • , Andrew Walden
  • , Ingeborg Welters
  • , Murray Wham
  • , Tony Whitehouse
  • , Paul Whittaker
  • , Ashley Whittington
  • , Meme Wijesinghe
  • , Eve Wilcock
  • , Martin Williams
  • , Lawrence Wilson
  • , Stephen Winchester
  • , Martin Wiselka
  • , Adam Wolverson
  • , Daniel G. Wootton
  • , Andrew Workman
  • , Nicola Wrobel
  • , Bryan Yates
  • , Peter Young
  • , Maria Zambon
  •  & J. Eunice Zhang


F.L. recruited participants, acquired clinical samples, analyzed and interpreted data and cowrote the manuscript, including all drafting and revisions. C.E. analyzed and interpreted data and cowrote this manuscript, including all drafting and revisions. S.F. and M.R. supported the analysis and interpretation of data as well as drafting and revisions. D.S., J.K.S., S.C.M., S.A., N.M., J.N., C.K., O.C.L., O.E., H.J.C.M., A. Shikotra, A. Singapuri, M.S., V.C.H., M.T., N.J.G., N.I.L. and C.C. contributed to acquisition of data underlying this study. L.H.-W., A.A.R.T., S.L.R.-J., L.S.H., O.M.K., D.G.W., T.I.d.S. and A. Ho made substantial contributions to conception/design and implementation of this work and/or acquisition of clinical samples for this work. They have supported drafting and revisions of the manuscript. E.M.H., J.K.Q. and A.B.D. made substantial contributions to the study design as well as data access, linkage and analysis. They have supported drafting and revisions of this work. J.D.C., L.-P.H., A. Horsley, B.R., K.P., M.M. and W.G. made substantial contributions to the conception and design of this work and have supported drafting and revisions of this work. J.K.B. obtained funding for ISARIC4C, is ISARIC4C consortium co-lead, has made substantial contributions to conception and design of this work and has supported drafting and revisions of this work. M.G.S. obtained funding for ISARIC4C, is ISARIC4C consortium co-lead, sponsor/protocol chief investigator, has made substantial contributions to conception and design of this work and has supported drafting and revisions of this work. R.A.E. and L.V.W. are co-leads of PHOSP-COVID, made substantial contributions to conception and design of this work, the acquisition and analysis of data, and have supported drafting and revisions of this work. C.B. is the chief investigator of PHOSP-COVID and has made substantial contributions to conception and design of this work. R.S.T. and L.T. made substantial contributions to the acquisition, analysis and interpretation of the data underlying this study and have contributed to drafting and revisions of this work. P.J.M.O. obtained funding for ISARIC4C, is ISARIC4C consortium co-lead, sponsor/protocol chief investigator and has made substantial contributions to conception and design of this work. R.S.T. and P.J.M.O. have also made key contributions to interpretation of data and have co-written this manuscript. All authors have read and approve the final version to be published. All authors agree to accountability for all aspects of this work. All investigators within ISARIC4C and the PHOSP-COVID consortia have made substantial contributions to the conception or design of this study and/or acquisition of data for this study. The full list of authors within these groups is available in Supplementary Information .

Corresponding authors

Correspondence to Ryan S. Thwaites or Peter J. M. Openshaw .

Ethics declarations

Competing interests.

F.L., C.E., D.S., J.K.S., S.C.M., C.D., C.K., N.M., L.N., E.M.H., A.B.D., J.K.Q., L.-P.H., K.P., L.S.H., O.M.K., S.F., T.I.d.S., D.G.W., R.S.T. and J.K.B. have no conflicts of interest. A.A.R.T. receives speaker fees and support to attend meetings from Janssen Pharmaceuticals. S.L.R.-J. is on the data safety monitoring board for Bexero trial in HIV+ adults in Kenya. J.D.C. is the deputy chief editor of the European Respiratory Journal and receives consulting fees from AstraZeneca, Boehringer Ingelheim, Chiesi, GSK, Insmed, Janssen, Novartis, Pfizer and Zambon. A. Horsley is deputy chair of NIHR Translational Research Collaboration (unpaid role). B.R. receives honoraria from Axcella therapeutics. R.A.E. is co-lead of PHOSP-COVID and receives fees from AstraZenaca/Evidera for consultancy on LC and from AstraZenaca for consultancy on digital health. R.A.E. has received speaker fees from Boehringer in June 2021 and has held a role as European Respiratory Society Assembly 01.02 Pulmonary Rehabilitation secretary. R.A.E. is on the American Thoracic Society Pulmonary Rehabilitation Assembly program committee. L.V.W. also receives funding from Orion pharma and GSK and holds contracts with Genentech and AstraZenaca. L.V.W. has received consulting fees from Galapagos and Boehringer, is on the data advisory board for Galapagos and is Associate Editor for the European Respiratory Journal . A. Ho is a member of NIHR Urgent Public Health Group (June 2020–March 2021). M.M. is an applicant on the PHOSP study funded by NIHR/DHSC. M.G.S. acts as an independent external and nonremunerated member of Pfizer’s External Data Monitoring Committee for their mRNA vaccine program(s), is Chair of Infectious Disease Scientific Advisory Board of Integrum Scientific LLC, and is director of MedEx Solutions Ltd. and majority owner of MedEx Solutions Ltd. and minority owner of Integrum Scientific LLC. M.G.S.’s institution has been in receipt of gifts from Chiesi Farmaceutici S.p.A. of Clinical Trial Investigational Medicinal Product without encumbrance and distribution of same to trial sites. M.G.S. is a nonrenumerated member of HMG UK New Emerging Respiratory Virus Threats Advisory Group and has previously been a nonrenumerated member of the Scientific Advisory Group for Emergencies (SAGE). C.B. has received consulting fees and/or grants from GSK, AstraZeneca, Genentech, Roche, Novartis, Sanofi, Regeneron, Chiesi, Mologic and 4DPharma. L.T. has received consulting fees from MHRA, AstraZeneca and Synairgen and speakers’ fees from Eisai Ltd., and support for conference attendance from AstraZeneca. L.T. has a patent pending with ZikaVac. P.J.M.O. reports grants from the EU Innovative Medicines Initiative 2 Joint Undertaking during the submitted work; grants from UK Medical Research Council, GSK, Wellcome Trust, EU Innovative Medicines Initiative, UK National Institute for Health Research and UK Research and Innovation–Department for Business, Energy and Industrial Strategy; and personal fees from Pfizer, Janssen and Seqirus, outside the submitted work.

Peer review

Peer review information.

Nature Immunology thanks Ziyad Al-Aly and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Ioana Staicu was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 penalized logistic regression performance..

Graphs show classification error and Area under curve (AUC) from the 50 repeats tenfold nested cross-validation used to optimise and assess the performance of PLR testing associations with each LC outcome relative to Recovered (n = 233): Cardio_Resp (n = 398), Fatigue (n = 384), Anxiety/Depression (n = 202), GI (n = 132), ( e ) Cognitive (n = 6). The distributions of classification error and area under curve (AUC) from the nested cross-validation are shown. Box plot centre line represents the Median and boundaries of the box represent interquartile range (IQR), the whisker length represent 1.5xIQR.

Extended Data Fig. 2 Associations with long COVID symptoms in full study cohort.

( a ) Fibrinogen levels at 6 months were compared between pooled LC cases (n = 295) and Recovered (n = 233) and between the Cognitive group (n = 41) and Recovered (n = 233). Box plot centre line represent the Median and boundaries of the box represent interquartile range (IQR), the whisker length represents 1.5xIQR, any outliers beyond the whisker range are shown as individual dots. Median differences were compared using two-sided Wilcoxon signed-rank test *= p  < 0·05, **= p  < 0·01, ***= p  < 0·001, ****= p  < 0·0001. Unadjusted p-values are reported. b ) Distribution of time from COVID-19 hospitalisation at sample collection applying CDC and NICE definitions of LC (n = 719) ( c ) Upset plot of symptom groups. Horizontal coloured bars represent the number of patients in each symptom group: Cardiorespiratory (Cardio_Resp), Fatigue, Cognitive, Gastrointestinal (GI) and Anxiety/Depression (Anx_Dep). Vertical black bars represent the number of patients in each symptom combination group. To prevent patient identification, where less than 5 patients belong to a combination group, this has been represented as ‘<5’. The Recovered group (n = 250) were used as controls. Forest plots show Olink protein concentrations (NPX) associated with ( d ) Cardio_Resp (n = 398), ( e ) Fatigue (n = 342), ( f ) Anx_Dep (n = 219), ( g ) GI (n = 134), and ( h ) Cognitive (n = 65). Error bars represent the median accuracy of the model.

Extended Data Fig. 3 Validation of olink measurements using conventional assays in plasma.

Olink measured protein (NPX) were compared to chemiluminescence assays (ECL or ELISA, log2[pg/mL]) to validate our findings, where contemporaneously collected plasma samples were available (n = 58). Results from key mediators associated with LC groups were validated: CSF3, IL1R2, IL2, IL3RA, TNFa, TFF2. R = spearman rank correlation coefficient and shaded areas indicated the 95% confidence interval. Samples that fell below the lower limit of detection for a given assay were excluded and the ‘n’ value on each panel indicates the number of samples above this limit.

Extended Data Fig. 4 Univariate analysis of proteins associated with each symptom.

Olink measured plasma protein levels (NPX) compared between LC groups (Cardio_Resp, n = 398, Fatigue n = 384, Anxiety/Depression, n = 202, GI, n = 132 and Cognitive, n = 60) and Recovered (n = 233). Proteins identified by PLR were compared between groups. Median differences were compared using two-sided Wilcoxon signed-rank test. * = p < 0·05, ** = p < 0·01, *** = p < 0·001, ****= p < 0·0001 after FDR adjustment. Box plot centre line represent the Median and boundaries of the box represent interquartile range (IQR), the whisker length represents 1.5xIQR, any outliers beyond the whisker range are shown as individual dots.

Extended Data Fig. 5 Unadjusted Penalised Logistic Regression.

Olink measured proteins (NPX) and their association with Cardio_Resp (n = 398), Fatigue (n = 342), Anx_Dep (n = 219), GI (n = 134), and Cognitive (n = 65). Forest plots show odds of each LC outcome vs Recovered (n = 233), using PLR without adjusting for clinical co-variates. Error bars represent the median accuracy of the model.

Extended Data Fig. 6 Partial Least Squares analysis.

Olink measured proteins (NPX) and their association with Cardio_Resp (n = 398), Fatigue (n = 342), Anx_Dep (n = 219), GI (n = 134), and Cognitive (n = 65) groups. Forest plots show odds of LC outcome vs Recovered (n = 233), using PLS analysis. Error bars represent the standard error of the coefficient estimate.

Extended Data Fig. 7 Network analysis centrality.

Each graph shows the centrality score for each Olink measured protein (NPX) found to have significant associations with other proteins that were elevated in the Cardio_Resp (n = 398), Fatigue (n = 342), Anx_Dep (n = 219), GI (n = 134), and Cognitive (n = 65) groups relative to Recovered (n = 233).

Extended Data Fig. 8 Inflammation in men and women with long COVID.

Olink measured plasma protein levels (NPX) between men and women with symptoms, divided by age (<50 or >=50years): (a) shows IL1R2 and MATN2 in the Anxiety/Depression group (<50 n = 55, >=50 n = 133), (b) shows CTSO and NFASC in the Cognitive group (<50 n = 11, >=50 n = 50). Median values were compared between men and women using two-sided Wilcoxon signed-rank test. Box plot centre line represent the Median and boundaries represent interquartile range (IQR), the whisker length represents 1.5xIQR.

Extended Data Fig. 9 Inflammation in the upper respiratory tract.

Nasal cytokines measured by immunoassay in the CardioResp Group (n = 29) and Recovered (n = 31): ( a ) shows IL1a, IL1b, IL-6, APO-2, TGFa, TFF2. Median differences were compared using two-sided Wilcoxon signed-rank test. Box plot centre line represents the Median and boundaries of the box represent interquartile range (IQR), the whisker length represent 1.5xIQR. ( b ) Shows cytokines measured by immunoassay in paired plasma and nasal (n = 70). Correlations between IL1a, IL1b, IL-6, APO-2, TGFa and TFF2 in nasal and plasma samples were compared using Spearman’s rank correlation coefficient ( R ). Shaded areas indicated the 95% confidence interval of R.

Extended Data Fig. 10 Graphical abstract.

Summary of interpretation of key findings from Olink measured proteins and their association with CardioResp (n = 398), Fatigue (n = 342), Anx/Dep (n = 219), GI (n = 134), and Cognitive (n = 65) groups relative to Recovered (n = 233).

Supplementary information

Supplementary information.

Supplementary Methods, Statistics and reproducibility statement, Supplementary Results, Supplementary Tables 1–7, Extended data figure legends, Appendix 1 (Supplementary Table 8), Appendix 2 (PHOSP-COVID author list) and Appendix 3 (ISARIC4C author list).

Reporting Summary

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Liew, F., Efstathiou, C., Fontanella, S. et al. Large-scale phenotyping of patients with long COVID post-hospitalization reveals mechanistic subtypes of disease. Nat Immunol 25 , 607–621 (2024). https://doi.org/10.1038/s41590-024-01778-0

Download citation

Received : 11 August 2023

Accepted : 06 February 2024

Published : 08 April 2024

Issue Date : April 2024

DOI : https://doi.org/10.1038/s41590-024-01778-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Immune dysregulation in long covid.

  • Laura Ceglarek
  • Onur Boyman

Nature Immunology (2024)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

ethics of big data research

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Browse by collection
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 110, Issue 9
  • The role of COVID-19 vaccines in preventing post-COVID-19 thromboembolic and cardiovascular complications
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Núria Mercadé-Besora 1 , 2 , 3 ,
  • Xintong Li 1 ,
  • Raivo Kolde 4 ,
  • Nhung TH Trinh 5 ,
  • Maria T Sanchez-Santos 1 ,
  • Wai Yi Man 1 ,
  • Elena Roel 3 ,
  • Carlen Reyes 3 ,
  • http://orcid.org/0000-0003-0388-3403 Antonella Delmestri 1 ,
  • Hedvig M E Nordeng 6 , 7 ,
  • http://orcid.org/0000-0002-4036-3856 Anneli Uusküla 8 ,
  • http://orcid.org/0000-0002-8274-0357 Talita Duarte-Salles 3 , 9 ,
  • Clara Prats 2 ,
  • http://orcid.org/0000-0002-3950-6346 Daniel Prieto-Alhambra 1 , 9 ,
  • http://orcid.org/0000-0002-0000-0110 Annika M Jödicke 1 ,
  • Martí Català 1
  • 1 Pharmaco- and Device Epidemiology Group, Health Data Sciences, Botnar Research Centre, NDORMS , University of Oxford , Oxford , UK
  • 2 Department of Physics , Universitat Politècnica de Catalunya , Barcelona , Spain
  • 3 Fundació Institut Universitari per a la recerca a l'Atenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol) , IDIAP Jordi Gol , Barcelona , Catalunya , Spain
  • 4 Institute of Computer Science , University of Tartu , Tartu , Estonia
  • 5 Pharmacoepidemiology and Drug Safety Research Group, Department of Pharmacy, Faculty of Mathematics and Natural Sciences , University of Oslo , Oslo , Norway
  • 6 School of Pharmacy , University of Oslo , Oslo , Norway
  • 7 Division of Mental Health , Norwegian Institute of Public Health , Oslo , Norway
  • 8 Department of Family Medicine and Public Health , University of Tartu , Tartu , Estonia
  • 9 Department of Medical Informatics, Erasmus University Medical Center , Erasmus University Rotterdam , Rotterdam , Zuid-Holland , Netherlands
  • Correspondence to Prof Daniel Prieto-Alhambra, Pharmaco- and Device Epidemiology Group, Health Data Sciences, Botnar Research Centre, NDORMS, University of Oxford, Oxford, UK; daniel.prietoalhambra{at}ndorms.ox.ac.uk

Objective To study the association between COVID-19 vaccination and the risk of post-COVID-19 cardiac and thromboembolic complications.

Methods We conducted a staggered cohort study based on national vaccination campaigns using electronic health records from the UK, Spain and Estonia. Vaccine rollout was grouped into four stages with predefined enrolment periods. Each stage included all individuals eligible for vaccination, with no previous SARS-CoV-2 infection or COVID-19 vaccine at the start date. Vaccination status was used as a time-varying exposure. Outcomes included heart failure (HF), venous thromboembolism (VTE) and arterial thrombosis/thromboembolism (ATE) recorded in four time windows after SARS-CoV-2 infection: 0–30, 31–90, 91–180 and 181–365 days. Propensity score overlap weighting and empirical calibration were used to minimise observed and unobserved confounding, respectively.

Fine-Gray models estimated subdistribution hazard ratios (sHR). Random effect meta-analyses were conducted across staggered cohorts and databases.

Results The study included 10.17 million vaccinated and 10.39 million unvaccinated people. Vaccination was associated with reduced risks of acute (30-day) and post-acute COVID-19 VTE, ATE and HF: for example, meta-analytic sHR of 0.22 (95% CI 0.17 to 0.29), 0.53 (0.44 to 0.63) and 0.45 (0.38 to 0.53), respectively, for 0–30 days after SARS-CoV-2 infection, while in the 91–180 days sHR were 0.53 (0.40 to 0.70), 0.72 (0.58 to 0.88) and 0.61 (0.51 to 0.73), respectively.

Conclusions COVID-19 vaccination reduced the risk of post-COVID-19 cardiac and thromboembolic outcomes. These effects were more pronounced for acute COVID-19 outcomes, consistent with known reductions in disease severity following breakthrough versus unvaccinated SARS-CoV-2 infection.

  • Epidemiology
  • Electronic Health Records

Data availability statement

Data may be obtained from a third party and are not publicly available. CPRD: CPRD data were obtained under the CPRD multi-study license held by the University of Oxford after Research Data Governance (RDG) approval. Direct data sharing is not allowed. SIDIAP: In accordance with current European and national law, the data used in this study is only available for the researchers participating in this study. Thus, we are not allowed to distribute or make publicly available the data to other parties. However, researchers from public institutions can request data from SIDIAP if they comply with certain requirements. Further information is available online ( https://www.sidiap.org/index.php/menu-solicitudesen/application-proccedure ) or by contacting SIDIAP ([email protected]). CORIVA: CORIVA data were obtained under the approval of Research Ethics Committee of the University of Tartu and the patient level data sharing is not allowed. All analyses in this study were conducted in a federated manner, where analytical code and aggregated (anonymised) results were shared, but no patient-level data was transferred across the collaborating institutions.

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See:  https://creativecommons.org/licenses/by/4.0/ .


Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


COVID-19 vaccines proved to be highly effective in reducing the severity of acute SARS-CoV-2 infection.

While COVID-19 vaccines were associated with increased risk for cardiac and thromboembolic events, such as myocarditis and thrombosis, the risk of complications was substantially higher due to SARS-CoV-2 infection.


COVID-19 vaccination reduced the risk of heart failure, venous thromboembolism and arterial thrombosis/thromboembolism in the acute (30 days) and post-acute (31 to 365 days) phase following SARS-CoV-2 infection. This effect was stronger in the acute phase.

The overall additive effect of vaccination on the risk of post-vaccine and/or post-COVID thromboembolic and cardiac events needs further research.


COVID-19 vaccines proved to be highly effective in reducing the risk of post-COVID cardiovascular and thromboembolic complications.


COVID-19 vaccines were approved under emergency authorisation in December 2020 and showed high effectiveness against SARS-CoV-2 infection, COVID-19-related hospitalisation and death. 1 2 However, concerns were raised after spontaneous reports of unusual thromboembolic events following adenovirus-based COVID-19 vaccines, an association that was further assessed in observational studies. 3 4 More recently, mRNA-based vaccines were found to be associated with a risk of rare myocarditis events. 5 6

On the other hand, SARS-CoV-2 infection can trigger cardiac and thromboembolic complications. 7 8 Previous studies showed that, while slowly decreasing over time, the risk for serious complications remain high for up to a year after infection. 9 10 Although acute and post-acute cardiac and thromboembolic complications following COVID-19 are rare, they present a substantial burden to the affected patients, and the absolute number of cases globally could become substantial.

Recent studies suggest that COVID-19 vaccination could protect against cardiac and thromboembolic complications attributable to COVID-19. 11 12 However, most studies did not include long-term complications and were conducted among specific populations.

Evidence is still scarce as to whether the combined effects of COVID-19 vaccines protecting against SARS-CoV-2 infection and reducing post-COVID-19 cardiac and thromboembolic outcomes, outweigh any risks of these complications potentially associated with vaccination.

We therefore used large, representative data sources from three European countries to assess the overall effect of COVID-19 vaccines on the risk of acute and post-acute COVID-19 complications including venous thromboembolism (VTE), arterial thrombosis/thromboembolism (ATE) and other cardiac events. Additionally, we studied the comparative effects of ChAdOx1 versus BNT162b2 on the risk of these same outcomes.

Data sources

We used four routinely collected population-based healthcare datasets from three European countries: the UK, Spain and Estonia.

For the UK, we used data from two primary care databases—namely, Clinical Practice Research Datalink, CPRD Aurum 13 and CPRD Gold. 14 CPRD Aurum currently covers 13 million people from predominantly English practices, while CPRD Gold comprises 3.1 million active participants mostly from GP practices in Wales and Scotland. Spanish data were provided by the Information System for the Development of Research in Primary Care (SIDIAP), 15 which encompasses primary care records from 6 million active patients (around 75% of the population in the region of Catalonia) linked to hospital admissions data (Conjunt Mínim Bàsic de Dades d’Alta Hospitalària). Finally, the CORIVA dataset based on national health claims data from Estonia was used. It contains all COVID-19 cases from the first year of the pandemic and ~440 000 randomly selected controls. CORIVA was linked to the death registry and all COVID-19 testing from the national health information system.

Databases included sociodemographic information, diagnoses, measurements, prescriptions and secondary care referrals and were linked to vaccine registries, including records of all administered vaccines from all healthcare settings. Data availability for CPRD Gold ended in December 2021, CPRD Aurum in January 2022, SIDIAP in June 2022 and CORIVA in December 2022.

All databases were mapped to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) 16 to facilitate federated analytics.

Multinational network staggered cohort study: study design and participants

The study design has been published in detail elsewhere. 17 Briefly, we used a staggered cohort design considering vaccination as a time-varying exposure. Four staggered cohorts were designed with each cohort representing a country-specific vaccination rollout phase (eg, dates when people became eligible for vaccination, and eligibility criteria).

The source population comprised all adults registered in the respective database for at least 180 days at the start of the study (4 January 2021 for CPRD Gold and Aurum, 20 February 2021 for SIDIAP and 28 January 2021 for CORIVA). Subsequently, each staggered cohort corresponded to an enrolment period: all people eligible for vaccination during this time were included in the cohort and people with a history of SARS-CoV-2 infection or COVID-19 vaccination before the start of the enrolment period were excluded. Across countries, cohort 1 comprised older age groups, whereas cohort 2 comprised individuals at risk for severe COVID-19. Cohort 3 included people aged ≥40 and cohort 4 enrolled people aged ≥18.

In each cohort, people receiving a first vaccine dose during the enrolment period were allocated to the vaccinated group, with their index date being the date of vaccination. Individuals who did not receive a vaccine dose comprised the unvaccinated group and their index date was assigned within the enrolment period, based on the distribution of index dates in the vaccinated group. People with COVID-19 before the index date were excluded.

Follow-up started from the index date until the earliest of end of available data, death, change in exposure status (first vaccine dose for those unvaccinated) or outcome of interest.

COVID-19 vaccination

All vaccines approved within the study period from January 2021 to July 2021—namely, ChAdOx1 (Oxford/AstraZeneca), BNT162b2 (BioNTech/Pfizer]) Ad26.COV2.S (Janssen) and mRNA-1273 (Moderna), were included for this study.

Post-COVID-19 outcomes of interest

Outcomes of interest were defined as SARS-CoV-2 infection followed by a predefined thromboembolic or cardiac event of interest within a year after infection, and with no record of the same clinical event in the 6 months before COVID-19. Outcome date was set as the corresponding SARS-CoV-2 infection date.

COVID-19 was identified from either a positive SARS-CoV-2 test (polymerase chain reaction (PCR) or antigen), or a clinical COVID-19 diagnosis, with no record of COVID-19 in the previous 6 weeks. This wash-out period was imposed to exclude re-recordings of the same COVID-19 episode.

Post-COVID-19 outcome events were selected based on previous studies. 11–13 Events comprised ischaemic stroke (IS), haemorrhagic stroke (HS), transient ischaemic attack (TIA), ventricular arrhythmia/cardiac arrest (VACA), myocarditis/pericarditis (MP), myocardial infarction (MI), heart failure (HF), pulmonary embolism (PE) and deep vein thrombosis (DVT). We used two composite outcomes: (1) VTE, as an aggregate of PE and DVT and (2) ATE, as a composite of IS, TIA and MI. To avoid re-recording of the same complication we imposed a wash-out period of 90 days between records. Phenotypes for these complications were based on previously published studies. 3 4 8 18

All outcomes were ascertained in four different time periods following SARS-CoV-2 infection: the first period described the acute infection phase—that is, 0–30 days after COVID-19, whereas the later periods - which are 31–90 days, 91–180 days and 181–365 days, illustrate the post-acute phase ( figure 1 ).

  • Download figure
  • Open in new tab
  • Download powerpoint

Study outcome design. Study outcomes of interest are defined as a COVID-19 infection followed by one of the complications in the figure, within a year after infection. Outcomes were ascertained in four different time windows after SARS-CoV-2 infection: 0–30 days (namely the acute phase), 31–90 days, 91–180 days and 181–365 days (these last three comprise the post-acute phase).

Negative control outcomes

Negative control outcomes (NCOs) were used to detect residual confounding. NCOs are outcomes which are not believed to be causally associated with the exposure, but share the same bias structure with the exposure and outcome of interest. Therefore, no significant association between exposure and NCO is to be expected. Our study used 43 different NCOs from previous work assessing vaccine effectiveness. 19

Statistical analysis

Federated network analyses.

A template for an analytical script was developed and subsequently tailored to include the country-specific aspects (eg, dates, priority groups) for the vaccination rollout. Analyses were conducted locally for each database. Only aggregated data were shared and person counts <5 were clouded.

Propensity score weighting

Large-scale propensity scores (PS) were calculated to estimate the likelihood of a person receiving the vaccine based on their demographic and health-related characteristics (eg, conditions, medications) prior to the index date. PS were then used to minimise observed confounding by creating a weighted population (overlap weighting 20 ), in which individuals contributed with a different weight based on their PS and vaccination status.

Prespecified key variables included in the PS comprised age, sex, location, index date, prior observation time in the database, number of previous outpatient visits and previous SARS-CoV-2 PCR/antigen tests. Regional vaccination, testing and COVID-19 incidence rates were also forced into the PS equation for the UK databases 21 and SIDIAP. 22 In addition, least absolute shrinkage and selection operator (LASSO) regression, a technique for variable selection, was used to identify additional variables from all recorded conditions and prescriptions within 0–30 days, 31–180 days and 181-any time (conditions only) before the index date that had a prevalence of >0.5% in the study population.

PS were then separately estimated for each staggered cohort and analysis. We considered covariate balance to be achieved if absolute standardised mean differences (ASMDs) were ≤0.1 after weighting. Baseline characteristics such as demographics and comorbidities were reported.

Effect estimation

To account for the competing risk of death associated with COVID-19, Fine-and-Grey models 23 were used to calculate subdistribution hazard ratios (sHRs). Subsequently, sHRs and confidence intervals were empirically calibrated from NCO estimates 24 to account for unmeasured confounding. To calibrate the estimates, the empirical null distribution was derived from NCO estimates and was used to compute calibrated confidence intervals. For each outcome, sHRs from the four staggered cohorts were pooled using random-effect meta-analysis, both separately for each database and across all four databases.

Sensitivity analysis

Sensitivity analyses comprised 1) censoring follow-up for vaccinated people at the time when they received their second vaccine dose and 2) considering only the first post-COVID-19 outcome within the year after infection ( online supplemental figure S1 ). In addition, comparative effectiveness analyses were conducted for BNT162b2 versus ChAdOx1.

Supplemental material

Data and code availability.

All analytic code for the study is available in GitHub ( https://github.com/oxford-pharmacoepi/vaccineEffectOnPostCovidCardiacThromboembolicEvents ), including code lists for vaccines, COVID-19 tests and diagnoses, cardiac and thromboembolic events, NCO and health conditions to prioritise patients for vaccination in each country. We used R version 4.2.3 and statistical packages survival (3.5–3), Empirical Calibration (3.1.1), glmnet (4.1-7), and Hmisc (5.0–1).

Patient and public involvement

Owing to the nature of the study and the limitations regarding data privacy, the study design, analysis, interpretation of data and revision of the manuscript did not involve any patients or members of the public.

All aggregated results are available in a web application ( https://dpa-pde-oxford.shinyapps.io/PostCovidComplications/ ).

We included over 10.17 million vaccinated individuals (1 618 395 from CPRD Gold; 5 729 800 from CPRD Aurum; 2 744 821 from SIDIAP and 77 603 from CORIVA) and 10.39 million unvaccinated individuals (1 640 371; 5 860 564; 2 588 518 and 302 267, respectively). Online supplemental figures S2-5 illustrate study inclusion for each database.

Adequate covariate balance was achieved after PS weighting in most studies: CORIVA (all cohorts) and SIDIAP (cohorts 1 and 4) did not contribute to ChAdOx1 subanalyses owing to sample size and covariate imbalance. ASMD results are accessible in the web application.

NCO analyses suggested residual bias after PS weighting, with a majority of NCOs associated positively with vaccination. Therefore, calibrated estimates are reported in this manuscript. Uncalibrated effect estimates and NCO analyses are available in the web interface.

Population characteristics

Table 1 presents baseline characteristics for the weighted populations in CPRD Aurum, for illustrative purposes. Online supplemental tables S1-25 summarise baseline characteristics for weighted and unweighted populations for each database and comparison. Across databases and cohorts, populations followed similar patterns: cohort 1 represented an older subpopulation (around 80 years old) with a high proportion of women (57%). Median age was lowest in cohort 4 ranging between 30 and 40 years.

  • View inline

Characteristics of weighted populations in CPRD Aurum database, stratified by staggered cohort and exposure status. Exposure is any COVID-19 vaccine

COVID-19 vaccination and post-COVID-19 complications

Table 2 shows the incidence of post-COVID-19 VTE, ATE and HF, the three most common post-COVID-19 conditions among the studied outcomes. Outcome counts are presented separately for 0–30, 31–90, 91–180 and 181–365 days after SARS-CoV-2 infection. Online supplemental tables S26-36 include all studied complications, also for the sensitivity and subanalyses. Similar pattern for incidences were observed across all databases: higher outcome rates in the older populations (cohort 1) and decreasing frequency with increasing time after infection in all cohorts.

Number of records (and risk per 10 000 individuals) for acute and post-acute COVID-19 cardiac and thromboembolic complications, across cohorts and databases for any COVID-19 vaccination

Forest plots for the effect of COVID-19 vaccines on post-COVID-19 cardiac and thromboembolic complications; meta-analysis across cohorts and databases. Dashed line represents a level of heterogeneity I 2 >0.4. ATE, arterial thrombosis/thromboembolism; CD+HS, cardiac diseases and haemorrhagic stroke; VTE, venous thromboembolism.

Results from calibrated estimates pooled in meta-analysis across cohorts and databases are shown in figure 2 .

Reduced risk associated with vaccination is observed for acute and post-acute VTE, DVT, and PE: acute meta-analytic sHR are 0.22 (95% CI, 0.17–0.29); 0.36 (0.28–0.45); and 0.19 (0.15–0.25), respectively. For VTE in the post-acute phase, sHR estimates are 0.43 (0.34–0.53), 0.53 (0.40–0.70) and 0.50 (0.36–0.70) for 31–90, 91–180, and 181–365 days post COVID-19, respectively. Reduced risk of VTE outcomes was observed in vaccinated across databases and cohorts, see online supplemental figures S14–22 .

Similarly, the risk of ATE, IS and MI in the acute phase after infection was reduced for the vaccinated group, sHR of 0.53 (0.44–0.63), 0.55 (0.43–0.70) and 0.49 (0.38–0.62), respectively. Reduced risk associated with vaccination persisted for post-acute ATE, with sHR of 0.74 (0.60–0.92), 0.72 (0.58–0.88) and 0.62 (0.48–0.80) for 31–90, 91–180 and 181–365 days post-COVID-19, respectively. Risk of post-acute MI remained lower for vaccinated in the 31–90 and 91–180 days after COVID-19, with sHR of 0.64 (0.46–0.87) and 0.64 (0.45–0.90), respectively. Vaccination effect on post-COVID-19 TIA was seen only in the 181–365 days, with sHR of 0.51 (0.31–0.82). Online supplemental figures S23-31 show database-specific and cohort-specific estimates for ATE-related complications.

Risk of post-COVID-19 cardiac complications was reduced in vaccinated individuals. Meta-analytic estimates in the acute phase showed sHR of 0.45 (0.38–0.53) for HF, 0.41 (0.26–0.66) for MP and 0.41 (0.27–0.63) for VACA. Reduced risk persisted for post-acute COVID-19 HF: sHR 0.61 (0.51–0.73) for 31–90 days, 0.61 (0.51–0.73) for 91–180 days and 0.52 (0.43–0.63) for 181–365 days. For post-acute MP, risk was only lowered in the first post-acute window (31–90 days), with sHR of 0.43 (0.21–0.85). Vaccination showed no association with post-COVID-19 HS. Database-specific and cohort-specific results for these cardiac diseases are shown in online supplemental figures S32-40 .

Stratified analyses by vaccine showed similar associations, except for ChAdOx1 which was not associated with reduced VTE and ATE risk in the last post-acute window. Sensitivity analyses were consistent with main results ( online supplemental figures S6-13 ).

Figure 3 shows the results of comparative effects of BNT162b2 versus ChAdOx1, based on UK data. Meta-analytic estimates favoured BNT162b2 (sHR of 0.66 (0.46–0.93)) for VTE in the 0–30 days after infection, but no differences were seen for post-acute VTE or for any of the other outcomes. Results from sensitivity analyses, database-specific and cohort-specific estimates were in line with the main findings ( online supplemental figures S41-51 ).

Forest plots for comparative vaccine effect (BNT162b2 vs ChAdOx1); meta-analysis across cohorts and databases. ATE, arterial thrombosis/thromboembolism; CD+HS, cardiac diseases and haemorrhagic stroke; VTE, venous thromboembolism.

Key findings

Our analyses showed a substantial reduction of risk (45–81%) for thromboembolic and cardiac events in the acute phase of COVID-19 associated with vaccination. This finding was consistent across four databases and three different European countries. Risks for post-acute COVID-19 VTE, ATE and HF were reduced to a lesser extent (24–58%), whereas a reduced risk for post-COVID-19 MP and VACA in vaccinated people was seen only in the acute phase.

Results in context

The relationship between SARS-CoV-2 infection, COVID-19 vaccines and thromboembolic and/or cardiac complications is tangled. Some large studies report an increased risk of VTE and ATE following both ChAdOx1 and BNT162b2 vaccination, 7 whereas other studies have not identified such a risk. 25 Elevated risk of VTE has also been reported among patients with COVID-19 and its occurrence can lead to poor prognosis and mortality. 26 27 Similarly, several observational studies have found an association between COVID-19 mRNA vaccination and a short-term increased risk of myocarditis, particularly among younger male individuals. 5 6 For instance, a self-controlled case series study conducted in England revealed about 30% increased risk of hospital admission due to myocarditis within 28 days following both ChAdOx1 and BNT162b2 vaccines. However, this same study also found a ninefold higher risk for myocarditis following a positive SARS-CoV-2 test, clearly offsetting the observed post-vaccine risk.

COVID-19 vaccines have demonstrated high efficacy and effectiveness in preventing infection and reducing the severity of acute-phase infection. However, with the emergence of newer variants of the virus, such as omicron, and the waning protective effect of the vaccine over time, there is a growing interest in understanding whether the vaccine can also reduce the risk of complications after breakthrough infections. Recent studies suggested that COVID-19 vaccination could potentially protect against acute post-COVID-19 cardiac and thromboembolic events. 11 12 A large prospective cohort study 11 reports risk of VTE after SARS-CoV-2 infection to be substantially reduced in fully vaccinated ambulatory patients. Likewise, Al-Aly et al 12 suggest a reduced risk for post-acute COVID-19 conditions in breakthrough infection versus SARS-CoV-2 infection without prior vaccination. However, the populations were limited to SARS-CoV-2 infected individuals and estimates did not include the effect of the vaccine to prevent COVID-19 in the first place. Other studies on post-acute COVID-19 conditions and symptoms have been conducted, 28 29 but there has been limited reporting on the condition-specific risks associated with COVID-19, even though the prognosis for different complications can vary significantly.

In line with previous studies, our findings suggest a potential benefit of vaccination in reducing the risk of post-COVID-19 thromboembolic and cardiac complications. We included broader populations, estimated the risk in both acute and post-acute infection phases and replicated these using four large independent observational databases. By pooling results across different settings, we provided the most up-to-date and robust evidence on this topic.

Strengths and limitations

The study has several strengths. Our multinational study covering different healthcare systems and settings showed consistent results across all databases, which highlights the robustness and replicability of our findings. All databases had complete recordings of vaccination status (date and vaccine) and are representative of the respective general population. Algorithms to identify study outcomes were used in previous published network studies, including regulatory-funded research. 3 4 8 18 Other strengths are the staggered cohort design which minimises confounding by indication and immortal time bias. PS overlap weighting and NCO empirical calibration have been shown to adequately minimise bias in vaccine effectiveness studies. 19 Furthermore, our estimates include the vaccine effectiveness against COVID-19, which is crucial in the pathway to experience post-COVID-19 complications.

Our study has some limitations. The use of real-world data comes with inherent limitations including data quality concerns and risk of confounding. To deal with these limitations, we employed state-of-the-art methods, including large-scale propensity score weighting and calibration of effect estimates using NCO. 19 24 A recent study 30 has demonstrated that methodologically sound observational studies based on routinely collected data can produce results similar to those of clinical trials. We acknowledge that results from NCO were positively associated with vaccination, and estimates might still be influenced by residual bias despite using calibration. Another limitation is potential under-reporting of post-COVID-19 complications: some asymptomatic and mild COVID-19 infections might have not been recorded. Additionally, post-COVID-19 outcomes of interest might be under-recorded in primary care databases (CPRD Aurum and Gold) without hospital linkage, which represent a large proportion of the data in the study. However, results in SIDIAP and CORIVA, which include secondary care data, were similar. Also, our study included a small number of young men and male teenagers, who were the main population concerned with increased risks of myocarditis/pericarditis following vaccination.


Vaccination against SARS-CoV-2 substantially reduced the risk of acute post-COVID-19 thromboembolic and cardiac complications, probably through a reduction in the risk of SARS-CoV-2 infection and the severity of COVID-19 disease due to vaccine-induced immunity. Reduced risk in vaccinated people lasted for up to 1 year for post-COVID-19 VTE, ATE and HF, but not clearly for other complications. Findings from this study highlight yet another benefit of COVID-19 vaccination. However, further research is needed on the possible waning of the risk reduction over time and on the impact of booster vaccination.

Ethics statements

Patient consent for publication.

Not applicable.

Ethics approval

The study was approved by the CPRD’s Research Data Governance Process, Protocol No 21_000557 and the Clinical Research Ethics committee of Fundació Institut Universitari per a la recerca a l’Atenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol) (approval number 4R22/133) and the Research Ethics Committee of the University of Tartu (approval No. 330/T-10).


This study is based in part on data from the Clinical Practice Research Datalink (CPRD) obtained under licence from the UK Medicines and Healthcare products Regulatory Agency. We thank the patients who provided these data, and the NHS who collected the data as part of their care and support. All interpretations, conclusions and views expressed in this publication are those of the authors alone and not necessarily those of CPRD. We would also like to thank the healthcare professionals in the Catalan healthcare system involved in the management of COVID-19 during these challenging times, from primary care to intensive care units; the Institut de Català de la Salut and the Program d’Analítica de Dades per a la Recerca i la Innovació en Salut for providing access to the different data sources accessible through The System for the Development of Research in Primary Care (SIDIAP).

  • Pritchard E ,
  • Matthews PC ,
  • Stoesser N , et al
  • Lauring AS ,
  • Tenforde MW ,
  • Chappell JD , et al
  • Pistillo A , et al
  • Duarte-Salles T , et al
  • Hansen JV ,
  • Fosbøl E , et al
  • Chen A , et al
  • Hippisley-Cox J ,
  • Mei XW , et al
  • Duarte-Salles T ,
  • Fernandez-Bertolin S , et al
  • Ip S , et al
  • Bowe B , et al
  • Prats-Uribe A ,
  • Feng Q , et al
  • Campbell J , et al
  • Herrett E ,
  • Gallagher AM ,
  • Bhaskaran K , et al
  • Raventós B ,
  • Fernández-Bertolín S ,
  • Aragón M , et al
  • Makadia R ,
  • Matcho A , et al
  • Mercadé-Besora N ,
  • Kolde R , et al
  • Ostropolets A ,
  • Makadia R , et al
  • Rathod-Mistry T , et al
  • Thomas LE ,
  • ↵ Coronavirus (COVID-19) in the UK . 2022 . Available : https://coronavirus.data.gov.uk/
  • Generalitat de Catalunya
  • Schuemie MJ ,
  • Hripcsak G ,
  • Ryan PB , et al
  • Houghton DE ,
  • Wysokinski W ,
  • Casanegra AI , et al
  • Katsoularis I ,
  • Fonseca-Rodríguez O ,
  • Farrington P , et al
  • Jehangir Q ,
  • Li P , et al
  • Byambasuren O ,
  • Stehlik P ,
  • Clark J , et al
  • Brannock MD ,
  • Preiss AJ , et al
  • Schneeweiss S , RCT-DUPLICATE Initiative , et al

Supplementary materials

Supplementary data.

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

  • Data supplement 1

AMJ and MC are joint senior authors.

Contributors DPA and AMJ led the conceptualisation of the study with contributions from MC and NM-B. AMJ, TD-S, ER, AU and NTHT adapted the study design with respect to the local vaccine rollouts. AD and WYM mapped and curated CPRD data. MC and NM-B developed code with methodological contributions advice from MTS-S and CP. DPA, MC, NTHT, TD-S, HMEN, XL, CR and AMJ clinically interpreted the results. NM-B, XL, AMJ and DPA wrote the first draft of the manuscript, and all authors read, revised and approved the final version. DPA and AMJ obtained the funding for this research. DPA is responsible for the overall content as guarantor: he accepts full responsibility for the work and the conduct of the study, had access to the data, and controlled the decision to publish.

Funding The research was supported by the National Institute for Health and Care Research (NIHR) Oxford Biomedical Research Centre (BRC). DPA is funded through a NIHR Senior Research Fellowship (Grant number SRF-2018–11-ST2-004). Funding to perform the study in the SIDIAP database was provided by the Real World Epidemiology (RWEpi) research group at IDIAPJGol. Costs of databases mapping to OMOP CDM were covered by the European Health Data and Evidence Network (EHDEN).

Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting or dissemination plans of this research.

Provenance and peer review Not commissioned; externally peer reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Read the full text or download the PDF:

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sage Choice

Logo of sageopen

The Challenges of Big Data for Research Ethics Committees: A Qualitative Swiss Study

Agata ferretti.

1 Health Ethics and Policy Lab, Department of Health Sciences and Technology, ETH Zürich, Switzerland

Marcello Ienca

2 College of Humanities, Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.

Minerva Rivas Velarde

3 Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Switzerland

Samia Hurst

4 Institute for Ethics, History, and the Humanities, Faculty of Medicine, University of Geneva, Switzerland

Effy Vayena

Associated data.

Supplemental material, sj-docx-1-jre-10.1177_15562646211053538 for The Challenges of Big Data for Research Ethics Committees: A Qualitative Swiss Study by Agata Ferretti, Marcello Ienca, Minerva Rivas Velarde, Samia Hurst and Effy Vayena in Journal of Empirical Research on Human Research Ethics

Big data trends in health research challenge the oversight mechanism of the Research Ethics Committees (RECs). The traditional standards of research quality and the mandate of RECs illuminate deficits in facing the computational complexity, methodological novelty, and limited auditability of these approaches. To better understand the challenges facing RECs, we explored the perspectives and attitudes of the members of the seven Swiss Cantonal RECs via semi-structured qualitative interviews. Our interviews reveal limited experience among REC members with the review of big data research, insufficient expertise in data science, and uncertainty about how to mitigate big data research risks. Nonetheless, RECs could strengthen their oversight by training in data science and big data ethics, complementing their role with external experts and ad hoc boards, and introducing precise shared practices.


In recent years, research using large volumes of data has drastically increased across a variety of fields including data science, physics, biomedicine, psychology, and the social sciences ( Leonelli, 2020 ). This type of research, known as big data research, benefits from merging and harnessing data from multiple sources, generating new insights and unexplored scientific perspectives. In this paper, we refer to “big data research” as any research relying on large datasets, made of data heterogeneous in source, processed at high speed, and analyzed through novel computational techniques ( Ienca et al., 2018 ).

In parallel with these changes in research practice, high profile cases of data misuse have emerged, exposing research participants to privacy breaches and risk of harm ( Fuller, 2019 ). In response, debate has increased about the role and effectiveness of the Research Ethics Committee (REC) as the chief ethical research oversight mechanism in research, given the specific challenges presented by research with big data ( Ferretti et al., 2020 ; Rennie et al., 2020 ). RECs, also known as Institutional Review Boards (IRBs) and Research Ethics Boards (REBs), were created in the 20th century to protect the safety and interests of human participants in research ( Friesen et al., 2019 ). Today, the REC’s mandate—the regulation of human subject research and the evaluation of key ethics review principles—might fall behind the demands of data-intensive research ( Vayena et al., 2016 ). In fact, big data research is characterized by novel ethical concerns, which can challenge traditional ethics oversight mechanisms and practices ( Samuel et al., 2021 ).

Particularly in the biomedical and health fields, the increasing availability of digital health technologies enables the collection of an unprecedented amount of data ( Car et al., 2019 ). The possibility of using artificial intelligence (AI) and extraordinary computational capabilities to merge, analyze, and harness these data offers great opportunities to improve individual and public health ( Blasimme & Vayena, 2019 ). The potential of AI in medicine has emerged even more clearly during the COVID-19 pandemic, as differently structured data from heterogeneous sources were collected and processed for public health purposes, such as containment, mitigation, and vaccine development ( Murray et al., 2020 ; Ngan & Kelmenson, 2020 ). A crucial benefit offered by AI technologies is improved prevention and personalized treatment. In fact, AI can extract information related to individual health status by combining data unrelated to health and wellbeing (e.g., location data, blog posts) collected through a variety of tools (e.g., social media, wearable devices) ( Vayena & Gasser, 2016 ). Despite the mentioned benefits, these new research methods and technological developments have numerous downsides. First, they challenge traditional research principles such as data privacy, informed consent, scientific validity of research, risk assessment, and distribution of benefits ( Price & Cohen, 2019 ; Rivas Velarde et al., 2020 ). Second, they introduce new epistemic challenges related to the assessment of scientific validity, technological reliability, accountability, fairness, and transparency ( Friesen et al., 2021 ). Finally, they challenge the very notion of human participants in research, as they enable retrospective data processing without physical interaction with research participants ( Metcalf & Crawford, 2016 ).

Several questions arise about whether existing regulatory and ethical governance tools, and current practices and expertise of RECs, are adequate to protect human participants and enable ethical research ( Ferretti et al., 2021 ). While some authors argue that the ethical principles and frameworks that traditionally govern research need to be adapted considering new research contexts ( Parasidis et al., 2019 ; Vayena & Blasimme, 2018 ), studies investigating the perspectives and needs of the involved stakeholders remain scarce. Recent studies ( Favaretto et al., 2020b ; Samuel et al., 2021 ) analyzing researcher views on the topic revealed both a lack of adequate expertise among REC members and the absence of clear and consistent criteria for evaluation. Similar conclusions were reached by empirical studies conducted in the UK, Canada, and the US, interviewing REC members about the ethics of social media research and research using pervasive sensing technologies ( Hibbin et al. 2018 ; Nebeker et al., 2017 ; Samuel et al., 2018 ). REC members were able to identify emerging ethical challenges related to big data but reported feeling unprepared to address those challenges, and a lack of normative guidance. Although these studies are highly informative, their exploratory and context-dependent nature makes their claims difficult to generalize. Furthermore, it should be noted that ethical oversight practices and research ethics guidelines diverge at the international level because legal requirements differ from state to state ( Vayena, 2021 ).

In Switzerland, research involving human subjects, biomedical data, and biological samples requires the approval of the REC. Most of the research projects conducted in biomedical and health fields are reviewed by Cantonal RECs ( Coordination Office for Human Research, 2019 ). Switzerland counts seven of these committees organized under Swissethics, the association of Cantonal RECs ( Swissethics, 2021 ). RECs apply the legal and ethical rules included in the Human Research Act (HRA), which ensure the dignity, privacy, and health of research participants, as well as the ethical value of the research. Each REC oversees projects in a specific geographical area of Switzerland: two in the French-speaking region, one in the Italian-speaking region, and four in the German-speaking region ( Figure 1 ). While the HRA sets general standards about RECs’ composition, members’ requirements, and review procedures, each REC is organized and managed independently at the Cantonal level. Although the number of members varies across committees, RECs usually include a chair, vice-chair, managing director, and scientific secretary ( Swiss Federal Council, 2021 ).

An external file that holds a picture, illustration, etc.
Object name is 10.1177_15562646211053538-fig1.jpg

Distribution of Swiss Research Ethics Committees (RECs) in the Swiss territory and areas of authority.

Typically, research involving anonymous health data or biological samples is not subject to the HRA. Similarly, studies without direct implications for “the understanding of human diseases; the structure and function of the human body; or public health” ( Swiss Federal Office of Public Health (FOPH), 2011 ) are exempted. As a consequence, human subject research in the fields of psychology, sociology, or marketing is exempted from HRA provisions. Several universities have introduced institutional ethics committees to review research projects falling outside the Cantonal RECs’ purview. Nevertheless, the implementation of such intra-institutional local ethics committees is uneven throughout the country, as federal law only provides for the establishment of Cantonal RECs, and universities have no legal obligation to introduce these committees.

While a recent study looked at the experience of Swiss researchers when submitting big data research for ethical review ( Favaretto et al., 2020a ), no study to date has investigated the opinions and perspectives of Cantonal RECs. Therefore, this study aims to fill this gap, complement existing research, and expand knowledge on the topic by engaging with members of Cantonal RECs. Their direct experience in evaluating big data projects can provide valuable insight into the current primary ethical oversight mechanism in Switzerland, shed light on existing gaps in the mechanism, and pave the way for needed reforms.

Recruitment and Sampling

For each Swiss Cantonal REC, we interviewed the chairperson (or vice-chairperson or managing director) and, whenever possible, one scientific secretary. Committees were identified through the Swissethics website. The invitation sent to each chairperson included the following: the outline of the research and research aims; the interview methodology and a preliminary timeline; the informed consent form and details about safeguards in place for data protection and confidentiality; and the research team contacts. The response rate was 100%. All Cantonal RECs ( n  = 7) responded to our email and participated in our study. Prior to recruitment, we obtained approval to conduct this study from the responsible REC.

Between October 2018 and May 2019, MI, AF, and MRV conducted semi-structured interviews, either face-to-face or via telephone. After written and verbal consent, each interview was recorded, and lasted between 35 min and 1 h. Interviewees could specify their preference for the interview language (French, German, Italian, English, or a combination.) We completed a total of seven interviews with 13 interviewees. Across RECs, interviewees shared similar disciplinary backgrounds ( Table 1 ).

Table 1.


MI developed the interview guide ( Appendix 1 ), which was vetted by AF and EV and approved by the research team. These interviews aim to investigate the perspective of the Cantonal RECs on (1) how to define big data research; (2) their experience with reviewing big data projects and with the ethical guidelines used for the assessment of big data research; (3) the peculiarities of big data research, namely, its benefits and challenges; and (4) the needs of RECS in order to adequately address big data research challenges (e.g., high-level recommendations, procedural good practices, education, training).

We transcribed verbatim the audio files in the original language of the interviews with the support of Sonix online software. Three interviews were in English, three in German, and one in Italian. To increase data consistency and reduce selective bias, we translated the non-English transcriptions into English with the assistance of DeepL Pro online software and additional human review. AF, MI, and MRV thematically coded and analyzed the data with NVivo 11 Software. Each interview was coded independently by two researchers using a combination of inductive and deductive reasoning for theme development ( Fereday & Muir-Cochrane, 2006 ). While the deductive analysis traced the themes listed in the interview guide, the inductive analysis allowed the expansion of the list of themes, by adding those that emerged from coding the interview content.

The data analysis was performed in two steps. First, major themes of interest were identified and categorized (please refer to the Results section). This phase was duplicated by two researchers, and any disagreement was resolved with a third researcher. Second, the themes were analyzed in depth through discussion among the researchers, and adjustments to the final thematic map were made to improve logical cohesion. The result of this analysis is detailed in the following section.

Our analysis identified four recurrent themes and several subthemes, which are summarized in Table 2 . These themes mirror the research questions addressed by this research, namely, (1) what is RECs’ understanding of the “big data” concept? (2) What is the ethics review process currently in place for big data research? (3) What are RECs’ perspectives about the benefits and challenges of big data research? (4) What are RECs’ needs in the big data era?

Table 2.

Overview of Interview Themes and Subthemes.

Note. REC = Research Ethics Committee.

Characteristics of Big Data Research

The interviewees displayed variation in interpreting the concept of big data research, often deviating from the definition proposed in the introduction. Consistent with existing literature on the topic ( Ienca et al., 2018 ; Jin et al., 2015 ), Interviewee 10 observed “ there is much talking about big data but no unanimous definition. ” The majority of interviewees mentioned the three versus (volume, variety, and velocity) characterizing big data, particularly stressing volume and variety.

To me, what is relevant is data volume… the fact that there is an increasing amount of data in research files or databases. In addition, it is important where and how these data come from (Interviewee 10)

Interviewees seemed aware of the diverse data sources used today for health research purposes: most mentioned data collected through social media, loyalty cards, tracking technologies, and digital health tools (e.g., health insurance mobile apps or fitness devices). Only a minority associated the big data concept with the deployment of novel analytic tools such as algorithms and AI.

I think that what would qualify as big data approach […] is if data are being analyzed using artificial intelligence and other analytic approaches that usually are not used for the regular project that we are evaluating - where normal or ordinary [statistical] methodology is applied. Here, with the big data, you are getting into a new dimension . (Interviewee 12)

Furthermore, while Interviewee 13 stressed the fact that big data projects are often hypothesis-free (“(big data projects) will try to generate the knowledge from the data itself rather than the classical approach with hypotheses and verifications”), Interviewee 3 suggested considering data transfers and the re-uses of existing datasets as signals of big data research.

Although interviewees could formulate definitions of big data research, they were confused about the line between traditional and big data research (“ when does a biomedical project start to be a big data project? ” (Interviewee 3)). Interviewee 4 said that medical research always collected and relied upon voluminous datasets. Therefore, it is only a matter of interpretation whether traditional research is considered big data research:

in cancer research it is common to integrate many patients’ pathology data with x-rays or other imaging data, and genetic information. This happened already in small projects; but now these projects are viewed as big data projects (Interviewee 4).

We asked interviewees to describe examples of big data projects they had reviewed or foresaw reviewing. Many referred to projects using data and samples from biobanks. Others spoke about projects focused on improving personalized medicine, using data from tracking devices and wearables and from social media (i.e., Facebook and Twitter).

When you talk to me about big data my idea goes more to databases, or biological sample banks that collect a huge amount of data and for which there is no purpose. […] I think big data means analyzing a huge amount of data from various sources but without a precise purpose in mind. (Interviewee 8)

Current State of Ethical Oversight in Big Data Research

Six of seven Cantonal RECs reported previously reviewing and assessing big data projects. Nevertheless, our respondents emphasized that, so far, this had occurred rarely, only a few times a per year. Moreover, none of these studies were explicitly labeled by researchers as a big data project. Interviewees acknowledged their limited experience in reviewing big data research and speculated that this is because a few of these projects had taken place in Switzerland so far. However, REC members anticipated that this trend would evolve in the future, especially with the creation of new biobanks and more medical data from electronic health records. Furthermore, interviewees highlighted the limits of their oversight power in the big data context. Their precisely defined mandate might be a reason they only rarely reviewed big data projects. In fact, Cantonal RECs’ research purview is restricted to biomedical and clinical projects involving humans, and human biological data and samples. For instance, big data studies collecting social media data or anonymized data in the fields of social sciences and psychology would fall outside Cantonal RECs’ review:

They [the not-strictly biomedical projects] are, so to speak, in the grey area: the conventional ethics committees are not responsible for them, but it is still completely unclear which oversight mechanisms should be applied (Interviewee 1)

As Interviewee 4 pointed out, Cantonal RECs may audit the above-mentioned studies, but only to “give an opinion (not-legally binding) according to article 51 (of the Swiss Human Research Act) about whether or not these types of research applications are ethical.” Thus this happens only rarely, because researchers are not legally required to submit these types of projects for review. Consequently, RECs are unaware of the real state of the art concerning big data research:

I am really wondering actually whether the researchers doing research on big data are willing to come and ask for our opinion. I will not be so surprised to learn that there are researchers that have actually conducted research on big data without coming to us, and I believe that under the legal point of view they may have some arguments. (Interviewee 13)

All interviewees reported the absence of specific standards to assess big data projects. Therefore, REC members rely on traditional research ethics and bioethics criteria (such as those included in the HRA, Belmont Report ( Sims, 2010 ), Emanuel framework ( Emanuel et al. 2004 ), and Beauchamp’s four-principle approach ( Beauchamp, 2007 )), independent of the study type. RECs’ assessment includes the evaluation of data protection safeguards, strategies to respect participant autonomy (i.e., informed consent), risk-benefit assessments, research purposes and data proportionality, and the scientific validity of research methodology and findings:

For now, there is no evaluation grid for analyzing these studies involving big data. […] The purpose of the study, the scientific question to which the study responds, is fundamental, and is one of the factors that we take into account. (Interviewee 8)

Our interviews revealed diverging opinions about whether the lack of specific guidance for big data research is potentially problematic. Interviewee 3 explained that the absence of such guidance should not necessarily be considered a weakness in the oversight mechanism. On the contrary, existing regulations provide tools that can be effectively applied across scientific disciplines and project types:

I think for that what we are seeing at the moment…   we have a law, we have data protection rules, we have the Human Research Act here and I think the regulations we have can apply for this kind of research [big data research] as well as for other types of research. So, we should not make any difference at the moment. (Interviewee 3)

Other interviewees agreed and spoke about the HRA, Swiss data protection law, GDPR, and Emanuel framework for biomedical research as sufficient tools to guide their judgment when reviewing projects. Two interviewees openly rejected the concept of big data research exceptionalism:

I mean, of course big data shows that issues are more pressing to answer. But the pending questions…we have identified them, even though from a different point of view. […] for each of those issues I can provide examples in traditional research that are already raising those questions. (Interviewee 12)
I don't want big data to be defined any differently than other requests […] only because it’s called big data… for me it is not fundamentally different than a normal request. (Interviewee 4)
…if you've done this [assessing projects] for so many years now, you have a certain routine. But with big data and AI, if we don't even know what the risks are, how can we assess and approve them? (Interviewee 1)

In addition, Interviewee 10 spoke about the difficulty of assessing data quality in big data research compared with traditional biomedical research. If traditional research data were collected inside hospitals by researchers and health professionals, these data are now collected by tracking devices or social media platforms:

One problem is that there is no control for data quality in self-collected self-tracked data. All those medical apps, all those devices. There is no quality control for that. Who is ensuring, checking the quality of the data they generate? Same for social media… The quality of those data is not, at least not always, identical as in conventional research. (Interviewee 10)

Finally, Interviewee 9 provided an example of how the absence of clear legal guidance can result in inconsistent ethical evaluations across RECs:

If there is not a sufficient legal framework…   projects involving big data are only interpreted from an ethical point of view. And ethical interpretations from one committee to another may vary. The lack of a precise frame allows you to have more interpretations – which are always interesting – but could create problems. (Interviewee 9)

Implications of Big Data Research

Overall, our respondents indicated a variety of benefits and challenges associated with big data research (summarized in Figure 2 ).

An external file that holds a picture, illustration, etc.
Object name is 10.1177_15562646211053538-fig2.jpg

Benefits and challenges of big data research discussed by Swiss Cantonal Research Ethics Committee (RECs).

Concerning the former, nearly all REC members flagged the importance of big data studies for increasing scientific knowledge and generating public benefit; “ I certainly believe that the public health dimension and the public benefit of big data has to be stressed and has to be encouraged ” (Interviewee 12). On a similar note, Interviewee 3 viewed big data as a chance for the scientific community to tackle broad research questions:

I think that the most important benefit is moving away from research on small data packages…I think if you merge these data together you will have a much better chance to have a good research. (Interviewee 3)

Many interviewees also spoke about the role of big data research in improving prevention and diagnostics. Furthermore, they commented on the role of big data research to boost precision medicine, in order to find the best treatments for rare diseases and tailor health interventions to specific population sub-groups. Interviewee 4 noted that while research participants and patients might not take direct advantage from big data research, its benefits will be available to the whole of society in the future:

Generally, as I have seen the projects so far, the individual does not benefit directly from the research. Data are used to improve prevention and find new therapy….so the benefit is shifted into the future. (Interviewee 4)

When asked about the challenges of big data research and their implications for ethical oversight, the respondents identified a wide range ( Figure 2 ). Our interviews, however, revealed a lack of consensus among Cantonal RECs concerning which challenges are most pressing (“ informed consent ” (Interviewee 13), “ anonymization ” (Interviewee 5), “ results interpretation and generalizability ” (Interviewee 7)). Despite this divergence, the majority of interviewees said that big data research exacerbates privacy and confidentiality risks, potentially resulting in individual and collective harms ( “Huge impact on privacy! Everybody wants you to be under constant surveillance.” (Interviewee 1)). Thus, respondents stressed the importance of the rigorous application of data protection governance and the implementation of precautionary measures (e.g., data encryption and anonymization) to secure sensitive information. However, some respondents questioned the effectiveness of data protection regulations and practices in the context of big data:

…all of a sudden you notice you can find things that you shouldn't have. Big data linkage creates more problems to ensure people’ dignity and data confidentiality. These information are precious to people and should not be put in danger. (Interviewee 13)
Is anonymization possible or not?…Or is it just a word that is not true anymore because it's so easy to identify people behind [the data]? (Interviewee 3)
It's just hard to agree to a declaration of consent online. These terms and conditions just require you to click and accept, but nobody reads them. That is not an informed and good consent. (Interviewee 1)

Nonetheless, interviewees’ opinions varied concerning which solution could best solve the informed consent impasse. Interviewee 3 commented on the need for a dynamic form of consent. Responsible big data research should allow participants to choose for which purposes their data are used:

I think to do research in a responsible manner we should say which data are used for what, and give the owner of data or samples the chance to choose each time. (Interviewee 3)

At the opposite side of the spectrum, Interviewee 12 said that projects using biomedical data and providing clear public benefits should presume participant consent, unless they state otherwise:

The law should be changed to resemble the system they have in Scandinavia where, for research purposes, the access to personal data and samples is guaranteed by a presumed consent. Of course, you need to have a democratic and human rights system in place to allow for presumed consent. (Interviewee 12)

Similarly, Interviewee 10 argued from a pragmatic perspective. In the age of big data, it is simply not feasible to obtain informed consent from participants (due to data volume), let alone for data reuses or retrospectively. Therefore, researchers should focus on obtaining consent only when collecting sensitive data. When using other data types (such as data publicly available online), researchers should rely instead on a consent waiver:

I am not so sure we need consent for all data. […] We should only protect sensitive data, hence make sure we obtain consent for those. […] People freely “leave their traces” around the web, giving their information for free to companies while using apps and online services without being concerned. Why should researchers be more concerned? (Interviewee 10)

Many interviewees explicitly articulated the difficulty of balancing the risks and benefits of big data research. They felt particularly uncertain about how to estimate the risks and justified their concerns with various arguments.

First, the exploratory nature of big data research and the numerous possibilities for data linkage make anticipating the risks very complex ( “another issue is the unforeseeable risk…because as of today we can't tell what we're going to find out about that person through big data analysis ” (Interviewee 8)) . Interviewee 3 spoke about the incremental risk of managing incidental findings (“ How are you (researcher) dealing with incidental findings? Is there still a possibility to report them back to the patients or not? ”), due to the large volumes of data which are combined and analyzed. This is especially critical since RECs review research intentions, but do not control the outcomes (“ we only see a project on paper at the beginning and then actually implement it. That's kind of out of our hands then ” Interviewee 5)

Second, the use of analytics tools like opaque AI algorithms “ that nobody at the end understands ” (Interviewee 3) increases the chance for unclear and incorrect data processes. In turn, these processes can result in “ wrong conclusions ” (Interviewee 7): “ moving away from, let's say, a kind of research where you are looking for causality we are diving much more into an area where you just look for correlations, which may be coincidences ” (Interviewee 3).

Third, the chances of data hacks and de-identification are hard to anticipate, especially when private companies are involved “ and there are a lot of secrets around these (data protection strategies). (…) how do you evaluate the quality of protection when you don't know how much they are subject to attacks, how many of those attacks are successful, and what are the steps taken against those attacks? ” (Interviewee 13).

Finally, the presence of private actors in big data research makes determining fair distribution of benefits more complex; “ we have to be careful about the fact that big data (research) is not a way of monetizing on our data by big companies…you see it already…they take all of our data and so on and make profit out of that while it is a public good ” (Interviewee 12)

Given the broad spectrum of potential but unclear risks emerging in big data research, the majority of interviewees were dubious when asked to define the threshold of minimal risk:

I mean…if you see the potential for data abuse which is here…and what has already happened…then you can't even speak of minimal risk! (Interviewee 1)

Needs of RECs in Big Data Research

Overall, committee members agreed about not having sufficient experience or expertise in technical areas, such as big data analytics or computer science. These weaknesses emerged when trying to understand (“ We can't understand that at all ” (Interviewee 7)) or assess biomedical big data projects ((“ with big data research we simply do not have the know-how ” (Interviewee 6)). Interviewees’ concerns predominantly centered around the speed at which new technology evolves. This constant change makes it virtually impossible to have sufficient experience and insight to judge projects with a high degree of certainty:

New algorithms with artificial intelligence…   I have no experience with this and how to deal with this in the future this is an open question. (Interviewee 3)

Despite this limitation, consensus emerged across RECs regarding their role as key oversight mechanisms for biomedical research, including research relying on big data. While a minority of respondents defended the current way of practicing ethical review and the adequacy of the current laws, the majority acknowledged several limitations of the oversight mechanism. When asked about their needs and envisioned solutions, REC members discussed possibilities at the levels of training, procedures, and regulations.

Regarding the first, almost all respondents expressed an urgent need to fill the REC expertise gap, recognizing expertise as a crucial factor in effectively fulfilling the oversight mandate. Interviewees expressed interest in targeted trainings discussing characteristics, risks, and ethical implications of big data projects and AI applications in biomedicine. Interviewee 4 suggested conducting these trainings in a dynamic format, offering case studies and mock projects to analyze. Meanwhile, other respondents further highlighted that improving REC members’ knowledge and allowing for greater exchange could increase review standardization within and across committees:

it would be really good to show examples of how the projects are built and which algorithms are on the back of the analysis and how to they are put together […] I need concrete examples…   […]The case studies should come from the people who do this…the researchers….to get a proper understanding. […] then, there should be a discussion among the ethics committees on how to deal with these case studies (Interviewee 4)

In addition to these trainings, all respondents confirmed the benefits of consulting specialists (in the fields of big data analytics, computer science, and data management) when needed (“ if I see a problem or so then we get the appropriate expertise. We also do this for quite ordinary applications where the risk cannot be assessed with certainty ” (Interviewee 4)). However, the suggestion of Interviewee 6 to include technical experts into RECs was unpopular among other interviewees:

I think they have to be members, so that we do not have to go and get an expert for an opinion every time (Interviewee 6)
I am not convinced that introducing a technical figure can be a solution. Rather get training for the whole committee . (Interviewee 9)
It is important to define what is good big data research, what must be done when conducting this type of research, and what is optional […], which methodology is acceptable and which unacceptable. (Interviewee 6)

Some respondents emphasized that researchers, too, would benefit from clearer standards about how to ensure data protection, handle unexpected results, certify the validity and quality of the methodology, and clarify the research question:

I think we have to look whereby certain standards are fulfilled and sometimes we get research application where it's not clear what is the research question. (Interviewee 3)

Interviewee 12 further explained that the REC’s attitude toward researchers is not intended to be that of watchdogs seeking to reject research projects simply because they rely on big data. On the contrary, RECs are responsible for promoting ethically aligned research and want to work together with researchers to improve their projects:

We say: OK let's look what these researchers want to do and how can we do it in the best way so that they do not hurt people. (Interviewee 12)

When asked about who should develop these practical guidelines (both for researchers and for RECs), respondents listed a variety of bodies, including Swissethics, the Swiss Academy of Medical Sciences (SAMS), and the Central Ethic Committee (CEC), in collaboration with research institutions and experts in both ethics and science. If ethics review practices were introduced at an international level and made valid across countries, REC members would expect the World Health Organization (WHO) to formulate them.

Finally at the regulatory level, interviewees discussed whether the scope and mandate of RECs should be expanded to cover those big data projects currently outside their purview. From the respondents’ perspectives, these projects might still carry negative consequences for individual health and wellbeing, as well as for broader society. Although REC members had a favorable view of the option of expanding the REC mandate, they also highlighted two crucial points. First, the aim of RECs should align with society’s expectations and values. Political and health authorities, as well as RECs, should engage with society to define the boundaries of RECs’ scope (what should be reviewed or not) in light of new technological advancements. Only as a result of this democratic debate should the law be adjusted.

Our role is to protect the individual and to decide what is in the interest of a society…Committees should agree with the society about what should be permitted and what not […] we need a clear and harmonized understanding of the role of RECs and what is legally required. (Interviewee 3)

However, although the society may identify a number of core values to respect and promote (e.g., “ privacy, accountability, transparency, public participation ” (Interviewee12)), Interviewee 1 suggested that societal expectations about what exactly ought to be done with the data might remain vague because “ we live in a pluralistic society. ”

The second point is of a pragmatic nature. As RECs do not have the capacity and expertise to review highly technical studies or studies outside the biomedical field, most respondents agreed on the idea of introducing specific oversight boards to assess the technical features of projects involving big data and AI. These boards could complement—rather than substitute for—RECs and find their place alongside those already supervising data uses (such as data protection legal offices and data safety monitoring boards).

I can imagine that an external body with certain skills could be useful….to evaluate the technical aspects that we do not consider […] its skills complements our evaluation. (Interviewee 8)
Possibly on the long term we are going to need something like “big data board”…possibly. I do not think they could replace RECs…they will be rather complementary. (Interviewee 10)
This should be carefully considered […]. I always struggle with too many parallel structures […]…in the end we have a forest of ethical institutions and nobody knows anymore what is really well reviewed. (Interviewee 1)

Overall, respondents agree that Cantonal RECs and their current practices have room for improvement, in order to be truly effective and valuable in the era of big data. To succeed in this task, good will alone will not suffice. Rather, interviewees specified that Swiss regulators and policymakers should consider these gaps and further clarify the role of RECs among other ethical oversight mechanisms in place.

Strengths and Limitations

While the methodology of qualitative interview analysis allows for the detailed exploration of opinions and perspectives, the same study design challenges the generalization of the conclusions. However, although the findings of this study are confined to the Swiss context, the fact that we interviewed members of all seven Cantonal RECs made it possible to represent the full spectrum of cultural variation that exists within the country. Furthermore, since the Swiss ethical oversight mechanism partially resembles those of other European countries (e.g., neighboring Germany, Austria, France, and Italy) and internationally, a certain degree of generalization of results could be justified.

In this study, a selection bias may have arisen from including only the views of Cantonal RECs. Although other ethics committees exist in Switzerland (e.g., the national ethics committee and institutional review committees within universities), this study focused on big data research in the biomedical and health field, which is usually reviewed by Cantonal ethics committees. The fact that only the chairperson/vice-chairperson and one scientific secretary per REC were interviewed may also have introduced a bias into the study. Nevertheless, one must consider that chairpersons, in practice, set the agenda for the committee, and scientific secretaries first review and evaluate research protocols. Therefore, we believe that their perspectives and comments have provided valuable insights into the ethics of research with big data in biomedical and health settings.

Finally, the fact that the interviews were conducted before the COVID-19 pandemic can be interpreted as a limitation. Indeed, the pandemic has increased pressure on RECs, especially for reviewing public health projects that leverage the power of big data and AI. Nonetheless, the results of this study transcend the temporality of current research conditions, as they relate to the complex oversight system of Cantonal RECs, which is not evolving as rapidly. Future research could explore which processes and functions of Cantonal RECs have changed as a result of the COVID-19 pandemic.

Our findings reveal four main areas of ethical significance. First, the lack of specific normative standards for the ethics review of big data studies. Second, epistemic challenges faced by REC members, specifically insufficient experience and expertise. Third, normative ethical challenges related to the scope of ethical reflection on big data, as conceptual tools traditionally used to assess biomedical research appear increasingly inadequate to assess unforeseeable and novel risks generated by big data studies. Finally, proposals for reform emerged from our analysis, including both conservative reforms (e.g., building capacity and promoting data literacy among REC members) and more radical reforms, such as complementing RECs with data-focused oversight bodies. In the following, we provide a detailed analysis of these themes.

Lack of Specific Review Standards

Although REC members share a general idea of what constitutes big data, they lack a precise common definition and clear guidance on how to recognize these studies in practice. As previous studies indicated, REC members’ uncertainty could result in inconsistencies across committees ( Favaretto et al, 2020b ; Vitak et al., 2017 ). Moreover, the way in which interviewees define big data can influence their assessment of the most pressing ethical challenges. By using narrow definitions of big data—namely focusing on one or few characteristics (e.g., data source, data volume)—RECs may be more sensitive to some ethical implications than others. It is relevant to note, however, that disagreement on a definition of big data is secondary to a lack of tailored standards for reviewing big data research. Our results, in line with previous research, highlight the lack of specific ethical guidelines for evaluating big data projects and thus the application of traditional ethical frameworks in the evaluation of all projects without distinction ( Ienca et al., 2018 ). While some interviewees believed that the lack of specific ethical guidance does not negatively impact ethical review practices, others expressed concern about having to interpret and judge big data research on a case-by-case basis without guidance. These diverging opinions might mirror different RECs approaches to ethics review, as well as RECs’ members’ confidence levels, experience in managing big data projects, and expertise in the technical disciplines (i.e., data and computer sciences). RECs’ diverging interpretations may result in disharmonious evaluations and decisions across committees, which could negatively affect researchers’ trust in the oversight system, data sharing practices, and research collaborations ( Ballantyne & Schaefer, 2020 ; Dove & Garattini, 2018 ; van den Hoonaard & Hamilton, 2016 ). Although a lack of transparency about evaluation procedures and inconsistencies across RECs’ judgments are not exclusive to big data research ( Lynch et al., 2020 ), our findings show that these limitations in REC practices continue to hamper research.

Limited Experience and Expertise

REC members acknowledged their limited experience in dealing with big data projects and inadequate expertise about fundamental technical aspects characterizing these studies. REC members recognized that their narrow mandate diminishes their oversight function in big data research. In fact, the narrow boundaries of HRA result in only a portion of big data projects conducted at the national level coming to their attention ( von Elm & Briel, 2019 ). However, unless the law is amended to expand the purview mission of the ethical oversight mechanism, Cantonal RECs have no choice but to invite researchers to submit their research voluntarily. Some studies have suggested that RECs should engage researchers in a dialogue to make them aware of and accountable for the consequences of their research ( Holland, 2018 ). Concerning RECs’ insufficient expertise around technical features of big data research, our interviewees were aware of their shortcomings. Most REC members expressed both willingness and commitment to implementing strategies to overcome their weaknesses (e.g., involving data specialists in the assessment of big data research, or attending trainings to increase their technical competence). Furthermore, our findings partially align with the results of a recent study focusing on IRB staff’s perspectives in the United States ( Vitak et al., 2017 ). The IRBs surveyed believed that over time they would surmount their shortcomings in assessing the technical aspects of big data research proposals, through cumulative experience. Yet, the rapidity with which AI technology and big data applications evolve further complicates RECs’ and IRBs’ attempts to get up to speed in their subject matter knowledge ( Nebeker et al., 2019 ; Prosperi & Bian, 2019 ).

Scope of Ethical Reflection

Our findings reveal that REC members are overall well informed about the benefits and challenges brought about by the advent of big data and data analytics techniques. However, they disagreed on which challenges are the most pressing and which tools are best suited to address them. These different opinions among interviewees might be explained by their background, personal bias, and the lack of training in big data research. The fact that many interviewees focused on how to adapt and improve the informed consent tool, and implement in the most rigorous way the existing data protection regulations, may signal a problem. Some authors flag the risk of viewing these tools as ethics panacea ( Babb, 2020 ; Corrigan, 2003 ). While regulating data re-uses and operationalizing informed consent remain unresolved issues, privacy-focused ethical oversight may be insufficient to address other challenges raised by big data, concerning, for example, justice, dignity, and fairness ( Ballantyne, 2019 ; McKeown et al., 2021 ). Our results highlight this gap in current ethical oversight, as respondents expressed concern about how to balance the risks and benefits of projects. The traditional ethics tools used to assess biomedical research are inadequate and ineffective when assessing unforeseeable and novel risks ( Sheehan et al., 2019 ). This concern, which remains unresolved for the time being, underscores the need for a broader conversation in society about the importance of big data research, and its uses in terms of our collective interest ( McMahon et al., 2020 ).

Proposals for Reform

Our results shed light on the limitations of the current mechanism of Cantonal RECs, in terms of skills, practices, and guidelines. REC members—aware of these shortcomings—suggested possible solutions to tackle them. The interviewees’ request for training on big data and AI reveals interest in expanding their knowledge. In addition, the practice of involving experts to fill RECs’ expertise gap can be seen as an attempt to offer assistance to researchers ( Huh-Yoo et al., 2021 ). Interviewees’ desire to improve the status quo of ethical review is further evidenced by their suggestion of creating complementary oversight mechanisms (e.g., big data boards), to review the technical aspects of projects and highlight inherent risks, while keeping pace with the fast-changing nature of research. Some interviewees imagined these boards serving as an accreditation mechanism, to certify the quality of a project’s technical features. These boards could operate across disciplines, to certify research in private and public sectors, regardless of data types and sources. Consequently, fewer big data projects would be left without any sort of oversight. Finally, our interviewees strongly defended the role of RECs as a key mechanism for ethical review in research and spoke against overturning the entire system by introducing new high-level principles or laws. Nevertheless, REC members would welcome more operationalizable guidance on what constitutes a good big data project. Therefore, future research and initiatives should aim to fill this gap by offering ERCs practical guidance for orienting their judgment in the field of big data research.

Best Practice

Swiss Cantonal RECs should be reformed if they are to be effective in the big data research context. In this paper, we argue that these reforms should involve not only the practices of REC members, but also their expertise and the regulations that define the mandate of RECs. Ethics oversight mechanisms outside the Swiss context might benefit from similar revisions. In addition, this study suggests that researchers be proactive in reaching out for RECs’ opinions and aware of their responsibilities when conducting research. However, the efforts of researchers must be supported by a system of clear rules and ethics training put in place by a network of actors (such as policymakers, universities, and funding bodies) ( Samuel et al., 2019 ).

Research Agenda

In this paper, we reported the perspectives of Swiss cantonal RECs on the challenges they face in reviewing big data projects and their needs in order to adequately address these challenges. We believe this analysis contributes significantly to the existing literature as it is the first qualitative study to survey Swiss RECs about their experiences and views on this topic. Interestingly, our results align with the literature at the international level. More research is required to explore the need for globally shared ethical standards for conducting research with big data. In fact, as interdisciplinary and cross-country big data projects increase, the scientific community may need not only clear common data governance, but also a shared vision about what an ethically aligned big data project consists of ( ÓhÉigeartaigh et al., 2020 ). The recent COVID-19 pandemic exemplified how divergent laws governing research, unclear ethical evaluation methods, and unrobust oversight mechanisms can slow down research processes, jeopardize efforts for public health, and reduce public trust in scientific institutions ( Gardner et al., 2020 ).

Educational Implications

Our results emphasize the need for knowledge exchange and a more productive engagement among the various factors involved in big data research. These include and are not limited to RECs, researchers, research institutions, private enterprises, citizen science groups, and the public ( Vayena & Gasser, 2016 ). In particular, if on the one hand REC members should acquire more technical skills about, for example, data analysis methodologies and AI-enabled technologies, researchers should also be more informed about the value of and the necessary steps for conducting research ethically. The dynamics of collaboration between RECs and researchers should not only be aimed at fulfilling the requirements imposed by law (i.e., ensuring compliance), but also at increasing mutual knowledge through an open dialogue and positive attitude towards learning. Scholars have argued that positive (although maybe not perfect) actions and responsible big data research can emerge only by asking difficult questions and through transparent confrontation on diverging perspectives ( Zook et al., 2017 ). Finally, our research findings indicate the crucial importance of informing society about issues related to big data and the use of AI in research. Starting with this democratic engagement, the general public can clarify their expectations regarding research with big data and thus inform the decisions of other actors involved.

Supplemental Material


We are grateful to our interviewees who kindly spent their time to take part in this research. We also would like to thank Dorothee Caminiti who assisted MI and AF during the completion of the interviews. We are grateful to Shannon Hubbs for her editorial suggestions.

Authors Biography

Agata Ferretti is a Postdoctoral Researcher at the Health Ethics & Policy Lab, Department of Health Sciences and Technology at ETH Zurich, Switzerland. Her PhD research focused on the ethics and governance of big data in health research and digital health applications.

Marcello Ienca is a Principal Investigator at the College of Humanities at EPFL where he leads the ERA-NET funded Intelligent Systems Ethics research unit. His research focuses on the ethical, legal, and social implications of neurotechnology and artificial intelligence, with particular focus on big data trends in neuroscience and biomedicine, human–machine interaction, social robotics, digital health, and cognitive assistance for people with intellectual disabilities.

Minerva Rivas Velarde is an SNSF Ambizione Group Leader at the Department of Radiology and Medical Informatics, University of Geneva, Switzerland. Her research focuses on global health, eHealth, disability studies, and bioethics.

Samia Hurst is Professor of Bioethics, Director of the Institute of Ethics, History, Humanities (IEH2) and of the Department of Health and Community Medicine at the Faculty of Medicine of Geneva, Switzerland. She has been working on ethical issues in clinical practice, health policy ethics, particularly issues of equity and protection of the vulnerable, and ethical issues in personalized medicine.

Effy Vayena is Professor of Bioethics and Director of the Health Ethics & Policy Lab, Department of Health Sciences and Technology at ETH Zurich, Switzerland. Her work focuses on the important societal issues of data and technology, as they relate to scientific progress and how it is or should be applied to public and personal health.

Appendix 1: Interview guide

Prior to the interview, a study’s investigator will provide an overview of the research purpose and will remind to the participant the confidentiality and anonymity measures adopted in the research.

Moreover, the study investigator will ask to get permission for tape recording. ​ recording.

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Grant No. 407540_167223).

Ethics Statement: Prior to recruitment, we obtained approval (EK 2017-N-74) to conduct this study from ETH Zurich’s Research Ethics Committee.

Authors’ Contributions: MI, AF, and MRV conducted and analyzed the interviews and compiled the results. AF drafted the manuscript. All authors contributed to the study design and the development of the final manuscript and approved the submitted version.

ORCID iD: Agata Ferretti https://orcid.org/0000-0001-6716-5713

Supplemental Material: Supplemental material for this article is available online.


  1. 5 ethics principles big data analysts must follow

    ethics of big data research

  2. Five Data Ethics Considerations for 2020

    ethics of big data research

  3. Generative AI Data Ethics: Expert Offers 3 Main Considerations

    ethics of big data research

  4. Big Data Ethics PowerPoint and Google Slides Template

    ethics of big data research

  5. Big Data Ethics PowerPoint and Google Slides Template

    ethics of big data research

  6. (PDF) Ethics review of big data research: What should stay and what

    ethics of big data research


  1. ISRO Made Fast PUSHPAK VIMAN & Created A world Record ! ISRO now Rocket launch compaction in world !

  2. Big data research Assingment-4

  3. Next Generation Data Integration Platform Apache Seatunnel


  5. ISRO send astronauts to space ?? ISRO change world record ?? #isro #isromissions #viral #shortfeed

  6. Ethics of Digital and Emerging Technologies: Erin Green


  1. Ethics review of big data research: What should stay and what should be

    Ethics review is the process of assessing the ethics of research involving humans. The Ethics Review Committee (ERC) is the key oversight mechanism designated to ensure ethics review. Whether or not this governance mechanism is still fit for purpose in the data-driven research context remains a debated issue among research ethics experts. In this article, we seek to address this issue in a ...

  2. Ethics review of big data research: What should stay and what should be

    With data protection and privacy concerns being in the spotlight of big data research review, language from data protection laws has worked its way into the vocabulary of research ethics. This terminological shift further reveals that big data, together with modern analytic methods used to interpret the data, creates novel dynamics between ...

  3. Ethics as Methods: Doing Ethics in the Era of Big Data Research

    This is an introduction to the special issue of "Ethics as Methods: Doing Ethics in the Era of Big Data Research." Building on a variety of theoretical paradigms (i.e., critical theory, [new] materialism, feminist ethics, theory of cultural techniques) and frameworks (i.e., contextual integrity, deflationary perspective, ethics of care), the Special Issue contributes specific cases and ...

  4. Addressing Conceptual Gaps in Big Data Research Ethics: An Application

    The rise of big data has provided new avenues for researchers to explore, observe, and measure human opinions, activities, and interactions. While scholars, professional societies, and ethical review boards have long-established research ethics frameworks to ensure the rights and welfare of the research subjects are protected, the rapid rise of big data-based research generates new challenges ...

  5. Ethical Challenges Posed by Big Data

    CONCLUSION. Optimal ethical solutions should be sought on both a societal and inter-personal level. Governments should especially seek to ensure that persons vulnerable to becoming unwitting, or even witting research participants understand the risks they face. 49 This is especially important because Big Data studies might affect stigmatization, negatively target individuals, and even affect ...

  6. How ethics combine with big data: a bibliometric analysis

    As of July 6th 2020, 107 research papers on this subject have been published, which indicates that research on ethics in Big Data is still slowly entering the field of scientific research.

  7. Considerations for ethics review of big data health research: A scoping

    The methodological novelty and computational complexity of big data health research raises novel challenges for ethics review. In this study, we conducted a scoping review of the literature using five databases to identify and map the major challenges of health-related big data for Ethics Review Committees (ERCs) or analogous institutional ...

  8. The Ethical Implications of Big Data Research in Public Health: "Big

    Ethics of Big Data and Public Health Research. Regardless of the legal particulars of a research project, even where it meets its legal obligations, it does not necessarily follow that its use of data will be ethically acceptable, nor does it mean that no obligations are owed in respect of its data subjects. There remains a need for use of data ...

  9. Considerations for ethics review of big data health research ...

    In this study, we conducted a scoping review of the literature using five databases to identify and map the major challenges of health-related big data for Ethics Review Committees (ERCs) or analogous institutional review boards. A total of 1093 publications were initially identified, 263 of which were included in the final synthesis after ...

  10. The Ethics of Big Data: Current and Foreseeable Issues in Biomedical

    The capacity to collect and analyse data is growing exponentially. Referred to as 'Big Data', this scientific, social and technological trend has helped create destabilising amounts of information, which can challenge accepted social and ethical norms. Big Data remains a fuzzy idea, emerging across social, scientific, and business contexts sometimes seemingly related only by the gigantic ...

  11. Data ethics: What it means and what it takes

    Now more than ever, every company is a data company. By 2025, individuals and companies around the world will produce an estimated 463 exabytes of data each day, 1 Jeff Desjardins, "How much data is generated each day?" World Economic Forum, April 17, 2019. compared with less than three exabytes a decade ago. 2 IBM Research Blog, "Dimitri Kanevsky translating big data," blog entry by ...

  12. Ethics review of big data research: What should stay and what should be

    Abstract. Background: Ethics review is the process of assessing the ethics of research involving humans. The Ethics Review Committee (ERC) is the key oversight mechanism designated to ensure ethics review. Whether or not this governance mechanism is still fit for purpose in the data-driven research context remains a debated issue among research ...

  13. Big Data ethics

    The speed of development in Big Data and associated phenomena, such as social media, has surpassed the capacity of the average consumer to understand his or her actions and their knock-on effects. We are moving towards changes in how ethics has to be perceived: away from individual decisions with specific and knowable outcomes, towards actions ...

  14. An Ethics Framework for Big Data in Health and Research

    Decisions made about the use, sharing, and re-use of big data are complex and laden with values. This paper sets out an Ethics Framework for Big Data in Health and Research developed by a working group convened by the Science, Health and Policy-relevant Ethics in Singapore (SHAPES) Initiative. It presents the aim and rationale for this ...

  15. An Ethics Framework for Big Data in Health and Research

    This Framework aims to. 1. support decision-makers in identifying values relating to a range of big data uses, such as sharing, linkage, granting access to third parties. 2. provide decision-makers with examples of a balancing approach to weighing up the relevant values when making decisions about big data; and.

  16. Ethics & Big Data

    Big Data is a digital phenomenon that enables the collection and use of massive amounts of data derived from both man and machine. This data is characterized in terms of its volume, variety, velocity, veracity, variability, and its complexity. While Big Data allows firms to rapidly capture, analyze, and exploit information, it can also enable ...

  17. Considerations for ethics review of big data health research: A scoping

    Big data trends in biomedical and health research enable large-scale and multi-dimensional aggregation and analysis of heterogeneous data sources, which could ultimately result in preventive, diagnostic and therapeutic benefit. The methodological novelty and computational complexity of big data health research raises novel challenges for ethics review. In this study, we conducted a scoping ...

  18. Scientific Research and Big Data

    9. Big Data Risks and the Ethics of Data Science. In closing, it is important to consider at least some of the risks and related ethical questions raised by research with big data. As already mentioned in the previous section, reliance on big data collected by powerful institutions or corporations risks raises significant social concerns.

  19. PDF Ethics review of big data research: What should stay and ...

    The regulatory design of research oversight is the first aspect which needs reform. ERCs could benefit from new guidance (e.g., in the form of a flowchart) on the ethics of big data research. This guidance could build upon a deep. Fig. 2 Reforms overview for the research oversight mechanism.

  20. The Challenges of Big Data for Research Ethics Committees: A

    In parallel with these changes in research practice, high profile cases of data misuse have emerged, exposing research participants to privacy breaches and risk of harm (Fuller, 2019).In response, debate has increased about the role and effectiveness of the Research Ethics Committee (REC) as the chief ethical research oversight mechanism in research, given the specific challenges presented by ...

  21. Big Data, Biomedical Research, and Ethics Review: New Challenges for

    Nevertheless, ethics of big data should not be reduced solely to a privacy issue. 44 Previous research has observed that, although privacy is a fundamental topic in big data research, it has been overemphasized to the detriment of other issues. 45 Our findings seem to confirm this observation. Our results also indicate that ethical issues of ...

  22. Large-scale phenotyping of patients with long COVID post

    Ethical approval was given by the South Central-Oxford C Research Ethics Committee in England (reference 13:/SC/0149), Scotland A Research Ethics Committee (20/SS/0028) and WHO Ethics Review ...

  23. The role of COVID-19 vaccines in preventing post-COVID-19 ...

    The study was approved by the CPRD's Research Data Governance Process, Protocol No 21_000557 and the Clinical Research Ethics committee of Fundació Institut Universitari per a la recerca a l'Atenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol) (approval number 4R22/133) and the Research Ethics Committee of the University of Tartu ...

  24. Ethics in the Era of Big Data

    The degree of enthusiasm as well as concern over the expanding role of big data in contemporary research has naturally led SHAPES to focus on big data as one of its core areas of investigation. In March 2018, SHAPES convened a symposium on big data ethics to develop an understanding of the most important and under-addressed issues in the area.

  25. The Challenges of Big Data for Research Ethics Committees: A

    Big data trends in health research challenge the oversight mechanism of the Research Ethics Committees (RECs). The traditional standards of research quality and the mandate of RECs illuminate deficits in facing the computational complexity, methodological novelty, and limited auditability of these approaches. To better understand the challenges ...