Introduction to R for Data Science: A LISA 2020 Guidebook

Chapter 7 network analysis.

In this chapter, we will cover concepts and procedures related to network analysis in R. “Networks enable the visualization of complex, multidimensional data as well as provide diverse statistical indices for interpreting the resultant graphs” (Jones et al., 2018). Put otherwise, network analysis is a collection of techniques that visualize and estimate relationships among agents in a social context. Furthermore, network analysis is used “to analyze the social structures that emerge from the recurrence of these relations” where “[the] basic assumption is that better explanations of social phenomena are yielded by analysis of the relations among entities” (Science Direct; Linked Below).

Networks are made up of nodes (i.e., individual actors, people, or things within the network) and the ties , edges , or links (i.e., relationships or interactions) that connect them. The extent to which nodes are connected lends to interpretations of the measured social context.

“By comparison with most other branches of quantitative social science, network analysts have given limited attention to statistical issues. Most techniques and measures examine the structure of specific data sets without addressing sampling variation, measurement error, or other uncertainties. Such issues are complex because of the dependencies inherent in network data, but they are now receiving increased study. The most widely investigated approach to the statistical analysis of networks stresses the detection of formal regularities in local relational structure .

network analysis case study

The figure above illustrates some of the relational structures commonly found in analyses of social networks.

A: Demonstrates a relationship of reciprocity/mutuality.

B: Demonstrates a directed relationship with a common target.

C: Relationships emerge from a common source.

D: Transitive direct relationships with indirect influences.

Another type is homophily, which is present, for example, when same-sex friendships are more common than between-sex friendships. This involves an interaction between a property of units and the presence of relationships” (Peter V. Marsden, in Encyclopedia of Social Measurement , 2005). This sort of model might reflect the tendency of people to seek out those that are similar to themselves.

7.0.0.1 Measures of Centrality

Measures of centrality provide quantitative context regarding the importance of a node within a network. There are four measures of centrality that we will cover.

Degree Centrality : The degree of a node is the number of other nodes that single node is connected to. Important nodes tend to have more connections to other nodes. Highly connected nodes are interpreted to have high degree centrality.

Eigenvector Centrality : The extent to which adjacent nodes are connected themselves also indicate importance (e.g., Important nodes increase the importance of other nodes).

Closeness centrality : Closeness centrality measures how many steps are required to access every other node from a given node. In other words, important nodes have easy access to other nodes given multiple connections.

Betweenness Centrality : This ranks the nodes based on the flow of connections through the network. Importance is demonstrated through high frequency of connection with multiple other nodes. Nodes with high levels of betweenness tend to serve as a bridge for multiple sets of other important nodes. See this link for a set of journals and books that cover the topic.

Also, examine this (paid) online tool for text-based network analysis: https://www.infranodus.com

7.1 Zacharies Karate Club Case Study

We will be working with a dataset called Zacharies Karate Club, a seminal dataset in network analysis literature. First we need to install the relevant packages. Today we will need a package called igraph , a package useful for creating, analyzing, and visualizing networks. If you do not have the packages already, install the tidyverse , igraph , ggnetwork , and intergraph . igraph helps us perform network analysis. ggnetwork and intergraph are both packages used for plotting networks in the ggplot framework.

Zachary’s Karate Club Background

Taken from wikipedia: “A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting pairwise links between members who interacted outside the club. During the study a conflict arose between the administrator”John A” and instructor “Mr. Hi” (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other group found a new instructor or gave up karate. Based on network analysis Zachary correctly predicted each member’s decision except member #9, who went with Mr. Hi instead of John A.” In this case study, we will try to infer/predict the group splits with network analysis techniques.

7.1.0.1 Load Data and Extract Model Features

Now it’s time to extract the relevant information that we need from the dataset. We need the associations between members (edges), the groupings after the split of the network, and the labels of the nodes.

Extract the groups and labels of the vertices and store them in vectors. Make sure that the labels are called as characters and not factors using the “str()” function, as igraph requires character data to cast labels.

7.1.0.2 Creating Networks From Data

Now that we have extracted the relevant data that we need, let’s construct a network of Zachary’s Karate club.

We can also create vertex attributes. Let’s make a vertex attribute for each group (Mr. Hi and John A).

Create a vertex attribute for node label. Call the attribute ‘label’.

7.1.0.3 Visualizing Networks with baseR

Now visualize the network by running the plot function on our network ‘G’.

network analysis case study

Let’s change some of the plot aesthetics. We can change the vertex colors, edge colors, vertex sizes, etc. Play around with the arguments for plotting a network.

network analysis case study

We can also change the color of our vertices according to group.

network analysis case study

7.1.0.4 Visualizing Networks with ggnetwork

You can also use ggplot to visualize igraph objects.

network analysis case study

Let’s see if we can make our the ggplot version look better.

network analysis case study

Using ggnetwork and ggplot, color or shape the nodes by karate group. Also make some other plot aesthetic changes to your liking.

7.1.0.5 Measuring Centrality

network analysis case study

Finally, Let’s put all of the centrality measures in one table so that we can compare the outputs.

It makes sense that the most connected members of the network are indeed John A. and Mr. Hi. We can view the centrality measures from the perspective of the graph. Here, we add the object degr_cent to the vertex size to display the nodes via their degree centrality using baseR .

network analysis case study

Now, using the tidyverse ! Change the code below to make a graph of our network where node sizes are scaled by the degree centrality.

network analysis case study

7.1.0.6 Modularity

Modularity is a measure that describes the extent to which community structure is present within a network when the groups are labeled. A modularity score close to 1 indicates the presence of strong community structure in the network. In other words, nodes in the same group are more likely to be connected than nodes in different groups. A modularity score close to -1 indicates the opposite of community structure. In other words, nodes in different groups are more likely to be connected than nodes in the same group. A modularity score close to 0 indicates that no community structure (or anti-community structure) is present in the network.

network analysis case study

Compute the modularity of the Zacharies Karate Club network using the modularity() function.

Higher modularity scores are better, however, modularity should not be used alone to assess the presence of communities in network. Rather, multiple measures should be used to provide an argument for community in a network.

7.2 Community Detection

Suppose we no longer have the group labels, but we want to infer the existence of groups in our network. This process is known as community detection. There are many different ways to infer the existence of groups in a network.

7.2.0.1 Via Modularity Maximization

The goal here is to find the groupings of nodes that lead to the highest possible modularity score.

network analysis case study

It turns out that the modularity maximization algorithm finds 3 communities within the Zacharies Karate Club network. But, if we merge those two groups into two, only one node is incorrectly grouped. Let’s try another community detection algorithm.

7.2.0.2 Via Edge Betweenness

Edge betweenness community structure detection is based on the following assumption; that edges connecting separate groupings have high edge betweenness as all the shortest paths from one module to another must traverse through them. Practically this means that if we gradually remove the edge with the highest edge betweenness score, our network will separate into communities.

network analysis case study

7.3 Network Simulation

Say you want to model a new network with no data. it’s possible to simulate a network to find out if it is actually interesting, or random. If you are familiar with hypothesis testing, we can view these random networks as our “null models”. We assume that our null model is true until there is enough evidence to suggest that our null model does not describe the real-life network. If our null-model is a good fit, then we have achieved a good representation of our network. If we don’t have a good fit, then there is likely additional structure in the network that is unaccounted for.

Our Question: How can we explain the group structure of our network? Is it random or can we explain it via the degree sequence?

7.3.0.1 Random Network Generation

Erdos-Renyi random networks in R require that we specify a number of nodes \(n\) , and an edge construction probability \(p\) . Essentially, for every pair of nodes, we flip a biased coin with the probability of “heads” being \(p\) . If we get a “heads”, then we draw an edge between that pair of nodes. This process simulates the social connections rather than plotting them from a dataset.

network analysis case study

Is this Erdos-Renyi random network a good representative model of the Zacharies Karate Club Network? Let’s construct the Erdos-Renyi random network that is most similar to our network.

We can map in parameters in the Erdo-Renyi random graph by specifying the number of nodes and the edge connection probability p. Considering the Zacharies Karate Club Network, we want to use 34 nodes in our graph. If we change the number of nodes, then we lose the ability to compare our network with the theoretical model. We can estimate a probability value for the simulated network using the mean of degr_cent over the length of the nodes - 1 from the ZKC network.

network analysis case study

Let’s check out the degree distribution for our random graph and the actual ZCC graph.

network analysis case study

7.3.0.2 Configuration Model

For this kind of random-graph model, we specify the exact degree sequence of all the nodes. We then construct a random graph that has the exact degree sequence as the one given.

network analysis case study

Is the configuration model random network a good representative model of the Zachary’s Karate Club Network?

Let’s see if the configuration model captures the group structure of the model. We are going to perform a permutation test in which we generate 1000 different configuration models (with the same degree sequence as ZKC), and then estimate how the actual value of the ZKC modularity lines up with the distribution of configuration model modularities.

Now let’s plot a histogram of these values, with a vertical line representing the modularity of ZKC network that we computed earlier. This value is stored in the object ZCCmod .

network analysis case study

We can see from the above that our computed modularity is extremely improbable. No simulations had a modularity that was as high as the one in ZKC. This tells us that the particular degree sequence of ZKC does not capture the community structure. Put otherwise, the configuration model does a bad job reflecting the community structure captured in the ZKC dataset.

7.3.0.3 Stochastic Block Model

Stochastic Block models are similar to the Erdos-Renyi random network but provide the additional ability to specify additional parameters. The stochastic block model adds a group structure into the random graph model. We can specify the group sizes and the edge construction probability for within group and between group modeling

network analysis case study

Is the stochastic block model a good representative model of the Zacharies Karate Club Network?

network analysis case study

7.4 Advanced Case Study

See this link ( https://www.frontiersin.org/articles/10.3389/fpsyg.2018.01742/ ) to access a paper by Jones, Mair, & McNally (2018), all professors at Harvard University in the Department of Psychology who discuss visualizing psychological networks in R.

See this link ( https://www.frontiersin.org/articles/10.3389/fpsyg.2018.01742/full#supplementary-material ) to access all supplementary material, including the relevant datasets needed for the code below.

Read the paper and run the code alongside the narrative to get the most out of this case study. For a brief overview of the paper see this abstract:

“Networks have emerged as a popular method for studying mental disorders. Psychopathology networks consist of aspects (e.g., symptoms) of mental disorders (nodes) and the connections between those aspects (edges). Unfortunately, the visual presentation of networks can occasionally be misleading. For instance, researchers may be tempted to conclude that nodes that appear close together are highly related, and that nodes that are far apart are less related. Yet this is not always the case. In networks plotted with force-directed algorithms, the most popular approach, the spatial arrangement of nodes is not easily interpretable. However, other plotting approaches can render node positioning interpretable. We provide a brief tutorial on several methods including multidimensional scaling, principal components plotting, and eigenmodel networks. We compare the strengths and weaknesses of each method, noting how to properly interpret each type of plotting approach.”

7.5 Datasets for Network Analysis

There is a package called “igraphdata” that contains many network datasets. Additionally, there are several more datasets at “The Colorado Index of Complex Networks (ICON)”. Here is the link: https://icon.colorado.edu/#!/

In this chapter we introduced network analysis concepts and methods. To make sure you understand this material, there is a practice assessment to go along with this chapter at https://jayholster1.shinyapps.io/NetworksinRAssessment/

7.7 References

Bojanowski, M. (2015). intergraph: Coercion routines for network data objects. R package version 2.0-2. http://mbojan.github.io/intergraph

Csardi, G., Nepusz, T. (2006). “The igraph software package for complex network research.” InterJournal , Complex Systems, 1695. <https://igraph.org> .

Paranyushkin, D. (2019). InfraNodus: Generating insight using text network analysis. In The World Wide Web Conference ( WWW ’19 ). Association for Computing Machinery, New York, NY, USA, 3584–3589. https://doi.org/10.1145/3308558.3314123

Payton, J. J., Mair, P., & McNally, R. J. (2018). Visualizing psychological networks: A tutorial in R. Frontiers in Psychology, 9 (1), https://doi.org/10.3389/fpsyg.2018.01742

Tyner, S., Briatte, F., & Hofmann, H. (2017). Network Visualization with ggplot2 , The R Journal 9(1): 27–59. https://briatte.github.io/ggnetwork/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4 (43), 1686. https://doi.org/10.21105/joss.01686 .

7.7.1 R Short Course Series

Video lectures of each guidebook chapter can be found at https://osf.io/6jb9t/ . For this chapter, find the follow the folder path Network Analysis in R -> AY 2021-2022 Spring and access the video files, r markdown documents, and other materials for each short course.

7.7.2 Acknowledgements

This guidebook was created with support from the Center for Research Data and Digital Scholarship and the Laboratory for Interdisciplinary Statistical Analaysis at the University of Colorado Boulder, as well as the U.S. Agency for International Development under cooperative agreement #7200AA18CA00022. Individuals who contributed to materials related to this project include Jacob Holster, Eric Vance, Michael Ramsey, Nicholas Varberg, and Nickoal Eichmann-Kalwara.

Visible Network Labs

Social Network Analysis 101: Ultimate Guide

Comprehensive introduction for beginners.

Social network analysis is a powerful tool for visualizing, understanding, and harnessing the power of networks and relationships. At Visible Network Labs, we use our network science and mapping tools and expertise to track collaborative ecosystems and strengthen systems change initiatives. In this Comprehensive Guide, we’ll introduce key principles, theories, terms, and tools for practitioners framed around social impact, systems change, and community health improvement. Let’s dig in!

Learn more and get started with the tools below in our complete Guide.

Table of Contents

You can read this guide from start-to-finish or use the table of contents to fast forward to a topic or section of interest to you. The guide is yours to use as you see fit.

Introduction

Let’s start by reviewing the basics, like a definition, why SNA is important, and the history of the practice. If you want a quick intro to this methodology, download our Social Network Analysis Brief .

Definition of Social Network Analysis (SNA)

Social Network Analysis , or SNA, is a research method used to visualize and analyze relationships and connections between entities or individuals within a network. Imagine mapping the relationships between different departments in a corporation. The outcome would be a vivid picture of how each department interacts with others, allowing us to see communication patterns, influential entities, and bottlenecks

The Importance of SNA

SNA is a powerful tool. It allows us to explore the underlying structure of an organization or network, identifying the formal and informal relationships that drive the formal processes and outcomes. This insight can enable better communication, facilitate change management, and inspire more efficient collaboration.

This methodology also helps demonstrate the impact of relationship-building and systems change efforts by documenting the changes in the quality and quantity of relationships before and after the initiative. The maps and visualizations produced by SNA are an engaging way to share your progress and impact with stakeholders, donors, and the community at large.

Brief Historical Overview of SNA

The concept of SNA emerged in the 1930s within the field of sociology. Its roots, however, trace back to graph theory in mathematics. It was not until the advent of computers and digital data in the 1980s and 1990s that SNA became widely used, revealing new insights about organizational dynamics, community structures, and social phenomena.

While it originated as an academic research tool, it is increasingly used to inform real-world practice. Today, it is used in a broad variety of industries, fields, and sectors, including business, web development, public health, foundations and philanthropy , telecommunications, law enforcement, academia, and systems change initiatives, to name a few.

Fundamentals of SNA

SNA is a broad topic, but these are some of the essential terms, concepts, and theories you need to know to understand how it works.

Nodes and Edges

In SNA, nodes represent individuals or entities while edges symbolize the relationships between them. For example, in an inter-organizational network, nodes might be companies, and edges could represent communication, collaboration, or competition.

Social Network Analysis

Network Types

Different types of networks serve different purposes. ‘Ego Networks’ focus on one node and its direct connections, revealing its immediate network. ‘Whole Networks’, on the other hand, capture a broader picture, encompassing an entire organization or system. Open networks are loosely connected, with many opportunities to build new connections, ideal for innovation and idea generation – while closed networks are densely interconnected, better for refining ideas amongst a group who all know each other.

Network Properties

Properties such as density (the proportion of potential connections that are actual connections), diameter (the longest distance between two nodes), and centrality (the importance of a node within the network) allow us to understand the network’s structure and function. Metrics also can measure relationship quality across the network, like our validated trust and value scores.

Dyadic and Triadic Relationships

Dyadic relationships involve two nodes, like a partnership between two companies. Triadic relationships, involving three nodes, are more complex but can offer richer insights. For instance, it might show how a third company influences the relationship between two others, or which members of your network are the best at building new relationships between their peers.

Homophily and Heterophily

Homophily refers to the tendency of similar nodes to connect, while heterophily is the opposite. In a business context, we might see homophily between companies in the same industry and heterophily when seeking diversity in a supply chain. Many networks aim to be diverse but get stuck talking to the same, similar partners. These network concepts underly many strategies promoting network innovation to avoid group-think among likeminded partners.

Network Topologies

Lastly, the layout or pattern of a network, its topology, can reveal much about its function. For instance, a centralized topology, where one node is connected to all others, may indicate a hierarchical organization, while a decentralized topology suggests a more collaborative and flexible environment. This is also referred to as the structure of the network. Read more.

Theoretical Background of SNA

Many different theories have developed to explain how certain network properties, like their topology, centrality, or type, lead to different outcomes. Here are several key theories relevant to SNA.

Strength of Weak Ties Theory

This theory postulates that weak ties or connections often provide more novel information and resources compared to strong ties. These “weak” relationships, which may seem less important, can serve as important bridges between different clusters within a network. Read more.

Structural Hole Theory

This theory posits that individuals who span the structural holes, or gaps, in a network—acting as a bridge between different groups—hold a strategic advantage. They can control and manipulate information and resources flowing between the groups, making their position more influential. Read more

Small World Network Theory

This theory emphasizes the interconnectedness of nodes within a network. It suggests that most nodes can be reached from any other node through a relatively short path of connections. This property leads to the famous phenomenon of “six degrees of separation,” indicating efficient information transfer and connectivity in a network.

Barabási–Albert (Scale-Free Network) Model

This model suggests that networks evolve over time through the process of preferential attachment, where new nodes are more likely to connect to already well-connected nodes. This results in “scale-free” networks, where a few nodes (“hubs”) have many connections while the majority of nodes have few.

Data Collection and Preparation

Every network mapping begins by collecting and preparing data before it can be analyzed. This data varies widely, but at a basic level, they must include data on nodes (the entities in the network) and data on edges (the lines between nodes representing a relationship or connection). Additional data on the attributes of the nodes or edges add more levels of analysis and insight but are not strictly necessary.

Primary Methods for Collecting SNA Data

This can be as simple as conducting interviews or surveys within an organization. The more complex the network, the more difficult it is to collect good primary data: If you have more than 5-10 partners, interviews and surveys are hard to conduct by hand.

Network survey tools like PARTNER collect relational data by asking respondents who they are connected to, and then asking them about aspects of their relationships to provide trust, value, and network structure scores. This is impossible to do using most survey software like Google Forms without hours of cleaning by hand.

Response rates are an important consideration if using surveys for data collection. Unlike a typical survey where a small sample is representative, a network survey requires a high response rate – 80% and above are considered the gold standard.

In an inter-organizational context where surveys are impossible, or you cannot achieve a valid response rate, one might gather data through business reports, contracts, or publicly available data on partnerships and affiliations. For example, you could visit an organization’s website to note who they list as a partner – and do the same for others – to generate a basic SNA map.

Secondary Sources of SNA Data

Secondary sources include data that was already collected but can be used again, often to complement your use of primary data you collect yourself. This might include academic databases, industry reports, or social media data. It’s important to ensure the accuracy and reliability of these sources.

You can also conduct interviews or focus groups with network members to add a qualitative perspective to your results. These mixed-method SNA projects provide a great deal more depth to their network maps through their conversations with numerous network representatives to explore deeper themes and perspectives.

Ethical Considerations in Data Collection

When collecting data, it’s crucial to ensure privacy, obtain necessary permissions, and anonymize data where necessary. Respecting these ethical boundaries is critical for maintaining trust and integrity in your work.

Consider also how your SNA results will be used. For example, network analysis can help assess how isolated an individual is to target them for interventions. Still, it could also be abused by insurance companies to charge these individuals a higher rate (loneliness increases your risk of death).

Lastly, consider ways to involve the communities with stake in your SNA using approaches like community-based participatory research. Bring in representatives from target populations to help co-design your initiative or innovation as partners, rather than patients or research subjects.

Preparing Data for Analysis

Data needs to be formatted correctly for analysis, often as adjacency matrices or edgelists. Depending on the size and complexity of your network, this can be a complex process but is crucial for meaningful analysis.

If you are new to SNA, you can start by laying out your data in tables. For example, the table below shows a relational data set for a set of partners within a public health coalition. The first column shows the survey respondent (Partner 1), the second shows who they reported as a partner, the third shows their reported level of trust, and the fourth their reported level of collaboration intensity. This is just one of many ways to lay out and organize network data.

Depending on which analysis tool you choose, a varying degree of data preparation and cleaning will be required. Usually, free tools require the most work, while software with subscriptions do a lot of it for you.

Network Analysis Methods & Techniques

There are many ways to analyze a network or set of entities using SNA. Here are some of basic and advanced techniques, along with info on network visualization – a major component and common output of SNA projects.

Basic Technique: Network Centrality

One of the most common ways to analyze a network is to look at the centrality of various nodes to identify key players, information hubs, and gatekeepers across the network. There are three types of centrality, each corresponding to a different aspect of connectivity and centrality. Degree, Betweenness, and Closeness Centrality are measures of a node’s importance.

Degree Centrality  

Can be used to identify the most connected actors in the network. These actors are considered “popular” or “active” and they often have a strong influence within the network due to their numerous direct connections. In a coalition or network, these nodes could be the organizations or individuals that are most active in participating or the most engaged in the network activities. They may be the ‘go-to’ people for information or resources and have a significant impact on shaping the group’s agenda.

Betweenness Centrality

A useful for identifying the “brokers” or “gatekeepers” in the network. These actors have a unique position where they connect different parts of the network, facilitating or controlling the flow of information between others. In a coalition context, these could be the organizations or individuals who have influence over how information, resources, or support flow within the network, by virtue of their position between other key actors. These actors could play crucial roles in collaboration, negotiation, and conflict resolution within the network.

Closeness Centrality

A measure of how quickly a node can reach every other node in the network via the shortest paths. In a coalition, these nodes can disseminate information or exert influence quickly due to their close proximity to all other nodes. These ‘efficient connectors’ are beneficial for the rapid spread of information, resources, or innovations across the network. They could play a vital role during times of rapid change or when swift collective action is required.

Network Centrality

Advanced Techniques: Clusters and Equivalence

Clustering Coefficients

The Clustering Coefficient provides insights into the “cliquishness” or local cohesion of the network around specific nodes. In a coalition or inter-organizational network, a high clustering coefficient may indicate that a node’s connections are also directly connected to each other, forming tight-knit groups or sub-communities within the larger network. These groups often share common interests or objectives, and they might collaborate or share resources more intensively. Understanding these clusters can be crucial for coalition management as it can highlight potential subgroups that may need to be engaged differently, or that might possess different levels of influence or commitment to the coalition’s overarching goals.

Structural Equivalence

Structural Equivalence is used to identify nodes that have similar patterns of connections, even if they do not share a direct link. In a coalition context, structurally equivalent organizations or individuals often occupy similar roles or positions within the network, and thus may have similar interests, influence, or responsibilities. They may be competing or collaborating entities within the same sectors or areas of work. Understanding structural equivalence can provide insights into the dynamics of the network, such as potential redundancies, competition, or opportunities for collaboration. It can also reveal how changes in one part of the network may impact other, structurally equivalent parts of the network.

Visualizing Networks

Network visualization is a key tool in Social Network Analysis (SNA) that allows researchers and stakeholders to see the ‘big picture’ of the network structure, as well as discern patterns and details that may not be immediately evident from numerical data. Here are some key aspects and benefits of network visualization in the context of a coalition or inter-organizational network:

Overview of Network Structure: Visualizations provide a snapshot of the entire network structure, including nodes (individuals or organizations) and edges (relationships or interactions). This helps to comprehend the overall size, density, and complexity of the network. Seeing these relationships mapped out can often make the network’s structure more tangible and easier to understand.

Identification of Key Actors: Centrality measures can be represented visually, making it easier to identify key actors or organizations within the network. High degree nodes, gatekeepers, and efficient connectors will stand out visually, which can assist in identifying who holds influence or power within the network.

Detecting Subgroups and Communities: Visualization can also highlight clusters or subgroups within the network. These might be based on shared interests, common goals, or frequent interaction. Understanding these subgroups is crucial for coalition management and strategic planning, as different groups might have unique needs, concerns, or levels of engagement.

Identifying Outliers and Peripheral Nodes: Network visualizations can also help in identifying outliers or peripheral nodes – those who are less engaged or connected within the network. These actors might represent opportunities for further engagement or potential risks for network cohesion.

Highlighting Network Dynamics: Visualizations can be used to show changes in the network over time, such as the formation or dissolution of ties, the entry or exit of nodes, or changes in nodes’ centrality. These dynamics can provide valuable insights into the evolution of the coalition or network and the impact of various interventions or events.

Software and Tools for SNA

SNA software helps you collect, clean, analyze, and visualize network data to simplify the process of of analyzing social networks. Some tools are free with limited functionality and support, while others require a subscription but are easier to use and come with support. Here are some popular s tools used across many application

Introduction to Popular SNA Tools

Tools like UCINet, Gephi, and Pajek are popular for SNA. They offer a variety of functions for analyzing and visualizing networks, accommodating users of varying skill levels. Here are ten tools for use in different contexts and applications.

  • UCINet: A comprehensive software package for the analysis of social network data as well as other 1-mode and 2-mode data.
  • NetDraw: A tool usually used in tandem with UCINet to visualize networks.
  • Gephi: An open-source network analysis and visualization software package written in Java.
  • NodeXL: A free and open-source network analysis and visualization software package for Microsoft Excel.
  • Kumu: A powerful visualization platform for mapping systems and better understanding relationships.
  • Pajek: Software for analysis and visualization of large networks, it’s particularly good for handling large network datasets.
  • SocNetV (Social Networks Visualizer): A user-friendly, free and open-source tool.
  • Cytoscape: A bioinformatics software platform for visualizing molecular interaction networks.
  • Graph-tool: An efficient Python module for manipulation and statistical analysis of graphs.
  • Polinode: Tools for network analysis, both for analyzing your own network data and for collecting new network data.

Choosing the Right Tool for Your Analysis:

The right tool depends on your needs. For beginners, a user-friendly interface might be a priority, while experienced analysts may prefer more advanced functions. The size and complexity of your network, as well as your budget, are also important considerations.

PARTNER CPRM: A Community Partner Relationship Management System for Network Mapping

PARTNER CPRM social network analysis platform

For example, we created PARTNER CPRM, a Community Partner Relationship Management System, to replace the CRMs used by most organizations to manage their relationships with their network of strategic partners. Incorporating data collecting, analysis, and visualization features alongside CRM tools like contact management and email tracking, the result is a powerful and easy-to-use network mapping tool.

SNA Case Studies

Looking for a real-world example of a social network analysis project? Here are three examples from recent projects here at Visible Network Labs.

Case Study 1: Leveraging SNA for Program Evaluation

SNA is increasingly becoming a vital tool for program evaluation across various sectors including public health, psychology, early childhood, education, and philanthropy. Its potency is particularly pronounced in initiatives centered around network-building.

Take for instance the Networks for School Improvement Portfolio by the Gates Foundation. The Foundation employed PARTNER, an SNA tool, to assess the growth and development of their educator communities over time. The SNA revealed robust networks that offer valuable benefits to members by fostering information exchange and relationship development. By repeating the SNA process at different stages, they could verify their ongoing success and evaluate the effectiveness of their actions and adjustments.

Read the Complete Case Study Here

Case Study 2: Empowering Coalition-building

In the realm of policy change, building a coalition of partners who share a common goal can be pivotal in overturning the status quo. SNA serves as a strategic tool for developing a coalition structure and optimizing pre-existing relationships among the members.

The Fix CRUS Coalition in Colorado, formulated in response to the closure of five major peaks to public access, is a prime example of this. With the aim of strengthening state liability protections for landowners, the coalition employed PARTNER to evaluate their network and identify key players. Their future plans involve mapping connections to important legislators as their bill progresses through the state legislature. Additionally, their network maps and reports will prove instrumental in acquiring grants and funding.

Case Study 3: Boosting Employee Engagement

In the private sector, businesses are increasingly harnessing SNA to optimize their employee networks, both formal and informal, with the goal of enhancing engagement, productivity, and morale.

Consider the case of Acuity Insurance. In response to a transition to a Hybrid-model amid the COVID-19 pandemic, the company started using PARTNER to gather network data from their employees. Their aim was to maintain their organizational culture and keep employee engagement intact despite the model change. Their ongoing SNA will reveal the level of connectedness within their team, identify employees who are over-networked (and hence at risk of burnout), and pinpoint those who are under-networked and could be missing crucial information or opportunities.

Read More About the Project Here

Challenges and Future Directions in Network Analysis

Like all fields and practices, social network analysis faces certain limitations. Practitioners are constantly innovating to find better ways to conduct projects. Here are some barriers in the field and current trends and predictions about the future of SNA.

The Limitations of SNA

SNA is a powerful tool, but it’s not without limitations. It can be time-consuming and complex, particularly with larger networks. Response rates are important to ensure accuracy, which makes data collection more difficult and time-consuming. SNA also requires quality, validated data, and the interpretation of results can be subjective. Software that helps to address these problems requires a significant investment, but the results are often worth it.

Lastly, SNA is a skill that takes time and effort to learn. If you do not have someone in-house with network analysis skills, you may need to hire someone to carry out the analysis or spend time training an employee to build the capacity internally.

Current Trends and Future Predictions

One emerging trend is the increased application of SNA in mapping inter-organizational networks such as strategic partnerships, community health ecosystems, or policy change coalitions. Organizations are realizing the power of these networks and using SNA to navigate them more strategically. With SNA, they can identify key players, assess the strength of relationships, and strategize on how to optimize their network for maximum benefit.

In line with the rise of data science, another trend is the integration of advanced analytics and machine learning with SNA. This fusion allows for the prediction of network behaviors, identification of influential nodes, and discovery of previously unnoticed patterns, significantly boosting the value derived from network data.

The future of SNA is likely to see a greater emphasis on dynamic networks – those that change and evolve over time. With increasingly sophisticated tools and methods, analysts will be better equipped to track network changes and adapt strategies accordingly.

In addition, there is a growing focus on inter-organizational network resilience. As global challenges such as pandemics and climate change underscore the need for collaborative solutions, understanding how these networks can withstand shocks and adapt becomes crucial. SNA will play an instrumental role in identifying weak spots and strengthening the resilience of these networks.

Conclusion: Social Network Analysis 101

SNA offers a unique way to visualize and analyze relationships within a network, be it within an organization or between organizations. It provides valuable insights that can enhance communication, improve efficiency, and inform strategic decisions.

This guide provides an overview of SNA, but there is much more to learn. Whether you’re interested in the theoretical underpinnings, advanced techniques, or the latest developments, we encourage you to delve deeper into this fascinating field.

Resources and Further Reading

For those who want to build more SNA skills and learn more about network science, check out these recommendations for further reading and exploration from the Visible Network Labs team of network science experts.

Recommended Books on SNA

  • “Network Science” by Albert-László Barabási – A comprehensive introduction to the theory and applications of network science from a leading expert in the field.
  • “Analyzing Social Networks” by Steve Borgatti, Martin Everett, and Jeffrey Johnson – An accessible introduction, complete with software instructions for carrying out analyses.
  • “Social Network Analysis: Methods and Applications” by Stanley Wasserman and Katherine Faust – A more advanced, methodological book for those interested in a deep dive into the methods of SNA.
  • “Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives” by Nicholas Christakis and James Fowler – An engaging exploration of how social networks influence everything from our health to our political views.
  • “The Network Imperative: How to Survive and Grow in the Age of Digital Business Models” by Barry Libert, Megan Beck, and Jerry Wind – An excellent book for those interested in applying network science in a business context.
  • “Networks, Crowds, and Markets: Reasoning About a Highly Connected World” by David Easley and Jon Kleinberg – An interdisciplinary approach to understanding networks in social and economic systems. This book combines graph theory, game theory, and market models.

Online Resources and Courses

Here are some online learning opportunities, including online courses, communities, resources hubs, and other places to learn about social network analysis.

  • Social Network Analysis  by Lada Adamic from the University of Michigan
  • Social and Economic Networks: Models and Analysis  by Matthew O. Jackson from Stanford University
  • Introduction to Social Network Analysis  by Dr. Jennifer Golbeck from the University of Maryland, College Park
  • Statistics.com :   Statistics.com offers a free online course called  Introduction to SNA  taught by Dr. Jennifer Golbeck.
  • The Social Network Analysis Network:  This website provides a directory of resources on network methods, including courses, books, articles, and software.
  • The SNA Society:  This organization provides a forum for social network analysts to share ideas and collaborate on research. They also offer a number of resources on their website, including a list of online courses.

Journals and Research Papers on SNA

These are a few of the most influential cornerstone research papers in network science and analysis methods:

  • “The Strength of Weak Ties” by Mark Granovetter (1973)
  • “Structural Holes and Good Ideas” by Ronald Burt (2004)
  • “ Collective dynamics of ‘small-world’ networks” by Duncan Watts & Steven Strogatz (1998)
  • “The structure and function of complex networks.” by M.E. Newman (2003).
  • “Emergence of scaling in random networks” by A. Barabasi (1999).

Check out these peer-reviewed journals for lots of network science content and information:

  • Social Networks : This is an interdisciplinary and international quarterly journal dedicated to the development and application of network analysis.
  • Network Science : A cross-disciplinary journal providing a unified platform for both theorists and practitioners working on network-centric problems.
  • Journal of Social Structure (JoSS) : An electronic journal dedicated to the publication of network analysis research and theory.
  • Connections : Published by the International Network for Social Network Analysis (INSNA), this journal covers a wide range of social network topics.
  • Journal of Complex Networks : This journal covers theoretical and computational aspects of complex networks across diverse fields, including sociology.

Frequently Asked Questions about SNA

A: SNA is a research method used to visualize and analyze relationships and connections within a network. In an organizational context, SNA can be used to explore the structure and dynamics of an organization, such as the informal connections that drive formal processes. It can reveal patterns of communication, identify influential entities, and detect potential bottlenecks or gaps.

A: The primary purpose of SNA is to uncover and visualize the relationships between entities within a network. By doing so, it allows us to understand the network’s structure and dynamics. This insight can inform strategic decision-making, facilitate change management, and enhance overall efficiency within an organization.

A: SNA allows researchers to examine the relationships between entities, the overall structure of the network, and the roles and importance of individual entities within it. This can involve studying patterns of communication, collaboration, competition, or any other type of relationship that exists within the network.

A: SNA has a wide range of applications across various fields. In business, it’s used to analyze organizational structures, supply chains, and market dynamics. In public health, it can map the spread of diseases. In sociology and anthropology, SNA is used to study social structures and relationships. Online, SNA is used to study social media dynamics and digital marketing strategies.

A: Key concepts in SNA include nodes (entities) and edges (relationships), network properties like density and centrality, and theories such as the Strength of Weak Ties and Structural Hole Theory. It also encompasses concepts like homophily and heterophily, which describe the tendency for similar or dissimilar nodes to connect.

A: An example of SNA could be a study of communication within a corporation. By treating departments as nodes and communication channels as edges, analysts could visualize the communication network, identify key players, detect potential bottlenecks, and suggest improvements.

A: Social Network Analysis refers to the method of studying the relationships and interactions between entities within a network. It involves mapping out these relationships and applying various analytical techniques to understand the structure, dynamics, and implications of the network.

A: In psychology, SNA can be used to study the social relationships between individuals or groups. It might be used to understand the spread of information, the formation of social groups, the dynamics of social influence, or the impact of social networks on individual behavior and well-being.

A: SNA can be conducted at different levels, depending on the focus of the study. The individual level focuses on a single node and its direct connections (ego networks). The dyadic level looks at the relationship between pairs of nodes, while the triadic level involves three nodes. The global level (whole network) considers the entire network.

A: There are several types of networks in SNA, including ego networks (focused on a single node), dyadic and triadic networks (focused on pairs or trios of nodes), and whole networks. Networks can also be categorized by their structure (like centralized or decentralized), by the type of relationships they represent, or by their application domain (such as organizational, social, or online networks).

A: SNA is used to visualize and analyze the relationships within a network. Its insights can inform strategic decisions, identify influential entities, detect potential weaknesses or vulnerabilities, and enhance the efficiency of communication or processes within an organization or system. It’s also an essential tool for research in fields like sociology, anthropology, business, public health, and digital marketing.

network analysis case study

Connect with our Team!

Contact the VNL team to demo PARTNER or discuss a potential research or evaluation project. We can help you learn more about our services, help brainstorm project designs, and provide a custom scope based on your budget and needs. We look forward to connecting! 

Email our team: [email protected]

Send a message: Contact Us Here

network analysis case study

Join our next webinar: Marketing & Communication Strategies & Tactics for Networks & Coalitions

Choose a free gift.

Click one of the links below to download a free resource to strengthen your community partnerships, collaborative network, and strategic ecosystem. 

Network Leadership Guide

Advice for building, managing, and assessing cross-sector networks or coalitions of partners.

Ecosystem Mapping Template

A template to map the connections and interactions between key stakeholders in your community.

Network Strategy Planner

A worksheet and guide to help you think through and develop your network or ecosystem strategy.

Subscribe to our Network Science Newsletter!

Get monthly updates on VNL news, new research, funding opportunities, and other resources related to network and ecosystem mapping and management.

network analysis case study

Using Social Network Analysis to Assess Collaborative Networks: A Case Study from the Genebank Platform Evaluation

What is social network analysis (sna).

Social Network Analysis (SNA) is an established method in sociology since the early 20th century that has gained prominence in recent decades due to technological advances. It is versatile and can be applied in a wide range of fields—including economics, biology, medicine, communications, and more—by identifying key actors within a social framework and decoding their interconnections. The approach offers a systematic methodology that employs graph theory to visually represent the structure of connections among entities.

In impact evaluations, SNA can usefully quantify the collaborative efforts of various stakeholders to achieve shared objectives. It can enhance evaluation processes by capturing and visualizing relationship nuances in interventions or programs, revealing insights into collaboration dynamics, identifying strong connections, and pinpointing gaps where interactions can be improved or established—thus boosting a group's overall effectiveness.

In SNA, networked structures are mapped out, with individual actors—such as people, groups, and organizations—referred to as ‘nodes’, and the relationships or interactions between them considered ‘edges’. These connections can be of any kind, such as family ties, friendships, professional, geographical, institutional, health-related, etc. Several tools can then be employed to study and interpret the mapped structures.

Several CGIAR studies have used SNA as a powerful way to measure the involvement of multiple stakeholders, including an evaluation of the influence and reach of the CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS), and a systematic portfolio review of the International Maize and Wheat Improvement Center (CIMMYT)’s climate change research portfolio.

This blog shares key learnings from the CGIAR’s recent Genebank Platform evaluation . It demonstrates the as-yet-untapped potential of integrating SNA to generate and visualize evidence based on Quality of Science (QoS) evaluation criteria in process and performance evaluations.

SNA in the CGIAR Genebank Platform evaluation

The 2023 CGIAR Genebank Platform evaluation used SNA to gather evaluative evidence about the Platform (see SNA Report ) to support the institutional learning of CGIAR and Crop Trust. SNA helped address key evaluation criteria such as relevance, effectiveness, coherence, and QoS. It also contributed to the assessment of three Genebank Platform modules, on policy, conservation, and use.   

During the evaluation, SNA was employed to identify key players and examine their relationships, both within their groups and with other Genebank Platform stakeholders. From the total 186 responses received from an online survey, a refined subset of 122 was selected after data cleaning and processing. These responses were then transformed into matrices, highlighting connections and other network characteristics. The evaluators then used the open-source software program Gephi to depict the network’s spatial layout using force-directed algorithms that brought linked nodes together and separated unrelated ones, thus facilitating the data’s interpretability and clarity. 

The SNA generated four visualizations: a network of professional collaborations, a communication and interaction network encompassing various stakeholders, a leadership network, and a network representing management decision-making and funding needs. Analysis of clusters formed in these networks showed that partners and users interacting through the Genebank Platform were moderately interconnected. 

Key takeaways from using SNA in the Genebank Platform evaluation include: 

  • CGIAR should continue its efforts to strengthen its relationship with National Agricultural Research (and Extension) Systems (NARS) across countries.  
  • Although the professional network of the Genebank Platform includes CGIAR centers and genebanks with important roles in the conservation and policy modules, partners outside the CGIAR system can play a crucial role in meeting the needs of farmers and other user groups.  
  • CGIAR centers have worked hand-in-hand with non-CGIAR partners. The flow of communication between different stakeholders reiterated the significant value of the professional ties these partners developed and positioned the Genebank Platform as a central space where relevant information was exchanged.  
  • Internal CGIAR stakeholders play a crucial role in disseminating information about access to and availability of plant genetic materials and accessions. However, the network of non-CGIAR partners offers an opportunity to extend CGIAR's reach to end users, including farmers and community-based organizations.  
  • Non-CGIAR partners, especially NARS and academic and research institutions, serve as pivotal connectors for region-specific user subgroups, providing access to vital information on plant and crop diversity and conservation. This enables CGIAR to meet its objective of enhancing the reach and timely accessibility of germplasm by the ultimate user groups (i.e. farmers).  
  • Lastly, leveraging the broadcast potential of influential partners ensures the cost-effectiveness of future interventions. Strategies aimed at empowering local and influential partners can motivate them to strengthen their networks and implement projects independently at the ground level, thereby enhancing the overall effectiveness of CGIAR's developmental efforts.  

Figure: Communication Patterns and Interactions Between Nodes or Partners on the Genebank Platform (Source: SNA Report )

social network analysis

Source: Sociographs were created using Gephi (an online software for SNA). The responses received for section on ‘partnerships and interactions’ from the online survey were utilized to create these maps.

Role and utility of SNA for CGIAR’s organizational learning

In the context of CGIAR's mission to advance global agriculture and food security, SNA emerges as a powerful tool that complements and supplements the data obtained from other quantitative and qualitative tools, thereby contributing to the subsequent set of learnings obtained during any evaluative study. This analytical method delves into the intricacies of network relationships and interactions, answering pivotal questions that can significantly influence CGIAR's strategies and outcomes.

SNA’s ability to identify key interconnected individuals and organizations is particularly relevant for CGIAR, which operates in a complex landscape of stakeholders, including researchers, farmers, governments, and NGOs. Understanding the role of these actors in collaborative networks—as well as the strength of their relationships—can guide CGIAR in forging effective collaborations and partnerships, which are critical for driving agricultural innovation and policy development. By analyzing how strong these connections are and whether they are unidirectional or bidirectional, CGIAR can identify not just the main participants in a network but also those deviating from the norm.

SNA also provides insights into the dynamics of information diffusion. By enabling the identification of potential gaps in interconnections among actors, the method can point to where CGIAR might strategically direct resources to strengthen those relationships. Such insights are invaluable in designing more inclusive and effective strategies that take into account the diverse range of stakeholders in the agricultural sector.

As SNA centers on actors and their relationships, rather than traditional indicators such as scientific outputs or outcome statements, results from this approach uncover insights that cannot be derived from evaluations that focus only on implementation, monitoring, and reporting mechanisms. SNA provides a methodological framework to understand and optimize the dynamics of the complex networks that CGIAR programs and initiatives operate within—thereby supporting its mission to transform global food and agriculture systems.

  • Report: Social Network Analysis: Evaluation of the CGIAR Genebank Platform
  • Blog:   'Alone We Can do so Little; Together We Can do so Much'

Related Publications

network analysis case study

CGIAR Genebank Platform: Evaluation Report

network analysis case study

Terms of Reference: CGIAR Science Group Evaluations

network analysis case study

Social Network Analysis: Evaluation of the CGIAR Genebank Platform

Join our mailing list.

Book cover

Encyclopedia of Social Network Analysis and Mining pp 856–861 Cite as

Fraud Detection Using Social Network Analysis: A Case Study

  • Duen Horng Chau 3 &
  • Christos Faloutsos 4  
  • Reference work entry
  • First Online: 01 January 2018

203 Accesses

2 Citations

Auction fraud ; Belief propagation ; Guilt by association ; Money laundering ; Pattern mining ; Reputation systems ; Subgraph matching ; User-centered pattern detection

A collection of entities (e.g., people, bank accounts) and the edges connecting the entities, each edge represents an interaction (e.g., friendship, bank transaction)

A specific type of network where the edges represent social interactions between entities that are people or organizations

The study of networks of social relationships, typically to extract useful information, such as patterns and anomalies

An inference algorithm that finds the marginal distribution of every unobserved variable, conditioned on all observed ones, in a probabilistic graphical model

Committing fraud means deceiving someone for financial or personal gain. Frauds happen online (e.g., electronic auction) and off-line (e.g., check fraud)....

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Akoglu L, McGlohon M, Faloutsos C (2010) OddBall: spotting anomalies in weighted graphs. In: PAKDD, Hyderabad, pp 410–421

Chapter   Google Scholar  

Chau DH, Faloutsos C, Tong H, Hong JI, Gallagher B, Eliassi-Rad T (2008) GRAPHITE: a visual query system for large graphs. In: ICDM, Leipzig, pp 963–966

Google Scholar  

Chau DH, Kittur A, Hong JI, Faloutsos C (2011a) Apolo: making sense of large network data by combining rich user interaction and machine learning. In: CHI, Vancouver, pp 167–176

Chau DH, Nachenberg C, Wilhelm J, Wright A, Faloutsos C (2011b) Polonium: tera-scale graph mining and inference for malware detection. In: SDM, Seattle

Li Z, Xiong H, Liu Y, Zhou A (2010) Detecting blackhole and volcano patterns in directed networks. In: ICDM, Sydney, pp 294–303

McGlohon M, Bay S, Anderle M, Steier D, Faloutsos C (2009) SNARE: a Link analytic system for graph labeling and risk detection. In: KDD, Paris, pp 1265–1274

Pandit S, Chau DH, Wang S, Faloutsos C (2007) NetProbe: a fast and scalable system for fraud detection in online auction networks. In: WWW, Banff, pp 201–210

Rodrigues Jr JF, Tong H, Traina AJM, Faloutsos C, Leskovec J (2006) Gmine: a system for scalable, interactive graph visualization and mining. In: VLDB, Seoul, pp 1195–1198

Tong H, Faloutsos C, Gallagher B, Eliassi-Rad T (2007) Fast best-effort pattern matching in large attributed graphs. In: KDD, San Jose, pp 737–746

Yedidia JS, Freeman WT, Weiss Y (2003) Understanding belief propagation and its generalizations. In: Exploring artificial intelligence in the new millennium, vol 8. Morgan Kaufmann, Amsterdam, pp 236–239

Download references

Acknowledgments

Duen Horng (Polo) Chau was supported by Symantec Research Labs Fellowship. This material is based upon the work supported by the National Science Foundation: IIS-0705359, IIS-0326322, CNS-0721736, and IIS-0534205; the Lawrence Livermore National Laboratory: DE-AC52-07NA27344; an IBM Faculty Award; and a Yahoo Research Alliance Gift, with additional funding from Intel, NTT, and Hewlett-Packard. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or other parties.

Author information

Authors and affiliations.

College of Computing, Georgia Institute of Technology, Atlanta, GA, USA

Duen Horng Chau

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

Christos Faloutsos

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Duen Horng Chau .

Editor information

Editors and affiliations.

Department of Computer Science, University of Calgary, Calgary, AB, Canada

Reda Alhajj

Section Editor information

Universidad Politécnica de Madrid, Madrid, Spain

Rosa M. Benito

Juan Carlos Losada

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media LLC, part of Springer Nature

About this entry

Cite this entry.

Chau, D.H., Faloutsos, C. (2018). Fraud Detection Using Social Network Analysis: A Case Study. In: Alhajj, R., Rokne, J. (eds) Encyclopedia of Social Network Analysis and Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-7131-2_284

Download citation

DOI : https://doi.org/10.1007/978-1-4939-7131-2_284

Published : 12 June 2018

Publisher Name : Springer, New York, NY

Print ISBN : 978-1-4939-7130-5

Online ISBN : 978-1-4939-7131-2

eBook Packages : Computer Science Reference Module Computer Science and Engineering

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Network analysis: An indispensable tool for curricula design. A real case-study of the degree on mathematics at the URJC in Spain

Contributed equally to this work with: Clara Simon de Blas, Daniel Gomez Gonzalez, Regino Criado Herrero

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliations Area de Estadistica e Investigacion Operativa, ETSII, URJC, Mostoles, Spain, Instituto Universitario de Evaluacion Sanitaria, UCM, Madrid, Spain

ORCID logo

Roles Conceptualization, Funding acquisition, Methodology, Software, Validation, Visualization, Writing – original draft

Affiliations Instituto Universitario de Evaluacion Sanitaria, UCM, Madrid, Spain, Departamento de Estadistica y Ciencia de los Datos, Facultad de Estudios Estadisticos, UCM, Madrid, Spain

Roles Conceptualization, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

Affiliations Departamento de Matematica Aplicada, Ciencia e Ingenieria de los Materiales y Tecnologia Electronica, ESCET, URJC, Mostoles, Madrid, Spain, Center for Computational Simulation, UPM, Pozuelo de Alarcón, Spain, Data, Complex Networks and Cybersecurity Research Institute, URJC, Madrid, Spain

  • Clara Simon de Blas, 
  • Daniel Gomez Gonzalez, 
  • Regino Criado Herrero

PLOS

  • Published: March 11, 2021
  • https://doi.org/10.1371/journal.pone.0248208
  • Reader Comments

Fig 1

Content addition to courses and its subsequent correct sequencing in a study plan or curricula design context determine the success (and, in some cases, the failure) of such study plan in the acquisition of knowledge by students. In this work, we propose a decision model to guide curricular design committees in the tasks of course selection and sequencing in higher education contexts using a novel methodology based on network analysis. In this work, the local and global properties stemming from complex network analysis tools are studied in detail to facilitate the design of the study plan and to ensure its coherence by detecting the communities within a graph, and the local and global centrality of the courses and their dependencies are analyzed, as well as the overlapping subgroups and the functions and different positions among them. The proposed methodology is applied to the study of a real case at the Universidad Rey Juan Carlos.

Citation: Simon de Blas C, Gomez Gonzalez D, Criado Herrero R (2021) Network analysis: An indispensable tool for curricula design. A real case-study of the degree on mathematics at the URJC in Spain. PLoS ONE 16(3): e0248208. https://doi.org/10.1371/journal.pone.0248208

Editor: Ben Webb, Brigham Young University, UNITED STATES

Received: November 4, 2019; Accepted: February 22, 2021; Published: March 11, 2021

Copyright: © 2021 Simon de Blas et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files.

Funding: This work was partially supported by Intelligent management of fuzzy information. Spanish Ministry of Education and Science. Reference: PGC2018-096509-B-I00 and PRODEBAT Spanish Ministry of Education and Science. Reference: PID2019-106254RBI00.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Many real world systems are subject to be modeled by the use of networks. In these networks nodes represent the different elements of the system and edges stand for the interactions or relationships between them [ 1 ]. In network analysis, a key point is to devise measures suitable for quantifying the strategic importance of a node, an edge, a set of nodes, or a set of edges, with the goal of identifying the optimal functioning of the system represented by the network [ 2 – 5 ]. Topological measures, such as different versions of vulnerability, efficiency, centrality, and clustering, can be used to quantify, compare, and rank different configurations for a specific system [ 2 , 6 – 13 ]. A task specially difficult to solve for the university management authorities is the design of the different curriculum plans. Few general methodologies have been reported in the literature. Most of them are given in educational laws in general frameworks.

The TUNING Educational Structures [ 14 ] is a European project aimed at implementing the political objectives of the Bologna Process and the later Lisbon Strategy in the higher educational sector. This is an approach to redesigning, developing, implementing, evaluating, and enhancing the quality of undergraduate and graduate degree programs. The tools that can be found in TUNING Educational Structures are described in a range of publications which institutions and researchers are invited to test and use in their own settings as points of reference, convergence, and common understanding. The use of techniques derived from Network Analysis (NA) have been very useful in similar studies in which it is desired to know the strengths and weaknesses as well as the inconsistencies of an organization system (see for example some Management cases in [ 15 – 17 ]).

The majority of the studies that combine the use of networks in the design of curricula have been carried out through dependency graphs. Dependency graphs have generally been applied to situations where edges represent a temporary constraint in the planning of a global activity ([ 18 – 21 ]). For example, in a dependency graph, the links (A, B); (B, C) simultaneously implies that the activity A (course A) must be carried out before B. So, B cannot be carried out before activity A has been completely finished. In dependency graphs, transitivity is always imposed in this way, so there is no need to represent the link (A,C) since it is obviously deduced as a natural consequence of transitivity that any dependency graph must have. So, this dependency graphs are usually modeled as Directed Acyclic Graphs (i.e. DAG networks).

In this scenario, the main aim is to establish a temporal planning in order to establish the critical activities (critical activities are considered to be those ones producing a global delay in the global process). Another class of networking approach to curriculum study that is worth mentioning is based on how students move from one course to another throughout their careers [ 22 ]. In this way, it is possible to build a network that shows the real temporal relationships between courses. Combining historical data of the students and courses when available with networking techniques, improves students and university managers capacity to estimate and predict the courses in which a student will enroll in the future. This allows to have a forecast of the enrollment in the courses for next year. Interesting network analysis or this type of curricular graphs can be found in which central courses are determined by means of centrality measures ([ 22 – 24 ]). The main difference with other network modelling is that the relationships between courses are based on temporality and they are not known a priori. In other words, there is a link (i, j) between the courses i and j if many students have taken the course j right after having completed course i.

In this paper, we propose an influence network model for the curriculum plan design in which the links among courses reflect the influence that have some subjects over others due to temporal sequencing of content acquisition. The curriculum model proposed here responds to a system of recommendations which is not necessarily restrictive, allowing the student to take several courses simultaneously ([ 25 – 27 ]). It is easy to see that the influence/recommendation graph analyzed here shows a more general scenario than a system of prerequisites. So all of the analysis we have done in this work are subject to be carried out in a dependency graph found in the literature in which the transitivity is imposed and the arcs are not valued. The network of this work is built based on the course contents that are “necessary/recommended” for other courses. The objective of using network analysis for this recommendation/influence structure is to show (in addition with any classical temporally analysis) the robustness, centralization, cohesion and in general to understand the whole plan structure to be able to modify it if appropriate or necessary. The information obtained after an analysis of the network of influence (as proposed in this document) enriches the knowledge of the training itinerary.

The main aim of the research is to present a tool for a better understanding of the complex structure of a curriculum plan. It could be useful to construct a degree training itinerary as an interactive process that permits (by means of a network visualization, some network measures and analysis) assign scheduled/allocated courses to semesters, modified courses contents in case of inconsistencies and to understand in general the whole flow knowledge process. This article is organized as follows. In Section 2, we describe some preliminary concepts in network analysis. In Section 3, we describe our methodology for creating a network representing the relationships between courses in an undergraduate curriculum, as well as the networking measures considered used here for evaluating and interpreting the network’s properties. In Section 4 we describe the process of developing a degree’s curriculum in Spain. In Section 5, we present the results of applying the methodology in the design of the University Rey Juan Carlos (URJC) mathematics degree curriculum. And finally, in Section 6 the conclusions derived from this work are shown.

Preliminaries

Network analysis, centrality measures..

One of the most important problems in network analysis is the identification of key nodes and the relationships between nodes. In order to rank the nodes of a network, there exist many approaches, depending on the definition of “relevance” or “importance” as determined by network analysts. A common approach for ranking nodes in a network is to use centrality measures [ 12 ]. The considered centrality measure reflects the node’s (relative) importance inside the network. Although there are many centrality papers in literature [ 28 , 29 ], centrality concept or idea is a complex notion that requires a clear definition. As it is pointed out in [ 3 , 29 ], the use of a specific centrality measure implies some assumptions about the network structure and how information flows along the network, so it is very important to analyze first the class of network that you are modelling and after that use the adequate centrality measure. For example, in the shortest path (closeness or betweenness centrality measures), we only take into account the geodesic paths in the communication between each pair of nodes. Thus, it is assumed that information flows through the network only along the shortest feasible paths.

In this sense, it is important to mention that in this paper, we focus on centrality measures for dominance (or reference) networks where relationships between nodes are weighted and directed through a relation of precedence. For this type of networks, one of the most popular and most frequently used centrality measures is based on the degree of the node in the graph: the more edges incident at a node, the higher the node’s position in the ranking. Generally, however, nodes more central to the structure, in the sense of having higher degree or more connections, tend to be favored in the ranking. The degree in case of directed network can be obtained as the sum of the in-degree and out degree. In Program Evaluation and Review Techniques (PERT) [ 30 ] (which can be viewed as a class of dominance networks) the out-degree measure of the nodes can be used to classify the objects or activities in the project as leading or intermediate.

In network analysis the term “influence network” frequently appears to refer to directed networks in which the relations represent some degree of influence between nodes. For example, author citation networks or Twitter networks are often considered as influence networks. In those networks, if author j cites (or retweets) author i many times, we could say that author i has some influence over author j. In this sense, there is a significant difference with dependency or permission networks since in these situations the direct relation implies a stronger constraint, as for example, the transitivity between nodes. For example, in a permission network author j needs the approval of author i so this relation is something more than just an influence, or in a dependency network a node usually represents the instant in which a task is finished and the link between two nodes is associated with a task. So as a consequence, the relations should be transitive. Other term that also appears in many network analyses is the Recommendation networks. These are clearly closer to the concept of influence than dependency / permission networks. Recommendation networks are used to model friends (or similar users) recommendations about the order in which tasks have to be performed. For example, after reading the Book X, the system recommends (since your friends or similar users have also read this Book X) to read Book Y. These recommendations can be modelled as a directed or undirect network where the nodes are the books and the links are recommendations. From a mathematical point of view, dependency networks impose more conditions than recommendation or influence networks. Dependency networks are usually modelled as DAG.

Other measure that can be used for directed network is PageRank defined by Larry Page and Sergey Brin in [ 31 ], where the rank value indicates the importance of the node in the net.

Finally, other class of measures that deals with weighted and directed networks are the flow centrality measures [ 3 , 32 ] that are based on regarding the network as representing a flow. In the case of the flow betweenness centrality measure, the contribution of a node represents the amount of flow that necessarily pass through this node in all possible maximum flows [ 32 ].

Community detection problems: Clustering nodes in a network.

In addition to the centrality or importance of the nodes in a network, another topic that merits attention is the identification of groups or communities in a network. The identification of groups in a network is very useful for understanding the structure of the problem that we are analyzing. A network is called modular (see [ 33 ]) when its nodes are joined together in tightly knit groups among which there are only looser connections (i.e. if the nodes are naturally grouped into dense communities with few connections between communities). There are many algorithms that try to find communities in a network (see [ 34 ]). Such problems are commonly known as community detection problems.

Nevertheless, as with the topic of centrality measures, the use of a specific community detection method is not straightforward and is still an open problem the suitability of the proposed algorithms in the literature depending on the type of network that it is analyzed. The community detection problems for directed networks have been much less studied than the general ones, since the concept of group or community for directed networks is not clear as for non-directed networks [ 35 ].

As pointed out in [ 35 ], the concept of group/community could be extended in different ways when changing from undirected networks to directed networks. In undirected networks the accepted interpretation of community/group is a set of members with many relationships between them and few relationships of them with other members outside the community. This community idea is related to the notion of density. Communities are dense subgraphs with lower relations with outside. In directed networks, this view is not the only way to consider communities. For example, if we have a directed network (on Twitter it is possible to follow someone who does not follow you), a group can consist of a set of nodes with similar “parents” even if they are not connected. For example, if two persons (non connected directly) who follow the same persons (politician, sport teams, university,..) we could say that they share characteristics in common and could be in the same community. This new community concept can be viewed in Fig 1 as citation-based cluster. Another notion of groups not necessarily similar to density is the idea of flow-based cluster, in which a group is a set of nodes that can communicated between them. Finally the idea of density (in the classical sense) is also adopted for the directed case. In this work, we have focused in density clusters.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0248208.g001

Different algorithms [ 35 ] have been proposed for directed graphs depending on the class of clusters that you are looking for. The most common and simple approach (addressed as clustering based on a naive graph transformation ) is to ignore edge directionality and treat graph as undirected. This methodology cannot capture citation-based clusters since the clusters could be very small or even disconnected ( Fig 1 ). In other approaches (see [ 35 ]) the directed graph is converted into an undirected one, where the edge direction is meaningfully maintained in the produced network. There exist other methodologies based on modularity optimization algorithms that can be used to deal with directed networks in order to capture other class of community concepts.

Global network measures.

Centrality measures are computed for each node of the network capturing the idea of central or importance. In this sense, centrality can be viewed as a local measure that does not give a general idea of how the network is structured. An aggregation of the different centrality measures (for example by means of the Gini Index or a variance of the centrality vector) could give us an idea of how the power in the network is distributed. This concept can be considered as a global property since makes reference of a quality of the whole network.

Another global network measure is the optimal modularity of a network (i.e. the best partition of the network in terms of the modularity measure) which permits us to know if a network presents groups or communities well-defined. If the modularity of this best partition is high we usually say that the network is modular.

Another relevant global measure is homophily. Homophily is an important concept aimed at encapsulating why the nodes in the network are linked. The general hypothesis is that nodes with similar characteristics are more likely to be connected. When this characteristic is the importance of the node (measured as the degree of incidence), it must be tested whether nodes with greater power are more likely to be connected among themselves than they are to be connected with lower power nodes. A coefficient measuring the correlation between the degrees of linked nodes can be used to capture this tendency.

Newman [ 4 ] introduces the assortativity measure, which makes it possible to classify networks into assortative and disassortive. A network is assortative if the correlation coefficient between the degree of nodes in the arc set is positive and significant nodes having many relations are connected amongst themselves with greater probability than they are connected with low degree nodes. In the opposite case of negative correlation coefficient, nodes with many connections tend to be connected with nodes with low total degree relations. However, some issues related to networks with different topologies presenting the same assortativity index or vice versa have been addressed in [ 36 ] who introduce, for undirected and unweighted networks, higher order assortativity based on a suitable choice of the matrix driving the connections. In this work, we have used the assortativity measure proposed by Newman [ 5 ] as a preliminary inspection to discover the network topology.

network analysis case study

To conclude this section, we will mention the motifs used to evaluate important substructures in the context of directed graphs. To search for repeated subgraphs having some well-defined structure, Davis, and Leinhardt [ 38 ] define a motif as a small connected subgraph having a particular given structure. It is argued that the motif profile (i.e. the number of different motifs in the graph) is characteristic for different types of networks and that network function is related to the motifs identified in the graph.

Methodology: Using network analysis for curriculum design

As we have mentioned in the introduction, networks analysis has been used for the understanding of complex structures in order to analyze their strengths as well as their weaknesses. With the main objective of understanding the complex structure associated with a curriculum plan, we propose the following steps:

  • First step (Experimental design): build the network of influences between courses.
  • Inconsistencies identification.
  • Courses scheduling allocation in semesters.
  • Detection of key courses.
  • Detection of central courses.
  • Courses communities.
  • Third step (Results and conclusions): evaluate the network analysis measures in the curriculum plan.
  • Fourth step (iterative): present the information and the associated conclusions to the experts/professors for a better understanding of the complex structure and if it is necessary (due to some inconsistencies or unfeasibility in the semesters allocations) change the content of some courses to make reduce these inconsistencies and return to step 1.

Fig 2 represents a decision aid model to construct the curriculum network.

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g002

In the first step we build the “Curriculum Network” that represents the relations between courses (based on the expert knowledge of the professors). For each course “A”, we have interviewed professors with more than 5 years of experience teaching lecture “A”. The professors establish the percentage of material in course B necessary to understand the contents of course A appropriately in a selected number of levels.

Once the information is obtained, we aggregate the expert opinion (in our real case application this group was composed by 43 professors with more than 5 years of experience) to have the final matrix relations among the courses in the curriculum plan. The second step it to apply network analysis tools for a better understanding of the complex structure of the curriculum plan. Fig 3 represents different analysis derived from different measures that can be applied to the curriculum network.

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g003

Finally, the information and the associated conclusions are presented to the experts/professors for a better understanding of the complex structure and if it is necessary (due to some inconsistencies or unfeasibility in the semesters allocations) change the content of some courses reduce these inconsistencies in an iterative way.

In order to represent the courses and the influences between different courses, throughout this paper we consider a weighted directed network G = ( X , E , λ), where X = {1, …, n } is the set of vertices or nodes , E ⊆ X × X is the set of arcs or edges and λ is a function λ: E → [0, + ∞) such that for each arc ( i , j ) ∈ E , the coefficient λ( i , j ) is called weight of ( i , j ). The graph is built as follows:

  • Courses to be considered will be represented as nodes in the network.
  • The network must reflect the relationships between courses. Since relationships between courses are directional, they are represented in the graph by directed arcs. That the contents of course i should precede the contents of course j is represented in the graph as a directed arc ( i , j ) from node i to node j .
  • The weight of the arc λ( i , j ) reflects the degree of influence/dependency between two courses.
  • The set of arcs is E = {( i , j )|λ( i , j ) > 0}.

The network must reflect the flow of knowledge as a sequence of acquisition and improvement of professional skills. Furthermore, isolated nodes in the graph can be allocated in any desirable semester in curriculum design and nodes with many relations in the structure modeled by the graph should be priorized in the allocation with respect to the rest of the nodes in the graph, since they will have more restrictions. Inconsistencies can be detected if exist a directed arc ( i , j ) with λ( i , j ) > 0 and a directed arc ( j , i ) with λ( j , i ) > 0. We consider the length of the path between courses i and j as the minimum number of semesters necessary to allocate the courses respecting the dependence between courses when avoiding interdependencies in a given semester.

In the final phase, we can proceed to the assignment of nodes from the graph to semesters in the study plan besides other applications derived from Network Analysis. For the semester allocation problem, we impose the following restrictions (based in the Spanish Laws, which could be modify according to other countries legislation):

  • The first four semesters of the degree must contain the transversal (common to all students of the same University, regardless the Degree they are studying) and basic training courses (those courses common to most degrees but adapted to the specific content of a given degree).
  • If a course is considered as intermediate (i.e. there is at least one course that requires the competencies acquired by studying it) then it cannot appear in the second semester of the fourth year (see Fig 4 ).
  • The maximum length of a path that joins a course with any other course in the dependency graph associated to the training itinerary determines the earliest semester for this course to be allocated in the degree programme (see Fig 5 ).
  • An arc’s direction reflects the sequential acquisition of competencies so an arc cannot return to a prior semester (see Fig 6 ).
  • It is desirable that the distance between semesters of co-requisite courses be as short as possible in the study plan (see Fig 7 ).

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g004

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g005

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g006

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g007

Case study: The Mathematics Curriculum Network (MCN) in the University Rey Juan Carlos

The University Rey Juan Carlos (URJC) offered a mathematics degree for the first time in the academic year 2015-16. Previously, the URJC had offered a mathematics degree only as a part of joint degrees with computer science or software engineering. After these joint degree programs had been initiated, their coordinators carried out a series of meetings to respond to reports issued by the regional oversight authority Madrid+d (Fundación para el Conocimiento madri+d, Madrid agency for to contribute to make the quality of higher education, science, technology and innovation key elements of the competitiveness and well-being of citizens) requiring the implementation of a single degree in mathematics. To take into account student demands for modifications to the degree programs and other suggestions for changes to the existing training itineraries, several teaching coordination and curriculum design committees, constituted by qualified teachers and student representatives, met to analyze proposals. Any proposed study plan must meet certain predetermined organizational requirements. Five 6 credit courses must be assigned to each of its first six semesters. While its seventh and eighth semesters may include as elective courses any admissible course, they must also include professional internships and the final project. Additionally, the URJC imposes as a graduation requirement to pass a foreign language course known as modern language. This requirement must be met by the completion of a year-long 6 ECTS (European Credit Transfer and Accumulation System) courses during the first two years of study. Therefore, a total number of 31 courses have to be allocated in the study plan and the diameter of the final network must be less than 7 to ensure the legal requirements are satisfied.

These requirements identify 31 nodes in the Mathematical Curriculum Network (MCN) corresponding to the courses appearing in the mathematics degree. In order to determine the dependencies between the courses constituting the degree, personal interviews were conducted with the teachers who lectured in the degree. On the basis of these interviews, the dependency arcs were constructed and their weights were determined so as to indicate an estimated percentage of dependency of the contents of a course on the contents of previously studied courses. The weights of a given arc (i,j) represents the percentage of contents of a course i required to be fluent before a student enrols in a course j so as to succeed in course j. For an arc ( i , j ), for which course i must precede course j in the itinerary, three levels of dependency were established:

  • Level 1: The content of course j requires less than 30% of the content taught in course i .
  • Level 2: The content of course j requires between 30% and 75% of the content taught in course i .
  • Level 3: The content of course j requires more than 75% of the content taught in course i .

In Table 1 we show the dependencies between the different courses in the mathematics degree in the URJC and the subscript indicates the level of percentage of matters of precedence courses necessary to understand the contents of given course.

thumbnail

https://doi.org/10.1371/journal.pone.0248208.t001

The transversal and basic training courses are LA, IP, C, PF, E, BF, MH, AVI, CF and ML.

Once this information was obtained, the curriculum network was built (see Fig 8 ). Each semester is plotted in a different color for clearness. Table 2 presents the color description for each semester.

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g008

thumbnail

https://doi.org/10.1371/journal.pone.0248208.t002

The network evaluation allows visualization of the entire network and also its subnetworks in which there are considered only dependencies between courses of level 1 or several levels (see Figs 9 and 10 ).

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g009

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g010

Once the dependency structure of the URJC mathematics degree has been modeled by means of a network (that we have called MCN), in this section we present a network analysis to deal with the objectives pointed out in the introduction. We are going to divide this section into 4 differentiated subsections according to the different analyzes presented that are: centrality measures analysis of the MCN, community detection analysis of the MCN, mofits indentification and global network analysis.

Centrality measures analysis of the MCN

First at all, let us note the Mathematical Curriculum Network (MCN) is a directed and weighted network. Also let us note that the direction of the link represent some dominance status, since the link between courses i and j exist if and only if there it is necessary to pass the knowledge of course i first to be able to understand or to pass the course j. As a consequence of this fact, some classical centrality measures based on minimal paths as: closenesss or betweeness are not appropriate for this particular case. Taking all this in consideration, we are going to measure the importance of the courses in this network based on two classical centrality measures that deals with directed and weighted networks: the degree and PageRank centrality measure. In directed networks, in-degree and out-degree are extremely local measures, although quite informative. In this sense, intermediate measures as flow betweenness can be understood as a more general and robust measure than in degree.

In Table 3 , we can see the importance/relevance of each course in the Mathematical Curriculum Network in terms of degree and PageRank value.

thumbnail

https://doi.org/10.1371/journal.pone.0248208.t003

The results highlight the importance of the classes Linear Algebra, Calculus, Vector Analysis I, Ordinary Differential Equations and Complex Variable, and Vector Analysis. Student performance in these courses must be monitored closely since a large number of future courses depend on them. To detect the core courses of the curriculum, the flow betweenness measure on the original network was calculated for minimum paths. This emphasizes the importance of the courses Calculus, Algebraic Structures, Vector Analysis I, Discrete Mathematics, Topology, Data and Information Modelling, Statistical Methods, and Curves and Surfaces as these courses appear in the minimum dependency path between any pair of courses in the degree. To detect the instrumental and terminal courses of the curriculum, the degree-out measure on the original graph was calculated. This highlights the importance of the courses Linear Algebra, Calculus, and Discrete Mathematics as “key” courses in the degree, which should be positioned at the beginning of the curriculum, preferably in the first semester of study. Those courses that achieve a score of zero can be considered terminal courses in the curriculum and can be placed in the final semesters of the curriculum.

Community detection analysis of the MCN

Community detection in NA are beneficial for numerous applications such as finding common characteristics between nodes or finding sets of nodes with similar interactions.

network analysis case study

https://doi.org/10.1371/journal.pone.0248208.t004

The degree of modularity is moderate (0.32), which reflects the fact that we have a modular network with clearly identifiable communities, even though there are relations between the communities, which means that there are also dependencies between the communities.

In Fig 11 , we present a visualization of the optimal solution presented by the Louvain algorithm.

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g011

It is worth noting the presence of four groups of courses (see Fig 11 ), which could be named as follows:

  • Group 1: Basic Algebra
  • Group 2: Probability and mathematical statistics
  • Group 3: Advanced computing and algebra
  • Group 4: Vector calculus and numerical methods

The courses Biological Foundations, Ethics, Modern Language, and History of Mathematics are separated from the other courses in the degree.

Global network analysis

The rich club in the network, represented by the principal courses in the study plan, refers to the tendency of the dominant elements of the system to form tightly interconnected groups. The maximum value for Φ( k ) with minimum k value is attained for k = 10 where R (10) is the set of courses Linear Algebra, Calculus, Vector Analysis I, Ordinary Differential Equations (see Fig 12 ).

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g012

The assortativity measure of the net is 0.12 (p-value 0.228, associated to the null hypothesis: assortativity = 0 vs the alternative assortativity ≠0). Assortative measure is defined here as the correlation coefficient between the degree of adjacent nodes. As the assortative measure is low, we can conclude there is a disassortative relationship between the nodes of the net, since nodes with high degree tend to connect with low degree nodes.

Cross-dependencies analysis

Some networks require a predefined model to determine the influence of one region on another due to temporal dependencies. Cross-dependencies helps to identify the network’s hierarchy of influence.

To detect cross-dependencies between classes, motifs of size three were considered. We compared the number of expected number of motifs of type 3 in a random graph with the observed number of motifs of size 3 in the study plan graph (see Figs 13 and 14 ). The random model used has been generated using the same number of nodes and arcs found in the original network using Pajek 5.1 software. Triad census are labelled according to [ 38 ].

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g013

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g014

Significant differences are apparent for patterns D, E, F and I (p-value < 2 E − 16) (see Fig 15 ). Patterns E and F were significantly smaller respect to a random graph in the study plan. It is desirable to avoid patterns C, G, H, K, L, M, N, O and P in the study plan as these patterns exhibit significant course interdependency, obliging considerable coordination between professors, something which is not always easy to achieve. Similarly, pattern type J should be avoided because of its cyclic structure, which would obligate use of knowledge taught in courses to be situated in semesters posterior to that in which it would be needed. Significant differences in patterns D and I were found to be higher in the study plan graph than in random graphs. This is a desirable property, as it reflects that knowledge will flow correctly provided that the corresponding courses are assigned to different semesters.

thumbnail

https://doi.org/10.1371/journal.pone.0248208.g015

In this paper we have proposed a new tool to visualize a curricula design from a network analysis point of view, by using the natural tools and concepts from Graph Theory. Even if just some studies can be found that combine network analysis with tools for curricula design, the methodology proposed here provides a new vision of the structure and functionality of different curriculum designs. In particular, we propose to construct the graph in a first step following the recommendations given in section 3, check for inconsistencies helped by the graph visualization, and if the study plan satisfies all the requirements proceed with network analysis to enrich the information of performance, otherwise return to the previous stage to reallocate courses in the semesters until all the requirements are satisfied.

From this network analysis, it is possible to detect incongruences or mistakes in the study plan in an automatic way. From a node network analysis point of view we can identify or detect the main courses in the study plan or the courses that required a detailed performance monitoring since the influence in other courses is high. Also it is important to mention that the relation between nodes in this network is based on the necessities of one course t respect the others. This community detection algorithm allows professors to coordinate in the case they are involved teaching the same subjects. In this sense the natural groups of courses could be identified in a natural way after a community detection procedure in the courses network of a study plan.

Finally it also relevant to mention that the general or topological properties of the whole network provide an interesting information of the whole study plan. Topological network measures as the density of the network, assortativity degree among others permits to know for example if the relations between courses in the study plan are more probable among similar courses or the opposite.

One limitation of the present study is the consideration in the model mandatory courses antecedence requisites. We find this issue as a very interesting task in a future research line.

This article has provided a step-by-step procedure for analyze the key courses in a study plan (deciding if it is recommendable small modifications or not), identify the natural groups of courses that should be coordinate, detects incongruences in the plan, robustness of the plan study or the level of connections of the plan allowing the comparison from an structure organization point of view different plans to decide what we want to design. Still, the problem of meeting all the requirements could be non-trivial and a future research item is to extend this work with a detailed algorithm to assign the courses to semesters efficiently.

Supporting information

https://doi.org/10.1371/journal.pone.0248208.s001

Acknowledgments

The authors are very grateful to the help of two anonymous referees, who helped considerably to improved this paper. This research has been significantly enhanced by the help and advice of the professors who teach in the mathematical degree at URJC.

  • 1. Scott J. Social network analysis. Sage.; 2017.
  • View Article
  • Google Scholar
  • 5. Newman MEJ. Networks: an introduction. Oxford University Press.; 2010.
  • 14. Gonzalez J, Wagenaar R. TUNING Educational Structures. 2000; ISBN: 978-84-9830-642-2
  • 20. Slim A., Kozlick J., Heileman G. L., Wigdahl J., Abdallah C. T. Network analysis of university courses. Proceedings of the 23rd International Conference on World Wide Web. 2014; 713–718.
  • 21. Wong W. Y., Lavrencic M. Using a Risk Management Approach in Analytics for Curriculum and Program Quality Improvement. 6th international conference on learning analytics and knowledge, 1st learning analytics for curriculum and program quality improvement workshop Edinburgh. 2016;10–14.
  • 22. Akba M. İ. Basavaraj P., Georgiopoulos M. Curriculum GPS: an adaptive curriculum generation and planning system. Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC). 2015.
  • 23. Slim A., Heileman G. L., Kozlick J., Abdallah C. T. Employing markov networks on curriculum graphs to predict student performance. 014 13th International Conference on Machine Learning and Applications.IEEE. 2014; 415–418.
  • PubMed/NCBI
  • 38. Davis JA, Leinhardt S. The Structure of Positive Interpersonal Relations in Small Groups. Berger J. (Ed.), Sociological Theories in Progress.; 1972; 2: 218–251. Boston: Houghton Mifflin.
  • 42. Chung F., Lu L. Complex graphs and networks. CBMS Regional Conference Series in Mathematics. Published for the Conference Board of the Mathematical Sciences, Washington, DC. 2006; 107.
  • 45. Azizifard N., Mahdavi, M., Nasersharif, B. Modularity optimization for clustering in social networks. International Conference on Emerging Trends in Computer and Image Processing. 2011; 52–55.
  • 46. Li, L., Du, M., Liu, G., Hu, X., Wu, G. Extremal optimization-based semi-supervised algorithm with conflict pairwise constraints for community detection. IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM). 2014; 180–187.
  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, communication network robust routing optimization in an integrated energy cyber–physical system based on a random denial-of-service attack.

www.frontiersin.org

  • 1 Electric Power Engineering, Shanghai University of Electric Power, Shanghai, China
  • 2 Research Center on High-Productivity Computing Systems, Zhejiang Lab, Hangzhou, China

The integration of power grids and communication networks in smart grids enhances system safety and reliability but also exposes vulnerabilities to network attacks, such as Denial-of-Service (DoS) attacks targeting communication networks. A multi-index evaluation approach is proposed to optimize routing modes in integrated energy cyber-physical systems (IECPS) considering potential failures from attacks. Security and economic service evaluation indexes are incorporated to quantify the significance of information flow routing. An optimization model for electric, heat, and gas routing in worst-case scenarios is formulated and solved using a column and constraint generation algorithm. The optimized routing method effectively circumvents specified attack areas, reducing the correlation degree of communication links within the attack area. Comparison with single-service optimization methods demonstrates the superiority of the proposed approach in mitigating the impact of network attacks on IECPS. The study highlights the importance of considering security and economic factors in optimizing routing modes to enhance the resilience of integrated energy cyber-physical systems against network attacks, particularly DoS attacks on communication networks. The evaluation index approach presented in this study provides a comprehensive method for assessing the importance of communication links in IECPS and optimizing routing modes to improve system robustness and reliability in the face of network attacks.

1 Introduction

In the process of smart grid development, the communication network plays an increasingly important role. The evolution of cyber–physical power systems (CPPSs) is also moving toward a more integrated direction, enhancing the security and reliability of the overall system ( Popat et al., 2021 ). However, the coupling of communication networks with the grid can introduce additional risks, potentially resulting in serious consequences ( Siu et al., 2022 ; Solanki et al., 2022 ). One such risk is the superposition of risks caused by the interference of a communication network with the transmission of information. This interference can lead to the occupation of routing bandwidth, ultimately affecting the generation of control commands and subsequently impacting the system frequency. This interference, known as a denial-of-service (DoS) attack, poses a significant threat to the stability and functionality of the system ( Hu et al., 2020 ). To mitigate the harm caused by DoS attacks and improve the overall stability of the system, extensive research has been conducted in various countries ( Hu et al., 2020 ; Gupta et al., 2021 ; Kakadiya et al., 2022 ). This research focuses on enhancing the security control of communication systems, implementing routing load-balancing strategies for communication networks, and improving communication performance.

Numerous studies have been conducted on communication network security and network attacks, particularly in the context of CPPSs. Security control is a widely adopted strategy to counter network attacks in CPPSs ( Dai et al., 2023 ; Wang et al., 2023 ). This strategy can be implemented through three main approaches: stochastic system approach, game theory approach ( Li et al., 2017 ), and resilient control approach ( Franze et al., 2020 ). Security control methods are effective in safeguarding system stability. However, designing control systems that can effectively handle DoS attacks in different scenarios can be challenging as these attacks often require specific corrections. For instance, in the case of an asynchronous DoS attack affecting two channels, even a lower attack frequency than that in a synchronous DoS attack can result in a prolonged failure of control signal updates ( Li et al., 2023 ). For the integrated energy cyber–physical system (IECPS), the state under the DoS attack is estimated to be difficult to carry out as well. In such scenarios, the commonly used brake strategy in control system designs may not be sufficient to ensure timely updates of control signals ( Lv et al., 2023 ). In addition, the impact on routers in a DoS attack is actually more non-negligible. Communication delays and data loss can affect the robustness of security controls ( Wang et al., 2023 ). Consequently, the performance of the system may be compromised. This calls for a redesign of the active security control approach to address specific challenges ( Li et al., 2021 ). Kumar et al. (2020) provided an optimal control technique that mitigates oscillations under steady-state conditions and slow response under dynamically changing conditions. Kumari et al. (2023) , Saxena et al. (2021) , and Kumar et al. (2019) took into account the impact of the dynamic environment on its basis. Kumar et al. (2023) , on the other hand, employed a model control scheme without voltage sensors to predict the operating state of the system afterward and improve the response speed in the face of attacks.

Compared to designing security control methods, information flow scheduling from the perspective of communication networks is a more efficient and economical approach. Given the interdependence between the communication network and the power grid, it is crucial to improve the performance of the CPPS when facing chain-coupled failures caused by network attacks. This requires the establishment of a reliable and optimal information delivery path, which can be achieved through an effective routing method that minimizes risk.

In a communication network, each communication path typically exhibits varying communication performance and reliability. The task of selecting the optimal communication path for service information flow, based on specific objectives, is known as a routing optimization problem ( Karamdel et al., 2022 ; Kong and Jiang, 2022 ; Li et al., 2022 ). This problem can be further divided into routing balance optimization ( Hammoudeh and Newman, 2015 ; Cai et al., 2022 ) and communication performance optimization ( Du et al., 2022 ), depending on the optimization goals. To address the issue of load imbalance on certain communication links and nodes resulting from the shortest routing, Zhang et al. (2019) proposed a load-balancing optimization method for power communication networks. This method optimizes the routing approach to reduce the load imbalance on communication links, thus achieving a more reasonable distribution of service information flow and alleviating the communication burden on highly loaded links. Zhao et al. (2021) proposed a decentralized load frequency control method for dealing with the impact of cyberattacks on networked power systems. The method combines game theory and optimization algorithms with an optimization analysis of a high percentage of renewable power systems. Considering the shared risk inherent in the laying of fiber optic links in communication networks, Li et al. (2014) developed a routing optimization model for power communication networks that incorporates risk balance. By integrating the importance of different services, the routing method can be optimized to reduce the average risk associated with each communication service. For enhancing the communication performance of the network, Ti et al. (2022) developed a reliable routing optimization model for power communication networks that takes into account communication delay constraints, routing hop count constraints, and reliability constraints. This model aims to reduce the congestion of communication nodes by optimizing routing, thereby achieving a more efficient allocation of node communication resources.

Indeed, the existing approaches mentioned above do not fully consider the interdependence between energy and communication networks, even though initial faults in the communication network can aggravate grid-side faults. To address this issue, a solution was proposed by Kong (2019) that focuses on optimizing power-disjoint communication routes between power nodes and control centers. This approach aims to prevent the propagation of initial faults and mitigate inter-network cascading failures. Based on this work, Kong (2020) further investigated the power supply dependency of routers in the routing optimization process. They developed a communication routing failure probability model and quantified the impact of routing failures in terms of load loss. Zhang et al. (2022) modeled the dependence of communication and physical networks and further analyzed the effect of coupling relationships on chain failures. By optimizing the routing approach, they aimed to minimize the amount of load loss triggered by initial routing failures. However, it is important to note that these methods have only been verified in single-energy power grids. In the context of current multi-energy systems, which exhibit direct interdependence between energy and communication networks, the challenges posed by network risks are even more complex. Multi-energy systems, such as the IECPS, introduce additional complexity due to the presence of multiple energy flow nodes ( Pazouki et al., 2021 ; Ding et al., 2022 ). Using a single control service to accurately capture the importance of such systems becomes challenging ( Soltan et al., 2019 ). Recent security incidents in the IECPS, such as the cyberattack on the Ukrainian power grid in 2015, highlight the potential for substantial losses when control servers of underlying generators and substations are compromised. This emphasizes the need for research on cyber security in integrated energy systems, expanding the scope of the CPPS to include the IECPS, and developing modeling and optimization methods specific to the IECPS. With the increasing integration of electric, heat, and gas networks, responsible for the conversion, transmission, and data communication of heterogeneous energy flow across different regions, cyberattacks on the IECPS can have wide-ranging and profound impacts. Therefore, it is crucial to focus on the cyber security of integrated energy systems, conduct research on IECPS modeling and optimization, and prioritize these areas in national energy security strategies.

Currently, there is limited literature available on the study of the IECPS under the context of cyberattacks. Furthermore, there is no research on optimizing IECPS communication networks, specifically considering DoS attacks. Therefore, this paper aims to address the following challenges in establishing optimal routing for the IECPS: first, there is a need for a routing protocol that takes into account the interdependencies between different networks within the IECPS. It should minimize the negative impact that may arise when the optimal routing approach of the communication network does not align with the optimal routing approach of the power system. Second, there is a need for a multi-service importance evaluation method that considers the diverse energy networks present in the IECPS. This method should appropriately assess the importance of different energy networks within the system. Taking these challenges into account, based on previous research ( Ti et al., 2022 ), this paper explores the robust optimization of routing under DoS attack scenarios in the communication network of an integrated energy system that consists of electric, heat, and gas energy. The key contributions of this paper are as follows:

1. Proposing an evaluation index for the importance of a dual-service routing method that takes into account the safety and economy of an integrated energy system that includes three forms of energy.

2. Establishing a robust optimization model for IECPS routing including electric, heat, and gas energy with the objective of minimizing the correlation degree of operations in the high-risk area of stochastic DoS attacks.

3. Utilizing the column and constraint generation algorithm to solve the optimization problem with the objective of minimizing service correlation in high-risk areas. Based on the degree of associated operations, it proves to be superior to traditional single-service optimization methods.

The remainder of this paper is organized as follows: Section 2 introduces the concept of the IECPS and its layers; Section 3 proposes an importance evaluation index of routing methods considering dual services; Section 4 presents an IECPS communication network routing robust optimization method; the simulation results are analyzed in Section 5; and finally, the conclusion of this work is summarized in Section 6 .

2 IECPS communication network modeling

2.1 basic concepts of the iecps.

The overall framework of the IECPS is shown in Figure 1 , which is composed of electricity, heat, and gas energy. From the functional level, it can be divided into the energy layer, transmission layer, and information layer.

www.frontiersin.org

Figure 1 . ntegrated energy cyber–physical system (IECPS) overall architecture.

2.2 Communication network and routing modeling

The transmission layer in the IECPS is mainly composed of a communication network responsible for the production control and information management of electric, heat, and gas energy. The communication network includes communication substations and communication links. For power communication network modeling, Xin et al. (2015) and Li et al. (2020) abstracted the multi-dimensional and multi-level information network as a directed graph composed of data nodes and network branches. The data node represents the dataset of input and output information of various modules in the power system, while the directed branch represents the processing and transmission process of information. According to the theory, it abstracts the communication network in the IECPS into a directed graph G composed of nodes and branches: G = v c ∪ v s ∪ v e ∪ v h ∪ v g , e c ∪ e s ∪ e e ∪ e h ∪ e g .

In order to describe the topological relationship between communication substations and communication links in the IECPS communication network, the adjacency matrix of the communication network is defined as A G . A G is a M c + M s + M e + M h + M g order matrix. Its rows and columns are arranged in the order of c , s , e , h , and g . The corresponding element is represented as

where the corresponding element is 1, which means that there is a communication link between the two substation nodes; otherwise, it means that there is no communication link between the two station nodes.

The adjacency matrix A G reflects the topology of the communication link in the IECPS communication network. When the information layer processes a certain service, the information flow flows along a certain path in the communication link, which constitutes the routing method of the information flow. As shown in Figure 2 , only the flow of information flow at the communication network level is considered, and the control center and router are mapped on the same plane for analysis. Since a certain service may have multiple information flows, such as the load control service that simultaneously reduces the load of multiple nodes, and the routing methods are diverse, in order to clearly represent the multiple routing methods of a certain service information flow and consider the primary and backup routing methods of the information flow, the primary and backup routes of the service information flow are defined as matrices X kq and Y kq , whose structure is similar to the adjacency matrix A G , and the corresponding elements are expressed as

www.frontiersin.org

Figure 2 . Simple IECPS communication network routing.

where k (=1, 2, 3 ⋅⋅⋅ N ) represents different information layer services and N is the total number of services. q (=1, 2, 3 ⋅⋅⋅ D ) indicates the different information flows of service k . D is the total number of information flows of service k . X kq , ij is an element of the main routing matrix. If the information flow q of service k does not pass through links i–j, then the corresponding element of this matrix is 0; conversely, the element is 1.

3 The importance evaluation index of the routing mode considering dual services

3.1 importance index of information flow.

The state of information flow is closely related to the safety and economy of IECPS operation. However, different information flows have different influences on the safety and economy of IECPS operation, which involves three different importance factors: 1) the importance of different business information flows; 2) the importance of different information flows in the same business; and 3) the importance of different routing methods. There is a one-to-many relationship between the information flow and routing mode. When the information layer optimization decision function is fixed and the communication link is normal and free from damage and interference, the importance of certain information flow is also fixed. However, the same information flow corresponds to many different routing modes, and the way to assign information flows to more important and reliable links or choose a reliable routing method for information flows is the key issue of IECPS routing optimization, while determining the importance of information flows and routing methods is the pre-requisite for routing optimization. In order to evaluate the importance of information flow, the importance index of substation information flow q of service k is defined as

It should be noted that since the substations, except the substations of the energy station, correspond to only one form of energy, the load reduction and nodal operating cost of the other energy forms involved in the calculation of I kq in this case should be 0.

The importance index of information flow I kq quantitatively indicates the importance of information flow q of service k in terms of the loss of IECPS operational safety and economy caused by physical side failures. For the load control service, the load reduction of energy network nodes corresponding to different substation information flows under different failure scenarios is different, and the sum of the load reduction penalty costs of energy network nodes corresponding to information flows q in each fault scenario is taken as the importance index of substation information flows q . For economic dispatch business, the operating cost of energy network nodes corresponding to different substation information flows in different fault scenarios is different from that of nodes under load control business. The sum of the absolute difference values of the operating costs of energy network nodes corresponding to information flow q in each fault scenario under two businesses is taken as the importance index of information flow q .

3.2 Importance index of communication links

The route of information flow is composed of communication links in the communication network, and different service information flows can flow through the same communication link. In Eq.5, on the basis of the evaluation index of the information flow importance defined above, considering the joint action of the load control business and economic dispatch business, the communication link correlation business degree matrix E is defined, and the mathematical expression is expressed as

In Eq. 5 , α and β are the importance of load control service and economic dispatch service, respectively. Table 1 shows the “DDD” communication network planning report of the State Grid Corporation of China, which shows that α and β can be 0.94 and 0.62, respectively.

www.frontiersin.org

Table 1 . Importance of different communication services.

4 IECPS communication network routing robust optimization modeling

4.1 worst-dos attack scenario modeling.

The following assumptions are made for the worst-case DoS attack scenario: the attacker will try to make the attack traffic exceed the total bandwidth, causing a service outage; the attack point location is the node with the most information flow through the shortest routing method; and the duration of the attack is limited by a linear function of time. For any t ≥ t 0 ≥ 0, there exist τ 0 and 1 > ζ > 0 satisfying

The substation will lose its communication ability, and the link directly connected to the substation will fail to complete the data transmission task after being attacked by DoS. The link fault matrix under the set of DoS attack fault Φ can be defined as F Dos,k ; when any substation k is attacked by DoS, F Dos,k can be expressed as

The operation state of the communication link is reflected by the link fault matrix F Dos,k ; if substation k is attacked by DoS, the element of F Dos,k related to the link associated with substation k is 1, indicating that the link has lost its communication ability.

When the active and standby routes of the traffic information flow do not work due to the loss of the communication ability of the link caused by DoS attacks, the traffic flow is interrupted. In order to judge whether two types of service flows are interrupted, r lq and r dq are defined as the interruption discriminant variables of information flow q of load control service and economic dispatching service, respectively:

According to Eqs 8 , 9 , r lq and r dq are 0–1 discriminant variables. When the value of r l or r d is 0, it indicates that the service information flow is not interrupted. Otherwise, it indicates that the service information flow is interrupted. The impact of the DoS attack on communication substation k on the two types of service information flows may be different, and there are several possibilities, i.e.,

In addition, DoS attacks on communication substation k will cause a communication interruption of multiple links, which may cause the interruption of multiple service information flows. So, the worst DoS attack scenario will occur under the interruption of multiple information flows, and the worst scenario under the fault set Φ of DoS attacks on a certain area of the communication network is

It is worth noting that the attack frequency and attack duration are major influencing factors that reflect the DoS attacks.

4.2 Routing optimization modeling

The routing robust optimization problem of the IECPS communication network in the worst scenario of DoS attacks to improve the resilience of the IECPS in extreme scenarios is considered in this paper. The routing robust optimization problem aims to optimize the information flow routing mode of load control services and economic dispatching services so as to minimize the degree of associated services in high-risk areas of DoS attacks. The objective function is shown as follows:

The decision variables of the objective function are x lq , ij , y lq , ij , x dq , ij , and y dq , ij that determine the information flow q of the load control service and economic dispatch service. In addition, the number of information flows of load control service is different from that of economic dispatch service. Economic dispatch service only dispatches electric, heat, and gas source nodes, while load control service controls not only each load node but also electric, heat, gas, and energy station nodes. Therefore, the number of information flows of load control service is greater than that of economic dispatch service, and its corresponding decision variables are more. In Eq. (12) , δ and χ are both minuteness used to set the priority of the optimization objective. The smaller δ makes the optimized routing method first meet the communication network resilience under the worst attack scenario and then reduces the associated traffic degree of the routing method. The smaller χ makes the optimized primary route meet the above objectives more preferentially than the standby route.

In order to ensure the rationality and effectiveness of the optimized active and standby routing methods, the above robust routing optimization problem should meet the following constraints.

4.2.1 Communication network topology constraints

Eqs 13 and 14 refer to the topology constraints of the source node, end node, and intermediate node in the main routing method of the load control service and economic dispatching service information flow, respectively. The formulas in brackets represent the restrictions on the information flow in and out of the source node, end node, and intermediate node in the routing method, i.e., the source node can only outflow information, the end node can only receive information, and the intermediate node can both send and receive information. Eqs 15 and 16 are the topology constraints of the source node, terminal node, and intermediate node in the information flow backup routing mode of load control service and economic dispatching service, respectively.

4.2.2 Active and standby route non-coincidence constraints

In order to prevent the failure of all the active and standby routing methods due to the failure of a communication link, the active and standby routing methods of information flow q are required to have no overlapping links. The restrictions on the active and standby routing methods of two types of traffic flows are as follows:

4.2.3 Attack result identification constraints

The discriminant equation for two types of service information flow interruption is shown in Eqs 8 – 10 . However, the discriminant constraint is nonlinear, which is not convenient for solving the subsequent robust optimization problem, so it needs to be linearized into a linear constraint. Rewriting Eqs 8 , 9 yields

The linearization results of Eqs 10 and Eq 18,19 can be obtained as

Eqs 20–22 correspond to the load control service information interruption constraint, economic dispatching service information flow interruption constraint, and both service information flow interruption constraints, respectively.

4.2.4 Communication link bandwidth constraints

The available communication optical path c ij of the communication optical cable is limited, and the communication optical path occupied by the two kinds of service information flows when flowing through the same communication link should not be bigger than c ij ; it should meet the following requirements:

Thus, the communication network robust routing optimization (CNRRO) model can be expressed as

CNRRO is a bi-level mixed-integer linear programming problem with an uncertainty set, which is a typical non-deterministic polynomial (NP)-hard problem, i.e., all uncertain polynomial problems can be reduced to polynomial time complexity. The logic architecture of the optimization problem is shown in Figure 3 . The upper layer is the main problem of routing optimization for reducing the degree of traffic associated with communication links, and the lower layer is the optimization subproblem of the worst scenario for providing DoS attacks for the upper layer.

www.frontiersin.org

Figure 3 . ommunication network robust routing optimization (CNRRO) logical architecture.

4.3 Solving method

As CNRRO is an NP-hard problem, the computational complexity increases with the increase in the number of communication network nodes. To reduce the computational complexity, the column and constraint generation (CCG) algorithm is used to solve the above model, and its principle is as follows: for convenience, the above model is expressed in standard form:

Eq. 26 corresponds to the communication network topology constraint (Eq. 13 – 14 ), Eq. 27 represents the uncertain fault set of the communication substation, and Eq. 28 represents the functional relationship among the internal decision variables and the external decision variables, as well as the uncertain variables, corresponding to the Eqs 20–22.

In order to solve the problem hierarchically and iteratively, the standard form of the routing robust optimization model is transformed into a main problem and a subproblem, as shown below.

Main problem:

Subproblem:

The decision variable z * can minimize the degree of traffic associated with the communication link in the area subject to DoS attacks by solving the main problem and pass the optimal solution to the subproblem; then, the worst scenario u l * in the area subject to attacks can be obtained by solving the subproblem. The scenario is passed to the main problem and iterated repeatedly, so that the final obtained main and standby routing methods are robust, and the resilience to extreme scenarios is improved. The specific steps are as follows.

www.frontiersin.org

Algorithm 1. CCG algorithm flow.

5 Case study

5.1 original data.

In this case, the IECPS consists of an energy network and communication network. The energy network is combined with an IEEE 30-bus power network, 14-bus heat network, and 20-bus gas network. The communication network contains 68 communication network nodes, i.e., 30 power grid communication substations, 20 gas network communication substations, 14 heat network communication substations, 4 energy station communication substations, and 1 control center main station. The specific parameters of the example system are detailed by Shabanpour-Haghighi and Seifi (2015 ), and its topological structure is shown in Figure 4 . The parameters of the example are set as follows: the rated bandwidth of each fiber link is 200 Mbit/s, and the rated optical path is 20 m. The outage probability of the transmission line in the physical side fault concentration is 2%, and the outage probability of the heating pipeline and the gas supply pipeline is 1%. In CNRRO, the minuscule quantities δ and χ used to distinguish priorities are both 10 –4 , and the convergence threshold σ of the CCG algorithm is also 10 –4 .

www.frontiersin.org

Figure 4 . IECPS communication network topology.

5.2 Analysis of routing optimization results of the IECPS communication network

In order to verify the effectiveness of the IECPS routing robust optimization method proposed in this paper, the power grid part of the communication network, the heat network part of the communication network, and a certain area of the gas network part are taken as the target area of the DoS attack. All the communication substations in this area are exposed to the risk of the DoS attack. In this case, the communication link directly connected to the substation will temporarily lose its communication function. In addition, to reflect the optimization result more intuitively, assume that the link that is not directly connected to the substation in the attack area but passes through the attack area also exits the operation. In the scenario where the three areas are high-risk areas to be attacked, CNRRO is solved to optimize the routing mode of the IECPS economic dispatching service and load control service. For the convenience of comparison and optimization results, as shown in Figure 5 , the former IECPS is optimized according to the shortest path to obtain the various substation economic dispatches of business information flow and information flow routing load control mode (due to the load control in the business and the corresponding electric source, heat source, and air source node flow routing with the economic operation business in the same way, it is no longer the picture). The attack areas listed in Table 2 are shown in the gray-shaded area in Figure 5 . It can be seen that routing with the shortest path as the goal will cause a large number of service information flows to route through the high-risk areas and cannot avoid the risk of service interruption of the subsites and links caused by a random DoS attack.

www.frontiersin.org

Figure 5 . Routing method of IECPS service information flow based on the shortest path before optimization.

www.frontiersin.org

Table 2 . DoS) attack area.

1) Analysis of optimization results in scenario 1: The main route optimization results of the IECPS economic dispatching service and load control service under the DoS attack in the grid communication area are shown in Figure 5 . According to the figure, the information flow of the two services of the optimized communication substation E8 reaches E8 through E2 and E6, thus bypassing high-risk area ϕ e . The load control service information flow of substations E30, E29, E26, E24, E23, E20, E19, E18, E15, and E14 in the north of the communication network detours northward through the link on the east of the communication network (E8, E28) so as to avoid passing through the substations and links that may be attacked. It is worth noting that the load control service information of the two substations E17 and E16 after optimization is delivered by the link (EH3, E17) through the corresponding communication network of the heat network and gas network, because there is no safe path on the grid side to bypass to the target substation. Since area ϕ e contains the communication substation node EH4 of the energy station, the communication link (EH4, G10) is at high risk, which threatens the smooth instruction of the control center to the economic dispatching command of the gas network substation G8. However, the economic dispatching service information flow of the optimized gas network substation G8 departs to the west and transfers the information flow to the link (E1, G12). The link (EH4, G10) was successfully avoided. However, the figure shows that after optimization, the service information flow of the communication substations E12, E11, E10, and E21 still passes through the high-risk area, because all feasible paths pass through the high-risk area. What is different from the previous optimization is that the load control service flow of the optimized E12, E10, and E21 is delivered from the north of the high-risk area. Fewer high-risk links pass through. This is because fewer high-risk links are involved in a routing mode, and the associated services of high-risk areas are lower.

2) Optimization result analysis of scenario 2: The main route optimization results of the IECPS economic dispatching service and load control service under the DoS attack in the heat network communication area are shown in Figure 5 . The figure shows that the information flow of load control service and economic dispatching service of EH2 and EH3 no longer passes through the link (H13, H11) after optimization. Instead, it is delivered by E3, G18, G19, H12, and H11, thus avoiding the possibility that the information flow is threatened by the DoS attack. The economic dispatching service information from the control center to G8 via the communication substation EH4 of the energy station passes through E3, G18, G17, and G11 after optimization and is delivered after a detour, effectively avoiding the link (H13, EH4) in area ϕ h . After optimization, the load control service flow of H7 is delivered from the northwest side, reducing the number of links flowing through high-risk areas. Compared with scenario 2, since grid area ϕ e is no longer a high-risk area, the links in this area recover to assume service information flow after optimization. It can be seen that the optimization results of this routing robust optimization model can adaptively adjust the routing scheme according to the set high-risk area.

3) Analysis of optimization results of scenario 3: The main route optimization results of the IECPS economic dispatch service and load control service under the DoS attack in the air network communication area are shown in Figure 5 . The figure shows that the information flow of economic dispatch service reaching G1, G2, and G14 after optimization no longer passes through high-risk links (G13, G14), (G12, G13), and (E1, G12). However, because G13 and G5 are located in high-risk areas, the optimized routing method of load control service information flow of their substations cannot avoid such high-risk areas and can only reduce the number of links passing through high-risk areas in the route. For energy station EH2, the information flow originally delivered by (H9, EH2) becomes roundabout and then delivered by (G1, EH2) after optimization, while the routing method of information flow of two services of energy station EH3 does not change before and after optimization. This is because EH3 is not only in a high-risk area but also the routing method before optimization has only one link path in the high-risk area, so there is no room for further optimization.

In addition, except by way of the main road, routing provides a robust optimization model to optimize the backup routing, as shown in Figures 6 – 8 (only substation alternate routes affected by high-risk areas are listed). It can be seen that due to the load for the routing optimization model of the effect that cannot overlap and communication network topology constraints, resulting in a large number of the routing information flow through more links and greater distances to reach the target substation or inevitably through the high-risk areas. Therefore, the optimization results are difficult to meet the requirements of reducing the degree of service association in high-risk areas and will not be described here.

www.frontiersin.org

Figure 6 . Optimization result of the IECPS backup routing mode under denial-of-service (DoS) attack on the power grid.

www.frontiersin.org

Figure 7 . Optimization result of the IECPS backup routing mode under DoS attack on the hot network.

www.frontiersin.org

Figure 8 . Optimization result of the IECPS backup routing mode under DoS attack on the gas network.

In conclusion, after analyzing the optimization results of the above three scenarios, the CNRRO model established in this paper can effectively adaptively avoid high-risk areas according to their locations and provide more secure routing modes for both service information flows, thereby avoiding or reducing the risk of DoS attacks.

5.3 Comparison of IECPS toughness before and after optimization

The routing optimization results mentioned in the above section should meet the toughness of the communication network in the worst scenario. The worst scenario before optimization determined according to Eq. 10 is shown in black in Figure 5 , and the worst scenario after optimization is shown in black in Figures 9 – 11 . The analysis of the worst scenario corresponding to the three high-risk areas is shown in Table 3 .

www.frontiersin.org

Figure 9 . Optimization results of the main routing mode of IECPS services under the DoS attack on the power grid.

www.frontiersin.org

Figure 10 . Optimization results of the IECPS service main routing mode under DoS attack on the hot network.

www.frontiersin.org

Figure 11 . Optimization results of the IECPS service main routing mode under DoS attack on the gas network.

www.frontiersin.org

Table 3 . Comparison of the worst scenarios before and after optimization.

The above table shows that, according to CNRRO, the number of unreachable service information flows, affected load, and associated service degree in the worst scenario of region ϕ e and region ϕ h after optimization are significantly reduced compared with those under the shortest path before optimization. However, the affected load in the worst scenario of region C before and after optimization is all 0. This is because the worst-case scenario does not affect the information flow of load control services but only that of economic scheduling services.

Specifically, according to the definition of association service degree in Eq. 5 , the routing method optimized by CNRRO will change the association service degree of communication network links. According to the optimization objective, the associated service degree of the communication link in the high-risk area should be reduced after optimization compared with that before optimization so as to ensure that the service volume of the high-risk area is reduced. According to the route optimization results, the association service degree of the affected links in the above three high-risk areas of the IECPS communication network is calculated, and the comparison with the association service degree of the links in high-risk areas under the shortest path before optimization is shown in Figure 12 . For the actual communication links corresponding to link numbers, see Table 4 .

www.frontiersin.org

Figure 12 . Comparison of link-related business degrees in high-risk areas before and after optimization: (A–C) link-associated service degrees of areas ϕ e , ϕ h , and ϕ g in the main routing mode; (D–F) link-associated service degrees of areas ϕ e , ϕ h , and ϕ g in the standby routing mode area. (D) in the standby routing mode; (E, F) link-associated service degree.

www.frontiersin.org

Table 4 . Shortest route for load control services.

In Figure 12, by analysis, (a) shows that the correlation business degree of substations 1 and 2 is much higher than that of other substations in the region, which will lead to a large amount of load being affected if substation 1 is attacked. After the optimization, the correlation degree of substations 1 and 2 is significantly reduced. Substation 5 in (b) has a very high business correlation, indicating that the risk of the substation is very high. The optimization significantly reduces the risk of this substation being attacked. This is because according to CNRRO obtained by way of the main road to avoid the influence of DoS attack to bypass the high-risk areas, business information flow distribution occurs to the rest of the security link. The number of service information flows passing through high-risk areas is reduced, and the association degree of links in high-risk areas is reduced, which corresponds to the optimization goal of CNRRO. Alternate routing the suboptimal solution of the corresponding optimization, although the high-risk areas under alternate routing associated business degrees below before optimization-related business, but as a result of, the numerical example of communication network architecture itself leads to alternate routing cannot bypass the high-risk areas, lead to high-risk areas part link associated business degree is higher than the shortest path under the link associated business degrees. In summary, the route optimization method proposed in this paper can reduce the degree of service association in high-risk areas and improve the toughness of the IECPS communication network under the background of service information flow disturbance caused by DoS attacks.

5.4 Comparison of optimization results of the single-service index and double-service index

The simulation results compare the routing optimization results of only load control services, only economic dispatch services, and both services. The associated service degree of the high-risk area (the sum of the associated service degree of all links in the high-risk area) in the three cases is shown in Table 5 .

www.frontiersin.org

Table 5 . Comparison of related business degrees in high-risk areas under different business indicators.

Different importance indexes will directly affect the route optimization results, and the regional association service degree will be different under different optimization results. According to the analysis given in the above table, the associated business degree of high-risk area ϕ e of the power grid and high-risk area ϕ h of the heat network under the routing robust optimization based on the importance index of dual-service information flow is significantly lower than that of the high-risk area under the routing robust optimization based on the importance index of single-service information flow. Moreover, since the number of load control information flows undertaken by the two regions is more than the number of economic dispatch service information flows, the regional association service degree considering only load control service indicators is lower than that considering economic dispatch service indicators. However, for gas network high-risk area ϕ g and areas under load control operations only, the associated business degree is significantly lower than that in the other two cases; this is because in the gas network information flow, only four-load control and load control-only businesses would lead to a large number of communication links, and the importance index is zero, which cannot effectively evaluate the importance of the communication links. In general, the routing robust optimization method considering both security and economy is better than the routing optimization method considering only the single-service index.

6 Conclusion

A routing optimization model with the dual objectives of security and economy is proposed for IECPS communication networks under DoS attacks. The optimization problem is solved using the CCG algorithm. Comparing with a single business optimization approach provides the following conclusions.

(1) A robust optimization model for routing in IECPS communication networks is developed, considering the worst scenario of a DoS attack. The results demonstrate that by minimizing the associated business degree in high-risk areas, the proposed optimization method effectively routes economic dispatch and load control information flows to bypass these areas. Furthermore, it adaptively optimizes each business information flow as the high-risk area changes, improving the resilience of the IECPS communication network.

(2) The integrated energy system fiber optic communication network consisting of an improved IEEE 30-node power grid, 14-node heat network, and 20-node gas network is used as an arithmetic example to analyze the change in the degree of business associated with the communication link in the attack area in conjunction with a DoS attack under the worst-case scenario. It significantly reduces the business relevance of high-risk substations. Thus, it helps minimize the loss when the system is attacked.

(3) This study uniquely combines the information flow importance index and communication link importance index to capture the dual objectives of security and economy. A comparison with a single-service optimization method confirms the superiority of the proposed approach.

It should be noted that this study considers only one control center in the information layer, while actual integrated energy systems may have multiple control centers. Future research should explore the IECPS routing optimization method for multiple control centers, which would involve more complex routing modes for each service information flow and closer coupling between information and physics.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Materials; further inquiries can be directed to the corresponding author.

Author contributions

HF: writing–original draft and writing–review and editing. XH: writing–original draft and writing–review and editing. DW: writing–review and editing. BZ: writing–review and editing.

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Cai, M., Zhang, E., Lin, J., Wang, K., Jiang, K., and Zhou, M. (2022). Route optimization equalization scheme based on graph theory for liquid metal battery strings. IEEE Trans. Industry Appl. 59, 2502–2508. doi:10.1109/TIA.2022.3221383

CrossRef Full Text | Google Scholar

Dai, Y., Li, M., Zhang, K., and Shi, Y. (2023). Robust and resilient distributed mpc for cyber-physical systems against dos attacks. IEEE Trans. Industrial Cyber-Physical Syst. 1, 44–55. doi:10.1109/TICPS.2023.3283229

Ding, S., Gu, W., Lu, S., Yu, R., and Sheng, L. (2022). Cyber-attack against heating system in integrated energy systems: model and propagation mechanism. Appl. Energy 311, 118650. doi:10.1016/j.apenergy.2022.118650

Du, H., Zhang, J., Guan, K., Niyato, D., Jiao, H., Wang, Z., et al. (2022). Performance and optimization of reconfigurable intelligent surface aided thz communications. IEEE Trans. Commun. 70, 3575–3593. doi:10.1109/TCOMM.2022.3162645

Franze, G., Famularo, D., Lucia, W., and Tedesco, F. (2020). A resilient control strategy for cyber-physical systems subject to denial of service attacks: a leader-follower set-theoretic approach. IEEE/CAA J. Automatica Sinica 7, 1204–1214. doi:10.1109/JAS.2020.1003189

Gupta, P. K., Singh, N. K., and Mahajan, V. (2021). Intrusion detection in cyber-physical layer of smart grid using intelligent loop based artificial neural network technique. Int. J. Eng . doi:10.5829/IJE.2021.34.05B.18

Hammoudeh, M., and Newman, R. (2015). Adaptive routing in wireless sensor networks: qos optimisation for enhanced application performance. Inf. Fusion 22, 3–15. doi:10.1016/j.inffus.2013.02.005

Hu, S., Yue, D., Han, Q.-L., Xie, X., Chen, X., and Dou, C. (2020). Observer-based event-triggered control for networked linear systems subject to denial-of-service attacks. IEEE Trans. Cybern. 50, 1952–1964. doi:10.1109/TCYB.2019.2903817

PubMed Abstract | CrossRef Full Text | Google Scholar

Kakadiya, H., Popat, J., Singh, N. K., Tak, L., Majeed, M. A., Mudgal, S., et al. (2022). “Analysis and prevention of denial of service attacks in smart grid using iot,” in Sustainable Technology and advanced computing in electrical engineering . Editors V. Mahajan, A. Chowdhury, N. P. Padhy, and F. Lezama (Singapore: Springer Nature Singapore ), 367–378.

Karamdel, S., Liang, X., Faried, S. O., and Mitolo, M. (2022). Optimization models in cyber-physical power systems: a review. IEEE Access 10, 130469–130486. doi:10.1109/ACCESS.2022.3229626

Kong, P.-Y. (2019). Optimal configuration of interdependence between communication network and power grid. IEEE Trans. Industrial Inf. 15, 4054–4065. doi:10.1109/TII.2019.2893132

Kong, P.-Y. (2020). Routing in communication networks with interdependent power grid. IEEE/ACM Trans. Netw. 28, 1899–1911. doi:10.1109/TNET.2020.3001759

Kong, P.-Y., and Jiang, Y. (2022). Vnf orchestration and power-disjoint traffic flow routing for optimal communication robustness in smart grid with cyber-physical interdependence. IEEE Trans. Netw. Serv. Manag. 19, 4479–4490. doi:10.1109/TNSM.2022.3165219

Kumar, N., Singh, B., and Panigrahi, B. K. (2019). Grid synchronisation framework for partially shaded solar pv-based microgrid using intelligent control strategy. IET Generation, Transm. Distribution 13, 829–837. doi:10.1049/iet-gtd.2018.6079

Kumar, N., Singh, B., and Panigrahi, B. K. (2023). Voltage sensorless based model predictive control with battery management system: for solar pv powered on-board ev charging. IEEE Trans. Transp. Electrification 9, 2583–2592. doi:10.1109/TTE.2022.3213253

Kumar, N., Singh, B., Wang, J., and Panigrahi, B. K. (2020). A framework of l-hc and am-mkf for accurate harmonic supportive control schemes. IEEE Trans. Circuits Syst. I Regul. Pap. 67, 5246–5256. doi:10.1109/TCSI.2020.2996775

Kumari, P., Kumar, N., and Panigrahi, B. K. (2023). A framework of reduced sensor rooftop spv system using parabolic curve fitting mppt technology for household consumers. IEEE Trans. Consumer Electron. 69, 29–37. doi:10.1109/TCE.2022.3209974

Li, B., Lu, C., Qi, B., Sun, Y., and Han, J. (2022). Risk and traffic based service routing optimization for electric power communication network. Int. J. Electr. Power Energy Syst. 137, 107782. doi:10.1016/j.ijepes.2021.107782

Li, B., Yang, J., Qi, B., Sun, Y., Yan, H., and Chen, S. (2014). Application of p -cycle protection for the substation communication network under srlg constraints. IEEE Trans. Power Deliv. 29, 2510–2518. doi:10.1109/TPWRD.2014.2358571

Li, M., Xue, Y., Ni, M., and Li, X. (2020). Modeling and hybrid calculation architecture for cyber physical power systems. IEEE Access 8, 138251–138263. doi:10.1109/ACCESS.2020.3011213

Li, T., Chen, B., Yu, L., and Zhang, W.-A. (2021). Active security control approach against dos attacks in cyber-physical systems. IEEE Trans. Automatic Control 66, 4303–4310. doi:10.1109/TAC.2020.3032598

Li, Y., Quevedo, D. E., Dey, S., and Shi, L. (2017). Sinr-based dos attack on remote state estimation: a game-theoretic approach. IEEE Trans. Control Netw. Syst. 4, 632–642. doi:10.1109/TCNS.2016.2549640

Li, Y., Ren, R., Huang, B., Wang, R., Sun, Q., Gao, D. W., et al. (2023). Distributed hybrid-triggering-based secure dispatch approach for smart grid against dos attacks. IEEE Trans. Syst. Man, Cybern. Syst. 53, 3574–3587. doi:10.1109/TSMC.2022.3228780

Lv, M., Lv, Y., Yu, W., and Meng, H. (2023). Finite-time attack detection and secure state estimation for cyber-physical systems. IEEE/CAA J. Automatica Sinica 10, 2032–2034. doi:10.1109/JAS.2023.123351

Pazouki, S., Naderi, E., and Asrari, A. (2021). A remedial action framework against cyberattacks targeting energy hubs integrated with distributed energy resources. Appl. Energy 304, 117895. doi:10.1016/j.apenergy.2021.117895

Popat, J., Kakadiya, H., Tak, L., Singh, N. K., Majeed, M. A., and Mahajan, V. (2021). “Reliability of smart grid including cyber impact: a case study,” in Computational methodologies for electrical and electronics engineers ( IGI Global ), 163–174. Available at: https://api.semanticscholar.org/CorpusID:234186951

Saxena, V., Kumar, N., Singh, B., and Panigrahi, B. K. (2021). An mpc based algorithm for a multipurpose grid integrated solar pv system with enhanced power quality and pcc voltage assist. IEEE Trans. Energy Convers. 36, 1469–1478. doi:10.1109/TEC.2021.3059754

Shabanpour-Haghighi, A., and Seifi, A. R. (2015). Simultaneous integrated optimal energy flow of electricity, gas, and heat. Energy Convers. Manag. 101, 579–591. doi:10.1016/j.enconman.2015.06.002

Siu, J. Y., Kumar, N., and Panda, S. K. (2022). Command authentication using multiagent system for attacks on the economic dispatch problem. IEEE Trans. Industry Appl. 58, 4381–4393. doi:10.1109/TIA.2022.3172240

Solanki, M. G., Patel, K. S., Kanzariya, B. R., Parekh, T. H., Singh, N. K., Yadav, A. K., et al. (2022). “Review on cybersecurity and major cyberthreats of smart meters,” in Sustainable Technology and advanced computing in electrical engineering . Editors V. Mahajan, A. Chowdhury, N. P. Padhy, and F. Lezama (Singapore: Springer Nature Singapore ), 527–541.

Soltan, S., Yannakakis, M., and Zussman, G. (2019). React to cyber attacks on power grids. IEEE Trans. Netw. Sci. Eng. 6, 459–473. doi:10.1109/TNSE.2018.2837894

Ti, B., Wang, J., Li, G., and Zhou, M. (2022). Operational risk-averse routing optimization for cyber-physical power systems. CSEE J. Power Energy Syst. 8, 801–811. doi:10.17775/CSEEJPES.2021.00370

Wang, A., Fei, M., Song, Y., Peng, C., Du, D., and Sun, Q. (2023). Secure adaptive event-triggered control for cyber–physical power systems under denial-of-service attacks. IEEE Trans. Cybern. 54, 1722–1733. doi:10.1109/TCYB.2023.3241179

Xin, S., Guo, Q., Sun, H., Zhang, B., Wang, J., and Chen, C. (2015). Cyber-physical modeling and cyber-contingency assessment of hierarchical control systems. IEEE Trans. Smart Grid 6, 2375–2385. doi:10.1109/TSG.2014.2387381

Zhang, Q., Lin, M., Yang, L. T., Chen, Z., and Li, P. (2019). Energy-efficient scheduling for real-time systems based on deep q-learning model. IEEE Trans. Sustain. Comput. 4, 132–141. doi:10.1109/TSUSC.2017.2743704

Zhang, Y., Jiang, T., Shi, Q., Liu, W., and Huang, S. (2022). Modeling and vulnerability assessment of cyber physical system considering coupling characteristics. Int. J. Electr. Power Energy Syst. 142, 108321. doi:10.1016/j.ijepes.2022.108321

Zhao, X., Zou, S., and Ma, Z. (2021). Decentralized resilient h ∞ load frequency control for cyber-physical power systems under dos attacks. IEEE/CAA J. Automatica Sinica 8, 1737–1751. doi:10.1109/JAS.2021.1004162

Nomenclature

www.frontiersin.org

Keywords: smart grid, integrated energy cyber–physical system, communication network optimization, denial-of-service attack, routing optimization

Citation: Fan H, Huang X, Wang D and Zhou B (2024) Communication network robust routing optimization in an integrated energy cyber–physical system based on a random denial-of-service attack. Front. Energy Res. 12:1382887. doi: 10.3389/fenrg.2024.1382887

Received: 06 February 2024; Accepted: 07 March 2024; Published: 08 April 2024.

Reviewed by:

Copyright © 2024 Fan, Huang, Wang and Zhou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Hong Fan, [email protected]

This article is part of the Research Topic

Optimization and Efficiency Model for Energy Internet

Cross-sectional study of pharmacovigilance knowledge, attitudes, and practices based on structural equation modeling and network analysis: a case study of healthcare personnel and the public in Yunnan Province

Affiliations.

  • 1 School of Pharmaceutical Sciences and Yunnan Provincial Key Laboratory of Pharmacology for Natural Products, Kunming Medical University, Kunming, Yunnan, China.
  • 2 Yunnan Provincial Center for Drug Policy Research, Kunming, Yunnan, China.
  • 3 College of Modern Biomedical Industry, Kunming Medical University, Kunming, Yunnan, China.
  • 4 Incubation Center of Scientific and Technological Achievements, Kunming Medical University, Kunming, Yunnan, China.
  • PMID: 38566786
  • PMCID: PMC10985242
  • DOI: 10.3389/fpubh.2024.1358117

Background: This study focuses on understanding pharmacovigilance knowledge, attitudes, and practices (KAP) in Yunnan Province, employing Structural Equation Modeling (SEM) and network analysis. It aims to evaluate the interplay of these factors among healthcare personnel and the public, assessing the impact of demographic characteristics to inform policy and educational initiatives.

Methods: A cross-sectional survey was conducted in Yunnan, targeting healthcare personnel and the public. Data collection was through questionnaires, with subsequent analysis involving correlation matrices, network visualization, and SEM. The data analysis utilized SPSS 27.0, AMOS 26.0, and Gephi software for network analysis.

Results: This study evaluated pharmacovigilance KAP among 209 public participants and 823 healthcare personnel, uncovering significant differences. Public respondents scored averages of 4.62 ± 2.70 in knowledge, 31.99 ± 4.72 in attitudes, and 12.07 ± 4.96 in practices, while healthcare personnel scored 4.38 ± 3.06, 27.95 ± 3.34, and 7.75 ± 2.77, respectively. Statistically significant correlations across KAP elements were observed in both groups, highlighting the interconnectedness of these factors. Demographic influences were more pronounced among healthcare personnel, emphasizing the role of professional background in pharmacovigilance competency. Network analysis identified knowledge as a key influencer within the pharmacovigilance KAP network, suggesting targeted education as a vital strategy for enhancing pharmacovigilance engagement.

Conclusion: The research reveals a less-than-ideal state of pharmacovigilance KAP among both healthcare personnel and the public in Yunnan, with significant differences between the two groups. SEM and network analysis confirmed a strong positive link among KAP components, moderated by demographics like age, occupation, and education level. These insights emphasize the need to enhance pharmacovigilance education and awareness, thereby promoting safer drug use.

Keywords: adverse drug reactions; healthcare personnel; knowledge attitudes and practices (KAP); network analysis; pharmacovigilance; public; structural equation modeling.

Copyright © 2024 Qin, Li and Yang.

  • Adverse Drug Reaction Reporting Systems
  • Cross-Sectional Studies
  • Health Knowledge, Attitudes, Practice*
  • Latent Class Analysis
  • Pharmacovigilance*

Grants and funding

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Health Psychol Behav Med
  • v.6(1); 2018

Network analysis: a brief overview and tutorial

David hevey.

School of Psychology, Trinity College Dublin, Dublin, Ireland

Objective : The present paper presents a brief overview on network analysis as a statistical approach for health psychology researchers. Networks comprise graphical representations of the relationships (edges) between variables (nodes). Network analysis provides the capacity to estimate complex patterns of relationships and the network structure can be analysed to reveal core features of the network. This paper provides an overview of networks, how they can be visualised and analysed, and presents a simple example of how to conduct network analysis in R using data on the Theory Planned Behaviour (TPB).

Method : Participants ( n  = 200) completed a TPB survey on regular exercise. The survey comprised items on attitudes, normative beliefs, perceived behavioural control, and intentions. Data were analysed to examine the network structure of the variables. The EBICglasso was applied to the partial correlation matrix.

Results : The network structure reveals the variation in relationships between the items. The network split into three distinct communities of items. The affective attitude item was the central node in the network. However, replication of the network in larger samples to produce more stable and robust estimates of network indices is required.

Conclusions : The reported network reveals that the affective attitudinal variable was the most important node in the network and therefore interventions could prioritise targeting changing the emotional responses to exercise. Network analysis offers the potential for insight into structural relations among core psychological processes to inform the health psychology science and practice.

Introduction

Health psychology research examines how the complex interactions between biological, psychological, and social factors influence health and well-being. For example, the UK Foresight map of obesity (see https://www.gov.uk/government/collections/tackling-obesities-future-choices ) provides a comprehensive representation of the complex system of over 300 relationships between over 100 variables and obesity (Finegood, Merth, & Rutter, 2010 ). The developers of the map assumed that obesity is the result of the interplay between a wide variety of factors, including a person’s physical make-up, eating behaviour, and physical activity pattern. The system reflects the relevant factors and their interdependencies that produce obesity as a behavioural outcome. The variables were classified into various categories of causal factors; for example, social psychological factors (e.g. peer pressure), individual psychological factors (e.g. stress), environmental factors (e.g. the extent to which one’s environment makes it easy to engage in regular walking), and individual physical activity factors (e.g. functional fitness). On the basis of expert academic opinion the Foresight report authors proposed that the variables in the system not only influence obesity, but can also have positive (e.g. high levels of stress cause high levels of alcohol consumption) and negative (e.g. high levels of stress cause low levels of physical activity) effects on each other, some have distal effects whereas others have proximal effects, and effects can be unidirectional (e.g. social attitudes towards fatness causes conceptualisations of obesity as an illness) or reciprocal (e.g. physical activity causes functional fitness, which causes physical activity). Networks are a fundamental characteristic of such complex systems; consequently, health psychological science can benefit from considering the network structure of the phenomena that it seeks to understand. It has been argued that networks pervade all aspects of human psychology (Borgatti, Mehra, Brass, & Labianca, 2009 ), and in the past decade network analysis has become an important conceptual and analytical approach in psychological research. Although network analysis has a long history of being applied in causal attribution research (e.g. Kelly, 1983 ) and social network analysis (Clifton & Webster, 2017 ), its broader potential for psychological science was highlighted over a decade ago by van der Maas et al. ( 2006 ). The frequently reported patterns of positive correlations between various cognitive tasks (e.g. verbal comprehension and working memory) are typically explained in terms of a dominant latent factor, i.e. the correlations reflect a hypothesised common factor of general intelligence ( g ). However, van der Maas and colleagues argued that this empirical pattern can also be accounted for by means of a network approach, wherein the patterns of positive relationships can be explained using a mutualism model, i.e. the variables have mutual, reinforcing, relationships. From a network analysis perspective, the network of relationships between the variables constitute the psychological phenomenon (De Schryver, Vindevogel, Rasmussen, & Cramer, 2015 ), which is a system wherein the constituent variables mutually influence each other without the need to hypothesise the existence of causal latent variables (Schmittmann et al., 2013 ). In addition to addressing psychometric issues (Epskamp, Maris, Waldorp, & Borsboom, In Press ) network perspectives can inform other areas of psychological science.

A key impetus for the current research on networks in psychology derives from Borsboom and colleagues’ influential application of networks in the field of clinical psychology in relation to psychopathology symptoms (e.g. Borsboom, 2017 ; Borsboom & Cramer, 2013 ; Cramer et al., 2016 ; Cramer, Waldorp, van der Maas, & Borsboom, 2010 ). Network models are also increasingly applied in other areas such as health related quality of life (HRQOL) assessment in health psychology (e.g. Kossakowski et al., 2016 ), personality (e.g. Costantini et al., 2015 ; Mõttus & Allerhand, 2017 ), and attitudes (e.g. Dalege et al., 2015 ). The psychosystems research team (i.e. Denny Borsboom, Angélique Cramer, Sacha Epskamp, Eiko Fried, Don Robinaugh, Claudia van Borkulo, Lourens Waldorp, Han van der Maas) are critical innovators for network analysis in psychology and this paper draws extensively from the key papers from the team and their collaborators; the psychosystems.org webpage is an essential resource for anyone interested in network analysis theory, process and applications.

To date, network analysis has not been widely applied in health psychology; however, network models are particularly salient for health psychology because many of the psychological phenomena we seek to understand are theorised to depend upon a large number of variables and interactions between them. The biopsychosocial model (e.g. Engel, 1980 ) has underpinned health psychology research and theory for the past 4 decades, and it reflects a complex system of mutually interacting and dynamic biological, psychological, interpersonal, and contextual effects on health (Lehman, David, & Gruber, 2017 ; Suls & Rothman, 2004 ). From a network perspective, health behaviours and outcomes can be conceptualised as emergent phenomena from a system of reciprocal interactions: network analysis offers a powerful methodological approach to investigate the complex patterns of such relationships. The overall global structural organisation, or topology, of the phenomenon and the roles played by specific variables in the network can be analysed in a manner that other statistical approaches cannot provide. In general, health psychology research, like many areas of psychology, has studied aspects of systems in isolation: for example, using regression models to examine the relationship between focal beliefs and moods and a specific outcome such as health behaviours or adaptation to illness. Although such research provides important insights, this approach is not suited for examining complex systems of interconnected variables and it does not help us easily piece back the various separate research findings on discrete components/sub-pathways into the more complex and complete system. As noted above, the complex interplay of physiological, psychological, social and environmental factors have been highlighted in the context of obesity. Comparable exercises for other chronic illnesses will produce similarly complex networks of variables. Network analysis provides a means to understand system-level relationships in a manner that can enhance psychological science and practice.

Health psychology research often focuses on HRQOL as a key outcome variable and HRQOL is frequently understood as being the common effect of observed items in scales, e.g. increased daily pain causes lower mental health. Network analysis has been applied to the SF-36 (Ware & Sherbourne, 1992 ), a widely used HRQOL scale, to examine the patterns of relationships between the items: Kossakowski et al. ( 2016 ) found that the observed covariances between the items may result largely from direct interactions between items. From this perspective, HRQoL emerges from a network of mutually interacting characteristics; the specific nature of the interacting relationships (e.g. causal effect, bidirectional effect, or effects of unmodelled latent variables) requires additional clarification. In addition to offering novel insights into psychometrics, a network approach can be applied to other important health psychology variables (e.g. illness representations, coping strategies) to better understand the nature of the relationships between items used in measurement.

Borsboom’s research on the networks of patterns of interconnected relationships between symptoms of various psychiatric disorders has resulted in the development of a novel network theory of mental disorders (Borsboom, 2017 ). This theory provides new insights into how trigger events can activate pathways in strongly connected networks to produce symptoms that can become self-sustaining, i.e. because the symptoms are strongly connected, feedback relations between them mean that they can activate each other after the triggering event has been removed. The absence of the trigger may be not be sufficient to de-activate the symptom network and return the person to a state of health; such insights from a network theory of psychopathology can help inform not only understandings of how and why symptoms are maintained, but also how such networks can be targeted to help transition the network back into a healthy state. Of note, such an approach may be beneficial for health psychology approaches to understanding clusters of symptom presentations over time in conditions such as chronic pain and chronic fatigue syndrome.

The network structures of individuals can be visualised and analysed; consequently we may be able to see how the system of beliefs, emotional states, behaviours and symptoms influence each other over time. Systems might comprise sets of variables that are diverse and only marginally connected, or could consist of variables that are highly interconnected. Understanding an individual’s personalised network may allow insight into when an individual’s specific patterns of beliefs and behaviours reach a tipping point, which then negatively impact on mood and symptoms. Such system transitions (e.g. moving from a state of wellness to being impaired functionally) occur gradually in response to changing conditions or they may be triggered by an external perturbation, e.g. life stressor. An individual may have a very robust network so that it remains stable despite the perturbations (e.g. symptom flare up) and consequently the person can maintain function, whereas other individuals may have less resilient networks wherein it is challenging to restore disturbed equilibrium. How such networks evolve over time and respond to changes in key and peripheral variables cannot be understood using traditional analytical methods: network analysis offers rich potential to further our understanding of complex systems of relationships among variables.

The Causal Attitude Network (CAN) model, which conceptualises attitudes as networks of causally interacting evaluative reactions (i.e. beliefs, feelings, and behaviours towards an attitude object; Dalege et al., 2015 ), is also of particular interest to health psychologists given the centrality of attitudinal variables in many core psychological models (e.g. Theory of Planned Behaviour, Health Belief Model). The capacity to graphically visualise complex patterns of relationships further offers the potential for insight into the salient psychological processes and to highlight theoretical gaps. For example, Langley, Wijn, Epskamp, and Van Bork ( 2015 ) used network analysis to examine the Health Belief Model variables in relation to girls’ intentions to obtain HPV vaccination. They reported that although some aspects of the HBM (e.g. perceived efficacy) were related to intentions, other core constructs such as cues to action were less relevant. In addition, social factors, currently not included in the HBM, were important in the network; such research can inform conceptual developments linking individual beliefs with social context to better understand healthy behaviours. Consequently, the network approach offers the potential to gain novel insights as the network structure can be analysed to reveal both core structural and relational features.

The aim of this paper is to provide an overview of networks, how they can be visualised and analysed, and to present a simple example of how to conduct network analysis on empirical data in R (R Core Team, 2017 ).

What is a network?

At an abstract level, a network refers to various structures comprising variables, which are represented by nodes, and the relationships (formally called edges ) between these nodes. For example, from the Foresight Report the variables such as stress, peer pressure, functional fitness, nutritional quality of food and drink represent nodes in the network, and the positive and negative relationships between those nodes are edges. There are some differences in nomenclature in the network literature: nodes are sometimes referred to as vertices, edges are sometimes referred to as links, and networks are also called graphs. Networks can be estimated based on cross-sectional or longitudinal time-series data; in addition, networks can be analysed at the group or individual level. Cross sectional data from a group can reveal group-level conditional independence relationships (e.g. Rhemtulla et al., 2016 ). Individualised networks based on times series data can provide insights into a specific individual over time (e.g. Kroeze et al., 2017 ). Furthermore, the networks produced by different populations can be compared. In general, network analysis represents a wide range of analytical techniques to examine different network models.

In psychological networks, nodes represent various psychological variables (e.g. attitudes, cognitions, moods, symptoms, behaviours), while edges represent unknown statistical relationships (e.g. correlations, predictive relationships) that can be estimated from the data. A node can represent a single item from a scale, a sub-scale, or a composite scale: the choice of node depends upon the type of data that provide the most appropriate and useful understanding of the questions to be addressed. Edges can represent different types of relationships, e.g. co-morbidity of psychological symptoms, correlations between attitudes.

Two types of edges can be present in a network: (1) a directed edge: the nodes are connected and one head of the edge has an arrowhead indicating a one-way effect, or (2) an undirected edge: the nodes have a connecting line indicating some mutual relationship but with no arrowheads to indicate direction of effect. Networks can be described as being directed (i.e. all edges are directed) or undirected (i.e. no edges are directed). For example, edge direction has been used in psychology networks particularly for representing cross-lagged relationships among variables (Bringmann et al., 2016 ). A directed network can be cyclic (i.e. we can follow the directed edges from a given node to end up back at that node) or acyclic (i.e. you cannot start at a node and end up back at that node again by following the directed edges).

Directed networks can represent causal structures (Pearl, 2000 ); however, such directed networks can have very strict assumptions, i.e. all the variables that have a causal effect are measured in the network, and the causal chain of cause and effect is not cyclic (i.e. a variable cannot cause itself via any path) (Epskamp, Borsboom, & Fried, 2018a ). Although Directed Acyclic Graphs (DAGs) have been frequently reported in the epidemiological research literature in the past two decades (Greenland, Pearl, & Robins, 1999 ), the acyclic assumption may be untenable in many contexts for psychology. For example, in many psychological phenomena, reciprocal effects may exist between variables: having a positive attitude towards a behaviour results in that behaviour, which then results in a more positive attitude. In addition, directed networks suffer from the problem, similar to that arising in Structural Equation Modelling, that many equivalent models can account for the pattern of relationships found in the data (Bentler & Satorra, 2010 ; MacCallum, Wegener, Uchino, & Fabrigar, 1993 ). In their recent review of the challenges for network theory and methodology in psychopathology, Fried and Cramer ( 2017 ) note that despite the plausibility of many causal psychopathological symptom pathways in networks, there is a need to build stronger cases for the causal nature of these relationships. They highlight that many network papers have estimated undirected networks in cross-sectional data, and that even those that use directed networks based on time-series data at best show that variables measured at one moment in time can predict another variable at a different measurement time ( Granger causality ; Granger, 1969 ), which satisfies the requirement for putative causes preceding their effects (Epskamp et al., 2018b ). Although such a temporal relationship may indicate a causal relationship, it is possible that the link may occur for other reasons (e.g. a unidimensional autocorrelated factor model would lead to every variable predicting every other variable over time; Epskamp et al., 2018b ). Spirtes, Glymour, and Scheines ( 2000 ) developed the PC algorithm, which can be used to examine networks to find candidate causal structures that may have generated the observed patterns of relations present. However, such approaches have not been widely used to date in psychological networks. In general, network analysis can be considered as hypothesis-generating for putative causal structures that require empirical validation.

Edges convey information about the direction and strength of the relationship between the nodes. The edge may be positive (e.g. positive correlation/covariance between variables) or negative (e.g. negative correlation/covariance between variables); the polarity of the relationships is represented graphically using different coloured lines to represent the edges: positive relationships are typically coloured blue or green, and negative relationships are coloured red. Edges can be either weighted or unweighted . A weighted edge reflects the strength of the relationship between nodes by varying the thickness and colour density of the edge connecting the nodes: thicker denser coloured lines indicate stronger relationships. Alternatively, the edge may be unweighted and simply represent the presence vs . absence of a relationship; in such a network, the absence of a relationship results in the nodes not having a connecting edge.

Figure 1 presents a simple network model representing the partial correlation matrix between 5 variables (A - E) below ( Table 1 ). The size and colour density of the lines (edges) vary to reflect the varying strength of relationship between the variables; the edges are non-directional as the data represented as bivariate partial correlations between the variables. The network comprises both positive (green lines) and negative correlations (red lines) between the variables. Some variables are more central and have more connections than others: C relates to all the variables in the network, whereas D only relates to two other variables.

An external file that holds a picture, illustration, etc.
Object name is RHPB_A_1521283_F0001_OC.jpg

Sample network with 5 nodes and 8 edges. Postive edges are green and negative edges are red. The numbers represent the correlations between the variables.

Having briefly outlined the basic features of a network, the next sections will outline the three core analytical steps in network analysis:

  • Estimate the network structure based on a statistical model that reflects the empirical patterns of relationships between the variables
  • Analyse the network structure
  • Assess the accuracy of the network parameters and measures.

1. Estimating the Network

Historically, network science has developed using graphical approaches to represent relationships between nodes. For example, Leonhard Euler’s application of ‘geometry of position’, Gustav Kirchoff’s work on the algebra of graphs in relation to electrical networks, and Cayley’s contributions to molecular chemistry all utilised graphical approaches to network data (Estrada & Knight, 2015 ). The network visually represents the pattern of relationships between variables and a network can be estimated using common statistical parameters that quantify relationships, e.g. correlations, covariances, partial correlations, regression coefficients, odds ratios, factor loadings. However, as correlation networks can contain spurious edges, for example due to an (unmeasured) confounding variable, the most common approach in psychology uses partial correlations to create the relationships between variables. For example, if we had a network examining the relationship between risk behaviours (e.g. caffeine consumption) and health outcome (e.g. cancer), the analysis would show a relationship between the variables; however, such a relationship may simply reflect the fact that an unmeasured confound (e.g. smoking) is associated with both caffeine consumption and cancer. Partial correlations, similar to multiple regression coefficients, provide estimates of the strength of relationships between variables controlling for the effects of the other measured variables in the network model. Thus it is critically important to measure such potential confounding variables to ensure that their effects are controlled for. Two nodes are connected if there is covariance between those nodes that cannot be explained by any other variable in the network. The resulting partial correlations not only provide an estimate of the direct strength of relationships, but can also indicate mediation pathways: in Figure 1 A and D are not directly connected (i.e. no edge between them) but A influences C, which in turn influences D, thus C mediates the relationship between A and D. Partial correlation networks can provide valuable hypothesis generating structures, which may reflect potential causal effects to be further examined in terms of conditional independence (Pearl, 2000 ).

As noted previously, undirected network models in psychology have typically been examined, and a frequently used model in estimating such networks is the pairwise Markov Random Field (PMRF), which is a broad class of statistical models. A PMRF model is characterised by undirected edges between nodes that indicate conditional dependence relations between nodes. An absent edge means that two nodes are conditionally independent given all other nodes in the network. An edge indicates conditional dependence given all other nodes in the network. Different PMRF models can be used, depending upon the type of data (continuous, ordinal, binary, or mixtures of these data types) to be modelled. When continuous data are multivariate normally distributed, analysing the partial correlations using the Gaussian graphical model (GGM; Costantini et al., 2015 ; Lauritzen, 1996 ) is appropriate. If the continuous data are not normally distributed then a transformation (e.g. nonparanormal transformation, Liu, Lafferty, & Wasserman, 2009 ) can be applied prior to applying the GGM. The GGM can also be used for ordinal data, wherein the network is based on the polychoric correlations instead of partial correlations (Epskamp, 2018 ). If all the research variables are binary, the Ising Model can be used (van Borkulo et al., 2014 ). When the data comprise a mixture of categorical and continuous variables, the Mixed Graphical Model can be used to estimate the PMRF (Haslbeck & Waldorp, 2016 ). Thus, networks can be estimated from various types of data in a flexible manner.

The network complexity requires consideration. The higher the number of nodes being examined, then the higher the number of edges have to be estimated: in a network with five nodes, 10 unique edges are estimated, whereas in a network with 10 nodes, 45 edges are estimated, and in a network with 20 nodes, 190 edges are estimated. In addition, in the case of an Ising model not only are edge weights estimated but so too are thresholds: in the case of 20 nodes that would mean an additional 20 parameters to be estimated. However, as mentioned above many of these edges (e.g. correlations) may be spurious, and an increase in the number of nodes can lead to over-fitting and very unstable estimates (Babyak, 2004 ). Like all statistical techniques that use sample data to estimate parameters, the correlation and partial correlations values will be influenced by sample variation and therefore exact zeros will be rarely observed in the matrices. Consequently, correlation networks will nearly always be fully connected networks, possibly with small weights on many of the edges that reflect weak and potentially spurious partial correlations. Such spurious relationships will be problematic in terms of the network interpretation and will compromise the potential for network replication. In order to limit the number of such spurious relationships, a statistical regularisation technique, which takes into account the model complexity, is frequently used.

A ‘least absolute shrinkage and selection operator’ (LASSO; Friedman, Hastie, & Tibshirani, 2008 ) with a tuning parameter set by the researcher is applied to the estimation of the partial correlation networks. The LASSO performs well in the estimation of partial correlation networks (Fan, Feng, & Wu, 2009 ), and it results in some small weak edge estimates being reduced to exactly zero, resulting in a sparse network (Tibshirani, 1996 ). The LASSO yields a more parsimonious graph (fewer connections between nodes) that reflects only the most important empirical relationships in the data. Of note, the absence of an edge does not present evidence that the edge is in fact exactly zero (Epskamp, Kruis, Marsman, & Marinazzo, 2017 ). The goal of the LASSO is to exclude spurious relationship but in doing so, it may omit actual relationships. Although many variants of the LASO have been developed, the graphicalLASSO ( glasso , Friedman et al., 2008 ) is recommended both in terms of ease of implementation in specific analysis programmes but also its felxibility in terms of non-continuous data (Epskamp & Fried, In Press ). The edge may be absent from the network if the data are too messy and noisy to detect the true relationship, and quantifying evidence for edge weights being zero is an ongoing research issue (Wetzels & Wagenmakers, 2012 ). Simulation studies show that the LASSO has a low likelihood of false positives, which provides some confidence that an observed edge is indeed present in the network (Krämer, Schäfer, & Boulesteix, 2009 ). However, the specific nature of the relationship reflected in the edge is still uncertain, e.g. the edge could represent a direct causal pathway between nodes, or it could reflect the common effect of a (latent) variable not included in the network model.

As mentioned previously, the use of the LASSO requires setting a tuning parameter. The sparseness of the network produced using the LASSO depends upon the value the researcher sets tuning parameter (λ): the higher the λ value selected the more edges are removed from the network and its value directly influences the structure of the resulting network. The tuning parameter λ therefore needs to be carefully selected to create a network structure that minimises the number of spurious edges while maximising the number of true edges (Foygel & Drton, 2010 ). In order to ensure that the optimal tuning parameter is selected, a common method involves estimating a number of networks under different λ values. These different networks range from a completely full network where every node is connected to each other to an empty network where no nodes are connected. The LASSO estimates produce a collection of networks rather than a single network; the researcher needs to select the optimal network model and typically this is achieved by minimising the Extended Bayesian Information Criterion (EBIC; Chen & Chen, 2008 ), which has been shown to work particularly well in identifying the true network structure (Foygel & Drton, 2010 ; van Borkulo et al., 2014 ), especially when the true network is sparse. Model selection using the EBIC works well for both the Ising model (Foygel Barber & Drton, 2015 ) and the GGM (Foygel & Drton, 2010 ). The EBIC has been widely used in psychology networks (e.g. Beard et al., 2016 ; Isvoranu et al., 2017 ) and it enhances both the accuracy and interpretability of networks produced (Tibshirani, 1996 ).

The EBIC uses a hyperparameter ( γ ) that dictates how much the EBIC will prefer sparser models (Chen & Chen, 2008 ; Foygel & Drton, 2010 ). The γ value is determined by the researcher and is typically set between 0 and 0.5 (Foygel & Drton, 2010 ), with higher values indicating that simpler models (more parsimonious models with fewer edges) are preferred. In many ways the choice of γ depends upon the extent to which the researcher is taking a liberal or conservative approach to the network model. A value of 0 results in more edges being estimated, including possible spurious ones, but which can be useful in early exploratory and hypotheses generating research. Of note, a γ setting of zero will still produce a network that is sparser compared to a partial correlation network that has not be regularised using a LASSO. Although γ can be set at 1, the default in many situations is 0.5. Foygel and Drton ( 2010 ) suggest that setting the γ value 0.5 will result in fewer edges being retained, which will remove the spurious edges but it may also remove some other edges too. A compromise value γ of 0.25 is potentially a useful value to also use to see the impact on the network model produced.

Figure 2 presents the same data (questionnaire items on the big 5 model of personality, with 5 items for each dimension: Openness, Conscientiousness, Agreeableness, Extraversion, and Neuroticism) analysed using γ of 0, 0.5, and 0.99. With the tuning parameter set to 0, the network contains a dense array of connections as more edges are estimated; as the tuning parameter increases, the number of edges estimated decreases as the model become more sparse. This illustrates that the choices made by the researchers in setting the γ level will impact on the nature of the network produced. Of note, Epskamp and Fried ( In Press ) report that comparison of networks based on simulated data using γ of 0.00, 0.25 and 0.50 revealed the higher values of γ were able to reveal the true network structure but that the value of 0 included a number of spurious relationships. They caution that γ of .5 may still be conservative and not reflect the true model, and they note that the choice of γ is somewhat arbitrary and up to the researcher. Epskamp ( 2018 ) reported recently that increasing the γ to 0.75 or 1.00 did not outperform a γ of 0.5 in a well-established personality dataset.

An external file that holds a picture, illustration, etc.
Object name is RHPB_A_1521283_F0002_OC.jpg

Partial correlation networks estimated on same dataset, with increasing levels of the LASSO hyperparameter γ (from left to right: Panel (a) γ  = 0, Panel (b) γ  = 0.5, Panel (c) = 0.99).

In order to plot the network, the nodes and edges need to be positioned in manner that reflects the patterns of relationships present in the data. The most frequently used approach in psychological networks is the Fruchterman-Reingold algorithm (Fruchterman & Reingold, 1991 ), which calculates the optimal layout so that nodes with less strength and less connections are placed further apart, and those with more and/or stronger connections are placed closer to each other. The development of qgraph as a package to visualise patterns of relationships between nodes in networks was an invaluable contribution to advancing network analysis (Epskamp, Cramer, Waldorp, Schmittmann, & Borsboom, 2012 ).

2. Network Properties

After a network structure is estimated, the graphical representation of the network reveals the structural relationships between the nodes, and we can then further analyse the network structure in terms of its properties. This analysis provides insight into critically important features of the network. For example, are certain nodes more important (central) than others in the network? Is the global structure dense or sparse? Does it contain strong clusters of nodes (communities) or are the nodes isolated?

Not all nodes in a network are equally important in determining the network’s structure: centrality indices provide insight into the relative importance of a node in the context of the other nodes in the network (Borgatti, 2005 ; Freeman, 1978 ). For example, a central symptom is one that has a large number of connections in a network and its activity can spread activation throughout a symptoms network; in contrast, a peripheral symptom is on the outskirts of a network and has few connections and consequently less impact on the network. Different centrality indices provide insights into different dimensions of centrality. The indices can be presented as standardised z score indices to provide information on the relative importance of the nodes, and judging centrality requires careful consideration of the different dimensions in combination. These indices are based on the pattern of the connections in which the node of interest plays a role and can be used to model or predict several network processes, such as the amount of flow that traverses a node or the tolerance of the network to the removal of selected nodes (Borgatti, 2005 ). The most common aspects of centrality typically examined are as follows.

Degree : degree centrality is defined as the number of connections incident to the node of interest (Freeman, 1978 ).

Node strength : how strongly a node is directly connected to other nodes is based on the sum of the weighted number and strength of all connections of a specific node relative to all other nodes. Whilst degree provides information on the number of connections, strength can provide additional information on the importance of that node, for example a node with many weak connections (high degree) might not be as central to the network as one that has fewer but stronger connections. However, as noted by Opsahl, Agneessens, and Skvoretz ( 2010 ) merely focusing on node strength alone as an index of importance is potentially misleading as it does not take account of the number of other nodes to which it connected. Consequently, it is important to incorporate both degree and strength as indicators of the level of involvement of a node in the surrounding network when examining the centrality of a node. Opsahl et al. ( 2010 ) proposed the use of a degree centrality measure, which is the product of the number of nodes that a specific node is connected to, and the average weight of the edges to these nodes adjusted by an alpha ( α ) parameter, which determines the relative importance of the number of edges compared to edge weights. In combining both degree and strength, the tuning α parameter is set by the researcher: if this parameter is between 0 and 1, then having a high degree is regarded as favourable, whereas if it is set above 1, then a low degree is favourable.

Closeness : the closeness index quantifies the node’s relationship to all other nodes in the network by taking into account the indirect connections from that node. A high closeness index indicates a short average distance of a specific node to all other nodes; a central node with high closeness will be affected quickly by changes in any part of the network and can affect changes in other parts of the network quickly (Borgatti, 2005 ).

Betweenness : the betweenness index provides information on how important a node is in the average pathway between other pairs of nodes. A node can play a key role in the network if it frequently lies on the shortest path between two other nodes, and it is important in the connection that the other nodes have between them (Saramäki, Kivelä, Onnela, Kaski, & Kertész, 2007 ; Watts & Strogatz, 1998 ).

Clustering : the extent to which a node is part of a cluster of nodes can be estimated (Saramäki et al., 2007 ). The local clustering coefficient C is the proportion of edges that exist between the neighbours of a particular node relative to the total number of possible edges between neighbours (Bullmore & Sporns, 2009 ). It provides insight into the local redundancy of a node: does removing the node have an impact on the capacity of the neighbouring nodes to still influence each other? An overall global clustering coefficient (also referred to as transitivity) for the entire network can be estimated in both undirected and directed networks. Furthermore, the overall network may comprise communities , i.e. a clustering of nodes that are highly interconnected among themselves and poorly connected with nodes outside that cluster.

Detecting communities requires researchers to not simply interpret the placement of nodes in the visual representation of the data but to examine the patterns present using a formal statistical approach. Fried ( 2016 ) highlights a number of approaches to help identify communities. As latent variable models and network models are mathematically equivalent, examining the eigenvalues of components present in data using exploratory factor analysis is one way to identify how many communities might be present and the factor loadings indicate which nodes belong to which community. More sophisticated approaches include the spinglass algorithim (although this is limited by the fact that it often produces different results every time you run it, and it only allows nodes to be part of one community, whereas nodes may be better described as belonging to several communities at the same time), the walktrap algorithim (which provides more consistent results if you repeat it, but which also only allows nodes to be part of one community), and the Clique Percolation Method (CPM), which allows nodes to belong to more than one community (see Blanken et al., 2018 ).

Overall network topology

Networks can take on many different shapes; however, some common network shapes have been described in detail in the literature. Random networks comprise nodes with random connections, with each node have approximately the same number of connections to others. The distribution of the nodes’ connections follows a bell-curve. ‘Small world’ networks are characterised by relatively high levels of transitivity and nodes being connected to each other through small average path lengths (Watts & Strogatz, 1998 ). A classic example of the ‘small-world effect’ is the so-called ‘six degrees of separation’ principle, suggested by Milgram ( 1967 ). Letters passed from person to person reached a designated target individual in only a small (approximately 6) number of steps; the nodes (individuals) were connected by a short path through the network.

‘Scale free’ networks are characterised by a relatively small number of nodes that are connected to many other nodes (Barabási, 2012 ). These ‘hub’ nodes have an exceptionally high number of connections to other nodes, whereas the majority of non-hub nodes have very few connections. The distribution of the nodes’ connections follows a power law. Research has found that HIV transmission among men who have sex with men can be modelled as a scale free model (Leigh Brown et al., 2011 ); identifying individuals who are have very high levels of connections and represent ‘ superspreaders ’ of infections provides an efficient means for targeted vaccinations (Pastor-Satorras & Vespignani, 2001 ). Within scale free networks, nodes with high centrality measures and extremely higher centrality than other nodes may be ‘hubs’. However, it is critically important to check the pattern of directed relationships between the node and its neighbours, e.g. in a directed network a node could have a high centrality because it has many directed edges to other nodes (high OutDegree centrality) whilst having no edges from those nodes pointing at it (zero InDegree centrality); in this case the node would not be a hub. 1

In addition to group-level analysis, networks can be developed at a person-specific level: a time-series network of an individual may be useful for understanding the relationship between nodes (e.g. symptoms) at an individualised level, and could be used for personalised treatment planning (David, Marshall, Evanovich, & Mumma, 2018 ). If network structures are replicated and nodes emerge as hubs, then changing these hub nodes might have downstream effects on other nodes, which might result in an efficient means to change outcomes (Isvoranu et al., 2017 ). For example, network analysis may reveal that a certain belief is a hub and therefore critical in terms of impact on behaviour change: therefore we could focus our efforts on changing that belief rather than attempting to change multiple beliefs. Developing a better understanding of the structural relationships between the nodes in the network can provide important theoretical and practical insights for health psychology.

3. Network accuracy

As the network is based on sample data, the accuracy of the sample-based estimates of the population parameters reflecting the direction, strength and patterns of relationships between nodes should be considered. To-date much of the research on networks has used edge strength and node centrality to make inferences about the phenomenon being modelled. However, as Epskamp et al. ( 2018a ) note, relatively little attention has been paid towards examining the accuracy of the edge and centrality estimates. Given the relatively small sample sizes that typically characterises psychological research, edge strengths and node centrality may not be estimated accurately. Therefore, it is recommended that researchers determine the accuracy of both. The accuracy of edge weights is estimated by calculating confidence intervals (e.g. 95% CI) for their estimates. As a CI requires knowledge of the sampling distribution of the estimate, which may be difficult to obtain for the edge weight estimate, Epskamp et al. ( 2018a ) developed a method that uses bootstrapping (Efron, 1979 ) to repeatedly estimate a model under either sampled or simulated data, and then estimates the required statistic. The more bootstrap samples that are run, the more consistent the results. Either a parametric bootstrap or non-parametric bootstrap can be applied for edge-weights (Bollen & Stine, 1992 ). For non-parametric bootstrapping, observations in the data are resampled with replacement to create new plausible datasets. Parametric bootstrapping samples new observations from the parametric model that has been estimated from the original data; this creates a series of values that can be used to estimate the sampling distribution. Consequently, the parametric bootstrap requires a parametric model of the data whereas the non-parametric bootstrap can be applied to continuous, categorical and ordinal data. As the non-parametric bootstrap is data-driven and less likely to produce biased estimates with LASSO regularised edges (which tend to dominate in the literature), Epskamp et al. ( 2018a ) emphasise the usefulness and general applicability of the non-parametric bootstrap. If the bootstrapped CIs are wide, it becomes hard to interpret the strength of an edge.

The accuracy of the centrality indices can be examined by using a different type of bootstrapping: subsets of the data are used to investigate the stability of the order of centrality indices based on the varying sub-samples ( m out of n bootstrap; Chernick, 2011 ). The focus is on whether the order of centrality indices remains the same after re-estimating the network with less cases or nodes. A case-dropping subset bootstrap can applied and the correlation stability (CS) coefficient can quantify the stability of centrality indices using subset bootstraps. The correlation between the original centrality indices (based on the full data) is compared to the correlation obtained from the subset of data representing different percentages of the overall sample. For example, what is the correlation between the estimates from the entire data with the estimates based on a subset of 70% of the original sample? A series of such correlations can be presented to illustrate how the correlations change as the subset sample gets smaller (95% of the sample, 80%, 70%, … .25%). If the correlation changes considerably, then the centrality estimate may be problematic. A correlation stability coefficient of .7 or higher between the original full sample estimate and the subset estimates has been suggested as being a useful threshold to examine (Epskamp et al., 2018a ). A CS -coefficient (correlation = .7) represents the maximum proportion of cases that can be dropped, such that with 95 % probability the correlation between original centrality indices and centrality of networks based on subsets is 0.7 or higher (Epskamp et al., 2018a ). It is suggested that the CS -coefficient should not be below 0.25, and preferably it should be above 0.5.

Other applications of network analysis

The majority of research has examined networks based on cross-sectional data from a single group of participants. However, networks can also be determined for individuals over time as well as for comparing different groups. A network can be created for an individual based on time-series data to provide insights into that specific individual. Nodes that are identified as hubs in such networks could be important targets for interventions (Valente, 2012 ). Networks can be developed that model temporal effects between consecutive data measurements. The graphical VAR model (Wild et al., 2010 ) uses LASSO regularisation based on BIC to select the optimal tuning parameter (Abegaz & Wit, 2013 ). When multiple individuals are measured over time, multi-level VAR can be used and it estimates variation due to both time and to individual differences (Bringmann et al., 2013 ).

Networks can be estimated for different groups. Although the lack of methods comparing networks from different groups has been noted (Fried & Cramer, 2017 ), joint estimation of different graphical models (Danaher, Wang, & Witten, 2014 ; Guo, Levina, Michailidis, & Zhu, 2011 ) may prove useful in this context. For example the Fused Graphical Lasso (FGL) was recently used to compare the networks of borderline personality disorder patients with those from a community sample (Richetin, Preti, Costantini, De Panfilis, & Mazza, 2017 ). In addition, van Borkulo and colleagues have developed the Network Comparison Test (NCT) to allow researchers to conduct direct comparisons of two networks as estimated in different subpopulations (Van Borkulo, 2018 ). The test uses permutation testing in order to compare network structures that involve relationships between variables that are estimated from the data. The test focuses on the extent to which groups may differ in relation to (1) the structure of the network as a whole, (2) a given edge strength, (3) and the overall level of connectivity in the network. For example, research has reported that the network of MDD symptoms for those with persistent depression was more strongly connected than the network of those with remitting depression (van Borkulo et al., 2015 ).

Network analysis issues

Like all statistical models, the network model represents an idealised version of a real-world phenomenon that we wish to understand. In selecting the variables to be modelled we must decide which variables to include and how they are to be measured: each of these processes introduces error into the modelling process. A general concern for networks concerns their replicability (e.g. see Forbes, Wright, Markon, & Krueger, 2017 ; and responses by Borsboom et al., 2017 ; Steinley, Hoffman, Brusco, & Sher, 2017 ) and research needs to address this issue by estimating the stability of the networks and examining generalizability of the network model. As noted by Fried and Cramer ( 2017 ) the literature in general requires more conceptual and methodological developments for estimating both the accuracy and stability of networks. The identification of useful thresholds for these parameters will also prove critical in the interpretation of the network models. Similar to other methods of analysis (e.g. regression, SEM), network analysis is sensitive to the variables in the model and to the specific estimation methods used. Hence, the challenges regarding replication and generalizability are not unique to network modelling.

The larger the sample size, the more stable and accurately networks are estimated. Given the recent growth in use network analytic approaches in psychology it is not easy to hypothesise expected network structure and edge weights, which means there is little evidence to guide a priori power analyses. Epskamp et al. ( 2018a ) note that as more network research is conducted in psychology, more knowledge will accumulate regarding the nature of network structure and edge-weights that can be expected.

The dominant methods to date used to discover network structures in psychology are based on correlations, partial correlations, and patterns of conditional independencies. Further developments and application of causal model techniques will advance understanding of the relationships present in networks (Borsboom & Cramer, 2013 ). As noted previously, much of the research in psychological networks has been based on exploratory data analyses to generate networks; there is a need to progress towards confirmatory network modelling wherein hypotheses about network structure are formally tested.

How to run network analysis: an example using R

Many network structure analysis methods can be implemented in the generic software MATLAB and Stata, or specialised network software packages including UCINET (Borgatti, Everett, & Freeman, 2002 ) or Gephi ( https://gephi.org ). The Stanford Network Analysis Platform (SNAP) provides a network analysis library. R is an open-source statistical programming language that facilitates statistical analysis and data visualisation (R Core Team, 2017 ); to date much of the research on psychological networks has used R -packages igraph (Csárdi & Nepusz, 2006 ) or qgraph (Epskamp et al., 2012 ). Of note, the psychosystems research group has created specific R packages that make network analysis easier to implement (see psychosystems.org) . As mentioned at the start of this paper, their website is an essential resource for conducting network analysis in psychology. In this example, we will use the bootnet package as it provides a comprehensive suite of analytical options for network analysis. Data can inputted straight into R or can be imported in various common formats (e.g. csv. or txt. file) or from other data analysis programmes, e.g. Excel, SPSS, SAS and Stata.

R can be obtained via the https://www.r-project.org/ webpage. To download R , you need to select your preferred CRAN (Comprehensive R Archive Network) mirror ( https://cran.r-project.org/mirrors.html ). On the Mirrors webpage, you will find listings of countries that have identical versions of R and should select a location geographically close to your computer’s location. R can be downloaded for Linux, Windows, and Mac OS. The pages are regularly updated and you need to check with releases are supported for your platform. R as a base package can perform many statistical analyses but most importantly, R ’s functionality can be expanded by downloading specific packages.

After installing R ( https://www.r-project.org/ ), it is quite useful to also install R Studio ( https://www.rstudio.com/ ), which provides a convenient interface to R . Once both are installed, opening up R Studio will give a window that is split into 4 panes:

Console/Terminal : this pane is the main graphical interface for the user and this is where the commands are typed in.
Editor : this pane shows the active datasets that you are working on.
Environment/History/Connections : this pane shows the R datasets and allows you to import data from text (e.g. csv. file), Excel, SPSS, SAS and Stata. The History tab allows you see the list of your previous commands.
Files/plots/packages/help: this pane and its tabs can open files, view the most current plot (also previous plots), install and load packages, or use the general R help function.

Under the Tools drop down tap at the top of the R Studio screen, you can select which packages to install for the analyses required. Alternatively the packages can be installed using the Packages tab or they can be directly installed using a typed command. R is a command line driven programme and you can enter commands at the prompt (> by default) and each command is executed one at a time. For the current example, you will need to install 2 packages (‘ggplot2’ and ‘bootnet’) and the relevant command lines are:

>Install.packages("ggplot2")

>Install.packages("bootnet")

Once installed, the packages need to be loaded into R using the library("name of package") command.

>library("ggplot2")

>library("bootnet")

Next we need to tell R to import the data, in this case a csv. file called TPB2018.

The data are taken from a study conducted using the Theory of Planned Behaviour (TPB; Ajzen, 1985 , 2011 ). The TPB assumes that volitional human behaviour is a function of (1) one’s intention to perform a given behaviour and (2) one’s perception of behavioural control (PBC) regarding that behaviour ( Figure 3 ). Furhermore, intentions are influened by one’s attitudes towards the behaviour (e.g. cognitive attitudes : is the behaviour good or bad?; affective attitudes : is the behaviour pleasant or unpleasant?), one’s subjective norm beliefs (e.g. descriptive norms : do others perform the behaviour?; injunctive norms : do others who are important to me want me to perform the behaviour?), and one’s perceptions of control regarding the behaviour (e.g. self efficacy : level of confidence to perform the behaviour; perceived control : barriers to stop the behavoiur being performed). The extent to which PBC influences behaviour directly, rather than indirectly through intention, depends on the degree of actual control over performing the behaviour (Sniehotta, Presseau, & Araújo-Soares, 2014 ). The TPB has been a dominant theoretical approach in health behaviour research for a number of decades and has been examined extensively. The vast majority of studies have used correlational designs to investigate cross-sectional and prospective associations between TPB variables and behaviour (Noar & Zimmerman, 2005 ); systematic reviews indicate that the TPB accounts for approximately 20% of variannce in health behaviour, and that intention is the strongest predictor of behaviour (McEachan, Conner, Taylor, & Lawton, 2011 ).

An external file that holds a picture, illustration, etc.
Object name is RHPB_A_1521283_F0003_OB.jpg

Theory of planned behaviour.

Following receipt of ethical approval from the local university REC (2014/6/15), students completed a questionnaire regarding regular exercise (Datafile in supplementary material). This cross-sectional dataset is used here to illustrate how to conduct a network analysis and comprises the responses of 200 students to a TPB questionnaire, which included the following items relating to regular exercise (i.e. exercising for at least 20 min, three times per week) for the next two months:

Att1 : belief that engaging in regular exercise is healthy
Att2: belief that engaging in regular exercise is useful
Att3 : belief that engaging in regular exercise is enjoyable
Dnorm1 : descriptive norms for friends regarding engaging in regular exercise
Dnorm2 : descriptive norms for other students regarding engaging in regular exercise
Injnorm1 : injunctive norms for friends regarding engaging in regular exercise
Injnorm2 : injunctive norms for students regarding engaging in regular exercise
Pbc1 : perceived control regarding engaging in regular exercise
Pbc2 : self-efficacy towards engaging in regular exercise
Intention : intention to engage in regular exercise

In the Environment/History/Connection pane, we can select Import Dataset to import the datafile. Alternatively you can use the command code:

TPB2018 = read.csv("filename.extension", header = TRUE).

The filename extension is simply the location of the relevant csv. file on your computer.

Once it is imported, the data will appear in the Editor pane and the console window will have a line of code indicating that data is active

>View(TPB2018)

The next step is to tell R to estimate the network model using the EBICglasso to produce an interpretable network. The command line below tells R to label the results as ‘Network.’

Network <- estimateNetwork(TPB2018, default = "EBICglasso")

Once we have estimated the network, we can ask R to plot it.

>plot(Network, layout = "spring", labels = colnames(TPB2018))

These commands will produce the network plot with the variable names in the plot ( Figure 4 ).

An external file that holds a picture, illustration, etc.
Object name is RHPB_A_1521283_F0004_OC.jpg

Network analysis of TPB items. The size and density of the edges between the nodes respresent the strength of connectedness.

The network shows the strength of relationships between the TPB variables. Some variables have quite strong connections (e.g. att2 and att3 ; injnorm1 and dnorm1 ), whereas others have weak relationship (e.g. att1 and pbc1 ). Visual inspection of the network reveals that the network seems to split into three different communities: (1) the normative beliefs cluster together; (2) the three attitudinal variables and the pbc1 item seem to cluster, and (3) the pbc2 and intention item cluster together. However, visual inspection of the graphical display of complex relationships requires careful interpretation, especially if there are a large number of nodes in the network. In order to check the presence of the potential 3 communities, a spinglass algorithm was applied to the network using the igraph R -package. Of note, this analysis supported the 3 community interpretation (Interested readers are referred to Eiko Fried’s tutorial on this topic: http://psych-networks.com/r-tutorial-identify-communities-items-networks/ ).

Next we can examine the centrality indices in terms of Betweenness, Closeness and Strength ( Figure 5 ).

An external file that holds a picture, illustration, etc.
Object name is RHPB_A_1521283_F0005_OB.jpg

Centrality indices.

>centralityPlot(Network)

Att 3 had the highest strength value and a high closeness value: it has strong connections to the nodes nearby. It plays an important role in the network and its activation has the strongest influence the other nodes in the network. However, pbc1 and injnorm1 had the highest betweenness values: they act as the bridge connecting the communities of nodes.

Stability of the centrality indices

As noted previously, the stability of centrality indices can be examined by estimating network models based on subsets of the data. The case-dropping bootstrap ( type = "case" ) is used; in this case 1000 bootstrapped samples were estimated.

>CentralStability <- bootnet(Network, nBoots = 1000, type = "case")

The CS coefficients for each index can be produced:

>corStability(CentralStability)

A table presenting summary data (e.g. M , SD, CI s) on the bootstrapped indices can be created.

>summary(CentralStability)

However, it may be more useful to plot the stability of centrality indices:

>Plot(CentralStability)

Figure 6 shows the resulting plot of the centrality indices. As the percentage of the sample included in the estimates decreases (as illustrated on the X-axis, the subset samples decrease from 95% of the original sample to 25% of the sample), there is a drop in the correlation between the subsample estimate and the estimate from the original entire sample. Once the correlation goes below .7, then the estimates become unstable. For example, using 90% of the original sample, there is steep decrease in accuracy of the betweenness estimate, whilst the stability of the strength and closeness estimates declines at a slower rate. However, with a subset sample of 70% of the original participants, the closeness estimate is now correlating less than .7 with the full sample estimate. When the subset sample comprises 50% of the original sample, the strength estimate falls below .7. Overall, the pattern suggests the stability of the centrality indices for closeness and betweenness are not that reliable: of note, strength tends to be the most precisely estimated centrality index in psychology networks, and betweenness and closeness only reach the threshold for reliable estimation in large samples (Santos, Kossakowski, Schwartz, Beeber, & Fried, 2018 ).

An external file that holds a picture, illustration, etc.
Object name is RHPB_A_1521283_F0006_OC.jpg

Stability of central indices.

Edge weight accuracy

The robustness of the edge weights can be examined using bootstrapped confidence intervals.

> EdgeWgt<- bootnet(Network, nBoots = 2500)

Similar to the centrality indices, a summary table of the results of edge accuracy analysis can be produced (e.g. M , SD, CI s for estimates):

summary(EdgeWgt)

The plot of the bootstrapped CIs for estimated edge parameters provides a visually informative representation of the estimates.

> plot(EdgeWgt, labels = TRUE, order = "sample")

Figure 7 has been modified to remove most of the names of the edges being represented on the Y axis to de-clutter the figure to enhance readability. The red line in Figure 6 shows the edge value estimated in the sample, and the grey bars surrounding the red line indicate the width of the bootstrapped CIs. Of note, many of edges are estimated as zero (e.g. dnorm2 - att3 ). Some edges are larger then zero, but the bootstrapped CIs contain zero (e.g. att3 - intention ), and for a smaller number of edges, the estimates are larger than 0 and the CIs do not including zero (e.g. dnorm1 - injnorm1 ). Given the above pattern of CIs for the edge weights, the network should be interpreted with caution.

An external file that holds a picture, illustration, etc.
Object name is RHPB_A_1521283_F0007_OC.jpg

Accuracy of the edge-weight estimates (red line) and the 95% confidence intervals (grey bars) for the estimates.

The data were used to illustrate how to run network analysis. Typically such data are analysed by combing the items into their higher order construct (e.g. Attitudes, Norms, PBC, and Intentions) and then multiple regression examines the extent to which variation in Attitudes, Norms and PBC accounts for variation in Intentions, and which variables have significant relationships with intentions (Noar & Zimmerman, 2005 ). Network analysis allows us to examine how the items relate to each other and can reveal important structural relationships that regression cannot reveal. If the present network was replicated and using larger samples, then we could interpret the network in terms of its structural implications for the TPB.

Contrary to the theory, not all variables were directly related to intentions; for example att2’s (belief that exercise is useful) relationship to intention was mediated by its relationship to att1, att3 and pbc1. Indeed, all of the subjective norm items were related to intentions through a mediated pathway with pbc1. Although in line with the TPB, the normative beliefs are related to each other and form a community (i.e. the normative variables correlate with each other), in the current network, contrary to the theory, these normative beliefs have no direct relationship with intentions and only a weak relationship to PBC. This finding would indicate that your intentions to exercise are not that influenced by either the exercise behaviours of others or what you believe others would like you do in terms of regular exercise. Rather, the network suggests that your beliefs about other’s exercise only influences your perceptions of control over exercise, e.g. if others are exercising and want you to exercise, you may feel that you have more control over whether you exercise (‘if others can do it, then so can I’), and by feeling in control, you may have higher intentions to then exercise. A previous meta-analysis similarly reported lower correlations between subjective norms and intentions for physical activity behaviour compared to the strength of relationships between attitudes and intentions, and between PBC and intention (Hagger, Chatzisarantis, & Biddle, 2002 ).

Among the attitudinal variables, the affective attitude is the central node as it connects not only to all the other attitude variables but also to both PBC items (in line with theory) and the Intention item. Research has highlighted the role of affective attitudes on behaviour (e.g. Lawton, Conner, & McEachan, 2009 ) and the present data highlight the value in conceptualising normative beliefs as comprising affective/experiential and cognitive/instrumental components (Conner, 2015 ).

The model also found that the self-efficacy variable (pbc1) of PBC had the highest closeness to intentions; the strong relationship between self-efficacy and activity intentions is consistent with previous meta-analyses (Hagger et al., 2002 ). The fact that the two PBC items had differing patterns of relationships with the other TPB variables further supports the proposed distinction between the self-efficacy and perceived control components of PBC (Conner, 2015 ). If replicated using within person networks, the findings may suggest that changes self efficacy might directly impact on intentions and changes in affective attitude might impact on the other attitudinal variables, and given the network model, a change in Att1 provides a route to influence Pbc2, which should further strengthen the intentions. In essence the network reveals that for regular exercise behaviour among the student population, the affective attitudinal variable is the strongest node and therefore interventions could prioritise targeting changing the emotional responses to exercise to increase intentions to exercise. The network gives little support to intervening to change normative beliefs. This section indicates how network analysis in principle can influence not just how we appraise the pathways proposed in our theories, but also how it may offer guidance for interventions.

The present example aimed to highlight some of the key aspects to conducting network analysis in R and how to make sense of the outputs. Many real world networks estimated in psychology are likely to be messy and therefore interpretations require tempering in light of the stability and accuracy of the estimates. As network analysis becomes more prevalent, replication of network structures and properties will give greater confidence in the interpretations of the network patterns.

Of note, the psychosystems group has also developed an online web app ( https://jolandakos.shinyapps.io/NetworkApp/ ) that allows researchers to visualise and analyze networks from data uploaded into the app. The app, based on the R packages describe above, can analyse data in different common formats (e.g. ‘.csv’, ‘.xls’ and ‘.sav’) and the data can represent the raw data, the correlation matrix between the variables, an adjacency matrix, or an edge list. The user can inform the app how missing data were coded and can also apply the non-paranormal transformation for data that are not normally distributed. The app provides the various options outlined in this paper for estimating the network structure from the raw data; these include the GLASSO, the graphical VAR, and multilevel VAR. The network default is to use the Fruchterman-Reingold Algorithm to layout the network and the user can decide various visual settings (e.g. size of nodes). It also calculates the centrality (strength, closeness and betweenness) indices to determine a node’s importance in the network. A clustering analysis can be run on the data and the networks from two groups can be compared. This resource offers a very user-friendly means to start to examine network structures in data.

Barabási ( 2012 ) argued that theories cannot ignore the network effects caused by interconnectedness among variables. Health psychological processes reflect complex systems and to understand such systems, we need to understand the networks that define the interactions between the constituent variables. Many of our core health psychology models comprise networks of interacting constructs. Considering such psychological processes and outcomes from this perspective offers alternate ways of conceptualising and answering important psychological questions. Networks evolve over time due to dynamical processes that add or remove nodes (variables) or change edges (relationships between variables): the power of network science derives from the ability of the network to model systems where the nature of the nodes (e.g. symptoms, behaviours, beliefs, physiological arousal) and the edges (e.g. correlational relationship, causal relationship, social connection) can vary. Network analysis as a technique has been briefly outlined and how to conduct a simple analysis in R was presented. Hopefully this brief paper will encourage health psychologists to think about their data in terms of networks and to start to apply network analysis methods to their research questions. The work of Borsboom and colleagues provides a key foundation for network analyses and, as mentioned at the start of this paper, their invaluable contributions to the applications of network theory to psychology cannot be underestimated. Understanding the dynamic patterns of networks may offer unique insights into core psychological processes that impact health and well-being.

1 We wish to thank an anonymous reviewer for highlighting this possibility.

Disclosure statement

No potential conflict of interest was reported by the author.

David Hevey http://orcid.org/0000-0003-2844-0449

  • Abegaz, F., & Wit, E. (2013). Sparse time series chain graphical models for reconstructing genetic networks . Biostatistics (Oxford, England) , 14 ( 3 ), 586–599. doi: 10.1093/biostatistics/kxt005 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ajzen, I. (1985). From intentions to actions: A theory of planned behavior . In Kuhl J., & Beckman J. (Eds.), Action-control: From cognition to behavior (pp. 11–39). Heidelberg: Springer. [ Google Scholar ]
  • Ajzen, I. (2011). The theory of planned behaviour: Reactions and reflections . Psychology & Health , 26 , 1113–1127. doi: 10.1080/08870446.2011.613995 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Babyak, M. A. (2004). What you see may not be what you get: A brief, nontechnical introduction to overfitting in regression-type models . Psychosomatic Medicine , 66 ( 3 ), 411–421. doi: 10.1097/00006842-200405000-00021 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Barabási, A. L. (2012). The network takeover . Nature Physics , 8 , 14–16. doi: 10.1038/nphys2188 [ CrossRef ] [ Google Scholar ]
  • Beard, C., Millner, A. J., Forgeard, M. J. C., Fried, E. I., Hsu, K. J., Treadway, M. T., … Björgvinsson, T. (2016). Network analysis of depression and anxiety symptom relationships in a psychiatric sample . Psychological Medicine , 46 ( 16 ), 3359–3369. doi: 10.1017/S0033291716002300 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bentler, P. M., & Satorra, A. (2010). Testing model nesting and equivalence . Psychological Methods , 15 ( 2 ), 111–123. doi: 10.1037/a0019625 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Blanken, T. F., Deserno, M. K., Dalege, J., Borsboom, D., Blanken, P., Kerkhof, G. A., & Cramer, A. O. J. (2018). The role of stabilizing and communicating symptoms given overlapping communities in psychopathology networks . Scientific Reports , 8 , 59. doi: 10.1038/s41598-018-24224-2 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bollen, K. A., & Stine, R. A. (1992). Bootstrapping goodness-of-fit measures in structural equation models . Sociological Methods &Research , 21 ( 2 ), 205–229. doi: 10.1177/0049124192021002004 [ CrossRef ] [ Google Scholar ]
  • Borgatti, S. P. (2005). Centrality and network flow . Social Networks , 27 , 55–71. doi: 10.1016/j.socnet.2004.11.008 [ CrossRef ] [ Google Scholar ]
  • Borgatti, S. P., Everett, M. G., & Freeman, L. C. (2002). Ucinet for windows: Software for social network analysis . Harvard, MA: Analytic Technologies. [ Google Scholar ]
  • Borgatti, S. P., Mehra, A., Brass, D. J., & Labianca, G. (2009). Network analysis in the social sciences . Science , 323 , 892–895. doi: 10.1126/science.1165821 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Borsboom, D. (2017). A network theory of mental disorders . World Psychiatry , 16 , 5–13. doi: 10.1002/wps.20375 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Borsboom, D., & Cramer, A. O. J. (2013). Network analysis: An integrative approach to the structure of psychopathology . Annual Review of Clinical Psychology , 9 , 91–121. doi: 10.1146/annurev-clinpsy-050212-185608 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Borsboom, D., Fried, E. I., Epskamp, S., Waldorp, L. J., van Borkulo, C. D., van der Maas, H. L. J., & Cramer, A. O. J. (2017). False alarm? A comprehensive reanalysis of “evidence that psychopathology symptom networks have limited replicability” by Forbes, Wright, Markon, and Krueger (2017) . Journal of Abnormal Psychology , 126 ( 7 ), 989–999. doi: 10.1037/abn0000306 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bringmann, L. F., Pe, M. L., Vissers, N., Ceulemans, E., Borsboom, D., Vanpaemel, W., … Kuppens, P. (2016). Assessing temporal emotion dynamics using networks . Assessment , 23 ( 4 ), 425–435. doi: 10.1177/1073191116645909 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bringmann, L. F., Vissers, N., Wichers, M., Geschwind, N., Kuppens, P., Peeters, F., … de Erausquin, G. A. (2013). A network approach to psychopathology: New insights into clinical longitudinal data . PLoS ONE , 8 ( 4 ), e60188. doi: 10.1371/journal.pone.0060188 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bullmore, E., & Sporns, O. (2009). Complex brain networks: Graph theoretical analysis of structural and functional systems . Nature Reviews Neuroscience , 10 ( 3 ), 186–198. doi: 10.1038/nrn2575 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces . Biometrika , 95 ( 3 ), 759–771. doi: 10.1093/biomet/asn034 [ CrossRef ] [ Google Scholar ]
  • Chernick, M. R. (2011). Bootstrap methods: A guide for practitioners and researchers . New York: Wiley. [ Google Scholar ]
  • Clifton, A., & Webster, G. D. (2017). An introduction to social network analysis for personality and social psychologists . Social Psychological and Personality Science , 8 ( 4 ), 442–453. doi: 10.1177/1948550617709114 [ CrossRef ] [ Google Scholar ]
  • Conner, M. (2015). Extending not retiring the theory of planned behaviour: A commentary on Sniehotta, Presseau and Araújo-Soares . Health Psychology Review , 9 ( 2 ), 141–145. doi: 10.1080/17437199.2014.899060 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Costantini, G., Epskamp, S., Borsboom, D., Perugini, M., Mõttus, R., Waldorp, L. J., & Cramer, A. O. J. (2015). State of the aRt personality research: A tutorial on network analysis of personality data in R . Journal of Research in Personality , 54 , 13–29. doi: 10.1016/j.jrp.2014.07.003 [ CrossRef ] [ Google Scholar ]
  • Cramer, A. O. J., van Borkulo, C. D., Giltay, E. J., van der Maas, H. L. J., Kendler, K. S., Scheffer, M., … Branchi, I. (2016). Major depression as a complex dynamic system . PLoS ONE , 11 ( 12 ), e0167490. doi: 10.1371/journal.pone.0167490 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cramer, A. O. J., Waldorp, L., van der Maas, H., & Borsboom, D. (2010). Comorbidity: A network perspective . Behavioral and Brain Sciences , 33 ( 2–3 ), 137–150. doi: 10.1017/S0140525X09991567 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Csárdi, G., & Nepusz, T. (2006). The Igraph Software Package for Complex Network Research . InterJournal, Complex Systems, 1695. Retrieved from http://igraph.org
  • Dalege, J., Borsboom, D., van Harreveld, F., van den Berg, H., Conner, M., & van der Maas, H. L. J. (2015). Toward a formalized account of attitudes: The Causal Attitude Network (CAN) model . Psychological Review , 123 ( 1 ), 2–22. doi: 10.1037/a0039802 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Danaher, P., Wang, P., & Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes . Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 76 ( 2 ), 373–397. doi: 10.1111/rssb.12033. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • David, S. J., Marshall, A. J., Evanovich, E. K., & Mumma, H. (2018). Intraindividual dynamic network analysis – implications for clinical assessment . Journal of Psychopathology and Behavioral Assessment , 40 , 235–248. doi: 10.1007/s10862-017-9632-8 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • De Schryver, M., Vindevogel, S., Rasmussen, A. E., & Cramer, A. O. J. (2015). Unpacking constructs: A network approach for studying war exposure, daily stressors and post-traumatic stress disorder . Frontiers in Psychology , 6 , 4. doi: 10.3389/fpsyg.2015.01896 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Efron, B. (1979). Bootstrap methods: Another look at the jackknife . The Annals of Statistics , 7 ( 1 ), 1–26. [ Google Scholar ]
  • Engel, G. L. (1980). The clinical application of the biopsychosocial model . American Journal of Psychiatry , 137 , 535–544. doi: 10.1176/ajp.137.5.535 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Epskamp, S. (2018). Regularized Gaussian psychological networks: Brief report on the performance of extended BIC model selection. Retrieved from https://arxiv.org/abs/1606.05771
  • Epskamp, S., Borsboom, D., & Fried, E. I. (2018a). Estimating psychological networks and their accuracy: A tutorial paper . Behavior Research Methods , 50 ( 1 ), 195–212. doi: 10.3758/s13428-017-0862-1 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Epskamp, S., Cramer, A., Waldorp, L., Schmittmann, V. D., & Borsboom, D. (2012). Qgraph: Network visualizations of relationships in psychometric data . Journal of Statistical Software , 48 ( 1 ), 1–18. doi: 10.18637/jss.v048.i04 [ CrossRef ] [ Google Scholar ]
  • Epskamp, S., & Fried, E. I. (In Press). A tutorial on estimating regularized psychological networks . Psychological Methods , doi: 10.1037/met0000167 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Epskamp, S., Kruis, J., Marsman, M., & Marinazzo, D. (2017). Estimating psychopathological networks: Be careful what you wish for . PLOS ONE , 12 ( 6 ), e0179891. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Epskamp, S., Maris, G., Waldorp, L., & Borsboom, D. (In Press). Network psychometrics . In P. Irwing, Hughes D., & Booth T. (Eds.), Handbook of psychometrics . New York, NY, USA: Wiley. [ Google Scholar ]
  • Epskamp, S., van Borkulo, C. D., van der Veen, M. N., Servaas, M. N., Isvoranu, A.-M., Riese, H., & Cramer, A. O. J. (2018b). Personalized network modeling in psychopathology: The importance of contemporaneous and temporal connections . Clinical Psychological Science , 6 ( 3 ), 416–427. doi: 10.1177/2167702617744325 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Estrada, E., & Knight, P. A. (2015). A first course in network theory . Oxford: Oxford University Press. [ Google Scholar ]
  • Fan, J., Feng, Y., & Wu, Y. (2009). Network exploration via the adaptive LASSO and SCAD penalties . The Annals of Applied Statistics , 3 ( 2 ), 521–541. doi: 10.1214/08-AOAS215 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Finegood, D. T., Merth, T. D. N., & Rutter, H. (2010). Implications of the foresight obesity system map for solutions to childhood obesity . Obesity , 18 ( Supplement1 ), S13–S16. [ PubMed ] [ Google Scholar ]
  • Forbes, M. K., Wright, A. G. C., Markon, K. E., & Krueger, R. F. (2017). Evidence that psychopathology symptom networks have limited replicability . Journal of Abnormal Psychology , 126 ( 7 ), 969–988. doi: 10.1037/abn0000276 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Foygel, R., & Drton, M. (2010). Extended Bayesian information criteria for Gaussian graphical models. Advances in Neural Information Processing Systems, 23, 24th Annual Conference on Neural Information Processing Systems 2010 , NIPS 2010.
  • Foygel Barber, R., & Drton, M. (2015). High-dimensional ising model selection with Bayesian information criteria . Electronic Journal of Statistics , 9 ( 1 ), 567–607. doi: 10.1214/154957804100000000 [ CrossRef ] [ Google Scholar ]
  • Freeman, L. C. (1978). Centrality in social networks conceptual clarification . Social Networks , 1 ( 3 ), 215–239. doi: 10.1016/0378-8733(78)90021-7 [ CrossRef ] [ Google Scholar ]
  • Fried, E. I. (2016). R tutorial: how to identify communities of items in networks. Retrieved from http://psych-networks.com/r-tutorial-identify-communities-items-networks/
  • Fried, E. I., & Cramer, A. O. J. (2017). Moving forward: Challenges and directions for psychopathological network theory and methodology . Perspectives on Psychological Science , 12 ( 6 ), 999–1020. doi: 10.17605/OSF.IO/BNEK [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso . Biostatistics (Oxford, England) , 9 ( 3 ), 432–441. doi: 10.1093/biostatistics/kxm045 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fruchterman, T. M. J., & Reingold, E. M. (1991). Graph drawing by force-directed placement . Software: Practice and Experience , 21 ( 11 ), 1129–1164. doi: 10.1002/spe.4380211102 [ CrossRef ] [ Google Scholar ]
  • Granger, C. W. J. (1969). Investigating causal relations by econometric models and cross-spectral methods . Econometrica , 37 ( 3 ), 424–438. doi: 10.2307/1912791 [ CrossRef ] [ Google Scholar ]
  • Greenland, S., Pearl, J., & Robins, J. M. (1999). Causal diagrams for epidemiologic research . Epidemiology , 10 , 37–48. [ PubMed ] [ Google Scholar ]
  • Guo, J., Levina, E., Michailidis, G., & Zhu, J. (2011). Joint estimation of multiple graphical models . Biometrika , 98 ( 1 ), 1–15. doi: 10.1093/biomet/asq060 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hagger, M. S., Chatzisarantis, N. L. D., & Biddle, S. J. H. (2002). A meta-analytic review of the theories of reasoned action and planned behavior in physical activity: Predictive validity and the contribution of additional variables . Journal of Sport & Exercise Psychology , 24 ( 1 ), 3–32. [ Google Scholar ]
  • Haslbeck, J. M. B., & Waldorp, L. J. (2016). Structure estimation for mixed graphical models in high dimensional data. Retrieved from https://arxiv.org/abs/1510.05677
  • Isvoranu, A. M., van Borkulo, C. D., Boyette, L., Wigman, J. T. W., Vinkers, C. H., Borsboom, D., … GROUP Investigators . (2017). A network approach to psychosis: Pathways between childhood trauma and psychotic symptoms . Schizophrenia Bulletin , 43 ( 1 ), 187–196. doi: 10.1093/schbul/sbw055 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kelly, H. H. (1983). Perceived causal structures . In Jaspars J., Fincham F. D., & Hewstone M. (Eds.), Attribution theory and research: Conceptual, developmental and social dimensions (pp. 343–369). London: Academic Press. [ Google Scholar ]
  • Kossakowski, J. J., Epskamp, S., Kieffer, J. M., van Borkulo, C. D., Rhemtulla, M., & Borsboom, D. (2016). The application of a network approach to health-related quality of life (HRQoL): Introducing a new method for assessing hrqol in healthy adults and cancer patient . Quality of Life Research , 25 , 781–792. doi: 10.1007/s11136-015-1127-z [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Krämer, N., Schäfer, J., & Boulesteix, A. L. (2009). Regularized estimation of large-scale gene association networks using graphical Gaussian models . BMC Bioinformatics , 10 , 384. doi: 10.1186/1471-2105-10-384 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kroeze, R., van der Veen, D. C., Servaas, M. N., Bastiaansen, J. A., Oude Voshaar, R. C., Borsboom, D., … Riese, H. (2017). Personalized feedback on symptom dynamics of psychopathology: A proof-of-principle study . Journal for Person-Oriented Research , 3 , 1–10. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Langley, D. J., Wijn, R., Epskamp, S., & Van Bork, R. (2015). Should I get that Jab? Exploring Influence to encourage vaccination via online social media. ECIS 2015 Research-in-Progress Papers , Paper 64.
  • Lauritzen, S. L. (1996). Graphical models . Oxford, UK: Clarendon Press. [ Google Scholar ]
  • Lawton, R., Conner, M., & McEachan, R. (2009). Desire or reason: Predicting health behaviors from affective and cognitive attitudes . Health Psychology , 28 , 56–65. doi: 10.1037/a0013424 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lehman, B. J., David, D. M., & Gruber, J. A. (2017). Rethinking the biopsychosocial model of health: Understanding health as a dynamic system . Social and Personality Psychology Compass , 11 ( 8 ), e12282. doi: 10.1111/spc3.12328 [ CrossRef ] [ Google Scholar ]
  • Leigh Brown, A. J., Lycett, S. J., Weinert, L., Hughes, G. H., Fearnhill, E., & Dunn, D. T. (2011). Transmission network parameters estimated from HIV sequences for a nationwide epidemic . The Journal of Infectious Diseases , 204 , 1463–1469. doi: 10.1093/infdis/jir550 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Liu, H., Lafferty, J. D., & Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs . The Journal of Machine Learning Research , 10 , 2295–2328. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem of equivalent models in applications of covariance structure analysis . Psychological Bulletin , 114 ( 1 ), 185–199. doi: 10.1037/0033-2909.114.1.185 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • McEachan, R. R. C., Conner, M., Taylor, N., & Lawton, R. J. (2011). Prospective prediction of health-related behaviors with the theory of planned behavior: A meta-analysis . Health Psychology Review , 5 , 97–144. doi: 10.1080/17437199.2010.521684 [ CrossRef ] [ Google Scholar ]
  • Milgram, S. (1967). The small-world problem . Psychology Today , 2 , 60–67. [ Google Scholar ]
  • Mõttus, R., & Allerhand, M. (2017). Why do traits come together? The underlying trait and network approaches . In Zeigler-Hill V., & Shackelford T. (Eds.), SAGE handbook of personality and individual differences: Volume 1. The science of personality and individual differences (pp. 1–22). London: SAGE. [ Google Scholar ]
  • Noar, S. M., & Zimmerman, R. S. (2005). Health behavior theory and cumulative knowledge regarding health behaviors: Are we moving in the right direction? Health Education Research , 20 , 275–290. doi: 10.1093/her/cyg113 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Opsahl, T., Agneessens, F., & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths . Social Networks , 32 ( 3 ), 245–251. doi: 10.1016/j.socnet.2010.03.006 [ CrossRef ] [ Google Scholar ]
  • Pastor-Satorras, R., & Vespignani, A. (2001). Epidemic spreading in scale-free networks . Physics Review Letters , 86 , 3200–3203. [ PubMed ] [ Google Scholar ]
  • Pearl, J. (2000). Causality: Models, reasoning, and inference . New York, NY: Cambridge University Press. [ Google Scholar ]
  • R Core Team . (2017). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/
  • Rhemtulla, M., Fried, E. I., Aggen, S. H., Tuerlinckx, F., Kendler, K. S., & Borsboom, D. (2016). Network analysis of substance abuse and dependence symptoms . Drug and Alcohol Dependence , 161 , 230–237. doi: 10.1016/j.drugalcdep.2016.02.005 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Richetin, J., Preti, E., Costantini, G., De Panfilis, C., & Mazza, M. (2017). The centrality of affective instability and identity in borderline personality disorder: Evidence from network analysis . PLOS one , 12 ( 10 ), e0186695. doi: 10.1371/journal.pone.0186695 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Santos, H. P., Jr., Kossakowski, J. J., Schwartz, T. A., Beeber, L., & Fried, E. I. (2018). Longitudinal network structure of depression symptoms and self-efficacy in low-income mothers . PLoS ONE , 13 ( 1 ), e0191675. doi: 10.1371/journal.pone.0191675 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Saramäki, J., Kivelä, M., Onnela, J., Kaski, K., & Kertész, J. (2007). Generalizations of the clustering coeffic ient to weighted complex networks . Physical Review E , 75 ( 2 ), 27–105. doi: 10.1103/PhysRevE.75.027105 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schmittmann, V. D., Cramer, A. O. J., Waldorp, L. J., Epskamp, S., Kievit, R. A., & Borsboom, D. (2013). Deconstructing the construct: A network perspective on psychological phenomena . New Ideas in Psychology , 31 , 43–53. doi: 10.1016/j.newideapsych.2011.02.007 [ CrossRef ] [ Google Scholar ]
  • Sniehotta, F. F., Presseau, J., & Araújo-Soares, V. (2014). Time to retire the theory of planned behaviour . Health Psychology Review , 8 ( 1 ), 1–7. doi: 10.1080/17437199.2013.869710 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, prediction, and search (2nd ed.). Cambridge, MA: MIT Press. [ Google Scholar ]
  • Steinley, D., Hoffman, M., Brusco, M. J., & Sher, K. J. (2017). A method for making inferences in network analysis: Comment on Forbes, Wright, Markon, and Krueger (2017) . Journal of Abnormal Psychology , 126 ( 7 ), 1000–1010. doi: 10.1037/abn0000308 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Suls, J., & Rothman, A. (2004). Evolution of the biopsychosocial model: Prospects and challenges for health psychology . Health Psychology , 23 , 119–125. doi: 10.1037/0278-6133.23.2.119 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso . Journal of the Royal Statistical Society. Series B (Methodological) , 58 , 267–288. [ Google Scholar ]
  • Valente, T. W. (2012). Network interventions . Science , 337 ( 6090 ), 49–53. doi: 10.1126/science.1217330 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Van Borkulo, C. D. (2018). Network comparison test: Permutation-based test of differences in strength of networks. Retrieved from github.com/cvborkulo/ NetworkComparisonTest
  • van Borkulo, C. D., Borsboom, D., Epskamp, S., Blanken, T. F., Boschloo, L., Schoevers, R. A., & Waldorp, L. J. (2014). A new method for constructing networks from binary data . Scientific Reports , 4 ( 5918 ), 1–10. doi: 10.1038/srep05918 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • van Borkulo, C. D., Boschloo, L., Borsboom, D., Penninx, B. W. J. H., Waldorp, L. J., & Schoevers, R. A. (2015). Association of symptom network structure with the course of depression . JAMA Psychiatry , 72 ( 12 ), 1219–1226. doi: 10.1001/jamapsychiatry.2015.2079 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • van der Maas, H. L., Dolan, C. V., Grasman, R. P., Wicherts, J. M., Huizenga, H. M., & Raijmakers, M. E. (2006). A dynamical model of general intelligence: The positive manifold of intelligence by mutualism . Psychological Review , 113 ( 4 ), 842–861. doi: 10.1037/0033-295X.113.4.842 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ware, J. E., Jr., & Sherbourne, C. D. (1992). The MOS 36-item short-form health survey (SF-36): I. Conceptual framework and item selection . Medical Care , 30 , 473–483. [ PubMed ] [ Google Scholar ]
  • Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of “small-world” networks . Nature , 393 ( 6684 ), 440–442. doi: 10.1038/30918 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wetzels, R., & Wagenmakers, E.-J. (2012). A default Bayesian hypothesis test for correlations and partial correlations . Psychonomic Bulletin & Review , 19 , 1057–1064. doi: 10.3758/s13423-012-0295-x [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wild, B., Eichler, M., Friederich, H.-C., Hartmann, M., Zipfel, S., & Herzog, W. (2010). A graphical vector autoregressive modeling approach to the analysis of electronic diary data . BMC Medical Research Methodology , 10 ( 28 ), 1–13. doi: 10.1186/1471-2288-10-28 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 08 April 2024

Enhanced accuracy through machine learning-based simultaneous evaluation: a case study of RBS analysis of multinary materials

  • Goele Magchiels 1 ,
  • Niels Claessens 1 , 2 ,
  • Johan Meersschaut 2 &
  • André Vantomme 1  

Scientific Reports volume  14 , Article number:  8186 ( 2024 ) Cite this article

Metrics details

  • Characterization and analytical techniques
  • Surfaces, interfaces and thin films

We address the high accuracy and precision demands for analyzing large in situ or in operando spectral data sets. A dual-input artificial neural network (ANN) algorithm enables the compositional and depth-sensitive analysis of multinary materials by simultaneously evaluating spectra collected under multiple experimental conditions. To validate the developed algorithm, a case study was conducted analyzing complex Rutherford backscattering spectrometry (RBS) spectra collected in two scattering geometries. The dual-input ANN analysis excelled in providing a systematic analysis and precise results, showcasing its robustness in handling complex data and minimizing user bias. A comprehensive comparison with human supervision analysis and conventional single-input ANN analysis revealed a reduced susceptibility of the dual-input ANN analysis to inaccurately known setup parameters, a common challenge in material characterization. The developed multi-input approach can be extended to a wide range of analytical techniques, in which the combined analysis of measurements performed under different experimental conditions is beneficial for disentangling details of the material properties.

Introduction

Accurate characterization of complex planar and 3D structures is, amongst others, essential for successful device fabrication and implementation in micro- and nanotechnology 1 . The state-of-the-art device fabrication involves various annealing processes, including thermal annealing for creating two-dimensional multilayered structures, and thermally activated topography annealing or pulsed laser annealing for shaping three-dimensional structures 2 .

The (heat-induced) change in structural, electrical, and chemical properties throughout the fabrication process is very often probed using in situ and in operando techniques 3 , 4 . In these approaches, a measurement is continuously repeated to discern the evolution of the sample properties during the processing, providing a large data set that allows one to monitor subtle changes between consecutive steps.

Nonetheless, the analysis of the resulting large data set requires a rapid and systematic method for examining each measurement step. Recent studies have demonstrated the successful deployment of machine learning algorithms for augmentation and high-throughput analysis of data generated by a wide variety of experimental techniques, aiming to extract valuable structural and topographical information 5 , 6 , 7 , 8 , 9 , 10 , 11 . These include nuclear resonant scattering of synchrotron radiation, reflective high-energy electron diffraction, X-ray diffraction, and Rutherford backscattering spectrometry (RBS) 12 , 13 , 14 , 15 , 16 , 17 . So far, machine learning algorithms have focused on analyzing data sets obtained by a single experimental technique within a single geometry. However, a more advanced and comprehensive understanding can be achieved by employing machine learning to simultaneously analyze data collected from various measurement geometries or conditions, or by integrating data from different experimental techniques 18 .

As an example, RBS enables multi-geometry data collection through the scattering of incident ions with the atoms of the target material, followed by the measurement of the energy of the backscattered ions at various scattering angles. This technique enables a high throughput of data, absolute yield quantification (without calibration standards), and absolute depth resolution, therefore making RBS propitious for real-time studies and the detection of (minor) structural changes such as layer thickness of multilayered materials, layer roughness and porosity, nucleation of a new stoichiometric phase, elemental diffusion, etc 19 , 20 , 21 , 22 , 23 , 24 . The compositional depth profile of the target can be directly derived from the measured spectrum via three quantities: (1) The atomic mass of the target atoms is obtained via the so-called kinematic factor, i.e., the ratio of the ion energy after and before scattering, and the kinematic factor increases with increasing target atomic mass; (2) The elemental concentration at a particular depth is obtained by the yield of detected backscattered ions at the corresponding energy; (3) The depth information is obtained via the width of the elemental signal and the shift in energy of elemental signals originating from deeper within the sample, caused by (small-angle) collisions of the incoming ions with target electrons (electronic stopping).

figure 1

Simulated (including Poisson statistics) RBS spectra of ( a ) a 48 nm Ni/Si bilayer (black crosses) and a 137 nm NiSi/Si bilayer (black open circles) and ( b ) a 48 nm Ni/117 nm Ge 1− Sn/Ge multilayer (black crosses) and a 22 nm Ni/49 nm Ni 5 (Ge 1− Sn) 3 /40 nm NiGe 1− Sn/73 nm Ge 1− Sn/Ge multilayer (black open circles) with x=0.08, using a 2.7 MeV He \(^{2+}\) beam and detection in the G \(_1\) geometry, as shown in the inset in ( a ). The Ni (green), Si and Ge (purple, in ( a ) and ( b ), resp.) and Sn (red) contributions to the RBS spectra are highlighted. The arrows indicate the respective elemental depth profiles, starting from the sample surface. The inset in ( b ) illustrates the difference in the Sn signal for the two spectra.

RBS is ideally suited for the study of multilayered materials with sufficiently different atomic masses (hence sufficiently different kinematic factors), allowing energy separation of the elemental signals in the spectrum. A textbook example is the RBS analysis of a Ni film on a Si substrate, compared to a NiSi film on Si that is obtained after annealing the sample, as shown for the simulated spectra in Fig.  1 a. On the one hand, in the spectrum emerging from the Ni/Si configuration, the low-energy (below \(\sim\) 1.6 MeV), low-yield signal results from scattering in the Si substrate (1: atomic mass, as described above), whereas the high-energy (around \(\sim\) 2.1 MeV), high-yield peak results from the Ni film (1: atomic mass). The width of the Ni peak, determined by electronic stopping, corresponds to the thickness of the film (3: depth information). On the other hand, in the spectrum emerging from the NiSi/Si configuration, a step in the Si signal can be observed as a result of the adjacent sub-signals arising from scattering with the Si substrate atoms (E < 1.45 MeV) and with the Si atoms in the NiSi film (E > 1.45 MeV). The energy width of the NiSi sub-signal, as well as the leading edge energy of the substrate sub-signal, are determined by the energy loss of the incoming ions in the NiSi film (3: depth information). The ratio of the sub-signal yields directly reflects the ratio of the Si concentrations in the respective layers (1 in substrate, 0.5 in NiSi layer) (2: elemental concentration). Concomitant with the changes in the Si signal, the Ni signal exhibits broadening and decreased yield as a result of the NiSi layer thickness and composition (1:1 ratio of Ni and Si atoms). Notably, the integrated Ni signal in the RBS spectra of both sample configurations remains constant, implying conservation of the number of Ni atoms within the system.

The conventional approach to deduce the compositional depth profile from the RBS measurement is by spectrum fitting varying the sample parameters and using a forward simulator 25 , 26 . Whereas the forward simulation of a defined target results in a uniquely defined RBS spectrum, the inverse problem of finding the compositional depth profile from experimental RBS data can be more ambiguous.

Highly reliable solutions (compositional depth profiles) for spectra within a real-time data set can be obtained by employing Butler’s three criteria, even though such solutions may not possess mathematical uniqueness 27 . The first criterion is conservation of mass, which implies conforming to conservation of the total areal density of elements present in the target. The second criterion is adherence to thermodynamic principles, which implies conforming to thermodynamically stable phase stoichiometries formed by annealing. The third and foremost criterion for this study is the combined evaluation of spectra, which are collected in multiple scattering geometries. This can be done either in an iterative way, i.e., sequentially, one spectrum’s analysis begins with an initial assumption derived from another spectrum’s analysis result, or in a simultaneous way, i.e., a direct approach in which a parameter is fitted to multiple spectra at the same time 28 . However, both the iterative and simultaneous approach are time-demanding, making them suboptimal for analyzing large quantities of data. Moreover, in both approaches, the operator is free to impose different weights on the contribution of each spectrum to the final compositional depth profile, resulting in a user-biased analysis.

The user bias can be minimized by applying a machine learning approach. The latter mainly involves artificial neural networks (ANN), which relate a single RBS spectrum to the corresponding compositional depth profile 12 , 29 , 30 . Analogously to the transmission and processing of electric signals in a biological network of neurons, the architecture of a multilayer perceptron ANN consists of one layer of input nodes and one layer of output nodes, separated by one or more hidden layers. The information is transmitted from the input to the output layer by a forward, fully interconnected network of nodes. A nonlinear activation function is applied to the weighted sum of the nodes in one layer, resulting in the value of the node in the next layer. The weight value of each interconnection is determined by the learning of the ANN using a training set consisting of established input-output patterns, and the iterative adaption of weights to minimize the mean-square error on the outputs of the test set—an approach known as supervised learning. Thereafter, the trained ANN allows extremely fast analysis of large sets of data within the parameter space defined by the training set. This parameter space encompasses a multidimensional range of values for the target and RBS setup parameters, used for the generation of the ANN training set.

Unlike forward fitting, this machine learning-based approach lacks any prior knowledge of the physics underlying the experiment (here: Rutherford scattering). Until now, the focus of machine learning analysis approaches has primarily been on large data sets of single geometry RBS data. While a previous study suggested the potential of combining the analysis of multiple RBS and elastic recoil detection analysis measurements within a single ANN, the work did not investigate the actual capabilities of this approach, as it tackled a straightforward problem that did not necessitate simultaneous analysis 31 .

This study establishes a multi-input ANN algorithm for the analysis of complex RBS data sets. This algorithm simultaneously relates spectra measured in multiple scattering geometries to a unique compositional depth profile. We show the advantages, limitations, and pitfalls of the dual-input ANN analysis by applying this newly developed approach to the real-time RBS data set of the thermal reaction of Ni with Ge \(_{1-x}\) Sn \(_x\) 32 . The complexity of this data set surpasses previously studied cases, presenting an exceptional level of challenge that reaches the limits of conventional analysis approaches. Therefore, a simultaneous and non-user-biased analysis is required to minimize ambiguity.

Introduction to the Ni-Ge \(_{1-x}\) Sn \(_{x}\) data set

The Ni-Ge \(_{1-x}\) Sn \(_{x}\) thermal reaction data set was collected in the scope of a study of Ni-germanide formation in the presence of strained Ge channels , which were introduced in microelectronics to enhance the hole mobility. One way to induce strain is by alloying a fraction of Sn to Ge, resulting in a lattice mismatch between Ge \(_{1-x}\) Sn \(_x\) and the Ge substrate 33 , 34 , 35 . For pure Ge, it was demonstrated that the thermally-induced NiGe phase exhibits exceptional contact properties on Ge, surpassing those of other transition metal germanides 36 . Based on this finding, the thermal reaction of a thin Ni film with Ge \(_{1-x}\) Sn \(_x\) was studied by Demeulemeester et al. 32 . To understand the influence of the alloying of Sn on the phase sequence and on the reaction kinetics, real-time RBS was applied, i.e., continuously capturing RBS spectra while the thermal reaction of Ni with Ge \(_{1-x}\) Sn \(_x\) takes place. At annealing temperatures up to 300 °C, referred to as the low-temperature domain, the same phase sequence was observed as for Ni/Ge (Ni/Ge \(\rightarrow\) Ni \(_{5}\) Ge \(_3\) \(\rightarrow\) NiGe), including a constant Sn fraction in the formed germanides. At annealing temperatures exceeding 300 °C, referred to as the high-temperature domain, Sn redistribution in the NiGe \(_{1-x}\) Sn \(_x\) phase occurred.

The RBS measurements were performed on the Ni/Ge \(_{1-x}\) Sn \(_x\) /Ge multilayer using an incident beam of 2.7 MeV He \(^{2+}\) ions that scattered from the sample, which was mounted at a tilt angle \(\alpha\) of 20°. The scattered particles were simultaneously detected at exit angles (i.e., the angle between the surface normal of the sample and the detected outgoing beam) of \(\beta _1~=~5\) ° and \(\beta _2~=~35\) °. These geometries will be referred to as G \(_1\) and G \(_2\) , respectively (see inset in Fig.  2 ). RBS spectra were acquired while the annealing temperature applied to the multilayer was ramped between room temperature and 430 °C, resulting in a data set of 80 spectra per detection geometry at a collection rate of 4 °C per measurement (except for the fast ramp up to 150 °C).

figure 2

Experimental (black data points) and simulated RBS spectra (solid lines) based on the dual-input ANN analysis in ( a ) the G \(_1\) geometry and ( b ) the G \(_2\) geometry (for T = 32 °C (triangles, purple), 246 °C (circles, red), 402 °C (squares, green).

Figure  1b shows the simulated RBS spectrum using SIMNRA for an as-deposited Ni/Ge \(_{1-x}\) Sn \(_x\) /Ge sample and a multilayer of thermodynamically stable Ni germanide phases induced by thermal annealing for the G \(_1\) scattering geometry. Several complexities emerge from these spectra. First, as a result of the higher atomic mass of Ge compared to Ni and the electronic stopping of the incoming He \(^{2+}\) ions in the Ni layer, the Ge high-energy edge and the Ni signal are superimposed in the RBS spectrum (purple and green signals). Second, upon thermal reaction, the real-time study comprises spectra with the simultaneous presence of the unreacted Ni, Ni \(_{5}\) (Ge \(_{1-x}\) Sn \(_x\) ) \(_3\) , and NiGe \(_{1-x}\) Sn \(_x\) phases. This coexistence of multiple very thin layers with varying thicknesses and composition, in combination with the superimposed Ni and Ge signal, results in non-unique solutions to the compositional depth profile. Third, the small Sn yield results from the small Sn concentration ( \(x = 8\%\) ), which complicates the probing of the Sn redistribution occurring at elevated temperatures. Considering these complexities and ambiguities, this data set is well suited to explore and push the boundaries of simultaneous multi-detector ANN analysis for highly convoluted RBS spectra. When assessing the network capabilities, the human supervision analysis performed by Demeulemeestrer et al. 32 serves as a benchmark for our results. This human supervision analysis involves the iterative fitting of the measurements in the G \(_1\) and G \(_2\) scattering geometry by sequentially using the analysis of the RBS spectrum in one geometry as input for the analysis in the other geometry.

In the scope of this study, three ANN analysis algorithms were developed: a single-input ANN analyzing the G \(_1\) scattering geometry (SI-G \(_1\) ) and the G \(_2\) geometry (SI-G \(_2\) ), and a dual-input (DI) ANN. A comparative assessment of their analysis accuracy, precision, and reliability is conducted, benchmarking against both each other and the human supervision analysis.

Multi-detector artificial neural network

Ideally, one dual-input ANN is designed and trained for simultaneous evaluation of the G \(_1\) and G \(_2\) scattering geometry measurements of the entire RBS data set; however, the generality of the parameter space fully covering the phase formation as well as the Sn redistribution results in a decreased accuracy of the ANN 37 , 38 . Because of this generality-precision trade-off, two distinct dual-input ANNs were designed and trained to cover the two physical processes occurring during the thermal reaction. First, the low-temperature domain ANN (DI, SI-G \(_1\) , SI-G \(_2\) ) models the growth and consumption of the unreacted Ni layer, the Ni \(_{5}\) (Ge \(_{1-x}\) Sn \(_x\) ) \(_3\) phase and the NiGe \(_{1-y}\) Sn \(_y\) phase, preserving a uniform Sn distribution in each phase, and Sn enrichment at the surface and interfaces. Second, the high-temperature domain DI ANN describes the redistribution of the alloying Sn in the final NiGe \(_{1-y}\) Sn \(_y\) phase and Sn enrichment at the surface and interfaces.

The architecture of both DI ANNs is a multilayer perceptron, whose inputs comprise the counts per channel within the region of interest of two simultaneously measured RBS spectra. For both scattering geometries, the region of interest consists of the channels encompassing the Ni, Ge, and Sn signals after normalization to the Ge substrate counts. The ANN outputs are the areal densities of the elements in the respective layers ( Nt , i.e., the product of the atomic density N and the film thickness t , hence directly related to the number of atoms).

Supervised learning of the ANNs (low- and high-temperature domain DI, SI-G \(_1\) , SI-G \(_2\) ) was applied. The training set consisted of patterns of randomly selected compositional depth profiles and the corresponding RBS spectra in scattering geometries G \(_1\) and G \(_2\) , simulated using SIMNRA 39 . The compositional depth profiles were generated by randomly selecting the areal densities of the elements in the respective layers from a normal distribution. The only free setup parameter is the energy calibration offset, allowing the validation of the ANN analysis if a gradual spectrum shift occurs during the long real-time run. Such shifts may originate from various factors, including minor drift in the data acquisition electronics and ion beam-induced carbon deposition 40 .

Following the supervised learning, the relative contribution of the inputs to the individual output nodes can be understood using Garson’s algorithm or activation maps 41 , 42 . Applying Garson’s algorithm to the low- and high-temperature domain ANN showed that the RBS spectra from both scattering geometries G \(_1\) and G \(_2\) contributed substantially to the ANN output. To obtain the accuracy and precision of the low- and high-temperature domain ANN analysis, ten ANNs are trained independently using the same training set and employed for data analysis. For each dual-spectrum input, this analysis with the independently trained ANNs results in the mean value and standard deviation of each ANN output 30 .

Feasibility and results of dual-input ANN analysis

The dual-input ANN analysis of the experimental spectra provides the mean areal densities of the elements in the respective layers (i.e., compositional depth profile) at each temperature step. Subsequent simulation of the RBS spectra based on the obtained compositional depth profile is performed using SIMNRA. These simulated spectra are superimposed with the experimental spectra to demonstrate the accuracy of the dual-input ANN analysis. It should be emphasized that the simulations are exclusively normalized to the Ge substrate yield and fitted to the experimental spectra by varying the energy calibration offset. The latter is valid as the offset is a free parameter in the training set. In particular, no further adjustment of the sample parameters is made, in contrast to what is often found in literature where a subsequent fitting step is applied post-ANN analysis. As an example, the experimental data measured at an annealing temperature of T = 32 °C, 246 °C, and 402 °C, superimposed with the corresponding simulations based on the DI ANN analysis are shown in Fig.  2 for both scattering geometries.

figure 3

Evaluation of reduced quadratic deviation of the spectra in the G \(_1\) geometry after dual-input ANN analysis (low-temperature domain: green filled circles, high-temperature domain: green open circles) and after human supervision analysis (black triangles).

The high accuracy is evidenced by the excellent agreement between the experimental and the simulated spectra throughout the entire low- and high-temperature domain. Next, the reduced-square deviation between the experimental and simulated data ( \(\chi ^2_r\) -value, i.e., \(\chi ^2\) divided by number of channels) is calculated for each measurement and plotted as a function of annealing temperature (Fig.  3 ). The \(\chi ^2_r\) -values resulting from the low-temperature domain DI ANN analysis are comparable to the human supervision analysis, confirming the accurate dual-input ANN analysis. The \(\chi ^2_r\) -values of the high-temperature domain DI ANN analysis are considerably smaller than those of the low-temperature domain DI ANN analysis, arising from the reduced level of complexity in the high-temperature domain DI ANN training set.

figure 4

Ni areal density as a function of temperature of the unreacted Ni layer ( a , blue), Ni in the Ni \(_{5}\) (Ge \(_{1-x}\) Sn \(_x\) ) \(_3\) phase ( b , green), and Ni in the NiGe \(_{1-y}\) Sn \(_y\) phase ( c , purple) obtained by dual-input ANN analysis (data points correspond to the mean areal density, error band to the uncertainty covering 1  \(\sigma\) , acquired through the analysis using ten independently trained ANNs), and the corresponding human supervision analysis (black triangles). The inset in ( c ) compares the low-temperature domain (filled symbols) and high-temperature domain (open symbols) dual-input ANN analysis for NiGe \(_{1-y}\) Sn \(_y\) in the overlapping temperature range.

The comparison of the phase formation and consumption between the analysis by the dual-input ANN and by human supervision is given in Fig.  4 . The DI ANN analysis indicates a higher onset temperature for the growth of the Ni \(_{5}\) (Ge \(_{1-x}\) Sn \(_x\) ) \(_3\) phase, concomitant with a higher onset temperature for the consumption of the Ni surface layer (Fig.  4 a,b). This difference in the onset temperature of thermal reaction discloses the necessity for simultaneous analysis (in contrast to iterative human supervision analysis applied before) and the susceptibility of the human supervision analysis to user bias, as will be discussed in Section ‘Iterative versus simultaneous analysis’. In addition, the human supervision and DI ANN analysis approach agree in the prediction of the total phase consumption temperature for both the unreacted Ni layer and the Ni \(_{5}\) (Ge \(_{1-x}\) Sn \(_x\) ) \(_3\) layer. In the temperature range between 292 and 304 °C (i.e., the transition from the low- to the high-temperature domain), the two DI ANNs provide consistent results. This consistency affirms the high precision and reliability of the low- and the high-temperature domain DI ANN at the respective high- and low-temperature limits of their parameter space (Fig.  4 c). Moreover, this indicates that the same result is obtained independently of the differently modeled systems (phase growth vs. Sn redistribution).

figure 5

Sn Areal density as a function of temperature of Sn in the Ni \(_{5}\) (Ge \(_{1-x}\) Sn \(_x\) ) \(_3\) phase ( a , green), the NiGe \(_{1-y}\) Sn \(_y\) phase ( b , purple), the surface precipitation ( c , red), the germanide-Ge \(_{1-z}\) Sn \(_z\) interface layer ( c , teal), and the unreacted Ge \(_{1-z}\) Sn \(_z\) layer ( d , gray) obtained by dual-input ANN analysis, together with the analysis uncertainty covering 1  \(\sigma\) . The low-temperature domain (filled symbols) and high-temperature domain (open symbols) results are shown.

Furthermore, Butler’s three criteria for the ambiguity reduction of RBS analysis are fulfilled by this simultaneous evaluation approach using machine learning: (1) Conservation of the total Ni areal density is obtained for both the low- and high-temperature domain DI ANN, even though the total Ni areal density was varied in the training set within a normal distribution ( \(\mu\)  = 220 \(\times 10^{15}\)  atoms/cm \(^2\) , \(\sigma\)  = 110  \(\times 10^{15}\)  atoms/cm \(^2\) ); (2) The ratio of the Ni (Fig.  4 ), Ge, and Sn (Fig.  5 ) areal densities in each phase agrees with the expected thermodynamically stable stoichiometries, including a homogeneous Sn distribution in the low-temperature domain; (3) The quantification is obtained by simultaneous evaluation of RBS data measured in two scattering geometries.

The reduction in ambiguity that is obtained through the dual-input approach is particularly crucial when analyzing the Sn depth profile, given the limited yield of the Sn signal. The dual-input ANN analysis approach allows the deconvolution of the five contributions to the total Sn signal (Fig.  5 ). Conservation of the total Sn areal density was obtained for T < 300 °C. However, at higher temperatures, an apparent increase in the total Sn areal density was observed. Likewise, as the temperature increased, integrating the Sn-related raw counts in the experimental spectra indicated a similar increase in total Sn areal density, which would contradict the principle of total areal density conservation. Therefore, this observation of increasing total Sn areal density is not related to a breakdown of the DI ANN algorithm but rather to an unexpected artifact, which was presumably caused by the out-diffusion of mobile n-type dopants (Sb) from the Ge substrate, or alike 43 .

Uncertainty of the dual-input ANN analysis

In addition to the statistical and systematic uncertainties, the uncertainty induced by the analysis itself must be included in the total uncertainty budget as well 44 , 45 . A common approach for the uncertainty evaluation of a trained ANN is through the analysis of a simulated test set covering the parameter space of the experiment. The ANN outputs are compared to the target composition, resulting in a prediction error 37 . However, unlike a simulated test set, experimental data are also susceptible to inaccurately known experimental parameters. Therefore, a simulated test set cannot be considered a valid representation to obtain the total uncertainty of the experimental data analysis. An alternative parameter to define the uncertainty as the \(\chi ^2_r\) -value obtained by comparison of the experimental data to the RBS simulation that is based on the ANN output, as shown in Fig.  3 . However, this approach is strongly susceptible to ambiguity (i.e., multiple compositional depth profiles may result in an identical ‘fit’) and may lead to non-physical results.

Therefore, as proposed for SI ANN analysis of RBS spectra, the uncertainty was evaluated using a set of ten independently trained DI ANNs (identical architecture and training set) 30 . After the DI ANN analyses (ten networks) of the experimental data set, each output’s mean value and standard deviation ( \(\sigma\) ) were calculated. The examination of the standard deviation of the Ni areal density outputs as a function of temperature (indicated by the error band in Fig.  4 ) demonstrated the following characteristics: (1) Before thermal reaction occurs (low-temperature domain, 30 °C to 200 °C), the standard deviation of the unreacted Ni layer is 11  \(\times \; 10^{15}\)  atoms/cm \(^2\) ; (2) During thermal reaction (low-temperature domain, 200 °C to 300 °C), the standard deviation of Ni in the different phases remains constant at approximately 10  \(\times \; 10^{15}\)  atoms/cm \(^2\) ; (3) After completing the final NiGe \(_{1-y}\) Sn \(_y\) phase (high-temperature domain, 300 °C to 430 °C), the standard deviation of Ni within the NiGe \(_{1-y}\) Sn \(_y\) phase drops to 2  \(\times \; 10^{15}\)  atoms/cm \(^2\) . Thus, the high-temperature domain ANN analysis exhibited a pronounced reduction in uncertainty compared to the low-temperature domain analysis, which can be attributed to the reduced complexity of the high-temperature domain training set. Furthermore, it confirms the advantage of dividing the entire experimental parameter space into subspaces (low-temperature domain, high-temperature domain) due to the generality-precision trade-off. Additional uncertainty reduction, at the cost of increased user bias, can be achieved by restricting the training set to a smaller parameter space, for example, by introducing conservation of the total areal density of specific elements in the training set.

Iterative versus simultaneous analysis

Comparing the analysis conducted under iterative human supervision with the simultaneous dual-input ANN analysis reveals distinct trends in the evolution of the unreacted Ni areal density as a function of annealing temperature (Fig.  4 a). This difference raises the question of whether the contrast between the iterative and simultaneous approaches, along with potential user bias associated with human supervision analysis, can explain the different temperatures at which the thermal reaction starts (see above).

figure 6

Ni areal density in the unreacted surface layer as a function of temperature, analyzed by the G \(_1\) single-input ANN (SI-G \(_1\) : purple), the G \(_2\) single-input ANN (SI-G \(_2\) : red), the dual-input ANN (DI: green), including the ANN analysis uncertainty covering 1  \(\sigma\) , and the human supervision analysis (black triangles).

In the pursuit of answers to this question, a set of single-input ANNs is trained (ten individual cycles of training, identical hidden layer architecture, and identical output nodes to dual-input ANN) for both the G \(_1\) (SI-G \(_1\) ) and G \(_2\) (SI-G \(_2\) ) geometry individually. The region of interest used as input of these ANNs is identical to that of the dual-input ANN. The resulting areal densities of the unreacted Ni layer are shown in Fig.  6 . As the RBS measurements in the G \(_1\) and G \(_2\) scattering geometries were simultaneously performed, they originate from an identical target composition. Therefore, an accurate, unambiguous analysis should lead to identical areal densities independent of the scattering geometry. In contrast, a difference is noticed in the areal densities of the unreacted Ni layer following analysis with the SI-G \(_1\) , SI-G \(_2\) , and DI ANN approaches. Moreover, the standard deviations of the SI-G \(_1\) and SI-G \(_2\)  ANN analysis are smaller than this areal density difference and, hence, can not explain the difference in the observed unreacted Ni areal density.

This raises questions regarding the accuracy of the setup parameters. Although it was attempted to determine the setup parameters with the highest accuracy possible, the potential for minor inaccuracies (minor deviations from the expected value) persists, which can result in systematic errors. For instance, any deviation in the detector position, and consequently, the scattering angle, changes the RBS spectrum. When the detector positions deviate in different directions (either positively or negatively) for either geometry, they have distinct effects on the spectra. Such inaccurately known detector positions could explain the difference in G \(_1\) and G \(_2\) analysis by SI ANNs. Finding the correct setup parameters through spectrum fitting is intricate due to the high correlation between the sample and setup parameters.

In contrast, when employing dual-input ANN analysis, minor deviations in setup parameters like beam energy, scattering angle, sample tilt, and energy-channel conversion (offset and gain) can be addressed by introducing an additional free setup parameter into the training set. Although each setup parameter affects the spectrum differently, in practice, at least for minor deviations, their effects can, to a first approximation, be modeled as spectrum shifts. Under this approximation, minor variations in the setup parameters are accounted for by including the free energy calibration offset within the training set (see “ Multi-detector artificial neural network ” section). Moreover, although it is possible to leave all setup parameters free, opting to vary only the offset substantially reduces the complexity of the training set and avoids highly ambiguous spectra, resulting in a more accurate analysis. Thus, the simultaneous dual-input ANN analysis provides a more robust approach by introducing a free spectrum shift into the ANN’s parameter space, effectively reducing sensitivity to inaccurately known setup parameters.

This robustness of the DI ANN approach outperforms conventional SI ANN analysis and human supervision analysis. On the one hand, the incapability of conventional SI ANNs in handling the analysis of spectra subjected to inaccurately known setup parameters is evidenced by the diverging results for SI-G1 and SI-G2, as shown in Fig.  6 . This indicates that a reduction of the degree of freedom of the analysis is required to distinguish spectrum features related to sample composition from those arising due to slight deviations in setup parameters. One way to decrease the degree of freedom is by the combined analysis of RBS spectra collected in the two scattering geometries, ensuring a unique compositional depth profile that aligns with both spectra. The integration of this combined evaluation in the DI ANN enhances the reliability of the results compared to SI ANN analysis. On the other hand, human supervision analysis encounters difficulties in fitting complex spectra when attempting to adjust both the target parameters and the inaccurately known setup parameter. This challenge arises from the high correlation between these parameters. It can be addressed by constraining the possible solutions to the analysis (i.e., combinations of target and setup parameters), which allows the reduction of ambiguity in their correlation. This constrained parameter space is integrated into the DI ANN analysis through the training set. These considerations underscore the superior performance of the DI ANN analysis compared to both SI ANN analysis and human supervision ANN analysis.

Next, when comparing the areal density curves of the unreacted Ni layer (Fig. 6 ), it can be noted that the human supervision curve aligns with the SI-G \(_2\) curve between room temperature and 147 °C after which it jumps to the SI-G \(_1\) curve, overlapping until T = 226 °C. Above T = 230 °C, the human supervision curve transitions to the DI curve. These transitions in the iterative human supervision approach result from two instances of user bias in the analysis. First, changes in the relative weights imposed by the operator to the G \(_1\) and G \(_2\) contributions in the G \(_1\) -G \(_2\) combined analysis across the temperature range of the real-time experiment lead to discontinuous jumps. Second, the operator may unintentionally be biased in imposing a trend to the areal density evolution. This occurs in two steps. Initially, it is the operator who decides when a new phase appears while the effect of the newly present phase on the spectrum is still minimal. Subsequently, the operator anticipates a systematic behavior in the presence and layer thickness evolution of a phase until the next phase emerges. Consequently, the shift from the SI-G \(_2\) to the SI-G \(_1\) curve during iterative human supervision analysis leads to an unintended underestimation of the onset temperature for phase formation (Fig.  4 b). Moreover, the transitions of the human supervision analysis between the SI ANN curves suggest that the human supervisor could not identify a single, consistent compositional depth profile that aligns with both the G \(_1\) and G \(_2\) scattering geometries, conforming to the divergent SI ANN analysis. It demonstrates the sensitivity of iterative human supervision analysis to minor deviations from the expected setup parameters, leading to an inconsistent analysis. In other words, the human supervision analysis suffers from user bias despite the human supervisor’s effort to perform a self-consistent analysis. In contrast, the simultaneous DI ANN approach (not iterative ) enables a systematic analysis throughout the entire temperature domain of the experiment.

A dual-input ANN algorithm has been developed to simultaneously evaluate RBS spectra measured in two scattering geometries. This analysis approach was applied to the large real-time RBS data set acquired during the thermal reaction of a Ni film with Ge \(_{1-x}\) Sn \(_{x}\) , which posed extensive challenges due to superimposed signals, adjacent thin layers containing varying element concentrations, and low concentrations. The accuracy of the dual-input approach was thoroughly examined by comparing the experimental spectra with simulations based on the dual-input ANN output. Remarkably, an excellent agreement was achieved without requiring post-ANN fitting . Additionally, a comprehensive comparison was made between single-input and dual-input ANN analysis algorithms concerning accuracy and precision. This evaluation demonstrated that allowing a free spectrum shift in the dual-input ANN training set offers a systematic and more robust analysis approach by reducing susceptibility to inaccurately known setup parameters, therefore providing more reliable results. This marks a major step towards precise analysis methodologies in the study of complex 3D micro- and nanostructures by simultaneously evaluating measurements taken under multiple experimental conditions using a machine learning-based approach 46 , 47 . Moreover, the multi-input ANN algorithm not only tackles challenges in simultaneous RBS spectrum analysis, as illustrated in this example, but also exhibits great potential for advancing the study of materials across a wide variety of high-throughput experimental techniques probing depth, composition, or chemical properties, whereby the combined analysis of measurements performed under different experimental conditions enhances the accuracy of the results.

An overview of the machine learning analysis workflow is shown in Fig.  7 .

figure 7

Schematic overview illustrating the DI ANN analysis applied to the real-time RBS data set. The blue box encompasses the experimental data acquisition and preprocessing. The green boxes represent the utilization of SIMNRA including the forward simulation of RBS spectra. The red boxes cover the DI ANN analysis approach.

Input preprocessing

Initially, a normalization region was defined, spanning channels 250 to 300 (Ge substrate) for both G1 and G2 geometries. Subsequently, the counts in the channels of interest (channels 300 to 500 for both G1 and G2) were normalized to the mean number of counts per channel in the normalization region of the respective geometries, which corresponds to the normalization of the number of incident ions. Subsequently, input counts with values below 0.015 were regarded as background noise and adjusted to zero. Finally, the spectrum was rebinned from 200 channels to 100 channels. This resulted in a feature vector size of 100 for the SI-G1 and SI-G2 ANN analysis, and a feature vector size of 200 for the DI ANN analysis. This data preprocessing did not result in a decreased accuracy of the analysis.

ANN architecture

The hidden layers in the architecture of the ANNs (low- and high-temperature domain DI, SI-G \(_1\) , SI-G \(_2\) ) consist of 200 and 50 nodes which are fully interconnected by a rectified linear unit (ReLU) activation function 48 .

The low-temperature domain ANN (DI, SI-G \(_1\) , SI-G \(_2\) ) produces 11 outputs: the areal densities of each of the elements present in the unreacted Ni (1 output), Ni \(_{5}\) (Ge \(_{1-x}\) Sn \(_x\) ) \(_3\) (3 outputs), NiGe \(_{1-y}\) Sn \(_y\) (3 outputs), and unreacted Ge \(_{1-z}\) Sn \(_z\) (2 outputs) layers, and the areal density of the Sn enrichment at the surface (1 output) and at the interface between the NiGe \(_{1-y}\) Sn \(_y\) and the unreacted Ge \(_{1-z}\) Sn \(_z\) layer (1 output). The high-temperature domain ANN (DI) produces a total of 8 outputs: the areal densities of the elements in the NiGe \(_{1-y}\) Sn \(_y\) (3 outputs) and unreacted Ge \(_{1-z}\) Sn \(_z\) layers (2 outputs), the areal density of the Sn interface enrichment (1 output), and the areal density and roughness of the Sn surface layer (2 outputs). The latter allows the modeling of Sn surface precipitation, which was observed after thermal annealing at 550 °C using scanning electron microscopy 32 . A thickness distribution following a Gamma function with a small mean layer thickness \(\bar{d}\) and large standard deviation \(\sigma\) ( \(\sigma ~\ge ~\bar{d}\) ) enables to address the precipitation-induced changes in the RBS spectrum 49 . In this approximation, correlation effects between film roughness and interface crossings of the incident and scattered ions are neglected. This is valid for non-grazing incidence and large scattering angles, hence applicable in this particular case.

To all outputs \(Y_i\) variance scaling \(Y_i/\sigma\) was applied, with \(\sigma\) being the standard deviation of the output feature in the normally distributed training set.

Training set and training process

The training process sets the weights and biases of all interconnections, aiming to minimize the prediction error on the training set. This training set consisted of patterns of randomly selected compositional depth profiles and the corresponding RBS spectra in scattering geometries G \(_1\) and G \(_2\) . The distribution boundaries of the sample structure and setup parameters in the training set define the parameter space of the ANN, which should cover the entire experimental parameter space. The training set of the low-temperature domain DI ANN consisted of 150,000 patterns within the defined parameter space including a variable total Ni areal density, a roughness of the Ni surface layer, a fixed stoichiometry of the Ni \(_{5}\) (Ge \(_{1-x}\) Sn \(_x\) ) \(_3\) , NiGe \(_{1-y}\) Sn \(_y\) , and Ge \(_{1-z}\) Sn \(_z\) layers, and free Sn fractions x ,  y ,  z . All free parameters were randomly selected from a normal distribution. The training set of the high-temperature domain DI ANN consisted of 50,000 patterns within the defined parameter space, including a variable Ni areal density, a fixed stoichiometry of the NiGe \(_{1-y}\) Sn \(_y\) and Ge \(_{1-z}\) Sn \(_z\) layers, a random Sn fraction y ,  z , a Sn surface layer with extreme roughness to resemble the surface precipitation, and a random areal density of the Sn interface layer. In both training sets, the energy calibration offset of each scattering geometry spectrum was a free parameter, aiming to cover spectrum shifts occurring during the real-time run.

The supervised learning of the ANN requires the forward simulation of RBS spectra. Multiple software implementations enable the calculation of the spectra based on the physics of the interaction of an ion beam with matter. A comparative study assessing the quantitative and qualitative aspects of these simulation codes, along with a quantitative comparison of the analysis of experimental spectra was conducted by the International Atomic Energy Agency 25 . From this, it was concluded that the analysis of experimental spectra using the new generation codes SIMNRA 39 and NDF 50 demonstrates excellent agreement amongst the codes. The consistent performance, encompassing spectrum generation time and precision, together with the ability to generate a large number of spectra has led to the frequent use of these software implementations in single-input ANN analysis applications for RBS data 12 , 30 , 38 . It is essential to note that, regardless of the forward simulation software employed, the overall uncertainty of the analysis is influenced by both the code uncertainty and the uncertainties associated with the parameters utilized in the simulation. These parameters include the electronic stopping power (used from the SRIM 2003 stopping power database 51 ) and the scattering cross sections. Given the usability and good documentation of SIMNRA, the decision was made to employ this forward simulation software for the generation of the training spectra, with the subsequent addition of Poisson statistics to mimic experimental spectra.

The training process was executed in Matlab using the Adam optimizer with 1000 epochs. For the adaptive moment estimation, a gradient decay factor of 0.900 and a squared gradient decay factor of 0.999 were used. The initial learning rate was 0.001 followed by a learn rate drop factor of 0.1 for a drop period of 10 epochs. L2 regularization was included, through the addition of a penalty term with a regularization hyperparameter ( \(\lambda\) ) of \(10^{-4}\) to the least-squares loss function to avoid overfitting. Parity plots were generated using a designated test set of 15,000 patterns, selected and excluded from the training set, to compare the actual areal density to the areal density predicted by ANN analysis. The linear correlation between the actual and predicted values, and the comparable root-mean-square error of the training and test set confirm the successful training and predictive capability.

Data availibility

The data that support the findings of this study are available from the corresponding author upon reasonable request.

IEEE. International roadmap for devices and systems: metrology (2022).

Schleunitz, A. et al. Novel 3D micro-and nanofabrication method using thermally activated selective topography equilibration (taste) of polymers. Nano Converg. 1 , 1–8 (2014).

Article   Google Scholar  

Gira, M. J., Tkacz, K. P. & Hampton, J. R. Physical and electrochemical area determination of electrodeposited Ni Co, and NiCo thin films. Nano Converg. 3 , 6 (2016).

Article   PubMed   PubMed Central   Google Scholar  

Bauer, S., Rodrigues, A. & Baumbach, T. Real time in situ x-ray diffraction study of the crystalline structure modification of Ba 0.5 Sr 0.5 TiO 3 during the post-annealing. Sci. Rep. 8 , 11969 (2018).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Schmidt, J., Marques, M. R., Botti, S. & Marques, M. A. Recent advances and applications of machine learning in solid-state materials science. npj Comput. Mater. 5 , 83 (2019).

Article   ADS   Google Scholar  

Bedolla, E., Padierna, L. C. & Castaneda-Priego, R. Machine learning for condensed matter physics. J. Condens. Matter Phys. 33 , 053001 (2020).

Oviedo, F. et al. Fast and interpretable classification of small x-ray diffraction datasets using data augmentation and deep neural networks. npj Comput. Mater. 5 , 60 (2019).

Bridger, A., David, W. I., Wood, T. J., Danaie, M. & Butler, K. T. Versatile domain mapping of scanning electron nanobeam diffraction datasets utilising variational autoencoders. npj Comput. Mater. 9 , 14 (2023).

Munshi, J. et al. Disentangling multiple scattering with deep learning: Application to strain mapping from electron diffraction patterns. npj Comput. Mater. 8 , 254 (2022).

Article   ADS   MathSciNet   CAS   Google Scholar  

Taherimakhsousi, N. et al. Quantifying defects in thin films using machine vision. npj Comput. Mater. 6 , 111 (2020).

Griffin, L. A., Gaponenko, I., Zhang, S. & Bassiri-Gharb, N. Smart machine learning or discovering meaningful physical and chemical contributions through dimensional stacking. npj Comput. Mater. 5 , 85 (2019).

Demeulemeester, J. et al. Artificial neural networks for instantaneous analysis of real-time Rutherford backscattering spectra. Nucl. Instrum. Methods Phys. Res. B 268 , 1676–1681 (2010).

Article   ADS   CAS   Google Scholar  

Planckaert, N. et al. Artificial neural networks applied to the analysis of synchrotron nuclear resonant scattering data. J. Synchrotron Radiat. 17 , 86–92 (2010).

Article   CAS   PubMed   Google Scholar  

Kim, H. J. et al. Machine-learning-assisted analysis of transition metal dichalcogenide thin-film growth. Nano Converg. 10 , 10 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Venderley, J. et al. Harnessing interpretable and unsupervised machine learning to address big data from modern x-ray diffraction. PNAS 119 , e2109665119 (2022).

Banko, L., Maffettone, P. M., Naujoks, D., Olds, D. & Ludwig, A. Deep learning for visualization and novelty detection in large x-ray diffraction datasets. npj Comput. Mater. 7 , 104 (2021).

Suzuki, Y. et al. Symmetry prediction and knowledge discovery from x-ray diffraction patterns using an interpretable machine learning approach. Sci. Rep. 10 , 21790 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Wu, L. et al. Resolution-enhanced x-ray fluorescence microscopy via deep residual networks. npj Comput. Mater. 9 , 43 (2023).

Theron, C., Lombaard, J. & Pretorius, R. Real-time RBS of solid-state reaction in thin films. Nucl. Instrum. Methods Phys. Res. B 161 , 48–55 (2000).

Smeets, D. et al. Simultaneous real-time x-ray diffraction spectroscopy, Rutherford backscattering spectrometry, and sheet resistance measurements to study thin film growth kinetics by Kissinger plots. J. Appl. Phys 104 , 103538 (2008).

Demeulemeester, J. et al. Pt redistribution during Ni (Pt) silicide formation. Appl. Phys. Lett. 93 , 261912 (2008).

Comrie, C. et al. Determination of the dominant diffusing species during nickel and palladium germanide formation. Thin Solid Films 526 , 261–268 (2012).

Schrauwen, A. et al. On the nucleation of PdSi and NiSi \(_2\) during the ternary Ni (Pd)/Si (100) reaction. J. Appl. Phys 114 , 063518 (2013).

van Stiphout, K. et al. Ion beam modification of the Ni-Si solid-phase reaction: The influence of substrate damage and nitrogen impurities introduced by ion implantation. J. Phys. D Appl. Phys. 54 , 015307 (2020).

Barradas, N. et al. International Atomic Energy Agency intercomparison of ion beam analysis software. Nucl. Instrum. Methods Phys. Res. B 262 , 281–303 (2007).

Heller, R., Klingner, N., Claessens, N., Merckling, C. & Meersschaut, J. Differential evolution optimization of Rutherford backscattering spectra. J. Appl. Phys 132 , 165302 (2022).

Butler, J. Criteria for validity of Rutherford scatter analyses. Nucl. Instrum. Methods Phys. Res. B 45 , 160–165 (1990).

Silva, T. F. et al. Self-consistent ion beam analysis: An approach by multi-objective optimization. Nucl. Instrum. Methods Phys. Res. B 506 , 32–40 (2021).

Barradas, N. P. & Vieira, A. Artificial neural network algorithm for analysis of Rutherford backscattering data. Phys. Rev. E 62 , 5818 (2000).

Guimarães, R. D. S. et al. Processing of massive Rutherford back-scattering spectrometry data by artificial neural networks. Nucl. Instrum. Methods Phys. Res. B 493 , 28–34 (2021).

Pinho, H., Vieira, A., Nené, N. & Barradas, N. Artificial neural network analysis of multiple IBA spectra. Nucl. Instrum. Methods Phys. Res. B 228 , 383–387 (2005).

Demeulemeester, J. et al. Sn diffusion during Ni germanide growth on Ge \(_{1- x}\) Sn \(_{x}\) . Appl. Phys. Lett. 99 , 211905 (2011).

Huang, Z.-M. et al. Emission of direct-gap band in germanium with Ge-GeSn layers on one-dimensional structure. Sci. Rep. 6 , 24802 (2016).

Vincent, B. et al. Characterization of GeSn materials for future Ge pMOSFETs source/drain stressors. Microelectron. Eng. 88 , 342–346 (2011).

Article   CAS   Google Scholar  

Liu, Z. et al. Defect-free high Sn-content GeSn on insulator grown by rapid melting growth. Sci. Rep. 6 , 38386 (2016).

Gaudet, S., Detavernier, C., Kellock, A., Desjardins, P. & Lavoie, C. Thin film reaction of transition metals with germanium. J. Vacuum Sci. Technol. A 24 , 474–485 (2006).

Vieira, A., Barradas, N. & Jeynes, C. Error performance analysis of artificial neural networks applied to Rutherford backscattering. Surf. Interface Anal. 31 , 35–38 (2001).

Barradas, N. P., Vieira, A. & Patricio, R. Artificial neural networks for automation of Rutherford backscattering spectroscopy experiments and data analysis. Phys. Rev. E 65 , 066703 (2002).

Mayer, M. Improved physics in SIMNRA 7. Nucl. Instrum. Methods Phys. Res. B 332 , 176–180 (2014).

Healy, M. Minimising carbon contamination during ion beam analysis. Nucl. Instrum. Methods Phys. Res. B 129 , 130–136 (1997).

Garson, G. D. Interpreting neural-network connection weights. AI Expert 6 , 46–51 (1991).

Google Scholar  

Oliveira, V. & Silva, T. What do artificial neural networks learn? A study for analysis of RBS spectra. J. Phys. Conf. Ser. 2340 , 012003 (2022).

Chroneos, A. & Bracht, H. Diffusion of n-type dopants in germanium. Appl. Phys. Rev. 1 , 011301 (2014).

Sjöland, K., Munnik, F. & Wätjen, U. Uncertainty budget for ion beam analysis. Nucl. Instrum. Methods Phys. Res. B 161 , 275–280 (2000).

Jeynes, C. et al. “Total IBA’’-Where are we?. Nucl. Instrum. Methods Phys. Res. B 271 , 107–118 (2012).

Claessens, N. et al. Quantification of area-selective deposition on nanometer-scale patterns using Rutherford backscattering spectrometry. Sci. Rep. 12 , 17770 (2022).

Claessens, N. et al. Ensemble RBS: Probing the compositional profile of 3D microscale structures. Surf. Interfaces 32 , 102101 (2022).

Dubey, S. R., Singh, S. K. & Chaudhuri, B. B. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing 503 , 92–108 (2022).

Mayer, M. Ion beam analysis of rough thin films. Nucl. Instrum. Methods Phys. Res. B 194 , 177–186 (2002).

Barradas, N., Jeynes, C. & Webb, R. Simulated annealing analysis of Rutherford backscattering data. Appl. Phys. Lett. 71 , 291–293 (1997).

Ziegler, J. F. Srim-2003. Nucl. Instrum. Methods Phys. Res. B 219 , 1027–1036 (2004).

Download references

Acknowledgements

This work was supported by FWO (Research Foundation Flanders) and the EU infrastructure network RADIATE (grant agreement 824096). The authors thank Jelle Demeulemeester for the collection and human supervision analysis of the experimental data set.

Author information

Authors and affiliations.

Quantum Solid-State Physics, KU Leuven, Celestijnenlaan 200D, 3001, Leuven, Belgium

Goele Magchiels, Niels Claessens & André Vantomme

IMEC, Kapeldreef 75, 3001, Leuven, Belgium

Niels Claessens & Johan Meersschaut

You can also search for this author in PubMed   Google Scholar

Contributions

G.M. and A.V. conceived and planned the scientific approach. G.M. developed the machine learning model and performed the data analysis. N.C. and J.M. wrote the script for the training set generation. G.M. and A.V. contributed to the manuscript writing. All authors contributed to discussing the results and reviewed the manuscript.

Corresponding author

Correspondence to Goele Magchiels .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Magchiels, G., Claessens, N., Meersschaut, J. et al. Enhanced accuracy through machine learning-based simultaneous evaluation: a case study of RBS analysis of multinary materials. Sci Rep 14 , 8186 (2024). https://doi.org/10.1038/s41598-024-58265-7

Download citation

Received : 01 December 2023

Accepted : 27 March 2024

Published : 08 April 2024

DOI : https://doi.org/10.1038/s41598-024-58265-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

network analysis case study

COMMENTS

  1. A Network Analysis Case Study

    A Network Analysis Case Study. I asked Satori Lab founder Esko Reinikainen to talk about the companies first experience with using ONA. While working on new projects, it's important to look at ...

  2. PDF Improving Company Performance with Organizational Network Analysis

    network, as well as digital networks like instant messaging (chat), phone call, or email network. Works gets done through all of these networks and it is the methods and theories of social and organizational network analysis (ONA) that allow us to synthesize and analyze these networks.1 A network consists of a set of nodes, and links (ties)

  3. Chapter 7 Network Analysis

    Chapter 7. Network Analysis. In this chapter, we will cover concepts and procedures related to network analysis in R. "Networks enable the visualization of complex, multidimensional data as well as provide diverse statistical indices for interpreting the resultant graphs" (Jones et al., 2018). Put otherwise, network analysis is a collection ...

  4. Full article: Event-focused network analysis: a case study of anti

    Research on diffusion and transfer increasingly relies on the concept of policy networks, but often in inductive, descriptive, and anecdotal ways. This article proposes a more robust method for the comparative analysis of policy networks, a method we term 'event-focused network analysis' (EFNA). The method assumes that networks are most ...

  5. A case study of university student networks and the COVID-19 ...

    De Brún, A. & McAuliffe, E. Social network analysis as a methodological approach to explore health systems: A case study exploring support among senior managers/executives in a hospital network. Int.

  6. Social Network Analysis 101: Ultimate Guide

    Learn what you need to know to conduct your first social network analysis project in our Comprehensive SNA 101 Guide. Comprehensive Introduction for Beginners Social network analysis is a powerful tool for visualizing, understanding, and harnessing the power of networks and relationships. ... Case Study 1: Leveraging SNA for Program Evaluation.

  7. Social Networks: Analysis and Case Studies

    The work covers Social Network Analysis theory and methods with a focus on current applications and case studies applied in various domains such as mobile networks, security, machine learning and health. With the increasing popularity of Web 2.0, social media has become a widely used communication platform.

  8. Social Network Analysis as a Methodological Approach to Explore Health

    This case study, assessing the support relationships between senior leaders in a recently established hospital network, illustrated some of the principal network- and node-level metrics used in social network analysis, and demonstrates the value of these maps and metrics to understand the system.

  9. Using Social Network Analysis to Assess Collaborative Networks: A Case

    Social Network Analysis (SNA) is an established method in sociology since the early 20th century that has gained prominence in recent decades due to technological advances. It is versatile and can be applied in a wide range of fields—including economics, biology, medicine, communications, and more—by identifying key actors within a social ...

  10. Using social network analysis in community development practice and

    ideas: the analysis of a participant network, before and after a community development project. This analysis is drawn from research undertaken by Gretchen (Ennis, 2011) as part of a larger case study. The case used is a project undertaken by a grassroots, volunteer-based community network called Ludmilla Neighbourhood Connections (LNC). LNC ...

  11. Moving beyond case studies: applying social network analysis to study

    We argue that social network analysis is a useful methodology to study and to extend scholarly knowledge on learning through legitimate peripheral participation in communities of practice. We first review work on legitimate peripheral participation and show that research on this topic currently focusses on the adoption of practices.

  12. PDF CASE STUDY

    how research such as a network analysis could be useful to inform their work. For those interested in conducting a network analysis, the case also provides resources and tools to support researchers and organizations to replicate the study in their program context. Key guiding questions for the case include: 1.

  13. Mixed methods with social network analysis for networked learning

    In this regard, we suggest that future SLA studies apply advanced and multimodal network analysis approaches [46], including understanding, the properties of networks in learning settings and ...

  14. Introduction to Social Networks: Analysis and Case Studies

    A social network is a social structure made up of actors called nodes, which are connected by various types of relationships. SNA is used to analyze and measure these relationships between people, groups and other information/knowledge processing entities and provides both a structural and a mathematical analysis.

  15. Full article: Network analysis: a brief overview and tutorial

    View PDF. Objective: The present paper presents a brief overview on network analysis as a statistical approach for health psychology researchers. Networks comprise graphical representations of the relationships (edges) between variables (nodes). Network analysis provides the capacity to estimate complex patterns of relationships and the network ...

  16. Understanding Classrooms through Social Network Analysis: A Primer for

    INTRODUCTION TO THE CASE STUDY. In introducing network analysis, we draw our example from a subset of a 10-wk introductory biology course with 187 students who saw the course to completion as an example. Each student in this course attended either a morning or afternoon 1-h lecture of ∼90 students four times a week and attended one of eight ...

  17. Fraud Detection Using Social Network Analysis: A Case Study

    Fraud Detection Using Social Network Analysis: A Case Study, Table 1 Fraud detection using social network analysis, a case study Full size table The idea of propagating information across a graph and aggregating it to produce high-level conclusion is powerful; it inspired the creation of the generalized Snare system (McGlohon et al. 2009 ...

  18. Guidelines for Experimental Algorithmics: A Case Study in Network Analysis

    The field of network science is a highly interdisciplinary area; for the empirical analysis of network data, it draws algorithmic methodologies from several research fields. Hence, research procedures and descriptions of the technical results often differ, sometimes widely. In this paper we focus on methodologies for the experimental part of algorithm engineering for network analysis—an ...

  19. Network analysis: An indispensable tool for curricula design. A real

    Content addition to courses and its subsequent correct sequencing in a study plan or curricula design context determine the success (and, in some cases, the failure) of such study plan in the acquisition of knowledge by students. In this work, we propose a decision model to guide curricular design committees in the tasks of course selection and sequencing in higher education contexts using a ...

  20. Metro Network Analysis: Case Study

    January 28, 2024. Download the dataset below to solve this Data Science case study on Metro Network Analysis. Download Data. Metro Network Analysis involves the application of data science techniques to understand and interpret the characteristics and dynamics of metro systems. The provided dataset contains detailed information about the Delhi ...

  21. Frontiers

    The integration of power grids and communication networks in smart grids enhances system safety and reliability but also exposes vulnerabilities to network attacks, such as Denial-of-Service (DoS) attacks targeting communication networks. A multi-index evaluation approach is proposed to optimize routing modes in integrated energy cyber-physical systems (IECPS) considering potential failures ...

  22. Cross-sectional study of pharmacovigilance knowledge ...

    Background: This study focuses on understanding pharmacovigilance knowledge, attitudes, and practices (KAP) in Yunnan Province, employing Structural Equation Modeling (SEM) and network analysis. It aims to evaluate the interplay of these factors among healthcare personnel and the public, assessing the impact of demographic characteristics to inform policy and educational initiatives.

  23. Mathematics

    In the face of the increasing complexity of risk factors in the coal mining transportation system (CMTS) during the process of intelligent transformation, this study proposes a method for analyzing accidents in CMTS based on fault tree analysis (FTA) combined with Bayesian networks (BN) and preliminary hazard analysis (PHA). Firstly, the fault tree model of CMTS was transformed into a risk ...

  24. Network analysis: a brief overview and tutorial

    Objective: The present paper presents a brief overview on network analysis as a statistical approach for health psychology researchers. Networks comprise graphical representations of the relationships (edges) between variables (nodes). Network analysis provides the capacity to estimate complex patterns of relationships and the network structure can be analysed to reveal core features of the ...

  25. Enhanced accuracy through machine learning-based simultaneous ...

    Accurate characterization of complex planar and 3D structures is, amongst others, essential for successful device fabrication and implementation in micro- and nanotechnology 1.The state-of-the-art ...

  26. The effect of industrial zones on road network performance (case study

    This paper aims to determine the effect of industrial zones on road network performance. The research location is located in the North Malang industrial area which is in the center of Malang City, Indonesia. The first step is to study the existing conditions of industrial land use. The existing condition of the industrial area includes: type of industry, operational status and area of ...