artificial neural networks research papers

Survey Paper
Open access
Published: 31 March 2021

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Laith Alzubaidi ORCID: orcid.org/0000-0002-7296-5413 1 , 5 ,
Jinglan Zhang 1 ,
Amjad J. Humaidi 2 ,
Ayad Al-Dujaili 3 ,
Ye Duan 4 ,
Omran Al-Shamma 5 ,
J. Santamaría 6 ,
Mohammed A. Fadhel 7 ,
Muthana Al-Amidie 4 &
Laith Farhan 8

Journal of Big Data volume 8 , Article number: 53 ( 2021 ) Cite this article

380k Accesses

2105 Citations

40 Altmetric

Metrics details

In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL is the ability to learn massive amounts of data. The DL field has grown fast in the last few years and it has been extensively used to successfully address a wide range of traditional applications. More importantly, DL has outperformed well-known ML techniques in many domains, e.g., cybersecurity, natural language processing, bioinformatics, robotics and control, and medical information processing, among many others. Despite it has been contributed several works reviewing the State-of-the-Art on DL, all of them only tackled one aspect of the DL, which leads to an overall lack of knowledge about it. Therefore, in this contribution, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of DL. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field. In particular, this paper outlines the importance of DL, presents the types of DL techniques and networks. It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net). Finally, we further present the challenges and suggested solutions to help researchers understand the existing research gaps. It is followed by a list of the major DL applications. Computational tools including FPGA, GPU, and CPU are summarized along with a description of their influence on DL. The paper ends with the evolution matrix, benchmark datasets, and summary and conclusion.

Introduction

Recently, machine learning (ML) has become very widespread in research and has been incorporated in a variety of applications, including text mining, spam detection, video recommendation, image classification, and multimedia concept retrieval [ 1 , 2 , 3 , 4 , 5 , 6 ]. Among the different ML algorithms, deep learning (DL) is very commonly employed in these applications [ 7 , 8 , 9 ]. Another name for DL is representation learning (RL). The continuing appearance of novel studies in the fields of deep and distributed learning is due to both the unpredictable growth in the ability to obtain data and the amazing progress made in the hardware technologies, e.g. High Performance Computing (HPC) [ 10 ].

DL is derived from the conventional neural network but considerably outperforms its predecessors. Moreover, DL employs transformations and graph technologies simultaneously in order to build up multi-layer learning models. The most recently developed DL techniques have obtained good outstanding performance across a variety of applications, including audio and speech processing, visual data processing, natural language processing (NLP), among others [ 11 , 12 , 13 , 14 ].

Usually, the effectiveness of an ML algorithm is highly dependent on the integrity of the input-data representation. It has been shown that a suitable data representation provides an improved performance when compared to a poor data representation. Thus, a significant research trend in ML for many years has been feature engineering, which has informed numerous research studies. This approach aims at constructing features from raw data. In addition, it is extremely field-specific and frequently requires sizable human effort. For instance, several types of features were introduced and compared in the computer vision context, such as, histogram of oriented gradients (HOG) [ 15 ], scale-invariant feature transform (SIFT) [ 16 ], and bag of words (BoW) [ 17 ]. As soon as a novel feature is introduced and is found to perform well, it becomes a new research direction that is pursued over multiple decades.

Relatively speaking, feature extraction is achieved in an automatic way throughout the DL algorithms. This encourages researchers to extract discriminative features using the smallest possible amount of human effort and field knowledge [ 18 ]. These algorithms have a multi-layer data representation architecture, in which the first layers extract the low-level features while the last layers extract the high-level features. Note that artificial intelligence (AI) originally inspired this type of architecture, which simulates the process that occurs in core sensorial regions within the human brain. Using different scenes, the human brain can automatically extract data representation. More specifically, the output of this process is the classified objects, while the received scene information represents the input. This process simulates the working methodology of the human brain. Thus, it emphasizes the main benefit of DL.

In the field of ML, DL, due to its considerable success, is currently one of the most prominent research trends. In this paper, an overview of DL is presented that adopts various perspectives such as the main concepts, architectures, challenges, applications, computational tools and evolution matrix. Convolutional neural network (CNN) is one of the most popular and used of DL networks [ 19 , 20 ]. Because of CNN, DL is very popular nowadays. The main advantage of CNN compared to its predecessors is that it automatically detects the significant features without any human supervision which made it the most used. Therefore, we have dug in deep with CNN by presenting the main components of it. Furthermore, we have elaborated in detail the most common CNN architectures, starting with the AlexNet network and ending with the High-Resolution network (HR.Net).

Several published DL review papers have been presented in the last few years. However, all of them have only been addressed one side focusing on one application or topic such as the review of CNN architectures [ 21 ], DL for classification of plant diseases [ 22 ], DL for object detection [ 23 ], DL applications in medical image analysis [ 24 ], and etc. Although these reviews present good topics, they do not provide a full understanding of DL topics such as concepts, detailed research gaps, computational tools, and DL applications. First, It is required to understand DL aspects including concepts, challenges, and applications then going deep in the applications. To achieve that, it requires extensive time and a large number of research papers to learn about DL including research gaps and applications. Therefore, we propose a deep review of DL to provide a more suitable starting point from which to develop a full understanding of DL from one review paper. The motivation behinds our review was to cover the most important aspect of DL including open challenges, applications, and computational tools perspective. Furthermore, our review can be the first step towards other DL topics.

The main aim of this review is to present the most important aspects of DL to make it easy for researchers and students to have a clear image of DL from single review paper. This review will further advance DL research by helping people discover more about recent developments in the field. Researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field. Our contributions are outlined as follows:

This is the first review that almost provides a deep survey of the most important aspects of deep learning. This review helps researchers and students to have a good understanding from one paper.

We explain CNN in deep which the most popular deep learning algorithm by describing the concepts, theory, and state-of-the-art architectures.

We review current challenges (limitations) of Deep Learning including lack of training data, Imbalanced Data, Interpretability of data, Uncertainty scaling, Catastrophic forgetting, Model compression, Overfitting, Vanishing gradient problem, Exploding Gradient Problem, and Underspecification. We additionally discuss the proposed solutions tackling these issues.

We provide an exhaustive list of medical imaging applications with deep learning by categorizing them based on the tasks by starting with classification and ending with registration.

We discuss the computational approaches (CPU, GPU, FPGA) by comparing the influence of each tool on deep learning algorithms.

The rest of the paper is organized as follows: “ Survey methodology ” section describes The survey methodology. “ Background ” section presents the background. “ Classification of DL approaches ” section defines the classification of DL approaches. “ Types of DL networks ” section displays types of DL networks. “ CNN architectures ” section shows CNN Architectures. “ Challenges (limitations) of deep learning and alternate solutions ” section details the challenges of DL and alternate solutions. “ Applications of deep learning ” section outlines the applications of DL. “ Computational approaches ” section explains the influence of computational approaches (CPU, GPU, FPGA) on DL. “ Evaluation metrics ” section presents the evaluation metrics. “ Frameworks and datasets ” section lists frameworks and datasets. “ Summary and conclusion ” section presents the summary and conclusion.

Survey methodology

We have reviewed the significant research papers in the field published during 2010–2020, mainly from the years of 2020 and 2019 with some papers from 2021. The main focus was papers from the most reputed publishers such as IEEE, Elsevier, MDPI, Nature, ACM, and Springer. Some papers have been selected from ArXiv. We have reviewed more than 300 papers on various DL topics. There are 108 papers from the year 2020, 76 papers from the year 2019, and 48 papers from the year 2018. This indicates that this review focused on the latest publications in the field of DL. The selected papers were analyzed and reviewed to (1) list and define the DL approaches and network types, (2) list and explain CNN architectures, (3) present the challenges of DL and suggest the alternate solutions, (4) assess the applications of DL, (5) assess computational approaches. The most keywords used for search criteria for this review paper are (“Deep Learning”), (“Machine Learning”), (“Convolution Neural Network”), (“Deep Learning” AND “Architectures”), ((“Deep Learning”) AND (“Image”) AND (“detection” OR “classification” OR “segmentation” OR “Localization”)), (“Deep Learning” AND “detection” OR “classification” OR “segmentation” OR “Localization”), (“Deep Learning” AND “CPU” OR “GPU” OR “FPGA”), (“Deep Learning” AND “Transfer Learning”), (“Deep Learning” AND “Imbalanced Data”), (“Deep Learning” AND “Interpretability of data”), (“Deep Learning” AND “Overfitting”), (“Deep Learning” AND “Underspecification”). Figure 1 shows our search structure of the survey paper. Table 1 presents the details of some of the journals that have been cited in this review paper.

Search framework

This section will present a background of DL. We begin with a quick introduction to DL, followed by the difference between DL and ML. We then show the situations that require DL. Finally, we present the reasons for applying DL.

DL, a subset of ML (Fig. 2 ), is inspired by the information processing patterns found in the human brain. DL does not require any human-designed rules to operate; rather, it uses a large amount of data to map the given input to specific labels. DL is designed using numerous layers of algorithms (artificial neural networks, or ANNs), each of which provides a different interpretation of the data that has been fed to them [ 18 , 25 ].

Deep learning family

Achieving the classification task using conventional ML techniques requires several sequential steps, specifically pre-processing, feature extraction, wise feature selection, learning, and classification. Furthermore, feature selection has a great impact on the performance of ML techniques. Biased feature selection may lead to incorrect discrimination between classes. Conversely, DL has the ability to automate the learning of feature sets for several tasks, unlike conventional ML methods [ 18 , 26 ]. DL enables learning and classification to be achieved in a single shot (Fig. 3 ). DL has become an incredibly popular type of ML algorithm in recent years due to the huge growth and evolution of the field of big data [ 27 , 28 ]. It is still in continuous development regarding novel performance for several ML tasks [ 22 , 29 , 30 , 31 ] and has simplified the improvement of many learning fields [ 32 , 33 ], such as image super-resolution [ 34 ], object detection [ 35 , 36 ], and image recognition [ 30 , 37 ]. Recently, DL performance has come to exceed human performance on tasks such as image classification (Fig. 4 ).

The difference between deep learning and traditional machine learning

Deep learning performance compared to human

Nearly all scientific fields have felt the impact of this technology. Most industries and businesses have already been disrupted and transformed through the use of DL. The leading technology and economy-focused companies around the world are in a race to improve DL. Even now, human-level performance and capability cannot exceed that the performance of DL in many areas, such as predicting the time taken to make car deliveries, decisions to certify loan requests, and predicting movie ratings [ 38 ]. The winners of the 2019 “Nobel Prize” in computing, also known as the Turing Award, were three pioneers in the field of DL (Yann LeCun, Geoffrey Hinton, and Yoshua Bengio) [ 39 ]. Although a large number of goals have been achieved, there is further progress to be made in the DL context. In fact, DL has the ability to enhance human lives by providing additional accuracy in diagnosis, including estimating natural disasters [ 40 ], the discovery of new drugs [ 41 ], and cancer diagnosis [ 42 , 43 , 44 ]. Esteva et al. [ 45 ] found that a DL network has the same ability to diagnose the disease as twenty-one board-certified dermatologists using 129,450 images of 2032 diseases. Furthermore, in grading prostate cancer, US board-certified general pathologists achieved an average accuracy of 61%, while the Google AI [ 44 ] outperformed these specialists by achieving an average accuracy of 70%. In 2020, DL is playing an increasingly vital role in early diagnosis of the novel coronavirus (COVID-19) [ 29 , 46 , 47 , 48 ]. DL has become the main tool in many hospitals around the world for automatic COVID-19 classification and detection using chest X-ray images or other types of images. We end this section by the saying of AI pioneer Geoffrey Hinton “Deep learning is going to be able to do everything”.

When to apply deep learning

Machine intelligence is useful in many situations which is equal or better than human experts in some cases [ 49 , 50 , 51 , 52 ], meaning that DL can be a solution to the following problems:

Cases where human experts are not available.

Cases where humans are unable to explain decisions made using their expertise (language understanding, medical decisions, and speech recognition).

Cases where the problem solution updates over time (price prediction, stock preference, weather prediction, and tracking).

Cases where solutions require adaptation based on specific cases (personalization, biometrics).

Cases where size of the problem is extremely large and exceeds our inadequate reasoning abilities (sentiment analysis, matching ads to Facebook, calculation webpage ranks).

Why deep learning?

Several performance features may answer this question, e.g

Universal Learning Approach: Because DL has the ability to perform in approximately all application domains, it is sometimes referred to as universal learning.

Robustness: In general, precisely designed features are not required in DL techniques. Instead, the optimized features are learned in an automated fashion related to the task under consideration. Thus, robustness to the usual changes of the input data is attained.

Generalization: Different data types or different applications can use the same DL technique, an approach frequently referred to as transfer learning (TL) which explained in the latter section. Furthermore, it is a useful approach in problems where data is insufficient.

Scalability: DL is highly scalable. ResNet [ 37 ], which was invented by Microsoft, comprises 1202 layers and is frequently applied at a supercomputing scale. Lawrence Livermore National Laboratory (LLNL), a large enterprise working on evolving frameworks for networks, adopted a similar approach, where thousands of nodes can be implemented [ 53 ].

Classification of DL approaches

DL techniques are classified into three major categories: unsupervised, partially supervised (semi-supervised) and supervised. Furthermore, deep reinforcement learning (DRL), also known as RL, is another type of learning technique, which is mostly considered to fall into the category of partially supervised (and occasionally unsupervised) learning techniques.

Deep supervised learning

Deep semi-supervised learning.

In this technique, the learning process is based on semi-labeled datasets. Occasionally, generative adversarial networks (GANs) and DRL are employed in the same way as this technique. In addition, RNNs, which include GRUs and LSTMs, are also employed for partially supervised learning. One of the advantages of this technique is to minimize the amount of labeled data needed. On other the hand, One of the disadvantages of this technique is irrelevant input feature present training data could furnish incorrect decisions. Text document classifier is one of the most popular example of an application of semi-supervised learning. Due to difficulty of obtaining a large amount of labeled text documents, semi-supervised learning is ideal for text document classification task.

Deep unsupervised learning

This technique makes it possible to implement the learning process in the absence of available labeled data (i.e. no labels are required). Here, the agent learns the significant features or interior representation required to discover the unidentified structure or relationships in the input data. Techniques of generative networks, dimensionality reduction and clustering are frequently counted within the category of unsupervised learning. Several members of the DL family have performed well on non-linear dimensionality reduction and clustering tasks; these include restricted Boltzmann machines, auto-encoders and GANs as the most recently developed techniques. Moreover, RNNs, which include GRUs and LSTM approaches, have also been employed for unsupervised learning in a wide range of applications. The main disadvantages of unsupervised learning are unable to provide accurate information concerning data sorting and computationally complex. One of the most popular unsupervised learning approaches is clustering [ 54 ].

Deep reinforcement learning

For solving a task, the selection of the type of reinforcement learning that needs to be performed is based on the space or the scope of the problem. For example, DRL is the best way for problems involving many parameters to be optimized. By contrast, derivative-free reinforcement learning is a technique that performs well for problems with limited parameters. Some of the applications of reinforcement learning are business strategy planning and robotics for industrial automation. The main drawback of Reinforcement Learning is that parameters may influence the speed of learning. Here are the main motivations for utilizing Reinforcement Learning:

It assists you to identify which action produces the highest reward over a longer period.

It assists you to discover which situation requires action.

It also enables it to figure out the best approach for reaching large rewards.

Reinforcement Learning also gives the learning agent a reward function.

Reinforcement Learning can’t utilize in all the situation such as:

In case there is sufficient data to resolve the issue with supervised learning techniques.

Reinforcement Learning is computing-heavy and time-consuming. Specially when the workspace is large.

Types of DL networks

The most famous types of deep learning networks are discussed in this section: these include recursive neural networks (RvNNs), RNNs, and CNNs. RvNNs and RNNs were briefly explained in this section while CNNs were explained in deep due to the importance of this type. Furthermore, it is the most used in several applications among other networks.

Recursive neural networks

RvNN can achieve predictions in a hierarchical structure also classify the outputs utilizing compositional vectors [ 57 ]. Recursive auto-associative memory (RAAM) [ 58 ] is the primary inspiration for the RvNN development. The RvNN architecture is generated for processing objects, which have randomly shaped structures like graphs or trees. This approach generates a fixed-width distributed representation from a variable-size recursive-data structure. The network is trained using an introduced back-propagation through structure (BTS) learning system [ 58 ]. The BTS system tracks the same technique as the general-back propagation algorithm and has the ability to support a treelike structure. Auto-association trains the network to regenerate the input-layer pattern at the output layer. RvNN is highly effective in the NLP context. Socher et al. [ 59 ] introduced RvNN architecture designed to process inputs from a variety of modalities. These authors demonstrate two applications for classifying natural language sentences: cases where each sentence is split into words and nature images, and cases where each image is separated into various segments of interest. RvNN computes a likely pair of scores for merging and constructs a syntactic tree. Furthermore, RvNN calculates a score related to the merge plausibility for every pair of units. Next, the pair with the largest score is merged within a composition vector. Following every merge, RvNN generates (a) a larger area of numerous units, (b) a compositional vector of the area, and (c) a label for the class (for instance, a noun phrase will become the class label for the new area if two units are noun words). The compositional vector for the entire area is the root of the RvNN tree structure. An example RvNN tree is shown in Fig. 5 . RvNN has been employed in several applications [ 60 , 61 , 62 ].

An example of RvNN tree

Recurrent neural networks

RNNs are a commonly employed and familiar algorithm in the discipline of DL [ 63 , 64 , 65 ]. RNN is mainly applied in the area of speech processing and NLP contexts [ 66 , 67 ]. Unlike conventional networks, RNN uses sequential data in the network. Since the embedded structure in the sequence of the data delivers valuable information, this feature is fundamental to a range of different applications. For instance, it is important to understand the context of the sentence in order to determine the meaning of a specific word in it. Thus, it is possible to consider the RNN as a unit of short-term memory, where x represents the input layer, y is the output layer, and s represents the state (hidden) layer. For a given input sequence, a typical unfolded RNN diagram is illustrated in Fig. 6 . Pascanu et al. [ 68 ] introduced three different types of deep RNN techniques, namely “Hidden-to-Hidden”, “Hidden-to-Output”, and “Input-to-Hidden”. A deep RNN is introduced that lessens the learning difficulty in the deep network and brings the benefits of a deeper RNN based on these three techniques.

Typical unfolded RNN diagram

However, RNN’s sensitivity to the exploding gradient and vanishing problems represent one of the main issues with this approach [ 69 ]. More specifically, during the training process, the reduplications of several large or small derivatives may cause the gradients to exponentially explode or decay. With the entrance of new inputs, the network stops thinking about the initial ones; therefore, this sensitivity decays over time. Furthermore, this issue can be handled using LSTM [ 70 ]. This approach offers recurrent connections to memory blocks in the network. Every memory block contains a number of memory cells, which have the ability to store the temporal states of the network. In addition, it contains gated units for controlling the flow of information. In very deep networks [ 37 ], residual connections also have the ability to considerably reduce the impact of the vanishing gradient issue which explained in later sections. CNN is considered to be more powerful than RNN. RNN includes less feature compatibility when compared to CNN.

Convolutional neural networks

In the field of DL, the CNN is the most famous and commonly employed algorithm [ 30 , 71 , 72 , 73 , 74 , 75 ]. The main benefit of CNN compared to its predecessors is that it automatically identifies the relevant features without any human supervision [ 76 ]. CNNs have been extensively applied in a range of different fields, including computer vision [ 77 ], speech processing [ 78 ], Face Recognition [ 79 ], etc. The structure of CNNs was inspired by neurons in human and animal brains, similar to a conventional neural network. More specifically, in a cat’s brain, a complex sequence of cells forms the visual cortex; this sequence is simulated by the CNN [ 80 ]. Goodfellow et al. [ 28 ] identified three key benefits of the CNN: equivalent representations, sparse interactions, and parameter sharing. Unlike conventional fully connected (FC) networks, shared weights and local connections in the CNN are employed to make full use of 2D input-data structures like image signals. This operation utilizes an extremely small number of parameters, which both simplifies the training process and speeds up the network. This is the same as in the visual cortex cells. Notably, only small regions of a scene are sensed by these cells rather than the whole scene (i.e., these cells spatially extract the local correlation available in the input, like local filters over the input).

A commonly used type of CNN, which is similar to the multi-layer perceptron (MLP), consists of numerous convolution layers preceding sub-sampling (pooling) layers, while the ending layers are FC layers. An example of CNN architecture for image classification is illustrated in Fig. 7 .

An example of CNN architecture for image classification

The input x of each layer in a CNN model is organized in three dimensions: height, width, and depth, or $m \times m \times r$ , where the height (m) is equal to the width. The depth is also referred to as the channel number. For example, in an RGB image, the depth (r) is equal to three. Several kernels (filters) available in each convolutional layer are denoted by k and also have three dimensions ( $n \times n \times q$ ), similar to the input image; here, however, n must be smaller than m , while q is either equal to or smaller than r . In addition, the kernels are the basis of the local connections, which share similar parameters (bias $b^{k}$ and weight $W^{k}$ ) for generating k feature maps $h^{k}$ with a size of ( $m-n-1$ ) each and are convolved with input, as mentioned above. The convolution layer calculates a dot product between its input and the weights as in Eq. 1 , similar to NLP, but the inputs are undersized areas of the initial image size. Next, by applying the nonlinearity or an activation function to the convolution-layer output, we obtain the following:

The next step is down-sampling every feature map in the sub-sampling layers. This leads to a reduction in the network parameters, which accelerates the training process and in turn enables handling of the overfitting issue. For all feature maps, the pooling function (e.g. max or average) is applied to an adjacent area of size $p \times p$ , where p is the kernel size. Finally, the FC layers receive the mid- and low-level features and create the high-level abstraction, which represents the last-stage layers as in a typical neural network. The classification scores are generated using the ending layer [e.g. support vector machines (SVMs) or softmax]. For a given instance, every score represents the probability of a specific class.

Benefits of employing CNNs

The benefits of using CNNs over other traditional neural networks in the computer vision environment are listed as follows:

The main reason to consider CNN is the weight sharing feature, which reduces the number of trainable network parameters and in turn helps the network to enhance generalization and to avoid overfitting.

Concurrently learning the feature extraction layers and the classification layer causes the model output to be both highly organized and highly reliant on the extracted features.

Large-scale network implementation is much easier with CNN than with other neural networks.

The CNN architecture consists of a number of layers (or so-called multi-building blocks). Each layer in the CNN architecture, including its function, is described in detail below.

Convolutional Layer: In CNN architecture, the most significant component is the convolutional layer. It consists of a collection of convolutional filters (so-called kernels). The input image, expressed as N-dimensional metrics, is convolved with these filters to generate the output feature map.

Kernel definition: A grid of discrete numbers or values describes the kernel. Each value is called the kernel weight. Random numbers are assigned to act as the weights of the kernel at the beginning of the CNN training process. In addition, there are several different methods used to initialize the weights. Next, these weights are adjusted at each training era; thus, the kernel learns to extract significant features.

Convolutional Operation: Initially, the CNN input format is described. The vector format is the input of the traditional neural network, while the multi-channeled image is the input of the CNN. For instance, single-channel is the format of the gray-scale image, while the RGB image format is three-channeled. To understand the convolutional operation, let us take an example of a $4 \times 4$ gray-scale image with a $2 \times 2$ random weight-initialized kernel. First, the kernel slides over the whole image horizontally and vertically. In addition, the dot product between the input image and the kernel is determined, where their corresponding values are multiplied and then summed up to create a single scalar value, calculated concurrently. The whole process is then repeated until no further sliding is possible. Note that the calculated dot product values represent the feature map of the output. Figure 8 graphically illustrates the primary calculations executed at each step. In this figure, the light green color represents the $2 \times 2$ kernel, while the light blue color represents the similar size area of the input image. Both are multiplied; the end result after summing up the resulting product values (marked in a light orange color) represents an entry value to the output feature map.

The primary calculations executed at each step of convolutional layer

However, padding to the input image is not applied in the previous example, while a stride of one (denoted for the selected step-size over all vertical or horizontal locations) is applied to the kernel. Note that it is also possible to use another stride value. In addition, a feature map of lower dimensions is obtained as a result of increasing the stride value.

On the other hand, padding is highly significant to determining border size information related to the input image. By contrast, the border side-features moves carried away very fast. By applying padding, the size of the input image will increase, and in turn, the size of the output feature map will also increase. Core Benefits of Convolutional Layers.

Sparse Connectivity: Each neuron of a layer in FC neural networks links with all neurons in the following layer. By contrast, in CNNs, only a few weights are available between two adjacent layers. Thus, the number of required weights or connections is small, while the memory required to store these weights is also small; hence, this approach is memory-effective. In addition, matrix operation is computationally much more costly than the dot (.) operation in CNN.

Weight Sharing: There are no allocated weights between any two neurons of neighboring layers in CNN, as the whole weights operate with one and all pixels of the input matrix. Learning a single group of weights for the whole input will significantly decrease the required training time and various costs, as it is not necessary to learn additional weights for each neuron.

Pooling Layer: The main task of the pooling layer is the sub-sampling of the feature maps. These maps are generated by following the convolutional operations. In other words, this approach shrinks large-size feature maps to create smaller feature maps. Concurrently, it maintains the majority of the dominant information (or features) in every step of the pooling stage. In a similar manner to the convolutional operation, both the stride and the kernel are initially size-assigned before the pooling operation is executed. Several types of pooling methods are available for utilization in various pooling layers. These methods include tree pooling, gated pooling, average pooling, min pooling, max pooling, global average pooling (GAP), and global max pooling. The most familiar and frequently utilized pooling methods are the max, min, and GAP pooling. Figure 9 illustrates these three pooling operations.

Three types of pooling operations

Sometimes, the overall CNN performance is decreased as a result; this represents the main shortfall of the pooling layer, as this layer helps the CNN to determine whether or not a certain feature is available in the particular input image, but focuses exclusively on ascertaining the correct location of that feature. Thus, the CNN model misses the relevant information.

Activation Function (non-linearity) Mapping the input to the output is the core function of all types of activation function in all types of neural network. The input value is determined by computing the weighted summation of the neuron input along with its bias (if present). This means that the activation function makes the decision as to whether or not to fire a neuron with reference to a particular input by creating the corresponding output.

Non-linear activation layers are employed after all layers with weights (so-called learnable layers, such as FC layers and convolutional layers) in CNN architecture. This non-linear performance of the activation layers means that the mapping of input to output will be non-linear; moreover, these layers give the CNN the ability to learn extra-complicated things. The activation function must also have the ability to differentiate, which is an extremely significant feature, as it allows error back-propagation to be used to train the network. The following types of activation functions are most commonly used in CNN and other deep neural networks.

Sigmoid: The input of this activation function is real numbers, while the output is restricted to between zero and one. The sigmoid function curve is S-shaped and can be represented mathematically by Eq. 2 .

Tanh: It is similar to the sigmoid function, as its input is real numbers, but the output is restricted to between − 1 and 1. Its mathematical representation is in Eq. 3 .

ReLU: The mostly commonly used function in the CNN context. It converts the whole values of the input to positive numbers. Lower computational load is the main benefit of ReLU over the others. Its mathematical representation is in Eq. 4 .

Occasionally, a few significant issues may occur during the use of ReLU. For instance, consider an error back-propagation algorithm with a larger gradient flowing through it. Passing this gradient within the ReLU function will update the weights in a way that makes the neuron certainly not activated once more. This issue is referred to as “Dying ReLU”. Some ReLU alternatives exist to solve such issues. The following discusses some of them.

Leaky ReLU: Instead of ReLU down-scaling the negative inputs, this activation function ensures these inputs are never ignored. It is employed to solve the Dying ReLU problem. Leaky ReLU can be represented mathematically as in Eq. 5 .

Note that the leak factor is denoted by m. It is commonly set to a very small value, such as 0.001.

Noisy ReLU: This function employs a Gaussian distribution to make ReLU noisy. It can be represented mathematically as in Eq. 6 .

Parametric Linear Units: This is mostly the same as Leaky ReLU. The main difference is that the leak factor in this function is updated through the model training process. The parametric linear unit can be represented mathematically as in Eq. 7 .

Note that the learnable weight is denoted as a.

Fully Connected Layer: Commonly, this layer is located at the end of each CNN architecture. Inside this layer, each neuron is connected to all neurons of the previous layer, the so-called Fully Connected (FC) approach. It is utilized as the CNN classifier. It follows the basic method of the conventional multiple-layer perceptron neural network, as it is a type of feed-forward ANN. The input of the FC layer comes from the last pooling or convolutional layer. This input is in the form of a vector, which is created from the feature maps after flattening. The output of the FC layer represents the final CNN output, as illustrated in Fig. 10 .

Fully connected layer

Loss Functions: The previous section has presented various layer-types of CNN architecture. In addition, the final classification is achieved from the output layer, which represents the last layer of the CNN architecture. Some loss functions are utilized in the output layer to calculate the predicted error created across the training samples in the CNN model. This error reveals the difference between the actual output and the predicted one. Next, it will be optimized through the CNN learning process.

However, two parameters are used by the loss function to calculate the error. The CNN estimated output (referred to as the prediction) is the first parameter. The actual output (referred to as the label) is the second parameter. Several types of loss function are employed in various problem types. The following concisely explains some of the loss function types.

Cross-Entropy or Softmax Loss Function: This function is commonly employed for measuring the CNN model performance. It is also referred to as the log loss function. Its output is the probability $p \in \left\{ 0\left. , 1 \right\} \right. $ . In addition, it is usually employed as a substitution of the square error loss function in multi-class classification problems. In the output layer, it employs the softmax activations to generate the output within a probability distribution. The mathematical representation of the output class probability is Eq. 8 .

Here, $e^{a_{i}}$ represents the non-normalized output from the preceding layer, while N represents the number of neurons in the output layer. Finally, the mathematical representation of cross-entropy loss function is Eq. 9 .

Euclidean Loss Function: This function is widely used in regression problems. In addition, it is also the so-called mean square error. The mathematical expression of the estimated Euclidean loss is Eq. 10 .

Hinge Loss Function: This function is commonly employed in problems related to binary classification. This problem relates to maximum-margin-based classification; this is mostly important for SVMs, which use the hinge loss function, wherein the optimizer attempts to maximize the margin around dual objective classes. Its mathematical formula is Eq. 11 .

The margin m is commonly set to 1. Moreover, the predicted output is denoted as $p_{_{i}}$ , while the desired output is denoted as $y_{_{i}}$ .

Regularization to CNN

For CNN models, over-fitting represents the central issue associated with obtaining well-behaved generalization. The model is entitled over-fitted in cases where the model executes especially well on training data and does not succeed on test data (unseen data) which is more explained in the latter section. An under-fitted model is the opposite; this case occurs when the model does not learn a sufficient amount from the training data. The model is referred to as “just-fitted” if it executes well on both training and testing data. These three types are illustrated in Fig. 11 . Various intuitive concepts are used to help the regularization to avoid over-fitting; more details about over-fitting and under-fitting are discussed in latter sections.

Dropout: This is a widely utilized technique for generalization. During each training epoch, neurons are randomly dropped. In doing this, the feature selection power is distributed equally across the whole group of neurons, as well as forcing the model to learn different independent features. During the training process, the dropped neuron will not be a part of back-propagation or forward-propagation. By contrast, the full-scale network is utilized to perform prediction during the testing process.

Drop-Weights: This method is highly similar to dropout. In each training epoch, the connections between neurons (weights) are dropped rather than dropping the neurons; this represents the only difference between drop-weights and dropout.

Data Augmentation: Training the model on a sizeable amount of data is the easiest way to avoid over-fitting. To achieve this, data augmentation is used. Several techniques are utilized to artificially expand the size of the training dataset. More details can be found in the latter section, which describes the data augmentation techniques.

Batch Normalization: This method ensures the performance of the output activations [ 81 ]. This performance follows a unit Gaussian distribution. Subtracting the mean and dividing by the standard deviation will normalize the output at each layer. While it is possible to consider this as a pre-processing task at each layer in the network, it is also possible to differentiate and to integrate it with other networks. In addition, it is employed to reduce the “internal covariance shift” of the activation layers. In each layer, the variation in the activation distribution defines the internal covariance shift. This shift becomes very high due to the continuous weight updating through training, which may occur if the samples of the training data are gathered from numerous dissimilar sources (for example, day and night images). Thus, the model will consume extra time for convergence, and in turn, the time required for training will also increase. To resolve this issue, a layer representing the operation of batch normalization is applied in the CNN architecture.

The advantages of utilizing batch normalization are as follows:

It prevents the problem of vanishing gradient from arising.

It can effectively control the poor weight initialization.

It significantly reduces the time required for network convergence (for large-scale datasets, this will be extremely useful).

It struggles to decrease training dependency across hyper-parameters.

Chances of over-fitting are reduced, since it has a minor influence on regularization.

Over-fitting and under-fitting issues

Optimizer selection

This section discusses the CNN learning process. Two major issues are included in the learning process: the first issue is the learning algorithm selection (optimizer), while the second issue is the use of many enhancements (such as AdaDelta, Adagrad, and momentum) along with the learning algorithm to enhance the output.

Loss functions, which are founded on numerous learnable parameters (e.g. biases, weights, etc.) or minimizing the error (variation between actual and predicted output), are the core purpose of all supervised learning algorithms. The techniques of gradient-based learning for a CNN network appear as the usual selection. The network parameters should always update though all training epochs, while the network should also look for the locally optimized answer in all training epochs in order to minimize the error.

The learning rate is defined as the step size of the parameter updating. The training epoch represents a complete repetition of the parameter update that involves the complete training dataset at one time. Note that it needs to select the learning rate wisely so that it does not influence the learning process imperfectly, although it is a hyper-parameter.

Gradient Descent or Gradient-based learning algorithm: To minimize the training error, this algorithm repetitively updates the network parameters through every training epoch. More specifically, to update the parameters correctly, it needs to compute the objective function gradient (slope) by applying a first-order derivative with respect to the network parameters. Next, the parameter is updated in the reverse direction of the gradient to reduce the error. The parameter updating process is performed though network back-propagation, in which the gradient at every neuron is back-propagated to all neurons in the preceding layer. The mathematical representation of this operation is as Eq. 12 .

The final weight in the current training epoch is denoted by $w_{i j^{t}}$ , while the weight in the preceding $(t-1)$ training epoch is denoted $w_{i j^{t-1}}$ . The learning rate is $\eta $ and the prediction error is E . Different alternatives of the gradient-based learning algorithm are available and commonly employed; these include the following:

Batch Gradient Descent: During the execution of this technique [ 82 ], the network parameters are updated merely one time behind considering all training datasets via the network. In more depth, it calculates the gradient of the whole training set and subsequently uses this gradient to update the parameters. For a small-sized dataset, the CNN model converges faster and creates an extra-stable gradient using BGD. Since the parameters are changed only once for every training epoch, it requires a substantial amount of resources. By contrast, for a large training dataset, additional time is required for converging, and it could converge to a local optimum (for non-convex instances).

Stochastic Gradient Descent: The parameters are updated at each training sample in this technique [ 83 ]. It is preferred to arbitrarily sample the training samples in every epoch in advance of training. For a large-sized training dataset, this technique is both more memory-effective and much faster than BGD. However, because it is frequently updated, it takes extremely noisy steps in the direction of the answer, which in turn causes the convergence behavior to become highly unstable.

Mini-batch Gradient Descent: In this approach, the training samples are partitioned into several mini-batches, in which every mini-batch can be considered an under-sized collection of samples with no overlap between them [ 84 ]. Next, parameter updating is performed following gradient computation on every mini-batch. The advantage of this method comes from combining the advantages of both BGD and SGD techniques. Thus, it has a steady convergence, more computational efficiency and extra memory effectiveness. The following describes several enhancement techniques in gradient-based learning algorithms (usually in SGD), which further powerfully enhance the CNN training process.

Momentum: For neural networks, this technique is employed in the objective function. It enhances both the accuracy and the training speed by summing the computed gradient at the preceding training step, which is weighted via a factor $\lambda $ (known as the momentum factor). However, it therefore simply becomes stuck in a local minimum rather than a global minimum. This represents the main disadvantage of gradient-based learning algorithms. Issues of this kind frequently occur if the issue has no convex surface (or solution space).

Together with the learning algorithm, momentum is used to solve this issue, which can be expressed mathematically as in Eq. 13 .

The weight increment in the current $t^{\prime} \text{th}$ training epoch is denoted as $ \Delta w_{i j^{t}}$ , while $\eta $ is the learning rate, and the weight increment in the preceding $(t-1)^{\prime} \text{th}$ training epoch. The momentum factor value is maintained within the range 0 to 1; in turn, the step size of the weight updating increases in the direction of the bare minimum to minimize the error. As the value of the momentum factor becomes very low, the model loses its ability to avoid the local bare minimum. By contrast, as the momentum factor value becomes high, the model develops the ability to converge much more rapidly. If a high value of momentum factor is used together with LR, then the model could miss the global bare minimum by crossing over it.

However, when the gradient varies its direction continually throughout the training process, then the suitable value of the momentum factor (which is a hyper-parameter) causes a smoothening of the weight updating variations.

Adaptive Moment Estimation (Adam): It is another optimization technique or learning algorithm that is widely used. Adam [ 85 ] represents the latest trends in deep learning optimization. This is represented by the Hessian matrix, which employs a second-order derivative. Adam is a learning strategy that has been designed specifically for training deep neural networks. More memory efficient and less computational power are two advantages of Adam. The mechanism of Adam is to calculate adaptive LR for each parameter in the model. It integrates the pros of both Momentum and RMSprop. It utilizes the squared gradients to scale the learning rate as RMSprop and it is similar to the momentum by using the moving average of the gradient. The equation of Adam is represented in Eq. 14 .

Design of algorithms (backpropagation)

Let’s start with a notation that refers to weights in the network unambiguously. We denote ${\varvec{w}}_{i j}^{h}$ to be the weight for the connection from $\text {ith}$ input or (neuron at $\left. (\text {h}-1){\text{th}}\right) $ to the $j{\text{t }}$ neuron in the $\text {hth}$ layer. So, Fig. 12 shows the weight on a connection from the neuron in the first layer to another neuron in the next layer in the network.

MLP structure

Where $w_{11}^{2}$ has represented the weight from the first neuron in the first layer to the first neuron in the second layer, based on that the second weight for the same neuron will be $w_{21}^{2}$ which means is the weight comes from the second neuron in the previous layer to the first layer in the next layer which is the second in this net. Regarding the bias, since the bias is not the connection between the neurons for the layers, so it is easily handled each neuron must have its own bias, some network each layer has a certain bias. It can be seen from the above net that each layer has its own bias. Each network has the parameters such as the no of the layer in the net, the number of the neurons in each layer, no of the weight (connection) between the layers, the no of connection can be easily determined based on the no of neurons in each layer, for example, if there are ten input fully connect with two neurons in the next layer then the number of connection between them is $(10 * 2=20$ connection, weights), how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network,

where $\text {d}$ is the label of induvial input $\text {ith}$ and $\text {y}$ is the output of the same individual input. Backpropagation is about understanding how to change the weights and biases in a network based on the changes of the cost function (Error). Ultimately, this means computing the partial derivatives $\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}$ and $\partial \text {E} / \partial \text {b}_{\text {j}}^{h}.$ But to compute those, a local variable is introduced, $\delta _{j}^{1}$ which is called the local error in the $j{\text{th} }$ neuron in the $h{\text{th} }$ layer. Based on that local error Backpropagation will give the procedure to compute $\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}$ and $\partial \text {E} / \partial \text {b}_{\text {j}}^{h}$ how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network that is shown in Fig. 13 .

Neuron activation functions

Output error for $\delta _{\text {j}}^{1}$ each $1=1: \text {L}$ where $\text {L}$ is no. of neuron in output

where $\text {e}(\text {k})$ is the error of the epoch $\text {k}$ as shown in Eq. ( 2 ) and $\varvec{\vartheta }^{\prime }\left( {\varvec{v}}_{j}({\varvec{k}})\right) $ is the derivate of the activation function for $v_{j}$ at the output.

Backpropagate the error at all the rest layer except the output

where $\delta _{j}^{1}({\mathbf {k}})$ is the output error and $w_{j l}^{h+1}(k)$ is represented the weight after the layer where the error need to obtain.

After finding the error at each neuron in each layer, now we can update the weight in each layer based on Eqs. ( 16 ) and ( 17 ).

Improving performance of CNN

Based on our experiments in different DL applications [ 86 , 87 , 88 ]. We can conclude the most active solutions that may improve the performance of CNN are:

Expand the dataset with data augmentation or use transfer learning (explained in latter sections).

Increase the training time.

Increase the depth (or width) of the model.

Add regularization.

Increase hyperparameters tuning.

CNN architectures

Over the last 10 years, several CNN architectures have been presented [ 21 , 26 ]. Model architecture is a critical factor in improving the performance of different applications. Various modifications have been achieved in CNN architecture from 1989 until today. Such modifications include structural reformulation, regularization, parameter optimizations, etc. Conversely, it should be noted that the key upgrade in CNN performance occurred largely due to the processing-unit reorganization, as well as the development of novel blocks. In particular, the most novel developments in CNN architectures were performed on the use of network depth. In this section, we review the most popular CNN architectures, beginning from the AlexNet model in 2012 and ending at the High-Resolution (HR) model in 2020. Studying these architectures features (such as input size, depth, and robustness) is the key to help researchers to choose the suitable architecture for the their target task. Table 2 presents the brief overview of CNN architectures.

The history of deep CNNs began with the appearance of LeNet [ 89 ] (Fig. 14 ). At that time, the CNNs were restricted to handwritten digit recognition tasks, which cannot be scaled to all image classes. In deep CNN architecture, AlexNet is highly respected [ 30 ], as it achieved innovative results in the fields of image recognition and classification. Krizhevesky et al. [ 30 ] first proposed AlexNet and consequently improved the CNN learning ability by increasing its depth and implementing several parameter optimization strategies. Figure 15 illustrates the basic design of the AlexNet architecture.

The architecture of LeNet

The architecture of AlexNet

The learning ability of the deep CNN was limited at this time due to hardware restrictions. To overcome these hardware limitations, two GPUs (NVIDIA GTX 580) were used in parallel to train AlexNet. Moreover, in order to enhance the applicability of the CNN to different image categories, the number of feature extraction stages was increased from five in LeNet to seven in AlexNet. Regardless of the fact that depth enhances generalization for several image resolutions, it was in fact overfitting that represented the main drawback related to the depth. Krizhevesky et al. used Hinton’s idea to address this problem [ 90 , 91 ]. To ensure that the features learned by the algorithm were extra robust, Krizhevesky et al.’s algorithm randomly passes over several transformational units throughout the training stage. Moreover, by reducing the vanishing gradient problem, ReLU [ 92 ] could be utilized as a non-saturating activation function to enhance the rate of convergence [ 93 ]. Local response normalization and overlapping subsampling were also performed to enhance the generalization by decreasing the overfitting. To improve on the performance of previous networks, other modifications were made by using large-size filters $(5\times 5 \; \text{and}\; 11 \times 11)$ in the earlier layers. AlexNet has considerable significance in the recent CNN generations, as well as beginning an innovative research era in CNN applications.

Network-in-network

This network model, which has some slight differences from the preceding models, introduced two innovative concepts [ 94 ]. The first was employing multiple layers of perception convolution. These convolutions are executed using a 1×1 filter, which supports the addition of extra nonlinearity in the networks. Moreover, this supports enlarging the network depth, which may later be regularized using dropout. For DL models, this idea is frequently employed in the bottleneck layer. As a substitution for a FC layer, the GAP is also employed, which represents the second novel concept and enables a significant reduction in the number of model parameters. In addition, GAP considerably updates the network architecture. Generating a final low-dimensional feature vector with no reduction in the feature maps dimension is possible when GAP is used on a large feature map [ 95 , 96 ]. Figure 16 shows the structure of the network.

The architecture of network-in-network

Before 2013, the CNN learning mechanism was basically constructed on a trial-and-error basis, which precluded an understanding of the precise purpose following the enhancement. This issue restricted the deep CNN performance on convoluted images. In response, Zeiler and Fergus introduced DeconvNet (a multilayer de-convolutional neural network) in 2013 [ 97 ]. This method later became known as ZefNet, which was developed in order to quantitively visualize the network. Monitoring the CNN performance via understanding the neuron activation was the purpose of the network activity visualization. However, Erhan et al. utilized this exact concept to optimize deep belief network (DBN) performance by visualizing the features of the hidden layers [ 98 ]. Moreover, in addition to this issue, Le et al. assessed the deep unsupervised auto-encoder (AE) performance by visualizing the created classes of the image using the output neurons [ 99 ]. By reversing the operation order of the convolutional and pooling layers, DenconvNet operates like a forward-pass CNN. Reverse mapping of this kind launches the convolutional layer output backward to create visually observable image shapes that accordingly give the neural interpretation of the internal feature representation learned at each layer [ 100 ]. Monitoring the learning schematic through the training stage was the key concept underlying ZefNet. In addition, it utilized the outcomes to recognize an ability issue coupled with the model. This concept was experimentally proven on AlexNet by applying DeconvNet. This indicated that only certain neurons were working, while the others were out of action in the first two layers of the network. Furthermore, it indicated that the features extracted via the second layer contained aliasing objects. Thus, Zeiler and Fergus changed the CNN topology due to the existence of these outcomes. In addition, they executed parameter optimization, and also exploited the CNN learning by decreasing the stride and the filter sizes in order to retain all features of the initial two convolutional layers. An improvement in performance was accordingly achieved due to this rearrangement in CNN topology. This rearrangement proposed that the visualization of the features could be employed to identify design weaknesses and conduct appropriate parameter alteration. Figure 17 shows the structure of the network.

The architecture of ZefNet

Visual geometry group (VGG)

After CNN was determined to be effective in the field of image recognition, an easy and efficient design principle for CNN was proposed by Simonyan and Zisserman. This innovative design was called Visual Geometry Group (VGG). A multilayer model [ 101 ], it featured nineteen more layers than ZefNet [ 97 ] and AlexNet [ 30 ] to simulate the relations of the network representational capacity in depth. Conversely, in the 2013-ILSVRC competition, ZefNet was the frontier network, which proposed that filters with small sizes could enhance the CNN performance. With reference to these results, VGG inserted a layer of the heap of $3\times 3$ filters rather than the $5\times 5$ and 11 × 11 filters in ZefNet. This showed experimentally that the parallel assignment of these small-size filters could produce the same influence as the large-size filters. In other words, these small-size filters made the receptive field similarly efficient to the large-size filters $(7 \times 7 \; \text{and}\; 5 \times 5)$ . By decreasing the number of parameters, an extra advantage of reducing computational complication was achieved by using small-size filters. These outcomes established a novel research trend for working with small-size filters in CNN. In addition, by inserting $1\times 1$ convolutions in the middle of the convolutional layers, VGG regulates the network complexity. It learns a linear grouping of the subsequent feature maps. With respect to network tuning, a max pooling layer [ 102 ] is inserted following the convolutional layer, while padding is implemented to maintain the spatial resolution. In general, VGG obtained significant results for localization problems and image classification. While it did not achieve first place in the 2014-ILSVRC competition, it acquired a reputation due to its enlarged depth, homogenous topology, and simplicity. However, VGG’s computational cost was excessive due to its utilization of around 140 million parameters, which represented its main shortcoming. Figure 18 shows the structure of the network.

The architecture of VGG

In the 2014-ILSVRC competition, GoogleNet (also called Inception-V1) emerged as the winner [ 103 ]. Achieving high-level accuracy with decreased computational cost is the core aim of the GoogleNet architecture. It proposed a novel inception block (module) concept in the CNN context, since it combines multiple-scale convolutional transformations by employing merge, transform, and split functions for feature extraction. Figure 19 illustrates the inception block architecture. This architecture incorporates filters of different sizes ( $5\times 5, 3\times 3, \; \text{and} \; 1\times 1$ ) to capture channel information together with spatial information at diverse ranges of spatial resolution. The common convolutional layer of GoogLeNet is substituted by small blocks using the same concept of network-in-network (NIN) architecture [ 94 ], which replaced each layer with a micro-neural network. The GoogLeNet concepts of merge, transform, and split were utilized, supported by attending to an issue correlated with different learning types of variants existing in a similar class of several images. The motivation of GoogLeNet was to improve the efficiency of CNN parameters, as well as to enhance the learning capacity. In addition, it regulates the computation by inserting a $1\times 1$ convolutional filter, as a bottleneck layer, ahead of using large-size kernels. GoogleNet employed sparse connections to overcome the redundant information problem. It decreased cost by neglecting the irrelevant channels. It should be noted here that only some of the input channels are connected to some of the output channels. By employing a GAP layer as the end layer, rather than utilizing a FC layer, the density of connections was decreased. The number of parameters was also significantly decreased from 40 to 5 million parameters due to these parameter tunings. The additional regularity factors used included the employment of RmsProp as optimizer and batch normalization [ 104 ]. Furthermore, GoogleNet proposed the idea of auxiliary learners to speed up the rate of convergence. Conversely, the main shortcoming of GoogleNet was its heterogeneous topology; this shortcoming requires adaptation from one module to another. Other shortcomings of GoogleNet include the representation jam, which substantially decreased the feature space in the following layer, and in turn occasionally leads to valuable information loss.

The basic structure of Google Block

Highway network

Increasing the network depth enhances its performance, mainly for complicated tasks. By contrast, the network training becomes difficult. The presence of several layers in deeper networks may result in small gradient values of the back-propagation of error at lower layers. In 2015, Srivastava et al. [ 105 ] suggested a novel CNN architecture, called Highway Network, to overcome this issue. This approach is based on the cross-connectivity concept. The unhindered information flow in Highway Network is empowered by instructing two gating units inside the layer. The gate mechanism concept was motivated by LSTM-based RNN [ 106 , 107 ]. The information aggregation was conducted by merging the information of the $\i{\text{th}}-k$ layers with the next $\i{\text{th}}$ layer to generate a regularization impact, which makes the gradient-based training of the deeper network very simple. This empowers the training of networks with more than 100 layers, such as a deeper network of 900 layers with the SGD algorithm. A Highway Network with a depth of fifty layers presented an improved rate of convergence, which is better than thin and deep architectures at the same time [ 108 ]. By contrast, [ 69 ] empirically demonstrated that plain Net performance declines when more than ten hidden layers are inserted. It should be noted that even a Highway Network 900 layers in depth converges much more rapidly than the plain network.

He et al. [ 37 ] developed ResNet (Residual Network), which was the winner of ILSVRC 2015. Their objective was to design an ultra-deep network free of the vanishing gradient issue, as compared to the previous networks. Several types of ResNet were developed based on the number of layers (starting with 34 layers and going up to 1202 layers). The most common type was ResNet50, which comprised 49 convolutional layers plus a single FC layer. The overall number of network weights was 25.5 M, while the overall number of MACs was 3.9 M. The novel idea of ResNet is its use of the bypass pathway concept, as shown in Fig. 20 , which was employed in Highway Nets to address the problem of training a deeper network in 2015. This is illustrated in Fig. 20 , which contains the fundamental ResNet block diagram. This is a conventional feedforward network plus a residual connection. The residual layer output can be identified as the $(l - 1){\text{th}}$ outputs, which are delivered from the preceding layer $(x_{l} - 1)$ . After executing different operations [such as convolution using variable-size filters, or batch normalization, before applying an activation function like ReLU on $(x_{l} - 1)$ ], the output is $F(x_{l} - 1)$ . The ending residual output is $x_{l}$ , which can be mathematically represented as in Eq. 18 .

There are numerous basic residual blocks included in the residual network. Based on the type of the residual network architecture, operations in the residual block are also changed [ 37 ].

The block diagram for ResNet

In comparison to the highway network, ResNet presented shortcut connections inside layers to enable cross-layer connectivity, which are parameter-free and data-independent. Note that the layers characterize non-residual functions when a gated shortcut is closed in the highway network. By contrast, the individuality shortcuts are never closed, while the residual information is permanently passed in ResNet. Furthermore, ResNet has the potential to prevent the problems of gradient diminishing, as the shortcut connections (residual links) accelerate the deep network convergence. ResNet was the winner of the 2015-ILSVRC championship with 152 layers of depth; this represents 8 times the depth of VGG and 20 times the depth of AlexNet. In comparison with VGG, it has lower computational complexity, even with enlarged depth.

Inception: ResNet and Inception-V3/4

Szegedy et al. [ 103 , 109 , 110 ] proposed Inception-ResNet and Inception-V3/4 as upgraded types of Inception-V1/2. The concept behind Inception-V3 was to minimize the computational cost with no effect on the deeper network generalization. Thus, Szegedy et al. used asymmetric small-size filters ( $1\times 5$ and $1\times 7$ ) rather than large-size filters ( $ 7\times 7$ and $5\times 5$ ); moreover, they utilized a bottleneck of $1\times 1$ convolution prior to the large-size filters [ 110 ]. These changes make the operation of the traditional convolution very similar to cross-channel correlation. Previously, Lin et al. utilized the 1 × 1 filter potential in NIN architecture [ 94 ]. Subsequently, [ 110 ] utilized the same idea in an intelligent manner. By using $1\times 1$ convolutional operation in Inception-V3, the input data are mapped into three or four isolated spaces, which are smaller than the initial input spaces. Next, all of these correlations are mapped in these smaller spaces through common $5\times 5$ or $3\times 3$ convolutions. By contrast, in Inception-ResNet, Szegedy et al. bring together the inception block and the residual learning power by replacing the filter concatenation with the residual connection [ 111 ]. Szegedy et al. empirically demonstrated that Inception-ResNet (Inception-4 with residual connections) can achieve a similar generalization power to Inception-V4 with enlarged width and depth and without residual connections. Thus, it is clearly illustrated that using residual connections in training will significantly accelerate the Inception network training. Figure 21 shows The basic block diagram for Inception Residual unit.

The basic block diagram for Inception Residual unit

To solve the problem of the vanishing gradient, DenseNet was presented, following the same direction as ResNet and the Highway network [ 105 , 111 , 112 ]. One of the drawbacks of ResNet is that it clearly conserves information by means of preservative individuality transformations, as several layers contribute extremely little or no information. In addition, ResNet has a large number of weights, since each layer has an isolated group of weights. DenseNet employed cross-layer connectivity in an improved approach to address this problem [ 112 , 113 , 114 ]. It connected each layer to all layers in the network using a feed-forward approach. Therefore, the feature maps of each previous layer were employed to input into all of the following layers. In traditional CNNs, there are l connections between the previous layer and the current layer, while in DenseNet, there are $\frac{l(l+1)}{2}$ direct connections. DenseNet demonstrates the influence of cross-layer depth wise-convolutions. Thus, the network gains the ability to discriminate clearly between the added and the preserved information, since DenseNet concatenates the features of the preceding layers rather than adding them. However, due to its narrow layer structure, DenseNet becomes parametrically high-priced in addition to the increased number of feature maps. The direct admission of all layers to the gradients via the loss function enhances the information flow all across the network. In addition, this includes a regularizing impact, which minimizes overfitting on tasks alongside minor training sets. Figure 22 shows the architecture of DenseNet Network.

(adopted from [ 112 ])

The architecture of DenseNet Network

ResNext is an enhanced version of the Inception Network [ 115 ]. It is also known as the Aggregated Residual Transform Network. Cardinality, which is a new term presented by [ 115 ], utilized the split, transform, and merge topology in an easy and effective way. It denotes the size of the transformation set as an extra dimension [ 116 , 117 , 118 ]. However, the Inception network manages network resources more efficiently, as well as enhancing the learning ability of the conventional CNN. In the transformation branch, different spatial embeddings (employing e.g. $5\times 5$ , $3\times 3$ , and $1\times 1$ ) are used. Thus, customizing each layer is required separately. By contrast, ResNext derives its characteristic features from ResNet, VGG, and Inception. It employed the VGG deep homogenous topology with the basic architecture of GoogleNet by setting $3\times 3$ filters as spatial resolution inside the blocks of split, transform, and merge. Figure 23 shows the ResNext building blocks. ResNext utilized multi-transformations inside the blocks of split, transform, and merge, as well as outlining such transformations in cardinality terms. The performance is significantly improved by increasing the cardinality, as Xie et al. showed. The complexity of ResNext was regulated by employing $1\times 1$ filters (low embeddings) ahead of a $3\times 3$ convolution. By contrast, skipping connections are used for optimized training [ 115 ].

The basic block diagram for the ResNext building blocks

The feature reuse problem is the core shortcoming related to deep residual networks, since certain feature blocks or transformations contribute a very small amount to learning. Zagoruyko and Komodakis [ 119 ] accordingly proposed WideResNet to address this problem. These authors advised that the depth has a supplemental influence, while the residual units convey the core learning ability of deep residual networks. WideResNet utilized the residual block power via making the ResNet wider instead of deeper [ 37 ]. It enlarged the width by presenting an extra factor, k, which handles the network width. In other words, it indicated that layer widening is a highly successful method of performance enhancement compared to deepening the residual network. While enhanced representational capacity is achieved by deep residual networks, these networks also have certain drawbacks, such as the exploding and vanishing gradient problems, feature reuse problem (inactivation of several feature maps), and the time-intensive nature of the training. He et al. [ 37 ] tackled the feature reuse problem by including a dropout in each residual block to regularize the network in an efficient manner. In a similar manner, utilizing dropouts, Huang et al. [ 120 ] presented the stochastic depth concept to solve the slow learning and gradient vanishing problems. Earlier research was focused on increasing the depth; thus, any small enhancement in performance required the addition of several new layers. When comparing the number of parameters, WideResNet has twice that of ResNet, as an experimental study showed. By contrast, WideResNet presents an improved method for training relative to deep networks [ 119 ]. Note that most architectures prior to residual networks (including the highly effective VGG and Inception) were wider than ResNet. Thus, wider residual networks were established once this was determined. However, inserting a dropout between the convolutional layers (as opposed to within the residual block) made the learning more effective in WideResNet [ 121 , 122 ].

Pyramidal Net

The depth of the feature map increases in the succeeding layer due to the deep stacking of multi-convolutional layers, as shown in previous deep CNN architectures such as ResNet, VGG, and AlexNet. By contrast, the spatial dimension reduces, since a sub-sampling follows each convolutional layer. Thus, augmented feature representation is recompensed by decreasing the size of the feature map. The extreme expansion in the depth of the feature map, alongside the spatial information loss, interferes with the learning ability in the deep CNNs. ResNet obtained notable outcomes for the issue of image classification. Conversely, deleting a convolutional block—in which both the number of channel and spatial dimensions vary (channel depth enlarges, while spatial dimension reduces)—commonly results in decreased classifier performance. Accordingly, the stochastic ResNet enhanced the performance by decreasing the information loss accompanying the residual unit drop. Han et al. [ 123 ] proposed Pyramidal Net to address the ResNet learning interference problem. To address the depth enlargement and extreme reduction in spatial width via ResNet, Pyramidal Net slowly enlarges the residual unit width to cover the most feasible places rather than saving the same spatial dimension inside all residual blocks up to the appearance of the down-sampling. It was referred to as Pyramidal Net due to the slow enlargement in the feature map depth based on the up-down method. Factor l, which was determined by Eq. 19 , regulates the depth of the feature map.

Here, the dimension of the l th residual unit is indicated by $d_{l}$ ; moreover, n indicates the overall number of residual units, the step factor is indicated by $\lambda $ , and the depth increase is regulated by the factor $\frac{\lambda }{n}$ , which uniformly distributes the weight increase across the dimension of the feature map. Zero-padded identity mapping is used to insert the residual connections among the layers. In comparison to the projection-based shortcut connections, zero-padded identity mapping requires fewer parameters, which in turn leads to enhanced generalization [ 124 ]. Multiplication- and addition-based widening are two different approaches used in Pyramidal Nets for network widening. More specifically, the first approach (multiplication) enlarges geometrically, while the second one (addition) enlarges linearly [ 92 ]. The main problem associated with the width enlargement is the growth in time and space required related to the quadratic time.

Extreme inception architecture is the main characteristic of Xception. The main idea behind Xception is its depthwise separable convolution [ 125 ]. The Xception model adjusted the original inception block by making it wider and exchanging a single dimension ( $3 \times 3$ ) followed by a $1 \times 1$ convolution to reduce computational complexity. Figure 24 shows the Xception block architecture. The Xception network becomes extra computationally effective through the use of the decoupling channel and spatial correspondence. Moreover, it first performs mapping of the convolved output to the embedding short dimension by applying $1 \times 1$ convolutions. It then performs k spatial transformations. Note that k here represents the width-defining cardinality, which is obtained via the transformations number in Xception. However, the computations were made simpler in Xception by distinctly convolving each channel around the spatial axes. These axes are subsequently used as the $1 \times 1$ convolutions (pointwise convolution) for performing cross-channel correspondence. The $1 \times 1$ convolution is utilized in Xception to regularize the depth of the channel. The traditional convolutional operation in Xception utilizes a number of transformation segments equivalent to the number of channels; Inception, moreover, utilizes three transformation segments, while traditional CNN architecture utilizes only a single transformation segment. Conversely, the suggested Xception transformation approach achieves extra learning efficiency and better performance but does not minimize the number of parameters [ 126 , 127 ].

The basic block diagram for the Xception block architecture

Residual attention neural network

To improve the network feature representation, Wang et al. [ 128 ] proposed the Residual Attention Network (RAN). Enabling the network to learn aware features of the object is the main purpose of incorporating attention into the CNN. The RAN consists of stacked residual blocks in addition to the attention module; hence, it is a feed-forward CNN. However, the attention module is divided into two branches, namely the mask branch and trunk branch. These branches adopt a top-down and bottom-up learning strategy respectively. Encapsulating two different strategies in the attention model supports top-down attention feedback and fast feed-forward processing in only one particular feed-forward process. More specifically, the top-down architecture generates dense features to make inferences about every aspect. Moreover, the bottom-up feedforward architecture generates low-resolution feature maps in addition to robust semantic information. Restricted Boltzmann machines employed a top-down bottom-up strategy as in previously proposed studies [ 129 ]. During the training reconstruction phase, Goh et al. [ 130 ] used the mechanism of top-down attention in deep Boltzmann machines (DBMs) as a regularizing factor. Note that the network can be globally optimized using a top-down learning strategy in a similar manner, where the maps progressively output to the input throughout the learning process [ 129 , 130 , 131 , 132 ].

Incorporating the attention concept with convolutional blocks in an easy way was used by the transformation network, as obtained in a previous study [ 133 ]. Unfortunately, these are inflexible, which represents the main problem, along with their inability to be used for varying surroundings. By contrast, stacking multi-attention modules has made RAN very effective at recognizing noisy, complex, and cluttered images. RAN’s hierarchical organization gives it the capability to adaptively allocate a weight for every feature map depending on its importance within the layers. Furthermore, incorporating three distinct levels of attention (spatial, channel, and mixed) enables the model to use this ability to capture the object-aware features at these distinct levels.

Convolutional block attention module

The importance of the feature map utilization and the attention mechanism is certified via SE-Network and RAN [ 128 , 134 , 135 ]. The convolutional block attention (CBAM) module, which is a novel attention-based CNN, was first developed by Woo et al. [ 136 ]. This module is similar to SE-Network and simple in design. SE-Network disregards the object’s spatial locality in the image and considers only the channels’ contribution during the image classification. Regarding object detection, object spatial location plays a significant role. The convolutional block attention module sequentially infers the attention maps. More specifically, it applies channel attention preceding the spatial attention to obtain the refined feature maps. Spatial attention is performed using 1 × 1 convolution and pooling functions, as in the literature. Generating an effective feature descriptor can be achieved by using a spatial axis along with the pooling of features. In addition, generating a robust spatial attention map is possible, as CBAM concatenates the max pooling and average pooling operations. In a similar manner, a collection of GAP and max pooling operations is used to model the feature map statistics. Woo et al. [ 136 ] demonstrated that utilizing GAP will return a sub-optimized inference of channel attention, whereas max pooling provides an indication of the distinguishing object features. Thus, the utilization of max pooling and average pooling enhances the network’s representational power. The feature maps improve the representational power, as well as facilitating a focus on the significant portion of the chosen features. The expression of 3D attention maps through a serial learning procedure assists in decreasing the computational cost and the number of parameters, as Woo et al. [ 136 ] experimentally proved. Note that any CNN architecture can be simply integrated with CBAM.

Concurrent spatial and channel excitation mechanism

To make the work valid for segmentation tasks, Roy et al. [ 137 , 138 ] expanded Hu et al. [ 134 ] effort by adding the influence of spatial information to the channel information. Roy et al. [ 137 , 138 ] presented three types of modules: (1) channel squeeze and excitation with concurrent channels (scSE); (2) exciting spatially and squeezing channel-wise (sSE); (3) exciting channel-wise and squeezing spatially (cSE). For segmentation purposes, they employed auto-encoder-based CNNs. In addition, they suggested inserting modules following the encoder and decoder layers. To specifically highlight the object-specific feature maps, they further allocated attention to every channel by expressing a scaling factor from the channel and spatial information in the first module (scSE). In the second module (sSE), the feature map information has lower importance than the spatial locality, as the spatial information plays a significant role during the segmentation process. Therefore, several channel collections are spatially divided and developed so that they can be employed in segmentation. In the final module (cSE), a similar SE-block concept is used. Furthermore, the scaling factor is derived founded on the contribution of the feature maps within the object detection [ 137 , 138 ].

CNN is an efficient technique for detecting object features and achieving well-behaved recognition performance in comparison with innovative handcrafted feature detectors. A number of restrictions related to CNN are present, meaning that the CNN does not consider certain relations, orientation, size, and perspectives of features. For instance, when considering a face image, the CNN does not count the various face components (such as mouth, eyes, nose, etc.) positions, and will incorrectly activate the CNN neurons and recognize the face without taking specific relations (such as size, orientation etc.) into account. At this point, consider a neuron that has probability in addition to feature properties such as size, orientation, perspective, etc. A specific neuron/capsule of this type has the ability to effectively detect the face along with different types of information. Thus, many layers of capsule nodes are used to construct the capsule network. An encoding unit, which contains three layers of capsule nodes, forms the CapsuleNet or CapsNet (the initial version of the capsule networks).

For example, the MNIST architecture comprises $28\times 28$ images, applying 256 filters of size $9\times 9$ and with stride 1. The $28-9+1=20$ is the output plus 256 feature maps. Next, these outputs are input to the first capsule layer, while producing an 8D vector rather than a scalar; in fact, this is a modified convolution layer. Note that a stride 2 with $9\times 9$ filters is employed in the first convolution layer. Thus, the dimension of the output is $(20-9)/2+1=6$ . The initial capsules employ $8\times 32$ filters, which generate 32 × 8 × 6 × 6 (32 for groups, 8 for neurons, while 6 × 6 is the neuron size).

Figure 25 represents the complete CapsNet encoding and decoding processes. In the CNN context, a max-pooling layer is frequently employed to handle the translation change. It can detect the feature moves in the event that the feature is still within the max-pooling window. This approach has the ability to detect the overlapped features; this is highly significant in detection and segmentation operations, since the capsule involves the weighted features sum from the preceding layer.

The complete CapsNet encoding and decoding processes

In conventional CNNs, a particular cost function is employed to evaluate the global error that grows toward the back throughout the training process. Conversely, in such cases, the activation of a neuron will not grow further once the weight between two neurons turns out to be zero. Instead of a single size being provided with the complete cost function in repetitive dynamic routing alongside the agreement, the signal is directed based on the feature parameters. Sabour et al. [ 139 ] provides more details about this architecture. When using MNIST to recognize handwritten digits, this innovative CNN architecture gives superior accuracy. From the application perspective, this architecture has extra suitability for segmentation and detection approaches when compared with classification approaches [ 140 , 141 , 142 ].

High-resolution network (HRNet)

High-resolution representations are necessary for position-sensitive vision tasks, such as semantic segmentation, object detection, and human pose estimation. In the present up-to-date frameworks, the input image is encoded as a low-resolution representation using a subnetwork that is constructed as a connected series of high-to-low resolution convolutions such as VGGNet and ResNet. The low-resolution representation is then recovered to become a high-resolution one. Alternatively, high-resolution representations are maintained during the entire process using a novel network, referred to as a High-Resolution Network (HRNet) [ 143 , 144 ]. This network has two principal features. First, the convolution series of high-to-low resolutions are connected in parallel. Second, the information across the resolutions are repeatedly exchanged. The advantage achieved includes getting a representation that is more accurate in the spatial domain and extra-rich in the semantic domain. Moreover, HRNet has several applications in the fields of object detection, semantic segmentation, and human pose prediction. For computer vision problems, the HRNet represents a more robust backbone. Figure 26 illustrates the general architecture of HRNet.

The general architecture of HRNet

Challenges (limitations) of deep learning and alternate solutions

When employing DL, several difficulties are often taken into consideration. Those more challenging are listed next and several possible alternatives are accordingly provided.

Training data

DL is extremely data-hungry considering it also involves representation learning [ 145 , 146 ]. DL demands an extensively large amount of data to achieve a well-behaved performance model, i.e. as the data increases, an extra well-behaved performance model can be achieved (Fig. 27 ). In most cases, the available data are sufficient to obtain a good performance model. However, sometimes there is a shortage of data for using DL directly [ 87 ]. To properly address this issue, three suggested methods are available. The first involves the employment of the transfer-learning concept after data is collected from similar tasks. Note that while the transferred data will not directly augment the actual data, it will help in terms of both enhancing the original input representation of data and its mapping function [ 147 ]. In this way, the model performance is boosted. Another technique involves employing a well-trained model from a similar task and fine-tuning the ending of two layers or even one layer based on the limited original data. Refer to [ 148 , 149 ] for a review of different transfer-learning techniques applied in the DL approach. In the second method, data augmentation is performed [ 150 ]. This task is very helpful for use in augmenting the image data, since the image translation, mirroring, and rotation commonly do not change the image label. Conversely, it is important to take care when applying this technique in some cases such as with bioinformatics data. For instance, when mirroring an enzyme sequence, the output data may not represent the actual enzyme sequence. In the third method, the simulated data can be considered for increasing the volume of the training set. It is occasionally possible to create simulators based on the physical process if the issue is well understood. Therefore, the result will involve the simulation of as much data as needed. Processing the data requirement for DL-based simulation is obtained as an example in Ref. [ 151 ].

The performance of DL regarding the amount of data

Transfer learning

Recent research has revealed a widespread use of deep CNNs, which offer ground-breaking support for answering many classification problems. Generally speaking, deep CNN models require a sizable volume of data to obtain good performance. The common challenge associated with using such models concerns the lack of training data. Indeed, gathering a large volume of data is an exhausting job, and no successful solution is available at this time. The undersized dataset problem is therefore currently solved using the TL technique [ 148 , 149 ], which is highly efficient in addressing the lack of training data issue. The mechanism of TL involves training the CNN model with large volumes of data. In the next step, the model is fine-tuned for training on a small request dataset.

The student-teacher relationship is a suitable approach to clarifying TL. Gathering detailed knowledge of the subject is the first step [ 152 ]. Next, the teacher provides a “course” by conveying the information within a “lecture series” over time. Put simply, the teacher transfers the information to the student. In more detail, the expert (teacher) transfers the knowledge (information) to the learner (student). Similarly, the DL network is trained using a vast volume of data, and also learns the bias and the weights during the training process. These weights are then transferred to different networks for retraining or testing a similar novel model. Thus, the novel model is enabled to pre-train weights rather than requiring training from scratch. Figure 28 illustrates the conceptual diagram of the TL technique.

Pre-trained models: Many CNN models, e.g. AlexNet [ 30 ], GoogleNet [ 103 ], and ResNet [ 37 ], have been trained on large datasets such as ImageNet for image recognition purposes. These models can then be employed to recognize a different task without the need to train from scratch. Furthermore, the weights remain the same apart from a few learned features. In cases where data samples are lacking, these models are very useful. There are many reasons for employing a pre-trained model. First, training large models on sizeable datasets requires high-priced computational power. Second, training large models can be time-consuming, taking up to multiple weeks. Finally, a pre-trained model can assist with network generalization and speed up the convergence.

A research problem using pre-trained models: Training a DL approach requires a massive number of images. Thus, obtaining good performance is a challenge under these circumstances. Achieving excellent outcomes in image classification or recognition applications, with performance occasionally superior to that of a human, becomes possible through the use of deep convolutional neural networks (DCNNs) including several layers if a huge amount of data is available [ 37 , 148 , 153 ]. However, avoiding overfitting problems in such applications requires sizable datasets and properly generalizing DCNN models. When training a DCNN model, the dataset size has no lower limit. However, the accuracy of the model becomes insufficient in the case of the utilized model has fewer layers, or if a small dataset is used for training due to over- or under-fitting problems. Due to they have no ability to utilize the hierarchical features of sizable datasets, models with fewer layers have poor accuracy. It is difficult to acquire sufficient training data for DL models. For example, in medical imaging and environmental science, gathering labelled datasets is very costly [ 148 ]. Moreover, the majority of the crowdsourcing workers are unable to make accurate notes on medical or biological images due to their lack of medical or biological knowledge. Thus, ML researchers often rely on field experts to label such images; however, this process is costly and time consuming. Therefore, producing the large volume of labels required to develop flourishing deep networks turns out to be unfeasible. Recently, TL has been widely employed to address the later issue. Nevertheless, although TL enhances the accuracy of several tasks in the fields of pattern recognition and computer vision [ 154 , 155 ], there is an essential issue related to the source data type used by the TL as compared to the target dataset. For instance, enhancing the medical image classification performance of CNN models is achieved by training the models using the ImageNet dataset, which contains natural images [ 153 ]. However, such natural images are completely dissimilar from the raw medical images, meaning that the model performance is not enhanced. It has further been proven that TL from different domains does not significantly affect performance on medical imaging tasks, as lightweight models trained from scratch perform nearly as well as standard ImageNet-transferred models [ 156 ]. Therefore, there exists scenarios in which using pre-trained models do not become an affordable solution. In 2020, some researchers have utilized same-domain TL and achieved excellent results [ 86 , 87 , 88 , 157 ]. Same-domain TL is an approach of using images that look similar to the target dataset for training. For example, using X-ray images of different chest diseases to train the model, then fine-tuning and training it on chest X-ray images for COVID-19 diagnosis. More details about same-domain TL and how to implement the fine-tuning process can be found in [ 87 ].

The conceptual diagram of the TL technique

Data augmentation techniques

If the goal is to increase the amount of available data and avoid the overfitting issue, data augmentation techniques are one possible solution [ 150 , 158 , 159 ]. These techniques are data-space solutions for any limited-data problem. Data augmentation incorporates a collection of methods that improve the attributes and size of training datasets. Thus, DL networks can perform better when these techniques are employed. Next, we list some data augmentation alternate solutions.

Flipping: Flipping the vertical axis is a less common practice than flipping the horizontal one. Flipping has been verified as valuable on datasets like ImageNet and CIFAR-10. Moreover, it is highly simple to implement. In addition, it is not a label-conserving transformation on datasets that involve text recognition (such as SVHN and MNIST).

Color space: Encoding digital image data is commonly used as a dimension tensor ( $height \times width \times color channels$ ). Accomplishing augmentations in the color space of the channels is an alternative technique, which is extremely workable for implementation. A very easy color augmentation involves separating a channel of a particular color, such as Red, Green, or Blue. A simple way to rapidly convert an image using a single-color channel is achieved by separating that matrix and inserting additional double zeros from the remaining two color channels. Furthermore, increasing or decreasing the image brightness is achieved by using straightforward matrix operations to easily manipulate the RGB values. By deriving a color histogram that describes the image, additional improved color augmentations can be obtained. Lighting alterations are also made possible by adjusting the intensity values in histograms similar to those employed in photo-editing applications.

Cropping: Cropping a dominant patch of every single image is a technique employed with combined dimensions of height and width as a specific processing step for image data. Furthermore, random cropping may be employed to produce an impact similar to translations. The difference between translations and random cropping is that translations conserve the spatial dimensions of this image, while random cropping reduces the input size [for example from (256, 256) to (224, 224)]. According to the selected reduction threshold for cropping, the label-preserving transformation may not be addressed.

Rotation: When rotating an image left or right from within 0 to 360 degrees around the axis, rotation augmentations are obtained. The rotation degree parameter greatly determines the suitability of the rotation augmentations. In digit recognition tasks, small rotations (from 0 to 20 degrees) are very helpful. By contrast, the data label cannot be preserved post-transformation when the rotation degree increases.

Translation: To avoid positional bias within the image data, a very useful transformation is to shift the image up, down, left, or right. For instance, it is common that the whole dataset images are centered; moreover, the tested dataset should be entirely made up of centered images to test the model. Note that when translating the initial images in a particular direction, the residual space should be filled with Gaussian or random noise, or a constant value such as 255 s or 0 s. The spatial dimensions of the image post-augmentation are preserved using this padding.

Noise injection This approach involves injecting a matrix of arbitrary values. Such a matrix is commonly obtained from a Gaussian distribution. Moreno-Barea et al. [ 160 ] employed nine datasets to test the noise injection. These datasets were taken from the UCI repository [ 161 ]. Injecting noise within images enables the CNN to learn additional robust features.

However, highly well-behaved solutions for positional biases available within the training data are achieved by means of geometric transformations. To separate the distribution of the testing data from the training data, several prospective sources of bias exist. For instance, when all faces should be completely centered within the frames (as in facial recognition datasets), the problem of positional biases emerges. Thus, geometric translations are the best solution. Geometric translations are helpful due to their simplicity of implementation, as well as their effective capability to disable the positional biases. Several libraries of image processing are available, which enables beginning with simple operations such as rotation or horizontal flipping. Additional training time, higher computational costs, and additional memory are some shortcomings of geometric transformations. Furthermore, a number of geometric transformations (such as arbitrary cropping or translation) should be manually observed to ensure that they do not change the image label. Finally, the biases that separate the test data from the training data are more complicated than transitional and positional changes. Hence, it is not trivial answering to when and where geometric transformations are suitable to be applied.

Imbalanced data

Commonly, biological data tend to be imbalanced, as negative samples are much more numerous than positive ones [ 162 , 163 , 164 ]. For example, compared to COVID-19-positive X-ray images, the volume of normal X-ray images is very large. It should be noted that undesirable results may be produced when training a DL model using imbalanced data. The following techniques are used to solve this issue. First, it is necessary to employ the correct criteria for evaluating the loss, as well as the prediction result. In considering the imbalanced data, the model should perform well on small classes as well as larger ones. Thus, the model should employ area under curve (AUC) as the resultant loss as well as the criteria [ 165 ]. Second, it should employ the weighted cross-entropy loss, which ensures the model will perform well with small classes if it still prefers to employ the cross-entropy loss. Simultaneously, during model training, it is possible either to down-sample the large classes or up-sample the small classes. Finally, to make the data balanced as in Ref. [ 166 ], it is possible to construct models for every hierarchical level, as a biological system frequently has hierarchical label space. However, the effect of the imbalanced data on the performance of the DL model has been comprehensively investigated. In addition, to lessen the problem, the most frequently used techniques were also compared. Nevertheless, note that these techniques are not specified for biological problems.

Interpretability of data

Occasionally, DL techniques are analyzed to act as a black box. In fact, they are interpretable. The need for a method of interpreting DL, which is used to obtain the valuable motifs and patterns recognized by the network, is common in many fields, such as bioinformatics [ 167 ]. In the task of disease diagnosis, it is not only required to know the disease diagnosis or prediction results of a trained DL model, but also how to enhance the surety of the prediction outcomes, as the model makes its decisions based on these verifications [ 168 ]. To achieve this, it is possible to give a score of importance for every portion of the particular example. Within this solution, back-propagation-based techniques or perturbation-based approaches are used [ 169 ]. In the perturbation-based approaches, a portion of the input is changed and the effect of this change on the model output is observed [ 170 , 171 , 172 , 173 ]. This concept has high computational complexity, but it is simple to understand. On the other hand, to check the score of the importance of various input portions, the signal from the output propagates back to the input layer in the back-propagation-based techniques. These techniques have been proven valuable in [ 174 ]. In different scenarios, various meanings can represent the model interpretability.

Uncertainty scaling

Commonly, the final prediction label is not the only label required when employing DL techniques to achieve the prediction; the score of confidence for every inquiry from the model is also desired. The score of confidence is defined as how confident the model is in its prediction [ 175 ]. Since the score of confidence prevents belief in unreliable and misleading predictions, it is a significant attribute, regardless of the application scenario. In biology, the confidence score reduces the resources and time expended in proving the outcomes of the misleading prediction. Generally speaking, in healthcare or similar applications, the uncertainty scaling is frequently very significant; it helps in evaluating automated clinical decisions and the reliability of machine learning-based disease-diagnosis [ 176 , 177 ]. Because overconfident prediction can be the output of different DL models, the score of probability (achieved from the softmax output of the direct-DL) is often not in the correct scale [ 178 ]. Note that the softmax output requires post-scaling to achieve a reliable probability score. For outputting the probability score in the correct scale, several techniques have been introduced, including Bayesian Binning into Quantiles (BBQ) [ 179 ], isotonic regression [ 180 ], histogram binning [ 181 ], and the legendary Platt scaling [ 182 ]. More specifically, for DL techniques, temperature scaling was recently introduced, which achieves superior performance compared to the other techniques.

Catastrophic forgetting

This is defined as incorporating new information into a plain DL model, made possible by interfering with the learned information. For instance, consider a case where there are 1000 types of flowers and a model is trained to classify these flowers, after which a new type of flower is introduced; if the model is fine-tuned only with this new class, its performance will become unsuccessful with the older classes [ 183 , 184 ]. The logical data are continually collected and renewed, which is in fact a highly typical scenario in many fields, e.g. Biology. To address this issue, there is a direct solution that involves employing old and new data to train an entirely new model from scratch. This solution is time-consuming and computationally intensive; furthermore, it leads to an unstable state for the learned representation of the initial data. At this time, three different types of ML techniques, which have not catastrophic forgetting, are made available to solve the human brain problem founded on the neurophysiological theories [ 185 , 186 ]. Techniques of the first type are founded on regularizations such as EWC [ 183 ] Techniques of the second type employ rehearsal training techniques and dynamic neural network architecture like iCaRL [ 187 , 188 ]. Finally, techniques of the third type are founded on dual-memory learning systems [ 189 ]. Refer to [ 190 , 191 , 192 ] in order to gain more details.

Model compression

To obtain well-trained models that can still be employed productively, DL models have intensive memory and computational requirements due to their huge complexity and large numbers of parameters [ 193 , 194 ]. One of the fields that is characterized as data-intensive is the field of healthcare and environmental science. These needs reduce the deployment of DL in limited computational-power machines, mainly in the healthcare field. The numerous methods of assessing human health and the data heterogeneity have become far more complicated and vastly larger in size [ 195 ]; thus, the issue requires additional computation [ 196 ]. Furthermore, novel hardware-based parallel processing solutions such as FPGAs and GPUs [ 197 , 198 , 199 ] have been developed to solve the computation issues associated with DL. Recently, numerous techniques for compressing the DL models, designed to decrease the computational issues of the models from the starting point, have also been introduced. These techniques can be classified into four classes. In the first class, the redundant parameters (which have no significant impact on model performance) are reduced. This class, which includes the famous deep compression method, is called parameter pruning [ 200 ]. In the second class, the larger model uses its distilled knowledge to train a more compact model; thus, it is called knowledge distillation [ 201 , 202 ]. In the third class, compact convolution filters are used to reduce the number of parameters [ 203 ]. In the final class, the information parameters are estimated for preservation using low-rank factorization [ 204 ]. For model compression, these classes represent the most representative techniques. In [ 193 ], it has been provided a more comprehensive discussion about the topic.

Overfitting

DL models have excessively high possibilities of resulting in data overfitting at the training stage due to the vast number of parameters involved, which are correlated in a complex manner. Such situations reduce the model’s ability to achieve good performance on the tested data [ 90 , 205 ]. This problem is not only limited to a specific field, but involves different tasks. Therefore, when proposing DL techniques, this problem should be fully considered and accurately handled. In DL, the implied bias of the training process enables the model to overcome crucial overfitting problems, as recent studies suggest [ 205 , 206 , 207 , 208 ]. Even so, it is still necessary to develop techniques that handle the overfitting problem. An investigation of the available DL algorithms that ease the overfitting problem can categorize them into three classes. The first class acts on both the model architecture and model parameters and includes the most familiar approaches, such as weight decay [ 209 ], batch normalization [ 210 ], and dropout [ 90 ]. In DL, the default technique is weight decay [ 209 ], which is used extensively in almost all ML algorithms as a universal regularizer. The second class works on model inputs such as data corruption and data augmentation [ 150 , 211 ]. One reason for the overfitting problem is the lack of training data, which makes the learned distribution not mirror the real distribution. Data augmentation enlarges the training data. By contrast, marginalized data corruption improves the solution exclusive to augmenting the data. The final class works on the model output. A recently proposed technique penalizes the over-confident outputs for regularizing the model [ 178 ]. This technique has demonstrated the ability to regularize RNNs and CNNs.

Vanishing gradient problem

In general, when using backpropagation and gradient-based learning techniques along with ANNs, largely in the training stage, a problem called the vanishing gradient problem arises [ 212 , 213 , 214 ]. More specifically, in each training iteration, every weight of the neural network is updated based on the current weight and is proportionally relative to the partial derivative of the error function. However, this weight updating may not occur in some cases due to a vanishingly small gradient, which in the worst case means that no extra training is possible and the neural network will stop completely. Conversely, similarly to other activation functions, the sigmoid function shrinks a large input space to a tiny input space. Thus, the derivative of the sigmoid function will be small due to large variation at the input that produces a small variation at the output. In a shallow network, only some layers use these activations, which is not a significant issue. While using more layers will lead the gradient to become very small in the training stage, in this case, the network works efficiently. The back-propagation technique is used to determine the gradients of the neural networks. Initially, this technique determines the network derivatives of each layer in the reverse direction, starting from the last layer and progressing back to the first layer. The next step involves multiplying the derivatives of each layer down the network in a similar manner to the first step. For instance, multiplying N small derivatives together when there are N hidden layers employs an activation function such as the sigmoid function. Hence, the gradient declines exponentially while propagating back to the first layer. More specifically, the biases and weights of the first layers cannot be updated efficiently during the training stage because the gradient is small. Moreover, this condition decreases the overall network accuracy, as these first layers are frequently critical to recognizing the essential elements of the input data. However, such a problem can be avoided through employing activation functions. These functions lack the squishing property, i.e., the ability to squish the input space to within a small space. By mapping X to max, the ReLU [ 91 ] is the most popular selection, as it does not yield a small derivative that is employed in the field. Another solution involves employing the batch normalization layer [ 81 ]. As mentioned earlier, the problem occurs once a large input space is squashed into a small space, leading to vanishing the derivative. Employing batch normalization degrades this issue by simply normalizing the input, i.e., the expression | x | does not accomplish the exterior boundaries of the sigmoid function. The normalization process makes the largest part of it come down in the green area, which ensures that the derivative is large enough for further actions. Furthermore, faster hardware can tackle the previous issue, e.g. that provided by GPUs. This makes standard back-propagation possible for many deeper layers of the network compared to the time required to recognize the vanishing gradient problem [ 215 ].

Exploding gradient problem

Opposite to the vanishing problem is the one related to gradient. Specifically, large error gradients are accumulated during back-propagation [ 216 , 217 , 218 ]. The latter will lead to extremely significant updates to the weights of the network, meaning that the system becomes unsteady. Thus, the model will lose its ability to learn effectively. Grosso modo, moving backward in the network during back-propagation, the gradient grows exponentially by repetitively multiplying gradients. The weight values could thus become incredibly large and may overflow to become a not-a-number (NaN) value. Some potential solutions include:

Using different weight regularization techniques.

Redesigning the architecture of the network model.

Underspecification

In 2020, a team of computer scientists at Google has identified a new challenge called underspecification [ 219 ]. ML models including DL models often show surprisingly poor behavior when they are tested in real-world applications such as computer vision, medical imaging, natural language processing, and medical genomics. The reason behind the weak performance is due to underspecification. It has been shown that small modifications can force a model towards a completely different solution as well as lead to different predictions in deployment domains. There are different techniques of addressing underspecification issue. One of them is to design “stress tests” to examine how good a model works on real-world data and to find out the possible issues. Nevertheless, this demands a reliable understanding of the process the model can work inaccurately. The team stated that “Designing stress tests that are well-matched to applied requirements, and that provide good “coverage” of potential failure modes is a major challenge”. Underspecification puts major constraints on the credibility of ML predictions and may require some reconsidering over certain applications. Since ML is linked to human by serving several applications such as medical imaging and self-driving cars, it will require proper attention to this issue.

Applications of deep learning

Presently, various DL applications are widespread around the world. These applications include healthcare, social network analysis, audio and speech processing (like recognition and enhancement), visual data processing methods (such as multimedia data analysis and computer vision), and NLP (translation and sentence classification), among others (Fig. 29 ) [ 220 , 221 , 222 , 223 , 224 ]. These applications have been classified into five categories: classification, localization, detection, segmentation, and registration. Although each of these tasks has its own target, there is fundamental overlap in the pipeline implementation of these applications as shown in Fig. 30 . Classification is a concept that categorizes a set of data into classes. Detection is used to locate interesting objects in an image with consideration given to the background. In detection, multiple objects, which could be from dissimilar classes, are surrounded by bounding boxes. Localization is the concept used to locate the object, which is surrounded by a single bounding box. In segmentation (semantic segmentation), the target object edges are surrounded by outlines, which also label them; moreover, fitting a single image (which could be 2D or 3D) onto another refers to registration. One of the most important and wide-ranging DL applications are in healthcare [ 225 , 226 , 227 , 228 , 229 , 230 ]. This area of research is critical due to its relation to human lives. Moreover, DL has shown tremendous performance in healthcare. Therefore, we take DL applications in the medical image analysis field as an example to describe the DL applications.

Examples of DL applications

Workflow of deep learning tasks

Classification

Computer-Aided Diagnosis (CADx) is another title sometimes used for classification. Bharati et al. [ 231 ] used a chest X-ray dataset for detecting lung diseases based on a CNN. Another study attempted to read X-ray images by employing CNN [ 232 ]. In this modality, the comparative accessibility of these images has likely enhanced the progress of DL. [ 233 ] used an improved pre-trained GoogLeNet CNN containing more than 150,000 images for training and testing processes. This dataset was augmented from 1850 chest X-rays. The creators reorganized the image orientation into lateral and frontal views and achieved approximately 100% accuracy. This work of orientation classification has clinically limited use. As a part of an ultimately fully automated diagnosis workflow, it obtained the data augmentation and pre-trained efficiency in learning the metadata of relevant images. Chest infection, commonly referred to as pneumonia, is extremely treatable, as it is a commonly occurring health problem worldwide. Conversely, Rajpurkar et al. [ 234 ] utilized CheXNet, which is an improved version of DenseNet [ 112 ] with 121 convolution layers, for classifying fourteen types of disease. These authors used the CheXNet14 dataset [ 235 ], which comprises 112,000 images. This network achieved an excellent performance in recognizing fourteen different diseases. In particular, pneumonia classification accomplished a 0.7632 AUC score using receiver operating characteristics (ROC) analysis. In addition, the network obtained better than or equal to the performance of both a three-radiologist panel and four individual radiologists. Zuo et al. [ 236 ] have adopted CNN for candidate classification in lung nodule. Shen et al. [ 237 ] employed both Random Forest (RF) and SVM classifiers with CNNs to classify lung nodules. They employed two convolutional layers with each of the three parallel CNNs. The LIDC-IDRI (Lung Image Database Consortium) dataset, which contained 1010-labeled CT lung scans, was used to classify the two types of lung nodules (malignant and benign). Different scales of the image patches were used by every CNN to extract features, while the output feature vector was constructed using the learned features. Next, these vectors were classified into malignant or benign using either the RF classifier or SVM with radial basis function (RBF) filter. The model was robust to various noisy input levels and achieved an accuracy of 86% in nodule classification. Conversely, the model of [ 238 ] interpolates the image data missing between PET and MRI images using 3D CNNs. The Alzheimer Disease Neuroimaging Initiative (ADNI) database, containing 830 PET and MRI patient scans, was utilized in their work. The PET and MRI images are used to train the 3D CNNs, first as input and then as output. Furthermore, for patients who have no PET images, the 3D CNNs utilized the trained images to rebuild the PET images. These rebuilt images approximately fitted the actual disease recognition outcomes. However, this approach did not address the overfitting issues, which in turn restricted their technique in terms of its possible capacity for generalization. Diagnosing normal versus Alzheimer’s disease patients has been achieved by several CNN models [ 239 , 240 ]. Hosseini-Asl et al. [ 241 ] attained 99% accuracy for up-to-date outcomes in diagnosing normal versus Alzheimer’s disease patients. These authors applied an auto-encoder architecture using 3D CNNs. The generic brain features were pre-trained on the CADDementia dataset. Subsequently, the outcomes of these learned features became inputs to higher layers to differentiate between patient scans of Alzheimer’s disease, mild cognitive impairment, or normal brains based on the ADNI dataset and using fine-tuned deep supervision techniques. The architectures of VGGNet and RNNs, in that order, were the basis of both VOXCNN and ResNet models developed by Korolev et al. [ 242 ]. They also discriminated between Alzheimer’s disease and normal patients using the ADNI database. Accuracy was 79% for Voxnet and 80% for ResNet. Compared to Hosseini-Asl’s work, both models achieved lower accuracies. Conversely, the implementation of the algorithms was simpler and did not require feature hand-crafting, as Korolev declared. In 2020, Mehmood et al. [ 240 ] trained a developed CNN-based network called “SCNN” with MRI images for the tasks of classification of Alzheimer’s disease. They achieved state-of-the-art results by obtaining an accuracy of 99.05%.

Recently, CNN has taken some medical imaging classification tasks to different level from traditional diagnosis to automated diagnosis with tremendous performance. Examples of these tasks are diabetic foot ulcer (DFU) (as normal and abnormal (DFU) classes) [ 87 , 243 , 244 , 245 , 246 ], sickle cells anemia (SCA) (as normal, abnormal (SCA), and other blood components) [ 86 , 247 ], breast cancer by classify hematoxylin–eosin-stained breast biopsy images into four classes: invasive carcinoma, in-situ carcinoma, benign tumor and normal tissue [ 42 , 88 , 248 , 249 , 250 , 251 , 252 ], and multi-class skin cancer classification [ 253 , 254 , 255 ].

In 2020, CNNs are playing a vital role in early diagnosis of the novel coronavirus (COVID-2019). CNN has become the primary tool for automatic COVID-19 diagnosis in many hospitals around the world using chest X-ray images [ 256 , 257 , 258 , 259 , 260 ]. More details about the classification of medical imaging applications can be found in [ 226 , 261 , 262 , 263 , 264 , 265 ].

Localization

Although applications in anatomy education could increase, the practicing clinician is more likely to be interested in the localization of normal anatomy. Radiological images are independently examined and described outside of human intervention, while localization could be applied in completely automatic end-to-end applications [ 266 , 267 , 268 ]. Zhao et al. [ 269 ] introduced a new deep learning-based approach to localize pancreatic tumor in projection X-ray images for image-guided radiation therapy without the need for fiducials. Roth et al. [ 270 ] constructed and trained a CNN using five convolutional layers to classify around 4000 transverse-axial CT images. These authors used five categories for classification: legs, pelvis, liver, lung, and neck. After data augmentation techniques were applied, they achieved an AUC score of 0.998 and the classification error rate of the model was 5.9%. For detecting the positions of the spleen, kidney, heart, and liver, Shin et al. [ 271 ] employed stacked auto-encoders on 78 contrast-improved MRI scans of the stomach area containing the kidneys or liver. Temporal and spatial domains were used to learn the hierarchal features. Based on the organs, these approaches achieved detection accuracies of 62–79%. Sirazitdinov et al. [ 268 ] presented an aggregate of two convolutional neural networks, namely RetinaNet and Mask R-CNN for pneumonia detection and localization.

Computer-Aided Detection (CADe) is another method used for detection. For both the clinician and the patient, overlooking a lesion on a scan may have dire consequences. Thus, detection is a field of study requiring both accuracy and sensitivity [ 272 , 273 , 274 ]. Chouhan et al. [ 275 ] introduced an innovative deep learning framework for the detection of pneumonia by adopting the idea of transfer learning. Their approach obtained an accuracy of 96.4% with a recall of 99.62% on unseen data. In the area of COVID-19 and pulmonary disease, several convolutional neural network approaches have been proposed for automatic detection from X-ray images which showed an excellent performance [ 46 , 276 , 277 , 278 , 279 ].

In the area of skin cancer, there several applications were introduced for the detection task [ 280 , 281 , 282 ]. Thurnhofer-Hemsi et al. [ 283 ] introduced a deep learning approach for skin cancer detection by fine-tuning five state-of-art convolutional neural network models. They addressed the issue of a lack of training data by adopting the ideas of transfer learning and data augmentation techniques. DenseNet201 network has shown superior results compared to other models.

Another interesting area is that of histopathological images, which are progressively digitized. Several papers have been published in this field [ 284 , 285 , 286 , 287 , 288 , 289 , 290 ]. Human pathologists read these images laboriously; they search for malignancy markers, such as a high index of cell proliferation, using molecular markers (e.g. Ki-67), cellular necrosis signs, abnormal cellular architecture, enlarged numbers of mitotic figures denoting augmented cell replication, and enlarged nucleus-to-cytoplasm ratios. Note that the histopathological slide may contain a huge number of cells (up to the thousands). Thus, the risk of disregarding abnormal neoplastic regions is high when wading through these cells at excessive levels of magnification. Ciresan et al. [ 291 ] employed CNNs of 11–13 layers for identifying mitotic figures. Fifty breast histology images from the MITOS dataset were used. Their technique attained recall and precision scores of 0.7 and 0.88 respectively. Sirinukunwattana et al. [ 292 ] utilized 100 histology images of colorectal adenocarcinoma to detect cell nuclei using CNNs. Roughly 30,000 nuclei were hand-labeled for training purposes. The novelty of this approach was in the use of Spatially Constrained CNN. This CNN detects the center of nuclei using the surrounding spatial context and spatial regression. Instead of this CNN, Xu et al. [ 293 ] employed a stacked sparse auto-encoder (SSAE) to identify nuclei in histological slides of breast cancer, achieving 0.83 and 0.89 recall and precision scores respectively. In this field, they showed that unsupervised learning techniques are also effectively utilized. In medical images, Albarquoni et al. [ 294 ] investigated the problem of insufficient labeling. They crowd-sourced the actual mitoses labeling in the histology images of breast cancer (from amateurs online). Solving the recurrent issue of inadequate labeling during the analysis of medical images can be achieved by feeding the crowd-sourced input labels into the CNN. This method signifies a remarkable proof-of-concept effort. In 2020, Lei et al. [ 285 ] introduced the employment of deep convolutional neural networks for automatic identification of mitotic candidates from histological sections for mitosis screening. They obtained the state-of-the-art detection results on the dataset of the International Pattern Recognition Conference (ICPR) 2012 Mitosis Detection Competition.

Segmentation

Although MRI and CT image segmentation research includes different organs such as knee cartilage, prostate, and liver, most research work has concentrated on brain segmentation, particularly tumors [ 295 , 296 , 297 , 298 , 299 , 300 ]. This issue is highly significant in surgical preparation to obtain the precise tumor limits for the shortest surgical resection. During surgery, excessive sacrificing of key brain regions may lead to neurological shortfalls including cognitive damage, emotionlessness, and limb difficulty. Conventionally, medical anatomical segmentation was done by hand; more specifically, the clinician draws out lines within the complete stack of the CT or MRI volume slice by slice. Thus, it is perfect for implementing a solution that computerizes this painstaking work. Wadhwa et al. [ 301 ] presented a brief overview on brain tumor segmentation of MRI images. Akkus et al. [ 302 ] wrote a brilliant review of brain MRI segmentation that addressed the different metrics and CNN architectures employed. Moreover, they explain several competitions in detail, as well as their datasets, which included Ischemic Stroke Lesion Segmentation (ISLES), Mild Traumatic brain injury Outcome Prediction (MTOP), and Brain Tumor Segmentation (BRATS).

Chen et al. [ 299 ] proposed convolutional neural networks for precise brain tumor segmentation. The approach that they employed involves several approaches for better features learning including the DeepMedic model, a novel dual-force training scheme, a label distribution-based loss function, and Multi-Layer Perceptron-based post-processing. They conducted their method on the two most modern brain tumor segmentation datasets, i.e., BRATS 2017 and BRATS 2015 datasets. Hu et al. [ 300 ] introduced the brain tumor segmentation method by adopting a multi-cascaded convolutional neural network (MCCNN) and fully connected conditional random fields (CRFs). The achieved results were excellent compared with the state-of-the-art methods.

Moeskops et al. [ 303 ] employed three parallel-running CNNs, each of which had a 2D input patch of dissimilar size, for segmenting and classifying MRI brain images. These images, which include 35 adults and 22 pre-term infants, were classified into various tissue categories such as cerebrospinal fluid, grey matter, and white matter. Every patch concentrates on capturing various image aspects with the benefit of employing three dissimilar sizes of input patch; here, the bigger sizes incorporated the spatial features, while the lowest patch sizes concentrated on the local textures. In general, the algorithm has Dice coefficients in the range of 0.82–0.87 and achieved a satisfactory accuracy. Although 2D image slices are employed in the majority of segmentation research, Milletrate et al. [ 304 ] implemented 3D CNN for segmenting MRI prostate images. Furthermore, they used the PROMISE2012 challenge dataset, from which fifty MRI scans were used for training and thirty for testing. The U-Net architecture of Ronnerberger et al. [ 305 ] inspired their V-net. This model attained a 0.869 Dice coefficient score, the same as the winning teams in the competition. To reduce overfitting and create the model of a deeper 11-convolutional layer CNN, Pereira et al. [ 306 ] applied intentionally small-sized filters of 3x3. Their model used MRI scans of 274 gliomas (a type of brain tumor) for training. They achieved first place in the 2013 BRATS challenge, as well as second place in the BRATS challenge 2015. Havaei et al. [ 307 ] also considered gliomas using the 2013 BRATS dataset. They investigated different 2D CNN architectures. Compared to the winner of BRATS 2013, their algorithm worked better, as it required only 3 min to execute rather than 100 min. The concept of cascaded architecture formed the basis of their model. Thus, it is referred to as an InputCascadeCNN. Employing FC Conditional Random Fields (CRFs), atrous spatial pyramid pooling, and up-sampled filters were techniques introduced by Chen et al. [ 308 ]. These authors aimed to enhance the accuracy of localization and enlarge the field of view of every filter at a multi-scale. Their model, DeepLab, attained 79.7% mIOU (mean Intersection Over Union). In the PASCAL VOC-2012 image segmentation, their model obtained an excellent performance.

Recently, the Automatic segmentation of COVID-19 Lung Infection from CT Images helps to detect the development of COVID-19 infection by employing several deep learning techniques [ 309 , 310 , 311 , 312 ].

Registration

Usually, given two input images, the four main stages of the canonical procedure of the image registration task are [ 313 , 314 ]:

Target Selection: it illustrates the determined input image that the second counterpart input image needs to remain accurately superimposed to.

Feature Extraction: it computes the set of features extracted from each input image.

Feature Matching: it allows finding similarities between the previously obtained features.

Pose Optimization: it is aimed to minimize the distance between both input images.

Then, the result of the registration procedure is the suitable geometric transformation (e.g. translation, rotation, scaling, etc.) that provides both input images within the same coordinate system in a way the distance between them is minimal, i.e. their level of superimposition/overlapping is optimal. It is out of the scope of this work to provide an extensive review of this topic. Nevertheless, a short summary is accordingly introduced next.

Commonly, the input images for the DL-based registration approach could be in various forms, e.g. point clouds, voxel grids, and meshes. Additionally, some techniques allow as inputs the result of the Feature Extraction or Matching steps in the canonical scheme. Specifically, the outcome could be some data in a particular form as well as the result of the steps from the classical pipeline (feature vector, matching vector, and transformation). Nevertheless, with the newest DL-based methods, a novel conceptual type of ecosystem issues. It contains acquired characteristics about the target, materials, and their behavior that can be registered with the input data. Such a conceptual ecosystem is formed by a neural network and its training manner, and it could be counted as an input to the registration approach. Nevertheless, it is not an input that one might adopt in every registration situation since it corresponds to an interior data representation.

From a DL view-point, the interpretation of the conceptual design enables differentiating the input data of a registration approach into defined or non-defined models. In particular, the illustrated phases are models that depict particular spatial data (e.g. 2D or 3D) while a non-defined one is a generalization of a data set created by a learning system. Yumer et al. [ 315 ] developed a framework in which the model acquires characteristics of objects, meaning ready to identify what a more sporty car seems like or a more comfy chair is, also adjusting a 3D model to fit those characteristics while maintaining the main characteristics of the primary data. Likewise, a fundamental perspective of the unsupervised learning method introduced by Ding et al. [ 316 ] is that there is no target for the registration approach. In this instance, the network is able of placing each input point cloud in a global space, solving SLAM issues in which many point clouds have to be registered rigidly. On the other hand, Mahadevan [ 317 ] proposed the combination of two conceptual models utilizing the growth of Imagination Machines to give flexible artificial intelligence systems and relationships between the learned phases through training schemes that are not inspired on labels and classifications. Another practical application of DL, especially CNNs, to image registration is the 3D reconstruction of objects. Wang et al. [ 318 ] applied an adversarial way using CNNs to rebuild a 3D model of an object from its 2D image. The network learns many objects and orally accomplishes the registration between the image and the conceptual model. Similarly, Hermoza et al. [ 319 ] also utilize the GAN network for prognosticating the absent geometry of damaged archaeological objects, providing the reconstructed object based on a voxel grid format and a label selecting its class.

DL for medical image registration has numerous applications, which were listed by some review papers [ 320 , 321 , 322 ]. Yang et al. [ 323 ] implemented stacked convolutional layers as an encoder-decoder approach to predict the morphing of the input pixel into its last formation using MRI brain scans from the OASIS dataset. They employed a registration model known as Large Deformation Diffeomorphic Metric Mapping (LDDMM) and attained remarkable enhancements in computation time. Miao et al. [ 324 ] used synthetic X-ray images to train a five-layer CNN to register 3D models of a trans-esophageal probe, a hand implant, and a knee implant onto 2D X-ray images for pose estimation. They determined that their model achieved an execution time of 0.1 s, representing an important enhancement against the conventional registration techniques based on intensity; moreover, it achieved effective registrations 79–99% of the time. Li et al. [ 325 ] introduced a neural network-based approach for the non-rigid 2D–3D registration of the lateral cephalogram and the volumetric cone-beam CT (CBCT) images.

Computational approaches

For computationally exhaustive applications, complex ML and DL approaches have rapidly been identified as the most significant techniques and are widely used in different fields. The development and enhancement of algorithms aggregated with capabilities of well-behaved computational performance and large datasets make it possible to effectively execute several applications, as earlier applications were either not possible or difficult to take into consideration.

Currently, several standard DNN configurations are available. The interconnection patterns between layers and the total number of layers represent the main differences between these configurations. The Table 2 illustrates the growth rate of the overall number of layers over time, which seems to be far faster than the “Moore’s Law growth rate”. In normal DNN, the number of layers grew by around 2.3× each year in the period from 2012 to 2016. Recent investigations of future ResNet versions reveal that the number of layers can be extended up to 1000. However, an SGD technique is employed to fit the weights (or parameters), while different optimization techniques are employed to obtain parameter updating during the DNN training process. Repetitive updates are required to enhance network accuracy in addition to a minorly augmented rate of enhancement. For example, the training process using ImageNet as a large dataset, which contains more than 14 million images, along with ResNet as a network model, take around 30K to 40K repetitions to converge to a steady solution. In addition, the overall computational load, as an upper-level prediction, may exceed 1020 FLOPS when both the training set size and the DNN complexity increase.

Prior to 2008, boosting the training to a satisfactory extent was achieved by using GPUs. Usually, days or weeks are needed for a training session, even with GPU support. By contrast, several optimization strategies were developed to reduce the extensive learning time. The computational requirements are believed to increase as the DNNs continuously enlarge in both complexity and size.

In addition to the computational load cost, the memory bandwidth and capacity have a significant effect on the entire training performance, and to a lesser extent, deduction. More specifically, the parameters are distributed through every layer of the input data, there is a sizeable amount of reused data, and the computation of several network layers exhibits an excessive computation-to-bandwidth ratio. By contrast, there are no distributed parameters, the amount of reused data is extremely small, and the additional FC layers have an extremely small computation-to-bandwidth ratio. Table 3 presents a comparison between different aspects related to the devices. In addition, the table is established to facilitate familiarity with the tradeoffs by obtaining the optimal approach for configuring a system based on either FPGA, GPU, or CPU devices. It should be noted that each has corresponding weaknesses and strengths; accordingly, there are no clear one-size-fits-all solutions.

Although GPU processing has enhanced the ability to address the computational challenges related to such networks, the maximum GPU (or CPU) performance is not achieved, and several techniques or models have turned out to be strongly linked to bandwidth. In the worst cases, the GPU efficiency is between 15 and 20% of the maximum theoretical performance. This issue is required to enlarge the memory bandwidth using high-bandwidth stacked memory. Next, different approaches based on FPGA, GPU, and CPU are accordingly detailed.

CPU-based approach

The well-behaved performance of the CPU nodes usually assists robust network connectivity, storage abilities, and large memory. Although CPU nodes are more common-purpose than those of FPGA or GPU, they lack the ability to match them in unprocessed computation facilities, since this requires increased network ability and a larger memory capacity.

GPU-based approach

GPUs are extremely effective for several basic DL primitives, which include greatly parallel-computing operations such as activation functions, matrix multiplication, and convolutions [ 326 , 327 , 328 , 329 , 330 ]. Incorporating HBM-stacked memory into the up-to-date GPU models significantly enhances the bandwidth. This enhancement allows numerous primitives to efficiently utilize all computational resources of the available GPUs. The improvement in GPU performance over CPU performance is usually 10-20:1 related to dense linear algebra operations.

Maximizing parallel processing is the base of the initial GPU programming model. For example, a GPU model may involve up to sixty-four computational units. There are four SIMD engines per each computational layer, and each SIMD has sixteen floating-point computation lanes. The peak performance is 25 TFLOPS (fp16) and 10 TFLOPS (fp32) as the percentage of the employment approaches 100%. Additional GPU performance may be achieved if the addition and multiply functions for vectors combine the inner production instructions for matching primitives related to matrix operations.

For DNN training, the GPU is usually considered to be an optimized design, while for inference operations, it may also offer considerable performance improvements.

FPGA-based approach

FPGA is wildly utilized in various tasks including deep learning [ 199 , 247 , 331 , 332 , 333 , 334 ]. Inference accelerators are commonly implemented utilizing FPGA. The FPGA can be effectively configured to reduce the unnecessary or overhead functions involved in GPU systems. Compared to GPU, the FPGA is restricted to both weak-behaved floating-point performance and integer inference. The main FPGA aspect is the capability to dynamically reconfigure the array characteristics (at run-time), as well as the capability to configure the array by means of effective design with little or no overhead.

As mentioned earlier, the FPGA offers both performance and latency for every watt it gains over GPU and CPU in DL inference operations. Implementation of custom high-performance hardware, pruned networks, and reduced arithmetic precision are three factors that enable the FPGA to implement DL algorithms and to achieve FPGA with this level of efficiency. In addition, FPGA may be employed to implement CNN overlay engines with over 80% efficiency, eight-bit accuracy, and over 15 TOPs peak performance; this is used for a few conventional CNNs, as Xillinx and partners demonstrated recently. By contrast, pruning techniques are mostly employed in the LSTM context. The sizes of the models can be efficiently minimized by up to 20×, which provides an important benefit during the implementation of the optimal solution, as MLP neural processing demonstrated. A recent study in the field of implementing fixed-point precision and custom floating-point has revealed that lowering the 8-bit is extremely promising; moreover, it aids in supplying additional advancements to implementing peak performance FPGA related to the DNN models.

Evaluation metrics

Evaluation metrics adopted within DL tasks play a crucial role in achieving the optimized classifier [ 335 ]. They are utilized within a usual data classification procedure through two main stages: training and testing. It is utilized to optimize the classification algorithm during the training stage. This means that the evaluation metric is utilized to discriminate and select the optimized solution, e.g., as a discriminator, which can generate an extra-accurate forecast of upcoming evaluations related to a specific classifier. For the time being, the evaluation metric is utilized to measure the efficiency of the created classifier, e.g. as an evaluator, within the model testing stage using hidden data. As given in Eq. 20 , TN and TP are defined as the number of negative and positive instances, respectively, which are successfully classified. In addition, FN and FP are defined as the number of misclassified positive and negative instances respectively. Next, some of the most well-known evaluation metrics are listed below.

Accuracy: Calculates the ratio of correct predicted classes to the total number of samples evaluated (Eq. 20 ).

Sensitivity or Recall: Utilized to calculate the fraction of positive patterns that are correctly classified (Eq. 21 ).

Specificity: Utilized to calculate the fraction of negative patterns that are correctly classified (Eq. 22 ).

Precision: Utilized to calculate the positive patterns that are correctly predicted by all predicted patterns in a positive class (Eq. 23 ).

F1-Score: Calculates the harmonic average between recall and precision rates (Eq. 24 ).

J Score: This metric is also called Youdens J statistic. Eq. 25 represents the metric.

False Positive Rate (FPR): This metric refers to the possibility of a false alarm ratio as calculated in Eq. 26

Area Under the ROC Curve: AUC is a common ranking type metric. It is utilized to conduct comparisons between learning algorithms [ 336 , 337 , 338 ], as well as to construct an optimal learning model [ 339 , 340 ]. In contrast to probability and threshold metrics, the AUC value exposes the entire classifier ranking performance. The following formula is used to calculate the AUC value for two-class problem [ 341 ] (Eq. 27 )

Here, $S_{p}$ represents the sum of all positive ranked samples. The number of negative and positive samples is denoted as $n_{n}$ and $n_{p}$ , respectively. Compared to the accuracy metrics, the AUC value was verified empirically and theoretically, making it very helpful for identifying an optimized solution and evaluating the classifier performance through classification training.

When considering the discrimination and evaluation processes, the AUC performance was brilliant. However, for multiclass issues, the AUC computation is primarily cost-effective when discriminating a large number of created solutions. In addition, the time complexity for computing the AUC is $O \left( |C|^{2} \; n\log n\right) $ with respect to the Hand and Till AUC model [ 341 ] and $O \left( |C| \; n\log n\right) $ according to Provost and Domingo’s AUC model [ 336 ].

Frameworks and datasets

Several DL frameworks and datasets have been developed in the last few years. various frameworks and libraries have also been used in order to expedite the work with good results. Through their use, the training process has become easier. Table 4 lists the most utilized frameworks and libraries.

Based on the star ratings on Github, as well as our own background in the field, TensorFlow is deemed the most effective and easy to use. It has the ability to work on several platforms. (Github is one of the biggest software hosting sites, while Github stars refer to how well-regarded a project is on the site). Moreover, there are several other benchmark datasets employed for different DL tasks. Some of these are listed in Table 5 .

Summary and conclusion

Finally, it is mandatory the inclusion of a brief discussion by gathering all the relevant data provided along this extensive research. Next, an itemized analysis is presented in order to conclude our review and exhibit the future directions.

DL already experiences difficulties in simultaneously modeling multi-complex modalities of data. In recent DL developments, another common approach is that of multimodal DL.

DL requires sizeable datasets (labeled data preferred) to predict unseen data and to train the models. This challenge turns out to be particularly difficult when real-time data processing is required or when the provided datasets are limited (such as in the case of healthcare data). To alleviate this issue, TL and data augmentation have been researched over the last few years.

Although ML slowly transitions to semi-supervised and unsupervised learning to manage practical data without the need for manual human labeling, many of the current deep-learning models utilize supervised learning.

The CNN performance is greatly influenced by hyper-parameter selection. Any small change in the hyper-parameter values will affect the general CNN performance. Therefore, careful parameter selection is an extremely significant issue that should be considered during optimization scheme development.

Impressive and robust hardware resources like GPUs are required for effective CNN training. Moreover, they are also required for exploring the efficiency of using CNN in smart and embedded systems.

In the CNN context, ensemble learning [ 342 , 343 ] represents a prospective research area. The collection of different and multiple architectures will support the model in improving its generalizability across different image categories through extracting several levels of semantic image representation. Similarly, ideas such as new activation functions, dropout, and batch normalization also merit further investigation.

The exploitation of depth and different structural adaptations is significantly improved in the CNN learning capacity. Substituting the traditional layer configuration with blocks results in significant advances in CNN performance, as has been shown in the recent literature. Currently, developing novel and efficient block architectures is the main trend in new research models of CNN architectures. HRNet is only one example that shows there are always ways to improve the architecture.

It is expected that cloud-based platforms will play an essential role in the future development of computational DL applications. Utilizing cloud computing offers a solution to handling the enormous amount of data. It also helps to increase efficiency and reduce costs. Furthermore, it offers the flexibility to train DL architectures.

With the recent development in computational tools including a chip for neural networks and a mobile GPU, we will see more DL applications on mobile devices. It will be easier for users to use DL.

Regarding the issue of lack of training data, It is expected that various techniques of transfer learning will be considered such as training the DL model on large unlabeled image datasets and next transferring the knowledge to train the DL model on a small number of labeled images for the same task.

Last, this overview provides a starting point for the community of DL being interested in the field of DL. Furthermore, researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field.

Availability of data and materials

Not applicable.

Rozenwald MB, Galitsyna AA, Sapunov GV, Khrameeva EE, Gelfand MS. A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput Sci. 2020;6:307.

Article Google Scholar

Amrit C, Paauw T, Aly R, Lavric M. Identifying child abuse through text mining and machine learning. Expert Syst Appl. 2017;88:402–18.

Hossain E, Khan I, Un-Noor F, Sikander SS, Sunny MSH. Application of big data and machine learning in smart grid, and associated security concerns: a review. IEEE Access. 2019;7:13960–88.

Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H. Survey of review spam detection using machine learning techniques. J Big Data. 2015;2(1):23.

Deldjoo Y, Elahi M, Cremonesi P, Garzotto F, Piazzolla P, Quadrana M. Content-based video recommendation system based on stylistic visual features. J Data Semant. 2016;5(2):99–113.

Al-Dulaimi K, Chandran V, Nguyen K, Banks J, Tomeo-Reyes I. Benchmarking hep-2 specimen cells classification using linear discriminant analysis on higher order spectra features of cell shape. Pattern Recogn Lett. 2019;125:534–41.

Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE. A survey of deep neural network architectures and their applications. Neurocomputing. 2017;234:11–26.

Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu ML, Chen SC, Iyengar S. A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR). 2018;51(5):1–36.

Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Hasan M, Van Essen BC, Awwal AA, Asari VK. A state-of-the-art survey on deep learning theory and architectures. Electronics. 2019;8(3):292.

Potok TE, Schuman C, Young S, Patton R, Spedalieri F, Liu J, Yao KT, Rose G, Chakma G. A study of complex deep learning networks on high-performance, neuromorphic, and quantum computers. ACM J Emerg Technol Comput Syst (JETC). 2018;14(2):1–21.

Adeel A, Gogate M, Hussain A. Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Inf Fusion. 2020;59:163–70.

Tian H, Chen SC, Shyu ML. Evolutionary programming based deep learning feature selection and network construction for visual data classification. Inf Syst Front. 2020;22(5):1053–66.

Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. IEEE Comput Intell Mag. 2018;13(3):55–75.

Koppe G, Meyer-Lindenberg A, Durstewitz D. Deep learning for small and big data in psychiatry. Neuropsychopharmacology. 2021;46(1):176–90.

Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1. IEEE; 2005. p. 886–93.

Lowe DG. Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision, vol. 2. IEEE; 1999. p. 1150–7.

Wu L, Hoi SC, Yu N. Semantics-preserving bag-of-words models and applications. IEEE Trans Image Process. 2010;19(7):1908–20.

Article MathSciNet MATH Google Scholar

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

Yao G, Lei T, Zhong J. A review of convolutional-neural-network-based action recognition. Pattern Recogn Lett. 2019;118:14–22.

Dhillon A, Verma GK. Convolutional neural network: a review of models, methodologies and applications to object detection. Prog Artif Intell. 2020;9(2):85–112.

Khan A, Sohail A, Zahoora U, Qureshi AS. A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev. 2020;53(8):5455–516.

Hasan RI, Yusuf SM, Alzubaidi L. Review of the state of the art of deep learning for plant diseases: a broad analysis and discussion. Plants. 2020;9(10):1302.

Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X. A review of object detection based on deep learning. Multimed Tools Appl. 2020;79(33):23729–91.

Ker J, Wang L, Rao J, Lim T. Deep learning applications in medical image analysis. IEEE Access. 2017;6:9375–89.

Zhang Z, Cui P, Zhu W. Deep learning on graphs: a survey. IEEE Trans Knowl Data Eng. 2020. https://doi.org/10.1109/TKDE.2020.2981333 .

Shrestha A, Mahmood A. Review of deep learning algorithms and architectures. IEEE Access. 2019;7:53040–65.

Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2(1):1.

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning, vol. 1. Cambridge: MIT press; 2016.

MATH Google Scholar

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for COVID-19. J Big Data. 2021;8(1):1–54.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.

Bhowmick S, Nagarajaiah S, Veeraraghavan A. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from uav videos. Sensors. 2020;20(21):6299.

Goh GB, Hodas NO, Vishnu A. Deep learning for computational chemistry. J Comput Chem. 2017;38(16):1291–307.

Li Y, Zhang T, Sun S, Gao X. Accelerating flash calculation through deep learning methods. J Comput Phys. 2019;394:153–65.

Yang W, Zhang X, Tian Y, Wang W, Xue JH, Liao Q. Deep learning for single image super-resolution: a brief review. IEEE Trans Multimed. 2019;21(12):3106–21.

Tang J, Li S, Liu P. A review of lane detection methods based on deep learning. Pattern Recogn. 2020;111:107623.

Zhao ZQ, Zheng P, Xu ST, Wu X. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):3212–32.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–8.

Ng A. Machine learning yearning: technical strategy for AI engineers in the era of deep learning. 2019. https://www.mlyearning.org .

Metz C. Turing award won by 3 pioneers in artificial intelligence. The New York Times. 2019;27.

Nevo S, Anisimov V, Elidan G, El-Yaniv R, Giencke P, Gigi Y, Hassidim A, Moshe Z, Schlesinger M, Shalev G, et al. Ml for flood forecasting at scale; 2019. arXiv preprint arXiv:1901.09583 .

Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discov Today. 2018;23(6):1241–50.

Benhammou Y, Achchab B, Herrera F, Tabik S. Breakhis based breast cancer automatic diagnosis using deep learning: taxonomy, survey and insights. Neurocomputing. 2020;375:9–24.

Wulczyn E, Steiner DF, Xu Z, Sadhwani A, Wang H, Flament-Auvigne I, Mermel CH, Chen PHC, Liu Y, Stumpe MC. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE. 2020;15(6):e0233678.

Nagpal K, Foote D, Liu Y, Chen PHC, Wulczyn E, Tan F, Olson N, Smith JL, Mohtashamian A, Wren JH, et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit Med. 2019;2(1):1–10.

Google Scholar

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8.

Brunese L, Mercaldo F, Reginelli A, Santone A. Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays. Comput Methods Programs Biomed. 2020;196(105):608.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and COVID-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

Shorfuzzaman M, Hossain MS. Metacovid: a siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients. Pattern Recogn. 2020;113:107700.

Carvelli L, Olesen AN, Brink-Kjær A, Leary EB, Peppard PE, Mignot E, Sørensen HB, Jennum P. Design of a deep learning model for automatic scoring of periodic and non-periodic leg movements during sleep validated against multiple human experts. Sleep Med. 2020;69:109–19.

De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, Askham H, Glorot X, O’Donoghue B, Visentin D, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24(9):1342–50.

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56.

Kermany DS, Goldbaum M, Cai W, Valentim CC, Liang H, Baxter SL, McKeown A, Yang G, Wu X, Yan F, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122–31.

Van Essen B, Kim H, Pearce R, Boakye K, Chen B. Lbann: livermore big artificial neural network HPC toolkit. In: Proceedings of the workshop on machine learning in high-performance computing environments; 2015. p. 1–6.

Saeed MM, Al Aghbari Z, Alsharidah M. Big data clustering techniques based on spark: a literature review. PeerJ Comput Sci. 2020;6:321.

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33.

Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag. 2017;34(6):26–38.

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing; 2013. p. 1631–42.

Goller C, Kuchler A. Learning task-dependent distributed representations by backpropagation through structure. In: Proceedings of international conference on neural networks (ICNN’96), vol 1. IEEE; 1996. p. 347–52.

Socher R, Lin CCY, Ng AY, Manning CD. Parsing natural scenes and natural language with recursive neural networks. In: ICML; 2011.

Louppe G, Cho K, Becot C, Cranmer K. QCD-aware recursive neural networks for jet physics. J High Energy Phys. 2019;2019(1):57.

Sadr H, Pedram MM, Teshnehlab M. A robust sentiment analysis method based on sequential combination of convolutional and recursive neural networks. Neural Process Lett. 2019;50(3):2745–61.

Urban G, Subrahmanya N, Baldi P. Inner and outer recursive neural networks for chemoinformatics applications. J Chem Inf Model. 2018;58(2):207–11.

Hewamalage H, Bergmeir C, Bandara K. Recurrent neural networks for time series forecasting: current status and future directions. Int J Forecast. 2020;37(1):388–427.

Jiang Y, Kim H, Asnani H, Kannan S, Oh S, Viswanath P. Learn codes: inventing low-latency codes via recurrent neural networks. IEEE J Sel Areas Inf Theory. 2020;1(1):207–16.

John RA, Acharya J, Zhu C, Surendran A, Bose SK, Chaturvedi A, Tiwari N, Gao Y, He Y, Zhang KK, et al. Optogenetics inspired transition metal dichalcogenide neuristors for in-memory deep recurrent neural networks. Nat Commun. 2020;11(1):1–9.

Batur Dinler Ö, Aydin N. An optimal feature parameter set based on gated recurrent unit recurrent neural networks for speech segment detection. Appl Sci. 2020;10(4):1273.

Jagannatha AN, Yu H. Structured prediction models for RNN based sequence labeling in clinical text. In: Proceedings of the conference on empirical methods in natural language processing. conference on empirical methods in natural language processing, vol. 2016, NIH Public Access; 2016. p. 856.

Pascanu R, Gulcehre C, Cho K, Bengio Y. How to construct deep recurrent neural networks. In: Proceedings of the second international conference on learning representations (ICLR 2014); 2014.

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 249–56.

Gao C, Yan J, Zhou S, Varshney PK, Liu H. Long short-term memory-based deep recurrent neural networks for target tracking. Inf Sci. 2019;502:279–96.

Zhou DX. Theory of deep convolutional neural networks: downsampling. Neural Netw. 2020;124:319–27.

Article MATH Google Scholar

Jhong SY, Tseng PY, Siriphockpirom N, Hsia CH, Huang MS, Hua KL, Chen YY. An automated biometric identification system using CNN-based palm vein recognition. In: 2020 international conference on advanced robotics and intelligent systems (ARIS). IEEE; 2020. p. 1–6.

Al-Azzawi A, Ouadou A, Max H, Duan Y, Tanner JJ, Cheng J. Deepcryopicker: fully automated deep neural network for single protein particle picking in cryo-EM. BMC Bioinform. 2020;21(1):1–38.

Wang T, Lu C, Yang M, Hong F, Liu C. A hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss. PeerJ Comput Sci. 2020;6:324.

Li G, Zhang M, Li J, Lv F, Tong G. Efficient densely connected convolutional neural networks. Pattern Recogn. 2021;109:107610.

Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J, et al. Recent advances in convolutional neural networks. Pattern Recogn. 2018;77:354–77.

Fang W, Love PE, Luo H, Ding L. Computer vision for behaviour-based safety in construction: a review and future directions. Adv Eng Inform. 2020;43:100980.

Palaz D, Magimai-Doss M, Collobert R. End-to-end acoustic modeling using convolutional neural networks for hmm-based automatic speech recognition. Speech Commun. 2019;108:15–32.

Li HC, Deng ZY, Chiang HH. Lightweight and resource-constrained learning network for face recognition with performance optimization. Sensors. 2020;20(21):6114.

Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol. 1962;160(1):106.

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift; 2015. arXiv preprint arXiv:1502.03167 .

Ruder S. An overview of gradient descent optimization algorithms; 2016. arXiv preprint arXiv:1609.04747 .

Bottou L. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer; 2010. p. 177–86.

Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on. 2012;14(8).

Zhang Z. Improved Adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS). IEEE; 2018. p. 1–2.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Duan Y. Deep learning models for classification of red blood cells in microscopy images to aid in sickle cell anemia diagnosis. Electronics. 2020;9(3):427.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Santamaría J, Duan Y, Oleiwi SR. Towards a better understanding of transfer learning for medical imaging: a case study. Appl Sci. 2020;10(13):4523.

Alzubaidi L, Al-Shamma O, Fadhel MA, Farhan L, Zhang J, Duan Y. Optimizing the performance of breast cancer classification by employing the same domain transfer learning from hybrid deep convolutional neural network model. Electronics. 2020;9(3):445.

LeCun Y, Jackel LD, Bottou L, Cortes C, Denker JS, Drucker H, Guyon I, Muller UA, Sackinger E, Simard P, et al. Learning algorithms for classification: a comparison on handwritten digit recognition. Neural Netw Stat Mech Perspect. 1995;261:276.

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

MathSciNet MATH Google Scholar

Dahl GE, Sainath TN, Hinton GE. Improving deep neural networks for LVCSR using rectified linear units and dropout. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8609–13.

Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network; 2015. arXiv preprint arXiv:1505.00853 .

Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst. 1998;6(02):107–16.

Lin M, Chen Q, Yan S. Network in network; 2013. arXiv preprint arXiv:1312.4400 .

Hsiao TY, Chang YC, Chou HH, Chiu CT. Filter-based deep-compression with global average pooling for convolutional networks. J Syst Arch. 2019;95:9–18.

Li Z, Wang SH, Fan RR, Cao G, Zhang YD, Guo T. Teeth category classification via seven-layer deep convolutional neural network with max pooling and global average pooling. Int J Imaging Syst Technol. 2019;29(4):577–83.

Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer; 2014. p. 818–33.

Erhan D, Bengio Y, Courville A, Vincent P. Visualizing higher-layer features of a deep network. Univ Montreal. 2009;1341(3):1.

Le QV. Building high-level features using large scale unsupervised learning. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8595–8.

Grün F, Rupprecht C, Navab N, Tombari F. A taxonomy and library for visualizing learned features in convolutional neural networks; 2016. arXiv preprint arXiv:1606.07757 .

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv preprint arXiv:1409.1556 .

Ranzato M, Huang FJ, Boureau YL, LeCun Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE; 2007. p. 1–8.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 1–9.

Bengio Y, et al. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization; 2015. arXiv:1502.04390 corr abs/1502.04390

Srivastava RK, Greff K, Schmidhuber J. Highway networks; 2015. arXiv preprint arXiv:1505.00387 .

Kong W, Dong ZY, Jia Y, Hill DJ, Xu Y, Zhang Y. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Trans Smart Grid. 2017;10(1):841–51.

Ordóñez FJ, Roggen D. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors. 2016;16(1):115.

CireşAn D, Meier U, Masci J, Schmidhuber J. Multi-column deep neural network for traffic sign classification. Neural Netw. 2012;32:333–8.

Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning; 2016. arXiv preprint arXiv:1602.07261 .

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2818–26.

Wu S, Zhong S, Liu Y. Deep residual learning for image steganalysis. Multimed Tools Appl. 2018;77(9):10437–53.

Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–08.

Rubin J, Parvaneh S, Rahman A, Conroy B, Babaeizadeh S. Densely connected convolutional networks for detection of atrial fibrillation from short single-lead ECG recordings. J Electrocardiol. 2018;51(6):S18-21.

Kuang P, Ma T, Chen Z, Li F. Image super-resolution with densely connected convolutional networks. Appl Intell. 2019;49(1):125–36.

Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1492–500.

Su A, He X, Zhao X. Jpeg steganalysis based on ResNeXt with gauss partial derivative filters. Multimed Tools Appl. 2020;80(3):3349–66.

Yadav D, Jalal A, Garlapati D, Hossain K, Goyal A, Pant G. Deep learning-based ResNeXt model in phycological studies for future. Algal Res. 2020;50:102018.

Han W, Feng R, Wang L, Gao L. Adaptive spatial-scale-aware deep convolutional neural network for high-resolution remote sensing imagery scene classification. In: IGARSS 2018-2018 IEEE international geoscience and remote sensing symposium. IEEE; 2018. p. 4736–9.

Zagoruyko S, Komodakis N. Wide residual networks; 2016. arXiv preprint arXiv:1605.07146 .

Huang G, Sun Y, Liu Z, Sedra D, Weinberger KQ. Deep networks with stochastic depth. In: European conference on computer vision. Springer; 2016. p. 646–61.

Huynh HT, Nguyen H. Joint age estimation and gender classification of Asian faces using wide ResNet. SN Comput Sci. 2020;1(5):1–9.

Takahashi R, Matsubara T, Uehara K. Data augmentation using random image cropping and patching for deep cnns. IEEE Trans Circuits Syst Video Technol. 2019;30(9):2917–31.

Han D, Kim J, Kim J. Deep pyramidal residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 5927–35.

Wang Y, Wang L, Wang H, Li P. End-to-end image super-resolution via deep and shallow convolutional networks. IEEE Access. 2019;7:31959–70.

Chollet F. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1251–8.

Lo WW, Yang X, Wang Y. An xception convolutional neural network for malware classification with transfer learning. In: 2019 10th IFIP international conference on new technologies, mobility and security (NTMS). IEEE; 2019. p. 1–5.

Rahimzadeh M, Attar A. A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of xception and resnet50v2. Inform Med Unlocked. 2020;19:100360.

Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 3156–64.

Salakhutdinov R, Larochelle H. Efficient learning of deep boltzmann machines. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 693–700.

Goh H, Thome N, Cord M, Lim JH. Top-down regularization of deep belief networks. Adv Neural Inf Process Syst. 2013;26:1878–86.

Guan J, Lai R, Xiong A, Liu Z, Gu L. Fixed pattern noise reduction for infrared images based on cascade residual attention CNN. Neurocomputing. 2020;377:301–13.

Bi Q, Qin K, Zhang H, Li Z, Xu K. RADC-Net: a residual attention based convolution network for aerial scene classification. Neurocomputing. 2020;377:345–59.

Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2015. p. 2017–25.

Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7132–41.

Mou L, Zhu XX. Learning to pay attention on spectral domain: a spectral attention module-based convolutional network for hyperspectral image classification. IEEE Trans Geosci Remote Sens. 2019;58(1):110–22.

Woo S, Park J, Lee JY, So Kweon I. CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19.

Roy AG, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2018. p. 421–9.

Roy AG, Navab N, Wachinger C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation’’ blocks. IEEE Trans Med Imaging. 2018;38(2):540–9.

Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 3856–66.

Arun P, Buddhiraju KM, Porwal A. Capsulenet-based spatial-spectral classifier for hyperspectral images. IEEE J Sel Topics Appl Earth Obs Remote Sens. 2019;12(6):1849–65.

Xinwei L, Lianghao X, Yi Y. Compact video fingerprinting via an improved capsule net. Syst Sci Control Eng. 2020;9:1–9.

Ma B, Li X, Xia Y, Zhang Y. Autonomous deep learning: a genetic DCNN designer for image classification. Neurocomputing. 2020;379:152–61.

Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2020. https://doi.org/10.1109/TPAMI.2020.2983686 .

Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L. Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: CVPR 2020; 2020. https://www.microsoft.com/en-us/research/publication/higherhrnet-scale-aware-representation-learning-for-bottom-up-human-pose-estimation/ .

Karimi H, Derr T, Tang J. Characterizing the decision boundary of deep neural networks; 2019. arXiv preprint arXiv:1912.11460 .

Li Y, Ding L, Gao X. On the decision boundary of deep neural networks; 2018. arXiv preprint arXiv:1808.05385 .

Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 3320–8.

Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep transfer learning. In: International conference on artificial neural networks. Springer; 2018. p. 270–9.

Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016;3(1):9.

Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):60.

Wang F, Wang H, Wang H, Li G, Situ G. Learning from simulation: an end-to-end deep-learning approach for computational ghost imaging. Opt Express. 2019;27(18):25560–72.

Pan W. A survey of transfer learning for collaborative recommendation with auxiliary data. Neurocomputing. 2016;177:447–53.

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE; 2009. p. 248–55.

Cook D, Feuz KD, Krishnan NC. Transfer learning for activity recognition: a survey. Knowl Inf Syst. 2013;36(3):537–56.

Cao X, Wang Z, Yan P, Li X. Transfer learning for pedestrian detection. Neurocomputing. 2013;100:51–7.

Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: understanding transfer learning for medical imaging. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2019. p. 3347–57.

Pham TN, Van Tran L, Dao SVT. Early disease classification of mango leaves using feed-forward neural network and hybrid metaheuristic feature selection. IEEE Access. 2020;8:189960–73.

Saleh AM, Hamoud T. Analysis and best parameters selection for person recognition based on gait model using CNN algorithm and image augmentation. J Big Data. 2021;8(1):1–20.

Hirahara D, Takaya E, Takahara T, Ueda T. Effects of data count and image scaling on deep learning training. PeerJ Comput Sci. 2020;6:312.

Moreno-Barea FJ, Strazzera F, Jerez JM, Urda D, Franco L. Forward noise adjustment scheme for data augmentation. In: 2018 IEEE symposium series on computational intelligence (SSCI). IEEE; 2018. p. 728–34.

Dua D, Karra Taniskidou E. Uci machine learning repository. Irvine: University of california. School of Information and Computer Science; 2017. http://archive.ics.uci.edu/ml

Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1):27.

Yang P, Zhang Z, Zhou BB, Zomaya AY. Sample subset optimization for classifying imbalanced biological data. In: Pacific-Asia conference on knowledge discovery and data mining. Springer; 2011. p. 333–44.

Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY. Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern. 2013;44(3):445–55.

Wang S, Sun S, Xu J. Auc-maximized deep convolutional neural fields for sequence labeling 2015. arXiv preprint arXiv:1511.05265 .

Li Y, Wang S, Umarov R, Xie B, Fan M, Li L, Gao X. Deepre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics. 2018;34(5):760–9.

Li Y, Huang C, Ding L, Li Z, Pan Y, Gao X. Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods. 2019;166:4–21.

Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2016. p. 3504–12.

Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141):20170,387.

Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4.

Pokuri BSS, Ghosal S, Kokate A, Sarkar S, Ganapathysubramanian B. Interpretable deep learning for guided microstructure-property explorations in photovoltaics. NPJ Comput Mater. 2019;5(1):1–11.

Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 1135–44.

Wang L, Nie R, Yu Z, Xin R, Zheng C, Zhang Z, Zhang J, Cai J. An interpretable deep-learning architecture of capsule networks for identifying cell-type gene expression programs from single-cell RNA-sequencing data. Nat Mach Intell. 2020;2(11):1–11.

Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks; 2017. arXiv preprint arXiv:1703.01365 .

Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif. 1999;10(3):61–74.

Nair T, Precup D, Arnold DL, Arbel T. Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Med Image Anal. 2020;59:101557.

Herzog L, Murina E, Dürr O, Wegener S, Sick B. Integrating uncertainty in deep neural networks for MRI based stroke analysis. Med Image Anal. 2020;65:101790.

Pereyra G, Tucker G, Chorowski J, Kaiser Ł, Hinton G. Regularizing neural networks by penalizing confident output distributions; 2017. arXiv preprint arXiv:1701.06548 .

Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the... AAAI conference on artificial intelligence. AAAI conference on artificial intelligence, vol. 2015. NIH Public Access; 2015. p. 2901.

Li M, Sethi IK. Confidence-based classifier design. Pattern Recogn. 2006;39(7):1230–40.

Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and Naive Bayesian classifiers. In: ICML, vol. 1, Citeseer; 2001. p. 609–16.

Steinwart I. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans Inf Theory. 2005;51(1):128–42.

Lee K, Lee K, Shin J, Lee H. Overcoming catastrophic forgetting with unlabeled data in the wild. In: Proceedings of the IEEE international conference on computer vision; 2019. p. 312–21.

Shmelkov K, Schmid C, Alahari K. Incremental learning of object detectors without catastrophic forgetting. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 3400–09.

Zenke F, Gerstner W, Ganguli S. The temporal paradox of Hebbian learning and homeostatic plasticity. Curr Opin Neurobiol. 2017;43:166–76.

Andersen N, Krauth N, Nabavi S. Hebbian plasticity in vivo: relevance and induction. Curr Opin Neurobiol. 2017;45:188–92.

Zheng R, Chakraborti S. A phase ii nonparametric adaptive exponentially weighted moving average control chart. Qual Eng. 2016;28(4):476–90.

Rebuffi SA, Kolesnikov A, Sperl G, Lampert CH. ICARL: Incremental classifier and representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2001–10.

Hinton GE, Plaut DC. Using fast weights to deblur old memories. In: Proceedings of the ninth annual conference of the cognitive science society; 1987. p. 177–86.

Parisi GI, Kemker R, Part JL, Kanan C, Wermter S. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71.

Soltoggio A, Stanley KO, Risi S. Born to learn: the inspiration, progress, and future of evolved plastic artificial neural networks. Neural Netw. 2018;108:48–67.

Parisi GI, Tani J, Weber C, Wermter S. Lifelong learning of human actions with deep neural network self-organization. Neural Netw. 2017;96:137–49.

Cheng Y, Wang D, Zhou P, Zhang T. Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Process Mag. 2018;35(1):126–36.

Wiedemann S, Kirchhoffer H, Matlage S, Haase P, Marban A, Marinč T, Neumann D, Nguyen T, Schwarz H, Wiegand T, et al. Deepcabac: a universal compression algorithm for deep neural networks. IEEE J Sel Topics Signal Process. 2020;14(4):700–14.

Mehta N, Pandit A. Concurrence of big data analytics and healthcare: a systematic review. Int J Med Inform. 2018;114:57–65.

Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9.

Shawahna A, Sait SM, El-Maleh A. Fpga-based accelerators of deep learning networks for learning and classification: a review. IEEE Access. 2018;7:7823–59.

Min Z. Public welfare organization management system based on FPGA and deep learning. Microprocess Microsyst. 2020;80:103333.

Al-Shamma O, Fadhel MA, Hameed RA, Alzubaidi L, Zhang J. Boosting convolutional neural networks performance based on fpga accelerator. In: International conference on intelligent systems design and applications. Springer; 2018. p. 509–17.

Han S, Mao H, Dally WJ. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding; 2015. arXiv preprint arXiv:1510.00149 .

Chen Z, Zhang L, Cao Z, Guo J. Distilling the knowledge from handcrafted features for human activity recognition. IEEE Trans Ind Inform. 2018;14(10):4334–42.

Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network; 2015. arXiv preprint arXiv:1503.02531 .

Lenssen JE, Fey M, Libuschewski P. Group equivariant capsule networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 8844–53.

Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R. Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 1269–77.

Xu Q, Zhang M, Gu Z, Pan G. Overfitting remedy by sparsifying regularization on fully-connected layers of CNNs. Neurocomputing. 2019;328:69–74.

Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. Commun ACM. 2018;64(3):107–15.

Xu X, Jiang X, Ma C, Du P, Li X, Lv S, Yu L, Ni Q, Chen Y, Su J, et al. A deep learning system to screen novel coronavirus disease 2019 pneumonia. Engineering. 2020;6(10):1122–9.

Sharma K, Alsadoon A, Prasad P, Al-Dala’in T, Nguyen TQV, Pham DTH. A novel solution of using deep learning for left ventricle detection: enhanced feature extraction. Comput Methods Programs Biomed. 2020;197:105751.

Zhang G, Wang C, Xu B, Grosse R. Three mechanisms of weight decay regularization; 2018. arXiv preprint arXiv:1810.12281 .

Laurent C, Pereyra G, Brakel P, Zhang Y, Bengio Y. Batch normalized recurrent neural networks. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE; 2016. p. 2657–61.

Salamon J, Bello JP. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett. 2017;24(3):279–83.

Wang X, Qin Y, Wang Y, Xiang S, Chen H. ReLTanh: an activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis. Neurocomputing. 2019;363:88–98.

Tan HH, Lim KH. Vanishing gradient mitigation with deep learning neural network optimization. In: 2019 7th international conference on smart computing & communications (ICSCC). IEEE; 2019. p. 1–4.

MacDonald G, Godbout A, Gillcash B, Cairns S. Volume-preserving neural networks: a solution to the vanishing gradient problem; 2019. arXiv preprint arXiv:1911.09576 .

Mittal S, Vaishay S. A survey of techniques for optimizing deep learning on GPUs. J Syst Arch. 2019;99:101635.

Kanai S, Fujiwara Y, Iwamura S. Preventing gradient explosions in gated recurrent units. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 435–44.

Hanin B. Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 582–91.

Ribeiro AH, Tiels K, Aguirre LA, Schön T. Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness. In: International conference on artificial intelligence and statistics, PMLR; 2020. p. 2370–80.

D’Amour A, Heller K, Moldovan D, Adlam B, Alipanahi B, Beutel A, Chen C, Deaton J, Eisenstein J, Hoffman MD, et al. Underspecification presents challenges for credibility in modern machine learning; 2020. arXiv preprint arXiv:2011.03395 .

Chea P, Mandell JC. Current applications and future directions of deep learning in musculoskeletal radiology. Skelet Radiol. 2020;49(2):1–15.

Wu X, Sahoo D, Hoi SC. Recent advances in deep learning for object detection. Neurocomputing. 2020;396:39–64.

Kuutti S, Bowden R, Jin Y, Barber P, Fallah S. A survey of deep learning applications to autonomous vehicle control. IEEE Trans Intell Transp Syst. 2020;22:712–33.

Yolcu G, Oztel I, Kazan S, Oz C, Bunyak F. Deep learning-based face analysis system for monitoring customer interest. J Ambient Intell Humaniz Comput. 2020;11(1):237–48.

Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R. A survey of deep learning-based object detection. IEEE Access. 2019;7:128837–68.

Muhammad K, Khan S, Del Ser J, de Albuquerque VHC. Deep learning for multigrade brain tumor classification in smart healthcare systems: a prospective survey. IEEE Trans Neural Netw Learn Syst. 2020;32:507–22.

Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van Der Laak JA, Van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88.

Mukherjee D, Mondal R, Singh PK, Sarkar R, Bhattacharjee D. Ensemconvnet: a deep learning approach for human activity recognition using smartphone sensors for healthcare applications. Multimed Tools Appl. 2020;79(41):31663–90.

Zeleznik R, Foldyna B, Eslami P, Weiss J, Alexander I, Taron J, Parmar C, Alvi RM, Banerji D, Uno M, et al. Deep convolutional neural networks to predict cardiovascular risk from computed tomography. Nature Commun. 2021;12(1):1–9.

Wang J, Liu Q, Xie H, Yang Z, Zhou H. Boosted efficientnet: detection of lymph node metastases in breast cancer using convolutional neural networks. Cancers. 2021;13(4):661.

Yu H, Yang LT, Zhang Q, Armstrong D, Deen MJ. Convolutional neural networks for medical image analysis: state-of-the-art, comparisons, improvement and perspectives. Neurocomputing. 2021. https://doi.org/10.1016/j.neucom.2020.04.157 .

Bharati S, Podder P, Mondal MRH. Hybrid deep learning for detecting lung diseases from X-ray images. Inform Med Unlocked. 2020;20:100391.

Dong Y, Pan Y, Zhang J, Xu W. Learning to read chest X-ray images from 16000+ examples using CNN. In: 2017 IEEE/ACM international conference on connected health: applications, systems and engineering technologies (CHASE). IEEE; 2017. p. 51–7.

Rajkomar A, Lingam S, Taylor AG, Blum M, Mongan J. High-throughput classification of radiographs using deep convolutional neural networks. J Digit Imaging. 2017;30(1):95–101.

Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz C, Shpanskaya K, et al. Chexnet: radiologist-level pneumonia detection on chest X-rays with deep learning; 2017. arXiv preprint arXiv:1711.05225 .

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2097–106.

Zuo W, Zhou F, Li Z, Wang L. Multi-resolution CNN and knowledge transfer for candidate classification in lung nodule detection. IEEE Access. 2019;7:32510–21.

Shen W, Zhou M, Yang F, Yang C, Tian J. Multi-scale convolutional neural networks for lung nodule classification. In: International conference on information processing in medical imaging. Springer; 2015. p. 588–99.

Li R, Zhang W, Suk HI, Wang L, Li J, Shen D, Ji S. Deep learning based imaging data completion for improved brain disease diagnosis. In: International conference on medical image computing and computer-assisted intervention. Springer; 2014. p. 305–12.

Wen J, Thibeau-Sutre E, Diaz-Melo M, Samper-González J, Routier A, Bottani S, Dormont D, Durrleman S, Burgos N, Colliot O, et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med Image Anal. 2020;63:101694.

Mehmood A, Maqsood M, Bashir M, Shuyuan Y. A deep siamese convolution neural network for multi-class classification of Alzheimer disease. Brain Sci. 2020;10(2):84.

Hosseini-Asl E, Ghazal M, Mahmoud A, Aslantas A, Shalaby A, Casanova M, Barnes G, Gimel’farb G, Keynton R, El-Baz A. Alzheimer’s disease diagnostics by a 3d deeply supervised adaptable convolutional network. Front Biosci. 2018;23:584–96.

Korolev S, Safiullin A, Belyaev M, Dodonova Y. Residual and plain convolutional neural networks for 3D brain MRI classification. In: 2017 IEEE 14th international symposium on biomedical imaging (ISBI 2017). IEEE; 2017. p. 835–8.

Alzubaidi L, Fadhel MA, Oleiwi SR, Al-Shamma O, Zhang J. DFU_QUTNet: diabetic foot ulcer classification using novel deep convolutional neural network. Multimed Tools Appl. 2020;79(21):15655–77.

Goyal M, Reeves ND, Davison AK, Rajbhandari S, Spragg J, Yap MH. Dfunet: convolutional neural networks for diabetic foot ulcer classification. IEEE Trans Emerg Topics Comput Intell. 2018;4(5):728–39.

Yap MH., Hachiuma R, Alavi A, Brungel R, Goyal M, Zhu H, Cassidy B, Ruckert J, Olshansky M, Huang X, et al. Deep learning in diabetic foot ulcers detection: a comprehensive evaluation; 2020. arXiv preprint arXiv:2010.03341 .

Tulloch J, Zamani R, Akrami M. Machine learning in the prevention, diagnosis and management of diabetic foot ulcers: a systematic review. IEEE Access. 2020;8:198977–9000.

Fadhel MA, Al-Shamma O, Alzubaidi L, Oleiwi SR. Real-time sickle cell anemia diagnosis based hardware accelerator. In: International conference on new trends in information and communications technology applications, Springer; 2020. p. 189–99.

Debelee TG, Kebede SR, Schwenker F, Shewarega ZM. Deep learning in selected cancers’ image analysis—a survey. J Imaging. 2020;6(11):121.

Khan S, Islam N, Jan Z, Din IU, Rodrigues JJC. A novel deep learning based framework for the detection and classification of breast cancer using transfer learning. Pattern Recogn Lett. 2019;125:1–6.

Alzubaidi L, Hasan RI, Awad FH, Fadhel MA, Alshamma O, Zhang J. Multi-class breast cancer classification by a novel two-branch deep convolutional neural network architecture. In: 2019 12th international conference on developments in eSystems engineering (DeSE). IEEE; 2019. p. 268–73.

Roy K, Banik D, Bhattacharjee D, Nasipuri M. Patch-based system for classification of breast histology images using deep learning. Comput Med Imaging Gr. 2019;71:90–103.

Hameed Z, Zahia S, Garcia-Zapirain B, Javier Aguirre J, María Vanegas A. Breast cancer histopathology image classification using an ensemble of deep learning models. Sensors. 2020;20(16):4373.

Hosny KM, Kassem MA, Foaud MM. Skin cancer classification using deep learning and transfer learning. In: 2018 9th Cairo international biomedical engineering conference (CIBEC). IEEE; 2018. p. 90–3.

Dorj UO, Lee KK, Choi JY, Lee M. The skin cancer classification using deep convolutional neural network. Multimed Tools Appl. 2018;77(8):9909–24.

Kassem MA, Hosny KM, Fouad MM. Skin lesions classification into eight classes for ISIC 2019 using deep convolutional neural network and transfer learning. IEEE Access. 2020;8:114822–32.

Heidari M, Mirniaharikandehei S, Khuzani AZ, Danala G, Qiu Y, Zheng B. Improving the performance of CNN to predict the likelihood of COVID-19 using chest X-ray images with preprocessing algorithms. Int J Med Inform. 2020;144:104284.

Al-Timemy AH, Khushaba RN, Mosa ZM, Escudero J. An efficient mixture of deep and machine learning models for COVID-19 and tuberculosis detection using X-ray images in resource limited settings 2020. arXiv preprint arXiv:2007.08223 .

Abraham B, Nair MS. Computer-aided detection of COVID-19 from X-ray images using multi-CNN and Bayesnet classifier. Biocybern Biomed Eng. 2020;40(4):1436–45.

Nour M, Cömert Z, Polat K. A novel medical diagnosis model for COVID-19 infection detection based on deep features and Bayesian optimization. Appl Soft Comput. 2020;97:106580.

Mallio CA, Napolitano A, Castiello G, Giordano FM, D’Alessio P, Iozzino M, Sun Y, Angeletti S, Russano M, Santini D, et al. Deep learning algorithm trained with COVID-19 pneumonia also identifies immune checkpoint inhibitor therapy-related pneumonitis. Cancers. 2021;13(4):652.

Fourcade A, Khonsari R. Deep learning in medical image analysis: a third eye for doctors. J Stomatol Oral Maxillofac Surg. 2019;120(4):279–88.

Guo Z, Li X, Huang H, Guo N, Li Q. Deep learning-based image segmentation on multimodal medical imaging. IEEE Trans Radiat Plasma Med Sci. 2019;3(2):162–9.

Thakur N, Yoon H, Chong Y. Current trends of artificial intelligence for colorectal cancer pathology image analysis: a systematic review. Cancers. 2020;12(7):1884.

Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI. Zeitschrift für Medizinische Physik. 2019;29(2):102–27.

Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data. 2019;6(1):113.

Nehme E, Freedman D, Gordon R, Ferdman B, Weiss LE, Alalouf O, Naor T, Orange R, Michaeli T, Shechtman Y. DeepSTORM3D: dense 3D localization microscopy and PSF design by deep learning. Nat Methods. 2020;17(7):734–40.

Zulkifley MA, Abdani SR, Zulkifley NH. Pterygium-Net: a deep learning approach to pterygium detection and localization. Multimed Tools Appl. 2019;78(24):34563–84.

Sirazitdinov I, Kholiavchenko M, Mustafaev T, Yixuan Y, Kuleev R, Ibragimov B. Deep neural network ensemble for pneumonia localization from a large-scale chest X-ray database. Comput Electr Eng. 2019;78:388–99.

Zhao W, Shen L, Han B, Yang Y, Cheng K, Toesca DA, Koong AC, Chang DT, Xing L. Markerless pancreatic tumor target localization enabled by deep learning. Int J Radiat Oncol Biol Phys. 2019;105(2):432–9.

Roth HR, Lee CT, Shin HC, Seff A, Kim L, Yao J, Lu L, Summers RM. Anatomy-specific classification of medical images using deep convolutional nets. In: 2015 IEEE 12th international symposium on biomedical imaging (ISBI). IEEE; 2015. p. 101–4.

Shin HC, Orton MR, Collins DJ, Doran SJ, Leach MO. Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D patient data. IEEE Trans Pattern Anal Mach Intell. 2012;35(8):1930–43.

Li Z, Dong M, Wen S, Hu X, Zhou P, Zeng Z. CLU-CNNs: object detection for medical images. Neurocomputing. 2019;350:53–9.

Gao J, Jiang Q, Zhou B, Chen D. Convolutional neural networks for computer-aided detection or diagnosis in medical image analysis: an overview. Math Biosci Eng. 2019;16(6):6536.

Article MathSciNet Google Scholar

Lumini A, Nanni L. Review fair comparison of skin detection approaches on publicly available datasets. Expert Syst Appl. 2020. https://doi.org/10.1016/j.eswa.2020.113677 .

Chouhan V, Singh SK, Khamparia A, Gupta D, Tiwari P, Moreira C, Damaševičius R, De Albuquerque VHC. A novel transfer learning based approach for pneumonia detection in chest X-ray images. Appl Sci. 2020;10(2):559.

Apostolopoulos ID, Mpesiana TA. COVID-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks. Phys Eng Sci Med. 2020;43(2):635–40.

Mahmud T, Rahman MA, Fattah SA. CovXNet: a multi-dilation convolutional neural network for automatic COVID-19 and other pneumonia detection from chest X-ray images with transferable multi-receptive feature optimization. Comput Biol Med. 2020;122:103869.

Tayarani-N MH. Applications of artificial intelligence in battling against COVID-19: a literature review. Chaos Solitons Fractals. 2020;142:110338.

Toraman S, Alakus TB, Turkoglu I. Convolutional capsnet: a novel artificial neural network approach to detect COVID-19 disease from X-ray images using capsule networks. Chaos Solitons Fractals. 2020;140:110122.

Dascalu A, David E. Skin cancer detection by deep learning and sound analysis algorithms: a prospective clinical study of an elementary dermoscope. EBioMedicine. 2019;43:107–13.

Adegun A, Viriri S. Deep learning techniques for skin lesion analysis and melanoma cancer detection: a survey of state-of-the-art. Artif Intell Rev. 2020;54:1–31.

Zhang N, Cai YX, Wang YY, Tian YT, Wang XL, Badami B. Skin cancer diagnosis based on optimized convolutional neural network. Artif Intell Med. 2020;102:101756.

Thurnhofer-Hemsi K, Domínguez E. A convolutional neural network framework for accurate skin cancer detection. Neural Process Lett. 2020. https://doi.org/10.1007/s11063-020-10364-y .

Jain MS, Massoud TF. Predicting tumour mutational burden from histopathological images using multiscale deep learning. Nat Mach Intell. 2020;2(6):356–62.

Lei H, Liu S, Elazab A, Lei B. Attention-guided multi-branch convolutional neural network for mitosis detection from histopathological images. IEEE J Biomed Health Inform. 2020;25(2):358–70.

Celik Y, Talo M, Yildirim O, Karabatak M, Acharya UR. Automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images. Pattern Recogn Lett. 2020;133:232–9.

Sebai M, Wang X, Wang T. Maskmitosis: a deep learning framework for fully supervised, weakly supervised, and unsupervised mitosis detection in histopathology images. Med Biol Eng Comput. 2020;58:1603–23.

Sebai M, Wang T, Al-Fadhli SA. Partmitosis: a partially supervised deep learning framework for mitosis detection in breast cancer histopathology images. IEEE Access. 2020;8:45133–47.

Mahmood T, Arsalan M, Owais M, Lee MB, Park KR. Artificial intelligence-based mitosis detection in breast cancer histopathology images using faster R-CNN and deep CNNs. J Clin Med. 2020;9(3):749.

Srinidhi CL, Ciga O, Martel AL. Deep neural network models for computational histopathology: a survey. Med Image Anal. 2020;67:101813.

Cireşan DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis detection in breast cancer histology images with deep neural networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2013. p. 411–8.

Sirinukunwattana K, Raza SEA, Tsang YW, Snead DR, Cree IA, Rajpoot NM. Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1196–206.

Xu J, Xiang L, Liu Q, Gilmore H, Wu J, Tang J, Madabhushi A. Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Trans Med Imaging. 2015;35(1):119–30.

Albarqouni S, Baur C, Achilles F, Belagiannis V, Demirci S, Navab N. Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1313–21.

Abd-Ellah MK, Awad AI, Khalaf AA, Hamed HF. Two-phase multi-model automatic brain tumour diagnosis system from magnetic resonance images using convolutional neural networks. EURASIP J Image Video Process. 2018;2018(1):97.

Thaha MM, Kumar KPM, Murugan B, Dhanasekeran S, Vijayakarthick P, Selvi AS. Brain tumor segmentation using convolutional neural networks in MRI images. J Med Syst. 2019;43(9):294.

Talo M, Yildirim O, Baloglu UB, Aydin G, Acharya UR. Convolutional neural networks for multi-class brain disease detection using MRI images. Comput Med Imaging Gr. 2019;78:101673.

Gabr RE, Coronado I, Robinson M, Sujit SJ, Datta S, Sun X, Allen WJ, Lublin FD, Wolinsky JS, Narayana PA. Brain and lesion segmentation in multiple sclerosis using fully convolutional neural networks: a large-scale study. Mult Scler J. 2020;26(10):1217–26.

Chen S, Ding C, Liu M. Dual-force convolutional neural networks for accurate brain tumor segmentation. Pattern Recogn. 2019;88:90–100.

Hu K, Gan Q, Zhang Y, Deng S, Xiao F, Huang W, Cao C, Gao X. Brain tumor segmentation using multi-cascaded convolutional neural networks and conditional random field. IEEE Access. 2019;7:92615–29.

Wadhwa A, Bhardwaj A, Verma VS. A review on brain tumor segmentation of MRI images. Magn Reson Imaging. 2019;61:247–59.

Akkus Z, Galimzianova A, Hoogi A, Rubin DL, Erickson BJ. Deep learning for brain MRI segmentation: state of the art and future directions. J Digit Imaging. 2017;30(4):449–59.

Moeskops P, Viergever MA, Mendrik AM, De Vries LS, Benders MJ, Išgum I. Automatic segmentation of MR brain images with a convolutional neural network. IEEE Trans Med Imaging. 2016;35(5):1252–61.

Milletari F, Navab N, Ahmadi SA. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV). IEEE; 2016. p. 565–71.

Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2015. p. 234–41.

Pereira S, Pinto A, Alves V, Silva CA. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE Trans Med Imaging. 2016;35(5):1240–51.

Havaei M, Davy A, Warde-Farley D, Biard A, Courville A, Bengio Y, Pal C, Jodoin PM, Larochelle H. Brain tumor segmentation with deep neural networks. Med Image Anal. 2017;35:18–31.

Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell. 2017;40(4):834–48.

Yan Q, Wang B, Gong D, Luo C, Zhao W, Shen J, Shi Q, Jin S, Zhang L, You Z. COVID-19 chest CT image segmentation—a deep convolutional neural network solution; 2020. arXiv preprint arXiv:2004.10987 .

Wang G, Liu X, Li C, Xu Z, Ruan J, Zhu H, Meng T, Li K, Huang N, Zhang S. A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans Med Imaging. 2020;39(8):2653–63.

Khan SH, Sohail A, Khan A, Lee YS. Classification and region analysis of COVID-19 infection using lung CT images and deep convolutional neural networks; 2020. arXiv preprint arXiv:2009.08864 .

Shi F, Wang J, Shi J, Wu Z, Wang Q, Tang Z, He K, Shi Y, Shen D. Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for COVID-19. IEEE Rev Biomed Eng. 2020;14:4–5.

Santamaría J, Rivero-Cejudo M, Martos-Fernández M, Roca F. An overview on the latest nature-inspired and metaheuristics-based image registration algorithms. Appl Sci. 2020;10(6):1928.

Santamaría J, Cordón O, Damas S. A comparative study of state-of-the-art evolutionary image registration methods for 3D modeling. Comput Vision Image Underst. 2011;115(9):1340–54.

Yumer ME, Mitra NJ. Learning semantic deformation flows with 3D convolutional networks. In: European conference on computer vision. Springer; 2016. p. 294–311.

Ding L, Feng C. Deepmapping: unsupervised map estimation from multiple point clouds. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2019. p. 8650–9.

Mahadevan S. Imagination machines: a new challenge for artificial intelligence. AAAI. 2018;2018:7988–93.

Wang L, Fang Y. Unsupervised 3D reconstruction from a single image via adversarial learning; 2017. arXiv preprint arXiv:1711.09312 .

Hermoza R, Sipiran I. 3D reconstruction of incomplete archaeological objects using a generative adversarial network. In: Proceedings of computer graphics international 2018. Association for Computing Machinery; 2018. p. 5–11.

Fu Y, Lei Y, Wang T, Curran WJ, Liu T, Yang X. Deep learning in medical image registration: a review. Phys Med Biol. 2020;65(20):20TR01.

Haskins G, Kruger U, Yan P. Deep learning in medical image registration: a survey. Mach Vision Appl. 2020;31(1):8.

de Vos BD, Berendsen FF, Viergever MA, Sokooti H, Staring M, Išgum I. A deep learning framework for unsupervised affine and deformable image registration. Med Image Anal. 2019;52:128–43.

Yang X, Kwitt R, Styner M, Niethammer M. Quicksilver: fast predictive image registration—a deep learning approach. NeuroImage. 2017;158:378–96.

Miao S, Wang ZJ, Liao R. A CNN regression approach for real-time 2D/3D registration. IEEE Trans Med Imaging. 2016;35(5):1352–63.

Li P, Pei Y, Guo Y, Ma G, Xu T, Zha H. Non-rigid 2D–3D registration using convolutional autoencoders. In: 2020 IEEE 17th international symposium on biomedical imaging (ISBI). IEEE; 2020. p. 700–4.

Zhang J, Yeung SH, Shu Y, He B, Wang W. Efficient memory management for GPU-based deep learning systems; 2019. arXiv preprint arXiv:1903.06631 .

Zhao H, Han Z, Yang Z, Zhang Q, Yang F, Zhou L, Yang M, Lau FC, Wang Y, Xiong Y, et al. Hived: sharing a {GPU} cluster for deep learning with guarantees. In: 14th {USENIX} symposium on operating systems design and implementation ({OSDI} 20); 2020. p. 515–32.

Lin Y, Jiang Z, Gu J, Li W, Dhar S, Ren H, Khailany B, Pan DZ. DREAMPlace: deep learning toolkit-enabled GPU acceleration for modern VLSI placement. IEEE Trans Comput Aided Des Integr Circuits Syst. 2020;40:748–61.

Hossain S, Lee DJ. Deep learning-based real-time multiple-object detection and tracking from aerial imagery via a flying robot with GPU-based embedded devices. Sensors. 2019;19(15):3371.

Castro FM, Guil N, Marín-Jiménez MJ, Pérez-Serrano J, Ujaldón M. Energy-based tuning of convolutional neural networks on multi-GPUs. Concurr Comput Pract Exp. 2019;31(21):4786.

Gschwend D. Zynqnet: an fpga-accelerated embedded convolutional neural network; 2020. arXiv preprint arXiv:2005.06892 .

Zhang N, Wei X, Chen H, Liu W. FPGA implementation for CNN-based optical remote sensing object detection. Electronics. 2021;10(3):282.

Zhao M, Hu C, Wei F, Wang K, Wang C, Jiang Y. Real-time underwater image recognition with FPGA embedded system for convolutional neural network. Sensors. 2019;19(2):350.

Liu X, Yang J, Zou C, Chen Q, Yan X, Chen Y, Cai C. Collaborative edge computing with FPGA-based CNN accelerators for energy-efficient and time-aware face tracking system. IEEE Trans Comput Soc Syst. 2021. https://doi.org/10.1109/TCSS.2021.3059318 .

Hossin M, Sulaiman M. A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process. 2015;5(2):1.

Provost F, Domingos P. Tree induction for probability-based ranking. Mach Learn. 2003;52(3):199–215.

Rakotomamonyj A. Optimizing area under roc with SVMS. In: Proceedings of the European conference on artificial intelligence workshop on ROC curve and artificial intelligence (ROCAI 2004), 2004. p. 71–80.

Mingote V, Miguel A, Ortega A, Lleida E. Optimization of the area under the roc curve using neural network supervectors for text-dependent speaker verification. Comput Speech Lang. 2020;63:101078.

Fawcett T. An introduction to roc analysis. Pattern Recogn Lett. 2006;27(8):861–74.

Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17(3):299–310.

Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45(2):171–86.

Masoudnia S, Mersa O, Araabi BN, Vahabie AH, Sadeghi MA, Ahmadabadi MN. Multi-representational learning for offline signature verification using multi-loss snapshot ensemble of CNNs. Expert Syst Appl. 2019;133:317–30.

Coupé P, Mansencal B, Clément M, Giraud R, de Senneville BD, Ta VT, Lepetit V, Manjon JV. Assemblynet: a large ensemble of CNNs for 3D whole brain MRI segmentation. NeuroImage. 2020;219:117026.

Download references

Acknowledgements

We would like to thank the professors from the Queensland University of Technology and the University of Information Technology and Communications who gave their feedback on the paper.

This research received no external funding.

Author information

Authors and affiliations.

School of Computer Science, Queensland University of Technology, Brisbane, QLD, 4000, Australia

Laith Alzubaidi & Jinglan Zhang

Control and Systems Engineering Department, University of Technology, Baghdad, 10001, Iraq

Amjad J. Humaidi

Electrical Engineering Technical College, Middle Technical University, Baghdad, 10001, Iraq

Ayad Al-Dujaili

Faculty of Electrical Engineering & Computer Science, University of Missouri, Columbia, MO, 65211, USA

Ye Duan & Muthana Al-Amidie

AlNidhal Campus, University of Information Technology & Communications, Baghdad, 10001, Iraq

Laith Alzubaidi & Omran Al-Shamma

Department of Computer Science, University of Jaén, 23071, Jaén, Spain

J. Santamaría

College of Computer Science and Information Technology, University of Sumer, Thi Qar, 64005, Iraq

Mohammed A. Fadhel

School of Engineering, Manchester Metropolitan University, Manchester, M1 5GD, UK

Laith Farhan

You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: LA, and JZ; methodology: LA, JZ, and JS; software: LA, and MAF; validation: LA, JZ, MA, and LF; formal analysis: LA, JZ, YD, and JS; investigation: LA, and JZ; resources: LA, JZ, and MAF; data curation: LA, and OA.; writing–original draft preparation: LA, and OA; writing—review and editing: LA, JZ, AJH, AA, YD, OA, JS, MAF, MA, and LF; visualization: LA, and MAF; supervision: JZ, and YD; project administration: JZ, YD, and JS; funding acquisition: LA, AJH, AA, and YD. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Laith Alzubaidi .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Alzubaidi, L., Zhang, J., Humaidi, A.J. et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8 , 53 (2021). https://doi.org/10.1186/s40537-021-00444-8

Download citation

Received : 21 January 2021

Accepted : 22 March 2021

Published : 31 March 2021

DOI : https://doi.org/10.1186/s40537-021-00444-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Deep learning
Machine learning
Convolution neural network (CNN)
Deep neural network architectures
Deep learning applications
Image classification
Medical image analysis
Supervised learning

artificial neural networks research papers

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Applications of artificial neural networks in health care organizational decision-making: A scoping review

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliations Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Canada, Toronto Health Economics and Technology Assessment (THETA) Collaborative, University Health Network, Toronto, Canada

Roles Formal analysis, Investigation, Methodology, Writing – review & editing

Affiliation Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Canada

Roles Conceptualization, Formal analysis, Investigation, Methodology, Supervision, Writing – review & editing

Nida Shahid,
Tim Rappon,
Whitney Berta

Published: February 19, 2019
https://doi.org/10.1371/journal.pone.0212356
Reader Comments

Health care organizations are leveraging machine-learning techniques, such as artificial neural networks (ANN), to improve delivery of care at a reduced cost. Applications of ANN to diagnosis are well-known; however, ANN are increasingly used to inform health care management decisions. We provide a seminal review of the applications of ANN to health care organizational decision-making. We screened 3,397 articles from six databases with coverage of Health Administration, Computer Science and Business Administration. We extracted study characteristics, aim, methodology and context (including level of analysis) from 80 articles meeting inclusion criteria. Articles were published from 1997–2018 and originated from 24 countries, with a plurality of papers (26 articles) published by authors from the United States. Types of ANN used included ANN (36 articles), feed-forward networks (25 articles), or hybrid models (23 articles); reported accuracy varied from 50% to 100%. The majority of ANN informed decision-making at the micro level (61 articles), between patients and health care providers. Fewer ANN were deployed for intra-organizational (meso- level, 29 articles) and system, policy or inter-organizational (macro- level, 10 articles) decision-making. Our review identifies key characteristics and drivers for market uptake of ANN for health care organizational decision-making to guide further adoption of this technique.

Citation: Shahid N, Rappon T, Berta W (2019) Applications of artificial neural networks in health care organizational decision-making: A scoping review. PLoS ONE 14(2): e0212356. https://doi.org/10.1371/journal.pone.0212356

Editor: Olalekan Uthman, The University of Warwick, UNITED KINGDOM

Received: October 4, 2018; Accepted: January 31, 2019; Published: February 19, 2019

Copyright: © 2019 Shahid et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

As health care systems in developed countries transform towards a value based, patient-centered model of care delivery, we face new complexities relating to improving the structure and management of health care delivery; for example, improving integration of processes in care delivery for patient-centered chronic disease management [ 1 ]. Artificial intelligence lies at the nexus of new technologies with the potential to deliver health care that is cost-effective and appropriate care in real-time, manage effective and efficient communication among multidisciplinary stakeholders, and address non-traditional care settings, the evolving heathcare workplace and workforce, and the advent of new and disparate health information systems. With the rapid uptake of artificial intelligence to make increasingly complex decisions across different industries, there are a multitude of solutions capable of addressing these health care management challenges; however, there is a paucity of guidance on selecting appropriate methods tailored to the health care industry[ 2 ].

Global health care expenditure is expected to reach $8.7 trillion by 2020, driven by aging populations growing in size and disease complexity, advancements made in medical treatments, rising labour costs and the market expansion of the health care industry. Many health systems are reported to struggle with updating aging infrastructure and legacy technologies with already limited capital resources. In an effort toward moving to value-based care, decision-makers are reported to be strategically shifting the focus to understanding and better alignment of financial incentives for health care providers in order to bear financial risk; population health management including analyses of trends in health, quality and cost; and adoption of innovative delivery models for improved processes and coordination of care.

Health care organizations are required to be increasingly strategic in their management due to a variety of system interdependences such as emerging environmental demands and competing priorities, that can complicate decision-making process [ 3 ]. According to economy theory, most organizations are risk-aversive [ 4 ] and decision-makers in health care can face issues related to culture, technology and risk when making high-risk decisions without the certainty of high-return [ 4 , 5 ]. Patient care and operations management requires the interaction of multiple stakeholders, for example clinicians, front-line/middle managers, senior level executives to make decisions on a clinical (e.g. diagnosis, treatment and therapy, medication prescription and administration), and non-clinical (e.g. budget, resource allocation, technology acquisition, service additions/reductions, strategic planning) [ 6 ].

A white paper published by IBM suggests that with increasing capture and digitization of health care data (e.g. electronic medical records and DNA sequences), health care organizations are taking advantage of analyzing large sets of routinely collected digital information in order to improve service and reduce costs [ 7 ]. Reported examples include analyzing clinical, financial and operational data to answer questions related to effectiveness of programs, making predictions regarding at-risk patients. The global market for health care predictive analytics is projected was valued at USD 1.48 billion in 2015 and expected to grow at a rate of 29.3% (compound annual growth rate) by 2025 [ 8 ]. Similarly, global revenue of $811 million is expected to increase 40% (Compound Annual Growth Rate) by 2021 due the artificial intelligence (AI) market for health care applications. A subfield of AI, machine learning-as-a-service-market (MLaaS), is expected to reach $5.4 billion by 2022, with the health care sector as a notable key driver [ 9 ].

A recent survey of AI applications in health care reported uses in major disease areas such as cancer or cardiology and artificial neural networks (ANN) as a common machine learning technique [ 10 ]. Applications of ANN in health care include clinical diagnosis, prediction of cancer, speech recognition, prediction of length of stay [ 11 ], image analysis and interpretation [ 12 ] (e.g. automated electrocardiographic (ECG) interpretation used to diagnose myocardial infarction [ 13 ]), and drug development[ 12 ]. Non-clinical applications have included improvement of health care organizational management [ 14 ], prediction of key indicators such as cost or facility utilization [ 15 ]. ANN has been used as part of decision support models to provide health care providers and the health care system with cost-effective solutions to time and resource management [ 16 ].

Despite its many applications and, more recently, its prominence [ 17 ], there is a lack of coherence regarding ANN’s applications and potential to inform decision making at different levels in health care organizations. This review is motivated by a need for a broad understanding the various applications of ANN in health care and aids researchers interested in bridging the disciplines of organizational behaviour and computer science. Considering the sheer abundance in reported use and complexity of the area, it can be challenging to remain abreast of the new advancements and trends in applications of ANN [ 18 ]. Adopters of ANN or researchers new to the field of AI may find the scope and esoteric terminology of neural computing particularly challenging [ 18 ]. Literature suggests that current reviews on applications of ANN are limited in scope and generally focus on a specific disease [ 19 ] or a particular type of neural network [ 20 ], or they are too broad (i.e. data mining or AI techniques that can include ANN but do not offer insights specific to ANN) [ 10 ]. The overarching goal of this scoping review is to provide a much-needed comprehensive review of the various applications of ANN in health care organizational decision-making at the micro-, meso-, and macro-levels. The levels pertain to decisions made on the (micro) level of individual patients, or on a (meso) group level (e.g. departmental or organizational level) where patient preference may be important but not essential; and on a wider (macro) level by large groups or public organizations related to allocation or utilization of resources where decisions are based on public interest and reflective of society as a whole [ 21 ]. By means of this review, we will identify the nature and extent of relevant literature and describe methodologies and context used.

According to an overview by Kononenko (2001), as a sub-field of AI, machine learning provides indispensable tools for intelligent data analysis. Three major branches of machine learning have emerged since electronic computers came in to use during the 1950s and 1960s: statistical methods, symbolic learning and neural networks [ 22 ]. ANN have been successfully used to solve highly complex problems within the physical sciences and as of late by scholars in organizational research as digital tools enabling faster processes of data collection and processing [ 23 ]. As practical and flexible modelling tools, ANN have an ability to generalize pattern information to new data, tolerate noisy inputs, and produce reliable and reasonable estimates [ 23 ]. ANN belong to a wide class of flexible nonlinear regression and discriminant models, data reduction models, and nonlinear dynamical systems [ 24 ]. ANN are similar to statistical techniques including generalized linear models, nonparametric regression and discriminant analysis, or cluster analysis [ 24 ]. As a statistical model, it’s general composition is one made of simple, interconnected processing elements that are configured through iterative exposure to sample data [ 23 ]. Its application is particularly valuable under one or more of several conditions: when sample data show complex interaction effects or do not meet parametric assumptions, when the relationship between independent and dependent variables is not strong, when there is a large unexplained variance in information, or in situations where the theoretical basis of prediction is poorly understood [ 23 ]. ANN architectures are commonly classified as feed-forward neural networks (e.g. single-layer perceptron, multi-layer perceptron, radial basis function networks) or feed-back, or otherwise referred to as recurrent neural networks (e.g. Competitive networks, Kohonen’s self-organizing maps, Hopfield networks) [ 25 ].

Artificial neural networks

Originally developed as mathematical theories of the information-processing activity of biological nerve cells, the structural elements used to describe an ANN are conceptually analogous to those used in neuroscience, despite it belonging to a class of statistical procedures [ 23 ].

ANN can have single or multiple layers [ 23 ], and consist of processing units (nodes or neurons) that are interconnected by a set of adjustable weights that allows signals to travel through the network in parallel and consecutively[ 13 , 26 ]. Generally ANN can be divided in to three layers of neurons: input (receives information), hidden (responsible for extracting patterns, perform most of internal processing), and output (produces and presents final network outputs) [ 27 ].

A review by Agatonovic-Kustrin & Beresford (2000) describes neural computation to be powered from the connection of its neurons and that each neuron has a weighted input, transfer function and a single output. The authors state that the neuron is activated by the weighed sum of inputs it receives and the activation signal passes through a transfer function to produce a single output. The transfer functions, the learning rule and the architecture determine the overall behaviour of the neural network [ 26 ].

Architecture

Sharma & Chopra (2013) describe the two most common types of neural networks applied in management sciences to be the feed-forward and recurrent neural networks ( Fig 1 ) in comparison with feed-forward networks common to medical applications [ 28 , 29 ]. A feed-forward network can be single-layered (e.g. Perceptron, ADALINE) or multi-layered (e.g. Multilayer Perceptron, Radial Basis Function) [ 27 , 30 ]. Sharma & Chopra (2013) describe information flow in feed-forward networks to be unidirectional from input layer, through hidden layers to the output layer, without any feedback. Whereas, a recurrent or feedback network involves dynamic information processing having at least one feedback loop, using outputs as feedback inputs (e.g. Hopfield) [ 27 , 30 ]. Fig 1 illustrates the two types of networks with three layers (input, hidden and output).

PPT PowerPoint slide
PNG larger image
TIFF original image

https://doi.org/10.1371/journal.pone.0212356.g001

In an overview of basic concepts, Agatonovic-Kustrin & Beresford (2000) describe ANN gather knowledge by detecting patterns and relationships in data and “learn” through experience. The authors state an artificial neural network learns by optimizing its inner unit connections in order to minimize errors in the predictions that it makes and to reach a desired level of accuracy. New information can be inputted into the model once the model has been trained and tested [ 26 ]. Also referred to as the generalized delta rule, backpropagation refers to how an ANN is trained or ‘learns’ based on data. It uses an iterative process involving six steps: (i) single case data is passed to input later, output is passed to the hidden layer and multiplied by the first set of connection weights; (ii) incoming signals are summed, transformed to output and passed to second connection weight matrix; (iii) incoming signals are summed, transformed and network output is produced; (iv) output value is subtracted from known value for that case, error term is passed backward through network; (v) connection weights are adjusted in proportion to their error contribution; (vi) modified connection weights saved for next cycle, next case input set queued for next cycle [ 23 ]. Sharma & Chopra (2013) broadly classify training or ‘learning’ methods in ANN into three types: supervised, unsupervised and reinforced learning. In supervised learning, every input pattern used to train the network is associated with an output pattern. The error in computed and desired outputs can be used to improve model performance. In unsupervised learning, the network learns without knowledge of desired output and by discovering and adapting to features of the input patterns. In reinforcement learning, the network is provided with feedback on if computation performance without presenting the desired output [ 30 ].

Artificial neural networks and regression models

Neural networks are similar to linear regression models in their nature and use. They are comprised of input (independent or predictor variable) and output (dependent or outcome variable) nodes, use connection weights (regression coefficients), bias weight (intercept parameters) and cross-entropy (maximum likelihood estimation) to learn or train (parameter estimation) a model [ 31 ]. ANN learn to perform tasks by using inductive learning algorithms requiring massive data sets [ 18 ]. A working paper on the use of ANN in decision support systems states that the structure, quality and quantity of data used is critical for the learning process and that the chosen attributes must be complete, relevant, measurable and independent[ 18 ]. The authors further observe that in business applications, external data sources (e.g. industry and trade databases) are typically used to supplement internal data sources.

Classification and prediction modelling

In the book entitled ‘Data Mining: Concepts and Techniques', classification is defined as the process of finding a model that describes and distinguishes data classes or concepts based on analysis of a set of training data [ 32 ]. The authors write that models called classifiers predict categorical class labels and can be used to predict the class label of objects for which the class label is unknown. Furthermore, the process is described to consist of a learning step (when a classification model is constructed) and a classification step (when a model is used to predict class labels for a given data). Methods include naïve Bayesian classification, support vector machines, and k -nearest-neighbour classification [ 32 ]. Han et al. (2012) suggest that applications can broadly include fraud detection, target marketing, performance prediction, manufacturing and medical diagnosis.

The available data is divided into two sets for cross-validation: a training set used to develop a model and a test set, used to evaluate the model’s performance [ 33 , 34 ]. Appropriate data splitting is a technique commonly used in machine learning in order to minimize poor generalization (also referred to as over-training or over-fitting) of models [ 34 ]. Using more training data improves the classification model, whereas using more test data contributes to estimating error accurately [ 35 ]. Although a 70:30 ratio can typically be used for training/testing size [ 36 ], various statistical sampling techniques ranging from simple (e.g. simple random sampling, trial-and-error) to more deterministic (e.g. CADEX, DUPLEX) can be used to split the data depending on the goals and complexity of the problem [ 34 ].

Han and colleagues (2012) write that where classification predicts categorical labels, regression is used to predict missing or unavailable numerical data values (rather than discrete class labels). The authors describe regression analysis as a statistical methodology often used for numeric prediction and encompasses identification of distribution trends based on available data. An example of numeric prediction is when a model is constructed to predict a continuous-valued function or ordered value (as opposed to a class label). Such a model is called a predictor model and typically uses regression analysis [ 32 ].

ANN can be used to perform nonlinear statistical modeling and provide new alternatives to logistic regression, the most commonly used method for developing predictive models for dichotomous outcomes in medicine [ 31 ]. Users require less formal statistical training and the networks are able to detect complex non-linear relationships and interactions between dependent and independent variables. ANN can combine and incorporate literature-based and experimental data to solve problems [ 26 ]. Other advantages of ANN, relative to traditional predictive modeling techniques, include fast and simple operation due to compact representation of knowledge (e.g., weight and threshold value matrices), the ability to operate with noisy or missing information and generalize to similar unseen data, the ability to learn inductively from training data and process non-linear functionality critical to dealing with real-word data [ 37 ].

Although ANN do not require knowledge of data source, they require large training sets due to the numerous estimated weights involved in computation [ 26 ]. They may require lengthy training times and the use of random weight initializations may lead to different solutions [ 37 ]. Despite successful applications, ANN remain problematic in that they offer us little or no insight into the process(es) by which they learn or the totality of the knowledge embedded in them [ 38 ]. Several limitations of ANN are identified in the literature: they are limited in their ability to explicitly identify possible causal relationships, they are challenging to use in the field, they are prone to over fitting, model development is empirical potentially requiring several attempts to develop an acceptable model [ 37 ], and there are methodological issues related to model development [ 31 ]. In comparing advantages and disadvantages of using ANN to predict medical outcomes, Tu (1996) suggests that logistic regression models can be disseminated to a wider audience, whereas ANN models are less transparent and therefore can be more difficult to communicate and use. Even if published and made available, the connection weight matrices used in ANN for training a data set may be large and difficult to interpret for others to make use of, whereas logistic regression coefficients can be published for any end user to be able to calculate [ 31 ].

The Arksey & O’Malley framework (2005) was adopted to identify the (i) research question, (ii) relevant studies, (iii) select studies, (iv) chart the data and (v), collate, summarize and present findings.

Search strategy

Due to the cross-disciplinary nature of our query, the search strategy was designed to identify literature from multiple databases according to the key disciplines of Health Administration (Medline and Embase), Computer Science (ACM Digital Library and Advanced Technologies & Aerospace Database), and Business and Management (ABI/Inform Global and JSTOR). The selection of the three disciplines reflects the core concepts embedded in our research question: ‘what are the different applications of ANN (Computer Science) in health care organizational decision-making (Health Administration and Business Management)?’

In consultation with a librarian, a comprehensive search syntax was built on the concepts of ‘artificial neural networks’ applied in ‘health care organizational decision-making’ and tailored for each database for optimum results. The final search syntax was based on search terms refined through an iterative process involving examination of a preliminary set of results to ensure relevance ( S1 Appendix ). The search strategy was limited to peer-reviewed publications in English without limitation to the year of publication up until the time of our search (January 2018). Our background search did not identify seminal paper(s) published or advancements related to our research question, thereby justifying the rationale for not limiting the search to a specicic start date.

Data collection

Screening of articles occurred in two stages. Identified articles were de-duplicated and imported to EndNote as a reference manager and to Covidence, a web-based platform, for screening. The screening inclusion and exclusion criteria were built iteratively via consensus (NS, TR and WB) ( Table 1 ). Titles and abstracts were first screened to include articles with keywords related to and/or in explicit reference to artificial neural networks. Articles were excluded if there was no explicit reference to artificial neural networks; the application was not in the health care domain or context of health care organizational decision-making, or was not a publication that was peer-reviewed (e.g. grey literature e.g. conference abstracts and papers, book reviews, newspaper or magazine articles, teaching courses). Table 1 lists the criteria used to screen, include or exclude articles in the review.

https://doi.org/10.1371/journal.pone.0212356.t001

Subsequently, a full-text review of articles that met the initial screening criteria was conducted on basis of relevance and availability of information for data extraction. In addition to independent review and extraction of articles, two coders (NS and TR) extracted data from a subset of articles for consensus, minimization of error, and clarity between reviewers regarding the choice of data selected for extraction. Information related to study characteristics, aim, methodology (application, taxonomy, accuracy) and context including organizational level of analysis (micro-, meso- and macro-) was collected and entered into Microsoft Excel for categorization and descriptive analysis. Applications of ANN to make decisions directly between providers and patients was categorized as ‘micro’, any decisions made by a larger group and not directly related to a patient was categorized as ‘meso’, and decisions beyond an organizational group (i.e. across different institutions, a system or countries) was categorized as ‘macro’ level of decision-making.

Overall, 3,457 articles were imported for screening, out of which (after removal of duplicates) 3,397 were screened for titles and abstracts to give a total of 306 articles used for full-text review ( Fig 2 ). Articles were excluded from data collection for reasons such as: there being no explicit reference to ANN being used (91 articles), the application of ANN was not in the context of health care organizational decision-making (68 articles), on basis of study exclusion criteria (53 articles) or the articles were irretrievable (8). In total, 80 articles were used for data collection. Fig 2 illustrates the overall review process including number of articles excluded at each stage.

*Articles excluded for the following reasons: Not ANN or suitable synonym (n = 93), use of ANN unrelated to healthcare organizational decision-making (n = 70), based on iterated exclusion criteria (n = 45), not based on empirical or theoretical research (n = 9), could not access full-text (n = 9).

https://doi.org/10.1371/journal.pone.0212356.g002

Study characteristics

Publication dates ranged from 1997 to 2018 with the number of studies fluctuating each year ( Fig 3A ). Studies were published across 24 countries with the majority of first authors from the United States (26), the United Kingdom and India (7), Taiwan (6) and Italy (5) ( Fig 3B ). Fig 3A and 3B illustrate the number of articles published over the years and across varying countries.

(A) Number of articles by publication year. (B) Number of articles by country.

https://doi.org/10.1371/journal.pone.0212356.g003

Aim and methodology

Main topics or area of interest based on the article’s overall purpose included Organizational Behaviour (18%), Cardiovascular (14%), Infectious Disease and Telemedicine (7%) ( Table 2 ). Topics categorized under ‘Organizational Behaviour’ include: behaviour and perspectives, crisis or risk management, clinical and non-clinical decision-making, and resource management ( S2 Appendix ). Table 2 lists the main topic areas of articles reviewed.

https://doi.org/10.1371/journal.pone.0212356.t002

Applications of ANN were mainly found to be classification (22), prediction (14), and diagnosis (10) ( Fig 4 ). Examples of applications include classification of data in medical databases (i.e. organizing or distinguishing data by relevant categories or concepts) [ 39 ], using a hybrid learning approach for automatic tissue recognition in wound images for accurate wound evaluations [ 40 ], and comparison of soft-computing techniques for diagnosis of heart conditions by processing digitally recorded heart sound signals to extract time and frequency features related to normal and abnormal heart conditions [ 41 ]. Applications for prediction included developing a risk advisor model to predict the chances of diabetes complication according to changes in risk factors [ 42 ], identifying the optimal subset of attributes from a given set of attributes for diagnosis of heart disease [ 43 ], modelling daily patient arrivals in the Emergency Department [ 44 ]. ANN was applied for diagnosis of disease based on age, sex, body mass index, average blood pressure and blood serum measurements [ 45 ], comparing predictive accuracies of different types of ANN and statistical models for diagnosis of coronary artery disease [ 46 ], diagnosis and risk group assignment for pulmonary tuberculosis among hospitalized patients [ 47 ], and non-invasive diagnosis of early risk in dengue patients [ 48 ]. Other examples include exploring the potential use of mobile phones as a health promotional tool by tracking daily exercise activities of people and using ANN to estimate a user’s movement[ 49 ], or using ANN to identify factors related to treatment and outcomes potentially impacting patient length of stay[ 50 ]. In addition to S2 Appendix , Fig 4 illustrates the various applications of ANN identified in the literature review.

https://doi.org/10.1371/journal.pone.0212356.g004

With respect to nomenclature or taxonomy, authors mostly reported using artificial neural networks (36 articles), feed-forward networks (25 articles), a hybrid model (23 articles), recurrent feedback networks (6 articles) or other (3 articles) ( S2 Appendix ). Various types of data (e.g. patients, cases, images, and signals) and sample sizes were used. Training/testing sets were in ratios of 50:50, 70:30 or 90:10 and the reported accuracy ranged between 50% and 100%.

Context and key findings

ANN was primarily applied to organizational decision-making at a micro-level (61 articles) between patients and health care providers in addition to meso-, macro-levels out of which 48 articles referenced to micro-level decision-making only; with 29 articles referencing meso-level applications between patients, health care providers, hospital managers and decision-makers, out of which 10 referenced meso- only. A small portion (10) of studies applied ANN at a macro level of decision-making mainly between policy and decision-makers across multiple facilities or health care systems, out of which 2 referenced macro- only. Micro-level applications of ANN include diagnosis of pulmonary tuberculosis among hospitalized patients by health care providers using models developed for classification and risk group assignment [ 47 ], classify Crohn’s Disease medical images [ 51 ], analyse recorded ECG signals to trigger an alarm for patients and allow collection and transmission of patient information to health care providers[ 52 ]. Meso-level applications include decision-making among managers involving classification of cost [ 53 ], developing a forecasting model to support health care management decision-making[ 54 ], among patients, providers, and hospital managers in order to evaluate the effect of hospital employee motivation on patient satisfaction [ 55 ], and predicting the adoption of radio frequency identification (RFID) technology adoption in clinical setting [ 56 ]. Macro-level applications of ANN include risk-adjustment models for policy-makers of Taiwan’s National Health Insurance program [ 57 ], a global comparison of the perception of corruption in the health care sector [ 58 ], model revenue generation for decision-makers to determine best indicators of revenue generation in not-for-profit foundations supporting hospitals of varying sizes [ 59 ].

Authors reported neural networks reduced computation time in comparison to conventional planning algorithms [ 60 ] thereby enabling users to access model output faster in real-time, outperforming linear regression models in prediction [ 44 , 56 , 61 – 63 ] and support vector machines in classification [ 64 , 65 ]. Limitations centered around the use of small data sets [ 42 , 53 , 66 – 72 ], limiting data set to continuous variables [ 69 ], inability to examine causal relationships [ 56 ] or have the network explain weights applied, appropriateness of decision-making [ 71 , 73 , 74 ], difficulty in implementation or understanding of the output [ 75 ]. ANN were cautioned to be used as a proof of concept rather than a successful prediction model [ 66 ].

This review provides a comprehensive review of the various applications of artificial neural networks in health care organizational decision-making. To our knowledge, this is the first attempt to comprehensively describe the use of ANN in health care, from the time of its origins to current day use, on all levels of organizational decision-making.

Prior efforts have concentrated on a specific domain or aspect of health care and/or limited study findings to a period of time. A systematic review on the use of ANN as decision-making tools in the field of cancer reported trends from 1994–2003 in clinical diagnosis, prognosis and therapeutic guidance for cancer from1994 to 2003, and suggested the need for rigorous methodologies in using neural networks [ 19 ]. Another review reported various applications in areas of accounting and finance, health and medicine, engineering and marketing, however focused the review on feed-forward neural networks and statistical techniques used in prediction and classification problems [ 20 ]. Outside of medicine and health care, Wong et al. conducted literature reviews of ANN used in business (from 1988–1995) [ 76 ] and finance (1990–1996) [ 77 ], at that time describing the promise of neural networks for increasing integration with other existing or developing technologies [ 76 , 77 ]. Data mining is the mathematical core of a larger process of knowledge discovery from databases otherwise referred to as the ‘KDD process [ 78 ]. The main activities involved in the KDD process include (i) integration and cleaning, (ii) selection and transformation, (iii) data mining and (iv) evaluation and interpretation. Data mining pertains to extraction of significant patterns and knowledge discovery and employs inferring algorithms, such as ANN, to pre-processed data to complete data mining tasks such as classification and cluster analysis [ 79 ]. Data mining and machine learning have produced practical applications in areas of analysing medical outcomes, detecting credit card fraud, predicting customer purchase behaviour or predicting personal interests from internet use [ 80 ]. Although limited in scope to the field of infertility, Durairaj & Ranjani (2013) conducted a comparative study of data mining techniques including ANN, suggesting the promise of combining more than one data mining technique for diagnosing or predicting disease [ 81 ].

Due to the primitive nature of computer technology mid-20 th Century, most of the research in machine learning was theoretical or based on construction of special purpose systems [ 18 ]. We found that application of ANN in health care decision-making began in the late 90’s with fluctuating use over the years. A number of breakthroughs in the field of computer science and AI bring insight to reported publication patterns [ 82 ]. ANN gained prominence with the publication of a few seminal works including the publication of the backpropagation learning rule for multilayered feed-forward neural networks [ 22 ]. In 1986, backpropagation was proven as a general purpose and simple procedure, powerful enough for a multi-layered neural network to use and construct appropriate internal representations based on incoming data [ 83 ]. A few years later, the ability of neural networks to learn any type of function was demonstrated [ 84 ], suggesting capabilities of neural networks as universal approximators [ 85 ]. During the 90’s, most of the research was largely experimental and the need for use of ANN as a widely-used computer paradigm remained warranted [ 18 ].

With the digitization of health care [ 86 ], hospitals are increasingly able to collect large amounts of data managed across large information systems [ 22 ]. With its ability to process large datasets, machine learning technology is well-suited for analysing medical data and providing effective algorithms [ 22 ]. Considering the prevalent use of medical information systems and medical databases, ANN have found useful applications in biomedical areas in diagnosis and disease monitoring [ 87 ].

Although the backpropagation learning rule enabled the use of neural networks in many hard medical diagnostic tasks, they have been typically used as black box classifiers lacking the transparency of generating knowledge as well as the ability to explain decision-making [ 22 ]. The lack of transparency or interpretability of neural networks continues to be an important problem since health care providers are often unwilling to accept machine recommendations without clarity regarding the underlying rationale [ 88 ]. Prior to 2006, application of neural networks included processing of biomedical signals, for example image and speech processing [ 89 , 90 ], clinical diagnosis, image analysis and interpretation, and drug development [ 87 ]. In 2006, a critical paper described the ability of a neural network to learn faster [ 91 ]. Six years later, the largest deep neural network to date (i.e. depth pertaining to layers of the network), was trained to classify 1.2 million images in record-breaking time as part of the ImageNet Large Scale Visual Recognition Challenge [ 92 ].

The most successful applications of ANN are found in extremely complex medical situations [ 13 ]. We found ANN to be mainly used for classification, prediction and clinical diagnosis in areas of cardiovascular, telemedicine and organizational behaviour. Use of ANN applies to four general areas of cardiovascular medicine: diagnosis and treatment of coronary artery disease, general interpretation of electrocardiography, cardiac image analysis and cardiovascular drug dosing [ 93 ]. Telemedicine offers health care providers elaborate solutions for remote monitoring designed to prevent, diagnose, manage disease and treatment [ 94 ] and can include machine learning techniques to predict clinical parameters such as blood pressure [ 95 ]. Preliminary diagnosis of high-risk patients (for disease or attributes) using neural networks provide hospital administrators with a cost-effective tool in time and resource management [ 16 ].

Neural networks have been used effectively as a tool in complex decision-making in strategic management, specifically in strategic planning and performance, assessing decision-making [ 96 ]. In health care, neural network models have been successfully used to predict quality determinants (responsiveness, security, efficiency) influencing adoption of e-government services [ 97 ]. With its ability to discover hidden knowledge and values, scholars have suggested using ANN to improve care performance and facilitate the adoption of ‘Lean thinking’ or value-based decision making in health care [ 87 ]. An example of ANN facilitating Lean thinking adoption in health care contexts is its application to describe ‘information flow’ among cancer patients by modeling the relationship between quality of life evaluations made by patients, pharmacists and nurses [ 87 ]. ‘Flow’ is a key concept in a Lean System and ‘information flow’ is an essential improvement target to the successful operation of a health care system using a Lean approach [ 87 ]. Key success factors or differentiators that define effective machine learning technology in health care include access to extensive data sources, ease of implementation, interpretability and buy-in as well as conformance with privacy standards [ 9 ]. Support vector machines are used to model high-dimensional data and are considered state-of-the-art solutions to problems otherwise not amenable to traditional statistical analysis. Despite its analytic capabilities, wide-scale adoption remains a challenge, mainly due to methodological complexities and scalability challenges [ 98 ]. For example, a systematic review of deep learning models using electronic health record data recently identified challenges related to the temporality (e.g. hidden relationships among clinical variables occurring at short and long term events) and irregularity of information used which can reduce model performance if not handled appropriately [ 88 ]. Poor interpretability remains a signicant challenge with implementing ANN in health care [ 90 ]. Zhang et al (2018) report that in comparison to linear models, ANN are not only difficult to interpret but the identification of predictors (input features) important for the model also seem to be a challenge [ 99 ]. Fisher et al (2016) developed an ANN based monitoring method evaluating Parkinson’s disease motor symptoms and reported signiciant challenges with detecting disease states due to the inherent subjectivity underlying the interpretation of disease state descriptors (i.e. the degree of motor symptoms experienced by each patient would likely vary) [ 100 ]. Despite the evident progress in certain areas (e.g. knowledge and temporal representation, machine learning), the adoption of key standards required for integration and knowledge sharing (e.g. controlled terminologies, semantic structuring, standards representing clinical decision logic) has been slow [ 101 ] Patel et al. (2009) suggest barriers to progress are related to political, fiscal or cultural reasons and not purely technical. A national study on the implementation of Health Information Technology (HIT) in the United States reported a poor understanding of IT staff, informaticians, health information managers and others playing a significant role in implementation of HIT in health care [ 102 ] Barriers to adoption of HIT include mismatch of return on investment, challenges to workflow in clinical settings, lack of standards and interoperability, and concerns about privacy and confidentiality [ 102 ].

We found that researchers often adopted a hybrid approach when using neural networks. Hybrid approaches (e.g. combining two or more techniques/soft-computing paradigms) are effective in reducing challenges with neural networks when introducing new items to the system or having insufficient data [ 103 ]. ANN learn (supervised, unsupervised or reinforcement) based on the iterative adjustment of connection weights using optimization algorithms such as the backpropagation rule. Challenges related to such algorithms include the necessity of a previously defined architecture for the model, sensitivity to the initial conditions used in training [ 104 ]. A hybrid model of an ANN and decision tree classifier has been used to predict university admissions using data related to student academic merits, background and university admission criteria. Reported advantages of using a hybrid model included higher prediction accuracy rates (error rate of <2%), flexibility and faster performance (0.1 second) in comparison with a model using neural networks only (20 minutes learning time). Another advantage reported was improved generalizability, e.g. ability to understand rules extracted that can be later coded into another type of system [ 105 ] Literature suggests extensive use of ANN in business applications in particular areas related to financial distress and bankruptcy problems, stock price forecasting and decision support [ 106 ] Hybrid networks have also been developed in business applications to improve performance of standard models [ 106 ]. The integration of ANN with secondary AI and meta-heuristic methods such as fuzzy logic, genetic, bee colony algorithms, or artificial immune systems have been proposed to reduce or eliminate challenges related to ANN (e.g. selection of network topology, initial weights, choice of control parameters) [ 106 ]. Applications of hybrid intelligent systems include robotics, medical diagnosis, speech/natural language understanding, monitoring of manufacturing processes.

Our findings suggest a possible correlation between advancements made in the field of ANN and publication rates related to the application of ANN in health care organizational decision-making. Despite the variety of study contexts and applications, ANN continues to be mainly used for classification, prediction and diagnosis. As suggested by the literature, the most commonly used taxonomy of ANN found was the feed-forward neural network. However, our study showed a significant use of hybrid models. ANN’s application to facilitate more micro- and meso-level decision-making compared to macro-level may be explained by the type and volume of data required and available to build an effective model.

Strengths and limitations

A primary strength of this review is its comprehensive scope and search strategy involving multiple databases. Variables selected for data collection were based on bodies of work with similar inquiry and well aligned with the methods of a scoping review. The complex nature of artificial neural networks required a fundamental understanding for the authors who were otherwise novice to the field. Studies included in this review did not always use standardized reporting measures and may include publications of lower quality.

Implications

Practical implications.

Current and anticipated advancements in the field of AI will play an influential role in decision-making related to adopting novel and innovative machine learning based techniques in health care. Clinical applications of AI include analysis of electronic health records, medical image processing, physician and hospital error reduction [ 107 ] AI applications in workflow optimization include payer claim processing, network coordination, staff management, training and education, supply costs and management [ 107 ] For example, the top three applications of greatest near-term value (based on the impact of application, likelihood of adoption and value to health economy) are reported to be robot-assisted surgery (valued at $40 B), virtual nursing assistants ($20B) and administrative workflow assistance ($18 B) [ 108 ]. Applications with lowest estimated potential value include preliminary diagnosis ($5B), automated image ($3B) and cyber-security ($2B) [ 108 ]. Our findings warrant the understanding of perspectives and beliefs of those adopting ANN-based solutions in clinical and non-clinical decision-making.

Patients and families are accessing health information in real-time with the array of AI or ANN based health care solutions available to them in an open and unstructured market. Clinical applications of ANN-based solutions can have implications on the changing role of health care providers as well team dynamics and patterns in workflow. The changing role of the physicians has been at the forefront of recent debates on AI, with some anticipating the positive impacts of augmenting clinical service with AI based technologies, e.g., enabling early diagnosis, or improving understanding of a patient’s medical history with genetic sequencing [ 109 ]. Literature suggests a need for bridging disciplines in order to enable of clinicians to benefit from rapid advancements in technology [ 101 ] In addition to the implications for clinical decision-making, interprofessional team dynamics and processes can be expected to change. For example, a US based hospital has collaborated with a game development company to create a virtual world in which surgeons are guided through scenarios in the operating room using rules, conditions and scripts to practice making decisions, team communication, and leadership [ 110 ].

As policy-makers adopt strategies towards a value-based, patient-centred model of care delivery, decision-makers are required to consider the readiness of health care organizations for successful implementation and wide-scale adoption of AI or ANN based decision-support tools. Factors such as easier integration with hospital workflows, patient-centric treatment plans leading to improved patient outcomes, elimination of unnecessary hospital procedures and reduced treatment costs can influence wider adoption of AI-based solutions in the health care industry [ 107 ]. Challenges in uptake include the current inability of AI-based solutions to read unstructured data, the perspectives of health care providers using AI-based solutions, and the lack of supportive infrastructure required for wide-scale implementation [ 107 ]. For improved organizational readiness, the governance and operating model of health care organizations need to enable a workforce and culture that will support the use of AI to enhance efficiency, quality and patient outcomes [ 108 ].

Machine learning from unstructured data (e.g. patient health records, photos, reviews, social media data from mobile applications and devices) remain a critical unmet need for hospitals [ 107 , 111 ]. Currently, most of the data in health care is unstructured and difficult to share [ 107 ] Wide-scale implementation and adoption of AI service solutions requires strong partnerships between AI technology vendors and health care organizations [ 107 ]. Policies encouraging transparency and sharing of core datasets across public and private sectors can stimulate higher levels of innovation-oriented competition and research productivity [ 112 ].

Theoretical implications

Several theoretical implications emerge from our study findings. Healthcare organizations are complex adaptive systems embedded in larger complex adaptive systems[ 113 ]; health care organizational decision-making can appropriately rely on ANN as an internalized rule set. The change of health care delivery from single to multiple settings and providers has led to new complexities around how health care delivery needs are being structured and managed (e.g., support required for delivering collaborative care or patient participatory medicine) [ 1 ]. Traditional decision-making processes based on stable and predictable systems are no longer relevant, due to the complex and emergent nature of contemporary health care delivery systems [ 1 ]. Yet the health care organizational decision-making literature suggests the focus of decision-making persistently remains on problems that are visible, while the larger system within which health care delivery organizations exist remains unacknowledged [ 1 ]. Using complex adaptive systems (CAS) theory to understand the functionality of AI can provide critical insights: first, AI enhances adaptability to change by strengthening communication among agents, which in turn fosters rapid collective response to change, and further, AI possesses the potential to generate a collective memory for social systems within an organization [ 114 ].

The theory of CAS has been used as an alternative approach to improve our understanding and scaling up of health services; CAS theory shifts decision-making towards embracing uncertainty, non-linear processes, varying context and emergent characteristics [ 115 ]. Interdependent organizational factors such as clinical practice, organization, information management research education and professional development, are built around multiple self-adjusting interacting systems [ 116 ]. Agents (e.g. users of the system) respond to their environment based on internalized rule sets that are not necessarily explicit, shared or need to be understood by another agent [ 116 ]. Although lacking the ability to explain decision-making, ANN-based decision-support tools enable health care organizational decision-makers to respond to complex and emergent environments using incoming and evolving data.

Our study found artificial neural networks can be applied across all levels of health care organizational decision-making. Influenced by advancements in the field, decision-makers are taking advantage of hybrid models of neural networks in efforts to tailor solutions to a given problem. We found ANN-based solutions applied on the meso- and macro-level of decision-making suggesting the promise of its use in contexts involving complex, unstructured or limited information. Successful implementation and adoption may require an improved understanding of the ethical, societal, and economic implications of applying ANN in health care organizational decision-making.

Supporting information

S1 checklist. preferred reporting items for systematic reviews and meta-analyses (prisma) checklist..

https://doi.org/10.1371/journal.pone.0212356.s001

S1 Appendix. Search strategy and syntax.

https://doi.org/10.1371/journal.pone.0212356.s002

S2 Appendix. Summary of findings.

https://doi.org/10.1371/journal.pone.0212356.s003

S3 Appendix. Glossary of terms.

https://doi.org/10.1371/journal.pone.0212356.s004

S1 Workflow. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flowchart.

https://doi.org/10.1371/journal.pone.0212356.s005

View Article
PubMed/NCBI
Google Scholar
2. Deloitte. Global health care outlook: The evolution of smart health care. 2018.
5. French R, Rayner C, G R, Rumbles S, Schermerhorn J, Hunt J, et al. Organizational Behaviour. 2nd ed. New York: John Wiley & Sons; 2011.
6. Awowale A. Decision Making in Healthcare Systems: Roles and Responsibilities: University of Maryland; 2017.
7. IBM Corporation. Data-driven healthcare organizations use big data analytics for big gains. Somers, NY: IBM Corporation, 2013.
8. Grand View Research. Healthcare Predictive Analytics Market Analysis By Application (Operations Management, Financial, Population Health, Clinical), By End-Use (Payers, Providers), By Region (North America, Europe, Asia Pacific, Latin America, MEA) And Segment Forecasts, 2018–2025. 2016 November 2016. Report No.
9. Tatcher L. The Dangers of Commoditized Machine Learning in Healthcare: 5 Key Differentiators that Lead to Success: Healthcare.ai; 2018 [cited 2018]. Available from: https://healthcare.ai/dangers-of-commoditized-machine-learning-in-healthcare/ .
12. Sordo M. Introduction to Neural Networks in Healthcare. 2002.
18. Schocken S, Ariav G. Neural Networks for Decision Support Systems: Problems and Opportunities. New York: Center for Research on Information Systems, Stern School of Business, New York University, 1991.
23. Scarborough D, Somers MJ. Neural Networks in Organizational Research: Applying Pattern Recognition to the Analysis of Organizational Behaviour. Washington, D.C.: American Psychological Association; 2006.
24. Sarle W, editor Neural Networks and Statistical Models. Nineteenth Annual SAS Users Group INternational Conference; 1994; Cary, NC, USA.
27. da Silva IN, Hernane Spatti S, Andrade Flauzino R. Chapter 2: Artificial Neural Network Architectures and Training Processes. Artificial Neural Networks: A Practical Course. Switzerland: Springer International Publishing 2017.
32. Han J, Pei J, Kamber M. Data Mining: Concepts and Techniques. Waltham, MA: Elsevier Inc.; 2012.
34. Reitermanova Z. Data Splitting. WDS'10 Proceedings of Contributed Papers. 2010;Part 1:31–6. 978-80-7378-139-2.
36. Amazon. Amazon Machine Learning: Developer Guide. 2018.
78. Maimon OZ, Rokach L. Data Mining and Knowledge Discovery Handbook. Second ed. New York: Springer; 2010 2010.
87. Moghimi FH, Wickramasinghe N. Chapter 2: Artificial Neural Network for Excellence to Facilitate Lean Thinking Adoption in Healthcare Contexts. Lean Thinking for Healthcare. New York: Springer Science; 2014.
89. Begg R, Kamruzzaman J, Sarker R. Neural Networks in Healthcare: Potential and Challenges. Hershey, United States: Idea Group Publishing; 2006.
90. Ferguson J. Neural Networks in Healthcare 2018 [cited 2018 26 March 2018]. Available from: https://royaljay.com/healthcare/neural-networks-in-healthcare/ .
92. Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Networks. 2012.
94. Lymberis A, editor Smart Wearables for Remote Health Monitoring, from Prevention to Rehabilitation: Current R&D, Future Challenges. 4th Annual IEEE Conference on Information Technology Applications in Biomedicine; 2003; United Kingdom.
98. Milenova B, Yarmus JS, Campos MM, editors. SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines. Proceedings of the 31st VLDB Conference; 2005; Trondheim, Norway.
104. Gutierrez PA, Hervas-Martinez C. Hybrid Artificial Neural Networks: Models, Algorithms and Data. IWANN Berlin Heidelberg: Springer-Verlag; 2011. p. 177–94.
105. Fong S, Si YW, Biuk-Aghai RP. Applying a Hybrid Model of Neural Network and Decision Tree Classifier for Predicting University Admission. IEEE2009.
107. Buttar HS, Rajan V. From $600 M to $6 Billion, Artificial Intelligence Systems Poised for Dramatic Market Expansion in Healthcare. Frost & Sullivan, 2016.
108. Collier M, Fu R, Yin L, Christiansen P. Artificial Intelligence: Healthcare's New Nervous System. Accenture, 2017.
109. Reller R. AI's revolutionary role in healthcare 2017 [cited 2017 1 December 2017]. Available from: https://www.elsevier.com/connect/ais-revolutionary-role-in-healthcare .
110. Smith R. Artificial intelligence: coming soon to a hospital near you: STAT; 2017 [24 April 2018]. Available from: https://www.statnews.com/2017/04/13/artificial-intelligence-surgeons-hospital/ .
112. Cockburn IM, Henderson R, Stern S. The Impact of Artificial Intelligence on Innovation: An Exploratory Analysis. NBER Conference on Research Issues in Artificial Intelligence; Toronto2017.

Mobile Navigation

Multimodal neurons in artificial neural networks

We’ve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIP’s accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that CLIP and similar models learn.

More resources

Fifteen years ago, Quiroga et al. [^reference-1] discovered that the human brain possesses multimodal neurons. These neurons respond to clusters of abstract concepts centered around a common high-level theme, rather than any specific visual feature. The most famous of these was the “Halle Berry” neuron, a neuron featured in both Scientific American and The New York Times , that responds to photographs, sketches, and the text “Halle Berry” (but not other names).

Two months ago, OpenAI announced CLIP , a general-purpose vision system that matches the performance of a ResNet-50, [^reference-2] but outperforms existing vision systems on some of the most challenging datasets. Each of these challenge datasets, ObjectNet , ImageNet Rendition , and ImageNet Sketch , stress tests the model’s robustness to not recognizing not just simple distortions or changes in lighting or pose, but also to complete abstraction and reconstruction—sketches, cartoons, and even statues of the objects.

Now, we’re releasing our discovery of the presence of multimodal neurons in CLIP. One such neuron, for example, is a “Spider-Man” neuron (bearing a remarkable resemblance to the “Halle Berry” neuron) that responds to an image of a spider, an image of the text “spider,” and the comic book character “Spider-Man” either in costume or illustrated.

Our discovery of multimodal neurons in CLIP gives us a clue as to what may be a common mechanism of both synthetic and natural vision systems—abstraction. We discover that the highest layers of CLIP organize images as a loose semantic collection of ideas, providing a simple explanation for both the model’s versatility and the representation’s compactness.

Using the tools of interpretability, we give an unprecedented look into the rich visual concepts that exist within the weights of CLIP. Within CLIP, we discover high-level concepts that span a large subset of the human visual lexicon—geographical regions, facial expressions, religious iconography, famous people and more. By probing what each neuron affects downstream, we can get a glimpse into how CLIP performs its classification.

Multimodal neurons in CLIP

Our paper builds on nearly a decade of research into interpreting convolutional networks, [^reference-3] [^reference-4] [^reference-5] [^reference-6] [^reference-7] [^reference-8] [^reference-9] [^reference-10] [^reference-11] [^reference-12] beginning with the observation that many of these classical techniques are directly applicable to CLIP. We employ two tools to understand the activations of the model: feature visualization , [^reference-6] [^reference-5] [^reference-12] which maximizes the neuron’s firing by doing gradient-based optimization on the input, and dataset examples , [^reference-4] which looks at the distribution of maximal activating images for a neuron from a dataset.

Using these simple techniques, we’ve found the majority of the neurons in CLIP RN50x4 (a ResNet-50 scaled up 4x using the EfficientNet scaling rule) to be readily interpretable. Indeed, these neurons appear to be extreme examples of “multi-faceted neurons,” [^reference-11] neurons that respond to multiple distinct cases, only at a higher level of abstraction.

self + relief

child’s drawing

West Africa

Architecture

Selected neurons from the final layer of four CLIP models. Each neuron is represented by a feature visualization with a human-chosen concept labels to help quickly provide a sense of each neuron. Labels were picked after looking at hundreds of stimuli that activate the neuron, in addition to feature visualizations. We chose to include some of the examples here to demonstrate the model’s proclivity towards stereotypical depictions of regions, emotions, and other concepts. We also see discrepancies in the level of neuronal resolution: while certain countries like the US and India were associated with well-defined neurons, the same was not true of countries in Africa, where neurons tended to fire for entire regions. We discuss some of these biases and their implications in later sections.

Indeed, we were surprised to find many of these categories appear to mirror neurons in the medial temporal lobe documented in epilepsy patients with intracranial depth electrodes. These include neurons that respond to emotions, [^reference-17] animals, [^reference-18] and famous people. [^reference-1]

But our investigation into CLIP reveals many more such strange and wonderful abstractions, including neurons that appear to count [ 17 , 202 , 310 ], neurons responding to art styles [ 75 , 587 , 122 ], even images with evidence of digital alteration [ 1640 ].

Absent concepts

While this analysis shows a great breadth of concepts, we note that a simple analysis on a neuron level cannot represent a complete documentation of the model’s behavior. The authors of CLIP have demonstrated, for example, that the model is capable of very precise geolocation, [^reference-19] (Appendix E.4, Figure 20) with a granularity that extends down to the level of a city and even a neighborhood. In fact, we offer an anecdote: we have noticed, by running our own personal photos through CLIP, that CLIP can often recognize if a photo was taken in San Francisco, and sometimes even the neighborhood (e.g., “Twin Peaks”).

Despite our best efforts, however, we have not found a “San Francisco” neuron, nor did it seem from attribution that San Francisco decomposes nicely into meaningful unit concepts like “California” and “city.” We believe this information to be encoded within the activations of the model somewhere, but in a more exotic way, either as a direction or as some other more complex manifold. We believe this to be a fruitful direction for further research.

How multimodal neurons compose

These multimodal neurons can give us insight into understanding how CLIP performs classification. With a sparse linear probe, [^reference-19] we can easily inspect CLIP’s weights to see which concepts combine to achieve a final classification for ImageNet classification:

For text classification, a key observation is that these concepts are contained within neurons in a way that, similar to the word2vec objective, [^reference-20] is almost linear . The concepts, therefore, form a simple algebra that behaves similarly to a linear probe. By linearizing the attention, we too can inspect any sentence, much like a linear probe, as shown below:

Fallacies of abstraction

The degree of abstraction in CLIP surfaces a new vector of attack that we believe has not manifested in previous systems. Like many deep networks, the representations at the highest layers of the model are completely dominated by such high-level abstractions. What distinguishes CLIP, however, is a matter of degree—CLIP’s multimodal neurons generalize across the literal and the iconic, which may be a double-edged sword.

Through a series of carefully-constructed experiments, we demonstrate that we can exploit this reductive behavior to fool the model into making absurd classifications. We have observed that the excitations of the neurons in CLIP are often controllable by its response to images of text , providing a simple vector of attacking the model.

The finance neuron [ 1330 ], for example, responds to images of piggy banks, but also responds to the string “$$$”. By forcing the finance neuron to fire, we can fool our model into classifying a dog as a piggy bank.

Attacks in the wild

We refer to these attacks as typographic attacks . We believe attacks such as those described above are far from simply an academic concern. By exploiting the model’s ability to read text robustly, we find that even photographs of hand-written text can often fool the model. Like the Adversarial Patch, [^reference-21] this attack works in the wild; but unlike such attacks, it requires no more technology than pen and paper.

We also believe that these attacks may also take a more subtle, less conspicuous form. An image, given to CLIP, is abstracted in many subtle and sophisticated ways, and these abstractions may over-abstract common patterns—oversimplifying and, by virtue of that, overgeneralizing.

Bias and overgeneralization

Our model, despite being trained on a curated subset of the internet, still inherits its many unchecked biases and associations. Many associations we have discovered appear to be benign, but yet we have discovered several cases where CLIP holds associations that could result in representational harm, such as denigration of certain individuals or groups.

We have observed, for example, a “Middle East” neuron [1895] with an association with terrorism; and an “immigration” neuron [395] that responds to Latin America. We have even found a neuron that fires for both dark-skinned people and gorillas [ 1257 ], mirroring earlier photo tagging incidents in other models we consider unacceptable. [^reference-22]

These associations present obvious challenges to applications of such powerful visual systems. [^footnote-1] Whether fine-tuned or used zero-shot, it is likely that these biases and associations will remain in the system, with their effects manifesting in both visible and nearly invisible ways during deployment. Many biased behaviors may be difficult to anticipate a priori, making their measurement and correction difficult. We believe that these tools of interpretability may aid practitioners the ability to preempt potential problems, by discovering some of these associations and ambigiuities ahead of time.

Our own understanding of CLIP is still evolving, and we are still determining if and how we would release large versions of CLIP. We hope that further community exploration of the released versions as well as the tools we are announcing today will help advance general understanding of multimodal systems, as well as inform our own decision-making.

Alongside the publication of “Multimodal Neurons in Artificial Neural Networks,” we are also releasing some of the tools we have ourselves used to understand CLIP—the OpenAI Microscope catalog has been updated with feature visualizations, dataset examples, and text feature visualizations for every neuron in CLIP RN50x4. We are also releasing the weights of CLIP RN50x4 and RN101 to further accommodate such research. We believe these investigations of CLIP only scratch the surface in understanding CLIP’s behavior, and we invite the research community to join in improving our understanding of CLIP and models like it.

Gabriel Goh
Chelsea Voss
Daniela Amodei
Shan Carter
Michael Petrov
Justin Jay Wang
Nick Cammarata

Acknowledgments

Sandhini Agarwal, Greg Brockman, Miles Brundage, Jeff Clune, Steve Dowling, Jonathan Gordon, Gretchen Krueger, Faiz Mandviwalla, Vedant Misra, Reiichiro Nakano, Ashley Pilipiszyn, Alec Radford, Aditya Ramesh, Pranav Shyam, Ilya Sutskever, Martin Wattenberg & Hannah Wong

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
24 July 2019
Clarification 21 August 2019

How AI and neuroscience drive each other forwards

Neil Savage 0

Neil Savage is a science and technology journalist in Lowell, Massachusetts.

You can also search for this author in PubMed Google Scholar

Chethan Pandarinath wants to enable people with paralysed limbs to reach out and grasp with a robotic arm as naturally as they would their own. To help him meet this goal, he has collected recordings of brain activity in people with paralysis. His hope, which is shared by many other researchers, is that he will be able to identify the patterns of electrical activity in neurons that correspond to a person’s attempts to move their arm in a particular way, so that the instruction can then be fed to a prosthesis. Essentially, he wants to read their minds.

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

Nature 571 , S15-S17 (2019)

doi: https://doi.org/10.1038/d41586-019-02212-4

This article is part of Nature Outlook: The brain , an editorially independent supplement produced with the financial support of third parties. About this content .

Updates & Corrections

Clarification 21 August 2019 : An earlier version of this article omitted one of Chethan Pandarinath’s affiliations.

Yamins, D. L. K. et al. Proc. Natl Acad. Sci. USA 111 , 8619–8624 (2014).

Article PubMed Google Scholar

Kell, A. J. E. et al. Neuron 98 , 630–644 (2018).

Download references

Computer science
Neuroscience
Information technology

Ketamine is in the spotlight thanks to Elon Musk — but is it the right treatment for depression?

News Explainer 20 MAR 24

COVID’s toll on the brain: new clues emerge

News 20 MAR 24

Can lessons from infants solve the problems of data-greedy AI?

News & Views 18 MAR 24

Three reasons why AI doesn’t model human language

Correspondence 19 MAR 24

So … you’ve been hacked

Technology Feature 19 MAR 24

No installation required: how WebAssembly is changing scientific computing

Technology Feature 11 MAR 24

Motor neurons generate pose-targeted movements via proprioceptive sculpting

Article 20 MAR 24

Astrocyte cells in the brain have immune memory

News & Views 20 MAR 24

Postdoctoral Associate

Our laboratory at the Washington University in St. Louis is seeking a postdoctoral experimental biologist to study urogenital diseases and cancer.

Saint Louis, Missouri

Washington University School of Medicine Department of Medicine

Recruitment of Global Talent at the Institute of Zoology, Chinese Academy of Sciences (IOZ, CAS)

The Institute of Zoology (IOZ), Chinese Academy of Sciences (CAS), is seeking global talents around the world.

Beijing, China

Institute of Zoology, Chinese Academy of Sciences (IOZ, CAS)

Postdoctoral Fellow-Proteomics/Mass Spectrometry

Location: Tulane University School of Medicine, New Orleans, LA, USA Department: Biochemistry and Molecular Biology Tulane University School of Med...

New Orleans, Louisiana

Tulane University School of Medicine (SOM)

Open Faculty Position in Mathematical and Information Security

We are now seeking outstanding candidates in all areas of mathematics and information security.

Dongguan, Guangdong, China

GREAT BAY INSTITUTE FOR ADVANCED STUDY： Institute of Mathematical and Information Security

Faculty Positions in Bioscience and Biomedical Engineering (BSBE) Thrust, Systems Hub, HKUST (GZ)

Tenure-track and tenured faculty positions at all ranks (Assistant Professor/Associate Professor/Professor)

The university is situated in the heart of the Guangdong-Hong Kong-Macau Greater Bay Area, a highly active and vibrant region in the world.

The Hong Kong University of Science and Technology (Guangzhou)

Quick links

Explore articles by subject
Guide to authors
Editorial policies

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
Dement Neurocogn Disord
v.17(3); 2018 Sep

Artificial Neural Network: Understanding the Basic Concepts without Mathematics

Su-hyun han.

1 Department of Neurology, Chung-Ang University College of Medicine, Seoul, Korea.

Ko Woon Kim

2 Department of Neurology, Chonbuk National University Hospital, Jeonju, Korea.

SangYun Kim

3 Department of Neurology, Seoul National University College of Medicine and Seoul National University Bundang Hospital, Seongnam, Korea.

Young Chul Youn

Machine learning is where a machine (i.e., computer) determines for itself how input data is processed and predicts outcomes when provided with new data. An artificial neural network is a machine learning algorithm based on the concept of a human neuron. The purpose of this review is to explain the fundamental concepts of artificial neural networks.

WHAT IS THE DIFFERENCE BETWEEN MACHINE LEARNING AND A COMPUTER PROGRAM?

A computer program takes input data, processes the data, and outputs a result. A programmer stipulates how the input data should be processed. In machine learning, input and output data is provided, and the machine determines the process by which the given input produces the given output data. 1 , 2 , 3 This process can then predict the unknown output when new input data is provided. 1 , 2 , 3

So, how does machine determine the process? Consider a linear model, y = Wx + b . If we are given the x and y values shown in Table 1 , we know intuitively that W = 2 and b = 0.

However, how does a computer know what W and b are? W and b are randomly generated initially. For example, if we initially say W = 1 and b = 0 and input the x values from Table 1 , the predicted y values do not match the y values in the Table 1 . At this point, there is a difference between the correct y values and the predicted y values; the machine gradually adjusts the values of W and b to reduce this difference. The difference between the predicted values and the correct values is called the cost function. Minimizing the cost function makes predictions closer to the correct answers. 4 , 5 , 6 , 7

The costs that correspond to given W and b values are shown in the Fig. 1 (obtained from https://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/ ). 8 , 9 , 10 Given the current values of W and b , the gradient is obtained. If the gradient is positive, the values of W and b are decreased. If the gradient is negative, the values of W and b are increased. In other words, the values of W and b determined by the derivative of the cost function. 8 , 9 , 10

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g001.jpg

ARTIFICIAL NEURAL NETWORK

The basic unit by which the brain works is a neuron. Neurons transmit electrical signals (action potentials) from one end to the other. 11 That is, electrical signals are transmitted from the dendrites to the axon terminals through the axon body. In this way, the electrical signals continue to be transmitted across the synapse from one neuron to another. The human brain has approximately 100 billion neurons. 11 , 12 It is difficult to imitate this level of complexity with existing computers. However, Drosophila have approximately 100,000 neurons, and they are able to find food, avoid danger, survive, and reproduce. 13 Nematodes have 302 neurons and they survive well. 14 This is a level of complexity that can be replicated well even with today's computers. However, nematodes can perform much better than our computers.

Let's think of the operative principles of neurons. Neurons receive signals and generate other signals. That is, they receive input data, perform some processing, and give an output. 11 , 12 However, the output is not given at a constant rate; the output is generated when the input exceeds a certain threshold. 15 The function that receives an input signal and produces an output signal after a certain threshold value is called an activation function. 11 , 15 As shown in Fig. 2 , when the input value is small, the output value is 0, and once the input value rises above the threshold value, a non-zero output value suddenly appears. Thus, the responses of biological neurons and artificial neurons (nodes) are similar. However, in reality, artificial neural networks use various functions other than activation functions, and most of them use sigmoid functions. 4 , 6 which are also called logistic functions. The sigmoid function has the advantage that it is very simple to calculate compared to other functions. Currently, artificial neural networks predominantly use a weight modification method in the learning process. 4 , 5 , 7 In the course of modifying the weights, the entire layer requires an activation function that can be differentiated. This is because the step function cannot be used as it is. The sigmoid function is expressed as the following equation. 6 , 16

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g002.jpg

Biological neurons receive multiple inputs from pre-synaptic neurons. 11 Neurons in artificial neural networks (nodes) also receive multiple inputs, then they add them and process the sum with a sigmoid function. 5 , 7 The value processed by the sigmoid function then becomes the output value. As shown in Fig. 3 , if the sum of inputs A, B, and C exceeds the threshold and the sigmoid function works, this neuron generates an output value. 4

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g003.jpg

A neuron receives input from multiple neurons and transmits signals to multiple neurons. 4 , 5 The models of biological neurons and an artificial neural network are shown in Fig. 4 . Neurons are located over several layers, and one neuron is considered to be connected to multiple neurons. In an artificial neural network, the first layer (input layer) has input neurons that transfer data via synapses to the second layer (hidden layer), and similarly, the hidden layer transfers this data to the third layer (output layer) via more synapses. 4 , 5 , 7 The hidden layer (node) is called the “black box” because we are unable to interpret how an artificial neural network derived a particular result. 4 , 5 , 6

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g004.jpg

So, how does a neural network learn in this structure? There are some variables that must be updated in order to increase the accuracy of the output values during the learning process. 4 , 5 , 7 At this time, there could be a method of adjusting the sum of the input values or modifying the sigmoid function, but it is a rather practical method to adjust the connection strength between the neurons (nodes). 4 , 5 , 7 Fig. 5 is a reformulation of Fig. 4 with the weight to be applied to the connection strength. Low weights weaken the signal, and high weights enhance the signal. 17 The learning part is the process by which the network adjusts the weights to improve the output. 4 , 5 , 7 The weights W 1,2, W 1,3, W 2,3, W 2,2, and W 3,2 are emphasized by strengthening the signal due to the high weight. If the weight W is 0, the signal is not transmitted, and the network cannot be influenced.

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g005.jpg

Multiple nodes and connections could affect predicted values and their errors. In this case, how do we update the weights to get the correct output, and how does learning work?

The updating of the weights is determined by the error between the predicted output and the correct output. 4 , 5 , 7 The error is divided by the ratio of the weights on the links, and the divided errors are back propagated and reassembled ( Fig. 6 ). However, in a hierarchical structure, it is extremely difficult to calculate all the weights mathematically. As an alternative, we can use the gradient descent method 8 , 9 , 10 to reach the correct answer — even if we do not know the complex mathematical calculations ( Fig. 1 ). The gradient descent method is a technique to find the lowest point at which the cost is minimized in a cost function (the difference between the predicted value and the answer obtained by the arbitrarily start weight W ). Again, the machine can start with any W value and alter it gradually (so that it goes down the graph) so that the cost is reduced and finally reaches a minimum. Without complicated mathematical calculations, this minimizes the error between the predicted value and the answer (or it shows that the arbitrary start weight is correct). This is the learning process of artificial neural networks.

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g006.jpg

CONCLUSIONS

In conclusion, the learning process of an artificial neural network involves updating the connection strength (weight) of a node (neuron). By using the error between the predicted value and the correct, the weight in the network is adjusted so that the error is minimized and an output close to the truth is obtained.

Funding: This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2017S1A6A3A01078538).

Conflict of Interest: The authors have no financial conflicts of interest.

Author Contributions:

Conceptualization: Kim SY.
Supervision: Kim KW, Youn YC.
Writing - original draft: Han SH, Kim KW.
Writing - review & editing: Youn YC.

A review of convolutional neural networks in computer vision

Open access
Published: 23 March 2024
Volume 57 , article number 99 , ( 2024 )

Cite this article

You have full access to this open access article

Xia Zhao 1 na1 ,
Limin Wang 1 na1 ,
Yufei Zhang 2 ,
Xuming Han 3 ,
Muhammet Deveci 4 , 5 , 6 &
Milan Parmar 7

In computer vision, a series of exemplary advances have been made in several areas involving image classification, semantic segmentation, object detection, and image super-resolution reconstruction with the rapid development of deep convolutional neural network (CNN). The CNN has superior features for autonomous learning and expression, and feature extraction from original input data can be realized by means of training CNN models that match practical applications. Due to the rapid progress in deep learning technology, the structure of CNN is becoming more and more complex and diverse. Consequently, it gradually replaces the traditional machine learning methods. This paper presents an elementary understanding of CNN components and their functions, including input layers, convolution layers, pooling layers, activation functions, batch normalization, dropout, fully connected layers, and output layers. On this basis, this paper gives a comprehensive overview of the past and current research status of the applications of CNN models in computer vision fields, e.g., image classification, object detection, and video prediction. In addition, we summarize the challenges and solutions of the deep CNN, and future research directions are also discussed.

Avoid common mistakes on your manuscript.

1 Introduction

Computer vision is gaining popularity as a buzzword in the field of image processing. Human activity recognition (HAR), an established trend with numerous real-life applications including elderly care monitoring, rehabilitation activity tracking, posture correction analysis, and intrusion detection in security, is a prominent area of research in the field of computer vision (Singh and Vishwakarma 2019 ). Over the years, deep learning advances in computer vision have attracted the attention of many scholars in the field of human action recognition (Vishwakarma and Singh 2019 ; Singh and Vishwakarma 2021 ; Dhiman and Vishwakarma 2020 ). The convolutional neural network (CNN) is used to construct the majority of computer vision algorithms. A convolutional neural network (Li et al. 2021 ), known for local connectivity of neurons, weight sharing, and down-sampling, is a deep feed-forward multilayered hierarchical network inspired by the receptive field mechanism in biology. As one of the deep learning models, a CNN can also achieve “end-to-end” learning. Through multiple layers of feature transformation, the underlying feature representation of the original data is gradually transformed into a higher-level feature representation, and the processed data is fed into a prediction function to settle the final classification or other tasks. The representation learned by the machine itself can generate good features, avoiding “feature engineering”.

In 2006, Hinton et al. proposed several perspectives in their article, which was published in Science (Hinton and Salakhutdinov 2006 ), including (1) that artificial neural networks with multiple hidden layers have a robust feature learning capability and (2) that the difficulty of training deep neural networks can be greatly reduced by the “layer-by-layer initialization” method. Since then, deep learning has become a hot topic in both academia and industry, and it has made a splash in computer vision, speech recognition, machine translation, and other fields. Meanwhile, another learning boom in artificial neural networks (Yu et al. 2013 ) has kicked off. As a typical neural network model of deep learning, a CNN has also gained wide attention from all walks of life. One of the most widely concerned is AlexNet (Alom et al. 2018 ), which won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012 due to its excellent performance. With the improvement of AlexNet’s accuracy on computer vision tasks such as image classification, researchers started to remedy the defects of the network models based on AlexNet in the expectation of further enhancing their performance. Significant advances have been made in model optimization, and some of the most representative neural network models are Visual Geometry Group (VGG) (Sengupta et al. 2019 ), GoogLeNet (Khan et al. 2019 ), Residual Network (ResNet) (Wightman et al. 2021 ), Squeeze and Excitation Network (SENet) (Jin et al. 2022 ), and MobileNet (Chen et al. 2022 ). With the development of these network architectures, neural network models tend to be deeper, wider, and more complex. Although this evolution can facilitate the networks to capture better feature representations, there is no guarantee that it can operate efficiently in all cases. Models still suffer from disadvantages such as the fact that the networks are more likely to fall into overfitting, and instead of decreasing, the error rate of the training set increases as the networks become deeper and more complex. To remedy the shortcomings of these models, many scholars have come up with various techniques to optimize the structure of CNN, e.g., network pruning (Yang et al. 2023 ), knowledge distilling (Guo et al. 2023 ), and tensor decomposition (Fernandes et al. 2021 ).

Despite the significant achievements of CNN in computer vision applications such as image classification (Chandra and Bedi 2021 ), object detection (Ma et al. 2023 ), speech recognition (Li et al. 2022 ), sentiment analysis (Chan et al. 2023 ), and video recognition (Yan et al. 2022 ), the field continues to face various challenges and opportunities. As computer vision tasks become increasingly complex, there is a pressing need for CNN models and algorithms that offer higher performance and efficiency. Moreover, current research focuses on addressing key issues such as knowledge sharing across different tasks, domain adaptation, and interpretability. Given these things into account, this paper aims to comprehensively summarize and analyze the applications of CNN in computer vision, with a particular emphasis on the latest advancements in tasks including image classification, object detection, and video prediction. The contributions of this survey paper are summarized below:

A holistic literature review of CNN in computer vision, including image classification, object detection, and video prediction, is presented in this paper.

A theoretical understanding of the CNN design principles and techniques, such as convolution, filter size, stride, down sampling, optimizer, etc., is explained in detail.

The image classification and object detection performance obtained using the existing algorithms on the dataset of the domain to which they belong are compared, respectively.

Classical architectures for deep learning and CNN-based visual models are highlighted.

The current challenges involved and future research directions for CNN are identified and presented.

The remaining part of the paper proceeds as follows (shown in Fig. 1 ): Section 2 gives a basic introduction to the elementary components of CNN and their corresponding functions. Sections 3 , 4 , and 5 summarize the relevant research models and methods in three application directions, namely, image classification, object detection, and video prediction, respectively. In Sects. 6 and 7 , through synthesizing the current research status, the issues of CNN are analyzed and summarized. In addition, an outlook on future research trends is provided.

Layout of the paper illustrating the overall process (This paper is reviewed in the order of introduction, basic CNN components, image classification, object detection, video prediction,CNN challenges and future directions, and conclusion. Among them, the convolution layer, pooling layer, activation function, batch normalization, dropout, and fully connected layer are introduced in the basic CNN compositions. An introduction to image classification includes AlexNet, VGG, GoogLeNet, ResNet, SENet, and MobileNet. We overview object detection according to two-stage and one-stage. Video prediction is a popular area of research in the field of CNN. This part presents the state-of-the-art models in video prediction. The conclusion summarizes the challenges related to CNN and outlines future research directions.)

2 Basic CNN components

Although there are numerous variations of CNN models, the overall architecture is essentially the same and adheres to a fixed paradigm, consisting of an input layer, alternate layers of convolution and pooling layers, one or more fully connected layers, activation functions, and an output layer at the end. The first half of the network comprises of a number of convolution and pooling layers stacked alternately to form a feature extractor, through which various operations can be performed to process the raw input data that is preprocessed into a more abstract and higher-level feature representation. Fully connected layers are used in combination with activation functions to execute tasks such as classification or regression on the extracted features. To maximize CNN performance, various regulatory units like batch normalization and dropout are also included in addition to various mapping functions (Bouvrie 2006 ). Fig. 2 shows various CNN components. The configuration of CNN components is essential to creating new architectures and, ultimately, to obtaining improved performance. It is crucial to comprehend various CNN components and their respective applications in order to learn about the developments in CNN architecture in computer vision (Bhatt et al. 2021 ). The role of these components in a CNN architecture is covered in brief in this section.

Structure of CNN (Suppose this is an n-classification problem. The original data is convolved twice (Convolution 1, Convolution 2), pooled twice (Max Pooling 1, Max Pooling 2), and output to the fully connected layer (Fully connection), and finally the Softmax activation function compresses the output vectors of the full connection layer into (0, 1) and outputs them in the output layer. The Data Cost 1 represents the probability of belonging to the n categories; the larger the value, the greater the possibility of belonging to the category.)

Before being input to CNN, the raw data needs to be preprocessed. The common processing methods include homogenization (Stepanov et al. 2023 ), normalization (Huang et al. 2023 ), and principal component analysis (PCA) (Uddin et al. 2021 ). To achieve homogenization, the average value calculated across the complete training set is subtracted to center each dimension of the input data at zero. Normalization is designed to normalize the data magnitude to the same range. By individually normalizing the input, dimension reduction with PCA can lessen the correlation between several data dimensions.

2.1 Convolution layer

The convolution layer, which may extract various features from different local regions of the input data, is composed of a collection of convolution kernels, with each neuron acting as a kernel. Each convolution kernel has three dimensions: length ( L ), width ( W ), and depth ( D ). In the convolution layer of a CNN, the length and width of the convolution kernel are designed artificially, $L \times W$ is also known as the size of the convolution kernel. Commonly used sizes are $3 \times 3$ , $5 \times 5$ , etc. The number of channels, also known as the depth or the number of feature maps, is the number of feature maps output from each layer in the CNN, and the depth of the convolution kernel is the same as the number of sheets of the feature map. The number of channels directly affects the feature extraction ability and computational complexity of the CNN. By increasing the number of channels, the feature extraction ability of CNN can be enhanced, but it also increases the computational complexity. A convolution operation is the process of sliding a convolution kernel (filter) over the input image, multiplying the convolution kernel and the pixel values at the corresponding positions of the input image, and summing them to obtain a feature map. The convolution process is depicted in Fig. 3 using a single-channel original image $5 \times 5$ and a convolution kernel $3 \times 3$ . Each pixel value of the feature map obtained by convolution is obtained by multiplying and summing the corresponding pixel values of the original image covered by the convolution kernel at the corresponding position. In Fig. 3 , the − 4 on the blue background in the feature map is calculated as follows: $ - 4 = ( - 1) \times 1 + 0 \times 0 + 1 \times 7 + ( - 1) \times 8 + 0 \times 2 + 1 \times 4 + ( - 1) \times 6 + 0 \times 5 + 1 \times 0 $ .

Convolution procedure

By convolving the original image with the filters and applying a nonlinear activation function to obtain new feature mappings, each feature mapping can be used as a class of extracted image features. To extract higher-level and more complete feature representations, the network model can stack multiple convolution layers. Convolutional operation’s weight-sharing technique allows multiple sets of features within an image to be retrieved by sliding a kernel with the same set of weights on the image, making it more efficient and effective for CNN parameters than fully connected networks. Furthermore, it also allows the network to have fewer neuron connections and a simpler network architecture, which facilitates the training of the network.

Stride is the number of rows and columns that the convolution kernel slides over the input matrix in order from left to right and top to bottom, starting from the top left of the input matrix. For example, in Fig. 3 , the stride is 1 in both the height and width directions. In addition, we can also use a larger stride. Fig. 4 illustrates a convolution operation with a stride of 3 in the vertical direction and 2 in the horizontal direction. At the output of the second element of the first column, the convolution window is slid down 3 rows, and the elements used for the calculation are: $( - 1) \times 3 + 0 \times 3 + ( - 1) \times 0 + 0 \times 1 = - 3$ . The convolution window slides two columns to the right when the second element of the first row is output. The elements that were used in the calculation are: $( - 1) \times 0 + 0 \times 7 + ( - 1) \times 2 + 0 \times 4 = - 2$ . Since the input elements cannot fill the convolution kernel window, no result is produced when the convolution window slides two more columns to the right on the input. The output data size, computational complexity, and feature extraction capability can all be impacted by the stride. The output data size reduces and the ability to extract features weakens as the stride increases, but the computation speed increases.

Convolution procedure (stride=(2,3))

Padding is the process of adding a certain number of pixels to the edges of the input data so that the size of the output data can match the input data. As shown in Fig. 5 , it is also known as padding some values on the boundary of the matrix to increase the size of the matrix, usually with 0 or copying the boundary pixels for padding. Padding is frequently used in CNN to prevent feature map sizes from shrinking at each layer. Furthermore, padding makes it easier for the convolution kernel to learn the information surrounding the input image. For instance, when the $5 \times 5 \times 1$ image is reinforced into a $7 \times 7 \times 1$ and applied to the $3 \times 3 \times 1$ kernel over it, the complex matrix is shown to be of dimensions $5 \times 5 \times 1$ . It demonstrates that the dimensions of the input and output images are the same. If the same procedure is done without padding, the output might have a smaller-sized image. Consequently, a $5 \times 5 \times 1$ image will be converted to a $3 \times 3 \times 1$ image (Bhatt et al. 2021 ).

2.2 Pooling layer

Upon acquiring the feature maps, a pooling (down sampling) layer must be added. The neurons in the pooling layer are connected to the local receptive domains of their input layer, i.e., the convolution layer, and the local receptive domains of different neurons do not overlap. The pooling procedure, like the convolution process, can be thought of as a pooling function without weights, in which the input feature mapping group is divided into many regions and each area is pooled to yield a value as a generalization of this region. Pooling functions that are commonly used are max pooling and average pooling.

For a region, max pooling selects the maximum activity value of all neurons as the representation of this region and extracts the most significant features from the input feature mapping, which is generally used for low-level feature extraction. In the case of max pooling (stride = 2), as shown in Fig. 6 , a kernel of size $2 \times 2$ is moved across the matrix, and the maximum value is selected and put in the appropriate spot of the output matrix. For example, pooling the four numbers ’0, 1, 4, 8’ in the blue region yields 8, the maximum of these four numbers.

Max pooling

Average pooling takes the arithmetic mean of all elements in the region as the output result of the function, namely, the mean value of the local response of the extracted feature mapping.The average pooling results with filter = $2 \times 2$ and stride 2 are shown in Fig. 7 . It is evident that the green region’s pooling result is (2 + 6 + 3 + 3)/4 = 3.5.

Average pooling

The introduction of a pooling layer not only effectively compresses the amount of data and parameters, reduces the feature map dimension, and minimizes overfitting, but also makes the network invariant to some small local morphological changes while having a larger perceptual field. Applying different pooling techniques also significantly shortens the time needed for model training and improves feature extraction and compression.

2.3 Activation function

An activation function is a different mathematical function that receives the filter’s output. It plays an important role in neural networks, which strengthens the representational and learning capabilities of the network. Each layer’s input and output in a neural network is a linear summation process, and the output of the next layer simply takes over the linear transformation of the previous layer’s input function. On the contrary, with the introduction of the activation function, the neural network can approximate any other nonlinear function, making it applicable to a wider range of nonlinear models. In this section, we will introduce the most classical and widely used activation functions, including Sigmoid, Tanh, Softmax, ReLU, and Leaky ReLU.

The logistic function, also known as the sigmoid, has values between 0 and 1. As can be seen in Fig. 8 , the sigmoid can be used to both normalize the output of each neuron and as a model that uses the predicted probabilities as the outputs. This is because the sigmoid maps any received vector to the (0,1) interval. The following is the expression for the sigmoid function.

Function curves of Sigmoid and Tanh

Figure 8 shows that the sigmoid gradient is smooth, preventing output values from jumping. Nevertheless, there are numerous issues with using Sigmoid. The next layer’s neuron inputs confront bias shift as a result of the non-zero-centered output, which also slows down the gradient descent’s convergence and decreases the weight update’s efficiency. Secondly, the sigmoid function’s rate of change flattens out as it gets closer to 0 and 1, meaning the sigmoid’s gradient converges to 0. Neurons with outputs near 0 or 1 do not have their weights updated when the neural network is backpropagated using the sigmoid activation function because their gradients are convergent to 0. Furthermore, the weights of the neurons connected to such neurons are slowly updated and are prone to gradient vanishing. Finally, the sigmoid function is an exponential operation, which lengthens the model’s computation time.

Tanh, also known as the hyperbolic tangent activation function (HTAF), compresses the received vector into a range of − 1 to 1. Equation ( 2 ) and Fig. 8 show the function expression and curve, respectively.

Figure 8 shows that the Tanh and sigmoid function curves are relatively similar and resemble an S-shaped curve. Furthermore, the Tanh function can be thought of as a zoomed and shifted sigmoid function. Tanh and Sigmoid have the following relationship:

Tanh is used with a higher priority than Sigmoid in practice because it improves on Sigmoid and solves the problem of Sigmoid functions not centering the output at 0. However, like the sigmoid, when the input is large or small, the output is smooth and the gradient is small, which is inconvenient for weight updating.

Softmax is an activation function for multi-classification problems. For any real vector of length K , Softmax activation can compress it into a real vector of length K , with values in the range (0, 1) and vector elements summing to 1. In the K classification task, these values obtained by the activation function can be used to represent the predicted probability of each category, with larger values indicating a higher probability of belonging to that category. As shown in Fig. 9 , this is a 5-classification problem. All the output layer vectors (left column) are given a number (right column) within (0, 1) after Softmax, where the probability of the second row is 0.90, indicating that the classification task belongs to the second category. SoftMax is formulated as follows:

Softmax schematic

In contrast to the standard max function, which only returns the maximum value, Softmax ensures that smaller values have smaller probabilities and are not discarded outright.The denominator of the Softmax function combines all the factors of the original output value, which means that the various probabilities obtained by the Softmax function are correlated with each other. When the input is negative, the gradient is zero, which means that the weights for activation in that region will not be updated during backpropagation, resulting in dead neurons that never activate. Furthermore, Softmax has the issue of being non-trivial at zero. Fig. 10 depicts the Softmax function image.

Function curves of Softmax

Softmax and Sigmoid also have some similarities and differences in some aspects. Softmax can be regarded as an extension of sigmoid, and softmax regression degenerates to sigmoid regression when the number of categories K = 2. A sigmoid maps a real value to the interval (0,1) and is used for binary categorization. Softmax puts a K -dimensional vector of real values ( ${\textbf {a1, a2, a3, a4....}}$ ) into ( ${\textbf {b1, b2, b3, b4....}}$ ), where ${\textbf {bi}}$ is a constant from 0 to 1. The multi-categorization task can then be performed based on the probability magnitude of ${\textbf {bi}}$ . Although multiple sigmoid can also achieve the effect of multi-categorization by superposition, multi-categorization by softmax regression is mutually exclusive between classes, i.e., an input can only be categorized into one class; multi-categorization by multiple sigmoid regression is performed, and the classes of the output are not mutually exclusive.

ReLU, also known as Rectified Linear Unit, is a segmented linear function, as shown in Fig. 11 . The ReLU function is essentially a ramp function with the following formula:

Function curves of ReLU

To some extent, ReLU compensates for the lack of sigmoid and tanh. When the input is positive, the derivative is 1, which improves the gradient vanishing problem and speeds up gradient descent convergence. Second, because the ReLU function only has linear relationships, it is faster than the sigmoid and tanh functions. However, this activation function suffers from the Dead ReLU problem. (If the input is negative, the gradient will be exactly zero, and the ReLU neurons are more likely to “die” during training.) Similar to Sigmoid, the output of the ReLU function is not zero-centered, which introduces a bias offset to the neural network in the next layer, affecting the efficiency of gradient descent.

To solve the problem of the vanishing gradient in ReLU, when x < 0, we use Leaky ReLU, a function that tries to fix the Dead ReLU problem. The function expression is as follows:

where a is a very tiny value, like 0.01, 0.1, etc. As in Fig. 12 , let a = 0.01 be displayed here.

Function curves of Leaky ReLU

Leaky ReLU mitigates the Dead ReLU problem to some extent by giving very small linear components to the negative inputs to adjust for the zero gradients of the negatives, extending the range of ReLU. Although Leaky ReLU has all the features of ReLU, such as being computationally efficient, having fast convergence, and not saturating in positive regions, it has not been fully proven in practice that Leaky ReLU is always better than ReLU.

The essence of deep learning lies in continuously updating weights to find values that minimize loss. When dealing with complex tasks, deep networks outperform shallow ones. However, in deep neural networks, gradients are unstable, either vanishing or exploding, caused by the compounding effect of multiplication in gradient backpropagation. For example, the backpropagation (BP) algorithm, based on gradient descent, adjusts parameters in the negative gradient direction of the objective. Gradient calculation involves the derivative of the activation function. If the derivative is greater than 1, as network layers increase, the computed gradient update grows exponentially, leading to a gradient explosion. This results in significant updates to network weights, making the network unstable. If the derivative is less than 1, the gradient update information decays exponentially with increasing layers, causing the vanishing gradient problem. This prevents the model from learning effectively from training data, even with prolonged training.

Choosing the appropriate activation function can effectively alleviate the issues of gradients vanishing and exploding. If the derivative of the activation function is 1, there is no problem of gradients vanishing or exploding, and each layer of the network can update at the same rate. Sigmoid and Tanh are two classic activation functions, but Sigmoid has a drawback: when x is large or small, the derivative is close to 0, and the maximum value of the Sigmoid function’s derivative is 0.25. If Sigmoid is used as the activation function, its gradient cannot exceed 0.25. Consequently, gradient vanishing is likely to occur after the chain rule in backpropagation. Similar to Sigmoid, using Tanh as an activation function may still lead to the issue of gradient vanishing; although its derivative is better than Sigmoid, it remains less than 1. Therefore, Sigmoid and Tanh are generally not suitable for neural networks. The derivative of ReLU is constantly 1 in the positive part, so using ReLU as the activation function avoids the problems of gradients vanishing and exploding. By allowing positive gradients to remain unchanged and setting negative values to zero, ReLU ensures that only positive gradients contribute to weight updates, mitigating the problem of gradients vanishing. Additionally, ReLU can prevent gradient explosion by truncating large gradient values. Other activation functions, such as Sigmoid or Tanh, can also partially alleviate the problem of gradient explosion to some extent. The activation function acts as a decision function and aids in the learning of complex patterns. Choosing an appropriate activation function can hasten the learning process. Different activation functions are appropriate for various application scenarios. ReLU and its variants, on the other hand, are preferred because they aid in overcoming the vanishing gradient problem (Nwankpa et al. 2018 ).

2.4 Batch normalization

Gradient descent is a very versatile optimization algorithm that is well suited to solving a range of problems. The whole idea of gradient descent is to minimize the objective function by iteratively updating the parameters in the opposite direction of the gradient of the objective function. The gradient is the representation of the directional derivative of a function at that point along which the function achieves its maximum value. The gradient descent algorithm is shown in Fig. 13 , where a random initial value is chosen, the gradient at that point is calculated, and then the independent variables are updated in the direction of the gradient until the value of the function changes very little or the minimum number of iterations is reached. The formula is as follows:

where, $\theta $ is the parameter to be solved, $\alpha $ the learning rate represents the learning step for each optimization, and $J(\theta )$ is the objective function.

Gradient descent algorithm

The learning step is an important parameter of gradient descent that determines just how far to try to advance on the objective function in order to find the minima point. There are two extremes that can occur with the setting of the learning step, as shown in Fig. 14 : (a) If the learning step is too small, it will have to go through many iterations before the algorithm can converge, which is very time-consuming. (b) On the other hand, if the learning step is too large, the minimum point will be skipped or may not even be found.

Algorithms for gradient descent with excessively small or large learning steps

However, not all objective functions resemble a standard bowl. They can be holes, ridges, plateaus, or any other irregular terrain that makes convergence difficult. Fig. 15 depicts the two main gradient descent challenges: If a random initial value is chosen on the image’s left side, it will converge to a local minimum that is greater than the global minimum. It will take a long time to cross the plateau if it starts on the right, and if it stops training earlier than necessary, it will never reach the global minimum.

Two main gradient descent challenges

To address the various problems of gradient descent algorithms, the Google team proposed the idea of batch normalization (BN) (Ioffe and Szegedy 2015 ). BN is a neural network regularization technique that unifies the distribution of feature-map values by setting them to zero mean and unit variance. Furthermore, the BN layer helps to alleviate the problem of gradient vanishing and gradient explosion, improves the network’s adaptability to different input data, speeds up the neural network’s training process, and improves the network’s generalization. It also avoids the problem of data death in the ReLU and makes weight initialization easier.

2.5 Dropout

Dropout facilitates regularization in the network by randomly omitting some units or connections with a predetermined probability, which eventually enhances generalization. This random dropping of some connections or units results in several thinned network architectures, from which one representative network is chosen with low weights. This chosen architecture is then regarded as an approximation of all proposed networks (Srivastava et al. 2014 ). Fig. 16 depicts the distinction between a fully connected layer and a dropout layer.

Distinction between a fully connected layer and a dropout layer

2.6 Fully connected layer

A fully connected layer is a global operation, as opposed to convolution and pooling, and is typically employed at the network’s conclusion for classification. Like a multi-layer perceptron neural network (MLP) (Isabona et al. 2022 ), each neuron in the fully connected layer is connected one by one with all the neurons in its preceding layers. Once the feature mapping obtained after several convolution and pooling operations is sufficient to recognize the features of the image, the next thing to consider is how to perform the classification. Generally, the CNN will pull the multiple feature mappings that are finally obtained at the end into a long vector and send it to the fully connected layer, followed by the output layer, for classification. For example, when it comes to an image triple classification problem, the output layer of a CNN will have three neurons. In addition, the fully connected layer can integrate local information that is class-distinctive in the convolution or pooling layers (Sainath et al. 2013 ).

3 Image classification

3.1 subtask explanation.

Image classification (Chandra and Bedi 2021 ), which seeks to differentiate between distinct classes of objects, such as flowers, figures, and vehicles, based on various properties reflected in the image, is one of the fundamental challenges in computer vision. In other words, a computer can identify the class to which the objects in an image or video belong. The main process of image classification includes preprocessing the original image, extracting image features, and classifying the image using a pre-trained classifier, in which the extraction of image features plays a pivotal role. The data flow diagram for image classification is shown in Fig. 17 . Traditional image classification algorithms can achieve the expected results in simple classification tasks. However, their performance in complex classification tasks is not satisfactory. CNN uses convolution kernels to extract features from the original input and automatically learns feature representations from massive sample data, giving the trained models stronger generalization abilities when compared to conventional image classification algorithms that manually extract features.

Data flow diagram for image classification

3.2 AlexNet

LeNet was proposed by LeCun in 1998 (LeCun et al. 1998 ). LeNet is a feed-forward neural network consisting of two fully connected layers after five alternating layers of pooling and convolution. LeNet-5 is a LeNet extension and improvement that adds more convolution and fully connected layers. As shown in Fig. 18 , the LeNet-5 network model has seven layers. LeNet-5 can share convolution kernels, reduce network parameters, perform well on the small-scale MNIST dataset, and achieve more than 98 $\%$ accuracy. CNN was first used for image recognition tasks thanks to the work of LeNet and LeNet-5, which also offered crucial lessons and insights for the later creation of deeper neural networks.

Architecture of LeNet-5

The concepts presented by David et al. in their 1968 seminal paper served as the foundation for the idea that LeCun and his colleagues implemented (Hubel and Wiesel 1968 ). The study on the striate cortex in monkeys categorized cells as simple, complex, or hypercomplex. It found smaller receptive fields, increased sensitivity to stimulus orientation, and a minority of cells with color-coding abilities. The evidence supports two vertical column systems in the studied cortex. The first type features columns with cells sharing receptive-field orientations, akin to cat orientation columns but likely smaller. The second system organizes cells into columns based on eye preference, with larger ocular dominance columns. The boundaries of the two systems appear to be independent. The cortex exhibits dual organization patterns: a vertical system aligns cells with common features along a line, mapping stimulus dimensions independently in superimposed mosaics. The horizontal system segregates cells hierarchically in layers, with lower orders (monocularly driven simple cells) near layer IV and higher orders in the upper and lower layers. These findings not only address the organizational aspects of receptive fields and functional structure but also provide a crucial foundation for further research into information processing in the cortical region of the brain.

However, due to the low performance of the hardware and the insufficiently rich dataset at that time, LeNet was not suitable for complex problems. In 2012, Krizhevsky et al. proposed AlexNet (Alom et al. 2018 ), which consists of five convolution layers and three fully connected layers. Each convolution layer contains a convolution kernel, a bias term, a ReLU activation function, and a local response normalization (LRN) module. The first convolution layer convolves the $224 \times 224 \times 3$ input image using 96 convolution kernels of size $11 \times 11 \times 3$ and stride 4. The second convolution layer takes the output of the first convolution layer as input and filters it with $5 \times 5 \times 48$ kernels. The third, fourth, and fifth convolution layers are connected to each other, with no pooling layer in between. The kernels of the second, fourth, and fifth convolution layers are only connected to those kernel maps of the previous convolution layer that are also located on the same GPU. The kernels of the third convolution layer are connected to all the kernel mappings of the second convolution layer. The neurons in the fully connected layer are connected to all the neurons in the previous layer. The response normalization layer follows the first and second convolution layers. The max pooling layer follows the response normalization layer and the fifth convolution layer. The image is convolved, fully connected, and finally fed into a Softmax classifier with 1000 nodes, which converts the output of the network into probabilistic values that can be used to predict the category of the image.

The image classification task of the ILSVRC reflects the most notable breakthrough of deep CNN in this area. In the 2012 ILSVRC, AlexNet demonstrated the potential of deep learning and finally won the competition with a Top-5 classification error rate of 16.4 $\%$ , surpassing the performance of the second-place algorithm that performed classification by traditional methods. This competition attracted the attention of many researchers, and since then, improved algorithms based on CNN have also obtained excellent results in the ImageNet competition. Meanwhile, AlexNet became the dividing line between traditional and deep learning algorithms and was the first deep CNN model in modern times. Distinguished from traditional algorithms, AlexNet adopts many modern technical methods of deep convolutional networks for the first time, including using dual GPU parallel convolution operations in training, which overcomes the limitation of hardware resources on the learning ability and thus accelerates the training of the model. In order to address the gradient disappearance issue and hasten the convergence of the network model, after convolution filtering, the output excitation of the convolution layer is obtained using the ReLU activation function, which is then output to the subsequent convolution layer after local response normalization and down-sampling operations. By utilizing dropout and data augmentation approaches, AlexNet also lessens the model’s overfitting.

3.3 Visual geometry group

To examine the impact of a CNN’s depth on its accuracy, Karen Sengupta et al. ( 2019 ) conducted a comprehensive evaluation of the performance of network models with increasing depth using small convolution filters (3 $\times $ 3) instead of the previous large convolution kernels (5 $\times $ 5) and proposed a series of Visual Geometry Group (VGG) models in 2014. With a classification error rate for the Top-5 of 7.3 $\%$ , VGG finished as the second-place network in ILSVRC 2014. VGG made the following advancements in comparison to earlier neural network models: lowered the size of the convolution kernels while increasing the number of network layers. The modest size of the convolution kernels used in VGG, as opposed to the convolution kernels used in AlexNet, lowers the computational complexity and the number of training parameters. Simultaneously, the hypothesis that performance can be enhanced by continually deepening the network topology is also supported by VGG. To date, VGG-16 is still widely used in various tasks due to its simple structural features and its applicability in transfer learning.

3.4 GoogLeNet

The champion model in the 2014 ILSVRC is GoogLeNet (Khan et al. 2019 ). As shown in Fig. 19 , GoogLeNet consists of nine Inception V1 modules, five down sampling layers, and a number of other convolution and fully connected layers. Though GoogLeNet has deeper network layers, it still has a lesser number of parameters compared to VGG. Consequentially, when computer hardware resources are restricted, GoogLeNet is a superior solution for image classification. A GoogLeNet convolution layer has many convolution processes of varying sizes, allowing for the production of dense data while making optimal use of processing resources. Additionally, it makes use of sparse connections to eliminate redundant data and cut costs by skipping through pointless feature maps. Last but not least, the GoogLeNet reduces the connection density by adopting global average pooling rather than a fully connected layer.

Architecture of GoogLeNet

By adding more hidden layers to CNN, the recognition accuracy and performance of deep neural networks can be enhanced (Szegedy et al. 2015 ), but it can lead to many issues. On the one hand, as the number of network layers rises, the network must learn more parameters, which easily leads to the model being overfitted to the training data set. On the other hand, networks with extra layers require robust hardware resources in order to maintain the required processing power. In order to overcome these problems, the research team at Google developed the concept of inception (Al Husaini et al. 2022 ), which aims to build the underlying neurons and a network topology for sparse high-performance computing. In in Fig. 20 a, the original Inception structure is displayed. Based on experimental results, it is concluded that the structure’s 5 $\times $ 5 convolution is the root cause of the excessive parameter issue. As a result, a new structure called Inception V1 is proposed. The structure of inception V1 is shown in Fig. 20 (b). The main idea of inception V1 is to extract feature information from the preceding layers with three different-sized convolution kernels, fuse them, and pass them to the succeeding layers. The 1 $\times $ 1 convolution kernel is the most commonly utilized among them for data dimension reduction, which reduces convolution computation when passing to the next 3 $\times $ 3 and 5 $\times $ 5 convolution layers, avoiding the huge computation due to the increase in network size. The following layer can extract more valuable features from various scales by combining the features of the four channels.

Architecture of inception and inception V1

Following Inception V1, Szegedy et al. proposed some optimizations to the Inception V1 structure and released the Inception V2 model in 2015 (Szegedy et al. 2016 ). To reduce the number of parameters and increase the discriminative nature of feature information, Inception V2 is improved by using two $3 \times 3$ convolution kernels instead of $5 \times 5$ convolution kernels, $\text {1} \times \text {n}$ convolution kernels, and $\text {n} \times \text {1}$ convolution kernels instead of $\text {n} \times \text {n}$ convolution kernels. Second, the pooling layer is optimized using a parallel structure to cut down on computation. Furthermore, by smoothing the probability distribution of labels, overfitting is minimized. Inception V3 is an improved version of Inception V1 and V2. The idea of Inception V3 was to reduce the computational cost of deep networks without affecting generalization. For this purpose, Szegedy et al. replaced large-size filters ( $5 \times 5$ and $7 \times 7$ ) with small and asymmetric filters ( $1 \times 7$ and $1 \times 5$ ) and used $1 \times 1$ convolution as a bottleneck before the large filters (Szegedy et al. 2017 ).

3.5 Residual network

A degradation problem emerges when deeper neural networks start to converge: accuracy increases to a saturation point and then rapidly declines as network depth increases. Nevertheless, the increase in layers that results in more training errors is what causes this degradation, rather than overfitting. Prior to residual network (ResNet Wightman et al. 2021 ), networks had relatively low layer counts; for example, the 2014 VGG network had only 19 layers. ResNet, on the other hand, maintains greater accuracy while having 152 layers in its depth. ResNet alludes to the highway network concept Srivastava et al. ( 2015 ) and is composed of stacked residual blocks. The structure of a residual block is illustrated in Fig. 21 . In addition to containing weighted layers, a residual block directly connects the input x to the output through a shortcut connection. The residual mapping is denoted as F(x), and the output is obtained by adding the residual mapping to the input, resulting in F(x) + x, representing the original mapping. The residual network encourages the stacked weighted layers to fit the residual mapping F(x) rather than the original mapping. Learning the residual mapping is simpler and more easily optimized compared to learning the original mapping. Furthermore, the shortcut connections enable the exchange of features between different layers, to some extent alleviating the problem of gradient vanishing. The Top-5 error rate of the residual network on the image classification task was reduced to 3.6 $\%$ .

A residual block

3.6 Squeeze and excitation network

In recent years, the attention mechanism has been another focus of CNN research. When the human eye scans an image, it first looks at the whole picture and then focuses its attention on a certain detail, concentrating its attention on the valuable part and ignoring the less valuable part. When we are designing neural network models, we hope that the models can have the same ability. Attention can be understood as selectively filtering out a small amount of important information from a large pool of data and focusing on these crucial details, while ignoring the majority of less significant information. The process of focusing is reflected in the calculation of weight coefficients, where larger weights indicate a stronger focus on the corresponding Value. In other words, the weights represent the importance of the information, and the Value is the corresponding piece of information. In this way, we can comprehend the attention mechanism (refer to Fig. 22 ). Imagine the constituent elements of Source as a series of <Key,Value> data pairs. Then, given an element Query in target, by calculating the similarity or correlation between Query and each Key, get the weight coefficients of the Value corresponding to each Key, and then weight and sum the Value, that is, we get the final Attention value. So the attention mechanism is essentially a weighted sum of the values of the elements in the Source, and the Query and Key are used to calculate the weight coefficients of the corresponding values.

The essential idea of attention

Abstracting the specific calculations of the attention mechanism can be summarized into two processes: the first process involves calculating weight coefficients based on Query and Key, and the second process entails weighting and summing the values based on these weight coefficients. The first process can be further divided into two stages: the first stage computes the similarity or correlation between Query and Key, and the second stage normalizes the raw scores obtained in the first stage. Fig. 23 illustrates the three-stage calculation process of attention.

Three-stage process for computing attention

In the first stage, various computation mechanisms can be introduced to calculate the similarity or correlation between Query and a given Key. The most common methods include computing the dot product of their vectors and calculating the cosine similarity, as illustrated below:

Due to the different methods used, the values produced in the first stage can have different ranges. In the second stage, a computation method similar to Softmax is introduced to transform the scores obtained in the first stage. On one hand, this normalization ensures that the original computed scores are unified into a probability distribution where the sum of all element weights is equal to 1. On the other hand, the intrinsic mechanism of Softmax helps emphasize the weights of important elements. Typically, the calculation is performed using the following formula:

where Lx = $ |Source| $ represents the length of the Source. The computed result $a_i$ from the second stage represents the weight coefficient corresponding to $value_i$ . Then, by performing a weighted sum, the attention value can be obtained:

Focusing on channel attention research, Hu proposed the squeeze-and-excitation block (SE block). The SE block explicitly models interdependencies between channels to recalibrate the feature responses within channels. This involves selectively enhancing useful channel features while suppressing irrelevant ones. Squeeze and excitation Networks (SENet Jin et al. 2022 ) won the 2017 ImageNet competition, similar to ResNet, both with a largely reduced error rate compared to previous models and low network complexity. The two primary parts of SENet are squeeze and excitation. A block of squeeze-and-excitation networks is shown in Fig. 24 . The ${\textbf {f}}_{tr}$ in the figure is the traditional convolution structure, ${\textbf {x}}$ and ${\textbf {u}}$ are the input and output of ${\textbf {f}}_{tr}$ , which are already present in the previous structures. The added part of SENet is the content after ${\textbf {u}}$ . In the image recognition task, the input image’s dimensions are h , w , and c , where h stands for height, w for width, and c for channel count. The squeeze component is in charge of compressing the $h\times w\times c$ dimension into $1\times 1\times c$ dimension, which is the same as condensing $h\times w$ into a single dimension, and this is typically accomplished using global average pooling ( ${\textbf {f}}_{sq}$ (.) in Fig. 24 ). The output $1\times 1\times c$ data is then fully concatenated ( ${\textbf {f}}_{ex}$ (.) in Fig. 24 , which is the excitation process), and finally, the self-gating technique is used to learn the excitation of each channel and scale this value to the c channels of ${\textbf {u}}$ as the next level’s input data. Controlling the scale size allows squeeze-and-excitation networks to strengthen critical channel properties while weakening non-important channel features, yielding good results and providing a novel notion for future research in this approach.

A block of squeeze-and-excitation networks

3.7 MobileNet

In traditional CNN, the memory requirements and computational demands are substantial, making it impractical for running on mobile and embedded devices. Howard and his colleagues proposed a lightweight network, MobileNetV1 (Howard et al. 2017 ), tailored for mobile and embedded applications. Compared to traditional CNN, MobileNetV1 significantly reduces the model’s parameters and computational workload while experiencing a minor decrease in accuracy. MobileNetV1 achieves 0.9 $\%$ lower accuracy than VGG16, but with only 1/32 of the model’s parameters. MobileNetV1 employs depthwise separable convolution layers, as illustrated in Fig. 25 . This involves first applying depthwise convolution to each channel of the feature map, followed by pointwise $1 \times 1$ convolution, aiming to reduce computational load and model parameters. Two contraction hyperparameters, the width multiplier and the resolution multiplier, are introduced simultaneously to decrease computation, reduce volume, and improve accuracy. However, a drawback of this model is its low cost-effectiveness, as many convolution kernel parameters become zero during the training process. Subsequently, Google introduced MobileNetV2 (Sandler et al. 2018 ), which utilizes an inverted residual structure and a linear bottleneck structure. The inverted residual structure first employs a $1 \times 1$ convolution to increase dimensionality, deepening the channels to capture more feature information. It then applies a $3 \times 3$ depthwise convolution operation and concludes with a $1 \times 1$ convolution for dimensionality reduction, effectively reducing the number of parameters. One drawback of this model is the loss of diversity between layers, which cannot guarantee accuracy.

Deep separable convolution layer structure

ImageNet (Deng et al. 2009 ), as one of the datasets for image classification tasks, has the characteristics of large-scale datasets and abundant image categories, and the trained model has good generalization ability, allowing it to obtain effective classification results on other image classification datasets such as CIFAR-10/100 Krizhevsky et al. ( 2009 ), Caltech-101 (Fei-Fei et al. 2004 ), and SUN (Xiao et al. 2010 ). Deep CNN models have improved in training thanks to the availability of a wide range of large-scale datasets, and models trained on these datasets have better generalization abilities. These generalization abilities can be used in practical applications to quickly learn the features of the datasets on their own and boost the effectiveness and efficiency of classification tasks. Performance comparisons of different architectures are shown in Table 1 .

As shown in Table 1 , from AlexNet to GoogLeNet, the accuracy of image classification increases progressively. This is attributed to the deeper architecture of the networks, which leads to more effective feature extraction.

ResNet has a deeper network architecture compared to VGG, but the introduction of residual learning makes the network more easily optimized, mitigating gradient vanishing issues. Additionally, parameter sharing and reuse, along with a reduced parameter count, contribute to achieving higher performance with lower complexity and error rates.

Deep neural networks incorporating attention mechanisms have achieved remarkable performance, as exemplified by the SE block proposed by Hu, which effectively models dependencies among channel features. Through these methods, it becomes evident that the core function of attention mechanisms is to emphasize useful components while disregarding those with relatively minor contributions to feature extraction. Consequently, integrating attention mechanisms into networks offers the advantage of enhancing model performance and improving the effective extraction of features.

Although lightweight networks may not perform as well as classical deep CNNs in image classification on the ImageNet dataset, they significantly reduce the number of parameters. This indicates that lightweight networks effectively utilize model parameters by employing methods like depthwise separable convolution. This advantage is particularly valuable in resource-constrained environments, such as mobile devices or embedded systems, where they can still deliver relatively good performance while reducing model size. This makes them more suitable for practical deployment and operation.

Despite the fact that several CNN models have achieved outstanding performance in image classification, they have a number of drawbacks. Advanced CNN models frequently have intricate structures and a lot of parameters, requiring a lot of processing power and memory during training and deployment. The use of lightweight network topologies like MobileNet and EfficientNet, model pruning, and model compression, as well as other strategies to lessen model complexity and storage needs, can all be used to overcome this issue. The fact that many CNN models heavily rely on an enormous amount of labeled data to perform at their best is a huge hurdle. Large-scale annotated data can be expensive and time-consuming to acquire, though. Several techniques can be used to improve training data and lessen reliance on annotated data in order to address this difficulty. These techniques include transfer learning, semi-supervised learning, and data augmentation. Another problem is that typical CNN models may lose fine-grained information when used on small-sized images. Different strategies can be used to handle the limitations of small-sized photos in order to solve this issue. These methods involve leveraging shallow network designs, pyramid-style network structures, or smaller convolutional kernels. Recent years have seen the emergence of fresh study focuses and methodologies, like using transformer models for image categorization. Researchers have looked into replacing convolutional blocks with transformer model structures or implementing self-attention processes from transformers straight into CNN. As evidenced by models like DeiT, pyramid vision transformer, and swin transformer, these initiatives have yielded promising outcomes. A significant area for future research will be the combination of deep learning and reinforcement learning in image classification. The effectiveness of image classification models may be further improved by this fusion of methodologies.

4 Object detection

4.1 subtask explanation.

As a fundamental task in computer vision, object detection (Ma et al. 2023 ) is the key to solving more complex vision tasks such as image segmentation (Minaee et al. 2021 ), object tracking (Luo et al. 2021 ), behavior recognition (Hu et al. 2023 ), etc. The process of object detection and recognition typically consists of two steps: firstly, the prospective placement of each target object in a picture is localized, and secondly, the well-positioned objects are sorted into several categories. Compared with image classification, object detection focuses more on local regions of an image and specific sets of object classes. CNN has been used in object detection since the 1990’s. However, because of a lack of training data and hardware resources such as computational power and storage devices, research on object detection using CNN received little attention and advanced slowly until 2012. The tremendous breakthrough of CNN in the ImageNet challenge in 2012 rekindled researchers’ interest in deep CNN-based object detection, which led to a dramatic increase in object detection and recognition rates. At the same time, object detection has been widely applied in real-world scenarios, including autonomous driving (Zablocki et al. 2022 ), virtual reality (VR) (Xiong et al. 2021 ), intelligent video surveillance (Huang et al. 2021 ), etc.

Before the prosperity of deep learning, object detection algorithms depended on the traditional sliding window approach and were designed manually. The commonly used feature descriptors are Haar (Papageorgiou et al. 1998 ), Sift (Lowe 2004 ), Surf (Bay et al. 2006 ), etc., to train a unique shallow classifier for each class of target objects. Traditional object detection process is shown in Fig. 26 . However, due to the factors of objects and the imaging environment, the method of manually designing features suffers from a lack of robustness, poor generalization, and low detection accuracy (Dicong et al. 2021 ). The bottlenecks of traditional object detection algorithms in practical applications are twofold. On the one hand, because the traditional object detection algorithm requires the designer to extract the features of the sample using prior knowledge, only a few parameters can appear in the feature design to lessen the difficulty of manually tuning the parameters. Shallow classifiers, on the other hand, require exponentially more parameters and training data in the face of tough detection tasks due to the lack of model depth. In response to the problem of manual parameter tuning of traditional object detection algorithms, the research boom in deep networks has brought new opportunities for the development of object detection. Compared with traditional object detection algorithms, deep CNN can automatically learn feature representations of parameters from massive data sets and do not require additional training classifiers, which greatly improves the efficiency of the feature learning process.

Traditional object detection process

In this paper, we outline known strategies for object detection from two perspectives: the region-based object detection algorithm (two-stage detectors) and the regression-based object detection algorithm (one-stage detectors). Fig. 27 depicts the basic procedure of two-stage detectors, which scan the whole image using multiple fixed-size sliding windows to generate a series of region proposal boxes, select the region proposal of the image, and then perform regression localization and classification of the targets that may exist in the region proposal to achieve object detection. One-stage detectors, in contrast, do not generate region proposals and combine feature extraction, object classification, and position regression into a single CNN to complete the process, simplifying the object detection process into a form of the end-to-end regression problem, as illustrated in Fig. 28 .

Basic process of two-stage detectors

Basic process of one-stage detectors

4.2 Representative two-stage detectors

Using CNN and region proposals, Girshick Girshick et al. ( 2014 ) introduced a deep learning object detection framework in 2014 called R-CNN. Initially, the model uses selective search (Ji et al. 2021 ), a non-deep learning algorithm, to propose candidate regions to be classified, and then feeds each candidate region into a CNN to extract features. Finally, these features are fed into a linear support vector machine for classification. To improve localization accuracy, a linear regression model is trained in R-CNN and used to correct the coordinates of the candidate region; this process is known as bounding box regression. On the PASCAL VOC object detection dataset, the model achieved an average correctness mean that was approximately 20 $\%$ higher than the traditional algorithm, paving the way for the creation of two-stage detectors.

In R-CNN, approximately 2000 candidate regions are generated for each image, and each image’s candidate regions must be feature extracted separately, making feature extraction a bottleneck in total test time. A Microsoft Research team applied SPP-Net (Ma et al. 2021 ), to object detection and elevated R-CNN’s shortcoming. For the candidate regions generated by the selective search algorithm, SPP-Net projects the coordinates of these regions to the corresponding positions of the feature maps output by the highest convolution layer and then inputs the features corresponding to each candidate region into the spatial pyramid pooling layer to obtain a fixed-length feature representation. The subsequent stages keep similarities to R-CNN in that the fully connected layer receives these feature representations as input, a linear support vector machine uses the fully connected layer’s feature output for classification, and bounding box regression is used to correct the candidate region coordinates. On the PASCAL VOC, the network achieved similar accuracy to the R-CNN, but the total time spent on the test was significantly reduced due to the fact that the time-consuming convolution operation was performed only once for each input image.

Like R-CNN, SPP-Net has certain limitations: the multi-stage training process of region proposal creation, feature extraction, and object classification is challenging, and it needs a lot of storage space for the derived features. Additionally, SPP-Net ignores the parameters of the network model’s other layers and only adjusts the fully connected layer. To solve these problems, Fast R-CNN (Girshick 2015 ) was available in 2015; its structure is shown in Fig. 29 . Compared with the CNN in R-CNN, Fast R-CNN improves on the last pooling layer by proposing the Region of Interest (RoI) pooling layer. The role of this layer is similar to that of the spatial pyramid pooling layer used in SPP-Net, which is to output a fixed-dimensional feature vector for any size of input, except that only a single level of spatial block partitioning is performed in the RoI pooling layer. This improvement allows Fast R-CNN, like SPP-Net, to input the whole input image together with the coordinates of the candidate regions generated by the selective search algorithm into a CNN and then perform RoI pooling on the feature maps of the output of the last convolution layer for the features corresponding to each candidate region. RoI pooling is performed on the output feature mapping of the last convolution layer, thus eliminating the need to perform a separate convolution computation for each candidate region. In addition, Fast R-CNN replaces the last softmax classification layer of the CNN with two side-by-side fully connected layers, one of which is still a softmax classification layer, and the other is a bounding box regressor, which is used for correcting the coordinate information of the candidate regions. During the training process, Fast R-CNN designs a multi-task loss function to train the two fully connected layers for classification and correction of candidate region coordinate information simultaneously. This training approach achieves better detection results on the PASCAL VOC dataset than the network obtained from the staged training previously used for R-CNN, thus eliminating the need for additional training of SVM classifiers in Fast R-CNN and realizing the integration of the process from extracting image features to completing detection.

Fast R-CNN architecture

These models have made improvements in the training process and the structure of CNN, however, they all use traditional algorithms to propose candidate regions, and these algorithms are implemented on CPUs, which makes the time of calculating candidate regions the bottleneck of the overall running time of the model. Therefore, in the Faster R-CNN model (Ren et al. 2015 ) designed by Ren Shaoqing et al., a candidate region network is proposed to improve this step, and its structure is shown in Fig. 30 . Faster R-CNN improves on Fast R-CNN by setting a sliding window on the feature mapping output from the last convolution layer, which is fully connected to the candidate region network. For each position that the sliding window slides over, several anchor points with different scales and aspect ratios centered on the center of the sliding window are given in the model, and the candidate region network will compute a candidate region based on each anchor point accordingly. Since the process of proposing candidate regions by Faster R-CNN is based on the features extracted from the first few convolution layers of the Fast R-CNN used for detection and the candidate region network is also implemented on GPUs, the time overhead for proposing candidate regions is greatly reduced, the time required for detection is about 1/10th of the original, and the accuracy is improved, which suggests that the candidate region network is able to not only operate more efficiently but also improve the quality of the candidate regions produced.

Region proposal network (RPN)

Despite the fact that various applications have successfully recognized medium-size and large-size items in images with accuracy, small object detection remains problematic. Small objects are very difficult to recognize due to indistinguishable characteristics, complicated backdrops, low resolution, insufficient context information, and so on. As a result, there is considerable research being done in this field, and numerous deep-learning approaches have been developed recently with promising outcomes. In 2017, Lin et al. ( 2017 ) used the pyramidal hierarchy property of CNN to connect top-down lateralized high-level features with low-resolution, high-semantic information and low-level features with high-resolution, low-semantic information to construct Feature Pyramid Networks (FPNs) with high-level semantic information at different scales. The proposed FPN greatly improved the detection accuracy of the network and achieved state-of-the-art object detection, which will also become one of the important techniques to improve the accuracy of major networks in the future. Moreover, compared with other object detection models, the performance of FPN to improve classification accuracy in small object detection has achieved good results.

He et al. presented Mask R-CNN (He et al. 2017 ) in 2017, which integrates the concepts of Faster R-CNN and FCN. The feature extraction section uses a feature pyramid network (FPN) architecture and replaces the RoI pooling layer with a RoI align pooling layer, as well as a Mask prediction branch. The new FPN architecture improves the model’s multi-scale feature extraction capacity and improves the recognition of small objects. However, the detection speed is the same as Faster R-CNN, which is insufficient for real-time monitoring applications.

Cao suggested a novel two-stage detector, D2Det (Cao et al. 2020 ), in 2020, that can handle the difficulties of precise localization and accurate classification at the same time. The model uses dense local regression to estimate the object’s various dense frame offsets. Dense local regression is not confined to a fixed set of quantified key points but may also regress location-sensitive real dense offsets, allowing for more accurate localization. To improve classification accuracy, discriminative RoI pooling (DRP) is used to recover accurate object feature areas from the first and second phases, respectively. Table 2 compares the performance of two-stage detectors.

4.3 Representative one-stage detectors

One-stage detectors separated input photos into a number of cells, and each cell was used to forecast the item’s center falling into the cell, as opposed to using pre-defined anchors for the object region. After just one stage, which has a quicker detection speed, the class and location of the object can be determined. However, compared to two-stage detectors, the detection accuracy is less accurate. The YOLO (you only look once) algorithm is a typical example of such an algorithm. The first one-stage object detection algorithm is YOLO (Redmon et al. 2016 ). The fundamental concept behind YOLO is to break the image up into multiple cells, predict the bounding box coordinates, the objects inside the boxes, and their corresponding confidence levels for each cell, and then remove the overlapping boxes using a non-maximal value algorithm to get the desired predicted boxes to achieve object detection. For instance, if the center of an object that needs to be recognized falls within one of the image’s divided cells, the cell is in charge of determining the type and location of the target object.

When compared to two-stage detectors, the real-time object detector YOLO was incredibly quick. However, it struggles to accurately forecast bounding box scales and ratios, especially for small item detection, which leads to relatively low localization and classification accuracy. It also performs poorly on objects that are partially situated in one cell. In 2017, Redmon proposed YOLOv2 (Redmon and Farhadi 2017 ). YOLOv2 adds a batch normalization layer to all convolution layers to accelerate model learning, employs DarkNet-19 (Al-Haija et al. 2021 ) as the backbone, and employs a classification network, namely, a high-resolution classifier (Anuj and Gopalakrishna 2020 ), which pre-trains the model on high-resolution ImageNet datasets and then fine-tunes it using target datasets to improve model training stability. All of the strategies significantly increased detection accuracy while remaining fast.

DarkNet-53 served as the backbone for YOLOv3’s extraction of picture characteristics, and logistics was employed in place of softmax for classification. The prediction was performed using the FPN network, and the previous frames were chosen using k-means clustering. In YOLOv3 (Redmon and Farhadi 2018 ), nine preceding frames were chosen, and three feature maps with various sensory fields were chosen to identify objects of various sizes.

YOLOv4 (Bochkovskiy et al. 2020 ) introduced mosaic data enhancement on the input images. In feature extraction, YOLOv4 integrated numerous novel techniques, including CSPDarkNet53 and the mish activation function. Instead of FPN, SPP and PAN were employed to extend the perceptual field and conduct feature fusion. Overall, YOLOv4 is a significant improvement over YOLOv3 and has considerable technical value since it introduces the most recent research methods within the realm of deep learning for validation testing. The network topology of YOLOv5 can be broken down into four sections: input, backbone, neck, and prediction. This makes it quite similar to YOLOv4. On the input photos, YOLOv5 applies adaptive image scaling, adaptive anchor frame computation, and mosaic data enhancement. A YOLOv5 invention, the backbone section employs a mix of focus structure and CSP structure, and the key is the slicing operation. Although YOLOv5 presently employs the same structure as YOLOv4, when it was launched, only the FPN structure was in use. The PAN structure was later introduced, and other network components were also modified. Although YOLOv4 already has a high level of detection precision, YOLOv5’s numerous network architectures are more adaptable in real-world experiments.

Accuracy and speed are two critical performance characteristics in object identification, and how to balance them is critical in actual applications in industry. YOLOv6 (Li et al. 2022 ), designed for industrial applications, was released in 2022, and it supports the entire chain of industrial application requirements, such as model training, inference, and multi-platform deployment, as well as making several improvements and optimizations at the network structure, training strategy, and other algorithm levels. In terms of backbone, neck, head, and training approach, YOLOv6 outperforms earlier models. Li created a re-parameterizable and more efficient backbone network based on the RepVGG architecture, inspired by the notion of hardware-aware neural network design (Ding et al. 2021 ). The anchor-free paradigm is employed as the training approach, and to further increase the detection accuracy, the SimOTA (Ge et al. 2021 ) label assignment technique and SIoU (Gevorgyan 2022 ) bounding box regression loss are included. In terms of accuracy and speed, YOLOv6 surpasses other methods of the same volume on the COCO datasets.

Tan proposed EfficienDet (Tan et al. 2020 ), which is based on EfficienNet (Tan and Le 2019 ), in order to establish a model that balances detection speed and accuracy. This model introduces a collaborative scaling strategy while enabling quick multi-scale feature fusion using EfficienNet as the backbone and a bi-directional feature pyramid network as the feature network. Additionally, the concept of weighting is used. Joint scaling may evenly scale the depth, breadth, and resolution of the frame-class prediction network, the feature network, and the backbone network to produce the best outcomes.

Dong introduced CentripetalNet (Dong et al. 2020 ) to address the issue that key point-based detectors are prone to matching mistakes. This approach matches the corner points more precisely than the conventional embedding method, and this model can anticipate the corner point location and centripetal displacement of the item and match its corresponding corner. In the meantime, the cross-star-shaped variability convolution is proposed to maximize the learning of the cross-star feature in the partial feature map created after the corner pooling layer. On the COCO datasets, the model experimentally outperforms all other object detectors without anchor frames. The datasets are an important measure for the training and evaluation of different supervised algorithms. The two datasets that are most often utilized for object detection tasks are PASCAL VOC (Shetty 2016 ) and Microsoft COCO (Lin et al. 2014 ). Table 3 compares the performance of one-stage detectors.

The accuracy of the method is improved by optimizing the network structure by making the model more complicated, but this decreases the training and detection speeds, making it challenging to satisfy the requirement for real-time detection. Therefore, concentrating on the combination of accuracy and speed will be the direction of future study. We simultaneously increase the accuracy and speed of object recognition to establish a balance between precision and speed that would satisfy the real demand. This is done by combining the high accuracy of region-based algorithms with the high speed of regression-based algorithms. An individual detection algorithm may perform SOTA on task A but may not perform as well on other tasks due to factors such as complex object backgrounds with substantial noise interference, and low contrast between object and background colors, which makes it challenging for the network to extract discriminate features, and small object sizes, which are challenging to detect. Therefore, a specific analysis of the difficulties of each detection task is beneficial to designing techniques that perform SOTA on a specific task.

In the past several years, object detectors based on CNN have entered the fast track of development, during which certain results have been achieved, but there is still room for further development. The following provides the frontier issues and research directions in this field to promote the research and improvement of subsequent object detectors.

Weakly supervised and small sample detection: At this stage, the object detection model is trained by large-scale instance-labeled data, and data labeling is a time-consuming and labor-intensive project. Weakly supervised object detection reduces the cost of data annotation and efficiently trains the network with a little quantity of annotated data. The labeled data can be transferred from related domains through transfer learning and then trained with a small amount of labeled data in the desired domain to improve object detection in the desired domain.

Multi-modal detection: To overcome the problem of monolithic data set categories, data from multiple modalities such as RGB images, 3D images, etc. can be fused, which is crucial for fields such as autonomous driving and intelligent robotics. Therefore, how to fuse data from different modalities and train the relevant detection models to migrate to multi-modal data will be the focus of future research.

Video detection: There are a good deal of issues in video detection involving redundant feature information, video focus disorder, and occlusion, and so on, resulting in lower computational redundancy and detection accuracy. Therefore, the study of object detection algorithms based on video sequences will become one of the future research directions.

5 Video prediction

5.1 subtask explanation.

Transformer (Vaswani et al. 2017 ), with its strong capabilities in long-range modeling and parallelized sequence processing, has gradually attracted the interest of researchers in the fields of image processing and computer vision. It has demonstrated excellent performance in applications such as object tracking, image generation, and image enhancement. Fig. 31 illustrates a simplified architecture diagram of the Transformer model.

Simplified architecture for transformer

The transformer consists of two parts: encoder and decoder, and the detailed composition of each encoder and decoder is depicted in Fig. 32 . The encoder employs a multi-head self-attention mechanism (MHSA), where the input matrix is linearly mapped to a feature subspace composed of multiple independent attention heads for dot product operations. Subsequently, the feature vectors and linear mappings are concatenated to obtain the final output, achieving the extraction of global information. Following this, a feedforward neural network (FFN), primarily consisting of two linear layers and a non-linear activation layer, is employed to transform dimensions and extract richer semantic information. The decoder is composed of self-attention, encoder-decoder attention, and feedforward components. For example, in Fig. 31 , inputting the Chinese sentence ’I have a cat’ goes through six encoders and produces something similar to a context vector. This can be understood as the encoder’s understanding of the current input sentence. The obtained vector is then fed into the decoder. Each decoder performs self-attention on the output of the previous decoder, simultaneously applying encoder-decoder attention to the vector passed from the encoder. The result is then processed through a feedforward network, constituting one decoder. By stacking six decoders, the model learns and produces the final output.

Encoder and decoder

Deep learning algorithms are mostly trained through a supervised approach, where model training is time-consuming and likely to be dependent on large amounts of labeled data. A key element we lack is predictive or unsupervised learning: the ability of a machine to simulate its environment, to predict future possibilities, and to understand how the world works through observation and engagement. Video prediction is a technique in which a computer learns spatio-temporal features inside a video frame and applies the learned features to the analysis and prediction of future frames. Since spatio-temporal information implies a large number of intrinsic laws of the real world and video prediction can be trained by a vast volume of unlabeled data, video prediction has attracted a lot of attention in academia, such as in human motion prediction (Liu et al. 2022 ), climate change (Ankrah et al. 2022 ), and traffic flow prediction (Gao et al. 2022 ). The goal of video prediction is to infer future frames from previous ones. Given a video sequence $\textbf{X}_{t,T} = \left\{ \textbf{x}_i \right\} _{t - T + 1}^t$ , at time t with the past T frames, our goal is to forecast the sequence of events in the future $\textbf{Y}_{t,T'} = \left\{ \textbf{x}_i \right\} _t^{t + T'}$ , at time t that contains the next $T'$ frames, where $\textbf{x}_i \in \mathbb {R}^{C,H,W}$ is an image with channels C , height H , and width W . Formally, the predicting model is a mapping $\Gamma _\theta : \textbf{X}_{t,T} \rightarrow \textbf{Y}_{t,T'}$ with learnable parameters $\theta $ , optimized by:

Where $\Phi $ can represent various loss functions; in our scenario, we specifically utilize Mean Squared Error (MSE) loss.

5.2 Deep learning applications

In response to the complexity and future uncertainty of video itself, scholars have achieved impressive results in a video in recent years by introducing various new neural operators, such as various RNNs (Wang et al. 2022 ), transformers (Ren et al. 2022 ), refinement structures (Chang et al. 2022 ), and applying different training strategies, involving adversarial training (Chan et al. 2022 ), etc. In order to assess the forecast’s accuracy and the level of the predicted visuals, the procedure of creating video prediction models is also crucial. A majority of the current prominent video prediction models, such as self-encoders (Baldi 2012 ), recurrent neural networks (Medsker and Jain 2001 ), and generative adversarial networks (Creswell et al. 2018 ), are suggested based on deep learning, which has opened up new possibilities for video prediction.

Most video prediction models use self-encoders for video downscaling and generation since they can compress coding efficiently. Shi Shi et al. ( 2015 ) proposed the Convolutional LSTM (ConvLSTM) model that can solve the spatio-temporal sequence prediction problem after combining the sequence processing capability of LSTM and the spatial feature representation capability of CNN. Unlike various recurrent neural networks that acquire image features by using convolution operations on the input sequence images, which are one-dimensional word vector inputs when recurrent neural networks are applied to tasks such as translation, ConvLSTM acquires two-dimensional image inputs and can also input three-channel color images, i.e., three-dimensional inputs, depending on the task. ConvLSTM takes a single channel of 64 $\times $ 64 digital sequence images as input in the video frame prediction task. The ConvLSTM model, as illustrated in Fig. 33 , contains the same three gate control units and one hidden layer as the LSTM model, namely an input gate, a forgetting gate, an output gate, and a hidden layer. The main distinction is that a single layer of convolution is computed after merging the input with the hidden layer at the present time, and this difference is critical for obtaining spatial structural information. Subsequently, Wang Wang et al. ( 2017 ) employed ConvLSTM units to develop an encoder that collected the spatio-temporal data contained in video frames and worked excellently in video prediction tasks. Lotter, inspired by “predictive coding” in neuroscience (Egner and Summerfield 2013 ), utilized ConvLSTM units to construct Prednet (Lotter et al. 2016 ), a multi-layer recurrent neural network that transmits the error caused by each layer of prediction to the next layer to assure the correctness of the network’s final layer. ConvLSTM and an optical flow predictor are used in the spatio-temporal video auto-encoder (Patraucean et al. 2015 ) to record changes over time. ConvLSTM still struggles with the issue of producing ambiguous prediction frames, even though it partially resolves the issue of gathering and processing spatio-temporal information from video prediction frame sequences and increases prediction accuracy. Wang et al. ( 2019 ) presented a three-dimensional CNN architecture in conjunction with LSTM to distinguish various activities from video frames in order to address the issues of poor dynamic information obtained by the model, low prediction accuracy, and bad quality of the produced pictures. Wang’s model outperforms others in terms of prediction accuracy, according to experimental data. The CNN encoder and RNN decoder are combined in a variable generation framework in conditional VRNN (Castrejon et al. 2019 ). According to CrevNet (Yu et al. 2020 ), the input for information-preserving feature transformation should be encoded and decoded using a CNN-based normalized stream module.

Convolutional LSTM structure

Contrary to popular belief, a completely CNN-based architecture is less widespread than the aforementioned models since its simplicity frequently necessitates the use of sophisticated modules and training techniques to increase novelty and performance, such as adversarial training (Yang et al. 2023 ), knowledge distillation (Feng et al. 2023 ), and optical flow approaches (Sui et al. 2022 ). Thus, Gao et al. introduced SimVP, a simple yet effective CNN video prediction model (Gao et al. 2022 ). SimVP is capable of achieving SOTA outcomes without the need for sophisticated modules, techniques, or tricks. Its minimal computing cost also makes it simple to scale to various scenarios. SimVP may function as a robust baseline and provide fresh perspectives for further study. Mean square error (MSE) loss is a way to train the model end-to-end, and it is totally constructed on top of CNN. The encoder, translator, and decoder of the SimVP model are all made entirely of CNN. Spatial feature extraction is done by the encoder, temporal evolution is learned by the translator, and spatial and temporal information are combined by the decoder to anticipate future frames.

More application possibilities for video prediction will be possible owing to improved model prediction performance and accuracy. First, deep learning-trained video prediction models have previously been used in areas like action identification and video interpretation. Additionally, for the self-driving industry, which has seen substantial progress in recent decades, if accurate future scene predictions can be made using the information currently available about the scene as it is being observed in real-time, driverless cars will be able to take the necessary precautions and, to the greatest extent possible, avoid risks. In the realm of computer vision, video prediction is an intriguing and challenging job. Most of the preceding models can forecast certain basic scenes successfully. Thus, future studies may begin with delicate circumstances, while the accuracy of prediction will be increased if the probability distribution of dynamic scenes can be modeled and predicted.

6 CNN challenges

Deep CNNs have demonstrated strong performance on data with a grid-like topology or that is time series in nature. However, in real-world applications, deep CNN architectures have run into additional difficulties. The different researchers have fascinating discussions about CNN’s performance on different machine learning tasks. The following list includes some of the difficulties encountered when training deep CNN models:

Given that deep CNNs are generally like black boxes, interpretation and explanation may be lacking. As a consequence, it can be difficult to verify them at times.

The choice of hyper-parameters (for example, learning step, stride, and filter) has a substantial influence on CNN performance. However, because selecting optimal hyper-parameters involves a great deal of knowledge and talent and these hyper-parameters are tremendously internally dependent, any tiny alteration can have a significant influence on the final training outcomes. As a result, careful selection of hyper-parameters is a key design challenge that must be handled using an appropriate optimization technique. In this context, meta-heuristic algorithms may be utilized to automatically tune hyper-parameters by doing both random and directed searches based on prior findings.

Deep CNN models for mobile devices are difficult to implement due to the necessity of maintaining model accuracy while taking into account the model’s size and performance.

Deep neural networks need a lot of data and processing power to be trained. Even when using the same set of datasets for various tasks, the data labeling varies, making the task of manually collecting large-scale and annotated datasets challenging. This results in a significant increase in labor and time costs for labeling the datasets created for a particular task. The success of the model training might also be significantly impacted by the datasets’ annotation quality. Therefore, the creation of extensive and precisely labeled datasets has emerged as a critical issue for computer vision research. By using unsupervised learning approaches to extract hierarchical features, the requirement for a lot of labeled data may be reduced. Simultaneously, further research into how to construct effective and scalable parallel learning algorithms is imperative to accelerate the learning process.

CNN deep models, when evaluated, need plenty of memory to hold numerous parameters and are highly time-consuming, making them unfeasible for deployment on resource-limited mobile platforms and other portable devices. As a result, it is critical to research ways to minimize the level of sophistication in neural networks while producing models that execute rapidly without sacrificing accuracy.

Despite the outstanding performance of deep CNN in various applications, there is still a lack of theoretical and mathematical foundations. The neural network model is evolving toward deeper layers and a greater parameter scale as deep learning technology advances. Hence, discovering strategies to lower the computational complexity of the model is critical, which necessitates ongoing optimization in theory and experiment. Meanwhile, deep learning’s mathematical theory is not flawless, and model optimization at this point is heavily reliant on the designer’s prior knowledge, which is detrimental to the whole theoretical framework of deep learning. As a result, understanding what characteristics deep networks have learned and the basics of deep CNN to achieve high performance is an increasingly prominent study field.

7 Future directions

The incorporation of novel ideas into the design of CNN architectures has shifted the focus of research, particularly in the field of computer vision. Research on CNN architecture is very promising, and one of the most popular deep learning methods in the future is probably going to be related to it.

One of the potential areas of CNN research is ensemble learning. By extracting different levels of semantic representations, the combination of multiple and diverse architectures can help the model improve its robustness and generalization to a variety of image categories.

CNNs and their variations are extensively employed in diverse computer vision applications; however, the majority of CNN architectures are tailored to specific uses. Better-performing generic architectures are always needed (Patel et al. 2022 ).

The key process by which the human visual system acquires information from images is attention. Furthermore, the attention mechanisms extract key information from images and store it in context with other visual components. The spatial relevance of objects and their distinguishing features can be maintained in subsequent learning stages in future research.

CNN learning power is typically increased by increasing network size, which can be accomplished in a reasonable amount of time using the Nvidia DGX-2 supercomputer. However, in terms of memory usage and computational resources, training deep and high-volume architectures continues to be a significant overhead. As a result, numerous advancements in hardware technology are still needed to speed up CNN research.

Deep CNNs have a large number of hyper-parameters, such as learning step, stride, filter, and so on. The selection of hyper-parameters and the evaluation time of deep networks make the parameter tuning task quite difficult. In this context, meta-heuristic algorithms can be used to automatically tune hyper-parameters by performing both random and directed searches based on previous results.

The future of network design is neural architecture search, which has grown in popularity due to the time-intensive and labor-intensive nature of human network design. It does, however, have some requirements for the experimental environment due to its lengthy training period and significant memory resource consumption.

Human activity recognition is a popular area of research in the field of CNN. The various CNN variants for human activity and pose recognition have been described in references (Vishwakarma and Singh 2019 ; Singh and Vishwakarma 2021 ; Dhiman and Vishwakarma 2020 ).

8 Conclusion

CNN has made impressive strides, particularly in image processing and video-related tasks, which has rekindled interest in deep learning among academics. Several studies have been done in this context to enhance CNN’s performance, including activation, optimization, regularization, and innovations in architecture. This paper reviews the research progress of CNN architecture in computer vision, especially in image classification, target detection, and video prediction. In addition, this paper also covers the fundamental elements of CNN, its applications, challenges, and future directions. We have shown that CNN outperforms classical methods when it comes to classification, detection, and prediction. Through exploiting depth and other structural modifications, CNN’s learning performance has dramatically improved over time. According to recent literature, the increase in CNN performance is primarily attributable to the replacement of the conventional layer structure with blocks. The function of an auxiliary learner can be performed by a block in a network. These additional learners leverage spatial or feature-map information or even enhance input channels to increase performance. Additionally, modular learning is supported by CNN’s block-based design, which makes the structure easier to grasp.

As a review, this paper will inevitably suffer from the following shortcomings: First, it is limited by the scope of literature and time, resulting in the failure to comprehensively cover all relevant research work. Research on certain emerging areas or specific application scenarios may fail to be covered, and there are certain research blind spots. Second, considering the influence of subjectivity, we realize that the review may be influenced by the subjective judgment of the authors, which may have a certain impact on the objectivity of the research area. As a result, in future studies, we will need to sift through the relevant literature more thoroughly and deal with the subjective factors more cautiously in order to comprehend and investigate the application of CNN in computer vision in a more comprehensive and in-depth manner.

Al-Haija QA, Smadi M, Al-Bataineh OM (2021) Identifying phasic dopamine releases using darknet-19 convolutional neural network. In: 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), pp. 1–5.

Al Husaini MAS, Habaebi MH, Gunawan TS, Islam MR, Elsheikh EA, Suliman F (2022) Thermal-based early breast cancer detection using inception v3, inception v4 and modified inception mv4. Neural Comput Appl 34(1):333–348

Article Google Scholar

Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Van Esesn BC, Awwal AAS, Asari VK (2018) The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv preprint arXiv:1803.01164

Ankrah J, Monteiro A, Madureira H (2022) Bibliometric analysis of data sources and tools for shoreline change analysis and detection. Sustainability 14(9):4895

Anuj L, Gopalakrishna M (2020) ResNet50-YOLOv2-convolutional neural network based hybrid deep structural learning for moving vehicle tracking under occlusion. Solid State Technol 63(6):3237–3258

Google Scholar

Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pp. 37–49. JMLR Workshop and Conference Proceedings

Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. Lecture Notes Comput Sci 3951:404–417

Bhatt D, Patel C, Talsania H, Patel J, Vaghela R, Pandya S, Modi K, Ghayvat H (2021) CNN variants for computer vision: history, architecture, application, challenges and future scope. Electronics 10(20):2470

Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934

Bouvrie, J (2006) Introduction Notes on Convolutional Neural Networks,” (1)

Cao J, Cholakkal H, Anwer RM, Khan FS, Pang Y, Shao L (2020) D2det: Towards high quality object detection and instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11485–11494

Castrejon L, Ballas N, Courville A (2019) Improved conditional vrnns for video prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7608–7617

Chan ER, Lin CZ, Chan MA, Nagano K, Pan B, De Mello S, Gallo O, Guibas LJ., Tremblay J, Khamis S (2022) Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133

Chan JY-L, Bea KT, Leow SMH, Phoong SW, Cheng WK (2023) State of the art: a review of sentiment analysis based on sequential transfer learning. Artif Intell Rev 56(1):749–780

Chandra MA, Bedi S (2021) Survey on SVM and their application in image classification. Int J Inf Technol 13:1–11

Chang Z, Zhang X, Wang S, Ma S, Gao W (2022) Stau: A spatiotemporal-aware unit for video prediction and beyond. arXiv preprint arXiv:2204.09456

Chen Y, Dai X, Chen D, Liu M, Dong X, Yuan L, Liu Z (2022) Mobile-former: Bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5270–5279

Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: an overview. IEEE Signal Process Mag 35(1):53–65

Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255.

Dhiman C, Vishwakarma DK (2020) View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans Image Process 29:3835–3844

Dicong W, Chenshuai B, Kaijun W (2021) Survey of video object detection based on deep learning. J Front Comput Sci Technol 15(9):1563

Ding X, Zhang X, Ma N, Han J, Ding G, Sun J (2021) Repvgg: Making vgg-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742

Dong Z, Li G, Liao Y, Wang F, Ren P, Qian C (2020) Centripetalnet: Pursuing high-quality keypoint pairs for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10519–10528

Egner T, Summerfield C (2013) Grounding predictive coding models in empirical neuroscience research. Behav Brain Sci 36(3):210–211

Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178.

Feng Z, Guo Y, Sun Y (2023) CEKD: Cross-modal edge-privileged knowledge distillation for semantic scene understanding using only thermal images. IEEE Robot Autom Lett 8(4):2205–2212

Fernandes S, Fanaee-T H, Gama J (2021) Tensor decomposition for analysing time-evolving social networks: an overview. Artif Intell Rev 54:2891–2916

Gao Z, Tan C, Wu L, Li SZ (2022) Simvp: Simpler yet better video prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3170–3180

Ge Z, Liu S, Wang F, Li Z, Sun J (2021) Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430

Gevorgyan Z (2022) Siou loss: More powerful learning for bounding box regression. arXiv preprint arXiv:2205.12740

Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448

Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587

Guo G, Han L, Wang L, Zhang D, Han J (2023) Semantic-aware knowledge distillation with parameter-free feature uniformization. Visual Intell 1(1):6

He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969

Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

Article MathSciNet Google Scholar

Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

Hu K, Jin J, Zheng F, Weng L, Ding Y (2023) Overview of behavior recognition based on deep learning. Artif Intell Rev 56(3):1833–1865

Huang C, Wu Z, Wen J, Xu Y, Jiang Q, Wang Y (2021) Abnormal event detection using deep contrastive learning for intelligent video surveillance system. IEEE Trans Industr Inform 18(8):5171–5179

Huang L, Qin J, Zhou Y, Zhu F, Liu L, Shao L (2023) Normalization techniques in training dnns: Methodology, analysis and application. IEEE Transactions on Pattern Analysis and Machine Intelligence

Hubel DH, Wiesel TN (1968) Receptive fields and functional architecture of monkey striate cortex. J Physiol 195(1):215–243

Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. pmlr

Isabona J, Imoize AL, Ojo S, Karunwi O, Kim Y, Lee C-C, Li C-T (2022) Development of a multilayer perceptron neural network for optimal predictive modeling in urban microcellular radio environments. Appl Sci 12(11):5713

Ji X, Yan Q, Huang D, Wu B, Xu X, Zhang A, Liao G, Zhou J, Wu M (2021) Filtered selective search and evenly distributed convolutional neural networks for casting defects recognition. J Mater Process Technol 292:117064

Jin X, Xie Y, Wei X-S, Zhao B-R, Chen Z-M, Tan X (2022) Delving deep into spatial pooling for squeeze-and-excitation networks. Pattern Recognit 121:108159

Khan RU, Zhang X, Kumar R (2019) Analysis of ResNet and GoogleNet models for malware detection. J Comput Virol Hacking Tech 15:29–37

Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

Li Z, Liu F, Yang W, Peng S, Zhou J (2021) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems

Li J et al. (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing 11 (1)

Li C, Li L, Jiang H, Weng K, Geng Y, Li L, Ke Z, Li Q, Cheng M, Nie W, et al (2022) Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976

Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer

Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125

Liu Z, Wu S, Jin S, Ji S, Liu Q, Lu S, Cheng L (2022) Investigating pose representations and motion contexts modeling for 3d motion prediction. IEEE Transn Pattern Anal Mach Intell 45(1):681–697

Lotter W, Kreiman G, Cox D (2016) Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104

Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110

Luo W, Xing J, Milan A, Zhang X, Liu W, Kim T-K (2021) Multiple object tracking: a literature review. Artif intell 293:103448

Ma X, Guo J, Sansom A, McGuire M, Kalaani A, Chen Q, Tang S, Yang Q, Fu S (2021) Spatial pyramid attention for deep convolutional neural networks. IEEE Trans Multimedia 23:3048–3058

Ma P, Li C, Rahaman MM, Yao Y, Zhang J, Zou S, Zhao X, Grzegorzek M (2023) A state-of-the-art survey of object detection techniques in microorganism image analysis: from classical methods to deep learning approaches. Artif Intell Rev 56(2):1627–1698

Medsker LR, Jain L (2001) Recurrent neural networks. Des Appl 5:64–67

Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3523–3542

Nwankpa C, Ijomah W, Gachagan A, Marshall S (2018) Activation functions: Comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378

Papageorgiou CP, Oren M, Poggio T (1998) A general framework for object detection. In: Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pp. 555–562. IEEE

Patel C, Bhatt D, Sharma U, Patel R, Pandya S, Modi K, Cholli N, Patel A, Bhatt U, Khan MA (2022) DBGC: dimension-based generic convolution block for object recognition. Sensors 22(5):1780

Patraucean V, Handa A, Cipolla R (2015) Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309

Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271

Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28

Ren J, Zheng Q, Zhao Y, Xu X, Li C (2022) Dlformer: Discrete latent transformer for video inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3511–3520

Sainath TN, Kingsbury B, Mohamed A-r, Dahl GE, Saon G, Soltau H, Beran T, Aravkin AY, Ramabhadran B (2013) Improvements to deep convolutional neural networks for lvcsr. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 315–320. IEEE

Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520

Sengupta A, Ye Y, Wang R, Liu C, Roy K (2019) Going deeper in spiking neural networks: VGG and residual architectures. Front Neurosci 13:95

Shetty S (2016) Application of convolutional neural network for image classification on pascal voc challenge 2012 dataset. arXiv preprint arXiv:1607.03785

Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-c (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28

Singh T, Vishwakarma DK (2019) Video benchmarks of human action datasets: a review. Artif Intell Rev 52:1107–1154

Singh T, Vishwakarma DK (2021) A deeply coupled convnet for human activity recognition using dynamic and RGB images. Neural Comput Appl 33:469–485

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

MathSciNet Google Scholar

Srivastava RK, Greff K, Schmidhuber J (2015) Highway networks. arXiv preprint arXiv:1505.00387

Stepanov S, Spiridonov D, Mai T (2023) Prediction of numerical homogenization using deep learning for the Richards equation. J Comput Appl Math 424:114980

Sui X, Li S, Geng X, Wu Y, Xu X, Liu Y, Goh R, Zhu H (2022) Craft: Cross-attentional flow transformer for robust optical flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17602–17611

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826

Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31

Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114.

Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790

Uddin MP, Mamun MA, Hossain MA (2021) PCA-based feature reduction for hyperspectral remote sensing image classification. IETE Tech Rev 38(4):377–396

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30

Vishwakarma DK, Singh T (2019) A visual cognizance based multi-resolution descriptor for human action recognition using key pose. AEU-Int J Electron Commun 107:157–169

Wang Y, Long M, Wang J, Gao Z, Yu PS (2017) Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in neural information processing systems 30

Wang Y, Jiang L, Yang M-H, Li L-J, Long M, Fei-Fei L (2019) Eidetic 3d lstm: A model for video prediction and beyond. In: International Conference on Learning Representations

Wang Y, Wu H, Zhang J, Gao Z, Wang J, Philip SY, Long M (2022) Predrnn: a recurrent neural network for spatiotemporal predictive learning. IEEE Trans Pattern Anal Mach Intell 45(2):2208–2225

Wightman R, Touvron H, Jégou H (2021) Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476

Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492.

Xiong J, Hsiang E-L, He Z, Zhan T, Wu S-T (2021) Augmented reality and virtual reality displays: emerging technologies and future perspectives. Light Sci Appl 10(1):216

Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3333–3343

Yang J, Soltan AA, Eyre DW, Yang Y, Clifton DA (2023) An adversarial training framework for mitigating algorithmic biases in clinical machine learning. NPJ Digit Med 6(1):55

Yang W, Yu H, Cui B, Sui R, Gu T (2023) Deep neural network pruning method based on sensitive layers and reinforcement learning. Artif Intell Rev 56:1897–917

Yu K, Jia L, Chen Y, Xu W (2013) Deep learning: yesterday, today, and tomorrow. J Comput Res Dev 50(9):1799–1804

Yu W, Lu Y, Easterbrook S, Fidler S (2020) Efficient and information-preserving future frame prediction and beyond

Zablocki É, Ben-Younes H, Pérez P, Cord M (2022) Explainability of deep vision-based autonomous driving systems: review and challenges. Int J Comput Vision 130(10):2425–2452

Download references

Acknowledgements

This work was supported by the National Social Science Fund of China under Grant No. 22BTJ057.

Author information

Xia Zhao and Limin Wang have contributed equally to this work.

Authors and Affiliations

School of Information Science, Guangdong University of Finance & Economics, Guangzhou, 510320, China

Xia Zhao & Limin Wang

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun, 130022, China

Yufei Zhang

School of Information Science and Technology, Jinan University, Guangzhou, 510632, China

Department of Industrial Engineering, Turkish Naval Academy, National Defence University, 34942 , Tuzla, Istanbul, Turkey

Muhammet Deveci

The Bartlett School of Sustainable Construction, University College London, 1-19 Torrington Place, London, WC1E 7HB, UK

Department of Electrical and Computer Engineering, Lebanese American University, Byblos, Lebanon

Department of Computer Science and Engineering, Mississippi State University, Starkville, MS, 39762, USA

Milan Parmar

You can also search for this author in PubMed Google Scholar

Contributions

Xia Zhao: Conceptualization, Methodology, Investigation, Writing-original draft, Writing-review & editing. Limin Wang: Writing-review & editing, Project administration, Funding acquisition. Yufei Zhang: Writing-review & editing. Xuming Han: Writing-original draft, Investigation, Supervision. Muhammet Deveci: Writing-review & editing, Supervision. Milan Parmar: Conceptualization, Language polish.

Corresponding authors

Correspondence to Limin Wang or Xuming Han .

Ethics declarations

Conflict of interest.

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Zhao, X., Wang, L., Zhang, Y. et al. A review of convolutional neural networks in computer vision. Artif Intell Rev 57 , 99 (2024). https://doi.org/10.1007/s10462-024-10721-6

Download citation

Accepted : 04 February 2024

Published : 23 March 2024

DOI : https://doi.org/10.1007/s10462-024-10721-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Convolutional neural networks
Computer vision
Status quo review
Deep learning
Find a journal
Publish with us
Track your research

Help | Advanced Search

Quantitative Biology > Neurons and Cognition

Title: neural networks, artificial intelligence and the computational brain.

Abstract: In recent years, several studies have provided insight on the functioning of the brain which consists of neurons and form networks via interconnection among them by synapses. Neural networks are formed by interconnected systems of neurons, and are of two types, namely, the Artificial Neural Network (ANNs) and Biological Neural Network (interconnected nerve cells). The ANNs are computationally influenced by human neurons and are used in modelling neural systems. The reasoning foundations of ANNs have been useful in anomaly detection, in areas of medicine such as instant physician, electronic noses, pattern recognition, and modelling biological systems. Advancing research in artificial intelligence using the architecture of the human brain seeks to model systems by studying the brain rather than looking to technology for brain models. This study explores the concept of ANNs as a simulator of the biological neuron, and its area of applications. It also explores why brain-like intelligence is needed and how it differs from computational framework by comparing neural networks to contemporary computers and their modern day implementation.

Submission history

Access paper:.

Download PDF
Other Formats

References & Citations

Google Scholar
Semantic Scholar

BibTeX formatted citation

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

A view of Artificial Neural Network

Ieee account.

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Search for: Toggle Search

‘You Transformed the World,’ NVIDIA CEO Tells Researchers Behind Landmark AI Paper

Of GTC ’s 900+ sessions, the most wildly popular was a conversation hosted by NVIDIA founder and CEO Jensen Huang with seven of the authors of the legendary research paper that introduced the aptly named transformer — a neural network architecture that went on to change the deep learning landscape and enable today’s era of generative AI.

“Everything that we’re enjoying today can be traced back to that moment,” Huang said to a packed room with hundreds of attendees, who heard him speak with the authors of “ Attention Is All You Need .”

Sharing the stage for the first time, the research luminaries reflected on the factors that led to their original paper, which has been cited more than 100,000 times since it was first published and presented at the NeurIPS AI conference. They also discussed their latest projects and offered insights into future directions for the field of generative AI.

While they started as Google researchers, the collaborators are now spread across the industry, most as founders of their own AI companies.

“We have a whole industry that is grateful for the work that you guys did,” Huang said.

Origins of the Transformer Model

The research team initially sought to overcome the limitations of recurrent neural networks , or RNNs, which were then the state of the art for processing language data.

Noam Shazeer, cofounder and CEO of Character.AI, compared RNNs to the steam engine and transformers to the improved efficiency of internal combustion.

“We could have done the industrial revolution on the steam engine, but it would just have been a pain,” he said. “Things went way, way better with internal combustion.”

“Now we’re just waiting for the fusion,” quipped Illia Polosukhin, cofounder of blockchain company NEAR Protocol.

The paper’s title came from a realization that attention mechanisms — an element of neural networks that enable them to determine the relationship between different parts of input data — were the most critical component of their model’s performance.

“We had very recently started throwing bits of the model away, just to see how much worse it would get. And to our surprise it started getting better,” said Llion Jones, cofounder and chief technology officer at Sakana AI.

Having a name as general as “transformers” spoke to the team’s ambitions to build AI models that could process and transform every data type — including text, images, audio, tensors and biological data.

“That North Star, it was there on day zero, and so it’s been really exciting and gratifying to watch that come to fruition,” said Aidan Gomez, cofounder and CEO of Cohere. “We’re actually seeing it happen now.”

Envisioning the Road Ahead

Adaptive computation, where a model adjusts how much computing power is used based on the complexity of a given problem, is a key factor the researchers see improving in future AI models.

“It’s really about spending the right amount of effort and ultimately energy on a given problem,” said Jakob Uszkoreit, cofounder and CEO of biological software company Inceptive. “You don’t want to spend too much on a problem that’s easy or too little on a problem that’s hard.”

A math problem like two plus two, for example, shouldn’t be run through a trillion-parameter transformer model — it should run on a basic calculator, the group agreed.

They’re also looking forward to the next generation of AI models.

“I think the world needs something better than the transformer,” said Gomez. “I think all of us here hope it gets succeeded by something that will carry us to a new plateau of performance.”

“You don’t want to miss these next 10 years,” Huang said. “Unbelievable new capabilities will be invented.”

The conversation concluded with Huang presenting each researcher with a framed cover plate of the NVIDIA DGX-1 AI supercomputer, signed with the message, “You transformed the world.”

There’s still time to catch the session replay by registering for a virtual GTC pass — it’s free.

To discover the latest in generative AI, watch Huang’s GTC keynote address:

NVIDIA websites use cookies to deliver and improve the website experience. See our cookie policy for further details on how we use cookies and how to change your cookie settings.

Share on Mastodon

How do neural networks learn? A mathematical formula explains how they detect relevant patterns

Neural networks have been powering breakthroughs in artificial intelligence, including the large language models that are now being used in a wide range of applications, from finance, to human resources to healthcare. But these networks remain a black box whose inner workings engineers and scientists struggle to understand. Now, a team led by data and computer scientists at the University of California San Diego has given neural networks the equivalent of an X-ray to uncover how they actually learn.

The researchers found that a formula used in statistical analysis provides a streamlined mathematical description of how neural networks, such as GPT-2, a precursor to ChatGPT, learn relevant patterns in data, known as features. This formula also explains how neural networks use these relevant patterns to make predictions.

"We are trying to understand neural networks from first principles," said Daniel Beaglehole, a Ph.D. student in the UC San Diego Department of Computer Science and Engineering and co-first author of the study. "With our formula, one can simply interpret which features the network is using to make predictions."

The team presented their findings in the March 7 issue of the journal Science .

Why does this matter? AI-powered tools are now pervasive in everyday life. Banks use them to approve loans. Hospitals use them to analyze medical data, such as X-rays and MRIs. Companies use them to screen job applicants. But it's currently difficult to understand the mechanism neural networks use to make decisions and the biases in the training data that might impact this.

"If you don't understand how neural networks learn, it's very hard to establish whether neural networks produce reliable, accurate, and appropriate responses," said Mikhail Belkin, the paper's corresponding author and a professor at the UC San Diego Halicioglu Data Science Institute. "This is particularly significant given the rapid recent growth of machine learning and neural net technology."

The study is part of a larger effort in Belkin's research group to develop a mathematical theory that explains how neural networks work. "Technology has outpaced theory by a huge amount," he said. "We need to catch up."

The team also showed that the statistical formula they used to understand how neural networks learn, known as Average Gradient Outer Product (AGOP), could be applied to improve performance and efficiency in other types of machine learning architectures that do not include neural networks.

"If we understand the underlying mechanisms that drive neural networks, we should be able to build machine learning models that are simpler, more efficient and more interpretable," Belkin said. "We hope this will help democratize AI."

The machine learning systems that Belkin envisions would need less computational power, and therefore less power from the grid, to function. These systems also would be less complex and so easier to understand.

Illustrating the new findings with an example

(Artificial) neural networks are computational tools to learn relationships between data characteristics (i.e. identifying specific objects or faces in an image). One example of a task is determining whether in a new image a person is wearing glasses or not. Machine learning approaches this problem by providing the neural network many example (training) images labeled as images of "a person wearing glasses" or "a person not wearing glasses." The neural network learns the relationship between images and their labels, and extracts data patterns, or features, that it needs to focus on to make a determination. One of the reasons AI systems are considered a black box is because it is often difficult to describe mathematically what criteria the systems are actually using to make their predictions, including potential biases. The new work provides a simple mathematical explanation for how the systems are learning these features.

Features are relevant patterns in the data. In the example above, there are a wide range of features that the neural networks learns, and then uses, to determine if in fact a person in a photograph is wearing glasses or not. One feature it would need to pay attention to for this task is the upper part of the face. Other features could be the eye or the nose area where glasses often rest. The network selectively pays attention to the features that it learns are relevant and then discards the other parts of the image, such as the lower part of the face, the hair and so on.

Feature learning is the ability to recognize relevant patterns in data and then use those patterns to make predictions. In the glasses example, the network learns to pay attention to the upper part of the face. In the new Science paper, the researchers identified a statistical formula that describes how the neural networks are learning features.

Alternative neural network architectures: The researchers went on to show that inserting this formula into computing systems that do not rely on neural networks allowed these systems to learn faster and more efficiently.

"How do I ignore what's not necessary? Humans are good at this," said Belkin. "Machines are doing the same thing. Large Language Models, for example, are implementing this 'selective paying attention' and we haven't known how they do it. In our Science paper, we present a mechanism explaining at least some of how the neural nets are 'selectively paying attention.'"

Study funders included the National Science Foundation and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning. Belkin is part of NSF-funded and UC San Diego-led The Institute for Learning-enabled Optimization at Scale, or TILOS.

Neural Interfaces
Computers and Internet
Distributed Computing
Computer Programming
Educational Technology
Information Technology
Computer Modeling
Communications
Mathematical model
Neural network
Computing power everywhere
Artificial intelligence
Artificial neural network
Information and communication technologies

Story Source:

Materials provided by University of California - San Diego . Original written by Ioana Patringenaru and Daniel Kane. Note: Content may be edited for style and length.

Journal Reference :

Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models . Science , 2024; DOI: 10.1126/science.adi5639

Cite This Page :