U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Face Recognition by Humans and Machines: Three Fundamental Advances from Deep Learning

Alice j. o’toole.

1 School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, Texas 75080, USA;

Carlos D. Castillo

2 Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA;

Deep learning models currently achieve human levels of performance on real-world face recognition tasks. We review scientific progress in understanding human face processing using computational approaches based on deep learning. This review is organized around three fundamental advances. First, deep networks trained for face identification generate a representation that retains structured information about the face (e.g., identity, demographics, appearance, social traits, expression) and the input image (e.g., viewpoint, illumination). This forces us to rethink the universe of possible solutions to the problem of inverse optics in vision. Second, deep learning models indicate that high-level visual representations of faces cannot be understood in terms of interpretable features. This has implications for understanding neural tuning and population coding in the high-level visual cortex. Third, learning in deep networks is a multistep process that forces theoretical consideration of diverse categories of learning that can overlap, accumulate over time, and interact. Diverse learning types are needed to model the development of human face processing skills, cross-race effects, and familiarity with individual faces.

1. INTRODUCTION

The fields of vision science, computer vision, and neuroscience are at an unlikely point of convergence. Deep convolutional neural networks (DCNNs) now define the state of the art in computer-based face recognition and have achieved human levels of performance on real-world face recognition tasks ( Jacquet & Champod 2020 , Phillips et al. 2018 , Taigman et al. 2014 ). This behavioral parity allows for meaningful comparisons of representations in two successful systems. DCNNs also emulate computational aspects of the ventral visual system ( Fukushima 1988 , Krizhevsky et al. 2012 , LeCun et al. 2015 ) and support surprisingly direct, layer-to-layer comparisons with primate visual areas ( Yamins et al. 2014 ). Nonlinear, local convolutions, executed in cascaded layers of neuron-like units, form the computational engine of both biological and artificial neural networks for human and machine-based face recognition. Enormous numbers of parameters, diverse learning mechanisms, and high-capacity storage in deep networks enable a wide variety of experiments at multiple levels of analysis, from reductionist to abstract. This makes it possible to investigate how systems and subsystems of computations support face processing tasks.

Our goal is to review scientific progress in understanding human face processing with computational approaches based on deep learning. As we proceed, we bear in mind wise words written decades ago in a paper on science and statistics: “All models are wrong, but some are useful” ( Box 1979 , p. 202) (see the sidebar titled Perspective: Theories and Models of Face Processing and the sidebar titled Caveat: Iteration Between Theory and Practice ). Since all models are wrong, in this review, we focus on what is useful. For present purposes, computational models are useful when they give us insight into the human visual and perceptual system. This review is organized around three fundamental advances in understanding human face perception, using knowledge generated from deep learning models. The main elements of these advances are as follows.

PERSPECTIVE: THEORIES AND MODELS OF FACE PROCESSING

Box (1976) reminds us that scientific progress comes from motivated iteration between theory and practice. In understanding human face processing, theories should be used to generate the questions, and machines (as models) should be used to answer the questions. Three elemental concepts are required for scientific progress. The first is flexibility. Effective iteration between theory and practice requires feedback between what the theory predicts and what the model reveals. The second is parsimony. Because all models are wrong, excessive elaboration will not find the correct model. Instead, economical descriptions of a phenomenon should be preferred over complex descriptions that capture less fundamental elements of human perception. Third, Box (1976 , p. 792) cautions us to avoid “worrying selectivity” in model evaluation. As he puts it, “since all models are wrong, the scientist must be alert to what is importantly wrong.”

These principles represent a scientific ideal, rather than a reality in the field of face perception by humans and machines. Applying scientific principles to computational modeling of human face perception is challenging for diverse reasons (see the sidebar titled Caveat: Iteration Between Theory and Practice below). We argue, as Cichy & Kaiser (2019) have, that although the utility of scientific models is usually seen in terms of prediction and explanation, their function for exploration should not be underrated. As scientific models, DCNNs carry out high-level visual tasks in neurally inspired ways. They are at a level of development that is ripe for exploring computational and representational principles that actually work but are not understood. This is a classic problem in reverse engineering—yet the use of deep learning as a model introduces a dilemma. The goal of reverse engineering is to understand how a functional but highly complex system (e.g., the brain and human visual system) solves a problem (e.g., recognizes a face). To accomplish this, a well-understood model is used to test hypotheses about the underlying mechanisms of the complex system. A prerequisite of reverse engineering is that we understand how the model works. Failing that, we risk using one poorly understood system to test hypotheses about another poorly understood system. Although deep networks are not black boxes (every parameter is knowable) ( Hasson et al. 2020 ), we do not fully understand how they recognize faces ( Poggio et al. 2020 ). Therefore, the primary goal should be to understand deep networks for face recognition at a conceptual and representational level.

CAVEAT: ITERATION BETWEEN THEORY AND PRACTICE

Box (1976) noted that scientific progress depends on motivated iteration between theory and practice. Unfortunately, a motivation to iterate between theory and practice is not a reasonable expectation for the field of computer-based face recognition. Automated face recognition is big business, and the best models were not developed to study human face processing. DCNNs provide a neurally inspired, but not copied, solution to face processing tasks. Computer scientists formulated DCNNs at an abstract level, based on neural networks from the 1980s ( Fukushima 1988 ). Current DCNN-based models of human face processing are computationally refined, scaled-up versions of these older networks. Algorithm developers make design and training decisions for performance and computational efficiency. In using DCNNs to model human face perception, researchers must choose between smaller, controlled models and larger-scale, uncontrolled networks (see also Richards et al. 2019 ). Controlled models are easier to analyze but can be limited in computational power and training data diversity. Uncontrolled models better emulate real neural systems but may be intractable. The easy availability of cutting-edge pretrained face recognition models, with a variety of architectures, has been the deciding factor for many research labs with limited resources and expertise to develop networks. Given the widespread use of these models in vision science, brain-similarity metrics for artificial neural networks have been developed ( Schrimpf et al. 2018 ). These produce a Brain-Score made up of a composite of neural and behavioral benchmarks. Some large-scale (uncontrolled) network architectures used in modeling human face processing (See Section 2.1 ) score well on these metrics.

A promising long-term strategy is to increase the neural accuracy of deep networks ( Grill-Spector et al. 2018 ). The ventral visual stream and DCNNs both enable hierarchical and feedforward processing. This offers two computational benefits consistent with DCNNs as models of human face processing. First, the universal approximation theorem ( Hornik et al. 1989 ) ensures that both types of networks can approximate any complex continuous function relating the input (visual image) to the output (face identity). Second, linear and nonlinear feedforward connections enable fast computation consistent with the speed of human facial recognition ( Grill-Spector et al. 2018 , Thorpe et al. 1996 ). Although current DCNNs lack other properties of the ventral visual system, these can be implemented as the field progresses.

  • Deep networks force us to rethink the universe of possible solutions to the problem of inverse optics in vision. The face representations that emerge from deep networks trained for identification operate invariantly across changes in image and appearance, but they are not themselves invariant.
  • Computational theory and simulation studies of deep learning indicate a reconsideration of a long-standing axiom in vision science that face or object representations can be understood in terms of interpretable features. Instead, in deep learning models, the concept of a nameable deep feature, localized in an output unit of the network or in the latent variables of the space, should be reevaluated.
  • Natural environments provide highly variable training data that can structure the development of face processing systems using a variety of learning mechanisms that overlap, accumulate over time, and interact. It is no longer possible to invoke learning as a generic theoretical account of a behavioral or neural phenomenon.

We focus on deep learning findings that are relevant for understanding human face processing—broadly construed. The human face provides us with diverse information, including identity, gender, race or ethnicity, age, and emotional state. We use the face to make inferences about a person’s social traits ( Oosterhof & Todorov 2008 ). As we discuss below, deep networks trained for identification retain much of this diverse facial information (e.g., Colón et al. 2021 , Dhar et al. 2020 , Hill et al. 2019 , Parde et al. 2017 , Terhörst et al. 2020 ). The use of face recognition algorithms in applied settings (e.g., law enforcement) has spurred detailed performance comparisons between DCNNs and humans (e.g., Phillips et al. 2018 ). For analogous reasons, the problem of human-like race bias in DCNNs has also been studied (e.g., Cavazos et al. 2020 ; El Khiyari & Wechsler 2016 ; Grother et al. 2019 ; Krishnapriya et al. 2019 , 2020 ). Developmental data on infants’ exposure to faces in the first year(s) of life offer insight into how to structure the training of deep networks ( Smith & Slone 2017 ). These topics are within the scope of this review. Although we consider general points of comparison between DCNNs and neural responses in face-selective areas of the primate inferotemporal (IT) cortex, a detailed discussion of this topic is beyond the scope of this review. (For a review of primate face-selective areas that considers computational perspectives, see Hesse & Tsao 2020 ). In this review, we focus on the computational and representational principles of neural coding from a deep learning perspective.

The review is organized as follows. We begin with a brief review of where machine performance on face identification stands relative to humans in quantitative terms. Qualitative performance comparisons on identification and other face processing tasks (e.g., expression classification, social perception, development) are integrated into Sections 2 – 4 . These sections consider advances in understanding human face processing from deep learning approaches. We close with a discussion of where the next steps might lead.

1.1. Where We Are Now: Human Versus Machine Face Recognition

Deep learning models of face identification map widely variable images of a face onto a representation that supports identification accuracy comparable to that of humans. The steady progress of machines over the past 15 years can be summarized in terms of the increasingly challenging face images that they can recognize ( Figure 1 ). By 2007, the best algorithms surpassed humans on a task of identity matching for unfamiliar faces in frontal images taken indoors ( O’Toole et al. 2007 ). By 2012, well-established algorithms exceeded human performance on frontal images with moderate changes in illumination and appearance ( Kumar et al. 2009 , Phillips & O’Toole 2014 ). Machine ability to match identity for in-the-wild images appeared with the advent of DCNNs in 2013–2014. Human face recognition was marginally more accurate than DeepFace ( Taigman et al. 2014 ), an early DCNN, on the Labeled Faces in the Wild (LFW) data set ( Huang et al. 2008 ). LFW contains in-the-wild images taken mostly from the front. DCNNs now fare well on in-the-wild images with significant pose variation (e.g., Maze et al. 2018 , data set). Sengupta et al. (2016) found parity between humans and machines on frontal-to-frontal identity matching but human superiority on frontal-to-profile matching.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0001.jpg

The progress of computer-based face recognition systems can be tracked by their ability to recognize faces with increasing levels of image and appearance variability. In 2006, highly controlled, cropped face images with moderate variability, such as the images of the same person shown, were challenging (images adapted with permission from Sim et al. 2002 ). In 2012, algorithms could tackle moderate image and appearance variability (the top 4 images are extreme examples adapted with permission from Huang et al. 2012 ; the bottom two images adapted with permission from Phillips et al. 2011 ). By 2018, deep convolutional neural networks (DCNNs) began to tackle wide variation in image and appearance, (images adapted with permission from the database in Maze et al. 2018 ). In the 2012 and 2018 images, all side-by side images show the same person except the bottom pair of 2018 panels.

Identity matching:

process of determining if two or more images show the same identity or different identities; this is the most common task performed by machines

Human face recognition:

the ability to determine whether a face is known

1.2. Expert Humans and State-of-the-Art Machines Work Together

DCNNs can sometimes even surpass normal human performance. Phillips et al. (2018) compared humans and machines matching the identity of faces in high-quality frontal images. Although this is generally considered an easy task, the images tested were chosen to be highly challenging based on previous human and machine studies. Four DCNNs developed between 2015 and 2017 were compared to human participants from five groups: professional forensic face examiners, professional forensic face reviewers, superrecognizers ( Noyes et al. 2017 , Russell et al. 2009 ), professional fingerprint examiners, and students. Face examiners, reviewers, and superrecognizers performed more accurately than fingerprint examiners, and fingerprint examiners performed more accurately than students. Machine performance, from 2015 to 2017, tracked human skill levels. The 2015 algorithm ( Parkhi et al. 2015 ) performed at the level of the students; the 2016 algorithm ( Chen et al. 2016 ) performed at the level of the fingerprint examiners ( Ranjan et al. 2017c ); and the two 2017 algorithms ( Ranjan et al. 2017 , c ) performed at the level of professional face reviewers and examiners, respectively. Notably, combining the judgments of individual professional face examiners with those of the best algorithm ( Ranjan et al. 2017 ) yielded perfect performance. This suggests a degree of strategic diversity for the face examiners and the DCNN and demonstrates the potential for effective human–machine collaboration ( Phillips et al. 2018 ).

Combined, the data indicate that machine performance has improved from a level comparable to that of a person recognizing unfamiliar faces to one comparable to that of a person recognizing more familiar faces ( Burton et al. 1999 , Hancock et al. 2000 , Jenkins et al. 2011 ) (see Section 4.1 ).

2. RETHINKING INVERSE OPTICS AND FACE REPRESENTATIONS

Deep networks force us to rethink the universe of possible solutions to the problem of inverse optics in vision. These networks operate with a degree of invariance to image and appearance that was unimaginable by researchers less than a decade ago. Invariance refers to the model’s ability to consistently identify a face when image conditions (e.g., viewpoint, illumination) and appearance (e.g., glasses, facial hair) vary. The nature of the representation that accomplishes this is not well understood. The inscrutability of DCNN codes is due to the enormous number of computations involved in generating a face representation from an image and the uncontrolled training data. To create a face representation, millions of nonlinear, local convolutions are executed over tens (to hundreds) of layers of units. Researchers exert little or no control over the training data, but instead source face images from the web with the goal of finding as much labeled training data as possible. The number of images per identity and the types of images (e.g., viewpoint, expression, illumination, appearance, quality) are left (mostly) to what is found through web scraping. Nevertheless, DCNNs produce a surprisingly structured and rich face representation that we are beginning to understand.

2.1. Mining the Face Identity Code in Deep Networks

The face representation generated by DCNNs for the purpose of identifying a face also retains detailed information about the characteristics of the input image (e.g., viewpoint, illumination) and the person pictured (e.g., gender, age). As shown below, this unified representation can solve multiple face processing tasks in addition to identification.

2.1.1. Image characteristics.

Face representations generated by deep networks both are and are not invariant to image variation. These codes can identify faces invariantly over image change, but they are not themselves invariant. Instead, face representations of a single identity vary systematically as a function of the characteristics of the input image. The representations generated by DCNNs are, in fact, representations of face images.

Work to dissect face identity codes draws on the metaphor of a face space ( Valentine 1991 ) adapted to representations generated by a DCNN. Visualization and simulation analyses demonstrate that identity codes for face images retain ordered information about the input image ( Dhar et al. 2020 , Hill et al. 2019 , Parde et al. 2017 ). Viewpoint (yaw and pitch) can be predicted accurately from the identity code, as can media source (still image or video frame) ( Parde et al. 2017 ). Image quality (blur, usability, occlusion) is also available as the identity code norm (vector length). 1 Poor-quality images produce face representations centered in the face space, creating a DCNN garbage dump. This organizational structure was replicated in two DCNNs with different architectures, one developed by Chen et al. (2016) with seven convolutional layers and three fully connected layers and another developed by Sankaranarayanan et al. (2016) with 11 convolutional layers and one fully connected layer. Image quality estimates can also be optimized directly in a DCNN using human ratings ( Best-Rowden & Jain 2018 ).

Face space:

representation of the similarity of faces in a multidimensional space

For a closer look at the structure of DCNN face representations, Hill et al. (2019) examined the representations of highly controlled face images in a face space generated by a deep network trained with in-the-wild images. The network processed images of three-dimensional laser scans of human heads rendered from five viewpoints under two illumination conditions (ambient, harsh spotlight). Visualization of these representations in the resulting face space showed a highly ordered pattern (see Figure 2 ). Consistent with the network’s high accuracy at face identification, images clustered by identity. Identity clusters separated into regions of male and female faces (see Section 2.1.2 ). Within each identity cluster, the images separated by illumination condition—visible in the face space as chains of images. Within each illumination chain, the image representations were arranged in the space by viewpoint, which varied systematically along the image chain. To further probe the coding of identity, Hill et al. (2019) processed images of caricatures of the 3D heads (see also Blanz & Vetter 1999 ). Caricature representations were centered in each identity cluster, indicating that the network perceived a caricature as a good likeness of the identity.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0002.jpg

Visualization of the top-level deep convolutional neural network (DCNN) similarity space for all images from Hill et al. (2019) . ( a – f ) Points are colored according to different variables. Grey polygonal borders are for illustration purposes only and show the convex hull of all images of each identity. These convex hulls are expanded by a margin for visibility. The network separates identities accurately. In panels a and d , the space is divided into male and female sections. In panels b and e , illumination conditions subdivide within identity groupings. In panels c and f , the viewpoint varies sequentially within illumination clusters. Dotted-line boxes in panels a – c show areas enlarged in panels d – g . Figure adapted with permission from Hill et al. (2019) .

DCNN face representation:

output vector produced for a face image processed through a deep network trained for faces

All results from Hill et al. (2019) were replicated using two networks with starkly different architectures. The first, developed by Ranjan et al. (2019) , was based on a ResNet-101 with 101 layers and skip connections; the second, developed by Chen et al. (2016) , had 15 convolution and pooling layers, a dropout layer, and one fully connected top layer. As measured using the brain-similarity metrics developed in Brain-Score ( Schrimpf et al. 2018 ), one of these architectures (ResNet-101) was the third most brain-like of the 25 networks tested. The ResNet-101 network scored well on both neural (V4 and IT cortex) and behavioral predictability for object recognition. Hill et al.’s (2019) replication of this face space using a shallower network ( Chen et al. 2016 ), however, suggests that network architecture may be less important than computational capacity in understanding high-level visual codes for faces (see Section 3.2 ).

Brain-Score:

neural and behavioral benchmarks that score an artificial neural network on its similarity to brain mechanisms for object recognition

Returning to the issue of human-like view invariance in a DCNN, Abudarham & Yovel (2020) compared the similarity of face representations computed within and across identities and viewpoints. Consistent with view-invariant performance, same-identity, different-view face pairs were more similar than different-identity, same-view face pairs. Consistent with a noninvariant face representation, correlations between similarity scores across head view decreased monotonically with increasing view disparity. These results support the characterization of DCNN codes as being functionally view invariant but with a view-specific code. Notably, earlier layers in the network showed view specificity, whereas higher layers showed view invariance.

It is worth digressing briefly to consider invariance in the context of neural approaches to face processing. An underlying assumption of neural approaches is that “a major purpose of the face patches is thus to construct a representation of individual identity invariant to view direction” ( Hesse & Tsao 2020 , pp. 703). Ideas about how this is accomplished have evolved. Freiwald & Tsao (2010) posited the progressive computation of invariance via the pooling of neurons across face patches, as follows. In early patches, a neuron responds to a specific identity from specific views; in middle face patches, greater invariance is achieved by pooling the responses of mirror-symmetric views of an identity; in later face patches, each neuron pools inputs representing all views of the same individual to create a fully view-invariant representation. More recently, Chang & Tsao (2017) proposed that the brain computes a view-invariant face code using shape and appearance parameters analogous to those used in a computer graphics model of face synthesis ( Cootes et al. 1995 ) (see the sidebar titled Neurons, Neural Tuning, Population Codes, Features, and Perceptual Constancy ). This code retains information about the face, but not about the particular image viewed.

NEURONS, NEURAL TUNING, POPULATION CODES, FEATURES, AND PERCEPTUAL CONSTANCY

Barlow (1972 , p. 371) wrote, “Results obtained by recording from single neurons in sensory pathways…obviously tell us something important about how we sense the world around us; but what exactly have we been told?” In answer, Barlow (1972 , p. 371) proposed that “our perceptions are caused by the activity of a rather small number of neurons selected from a very large population of predominantly silent cells. The activity of each single cell is thus an important perceptual event and it is thought to be related quite simply to our subjective experience.” Although this proposal is sometimes caricatured as the grandmother cell doctrine (see also Gross 2002 ), Barlow simply asserts that single-unit activity can be interpreted in perceptual terms, and that the responses of small numbers of units, in combination, underlie subjective perceptual experience. This proposal reflects ideas gleaned from studies of early visual areas that have been translated, at least in part, to studies of high-level vision.

Over the past decade, single neurons in face patches have been characterized as selective for facial features (e.g., aspect ratio, hair length, eyebrow height) ( Freiwald et al. 2009 ), face viewpoint and identity ( Freiwald & Tsao 2010 ), eyes ( Issa & DiCarlo 2012 ), and shape or appearance parameters from an active appearance model of facial synthesis ( Chang & Tsao 2017 ). Neurophysiological studies of face and object processing also employ techniques aimed at understanding neural population codes. Using the pattern of neural responses in a population of neurons (e.g., IT), linear classifiers are used often to predict subjective percepts (commonly defined as the image viewed). For example, Chang & Tsao (2017) showed that face images viewed by a macaque could be reconstructed using a linear combination of the activity of just 205 face cells in face patches ML–MF and AM. This classifier provides a real neural network model of the face-selective cortex that can be interpreted in simple terms.

Population code models generated from real neural data (a few hundred units), however, differ substantially in scale from the face- and object-selective cortical regions that they model (1 mm 3 of the cerebral cortex contains approximately 50,000 neurons and 300 million adjustable parameters; Azevedo et al. 2009 , Kandel et al. 2000 , Hasson et al. 2020 ). This difference in scale is at the core of a tension between model interpretability and real-world task generalizability ( Hasson et al. 2020 ). It also creates tension between the neural coding hypotheses suggested by deep learning and the limitations of current neuroscience techniques for testing these hypotheses. To model neural function, an electrode gives access to single neurons and (with multi-unit recordings) to relatively small numbers of neurons (a few hundred). Neurocomputational theory based on direct fit models posits that overparameterization (i.e., the extremely high number of parameters available for neural computation) is critical to the brain’s solution to real-world problems (see Section 3.2 ). Bridging the gap between the computational and neural scale of these perspectives remains an ongoing challenge for the field.

Deep networks suggest an alternative that is largely consistent with neurophysiological data but interprets the data in a different light. Neurocomputational theory posits that the ventral visual system untangles face identity information from image parameters ( DiCarlo & Cox 2007 ). The idea is that visual processing starts in the image domain, where identity and viewpoint information are entangled. With successive levels of neural processing, manifolds corresponding to individual identities are untangled from image variation. This creates a representational space where identities can be separated with hyperplanes. Image information is not lost, but rather, is rearranged (for object recognition results, see Hong et al. 2016 ). The retention of image and identity information in DCNN face representations is consistent with this theory. It is also consistent with basic neuroscience findings indicating the emergence of a representation dominated by identity that retains sensitivity to image features (See Section 2.2 ).

2.1.2. Appearance and demographics.

Faces can be described using what computer vision researchers have called attributes or soft biometrics (hairstyle, hair color, facial hair, and accessories such as makeup and glasses). The definition of attributes in the computational literature is vague and can include demographics (e.g., gender, age, race) and even facial expression. Identity codes from deep networks retain a wide variety of face attributes. For example, Terhörst et al. (2020) built a massive attribute classifier (MAC) to test whether 113 attributes could be predicted from the face representations produced by deep networks [ArcFace ( Deng et al. 2019 ) or FaceNet ( Schroff et al. 2015 )] for images from in-the-wild data sets ( Huang et al. 2008 , Liu et al. 2015 ). The MAC learned to map from DCNN-generated face representations to attribute labels. Cross-validated results showed that 39 of the attributes were easily predictable, and 74 of the 113 were predictable at reliable levels. Hairstyle, hair color, beard, and accessories were predicted easily. Attributes such as face geometry (e.g., round), periocular characteristics (e.g., arched eyebrows), and nose were moderately predictable. Skin and mouth attributes were not well predicted.

The continuous shuffling of identity, attribute, and image information across layers of the network was demonstrated by Dhar et al. (2020) . They tracked the expressivity of attributes (identity, sex, age, pose) across layers of a deep network. Expressivity was defined as the degree to which a feature vector, from any given layer of a network, specified an attribute. Dhar et al. (2020) computed expressivity using a second neural network that estimated the mutual information between attributes and DCNN features. Expressivity order in the final fully connected layer of both networks (Resnet-101 and Inception Resnet v2; Ranjan et al. 2019 ) indicated that identity was most expressed, followed by age, sex, and yaw. Identity expressivity increased dramatically from the final pooling layer to the last fully connected layer. This echos the progressive increase in the detectability of view-invariant face identity representations seen across face patches in the macaque ( Freiwald & Tsao 2010 ). It also raises the computational possibility of undetected viewpoint sensitivity in these neurons (see Section 3.1 ).

Mutual information:

a statistical term from information theory that quantifies the codependence of information between two random variables

2.1.3. Social traits.

People make consistent (albeit invalid) inferences about a person’s social traits based on their face ( Todorov 2017 ). These judgments have profound consequences. For example, competence judgments about faces predict election success at levels far above chance ( Todorov et al. 2005 ). The physical structure of the face supports these trait inferences ( Oosterhof & Todorov 2008 , Walker & Vetter 2009 ), and thus it is not surprising that deep networks retain this information. Using face representations produced by a network trained for face identification ( Sankaranarayanan et al. 2016 ), 11 traits (e.g., shy, warm, impulsive, artistic, lazy), rated by human participants, were predicted at levels well above chance ( Parde et al. 2019 ). Song et al. (2017) found that more than half of 40 attributes were predicted accurately by a network trained for object recognition (VGG-16; Simonyan & Zisserman 2014 ). Human and machine trait ratings were highly correlated.

Other studies show that deep networks can be optimized to predict traits from images. Lewenberg et al. (2016) crowd-sourced large numbers of objective (e.g., hair color) and subjective (e.g., attractiveness) attribute ratings from faces. DCNNs were trained to classify images for the presence or absence of each attribute. They found highly accurate classification for the objective attributes and somewhat less accurate classification for the subjective attributes. McCurrie et al. (2017) trained a DCNN to classify faces according to trustworthiness, dominance, and IQ. They found significant accord with human ratings, with higher agreement for trustworthiness and dominance than for IQ.

2.1.4. Facial expressions.

Facial expressions are also detectable in face representations produced by identity-trained deep networks. Colón et al. (2021) found that expression classification was well above chance for face representations of images from the Karolinska data set ( Lundqvist et al. 1998 ), which includes seven facial expressions (happy, sad, angry, surprised, fearful, disgusted, neutral) seen from five viewpoints (frontal and 90- and 45-degree left and right profiles). Consistent with human data, happiness was classified most accurately, followed by surprise, disgust, anger, neutral, sadness, and fear. Notably, accuracy did not vary across viewpoint. Visualization of the identities in the emergent face space showed a structured ordering of similarity in which viewpoint dominated over expression.

2.2. Functional Invariance, Useful Variability

The emergent code from identity-trained DCNNs can be used to recognize faces robustly, but it also retains extraneous information that is of limited, or no, value for identification. Although demographic and trait information offers weak hints to identity, image characteristics and facial expression are not useful for identification. Attributes such as glasses, hairstyle, and facial hair are, at best, weak identity cues and, at worst, misleading cues that will not remain constant over extended time periods. In purely computational terms, the variability of face representations for different images of an identity can lead to errors. Although this is problematic in security applications, coincidental features and attributes can be diagnostic enough to support acceptably accurate identification performance in day-to-day face recognition ( Yovel & O’Toole 2016 ). (For related arguments based on adversarial images for object recognition, see Ilyas et al. 2019 , Xie et al. 2020 , Yuan et al. 2020 .) A less-than-perfect identification system in computational terms, however, can be a surprisingly efficient, multipurpose face processing system that supports identification and the detection of visually derived semantic information [called attributes by Bruce & Young (1986) ].

What do we learn from these studies that can be useful in understanding human visual processing of faces? First, we learn that it is computationally feasible to accommodate diverse information about faces (identity, demographics, visually derived semantic information), images (viewpoint, illumination, quality), and emotions (expression) in a unified representation. Furthermore, this diverse information can be accessed selectively from the representation. Thus, identity, image parameters, and attributes are all untangled when learning prioritizes the difficult within-category discrimination problem of face identification.

Second, we learn that to understand high-level visual representations for faces, we need to think in terms of categorical codes unbound from a spatial frame of reference. Although remnants of retinotopy and image characteristics remain in high-level visual areas (e.g., Grill-Spector et al. 1999 , Kay et al. 2015 , Kietzmann et al. 2012 , Natu et al. 2010 , Yue et al. 2010 ), the expressivity of spatial layout weakens dramatically from early visual areas to categorically structured areas in the IT cortex. Categorical face representations should capture what cognitive and perceptual psychologists call facial features (e.g., face shape, eye color). Indeed, altering these types of features in a face affects identity perception similarly for humans and deep networks ( Abudarham et al. 2019 ). However, neurocomputational theory suggests that finding these features in the neural code will likely require rethinking the interpretation of neural tuning and population coding (see Section 3.2 ).

Third, if the ventral stream untangles information across layers of computations, then we should expect traces of identity, image data, and attributes at many, if not all, neural network layers. These may variously dominate the strength of the neural signal at different layers (see Section 3.1 ). Thus, various layers in the network will likely succeed in predicting several types of information about the face and/or image, though with differing accuracy. For now, we should not ascribe too much importance to findings about which specific layer(s) of a particular network predict specific attributes. Instead, we should pay attention to the pattern of prediction accuracy across layers. We would expect the following pattern. Clearly, for the optimized attribute (identity), the output offers the clearest access. For subject-related attributes (e.g., demographics), this may also be the case. For image-related attributes, we would expect every layer in the network to retain some degree of prediction ability. Exactly how, where, and whether the neural system makes use of these attributes for specific tasks remain open questions.

3. RETHINKING VISUAL FEATURES: IMPLICATIONS FOR NEURAL CODES

Deep learning models force us to rethink the definition and interpretation of facial features in high-level representations. Theoretical ideas about the brain’s solution to complex real-world tasks such as face recognition must be reconciled at the level of neural units and representational spaces. Deep learning models can be used to test hypotheses about how faces are stored in the high-dimensional representational space defined by the pattern of responses of large numbers of neurons.

3.1. Units Confound Information that Separates in the Representation Space

Insight into interpreting facial features comes from deep network simulations aimed at understanding the relationship between unit responses and the information retained in the face representation. Parde et al. (2021) compared identification, gender classification, and viewpoint estimation in subspaces of a DCNN face space. Using an identity-trained network capable of all three tasks, they tested performance on the tasks using randomly sampled subsets of output units. Beginning at full dimensionality (512-units) and progressively decreasing sample size, they found no notable decline in identification accuracy for more than 3,000 in-the-wild-faces until the sample size reached 16 randomly chosen units (3% of full dimensionality). Correlations between unit responses across representations were near zero, indicating that individual units captured nonredundant identity cues. Statistical power for identification (i.e., separating identities) was uniformly high for all output units, demonstrating that units used their entire response range to separate identities. A unit firing at its maximum provided no more, and no less, information than any other response value. This distinction may seem trivial, but it is not. The data suggest that every output unit acts to separate identities to the maximum degree possible. As such, all units participate in coding all identities. In information theory terms, this is an ideal use of neural resources.

For gender classification and viewpoint estimation, performance declined at a much faster rate than for identification as units were deleted ( Parde et al. 2021 ). Statistical power for predicting gender and viewpoint was strong in the distributed code but weak at the level of the unit. Prediction power for these attributes was again roughly equivalent for all units. Thus, individual units contributed to coding all three attributes, but identity modulated individual unit responses far more strongly than did gender or viewpoint. Notably, a principal component (PC) analysis of representations in the full-dimensional space revealed subspaces aligned with identity, gender, and viewpoint ( Figure 3 ). Consistent with the strength of the categorical identity code in the representation, identity information dominated PCs explaining large amounts of variance, gender dominated the middle range of PCs, and viewpoint dominated PCs explaining small amounts of variation.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0003.jpg

Illustration of the separation of the task-relevant information into subspaces for an identity-trained deep convolutional neural network (DCNN). Each plot shows the similarity (cosine) between principal components (PCs) of the face space and directional vectors in the space that are diagnostic of identity ( top ), gender ( middle ), and viewpoint ( bottom ). Figure adapted with permission from Parde et al. (2021) .

The emergence and effectiveness of these codes in DCNNs suggest that caution is needed in ascribing significance only to stimuli that drive a neuron to high rates of response. Small-scale modulations of neural responses can also be meaningful. Let us consider a concrete example. A neurophysiologist probing the network used by Parde et al. (2021) would find some neurons that respond strongly to a few identities. Interpreting this as identity tuning, however, would be an incorrect characterization of a code in which all units participate in coding all identities. Concomitantly, few units in the network would appear responsive to viewpoint or gender variations because unit firing rates would modulate only slightly with changes in viewpoint or gender. Thus, the distributed coding of view and gender across units would likely be missed. The finding that neurons in macaque face patch AM respond selectively (i.e., with high response rates) to identity over variable views ( Freiwald & Tsao 2010 ) is consistent with DCNN face representations. It is possible, however, that these units also encode other face and image attributes, but with differential degrees of expressivity. This would be computationally consistent with the untangling theory and with DCNN codes.

Macaque face patches:

regions of the macaque cortex that respond selectively to faces, including the posterior lateral (PL), middle lateral (ML), middle fundus (MF), anterior lateral (AL), anterior fundus (AF), and anterior medial (AM)

Another example comes from the use of generative adversarial networks and related techniques to characterize the response properties of single (or multiple) neuron(s) in the primate visual cortex ( Bashivan et al. 2019 , Ponce et al. 2019 , Yuan et al. 2020 ). These techniques have examined neurons in areas V4 ( Bashivan et al. 2019 ) and IT ( Ponce et al. 2019 , Yuan et al. 2020 ). The goal is to progressively evolve images that drive neurons to their maximum response or that selectively (in)activate subsets of neurons. Evolved images show complex mosaics of textures, shapes, and colors. They sometimes show animals or people and sometimes reveal spatial patterns that are not semantically interpretable. However, these techniques rely on two strong assumptions. First, they assume that a neuron’s response can be characterized completely in terms of the stimuli that activate it maximally, thereby discounting other response rates as noninformative. The computational utility of a unit’s full response range in DCNNs suggests that reconsideration of this assumption is necessary. Second, these techniques assume that a neuron’s response properties can be visualized accurately as a two-dimensional image. Given the categorical, nonretinotopic nature of representations in high-level visual areas, this seems problematic. If the representation under consideration is not in the image or pixel domain, then image-based visualization may offer limited, and possibly misleading, insight into the underlying nature of the code.

3.2. Direct-Fit Models and Deep Learning

In rethinking visual features at a theoretical level, direct-fit models of neural coding appear to best explain deep learning findings in multiple domains (e.g., face recognition, language) ( Hasson et al. 2020 ). These models posit that neural computation fits densely sampled data from the environment. Implementation is accomplished using “overparameterized optimization algorithms that increase predictive (generalization) power, without explicitly modeling the underlying generative structure of the world” ( Hasson et al. 2020 , p. 418). Hasson et al. (2020) begins with an ideal model in a small-parameter space ( Figure 4 ). When the underlying structure of the world is simple, a small-parameter model will find the underlying generative function, thereby supporting generalization via interpolation and extrapolation. Despite decades of effort, small-parameter functions have not solved real-world face recognition with performance anywhere near that of humans.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0004.jpg

( a ) A model with too few parameters fails to fit the data. ( b ) The ideal-fit model fits with a small number of parameters and has generative power that supports interpolation and extrapolation. ( c ) An overfit function can model noise in the training data. ( d ) An overparameterized model generalizes well to new stimuli within the scope of the training samples. Figure adapted with permission from Hasson et al. (2020) .

When the underlying structure of the world is complex and multivariate, direct-fit models offer an alternative to models based on small-parameter functions. With densely sampled real-world training data, each new observation can be placed in the context of past experience. More formally, direct-fit models solve the problem of generalization to new exemplars by experience-scaffolded interpolation ( Hasson et al. 2020 ). This produces face recognition performance in the range of that of humans. A fundamental element of the success of deep networks is that they model the environment with big data, which can be structured in overparameterized spaces. The scale of the parameterization and the requirement to operate on real-world data are pivotal. Once the network is sufficiently parameterized to fit the data, the exact details of its architecture are not important. This may explain why starkly different network architectures arrive at similarly structured representations ( Hill et al. 2019 , Parde et al. 2017 , Storrs et al. 2020 ).

Returning to the issue of features, in neurocomputational terms, the strength of connectivity between neurons at synapses is the primary locus of information, just as weights between units in a deep network comprise information. We expect features, whatever they are, to be housed in the combination of connection strengths among units, not in the units themselves. In a high-dimensional multivariate encoding space, they are hyperplane directions through the space. Thus, features are represented across many computing elements, and each computing element participates in encoding many features ( Hasson et al. 2020 , Parde et al. 2021 ). If features are directions in a high-dimensional coding space ( Goodfellow et al. 2014 ), then units act as an arbitrary projection surface from which this information can be accessed—albeit in a nontransparent form.

A downside of direct-fit models is that they cannot generalize via extrapolation. The other-race effect is an example of how face recognition may fail due to limited experience ( Malpass & Kravitz 1969 ) (see Section 4.3.2 ). The extrapolation limit may be countered, however, by the capacity of direct-fit models to acquire expertise within the confines of experience. For example, in human perception, category experience selectively structures representations as new exemplars are learned. Collins & Behrmann (2020) show that this occurs in a way that reflects the greater experience that humans have with faces and computer-generated objects from novel made-up categories of objects, which the authors call YUFOs. They tracked the perceived similarity of pairs of other-race faces and YUFOs as people learned novel exemplars of each. Experience changed perceived similarities more selectively for faces than for YUFOs, enabling more nuanced discrimination of exemplars from the experienced category of faces.

In summary, direct-fit models offer a framework for thinking about high-level visual codes for faces in a way that unifies disparate data on single units and high-dimensional coding spaces. These models are fueled by the rich experience that we (models) gain from learning (training on) real-world data. They solve complex visual tasks with interpolated solutions that elude transparent semantic interpretation.

4. RETHINKING LEARNING IN HUMANS AND DEEP NETWORKS

Deep network models of human face processing force us to consider learning as a complex and diverse set of mechanisms that can overlap, accumulate over time, and interact. Learning in both humans and artificial neural networks can refer to qualitatively different phenomena. In both cases, learning involves multiple steps. For DCNNs, these steps are fundamental to a network’s ability to recognize faces across image and appearance variation. Human visual learning is likewise diverse and unfolds across the developmental lifespan in a process governed by genetics and environmental input ( Goodman & Shatz 1993 ). The stepwise implementation of learning is one way that DCNNs differ from previous face recognition networks. Considered as manipulable modeling tools, the learning steps in DCNNs force us to think in concrete and nuanced ways about how humans learn faces.

In this section, we outline the learning layers in human face processing ( Section 4.1 ), introduce the layers of learning used in training machines ( Section 4.2 ), and consider the relationship between the two in the context of human behavior ( Section 4.3.1 ). The human learning layers support a complex, biologically realized face processing system. The machine learning layers can be thought of as building blocks that can be combined in a variety of ways to model human behavioral phenomena. At the outset, we note that machine learning is designed to maximize performance—not to model the development of the human face processing system ( Smith & Slone 2017 ). Concomitantly, the sequential presentation of training data in DCNNs differs from the pattern of exposure that infants and young children have with faces and objects ( Jayaraman et al. 2015 ). The machine learning steps, however, can be modified to model human learning more closely. In practical terms, fully trained DCNNs, available on the web, are used (almost exclusively) to model human neural systems (see the sidebar titled Caveat: Iteration Between Theory and Practice ). It is important, therefore, to understand how (and why) these models are configured as they are and to understand the types of learning tools available for modeling human face processing. These steps may provide computational grounding for basic learning mechanisms hypothesized in humans.

4.1. Human Learning for Face Processing

To model human face processing, researchers need to consider the following types of learning. The most specific form of learning is familiar face recognition. People learn the faces of specific familiar individuals (e.g., friends, family, celebrities). Familiar faces are recognized robustly over challenging changes in appearance and image characteristics. The second-most specific is local population tuning. People recognize own-race faces more accurately than other-race faces, a phenomenon referred to as the other-race effect (e.g., Malpass & Kravitz 1969 ). This likely results from tuning to the statistical properties of the faces that we see most frequently—typically faces of our own race. The third-most specific is nfamiliar face recognition. People can differentiate unfamiliar faces perceptually. Unfamiliar refers to faces that a person has not encountered previously or has encountered infrequently. Unfamiliar face recognition is less robust to image and appearance change than is familiar face recognition. The least specific form of learning is object recognition. At a fundamental level of analysis, faces are objects, and both share early visual processing wetware.

4.2. How Deep Convolutional Neural Networks Learn Face Identification

Training DCNNs for face recognition involves a sequence of learning stages, each with a concrete objective. Unlike human learning, machine learning stages are executed in strict sequence. The goal across all stages of training is to build an effective method for converting images of faces into points in a high-dimensional space. The resulting high-dimensional space allows for easy comparison among faces, search, and clustering. In this section, we sketch out the engineering approach to learning, working forward from the most general to the most specific form of learning. This follows the implementation order used by engineers.

4.2.1. Object classification (between-category learning): Stage 1.

Deep networks for face identification are commonly built on top of DCNNs that have been pretrained for object classification. Pretraining is carried out using large data sets of objects, such as those available in ImageNet ( Russakovsky et al. 2015 ), which contains more than 14 million images of over 1,000 classes of objects (e.g., volcanoes, cups, chihuahuas). The object categorization training procedure involves adjusting the weights on all layers of the network. For training to converge, a large training set is required. The loss function optimized in this procedure typically uses the well-understood cross-entropy loss + Softmax combination. Most practitioners do not execute this step because it has been performed already in a pretrained model downloaded from a public repository in a format compatible with DCNN software libraries [e.g., PyTorch ( Paszke et al. 2019 ), TensorFlow ( Abadi et al. 2016 )]. Networks trained for object recognition have proven better for face identification than networks that start with a random configuration ( Liu et al. 2015 , Yi et al. 2014 ).

4.2.2. Face recognition (within-category learning): Stage 2.

Face recognition training is implemented in a second stage of training. In this stage, the last fully connected layer that connects to object-category nodes (e.g., volcanoes, cups) is removed from the results of the Stage 1 training. Next, a fully connected layer that maps to the number of face identities available for face training is connected. Depending on the size of the face training set, the weights of either all layers or all but a few layers at the beginning of the network are updated. The former is common when very large numbers of face identities are available for training. In academic laboratories, data sets include 5–10 million face images of 40,000–100,000 identities. In industry, far larger data sets are often used ( Schroff et al. 2015 ). A technical difficulty encountered in retraining an object classification network to a face recognition network is the large increase in the number of categories involved (approximately 1,000 objects versus 50,000+ faces). Special loss functions can address this issue [e.g., L2-Softmax/crystal loss ( Ranjan et al. 2017 ), NormFace ( Wang et al. 2017 ), angular Softmax ( Li et al. 2018 ), additive Softmax ( Wang et al. 2018 ), additive angular margins ( Deng et al. 2019 )].

When the Stage 2 face training is complete, the last fully connected layer that connects to the 50,000+ face identity nodes is removed, leaving below it a relatively low-dimensional (128- to 5,000-unit) layer of output units. This can be thought of as the face representation. This output represents a face image, not a face identity. At this point in training, any arbitrary face image from any identity (known or unknown to the network) can be processed by the DCNN to produce a compact face image descriptor across the units of this layer. If the network functions perfectly, then it will produce identical codes for all images of the same person. This would amount to perfect image and appearance generalization. This is not usually achieved, even when the network is highly accurate (see Section 2 ).

In this state, the network is commonly employed to recognize faces not seen in training (unfamiliar faces). Stage 2 training supports a surprising degree of generalization (e.g., pose, expression, illumination, and appearance) for images of unfamiliar faces. This general face learning gives the system special knowledge of faces and enables it to perform within-category face discrimination for unfamiliar faces ( O’Toole et al. 2018 ). With or without Stage 3 training, the network is now capable of converting images of faces into points in a high-dimensional space, which, as noted above, is the primary goal of training. In practice, however, Stages 3 and 4 can provide a critical bridge to modeling behavioral characteristics of the human face processing system.

4.2.3. Adapting to local statistics of people and visual environments: Stage 3.

The objective of Stage 3 training is to finalize the modification of the DCNN weights to better adapt to the application domain. The term application domain can refer to faces from a particular race or ethnicity or, as it is commonly used in industry, to the type of images to be processed (e.g., in-the-wild faces, passport photographs). This training is a crucial step in many applications because there will be no further transformation of the weights. Special care is needed in this training to avoid collapsing the representation into a form that is too specific. Training at this stage can improve performance for some faces and decrease it for others.

Whereas Stages 1 and 2 are used in the vast majority of published computational work, in Stage 3, researchers diverge. Although there is no standard implementation for this training, fine-tuning and learning a triplet loss embedding ( van der Maaten & Weinberger 2012 ) are common methods. These methods are conceptually similar but differ in implementation. In both methods, ( a ) new layers are added to the network, ( b ) specific subsets of layers are frozen or unfrozen, and ( c ) optimization continues with an appropriate loss function using a new data set with the desired domain characteristics. Fine-tuning starts from an already-viable network state and updates a nonempty subset of weights, or possibly all weights. It is typically implemented with smaller learning rates and can use smaller training sets than those needed for full training. Triplet loss is implemented by freezing all layers and adding a new, fully connected layer. Minimization is done with the triplet loss, again on a new (smaller) data set with the desired domain characteristics.

A natural question is why Stage 2 (general face training) is not considered fine-tuning. The answer, in practice, comes down to viability and volume. When the training for Stage 2 starts, the network is not in a viable state to perform face recognition. Therefore, it requires a voluminous, diverse data set to function. Stage 3 begins with a functional network and can be tuned effectively with a small targeted data set.

This face knowledge history provides a tool for adapting to local face statistics (e.g., race) ( O’Toole et al. 2018 ).

4.2.4. Learning individual people: Stage 4.

In psychological terms, learning individual familiar faces involves seeing multiple, diverse images of the individuals to whom the faces belong. As we see more images of a person, we become more familiar with their face and can recognize it from increasingly variable images ( Dowsett et al. 2016 , Murphy et al. 2015 , Ritchie & Burton 2017 ). In computational terms, this translates into the question of how a network can learn to recognize a random set of special (familiar) faces with greater accuracy and robustness than other nonspecial (unfamiliar) faces—assuming, of course, the availability of multiple, variable images of the special faces. This stage of learning is defined, in nearly all cases, outside of the DCNN, with no change to weights within the DCNN.

The problem is as follows. The network starts with multiple images of each familiar identity and can produce a representation for each of the images–but what then? There is no standard familiarization protocol, but several approaches exist. We categorize these approaches first and link them to theoretical accounts of face familiarity in Section 4.3.3 .

The first approach is averaging identity codes, or 1-class learning. It is common in machine learning to use an average (or weighted average) of the DCNN-generated face image representations as an identity code (see also Crosswhite et al. 2018 , Su et al. 2015 ). Averaging creates a person-identity prototype ( Noyes et al. 2021 ) for each familiar face.

The second is individual face contrast, or 2-class learning. This technique employs direct learning of individual identities by contrasting them with all other identities. There are two classes because the model learns what makes each identity (positive class) different than all other identities (negative class). The distinctiveness of each familiar face is enhanced relative to all other known faces (e.g., Noyes et al. 2021 ).

The third is multiple face contrast, or K-class learning. This refers to the use of identification training for a random set of (familiar) faces with a simple network (often a one-layer network). The network learns to map DCNN-generated face representations of the available images onto identity nodes.

The fourth approach is fine-tuning individual face representations. Fine-tuning has also been used for learning familiar identities ( Blauch et al. 2020a ). It is an unusual method because it alters weights within the DCNN itself. This can improve performance for the familiarized faces but can limit the network’s ability to represent other faces.

These methods create a personal face learning history that supports more accurate and robust face processing for familiar people ( O’Toole et al. 2018 ).

4.3. Mapping Learning Between Humans and Machines

Deep networks rely on multiple types of learning that can be useful in formulating and testing complex, nuanced hypotheses about human face learning. Manipulable variables include order of learning, training data, and network plasticity at different learning stages. We consider a sample of topics in human face processing that can be investigated by manipulating learning in deep networks. Because these investigations are just beginning, we provide an overview of the work in progress and discuss possible next steps in modeling.

4.3.1. Development of face processing.

Early infants’ experience with faces is critical for the development of face processing skills ( Maurer et al. 2002 ). The timing of this experience has become increasingly clear with the availability of data sets gathered using head-mounted cameras in infants (1–15 months of age) (e.g., Jayaraman et al. 2015 , Yoshida & Smith 2008 ). In seeing the world from the perspective of the infant, it becomes clear that the development of sensorimotor abilities drives visual experience. Infants’ experience transitions from seeing only what is made available to them (often faces in the near range), to seeing the world from the perspective of a crawler (objects and environments), to seeing hands and the objects that they manipulate ( Fausey et al. 2016 , Jayaraman et al. 2015 , Smith & Slone 2017 , Sugden & Moulson 2017 ). Between 1 and 3 months of age, faces are frequent, temporally persistent, and viewed frontally at close range. This early experience with faces is limited to a few individuals. Faces become less frequent as the child’s first year progresses and attention shifts to the environment, to objects, and later to hands ( Jayaraman & Smith 2019 ).

The prevalence of a few important faces in the infants’ visual world suggests that early face learning may have an out-sized influence on structuring visual recognition systems. Infants’ visual experience of objects, faces, and environments can provide a curriculum for teaching machines ( Smith et al. 2018 ). DCNNs can be used to test hypotheses about the emergence of competence on different face processing tasks. Some basic computational challenges, however, need to be addressed. Training with very large numbers of objects (or faces) is required for deep network learning to converge (see Section 4.2.1 ). Starting small and building competence on multiple domains (faces, objects, environments) might require basic changes to deep network training. Alternatively, the small number of special faces in an infant’s life might be considered familiar faces. Perception and memory of these faces may be better modeled using tools that operate outside the deep network on representations that develop within the network (Stage 4 learning; Section 4.2.4 ). In this case, the quality of the representation produced at different points in a network’s development of more general visual knowledge varies (Stages 1 and 2 of training; Sections 4.2.1 and 4.2.2 ). The learning of these special faces early in development might interact with the learning of objects and scenes at the categorical level ( Rosch et al. 1976 , Yovel et al. 2012 ). A promising approach would involve pausing training in Stages 1 and 2 to test face representation quality at various points along the way to convergence.

4.3.2. Race bias in the performance of humans and deep networks.

People recognize own-race faces more accurately than other-race faces. For humans, this other-race effect begins in infancy ( Kelly et al. 2005 , 2007 ) and is manifest in children ( Pezdek et al. 2003 ). Although it is possible to reverse these effects in childhood ( Sangrigoli et al. 2005 ), training adults to recognize other-race faces yields only modest gains (e.g., Cavazos et al. 2019 , Hayward et al. 2017 , Laurence et al. 2016 , Matthews & Mondloch 2018 , Tanaka & Pierce 2009 ). Concomitantly, evidence for the experience-based contact hypothesis is weak when it is evaluated in adulthood ( Levin 2000 ). Clearly, the timing of experience is critical in the other-race effect. Developmental learning, which results in perceptual narrowing during a critical childhood period, may provide a partial account of the other-race effect ( Kelly et al. 2007 , Sangrigoli et al. 2005 , Scott & Monesson 2010 ).

Perceptual narrowing:

sculpting of neural and perceptual processing via experience during a critical period in child development

Face recognition algorithms from the 1990s and present-day DCNNs differ in accuracy for faces of different races (for a review, see Cavazos et al. 2020 ; for a comprehensive test of race bias in DCNNs, see Grother et al. 2019 ). Although training with faces of different races is often cited as a cause of race effects, it is unclear which training stage(s) contribute to the bias. It is likely that biased learning affects all learning stages. From the human perspective, for many people, experience favors own-race faces across the lifespan, potentially impacting performance through multiple learning mechanisms (developmental, unfamiliar, and familiar face learning). DCNN training may also use race-biased data at all stages. For humans, understanding the role of different types of learning in the other-race effect is challenging because experience with faces cannot be controlled. DCNNs can serve as a tool for studying critical periods and perceptual narrowing. It is possible to compare the face representations that emerge from training regimes that vary in the time course of exposure to faces of different races. The ability to manipulate training stage order, network plasticity, and training set diversity in deep networks offers an opportunity to test hypotheses about how bias emerges. The major challenge for DCNNs is the limited availability of face databases that represent the diversity of humans.

4.3.3. Familiar versus unfamiliar face recognition.

Face familiarity in a deep network can be modeled in more ways than we can count. The approaches presented in Section 4.2.4 are just a beginning. Researchers should focus first on the big questions. How do familiar and unfamiliar face representations differ—beyond simple accuracy and robustness? This has been much debated recently, and many questions remain ( Blauch et al. 2020a , b ; Young & Burton 2020 ; Yovel & Abudarham 2020 ). One approach is to ask where in the learning process representations for familiar and unfamiliar faces diverge. The methods outlined in Section 4.2.4 make some predictions.

In the individual and multiple face contrast methods, familiar and unfamiliar face representations are not differentiated within the deep network. Instead, familiar face representations generated by the DCNN are enhanced in another, simpler network populated with known faces. A familiar face’s representation is affected, therefore, by the other faces that we know well. Contrast techniques have preliminary empirical support. In the work of Noyes et al. (2021) , familiarization using individual-face contrast improved identification for both evasion and impersonation disguise. It also produced a pattern of accuracy similar to that seen for people familiar with the disguised individuals ( Noyes & Jenkins 2019 ). For humans who were unfamiliar with the disguised faces, the pattern of accuracy resembled that seen after general face training inside of the DCNN. There is also support for multiple-face contrast familiarization. Perceptual expertise findings that emphasize the selective effects of the exemplars experienced during highly skilled learning are consistent with this approach ( Collins & Behrmann 2020 ) (see Section 3.2 ).

Familiarization by averaging and fine-tuning both improve performance, but at a cost. For example, averaging the DCNN representations increased performance for evasion disguise by increasing tolerance for appearance variation ( Noyes et al. 2021 ). It decreased performance, however, for imposter disguise by allowing too much tolerance for appearance variation. Averaging methods highlight the need to balance the perception of identity across variable images with an ability to tell similar faces apart.

Familiarization via fine-tuning was explored by Blauch et al. (2020a) , who varied the number of layers tuned (all layers, fully connected layers, only the fully connected layer mapping the perceptual layer to identity nodes). Fine-tuning applied at lower layers alters the weights within the deep network to produce a perceptual representation potentially affected by familiar faces. Fine-tuning in the mapping layer is equivalent to multiclass face contrast learning ( Blauch et al. 2020b ). Blauch et al. (2020b) show that fine-tuning the perceptual representation, which they consider analogous to perceptual learning, is not necessary for producing a familiarity effect ( Blauch et al. 2020a ).

These approaches are not (necessarily) mutually exclusive and therefore can be combined to exploit useful features of each.

4.3.4. Objects, faces, both.

The organization of face-, body-, and object-selective areas in the ventral temporal cortex has been studied intensively (cf. Grill-Spector & Weiner 2014 ). Neuroimaging studies in childhood reveal the developmental time course of face selectivity and other high-level visual tasks (e.g., Natu et al. 2016 ; Nordt et al. 2019 , 2020 ). How these systems interact during development in the context of constantly changing input from the environment is an open question. DCNNs can be used to test functional hypotheses about the development of object and face learning (see also Grill-Spector et al. 2018 ).

In the case of machine learning, face recognition networks are more accurate when pretrained to categorize objects ( Liu et al. 2015 , Yi et al. 2014 ), and networks trained with only faces are more accurate for face recognition than networks trained with only objects ( Abudarham & Yovel 2020 , Blauch et al. 2020a ). Human-like viewpoint invariance was found in a DCNN trained for face recognition but not in one trained for object recognition ( Abudarham & Yovel 2020 ). In machine learning, networks are trained first with objects, and then with faces. Moreover, networks can simultaneously learn object and face recognition ( Dobs et al. 2020 ), which incurs minimal duplication of neural resources.

4.4. New Tools, New Questions, New Data, and a New Look at Old Data

Psychologists have long posited diverse and complex learning mechanisms for faces. Deep networks provide new tools that can be used to model human face learning with greater precision than was possible previously. This is useful because it encourages theoreticians to articulate hypotheses in ways specific enough to model. It may no longer be sufficient to explain a phenomenon in terms of generic learning or contact. Concepts such as perceptual narrowing should include ideas about where and how in the learning process this narrowing occurs. A major challenge ahead is the sheer number of knobs to be set in deep networks. Plasticity, for example, can be dialed up or down, and it can be applied to selected network layers or specific face diets administered across multiple learning stages (in sequence or simultaneously). The list goes on. In all of the topics discussed, and others not discussed, theoretical ideas should specify the manipulations thought to be most critical. We should follow the counsel of Box (1976) to avoid worrying selectivity and instead focus on what is most important. New tools succeed when they facilitate the discovery of things that we did not know or had not hypothesized. Testing these hypotheses will require new data and may suggest a reevaluation of existing data.

5. THE PATH FORWARD

In this review, we highlight fundamental advances in thinking brought about by deep learning approaches. These networks solve the inverse optics problem for face identification by untangling image, appearance, and identity over layers of neural-like processing. This demonstrates that robust face identification can be achieved with a representation that includes specific information about the face image(s) actually experienced. These representations retain information about appearance, perceived traits, expressions, and identity.

Direct-fit models posit that deep networks operate by placing new observations into the context of past experience. These models depend on overparameterized networks that create a high-dimensional space from real-world training data. Face representations housed within this space project onto units, thereby confounding stimulus features that (may) separate in the high-dimensional space. This raises questions about the transparency and interpretability of information gained by examining the response properties of network units. Deep networks can be studied at the both micro- and macroscale simultaneously and can be used to formulate hypotheses about the underlying neural code for faces. A key to understanding face representations is to reconcile the responses of neurons to the structure of the code in the high-dimensional space. This is a challenging problem best approached by combining psychological, neural, and computational methods.

The process of training a deep network is complex and layered. It draws on learning mechanisms aimed at objects and faces, visual categories of faces (e.g., race), and special familiar faces. Psychological and neural theory considers the many ways in which people and brains learn faces from real-world visual experience. DCNNs offer the potential to implement and test sophisticated hypotheses about how humans learn faces across the lifespan.

We should not lose sight of the fact that a compelling reason to study deep networks is that they actually work, i.e., they perform nearly as well as humans, on face recognition tasks that have stymied computational modelers for decades. This might qualify as a property of deep networks that is importantly right ( Box 1976 ). There is a difference, of course, between working and working like humans. Determining whether a deep network can work like humans, or could be made to do so by manipulating other properties of the network (e.g., architectures, training data, learning rules), is work that is just beginning.

SUMMARY POINTS

  • Face representations generated by DCNN networks trained for identification retain information about the face (e.g., identity, demographics, attributes, traits, expression) and the image (e.g., viewpoint).
  • Deep learning face networks generate a surprisingly structured face representation from unstructured training with in-the-wild face images.
  • Individual output units from deep networks are unlikely to signal the presence of interpretable features.
  • Fundamental structural aspects of high-level visual codes for faces in deep networks replicate over a wide variety of network architectures.
  • Diverse learning mechanisms in DCNNs, applied simultaneously or in sequence, can be used to model human face perception across the lifespan.

FUTURE ISSUES

  • Large-scale systematic manipulations of training data (race, ethnicity, image variability) are needed to give insight into the role of experience in structuring face representations.
  • Fundamental challenges remain in understanding how to combine deep networks for face, object, and scene recognition in ways analogous to the human visual system.
  • Deep networks model the ventral visual stream at a generic level, arguably up to the level of the IT cortex. Future work should examine how downstream systems, such as face patches, could be connected into this system.
  • In rethinking the goals of face processing, we argue in this review that some longstanding assumptions about visual representations should be reconsidered. Future work should consider novel experimental questions and employ methods that do not rely on these assumptions.

ACKNOWLEDGMENTS

The authors are supported by funding provided by National Eye Institute grant R01EY029692-03 to A.J.O. and C.D.C.

DISCLOSURE STATEMENT

C.D.C. is an equity holder in Mukh Technologies, which may potentially benefit from research results.

1 This is the case in networks trained with the Softmax objective function.

LITERATURE CITED

  • Abadi M, Barham P, Chen J, Chen Z, Davis A, et al. 2016. Tensorflow: a system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) , pp. 265–83. Berkeley, CA: USENIX [ Google Scholar ]
  • Abudarham N, Shkiller L, Yovel G. 2019. Critical features for face recognition . Cognition 182 :73–83 [ PubMed ] [ Google Scholar ]
  • Abudarham N, Yovel G. 2020. Face recognition depends on specialized mechanisms tuned to view-invariant facial features: insights from deep neural networks optimized for face or object recognition . bioRxiv 2020.01.01.890277 . 10.1101/2020.01.01.890277 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Azevedo FA, Carvalho LR, Grinberg LT, Farfel JM, Ferretti RE, et al. 2009. Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain . J. Comp. Neurol 513 ( 5 ):532–41 [ PubMed ] [ Google Scholar ]
  • Barlow HB. 1972. Single units and sensation: a neuron doctrine for perceptual psychology? Perception 1 ( 4 ):371–94 [ PubMed ] [ Google Scholar ]
  • Bashivan P, Kar K, DiCarlo JJ. 2019. Neural population control via deep image synthesis . Science 364 ( 6439 ):eaav9436 [ PubMed ] [ Google Scholar ]
  • Best-Rowden L, Jain AK. 2018. Learning face image quality from human assessments . IEEE Trans. Inform. Forensics Secur 13 ( 12 ):3064–77 [ Google Scholar ]
  • Blanz V, Vetter T. 1999. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques , pp. 187–94. New York: ACM [ Google Scholar ]
  • Blauch NM, Behrmann M, Plaut DC. 2020a. Computational insights into human perceptual expertise for familiar and unfamiliar face recognition . Cognition 208 :104341. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Blauch NM, Behrmann M, Plaut DC. 2020b. Deep learning of shared perceptual representations for familiar and unfamiliar faces: reply to commentaries . Cognition 208 :104484. [ PubMed ] [ Google Scholar ]
  • Box GE. 1976. Science and statistics . J. Am. Stat. Assoc 71 ( 356 ):791–99 [ Google Scholar ]
  • Box GEP. 1979. Robustness in the strategy of scientific model building. In Robustness in Statistics , ed. Launer RL, Wilkinson GN, pp. 201–36. Cambridge, MA: Academic Press [ Google Scholar ]
  • Bruce V, Young A. 1986. Understanding face recognition . Br. J. Psychol 77 ( 3 ):305–27 [ PubMed ] [ Google Scholar ]
  • Burton AM, Bruce V, Hancock PJ. 1999. From pixels to people: a model of familiar face recognition . Cogn. Sci 23 ( 1 ):1–31 [ Google Scholar ]
  • Cavazos JG, Noyes E, O’Toole AJ. 2019. Learning context and the other-race effect: strategies for improving face recognition . Vis. Res 157 :169–83 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Cavazos JG, Phillips PJ, Castillo CD, O’Toole AJ. 2020. Accuracy comparison across face recognition algorithms: Where are we on measuring race bias? IEEE Trans. Biom. Behav. Identity Sci 3 ( 1 ):101–11 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chang L, Tsao DY. 2017. The code for facial identity in the primate brain . Cell 169 ( 6 ):1013–28 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chen JC, Patel VM, Chellappa R. 2016. Unconstrained face verification using deep CNN features. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1–9. Piscataway, NJ: IEEE [ Google Scholar ]
  • Cichy RM, Kaiser D. 2019. Deep neural networks as scientific models . Trends Cogn. Sci 23 ( 4 ):305–17 [ PubMed ] [ Google Scholar ]
  • Collins E, Behrmann M. 2020. Exemplar learning reveals the representational origins of expert category perception . PNAS 117 ( 20 ):11167–77 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Colón YI, Castillo CD, O’Toole AJ. 2021. Facial expression is retained in deep networks trained for face identification . J. Vis 21 ( 4 ):4 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Cootes TF, Taylor CJ, Cooper DH, Graham J. 1995. Active shape models-their training and application . Comput. Vis. Image Underst 61 ( 1 ):38–59 [ Google Scholar ]
  • Crosswhite N, Byrne J, Stauffer C, Parkhi O, Cao Q, Zisserman A. 2018. Template adaptation for face verification and identification . Image Vis. Comput 79 :35–48 [ Google Scholar ]
  • Deng J, Guo J, Xue N, Zafeiriou S. 2019. Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 4690–99. Piscataway, NJ: IEEE [ PubMed ] [ Google Scholar ]
  • Dhar P, Bansal A, Castillo CD, Gleason J, Phillips P, Chellappa R. 2020. How are attributes expressed in face DCNNs? In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) , pp. 61–68. Piscataway, NJ: IEEE [ Google Scholar ]
  • DiCarlo JJ, Cox DD. 2007. Untangling invariant object recognition . Trends Cogn. Sci 11 ( 8 ):333–41 [ PubMed ] [ Google Scholar ]
  • Dobs K, Kell AJ, Martinez J, Cohen M, Kanwisher N. 2020. Using task-optimized neural networks to understand why brains have specialized processing for faces . J. Vis 20 ( 11 ):660 [ Google Scholar ]
  • Dowsett A, Sandford A, Burton AM. 2016. Face learning with multiple images leads to fast acquisition of familiarity for specific individuals . Q. J. Exp. Psychol 69 ( 1 ):1–10 [ PubMed ] [ Google Scholar ]
  • El Khiyari H, Wechsler H. 2016. Face verification subject to varying (age, ethnicity, and gender) demographics using deep learning . J. Biom. Biostat 7 :323 [ Google Scholar ]
  • Fausey CM, Jayaraman S, Smith LB. 2016. From faces to hands: changing visual input in the first two years . Cognition 152 :101–7 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Freiwald WA, Tsao DY. 2010. Functional compartmentalization and viewpoint generalization within the macaque face-processing system . Science 330 ( 6005 ):845–51 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Freiwald WA, Tsao DY, Livingstone MS. 2009. A face feature space in the macaque temporal lobe . Nat. Neurosci 12 ( 9 ):1187–96 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Fukushima K 1988. Neocognitron: a hierarchical neural network capable of visual pattern recognition . Neural Netw 1 ( 2 ):119–30 [ Google Scholar ]
  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, et al. 2014. Generative adversarial nets. In NIPS’14: Proceedings of the 27th International Conference on Neural Information Processing Systems , pp. 2672–80. New York: ACM [ Google Scholar ]
  • Goodman CS, Shatz CJ. 1993. Developmental mechanisms that generate precise patterns of neuronal connectivity . Cell 72 :77–98 [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Kushnir T, Edelman S, Avidan G, Itzchak Y, Malach R. 1999. Differential processing of objects under various viewing conditions in the human lateral occipital complex . Neuron 24 ( 1 ):187–203 [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Weiner KS. 2014. The functional architecture of the ventral temporal cortex and its role in categorization . Nat. Rev. Neurosci 15 ( 8 ):536–48 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Weiner KS, Gomez J, Stigliani A, Natu VS. 2018. The functional neuroanatomy of face perception: from brain measurements to deep neural networks . Interface Focus 8 ( 4 ):20180013. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Gross CG. 2002. Genealogy of the “grandmother cell” . Neuroscientist 8 ( 5 ):512–18 [ PubMed ] [ Google Scholar ]
  • Grother P, Ngan M, Hanaoka K. 2019. Face recognition vendor test (FRVT) part 3: demographic effects . Rep., Natl. Inst. Stand. Technol., US Dept. Commerce, Gaithersburg, MD [ Google Scholar ]
  • Hancock PJ, Bruce V, Burton AM. 2000. Recognition of unfamiliar faces . Trends Cogn. Sci 4 ( 9 ):330–37 [ PubMed ] [ Google Scholar ]
  • Hasson U, Nastase SA, Goldstein A. 2020. Direct fit to nature: an evolutionary perspective on biological and artificial neural networks . Neuron 105 ( 3 ):416–34 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hayward WG, Favelle SK, Oxner M, Chu MH, Lam SM. 2017. The other-race effect in face learning: using naturalistic images to investigate face ethnicity effects in a learning paradigm . Q. J. Exp. Psychol 70 ( 5 ):890–96 [ PubMed ] [ Google Scholar ]
  • Hesse JK, Tsao DY. 2020. The macaque face patch system: a turtle’s underbelly for the brain . Nat. Rev. Neurosci 21 ( 12 ):695–716 [ PubMed ] [ Google Scholar ]
  • Hill MQ, Parde CJ, Castillo CD, Colon YI, Ranjan R, et al. 2019. Deep convolutional neural networks in the face of caricature . Nat. Mach. Intel 1 ( 11 ):522–29 [ Google Scholar ]
  • Hong H, Yamins DL, Majaj NJ, DiCarlo JJ. 2016. Explicit information for category-orthogonal object properties increases along the ventral stream . Nat. Neurosci 19 ( 4 ):613–22 [ PubMed ] [ Google Scholar ]
  • Hornik K, Stinchcombe M, White H. 1989. Multilayer feedforward networks are universal approximators . Neural Netw 2 ( 5 ):359–66 [ Google Scholar ]
  • Huang GB, Lee H, Learned-Miller E. 2012. Learning hierarchical representations for face verification with convolutional deep belief networks. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 2518–25. Piscataway, NJ: IEEE [ Google Scholar ]
  • Huang GB, Mattar M, Berg T, Learned-Miller E. 2008. Labeled faces in the wild: a database for studying face recognition in unconstrained environments . Paper presented at the Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition, Marseille, France [ Google Scholar ]
  • Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A. 2019. Adversarial examples are not bugs, they are features . arXiv:1905.02175 [stat.ML] [ Google Scholar ]
  • Issa EB, DiCarlo JJ. 2012. Precedence of the eye region in neural processing of faces . J. Neurosci 32 ( 47 ):16666–82 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Jacquet M, Champod C. 2020. Automated face recognition in forensic science: review and perspectives . Forensic Sci. Int 307 :110124. [ PubMed ] [ Google Scholar ]
  • Jayaraman S, Fausey CM, Smith LB. 2015. The faces in infant-perspective scenes change over the first year of life . PLOS ONE 10 ( 5 ):e0123780. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Jayaraman S, Smith LB. 2019. Faces in early visual environments are persistent not just frequent . Vis. Res 157 :213–21 [ PubMed ] [ Google Scholar ]
  • Jenkins R, White D, Van Montfort X, Burton AM. 2011. Variability in photos of the same face . Cognition 121 ( 3 ):313–23 [ PubMed ] [ Google Scholar ]
  • Kandel ER, Schwartz JH, Jessell TM, Siegelbaum S, Hudspeth AJ, Mack S, eds. 2000. Principles of Neural Science , Vol. 4 . New York: McGraw-Hill [ Google Scholar ]
  • Kay KN, Weiner KS, Grill-Spector K. 2015. Attention reduces spatial uncertainty in human ventral temporal cortex . Curr. Biol 25 ( 5 ):595–600 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kelly DJ, Quinn PC, Slater AM, Lee K, Ge L, Pascalis O. 2007. The other-race effect develops during infancy: evidence of perceptual narrowing . Psychol. Sci 18 ( 12 ):1084–89 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kelly DJ, Quinn PC, Slater AM, Lee K, Gibson A, et al. 2005. Three-month-olds, but not newborns, prefer own-race faces . Dev. Sci 8 ( 6 ):F31–36 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kietzmann TC, Swisher JD, König P, Tong F. 2012. Prevalence of selectivity for mirror-symmetric views of faces in the ventral and dorsal visual pathways . J. Neurosci 32 ( 34 ):11763–72 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Krishnapriya KS, Albiero V, Vangara K, King MC, Bowyer KW. 2020. Issues related to face recognition accuracy varying based on race and skin tone . IEEE Trans. Technol. Soc 1 ( 1 ):8–20 [ Google Scholar ]
  • Krishnapriya K, Vangara K, King MC, Albiero V, Bowyer K. 2019. Characterizing the variability in face recognition accuracy relative to race. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , Vol. 1 , pp. 2278–85. Piscataway, NJ: IEEE [ Google Scholar ]
  • Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep convolutional neural networks. In NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems , pp. 1097–105. New York: ACM [ Google Scholar ]
  • Kumar N, Berg AC, Belhumeur PN, Nayar SK. 2009. Attribute and simile classifiers for face verification. In Proceedings of the 2009 IEEE International Conference on Computer Vision , pp. 365–72. Piscataway, NJ: IEEE [ Google Scholar ]
  • Laurence S, Zhou X, Mondloch CJ. 2016. The flip side of the other-race coin: They all look different to me . Br. J. Psychol 107 ( 2 ):374–88 [ PubMed ] [ Google Scholar ]
  • LeCun Y, Bengio Y, Hinton G. 2015. Deep learning . Nature 521 ( 7553 ):436–44 [ PubMed ] [ Google Scholar ]
  • Levin DT. 2000. Race as a visual feature: using visual search and perceptual discrimination tasks to understand face categories and the cross-race recognition deficit . J. Exp. Psychol. Gen 129 ( 4 ):559–74 [ PubMed ] [ Google Scholar ]
  • Lewenberg Y, Bachrach Y, Shankar S, Criminisi A. 2016. Predicting personal traits from facial images using convolutional neural networks augmented with facial landmark information . arXiv:1605.09062 [cs.CV] [ Google Scholar ]
  • Li Y, Gao F, Ou Z, Sun J. 2018. Angular softmax loss for end-to-end speaker verification. In Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) , pp. 190–94. Baixas, France: ISCA [ Google Scholar ]
  • Liu Z, Luo P, Wang X, Tang X. 2015. Deep learning face attributes in the wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision , pp. 3730–38. Piscataway, NJ: IEEE [ Google Scholar ]
  • Lundqvist D, Flykt A, Ohman A. 1998. Karolinska directed emotional faces . Database of standardized facial images, Psychol. Sect., Dept. Clin. Neurosci. Karolinska Hosp., Solna, Swed. https://www.kdef.se/#:~:text=The%20Karolinska%20Directed%20Emotional%20Faces,from%20the%20original%20KDEF%20images [ Google Scholar ]
  • Malpass RS, Kravitz J. 1969. Recognition for faces of own and other race . J. Personal. Soc. Psychol 13 ( 4 ):330–34 [ PubMed ] [ Google Scholar ]
  • Matthews CM, Mondloch CJ. 2018. Improving identity matching of newly encountered faces: effects of multi-image training . J. Appl. Res. Mem. Cogn 7 ( 2 ):280–90 [ Google Scholar ]
  • Maurer D, Le Grand R, Mondloch CJ. 2002. The many faces of configural processing . Trends Cogn. Sci 6 ( 6 ):255–60 [ PubMed ] [ Google Scholar ]
  • Maze B, Adams J, Duncan JA, Kalka N, Miller T, et al. 2018. IARPA Janus Benchmark—C: face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB) , pp. 158–65. Piscataway, NJ: IEEE [ Google Scholar ]
  • McCurrie M, Beletti F, Parzianello L, Westendorp A, Anthony S, Scheirer WJ. 2017. Predicting first impressions with deep learning. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 518–25. Piscataway, NJ: IEEE [ Google Scholar ]
  • Murphy J, Ipser A, Gaigg SB, Cook R. 2015. Exemplar variance supports robust learning of facial identity . J. Exp. Psychol. Hum. Percept. Perform 41 ( 3 ):577–81 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Natu VS, Barnett MA, Hartley J, Gomez J, Stigliani A, Grill-Spector K. 2016. Development of neural sensitivity to face identity correlates with perceptual discriminability . J. Neurosci 36 ( 42 ):10893–907 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Natu VS, Jiang F, Narvekar A, Keshvari S, Blanz V, O’Toole AJ. 2010. Dissociable neural patterns of facial identity across changes in viewpoint . J. Cogn. Neurosci 22 ( 7 ):1570–82 [ PubMed ] [ Google Scholar ]
  • Nordt M, Gomez J, Natu V, Jeska B, Barnett M, Grill-Spector K. 2019. Learning to read increases the informativeness of distributed ventral temporal responses . Cereb. Cortex 29 ( 7 ):3124–39 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Nordt M, Gomez J, Natu VS, Rezai AA, Finzi D, Grill-Spector K. 2020. Selectivity to limbs in ventral temporal cortex decreases during childhood as selectivity to faces and words increases . J. Vis 20 ( 11 ):152 [ Google Scholar ]
  • Noyes E, Jenkins R. 2019. Deliberate disguise in face identification . J. Exp. Psychol. Appl 25 ( 2 ):280–90 [ PubMed ] [ Google Scholar ]
  • Noyes E, Parde C, Colon Y, Hill M, Castillo C, et al. 2021. Seeing through disguise: getting to know you with a deep convolutional neural network . Cognition . In press [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Noyes E, Phillips P, O’Toole A. 2017. What is a super-recogniser. In Face Processing: Systems, Disorders and Cultural Differences , ed. Bindemann M, pp. 173–201. Hauppage, NY: Nova Sci. Publ. [ Google Scholar ]
  • Oosterhof NN, Todorov A. 2008. The functional basis of face evaluation . PNAS 105 ( 32 ):11087–92 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • O’Toole AJ, Castillo CD, Parde CJ, Hill MQ, Chellappa R. 2018. Face space representations in deep convolutional neural networks . Trends Cogn. Sci 22 ( 9 ):794–809 [ PubMed ] [ Google Scholar ]
  • O’Toole AJ, Phillips PJ, Jiang F, Ayyad J, Pénard N, Abdi H. 2007. Face recognition algorithms surpass humans matching faces over changes in illumination . IEEE Trans. Pattern Anal. Mach. Intel ( 9 ):1642–46 [ PubMed ] [ Google Scholar ]
  • Parde CJ, Castillo C, Hill MQ, Colon YI, Sankaranarayanan S, et al. 2017. Face and image representation in deep CNN features. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pp. 673–80. Piscataway, NJ: IEEE [ Google Scholar ]
  • Parde CJ, Colón YI, Hill MQ, Castillo CD, Dhar P, O’Toole AJ. 2021. Face recognition by humans and machines: closing the gap between single-unit and neural population codes—insights from deep learning in face recognition . J. Vis In press [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Parde CJ, Hu Y, Castillo C, Sankaranarayanan S, O’Toole AJ. 2019. Social trait information in deep convolutional neural networks trained for face identification . Cogn. Sci 43 ( 6 ):e12729. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Parkhi OM, Vedaldi A, Zisserman A. 2015. Deep face recognition . Rep., Vis. Geom. Group, Dept. Eng. Sci., Univ. Oxford, UK [ Google Scholar ]
  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, et al. 2019. Pytorch: an imperative style, high-performance deep learning library. In NeurIPS 2019: Proceedings of the 32nd International Conference on Neural Information Processing Systems , pp. 8024–35. New York: ACM [ Google Scholar ]
  • Pezdek K, Blandon-Gitlin I, Moore C. 2003. Children’s face recognition memory: more evidence for the cross-race effect . J. Appl. Psychol 88 ( 4 ):760–63 [ PubMed ] [ Google Scholar ]
  • Phillips PJ, Beveridge JR, Draper BA, Givens G, O’Toole AJ, et al. 2011. An introduction to the good, the bad, & the ugly face recognition challenge problem. In Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG) , pp. 346–53. Piscataway, NJ: IEEE [ Google Scholar ]
  • Phillips PJ, O’Toole AJ. 2014. Comparison of human and computer performance across face recognition experiments . Image Vis. Comput 32 ( 1 ):74–85 [ Google Scholar ]
  • Phillips PJ, Yates AN, Hu Y, Hahn CA, Noyes E, et al. 2018. Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms . PNAS 115 ( 24 ):6171–76 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Poggio T, Banburski A, Liao Q. 2020. Theoretical issues in deep networks . PNAS 117 ( 48 ):30039–45 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ponce CR, Xiao W, Schade PF, Hartmann TS, Kreiman G, Livingstone MS. 2019. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences . Cell 177 ( 4 ):999–1009 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ranjan R, Bansal A, Zheng J, Xu H, Gleason J, et al. 2019. A fast and accurate system for face detection, identification, and verification . IEEE Trans. Biom. Behav. Identity Sci 1 ( 2 ):82–96 [ Google Scholar ]
  • Ranjan R, Castillo CD, Chellappa R. 2017. L2-constrained softmax loss for discriminative face verification . arXiv:1703.09507 [cs.CV] [ Google Scholar ]
  • Ranjan R, Sankaranarayanan S, Castillo CD, Chellappa R. 2017c. An all-in-one convolutional neural network for face analysis. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pp. 17–24. Piscataway, NJ: IEEE [ Google Scholar ]
  • Richards BA, Lillicrap TP, Beaudoin P, Bengio Y, Bogacz R, et al. 2019. A deep learning framework for neuroscience . Nat. Neurosci 22 ( 11 ):1761–70 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ritchie KL, Burton AM. 2017. Learning faces from variability . Q. J. Exp. Psychol 70 ( 5 ):897–905 [ PubMed ] [ Google Scholar ]
  • Rosch E, Mervis CB, Gray WD, Johnson DM, Boyes-Braem P. 1976. Basic objects in natural categories . Cogn. Psychol 8 ( 3 ):382–439 [ Google Scholar ]
  • Russakovsky O, Deng J, Su H, Krause J, Satheesh S, et al. 2015. ImageNet Large Scale Visual Recognition Challenge . Int. J. Comput. Vis 115 ( 3 ):211–52 [ Google Scholar ]
  • Russell R, Duchaine B, Nakayama K. 2009. Super-recognizers: people with extraordinary face recognition ability . Psychon. Bull. Rev 16 ( 2 ):252–57 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Sangrigoli S, Pallier C, Argenti AM, Ventureyra V, de Schonen S. 2005. Reversibility of the other-race effect in face recognition during childhood . Psychol. Sci 16 ( 6 ):440–44 [ PubMed ] [ Google Scholar ]
  • Sankaranarayanan S, Alavi A, Castillo C, Chellappa R. 2016. Triplet probabilistic embedding for face verification and clustering . arXiv:1604.05417 [cs.CV] [ Google Scholar ]
  • Schrimpf M, Kubilius J, Hong H, Majaj NJ, Rajalingham R, et al. 2018. Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv 407007 . 10.1101/407007 [ CrossRef ] [ Google Scholar ]
  • Schroff F, Kalenichenko D, Philbin J. 2015. Facenet: a unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition , pp. 815–23. Piscataway, NJ: IEEE [ Google Scholar ]
  • Scott LS, Monesson A. 2010. Experience-dependent neural specialization during infancy . Neuropsychologia 48 ( 6 ):1857–61 [ PubMed ] [ Google Scholar ]
  • Sengupta S, Chen JC, Castillo C, Patel VM, Chellappa R, Jacobs DW. 2016. Frontal to profile face verification in the wild. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1–9. Piscataway, NJ: IEEE [ Google Scholar ]
  • Sim T, Baker S, Bsat M. 2002. The CMU pose, illumination, and expression (PIE) database. In Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition , pp. 53–58. Piscataway, NJ: IEEE [ Google Scholar ]
  • Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition . arXiv:1409.1556 [cs.CV] [ Google Scholar ]
  • Smith LB, Jayaraman S, Clerkin E, Yu C. 2018. The developing infant creates a curriculum for statistical learning . Trends Cogn. Sci 22 ( 4 ):325–36 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Smith LB, Slone LK. 2017. A developmental approach to machine learning? Front. Psychol 8 :2124. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Song A, Linjie L, Atalla C, Gottrell G. 2017. Learning to see people like people: predicting social impressions of faces . Cogn. Sci 2017 :1096–101 [ Google Scholar ]
  • Storrs KR, Kietzmann TC, Walther A, Mehrer J, Kriegeskorte N. 2020. Diverse deep neural networks all predict human it well, after training and fitting . bioRxiv 2020.05.07.082743 . 10.1101/2020.05.07.082743 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Su H, Maji S, Kalogerakis E, Learned-Miller E. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision , pp. 945–53. Piscataway, NJ: IEEE [ Google Scholar ]
  • Sugden NA, Moulson MC. 2017. Hey baby, what’s “up”? One-and 3-month-olds experience faces primarily upright but non-upright faces offer the best views . Q. J. Exp. Psychol 70 ( 5 ):959–69 [ PubMed ] [ Google Scholar ]
  • Taigman Y, Yang M, Ranzato M, Wolf L. 2014. Deepface: closing the gap to human-level performance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition , pp. 1701–8. Piscataway, NJ: IEEE [ Google Scholar ]
  • Tanaka JW, Pierce LJ. 2009. The neural plasticity of other-race face recognition . Cogn. Affect. Behav. Neurosci 9 ( 1 ):122–31 [ PubMed ] [ Google Scholar ]
  • Terhörst P, Fährmann D, Damer N, Kirchbuchner F, Kuijper A. 2020. Beyond identity: What information is stored in biometric face templates? arXiv:2009.09918 [cs.CV] [ Google Scholar ]
  • Thorpe S, Fize D, Marlot C. 1996. Speed of processing in the human visual system . Nature 381 ( 6582 ):520–22 [ PubMed ] [ Google Scholar ]
  • Todorov A 2017. Face Value: The Irresistible Influence of First Impressions . Princeton, NJ: Princeton Univ. Press [ Google Scholar ]
  • Todorov A, Mandisodza AN, Goren A, Hall CC. 2005. Inferences of competence from faces predict election outcomes . Science 308 ( 5728 ):1623–26 [ PubMed ] [ Google Scholar ]
  • Valentine T 1991. A unified account of the effects of distinctiveness, inversion, and race in face recognition . Q. J. Exp. Psychol. A 43 ( 2 ):161–204 [ PubMed ] [ Google Scholar ]
  • van der Maaten L, Weinberger K. 2012. Stochastic triplet embedding. In Proceedings of the 2012 IEEE International Workshop on Machine Learning for Signal Processing , pp. 1–6. Piscataway, NJ: IEEE [ Google Scholar ]
  • Walker M, Vetter T. 2009. Portraits made to measure: manipulating social judgments about individuals with a statistical face model . J. Vis 9 ( 11 ):12 [ PubMed ] [ Google Scholar ]
  • Wang F, Liu W, Liu H, Cheng J. 2018. Additive margin softmax for face verification . IEEE Signal Process. Lett 25 :926–30 [ Google Scholar ]
  • Wang F, Xiang X, Cheng J, Yuille AL. 2017. Normface: L 2 hypersphere embedding for face verification. In MM ‘17: Proceedings of the 25th ACM International Conference on Multimedia , pp. 1041–49. New York: ACM [ Google Scholar ]
  • Xie C, Tan M, Gong B, Wang J, Yuille AL, Le QV. 2020. Adversarial examples improve image recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 819–28. Piscataway, NJ: IEEE [ Google Scholar ]
  • Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. 2014. Performance-optimized hierarchical models predict neural responses in higher visual cortex . PNAS 111 ( 23 ):8619–24 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yi D, Lei Z, Liao S, Li SZ. 2014. Learning face representation from scratch . arXiv:1411.7923 [cs.CV] [ Google Scholar ]
  • Yoshida H, Smith LB. 2008. What’s in view for toddlers? Using a head camera to study visual experience . Infancy 13 ( 3 ):229–48 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Young AW, Burton AM. 2020. Insights from computational models of face recognition: a reply to Blauch, Behrmann and Plaut . Cognition 208 :104422. [ PubMed ] [ Google Scholar ]
  • Yovel G, Abudarham N. 2020. From concepts to percepts in human and machine face recognition: a reply to Blauch, Behrmann & Plaut . Cognition 208 :104424. [ PubMed ] [ Google Scholar ]
  • Yovel G, Halsband K, Pelleg M, Farkash N, Gal B, Goshen-Gottstein Y. 2012. Can massive but passive exposure to faces contribute to face recognition abilities? J. Exp. Psychol. Hum. Percept. Perform 38 ( 2 ):285–89 [ PubMed ] [ Google Scholar ]
  • Yovel G, O’Toole AJ. 2016. Recognizing people in motion . Trends Cogn. Sci 20 ( 5 ):383–95 [ PubMed ] [ Google Scholar ]
  • Yuan L, Xiao W, Kreiman G, Tay FE, Feng J, Livingstone MS. 2020. Adversarial images for the primate brain . arXiv:2011.05623 [q-bio.NC] [ Google Scholar ]
  • Yue X, Cassidy BS, Devaney KJ, Holt DJ, Tootell RB. 2010. Lower-level stimulus features strongly influence responses in the fusiform face area . Cereb. Cortex 21 ( 1 ):35–47 [ PMC free article ] [ PubMed ] [ Google Scholar ]

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

A deep facial recognition system using computational intelligent algorithms

Roles Conceptualization, Data curation, Formal analysis, Methodology, Supervision, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliations Department of Information Systems, Faculty of Computers and Artificial Intelligence, Benha University, Benha City, Egypt, Department of Computer Science, Faculty of Computers and Informatics, Misr International University, Cairo, Egypt

ORCID logo

Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Writing – original draft

Affiliation Department of Information Systems, Faculty of Computers and Artificial Intelligence, Benha University, Benha City, Egypt

Roles Formal analysis, Investigation, Methodology, Software, Validation, Writing – review & editing

Affiliation Department of Computer Science, Faculty of Computers and Artificial Intelligence, Benha University, Benha City, Egypt

Roles Conceptualization, Investigation, Project administration, Writing – original draft, Writing – review & editing

Affiliations Department of Scientific Computing, Faculty of Computers and Artificial Intelligence, Benha University, Benha City, Egypt, Department of Computer Science, Higher Technological Institute, 10th of Ramadan City, Egypt

  • Diaa Salama AbdELminaam, 
  • Abdulrhman M. Almansori, 
  • Mohamed Taha, 
  • Elsayed Badr

PLOS

  • Published: December 3, 2020
  • https://doi.org/10.1371/journal.pone.0242269
  • Peer Review
  • Reader Comments

Fig 1

The development of biometric applications, such as facial recognition (FR), has recently become important in smart cities. Many scientists and engineers around the world have focused on establishing increasingly robust and accurate algorithms and methods for these types of systems and their applications in everyday life. FR is developing technology with multiple real-time applications. The goal of this paper is to develop a complete FR system using transfer learning in fog computing and cloud computing. The developed system uses deep convolutional neural networks (DCNN) because of the dominant representation; there are some conditions including occlusions, expressions, illuminations, and pose, which can affect the deep FR performance. DCNN is used to extract relevant facial features. These features allow us to compare faces between them in an efficient way. The system can be trained to recognize a set of people and to learn via an online method, by integrating the new people it processes and improving its predictions on the ones it already has. The proposed recognition method was tested with different three standard machine learning algorithms (Decision Tree (DT), K Nearest Neighbor(KNN), Support Vector Machine (SVM)). The proposed system has been evaluated using three datasets of face images (SDUMLA-HMT, 113, and CASIA) via performance metrics of accuracy, precision, sensitivity, specificity, and time. The experimental results show that the proposed method achieves superiority over other algorithms according to all parameters. The suggested algorithm results in higher accuracy (99.06%), higher precision (99.12%), higher recall (99.07%), and higher specificity (99.10%) than the comparison algorithms.

Citation: Salama AbdELminaam D, Almansori AM, Taha M, Badr E (2020) A deep facial recognition system using computational intelligent algorithms. PLoS ONE 15(12): e0242269. https://doi.org/10.1371/journal.pone.0242269

Editor: Seyedali Mirjalili, Torrens University Australia, AUSTRALIA

Received: May 28, 2020; Accepted: October 25, 2020; Published: December 3, 2020

Copyright: © 2020 Salama AbdELminaam et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript.

Funding: This study was funded by a grant from DSA Lab, Faculty of Computers and Artificial Intelligence, Benha University to author DSA (28211231302952).

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

The face is considered the most critical part of the human body. Research shows that even a face can speak, and it has different words for different emotions. It plays a crucial role in interacting with people in society. It conveys people's identity and thus can be used as a key for security solutions in many organizations. The facial recognition (FR) system is increasingly trending across the world as an extraordinarily safe and reliable security technology. It is gaining significant importance and attention from thousands of corporate and government organizations because of its high level of security and reliability [ 1 – 3 ].

Moreover, the FR system is providing vast benefits compared to other biometric security solutions such as palmprints and fingerprints. The system captures biometric measurements of a person from a specific distance without interacting with the person. In crime deterrent applications, this system can help many organizations identify a person who has any kind of criminal record or other legal issues. Thus, this technology is becoming essential for numerous residential buildings and corporate organizations. This technique is based on the ability to recognize a human face and then compare the different features of the face with previously recorded faces. This feature also increases the importance of the system and enables it to be widely used across the world. It is developed with user-friendly features and operations that include different nodal points of the face. There are approximately 80 to 90 unique nodal points of a face. From these nodal points, the FR system measures significant aspects including the distance between the eyes, length of the jawline, shape of the cheekbones, and depth of the eyes. These points are measured by creating a code called the faceprint, which represents the identity of the face in the computer database. With the introduction of the latest technology, systems based on 2D graphics are now available on 3D graphics, which makes the system more accurate and increases its reliability.

Biometrics is defined as the science and technology to measure and statistically analyze biological data. They are measurable behavioral and/or physiological characteristics that could be used to verify individual identification. For each individual, a unique biometric could be used for verification. Biometric systems are used in increasingly many fields such as prison security, secured access, and forensics. Biometric systems recognize individuals using authentication by utilizing different biological features such as the face, hand geometry, iris, retina, and fingerprints. The FR system is a more natural biometric information process with better variation than any other method. Thus, FR has become a recent topic in computer science related to biometrics and machine learning [ 4 , 5 ]. Machine learning is a computer science field that gives computers the capability to learn without further explicit programming. The main focus of machine learning is providing algorithms for training to perform a task—machine learning related to the field of computational statistics and mathematical optimization. Machine learning includes multiple methods such as reinforcement learning, supervised learning, almost supervised learning, and unsupervised learning [ 6 ]. Machine learning can be used on many tasks that people think only they can do, such as playing games, learning subjects, and recognition [ 6 ]. Most machine learning algorithms consume a massive amount of resources, so it would be better to perform their tasks on a distributed environment such as cloud computing, fog computing, or edge computing.

Cloud computing is based on the shareability of many resources including services, applications, storage, servers, and networks to accomplish economies and consistency and thus provide the best concentration to maximize the efficiency of using the shared resources. Fog computing contains many services that are provided on the network edge, such as data storage, computing, data provision, and application services for end users who can be added to the network edge [ 7 ]. These environments would reduce the total amount of resource usage, speed up the completion time of tasks, and reduce costs via pay-per-use.

The main goals of this paper are to build a deep FR system using transfer learning in fog computing. This system is based on modern techniques of deep convolutional neural networks (DCNN) and machine learning. The proposed methods will be able to capture the biometric measurements of a person from a specific distance for crime deterrent purposes without interacting with the person. Thus, the proposed methods can help many organizations identify a person with any kind of criminal record or other legal issues.

The remainder of the paper is organized as follows. Section 2 presents related work in FR techniques and applications. Section 3 presents the components of traditional FR: face processing, deep feature extraction and face matching by in-depth features, machine learning, K-nearest neighbors (KNN), support vector machines (SVM), DCNN, the computing framework, fog computing, and cloud computing. Section 4 explains the proposed FR system using transfer learning in fog computing. Section 5 presents the experimental results. Section 6 provides the conclusion with the outcomes of the proposed system.

2. Literature review

Due to the significant development of machine learning, the computing environment, and recognition systems, many researchers have worked on pattern recognition and identification via different biometrics using various building mining model strategies. Some common recent works on FR systems are surveyed here in brief.

Singh, D et al. [ 8 ] proposed a COVID-19 disease classification model to classify infected patients from chest CT images. a convolutional neural network (CNN) is used to classify COVID-19-infected patients as infected (+ve) or not (−ve). Additionally, the initial parameters of CNN are tuned using multi-objective differential evolution (MODE). The results show that the proposed CNN model outperforms competitive models, i.e., ANN, ANFIS, and CNN models in terms of accuracy, F-measure, sensitivity, specificity, and Kappa statistics by 1.9789%, 2.0928%, 1.8262%, 1.6827%, and 1.9276%, respectively.

Schiller, D et al. [ 9 ] proposed a novel approach to transfer learning to automatic emotion recognition (AER) across various modalities. The proposed model used for facial expression recognition that utilizes saliency maps to transfer knowledge from an arbitrary source to a target network by mostly “hiding” non-relevant information. The proposed method is independent of the employed model since the experience is solely transferred via augmentation of the input data. The evaluation of the proposed model showed that the new model was able to adapt to the new domain faster when forced to focus on the parts of the input that were considered relevant sources Prakash, R et al. [ 10 ] proposed an automated face recognition method using Convolutional Neural Network (CNN) with a transfer learning approach. The CNN with weights learned from pre-trained model VGG-16. The extracted features are fed as input to the Fully connected layer and softmax activation for classification. Two publicly available databases of face images–Yale and AT&T are used to test the performance of the proposed method. Face recognition accuracy of 100% is achieved for AT&T database face images and 96.5% for Yale database face images. The results show that face recognition using CNN with transfer learning gives better classification accuracy in comparison with PCA method.

Deng et al. [ 11 ] proposed additive angular margin loss (ArcFace) to accomplish face acknowledgment. The proposed ArcFace has an unmistakable geometric understanding as a result of the specific correspondence to geodesic separation on a hypersphere. They also introduced the broadest exploratory assessment against the FR method utilizing ten FR datasets. They indicated that ArcFace reliably beats the best in class and can be effectively actualized with irrelevant computational overhead. The verification performance of open-sourced FR models on LFW, CALFW, and CPLFW datasets reached 99.82%, 95.45%, and 92.08%, respectively [ 11 ].

Wang et al. [ 12 ] proposed a large margin cosine loss (LMCL) by reformulating the SoftMax loss as a cosine loss by L2 normalizing the two highlights and weight vectors to evacuate outspread varieties and using the cosine edge term to expand the choice edge in precise space. They achieved the highest between-class difference and lowest intraclass fluctuation via cosine choice edge augmentation and normalization. They referred to their model, trained with LMCL, as CosFace. They based their experiment on the Labeled Face in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge datasets. They confirmed the efficiency of their proposed approach, achieving 99.33%, 96.1%, 77.11%, and 89.88% accuracy on the LFW, YTF, MF1 Rank1, and MF1 Veri datasets, respectively [ 12 ].

Tran et al. [ 13 ] proposed a disentangled representation learning-generative adversarial network (DR-GAN) with three different developments. First, the encoder-decoder structure of the generator permits DR-GAN to gain proficiency with a discriminative and generative portrayal, including picture blending. Second, the portrayal is unraveled from other face varieties—for example, through the posture code given to the decoder and posture estimation in the discriminator. Third, DR-GAN can accept one or various pictures as information and produce one integrated portrayal alongside an arbitrary number of manufactured pictures. They tested their network using the Multi-PIE database. They contrasted their strategy and face acknowledgment techniques with Multi-PIE, CFP, and IJB-A and achieved average face confirmation exactness with greater than tenfold standard deviation. They accomplished equivalent execution on frontal-frontal confirmation with ~1.4% enhancement for frontal-profile verification [ 13 ].

Masi et al. [ 14 ] proposed to build prepared information sizes for face acknowledgment frameworks: domain explicit information development. They presented techniques to enhance realistic datasets with critical facial varieties by controlling the faces in the datasets while coordinating inquiry pictures presented by standard convolutional neural systems. They tested their framework against the LFW and IJB-A benchmarks and Janus CS2 on a large number of downloaded pictures. They reported the standard convention for unhindered, marked outside information and announced a mean grouping precision of 100% equal error rate [ 14 ].

Ding and Tao [ 15 ] proposed a far-reaching system based on convolutional neural networks (CNN) to overcome the difficulties faced in video-based face recognition (VFR). CNN learns obscure highlights by utilizing prepared information comprising misleadingly obscured information and still pictures. They proposed a trunk-branch ensemble CNN model (TBE-CNN) to improve CNN highlights to present varieties and impediments. TBE-CNN separates data from face pictures and zones picked around facial segments. TBE-CNN removes information by sharing the center and low-level convolutional layers between the branch and trunk systems. They proposed an improved triplet misfortune capacity to invigorate the influence of discriminative portrayals learned by TBE-CNN. TBE-CNN was tested on three video face databases: YouTube, COX Face, and PaSC Faces [ 15 ].

Al-Waisy, et al. [ 16 ] proposed a multimodal profound learning system that depends on nearby element presentation for k-based face acknowledgment. They consolidated the focal points of neighborhood handmade component descriptors with the DBN to report face acknowledgment in unconstrained circumstances. They proposed a multimodal nearby component extraction approach dependent on consolidating the upsides of fractal measurement with the curvelet change, and they called it the curvelet–fractal approach. The principal inspiration of this methodology is that the curvelet change can expertly present the fundamental facial structure, while the fractal measurement presents the surface descriptors of face pictures. They proposed a multimodal profound face acknowledgment (MDFR) approach, to include highlight presentation by preparing a DBN on nearby element portrayals. They compared the outcomes of the proposed MDFR approach with the curvelet–fractal approach on four face datasets: the LFW, CAS-PEAL-R1, FERET, and SDUMLA-HMT databases. The outcomes acquired from their proposed approaches outperformed different methodologies including WPCA, DBN, and LBP by accomplishing new outcomes on the four datasets [ 16 ].

Sivalingam et al. [ 17 ] proposed a proficient fractional face location strategy utilizing AlexNet CNN to detect emotions based on images of half-faces. They distinguished the key focal points and concentrated on textural highlights. They proposed an AlexNet CNN strategy to discriminatively coordinate the two removed nearby capabilities, and both the textural and geometrical data of neighborhood highlights were utilized for coordination. The comparability of two appearances was changed according to the separation between the adjusted capabilities. They tested their approach on four generally utilized face datasets and demonstrated the viability and constraints of their proposed method [ 17 ].

Jonnathann et al. [ 18 ] presented a comparison between profound learning and conventional AI strategies (for example, artificial neural networks, extreme learning machine, SVM, optimum-path forest, KNN) and deep learning. For facial biometric acknowledgment, they concentrated on CNNs. They used three datasets: AR Face, YALE, and SDUMLA-HMT [ 19 ]. Further research on FR can be found in [ 20 – 23 ].

3. Material and methods

  • Ethics Statement

All participants provided written informed consent and appropriate, photographic release. The individuals shown in Fig 1 have given written informed consent (as outlined in PLOS consent form) to publish their image.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0242269.g001

3.1 Traditional facial recognition components

The whole system comprises three modules, as shown in Fig 1 .

  • In the beginning, the face detector is utilized on videos or images to detect faces.
  • The prominent feature detector aligns each face to be normalized and recognized with the best match.
  • Finally, the face images are fed into the FR module with the aligned results.

Before inputting an image into the FR module, the image is scanned using face anti-spoofing, followed by recognition performance .

facial recognition system research paper

  • where M indicates the face matching algorithm, which is used to calculate the degree of similarity,
  • F refers to extracting the feature encoded for identity information,
  • P is the face-processing stage of occlusal facial treatment, expressions, highlights, and phenomena; and
  • I i and I j are two faces in the images.

3.1.1 Face processing.

Deep learning approaches are commonly used because of their dominant representation; Ghazi and Ekenel [ 24 ] showed some conditions including occlusions, expressions, illuminations, and pose, which can affect the deep FR performance. One of the main challenges in FR applications is representing variation; in this paper, we will summarize the face-processing deep methods for poses. Similar techniques can solve other changes. The face-processing techniques are categorized as "one-to-many augmentation" and "many-to-one normalization" [ 24 ].

  • "One-to-many augmentation" : Create many images from a single image with the ability to change the situation, which helps increase the ability of deep networks to work and learn.
  • "Many-to-one normalization" : The canonical view of face images is recovered from nonfrontal-view images, after which FR is performed under controlled conditions.

3.1.2 Deep feature extraction: Network architecture.

The architectures can be categorized as a backbone and assembled networks , as shown in Table 1 , inspired by the success of ImageNet [ 25 ] and typical CNN architectures such as SENet, ResNet, GoogleNet and VGGNet. It is also used as a baseline model in FR as a full or partial implementation [ 26 – 30 ].

thumbnail

https://doi.org/10.1371/journal.pone.0242269.t001

In addition to the mainstream methods, FR is still used as an architecture design to improve efficiency. Additionally, with backbone networks as basic blocks, FR methods can be implemented in assembled networks, possibly with multiple tasks or multiple inputs. Each network is related to one type of input or one type of task. During adoption, higher performance is attained after the results of assembled networks are collected [ 30 ].

Loss Function. SoftMax loss is used as an organizing object by a supervising signal, and it improves the variation in the features. For FR, when intravariations may be larger than intervariations, SoftMax loss loses its effectiveness.

  • Euclidean-distance-based loss:

Intravariance compression and intervariance enlargement are based on the Euclidean distance.

  • Angular/cosine-margin-based loss:

Discriminative learning of facial features is performed according to angular similarity, with prominent and potentially large angular/cosine separability between the features learned.

  • SoftMax loss and its variations:

Performance is enhanced by using SoftMax loss or a modification of it.

3.1.3 Face matching by deep features.

After training the deep networks to work with massive data and an appropriate loss function, deep feature representation must be obtained by testing each of the passed images through the networks. L2 distance or cosine distance methods are most commonly used to compute feature similarity; however, for identification and verification tasks, the nearest neighbor (NN) and threshold comparison are used. Many other methods are used to process the deep features and compute facial matching with high accuracy, such as sparse representation-based classifier (SRC) and metric learning.

FR is a developed object classification; face-processing methods can also handle variations in poses, expressions, and occlusions. There are many new complicated kinds of FR related to features present in the real world, such as cross pose FR, cross-age FR, and video FR. Sometimes, more realistic datasets are constructed to simulate scenes from reality.

3.2 Machine learning

Machine learning is developed from computational learning theory and pattern recognition. A learning algorithm uses a set of samples called a training set as an input.

In general, there exist two main categories of learning: supervised and unsupervised. The objective of supervised learning is to learn the prediction of the proper output vector for any input vector. Classification tasks are applications in which the target label is a finite number in a discrete category. Defining the unsupervised learning objective is challenging. A primary objective is to find similar samples of sensible clusters identified within input data, called clustering.

3.2.1 K-nearest neighbors.

facial recognition system research paper

KNN must store a large amount of training space, and this is one of the limitations that make KNN challenging to work with in a large dataset.

3.2.2 Support vector machine.

facial recognition system research paper

Although we use the L1 norm for the penalty term Pn i = 1 ξi, there exist other penalty terms such as the L2 norm, which should be chosen with respect to the needs of the application. Moreover, parameter C is a hyper-parameter that can be chosen via cross-validation or Bayesian optimization. An important property of SVM is that the resulting classifier uses only a few points of training to classify a new data point, known as a support vector.

SVMs can perform nonlinear classification that detects a nonlinear hyper-plane function of the input variable in addition to performing linear classification as the input variable is mapped to a high-dimensional feature space. SVMs can perform multiclass classification in addition to binary classification [ 34 ].

SVMs are among the best off-the-shelf supervised learning models that are capable of effectively working with high-dimensional datasets and are efficient regarding memory usage due to the employment of support vectors for prediction. SVMs are useful in several real-world systems including protein classification, image classification, and handwritten character recognition.

3.3 Computing framework

The recognition system has different parts, and the computing framework is one of the essential parts for processing data. The computing framework is famous for cloud and fog computing. The application of FR can utilize a framework based on process location and application. Data in some applications must be processed after the acquisition; however, in some applications, data processing is not instantly required. Fog computing is a network architecture that supports the processing of data instantly [ 35 ].

3.3.1 Fog computing.

Cloud computing is engineered to work by relaying and transmitting information to the edge of the servers from the datacenter task. The fog computing architecture on edge servers uses this architecture, and it provides network, storage space, limited computing, and data filtering of logical intelligence and datacenters. This structure is used in fields such as military and e-health applications [ 36 , 37 ].

3.3.2 Cloud computing.

To obtain accessible data, data are sent to the datacenter for analysis and processing. A significant amount of time and effort is expended to transfer and process data in this type of architecture, indicating that it is not sufficient to work with big data. Big data processing increases the cloud server's CPU usage [ 38 ]. There are various types of cloud computing such as Infrastructure as a Service (IaaS) , Platform as a Service (PaaS) , Software as a Service (SaaS ), and Mobile Backend as a Service (MBaaS ) [ 39 ].

Big data applications such as FR require a method and design that distribute computing to process big data in a fast and repetitive way [ 40 , 41 ]. Data are divided into packages, and each package is assigned to different computers for processing. A move from the cloud to fog or distributed computing requires 1) a reduction in network loading, 2) an increase in data processing speed, 3) a decrease in CPU usage, 4) a decrease in energy consumption, and 5) higher data volume processing.

4. Proposed facial recognition system

4.1 traditional deep convolutional neural networks.

Images are expressed in terms of width (W) 227, height (H) 227, and depth (D) 3 of the colors red, green, and blue; therefore, they have a size of 227×227×3. The input color image is filtered at the first convolutional layer. This layer has 96 kernels (K) with an 11x 11x11 filter (F) and a 4-pixel stride (s). In the kernel map, the stride is the distance between the responsive field centers of neighboring neurons. The mathematical formula ((W-F+2P)/S) +1 is employed to compute the output size of the convolutional layer, where P refers to the padded pixel number, which can be as low as zero. The output volume size of the convolutional layer is ((227–11+0)/4)+1 = 55. The second input of the convolutional layer has a size of 55×55×no of filters, and therefore, the number of filters is 256 in this layer. As the work of the layers is distributed over 2 GPUs, the load is divided by 2 over all layers in each GPU. The next layer is the convolutional layer, followed by the pooling layer. Each feature map is decreased in dimensionality, and important features are retained. The type of pooling can be sum, max, average, etc. In AlexNet, a max-pooling layer is employed. Two hundred fifty-six filters (256) are input to this layer.

Krizhevsky et al. [ 11 ] developed AlexNet for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [ 34 ]. The first layer of AlexNet is used to filter the input image. The input image has a height (H), width (W), and depth (D) of 227×227×3; D = 3 to account for the colors red, green, and blue. The first convolutional layer is utilized to filter the input color image; it has 96 kernels (K) with an 11x11x11 filter (F) and a four-pixel stride (s). The stride is the distance between the responsive field centers of neighboring neurons in the kernel map. The formula ((W-F+2P)/S) +1 is employed to compute the output size of the convolutional layer, where P refers to the padded pixel number, which can be as low as zero. The convolutional layer output volume size is ((227–11+0)/4)+1 = 55. The second input of the convolutional layer is of size 55×55×no of filters, and the number of filters in this layer is also 256. Since the work of these layers is distributed over 2 GPUs, the load of each layer is divided by 2. The next layer is the convolutional layer, followed by the pooling layer. Each feature map dimensionality decreases, and important features are retained. The pooling method can be max, sum, average, etc. A max-pooling layer is employed in AlexNet. A total of 256 filters are the input of this layer. Each filter has a size of 5×5×256 with a stride of two pixels. When two GPUs are used, the work is divided into 55/2×55/2×256/2≈ 27×27×128 inputs for each GPU. The normalized output of the second convolutional layer is connected to the third layer, which has 384 kernels with a size of 3×3. For the fourth convolutional layer, there are 384 kernels of size 3×3, and they are divided over 2 GPUs, so the load of each GPU is 3×3×192. There are 256 kernels each of size 3×3 in the fifth convolutional layer, and they are divided over 2 GPUs, so each GPU has a load of 3×3×128. The last three convolutional layers are created without pooling layers or normalization. The outputs of these three layers are delivered as the input to two fully connected layers, where each layer has 4096 neurons. Fig 2 illustrates the architecture used in AlexNet to classify different classes with ImageNet as a training dataset [ 34 ]. DCNNs can learn from features hierarchically. A DCNN increases the image classification accuracy, especially with large datasets [ 42 ]. Since the implementation of a DCNN requires a large number of images to attain high classification rates, an insufficient number of color images among the subjects’ identification images creates an extra challenge for recognition systems [ 35 , 36 ]. A DCNN consists of neural networks with convolutional layers that perform feature extraction and classification on images [ 37 ]. The difference between the information used for testing and the original data used to train the DCNN is minimized by using a training set with different sizes or scales but the same features. The features will be extracted and classified well using a deep network [ 43 ]. Therefore, the DCNN will be useful in the task of recognition and classification. So DCNN will be utilized in the recognition and classification tasks. The AlexNet Architecture is shown in Fig 2 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g002

4.2 Fundamentals of transfer learning

The center information on transfer learning (TL) appears in Fig 3 . The center utilizes a moderately intricate and fruitful preprepared model, prepared from an enormous information source, e.g., ImageNet, which is a large visual database developed for visual object recognition research [ 41 ]. It contains over 14,000,000 manually annotated pictures, and one million pictures are furnished with bounding boxes. ImageNet contains in excess of 20,000 classifications [ 11 ]. Ordinarily, pretrained models are prepared on a subset of ImageNet with 1,000 classes. At that point, we "moved" the scholarly information to the moderately rearranged assignments (e.g., characterizing liquor abuse and nonliquor addiction) to remove a limited quantity of private information. Two attributes are imperative to support the exchange [ 44 ]: -i. The achievement of the pretrained model can advance the prohibition of client mediation with the exhausting hyperparameter tuning of new undertakings; ii. The early layers in pretrained models can be resolved as highlight extractors that help separate low-level highlights—for example, edges, tints, shades, and surfaces. Customary TL retrains the new layers [ 13 ]. First, the pretrained model is utilized, and then the entire structure of the neural system is reprepared. Critically, the worldwide learning rate is fixed, and the moving layers will have a low factor, while recently included layers will have a high factor. The core knowledge of TL is shown in Fig 3 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g003

4.3 Adaptive deep convolutional neural networks (the proposed face recognition system)

The proposed system consists of three essential stages, including

  • preprocessing,
  • feature extraction
  • recognition, and identification.

In preprocessing , the frame begins to capture images that must have a human face as the subject of insertion.

This image is passed to face detector module. The face detector work non detecting the human face and segment bit as region of interest. the obtained ROI continues the preprocessing steps. It is resized into the preretinal size to alignment purpose.

In the feature’s extraction , the preprocessed ROI in handled to extract feature vector using the modified version of AlexNet. The extract vector represents the significant details of the associated image.

Finally, the recognition and identification include the determination of feature vector belongs to whom subject of enrolled subject in the system’s database. Each new feature vector represents either anew subject or already registered subject. for the feature vector of ready a register subject, the system recognition the associated ID. for the feature vector of a new registered subject, the system adds new record into the connected database.

Fig 4 illustrates the general overall view of the proposed face recognition system.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g004

The system performs the steps on the face images to obtain the distinctive features of each face as follow:

All participants provided written informed consent and appropriate, photographic release. The individuals shown in Fig 5 have given written informed consent (as outlined in PLOS consent form) to publish their image.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g005

In the preprocessing step, as shown in Fig 5 , the system begins to ensure the input image is the RGP image. Align in the same size of the image. Then, the face detection step is performed. This step uses a well-known face detection mechanism, the Viola-Jones detection approach. The popularity of Viola-Jones detection stems from its ability to work well in real-time and its ability to achieve high accuracy. To detect faces in a specific image, this face detector uses detection windows with different sizes to scan the input image.

In this phase, the decision of whether there is a face window is made. Haar-like filters are used to derive simple local features that are applied to face window candidates. In Haar-like filters, the feature values are obtained easily by finding the difference between the total light intensities of the pixels. Then segmentation the region of the issue by cropping and resizing the face image to 227×227, as shown in Fig 6 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g006

All participants provided written informed consent and appropriate, photographic release. The individuals shown in Fig 6 have given written informed consent (as outlined in PLOS consent form) to publish their image.

  • 2. Features Extraction using Pre-trained Alex Network

The accessible dataset size is inadequate to prepare another deep model from the earliest starting point, and in any case, this is not possible due to a large number of prepared pictures. To maintain objectivity in this test, we applied the exchange learning hypothesis to the preprepared engineering of AlexNet in three distinct ways. First, we expected to alter the structure. The last fully-connected layer (FCL) was updated since the first FCLs were created to perform 1,000 classifications. Twenty arbitrarily chosen classes were recorded: the scale, hairdresser chair, lorikeet, small poodle, Maltese dog, dark-striped cat, beer bottle, work station, necktie, trombone, protective crash helmet, cucumber, letterbox, pomegranate, Appenzeller, gag, snow panther, mountain bike, lock, and Diamondback. We observed that none of them were identified with the face recognition method. Thus, we could not legitimately apply AlexNet as the element extractor. Consequently, the calibration was fundamental. Since the length of yield neurons (1000) in conventional AlexNet is not equivalent to the number of classes in our task (2), we expected to have to alter the relating softmax layer and arrangement layer, as indicated by Fig 7 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g007

In our exchange learning plan, we utilized another arbitrarily introduced completely associated layer with a number of accessible subjects in the utilized dataset(s), a softmax layer, and another characterization layer with a similar number of competitors. Fig 8 shows various kinds of available activation functions; we used softmax, since we had different information and choices depending on the most extreme scores of different outputs. Next, we set the training choices. Three properties were checked before training. First, the overall number of training iterations ought to be small for exchange learning. We initially set the number of training iterations to 6. Second, the global learning rate was set to a small estimated value of 10−4 to back learning off, since the early layers of this neural system were preprepared. Third, the learning pace of new layers was several times that of the transfer layer, since the transfer layers with preprepared loads and weights and the new layers had irregular instated loads and weights. Third, we shifted the quantities of transfer layers and tried various settings. AlexNet comprises five Conv layers (CL1, CL2, CL3, CL4, and CL5) and three completely associated layers (FCL6, FL7, and FL8).

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g008

The pseudocode of the proposed algorithm is shown in algorithm 1. It starts using the original AlexNet architecture and image dataset for the subjects that were enrolled in the recognition systems. For each image in the dataset, the subject’s face is detected using Viola-Jones detection. The new face dataset is used for transfer learning. To transfer learning, we adapt to the architecture of AlexNet. Next, we train the altered architecture using the face dataset. The trained model is used in feature extraction.

we expect to overhaul the relating SoftMax layer and arrangement layer as indicated in the pseudocode of the proposed calculation (Algorithm 1).

Algorithm 1: Transfer Learning using AlexNet model

Input ← original AlexNet Net , ImageFaceSet imds

Output ← modified trained AlexNet FNet , features FSet

1.     Begin

2.         // Preprocessing Face image(s) in imds

3.         For i = 1: length(imds)

4.            img ← read(imds,i)

5.             face ← detectFace(img)

6.             img ← resize(face,[227, 227])

7.          save(imds,I,img)

8.         End for

9.         // Adapt AlexNet Structure

10.        FLayers ← Net.Layers(1:END-3)

11.         FLayers .append(new Convolutional layer)

12.         FLayers . append(new SoftMax layer)

13.        FLayers. append(new Classification layer)

14.         // Train FNet using options

15.         Options.set(SolverOptimizer ← stochastic gradient descent with momentum)

16.         Options.set(InitialLearnRate ←1e-3)

17.         Options.set(LearnRateSchedule ← Piecewise)

18.         Options.set(MiniBatchSize ←32)

19.         Options.set(MaxEpochs ←6)

20.         FNet ← trainNetwork(FLayers, imds, Options)

21.         //Use FNet to extract features

22.        FSet ← empty

23.         For j = 1: length(imds)

24.            img ← read(imds,j)

25.             F ← extract(FNet, img, ‘FC7’)

26.             FSet ← FSet U F

27.     End for

  • 3. Face recognition Phase using Fog and Cloud Computing:

Fig 9 shows the fog computing face recognition framework. Fog systems comprise client devices, cloud nodes/servers, and distributed computing environments. The general differences from the conventional distributed computing process are as follows:

  • A distributed computing community oversees and controls numerous cloud nodes/servers.
  • Fog nodes/servers situated at the edge of the system between the system community and the client have a specific procurement device that can perform preprocessing and highlight extraction tasks and can communicate biometric data securely with the client devices and cloud node.
  • User devices are heterogeneous and include advanced mobile phones, personal computers (PCs), hubs, and other networkable terminals.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g009

There are multiple purposes behind the communication plan.

  • From the viewpoint of recognition efficiency, if FR information is sent to a node, the system communication cost will increase, since all information must be sent to and prepared by the cloud server. Additionally, the calculation load on the cloud server will increase.
  • From the point of view of recognition security, the cloud community, as the focal hub of the whole system, will become a target for attacks. If the focal hub is breached, information acquired from the fog nodes/servers becomes vulnerable.
  • Face recognition datasets are required for training if a neural system is utilized for recognition. Preparing datasets is normally time consuming and will greatly increase the training time if the training is carried out only by the nodes, risking the training quality.

Since the connection between a fog node and client devices is very inconsistent, we propose a general engineering system for cloud-based face recognition frameworks. This plan exploits the processing ability and capacity limit of fog nodes/servers and cloud servers.

The design incorporates preprocessing, including extraction, face recognition, and recognition-based security. The plan is partitioned into 6 layers as indicated by the information stream of fog architecture shown in Fig 10 :

  • User equipment layer : The FC/MEC client devices are heterogeneous, including PCs and smart terminals. These devices may use various fog nodes/servers through various conventions.
  • Network layer : This connects administration through various fog architecture protocols. It is able to obtain information transmitted from the system and client device layer and to compress and transmit the information.
  • Data processing layer : The essential task of this layer is to preprocess image(s) sent from client hardware, including information cleaning, filtering, and preprocessing. The task of this layer is performed on cloud nodes.
  • Extraction layer : After the image(s) are preprocessed, the extraction layer utilizes the related AlexNet to remove the highlights.
  • Analysis layer : This layer communicates through the cloud. Its primary task is to cluster the removed element vectors that were found by fog nodes/servers. It can coordinate data among registered clients and produces responses to requests.
  • Management layer : The management in the cloud server is, for the most part, responsible for(1) the choices and responses of the face recognition framework and (2) the information and logs of the fog nodes/servers that can be stored to facilitate recognition and authentication.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g010

All participants provided written informed consent and appropriate, photographic release. The individuals shown in Fig 11 , Fig 12 have given written informed consent (as outlined in PLOS consent form) to publish their image.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g011

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g012

As shown in Fig 11 , the recognition classifier of the Analysis layer is the most significant piece of the framework for data preparation. It is identified with the resulting cloud server response to guarantee the legitimacy of the framework. Relatedly, our work centres around recognition and authentication. Classifiers on fog nodes/servers can utilize their calculation ability and capacity limit for recognition. In any case, much of the scope information cannot be handled or stored because of the restricted calculation and capacity of fog nodes/servers. Moreover, as mentioned, sending classifiers on fog nodes/servers cannot meet the needs of an individual system. The cloud server has a greater storage capacity than fog nodes/servers; therefore, the cloud server can store many training sets and process these sets. It can send training sets to fog nodes/servers progressively for training with the goal that different fog nodes/servers receive appropriate sets.

Fig 12 shows Face images of SDUMLA-HMT subjects under different conditions as a dataset example.

5. Experimental results

In this section, we provide the results we obtained in the experiments. Some of these results will be presented as graphs, which present the relation between the performance and some of the parameters previously mentioned.

5.1 Runtime environment

The proposed recognition system was implemented and developed using MatlabR2018a on a PC with an Intel Core i7 CPU running at 2.2 GHz and Windows 10 Professional 64-bit edition. The proposed system is based on the dataset SDUMLA-HMT, which is available online for free.

5.2 Dataset(s)

SDUMLA-HMT is a publicly available database that has been used to evaluate the proposed system. The SDUMLA-HMT database was collected in 2010 by Shandong University, Jinan, China. It consists of five subdatabases—face, iris, finger vein, fingerprint, and gait—and contains 106 subjects (61 males and 45 females) with ages ranging between 17 and 31 years. In this work, we have used the face and iris databases only [ 19 ].

The face database was built using seven digital cameras. Each camera was used to capture the face of every subject with different poses (three images), different expressions (four images), and different accessories (one image with a hat and one image with glasses), and under different illumination conditions (three images). The face database consists of 106×7×(3+4+2+3) = 8,904 images. All face images are of 640×480 pixels and are stored in the BMP format. Some face images of subject number 69 under different conditions are shown in Fig [ 19 ].

5.3 Performance measure

It is obviously, researchers recently focus on enhancing the face recognition systems from accuracy metrics regardless of the latest technologies and computing environment. Today, cloud computing and fog computing are available to enhance the performance of face recognition and decrease time complexity. In the proposed framework, we will handle these issues and well considered. The classifier performance evaluator carries out various performance measures and classifies the FR accuracy as true positive (TP), false negative (FN), false positive (FP) and true negative (TN). Precision is the most interesting and sensitive measure that can be used in wide-range comparison of the essential individual classifiers and the proposed system.

facial recognition system research paper

  • True Negative (TN): These are the negative tuples that were correctly labeled by the classifier.
  • True Positive (TP): These are the positive tuples that were correctly labeled by the classifier.
  • False Positive (FP): These are the negative tuples that were incorrectly labeled as positive.
  • False Negative (FN): These are the positive tuples that were mislabeled as negative.

5.4 Results & discussion

A set of experiments were performed to evaluate the proposed system in terms of the evaluation criteria. All experiments start by loading the color images from the data source, then passing them to the segmentation step. According to the pretrained AlexNet, the input image size cannot exceed 227×227, and the image depth limit is 3. Therefore, after segmentation, we performed a check step to guarantee the appropriateness of the image size. A resizing process to 227×227×3 for width, height, and depth is imperative if the size of the image exceeds the size limit. And the main parameters and ratios are represented in Table 2 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.t002

  • The experimental outcomes of the developed FR system and its comparison with various other techniques are presented in the scenario. It has been noted that the outcomes of the proposed algorithm outperformed most of its peers, especially in terms of precision.

5.4.1 Recognition time results

Fig 13 shows the comparison of the four algorithms: decision tree (DT), KNN classifier, SVM, and the proposed DCNN powered by the pre-trained AlexNet classifier. The relationship between two Parameters, observation/sec and recognition time in seconds per observation, which are used respectively for comparisons.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g013

  • The results show that the proposed DCNN has superiority over other machine learning algorithms according to observation/sec and recognition time

5.4.2 Precision results.

Fig 14 shows the precision of the four algorithms using the three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g014

  • The results show that the proposed DCNN has superiority over other machine learning algorithms according to Perception for the 2 nd and 3 rd datasets and obtain with SVM the best results for the 1 st dataset.

5.4.3 Recall results.

Fig 15 shows the recall of the four algorithms using the three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g015

  • The results show that the proposed DCNN has superiority over other machine learning algorithms, according to Recall parameters.

5.4.4 Accuracy results

Fig 16 displays the accuracy of our proposed system of the four algorithms using three datasets SDUMLA-HMT, 113, and

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g016

  • The results show that the proposed DCNN has superiority over other machine learning algorithms, according to Accuracy parameters.

5.4.5 Specificity results.

Fig 17 displays the data of the specificity of our proposed system comparing with other four algorithms using three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g017

Table 3 shows the average results for precision, recall, accuracy, and specificity of the four algorithms using the three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.t003

Fig 18 displays the data documented in Table representing the average results for precision, recall, accuracy, and specificity of our proposed system of the four algorithms using three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g018

Table 4 shows the comparison of the three algorithms and the algorithm developed by Jonnathann et al. [ 15 ] using the same dataset. The Table 4 compares the accuracy rates of the developed classifiers verse the same classifiers developed by Jonnathann et al. [ 15 ] in terms of accuracy rates without considering feature extraction methods.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.t004

Fig 19 shows the data documented in Table. It is noticeable that the proposed classifier achieves the highest accuracy using KNN, SVM, and DCNN.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g019

6. Conclusion

FR a more natural biometric information process than other proposed systems, and it must address more variation than any other method. It is one of the most famous combinatorial optimization problems. Solving this problem in a reasonable time requires an efficient optimization method. FR may face many difficulties and challenges in terms of the input image such as different facial expressions, subjects wearing hats or glasses and varying brightness levels. This study is based on the adaptive version of the most recent DCNN algorithm, called AlexNet. This paper proposed a deep FR learning method using TL in fog computing. The proposed DCNN algorithm is based on a set of steps to process the face images to obtain the distinctive features of the face. These steps are divided by preprocessing, face detection, and feature extraction. The proposed method improves the solution by adjusting the parameters to search for the final optimal solution. In this study, the proposed algorithm and other popular machine learning algorithms, including the DT, KNN, and SVM algorithms, were tested on three standard benchmark datasets to demonstrate the efficiency and effectiveness of the proposed DCNN in solving the FR problem. These datasets were characterized by various numbers of images, including males and females. The proposed algorithm and other algorithms were tested on different images in the first dataset, and the results demonstrated the effectiveness of the DCNN algorithm in terms of achieving the optimal solution (i.e., the best accuracy) with reasonable accuracy, recall, precision, and specificity compared to the other algorithms. At the same time, the proposed DCNN achieved the best accuracy compared with Jonnathann et al. [ 18 ]. The accuracy of the proposed method reached 99.4%, compared with 97.26% by Jonnathann et al. [ 18 ]. The suggested algorithm results in higher accuracy (99.06%), higher precision (99.12%), higher recall (99.07%), and higher specificity (99.10%) than the comparison algorithms.

Based on the experimental results and performance analysis of various test images (i.e., 30 images), the results showed that the proposed algorithm could be used to effectively locate an optimal solution within a reasonable time compared with other popular algorithms. In the future, we plan to improve this algorithm in two ways. The first is by comparing the proposed algorithm with different recent metaheuristic algorithms and testing the methods with the remaining instances from each dataset. The second is by applying the proposed algorithm to real-life FR problems in a specific domain.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 7. Gamaleldin AM. An introduction to cloud computing concepts. Egypt: Software Engineering Competence Center; 2013. https://doi.org/10.1016/j.aju.2012.12.001 pmid:26579251
  • 10. Prakash, R. Meena, N. Thenmoezhi, and M. Gayathri. "Face Recognition with Convolutional Neural Network and Transfer Learning." In 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT), pp. 861–864. IEEE, 2019.
  • 11. Deng J, Guo J, Xue N, Zafeiriou S, ArcFace: Additive angular margin loss for deep face recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). Long Beach, CA: IEEE; 2019. pp. 4685–4694.
  • 12. Wang H, Wang Y, Zhou Z, Ji X, Gong D, Zhou J, et al., CosFace: Large margin cosine loss for deep face recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. Salt Lake City, UT: IEEE; 2018. pp. 5265–5274.
  • 13. Tran L, Yin X, Liu X, Disentangled representation learning GAN for pose-invariant face recognition. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). Honolulu, HI: IEEE; 2017. pp. 1415–1424.
  • 14. Masi I, Tran AT, Hassner T, Leksut JT, Medioni G. Do we really need to collect millions of faces for effective face recognition? In: Leibe B, Matas J, Sebe N, Welling M, editors. European conference on computer vision (ECCV). Cham, Switzerland: Springer; 2016. pp. 579–596.
  • 19. Yin Y, Liu L, Sun X, SDUMLA-HMT: A multimodal biometric database. In: Chinese conference on biometric recognition. Beijing, China: Springer; 2011. pp. 260–268.
  • 24. Ghazi MM, Ekenel HK, A comprehensive analysis of deep learning based representation for face recognition. In: 2016 IEEE conference on computer vision and pattern recognition workshops (CVPRW). Las Vegas, NV: IEEE; 2016. pp. 102–109.
  • 26. He K, Zhang X, Ren S, Sun J, Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). Las Vegas, NV: IEEE; 2016. pp. 770–778.
  • 27. Hu J, Shen L, Sun G, Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. Salt Lake City, UT: IEEE; 2018. pp. 7132–7141.
  • 28. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Advances in neural information processing systems. Nevada, USA: Curran Associates Inc.; 2012. pp. 1097–1105.
  • 29. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014.
  • 30. Szegedy C, Wei L, Yangqing J, Sermanet P, Reed S, Anguelov D, et al., Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). Boston, MA: IEEE; 2015. pp. 1–9.
  • 32. Guyon I, Boser BE, Vapnik V. Automatic capacity tuning of very large VC-dimension classifiers. In: Hanson SJ, Cowan JD, Giles CL, editors. Advances in neural information processing systems. San Mateo, CA: Morgan Kaufmann Publishers Inc.; 1993. pp. 147–155.
  • 33. Schölkopf B, Smola AJ. Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press; 2002. https://doi.org/10.1074/mcp.m200054-mcp200 pmid:12488466
  • 34. Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press; 2000.
  • 40. Nasr-Esfahani E, Samavi S, Karimi N, Soroushmehr SMR, Jafari MH, Ward K, et al., Melanoma detection by analysis of clinical images using convolutional neural network. In: 2016 38th annual international conference of the IEEE engineering in medicine and biology society (EMBC). Orlando, FL: IEEE; 2016. pp. 1373–1376.
  • 41. Pham TC, Luong CM, Visani M, Hoang VD. Deep CNN and data augmentation for skin lesion classification. In: Nguyen NT, Hoang DH, Hong TP, Pham H, Trawiński B, editors. Asian conference on intelligent information and database systems. Dong Hoi City, Vietnam: Springer; 2018. pp. 573–582.
  • 42. Deng J, Dong W, Socher R, Li L, Li K, Li FF, ImageNet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Miami, FL: IEEE; 2009. pp. 248–255.
  • 44. D. S. Abdul. Elminaam, Shaimaa ABDALLAH IBRAHIM, “Building a robust Heart Diseases Diagnose Intelligent Model Based on RST using LEM2 and MODLEM2”, in the Proceedings of the 32nd International Business Information Management Association Conference, IBIMA 2018—Vision 2020: Sustainable Economic Development and Application of Innovation Management from Regional expansion to Global Growth, PP 5733–5744, 15–16 November 2018, Seville, Spain

Facial Expression Recognition Using Machine Learning and Deep Learning Techniques: A Systematic Review

  • Review Article
  • Published: 13 April 2024
  • Volume 5 , article number  432 , ( 2024 )

Cite this article

  • M. Mohana   ORCID: orcid.org/0000-0002-3566-0995 1 &
  • P. Subashini 1  

10 Accesses

Explore all metrics

In the contemporary era, Facial Expression Recognition (FER) plays a pivotal role in numerous fields due to its vast application areas, such as e-learning, healthcare, marketing, and psychology, to name a few examples. Several research studies have been conducted on FER, and many reviews are available. The existing FER review paper focused on presenting a standard pipeline for FER to predict basic expressions. However, previous studies have not given an adequate amount of importance to FER datasets and their influence on affecting FER system performance. In this systematic review, 105 papers retrieved papers from IEEE, ACM, Science Direct, Scopus, Web of Science, and Springer from the years 2002 to 2023, following systematic review guidelines. Review protocol and research questions are also developed for the analysis of study results. The review identified that the accuracy of the FER system in controlled and spontaneous facial expression datasets is being affected, along with other challenges such as illumination, pose, and scale variation. Furthermore, this paper comparatively analyzed the FER model in both machine and deep learning techniques, including face detection, pre-processing, handcrafted feature extraction techniques, and emotion classifiers. In addition, we discussed some unresolved issues in FER and suggested solutions to enhance FER system performance further. In the future, multimodal FER systems need to be developed for real-time scenarios, considering the computational efficiency of model performance when integrating more than one model and dataset to achieve promising accuracy and reduce error rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

facial recognition system research paper

Data Availability

No data availability.

Abdul-Hadi MH, Waleed, J. Human speech and facial emotion recognition technique using svm. In: 2020 International Conference on Computer Science and Software Engineering (CSASE). 2020;191–196. IEEE. https://doi.org/10.1186/s42492-019-0034-5

Abdulsalam WH, Alhamdani RS, Abdullah MN. Facial emotion recognition from videos using deep convolutional neural networks. Int J Mach Learn Comput. 2019;9(1):14–9. https://doi.org/10.18178/ijmlc.2019.9.1.759 .

Article   Google Scholar  

Abiyev RH. Facial feature extraction techniques for face recognition. J Comput Sci. 2014;10(12):2360. https://doi.org/10.3844/jcssp.2014.2360.2365 .

Adyapady RR, Annappa BA. comprehensive review of facial expression recognition techniques. Multimedia Syst. 2023;29:73–103. https://doi.org/10.1007/s00530-022-00984-w .

Akhand MAH, Roy S, Siddique N, Kamal MAS, Shimamura T. Facial emotion recognition using transfer learning in the deep CNN. Electronics. 2021;10(9):1036. https://doi.org/10.3390/electronics10091036 .

Ali G, Ali A, Ali F, Draz U, Majeed F, Yasin S, Haider N. Artificial neural network-based ensemble approach for multicultural facial expressions analysis. IEEE Access. 2023;8:134950–63.

Anwar S, Milanova M. Real-time face expression recognition of children with autism. In: Proc. IAEMR. 2016.

Aouani H, Ayed YB. Speech emotion recognition with deep learning. Procedia Comput Sci. 2020;176:251–60. https://doi.org/10.1016/j.procs.2020.08.027 .

Aro T, Abikoye O, Oladipo I, Awotunde B. Enhanced Gabor features based facial recognition using ant colony optimization algorithm. J Sustain Technol. 2019;10(1):1–28

Ashraf A, Gunawan T S, Rahman F D A, Kartiwi M. A Summarization of Image and Video Databases for Emotion Recognition. In: Recent Trends in Mechatronics Towards Industry 4.0, Springer. 2022; 669–680.

Aung H, Bobkov AV, Tun NL. Face detection in real-time live video using Yolo algorithm based on Vgg16 convolutional neural network. In: 2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), IEEE. 2021; 697–702.

Ayvaz U, Gürüler H, Devrim MO. Use of facial emotion recognition in e-learning systems. 2017.

Babajee P, Suddul G, Armoogum S, Foogooa R. Identifying human emotions from facial expressions with deep learning. In: 2020 Zooming Innovation in Consumer Technologies Conference (ZINC). IEEE. 2020; 36–9.

Bach S, Binder A, Montavon G, Klauschen F, Müller KR, Samek W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One. 2015;10(7): e0130140.

Bagherian E, Rahmat RWO. Facial feature extraction for face recognition: a review. In: 2008 International Symposium on Information Technology, IEEE. 2008;2;1–9.

Bakshi U, Singhal R. A survey on face detection methods and feature extraction techniques of face recognition. Int J Emerg Trends Technol Comput Sci (IJETTCS). 2014;3(3):233–7.

Google Scholar  

Basak P, De S, Agarwal M, Malhotra A, Vatsa M, Singh R. Multimodal biometric recognition for toddlers and pre-school children. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), IEEE, 2017; 627–33.

Bishop C M, Nasrabadi NM. Pattern recognition and machine learning. New York: springer. 2006;4(4):738.

Boughida A, Kouahla MN, Lafifi Y. A novel approach for facial expression recognition based on Gabor filters and genetic algorithm. Evol Syst. 2021. https://doi.org/10.1007/s12530-021-09393-2 .

Bouhabba EM, Shafie AA, Akmeliawati R. Support vector machine for face emotion detection on a real-time basis. In: 2011 4th International Conference on Mechatronics (ICOM). IEEE. 2011; 1–6. IEEE.

Buhari AM, Ooi CP, Baskaran VM, Phan RC, Wong K, Tan WH. FACS-based graph features for real-time micro-expression recognition. J Imaging. 2020;6(12):130.

Cambria E, Das D, Bandyopadhyay S, Feraco A. Affective computing and sentiment analysis. In: A practical guide to sentiment analysis. Springer, Cham. 2017; 1–10.

Canedo D, Neves AJ. Facial expression recognition using computer vision: a systematic review. Appl Sci. 2019;9(21):4678.

Celisse A, Robin S. Nonparametric density estimation by exact leave-p-out cross-validation. Comput Stat Data Anal. 2008;52(5):2350–68.

Article   MathSciNet   Google Scholar  

Chang W Y, Hsu S H, Chien J H. FATAUVA-Net: an integrated deep learning framework for facial attribute recognition, action unit detection, and valence-arousal estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017; 17–25.

Chatterjee S, Das AK, Nayak J, Pelusi D. Improving facial emotion recognition using residual autoencoder coupled affinity-based overlapping reduction. Mathematics. 2022;10(3):406.

Chaudhari S T, Kale A. Face normalization: enhancing face recognition. In: 2010 3rd International Conference on Emerging Trends in Engineering and Technology. IEEE. 2010; 520–5.

Chen T, Pu T, Wu H, Xie Y, Liu L, Lin L. Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning. IEEE Trans Pattern Anal Mach Intell. 2021;44(12):9887–903.

Chen W, Huang H, Peng S, Zhou C, Zhang C. YOLO-face: a real-time face detector. Vis Comput. 2021;37(4):805–13.

Choi IK, Ahn HE, Yoo J. Facial expression classification using deep convolutional neural network. J Elect Eng Technol. 2018;13(1):485–92.

Cohn J F, Zlochower A J, Lien J J, Kanade T. Feature-point tracking by optical flow discriminates subtle differences in facial expression. In: Proceedings third IEEE international conference on automatic face and gesture recognition. IEEE. 1998; 396–401.

Cubuk E D, Zoph B, Mane D, Vasudevan V, Le Q V. Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019; 113–23.

Cuimei L, Zhiliang Q, Nan J, Jianhua W. Human face detection algorithm via Haar cascade classifier combined with three additional classifiers. In: 2017 13th IEEE International Conference on Electronic Measurement & Instruments (ICEMI). IEEE. 2017; 483–7.

Cunningham S, Ridley H, Weinel J, Picking R. Supervised machine learning for audio emotion recognition. Pers Ubiquit Comput. 2021;25(4):637–50.

Dalrymple KA, Gomez J, Duchaine B. The dartmouth database of children’s faces: acquisition and validation of a new face stimulus set. PLoS One. 2013;8(11): e79131.

Dang V T, Do H Q, Vu V V, Yoon B. Facial expression recognition: a survey and its applications. In: 2021 23rd International Conference on Advanced Communication Technology (ICACT), IEEE. 2021; 359–67.

Deng J, Guo J, Zafeiriou S. Single-stage joint face detection and alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 2019.

Dino H I, Abdulrazzaq M B. Facial expression classification based on SVM, KNN and MLP classifiers. In: 2019 International Conference on Advanced Science and Engineering (ICOASE). IEEE. 2019; 70–5.

Du S, Tao Y, Martinez AM. Compound facial expression of emotion. Proc Natl Acad Sci. 2014;111(15):E1454–62.

Edwards G J, Cootes TF, Taylor CJ. Face recognition using active appearance models. In: European conference on computer vision. Springer, Berlin, Heidelberg. 1998; 581–95.

Egger HL, Pine DS, Nelson E, Leibenluft E, Ernst M, Towbin KE, Angold A. The NIMH Child Emotional Faces Picture Set (NIMH-ChEFS): a new set of children’s facial emotion stimuli. Int J Methods Psychiatr Res. 2011;20(3):145–56.

Ekman P, Friesen WV. Constants across cultures in the face and emotion. J Pers Soc Psychol. 1971;17(2):124.

Ekman P. An argument for basic emotions. Cogn Emot. 1992;6(3–4):169–200.

Ekman P. Darwin, deception, and facial expression. Ann NY Acad Sci. 2003;1000(1):205–21.

El Hammoumi O, Benmarrakchi F, Ouherrou N, El Kafi J, El Hore A. Emotion recognition in e-learning systems. In: 2018 6th international conference on multimedia computing and systems (ICMCS). IEEE. 2018; 1–6.

Fabian Benitez-Quiroz C, Srinivasan R, Martinez AM. Emotionet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016; 5562–70.

Fasel B, Luettin J. Automatic facial expression analysis: a survey. Pattern Recogn. 2003;36(1):259–75.

Friesen E, Ekman P. Facial action coding system: a technique for the measurement of facial movement. Palo Alto. 1978;3(2):5.

Gavrilescu M, Vizireanu N. Predicting depression, anxiety, and stress levels from videos using the facial action coding system. Sensors. 2019;19(17):3693.

Gehrig T, Ekenel H K. Why is facial expression analysis in the wild challenging? In: Proceedings of 2013 on Emotion recognition in the wild challenge and workshop. 2013; 9–16.

Giuliani NR, Flournoy JC, Ivie EJ, Von Hippel A, Pfeifer JH. Presentation and validation of the DuckEES child and adolescent dynamic facial expressions stimulus set. Int J Methods Psychiatr Res. 2017;26(1): e1553.

Goodfellow I J, Erhan D, Carrier P L, Courville A, Mirza M, Hamner B, Bengio Y. Challenges in representation learning: a report on three machine learning contests. In: International conference on neural information processing. Springer, Berlin, Heidelberg. 2013;117–24.

Gross R, Matthews I, Cohn J, Kanade T, Baker S. Multi-pie. Image Vis Comput. 2010;28(5):807–13.

Gu L, Kanade T. A generative shape regularization model for robust face alignment. In: European conference on computer vision. Springer, Berlin, Heidelberg. 2008; 413–26 (2008).

Gunawan TS, Ashraf A, Riza BS, Haryanto EV, Rosnelly R, Kartiwi M, Janin Z. Development of video-based emotion recognition using deep learning with Google Colab. Telkomnika. 2020;18(5):2463–71.

Gupta S. Facial emotion recognition in real-time and static images. In: 2018 2nd international conference on inventive systems and control (ICISC). IEEE. 2018; 553–60.

Hjelmås E, Low BK. Face detection: a survey. Comput Vis Image Underst. 2001;83(3):236–74.

Ho D, Liang E, Chen X, Stoica I, Abbeel P. Population-based augmentation: Efficient learning of augmentation policy schedules. In: International conference on machine learning. 2019; 2731–41. PMLR.

Hossain S, Umer S, Rout RK, Tanveer M. Fine-grained image analysis for facial expression recognition using deep convolutional neural networks with bilinear pooling. Appl Soft Comput. 2023;134: 109997.

Hossen AMA, Ogla RAA, Ali MM. Face detection by using OpenCV’s Viola-Jones Algorithm based on coding eyes. Iraqi J Sci. 2017;58(2A):735–45.

Huang Y, Chen F, Lv S, Wang X. Facial expression recognition: a survey. Symmetry. 2019;11(10):1189.

Jack RE, Garrod OG, Yu H, Caldara R, Schyns PG. Facial expressions of emotion are not culturally universal. Proc Natl Acad Sci. 2012;109(19):7241–4.

Jaderberg M, Dalibard V, Osindero S, Czarnecki W M, Donahue J, Razavi A, Kavukcuoglu K. Population based training of neural networks. arXiv preprint arXiv:1711.09846 . 2017.

Jadhav R, Bhuke J, Patil N. Facial emotion detection using convolutional neural network. Int Res J Eng Technol. e-ISSN, 2395–0056. 2019.

Jain DK, Shamsolmoali P, Sehdev P. Extended deep neural network for facial emotion recognition. Pattern Recogn Lett. 2019;120:69–74.

Jaiswal A, Raju A K, Deb S. Facial emotion detection using deep learning. In: 2020 International Conference for Emerging Technology (INCET). IEEE. 2020; 1–5.

Jiang F, Zhang J, Yan L, Xia Y, Shan S. A three-category face detector with contextual information on finding tiny faces. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE. 2018; 2680–4

Johnston B, de Chazal P. A review of image-based automatic facial landmark identification techniques. EURASIP J Image Video Process. 2018;1:1–23.

Kalinovskii I, Spitsyn, V. Compact convolutional neural network cascade for face detection. arXiv preprint arXiv:1508.01292 . 2015.

Kamarol SKA, Jaward MH, Kälviäinen H, Parkkinen J, Parthiban R. Joint facial expression recognition and intensity estimation based on weighted votes of image sequences. Pattern Recogn Lett. 2017;92:25–32.

Kanade T, Cohn J F, Tian Y. Comprehensive database for facial expression analysis. In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580). IEEE. 2000; 46–53.

Karnati M, Seal A, Yazidi A, Krejcar O. FLEPNet: feature level ensemble parallel network for facial expression recognition. IEEE Trans Affect Comput. 2022;13(4):2058–70.

Kawakura S, Hirafuji M, Ninomiya S, Shibasaki R. Analyses of diverse agricultural worker data with explainable artificial intelligence: Xai based on shap, lime, and lightgbm. Eur J Agric Food Sci. 2022;4(6):11–9.

Khan RA, Crenn A, Meyer A, Bouakaz S. A novel database of children’s spontaneous facial expressions (LIRIS-CSE). Image Vis Comput. 2019;83:61–9.

Kitchenham B. Procedures for performing systematic reviews, vol. 33. Keele: Keele University; 2004. p. 1–26.

Kumar GR, Kumar RK, Sanyal G. Facial emotion analysis using deep convolution neural network. In: 2017 International Conference on Signal Processing and Communication (ICSPC). IEEE. 2017; 369–74.

Langner O, Dotsch R, Bijlstra G, Wigboldus DH, Hawk ST, Van Knippenberg AD. Presentation and validation of the Radboud Faces Database. Cogn Emot. 2010;24(8):1377–88.

Lasri I, Solh A R, El Belkacemi M. Facial emotion recognition of students using a convolutional neural network. In: 2019 third international conference on intelligent computing in data sciences (ICDS), IEEE. 2019;1–6.

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

Li S, Deng W. Deep emotion transfer network for cross-database facial expression recognition. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE. 2018;3092–9.

Li S, Deng W. Deep facial expression recognition: a survey. IEEE Trans Affect Comput. 2020. 13(3);1195–1215.

Li X, Lai S, Qian X. DBCFace: towards pure convolutional neural network face detection. IEEE Trans Circ Syst Video Technol. 2021. https://doi.org/10.1109/TCSVT.2021.3082635 .

Lim R, MJT Reinders T. Facial landmark detection using a Gabor filter representation and a genetic search algorithm. In: Proceeding, (SITIA’2000), Graha Institut Teknologi Sepuluh November. 2000.

Liu C, Hirota K, Dai Y. Patch attention convolutional vision transformer for facial expression recognition with occlusion. Inf Sci. 2023;619:781–94.

Liu S, Tian Y, Peng C, Li J. Facial expression recognition approaches based on least squares support vector machines with improved particle swarm optimization algorithms. In: 2010 IEEE International Conference on Robotics and Biomimetics. IEEE. 2010; 399–404 (2010).

Liu Y, Wang W, Feng C, Zhang H, Chen Z, Zhan Y. Expression snippet transformer for robust video-based facial expression recognition. Pattern Recogn. 2023;138: 109368.

LoBue V, Thrasher C. The Child Affective Facial Expression (CAFE) set: validity and reliability from untrained adults. Front Psychol. 2015;5:532.

Lopez-Rincon, A. Emotion recognition using facial expressions in children using the NAO Robot. In: 2019 International Conference on Electronics, Communications and Computers (CONIELECOMP), IEEE. 2019;146–53.

Lu H, Yang F. Active shape model and its application to face alignment. In: Subspace methods for pattern recognition in intelligent environment, Springer. 2014; 1–31.

Lucey P, Cohn J F, Kanade T, Saragih J, Ambadar Z, Matthews I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In: 2010 ieee computer society conference on computer vision and pattern recognition-workshops. IEEE. 2010; 94–10.

Lundqvist D, Flykt A, Öhman A. The Karolinska directed emotional faces—KDEF [CD ROM]. Karolinska Institutet, Stockholm. 1998.

Luo D, Wen G, Li D, Hu Y, Huan E. Deep-learning-based face detection using iterative bounding-box regression. Multimedia Tools Appl. 2018;77:24663–80.

Michael L, Miyuki K, Jiro G. The Japanese Female Facial Expression (JAFFE) Dataset .1998. Zenodo.

Martinez B, Valstar MF. Advances, challenges, and opportunities in automatic facial expression recognition. Adv Face Detect Facial Image Anal. 2016; 63–100.

Mascaró-Oliver M, Mas-Sansó R, Amengual-Alcover E, Roig-Maimó MF. UIBVFED-mask: a dataset for comparing facial expressions with and without face masks. Data. 2023;8(1):17. https://doi.org/10.3390/data8010017 .

Matsumoto D. More evidence for the universality of a contempt expression. Motiv Emot. 1992;16:363–8.

Mehrabian, A. Nonverbal communication. In: Nebraska symposium on motivation. University of Nebraska Press. 1971.

Ming Y, Qian H, Guangyuan L. CNN-LSTM facial expression recognition method fused with two-layer attention mechanism. Comput Intell Neurosci. 2022.

Miolla A, Cardaioli M, Scarpazza C. Padova Emotional Dataset of Facial Expressions (PEDFE): a unique dataset of genuine and posed emotional facial expressions. Behav Res. 2023;55:2559–74. https://doi.org/10.3758/s13428-022-01914-4 .

Mohammed OA, Al-Tuwaijari JM. Analysis of challenges and methods for face detection systems: a survey. Int J Nonlinear Anal Appl. 2022;13(1):3997–4015.

Mohana M, Subashini P, Krishnaveni M. Emotion recognition from facial expression using hybrid CNN–LSTM network. Int J Pattern Recognit Artif Intell. 2023;37(08):2356008.

Mollahosseini A, Chan D, Mahoor M H. Going deeper in facial expression recognition using deep neural networks. In: 2016 IEEE Winter conference on applications of computer vision (WACV), IEEE. 2016;1–10.

Mollahosseini A, Hasani B, Mahoor MH. Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans Affect Comput. 2017;10(1):18–31.

Nan Y, Ju J, Hua Q, Zhang H, Wang B. A-MobileNet: an approach of facial expression recognition. Alex Eng J. 2022;61(6):4435–44.

Negrão JG, Osorio AAC, Siciliano RF, Lederman VRG, Kozasa EH, D’Antino MEF, Schwartzman JS. The child emotion facial expression set: a database for emotion recognition in children. Front Psychol. 2021;12:1352.

Nojavanasghari B, Baltrušaitis T, Hughes CE, Morency LP. Emoreact: a multimodal approach and dataset for recognizing emotional responses in children. In: Proceedings of the 18th acm international conference on multimodal interaction. 2016; 137–44.

Oliver MM, Amengual AE. UIBVFED: virtual facial expression dataset. PLoS One. 2020;15(4): e0231266.

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Moher D. Updating guidance for reporting systematic reviews: development of the PRISMA 2020 statement. J Clin Epidemiol. 2021;134:103–12.

Palacio S, Lucieri A, Munir M, Ahmed S, Hees J, Dengel A. Xai handbook: towards a unified framework for explainable AI. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021; 3766–75.

Pali V, Goswami S, Bhaiya L P. An extensive survey on feature extraction techniques for facial image processing. In: 2014 International Conference on Computational Intelligence and Communication Networks, IEEE. 2014; 142–148.

Pantic M, Valstar M, Rademaker R, Maat L. Web-based database for facial expression analysis. In: 2005 IEEE international conference on multimedia and Expo, IEEE. 2005; 5.

Park S, Lee K, Lim JA, Ko H, Kim T, Lee JI, Lee EC. Differences in facial expressions between spontaneous and posed smiles: automated method by action units and three-dimensional facial landmarks. Sensors. 2020;20(4):1199.

Patil HY, Bharambe SV, Kothari AG, Bhurchandi KM. Face localization and its implementation on embedded platform. In: 2013 3rd IEEE International Advance Computing Conference (IACC). IEEE. 2013; 741–5

Peng P, Xiang T, Wang Y, Pontil M, Gong S, Huang T, Tian Y. Unsupervised cross-dataset transfer learning for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016;1306–15.

Perveen N, Ahmad N, Khan M A Q B, Khalid R, Qadri S. Facial expression recognition through machine learning. Int J Sci Technol Res. 2016;5(03)91–97

Picard RW. Affective computing. MIT press; 2000.

Book   Google Scholar  

Polikovsky S, Kameda Y, Ohta Y. Facial micro-expression detection in hi-speed video based on facial action coding system (FACS). IEICE Trans Inf Syst. 2013;96(1):81–92.

Pranav E, Kamal S, Chandran C S, Supriya M H. Facial emotion recognition using deep convolutional neural network. In: 2020 6th International conference on advanced computing and communication Systems (ICACCS), IEEE. 2020; 317–20.

Priadana A, Habibi M. Face detection using haar cascades to filter selfie face images on instagram. In: 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), IEEE. 2019; 6–9.

Qayyum R, Akre V, Hafeez T, Khattak H A, Nawaz A, Ahmed S, ur Rahman, K. Android based Emotion Detection Using Convolutions Neural Networks. In: 2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE), IEEE. 2021; 360–5.

Rathee N, Vaish A, Gupta S. Adaptive system to learn and recognize the emotional state of mind. In: 2016 International Conference on Computing, Communication and Automation (ICCCA). IEEE. 2016; 32–6.

Riyantoko PA, Hindrayani KM. Facial emotion detection using haar-cascade classifier and convolutional neural networks. J Phys Conf Ser. 2021;1844(1): 012004 ( IOP Publishing ).

Robin M H, Rahman M M U, Taief A M, Eity Q N. Improvement of face and eye detection performance by using multi-task cascaded convolutional networks. In: 2020 IEEE Region 10 Symposium (TENSYMP), IEEE. 2020; 977–980

Rodriguez JD, Perez A, Lozano JA. Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell. 2009;32(3):569–75.

Rodriguez Y, Cardinaux F, Bengio S, Mariéthoz J. Measuring the performance of face localization systems. Image Vis Comput. 2006;24(8):882–93.

Romani-Sponchiado A, Sanvicente-Vieira B, Mottin C, Hertzog-Fonini D, Arteche A. Child Emotions Picture Set (CEPS): development of a database of children’s emotional expressions. Psychol Neurosci. 2015;8(4):467.

Roshan K, Zafar A, Haque SBU. Untargeted white-box adversarial attack with heuristic defence methods in real-time deep learning based network intrusion detection system. Comput Commun. 2023. https://doi.org/10.1016/j.comcom.2023.09.030 .

Russell JA. A circumplex model of affect. J Pers Soc Psychol. 1980;39(6):1161.

Sajjad M, Ullah FUM, Ullah M, Christodoulou G, Cheikh FA, Hijji M, Rodrigues JJ. A comprehensive survey on deep facial expression recognition: challenges, applications, and future guidelines. Alex Eng J. 2023;68:817–40.

Sebe N, Cohen I, Huang TS. Multimodal emotion recognition. In: Handbook of pattern recognition and computer vision. 2005;387–409.

Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. 2017; 618–626.

Sharafi M, Yazdchi M, Rasti R, Nasimi F. A novel Spatio-temporal convolutional neural framework for multimodal emotion recognition. Biomed Signal Process Control. 2022;78: 103970.

Sharma R, Patterh MS. Face recognition using face alignment and PCA techniques: a literature survey. IOSR J Comput Eng (IOSR-JCE). 2015;17(4):17–30.

Sheikh BUH, Zafar A. White-box inference attack: compromising the security of deep learning-based COVID-19 diagnosis systems. Int J Inf Tecnol. 2023. https://doi.org/10.1007/s41870-023-01538-7 .

Sheikh BUH, Zafar A. Unlocking adversarial transferability: a security threat towards deep learning-based surveillance systems via black box inference attack—a case study on face mask surveillance. Multimed Tools Appl. 2023. https://doi.org/10.1007/s11042-023-16439-x .

Sheikh BUH, Zafar A. Untargeted white-box adversarial attack to break into deep learning based COVID-19 monitoring face mask detection system. Multimed Tools Appl. 2023. https://doi.org/10.1007/s11042-023-15405-x .

Sheikh B, Zafar A. Beyond accuracy and precision: a robust deep learning framework to enhance the resilience of face mask detection models against adversarial attacks. Evol Syst. 2023. https://doi.org/10.1007/s12530-023-09522-z .

Sheikh B, Zafar A. RRFMDS: rapid real-time face mask detection system for effective COVID-19 monitoring. SN Comput Sci. 2023;4:288. https://doi.org/10.1007/s42979-023-01738-9 .

Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):1–48.

Sikkandar H, Thiyagarajan R. Deep learning based facial expression recognition using improved Cat Swarm Optimization. J Ambient Intell Humaniz Comput. 2021;12(2):3037–53.

Stockman G, Shapiro L G.  Computer vision . Prentice Hall PTR.

Su Y, Liu Z, Ban X. Symmetric face normalization. Symmetry. 2019;11(1):96.

Suhaimi NS, Mountstephens J, Teo J. EEG-based emotion recognition: a state-of-the-art review of current trends and opportunities. Comput Intell Neurosci. 2020. https://doi.org/10.1155/2020/8875426 .

Talele KT, Kadam S. Face detection and geometric face normalization. In: TENCON 2009–2009 IEEE Region 10 Conference. IEEE. 2009; 1–6.

Tang Y, Zhang X, Hu X, Wang S, Wang H. Facial expression recognition using frequency neural network. IEEE Trans Image Process. 2020;30:444–57.

Tao S, Li Y, Huang Y, Lan X. Face detection algorithm based on deep residual network. J Phys: Conf Ser. 2021;1802(3): 032142 ( IOP Publishing ).

Tian YI, Kanade T, Cohn JF. Recognizing action units for facial expression analysis. IEEE Trans Pattern Anal Mach Intell. 2001;23(2):97–115.

Tümen V, Söylemez ÖF, Ergen B. Facial emotion recognition on a dataset using convolutional neural network. In: 2017 International Artificial Intelligence and Data Processing Symposium (IDAP). IEEE. 2017; 1–5

Verma A, Singh P, Alex J S R. Modified convolutional neural network architecture analysis for facial emotion recognition. In: 2019 International Conference on Systems, Signals and Image Processing (IWSSIP), IEEE. 2019; 169–173.

Viola P, Jones M. Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001. IEEE. 2001; 1; I-I.

Voulodimos A, Doulamis N, Doulamis A, Protopapadakis E. Deep learning for computer vision: a brief review. Comput Intell Neurosci. 2018. https://doi.org/10.1155/2018/7068349 .

Wang Y, Ji X, Zhou Z, Wang H, Li Z. Detecting faces using region-based fully convolutional networks. arXiv preprint arXiv:1709.05256 . 2017.

Wang Y, Sun Y, Huang Y, Liu Z, Gao S, Zhang W, Zhang W. Ferv39k: a large-scale multi-scene dataset for facial expression recognition in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022;20922–31.

Wardhani N W S, Rochayani M Y, Iriany A, Sulistyono A D, Lestantyo P. Cross-validation metrics for evaluating classification performance on imbalanced data. In: 2019 international conference on computer, control, informatics and its applications (IC3INA), IEEE. 2019; 14–18.

Wen Z, Lin W, Wang T, Xu G. Distract your attention: multi-head cross attention network for facial expression recognition. Biomimetics. 2023;8(2):199.

Wu Y, Ji Q. Facial landmark detection: a literature survey. Int J Comput Vis. 2019;127(2):115–42.

Wu Y, Zhang L, Gu Z, Lu H, Wan S. Edge-AI-driven framework with efficient mobile network design for facial expression recognition. ACM Trans Embed Comput Syst. 2023;22(3):1–17.

Xie Y, Chen T, Pu T, Wu H, Lin L. Adversarial graph representation adaptation for cross-domain facial expression recognition. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020;1255–64.

Yadav KS, Singha J. Facial expression recognition using modified Viola-John’s algorithm and KNN classifier. Multimed Tools Appl. 2020;79(19):13089–107.

Yan H, Liu Y, Wang X, Li M, Li H. A face detection method based on skin color features and AdaBoost algorithm. J Phys: Conf Ser. 2021;1748(4): 042015 ( IOP Publishing ).

Yan Z, Yuan C. Ant colony optimization for feature selection in face recognition. In International conference on biometric authentication. Springer, Berlin, Heidelberg. 2004; 221–6.

Yang L, Tian Y, Song Y, Yang N, Ma K, Xie L. A novel feature separation model exchange-GAN for facial expression recognition. Knowl-Based Syst. 2020;204: 106217.

Yang W, Jiachun Z. Real-time face detection based on YOLO. In: 2018 1st IEEE international conference on knowledge innovation and invention (ICKII), IEEE. 2018; 221–4.

Zhang L, Ai H, Xin S, Huang C, Tsukiji S, Lao S. Robust face alignment based on local texture classifiers. In: IEEE International Conference on Image Processing, IEEE. 2005; 2, II-354.

Zhang N, Luo J, Gao W. Research on face detection technology based on MTCNN. In: 2020 International Conference on Computer Network, Electronic and Automation (ICCNEA), IEEE. 2020; 154–8.

Zhang X, Zhang F, Xu C. Joint expression synthesis and representation learning for facial expression recognition. IEEE Trans Circuits Syst Video Technol. 2021;32(3):1681–95.

Zhang Z, Luo P, Loy C C, Tang X. Facial landmark detection by deep multi-task learning. In: European conference on computer vision, Springer, Cham. 2014; 94–108. https://doi.org/10.1007/978-3-319-10599-4_7 .

Zhao G, Huang X, Taini M, Li SZ, PietikäInen M. Facial expression recognition from near-infrared videos. Image Vis Comput. 2011;29(9):607–19.

Zhu X, Liu Y, Li J, Wan T, Qin Z. Emotion classification with data augmentation using generative adversarial networks. In: Advances in knowledge discovery and data mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Proceedings, Part III 22 . Springer International Publishing. 2018;49–360. https://doi.org/10.1007/978-3-319-93040-4_28

Download references

Acknowledgements

The authors sincerely thank the ISO Certified (ISO/IEC 20000-1:2018) Centre for Machine Learning and Intelligence (CMLI), funded by the Department of Science and Technology (DST-CURIE), India, for providing the facility to carry out this research study.

This research study received no external funding.

Author information

Authors and affiliations.

Centre for Machine Learning and Intelligence, Department of Computer Science, Avinashilingam Institute, Coimbatore, India

M. Mohana & P. Subashini

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to M. Mohana .

Ethics declarations

Conflict of interest.

The author declared no potential conflict of interest concerning the publishing of this article.

Declaration of AI and AI-assisted Technologies in the Writing Process

During the preparation of this work, the authors utilized Grammarly assistant tools included in Microsoft Word for grammar checking. After using these tools, the authors reviewed and edited the content as necessary and took full responsibility for the publication's content.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Mohana, M., Subashini, P. Facial Expression Recognition Using Machine Learning and Deep Learning Techniques: A Systematic Review. SN COMPUT. SCI. 5 , 432 (2024). https://doi.org/10.1007/s42979-024-02792-7

Download citation

Received : 02 August 2023

Accepted : 14 March 2024

Published : 13 April 2024

DOI : https://doi.org/10.1007/s42979-024-02792-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Facial Expression Recognition (FER)
  • Machine learning (ML)
  • Deep learning (DL)
  • Face detection
  • Facial emotion
  • Find a journal
  • Publish with us
  • Track your research

Face Detection and Recognition Using OpenCV

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

  1. (PDF) Security System using Motion Detection and Face Recognition

    facial recognition system research paper

  2. PPT

    facial recognition system research paper

  3. (PDF) Facial Recognition Attendance System Using Python and OpenCv

    facial recognition system research paper

  4. Pdf Face Detection And Recognition Using Opencv

    facial recognition system research paper

  5. (PDF) A Review of Face Recognition Technology

    facial recognition system research paper

  6. (PDF) Study of Face Recognition Techniques: A Survey

    facial recognition system research paper

COMMENTS

  1. (PDF) Face Recognition: A Literature Review

    The task of face recognition has been actively researched in recent years. This paper provides an up-to-date review of major human face recognition research. We first present an overview of face ...

  2. A Review of Face Recognition Technology

    Face recognition technology is a biometric technology, which is based on the identification of facial features of a person. People collect the face images, and the recognition equipment automatically processes the images. The paper introduces the related researches of face recognition from different perspectives. The paper describes the development stages and the related technologies of face ...

  3. Human face recognition based on convolutional neural network and

    To deal with the issue of human face recognition on small original dataset, a new approach combining convolutional neural network (CNN) with augmented dataset is developed in this paper. The original small dataset is augmented to be a large dataset via several transformations of the face images. Based on the augmented face image dataset, the ...

  4. Face recognition: Past, present and future (a review)☆

    Below, we give a brief review of the face detection and facial landmarking methods in the literature. Accurate and effective face detection and facial landmarking algorithms increase the accuracy of face recognition systems. 2.2.1. Face detection. Face detection is estimating the bounding-box of the face in a given image or frames of a video.

  5. A review on face recognition systems: recent approaches and ...

    Face recognition is an efficient technique and one of the most preferred biometric modalities for the identification and verification of individuals as compared to voice, fingerprint, iris, retina eye scan, gait, ear and hand geometry. This has over the years necessitated researchers in both the academia and industry to come up with several face recognition techniques making it one of the most ...

  6. Design and Evaluation of a Real-Time Face Recognition System using

    In this paper, design and evaluation of a real-time face recognition system using Convolutional Neural Network (CNN) is proposed. The initial evaluation of the proposed design is carried out using standard AT&T datasets and the same is later extended towards the design of a real-time system.

  7. Face Recognition by Humans and Machines: Three Fundamental Advances

    1. INTRODUCTION. The fields of vision science, computer vision, and neuroscience are at an unlikely point of convergence. Deep convolutional neural networks (DCNNs) now define the state of the art in computer-based face recognition and have achieved human levels of performance on real-world face recognition tasks (Jacquet & Champod 2020, Phillips et al. 2018, Taigman et al. 2014).

  8. Past, Present, and Future of Face Recognition: A Review

    Face recognition is one of the most active research fields of computer vision and pattern recognition, with many practical and commercial applications including identification, access control, forensics, and human-computer interactions. However, identifying a face in a crowd raises serious questions about individual freedoms and poses ethical issues. Significant methods, algorithms, approaches ...

  9. A comprehensive study on face recognition: methods and challenges

    Pre-processing, Face Detection, Feature Extraction, Optimal Feature Selection, and Classification are primary steps in any face recognition system. This paper provides a detailed review of each. Feature extraction techniques can be classified as appearance-based methods or geometry-based methods, such method may be local or global. Feature ...

  10. [2201.02991] A Survey on Face Recognition Systems

    In this paper, some of the most impactful face recognition systems were surveyed. Firstly, the paper gives an overview of a general face recognition system. Secondly, the survey covers various network architectures and training losses that have had a substantial impact. Finally, the paper talks about various databases that are used to evaluate ...

  11. A deep facial recognition system using computational intelligent ...

    3.1 Traditional facial recognition components. The whole system comprises three modules, as shown in Fig 1.. In the beginning, the face detector is utilized on videos or images to detect faces. The prominent feature detector aligns each face to be normalized and recognized with the best match.; Finally, the face images are fed into the FR module with the aligned results.

  12. Sensors

    Face recognition system is a popular study task in the field of image processing and computer vision, owing to its potentially enormous application as well as its theoretical value. ... This paper highlights the recent research on the 2D or 3D face recognition system, focusing mainly on approaches based on local, holistic (subspace), and hybrid ...

  13. Face Detection Research Paper

    Face detectors are equipped with a photo of 2500 left or right eyes and the snapshots of the eyestrain terrible sets. Overall advantageous 94 percent and fake-fantastic thirteen percent are detected in facial detection. Eyes are detected at a fee of 88 percentages with the simplest 1 percent false nice outcome.

  14. Facial Expression Recognition Using Machine Learning and ...

    In the contemporary era, Facial Expression Recognition (FER) plays a pivotal role in numerous fields due to its vast application areas, such as e-learning, healthcare, marketing, and psychology, to name a few examples. Several research studies have been conducted on FER, and many reviews are available. The existing FER review paper focused on presenting a standard pipeline for FER to predict ...

  15. Face Recognition Smart Attendance System using Deep ...

    Face is one of the most broadly used biometrics for human identity authentication. This paper presents a facial recognition attendance system based on deep learning convolutional neural networks. We utilize transfer learning by using three pre-trained convolutional neural networks and trained them on our data.

  16. Face Detection and Recognition Using OpenCV

    Face detection and picture or video recognition is a popular subject of research on biometrics. Face recognition in a real-time setting has an exciting area and a rapidly growing challenge. Framework for the use of face recognition application authentication. This proposes the PCA (Principal Component Analysis) facial recognition system. The key component analysis (PCA) is a statistical method ...

  17. PDF AttenFace: A Real Time Attendance System Using Face Recognition

    For example, Truein [13] is a touchless face recognition system used to manage employee attendance in the workplace. iFace [14] provides face recognition capabilities through a mobile app, useful in work-from-home scenarios. There is currently no product in the market aimed at real-time face recognition for attendance capture in schools and ...

  18. PDF Face Recognition System

    Face recognition tracks target objects in live video images taken with a video camera. In simple words, it is a system application for automatically identifying a person from a still image or video frame. In this paper we proposed an automated face recognition system.

  19. Unified Physical-Digital Attack Detection Challenge

    Face Anti-Spoofing (FAS) is crucial to safeguard Face Recognition (FR) Systems. In real-world scenarios, FRs are confronted with both physical and digital attacks. However, existing algorithms often address only one type of attack at a time, which poses significant limitations in real-world scenarios where FR systems face hybrid physical-digital threats. To facilitate the research of Unified ...

  20. (PDF) Face recognition based attendance system using ...

    algorithm.Once the system is trained, it can recognize the faces of authorized students in real-time. When a student's. face is detected by the camera, the system matches the detected face with ...