Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 11 May 2022

Diagnostic accuracy of keystroke dynamics as digital biomarkers for fine motor decline in neuropsychiatric disorders: a systematic review and meta-analysis

  • Hessa Alfalahi 1 , 2 ,
  • Ahsan H. Khandoker 1 , 2 ,
  • Nayeefa Chowdhury 1 ,
  • Dimitrios Iakovakis 3 ,
  • Sofia B. Dias 1 , 2 , 4 ,
  • K. Ray Chaudhuri 5 , 6 &
  • Leontios J. Hadjileontiadis 1 , 2 , 3  

Scientific Reports volume  12 , Article number:  7690 ( 2022 ) Cite this article

5958 Accesses

24 Citations

2 Altmetric

Metrics details

  • Engineering
  • Neuroscience
  • Signs and symptoms

The unmet timely diagnosis requirements, that take place years after substantial neural loss and neuroperturbations in neuropsychiatric disorders, affirm the dire need for biomarkers with proven efficacy. In Parkinson’s disease (PD), Mild Cognitive impairment (MCI), Alzheimers disease (AD) and psychiatric disorders, it is difficult to detect early symptoms given their mild nature. We hypothesize that employing fine motor patterns, derived from natural interactions with keyboards, also knwon as keystroke dynamics, could translate classic finger dexterity tests from clinics to populations in-the-wild for timely diagnosis, yet, further evidence is required to prove this efficiency. We have searched PubMED, Medline, IEEEXplore, EBSCO and Web of Science for eligible diagnostic accuracy studies employing keystroke dynamics as an index test for the detection of neuropsychiatric disorders as the main target condition. We evaluated the diagnostic performance of keystroke dynamics across 41 studies published between 2014 and March 2022, comprising 3791 PD patients, 254 MCI patients, and 374 psychiatric disease patients. Of these, 25 studies were included in univariate random-effect meta-analysis models for diagnostic performance assessment. Pooled sensitivity and specificity are 0.86 (95% Confidence Interval (CI) 0.82–0.90, I 2  = 79.49%) and 0.83 (CI 0.79–0.87, I 2  = 83.45%) for PD, 0.83 (95% CI 0.65–1.00, I 2  = 79.10%) and 0.87 (95% CI 0.80–0.93, I 2  = 0%) for psychomotor impairment, and 0.85 (95% CI 0.74–0.96, I 2  = 50.39%) and 0.82 (95% CI 0.70–0.94, I 2  = 87.73%) for MCI and early AD, respectively. Our subgroup analyses conveyed the diagnosis efficiency of keystroke dynamics for naturalistic self-reported data, and the promising performance of multimodal analysis of naturalistic behavioral data and deep learning methods in detecting disease-induced phenotypes. The meta-regression models showed the increase in diagnostic accuracy and fine motor impairment severity index with age and disease duration for PD and MCI. The risk of bias, based on the QUADAS-2 tool, is deemed low to moderate and overall, we rated the quality of evidence to be moderate. We conveyed the feasibility of keystroke dynamics as digital biomarkers for fine motor decline in naturalistic environments. Future work to evaluate their performance for longitudinal disease monitoring and therapeutic implications is yet to be performed. We eventually propose a partnership strategy based on a “co-creation” approach that stems from mechanistic explanations of patients’ characteristics derived from data obtained in-clinics and under ecologically valid settings. The protocol of this systematic review and meta-analysis is registered in PROSPERO; identifier CRD42021278707. The presented work is supported by the KU-KAIST joint research center.

Introduction

Motor abnormalities, a transdiagnostic domain of an array of neurological and psychiatric disorders that begin years if not decades before clinical diagnosis 1 , stem from perturbed brain networks involving cognitive, emotional and motor domains 2 , 3 . Despite their well-established neurobiological mechanisms and clinical criteria, early diagnosis remains a devastating obstacle against effective, disease-modifying treatment and sustained quality of life. In fact, the progression of motor symptoms to warrant clinical diagnosis usually occurs after substantial neural loss in neurodegenerative disorders, and at advanced stages of psychiatric disorders. In the case of Parkinson’s Disease (PD), for instance, the hallmark symptoms of bradykinesia, rigidity and tremor are detected after a neural loss of at least 50% 4 , rendering clinical diagnosis accuracy unsatisfactory at early stages as per a recent meta-analysis 5 . In addition, Alzheimer’s disease (AD) is preceded by a mild cognitive impairment (MCI) stage, characterized by a decline in memory and executive functions that is hardly distinguishable from normal aging, but with pronounced impact on the activities of daily life 6 . In psychiatric disorders, the descriptive nature of clinical scales lacks sensitivity to subtle psychomotor symptoms, either in early or remission stages, resulting in a median delay in diagnosis of 14 years after disease onset 7 . Generally, these diseases, affecting the frontal cortical and subcortical circuits are characterized with executive dysfunction that begins years before diagnosis 1 , entailing the need for dimensional, fine-grained behavioral measures, thereby alleviating the “floor-ceiling” effect associated with qualitative clinical scales as well as the inter- and intra-rater diagnosis variability.

According to the scientific vision (2025) of the Brain Research through Advancing Innovative Neurotechnologies (BRAIN) of the National Institute of Health (NIH) 8 , and the Research Domain Criteria (RDoC) of National Institute of Mental Health (NIMH) 9 , automated behavioral quantification, analysis and classification are a crucial start to high-throughput readout of brain activity, whose impact is envisioned to facilitate breakthroughs in early identification and disease management in both neurology and psychiatry. Concurrent with the ever-increasing interest in behavioral measures, is the lack of hypothesis-supported behavioral experiments 10 . The latter require not only experimental design, but also robust computational and analysis methodologies, supported by clinical ground truth and neurobiological theories. With the booming of smartphones in recent years, keyboard typing became an everyday habit, reflecting unique behavioral profile for every user 11 . We hypothesize that the kinetic movement of fingers during keyboard/touchscreen typing embeds features related to subtle decline in motor sequencing and force steadiness 12 . These are referred to as Neurological Soft Signs (NSS), sub-clinical motor abnormalities that can serve as early “warning signs” of brain dysfunction, and additional clinical evaluation remains essential for precise diagnosis 13 , 14 , 15 .

Besides the passive acquisition of user-device interactions, the intricate Artificial Intelligence (AI) and Machine Learning (ML) methods allowed the definition of new disease-related features 16 , 17 , resulting in a new class of digital biomarkers, that of keystroke dynamics. We found that the latter provide a rich space of the assessment parameters, similar to that of finger tapping tests that quantitatively score the frequency and speed of tapping in clinical settings, either in single or alternating fashion 13 . Therefore, employing keystroke dynamics for fine motor analysis facilitates a paradigm shift from conventional, subjective diagnosis to objective, in-the-wild assessment. As opposed to other papers in the area of digital phenotyping that provide an overview of an “island of experts”, we hereby concentrate on a specific digital biomarker class with plausible connection to neurobiological mechanisms and clinical workflow. In fact, keystroke dynamics were used for PD and MCI, yet, and to our best knowledge, no systematic reviews and/or meta-analysis attempted to convey their diagnostic potential or their clinical significance for identifying patterns with plausible connections to disease characteristics.

In this systematic review and meta-analysis, we aimed to appraise the diagnostic performance of keystroke dynamics for an array of neurological and psychiatric disorders. Moreover, we sought to assess the impact of data collection settings, labeling methods, and model characteristics on the diagnostic performance, with emphasis on clinical relevance and ecological validity. In the meta-analysis, we provided a quantitative evaluation of the keystroke dynamics diagnosis of PD, MCI and psychiatric disorders independently, to convey their reproducibility and clinical impact. More importantly, we performed regression analysis, to convey the relationship between patients’ demographic and clinical characteristics with the diagnostic potentiality of keystroke dynamics, as well as the derived fine motor impairment index. Lastly, due to the immature progress of this area towards clinical adoption, we cast-in-concrete a detailed, multidisciplinary agenda for all stakeholders involved in the digital biomarker research, and open an avenue to multidisciplinary intervention and care delivery in neurology and psychiatry.

Our search identified 9576 results of which 4365 were removed as duplicates and 4045 were excluded by automation tools, as illustrated by the PRSIMA 2020 flowchart in Fig.  1 . We therefore screened the title and the abstract of 1166 articles, and we identified 1120 as not meeting our eligibility criteria. Thirty-nine (39) full eligible articles were screened and from their list of references, we identified seven more articles that meet our eligibility criteria. From the resulting 46 articles, five full articles, listed in supplementary file (p. 7) were excluded. At the end of our systematic search, we ended up with 41 full articles of which 25 reported sufficient data to be included in the meta-analysis. Overall, 25 studies are targeting PD, ten studies targeting mood disorders, and six studies were on mild cognitive impairment and AD. The characteristics of the included studies are summarized in Table 1 and are discussed in the following section.

figure 1

PRISMA 2020 flow diagram for study selection.

Characteristics of included studies

Of the 41 included studies, we identified 25 on PD with 3791 patients of whom 33.9% were female, six on MCI and early AD with 254 patients of whom 52.4% were female, and ten on psychiatric disorders with 374 patients of whom 56.0% were female (not all studies reported gender information). Regardless of the target condition and the data collection setting (in-the-clinic, in-the-wild), typing patterns, or keystroke dynamics, are always passively collected as series of time stamps of consecutive key presses and releases. The derived kinematic parameters are then used for motor behavior pattern analysis.

Of the PD studies, 12 were conducted in-the-clinic 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , while 13 were conducted in-the-wild 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 . The earliest studies, that were mainly on PD, collected data in clinical settings, and attempted to correlate the extracted keystroke dynamic features to the Unified Parkinson’s Disease Rating Scale Part III (UPDRS-III) score, which is currently the gold standard for PD diagnosis 43 . On this basis, PD patients were found to have longer inter-key delay, also known as Flight Time (FT), smaller number of total taps (over a fixed tapping duration), and shorter total distance of finger movement compared to controls 22 . Keystroke dynamics analysis also showed that PD patients are characterized by arrhythmokinesia, that is, hastening or freezing in the typing kinetics 44 , as well as heteroscedasticity or dispersion of FT 24 .

Owing to the establishment of reproducible digital biomarkers on the basis of keyboard interaction patterns, the neuroQWERTY index, for example, was estimated using an ensemble regression model that digests variance and histogram features extracted from 90 s windows of the hold time (HT) series obtained from early stage PD patients 18 . The HT, which is the time required for pressing and releasing a key, was particularly employed in early studies given that it is neither affected by the typing skill nor by conscious control. Consequently, the numerical index derived from it, neuroQWERTY, did not only discriminate early-stage PD patients from controls, but also de novo PD patients, reflecting its high sensitivity to subtle motor changes. Besides the HT, the flight time (FT), the latency between releasing a key and pressing the next one was analyzed in 24 to test the hypothesis that PD patients are characterized by higher dispersion and temporal variability compared to controls. The analysis of the typing patterns of PD patients through the neuroQWERTY keyboard revealed their slower fine-motor kinetics as well 40 . Compared to the Alternating Finger Tapping (AFT) test, employing skewness, kurtosis and covariance features of the FT distribution resulted in a higher diagnosis accuracy, meaning that the typing patterns embed specific irregularities of PD motor symptoms, mainly attributed to rigidity and bradykinesia. In an effort to enrich the feature space of keystroke dynamics, Iakovakis et al. 19 developed a two-stage machine learning model based on low- and high-order statistical features derived from the HT, Normalized FT and Normalized Pressure. Their results were consistent with earlier studies, and showed higher and more variable HT, lower pressure and high FT skewness. While these features were significantly correlated to the motor sub-scores of the UPDRS-III, correlating the outcome of such standardized clinical scales, which encompass a mixture of symptoms not related to fine motor impairments, to the typing behavior might be misleading. Taking this into consideration and with the objective of enhancing the interpretability of the fine-grained indicators, Iakovakis et al. 32 analyzed keystroke dynamics with single items scores of the UPDRS Part III, in order to create a plausible connection between the typing behavior and fine motor impairment symptoms. Employing typing kinetics features as independent variables, the UPDRS single items that correspond to the severity of Bradykinesia, Tremor, Rigidity, and AFT were estimated. The regression results indicated that dominant hand rigidity and bradykinesia were estimated with lower error compared to tremor, meaning that the effect of the latter is less pronounced from the typing cadence.

Furthermore, the “transferability” of typing patterns-based models developed and tested on clinically validated data to naturalistic, quasi-continuous, self-reported data from daily interaction with keyboards was evaluated. While the models achieved higher diagnostic performance in clinical settings, they still show high potentiality for real-life detection of disease-induced abnormal behavior 32 , 33 , 36 , 37 , 38 . Moreover, exploiting the passive nature of typing data acquisition, data from 970 PD patients, part of the mPower database 45 , facilitated the detection of early motor decline through Support Vector Machine and Random Forests 42 . Similarly, using the mPower database, unsupervised clustering of smartphone tapping data was used to discriminate the severity of motor symptoms in PD 41 .

Given that amalgamating multiple data streams in one model boosts its diagnostic accuracy, multimodal analysis has been uniquely adopted by Papadopolous et al . 36 in order to achieve symptom-specific detection, wherein accelerometer data are used to yield a tremor estimation index, while the typing behavior is leveraged for estimating fine motor impairment. Besides diagnosis, five longitudinal clinical studies investigated medication response 27 , 30 , 31 , 34 , 39 with the longest follow up being 36 months 30 . For instance, typing behavior has been utilized to detect longitudinal disease phenotype to uncover short- and long-term variations in the motor behavior profile of PD patients as in 31 . This was achieved by the definition of reliable parameters, such as the progression ratio and the steady state ratio, derived by comparisons between motor behavior across consecutive time windows. While this aspect is still in its infancy, Matarazzo et al. 34 showed promising results in detecting response to levodopa using recurrent neural networks. This implication, in turn, suggests that deep learning is a robust predictive model in biomarkers research, and is therefore being used in five clinical studies 21 , 34 , 36 , 37 , 38 .

Besides PD, Growing evidence, from studies targeting imaging biomarkers, suggests that the accumulation of Amyloid β starts up to 20 years before the manifestation of clinical symptoms of Alzheimer’s disease (AD) and that this is detected in one third of the clinically normal elderly population 46 . Whether this population will convert to AD, and at what time frame remain elusive, entailing the search for quantitative assessments during this preclinical stage. AD is in fact preceded by a mild cognitive impairment (MCI), which is an intermediate stage characterized by subtle deficits in memory, lexical and information processing, besides sensory and motor abnormalities 47 , 48 . In particular, fine motor impairment has been linked to functional loss at the MCI stage, and is specifically compromising the performance of daily life activities. Therefore, six studies were identified on MCI and AD 49 , 50 , 51 , 52 , 53 , 54 . The validity of utilizing the typing kinetics as biomarkers for early stage cognitive decline came about after the pioneering experimental trials that attempted to replicate finger dexterity tests in naturalistic environments. Specifically, the inter-keystroke interval, which is the FT, showed promising cognitive assessment performance of the elderly population 55 . This is particularly linked to breakdowns in attentional control and short-term memory, which constitute two key domains of time-reproduction tasks, such as typing. On this basis, increased latency variability and slower performance were observed in MCI and dementia patients, as compared to age- matched healthy participants 54 . Therefore, capturing computer-use profiles, including mouse and keyboard interactions successfully discriminated MCI patients from age-matched healthy controls 52 .

Interestingly, the multi-domain dysfunction of the prefrontal cortex motivated the development of multi-modal assessment methods, to validate the co-existence of motor and cognitive impairment. The sharp degradation of lexical processing and syntactic complexity reflects on MCI-specific language characteristics including increased verb and pronoun rate and decreased noun rate. To this end, Vizer and colleagues combined keystroke timing features including the HT and the pause rate with linguistic features collected in clinical settings to distinguish PreMCI subjects from age matched healthy controls 49 . Taking the analysis a step further, with the advancement in Natural Language Processing (NLP), and the capability of capturing objective linguistic features, usually not recognized by human raters, Ntracha et al. employed NLP of Spontaneous Written Speech (SWS), fused with keystroke dynamics features captured in-the-wild, to reinforce the interplay of cognitive and fine motor functions 50 . Furthermore, the pronounced advancement in computational modeling now allows aligning multiple data lines, what facilitated the development of “behaviorgrams” that capture activity levels, physiological and behavioral signals on a longitudinal bases, yielding a more comprehensive overview of individuals’ health, yet without solid interpretability on longitudinal transient behavior 51 .

Of the ten studies targeting psychiatric disorders, we identified seven studies on bipolar disorder 56 , 57 , 58 , 59 , 60 , 61 , 62 , one study on idiopathic REM Sleep Behavior disorder 23 , two studies are on depression 63 and sleep induced psychomotor impairment 64 , respectively. All these studies were conducted in-the-wild except the one on REM sleep disorder 23 . Mental and psychiatric disorders, with major depression being the most prevalent, are the leading cause of the disease burden worldwide, accounting for 32.4% of years lived with disability 65 , and substantially contributing to health loss across the lifespan 66 . The underlying mechanisms of depression include dopaminergic, noradrenergic and serotonergic disturbances along with inflammatory and psychosocial factors 67 . Depression has therefore been identified as an epiphenomenon in PD, MCI and AD patients, and has been linked to higher prevalence of neurodegeneration. As per the recommendations of the National Institute of Mental Health (NIMH), deep phenotyping of disease mechanisms at multiple analysis levels, including genetic, neural, and behavioral levels, is key for early diagnosis and monitoring 9 .

From pathological and clinical perspectives, psychomotor perturbation is a well-defined criterion of manic and depressive states 68 . Stemming from this, keyboard interaction patterns, along with accelerometer data, backspace and autocorrect rate were used a predictor variables in a linear mixed effects model to estimate Hamilton Depression Rating Scale (HDRS) and Young Mania Rating Scale (YMRS) scores of bipolar disorder patients 56 . Besides the psychomotor slowing observed by the longer FT, the analysis of the typing meta data including autocorrect and backspace rate, reflect the cognitive states associated with depressive and manic states. For instance, the high autocorrect rate associated with depressive states reflect the degree of concentration impairment. In contrast, the high backspace rate during manic states is associated with deteriorated error-response inhibition. The impact of circadian rhythm, depression severity, and age also have a profound impact on the typing kinetics 58 . In this vein, the analysis of typing kinetics, along with the clinical scores, facilitated the prediction of brain age and revealed that the predicted age of bipolar disorders patients is higher than their actual age, compared to healthy controls, reflecting a marker of brain pathology 62 . Moreover, keystroke dynamics predict cognitive decline, diminished visual attention, reduced processing speed and task switching in bipolar disorder patients 61 .

Beside these approaches, employing machine learning methods such as random forests yielded high discriminatory performance between mildly and severely depressed patients, and controls, from typing data collected in-the-wild 63 . Considering the impact of individual’s unique typing style and the circadian rhythm, stacking convolutional neural networks that detect personalized features, along with recurrent neural networks that learn the dynamic patterns, resulted in personalized mood detection 59 , 60 . Taking the analysis a leap forward, leveraging passively acquired keystroke dynamics with day-to-day ecological momentary assessment for mood prediction suggested that higher mood instability, inferred from the self-reports and the typing kinetics are highly predictive of worsening depressive and manic symptoms 57 . They also showed that continuous monitoring for up to seven days is sufficient for accurate symptom prediction using multilevel statistical analysis. The longest follow up period among studies targeting psychiatric disorders was eight weeks 56 .

Diagnostic potentiality of keystroke dynamics

Twenty-five (25) independent studies were included in the meta-analysis, given that the symmetry condition of the funnel plots is respected (Figs. S1 – S4 ). Whenever possible, if one study formulated multiple models, we treat them independently and their specific characteristics are reported in Supplementary Table 7 . We identified 29 independent models for the diagnosis of PD on the basis of keystroke dynamics. Pooled AUC and accuracy of keystroke dynamics classification methods for PD were 0.85 (95% confidence interval (CI): 0.83–0.88; I 2  = 94.04%) and 0.82 (95% CI 0.78–0.86; I 2  = 71.55%), respectively. In addition, pooled sensitivity and specificity were 0.86 (95% CI 0.82–0.90, I 2  = 79.49%) and 0·83 (95% CI 0.79–0.87, I 2  = 83.45%), as shown in Fig.  2 a–d. For MCI and AD (Fig.  3 a–d) we found ten independent classification models, except for the study of 51 that only reported AUC for their three models. The pooled AUC and accuracy were 0.84 (95% CI 0.78–0.90, I 2  = 87.43%) and 0.82 (95% CI 0.74–0.89, I 2  = 72.63%), respectively. Pooled sensitivity and specificity for the same category were also found to be 0·85 (95% CI 0.74–0.96, I 2  = 50.39%) and 0.82 (95% CI 0.70–0.94, I 2  = 87.73%). We identified four independent models for psychiatric diseases with 59 only reporting accuracy. Pooled AUC and accuracy for psychomotor impairment were 0.90 (95% CI 0.82–0.97, I 2  = 0%) and 0.89 (95% CI 0.83–0.95, I 2  = 35.56%). Pooled sensitivity and specificity for psychomotor impairment were 0.83 (95% CI 0.65–1.00, I 2  = 79.10%) and 0.87 (95% CI 0.80–0.93, I 2  = 0%) as shown in Fig.  4 a–d. More importantly, the non-significance, inferred by the sensitivity analysis for every disease category, reveals the consistency of the reported diagnostic accuracy, for all pooled measures.

figure 2

( a ): Pooled AUC with 95% CI of PD studies. ( b ) Pooled accuracy with 95% CI for PD studies. ( c ) Pooled sensitivity with 95% CI for PD studies. ( d ) Pooled specificity with 95% CI for PD studies.

figure 3

( a ) Pooled AUC with 95% CI for MCI studies. ( b ) Pooled accuracy with 95% CI for MCI studies. ( c ) Pooled sensitivity with 95% CI for MCI studies. ( d ) Pooled specificity with 95% CI for MCI studies.

figure 4

( a ) Pooled AUC with 95% CI for psychiatric disorder studies. ( b ) Pooled Accuracy with 95% CI for psychiatric disorder studies. ( c) Pooled Sensitivity with 95% CI for psychiatric disorder studies. ( d ) Pooled Specificity with 95% CI for psychiatric disorder studies.

Assessment of experimental design on diagnostic performance

In order to decipher the heterogeneity sources of the included studies, we have conducted multiple subgroup analyses. Comparing the performance of the diagnostic models when per- formed on data captured in-the-clinic to data captured in-the-wild revealed that the AUC (p = 0.007) and the accuracy (p = 0.032) were significantly higher under clinical settings. The AUC and the accuracy for the data captured in-the-clinic were 0.89 (95% CI = 0.86–0.91, I 2  = 87.15%, n = 21) and 0.87 (95% CI = 0.83–0.90, I 2  = 62.33%, n = 18), respectively. The same measures for data captured in-the-wild were 0.82 (95% CI = 0.79–0.84, I 2  = 74.02%, n = 21) and 0·81 (95% CI = 0.77–0.85, I 2  = 71.82%, n = 17), respectively. In terms of the sensitivity and the specificity, we found that the pooled sensitivity was not significantly higher for data captured in-the-clinic (p = 0.903), while the specificity was significantly higher for data captured in-the-clinic (p = 0.032). These metrics for data captured in-the-clinic were 0.85 (95% CI = 0.80–0.99, I 2  = 82.06%, n = 20) and 0.87 (95% CI = 0.83–0.91, I 2  = 81.81%, n = 18). For data captured in-the-wild, pooled sensitivity and specificity were 0.85 (95% CI = 0.79–0.90, I 2  = 52.55%, n = 14) and 0.79 (95% CI = 0.73–0.85, I 2  = 68.96%, n = 16). Similarly, the AUC (p = 0.004), the accuracy (p = 0.013), and the specificity (p = 0.002) are significantly higher for clinically-validated databases, compared to self-reports labeled typing data. For the former, the AUC, accuracy, pooled sensitivity and specificity were 0.86 (95% CI = 0.83–0.89, I 2  = 86.41%, n = 31), 0.86 (95% CI = 0.83–0.89, I 2  = 64.17%, n = 26), 0.86 (95% CI = 0.81–0.90, I 2  = 78.77%, n = 21) and 0.87 (95% CI = 0·0.84–0.91, I 2  = 73.41%, n = 22). On the other hand, these metrics for the self-reported data were 0.78 (95% CI = 0.73–0.84, I 2  = 0.00%, n = 6), 0.79 (95% CI = 0.74–0.83, I 2  = 29.24%, n = 8), 0.83 (95% CI = 0.76–0.90, I 2  = 50.11%, n = 9) and 0.76 (95% CI = 0.69–0.82, I 2  = 77.43%, n = 11).

From a methodological point of view, we report no statistical significance between pooled AUC (p = 0.525) and sensitivity (p = 0.074) when we compare unimodal and multimodal analysis methods. The specificity (p = 0.042) and the accuracy (p = 0.022), however, were significantly higher for multimodal analysis. Pooled AUC, accuracy, sensitivity and specificity for multimodal analysis were as follows: 0·83 (95% CI = 0·77–0·90, I 2  = 90·83%, n = 9), 0·87 (95% CI = 0·83–0·91, I 2  = 66·24%, n = 11), 0·89 (95% CI = 0·84–0·94, I 2  = 45·77%, n = 7) and 0.87 (95% CI = 0.79–0·95, I 2  = 90·19%, n = 8). The same measures for unimodal analysis were 0·86 (95% CI = 0.84–0.89, I 2  = 76.42%, n = 31), 0.82 (95% CI = 0.78–0.85, I 2  = 63.98%, n = 25), 0.84 (95% CI = 0.79–0.89, I 2  = 76.41%, n = 22) and 0.80 (95% CI = 0.76- 0.84, I 2  = 68.23%, n = 17).

Comparing the performance of ML classifiers and deep learning methods, the sensitivity was significantly higher for deep learning classifiers (p = 0.029), compared to linear machine learning methods, while the AUC (p = 0.859), accuracy (p = 0.299), and specificity (p = 0.882) were all associated with insignificant difference. The pooled AUC, accuracy, sensitivity and specificity for machine learning classifiers were 0.86 (95% CI = 0.83–0.88, I 2  = 75.29%, n = 33), 0.83 (95% CI = 0.80–0.87, I 2  = 66.49, n = 26), 0.82 (95% CI = 0.78–0.86, I 2  = 71.49, n = 24) and 0.83 (95% CI = 0.78–0.87, I 2  = 75.82%, n = 26), respectively. On the other hand, in the case of deep learning, the pooled measures are 0.86 (95% CI = 0.79–0.94, I 2  = 86.77%, n = 7), 0·86 (95% CI = 0.81–0.91, I 2  = 51.77%, n = 8), 0.89 (95% CI = 0.83–0.96, I 2  = 44.25%, n = 9) and 0.83 (95% CI = 0.76–0.91, I 2  = 80.37%, n = 9). Figure  5 represents scatter-bar plots of the subgroup analyses results forest plot representations can be found in Figs. (S5–S20).

figure 5

Scatter–Bar plots for the Subgroup Analysis results for ( a ) data collected in-the-clinic vs. data collected in-the-wild, ( b ) clinically validated data vs. self-reported data, ( c ) multimodal analysis vs. unimodal analysis and ( d ) deep learning vs. other machine learning classifiers. The dots represent the individual studies and the height of the bars corresponds to the outcome of the random effects meta-analysis model with 95% CI. ** denotes p < 0.005 and * denotes p < 0.05.

Association of diagnostic performance with age and disease duration

We hypothesize that patients’ demographics and clinical characteristics affect the diagnostic potentiality of keystroke dynamics. To this aim, we performed multiple linear regression analyses to convey the influence of age, disease duration, and medication on the diagnosis potentiality of keystroke dynamics. We also pooled fine motor impairment indexes, mainly related to bradykinesia, to investigate the influence of disease stage on the estimated motor impairment severity. Due to the unavailability of sufficient data for MCI and psychiatric disorders studies, we were mainly able to perform the regression analysis for PD diagnosis. Figure  6 a shows the relationship between PD patients’ age and disease duration (years from diagnosis). The figure intuitively suggests that PD disease duration increases with age, and the relationship between the two is statistically significant (p = 0.013) as inferred from the regression analysis. Accordingly, adjusting for disease duration, we analyzed its relationship with diagnostic AUC as represented in Fig.  6 b. The regression analysis yielded a statistically significant increase in AUC with disease duration (p = 0.005), reflecting the progression of fine motor impairment skills of PD patients. Next, using the same data, we investigated the AUC relationship with disease duration, when de novo PD patients are compared to early PD patients taking levodopa ( l -Dopa). Interestingly, when we use linear fitting to each group, the higher slope associated with the de novo PD patients, compared to that of early PD patients on l -Dopa indicates that although the diagnostic AUC of de novo patients is lower, the evolution of the AUC with respect to disease progression for this patients’ category is more significant, mainly during the first three years after diagnosis, than that of early, medicated PD (Fig.  6 c). Perhaps this implication also suggests the sharper decline in fine motor skills at this stage, resulting in a clear improvement in the diagnostic AUC. This is in line with the recent evidence suggesting an exponential neurodegeneration patterns of the Substantia Nigra pars compacta, parallel to a sharper decline in motor skills in early PD 69 . Besides the diagnostic performance, we sought to investigate the association of PD disease duration and the severity of fine motor symptoms. We pooled the fine motor impairment index, that derived from the HT, as an estimation of bradykinesia, as it was reported by multiple studies with sufficient data. However, not all studies reported the fine motor impairment index derived from the HT. Figure  6 d depicts the significant correlation (p = 0.010) between the disease duration and fine motor impairment index.

figure 6

Evaluation of the impact of patients’ age and disease duration on the diagnostic performance of keystroke dynamics represented by the AUC. ( a ) Regression analysis results of PD patients age and years from diagnosis (disease duration). ( b ) Regression analysis results of PD studies reporting diagnostic AUC and disease duration reveals their significant association. ( c ) Pooled AUC of de novo PD patients (blue) and early PD patients on L-Dopa (orange) depicts the sharper increase in AUC with disease duration of de novo PD patients, compared to that of early, medicated PD patients. ( d ) Regression analysis results of Fine motor impairment index derived from the HT and the disease duration. ( e ) Regression analysis results of MCI patients age and diagnosis AUC.

Although the included studies on MCI were generally few compared to those targeting PD, we were able to perform regression analysis to convey the diagnostic potentiality relationship with patients’ age. As represented in Fig.  6 e, there is a significant increase in the diagnostic AUC of MCI based on fine motor skills inferred by keystroke dynamics (p = 0.017). The full regression results are reported in Supplementary Tables 9 – 12 .

Evaluation of between-study heterogeneity and bias risk

The large between-study heterogeneity made combining data from multiple studies to generate a representative effect size on the diagnosis performance problematic. It is due to this reason that we decided to pool four diagnostic metrices via univariate random-effect meta-analysis models. Consequently, we assume that pooled diagnosis metrics of PD, MCI, and psychiatric disorders, as well as the subgroup analysis results, are adequate to convey the diagnostic potentiality of keystroke dynamics models and the impact of study characteristics; namely data collection settings, labeling methods, and the modeling characteristics. Hence, we group the studies based on the desired outcome and assume that despite the heterogeneity, the pooled outcome contributes to the evidence. For instance, when evaluating the diagnostic performance for every disease category, the heterogeneity stems from the between-study differences in experimental design and model characteristics, however, when we group the studies based on experimental characteristics despite the disease category, we attribute the heterogeneity to patients’ characteristics, and other experimental design aspects that are not under investigation. Furthermore, we reinforce our findings from the global performance of studies by evidence from methodological perspectives. Nonetheless, we still caution against overinterpretation.

Figure  7 shows the graphical representation of the risk of bias of included studies, and the per-study risk of bias assessment is reported in Supplementary Table 6 . Given that we target the diagnostic accuracy, the included studies are case–control including a priori labeled diseased and healthy participants. We consider the studies that labeled the participants using self-reports without clinical evaluation at high risk of bias, because participants’ honesty, recall bias and unawareness of their medical conditions influence the correctness of the labels. Furthermore, most studies did not assess the appropriateness of the sample size, we therefore deemed this of unclear risk of bias for most studies, except 12 studies that aimed at enlarging the sample pool, mainly collecting data outside clinics. Besides, we deemed all the studies that performed independent clinical evaluation and keystroke dynamics analysis (outcome assessment blindness) to be of low risk of bias, except 30 , 39 , that did finger dexterity and clinical evaluation of PD without blindness. All studies were characterized with low risk of bias when we consider timing of ground-truth labeling and data collection. Most studies are of unclear risk of bias in terms of selective reporting. Overall, we deemed the risk of bias to be low to moderate, and the quality of evidence, as inferred by the GRADE tool, to be moderate to high, as illustrated in Supplementary Table 8 .

figure 7

Risk of bias assessment.

To our best knowledge, this is the first systematic review and meta-analysis that provides a concentrated overview of the clinically-relevant diagnostic performance of keystroke dynamics, their ecological validity and association with patients’ demographics and clinical characteristics. We found that most studies are targeting PD, given its hallmark motor symptoms, however, there are now multiple studies dedicated to the assessment of motor perturbation in MCI, bipolar disorder, and depression. The diagnostic accuracy revealed by our meta-analysis reflects, for the first time, the reproducibility of keystroke dynamic models in the assessment of multiple disorders with neurologically defined fine motor impairment. Besides the three disease categories reviewed in this paper, researchers are currently employing them for Multiple Sclerosis 70 , 71 and Huntington’s disease 72 . We therefore conclude that we can rely on keystroke dynamics obtained passively from natural interactions with keyboards to detect fine motor impairments induced by early stage neurological and/or psychiatric disorders. Despite this diagnostic performance, several experimental and analysis deficiencies need to be discussed to mitigate between-study heterogeneity, pave the way for future research. For clinical adoption of this technology, we propose a partnership strategy based on a “co-creation” approach that stems from mechanistic explanations of patients’ characteristics derived from data obtained in-clinics and under ecologically valid settings. It is the multi-level analysis of patients’ data on genetic-, organ-, and behavior-level that will be at the center of the translational paradigm to precision medicine when the heterogeneous brain disorders are to be considered.

While computer/smartphone interaction behavior outperformed clinical gold standards such as the AFT and the single finger tapping tests in detecting specific fine motor symptoms of PD patients, the transition from highly-controlled assessment in-the-clinic, to naturalistic, real-life assessment models should be approached with caution 73 , 74 , 75 . From the sampling perspective, home-based data collection usually results in highly sparse bursts of unpredictable typing activity, that are highly sensitive to real-life contexts, emotional burden and diurnal patterns. This is in line with the higher discriminatory performance of the models on data captured in-the-clinic and labeled by clinical assessment, elucidated by our subgroup analysis. Therefore, to establish robust detection models for diagnosis outside clinics, integrating multiple latent domains, or confounders and defining multiple predictor parameters, such as emotions, activity levels, and sleep patterns, is particularly an interesting avenue for future research to enhance ecological validity 76 . Such integrated frameworks might therefore capture the heterogeneous, neuropsychiatric symptoms in different behavioral disorders, let alone the intra-subject variability that occurs across different time windows. Moreover, because there is neither a consensus on the optimal assessment duration to detect meaningful disease trajectories and progression of neurological disorders, nor for episodic relapse in psychiatric disorders, long-term analysis of behavioral profiles is essential. Moreover, optimizing the analysis window length, that is, the distribution of observation period, to precisely detect disease-induced transient behavior is yet to be performed.

The inherent, progressive nature of psychiatric and neurodegenerative disorders makes them amenable to frequent treatment regimen modifications, yet satisfying symptom control is not achieved given the high economic burden of clinical visits 77 . Besides screening and diagnosis, the concept of remote monitoring is realized thanks to the passive acquisition of high frequency, objective behavioral data. While this undoubtedly constitutes a promising arena, the lack of standardization objective features and the inconsistent analysis methods remain a challenge 78 . A contributing factor to this, according to 76 , is the short assessment time and the rare outcome assessment during the study duration. To be more precise, the ground truth clinical evaluations that are performed at intermittent intervals during longitudinal data acquisition results in many unlabeled days, therefore the validity of propagating these labels for long time windows is still unclear. Perhaps undertaking a hybrid labeling approach combining low frequency clinical assessment and higher frequency Ecological Momentary Assessment via self-reports along the study duration might therefore mitigate this dilemma.

As illustrated earlier in our subgroup analysis, the adoption of deep learning methods that efficiently extract meaningful patterns from unstructured data is now on the rise. However, deep learning methods that outweighed the rest of machine learning models in terms of diagnostic accuracy are associated with considerable uncertainty. Interestingly, with the aim of enhancing the efficacy of remote assessment of PD, Iakovakis et al . 37 combined two databases captured in-the-clinic and in-the-wild in a deep learning, hybrid model capable of learning fine motor symptoms, thereby overcoming the induced quantization error of the UPDRS-III and enhancing the performance of deep learning. Similarly, our meta-analysis showed that multimodal analysis, although reinforces the diagnostic accuracy, is characterized with considerable diagnostic between-studies uncertainty, therefore, future studies should adopt a more transparent and well-conducted study designs to reduce bias. Perhaps combining voice analysis techniques along with keystroke dynamics will boost the detection of early motor impairment signs, as these are also reflected on speech characteristics of PD patients 79 . Moreover, from the methodological perspectives, several pattern recognition tools have the potential to learn and decipher the nonlinear, dynamic nature of human-keyboard interactions. For example, fuzzy recurrence plots and scalable recurrence networks visually revealed finer texture and more regularity in the hold time series of healthy controls to early stage PD patients 20 , 21 .

Psychiatric and neurodegenerative disorders that develop and progress across the lifespan are characterized by a heterogeneous phenotype of motor and non-motor symptoms 80 . Early stage behavioral perturbations constitute a priori link with plausible connection to disease likelihood, but the high cross-talk between symptoms obscures accurate diagnosis and pathogenesis understanding especially at preclinical stages. This heterogeneity is a central problem to diagnostic research, entailing additional methods for analyzing similarities and differences across disease-induced behavioral disturbances. As opposed to previous reviews that put too much emphasis on specific disorders, we hereby deliberately included studies on PD, MCI and affective disorders to convey that the neurobiological mechanisms differ greatly among disorders that are characterized with similar traits, such as motor slowing. Rather than focusing on specific disorders in isolation of others, we advocate a dimensional approach that stresses more on the symptoms per se, also referred to as comorbidities. We believe that a central challenge, in this realm, is formulating databases with a full representation of the population, to expand our understanding of the heterogeneous disease-related traits.

Although the previous years witnessed an increase lean towards digital health technologies, the premature adoption of these measures by clinics precludes meaningful outcome 81 . Our work highlights important directions for future research. The definition of clinically meaningful thresholds is yet to established, and this cannot be attained without a “co-creation” approach, whereby high-level data and clinically validated interpretations are made. For instance, amalgamating low level, behavioral data, with high level imaging data is not explored yet. We think that this will not only inform better health information, but might also generate new knowledge on “brain fitness” and behavior, across the generations. Further, the importance of interdisciplinary interactions also propagates to ethical implications, for enhanced transparency, informed consent from patients, privacy and accountability 82 . We therefore summarize domain-specific limitations and future research directions in Table 2 .

We acknowledge that our study has several limitations. Among them is the sparsity and the inherent heterogeneity of the included studies. While we were able to perform regression analysis with patients’ demographic and clinical characteristics (i.e., age, disease duration respectively) for PD, our meta-analysis lacks the investigation of additional covariates, such as gender differences and medication response, especially for MCI and psychiatric disorders. Although promising results have been revealed by leveraging typing patterns for diagnosing and monitoring mood and cognitive decline, the majority of the studies are, so far, disproportionately targeting PD. While this is understandable given the hallmark motor disturbance in this latter, we see that the need for further validations of this approach in other disorders is still pressing. This will be an important avenue for future studies. The data collected and analyzed in the included studies are collected either in the United States (US) and Europe, therefore, future clinical trials of the diagnostic performance of keystroke dynamics in other populations, with possibly lower education level and smartphone usage, particularly in ageing populations and low-income countries are needed. Perhaps also a global consortium on the translation possibility of this technology to these populations with limited neurological care access is the first step in this context. We can therefore investigate how the diagnostic potentiality changes across time, by site and for different populations. Concerning per-patient variability and disease progression, future work should be more focused on identifying temporal symptom profiles and behavioral trajectories indicative of conversion to brain disease. More importantly, latent domains such as emotions, sleep pattern should be considered as confounders, given their direct influence on motor behavior and general health status. Estimation of motor impairment severity, that correlates with disease stage and subtype is also an important future avenue. Furthermore, future researchers in the field should collaborate with clinicians to make the models more interpretable, thereby enhancing clinical adoption. Research on the area of explainable AI (XAI) is now rapidly growing 83 , but collaborative work between data scientists, engineers and clinicians is not yet established, especially in mutual exchange of data (i.e., behavioral data, imaging). Eventually, we declare, as a limitation, that the protocol of this systematic review and meta-analysis is registered in PROSPERO, but has not been published yet.

Lastly, we note the strength of our meta-analysis conclusions that conveyed the feasibility of using keystroke dynamics derived from the natural interaction connected devices keyboards as digital biomarkers for early decline in fine motor skills associated with neuropsychiatric disorders. Based on experimental design comparisons, we showed that the keystroke dynamics constitute an ecologically valid diagnostic platforms in-the-wild , reflecting their translational potentiality outside clinics, despite the methodological challenges that arises, including but limited to confounders influence and sampling difficulties. Further, given the influence of data labeling on the diagnosis models, we conclude that even when self-reported data in-the-wild are used for training, keystroke dynamics models still achieve sound discriminatory potential. From methodological perspectives, we show that employing multimodal and advanced deep learning models, which are at the high edge of the contemporary data science methodologies, offer promising opportunities for boosting the diagnostic accuracy, but with considerable heterogeneity across the studies. Consequently, the establishment of intricate and generalizable diagnostic models, that not only achieve accurate diagnosis, but are also sensitive to temporal change and symptom progression. To this end, our regression models showed the evolution of diagnosis AUC and fine motor impairment with age and disease duration for PD. We reperformed the regression analysis for MCI, and showed how the diagnostic AUC increases with age, reflecting the increasing fine motor impairment severity. In conclusion, the importance of digital technology also goes beyond the diagnostic yield, so once at-risk cohorts are identified, digital technologies can also be employed to reinforce behavior change and patients’ empowerment, towards a sustained quality of life, as detailed in Table 2 .

Search strategy and selection criteria

In this systematic review and meta-analysis, conducted in accordance with the Diagnostic Test Accuracy extension of Preferred Reporting Items for Systematic Reviews and Meta- Analyses (PRISMA-2020) 84 , a systematic search of MEDLINE, PubMed, IEEE Xplore, Web of Science, and EBSCO has been independently performed by two authors (H.A and N.C) for publications between January 1st, 2010 and March 30th, 2022, on pattern recognition and neuropsychiatric disease classification on the basis of natural interactions with keyboards, without language restrictions. These date restrictions were specified a priori, because typing patterns constitute a new class in the fruitful digital phenotyping area. The full search strategy of all databases is reported in Supplementary Tables 1 – 5 and in the PRISMA 2020 checklist . Eligible studies assessed the influence of motor impairment induced by psychiatric or neurological disorders on the typing patterns (i.e., keystroke dynamics). Those deemed eligible were case–control studies, comparing the typing behavior of neuropsychiatric disease patients to age- and education-matched healthy control subjects. Studies that used statistical analysis without classification were included in the narrative synthesis, while those employing machine learning models for classification were included in a random effects meta-analysis to evaluate the diagnosis performance on the basis of typing behavior. We performed a manual search of the reference lists from the eligible studies, and we searched the grey literature for unpublished data, conference proceedings and dissertations. Prior to the writing of this paper, we searched if there are existing systematic reviews and meta-analyses on the same topic.

All search results were uploaded to Rayyan web of intelligent systematic reviews 85 for duplicates removal and screening. One author (H.A.) screened titles and abstracts of the included studies, that were double-screened by a second author (L.H.). Three authors (H.A., A.K. and L.H.) assessed the eligibility of the included full articles. Any disagreement was resolved by discussion.

Protocol registration

The protocol of this systematic review and meta-analysis has been registered in PROSPERO with identifier CRD42021278707.

Data extraction and quality assessment

Two authors (H.A. and L.H.) extracted data from the included studies. We extracted the following data from the included studies: (1) disease, (2) first author and publication year, (3) experimental protocol of data collection including collection settings and study duration, (4) number and mean age of participants in diseased and healthy groups, (5) data labeling methodology (self-reported meta-data vs. clinical evaluation), (6) data streams employed by the study, (7) extracted features, (8) analysis and feature extraction level (subject- level vs. typing session-level), (9) problem formulation and validation whether through statistical analyses or classification, (10) 2 × 2 data (True Positives, True negatives, False Positives, False Negatives), and from here we extracted the sensitivity and the specificity (11) Classification Accuracy and (12) Area Under the Receiver Operating Characteristics Curve (AUC). Three authors (H.A., A.K. and L.H.) discussed and assessed the quality of the included studies. The studies that did not perform classification, were included in the systematic review but not in the meta-analysis.

Statistical analysis and diagnosis evaluation

Our primary outcome is the diagnosis efficiency of machine learning models employing typing features (i.e., keystroke dynamics). Secondary outcomes include longitudinal disease monitoring on the basis of pattern recognition of keystroke dynamics, treatment response, and key features that discriminate diseased from healthy groups.

In particular, the outcomes of the meta-analysis were the Area Under the receiver operating characteristic Curve (AUC), accuracy, sensitivity and specificity. These outcomes were pooled and included in a univariate random effect model independently for three disease categories, namely PD, MCI, and psychiatric disorders. Heterogeneity was assessed using the I 2 statistics, attributable to non-sample related between-studies differences, in addition to the Cochran Q (X 2 ) test (p < 0·05). Given that in this study we report the validity of keystroke dynamics models as diagnostic tools for different disorders, we accepted high heterogeneity (I 2  > 50%). Furthermore, to ensure the completeness and transparency of the reported diagnostic accuracy measures, we followed the Standards for the Reporting of Diagnostic Accuracy Studies (STARD) 86 .

After pooling the data, we processed them using the Meta Essential tool 87 . For each study, we entered the (per-subject) AUC and the accuracy and the sample size, while for the sensitivity and specificity, we entered the number of participants in diseased and healthy groups, respectively. These measures, along with the 95% confidence interval (CI), were represented by univariate forest plots. All included studies, that reported AUC, accuracy, sensitivity and specificity were included in the meta-analysis given that we maintain symmetry of the funnel plots to minimize publication bias. Studies that did not report any of these measures and/or were associated with high bias risk were included in the systematic review but not in the meta-analysis. Furthermore, for each of the three disorder groups, we performed a sensitivity analysis using leave-one-study-out, to investigate the impact of individual studies on the diagnostic metrics. Two authors (H.A. and L.H.) performed and agreed on the performance and the outcome of the statistical analysis.

Subgroup analysis

Subgroup analyses were conducted to assess the source of heterogeneity between the studies, if each subgroup contained more than three studies (n > 3) after subgroup division. We particularly focus on the performance of data acquisition and analysis methods, given that they are the main intellectual challenges of the highly fertile arena of digital phenotyping 73 . In spite of the increasing interest in real-life diagnosis, we segregated the studies based on the data acquisition modality as (1) in-the-clinic and in-the-wild . Furthermore, we compared the attained AUC, accuracy, sensitivity and specificity between (2) clinically validated and self-reported data. In addition, comparisons between (3) multimodal and unimodal studies, as well as (4) deep learning and other machine learning classification methods were performed.

Regression analysis

Four linear regression models were fitted for (1) PD patients’ age and years from diagnosis (disease duration), (2) PD diagnosis AUC and disease duration, (3) PD fine motor impairment index and disease duration and (4) MCI diagnosis AUC and patients’ age. These tests were two-sided with a statistical significance threshold of 0.05 and 95% CI.

Publication bias assessment

Publication bias was assessed based on Begg and Mazumdar’s rank correlation test and visualized by funnel plots. Importantly, if one database was used in multiple studies, or if one study employed multiple analysis methods, we treat those as independent studies.

To assess the internal validity of the included studies, quality assessment was performed employing the tool for Quality Assessment of Diagnostic Test Accuracy (QUADAS-2) 88 . All discrepancies were resolved by mutual discussions between three authors (H.A., A.K., and L.H.). Moreover, we generated four funnel plots for AUC, accuracy, sensitivity and specificity to visually illustrate the publication bias of the included studies 89 .

Quality of evidence assessment

To convey the clinical value of keystroke dynamics, we have used the Grades of Recommendations, Assessment, Development and Evaluation (GRADE) tool 90 to systematically and transparently assess the diagnostic accuracy evidence of keystroke dynamics for neuropsychiatric disorders. The systematic appraisal of the evidence quality is determined by (1) the design of the study, (2) risk of bias, (3) inconsistency of reported results, (4) indirectness of the outcome, (5) imprecision of the reported results, and (6) publication bias.

Data availability

The search strategy and extracted data contributing to the meta-analysis is available in the appendix; any additional data are available on request from the corresponding author.

Tekin, S. & Cummings, J. L. Frontal–subcortical neuronal circuits and clinical neuropsychiatry. J. Psychosom. Res. 53 , 647–654 (2002).

Article   PubMed   Google Scholar  

Peralta, V. & Cuesta, M. J. Motor abnormalities: From neurodevelopmental to neurodegenerative through “functional” (neuro)psychiatric disorders. Schizophr. Bull. 43 , 956–971 (2017).

Article   PubMed   PubMed Central   Google Scholar  

Bostan, A. C. & Strick, P. L. The basal ganglia and the cerebellum: Nodes in an integrated network. Nat. Rev. Neurosci. 19 , 338–350 (2018).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Schrag, A., Horsfall, L., Walters, K., Noyce, A. & Petersen, I. Prediagnostic presentations of Parkinson’s disease in primary care: A case-control study. Lancet Neurol. 14 , 57–64 (2015).

Rizzo, G. et al. Accuracy of clinical diagnosis of Parkinson disease: A systematic review and meta-analysis. Neurology 86 , 566–576 (2016).

de Paula, J. J. et al. Impairment of fine motor dexterity in mild cognitive impairment and Alzheimer’s disease dementia: Association with activities of daily living. Rev. Bras. Psiquiatr. 38 , 235–238 (2016).

Wang, P. S. et al. Delay and failure in treatment seeking after first onset of mental disorders in the World Health Organization’s World Mental Health Survey Initiative. World Psychiatry Off. J. World Psychiatr. Assoc. WPA 6 , 177–185 (2007).

Google Scholar  

Bargmann, C. I. & Newsome, W. T. The brain research through advancing innovative neurotechnologies (BRAIN) initiative and neurology. JAMA Neurol. 71 , 675 (2014).

Bernard, J. A. & Mittal, V. A. Updating the research domain criteria: The utility of a motor dimension. Psychol. Med. 45 , 2685–2689 (2015).

Krakauer, J. W., Ghazanfar, A. A., Gomez-Marin, A., MacIver, M. A. & Poeppel, D. Neuroscience needs behavior: Correcting a reductionist bias. Neuron 93 , 480–490 (2017).

Article   CAS   PubMed   Google Scholar  

Monrose, F. & Rubin, A. D. Keystroke dynamics as a biometric for authentication. Future Gener. Comput. Syst. 16 , 351–359 (2000).

Article   Google Scholar  

Wolff, A. L. & O’Driscoll, G. A. Motor deficits and schizophrenia: The evidence from neuroleptic-naïve patients and populations at risk. J. Psychiatry Neurosci. 24 , 304–314 (1999).

CAS   PubMed   PubMed Central   Google Scholar  

Shimoyama, I. The finger-tapping test: A quantitative analysis. Arch. Neurol. 47 , 681 (1990).

Chan, R. C. et al. Neurological abnormalities and neurocognitive functions in healthy elder people: A structural equation modeling analysis. Behav. Brain Funct. 7 , 32 (2011).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Emsley, R. et al. Neurological soft signs in first-episode schizophrenia: State- and trait-related relationships to psychopathology, cognition and antipsychotic medication effects. Schizophr. Res. 188 , 144–150 (2017).

Pentland, A., Lazer, D., Brewer, D. & Heibeck, T. Using reality mining to improve public health and medicine. Stud. Health Technol. Inform. 149 , 93–102 (2009).

PubMed   Google Scholar  

Myszczynska, M. A. et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol. 16 , 440–456 (2020).

Giancardo, L. et al. Computer keyboard interaction as an indicator of early Parkinson’s disease. Sci. Rep. 6 , 34468 (2016).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Iakovakis, D. et al. Touchscreen typing-pattern analysis for detecting fine motor skills decline in early-stage Parkinson’s disease. Sci. Rep. 8 , 7663 (2018).

Article   ADS   PubMed   PubMed Central   CAS   Google Scholar  

Pham, T. D. Pattern analysis of computer keystroke time series in healthy control and early-stage Parkinson’s disease subjects using fuzzy recurrence and scalable recurrence network features. J. Neurosci. Methods 307 , 194–202 (2018).

Pham, T. D., Wardell, K., Eklund, A. & Salerud, G. Classification of short time series in early Parkinsons disease with deep learning of fuzzy recurrence plots. IEEECAA J. Autom. Sin. 6 , 1306–1317 (2019).

Article   MathSciNet   Google Scholar  

Lee, C. Y. et al. A validation study of a smartphone-based finger tapping application for quantitative assessment of bradykinesia in Parkinson’s disease. PLoS ONE 11 , e0158852 (2016).

Article   PubMed   PubMed Central   CAS   Google Scholar  

Arora, S. et al. Smartphone motor testing to distinguish idiopathic REM sleep behavior disorder, controls, and PD. Neurology 91 , e1528–e1538 (2018).

Arroyo-Gallego, T. et al. Detection of motor impairment in Parkinson’s disease via mobile touchscreen typing. IEEE Trans. Biomed. Eng. 64 , 1994–2002 (2017).

Hooman, O. M., Oldfield, J. & Nicolaou, M. A. Detecting early Parkinson’s disease from keystroke dynamics using the tensor-train decomposition. In 2019 27th European Signal Processing Conference (EUSIPCO) 1–5 (IEEE, 2019).

Printy, B. P. et al. Smartphone application for classification of motor impairment severity in Parkinson’s disease. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society 2686–2689 (IEEE, 2014).

Wissel, B. D. et al. Tablet-based application for objective measurement of motor fluctuations in Parkinson disease. Digit. Biomark. 1 , 126–135 (2017).

Adams, W. R. High-accuracy detection of early Parkinson’s Disease using multiple characteristics of finger movement while typing. PLoS ONE 12 , e0188226 (2017).

Milne, A., Farrahi, K. & Nicolaou, M. A. Less is more: univariate modelling to detect early Parkinson’s disease from keystroke dynamics. In International Conference on Discovery Science 435–446 (Springer, 2018).

Memedi, M., Khan, T., Grenholm, P., Nyholm, D. & Westin, J. Automatic and objective assessment of alternating tapping performance in Parkinson’s disease. Sensors 13 , 16965–16984 (2013).

Prince, J., Arora, S. & de Vos, M. Big data in Parkinson’s disease: using smartphones to remotely detect longitudinal disease phenotypes. Physiol. Meas. 39 , 044005 (2018).

Iakovakis, D. et al. Motor impairment estimates via touchscreen typing dynamics toward Parkinson’s disease detection from data harvested in-the-wild. Front. ICT 5 , 28 (2018).

Arroyo-Gallego, T. et al. Detecting motor impairment in early parkinson’s disease via natural typing interaction with keyboards: Validation of the neuroQWERTY approach in an uncontrolled at-home setting. J. Med. Internet Res. 20 , e89 (2018).

Matarazzo, M. et al. Remote monitoring of treatment response in Parkinson’s disease: The habit of typing on a computer. Mov. Disord. 34 , 1488–1495 (2019).

Lipsmeier, F. et al. Evaluation of smartphone-based testing to generate exploratory outcome measures in a phase 1 Parkinson’s disease clinical trial. Mov. Disord. 33 , 1287–1297 (2018).

Papadopoulos, A. et al. Unobtrusive detection of Parkinson’s disease from multi-modal and in-the-wild sensor data using deep learning techniques. Sci. Rep. 10 , 21370 (2020).

Iakovakis, D. et al. Screening of Parkinsonian subtle fine-motor impairment from touchscreen typing via deep learning. Sci. Rep. 10 , 1–13 (2020).

Article   CAS   Google Scholar  

Iakovakis, D. et al. Early Parkinson’s disease detection via touchscreen typing analysis using convolutional neural networks. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 3535–3538 (IEEE, 2019).

Zhan, A. et al. High frequency remote monitoring of Parkinson’s disease via smartphone: Platform overview and medication response detection. arXiv arXiv:1601.00960 (2016).

Wang, Y. et al. Facilitating text entry on smartphones with QWERTY keyboard for users with Parkinson’s disease. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems 1–12 (ACM, 2021) https://doi.org/10.1145/3411764.3445352 .

Surangsrirat, D., Sri-iesaranusorn, P., Chaiyaroj, A., Vateekul, P. & Bhidayasiri, R. Parkinson’s disease severity clustering based on tapping activity on mobile device. Sci. Rep. 12 , 3142 (2022).

Goñi, M., Eickhoff, S. B., Far, M. S., Patil, K. R. & Dukart, J. Smartphone-Based Digital Biomarkers for Parkinson’s Disease in a Remotely-Administered Setting (2021) https://doi.org/10.1101/2021.01.13.21249660 .

Martínez-Martín, P. et al. Unified Parkinson’s disease rating scale characteristics and structure. Mov. Disord. 9 , 76–83 (1994).

TaylorTavares, A. L. et al. Quantitative measurements of alternating finger tapping in Parkinson’s disease correlate with UPDRS motor disability and reveal the improvement in fine motor control from medication and deep brain stimulation. Mov. Disord. 20 , 1286–1298 (2005).

Bot, B. M. et al. The mPower study, Parkinson disease mobile data collected using ResearchKit. Sci. Data 3 , 160011 (2016).

Sperling, R., Mormino, E. & Johnson, K. The evolution of preclinical Alzheimer’s disease: Implications for prevention trials. Neuron 84 , 608–622 (2014).

Petersen, R. C. et al. Mild cognitive impairment: Clinical characterization and outcome. Arch. Neurol. 56 , 303–308 (1999).

Kourtis, L. C., Regele, O. B., Wright, J. M. & Jones, G. B. Digital biomarkers for Alzheimer’s disease: The mobile/wearable devices opportunity. NPJ Digit. Med. 2 , 1–9 (2019).

Vizer, L. M. & Sears, A. Classifying text-based computer interactions for health monitoring. IEEE Pervasive Comput. 14 , 64–71 (2015).

Ntracha, A. et al. Detection of mild cognitive impairment through natural language and touchscreen typing processing. Front. Digit. Health 2 , 567158 (2020).

Chen, R. et al. Developing measures of cognitive impairment in the real world from consumer-grade multimodal sensor streams. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2145–2155 (2019).

Stringer, G. et al. Can you detect early dementia from an email? A proof of principle study of daily computer use to detect cognitive and functional decline. Int. J. Geriatr. Psychiatry 33 , 867–874 (2018).

Van Waes, L., Leijten, M., Mariën, P. & Engelborghs, S. Typing competencies in Alzheimer’s disease: An exploration of copy tasks. Comput. Hum. Behav. 73 , 311–319 (2017).

Rabinowitz, I. & Lavner, Y. Association between finger tapping, attention, memory, and cognitive diagnosis in elderly patients. Percept. Mot. Skills 119 , 259–278 (2014).

Austin, D. et al. Measuring motor speed through typing: A surrogate for the finger tapping test. Behav. Res. Methods 43 , 903–909 (2011).

Zulueta, J. et al. Predicting mood disturbance severity with mobile phone keystroke metadata: A biaffect digital phenotyping study. J. Med. Internet Res. 20 , e241 (2018).

Stange, J. P. et al. Let your fingers do the talking: Passive typing instability predicts future mood outcomes. Bipolar Disord. 20 , 285–288 (2018).

Vesel, C. et al. Effects of mood and aging on keystroke dynamics metadata and their diurnal patterns in a large open-science sample: A BiAffect iOS study. J. Am. Med. Inform. Assoc. 27 , 1007–1018 (2020).

Cao, B. et al. DeepMood: modeling mobile phone typing dynamics for mood detection. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 747–755 (ACM, 2017) https://doi.org/10.1145/3097983.3098086 .

Huang, H., Cao, B., Yu, P. S., Wang, C.-D. & Leow, A. D. dpMood: exploiting local and periodic typing dynamics for personalized mood prediction. In 2018 IEEE International Conference on Data Mining (ICDM) 157–166 (IEEE, 2018) https://doi.org/10.1109/ICDM.2018.00031 .

Ross, M. K. et al. Naturalistic smartphone keyboard typing reflects processing speed and executive function. Brain Behav. 11 , e2363 (2021).

Zulueta, J. et al. The effects of bipolar disorder risk on a mobile phone keystroke dynamics based biomarker of brain age. Front. Psychiatry 12 , 739022 (2021).

Mastoras, R.-E. et al. Touchscreen typing pattern analysis for remote detection of the depressive tendency. Sci. Rep. 9 , 1–12 (2019).

Giancardo, L., Sánchez-Ferro, A., Butterworth, I., Mendoza, C. & Hooker, J. M. Psychomotor impairment detection via finger interactions with a computer keyboard during natural typing. Sci. Rep. 5 , 1–8 (2015).

Vigo, D., Thornicroft, G. & Atun, R. Estimating the true global burden of mental illness. Lancet Psychiatry 3 , 171–178 (2016).

Whiteford, H. A., Ferrari, A. J., Degenhardt, L., Feigin, V. & Vos, T. The global burden of mental, neurological and substance use disorders: An analysis from the Global Burden of Disease Study 2010. PLoS ONE 10 , e0116820 (2015).

Aarsland, D., Påhlhagen, S., Ballard, C. G., Ehrt, U. & Svenningsson, P. Depression in Parkinson disease—epidemiology, mechanisms and management. Nat. Rev. Neurol. 8 , 35–47 (2012).

American Psychiatric Association, A. et al. Diagnostic and Statistical Manual of Mental Disorders Vol. 3 (American Psychiatric Association, 1980).

Biondetti, E. et al. Spatiotemporal changes in substantia nigra neuromelanin content in Parkinson’s disease. Brain 143 , 2757–2770 (2020).

Lam, K. et al. Real-world keystroke dynamics are a potentially valid biomarker for clinical disability in multiple sclerosis. Mult. Scler. J. 1352458520968797 (2020).

Twose, J., Licitra, G., McConchie, H., Lam, K. & Killestein, J. Early-warning signals for disease activity in patients diagnosed with multiple sclerosis based on keystroke dynamics<? A3B2 show [editpick]?>. Chaos Interdiscip. J. Nonlinear Sci. 30 , 113133 (2020).

Lang, C. et al. Monitoring the motor phenotype in Huntington’s disease by analysis of keyboard typing during real life computer use. J. Huntingt. Dis. 10 (2), 259–268 (2021).

Onnela, J.-P. Opportunities and challenges in the collection and analysis of digital phenotyping data. Neuropsychopharmacology 46 , 45–54 (2021).

Kaye, J. et al. Methodology for establishing a community-wide life laboratory for capturing unobtrusive and continuous remote activity and health data. J. Vis. Exp. (2018).

Sánchez-Ferro, Á. et al. New methods for the assessment of Parkinson’s disease (2005 to 2015): A systematic review. Mov. Disord. 31 , 1283–1292 (2016).

Ebner-Priemer, U. W. et al. Digital phenotyping: Towards replicable findings with comprehensive assessments and integrative models in bipolar disorders. Int. J. Bipolar Disord. 8 , 1–9 (2020).

Marxreiter, F. et al. The use of digital technology and media in German Parkinson’s disease patients. J. Park. Dis. 10 , 717–727 (2020).

Rohani, D. A., Faurholt-Jepsen, M., Kessing, L. V. & Bardram, J. E. Correlations between objective behavioral features collected from mobile and wearable devices and depressive mood symptoms in patients with affective disorders: systematic review. JMIR MHealth UHealth 6 , e165 (2018).

Laganas, C. et al. Parkinson’s disease detection based on running speech data from phone calls. IEEE Trans. Biomed. Eng. (2021) ( in Press ).

Heilbron, K. et al. The Parkinson’s phenome—traits associated with Parkinson’s disease in a broadly phenotyped cohort. NPJ Park. Dis. 5 , 1–8 (2019).

Hilty, D. M., Armstrong, C. M., Luxton, D. D., Gentry, M. T. & Krupinski, E. A. A scoping review of sensors, wearables, and remote monitoring for behavioral health: Uses, outcomes, clinical competencies, and research directions. J. Technol. Behav. Sci. 6 , 278–331 (2021).

Potier, R. The digital phenotyping project: A psychoanalytical and network theory perspective. Front. Psychol. 11 , 1218 (2020).

Gilpin, L. H. et al. Explaining explanations: an overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) 80–89 (IEEE, 2018). https://doi.org/10.1109/DSAA.2018.00018 .

McInnes, M. D. F. et al. Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies: The PRISMA-DTA statement. JAMA 319 , 388 (2018).

Ouzzani, M., Hammady, H., Fedorowicz, Z. & Elmagarmid, A. Rayyan—A web and mobile app for systematic reviews. Syst. Rev. 5 , 210 (2016).

Bossuyt, P. M. et al. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. BMJ https://doi.org/10.1136/bmj.h5527 (2015).

Suurmond, R., van Rhee, H. & Hak, T. Introduction, comparison, and validation of Meta-Essentials : A free and simple tool for meta-analysis. Res. Synth. Methods 8 , 537–553 (2017).

Whiting, P. F. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern. Med. 155 , 529 (2011).

Song, F., Khan, K. S., Dinnes, J. & Sutton, A. J. Asymmetric funnel plots and publication bias in meta-analyses of diagnostic accuracy. Int. J. Epidemiol. 31 , 88–95 (2002).

Guyatt, G. et al. GRADE guidelines: 1. Introduction—GRADE evidence profiles and summary of findings tables. J. Clin. Epidemiol. 64 , 383–394 (2011).

Download references

Acknowledgements

This work is supported by the Joint Research Center of Khalifa University of Science and Technology and the Korean Advanced Institute of Science and Technology, 8474000221 (KKJRC-2019-Health2) awarded to Ahsan Khandoker and Leontios Hadjileontiadis.

This study is funded by the Joint Research Center of Khalifa University of Science and Technology and the Korean Advanced Institute of Science and Technology, 8474000221 (KKJRC-2019-Health2). The funder did not have any role in the analysis performed.

Author information

Authors and affiliations.

Department of Biomedical Engineering, Khalifa University of Science and Technology, P O Box 127788, Abu Dhabi, United Arab Emirates

Hessa Alfalahi, Ahsan H. Khandoker, Nayeefa Chowdhury, Sofia B. Dias & Leontios J. Hadjileontiadis

Healthcare Engineering Innovation Center (HEIC), Khalifa University of Science and Technology, P O Box 127788, Abu Dhabi, United Arab Emirates

Hessa Alfalahi, Ahsan H. Khandoker, Sofia B. Dias & Leontios J. Hadjileontiadis

Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece

Dimitrios Iakovakis & Leontios J. Hadjileontiadis

CIPER, Faculdade de Motricidade Humana, Universidade de Lisboa, Cruz Quebrada, 1499-002, Lisbon, Portugal

  • Sofia B. Dias

Parkinson’s Foundation Centre of Excellence, King’s College Hospital NHS Foundation Trust, Denmark Hill, London, SE5 9RS, United Kingdom

K. Ray Chaudhuri

Department of Basic and Clinical Neurosciences, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, De Crespigny Park, London, SE5 8AF, United Kingdom

You can also search for this author in PubMed   Google Scholar

Contributions

H.A., A.K., and L.H. conceived and designed the study; H.A. and N.C. performed systematic search; H.A. and L.H. performed quality assessment, extracted meta-data and conducted the systematic review and meta-analysis; H.A. wrote the first draft and S.D. and L.H. contributed to the writing and editing. H.A., A.K., D.I., S.D., R.C. and L.H. reviewed the manuscript. All authors discussed and agreed on the submission of this manuscript.

Corresponding author

Correspondence to Hessa Alfalahi .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information 1., supplementary information 2., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Alfalahi, H., Khandoker, A.H., Chowdhury, N. et al. Diagnostic accuracy of keystroke dynamics as digital biomarkers for fine motor decline in neuropsychiatric disorders: a systematic review and meta-analysis. Sci Rep 12 , 7690 (2022). https://doi.org/10.1038/s41598-022-11865-7

Download citation

Received : 24 November 2021

Accepted : 25 April 2022

Published : 11 May 2022

DOI : https://doi.org/10.1038/s41598-022-11865-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A scoping review of neurodegenerative manifestations in explainable digital phenotyping.

  • Hessa Alfalahi
  • Leontios J. Hadjileontiadis

npj Parkinson's Disease (2023)

Common multi-day rhythms in smartphone behavior

  • Enea Ceolini

npj Digital Medicine (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

a systematic literature review on latest keystroke dynamics based models

  • Survey Paper
  • Open access
  • Published: 10 July 2013

A systematic review on keystroke dynamics

  • Paulo Henrique Pisani 1 &
  • Ana Carolina Lorena 2  

Journal of the Brazilian Computer Society volume  19 ,  pages 573–587 ( 2013 ) Cite this article

6622 Accesses

48 Citations

Metrics details

Computing and communication systems have improved our way of life, but have also contributed to an increased data exposure and, consequently, to identity theft. A possible way to overcome this issue is by the use of biometric technologies for user authentication. Among the possible technologies to be analysed, this work focuses on keystroke dynamics, which attempts to recognize users by their typing rhythm. In order to guide future researches in this area, a systematic review on keystroke dynamics was conducted and presented here. The systematic review method adopts a rigorous procedure with the definition of a formal review protocol. Systematic reviews are not commonly used in artificial intelligence, and this work contributes to its use in the area. This paper discusses the process involved in the review along with the results obtained in order to identify the state of the art of keystroke dynamics. We summarized main classifiers, performance measures, extracted features and benchmark datasets used in the area.

1 Introduction

The wider dissemination of digital identities has contributed to greater worries regarding information exposure [ 47 ]. Recently, in view of the increased dissemination of the internet in several activities (e.g. online banking, e-commerce, e-mail ), security problems became more evident [ 24 ]. As a result, identity theft has gained new momentum. The term identity theft is commonly used to refer to the crime of using personal information of someone else to illegally pretend to be a certain person [ 38 ].

In view of this scenario, more sophisticated methods for user authentication have been developed. Authentication is the process used to confirm the identity of a user. In the case of workstations, for example, the authentication usually occurs in the system initialization, known as initial authentication . Nevertheless, even more secure authentication methods do not provide an entirely effective security mechanism, as the computer may be vulnerable to intruders when the user leaves the workstation and does not end the session. Consequently, an intruder could use the computer masquerading as the legitimate user, resulting in identity theft [ 38 ]. One of the ways to mitigate this problem is by using intrusion detection systems that act on the workstation ( host-based ).

More recently, the concept of detecting intrusions by the behavioral analysis of the user of the computer [ 39 ] has emerged, also known as Behavioral Intrusion Detection [ 49 ]; several aspects of this method have yet to be explored. This concept is grounded on the fact that, by observing the behavior of a user, it is possible to define models that represent the regular behavior ( profile ) of this user, thus allowing the identification of deviations that are potential intrusions. The process of defining these models is known as user profiling [ 46 ]. There is a great variety of features that can be used to define the model of a user. This work focuses on keystroke dynamics , classified as a behavioral biometric technology.

This paper adopts a rigorous method to perform a review on intrusion detection with keystroke dynamics , known as systematic review . As the name suggests, a systematic review adopts a formal and systematic procedure for the conduction of the bibliographic review , with the definition of explicit protocols for obtaining information. Consequently, by using these protocols, the results attained by the systematic review can be reproduced by other researchers as a way of validation, decreasing the incidence of bias in the review, a problem boosted in non-systematic bibliographic reviews [ 33 ].

Systematic reviews are commonly applied in other areas, mainly in medicine , and have a number of reported benefits [ 33 ]. In the area of computing , this review method is more disseminated in software engineering [ 7 ]. This paper contributes to the use of systematic review in computing , particularly in artificial intelligence . Here, we discuss how the systematic review was applied and the achieved results, which are valuable information for the area of intrusion detection with keystroke dynamics .

This paper presents a systematic review carried out with the aim of identifying the state of the art in keystroke dynamics applied to intrusion detection. Preliminary results of this review are shown in [ 42 ] and [ 41 ]. The remaining sections are organized as follows: in Sect. 2 , basic concepts of keystroke dynamics are introduced; in Sect. 3 , the process of systematic review is presented; Sect. 4 discusses how the systematic review was applied in this work, specifying the review protocol and the steps adopted; in Sect. 5 , the results obtained by the systematic review are summarized; and, finally, Sect. 6 presents our conclusions.

2 Background

In information security, intrusion detection is the process of monitoring events in a computer or network and analyse them to detect signals of possible incidents, which are violations or threats of violations of security policies, acceptable use or security practices [ 45 ]. An intrusion detection system (IDS) automatizes this process.

As previously discussed, more recently, a new concept of detecting intrusions by the analysis of the user behaviour in the computer has emerged [ 39 ], which is performed by the behavioural IDS [ 49 ]. This type of system is grounded on a concept known as user profiling , which consists of observing the behaviour of a user in order to generate models that represent its normal behaviour. Observed events are then compared to these models and possible deviations are classified as potential intrusions [ 46 ]. An IDS that applies user profiling is a system based on anomaly detection, as it generates alarms for events that deviates from a behaviour pattern. Figure 1 represents the basic flow of a behavioural IDS, which involves two major steps [ 16 , 21 ]:

Training : obtaining features for the definition of the user behavior pattern;

Recognition : matching observed features against user behavior pattern.

Behavioural intrusion detection (adapted from [ 42 ])

A key issue in the application of user profiling is how to define the profile, that is, which aspects will be observed. The process of choosing these aspects is one of the major questions when applying user profiling . Ideally, the chosen aspects should allow the identification of a user within a group of users and, at the same time, maintain similar values through the time for the same user [ 21 ]. There is a number of aspects that can be used for the definition of the user profile, such as keystroke dynamics , system audit logs, e-mail and command line use [ 46 ].

This work studies keystroke dynamics as an aspect to be analysed by the behavioural intrusion detection system. Keystroke dynamics analyzes how users type from the monitoring of the keyboard input. As a result, models that represent the regular typing rhythm of the user are defined. Afterwards, these models are used for the recognition [ 28 ], in such a way that typing rhythms deviating from this model are classified as being from intruders. Here, we have chosen keystroke dynamics instead of other aspects because it may be used either in the initial authentication of a system or as continuous authentication after the initial authentication. It makes this technology more flexible than an analysis of systems audit logs or e-mail behaviour.

Keystroke dynamics can be applied in two ways: static text or dynamic text . Static text only performs an analysis of fixed expressions as, for example, a password. While, in dynamic text , the analysis occurs for any text that is typed by the user. Keystroke dynamics in static text requires less effort to be implemented and it also reached lower error rates in literature [ 11 ].

Two distinctive processes are involved in keystroke dynamics : feature extraction and classification of the extracted features . In the first process, a number of features are extracted for the recognition of a user. These features should represent how the user behaves in terms of keystroke dynamics .

In the second process, which corresponds to the feature classification, several algorithms can be used. For instance, machine learning algorithms, like neural networks [ 48 ] and support vector machines [ 19 ], were applied in this classification, which consists of verifying whether the typing features belong or not to a specific user.

3 Systematic review

Systematic literature review (called just systematic review in this paper) is a method for conducting bibliographic reviews in a formal way, following well defined steps, which allows the results to be reproducible. In addition, the protocol adopted for the conduction of the review must assure its completion. This review method is commonly used in other areas, mainly in Medicine  [ 7 ] and has several reported benefits, like less susceptibility to bias [ 33 ]. In the area of Computing , this method of review is more disseminated in Software Engineering .

The application of the systematic review involves three major phases: planning , conduction and presentation of results . In the first phase, a review protocol is defined, in which research questions are specified along with search strategies. After that, in the second phase, the review protocol is applied and the information is extracted from the returned references. References used for the extraction of information are called primary studies , while the review is a secondary study . Finally, the third phase defines the way to present the results and the final report is done. The items comprehended in each of the three phases are [ 33 ]:

3.1 Planning

Identification of the review need : a systematic review has the goal of summarizing all information regarding a specific topic. However, before starting a systematic review , the need of this review has to be checked. This checking, for instance, should verify the existence of previously published systematic reviews that deal with the topic under investigation and whether the protocol of these reviews meet the requirements of the research.

Commissioning (optional) : in some cases, due to the lack of time or specific knowledge, one may need to request that other researchers conduct the systematic review .

Specification of the research questions : this is considered to be the most important part of the systematic review , as these questions will guide all the following steps, as the search for primary studies, extraction and analysis of information.

Development of the review protocol : this step defines strategies to be used for the search, selection and evaluation of the references. In addition, the information to be extracted from each of the selected references is also defined.

Protocol evaluation (optional) : as the review protocol is an essential part of the systematic review , it is recommended to be reviewed by other researches.

3.2 Conduction

Reference search : search for the greatest possible number of references which can answer the research question in order to avoid bias. In the systematic review , the search is performed with increased rigour, with the pre-definition of search expressions and databases, making it different from traditional reviews.

Selection of primary studies : after reference search, the studies that are in fact relevant for the research must be selected, by the use of inclusion/exclusion criteria.

Quality evaluation : each of the selected references undergo a quality evaluation. This evaluation may be used with diverse aims, like contributing for the inclusion/exclusion criteria or supporting the summary results, by measuring the importance of each study.

Information extraction : the information extraction from the references must be done with the support of forms defined during the planning phase of the systematic review .

Data synthesis : this step corresponds to summarizing the results attained during the review. This summary may involve qualitative and quantitative aspects. For quantitative aspects, a meta-analysis may also be applied.

3.3 Reporting the review

Specification of the dissemination mechanisms and formulation of the report : dissemination of the results attained by the systematic review . This can be done by publishing in academic journals and conferences or even in web sites.

Report evaluation (optional) : this evaluation can be requested to experts in the area of the research. If the review is submitted to a journal or conference, the review process of the publication can be considered an evaluation of the report.

The explicit definition of the review protocol allows the results to be reproduced. The review presented in this paper was performed by two researchers in the planning phase, but by just one in the conduction phase. Due to that, this review can be called a quasi-systematic review , as it follows the principles of a systematic review , but was not conducted by two researchers in all phases. This term, quasi-systematic review , was also used in previous work [ 35 ]. More details on how to carry out each of the phases are discussed in the next sections, in which the systematic review process is applied to the topic of keystroke dynamics for intrusion detection .

4 How the systematic review was applied

In this work, the application of the systematic review has the goal of studying the state of the art in keystroke dynamics in order to identify:

Advantages and disadvantages of using keystroke dynamics in intrusion detection;

Extracted features;

Classification algorithms applied;

Performance measures commonly adopted;

Benchmarking datasets, which are useful for conducting comparative experiments in the area.

Before presenting details of how the systematic review was applied in this work, it is important to highlight that we only considered references indexed by reference databases available on the Internet and written in English .

4.1 Planning

According to a research carried by the authors, there are no published systematic reviews that meet the goals of this work. Besides, the newer review article on keystroke dynamics known by the authors was submitted for publication in 2009 [ 28 ]. Moreover, part of our aims was not met in that publication, as the identification of benchmarking datasets. Hence, the conduction of the review in this work is justified.

4.1.1 Research questions

In view of the need of the systematic review , we defined a research question and some respective sub-questions to meet the established goals:

How keystroke dynamics is used for intrusion detection?

What are the advantages and disadvantages of using keystroke dynamics for intrusion detection?

What features are extracted from the typing data?

What classification algorithms are applied? What algorithms are used in the performance comparisons?

What measures were used to evaluate the performance? What was the performance achieved?

What datasets are used to measure the performance of the classifier? How many users took part in the tests performed?

4.1.2 References search

After defining the research question, we enumerated a list of terms related to papers that could answer it: keystroke dynamics, typing dynamics, keystroke biometric(s), keystroke authentication, keystroke pattern(s), typing pattern(s), behaviour intrusion detection, behavior intrusion detection, behavioral IDS, biometric intrusion detection, user profiling, behavioural biometrics, behavioral biometrics, continuous authentication, typing biometric(s), keypress biometric(s), keystroke analysis . The use of various terms for the same topic, sometimes even synonyms, contributes to the completeness of the search [ 1 ]. From this list of terms, we built search expressions for each database of references. The basic search expression is the conjunction of each term in the list using the logical connective \(OR\) .

Nevertheless, after some tests with this search expression, we observed that many of the returned references dealt with topics not related to the research question, as personalization systems and recommender systems . For this reason, some terms that could exclude these unrelated topics were identified: web search, personalized information, personalized content, content delivery, recommendation system, recommendations system, information retrieval, personalizing, personalization, recommender . The basic search expression was then modified to consider the exclusion terms with the use of the logic connective \(AND\) and \(NOT\) together, as follows:

This search expression was applied in several data-bases that included references in the computing area. As each database has differences in its syntax for search expression, the basic search expression presented here was adapted to each database, as specified in Appendix A. The following databases were considered in this work:

ACM Digital Library ( http://dl.acm.org/ )

IEEE Xplore ( http://ieeexplore.ieee.org/ )

Science Direct ( http://www.sciencedirect.com/ )

Web of Science ( http://isiknowledge.com/ )

Scopus ( http://www.scopus.com/ )

4.1.3 Selection criteria

The last part of the planning phase is the definition of the selection criteria (inclusion and exclusion) that will be applied to the returned references. In this systematic review , all the returned references are included for analysis in the next steps, except the ones that meet the following exclusion criteria:

Publications that do not deal with keystroke dynamics for intrusion detection: the aim of this review is to work with intrusion detection, which comprehends authentication systems. Therefore, references that do not meet this requirement were not included.

Publications with one page, posters, presentations, abstracts and editorials, texts in magazines/newspaper and duplicate publications in terms of results, except the most complete version: references without enough information to answer the research question. This criterion also avoids unnecessary work for the cases in which the same study is published in different versions.

Publication hosted in services with restricted access and not accessible or publications not written in English.

In this phase, we also created a quality score to be applied to the returned references. This score was determined to highlight references that better answer our research question. The value of the quality score is the sum of the score reached in each of the assessed items. For each of these items, the reference scores 1 if fully meets it, 0.5 if partially meets it and 0 if does not meet the assessed item. As there are nine items, the possible scores ranges between 0 and 9, in such a way that higher values indicate better publications according to the established research criteria. The items are:

Were the goals clearly presented in the beginning of the work?

Were the advantages/disadvantages of keystroke dynamics discussed?

Is the dataset available to be reused?

Was it detailed how the feature vector is generated?

Were the values of the algorithm parameters presented?

Were the applied approaches detailed so as to allow them to be replicated?

Were experimental tests conducted?

Were the results compared to previous researches in the area?

Were the limitations of the study presented?

The quality criteria were defined considering that researches may present problems in the following steps: design, conduction, analysis and conclusion [ 33 ]. The items 1 and 2 refer to the design step, the items 3–6 to the conduction step, the items 7–8 to the analysis step and the item 9 to the conclusion step. Part of the items used to assess the quality was based on the list in [ 33 ], which presents several items to be evaluated in references.

4.1.4 Information extraction

Still in the planning phase of the systematic review , we defined a set of information to be extracted from each selected reference (after the application of the exclusion criteria), as follows:

Basic information about the publication (title, authors, name and year of publication)

Were performance tests conducted?

Type of device (e.g. PC, mobile)

Best performance achieved: algorithm, measure and performance

Number of users in the tests

Algorithms used in the tests

Extracted features

Is the test dataset available to be reused? Where?

Type of verification: static text or dynamic text ?

Observations

These items were defined in line with the research question, in order to answer it and guide the information extraction in the conduction phase of this review.

4.2 Conduction

From the review protocol defined in the planning phase, the conduction of the systematic review was started.

4.2.1 Application of the search expressions

The first step was to apply the search expressions in each database of references and save the returned results. Apart from the returned references, we also included a reference previously known by the authors, but not indexed by the databases used in this review: [ 15 ]. This reference is mentioned in several papers as being one of the first publications about keystroke dynamics . Table 1 shows the number of references returned by each database on 18/February/2013.

These results were centralized in order to continue the review, using a tool called Mendeley (available in: http://www.mendeley.com/ ). We used this tool to import the results exported from the databases. Mendeley has a series of useful features that can be used for systematic reviews , such as search for duplicates, organization of references by category and associations of the entries with PDF files stored in the computer.

4.2.2 Selection of references

After the centralization of the information returned from the search databases, duplicate references were removed. Duplicate references may appear since databases can have some intersection in the indexed data, as in the case of Scopus and Web of Science .

Once the removal of duplicates was finished, a fast reading of the text of the remaining references was performed. Before starting this step, we needed to download the complete text of each publication. However, it was not possible to download 27 of them, which were hosted in services not available from our university (exclusion criterion 3). Consequently, the number of eligible references was again reduced. In the end, another fast reading of the eligible references was performed to revalidate the exclusion criteria 1 and 2. A great number of references that do not deal with keystroke dynamics for intrusion detection has been eliminated just by the title and abstract, nevertheless, some references were eliminated only after reading their full text. Once the exclusion criteria 1 to 3 were applied, secondary studies were removed, which were only three: [ 11 , 28 , 40 ]. Secondary studies are those commonly known as reviews or surveys . Table 2 shows the number of references returned after the application of each step.

With the application of all exclusion criteria, 200 references (Table 2 ) were left for the next steps: information extraction and quality assessment. Aiming at accelerating these tasks, we created a spreadsheet with all the items for information extraction and quality assessment discussed in the planning phase (Sect. 4.1 ). This spreadsheet was then filled with the information from the references.

This was the part of the systematic review that consumed more time due to the need to read in detail several texts. In addition, sometimes the information to be extracted were not present in a direct way in the text. For example, in some publications, there were tables summarizing tested algorithms and their performance [ 19 ] or it was even possible to extract almost all information from the abstract [ 22 ]. However, this was not the case of some publications, which needed to be read more deeply to find the desired information. Actually, this observation may be related to the one mentioned in [ 7 ], which highlights the fact that abstracts in Computing are usually not well structured, making it difficult to get information about the publication only by the abstract. According to [ 7 ], the scenario is different in medicine , area in which the abstracts are, in general, better structured and usually contain more information about the publication.

4.2.3 Quality assessment

Due to the high number of selected references, they were sorted in descending order of quality score and only the ones with the highest scores are discussed in details here. For the purpose of this review, only those papers with quality score equals or higher than 7.5 were considered, resulting in 16 publications. The focus on references with higher scores has the goal of spending greater efforts on references more relevant to the research question, as the quality scores were specially designed with this purpose.

The graph in Fig. 2 shows the number of publications for each quality score. The average score among those different from zero was 5.54 and, as shown in Fig. 2 , the scores follow an approximate normal distribution. The maximum reached score was 8.5.

Publications by quality score

Another aspect analysed was the number of selected publications by year, as shown in the graph in Fig. 3 . In this graph, it is important to highlight the growth trend in the number of publications by year in the area of keystroke dynamics . This trend was higher between 2002 and 2006. Such a growth trend indicates that the area has been receiving more attention from the scientific community. This may justify additional research efforts in keystroke dynamics.

Publications by year in keystroke dynamics. The growth trend illustrates that the field is gaining new momentum, justifying additional research efforts

Both graphs consider only the references with available texts.

In this section, we focus on the 16 publications with highest quality score and on some papers referenced by them. The following subsections are organized in such a way to answer each of the research sub-questions: advantages and disadvantages of keystroke dynamics , feature extraction, classification algorithms, performance evaluation and benchmarking datasets.

5.1 Advantages and disadvantages

Authentication of users is done by the use of credentials, also known as authentication factors, which can be [ 47 ]:

what the user knows (e.g. password);

what the user has (e.g. access card, token );

what the user is/does (e.g. biometrics: recognition by fingerprint, iris, keystroke dynamics, voice recognition);

some combination of the above items.

The primary method of authentication , be it for e-commerce or for military purposes, is a simple login and password [ 12 ]. The use of this method is based on the fact that the secrecy of the password will be held [ 40 ]. However, this is not always the case, implying in a number of weaknesses [ 10 ]:

Passwords may be shared by several users, resulting in unauthorized access;

Passwords may be copied without authorization;

Passwords may be guessed, particularly for easy passwords, as when someone uses his/her birthday as a password [ 43 ].

Moreover, even in scenarios in which the user authentication is performed by the use of access cards, the security is compromised. This is because the card ownership can be shared with an unauthorized user and it may also be stolen [ 26 ].

These problems, along with widespread use of the Web, contributed to expansion of identity theft , which occurs when a person uses personal information of someone else to illegally pretend to be this person [ 38 ]. In recent years, identity theft has become a crime with the rate of greatest growth in the USA [ 6 ]. Furthermore, the sum of losses in the world due to identity theft have been estimated to be around US$ 221 billion in 2003 [ 25 ]. According to research, [ 29 ], weaknesses of passwords was the most exploited factor by insiders (users from the same institution which is the victim of the attack).

One way to mitigate this problem is the use of biometric technologies to enhance the security provided by passwords. In the security context, biometrics is a science which studies methods for the determination of user identity based on physiological and behavioral features [ 26 ]. Keystroke dynamics , which is considered a biometric technology, can be used without any additional cost with hardware, in contrast to other biometric technologies (e.g., iris, fingerprint), which need specific devices for the capture of biometric data [ 24 , 37 ]. In addition, the level of transparency in the use of keystroke dynamics is high [ 40 ]. This means that there is no need to perform specific operations for the authentication by keystroke dynamics [ 3 ]. This factor contributes for an increased acceptance of keystroke dynamics among users.

Recognition precision by keystroke dynamics may be affected in the presence of keyboards with different characteristics in the same environment. Nevertheless, it is expected that such differences does not significantly impair the recognition performance and, consequently, still enable proper user identification [ 24 ]. This can be compared to the signature recognition biometrics in which, regardless of the pen used, the system is still able to differentiate between legitimate and illegitimate users [ 24 ].

Furthermore, false alarm rates (when a legitimate user is classified as an intruder) in keystroke dynamics are usually high and do not meet standards in some access control systems, such as the European . Additionally, differences among systems, like precision in the capture of typing times, may negatively affect the performance of the classifier by introducing noise [ 30 ]. Another issue raised in the area of behavioral biometrics is the adaptation to changing profiles. A person may change the behavior over time as a result of learning and such a change should be included in the profile stored in the security system, otherwise performance may be impaired. However, this task is far from being simple and represents a challenge in the area [ 27 ].

5.2 Extracted features

Apart from the text itself, the keyboard provides the instants in which each key is pressed and released. From these basic data, features are extracted and used as input for the classification algorithm. In this paper, we adopted the following notation to represent the extracted features (Fig. 4 shows these features in a graphical way, in which the down and up arrows represent, respectively, the instants of pressing and releasing of each key):

Typing data and features (adapted from [ 42 ])

DU1: time difference between the instants in which a key is pressed and released. This feature represents the time that the key keeps pressed and is also named by some authors as dwell time [ 38 ].

DU2: time difference between the instants in which a key is pressed and the next key is released.

UD: time difference between the instants in which a key is released and the next is pressed. This feature is also known as flight time [ 38 ].

DD: time difference between the instants in which a key is pressed and the next key is pressed.

UU: time difference between the instants in which a key is released and the next key is released.

The feature vector is then generated based on these features. An example of a feature vector for an expression of four keys is shown in Fig. 5 .

Example of a feature vector (adapted from [ 41 ])

A summary of the features used in each of the selected references is shown in Table 3 . From the data on this table, we generated the histogram shown in Figure 6 . As can be observed, features DU1 ( dwell time ) and UD ( flight time ) are the most used.

Number of references that employed each feature

Another feature used in previous researches was the pressure over the keys [ 8 , 13 ], but the extraction of this feature requires the use of specialized hardware. However, in view of the increasing availability of touch screen devices, costs to use this feature may decrease over time. In a recent work [ 8 ], the pressure of a touch-screen smartphone was evaluated in a keystroke dynamics scenario. Error rates decreased from 12.2 to 6.9 % when the pressure was also considered.

In [ 37 ], a process of equalization over the feature vector was applied. The authors argue that this transformation may highlight important aspects of the feature vector, as observed in other areas, like digital communications and image processing. According to the reported results, the application of this equalization improved the performance (lower error rate) attained by several algorithms from previous researches.

Studies from [ 17 , 19 ] evaluated the use of discretization over the feature vectors. Each value in the feature vector is discretized in five ranges. Discretized data is then classified by a two-class SVM, using both negative and positive samples for training. According to the authors, the application of the SVM together with this discretization obtained lower error rates than other approaches seen in the literature (e.g., neural networks and distance-based classifiers).

In [ 24 ], the authors performed a comparative analysis of seven feature sets. All combinations using DU1, DD and UU were considered. the best performance was achieved by the set DU1, UU. However, the feature UD was not considered in their analysis. UD is one of the most used feature in previous papers, according to our review, as shown in Fig. 6 .

Another study on extracted features was conducted by [ 3 ]. In addition to considering “character” keys, this study also investigated the Shift key. In passwords containing a mixture of lower case and upper case letters, the Shift key is normally used. Consequently, the analysis of the Shift key may be an additional factor to classify users. According to their tests, analysing the Shift key reduces the error rates of the classifier.

An important factor in keystroke dynamics is the resolution of the captured data. In the MS Windows operating system, for example, the notification of keyboard events, such as key press and release, does not distinguish differences lower than 15.625 ms. In [ 30 ], the effect of different resolutions was evaluated. This evaluation used an external device with a resolution of 100 \(\upmu \) s. High resolution data was then used to derive lower resolution samples. As expected, higher resolution data implies in better classification accuracy. Low resolutions (e.g., 100 ms) resulted in error rates of 50 %, which is a very low performance.

5.3 Classification algorithms

A number of algorithms have been used to classify users in keystroke dynamics . Table 4 shows the algorithms studiedin the 16 selected publications. It is important to highlight that, apart from algorithms known from Machine Learning literature, such as Support Vector Machines (SVM) [ 19 ] and Nearest Neighbour [ 30 ], some authors proposed some new algorithms [ 22 , 36 ]. These new algorithms were also used in comparisons performed by later researches [ 37 ].

The use of static and dynamic text was tested in [ 36 ]. At the time the work was published, the concept of recognizing users by keystroke dynamics was relatively new. Therefore, the authors carried out experiments to validate the idea of classifying users by their typing rhythm. Their experiments validate the approach, achieving an accuracy rate of 92.14 %.

As discussed in previous works [ 19 , 31 ], the amount of training samples may affect the classifier performance. In general, the greater their representativity, the higher is the classification accuracy. In [ 9 ], a method to generate new training samples based on the legitimate user was proposed. The samples are generated using re-sampling in time domain and by the use of discrete wavelet transform (DWT). Although the this method generate more samples, a question still not answered is whether these new samples actually imply in greater representativity.

The use of numeric keypads was analysed by [ 43 ]. An advantage of using numeric keypads is that it would be easier to implement keystroke dynamics technology in mobile devices, such as cell phones, which usually only have a numeric keypad. The authors conducted experiments using eight number passwords, obtaining an ERR of 3.6 %.

Novelty detectors were tested in [ 48 ], namely an auto-associative multilayer percetron (AAMLP) and a one-class support vector machine (one-class SVM). According to their experiments, error rates were similar for both novelty detectors. Nevertheless, the one-class SVM was more efficient in terms of computational resources usage.

Several tools were used to carry out the tests of the classification algorithms in these papers. In the case of neural networks, two tools were identified: the library ffnet and the package AMORE , which were employed by [ 19 ] and [ 30 ] respectively. For the other algorithms, we identified the following tools: [ 19 ] applied the library libsvm for a SVM and [ 43 ] applied the Hidden Markov Toolkit (HTK) for training a HMM. Some classification algorithms were implemented by the authors using programming languages, such as Java in the Net Beans development environment [ 3 ] and C++ with the library xview [ 36 ].

5.4 Performance evaluation

With regard to the performance evaluation, through the review, we found four main measures:

FAR and FRR: the false acceptance rate (FAR) measures the percentage of times that an intruder is erroneously accepted as being legitimate and the false rejection rate (FRR) measures the percentage of times that a legitimate user is wrongly rejected [ 40 ]. Hypothetically, these two rates vary according to the graph in Fig. 7 , depending on the sensitivity level of the algorithm: when one rate decreases, the other increases.

EER: the equal error rate (EER) represents the error value when both FAR and FRR assume the same value [ 11 ]. In contrast to FAR and FRR, this measure does not depend on the level of sensibility of the classification algorithm.

Accuracy rate: only measures the percentage of correct classifications attained by the algorithm.

Integrated error: is the area under the curve plotted with FAR and FRR rates, as shown Fig. 8 . The value of the shaded area is the integrated error. Smaller areas represent better performance.

FAR, FRR and EER (adapted from [ 11 ])

Example of integrated error (adapted from [ 34 ])

Several aspects may affect the performance of a biometric system based on keystroke dynamics. In [ 31 ], the authors studied which aspects have the major influence on keystroke dynamics performance. Their study showed that the classification algorithm, the amount of training samples and methods to update the user model play a key role in the system performance. Other aspects, such as the set of extracted features and the user typing experience had minor effects on the overall performance.

Another fundamental issue in performance evaluation is regarding the way keystroke data is collected. For instance, a user may type a predefined text (transcription) or just freely type something (free composition). Most papers in keystroke dynamics adopt the transcription method as it is easier to apply. However, does it have an impact on the classifier performance? A recent study showed that there are no significant difference between the two methods [ 32 ]. Thus, the authors encourage researches to continue using transcription.

Tables 5 and 6 summarize the best results reached in the selected references. The first table shows the papers that used EER to measure the performance and the second table shows the papers that evaluated the performance using FAR and FRR. One of the returned papers reported the results in terms of accuracy rates and, therefore, it is not shown in Tables 5 and 6 . Based on a Bayesian classifier, the accuracy rate attained was 92.14 % in a dataset containing 63 users [ 36 ].

Nonetheless, the comparison of studies just by the reported performance values cannot be done directly, as there is a number of differences between them, like dataset and evaluation measures used. According to Tables 5 and 6 , the number of users that took part in the tests was quite different among the selected studies, ranging from 12 to 205. Moreover, even when the same algorithm is applied by some papers, the comparison is still complex as the parameter values may be different. This difficulty in performing comparisons in the area of keystroke dynamics due to the non-uniformity between researches was also mentioned in [ 40 ]. The use of benchmarking datasets can improve this scenario, as it would allow a more reliable comparison between studies in keystroke dynamics .

5.5 Benchmarking datasets

In view of the fact that performance in keystroke dynamics is highly dependent on the dataset, the identification of benchmarking datasets turns out to be fundamental. Furthermore, the use of readily available datasets save research time and allows greater focus on the development of the classification algorithm [ 18 ].

As there are few benchmarking datasets in keystroke dynamics , all the 200 references were considered to answer this item of the research question. In these references, we identified five datasets (items 1–5) and another one (item 6) was found in [ 44 ].

GREYC [ 18 ]: 133 users typed the text “greyc labora tory” in two different keyboards, in which 100 of the users provided samples in at least five sessions. Samples were colected in a period of two months. Link: http://www.ecole.ensicaen.fr/~rosenber/keystroke.html .

Web-GREYC [ 20 ]: 118 users typed imposed and free login/passwords during one year. The authors claim that this dataset has the biggest number of different passwords in a public dataset. Link: http://www.epaymentbiometrics.ensicaen.fr/index.php/app/resources/84 .

BioChaves [ 37 ]: 47 users formed four datasets: A (10 users), B (8 users), C (14 users) and D (15 users). In datasets A and B, users typed four fixed expressions (“chocolate”, “zebra”, “banana” and “taxi”), while in datasets C and D users typed the expression “computador calcula’. Link: http://www.biochaves.com/en/download.htm .

CMU [ 31 ]: 51 users typed the text “.tie5Ronal” in eight sessions. Link: http://www.cs.cmu.edu/keystroke/ .

CMU-2 [ 32 ]: 20 users provided keystroke data for free text and transcribed text. Link: http://www.cs.cmu.edu/keystroke/laser-2012/

Pressure sensitive [ 2 ]: 104 users typed three different texts: “pr7q1z”, “jeffrey allen” and “drizzle”. Link: http://jdadesign.net/2010/04/pressure-sensitive-keystroke-dynamics-dataset/

All datasets presented here contain basic data for the feature extraction (instants in which each key is pressed and released), with the exception of the dataset 2, which does not provide the instant in which each key is released. Additionally, the last dataset (item 6) also stored the pressure over each key.

6 Conclusion

Intrusion detection systems based on the user behavior are a promising alternative to curb identity theft . Among the features to be analysed in order to define the user behavior, this work considered a biometric technology known as keystroke dynamics .

The quasi-systematic review we conducted here may be used to guide future researches in this area. A systematic review involves a formal definition of the review protocol before starting the review. Consequently, the results attained by the review may be reproduced by other researches as way of validation.

Here, the main goal was to identify the state of the art in keystroke dynamics . In order to perform this task, this review identified advantages and disadvantages of the use of keystroke dynamics , features extracted from keystroke data, classification algorithms, ways of evaluating the performance and datasets for benchmarking .

A possible trend in keystroke dynamics is its use in touch screen devices due to their increasing availability. These devices may provide additional features to increase accuracy. Although we cite a fair amount of datasets, some of them have few samples per user (around 10). Consequently, more public datasets on key-stroke dynamics are needed. This would allow studies on specific aspects of keystroke dynamics, such as influence of age, typing skills, keyboard, etc on the authentication performance. Additionally, the use of more datasets would increase the confidence of classifier performance comparisons drawn in the literature.

In addition to summarizing key information in the area of keystroke dynamics , this paper also detailed the process involved in the application of the systematic review . This may lead to an increased dissemination of this review method in Computing , particularly in the area of Artificial Intelligence .

Afzal W, Torkar R (2011) On the application of genetic programming for software engineering predictive modeling: a systematic review. Expert Syst Appl 38(9):11984–11997

Article   Google Scholar  

Allen JD (2010) An analysis of pressure-based keystroke dynamics algorithms. Master’s thesis, Southern Methodist University, Dallas

Bartlow N, Cukic B (2006) Evaluating the reliability of credential hardening through keystroke dynamics. In: Software Reliability Engineering, ISSRE ’06. 17th International Symposium on IEEE, pp 117–126

Bleha S, Slivinsky C, Hussien B (1990) Computer-access security systems using keystroke dynamics. IEEE Trans Pattern Anal Mach Intell 12(12):1217–1222

Boechat G, Ferreira J, Carvalho Filho E (2007) Authentication personal. In: International conference on intelligent and advanced systems, 2007. ICIAS 2007, pp 254–256

Bose R (2006) Intelligent technologies for managing fraud and identity theft. In: Information technology: new generations, 2006. ITNG 2006. Third International Conference on IEEE, pp 446–451

Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80(4):571–583

Chang TY, Tsai CJ, Lin JH (2012) A graphical-based password keystroke dynamic authentication system for touch screen handheld mobile devices. J Syst Softw 85(5):1157–1165

Chang W (2006) Reliable keystroke biometric system based on a small number of keystroke samples, 3995th edn. Springer, Berlin / Heidelberg

Google Scholar  

Conklin A, Dietrich G, Walz D (2004) Password-based authentication: a system perspective. In: Proceedings of the 37th annual Hawaii international conference on system sciences, 2004, IEEE, pp 1–10

Crawford H (2010) Keystroke dynamics: Characteristics and opportunities. In: Eighth annual international conference on privacy security and trust (PST), pp 205–212

Desouza KC, Vanapalli GK (2005) Securing knowledge assets and processes: lessons from the defense and intelligence sectors. Hawaii Int Conf Syst Sci 1:1–11

Elftmann P (2006) Diploma thesis: secure alternatives to password-based authentication mechanisms. Master’s thesis, Laboratory for Dependable Distributed Systems, RWTH Aachen University

Filho JRM, Freire EO (2006) On the equalization of keystroke timing histograms. Pattern Recogn Lett 27(13):1440–1446

Gaines R, Lisowski W, Press S, Shapiro N (1980) Authentication by keystroke timing: some preliminary results, technical report. Rand Corporation

Galassi U (2008) Learning behavior profiles from noisy sequences. In: Intrusion detection systems, 38th edn. Springer, US

Giot R, El-Abed M, Hemery B, Rosenberger C (2011) Unconstrained keystroke dynamics authentication with shared secret. Comput Secur 30(6–7):27–445

Giot R, El-Abed M, Rosenberger C (2009) Greyc keystroke: a benchmark for keystroke dynamics biometric systems. In: IEEE international conference on biometrics: theory, applications and systems (BTAS). IEEE Computer Society, Washington, District of Columbia, USA (2009)

Giot R, El-Abed, M, Rosenberger C (2009) Keystroke dynamics with low constraints SVM based passphrase enrollment. In: IEEE 3rd International Conference on biometrics: theory, applications, and systems, 2009. BTAS 2009, pp 1–6

Giot R, El-Abed M, Rosenberger C (2012) Web-based benchmark for keystroke dynamics biometric systems: a statistical analysis. In: Intelligent information hiding and multimedia signal processing (IIH-MSP), pp 11–15

Goldring T (2003) User profiling for intrusion detection in windows nt. In: Proceedings of the 35th Symposium on the Interface

Gunetti D, Picardi C (2005) Keystroke analysis of free text. ACM Trans Inf Syst Secur 8:312–347

Hocquet S, Ramel J, Cardot H (2006) Estimation of user specific parameters in one-class problems. In: 18th International Conference on Pattern Recognition, 2006. ICPR 2006. vol 4, pp 449–452

Hosseinzadeh D, Krishnan S (2008) Gaussian mixture modeling of keystroke patterns for biometric applications. IEEE Trans Syst Man Cybernetics Part C: Appl Rev 38(6):816–826

Jain A, Pankanti S (2006) A touch of money [biometric authentication systems]. Spectrum IEEE 43(7):22–27

Jain AK, Flynn P, Ross AA (2007) Handbook of biometrics. Springer, New York

Kang P, Hwang Ss, Cho S (2007) Continual retraining of keystroke dynamics based authenticator, 4642nd edn. Springer, Berlin / Heidelberg

Karnan M, Akila M, Krishnaraj N (2011) Biometric personal authentication using keystroke dynamics: a review. Appl Soft Comput 11:1565–1573

Keeney M, Kowalski E, Cappelli D, Moore A, Shimeall T, Rogers S (2005) Insider threat study: computer system sabotage in critical infrastructure sectors. Carnegie Mellon University, Pittsburgh

Killourhy K, Maxion R (2008) The effect of clock resolution on keystroke dynamics. In: Lippmann R, Kirda E, Trachtenberg A (eds) Recent advances in intrusion detection, lecture notes in computer science, vol 5230. Springer, Berlin/Heidelber, pp 331–350

Chapter   Google Scholar  

Killourhy K, Maxion R (2010) Why did my detector do that?! predicting keystroke-dynamics error rates. In: Jha S, Sommer R, Kreibich C (eds) Recent advances in intrusion detection, lecture notes in computer science, vol 6307. Springer, Berlin/Heidelberg, pp 256–276

Killourhy KS, Maxion RA (2012) Free vs. transcribed text for keystroke-dynamics evaluations. In: Proceedings of the 2012 workshop on learning from authoritative security experiment results, LASER ’12, pp 1–8. ACM, New York

Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering, technical report 2007–001. Keele University and Durham University Joint Report

joo Lee H, Cho S (2007) Retraining a keystroke dynamics-based authenticator with impostor patterns. Comput Security 26(4):300–310

Magdaleno AM, Werner CML, de Araujo RM (2012) Reconciling software development models: a quasi-systematic review. J Syst Softw 85(2):351–369

Monrose F, Rubin AD (2000) Keystroke dynamics as a biometric for authentication. Future Gener Comp Syst 16(4):351–359

Montalvao J, Almeida C, Freire E (2006) Equalization of keystroke timing histograms for improved identification performance. In: Telecommunications symposium, 2006 International, pp 560–565

Moskovitch R, Feher C, Messerman A, Kirschnick N, Mustafic T, Camtepe A, Lohlein B, Heister U, Moller S, Rokach L, Elovici Y (2009) Identity theft, computers and behavioral biometrics. In: IEEE International conference on intelligence and security informatics, 2009. ISI ’09. pp 155–160

Pannell G, Ashman H (2010) User modelling for exclusion and anomaly detection: a behavioural intrusion detection system. In: De Bra P, Kobsa A, Chin D (eds) User modeling, adaptation, and personalization, lecture notes in computer science, vol 6075. Springer, Berlin/Heidelberg, pp 207–218

Peacock A, Ke X, Wilkerson M (2004) Typing patterns: a key to user identification. Secur Privacy IEEE 2(5):40–47

Pisani PH (2012) Algoritmos imunológicos aplicados na detecção de intrusões com dinâmica da digitação (in Portuguese). Master’s thesis, Universidade Federal do ABC

Pisani PH, Lorena AC (2011) Detecção de intrusões com dinâmica da digitação: uma revisão sistemática (in Portuguese). Technical Report 06/2011, Universidade Federal do ABC, Santo André, Brazil

Rodrigues R, Yared G (2005) Biometric access control through numerical keyboards based on keystroke dynamics. In: Zhang D, Jain A (eds) Advances in biometrics, lecture notes in computer science, vol 3832. Springer, Berlin/Heidelberg, pp 640–646

Giot R, El-Abed M, Rosenberger C (2011)) Biometrics, Intech, Ch. Keystroke Dynamics Overview, pp 157–182

Scarfone K, Mell P (2007) Guide to intrusion detection and, prevention systems (IDPS).

Wang L, Geng X (2009) Behavioral biometrics for human identification, medical information science reference, IGI Global. Hershey, New York

Windley PJ (2005) Digital identity. O’Reilly Media, Sebastopol

Yu E, Cho S (2003) Novelty detection approach for keystroke dynamics identity verification. In: Liu J, Cheung YM, Yin H (eds) Intelligent data engineering and automated learning, lecture notes in computer science, vol 2690. Springer, Berlin/Heidelberg, pp 1016–1023

Zanero S (2004) Behavioral intrusion detection. In: Aykanat C, Dayar T, Krpeoglu I (eds) Computer and information sciences, ISCIS 2004, lecture notes in computer science, vol 3280. Springer, Berlin/Heidelberg, pp 657–666

Download references

Acknowledgments

The authors would like to thank Universidade Federal do ABC (UFABC), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) for financial support.

Author information

Authors and affiliations.

Instituto de Ciências Matemáticas e de Computação (ICMC), Universidade de São Paulo (USP), São Carlos, SP, Brazil

Paulo Henrique Pisani

Instituto de Ciência e Tecnologia (ICT), Universidade Federal de São Paulo (UNIFESP), São José dos Campos, SP, Brazil

Ana Carolina Lorena

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Paulo Henrique Pisani .

Appendix: search expressions

The search expressions used in each of the databases are shown here.

ACM Digital Library

In the case of ACM Digital Library, the expression had to be split, as the complete version exceeded the size limit.

((Title:(”behavioural intrusion detection” OR ”behavioral intrusion detection” OR ”behavioral IDS” OR ”behavioural IDS” OR ”biometric intrusion detection” OR ”user profiling” OR ”keystroke dynamics” OR ”typing dynamics” OR ”keystroke biometrics” OR ”keystroke biometric” OR ”continuous authentication” OR ”keystroke authentication” OR ”behavioural biometrics” OR ”behavioral biometrics” OR ”keystroke pattern” OR ”keystroke patterns” OR ”typing pattern” OR ”typing patterns”) AND NOT Title:(”web search” OR ”personalized information” OR ”personalized content” OR ”content delivery” OR ”recommendation system” OR ”recommendations system” OR ”information retrieval” OR ”personalizing” OR ”personalization” OR ”recommender”)) OR (Abstract:( ”behavioural intrusion detection” OR ”behavioral intrusion detection” OR ”behavioral IDS” OR ”behavioural IDS” OR ”biometric intrusion detection” OR ”user profiling” OR ”keystroke dynamics” OR ”typing dynamics” OR ”keystroke biometrics” OR ”keystroke biometric” OR ”continuous authentication” OR ”keystroke authentication” OR ”behavioural biometrics” OR ”behavioral biometrics” OR ”keystroke pattern” OR ”keystroke patterns” OR ”typing pattern” OR ”typing patterns”) AND NOT Abstract:(”web search” OR ”personalized information” OR ”personalized content” OR ”content delivery” OR ”recommendation system” OR ”recommendations system” OR ”information retrieval” OR ”personalizing” OR ”personalization” OR ”recommender”)))

((Title:(”typing biometric” OR ”typing biometrics” OR ”keypress biometric” OR ”keypress biometrics”) AND NOT Title:(”web search” OR ”personalized information” OR ”personalized content” OR ”content delivery” OR ”recommendation system” OR ”recommendations system” OR ”information retrieval” OR ”personalizing” OR ”personalization” OR ”recommender”)) OR (Abstract:(”typing biometric” OR ”typing biometrics” OR ”keypress biometric” OR ”keypress biometrics” OR ”keystroke analysis”) AND NOT Abstract:(”web search” OR ”personalized information” OR ”personalized content” OR ”content delivery” OR ”recommendation system” OR ”recommendations system” OR ”information retrieval” OR ”personalizing” OR ”personalization” OR ”recommender”)))

IEEE Xplore

((”behavioural int rusion detection” OR ”behavioral intrusion detection” OR ”behavioral IDS” OR ”behavioural IDS” OR ”biometric intrusion detection” OR ”user profiling” OR ”keystroke dynamics” OR ”typing dynamics” OR ”keystroke biometrics” OR ”keystroke biometric” OR ”continuous authentication” OR ”keystroke authentication” OR ”behavioural biometrics” OR ”behavioral biometrics” OR ”keystroke pattern” OR ”keystroke patterns” OR ”typing pattern” OR ”typing patterns” OR ”typing biometric” OR ”typing biometrics” OR ”keypress biometric” OR ”keypress biometrics” OR ”keystroke analysis”) AND NOT (”web search” OR ”personalized information” OR ”personalized content” OR ”content delivery” OR ”recommendation system” OR ”recommendations system” OR ”information retrieval” OR ”personalizing” OR ”personalization” OR ”recommender”))

Science Direct

TITLE-ABSTR-KEY((”behavioural intrusion detection” OR ”behavioral intrusion detection” OR ”behavioral IDS” OR ”behavioural IDS” OR ”biometric intrusion detection” OR ”user profiling” OR ”keystroke dynamics” OR ”typing dynamics” OR ”keystroke biometrics” OR ”keystroke biometric” OR ”continuous authentication” OR ”keystroke authentication” OR ”behavioural biometrics” OR ”behavioral biometrics” OR ”keystroke pattern” OR ”keystroke patterns” OR ”typing pattern” OR ”typing patterns” OR ”typing biometric” OR ”typing biometrics” OR ”keypress biometric” OR ”keypress biometrics” OR ”keystroke analysis”) AND NOT (”web search” OR ”personalized information” OR ”personalized content” OR ”content delivery” OR ”recommendation system” OR ”recommendations system” OR ”information retrieval” OR ”personalizing” OR ”personalization” OR ”recommender”))

Web of Science

TS=((”behavioural intrusion detection” OR ”behavioral intrusion detection” OR ”behavioral IDS” OR ”behavioural IDS” OR ”biometric intrusion detection” OR ”user profiling” OR ”keystroke dynamics” OR ”typing dynamics” OR ”keystroke biometrics” OR ”keystroke biometric” OR ”continuous authentication” OR ”keystroke authentication” OR ”behavioural biometrics” OR ”behavioral biometrics” OR ”keystroke pattern” OR ”keystroke patterns” OR ”typing pattern” OR ”typing patterns” OR ”typing biometric” OR ”typing biometrics” OR ”keypress biometric” OR ”keypress biometrics” OR ”keystroke analysis”) NOT (”web search” OR ”personalized information” OR ”personalized content” OR ”content delivery” OR ”recommendation system” OR ”recommendations system” OR ”information retrieval” OR ”personalizing” OR ”personalization” OR ”recommender”))

TITLE-ABS-KEY((”behavioural intrusion detection” OR ”behavioral intrusion detection” OR ”behavioral IDS” OR ”behavioural IDS” OR ”biometric intrusion detection” OR ”user profiling” OR ”keystroke dynamics” OR ”typing dynamics” OR ”keystroke biometrics” OR ”keystroke biometric” OR ”continuous authentication” OR ”keystroke authentication” OR ”behavioural biometrics” OR ”behavioral biometrics” OR ”keystroke pattern” OR ”keystroke patterns” OR ”typing pattern” OR ”typing patterns” OR ”typing biometric” OR ”typing biometrics” OR ”keypress biometric” OR ”keypress biometrics” OR ”keystroke analysis”) AND NOT (”web search” OR ”personalized information” OR ”personalized content” OR ”content delivery” OR ”recommendation system” OR ”recommendations system” OR ”information retrieval” OR ”personalizing” OR ”personalization” OR ”recommender”))

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Pisani, P.H., Lorena, A.C. A systematic review on keystroke dynamics. J Braz Comput Soc 19 , 573–587 (2013). https://doi.org/10.1007/s13173-013-0117-7

Download citation

Received : 18 March 2013

Accepted : 24 June 2013

Published : 10 July 2013

Issue Date : November 2013

DOI : https://doi.org/10.1007/s13173-013-0117-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Behavioral intrusion detection
  • Keystroke dynamics
  • Systematic review

a systematic literature review on latest keystroke dynamics based models

Does Keystroke Dynamics tell us about Emotions? A Systematic Literature Review and Dataset Construction

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

A systematic review on keystroke dynamics

Profile image of Ana Carolina Lorena

2013, Journal of the Brazilian Computer Society

Computing and communication systems have improved our way of life, but have also contributed to an increased data exposure and, consequently, to identity theft. A possible way to overcome this issue is by the use of biometric technologies for user authentication. Among the possible technologies to be analysed, this work focuses on keystroke dynamics, which attempts to recognize users by their typing rhythm. In order to guide future researches in this area, a systematic review on keystroke dynamics was conducted and presented here. The systematic review method adopts a rigorous procedure with the definition of a formal review protocol. Systematic reviews are not commonly used in artificial intelligence, and this work contributes to its use in the area. This paper discusses the process involved in the review along with the results obtained in order to identify the state of the art of keystroke dynamics. We summarized main classifiers, performance measures, extracted features and bench...

Related Papers

asdkilee asdkile

More than ever before the Internet is changing computing as we know it. Global access to information and resources is becoming an integral part of nearly every aspect of our lives. Unfortunately, with this global network access comes increased chances of malicious attack and intrusion. In an effort to confront the new threats unveiled by the networking revolution of the past few years reliable, rapid, and unintrusive means for automatically recognizing the identity of individuals are now being sought. In this paper we examine an emerging non-static biometric technique that aims to identify users based on analyzing habitual rhythm patterns in the way they type.

a systematic literature review on latest keystroke dynamics based models

International Journal of Natural Computing Research

Ana Carolina Lorena

A number of current applications require algorithms able to extract a model from one-class data and classify unseen data as self or non-self in a novelty detection scenario, such as spam identification and intrusion detection. In this paper the authors focus on keystroke dynamics, which analyses the user typing rhythm to improve the reliability of user authentication process. However, several different features may be extracted from the typing data, making it difficult to define the feature vector. This problem is even more critical in a novelty detection scenario, when data from the negative class is not available. Based on a keystroke dynamics review, this work evaluated the most used features and evaluated which ones are more significant to differentiate a user from another using keystroke dynamics. In order to perform this evaluation, the authors tested the impact on two benchmark databases applying bio-inspired algorithms based on neural networks and artificial immune systems.

in this paper we proposed one new measure of keystroke patterns over and above of the existing features for making user authentication through keystroke more efficient. With comparison to other access control systems based on biometric features, keystroke analysis has not yet meets acceptable level of accuracy. The reason is probably the intrinsic variability of typing dynamics, versus other very stable biometric characteristics, such as face or fingerprint. Our experiment and statistical analysis described in the current literature and show through experimental data that, the proposed unique measure of keystrokes can be combined with existing authentication mechanism to improve the authentication and security of delicate applications to a very high extent. It can be useful to ascertain the intruders and reject them from the system, provided that we are able to deal with the typing rhythm of the intruders. Our methodology can rely on what is typed by people because of their normal j...

Journal of Computer Science IJCSIS , pranit shinde

Keystroke Dynamics is the study of a user's typing pattern based on the various timing information obtained when a key is pressed and released. It comes under Behavioral Biometrics and has been a topic of interest for authenticating as well as identifying users based on their typing pattern. There have been numerous studies conducted on Keystroke Dynamics as a Biometrics with different data acquisition methods, user base, feature sets, classification techniques and evaluation strategies. We have done a comprehensive study of the existing research and gave our own inference on the topic. In this paper we discuss where the Keystroke Dynamics research currently stands and what scope it has in the future as a biometric application.

Applied Computer Science

olha malyshevska

2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS)

Carlo Sansone

JuLio ZeLaya

GRD JOURNALS , Jyotsna Gaikwad

There is need to secure sensitive data and computer systems from intruders while allowing ease of access for authenticating the user is one of the main problems in computer security. Traditionally, passwords have been the usual method for controlling access to computer systems but this approach has many inherent flaws. Keystroke dynamics is a biometric technique to recognize and an analysis of his/her typing patterns. In the experiment, we measure mean, standard deviation and median values of keystroke features such as latency, duration, digraph and their combinations and compare their performance. The latest trend in authenticating users is by using the potentiality of biometrics. Keystroke dynamics is a behavioral biometrics which captures the typing rhythms of users and then authenticates them based on the dynamics captured. In this paper, a detailed study on the evaluation of keystroke dynamics as a measure of authentication is carried out. This paper gives an insight from the infancy stage to the current work done on this domain which can be used by researchers working on this topic.

Mayur Sawant

Today need of authentication is not limited to password and PIN. It needs a high level of security which can be achieved by Keystroke biometrics. This paper attempts to catch the imposter even if he carries login details of genuine user. The paper tries to review the keystroke methods and draw a common conclusion. Adding keystroke mechanism with existing system helps to enhance the security.

RELATED PAPERS

Journal of Econometrics

Zhuanxin Ding

Dafydd THOMAS

Hydrobiologia

Developmental medicine and child neurology

Ioanna Antoniadou

Community Mental Health Journal

Danica ŽELEZNIK

The Jewish Link

Alex Grobman

European Archives of Oto-Rhino-Laryngology

GIST – Education and Learning Research Journal

Zahra Alimorad

Journal of physical therapy science

Agnaldo Lopes

Proceedings of the American Mathematical Society

Tamas Erdelyi

Blucher Design Proceedings

Diego Ricca

The Indian Journal of Agricultural Sciences

Dr. Manjeet Singh Nain

Revista de Historia Naval nº 50

Revista de Historia Naval D.E.I.

Emilio Pérez

Farmeconomia. Health economics and therapeutic pathways

Lorenzo Pradelli

Turkish studies

Lazzat Urakova Yanç

Technological and Economic Development of Economy

Evelyn Devadason

Revista Produção e Desenvolvimento

Profa. Mônica Liberato

Journal of Agricultural Sciences Research (2764-0973)

Daniel Grande Cano

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Scientific Reports

Logo of scirep

Diagnostic accuracy of keystroke dynamics as digital biomarkers for fine motor decline in neuropsychiatric disorders: a systematic review and meta-analysis

Hessa alfalahi.

1 Department of Biomedical Engineering, Khalifa University of Science and Technology, P O Box 127788, Abu Dhabi, United Arab Emirates

2 Healthcare Engineering Innovation Center (HEIC), Khalifa University of Science and Technology, P O Box 127788, Abu Dhabi, United Arab Emirates

Ahsan H. Khandoker

Nayeefa chowdhury, dimitrios iakovakis.

3 Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

Sofia B. Dias

4 CIPER, Faculdade de Motricidade Humana, Universidade de Lisboa, Cruz Quebrada, 1499-002 Lisbon, Portugal

K. Ray Chaudhuri

5 Parkinson’s Foundation Centre of Excellence, King’s College Hospital NHS Foundation Trust, Denmark Hill, London, SE5 9RS United Kingdom

6 Department of Basic and Clinical Neurosciences, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, De Crespigny Park, London, SE5 8AF United Kingdom

Leontios J. Hadjileontiadis

Associated data.

The search strategy and extracted data contributing to the meta-analysis is available in the appendix; any additional data are available on request from the corresponding author.

The unmet timely diagnosis requirements, that take place years after substantial neural loss and neuroperturbations in neuropsychiatric disorders, affirm the dire need for biomarkers with proven efficacy. In Parkinson’s disease (PD), Mild Cognitive impairment (MCI), Alzheimers disease (AD) and psychiatric disorders, it is difficult to detect early symptoms given their mild nature. We hypothesize that employing fine motor patterns, derived from natural interactions with keyboards, also knwon as keystroke dynamics, could translate classic finger dexterity tests from clinics to populations in-the-wild for timely diagnosis, yet, further evidence is required to prove this efficiency. We have searched PubMED, Medline, IEEEXplore, EBSCO and Web of Science for eligible diagnostic accuracy studies employing keystroke dynamics as an index test for the detection of neuropsychiatric disorders as the main target condition. We evaluated the diagnostic performance of keystroke dynamics across 41 studies published between 2014 and March 2022, comprising 3791 PD patients, 254 MCI patients, and 374 psychiatric disease patients. Of these, 25 studies were included in univariate random-effect meta-analysis models for diagnostic performance assessment. Pooled sensitivity and specificity are 0.86 (95% Confidence Interval (CI) 0.82–0.90, I 2  = 79.49%) and 0.83 (CI 0.79–0.87, I 2  = 83.45%) for PD, 0.83 (95% CI 0.65–1.00, I 2  = 79.10%) and 0.87 (95% CI 0.80–0.93, I 2  = 0%) for psychomotor impairment, and 0.85 (95% CI 0.74–0.96, I 2  = 50.39%) and 0.82 (95% CI 0.70–0.94, I 2  = 87.73%) for MCI and early AD, respectively. Our subgroup analyses conveyed the diagnosis efficiency of keystroke dynamics for naturalistic self-reported data, and the promising performance of multimodal analysis of naturalistic behavioral data and deep learning methods in detecting disease-induced phenotypes. The meta-regression models showed the increase in diagnostic accuracy and fine motor impairment severity index with age and disease duration for PD and MCI. The risk of bias, based on the QUADAS-2 tool, is deemed low to moderate and overall, we rated the quality of evidence to be moderate. We conveyed the feasibility of keystroke dynamics as digital biomarkers for fine motor decline in naturalistic environments. Future work to evaluate their performance for longitudinal disease monitoring and therapeutic implications is yet to be performed. We eventually propose a partnership strategy based on a “co-creation” approach that stems from mechanistic explanations of patients’ characteristics derived from data obtained in-clinics and under ecologically valid settings. The protocol of this systematic review and meta-analysis is registered in PROSPERO; identifier CRD42021278707. The presented work is supported by the KU-KAIST joint research center.

Introduction

Motor abnormalities, a transdiagnostic domain of an array of neurological and psychiatric disorders that begin years if not decades before clinical diagnosis 1 , stem from perturbed brain networks involving cognitive, emotional and motor domains 2 , 3 . Despite their well-established neurobiological mechanisms and clinical criteria, early diagnosis remains a devastating obstacle against effective, disease-modifying treatment and sustained quality of life. In fact, the progression of motor symptoms to warrant clinical diagnosis usually occurs after substantial neural loss in neurodegenerative disorders, and at advanced stages of psychiatric disorders. In the case of Parkinson’s Disease (PD), for instance, the hallmark symptoms of bradykinesia, rigidity and tremor are detected after a neural loss of at least 50% 4 , rendering clinical diagnosis accuracy unsatisfactory at early stages as per a recent meta-analysis 5 . In addition, Alzheimer’s disease (AD) is preceded by a mild cognitive impairment (MCI) stage, characterized by a decline in memory and executive functions that is hardly distinguishable from normal aging, but with pronounced impact on the activities of daily life 6 . In psychiatric disorders, the descriptive nature of clinical scales lacks sensitivity to subtle psychomotor symptoms, either in early or remission stages, resulting in a median delay in diagnosis of 14 years after disease onset 7 . Generally, these diseases, affecting the frontal cortical and subcortical circuits are characterized with executive dysfunction that begins years before diagnosis 1 , entailing the need for dimensional, fine-grained behavioral measures, thereby alleviating the “floor-ceiling” effect associated with qualitative clinical scales as well as the inter- and intra-rater diagnosis variability.

According to the scientific vision (2025) of the Brain Research through Advancing Innovative Neurotechnologies (BRAIN) of the National Institute of Health (NIH) 8 , and the Research Domain Criteria (RDoC) of National Institute of Mental Health (NIMH) 9 , automated behavioral quantification, analysis and classification are a crucial start to high-throughput readout of brain activity, whose impact is envisioned to facilitate breakthroughs in early identification and disease management in both neurology and psychiatry. Concurrent with the ever-increasing interest in behavioral measures, is the lack of hypothesis-supported behavioral experiments 10 . The latter require not only experimental design, but also robust computational and analysis methodologies, supported by clinical ground truth and neurobiological theories. With the booming of smartphones in recent years, keyboard typing became an everyday habit, reflecting unique behavioral profile for every user 11 . We hypothesize that the kinetic movement of fingers during keyboard/touchscreen typing embeds features related to subtle decline in motor sequencing and force steadiness 12 . These are referred to as Neurological Soft Signs (NSS), sub-clinical motor abnormalities that can serve as early “warning signs” of brain dysfunction, and additional clinical evaluation remains essential for precise diagnosis 13 – 15 .

Besides the passive acquisition of user-device interactions, the intricate Artificial Intelligence (AI) and Machine Learning (ML) methods allowed the definition of new disease-related features 16 , 17 , resulting in a new class of digital biomarkers, that of keystroke dynamics. We found that the latter provide a rich space of the assessment parameters, similar to that of finger tapping tests that quantitatively score the frequency and speed of tapping in clinical settings, either in single or alternating fashion 13 . Therefore, employing keystroke dynamics for fine motor analysis facilitates a paradigm shift from conventional, subjective diagnosis to objective, in-the-wild assessment. As opposed to other papers in the area of digital phenotyping that provide an overview of an “island of experts”, we hereby concentrate on a specific digital biomarker class with plausible connection to neurobiological mechanisms and clinical workflow. In fact, keystroke dynamics were used for PD and MCI, yet, and to our best knowledge, no systematic reviews and/or meta-analysis attempted to convey their diagnostic potential or their clinical significance for identifying patterns with plausible connections to disease characteristics.

In this systematic review and meta-analysis, we aimed to appraise the diagnostic performance of keystroke dynamics for an array of neurological and psychiatric disorders. Moreover, we sought to assess the impact of data collection settings, labeling methods, and model characteristics on the diagnostic performance, with emphasis on clinical relevance and ecological validity. In the meta-analysis, we provided a quantitative evaluation of the keystroke dynamics diagnosis of PD, MCI and psychiatric disorders independently, to convey their reproducibility and clinical impact. More importantly, we performed regression analysis, to convey the relationship between patients’ demographic and clinical characteristics with the diagnostic potentiality of keystroke dynamics, as well as the derived fine motor impairment index. Lastly, due to the immature progress of this area towards clinical adoption, we cast-in-concrete a detailed, multidisciplinary agenda for all stakeholders involved in the digital biomarker research, and open an avenue to multidisciplinary intervention and care delivery in neurology and psychiatry.

Our search identified 9576 results of which 4365 were removed as duplicates and 4045 were excluded by automation tools, as illustrated by the PRSIMA 2020 flowchart in Fig.  1 . We therefore screened the title and the abstract of 1166 articles, and we identified 1120 as not meeting our eligibility criteria. Thirty-nine (39) full eligible articles were screened and from their list of references, we identified seven more articles that meet our eligibility criteria. From the resulting 46 articles, five full articles, listed in supplementary file (p. 7) were excluded. At the end of our systematic search, we ended up with 41 full articles of which 25 reported sufficient data to be included in the meta-analysis. Overall, 25 studies are targeting PD, ten studies targeting mood disorders, and six studies were on mild cognitive impairment and AD. The characteristics of the included studies are summarized in Table ​ Table1 1 and are discussed in the following section.

An external file that holds a picture, illustration, etc.
Object name is 41598_2022_11865_Fig1_HTML.jpg

PRISMA 2020 flow diagram for study selection.

Characteristics of included studies.

NR—not reported; NA—not applicable.

Characteristics of included studies

Of the 41 included studies, we identified 25 on PD with 3791 patients of whom 33.9% were female, six on MCI and early AD with 254 patients of whom 52.4% were female, and ten on psychiatric disorders with 374 patients of whom 56.0% were female (not all studies reported gender information). Regardless of the target condition and the data collection setting (in-the-clinic, in-the-wild), typing patterns, or keystroke dynamics, are always passively collected as series of time stamps of consecutive key presses and releases. The derived kinematic parameters are then used for motor behavior pattern analysis.

Of the PD studies, 12 were conducted in-the-clinic 18 – 29 , while 13 were conducted in-the-wild 30 – 42 . The earliest studies, that were mainly on PD, collected data in clinical settings, and attempted to correlate the extracted keystroke dynamic features to the Unified Parkinson’s Disease Rating Scale Part III (UPDRS-III) score, which is currently the gold standard for PD diagnosis 43 . On this basis, PD patients were found to have longer inter-key delay, also known as Flight Time (FT), smaller number of total taps (over a fixed tapping duration), and shorter total distance of finger movement compared to controls 22 . Keystroke dynamics analysis also showed that PD patients are characterized by arrhythmokinesia, that is, hastening or freezing in the typing kinetics 44 , as well as heteroscedasticity or dispersion of FT 24 .

Owing to the establishment of reproducible digital biomarkers on the basis of keyboard interaction patterns, the neuroQWERTY index, for example, was estimated using an ensemble regression model that digests variance and histogram features extracted from 90 s windows of the hold time (HT) series obtained from early stage PD patients 18 . The HT, which is the time required for pressing and releasing a key, was particularly employed in early studies given that it is neither affected by the typing skill nor by conscious control. Consequently, the numerical index derived from it, neuroQWERTY, did not only discriminate early-stage PD patients from controls, but also de novo PD patients, reflecting its high sensitivity to subtle motor changes. Besides the HT, the flight time (FT), the latency between releasing a key and pressing the next one was analyzed in 24 to test the hypothesis that PD patients are characterized by higher dispersion and temporal variability compared to controls. The analysis of the typing patterns of PD patients through the neuroQWERTY keyboard revealed their slower fine-motor kinetics as well 40 . Compared to the Alternating Finger Tapping (AFT) test, employing skewness, kurtosis and covariance features of the FT distribution resulted in a higher diagnosis accuracy, meaning that the typing patterns embed specific irregularities of PD motor symptoms, mainly attributed to rigidity and bradykinesia. In an effort to enrich the feature space of keystroke dynamics, Iakovakis et al. 19 developed a two-stage machine learning model based on low- and high-order statistical features derived from the HT, Normalized FT and Normalized Pressure. Their results were consistent with earlier studies, and showed higher and more variable HT, lower pressure and high FT skewness. While these features were significantly correlated to the motor sub-scores of the UPDRS-III, correlating the outcome of such standardized clinical scales, which encompass a mixture of symptoms not related to fine motor impairments, to the typing behavior might be misleading. Taking this into consideration and with the objective of enhancing the interpretability of the fine-grained indicators, Iakovakis et al. 32 analyzed keystroke dynamics with single items scores of the UPDRS Part III, in order to create a plausible connection between the typing behavior and fine motor impairment symptoms. Employing typing kinetics features as independent variables, the UPDRS single items that correspond to the severity of Bradykinesia, Tremor, Rigidity, and AFT were estimated. The regression results indicated that dominant hand rigidity and bradykinesia were estimated with lower error compared to tremor, meaning that the effect of the latter is less pronounced from the typing cadence.

Furthermore, the “transferability” of typing patterns-based models developed and tested on clinically validated data to naturalistic, quasi-continuous, self-reported data from daily interaction with keyboards was evaluated. While the models achieved higher diagnostic performance in clinical settings, they still show high potentiality for real-life detection of disease-induced abnormal behavior 32 , 33 , 36 – 38 . Moreover, exploiting the passive nature of typing data acquisition, data from 970 PD patients, part of the mPower database 45 , facilitated the detection of early motor decline through Support Vector Machine and Random Forests 42 . Similarly, using the mPower database, unsupervised clustering of smartphone tapping data was used to discriminate the severity of motor symptoms in PD 41 .

Given that amalgamating multiple data streams in one model boosts its diagnostic accuracy, multimodal analysis has been uniquely adopted by Papadopolous et al . 36 in order to achieve symptom-specific detection, wherein accelerometer data are used to yield a tremor estimation index, while the typing behavior is leveraged for estimating fine motor impairment. Besides diagnosis, five longitudinal clinical studies investigated medication response 27 , 30 , 31 , 34 , 39 with the longest follow up being 36 months 30 . For instance, typing behavior has been utilized to detect longitudinal disease phenotype to uncover short- and long-term variations in the motor behavior profile of PD patients as in 31 . This was achieved by the definition of reliable parameters, such as the progression ratio and the steady state ratio, derived by comparisons between motor behavior across consecutive time windows. While this aspect is still in its infancy, Matarazzo et al. 34 showed promising results in detecting response to levodopa using recurrent neural networks. This implication, in turn, suggests that deep learning is a robust predictive model in biomarkers research, and is therefore being used in five clinical studies 21 , 34 , 36 – 38 .

Besides PD, Growing evidence, from studies targeting imaging biomarkers, suggests that the accumulation of Amyloid β starts up to 20 years before the manifestation of clinical symptoms of Alzheimer’s disease (AD) and that this is detected in one third of the clinically normal elderly population 46 . Whether this population will convert to AD, and at what time frame remain elusive, entailing the search for quantitative assessments during this preclinical stage. AD is in fact preceded by a mild cognitive impairment (MCI), which is an intermediate stage characterized by subtle deficits in memory, lexical and information processing, besides sensory and motor abnormalities 47 , 48 . In particular, fine motor impairment has been linked to functional loss at the MCI stage, and is specifically compromising the performance of daily life activities. Therefore, six studies were identified on MCI and AD 49 – 54 . The validity of utilizing the typing kinetics as biomarkers for early stage cognitive decline came about after the pioneering experimental trials that attempted to replicate finger dexterity tests in naturalistic environments. Specifically, the inter-keystroke interval, which is the FT, showed promising cognitive assessment performance of the elderly population 55 . This is particularly linked to breakdowns in attentional control and short-term memory, which constitute two key domains of time-reproduction tasks, such as typing. On this basis, increased latency variability and slower performance were observed in MCI and dementia patients, as compared to age- matched healthy participants 54 . Therefore, capturing computer-use profiles, including mouse and keyboard interactions successfully discriminated MCI patients from age-matched healthy controls 52 .

Interestingly, the multi-domain dysfunction of the prefrontal cortex motivated the development of multi-modal assessment methods, to validate the co-existence of motor and cognitive impairment. The sharp degradation of lexical processing and syntactic complexity reflects on MCI-specific language characteristics including increased verb and pronoun rate and decreased noun rate. To this end, Vizer and colleagues combined keystroke timing features including the HT and the pause rate with linguistic features collected in clinical settings to distinguish PreMCI subjects from age matched healthy controls 49 . Taking the analysis a step further, with the advancement in Natural Language Processing (NLP), and the capability of capturing objective linguistic features, usually not recognized by human raters, Ntracha et al. employed NLP of Spontaneous Written Speech (SWS), fused with keystroke dynamics features captured in-the-wild, to reinforce the interplay of cognitive and fine motor functions 50 . Furthermore, the pronounced advancement in computational modeling now allows aligning multiple data lines, what facilitated the development of “behaviorgrams” that capture activity levels, physiological and behavioral signals on a longitudinal bases, yielding a more comprehensive overview of individuals’ health, yet without solid interpretability on longitudinal transient behavior 51 .

Of the ten studies targeting psychiatric disorders, we identified seven studies on bipolar disorder 56 – 62 , one study on idiopathic REM Sleep Behavior disorder 23 , two studies are on depression 63 and sleep induced psychomotor impairment 64 , respectively. All these studies were conducted in-the-wild except the one on REM sleep disorder 23 . Mental and psychiatric disorders, with major depression being the most prevalent, are the leading cause of the disease burden worldwide, accounting for 32.4% of years lived with disability 65 , and substantially contributing to health loss across the lifespan 66 . The underlying mechanisms of depression include dopaminergic, noradrenergic and serotonergic disturbances along with inflammatory and psychosocial factors 67 . Depression has therefore been identified as an epiphenomenon in PD, MCI and AD patients, and has been linked to higher prevalence of neurodegeneration. As per the recommendations of the National Institute of Mental Health (NIMH), deep phenotyping of disease mechanisms at multiple analysis levels, including genetic, neural, and behavioral levels, is key for early diagnosis and monitoring 9 .

From pathological and clinical perspectives, psychomotor perturbation is a well-defined criterion of manic and depressive states 68 . Stemming from this, keyboard interaction patterns, along with accelerometer data, backspace and autocorrect rate were used a predictor variables in a linear mixed effects model to estimate Hamilton Depression Rating Scale (HDRS) and Young Mania Rating Scale (YMRS) scores of bipolar disorder patients 56 . Besides the psychomotor slowing observed by the longer FT, the analysis of the typing meta data including autocorrect and backspace rate, reflect the cognitive states associated with depressive and manic states. For instance, the high autocorrect rate associated with depressive states reflect the degree of concentration impairment. In contrast, the high backspace rate during manic states is associated with deteriorated error-response inhibition. The impact of circadian rhythm, depression severity, and age also have a profound impact on the typing kinetics 58 . In this vein, the analysis of typing kinetics, along with the clinical scores, facilitated the prediction of brain age and revealed that the predicted age of bipolar disorders patients is higher than their actual age, compared to healthy controls, reflecting a marker of brain pathology 62 . Moreover, keystroke dynamics predict cognitive decline, diminished visual attention, reduced processing speed and task switching in bipolar disorder patients 61 .

Beside these approaches, employing machine learning methods such as random forests yielded high discriminatory performance between mildly and severely depressed patients, and controls, from typing data collected in-the-wild 63 . Considering the impact of individual’s unique typing style and the circadian rhythm, stacking convolutional neural networks that detect personalized features, along with recurrent neural networks that learn the dynamic patterns, resulted in personalized mood detection 59 , 60 . Taking the analysis a leap forward, leveraging passively acquired keystroke dynamics with day-to-day ecological momentary assessment for mood prediction suggested that higher mood instability, inferred from the self-reports and the typing kinetics are highly predictive of worsening depressive and manic symptoms 57 . They also showed that continuous monitoring for up to seven days is sufficient for accurate symptom prediction using multilevel statistical analysis. The longest follow up period among studies targeting psychiatric disorders was eight weeks 56 .

Diagnostic potentiality of keystroke dynamics

Twenty-five (25) independent studies were included in the meta-analysis, given that the symmetry condition of the funnel plots is respected (Figs. S1 – S4 ). Whenever possible, if one study formulated multiple models, we treat them independently and their specific characteristics are reported in Supplementary Table 7 . We identified 29 independent models for the diagnosis of PD on the basis of keystroke dynamics. Pooled AUC and accuracy of keystroke dynamics classification methods for PD were 0.85 (95% confidence interval (CI): 0.83–0.88; I 2  = 94.04%) and 0.82 (95% CI 0.78–0.86; I 2  = 71.55%), respectively. In addition, pooled sensitivity and specificity were 0.86 (95% CI 0.82–0.90, I 2  = 79.49%) and 0·83 (95% CI 0.79–0.87, I 2  = 83.45%), as shown in Fig.  2 a–d. For MCI and AD (Fig.  3 a–d) we found ten independent classification models, except for the study of 51 that only reported AUC for their three models. The pooled AUC and accuracy were 0.84 (95% CI 0.78–0.90, I 2  = 87.43%) and 0.82 (95% CI 0.74–0.89, I 2  = 72.63%), respectively. Pooled sensitivity and specificity for the same category were also found to be 0·85 (95% CI 0.74–0.96, I 2  = 50.39%) and 0.82 (95% CI 0.70–0.94, I 2  = 87.73%). We identified four independent models for psychiatric diseases with 59 only reporting accuracy. Pooled AUC and accuracy for psychomotor impairment were 0.90 (95% CI 0.82–0.97, I 2  = 0%) and 0.89 (95% CI 0.83–0.95, I 2  = 35.56%). Pooled sensitivity and specificity for psychomotor impairment were 0.83 (95% CI 0.65–1.00, I 2  = 79.10%) and 0.87 (95% CI 0.80–0.93, I 2  = 0%) as shown in Fig.  4 a–d. More importantly, the non-significance, inferred by the sensitivity analysis for every disease category, reveals the consistency of the reported diagnostic accuracy, for all pooled measures.

An external file that holds a picture, illustration, etc.
Object name is 41598_2022_11865_Fig2a_HTML.jpg

( a ): Pooled AUC with 95% CI of PD studies. ( b ) Pooled accuracy with 95% CI for PD studies. ( c ) Pooled sensitivity with 95% CI for PD studies. ( d ) Pooled specificity with 95% CI for PD studies.

An external file that holds a picture, illustration, etc.
Object name is 41598_2022_11865_Fig3_HTML.jpg

( a ) Pooled AUC with 95% CI for MCI studies. ( b ) Pooled accuracy with 95% CI for MCI studies. ( c ) Pooled sensitivity with 95% CI for MCI studies. ( d ) Pooled specificity with 95% CI for MCI studies.

An external file that holds a picture, illustration, etc.
Object name is 41598_2022_11865_Fig4_HTML.jpg

( a ) Pooled AUC with 95% CI for psychiatric disorder studies. ( b ) Pooled Accuracy with 95% CI for psychiatric disorder studies. ( c) Pooled Sensitivity with 95% CI for psychiatric disorder studies. ( d ) Pooled Specificity with 95% CI for psychiatric disorder studies.

Assessment of experimental design on diagnostic performance

In order to decipher the heterogeneity sources of the included studies, we have conducted multiple subgroup analyses. Comparing the performance of the diagnostic models when per- formed on data captured in-the-clinic to data captured in-the-wild revealed that the AUC (p = 0.007) and the accuracy (p = 0.032) were significantly higher under clinical settings. The AUC and the accuracy for the data captured in-the-clinic were 0.89 (95% CI = 0.86–0.91, I 2  = 87.15%, n = 21) and 0.87 (95% CI = 0.83–0.90, I 2  = 62.33%, n = 18), respectively. The same measures for data captured in-the-wild were 0.82 (95% CI = 0.79–0.84, I 2  = 74.02%, n = 21) and 0·81 (95% CI = 0.77–0.85, I 2  = 71.82%, n = 17), respectively. In terms of the sensitivity and the specificity, we found that the pooled sensitivity was not significantly higher for data captured in-the-clinic (p = 0.903), while the specificity was significantly higher for data captured in-the-clinic (p = 0.032). These metrics for data captured in-the-clinic were 0.85 (95% CI = 0.80–0.99, I 2  = 82.06%, n = 20) and 0.87 (95% CI = 0.83–0.91, I 2  = 81.81%, n = 18). For data captured in-the-wild, pooled sensitivity and specificity were 0.85 (95% CI = 0.79–0.90, I 2  = 52.55%, n = 14) and 0.79 (95% CI = 0.73–0.85, I 2  = 68.96%, n = 16). Similarly, the AUC (p = 0.004), the accuracy (p = 0.013), and the specificity (p = 0.002) are significantly higher for clinically-validated databases, compared to self-reports labeled typing data. For the former, the AUC, accuracy, pooled sensitivity and specificity were 0.86 (95% CI = 0.83–0.89, I 2  = 86.41%, n = 31), 0.86 (95% CI = 0.83–0.89, I 2  = 64.17%, n = 26), 0.86 (95% CI = 0.81–0.90, I 2  = 78.77%, n = 21) and 0.87 (95% CI = 0·0.84–0.91, I 2  = 73.41%, n = 22). On the other hand, these metrics for the self-reported data were 0.78 (95% CI = 0.73–0.84, I 2  = 0.00%, n = 6), 0.79 (95% CI = 0.74–0.83, I 2  = 29.24%, n = 8), 0.83 (95% CI = 0.76–0.90, I 2  = 50.11%, n = 9) and 0.76 (95% CI = 0.69–0.82, I 2  = 77.43%, n = 11).

From a methodological point of view, we report no statistical significance between pooled AUC (p = 0.525) and sensitivity (p = 0.074) when we compare unimodal and multimodal analysis methods. The specificity (p = 0.042) and the accuracy (p = 0.022), however, were significantly higher for multimodal analysis. Pooled AUC, accuracy, sensitivity and specificity for multimodal analysis were as follows: 0·83 (95% CI = 0·77–0·90, I 2  = 90·83%, n = 9), 0·87 (95% CI = 0·83–0·91, I 2  = 66·24%, n = 11), 0·89 (95% CI = 0·84–0·94, I 2  = 45·77%, n = 7) and 0.87 (95% CI = 0.79–0·95, I 2  = 90·19%, n = 8). The same measures for unimodal analysis were 0·86 (95% CI = 0.84–0.89, I 2  = 76.42%, n = 31), 0.82 (95% CI = 0.78–0.85, I 2  = 63.98%, n = 25), 0.84 (95% CI = 0.79–0.89, I 2  = 76.41%, n = 22) and 0.80 (95% CI = 0.76- 0.84, I 2  = 68.23%, n = 17).

Comparing the performance of ML classifiers and deep learning methods, the sensitivity was significantly higher for deep learning classifiers (p = 0.029), compared to linear machine learning methods, while the AUC (p = 0.859), accuracy (p = 0.299), and specificity (p = 0.882) were all associated with insignificant difference. The pooled AUC, accuracy, sensitivity and specificity for machine learning classifiers were 0.86 (95% CI = 0.83–0.88, I 2  = 75.29%, n = 33), 0.83 (95% CI = 0.80–0.87, I 2  = 66.49, n = 26), 0.82 (95% CI = 0.78–0.86, I 2  = 71.49, n = 24) and 0.83 (95% CI = 0.78–0.87, I 2  = 75.82%, n = 26), respectively. On the other hand, in the case of deep learning, the pooled measures are 0.86 (95% CI = 0.79–0.94, I 2  = 86.77%, n = 7), 0·86 (95% CI = 0.81–0.91, I 2  = 51.77%, n = 8), 0.89 (95% CI = 0.83–0.96, I 2  = 44.25%, n = 9) and 0.83 (95% CI = 0.76–0.91, I 2  = 80.37%, n = 9). Figure  5 represents scatter-bar plots of the subgroup analyses results forest plot representations can be found in Figs. (S5–S20).

An external file that holds a picture, illustration, etc.
Object name is 41598_2022_11865_Fig5_HTML.jpg

Scatter–Bar plots for the Subgroup Analysis results for ( a ) data collected in-the-clinic vs. data collected in-the-wild, ( b ) clinically validated data vs. self-reported data, ( c ) multimodal analysis vs. unimodal analysis and ( d ) deep learning vs. other machine learning classifiers. The dots represent the individual studies and the height of the bars corresponds to the outcome of the random effects meta-analysis model with 95% CI. ** denotes p < 0.005 and * denotes p < 0.05.

Association of diagnostic performance with age and disease duration

We hypothesize that patients’ demographics and clinical characteristics affect the diagnostic potentiality of keystroke dynamics. To this aim, we performed multiple linear regression analyses to convey the influence of age, disease duration, and medication on the diagnosis potentiality of keystroke dynamics. We also pooled fine motor impairment indexes, mainly related to bradykinesia, to investigate the influence of disease stage on the estimated motor impairment severity. Due to the unavailability of sufficient data for MCI and psychiatric disorders studies, we were mainly able to perform the regression analysis for PD diagnosis. Figure  6 a shows the relationship between PD patients’ age and disease duration (years from diagnosis). The figure intuitively suggests that PD disease duration increases with age, and the relationship between the two is statistically significant (p = 0.013) as inferred from the regression analysis. Accordingly, adjusting for disease duration, we analyzed its relationship with diagnostic AUC as represented in Fig.  6 b. The regression analysis yielded a statistically significant increase in AUC with disease duration (p = 0.005), reflecting the progression of fine motor impairment skills of PD patients. Next, using the same data, we investigated the AUC relationship with disease duration, when de novo PD patients are compared to early PD patients taking levodopa ( l -Dopa). Interestingly, when we use linear fitting to each group, the higher slope associated with the de novo PD patients, compared to that of early PD patients on l -Dopa indicates that although the diagnostic AUC of de novo patients is lower, the evolution of the AUC with respect to disease progression for this patients’ category is more significant, mainly during the first three years after diagnosis, than that of early, medicated PD (Fig.  6 c). Perhaps this implication also suggests the sharper decline in fine motor skills at this stage, resulting in a clear improvement in the diagnostic AUC. This is in line with the recent evidence suggesting an exponential neurodegeneration patterns of the Substantia Nigra pars compacta, parallel to a sharper decline in motor skills in early PD 69 . Besides the diagnostic performance, we sought to investigate the association of PD disease duration and the severity of fine motor symptoms. We pooled the fine motor impairment index, that derived from the HT, as an estimation of bradykinesia, as it was reported by multiple studies with sufficient data. However, not all studies reported the fine motor impairment index derived from the HT. Figure  6 d depicts the significant correlation (p = 0.010) between the disease duration and fine motor impairment index.

An external file that holds a picture, illustration, etc.
Object name is 41598_2022_11865_Fig6_HTML.jpg

Evaluation of the impact of patients’ age and disease duration on the diagnostic performance of keystroke dynamics represented by the AUC. ( a ) Regression analysis results of PD patients age and years from diagnosis (disease duration). ( b ) Regression analysis results of PD studies reporting diagnostic AUC and disease duration reveals their significant association. ( c ) Pooled AUC of de novo PD patients (blue) and early PD patients on L-Dopa (orange) depicts the sharper increase in AUC with disease duration of de novo PD patients, compared to that of early, medicated PD patients. ( d ) Regression analysis results of Fine motor impairment index derived from the HT and the disease duration. ( e ) Regression analysis results of MCI patients age and diagnosis AUC.

Although the included studies on MCI were generally few compared to those targeting PD, we were able to perform regression analysis to convey the diagnostic potentiality relationship with patients’ age. As represented in Fig.  6 e, there is a significant increase in the diagnostic AUC of MCI based on fine motor skills inferred by keystroke dynamics (p = 0.017). The full regression results are reported in Supplementary Tables 9 – 12 .

Evaluation of between-study heterogeneity and bias risk

The large between-study heterogeneity made combining data from multiple studies to generate a representative effect size on the diagnosis performance problematic. It is due to this reason that we decided to pool four diagnostic metrices via univariate random-effect meta-analysis models. Consequently, we assume that pooled diagnosis metrics of PD, MCI, and psychiatric disorders, as well as the subgroup analysis results, are adequate to convey the diagnostic potentiality of keystroke dynamics models and the impact of study characteristics; namely data collection settings, labeling methods, and the modeling characteristics. Hence, we group the studies based on the desired outcome and assume that despite the heterogeneity, the pooled outcome contributes to the evidence. For instance, when evaluating the diagnostic performance for every disease category, the heterogeneity stems from the between-study differences in experimental design and model characteristics, however, when we group the studies based on experimental characteristics despite the disease category, we attribute the heterogeneity to patients’ characteristics, and other experimental design aspects that are not under investigation. Furthermore, we reinforce our findings from the global performance of studies by evidence from methodological perspectives. Nonetheless, we still caution against overinterpretation.

Figure  7 shows the graphical representation of the risk of bias of included studies, and the per-study risk of bias assessment is reported in Supplementary Table 6 . Given that we target the diagnostic accuracy, the included studies are case–control including a priori labeled diseased and healthy participants. We consider the studies that labeled the participants using self-reports without clinical evaluation at high risk of bias, because participants’ honesty, recall bias and unawareness of their medical conditions influence the correctness of the labels. Furthermore, most studies did not assess the appropriateness of the sample size, we therefore deemed this of unclear risk of bias for most studies, except 12 studies that aimed at enlarging the sample pool, mainly collecting data outside clinics. Besides, we deemed all the studies that performed independent clinical evaluation and keystroke dynamics analysis (outcome assessment blindness) to be of low risk of bias, except 30 , 39 , that did finger dexterity and clinical evaluation of PD without blindness. All studies were characterized with low risk of bias when we consider timing of ground-truth labeling and data collection. Most studies are of unclear risk of bias in terms of selective reporting. Overall, we deemed the risk of bias to be low to moderate, and the quality of evidence, as inferred by the GRADE tool, to be moderate to high, as illustrated in Supplementary Table 8 .

An external file that holds a picture, illustration, etc.
Object name is 41598_2022_11865_Fig7_HTML.jpg

Risk of bias assessment.

To our best knowledge, this is the first systematic review and meta-analysis that provides a concentrated overview of the clinically-relevant diagnostic performance of keystroke dynamics, their ecological validity and association with patients’ demographics and clinical characteristics. We found that most studies are targeting PD, given its hallmark motor symptoms, however, there are now multiple studies dedicated to the assessment of motor perturbation in MCI, bipolar disorder, and depression. The diagnostic accuracy revealed by our meta-analysis reflects, for the first time, the reproducibility of keystroke dynamic models in the assessment of multiple disorders with neurologically defined fine motor impairment. Besides the three disease categories reviewed in this paper, researchers are currently employing them for Multiple Sclerosis 70 , 71 and Huntington’s disease 72 . We therefore conclude that we can rely on keystroke dynamics obtained passively from natural interactions with keyboards to detect fine motor impairments induced by early stage neurological and/or psychiatric disorders. Despite this diagnostic performance, several experimental and analysis deficiencies need to be discussed to mitigate between-study heterogeneity, pave the way for future research. For clinical adoption of this technology, we propose a partnership strategy based on a “co-creation” approach that stems from mechanistic explanations of patients’ characteristics derived from data obtained in-clinics and under ecologically valid settings. It is the multi-level analysis of patients’ data on genetic-, organ-, and behavior-level that will be at the center of the translational paradigm to precision medicine when the heterogeneous brain disorders are to be considered.

While computer/smartphone interaction behavior outperformed clinical gold standards such as the AFT and the single finger tapping tests in detecting specific fine motor symptoms of PD patients, the transition from highly-controlled assessment in-the-clinic, to naturalistic, real-life assessment models should be approached with caution 73 – 75 . From the sampling perspective, home-based data collection usually results in highly sparse bursts of unpredictable typing activity, that are highly sensitive to real-life contexts, emotional burden and diurnal patterns. This is in line with the higher discriminatory performance of the models on data captured in-the-clinic and labeled by clinical assessment, elucidated by our subgroup analysis. Therefore, to establish robust detection models for diagnosis outside clinics, integrating multiple latent domains, or confounders and defining multiple predictor parameters, such as emotions, activity levels, and sleep patterns, is particularly an interesting avenue for future research to enhance ecological validity 76 . Such integrated frameworks might therefore capture the heterogeneous, neuropsychiatric symptoms in different behavioral disorders, let alone the intra-subject variability that occurs across different time windows. Moreover, because there is neither a consensus on the optimal assessment duration to detect meaningful disease trajectories and progression of neurological disorders, nor for episodic relapse in psychiatric disorders, long-term analysis of behavioral profiles is essential. Moreover, optimizing the analysis window length, that is, the distribution of observation period, to precisely detect disease-induced transient behavior is yet to be performed.

The inherent, progressive nature of psychiatric and neurodegenerative disorders makes them amenable to frequent treatment regimen modifications, yet satisfying symptom control is not achieved given the high economic burden of clinical visits 77 . Besides screening and diagnosis, the concept of remote monitoring is realized thanks to the passive acquisition of high frequency, objective behavioral data. While this undoubtedly constitutes a promising arena, the lack of standardization objective features and the inconsistent analysis methods remain a challenge 78 . A contributing factor to this, according to 76 , is the short assessment time and the rare outcome assessment during the study duration. To be more precise, the ground truth clinical evaluations that are performed at intermittent intervals during longitudinal data acquisition results in many unlabeled days, therefore the validity of propagating these labels for long time windows is still unclear. Perhaps undertaking a hybrid labeling approach combining low frequency clinical assessment and higher frequency Ecological Momentary Assessment via self-reports along the study duration might therefore mitigate this dilemma.

As illustrated earlier in our subgroup analysis, the adoption of deep learning methods that efficiently extract meaningful patterns from unstructured data is now on the rise. However, deep learning methods that outweighed the rest of machine learning models in terms of diagnostic accuracy are associated with considerable uncertainty. Interestingly, with the aim of enhancing the efficacy of remote assessment of PD, Iakovakis et al . 37 combined two databases captured in-the-clinic and in-the-wild in a deep learning, hybrid model capable of learning fine motor symptoms, thereby overcoming the induced quantization error of the UPDRS-III and enhancing the performance of deep learning. Similarly, our meta-analysis showed that multimodal analysis, although reinforces the diagnostic accuracy, is characterized with considerable diagnostic between-studies uncertainty, therefore, future studies should adopt a more transparent and well-conducted study designs to reduce bias. Perhaps combining voice analysis techniques along with keystroke dynamics will boost the detection of early motor impairment signs, as these are also reflected on speech characteristics of PD patients 79 . Moreover, from the methodological perspectives, several pattern recognition tools have the potential to learn and decipher the nonlinear, dynamic nature of human-keyboard interactions. For example, fuzzy recurrence plots and scalable recurrence networks visually revealed finer texture and more regularity in the hold time series of healthy controls to early stage PD patients 20 , 21 .

Psychiatric and neurodegenerative disorders that develop and progress across the lifespan are characterized by a heterogeneous phenotype of motor and non-motor symptoms 80 . Early stage behavioral perturbations constitute a priori link with plausible connection to disease likelihood, but the high cross-talk between symptoms obscures accurate diagnosis and pathogenesis understanding especially at preclinical stages. This heterogeneity is a central problem to diagnostic research, entailing additional methods for analyzing similarities and differences across disease-induced behavioral disturbances. As opposed to previous reviews that put too much emphasis on specific disorders, we hereby deliberately included studies on PD, MCI and affective disorders to convey that the neurobiological mechanisms differ greatly among disorders that are characterized with similar traits, such as motor slowing. Rather than focusing on specific disorders in isolation of others, we advocate a dimensional approach that stresses more on the symptoms per se, also referred to as comorbidities. We believe that a central challenge, in this realm, is formulating databases with a full representation of the population, to expand our understanding of the heterogeneous disease-related traits.

Although the previous years witnessed an increase lean towards digital health technologies, the premature adoption of these measures by clinics precludes meaningful outcome 81 . Our work highlights important directions for future research. The definition of clinically meaningful thresholds is yet to established, and this cannot be attained without a “co-creation” approach, whereby high-level data and clinically validated interpretations are made. For instance, amalgamating low level, behavioral data, with high level imaging data is not explored yet. We think that this will not only inform better health information, but might also generate new knowledge on “brain fitness” and behavior, across the generations. Further, the importance of interdisciplinary interactions also propagates to ethical implications, for enhanced transparency, informed consent from patients, privacy and accountability 82 . We therefore summarize domain-specific limitations and future research directions in Table ​ Table2 2 .

Future directions for the digital biomarkers research based on the “co-creation approach”.

We acknowledge that our study has several limitations. Among them is the sparsity and the inherent heterogeneity of the included studies. While we were able to perform regression analysis with patients’ demographic and clinical characteristics (i.e., age, disease duration respectively) for PD, our meta-analysis lacks the investigation of additional covariates, such as gender differences and medication response, especially for MCI and psychiatric disorders. Although promising results have been revealed by leveraging typing patterns for diagnosing and monitoring mood and cognitive decline, the majority of the studies are, so far, disproportionately targeting PD. While this is understandable given the hallmark motor disturbance in this latter, we see that the need for further validations of this approach in other disorders is still pressing. This will be an important avenue for future studies. The data collected and analyzed in the included studies are collected either in the United States (US) and Europe, therefore, future clinical trials of the diagnostic performance of keystroke dynamics in other populations, with possibly lower education level and smartphone usage, particularly in ageing populations and low-income countries are needed. Perhaps also a global consortium on the translation possibility of this technology to these populations with limited neurological care access is the first step in this context. We can therefore investigate how the diagnostic potentiality changes across time, by site and for different populations. Concerning per-patient variability and disease progression, future work should be more focused on identifying temporal symptom profiles and behavioral trajectories indicative of conversion to brain disease. More importantly, latent domains such as emotions, sleep pattern should be considered as confounders, given their direct influence on motor behavior and general health status. Estimation of motor impairment severity, that correlates with disease stage and subtype is also an important future avenue. Furthermore, future researchers in the field should collaborate with clinicians to make the models more interpretable, thereby enhancing clinical adoption. Research on the area of explainable AI (XAI) is now rapidly growing 83 , but collaborative work between data scientists, engineers and clinicians is not yet established, especially in mutual exchange of data (i.e., behavioral data, imaging). Eventually, we declare, as a limitation, that the protocol of this systematic review and meta-analysis is registered in PROSPERO, but has not been published yet.

Lastly, we note the strength of our meta-analysis conclusions that conveyed the feasibility of using keystroke dynamics derived from the natural interaction connected devices keyboards as digital biomarkers for early decline in fine motor skills associated with neuropsychiatric disorders. Based on experimental design comparisons, we showed that the keystroke dynamics constitute an ecologically valid diagnostic platforms in-the-wild , reflecting their translational potentiality outside clinics, despite the methodological challenges that arises, including but limited to confounders influence and sampling difficulties. Further, given the influence of data labeling on the diagnosis models, we conclude that even when self-reported data in-the-wild are used for training, keystroke dynamics models still achieve sound discriminatory potential. From methodological perspectives, we show that employing multimodal and advanced deep learning models, which are at the high edge of the contemporary data science methodologies, offer promising opportunities for boosting the diagnostic accuracy, but with considerable heterogeneity across the studies. Consequently, the establishment of intricate and generalizable diagnostic models, that not only achieve accurate diagnosis, but are also sensitive to temporal change and symptom progression. To this end, our regression models showed the evolution of diagnosis AUC and fine motor impairment with age and disease duration for PD. We reperformed the regression analysis for MCI, and showed how the diagnostic AUC increases with age, reflecting the increasing fine motor impairment severity. In conclusion, the importance of digital technology also goes beyond the diagnostic yield, so once at-risk cohorts are identified, digital technologies can also be employed to reinforce behavior change and patients’ empowerment, towards a sustained quality of life, as detailed in Table ​ Table2 2 .

Search strategy and selection criteria

In this systematic review and meta-analysis, conducted in accordance with the Diagnostic Test Accuracy extension of Preferred Reporting Items for Systematic Reviews and Meta- Analyses (PRISMA-2020) 84 , a systematic search of MEDLINE, PubMed, IEEE Xplore, Web of Science, and EBSCO has been independently performed by two authors (H.A and N.C) for publications between January 1st, 2010 and March 30th, 2022, on pattern recognition and neuropsychiatric disease classification on the basis of natural interactions with keyboards, without language restrictions. These date restrictions were specified a priori, because typing patterns constitute a new class in the fruitful digital phenotyping area. The full search strategy of all databases is reported in Supplementary Tables 1 – 5 and in the PRISMA 2020 checklist . Eligible studies assessed the influence of motor impairment induced by psychiatric or neurological disorders on the typing patterns (i.e., keystroke dynamics). Those deemed eligible were case–control studies, comparing the typing behavior of neuropsychiatric disease patients to age- and education-matched healthy control subjects. Studies that used statistical analysis without classification were included in the narrative synthesis, while those employing machine learning models for classification were included in a random effects meta-analysis to evaluate the diagnosis performance on the basis of typing behavior. We performed a manual search of the reference lists from the eligible studies, and we searched the grey literature for unpublished data, conference proceedings and dissertations. Prior to the writing of this paper, we searched if there are existing systematic reviews and meta-analyses on the same topic.

All search results were uploaded to Rayyan web of intelligent systematic reviews 85 for duplicates removal and screening. One author (H.A.) screened titles and abstracts of the included studies, that were double-screened by a second author (L.H.). Three authors (H.A., A.K. and L.H.) assessed the eligibility of the included full articles. Any disagreement was resolved by discussion.

Protocol registration

The protocol of this systematic review and meta-analysis has been registered in PROSPERO with identifier CRD42021278707.

Data extraction and quality assessment

Two authors (H.A. and L.H.) extracted data from the included studies. We extracted the following data from the included studies: (1) disease, (2) first author and publication year, (3) experimental protocol of data collection including collection settings and study duration, (4) number and mean age of participants in diseased and healthy groups, (5) data labeling methodology (self-reported meta-data vs. clinical evaluation), (6) data streams employed by the study, (7) extracted features, (8) analysis and feature extraction level (subject- level vs. typing session-level), (9) problem formulation and validation whether through statistical analyses or classification, (10) 2 × 2 data (True Positives, True negatives, False Positives, False Negatives), and from here we extracted the sensitivity and the specificity (11) Classification Accuracy and (12) Area Under the Receiver Operating Characteristics Curve (AUC). Three authors (H.A., A.K. and L.H.) discussed and assessed the quality of the included studies. The studies that did not perform classification, were included in the systematic review but not in the meta-analysis.

Statistical analysis and diagnosis evaluation

Our primary outcome is the diagnosis efficiency of machine learning models employing typing features (i.e., keystroke dynamics). Secondary outcomes include longitudinal disease monitoring on the basis of pattern recognition of keystroke dynamics, treatment response, and key features that discriminate diseased from healthy groups.

In particular, the outcomes of the meta-analysis were the Area Under the receiver operating characteristic Curve (AUC), accuracy, sensitivity and specificity. These outcomes were pooled and included in a univariate random effect model independently for three disease categories, namely PD, MCI, and psychiatric disorders. Heterogeneity was assessed using the I 2 statistics, attributable to non-sample related between-studies differences, in addition to the Cochran Q (X 2 ) test (p < 0·05). Given that in this study we report the validity of keystroke dynamics models as diagnostic tools for different disorders, we accepted high heterogeneity (I 2  > 50%). Furthermore, to ensure the completeness and transparency of the reported diagnostic accuracy measures, we followed the Standards for the Reporting of Diagnostic Accuracy Studies (STARD) 86 .

After pooling the data, we processed them using the Meta Essential tool 87 . For each study, we entered the (per-subject) AUC and the accuracy and the sample size, while for the sensitivity and specificity, we entered the number of participants in diseased and healthy groups, respectively. These measures, along with the 95% confidence interval (CI), were represented by univariate forest plots. All included studies, that reported AUC, accuracy, sensitivity and specificity were included in the meta-analysis given that we maintain symmetry of the funnel plots to minimize publication bias. Studies that did not report any of these measures and/or were associated with high bias risk were included in the systematic review but not in the meta-analysis. Furthermore, for each of the three disorder groups, we performed a sensitivity analysis using leave-one-study-out, to investigate the impact of individual studies on the diagnostic metrics. Two authors (H.A. and L.H.) performed and agreed on the performance and the outcome of the statistical analysis.

Subgroup analysis

Subgroup analyses were conducted to assess the source of heterogeneity between the studies, if each subgroup contained more than three studies (n > 3) after subgroup division. We particularly focus on the performance of data acquisition and analysis methods, given that they are the main intellectual challenges of the highly fertile arena of digital phenotyping 73 . In spite of the increasing interest in real-life diagnosis, we segregated the studies based on the data acquisition modality as (1) in-the-clinic and in-the-wild . Furthermore, we compared the attained AUC, accuracy, sensitivity and specificity between (2) clinically validated and self-reported data. In addition, comparisons between (3) multimodal and unimodal studies, as well as (4) deep learning and other machine learning classification methods were performed.

Regression analysis

Four linear regression models were fitted for (1) PD patients’ age and years from diagnosis (disease duration), (2) PD diagnosis AUC and disease duration, (3) PD fine motor impairment index and disease duration and (4) MCI diagnosis AUC and patients’ age. These tests were two-sided with a statistical significance threshold of 0.05 and 95% CI.

Publication bias assessment

Publication bias was assessed based on Begg and Mazumdar’s rank correlation test and visualized by funnel plots. Importantly, if one database was used in multiple studies, or if one study employed multiple analysis methods, we treat those as independent studies.

To assess the internal validity of the included studies, quality assessment was performed employing the tool for Quality Assessment of Diagnostic Test Accuracy (QUADAS-2) 88 . All discrepancies were resolved by mutual discussions between three authors (H.A., A.K., and L.H.). Moreover, we generated four funnel plots for AUC, accuracy, sensitivity and specificity to visually illustrate the publication bias of the included studies 89 .

Quality of evidence assessment

To convey the clinical value of keystroke dynamics, we have used the Grades of Recommendations, Assessment, Development and Evaluation (GRADE) tool 90 to systematically and transparently assess the diagnostic accuracy evidence of keystroke dynamics for neuropsychiatric disorders. The systematic appraisal of the evidence quality is determined by (1) the design of the study, (2) risk of bias, (3) inconsistency of reported results, (4) indirectness of the outcome, (5) imprecision of the reported results, and (6) publication bias.

Supplementary Information

Acknowledgements.

This work is supported by the Joint Research Center of Khalifa University of Science and Technology and the Korean Advanced Institute of Science and Technology, 8474000221 (KKJRC-2019-Health2) awarded to Ahsan Khandoker and Leontios Hadjileontiadis.

Author contributions

H.A., A.K., and L.H. conceived and designed the study; H.A. and N.C. performed systematic search; H.A. and L.H. performed quality assessment, extracted meta-data and conducted the systematic review and meta-analysis; H.A. wrote the first draft and S.D. and L.H. contributed to the writing and editing. H.A., A.K., D.I., S.D., R.C. and L.H. reviewed the manuscript. All authors discussed and agreed on the submission of this manuscript.

This study is funded by the Joint Research Center of Khalifa University of Science and Technology and the Korean Advanced Institute of Science and Technology, 8474000221 (KKJRC-2019-Health2). The funder did not have any role in the analysis performed.

Data availability

Competing interests.

The authors declare no competing interests.

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The online version contains supplementary material available at 10.1038/s41598-022-11865-7.

IMAGES

  1. (PDF) A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    a systematic literature review on latest keystroke dynamics based models

  2. (PDF) A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    a systematic literature review on latest keystroke dynamics based models

  3. (PDF) A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    a systematic literature review on latest keystroke dynamics based models

  4. (PDF) A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    a systematic literature review on latest keystroke dynamics based models

  5. (PDF) A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    a systematic literature review on latest keystroke dynamics based models

  6. (PDF) A systematic review on keystroke dynamics

    a systematic literature review on latest keystroke dynamics based models

VIDEO

  1. Systematic literature review

  2. SYSTEMATIC AND LITERATURE REVIEWS

  3. Workshop Systematic Literature Review (SLR) & Bibliometric Analysis

  4. Systematic Literature Review Paper

  5. Systematic Literature Review Paper presentation

  6. Systematic Literature Review, by Prof. Ranjit Singh, IIIT Allahabad

COMMENTS

  1. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    A Systematic Literature Review on Latest Keystroke Dynamics Based Models Abstract: The purpose of this study is to conduct a comprehensive evaluation and analysis of the most recent studies on the implications of keystroke dynamics (KD) patterns in user authentication, identification, and the determination of useful information.

  2. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    The purpose of this study is to conduct a comprehensive evaluation and analysis of the most recent studies on the implications of keystroke dynamics (KD) patterns in user authentication,...

  3. PDF A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    S. Roy et al.: Systematic Literature Review on Latest Keystroke Dynamics Based Models TABLE 1. Authentication technologies and their usage parameters. Some of these parameters have been explained ...

  4. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    Six unique KD-based designs are identified and the status of findings toward an effective solution in authentication, identification, and prediction are presented and some indications for a deeper understanding of the issues and further study are provided. Expand View on IEEE ieeexplore.ieee.org Save to Library Create Alert Cite

  5. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    A Systematic Literature Review on Latest Keystroke Dynamics Based Models @article{Roy2022ASL, title={A Systematic Literature Review on Latest Keystroke Dynamics Based Models}, author={Soumen Roy and Jitesh Pradhan and Abhinav Kumar and Dibya Ranjan Das Adhikary and Utpal Roy and D. D. Sinha and Rajat Kumar Pal}, journal={IEEE Access}, year ...

  6. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    A Systematic Literature Review on Latest Keystroke Dynamics Based Models Roy, Soumen ; Pradhan, Jitesh ; Kumar, Abhinav ; Adhikary, Dibya Ranjan Das ; Roy, Utpal ; Sinha, Devadatta ; Pal, Rajat Kumar Abstract Publication: IEEE Access Pub Date: 2022 DOI: 10.1109/ACCESS.2022.3197756 Bibcode: 2022IEEEA..1092192R full text sources Publisher

  7. Diagnostic accuracy of keystroke dynamics as digital ...

    Article Published: 11 May 2022 Diagnostic accuracy of keystroke dynamics as digital biomarkers for fine motor decline in neuropsychiatric disorders: a systematic review and meta-analysis...

  8. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    A Systematic Literature Review on Latest Keystroke Dynamics Based Models (PDF) A Systematic Literature Review on Latest Keystroke Dynamics Based Models | soumen roy and Rajat Pal - Academia.edu Academia.edu no longer supports Internet Explorer.

  9. A systematic review on keystroke dynamics

    In order to guide future researches in this area, a systematic review on keystroke dynamics was conducted and presented here. The systematic review method adopts a rigorous procedure with the definition of a formal review protocol. Systematic reviews are not commonly used in artificial intelligence, and this work contributes to its use in the area.

  10. (PDF) A systematic review on keystroke dynamics

    A systematic review on keystroke dynamics Authors: Paulo Henrique Pisani Ana Carolina Lorena Instituto Tecnologico de Aeronautica Abstract and Figures Computing and communication systems have...

  11. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    FIGURE 9. The percentage distribution of the most recent research for various inputs. It demonstrates that free texts were evaluated in addition to simple, numeric, complicated, or user IDs and passwords. However, free text is being researched more extensively. - "A Systematic Literature Review on Latest Keystroke Dynamics Based Models"

  12. PDF A systematic review on keystroke dynamics

    3 Systematic review Systematic literature review (called just systematic review in this paper) is a method for conducting bibliographic reviews in a formal way, following well defined steps, which allows the results to be reproducible. In addition, the pro-tocol adopted for the conduction of the review must assure its completion.

  13. Does Keystroke Dynamics tell us about Emotions? A Systematic Literature

    Keystroke Dynamics focuses on the particular way a person types on a keyboard. To provide insight, assess recent works, and guide future researches in this field, a Systematic Literature Review (SLR) is conducted. An SLR adopts a rigorous procedure with the definition of a formal review protocol.

  14. A systematic review on keystroke dynamics

    A systematic review on keystroke dynamics Ana Carolina Lorena 2013, Journal of the Brazilian Computer Society Computing and communication systems have improved our way of life, but have also contributed to an increased data exposure and, consequently, to identity theft.

  15. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    FIGURE 1. Flow of this study for better readability. Section II provides a clear idea of KD-based models for beginners. Section III states the methodology for this study. Section IV gives several protocols for the development of new datasets. Section V introduces several feature vectors of KD-based systems. Section VI gives the details of classifiers and adaptation techniques used in the ...

  16. Diagnostic accuracy of keystroke dynamics as digital biomarkers for

    In this systematic review and meta-analysis, we aimed to appraise the diagnostic performance of keystroke dynamics for an array of neurological and psychiatric disorders. Moreover, we sought to assess the impact of data collection settings, labeling methods, and model characteristics on the diagnostic performance, with emphasis on clinical ...

  17. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    A Systematic Literature Review on Latest Keystroke Dynamics Based Models. https://doi.org/10.1109/access.2022.3197756 Journal: IEEE Access, 2022, p. 92192-92236 ...

  18. Does Keystroke Dynamics tell us about Emotions? A Systematic Literature

    Other examples of systematic literature review in the HCI domain are the work by Maalej and Kallel (2020), which reports the effectiveness of using keystroke dynamics biometrics in recognizing ...

  19. ‪Soumen Roy‬

    ‪Research scholar, University of Calcutta‬ - ‪‪Cited by 205‬‬ - ‪Keystroke Dynamics‬ - ‪Datamining‬ - ‪Knowledge Discovery‬ - ‪Pattern Recognition‬ - ‪Active Authentication‬ ... A systematic literature review on latest keystroke dynamics based models. S Roy, J Pradhan, A Kumar, DRD Adhikary, U Roy, D Sinha, RK ...

  20. A Systematic Literature Review on Latest Keystroke Dynamics Based Models

    TABLE 5. Special arrangements in desktop/laptop environment to develop datasets to meet the specific objective. It demonstrates how patterns for fixed-text and continually typing free text were created using a number of apps with varying sample rates. It provides a multitude of directions for future dataset development. - "A Systematic Literature Review on Latest Keystroke Dynamics Based Models"

  21. A systematic review on keystroke dynamics

    TLDR This work presents a comparative study which analyzes the performances of six popular ML algorithms applied to five different public datasets with static and predefined samples and showed that Random Forest was able to outperform all the other algorithms in all datasets. Expand 1

  22. PDF Does Keystroke Dynamics tell us about Emotions? A Systematic Literature

    A Systematic Literature Review and Dataset Construction ... develop automatic emotion recognition systems based on keystroke dynamics. Index Terms—HCI, ... developed 3 major emotion models: (i ...

  23. Telecom

    Understanding the distinct characteristics of unidentified Internet users is helpful in various contexts, including digital forensics, targeted advertising, and user interaction with services and systems. Keystroke dynamics (KD) enables the analysis of data derived from a user's typing behaviour on a keyboard as one approach to obtain such information. This study conducted experiments on a ...