Real-Time Ransomware Detection by Using eBPF and Natural Language Processing and Machine Learning

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Natural Language Processing

Featured article, related topics, top conferences on natural language processing, top videos on natural language processing.

What Is Best For Spoken Language Understanding: Small But Task-Dependant Embeddings Or Huge But Out-Of-Domain Embeddings?

Xplore Articles related to Natural Language Processing

Periodicals related to natural language processing, e-books related to natural language processing, courses related to natural language processing, top organizations on natural language processing, most published xplore authors for natural language processing.


Special Session: Computational Intelligence for Natural Language Processing

Natural language processing, a sub-discipline of artificial intelligence has received much attention in recent years. There is a paradigm shift in natural language processing with the introduction of deep-learning based techniques. Extracting features manually from text data is difficult and time consuming. Supervised deep learning-based techniques are well-known for their automatic feature extraction capabilities. But these systems suffer from interpretability issues and high computational complexity. Moreover in order to train the systems a huge amount of labeled data is required. NLP includes many research problems like summarization, dialogue systems, machine translation, question answering where generating labeled data is a challenge.

Nowadays, there is a trend in developing unsupervised deep learning-based architectures. Most of the research articles are only based on utilizing deep learning. Researchers are also developing unsupervised optimization techniques that are widely used in solving different problems of real-life. Evolutionary algorithms (EAs) are a type of meta-heuristic optimization technique which are well-known for determining near-optimal solutions in a limited amount of time. They are based on the concepts of the natural evolution process. In the recent world, a lot of research is also going on developing new evolutionary algorithms (EAs) and their applications in various domains.

EAs have received considerable attention from academics, researchers, and domain workers in solving the problems of NLP because these algorithms can be used without availability of large training data as opposed to deep learning techniques. In solving different problems in the domain of NLP, there is a requirement of simultaneously optimizing several objective functions. For example in summarization systems, conflicting objectives like coverage, anti-redundancy, readability, cohesion etc. are required to be simultaneously optimized by the search capability of some optimization techniques. Concepts of multiobjective optimization are useful in such cases where multiple conflicting objective functions are simultaneously optimized. Moreover, for determining the appropriate parameters/architectures for deep learning based models for solving different NLP problems, optimization based approaches are widely used. Feature selection, classifier ensemble based approaches can also be successfully solved using optimization frameworks. Several quality measures like accuracy, precision, recall, F-score can be simultaneously optimized for selecting the best combination of classifiers for the purpose of ensemble. Similarly in the domain of feature selection, filter and wrapper based approaches can be designed by simultaneously optimizing several feature quality measures by utilizing the search capability of EAs. But there are still many open problems which are related to applications of EAs in solving NLP problems: 1) stability of EAs in solving different real-life

problems; in general EAs are randomized algorithms. Their performance varies from generation to generation. How to generate stable solutions using EAs? 2) EAs produce several solutions on the final population. How to select a single solution for reporting? 3) As the researchers are also investigating the effect of adding multimodal information in solving different NLP problems, in this scenario, how to develop an evolutionary framework handling multimodal NLP data in an efficient way? 4) How to adopt/develop an evolutionary framework when it is applied on multi/cross-lingual data?? Answering these questions is a prerequisite for widespread deployment of evolutionary algorithms in NLP application.

This special session aims to bring together the current research progress on developing new EAs for solving different NLP problems. The articles demonstrating the applications of existing EAs in solving various real-life problems are also welcome. We however do NOT encourage the submission that only focuses on new theories and algorithms, without demonstrating their application on NLP data. High quality articles based on  unsupervised deep-learning techniques  are also welcome. As per our knowledge, there is no previous special session held anywhere as most of the NLP community focuses on using deep learning-based methods.

Articles focusing on the following topics (but not limited to) related to applications of EAs in solving different problems of NLP are invited for this special session.

1) Architecture selection for deep learning based techniques using evolutionary algorithms 2) Review paper showing comparison between deep and evolutionary techniques 3) Computer aided Machine Translation 4) Evolutionary Computing for mono-lingual and cross-lingual NLP task 5) Summarization 6) Natural language inference 7) Evolutionary algorithms for Textual entailment 8) Text Classification/Clustering 9) Entity linking 10) Named Entities Recognition 11) Knowledge extraction and information analysis 12) Natural language generation 13) Dialogue management 14) Slogan generation 15) Visual Question-Answering 16) Sentiment Analysis

Because of the wide scope of NLP, some important topics that fit in the scope of the special session may not be listed above. Therefore, if you are unsure whether your work would fit, we encourage you to get in touch with any organizer. All papers must comply with the basic requirements of IEEE SSCI 2021. The review process will comply with the standard review process of the IEEE SSCI. Each paper will receive at least three reviews from experts in the field. As per our knowledge, there is no previous special session held anywhere as most of the NLP community focuses on using deep learning-based methods.

Organizers: Sriparna Saha, Naveen Saini, Jose G Moreno

Subscribe to the PwC Newsletter

Join the community, natural language processing, representation learning.

ieee research papers on natural language processing


Graph representation learning, sentence embeddings.

ieee research papers on natural language processing

Network Embedding


ieee research papers on natural language processing

Text Classification

ieee research papers on natural language processing

Graph Classification

ieee research papers on natural language processing

Audio Classification

ieee research papers on natural language processing

Medical Image Classification

Language modelling.

ieee research papers on natural language processing

Long-range modeling

Protein language model, sentence pair modeling, deep hashing, table retrieval, question answering.

ieee research papers on natural language processing

Open-Ended Question Answering

ieee research papers on natural language processing

Open-Domain Question Answering

Conversational question answering.

ieee research papers on natural language processing

Answer Selection

Translation, image generation.

ieee research papers on natural language processing

Image-to-Image Translation

ieee research papers on natural language processing

Image Inpainting

ieee research papers on natural language processing

Text-to-Image Generation

ieee research papers on natural language processing

Conditional Image Generation

Data augmentation.

ieee research papers on natural language processing

Image Augmentation

ieee research papers on natural language processing

Text Augmentation

Machine translation.

ieee research papers on natural language processing


Bilingual lexicon induction.

ieee research papers on natural language processing

Multimodal Machine Translation

ieee research papers on natural language processing

Unsupervised Machine Translation

Text generation.

ieee research papers on natural language processing

Dialogue Generation

ieee research papers on natural language processing

Data-to-Text Generation

ieee research papers on natural language processing

Multi-Document Summarization

Text style transfer.

ieee research papers on natural language processing

Topic Models

ieee research papers on natural language processing

Document Classification

ieee research papers on natural language processing

Sentence Classification

ieee research papers on natural language processing

Emotion Classification

2d semantic segmentation, image segmentation.

ieee research papers on natural language processing

Scene Parsing

ieee research papers on natural language processing

Reflection Removal

Visual question answering (vqa).

ieee research papers on natural language processing

Visual Question Answering

ieee research papers on natural language processing

Machine Reading Comprehension

ieee research papers on natural language processing

Chart Question Answering

ieee research papers on natural language processing

Embodied Question Answering

Named entity recognition (ner).

ieee research papers on natural language processing

Nested Named Entity Recognition

Chinese named entity recognition, few-shot ner, sentiment analysis.

ieee research papers on natural language processing

Aspect-Based Sentiment Analysis (ABSA)

ieee research papers on natural language processing

Multimodal Sentiment Analysis

ieee research papers on natural language processing

Aspect Sentiment Triplet Extraction

ieee research papers on natural language processing

Twitter Sentiment Analysis

Few-shot learning.

ieee research papers on natural language processing

One-Shot Learning

ieee research papers on natural language processing

Few-Shot Semantic Segmentation

Cross-domain few-shot.

ieee research papers on natural language processing

Unsupervised Few-Shot Learning

Word embeddings.

ieee research papers on natural language processing

Learning Word Embeddings

ieee research papers on natural language processing

Multilingual Word Embeddings

Embeddings evaluation, contextualised word representations, optical character recognition (ocr).

ieee research papers on natural language processing

Active Learning

ieee research papers on natural language processing

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, text summarization.

ieee research papers on natural language processing

Abstractive Text Summarization

Document summarization, extractive text summarization, continual learning.

ieee research papers on natural language processing

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning, information retrieval.

ieee research papers on natural language processing

Passage Retrieval

Cross-lingual information retrieval, table search, relation extraction.

ieee research papers on natural language processing

Relation Classification

Document-level relation extraction, joint entity and relation extraction, temporal relation extraction, link prediction.

ieee research papers on natural language processing

Inductive Link Prediction

Dynamic link prediction, anchor link prediction, calibration for link prediction, natural language inference.

ieee research papers on natural language processing

Answer Generation

ieee research papers on natural language processing

Visual Entailment

Cross-lingual natural language inference, reading comprehension.

ieee research papers on natural language processing

Intent Recognition

Implicit relations, active object detection, large language model, emotion recognition.

ieee research papers on natural language processing

Speech Emotion Recognition

ieee research papers on natural language processing

Emotion Recognition in Conversation

ieee research papers on natural language processing

Multimodal Emotion Recognition

Emotion-cause pair extraction, natural language understanding.

ieee research papers on natural language processing

Emotional Dialogue Acts

Image captioning.

ieee research papers on natural language processing

3D dense captioning

Controllable image captioning, aesthetic image captioning.

ieee research papers on natural language processing

Relational Captioning

Semantic textual similarity.

ieee research papers on natural language processing

Paraphrase Identification

ieee research papers on natural language processing

Cross-Lingual Semantic Textual Similarity

Event extraction, event causality identification, zero-shot event extraction, dialogue state tracking, task-oriented dialogue systems.

ieee research papers on natural language processing

Visual Dialog

Dialogue understanding, semantic parsing.

ieee research papers on natural language processing

AMR Parsing

Semantic dependency parsing, drs parsing, ucca parsing, coreference resolution, coreference-resolution, cross document coreference resolution, semantic similarity, conformal prediction.

ieee research papers on natural language processing

Text Simplification

ieee research papers on natural language processing

Music Source Separation

Audio source separation.

ieee research papers on natural language processing

Decision Making Under Uncertainty

ieee research papers on natural language processing

In-Context Learning

ieee research papers on natural language processing

Sentence Embedding

Sentence compression, joint multilingual sentence representations, sentence embeddings for biomedical texts, code generation.

ieee research papers on natural language processing

Code Translation

ieee research papers on natural language processing

Code Documentation Generation

Library-oriented code generation, class-level code generation, dependency parsing.

ieee research papers on natural language processing

Transition-Based Dependency Parsing

Prepositional phrase attachment, unsupervised dependency parsing, cross-lingual zero-shot dependency parsing, specificity, information extraction, extractive summarization, temporal information extraction, low resource named entity recognition, cross-lingual, cross-lingual transfer, cross-lingual document classification.

ieee research papers on natural language processing

Cross-Lingual Entity Linking

Cross-language text summarization, response generation, common sense reasoning.

ieee research papers on natural language processing

Physical Commonsense Reasoning

Riddle sense, anachronisms, memorization, instruction following, visual instruction following, data integration.

ieee research papers on natural language processing

Entity Alignment

ieee research papers on natural language processing

Entity Resolution

Table annotation, entity linking.

ieee research papers on natural language processing

Question Generation

Poll generation, part-of-speech tagging.

ieee research papers on natural language processing

Unsupervised Part-Of-Speech Tagging

ieee research papers on natural language processing

Topic coverage

Dynamic topic modeling, prompt engineering.

ieee research papers on natural language processing

Visual Prompting

Mathematical reasoning.

ieee research papers on natural language processing

Math Word Problem Solving

Formal logic, geometry problem solving, abstract algebra, abuse detection, hate speech detection, open information extraction.

ieee research papers on natural language processing

Hope Speech Detection

Hate speech normalization, hate speech detection crisishatemm benchmark, data mining.

ieee research papers on natural language processing

Argument Mining

ieee research papers on natural language processing

Opinion Mining

Subgroup discovery, parallel corpus mining, cognitive diagnosis, word sense disambiguation.

ieee research papers on natural language processing

Word Sense Induction

Language identification, dialect identification, native language identification, few-shot relation classification, implicit discourse relation classification, cause-effect relation classification, bias detection, selection bias, fake news detection, relational reasoning.

ieee research papers on natural language processing

Semantic Role Labeling

ieee research papers on natural language processing

Predicate Detection

Semantic role labeling (predicted predicates).

ieee research papers on natural language processing

Textual Analogy Parsing

ieee research papers on natural language processing

Slot Filling

ieee research papers on natural language processing

Zero-shot Slot Filling

Extracting covid-19 events from twitter, grammatical error correction.

ieee research papers on natural language processing

Grammatical Error Detection

Text matching, document text classification, learning with noisy labels, multi-label classification of biomedical texts, political salient issue orientation detection, pos tagging, deep clustering, trajectory clustering, deep nonparametric clustering, nonparametric deep clustering, multi-modal entity alignment, spoken language understanding, dialogue safety prediction, intent detection.

ieee research papers on natural language processing

Open Intent Detection

Word similarity, stance detection, stance detection (us election 2020 - biden), stance detection (us election 2020 - trump), text-to-speech synthesis.

ieee research papers on natural language processing

Prosody Prediction

Zero-shot multi-speaker tts, zero-shot cross-lingual transfer, cross-lingual ner, intent classification.

ieee research papers on natural language processing

Fact Verification

Constituency parsing.

ieee research papers on natural language processing

Constituency Grammar Induction

Entity typing.

ieee research papers on natural language processing

Entity Typing on DH-KGs

Language acquisition, grounded language learning, document ai, document understanding, self-learning, ad-hoc information retrieval, document ranking.

ieee research papers on natural language processing

Cross-Modal Retrieval

Image-text matching, multilingual cross-modal retrieval.

ieee research papers on natural language processing

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, word alignment, open-domain dialog, dialogue evaluation, novelty detection, multimodal deep learning, multimodal text and image classification, discourse parsing, discourse segmentation, connective detection.

ieee research papers on natural language processing

Text-based Image Editing


ieee research papers on natural language processing

Zero-Shot Text-to-Image Generation

Concept alignment, conditional text-to-image synthesis, model editing, knowledge editing.

ieee research papers on natural language processing

Multi-Label Text Classification

Shallow syntax, sarcasm detection.

ieee research papers on natural language processing


Privacy preserving deep learning, lemmatization, explanation generation, morphological analysis.

ieee research papers on natural language processing

Aspect Extraction

Extract aspect, aspect category sentiment analysis.

ieee research papers on natural language processing

Aspect-oriented Opinion Extraction

ieee research papers on natural language processing

Aspect-Category-Opinion-Sentiment Quadruple Extraction

Session search.

ieee research papers on natural language processing

Chinese Word Segmentation

Handwritten chinese text recognition, chinese spelling error correction, chinese zero pronoun resolution, offline handwritten chinese character recognition, molecular representation, entity disambiguation, conversational search, source code summarization, method name prediction, speech-to-text translation, simultaneous speech-to-text translation, text clustering.

ieee research papers on natural language processing

Short Text Clustering

ieee research papers on natural language processing

Open Intent Discovery

Authorship attribution, keyphrase extraction, linguistic acceptability.

ieee research papers on natural language processing

Column Type Annotation

Cell entity annotation, columns property annotation, row annotation, text-to-video generation, text-to-video editing, subject-driven video generation.

ieee research papers on natural language processing

Visual Storytelling

ieee research papers on natural language processing

KG-to-Text Generation

ieee research papers on natural language processing

Unsupervised KG-to-Text Generation

Abusive language, few-shot text classification, zero-shot out-of-domain detection, term extraction, text2text generation, keyphrase generation, figurative language visualization, sketch-to-text generation, protein folding, phrase grounding, grounded open vocabulary acquisition, deep attention, morphological inflection, word translation, multilingual nlp, spam detection, context-specific spam detection, traditional spam detection, summarization, unsupervised extractive summarization, query-focused summarization.

ieee research papers on natural language processing

Knowledge Base Population

Natural language transduction, cross-lingual word embeddings, conversational response selection, text annotation, image-to-text retrieval, passage ranking, news classification, key information extraction, biomedical information retrieval.

ieee research papers on natural language processing

SpO2 estimation

Authorship verification.

ieee research papers on natural language processing


Sentence summarization, unsupervised sentence summarization, keyword extraction, story generation, temporal processing, timex normalization, document dating, multimodal association, multimodal generation, automated essay scoring, morphological tagging, nlg evaluation, meme classification, hateful meme classification, weakly supervised classification, weakly supervised data denoising, entity extraction using gan.

ieee research papers on natural language processing

Rumour Detection

Key point matching, component classification, argument pair extraction (ape), claim extraction with stance classification (cesc), claim-evidence pair extraction (cepe), semantic composition.

ieee research papers on natural language processing

Sentence Ordering

Lexical simplification, token classification, toxic spans detection.

ieee research papers on natural language processing

Blackout Poetry Generation

Semantic retrieval, subjectivity analysis.

ieee research papers on natural language processing

Taxonomy Learning

Taxonomy expansion, hypernym discovery, conversational response generation.

ieee research papers on natural language processing

Personalized and Emotional Conversation

Comment generation.

ieee research papers on natural language processing

Review Generation

Sentence-pair classification, emotional intelligence, dark humor detection, lexical normalization, pronunciation dictionary creation, negation detection, negation scope resolution, question similarity, medical question pair similarity computation, intent discovery, propaganda detection, propaganda span identification, propaganda technique identification, lexical analysis, lexical complexity prediction, goal-oriented dialog, user simulation, passage re-ranking, punctuation restoration, reverse dictionary, question rewriting, humor detection.

ieee research papers on natural language processing

Meeting Summarization

Table-based fact verification, pretrained multilingual language models, formality style transfer, semi-supervised formality style transfer, word attribute transfer, attribute value extraction, diachronic word embeddings, legal reasoning, persian sentiment analysis, clinical concept extraction.

ieee research papers on natural language processing

Clinical Information Retreival

Constrained clustering.

ieee research papers on natural language processing

Only Connect Walls Dataset Task 1 (Grouping)

Incremental constrained clustering, aspect category detection, dialog act classification, extreme summarization.

ieee research papers on natural language processing

Hallucination Evaluation

Recognizing emotion cause in conversations.

ieee research papers on natural language processing

Causal Emotion Entailment

ieee research papers on natural language processing

Nested Mention Recognition

Relationship extraction (distant supervised), semantic entity labeling, binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, clickbait detection, decipherment, text compression, handwriting verification, bangla spelling error correction, ccg supertagging, probing language models, toponym resolution.

ieee research papers on natural language processing

Timeline Summarization

Multimodal abstractive text summarization, reader-aware summarization, code repair, gender bias detection, linguistic steganography, thai word segmentation, stock prediction, text-based stock prediction, event-driven trading, pair trading.

ieee research papers on natural language processing

Face to Face Translation

Multimodal lexical translation, aggression identification, arabic text diacritization, commonsense causal reasoning, fact selection, suggestion mining, temporal relation classification, vietnamese datasets, vietnamese word segmentation, arabic sentiment analysis, aspect category polarity, complex word identification, cross-lingual bitext mining, morphological disambiguation, scientific document summarization, lay summarization, text attribute transfer.

ieee research papers on natural language processing

Image-guided Story Ending Generation

Speculation detection, speculation scope resolution, abstract argumentation, dialogue rewriting, logical reasoning reading comprehension.

ieee research papers on natural language processing

Unsupervised Sentence Compression

Sign language production, stereotypical bias analysis, temporal tagging, anaphora resolution, bridging anaphora resolution.

ieee research papers on natural language processing

Abstract Anaphora Resolution

Hope speech detection for english, hope speech detection for malayalam, hope speech detection for tamil, hidden aspect detection, latent aspect detection, chinese spell checking, cognate prediction, japanese word segmentation, memex question answering, polyphone disambiguation, spelling correction, table-to-text generation.

ieee research papers on natural language processing

KB-to-Language Generation

Text anonymization, zero-shot sentiment classification, conditional text generation, contextualized literature-based discovery, multimedia generative script learning, image-sentence alignment, open-world social event classification, personality generation, personality alignment, action parsing, author attribution, binary condescension detection, conversational web navigation, croatian text diacritization, czech text diacritization, definition modelling, document-level re with incomplete labeling, domain labelling, french text diacritization, hungarian text diacritization, irish text diacritization, latvian text diacritization, misogynistic aggression identification, morpheme segmentaiton, multi-agent integration, multi-label condescension detection, news annotation, open relation modeling, reading order detection, record linking, role-filler entity extraction, romanian text diacritization, slovak text diacritization, spanish text diacritization, syntax representation, text-to-video search, turkish text diacritization, turning point identification, twitter event detection.

ieee research papers on natural language processing

Vietnamese Text Diacritization

Zero-shot machine translation.

ieee research papers on natural language processing

Conversational Sentiment Quadruple Extraction

Attribute extraction, legal outcome extraction, automated writing evaluation, chemical indexing, clinical assertion status detection.

ieee research papers on natural language processing

Coding Problem Tagging

Collaborative plan acquisition, commonsense reasoning for rl, context query reformulation.

ieee research papers on natural language processing

Variable Disambiguation

Cross-lingual text-to-image generation, crowdsourced text aggregation.

ieee research papers on natural language processing

Description-guided molecule generation

ieee research papers on natural language processing

Multi-modal Dialogue Generation

Page stream segmentation.

ieee research papers on natural language processing

Email Thread Summarization

Emergent communications on relations, emotion detection and trigger summarization, extractive tags summarization.

ieee research papers on natural language processing

Hate Intensity Prediction

Hate span identification, job prediction, joint entity and relation extraction on scientific data, joint ner and classification, literature mining, math information retrieval, meme captioning, multi-grained named entity recognition, multilingual machine comprehension in english hindi, multimodal text prediction, negation and speculation cue detection, negation and speculation scope resolution, only connect walls dataset task 2 (connections), overlapping mention recognition, paraphrase generation, multilingual paraphrase generation, personality recognition in conversation.

ieee research papers on natural language processing

Phrase Ranking

Phrase tagging, phrase vector embedding, poem meters classification, query wellformedness.

ieee research papers on natural language processing

Question-Answer categorization

Readability optimization, reliable intelligence identification, sentence completion, hurtful sentence completion, speaker attribution in german parliamentary debates (germeval 2023, subtask 1), text effects transfer, text-variation, vietnamese aspect-based sentiment analysis, sentiment dependency learning, web page tagging, workflow discovery, incongruity detection, multi-word expression embedding, multi-word expression sememe prediction, trustable and focussed llm generated content, pcl detection, semeval-2022 task 4-1 (binary pcl detection), semeval-2022 task 4-2 (multi-label pcl detection), automatic writing, complaint comment classification, counterspeech detection, face selection, job classification, multi-lingual text-to-image generation, multlingual neural machine translation, optical charater recogntion, bangla text detection, question to declarative sentence, relation mention extraction.

ieee research papers on natural language processing

Tweet-Reply Sentiment Analysis

Vietnamese parsing.

Natural language processing: state of the art, current trends and challenges

  • Published: 14 July 2022
  • Volume 82 , pages 3713–3744, ( 2023 )

Cite this article

  • Diksha Khurana 1 ,
  • Aditya Koli 1 ,
  • Kiran Khatter   ORCID: 2 &
  • Sukhdev Singh 3  

123k Accesses

231 Citations

34 Altmetric

Explore all metrics

This article has been updated

Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP and components of N atural L anguage G eneration followed by presenting the history and evolution of NLP. We then discuss in detail the state of the art presenting the various applications of NLP, current trends, and challenges. Finally, we present a discussion on some available datasets, models, and evaluation metrics in NLP.

Similar content being viewed by others

ieee research papers on natural language processing

Natural Language Processing

ieee research papers on natural language processing

A survey on deep learning approaches for text-to-SQL

George Katsogiannis-Meimarakis & Georgia Koutrika

ieee research papers on natural language processing

Prompt Engineering in Large Language Models

Avoid common mistakes on your manuscript.

1 Introduction

A language can be defined as a set of rules or set of symbols where symbols are combined and used for conveying information or broadcasting the information. Since all the users may not be well-versed in machine specific language, N atural Language Processing (NLP) caters those users who do not have enough time to learn new languages or get perfection in it. In fact, NLP is a tract of Artificial Intelligence and Linguistics, devoted to make computers understand the statements or words written in human languages. It came into existence to ease the user’s work and to satisfy the wish to communicate with the computer in natural language, and can be classified into two parts i.e. Natural Language Understanding or Linguistics and Natural Language Generation which evolves the task to understand and generate the text. L inguistics is the science of language which includes Phonology that refers to sound, Morphology word formation, Syntax sentence structure, Semantics syntax and Pragmatics which refers to understanding. Noah Chomsky, one of the first linguists of twelfth century that started syntactic theories, marked a unique position in the field of theoretical linguistics because he revolutionized the area of syntax (Chomsky, 1965) [ 23 ]. Further, Natural Language Generation (NLG) is the process of producing phrases, sentences and paragraphs that are meaningful from an internal representation. The first objective of this paper is to give insights of the various important terminologies of NLP and NLG.

In the existing literature, most of the work in NLP is conducted by computer scientists while various other professionals have also shown interest such as linguistics, psychologists, and philosophers etc. One of the most interesting aspects of NLP is that it adds up to the knowledge of human language. The field of NLP is related with different theories and techniques that deal with the problem of natural language of communicating with the computers. Few of the researched tasks of NLP are Automatic Summarization ( Automatic summarization produces an understandable summary of a set of text and provides summaries or detailed information of text of a known type), Co-Reference Resolution ( Co-reference resolution refers to a sentence or larger set of text that determines all words which refer to the same object), Discourse Analysis ( Discourse analysis refers to the task of identifying the discourse structure of connected text i.e. the study of text in relation to social context),Machine Translation ( Machine translation refers to automatic translation of text from one language to another),Morphological Segmentation ( Morphological segmentation refers to breaking words into individual meaning-bearing morphemes), Named Entity Recognition ( Named entity recognition (NER) is used for information extraction to recognized name entities and then classify them to different classes), Optical Character Recognition ( Optical character recognition (OCR) is used for automatic text recognition by translating printed and handwritten text into machine-readable format), Part Of Speech Tagging ( Part of speech tagging describes a sentence, determines the part of speech for each word) etc. Some of these tasks have direct real-world applications such as Machine translation, Named entity recognition, Optical character recognition etc. Though NLP tasks are obviously very closely interwoven but they are used frequently, for convenience. Some of the tasks such as automatic summarization, co-reference analysis etc. act as subtasks that are used in solving larger tasks. Nowadays NLP is in the talks because of various applications and recent developments although in the late 1940s the term wasn’t even in existence. So, it will be interesting to know about the history of NLP, the progress so far has been made and some of the ongoing projects by making use of NLP. The second objective of this paper focus on these aspects. The third objective of this paper is on datasets, approaches, evaluation metrics and involved challenges in NLP. The rest of this paper is organized as follows. Section 2 deals with the first objective mentioning the various important terminologies of NLP and NLG. Section 3 deals with the history of NLP, applications of NLP and a walkthrough of the recent developments. Datasets used in NLP and various approaches are presented in Section 4 , and Section 5 is written on evaluation metrics and challenges involved in NLP. Finally, a conclusion is presented in Section 6 .

2 Components of NLP

NLP can be classified into two parts i.e., Natural Language Understanding and Natural Language Generation which evolves the task to understand and generate the text. Figure 1 presents the broad classification of NLP. The objective of this section is to discuss the Natural Language Understanding (Linguistic) (NLU) and the Natural Language Generation (NLG) .

figure 1

Broad classification of NLP

NLU enables machines to understand natural language and analyze it by extracting concepts, entities, emotion, keywords etc. It is used in customer care applications to understand the problems reported by customers either verbally or in writing. Linguistics is the science which involves the meaning of language, language context and various forms of the language. So, it is important to understand various important terminologies of NLP and different levels of NLP. We next discuss some of the commonly used terminologies in different levels of NLP.

Phonology is the part of Linguistics which refers to the systematic arrangement of sound. The term phonology comes from Ancient Greek in which the term phono means voice or sound and the suffix –logy refers to word or speech. In 1993 Nikolai Trubetzkoy stated that Phonology is “the study of sound pertaining to the system of language” whereas Lass1998 [ 66 ]wrote that phonology refers broadly with the sounds of language, concerned with sub-discipline of linguistics, behavior and organization of sounds. Phonology includes semantic use of sound to encode meaning of any Human language.

The different parts of the word represent the smallest units of meaning known as Morphemes. Morphology which comprises Nature of words, are initiated by morphemes. An example of Morpheme could be, the word precancellation can be morphologically scrutinized into three separate morphemes: the prefix pre , the root cancella , and the suffix -tion . The interpretation of morphemes stays the same across all the words, just to understand the meaning humans can break any unknown word into morphemes. For example, adding the suffix –ed to a verb, conveys that the action of the verb took place in the past. The words that cannot be divided and have meaning by themselves are called Lexical morpheme (e.g.: table, chair). The words (e.g. -ed, −ing, −est, −ly, −ful) that are combined with the lexical morpheme are known as Grammatical morphemes (eg. Worked, Consulting, Smallest, Likely, Use). The Grammatical morphemes that occur in combination called bound morphemes (eg. -ed, −ing) Bound morphemes can be divided into inflectional morphemes and derivational morphemes. Adding Inflectional morphemes to a word changes the different grammatical categories such as tense, gender, person, mood, aspect, definiteness and animacy. For example, addition of inflectional morphemes –ed changes the root park to parked . Derivational morphemes change the semantic meaning of the word when it is combined with that word. For example, in the word normalize, the addition of the bound morpheme –ize to the root normal changes the word from an adjective ( normal ) to a verb ( normalize ).

In Lexical, humans, as well as NLP systems, interpret the meaning of individual words. Sundry types of processing bestow to word-level understanding – the first of these being a part-of-speech tag to each word. In this processing, words that can act as more than one part-of-speech are assigned the most probable part-of-speech tag based on the context in which they occur. At the lexical level, Semantic representations can be replaced by the words that have one meaning. In fact, in the NLP system the nature of the representation varies according to the semantic theory deployed. Therefore, at lexical level, analysis of structure of words is performed with respect to their lexical meaning and PoS. In this analysis, text is divided into paragraphs, sentences, and words. Words that can be associated with more than one PoS are aligned with the most likely PoS tag based on the context in which they occur. At lexical level, semantic representation can also be replaced by assigning the correct POS tag which improves the understanding of the intended meaning of a sentence. It is used for cleaning and feature extraction using various techniques such as removal of stop words, stemming, lemmatization etc. Stop words such as ‘ in ’, ‘the’, ‘and’ etc. are removed as they don’t contribute to any meaningful interpretation and their frequency is also high which may affect the computation time. Stemming is used to stem the words of the text by removing the suffix of a word to obtain its root form. For example: consulting and consultant words are converted to the word consult after stemming, using word gets converted to us and driver is reduced to driv . Lemmatization does not remove the suffix of a word; in fact, it results in the source word with the use of a vocabulary. For example, in case of token drived , stemming results in “driv”, whereas lemmatization attempts to return the correct basic form either drive or drived depending on the context it is used.

After PoS tagging done at lexical level, words are grouped to phrases and phrases are grouped to form clauses and then phrases are combined to sentences at syntactic level. It emphasizes the correct formation of a sentence by analyzing the grammatical structure of the sentence. The output of this level is a sentence that reveals structural dependency between words. It is also known as parsing which uncovers the phrases that convey more meaning in comparison to the meaning of individual words. Syntactic level examines word order, stop-words, morphology and PoS of words which lexical level does not consider. Changing word order will change the dependency among words and may also affect the comprehension of sentences. For example, in the sentences “ram beats shyam in a competition” and “shyam beats ram in a competition”, only syntax is different but convey different meanings [ 139 ]. It retains the stopwords as removal of them changes the meaning of the sentence. It doesn’t support lemmatization and stemming because converting words to its basic form changes the grammar of the sentence. It focuses on identification on correct PoS of sentences. For example: in the sentence “frowns on his face”, “frowns” is a noun whereas it is a verb in the sentence “he frowns”.

On a semantic level, the most important task is to determine the proper meaning of a sentence. To understand the meaning of a sentence, human beings rely on the knowledge about language and the concepts present in that sentence, but machines can’t count on these techniques. Semantic processing determines the possible meanings of a sentence by processing its logical structure to recognize the most relevant words to understand the interactions among words or different concepts in the sentence. For example, it understands that a sentence is about “movies” even if it doesn’t comprise actual words, but it contains related concepts such as “actor”, “actress”, “dialogue” or “script”. This level of processing also incorporates the semantic disambiguation of words with multiple senses (Elizabeth D. Liddy, 2001) [ 68 ]. For example, the word “bark” as a noun can mean either as a sound that a dog makes or outer covering of the tree. The semantic level examines words for their dictionary interpretation or interpretation is derived from the context of the sentence. For example: the sentence “Krishna is good and noble.” This sentence is either talking about Lord Krishna or about a person “Krishna”. That is why, to get the proper meaning of the sentence, the appropriate interpretation is considered by looking at the rest of the sentence [ 44 ].

While syntax and semantics level deal with sentence-length units, the discourse level of NLP deals with more than one sentence. It deals with the analysis of logical structure by making connections among words and sentences that ensure its coherence. It focuses on the properties of the text that convey meaning by interpreting the relations between sentences and uncovering linguistic structures from texts at several levels (Liddy,2001) [ 68 ]. The two of the most common levels are: Anaphora Resolution an d Coreference Resolution. Anaphora resolution is achieved by recognizing the entity referenced by an anaphor to resolve the references used within the text with the same sense. For example, (i) Ram topped in the class. (ii) He was intelligent. Here i) and ii) together form a discourse. Human beings can quickly understand that the pronoun “he” in (ii) refers to “Ram” in (i). The interpretation of “He” depends on another word “Ram” presented earlier in the text. Without determining the relationship between these two structures, it would not be possible to decide why Ram topped the class and who was intelligent. Coreference resolution is achieved by finding all expressions that refer to the same entity in a text. It is an important step in various NLP applications that involve high-level NLP tasks such as document summarization, information extraction etc. In fact, anaphora is encoded through one of the processes called co-reference.

Pragmatic level focuses on the knowledge or content that comes from the outside the content of the document. It deals with what speaker implies and what listener infers. In fact, it analyzes the sentences that are not directly spoken. Real-world knowledge is used to understand what is being talked about in the text. By analyzing the context, meaningful representation of the text is derived. When a sentence is not specific and the context does not provide any specific information about that sentence, Pragmatic ambiguity arises (Walton, 1996) [ 143 ]. Pragmatic ambiguity occurs when different persons derive different interpretations of the text, depending on the context of the text. The context of a text may include the references of other sentences of the same document, which influence the understanding of the text and the background knowledge of the reader or speaker, which gives a meaning to the concepts expressed in that text. Semantic analysis focuses on literal meaning of the words, but pragmatic analysis focuses on the inferred meaning that the readers perceive based on their background knowledge. For example, the sentence “Do you know what time is it?” is interpreted to “Asking for the current time” in semantic analysis whereas in pragmatic analysis, the same sentence may refer to “expressing resentment to someone who missed the due time” in pragmatic analysis. Thus, semantic analysis is the study of the relationship between various linguistic utterances and their meanings, but pragmatic analysis is the study of context which influences our understanding of linguistic expressions. Pragmatic analysis helps users to uncover the intended meaning of the text by applying contextual background knowledge.

The goal of NLP is to accommodate one or more specialties of an algorithm or system. The metric of NLP assess on an algorithmic system allows for the integration of language understanding and language generation. It is even used in multilingual event detection. Rospocher et al. [ 112 ] purposed a novel modular system for cross-lingual event extraction for English, Dutch, and Italian Texts by using different pipelines for different languages. The system incorporates a modular set of foremost multilingual NLP tools. The pipeline integrates modules for basic NLP processing as well as more advanced tasks such as cross-lingual named entity linking, semantic role labeling and time normalization. Thus, the cross-lingual framework allows for the interpretation of events, participants, locations, and time, as well as the relations between them. Output of these individual pipelines is intended to be used as input for a system that obtains event centric knowledge graphs. All modules take standard input, to do some annotation, and produce standard output which in turn becomes the input for the next module pipelines. Their pipelines are built as a data centric architecture so that modules can be adapted and replaced. Furthermore, modular architecture allows for different configurations and for dynamic distribution.

Ambiguity is one of the major problems of natural language which occurs when one sentence can lead to different interpretations. This is usually faced in syntactic, semantic, and lexical levels. In case of syntactic level ambiguity, one sentence can be parsed into multiple syntactical forms. Semantic ambiguity occurs when the meaning of words can be misinterpreted. Lexical level ambiguity refers to ambiguity of a single word that can have multiple assertions. Each of these levels can produce ambiguities that can be solved by the knowledge of the complete sentence. The ambiguity can be solved by various methods such as Minimizing Ambiguity, Preserving Ambiguity, Interactive Disambiguation and Weighting Ambiguity [ 125 ]. Some of the methods proposed by researchers to remove ambiguity is preserving ambiguity, e.g. (Shemtov 1997; Emele & Dorna 1998; Knight & Langkilde 2000; Tong Gao et al. 2015, Umber & Bajwa 2011) [ 39 , 46 , 65 , 125 , 139 ]. Their objectives are closely in line with removal or minimizing ambiguity. They cover a wide range of ambiguities and there is a statistical element implicit in their approach.

Natural Language Generation (NLG) is the process of producing phrases, sentences and paragraphs that are meaningful from an internal representation. It is a part of Natural Language Processing and happens in four phases: identifying the goals, planning on how goals may be achieved by evaluating the situation and available communicative sources and realizing the plans as a text (Fig. 2 ). It is opposite to Understanding.

Speaker and Generator

figure 2

Components of NLG

To generate a text, we need to have a speaker or an application and a generator or a program that renders the application’s intentions into a fluent phrase relevant to the situation.

Components and Levels of Representation

The process of language generation involves the following interweaved tasks. Content selection: Information should be selected and included in the set. Depending on how this information is parsed into representational units, parts of the units may have to be removed while some others may be added by default. Textual Organization : The information must be textually organized according to the grammar, it must be ordered both sequentially and in terms of linguistic relations like modifications. Linguistic Resources : To support the information’s realization, linguistic resources must be chosen. In the end these resources will come down to choices of particular words, idioms, syntactic constructs etc. Realization : The selected and organized resources must be realized as an actual text or voice output.

Application or Speaker

This is only for maintaining the model of the situation. Here the speaker just initiates the process doesn’t take part in the language generation. It stores the history, structures the content that is potentially relevant and deploys a representation of what it knows. All these forms the situation, while selecting subset of propositions that speaker has. The only requirement is the speaker must make sense of the situation [ 91 ].

3 NLP: Then and now

In the late 1940s the term NLP wasn’t in existence, but the work regarding machine translation (MT) had started. In fact, Research in this period was not completely localized. Russian and English were the dominant languages for MT (Andreev,1967) [ 4 ]. In fact, MT/NLP research almost died in 1966 according to the ALPAC report, which concluded that MT is going nowhere. But later, some MT production systems were providing output to their customers (Hutchins, 1986) [ 60 ]. By this time, work on the use of computers for literary and linguistic studies had also started. As early as 1960, signature work influenced by AI began, with the BASEBALL Q-A systems (Green et al., 1961) [ 51 ]. LUNAR (Woods,1978) [ 152 ] and Winograd SHRDLU were natural successors of these systems, but they were seen as stepped-up sophistication, in terms of their linguistic and their task processing capabilities. There was a widespread belief that progress could only be made on the two sides, one is ARPA Speech Understanding Research (SUR) project (Lea, 1980) and other in some major system developments projects building database front ends. The front-end projects (Hendrix et al., 1978) [ 55 ] were intended to go beyond LUNAR in interfacing the large databases. In early 1980s computational grammar theory became a very active area of research linked with logics for meaning and knowledge’s ability to deal with the user’s beliefs and intentions and with functions like emphasis and themes.

By the end of the decade the powerful general purpose sentence processors like SRI’s Core Language Engine (Alshawi,1992) [ 2 ] and Discourse Representation Theory (Kamp and Reyle,1993) [ 62 ] offered a means of tackling more extended discourse within the grammatico-logical framework. This period was one of the growing communities. Practical resources, grammars, and tools and parsers became available (for example: Alvey Natural Language Tools) (Briscoe et al., 1987) [ 18 ]. The (D)ARPA speech recognition and message understanding (information extraction) conferences were not only for the tasks they addressed but for the emphasis on heavy evaluation, starting a trend that became a major feature in 1990s (Young and Chase, 1998; Sundheim and Chinchor,1993) [ 131 , 157 ]. Work on user modeling (Wahlster and Kobsa, 1989) [ 142 ] was one strand in a research paper. Cohen et al. (2002) [ 28 ] had put forwarded a first approximation of a compositional theory of tune interpretation, together with phonological assumptions on which it is based and the evidence from which they have drawn their proposals. At the same time, McKeown (1985) [ 85 ] demonstrated that rhetorical schemas could be used for producing both linguistically coherent and communicatively effective text. Some research in NLP marked important topics for future like word sense disambiguation (Small et al., 1988) [ 126 ] and probabilistic networks, statistically colored NLP, the work on the lexicon, also pointed in this direction. Statistical language processing was a major thing in 90s (Manning and Schuetze,1999) [ 75 ], because this not only involves data analysts. Information extraction and automatic summarizing (Mani and Maybury,1999) [ 74 ] was also a point of focus. Next, we present a walkthrough of the developments from the early 2000.

3.1 A walkthrough of recent developments in NLP

The main objectives of NLP include interpretation, analysis, and manipulation of natural language data for the intended purpose with the use of various algorithms, tools, and methods. However, there are many challenges involved which may depend upon the natural language data under consideration, and so makes it difficult to achieve all the objectives with a single approach. Therefore, the development of different tools and methods in the field of NLP and relevant areas of studies have received much attention from several researchers in the recent past. The developments can be seen in the Fig.  3 :

figure 3

A walkthrough of recent developments in NLP

In early 2000, neural language modeling in which the probability of occurring of next word (token) is determined given n previous words. Bendigo et al. [ 12 ] proposed the concept of feed forward neural network and lookup table which represents the n previous words in sequence. Collobert et al. [ 29 ] proposed the application of multitask learning in the field of NLP, where two convolutional models with max pooling were used to perform parts-of-speech and named entity recognition tagging. Mikolov [ 87 ] proposed a word embedding process where the dense vector representation of text was addressed. They also report the challenges faced by traditional sparse bag-of-words representation. After the advancement of word embedding, neural networks were introduced in the field of NLP where variable length input is taken for further processing. Sutskever et al. [ 132 ] proposed a general framework for sequence-to-sequence mapping where encoder and decoder networks are used to map from sequence to vector and vector to sequence respectively. In fact, the use of neural networks have played a very important role in NLP. One can observe from the existing literature that enough use of neural networks was not there in the early 2000s but till the year 2013enough discussion had happened about the use of neural networks in the field of NLP which transformed many things and further paved the way to implement various neural networks in NLP. Earlier the use of Convolutional neural networks ( CNN ) contributed to the field of image classification and analyzing visual imagery for further analysis. Later the use of CNNs can be observed in tackling problems associated with NLP tasks like Sentence Classification [ 127 ], Sentiment Analysis [ 135 ], Text Classification [ 118 ], Text Summarization [ 158 ], Machine Translation [ 70 ] and Answer Relations [ 150 ] . An article by Newatia (2019) [ 93 ] illustrates the general architecture behind any CNN model, and how it can be used in the context of NLP. One can also refer to the work of Wang and Gang [ 145 ] for the applications of CNN in NLP. Further Neural Networks those are recurrent in nature due to performing the same function for every data, also known as Recurrent Neural Networks (RNNs), have also been used in NLP, and found ideal for sequential data such as text, time series, financial data, speech, audio, video among others, see article by Thomas (2019) [ 137 ]. One of the modified versions of RNNs is Long Short-Term Memory (LSTM) which is also very useful in the cases where only the desired important information needs to be retained for a much longer time discarding the irrelevant information, see [ 52 , 58 ]. Further development in the LSTM has also led to a slightly simpler variant, called the gated recurrent unit (GRU), which has shown better results than standard LSTMs in many tasks [ 22 , 26 ]. Attention mechanisms [ 7 ] which suggest a network to learn what to pay attention to in accordance with the current hidden state and annotation together with the use of transformers have also made a significant development in NLP, see [ 141 ]. It is to be noticed that Transformers have a potential of learning longer-term dependency but are limited by a fixed-length context in the setting of language modeling. In this direction recently Dai et al. [ 30 ] proposed a novel neural architecture Transformer-XL (XL as extra-long) which enables learning dependencies beyond a fixed length of words. Further the work of Rae et al. [ 104 ] on the Compressive Transformer, an attentive sequence model which compresses memories for long-range sequence learning, may be helpful for the readers. One may also refer to the recent work by Otter et al. [ 98 ] on uses of Deep Learning for NLP, and relevant references cited therein. The use of BERT (Bidirectional Encoder Representations from Transformers) [ 33 ] model and successive models have also played an important role for NLP.

Many researchers worked on NLP, building tools and systems which makes NLP what it is today. Tools like Sentiment Analyser, Parts of Speech (POS) Taggers, Chunking, Named Entity Recognitions (NER), Emotion detection, Semantic Role Labeling have a huge contribution made to NLP, and are good topics for research. Sentiment analysis (Nasukawaetal.,2003) [ 156 ] works by extracting sentiments about a given topic, and it consists of a topic specific feature term extraction, sentiment extraction, and association by relationship analysis. It utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. It analyzes the documents for positive and negative words and tries to give ratings on scale −5 to +5. The mainstream of currently used tagsets is obtained from English. The most widely used tagsets as standard guidelines are designed for Indo-European languages but it is less researched on Asian languages or middle- eastern languages. Various authors have done research on making parts of speech taggers for various languages such as Arabic (Zeroual et al., 2017) [ 160 ], Sanskrit (Tapswi & Jain, 2012) [ 136 ], Hindi (Ranjan & Basu, 2003) [ 105 ] to efficiently tag and classify words as nouns, adjectives, verbs etc. Authors in [ 136 ] have used treebank technique for creating rule-based POS Tagger for Sanskrit Language. Sanskrit sentences are parsed to assign the appropriate tag to each word using suffix stripping algorithm, wherein the longest suffix is searched from the suffix table and tags are assigned. Diab et al. (2004) [ 34 ] used supervised machine learning approach and adopted Support Vector Machines (SVMs) which were trained on the Arabic Treebank to automatically tokenize parts of speech tag and annotate base phrases in Arabic text.

Chunking is a process of separating phrases from unstructured text. Since simple tokens may not represent the actual meaning of the text, it is advisable to use phrases such as “North Africa” as a single word instead of ‘North’ and ‘Africa’ separate words. Chunking known as “Shadow Parsing” labels parts of sentences with syntactic correlated keywords like Noun Phrase (NP) and Verb Phrase (VP). Chunking is often evaluated using the CoNLL 2000 shared task. Various researchers (Sha and Pereira, 2003; McDonald et al., 2005; Sun et al., 2008) [ 83 , 122 , 130 ] used CoNLL test data for chunking and used features composed of words, POS tags, and tags.

There are particular words in the document that refer to specific entities or real-world objects like location, people, organizations etc. To find the words which have a unique context and are more informative, noun phrases are considered in the text documents. Named entity recognition (NER) is a technique to recognize and separate the named entities and group them under predefined classes. But in the era of the Internet, where people use slang not the traditional or standard English which cannot be processed by standard natural language processing tools. Ritter (2011) [ 111 ] proposed the classification of named entities in tweets because standard NLP tools did not perform well on tweets. They re-built NLP pipeline starting from PoS tagging, then chunking for NER. It improved the performance in comparison to standard NLP tools.

Emotion detection investigates and identifies the types of emotion from speech, facial expressions, gestures, and text. Sharma (2016) [ 124 ] analyzed the conversations in Hinglish means mix of English and Hindi languages and identified the usage patterns of PoS. Their work was based on identification of language and POS tagging of mixed script. They tried to detect emotions in mixed script by relating machine learning and human knowledge. They have categorized sentences into 6 groups based on emotions and used TLBO technique to help the users in prioritizing their messages based on the emotions attached with the message. Seal et al. (2020) [ 120 ] proposed an efficient emotion detection method by searching emotional words from a pre-defined emotional keyword database and analyzing the emotion words, phrasal verbs, and negation words. Their proposed approach exhibited better performance than recent approaches.

Semantic Role Labeling (SRL) works by giving a semantic role to a sentence. For example, in the PropBank (Palmer et al., 2005) [ 100 ] formalism, one assigns roles to words that are arguments of a verb in the sentence. The precise arguments depend on the verb frame and if multiple verbs exist in a sentence, it might have multiple tags. State-of-the-art SRL systems comprise several stages: creating a parse tree, identifying which parse tree nodes represent the arguments of a given verb, and finally classifying these nodes to compute the corresponding SRL tags.

Event discovery in social media feeds (Benson et al.,2011) [ 13 ], using a graphical model to analyze any social media feeds to determine whether it contains the name of a person or name of a venue, place, time etc. The model operates on noisy feeds of data to extract records of events by aggregating multiple information across multiple messages, despite the noise of irrelevant noisy messages and very irregular message language, this model was able to extract records with a broader array of features on factors.

We first give insights on some of the mentioned tools and relevant work done before moving to the broad applications of NLP.

3.2 Applications of NLP

Natural Language Processing can be applied into various areas like Machine Translation, Email Spam detection, Information Extraction, Summarization, Question Answering etc. Next, we discuss some of the areas with the relevant work done in those directions.

Machine Translation

As most of the world is online, the task of making data accessible and available to all is a challenge. Major challenge in making data accessible is the language barrier. There are a multitude of languages with different sentence structure and grammar. Machine Translation is generally translating phrases from one language to another with the help of a statistical engine like Google Translate. The challenge with machine translation technologies is not directly translating words but keeping the meaning of sentences intact along with grammar and tenses. The statistical machine learning gathers as many data as they can find that seems to be parallel between two languages and they crunch their data to find the likelihood that something in Language A corresponds to something in Language B. As for Google, in September 2016, announced a new machine translation system based on artificial neural networks and Deep learning. In recent years, various methods have been proposed to automatically evaluate machine translation quality by comparing hypothesis translations with reference translations. Examples of such methods are word error rate, position-independent word error rate (Tillmann et al., 1997) [ 138 ], generation string accuracy (Bangalore et al., 2000) [ 8 ], multi-reference word error rate (Nießen et al., 2000) [ 95 ], BLEU score (Papineni et al., 2002) [ 101 ], NIST score (Doddington, 2002) [ 35 ] All these criteria try to approximate human assessment and often achieve an astonishing degree of correlation to human subjective evaluation of fluency and adequacy (Papineni et al., 2001; Doddington, 2002) [ 35 , 101 ].

Text Categorization

Categorization systems input a large flow of data like official documents, military casualty reports, market data, newswires etc. and assign them to predefined categories or indices. For example, The Carnegie Group’s Construe system (Hayes, 1991) [ 54 ], inputs Reuters articles and saves much time by doing the work that is to be done by staff or human indexers. Some companies have been using categorization systems to categorize trouble tickets or complaint requests and routing to the appropriate desks. Another application of text categorization is email spam filters. Spam filters are becoming important as the first line of defence against the unwanted emails. A false negative and false positive issue of spam filters is at the heart of NLP technology, it has brought down the challenge of extracting meaning from strings of text. A filtering solution that is applied to an email system uses a set of protocols to determine which of the incoming messages are spam; and which are not. There are several types of spam filters available. Content filters : Review the content within the message to determine whether it is spam or not. Header filters : Review the email header looking for fake information. General Blacklist filters : Stop all emails from blacklisted recipients. Rules Based Filters : It uses user-defined criteria. Such as stopping mails from a specific person or stopping mail including a specific word. Permission Filters : Require anyone sending a message to be pre-approved by the recipient. Challenge Response Filters : Requires anyone sending a message to enter a code to gain permission to send email.

Spam Filtering

It works using text categorization and in recent times, various machine learning techniques have been applied to text categorization or Anti-Spam Filtering like Rule Learning (Cohen 1996) [ 27 ], Naïve Bayes (Sahami et al., 1998; Androutsopoulos et al., 2000; Rennie.,2000) [ 5 , 109 , 115 ],Memory based Learning (Sakkiset al.,2000b) [ 117 ], Support vector machines (Druker et al., 1999) [ 36 ], Decision Trees (Carreras and Marquez, 2001) [ 19 ], Maximum Entropy Model (Berger et al. 1996) [ 14 ], Hash Forest and a rule encoding method (T. Xia, 2020) [ 153 ], sometimes combining different learners (Sakkis et al., 2001) [ 116 ]. Using these approaches is better as classifier is learned from training data rather than making by hand. The naïve bayes is preferred because of its performance despite its simplicity (Lewis, 1998) [ 67 ] In Text Categorization two types of models have been used (McCallum and Nigam, 1998) [ 77 ]. Both modules assume that a fixed vocabulary is present. But in first model a document is generated by first choosing a subset of vocabulary and then using the selected words any number of times, at least once irrespective of order. This is called Multi-variate Bernoulli model. It takes the information of which words are used in a document irrespective of number of words and order. In second model, a document is generated by choosing a set of word occurrences and arranging them in any order. This model is called multi-nomial model, in addition to the Multi-variate Bernoulli model, it also captures information on how many times a word is used in a document. Most text categorization approaches to anti-spam Email filtering have used multi variate Bernoulli model (Androutsopoulos et al., 2000) [ 5 ] [ 15 ].

Information Extraction

Information extraction is concerned with identifying phrases of interest of textual data. For many applications, extracting entities such as names, places, events, dates, times, and prices is a powerful way of summarizing the information relevant to a user’s needs. In the case of a domain specific search engine, the automatic identification of important information can increase accuracy and efficiency of a directed search. There is use of hidden Markov models (HMMs) to extract the relevant fields of research papers. These extracted text segments are used to allow searched over specific fields and to provide effective presentation of search results and to match references to papers. For example, noticing the pop-up ads on any websites showing the recent items you might have looked on an online store with discounts. In Information Retrieval two types of models have been used (McCallum and Nigam, 1998) [ 77 ]. Both modules assume that a fixed vocabulary is present. But in first model a document is generated by first choosing a subset of vocabulary and then using the selected words any number of times, at least once without any order. This is called Multi-variate Bernoulli model. It takes the information of which words are used in a document irrespective of number of words and order. In second model, a document is generated by choosing a set of word occurrences and arranging them in any order. This model is called multi-nominal model, in addition to the Multi-variate Bernoulli model, it also captures information on how many times a word is used in a document.

Discovery of knowledge is becoming important areas of research over the recent years. Knowledge discovery research use a variety of techniques to extract useful information from source documents like Parts of Speech (POS) tagging , Chunking or Shadow Parsing , Stop-words (Keywords that are used and must be removed before processing documents), Stemming (Mapping words to some base for, it has two methods, dictionary-based stemming and Porter style stemming (Porter, 1980) [ 103 ]. Former one has higher accuracy but higher cost of implementation while latter has lower implementation cost and is usually insufficient for IR). Compound or Statistical Phrases (Compounds and statistical phrases index multi token units instead of single tokens.) Word Sense Disambiguation (Word sense disambiguation is the task of understanding the correct sense of a word in context. When used for information retrieval, terms are replaced by their senses in the document vector.)

The extracted information can be applied for a variety of purposes, for example to prepare a summary, to build databases, identify keywords, classifying text items according to some pre-defined categories etc. For example, CONSTRUE, it was developed for Reuters, that is used in classifying news stories (Hayes, 1992) [ 54 ]. It has been suggested that many IE systems can successfully extract terms from documents, acquiring relations between the terms is still a difficulty. PROMETHEE is a system that extracts lexico-syntactic patterns relative to a specific conceptual relation (Morin,1999) [ 89 ]. IE systems should work at many levels, from word recognition to discourse analysis at the level of the complete document. An application of the Blank Slate Language Processor (BSLP) ( Bondale et al., 1999) [ 16 ] approach for the analysis of a real-life natural language corpus that consists of responses to open-ended questionnaires in the field of advertising.

There is a system called MITA (Metlife’s Intelligent Text Analyzer) (Glasgow et al. (1998) [ 48 ]) that extracts information from life insurance applications. Ahonen et al. (1998) [ 1 ] suggested a mainstream framework for text mining that uses pragmatic and discourse level analyses of text .


Overload of information is the real thing in this digital age, and already our reach and access to knowledge and information exceeds our capacity to understand it. This trend is not slowing down, so an ability to summarize the data while keeping the meaning intact is highly required. This is important not just allowing us the ability to recognize the understand the important information for a large set of data, it is used to understand the deeper emotional meanings; For example, a company determines the general sentiment on social media and uses it on their latest product offering. This application is useful as a valuable marketing asset.

The types of text summarization depends on the basis of the number of documents and the two important categories are single document summarization and multi document summarization (Zajic et al. 2008 [ 159 ]; Fattah and Ren 2009 [ 43 ]).Summaries can also be of two types: generic or query-focused (Gong and Liu 2001 [ 50 ]; Dunlavy et al. 2007 [ 37 ]; Wan 2008 [ 144 ]; Ouyang et al. 2011 [ 99 ]).Summarization task can be either supervised or unsupervised (Mani and Maybury 1999 [ 74 ]; Fattah and Ren 2009 [ 43 ]; Riedhammer et al. 2010 [ 110 ]). Training data is required in a supervised system for selecting relevant material from the documents. Large amount of annotated data is needed for learning techniques. Few techniques are as follows–

Bayesian Sentence based Topic Model (BSTM) uses both term-sentences and term document associations for summarizing multiple documents. (Wang et al. 2009 [ 146 ])

Factorization with Given Bases (FGB) is a language model where sentence bases are the given bases and it utilizes document-term and sentence term matrices. This approach groups and summarizes the documents simultaneously. (Wang et al. 2011) [ 147 ])

Topic Aspect-Oriented Summarization (TAOS) is based on topic factors. These topic factors are various features that describe topics such as capital words are used to represent entity. Various topics can have various aspects and various preferences of features are used to represent various aspects. (Fang et al. 2015 [ 42 ])

Dialogue System

Dialogue systems are very prominent in real world applications ranging from providing support to performing a particular action. In case of support dialogue systems, context awareness is required whereas in case to perform an action, it doesn’t require much context awareness. Earlier dialogue systems were focused on small applications such as home theater systems. These dialogue systems utilize phonemic and lexical levels of language. Habitable dialogue systems offer potential for fully automated dialog systems by utilizing all levels of a language. (Liddy, 2001) [ 68 ].This leads to producing systems that can enable robots to interact with humans in natural languages such as Google’s assistant, Windows Cortana, Apple’s Siri and Amazon’s Alexa etc.

NLP is applied in the field as well. The Linguistic String Project-Medical Language Processor is one the large scale projects of NLP in the field of medicine [ 21 , 53 , 57 , 71 , 114 ]. The LSP-MLP helps enabling physicians to extract and summarize information of any signs or symptoms, drug dosage and response data with the aim of identifying possible side effects of any medicine while highlighting or flagging data items [ 114 ]. The National Library of Medicine is developing The Specialist System [ 78 , 79 , 80 , 82 , 84 ]. It is expected to function as an Information Extraction tool for Biomedical Knowledge Bases, particularly Medline abstracts. The lexicon was created using MeSH (Medical Subject Headings), Dorland’s Illustrated Medical Dictionary and general English Dictionaries. The Centre d’Informatique Hospitaliere of the Hopital Cantonal de Geneve is working on an electronic archiving environment with NLP features [ 81 , 119 ]. In the first phase, patient records were archived. At later stage the LSP-MLP has been adapted for French [ 10 , 72 , 94 , 113 ], and finally, a proper NLP system called RECIT [ 9 , 11 , 17 , 106 ] has been developed using a method called Proximity Processing [ 88 ]. It’s task was to implement a robust and multilingual system able to analyze/comprehend medical sentences, and to preserve a knowledge of free text into a language independent knowledge representation [ 107 , 108 ]. The Columbia university of New York has developed an NLP system called MEDLEE (MEDical Language Extraction and Encoding System) that identifies clinical information in narrative reports and transforms the textual information into structured representation [ 45 ].

3.3 NLP in talk

We next discuss some of the recent NLP projects implemented by various companies:

ACE Powered GDPR Robot Launched by RAVN Systems [ 134 ]

RAVN Systems, a leading expert in Artificial Intelligence (AI), Search and Knowledge Management Solutions, announced the launch of a RAVN (“Applied Cognitive Engine”) i.e. powered software Robot to help and facilitate the GDPR (“General Data Protection Regulation”) compliance. The Robot uses AI techniques to automatically analyze documents and other types of data in any business system which is subject to GDPR rules. It allows users to search, retrieve, flag, classify, and report on data, mediated to be super sensitive under GDPR quickly and easily. Users also can identify personal data from documents, view feeds on the latest personal data that requires attention and provide reports on the data suggested to be deleted or secured. RAVN’s GDPR Robot is also able to hasten requests for information (Data Subject Access Requests - “DSAR”) in a simple and efficient way, removing the need for a physical approach to these requests which tends to be very labor thorough. Peter Wallqvist, CSO at RAVN Systems commented, “GDPR compliance is of universal paramountcy as it will be exploited by any organization that controls and processes data concerning EU citizens.


Eno A Natural Language Chatbot Launched by Capital One [ 56 ]

Capital One announces a chatbot for customers called Eno. Eno is a natural language chatbot that people socialize through texting. CapitalOne claims that Eno is First natural language SMS chatbot from a U.S. bank that allows customers to ask questions using natural language. Customers can interact with Eno asking questions about their savings and others using a text interface. Eno makes such an environment that it feels that a human is interacting. This provides a different platform than other brands that launch chatbots like Facebook Messenger and Skype. They believed that Facebook has too much access to private information of a person, which could get them into trouble with privacy laws U.S. financial institutions work under. Like Facebook Page admin can access full transcripts of the bot’s conversations. If that would be the case then the admins could easily view the personal banking information of customers with is not correct.


Future of BI in Natural Language Processing [ 140 ]

Several companies in BI spaces are trying to get with the trend and trying hard to ensure that data becomes more friendly and easily accessible. But still there is a long way for this.BI will also make it easier to access as GUI is not needed. Because nowadays the queries are made by text or voice command on of the most common examples is Google might tell you today what tomorrow’s weather will be. But soon enough, we will be able to ask our personal data chatbot about customer sentiment today, and how we feel about their brand next week; all while walking down the street. Today, NLP tends to be based on turning natural language into machine language. But with time the technology matures – especially the AI component –the computer will get better at “understanding” the query and start to deliver answers rather than search results. Initially, the data chatbot will probably ask the question ‘how have revenues changed over the last three-quarters?’ and then return pages of data for you to analyze. But once it learns the semantic relations and inferences of the question, it will be able to automatically perform the filtering and formulation necessary to provide an intelligible answer, rather than simply showing you data.


Using Natural Language Processing and Network Analysis to Develop a Conceptual Framework for Medication Therapy Management Research [ 97 ]

Natural Language Processing and Network Analysis to Develop a Conceptual Framework for Medication Therapy Management Research describes a theory derivation process that is used to develop a conceptual framework for medication therapy management (MTM) research. The MTM service model and chronic care model are selected as parent theories. Review article abstracts target medication therapy management in chronic disease care that were retrieved from Ovid Medline (2000–2016). Unique concepts in each abstract are extracted using Meta Map and their pair-wise co-occurrence are determined. Then the information is used to construct a network graph of concept co-occurrence that is further analyzed to identify content for the new conceptual model. 142 abstracts are analyzed. Medication adherence is the most studied drug therapy problem and co-occurred with concepts related to patient-centered interventions targeting self-management. The enhanced model consists of 65 concepts clustered into 14 constructs. The framework requires additional refinement and evaluation to determine its relevance and applicability across a broad audience including underserved settings.


Meet the Pilot, world’s first language translating earbuds [ 96 ]

The world’s first smart earpiece Pilot will soon be transcribed over 15 languages. According to Spring wise, Waverly Labs’ Pilot can already transliterate five spoken languages, English, French, Italian, Portuguese, and Spanish, and seven written affixed languages, German, Hindi, Russian, Japanese, Arabic, Korean and Mandarin Chinese. The Pilot earpiece is connected via Bluetooth to the Pilot speech translation app, which uses speech recognition, machine translation and machine learning and speech synthesis technology. Simultaneously, the user will hear the translated version of the speech on the second earpiece. Moreover, it is not necessary that conversation would be taking place between two people; only the users can join in and discuss as a group. As if now the user may experience a few second lag interpolated the speech and translation, which Waverly Labs pursue to reduce. The Pilot earpiece will be available from September but can be pre-ordered now for $249. The earpieces can also be used for streaming music, answering voice calls, and getting audio notifications.


4 Datasets in NLP and state-of-the-art models

The objective of this section is to present the various datasets used in NLP and some state-of-the-art models in NLP.

4.1 Datasets in NLP

Corpus is a collection of linguistic data, either compiled from written texts or transcribed from recorded speech. Corpora are intended primarily for testing linguistic hypotheses - e.g., to determine how a certain sound, word, or syntactic construction is used across a culture or language. There are various types of corpus: In an annotated corpus, the implicit information in the plain text has been made explicit by specific annotations. Un-annotated corpus contains raw state of plain text. Different languages can be compared using a reference corpus. Monitor corpora are non-finite collections of texts which are mostly used in lexicography. Multilingual corpus refers to a type of corpus that contains small collections of monolingual corpora based on the same sampling procedure and categories for different languages. Parallel corpus contains texts in one language and their translations into other languages which are aligned sentence phrase by phrase. Reference corpus contains text of spoken (formal and informal) and written (formal and informal) language which represents various social and situational contexts. Speech corpus contains recorded speech and transcriptions of recording and the time each word occurred in the recorded speech. There are various datasets available for natural language processing; some of these are listed below for different use cases:

Sentiment Analysis: Sentiment analysis is a rapidly expanding field of natural language processing (NLP) used in a variety of fields such as politics, business etc. Majorly used datasets for sentiment analysis are:

Stanford Sentiment Treebank (SST): Socher et al. introduced SST containing sentiment labels for 215,154 phrases in parse trees for 11,855 sentences from movie reviews posing novel sentiment compositional difficulties [ 127 ].

Sentiment140: It contains 1.6 million tweets annotated with negative, neutral and positive labels.

Paper Reviews: It provides reviews of computing and informatics conferences written in English and Spanish languages. It has 405 reviews which are evaluated on a 5-point scale ranging from very negative to very positive.

IMDB: For natural language processing, text analytics, and sentiment analysis, this dataset offers thousands of movie reviews split into training and test datasets. This dataset was introduced in by Mass et al. in 2011 [ 73 ].

G.Rama Rohit Reddy of the Language Technologies Research Centre, KCIS, IIIT Hyderabad, generated the corpus “Sentiraama.” The corpus is divided into four datasets, each of which is annotated with a two-value scale that distinguishes between positive and negative sentiment at the document level. The corpus contains data from a variety of fields, including book reviews, product reviews, movie reviews, and song lyrics. The annotators meticulously followed the annotation technique for each of them. The folder “Song Lyrics” in the corpus contains 339 Telugu song lyrics written in Telugu script [ 121 ].

Language Modelling: Language models analyse text data to calculate word probability. They use an algorithm to interpret the data, which establishes rules for context in natural language. The model then uses these rules to accurately predict or construct new sentences. The model basically learns the basic characteristics and features of language and then applies them to new phrases. Majorly used datasets for Language modeling are as follows:

Salesforce’s WikiText-103 dataset has 103 million tokens collected from 28,475 featured articles from Wikipedia.

WikiText-2 is a scaled-down version of WikiText-103. It contains 2 million tokens with a 33,278 jargon size.

Penn Treebank piece of the Wall Street Diary corpus includes 929,000 tokens for training, 73,000 tokens for validation, and 82,000 tokens for testing purposes. Its context is limited since it comprises sentences rather than paragraphs [ 76 ].

The Ministry of Electronics and Information Technology’s Technology Development Programme for Indian Languages (TDIL) launched its own data distribution portal ( ) which has cataloged datasets [ 24 ].

Machine Translation: The task of converting the text of one natural language into another language while keeping the sense of the input text is known as machine translation. Majorly used datasets are as follows:

Tatoeba is a collection of multilingual sentence pairings. A tab-delimited pair of an English text sequence and the translated French text sequence appears on each line of the dataset. Each text sequence might be as simple as a single sentence or as complex as a paragraph of many sentences.

The Europarl parallel corpus is derived from the European Parliament’s proceedings. It is available in 21 European languages [ 40 ].

WMT14 provides machine translation pairs for English-German and English-French. Separately, these datasets comprise 4.5 million and 35 million sentence sets. Byte-Pair Encoding with 32 K tasks is used to encode the phrases.

There are around 160,000 sentence pairings in the IWSLT 14. The dataset includes descriptions in English-German (En-De) and German-English (De-En) languages. There are around 200 K training sentence sets in the IWSLT 13 dataset.

The IIT Bombay English-Hindi corpus comprises parallel corpora for English-Hindi as well as monolingual Hindi corpora gathered from several existing sources and corpora generated over time at IIT Bombay’s Centre for Indian Language Technology.

Question Answering System: Question answering systems provide real-time responses which are widely used in customer care services. The datasets used for dialogue system/question answering system are as follows:

Stanford Question Answering Dataset (SQuAD): it is a reading comprehension dataset made up of questions posed by crowd workers on a collection of Wikipedia articles.

Natural Questions: It is a large-scale corpus presented by Google used for training and assessing open-domain question answering systems. It includes 300,000 naturally occurring queries as well as human-annotated responses from Wikipedia pages for use in QA system training.

Question Answering in Context (QuAC): This dataset is used to describe, comprehend, and participate in information seeking conversation. In this dataset, instances are made up of an interactive discussion between two crowd workers: a student who asks a series of open-ended questions about an unknown Wikipedia text, and a teacher who responds by offering brief extracts from the text.

The neural learning models are overtaking traditional models for NLP [ 64 , 127 ]. In [ 64 ], authors used CNN (Convolutional Neural Network) model for sentiment analysis of movie reviews and achieved 81.5% accuracy. The results illustrate that using CNN was an appropriate replacement for state-of-the-art methods. Authors [ 127 ] have combined SST and Recursive Neural Tensor Network for sentiment analysis of the single sentence. This model amplifies the accuracy by 5.4% for sentence classification compared to traditional NLP models. Authors [ 135 ] proposed a combined Recurrent Neural Network and Transformer model for sentiment analysis. This hybrid model was tested on three different datasets: Twitter US Airline Sentiment, IMDB, and Sentiment 140: and achieved F1 scores of 91%, 93%, and 90%, respectively. This model’s performance outshined the state-of-art methods.

Santoro et al. [ 118 ] introduced a rational recurrent neural network with the capacity to learn on classifying the information and perform complex reasoning based on the interactions between compartmentalized information. They used the relational memory core to handle such interactions. Finally, the model was tested for language modeling on three different datasets (GigaWord, Project Gutenberg, and WikiText-103). Further, they mapped the performance of their model to traditional approaches for dealing with relational reasoning on compartmentalized information. The results achieved with RMC show improved performance.

Merity et al. [ 86 ] extended conventional word-level language models based on Quasi-Recurrent Neural Network and LSTM to handle the granularity at character and word level. They tuned the parameters for character-level modeling using Penn Treebank dataset and word-level modeling using WikiText-103. In both cases, their model outshined the state-of-art methods.

Luong et al. [ 70 ] used neural machine translation on the WMT14 dataset and performed translation of English text to French text. The model demonstrated a significant improvement of up to 2.8 bi-lingual evaluation understudy (BLEU) scores compared to various neural machine translation systems. It outperformed the commonly used MT system on a WMT 14 dataset.

Fan et al. [ 41 ] introduced a gradient-based neural architecture search algorithm that automatically finds architecture with better performance than a transformer, conventional NMT models. They tested their model on WMT14 (English-German Translation), IWSLT14 (German-English translation), and WMT18 (Finnish-to-English translation) and achieved 30.1, 36.1, and 26.4 BLEU points, which shows better performance than Transformer baselines.

Wiese et al. [ 150 ] introduced a deep learning approach based on domain adaptation techniques for handling biomedical question answering tasks. Their model revealed the state-of-the-art performance on biomedical question answers, and the model outperformed the state-of-the-art methods in domains.

Seunghak et al. [ 158 ] designed a Memory-Augmented-Machine-Comprehension-Network (MAMCN) to handle dependencies faced in reading comprehension. The model achieved state-of-the-art performance on document-level using TriviaQA and QUASAR-T datasets, and paragraph-level using SQuAD datasets.

Xie et al. [ 154 ] proposed a neural architecture where candidate answers and their representation learning are constituent centric, guided by a parse tree. Under this architecture, the search space of candidate answers is reduced while preserving the hierarchical, syntactic, and compositional structure among constituents. Using SQuAD, the model delivers state-of-the-art performance.

4.2 State-of-the-art models in NLP

Rationalist approach or symbolic approach assumes that a crucial part of the knowledge in the human mind is not derived by the senses but is firm in advance, probably by genetic inheritance. Noam Chomsky was the strongest advocate of this approach. It was believed that machines can be made to function like the human brain by giving some fundamental knowledge and reasoning mechanism linguistics knowledge is directly encoded in rule or other forms of representation. This helps the automatic process of natural languages [ 92 ]. Statistical and machine learning entail evolution of algorithms that allow a program to infer patterns. An iterative process is used to characterize a given algorithm’s underlying algorithm that is optimized by a numerical measure that characterizes numerical parameters and learning phase. Machine-learning models can be predominantly categorized as either generative or discriminative. Generative methods can generate synthetic data because of which they create rich models of probability distributions. Discriminative methods are more functional and have right estimating posterior probabilities and are based on observations. Srihari [ 129 ] explains the different generative models as one with a resemblance that is used to spot an unknown speaker’s language and would bid the deep knowledge of numerous languages to perform the match. Discriminative methods rely on a less knowledge-intensive approach and using distinction between languages. Whereas generative models can become troublesome when many features are used and discriminative models allow use of more features [ 38 ]. Few of the examples of discriminative methods are Logistic regression and conditional random fields (CRFs), generative methods are Naive Bayes classifiers and hidden Markov models (HMMs).

Naive Bayes Classifiers

Naive Bayes is a probabilistic algorithm which is based on probability theory and Bayes’ Theorem to predict the tag of a text such as news or customer review. It helps to calculate the probability of each tag for the given text and return the tag with the highest probability. Bayes’ Theorem is used to predict the probability of a feature based on prior knowledge of conditions that might be related to that feature. The choice of area in NLP using Naïve Bayes Classifiers could be in usual tasks such as segmentation and translation but it is also explored in unusual areas like segmentation for infant learning and identifying documents for opinions and facts. Anggraeni et al. (2019) [ 61 ] used ML and AI to create a question-and-answer system for retrieving information about hearing loss. They developed I-Chat Bot which understands the user input and provides an appropriate response and produces a model which can be used in the search for information about required hearing impairments. The problem with naïve bayes is that we may end up with zero probabilities when we meet words in the test data for a certain class that are not present in the training data.

Hidden Markov Model (HMM)

An HMM is a system where a shifting takes place between several states, generating feasible output symbols with each switch. The sets of viable states and unique symbols may be large, but finite and known. We can describe the outputs, but the system’s internals are hidden. Few of the problems could be solved by Inference A certain sequence of output symbols, compute the probabilities of one or more candidate states with sequences. Patterns matching the state-switch sequence are most likely to have generated a particular output-symbol sequence. Training the output-symbol chain data, reckon the state-switch/output probabilities that fit this data best.

Hidden Markov Models are extensively used for speech recognition, where the output sequence is matched to the sequence of individual phonemes. HMM is not restricted to this application; it has several others such as bioinformatics problems, for example, multiple sequence alignment [ 128 ]. Sonnhammer mentioned that Pfam holds multiple alignments and hidden Markov model-based profiles (HMM-profiles) of entire protein domains. The cue of domain boundaries, family members and alignment are done semi-automatically found on expert knowledge, sequence similarity, other protein family databases and the capability of HMM-profiles to correctly identify and align the members. HMM may be used for a variety of NLP applications, including word prediction, sentence production, quality assurance, and intrusion detection systems [ 133 ].

Neural Network

Earlier machine learning techniques such as Naïve Bayes, HMM etc. were majorly used for NLP but by the end of 2010, neural networks transformed and enhanced NLP tasks by learning multilevel features. Major use of neural networks in NLP is observed for word embedding where words are represented in the form of vectors. These vectors can be used to recognize similar words by observing their closeness in this vector space, other uses of neural networks are observed in information retrieval, text summarization, text classification, machine translation, sentiment analysis and speech recognition. Initially focus was on feedforward [ 49 ] and CNN (convolutional neural network) architecture [ 69 ] but later researchers adopted recurrent neural networks to capture the context of a word with respect to surrounding words of a sentence. LSTM (Long Short-Term Memory), a variant of RNN, is used in various tasks such as word prediction, and sentence topic prediction. [ 47 ] In order to observe the word arrangement in forward and backward direction, bi-directional LSTM is explored by researchers [ 59 ]. In case of machine translation, encoder-decoder architecture is used where dimensionality of input and output vector is not known. Neural networks can be used to anticipate a state that has not yet been seen, such as future states for which predictors exist whereas HMM predicts hidden states.

Bi-directional Encoder Representations from Transformers (BERT) is a pre-trained model with unlabeled text available on BookCorpus and English Wikipedia. This can be fine-tuned to capture context for various NLP tasks such as question answering, sentiment analysis, text classification, sentence embedding, interpreting ambiguity in the text etc. [ 25 , 33 , 90 , 148 ]. Earlier language-based models examine the text in either of one direction which is used for sentence generation by predicting the next word whereas the BERT model examines the text in both directions simultaneously for better language understanding. BERT provides contextual embedding for each word present in the text unlike context-free models (word2vec and GloVe). For example, in the sentences “he is going to the riverbank for a walk” and “he is going to the bank to withdraw some money”, word2vec will have one vector representation for “bank” in both the sentences whereas BERT will have different vector representation for “bank”. Muller et al. [ 90 ] used the BERT model to analyze the tweets on covid-19 content. The use of the BERT model in the legal domain was explored by Chalkidis et al. [ 20 ].

Since BERT considers up to 512 tokens, this is the reason if there is a long text sequence that must be divided into multiple short text sequences of 512 tokens. This is the limitation of BERT as it lacks in handling large text sequences.

5 Evaluation metrics and challenges

The objective of this section is to discuss evaluation metrics used to evaluate the model’s performance and involved challenges.

5.1 Evaluation metrics

Since the number of labels in most classification problems is fixed, it is easy to determine the score for each class and, as a result, the loss from the ground truth. In image generation problems, the output resolution and ground truth are both fixed. As a result, we can calculate the loss at the pixel level using ground truth. But in NLP, though output format is predetermined in the case of NLP, dimensions cannot be specified. It is because a single statement can be expressed in multiple ways without changing the intent and meaning of that statement. Evaluation metrics are important to evaluate the model’s performance if we were trying to solve two problems with one model.

BLEU (BiLingual Evaluation Understudy) Score: Each word in the output sentence is scored 1 if it appears in either of the reference sentences and a 0 if it does not. Further the number of words that appeared in one of the reference translations is divided by the total number of words in the output sentence to normalize the count so that it is always between 0 and 1. For example, if ground truth is “He is playing chess in the backyard” and output sentences are S1: “He is playing tennis in the backyard”, S2: “He is playing badminton in the backyard”, S3: “He is playing movie in the backyard” and S4: “backyard backyard backyard backyard backyard backyard backyard”. The score of S1, S2 and S3 would be 6/7,6/7 and 6/7. All sentences are getting the same score though information in S1 and S3 is not same. This is because BELU considers words in a sentence contribute equally to the meaning of a sentence which is not the case in real-world scenario. Using combination of uni-gram, bi-gram and n-grams, we can to capture the order of a sentence. We may also set a limit on how many times each word is counted based on how many times it appears in each reference phrase, which helps us prevent excessive repetition.

GLUE (General Language Understanding Evaluation) score: Previously, NLP models were almost usually built to perform effectively on a unique job. Various models such as LSTM, Bi-LSTM were trained solely for this task, and very rarely generalized to other tasks. The model which is used for named entity recognition can perform for textual entailment. GLUE is a set of datasets for training, assessing, and comparing NLP models. It includes nine diverse task datasets designed to test a model’s language understanding. To acquire a comprehensive assessment of a model’s performance, GLUE tests the model on a variety of tasks rather than a single one. Single-sentence tasks, similarity and paraphrase tasks, and inference tasks are among them. For example, in sentiment analysis of customer reviews, we might be interested in analyzing ambiguous reviews and determining which product the client is referring to in his reviews. Thus, the model obtains a good “knowledge” of language in general after some generalized pre-training. When the time comes to test out a model to meet a given task, this universal “knowledge” gives us an advantage. With GLUE, researchers can evaluate their model and score it on all nine tasks. The final performance score model is the average of those nine scores. It makes little difference how the model looks or works if it can analyze inputs and predict outcomes for all the activities.

Considering these metrics in mind, it helps to evaluate the performance of an NLP model for a particular task or a variety of tasks.

5.2 Challenges

The applications of NLP have been growing day by day, and with these new challenges are also occurring despite a lot of work done in the recent past. Some of the common challenges are: Contextual words and phrases in the language where same words and phrases can have different meanings in a sentence which are easy for the humans to understand but makes a challenging task. Such type of challenges can also be faced with dealing Synonyms in the language because humans use many different words to express the same idea, also in the language different levels of complexity such as large, huge, and big may be used by the different persons which become a challenging task to process the language and design algorithms to adopt all these issues. Further in language, Homonyms, the words used to be pronounced the same but have different definitions are also problematic for question answering and speech-to-text applications because they aren’t written in text form. Sentences using sarcasm and irony sometimes may be understood in the opposite way by the humans, and so designing models to deal with such sentences is a really challenging task in NLP. Furthermore, the sentences in the language having any type of ambiguity in the sense of interpreting in more than one way is also an area to work upon where more accuracy can be achieved. Language containing informal phrases, expressions, idioms, and culture-specific lingo make difficult to design models intended for the broad use, however having a lot of data on which training and updating on regular basis may improve the models, but it is a really challenging task to deal with the words having different meaning in different geographic areas. In fact, such types of issues also occur in dealing with different domains such as the meaning of words or sentences may be different in the education industry but have different meaning in health, law, defense etc. So, the models for NLP may be working good for an individual domain, geographic area but for a broad use such challenges need to be tackled. Further together with the above-mentioned challenges misspelled or misused words can also create a problem, although autocorrect and grammar corrections applications have improved a lot due to the continuous developments in the direction but predicting the intention of the writer that to from a specific domain, geographic area by considering sarcasm, expressions, informal phrases etc. is really a big challenge. There is no doubt that for most common widely used languages models for NLP have been doing very well, and further improving day by day but still there is a need for models for all the persons rather than specific knowledge of a particular language and technology. One may further refer to the work of Sharifirad and Matwin (2019) [ 123 ] for classification of different online harassment categories and challenges, Baclic (2020) [ 6 ] and Wong et al. (2018) [ 151 ] for challenges and opportunities in public health, Kang (2020) [ 63 ] for detailed literature survey and technological challenges relevant to management research and NLP, and a recent review work by Alshemali and Kalita (2020) [ 3 ], and references cited there in.

In the recent past, models dealing with Visual Commonsense Reasoning [ 31 ] and NLP have also been getting attention of the several researchers and seems a promising and challenging area to work upon. These models try to extract the information from an image, video using a visual reasoning paradigm such as the humans can infer from a given image, video beyond what is visually obvious, such as objects’ functions, people’s intents, and mental states. In this direction, recently Wen and Peng (2020) [ 149 ] suggested a model to capture knowledge from different perspectives, and perceive common sense in advance, and the results based on the conducted experiments on visual commonsense reasoning dataset VCR seems very satisfactory and effective. The work of Peng and Chi (2019) [ 102 ], that proposes Domain Adaptation with Scene Graph approach to transfer knowledge from the source domain with the objective to improve cross-media retrieval in the target domain, and Yen et al. (2019) [ 155 ] is also very useful to further explore the use of NLP and in its relevant domains.

6 Conclusion

This paper is written with three objectives. The first objective gives insights of the various important terminologies of NLP and NLG, and can be useful for the readers interested to start their early career in NLP and work relevant to its applications. The second objective of this paper focuses on the history, applications, and recent developments in the field of NLP. The third objective is to discuss datasets, approaches and evaluation metrics used in NLP. The relevant work done in the existing literature with their findings and some of the important applications and projects in NLP are also discussed in the paper. The last two objectives may serve as a literature survey for the readers already working in the NLP and relevant fields, and further can provide motivation to explore the fields mentioned in this paper. It is to be noticed that even though a great amount of work on natural language processing is available in literature surveys (one may refer to [ 15 , 32 , 63 , 98 , 133 , 151 ] focusing on one domain such as usage of deep-learning techniques in NLP, techniques used for email spam filtering, medication safety, management research, intrusion detection, and Gujarati language etc.), still there is not much work on regional languages, which can be the focus of future research.

Change history

25 july 2022.

Affiliation 3 has been added into the online PDF.

Ahonen H, Heinonen O, Klemettinen M, Verkamo AI (1998) Applying data mining techniques for descriptive phrase extraction in digital document collections. In research and technology advances in digital libraries, 1998. ADL 98. Proceedings. IEEE international forum on (pp. 2-11). IEEE

Alshawi H (1992) The core language engine. MIT press

Alshemali B, Kalita J (2020) Improving the reliability of deep neural networks in NLP: A review. Knowl-Based Syst 191:105210

Article   Google Scholar  

Andreev ND (1967) The intermediary language as the focal point of machine translation. In: Booth AD (ed) Machine translation. North Holland Publishing Company, Amsterdam, pp 3–27

Google Scholar  

Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD, Stamatopoulos P (2000) Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. arXiv preprint cs/0009009

Baclic O, Tunis M, Young K, Doan C, Swerdfeger H, Schonfeld J (2020) Artificial intelligence in public health: challenges and opportunities for public health made possible by advances in natural language processing. Can Commun Dis Rep 46(6):161

Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In ICLR 2015

Bangalore S, Rambow O, Whittaker S (2000) Evaluation metrics for generation. In proceedings of the first international conference on natural language generation-volume 14 (pp. 1-8). Assoc Comput Linguist

Baud RH, Rassinoux AM, Scherrer JR (1991) Knowledge representation of discharge summaries. In AIME 91 (pp. 173–182). Springer, Berlin Heidelberg

Baud RH, Rassinoux AM, Scherrer JR (1992) Natural language processing and semantical representation of medical texts. Methods Inf Med 31(2):117–125

Baud RH, Alpay L, Lovis C (1994) Let’s meet the users with natural language understanding. Knowledge and Decisions in Health Telematics: The Next Decade 12:103

Bengio Y, Ducharme R, Vincent P (2001) A neural probabilistic language model. Proceedings of NIPS

Benson E, Haghighi A, Barzilay R (2011) Event discovery in social media feeds. In proceedings of the 49th annual meeting of the Association for Computational Linguistics: human language technologies-volume 1 (pp. 389-398). Assoc Comput Linguist

Berger AL, Della Pietra SA, Della Pietra VJ (1996) A maximum entropy approach to natural language processing. Computational Linguistics 22(1):39–71

Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29(1):63–92

Bondale N, Maloor P, Vaidyanathan A, Sengupta S, Rao PV (1999) Extraction of information from open-ended questionnaires using natural language processing techniques. Computer Science and Informatics 29(2):15–22

Borst F, Sager N, Nhàn NT, Su Y, Lyman M, Tick LJ, ..., Scherrer JR (1989) Analyse automatique de comptes rendus d'hospitalisation. In Degoulet P, Stephan JC, Venot A, Yvon PJ, rédacteurs. Informatique et Santé, Informatique et Gestion des Unités de Soins, Comptes Rendus du Colloque AIM-IF, Paris (pp. 246–56). [5]

Briscoe EJ, Grover C, Boguraev B, Carroll J (1987) A formalism and environment for the development of a large grammar of English. IJCAI 87:703–708

Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. arXiv preprint cs/0109015

Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: the muppets straight out of law school. arXiv preprint arXiv:2010.02559

Chi EC, Lyman MS, Sager N, Friedman C, Macleod C (1985) A database of computer-structured narrative: methods of computing complex relations. In proceedings of the annual symposium on computer application in medical care (p. 221). Am Med Inform Assoc

Cho K, Van Merriënboer B, Bahdanau D, Bengio Y, (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259

Chomsky N (1965) Aspects of the theory of syntax. MIT Press, Cambridge, Massachusetts

Choudhary N (2021) LDC-IL: the Indian repository of resources for language technology. Lang Resources & Evaluation 55:855–867.

Chouikhi H, Chniter H, Jarray F (2021) Arabic sentiment analysis using BERT model. In international conference on computational collective intelligence (pp. 621-632). Springer, Cham

Chung J, Gulcehre C, Cho K, Bengio Y, (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

Cohen WW (1996) Learning rules that classify e-mail. In AAAI spring symposium on machine learning in information access (Vol. 18, p. 25)

Cohen PR, Morgan J, Ramsay AM (2002) Intention in communication, Am J Psychol 104(4)

Collobert R, Weston J (2008) A unified architecture for natural language processing. In proceedings of the 25th international conference on machine learning (pp. 160–167)

Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R, (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

Davis E, Marcus G (2015) Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun ACM 58(9):92–103

Desai NP, Dabhi VK (2022) Resources and components for Gujarati NLP systems: a survey. Artif Intell Rev:1–19

Devlin J, Chang MW, Lee K, Toutanova K, (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

Diab M, Hacioglu K, Jurafsky D (2004) Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of HLT-NAACL 2004: Short papers (pp. 149–152). Assoc Computat Linguist

Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In proceedings of the second international conference on human language technology research (pp. 138-145). Morgan Kaufmann publishers Inc

Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054

Dunlavy DM, O’Leary DP, Conroy JM, Schlesinger JD (2007) QCS: A system for querying, clustering and summarizing documents. Inf Process Manag 43(6):1588–1605

Elkan C (2008) Log-Linear Models and Conditional Random Fields. accessed 28 Jun 2017.

Emele MC, Dorna M (1998) Ambiguity preserving machine translation using packed representations. In proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics-volume 1 (pp. 365-371). Association for Computational Linguistics

Europarl: A Parallel Corpus for Statistical Machine Translation (2005) Philipp Koehn , MT Summit 2005

Fan Y, Tian F, Xia Y, Qin T, Li XY, Liu TY (2020) Searching better architectures for neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28:1574–1585

Fang H, Lu W, Wu F, Zhang Y, Shang X, Shao J, Zhuang Y (2015) Topic aspect-oriented summarization via group selection. Neurocomputing 149:1613–1619

Fattah MA, Ren F (2009) GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144

Feldman S (1999) NLP meets the jabberwocky: natural language processing in information retrieval. Online-Weston Then Wilton 23:62–73

Friedman C, Cimino JJ, Johnson SB (1993) A conceptual model for clinical radiology reports. In proceedings of the annual symposium on computer application in medical care (p. 829). Am Med Inform Assoc

Gao T, Dontcheva M, Adar E, Liu Z, Karahalios K DataTone: managing ambiguity in natural language interfaces for data visualization, UIST ‘15: proceedings of the 28th annual ACM symposium on User Interface Software & Technology, November 2015, 489–500,

Ghosh S, Vinyals O, Strope B, Roy S, Dean T, Heck L (2016) Contextual lstm (clstm) models for large scale nlp tasks. arXiv preprint arXiv:1602.06291

Glasgow B, Mandell A, Binney D, Ghemri L, Fisher D (1998) MITA: an information-extraction approach to the analysis of free-form text in life insurance applications. AI Mag 19(1):59

Goldberg Y (2017) Neural network methods for natural language processing. Synthesis lectures on human language technologies 10(1):1–309

Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 19-25). ACM

Green Jr, BF, Wolf AK, Chomsky C, Laughery K (1961) Baseball: an automatic question-answerer. In papers presented at the may 9-11, 1961, western joint IRE-AIEE-ACM computer conference (pp. 219-224). ACM

Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems 28(10):2222–2232

Article   MathSciNet   Google Scholar  

Grishman R, Sager N, Raze C, Bookchin B (1973) The linguistic string parser. In proceedings of the June 4-8, 1973, national computer conference and exposition (pp. 427-434). ACM

Hayes PJ (1992) Intelligent high-volume text processing using shallow, domain-specific techniques. Text-based intelligent systems: current research and practice in information extraction and retrieval, 227-242.

Hendrix GG, Sacerdoti ED, Sagalowicz D, Slocum J (1978) Developing a natural language interface to complex data. ACM Transactions on Database Systems (TODS) 3(2):105–147

"Here’s Why Natural Language Processing is the Future of BI (2017) " SmartData Collective. N.p., n.d. Web. 19

Hirschman L, Grishman R, Sager N (1976) From text to structured information: automatic processing of medical reports. In proceedings of the June 7-10, 1976, national computer conference and exposition (pp. 267-275). ACM

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

Hutchins WJ (1986) Machine translation: past, present, future (p. 66). Ellis Horwood, Chichester

Jurafsky D, Martin J (2008) H. Speech and language processing. 2nd edn. Prentice-Hall, Englewood Cliffs, NJ

Kamp H, Reyle U (1993) Tense and aspect. In from discourse to logic (pp. 483-689). Springer Netherlands

Kang Y, Cai Z, Tan CW, Huang Q, Liu H (2020) Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics 7(2):139–172

Kim Y. (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882

Knight K, Langkilde I (2000) Preserving ambiguities in generation via automata intersection. In AAAI/IAAI (pp. 697-702)

Lass R (1998) Phonology: An Introduction to Basic Concepts. Cambridge, UK; New York; Melbourne, Australia: Cambridge University Press. p. 1. ISBN 978–0–521-23728-4. Retrieved 8 January 2011Paperback ISBN 0–521–28183-0

Lewis DD (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. In European conference on machine learning (pp. 4–15). Springer, Berlin Heidelberg

Liddy ED (2001). Natural language processing

Lopez MM, Kalita J (2017) Deep learning applied to NLP. arXiv preprint arXiv:1703.03091

Luong MT, Sutskever I, Le Q V, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206

Lyman M, Sager N, Friedman C, Chi E (1985) Computer-structured narrative in ambulatory care: its use in longitudinal review of clinical data. In proceedings of the annual symposium on computer application in medical care (p. 82). Am Med Inform Assoc

Lyman M, Sager N, Chi EC, Tick LJ, Nhan NT, Su Y, ..., Scherrer, J. (1989) Medical Language Processing for Knowledge Representation and Retrievals. In Proceedings. Symposium on Computer Applications in Medical Care (pp. 548–553). Am Med Inform Assoc

Maas A, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (pp. 142-150)

Mani I, Maybury MT (eds) (1999) Advances in automatic text summarization, vol 293. MIT press, Cambridge, MA

Manning CD, Schütze H (1999) Foundations of statistical natural language processing, vol 999. MIT press, Cambridge

MATH   Google Scholar  

Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: the penn treebank. Comput Linguist 19(2):313–330

McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41-48)

McCray AT (1991) Natural language processing for intelligent information retrieval. In Engineering in Medicine and Biology Society, 1991. Vol. 13: 1991., Proceedings of the Annual International Conference of the IEEE (pp. 1160–1161). IEEE

McCray AT (1991) Extending a natural language parser with UMLS knowledge. In proceedings of the annual symposium on computer application in medical care (p. 194). Am Med Inform Assoc

McCray AT, Nelson SJ (1995) The representation of meaning in the UMLS. Methods Inf Med 34(1–2):193–201

McCray AT, Razi A (1994) The UMLS knowledge source server. Medinfo MedInfo 8:144–147

McCray AT, Srinivasan S, Browne AC (1994) Lexical methods for managing variation in biomedical terminologies. In proceedings of the annual symposium on computer application in medical care (p. 235). Am Med Inform Assoc

McDonald R, Crammer K, Pereira F (2005) Flexible text segmentation with structured multilabel classification. In proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 987-994). Assoc Comput Linguist

McGray AT, Sponsler JL, Brylawski B, Browne AC (1987) The role of lexical knowledge in biomedical text understanding. In proceedings of the annual symposium on computer application in medical care (p. 103). Am Med Inform Assoc

McKeown KR (1985) Text generation. Cambridge University Press, Cambridge

Book   Google Scholar  

Merity S, Keskar NS, Socher R (2018) An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240

Mikolov T, Chen K, Corrado G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems

Morel-Guillemaz AM, Baud RH, Scherrer JR (1990) Proximity processing of medical text. In medical informatics Europe’90 (pp. 625–630). Springer, Berlin Heidelberg

Morin E (1999) Automatic acquisition of semantic relations between terms from technical corpora. In proc. of the fifth international congress on terminology and knowledge engineering-TKE’99

Müller M, Salathé M, Kummervold PE (2020) Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter. arXiv preprint arXiv:2005.07503

"Natural Language Processing (2017) " Natural Language Processing RSS. N.p., n.d. Web. 25

"Natural Language Processing" (2017) Natural Language Processing RSS. N.p., n.d. Web. 23

Newatia R (2019) . Accessed 15 Dec 2021

Nhàn NT, Sager N, Lyman M, Tick LJ, Borst F, Su Y (1989) A medical language processor for two indo-European languages. In proceedings. Symposium on computer applications in medical care (pp. 554-558). Am Med Inform Assoc

Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In LREC

Ochoa, A. (2016). Meet the Pilot: Smart Earpiece Language Translator. . Accessed April 10, 2017

Ogallo, W., & Kanter, A. S. (2017). Using natural language processing and network analysis to develop a conceptual framework for medication therapy management research. . Accessed April 10, 2017

Otter DW, Medina JR, Kalita JK (2020) A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems 32(2):604–624

Ouyang Y, Li W, Li S, Lu Q (2011) Applying regression models to query-focused multi-document summarization. Inf Process Manag 47(2):227–237

Palmer M, Gildea D, Kingsbury P (2005) The proposition bank: an annotated corpus of semantic roles. Computational linguistics 31(1):71–106

Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Assoc Comput Linguist

Peng Y, Chi J (2019) Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Transactions on Circuits and Systems for Video Technology 30(11):4368–4379

Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

Rae JW, Potapenko A, Jayakumar SM, Lillicrap TP, (2019) Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507

Ranjan P, Basu HVSSA (2003) Part of speech tagging and local word grouping techniques for natural language parsing in Hindi. In Proceedings of the 1st International Conference on Natural Language Processing (ICON 2003)

Rassinoux AM, Baud RH, Scherrer JR (1992) Conceptual graphs model extension for knowledge representation of medical texts. MEDINFO 92:1368–1374

Rassinoux AM, Michel PA, Juge C, Baud R, Scherrer JR (1994) Natural language processing of medical texts within the HELIOS environment. Comput Methods Prog Biomed 45:S79–S96

Rassinoux AM, Juge C, Michel PA, Baud RH, Lemaitre D, Jean FC, Scherrer JR (1995) Analysis of medical jargon: The RECIT system. In Conference on Artificial Intelligence in Medicine in Europe (pp. 42–52). Springer, Berlin Heidelberg

Rennie J (2000) ifile: An application of machine learning to e-mail filtering. In Proc. KDD 2000 Workshop on text mining, Boston, MA

Riedhammer K, Favre B, Hakkani-Tür D (2010) Long story short–global unsupervised models for keyphrase based meeting summarization. Speech Comm 52(10):801–815

Ritter A, Clark S, Etzioni O (2011) Named entity recognition in tweets: an experimental study. In proceedings of the conference on empirical methods in natural language processing (pp. 1524-1534). Assoc Comput Linguist

Rospocher M, van Erp M, Vossen P, Fokkens A, Aldabe I, Rigau G, Soroa A, Ploeger T, Bogaard T(2016) Building event-centric knowledge graphs from news. Web Semantics: Science, Services and Agents on the World Wide Web, In Press

Sager N, Lyman M, Tick LJ, Borst F, Nhan NT, Revillard C, … Scherrer JR (1989) Adapting a medical language processor from English to French. Medinfo 89:795–799

Sager N, Lyman M, Nhan NT, Tick LJ (1995) Medical language processing: applications to patient data representation and automatic encoding. Methods Inf Med 34(1–2):140–146

Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In learning for text categorization: papers from the 1998 workshop (Vol. 62, pp. 98-105)

Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. arXiv preprint cs/0106040

Sakkis G, Androutsopoulos I, Paliouras G et al (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6:49–73.

Santoro A, Faulkner R, Raposo D, Rae J, Chrzanowski M, Weber T, ..., Lillicrap T (2018) Relational recurrent neural networks. Adv Neural Inf Proces Syst, 31

Scherrer JR, Revillard C, Borst F, Berthoud M, Lovis C (1994) Medical office automation integrated into the distributed architecture of a hospital information system. Methods Inf Med 33(2):174–179

Seal D, Roy UK, Basak R (2020) Sentence-level emotion detection from text based on semantic rules. In: Tuba M, Akashe S, Joshi A (eds) Information and communication Technology for Sustainable Development. Advances in intelligent Systems and computing, vol 933. Springer, Singapore.

Chapter   Google Scholar  

Sentiraama Corpus by Gangula Rama Rohit Reddy, Radhika Mamidi. Language Technologies Research Centre, KCIS, IIIT Hyderabad (n.d.)

Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In proceedings of the 2003 conference of the north American chapter of the Association for Computational Linguistics on human language technology-volume 1 (pp. 134-141). Assoc Comput Linguist

Sharifirad S, Matwin S, (2019) When a tweet is actually sexist. A more comprehensive classification of different online harassment categories and the challenges in NLP. arXiv preprint arXiv:1902.10584

Sharma S, Srinivas PYKL, Balabantaray RC (2016) Emotion Detection using Online Machine Learning Method and TLBO on Mixed Script. In Proceedings of Language Resources and Evaluation Conference 2016 (pp. 47–51)

Shemtov H (1997) Ambiguity management in natural language generation. Stanford University

Small SL, Cortell GW, Tanenhaus MK (1988) Lexical Ambiguity Resolutions. Morgan Kauffman, San Mateo, CA

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642)

Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 26(1):320–322

Srihari S (2010) Machine Learning: Generative and Discriminative Models. . accessed 31 May 2017.]

Sun X, Morency LP, Okanohara D, Tsujii JI (2008) Modeling latent-dynamic in shallow parsing: a latent conditional model with improved inference. In proceedings of the 22nd international conference on computational linguistics-volume 1 (pp. 841-848). Assoc Comput Linguist

Sundheim BM, Chinchor NA (1993) Survey of the message understanding conferences. In proceedings of the workshop on human language technology (pp. 56-60). Assoc Comput Linguist

Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems

Sworna ZT, Mousavi Z, Babar MA (2022) NLP methods in host-based intrusion detection Systems: A systematic review and future directions. arXiv preprint arXiv:2201.08066

Systems RAVN (2017) "RAVN Systems Launch the ACE Powered GDPR Robot - Artificial Intelligence to Expedite GDPR Compliance." Stock Market. PR Newswire, n.d. Web. 19

Tan KL, Lee CP, Anbananthen KSM, Lim KM (2022) RoBERTa-LSTM: A hybrid model for sentiment analysis with transformers and recurrent neural network. IEEE Access, RoBERTa-LSTM: A Hybrid Model for Sentiment Analysis With Transformer and Recurrent Neural Network

Tapaswi N, Jain S (2012) Treebank based deep grammar acquisition and part-of-speech tagging for Sanskrit sentences. In software engineering (CONSEG), 2012 CSI sixth international conference on (pp. 1-4). IEEE

Thomas C (2019) . Accessed 15 Dec 2021

Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In Eurospeech

Umber A, Bajwa I (2011) “Minimizing ambiguity in natural language software requirements specification,” in Sixth Int Conf Digit Inf Manag, pp. 102–107

"Using Natural Language Processing and Network Analysis to Develop a Conceptual Framework for Medication Therapy Management Research (2017) " AMIA ... Annual Symposium proceedings. AMIA Symposium. U.S. National Library of Medicine, n.d. Web. 19

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I, (2017) Attention is all you need. In advances in neural information processing systems (pp. 5998-6008)

Wahlster W, Kobsa A (1989) User models in dialog systems. In user models in dialog systems (pp. 4–34). Springer Berlin Heidelberg, User Models in Dialog Systems

Walton D (1996) A pragmatic synthesis. In: fallacies arising from ambiguity. Applied logic series, vol 1. Springer, Dordrecht)

Wan X (2008) Using only cross-document relationships for both generic and topic-focused multi-document summarizations. Inf Retr 11(1):25–49

Wang W, Gang J, 2018 Application of convolutional neural network in natural language processing. In 2018 international conference on information Systems and computer aided education (ICISCAE) (pp. 64-70). IEEE

Wang D, Zhu S, Li T, Gong Y (2009) Multi-document summarization using sentence-based topic models. In proceedings of the ACL-IJCNLP 2009 conference short papers (pp. 297-300). Assoc Comput Linguist

Wang D, Zhu S, Li T, Chi Y, Gong Y (2011) Integrating document clustering and multidocument summarization. ACM Transactions on Knowledge Discovery from Data (TKDD) 5(3):14–26

Wang Z, Ng P, Ma X, Nallapati R, Xiang B (2019) Multi-passage bert: A globally normalized bert model for open-domain question answering. arXiv preprint arXiv:1908.08167

Wen Z, Peng Y (2020) Multi-level knowledge injecting for visual commonsense reasoning. IEEE Transactions on Circuits and Systems for Video Technology 31(3):1042–1054

Wiese G, Weissenborn D, Neves M (2017) Neural domain adaptation for biomedical question answering. arXiv preprint arXiv:1706.03610

Wong A, Plasek JM, Montecalvo SP, Zhou L (2018) Natural language processing and its implications for the future of medication safety: a narrative review of recent advances and challenges. Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy 38(8):822–841

Woods WA (1978) Semantics and quantification in natural language question answering. Adv Comput 17:1–87

Xia T (2020) A constant time complexity spam detection algorithm for boosting throughput on rule-based filtering Systems. IEEE Access 8:82653–82661.

Xie P, Xing E (2017) A constituent-centric neural architecture for reading comprehension. In proceedings of the 55th annual meeting of the Association for Computational Linguistics (volume 1: long papers) (pp. 1405-1414)

Yan X, Ye Y, Mao Y, Yu H (2019) Shared-private information bottleneck method for cross-modal clustering. IEEE Access 7:36045–36056

Yi J, Nasukawa T, Bunescu R, Niblack W (2003) Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques. In data mining, 2003. ICDM 2003. Third IEEE international conference on (pp. 427-434). IEEE

Young SJ, Chase LL (1998) Speech recognition evaluation: a review of the US CSR and LVCSR programmes. Comput Speech Lang 12(4):263–279

Yu S, et al. (2018) "A multi-stage memory augmented neural network for machine reading comprehension." Proceedings of the workshop on machine reading for question answering

Zajic DM, Dorr BJ, Lin J (2008) Single-document and multi-document summarization techniques for email threads using sentence compression. Inf Process Manag 44(4):1600–1610

Zeroual I, Lakhouaja A, Belahbib R (2017) Towards a standard part of speech tagset for the Arabic language. J King Saud Univ Comput Inf Sci 29(2):171–178

Download references


Authors would like to express the gratitude to Research Mentors from CL Educate: Accendere Knowledge Management Services Pvt. Ltd. for their comments on earlier versions of the manuscript. Although any errors are our own and should not tarnish the reputations of these esteemed persons. We would also like to appreciate the Editor, Associate Editor, and anonymous referees for their constructive suggestions that led to many improvements on an earlier version of this manuscript.

Author information

Authors and affiliations.

Department of Computer Science, Manav Rachna International Institute of Research and Studies, Faridabad, India

Diksha Khurana & Aditya Koli

Department of Computer Science, BML Munjal University, Gurgaon, India

Kiran Khatter

Department of Statistics, Amity University Punjab, Mohali, India

Sukhdev Singh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kiran Khatter .

Ethics declarations

Conflict of interest.

The first draft of this paper was written under the supervision of Dr. Kiran Khatter and Dr. Sukhdev Singh, associated with CL- Educate: Accendere Knowledge Management Services Pvt. Ltd. and deputed at the Manav Rachna International University. The draft is also available on at

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Khurana, D., Koli, A., Khatter, K. et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 82 , 3713–3744 (2023).

Download citation

Received : 03 February 2021

Revised : 23 March 2022

Accepted : 02 July 2022

Published : 14 July 2022

Issue Date : January 2023


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Natural language processing
  • Natural language understanding
  • Natural language generation
  • NLP applications
  • NLP evaluation metrics
  • Find a journal
  • Publish with us
  • Track your research

For IEEE Members

Ieee spectrum, follow ieee spectrum, support ieee spectrum, enjoy more free content and benefits by creating an account, saving articles to read later requires an ieee spectrum account, the institute content is only available for members, downloading full pdf issues is exclusive for ieee members, downloading this e-book is exclusive for ieee members, access to spectrum 's digital edition is exclusive for ieee members, following topics is a feature exclusive for ieee members, adding your response to an article requires an ieee spectrum account, create an account to access more content and features on ieee spectrum , including the ability to save articles to read later, download spectrum collections, and participate in conversations with readers and editors. for more exclusive content and features, consider joining ieee ., join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of spectrum’s articles, archives, pdf downloads, and other benefits. learn more →, join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of ieee spectrum’s articles, archives, pdf downloads, and other benefits. learn more →, access thousands of articles — completely free, create an account and get exclusive content and features: save articles, download collections, and talk to tech insiders — all free for full access and benefits, join ieee as a paying member., natural language processing news & articles.

Showing 33 posts that have the tag “natural-language-processing”

Can AI and fMRI “Hear” Unspoken Thoughts?

Hopes for paralyzed patients counterbalance calls for mental privacy, can we identify a person from their voice, digital voiceprinting may not be ready for the courts, hey, siri you worried chatgpt will take your job, the rapid rise of gpts distresses natural language processing researchers, chatgpt’s ai can help screen for alzheimer’s, unusual speech patterns provide new inroads for diagnosis of the disease, robotics news in your inbox, weekly.

AIP Publishing Logo

  • Previous Article
  • Next Article

Natural language processing: A review

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Reprints and Permissions
  • Cite Icon Cite
  • Search Site

Anushka Gangal , Ayush Shrivastava , Nadia Mahmood Hussien , Prabhishek Singh , Manoj Diwakar , Kapil Joshi , Sapna Bisht , Naveen Chandra Joshi; Natural language processing: A review. AIP Conf. Proc. 1 September 2023; 2771 (1): 020010.

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Natural language processing (NLP) has received a great deal of attention for its computer representation and evaluation of human language. AI, email spam location, data extraction, once-finished, clinical, and question addressing are only a couple of the applications. The article is broken into four areas, with the first talking about different degrees of NLP and parts of Natural LanguageGeneration (NLG), trailed by the arrangement of encounters and improvement of NLP, the high level, new things, and weights, and future expansion. We’ll also look at the tools and methods utilized in Natural Language Processing, as well as how these procedures work when we apply them. The single correlation between operations and how each approach performs. Regular language handling has not yet attained flawlessness, although continued progress in this field can certainly approach the line of flawlessness. Today, numerous AIs recognize and respond to consumer voice directions using typical language handling calculations.

Sign in via your Institution

Citing articles via, publish with us - request a quote.

ieee research papers on natural language processing

Sign up for alerts

  • Online ISSN 1551-7616
  • Print ISSN 0094-243X
  • For Researchers
  • For Librarians
  • For Advertisers
  • Our Publishing Partners  
  • Physics Today
  • Conference Proceedings
  • Special Topics

  • Privacy Policy
  • Terms of Use

Connect with AIP Publishing

This feature is available to subscribers only.

Sign In or Create an Account

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 18 March 2024

Natural language instructions induce compositional generalization in networks of neurons

  • Reidar Riveland   ORCID: 1 &
  • Alexandre Pouget   ORCID: 1  

Nature Neuroscience ( 2024 ) Cite this article

20k Accesses

195 Altmetric

Metrics details

  • Intelligence
  • Network models

A fundamental human cognitive feat is to interpret linguistic instructions in order to perform novel tasks without explicit task experience. Yet, the neural computations that might be used to accomplish this remain poorly understood. We use advances in natural language processing to create a neural model of generalization based on linguistic instructions. Models are trained on a set of common psychophysical tasks, and receive instructions embedded by a pretrained language model. Our best models can perform a previously unseen task with an average performance of 83% correct based solely on linguistic instructions (that is, zero-shot learning). We found that language scaffolds sensorimotor representations such that activity for interrelated tasks shares a common geometry with the semantic representations of instructions, allowing language to cue the proper composition of practiced skills in unseen settings. We show how this model generates a linguistic description of a novel task it has identified using only motor feedback, which can subsequently guide a partner model to perform the task. Our models offer several experimentally testable predictions outlining how linguistic information must be represented to facilitate flexible and general cognition in the human brain.

Similar content being viewed by others

ieee research papers on natural language processing

Solving olympiad geometry without human demonstrations

Trieu H. Trinh, Yuhuai Wu, … Thang Luong

ieee research papers on natural language processing

Motor neurons generate pose-targeted movements via proprioceptive sculpting

Benjamin Gorko, Igor Siwanowicz, … Stephen J. Huston

ieee research papers on natural language processing

Computer programmers show distinct, expertise-dependent brain responses to violations in form and meaning when reading code

Chu-Hsuan Kuo & Chantel S. Prat

In a laboratory setting, animals require numerous trials in order to acquire a new behavioral task. This is in part because the only means of communication with nonlinguistic animals is simple positive and negative reinforcement signals. By contrast, it is common to give written or verbal instructions to humans, which allows them to perform new tasks relatively quickly. Further, once humans have learned a task, they can typically describe the solution with natural language. The dual ability to use an instruction to perform a novel task and, conversely, produce a linguistic description of the demands of a task once it has been learned are two unique cornerstones of human communication. Yet, the computational principles that underlie these abilities remain poorly understood.

One influential systems-level explanation posits that flexible interregional connectivity in the prefrontal cortex allows for the reuse of practiced sensorimotor representations in novel settings 1 , 2 . More recently, multiple studies have observed that when subjects are required to flexibly recruit different stimulus-response patterns, neural representations are organized according to the abstract structure of the task set 3 , 4 , 5 . Lastly, recent modeling work has shown that a multitasking recurrent neural network (RNN) will share dynamical motifs across tasks with similar demands 6 . This work forms a strong basis for explanations of flexible cognition in humans but leaves open the question of how linguistic information can reconfigure a sensorimotor network so that it performs a novel task well on the first attempt. Overall, it remains unclear what representational structure we should expect from brain areas that are responsible for integrating linguistic information in order to reorganize sensorimotor mappings on the fly.

These questions become all the more pressing given that recent advances in machine learning have led to artificial systems that exhibit human-like language skills 7 , 8 . Recent works have matched neural data recorded during passive listening and reading tasks to activations in autoregressive language models (that is, GPT 9 ), arguing that there is a fundamentally predictive component to language comprehension 10 , 11 . Additionally, some high-profile machine learning models do show the ability to use natural language as a prompt to perform a linguistic task or render an image, but the outputs of these models are difficult to interpret in terms of a sensorimotor mapping that we might expect to occur in a biological system 12 , 13 , 14 . Alternatively, recent work on multimodal interactive agents may be more interpretable in terms of the actions they take, but utilize a perceptual hierarchy that fuses vision and language at early stages of processing, making them difficult to map onto functionally and anatomically distinct language and vision areas in human brains 15 , 16 , 17 .

We, therefore, seek to leverage the power of language models in a way that results in testable neural predictions detailing how the human brain processes natural language in order to generalize across sensorimotor tasks.

To that end, we train an RNN (sensorimotor-RNN) model on a set of simple psychophysical tasks where models process instructions for each task using a pretrained language model. We find that embedding instructions with models tuned to sentence-level semantics allow sensorimotor-RNNs to perform a novel task at 83% correct, on average. Generalization in our models is supported by a representational geometry that captures task subcomponents and is shared between instruction embeddings and sensorimotor activity, thereby allowing a composition of practice skills in a novel setting. We also find that individual neurons modulate their tuning based on the semantics of instructions. We demonstrate how a network trained to interpret linguistic instructions can invert this understanding and produce a linguistic description of a previously unseen task based on the information in motor feedback signals. We end by discussing how these results can guide research on the neural basis of language-based generalization in the human brain.

Instructed models and task set

We train sensorimotor-RNNs on a set of 50 interrelated psychophysical tasks that require various cognitive capacities that are well studied in the literature 18 . Two example tasks are presented in Fig. 1a,b as they might appear in a laboratory setting. For all tasks, models receive a sensory input and task-identifying information and must output motor response activity (Fig. 1c ). Input stimuli are encoded by two one-dimensional maps of neurons, each representing a different input modality, with periodic Gaussian tuning curves to angles (over (0, 2 π )). Output responses are encoded in the same way. Inputs also include a single fixation unit. After the input fixation is off, the model can respond to the input stimuli. Our 50 tasks are roughly divided into 5 groups, ‘Go’, ‘Decision-making’, ‘Comparison’, ‘Duration’ And ‘Matching’, where within-group tasks share similar sensory input structures but may require divergent responses. For instance, in the decision-making (DM) task, the network must respond in the direction of the stimulus with the highest contrast, whereas in the anti-decision-making (AntiDM) task, the network responds to the stimulus with the weakest contrast (Fig. 1a ). Thus, networks must properly infer the task demands for a given trial from task-identifying information in order to perform all tasks simultaneously (see Methods for task details; see Supplementary Fig. 13 for example trials of all tasks).

figure 1

a , b , Illustrations of example trials as they might appear in a laboratory setting. The trial is instructed, then stimuli are presented with different angles and strengths of contrast. The agent must then respond with the proper angle during the response period. a , An example AntiDM trial where the agent must respond to the angle presented with the least intensity. b , An example COMP1 trial where the agent must respond to the first angle if it is presented with higher intensity than the second angle otherwise repress response. c , Diagram of model inputs and outputs. Sensory inputs (fixation unit, modality 1, modality 2) are shown in red and model outputs (fixation output, motor output) are shown in green. Models also receive a rule vector (blue) or the embedding that results from passing task instructions through a pretrained language model (gray). A list of models tested is provided in the inset.

In our models, task-identifying input is either nonlinguistic or linguistic. We use two nonlinguistic control models. First, in SIMPLENET, the identity of a task is represented by one of 50 orthogonal rule vectors. Second, STRUCTURENET uses a set of 10 orthogonal structure vectors, each representing a dimension of the task set (that is, respond weakest versus strongest direction), and tasks are encoded using combinations of these vectors (see Supplementary Notes 3 for the full set of structure combinations). As a result, STRUCTURENET fully captures all the relevant relationships among tasks, whereas SIMPLENET encodes none of this structure.

Instructed models use a pretrained transformer architecture 19 to embed natural language instructions for the tasks at hand. For each task, there is a corresponding set of 20 unique instructions (15 training, 5 validation; see Supplementary Notes 2 for the full instruction set). We test various types of language models that share the same basic architecture but differ in their size and also their pretraining objective. We tested two autoregressive models, a standard and a large version of GPT2, which we call GPT and GPT (XL), respectively. Previous work has demonstrated that GPT activations can account for various neural signatures of reading and listening 11 . BERT is trained to identify masked words within a piece of text 20 , but it also uses an unsupervised sentence-level objective, in which the network is given two sentences and must determine whether they follow each other in the original text. SBERT is trained like BERT but receives additional tuning on the Stanford Natural Language Inference task, a hand-labeled dataset detailing the logical relationship between two candidate sentences ( Methods ) 21 , 22 . Lastly, we use the language embedder from CLIP, a multimodal model that learns a joint embedding space of images and text captions 23 . We call a sensorimotor-RNN using a given language model LANGUAGEMODELNET and append a letter indicating its size. The various sizes of models are given in Fig. 1c . For each language model, we apply a pooling method to the last hidden state of the transformer and pass this fixed-length representation through a set of linear weights that are trained during task learning. This results in a 64-dimensional instruction embedding across all models ( Methods ). Language model weights are frozen unless otherwise specified. Finally, as a control, we also test a bag-of-words (BoW) embedding scheme that only uses word count statistics to embed each instruction.

First, we verify our models can perform all tasks simultaneously. For instructed models to perform well, they must infer the common semantic content between 15 distinct instruction formulations for each task. We find that all our instructed models can learn all tasks simultaneously except for GPTNET, where performance asymptotes are below the 95% threshold for some tasks. Hence, we relax the performance threshold to 85% for models that use GPT (Supplementary Fig. 1 ; see Methods for training details). We additionally tested all architectures on validation instructions (Supplementary Fig. 2 ). SBERTNET (L) and SBERTNET are our best-performing models, achieving an average performance of 97% and 94%, respectively, on validation instructions, demonstrating that these networks infer the proper semantic content even for entirely novel instructions.

Generalization to novel tasks

We next examined the extent to which different language models aided generalization to novel tasks. We trained individual networks on 45 tasks and then tested performance when exposed to the five held-out tasks. We use unequal-variance t -tests to make comparisons among the performance of different models. Figure 2 shows results with P values for the most relevant comparisons (a full matrix of comparisons across all models can be found in Supplementary Figs. 3 and 4 )

figure 2

a , Learning curves for the first 100 exposures to held-out tasks averaged over all tasks. Data are presented as the mean ± s.d. across different n  = 5 random initializations of sensorimotor-RNN weights. For all subplots, asterisks indicate significant differences among performance according to a two-sided unequal-variance t -test. Most relevant comparisons are presented in plots (for all subplots, not significant (NS), P  > 0.05, * P  < 0.05, ** P  < 0.01, *** P  < 0.001; STRUCTURENET versus SBERTNET (L): t  = 3.761, P  = 1.89 × 10 −4 ; SBERTNET (L) versus SBERTNET: t  = 2.19, P  = 0.029; SBERTNET versus CLIPNET: t  = 6.22, P  = 1.02 × 10 −9 ; CLIPNET versus BERTNET: t  = 1.037,  P  = 0.300; BERTNET versus GPTNET (XL): t  = −1.122, P  = 0.262; GPTNET (XL) versus GPTNET: t  = 6.22, P  = 1.04 × 10 −9 ; GPTNET versus BOWNET: t  = −3.346, P  = 8.85 × 10 − 4 ; BOWNET versus SIMPLENET: t  = 10.25, P  = 2.091 × 10 −22 ). A full table of pairwise comparisons can be found in Supplementary Fig. 3 . b , Distribution of generalization performance (that is, first exposure to novel task) across models. c – f , Performance across different test conditions for n  = 5 different random initialization of sensorimotor-RNN weights where each point indicates average performance across tasks for a given initialization. c , Generalization performance for tasks where instructions are swapped at test time (STRUCTURENET versus SBERTNET (L): t  = −0.15, P  = 0.875; SBERTNET (L) versus SBERTNET: t  = −2.102, P  = 0.036; SBERTNET versus CLIPNET: t  = −0.162, P  = 0.871; CLIPNET versus BERTNET: t  = 0.315, P  = 0.752; BERTNET versus GPTNET (XL): t  = 0.781, P  = 0.435; GPTNET (XL) versus GPTNET: t  = 1.071, P  = 0.285; GPTNET versus BOWNET: t  = −2.702, P  = 0.007; BOWNET versus SIMPLENET: t  = −3.471, P  = 5.633 −4 ). A full table of pairwise comparisons can be found in Supplementary Fig. 4 . d , Generalization performance for models where tasks from the same family are held out during training (STRUCTURENET versus SBERTNET (L): t  = 0.629, P  = 0.530; SBERTNET (L) versus SBERTNET: t  = −0.668, P  = 0.504; SBERTNET versus CLIPNET: t  = 8.043, P  = 7.757 × 10 −15 ; CLIPNET versus BERTNET: t  = −0.306, P  = 0.759; BERTNET versus GPTNET (XL): t  = 0.163, P  = 0.869; GPTNET (XL) versus GPTNET: t  = 1.534, P  = 0.126; GPTNET versus BOWNET: t  = −6.418, P  = 3.26 × 10 −10 ; BOWNET versus SIMPLENET: t  = 14.23, P  = 8.561 −39 ). A full table of pairwise comparisons can be found in Supplementary Fig. 4 . e , Generalization performance for models where the last layers of language models are allowed to fine-tune to the loss from sensorimotor tasks (STRUCTURENET versus SBERTNET (L): t  = 1.203, P  = 0.229; SBERTNET (L) versus SBERTNET: t  = 2.399, P  = 0.016; SBERTNET versus CLIPNET: t  = 5.186,  P  = 3.251 × 10 −7 ; CLIPNET versus BERTNET: t  = −3.002, P  = 0.002; BERTNET versus GPTNET (XL): t  = 0.522, P  = 0.601; GPTNET (XL) versus GPTNET: t  = 2.631, P  = 0.009; GPTNET versus BOWNET: t  = 4.440, P  = 1.134 × 10 −5 ; BOWNET versus SIMPLENET: t  = 10.255, P  = 2.091 × 10 −22 ). A full table of pairwise comparisons can be found in Supplementary Fig. 4 . f , Average difference in performance between tasks that use standard imperative instructions and those that use instructions with conditional clauses and require a simple deductive reasoning component. Colored asterisks at the bottom of the plot show P values for a two-sided, unequal-variance t -test between the null distribution constructed using random splits of the task set (transparent points represent mean differences for random splits; STRUCTURENET: t  = −36.46, P  = 4.34 × 10 −23 ; SBERTNET (L): t  = −16.38, P  = 3.02 × 10 −5 ; SBERTNET: t  = −15.35, P  = 3.920 × 10 −5 ; CLIPNET: t  = −44.68, P  = 5.32 × 10 − 13 ; BERTNET: t  = −25.51, P  = 3.14 × 10 −8 ; GPTNET (XL): t  = −16.99, P  = 3.61 × 10 −6 ; GPTNET: t  = −9.150, P  = 0.0002; BOWNET: t  = −70.99, P  = 4.566 × 10 −35 ; SIMPLENET: t  = 19.60, P  = 5.82 × 10 −6 ), and asterisks at the top of plot indicate P -value results from a t -test comparing differences with STRUCTURENET and our other instructed models (versus SBERTNET (L): t  = 3.702, P  = 0.0168; versus SBERTNET: t  = 6.592, P  = 0.002; versus CLIPNET: t  = 30.35, P  = 2.367 × 10 −7 ; versus BERTNET: t  = 7.234, P  = 0.0007; versus GPTNET (XL): t  = 5.282, P  = 0.004; versus GPTNET: t  = −1.745, P  = 0.149; versus BOWNET: t  = 75.04, P  = 9.96 × 10 −11 ; versus SIMPLENET: t  = −30.95, P  = 2.86 × 10 − 6 ; see Methods and Supplementary Fig. 6 . for full comparisons).

Our uninstructed control model SIMPLENET performs at 39%, on average, on the first presentation of a novel task (zero-shot generalization). This serves as a baseline for generalization. Note that despite the orthogonality of task rules provided to SIMPLENET, exposure to the task set allows models to learn patterns that are common to all tasks (for example, always repress response during fixation). Therefore, 39% is not chance-level performance per se, but rather performance achieved by a network trained and tested on a task set with some common requirements for responding. GPTNET, exhibits a zero-shot generalization of 57%. This is a significant improvement over SIMPLENET ( t  = 8.32, P  = 8.24 × 10 −16 ). Strikingly, increasing the size of GPT by an order of magnitude to the 1.5 billion parameters used by GPT (XL) only resulted in modest gains over BOWNET (64%), with GPTNET (XL) achieving 68% on held-out tasks ( t  = 2.04, P  = 0.047). By contrast, CLIPNET (S), which uses 4% of the number of parameters utilized by GPTNET (XL), is nonetheless able to achieve the same performance (68% correct, t  = 0.146, P  = 0.88). Likewise, BERTNET achieves a generalization performance that lags only 2% behind GPTNETXL in the mean ( t  = −1.122, P  = 0.262). By contrast, models with knowledge of sentence-level semantics show marked improvements in generalization, with SBERTNET performing an unseen task at 79% correct on average. Finally, our best-performing model, SBERTNET (L), can execute a never-before-seen task with a performance of 83% correct, on average, lagging just a few percentage points behind STRUCTURENET (88% correct), which receives the structure of the task set hand-coded in its rule vectors.

Figure 2b shows a histogram of the number of tasks for which each model achieves a given level of performance. Again, SBERTNET (L) manages to perform over 20 tasks set nearly perfectly in the zero-shot setting (for individual task performance for all models across tasks, see Supplementary Fig. 3 ).

To validate that our best-performing models leveraged the semantics of instructions, we presented the sensory input for one held-out task while providing the linguistic instructions for a different held-out task. Models that truly rely on linguistic information should be most penalized by this manipulation and, as predicted, we saw the largest decrease in performance for our best models (Fig. 2c ).

We also tested a more stringent hold-out procedure where we purposefully chose 4–6 tasks from the same family of tasks to hold out during training (Fig. 2d ). Overall, performance decreased in this more difficult setting, although our best-performing models still showed strong generalization, with SBERTNET (L) and SBERTNET achieving 71% and 72% correct on novel tasks, respectively, which was not significantly different from STRUCTURENET at 72% ( t  = 0.629, P  = 0.529; t  = 0.064, P  = 0.948; for SBERTNET (L) and SBERTNET, respectively).

In addition, we tested models in a setting where we allow the weights of language models to tune according to the loss experienced during sensorimotor training (see Methods for tuning details). This manipulation improved the generalization performance across all models, and for our best-performing model, SBERTNET (L), we see that generalization is as strong as for STRUCTURENET (86%, t  = 1.204, P  = 0.229).

Following ref. 18 , we tested models in a setting where task-type information for a given task was represented as a composition of information for related tasks in the training set (that is, AntiDMMod1 = (rule(AntiDMMod2) − rule(DMMod2)) + rule(DMMod1)). In this setting, we did find that the performance of SIMPLENET improved (60% correct). However, when we combined embedded instructions according to the same compositional rules, our linguistic models dramatically outperformed SIMPLENET. This suggests that training in the context of language more readily allows a simple compositional scheme to successfully configure task responses (see Supplementary Fig. 5 for full results and compositional encodings).

Finally, we tested a version of each model where outputs of language models are passed through a set of nonlinear layers, as opposed to the linear mapping used in the preceding results. We found that this manipulation reduced performance, suggesting that this added power leads to overfitting on training tasks, and that a simpler linear mapping is better suited to generalization (see Methods for details and Supplementary Fig. 4 for full results).

The discrepancy in performance between our instructed models suggests that in order to represent linguistic information such that it can successfully configure sensorimotor networks, it is not sufficient to simply use any very powerful language processing system. Rather, model success can be delineated by the extent to which they are exposed to sentence-level semantics during pretraining. Our best-performing models SBERTNET (L) and SBERTNET are explicitly trained to produce good sentence embeddings, whereas our worst-performing model, GPTNET, is only tuned to the statistics of upcoming words. Both CLIPNET (S) and BERTNET are exposed to some form of sentence-level knowledge. CLIPNET (S) is interested in sentence-level representations, but trains these representations using the statistics of corresponding vision representations. BERTNET performs a two-way classification of whether or not input sentences are adjacent in the training corpus. That the 1.5 billion parameters of GPTNET (XL) doesn’t markedly improve performance relative to these comparatively small models speaks to the fact that model size isn’t the determining factor. Lastly, although BoW removes key elements of linguistic meaning (that is, syntax), the simple use of word occurrences encodes information primarily about the similarities and differences between the sentences. For instance, simply representing the inclusion or exclusion of the words ‘stronger’ or ‘weaker’ is highly informative about the meaning of the instruction.

We also investigated which features of language make it difficult for our models to generalize. Thirty of our tasks require processing instructions with a conditional clause structure (for example, COMP1) as opposed to a simple imperative (for example, AntiDM). Tasks that are instructed using conditional clauses also require a simple form of deductive reasoning (if p then q else s ). Neuroimaging literature exploring the relationship between such deductive processes and language areas has reached differing conclusions, with some early studies showing that deduction recruits regions that are thought to support syntactic computations 24 , 25 , 26 and follow-up studies claiming that deduction can be reliably dissociated from language areas 27 , 28 , 29 , 30 . One theory for this variation in results is that baseline tasks used to isolate deductive reasoning in earlier studies used linguistic stimuli that required only superficial processing 31 , 32 .

To explore this issue, we calculated the average difference in performance between tasks with and without conditional clauses/deductive reasoning requirements (Fig. 2f ). All our models performed worse on these tasks relative to a set of random shuffles. However, we also saw an additional effect between STRUCTURENET and our instructed models, which performed worse than STRUCTURENET by a statistically significant margin (see Supplementary Fig. 6 for full comparisons). This is a crucial comparison because STRUCTURENET performs deductive tasks without relying on language. Hence, the decrease in performance between STRUCTURENET and instructed models is in part due to the difficulty inherent in parsing syntactically more complicated language. The implication is that we may see engagement of linguistic areas in deductive reasoning tasks, but this may simply be due to the increased syntactic demands of corresponding instructions (rather than processes that recruit linguistic areas to explicitly aid in the deduction). This result largely agrees with two reviews of the deductive reasoning literature, which concluded that the effects in language areas seen in early studies were likely due to the syntactic complexity of test stimuli 31 , 32 .

Shared structure in language and sensorimotor networks

We then turned to an investigation of the representational scheme that supports generalization. First, we note that like in other multitasking models, units in our sensorimotor-RNNs exhibited functional clustering, where similar subsets of neurons show high variance across similar sets of tasks (Supplementary Fig. 7 ). Moreover, we found that models can learn unseen tasks by only training sensorimotor-RNN input weights and keeping the recurrent dynamics constant (Supplementary Fig. 8 ). Past work has shown that these properties are characteristic of networks that can reuse the same set of underlying neural resources across different settings 6 , 18 . We then examined the geometry that exists between the neural representations of related tasks. We plotted the first three principal components (PCs) of sensorimotor-RNN hidden activity at stimulus onset in SIMPLENET, GPTNETXL, SBERTNET (L) and STRUCTURENET performing modality-specific DM and AntiDM tasks. Here, models receive input for a decision-making task in both modalities but must only attend to the stimuli in the modality relevant for the current task. Importantly, AntiDMMod1 is held out of training in the following examples. In addition, we plotted the PCs of either the rule vectors or the instruction embeddings in each task (Fig. 3 ).

figure 3

a – d , The first three PCs of sensorimotor hidden activity and task-info representations for models trained with AntiDMMod1 held out. Solid arrows represent an abstract ‘Pro’ versus ‘Anti’ axis, and dashed arrows represent an abstract ‘Mod1’ versus ‘Mod2’ axis. a , STRUCTURENET. b , SBERTNET (L). c , GPTNET (XL). d , SIMPLENET. e , Correlation between held-out task CCGP and zero-shot performance (Pearson’s r  = 0.606, P  = 1.57 × 10 −46 ). f , CCGP scores for held-out tasks for each layer in the model hierarchy. Significance scores indicate P- value results from pairwise two-sided unequal-variance t -tests performed among model distributions of CCGP scores on held-out tasks for sensorimotor-RNN (NS P  > 0.05, * P  < 0.05, ** P  < 0.01, *** P  < 0.001; STRUCTURENET versus SBERTNET (L): t  = 13.67, P  = 2.44 × 10 −36 ; SBERTNET (L) versus SBERTNET: t  = 5.061, P  = 5.84 × 10 −7 ; SBERTNET versus CLIPNET: t  = 2.809, P  = 0.005; CLIPNET versus BERTNET: t  = 0.278, P  = 0.780; BERTNET versus GPTNET (XL): t  = 2.505, P  = 0.012; GPTNET (XL) versus GPTNET: t  = 3.180, P  = 0.001; GPTNET versus BOWNET: t  = −4.176, P  = 3.50 × 10 −5 ; BOWNET versus SIMPLENET: t  = 23.0.8, P  = 1.10 −80 ; see Supplementary Fig. 9 for full comparisons as well as t -test results for embedding layer CCGP scores).

For STRUCTURENET, hidden activity is factorized along task-relevant axes, namely a consistent ‘Pro’ versus ‘Anti’ direction in activity space (solid arrows), and a ‘Mod1’ versus ‘Mod2’ direction (dashed arrows). Importantly, this structure is maintained even for AntiDMMod1, which has been held out of training, allowing STRUCTURENET to achieve a performance of 92% correct on this unseen task. This factorization is also reflected in the PCs of rule embeddings. Strikingly, SBERTNET (L) also organizes its representations in a way that captures the essential compositional nature of the task set using only the structure that it has inferred from the semantics of instructions. This is the case for language embeddings, which maintain abstract axes across AntiDMMod1 instructions (again, held out of training). As a result, SBERTNET (L) is able to use these relevant axes for AntiDMMod1 sensorimotor-RNN representations, leading to a generalization performance of 82%. By contrast, GPTNET (XL) fails to properly infer a distinct ‘Pro’ versus ‘Anti’ axes in either sensorimotor-RNN representations or language embeddings leading to a zero-shot performance of 6% on AntiDMMod1 (Fig. 3b ). Finally, we find that the orthogonal rule vectors used by simpleNet preclude any structure between practiced and held-out tasks, resulting in a performance of 22%.

To more precisely quantify this structure, we measure the cross-conditional generalization performance (CCGP) of these representations 3 . CCGP measures the ability of a linear decoder trained to differentiate one set of conditions (that is, DMMod2 and AntiDMMod2) to generalize to an analogous set of test conditions (that is, DMMod1 and AntiDMMod1). Intuitively, this captures the extent to which models have learned to place sensorimotor activity along abstract task axes (that is, the ‘Anti’ dimension). Notably, high CCGP scores and related measures have been observed in experiments that required human participants to flexibly switch between different interrelated tasks 4 , 33 .

We measured CCGP scores among representations in sensorimotor-RNNs for tasks that have been held out of training ( Methods ) and found a strong correlation between CCGP scores and zero-shot performance (Fig. 3e ). Additionally, we find that swapping task instructions for held-out tasks dramatically reduces CCGP scores for all our instructed models, indicating that the semantic of instructions is crucial for maintaining structured representations (Supplementary Fig. 9 ).

We then looked at how structure emerges in the language processing hierarchy. CCGP decoding scores for different layers in our model are shown in Fig. 3f . For each instructed model, scores for 12 transformer layers (or the last 12 layers for SBERTNET (L) and GPTNET (XL)), the 64-dimensional embedding layer and the Sensorimotor-RNN task representations are plotted. We also plotted CCGP scores for the rule embeddings used in our nonlinguistic models. Among models, there was a notable discrepancy in how abstract structure emerges. Autoregressive models (GPTNETXL, GPTNET), BERTNET and CLIPNET (S), showed a low CCGP throughout language model layers followed by a jump in the embedding layer. This is because weights feeding into the embedding layer are tuned during sensorimotor training. The implication of this spike is that most of the useful representational processing in these models actually does not occur in the pretrained language model per se, but rather in the linear readout, which is exposed to task structure via training. By contrast, our best-performing models SBERTNET and SBERTNET (L) use language representations where high CCGP scores emerge gradually in the intermediate layers of their respective language models. Because semantic representations already have such a structure, most of the compositional inference involved in generalization can occur in the comparatively powerful language processing hierarchy. As a result, representations are already well organized in the last layer of language models, and a linear readout in the embedding layer is sufficient for the sensorimotor-RNN to correctly infer the geometry of the task set and generalize well.

This analysis strongly suggests that models exhibiting generalization do so by leveraging structured semantic representations to properly relate practiced and novel tasks in sensorimotor space, thereby allowing a composition of practiced behaviors in an unseen setting.

Semantic modulation of single-unit tuning properties

Next, we examined tuning profiles of individual units in our sensorimotor-RNNs. We found that individual neurons are tuned to a variety of task-relevant variables. Critically, however, we find neurons where this tuning varies predictably within a task group and is modulated by the semantic content of instructions in a way that reflects task demands.

For instance, in the ‘Go’ family of tasks, unit 42 shows direction selectivity that shifts by π between ‘Pro’ and ‘Anti’ tasks, reflecting the relationship of task demands in each context (Fig. 4a ). This flip in selectivity is observed even for the AntiGo task, which was held out during training.

figure 4

a , Tuning curves for a SBERTNET (L) sensorimotor-RNN unit that modulates tuning according to task demands in the ‘Go’ family. b , Tuning curves, for a SBERTNET (L) sensorimotor-RNN unit in the ‘matching’ family of tasks plotted in terms of difference in angle between two stimuli. c , Full activity traces for modality-specific ‘DM’ and ‘AntiDM’ tasks for different levels of relative stimulus strength. d , Full activity traces for tasks in the ‘comparison’ family of tasks for different levels of relative stimulus strength.

For the ‘Matching’ family of tasks, unit 14 modulates activity between ‘match’ (DMS, DMC) and ‘non-match’ (DNMS, DNMC) conditions. In ‘non-match’ trials, the activity of this unit increases as the distance between the two stimuli increases. By contrast, for ‘matching’ tasks, this neuron is most active when the relative distance between the two stimuli is small. Hence, in both cases this neuron modulates its activity to represent when the model should respond, changing selectivity to reflect opposing task demands between ‘match’ and ‘non-match’ trials. This is true even for DMS, which has been held out of training.

Figure 4c shows traces of unit 3 activity in modality-specific versions of DM and AntiDM tasks (AntiDMMod1 is held out of training) for different levels of contrast (contrast =  s t r stim1  −  s t r stim2 ). In all tasks, we observed ramping activity where the rate of ramping is relative to the strength of contrast. This motif of activity has been reported in previous studies 34 , 35 . However, in our models, we observe that an evidence-accumulating neuron can swap the sign of its integration in response to a change in linguistic instructions, which allows models to meet opposing demands of ‘Pro’ and ‘Anti’ versions of the task, even for previously unseen tasks.

Interestingly, we also found that unsuccessful models failed to properly modulate tuning preferences. For example, with GPTNET (XL), which failed to factorize along a ‘Pro’ versus ‘Anti’ axis (Fig. 3b ) and had poor generalization on AntiDMMod1, we also find neurons that failed to swap their sign of integration in the held-out setting (Supplementary Fig. 10 ).

Finally, we see a similar pattern in the time course of activity for trials in the ‘Comparison’ family of tasks (Fig. 4d ). In the COMP1 task, the network must respond in the direction of the first stimulus if it has higher intensity than the second stimulus, and must not respond otherwise. In COMP2, it must only respond to the second stimulus if the second stimulus is higher intensity. For ‘Anti’ versions, the demands of stimulus ordering are the same except the model has to choose the stimuli with the weakest contrast. Even with this added complexity, we found individual neurons that modulate their tuning with respect to task demands, even for held-out tasks (in this case COMP2). For example, unit 82 is active when the network should repress response. For ‘COMP1’, this unit is highly active with negative contrast (that is, s t r stim2  >  s t r stim1 ), but flips this sensitivity in COMP2 and is highly active with positive contrast (that is, s t r stim1  >  s t r stim2 ). Importantly, this relation is reversed when the goal is to select the weakest stimuli. Hence, despite these subtle syntactic differences in instruction sets, the language embedding can reverse the tuning of this unit in a task-appropriate manner.

Linguistic communication between networks

We now seek to model the complementary human ability to describe a particular sensorimotor skill with words once it has been acquired. To do this, we inverted the language-to-sensorimotor mapping our models learn during training so that they can provide a linguistic description of a task based only on the state of sensorimotor units. First, we constructed an output channel (production-RNN; Fig. 5a–c ), which is trained to map sensorimotor-RNN states to input instructions. We then present the network with a series of example trials while withholding instructions for a specific task. During this phase all model weights are frozen, and models receive motor feedback in order to update the embedding layer activity in order to reduce the error of the output (Fig. 5b ). Once the activity in the embedding layer drives sensorimotor units to achieve a performance criterion, we used the production-RNN to decode a linguistic description of the current task. Finally, to evaluate the quality of these instructions, we input them into a partner model and measure performance across tasks (Fig. 5c ). All instructing and partner models used in this section are instances of SBERTNET (L) ( Methods ).

figure 5

a , Illustration of self-supervised training procedure for the language production network (blue). The red dashed line indicates gradient flow. b , Illustration of motor feedback used to drive task performance in the absence of linguistic instructions. c , Illustration of the partner model evaluation procedure used to evaluate the quality of instructions generated from the instructing model. d , Three example instructions produced from sensorimotor activity evoked by embeddings inferred in b for an AntiDMMod1 task. e , Confusion matrix of instructions produced again using the method described in b . y axis indicates input–output task used to infer an embedding, and x axis indicates whether the instruction produced from the resulting sensorimotor activity was included in one of the instruction sets used during self-supervised training or else was a ‘novel’ formulation. f , Performance of partner models in different training regimes given produced instructions or direct input of embedding vectors. Each point represents the average performance of a partner model across tasks using instructions from decoders train with different random initializations. Dots indicate the partner model was trained on all tasks, whereas diamonds indicate performance on held-out tasks. Axes indicate the training regime of the instructing model. Full statistical comparisons of performance can be found in Supplementary Fig. 12 .

Some example decoded instructions for the AntiDMMod1 task (Fig. 5d ; see Supplementary Notes 4 for all decoded instructions). To visualize decoded instructions across the task set, we plotted a confusion matrix where both sensorimotor-RNN and production-RNN are trained on all tasks (Fig. 5e ). Note that many decoded instructions were entirely ‘novel’, that is, they were not included in the training set for the production-RNN ( Methods ). Novel instructions made up 53% of decoded instructions across all tasks.

To test the quality of these novel instructions, we evaluated a partner model’s performance on instructions generated by the first network (Fig. 5c ; results are shown in Fig. 5f ). When the partner model is trained on all tasks, performance on all decoded instructions was 93% on average across tasks. Communicating instructions to partner models with tasks held out of training also resulted in good performance (78%). Importantly, performance was maintained even for ‘novel’ instructions, where average performance was 88% for partner models trained on all tasks and 75% for partner models with hold-out tasks. Given that the instructing and partner models share the same architecture, one might expect that it is more efficient to forgo the language component of communication and simply copy the embedding inferred by one model into the input of the partner model. This resulted in only 31% correct performance on average and 28% performance when testing partner models on held-out tasks. Although both instructing and partner networks share the same architecture and the same competencies, they nonetheless have different synaptic weights. Hence, using a neural representation tuned for the set of weights within the one agent won’t necessarily produce good performance in the other.

We also tested an instructing model using a sensorimotor-RNN with tasks held out of training. We emphasize here that during training the production-RNN attempts to decode from sensorimotor hidden states induced by instructions for tasks the network has never experienced before (Fig. 5a ), whereas during test time, instructions are produced from sensorimotor states that emerge entirely as a result of minimizing a motor error (Fig. 5b,c ). We nonetheless find that, in this setting, a partner model trained on all tasks performs at 82% correct, while partner models with tasks held out of training perform at 73%. Here, 77% of produced instructions are novel, so we see a very small decrease of 1% when we test the same partner models only on novel instructions. Like above, context representations induce a relatively low performance of 30% and 37% correct for partners trained on all tasks and with tasks held out, respectively.

Lastly, we tested our most extreme setting where tasks have been held out for both sensorimotor-RNNs and production-RNNs (Fig. 5f ). We find that produced instructions induce a performance of 71% and 63% for partner models trained on all tasks and with tasks held out, respectively. Although this is a decrease in performance from our previous set-ups, the fact that models can produce sensible instructions at all in this double held-out setting is striking. The fact that the system succeeds to any extent speaks to strong inductive biases introduced by training in the context of rich, compositionally structured semantic representations.

In this study, we use the latest advances in natural language processing to build tractable models of the ability to interpret instructions to guide actions in novel settings and the ability to produce a description of a task once it has been learned. RNNs can learn to perform a set of psychophysical tasks simultaneously using a pretrained language transformer to embed a natural language instruction for the current task. Our best-performing models can leverage these embeddings to perform a brand-new model with an average performance of 83% correct. Instructed models that generalize performance do so by leveraging the shared compositional structure of instruction embeddings and task representations, such that an inference about the relations between practiced and novel instructions leads to a good inference about what sensorimotor transformation is required for the unseen task. Finally, we show a network can invert this information and provide a linguistic description for a task based only on the sensorimotor contingency it observes.

Our models make several predictions for what neural representations to expect in brain areas that integrate linguistic information in order to exert control over sensorimotor areas. Firstly, the CCGP analysis of our model hierarchy suggests that when humans must generalize across (or switch between) a set of related tasks based on instructions, the neural geometry observed among sensorimotor mappings should also be present in semantic representations of instructions. This prediction is well grounded in the existing experimental literature where multiple studies have observed the type of abstract structure we find in our sensorimotor-RNNs also exists in sensorimotor areas of biological brains 3 , 36 , 37 . Our models theorize that the emergence of an equivalent task-related structure in language areas is essential to instructed action in humans. One intriguing candidate for an area that may support such representations is the language selective subregion of the left inferior frontal gyrus. This area is sensitive to both lexico-semantic and syntactic aspects of sentence comprehension, is implicated in tasks that require semantic control and lies anatomically adjacent to another functional subregion of the left inferior frontal gyrus, which is implicated in flexible cognition 38 , 39 , 40 , 41 . We also predict that individual units involved in implementing sensorimotor mappings should modulate their tuning properties on a trial-by-trial basis according to the semantics of the input instructions, and that failure to modulate tuning in the expected way should lead to poor generalization. This prediction may be especially useful to interpret multiunit recordings in humans. Finally, given that grounding linguistic knowledge in the sensorimotor demands of the task set improved performance across models (Fig. 2e ), we predict that during learning the highest level of the language processing hierarchy should likewise be shaped by the embodied processes that accompany linguistic inputs, for example, motor planning or affordance evaluation 42 .

One notable negative result of our study is the relatively poor generalization performance of GPTNET (XL), which used at least an order of magnitude more parameters than other models. This is particularly striking given that activity in these models is predictive of many behavioral and neural signatures of human language processing 10 , 11 . Given this, future imaging studies may be guided by the representations in both autoregressive models and our best-performing models to delineate a full gradient of brain areas involved in each stage of instruction following, from low-level next-word prediction to higher-level structured-sentence representations to the sensorimotor control that language informs.

Our models may guide future work comparing compositional representations in nonlinguistic subjects like nonhuman primates. Comparison of task switching (without linguistic instructions) between humans and nonhuman primates indicates that both use abstract rule representations, although humans can make switches much more rapidly 43 . One intriguing parallel in our analyses is the use of compositional rules vectors (Supplementary Fig. 5 ). Even in the case of nonlinguistic SIMPLENET, using these vectors boosted generalization. Importantly, however, this compositionality is much stronger for our best-performing instructed models. This suggests that language endows agents with a more flexible organization of task subcomponents, which can be recombined in a broader variety of contexts.

Our results also highlight the advantages of linguistic communication. Networks can compress the information they have gained through experience of motor feedback and transfer that knowledge to a partner network via natural language. Although rudimentary in our example, the ability to endogenously produce a description of how to accomplish a task after a period of practice is a hallmark human language skill. The failure to transfer performance by sharing latent representations demonstrates that to communicate information in a group of independent networks of neurons, it needs to pass through a representational medium that is equally interpretable by all members of the group. In humans and for our best-performing instructed models, this medium is language.

A series of works in reinforcement learning has investigated using language and language-like schemes to aid agent performance. Agents receive language information through step-by-step descriptions of action sequences 44 , 45 , or by learning policies conditioned on a language goal 46 , 47 . These studies often deviate from natural language and receive linguistic inputs that are parsed or simply refer directly to environmental objects. Some larger versions of the pretrained language models we use to embed instructions also display instructions following behavior, that is, GPT-3 (ref. 7 ), PALM 12 , LaMDA 13 and InstructGPT 48 in the modality of language and DALL-E 8 and Stable Diffusion 14 in a language to image modality. The semantic and syntactic understanding displayed in these models is impressive. However, the outputs of these models are difficult to interpret in terms of guiding the dynamics of a downstream action plan. Finally, recent work has sought to engineer instruction following agents that can function in complex or even real-world environments 16 , 17 , 18 . While these models exhibit impressive behavioral repertoires, they rely on perceptual systems that fuse linguistic and visual information making them difficult to compare to language representations in human brains, which emerge from a set of areas specialized for processing language. In all, none of these models offer a testable representational account of how language might be used to induce generalization over sensorimotor mappings in the brain.

Our models by contrast make tractable predictions for what population and single-unit neural representations are required to support compositional generalization and can guide future experimental work examining the interplay of linguistic and sensorimotor skills in humans. By developing interpretable models that can both understand instructions as guiding a particular sensorimotor response, and communicate the results of sensorimotor learning as an intelligible linguistic instruction, we have begun to explain the power of language in encoding and transferring knowledge in networks of neurons.

Model architecture


The base model architecture and task structure used in this paper follows 18 . All networks of sensorimotor units denoted sensorimotor-RNN are gated recurrent units (GRU) 49 using rectified linear unit (ReLU) nonlinearities with 256 hidden units each. Inputs to the networks consist of (1) sensory inputs, X t and (2) task-identifying information, I t . We initialize hidden activity in the GRU as \({h}^{0}\in {{\mathbb{R}}}^{256}\) with values set to 0.1. All networks of sensorimotor units use the same hidden state initialization, so we omit h 0 in network equations. At each time step, a readout layer Linear out decodes motor activity, \(\hat{{y}_{t}}\) , from the activity of recurrent hidden units, h t , according to:

where σ denotes the sigmoid function. Sensory inputs X t are made up of three channels, two sensory modalities \({x}_{{{\mathrm{mod}}}\,1,t}\) and \({x}_{{{\mathrm{mod}}}\,2,t}\) , and a fixation channel x fix, t . Both \({x}_{{{\mathrm{mod}}}\,1,t},{x}_{{{\mathrm{mod}}}\,2,t}\in {{\mathbb{R}}}^{32}\) and stimuli in these modalities are represented as hills of activity with peaks determined by units’ preferred directions around a one-dimensional circular variable. For an input at direction θ , the activity of a given input unit u i with preferred direction θ i is

where s t r is the coefficient describing stimulus strength. The fixation channel \({x}_{{{{\rm{fix}}}},t}\in {{\mathbb{R}}}^{1}\) is a single unit simulating a fixation cue for the network. In all, sensory input \({X}_{t}=({x}_{mod1,t},{x}_{mod2,t},{x}_{fix,t})\in {{\mathbb{R}}}^{65}\) . Motor output, \({\hat{{y}}_{t}}\) consists of both a 32-dimensional ring representing directional responses to the input stimulus as well as a single unit representing model fixation, so that \({\hat{{y}}_{t}}\in {{\mathbb{R}}}^{33}\) .

For all models, task-identifying information \({I}_{t}\in {{\mathbb{R}}}^{64}\) . Task-identifying information is presented throughout the duration of a trial and remains constant such that \({I}_{t}={I}_{t{\prime} }\forall t,t{\prime}\) . For all models, task-identifying info I t and sensory input X t are concatenated as inputs to the sensorimotor-RNN.

Nonlinguistic models

For SIMPLENET, we generate a set of 64-dimensional orthogonal task rules by constructing an orthogonal matrix using the Python package scipy.stats.ortho_group, and assign rows of this matrix to each task type. For STRUCTURENET, we generate a set of ten orthogonal, 64-dimensional vectors in the same manner, and each of these represents a dimension of the task set (that is, respond weakest versus strongest direction, respond in the same versus opposite direction, pay attention only to stimuli in the first modality, and so on). Rule vectors for tasks are then simple combinations of each of these ten basis vectors. For a full description of structure rule vectors, see Supplementary Note 3 .

We also test SIMPLENETPLUS and STRUCTURENETPLUS, which use an additional hidden layer with 128 units and ReLU nonlinearities to process orthogonal tasks rules I t into a vector \(\bar{{I}_{t}}\) which is used by sensorimotor-RNN as task-identifying information.

Full results for these models are included in Supplementary Fig. 4 .

Pretrained transformers

The main language models we test use pretrained transformer architectures to produce I . Importantly, transformers differ in the type of pretraining objective used to tune the model parameters. GPT is trained to predict the next word given a context of words 9 . GPT (XL) follows the same objective but trains for longer on a larger dataset 50 . Both models are fully autoregressive. BERT, by contrast, takes bidirectional language inputs and is tasked with predicting masked words that appear in the middle of input phrases. Additionally, BERT is trained on a simple sentence prediction task where the model must determine if input sentence 1 is followed by input sentence 2 in the training corpus. Extending this principle, SBERT is explicitly trained to produce fixed-length embeddings of whole sentences 21 . It takes pretrained BERT networks and uses them in a siamese architecture 51 , which allows the weights of the model to be tuned in a supervised fashion according to the Stanford Natural Language Inference dataset 22 . Natural language inference is a three-way categorization task where the network must infer the logical relationship between sentences: whether a premise sentence implies, contradicts or is unrelated to a hypothesis sentence. Finally, CLIP is trained to jointly embed images and language 23 . It uses data from captioned images and is asked to properly categorize which text and images pairs match or are mismatched in the dataset via a contrastive loss.

Importantly, the natural output of a transformer is a matrix of size \({\dim }_{{{{\rm{trans}}}}.}\times {{{\mathcal{T}}}}\) , the inherent dimensionality of the transformer by the length of the input sequence. To create an embedding space for sentences it is standard practice to apply a pooling method to the transformer output, which produces a fixed-length representation for each instruction.

For GPT, GPT (XL), BERT and SBERT, we use an average pooling method. Suppose we have an input instruction \({w}_{1}\ldots {w}_{{{{\mathcal{T}}}}}\) . Following standard practice with pretrained language models, the input to our transformers is tokenized with special ‘cls’ and ‘eos’ tokens at the beginning and end of the input sequence. We then compute I as follows:

We chose this average pooling method primarily because a previous study 21 found that this resulted in the highest-performing SBERT embeddings. Another alternative would be to simply use the final hidden representation of the ‘cls’ token as a summary of the information in the entire sequence (given that BERT architectures are bidirectional, this token will have access to the whole sequence).

Where \({h}_{{{{\rm{cls}}}}}^{\rm{tran.}}\) denote the last hidden representation for the ‘cls’ token. Ref. 21 found this pooling method performed worse than average pooling, so we don’t include these alternatives in our results. For GPT and GPT (XL), we also tested a pooling method where the fixed-length representation for a sequence was taken from the transformer output of the ‘eos’ token. In this case:

We found that GPT failed to achieve even a relaxed performance criterion of 85% across tasks using this pooling method, and GPT (XL) performed worse than with average pooling, so we omitted these models from the main results (Supplementary Fig. 11 ). For CLIP models we use the same pooling method as in the original multiModal training procedure, which takes the outputs of the [cls] token as described above.

For all the above models, we also tested a version where the information from the pretrained transformers is passed through a multilayer perceptron with a single hidden layer of 256 hidden units and ReLU nonlinearities. We found that this manipulation reduced performance across all models, verifying that a simple linear embedding is beneficial to generalization performance.

For GPT, BERT and SBERT, \({\dim }_{{{{\rm{trans}}}}.}=768\) and each model uses a total of ~100 million parameters; for SBERT (L) \({\dim }_{{{{\rm{trans}}}}.}=1,024\) and the model uses ~300 million parameters; GPT (XL) \({\dim }_{{{{\rm{trans}}}}.}=1,600\) and the model uses ~1.5 billion parameters; for CLIP, \({\dim }_{{{{\rm{trans}}}}.}=512\) and the model uses ~60 million parameters. Full PyTorch implementations, including all pretrained weights and model hyperparameters, can be accessed at the Huggingface library ( ) 52 .

For our BoW model, instructions are represented as a vector of binary activations the size of the instruction vocabulary, where each unit indicates the inclusion or exclusion of the associated word in the current instruction. For our instruction set, ∣ vocab ∣  = 181. This vector is then projected through a linear layer into 64-dimensional space.

Blank slate language models

Given that tuning the last layers of language models resulted in improved performance (Fig. 2e ), we tested two additional models to determine if training a blank slate language model trained exclusively on the loss from sensorimotor tasks would improve performance. These models consist of passing BoW representations through a multilayer perceptron and passing pretrained BERT word embeddings through one layer of a randomly initialized BERT encoder. Both models performed poorly compared to pretrained models (Supplementary Fig. 4.5 ), confirming that language pretraining is essential to generalization.

Tasks were divided into five interrelated subgroups: ‘go’, ‘decision-making’, ‘matching’, and ‘comparison’ and ‘duration’. Depending on the task, multiple stimuli may appear during the stimulus epoch. Also, depending on the task, models may be required to respond in a particular direction or repress response altogether. Unless otherwise specified, zero-mean Gaussian noise is added independently at each time step and to each input unit and the variance of this noise is drawn randomly from \({\mathbb{U}}[0.1,0.15]\) . The timing of stimuli differs among the tasks type. However, for all tasks, trials can be divided into preparatory, stimulus and response epochs. The stimulus epoch can be subdivided into three parts—stim1, delay and stim23—although these distinct parts aren’t used by all tasks. A trial lasts for a total of T  = 150 time steps. Let d u r epoch denote the duration in simulated time steps of a given epoch. Then

For tasks that don’t utilize a delay structure, stim1, stim2 and delay epochs are grouped together in a single stimulus epoch where \(du{r}_{{{{\rm{stimulus}}}}}=du{r}_{{{{\rm{stim}}}}1}+du{r}_{{{{\rm{stim}}}}2}+du{r}_{{{{\rm{delay}}}}}\) . Unless otherwise specified, a fixation cue with a constant strength s t r fix  = 1 is activated throughout the preparatory and stimulus epochs. For example trials of each task, see Supplementary Fig. 13 .

The ‘Go’ family of tasks includes ‘Go’, ‘RTGo’, ‘AntiGo’, ‘AntiRTGo’ and modality-specific versions of each task denoted with either ‘Mod1’ and ‘Mod2’. In both the ‘Go’ and ‘AntiGo’ tasks, a single stimulus is presented at the beginning of the stimulus epoch. The direction of the presented stimulus is generated by drawing from a uniform distribution between 0 and 2 π , that is, \({\theta }_{{{{\rm{stim}}}}} \sim {\mathbb{U}}[0,2\pi ]\) . The stimulus will appear in either modality 1 or modality 2 with equal probability. The strength of the stimulus is given by \(st{r}_{{{{\rm{stim}}}}} \sim {\mathbb{U}}[1.0,1.2]\) . In the ‘Go’ task, the target response is in the same direction as the presented stimulus, that is, \({\theta }_{{{{\rm{stim}}}}}={\theta }_{{{{\rm{target}}}}}\) , while in the ‘AntiGo’ task the direction of the response should be in the opposite of the stimulus direction, \({\theta }_{{{{\rm{stim}}}}}+\pi ={\theta }_{{{{\rm{target}}}}}\) . For modality-specific versions of each task, a stimulus direction is drawn in each modality \({\theta }_{{{{\rm{stim}}}},{{{\rm{mod}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}},{{{\rm{mod}}}}2} \sim {\mathbb{U}}[0,2\pi ]\) and for modality-specific Go-type tasks

while for modality-specific AntiGo-type tasks

For ‘RT’ versions of the ‘Go’ tasks, stimuli are only presented during the response epoch and the fixation cue is never extinguished. Thus, the presence of the stimulus itself serves as the response cue and the model must respond as quickly as possible. Otherwise, stimuli persist through the duration of the stimulus epoch.

‘Decision-making’ tasks

The ‘decision-making’ family of tasks includes ‘DM’ (decision-making), ‘AntiDM’, ‘MultiDM’ (multisensory decision-making), ‘AntiMultiDM,’ modality-specific versions of each of these tasks and, finally, confidence-based versions of ‘DM’ and ‘AntiDM.’ For all tasks in this group, two stimuli are presented simultaneously and persist throughout the duration of the stimulus epoch. They are drawn according to \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}\) \([({\theta }_{{{{\rm{stim}}}}1}-0.2\pi ,{\theta }_{{{{\rm{stim}}}}1}-0.6\pi )\cup ({\theta }_{{{{\rm{stim}}}}1}+0.2\pi ,{\theta }_{{{{\rm{stim}}}}1}+0.6\pi )]\) . A base strength applied to both stimuli is drawn such that \(st{r}_{\rm{base}} \sim {\mathbb{U}}[1.0,1.2]\) . A contrast is drawn from a discrete distribution such that c  ~ {−0.175, −0.15, −0.1, 0.1, 0.15, 0.175} so the stimulus strength associated with each direction in a trial are given by \(st{r}_{{{{\rm{stim}}}}1}=st{r}_{\rm{base}}+c\) and \(st{r}_{{{{\rm{stim}}}}2}=\) \({str}_{\rm{base}}-c\) .

For the ‘DM’ task,

and for the the ‘AntiDM’ task,

For these versions of the tasks, the stimuli are presented in either modality 1 or modality 2 with equal probability. For the multisensory versions of each task, stimuli directions are drawn in the same manner and presented across both modalities so that \({\theta }_{{{{\rm{stim}}}}1,{{{\rm{mod}}}}1}={\theta }_{{{{\rm{stim}}}}1,{{{\rm{mod}}}}2}\) and \({\theta }_{{{{\rm{stim}}}}2,{{{\rm{mod}}}}1}={\theta }_{{{{\rm{stim}}}}2,{{{\rm{mod}}}}2}\) . Base strengths are drawn independently for each modality. Contrasts for both modalities are drawn from a discrete distribution such that \({c}_{{{\mathrm{mod}}}\,1},{c}_{{{\mathrm{mod}}}\,2} \sim \left\{0.2,0.175,\right.\) \(\left.0.15,0.125,-0.125,-0.15,-0.175,-0.2\right\}\) . If both \(| {c}_{{{\mathrm{mod}}}\,1}| -| {c}_{{{\mathrm{mod}}}\,2}| =0\) then contrasts are redrawn to avoid zero-contrast trials during training. If both \({c}_{{{\mathrm{mod}}}\,1}\) and \({c}_{{{\mathrm{mod}}}\,2}\) have the same sign, then contrasts are redrawn to ensure that the trial requires integrating over both modalities as opposed to simply performing a ‘DM’ task in a single modality. Criteria for target responses are measured as the strength of a given direction summed over both modalities. So, for ‘MultiDM’

and for ‘AntiMultiDM’

Stimuli for modality-specific versions of each task are generated in the same way as multisensory versions of the task. Criteria for target response are the same as standard versions of ‘DM’ and ‘AntiDM’ tasks applied only to stimuli in the relevant modality.

In confidence-based decision-making tasks (‘ConDM’ and ‘ConAntiDM’), the stimuli directions are drawn in the same way as above. Stimuli are shown in either modality 1 or modality 2 with equal probability. In each trial, s t r base  = 1. The contrast and noise for each trial is based on the thresholded performance of a SIMPLENET model trained on all tasks except ‘ConDM’ and ‘ConAntiDM’. Once this model has been trained, we establish a threshold across levels of noise and contrasts for which the model can perform a ‘DM’ or an ‘AntiDM’ task at 95% correct. We then draw contrasts and noises for trials from above and below this threshold with equal probability during training. In trials where the noise and contrast levels fell below the 95% correct threshold, the model must repress response, and otherwise perform the decision-making task (either ‘DM’ or ‘AntiDM’).

‘Comparison’ tasks

Our comparison task group includes ‘COMP1’, ‘COMP2’, ‘MultiCOMP1’, ‘MultiCOMP2’, ‘Anti’ versions of each of these tasks, as well as modality-specific versions of ‘COMP1’ and ‘COMP2’ tasks. This group of tasks is designed to extend the basic decision-making framework into a setting with more complex control demands. These tasks utilize the delay structure in the stimulus epoch so that stim1 appears only during the stim1 epoch, followed by a delay, and finally stim2. This provides a temporal ordering on the stimuli. In ‘COMP1’, the model must respond to the first stimulus only if it has greater strength than the second and otherwise repress a response that is

Likewise, in ‘COMP2’, the model must respond to the second direction if it presented with greater strength than the first otherwise repress response that is

In ‘Anti’ versions of the task the ordering criteria is the same except for stimuli with least strength, that is, for ‘AntiCOMP1’

and for ‘AntiCOMP2’

In multisensory settings, the criteria for target direction are analogous to the multisensory decision-making tasks where strength is integrated across modalities. Likewise, for modality-specific versions, the criteria are only applied to stimuli in the relevant modality. Stimuli directions and strength for each of these tasks are drawn from the same distributions as the analogous task in the ‘decision-making’ family. However, during training, we make sure to balance trials where responses are required and trials where models must repress response.

‘Duration’ tasks

The ‘duration’ family of tasks includes ‘Dur1’, ‘Dur2’, ‘MultiDur1’, ‘MultiDur2’, ‘Anti’ versions of each of these tasks and modality-specific versions of ‘Dur1’ and ‘Dur2’ tasks. These tasks require models to perform a time estimation task with the added demand or stimuli ordering determining relevance for response. Like in ‘comparison’ tasks, stim1 is presented followed by a delay and then stim2. For ‘Dur1’ trials

Likewise, for ‘Dur2’

In ‘Anti’ versions of these tasks, the correct response is in the direction of the stimulus with the shortest duration given the ordering criteria is met. Hence, for ‘AntiDur1’

and for ‘AntiDur2’

Across these tasks directions are drawn according to \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}[({\theta }_{{{{\rm{stim}}}}1}-0.2\pi ,{\theta }_{{{{\rm{stim}}}}1}-0.6\pi )\cup ({\theta }_{{{{\rm{stim}}}}1}+0.2\pi ,{\theta }_{{{{\rm{stim}}}}1}+0.6\pi )]\) . Stimulus strengths are drawn according to \(st{r}_{{{{\rm{stim}}}}1},st{r}_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}[0.8,1.2]\) . To set the duration of each stimulus, we first draw \(du{r}_{{{{\rm{long}}}}} \sim\) \(\{i| 35 < i\le 50,i\in {\mathbb{N}}\}\) and \(du{r}_{{{{\rm{short}}}}} \sim \{i| 25 < i\le (du{r}_{{{{\rm{long}}}}}-8),i\in {\mathbb{N}}\}\) . During training, we determine which trials for a given task should and should not require a response in order to evenly balance repress and respond trials. We then assign d u r long and d u r short to either stim1 or stim2 so that the trial requires the appropriate response given the particular task type.

Again, criteria for correct response in the multisensory and modality-specific versions of each tasks follow analogous tasks in the ‘decision-making’ and ‘comparison’ groups where multisensory versions of the task require integrating total duration over each modality, and modality-specific tasks require only considering durations in the given task modality. For multisensory tasks, we draw duration value \(du{r}_{{{{\rm{long}}}}} \sim \{i| 75 < i\le 100,i\in {\mathbb{N}}\}\) and then split this value d u r long0  =  d u r long  × 0.55 and d u r long1  =  d u r long  × 0.45. We also draw a value d u r short  =  d u r long  − Δ d u r where \(\Delta dur \sim \{i| 15 < i\le 25,i\in {\mathbb{N}}\}\) . This value is then subdivided further into d u r short0  =  d u r long1  + Δ d u r short where \(\Delta du{r}_{{{{\rm{short}}}}} \sim\) \(\{i| 19 < i\le 15,i\in {\mathbb{N}}\}\) and d u r short1  =  d u r Short  −  d u r short0 . Short and long durations can then be allocated to the ordered stimuli according to task type. Drawing durations in this manner ensures that, like in ‘decision-making’ and ‘comparison’ groups, correct answers truly require models to integrate durations over both modalities, rather than simply performing the task in a given modality to achieve correct responses.

‘Matching’ tasks

The ‘matching’ family of tasks consists of ‘DMS’ (delay match to stimulus), ‘DNMS’ (delay non-match to stimulus), ‘DMC’ (delay match to category) and ‘DMNC’ (delay non-match to category) tasks. For all tasks, stim1 is presented at the beginning of the stimulus epoch, followed by a delay, and the presentation of stim2. The stimulus strength is drawn according to \(st{r}_{{{{\rm{stim}}}}1},st{r}_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}[0.8,1.2]\) . The input modality for any given trial is chosen at random with equal probability. In both ‘DMS’ and ‘DNMS’ tasks, trials are constructed as ‘matching stim’ trials or ‘mismatching stim’ trials with equal probability. In ‘matching stim’ trials \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2}={\theta }_{{{{\rm{stim}}}}1}\) . In ‘mismatch stim’ trials, \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and

For ‘DMS’, models must respond in the displayed direction if the stimuli match, otherwise repress response,

and for ‘DNMS’, models must respond to the second direction if both directions are mismatched,

‘DMC’ and ‘DNMC’ tasks are organized in a similar manner. The stimulus input space is divided evenly into two categories such that cat1 = { θ : 0 <  θ ≤ π } and cat2 = { θ :  π  <  θ ≤2 π }. For ‘DMC’ and ‘DNMC’ tasks, trials are constructed as ‘matching cat.’ trials or ‘mismatching cat.’ trials with equal probability. In ‘matching cat.’ trials \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}({{{\mbox{cat}}}}_{{{{\rm{stim}}}}1})\) , where \({\mathbb{U}}({{{\mbox{cat}}}}_{{{{\rm{stim}}}}1})\) is a uniform draw from the category of stim1. In ‘mismatch stim’ trials, \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}(-{{{\mbox{cat}}}}_{{{{\rm{stim}}}}1})\) where \(-{{{\mbox{cat}}}}_{{{{\rm{stim}}}}1}\) is the opposite category as stim1. For ‘DMC’, the model must respond in the first direction if both stimuli are presented in the same category otherwise repress response,

and for ‘DNMC’, the model should respond to the second direction if both stimuli are presented in opposite categories otherwise repress response,

Target output and correct criteria

The target output \(y\in {{\mathbb{R}}}^{33\times T}\) for a trial entails maintaining fixation in y 1  =  y fix during the stimulus epoch, and then either responding in the correct direction or repressing activity in the remaining target response units y 2…33 in the response epoch. Since the model should maintain fixation until response, target for fixation is set at y fix  = 0.85 during preparatory and stimulus epochs and y fix  = 0.05 in the response epoch. When a response is not required, as in the preparatory and stimulus epochs and with repressed activity in the response epoch, unit i takes on a target activity of y i  = 0.05. Alternatively, when there is a target direction for response,

where θ i is the preferred direction for unit i . Like in sensory stimuli, preferred directions for target units are evenly spaced values from [0, 2 π ] allocated to the 32 response units.

For a model response to count as correct, it must maintain fixation, that is, \({\hat{y}}_{{{{\rm{fix}}}}} > 0.5\) during preparatory and stimulus epochs. When no response is required \({\hat{y}}_{i} < 0.15\) . When a response is required, response activity is decoded using a population vector method and \({\theta }_{{{{\rm{resp}}}}.}\in ({\theta }_{{{{\rm{target}}}}}-\frac{\pi }{10},{\theta }_{{{{\rm{target}}}}}+\frac{\pi }{10})\) . If the model fails to meet any of these criteria, the trial response is incorrect.

Model training

Again following ref. 18 , model parameters are updated in a supervised fashion according to a masked mean squared error loss (mMSE) computed between the model motor response, \({\hat{y}}_{1\ldots T}=\hat{y}\) , and the target, y 1… T  =  y , for each trial.

Here, the multiplication sign denotes element-wise multiplication. Masks weigh the importance of different trial epochs. During preparatory and stimulus epochs, mask weights are set to 1; during the first five time steps of the response epoch, the mask value is set to 0; and during the remainder of the response epoch, the mask weight is set to 5. The mask value for the fixation is twice that of other values at all time steps.

For all models, we update Θ = {sensorimotor-RNN, Linear out } during training on our task set. For instructed models, we additionally update Linear embed in the process of normal training. We train models using standard PyTorch machinery and an Adam optimizer. An epoch consists of 2,400 mini-batches, with each mini-batch consisting of 64 trials. For all models, we use the same initial learning rate as in ref. 18 , l r  = 0.001. We found that in the later phases of training, model performance oscillated based on which latest task presented during training, so we decayed the learning rate for each epoch by a factor of γ  = 0.95, which allowed performance to converge smoothly. Following ref. 18 , models train until they reach a threshold performance of 95% across all tasks (and train for a minimum of 35 epochs). We found that training for GPTNET tended to asymptote below performance threshold for multisensory versions of comparison tasks. This held true over a variety of training hyperparameters and learning rate scheduler regimes. Hence, we relax the performance threshold of GPTNET to 85%. For each model type, we train five models that start from five different random initializations. Where applicable, results are averaged over these initializations.

Language model fine-tuning

When fine-tuning models, we allow the gradient from the motor loss experienced during sensorimotor training to fine-tune the weights in the final layers of the transformer language models. During normal training, we checkpoint a copy of our instructed models after training for 30 epochs. We then add the last three transformer layers to the set of trainable parameters, and reset the learning rates to l r  = 1 × 10 − 4 for Θ = {sensorimotor-RNN, Linear out } and l r lang  = 3 × 10 −4 for Θ lang  = {Linear embed , transformer −3,−2,−1 } where transformer −3,−2,−1 denotes the parameters of the last three layers of the relevant transformer architecture. We used these reduced learning rates to avoid completely erasing preexisting linguistic knowledge. Similarly for RNN parameters, we found the above learning rate avoided catastrophic forgetting of sensorimotor knowledge while also allowing the RNN to adapt to updated language embeddings across all models. Autoregressive models were much more sensitive to this procedure, often collapsing at the beginning of fine-tuning. Hence, for GPTNETXL and GPTNET, we used l r lang  = 5 × 10 −5 , which resulted in robust learning. Models train until they reach a threshold performance of 95% across training tasks or 85% correct for GPTNET.

Hold-out testing

During hold-out testing, we present models with 100 batches of one of the tasks that had been held out of training. For the instructed model, the only weights allowed to update during this phase are Θ = {sensorimotor-RNN, Linear out , Linear embed }. All weights of SIMPLENET and STRUCTURENET are trainable in this context. In this hold-out setting, we found that in more difficult tasks for some of our more poorly performing models, the standard hyperparameters we used during training resulted in unstable learning curves for novel tasks. To stabilize performance and thereby create fair comparisons across models, we used an increased batch size of 256. We then began with the standard learning rate of 0.001 and decreased this by increments of 0.0005 until all models showed robust learning curves. This resulted in a learning rate of 8 × 10 −4 . All additional results shown in the Supplementary Information section 4 follow this procedure.

CCGP calculation

To calculate CCGP, we trained a linear decoder on a pair of tasks and then tested that decoder on alternative pairs of tasks that have an analogous relationship. We grouped tasks into eight dichotomies: ‘Go’ versus ‘Anti’, ‘Standard’ versus ‘RT’, ‘Weakest’ versus ‘Strongest’, ‘Longest’ versus ‘Shortest’, ‘First Stim.’ versus ‘Second Stim’, ‘Stim Match’ versus ‘Category Match’, ‘Matching’ versus ‘Non-Matching’ and ‘Mod1’ versus ‘Mod2’. As an example, the ‘Go’ versus ‘Anti’ dichotomy includes (‘Go’, ‘AntiGo’), (‘GoMod1’, ‘AntiGoMod1’), (‘GoMod2’, ‘AntiGoMod2’), (‘RTGo’, ‘AntiRTGo’), (‘RTGoMod1’, ‘AntiRTGoMod1’) and (‘RTGoMod2’, ‘AntiRTGoMod2’) task pairs. For ‘RNN’ task representations, we extracted activity at the time of stimulus onset for 250 example trials. For language representations, we input the instruction sets for relevant tasks to our language model and directly analyze activity in the ‘embedding’ layer or take the sequence-averaged activity in each transformer layer. For nonlinguistic models, we simply analyze the space of rule vectors. Train and test conditions for decoders were determined by dichotomies identified across the task set (Supplementary Note 1 ). To train and test decoders, we used sklearn.svm.LinearSVC Python package. The CCGP score for a given task is the average decoding score achieved across all dichotomies where the task in question was part of either the train set or the test set. For model scores reported in the main text, we only calculate CCGP scores for models where the task in question has been held out of training. In Supplementary Fig. 9 , we report scores on tasks where models have been trained on all tasks, and for models where instructions have been switched for the hold-out task.

For Fig. 3e , we calculated Pearson’s r correlation coefficient between performance on held-out tasks and CCGP scores per task, as well as a P -value testing against the null hypothesis that these metrics are uncorrelated and normally distributed (using the scipy.stats.pearsonr function). Full statistical tests for CCGP scores of both RNN and embedding layers from Fig. 3f can be found in Supplementary Fig. 9 . Note that transformer language models use the same set of pretrained weights among random initialization of Sensorimotor-RNNs, thus for language model layers, the Fig. 3f plots show the absolute scores of those language models.

Conditional clause/deduction task analysis

We first split our task set into two groups (listed below): tasks that included conditional clauses and simple deductive reasoning components (30 tasks) and those where instructions include simple imperatives (20 tasks). We computed the difference in performance across the mean of generalization performance for each group across random initialization for each model (Fig. 2f ). We compared these differences to a null distribution constructed by performing a set of 50 random shuffles of the task set into groups of 30 and 20 tasks and computing differences in the same way, again using two-sided unequal-variance t -tests. Because STRUCUTRENET is a nonlinguistic model, we then compared performance of STRUCUTRENET to our instructed models to disassociate the effects of performing tasks with a deductive reasoning component versus processing instructions with more complicated conditional clause structure. Results of all statistical tests are reported in Supplementary Fig. 6 ).

Simple imperative tasks include: ‘Go’, ‘AntiGo’, ‘RTGo’, ‘AntiRTGo’, ‘GoMod1’, ‘GoMod2’, ‘AntiGoMod1’, ‘AntiGoMod2’, ‘RTGoMod1’, ‘AntiRTGoMod2’, ‘RTGoMod2’, ‘AntiRTGoMod2’, ‘DM’, ‘AntiDM’, ‘MultiDM’, ‘AntiMultiDM’, ‘DMMod1’, ‘DMMod2’, ‘AntiDMMod1’ and ‘AntiDMMod2’.

Conditional clause/deduction tasks include: ‘ConDM’, ‘ConAntiDM’, ‘Dur1’, ‘Dur2’, ‘MultiDur1’, ‘MultiDur2’, ‘AntiDur1’, ‘AntiDur2’, ‘AntiMultiDur1’, ‘AntiMultiDur2’, ‘Dur1Mod1’, ‘Dur1Mod2’, ‘Dur2Mod1’, ‘Dur2Mod2’, ‘COMP1’, ‘COMP2’, ‘MultiCOMP1’, ‘MultiCOMP2’, ‘AntiCOMP1’, ‘AntiCOMP2’, ‘AntiMultiCOMP1’, ‘AntiMultiCOMP2’, ‘COMP1Mod1’, ‘COMP1Mod2’, ‘COMP2Mod1’, ‘COMP2Mod2’, ‘DMS’, ‘DNMS’, ‘DMC’ and ‘DMNC’.

Language production training

Self-supervised language production network training.

Our language production framework is inspired by classic sequence-to-sequence modeling using RNNs 53 . Our Production-RNN is a GRU with 256 hidden units using ReLU nonlinearities. At each step in the sequence, a set of decoder weights, Linear words , attempts to decode the next token, w τ +1 , from the hidden state of the recurrent units. The hidden state of the Production-RNN is initialized by concatenating the time average and maximum sensorimotor activity of a SBERTNET (L) and passing that through weights Linear sm . The linguistic instruction used to drive the initializing sensorimotor activity is in turn used as the target set of tokens for the Production-RNN outputs. The first input to the Production-RNN is always a special start-of-sentence token, and the decoder runs until an end-of-sentence token is decoded or until input reaches a length of 30 tokens. Suppose \({w}_{1,k}\ldots {w}_{{{{\mathcal{T}}}},k}\in {\rm{Instruc{t}}}_{k}^{i}\) is the sequence of tokens in instruction k where k is in the instruction set for task i and X i is sensory input for a trial of task i . For brevity, we denote the process by which language models embed instructions as Embed() (see ‘Pretrained transformers’). The decoded token at the τ th position, \({\hat{w}}_{\tau ,k}\) , is then given by

The model parameters Θ production  = {Linear sm , Linear words , Production-RNN} are trained using cross-entropy loss between the \({p}_{{\hat{w}}_{\tau ,i}}\) and the instruction token w τ , k provided to the sensorimotor-RNN as input. We train for 80 epochs of 2,400 batches with 64 trials per batch and with task type randomly interleaved. We found that using an initial learning rate of 0.001 sometimes caused models to diverge in early phases of training, so we opted for a learning rate of 1× 10 −4 , which led to stable early training. To alleviate similar oscillation problems detected in sensorimotor training, we also decayed the learning rate by γ  = 0.99 per epoch. Additionally, the use of a dropout layer with a dropout rate of 0.05 improved performance. We also used a teacher forcing curriculum, where for some ratio of training batches, we input the ground truth instruction token w τ , k at each time step instead of the models decoded word \({\hat{w}}_{\tau ,k}\) . At each epoch, \({\rm{teacher}}\,{{\mbox{\_}}}{\rm{forcing}}{{\mbox{\_}}}\) \({\rm{ratio}}=0.5 \times \frac{80-{{{\rm{epoch}}}}}{80}\) .

Obtaining embedding layer activity using motor feedback

For a task, i , we seek to optimize a set of embedding activity vectors \({E}^{i}\in {{\mathbb{R}}}^{64}\) such that when they are input as task-identifying information, the model will perform the task in question. Crucially, we freeze all model weights Θ = {sensorimotor-RNN, Linear out , Linear embedding } and only update E i according to the standard supervised loss on the motor output. For notional clarity, GRU dependence on the previous hidden state h t −1 has been made implicit in the following equations.

We optimized a set of 25 embedding vectors for each task, again using an Adam optimizer. Here the optimization space has many suboptimal local minimums corresponding to embeddings for related tasks. Hence, we used a high initial learning rate of l r  = 0.05, which we decayed by γ  = 0.8 for each epoch. This resulted in more robust learning than lower learning rates. An epoch lasts for 800 batches with a batch length of 64, and we train for a minimum of 1 epoch or until we reach a threshold performance of 90% or 85% on ‘DMC’ and ‘DNMC’ tasks.

Producing task instructions

To produce task instructions, we simply use the set E i as task-identifying information in the input of the sensorimotor-RNN and use the Production-RNN to output instructions based on the sensorimotor activity driven by E i . For each task, we use the set of embedding vectors to produce 50 instructions per task. We repeat this process for each of the 5 initializations of sensorimotor-RNN, resulting in 5 distinct language production networks, and 5 distinct sets of learned embedding vectors. Reported results for each task are averaged over these 5 networks. For the confusion matrix (Fig. 5d ), we report the average percentage that decoded instructions are in the training instruction set for a given task or a novel instruction. Partner model performance (Fig. 5e ) for each network initialization is computed by testing each of the 4 possible partner networks and averaging over these results.

Sample sizes/randomization

No statistical methods were used to predetermine sample sizes but following ref. 18 we used five different random weight initializations per language model tested. Randomization of weights was carried out automatically in Python and PyTorch software packages. Given this automated randomization of weights, we did not use any blinding procedures in our study. No data were excluded from analyses.

All simulation and data analysis was performed in Python 3.7.11. PyTorch 1.10 was used to implement and train models (this includes Adam optimizer implementation). Transformers 4.16.2 was used to implement language models and all pretrained weights for language models were taken from the Huggingface repository ( ). We also used scikit-learn 0.24.1 and scipy 1.7.3 to perform analyses.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All weights for language transformers used in this study were taken from pretrained models available on the Huggingface repository ( ). Training data for simulated psychophysical tasks were generated using code available at . The full set of trained model weights for all results is available upon request.

Code availability

All code used to train models and analyze results can be found at .

Cole, M. W. et al. Multi-task connectivity reveals flexible hubs for adaptive task control. Nature Neurosci. 16 , 1348–1355 (2013).

Article   CAS   PubMed   Google Scholar  

Miller, E. K. & Cohen, J. D. An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci. 24 , 167–202 (2001).

Bernardi, S. et al. The geometry of abstraction in the hippocampus and prefrontal cortex. Cell 183 , 954–967 (2020).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Minxha, J., Adolphs, R., Fusi, S., Mamelak, A. N. & Rutishauser, U. Flexible recruitment of memory-based choice representations by the human medial frontal cortex. Science 368 , eaba3313 (2020).

Takuya, I. et al. Compositional generalization through abstract representations in human and artificial neural networks. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 32225–32239 (Curran Associates, Inc., 2022).

Driscoll, L., Shenoy, K. & Sussillo, D. Flexible multitask computation in recurrent networks utilizes shared dynamical motifs. Preprint at bioRxiv (2022).

Brown, Tom, et al. Language models are few-shot learners. In Proc. 34th International Conference on Neural Information Processing Systems 1877–1901 (Curran Associates Inc., 2020).

Ramesh, A. et al. Zero-shot text-to-image generation. In Proc. 38th International Conference on Machine Learning (eds Marina, M. & Tong, Z.) 8821–8831 (PMLR, 2021).

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI 1 , 9 (2019).

Google Scholar  

Schrimpf, M. et al. The neural architecture of language: integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA (2021).

Goldstein, A. et al. Shared computational principles for language processing in humans and deep language models. Nature Neurosci. 25 , 369–380 (2022).

Chowdhery, A. et al. Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24 , 11324–11436 (2023).

Thoppilan, R. et al. Lamda: language models for dialog applications. Preprint at (2022).

Rombach, R. et al. High-resolution image synthesis with latent diffusion models. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10674–10685 (IEEE, 2022).

Zitkovich, B. et al. Rt-2: vision-language-action models transfer web knowledge to robotic control. In Proc. 7th Conference on Robot Learning (eds Tan, J. et al.) 2165-2183 (PMLR, 2023).

Abramson, J. et al. Imitating interactive intelligence. Preprint at (2021).

DeepMind Interactive Agents Team. Creating multimodal interactive agents with imitation and self-supervised learning. Preprint at (2022).

Yang, G. R., Joglekar, M. R., Song, H. F., Newsome, W. T. & Wang, X.-J. Task representations in neural networks trained to perform many cognitive tasks. Nat. Neurosci. 22 , 297–306 (2019).

Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates Inc., 2017).

Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at (2018).

Reimers, N. & Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. Preprint at (2019).

Bowman, S. R., Angeli, G., Potts, C. & Manning, C. D. A large annotated corpus for learning natural language inference. Preprint at (2015).

Radford, A. et al. "Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Marina, M. & Tong, Z.) 8748–8763 (PMLR, 2021).

Goel, V., Gold, B., Kapur, S. & Houle, S. Neuroanatomical correlates of human reasoning. J. Cogn. Neurosci. 10 , 293–302 (1998).

Goel, V., Buchel, C., Frith, C. & Dolan, R. J. Dissociation of mechanisms underlying syllogistic reasoning. Neuroimage 12 , 504–514 (2000).

Reverberi, C. et al. Neural basis of generation of conclusions in elementary deduction. Neuroimage 38 , 752–762 (2007).

Article   PubMed   Google Scholar  

Noveck, I. A., Goel, V. & Smith, K. W. The neural basis of conditional reasoning with arbitrary content. Cortex 40 , 613–622 (2004).

Monti, M. M., Osherson, D. N., Martinez, M. J. & Parsons, L. M. Functional neuroanatomy of deductive inference: a language-independent distributed network. Neuroimage 37 , 1005–1016 (2007).

Monti, M. M., Parsons, L. M. & Osherson, D. N. The boundaries of language and thought in deductive inference. Proc. Natl Acad. Sci. USA 106 , 12554–12559 (2009).

Article   CAS   PubMed   PubMed Central   ADS   Google Scholar  

Coetzee, J. P. & Monti, M. M. At the core of reasoning: dissociating deductive and non-deductive load. Hum. Brain Mapp. 39 , 1850–1861 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Monti, M. M. & Osherson, D. N. Logic, language and the brain. Brain Res. 1428 , 33–42 (2012).

Prado, J. The relationship between deductive reasoning and the syntax of language in broca’s area: a review of the neuroimaging literature. L’année Psychol. 118 , 289–315 (2018).

Article   Google Scholar  

Ito, T., Yang, G. R., Laurent, P., Schultz, D. H. & Cole, M. W. Constructing neural network models from brain data reveals representational transformations linked to adaptive behavior. Nat. Commun. 13 , 673 (2022).

Shadlen, M. N. & Newsome, W. T. Neural basis of a perceptual decision in the parietal cortex (area lip) of the rhesus monkey. J. Neurophysiol. 86 , 1916–1936 (2001).

Huk, A. C. & Shadlen, M. N. Neural activity in macaque parietal cortex reflects temporal integration of visual motion signals during perceptual decision making. J. Neurosci. 25 , 10420–10436 (2005).

Panichello, M. F. & Buschman, T. J. Shared mechanisms underlie the control of working memory and attention. Nature 592 , 601–605 (2021).

Nieh, E. H. et al. Geometry of abstract learned knowledge in the hippocampus. Nature 595 , 80–84 (2021).

Fedorenko, E. & Blank, I. A. Broca’s area is not a natural kind. Trends Cogn. Sci. 24 , 270–284 (2020).

Fedorenko, E., Duncan, J. & Kanwisher, N. Language-selective and domain-general regions lie side by side within broca’s area. Curr. Biol. 22 , 2059–2062 (2012).

Gao, Z. et al. Distinct and common neural coding of semantic and non-semantic control demands. NeuroImage 236 , 118230 (2021).

Duncan, J. The multiple-demand (MD) system of the primate brain: mental programs for intelligent behaviour. Trends Cogn. Sci. 14 , 172–179 (2010).

Buccino, G., Colagé, I., Gobbi, N. & Bonaccorso, G. Grounding meaning in experience: a broad perspective on embodied language. Neurosci. Biobehav. Rev. 69 , 69–78 (2016).

Mansouri, F. A., Freedman, D. J. & Buckley, M. J. Emergence of abstract rules in the primate brain. Nat. Rev. Neurosci. 21 , 595–610 (2020).

Oh, J. Singh, S., Lee, H. & Kohli, P. Zero-shot task generalization with multi-task deep reinforcement learning. In Proc. 34th International Conference on Machine Learning 2661–2670 (, 2017).

Chaplot, D. S., Mysore Sathyendra, K., Pasumarthi, R. K., Rajagopal, D., & Salakhutdinov, R. Gated-attention architectures for task-oriented language grounding. In Proc. 32nd AAAI Conference on Artificial Intelligence Vol. 32 (AAAI Press, 2018).

Sharma, P., Torralba, A. & Andreas, J. Skill induction and planning with latent language. Preprint at (2021).

Jiang, Y., Gu, S., Murphy, K. & Finn, C. Language as an abstraction for hierarchical deep reinforcement learning. In Proc. 33rd International Conference on Neural Information Processing Systems 9419–943132 (Curran Associates Inc., 2019).

Ouyang, L. et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 27730–27744 (Curran Associates, Inc., 2022).

Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at (2014).

Radford, A. et al. Better language models and their implications. (2019).

Bromley, J. et al. Signature verification using a ‘siamese’ time delay neural network. Int. J. Pattern Recognit. Artif. Intell. 7 , 669–688 (1993).

Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Pr oc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45 (Association for Computational Linguistics, 2020).

Sutskever, I., Vinyals, O. & Le., Q. V. Sequence to sequence learning with neural networks. In Proc. 27th International Conference on Neural Information Processing Systems 3104–3112 (MIT Press, 2014).

Download references


We thank N. Rungratsameetaweemana, T. Aquino and V. Borghesani as well as N. Patel and P. Tano for their useful discussions during this project. We are also appreciative to the University of Geneva for the funding which made this research possible.

Open access funding provided by University of Geneva.

Author information

Authors and affiliations.

Department of Basic Neuroscience, University of Geneva, Geneva, Switzerland

Reidar Riveland & Alexandre Pouget

You can also search for this author in PubMed   Google Scholar


A.P. and R.R. conceived the project. R.R. wrote the code for model simulations and performed analysis of model representations. A.P. and R.R. wrote and revised the paper.

Corresponding author

Correspondence to Reidar Riveland .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Neuroscience thanks Blake Richards and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Figs. 1–13 and Supplementary Notes 1–4

Reporting Summary

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit .

Reprints and permissions

About this article

Cite this article.

Riveland, R., Pouget, A. Natural language instructions induce compositional generalization in networks of neurons. Nat Neurosci (2024).

Download citation

Received : 13 May 2023

Accepted : 15 February 2024

Published : 18 March 2024


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

ieee research papers on natural language processing

Help | Advanced Search

Computer Science > Computation and Language

Title: comprehensive implementation of textcnn for enhanced collaboration between natural language processing and system recommendation.

Abstract: Natural Language Processing (NLP) is an important branch of artificial intelligence that studies how to enable computers to understand, process, and generate human language. Text classification is a fundamental task in NLP, which aims to classify text into different predefined categories. Text classification is the most basic and classic task in natural language processing, and most of the tasks in natural language processing can be regarded as classification tasks. In recent years, deep learning has achieved great success in many research fields, and today, it has also become a standard technology in the field of NLP, which is widely integrated into text classification tasks. Unlike numbers and images, text processing emphasizes fine-grained processing ability. Traditional text classification methods generally require preprocessing the input model's text data. Additionally, they also need to obtain good sample features through manual annotation and then use classical machine learning algorithms for classification. Therefore, this paper analyzes the application status of deep learning in the three core tasks of NLP (including text representation, word order modeling, and knowledge representation). This content explores the improvement and synergy achieved through natural language processing in the context of text classification, while also taking into account the challenges posed by adversarial techniques in text generation, text classification, and semantic parsing. An empirical study on text classification tasks demonstrates the effectiveness of interactive integration training, particularly in conjunction with TextCNN, highlighting the significance of these advancements in text classification augmentation and enhancement.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

NeurIPS 2024, the Thirty-eighth Annual Conference on Neural Information Processing Systems, will be held at the Vancouver Convention Center

Monday Dec 9 through Sunday Dec 15. Monday is an industry expo.

ieee research papers on natural language processing


Registration details will be posted soon. 

Our Hotel Reservation page is currently under construction and will be released shortly. NeurIPS has contracted Hotel guest rooms for the Conference at group pricing, requiring reservations only through this page. Please do not make room reservations through any other channel, as it only impedes us from putting on the best Conference for you. We thank you for your assistance in helping us protect the NeurIPS conference.


Latest neurips blog entries [ all entries ], important dates.

If you have questions about supporting the conference, please contact us .

Become an 2024 Exhibitor Exhibitor Info »

Organizing Committee

Workflow manager, logistics and it, mission statement.

The Neural Information Processing Systems Foundation is a non-profit corporation whose purpose is to foster the exchange of research advances in Artificial Intelligence and Machine Learning, principally by hosting an annual interdisciplinary academic conference with the highest ethical standards for a diverse and inclusive community.

About the Conference

The conference was founded in 1987 and is now a multi-track interdisciplinary annual meeting that includes invited talks, demonstrations, symposia, and oral and poster presentations of refereed papers. Along with the conference is a professional exposition focusing on machine learning in practice, a series of tutorials, and topical workshops that provide a less formal setting for the exchange of ideas.

More about the Neural Information Processing Systems foundation »


  1. (PDF) Natural Language Processing with Process Models (NLP4RE Report Paper)

    ieee research papers on natural language processing

  2. (PDF) Natural Language Processing Advancements By Deep Learning: A Survey

    ieee research papers on natural language processing


    ieee research papers on natural language processing

  4. Natural Language Processing steps

    ieee research papers on natural language processing

  5. (PDF) Natural Language Processing

    ieee research papers on natural language processing


    ieee research papers on natural language processing


  1. Image Processing Course in 2 hours

  2. Introduction to NLP part 1

  3. Automated Medical Recommendation System using Machine Learning Techniques & Natural Language Process

  4. Panel

  5. How to download IEEE research papers for free ||How to download IEEE paper free without access ||

  6. COMPLETE REVISION with PYQs for Natural Language Processing


  1. Natural Language Processing and Its Applications in ...

    As an essential part of artificial intelligence technology, natural language processing is rooted in multiple disciplines such as linguistics, computer science, and mathematics. The rapid advancements in natural language processing provides strong support for machine translation research. This paper first introduces the key concepts and main content of natural language processing, and briefly ...

  2. Vision, status, and research topics of Natural Language Processing

    The field of Natural Language Processing (NLP) has evolved with, and as well as influenced, recent advances in Artificial Intelligence (AI) and computing technologies, opening up new applications and novel interactions with humans. ... IEEE/ACM Transactions on Audio, Speech, and Language Processing: 2014: 3.919: ... • Research papers should ...

  3. Ieee Transactions on Neural Networks and Learning Systems, Vol. Xx, No

    natural language processing and deep neural networks, and then presents an extensive discussion on how deep learning is being used to solve current problems in NLP. While several other papers and books on the topic have been published [12], [10], none have extensively covered the state-of-the-art in as many areas within it. Furthermore, no ...


    research and development for life science, health management, public health, rehabilitation therapy, and etc. Fig. 1 shows the major participants, emerging technologies, and representative scenarios of smart healthcare. Natural language processing (NLP) is a subfield of com-puter science and artificial intelligence that is concerned with

  5. A systematic review of applications of natural language processing and

    This review has meticulously examined 63 research papers from the IEEE, Science Direct, Scopus, and Web of Science databases to address four primary research questions. ... languages, learning, problem-solving, decision-making, etc. One of the significant contributions of AI has remained in Natural Language Processing (NLP), which glued ...

  6. Real-Time Ransomware Detection by Using eBPF and Natural Language

    This paper introduces a novel real-time ransomware detection system integrating Extended Berkeley Packet Filter (eBPF), Machine Learning (ML), and Natural Language Processing (NLP). The system architecture leverages eBPF for efficient data collection, ML for anomaly detection, and NLP for textual analysis, achieving a high detection accuracy of 94.7% with significantly reduced false positives ...

  7. Natural Language Processing on IEEE Technology Navigator

    Top Conferences on Natural Language Processing. ICASSP 2027 - 2027 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) 2022 59th ACM/IEEE Design Automation Conference (DAC) More links.

  8. Special Session: Computational Intelligence for Natural Language ...

    All papers must comply with the basic requirements of IEEE SSCI 2021. The review process will comply with the standard review process of the IEEE SSCI. Each paper will receive at least three reviews from experts in the field. As per our knowledge, there is no previous special session held anywhere as most of the NLP community focuses on using ...

  9. Natural Language Processing

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... Browse SoTA > Natural Language Processing Natural Language Processing. 2328 benchmarks • 660 tasks • 1999 datasets • 27425 papers with code Representation Learning Representation Learning. 16 benchmarks 3620 papers with ...

  10. Deep Learning for Natural Language Processing: A Survey

    Over the last decade, deep learning has revolutionized machine learning. Neural network architectures have become the method of choice for many different applications; in this paper, we survey the applications of deep learning to natural language processing (NLP) problems. We begin by briefly reviewing the basic notions and major architectures of deep learning, including some recent advances ...

  11. Natural Language Processing

    B-NER. Within the Natural Language Processing (NLP) framework, Named Entity Recognition (NER) is regarded as the basis for extracting key information to understand texts in any language. As Bangla is a highly inflectional, morphologically rich, and resource-scarce language, building a balanced NER corpus with large and diverse entities is a ...

  12. Natural language processing: state of the art, current trends and

    Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP ...

  13. Jumping NLP Curves: A Review of Natural Language Processing Research

    This survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves-namely Syntactics, Semantics, and Pragmatics Curves which will eventually lead NLPResearch to evolve into natural language understanding. Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human ...

  14. Natural Language Processing: State of The Art, Current Trends and

    The paper distinguishes four phases by discussing. different levels of NLP and components of N atural L anguage G eneration (NLG) fo llowed by. presenting the history and evolution of NLP, state ...

  15. natural language processing

    Showing 33 posts that have the tag "natural-language-processing". Filter Results. All results Artificial Intelligence Biomedical Computing Consumer Electronics Robotics Telecommunications.

  16. Natural language processing: A review

    Natural language processing (NLP) has received a great deal of attention for its computer representation and evaluation of human language. AI, email spam location, data extraction, once-finished, clinical, and question addressing are only a couple of the applications. The article is broken into four areas, with the first talking about different ...

  17. [2403.15696] MixRED: A Mix-lingual Relation Extraction Dataset

    Relation extraction is a critical task in the field of natural language processing with numerous real-world applications. Existing research primarily focuses on monolingual relation extraction or cross-lingual enhancement for relation extraction. Yet, there remains a significant gap in understanding relation extraction in the mix-lingual (or code-switching) scenario, where individuals intermix ...

  18. (PDF) Natural Language Processing: A Review

    Natural language processing (NLP) is a research domain exploring how computers can be used to interpret and manipulate natural language text or speech [68]. With the advance of machine learning ...

  19. Natural language instructions induce compositional generalization in

    We use advances in natural language processing to create a neural model of generalization based on linguistic instructions. ... We end by discussing how these results can guide research on the ...

  20. [2403.09718] Comprehensive Implementation of TextCNN for Enhanced

    Natural Language Processing (NLP) is an important branch of artificial intelligence that studies how to enable computers to understand, process, and generate human language. Text classification is a fundamental task in NLP, which aims to classify text into different predefined categories. Text classification is the most basic and classic task in natural language processing, and most of the ...

  21. NeurIPS 2024

    The Neural Information Processing Systems Foundation is a non-profit corporation whose purpose is to foster the exchange of research advances in Artificial Intelligence and Machine ... and oral and poster presentations of refereed papers. Along with the conference is a professional exposition focusing on machine learning in practice, a series ...