Subscribe to the PwC Newsletter

Join the community, natural language processing, representation learning.

natural language processing research paper topics

Disentanglement

Graph representation learning, sentence embeddings.

natural language processing research paper topics

Network Embedding

Classification.

natural language processing research paper topics

Text Classification

natural language processing research paper topics

Graph Classification

natural language processing research paper topics

Audio Classification

natural language processing research paper topics

Medical Image Classification

Language modelling.

natural language processing research paper topics

Long-range modeling

Protein language model, sentence pair modeling, deep hashing, table retrieval, question answering.

natural language processing research paper topics

Open-Ended Question Answering

natural language processing research paper topics

Open-Domain Question Answering

Conversational question answering.

natural language processing research paper topics

Answer Selection

Translation, image generation.

natural language processing research paper topics

Image-to-Image Translation

natural language processing research paper topics

Image Inpainting

natural language processing research paper topics

Text-to-Image Generation

natural language processing research paper topics

Conditional Image Generation

Data augmentation.

natural language processing research paper topics

Image Augmentation

natural language processing research paper topics

Text Augmentation

Machine translation.

natural language processing research paper topics

Transliteration

Bilingual lexicon induction.

natural language processing research paper topics

Multimodal Machine Translation

natural language processing research paper topics

Unsupervised Machine Translation

Text generation.

natural language processing research paper topics

Dialogue Generation

natural language processing research paper topics

Data-to-Text Generation

natural language processing research paper topics

Multi-Document Summarization

Text style transfer.

natural language processing research paper topics

Topic Models

natural language processing research paper topics

Document Classification

natural language processing research paper topics

Sentence Classification

natural language processing research paper topics

Emotion Classification

2d semantic segmentation, image segmentation.

natural language processing research paper topics

Scene Parsing

natural language processing research paper topics

Reflection Removal

Visual question answering (vqa).

natural language processing research paper topics

Visual Question Answering

natural language processing research paper topics

Machine Reading Comprehension

natural language processing research paper topics

Chart Question Answering

natural language processing research paper topics

Embodied Question Answering

Named entity recognition (ner).

natural language processing research paper topics

Nested Named Entity Recognition

Chinese named entity recognition, few-shot ner, sentiment analysis.

natural language processing research paper topics

Aspect-Based Sentiment Analysis (ABSA)

natural language processing research paper topics

Multimodal Sentiment Analysis

natural language processing research paper topics

Aspect Sentiment Triplet Extraction

natural language processing research paper topics

Twitter Sentiment Analysis

Few-shot learning.

natural language processing research paper topics

One-Shot Learning

natural language processing research paper topics

Few-Shot Semantic Segmentation

Cross-domain few-shot.

natural language processing research paper topics

Unsupervised Few-Shot Learning

Word embeddings.

natural language processing research paper topics

Learning Word Embeddings

natural language processing research paper topics

Multilingual Word Embeddings

Embeddings evaluation, contextualised word representations, optical character recognition (ocr).

natural language processing research paper topics

Active Learning

natural language processing research paper topics

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, continual learning.

natural language processing research paper topics

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning, text summarization.

natural language processing research paper topics

Abstractive Text Summarization

Document summarization, extractive text summarization, information retrieval.

natural language processing research paper topics

Passage Retrieval

Cross-lingual information retrieval, table search, relation extraction.

natural language processing research paper topics

Relation Classification

Document-level relation extraction, joint entity and relation extraction, temporal relation extraction, link prediction.

natural language processing research paper topics

Inductive Link Prediction

Dynamic link prediction, anchor link prediction, calibration for link prediction, natural language inference.

natural language processing research paper topics

Answer Generation

natural language processing research paper topics

Visual Entailment

Cross-lingual natural language inference, reading comprehension.

natural language processing research paper topics

Intent Recognition

Implicit relations, active object detection, large language model, emotion recognition.

natural language processing research paper topics

Speech Emotion Recognition

natural language processing research paper topics

Emotion Recognition in Conversation

natural language processing research paper topics

Multimodal Emotion Recognition

Emotion-cause pair extraction, natural language understanding.

natural language processing research paper topics

Emotional Dialogue Acts

Semantic textual similarity.

natural language processing research paper topics

Paraphrase Identification

natural language processing research paper topics

Cross-Lingual Semantic Textual Similarity

Image captioning.

natural language processing research paper topics

3D dense captioning

Controllable image captioning, aesthetic image captioning.

natural language processing research paper topics

Relational Captioning

Event extraction, event causality identification, zero-shot event extraction, dialogue state tracking, task-oriented dialogue systems.

natural language processing research paper topics

Visual Dialog

Dialogue understanding, semantic parsing.

natural language processing research paper topics

AMR Parsing

Semantic dependency parsing, drs parsing, ucca parsing, coreference resolution, coreference-resolution, cross document coreference resolution, semantic similarity, conformal prediction.

natural language processing research paper topics

Text Simplification

natural language processing research paper topics

Music Source Separation

Audio source separation.

natural language processing research paper topics

Decision Making Under Uncertainty

In-context learning.

natural language processing research paper topics

Sentence Embedding

Sentence compression, joint multilingual sentence representations, sentence embeddings for biomedical texts, code generation.

natural language processing research paper topics

Code Translation

natural language processing research paper topics

Code Documentation Generation

Class-level code generation, library-oriented code generation, dependency parsing.

natural language processing research paper topics

Transition-Based Dependency Parsing

Prepositional phrase attachment, unsupervised dependency parsing, cross-lingual zero-shot dependency parsing, specificity, information extraction, extractive summarization, temporal information extraction, low resource named entity recognition, cross-lingual, cross-lingual transfer, cross-lingual document classification.

natural language processing research paper topics

Cross-Lingual Entity Linking

Cross-language text summarization, response generation, common sense reasoning.

natural language processing research paper topics

Physical Commonsense Reasoning

Riddle sense, anachronisms, memorization, instruction following, visual instruction following, data integration.

natural language processing research paper topics

Entity Alignment

natural language processing research paper topics

Entity Resolution

Table annotation, entity linking.

natural language processing research paper topics

Question Generation

Poll generation.

natural language processing research paper topics

Topic coverage

Dynamic topic modeling, part-of-speech tagging.

natural language processing research paper topics

Unsupervised Part-Of-Speech Tagging

Prompt engineering.

natural language processing research paper topics

Visual Prompting

Abuse detection, hate speech detection, mathematical reasoning.

natural language processing research paper topics

Math Word Problem Solving

Formal logic, geometry problem solving, abstract algebra, open information extraction.

natural language processing research paper topics

Hope Speech Detection

Hate speech normalization, hate speech detection crisishatemm benchmark, data mining.

natural language processing research paper topics

Argument Mining

natural language processing research paper topics

Opinion Mining

Subgroup discovery, parallel corpus mining, cognitive diagnosis, word sense disambiguation.

natural language processing research paper topics

Word Sense Induction

Few-shot relation classification, implicit discourse relation classification, cause-effect relation classification, language identification, dialect identification, native language identification, bias detection, selection bias, fake news detection, relational reasoning.

natural language processing research paper topics

Semantic Role Labeling

natural language processing research paper topics

Predicate Detection

Semantic role labeling (predicted predicates).

natural language processing research paper topics

Textual Analogy Parsing

natural language processing research paper topics

Slot Filling

natural language processing research paper topics

Zero-shot Slot Filling

Extracting covid-19 events from twitter, grammatical error correction.

natural language processing research paper topics

Grammatical Error Detection

Text matching, document text classification, learning with noisy labels, multi-label classification of biomedical texts, political salient issue orientation detection, pos tagging, deep clustering, trajectory clustering, deep nonparametric clustering, nonparametric deep clustering, spoken language understanding, dialogue safety prediction, multi-modal entity alignment, stance detection, zero-shot stance detection, few-shot stance detection, stance detection (us election 2020 - biden), stance detection (us election 2020 - trump), intent detection.

natural language processing research paper topics

Open Intent Detection

Word similarity, text-to-speech synthesis.

natural language processing research paper topics

Prosody Prediction

Zero-shot multi-speaker tts, intent classification.

natural language processing research paper topics

Zero-Shot Cross-Lingual Transfer

Cross-lingual ner, fact verification, language acquisition, grounded language learning, constituency parsing.

natural language processing research paper topics

Constituency Grammar Induction

Document ai, document understanding, entity typing.

natural language processing research paper topics

Entity Typing on DH-KGs

Self-learning, ad-hoc information retrieval, document ranking.

natural language processing research paper topics

Word Alignment

Cross-modal retrieval, image-text matching, multilingual cross-modal retrieval.

natural language processing research paper topics

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, open-domain dialog, dialogue evaluation, novelty detection, multimodal deep learning, multimodal text and image classification, discourse parsing, discourse segmentation, connective detection, model editing, knowledge editing.

natural language processing research paper topics

Text-based Image Editing

Text-guided-image-editing.

natural language processing research paper topics

Zero-Shot Text-to-Image Generation

Concept alignment, conditional text-to-image synthesis.

natural language processing research paper topics

Multi-Label Text Classification

Shallow syntax, sarcasm detection.

natural language processing research paper topics

De-identification

Privacy preserving deep learning, explanation generation, lemmatization, morphological analysis.

natural language processing research paper topics

Aspect Extraction

Extract aspect, aspect category sentiment analysis.

natural language processing research paper topics

Aspect-oriented Opinion Extraction

natural language processing research paper topics

Aspect-Category-Opinion-Sentiment Quadruple Extraction

Session search.

natural language processing research paper topics

Chinese Word Segmentation

Handwritten chinese text recognition, chinese spelling error correction, chinese zero pronoun resolution, offline handwritten chinese character recognition, molecular representation, entity disambiguation, conversational search, source code summarization, method name prediction, speech-to-text translation, simultaneous speech-to-text translation, text clustering.

natural language processing research paper topics

Short Text Clustering

natural language processing research paper topics

Open Intent Discovery

Authorship attribution, text-to-video generation, text-to-video editing, subject-driven video generation, keyphrase extraction, linguistic acceptability.

natural language processing research paper topics

Column Type Annotation

Cell entity annotation, columns property annotation, row annotation.

natural language processing research paper topics

Visual Storytelling

natural language processing research paper topics

KG-to-Text Generation

natural language processing research paper topics

Unsupervised KG-to-Text Generation

Abusive language, few-shot text classification, zero-shot out-of-domain detection, term extraction, text2text generation, keyphrase generation, figurative language visualization, sketch-to-text generation, protein folding, phrase grounding, grounded open vocabulary acquisition, deep attention, morphological inflection, word translation, multilingual nlp, spam detection, context-specific spam detection, traditional spam detection, summarization, unsupervised extractive summarization, query-focused summarization.

natural language processing research paper topics

Knowledge Base Population

Natural language transduction, cross-lingual word embeddings, conversational response selection, text annotation, image-to-text retrieval, passage ranking, news classification, key information extraction, biomedical information retrieval.

natural language processing research paper topics

SpO2 estimation

Authorship verification.

natural language processing research paper topics

Graph-to-Sequence

Sentence summarization, unsupervised sentence summarization, automated essay scoring, keyword extraction, story generation, temporal processing, timex normalization, document dating, multimodal association, multimodal generation, morphological tagging, nlg evaluation, meme classification, hateful meme classification, weakly supervised classification, weakly supervised data denoising, entity extraction using gan.

natural language processing research paper topics

Rumour Detection

Key point matching, component classification, argument pair extraction (ape), claim extraction with stance classification (cesc), claim-evidence pair extraction (cepe), semantic composition.

natural language processing research paper topics

Sentence Ordering

Lexical simplification, token classification, toxic spans detection.

natural language processing research paper topics

Blackout Poetry Generation

Comment generation.

natural language processing research paper topics

Semantic Retrieval

Subjectivity analysis.

natural language processing research paper topics

Taxonomy Learning

Taxonomy expansion, hypernym discovery, conversational response generation.

natural language processing research paper topics

Personalized and Emotional Conversation

Review generation, sentence-pair classification, emotional intelligence, dark humor detection, lexical normalization, pronunciation dictionary creation, negation detection, negation scope resolution, question similarity, medical question pair similarity computation, intent discovery, propaganda detection, propaganda span identification, propaganda technique identification, lexical analysis, lexical complexity prediction, goal-oriented dialog, user simulation, passage re-ranking, punctuation restoration, reverse dictionary, question rewriting, humor detection.

natural language processing research paper topics

Meeting Summarization

Table-based fact verification, pretrained multilingual language models, formality style transfer, semi-supervised formality style transfer, word attribute transfer, attribute value extraction, diachronic word embeddings, legal reasoning, persian sentiment analysis, clinical concept extraction.

natural language processing research paper topics

Clinical Information Retreival

Constrained clustering.

natural language processing research paper topics

Only Connect Walls Dataset Task 1 (Grouping)

Incremental constrained clustering, aspect category detection, dialog act classification, extreme summarization.

natural language processing research paper topics

Hallucination Evaluation

Recognizing emotion cause in conversations.

natural language processing research paper topics

Causal Emotion Entailment

natural language processing research paper topics

Nested Mention Recognition

Relationship extraction (distant supervised), binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, clickbait detection, decipherment, semantic entity labeling, text compression, handwriting verification, bangla spelling error correction, ccg supertagging, linguistic steganography, probing language models, toponym resolution.

natural language processing research paper topics

Timeline Summarization

Multimodal abstractive text summarization, reader-aware summarization, code repair, gender bias detection, thai word segmentation, stock prediction, text-based stock prediction, event-driven trading, pair trading.

natural language processing research paper topics

Face to Face Translation

Multimodal lexical translation, aggression identification, arabic text diacritization, commonsense causal reasoning, fact selection, suggestion mining, temporal relation classification, vietnamese datasets, vietnamese word segmentation, arabic sentiment analysis, aspect category polarity, complex word identification, cross-lingual bitext mining, morphological disambiguation, scientific document summarization, lay summarization, text attribute transfer.

natural language processing research paper topics

Image-guided Story Ending Generation

Speculation detection, speculation scope resolution, abstract argumentation, dialogue rewriting, logical reasoning reading comprehension.

natural language processing research paper topics

Unsupervised Sentence Compression

Sign language production, stereotypical bias analysis, temporal tagging, anaphora resolution, bridging anaphora resolution.

natural language processing research paper topics

Abstract Anaphora Resolution

Hope speech detection for english, hope speech detection for malayalam, hope speech detection for tamil, hidden aspect detection, latent aspect detection, chinese spell checking, cognate prediction, japanese word segmentation, memex question answering, polyphone disambiguation, spelling correction, table-to-text generation.

natural language processing research paper topics

KB-to-Language Generation

Text anonymization, zero-shot sentiment classification, conditional text generation, contextualized literature-based discovery, multimedia generative script learning, image-sentence alignment, open-world social event classification, personality generation, personality alignment, action parsing, author attribution, binary condescension detection, conversational web navigation, croatian text diacritization, czech text diacritization, definition modelling, document-level re with incomplete labeling, domain labelling, french text diacritization, hungarian text diacritization, irish text diacritization, latvian text diacritization, misogynistic aggression identification, morpheme segmentaiton, multi-agent integration, multi-label condescension detection, news annotation, open relation modeling, reading order detection, record linking, role-filler entity extraction, romanian text diacritization, slovak text diacritization, spanish text diacritization, syntax representation, text-to-video search, turkish text diacritization, turning point identification, twitter event detection.

natural language processing research paper topics

Vietnamese Text Diacritization

Zero-shot machine translation.

natural language processing research paper topics

Conversational Sentiment Quadruple Extraction

Attribute extraction, legal outcome extraction, automated writing evaluation, chemical indexing, clinical assertion status detection.

natural language processing research paper topics

Coding Problem Tagging

Collaborative plan acquisition, commonsense reasoning for rl, context query reformulation.

natural language processing research paper topics

Variable Disambiguation

Cross-lingual text-to-image generation, crowdsourced text aggregation.

natural language processing research paper topics

Description-guided molecule generation

natural language processing research paper topics

Multi-modal Dialogue Generation

Page stream segmentation.

natural language processing research paper topics

Email Thread Summarization

Emergent communications on relations, emotion detection and trigger summarization, extractive tags summarization.

natural language processing research paper topics

Hate Intensity Prediction

Hate span identification, job prediction, joint entity and relation extraction on scientific data, joint ner and classification, literature mining, math information retrieval, meme captioning, multi-grained named entity recognition, multilingual machine comprehension in english hindi, multimodal text prediction, negation and speculation cue detection, negation and speculation scope resolution, only connect walls dataset task 2 (connections), overlapping mention recognition, paraphrase generation, multilingual paraphrase generation, personality recognition in conversation.

natural language processing research paper topics

Phrase Ranking

Phrase tagging, phrase vector embedding, poem meters classification, query wellformedness.

natural language processing research paper topics

Question-Answer categorization

Readability optimization, reliable intelligence identification, sentence completion, hurtful sentence completion, speaker attribution in german parliamentary debates (germeval 2023, subtask 1), text effects transfer, text-variation, vietnamese aspect-based sentiment analysis, sentiment dependency learning, web page tagging, workflow discovery, incongruity detection, multi-word expression embedding, multi-word expression sememe prediction, trustable and focussed llm generated content, pcl detection, semeval-2022 task 4-1 (binary pcl detection), semeval-2022 task 4-2 (multi-label pcl detection), automatic writing, complaint comment classification, counterspeech detection, face selection, job classification, multi-lingual text-to-image generation, multlingual neural machine translation, optical charater recogntion, bangla text detection, question to declarative sentence, relation mention extraction.

natural language processing research paper topics

Tweet-Reply Sentiment Analysis

Vietnamese parsing.

Natural language processing: state of the art, current trends and challenges

  • Published: 14 July 2022
  • Volume 82 , pages 3713–3744, ( 2023 )

Cite this article

  • Diksha Khurana 1 ,
  • Aditya Koli 1 ,
  • Kiran Khatter   ORCID: orcid.org/0000-0002-1000-6102 2 &
  • Sukhdev Singh 3  

126k Accesses

243 Citations

34 Altmetric

Explore all metrics

This article has been updated

Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP and components of N atural L anguage G eneration followed by presenting the history and evolution of NLP. We then discuss in detail the state of the art presenting the various applications of NLP, current trends, and challenges. Finally, we present a discussion on some available datasets, models, and evaluation metrics in NLP.

Similar content being viewed by others

natural language processing research paper topics

Natural Language Processing

natural language processing research paper topics

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, … Jirong Wen

natural language processing research paper topics

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Marco Cascella, Federico Semeraro, … Elena Bignami

Avoid common mistakes on your manuscript.

1 Introduction

A language can be defined as a set of rules or set of symbols where symbols are combined and used for conveying information or broadcasting the information. Since all the users may not be well-versed in machine specific language, N atural Language Processing (NLP) caters those users who do not have enough time to learn new languages or get perfection in it. In fact, NLP is a tract of Artificial Intelligence and Linguistics, devoted to make computers understand the statements or words written in human languages. It came into existence to ease the user’s work and to satisfy the wish to communicate with the computer in natural language, and can be classified into two parts i.e. Natural Language Understanding or Linguistics and Natural Language Generation which evolves the task to understand and generate the text. L inguistics is the science of language which includes Phonology that refers to sound, Morphology word formation, Syntax sentence structure, Semantics syntax and Pragmatics which refers to understanding. Noah Chomsky, one of the first linguists of twelfth century that started syntactic theories, marked a unique position in the field of theoretical linguistics because he revolutionized the area of syntax (Chomsky, 1965) [ 23 ]. Further, Natural Language Generation (NLG) is the process of producing phrases, sentences and paragraphs that are meaningful from an internal representation. The first objective of this paper is to give insights of the various important terminologies of NLP and NLG.

In the existing literature, most of the work in NLP is conducted by computer scientists while various other professionals have also shown interest such as linguistics, psychologists, and philosophers etc. One of the most interesting aspects of NLP is that it adds up to the knowledge of human language. The field of NLP is related with different theories and techniques that deal with the problem of natural language of communicating with the computers. Few of the researched tasks of NLP are Automatic Summarization ( Automatic summarization produces an understandable summary of a set of text and provides summaries or detailed information of text of a known type), Co-Reference Resolution ( Co-reference resolution refers to a sentence or larger set of text that determines all words which refer to the same object), Discourse Analysis ( Discourse analysis refers to the task of identifying the discourse structure of connected text i.e. the study of text in relation to social context),Machine Translation ( Machine translation refers to automatic translation of text from one language to another),Morphological Segmentation ( Morphological segmentation refers to breaking words into individual meaning-bearing morphemes), Named Entity Recognition ( Named entity recognition (NER) is used for information extraction to recognized name entities and then classify them to different classes), Optical Character Recognition ( Optical character recognition (OCR) is used for automatic text recognition by translating printed and handwritten text into machine-readable format), Part Of Speech Tagging ( Part of speech tagging describes a sentence, determines the part of speech for each word) etc. Some of these tasks have direct real-world applications such as Machine translation, Named entity recognition, Optical character recognition etc. Though NLP tasks are obviously very closely interwoven but they are used frequently, for convenience. Some of the tasks such as automatic summarization, co-reference analysis etc. act as subtasks that are used in solving larger tasks. Nowadays NLP is in the talks because of various applications and recent developments although in the late 1940s the term wasn’t even in existence. So, it will be interesting to know about the history of NLP, the progress so far has been made and some of the ongoing projects by making use of NLP. The second objective of this paper focus on these aspects. The third objective of this paper is on datasets, approaches, evaluation metrics and involved challenges in NLP. The rest of this paper is organized as follows. Section 2 deals with the first objective mentioning the various important terminologies of NLP and NLG. Section 3 deals with the history of NLP, applications of NLP and a walkthrough of the recent developments. Datasets used in NLP and various approaches are presented in Section 4 , and Section 5 is written on evaluation metrics and challenges involved in NLP. Finally, a conclusion is presented in Section 6 .

2 Components of NLP

NLP can be classified into two parts i.e., Natural Language Understanding and Natural Language Generation which evolves the task to understand and generate the text. Figure 1 presents the broad classification of NLP. The objective of this section is to discuss the Natural Language Understanding (Linguistic) (NLU) and the Natural Language Generation (NLG) .

figure 1

Broad classification of NLP

NLU enables machines to understand natural language and analyze it by extracting concepts, entities, emotion, keywords etc. It is used in customer care applications to understand the problems reported by customers either verbally or in writing. Linguistics is the science which involves the meaning of language, language context and various forms of the language. So, it is important to understand various important terminologies of NLP and different levels of NLP. We next discuss some of the commonly used terminologies in different levels of NLP.

Phonology is the part of Linguistics which refers to the systematic arrangement of sound. The term phonology comes from Ancient Greek in which the term phono means voice or sound and the suffix –logy refers to word or speech. In 1993 Nikolai Trubetzkoy stated that Phonology is “the study of sound pertaining to the system of language” whereas Lass1998 [ 66 ]wrote that phonology refers broadly with the sounds of language, concerned with sub-discipline of linguistics, behavior and organization of sounds. Phonology includes semantic use of sound to encode meaning of any Human language.

The different parts of the word represent the smallest units of meaning known as Morphemes. Morphology which comprises Nature of words, are initiated by morphemes. An example of Morpheme could be, the word precancellation can be morphologically scrutinized into three separate morphemes: the prefix pre , the root cancella , and the suffix -tion . The interpretation of morphemes stays the same across all the words, just to understand the meaning humans can break any unknown word into morphemes. For example, adding the suffix –ed to a verb, conveys that the action of the verb took place in the past. The words that cannot be divided and have meaning by themselves are called Lexical morpheme (e.g.: table, chair). The words (e.g. -ed, −ing, −est, −ly, −ful) that are combined with the lexical morpheme are known as Grammatical morphemes (eg. Worked, Consulting, Smallest, Likely, Use). The Grammatical morphemes that occur in combination called bound morphemes (eg. -ed, −ing) Bound morphemes can be divided into inflectional morphemes and derivational morphemes. Adding Inflectional morphemes to a word changes the different grammatical categories such as tense, gender, person, mood, aspect, definiteness and animacy. For example, addition of inflectional morphemes –ed changes the root park to parked . Derivational morphemes change the semantic meaning of the word when it is combined with that word. For example, in the word normalize, the addition of the bound morpheme –ize to the root normal changes the word from an adjective ( normal ) to a verb ( normalize ).

In Lexical, humans, as well as NLP systems, interpret the meaning of individual words. Sundry types of processing bestow to word-level understanding – the first of these being a part-of-speech tag to each word. In this processing, words that can act as more than one part-of-speech are assigned the most probable part-of-speech tag based on the context in which they occur. At the lexical level, Semantic representations can be replaced by the words that have one meaning. In fact, in the NLP system the nature of the representation varies according to the semantic theory deployed. Therefore, at lexical level, analysis of structure of words is performed with respect to their lexical meaning and PoS. In this analysis, text is divided into paragraphs, sentences, and words. Words that can be associated with more than one PoS are aligned with the most likely PoS tag based on the context in which they occur. At lexical level, semantic representation can also be replaced by assigning the correct POS tag which improves the understanding of the intended meaning of a sentence. It is used for cleaning and feature extraction using various techniques such as removal of stop words, stemming, lemmatization etc. Stop words such as ‘ in ’, ‘the’, ‘and’ etc. are removed as they don’t contribute to any meaningful interpretation and their frequency is also high which may affect the computation time. Stemming is used to stem the words of the text by removing the suffix of a word to obtain its root form. For example: consulting and consultant words are converted to the word consult after stemming, using word gets converted to us and driver is reduced to driv . Lemmatization does not remove the suffix of a word; in fact, it results in the source word with the use of a vocabulary. For example, in case of token drived , stemming results in “driv”, whereas lemmatization attempts to return the correct basic form either drive or drived depending on the context it is used.

After PoS tagging done at lexical level, words are grouped to phrases and phrases are grouped to form clauses and then phrases are combined to sentences at syntactic level. It emphasizes the correct formation of a sentence by analyzing the grammatical structure of the sentence. The output of this level is a sentence that reveals structural dependency between words. It is also known as parsing which uncovers the phrases that convey more meaning in comparison to the meaning of individual words. Syntactic level examines word order, stop-words, morphology and PoS of words which lexical level does not consider. Changing word order will change the dependency among words and may also affect the comprehension of sentences. For example, in the sentences “ram beats shyam in a competition” and “shyam beats ram in a competition”, only syntax is different but convey different meanings [ 139 ]. It retains the stopwords as removal of them changes the meaning of the sentence. It doesn’t support lemmatization and stemming because converting words to its basic form changes the grammar of the sentence. It focuses on identification on correct PoS of sentences. For example: in the sentence “frowns on his face”, “frowns” is a noun whereas it is a verb in the sentence “he frowns”.

On a semantic level, the most important task is to determine the proper meaning of a sentence. To understand the meaning of a sentence, human beings rely on the knowledge about language and the concepts present in that sentence, but machines can’t count on these techniques. Semantic processing determines the possible meanings of a sentence by processing its logical structure to recognize the most relevant words to understand the interactions among words or different concepts in the sentence. For example, it understands that a sentence is about “movies” even if it doesn’t comprise actual words, but it contains related concepts such as “actor”, “actress”, “dialogue” or “script”. This level of processing also incorporates the semantic disambiguation of words with multiple senses (Elizabeth D. Liddy, 2001) [ 68 ]. For example, the word “bark” as a noun can mean either as a sound that a dog makes or outer covering of the tree. The semantic level examines words for their dictionary interpretation or interpretation is derived from the context of the sentence. For example: the sentence “Krishna is good and noble.” This sentence is either talking about Lord Krishna or about a person “Krishna”. That is why, to get the proper meaning of the sentence, the appropriate interpretation is considered by looking at the rest of the sentence [ 44 ].

While syntax and semantics level deal with sentence-length units, the discourse level of NLP deals with more than one sentence. It deals with the analysis of logical structure by making connections among words and sentences that ensure its coherence. It focuses on the properties of the text that convey meaning by interpreting the relations between sentences and uncovering linguistic structures from texts at several levels (Liddy,2001) [ 68 ]. The two of the most common levels are: Anaphora Resolution an d Coreference Resolution. Anaphora resolution is achieved by recognizing the entity referenced by an anaphor to resolve the references used within the text with the same sense. For example, (i) Ram topped in the class. (ii) He was intelligent. Here i) and ii) together form a discourse. Human beings can quickly understand that the pronoun “he” in (ii) refers to “Ram” in (i). The interpretation of “He” depends on another word “Ram” presented earlier in the text. Without determining the relationship between these two structures, it would not be possible to decide why Ram topped the class and who was intelligent. Coreference resolution is achieved by finding all expressions that refer to the same entity in a text. It is an important step in various NLP applications that involve high-level NLP tasks such as document summarization, information extraction etc. In fact, anaphora is encoded through one of the processes called co-reference.

Pragmatic level focuses on the knowledge or content that comes from the outside the content of the document. It deals with what speaker implies and what listener infers. In fact, it analyzes the sentences that are not directly spoken. Real-world knowledge is used to understand what is being talked about in the text. By analyzing the context, meaningful representation of the text is derived. When a sentence is not specific and the context does not provide any specific information about that sentence, Pragmatic ambiguity arises (Walton, 1996) [ 143 ]. Pragmatic ambiguity occurs when different persons derive different interpretations of the text, depending on the context of the text. The context of a text may include the references of other sentences of the same document, which influence the understanding of the text and the background knowledge of the reader or speaker, which gives a meaning to the concepts expressed in that text. Semantic analysis focuses on literal meaning of the words, but pragmatic analysis focuses on the inferred meaning that the readers perceive based on their background knowledge. For example, the sentence “Do you know what time is it?” is interpreted to “Asking for the current time” in semantic analysis whereas in pragmatic analysis, the same sentence may refer to “expressing resentment to someone who missed the due time” in pragmatic analysis. Thus, semantic analysis is the study of the relationship between various linguistic utterances and their meanings, but pragmatic analysis is the study of context which influences our understanding of linguistic expressions. Pragmatic analysis helps users to uncover the intended meaning of the text by applying contextual background knowledge.

The goal of NLP is to accommodate one or more specialties of an algorithm or system. The metric of NLP assess on an algorithmic system allows for the integration of language understanding and language generation. It is even used in multilingual event detection. Rospocher et al. [ 112 ] purposed a novel modular system for cross-lingual event extraction for English, Dutch, and Italian Texts by using different pipelines for different languages. The system incorporates a modular set of foremost multilingual NLP tools. The pipeline integrates modules for basic NLP processing as well as more advanced tasks such as cross-lingual named entity linking, semantic role labeling and time normalization. Thus, the cross-lingual framework allows for the interpretation of events, participants, locations, and time, as well as the relations between them. Output of these individual pipelines is intended to be used as input for a system that obtains event centric knowledge graphs. All modules take standard input, to do some annotation, and produce standard output which in turn becomes the input for the next module pipelines. Their pipelines are built as a data centric architecture so that modules can be adapted and replaced. Furthermore, modular architecture allows for different configurations and for dynamic distribution.

Ambiguity is one of the major problems of natural language which occurs when one sentence can lead to different interpretations. This is usually faced in syntactic, semantic, and lexical levels. In case of syntactic level ambiguity, one sentence can be parsed into multiple syntactical forms. Semantic ambiguity occurs when the meaning of words can be misinterpreted. Lexical level ambiguity refers to ambiguity of a single word that can have multiple assertions. Each of these levels can produce ambiguities that can be solved by the knowledge of the complete sentence. The ambiguity can be solved by various methods such as Minimizing Ambiguity, Preserving Ambiguity, Interactive Disambiguation and Weighting Ambiguity [ 125 ]. Some of the methods proposed by researchers to remove ambiguity is preserving ambiguity, e.g. (Shemtov 1997; Emele & Dorna 1998; Knight & Langkilde 2000; Tong Gao et al. 2015, Umber & Bajwa 2011) [ 39 , 46 , 65 , 125 , 139 ]. Their objectives are closely in line with removal or minimizing ambiguity. They cover a wide range of ambiguities and there is a statistical element implicit in their approach.

Natural Language Generation (NLG) is the process of producing phrases, sentences and paragraphs that are meaningful from an internal representation. It is a part of Natural Language Processing and happens in four phases: identifying the goals, planning on how goals may be achieved by evaluating the situation and available communicative sources and realizing the plans as a text (Fig. 2 ). It is opposite to Understanding.

Speaker and Generator

figure 2

Components of NLG

To generate a text, we need to have a speaker or an application and a generator or a program that renders the application’s intentions into a fluent phrase relevant to the situation.

Components and Levels of Representation

The process of language generation involves the following interweaved tasks. Content selection: Information should be selected and included in the set. Depending on how this information is parsed into representational units, parts of the units may have to be removed while some others may be added by default. Textual Organization : The information must be textually organized according to the grammar, it must be ordered both sequentially and in terms of linguistic relations like modifications. Linguistic Resources : To support the information’s realization, linguistic resources must be chosen. In the end these resources will come down to choices of particular words, idioms, syntactic constructs etc. Realization : The selected and organized resources must be realized as an actual text or voice output.

Application or Speaker

This is only for maintaining the model of the situation. Here the speaker just initiates the process doesn’t take part in the language generation. It stores the history, structures the content that is potentially relevant and deploys a representation of what it knows. All these forms the situation, while selecting subset of propositions that speaker has. The only requirement is the speaker must make sense of the situation [ 91 ].

3 NLP: Then and now

In the late 1940s the term NLP wasn’t in existence, but the work regarding machine translation (MT) had started. In fact, Research in this period was not completely localized. Russian and English were the dominant languages for MT (Andreev,1967) [ 4 ]. In fact, MT/NLP research almost died in 1966 according to the ALPAC report, which concluded that MT is going nowhere. But later, some MT production systems were providing output to their customers (Hutchins, 1986) [ 60 ]. By this time, work on the use of computers for literary and linguistic studies had also started. As early as 1960, signature work influenced by AI began, with the BASEBALL Q-A systems (Green et al., 1961) [ 51 ]. LUNAR (Woods,1978) [ 152 ] and Winograd SHRDLU were natural successors of these systems, but they were seen as stepped-up sophistication, in terms of their linguistic and their task processing capabilities. There was a widespread belief that progress could only be made on the two sides, one is ARPA Speech Understanding Research (SUR) project (Lea, 1980) and other in some major system developments projects building database front ends. The front-end projects (Hendrix et al., 1978) [ 55 ] were intended to go beyond LUNAR in interfacing the large databases. In early 1980s computational grammar theory became a very active area of research linked with logics for meaning and knowledge’s ability to deal with the user’s beliefs and intentions and with functions like emphasis and themes.

By the end of the decade the powerful general purpose sentence processors like SRI’s Core Language Engine (Alshawi,1992) [ 2 ] and Discourse Representation Theory (Kamp and Reyle,1993) [ 62 ] offered a means of tackling more extended discourse within the grammatico-logical framework. This period was one of the growing communities. Practical resources, grammars, and tools and parsers became available (for example: Alvey Natural Language Tools) (Briscoe et al., 1987) [ 18 ]. The (D)ARPA speech recognition and message understanding (information extraction) conferences were not only for the tasks they addressed but for the emphasis on heavy evaluation, starting a trend that became a major feature in 1990s (Young and Chase, 1998; Sundheim and Chinchor,1993) [ 131 , 157 ]. Work on user modeling (Wahlster and Kobsa, 1989) [ 142 ] was one strand in a research paper. Cohen et al. (2002) [ 28 ] had put forwarded a first approximation of a compositional theory of tune interpretation, together with phonological assumptions on which it is based and the evidence from which they have drawn their proposals. At the same time, McKeown (1985) [ 85 ] demonstrated that rhetorical schemas could be used for producing both linguistically coherent and communicatively effective text. Some research in NLP marked important topics for future like word sense disambiguation (Small et al., 1988) [ 126 ] and probabilistic networks, statistically colored NLP, the work on the lexicon, also pointed in this direction. Statistical language processing was a major thing in 90s (Manning and Schuetze,1999) [ 75 ], because this not only involves data analysts. Information extraction and automatic summarizing (Mani and Maybury,1999) [ 74 ] was also a point of focus. Next, we present a walkthrough of the developments from the early 2000.

3.1 A walkthrough of recent developments in NLP

The main objectives of NLP include interpretation, analysis, and manipulation of natural language data for the intended purpose with the use of various algorithms, tools, and methods. However, there are many challenges involved which may depend upon the natural language data under consideration, and so makes it difficult to achieve all the objectives with a single approach. Therefore, the development of different tools and methods in the field of NLP and relevant areas of studies have received much attention from several researchers in the recent past. The developments can be seen in the Fig.  3 :

figure 3

A walkthrough of recent developments in NLP

In early 2000, neural language modeling in which the probability of occurring of next word (token) is determined given n previous words. Bendigo et al. [ 12 ] proposed the concept of feed forward neural network and lookup table which represents the n previous words in sequence. Collobert et al. [ 29 ] proposed the application of multitask learning in the field of NLP, where two convolutional models with max pooling were used to perform parts-of-speech and named entity recognition tagging. Mikolov et.al. [ 87 ] proposed a word embedding process where the dense vector representation of text was addressed. They also report the challenges faced by traditional sparse bag-of-words representation. After the advancement of word embedding, neural networks were introduced in the field of NLP where variable length input is taken for further processing. Sutskever et al. [ 132 ] proposed a general framework for sequence-to-sequence mapping where encoder and decoder networks are used to map from sequence to vector and vector to sequence respectively. In fact, the use of neural networks have played a very important role in NLP. One can observe from the existing literature that enough use of neural networks was not there in the early 2000s but till the year 2013enough discussion had happened about the use of neural networks in the field of NLP which transformed many things and further paved the way to implement various neural networks in NLP. Earlier the use of Convolutional neural networks ( CNN ) contributed to the field of image classification and analyzing visual imagery for further analysis. Later the use of CNNs can be observed in tackling problems associated with NLP tasks like Sentence Classification [ 127 ], Sentiment Analysis [ 135 ], Text Classification [ 118 ], Text Summarization [ 158 ], Machine Translation [ 70 ] and Answer Relations [ 150 ] . An article by Newatia (2019) [ 93 ] illustrates the general architecture behind any CNN model, and how it can be used in the context of NLP. One can also refer to the work of Wang and Gang [ 145 ] for the applications of CNN in NLP. Further Neural Networks those are recurrent in nature due to performing the same function for every data, also known as Recurrent Neural Networks (RNNs), have also been used in NLP, and found ideal for sequential data such as text, time series, financial data, speech, audio, video among others, see article by Thomas (2019) [ 137 ]. One of the modified versions of RNNs is Long Short-Term Memory (LSTM) which is also very useful in the cases where only the desired important information needs to be retained for a much longer time discarding the irrelevant information, see [ 52 , 58 ]. Further development in the LSTM has also led to a slightly simpler variant, called the gated recurrent unit (GRU), which has shown better results than standard LSTMs in many tasks [ 22 , 26 ]. Attention mechanisms [ 7 ] which suggest a network to learn what to pay attention to in accordance with the current hidden state and annotation together with the use of transformers have also made a significant development in NLP, see [ 141 ]. It is to be noticed that Transformers have a potential of learning longer-term dependency but are limited by a fixed-length context in the setting of language modeling. In this direction recently Dai et al. [ 30 ] proposed a novel neural architecture Transformer-XL (XL as extra-long) which enables learning dependencies beyond a fixed length of words. Further the work of Rae et al. [ 104 ] on the Compressive Transformer, an attentive sequence model which compresses memories for long-range sequence learning, may be helpful for the readers. One may also refer to the recent work by Otter et al. [ 98 ] on uses of Deep Learning for NLP, and relevant references cited therein. The use of BERT (Bidirectional Encoder Representations from Transformers) [ 33 ] model and successive models have also played an important role for NLP.

Many researchers worked on NLP, building tools and systems which makes NLP what it is today. Tools like Sentiment Analyser, Parts of Speech (POS) Taggers, Chunking, Named Entity Recognitions (NER), Emotion detection, Semantic Role Labeling have a huge contribution made to NLP, and are good topics for research. Sentiment analysis (Nasukawaetal.,2003) [ 156 ] works by extracting sentiments about a given topic, and it consists of a topic specific feature term extraction, sentiment extraction, and association by relationship analysis. It utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. It analyzes the documents for positive and negative words and tries to give ratings on scale −5 to +5. The mainstream of currently used tagsets is obtained from English. The most widely used tagsets as standard guidelines are designed for Indo-European languages but it is less researched on Asian languages or middle- eastern languages. Various authors have done research on making parts of speech taggers for various languages such as Arabic (Zeroual et al., 2017) [ 160 ], Sanskrit (Tapswi & Jain, 2012) [ 136 ], Hindi (Ranjan & Basu, 2003) [ 105 ] to efficiently tag and classify words as nouns, adjectives, verbs etc. Authors in [ 136 ] have used treebank technique for creating rule-based POS Tagger for Sanskrit Language. Sanskrit sentences are parsed to assign the appropriate tag to each word using suffix stripping algorithm, wherein the longest suffix is searched from the suffix table and tags are assigned. Diab et al. (2004) [ 34 ] used supervised machine learning approach and adopted Support Vector Machines (SVMs) which were trained on the Arabic Treebank to automatically tokenize parts of speech tag and annotate base phrases in Arabic text.

Chunking is a process of separating phrases from unstructured text. Since simple tokens may not represent the actual meaning of the text, it is advisable to use phrases such as “North Africa” as a single word instead of ‘North’ and ‘Africa’ separate words. Chunking known as “Shadow Parsing” labels parts of sentences with syntactic correlated keywords like Noun Phrase (NP) and Verb Phrase (VP). Chunking is often evaluated using the CoNLL 2000 shared task. Various researchers (Sha and Pereira, 2003; McDonald et al., 2005; Sun et al., 2008) [ 83 , 122 , 130 ] used CoNLL test data for chunking and used features composed of words, POS tags, and tags.

There are particular words in the document that refer to specific entities or real-world objects like location, people, organizations etc. To find the words which have a unique context and are more informative, noun phrases are considered in the text documents. Named entity recognition (NER) is a technique to recognize and separate the named entities and group them under predefined classes. But in the era of the Internet, where people use slang not the traditional or standard English which cannot be processed by standard natural language processing tools. Ritter (2011) [ 111 ] proposed the classification of named entities in tweets because standard NLP tools did not perform well on tweets. They re-built NLP pipeline starting from PoS tagging, then chunking for NER. It improved the performance in comparison to standard NLP tools.

Emotion detection investigates and identifies the types of emotion from speech, facial expressions, gestures, and text. Sharma (2016) [ 124 ] analyzed the conversations in Hinglish means mix of English and Hindi languages and identified the usage patterns of PoS. Their work was based on identification of language and POS tagging of mixed script. They tried to detect emotions in mixed script by relating machine learning and human knowledge. They have categorized sentences into 6 groups based on emotions and used TLBO technique to help the users in prioritizing their messages based on the emotions attached with the message. Seal et al. (2020) [ 120 ] proposed an efficient emotion detection method by searching emotional words from a pre-defined emotional keyword database and analyzing the emotion words, phrasal verbs, and negation words. Their proposed approach exhibited better performance than recent approaches.

Semantic Role Labeling (SRL) works by giving a semantic role to a sentence. For example, in the PropBank (Palmer et al., 2005) [ 100 ] formalism, one assigns roles to words that are arguments of a verb in the sentence. The precise arguments depend on the verb frame and if multiple verbs exist in a sentence, it might have multiple tags. State-of-the-art SRL systems comprise several stages: creating a parse tree, identifying which parse tree nodes represent the arguments of a given verb, and finally classifying these nodes to compute the corresponding SRL tags.

Event discovery in social media feeds (Benson et al.,2011) [ 13 ], using a graphical model to analyze any social media feeds to determine whether it contains the name of a person or name of a venue, place, time etc. The model operates on noisy feeds of data to extract records of events by aggregating multiple information across multiple messages, despite the noise of irrelevant noisy messages and very irregular message language, this model was able to extract records with a broader array of features on factors.

We first give insights on some of the mentioned tools and relevant work done before moving to the broad applications of NLP.

3.2 Applications of NLP

Natural Language Processing can be applied into various areas like Machine Translation, Email Spam detection, Information Extraction, Summarization, Question Answering etc. Next, we discuss some of the areas with the relevant work done in those directions.

Machine Translation

As most of the world is online, the task of making data accessible and available to all is a challenge. Major challenge in making data accessible is the language barrier. There are a multitude of languages with different sentence structure and grammar. Machine Translation is generally translating phrases from one language to another with the help of a statistical engine like Google Translate. The challenge with machine translation technologies is not directly translating words but keeping the meaning of sentences intact along with grammar and tenses. The statistical machine learning gathers as many data as they can find that seems to be parallel between two languages and they crunch their data to find the likelihood that something in Language A corresponds to something in Language B. As for Google, in September 2016, announced a new machine translation system based on artificial neural networks and Deep learning. In recent years, various methods have been proposed to automatically evaluate machine translation quality by comparing hypothesis translations with reference translations. Examples of such methods are word error rate, position-independent word error rate (Tillmann et al., 1997) [ 138 ], generation string accuracy (Bangalore et al., 2000) [ 8 ], multi-reference word error rate (Nießen et al., 2000) [ 95 ], BLEU score (Papineni et al., 2002) [ 101 ], NIST score (Doddington, 2002) [ 35 ] All these criteria try to approximate human assessment and often achieve an astonishing degree of correlation to human subjective evaluation of fluency and adequacy (Papineni et al., 2001; Doddington, 2002) [ 35 , 101 ].

Text Categorization

Categorization systems input a large flow of data like official documents, military casualty reports, market data, newswires etc. and assign them to predefined categories or indices. For example, The Carnegie Group’s Construe system (Hayes, 1991) [ 54 ], inputs Reuters articles and saves much time by doing the work that is to be done by staff or human indexers. Some companies have been using categorization systems to categorize trouble tickets or complaint requests and routing to the appropriate desks. Another application of text categorization is email spam filters. Spam filters are becoming important as the first line of defence against the unwanted emails. A false negative and false positive issue of spam filters is at the heart of NLP technology, it has brought down the challenge of extracting meaning from strings of text. A filtering solution that is applied to an email system uses a set of protocols to determine which of the incoming messages are spam; and which are not. There are several types of spam filters available. Content filters : Review the content within the message to determine whether it is spam or not. Header filters : Review the email header looking for fake information. General Blacklist filters : Stop all emails from blacklisted recipients. Rules Based Filters : It uses user-defined criteria. Such as stopping mails from a specific person or stopping mail including a specific word. Permission Filters : Require anyone sending a message to be pre-approved by the recipient. Challenge Response Filters : Requires anyone sending a message to enter a code to gain permission to send email.

Spam Filtering

It works using text categorization and in recent times, various machine learning techniques have been applied to text categorization or Anti-Spam Filtering like Rule Learning (Cohen 1996) [ 27 ], Naïve Bayes (Sahami et al., 1998; Androutsopoulos et al., 2000; Rennie.,2000) [ 5 , 109 , 115 ],Memory based Learning (Sakkiset al.,2000b) [ 117 ], Support vector machines (Druker et al., 1999) [ 36 ], Decision Trees (Carreras and Marquez, 2001) [ 19 ], Maximum Entropy Model (Berger et al. 1996) [ 14 ], Hash Forest and a rule encoding method (T. Xia, 2020) [ 153 ], sometimes combining different learners (Sakkis et al., 2001) [ 116 ]. Using these approaches is better as classifier is learned from training data rather than making by hand. The naïve bayes is preferred because of its performance despite its simplicity (Lewis, 1998) [ 67 ] In Text Categorization two types of models have been used (McCallum and Nigam, 1998) [ 77 ]. Both modules assume that a fixed vocabulary is present. But in first model a document is generated by first choosing a subset of vocabulary and then using the selected words any number of times, at least once irrespective of order. This is called Multi-variate Bernoulli model. It takes the information of which words are used in a document irrespective of number of words and order. In second model, a document is generated by choosing a set of word occurrences and arranging them in any order. This model is called multi-nomial model, in addition to the Multi-variate Bernoulli model, it also captures information on how many times a word is used in a document. Most text categorization approaches to anti-spam Email filtering have used multi variate Bernoulli model (Androutsopoulos et al., 2000) [ 5 ] [ 15 ].

Information Extraction

Information extraction is concerned with identifying phrases of interest of textual data. For many applications, extracting entities such as names, places, events, dates, times, and prices is a powerful way of summarizing the information relevant to a user’s needs. In the case of a domain specific search engine, the automatic identification of important information can increase accuracy and efficiency of a directed search. There is use of hidden Markov models (HMMs) to extract the relevant fields of research papers. These extracted text segments are used to allow searched over specific fields and to provide effective presentation of search results and to match references to papers. For example, noticing the pop-up ads on any websites showing the recent items you might have looked on an online store with discounts. In Information Retrieval two types of models have been used (McCallum and Nigam, 1998) [ 77 ]. Both modules assume that a fixed vocabulary is present. But in first model a document is generated by first choosing a subset of vocabulary and then using the selected words any number of times, at least once without any order. This is called Multi-variate Bernoulli model. It takes the information of which words are used in a document irrespective of number of words and order. In second model, a document is generated by choosing a set of word occurrences and arranging them in any order. This model is called multi-nominal model, in addition to the Multi-variate Bernoulli model, it also captures information on how many times a word is used in a document.

Discovery of knowledge is becoming important areas of research over the recent years. Knowledge discovery research use a variety of techniques to extract useful information from source documents like Parts of Speech (POS) tagging , Chunking or Shadow Parsing , Stop-words (Keywords that are used and must be removed before processing documents), Stemming (Mapping words to some base for, it has two methods, dictionary-based stemming and Porter style stemming (Porter, 1980) [ 103 ]. Former one has higher accuracy but higher cost of implementation while latter has lower implementation cost and is usually insufficient for IR). Compound or Statistical Phrases (Compounds and statistical phrases index multi token units instead of single tokens.) Word Sense Disambiguation (Word sense disambiguation is the task of understanding the correct sense of a word in context. When used for information retrieval, terms are replaced by their senses in the document vector.)

The extracted information can be applied for a variety of purposes, for example to prepare a summary, to build databases, identify keywords, classifying text items according to some pre-defined categories etc. For example, CONSTRUE, it was developed for Reuters, that is used in classifying news stories (Hayes, 1992) [ 54 ]. It has been suggested that many IE systems can successfully extract terms from documents, acquiring relations between the terms is still a difficulty. PROMETHEE is a system that extracts lexico-syntactic patterns relative to a specific conceptual relation (Morin,1999) [ 89 ]. IE systems should work at many levels, from word recognition to discourse analysis at the level of the complete document. An application of the Blank Slate Language Processor (BSLP) ( Bondale et al., 1999) [ 16 ] approach for the analysis of a real-life natural language corpus that consists of responses to open-ended questionnaires in the field of advertising.

There is a system called MITA (Metlife’s Intelligent Text Analyzer) (Glasgow et al. (1998) [ 48 ]) that extracts information from life insurance applications. Ahonen et al. (1998) [ 1 ] suggested a mainstream framework for text mining that uses pragmatic and discourse level analyses of text .

Summarization

Overload of information is the real thing in this digital age, and already our reach and access to knowledge and information exceeds our capacity to understand it. This trend is not slowing down, so an ability to summarize the data while keeping the meaning intact is highly required. This is important not just allowing us the ability to recognize the understand the important information for a large set of data, it is used to understand the deeper emotional meanings; For example, a company determines the general sentiment on social media and uses it on their latest product offering. This application is useful as a valuable marketing asset.

The types of text summarization depends on the basis of the number of documents and the two important categories are single document summarization and multi document summarization (Zajic et al. 2008 [ 159 ]; Fattah and Ren 2009 [ 43 ]).Summaries can also be of two types: generic or query-focused (Gong and Liu 2001 [ 50 ]; Dunlavy et al. 2007 [ 37 ]; Wan 2008 [ 144 ]; Ouyang et al. 2011 [ 99 ]).Summarization task can be either supervised or unsupervised (Mani and Maybury 1999 [ 74 ]; Fattah and Ren 2009 [ 43 ]; Riedhammer et al. 2010 [ 110 ]). Training data is required in a supervised system for selecting relevant material from the documents. Large amount of annotated data is needed for learning techniques. Few techniques are as follows–

Bayesian Sentence based Topic Model (BSTM) uses both term-sentences and term document associations for summarizing multiple documents. (Wang et al. 2009 [ 146 ])

Factorization with Given Bases (FGB) is a language model where sentence bases are the given bases and it utilizes document-term and sentence term matrices. This approach groups and summarizes the documents simultaneously. (Wang et al. 2011) [ 147 ])

Topic Aspect-Oriented Summarization (TAOS) is based on topic factors. These topic factors are various features that describe topics such as capital words are used to represent entity. Various topics can have various aspects and various preferences of features are used to represent various aspects. (Fang et al. 2015 [ 42 ])

Dialogue System

Dialogue systems are very prominent in real world applications ranging from providing support to performing a particular action. In case of support dialogue systems, context awareness is required whereas in case to perform an action, it doesn’t require much context awareness. Earlier dialogue systems were focused on small applications such as home theater systems. These dialogue systems utilize phonemic and lexical levels of language. Habitable dialogue systems offer potential for fully automated dialog systems by utilizing all levels of a language. (Liddy, 2001) [ 68 ].This leads to producing systems that can enable robots to interact with humans in natural languages such as Google’s assistant, Windows Cortana, Apple’s Siri and Amazon’s Alexa etc.

NLP is applied in the field as well. The Linguistic String Project-Medical Language Processor is one the large scale projects of NLP in the field of medicine [ 21 , 53 , 57 , 71 , 114 ]. The LSP-MLP helps enabling physicians to extract and summarize information of any signs or symptoms, drug dosage and response data with the aim of identifying possible side effects of any medicine while highlighting or flagging data items [ 114 ]. The National Library of Medicine is developing The Specialist System [ 78 , 79 , 80 , 82 , 84 ]. It is expected to function as an Information Extraction tool for Biomedical Knowledge Bases, particularly Medline abstracts. The lexicon was created using MeSH (Medical Subject Headings), Dorland’s Illustrated Medical Dictionary and general English Dictionaries. The Centre d’Informatique Hospitaliere of the Hopital Cantonal de Geneve is working on an electronic archiving environment with NLP features [ 81 , 119 ]. In the first phase, patient records were archived. At later stage the LSP-MLP has been adapted for French [ 10 , 72 , 94 , 113 ], and finally, a proper NLP system called RECIT [ 9 , 11 , 17 , 106 ] has been developed using a method called Proximity Processing [ 88 ]. It’s task was to implement a robust and multilingual system able to analyze/comprehend medical sentences, and to preserve a knowledge of free text into a language independent knowledge representation [ 107 , 108 ]. The Columbia university of New York has developed an NLP system called MEDLEE (MEDical Language Extraction and Encoding System) that identifies clinical information in narrative reports and transforms the textual information into structured representation [ 45 ].

3.3 NLP in talk

We next discuss some of the recent NLP projects implemented by various companies:

ACE Powered GDPR Robot Launched by RAVN Systems [ 134 ]

RAVN Systems, a leading expert in Artificial Intelligence (AI), Search and Knowledge Management Solutions, announced the launch of a RAVN (“Applied Cognitive Engine”) i.e. powered software Robot to help and facilitate the GDPR (“General Data Protection Regulation”) compliance. The Robot uses AI techniques to automatically analyze documents and other types of data in any business system which is subject to GDPR rules. It allows users to search, retrieve, flag, classify, and report on data, mediated to be super sensitive under GDPR quickly and easily. Users also can identify personal data from documents, view feeds on the latest personal data that requires attention and provide reports on the data suggested to be deleted or secured. RAVN’s GDPR Robot is also able to hasten requests for information (Data Subject Access Requests - “DSAR”) in a simple and efficient way, removing the need for a physical approach to these requests which tends to be very labor thorough. Peter Wallqvist, CSO at RAVN Systems commented, “GDPR compliance is of universal paramountcy as it will be exploited by any organization that controls and processes data concerning EU citizens.

Link: http://markets.financialcontent.com/stocks/news/read/33888795/RAVN_Systems_Launch_the_ACE_Powered_GDPR_Robot

Eno A Natural Language Chatbot Launched by Capital One [ 56 ]

Capital One announces a chatbot for customers called Eno. Eno is a natural language chatbot that people socialize through texting. CapitalOne claims that Eno is First natural language SMS chatbot from a U.S. bank that allows customers to ask questions using natural language. Customers can interact with Eno asking questions about their savings and others using a text interface. Eno makes such an environment that it feels that a human is interacting. This provides a different platform than other brands that launch chatbots like Facebook Messenger and Skype. They believed that Facebook has too much access to private information of a person, which could get them into trouble with privacy laws U.S. financial institutions work under. Like Facebook Page admin can access full transcripts of the bot’s conversations. If that would be the case then the admins could easily view the personal banking information of customers with is not correct.

Link: https://www.macobserver.com/analysis/capital-one-natural-language-chatbot-eno/

Future of BI in Natural Language Processing [ 140 ]

Several companies in BI spaces are trying to get with the trend and trying hard to ensure that data becomes more friendly and easily accessible. But still there is a long way for this.BI will also make it easier to access as GUI is not needed. Because nowadays the queries are made by text or voice command on smartphones.one of the most common examples is Google might tell you today what tomorrow’s weather will be. But soon enough, we will be able to ask our personal data chatbot about customer sentiment today, and how we feel about their brand next week; all while walking down the street. Today, NLP tends to be based on turning natural language into machine language. But with time the technology matures – especially the AI component –the computer will get better at “understanding” the query and start to deliver answers rather than search results. Initially, the data chatbot will probably ask the question ‘how have revenues changed over the last three-quarters?’ and then return pages of data for you to analyze. But once it learns the semantic relations and inferences of the question, it will be able to automatically perform the filtering and formulation necessary to provide an intelligible answer, rather than simply showing you data.

Link: http://www.smartdatacollective.com/eran-levy/489410/here-s-why-natural-language-processing-future-bi

Using Natural Language Processing and Network Analysis to Develop a Conceptual Framework for Medication Therapy Management Research [ 97 ]

Natural Language Processing and Network Analysis to Develop a Conceptual Framework for Medication Therapy Management Research describes a theory derivation process that is used to develop a conceptual framework for medication therapy management (MTM) research. The MTM service model and chronic care model are selected as parent theories. Review article abstracts target medication therapy management in chronic disease care that were retrieved from Ovid Medline (2000–2016). Unique concepts in each abstract are extracted using Meta Map and their pair-wise co-occurrence are determined. Then the information is used to construct a network graph of concept co-occurrence that is further analyzed to identify content for the new conceptual model. 142 abstracts are analyzed. Medication adherence is the most studied drug therapy problem and co-occurred with concepts related to patient-centered interventions targeting self-management. The enhanced model consists of 65 concepts clustered into 14 constructs. The framework requires additional refinement and evaluation to determine its relevance and applicability across a broad audience including underserved settings.

Link: https://www.ncbi.nlm.nih.gov/pubmed/28269895?dopt=Abstract

Meet the Pilot, world’s first language translating earbuds [ 96 ]

The world’s first smart earpiece Pilot will soon be transcribed over 15 languages. According to Spring wise, Waverly Labs’ Pilot can already transliterate five spoken languages, English, French, Italian, Portuguese, and Spanish, and seven written affixed languages, German, Hindi, Russian, Japanese, Arabic, Korean and Mandarin Chinese. The Pilot earpiece is connected via Bluetooth to the Pilot speech translation app, which uses speech recognition, machine translation and machine learning and speech synthesis technology. Simultaneously, the user will hear the translated version of the speech on the second earpiece. Moreover, it is not necessary that conversation would be taking place between two people; only the users can join in and discuss as a group. As if now the user may experience a few second lag interpolated the speech and translation, which Waverly Labs pursue to reduce. The Pilot earpiece will be available from September but can be pre-ordered now for $249. The earpieces can also be used for streaming music, answering voice calls, and getting audio notifications.

Link: https://www.indiegogo.com/projects/meet-the-pilot-smart-earpiece-language-translator-headphones-travel#/

4 Datasets in NLP and state-of-the-art models

The objective of this section is to present the various datasets used in NLP and some state-of-the-art models in NLP.

4.1 Datasets in NLP

Corpus is a collection of linguistic data, either compiled from written texts or transcribed from recorded speech. Corpora are intended primarily for testing linguistic hypotheses - e.g., to determine how a certain sound, word, or syntactic construction is used across a culture or language. There are various types of corpus: In an annotated corpus, the implicit information in the plain text has been made explicit by specific annotations. Un-annotated corpus contains raw state of plain text. Different languages can be compared using a reference corpus. Monitor corpora are non-finite collections of texts which are mostly used in lexicography. Multilingual corpus refers to a type of corpus that contains small collections of monolingual corpora based on the same sampling procedure and categories for different languages. Parallel corpus contains texts in one language and their translations into other languages which are aligned sentence phrase by phrase. Reference corpus contains text of spoken (formal and informal) and written (formal and informal) language which represents various social and situational contexts. Speech corpus contains recorded speech and transcriptions of recording and the time each word occurred in the recorded speech. There are various datasets available for natural language processing; some of these are listed below for different use cases:

Sentiment Analysis: Sentiment analysis is a rapidly expanding field of natural language processing (NLP) used in a variety of fields such as politics, business etc. Majorly used datasets for sentiment analysis are:

Stanford Sentiment Treebank (SST): Socher et al. introduced SST containing sentiment labels for 215,154 phrases in parse trees for 11,855 sentences from movie reviews posing novel sentiment compositional difficulties [ 127 ].

Sentiment140: It contains 1.6 million tweets annotated with negative, neutral and positive labels.

Paper Reviews: It provides reviews of computing and informatics conferences written in English and Spanish languages. It has 405 reviews which are evaluated on a 5-point scale ranging from very negative to very positive.

IMDB: For natural language processing, text analytics, and sentiment analysis, this dataset offers thousands of movie reviews split into training and test datasets. This dataset was introduced in by Mass et al. in 2011 [ 73 ].

G.Rama Rohit Reddy of the Language Technologies Research Centre, KCIS, IIIT Hyderabad, generated the corpus “Sentiraama.” The corpus is divided into four datasets, each of which is annotated with a two-value scale that distinguishes between positive and negative sentiment at the document level. The corpus contains data from a variety of fields, including book reviews, product reviews, movie reviews, and song lyrics. The annotators meticulously followed the annotation technique for each of them. The folder “Song Lyrics” in the corpus contains 339 Telugu song lyrics written in Telugu script [ 121 ].

Language Modelling: Language models analyse text data to calculate word probability. They use an algorithm to interpret the data, which establishes rules for context in natural language. The model then uses these rules to accurately predict or construct new sentences. The model basically learns the basic characteristics and features of language and then applies them to new phrases. Majorly used datasets for Language modeling are as follows:

Salesforce’s WikiText-103 dataset has 103 million tokens collected from 28,475 featured articles from Wikipedia.

WikiText-2 is a scaled-down version of WikiText-103. It contains 2 million tokens with a 33,278 jargon size.

Penn Treebank piece of the Wall Street Diary corpus includes 929,000 tokens for training, 73,000 tokens for validation, and 82,000 tokens for testing purposes. Its context is limited since it comprises sentences rather than paragraphs [ 76 ].

The Ministry of Electronics and Information Technology’s Technology Development Programme for Indian Languages (TDIL) launched its own data distribution portal ( www.tdil-dc.in ) which has cataloged datasets [ 24 ].

Machine Translation: The task of converting the text of one natural language into another language while keeping the sense of the input text is known as machine translation. Majorly used datasets are as follows:

Tatoeba is a collection of multilingual sentence pairings. A tab-delimited pair of an English text sequence and the translated French text sequence appears on each line of the dataset. Each text sequence might be as simple as a single sentence or as complex as a paragraph of many sentences.

The Europarl parallel corpus is derived from the European Parliament’s proceedings. It is available in 21 European languages [ 40 ].

WMT14 provides machine translation pairs for English-German and English-French. Separately, these datasets comprise 4.5 million and 35 million sentence sets. Byte-Pair Encoding with 32 K tasks is used to encode the phrases.

There are around 160,000 sentence pairings in the IWSLT 14. The dataset includes descriptions in English-German (En-De) and German-English (De-En) languages. There are around 200 K training sentence sets in the IWSLT 13 dataset.

The IIT Bombay English-Hindi corpus comprises parallel corpora for English-Hindi as well as monolingual Hindi corpora gathered from several existing sources and corpora generated over time at IIT Bombay’s Centre for Indian Language Technology.

Question Answering System: Question answering systems provide real-time responses which are widely used in customer care services. The datasets used for dialogue system/question answering system are as follows:

Stanford Question Answering Dataset (SQuAD): it is a reading comprehension dataset made up of questions posed by crowd workers on a collection of Wikipedia articles.

Natural Questions: It is a large-scale corpus presented by Google used for training and assessing open-domain question answering systems. It includes 300,000 naturally occurring queries as well as human-annotated responses from Wikipedia pages for use in QA system training.

Question Answering in Context (QuAC): This dataset is used to describe, comprehend, and participate in information seeking conversation. In this dataset, instances are made up of an interactive discussion between two crowd workers: a student who asks a series of open-ended questions about an unknown Wikipedia text, and a teacher who responds by offering brief extracts from the text.

The neural learning models are overtaking traditional models for NLP [ 64 , 127 ]. In [ 64 ], authors used CNN (Convolutional Neural Network) model for sentiment analysis of movie reviews and achieved 81.5% accuracy. The results illustrate that using CNN was an appropriate replacement for state-of-the-art methods. Authors [ 127 ] have combined SST and Recursive Neural Tensor Network for sentiment analysis of the single sentence. This model amplifies the accuracy by 5.4% for sentence classification compared to traditional NLP models. Authors [ 135 ] proposed a combined Recurrent Neural Network and Transformer model for sentiment analysis. This hybrid model was tested on three different datasets: Twitter US Airline Sentiment, IMDB, and Sentiment 140: and achieved F1 scores of 91%, 93%, and 90%, respectively. This model’s performance outshined the state-of-art methods.

Santoro et al. [ 118 ] introduced a rational recurrent neural network with the capacity to learn on classifying the information and perform complex reasoning based on the interactions between compartmentalized information. They used the relational memory core to handle such interactions. Finally, the model was tested for language modeling on three different datasets (GigaWord, Project Gutenberg, and WikiText-103). Further, they mapped the performance of their model to traditional approaches for dealing with relational reasoning on compartmentalized information. The results achieved with RMC show improved performance.

Merity et al. [ 86 ] extended conventional word-level language models based on Quasi-Recurrent Neural Network and LSTM to handle the granularity at character and word level. They tuned the parameters for character-level modeling using Penn Treebank dataset and word-level modeling using WikiText-103. In both cases, their model outshined the state-of-art methods.

Luong et al. [ 70 ] used neural machine translation on the WMT14 dataset and performed translation of English text to French text. The model demonstrated a significant improvement of up to 2.8 bi-lingual evaluation understudy (BLEU) scores compared to various neural machine translation systems. It outperformed the commonly used MT system on a WMT 14 dataset.

Fan et al. [ 41 ] introduced a gradient-based neural architecture search algorithm that automatically finds architecture with better performance than a transformer, conventional NMT models. They tested their model on WMT14 (English-German Translation), IWSLT14 (German-English translation), and WMT18 (Finnish-to-English translation) and achieved 30.1, 36.1, and 26.4 BLEU points, which shows better performance than Transformer baselines.

Wiese et al. [ 150 ] introduced a deep learning approach based on domain adaptation techniques for handling biomedical question answering tasks. Their model revealed the state-of-the-art performance on biomedical question answers, and the model outperformed the state-of-the-art methods in domains.

Seunghak et al. [ 158 ] designed a Memory-Augmented-Machine-Comprehension-Network (MAMCN) to handle dependencies faced in reading comprehension. The model achieved state-of-the-art performance on document-level using TriviaQA and QUASAR-T datasets, and paragraph-level using SQuAD datasets.

Xie et al. [ 154 ] proposed a neural architecture where candidate answers and their representation learning are constituent centric, guided by a parse tree. Under this architecture, the search space of candidate answers is reduced while preserving the hierarchical, syntactic, and compositional structure among constituents. Using SQuAD, the model delivers state-of-the-art performance.

4.2 State-of-the-art models in NLP

Rationalist approach or symbolic approach assumes that a crucial part of the knowledge in the human mind is not derived by the senses but is firm in advance, probably by genetic inheritance. Noam Chomsky was the strongest advocate of this approach. It was believed that machines can be made to function like the human brain by giving some fundamental knowledge and reasoning mechanism linguistics knowledge is directly encoded in rule or other forms of representation. This helps the automatic process of natural languages [ 92 ]. Statistical and machine learning entail evolution of algorithms that allow a program to infer patterns. An iterative process is used to characterize a given algorithm’s underlying algorithm that is optimized by a numerical measure that characterizes numerical parameters and learning phase. Machine-learning models can be predominantly categorized as either generative or discriminative. Generative methods can generate synthetic data because of which they create rich models of probability distributions. Discriminative methods are more functional and have right estimating posterior probabilities and are based on observations. Srihari [ 129 ] explains the different generative models as one with a resemblance that is used to spot an unknown speaker’s language and would bid the deep knowledge of numerous languages to perform the match. Discriminative methods rely on a less knowledge-intensive approach and using distinction between languages. Whereas generative models can become troublesome when many features are used and discriminative models allow use of more features [ 38 ]. Few of the examples of discriminative methods are Logistic regression and conditional random fields (CRFs), generative methods are Naive Bayes classifiers and hidden Markov models (HMMs).

Naive Bayes Classifiers

Naive Bayes is a probabilistic algorithm which is based on probability theory and Bayes’ Theorem to predict the tag of a text such as news or customer review. It helps to calculate the probability of each tag for the given text and return the tag with the highest probability. Bayes’ Theorem is used to predict the probability of a feature based on prior knowledge of conditions that might be related to that feature. The choice of area in NLP using Naïve Bayes Classifiers could be in usual tasks such as segmentation and translation but it is also explored in unusual areas like segmentation for infant learning and identifying documents for opinions and facts. Anggraeni et al. (2019) [ 61 ] used ML and AI to create a question-and-answer system for retrieving information about hearing loss. They developed I-Chat Bot which understands the user input and provides an appropriate response and produces a model which can be used in the search for information about required hearing impairments. The problem with naïve bayes is that we may end up with zero probabilities when we meet words in the test data for a certain class that are not present in the training data.

Hidden Markov Model (HMM)

An HMM is a system where a shifting takes place between several states, generating feasible output symbols with each switch. The sets of viable states and unique symbols may be large, but finite and known. We can describe the outputs, but the system’s internals are hidden. Few of the problems could be solved by Inference A certain sequence of output symbols, compute the probabilities of one or more candidate states with sequences. Patterns matching the state-switch sequence are most likely to have generated a particular output-symbol sequence. Training the output-symbol chain data, reckon the state-switch/output probabilities that fit this data best.

Hidden Markov Models are extensively used for speech recognition, where the output sequence is matched to the sequence of individual phonemes. HMM is not restricted to this application; it has several others such as bioinformatics problems, for example, multiple sequence alignment [ 128 ]. Sonnhammer mentioned that Pfam holds multiple alignments and hidden Markov model-based profiles (HMM-profiles) of entire protein domains. The cue of domain boundaries, family members and alignment are done semi-automatically found on expert knowledge, sequence similarity, other protein family databases and the capability of HMM-profiles to correctly identify and align the members. HMM may be used for a variety of NLP applications, including word prediction, sentence production, quality assurance, and intrusion detection systems [ 133 ].

Neural Network

Earlier machine learning techniques such as Naïve Bayes, HMM etc. were majorly used for NLP but by the end of 2010, neural networks transformed and enhanced NLP tasks by learning multilevel features. Major use of neural networks in NLP is observed for word embedding where words are represented in the form of vectors. These vectors can be used to recognize similar words by observing their closeness in this vector space, other uses of neural networks are observed in information retrieval, text summarization, text classification, machine translation, sentiment analysis and speech recognition. Initially focus was on feedforward [ 49 ] and CNN (convolutional neural network) architecture [ 69 ] but later researchers adopted recurrent neural networks to capture the context of a word with respect to surrounding words of a sentence. LSTM (Long Short-Term Memory), a variant of RNN, is used in various tasks such as word prediction, and sentence topic prediction. [ 47 ] In order to observe the word arrangement in forward and backward direction, bi-directional LSTM is explored by researchers [ 59 ]. In case of machine translation, encoder-decoder architecture is used where dimensionality of input and output vector is not known. Neural networks can be used to anticipate a state that has not yet been seen, such as future states for which predictors exist whereas HMM predicts hidden states.

Bi-directional Encoder Representations from Transformers (BERT) is a pre-trained model with unlabeled text available on BookCorpus and English Wikipedia. This can be fine-tuned to capture context for various NLP tasks such as question answering, sentiment analysis, text classification, sentence embedding, interpreting ambiguity in the text etc. [ 25 , 33 , 90 , 148 ]. Earlier language-based models examine the text in either of one direction which is used for sentence generation by predicting the next word whereas the BERT model examines the text in both directions simultaneously for better language understanding. BERT provides contextual embedding for each word present in the text unlike context-free models (word2vec and GloVe). For example, in the sentences “he is going to the riverbank for a walk” and “he is going to the bank to withdraw some money”, word2vec will have one vector representation for “bank” in both the sentences whereas BERT will have different vector representation for “bank”. Muller et al. [ 90 ] used the BERT model to analyze the tweets on covid-19 content. The use of the BERT model in the legal domain was explored by Chalkidis et al. [ 20 ].

Since BERT considers up to 512 tokens, this is the reason if there is a long text sequence that must be divided into multiple short text sequences of 512 tokens. This is the limitation of BERT as it lacks in handling large text sequences.

5 Evaluation metrics and challenges

The objective of this section is to discuss evaluation metrics used to evaluate the model’s performance and involved challenges.

5.1 Evaluation metrics

Since the number of labels in most classification problems is fixed, it is easy to determine the score for each class and, as a result, the loss from the ground truth. In image generation problems, the output resolution and ground truth are both fixed. As a result, we can calculate the loss at the pixel level using ground truth. But in NLP, though output format is predetermined in the case of NLP, dimensions cannot be specified. It is because a single statement can be expressed in multiple ways without changing the intent and meaning of that statement. Evaluation metrics are important to evaluate the model’s performance if we were trying to solve two problems with one model.

BLEU (BiLingual Evaluation Understudy) Score: Each word in the output sentence is scored 1 if it appears in either of the reference sentences and a 0 if it does not. Further the number of words that appeared in one of the reference translations is divided by the total number of words in the output sentence to normalize the count so that it is always between 0 and 1. For example, if ground truth is “He is playing chess in the backyard” and output sentences are S1: “He is playing tennis in the backyard”, S2: “He is playing badminton in the backyard”, S3: “He is playing movie in the backyard” and S4: “backyard backyard backyard backyard backyard backyard backyard”. The score of S1, S2 and S3 would be 6/7,6/7 and 6/7. All sentences are getting the same score though information in S1 and S3 is not same. This is because BELU considers words in a sentence contribute equally to the meaning of a sentence which is not the case in real-world scenario. Using combination of uni-gram, bi-gram and n-grams, we can to capture the order of a sentence. We may also set a limit on how many times each word is counted based on how many times it appears in each reference phrase, which helps us prevent excessive repetition.

GLUE (General Language Understanding Evaluation) score: Previously, NLP models were almost usually built to perform effectively on a unique job. Various models such as LSTM, Bi-LSTM were trained solely for this task, and very rarely generalized to other tasks. The model which is used for named entity recognition can perform for textual entailment. GLUE is a set of datasets for training, assessing, and comparing NLP models. It includes nine diverse task datasets designed to test a model’s language understanding. To acquire a comprehensive assessment of a model’s performance, GLUE tests the model on a variety of tasks rather than a single one. Single-sentence tasks, similarity and paraphrase tasks, and inference tasks are among them. For example, in sentiment analysis of customer reviews, we might be interested in analyzing ambiguous reviews and determining which product the client is referring to in his reviews. Thus, the model obtains a good “knowledge” of language in general after some generalized pre-training. When the time comes to test out a model to meet a given task, this universal “knowledge” gives us an advantage. With GLUE, researchers can evaluate their model and score it on all nine tasks. The final performance score model is the average of those nine scores. It makes little difference how the model looks or works if it can analyze inputs and predict outcomes for all the activities.

Considering these metrics in mind, it helps to evaluate the performance of an NLP model for a particular task or a variety of tasks.

5.2 Challenges

The applications of NLP have been growing day by day, and with these new challenges are also occurring despite a lot of work done in the recent past. Some of the common challenges are: Contextual words and phrases in the language where same words and phrases can have different meanings in a sentence which are easy for the humans to understand but makes a challenging task. Such type of challenges can also be faced with dealing Synonyms in the language because humans use many different words to express the same idea, also in the language different levels of complexity such as large, huge, and big may be used by the different persons which become a challenging task to process the language and design algorithms to adopt all these issues. Further in language, Homonyms, the words used to be pronounced the same but have different definitions are also problematic for question answering and speech-to-text applications because they aren’t written in text form. Sentences using sarcasm and irony sometimes may be understood in the opposite way by the humans, and so designing models to deal with such sentences is a really challenging task in NLP. Furthermore, the sentences in the language having any type of ambiguity in the sense of interpreting in more than one way is also an area to work upon where more accuracy can be achieved. Language containing informal phrases, expressions, idioms, and culture-specific lingo make difficult to design models intended for the broad use, however having a lot of data on which training and updating on regular basis may improve the models, but it is a really challenging task to deal with the words having different meaning in different geographic areas. In fact, such types of issues also occur in dealing with different domains such as the meaning of words or sentences may be different in the education industry but have different meaning in health, law, defense etc. So, the models for NLP may be working good for an individual domain, geographic area but for a broad use such challenges need to be tackled. Further together with the above-mentioned challenges misspelled or misused words can also create a problem, although autocorrect and grammar corrections applications have improved a lot due to the continuous developments in the direction but predicting the intention of the writer that to from a specific domain, geographic area by considering sarcasm, expressions, informal phrases etc. is really a big challenge. There is no doubt that for most common widely used languages models for NLP have been doing very well, and further improving day by day but still there is a need for models for all the persons rather than specific knowledge of a particular language and technology. One may further refer to the work of Sharifirad and Matwin (2019) [ 123 ] for classification of different online harassment categories and challenges, Baclic et.al. (2020) [ 6 ] and Wong et al. (2018) [ 151 ] for challenges and opportunities in public health, Kang et.al. (2020) [ 63 ] for detailed literature survey and technological challenges relevant to management research and NLP, and a recent review work by Alshemali and Kalita (2020) [ 3 ], and references cited there in.

In the recent past, models dealing with Visual Commonsense Reasoning [ 31 ] and NLP have also been getting attention of the several researchers and seems a promising and challenging area to work upon. These models try to extract the information from an image, video using a visual reasoning paradigm such as the humans can infer from a given image, video beyond what is visually obvious, such as objects’ functions, people’s intents, and mental states. In this direction, recently Wen and Peng (2020) [ 149 ] suggested a model to capture knowledge from different perspectives, and perceive common sense in advance, and the results based on the conducted experiments on visual commonsense reasoning dataset VCR seems very satisfactory and effective. The work of Peng and Chi (2019) [ 102 ], that proposes Domain Adaptation with Scene Graph approach to transfer knowledge from the source domain with the objective to improve cross-media retrieval in the target domain, and Yen et al. (2019) [ 155 ] is also very useful to further explore the use of NLP and in its relevant domains.

6 Conclusion

This paper is written with three objectives. The first objective gives insights of the various important terminologies of NLP and NLG, and can be useful for the readers interested to start their early career in NLP and work relevant to its applications. The second objective of this paper focuses on the history, applications, and recent developments in the field of NLP. The third objective is to discuss datasets, approaches and evaluation metrics used in NLP. The relevant work done in the existing literature with their findings and some of the important applications and projects in NLP are also discussed in the paper. The last two objectives may serve as a literature survey for the readers already working in the NLP and relevant fields, and further can provide motivation to explore the fields mentioned in this paper. It is to be noticed that even though a great amount of work on natural language processing is available in literature surveys (one may refer to [ 15 , 32 , 63 , 98 , 133 , 151 ] focusing on one domain such as usage of deep-learning techniques in NLP, techniques used for email spam filtering, medication safety, management research, intrusion detection, and Gujarati language etc.), still there is not much work on regional languages, which can be the focus of future research.

Change history

25 july 2022.

Affiliation 3 has been added into the online PDF.

Ahonen H, Heinonen O, Klemettinen M, Verkamo AI (1998) Applying data mining techniques for descriptive phrase extraction in digital document collections. In research and technology advances in digital libraries, 1998. ADL 98. Proceedings. IEEE international forum on (pp. 2-11). IEEE

Alshawi H (1992) The core language engine. MIT press

Alshemali B, Kalita J (2020) Improving the reliability of deep neural networks in NLP: A review. Knowl-Based Syst 191:105210

Article   Google Scholar  

Andreev ND (1967) The intermediary language as the focal point of machine translation. In: Booth AD (ed) Machine translation. North Holland Publishing Company, Amsterdam, pp 3–27

Google Scholar  

Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD, Stamatopoulos P (2000) Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. arXiv preprint cs/0009009

Baclic O, Tunis M, Young K, Doan C, Swerdfeger H, Schonfeld J (2020) Artificial intelligence in public health: challenges and opportunities for public health made possible by advances in natural language processing. Can Commun Dis Rep 46(6):161

Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In ICLR 2015

Bangalore S, Rambow O, Whittaker S (2000) Evaluation metrics for generation. In proceedings of the first international conference on natural language generation-volume 14 (pp. 1-8). Assoc Comput Linguist

Baud RH, Rassinoux AM, Scherrer JR (1991) Knowledge representation of discharge summaries. In AIME 91 (pp. 173–182). Springer, Berlin Heidelberg

Baud RH, Rassinoux AM, Scherrer JR (1992) Natural language processing and semantical representation of medical texts. Methods Inf Med 31(2):117–125

Baud RH, Alpay L, Lovis C (1994) Let’s meet the users with natural language understanding. Knowledge and Decisions in Health Telematics: The Next Decade 12:103

Bengio Y, Ducharme R, Vincent P (2001) A neural probabilistic language model. Proceedings of NIPS

Benson E, Haghighi A, Barzilay R (2011) Event discovery in social media feeds. In proceedings of the 49th annual meeting of the Association for Computational Linguistics: human language technologies-volume 1 (pp. 389-398). Assoc Comput Linguist

Berger AL, Della Pietra SA, Della Pietra VJ (1996) A maximum entropy approach to natural language processing. Computational Linguistics 22(1):39–71

Blanzieri E, Bryl A (2008) A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29(1):63–92

Bondale N, Maloor P, Vaidyanathan A, Sengupta S, Rao PV (1999) Extraction of information from open-ended questionnaires using natural language processing techniques. Computer Science and Informatics 29(2):15–22

Borst F, Sager N, Nhàn NT, Su Y, Lyman M, Tick LJ, ..., Scherrer JR (1989) Analyse automatique de comptes rendus d'hospitalisation. In Degoulet P, Stephan JC, Venot A, Yvon PJ, rédacteurs. Informatique et Santé, Informatique et Gestion des Unités de Soins, Comptes Rendus du Colloque AIM-IF, Paris (pp. 246–56). [5]

Briscoe EJ, Grover C, Boguraev B, Carroll J (1987) A formalism and environment for the development of a large grammar of English. IJCAI 87:703–708

Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. arXiv preprint cs/0109015

Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: the muppets straight out of law school. arXiv preprint arXiv:2010.02559

Chi EC, Lyman MS, Sager N, Friedman C, Macleod C (1985) A database of computer-structured narrative: methods of computing complex relations. In proceedings of the annual symposium on computer application in medical care (p. 221). Am Med Inform Assoc

Cho K, Van Merriënboer B, Bahdanau D, Bengio Y, (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259

Chomsky N (1965) Aspects of the theory of syntax. MIT Press, Cambridge, Massachusetts

Choudhary N (2021) LDC-IL: the Indian repository of resources for language technology. Lang Resources & Evaluation 55:855–867. https://doi.org/10.1007/s10579-020-09523-3

Chouikhi H, Chniter H, Jarray F (2021) Arabic sentiment analysis using BERT model. In international conference on computational collective intelligence (pp. 621-632). Springer, Cham

Chung J, Gulcehre C, Cho K, Bengio Y, (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

Cohen WW (1996) Learning rules that classify e-mail. In AAAI spring symposium on machine learning in information access (Vol. 18, p. 25)

Cohen PR, Morgan J, Ramsay AM (2002) Intention in communication, Am J Psychol 104(4)

Collobert R, Weston J (2008) A unified architecture for natural language processing. In proceedings of the 25th international conference on machine learning (pp. 160–167)

Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R, (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

Davis E, Marcus G (2015) Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun ACM 58(9):92–103

Desai NP, Dabhi VK (2022) Resources and components for Gujarati NLP systems: a survey. Artif Intell Rev:1–19

Devlin J, Chang MW, Lee K, Toutanova K, (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

Diab M, Hacioglu K, Jurafsky D (2004) Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of HLT-NAACL 2004: Short papers (pp. 149–152). Assoc Computat Linguist

Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In proceedings of the second international conference on human language technology research (pp. 138-145). Morgan Kaufmann publishers Inc

Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054

Dunlavy DM, O’Leary DP, Conroy JM, Schlesinger JD (2007) QCS: A system for querying, clustering and summarizing documents. Inf Process Manag 43(6):1588–1605

Elkan C (2008) Log-Linear Models and Conditional Random Fields. http://cseweb.ucsd.edu/welkan/250B/cikmtutorial.pdf accessed 28 Jun 2017.

Emele MC, Dorna M (1998) Ambiguity preserving machine translation using packed representations. In proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics-volume 1 (pp. 365-371). Association for Computational Linguistics

Europarl: A Parallel Corpus for Statistical Machine Translation (2005) Philipp Koehn , MT Summit 2005

Fan Y, Tian F, Xia Y, Qin T, Li XY, Liu TY (2020) Searching better architectures for neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28:1574–1585

Fang H, Lu W, Wu F, Zhang Y, Shang X, Shao J, Zhuang Y (2015) Topic aspect-oriented summarization via group selection. Neurocomputing 149:1613–1619

Fattah MA, Ren F (2009) GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144

Feldman S (1999) NLP meets the jabberwocky: natural language processing in information retrieval. Online-Weston Then Wilton 23:62–73

Friedman C, Cimino JJ, Johnson SB (1993) A conceptual model for clinical radiology reports. In proceedings of the annual symposium on computer application in medical care (p. 829). Am Med Inform Assoc

Gao T, Dontcheva M, Adar E, Liu Z, Karahalios K DataTone: managing ambiguity in natural language interfaces for data visualization, UIST ‘15: proceedings of the 28th annual ACM symposium on User Interface Software & Technology, November 2015, 489–500, https://doi.org/10.1145/2807442.2807478

Ghosh S, Vinyals O, Strope B, Roy S, Dean T, Heck L (2016) Contextual lstm (clstm) models for large scale nlp tasks. arXiv preprint arXiv:1602.06291

Glasgow B, Mandell A, Binney D, Ghemri L, Fisher D (1998) MITA: an information-extraction approach to the analysis of free-form text in life insurance applications. AI Mag 19(1):59

Goldberg Y (2017) Neural network methods for natural language processing. Synthesis lectures on human language technologies 10(1):1–309

Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 19-25). ACM

Green Jr, BF, Wolf AK, Chomsky C, Laughery K (1961) Baseball: an automatic question-answerer. In papers presented at the may 9-11, 1961, western joint IRE-AIEE-ACM computer conference (pp. 219-224). ACM

Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems 28(10):2222–2232

Article   MathSciNet   Google Scholar  

Grishman R, Sager N, Raze C, Bookchin B (1973) The linguistic string parser. In proceedings of the June 4-8, 1973, national computer conference and exposition (pp. 427-434). ACM

Hayes PJ (1992) Intelligent high-volume text processing using shallow, domain-specific techniques. Text-based intelligent systems: current research and practice in information extraction and retrieval, 227-242.

Hendrix GG, Sacerdoti ED, Sagalowicz D, Slocum J (1978) Developing a natural language interface to complex data. ACM Transactions on Database Systems (TODS) 3(2):105–147

"Here’s Why Natural Language Processing is the Future of BI (2017) " SmartData Collective. N.p., n.d. Web. 19

Hirschman L, Grishman R, Sager N (1976) From text to structured information: automatic processing of medical reports. In proceedings of the June 7-10, 1976, national computer conference and exposition (pp. 267-275). ACM

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

Hutchins WJ (1986) Machine translation: past, present, future (p. 66). Ellis Horwood, Chichester

Jurafsky D, Martin J (2008) H. Speech and language processing. 2nd edn. Prentice-Hall, Englewood Cliffs, NJ

Kamp H, Reyle U (1993) Tense and aspect. In from discourse to logic (pp. 483-689). Springer Netherlands

Kang Y, Cai Z, Tan CW, Huang Q, Liu H (2020) Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics 7(2):139–172

Kim Y. (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882

Knight K, Langkilde I (2000) Preserving ambiguities in generation via automata intersection. In AAAI/IAAI (pp. 697-702)

Lass R (1998) Phonology: An Introduction to Basic Concepts. Cambridge, UK; New York; Melbourne, Australia: Cambridge University Press. p. 1. ISBN 978–0–521-23728-4. Retrieved 8 January 2011Paperback ISBN 0–521–28183-0

Lewis DD (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. In European conference on machine learning (pp. 4–15). Springer, Berlin Heidelberg

Liddy ED (2001). Natural language processing

Lopez MM, Kalita J (2017) Deep learning applied to NLP. arXiv preprint arXiv:1703.03091

Luong MT, Sutskever I, Le Q V, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206

Lyman M, Sager N, Friedman C, Chi E (1985) Computer-structured narrative in ambulatory care: its use in longitudinal review of clinical data. In proceedings of the annual symposium on computer application in medical care (p. 82). Am Med Inform Assoc

Lyman M, Sager N, Chi EC, Tick LJ, Nhan NT, Su Y, ..., Scherrer, J. (1989) Medical Language Processing for Knowledge Representation and Retrievals. In Proceedings. Symposium on Computer Applications in Medical Care (pp. 548–553). Am Med Inform Assoc

Maas A, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (pp. 142-150)

Mani I, Maybury MT (eds) (1999) Advances in automatic text summarization, vol 293. MIT press, Cambridge, MA

Manning CD, Schütze H (1999) Foundations of statistical natural language processing, vol 999. MIT press, Cambridge

MATH   Google Scholar  

Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: the penn treebank. Comput Linguist 19(2):313–330

McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41-48)

McCray AT (1991) Natural language processing for intelligent information retrieval. In Engineering in Medicine and Biology Society, 1991. Vol. 13: 1991., Proceedings of the Annual International Conference of the IEEE (pp. 1160–1161). IEEE

McCray AT (1991) Extending a natural language parser with UMLS knowledge. In proceedings of the annual symposium on computer application in medical care (p. 194). Am Med Inform Assoc

McCray AT, Nelson SJ (1995) The representation of meaning in the UMLS. Methods Inf Med 34(1–2):193–201

McCray AT, Razi A (1994) The UMLS knowledge source server. Medinfo MedInfo 8:144–147

McCray AT, Srinivasan S, Browne AC (1994) Lexical methods for managing variation in biomedical terminologies. In proceedings of the annual symposium on computer application in medical care (p. 235). Am Med Inform Assoc

McDonald R, Crammer K, Pereira F (2005) Flexible text segmentation with structured multilabel classification. In proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 987-994). Assoc Comput Linguist

McGray AT, Sponsler JL, Brylawski B, Browne AC (1987) The role of lexical knowledge in biomedical text understanding. In proceedings of the annual symposium on computer application in medical care (p. 103). Am Med Inform Assoc

McKeown KR (1985) Text generation. Cambridge University Press, Cambridge

Book   Google Scholar  

Merity S, Keskar NS, Socher R (2018) An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240

Mikolov T, Chen K, Corrado G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems

Morel-Guillemaz AM, Baud RH, Scherrer JR (1990) Proximity processing of medical text. In medical informatics Europe’90 (pp. 625–630). Springer, Berlin Heidelberg

Morin E (1999) Automatic acquisition of semantic relations between terms from technical corpora. In proc. of the fifth international congress on terminology and knowledge engineering-TKE’99

Müller M, Salathé M, Kummervold PE (2020) Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter. arXiv preprint arXiv:2005.07503

"Natural Language Processing (2017) " Natural Language Processing RSS. N.p., n.d. Web. 25

"Natural Language Processing" (2017) Natural Language Processing RSS. N.p., n.d. Web. 23

Newatia R (2019) https://medium.com/saarthi-ai/sentence-classification-using-convolutional-neural-networks-ddad72c7048c . Accessed 15 Dec 2021

Nhàn NT, Sager N, Lyman M, Tick LJ, Borst F, Su Y (1989) A medical language processor for two indo-European languages. In proceedings. Symposium on computer applications in medical care (pp. 554-558). Am Med Inform Assoc

Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In LREC

Ochoa, A. (2016). Meet the Pilot: Smart Earpiece Language Translator. https://www.indiegogo.com/projects/meet-the-pilot-smart-earpiece-language-translator-headphones-travel . Accessed April 10, 2017

Ogallo, W., & Kanter, A. S. (2017). Using natural language processing and network analysis to develop a conceptual framework for medication therapy management research. https://www.ncbi.nlm.nih.gov/pubmed/28269895?dopt=Abstract . Accessed April 10, 2017

Otter DW, Medina JR, Kalita JK (2020) A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems 32(2):604–624

Ouyang Y, Li W, Li S, Lu Q (2011) Applying regression models to query-focused multi-document summarization. Inf Process Manag 47(2):227–237

Palmer M, Gildea D, Kingsbury P (2005) The proposition bank: an annotated corpus of semantic roles. Computational linguistics 31(1):71–106

Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Assoc Comput Linguist

Peng Y, Chi J (2019) Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Transactions on Circuits and Systems for Video Technology 30(11):4368–4379

Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

Rae JW, Potapenko A, Jayakumar SM, Lillicrap TP, (2019) Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507

Ranjan P, Basu HVSSA (2003) Part of speech tagging and local word grouping techniques for natural language parsing in Hindi. In Proceedings of the 1st International Conference on Natural Language Processing (ICON 2003)

Rassinoux AM, Baud RH, Scherrer JR (1992) Conceptual graphs model extension for knowledge representation of medical texts. MEDINFO 92:1368–1374

Rassinoux AM, Michel PA, Juge C, Baud R, Scherrer JR (1994) Natural language processing of medical texts within the HELIOS environment. Comput Methods Prog Biomed 45:S79–S96

Rassinoux AM, Juge C, Michel PA, Baud RH, Lemaitre D, Jean FC, Scherrer JR (1995) Analysis of medical jargon: The RECIT system. In Conference on Artificial Intelligence in Medicine in Europe (pp. 42–52). Springer, Berlin Heidelberg

Rennie J (2000) ifile: An application of machine learning to e-mail filtering. In Proc. KDD 2000 Workshop on text mining, Boston, MA

Riedhammer K, Favre B, Hakkani-Tür D (2010) Long story short–global unsupervised models for keyphrase based meeting summarization. Speech Comm 52(10):801–815

Ritter A, Clark S, Etzioni O (2011) Named entity recognition in tweets: an experimental study. In proceedings of the conference on empirical methods in natural language processing (pp. 1524-1534). Assoc Comput Linguist

Rospocher M, van Erp M, Vossen P, Fokkens A, Aldabe I, Rigau G, Soroa A, Ploeger T, Bogaard T(2016) Building event-centric knowledge graphs from news. Web Semantics: Science, Services and Agents on the World Wide Web, In Press

Sager N, Lyman M, Tick LJ, Borst F, Nhan NT, Revillard C, … Scherrer JR (1989) Adapting a medical language processor from English to French. Medinfo 89:795–799

Sager N, Lyman M, Nhan NT, Tick LJ (1995) Medical language processing: applications to patient data representation and automatic encoding. Methods Inf Med 34(1–2):140–146

Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In learning for text categorization: papers from the 1998 workshop (Vol. 62, pp. 98-105)

Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. arXiv preprint cs/0106040

Sakkis G, Androutsopoulos I, Paliouras G et al (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6:49–73. https://doi.org/10.1023/A:1022948414856

Santoro A, Faulkner R, Raposo D, Rae J, Chrzanowski M, Weber T, ..., Lillicrap T (2018) Relational recurrent neural networks. Adv Neural Inf Proces Syst, 31

Scherrer JR, Revillard C, Borst F, Berthoud M, Lovis C (1994) Medical office automation integrated into the distributed architecture of a hospital information system. Methods Inf Med 33(2):174–179

Seal D, Roy UK, Basak R (2020) Sentence-level emotion detection from text based on semantic rules. In: Tuba M, Akashe S, Joshi A (eds) Information and communication Technology for Sustainable Development. Advances in intelligent Systems and computing, vol 933. Springer, Singapore. https://doi.org/10.1007/978-981-13-7166-0_42

Chapter   Google Scholar  

Sentiraama Corpus by Gangula Rama Rohit Reddy, Radhika Mamidi. Language Technologies Research Centre, KCIS, IIIT Hyderabad (n.d.) ltrc.iiit.ac.in/showfile.php?filename=downloads/sentiraama/

Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In proceedings of the 2003 conference of the north American chapter of the Association for Computational Linguistics on human language technology-volume 1 (pp. 134-141). Assoc Comput Linguist

Sharifirad S, Matwin S, (2019) When a tweet is actually sexist. A more comprehensive classification of different online harassment categories and the challenges in NLP. arXiv preprint arXiv:1902.10584

Sharma S, Srinivas PYKL, Balabantaray RC (2016) Emotion Detection using Online Machine Learning Method and TLBO on Mixed Script. In Proceedings of Language Resources and Evaluation Conference 2016 (pp. 47–51)

Shemtov H (1997) Ambiguity management in natural language generation. Stanford University

Small SL, Cortell GW, Tanenhaus MK (1988) Lexical Ambiguity Resolutions. Morgan Kauffman, San Mateo, CA

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642)

Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 26(1):320–322

Srihari S (2010) Machine Learning: Generative and Discriminative Models. http://www.cedar.buffalo.edu/wsrihari/CSE574/Discriminative-Generative.pdf . accessed 31 May 2017.]

Sun X, Morency LP, Okanohara D, Tsujii JI (2008) Modeling latent-dynamic in shallow parsing: a latent conditional model with improved inference. In proceedings of the 22nd international conference on computational linguistics-volume 1 (pp. 841-848). Assoc Comput Linguist

Sundheim BM, Chinchor NA (1993) Survey of the message understanding conferences. In proceedings of the workshop on human language technology (pp. 56-60). Assoc Comput Linguist

Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems

Sworna ZT, Mousavi Z, Babar MA (2022) NLP methods in host-based intrusion detection Systems: A systematic review and future directions. arXiv preprint arXiv:2201.08066

Systems RAVN (2017) "RAVN Systems Launch the ACE Powered GDPR Robot - Artificial Intelligence to Expedite GDPR Compliance." Stock Market. PR Newswire, n.d. Web. 19

Tan KL, Lee CP, Anbananthen KSM, Lim KM (2022) RoBERTa-LSTM: A hybrid model for sentiment analysis with transformers and recurrent neural network. IEEE Access, RoBERTa-LSTM: A Hybrid Model for Sentiment Analysis With Transformer and Recurrent Neural Network

Tapaswi N, Jain S (2012) Treebank based deep grammar acquisition and part-of-speech tagging for Sanskrit sentences. In software engineering (CONSEG), 2012 CSI sixth international conference on (pp. 1-4). IEEE

Thomas C (2019)  https://towardsdatascience.com/recurrent-neural-networks-and-natural-language-processing-73af640c2aa1 . Accessed 15 Dec 2021

Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In Eurospeech

Umber A, Bajwa I (2011) “Minimizing ambiguity in natural language software requirements specification,” in Sixth Int Conf Digit Inf Manag, pp. 102–107

"Using Natural Language Processing and Network Analysis to Develop a Conceptual Framework for Medication Therapy Management Research (2017) " AMIA ... Annual Symposium proceedings. AMIA Symposium. U.S. National Library of Medicine, n.d. Web. 19

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I, (2017) Attention is all you need. In advances in neural information processing systems (pp. 5998-6008)

Wahlster W, Kobsa A (1989) User models in dialog systems. In user models in dialog systems (pp. 4–34). Springer Berlin Heidelberg, User Models in Dialog Systems

Walton D (1996) A pragmatic synthesis. In: fallacies arising from ambiguity. Applied logic series, vol 1. Springer, Dordrecht)

Wan X (2008) Using only cross-document relationships for both generic and topic-focused multi-document summarizations. Inf Retr 11(1):25–49

Wang W, Gang J, 2018 Application of convolutional neural network in natural language processing. In 2018 international conference on information Systems and computer aided education (ICISCAE) (pp. 64-70). IEEE

Wang D, Zhu S, Li T, Gong Y (2009) Multi-document summarization using sentence-based topic models. In proceedings of the ACL-IJCNLP 2009 conference short papers (pp. 297-300). Assoc Comput Linguist

Wang D, Zhu S, Li T, Chi Y, Gong Y (2011) Integrating document clustering and multidocument summarization. ACM Transactions on Knowledge Discovery from Data (TKDD) 5(3):14–26

Wang Z, Ng P, Ma X, Nallapati R, Xiang B (2019) Multi-passage bert: A globally normalized bert model for open-domain question answering. arXiv preprint arXiv:1908.08167

Wen Z, Peng Y (2020) Multi-level knowledge injecting for visual commonsense reasoning. IEEE Transactions on Circuits and Systems for Video Technology 31(3):1042–1054

Wiese G, Weissenborn D, Neves M (2017) Neural domain adaptation for biomedical question answering. arXiv preprint arXiv:1706.03610

Wong A, Plasek JM, Montecalvo SP, Zhou L (2018) Natural language processing and its implications for the future of medication safety: a narrative review of recent advances and challenges. Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy 38(8):822–841

Woods WA (1978) Semantics and quantification in natural language question answering. Adv Comput 17:1–87

Xia T (2020) A constant time complexity spam detection algorithm for boosting throughput on rule-based filtering Systems. IEEE Access 8:82653–82661. https://doi.org/10.1109/ACCESS.2020.2991328

Xie P, Xing E (2017) A constituent-centric neural architecture for reading comprehension. In proceedings of the 55th annual meeting of the Association for Computational Linguistics (volume 1: long papers) (pp. 1405-1414)

Yan X, Ye Y, Mao Y, Yu H (2019) Shared-private information bottleneck method for cross-modal clustering. IEEE Access 7:36045–36056

Yi J, Nasukawa T, Bunescu R, Niblack W (2003) Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques. In data mining, 2003. ICDM 2003. Third IEEE international conference on (pp. 427-434). IEEE

Young SJ, Chase LL (1998) Speech recognition evaluation: a review of the US CSR and LVCSR programmes. Comput Speech Lang 12(4):263–279

Yu S, et al. (2018) "A multi-stage memory augmented neural network for machine reading comprehension." Proceedings of the workshop on machine reading for question answering

Zajic DM, Dorr BJ, Lin J (2008) Single-document and multi-document summarization techniques for email threads using sentence compression. Inf Process Manag 44(4):1600–1610

Zeroual I, Lakhouaja A, Belahbib R (2017) Towards a standard part of speech tagset for the Arabic language. J King Saud Univ Comput Inf Sci 29(2):171–178

Download references

Acknowledgements

Authors would like to express the gratitude to Research Mentors from CL Educate: Accendere Knowledge Management Services Pvt. Ltd. for their comments on earlier versions of the manuscript. Although any errors are our own and should not tarnish the reputations of these esteemed persons. We would also like to appreciate the Editor, Associate Editor, and anonymous referees for their constructive suggestions that led to many improvements on an earlier version of this manuscript.

Author information

Authors and affiliations.

Department of Computer Science, Manav Rachna International Institute of Research and Studies, Faridabad, India

Diksha Khurana & Aditya Koli

Department of Computer Science, BML Munjal University, Gurgaon, India

Kiran Khatter

Department of Statistics, Amity University Punjab, Mohali, India

Sukhdev Singh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kiran Khatter .

Ethics declarations

Conflict of interest.

The first draft of this paper was written under the supervision of Dr. Kiran Khatter and Dr. Sukhdev Singh, associated with CL- Educate: Accendere Knowledge Management Services Pvt. Ltd. and deputed at the Manav Rachna International University. The draft is also available on arxiv.org at https://arxiv.org/abs/1708.05148

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Khurana, D., Koli, A., Khatter, K. et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 82 , 3713–3744 (2023). https://doi.org/10.1007/s11042-022-13428-4

Download citation

Received : 03 February 2021

Revised : 23 March 2022

Accepted : 02 July 2022

Published : 14 July 2022

Issue Date : January 2023

DOI : https://doi.org/10.1007/s11042-022-13428-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Natural language processing
  • Natural language understanding
  • Natural language generation
  • NLP applications
  • NLP evaluation metrics
  • Find a journal
  • Publish with us
  • Track your research

natural language processing Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages

Three different Indic/Indo-Aryan languages - Bengali, Hindi and Nepali have been explored here in character level to find out similarities and dissimilarities. Having shared the same root, the Sanskrit, Indic languages bear common characteristics. That is why computer and language scientists can take the opportunity to develop common Natural Language Processing (NLP) techniques or algorithms. Bearing the concept in mind, we compare and analyze these three languages character by character. As an application of the hypothesis, we also developed a uniform sorting algorithm in two steps, first for the Bengali and Nepali languages only and then extended it for Hindi in the second step. Our thorough investigation with more than 30,000 words from each language suggests that, the algorithm maintains total accuracy as set by the local language authorities of the respective languages and good efficiency.

Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.

Model Transformation Development Using Automated Requirements Analysis, Metamodel Matching, and Transformation by Example

In this article, we address how the production of model transformations (MT) can be accelerated by automation of transformation synthesis from requirements, examples, and metamodels. We introduce a synthesis process based on metamodel matching, correspondence patterns between metamodels, and completeness and consistency analysis of matches. We describe how the limitations of metamodel matching can be addressed by combining matching with automated requirements analysis and model transformation by example (MTBE) techniques. We show that in practical examples a large percentage of required transformation functionality can usually be constructed automatically, thus potentially reducing development effort. We also evaluate the efficiency of synthesised transformations. Our novel contributions are: The concept of correspondence patterns between metamodels of a transformation. Requirements analysis of transformations using natural language processing (NLP) and machine learning (ML). Symbolic MTBE using “predictive specification” to infer transformations from examples. Transformation generation in multiple MT languages and in Java, from an abstract intermediate language.

A Computational Look at Oral History Archives

Computational technologies have revolutionized the archival sciences field, prompting new approaches to process the extensive data in these collections. Automatic speech recognition and natural language processing create unique possibilities for analysis of oral history (OH) interviews, where otherwise the transcription and analysis of the full recording would be too time consuming. However, many oral historians note the loss of aural information when converting the speech into text, pointing out the relevance of subjective cues for a full understanding of the interviewee narrative. In this article, we explore various computational technologies for social signal processing and their potential application space in OH archives, as well as neighboring domains where qualitative studies is a frequently used method. We also highlight the latest developments in key technologies for multimedia archiving practices such as natural language processing and automatic speech recognition. We discuss the analysis of both visual (body language and facial expressions), and non-visual cues (paralinguistics, breathing, and heart rate), stating the specific challenges introduced by the characteristics of OH collections. We argue that applying social signal processing to OH archives will have a wider influence than solely OH practices, bringing benefits for various fields from humanities to computer sciences, as well as to archival sciences. Looking at human emotions and somatic reactions on extensive interview collections would give scholars from multiple fields the opportunity to focus on feelings, mood, culture, and subjective experiences expressed in these interviews on a larger scale.

Which environmental features contribute to positive and negative perceptions of urban parks? A cross-cultural comparison using online reviews and Natural Language Processing methods

Natural language processing for smart construction: current status and future directions, attention-based unsupervised keyphrase extraction and phrase graph for covid-19 medical literature retrieval.

Searching, reading, and finding information from the massive medical text collections are challenging. A typical biomedical search engine is not feasible to navigate each article to find critical information or keyphrases. Moreover, few tools provide a visualization of the relevant phrases to the query. However, there is a need to extract the keyphrases from each document for indexing and efficient search. The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. This research investigates whether the self-attentions can be utilized to extract keyphrases from a document in an unsupervised manner and identify relevancy between phrases to construct a query relevancy phrase graph to visualize the search corpus phrases on their relevancy and importance. The comparison with six baseline methods shows that the self-attention-based unsupervised keyphrase extraction works well on a medical literature dataset. This unsupervised keyphrase extraction model can also be applied to other text data. The query relevancy graph model is applied to the COVID-19 literature dataset and to demonstrate that the attention-based phrase graph can successfully identify the medical phrases relevant to the query terms.

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB .

An ensemble approach for healthcare application and diagnosis using natural language processing

Machine learning and natural language processing enable a data-oriented experimental design approach for producing biochar and hydrochar from biomass, export citation format, share document.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 18 March 2024

Natural language instructions induce compositional generalization in networks of neurons

  • Reidar Riveland   ORCID: orcid.org/0000-0003-1510-290X 1 &
  • Alexandre Pouget   ORCID: orcid.org/0000-0003-3054-6365 1  

Nature Neuroscience ( 2024 ) Cite this article

23k Accesses

220 Altmetric

Metrics details

  • Intelligence
  • Network models

A fundamental human cognitive feat is to interpret linguistic instructions in order to perform novel tasks without explicit task experience. Yet, the neural computations that might be used to accomplish this remain poorly understood. We use advances in natural language processing to create a neural model of generalization based on linguistic instructions. Models are trained on a set of common psychophysical tasks, and receive instructions embedded by a pretrained language model. Our best models can perform a previously unseen task with an average performance of 83% correct based solely on linguistic instructions (that is, zero-shot learning). We found that language scaffolds sensorimotor representations such that activity for interrelated tasks shares a common geometry with the semantic representations of instructions, allowing language to cue the proper composition of practiced skills in unseen settings. We show how this model generates a linguistic description of a novel task it has identified using only motor feedback, which can subsequently guide a partner model to perform the task. Our models offer several experimentally testable predictions outlining how linguistic information must be represented to facilitate flexible and general cognition in the human brain.

Similar content being viewed by others

natural language processing research paper topics

Collective intelligence: A unifying concept for integrating biology across scales and substrates

Patrick McMillen & Michael Levin

natural language processing research paper topics

Motor neurons generate pose-targeted movements via proprioceptive sculpting

Benjamin Gorko, Igor Siwanowicz, … Stephen J. Huston

natural language processing research paper topics

Co-dependent excitatory and inhibitory plasticity accounts for quick, stable and long-lasting memories in biological networks

Everton J. Agnes & Tim P. Vogels

In a laboratory setting, animals require numerous trials in order to acquire a new behavioral task. This is in part because the only means of communication with nonlinguistic animals is simple positive and negative reinforcement signals. By contrast, it is common to give written or verbal instructions to humans, which allows them to perform new tasks relatively quickly. Further, once humans have learned a task, they can typically describe the solution with natural language. The dual ability to use an instruction to perform a novel task and, conversely, produce a linguistic description of the demands of a task once it has been learned are two unique cornerstones of human communication. Yet, the computational principles that underlie these abilities remain poorly understood.

One influential systems-level explanation posits that flexible interregional connectivity in the prefrontal cortex allows for the reuse of practiced sensorimotor representations in novel settings 1 , 2 . More recently, multiple studies have observed that when subjects are required to flexibly recruit different stimulus-response patterns, neural representations are organized according to the abstract structure of the task set 3 , 4 , 5 . Lastly, recent modeling work has shown that a multitasking recurrent neural network (RNN) will share dynamical motifs across tasks with similar demands 6 . This work forms a strong basis for explanations of flexible cognition in humans but leaves open the question of how linguistic information can reconfigure a sensorimotor network so that it performs a novel task well on the first attempt. Overall, it remains unclear what representational structure we should expect from brain areas that are responsible for integrating linguistic information in order to reorganize sensorimotor mappings on the fly.

These questions become all the more pressing given that recent advances in machine learning have led to artificial systems that exhibit human-like language skills 7 , 8 . Recent works have matched neural data recorded during passive listening and reading tasks to activations in autoregressive language models (that is, GPT 9 ), arguing that there is a fundamentally predictive component to language comprehension 10 , 11 . Additionally, some high-profile machine learning models do show the ability to use natural language as a prompt to perform a linguistic task or render an image, but the outputs of these models are difficult to interpret in terms of a sensorimotor mapping that we might expect to occur in a biological system 12 , 13 , 14 . Alternatively, recent work on multimodal interactive agents may be more interpretable in terms of the actions they take, but utilize a perceptual hierarchy that fuses vision and language at early stages of processing, making them difficult to map onto functionally and anatomically distinct language and vision areas in human brains 15 , 16 , 17 .

We, therefore, seek to leverage the power of language models in a way that results in testable neural predictions detailing how the human brain processes natural language in order to generalize across sensorimotor tasks.

To that end, we train an RNN (sensorimotor-RNN) model on a set of simple psychophysical tasks where models process instructions for each task using a pretrained language model. We find that embedding instructions with models tuned to sentence-level semantics allow sensorimotor-RNNs to perform a novel task at 83% correct, on average. Generalization in our models is supported by a representational geometry that captures task subcomponents and is shared between instruction embeddings and sensorimotor activity, thereby allowing a composition of practice skills in a novel setting. We also find that individual neurons modulate their tuning based on the semantics of instructions. We demonstrate how a network trained to interpret linguistic instructions can invert this understanding and produce a linguistic description of a previously unseen task based on the information in motor feedback signals. We end by discussing how these results can guide research on the neural basis of language-based generalization in the human brain.

Instructed models and task set

We train sensorimotor-RNNs on a set of 50 interrelated psychophysical tasks that require various cognitive capacities that are well studied in the literature 18 . Two example tasks are presented in Fig. 1a,b as they might appear in a laboratory setting. For all tasks, models receive a sensory input and task-identifying information and must output motor response activity (Fig. 1c ). Input stimuli are encoded by two one-dimensional maps of neurons, each representing a different input modality, with periodic Gaussian tuning curves to angles (over (0, 2 π )). Output responses are encoded in the same way. Inputs also include a single fixation unit. After the input fixation is off, the model can respond to the input stimuli. Our 50 tasks are roughly divided into 5 groups, ‘Go’, ‘Decision-making’, ‘Comparison’, ‘Duration’ And ‘Matching’, where within-group tasks share similar sensory input structures but may require divergent responses. For instance, in the decision-making (DM) task, the network must respond in the direction of the stimulus with the highest contrast, whereas in the anti-decision-making (AntiDM) task, the network responds to the stimulus with the weakest contrast (Fig. 1a ). Thus, networks must properly infer the task demands for a given trial from task-identifying information in order to perform all tasks simultaneously (see Methods for task details; see Supplementary Fig. 13 for example trials of all tasks).

figure 1

a , b , Illustrations of example trials as they might appear in a laboratory setting. The trial is instructed, then stimuli are presented with different angles and strengths of contrast. The agent must then respond with the proper angle during the response period. a , An example AntiDM trial where the agent must respond to the angle presented with the least intensity. b , An example COMP1 trial where the agent must respond to the first angle if it is presented with higher intensity than the second angle otherwise repress response. c , Diagram of model inputs and outputs. Sensory inputs (fixation unit, modality 1, modality 2) are shown in red and model outputs (fixation output, motor output) are shown in green. Models also receive a rule vector (blue) or the embedding that results from passing task instructions through a pretrained language model (gray). A list of models tested is provided in the inset.

In our models, task-identifying input is either nonlinguistic or linguistic. We use two nonlinguistic control models. First, in SIMPLENET, the identity of a task is represented by one of 50 orthogonal rule vectors. Second, STRUCTURENET uses a set of 10 orthogonal structure vectors, each representing a dimension of the task set (that is, respond weakest versus strongest direction), and tasks are encoded using combinations of these vectors (see Supplementary Notes 3 for the full set of structure combinations). As a result, STRUCTURENET fully captures all the relevant relationships among tasks, whereas SIMPLENET encodes none of this structure.

Instructed models use a pretrained transformer architecture 19 to embed natural language instructions for the tasks at hand. For each task, there is a corresponding set of 20 unique instructions (15 training, 5 validation; see Supplementary Notes 2 for the full instruction set). We test various types of language models that share the same basic architecture but differ in their size and also their pretraining objective. We tested two autoregressive models, a standard and a large version of GPT2, which we call GPT and GPT (XL), respectively. Previous work has demonstrated that GPT activations can account for various neural signatures of reading and listening 11 . BERT is trained to identify masked words within a piece of text 20 , but it also uses an unsupervised sentence-level objective, in which the network is given two sentences and must determine whether they follow each other in the original text. SBERT is trained like BERT but receives additional tuning on the Stanford Natural Language Inference task, a hand-labeled dataset detailing the logical relationship between two candidate sentences ( Methods ) 21 , 22 . Lastly, we use the language embedder from CLIP, a multimodal model that learns a joint embedding space of images and text captions 23 . We call a sensorimotor-RNN using a given language model LANGUAGEMODELNET and append a letter indicating its size. The various sizes of models are given in Fig. 1c . For each language model, we apply a pooling method to the last hidden state of the transformer and pass this fixed-length representation through a set of linear weights that are trained during task learning. This results in a 64-dimensional instruction embedding across all models ( Methods ). Language model weights are frozen unless otherwise specified. Finally, as a control, we also test a bag-of-words (BoW) embedding scheme that only uses word count statistics to embed each instruction.

First, we verify our models can perform all tasks simultaneously. For instructed models to perform well, they must infer the common semantic content between 15 distinct instruction formulations for each task. We find that all our instructed models can learn all tasks simultaneously except for GPTNET, where performance asymptotes are below the 95% threshold for some tasks. Hence, we relax the performance threshold to 85% for models that use GPT (Supplementary Fig. 1 ; see Methods for training details). We additionally tested all architectures on validation instructions (Supplementary Fig. 2 ). SBERTNET (L) and SBERTNET are our best-performing models, achieving an average performance of 97% and 94%, respectively, on validation instructions, demonstrating that these networks infer the proper semantic content even for entirely novel instructions.

Generalization to novel tasks

We next examined the extent to which different language models aided generalization to novel tasks. We trained individual networks on 45 tasks and then tested performance when exposed to the five held-out tasks. We use unequal-variance t -tests to make comparisons among the performance of different models. Figure 2 shows results with P values for the most relevant comparisons (a full matrix of comparisons across all models can be found in Supplementary Figs. 3 and 4 )

figure 2

a , Learning curves for the first 100 exposures to held-out tasks averaged over all tasks. Data are presented as the mean ± s.d. across different n  = 5 random initializations of sensorimotor-RNN weights. For all subplots, asterisks indicate significant differences among performance according to a two-sided unequal-variance t -test. Most relevant comparisons are presented in plots (for all subplots, not significant (NS), P  > 0.05, * P  < 0.05, ** P  < 0.01, *** P  < 0.001; STRUCTURENET versus SBERTNET (L): t  = 3.761, P  = 1.89 × 10 −4 ; SBERTNET (L) versus SBERTNET: t  = 2.19, P  = 0.029; SBERTNET versus CLIPNET: t  = 6.22, P  = 1.02 × 10 −9 ; CLIPNET versus BERTNET: t  = 1.037,  P  = 0.300; BERTNET versus GPTNET (XL): t  = −1.122, P  = 0.262; GPTNET (XL) versus GPTNET: t  = 6.22, P  = 1.04 × 10 −9 ; GPTNET versus BOWNET: t  = −3.346, P  = 8.85 × 10 − 4 ; BOWNET versus SIMPLENET: t  = 10.25, P  = 2.091 × 10 −22 ). A full table of pairwise comparisons can be found in Supplementary Fig. 3 . b , Distribution of generalization performance (that is, first exposure to novel task) across models. c – f , Performance across different test conditions for n  = 5 different random initialization of sensorimotor-RNN weights where each point indicates average performance across tasks for a given initialization. c , Generalization performance for tasks where instructions are swapped at test time (STRUCTURENET versus SBERTNET (L): t  = −0.15, P  = 0.875; SBERTNET (L) versus SBERTNET: t  = −2.102, P  = 0.036; SBERTNET versus CLIPNET: t  = −0.162, P  = 0.871; CLIPNET versus BERTNET: t  = 0.315, P  = 0.752; BERTNET versus GPTNET (XL): t  = 0.781, P  = 0.435; GPTNET (XL) versus GPTNET: t  = 1.071, P  = 0.285; GPTNET versus BOWNET: t  = −2.702, P  = 0.007; BOWNET versus SIMPLENET: t  = −3.471, P  = 5.633 −4 ). A full table of pairwise comparisons can be found in Supplementary Fig. 4 . d , Generalization performance for models where tasks from the same family are held out during training (STRUCTURENET versus SBERTNET (L): t  = 0.629, P  = 0.530; SBERTNET (L) versus SBERTNET: t  = −0.668, P  = 0.504; SBERTNET versus CLIPNET: t  = 8.043, P  = 7.757 × 10 −15 ; CLIPNET versus BERTNET: t  = −0.306, P  = 0.759; BERTNET versus GPTNET (XL): t  = 0.163, P  = 0.869; GPTNET (XL) versus GPTNET: t  = 1.534, P  = 0.126; GPTNET versus BOWNET: t  = −6.418, P  = 3.26 × 10 −10 ; BOWNET versus SIMPLENET: t  = 14.23, P  = 8.561 −39 ). A full table of pairwise comparisons can be found in Supplementary Fig. 4 . e , Generalization performance for models where the last layers of language models are allowed to fine-tune to the loss from sensorimotor tasks (STRUCTURENET versus SBERTNET (L): t  = 1.203, P  = 0.229; SBERTNET (L) versus SBERTNET: t  = 2.399, P  = 0.016; SBERTNET versus CLIPNET: t  = 5.186,  P  = 3.251 × 10 −7 ; CLIPNET versus BERTNET: t  = −3.002, P  = 0.002; BERTNET versus GPTNET (XL): t  = 0.522, P  = 0.601; GPTNET (XL) versus GPTNET: t  = 2.631, P  = 0.009; GPTNET versus BOWNET: t  = 4.440, P  = 1.134 × 10 −5 ; BOWNET versus SIMPLENET: t  = 10.255, P  = 2.091 × 10 −22 ). A full table of pairwise comparisons can be found in Supplementary Fig. 4 . f , Average difference in performance between tasks that use standard imperative instructions and those that use instructions with conditional clauses and require a simple deductive reasoning component. Colored asterisks at the bottom of the plot show P values for a two-sided, unequal-variance t -test between the null distribution constructed using random splits of the task set (transparent points represent mean differences for random splits; STRUCTURENET: t  = −36.46, P  = 4.34 × 10 −23 ; SBERTNET (L): t  = −16.38, P  = 3.02 × 10 −5 ; SBERTNET: t  = −15.35, P  = 3.920 × 10 −5 ; CLIPNET: t  = −44.68, P  = 5.32 × 10 − 13 ; BERTNET: t  = −25.51, P  = 3.14 × 10 −8 ; GPTNET (XL): t  = −16.99, P  = 3.61 × 10 −6 ; GPTNET: t  = −9.150, P  = 0.0002; BOWNET: t  = −70.99, P  = 4.566 × 10 −35 ; SIMPLENET: t  = 19.60, P  = 5.82 × 10 −6 ), and asterisks at the top of plot indicate P -value results from a t -test comparing differences with STRUCTURENET and our other instructed models (versus SBERTNET (L): t  = 3.702, P  = 0.0168; versus SBERTNET: t  = 6.592, P  = 0.002; versus CLIPNET: t  = 30.35, P  = 2.367 × 10 −7 ; versus BERTNET: t  = 7.234, P  = 0.0007; versus GPTNET (XL): t  = 5.282, P  = 0.004; versus GPTNET: t  = −1.745, P  = 0.149; versus BOWNET: t  = 75.04, P  = 9.96 × 10 −11 ; versus SIMPLENET: t  = −30.95, P  = 2.86 × 10 − 6 ; see Methods and Supplementary Fig. 6 . for full comparisons).

Our uninstructed control model SIMPLENET performs at 39%, on average, on the first presentation of a novel task (zero-shot generalization). This serves as a baseline for generalization. Note that despite the orthogonality of task rules provided to SIMPLENET, exposure to the task set allows models to learn patterns that are common to all tasks (for example, always repress response during fixation). Therefore, 39% is not chance-level performance per se, but rather performance achieved by a network trained and tested on a task set with some common requirements for responding. GPTNET, exhibits a zero-shot generalization of 57%. This is a significant improvement over SIMPLENET ( t  = 8.32, P  = 8.24 × 10 −16 ). Strikingly, increasing the size of GPT by an order of magnitude to the 1.5 billion parameters used by GPT (XL) only resulted in modest gains over BOWNET (64%), with GPTNET (XL) achieving 68% on held-out tasks ( t  = 2.04, P  = 0.047). By contrast, CLIPNET (S), which uses 4% of the number of parameters utilized by GPTNET (XL), is nonetheless able to achieve the same performance (68% correct, t  = 0.146, P  = 0.88). Likewise, BERTNET achieves a generalization performance that lags only 2% behind GPTNETXL in the mean ( t  = −1.122, P  = 0.262). By contrast, models with knowledge of sentence-level semantics show marked improvements in generalization, with SBERTNET performing an unseen task at 79% correct on average. Finally, our best-performing model, SBERTNET (L), can execute a never-before-seen task with a performance of 83% correct, on average, lagging just a few percentage points behind STRUCTURENET (88% correct), which receives the structure of the task set hand-coded in its rule vectors.

Figure 2b shows a histogram of the number of tasks for which each model achieves a given level of performance. Again, SBERTNET (L) manages to perform over 20 tasks set nearly perfectly in the zero-shot setting (for individual task performance for all models across tasks, see Supplementary Fig. 3 ).

To validate that our best-performing models leveraged the semantics of instructions, we presented the sensory input for one held-out task while providing the linguistic instructions for a different held-out task. Models that truly rely on linguistic information should be most penalized by this manipulation and, as predicted, we saw the largest decrease in performance for our best models (Fig. 2c ).

We also tested a more stringent hold-out procedure where we purposefully chose 4–6 tasks from the same family of tasks to hold out during training (Fig. 2d ). Overall, performance decreased in this more difficult setting, although our best-performing models still showed strong generalization, with SBERTNET (L) and SBERTNET achieving 71% and 72% correct on novel tasks, respectively, which was not significantly different from STRUCTURENET at 72% ( t  = 0.629, P  = 0.529; t  = 0.064, P  = 0.948; for SBERTNET (L) and SBERTNET, respectively).

In addition, we tested models in a setting where we allow the weights of language models to tune according to the loss experienced during sensorimotor training (see Methods for tuning details). This manipulation improved the generalization performance across all models, and for our best-performing model, SBERTNET (L), we see that generalization is as strong as for STRUCTURENET (86%, t  = 1.204, P  = 0.229).

Following ref. 18 , we tested models in a setting where task-type information for a given task was represented as a composition of information for related tasks in the training set (that is, AntiDMMod1 = (rule(AntiDMMod2) − rule(DMMod2)) + rule(DMMod1)). In this setting, we did find that the performance of SIMPLENET improved (60% correct). However, when we combined embedded instructions according to the same compositional rules, our linguistic models dramatically outperformed SIMPLENET. This suggests that training in the context of language more readily allows a simple compositional scheme to successfully configure task responses (see Supplementary Fig. 5 for full results and compositional encodings).

Finally, we tested a version of each model where outputs of language models are passed through a set of nonlinear layers, as opposed to the linear mapping used in the preceding results. We found that this manipulation reduced performance, suggesting that this added power leads to overfitting on training tasks, and that a simpler linear mapping is better suited to generalization (see Methods for details and Supplementary Fig. 4 for full results).

The discrepancy in performance between our instructed models suggests that in order to represent linguistic information such that it can successfully configure sensorimotor networks, it is not sufficient to simply use any very powerful language processing system. Rather, model success can be delineated by the extent to which they are exposed to sentence-level semantics during pretraining. Our best-performing models SBERTNET (L) and SBERTNET are explicitly trained to produce good sentence embeddings, whereas our worst-performing model, GPTNET, is only tuned to the statistics of upcoming words. Both CLIPNET (S) and BERTNET are exposed to some form of sentence-level knowledge. CLIPNET (S) is interested in sentence-level representations, but trains these representations using the statistics of corresponding vision representations. BERTNET performs a two-way classification of whether or not input sentences are adjacent in the training corpus. That the 1.5 billion parameters of GPTNET (XL) doesn’t markedly improve performance relative to these comparatively small models speaks to the fact that model size isn’t the determining factor. Lastly, although BoW removes key elements of linguistic meaning (that is, syntax), the simple use of word occurrences encodes information primarily about the similarities and differences between the sentences. For instance, simply representing the inclusion or exclusion of the words ‘stronger’ or ‘weaker’ is highly informative about the meaning of the instruction.

We also investigated which features of language make it difficult for our models to generalize. Thirty of our tasks require processing instructions with a conditional clause structure (for example, COMP1) as opposed to a simple imperative (for example, AntiDM). Tasks that are instructed using conditional clauses also require a simple form of deductive reasoning (if p then q else s ). Neuroimaging literature exploring the relationship between such deductive processes and language areas has reached differing conclusions, with some early studies showing that deduction recruits regions that are thought to support syntactic computations 24 , 25 , 26 and follow-up studies claiming that deduction can be reliably dissociated from language areas 27 , 28 , 29 , 30 . One theory for this variation in results is that baseline tasks used to isolate deductive reasoning in earlier studies used linguistic stimuli that required only superficial processing 31 , 32 .

To explore this issue, we calculated the average difference in performance between tasks with and without conditional clauses/deductive reasoning requirements (Fig. 2f ). All our models performed worse on these tasks relative to a set of random shuffles. However, we also saw an additional effect between STRUCTURENET and our instructed models, which performed worse than STRUCTURENET by a statistically significant margin (see Supplementary Fig. 6 for full comparisons). This is a crucial comparison because STRUCTURENET performs deductive tasks without relying on language. Hence, the decrease in performance between STRUCTURENET and instructed models is in part due to the difficulty inherent in parsing syntactically more complicated language. The implication is that we may see engagement of linguistic areas in deductive reasoning tasks, but this may simply be due to the increased syntactic demands of corresponding instructions (rather than processes that recruit linguistic areas to explicitly aid in the deduction). This result largely agrees with two reviews of the deductive reasoning literature, which concluded that the effects in language areas seen in early studies were likely due to the syntactic complexity of test stimuli 31 , 32 .

Shared structure in language and sensorimotor networks

We then turned to an investigation of the representational scheme that supports generalization. First, we note that like in other multitasking models, units in our sensorimotor-RNNs exhibited functional clustering, where similar subsets of neurons show high variance across similar sets of tasks (Supplementary Fig. 7 ). Moreover, we found that models can learn unseen tasks by only training sensorimotor-RNN input weights and keeping the recurrent dynamics constant (Supplementary Fig. 8 ). Past work has shown that these properties are characteristic of networks that can reuse the same set of underlying neural resources across different settings 6 , 18 . We then examined the geometry that exists between the neural representations of related tasks. We plotted the first three principal components (PCs) of sensorimotor-RNN hidden activity at stimulus onset in SIMPLENET, GPTNETXL, SBERTNET (L) and STRUCTURENET performing modality-specific DM and AntiDM tasks. Here, models receive input for a decision-making task in both modalities but must only attend to the stimuli in the modality relevant for the current task. Importantly, AntiDMMod1 is held out of training in the following examples. In addition, we plotted the PCs of either the rule vectors or the instruction embeddings in each task (Fig. 3 ).

figure 3

a – d , The first three PCs of sensorimotor hidden activity and task-info representations for models trained with AntiDMMod1 held out. Solid arrows represent an abstract ‘Pro’ versus ‘Anti’ axis, and dashed arrows represent an abstract ‘Mod1’ versus ‘Mod2’ axis. a , STRUCTURENET. b , SBERTNET (L). c , GPTNET (XL). d , SIMPLENET. e , Correlation between held-out task CCGP and zero-shot performance (Pearson’s r  = 0.606, P  = 1.57 × 10 −46 ). f , CCGP scores for held-out tasks for each layer in the model hierarchy. Significance scores indicate P- value results from pairwise two-sided unequal-variance t -tests performed among model distributions of CCGP scores on held-out tasks for sensorimotor-RNN (NS P  > 0.05, * P  < 0.05, ** P  < 0.01, *** P  < 0.001; STRUCTURENET versus SBERTNET (L): t  = 13.67, P  = 2.44 × 10 −36 ; SBERTNET (L) versus SBERTNET: t  = 5.061, P  = 5.84 × 10 −7 ; SBERTNET versus CLIPNET: t  = 2.809, P  = 0.005; CLIPNET versus BERTNET: t  = 0.278, P  = 0.780; BERTNET versus GPTNET (XL): t  = 2.505, P  = 0.012; GPTNET (XL) versus GPTNET: t  = 3.180, P  = 0.001; GPTNET versus BOWNET: t  = −4.176, P  = 3.50 × 10 −5 ; BOWNET versus SIMPLENET: t  = 23.0.8, P  = 1.10 −80 ; see Supplementary Fig. 9 for full comparisons as well as t -test results for embedding layer CCGP scores).

For STRUCTURENET, hidden activity is factorized along task-relevant axes, namely a consistent ‘Pro’ versus ‘Anti’ direction in activity space (solid arrows), and a ‘Mod1’ versus ‘Mod2’ direction (dashed arrows). Importantly, this structure is maintained even for AntiDMMod1, which has been held out of training, allowing STRUCTURENET to achieve a performance of 92% correct on this unseen task. This factorization is also reflected in the PCs of rule embeddings. Strikingly, SBERTNET (L) also organizes its representations in a way that captures the essential compositional nature of the task set using only the structure that it has inferred from the semantics of instructions. This is the case for language embeddings, which maintain abstract axes across AntiDMMod1 instructions (again, held out of training). As a result, SBERTNET (L) is able to use these relevant axes for AntiDMMod1 sensorimotor-RNN representations, leading to a generalization performance of 82%. By contrast, GPTNET (XL) fails to properly infer a distinct ‘Pro’ versus ‘Anti’ axes in either sensorimotor-RNN representations or language embeddings leading to a zero-shot performance of 6% on AntiDMMod1 (Fig. 3b ). Finally, we find that the orthogonal rule vectors used by simpleNet preclude any structure between practiced and held-out tasks, resulting in a performance of 22%.

To more precisely quantify this structure, we measure the cross-conditional generalization performance (CCGP) of these representations 3 . CCGP measures the ability of a linear decoder trained to differentiate one set of conditions (that is, DMMod2 and AntiDMMod2) to generalize to an analogous set of test conditions (that is, DMMod1 and AntiDMMod1). Intuitively, this captures the extent to which models have learned to place sensorimotor activity along abstract task axes (that is, the ‘Anti’ dimension). Notably, high CCGP scores and related measures have been observed in experiments that required human participants to flexibly switch between different interrelated tasks 4 , 33 .

We measured CCGP scores among representations in sensorimotor-RNNs for tasks that have been held out of training ( Methods ) and found a strong correlation between CCGP scores and zero-shot performance (Fig. 3e ). Additionally, we find that swapping task instructions for held-out tasks dramatically reduces CCGP scores for all our instructed models, indicating that the semantic of instructions is crucial for maintaining structured representations (Supplementary Fig. 9 ).

We then looked at how structure emerges in the language processing hierarchy. CCGP decoding scores for different layers in our model are shown in Fig. 3f . For each instructed model, scores for 12 transformer layers (or the last 12 layers for SBERTNET (L) and GPTNET (XL)), the 64-dimensional embedding layer and the Sensorimotor-RNN task representations are plotted. We also plotted CCGP scores for the rule embeddings used in our nonlinguistic models. Among models, there was a notable discrepancy in how abstract structure emerges. Autoregressive models (GPTNETXL, GPTNET), BERTNET and CLIPNET (S), showed a low CCGP throughout language model layers followed by a jump in the embedding layer. This is because weights feeding into the embedding layer are tuned during sensorimotor training. The implication of this spike is that most of the useful representational processing in these models actually does not occur in the pretrained language model per se, but rather in the linear readout, which is exposed to task structure via training. By contrast, our best-performing models SBERTNET and SBERTNET (L) use language representations where high CCGP scores emerge gradually in the intermediate layers of their respective language models. Because semantic representations already have such a structure, most of the compositional inference involved in generalization can occur in the comparatively powerful language processing hierarchy. As a result, representations are already well organized in the last layer of language models, and a linear readout in the embedding layer is sufficient for the sensorimotor-RNN to correctly infer the geometry of the task set and generalize well.

This analysis strongly suggests that models exhibiting generalization do so by leveraging structured semantic representations to properly relate practiced and novel tasks in sensorimotor space, thereby allowing a composition of practiced behaviors in an unseen setting.

Semantic modulation of single-unit tuning properties

Next, we examined tuning profiles of individual units in our sensorimotor-RNNs. We found that individual neurons are tuned to a variety of task-relevant variables. Critically, however, we find neurons where this tuning varies predictably within a task group and is modulated by the semantic content of instructions in a way that reflects task demands.

For instance, in the ‘Go’ family of tasks, unit 42 shows direction selectivity that shifts by π between ‘Pro’ and ‘Anti’ tasks, reflecting the relationship of task demands in each context (Fig. 4a ). This flip in selectivity is observed even for the AntiGo task, which was held out during training.

figure 4

a , Tuning curves for a SBERTNET (L) sensorimotor-RNN unit that modulates tuning according to task demands in the ‘Go’ family. b , Tuning curves, for a SBERTNET (L) sensorimotor-RNN unit in the ‘matching’ family of tasks plotted in terms of difference in angle between two stimuli. c , Full activity traces for modality-specific ‘DM’ and ‘AntiDM’ tasks for different levels of relative stimulus strength. d , Full activity traces for tasks in the ‘comparison’ family of tasks for different levels of relative stimulus strength.

For the ‘Matching’ family of tasks, unit 14 modulates activity between ‘match’ (DMS, DMC) and ‘non-match’ (DNMS, DNMC) conditions. In ‘non-match’ trials, the activity of this unit increases as the distance between the two stimuli increases. By contrast, for ‘matching’ tasks, this neuron is most active when the relative distance between the two stimuli is small. Hence, in both cases this neuron modulates its activity to represent when the model should respond, changing selectivity to reflect opposing task demands between ‘match’ and ‘non-match’ trials. This is true even for DMS, which has been held out of training.

Figure 4c shows traces of unit 3 activity in modality-specific versions of DM and AntiDM tasks (AntiDMMod1 is held out of training) for different levels of contrast (contrast =  s t r stim1  −  s t r stim2 ). In all tasks, we observed ramping activity where the rate of ramping is relative to the strength of contrast. This motif of activity has been reported in previous studies 34 , 35 . However, in our models, we observe that an evidence-accumulating neuron can swap the sign of its integration in response to a change in linguistic instructions, which allows models to meet opposing demands of ‘Pro’ and ‘Anti’ versions of the task, even for previously unseen tasks.

Interestingly, we also found that unsuccessful models failed to properly modulate tuning preferences. For example, with GPTNET (XL), which failed to factorize along a ‘Pro’ versus ‘Anti’ axis (Fig. 3b ) and had poor generalization on AntiDMMod1, we also find neurons that failed to swap their sign of integration in the held-out setting (Supplementary Fig. 10 ).

Finally, we see a similar pattern in the time course of activity for trials in the ‘Comparison’ family of tasks (Fig. 4d ). In the COMP1 task, the network must respond in the direction of the first stimulus if it has higher intensity than the second stimulus, and must not respond otherwise. In COMP2, it must only respond to the second stimulus if the second stimulus is higher intensity. For ‘Anti’ versions, the demands of stimulus ordering are the same except the model has to choose the stimuli with the weakest contrast. Even with this added complexity, we found individual neurons that modulate their tuning with respect to task demands, even for held-out tasks (in this case COMP2). For example, unit 82 is active when the network should repress response. For ‘COMP1’, this unit is highly active with negative contrast (that is, s t r stim2  >  s t r stim1 ), but flips this sensitivity in COMP2 and is highly active with positive contrast (that is, s t r stim1  >  s t r stim2 ). Importantly, this relation is reversed when the goal is to select the weakest stimuli. Hence, despite these subtle syntactic differences in instruction sets, the language embedding can reverse the tuning of this unit in a task-appropriate manner.

Linguistic communication between networks

We now seek to model the complementary human ability to describe a particular sensorimotor skill with words once it has been acquired. To do this, we inverted the language-to-sensorimotor mapping our models learn during training so that they can provide a linguistic description of a task based only on the state of sensorimotor units. First, we constructed an output channel (production-RNN; Fig. 5a–c ), which is trained to map sensorimotor-RNN states to input instructions. We then present the network with a series of example trials while withholding instructions for a specific task. During this phase all model weights are frozen, and models receive motor feedback in order to update the embedding layer activity in order to reduce the error of the output (Fig. 5b ). Once the activity in the embedding layer drives sensorimotor units to achieve a performance criterion, we used the production-RNN to decode a linguistic description of the current task. Finally, to evaluate the quality of these instructions, we input them into a partner model and measure performance across tasks (Fig. 5c ). All instructing and partner models used in this section are instances of SBERTNET (L) ( Methods ).

figure 5

a , Illustration of self-supervised training procedure for the language production network (blue). The red dashed line indicates gradient flow. b , Illustration of motor feedback used to drive task performance in the absence of linguistic instructions. c , Illustration of the partner model evaluation procedure used to evaluate the quality of instructions generated from the instructing model. d , Three example instructions produced from sensorimotor activity evoked by embeddings inferred in b for an AntiDMMod1 task. e , Confusion matrix of instructions produced again using the method described in b . y axis indicates input–output task used to infer an embedding, and x axis indicates whether the instruction produced from the resulting sensorimotor activity was included in one of the instruction sets used during self-supervised training or else was a ‘novel’ formulation. f , Performance of partner models in different training regimes given produced instructions or direct input of embedding vectors. Each point represents the average performance of a partner model across tasks using instructions from decoders train with different random initializations. Dots indicate the partner model was trained on all tasks, whereas diamonds indicate performance on held-out tasks. Axes indicate the training regime of the instructing model. Full statistical comparisons of performance can be found in Supplementary Fig. 12 .

Some example decoded instructions for the AntiDMMod1 task (Fig. 5d ; see Supplementary Notes 4 for all decoded instructions). To visualize decoded instructions across the task set, we plotted a confusion matrix where both sensorimotor-RNN and production-RNN are trained on all tasks (Fig. 5e ). Note that many decoded instructions were entirely ‘novel’, that is, they were not included in the training set for the production-RNN ( Methods ). Novel instructions made up 53% of decoded instructions across all tasks.

To test the quality of these novel instructions, we evaluated a partner model’s performance on instructions generated by the first network (Fig. 5c ; results are shown in Fig. 5f ). When the partner model is trained on all tasks, performance on all decoded instructions was 93% on average across tasks. Communicating instructions to partner models with tasks held out of training also resulted in good performance (78%). Importantly, performance was maintained even for ‘novel’ instructions, where average performance was 88% for partner models trained on all tasks and 75% for partner models with hold-out tasks. Given that the instructing and partner models share the same architecture, one might expect that it is more efficient to forgo the language component of communication and simply copy the embedding inferred by one model into the input of the partner model. This resulted in only 31% correct performance on average and 28% performance when testing partner models on held-out tasks. Although both instructing and partner networks share the same architecture and the same competencies, they nonetheless have different synaptic weights. Hence, using a neural representation tuned for the set of weights within the one agent won’t necessarily produce good performance in the other.

We also tested an instructing model using a sensorimotor-RNN with tasks held out of training. We emphasize here that during training the production-RNN attempts to decode from sensorimotor hidden states induced by instructions for tasks the network has never experienced before (Fig. 5a ), whereas during test time, instructions are produced from sensorimotor states that emerge entirely as a result of minimizing a motor error (Fig. 5b,c ). We nonetheless find that, in this setting, a partner model trained on all tasks performs at 82% correct, while partner models with tasks held out of training perform at 73%. Here, 77% of produced instructions are novel, so we see a very small decrease of 1% when we test the same partner models only on novel instructions. Like above, context representations induce a relatively low performance of 30% and 37% correct for partners trained on all tasks and with tasks held out, respectively.

Lastly, we tested our most extreme setting where tasks have been held out for both sensorimotor-RNNs and production-RNNs (Fig. 5f ). We find that produced instructions induce a performance of 71% and 63% for partner models trained on all tasks and with tasks held out, respectively. Although this is a decrease in performance from our previous set-ups, the fact that models can produce sensible instructions at all in this double held-out setting is striking. The fact that the system succeeds to any extent speaks to strong inductive biases introduced by training in the context of rich, compositionally structured semantic representations.

In this study, we use the latest advances in natural language processing to build tractable models of the ability to interpret instructions to guide actions in novel settings and the ability to produce a description of a task once it has been learned. RNNs can learn to perform a set of psychophysical tasks simultaneously using a pretrained language transformer to embed a natural language instruction for the current task. Our best-performing models can leverage these embeddings to perform a brand-new model with an average performance of 83% correct. Instructed models that generalize performance do so by leveraging the shared compositional structure of instruction embeddings and task representations, such that an inference about the relations between practiced and novel instructions leads to a good inference about what sensorimotor transformation is required for the unseen task. Finally, we show a network can invert this information and provide a linguistic description for a task based only on the sensorimotor contingency it observes.

Our models make several predictions for what neural representations to expect in brain areas that integrate linguistic information in order to exert control over sensorimotor areas. Firstly, the CCGP analysis of our model hierarchy suggests that when humans must generalize across (or switch between) a set of related tasks based on instructions, the neural geometry observed among sensorimotor mappings should also be present in semantic representations of instructions. This prediction is well grounded in the existing experimental literature where multiple studies have observed the type of abstract structure we find in our sensorimotor-RNNs also exists in sensorimotor areas of biological brains 3 , 36 , 37 . Our models theorize that the emergence of an equivalent task-related structure in language areas is essential to instructed action in humans. One intriguing candidate for an area that may support such representations is the language selective subregion of the left inferior frontal gyrus. This area is sensitive to both lexico-semantic and syntactic aspects of sentence comprehension, is implicated in tasks that require semantic control and lies anatomically adjacent to another functional subregion of the left inferior frontal gyrus, which is implicated in flexible cognition 38 , 39 , 40 , 41 . We also predict that individual units involved in implementing sensorimotor mappings should modulate their tuning properties on a trial-by-trial basis according to the semantics of the input instructions, and that failure to modulate tuning in the expected way should lead to poor generalization. This prediction may be especially useful to interpret multiunit recordings in humans. Finally, given that grounding linguistic knowledge in the sensorimotor demands of the task set improved performance across models (Fig. 2e ), we predict that during learning the highest level of the language processing hierarchy should likewise be shaped by the embodied processes that accompany linguistic inputs, for example, motor planning or affordance evaluation 42 .

One notable negative result of our study is the relatively poor generalization performance of GPTNET (XL), which used at least an order of magnitude more parameters than other models. This is particularly striking given that activity in these models is predictive of many behavioral and neural signatures of human language processing 10 , 11 . Given this, future imaging studies may be guided by the representations in both autoregressive models and our best-performing models to delineate a full gradient of brain areas involved in each stage of instruction following, from low-level next-word prediction to higher-level structured-sentence representations to the sensorimotor control that language informs.

Our models may guide future work comparing compositional representations in nonlinguistic subjects like nonhuman primates. Comparison of task switching (without linguistic instructions) between humans and nonhuman primates indicates that both use abstract rule representations, although humans can make switches much more rapidly 43 . One intriguing parallel in our analyses is the use of compositional rules vectors (Supplementary Fig. 5 ). Even in the case of nonlinguistic SIMPLENET, using these vectors boosted generalization. Importantly, however, this compositionality is much stronger for our best-performing instructed models. This suggests that language endows agents with a more flexible organization of task subcomponents, which can be recombined in a broader variety of contexts.

Our results also highlight the advantages of linguistic communication. Networks can compress the information they have gained through experience of motor feedback and transfer that knowledge to a partner network via natural language. Although rudimentary in our example, the ability to endogenously produce a description of how to accomplish a task after a period of practice is a hallmark human language skill. The failure to transfer performance by sharing latent representations demonstrates that to communicate information in a group of independent networks of neurons, it needs to pass through a representational medium that is equally interpretable by all members of the group. In humans and for our best-performing instructed models, this medium is language.

A series of works in reinforcement learning has investigated using language and language-like schemes to aid agent performance. Agents receive language information through step-by-step descriptions of action sequences 44 , 45 , or by learning policies conditioned on a language goal 46 , 47 . These studies often deviate from natural language and receive linguistic inputs that are parsed or simply refer directly to environmental objects. Some larger versions of the pretrained language models we use to embed instructions also display instructions following behavior, that is, GPT-3 (ref. 7 ), PALM 12 , LaMDA 13 and InstructGPT 48 in the modality of language and DALL-E 8 and Stable Diffusion 14 in a language to image modality. The semantic and syntactic understanding displayed in these models is impressive. However, the outputs of these models are difficult to interpret in terms of guiding the dynamics of a downstream action plan. Finally, recent work has sought to engineer instruction following agents that can function in complex or even real-world environments 16 , 17 , 18 . While these models exhibit impressive behavioral repertoires, they rely on perceptual systems that fuse linguistic and visual information making them difficult to compare to language representations in human brains, which emerge from a set of areas specialized for processing language. In all, none of these models offer a testable representational account of how language might be used to induce generalization over sensorimotor mappings in the brain.

Our models by contrast make tractable predictions for what population and single-unit neural representations are required to support compositional generalization and can guide future experimental work examining the interplay of linguistic and sensorimotor skills in humans. By developing interpretable models that can both understand instructions as guiding a particular sensorimotor response, and communicate the results of sensorimotor learning as an intelligible linguistic instruction, we have begun to explain the power of language in encoding and transferring knowledge in networks of neurons.

Model architecture

Sensorimotor-rnn.

The base model architecture and task structure used in this paper follows 18 . All networks of sensorimotor units denoted sensorimotor-RNN are gated recurrent units (GRU) 49 using rectified linear unit (ReLU) nonlinearities with 256 hidden units each. Inputs to the networks consist of (1) sensory inputs, X t and (2) task-identifying information, I t . We initialize hidden activity in the GRU as \({h}^{0}\in {{\mathbb{R}}}^{256}\) with values set to 0.1. All networks of sensorimotor units use the same hidden state initialization, so we omit h 0 in network equations. At each time step, a readout layer Linear out decodes motor activity, \(\hat{{y}_{t}}\) , from the activity of recurrent hidden units, h t , according to:

where σ denotes the sigmoid function. Sensory inputs X t are made up of three channels, two sensory modalities \({x}_{{{\mathrm{mod}}}\,1,t}\) and \({x}_{{{\mathrm{mod}}}\,2,t}\) , and a fixation channel x fix, t . Both \({x}_{{{\mathrm{mod}}}\,1,t},{x}_{{{\mathrm{mod}}}\,2,t}\in {{\mathbb{R}}}^{32}\) and stimuli in these modalities are represented as hills of activity with peaks determined by units’ preferred directions around a one-dimensional circular variable. For an input at direction θ , the activity of a given input unit u i with preferred direction θ i is

where s t r is the coefficient describing stimulus strength. The fixation channel \({x}_{{{{\rm{fix}}}},t}\in {{\mathbb{R}}}^{1}\) is a single unit simulating a fixation cue for the network. In all, sensory input \({X}_{t}=({x}_{mod1,t},{x}_{mod2,t},{x}_{fix,t})\in {{\mathbb{R}}}^{65}\) . Motor output, \({\hat{{y}}_{t}}\) consists of both a 32-dimensional ring representing directional responses to the input stimulus as well as a single unit representing model fixation, so that \({\hat{{y}}_{t}}\in {{\mathbb{R}}}^{33}\) .

For all models, task-identifying information \({I}_{t}\in {{\mathbb{R}}}^{64}\) . Task-identifying information is presented throughout the duration of a trial and remains constant such that \({I}_{t}={I}_{t{\prime} }\forall t,t{\prime}\) . For all models, task-identifying info I t and sensory input X t are concatenated as inputs to the sensorimotor-RNN.

Nonlinguistic models

For SIMPLENET, we generate a set of 64-dimensional orthogonal task rules by constructing an orthogonal matrix using the Python package scipy.stats.ortho_group, and assign rows of this matrix to each task type. For STRUCTURENET, we generate a set of ten orthogonal, 64-dimensional vectors in the same manner, and each of these represents a dimension of the task set (that is, respond weakest versus strongest direction, respond in the same versus opposite direction, pay attention only to stimuli in the first modality, and so on). Rule vectors for tasks are then simple combinations of each of these ten basis vectors. For a full description of structure rule vectors, see Supplementary Note 3 .

We also test SIMPLENETPLUS and STRUCTURENETPLUS, which use an additional hidden layer with 128 units and ReLU nonlinearities to process orthogonal tasks rules I t into a vector \(\bar{{I}_{t}}\) which is used by sensorimotor-RNN as task-identifying information.

Full results for these models are included in Supplementary Fig. 4 .

Pretrained transformers

The main language models we test use pretrained transformer architectures to produce I . Importantly, transformers differ in the type of pretraining objective used to tune the model parameters. GPT is trained to predict the next word given a context of words 9 . GPT (XL) follows the same objective but trains for longer on a larger dataset 50 . Both models are fully autoregressive. BERT, by contrast, takes bidirectional language inputs and is tasked with predicting masked words that appear in the middle of input phrases. Additionally, BERT is trained on a simple sentence prediction task where the model must determine if input sentence 1 is followed by input sentence 2 in the training corpus. Extending this principle, SBERT is explicitly trained to produce fixed-length embeddings of whole sentences 21 . It takes pretrained BERT networks and uses them in a siamese architecture 51 , which allows the weights of the model to be tuned in a supervised fashion according to the Stanford Natural Language Inference dataset 22 . Natural language inference is a three-way categorization task where the network must infer the logical relationship between sentences: whether a premise sentence implies, contradicts or is unrelated to a hypothesis sentence. Finally, CLIP is trained to jointly embed images and language 23 . It uses data from captioned images and is asked to properly categorize which text and images pairs match or are mismatched in the dataset via a contrastive loss.

Importantly, the natural output of a transformer is a matrix of size \({\dim }_{{{{\rm{trans}}}}.}\times {{{\mathcal{T}}}}\) , the inherent dimensionality of the transformer by the length of the input sequence. To create an embedding space for sentences it is standard practice to apply a pooling method to the transformer output, which produces a fixed-length representation for each instruction.

For GPT, GPT (XL), BERT and SBERT, we use an average pooling method. Suppose we have an input instruction \({w}_{1}\ldots {w}_{{{{\mathcal{T}}}}}\) . Following standard practice with pretrained language models, the input to our transformers is tokenized with special ‘cls’ and ‘eos’ tokens at the beginning and end of the input sequence. We then compute I as follows:

We chose this average pooling method primarily because a previous study 21 found that this resulted in the highest-performing SBERT embeddings. Another alternative would be to simply use the final hidden representation of the ‘cls’ token as a summary of the information in the entire sequence (given that BERT architectures are bidirectional, this token will have access to the whole sequence).

Where \({h}_{{{{\rm{cls}}}}}^{\rm{tran.}}\) denote the last hidden representation for the ‘cls’ token. Ref. 21 found this pooling method performed worse than average pooling, so we don’t include these alternatives in our results. For GPT and GPT (XL), we also tested a pooling method where the fixed-length representation for a sequence was taken from the transformer output of the ‘eos’ token. In this case:

We found that GPT failed to achieve even a relaxed performance criterion of 85% across tasks using this pooling method, and GPT (XL) performed worse than with average pooling, so we omitted these models from the main results (Supplementary Fig. 11 ). For CLIP models we use the same pooling method as in the original multiModal training procedure, which takes the outputs of the [cls] token as described above.

For all the above models, we also tested a version where the information from the pretrained transformers is passed through a multilayer perceptron with a single hidden layer of 256 hidden units and ReLU nonlinearities. We found that this manipulation reduced performance across all models, verifying that a simple linear embedding is beneficial to generalization performance.

For GPT, BERT and SBERT, \({\dim }_{{{{\rm{trans}}}}.}=768\) and each model uses a total of ~100 million parameters; for SBERT (L) \({\dim }_{{{{\rm{trans}}}}.}=1,024\) and the model uses ~300 million parameters; GPT (XL) \({\dim }_{{{{\rm{trans}}}}.}=1,600\) and the model uses ~1.5 billion parameters; for CLIP, \({\dim }_{{{{\rm{trans}}}}.}=512\) and the model uses ~60 million parameters. Full PyTorch implementations, including all pretrained weights and model hyperparameters, can be accessed at the Huggingface library ( https://huggingface.co/docs/transformers/ ) 52 .

For our BoW model, instructions are represented as a vector of binary activations the size of the instruction vocabulary, where each unit indicates the inclusion or exclusion of the associated word in the current instruction. For our instruction set, ∣ vocab ∣  = 181. This vector is then projected through a linear layer into 64-dimensional space.

Blank slate language models

Given that tuning the last layers of language models resulted in improved performance (Fig. 2e ), we tested two additional models to determine if training a blank slate language model trained exclusively on the loss from sensorimotor tasks would improve performance. These models consist of passing BoW representations through a multilayer perceptron and passing pretrained BERT word embeddings through one layer of a randomly initialized BERT encoder. Both models performed poorly compared to pretrained models (Supplementary Fig. 4.5 ), confirming that language pretraining is essential to generalization.

Tasks were divided into five interrelated subgroups: ‘go’, ‘decision-making’, ‘matching’, and ‘comparison’ and ‘duration’. Depending on the task, multiple stimuli may appear during the stimulus epoch. Also, depending on the task, models may be required to respond in a particular direction or repress response altogether. Unless otherwise specified, zero-mean Gaussian noise is added independently at each time step and to each input unit and the variance of this noise is drawn randomly from \({\mathbb{U}}[0.1,0.15]\) . The timing of stimuli differs among the tasks type. However, for all tasks, trials can be divided into preparatory, stimulus and response epochs. The stimulus epoch can be subdivided into three parts—stim1, delay and stim23—although these distinct parts aren’t used by all tasks. A trial lasts for a total of T  = 150 time steps. Let d u r epoch denote the duration in simulated time steps of a given epoch. Then

For tasks that don’t utilize a delay structure, stim1, stim2 and delay epochs are grouped together in a single stimulus epoch where \(du{r}_{{{{\rm{stimulus}}}}}=du{r}_{{{{\rm{stim}}}}1}+du{r}_{{{{\rm{stim}}}}2}+du{r}_{{{{\rm{delay}}}}}\) . Unless otherwise specified, a fixation cue with a constant strength s t r fix  = 1 is activated throughout the preparatory and stimulus epochs. For example trials of each task, see Supplementary Fig. 13 .

The ‘Go’ family of tasks includes ‘Go’, ‘RTGo’, ‘AntiGo’, ‘AntiRTGo’ and modality-specific versions of each task denoted with either ‘Mod1’ and ‘Mod2’. In both the ‘Go’ and ‘AntiGo’ tasks, a single stimulus is presented at the beginning of the stimulus epoch. The direction of the presented stimulus is generated by drawing from a uniform distribution between 0 and 2 π , that is, \({\theta }_{{{{\rm{stim}}}}} \sim {\mathbb{U}}[0,2\pi ]\) . The stimulus will appear in either modality 1 or modality 2 with equal probability. The strength of the stimulus is given by \(st{r}_{{{{\rm{stim}}}}} \sim {\mathbb{U}}[1.0,1.2]\) . In the ‘Go’ task, the target response is in the same direction as the presented stimulus, that is, \({\theta }_{{{{\rm{stim}}}}}={\theta }_{{{{\rm{target}}}}}\) , while in the ‘AntiGo’ task the direction of the response should be in the opposite of the stimulus direction, \({\theta }_{{{{\rm{stim}}}}}+\pi ={\theta }_{{{{\rm{target}}}}}\) . For modality-specific versions of each task, a stimulus direction is drawn in each modality \({\theta }_{{{{\rm{stim}}}},{{{\rm{mod}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}},{{{\rm{mod}}}}2} \sim {\mathbb{U}}[0,2\pi ]\) and for modality-specific Go-type tasks

while for modality-specific AntiGo-type tasks

For ‘RT’ versions of the ‘Go’ tasks, stimuli are only presented during the response epoch and the fixation cue is never extinguished. Thus, the presence of the stimulus itself serves as the response cue and the model must respond as quickly as possible. Otherwise, stimuli persist through the duration of the stimulus epoch.

‘Decision-making’ tasks

The ‘decision-making’ family of tasks includes ‘DM’ (decision-making), ‘AntiDM’, ‘MultiDM’ (multisensory decision-making), ‘AntiMultiDM,’ modality-specific versions of each of these tasks and, finally, confidence-based versions of ‘DM’ and ‘AntiDM.’ For all tasks in this group, two stimuli are presented simultaneously and persist throughout the duration of the stimulus epoch. They are drawn according to \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}\) \([({\theta }_{{{{\rm{stim}}}}1}-0.2\pi ,{\theta }_{{{{\rm{stim}}}}1}-0.6\pi )\cup ({\theta }_{{{{\rm{stim}}}}1}+0.2\pi ,{\theta }_{{{{\rm{stim}}}}1}+0.6\pi )]\) . A base strength applied to both stimuli is drawn such that \(st{r}_{\rm{base}} \sim {\mathbb{U}}[1.0,1.2]\) . A contrast is drawn from a discrete distribution such that c  ~ {−0.175, −0.15, −0.1, 0.1, 0.15, 0.175} so the stimulus strength associated with each direction in a trial are given by \(st{r}_{{{{\rm{stim}}}}1}=st{r}_{\rm{base}}+c\) and \(st{r}_{{{{\rm{stim}}}}2}=\) \({str}_{\rm{base}}-c\) .

For the ‘DM’ task,

and for the the ‘AntiDM’ task,

For these versions of the tasks, the stimuli are presented in either modality 1 or modality 2 with equal probability. For the multisensory versions of each task, stimuli directions are drawn in the same manner and presented across both modalities so that \({\theta }_{{{{\rm{stim}}}}1,{{{\rm{mod}}}}1}={\theta }_{{{{\rm{stim}}}}1,{{{\rm{mod}}}}2}\) and \({\theta }_{{{{\rm{stim}}}}2,{{{\rm{mod}}}}1}={\theta }_{{{{\rm{stim}}}}2,{{{\rm{mod}}}}2}\) . Base strengths are drawn independently for each modality. Contrasts for both modalities are drawn from a discrete distribution such that \({c}_{{{\mathrm{mod}}}\,1},{c}_{{{\mathrm{mod}}}\,2} \sim \left\{0.2,0.175,\right.\) \(\left.0.15,0.125,-0.125,-0.15,-0.175,-0.2\right\}\) . If both \(| {c}_{{{\mathrm{mod}}}\,1}| -| {c}_{{{\mathrm{mod}}}\,2}| =0\) then contrasts are redrawn to avoid zero-contrast trials during training. If both \({c}_{{{\mathrm{mod}}}\,1}\) and \({c}_{{{\mathrm{mod}}}\,2}\) have the same sign, then contrasts are redrawn to ensure that the trial requires integrating over both modalities as opposed to simply performing a ‘DM’ task in a single modality. Criteria for target responses are measured as the strength of a given direction summed over both modalities. So, for ‘MultiDM’

and for ‘AntiMultiDM’

Stimuli for modality-specific versions of each task are generated in the same way as multisensory versions of the task. Criteria for target response are the same as standard versions of ‘DM’ and ‘AntiDM’ tasks applied only to stimuli in the relevant modality.

In confidence-based decision-making tasks (‘ConDM’ and ‘ConAntiDM’), the stimuli directions are drawn in the same way as above. Stimuli are shown in either modality 1 or modality 2 with equal probability. In each trial, s t r base  = 1. The contrast and noise for each trial is based on the thresholded performance of a SIMPLENET model trained on all tasks except ‘ConDM’ and ‘ConAntiDM’. Once this model has been trained, we establish a threshold across levels of noise and contrasts for which the model can perform a ‘DM’ or an ‘AntiDM’ task at 95% correct. We then draw contrasts and noises for trials from above and below this threshold with equal probability during training. In trials where the noise and contrast levels fell below the 95% correct threshold, the model must repress response, and otherwise perform the decision-making task (either ‘DM’ or ‘AntiDM’).

‘Comparison’ tasks

Our comparison task group includes ‘COMP1’, ‘COMP2’, ‘MultiCOMP1’, ‘MultiCOMP2’, ‘Anti’ versions of each of these tasks, as well as modality-specific versions of ‘COMP1’ and ‘COMP2’ tasks. This group of tasks is designed to extend the basic decision-making framework into a setting with more complex control demands. These tasks utilize the delay structure in the stimulus epoch so that stim1 appears only during the stim1 epoch, followed by a delay, and finally stim2. This provides a temporal ordering on the stimuli. In ‘COMP1’, the model must respond to the first stimulus only if it has greater strength than the second and otherwise repress a response that is

Likewise, in ‘COMP2’, the model must respond to the second direction if it presented with greater strength than the first otherwise repress response that is

In ‘Anti’ versions of the task the ordering criteria is the same except for stimuli with least strength, that is, for ‘AntiCOMP1’

and for ‘AntiCOMP2’

In multisensory settings, the criteria for target direction are analogous to the multisensory decision-making tasks where strength is integrated across modalities. Likewise, for modality-specific versions, the criteria are only applied to stimuli in the relevant modality. Stimuli directions and strength for each of these tasks are drawn from the same distributions as the analogous task in the ‘decision-making’ family. However, during training, we make sure to balance trials where responses are required and trials where models must repress response.

‘Duration’ tasks

The ‘duration’ family of tasks includes ‘Dur1’, ‘Dur2’, ‘MultiDur1’, ‘MultiDur2’, ‘Anti’ versions of each of these tasks and modality-specific versions of ‘Dur1’ and ‘Dur2’ tasks. These tasks require models to perform a time estimation task with the added demand or stimuli ordering determining relevance for response. Like in ‘comparison’ tasks, stim1 is presented followed by a delay and then stim2. For ‘Dur1’ trials

Likewise, for ‘Dur2’

In ‘Anti’ versions of these tasks, the correct response is in the direction of the stimulus with the shortest duration given the ordering criteria is met. Hence, for ‘AntiDur1’

and for ‘AntiDur2’

Across these tasks directions are drawn according to \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}[({\theta }_{{{{\rm{stim}}}}1}-0.2\pi ,{\theta }_{{{{\rm{stim}}}}1}-0.6\pi )\cup ({\theta }_{{{{\rm{stim}}}}1}+0.2\pi ,{\theta }_{{{{\rm{stim}}}}1}+0.6\pi )]\) . Stimulus strengths are drawn according to \(st{r}_{{{{\rm{stim}}}}1},st{r}_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}[0.8,1.2]\) . To set the duration of each stimulus, we first draw \(du{r}_{{{{\rm{long}}}}} \sim\) \(\{i| 35 < i\le 50,i\in {\mathbb{N}}\}\) and \(du{r}_{{{{\rm{short}}}}} \sim \{i| 25 < i\le (du{r}_{{{{\rm{long}}}}}-8),i\in {\mathbb{N}}\}\) . During training, we determine which trials for a given task should and should not require a response in order to evenly balance repress and respond trials. We then assign d u r long and d u r short to either stim1 or stim2 so that the trial requires the appropriate response given the particular task type.

Again, criteria for correct response in the multisensory and modality-specific versions of each tasks follow analogous tasks in the ‘decision-making’ and ‘comparison’ groups where multisensory versions of the task require integrating total duration over each modality, and modality-specific tasks require only considering durations in the given task modality. For multisensory tasks, we draw duration value \(du{r}_{{{{\rm{long}}}}} \sim \{i| 75 < i\le 100,i\in {\mathbb{N}}\}\) and then split this value d u r long0  =  d u r long  × 0.55 and d u r long1  =  d u r long  × 0.45. We also draw a value d u r short  =  d u r long  − Δ d u r where \(\Delta dur \sim \{i| 15 < i\le 25,i\in {\mathbb{N}}\}\) . This value is then subdivided further into d u r short0  =  d u r long1  + Δ d u r short where \(\Delta du{r}_{{{{\rm{short}}}}} \sim\) \(\{i| 19 < i\le 15,i\in {\mathbb{N}}\}\) and d u r short1  =  d u r Short  −  d u r short0 . Short and long durations can then be allocated to the ordered stimuli according to task type. Drawing durations in this manner ensures that, like in ‘decision-making’ and ‘comparison’ groups, correct answers truly require models to integrate durations over both modalities, rather than simply performing the task in a given modality to achieve correct responses.

‘Matching’ tasks

The ‘matching’ family of tasks consists of ‘DMS’ (delay match to stimulus), ‘DNMS’ (delay non-match to stimulus), ‘DMC’ (delay match to category) and ‘DMNC’ (delay non-match to category) tasks. For all tasks, stim1 is presented at the beginning of the stimulus epoch, followed by a delay, and the presentation of stim2. The stimulus strength is drawn according to \(st{r}_{{{{\rm{stim}}}}1},st{r}_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}[0.8,1.2]\) . The input modality for any given trial is chosen at random with equal probability. In both ‘DMS’ and ‘DNMS’ tasks, trials are constructed as ‘matching stim’ trials or ‘mismatching stim’ trials with equal probability. In ‘matching stim’ trials \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2}={\theta }_{{{{\rm{stim}}}}1}\) . In ‘mismatch stim’ trials, \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and

For ‘DMS’, models must respond in the displayed direction if the stimuli match, otherwise repress response,

and for ‘DNMS’, models must respond to the second direction if both directions are mismatched,

‘DMC’ and ‘DNMC’ tasks are organized in a similar manner. The stimulus input space is divided evenly into two categories such that cat1 = { θ : 0 <  θ ≤ π } and cat2 = { θ :  π  <  θ ≤2 π }. For ‘DMC’ and ‘DNMC’ tasks, trials are constructed as ‘matching cat.’ trials or ‘mismatching cat.’ trials with equal probability. In ‘matching cat.’ trials \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}({{{\mbox{cat}}}}_{{{{\rm{stim}}}}1})\) , where \({\mathbb{U}}({{{\mbox{cat}}}}_{{{{\rm{stim}}}}1})\) is a uniform draw from the category of stim1. In ‘mismatch stim’ trials, \({\theta }_{{{{\rm{stim}}}}1} \sim {\mathbb{U}}[0,2\pi ]\) and \({\theta }_{{{{\rm{stim}}}}2} \sim {\mathbb{U}}(-{{{\mbox{cat}}}}_{{{{\rm{stim}}}}1})\) where \(-{{{\mbox{cat}}}}_{{{{\rm{stim}}}}1}\) is the opposite category as stim1. For ‘DMC’, the model must respond in the first direction if both stimuli are presented in the same category otherwise repress response,

and for ‘DNMC’, the model should respond to the second direction if both stimuli are presented in opposite categories otherwise repress response,

Target output and correct criteria

The target output \(y\in {{\mathbb{R}}}^{33\times T}\) for a trial entails maintaining fixation in y 1  =  y fix during the stimulus epoch, and then either responding in the correct direction or repressing activity in the remaining target response units y 2…33 in the response epoch. Since the model should maintain fixation until response, target for fixation is set at y fix  = 0.85 during preparatory and stimulus epochs and y fix  = 0.05 in the response epoch. When a response is not required, as in the preparatory and stimulus epochs and with repressed activity in the response epoch, unit i takes on a target activity of y i  = 0.05. Alternatively, when there is a target direction for response,

where θ i is the preferred direction for unit i . Like in sensory stimuli, preferred directions for target units are evenly spaced values from [0, 2 π ] allocated to the 32 response units.

For a model response to count as correct, it must maintain fixation, that is, \({\hat{y}}_{{{{\rm{fix}}}}} > 0.5\) during preparatory and stimulus epochs. When no response is required \({\hat{y}}_{i} < 0.15\) . When a response is required, response activity is decoded using a population vector method and \({\theta }_{{{{\rm{resp}}}}.}\in ({\theta }_{{{{\rm{target}}}}}-\frac{\pi }{10},{\theta }_{{{{\rm{target}}}}}+\frac{\pi }{10})\) . If the model fails to meet any of these criteria, the trial response is incorrect.

Model training

Again following ref. 18 , model parameters are updated in a supervised fashion according to a masked mean squared error loss (mMSE) computed between the model motor response, \({\hat{y}}_{1\ldots T}=\hat{y}\) , and the target, y 1… T  =  y , for each trial.

Here, the multiplication sign denotes element-wise multiplication. Masks weigh the importance of different trial epochs. During preparatory and stimulus epochs, mask weights are set to 1; during the first five time steps of the response epoch, the mask value is set to 0; and during the remainder of the response epoch, the mask weight is set to 5. The mask value for the fixation is twice that of other values at all time steps.

For all models, we update Θ = {sensorimotor-RNN, Linear out } during training on our task set. For instructed models, we additionally update Linear embed in the process of normal training. We train models using standard PyTorch machinery and an Adam optimizer. An epoch consists of 2,400 mini-batches, with each mini-batch consisting of 64 trials. For all models, we use the same initial learning rate as in ref. 18 , l r  = 0.001. We found that in the later phases of training, model performance oscillated based on which latest task presented during training, so we decayed the learning rate for each epoch by a factor of γ  = 0.95, which allowed performance to converge smoothly. Following ref. 18 , models train until they reach a threshold performance of 95% across all tasks (and train for a minimum of 35 epochs). We found that training for GPTNET tended to asymptote below performance threshold for multisensory versions of comparison tasks. This held true over a variety of training hyperparameters and learning rate scheduler regimes. Hence, we relax the performance threshold of GPTNET to 85%. For each model type, we train five models that start from five different random initializations. Where applicable, results are averaged over these initializations.

Language model fine-tuning

When fine-tuning models, we allow the gradient from the motor loss experienced during sensorimotor training to fine-tune the weights in the final layers of the transformer language models. During normal training, we checkpoint a copy of our instructed models after training for 30 epochs. We then add the last three transformer layers to the set of trainable parameters, and reset the learning rates to l r  = 1 × 10 − 4 for Θ = {sensorimotor-RNN, Linear out } and l r lang  = 3 × 10 −4 for Θ lang  = {Linear embed , transformer −3,−2,−1 } where transformer −3,−2,−1 denotes the parameters of the last three layers of the relevant transformer architecture. We used these reduced learning rates to avoid completely erasing preexisting linguistic knowledge. Similarly for RNN parameters, we found the above learning rate avoided catastrophic forgetting of sensorimotor knowledge while also allowing the RNN to adapt to updated language embeddings across all models. Autoregressive models were much more sensitive to this procedure, often collapsing at the beginning of fine-tuning. Hence, for GPTNETXL and GPTNET, we used l r lang  = 5 × 10 −5 , which resulted in robust learning. Models train until they reach a threshold performance of 95% across training tasks or 85% correct for GPTNET.

Hold-out testing

During hold-out testing, we present models with 100 batches of one of the tasks that had been held out of training. For the instructed model, the only weights allowed to update during this phase are Θ = {sensorimotor-RNN, Linear out , Linear embed }. All weights of SIMPLENET and STRUCTURENET are trainable in this context. In this hold-out setting, we found that in more difficult tasks for some of our more poorly performing models, the standard hyperparameters we used during training resulted in unstable learning curves for novel tasks. To stabilize performance and thereby create fair comparisons across models, we used an increased batch size of 256. We then began with the standard learning rate of 0.001 and decreased this by increments of 0.0005 until all models showed robust learning curves. This resulted in a learning rate of 8 × 10 −4 . All additional results shown in the Supplementary Information section 4 follow this procedure.

CCGP calculation

To calculate CCGP, we trained a linear decoder on a pair of tasks and then tested that decoder on alternative pairs of tasks that have an analogous relationship. We grouped tasks into eight dichotomies: ‘Go’ versus ‘Anti’, ‘Standard’ versus ‘RT’, ‘Weakest’ versus ‘Strongest’, ‘Longest’ versus ‘Shortest’, ‘First Stim.’ versus ‘Second Stim’, ‘Stim Match’ versus ‘Category Match’, ‘Matching’ versus ‘Non-Matching’ and ‘Mod1’ versus ‘Mod2’. As an example, the ‘Go’ versus ‘Anti’ dichotomy includes (‘Go’, ‘AntiGo’), (‘GoMod1’, ‘AntiGoMod1’), (‘GoMod2’, ‘AntiGoMod2’), (‘RTGo’, ‘AntiRTGo’), (‘RTGoMod1’, ‘AntiRTGoMod1’) and (‘RTGoMod2’, ‘AntiRTGoMod2’) task pairs. For ‘RNN’ task representations, we extracted activity at the time of stimulus onset for 250 example trials. For language representations, we input the instruction sets for relevant tasks to our language model and directly analyze activity in the ‘embedding’ layer or take the sequence-averaged activity in each transformer layer. For nonlinguistic models, we simply analyze the space of rule vectors. Train and test conditions for decoders were determined by dichotomies identified across the task set (Supplementary Note 1 ). To train and test decoders, we used sklearn.svm.LinearSVC Python package. The CCGP score for a given task is the average decoding score achieved across all dichotomies where the task in question was part of either the train set or the test set. For model scores reported in the main text, we only calculate CCGP scores for models where the task in question has been held out of training. In Supplementary Fig. 9 , we report scores on tasks where models have been trained on all tasks, and for models where instructions have been switched for the hold-out task.

For Fig. 3e , we calculated Pearson’s r correlation coefficient between performance on held-out tasks and CCGP scores per task, as well as a P -value testing against the null hypothesis that these metrics are uncorrelated and normally distributed (using the scipy.stats.pearsonr function). Full statistical tests for CCGP scores of both RNN and embedding layers from Fig. 3f can be found in Supplementary Fig. 9 . Note that transformer language models use the same set of pretrained weights among random initialization of Sensorimotor-RNNs, thus for language model layers, the Fig. 3f plots show the absolute scores of those language models.

Conditional clause/deduction task analysis

We first split our task set into two groups (listed below): tasks that included conditional clauses and simple deductive reasoning components (30 tasks) and those where instructions include simple imperatives (20 tasks). We computed the difference in performance across the mean of generalization performance for each group across random initialization for each model (Fig. 2f ). We compared these differences to a null distribution constructed by performing a set of 50 random shuffles of the task set into groups of 30 and 20 tasks and computing differences in the same way, again using two-sided unequal-variance t -tests. Because STRUCUTRENET is a nonlinguistic model, we then compared performance of STRUCUTRENET to our instructed models to disassociate the effects of performing tasks with a deductive reasoning component versus processing instructions with more complicated conditional clause structure. Results of all statistical tests are reported in Supplementary Fig. 6 ).

Simple imperative tasks include: ‘Go’, ‘AntiGo’, ‘RTGo’, ‘AntiRTGo’, ‘GoMod1’, ‘GoMod2’, ‘AntiGoMod1’, ‘AntiGoMod2’, ‘RTGoMod1’, ‘AntiRTGoMod2’, ‘RTGoMod2’, ‘AntiRTGoMod2’, ‘DM’, ‘AntiDM’, ‘MultiDM’, ‘AntiMultiDM’, ‘DMMod1’, ‘DMMod2’, ‘AntiDMMod1’ and ‘AntiDMMod2’.

Conditional clause/deduction tasks include: ‘ConDM’, ‘ConAntiDM’, ‘Dur1’, ‘Dur2’, ‘MultiDur1’, ‘MultiDur2’, ‘AntiDur1’, ‘AntiDur2’, ‘AntiMultiDur1’, ‘AntiMultiDur2’, ‘Dur1Mod1’, ‘Dur1Mod2’, ‘Dur2Mod1’, ‘Dur2Mod2’, ‘COMP1’, ‘COMP2’, ‘MultiCOMP1’, ‘MultiCOMP2’, ‘AntiCOMP1’, ‘AntiCOMP2’, ‘AntiMultiCOMP1’, ‘AntiMultiCOMP2’, ‘COMP1Mod1’, ‘COMP1Mod2’, ‘COMP2Mod1’, ‘COMP2Mod2’, ‘DMS’, ‘DNMS’, ‘DMC’ and ‘DMNC’.

Language production training

Self-supervised language production network training.

Our language production framework is inspired by classic sequence-to-sequence modeling using RNNs 53 . Our Production-RNN is a GRU with 256 hidden units using ReLU nonlinearities. At each step in the sequence, a set of decoder weights, Linear words , attempts to decode the next token, w τ +1 , from the hidden state of the recurrent units. The hidden state of the Production-RNN is initialized by concatenating the time average and maximum sensorimotor activity of a SBERTNET (L) and passing that through weights Linear sm . The linguistic instruction used to drive the initializing sensorimotor activity is in turn used as the target set of tokens for the Production-RNN outputs. The first input to the Production-RNN is always a special start-of-sentence token, and the decoder runs until an end-of-sentence token is decoded or until input reaches a length of 30 tokens. Suppose \({w}_{1,k}\ldots {w}_{{{{\mathcal{T}}}},k}\in {\rm{Instruc{t}}}_{k}^{i}\) is the sequence of tokens in instruction k where k is in the instruction set for task i and X i is sensory input for a trial of task i . For brevity, we denote the process by which language models embed instructions as Embed() (see ‘Pretrained transformers’). The decoded token at the τ th position, \({\hat{w}}_{\tau ,k}\) , is then given by

The model parameters Θ production  = {Linear sm , Linear words , Production-RNN} are trained using cross-entropy loss between the \({p}_{{\hat{w}}_{\tau ,i}}\) and the instruction token w τ , k provided to the sensorimotor-RNN as input. We train for 80 epochs of 2,400 batches with 64 trials per batch and with task type randomly interleaved. We found that using an initial learning rate of 0.001 sometimes caused models to diverge in early phases of training, so we opted for a learning rate of 1× 10 −4 , which led to stable early training. To alleviate similar oscillation problems detected in sensorimotor training, we also decayed the learning rate by γ  = 0.99 per epoch. Additionally, the use of a dropout layer with a dropout rate of 0.05 improved performance. We also used a teacher forcing curriculum, where for some ratio of training batches, we input the ground truth instruction token w τ , k at each time step instead of the models decoded word \({\hat{w}}_{\tau ,k}\) . At each epoch, \({\rm{teacher}}\,{{\mbox{\_}}}{\rm{forcing}}{{\mbox{\_}}}\) \({\rm{ratio}}=0.5 \times \frac{80-{{{\rm{epoch}}}}}{80}\) .

Obtaining embedding layer activity using motor feedback

For a task, i , we seek to optimize a set of embedding activity vectors \({E}^{i}\in {{\mathbb{R}}}^{64}\) such that when they are input as task-identifying information, the model will perform the task in question. Crucially, we freeze all model weights Θ = {sensorimotor-RNN, Linear out , Linear embedding } and only update E i according to the standard supervised loss on the motor output. For notional clarity, GRU dependence on the previous hidden state h t −1 has been made implicit in the following equations.

We optimized a set of 25 embedding vectors for each task, again using an Adam optimizer. Here the optimization space has many suboptimal local minimums corresponding to embeddings for related tasks. Hence, we used a high initial learning rate of l r  = 0.05, which we decayed by γ  = 0.8 for each epoch. This resulted in more robust learning than lower learning rates. An epoch lasts for 800 batches with a batch length of 64, and we train for a minimum of 1 epoch or until we reach a threshold performance of 90% or 85% on ‘DMC’ and ‘DNMC’ tasks.

Producing task instructions

To produce task instructions, we simply use the set E i as task-identifying information in the input of the sensorimotor-RNN and use the Production-RNN to output instructions based on the sensorimotor activity driven by E i . For each task, we use the set of embedding vectors to produce 50 instructions per task. We repeat this process for each of the 5 initializations of sensorimotor-RNN, resulting in 5 distinct language production networks, and 5 distinct sets of learned embedding vectors. Reported results for each task are averaged over these 5 networks. For the confusion matrix (Fig. 5d ), we report the average percentage that decoded instructions are in the training instruction set for a given task or a novel instruction. Partner model performance (Fig. 5e ) for each network initialization is computed by testing each of the 4 possible partner networks and averaging over these results.

Sample sizes/randomization

No statistical methods were used to predetermine sample sizes but following ref. 18 we used five different random weight initializations per language model tested. Randomization of weights was carried out automatically in Python and PyTorch software packages. Given this automated randomization of weights, we did not use any blinding procedures in our study. No data were excluded from analyses.

All simulation and data analysis was performed in Python 3.7.11. PyTorch 1.10 was used to implement and train models (this includes Adam optimizer implementation). Transformers 4.16.2 was used to implement language models and all pretrained weights for language models were taken from the Huggingface repository ( https://huggingface.co/docs/transformers/ ). We also used scikit-learn 0.24.1 and scipy 1.7.3 to perform analyses.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All weights for language transformers used in this study were taken from pretrained models available on the Huggingface repository ( https://huggingface.co/docs/transformers/ ). Training data for simulated psychophysical tasks were generated using code available at https://github.com/ReidarRiveland/Instruct-RNN/ . The full set of trained model weights for all results is available upon request.

Code availability

All code used to train models and analyze results can be found at https://github.com/ReidarRiveland/Instruct-RNN/ .

Cole, M. W. et al. Multi-task connectivity reveals flexible hubs for adaptive task control. Nature Neurosci. 16 , 1348–1355 (2013).

Article   CAS   PubMed   Google Scholar  

Miller, E. K. & Cohen, J. D. An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci. 24 , 167–202 (2001).

Bernardi, S. et al. The geometry of abstraction in the hippocampus and prefrontal cortex. Cell 183 , 954–967 (2020).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Minxha, J., Adolphs, R., Fusi, S., Mamelak, A. N. & Rutishauser, U. Flexible recruitment of memory-based choice representations by the human medial frontal cortex. Science 368 , eaba3313 (2020).

Takuya, I. et al. Compositional generalization through abstract representations in human and artificial neural networks. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 32225–32239 (Curran Associates, Inc., 2022).

Driscoll, L., Shenoy, K. & Sussillo, D. Flexible multitask computation in recurrent networks utilizes shared dynamical motifs. Preprint at bioRxiv https://doi.org/10.1101/2022.08.15.503870 (2022).

Brown, Tom, et al. Language models are few-shot learners. In Proc. 34th International Conference on Neural Information Processing Systems 1877–1901 (Curran Associates Inc., 2020).

Ramesh, A. et al. Zero-shot text-to-image generation. In Proc. 38th International Conference on Machine Learning (eds Marina, M. & Tong, Z.) 8821–8831 (PMLR, 2021).

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI 1 , 9 (2019).

Google Scholar  

Schrimpf, M. et al. The neural architecture of language: integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA https://doi.org/10.1073/pnas.2105646118 (2021).

Goldstein, A. et al. Shared computational principles for language processing in humans and deep language models. Nature Neurosci. 25 , 369–380 (2022).

Chowdhery, A. et al. Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24 , 11324–11436 (2023).

Thoppilan, R. et al. Lamda: language models for dialog applications. Preprint at https://arxiv.org/abs/2201.08239 (2022).

Rombach, R. et al. High-resolution image synthesis with latent diffusion models. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10674–10685 (IEEE, 2022).

Zitkovich, B. et al. Rt-2: vision-language-action models transfer web knowledge to robotic control. In Proc. 7th Conference on Robot Learning (eds Tan, J. et al.) 2165-2183 (PMLR, 2023).

Abramson, J. et al. Imitating interactive intelligence. Preprint at https://arxiv.org/abs/2012.05672 (2021).

DeepMind Interactive Agents Team. Creating multimodal interactive agents with imitation and self-supervised learning. Preprint at https://arxiv.org/abs/2112.03763 (2022).

Yang, G. R., Joglekar, M. R., Song, H. F., Newsome, W. T. & Wang, X.-J. Task representations in neural networks trained to perform many cognitive tasks. Nat. Neurosci. 22 , 297–306 (2019).

Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates Inc., 2017).

Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at http://arxiv.org/abs/1810.04805 (2018).

Reimers, N. & Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. Preprint at https://arxiv.org/abs/1908.10084 (2019).

Bowman, S. R., Angeli, G., Potts, C. & Manning, C. D. A large annotated corpus for learning natural language inference. Preprint at http://arxiv.org/abs/1508.05326 (2015).

Radford, A. et al. "Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Marina, M. & Tong, Z.) 8748–8763 (PMLR, 2021).

Goel, V., Gold, B., Kapur, S. & Houle, S. Neuroanatomical correlates of human reasoning. J. Cogn. Neurosci. 10 , 293–302 (1998).

Goel, V., Buchel, C., Frith, C. & Dolan, R. J. Dissociation of mechanisms underlying syllogistic reasoning. Neuroimage 12 , 504–514 (2000).

Reverberi, C. et al. Neural basis of generation of conclusions in elementary deduction. Neuroimage 38 , 752–762 (2007).

Article   PubMed   Google Scholar  

Noveck, I. A., Goel, V. & Smith, K. W. The neural basis of conditional reasoning with arbitrary content. Cortex 40 , 613–622 (2004).

Monti, M. M., Osherson, D. N., Martinez, M. J. & Parsons, L. M. Functional neuroanatomy of deductive inference: a language-independent distributed network. Neuroimage 37 , 1005–1016 (2007).

Monti, M. M., Parsons, L. M. & Osherson, D. N. The boundaries of language and thought in deductive inference. Proc. Natl Acad. Sci. USA 106 , 12554–12559 (2009).

Article   CAS   PubMed   PubMed Central   ADS   Google Scholar  

Coetzee, J. P. & Monti, M. M. At the core of reasoning: dissociating deductive and non-deductive load. Hum. Brain Mapp. 39 , 1850–1861 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Monti, M. M. & Osherson, D. N. Logic, language and the brain. Brain Res. 1428 , 33–42 (2012).

Prado, J. The relationship between deductive reasoning and the syntax of language in broca’s area: a review of the neuroimaging literature. L’année Psychol. 118 , 289–315 (2018).

Article   Google Scholar  

Ito, T., Yang, G. R., Laurent, P., Schultz, D. H. & Cole, M. W. Constructing neural network models from brain data reveals representational transformations linked to adaptive behavior. Nat. Commun. 13 , 673 (2022).

Shadlen, M. N. & Newsome, W. T. Neural basis of a perceptual decision in the parietal cortex (area lip) of the rhesus monkey. J. Neurophysiol. 86 , 1916–1936 (2001).

Huk, A. C. & Shadlen, M. N. Neural activity in macaque parietal cortex reflects temporal integration of visual motion signals during perceptual decision making. J. Neurosci. 25 , 10420–10436 (2005).

Panichello, M. F. & Buschman, T. J. Shared mechanisms underlie the control of working memory and attention. Nature 592 , 601–605 (2021).

Nieh, E. H. et al. Geometry of abstract learned knowledge in the hippocampus. Nature 595 , 80–84 (2021).

Fedorenko, E. & Blank, I. A. Broca’s area is not a natural kind. Trends Cogn. Sci. 24 , 270–284 (2020).

Fedorenko, E., Duncan, J. & Kanwisher, N. Language-selective and domain-general regions lie side by side within broca’s area. Curr. Biol. 22 , 2059–2062 (2012).

Gao, Z. et al. Distinct and common neural coding of semantic and non-semantic control demands. NeuroImage 236 , 118230 (2021).

Duncan, J. The multiple-demand (MD) system of the primate brain: mental programs for intelligent behaviour. Trends Cogn. Sci. 14 , 172–179 (2010).

Buccino, G., Colagé, I., Gobbi, N. & Bonaccorso, G. Grounding meaning in experience: a broad perspective on embodied language. Neurosci. Biobehav. Rev. 69 , 69–78 (2016).

Mansouri, F. A., Freedman, D. J. & Buckley, M. J. Emergence of abstract rules in the primate brain. Nat. Rev. Neurosci. 21 , 595–610 (2020).

Oh, J. Singh, S., Lee, H. & Kohli, P. Zero-shot task generalization with multi-task deep reinforcement learning. In Proc. 34th International Conference on Machine Learning 2661–2670 (JMLR.org, 2017).

Chaplot, D. S., Mysore Sathyendra, K., Pasumarthi, R. K., Rajagopal, D., & Salakhutdinov, R. Gated-attention architectures for task-oriented language grounding. In Proc. 32nd AAAI Conference on Artificial Intelligence Vol. 32 (AAAI Press, 2018).

Sharma, P., Torralba, A. & Andreas, J. Skill induction and planning with latent language. Preprint at https://arxiv.org/abs/2110.01517 (2021).

Jiang, Y., Gu, S., Murphy, K. & Finn, C. Language as an abstraction for hierarchical deep reinforcement learning. In Proc. 33rd International Conference on Neural Information Processing Systems 9419–943132 (Curran Associates Inc., 2019).

Ouyang, L. et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 27730–27744 (Curran Associates, Inc., 2022).

Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/abs/1412.3555 (2014).

Radford, A. et al. Better language models and their implications. https://openai.com/blog/better-language-models/ (2019).

Bromley, J. et al. Signature verification using a ‘siamese’ time delay neural network. Int. J. Pattern Recognit. Artif. Intell. 7 , 669–688 (1993).

Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Pr oc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45 (Association for Computational Linguistics, 2020).

Sutskever, I., Vinyals, O. & Le., Q. V. Sequence to sequence learning with neural networks. In Proc. 27th International Conference on Neural Information Processing Systems 3104–3112 (MIT Press, 2014).

Download references

Acknowledgements

We thank N. Rungratsameetaweemana, T. Aquino and V. Borghesani as well as N. Patel and P. Tano for their useful discussions during this project. We are also appreciative to the University of Geneva for the funding which made this research possible.

Open access funding provided by University of Geneva.

Author information

Authors and affiliations.

Department of Basic Neuroscience, University of Geneva, Geneva, Switzerland

Reidar Riveland & Alexandre Pouget

You can also search for this author in PubMed   Google Scholar

Contributions

A.P. and R.R. conceived the project. R.R. wrote the code for model simulations and performed analysis of model representations. A.P. and R.R. wrote and revised the paper.

Corresponding author

Correspondence to Reidar Riveland .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Neuroscience thanks Blake Richards and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Figs. 1–13 and Supplementary Notes 1–4

Reporting Summary

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Riveland, R., Pouget, A. Natural language instructions induce compositional generalization in networks of neurons. Nat Neurosci (2024). https://doi.org/10.1038/s41593-024-01607-5

Download citation

Received : 13 May 2023

Accepted : 15 February 2024

Published : 18 March 2024

DOI : https://doi.org/10.1038/s41593-024-01607-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

natural language processing research paper topics

Publications

Daniel Jurafsky . 2014. The Language of Food . W. W. Norton.

Christopher D. Manning , Prabhakar Raghavan , and Hinrich Schütze . 2008. Introduction to Information Retrieval . Cambridge University Press.

Daniel Jurafsky and James H. Martin . 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics . 2nd edition. Prentice-Hall.

Christopher D. Manning and Hinrich Schütze . 1999. Foundations of Statistical Natural Language Processing . Cambridge, MA: MIT Press.

Barbara A. Fox , Dan Jurafsky , and Laura A. Michaelis (Eds.). 1999. Cognition and Function in Language . Stanford, CA: CSLI Publications.

Avery D. Andrews and Christopher D. Manning . 1999. Complex Predicates and Information Spreading in LFG . Stanford, CA: CSLI Publications.

Natural Language Processing

Natural Language Processing (NLP) research at Google focuses on algorithms that apply at scale, across languages, and across domains. Our systems are used in numerous ways across Google, impacting user experience in search, mobile, apps, ads, translate and more.

Our work spans the range of traditional NLP tasks, with general-purpose syntax and semantic algorithms underpinning more specialized systems. We are particularly interested in algorithms that scale well and can be run efficiently in a highly distributed environment.

Our syntactic systems predict part-of-speech tags for each word in a given sentence, as well as morphological features such as gender and number. They also label relationships between words, such as subject, object, modification, and others. We focus on efficient algorithms that leverage large amounts of unlabeled data, and recently have incorporated neural net technology.

On the semantic side, we identify entities in free text, label them with types (such as person, location, or organization), cluster mentions of those entities within and across documents (coreference resolution), and resolve the entities to the Knowledge Graph.

Recent work has focused on incorporating multiple sources of knowledge and information to aid with analysis of text, as well as applying frame semantics at the noun phrase, sentence, and document level.

Recent Publications

Some of our teams.

We're always looking for more talented, passionate people.

Careers

Natural Language Processing and Its Applications in Machine Translation: A Diachronic Review

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.7(3); 2021 Mar

Logo of heliyon

Natural language processing for urban research: A systematic review

Associated data.

Data included in article/supplementary material/referenced in article.

Natural language processing (NLP) has shown potential as a promising tool to exploit under-utilized urban data sources. This paper presents a systematic review of urban studies published in peer-reviewed journals and conference proceedings that adopted NLP. The review suggests that the application of NLP in studying cities is still in its infancy. Current applications fell into five areas: urban governance and management, public health, land use and functional zones, mobility, and urban design. NLP demonstrates the advantages of improving the usability of urban big data sources, expanding study scales, and reducing research costs. On the other hand, to take advantage of NLP, urban researchers face challenges of raising good research questions, overcoming data incompleteness, inaccessibility, and non-representativeness, immature NLP techniques, and computational skill requirements. This review is among the first efforts intended to provide an overview of existing applications and challenges for advancing urban research through the adoption of NLP.

Natural language processing; Urban research; Urban big data; Text mining

1. Introduction

The advancement of technologies is not just changing cities ( Urban Land Institute, 2019 ); it is also transforming the way urban researchers are able to study cities. Gray's notion of the fourth paradigm of science pointed out that the wide availability of data changes the practice of science ( Hey et al., 2009 ). Abundant urban big data is being generated and stored at unprecedented speed and scale; researchers nowadays are able to ask and answer questions in ways that were impossible in the past.

The paradigm shift of scientific research highlights the need for a new generation of scientific tools and methods. Among all existing data, 95% are in unstructured form, which lacks an identifiable tabular organization required by traditional data analysis methods ( Gandomi and Haider, 2015 ). Unstructured data, such as Web pages, emails, and mobile phone records, may contain numerical information (e.g. dates) but is usually text-heavy. Unlike numbers, textual data are inherently inaccurate and vague. According to a conservative estimate by Britton (1978) , at least 32% of the words used in English text are lexically ambiguous. The messy reality of textual data makes it challenging for researchers to take advantage of urban big data.

On the other hand, the large quantity of textual data provides new opportunities for urban researchers to examine people's perceptions, attitudes, and behaviors, so as to advance the knowledge and understanding of urban dynamics. For example, Jang and Kim (2019) have proved that crowd-sourced text data gathered from social media can effectively represent the collective identity of urban space. Conventional data gathering techniques, such as surveys, focus groups, and interviews, are oftentimes expensive and time-consuming. If used wisely, organic text data without pre-specified purposes could be incredibly powerful and complement purposefully designed data collection.

Natural language processing (NLP) has demonstrated tremendous capabilities of harvesting the abundance of textual data. As a form of artificial intelligence, it uses computational algorithms to learn, understand, and produce human language content ( Hirschberg and Manning, 2015 ). It is interrelated with machine learning and deep learning. Basic NLP procedures include processing text data, converting text to features, and identifying semantic relationships ( Ghosh and Gunning, 2019 ). In addition to its ability to structure large volumes of unstructured data, NLP can improve the accuracy of text processing and analysis because it follows rules and criteria in a consistent way. NLP has proven to be useful in many fields. For example, in medical research, Guetterman et al. (2018) conducted an experiment to compare the results from an NLP analysis and a traditional text analysis. They reported that NLP was able to identify major themes that were manually summarized by traditional text analysis.

Here, a comprehensive review of the ways that researchers have utilized NLP in urban studies is presented. This work is among the first efforts intended to provide a synthesis of opportunities and challenges for advancing urban research through the adoption of NLP.

2. Methodology

2.1. literature search.

The aim of this literature search was to gather all scientific publications in urban studies that utilized the method of NLP. To serve this aim, journal articles and conference papers were searched in four online databases: EBSCO Urban Studies Abstracts, Scopus, ProQuest, and Web of Science. Due to the fact that each database has different searchable fields and filtering options, slightly different search criteria were adopted depending on the different databases used (see Table 1 ). Besides the criteria listed in Table 1 , the language of publications in all four database searches was also constrained so the results only included literature in English. The search timeframe was “all years,” which means the results contained all publications to date (November 2019).

Table 1

Literature search criteria.

The initial search returned 271 publications: 6 from EBSCO Urban Studies Abstracts, 69 from Scopus, 125 from ProQuest, and 71 from Web of Science. After removing 73 duplicates, the titles and abstracts of the remaining articles were reviewed. The publications were further narrowed down by determining that 152 were of irrelevant topics to urban research, such as travel planning, regional linguistic variations, or corpus development; 18 studies did not use the method of NLP; and four articles were without full-text access. The above mentioned 174 articles were removed and the remaining 24 studies were reviewed in full texts. Two articles identified from citations of the included studies were added for review. As a result of reviewing the publications found based on criteria of relevance and full-text access, this study included a total of 26 publications for detailed analysis.

2.2. Limitations

While the strategy used during the literature search was meant to be a comprehensive and systematic approach, it had several limitations. First, the search had a language bias because it only included studies published in English. Articles in non-English languages with English abstracts were not included either. Second, the method for retrieving publications may have excluded studies that used NLP techniques but had been labeled with other terminology. For example, studies that used latent Dirichlet allocation (LDA), a statistical model in NLP, listed LDA as a keyword rather than NLP and therefore did not match the literature search criteria. Third, this review only included peer-reviewed journal articles and conference papers, which eliminated possible NLP applications documented in dissertations, theses, reports, and working papers. This was a tradeoff between literature quantity and quality.

3. Literature search results

3.1. amount of publications.

The systematic literature search returned a total of 26 urban studies that used NLP, of which 21 were journal articles and five were conference papers. All of those appeared from 2012 onwards and more than half (62%) from 2018 onwards ( Figure 1 ). The exponential increase in the number of publications reflects the growing interest in NLP among urban researchers.

Figure 1

Amount of urban studies using NLP by year.

3.2. NLP application

Urban researchers have explored diverse topics using NLP as summarized in Table 2 . In general, researchers have applied NLP in five areas: urban governance and management, public health, land use and functional zones, mobility, and urban design. Urban governance and management is the most dominant topic (39% of all literatures), which includes discussions on citizen engagement, disaster response, crime detection, and construction management. Researchers also have used NLP to study urban health (19% of all literatures), such as urban epidemics prediction, air quality monitoring, and assessment of living environments. Land use and functional zones is another popular area of research (19% of all literatures), in which researchers used NLP to model urban spatiotemporal dynamics. Besides, though only a limited number, researchers have adopted NLP in mobility (15% of all literatures) and urban design research (8% of all literatures).

Table 2

Summary of included literatures.

The majority of studies involved in this review used social media as their data source, including Twitter, Instagram, Facebook, Foursquare, Craigslist, Minube, and Yelp ( Table 2 ). Researchers typically extract and analyze the text along with geolocation information embedded in social media posts. However, the data source is not limited to social media, researchers used NLP to process information gathered from interviews, focus groups, phone call records, building permits, online hotel reviews, event listings, and neighborhood reviews. The data size could be big (e.g. millions of tweets) or small (e.g. a dozen of interviews). Additionally, researchers have extended the usage of NLP from analyzing textual data to non-textual data such as points of interest (POIs) data in maps and GPS trajectories generated by cell phones and taxis. It is worth mentioning that when a study objective was predictive modeling, it was common to check the validity of NLP results with records from official sources.

3.4. Study area

Studies using NLP have covered a wide range of geographic locations ( Table 2 ). Most studies focused on major cities with large populations, such as New York City, US and Beijing, China. Some examined multiple cities for comparison. The scale of analysis ranges from a single city to a continent.

4. Applications of NLP in urban research

Using NLP has the advantages of improving the usability of urban big data sources, expanding study areas and scales, and reducing research costs. In this section, the opportunities shown in the current applications of NLP are discussed in five areas: urban governance and management, public health, land use and functional zones, mobility, and urban design.

4.1. Urban governance and management

NLP adds new opportunities to citizen engagement, which is the most dominant topic among studies in urban governance challenges ( Cruz et al., 2019 ). NLP techniques combined with online crowd-sourced data opens up a communication channel between city managers and the general public. From 2001 to 2004, the Electronic Democracy European Network (EDEN) project launched a real-life pilot to test if a particular NLP approach could improve communication between citizens and public administrators ( European Commission, 2015 ). Though the EDEN project run into multiple obstacles, the project managers and engineers concluded that “it seems reasonable to approach e-democracy by seeking a democratic approach to software solutions” ( Carenini, Whyte, Bertorello and Vanocchi, 2007 , p. 27). Computer scientists also developed NLP applications to function as a citizen feedback gathering tool ( Estévez-Ortiz et al., 2016 ), a citizen concern detector ( Abali et al., 2018 ), and an urban community identifier ( Vargas-Calderón and Camargo, 2019 ), and all showed promising results. In addition, combining NLP with interviews and focus groups, Bardhan et al. (2019) discovered gender inequality in Indian slum rehabilitation housing management, which suggested a need for a more systematic participatory approach to improve well-being among the rehabilitated occupants.

Additionally, NLP shows potential to support natural disaster responses. According to the US Congress's think tank, there are two ways that government agencies could use social media in emergency and disaster management: 1) as an outlet for information dissemination, and 2) as a systematic tool for emergency communication, victim assistance, situation monitoring, and damage estimation ( Lindsay, 2011 ). The second category is where NLP has a direct role. An early work by Imran et al. (2013) trained a model that extracts disaster-relevant information from tweets and achieved 40%–80% correctness. More recently, Hong et al. (2018) built an unsupervised NLP topic model that requires minimal human efforts in text collecting and analyzing, which could help government agencies to identify citizens' needs and prioritize tasks during natural disasters. Additionally, with the integration of NLP and geospatial clustering methods, Hu et al., 2019a , Hu et al., 2019b collected local place names from housing advertisements, which has implications in disaster response, because these names may not exist in official gazetteers and could lead to miscommunication between local residents and disaster responders.

Furthermore, researchers have completed proof-of-concept studies for the method of using NLP, machine learning, and spatial analysis to spot urban crime. In Brazil, Souza et al. (2016) trained a classification model by emergency phone call records from the state police department and their model was able to analyze real-time tweets for crime detection. In the US, Helderop et al. (2019) were successful in detecting prostitution activities by examining hotel reviews, locations, and prices. Though the generalizability of the methods used in these studies needs further verification, with future improvements, they could eventually contribute to improving urban security.

Finally, scholars demonstrated the power of NLP in the research of construction management. Lai and Kontokosta (2019) conducted an exploratory study analyzing building permit records to uncover building renovation and adaptive reuse patterns in seven major cities in the US. The method they developed may benefit the monitoring of building alterations in urban areas.

4.2. Public health

Urban public health has drawn growing attention among researchers in recent years. Studies have revealed that various economic, social, and environmental factors, including the spread of infectious diseases, poor living conditions, unhealthy lifestyles, and pollution, could negatively affect public health in urban areas ( Moscato and Poscia, 2015 ).

NLP is essential to large-scale application of social media as sensors to predict epidemic outbreaks. Traditional epidemic monitors rely on clinical reports gathered by public health authorities ( Vaughan et al., 1989 ). For instance, health care providers in the US depend on the information provided by the Centers for Disease Control and Prevention (CDC) to learn about disease outbreaks ( CDC, 2018 ). However, the time lag between the date that a disease starts and the date that clinical cases are reported to authorities is a major drawback of official surveillance systems ( CDC, 2018 ). For this reason, many researchers have developed data processing and modeling techniques to use social media as a data source to conduct real-time epidemic analysis ( Al-garadi et al., 2016 ). Though manually filtering and classifying relevant messages eliminates false positive and negative errors, the tradeoff is a slow analysis process ( Nagar et al., 2014 ). NLP classification, on the other hand, can process data relatively fast with reasonable accuracy, supporting early detection of a disease. For example, in a Japanese nationwide study, Wakamiya et al. (2018) used an NLP module to effectively estimate when and where influenza outbreaks were happening.

In a similar sense, researchers view social media users as soft sensors to measure urban air quality. Riga and Karatzas (2014) adopted an NLP bag-of-words model to process social media posts and concluded that users’ reports of their surrounding environmental conditions on social media platforms are highly correlated with the actual observations obtained from official monitoring sites.

Analyzing urban residents' perceptions of living environments and evaluating urban communities' lifestyles is another sphere of public health research in which NLP appears to be useful. Hu et al. (2019a) used NLP to process online neighborhood reviews to assess New Yorkers' satisfaction with their living conditions and their perceived quality of life. Also using NLP, Fu et al. (2018) derived urban citizens' activities through their linguistic patterns. Additionally, Rahimi, Mottahedi, and Liu (2018) were able to examine different communities’ food consumption behaviors and lifestyles in ten major cities in North America by a bag-of-words model. Findings from these studies could serve as a valuable reference for city policymakers as they provide multifaceted health-related information complementary to conventional Census.

4.3. Land use and functional zones

Looking back into the history of urban planning, collecting information concerning land use functions is a critical step before laying out urban plans ( Breheny and Batey, 1981 ). Traditional approaches to examine structures and changes in urban land use include analyzing aerial photographs ( Philipson, 1997 ), field survey ( Pissourios, 2019 ), and remote sensing ( Bowden, 1975 ).

More recently, researchers have extended the usage of NLP from analyzing textual data to non-textual data, and applied it to urban land use and functionality studies. NLP typically detects underlying correlations between words according to their context. To capture urban spatial structures, researchers consider a region as a text document, a function as a topic, and research entities as words ( Li, Fei and Zhang, 2019 ; Yuan et al., 2015 ). In this way, various NLP modeling methods allow researchers to determine contextual relationships between urban functional regions or different land use types based on the similarities among entities (i.e. geographic space interactions). This method makes use of urban data generated by sensors, vehicle geolocation tracking systems, and location-based services.

Yuan et al. (2015) explained the concept of mobility semantics by arguing that people's socioeconomic activities in a region are strongly correlated with the spatiotemporal patterns of those who visit the region (i.e. mobility semantics). Another key concept, location semantics, refers to urban road networks and the allocation of POIs ( Yuan et al., 2015 ). By leveraging mobility and location semantics, Yuan et al. (2012 , 2015) identified urban functional zones (e.g. residential, business, and educational areas) through topic modeling. Similarly, Yao et al. (2017) classified urban land use at the level of irregular land parcels by integrating a semantic model and deep learning. Huang et al. (2018) quantified industrial land use changes in a bay area in China using POIs data. Based on a Word2Vec model, Li et al. (2019) proposed a regionalization method to cluster similar spatial units in an area and inspect the clusters' socioeconomic patterns by analyzing all mobility trajectories of people in that area.

Demonstrating the advantages of being efficient and capable of handling large volumes of data, these researchers’ approaches of NLP modeling to classify land use and functional zones show potential as great tools to monitor urban landscape dynamics and provide calibrations and for urban planning.

4.4. Mobility

Urban mobility researchers have begun to leverage NLP in their studies as well. Serna, Gerrikagoitia, Bernabé, and Ruiz (2017) demonstrated the feasibility of using NLP to automatically identify sustainable mobility issues from social media data, which could enrich the data of traditional travel surveys. Markou, Kaiser, and Pereira (2019) were able to predict taxi demand hotspots for special events by a tool they developed that scans the internet for time-series data.

Similar to the previously explained usage of NLP in land use and functional zones, researchers also adopted NLP for non-textual data analysis on urban mobility. By analyzing taxi moving paths recorded by the Global Navigation Satellite System, researchers measured spatiotemporal relationships among roads ( Liu et al., 2017 ) and identified the interaction pattern of vehicle movements on road networks ( Liu et al., 2019 ), which, they argued, could be useful in understanding and managing urban traffic.

4.5. Urban design

NLP can facilitate urban design with imageability analysis. “Urban design is the process of understanding people and place in an urban context, leading to a strategy for the improvement of urban life and the evolution of the built environment...” ( Building Design Partnership, 1991 , p. 14). Introduced by Lynch (1960) , imageability is an important concept in urban design that is still being discussed today. It involves subjective urban identity, emphasizing the quality of the built environment perceived and assessed by observers ( Lynch, 1960 ). NLP enables researchers to evaluate the emotional responses evoked by urban places and visually map urban identity at various scales and times.

Researchers have already used NLP to process hashtags from Instagram photos; using this information together with photos' geolocations, they created a cognitive map of the Seoul metropolitan area in Korea to represent its residents’ collective perceptions of the place ( Jang and Kim, 2019 ). To explore the emotional dynamics of urban space, Iaconesi (2015) used NLP to connect geographical locations within a city and emotions expressed in social media, through which established urban emotional landmarks. The observation of urban identity and emotional landmarks helps urban designers and planners make interventions and shape urban spaces into more positive and imageable places.

5. NLP challenges in urban research

Ultimately, a method or technique alone will never solve any problem. NLP opens up an exciting new direction, but at the same time it brings about more challenges. In this section, four aspects of potential challenges that apply to urban research are discussed: research questions, data, the method itself, and researchers.

The challenge of research questions lies in identifying novel issues that could not be well solved by traditional techniques. NLP holds great promise for the quest to untangle the complex relationships among urban systems, however, what questions it enables researchers to answer are still waiting to be explored. In fact, the study of cities involves a variety of disciplines in urban contexts ( Ramadier, 2004 ). The existing urban studies have complemented or, sometimes, completely replaced traditional methods with NLP to solve problems in various urban-related fields. While conventional text analysis methods have high accuracy, NLP has advantages in dealing with a massive amount of data at a large scale with fine resolution. In a reality of limited time and resources, NLP could provide insight into questions that are impossible to answer with traditional methods. Looking ahead, more urban studies using NLP to answer questions that could not be otherwise answered will sure to emerge.

The data challenge for NLP goes hand-in-hand with the characteristics of urban big data. As pointed out by Salganik (2018) , big data's characteristics of incompleteness, inaccessibility, and non-representativeness are generally problematic for academic research. Data incompleteness refers to the fact that no matter the size, urban big data are not purposeful designed structural data and are very likely to miss some valuable information to research such as demographic factors. In addition, inaccessibility means that data owned by private companies or government agencies are not always accessible to urban researchers due to legal or ethical barriers. Moreover, big data usually could not represent a certain urban population. As a result, studies that use NLP to process big data are not likely to yield generalizable results and face the risk of overlooking certain populations.

Though there have been revolutionary advances in NLP, its mainstream application is still very limited ( Hirschberg and Manning, 2015 ). While the goal of NLP is that algorithms will ultimately be able to determine the relationships between words and grammar in human language and organize meaning by computer logic, the current techniques do not have the exact same capabilities of resolving natural language as humans do. NLP still needs improvements in “deal [ing] with irony, humor and other linguistic, psychological, anthropologic and cultural issues” ( Iaconesi, 2015 , p. 16), which is a difficult task for human analysts as well. Most recently, significant progress has been made in the field of NLP with increasing ease of implementing pre-trained models such as ULMFiT ( Howard and Ruder, 2018 ) and BERT ( Devlin et al., 2019 ). While this may trigger more adoption of NLP among researchers, it is important to validate NLP analysis with results reached by traditional methods.

Furthermore, people who study cities usually do not have professional training or a background in computer science. As a result, the complexity of detecting patterns, fitting models, and training classifiers limits urban researchers’ ability to take full advantage of NLP. This further hinders transferring knowledge into practice for urban planners and designers. In order to harness the new opportunities offered by NLP, urban researchers face the challenge to expand their skill set. On the other hand, computer scientists who wish to conduct urban research face the challenge to comprehend sophisticated social concepts and theories. Robust collaboration among researchers in different fields is likely to drive NLP applications in the study of cities.

While some of these challenges are common to studies using NLP in all fields, some others are more prominent for urban studies. Almost every researcher adopting NLP faces the challenges of acquiring good data and immature NLP techniques. For example, addressing implicit bias is still a daunting task when building NLP models. The success of NLP application in urban studies and other domains depends highly on the quality of data and modeling. Asking good research questions and skill requirements are more specific challenges facing urban researchers who intend to use NLP to facilitate their work. The spatial aspect of urban studies further compounds the challenge of adopting NLP. For instance, while NLP is effective in harvesting location data in texts, urban researchers need to be mindful of the massiveness and messiness of such data and assess the accuracy of uncovered geospatial information.

6. Conclusion

This systematic literature review suggests that there have been only a limited number of urban studies that adopted the approach of NLP. Current applications fell into five areas of study: urban governance and management, public health, land use and functional zones, mobility, and urban design. Using NLP in urban research demonstrates the advantages of improving the usability of urban big data sources, expanding study areas and scales, and reducing research costs. While recognizing this new opportunity is exciting, it is important for urban researchers not to overestimate what NLP is capable of accomplishing and acknowledge its limitations. To take advantage of NLP, urban researchers face challenges of raising good research questions, overcoming data incompleteness, inaccessibility, and non-representativeness, immature NLP techniques, and computational skill requirements.

Declarations

Author contribution statement.

M. Cai: Developed and the wrote this article.

Funding statement

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data availability statement

Declaration of interests statement.

The authors declare no conflict of interest.

Additional information

No additional information is available for this paper.

Acknowledgements

I am very grateful to Dr. Mark Wilson, Dr. Jason Zhao, and the two anonymous reviewers for their insightful suggestions and comments on this paper.

  • Abali G., Karaarslan E., Hurriyetoglu A., Dalkilic F. 2018 6th International Istanbul Smart Grids and Cities Congress and Fair (ICSG), 30–33. 2018. Detecting citizen problems and their locations using twitter data. [ Google Scholar ]
  • Al-garadi M.A., Khan M.S., Varathan K.D., Mujtaba G., Al-Kabsi A.M. Using online social networks to track a pandemic: a systematic review. J. Biomed. Inf. 2016; 62 :1–11. [ PubMed ] [ Google Scholar ]
  • Bardhan R., Sunikka-Blank M., Haque A.N. Sentiment analysis as tool for gender mainstreaming in slum rehabilitation housing management in Mumbai, India. Habitat Int. 2019; 92 :102040. [ Google Scholar ]
  • Bowden L.W. Vol. 12. American Society of Photogrammetry; 1975. Urban environments: Inventory and analysis; pp. 1815–1880. (Manual of Remote Sensing). [ Google Scholar ]
  • Breheny M.J., Batey P.W.J. The history of planning methodology: a preliminary sketch. Built. Environ. (1978) 1981; 7 (2):109–120. JSTOR. [ Google Scholar ]
  • Britton B.K. Lexical ambiguity of words used in English text. Behav. Res. Methods Instrum. 1978; 10 (1):1–7. [ Google Scholar ]
  • Building Design Partnership Urban design in practice. Urban Design Quarterly. 1991; 40 [ Google Scholar ]
  • Carenini M., Whyte A., Bertorello L., Vanocchi M. Improving communication in E-democracy using natural language processing. IEEE Intell. Syst. 2007; 22 (1):20–27. [ Google Scholar ]
  • CDC . Centers for Disease Control and Prevention; 2018. November 16). Interpretation Of Epidemic (Epi) Curves During Ongoing Outbreak Investigations . https://www.cdc.gov/foodsafety/outbreaks/investigating-outbreaks/epi-curves.html [ Google Scholar ]
  • Cruz N. F. da, Rode P., McQuarrie M. New urban governance: a review of current themes and future priorities. J. Urban Aff. 2019; 41 (1):1–19. [ Google Scholar ]
  • Devlin J., Chang M.-W., Lee K., Toutanova K. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [ Google Scholar ]
  • Estévez-Ortiz F.-J., García-Jiménez A., Glösekötter P. An application of people’s sentiment from social media to smart cities. El Prof. Inf. 2016; 25 (6):851. [ Google Scholar ]
  • European Commission . 2015 June 13. Electronic Democracy European Network | EDEN Project. CORDIS | European Commission. https://cordis.europa.eu/project/rcn/57135/factsheet/en [ Google Scholar ]
  • Fu C., McKenzie G., Frias-Martinez V., Stewart K. Identifying spatiotemporal urban activities through linguistic signatures. Comput. Environ. Urban Syst. 2018; 72 :25–37. [ Google Scholar ]
  • Gandomi A., Haider M. Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015; 35 (2):137–144. [ Google Scholar ]
  • Ghosh S., Gunning D. Packt Publishing Ltd; 2019. Natural Language Processing Fundamentals: Build Intelligent Applications that Can Interpret the Human Language to Deliver Impactful Results. [ Google Scholar ]
  • Guetterman T.C., Chang T., DeJonckheere M., Basu T., Scruggs E., Vydiswaran V.V. Augmenting qualitative text analysis with natural language processing: methodological study. J. Med. Internet Res. 2018; 20 (6) [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Helderop E., Huff J., Morstatter F., Grubesic A., Wallace D. Hidden in plain sight: a machine learning approach for detecting prostitution activity in phoenix, Arizona. Appl. Spat. Analy. Pol. 2019; 12 (4):941–963. [ Google Scholar ]
  • Hey T., Tansley S., Tolle K. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/ [ Google Scholar ]
  • Hirschberg J., Manning C.D. Advances in natural language processing. Science. 2015; 349 (6245):261–266. [ PubMed ] [ Google Scholar ]
  • Hong L., Fu C., Wu J., Frias-Martinez V. Information needs and communication gaps between citizens and local governments online during natural disasters. Inf. Syst. Front. New York. 2018; 20 (5):1027–1039. [ Google Scholar ]
  • Howard J., Ruder S. 2018. Universal language model fine-tuning for text classification; pp. 328–339. (Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)). [ Google Scholar ]
  • Hu Y., Deng C., Zhou Z. A semantic and sentiment analysis on online neighborhood reviews for understanding the perceptions of people toward their living environments. Ann. Assoc. Am. Geogr. 2019; 109 (4):1052–1073. [ Google Scholar ]
  • Hu Y., Mao H., McKenzie G. A natural language processing and geospatial clustering framework for harvesting local place names from geotagged housing advertisements. Int. J. Geogr. Inf. Sci. 2019; 33 (4):714–738. [ Google Scholar ]
  • Huang L., Wu Y., Zheng Q., Zheng Q., Zheng X., Gan M., Wang K., Shahtahmassebi A., Deng J., Wang J., Zhang J. Quantifying the spatiotemporal dynamics of industrial land uses through mining free access social datasets in the mega hangzhou bay region, China. Sustainability. 2018; 10 (10):3463. [ Google Scholar ]
  • Iaconesi S. Emotional landmarks in cities. Sociologica. 2015; 9 (3):22. [ Google Scholar ]
  • Imran M., Elbassuoni S., Castillo C., Diaz F., Meier P. Proceedings of the 22nd International Conference on World Wide Web - WWW ’13 Companion, 1021–1024. 2013. Practical extraction of disaster-relevant information from social media. [ Google Scholar ]
  • Jang K.M., Kim Y. Crowd-sourced cognitive mapping: a new way of displaying people’s cognitive perception of urban space. PloS One. 2019; 14 (6) [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lai Y., Kontokosta C.E. Topic modeling to discover the thematic structure and spatial-temporal patterns of building renovation and adaptive reuse in cities. Comput. Environ. Urban Syst. 2019; 78 :101383. [ Google Scholar ]
  • Li Y., Fei T., Zhang F. A regionalization method for clustering and partitioning based on trajectories from NLP perspective. Int. J. Geogr. Inf. Sci. 2019; 33 (12):2385–2405. [ Google Scholar ]
  • Lindsay B.R. 2011. Social Media and Disasters: Current Uses, Future Options, and Policy Considerations; p. 13. [ Google Scholar ]
  • Liu K., Gao S., Lu F. Identifying spatial interaction patterns of vehicle movements on urban road networks by topic modelling. Comput. Environ. Urban Syst. 2019; 74 :50–61. [ Google Scholar ]
  • Liu K., Gao S., Qiu P., Liu X., Yan B., Lu F. Road2Vec: measuring traffic interactions in urban road system from massive travel routes. ISPRS Int. J. Geo Inf. 2017; 6 (11):321. [ Google Scholar ]
  • Lynch K. MIT Press; 1960. The Image of the City. [ Google Scholar ]
  • Markou I., Kaiser K., Pereira F.C. Predicting taxi demand hotspots using automated Internet Search Queries. Transport. Res. C Emerg. Technol. 2019; 102 :73–86. [ Google Scholar ]
  • Moscato U., Poscia A. Urban public health. In: Boccia S., Villari P., Ricciardi W., editors. A Systematic Review of Key Issues in Public Health. Springer International Publishing; 2015. pp. 223–247. [ Google Scholar ]
  • Nagar R., Yuan Q., Freifeld C.C., Santillana M., Nojima A., Chunara R., Brownstein J.S. A case study of the New York city 2012-2013 influenza season with daily geocoded twitter data from temporal and spatiotemporal perspectives. J. Med. Internet Res. 2014; 16 (10) [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Philipson W.R. Vol. 2. American Society of Photogrammetry and Remote Sensing; 1997. Urban analysis and planning; pp. 517–554. (Manual of Photographic Interpretation). [ Google Scholar ]
  • Pissourios I.A. Survey methodologies of urban land uses: an oddment of the past, or a gap in contemporary planning theory? Land Use Pol. 2019; 83 :403–411. [ Google Scholar ]
  • Rahimi S., Mottahedi S., Liu X. The geography of taste: using Yelp to study urban culture. ISPRS Int. J. Geo Inf. Basel. 2018; 7 (9) [ Google Scholar ]
  • Ramadier T. Transdisciplinarity and its challenges: the case of urban studies. Futures. 2004; 36 (4):423–439. [ Google Scholar ]
  • Riga M., Karatzas K. Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14) - WIMS ’14, 1–7. 2014. Investigating the relationship between social media content and real-time observations for urban air quality and public health. [ Google Scholar ]
  • Salganik M. Princeton University Press; 2018. Bit By Bit: Social Research in the Digital Age (Open Review Edition) https://www.bitbybitbook.com/en/preface/ [ Google Scholar ]
  • Serna A., Gerrikagoitia J.K., Bernabé U., Ruiz T. Sustainability analysis on urban mobility based on social media content. Transp. Res. Proc. 2017; 24 :1–8. [ Google Scholar ]
  • Souza A., Figueredo M., Cacho N., Araujo D., Coelho J., Prolo C.A. 2016 IEEE International Smart Cities Conference (ISC2), 1–6. 2016. Social smart city: a platform to analyze social streams in smart city initiatives. [ Google Scholar ]
  • Urban Land Institute . 2019. Urban Technology Framework. https://ulidigitalmarketing.blob.core.windows.net/ulidcnc/2019/05/ULI-Urban-Technology-Framework-2019.pdf [ Google Scholar ]
  • Vargas-Calderón V., Camargo J.E. Characterization of citizens using word2vec and latent topic analysis in a large set of tweets. Cities. 2019; 92 :187–196. [ Google Scholar ]
  • Vaughan J.P., Morrow R.H., Organization W.H. World Health Organization; 1989. Manual of Epidemiology for District Health Management. http://apps.who.int/iris/handle/10665/37032 [ Google Scholar ]
  • Wakamiya S., Kawai Y., Aramaki E. Twitter-based influenza detection after flu peak via tweets with indirect information: text mining study. JMIR Publ. Health Surv. 2018; 4 (3):e65. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yao Y., Li X., Liu X., Liu P., Liang Z., Zhang J., Mai K. Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. Int. J. Geogr. Inf. Sci. 2017; 31 (4):825–848. [ Google Scholar ]
  • Yuan J., Zheng Y., Xie X. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD. Vol. 12. 2012. Discovering regions of different functions in a city using human mobility and POIs; p. 186. [ Google Scholar ]
  • Yuan N.J., Zheng Y., Xie X., Wang Y., Zheng K., Xiong H. Discovering urban functional zones using latent activity trajectories. IEEE Trans. Knowl. Data Eng. 2015; 27 (3):712–725. [ Google Scholar ]

Help | Advanced Search

Quantum Physics

Title: natural language, ai, and quantum computing in 2024: research ingredients and directions in qnlp.

Abstract: Language processing is at the heart of current developments in artificial intelligence, and quantum computers are becoming available at the same time. This has led to great interest in quantum natural language processing, and several early proposals and experiments. This paper surveys the state of this area, showing how NLP-related techniques including word embeddings, sequential models, attention, and grammatical parsing have been used in quantum language processing. We introduce a new quantum design for the basic task of text encoding (representing a string of characters in memory), which has not been addressed in detail before. As well as motivating new technologies, quantum theory has made key contributions to the challenging questions of 'What is uncertainty?' and 'What is intelligence?' As these questions are taking on fresh urgency with artificial systems, the paper also considers some of the ways facts are conceptualized and presented in language. In particular, we argue that the problem of 'hallucinations' arises through a basic misunderstanding: language expresses any number of plausible hypotheses, only a few of which become actual, a distinction that is ignored in classical mechanics, but present (albeit confusing) in quantum mechanics.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • INSPIRE HEP
  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Center for Security and Emerging Technology

Natural Language Processing Research Clusters

Data Snapshot

Concentrations of ai-related topics in research: natural language processing.

Sara Abdulla

Data Snapshots are informative descriptions and quick analyses that dig into CSET’s unique data resources. Our first series of Snapshots introduced CSET’s Map of Science and explored the underlying data and analytic utility of this new tool, which enables users to interact with the Map directly.

In this snapshot, we analyze natural language processing paper concentrations across research clusters in our Map of Science using the labelling convention described in Defining Computer Vision, Natural Language Processing, and Robotics Research Clusters. 1 We examine the 397 RCs (as of July 29, 2021) with concentrations of AI papers of at least 25 percent and concentrations of NLP papers of at least 25 percent with the concentration of NLP papers higher than concentration of computer vision- or robotics-related papers. These RCs are referred to here as NLP RCs. Figure 1 displays these RCs within the Map of Science, with RCs color coded by their broad research area. 

NLP employs text data across a myriad of domains, including social media analysis, education, surveillance, and more, with methods such as sentiment analysis (i.e., using machine learning to intelligently detect the emotions behind text, such as deciphering between positive and negative product and service reviews) and topic modelling (i.e., using machine learning to break up text data, such as abstracts or articles, into subgroups). An example of widespread application of NLP are “bots” on social media sites such as Twitter. These bots potentially contribute to the spread of misinformation, as well as harassment and online trolling. 2

Figure 1. NLP RCs Highlighted in the Map of Science 

natural language processing research paper topics

Table 1. Number of Natural Language Processing RCs by Broad Research Area

We found that the overwhelming majority of NLP RCs fall within computer science, as shown in Table 1. The remaining RCs are largely social science RCs plus a few Humanities, Mathematics, and Medicine NLP RCs. Table 2 illustrates the concentration of NLP papers across all NLP RCs. We found that about one-fourth of NLP RCs actually have a concentration of over 75 percent NLP-related papers, suggesting that these NLP RCs are NLP-focused rather than employing NLP as an accessory field of research. 

Table 2. NLP Concentrations Across RCs 

In order to understand the range of RCs that can be assigned the NLP label, we provide details on four RCs: 

  • The NLP RC with the highest percentage of NLP-related publications
  • The NLP RC with the lowest percentage of NLP-related publications
  • A NLP RC in non-computer science (CS) STEM field
  • A NLP RC in a non-STEM field 

For each of these RCs, we provide the top five core papers. Core papers are publications that have strong citation links within an RC, meaning that they have high citations from the other publications in that cluster. Since RCs do not necessarily represent a homogenous area of research, we can review the member publications to describe the central areas of research that a RC is focused on. 

NLP RC with the highest percentage of NLP papers 

Of RC 60531’s 727 papers, 95 percent are NLP related. RC 60531 focuses broadly on machine translations, word embeddings, speech recognition, and NLP methods. Like most NLP RCs, it falls under the computer science research realm. Additionally, this RC is forecasted to experience extreme growth, meaning a forecasted mean growth of 8 percent or more over the next three years. 3

RC 60531 Top Five Core Papers:

  • A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
  • How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions
  • Learning bilingual word embeddings with (almost) no bilingual data
  • Word Translation Without Parallel Data
  • Offline bilingual word vectors, orthogonal transformations and the inverted softmax

Three-fourths of papers had country affiliations; two-fifths of papers with data available on country affiliations were affiliated with the United States, while less than one-fourth of them were affiliated with China. 

NLP RC with the lowest percentage of NLP papers 

RC 47196 falls into the computer science area of research, like most NLP RCs. Just over 25 percent of its 670 papers are NLP-related. This RC broadly focuses on data mining, knowledge extraction, and related linguistics analytic methods. Less than one-third of its papers also are AI-related. 

RC 47196 Top Five Core Papers:

  • 短视频对城市旅游景点的呈现与传播——以“抖音”为例
  • Research on the Classification of Travel Demand Information and the Acquisition of Ontology Concept Based on “We Media” , 2015, 图书情报工作
  • Automatic Extraction of Nonlinguistic Representations of Texts to Support Writing
  • Text Classification for Student Data Set using Naive Bayes Classifier and KNN Classifier
  • Development of Game Application for Enhancement of Children’s Cognitive Skills

This RC is led by India and had fewer new papers published last year than the year prior (i.e., had negative growth). 

Examining a non-CS STEM NLP RC 

While most NLP RCs are in the CS research domain, it is useful to assess the makeup of non-CS related RCs. Cluster 26223 focuses on mathematics and more than 70 percent of its papers are NLP-related. This cluster is 14 years old and contains 1,507 papers, dominated by the United States. 

While this RC focuses primarily on mathematics, computer science, linguistics, and NLP, its closest neighbors also focus on epistemology, literature, AI, and various humanities fields. RC 26223 grew about 4 percent last year, and extreme growth is not forecasted for this RC. 

RC 26221 Top Five Core Papers:

  • The Grammar of Degree: Gradability Across Languages
  • Modification (book)
  • The semantics of many, much, few, and little
  • Projecting adjectives in Chinese
  • Semantic variation and the grammar of property concepts, 2015, Language

A non-STEM NLP RC

Social science is a distant second place to CS for general fields of study for RCs. RC 3916 is an example of a social science-focused RC with a high concentration of NLP papers: more than 50 percent of its 4749 papers are NLP papers. 

RC 3916 contains psychology, communication, and syntax papers with an evident focus on the study of sign language. This RC grew 47 percent last year and is led by the United States. 

RC 3916 Top Five Core Papers: 

  • Gesture, sign, and language: The coming of age of sign language and gesture studies
  • Interaction of Morphology and Syntax in American Sign Language
  • Modification of indicating verbs in British Sign Language: A corpus-based study
  • The syntax of sign language agreement: Common ingredients, but unusual recipe
  • Visible Meaning: Sign language and the foundations of semantics

If you missed it, find the first part of the series exploring computer vision-related RCs, and other snapshots examining the Map of Science, below. 

In August 2021, CSET updated the Map of Science, linking more data to the research clusters and implementing a more stable clustering method. With this update, research clusters were assigned new IDs, so the cluster IDs reported in this Snapshot will not match IDs in the current Map of Science user interface. If you are interested in knowing which clusters in the updated Map are most similar to those reported here, or have general questions about our methodology or want to discuss this research, you can email  [email protected] .

Download Related Data Brief

 alt=

  • https://cset.georgetown.edu/publication/defining-computer-vision-natural-language-processing-and-robotics-research-clusters/
  • E.g.: https://www.pewresearch.org/internet/2018/04/09/bots-in-the-twittersphere/ ; https://arxiv.org/pdf/2012.02164.pdf ; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8139392/
  • See AI Bin RC average growth vs. forecast growth paper by Autumn Toney: https://cset.georgetown.edu/publication/measuring-ai-rc-growth/

Related Content

Concentrations of ai-related topics in research: computer vision.

Data Snapshots are informative descriptions and quick analyses that dig into CSET’s unique data resources. Our first series of Snapshots introduced CSET’s Map of Science and explored the underlying data and analytic utility of this… Read More

Defining Computer Vision, Natural Language Processing, and Robotics Research Clusters

This website uses cookies., privacy overview.

Making Use of Natural Language Processing to Better Understand Medical Students' Self-Assessment of Clinical Skills

  • PMID: 37976396
  • PMCID: PMC10922291 (available on 2025-03-01 )
  • DOI: 10.1097/ACM.0000000000005527

Problem: Reflective practice is necessary for self-regulated learning. Helping medical students develop these skills can be challenging since they are difficult to observe. One common solution is to assign students' reflective self-assessments, which produce large quantities of narrative assessment data. Reflective self-assessments also provide feedback to faculty regarding students' understanding of content, reflective abilities, and areas for course improvement. To maximize student learning and feedback to faculty, reflective self-assessments must be reviewed and analyzed, activities that are often difficult for faculty due to the time-intensive and cumbersome nature of processing large quantities of narrative assessment data.

Approach: The authors collected narrative assessment data (2,224 students' reflective self-assessments) from 344 medical students' reflective self-assessments. In academic years 2019-2020 and 2021-2022, students at the University of Cincinnati College of Medicine responded to 2 prompts (aspects that surprised students, areas for student improvement) after reviewing their standardized patient encounters. These free-text entries were analyzed using TopEx, an open-source natural language processing (NLP) tool, to identify common topics and themes, which faculty then reviewed.

Outcomes: TopEx expedited theme identification in students' reflective self-assessments, unveiling 10 themes for prompt 1 such as question organization and history analysis, and 8 for prompt 2, including sensitive histories and exam efficiency. Using TopEx offered a user-friendly, time-saving analysis method without requiring complex NLP implementations. The authors discerned 4 education enhancement implications: aggregating themes for future student reflection, revising self-assessments for common improvement areas, adjusting curriculum to guide students better, and aiding faculty in providing targeted upcoming feedback.

Next steps: The University of Cincinnati College of Medicine aims to refine and expand the utilization of TopEx for deeper narrative assessment analysis, while other institutions may model or extend this approach to uncover broader educational insights and drive curricular advancements.

Copyright © 2023 Written work prepared by employees of the Federal Government as part of their official duties is, under the U.S. Copyright Act, a “work of the United States Government” for which copyright protection under Title 17 of the United States Code is not available. As such, copyright does not extend to the contributions of employees of the Federal Government.

  • Clinical Competence
  • Natural Language Processing
  • Self-Assessment
  • Students, Medical*

Grants and funding

  • UL1 TR002649/TR/NCATS NIH HHS/United States

natural language processing research paper topics

RESEARCH AREA

Natural language processing.

Our team advances the state of the art in natural language understanding and generation, and deploys these systems at scale to break down language barriers, enable people to understand and communicate with anyone, and to provide a safe experience—no matter what language they speak.

The opportunities and challenges of this work are immense. Billions of people use our services to connect and communicate in their preferred language, but many of these languages lack traditional NLP resources and our systems need to be robust to the informal tone, slang and typos often found in daily communication.

Our research spans multiple areas across NLP and machine learning, including deep learning/neural networks, machine translation, natural language understanding and generation, low-resource NLP, question answering, dialogue, and cross-lingual and cross-domain transfer learning.

Latest Publications

June 03, 2019

Pay less attention with Lightweight and Dynamic Convolutions

Felix Wu, Angela Fan , Alexei Baevski , Yann Dauphin, Michael Auli

October 31, 2018

Phrase-Based & Neural Unsupervised Machine Translation (EMNLP 2018, best paper)

Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer , Marc'Aurelio Ranzato

July 15, 2018

Hierarchical Neural Story Generation. (ACL 2018, best paper honorable mention)

Angela Fan , Michael Lewis, Yann Dauphin

October 29, 2018

XNLI: Evaluating Cross-lingual Sentence Representations

Alexis Conneau, Ruty Rinott , Guillaume Lample, Adina Williams , Samuel R. Bowman, Holger Schwenk, Ves Stoyanov

Latest News

natural language processing research paper topics

Announcing new research awards in NLP and machine translation

Facebook AI is announcing three new research awards in NLP and machine translation.

natural language processing research paper topics

Open Source

Open-sourcing pytext for faster nlp development.

To make it easier to build and deploy natural language processing (NLP) systems, we are open-sourcing PyText, a modeling framework that blurs the boundaries between experimentation and large-scale deployment

Latest Work

Our Actions

Meta © 2024

IMAGES

  1. Natural Language Processing Research

    natural language processing research paper topics

  2. (PDF) Natural Language Processing with Process Models (NLP4RE Report Paper)

    natural language processing research paper topics

  3. Trending NLP Research Topics for Masters and PhD

    natural language processing research paper topics

  4. Top 9 Interesting Natural Language Processing Project Topics

    natural language processing research paper topics

  5. What is Natural Language Processing

    natural language processing research paper topics

  6. **TOP 10 NATURAL LANGUAGE PROCESSING PAPERS: RECOMMENDED READING

    natural language processing research paper topics

VIDEO

  1. Research Paper Topics 😮😮😯 Best for Beginners 👍

  2. Introduction to NLP part 1

  3. intro to NLP

  4. Online Workshop on Research Paper Writing & Publishing Day 2

  5. AWS Academy Machine Learning for Natural Language Processing

  6. NATURAL LANGUAGE PROCESSING-NPTEL-JULY2023-WEEK10-ASSIGNMENT ANSWER

COMMENTS

  1. Vision, status, and research topics of Natural Language Processing

    The field of Natural Language Processing (NLP) has evolved with, and as well as influenced, recent advances in Artificial Intelligence (AI) and computing technologies, opening up new applications and novel interactions with humans. Modern NLP involves machines' interaction with human languages for the study of patterns and obtaining ...

  2. Natural Language Processing

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... if you don't have one. Or, discuss a change on Slack. Browse SoTA > Natural Language Processing Natural Language Processing. 2330 benchmarks • 662 tasks • 2006 datasets • 27582 papers with code ... Dynamic Topic Modeling ...

  3. Natural language processing: state of the art, current trends and

    Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP ...

  4. natural language processing Latest Research Papers

    Hindi Language. Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language ...

  5. Studies in Natural Language Processing

    Volumes in the Studies in Natural Language Processing series provide comprehensive surveys of current research topics and applications in the field of natural language processing (NLP) that shed light on language technology, language cognition, language and society, and linguistics. The increased availability of language corpora and digital ...

  6. Natural language instructions induce compositional generalization in

    We use advances in natural language processing to create a neural model of generalization based on linguistic instructions. ... We end by discussing how these results can guide research on the ...

  7. Publications

    Performing groundbreaking Natural Language Processing research since 1999.

  8. Natural Language Processing

    Natural Language Processing (NLP) research at Google focuses on algorithms that apply at scale, across languages, and across domains. Our systems are used in numerous ways across Google, impacting user experience in search, mobile, apps, ads, translate and more. Our work spans the range of traditional NLP tasks, with general-purpose syntax and ...

  9. Current Approaches and Applications in Natural Language Processing

    Staying in natural language understanding tasks, Question and Answering (Q & A) systems still emerge as a continuous topic of research. In this regard, the paper by proposes an attention model to solve question difficulty estimation in Question-Answering tasks. The method first relates question and information components using dual multi-head ...

  10. Methods to Integrate Natural Language Processing Into Qualitative Research

    LDA analysis is a common tool used in natural language processing as a generative probabilistic model of a corpus (Blei et al., 2003). LDA assumes documents are a random mixture of words over latent topics, whereas each topic may be characterized by the distribution over the contained words.

  11. Exploring the Landscape of Natural Language Processing Research

    As an efficient approach to understand, generate, and process natural language texts, research in natural language processing (NLP) has exhibited a rapid spread and wide adoption in recent years. Given the increasing research work in this area, several NLP-related approaches have been surveyed in the research community. However, a comprehensive study that categorizes established topics ...

  12. Natural Language Processing and Its Applications in ...

    As an essential part of artificial intelligence technology, natural language processing is rooted in multiple disciplines such as linguistics, computer science, and mathematics. The rapid advancements in natural language processing provides strong support for machine translation research. This paper first introduces the key concepts and main content of natural language processing, and briefly ...

  13. PDF arXiv:2307.10652v5 [cs.CL] 24 Sep 2023

    erate, and process natural language texts, re-search in natural language processing (NLP) has exhibited a rapid spread and wide adop-tion in recent years. Given the increasing re-search work in this area, several NLP-related approaches have been surveyed in the research community. However, a comprehensive study that categorizes established ...

  14. Efficient Methods for Natural Language Processing: A Survey

    Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require ...

  15. Natural language processing for urban research: A systematic review

    Abstract. Natural language processing (NLP) has shown potential as a promising tool to exploit under-utilized urban data sources. This paper presents a systematic review of urban studies published in peer-reviewed journals and conference proceedings that adopted NLP.

  16. Natural Language Processing (NLP) in Qualitative Public Health Research

    Qualitative data-analysis methods provide thick, rich descriptions of subjects' thoughts, feelings, and lived experiences but may be time-consuming, labor-intensive, or prone to bias. Natural language processing (NLP) is a machine learning technique from computer science that uses algorithms to analyze textual data.

  17. Natural Language Processing: State of The Art, Current Trends and

    The paper distinguishes four phases by discussing. different levels of NLP and components of N atural L anguage G eneration (NLG) fo llowed by. presenting the history and evolution of NLP, state ...

  18. Exploring the Landscape of Natural Language Processing Research

    Abstract. As an efficient approach to understand, gen-. erate, and process natural language texts, re-. search in natural language processing (NLP) has exhibited a rapid spread and wide adoption ...

  19. Natural Language Processing for Text and Speech Processing: A Review Paper

    Natural language handling is a part of software engineering and artificial intelligence which manages the human and script communication. The exploration of computational and numerical displaying of various part of language and the development of a wide scope of frameworks is NLP.

  20. Vision, status, and research topics of Natural Language Processing

    SCICERO uses Natural Language Processing techniques to parse the content of scientific papers to discover entities and relationships, exploits state-of-the-art Deep Learning Transformer models to ...

  21. Natural Language, AI, and Quantum Computing in 2024: Research

    Language processing is at the heart of current developments in artificial intelligence, and quantum computers are becoming available at the same time. This has led to great interest in quantum natural language processing, and several early proposals and experiments. This paper surveys the state of this area, showing how NLP-related techniques including word embeddings, sequential models ...

  22. Concentrations of AI-Related Topics in Research: Natural Language

    In this snapshot, we analyze natural language processing paper concentrations across research clusters in our Map of Science using the labelling convention described in Defining Computer Vision, Natural Language Processing, and Robotics Research Clusters. 1 We examine the 397 RCs (as of July 29, 2021) with concentrations of AI papers of at least 25 percent and concentrations of NLP papers of ...

  23. Making Use of Natural Language Processing to Better Understand ...

    Making Use of Natural Language Processing to Better Understand Medical Students' Self-Assessment of Clinical Skills Acad Med. 2024 Mar 1;99(3):285-289. doi: 10.1097/ACM.0000000000005527. ... (NLP) tool, to identify common topics and themes, which faculty then reviewed.

  24. (PDF) Natural Language Processing

    Natural language processing is an integral area of computer. science in which machine learni ng and computational. linguistics are b roadly used. This field is mainly concerned. with making t he h ...

  25. Meta AI Research Topic

    About. RESEARCH AREA. Natural Language Processing. Our team advances the state of the art in natural language understanding and generation, and deploys these systems at scale to break down language barriers, enable people to understand and communicate with anyone, and to provide a safe experience—no matter what language they speak.