• Open access
  • Published: 26 August 2019

Research paper classification systems based on TF-IDF and LDA schemes

  • Sang-Woon Kim 1 &
  • Joon-Min Gil   ORCID: orcid.org/0000-0001-6774-8476 2  

Human-centric Computing and Information Sciences volume  9 , Article number:  30 ( 2019 ) Cite this article

34k Accesses

136 Citations

2 Altmetric

Metrics details

With the increasing advance of computer and information technologies, numerous research papers have been published online as well as offline, and as new research fields have been continuingly created, users have a lot of trouble in finding and categorizing their interesting research papers. In order to overcome the limitations, this paper proposes a research paper classification system that can cluster research papers into the meaningful class in which papers are very likely to have similar subjects. The proposed system extracts representative keywords from the abstracts of each paper and topics by Latent Dirichlet allocation (LDA) scheme. Then, the K-means clustering algorithm is applied to classify the whole papers into research papers with similar subjects, based on the Term frequency-inverse document frequency (TF-IDF) values of each paper.

Introduction

Numerous research papers have been published online as well as offline with the increasing advance of computer and information technologies, which makes it difficult for users to search and categorize their interesting research papers for a specific subject [ 1 ]. Therefore, it is desired that these huge numbers of research papers are systematically classified with similar subjects so that users can find their interesting research papers easily and conveniently. Typically, finding research papers on specific topics or subjects is time consuming activity. For example, researchers are usually spending a long time on the Internet to find their interesting papers and are bored because the information they are looking for is not retrieved efficiently due to the fact that the papers are not grouped in their topics or subjects for easy and fast access.

The commonly-used analysis for the classification of a huge number of research papers is run on large-scale computing machines without any consideration on big data properties. As time goes on, it is difficult to manage and process efficiently those research papers that continue to quantitatively increase. Since the relation of the papers to be analyzed and classified is very complex, it is also difficult to catch quickly the subject of each research paper and, moreover hard to accurately classify research papers with the similar subjects in terms of contents. Therefore, there is a need to use an automated processing method for such a huge number of research papers so that they are classified fast and accurately.

The abstract is one of important parts in a research paper as it describes the gist of the paper. Typically, it is a next most part that users read after paper title. Accordingly, users tend to read firstly a paper abstract in order to catch the research direction and summary before reading contents in the body of a paper. In this regard, the core words of research papers should be written in the abstract concisely and interestingly. Therefore, in this paper, we use the abstract data of research papers as a clue to classify similar papers fast and correct.

To classify a huge number of papers into papers with similar subjects, we propose the paper classification system based on term frequency-inverse document frequency (TF-IDF) [ 2 , 3 , 4 ] and Latent Dirichlet allocation (LDA) [ 5 ] schemes. The proposed system firstly constructs a representative keyword dictionary with the keywords that user inputs, and with the topics extracted by the LDA. Secondly, it uses the TF-IDF scheme to extract subject words from the abstract of papers based on the keyword dictionary. Then, the K-means clustering algorithm [ 6 , 7 , 8 ] is applied to classify the papers with similar subjects, based on the TF-IDF values of each paper.

To extract subject words from a set of massive papers efficiently, in this paper, we use the Hadoop Distributed File Systems (HDFS) [ 9 , 10 ] that can process big data rapidly and stably with high scalability. We also use the map-reduce programming model [ 11 , 12 ] to calculate the TF-IDF value from the abstract of each paper. Moreover, in order to demonstrate the validation and applicability of the proposed system, this paper evaluates the performance of the proposed system, based on actual paper data. As the experimental data of performance evaluation, we use the titles and abstracts of the papers published on Future Generation Compute Systems (FGCS) journal [ 13 ] from 1984 to 2017. The experimental results indicate that the proposed system can well classify the whole papers with papers with similar subjects according to the relationship of the keywords extracted from the abstracts of papers.

The remainder of the paper is organized as follows: In “ Related work ” section, we provide related work on research paper classification. “ System flow diagram ” section presents a system flow diagram for our research paper classification system. “ Paper classification system ” section explains the paper classification system based on TF-IDF and LDA schemes in detail. In “ Experiments ” section, we carry out experiments to evaluate the performance of the proposed paper classification system. In particular, Elbow scheme is applied to determine the optimal number of clusters in the K-means clustering algorithm, and Silhouette schemes are introduced to show the validation of clustering results. Finally, “ Conclusion ” section concludes the paper.

Related work

This section briefly reviews the literature on paper classification methods related on the research subject of this paper.

Document classification has direct relation with the paper classification of this paper. It is a problem that assigns a document to one or more predefined classes according to a specific criterion or contents. The representative application areas of document classification are follows as:

News article classification: The news articles are generally massive, because they are tremendously issued in daily or hourly. There have been lots of works for automatic news article classification [ 14 ].

Opinion mining: It is very important to analyze the information on opinions, sentiment, and subjectivity in documents with a specific topic [ 15 ]. Analysis results can be applied to various areas such as website evaluation, the review of online news articles, opinion in blog or SNS, etc. [ 16 ].

Email classification and spam filtering: Its area can be considered as a document classification problem not only for spam filtering, but also for classifying messages and sorting them into a specific folder [ 17 ].

A wide variety of classification techniques have been used to document classification [ 18 ]. Automatic document classification can be divided into two methods: supervised and unsupervised [ 19 , 20 , 21 ]. In the supervised classification, documents are classified on the basis of supervised learning methods. These methods generally analyze the training data (i.e., pair data of predefined input–output) and produce an inferred function which can be used for mapping other examples. On the other hand, unsupervised classification groups documents, based on similarity among documents without any predefined criterion. As automatic document classification algorithms, there have been developed various types of algorithms such as Naïve Bayes classifier, TF-IDF, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Decision Tree, and so on [ 22 , 23 ].

Meanwhile, as works related on paper classification, Bravo-Alcobendas et al. [ 24 ] proposed a document clustering algorithm that extracts the characteristics of documents by Non-negative matrix factorization (NMF) and that groups documents by K-means clustering algorithm. This work mainly focuses on the reduction of high-dimensional vector formed by word counts in documents, not on a sophisticated classification in terms of a variety of subject words.

In [ 25 ], Taheriyan et al. proposed the paper classification method based on a relation graph using interrelationships among papers, such as citations, authors, common references, etc. This method has better performance as the links among papers increase. It mainly focuses on interrelationships among papers without any consideration of paper contents or subjects. Thus, the papers can be misclassified regardless of subjects.

In [ 26 ], Hanyurwimfura et al. proposed the paper classification method based on research paper’s title and common terms. In [ 27 ], Nanbo et al. proposed the paper classification method that extracts keywords from research objectives and background and that groups papers on the basis of the extracted keywords. In these works, the results achieved on using important information such as paper’s subjects, objectives, background were promising ones. However, they does not consider frequently occurring keywords in paper classification. Paper title, research objectives, and research background provide only limited information, leading to inaccurate decision [ 28 ].

In [ 29 ], Nguyen et al. proposed the paper classification method based on Bag-of-Word scheme and KNN algorithm. This method extracts topics from all contents of a paper without any consideration for the reduction of computational complexity. Thus, it suffers from extensive computational time when data volume sharply increases.

Different from the above mentioned methods, our method uses three kinds of keywords: keywords that users input, keywords extracted from abstracts, and topics extracted by LDA scheme. These keywords are used to calculate the TF-IDF of each paper, with an aim to considering an importance of papers. Then, the K-means clustering algorithm is applied to classify the papers with similar subjects, based on the TF-IDF values of each paper. Meanwhile, our classification method is designed and implemented on Hadoop Distributed File System (HDFS) to efficiently process the massive research papers that have the characteristics of big data. Moreover, map-reduce programming model is used for the parallel processing of the massive research papers. To our best knowledge, our work is the first to use the analysis of paper abstracts based on TF-IDF and LDA schemes for paper classification.

System flow diagram

The paper classification system proposed in this paper consists of four main processes (Fig.  1 ): (1) Crawling, (2) Data Management and Topic Modeling, (3) TF-IDF, and (4) Classification. This section describes a system flow diagram for our paper classification system.

figure 1

Detailed flows for the system flow diagram shown in Fig.  1 are as follows:

It automatically collects keywords and abstracts data of the papers published during a given period. It also executes preprocessing for these data, such as the removal of stop words, the extraction of only nouns, etc.

It constructs a keyword dictionary based on crawled keywords. Because total keywords of whole papers are huge, this paper uses only top-N keywords with high frequency among the whole keywords

It extracts topics from the crawled abstracts by LDA topic modeling

It calculates paper lengths as the number of occurrences of words in the abstract of each paper

It calculates a TF value for both of the keywords obtained by Step 2 and the topics obtained by Step 3

It calculates an IDF value for both of the keywords obtained by Step 2 and the topics obtained by Step 3

It calculates a TF-IDF value for each keyword using the values obtained by Steps 4, 5, and 6

It groups the whole papers into papers with a similar subject, based on the K-means clustering algorithm

In the next section, we provide a detailed description for the above mentioned steps.

Paper classification system

Crawling of abstract data.

The abstract is one of important parts in a paper as it describes the gist of the paper [ 30 ]. Typically, next a paper title, the next most part of papers that users are likely to read is the abstract. That is, users tend to read firstly a paper abstract in order to catch the research direction and summary before reading all contents in the paper. Accordingly, the core words of papers should be written concisely and interestingly in the abstract. Because of this, this paper classifies similar papers based on abstract data fast and correct.

As you can see in the crawling step of Fig.  1 , the data crawler collects the paper abstract and keywords according to the checking items of crawling list. It also removes stop words in the crawled abstract data and then extracts only nouns from the data. Since the abstract data have large-scale volume and are produced fast, they have a typical characteristic of big data. Therefore, this paper manages the abstract data on HDFS and calculates the TF-IDF value of each paper using the map-reduce programming model. Figure  2 shows an illustrative example for the abstract data before and after the elimination of stop words and the extraction of nouns are applied.

figure 2

Abstract data before and after preprocessing

After the preprocessing (i.e., the removal of stop words and the extraction of only nouns), the amount of abstract data should be greatly reduced. This will result to enhancing the processing efficiency of the proposed paper classification system.

Managing paper data

The save step in Fig.  1 constructs the keyword dictionary using the abstract data and keywords data crawled in crawling step and saves it to the HDFS.

In order to process lots of keywords simply and efficiently, this paper categorizes several keywords with similar meanings into one representative keyword. In this paper, we construct 1394 representative keywords from total keywords of all abstracts and make a keyword dictionary of these representative keywords. However, even these representative keywords cause much computational time if they are used for paper classification without a way of reducing computation. To alleviate this suffering, we use the keyword sets of top frequency 10, 20, and 30 among these representative keywords, as shown in Table  1 .

Topic modeling

Latent Dirichlet allocation (LDA) is a probabilistic model that can extract latent topics from a collection of documents. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words [ 31 , 32 ].

The LDA estimates the topic-word distribution \(P(t|z)\) and the document-topic distribution \(P(z|d)\) from an unlabeled corpus using Dirichlet priors for the distributions with a fixed number of topics [ 31 , 32 ]. As a result, we get \(P(z|d)\) for each document and further build the feature vector as

In this paper, using LDA scheme, we extract topic sets from the abstract data crawled in crawling step. Three kinds of topic sets are extracted, each of which consists of 10, 20, and 30 topics, respectively. Table  2 shows topic sets with 10 topics and the keywords of each topic. The remaining topic sets with 20 and 30 topics are omitted due to space limitations.

The TF-IDF has been widely used in the fields of information retrieval and text mining to evaluate the relationship for each word in the collection of documents. In particular, they are used for extracting core words (i.e., keywords) from documents, calculating similar degrees among documents, deciding search ranking, and so on.

The TF in TF-IDF means the occurrence of specific words in documents. Words with a high TF value have an importance in documents. On the other hand, the DF implies how many times a specific word appears in the collection of documents. It calculates the occurrence of the word in multiple documents, not in only a document. Words with a high DF value do not have an importance because they commonly appear in all documents. Accordingly, the IDF that is an inverse of the DF is used to measure an importance of words in all documents. The high IDF values mean rare words in all documents, resulting to the increase of an importance.

Paper length

The paper length step of Fig.  1 calculates a total number of occurrences of words after separating words in a given abstract using white spaces as a delimiter. The objective of this step is to prevent unbalancing of TF values caused by a quantity of abstracts. Figure  3 shows a map-reduce algorithm for the calculation of paper length. In this figure, DocName and wc represents a paper title and a paper length, respectively.

figure 3

Map-reduce algorithm for the calculation of paper length

Word frequency

The TF calculation step in Fig.  1 counts how many times the keywords defined in a keyword dictionary and the topics extracted by LDA appear in abstract data. The TF used in this paper is defined as

where, \(n_{i,j}\) represents the number of occurrences of word \(t_{i}\) in document \(d_{j}\) and \(\sum\limits_{k} {n_{k,j} }\) represents a total number of occurrences of words in document \(d_{j}\) . K and D are the number of keywords and documents (i.e., papers), respectively.

Figure  4 illustrates TF calculation for 10 keywords of top frequency. The abstract data in this figure have the paper length of 64. As we can see in this figure, the keywords ‘cloud computing’, ‘Internet of Things’, and ‘Big Data’ have the TF value of 0.015 because of one occurrence in the abstract data. The keyword ‘cloud computing’ has the TF value of 0.03 because of two occurrences. Figure  5 shows map-reduce algorithm to calculate word frequency (i.e., TF). In this figure, n represents the number of occurrences of a keyword in a document with a paper title of DocName .

figure 4

An illustrative example of TF calculation

figure 5

Map-reduce algorithm for the calculation of word frequency

Document frequency

While the TF means the number of occurrences of each keyword in a document, the DF means how many times each keyword appears in the collection of documents. In the DF calculation step in Fig.  1 , the DF is calculated by dividing the total number of documents by the number of documents that contain a specific keyword. It is defined as

where, \(\left| D \right|\) represents total number of documents and \(\left| {d_{j} \in D:t_{j} \in d_{j} } \right|\) represents the number of documents that keyword \(t_{j}\) occurs. Figure  6 shows an illustrative example when four documents are used to calculate the DF value.

figure 6

An illustrative example of DF calculation

Figure  7 shows the map-reduce algorithm to calculate the DF of each paper.

figure 7

Map-reduce algorithm for the calculation of document frequency

Keywords with a high DF value cannot have an importance because they commonly appear in the most documents. Accordingly, the IDF that is an inverse of the DF is used to measure an importance of keywords in the collection of documents. The IDF is defined as

Using Eqs. ( 2 ) and ( 4 ), the TF-IDF is defined as

The TF-IDF value increases when a specific keyword has high frequency in a document and the frequency of documents that contain the keyword among the whole documents is low. This principle can be used to find the keywords frequently occurring in documents. Consequently, using the TF-IDF calculated by Eq. ( 5 ), we can find out what keywords are important in each paper.

Figure  8 shows the map-reduce algorithm for the TF-IDF calculation of each paper.

figure 8

Map-reduce algorithm for TF-IDF calculation

  • K-means clustering

Typically, clustering technique is used to classify a set of data into classes of similar data. Until now, it has been applied to various applications in many fields such as marketing, biology, pattern recognition, web mining, analysis of social networks, etc. [ 33 ]. Among various clustering techniques, we choose the k-means clustering algorithm, which is one of unsupervised learning algorithm, because of its effectiveness and simplicity. More specifically, the algorithm is to classify the data set of N items based on features into k disjoint subsets. This is done by minimizing distances between data item and the corresponding cluster centroid.

Mathematically, the k-means clustering algorithm can be described as follows:

where, k is the number of clusters, \(x_{j}\) is the j th data point in the i th cluster \(C_{i}\) , and \(c_{i}\) is the centroid of \(C_{i}\) . The notation \(\left\| {x_{j} - c_{i} } \right\|^{2}\) stands for the distance between \(x_{j}\) and \(c_{i}\) , and Euclidean distance is commonly used as a distance measure. To achieve a representative clustering, a sum of squared error function, E , should be as small as possible.

The advantage of the K-means clustering algorithm is that (1) dealing with different types of attributes; (2) discovering clusters with arbitrary shape; (3) minimal requirements for domain knowledge to determine input parameters; (4) dealing with noise and outliers; and (5) minimizing the dissimilarity between data [ 34 ].

The TF-IDF value represents an importance of the keywords that determines characteristics of each paper. Thus, the classification of papers by TF-IDF value leads to finding a group of papers with similar subjects according to the importance of keywords. Because of this, this paper uses the K-means clustering algorithm, which is one of most used clustering algorithm, to group papers with similar subjects. The K-means clustering algorithm used in this paper calculates a center of the cluster that represents a group of papers with a specific subject and allocates a paper to a cluster with high similarity, based on a Euclidian distance between the TF-IDF value of the paper and a center value of each cluster.

The K-means clustering algorithm is computationally faster than the other clustering algorithms. However, it produces different clustering results for different number of clusters. So, it is required to determine the number of clusters (i.e., K value) in advance before clustering. To overcome the limitations, we will use the Elbow scheme [ 35 ] that can find a proper number of clusters. Also, we will use the Silhouette scheme [ 36 , 37 ] to validate the performance of clustering results by K-means clustering scheme. The detailed descriptions of the two schemes will be provided in next section with performance evaluation.

Experiments

Experimental environment.

The paper classification system proposed by this paper is based on the HDFS to manage and process massive paper data. Specifically, we build the Hadoop cluster composed of one master node, one sub node, and four data nodes. The TF-IDF calculation module is implemented with Java language on Hadoop-2.6.5 version. We also implemented the LDA calculation module using Spark MLlib in python. The K-means clustering algorithm is implemented using Scikit-learn library [ 38 ].

Meanwhile, as experimental data, we use the actual papers published on Future Generation Computer System (FGCS) journal [ 13 ] during the period of 1984 to 2017. The titles, abstracts, and keywords of total 3264 papers are used as core data for paper classification. Figure  9 shows overall system architecture for our paper classification system.

figure 9

Overall system architecture for our paper classification system

The keyword dictionaries used for performance evaluation in this paper are constructed with the three methods shown in Table  3 . The constructed keyword dictionaries are applied to Elbow and Silhouette schemes, respectively, to compare and analyze the performance of the proposed system.

Experimental results

Applying elbow scheme.

When using K-means clustering algorithm, users should determine a number of clusters before the clustering of a dataset is executed. One method to validate the number of clusters is to use the Elbow scheme [ 35 ]. We perform Elbow scheme to find out an optimal number of clusters, changing the value ranging from 2 to 100.

Table  4 shows the number of clusters obtained by Elbow scheme for the three methods shown in Table  3 .

As we can see in the results of Table  4 , the number of clusters becomes more as the number of keywords increases. It is natural phenomenon because the large number of keywords results in more elaborate clustering for the given keywords. However, on comparing the number of clusters of three methods, we can see that Method 3 has the lower number of clusters than other two methods. This is because Method 3 can complementarily use the advantages of the remaining two methods when it groups papers with similar subjects. That is, Method 1 depends on the keywords input by users. It cannot be guaranteed that these keywords are always correct to group papers with similar subjects. The reason is because users can register incorrect keywords for their own papers. Method 2 makes up for the disadvantage of Method 1 using the topics automatically extracted by LDA scheme. Figure  10 shows elbow graph when Method 3 are used. In this figure, an upper arrow represents the optimal number of clusters calculated by Elbow scheme.

figure 10

Elbow graph for Method 3

Applying Silhouette scheme

The silhouette scheme is one of various evaluation methods as a measure to evaluate the performance of clustering [ 36 , 37 ]. The silhouette value becomes higher as two data within a same cluster is closer. It also becomes higher as two data within different clusters is farther. Typically, a silhouette value ranges from − 1 to 1, where a high value indicates that data are well matched to their own cluster and poorly matched to neighboring clusters. Generally, the silhouette value more than 0.5 means that clustering results are validated [ 36 , 37 ].

Table  5 shows an average silhouette value for each of the three methods shown in Table  3 . We can see from results of this table that the K-means clustering algorithm used in the paper produces good clustering when 10 and 30 keywords are used. It is should be noted that the silhouette values of more than 0.5 represent valid clustering. Figure  11 shows the silhouette graph for each of 10, 20, and 30 keywords when Method 3 are used. In this figure, a dashed line represents the average silhouette value. We omit the remaining silhouette graphs due to space limitations.

figure 11

Silhouette graph for Method 3

Analysis of classification results

Table  6 shows an illustrative example for classification results. In this table, the papers in cluster 1 indicate that they are grouped by two keywords ‘cloud’ and ‘bigdata’ as a primary keyword. For cluster 2, two keywords ‘IoT’ and ‘privacy’ have an important role in grouping the papers in this cluster. For cluster 3, three keywords ‘IoT’, ‘security’ and ‘privacy’ have an important role. In particular, according to whether or not the keyword ‘security’ is used, the papers in cluster 2 and cluster 3 are grouped into different clusters.

Figure  12 shows a TF-IDF value and a clustering result for some papers. In this figure, ‘predict’ means cluster number, whose cluster contains a paper with the title denoted in first column. In Fig.  12 a, we can observe that all papers have the same keyword ‘scheduling’, but they are divided into two clusters according to a TF-IDF value of the keyword. Figure  12 b indicates that all papers have the same keyword ‘cloud’, but they are grouped into different clusters (cluster 7 and cluster 8) according whether or not a TF-IDF value of the keyword ‘cloud storage’ exists.

figure 12

Illustrative examples of clustering results

Figure  13 shows an analysis result for the papers belonging to the same cluster. In this figure, we can see that three papers in cluster 11 have four common keywords ‘cloud’, ‘clustering’, ‘hadoop’, and ‘map-reduce’ as a primary keyword. Therefore, we can see from this figure that the papers are characterized by these four common keywords.

figure 13

Clustering results by common keywords

Figures  14 and 15 show abstract examples for first and second papers among the four ones shown in Fig.  13 , respectively. From these figures, we can see that four keywords (‘cloud’, ‘clustering’, ‘hadoop’, and ‘map-reduce’) are properly included in the abstracts of the two papers.

figure 14

An abstract example for [ 39 ]

figure 15

An abstract example for [ 40 ]

Evaluation on the accuracy of the proposed classification system

The accuracy the proposed classification systems has been evaluated by using the well-known F-Score [ 41 ] which measure how good paper classification is when compared with reference classification. The F-Score is a combination of the precision and recall values used in information extraction. The precision, recall, and F-Score are defined as follows.

In the above equations, TP, TN, FP, and FN represents true positive, true negative, false positive, and false negative, respectively. We carried out our experiments on 500 research papers randomly selected among the total 3264 ones used for our experiments. This experiment is run 5 times and the average of F-Score values is recorded.

Figure  16 shows the F-Score values of the three methods to construct keyword dictionaries shown in Table  3 .

figure 16

F-score values of three methods (TF-IDF, LDA, TF-IDF + LDA)

As we can see in the results of Fig.  16 , the F-score value of Method 3 (the combination of TF-IDF and LDA) is higher than that of other methods. The main reason is that Method 3 can complementarily use the advantages of the remaining two methods. That is, TF-IDF can extract only the frequently occurring keywords in research papers and LDA can extract only the topics which are latent in research papers. On the other hand, the combination of TF-IDF and LDA can lead to the more detailed classification of research papers because frequently occurring keywords and the correlation between latent topics are simultaneously used to classify the papers.

We presented a paper classification system to efficiently support the paper classification, which is essential to provide users with fast and efficient search for their desired papers. The proposed system incorporates TF-IDF and LDA schemes to calculate an importance of each paper and groups the papers with similar subjects by the K-means clustering algorithm. It can thereby achieve correct classification results for users’ interesting papers. For the experiments to demonstrate the performance of the proposed system, we used actual data based on the papers published in FGCS journal. The experimental results showed that the proposed system can classify the papers with similar subjects according to the keywords extracted from the abstracts of papers. In particular, when a keyword dictionary with both of the keywords extracted from the abstracts and the topics extracted by LDA scheme was used, our classification system has better clustering performance and higher F-Score values. Therefore, our classification systems can classify research papers in advance by both of keywords and topics with the support of high-performance computing techniques, and then the classified research papers will be applied to search the papers within users’ interesting research areas, fast and efficiently.

This work has been mainly focused on developing and analyzing research paper classification. To be a generic approach, the work needs to be expanded into various types of datasets, e.g. documents, tweets, and so on. Therefore, future work involves working upon various types of datasets in the field of text mining, as well as developing even more efficient classifiers for research paper datasets.

Availability of data and materials

Not applicable.

Bafna P, Pramod D, Vaidya A (2016) Document clustering: TF-IDF approach. In: IEEE int. conf. on electrical, electronics, and optimization techniques (ICEEOT). pp 61–66

Ramos J (2003) Using TF-IDF to determine word relevance in document queries. In: Proc. of the first int. conf. on machine learning

Havrlant L, Kreinovich V (2017) A simple probabilistic explanation of term frequency-inverse document frequency (TF-IDF) heuristic (and variations motivated by this explanation). Int J Gen Syst 46(1):27–36

Article   MathSciNet   Google Scholar  

Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based framework for text categorization. Procedia Eng 69:1356–1364

Article   Google Scholar  

Yau C-K et al (2014) Clustering scientific documents with topic modeling. Scientometrics 100(3):767–786

Balabantaray RC, Sarma C, Jha M (2013) Document clustering using K-means and K-medoids. Int J Knowl Based Comput Syst 1(1):7–13.

Google Scholar  

Gupta H, Srivastava R (2014) K-means based document clustering with automatic “K” selection and cluster refinement. Int J Comput Sci Mob Appl 2(5):7–13

Gurusamy R, Subramaniam V (2017) A machine learning approach for MRI brain tumor classification. Comput Mater Continua 53(2):91–108

Nagwani NK (2015) Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J Big Data 2(1):1–18

Kim J-J (2017) Hadoop based wavelet histogram for big data in cloud. J Inf Process Syst 13(4):668–676

Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

Cho W, Choi E (2017) DTG big data analysis for fuel consumption estimation. J Inf Process Syst 13(2):285–304

FGCS Journal. https://www.journals.elsevier.com/future-generation-computer-systems . Accessed 15 Aug 2018.

Gui Y, Gao G, Li R, Yang X (2012) Hierarchical text classification for news articles based-on named entities. In: Proc. of int. conf. on advanced data mining and applications. pp 318–329

Chapter   Google Scholar  

Singh J, Singh G, Singh R (2017) Optimization of sentiment analysis using machine learning classifiers. Hum-cent Comput Inf Sci 7:32

Mahendran A et al (2013) “Opinion Mining for text classification,” Int. J Sci Eng Technol 2(6):589–594

Alsmadi I, Alhami I (2015) Clustering and classification of email contents. J King Saud Univ Comput Inf Sci. 27(1):46–57

Rossi RG, Lopes AA, Rezende SO (2016) Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts. Inf Process Manag 52(2):217–257

Barigou F (2018) Impact of instance selection on kNN-based text categorization. J Inf Process Syst 14(2):418–434

Baker K, Bhandari A, Thotakura R (2009) An interactive automatic document classification prototype. In: Proc. of the third workshop on human-computer interaction and information retrieval. pp 30–33

Xuan J et al. (2017) Automatic bug triage using semi-supervised text classification. arXiv preprint arXiv:1704.04769

Aggarwal CC, Zhai CX (2012) A survey of text classification algorithms. In: Mining text data, Springer, Berlin, pp 163–222

Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, Hoboken

MATH   Google Scholar  

Bravo-Alcobendas D, Sorzano COS (2009) Clustering of biomedical scientific papers. In: 2009 IEEE Int. symp. on intelligent signal processing. pp 205–209

Taheriyan M (2011) Subject classification of research papers based on interrelationships analysis. In: ACM proc. of the 2011 workshop on knowledge discovery, modeling and simulation. pp 39–44

Hanyurwimfura D, Bo L, Njagi D, Dukuzumuremyi JP (2014) A centroid and Relationship based clustering for organizing research papers. Int J Multimed Ubiquitous Eng 9(3):219–234

Nanba H, Kando N, Okumura M (2011) Classification of research papers using citation links and citation types: towards automatic review article generation. Adv Classif Res Online 11(1):117–134

Mohsen T (2011) Subject classification of research papers based on interrelationships analysis. In: Proceeding of the 2011 workshop on knowledge discovery, modeling and simulation. pp 39–44

Nguyen TH, Shirai K (2013) Text classification of technical papers based on text segmentation. In: Int. conf. on application of natural language to information systems. pp 278–284

Gurung P, Wagh R (2017) A study on topic identification using K means clustering algorithm: big vs. small documents. Adv Comput Sci Technol 10(2):221–233

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

Jiang Y, Jia A, Feng Y, Zhao D (2012) Recommending academic papers via users’ reading purposes. In: Proc. of the sixth ACM conf. on recommender systems. pp 241–244

Xu R, Wunsch D (2008) Clustering. Wiley, Hoboken

Book   Google Scholar  

Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications. SIAM, Alexandria

Book   MATH   Google Scholar  

Kodinariya TM, Makwana PR (2013) Review on determining number of cluster in K-means clustering. Int J Adv Res Comput Sci Manag Stud 1(6):90–95

Oliveira GV et al (2017) Improving K-means through distributed scalable metaheuristics. Neurocomputing 246:45–57

Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

Article   MATH   Google Scholar  

Scikit-Learn. http://scikit-learn.org/stable/modules/classes.html . Accessed 15 Aug 2018.

Veiga J, Exposito RR, Taboada GL, Tounno J (2016) Flame-MR: an event-driven architecture for MapReduce applications. Future Gener Comput Syst 65:46–56

Ibrahim S, Phan T-D, Carpen-Amarie A, Chihoub H-E, Moise D, Antoniu G (2016) Governing energy consumption in Hadoop through CPU frequency scaling: an analysis. Future Gener Comput Syst 54:219–232

Visentini I, Snidaro L, Foresti GL (2016) Diversity-aware classifier ensemble selection via F-score. Inf Fus 28:24–43

Download references

Acknowledgements

This work was supported by research grants from Daegu Catholic University in 2017.

Author information

Authors and affiliations.

Department of Police Administration, Daegu Catholic University, 13-13 Hayang-ro, Hayang-eup, Gyeongsan, Gyeongbuk, 38430, South Korea

Sang-Woon Kim

School of Information Technology Eng., Daegu Catholic University, 13-13 Hayang-ro, Hayang-eup, Gyeongsan, Gyeongbuk, 38430, South Korea

Joon-Min Gil

You can also search for this author in PubMed   Google Scholar

Contributions

SWK proposed a main idea for keyword analysis and edited the manuscript. JMG was a major contributor in writing the manuscript and carried out the performance experiments. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Joon-Min Gil .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Kim, SW., Gil, JM. Research paper classification systems based on TF-IDF and LDA schemes. Hum. Cent. Comput. Inf. Sci. 9 , 30 (2019). https://doi.org/10.1186/s13673-019-0192-7

Download citation

Received : 03 December 2018

Accepted : 12 August 2019

Published : 26 August 2019

DOI : https://doi.org/10.1186/s13673-019-0192-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Paper classification

research paper on categorization

  • Open supplemental data
  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, large scale subject category classification of scholarly papers with deep attentive neural networks.

www.frontiersin.org

  • 1 Computer Science and Engineering, Pennsylvania State University, University Park, PA, United States
  • 2 Information Sciences and Technology, Pennsylvania State University, University Park, PA, United States
  • 3 Computer Science, Old Dominion University, Norfolk, VA, United States

Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category classification is a prerequisite for bibliometric studies, organizing scientific publications for domain knowledge extraction, and facilitating faceted searches for digital library search engines. Unfortunately, many academic papers do not have such information as part of their metadata. Most existing methods for solving this task focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using nine million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro- F 1 measure of 0.76 with F 1 of individual subject categories ranging from 0.50 to 0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers.

1 Introduction

A recent estimate of the total number of English research articles available online was at least 114 million ( Khabsa and Giles, 2014 ). Studies indicate the number of academic papers doubles every 10–15 years ( Larsen and von Ins, 2010 ). The continued growth of scholarly papers increases the challenges to accurately find relevant research papers, especially when papers in different subject categories (SCs) are mixed in a search engine’s collection. Searches based on only keywords may no longer be the most efficient method ( Matsuda and Fukushima, 1999 ) to use. This often happens when the same query terms appear in multiple research areas. For example, querying “neuron” in Google Scholar returns documents in both computer science and neuroscience. Search results can also belong to diverse domains when the query terms contain acronyms. For example, querying “NLP” returns documents in linguistics (meaning “neuro-linguistic programming”) and computer science (meaning “natural language processing”). If the SCs of documents are available, the users can narrow search results by specifying an SC, which effectively increases the precision of the query results, assuming SCs are accurately assigned to documents. Also, delineation of scientific domains is a preliminary tasks of many bibliometric studies at the meso-level. Accurate categorization of research articles is a prerequisite for discovering various dimensions of scientific activity in epistemology ( Collins, 1992 ) and sociology ( Barnes et al., 1996 ), as well as the invisible colleges, which are implicit academic networks ( Zitt et al., 2019 ). To build a web-scale knowledge system, it is necessary to organize scientific publications into a hierarchical concept structure, which further requires categorization of research articles by SCs ( Shen et al., 2018 ).

As such, we believe it is useful to build a classification system that assigns SCs to scholarly papers. Such a system could significantly impact scientific search and facilitate bibliometric evaluation. It can also help with Science of Science research ( Fortunato et al., 2018 ), an area of research that uses scholarly big data to study the choice of scientific problems, scientist career trajectories, research trends, research funding, and other research aspects. Also, many have noted that it is difficult to extract SCs using traditional topic models such as Latent Dirichlet Allocation (LDA), since it only extracts words and phrases present in documents ( Gerlach et al., 2018 ). An example is that a paper in computer science is rarely given its SC in the keyword list.

In this work, we pose the SC problem as one of multiclass classifications in which one SC is assigned to each paper. In a preliminary study, we investigated feature-based machine learning methods to classify research papers into six SCs ( Wu et al., 2018 ). Here, we extend that study and propose a system that classifies scholarly papers into 104 SCs using only abstracts. The core component is a neural network classifier trained on millions of labeled documents that are part of the WoS database. In comparison with our preliminary work, our data is more heterogeneous (more than 100 SCs as opposed to six), imbalanced, and complicated (data labels may overlap). We compare our system against several baselines applying various text representations, machine learning models, and/or neural network architectures.

SC classification is usually based on a universal schema for a specific domain or for all domains. Many schemas for scientific classification systems are publisher domain specific. For example, ACM has its own hierarchical classification system 1 , NLM has medical subject headings 2 , and MSC has a subject classification for mathematics 3 . The most comprehensive and systematic classification schemas seem to be from WoS 4 and the Library of Congress (LOC) 5 . The latter was created in 1897 and was driven by practical needs of the LOC rather than any epistemological considerations and is most likely out of date.

To the best of our knowledge, our work is the first example of using a neural network to classify scholarly papers into a comprehensive set of SCs. Other work focused on unsupervised methods and most were developed for specific category domains. In contrast, our classifier was trained on a large number of high quality abstracts from WoS and can be applied directly to abstracts without any citation information. We also develop a novel representation of scholarly paper abstracts using ranked tokens and their word embedding representations. This significantly reduces the scale of the classic Bag of Word (BoW) model. We also retrained FastText and GloVe word embedding models using WoS abstracts. The subject category classification was then applied to the CiteSeerX collection of documents. However, it could be applied to any similar collection.

2 Related Work

Text classification is a fundamental task in natural language processing. Many complicated tasks use it or include it as a necessary step, such as part-of-speech tagging, e.g., Ratnaparkhi (1996) , sentiment analysis, e.g., Vo and Zhang (2015) , and named entity recognition, e.g., Nadeau and Sekine (2007) . Classification can be performed at many levels: word, phrase, sentence, snippet (e.g., tweets, reviews), articles (e.g., news articles), and others. The number of classes usually ranges from a few to nearly 100. Methodologically, a classification model can be supervised, semi-supervised, and unsupervised. An exhaustive survey is beyond the scope of this paper. Here we briefly review short text classification and highlight work that classifies scientific articles.

Bag of words (BoWs) is one of the most commonly used representations for text classification, an example being keyphrase extraction ( Caragea et al., 2016 ; He et al., 2018 ). BoW represents text as a set of unordered word-level tokens, without considering syntactical and sequential information. For example, Nam et al. (2016) combined BoW with linguistic, grammatical, and structural features to classify sentences in biomedical paper abstracts. In Li et al. (2010) , the authors treated the text classification as a sequence tagging problem and proposed a Hidden Markov Model used for the task of classifying sentences into mutually exclusive categories, namely, background, objective, method, result, and conclusions. The task described in García et al. (2012) classifies abstracts in biomedical databases into 23 categories (OHSUMED dataset) or 26 categories (UVigoMED dataset). The author proposed a bag-of-concept representation based on Wikipedia and classify abstracts using the SVM model.

Recently, word embeddings (WE) have been used to build distributed dense vector representations for text. Embedded vectors can be used to measure semantic similarity between words ( Mikolov et al., 2013b ). WE has shown improvements in semantic parsing and similarity analysis, e.g., Prasad et al. (2018) . Other types of embeddings were later developed for character level embedding ( Zhang et al., 2015 ), phrase embedding ( Passos et al., 2014 ), and sentence embedding ( Cer et al., 2018 ). Several WE models have been trained and distributed; examples are word2vec ( Mikolov et al., 2013b ), GloVe ( Pennington et al., 2014 ), FastText ( Grave et al., 2017 ), Universal Sentence Encoder ( Cer et al., 2018 ), ELMo ( Peters et al., 2018 ), and BERT ( Devlin et al., 2019 ). Empirically, Long Short Term Memory [LSTM; Hochreiter and Schmidhuber (1997) ], Gated Recurrent Units [GRU; Cho et al. (2014) ], and convolutional neural networks [CNN; LeCun et al. (1989) ] have achieved improved performance compared to other supervised machine learning models based on shallow features ( Ren et al., 2016 ).

Classifying SCs of scientific documents is usually based on metadata, since full text is not available for most papers and processing a large amount of full text is computationally expensive. Most existing methods for SC classification are unsupervised. For example, the Smart Local Moving Algorithm identified topics in PubMed based on text similarity ( Boyack and Klavans, 2018 ) and citation information ( van Eck and Waltman, 2017 ). K-means was used to cluster articles based on semantic similarity ( Wang and Koopman, 2017 ). The memetic algorithm, a type of evolutionary computing ( Moscato and Cotta, 2003 ), was used to classify astrophysical papers into subdomains using their citation networks. A hybrid clustering method was proposed based on a combination of bibliographic coupling and textual similarities using the Louvain algorithm-a greedy method that extracted communities from large networks ( Glänzel and Thijs, 2017 ). Another study constructed a publication-based classification system of science using the WoS dataset ( Waltman and van Eck, 2012 ). The clustering algorithm, described as a modularity-based clustering, is conceptually similar to k -nearest neighbor ( k NN). It starts with a small set of seed labeled publications and grows by incrementally absorbing similar articles using co-citation and bibliographic coupling. Many methods mentioned above rely on citation relationships. Although such information can be manually obtained from large search engines such as Google Scholar, it is non-trivial to scale this for millions of papers.

Our model classifies papers based only on abstracts, which are often available. Our end-to-end system is trained on a large number of labeled data with no references to external knowledge bases. When compared with citation-based clustering methods, we believe it to be more scalable and portable.

3 Text Representations

For this work, we represent each abstract using a BoW model weighted by TF-IDF. However, instead of building a sparse vector for all tokens in the vocabulary, we choose word tokens with the highest TF-IDF values and encode them using WE models. We explore both pre-trained and re-trained WE models. We also explore their effect on classification performance based on token order. As evaluation baselines, we compare our best model with off-the-shelf text embedding models, such as the Unified Sentence Encoder [USE; Cer et al. (2018) ]. We show that our model which uses the traditional and relatively simple BoW representation is computationally less expensive and can be used to classify scholarly papers at scale, such as those in the CiteSeerX repository ( Giles et al., 1998 ; Wu et al., 2014 ).

3.1 Representing Abstracts

First, an abstract is tokenized with white spaces, punctuation, and stop words were removed. Then a list A of word types (unique words) w i is generated after lemmatization which uses the WordNet database ( Fellbaum, 2005 ) for the lemmas.

Next the list A f is sorted in descending order by TF-IDF giving A sorted . TF is the term frequency in an abstract and IDF is the inverse document frequency calculated using the number of abstracts containing a token in the entire WoS abstract corpus.

Because abstracts may have different numbers of words, we chose the top d elements from A sorted to represent the abstract. We then re-organize the elements according to their original order in the abstract forming a sequential input. If the number of words is less than d , we pad the feature list with zeros. The final list is a vector built by concatenating all word level vectors v → ′ k , k ∈ { 1 , ⋯ , d } into a D WE dimension vector. The final semantic feature vector A f is:

3.2 Word Embedding

To investigate how different word embeddings affect classification results, we apply several widely used models. An exhaustive experiment for all possible models is beyond the scope of this paper. We use some of the more popular ones as now discussed.

GloVe captures semantic correlations between words using global word-word co-occurrence, as opposed to local information used in word2vec ( Mikolov et al., 2013a ). It learns a word-word co-occurrence matrix and predicts co-occurrence ratios of given words in context ( Pennington et al., 2014 ). Glove is a context-independent model and outperformed other word embedding models such as word2vec in tasks such as word analogy, word similarity, and named entity recognition tasks.

FastText is another context-independent model which uses sub-word (e.g., character n -grams) information to represent words as vectors ( Bojanowski et al., 2017 ). It uses log-bilinear models that ignore the morphological forms by assigning distinct vectors for each word. If we consider a word w whose n -grams are denoted by g w , then the vector z g is assigned to each n -gram in g w . Each word is represented by the sum of the vector representations of its character n -grams. This representation is incorporated into a Skip Gram model ( Goldberg and Levy, 2014 ) which improves vector representation for morphologically rich languages.

SciBERT is a variant of BERT, a context-aware WE model that has improved the performance of many NLP tasks such as question answering and inference ( Devlin et al., 2019 ). The bidirectionally trained model seems to learn a deeper sense of language than single directional transformers. The transformer uses an attention mechanism that learns contextual relationships between words. SciBERT uses the same training method as BERT but is trained on research papers from Semantic Scholar. Since the abstracts from WoS articles mostly contain scientific information, we use SciBERT ( Beltagy et al., 2019 ) instead of BERT. Since it is computationally expensive to train BERT (4 days on 4–16 Cloud TPUs as reported by Google), we use the pre-trained SciBERT.

3.3 Retrained WE Models

Though pretrained WE models represent richer semantic information compared with traditional one-hot vector methods, when applied to text in scientific articles the classifier does not perform well. This is probably because the text corpus used to train these models are mostly from Wikipedia and Newswire. The majority of words and phrases included in the vocabulary extracted from these articles provides general descriptions of knowledge, which are significantly different from those used in scholarly articles which describe specific domain knowledge. Statistically, the overlap between the vocabulary of pretrained GloVe (six billion tokens) and WoS is only 37% ( Wu et al., 2018 ). Nearly all of the WE models can be retrained. Thus, we retrained GloVe and FastText using 6.38 million abstracts in WoS (by imposing a limit of 150k on each SC, see below for more details). There are 1.13 billion word tokens in total. GloVe generated 1 million unique vectors, and FastText generated 1.2 million unique vectors.

3.4 Universal Sentence Encoder

For baselines, we compared with Google’s Universal Sentence Encoder (USE) and the character-level convolutional network (CCNN). USE uses transfer learning to encode sentences into vectors. The architecture consists of a transformer-based sentence encoding ( Vaswani et al., 2017 ) and a deep averaging network (DAN) ( Iyyer et al., 2015 ). These two variants have trade-offs between accuracy and compute resources. We chose the transformer model because it performs better than the DAN model on various NLP tasks ( Cer et al., 2018 ). CCNN is a combination of character-level features trained on temporal (1D) convolutional networks [ConvNets; Zhang et al. (2015) ]. It treats input characters in text as a raw-signal which is then applied to ConvNets. Each character in text is encoded using a one-hot vector such that the maximum length l of a character sequence does not exceed a preset length l 0 .

4 Classifier Design

The architecture of our proposed classifier is shown in Figure 1 . An abstract representation previously discussed is passed to the neural network for encoding. Then the label of the abstract is determined by the output of the sigmoid function that aggregates all word encodings. Note that this architecture is not applicable for use by CCNN or USE. For comparison, we used these two architectures directly as described from their original publications.

www.frontiersin.org

FIGURE 1 . Subject category (SC) classification architecture.

LSTM is known for handling the vanishing gradient that occurs when training recurrent neural networks. A typical LSTM cell consists of three gates: input gate i t , output gate o t and forget gate f t . The input gate updates the cell state; the output gate decides the next hidden state, and the forget gate decides whether to store or erase particular information in the current state h t . We use tanh ( ⋅ ) as the activation function and the sigmoid function σ ( ⋅ ) to map the output values into a probability distribution. The current hidden state h t of LSTM cells can be implemented with the following equations:

At a given time step t , x t represents the input vector; c t represents cell state vector or memory cell; z t is a temporary result. W and U are weights for the input gate i , forget gate f , temporary result z , and output gate o .

GRU is similar to LSTM, except that it has only a reset gate r t and an update gate z t . The current hidden state h t at a given timestep t can be calculated with:

with the same defined variables. GRU is less computationally expensive than LSTM and achieves comparable or better performance for many tasks. For a given sequence, we train LSTM and GRU in two directions (BiLSTM and BiGRU) to predict the label for the current position using both historical and future data, which has been shown to outperform a single direction model for many tasks.

Attention Mechanism The attention mechanism is used to weight word tokens deferentially when aggregating them into a document level representations. In our system ( Figure 1 ), embeddings of words are concatenated into a vector with D WE dimensions. Using the attention mechanism, each word t contributes to the sentence vector, which is characterized by the factor α t such that

in which h t = [ h → t ; h ← t ] is the representation of each word after the BiLSTM or BiGRU layers, v t is the context vector that is randomly initialized and learned during the training process, W is the weight, and b is the bias. An abstract vector v is generated by aggregating word vectors using weights learned by the attention mechanism. We then calculate the weighted sum of h t using the attention weights by:

5 Experiments

Our training dataset is from the WoS database for the year 2015. The entire dataset contains approximately 45 million records of academic documents, most having titles and abstracts. They are labeled with 235 SCs at the journal level in three broad categories–Science, Social Science, and Art and Literature. A portion of the SCs have subcategories, such as “Physics, Condensed Matter,” “Physics, Nuclear,” and “Physics, Applied.” Here, we collapse these subcategories, which reduces the total number of SCs to 115. We do this because the minor classes decrease the performance of the model (due to the less availability of that data). Also, we need to have an “others” class to balance the data samples. We also exclude papers labeled with more than one category and papers that are labeled as “Multidisciplinary.” Abstracts with less than 10 words are excluded. The final number of singly labeled abstracts is approximately nine million, in 104 SCs. The sample sizes of these SCs range from 15 (Art) to 734k (Physics) with a median about 86k. We randomly select up to 150k abstracts per SC. This upper limit is based on our preliminary study ( Wu et al., 2018 ). The ratio between the training and testing corpus is 9:1.

The median of word types per abstract is approximately 80–90. As such, we choose the top d = 80 elements from A sorted to represent the abstract. If A sorted has less than d elements, we pad the feature list with zeros. The word vector dimensions of GloVe and FastText are set to 50 and 100, respectively. This falls into the reasonable value range (24–256) for WE dimensions ( Witt and Seifert, 2017 ). When training the BiLSM and BiGRU models, each layer contains 128 neurons. We investigate the dependency of classification performance on these hyper-parameters by varying the number of layers and neurons. We varied the number of word types per abstract d and set the dropout rate to 20% to mitigate overfitting or underfitting. Due to their relatively large size, we train the neural networks using mini-batch gradient descent with Adam for gradient optimization and one word cross entropy as the loss function. The learning rate was set to 10 − 3 .

6 Evaluation and Comparison

6.1 one-level classifier.

We first classify all abstracts in the testing set into 104 SCs using the retrained GloVe WE model with BiGRU. The model achieves a micro- F 1 score of 0.71. The first panel in Figure 2 shows the SCs that achieve the highest F 1 ’s; the second panel shows SCs that achieve relatively low F 1 ’s. The results indicate that the classifier performs poorer on SCs with relatively small sample sizes than SCs with relatively large sample sizes. The data imbalance is likely to contribute to the significantly different performances of SCs.

www.frontiersin.org

FIGURE 2 . Number of training documents (blue bars) and the corresponding F 1 values (red curves) for best performance (top) and worst performance (bottom) SC’s. Green line shows improved F 1 ’s produced by the second-level classifier.

6.2 Two-Level Classifier

To mitigate the data imbalance problems for the one-level classifier, we train a two-level classifier. The first level classifies abstracts into 81 SCs, including 80 major SCs and an “Others” category, which incorporates 24 minor SCs. “Others” contains the categories with training data < 10k abstracts. Abstracts that fall into the “Others” are further classified by a second level classifier, which is trained on abstracts belonging to the 24 minor SCs.

6.3 Baseline Methods

For comparison, we trained five supervised machine learning models as baselines. They are Random Forest (RF), Naïve Bayes (NB, Gaussian), Support Vector Machine (SVM, linear and Radial Basis Function kernels), and Logistic Regression (LR). Documents are represented in the same way as for the DANN except that no word embedding is performed. Because it takes an extremely long time to train these models using all data used for training DANN, and the implementation does not support batch processing, we downsize the training corpus to 150k in total and keep training samples in each SC in proportion to those used in DANN. The performance metrics are calculated based on the same testing corpus as the DANN model.

We used the CCNN architecture ( Zhang et al., 2015 ), which contains six convolutional layers each including 1,008 neurons followed by three fully connected layers. Each abstract is represented by a 1,014 dimensional vector. Our architecture for USE ( Cer et al., 2018 ) is an MLP with four layers, each of which contains 1,024 neurons. Each abstract is represented by a 512 dimensional vector.

6.4 Results

The performances of DANN in different settings and a comparison between the best DANN models and baseline models are illustrated in Figure 3 . The numerical values of performance metrics using the two-level classifier are tabulated in Supplementary Table S1 . Below are the observations from results.

(1) FastText + BiGRU + Attn and FastText+BiLSTM + Attn achieve the highest micro- F 1 of 0.76. Several models achieve similar results:GloVe + BiLSTM + Attn (micro- F 1 = 0.75), GloVe + BiGRU + Attn (micro- F 1 = 0.74), FastText + LSTM + Attn (micro- F 1 = 0.75), and FastText + GRU + Attn (micro- F 1 = 0.74). These results indicate that the attention mechanism significantly improves the classifier performance.

(2) Retraining FastText and GloVe significantly boosted the performance. In contrast, the best micro- F 1 achieved by USE is 0.64, which is likely resulted from its relatively low vocabulary overlap. Another reason could be is that the single vector of fixed length only encodes the overall semantics of the abstract. The occurrences of words are better indicators of sentences in specific domains.

(3) LSTM and GRU and their bidirectional counterparts exhibit very similar performance, which is consistent with a recent systematic survey ( Greff et al., 2017 ).

(4) For FastText + BiGRU + Attn, the F 1 measure varies from 0.50 to 0.95 with a median of 0.76. The distribution of F 1 values for 81 SCs is shown in Figure 4 . The F 1 achieved by the first-level classifier with 81 categories (micro- F 1 = 0.76) is improved compared with the classifier trained on 104 SCs (micro- F 1 = 0.70)

(5) The performance was not improved by increasing the GloVe vector dimension from 50 to 100 (not shown) under the setting of GloVe + BiGRU with 128 neurons on two layers which is consistent with earlier work ( Witt and Seifert, 2017 ).

(6) Word-level embedding models in general perform better than the character-level embedding models (i.e., CCNN). CCNN considers the text as a raw-signal, so the word vectors constructed are more appropriate when comparing morphological similarities. However, semantically similar words may not be morphologically similar, e.g., “Neural Networks” and “Deep Learning.”

(7) SciBERT’s performance is 3–5% below FastText and GloVe, indicating that re-trained WE models exhibit an advantage over pre-trained WE models. This is because SciBERT was trained on the PubMed corpus which mostly incorporates papers in biomedical and life sciences. Also, due to their large dimensions, the training time was greater than FastText under the same parameter settings.

(8) The best DANN model beats the best machine learning model (LR) by about 10%.

www.frontiersin.org

FIGURE 3 . Top: Micro- F 1 ’s of our DANN models that classify abstracts into 81 SCs. Variants of models within each group are color-coded. Bottom: Micro- F 1 ’s of our best DANN models that classify abstracts into 81 SCs, compared with baseline models.

www.frontiersin.org

FIGURE 4 . Distribution of F 1 ’s across 81 SC’s obtained by the first level classifier.

We also investigated the dependency of classification performance on key hyper-parameters. The settings of GLoVe + BiGRU with 128 neurons on two layers are considered as the “reference setting.” With the setting of GloVe + BiGRU, we increase the neuron number by factor of 10 (1,280 neurons on two layers) and obtained marginally improved performance by 1% compared with the same setting with 128 neurons. We also doubled the number of layers (128 neurons on four layers). Without attention, the model performs worse than the reference setting by 3%. With the attention mechanism, the micro- F 1 = 0.75 is marginally improved by 1% with respect to the reference setting. We also increase the default number of neurons of USE to 2048 neurons for four layers. The micro- F 1 improves marginally by 1%, reaching only 0.64. The results indicate that adding more neurons and layers seem to have little impact to the performance improvement.

The second-level classifier is trained using the same neural architecture as the first-level on the “Others” corpus. Figure 2 (Right ordinate legend) shows that F 1 ’s vary from 0.92 to 0.97 with a median of 0.96. The results are significantly improved by classifying minor classes separately from major classes.

7 Discussion

7.1 sampling strategies.

The data imbalance problem is ubiquitous in both multi-class and multi-label classification problems ( Charte et al., 2015 ). The imbalance ratio (IR), defined as the ratio of the number of instances in the majority class to the number of samples in the minority class ( García et al., 2012 ), has been commonly used to characterize the level of imbalance. Compared with the imbalance datasets in Table 1 of ( Charte et al., 2015 ), our data has a significantly high level of imbalance. In particular, the highest IR is about 49,000 ( # Physics/#Art). One commonly used way to mitigate this problem is data resampling. This method is based on rebalancing SC distributions by either deleting instances of major SCs (undersampling) or supplementing artificially generated instances of the minor SCs (oversampling). We can always undersample major SCs, but this means we have to reduce sample sizes of all SCs down to about 15 (Art; Section 5), which is too small for training robust neural network models. The oversampling strategies such as SMOTE ( Chawla et al., 2002 ) works for problems involving continuous numerical quantities, e.g., SalahEldeen and Nelson (2015) . In our case, the synthesized vectors of “abstracts” by SMOTE will not map to any actual words because word representations are very sparsely distributed in the large WE space. Even if we oversample minor SCs using semantically dummy vectors, generating all samples will take a large amount of time given the high dimensionality of abstract vectors and high IR. Therefore, we only use real data.

www.frontiersin.org

TABLE 1 . Results of the top 10 SCs of classifying one million research papers in CiteSeerX, using our best model.

7.2 Category Overlapping

We discuss the potential impact on classification results contributed by categories overlapping in the training data. Our initial classification schema contains 104 SCs, but they are not all mutually exclusive. Instead, the vocabularies of some categories overlap with the others. For example, papers exclusively labeled as “Materials Science” and “Metallurgy” exhibit significant overlap in their tokens. In the WE vector space, the semantic vectors labeled with either category are overlapped making it hard to differentiate them. Figure 5 shows the confusion matrices of the closely related categories such as “Geology,” “Mineralogy,” and “Geochemistry Geophysics.” Figure 6 is the t -SNE plot of abstracts of closely related SCs. To make the plot less crowded, we randomly select 250 abstracts from each SC as shown in Figure 5 . Data points representing “Geology,” “Mineralogy,” and “Geochemistry Geophysics” tend to spread or are overlapped in such a way that are hard to be visually distinguished.

www.frontiersin.org

FIGURE 5 . Normalized Confusion Matrix for closely related classes in which a large fraction of “Geology” and “Mineralogy” papers are classified into “GeoChemistry GeoPhysics” (A) , and a large fraction of Zoology papers are classified into “biology” or “ecology” (B) , a large fraction of “TeleCommunications,” “Mechanics” and “EnergyFuels” papers are classified into “Engineering” (C) .

www.frontiersin.org

FIGURE 6 . t -SNE plot of closely related SCs.

One way to mitigate this problem is to merge overlapped categories. However, special care should be taken on whether these overlapped SCs are truly strongly related and should be evaluated by domain experts. For example, “Zoology,” “PlantSciences,” and “Ecology” can be merged into a single SC called “Biology” (Gaff, 2019; private communication). “Geology,” “Mineralogy,” and “GeoChemistry GeoPhysics” can be merged into a single SC called “Geology.” However, “Materials Science” and “Metallurgy” may not be merged (Liu, 2019; private communication) to a single SC. By doing the aforementioned merges, the number of SCs is reduced to 74. As a preliminary study, we classified the merged dataset using our best model (retrained FastText + BiGRU + Attn) and achieved an improvement with an overall micro- F 1 score of 0.78. The classification performance of “Geology” after merging has improved from 0.83 to 0.88.

7.3 Limitations

Compared with existing work, our models are trained on a relatively comprehensive, large-scale, and clean dataset from WoS. However, the basic classification of WoS is at the journal level and not at the article level. We are also aware that the classification schema of WoS may change over time. For example, in 2018, WoS introduced three new SCs such as Quantum Science and Technology, reflecting emerging research trends and technologies ( Boletta, 2019 ). To mitigate this effect, we excluded papers with multiple SCs and assume that the SCs of papers studied are stationary and journal level classifications represent the paper level SCs.

Another limitation is the document representation. The BoW model ignores the sequential information. Although we experimented on the cases in which we keep word tokens in the same order as they appear in the original documents, the exclusion of stop words breaks the original sequence, which is the input of the recurrent encoder. We will address this limitation in future research by encoding the whole sentences, e.g., Yang et al. (2016) .

8 Application to CITESEERX

CiteSeerX is a digital library search engine that was the first to use automatic citation indexing ( Giles et al., 1998 ). It is an open source search engine that provides metadata and full-text access for more than 10 million scholarly documents and continues to add new documents ( Wu et al., 2019 ). In the past decade, it has incorporated scholarly documents in diverse SCs, but the distribution of their subject categories is unknown. Using the best neural network model in this work (FreeText + BiGRU + Attn), we classified one million papers randomly selected from CiteSeerX into 104 SCs ( Table 1 ). The fraction of Computer Science papers (19.2%) is significantly higher than the results in Wu et al. (2018) , which was 7.58%. The F 1 for Computer Science was about 0.94 for Computer Science which is higher than this work (about 0.80). Therefore, the fraction may be overestimated here. However Wu et al. (2018) , had only six classes and this model classifies abstracts into 104 SCs, so although this compromises the accuracy (by around 7% on average), our work can still be used as a starting point for a systematic SC classification. The classifier classifies one million abstracts in 1,253 s implying that will be scalable on multi-millions of papers.

9 Conclusion

We investigated the problem of systematically classifying a large collection of scholarly papers into 104 SC’s using neural network methods based only on abstracts. Our methods appear to scale better than existing clustering-based methods relying on citation networks. For neural network methods, our retrained FastText or GloVe combined with BiGRU or BiLSTM with the attention mechanism gives the best results. Retraining WE models and using an attention mechanism play important roles in improving the classifier performance. A two-level classifier effectively improves our performance when dealing with training data that has extremely imbalanced categories. The median F 1 ’s under the best settings are 0.75–0.76.

One bottleneck of our classifier is the overlapping categories. Merging closely related SCs is a promising solution, but should be under the guidance of domain experts. The TF-IDF representation only considers unigrams. Future work could consider n -grams or concepts ( n ≥ 2 ) and transfer learning to adopt word/sentence embedding models trained on non-scholarly corpora ( Arora et al., 2017 ; Conneau et al., 2017 ). One could investigate models that also take into account stop-words, e.g., Yang et al. (2016) . One could also explore alternative optimizers of neural networks besides Adam, such as the Stochastic Gradient Descent (SGD). Our work falls into the multiclass classification, which classifies research papers into flat SCs. In the future, we will investigate hierarchical multilabel classification that assigns multiple SCs at multiple levels to papers.

Data Availability Statement

The Web of Science (WoS) dataset used for this study is proprietary and can be purchased from Clarivate 6 . The implementation software is open accessible from GitHub 7 . The testing datasets and CiteSeerX classification results are available on figshare 8 .

Author Contributions

BK designed the study and implemented the models. He is responsible for analyzing the results and writing the paper. SR is responsible for model selection, experiment design, and reviewing methods and results. JW was responsible for data management, selection, and curation. He reviewed related works and contributed to the introduction. CG is responsible for project management, editing, and supervision.

This research is partially funded by the National Science Foundation (Grant No: 1823288).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We gratefully acknowledge partial support from the National Science Foundation. We also acknowledge Adam T. McMillen for technical support, and Holly Gaff, Old Dominion University and Shimin Liu, Pennsylvania State University as domain experts respectively in biology and the earth and mineral sciences.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frma.2020.600382/full#supplementary-material .

1 https://www.acm.org/about-acm/class

2 https://www.ncbi.nlm.nih.gov/mesh

3 http://msc2010.org/mediawiki/index.php?title=Main_Page

4 https://images.webofknowledge.com/images/help/WOS/hp_subject_category_terms_tasca.html

5 https://www.loc.gov/aba/cataloging/classification/

6 https://clarivate.libguides.com/rawdata

7 https://github.com/SeerLabs/sbdsubjectclassifier

8 https://doi.org/10.6084/m9.figshare.12887966.v2

Arora, S., Liang, Y., and Ma, T. (2017). “A simple but tough-to-beat baseline for sentence embeddings,” in ICLR, Toulon, France, April 24-26, 2017 .

Google Scholar

Barnes, B., Bloor, D., and Henry, J. (1996). Scientific knowledge: a sociological analysis . Chicago IL: University of Chicago Press .

Beltagy, I., Cohan, A., and Lo, K. (2019). Scibert: pretrained contextualized embeddings for scientific text. arXiv:1903.10676.

CrossRef Full Text | Google Scholar

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Trans. Assoc. Comput. 5, 135–146. doi:10.1162/tacl_a_00051

Boletta, M. (2019). New web of science categories reflect ever-evolving research. Available at: https://clarivate.com/webofsciencegroup/article/new-web-of-science-categories-reflect-ever-evolving-research/ (Accessed January 24, 2019).

Boyack, K. W., and Klavans, R. (2018). Accurately identifying topics using text: mapping pubmed . Leiden, Netherlands: Centre for Science and Technology Studies (CWTS) , 107–115.

Caragea, C., Wu, J., Gollapalli, S. D., and Giles, C. L. (2016). “Document type classification in online digital libraries,” in Proceedings of the 13th AAAI conference, Phoenix, AZ, USA, February 12-17, 2016 .

Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., et al. (2018). “Universal sentence encoder for English,” in Proceedings of EMNLP conference, Brussels, Belgium, October 31-November 4, 2018 .

Charte, F., Rivera, A. J., del Jesús, M. J., and Herrera, F. (2015). Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163, 3–16. doi:10.1016/j.neucom.2014.08.091

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Jair 16, 321–357. doi:10.1613/jair.953

Cho, K., Van Merrienboer, B., Gülçehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, Doha Qatar, October 25-29, 2014 , 1724–1734.

Collins, H. M. (1992). “Epistemological chicken hm collins and steven yearley,” in Science as practice and culture . Editor A. Pickering ( Chicago, IL: University of Chicago Press ), 301.

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. (2017). “Supervised learning of universal sentence representations from natural language inference data,” in Proceedings of the EMNLP conference, Copenhagen, Denmark, September 9-11, 2017 .

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, Minneapolis, MN, USA, June 2-7, 2019 .

Fellbaum, C. (2005). “Wordnet and wordnets,” in Encyclopedia of language and linguistics . Editor A. Barber ( Elsevier ), 2–665.

Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., et al. (2018). Science of science. Science 359, eaao0185. doi:10.1126/science.aao0185

García, V., Sánchez, J. S., and Mollineda, R. A. (2012). On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl. Base Syst. 25, 13–21. doi:10.1016/j.knosys.2011.06.013

Gerlach, M., Peixoto, T. P., and Altmann, E. G. (2018). A network approach to topic models. Sci. Adv. 4, eaaq1360. doi:10.1126/sciadv.aaq1360

PubMed Abstract | CrossRef Full Text | Google Scholar

Giles, C. L., Bollacker, K. D., and Lawrence, S. (1998). “CiteSeer: An automatic citation indexing system,” in Proceedings of the 3rd ACM international conference on digital libraries , June 23–26, 1998 , Pittsburgh, PA, United States , 89–98.

Glänzel, W., and Thijs, B. (2017). Using hybrid methods and ‘core documents’ for the representation of clusters and topics: the astronomy dataset. Scientometrics 111, 1071–1087. doi:10.1007/s11192-017-2301-6

Goldberg, Y., and Levy, O. (2014). Word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. preprint arXiv:1402.3722.

Grave, E., Mikolov, T., Joulin, A., and Bojanowski, P. (2017). “Bag of tricks for efficient text classification,” in Proceedings of the 15th EACL, Valencia, Span, April 3-7, 2017 .

Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmidhuber, J. (2017). LSTM: a search space odyssey. IEEE Trans. Neural Networks Learn. Syst. 28, 2222–2232. doi:10.1109/TNNLS.2016.2582924

He, G., Fang, J., Cui, H., Wu, C., and Lu, W. (2018). “Keyphrase extraction based on prior knowledge,” in Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, JCDL, Fort Worth, TX, USA, June 3-7, 2018 .

Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Comput. 9, 1735–1780.

Iyyer, M., Manjunatha, V., Boyd-Graber, J., and Daumé, H. (2015). “Deep unordered composition rivals syntactic methods for text classification,” in Proceedings ACL, Beijing, China, July 26-31, 2015 .

Khabsa, M., and Giles, C. L. (2014). The number of scholarly documents on the public web. PloS One 9, e93949. doi:10.1371/journal.pone.0093949

Larsen, P., and von Ins, M. (2010). The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics 84, 575–603. doi:10.1007/s11192-010-0202-z

LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., et al. (1989). “Handwritten digit recognition with a back-propagation network,” in Advances in neural information processing systems [NIPS conference], Denver, Colorado, USA, November 27-30, 1989.

Li, Y., Lipsky Gorman, S., and Elhadad, N. (2010). “Section classification in clinical notes using supervised hidden markov model,” in Proceedings of the 1st ACM international health informatics symposium, Arlington, VA, USA, November 11-12, 2010 (New York, NY : Association for Computing Machinery ), 744–750.

Matsuda, K., and Fukushima, T. (1999). Task-oriented world wide web retrieval by document type classification. In Proceedings of CIKM, Kansas City, Missouri, USA, November 2-6, 1999 .

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv:1301.3781.

Mikolov, T., Yih, W., and Zweig, G. (2013b). “Linguistic regularities in continuous space word representations,” in Proceedings of NAACL-HLT, Atlanta, GA, USA, June 9-14, 2013 .

Moscato, P., and Cotta, C. (2003). A gentle introduction to memetic algorithms . Boston, MA: Springer US , 105–144.

Nadeau, D., and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticæ Investigationes 30, 3–26. doi:10.1075/li.30.1.03nad

Nam, S., Jeong, S., Kim, S.-K., Kim, H.-G., Ngo, V., and Zong, N. (2016). Structuralizing biomedical abstracts with discriminative linguistic features. Comput. Biol. Med. 79, 276–285. doi:10.1016/j.compbiomed.2016.10.026

Passos, A., Kumar, V., and McCallum, A. (2014). “Lexicon infused phrase embeddings for named entity resolution,” in Proceedings of CoNLL, Baltimore, MD, USA, June, 26-27, 2014 .

Pennington, J., Socher, R., and Manning, C. D. (2014). “Glove: global vectors for word representation,” in Proceedings of the EMNLP conference, Doha, Qatar, October 25-29, 2014 .

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., et al. (2018). “Deep contextualized word representations,” in NAACL-HLT, New Orleans, LA,, USA, June 1-6, 2018 .

Prasad, A., Kaur, M., and Kan, M.-Y. (2018). Neural ParsCit: a deep learning-based reference string parser. Int. J. Digit. Libr. 19, 323–337. doi:10.1007/2Fs00799-018-0242-1

Ratnaparkhi, A. (1996). “A maximum entropy model for part-of-speech tagging,” in The proceedings of the EMNLP conference, Philadelphia, PA, USA, May 17-18, 1996 .

Ren, Y., Zhang, Y., Zhang, M., and Ji, D. (2016). “Improving twitter sentiment classification using topic-enriched multi-prototype word embeddings,” in AAAI, Phoenix, AZ, USA, February 12-17, 2016 .

SalahEldeen, H. M., and Nelson, M. L. (2015). “Predicting temporal intention in resource sharing,” in Proceedings of the 15th JCDL conference, Knoxville, TN, USA, June 21-25, 2015 .

Shen, Z., Ma, H., and Wang, K. (2018). “A web-scale system for scientific knowledge exploration,” in Proceedings of ACL 2018, system demonstrations (Melbourne, Australia: Association for Computational Linguistics ), 87–92.

van Eck, N. J., and Waltman, L. (2017). Citation-based clustering of publications using citnetexplorer and vosviewer. Scientometrics 111, 1053–1070. doi:10.1007/s11192-017-2300-7

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). “Attention is all you need,” in NIPS, Long Beach, CA, USA, December 4-9, 2017 .

Vo, D.-T., and Zhang, Y. (2015). “Target-dependent twitter sentiment classification with rich automatic features,” In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, Editors Q. Yang and M. J. Wooldridge (AAAI Press), 1347–1353.

Waltman, L., and van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST 63, 2378–2392. doi:10.1002/asi.22748

Wang, S., and Koopman, R. (2017). Clustering articles based on semantic similarity. Scientometrics 111, 1017–1031. doi:10.1007/s11192-017-2298-x

Witt, N., and Seifert, C. (2017). “Understanding the influence of hyperparameters on text embeddings for text classification tasks,” in TPDL conference, Thessaloniki, Greece, September 18-21, 2017 .

Wu, J., Kandimalla, B., Rohatgi, S., Sefid, A., Mao, J., and Giles, C. L. (2018). “Citeseerx-2018: a cleansed multidisciplinary scholarly big dataset,” in IEEE big data, Seattle, WA, USA, December 10-13, 2018 .

Wu, J., Kim, K., and Giles, C. L. (2019). “CiteSeerX: 20 years of service to scholarly big data,” in Proceedings of the AIDR conference, Pittsburgh, PA, USA, May 13-15, 2019 .

Wu, J., Williams, K., Chen, H., Khabsa, M., Caragea, C., Ororbia, A., et al. (2014). “CiteSeerX: AI in a digital library search engine,” in Proceedings of the twenty-eighth AAAI conference on artificial intelligence .

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., and Hovy, E. H. (2016). “Hierarchical attention networks for document classification,” in The NAACL-HLT conference .

Zhang, X., Zhao, J., and LeCun, Y. (2015). “Character-level convolutional networks for text classification,” in Proceedings of the NIPS conference, Montreal, Canada, December 7-12, 2015 .

Zitt, M., Lelu, A., Cadot, M., and Cabanac, G. (2019). “Bibliometric delineation of scientific fields,” in Handbook of science and technology indicators . Editors W. Glänzel, H. F. Moed, U. Schmoch, and M. Thelwall ( Springer International Publishing ), 25–68.

Keywords: text classification, text mining, scientific papers, digital library, neural networks, citeseerx, subject category classification

Citation: Kandimalla B, Rohatgi S, Wu J and Giles CL (2021) Large Scale Subject Category Classification of Scholarly Papers With Deep Attentive Neural Networks. Front. Res. Metr. Anal. 5:600382. doi: 10.3389/frma.2020.600382

Received: 29 August 2020; Accepted: 24 December 2020; Published: 10 February 2021.

Reviewed by:

Copyright © 2021 Kandimalla, Rohatgi, Wu and Giles.. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Bharath Kandimalla, [email protected] ; Jian Wu, [email protected]

This article is part of the Research Topic

Mining Scientific Papers, Volume II: Knowledge Discovery and Data Exploitation

Categorization Methodology: an Approach to the Collection and Analysis of Certain Classes of Qualitative Information

  • PMID: 26815730
  • DOI: 10.1207/s15327906mbr2102_1

A new methodology has been developed for the study of certain classes of qualitative information. The methodology is composed of two major techniques: the F-sort task for data collection and latent partition analysis for data summarization. In this paper, similarities and differences of the methodology in relation to existing techniques are discussed. This includes a review of the historical antecedents and development of the methodology as well as a listing of recent applications of the methodology, which have been in several fields of psychology and education. Following that review, a detailed presentation is given of how the methodology was applied to studying teachers' views of facilitating student learning in the classroom. In the data analysis, some previously unpublished enhancements of latent partition analysis are introduced -- for basic categorization statistics, recovery of subgroup information, and linkage to multidimensional scaling. This complete and substantively intriguing example is intended to illustrate the power and utility of the methodology in exploring important research topics. Then, examinations are made of the main components of the F-sort technique and of latent partition analysis. The intent is to lay out detailed frameworks for designing and interpreting research that applies the categorization methodology. The methodology has not previously been given an integrated description. Finally, discussion concerns the uses of the methodology and the importance of standardized application of the procedures.

Research Paper Classification using Supervised Machine Learning Techniques

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

The origins of social categorization

Zoe liberman.

1 University of Chicago, Department of Psychology

2 University of California Santa Barbara, Department of Psychological & Brain Sciences

Amanda L. Woodward

Katherine d. kinzler.

3 Cornell University, Department of Psychology & Department of Human Development

Forming conceptually rich social categories helps people navigate the complex social world by allowing them to reason about others’ likely thoughts, beliefs, actions, and interactions as guided by group membership. Yet, social categorization often has nefarious consequences. We suggest that the foundation of the human ability to form useful social categories is in place in infancy: social categories guide infants’ inferences about peoples’ shared characteristics and social relationships. We also suggest that the ability to form abstract social categories may be separable from the eventual negative downstream consequences of social categorization, including prejudice, discrimination and stereotyping. Whereas a tendency to form inductively rich social categories appears early in ontogeny, prejudice based on each particular category dimension may not be inevitable.

Social categorization profoundly influences human social life

Despite the salience of individuals in social thinking, a large body of work suggests that the tendency to conceive of people as belonging to social categories is automatic [ 1 – 3 ]. Indeed, the ability to group instances into categories and to use category-based knowledge to generate novel inductive inferences is a powerful aspect of human cognition [ 1 , 4 ]. In particular, the capacity to view category members as sharing important, unchanging, and possibly unobservable similarities allows people to efficiently, and perhaps even spontaneously, learn a property of a category and apply it to novel category members [ 5 – 9 ]. When applied to the social domain, forming conceptually rich categories has obvious functional value – social categories organize people’s vast knowledge about human attributes and about the complex relationship networks that comprise human social life [ 10 ].

Despite the upsides of category formation, much of the research on social categorization focuses on its potential downstream negative consequences. Social categorization differs from other forms of categorization in that people tend to place themselves in a category [ 11 ], leading them to be partial to members of their own group ( ingroup ) relative to those from other groups ( outgroup ) in terms of social preferences, empathic responding, and resource distribution [ 12 – 15 ]. Beyond sheer partiality and greater liking of members of one’s own group, some of the most invidious effects of social categories result from the biased belief systems that social categorization supports, including stereotypes for, essentialist beliefs about, and even dehumanization of, members of certain social groups [ 12 – 13 , 16 – 17 ].

Although prejudice was once assumed to be an inevitable consequence of social categorization [ 12 ], social psychologists have long noted the distinction between explicit prejudice (negative affect towards an outgroup) and endorsement of stereotypes (cognitive representations of culturally held beliefs about a group) [ 1 ]. Whereas less research has focused on the affective-cognitive distinction in implicit cognition, implicit evaluations of social groups (implicit prejudice) may also be distinct from implicit stereotyping, and these constructs have separable influences on human social behavior [ 18 ]. Nonetheless, there are many important open questions about each of these outcomes of social categorization. For example, what is the time course of affective and categorical thinking about groups [ 19 ], and how does one influence the other? How do social categories work similarly to, and different from, non-social categories [ 11 ], and how do stereotypes about groups develop in the first place [ 20 ]?

One way to continue to answer important questions about the nature of social categorization is to look to developmental psychology. For example, research with infants and children can ask whether social preferences and inductive inferences based on social group membership always emerge together, or whether they arise separately [ 21 ]. Findings from such studies can demonstrate whether prejudice and stereotyping are inevitable consequences of dividing the world in conceptually meaningful social categories, or whether humans are able to use group divisions in meaningful ways without these negative outcomes. Here, we examine new experiments with children and infants to understand the nature and origins of the human capacity to form social categories. Considering social categories from a developmental perspective does more than merely shed light on when social categorization and its downstream consequences arise, it can also reveal the cognitive processes that shape the human ability to form social categories and provide insight into how negative consequences of social categorization begin and might be mitigated.

Social categorization in childhood

Social preferences for members of one’s own social group, and rich conceptual inferences based on social group membership are each in place by the time children enter formal schooling. For example, children have both explicit and implicit preferences based on people’s gender, race, and linguistic group [ 22 – 25 ]. Children also look to ingroup, rather than outgroup, members when learning new information [ 26 – 30 ], show partiality towards the ingroup when allocating rewards and punishments [ 31 – 32 ] and acquire negative stereotypes associated with social group membership [ 20 ].

Recent research indicates that children use social categories to make productive social inferences. For example, children expect members of a social group to share deep properties, including preferences, traits, and norms [ 33 – 37 ], and they expect characteristics that mark social category membership to endure over time [ 38 – 39 ]. Indeed, preschoolers expect group members to follow social conventions [ 40 ], and negatively evaluate people who do not follow their social group’s conventions and norms [ 41 – 42 ], suggesting they view conforming to the group as a fundamentally important feature of group membership. In addition to sharing common attributes, members of a social category are typified by rich relational structure, such that social categories support inferences about patterns of interpersonal interaction. For example, by early childhood, people expect members of a social category to be loyal to one another, to engage in prosocial relationships, and to share resources with each other [ 43 – 46 ]. In fact, whereas children think that people in a social group must refrain from harming one another, this expectation does not always hold between members of different social groups [ 45 ].

Although children hold intuitive theories that social categories are natural kinds and that social categories should mark social obligations [ 47 ], children apply these two intuitive theories differently to different social groups. Children treat gender as a natural kind [ 48 ], but they do not view novel groups [ 49 – 50 ] or race [ 51 ], as marking fundamental similarities between category members. On the other hand, children of the same age do use novel groups and race for predicting patterns of social interaction [ 46 , 51 ]. Children may initially view social categories as marking social obligations, and later come to see these categories as natural kinds [ 47 ]. Or, children may prioritize the significance of some social categories as compared to others, and reason about prioritized categories as natural kinds and as marking patterns of social interactions at earlier ages ( Box 1 ). In either case, existing data suggest that the formation of a social group in and of itself does not inherently lead to stereotyping or prejudice: children can know about a social division, such as race or novel group, without using group boundaries to make inductive inferences [ 46 . 50 ], and without expressing group bias [ 52 ].

Infants may prioritize informative social categorization signals

Rather than varying across communities based on learning which dimensions carry the most functional relevance ( Box 3 ), infants’ earliest social categories may prioritize features that have fundamentally signaled social group membership across human evolutionary history. A prioritization account could help explain potentially counterintuitive research on race. Because people in hunter-gather bands likely never traveled far enough to encounter someone of a different “race,” race might not be a prioritized signal of social group [ 101 ]. Indeed, although infants perceive race ( Box 2 ) and children prefer own-race social partners [ 22 ], children do not use race as a conceptually rich category. Children do not automatically encode race [ 3 ], do not make race-based inductive inferences [ 46 ], and do not always expect race to be stable [ 38 , 88 ]. Rather, seeing race as relevant for social categorization depends on social experience: minority race children, who likely think and talk more about race, see race as a defining feature of social identity earlier in development than majority race children [ 38 , 88 ]. Additionally, growing up in racially diverse areas decreases children’s racial essentialism [ 102 ], and racial essentialism leads children to treat ambiguous race faces as outgroup members [ 103 ], suggesting exposure to diversity could decrease prejudice. On the other hand, gender may be a dedicated dimension of social categorization: children automatically encode gender [ 3 ], and make gender-based inductive inferences [ 38 ]. In fact, transgender children express clear gender identities and use gender to carve up the social world [ 104 ], suggesting attention to gender is present across a variety of experiences and backgrounds.

Do visual preferences in infancy reflect social categorization?

Infants show clear visual preferences for people from certain social groups [see 21 , 54 for review]. For example, infants prefer to look at female faces [ 55 , 109 ], and at own-race faces [ 110 ]. These effects are due to familiarity and vary based on contact [ 55 , 110 ]. For example, infants who regularly see faces of diverse races do not show own-race preferences [ 111 ], and the own-race visual preferences emerges earlier for female faces than male faces [ 112 ], suggesting the preference is based on liking to look at the type of face that they encounter most often in their environment (own-race females).

Further, infants are better at recognizing individual novel faces own-race compared to other-race faces, and show the ability to form perceptual categories based on race [ 113 – 114 ]. As with visual preferences, these benefits are likely based on expertise for processing familiar faces: exposing infants to other-race faces in the laboratory can eliminate own-race facial recognition advantages [ 115 – 116 ]. Thus, infants’ visual responses to social categories (in terms of preferences and perceptual categorization) reflect adaptive learning about regularities in their social environment.

Are perceptual categories linked to conceptually-rich social knowledge or social expectations about category members? For example, does better categorization of own-race faces indicate expectations that members of the own-race group will share deep properties, or socially interact? As of yet, the closest evidence comes from a recent paper where infants associate own-race faces with happier music [ 117 ]. Although this finding could be relevant to early social bias, particularly since infants see music as social [ 118 – 119 ], it could also be due to familiarity without any social grouping or social expectations: infants have more exposure to own-race faces and positive music than to other-race faces and negative music. Thus, more research is needed to ask whether perceptual categorization reflects conceptually rich social expectations.

Even if perceptual categorization alone can’t be taken as evidence for conceptually rich social categorization, it may scaffold infants’ complex social reasoning. For example, a tendency to pay more attention to racial ingroup members could bias children towards learning only from own-race teachers [ 28 ], and seeing ingroup members as more relevant sources of information. And, paying less attention to outgroup members could lead to outgroup homogeneity [ 120 – 121 ], whereby people might view outgroup members as more similar to one another. Future research is needed to explore how early differences in visual attention may relate to later emerging conceptualizations of the social world.

Using functional relevance to form social groups

Because any dimension could theoretically be (or become) meaningful for social categorization in a particular community, infants and children may be ready to detect groups based on any feature, if given the appropriate input [ 89 ]. Indeed, decades of research have indicated that humans show preferential treatment towards people to whom they share only a, “minimal,” similarity. For example, people prefer ingroup members even when the group is assigned and is based on an arbitrary (and untrue) personality feature, such as being “overestimators” [ 14 ]. Preferring “minimal” ingroup members begins early in childhood. In both classic experiments, like the Robber’s Cave [ 122 ], and more modern research [ 44 , 123 – 125 ], children prefer other people who are in their randomly assigned group.

Preferences for minimal ingroup members likely don’t arise due to people thinking that the groups and random and meaningless, but rather could emerge because participants believe they share important features with others in their group (e.g., thinking that “overestimators” are more similar to one another than they are to “underestimators”), or believe that the groups must be functional, since the groups were labeled and used by a figure in power (e.g., the experimenter). Indeed, drawing attention to a category’s relevance, through labeling, generic statements, and functional use increases children’s likelihood of forming preferences for their minimal ingroup [ 65 ], increases their propensity to use minimal category membership to make inferences about characters’ behaviors [ 126 ], and heightens their expectations that members of the same minimal group will share essential similarities [ 66 – 67 ].

Together, these studies elegantly demonstrate that “minimal” characteristics, which are not typically seen as critical in our society can become relevant when attention is drawn to them in a laboratory context. However, although children form social preferences for minimal ingroup members, they show stronger group-level inferences and higher levels of own-group biases when reasoning about less arbitrary categories, like gender [ 31 ]. Future work is needed to determine whether these differences are due to the fact that children likely have more experience seeing these less arbitrary categories, like gender, used functionally in their communities, or due to certain categories being more readily used for social categorization regardless of a child’s particular experiences ( Box 1 ). One type of research that may help resolve this debate would be work that investigates social cognition of children attend gender neutral preschools, who likely hear less gendered generic language and who likely see less gendered division of labor [ 127 ].

If humans’ system for reasoning about social categories is structured to attend to evolutionarily relevant groups, which features would infants prioritize? Spoken language and food preferences vary across groups, and are constrained by sensitive periods for learning ([ 105 – 108 ] Box 4 ), making them potential candidates for prioritized social categorization. Indeed, infants see shared language and shared foods preferences as providing information about social obligation and inductive generalization [ 76 – 77 , 80 ]. Critically, language and food preferences may have a special significance–infants do not use highly similar cues, such as object preferences, to make the same kinds of social inferences [ 80 ]. This account would make the further predictions (not yet tested) that infants’ social inferences would be guided by other fundamental markers of social group membership, such as kinship, or knowledge of group rituals, but not by arbitrary dimensions of similarity that did not mark social group across human evolutionary history.

Language is a potent cue to social structure

Research on intergroup cognition often focuses on race, gender, and age. Indeed, adults and children are sensitive to these social categories, and use membership in them to guide their preferences and learning [ 2 ]. Yet, despite a wealth of evidence from the neighboring social sciences disciplines of linguistics and anthropology that language and accent serve as particularly reliable signals of social group membership [ 105 – 106 ], and that attention to language can surpass attention to visual cues in social categorization tasks [ 128 – 130 ], language is often overlooked in social psychology research on intergroup cognition [ 131 ].

The sociality of language emerges early: infants prefer native language speakers [ 24 ], and infants and children look to linguistic ingroup members to learn new information [ 26 , 29 , 57 , 60 , 62 ]. In fact, infants use language for more than personal decisions about whom to like or whom to learn from; they create conceptually rich social categories based on language, whereby they use language to make predictions about people’s likely traits and social interactions For example, they expect same-language speakers to be more likely to affiliate [ 77 ], and expect same-language speakers to share important social similarities, even when the language shared by the speakers is unfamiliar to them [ 80 ].

Language may inherently mark social group, and using language to divide the world into groups continues across development. Indeed, by preschool, children expect native speakers, but not foreign speakers, to follow social conventions [ 40 ], and children acquire linguistic stereotypes [ 132 – 133 ]. Does early language-based categorization feed into xenophobia? If so, which experiences mitigate these biases? In our research, infants raised in bilingual environments generalized information even across different-language speakers [ 80 ], suggesting multilingual exposure may cause people to be less likely to see language as marking concrete social boundaries. More work is needed to determine whether experiencing a diverse linguistic community could reduce stereotyping and prejudice, and, if so, whether the reduction in bias would be specifically in the domain of language (e.g., by increasing attitudes towards foreign speakers), or would be broader. These questions about the role of experience on impacting bias formation and reduction should also be asked about other social categories, including those typically studied (gender, race, and age), and those that are less studied, but potentially highly evolutionarily relevant ( Box 1 ).

The relationship between social preferences and social categorization

The growing body of research on children’s social categories brings into focus the distinction between conceptually rich belief systems about members of certain social categories (perhaps relevant to later stereotyping) and social preferences for people who are members of those categories (perhaps relevant to later prejudice). Though both develop by early school years, and they are often coincident, these two processes are not identical: they may emerge and interact in different ways over the course of development. In fact, there are many theoretical reasons to expect that social preferences could diverge from rich knowledge of social groups.

First, preferences can exist in the absence of knowledge about groups. For example, preferential looking time methodologies, which measure infants’ spontaneously looking to a pair of faces, find that infants spend more time looking at attractive compared to unattractive faces [ 53 ], native speakers compared to foreign language speakers [ 24 ], and own-race compared to other-race faces [ 54 ]. Although preferential looking time studies have been taken as evidence for an early-developing own race bias, preferential looking does not necessarily indicate categorization, or a preference for the “ingroup.” 1 Indeed, few people would take longer looking at symmetrical faces as evidence that infants form a conceptual category of “attractive people” and expect attractive people to share common essentialized properties. Instead, infants may prefer individuals with symmetrical faces for a variety of reasons, including that symmetry may indicate health [ 55 ]. Indeed, infants could look longer at symmetrical faces without grouping these faces into a category at all. Similarly, even social preferences that seem more plausibly relevant to early ingroup bias, such as infants’ tendency to preferentially interact with native language speakers [ 24 ], could emerge based on liking to approach relatively more familiar social partners, without having any abstract categorization of “native speaker” or “foreign speaker,” or even of “like me” and “not like me.” Some looking time data, such as when an infant who is habituated one type of face (e.g., gender or race), subsequently looks longer at a face of someone from a different group, may be better evidence for categorization, though this categorization could nonetheless be perceptual rather than conceptual (see 21 , 54 and Box 2 for review of such literature).

Second, this differentiation could function in the reverse direction: humans may expect social group membership to influence novel individuals’ traits and patterns of social interaction, outside of forming preferences or dispreferences for groups. As example, someone could use the group identity “Italian” to infer properties of a person who belongs to that social group, such as what language she might speak, what foods she might prefer to eat, what religion she might practice, and which other people she might interact with. Someone could also make these inferences about a person holding a different group identity, such as “Japanese.” Although humans may have a tendency to automatically prefer their own group to all other groups [ 14 ], they could nonetheless make productive inferences about people from each of these two “outgroups,” without necessarily preferring one outgroup to the other.

At some point in development, social preferences and social categorization appear to operate in close coordination, and it is from this coordination that negative stereotypes, and other negative consequences of social categorization may be forged. One possibility is that one of these processes gives rise to the other. For example, early preferences, perhaps based on familiarity ( Box 2 ), may set the stage for the later growth of conceptual social categories. Alternatively, children may quickly detect the category structure in the social world, and prejudice and stereotypes may result when category-based knowledge combines with children’s self-categorization and cues from society about the importance of social categories ( Box 3 ). In contrast to each of these views, we propose that social preferences and rich category-based beliefs emerge in parallel early in development, and may not inevitably interact to form prejudice.

The origins of social preferences and social categorization in infancy

The bulk of research on infants’ early social reasoning focuses on early emerging visual and social preferences. As discussed earlier, infants spontaneously look longer at attractive faces, female faces, own-race faces, and faces of native language speakers [ 24 , 53 – 54 , 56 ], and babies show a familiar-race bias in face perception (see Box 2 ). Additionally, infants are more likely to approach, interact with, and imitate people who share their preferences or who speak their native language [ 24 , 57 – 63 ]. Although these visual and social preferences could be signs of early-emerging bias and prejudice, these preferences may instead operate completely differently from adult prejudice. For adults, ingroup favoritism is based on liking someone and rewarding them specifically because of their membership in the ingroup. That is, adults’ partiality towards the ingroup is depersonalized: it applies to all group members and does not depend on the perceiver being known by, or related to, the target [ 64 ]. Infants’ social preferences, on the other hand, could arise solely based on an affinity for more familiar individuals. In fact, infants could prefer particular individuals over others, without grouping preferred individuals into categories at all. Thus, although social preferences may have functional value, by guiding infants towards relevant social partners information [ 26 ] they do not per se indicate that infants are reasoning about people as members of conceptually rich social categories. Other evidence would be needed to demonstrate that infants could form inductively useful social categories.

Indeed, it is theoretically plausible that the ability to reason in sophisticated ways about people as members of social categories arises slowly and depends children acquiring information about members of social groups through observation of other people’s behaviors, and through older individuals’ explicit teaching and testimony. That is, children’s social categories could be grounded in the stereotypes and beliefs of adult members of their social and cultural community. Indeed, input from adults clearly influences children’s reasoning about social categories: children are more likely to see minimal social categories as informative when adults consistently label and use the categories functionally [ 65 ]. Hearing generic language about social categories leads children to be more likely to form a novel category [ 66 ], and to reason in essentialist ways about members of the novel group [ 67 ]. Such input could lead infants to move from forming preferences for familiar individuals, to forming adult-like preferences for people based on their identity as members of different social categories.

Alternatively, the ability to form relationally embedded social categories with inductive potential could plausibly be in place very early in life. A recent surge of research –outside of infants’ visual and social preferences—provides evidence infants have the cognitive capacities that may underlie conceptually rich social categorization. Infants can think about individual items as members of conceptual categories [ 68 ], form inductive inferences [ 69 – 70 ], and track complex social relationships [ 71 – 79 ]. Below we review evidence suggesting conceptually rich social categorization emerges early in life.

Looking time studies provide evidence for early conceptually rich social reasoning

Research using violation of expectation looking time methodologies, which assess infants’ responses to others’ actions and interactions, can provide a clearer view than measures of preference into infants’ ability to form conceptually rich social categories. In particular, because looking time studies ask about third-party expectations, these measures can elucidate whether infants reason in abstract ways about people as members of social groups, outside of any familiarity preferences. Violation of expectation studies on infants’ understanding of social groups evaluate whether infants use cues to group membership to form expectations about other people’s attributes and interactions. For example, in one set of studies, infants inferred that characters who moved in synchrony would subsequently perform the same action [ 78 ], suggesting they made the inductive inference that belonging to a group would influence each group member’s likely behavior.

One particularly illustrative test case of using violation of expectation studies to investigate infants’ ability to form abstract social categories comes from research on language as a marker of social group ( Box 4 ). In these studies, we showed infants two native bilingual actors who were presented as members of the same group (either two English speakers or two Spanish speakers) or as members of different groups (one English speaker and one Spanish speaker). We then asked how infants used the information about the actors’ language to inform their expectations about the actors’ interactions and attributes. In one study, we asked whether infants expected the actors to affiliate with one another or to socially disengage from one another. Infants’ responses varied based on the actors’ group membership: infants who saw the actors both speak English looked longer when they disengaged, suggesting they expected affiliation, whereas infants who saw the actors speak different languages looked longer at affiliation, suggesting they found affiliation unexpected [ 76 ]. Thus, like adults and older children [ 45 ], infants expect people who speak a common language to engage, but they do not hold these same social expectations for people who speak different languages. These data suggest that infants view social relationships as embedded in broader shared social categories.

In another series of studies, we asked about 11-month-old infants’ inductive generalizations. After being presented with same-language or different-language speakers, infants were shown one speaker’s food preference. Subsequently, infants were shown the second speaker disagreeing with the first speaker (by disliking the previously liked food), or expressing a negative opinion of a previously uneaten food. Infants selectively generalized information across same-language speakers: they looked longer at the disagreement when the actors spoke the same language, but not when the actors spoke different languages, suggesting they found it unexpected for people from the same social group, but not all people, to disagree ([ 80 ]; Figure 1 ). Infants show a similar pattern of inductive generalization of labels: they expect people who speak the same language to use the same novel labels to refer to the same object [ 81 ], but do not expect people who speak different languages to use the same object labels [ 82 ].

An external file that holds a picture, illustration, etc.
Object name is nihms870318f1.jpg

This figure details methods and results from [ 65 ]. Monolingual infants generalized food preferences across same-language speakers, finding it unexpected when they disagree, but did not generalize food preferences across different-language speakers. On the other hand, infants from bilingual backgrounds generalized even across speakers of different languages.

Interestingly, infants’ inductive inferences based on shared language are not limited to speakers of a familiar language: infants are equally likely to generalize information across same-language speakers when the shared language is the infants’ native language (English) and when the shared language is unfamiliar to the infant (Spanish) [ 80 ]. Thus, infants’ expectations did not require any specific information or experience with that linguistic group to infer that people who speak the same language share relevant similarities.

An initial system for social categorization in infancy

Taken together, the findings from violation of expectation studies suggest that infants can generalize information selectively across same-language speakers, and make inferences about social relationships based on language. Therefore, at least in the case of language, infants’ social categorization shares critically important features of older children’s and adult’s social categorization: they use information about group membership to infer whether people will share properties, and how people will interact. Thus, conceptually-rich social categories emerge before verbally provided information can affect social knowledge, suggesting the ability to form social categories does not depend on explicit learning about the cultural or stereotypic content associated with different groups and the ability to use these categories to draw inferences about social structure likely drives social thinking and learning from early in ontogeny.

Although this recent work on infants’ inductive generalization and inferences about social relationships focuses on language categories, we hypothesize that infants could apply these same abstract features of social categorization to other groups that they think are socially important. Specifically, infants may have a system for thinking about people as members of social groups that is present early in ontogeny, such that infants are ready, early on, to apprehend and generalize information across individuals in a social category. Of course, children’s social category knowledge grows significantly across development, and social partners play an important role in this process. In particular, we expect that social experiences (such as the infants’ typical environment and the information n they receive from social partners) would modulate which features infants would see as relevant markers of social categories.

We hypothesize that infants begin by seeing specific features of human behavior as fundamentally relevant to social categorization ( Box 1 ), and based on their social experiences, they learn to update the set of features they use to divide the social world into groups. Under this account, infants could require different experiences to form different social categories. That is, infants may initially expect that shared features that have defined group membership across human evolutionary history, such as language ( Box 4 ), food preferences [ 80 , 83 – 84 ], and engagement in ritualistic actions [ 85 ], would mark people as members a social category. In contrast, infants would likely not see an arbitrary similarity, such as being randomly assigned to wear the same color mittens [ 61 , 87 ], as defining membership in a conceptually rich social group.

However, with experience, infants and children likely update the list of dimensions that are seen as relevant for social categorization. Thus, although humans might have predispositions to attend to some markers of social division over others, the features that are relevant in each infant’s and each child’s community will certainly impact social categorization across development. For example, experience with group norms can lead children to form social categories based on dimensions that were not relevant in our evolutionary past, such as race ( Box 1 ). Demonstrating the importance of social experiences on social thinking, minority race children, who likely have more experience thinking about race, reason about race as an important social category at earlier ages than majority race children [ 38 , 88 ]. These ideas are consistent with Developmental Intergroup Theory (DIT), which suggests that any dimension that is marked and made salient in a child’s community (e.g., by explicit input from important social partners) may be able to be co-opted into the human system for reasoning about abstract social categories [ 89 ]. Indeed, this process likely underlies minimal group effects: researchers approximate social relevance by highlighting an arbitrary similarity, leading people to use the arbitrary feature as they would use an important social group marker ( Box 3 ).

Malleability in the features that are seen as relevant for social categorization may also work in the reverse direction, such that even categories that served as fundamental social group markers may be abandoned based on early social experiences. As example, though we argue that language reliably marks social group, and may be prioritized in infants’ early social categorization, differences in infants’ sociolinguistic environments may influence whether they reason about language as marking social categories. Whereas infants from monolingual environments refrain from generalizing information across people who speak different languages, suggesting they may view different-language speakers as members of distinct social groups, infants from multilingual backgrounds generalize even across different-language speakers [ 80 ]. Therefore, infants who regularly see people who speak diverse languages interact may be less likely to use spoken language as a boundary for social groups. Thus, variations in important features of social environments could impact broader reasoning about the social world. Future research should ask how social experiences influence categorization on both potentially prioritized dimensions and on dimensions that humans may learn are important via social transmission (see Outstanding Questions Box ).

Outstanding Questions Box

  • Role of self-categorization: How early do infants and children self-categorize into social groups? Can conceptually-rich social categories exist prior to the development of a sense of self? How does acquiring a sense of self and self-categorization influence social categorization, social preferences, and prejudice? How does the social status of the groups to which the child self-categorizes impact social learning, own-group preferences, and prejudice? What happens when children identify with more than one group?
  • Role of social experience: What types of experiences influence infants’ social categorization? Is the link between experience and social categorization specific (e.g., language experience impacts thinking about language as a social marker) or broad (e.g., language experience leads to more flexible categorization generally)? Are early social experiences more important than later ones? How does experience impact both the tendency to create a category at all, as well as the tendency to use that category to make social inferences?
  • Malleability of prejudice: What types of interventions most successfully reduce prejudice? Do the same interventions work across the lifespan? Does reducing a social preference based on one social category (e.g., race) reduce social preferences or prejudice more broadly (e.g., change gender stereotypes)?
  • Priorities in social categorization: Which social boundaries are infants most likely to attend to? Are infants’ earliest social categories based on dimensions that have fundamentally marked social groups across human evolutionary history? Is reasoning about potentially prioritized social categories less malleable than reasoning about social categories that are acquired later? Are categories that are learned to be important based on social input (e.g., race) used identically to potentially prioritized categories once they are acquired?

Using malleability of social categorization to reduce social prejudice

Although human infants may be ready to form conceptually rich social categories, the fact that forming generative inferences based on category membership can, in theory, be separated from dislike of the outgroup [ 1 ], and the fact that the particular dimensions upon which humans form social categories are malleable [ 38 , 80 ], suggests that prejudice against members of particular social groups is not inevitable. Developmental research sheds light on the relationship between categorization and preference formation, suggesting that studying human reasoning across the lifespan is critical for understanding the emergence and malleability of intergroup bias. Specifically, important future studies will continue to investigate how and when social preferences and conceptual reasoning about social groups come to operate together, leading to prejudice and discrimination. One possibility is that children first forming simple preferences for familiar people, and then later generalize these positive associations to personally unfamiliar members of their broader social group, leading them to show ingroup positivity [ 64 ] and eventually outgroup negativity [ 52 ]. Or, children’s may form conceptually rich social categories, and then come to self-identify with one category, at which point they may begin to show adult-like depersonalized preferences for members of their own group, leading to bias (Outstanding Questions).

One particularly important area for future study involves investigating how parents and educators can limit the transmission of bias. One prominent way that adults transmit information about social categories is through their language. For instance, generic language refers to groups rather than individuals, (e.g., “boys like X,” or “Hispanics live in Y”), signifying that groups are enduring, highlighting group differences, and teaching children that the group distinction is meaningful. Indeed, hearing generic statements about a novel social group increases the likelihood that children form a conceptually rich social category, and can lead children to develop essentialist thoughts and stereotypes about the novel group [ 65 – 67 ]. Thus, parents and educators may strive to speak about people as individuals (e.g., “This boy likes X”) rather than speaking about whole categories of people, in order to reduce essentialist tendencies.

As a caveat, it is impossible, and potentially counterproductive, to avoid all conversation that remarks on people’s social group membership. Indeed, research on “colorblind” interventions shows that purposefully refraining from all discussion of a category (in this case, race) can be ineffective and can even lead to increased prejudice [ 90 – 91 ]. Nevertheless, focusing on people as being distinctive individuals, as opposed to members of groups with collective properties, is one area in which language can be used in smart ways to potentially reduce children’s tendency to form a new conceptually rich social category, and to lower the transmission of bias towards members of the highlighted social group. In support of this idea, introducing people to counter-stereotypic individuals from a certain social group has been one of the most effective ways to reduce implicit bias for both adults [ 92 ] and children [ 93 ].

Rather than trying to halt the formation of social categorization in the first place, many interventions have focused on reducing the social significance that people ascribe to the categories to which other people belong. For example, interventions aimed at reducing prejudice based on gender and race have successfully led to less explicit and implicit bias, to smoother cross-group interactions, and even to increased overt actions aimed at promoting equality [ 94 – 96 ]. These interventions probably do not change the likelihood that adults categorize people into social groups, but rather may help participants change the perceptions, beliefs, and stereotypes they ascribe to those social categories. Interesting future questions concern how to leverage the insight gained by studies of bias reduction among adults to create manipulations that are effective with children. Indeed, current research suggests that implicit associations are malleable based on new information [ 97 – 98 ], and that similar interventions could be effective for adults and children [ 92 – 93 ]. Indeed, efforts to change the structure of social categorization among children may be even more impactful, since children have less experience, meaning their stereotypes and bias may be less entrenched and easier to overcome.

Concluding Remarks

Social categorization has vast implications for myriad aspects of human social life. Here we present evidence that developmental psychology can inform scientists’ understanding of the mechanisms underlying social categorization and its downstream negative consequences. To do this, we aimed at providing a review of developmental research on social categorization, and argued that conceptually rich social categorization is functionally different than social preferences for individual members of social groups. Separating these two processes can lead to a better understanding of the mechanisms and implications of each type of data. Although these constructs may act in tandem in adulthood, it is theoretically and empirically possible to separate them: having a social preference for people from a familiar background does not require reasoning about abstract similarities between group members, and the initial formation of social predictions based on social categories does not obligate a preference or dispreference toward a particular group. In fact, although hearing generic language increases essentialist reasoning about novel social groups [ 65 – 67 ], children who hear generic language about a novel social group do not initially show lower levels liking for members that group compared to children who heard specific language about members of the novel group [ 99 ]. Even for adults, higher levels of essentialist reasoning are not always related to higher levels of prejudice towards social groups [ 100 ]. Therefore, even if forming social categories and making social inferences based on these categories is a basic part of human cognition, prejudice is not inevitable.

There are still many open questions regarding the origins of social categorization (Outstanding Questions), including questions about which dimensions infants’ see as fundamentally relevant to social categorization ( Box 1 ), and about how experience across the lifespan shapes use of these social categories in real interactive contexts ( Box 3 ). However, this growing body of research suggests that an ability to see people as members of social groups, and to use these groups to inform inferences about the social world emerges in infancy. Clarity in the evidence deemed necessary to demonstrate conceptually rich social categorization in infancy will propel these important inquiries forward.

  • Social preferences for ingroup members emerge in the first year of life.
  • Preferring to look at or to interact with familiar or similar others does not necessarily indicate an ability to form abstract, inductively-rich social categories
  • Recent studies using violation of expectation looking time methods provide clearer evidence that infants can form conceptually rich social categories
  • Infants use social group boundaries to guide their inductive generalizations and expectations about social relationships.
  • Social categorization and social preferences are each malleable based on input, experience, and interventions, suggesting prejudice may not be inevitable

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Automatic Documents Categorization Using NLP

  • Conference paper
  • First Online: 08 November 2022
  • Cite this conference paper

research paper on categorization

  • Parsa Sai Tejaswi 12 ,
  • Saranam Venkata Amruth 12 ,
  • Prakya Tummala 12 &
  • M. Suneetha 12  

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 520))

422 Accesses

Automated text classification has become a critical tool in the legal business in recent years. There is no doubt about the commercial value of being able to automatically categorize documents based on their content. Automatic document categorization divides and organizes text based on a set of specified categories, allowing for quick and easy retrieval of data during the search phase. Text classification, which aims to assign predetermined categories to a given text sequence, has long been a classic task and a popular research topic in the field of natural language processing (NLP). Through fine-tuning in downstream tasks, bidirectional encoder representations from transformer (BERT), one of the most popular transformer and general language models is more efficient, scalable, and objective, and it yields the best outcomes. As a result, BERT is used in this project, fine-tuned, and applied to documents for automatic document classification. Then, this BERT model is compared to another model called XLNet to see if BERT outperforms XLNet. The accuracy of the BERT and XLNet models are 96.8% and 97.30%, respectively, when tested with test data, indicating that the XLNet model slightly beats the BERT model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Kim S-W, Gil J-M (2019) Research paper classification systems based on TF-IDF and LDA schemes. In: Human-centric computing and information sciences, vol 9. Article number 30, https://doi.org/10.1186/s13673-019-0192-7

Chen S (2018) K-nearest neighbor algorithm optimization in text categorization. IOP Conf Ser Earth Environ Sci 108(5):052074

Google Scholar  

Kadhim A (June 2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 52(3)

Stein RA, Jaques P, Valiati J (2019) An analysis of hierarchical text classification using word embeddings. Inf Sci 471:216–232

Article   Google Scholar  

El Kourdi M, Bensaid A, Rachidi T (2004) Automatic Arabic document categorization based on the Naïve Bayes algorithm, August 2004

Kashyap S, Sushma MG, Rajaram V, Vibha S (2015) Medical document classification. IJERT NCRTS-2015 3(27)

Kurnia R, Girsang A (2021) Classification of user comment using word2vec and deep learning. Int J Emerg Technol Adv Eng 11:1–8. https://doi.org/10.46338/ijetae0521_01

Jang B, Kim M, Harerimana G, Kang S, Kim JW (2020) Bi-LSTM model to increase accuracy in text classification: combining Word2vec CNN and attention mechanism. Appl Sci 10(17):5841. https://doi.org/10.3390/app10175841 .

Jacob D, Ming-Wei C, Kenton L, Kristina T (2018) Bert: pre-training of deep bidirectional transformers for language understanding. Tech Rep

Jeremy H, Sebasstian R (2018) Universal language model fine-tuning for text classification. Tech Rep

Sun C, Qiu X, Xu Y, Huang X (2019) How to fine-tune BERT for text classification? May 2019

Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding

Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding

Haghighian Roudsari A, Afshar J, Lee W et al (2022) PatentNet: multi-label classification of patent documents using deep learning based language understanding. Scientometrics 127:207–231. https://doi.org/10.1007/s11192-021-04179-4

Sweidan AH, El-Bendary N, Al-Feel H (2021) Sentence-level aspect-based sentiment analysis for classifying adverse drug reactions (ADRs) using hybrid ontology-XLNet transfer learning. IEEE Access 9:90828–90846. https://doi.org/10.1109/ACCESS.2021.3091394

Gonz’alez-Carvajal S, Garrido-Merch’an EC (2020) Comparing BERT against traditional machine learning text classification. ArXiv abs/2005.13012 :n. pag

Download references

Author information

Authors and affiliations.

Information Technology, Velagapudi Ramakrishna Siddhartha Engineering College, Vijayawada, 520007, India

Parsa Sai Tejaswi, Saranam Venkata Amruth, Prakya Tummala & M. Suneetha

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Parsa Sai Tejaswi .

Editor information

Editors and affiliations.

Singidunum University, Belgrade, Serbia

ITM University, Gwalior, Madhya Pradesh, India

Shyam Akashe

Global Knowledge Research Foundation, Ahmedabad, India

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Tejaswi, P.S., Amruth, S.V., Tummala, P., Suneetha, M. (2023). Automatic Documents Categorization Using NLP. In: Tuba, M., Akashe, S., Joshi, A. (eds) ICT Infrastructure and Computing. Lecture Notes in Networks and Systems, vol 520. Springer, Singapore. https://doi.org/10.1007/978-981-19-5331-6_23

Download citation

DOI : https://doi.org/10.1007/978-981-19-5331-6_23

Published : 08 November 2022

Publisher Name : Springer, Singapore

Print ISBN : 978-981-19-5330-9

Online ISBN : 978-981-19-5331-6

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

The Science and Information (SAI) Organization

Publication Links

  • Author Guidelines
  • Publication Policies
  • Metadata Harvesting (OAI2)
  • Digital Archiving Policy
  • Promote your Publication
  • About the Journal
  • Call for Papers
  • Submit your Paper
  • Current Issue
  • Apply as a Reviewer
  • Indexing & Archiving

Special Issues

  • Guest Editors

Future of Information and Communication Conference (FICC)

  • Submit your Paper/Poster

Computing Conference

Intelligent Systems Conference (IntelliSys)

Future Technologies Conference (FTC)

DOI: 10.14569/IJACSA.2023.0140240 PDF

Automated Categorization of Research Papers with MONO Supervised Term Weighting in RECApp

Author 1: Ivic Jan A. Biol Author 2: Rhey Marc A. Depositario Author 3: Glenn Geo T. Noangay Author 4: Julian Michael F. Melchor Author 5: Cristopher C. Abalorio Author 6: James Cloyd M. Bustillo

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 14 Issue 2, 2023.

  • Abstract and Keywords
  • How to Cite this Article
  • {} BibTeX Source

Abstract: Natural Language Processing, specifically text classification or text categorization, has become a trend in computer science. Commonly, text classification is used to categorize large amounts of data to allocate less time to retrieve information. Students, as well as research advisers and panelists, take extra effort and time in classifying research documents. To solve this problem, the researchers used state-of-the-art supervised term weighting schemes, namely: TF-MONO and SQRTF-MONO and its application to machine learning algorithms: K-Nearest Neighbor, Linear Support Vector, Naive Bayes Classifiers, creating a total of six classifier models to ascertain which of them performs optimally in classifying research documents while utilizing Optical Character Recognition for text extraction. The results showed that among all classification models trained, SQRTF-MONO and Linear SVC outperformed all other models with an F1 score of 0.94 both in the abstract and the background of the study datasets. In conclusion, the developed classification model and application prototype can be a tool to help researchers, advisers, and panelists to lessen the time spent in classifying research documents.

Ivic Jan A. Biol, Rhey Marc A. Depositario, Glenn Geo T. Noangay, Julian Michael F. Melchor, Cristopher C. Abalorio and James Cloyd M. Bustillo, “Automated Categorization of Research Papers with MONO Supervised Term Weighting in RECApp” International Journal of Advanced Computer Science and Applications(IJACSA), 14(2), 2023. http://dx.doi.org/10.14569/IJACSA.2023.0140240

@article{Biol2023, title = {Automated Categorization of Research Papers with MONO Supervised Term Weighting in RECApp}, journal = {International Journal of Advanced Computer Science and Applications}, doi = {10.14569/IJACSA.2023.0140240}, url = {http://dx.doi.org/10.14569/IJACSA.2023.0140240}, year = {2023}, publisher = {The Science and Information Organization}, volume = {14}, number = {2}, author = {Ivic Jan A. Biol and Rhey Marc A. Depositario and Glenn Geo T. Noangay and Julian Michael F. Melchor and Cristopher C. Abalorio and James Cloyd M. Bustillo} }

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

IJACSA

Upcoming Conferences

research paper on categorization

Future of Information and Communication Conference (FICC) 2024

4-5 April 2024

  • Berlin, Germany

research paper on categorization

Computing Conference 2024

11-12 July 2024

  • London, United Kingdom

research paper on categorization

IntelliSys 2024

5-6 September 2024

  • Amsterdam, The Netherlands

research paper on categorization

Future Technologies Conference (FTC) 2024

14-15 November 2024

research paper on categorization

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 24 April 2024

Trained neural networking framework based skin cancer diagnosis and categorization using grey wolf optimization

  • Amit Kumar K. 1 ,
  • Satheesha T.Y. 2 ,
  • Syed Thouheed Ahmed 3 ,
  • Sandeep Kumar Mathivanan 4 ,
  • Sangeetha Varadhan 5 &
  • Mohd Asif Shah 6 , 7  

Scientific Reports volume  14 , Article number:  9388 ( 2024 ) Cite this article

Metrics details

  • Health care
  • Medical research
  • Skin cancer

Skin Cancer is caused due to the mutational differences in epidermis hormones and patch appearances. Many studies are focused on the design and development of effective approaches in diagnosis and categorization of skin cancer. The decisions are made on independent training dataset under limited editions and scenarios. In this research, the kaggle based datasets are optimized and categorized into a labeled data array towards indexing using Federated learning (FL). The technique is developed on grey wolf optimization algorithm to assure the dataset attribute dependencies are extracted and dimensional mapping is processed. The threshold value validation of the dimensional mapping datasets is effectively optimized and trained under the neural networking framework further expanded via federated learning standards. The technique has demonstrated 95.82% accuracy under GWO technique and 94.9% on inter-combination of Trained Neural Networking (TNN) framework and Recessive Learning (RL) in accuracy.

Introduction

Skin cancer is the most common type of cancer recorded in United States with an estimate of 145% increase from the current diagnosis statistics. Skin cancer is caused due to pigment abnormalities in outer-layer of skin or epidermis. The process of skin cancer detection and diagnosis is based on the unusual skin textures and predominant symptoms such as bleeding and crusting in the middle of skin wounds, visibility of telangiectasia (small blood vessels). Typically, the melanoma has common symptoms such as skin color change, itching and pain. These signs are inter-correlated with other abnormalities and hence patient attention is missed by many instances. The process of diagnosis and decision classification of skin cancer is based on the primary detection and hence the abnormality ratio of detecting this cancer is in last or complex stage.

Many researchers and research approaches are proposed worldwide to encounter the delay in detection of skin cancer. The primary approach is from the medical community, via improvising the ability to track and categorize the patients based on previous family history, current medical treatment and much more. These precautionary measures are binding experts to provide a higher positive rate of prediction. The medical community has thereby dependent on technological cum biomedical scientist community for the design and development of technological solutions in classifying and predicting the skin cancer based on inter-connecting attributes and parameters. In this research article, an effort is made to propose a trained neural networking framework based skin cancer categorization. The article builds a structural connectivity from one processing technique to another via sharing a trained datasets for effective communication and optimization. The motivation of the proposed technique is to reorganize and structure the existing techniques of skin cancer detection to produce a collective decision support.

The objective of the proposed research is to provide a reliable supporting approach in customization of interdependency parameters in multiple datasets for the processing framework. Since multiple techniques are involved in the process of skin cancer detection, diagnosis and prediction, it makes the system complicated to validate and thereby compute the decision support. The validated datasets are unstructured as the occurrences are based on multiple independent processing units and algorithms. Hence it requires a dedicated framework for skin cancer diagnosis and categorization. The concept of federated learning (FL) is derived from the terminology of distributed computing and centralized decision making. The process and operating principles of FL systems provide an ease in multiple independent dataset coordination and synchronization across multiple platforms and algorithms.

The research article discusses on the probabilities of attributes and feature based diagnosis using grey-wolf optimization approach. The Trained Neural Networking (TNN) framework assures the attribute-feature mapping and the prediction of skin cancer in near future. The article is organized with an introduction and current literature in Sects. “ Introduction ” and “ Proposed methodology ”, followed by proposed methodology and problem statement in “ Problem Statement ” and “ Prerequisite and data processing ” with a mathematical proof in section “ Dimensionality mapping ”. The article is concluded with results and discussion, highlights the experimental setup and research findings with a conclusion and scope for future enhancement.

The skin cancer is validated and studied from technological aspect with respect to the primary image processing techniques. The techniques and improvisations of approaches are as discussed in Ref. 1 computer aided processing and Ref. 2 with the image processing techniques. The comparative study on image processing based skin cancer detection is drafted by Ref. 3 . Further a noninvasive approach of skin cancer detection technique is discussed in Ref. 4 . These approaches are treated as the building blocks of technological techniques in resolving global skin cancer diagnosis. The process of diagnosis is improvised with neural networking based frameworks, followed with deep learning and artificial intelligence 5 , 6 . The deep learning and neural networking techniques has successfully outperformed image processing techniques in performance and accuracy detection.

Device based interference is computed and validated 7 with a dedicated Convolutional neural network (CNN) and application interference. The bioinspired algorithms are boon in solving the skin cancer diagnosis. Practical swarm optimization (PSO) is designed to improvise the feature selection process from the attribute pool of training dataset 8 , 9 . The biomedical datasets are considered for longer validations under multiple attributes detainment ratio 10 . The supporting clinical trials and processing of skin cancer based detection is validated by sound analysis algorithms. The process of deep learning based attribute matching and proposed attribute validation is supported for mapping and categorizing the most influential feature in providing an accurate decision support 11 . The validation of supporting the immune system based cancer analysis is reported by Ref. 12 under osteoporosis.

The proposed framework is designed and validated with optimizing algorithms for performance improvisation 13 , 14 . These algorithms designed are based on dataset attribute-feature dependency matching. The features are reduced with dimensions to upgrade the selected attributes in providing a reliable decision support. The approach of Convolutional Neural Networking (CNN) provides a reliable classification of skin cancer 15 . The approach discusses various across-domain studies and techniques used in classification. The further evolution of CNN is Deep-CNN (DCNN) models 16 . The process of DCNN is enhanced using an additional customization approach using transfer learning. The transfer learning based systems are customized with inter and cross-domain learning of datasets. 17 The Bayesian Deep Learning (BDL) models are one such models with Three-way decision validations of skin cancer datasets. The BDL is implemented on uncertainty quantification (UQ) methods. The principle of robustness in the dynamic datasets are at peak of computation and relevance.

The inclusiveness of transfer learning is enhanced by Ref. 18 in benchmark setting and out-of-distribution (OOD) testing. This includes a wide-range of datasets from multiple sources and archives. The validation of skin cancer classification and decision making is further represented in Ref. 19 with respect to the performance efficiency and technological categorization. The summary of this survey is to provide an inclusive overview in categorizing and customizing the data-attribute ratios using optimization algorithms for skin cancer classification and diagnosis. In Ref. 20 the novel approach using sliding window technique is proposed with a recorded accuracy of 97.8% in skin cancer prediction. The technique has further included Concentrated-Xception-ResNet50 models. The studies discuss an improvised approaches such as augmented intelligence 21 and multilevel threshold segmentation 22 and a detailed review is reported by Ref. 23 highlighting the multiple dimensionalities of dataset processing, methodologies, feature extracting techniques and much more.

In recent times a novel and effective optimizations technique such as Liver Cancer Algorithm (LCA) 24 , Harris Hawks optimization (HHO) 25 and RIME optimization algorithm (RIME) 26 are proposed and discusses the versatile properties of bio-inspired algorithms computation for optimizing the features. Whereas the in the proposed technique we have considered to use the Grey-Wolf Optimization 27 based technique for categorizing and customizing the skin cancer features.

Proposed methodology

The proposed methodology is designed and developed with the objective to provide a reliable solution for early detection and classification of skin cancer. The technique includes the trained dataset repository from the existing sources and algorithms (ISIC 2020 Sydney Melanoma Diagnosis Centre and Melanoma Institute Australia, HAM and SKINL2 2019. The purpose is to extract threshold value attenuation ratio from processed datasets in providing accurate probability for decision support. The initial pre-processed datasets are represented in schematic records. These records are primary elements for processing data standardization and alignment. The schematic records extract attributes from raw datasets (trained repositories) to provide a dimensional representation matrix of primary attributes. The matrix consists of attributes and feature coordination ratio such as geographical reporting, gender, age, lesion size, area, primary comorbidities and patient history records and hence dimensionality mapping is required to assure attribute to feature correlation is achieved as shown in Fig.  1 . The inter association of mapped raw attribute with trained datasets repository provide a stable format of attribute-feature ratios. In Fig.  1 , the methodology transforms information parameters to a trained CNN model to attain a decision support in detecting skin cancer. The Table 1 represent a detailed representation of mathematical model for ease in understandability.

figure 1

Dataset dimensionality mapping and feature optimization model.

On successfully extracting the attributes from inter-relationship mapping and feature extraction as shown from Table 2 dataset, Grey Wolf Optimization (GWO) algorithm is processed and appended. The GWO extracts most significant feature and attribute from the extracted attribute-feature pool as demonstrated in Fig.  2 . The process is to obtain the most significant and influential feature of skin pattern such as lesion size, rate of spread and area of interest to provide a reliable decision making via Trained Convolutional Neural Networks (TCNN). The TCNN is designed based on feedback layer and update based training. The attribute and feature mapping with an impact on decision support and validation is described in classification diagram (Fig.  2 ). The process of decision making and validation of skin cancer classification is based on secondary mapping and synchronization. The process of mapping coordination and the repository analysis assures higher performance of decision support.

figure 2

Interdependency representation and classification for decision making.

The Fig.  2 represents the classification model for interdependency representation of skin cancer computation as discussed in Fig.  1 of the proposed system. The Fig.  2 further demonstrates the purpose of extracting features and correlating with the datasets with respect to the computational developments. The classification is based on phases of the implementation such as raw dataset collection, dimensionality mapping of features, Gray-wolf optimization technique and decision support. Typically, the individual unit represents the operations and tanks undertaken.

Problem statement

The validation of skin cancer detection and categorization is currently processed by the Convolutional Neural Networking (CNN) framework. Typically, these neural networks are aligned and calibrated with training datasets. The dataset is static and freeze (pre-loaded) before the initialization of the process and are centralized in nature. The process of data centralization is due to the trivial computational approaches of processing. This results in data-indexing and heap-order repository creation causing a larger volume of data deposits. The further results lead to ineffective computation and false-positive attribution on decision making hence, the dataset modification and updates leads the neural network to unlearn the decision support. Consider the dataset \(\left( {D_{X} } \right)\) as the trained repository with \(\left( {x_{i} } \right)\) attributes lined for computational validation. The approach technically derives \(\left( {D_{X} \Rightarrow \infty } \right)\) . If and only if the optimized attributes \(\left( {x_{j} } \right)\) is extracted as \(\left( {x_{j} \subseteq x_{i} } \right)\) and \(\left( {x_{j} \Rightarrow D_{X} \Rightarrow \infty } \right)\) . Hence the orientation of changing values functions is fluctuated with changing time internals.

The validation of the attributes \(\left[ {\left( {V_{x} } \right) \Rightarrow \left( {x_{j} \cup D_{X} } \right)_{0}^{n} } \right]\) such that \(\left( {\forall x_{i} x_{j} \in D_{X} } \right)\) with changing dataset under training arbitrary \(\left( {\Delta {\rm T}} \right)\) as shown in Eq. ( 1 ) with summation function of each independent variable associated and managed with training datasets \(\left( {\Delta {\rm T}} \right)\) .

The major challenge on dataset processing is associated with the phase of training and evaluation. The solution for the approach is to rectify the training dataset with a dedicated neural networking framework operated in the source of data origin.

Prerequisite and data processing

In the context of skin cancer datasets, data preprocessing plays a vital role in preparing the data for analysis and machine learning applications. Initially, data cleaning techniques are employed to address missing values and outliers, ensuring data integrity. Subsequently, data transformation steps are applied to encode categorical variables, scale numerical features, and handle data normalization. Feature selection and extraction methods are utilized to identify relevant features and reduce dimensionality, enhancing the efficiency of subsequent analyses. Data integration techniques consolidate information from diverse sources, while data reduction methods help manage large datasets. Normalization ensures consistency in feature scales, crucial for accurate model training. Finally, the dataset is split into training, validation, and test sets to facilitate model development, parameter tuning, and evaluation. By systematically preprocessing skin cancer datasets, researchers can optimize data quality and facilitate more robust analysis, ultimately advancing our understanding and management of skin cancer. The skin cancer dataset is aligned and contributed by the National Cancer Imaging (NCI) and Cancer Image Archive (CIA) institute archives. The datasets are also aligned with Kaggle repository for validation using random processing. The CIA datasets are used in 60:40 training and testing validation, whereas the kaggle based cancer_datasets are used for independent validation. The NCI dataset considered are with 1274 maligned and 1173 benign pre-trained and the performance sources are pre-acquired with the training data technique. Consider the multisource repository alignment of datasets as \(\left( {D_{X} } \right)\) with each \(\left( {D_{X} = D_{1} ,D_{2} ,D_{3} .......D_{X} } \right)\) the xth value of \(\left( {\Delta D_{X} } \right)\) is reflected and attributed with the associated attributes \(\left( {A_{i} } \right)\) as shown in Eq. ( 2 ) with weights \(\left( \omega \right)\)

where \(\left( \eta \right)\) is the neutralization factor for collective processing with \(\left( \varepsilon \right)\) as elimination matrix as shown in Eqs. ( 2 ) and ( 3 ) respectively. The acquired image segments are further retained and processed with weight calibration and fragmentation. The resultant weight matrix \(\left( {\omega \left( {n \times m} \right)_{i} } \right)\) is a relevance to assure the data streams are interconnected and attribute mapping network is optimized.

Dimensionality mapping

The acquired dataset \(\left( {D_{X} } \right)\) with an optimized attribute association \(\left( {\Delta D_{i} } \right)\) is represented as shown in Eq. ( 3 ). The relevance of weight \(\left( \omega \right)\) association is framed under \(\varepsilon \left( \omega \right)\) with an interdependency mapping as \(\left( {\varepsilon \left| \omega \right| = \left\| {\Delta D_{i} } \right\|} \right)\) at the given data intervals. The technical differences of the optimized data and relevance parameters are mapped towards dimension reduction. The dimensional paradigms are further influential in capturing attributes \(\left( {\Delta D_{i} } \right)\) to dimensional type mapping as shown in Eq. ( 4 )

where, the functional representation of data variables under dimensions \(\left\| {D_{P} } \right\|\) is evaluated with an elementary matrix \(\left( \varepsilon \right)\) . The process is aligned with weight matrix and a data type prediction ratio as shown in Eq. ( 4 ). The variation matrix of skin cancer dataset \(\left( {\Delta D_{j} } \right)_{0}^{S}\) is further aligned to the optimized matrix in Eq. ( 5 ) with extensive representation as shown in Eq.  6 .

On retaining the time matrix with reference to Eq. ( 6 ), the weight matrix \(\left\| {\omega_{{\left( {i,j} \right)}} } \right\|\) is dependent for computation in relatively aligned manner. The elementary matrix is reduced and limited to the operations of \(\left( {\varepsilon \left\| {D_{P} } \right\|_{0}^{S} } \right)\) with \(\left( {0 \to S} \right)\) iteration. The further dimension reduction is optimized and represented as in Eq. ( 7 ).

where \(\left( L \right)\) is the layer dimension processing under defined matrix representation of \(\left\| {D_{P} } \right\|\) . The overall representation of dimension is collectively focused in the optimization of processing datasets.

Trained convolutional neural networking framework for dimensionality mapping

On extraction of dimensional optimization matrix in the Eq. ( 7 ), the further processing is reflected with respect to mapping. The process of dimensionality mapping \(\left( {D\left[ M \right]_{0}^{S} } \right)\) is representative function of multiple values and framesets with RoI based extraction \(\left( {{\mathbb{R}}_{Z} } \right)\) . The layering \(\left( L \right)\) with summarization matrix \(\left( {S_{M} } \right)\) is represented as shown in Eq. ( 8 )

On Eq. ( 9 ), the possibility of functionality vector \(\left( L \right)\) and the RoI \(\left( {\mathbb{R}} \right)\) range is computed. The factorial representation is layered and the functional aspect of \(\sum \left( L \right)\) is computed with reference to Eq. ( 10 ). Typically, the processing vector component is aligned using \(\left( {\left\| {S_{m} } \right\|_{k} } \right)\) to an interconnected ratio attribute. The extracted cum optimized layer \(\left[ {\sum \left( L \right)_{k} } \right]\) is reflective in the attribute \(\left( A \right)\) and hence the mapping \(\left. {D\left( m \right)} \right)_{0}^{s}\) is represented in Eq. ( 11 )

According to Eq. ( 11 ), the coordination of weight matrix \(\left\| {\Delta \omega } \right\|\) is reflected with the association of \(\left\| {\Delta {\rm T}} \right\|\) values for suturing the dataset recommendation. The process is further computed with the layer \(\left[ {\sum \left( L \right)_{k} } \right]\) and \(\left. {D_{P} } \right)_{0}^{s}\) at the instance of operation. The attribute of skin cancer layering and \(\left( {{\mathbb{R}}_{Z} } \right)\) RoI extraction is based on dependency. Technically, the formulation of multiple layers and associations are streamlined in mapping function as represented in Eq. ( 12 ).

The mapping order of skin cancer dataset \(\left( {D_{X} } \right)\) under \(\left( {\Delta D_{f} } \right)\) is optimized and mapped for representational purpose. The outcome of rational mapping is resultant of trained neural network \(\left( {{\rm T}_{X} } \right)\) as represented in Eq. ( 13 ).

The terminology of trained neural network (TNN) is based on feedback competitive learning models. The approach segments the dataset \(\left( {D_{X} } \right)\) into multiple layers as input layer, hidden layer, computational layer and feedback layer. Typically, the process includes a structural representation of feedback based self-learning models. The overall streamlining \(\sum \left( {\left. {D\left( m \right)} \right]_{0}^{S} } \right)\) is evaluated within the scope of mapping range as shown in Eq. ( 13 ).

The relevance ratio of training neural networking based \(\left( {{\rm T}_{X} } \right)\) is optimized and summarized with reference to \(f\left( {D_{S} } \right)\) as the feature set and the occurrence pattern. The representation of \(\left[ {D\left( {\rm M} \right)_{0}^{S} } \right]\) is associated to the i th value of overall nodes (n) involved in TNN processing. The outcome of differences is subjected \(\sum \left[ {D\left( {\rm M} \right)_{0}^{S} } \right]\) with a decision support for categorization.

Grey-wolf optimization approach

The extracted layers and dimensions are classified and processed with a decision support. The decision is subjected with attribute and dataset optimization. The TNN framework assures the reliability of feedback based self-learning environment. The grey-wolf optimization \(\left( {O_{G} } \right)\) is to reduce and provide the ranging attributes associated with \(\left( {{\rm T}_{X} } \right)\) and \(\left[ {\left. {D_{{\left( {\rm P} \right)}} } \right|_{0}^{S} } \right]\) such that, \(\left[ {\forall \left\| {D_{\rm P} } \right\|_{0}^{S} \Rightarrow \sum {\rm T}_{X} } \right]\) and \(\left( {{\rm T}_{X} \notin D_{X} } \right)\) at initial and final computational stages. The process of alignment is subjected to the variation of parameters in attribute \(\left( {A_{i} } \right)\) such that \(\left( {\forall A_{i} \Rightarrow D_{X} } \right)\) and \(\left( {A_{i} \subseteq D_{X} } \right)\) at an outset, the formulation vector is computed as shown in Eq. ( 15 ).

The grey wolf optimization \(\left( {O_{G} } \right)\) is appended on the trained neural networking framework for collective processing. The weight of association is aligned with \(\left( {\left\| {D_{\rm P} } \right\|_{0}^{S} } \right)\) such that, \(\left( {\forall \left\| {D_{\rm P} } \right\|_{0}^{S} \Rightarrow \Delta {\rm T}\left( {O_{G} } \right)_{Z} } \right)\) , where \(\left( x \right)\) is the functional variable of optimized techniques.

According to Eqs. ( 17 ) and ( 18 ) the formulation of datasets \(\left( {\left\| {D_{\rm P} } \right\|_{0}^{S} } \right)\) is bound with respect to the optimization \(\left( {O_{G} } \right)\) under a constant recurrence format. Typically, the functional representation of \(\left( {\left\| {D_{\rm P} } \right\|_{0}^{S} } \right)\) under a recurrence ration is evaluated for effective computations. The order of evaluation and occurrence ratio of dataset \(\left( {D_{X} } \right)\) is further subjected with the inter-common attribute and feature elimination as shown in Eq. ( 18 ). The buffer factor of \(\left( {\Delta {\rm T}} \right)\) is associated to assure the saturation of threshold values in the frame of dataset.

Results and discussions

The proposed technique of skin cancer classification and validation is supported on the dynamic datasets of pre-processed and trained datasets. The process has included raw dataset schematic record based feature extraction and mapping with reference to trained dataset repository. The process of attribute ratio extraction on multiple scenarios are demonstrated in Fig.  3 . The alignment rations of each independent attributes are correlated and functioned into a feature matrix to provide an alignment ratio of interconnected values. This includes the performance efficiency of attribute ratio with respect to feature selection for the give interval of values with reference to alignment ratio. The scenario is based on the split ratio of training and testing datasets.

figure 3

Attribute ratio extraction and evaluation parameter comparison.

Figure  4 discusses the performance outcomes of proposed Grey-wolf Optimization (GWO) technique over the existing approaches such as feature optimization, KNN optimization and whale’s optimization. The outcome of proposed GWO based feature optimization has improved and outperformed to 95.82% in accuracy compared to the other techniques.

figure 4

Comparison with optimization algorithms.

According to the extracted values of multiple comparison as shown in Fig.  5 , the proposed Trained Neural Networking (TNN) framework is evaluated under the Recursive Learning (RL). The TNN + RL based computation has increased the probability of decision making and supporting. The RL adds-on the feedback layer with structural updates of processing (hidden layers) from the TNN model. The ratio of True-Negative (TN) over False Positive (FP) is bounded with a minimal navies approach for optimizing the prediction and classification ratio.

figure 5

Observation and study evaluation with reference to trained neural networking technique.

The anticipated outcomes undergo comparison and validation using various cross-domain methodologies such as K-nearest neighbors (KNN), whale optimization, and Grey-wolf optimization. This process entails independent dataset processing and decision-making capabilities. The performance varies across instances, transitioning from KNN optimization to whale optimization. These optimization techniques are inherently system-driven, with datasets being preserved and centralized. In contrast, the proposed Grey-wolf optimization is tailored to Tensor-based neural network (TNN) recursive learning models within federated systems. The federated learning framework operates primarily in a decentralized manner and exhibits a higher computational ratio compared to existing approaches. The Grey-wolf optimization mechanism is refined through inter-domain computation utilizing federated learning (FL) and reinforcement learning (RL) models of TNN.

The approach outlined presents several potential limitations that merit consideration. Firstly, the reliance on specific optimization techniques like KNN, whale optimization, and Grey-wolf optimization may restrict the generalizability of results, as these methods might not universally suit all datasets or problem domains. Moreover, while KNN and whale optimization are depicted as centralized methods with centralized datasets, this centralized nature could be problematic for applications involving distributed or privacy-sensitive data. Additionally, the complexity of Grey-wolf optimization, particularly when applied to TNN-based recursive learning models in federated systems, may introduce implementation challenges and hinder its effectiveness. The computational demands of the federated learning framework, compounded by the inclusion of Grey-wolf optimization, could pose significant computational burdens. Furthermore, the efficiency and effectiveness of inter-domain computing leveraging FL and RL models of TNN to enhance Grey-wolf optimization may be limited by factors such as data heterogeneity or communication overhead. Lastly, the generalizability of performance comparisons across different optimization techniques may not extend to all datasets or real-world scenarios, necessitating careful consideration of dataset characteristics and scalability concerns.

The proposed technique is evaluated on the process of diagnosis and classification of skin cancer datasets into a categorization. The technique has achieved the process of evaluation and feature classification based on trained neural networking framework. The extracted and trained neural network has outperformed the existing techniques. The dual process of validation is aligned and processed with grey-wolf optimization technique in further categorization of RoI and provide a reliable decision support. The technique has retrieved and classified datasets under 60% training and 40% testing module. Overall the proposed technique has extracted an accuracy of 94.9% with the skin cancer images and classification based on pattern extracted. In near future, the technique can be validated on dynamic Artificial Neural Networks (dANN).

Data availability

The datasets used during the current study are available from the corresponding author on reasonable request.

Jain, S. & Pise, N. Computer aided melanoma skin cancer detection using image processing. Proc. Comput. Sci. 48 , 735–740 (2015).

Article   Google Scholar  

Ansari, U. B. & Sarode, T. Skin cancer detection using image processing. Int. Res. J. Eng. Technol. 4 (4), 2875–2881 (2017).

Google Scholar  

Sreedhar, B. B. E., Kumar, M. S. & Sunil, M. A comparative study of melanoma skin cancer detection in traditional and current image processing techniques. In 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC) (ed. Sreedhar, B. B. E.) 654–658 (IEEE, 2020).

Chapter   Google Scholar  

Heibel, H. D., Hooey, L. & Cockerell, C. J. A review of noninvasive techniques for skin cancer detection in dermatology. Am. J. Clin. Dermatol. 21 (4), 513–524 (2020).

Article   PubMed   Google Scholar  

Takiddin, A., Schneider, J., Yang, Y., Abd-Alrazaq, A. & Househ, M. Artificial Intelligence for skin cancer detection: Scoping review. J. Med. Internet Res. 23 (11), e22934 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Kadampur, M. A. & Al Riyaee, S. Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images Inform. Med. Unlock. 18 , 100282 (2020).

Dai, X., Spasić, I., Meyer, B., Chapman, S. & Andres, F. Machine learning on mobile: An on-device inference app for skin cancer detection. In 2019 Fourth International Conference on Fog and Mobile Edge Computing (FMEC) (ed. Dai, X.) 301–305 (IEEE, 2019).

Tan, T. Y., Zhang, L., Neoh, S. C. & Lim, C. P. Intelligent skin cancer detection using enhanced particle swarm optimization. Knowl.-Based Syst. 158 , 118–135 (2018).

Dascalu, A. & David, E. O. Skin cancer detection by deep learning and sound analysis algorithms: A prospective clinical study of an elementary dermoscope. EBioMedicine 43 , 107–113 (2019).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Ahmed, S. T. A study on multi objective optimal clustering techniques for medical datasets. In 2017 International Conference on Intelligent Computing and Control Systems (ICICCS) (ed. Ahmed, S. T.) 174–177 (IEEE, 2017).

Kadampur, M. A. & Al Riyaee, S. Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images. Inform. Med. Unlock. 18 , 100282 (2020).

Periasamy, K. et al. A proactive model to predict osteoporosis: An artificial immune system approach. Expert Syst. 39 (4), e12708 (2022).

Basha, S. M., Poluru, R. K. & Ahmed, S. T. A comprehensive study on learning strategies of optimization algorithms and its applications. In 2022 8th International Conference on Smart Structures and Systems (ICSSS) (ed. Basha, S. M.) 1–4 (IEEE, 2022).

Haggenmüller, S. et al. Skin cancer classification via convolutional neural networks: Systematic review of studies involving human experts. Eur. J. Cancer 156 , 202–216 (2021).

Ali, M. S., Miah, M. S., Haque, J., Rahman, M. M. & Islam, M. K. An enhanced technique of skin cancer classification using deep convolutional neural network with transfer learning models. Mach. Learning Appl. 5 , 100036 (2021).

Abdar, M. et al. Uncertainty quantification in skin cancer classification using three-way decision-based Bayesian deep learning. Comput. Biol. Med. 135 , 104418 (2021).

Maron, R. C. et al. A benchmark for neural network robustness in skin cancer classification. Eur. J. Cancer 155 , 191–199 (2021).

Höhn, J. et al. Combining CNN-based histologic whole slide image analysis and patient data to improve skin cancer classification. Eur. J. Cancer 149 , 94–101 (2021).

Pacheco, A. G. & Krohling, R. A. An attention-based mechanism to combine images and metadata in deep learning models applied to skin cancer classification. IEEE J. Biomed. Health Inform. 25 (9), 3554–3563 (2021).

Panthakkan, A., Anzar, S. M., Jamal, S. & Mansoor, W. Concatenated Xception-ResNet50—A novel hybrid approach for accurate skin cancer prediction. Comput. Biol. Med. 150 , 106170 (2022).

Article   CAS   PubMed   Google Scholar  

Kumar, A., Satheesha, T. Y., Salvador, B. B. L., Mithileysh, S. & Ahmed, S. T. Augmented Intelligence enabled Deep Neural Networking (AuDNN) framework for skin cancer classification and prediction using multi-dimensional datasets on industrial IoT standards. Microprocess. Microsyst. 97 , 104755 (2023).

Ren, L. et al. Multi-level thresholding segmentation for pathological images: Optimal performance design of a new modified differential evolution. Comput. Biol. Med. 148 , 105910 (2022).

Painuli, D. & Bhardwaj, S. Recent advancement in cancer diagnosis using machine learning and deep learning techniques: A comprehensive review. Comput. Biol. Med. 146 , 105580 (2022).

Houssein, E. H., Oliva, D., Samee, N. A., Mahmoud, N. F. & Emam, M. M. Liver cancer algorithm: A novel bio-inspired optimizer. Comput. Biol. Med. 165 , 107389 (2023).

Alabool, H. M., Alarabiat, D., Abualigah, L. & Heidari, A. A. Harris hawks optimization: A comprehensive review of recent variants and applications. Neural Comput. Appl. 33 , 8939–8980 (2021).

Su, H. et al. RIME: A physics-based optimization. Neurocomputing 532 , 183–214 (2023).

Mirjalili, S., Mirjalili, S. M. & Lewis, A. Grey wolf optimizer. Adv. Eng. Softw. 69 , 46–61 (2014).

Download references

Author information

Authors and affiliations.

School of Engineering, CMR University, Bengaluru, India

Amit Kumar K.

School of Computer Science and Engineering, REVA University, Bengaluru, India

Satheesha T.Y.

Department of Electrical Engineering, Indian Institute of Technology Hyderabad, Hyderabad, India

Syed Thouheed Ahmed

School of Computer Science and Engineering, Galgotias University, Greater Noida, 203201, India

Sandeep Kumar Mathivanan

Department of Computer Applications, Dr. MGR Educational and Research Institute, Chennai, 600095, India

Sangeetha Varadhan

Kebri Dehar University, Kebri Dehar, Somali, 250, Ethiopia

Mohd Asif Shah

Division of Research and Development, Lovely Professional University, Phagwara, Punjab, 144001, India

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization, A.K.K and T.Y.S.; methodology, S.K.M.; validation, S.K.M and S.T.A; resources, M.A.S; data curation, S.V; writing—original draft preparation, A.K.K and T.Y.S; writing—review and editing, S.K.M, and M.A.S; visualization, S.K.M and S.V; supervision S.K.M and S.T.A; project administration. S.K.M, S.T.A and M.A.S.

Corresponding authors

Correspondence to Syed Thouheed Ahmed or Mohd Asif Shah .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

K., A.K., T.Y., S., Ahmed, S.T. et al. Trained neural networking framework based skin cancer diagnosis and categorization using grey wolf optimization. Sci Rep 14 , 9388 (2024). https://doi.org/10.1038/s41598-024-59979-4

Download citation

Received : 16 October 2023

Accepted : 17 April 2024

Published : 24 April 2024

DOI : https://doi.org/10.1038/s41598-024-59979-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Skin cancer detection
  • Trained neural networks
  • Federated learning
  • Feature categorization

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

research paper on categorization

research paper on categorization

Data Science Central

  • Author Portal
  • 3D Printing
  • AI Data Stores
  • AI Hardware
  • AI Linguistics
  • AI User Interfaces and Experience
  • AI Visualization
  • Cloud and Edge
  • Cognitive Computing
  • Containers and Virtualization
  • Data Science
  • Data Security
  • Digital Factoring
  • Drones and Robot AI
  • Internet of Things
  • Knowledge Engineering
  • Machine Learning
  • Quantum Computing
  • Robotic Process Automation
  • The Mathematics of AI
  • Tools and Techniques
  • Virtual Reality and Gaming
  • Blockchain & Identity
  • Business Agility
  • Business Analytics
  • Data Lifecycle Management
  • Data Privacy
  • Data Strategist
  • Data Trends
  • Digital Communications
  • Digital Disruption
  • Digital Professional
  • Digital Twins
  • Digital Workplace
  • Marketing Tech
  • Sustainability
  • Agriculture and Food AI
  • AI and Science
  • AI in Government
  • Autonomous Vehicles
  • Education AI
  • Energy Tech
  • Financial Services AI
  • Healthcare AI
  • Logistics and Supply Chain AI
  • Manufacturing AI
  • Mobile and Telecom AI
  • News and Entertainment AI
  • Smart Cities
  • Social Media and AI
  • Functional Languages
  • Other Languages
  • Query Languages
  • Web Languages
  • Education Spotlight
  • Newsletters
  • O’Reilly Media

Research paper categorization in Python

AqibSaeed

  • July 25, 2016 at 11:04 pm

Text classification (a.k.a. text categorization) is one of the most prominent application of Machine Learning.  The purpose of text classification is to give conceptual organization to large collection of documents.An interesting application of text classification is to categorize research papers by most suitable conferences. Finding and selecting a suitable academic conference has always been a challenging task especially for young researchers. We can define a ‘suitable academic conference’ as a conference, which is aligned with the researcher’s work and have a good academic ranking. Usually researcher have to consult to their supervisors and search extensively to find a suitable conference. Among many conferences, few are considered to be relevant to send a research work. To fulfil editorial and content specific demands of conferences, researcher needs to go through the previously published proceedings of a certain conference. Based on previous proceeding of a conferences, the research work is sometimes modified to increase the chances of article acceptances and publication. This problem can be solved to some extent using machine learning techniques e.g. classification algorithms like SVM, Naïve Bayes, etc.

Thus, the objective of this tutorial is to provide hands on experience on how to perform text classification using conference proceedings dataset. We will learn how to apply various classification algorithms to categorize research papers by conferences along with feature selection and dimensionality reduction methods using popular  scikit-learn  library in Python.

Read full article  with source code.

Related Content

'  data-srcset=

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning.

Welcome to the newly launched Education Spotlight page! View Listings

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

research paper on categorization

What It Means To Be Asian in America

The lived experiences and perspectives of asian americans in their own words.

Asians are the fastest growing racial and ethnic group in the United States. More than 24 million Americans in the U.S. trace their roots to more than 20 countries in East and Southeast Asia and the Indian subcontinent.

The majority of Asian Americans are immigrants, coming to understand what they left behind and building their lives in the United States. At the same time, there is a fast growing, U.S.-born generation of Asian Americans who are navigating their own connections to familial heritage and their own experiences growing up in the U.S.

In a new Pew Research Center analysis based on dozens of focus groups, Asian American participants described the challenges of navigating their own identity in a nation where the label “Asian” brings expectations about their origins, behavior and physical self. Read on to see, in their own words, what it means to be Asian in America.

Table of Contents

Introduction, this is how i view my identity, this is how others see and treat me, this is what it means to be home in america, about this project, methodological note, acknowledgments.

No single experience defines what it means to be Asian in the United States today. Instead, Asian Americans’ lived experiences are in part shaped by where they were born, how connected they are to their family’s ethnic origins, and how others – both Asians and non-Asians – see and engage with them in their daily lives. Yet despite diverse experiences, backgrounds and origins, shared experiences and common themes emerged when we asked: “What does it mean to be Asian in America?”

In the fall of 2021, Pew Research Center undertook the largest focus group study it had ever conducted – 66 focus groups with 264 total participants – to hear Asian Americans talk about their lived experiences in America. The focus groups were organized into 18 distinct Asian ethnic origin groups, fielded in 18 languages and moderated by members of their own ethnic groups. Because of the pandemic, the focus groups were conducted virtually, allowing us to recruit participants from all parts of the United States. This approach allowed us to hear a diverse set of voices – especially from less populous Asian ethnic groups whose views, attitudes and opinions are seldom presented in traditional polling. The approach also allowed us to explore the reasons behind people’s opinions and choices about what it means to belong in America, beyond the preset response options of a traditional survey.

The terms “Asian,” “Asians living in the United States” and “Asian American” are used interchangeably throughout this essay to refer to U.S. adults who self-identify as Asian, either alone or in combination with other races or Hispanic identity.

“The United States” and “the U.S.” are used interchangeably with “America” for variations in the writing.

Multiracial participants are those who indicate they are of two or more racial backgrounds (one of which is Asian). Multiethnic participants are those who indicate they are of two or more ethnicities, including those identified as Asian with Hispanic background.

U.S. born refers to people born in the 50 U.S. states or the District of Columbia, Puerto Rico, or other U.S. territories.

Immigrant refers to people who were not U.S. citizens at birth – in other words, those born outside the U.S., Puerto Rico or other U.S. territories to parents who were not U.S. citizens. The terms “immigrant,” “first generation” and “foreign born” are used interchangeably in this report.  

Second generation refers to people born in the 50 states or the District of Columbia with at least one first-generation, or immigrant, parent.

The pan-ethnic term “Asian American” describes the population of about 22 million people living in the United States who trace their roots to more than 20 countries in East and Southeast Asia and the Indian subcontinent. The term was popularized by U.S. student activists in the 1960s and was eventually adopted by the U.S. Census Bureau. However, the “Asian” label masks the diverse demographics and wide economic disparities across the largest national origin groups (such as Chinese, Indian, Filipino) and the less populous ones (such as Bhutanese, Hmong and Nepalese) living in America. It also hides the varied circumstances of groups immigrated to the U.S. and how they started their lives there. The population’s diversity often presents challenges . Conventional survey methods typically reflect the voices of larger groups without fully capturing the broad range of views, attitudes, life starting points and perspectives experienced by Asian Americans. They can also limit understanding of the shared experiences across this diverse population.

A chart listing the 18 ethnic origins included in Pew Research Center's 66 focus groups, and the composition of the focus groups by income and birth place.

Across all focus groups, some common findings emerged. Participants highlighted how the pan-ethnic “Asian” label used in the U.S. represented only one part of how they think of themselves. For example, recently arrived Asian immigrant participants told us they are drawn more to their ethnic identity than to the more general, U.S.-created pan-ethnic Asian American identity. Meanwhile, U.S.-born Asian participants shared how they identified, at times, as Asian but also, at other times, by their ethnic origin and as Americans.

Another common finding among focus group participants is the disconnect they noted between how they see themselves and how others view them. Sometimes this led to maltreatment of them or their families, especially at heightened moments in American history such as during Japanese incarceration during World War II, the aftermath of 9/11 and, more recently, the COVID-19 pandemic. Beyond these specific moments, many in the focus groups offered their own experiences that had revealed other people’s assumptions or misconceptions about their identity.

Another shared finding is the multiple ways in which participants take and express pride in their cultural and ethnic backgrounds while also feeling at home in America, celebrating and blending their unique cultural traditions and practices with those of other Americans.

This focus group project is part of a broader research agenda about Asians living in the United States. The findings presented here offer a small glimpse of what participants told us, in their own words, about how they identify themselves, how others see and treat them, and more generally, what it means to be Asian in America.

Illustrations by Jing Li

Publications from the Being Asian in America project

  • Read the data essay: What It Means to Be Asian in America
  • Watch the documentary: Being Asian in America
  • Explore the interactive: In Their Own Words: The Diverse Perspectives of Being Asian in America
  • View expanded interviews: Extended Interviews: Being Asian in America
  • About this research project: More on the Being Asian in America project
  • Q&A: Why and how Pew Research Center conducted 66 focus groups with Asian Americans

research paper on categorization

One of the topics covered in each focus group was how participants viewed their own racial or ethnic identity. Moderators asked them how they viewed themselves, and what experiences informed their views about their identity. These discussions not only highlighted differences in how participants thought about their own racial or ethnic background, but they also revealed how different settings can influence how they would choose to identify themselves. Across all focus groups, the general theme emerged that being Asian was only one part of how participants viewed themselves.

The pan-ethnic label ‘Asian’ is often used more in formal settings

research paper on categorization

“I think when I think of the Asian Americans, I think that we’re all unique and different. We come from different cultures and backgrounds. We come from unique stories, not just as a group, but just as individual humans.” Mali , documentary participant

Many participants described a complicated relationship with the pan-ethnic labels “Asian” or “Asian American.” For some, using the term was less of an active choice and more of an imposed one, with participants discussing the disconnect between how they would like to identify themselves and the available choices often found in formal settings. For example, an immigrant Pakistani woman remarked how she typically sees “Asian American” on forms, but not more specific options. Similarly, an immigrant Burmese woman described her experience of applying for jobs and having to identify as “Asian,” as opposed to identifying by her ethnic background, because no other options were available. These experiences highlight the challenges organizations like government agencies and employers have in developing surveys or forms that ask respondents about their identity. A common sentiment is one like this:

“I guess … I feel like I just kind of check off ‘Asian’ [for] an application or the test forms. That’s the only time I would identify as Asian. But Asian is too broad. Asia is a big continent. Yeah, I feel like it’s just too broad. To specify things, you’re Taiwanese American, that’s exactly where you came from.”

–U.S.-born woman of Taiwanese origin in early 20s

Smaller ethnic groups default to ‘Asian’ since their groups are less recognizable

Other participants shared how their experiences in explaining the geographic location and culture of their origin country led them to prefer “Asian” when talking about themselves with others. This theme was especially prominent among those belonging to smaller origin groups such as Bangladeshis and Bhutanese. A Lao participant remarked she would initially say “Asian American” because people might not be familiar with “Lao.”

“​​[When I fill out] forms, I select ‘Asian American,’ and that’s why I consider myself as an Asian American. [It is difficult to identify as] Nepali American [since] there are no such options in forms. That’s why, Asian American is fine to me.”

–Immigrant woman of Nepalese origin in late 20s

“Coming to a big country like [the United States], when people ask where we are from … there are some people who have no idea about Bhutan, so we end up introducing ourselves as being Asian.”

–Immigrant woman of Bhutanese origin in late 40s

But for many, ‘Asian’ as a label or identity just doesn’t fit

Many participants felt that neither “Asian” nor “Asian American” truly captures how they view themselves and their identity. They argue that these labels are too broad or too ambiguous, as there are so many different groups included within these labels. For example, a U.S.-born Pakistani man remarked on how “Asian” lumps many groups together – that the term is not limited to South Asian groups such as Indian and Pakistani, but also includes East Asian groups. Similarly, an immigrant Nepalese man described how “Asian” often means Chinese for many Americans. A Filipino woman summed it up this way:

“Now I consider myself to be both Filipino and Asian American, but growing up in [Southern California] … I didn’t start to identify as Asian American until college because in [the Los Angeles suburb where I lived], it’s a big mix of everything – Black, Latino, Pacific Islander and Asian … when I would go into spaces where there were a lot of other Asians, especially East Asians, I didn’t feel like I belonged. … In media, right, like people still associate Asian with being East Asian.”

–U.S.-born woman of Filipino origin in mid-20s

Participants also noted they have encountered confusion or the tendency for others to view Asian Americans as people from mostly East Asian countries, such as China, Japan and Korea. For some, this confusion even extends to interactions with other Asian American groups. A Pakistani man remarked on how he rarely finds Pakistani or Indian brands when he visits Asian stores. Instead, he recalled mostly finding Vietnamese, Korean and Chinese items.

Among participants of South Asian descent, some identified with the label “South Asian” more than just “Asian.” There were other nuances, too, when it comes to the labels people choose. Some Indian participants, for example, said people sometimes group them with Native Americans who are also referred to as Indians in the United States. This Indian woman shared her experience at school:

“I love South Asian or ‘Desi’ only because up until recently … it’s fairly new to say South Asian. I’ve always said ‘Desi’ because growing up … I’ve had to say I’m the red dot Indian, not the feather Indian. So annoying, you know? … Always a distinction that I’ve had to make.”

–U.S.-born woman of Indian origin in late 20s

Participants with multiethnic or multiracial backgrounds described their own unique experiences with their identity. Rather than choosing one racial or ethnic group over the other, some participants described identifying with both groups, since this more accurately describes how they see themselves. In some cases, this choice reflected the history of the Asian diaspora. For example, an immigrant Cambodian man described being both Khmer/Cambodian and Chinese, since his grandparents came from China. Some other participants recalled going through an “identity crisis” as they navigated between multiple identities. As one woman explained:

“I would say I went through an identity crisis. … It’s because of being multicultural. … There’s also French in the mix within my family, too. Because I don’t identify, speak or understand the language, I really can’t connect to the French roots … I’m in between like Cambodian and Thai, and then Chinese and then French … I finally lumped it up. I’m just an Asian American and proud of all my roots.”

–U.S.-born woman of Cambodian origin in mid-30s

In other cases, the choice reflected U.S. patterns of intermarriage. Asian newlyweds have the highest intermarriage rate of any racial or ethnic group in the country. One Japanese-origin man with Hispanic roots noted:

“So I would like to see myself as a Hispanic Asian American. I want to say Hispanic first because I have more of my mom’s culture in me than my dad’s culture. In fact, I actually have more American culture than my dad’s culture for what I do normally. So I guess, Hispanic American Asian.”

–U.S.-born man of Hispanic and Japanese origin in early 40s

Other identities beyond race or ethnicity are also important

Focus group participants also talked about their identity beyond the racial or ethnic dimension. For example, one Chinese woman noted that the best term to describe her would be “immigrant.” Faith and religious ties were also important to some. One immigrant participant talked about his love of Pakistani values and how religion is intermingled into Pakistani culture. Another woman explained:

“[Japanese language and culture] are very important to me and ingrained in me because they were always part of my life, and I felt them when I was growing up. Even the word itadakimasu reflects Japanese culture or the tradition. Shinto religion is a part of the culture. They are part of my identity, and they are very important to me.”

–Immigrant woman of Japanese origin in mid-30s

For some, gender is another important aspect of identity. One Korean participant emphasized that being a woman is an important part of her identity. For others, sexual orientation is an essential part of their overall identity. One U.S.-born Filipino participant described herself as “queer Asian American.” Another participant put it this way:

“I belong to the [LGBTQ] community … before, what we only know is gay and lesbian. We don’t know about being queer, nonbinary. [Here], my horizon of knowing what genders and gender roles is also expanded … in the Philippines, if you’ll be with same sex, you’re considered gay or lesbian. But here … what’s happening is so broad, on how you identify yourself.”

–Immigrant woman of Filipino origin in early 20s

Immigrant identity is tied to their ethnic heritage

A chart showing how participants in the focus groups described the differences between race-centered and ethnicity-centered identities.

Participants born outside the United States tended to link their identity with their ethnic heritage. Some felt strongly connected with their ethnic ties due to their citizenship status. For others, the lack of permanent residency or citizenship meant they have stronger ties to their ethnicity and birthplace. And in some cases, participants said they held on to their ethnic identity even after they became U.S. citizens. One woman emphasized that she will always be Taiwanese because she was born there, despite now living in the U.S.

For other participants, family origin played a central role in their identity, regardless of their status in the U.S. According to some of them, this attitude was heavily influenced by their memories and experiences in early childhood when they were still living in their countries of origin. These influences are so profound that even after decades of living in the U.S., some still feel the strong connection to their ethnic roots. And those with U.S.-born children talked about sending their kids to special educational programs in the U.S. to learn about their ethnic heritage.

“Yes, as for me, I hold that I am Khmer because our nationality cannot be deleted, our identity is Khmer as I hold that I am Khmer … so I try, even [with] my children today, I try to learn Khmer through Zoom through the so-called Khmer Parent Association.”

–Immigrant man of Cambodian origin in late 50s

Navigating life in America is an adjustment

Many participants pointed to cultural differences they have noticed between their ethnic culture and U.S. culture. One of the most distinct differences is in food. For some participants, their strong attachment to the unique dishes of their families and their countries of origin helps them maintain strong ties to their ethnic identity. One Sri Lankan participant shared that her roots are still in Sri Lanka, since she still follows Sri Lankan traditions in the U.S. such as preparing kiribath (rice with coconut milk) and celebrating Ramadan.

For other participants, interactions in social settings with those outside their own ethnic group circles highlighted cultural differences. One Bangladeshi woman talked about how Bengalis share personal stories and challenges with each other, while others in the U.S. like to have “small talk” about TV series or clothes.

Many immigrants in the focus groups have found it is easier to socialize when they are around others belonging to their ethnicity. When interacting with others who don’t share the same ethnicity, participants noted they must be more self-aware about cultural differences to avoid making mistakes in social interactions. Here, participants described the importance of learning to “fit in,” to avoid feeling left out or excluded. One Korean woman said:

“Every time I go to a party, I feel unwelcome. … In Korea, when I invite guests to my house and one person sits without talking, I come over and talk and treat them as a host. But in the United States, I have to go and mingle. I hate mingling so much. I have to talk and keep going through unimportant stories. In Korea, I am assigned to a dinner or gathering. I have a party with a sense of security. In America, I have nowhere to sit, and I don’t know where to go and who to talk to.”

–Immigrant woman of Korean origin in mid-40s

And a Bhutanese immigrant explained:

“In my case, I am not an American. I consider myself a Bhutanese. … I am a Bhutanese because I do not know American culture to consider myself as an American. It is very difficult to understand the sense of humor in America. So, we are pure Bhutanese in America.”

–Immigrant man of Bhutanese origin in early 40s

Language was also a key aspect of identity for the participants. Many immigrants in the focus groups said they speak a language other than English at home and in their daily lives. One Vietnamese man considered himself Vietnamese since his Vietnamese is better than his English. Others emphasized their English skills. A Bangladeshi participant felt that she was more accepted in the workplace when she does more “American” things and speaks fluent English, rather than sharing things from Bangladeshi culture. She felt that others in her workplace correlate her English fluency with her ability to do her job. For others born in the U.S., the language they speak at home influences their connection to their ethnic roots.

“Now if I go to my work and do show my Bengali culture and Asian culture, they are not going to take anything out of it. So, basically, I have to show something that they are interested in. I have to show that I am American, [that] I can speak English fluently. I can do whatever you give me as a responsibility. So, in those cases I can’t show anything about my culture.”

–Immigrant woman of Bangladeshi origin in late 20s

“Being bi-ethnic and tri-cultural creates so many unique dynamics, and … one of the dynamics has to do with … what it is to be Americanized. … One of the things that played a role into how I associate the identity is language. Now, my father never spoke Spanish to me … because he wanted me to develop a fluency in English, because for him, he struggled with English. What happened was three out of the four people that raised me were Khmer … they spoke to me in Khmer. We’d eat breakfast, lunch and dinner speaking Khmer. We’d go to the temple in Khmer with the language and we’d also watch videos and movies in Khmer. … Looking into why I strongly identify with the heritage, one of the reasons is [that] speaking that language connects to the home I used to have [as my families have passed away].”

–U.S.-born man of Cambodian origin in early 30s

Balancing between individualistic and collective thinking

For some immigrant participants, the main differences between themselves and others who are seen as “truly American” were less about cultural differences, or how people behave, and more about differences in “mindset,” or how people think . Those who identified strongly with their ethnicity discussed how their way of thinking is different from a “typical American.” To some, the “American mentality” is more individualistic, with less judgment on what one should do or how they should act . One immigrant Japanese man, for example, talked about how other Japanese-origin co-workers in the U.S. would work without taking breaks because it’s culturally inconsiderate to take a break while others continued working. However, he would speak up for himself and other workers when they are not taking any work breaks. He attributed this to his “American” way of thinking, which encourages people to stand up for themselves.

Some U.S.-born participants who grew up in an immigrant family described the cultural clashes that happened between themselves and their immigrant parents. Participants talked about how the second generation (children of immigrant parents) struggles to pursue their own dreams while still living up to the traditional expectations of their immigrant parents.

“I feel like one of the biggest things I’ve seen, just like [my] Asian American friends overall, is the kind of family-individualistic clash … like wanting to do your own thing is like, is kind of instilled in you as an American, like go and … follow your dream. But then you just grow up with such a sense of like also wanting to be there for your family and to live up to those expectations, and I feel like that’s something that’s very pronounced in Asian cultures.”

–U.S.-born man of Indian origin in mid-20s

Discussions also highlighted differences about gender roles between growing up in America compared with elsewhere.

“As a woman or being a girl, because of your gender, you have to keep your mouth shut [and] wait so that they call on you for you to speak up. … I do respect our elders and I do respect hearing their guidance but I also want them to learn to hear from the younger person … because we have things to share that they might not know and that [are] important … so I like to challenge gender roles or traditional roles because it is something that [because] I was born and raised here [in America], I learn that we all have the equal rights to be able to speak and share our thoughts and ideas.”

U.S. born have mixed ties to their family’s heritage

research paper on categorization

“I think being Hmong is somewhat of being free, but being free of others’ perceptions of you or of others’ attempts to assimilate you or attempts to put pressure on you. I feel like being Hmong is to resist, really.” Pa Houa , documentary participant

How U.S.-born participants identify themselves depends on their familiarity with their own heritage, whom they are talking with, where they are when asked about their identity and what the answer is used for. Some mentioned that they have stronger ethnic ties because they are very familiar with their family’s ethnic heritage. Others talked about how their eating habits and preferred dishes made them feel closer to their ethnic identity. For example, one Korean participant shared his journey of getting closer to his Korean heritage because of Korean food and customs. When some participants shared their reasons for feeling closer to their ethnic identity, they also expressed a strong sense of pride with their unique cultural and ethnic heritage.

“I definitely consider myself Japanese American. I mean I’m Japanese and American. Really, ever since I’ve grown up, I’ve really admired Japanese culture. I grew up watching a lot of anime and Japanese black and white films. Just learning about [it], I would hear about Japanese stuff from my grandparents … myself, and my family having blended Japanese culture and American culture together.”

–U.S.-born man of Japanese origin in late 20s

Meanwhile, participants who were not familiar with their family’s heritage showed less connection with their ethnic ties. One U.S.-born woman said she has a hard time calling herself Cambodian, as she is “not close to the Cambodian community.” Participants with stronger ethnic ties talked about relating to their specific ethnic group more than the broader Asian group. Another woman noted that being Vietnamese is “more specific and unique than just being Asian” and said that she didn’t feel she belonged with other Asians. Some participants also disliked being seen as or called “Asian,” in part because they want to distinguish themselves from other Asian groups. For example, one Taiwanese woman introduces herself as Taiwanese when she can, because she had frequently been seen as Chinese.

Some in the focus groups described how their views of their own identities shifted as they grew older. For example, some U.S.-born and immigrant participants who came to the U.S. at younger ages described how their experiences in high school and the need to “fit in” were important in shaping their own identities. A Chinese woman put it this way:

“So basically, all I know is that I was born in the United States. Again, when I came back, I didn’t feel any barrier with my other friends who are White or Black. … Then I got a little confused in high school when I had trouble self-identifying if I am Asian, Chinese American, like who am I. … Should I completely immerse myself in the American culture? Should I also keep my Chinese identity and stuff like that? So yeah, that was like the middle of that mist. Now, I’m pretty clear about myself. I think I am Chinese American, Asian American, whatever people want.”

–U.S.-born woman of Chinese origin in early 20s

Identity is influenced by birthplace

research paper on categorization

“I identified myself first and foremost as American. Even on the forms that you fill out that says, you know, ‘Asian’ or ‘Chinese’ or ‘other,’ I would check the ‘other’ box, and I would put ‘American Chinese’ instead of ‘Chinese American.’” Brent , documentary participant

When talking about what it means to be “American,” participants offered their own definitions. For some, “American” is associated with acquiring a distinct identity alongside their ethnic or racial backgrounds, rather than replacing them. One Indian participant put it this way:

“I would also say [that I am] Indian American just because I find myself always bouncing between the two … it’s not even like dual identity, it just is one whole identity for me, like there’s not this separation. … I’m doing [both] Indian things [and] American things. … They use that term like ABCD … ‘American Born Confused Desi’ … I don’t feel that way anymore, although there are those moments … but I would say [that I am] Indian American for sure.”

–U.S.-born woman of Indian origin in early 30s

Meanwhile, some U.S.-born participants view being American as central to their identity while also valuing the culture of their family’s heritage.

Many immigrant participants associated the term “American” with immigration status or citizenship. One Taiwanese woman said she can’t call herself American since she doesn’t have a U.S. passport. Notably, U.S. citizenship is an important milestone for many immigrant participants, giving them a stronger sense of belonging and ultimately calling themselves American. A Bangladeshi participant shared that she hasn’t received U.S. citizenship yet, and she would call herself American after she receives her U.S. passport.

Other participants gave an even narrower definition, saying only those born and raised in the United States are truly American. One Taiwanese woman mentioned that her son would be American since he was born, raised and educated in the U.S. She added that while she has U.S. citizenship, she didn’t consider herself American since she didn’t grow up in the U.S. This narrower definition has implications for belonging. Some immigrants in the groups said they could never become truly American since the way they express themselves is so different from those who were born and raised in the U.S. A Japanese woman pointed out that Japanese people “are still very intimidated by authorities,” while those born and raised in America give their opinions without hesitation.

“As soon as I arrived, I called myself a Burmese immigrant. I had a green card, but I still wasn’t an American citizen. … Now I have become a U.S. citizen, so now I am a Burmese American.”

–Immigrant man of Burmese origin in mid-30s

“Since I was born … and raised here, I kind of always view myself as American first who just happened to be Asian or Chinese. So I actually don’t like the term Chinese American or Asian American. I’m American Asian or American Chinese. I view myself as American first.”

–U.S.-born man of Chinese origin in early 60s

“[I used to think of myself as] Filipino, but recently I started saying ‘Filipino American’ because I got [U.S.] citizenship. And it just sounds weird to say Filipino American, but I’m trying to … I want to accept it. I feel like it’s now marry-able to my identity.”

–Immigrant woman of Filipino origin in early 30s

For others, American identity is about the process of ‘becoming’ culturally American

A Venn diagram showing how participants in the focus group study described their racial or ethnic identity overlaps with their American identity

Immigrant participants also emphasized how their experiences and time living in America inform their views of being an “American.” As a result, some started to see themselves as Americans after spending more than a decade in the U.S. One Taiwanese man considered himself an American since he knows more about the U.S. than Taiwan after living in the U.S. for over 52 years.

But for other immigrant participants, the process of “becoming” American is not about how long they have lived in the U.S., but rather how familiar they are with American culture and their ability to speak English with little to no accent. This is especially true for those whose first language is not English, as learning and speaking it without an accent can be a big challenge for some. One Bangladeshi participant shared that his pronunciation of “hot water” was very different from American English, resulting in confusions in communication. By contrast, those who were more confident in their English skills felt they can better understand American culture and values as a result, leading them to a stronger connection with an American identity.

“[My friends and family tease me for being Americanized when I go back to Japan.] I think I seem a little different to people who live in Japan. I don’t think they mean anything bad, and they [were] just joking, because I already know that I seem a little different to people who live in Japan.”

–Immigrant man of Japanese origin in mid-40s

“I value my Hmong culture, and language, and ethnicity, but I also do acknowledge, again, that I was born here in America and I’m grateful that I was born here, and I was given opportunities that my parents weren’t given opportunities for.”

–U.S.-born woman of Hmong origin in early 30s

research paper on categorization

During the focus group discussions about identity, a recurring theme emerged about the difference between how participants saw themselves and how others see them. When asked to elaborate on their experiences and their points of view, some participants shared experiences they had with people misidentifying their race or ethnicity. Others talked about their frustration with being labeled the “model minority.” In all these discussions, participants shed light on the negative impacts that mistaken assumptions and labels had on their lives.

All people see is ‘Asian’

For many, interactions with others (non-Asians and Asians alike) often required explaining their backgrounds, reacting to stereotypes, and for those from smaller origin groups in particular, correcting the misconception that being “Asian” means you come from one of the larger Asian ethnic groups. Several participants remarked that in their own experiences, when others think about Asians, they tend to think of someone who is Chinese. As one immigrant Filipino woman put it, “Interacting with [non-Asians in the U.S.], it’s hard. … Well, first, I look Spanish. I mean, I don’t look Asian, so would you guess – it’s like they have a vision of what an Asian [should] look like.” Similarly, an immigrant Indonesian man remarked how Americans tended to see Asians primarily through their physical features, which not all Asian groups share.

Several participants also described how the tendency to view Asians as a monolithic group can be even more common in the wake of the COVID-19 pandemic.

“The first [thing people think of me as] is just Chinese. ‘You guys are just Chinese.’ I’m not the only one who felt [this] after the COVID-19 outbreak. ‘Whether you’re Japanese, Korean, or Southeast Asian, you’re just Chinese [to Americans]. I should avoid you.’ I’ve felt this way before, but I think I’ve felt it a bit more after the COVID-19 outbreak.”

–Immigrant woman of Korean origin in early 30s

At the same time, other participants described their own experiences trying to convince others that they are Asian or Asian American. This was a common experience among Southeast Asian participants.

“I have to convince people I’m Asian, not Middle Eastern. … If you type in Asian or you say Asian, most people associate it with Chinese food, Japanese food, karate, and like all these things but then they don’t associate it with you.”

–U.S.-born man of Pakistani origin in early 30s

The model minority myth and its impact

research paper on categorization

“I’ve never really done the best academically, compared to all my other Asian peers too. I never really excelled. I wasn’t in honors. … Those stereotypes, I think really [have] taken a toll on my self-esteem.” Diane , documentary participant

Across focus groups, immigrant and U.S.-born participants described the challenges of the seemingly positive stereotypes of Asians as intelligent, gifted in technical roles and hardworking. Participants often referred to this as the “model minority myth.”

The label “model minority” was coined in the 1960s and has been used to characterize Asian Americans as financially and educationally successful and hardworking when compared with other groups. However, for many Asians living in the United States, these characterizations do not align with their lived experiences or reflect their socioeconomic backgrounds. Indeed, among Asian origin groups in the U.S., there are wide differences in economic and social experiences. 

Academic research on the model minority myth has pointed to its impact beyond Asian Americans and towards other racial and ethnic groups, especially Black Americans, in the U.S. Some argue that the model minority myth has been used to justify policies that overlook the historical circumstances and impacts of colonialism, slavery, discrimination and segregation on other non-White racial and ethnic groups.

Many participants noted ways in which the model minority myth has been harmful. For some, expectations based on the myth didn’t match their own experiences of coming from impoverished communities. Some also recalled experiences at school when they struggled to meet their teachers’ expectations in math and science.

“As an Asian person, I feel like there’s that stereotype that Asian students are high achievers academically. They’re good at math and science. … I was a pretty mediocre student, and math and science were actually my weakest subjects, so I feel like it’s either way you lose. Teachers expect you to fit a certain stereotype and if you’re not, then you’re a disappointment, but at the same time, even if you are good at math and science, that just means that you’re fitting a stereotype. It’s [actually] your own achievement, but your teachers might think, ‘Oh, it’s because they’re Asian,’ and that diminishes your achievement.”

–U.S.-born woman of Korean origin in late 20s

Some participants felt that even when being Asian worked in their favor in the job market, they encountered stereotypes that “Asians can do quality work with less compensation” or that “Asians would not complain about anything at work.”

“There is a joke from foreigners and even Asian Americans that says, ‘No matter what you do, Asians always do the best.’ You need to get A, not just B-plus. Otherwise, you’ll be a disgrace to the family. … Even Silicon Valley hires Asian because [an] Asian’s wage is cheaper but [they] can work better. When [work] visa overflow happens, they hire Asians like Chinese and Indian to work in IT fields because we are good at this and do not complain about anything.”

–Immigrant man of Thai origin in early 40s

Others expressed frustration that people were placing them in the model minority box. One Indian woman put it this way:

“Indian people and Asian people, like … our parents or grandparents are the ones who immigrated here … against all odds. … A lot of Indian and Asian people have succeeded and have done really well for themselves because they’ve worked themselves to the bone. So now the expectations [of] the newer generations who were born here are incredibly unrealistic and high. And you get that not only from your family and the Indian community, but you’re also getting it from all of the American people around you, expecting you to be … insanely good at math, play an instrument, you know how to do this, you know how to do that, but it’s not true. And it’s just living with those expectations, it’s difficult.”

–U.S.-born woman of Indian origin in early 20s

Whether U.S. born or immigrants, Asians are often seen by others as foreigners

research paper on categorization

“Being only not quite 10 years old, it was kind of exciting to ride on a bus to go someplace. But when we went to Pomona, the assembly center, we were stuck in one of the stalls they used for the animals.” Tokiko , documentary participant

Across all focus groups, participants highlighted a common question they are asked in America when meeting people for the first time: “Where are you really from?” For participants, this question implied that people think they are “foreigners,” even though they may be longtime residents or citizens of the United States or were born in the country. One man of Vietnamese origin shared his experience with strangers who assumed that he and his friends are North Korean. Perhaps even more hurtful, participants mentioned that this meant people had a preconceived notion of what an “American” is supposed to look like, sound like or act like. One Chinese woman said that White Americans treated people like herself as outsiders based on her skin color and appearance, even though she was raised in the U.S.

Many focus group participants also acknowledged the common stereotype of treating Asians as “forever foreigners.” Some immigrant participants said they felt exhausted from constantly being asked this question by people even when they speak perfect English with no accent. During the discussion, a Korean immigrant man recalled that someone had said to him, “You speak English well, but where are you from?” One Filipino participant shared her experience during the first six months in the U.S.:

“You know, I spoke English fine. But there were certain things that, you know, people constantly questioning you like, oh, where are you from? When did you come here? You know, just asking about your experience to the point where … you become fed up with it after a while.”

–Immigrant woman of Filipino origin in mid-30s

U.S.-born participants also talked about experiences when others asked where they are from. Many shared that they would not talk about their ethnic origin right away when answering such a question because it often led to misunderstandings and assumptions that they are immigrants.

“I always get that question of, you know, ‘Where are you from?’ and I’m like, ‘I’m from America.’ And then they’re like, ‘No. Where are you from-from ?’ and I’m like, ‘Yeah, my family is from Pakistan,’ so it’s like I always had like that dual identity even though it’s never attached to me because I am like, of Pakistani descent.”

–U.S.-born man of Pakistani origin in early 20s

One Korean woman born in the U.S. said that once people know she is Korean, they ask even more offensive questions such as “Are you from North or South Korea?” or “Do you still eat dogs?”

In a similar situation, this U.S.-born Indian woman shared her responses:

“I find that there’s a, ‘So but where are you from?’ Like even in professional settings when they feel comfortable enough to ask you. ‘So – so where are you from?’ ‘Oh, I was born in [names city], Colorado. Like at [the hospital], down the street.’ ‘No, but like where are you from?’ ‘My mother’s womb?’”

–U.S.-born woman of Indian origin in early 40s

Ignorance and misinformation about Asian identity can lead to contentious encounters

research paper on categorization

“I have dealt with kids who just gave up on their Sikh identity, cut their hair and groomed their beard and everything. They just wanted to fit in and not have to deal with it, especially [those] who are victim or bullied in any incident.” Surinder , documentary participant

In some cases, ignorance and misinformation about Asians in the U.S. lead to inappropriate comments or questions and uncomfortable or dangerous situations. Participants shared their frustration when others asked about their country of origin, and they then had to explain their identity or correct misunderstandings or stereotypes about their background. At other times, some participants faced ignorant comments about their ethnicity, which sometimes led to more contentious encounters. For example, some Indian or Pakistani participants talked about the attacks or verbal abuse they experienced from others blaming them for the 9/11 terrorist attacks. Others discussed the racial slurs directed toward them since the COVID-19 pandemic in 2020. Some Japanese participants recalled their families losing everything and being incarcerated during World War II and the long-term effect it had on their lives.

“I think like right now with the coronavirus, I think we’re just Chinese, Chinese American, well, just Asian American or Asians in general, you’re just going through the same struggles right now. Like everyone is just blaming whoever looks Asian about the virus. You don’t feel safe.”

–U.S.-born man of Chinese origin in early 30s

“At the beginning of the pandemic, a friend and I went to celebrate her birthday at a club and like these guys just kept calling us COVID.”

–U.S.-born woman of Korean origin in early 20s

“There [were] a lot of instances after 9/11. One day, somebody put a poster about 9/11 [in front of] my business. He was wearing a gun. … On the poster, it was written ‘you Arabs, go back to your country.’ And then someone came inside. He pointed his gun at me and said ‘Go back to your country.’”

–Immigrant man of Pakistani origin in mid-60s

“[My parents went through the] internment camps during World War II. And my dad, he was in high school, so he was – they were building the camps and then he was put into the Santa Anita horse track place, the stables there. And then they were sent – all the Japanese Americans were sent to different camps, right, during World War II and – in California. Yeah, and they lost everything, yeah.”

–U.S.-born woman of Japanese origin in mid-60s

research paper on categorization

As focus group participants contemplated their identity during the discussions, many talked about their sense of belonging in America. Although some felt frustrated with people misunderstanding their ethnic heritage, they didn’t take a negative view of life in America. Instead, many participants – both immigrant and U.S. born – took pride in their unique cultural and ethnic backgrounds. In these discussions, people gave their own definitions of America as a place with a diverse set of cultures, with their ethnic heritage being a part of it.

Taking pride in their unique cultures

research paper on categorization

“Being a Pakistani American, I’m proud. … Because I work hard, and I make true my dreams from here.” Shahid , documentary participant

Despite the challenges of adapting to life in America for immigrant participants or of navigating their dual cultural identity for U.S.-born ones, focus group participants called America their home. And while participants talked about their identities in different ways – ethnic identity, racial (Asian) identity, and being American – they take pride in their unique cultures. Many also expressed a strong sense of responsibility to give back or support their community, sharing their cultural heritage with others on their own terms.

“Right now it has been a little difficult. I think it has been for all Asians because of the COVID issue … but I’m glad that we’re all here [in America]. I think we should be proud to be here. I’m glad that our families have traveled here, and we can help make life better for communities, our families and ourselves. I think that’s really a wonderful thing. We can be those role models for a lot of the future, the younger folks. I hope that something I did in the last years will have impacted either my family, friends or students that I taught in other community things that I’ve done. So you hope that it helps someplace along the line.”

“I am very proud of my culture. … There is not a single Bengali at my workplace, but people know the name of my country. Maybe many years [later] – educated people know all about the country. So, I don’t have to explain that there is a small country next to India and Nepal. It’s beyond saying. People after all know Bangladesh. And there are so many Bengali present here as well. So, I am very proud to be a Bangladeshi.”

Where home is

When asked about the definition of home, some immigrant participants said home is where their families are located. Immigrants in the focus groups came to the United States by various paths, whether through work opportunities, reuniting with family or seeking a safe haven as refugees. Along their journey, some received support from family members, their local community or other individuals, while others overcame challenges by themselves. Either way, they take pride in establishing their home in America and can feel hurt when someone tells them to “go back to your country.” In response, one Laotian woman in her mid-40s said, “This is my home. My country. Go away.”

“If you ask me personally, I view my home as my house … then I would say my house is with my family because wherever I go, I cannot marry if I do not have my family so that is how I would answer.”

–Immigrant man of Hmong origin in late 30s

“[If somebody yelled at me ‘go back to your country’] I’d feel angry because this is my country! I live here. America is my country. I grew up here and worked here … I’d say, ‘This is my country! You go back to your country! … I will not go anywhere. This is my home. I will live here.’ That’s what I’d say.”

–Immigrant woman of Laotian origin in early 50s

‘American’ means to blend their unique cultural and ethnic heritage with that in the U.S.

research paper on categorization

“I want to teach my children two traditions – one American and one Vietnamese – so they can compare and choose for themselves the best route in life.” Helen , documentary participant (translated from Vietnamese)

Both U.S.-born and immigrant participants in the focus groups shared their experiences of navigating a dual cultural environment between their ethnic heritage and American culture. A common thread that emerged was that being Asian in America is a process of blending two or more identities as one.

“Yeah, I want to say that’s how I feel – because like thinking about it, I would call my dad Lao but I would call myself Laotian American because I think I’m a little more integrated in the American society and I’ve also been a little more Americanized, compared to my dad. So that’s how I would see it.”

–U.S.-born man of Laotian origin in late 20s

“I mean, Bangladeshi Americans who are here, we are carrying Bangladeshi culture, religion, food. I am also trying to be Americanized like the Americans. Regarding language, eating habits.”

–Immigrant man of Bangladeshi origin in mid-50s

“Just like there is Chinese American, Mexican American, Japanese American, Italian American, so there is Indian American. I don’t want to give up Indianness. I am American by nationality, but I am Indian by birth. So whenever I talk, I try to show both the flags as well, both Indian and American flags. Just because you make new relatives but don’t forget the old relatives.”

–Immigrant man of Indian origin in late 40s

research paper on categorization

Pew Research Center designed these focus groups to better understand how members of an ethnically diverse Asian population think about their place in America and life here. By including participants of different languages, immigration or refugee experiences, educational backgrounds, and income levels, this focus group study aimed to capture in people’s own words what it means to be Asian in America. The discussions in these groups may or may not resonate with all Asians living in the United States. Browse excerpts from our focus groups with the interactive quote sorter below, view a video documentary focused on the topics discussed in the focus groups, or tell us your story of belonging in America via social media. The focus group project is part of a broader research project studying the diverse experiences of Asians living in the U.S.

Read sortable quotes from our focus groups

Browse excerpts in the interactive quote sorter from focus group participants in response to the question “What does it mean to be [Vietnamese, Thai, Sri Lankan, Hmong, etc.] like yourself in America?” This interactive allows you to sort quotes from focus group participants by ethnic origin, nativity (U.S. born or born in another country), gender and age.

Video documentary

Videos throughout the data essay illustrate what focus group participants discussed. Those recorded in these videos did not participate in the focus groups but were sampled to have similar demographic characteristics and thematically relevant stories.

Watch the full video documentary and watch additional shorter video clips related to the themes of this data essay.

Share the story of your family and your identity

Did the voices in this data essay resonate? Share your story of what it means to be Asian in America with @pewresearch. Tell us your story by using the hashtag #BeingAsianInAmerica and @pewidentity on Twitter, as well as #BeingAsianInAmerica and @pewresearch on Instagram.

This cross-ethnic, comparative qualitative research project explores the identity, economic mobility, representation, and experiences of immigration and discrimination among the Asian population in the United States. The analysis is based on 66 focus groups we conducted virtually in the fall of 2021 and included 264 participants from across the U.S. More information about the groups and analysis can be found in this appendix .

Pew Research Center is a subsidiary of The Pew Charitable Trusts, its primary funder. This data essay was funded by The Pew Charitable Trusts, with generous support from the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation; the Robert Wood Johnson Foundation; the Henry Luce Foundation; The Wallace H. Coulter Foundation; The Dirk and Charlene Kabcenell Foundation; The Long Family Foundation; Lu-Hebert Fund; Gee Family Foundation; Joseph Cotchett; the Julian Abdey and Sabrina Moyle Charitable Fund; and Nanci Nishimura.

The accompanying video clips and video documentary were made possible by The Pew Charitable Trusts, with generous support from The Sobrato Family Foundation and The Long Family Foundation.

We would also like to thank the Leaders Forum for its thought leadership and valuable assistance in helping make this study possible. This is a collaborative effort based on the input and analysis of a number of individuals and experts at Pew Research Center and outside experts.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Age & Generations
  • Coronavirus (COVID-19)
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • Immigration & Migration
  • International Affairs
  • Internet & Technology
  • Methodological Research
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Politics & Policy
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Terms & Conditions

Privacy Policy

Cookie Settings

Reprints, Permissions & Use Policy

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: a review of deep learning-based information fusion techniques for multimodal medical image classification.

Abstract: Multimodal medical imaging plays a pivotal role in clinical diagnosis and research, as it combines information from various imaging modalities to provide a more comprehensive understanding of the underlying pathology. Recently, deep learning-based multimodal fusion techniques have emerged as powerful tools for improving medical image classification. This review offers a thorough analysis of the developments in deep learning-based multimodal fusion for medical classification tasks. We explore the complementary relationships among prevalent clinical modalities and outline three main fusion schemes for multimodal classification networks: input fusion, intermediate fusion (encompassing single-level fusion, hierarchical fusion, and attention-based fusion), and output fusion. By evaluating the performance of these fusion techniques, we provide insight into the suitability of different network architectures for various multimodal fusion scenarios and application domains. Furthermore, we delve into challenges related to network architecture selection, handling incomplete multimodal data management, and the potential limitations of multimodal fusion. Finally, we spotlight the promising future of Transformer-based multimodal fusion techniques and give recommendations for future research in this rapidly evolving field.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Social justice
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Researchers detect a new molecule in space

Press contact :.

Illustration against a starry background. Two radio dishes are in the lower left, six 3D molecule models are in the center.

Previous image Next image

New research from the group of MIT Professor Brett McGuire has revealed the presence of a previously unknown molecule in space. The team's open-access paper, “ Rotational Spectrum and First Interstellar Detection of 2-Methoxyethanol Using ALMA Observations of NGC 6334I ,” appears in April 12 issue of The Astrophysical Journal Letters .

Zachary T.P. Fried , a graduate student in the McGuire group and the lead author of the publication, worked to assemble a puzzle comprised of pieces collected from across the globe, extending beyond MIT to France, Florida, Virginia, and Copenhagen, to achieve this exciting discovery. 

“Our group tries to understand what molecules are present in regions of space where stars and solar systems will eventually take shape,” explains Fried. “This allows us to piece together how chemistry evolves alongside the process of star and planet formation. We do this by looking at the rotational spectra of molecules, the unique patterns of light they give off as they tumble end-over-end in space. These patterns are fingerprints (barcodes) for molecules. To detect new molecules in space, we first must have an idea of what molecule we want to look for, then we can record its spectrum in the lab here on Earth, and then finally we look for that spectrum in space using telescopes.”

Searching for molecules in space

The McGuire Group has recently begun to utilize machine learning to suggest good target molecules to search for. In 2023, one of these machine learning models suggested the researchers target a molecule known as 2-methoxyethanol. 

“There are a number of 'methoxy' molecules in space, like dimethyl ether, methoxymethanol, ethyl methyl ether, and methyl formate, but 2-methoxyethanol would be the largest and most complex ever seen,” says Fried. To detect this molecule using radiotelescope observations, the group first needed to measure and analyze its rotational spectrum on Earth. The researchers combined experiments from the University of Lille (Lille, France), the New College of Florida (Sarasota, Florida), and the McGuire lab at MIT to measure this spectrum over a broadband region of frequencies ranging from the microwave to sub-millimeter wave regimes (approximately 8 to 500 gigahertz). 

The data gleaned from these measurements permitted a search for the molecule using Atacama Large Millimeter/submillimeter Array (ALMA) observations toward two separate star-forming regions: NGC 6334I and IRAS 16293-2422B. Members of the McGuire group analyzed these telescope observations alongside researchers at the National Radio Astronomy Observatory (Charlottesville, Virginia) and the University of Copenhagen, Denmark. 

“Ultimately, we observed 25 rotational lines of 2-methoxyethanol that lined up with the molecular signal observed toward NGC 6334I (the barcode matched!), thus resulting in a secure detection of 2-methoxyethanol in this source,” says Fried. “This allowed us to then derive physical parameters of the molecule toward NGC 6334I, such as its abundance and excitation temperature. It also enabled an investigation of the possible chemical formation pathways from known interstellar precursors.”

Looking forward

Molecular discoveries like this one help the researchers to better understand the development of molecular complexity in space during the star formation process. 2-methoxyethanol, which contains 13 atoms, is quite large for interstellar standards — as of 2021, only six species larger than 13 atoms were detected outside the solar system , many by McGuire’s group, and all of them existing as ringed structures.  

“Continued observations of large molecules and subsequent derivations of their abundances allows us to advance our knowledge of how efficiently large molecules can form and by which specific reactions they may be produced,” says Fried. “Additionally, since we detected this molecule in NGC 6334I but not in IRAS 16293-2422B, we were presented with a unique opportunity to look into how the differing physical conditions of these two sources may be affecting the chemistry that can occur.”

Share this news article on:

Related links.

  • McGuire Lab
  • Department of Chemistry

Related Topics

  • Space, astronomy and planetary science
  • Astrophysics

Related Articles

Green Bank Telescope

Found in space: Complex carbon-based molecules

Previous item Next item

More MIT News

An icon of the human body has medication icons around it, and also four nodes. The nodes expand to say “Time of Day, Muscle Mass, Genetics, and Drug Toxicity,” and show icons. The words “Height and Weight” appear next to the human body.

A closed-loop drug-delivery system could improve chemotherapy

Read full story →

A futuristic quantum computer chip is made of a grid with qubits on the intersections. These red spherical qubits emit flame-like energy between them.

MIT scientists tune the entanglement structure in an array of qubits

On a vast rocky landscape, Claire Nichols drills into brown rock as Ben Weiss smiles. Both wear protective goggles.

Geologists discover rocks with the oldest evidence yet of Earth’s magnetic field

Collage of about 40 random thumbnail photos with one large illustration of a human brain overlaid in the center.

Mapping the brain pathways of visual memorability

A grayscale photograph of Professor Bernie Wuensch in his office, surrounded by books and heaps of papers, welcoming the camera with open arms and a warm smile

Professor Emeritus Bernhardt Wuensch, crystallographer and esteemed educator, dies at 90

Close-up photos shows fingers manipulating a little valve. Steam emerges from the valve.

How light can vaporize water without the need for heat

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

IMAGES

  1. 📌 Social Identity and Self-Categorization

    research paper on categorization

  2. Prognostic Data Categorization.

    research paper on categorization

  3. Taxonomy layout of the literature review

    research paper on categorization

  4. Categorization of research papers

    research paper on categorization

  5. Categorization of research topics.

    research paper on categorization

  6. Paper categorization

    research paper on categorization

VIDEO

  1. Mutual Fund Categorization #shortvideo #ytshorts #mutualfund

  2. Statistics (Classification & Tabulation of Data)

  3. 20 November 2023

  4. Categorization of Industries

  5. Machine Learning

  6. Analysis of Categorical Data (part 1) continuing education course, Joint Statistical Meetings, 2017

COMMENTS

  1. Multi-label classification of research articles using Word2Vec and

    Every year, around 28,100 journals publish 2.5 million research publications. Search engines, digital libraries, and citation indexes are used extensively to search these publications. When a user ...

  2. PDF Categorizing Categorization Research: Review, Integration, and Future

    ABSTRACT This paper offers a systematic review of the literature on organizational categories and categorization published in the last 14 years (1999-2012). ... Second, based on our review, we classify the core publications of categorization research along several dimensions of interest, such as level of analysis, type of dependent variable ...

  3. Categorizing Categorization Research: Review, Integration, and Future

    This paper offers a systematic review of the literature on organizational categories and categorization published in the last 14 years (1999-2012). After identifying a core of roughly 100 papers on categories that appeared in management, organization, and sociology journals, we classified them based on several key dimensions, and analysed a ...

  4. (PDF) Categorizing Categorization Research: Review, Integration, and

    ABSTRACT This paper offers a systematic review of the literature on organizational categories. and categorization published in the last 14 years (1999-2012). After identifying a core of. roughly ...

  5. The Research Trends of Text Classification Studies (2000-2020): A

    The rest of the paper is organized as follows. Section 2 describes the data source and methods of data analysis for the study. Section 3 reports on our results from the four perspectives: (1) annual trends in publications, (2) active contributors at country, institution, and author levels, (3) publication sources and disciplines, and (4) topics ...

  6. Text categorization: past and present

    The various survey has been provided in text summarization (Kanapala et al. 2019), sentiment analysis and opinion mining (Hemmatian and Sohrabi 2017) but very few exhaustive studies have been performed for text categorization task (Kadhim 2019).This survey paper aims to explore a wide variety of algorithms used for categorizing text documents and tries to assemble the existing works into three ...

  7. Deep Learning Based Text Classification: A Comprehensive Review

    In this paper, we provide a comprehensive review of more than 150 deep learning based models for text classification developed in recent years, and discuss their technical contributions, similarities, and strengths. We also provide a summary of more than 40 popular datasets widely used for text classification.

  8. Research paper classification systems based on TF-IDF ...

    With the increasing advance of computer and information technologies, numerous research papers have been published online as well as offline, and as new research fields have been continuingly created, users have a lot of trouble in finding and categorizing their interesting research papers. In order to overcome the limitations, this paper proposes a research paper classification system that ...

  9. PDF Beyond original Research Articles Categorization via NLP

    The automatic classification of research publications is commonly achieved by assigning papers to existing categories within hierarchically-structured vocabularies, such as Medical Subject Headings (MeSH) [13], Physics Subject Headings (PhySH) [14], and the STW Thesaurus for Economics [15].

  10. Large Scale Subject Category Classification of Scholarly Papers With

    1 Introduction. A recent estimate of the total number of English research articles available online was at least 114 million (Khabsa and Giles, 2014).Studies indicate the number of academic papers doubles every 10-15 years (Larsen and von Ins, 2010).The continued growth of scholarly papers increases the challenges to accurately find relevant research papers, especially when papers in ...

  11. Comparing paper level classifications across different ...

    The classification of scientific literature into appropriate disciplines is an essential precondition of valid scientometric analysis and significant to the practice of research assessment. In this paper, we compared the classification of publications in Nature based on three different approaches across three different systems. These were: Web of Science (WoS) subject categories (SCs) provided ...

  12. The Development of Categorization: Effects of Classification and

    Despite these limitations and need for future research, current research presents novel evidence pertaining to the development of categorization and the role of attention in this process. Whereas adults and 6-year-olds are more likely to attend selectively to diagnostic features in classification than in inference training, 4-year-olds tend to ...

  13. A Comparative Analysis of Machine Learning Algorithms for

    Conclusion The prediction of classes is handled by a classification algorithm in this paper. There are different classification models available which are based on a variety of logic and methodologies. ... DOI: 10.1109/SAI.2017.8252132. [11] Journal of Basic and Applied Engineering Research p-ISSN: 2350-0077; e-ISSN: 2350-0255; Volume 4, Issue ...

  14. PDF A Survey on Text Classification: From Traditional to Deep Learning

    The AAPD is a large dataset in the computer science field for the multi-label text classification from website 1. It has 55,840 papers, including the abstract and the corresponding subjects with 54 labels in total. The aim is to predict the corresponding subjects of each paper according to the abstract. Patent Dataset.

  15. (PDF) Research Paper Classification using Supervised Machine Learning

    The authors of this scientific study suggest a categorization method for categorising research paper abstracts using several machine-learning approaches. The goal is to automatically classify the ...

  16. Categorization Methodology: an Approach to the Collection and ...

    Then, examinations are made of the main components of the F-sort technique and of latent partition analysis. The intent is to lay out detailed frameworks for designing and interpreting research that applies the categorization methodology. The methodology has not previously been given an integrated description.

  17. Comparative Analysis of Research Papers Categorization using LDA and

    In the digital world, the research papers are growing exponentially with time, and it is essential to cluster the documents under their respective categories for easier identification and access. However, researchers find it relatively challenging to recognize and categorize their favorite research articles. Though this task can be achieved by putting in the human work, it would be tedious and ...

  18. Research Paper Classification using Supervised Machine Learning

    In this work, different Machine Learning (ML) techniques are used and evaluated based on their performance of classifying peer reviewed published content. The ultimate objective is to extract meaningful information from published abstracts. In pursuing this objective, the ML techniques are utilized to classify different publications into three fields: Science, Business, and Social Science. The ...

  19. The origins of social categorization

    Social categorization profoundly influences human social life. Despite the salience of individuals in social thinking, a large body of work suggests that the tendency to conceive of people as belonging to social categories is automatic [1-3].Indeed, the ability to group instances into categories and to use category-based knowledge to generate novel inductive inferences is a powerful aspect ...

  20. [PDF] Classification and Categorization: A Difference that Makes a

    Structural and semantic differences between classification and categorization are differences that make a difference in the information environment by influencing the functional activities of an information system and by contributing to its constitution as an information environment. Examination of the systemic properties and forms of interaction that characterize classification and ...

  21. Automatic Documents Categorization Using NLP

    The proposed research paper classification system uses Diricre's Latent Attribute (LDA) scheme to extract representative keywords from each article and topic summary. Then, apply the K-means algorithm program to classify the total number of articles into analytical articles with similar topics, taking into account the document frequency ...

  22. Automated Categorization of Research Papers with MONO Supervised Term

    Natural Language Processing, specifically text classification or text categorization, has become a trend in computer science. Commonly, text classification is used to categorize large amounts of data to allocate less time to retrieve information. Students, as well as research advisers and panelists, take extra effort and time in classifying research documents.

  23. Research on a Capsule Network Text Classification Method with a Self

    A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the ...

  24. Trained neural networking framework based skin cancer ...

    The validation of skin cancer classification and decision making is further represented in Ref. 19 with respect to the performance efficiency and technological categorization. The summary of this ...

  25. Research paper categorization in Python

    Text classification (a.k.a. text categorization) is one of the most prominent application of Machine Learning. The purpose of text classification is to give conceptual organization to large collection of documents.An interesting application of text classification is to categorize research papers by most suitable conferences. Finding and selecting a suitable academic conference has always been ...

  26. Design of highly functional genome editors by modeling the ...

    Gene editing has the potential to solve fundamental challenges in agriculture, biotechnology, and human health. CRISPR-based gene editors derived from microbes, while powerful, often show significant functional tradeoffs when ported into non-native environments, such as human cells. Artificial intelligence (AI) enabled design provides a powerful alternative with potential to bypass ...

  27. Pew Research Center

    Pew Research Center

  28. [2404.15022] A review of deep learning-based information fusion

    Multimodal medical imaging plays a pivotal role in clinical diagnosis and research, as it combines information from various imaging modalities to provide a more comprehensive understanding of the underlying pathology. Recently, deep learning-based multimodal fusion techniques have emerged as powerful tools for improving medical image classification. This review offers a thorough analysis of ...

  29. Researchers detect a new molecule in space

    New research from the group of MIT Professor Brett McGuire has revealed the presence of a previously unknown molecule in space. The team's open-access paper, "Rotational Spectrum and First Interstellar Detection of 2-Methoxyethanol Using ALMA Observations of NGC 6334I," appears in April 12 issue of The Astrophysical Journal Letters. Zachary T.P. Fried, a graduate student in the McGuire ...

  30. The Nuts And Bolts Research Paper Clinics

    The Nuts And Bolts Research Paper Clinics. By Harvard College Writing Program. Wednesday, April 24, 2024 - Monday, April 29, 2024. I'm Interested.