Quantitative text analysis

Nielbo, Kristoffer L.; Karsdorp, Folgert; Wevers, Melvin; Lassche, Alie; Baglini, Rebekah B.; Kestemont, Mike; Tahmasebi, Nina

doi:10.1038/s43586-024-00302-w

Primer
Published: 11 April 2024

Quantitative text analysis

Nature Reviews Methods Primers volume 4, Article number: 25 (2024) Cite this article

49k Accesses
54 Altmetric
Metrics details

Subjects

Abstract

Text analysis has undergone substantial evolution since its inception, moving from manual qualitative assessments to sophisticated quantitative and computational methods. Beginning in the late twentieth century, a surge in the utilization of computational techniques reshaped the landscape of text analysis, catalysed by advances in computational power and database technologies. Researchers in various fields, from history to medicine, are now using quantitative methodologies, particularly machine learning, to extract insights from massive textual data sets. This transformation can be described in three discernible methodological stages: feature-based models, representation learning models and generative models. Although sequential, these stages are complementary, each addressing analytical challenges in the text analysis. The progression from feature-based models that require manual feature engineering to contemporary generative models, such as GPT-4 and Llama2, signifies a change in the workflow, scale and computational infrastructure of the quantitative text analysis. This Primer presents a detailed introduction of some of these developments, offering insights into the methods, principles and applications pertinent to researchers embarking on the quantitative text analysis, especially within the field of machine learning.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Schematic representation of three predominant approaches in the quantitative text analysis.**

**Fig. 2: Comparative workflows of model training (upper section) and model application or inference (lower section) for a text classification task in the context of the quantitative text analysis.**

Augmenting interpretable models with large language models during training

Article Open access 30 November 2023

An open source machine learning framework for efficient and transparent systematic reviews

Article Open access 01 February 2021

The shaky foundations of large language models and foundation models for electronic health records

Article Open access 29 July 2023

References

Miner, G. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications (Academic Press, 2012).
Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. From data mining to knowledge discovery in databases. AI Mag. 17, 37 (1996).
Google Scholar
Hand, D. J. Data mining: statistics and more? Am. Stat. 52, 112–116 (1998).
Article Google Scholar
McEnery, T. & Wilson, A. Corpus Linguistics: An Introduction (Edinburgh University Press, 2001).
Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing 1st edn (The MIT Press, 1999).
Manning, C., Raghavan, P. & Schütze, H. Introduction to Information Retrieval 1st edn (Cambridge University Press, 2008).
Wankhade, M., Rao, A. C. S. & Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 55, 5731–5780 (2022).
Article Google Scholar
Jehangir, B., Radhakrishnan, S. & Agarwal, R. A survey on named entity recognition — datasets, tools, and methodologies. Nat. Lang. Process. J. 3, 100017 (2023).
Article Google Scholar
Fu, S. et al. Clinical concept extraction: a methodology review. J. Biomed. Inform. 109, 103526 (2020).
Article Google Scholar
Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002).
Article Google Scholar
Talley, E. M. et al. Database of NIH grants using machine-learned categories and graphical clustering. Nat. Meth. 8, 443–444 (2011).
Article Google Scholar
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2022).
Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).
Google Scholar
Angelov, D. Top2Vec: distributed representations of topics. Preprint at https://arxiv.org/abs/2008.09470 (2020).
Barron, A. T. J., Huang, J., Spang, R. L. & DeDeo, S. Individuals, institutions, and innovation in the debates of the French Revolution. Proc. Natl Acad. Sci. USA 115, 4607–4612 (2018).
Article ADS Google Scholar
Mitchell, T. M. Machine Learning 1st edn (McGraw-Hill, 1997).
Vaswani, A. et al. Attention is all you need. in Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30 (Curran Associates, Inc., 2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805 (2018).
Brown, T. et al. Language models are few-shot learners. in Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H.) 1877–1901 (Curran Associates, Inc., 2020).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Wolf, T. et al. Transformers: state-of-the-art natural language processing. in Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 38–45 (Association for Computational Linguistics, Online, 2020).
Demartini, G., Siersdorfer, S., Chelaru, S. & Nejdl, W. Analyzing political trends in the blogosphere. in Proceedings of the International AAAI Conference on Web and Social Media vol. 5 466–469 (AAAI, 2011).
Goldstone, A. & Underwood, T. The quiet transformations of literary studies: what thirteen thousand scholars could tell us. New Lit. Hist. 45, 359–384 (2014).
Article Google Scholar
Tangherlini, T. R. & Leonard, P. Trawling in the sea of the great unread: sub-corpus topic modeling and humanities research. Poetics 41, 725–749 (2013).
Article Google Scholar
Mei, Q. & Zhai, C. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 198–207 (Association for Computing Machinery, 2005).
Frermann, L. & Lapata, M. A Bayesian model of diachronic meaning change. Trans. Assoc. Comput. Linguist. 4, 31–45 (2016).
Article Google Scholar
Koplenig, A. Analyzing Lexical Change in Diachronic Corpora. PhD thesis, Mannheim https://nbn-resolving.org/urn:nbn:de:bsz:mh39-48905 (2016).
Dubossarsky, H., Weinshall, D. & Grossman, E. Outta control: laws of semantic change and inherent biases in word representation models. in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 1136–1145 (Association for Computational Linguistics, 2017).
Dubossarsky, H., Hengchen, S., Tahmasebi, N. & Schlechtweg, D. Time-out: temporal referencing for robust modeling of lexical semantic change. in Proc. 57th Annual Meeting of the Association for Computational Linguistics 457–470 (Association for Computational Linguistics, 2019).
Koplenig, A. Why the quantitative analysis of diachronic corpora that does not consider the temporal aspect of time-series can lead to wrong conclusions. Digit. Scholarsh. Humanit. 32, 159–168 (2017).
Google Scholar
Tahmasebi, N., Borin, L. & Jatowt, A. Survey of computational approaches to lexical semantic change detection. Zenodo https://doi.org/10.5281/zenodo.5040302 (2021).
Bizzoni, Y., Degaetano-Orttlieb, S., Fankhauser, P. & Teich, E. Linguistic variation and change in 250 years of English scientific writing: a data-driven approach. Front. Artif. Intell. 3, 73 (2020).
Article Google Scholar
Haider, T. & Eger, S. Semantic change and emerging tropes in a large corpus of New High German poetry. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 216–222 (Association for Computational Linguistics, 2019).
Vylomova, E., Murphy, S. & Haslam, N. Evaluation of semantic change of harm-related concepts in psychology. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 29–34 (Association for Computational Linguistics, 2019).
Marjanen, J., Pivovarova, L., Zosa, E. & Kurunmäki, J. Clustering ideological terms in historical newspaper data with diachronic word embeddings. in 5th International Workshop on Computational History, HistoInformatics 2019 (CEUR-WS, 2019).
Tripodi, R., Warglien, M., Levis Sullam, S. & Paci, D. Tracing antisemitic language through diachronic embedding projections: France 1789–1914. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 115–125 (Association for Computational Linguistics, 2019).
Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proc. Natl. Acad. Sci. USA 115, E3635–E3644 (2018).
Article ADS Google Scholar
Wevers, M. Using word embeddings to examine gender bias in Dutch newspapers, 1950–1990. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 92–97 (Association for Computational Linguistics, 2019).
Sommerauer, P. & Fokkens, A. Conceptual change and distributional semantic models: an exploratory study on pitfalls and possibilities. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 223–233 (Association for Computational Linguistics, 2019). This article examines the effects of known pitfalls on digital humanities studies, using embedding models, and proposes guidelines for conducting such studies while acknowledging the need for further research to differentiate between artefacts and actual conceptual changes.
Doermann, D. & Tombre, K. (eds) Handbook of Document Image Processing and Recognition 2014th edn (Springer, 2014).
Yu, D. & Deng, L. Automatic Speech Recognition: A Deep Learning Approach 2015th edn (Springer, 2014).
Dasu, T. & Johnson, T. Exploratory Data Mining and Data Cleaning (John Wiley & Sons, Inc., 2003).
Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R. & Watanabe, S. End-to-end speech recognition: a survey https://arxiv.org/abs/2303.03329 (2023).
Pustejovsky, J. & Stubbs, A. Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications 1st edn (O’Reilly Media, 2012). A hands-on guide to data-intensive humanities research, including the quantitative text analysis, using the Python programming language.
Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977).
Article Google Scholar
Gurav, V., Parkar, M. & Kharwar, P. Accessible and ethical data annotation with the application of gamification. in Data Science and Analytics (eds Batra, U., Roy, N. R. & Panda, B.) 68–78 (Springer Singapore, 2020).
Paolacci, G., Chandler, J. & Ipeirotis, P. G. Running experiments on Amazon Mechanical Turk. Judgm. Decis. Mak. 5, 411–419 (2010).
Article Google Scholar
Bergvall-Kåreborn, B. & Howcroft, D. Amazon mechanical turk and the commodification of labour. New Technol. Work Employ. 29, 213–223 (2014).
Article Google Scholar
Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Article Google Scholar
Klymenko, O., Meisenbacher, S. & Matthes, F. Differential privacy in natural language processing the story so far. in Proc. Fourth Workshop on Privacy in Natural Language Processing 1–11 (Association for Computational Linguistics, 2022).
Lassen, I. M. S., Almasi, M., Enevoldsen, K. & Kristensen-McLachlan, R. D. Detecting intersectionality in NER models: a data-driven approach. in Proc. 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 116–127 (Association for Computational Linguistics, 2023).
DaCy: A Unified Framework for Danish NLP Vol. 2989, 206–216 (CEUR Workshop Proceedings, 2021).
Karsdorp, F., Kestemont, M. & Riddell, A. Humanities Data Analysis: Case Studies with Python (Princeton Univ. Press, 2021).
Ruder, S., Peters, M. E., Swayamdipta, S. & Wolf, T. Transfer learning in natural language processing. in Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials 15–18 (Association for Computational Linguistics, 2019). The paper presents an overview of modern transfer learning methods in natural language processing, highlighting their emergence, effectiveness in improving the state of the art across various tasks and potential to become a standard tool in natural language processing.
Malte, A. & Ratadiya, P. Evolution of transfer learning in natural language processing. Preprint at https://arxiv.org/abs/1910.07370 (2019).
Groh, M. Identifying the context shift between test benchmarks and production data. Preprint at https://arxiv.org/abs/2207.01059 (2022).
Wang, H., Li, J., Wu, H., Hovy, E. & Sun, Y. Pre-trained language models and their applications. Engineering 25, 51–65 (2023). This article provides a comprehensive review of the recent progress and research on pre-trained language models in natural language processing, including their development, impact, challenges and future directions in the field.
Article Google Scholar
Wilks, D. S. On the combination of forecast probabilities for consecutive precipitation periods. Weather Forecast. 5, 640–650 (1990).
Article ADS Google Scholar
Loughran, T. & McDonald, B. Textual analysis in accounting and finance: a survey. J. Account. Res. 54, 1187–1230 (2016).
Article Google Scholar
Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why should I trust you?’: explaining the predictions of any classifier. Preprint at https://arxiv.org/abs/1602.04938 (2016).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. in Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) 4765–4774 (Curran Associates, Inc., 2017).
Tahmasebi, N. & Hengchen, S. The strengths and pitfalls of large-scale text mining for literary studies. Samlaren 140, 198–227 (2019).
Google Scholar
Jaidka, K., Ahmed, S., Skoric, M. & Hilbert, M. Predicting elections from social media: a three-country, three-method comparative study. Asian J. Commun. 29, 252–273 (2019).
Article Google Scholar
Underwood, T. Distant Horizons: Digital Evidence and Literary Change (Univ. Chicago Press, 2019).
Jo, E. S. & Algee-Hewitt, M. The long arc of history: neural network approaches to diachronic linguistic change. J. Jpn Assoc. Digit. Humanit. 3, 1–32 (2018).
Google Scholar
Soni, S., Klein, L. F. & Eisenstein, J. Abolitionist networks: modeling language change in nineteenth-century activist newspapers. J. Cultural Anal. 6, 1–43 (2021).
Google Scholar
Perry, C. & Dedeo, S. The cognitive science of extremist ideologies online. Preprint at https://arxiv.org/abs/2110.00626 (2021).
Antoniak, M., Mimno, D. & Levy, K. Narrative paths and negotiation of power in birth stories. Proc. ACM Hum. Comput. Interact. 3, 1–27 (2019).
Article Google Scholar
Vicinanza, P., Goldberg, A. & Srivastava, S. B. A deep-learning model of prescient ideas demonstrates that they emerge from the periphery. PNAS Nexus 2, pgac275 (2023). Using deep learning on text data, the study identifies markers of prescient ideas, revealing that groundbreaking thoughts often emerge from the periphery of domains rather than their core.
Article Google Scholar
Adeva, J. G., Atxa, J. P., Carrillo, M. U. & Zengotitabengoa, E. A. Automatic text classification to support systematic reviews in medicine. Exp. Syst. Appl. 41, 1498–1508 (2014).
Article Google Scholar
Schneider, N., Fechner, N., Landrum, G. A. & Stiefl, N. Chemical topic modeling: exploring molecular data sets using a common text-mining approach. J. Chem. Inf. Model. 57, 1816–1831 (2017).
Article Google Scholar
Kayi, E. S., Yadav, K. & Choi, H.-A. Topic modeling based classification of clinical reports. in 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop 67–73 (Association for Computational Linguistics, 2013).
Roberts, M. E. et al. Structural topic models for open-ended survey responses. Am. J. Political Sci. 58, 1064–1082 (2014).
Article Google Scholar
Kheiri, K. & Karimi, H. SentimentGPT: exploiting GPT for advanced sentiment analysis and its departure from current machine learning. Preprint at https://arxiv.org/abs/2307.10234 (2023).
Pelaez, S., Verma, G., Ribeiro, B. & Shapira, P. Large-scale text analysis using generative language models: a case study in discovering public value expressions in AI patents. Preprint at https://arxiv.org/abs/2305.10383 (2023).
Rathje, S. et al. GPT is an effective tool for multilingual psychological text analysis. Preprint at https://psyarxiv.com/sekf5/ (2023).
Bollen, J., Mao, H. & Zeng, X. Twitter mood predicts the stock market. J. Comput. Sci. 2, 1–8 (2011). Analysing large-scale Twitter feeds, the study finds that certain collective mood states can predict daily changes in the Dow Jones Industrial Average with 86.7% accuracy.
Article Google Scholar
Tumasjan, A., Sprenger, T. O., Sandner, P. G. & Welpe, I. M. Election forecasts with twitter: how 140 characters reflect the political landscape. Soc. Sci. Comput. Rev. 29, 402–418 (2011).
Article Google Scholar
Koppel, M., Schler, J. & Argamon, S. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Tech. 60, 9–26 (2009).
Article Google Scholar
Juola, P. The Rowling case: a proposed standard analytic protocol for authorship questions. Digit. Scholarsh. Humanit. 30, i100–i113 (2015).
Google Scholar
Danielsen, A. A., Fenger, M. H. J., Østergaard, S. D., Nielbo, K. L. & Mors, O. Predicting mechanical restraint of psychiatric inpatients by applying machine learning on electronic health data. Acta Psychiatr. Scand. 140, 147–157 (2019). The study used machine learning from electronic health data to predict mechanical restraint incidents within 3 days of psychiatric patient admission, achieving an accuracy of 0.87 area under the curve, with most predictive factors coming from clinical text notes.
Article Google Scholar
Rudolph, J., Tan, S. & Tan, S. ChatGPT: bullshit spewer or the end of traditional assessments in higher education? J. Appl. Learn. Teach. 6, 342–363 (2023).
Google Scholar
Park, J. S. et al. Generative agents: interactive Simulacra of human behavior. in Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ‘23) 1–22 (Association for Computing Machinery, 2023).
Lucy, L. & Bamman, D. Gender and representation bias in GPT-3 generated stories. in Proc. Third Workshop on Narrative Understanding 48–55 (Association for Computational Linguistics, Virtual, 2021). The paper shows how GPT-3-generated stories exhibit gender stereotypes, associating feminine characters with family and appearance, and showing them as less powerful than masculine characters, prompting concerns about social biases in language models for storytelling.
Mitchell, M. et al. Model cards for model reporting. in Proc. Conference on Fairness, Accountability, and Transparency (Association for Computing Machinery, 2019). The paper introduces model cards for documentation of machine-learning models, detailing their performance characteristics across diverse conditions and contexts to promote transparency and responsible use.
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).
Article Google Scholar
Bailer-Jones, D. M. When scientific models represent. Int. Stud. Philos. Sci. 17, 59–74 (2010).
Article Google Scholar
Guldi, J. The Dangerous Art of Text Mining: A Methodology for Digital History 1st edn (Cambridge Univ. Press, (2023).
Da, N. Z. The computational case against computational literary studies. Crit. Inquiry 45, 601–639 (2019).
Article Google Scholar
Mäntylä, M. V., Graziotin, D. & Kuutila, M. The evolution of sentiment analysis — a review of research topics, venues, and top cited papers. Comp. Sci. Rev. 27, 16–32 (2018).
Article Google Scholar
Alemohammad, S. et al. Self-consuming generative models go mad. Preprint at https://arxiv.org/abs/2307.01850 (2023).
Bockting, C. L., van Dis, E. A., van Rooij, R., Zuidema, W. & Bollen, J. Living guidelines for generative AI — why scientists must oversee its use. Nature 622, 693–696 (2023).
Article ADS Google Scholar
Wu, C.-J. et al. Sustainable AI: environmental implications, challenges and opportunities. in Proceedings of Machine Learning and Systems 4 (MLSys 2022) vol. 4, 795–813 (2022).
Pushkarna, M., Zaldivar, A. & Kjartansson, O. Data cards: purposeful and transparent dataset documentation for responsible AI. in 2022 ACM Conference on Fairness, Accountability, and Transparency 1776–1826 (Association for Computing Machinery, 2022).
Shumailov, I. et al. The curse of recursion: training on generated data makes models forget. Preprint at https://arxiv.org/abs/2305.17493 (2023).
Mitchell, M. How do we know how smart AI systems are? Science https://doi.org/10.1126/science.adj5957 (2023).
Wu, Z. et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. Preprint at https://arxiv.org/abs/2307.02477 (2023).
Birjali, M., Kasri, M. & Beni-Hssane, A. A comprehensive survey on sentiment analysis: approaches, challenges and trends. Knowl. Based Syst. 226, 107134 (2021).
Article Google Scholar
Acheampong, F. A., Wenyu, C. & Nunoo Mensah, H. Text based emotion detection: advances, challenges, and opportunities. Eng. Rep. 2, e12189 (2020).
Article Google Scholar
Pauca, V. P., Shahnaz, F., Berry, M. W. & Plemmons, R. J. Text mining using non-negative matrix factorizations. in Proc. 2004 SIAM International Conference on Data Mining 452–456 (Society for Industrial and Applied Mathematics, 2004).
Sharma, A., Amrita, Chakraborty, S. & Kumar, S. Named entity recognition in natural language processing: a systematic review. in Proc. Second Doctoral Symposium on Computational Intelligence (eds Gupta, D., Khanna, A., Kansal, V., Fortino, G. & Hassanien, A. E.) 817–828 (Springer Singapore, 2022).
Nasar, Z., Jaffry, S. W. & Malik, M. K. Named entity recognition and relation extraction: state-of-the-art. ACM Comput. Surv. 54, 1–39 (2021).
Article Google Scholar
Sedighi, M. Application of word co-occurrence analysis method in mapping of the scientific fields (case study: the field of informetrics). Library Rev. 65, 52–64 (2016).
Article Google Scholar
El-Kassas, W. S., Salama, C. R., Rafea, A. A. & Mohamed, H. K. Automatic text summarization: a comprehensive survey. Exp. Syst. Appl. 165, 113679 (2021).
Article Google Scholar

Download references

Acknowledgements

K.L.N. was supported by grants from the Velux Foundation (grant title: FabulaNET) and the Carlsberg Foundation (grant number: CF23-1583). N.T. was supported by the research programme Change is Key! supported by Riksbankens Jubileumsfond (grant number: M21-0021).

Author information

Authors and Affiliations

Center for Humanities Computing, Aarhus University, Aarhus, Denmark
Kristoffer L. Nielbo
Meertens Institute, Royal Netherlands Academy of Arts and Sciences, Amsterdam, The Netherlands
Folgert Karsdorp
Department of History, University of Amsterdam, Amsterdam, The Netherlands
Melvin Wevers
Institute of History, Leiden University, Leiden, The Netherlands
Alie Lassche
Department of Linguistics, Aarhus University, Aarhus, Denmark
Rebekah B. Baglini
Department of Literature, University of Antwerp, Antwerp, Belgium
Mike Kestemont
Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg, Gothenburg, Sweden
Nina Tahmasebi

Authors

Kristoffer L. Nielbo
View author publications
You can also search for this author in PubMed Google Scholar
Folgert Karsdorp
View author publications
You can also search for this author in PubMed Google Scholar
Melvin Wevers
View author publications
You can also search for this author in PubMed Google Scholar
Alie Lassche
View author publications
You can also search for this author in PubMed Google Scholar
Rebekah B. Baglini
View author publications
You can also search for this author in PubMed Google Scholar
Mike Kestemont
View author publications
You can also search for this author in PubMed Google Scholar
Nina Tahmasebi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Introduction (K.L.N. and F.K.); Experimentation (K.L.N., F.K., M.K. and R.B.B.); Results (F.K., M.K., R.B.B. and N.T.); Applications (K.L.N., M.W. and A.L.); Reproducibility and data deposition (K.L.N. and A.L.); Limitations and optimizations (M.W. and N.T.); Outlook (M.W. and N.T.); overview of the Primer (K.L.N.).

Corresponding author

Correspondence to Kristoffer L. Nielbo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Methods Primers thanks F. Jannidis, L. Nelson, T. Tangherlini and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Glossary

Application programming interface: A set of rules, protocols and tools for building software and applications, which programs can query to obtain data.
Bag-of-words model: A model that represents text as a numerical vector based on word frequency or presence. Each text corresponds to a predefined vocabulary dictionary, with the vector.
Computer linguistics: Intersection of linguistics, computer science and artificial intelligence that is concerned with computational aspects of human language. It involves the development of algorithms and models that enable computers to understand, interpret and generate human language.
Corpus linguistics: The branch of linguistics that studies language as expressed in corpora (samples of real-world text) and uses computational methods to analyse large collections of textual data.
Data augmentation: A technique used to increase the size and diversity of language data sets to train machine-learning models.
Data science: The application of statistical, analytical and computational techniques to extract insights and knowledge from data.
Fleiss’ kappa: (κ). A statistical measure used to assess the reliability of agreement between multiple raters when assigning categorical ratings to a number of items.
Frequency bias: A phenomenon in which elements that are over-represented in a data set receive disproportionate attention or influence in the analysis.
Information retrieval: A field of study focused on the science of searching for information within documents and retrieving relevant documents from large databases.
Lemmatization: A text normalization technique used in natural language processing in which words are reduced to their base or dictionary form.
Machine learning: In quantitative text analysis, machine learning refers to the application of algorithms and statistical models to enable computers to identify patterns, trends and relationships in textual data without being explicitly programmed. It involves training these models on large data sets to learn and infer from the structure and nuances of language.
Natural language processing: A field of artificial intelligence using computational methods for analysing and generating natural language and speech.
Recommender system: A type of information filtering system that seeks to predict user preferences and recommend items (such as books, movies and products) that are likely to be of interest to the user.
Representation learning: A set of techniques in machine learning in which the system learns to automatically identify and extract useful features or representations from raw data.
Stemming: A text normalization technique used in natural language processing, in which words are reduced to their base or root form.
Supervised learning: A machine-learning approach in which models are trained on labelled data, such that each training text is paired with an output label. The model learns to predict the output from the input data, with the aim of generalizing the training set to unseen data.
Transformer: A deep learning model that handles sequential data, such as text, using mechanisms called attention and self-attention, allowing it to weigh the importance of different parts of the input data. In the quantitative text analysis, transformers are used for tasks such as sentiment analysis, text classification and language translation, offering superior performance in understanding context and nuances in large data sets.
Unsupervised learning: A type of machine learning in which models are trained on data without output labels. The goal is to discover underlying patterns, groupings or structures within the data, often through clustering or dimensionality reduction techniques.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nielbo, K.L., Karsdorp, F., Wevers, M. et al. Quantitative text analysis. Nat Rev Methods Primers 4, 25 (2024). https://doi.org/10.1038/s43586-024-00302-w

Download citation

Accepted: 21 February 2024
Published: 11 April 2024
DOI: https://doi.org/10.1038/s43586-024-00302-w

Quantitative text analysis

Subjects

Abstract

Access options

Similar content being viewed by others

Augmenting interpretable models with large language models during training

An open source machine learning framework for efficient and transparent systematic reviews

The shaky foundations of large language models and foundation models for electronic health records

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Glossary

Rights and permissions

About this article

Cite this article