Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Primer
  • Published:

Quantitative text analysis

Abstract

Text analysis has undergone substantial evolution since its inception, moving from manual qualitative assessments to sophisticated quantitative and computational methods. Beginning in the late twentieth century, a surge in the utilization of computational techniques reshaped the landscape of text analysis, catalysed by advances in computational power and database technologies. Researchers in various fields, from history to medicine, are now using quantitative methodologies, particularly machine learning, to extract insights from massive textual data sets. This transformation can be described in three discernible methodological stages: feature-based models, representation learning models and generative models. Although sequential, these stages are complementary, each addressing analytical challenges in the text analysis. The progression from feature-based models that require manual feature engineering to contemporary generative models, such as GPT-4 and Llama2, signifies a change in the workflow, scale and computational infrastructure of the quantitative text analysis. This Primer presents a detailed introduction of some of these developments, offering insights into the methods, principles and applications pertinent to researchers embarking on the quantitative text analysis, especially within the field of machine learning.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic representation of three predominant approaches in the quantitative text analysis.
Fig. 2: Comparative workflows of model training (upper section) and model application or inference (lower section) for a text classification task in the context of the quantitative text analysis.

Similar content being viewed by others

References

  1. Miner, G. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications (Academic Press, 2012).

  2. Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. From data mining to knowledge discovery in databases. AI Mag. 17, 37 (1996).

    Google Scholar 

  3. Hand, D. J. Data mining: statistics and more? Am. Stat. 52, 112–116 (1998).

    Article  Google Scholar 

  4. McEnery, T. & Wilson, A. Corpus Linguistics: An Introduction (Edinburgh University Press, 2001).

  5. Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing 1st edn (The MIT Press, 1999).

  6. Manning, C., Raghavan, P. & Schütze, H. Introduction to Information Retrieval 1st edn (Cambridge University Press, 2008).

  7. Wankhade, M., Rao, A. C. S. & Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 55, 5731–5780 (2022).

    Article  Google Scholar 

  8. Jehangir, B., Radhakrishnan, S. & Agarwal, R. A survey on named entity recognition — datasets, tools, and methodologies. Nat. Lang. Process. J. 3, 100017 (2023).

    Article  Google Scholar 

  9. Fu, S. et al. Clinical concept extraction: a methodology review. J. Biomed. Inform. 109, 103526 (2020).

    Article  Google Scholar 

  10. Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002).

    Article  Google Scholar 

  11. Talley, E. M. et al. Database of NIH grants using machine-learned categories and graphical clustering. Nat. Meth. 8, 443–444 (2011).

    Article  Google Scholar 

  12. Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2022).

  13. Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).

    Google Scholar 

  14. Angelov, D. Top2Vec: distributed representations of topics. Preprint at https://arxiv.org/abs/2008.09470 (2020).

  15. Barron, A. T. J., Huang, J., Spang, R. L. & DeDeo, S. Individuals, institutions, and innovation in the debates of the French Revolution. Proc. Natl Acad. Sci. USA 115, 4607–4612 (2018).

    Article  ADS  Google Scholar 

  16. Mitchell, T. M. Machine Learning 1st edn (McGraw-Hill, 1997).

  17. Vaswani, A. et al. Attention is all you need. in Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30 (Curran Associates, Inc., 2017).

  18. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805 (2018).

  19. Brown, T. et al. Language models are few-shot learners. in Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H.) 1877–1901 (Curran Associates, Inc., 2020).

  20. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

  21. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  Google Scholar 

  22. Wolf, T. et al. Transformers: state-of-the-art natural language processing. in Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 38–45 (Association for Computational Linguistics, Online, 2020).

  23. Demartini, G., Siersdorfer, S., Chelaru, S. & Nejdl, W. Analyzing political trends in the blogosphere. in Proceedings of the International AAAI Conference on Web and Social Media vol. 5 466–469 (AAAI, 2011).

  24. Goldstone, A. & Underwood, T. The quiet transformations of literary studies: what thirteen thousand scholars could tell us. New Lit. Hist. 45, 359–384 (2014).

    Article  Google Scholar 

  25. Tangherlini, T. R. & Leonard, P. Trawling in the sea of the great unread: sub-corpus topic modeling and humanities research. Poetics 41, 725–749 (2013).

    Article  Google Scholar 

  26. Mei, Q. & Zhai, C. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 198–207 (Association for Computing Machinery, 2005).

  27. Frermann, L. & Lapata, M. A Bayesian model of diachronic meaning change. Trans. Assoc. Comput. Linguist. 4, 31–45 (2016).

    Article  Google Scholar 

  28. Koplenig, A. Analyzing Lexical Change in Diachronic Corpora. PhD thesis, Mannheim https://nbn-resolving.org/urn:nbn:de:bsz:mh39-48905 (2016).

  29. Dubossarsky, H., Weinshall, D. & Grossman, E. Outta control: laws of semantic change and inherent biases in word representation models. in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 1136–1145 (Association for Computational Linguistics, 2017).

  30. Dubossarsky, H., Hengchen, S., Tahmasebi, N. & Schlechtweg, D. Time-out: temporal referencing for robust modeling of lexical semantic change. in Proc. 57th Annual Meeting of the Association for Computational Linguistics 457–470 (Association for Computational Linguistics, 2019).

  31. Koplenig, A. Why the quantitative analysis of diachronic corpora that does not consider the temporal aspect of time-series can lead to wrong conclusions. Digit. Scholarsh. Humanit. 32, 159–168 (2017).

    Google Scholar 

  32. Tahmasebi, N., Borin, L. & Jatowt, A. Survey of computational approaches to lexical semantic change detection. Zenodo https://doi.org/10.5281/zenodo.5040302 (2021).

  33. Bizzoni, Y., Degaetano-Orttlieb, S., Fankhauser, P. & Teich, E. Linguistic variation and change in 250 years of English scientific writing: a data-driven approach. Front. Artif. Intell. 3, 73 (2020).

    Article  Google Scholar 

  34. Haider, T. & Eger, S. Semantic change and emerging tropes in a large corpus of New High German poetry. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 216–222 (Association for Computational Linguistics, 2019).

  35. Vylomova, E., Murphy, S. & Haslam, N. Evaluation of semantic change of harm-related concepts in psychology. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 29–34 (Association for Computational Linguistics, 2019).

  36. Marjanen, J., Pivovarova, L., Zosa, E. & Kurunmäki, J. Clustering ideological terms in historical newspaper data with diachronic word embeddings. in 5th International Workshop on Computational History, HistoInformatics 2019 (CEUR-WS, 2019).

  37. Tripodi, R., Warglien, M., Levis Sullam, S. & Paci, D. Tracing antisemitic language through diachronic embedding projections: France 1789–1914. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 115–125 (Association for Computational Linguistics, 2019).

  38. Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proc. Natl. Acad. Sci. USA 115, E3635–E3644 (2018).

    Article  ADS  Google Scholar 

  39. Wevers, M. Using word embeddings to examine gender bias in Dutch newspapers, 1950–1990. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 92–97 (Association for Computational Linguistics, 2019).

  40. Sommerauer, P. & Fokkens, A. Conceptual change and distributional semantic models: an exploratory study on pitfalls and possibilities. in Proc. 1st International Workshop on Computational Approaches to Historical Language Change 223–233 (Association for Computational Linguistics, 2019). This article examines the effects of known pitfalls on digital humanities studies, using embedding models, and proposes guidelines for conducting such studies while acknowledging the need for further research to differentiate between artefacts and actual conceptual changes.

  41. Doermann, D. & Tombre, K. (eds) Handbook of Document Image Processing and Recognition 2014th edn (Springer, 2014).

  42. Yu, D. & Deng, L. Automatic Speech Recognition: A Deep Learning Approach 2015th edn (Springer, 2014).

  43. Dasu, T. & Johnson, T. Exploratory Data Mining and Data Cleaning (John Wiley & Sons, Inc., 2003).

  44. Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R. & Watanabe, S. End-to-end speech recognition: a survey https://arxiv.org/abs/2303.03329 (2023).

  45. Pustejovsky, J. & Stubbs, A. Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications 1st edn (O’Reilly Media, 2012). A hands-on guide to data-intensive humanities research, including the quantitative text analysis, using the Python programming language.

  46. Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977).

    Article  Google Scholar 

  47. Gurav, V., Parkar, M. & Kharwar, P. Accessible and ethical data annotation with the application of gamification. in Data Science and Analytics (eds Batra, U., Roy, N. R. & Panda, B.) 68–78 (Springer Singapore, 2020).

  48. Paolacci, G., Chandler, J. & Ipeirotis, P. G. Running experiments on Amazon Mechanical Turk. Judgm. Decis. Mak. 5, 411–419 (2010).

    Article  Google Scholar 

  49. Bergvall-Kåreborn, B. & Howcroft, D. Amazon mechanical turk and the commodification of labour. New Technol. Work Employ. 29, 213–223 (2014).

    Article  Google Scholar 

  50. Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).

    Article  Google Scholar 

  51. Klymenko, O., Meisenbacher, S. & Matthes, F. Differential privacy in natural language processing the story so far. in Proc. Fourth Workshop on Privacy in Natural Language Processing 1–11 (Association for Computational Linguistics, 2022).

  52. Lassen, I. M. S., Almasi, M., Enevoldsen, K. & Kristensen-McLachlan, R. D. Detecting intersectionality in NER models: a data-driven approach. in Proc. 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 116–127 (Association for Computational Linguistics, 2023).

  53. DaCy: A Unified Framework for Danish NLP Vol. 2989, 206–216 (CEUR Workshop Proceedings, 2021).

  54. Karsdorp, F., Kestemont, M. & Riddell, A. Humanities Data Analysis: Case Studies with Python (Princeton Univ. Press, 2021).

  55. Ruder, S., Peters, M. E., Swayamdipta, S. & Wolf, T. Transfer learning in natural language processing. in Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials 15–18 (Association for Computational Linguistics, 2019). The paper presents an overview of modern transfer learning methods in natural language processing, highlighting their emergence, effectiveness in improving the state of the art across various tasks and potential to become a standard tool in natural language processing.

  56. Malte, A. & Ratadiya, P. Evolution of transfer learning in natural language processing. Preprint at https://arxiv.org/abs/1910.07370 (2019).

  57. Groh, M. Identifying the context shift between test benchmarks and production data. Preprint at https://arxiv.org/abs/2207.01059 (2022).

  58. Wang, H., Li, J., Wu, H., Hovy, E. & Sun, Y. Pre-trained language models and their applications. Engineering 25, 51–65 (2023). This article provides a comprehensive review of the recent progress and research on pre-trained language models in natural language processing, including their development, impact, challenges and future directions in the field.

    Article  Google Scholar 

  59. Wilks, D. S. On the combination of forecast probabilities for consecutive precipitation periods. Weather Forecast. 5, 640–650 (1990).

    Article  ADS  Google Scholar 

  60. Loughran, T. & McDonald, B. Textual analysis in accounting and finance: a survey. J. Account. Res. 54, 1187–1230 (2016).

    Article  Google Scholar 

  61. Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why should I trust you?’: explaining the predictions of any classifier. Preprint at https://arxiv.org/abs/1602.04938 (2016).

  62. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. in Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) 4765–4774 (Curran Associates, Inc., 2017).

  63. Tahmasebi, N. & Hengchen, S. The strengths and pitfalls of large-scale text mining for literary studies. Samlaren 140, 198–227 (2019).

    Google Scholar 

  64. Jaidka, K., Ahmed, S., Skoric, M. & Hilbert, M. Predicting elections from social media: a three-country, three-method comparative study. Asian J. Commun. 29, 252–273 (2019).

    Article  Google Scholar 

  65. Underwood, T. Distant Horizons: Digital Evidence and Literary Change (Univ. Chicago Press, 2019).

  66. Jo, E. S. & Algee-Hewitt, M. The long arc of history: neural network approaches to diachronic linguistic change. J. Jpn Assoc. Digit. Humanit. 3, 1–32 (2018).

    Google Scholar 

  67. Soni, S., Klein, L. F. & Eisenstein, J. Abolitionist networks: modeling language change in nineteenth-century activist newspapers. J. Cultural Anal. 6, 1–43 (2021).

    Google Scholar 

  68. Perry, C. & Dedeo, S. The cognitive science of extremist ideologies online. Preprint at https://arxiv.org/abs/2110.00626 (2021).

  69. Antoniak, M., Mimno, D. & Levy, K. Narrative paths and negotiation of power in birth stories. Proc. ACM Hum. Comput. Interact. 3, 1–27 (2019).

    Article  Google Scholar 

  70. Vicinanza, P., Goldberg, A. & Srivastava, S. B. A deep-learning model of prescient ideas demonstrates that they emerge from the periphery. PNAS Nexus 2, pgac275 (2023). Using deep learning on text data, the study identifies markers of prescient ideas, revealing that groundbreaking thoughts often emerge from the periphery of domains rather than their core.

    Article  Google Scholar 

  71. Adeva, J. G., Atxa, J. P., Carrillo, M. U. & Zengotitabengoa, E. A. Automatic text classification to support systematic reviews in medicine. Exp. Syst. Appl. 41, 1498–1508 (2014).

    Article  Google Scholar 

  72. Schneider, N., Fechner, N., Landrum, G. A. & Stiefl, N. Chemical topic modeling: exploring molecular data sets using a common text-mining approach. J. Chem. Inf. Model. 57, 1816–1831 (2017).

    Article  Google Scholar 

  73. Kayi, E. S., Yadav, K. & Choi, H.-A. Topic modeling based classification of clinical reports. in 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop 67–73 (Association for Computational Linguistics, 2013).

  74. Roberts, M. E. et al. Structural topic models for open-ended survey responses. Am. J. Political Sci. 58, 1064–1082 (2014).

    Article  Google Scholar 

  75. Kheiri, K. & Karimi, H. SentimentGPT: exploiting GPT for advanced sentiment analysis and its departure from current machine learning. Preprint at https://arxiv.org/abs/2307.10234 (2023).

  76. Pelaez, S., Verma, G., Ribeiro, B. & Shapira, P. Large-scale text analysis using generative language models: a case study in discovering public value expressions in AI patents. Preprint at https://arxiv.org/abs/2305.10383 (2023).

  77. Rathje, S. et al. GPT is an effective tool for multilingual psychological text analysis. Preprint at https://psyarxiv.com/sekf5/ (2023).

  78. Bollen, J., Mao, H. & Zeng, X. Twitter mood predicts the stock market. J. Comput. Sci. 2, 1–8 (2011). Analysing large-scale Twitter feeds, the study finds that certain collective mood states can predict daily changes in the Dow Jones Industrial Average with 86.7% accuracy.

    Article  Google Scholar 

  79. Tumasjan, A., Sprenger, T. O., Sandner, P. G. & Welpe, I. M. Election forecasts with twitter: how 140 characters reflect the political landscape. Soc. Sci. Comput. Rev. 29, 402–418 (2011).

    Article  Google Scholar 

  80. Koppel, M., Schler, J. & Argamon, S. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Tech. 60, 9–26 (2009).

    Article  Google Scholar 

  81. Juola, P. The Rowling case: a proposed standard analytic protocol for authorship questions. Digit. Scholarsh. Humanit. 30, i100–i113 (2015).

    Google Scholar 

  82. Danielsen, A. A., Fenger, M. H. J., Østergaard, S. D., Nielbo, K. L. & Mors, O. Predicting mechanical restraint of psychiatric inpatients by applying machine learning on electronic health data. Acta Psychiatr. Scand. 140, 147–157 (2019). The study used machine learning from electronic health data to predict mechanical restraint incidents within 3 days of psychiatric patient admission, achieving an accuracy of 0.87 area under the curve, with most predictive factors coming from clinical text notes.

    Article  Google Scholar 

  83. Rudolph, J., Tan, S. & Tan, S. ChatGPT: bullshit spewer or the end of traditional assessments in higher education? J. Appl. Learn. Teach. 6, 342–363 (2023).

    Google Scholar 

  84. Park, J. S. et al. Generative agents: interactive Simulacra of human behavior. in Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ‘23) 1–22 (Association for Computing Machinery, 2023).

  85. Lucy, L. & Bamman, D. Gender and representation bias in GPT-3 generated stories. in Proc. Third Workshop on Narrative Understanding 48–55 (Association for Computational Linguistics, Virtual, 2021). The paper shows how GPT-3-generated stories exhibit gender stereotypes, associating feminine characters with family and appearance, and showing them as less powerful than masculine characters, prompting concerns about social biases in language models for storytelling.

  86. Mitchell, M. et al. Model cards for model reporting. in Proc. Conference on Fairness, Accountability, and Transparency (Association for Computing Machinery, 2019). The paper introduces model cards for documentation of machine-learning models, detailing their performance characteristics across diverse conditions and contexts to promote transparency and responsible use.

  87. Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).

    Article  Google Scholar 

  88. Bailer-Jones, D. M. When scientific models represent. Int. Stud. Philos. Sci. 17, 59–74 (2010).

    Article  Google Scholar 

  89. Guldi, J. The Dangerous Art of Text Mining: A Methodology for Digital History 1st edn (Cambridge Univ. Press, (2023).

  90. Da, N. Z. The computational case against computational literary studies. Crit. Inquiry 45, 601–639 (2019).

    Article  Google Scholar 

  91. Mäntylä, M. V., Graziotin, D. & Kuutila, M. The evolution of sentiment analysis — a review of research topics, venues, and top cited papers. Comp. Sci. Rev. 27, 16–32 (2018).

    Article  Google Scholar 

  92. Alemohammad, S. et al. Self-consuming generative models go mad. Preprint at https://arxiv.org/abs/2307.01850 (2023).

  93. Bockting, C. L., van Dis, E. A., van Rooij, R., Zuidema, W. & Bollen, J. Living guidelines for generative AI — why scientists must oversee its use. Nature 622, 693–696 (2023).

    Article  ADS  Google Scholar 

  94. Wu, C.-J. et al. Sustainable AI: environmental implications, challenges and opportunities. in Proceedings of Machine Learning and Systems 4 (MLSys 2022) vol. 4, 795–813 (2022).

  95. Pushkarna, M., Zaldivar, A. & Kjartansson, O. Data cards: purposeful and transparent dataset documentation for responsible AI. in 2022 ACM Conference on Fairness, Accountability, and Transparency 1776–1826 (Association for Computing Machinery, 2022).

  96. Shumailov, I. et al. The curse of recursion: training on generated data makes models forget. Preprint at https://arxiv.org/abs/2305.17493 (2023).

  97. Mitchell, M. How do we know how smart AI systems are? Science https://doi.org/10.1126/science.adj5957 (2023).

  98. Wu, Z. et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. Preprint at https://arxiv.org/abs/2307.02477 (2023).

  99. Birjali, M., Kasri, M. & Beni-Hssane, A. A comprehensive survey on sentiment analysis: approaches, challenges and trends. Knowl. Based Syst. 226, 107134 (2021).

    Article  Google Scholar 

  100. Acheampong, F. A., Wenyu, C. & Nunoo Mensah, H. Text based emotion detection: advances, challenges, and opportunities. Eng. Rep. 2, e12189 (2020).

    Article  Google Scholar 

  101. Pauca, V. P., Shahnaz, F., Berry, M. W. & Plemmons, R. J. Text mining using non-negative matrix factorizations. in Proc. 2004 SIAM International Conference on Data Mining 452–456 (Society for Industrial and Applied Mathematics, 2004).

  102. Sharma, A., Amrita, Chakraborty, S. & Kumar, S. Named entity recognition in natural language processing: a systematic review. in Proc. Second Doctoral Symposium on Computational Intelligence (eds Gupta, D., Khanna, A., Kansal, V., Fortino, G. & Hassanien, A. E.) 817–828 (Springer Singapore, 2022).

  103. Nasar, Z., Jaffry, S. W. & Malik, M. K. Named entity recognition and relation extraction: state-of-the-art. ACM Comput. Surv. 54, 1–39 (2021).

    Article  Google Scholar 

  104. Sedighi, M. Application of word co-occurrence analysis method in mapping of the scientific fields (case study: the field of informetrics). Library Rev. 65, 52–64 (2016).

    Article  Google Scholar 

  105. El-Kassas, W. S., Salama, C. R., Rafea, A. A. & Mohamed, H. K. Automatic text summarization: a comprehensive survey. Exp. Syst. Appl. 165, 113679 (2021).

    Article  Google Scholar 

Download references

Acknowledgements

K.L.N. was supported by grants from the Velux Foundation (grant title: FabulaNET) and the Carlsberg Foundation (grant number: CF23-1583). N.T. was supported by the research programme Change is Key! supported by Riksbankens Jubileumsfond (grant number: M21-0021).

Author information

Authors and Affiliations

Authors

Contributions

Introduction (K.L.N. and F.K.); Experimentation (K.L.N., F.K., M.K. and R.B.B.); Results (F.K., M.K., R.B.B. and N.T.); Applications (K.L.N., M.W. and A.L.); Reproducibility and data deposition (K.L.N. and A.L.); Limitations and optimizations (M.W. and N.T.); Outlook (M.W. and N.T.); overview of the Primer (K.L.N.).

Corresponding author

Correspondence to Kristoffer L. Nielbo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Methods Primers thanks F. Jannidis, L. Nelson, T. Tangherlini and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Glossary

Application programming interface

A set of rules, protocols and tools for building software and applications, which programs can query to obtain data.

Bag-of-words model

A model that represents text as a numerical vector based on word frequency or presence. Each text corresponds to a predefined vocabulary dictionary, with the vector.

Computer linguistics

Intersection of linguistics, computer science and artificial intelligence that is concerned with computational aspects of human language. It involves the development of algorithms and models that enable computers to understand, interpret and generate human language.

Corpus linguistics

The branch of linguistics that studies language as expressed in corpora (samples of real-world text) and uses computational methods to analyse large collections of textual data.

Data augmentation

A technique used to increase the size and diversity of language data sets to train machine-learning models.

Data science

The application of statistical, analytical and computational techniques to extract insights and knowledge from data.

Fleiss’ kappa

(κ). A statistical measure used to assess the reliability of agreement between multiple raters when assigning categorical ratings to a number of items.

Frequency bias

A phenomenon in which elements that are over-represented in a data set receive disproportionate attention or influence in the analysis.

Information retrieval

A field of study focused on the science of searching for information within documents and retrieving relevant documents from large databases.

Lemmatization

A text normalization technique used in natural language processing in which words are reduced to their base or dictionary form.

Machine learning

In quantitative text analysis, machine learning refers to the application of algorithms and statistical models to enable computers to identify patterns, trends and relationships in textual data without being explicitly programmed. It involves training these models on large data sets to learn and infer from the structure and nuances of language.

Natural language processing

A field of artificial intelligence using computational methods for analysing and generating natural language and speech.

Recommender system

A type of information filtering system that seeks to predict user preferences and recommend items (such as books, movies and products) that are likely to be of interest to the user.

Representation learning

A set of techniques in machine learning in which the system learns to automatically identify and extract useful features or representations from raw data.

Stemming

A text normalization technique used in natural language processing, in which words are reduced to their base or root form.

Supervised learning

A machine-learning approach in which models are trained on labelled data, such that each training text is paired with an output label. The model learns to predict the output from the input data, with the aim of generalizing the training set to unseen data.

Transformer

A deep learning model that handles sequential data, such as text, using mechanisms called attention and self-attention, allowing it to weigh the importance of different parts of the input data. In the quantitative text analysis, transformers are used for tasks such as sentiment analysis, text classification and language translation, offering superior performance in understanding context and nuances in large data sets.

Unsupervised learning

A type of machine learning in which models are trained on data without output labels. The goal is to discover underlying patterns, groupings or structures within the data, often through clustering or dimensionality reduction techniques.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nielbo, K.L., Karsdorp, F., Wevers, M. et al. Quantitative text analysis. Nat Rev Methods Primers 4, 25 (2024). https://doi.org/10.1038/s43586-024-00302-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s43586-024-00302-w

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics