Periodical peer-reviewed academic journal of INION RAN

Topic modeling of the corpus of blogs in Russian with respect to author's gender: text and context (Open access)

Litvinova T.A.

Voronezh State Pedagogical University, Russia, Voronezh, centr_rus_yaz@mail.ru

Abstract

Topic modeling aims to analyze the semantic organization of a text and is widely used both to deal with various applied problems and to conduct theoretical research in the field of sociology, psychology, scientometrics, etc. However, in linguistic and sociolinguistic studies topic modeling methods are used less frequently. Moreover, classical topic modeling is based on the analysis of the occurrence of words within a document and does not take into account the co-occurrence of words within context windows. The paper presents the findings of a comparative analysis of blog texts in Russian using Latent Dirichlet allocation applied to matrices of two types: a term-document matrix (Text model) and a term co-occurrence matrix (Context model), taking into account the gender of authors, as well as the results of an experiment on classifying texts by the gender of their authors based on the probabilities of the distribution of topics in texts. Higher values of topic modeling quality metrics were obtained for models built on term cooccurrence matrices («contexts»). High accuracy of the classification of texts by the author’s gender reveals a clear gender signal in blogs – text genre which involves active construction of the author’s identity, including gender. Comparison of a set of topics the distribution probabilities of which in documents and contexts make the greatest contribution to the classifier showed that topic modeling performed on the term co-occurrence matrices makes it possible to identify features of the semantic organization of texts that complement the results obtained with traditional topic modeling

Keywords

text semantics; computer semantics; topic modeling; gender attribution; blogs; Russian-language text corpora

Download text

For citing: Litvinova, T.A. (2022). Topic modeling of the corpus of blogs in Russian with respect to author's gender: text and context. Ethnopsycholinguistics. Moscow: INION RAN. Vol. 2(9), pp. 7-23. DOI: 10.31249/epl/2022.02.01


Order this article

Cash

Безналичный платеж по реквизитам

Печатный

Электронный (PDF по e-mail)