Madhurima Nath, PhD · Follow
8 min read · Aug 21, 2023
--
Learn about the mathematical concepts behind LDA, NMF, BERTopic models
Topic modeling is a part of natural language processing (NLP) which enables end-users to identify themes/topics within a collection of documents. It has applications in multiple industries for text mining and gaining relevant insights from textual data.
Most algorithms try to decompose the document-term matrix into two or more matrices to obtain the matrix containing terms and topics. This is a schematic I came up with to understand how the well-known algorithms work.
Depending on the algorithm, the entries in the document-term matrix can be calculated using either using a bags-of-words approach, term frequency — inverse document frequency (TF-IDF) or class-based TF-IDF. Further, the number lower ranked and lower dimension matrices the document-term matrix gets factorized into depends on the specific algorithm. The distribution of the topics in the documents and the distribution of the terms in each topic could be probabilistic or deterministic.
Note: The matrix factorization technique is widely used for feature reduction in Facial Recognition and other NLP tasks.
The most established go-to techniques for topic modeling is Latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF). LDA is a generative probabilistic model, and NMF is a non-probabilistic linear algebraic model which uses matrix factorization. LDA is useful to identify coherent topics, while NMF does well to identify incoherent topics.
LDA is a three-level hierarchical Bayesian inference model, used to estimate the parameters of the model — topic-term distribution and the document-topic distribution. LDA assumes that each document is a mix of underlying set of multiple topics, and each topic is modeled as an infinite mixture over an underlying set of topic distribution. Since the results are not deterministic, one might get different results each time the model is run, even on the same dataset.
NMF decomposes a non-negative matrix into two non-negative matrices, where each row is a topic and each column is a document. NMF assumes that each document is a linear combination of the topics and each topic is a linear combination of the terms. The objective of NMF is dimensionality reduction and feature extraction. The original matrix is decomposed into a feature matrix and a coefficient matrix. It is more suitable for smaller datasets and short texts. NMF takes advantage of the fact that vectors are non-negative.
BERTopic is fairly new technique and uses Google’s BERT (Bidirectional Encoder Representations from Transformers) embeddings and class-based variation of TF-IDF (Term Frequency — Inverse Document Frequency). As mentioned in the paper by Groontendorst —
“BERTopic is a topic model that extracts coherent topic representation through the development of a class-based variation of TF-IDF. BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure.”
LDA
LDA is an extension of probabilistic Latent Semantic Analysis (pLSA). Even though the topics in the document are unknown, it is assumed that the text in the document is generated based on these topics. The “latent” refers to things unknown apriori and hidden in the textual data. As discussed above, LDA is a distribution of topics in the documents and the distribution of the terms in a topic — called “Dirichlet”. The graphical model representation of LDA is shown below (re-generated from LDA paper by Blei et. al. 2003). As the paper mentions:
“The boxes are “plates” reprsenting replicates. the outer plate represents documents, while the inner plate represents the repeated choice of topics and terms within a document.”
The probability of a corpus is obtained using the product of the marginal probabilities of each documents:
Please refer to the LDA paper by Blei et. al. 2003 paper for further details.
NMF
NMF-based models learn topics by directly decomposing the document-term matrix into two low-rank matrices. NMF can be applied to perform statistical analysis of multivariate data. Given a non-negative matrix V, NMF finds non-negative matrix factors W and H (courtsey: algorithms for NMF paper by Lee, Seung 2000).
The columns of W can be interpreted as basis documents from bags-of-words, i.e., these are the topics, while the columns in H are the features. The implementation of the NMF algorithm consists of update rules for W and H, iterating through which results in a convergence to a local maximum of the objective function (re-generated from NMF paper by Lee 1999) subject to the non-negativity constraints.
“The algorithm starts with non-negative initial conditions for W and H, iterations of the update rules for non-negative V finds an approximate factorization by converging to a local maximum of the objective function. It can be derived by interpreting NMF as an algorithm for constructing a probabilistic model of image generation. This objective function is then related to the likelihood of generating the images in V from the basis W and encodings H.”
NMF decomposes multivariate data by creating a user-defined number of features. Each feature is a linear combination of the original attributes and the coefficients of these linear combinations are non-negative. While applying the model, an NMF model maps the orginal data into the new set of attributes or feautes discovered by the model.
In general, NMF is NP-hard, but there are some heuristic approximations which work well in many applications. There is also no guarantee that there is a single unique decomposition. To tackle this, priors are used on the factors W and H along with regularization terms in the objective function. It is also hard to know how to choose the factorization rank.
Please refer to the papers listed in References section for further reading.
BERTopic
BERTopic assumes that documents containing the same topic are semantically similar. Generating topic using involves three steps:
- Each document is coverted to its embedding representation using Sentence-BERT (SBERT) framework — this allows the conversion of sentences and paragraphs to dense vector representations using pre-trained language models.
- The dimensionality of the resulting embeddings is reduced to optimize the clustering process using Uniform Manifold Approximation and Projection (UMAP) technique — as it is shown to preserve more of the local and global features of high-dimensional data in lower projected dimensions. Clusters are obatined from reduced embeddings using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN).
- The topics are extracted using a custom class-based variation of TF-IDF from these above clusters.
The classic TF-IDF is
The above equation is slightly modified for the class-based TF-IDF. The class c is the collection of documents concatenated into a single document for each cluster.
This allows to generate topic-term distributions for each cluster of documents. The number of topics to a user-specified value can be reduced by iteratively merging these class-based TF-IDF representations of the least common topic with its most similar one.
The implementation of BERTopic is freely available here.
The well-known pros and cons of these models are listed below.
Pros of LDA:
- can handle large datasets and can be easily parallelized
- can assign a probability to a new document thanks to the document-topic Dirichlet distribution
- topics are open to human interpretation
Cons of LDA:
- computationally expensive
- may not work well for short texts
- number of topics must be known/set beforehand
- bag-of-words approach disregards the semantic representation of words in a corpus, similar to LSA and pLSA
- estimation of Bayes parameters lies under the assumption of exchangeability for the documents
- requires an extensive pre-processing phase to obtain a significant representation from the textual input data
- studies report LDA may yield too general (Rizvi et al., 2019) or irrelevant (Alnusyan et al., 2020) topics. Results may also be inconsistent across different executions (Egger et al., 2021).
Pros of NMF:
- computationally efficient
- works well for smaller datasets and short texts
Cons of NMF:
- not as effective to identify complex relationships between topics
- does not consider semantic relationship between the terms
Pros of BERTopic:
- preserves semantic relationship between the terms
- scalability — performance increases when state-of-the-art language models are implemented to create embeddings
- can be used in a wide range of situation because of its stability across language models
- significant flexibility in usage and fine-tuning of the model because of the separation of embedding process from representing topics
- the distribution of terms as topics allows BERTopic to model the dynamic and evolutionary aspects of topics with little changes to the core algorithm
Cons of BERTopic:
- assumption that each document contains a single topic is not realistic
- terms in a topic can be redundant for interpretation of the topic
Both LDA and NMF assume that a document is a mix of latent topics and take a bags-of-words approach to describe the document. This results in a disregard for the semantic relationship among the terms in the document.
There is no correct answer for which is the best model. Depending on the requirements of the problem and the resources available one can choose between these three models best suited for that scenario.
This paper by Egger R. and Yu J. compares these models on Twitter posts.
- Deerwester, Scott, et al. “Indexing by latent semantic analysis.” Journal of the American society for information science 41.6 (1990): 391–407.
- Hofmann, Thomas. “Probabilistic latent semantic indexing.” In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 1999.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3, no. Jan (2003): 993–1022.
- Lee, Daniel, and H. Sebastian Seung. “Algorithms for non-negative matrix factorization.” Advances in neural information processing systems, 13 (2000).
- Lee, Daniel, and H. Sebastian Seung. “Learning the parts of objects by non-negative matrix factorization.” Nature 401, no. 6755 (1999): 788–791.
- Egger, Roman, and Joanne Yu. “A topic modeling comparison between lda, nmf, top2vec, and berttopic to demystify twitter posts.” Frontiers in sociology 7 (2022): 886498.
- Grootendorst, Maarten. “BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022).
- Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
- Shi, Tian, Kyeongpil Kang, Jaegul Choo, and Chandan K., Reddy. “Short-text topic modeling via non-negative matrix factorizationenriched with local word-context correlations.” In Proceedings of the 2018 World Wide Web Conference, pp. 1105–1114, 2018.
- Choo, Jaegul, Changhyun Lee, Chandan K. Reddy, and Haesun Park. “Utopian: User driven topic modeling based on interactive nonnegative matrix factorization.” IEEE transactions on visualization and computer graphics 19, np. 12 (2013): 1992: 2001.