Topic modeling algorithms (2024)

Madhurima Nath, PhD

8 min read

Aug 21, 2023

LDA

LDA is an extension of probabilistic Latent Semantic Analysis (pLSA). Even though the topics in the document are unknown, it is assumed that the text in the document is generated based on these topics. The “latent” refers to things unknown apriori and hidden in the textual data. As discussed above, LDA is a distribution of topics in the documents and the distribution of the terms in a topic — called “Dirichlet”. The graphical model representation of LDA is shown below (re-generated from LDA paper by Blei et. al. 2003). As the paper mentions:

“The boxes are “plates” reprsenting replicates. the outer plate represents documents, while the inner plate represents the repeated choice of topics and terms within a document.”

The probability of a corpus is obtained using the product of the marginal probabilities of each documents:

Please refer to the LDA paper by Blei et. al. 2003 paper for further details.

NMF

NMF-based models learn topics by directly decomposing the document-term matrix into two low-rank matrices. NMF can be applied to perform statistical analysis of multivariate data. Given a non-negative matrix V, NMF finds non-negative matrix factors W and H (courtsey: algorithms for NMF paper by Lee, Seung 2000).

The columns of W can be interpreted as basis documents from bags-of-words, i.e., these are the topics, while the columns in H are the features. The implementation of the NMF algorithm consists of update rules for W and H, iterating through which results in a convergence to a local maximum of the objective function (re-generated from NMF paper by Lee 1999) subject to the non-negativity constraints.

“The algorithm starts with non-negative initial conditions for W and H, iterations of the update rules for non-negative V finds an approximate factorization by converging to a local maximum of the objective function. It can be derived by interpreting NMF as an algorithm for constructing a probabilistic model of image generation. This objective function is then related to the likelihood of generating the images in V from the basis W and encodings H.”

NMF decomposes multivariate data by creating a user-defined number of features. Each feature is a linear combination of the original attributes and the coefficients of these linear combinations are non-negative. While applying the model, an NMF model maps the orginal data into the new set of attributes or feautes discovered by the model.

In general, NMF is NP-hard, but there are some heuristic approximations which work well in many applications. There is also no guarantee that there is a single unique decomposition. To tackle this, priors are used on the factors W and H along with regularization terms in the objective function. It is also hard to know how to choose the factorization rank.

Please refer to the papers listed in References section for further reading.

BERTopic

BERTopic assumes that documents containing the same topic are semantically similar. Generating topic using involves three steps:

Each document is coverted to its embedding representation using Sentence-BERT (SBERT) framework — this allows the conversion of sentences and paragraphs to dense vector representations using pre-trained language models.
The dimensionality of the resulting embeddings is reduced to optimize the clustering process using Uniform Manifold Approximation and Projection (UMAP) technique — as it is shown to preserve more of the local and global features of high-dimensional data in lower projected dimensions. Clusters are obatined from reduced embeddings using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN).
The topics are extracted using a custom class-based variation of TF-IDF from these above clusters.

The classic TF-IDF is

The above equation is slightly modified for the class-based TF-IDF. The class c is the collection of documents concatenated into a single document for each cluster.

This allows to generate topic-term distributions for each cluster of documents. The number of topics to a user-specified value can be reduced by iteratively merging these class-based TF-IDF representations of the least common topic with its most similar one.

The implementation of BERTopic is freely available here.

The well-known pros and cons of these models are listed below.

Pros of LDA:

can handle large datasets and can be easily parallelized
can assign a probability to a new document thanks to the document-topic Dirichlet distribution
topics are open to human interpretation

Cons of LDA:

computationally expensive
may not work well for short texts
number of topics must be known/set beforehand
bag-of-words approach disregards the semantic representation of words in a corpus, similar to LSA and pLSA
estimation of Bayes parameters lies under the assumption of exchangeability for the documents
requires an extensive pre-processing phase to obtain a significant representation from the textual input data
studies report LDA may yield too general (Rizvi et al., 2019) or irrelevant (Alnusyan et al., 2020) topics. Results may also be inconsistent across different executions (Egger et al., 2021).

Pros of NMF:

computationally efficient
works well for smaller datasets and short texts

Cons of NMF:

not as effective to identify complex relationships between topics
does not consider semantic relationship between the terms

Pros of BERTopic:

preserves semantic relationship between the terms
scalability — performance increases when state-of-the-art language models are implemented to create embeddings
can be used in a wide range of situation because of its stability across language models
significant flexibility in usage and fine-tuning of the model because of the separation of embedding process from representing topics
the distribution of terms as topics allows BERTopic to model the dynamic and evolutionary aspects of topics with little changes to the core algorithm

Cons of BERTopic:

assumption that each document contains a single topic is not realistic
terms in a topic can be redundant for interpretation of the topic

Both LDA and NMF assume that a document is a mix of latent topics and take a bags-of-words approach to describe the document. This results in a disregard for the semantic relationship among the terms in the document.

There is no correct answer for which is the best model. Depending on the requirements of the problem and the resources available one can choose between these three models best suited for that scenario.

This paper by Egger R. and Yu J. compares these models on Twitter posts.

Deerwester, Scott, et al. “Indexing by latent semantic analysis.” Journal of the American society for information science 41.6 (1990): 391–407.
Hofmann, Thomas. “Probabilistic latent semantic indexing.” In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 1999.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3, no. Jan (2003): 993–1022.
Lee, Daniel, and H. Sebastian Seung. “Algorithms for non-negative matrix factorization.” Advances in neural information processing systems, 13 (2000).
Lee, Daniel, and H. Sebastian Seung. “Learning the parts of objects by non-negative matrix factorization.” Nature 401, no. 6755 (1999): 788–791.
Egger, Roman, and Joanne Yu. “A topic modeling comparison between lda, nmf, top2vec, and berttopic to demystify twitter posts.” Frontiers in sociology 7 (2022): 886498.
Grootendorst, Maarten. “BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022).
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
Shi, Tian, Kyeongpil Kang, Jaegul Choo, and Chandan K., Reddy. “Short-text topic modeling via non-negative matrix factorizationenriched with local word-context correlations.” In Proceedings of the 2018 World Wide Web Conference, pp. 1105–1114, 2018.
Choo, Jaegul, Changhyun Lee, Chandan K. Reddy, and Haesun Park. “Utopian: User driven topic modeling based on interactive nonnegative matrix factorization.” IEEE transactions on visualization and computer graphics 19, np. 12 (2013): 1992: 2001.

FAQs

What is the best algorithm for topic modeling? ›

Latent Dirichlet Allocation (LDA): It is one of the most popular and widely used algorithms in topic modeling. It works by assigning a probability distribution to each word in a document, and assigning a topic to each set of words. Advantages: It is easy to implement and can handle large data sets.

Read On ›

How much data do you need for topic modeling? ›

For best results: You should use at least 1,000 documents in each topic modeling job. Each document should be at least 3 sentences long. If a document consists of mostly numeric data, you should remove it from the corpus.

Discover More Details ›

How do you find the optimal number of topics in topic modeling? ›

To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.

What is the alternative to LDA topic modeling? ›

Hierarchical latent tree analysis (HLTA) is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.

See Details ›

What is the hardest topic in algorithms? ›

In the realm of algorithms, the hardest algorithm is often considered to be the Traveling Salesman Problem (TSP). This is an optimization problem that revolves around finding the shortest possible route a salesman must take to visit a given number of cities exactly once and return to the starting city.

Find Out More ›

What are the two algorithms for topic modeling? ›

The most established go-to techniques for topic modeling is Latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF).

Tell Me More ›

How many documents do you need for topic modeling? ›

The minimum number of documents needed for topic modeling is 20.

Show Me More ›

Is topic modelling quantitative or qualitative? ›

We argue topic modeling is a useful method for the descriptive analysis of unstructured social media data sets, and is best used as part of a mixed-method strategy, with topic model results guiding deeper qualitative analysis.

Explore More ›

Can BERT be used for topic modelling? ›

BERT is a Natural Language Processing Model proposed by Google Researchers [14]. This topic modeling technique uses transformers (BERT embeddings) and class-based TF-IDF to generate dense clusters [16].

How to choose the number of topics for LDA? ›

The selection of the optimal number of topics for LDA models can be based either on measures of topic quality (similarity or coherence) or on measures of goodness-of-fit and model complexity. models is determined by the numbers of topics in candidate models: K ∈ {Kmin,...,Kmax}.

Show Me More ›

What is a good coherence score topic modeling? ›

In topic modeling, topic coherence measures the quality of the data by comparing the semantic similarity between highly repetitive words in a topic [10]. Coherence score is a scale from 0 to 1 in which a good coherence (high similarity) has a score of 1, and a bad coherence (low similarity) has a score of 0 [11].

Read The Full Story ›

How to use LDA for topic modelling? ›

Steps

Set up your environment. ...
Install and import relevant libraries. ...
Load the data. ...
Step 4: Preprocess the data. ...
Step 5: Generate the topic models. ...
Step 6: Evaluate models. ...
Step 7: Fine-tuning our LDA model. ...
Step 8: Visualize the topics.

More items...

May 17, 2024

See Details ›

Is LDA still relevant? ›

LDA is a powerful tool for topic modeling but its instability is a major, often unacknowledged, stumbling block.

Get More Info Here ›

Which is better, LDA or NMF? ›

NMF makes up for the aspects that LDA lacks with its difference in approach. Below are some use cases where NMF outperforms LDA: Image analyses are better with NMF as images can be decomposed into meaningful parts for analysis.

Why BERTopic is better than LDA? ›

The study found that BERTopic excelled in semantic relevance and coherence, while LDA effectively formed distinct clusters. The evaluation combined automatic metrics, such as silhouette and coherence scores, with domain experts' insights.

Why is LDA best for topic modelling? ›

The significance of LDA in topic modelling is seen in its capacity to identify underlying themes in a document. It helps reveal hidden patterns within documents that are crucial for drawing insights from large datasets.

View Details ›

Is BERTopic better than LDA? ›

Although BERTopic demonstrates a slightly better quantitative performance, LDA aligns much better with human interpretation, indicating a stronger ability to capture meaningful and coherent topics within company's customer call data.

What is the most efficient algorithm? ›

The highest efficiency level in algorithm complexity is the O(1) or constant time complexity. An algorithm with O(1) complexity is highly efficient, as its performance does not depend on the size of the input data. It takes the same amount of time to execute, regardless of the input size.

Learn More ›

Which algorithm is best for text classification? ›

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

Discover More Details ›