Topic modeling algorithms (2024)

Table of Contents
LDA NMF BERTopic FAQs

Learn about the mathematical concepts behind LDA, NMF, BERTopic models

Topic modeling is a part of natural language processing (NLP) which enables end-users to identify themes/topics within a collection of documents. It has applications in multiple industries for text mining and gaining relevant insights from textual data.

Most algorithms try to decompose the document-term matrix into two or more matrices to obtain the matrix containing terms and topics. This is a schematic I came up with to understand how the well-known algorithms work.

Topic modeling algorithms (2)

Depending on the algorithm, the entries in the document-term matrix can be calculated using either using a bags-of-words approach, term frequency — inverse document frequency (TF-IDF) or class-based TF-IDF. Further, the number lower ranked and lower dimension matrices the document-term matrix gets factorized into depends on the specific algorithm. The distribution of the topics in the documents and the distribution of the terms in each topic could be probabilistic or deterministic.

Note: The matrix factorization technique is widely used for feature reduction in Facial Recognition and other NLP tasks.

The most established go-to techniques for topic modeling is Latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF). LDA is a generative probabilistic model, and NMF is a non-probabilistic linear algebraic model which uses matrix factorization. LDA is useful to identify coherent topics, while NMF does well to identify incoherent topics.

LDA is a three-level hierarchical Bayesian inference model, used to estimate the parameters of the model — topic-term distribution and the document-topic distribution. LDA assumes that each document is a mix of underlying set of multiple topics, and each topic is modeled as an infinite mixture over an underlying set of topic distribution. Since the results are not deterministic, one might get different results each time the model is run, even on the same dataset.

NMF decomposes a non-negative matrix into two non-negative matrices, where each row is a topic and each column is a document. NMF assumes that each document is a linear combination of the topics and each topic is a linear combination of the terms. The objective of NMF is dimensionality reduction and feature extraction. The original matrix is decomposed into a feature matrix and a coefficient matrix. It is more suitable for smaller datasets and short texts. NMF takes advantage of the fact that vectors are non-negative.

BERTopic is fairly new technique and uses Google’s BERT (Bidirectional Encoder Representations from Transformers) embeddings and class-based variation of TF-IDF (Term Frequency — Inverse Document Frequency). As mentioned in the paper by Groontendorst

“BERTopic is a topic model that extracts coherent topic representation through the development of a class-based variation of TF-IDF. BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure.”

LDA

LDA is an extension of probabilistic Latent Semantic Analysis (pLSA). Even though the topics in the document are unknown, it is assumed that the text in the document is generated based on these topics. The “latent” refers to things unknown apriori and hidden in the textual data. As discussed above, LDA is a distribution of topics in the documents and the distribution of the terms in a topic — called “Dirichlet”. The graphical model representation of LDA is shown below (re-generated from LDA paper by Blei et. al. 2003). As the paper mentions:

“The boxes are “plates” reprsenting replicates. the outer plate represents documents, while the inner plate represents the repeated choice of topics and terms within a document.”

Topic modeling algorithms (3)

The probability of a corpus is obtained using the product of the marginal probabilities of each documents:

Topic modeling algorithms (4)

Please refer to the LDA paper by Blei et. al. 2003 paper for further details.

NMF

NMF-based models learn topics by directly decomposing the document-term matrix into two low-rank matrices. NMF can be applied to perform statistical analysis of multivariate data. Given a non-negative matrix V, NMF finds non-negative matrix factors W and H (courtsey: algorithms for NMF paper by Lee, Seung 2000).

Topic modeling algorithms (5)

The columns of W can be interpreted as basis documents from bags-of-words, i.e., these are the topics, while the columns in H are the features. The implementation of the NMF algorithm consists of update rules for W and H, iterating through which results in a convergence to a local maximum of the objective function (re-generated from NMF paper by Lee 1999) subject to the non-negativity constraints.

Topic modeling algorithms (6)

“The algorithm starts with non-negative initial conditions for W and H, iterations of the update rules for non-negative V finds an approximate factorization by converging to a local maximum of the objective function. It can be derived by interpreting NMF as an algorithm for constructing a probabilistic model of image generation. This objective function is then related to the likelihood of generating the images in V from the basis W and encodings H.”

NMF decomposes multivariate data by creating a user-defined number of features. Each feature is a linear combination of the original attributes and the coefficients of these linear combinations are non-negative. While applying the model, an NMF model maps the orginal data into the new set of attributes or feautes discovered by the model.

In general, NMF is NP-hard, but there are some heuristic approximations which work well in many applications. There is also no guarantee that there is a single unique decomposition. To tackle this, priors are used on the factors W and H along with regularization terms in the objective function. It is also hard to know how to choose the factorization rank.

Please refer to the papers listed in References section for further reading.

BERTopic

BERTopic assumes that documents containing the same topic are semantically similar. Generating topic using involves three steps:

  1. Each document is coverted to its embedding representation using Sentence-BERT (SBERT) framework — this allows the conversion of sentences and paragraphs to dense vector representations using pre-trained language models.
  2. The dimensionality of the resulting embeddings is reduced to optimize the clustering process using Uniform Manifold Approximation and Projection (UMAP) technique — as it is shown to preserve more of the local and global features of high-dimensional data in lower projected dimensions. Clusters are obatined from reduced embeddings using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN).
  3. The topics are extracted using a custom class-based variation of TF-IDF from these above clusters.

The classic TF-IDF is

Topic modeling algorithms (7)

The above equation is slightly modified for the class-based TF-IDF. The class c is the collection of documents concatenated into a single document for each cluster.

Topic modeling algorithms (8)

This allows to generate topic-term distributions for each cluster of documents. The number of topics to a user-specified value can be reduced by iteratively merging these class-based TF-IDF representations of the least common topic with its most similar one.

The implementation of BERTopic is freely available here.

The well-known pros and cons of these models are listed below.

Pros of LDA:

  • can handle large datasets and can be easily parallelized
  • can assign a probability to a new document thanks to the document-topic Dirichlet distribution
  • topics are open to human interpretation

Cons of LDA:

  • computationally expensive
  • may not work well for short texts
  • number of topics must be known/set beforehand
  • bag-of-words approach disregards the semantic representation of words in a corpus, similar to LSA and pLSA
  • estimation of Bayes parameters lies under the assumption of exchangeability for the documents
  • requires an extensive pre-processing phase to obtain a significant representation from the textual input data
  • studies report LDA may yield too general (Rizvi et al., 2019) or irrelevant (Alnusyan et al., 2020) topics. Results may also be inconsistent across different executions (Egger et al., 2021).

Pros of NMF:

  • computationally efficient
  • works well for smaller datasets and short texts

Cons of NMF:

  • not as effective to identify complex relationships between topics
  • does not consider semantic relationship between the terms

Pros of BERTopic:

  • preserves semantic relationship between the terms
  • scalability — performance increases when state-of-the-art language models are implemented to create embeddings
  • can be used in a wide range of situation because of its stability across language models
  • significant flexibility in usage and fine-tuning of the model because of the separation of embedding process from representing topics
  • the distribution of terms as topics allows BERTopic to model the dynamic and evolutionary aspects of topics with little changes to the core algorithm

Cons of BERTopic:

  • assumption that each document contains a single topic is not realistic
  • terms in a topic can be redundant for interpretation of the topic

Both LDA and NMF assume that a document is a mix of latent topics and take a bags-of-words approach to describe the document. This results in a disregard for the semantic relationship among the terms in the document.

There is no correct answer for which is the best model. Depending on the requirements of the problem and the resources available one can choose between these three models best suited for that scenario.

This paper by Egger R. and Yu J. compares these models on Twitter posts.

  1. Deerwester, Scott, et al. “Indexing by latent semantic analysis.” Journal of the American society for information science 41.6 (1990): 391–407.
  2. Hofmann, Thomas. “Probabilistic latent semantic indexing.” In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 1999.
  3. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3, no. Jan (2003): 993–1022.
  4. Lee, Daniel, and H. Sebastian Seung. “Algorithms for non-negative matrix factorization.” Advances in neural information processing systems, 13 (2000).
  5. Lee, Daniel, and H. Sebastian Seung. “Learning the parts of objects by non-negative matrix factorization.” Nature 401, no. 6755 (1999): 788–791.
  6. Egger, Roman, and Joanne Yu. “A topic modeling comparison between lda, nmf, top2vec, and berttopic to demystify twitter posts.” Frontiers in sociology 7 (2022): 886498.
  7. Grootendorst, Maarten. “BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022).
  8. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
  9. Shi, Tian, Kyeongpil Kang, Jaegul Choo, and Chandan K., Reddy. “Short-text topic modeling via non-negative matrix factorizationenriched with local word-context correlations.” In Proceedings of the 2018 World Wide Web Conference, pp. 1105–1114, 2018.
  10. Choo, Jaegul, Changhyun Lee, Chandan K. Reddy, and Haesun Park. “Utopian: User driven topic modeling based on interactive nonnegative matrix factorization.” IEEE transactions on visualization and computer graphics 19, np. 12 (2013): 1992: 2001.
Topic modeling algorithms (2024)

FAQs

What is the best algorithm for topic modeling? ›

Latent Dirichlet Allocation (LDA): It is one of the most popular and widely used algorithms in topic modeling. It works by assigning a probability distribution to each word in a document, and assigning a topic to each set of words. Advantages: It is easy to implement and can handle large data sets.

How much data do you need for topic modeling? ›

For best results: You should use at least 1,000 documents in each topic modeling job. Each document should be at least 3 sentences long. If a document consists of mostly numeric data, you should remove it from the corpus.

How do you find the optimal number of topics in topic modeling? ›

To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.

What is the alternative to LDA topic modeling? ›

Hierarchical latent tree analysis (HLTA) is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.

What is the hardest topic in algorithms? ›

In the realm of algorithms, the hardest algorithm is often considered to be the Traveling Salesman Problem (TSP). This is an optimization problem that revolves around finding the shortest possible route a salesman must take to visit a given number of cities exactly once and return to the starting city.

What are the two algorithms for topic modeling? ›

The most established go-to techniques for topic modeling is Latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF).

How many documents do you need for topic modeling? ›

The minimum number of documents needed for topic modeling is 20.

Is topic modelling quantitative or qualitative? ›

We argue topic modeling is a useful method for the descriptive analysis of unstructured social media data sets, and is best used as part of a mixed-method strategy, with topic model results guiding deeper qualitative analysis.

Can BERT be used for topic modelling? ›

BERT is a Natural Language Processing Model proposed by Google Researchers [14]. This topic modeling technique uses transformers (BERT embeddings) and class-based TF-IDF to generate dense clusters [16].

How to choose the number of topics for LDA? ›

The selection of the optimal number of topics for LDA models can be based either on measures of topic quality (similarity or coherence) or on measures of goodness-of-fit and model complexity. models is determined by the numbers of topics in candidate models: K ∈ {Kmin,...,Kmax}.

What is a good coherence score topic modeling? ›

In topic modeling, topic coherence measures the quality of the data by comparing the semantic similarity between highly repetitive words in a topic [10]. Coherence score is a scale from 0 to 1 in which a good coherence (high similarity) has a score of 1, and a bad coherence (low similarity) has a score of 0 [11].

How to use LDA for topic modelling? ›

Steps
  1. Set up your environment. ...
  2. Install and import relevant libraries. ...
  3. Load the data. ...
  4. Step 4: Preprocess the data. ...
  5. Step 5: Generate the topic models. ...
  6. Step 6: Evaluate models. ...
  7. Step 7: Fine-tuning our LDA model. ...
  8. Step 8: Visualize the topics.
May 17, 2024

Is LDA still relevant? ›

LDA is a powerful tool for topic modeling but its instability is a major, often unacknowledged, stumbling block.

Which is better, LDA or NMF? ›

NMF makes up for the aspects that LDA lacks with its difference in approach. Below are some use cases where NMF outperforms LDA: Image analyses are better with NMF as images can be decomposed into meaningful parts for analysis.

Why BERTopic is better than LDA? ›

The study found that BERTopic excelled in semantic relevance and coherence, while LDA effectively formed distinct clusters. The evaluation combined automatic metrics, such as silhouette and coherence scores, with domain experts' insights.

Why is LDA best for topic modelling? ›

The significance of LDA in topic modelling is seen in its capacity to identify underlying themes in a document. It helps reveal hidden patterns within documents that are crucial for drawing insights from large datasets.

Is BERTopic better than LDA? ›

Although BERTopic demonstrates a slightly better quantitative performance, LDA aligns much better with human interpretation, indicating a stronger ability to capture meaningful and coherent topics within company's customer call data.

What is the most efficient algorithm? ›

The highest efficiency level in algorithm complexity is the O(1) or constant time complexity. An algorithm with O(1) complexity is highly efficient, as its performance does not depend on the size of the input data. It takes the same amount of time to execute, regardless of the input size.

Which algorithm is best for text classification? ›

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

Top Articles
1 thousand South Korean wons to Philippine pesos Exchange Rate. Convert KRW/PHP - Wise
100 USD to KRW - Exchange - How much South Korean Won (KRW) is 100 US Dollar (USD) ? Exchange Rates by Walletinvestor.com
Napa Autocare Locator
Www.politicser.com Pepperboy News
Phone Number For Walmart Automotive Department
Comforting Nectar Bee Swarm
Beds From Rent-A-Center
Crime Scene Photos West Memphis Three
Carter Joseph Hopf
Dark Souls 2 Soft Cap
Seth Juszkiewicz Obituary
Aita Autism
Craigslist Cars Nwi
6th gen chevy camaro forumCamaro ZL1 Z28 SS LT Camaro forums, news, blog, reviews, wallpapers, pricing – Camaro5.com
The Shoppes At Zion Directory
Restaurants Near Paramount Theater Cedar Rapids
Swedestats
Caledonia - a simple love song to Scotland
EASYfelt Plafondeiland
Winco Employee Handbook 2022
Ac-15 Gungeon
Www.dunkinbaskinrunsonyou.con
Chime Ssi Payment 2023
Turbo Tenant Renter Login
Cb2 South Coast Plaza
At 25 Years, Understanding The Longevity Of Craigslist
Panolian Batesville Ms Obituaries 2022
No Limit Telegram Channel
208000 Yen To Usd
Table To Formula Calculator
Weather Underground Durham
Craigslist Sf Garage Sales
Grand Teton Pellet Stove Control Board
Ixl Lausd Northwest
Amici Pizza Los Alamitos
Louisville Volleyball Team Leaks
Reborn Rich Ep 12 Eng Sub
Dr Adj Redist Cadv Prin Amex Charge
The Thing About ‘Dateline’
Silive Obituary
התחבר/י או הירשם/הירשמי כדי לראות.
Exam With A Social Studies Section Crossword
Rocket Lab hiring Integration & Test Engineer I/II in Long Beach, CA | LinkedIn
Aznchikz
Used Auto Parts in Houston 77013 | LKQ Pick Your Part
15:30 Est
Rocket Bot Royale Unblocked Games 66
Coleman Funeral Home Olive Branch Ms Obituaries
Nfsd Web Portal
Buildapc Deals
라이키 유출
Lorcin 380 10 Round Clip
Latest Posts
Article information

Author: Delena Feil

Last Updated:

Views: 5781

Rating: 4.4 / 5 (65 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.