Topic models :: Tutorials for quanteda (2024)

Table of Contents
LDA Seeded LDA

Topics models are unsupervised document classification techniques. By modeling distributions of topics over words and words over documents, topic models identify the most discriminatory groups of documents automatically.

require(quanteda)require(quanteda.corpora)require(seededlda)require(lubridate)
corp_news <- download("data_corpus_guardian")

We will select only news articles published in 2016 using corpus_subset() and the year function from the lubridate package.

corp_news_2016 <- corpus_subset(corp_news, year(date) == 2016)ndoc(corp_news_2016)
## [1] 1959

Further, after removal of function words and punctuation in dfm(), we will only keep the top 20% of the most frequent features (min_termfreq = 0.8) that appear in less than 10% of all documents (max_docfreq = 0.1) using dfm_trim() to focus on common but distinguishing features.

toks_news <- tokens(corp_news_2016, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)toks_news <- tokens_remove(toks_news, pattern = c(stopwords("en"), "*-time", "updated-*", "gmt", "bst"))dfmat_news <- dfm(toks_news) %>% dfm_trim(min_termfreq = 0.8, termfreq_type = "quantile", max_docfreq = 0.1, docfreq_type = "prop")

quanteda does not implement topic models, but you can fit LDA and seeded-LDA with the seededlda package.

LDA

k = 10 specifies the number of topics to be discovered. This is an important parameter and you should try a variety of values and validate the outputs of your topic models thoroughly.

tmod_lda <- textmodel_lda(dfmat_news, k = 10)

You can extract the most important terms for each topic from the model using terms().

terms(tmod_lda, 10)
## topic1 topic2 topic3 topic4 topic5 topic6 ## [1,] "oil" "corbyn" "son" "officers" "clinton" "refugees"## [2,] "markets" "johnson" "parents" "violence" "sanders" "brussels"## [3,] "prices" "leadership" "felt" "prison" "cruz" "talks" ## [4,] "investors" "boris" "park" "victims" "hillary" "french" ## [5,] "shares" "shadow" "room" "sexual" "obama" "summit" ## [6,] "rates" "jeremy" "love" "abuse" "trump's" "refugee" ## [7,] "banks" "tory" "mother" "criminal" "bernie" "migrants"## [8,] "trading" "doctors" "story" "officer" "ted" "turkey" ## [9,] "quarter" "junior" "father" "crime" "rubio" "benefits"## [10,] "sector" "khan" "knew" "incident" "senator" "asylum" ## topic7 topic8 topic9 topic10 ## [1,] "sales" "syria" "australia" "climate" ## [2,] "apple" "isis" "australian" "water" ## [3,] "customers" "military" "labor" "energy" ## [4,] "google" "islamic" "turnbull" "food" ## [5,] "users" "un" "budget" "development"## [6,] "technology" "syrian" "funding" "gas" ## [7,] "games" "forces" "housing" "drug" ## [8,] "game" "muslim" "senate" "medical" ## [9,] "iphone" "peace" "education" "hospital" ## [10,] "app" "china" "coalition" "patients"

You can then obtain the most likely topics using topics() and save them as a document-level variable.

head(topics(tmod_lda), 20)
## text136751 text136585 text139163 text169133 text153451 text163885 text157885 ## topic10 topic2 topic7 topic3 topic4 topic10 topic3 ## text173244 text137394 text169408 text184646 text127410 text134923 text169695 ## topic2 topic9 topic3 topic2 topic4 topic2 topic1 ## text147917 text157535 text177078 text174393 text181782 text143323 ## topic3 topic7 topic10 topic3 topic3 topic2 ## 10 Levels: topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 ... topic10
# assign topic as a new document-level variabledfmat_news$topic <- topics(tmod_lda)# cross-table of the topic frequencytable(dfmat_news$topic)
## ## topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10 ## 200 218 233 236 192 83 204 185 191 210

Seeded LDA

In the seeded LDA, you can pre-define topics in LDA using a dictionary of “seed” words.

# load dictionary containing seed wordsdict_topic <- dictionary(file = "../dictionary/topics.yml")print(dict_topic)
## Dictionary object with 5 key entries.## - [economy]:## - market*, money, bank*, stock*, bond*, industry, company, shop*## - [politics]:## - lawmaker*, politician*, election*, voter*## - [society]:## - police, prison*, school*, hospital*## - [diplomacy]:## - ambassador*, diplomat*, embassy, treaty## - [military]:## - military, soldier*, terrorist*, marine, navy, army

The number of topics is determined by the number of keys in the dictionary. Next, we can fit the seeded LDA model using textmodel_seededlda() and specify the dictionary with our relevant keywords.

tmod_slda <- textmodel_seededlda(dfmat_news, dictionary = dict_topic)

Some of the topic words are seed words, but the seeded LDA identifies many other related words.

terms(tmod_slda, 20)
## economy politics society diplomacy military ## [1,] "markets" "clinton" "hospital" "refugees" "military" ## [2,] "banks" "sanders" "prison" "syria" "labor" ## [3,] "oil" "cruz" "schools" "brussels" "australian"## [4,] "energy" "obama" "violence" "isis" "corbyn" ## [5,] "climate" "hillary" "officers" "talks" "australia" ## [6,] "sales" "trump's" "hospitals" "un" "turnbull" ## [7,] "prices" "bernie" "victims" "syrian" "johnson" ## [8,] "stock" "ted" "cases" "french" "budget" ## [9,] "food" "rubio" "sexual" "turkey" "leadership"## [10,] "rates" "senator" "drug" "islamic" "cabinet" ## [11,] "sector" "gun" "officer" "border" "shadow" ## [12,] "banking" "politicians" "abuse" "aid" "funding" ## [13,] "businesses" "primary" "crime" "summit" "coalition" ## [14,] "investors" "race" "facebook" "refugee" "senate" ## [15,] "housing" "elections" "medical" "migrants" "terrorist" ## [16,] "shares" "kasich" "child" "agreement" "boris" ## [17,] "costs" "candidates" "mental" "forces" "doctors" ## [18,] "average" "delegates" "died" "military" "tory" ## [19,] "development" "photograph" "parents" "russian" "parties" ## [20,] "trading" "supporters" "mother" "asylum" "jeremy"

topics() returns dictionary keys as the most likely topics of documents.

head(topics(tmod_slda), 20)
## text136751 text136585 text139163 text169133 text153451 text163885 text157885 ## economy military economy society society economy politics ## text173244 text137394 text169408 text184646 text127410 text134923 text169695 ## military military society military society military economy ## text147917 text157535 text177078 text174393 text181782 text143323 ## society economy economy politics society military ## Levels: economy politics society diplomacy military
# assign topics from seeded LDA as a document-level variable to the dfmdfmat_news$topic2 <- topics(tmod_slda)# cross-table of the topic frequencytable(dfmat_news$topic2)
## ## economy politics society diplomacy military ## 479 219 633 233 388
Topic models :: Tutorials for quanteda (2024)
Top Articles
how to load dow30 or US 30 to MT5?
Why I Have All My Brokerage Accounts in One Place
Knoxville Tennessee White Pages
Is Sam's Club Plus worth it? What to know about the premium warehouse membership before you sign up
Cold Air Intake - High-flow, Roto-mold Tube - TOYOTA TACOMA V6-4.0
Wizard Build Season 28
Readyset Ochsner.org
Apex Rank Leaderboard
Elden Ring Dex/Int Build
Atrium Shift Select
Skip The Games Norfolk Virginia
Oppenheimer & Co. Inc. Buys Shares of 798,472 AST SpaceMobile, Inc. (NASDAQ:ASTS)
Elizabethtown Mesothelioma Legal Question
Missing 2023 Showtimes Near Landmark Cinemas Peoria
Sony E 18-200mm F3.5-6.3 OSS LE Review
Gino Jennings Live Stream Today
Munich residents spend the most online for food
Tamilrockers Movies 2023 Download
Katherine Croan Ewald
Diamond Piers Menards
The Ultimate Style Guide To Casual Dress Code For Women
Site : Storagealamogordo.com Easy Call
Is Windbound Multiplayer
Filthy Rich Boys (Rich Boys Of Burberry Prep #1) - C.M. Stunich [PDF] | Online Book Share
Integer Division Matlab
Sandals Travel Agent Login
Horn Rank
Ltg Speech Copy Paste
Random Bibleizer
Craigslist Fort Smith Ar Personals
The Clapping Song Lyrics by Belle Stars
Poe T4 Aisling
R/Sandiego
Kempsville Recreation Center Pool Schedule
Rogold Extension
Beaver Saddle Ark
Log in or sign up to view
A Man Called Otto Showtimes Near Amc Muncie 12
Powerspec G512
Saybyebugs At Walmart
2007 Jaguar XK Low Miles for sale - Palm Desert, CA - craigslist
Miami Vice turns 40: A look back at the iconic series
Love Words Starting with P (With Definition)
Tlc Africa Deaths 2021
Youravon Com Mi Cuenta
Nope 123Movies Full
Kushfly Promo Code
Diario Las Americas Rentas Hialeah
Game Akin To Bingo Nyt
Marion City Wide Garage Sale 2023
Latest Posts
Article information

Author: Amb. Frankie Simonis

Last Updated:

Views: 6074

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Amb. Frankie Simonis

Birthday: 1998-02-19

Address: 64841 Delmar Isle, North Wiley, OR 74073

Phone: +17844167847676

Job: Forward IT Agent

Hobby: LARPing, Kitesurfing, Sewing, Digital arts, Sand art, Gardening, Dance

Introduction: My name is Amb. Frankie Simonis, I am a hilarious, enchanting, energetic, cooperative, innocent, cute, joyous person who loves writing and wants to share my knowledge and understanding with you.