토픽 모델링 R

티스토리 뷰

빅데이터분석

토픽 모델링 R

yggo 2020. 11. 2. 17:27

titles.xlsx

0.03MB

다수의 문서의 내용을 분석해서 주요한 단어들을 그룹별로 분류하는 분석방법을 토픽 모델링이라고 한다.

개요

데이터를 엑셀 파일에서 입력받아 토픽 모델링 분석을 실시하다. R의 topicmodels 패키지를 사용한다.

입력

Web of science의 검색 결과를 엑셀 포맷으로 다운로드하여 다음과 같이 수정한다.

분석할 데이터 열만 제외하고 모두 삭제(본 예제에서사용할 열은 'Article Title')
열의 이름을 text로 수정(엑셀에서 1번 행의 값에 해당)
titles.xlsx 로 저장(첨부된 파일로 실습 가능)

아래의 코드는 엑셀 파일을 읽어온 후 각 행의 번호에 해당하는 ID열을 추가한다.

documents <- read.xlsx("titles.xlsx",1,header = TRUE, stringsAsFactors=FALSE)
documents$ID <- seq(1,nrow(documents))
glimpse(documents)

토픽 개수, 토픽당 상위 단어 개수, 추가 불용어를 설정한다. 추가 불용어는 1차 분석 후 재추가 해야 할 수도 있다. 토픽의 개수는 분석가가 판단한다. (참조: cran.r-project.org/web/packages/ldatuning/vignettes/topics.html)

#
# 설정 
#
topic_n <- 16  # 토픽 개수 16개
term_top_n <- 10  # 상위 단어 개수 10개
# custom_stop_words <- c("act","bill")  # 추가 불용어
custom_stop_words <- c("accept","health","studi","factor",
                       "technologi","model",
                       "adopt","intent")  # 추가 불용어

출력

단어들의 그룹(토픽)

분석 절차

전처리(데이터 정제화)

여백, 문장부호, 소문 자화, 불용어 제거

document_tokens <- documents %>%
  unnest_tokens(output = word, input = text) %>%  # tidytext
  # remove numbers
  filter(!str_detect(word, "^[0-9]*$")) %>%
  # remove stop words
  anti_join(stop_words) %>%
  # stem the words
  mutate(word = SnowballC::wordStem(word)) %>%
  filter(!word %in% custom_stop_words)

토픽 분석

> head(document_tokens)
  ID     word
1  1  patient
2  1   experi
3  1    digit
4  1       ag
5  1 investig
6  1   effect

LDA의 입력 포맷이 DTM이기 때문에 cast_dtm으로 자료 포맷을 변환한다.

document_dtm <- document_tokens %>%
  # get count of each token in each document
  count(ID, word) %>%
  # create a document-term matrix with all features and tf weighting
  cast_dtm(document = ID, term = word, value = n)  # tidytext

> document_dtm
<<DocumentTermMatrix (documents: 312, terms: 1001)>>
Non-/sparse entries: 2769/309543
Sparsity           : 99%
Maximal term length: 17
Weighting          : term frequency (tf)

#
# LDA 실행
#
document_lda <- LDA(document_dtm, k = topic_n, control = list(seed = 123))  # topicmodels
document_lda

결과 해석

토픽별 상위 n개의 단어를 출력한다. 출력된 단어 중에서 불용어 처리를 해야 할 단어가 있으면 custom_stop_words에 추가한 다음 분석을 재실시한다.

> get_terms(document_lda,term_top_n)
      Topic 1   Topic 2    Topic 3    Topic 4     Topic 5       Topic 6      Topic 7  
 [1,] "user"    "digit"    "tool"     "patient"   "smart"       "mobil"      "manag"  
 [2,] "develop" "peopl"    "analysi"  "nurs"      "home"        "cross"      "depress"
 [3,] "design"  "perspect" "test"     "portal"    "influenc"    "app"        "effect" 
 [4,] "evid"    "base"     "develop"  "experi"    "prevent"     "section"    "sexual" 
 [5,] "role"    "qualit"   "user"     "physician" "valid"       "adult"      "therapi"
 [6,] "servic"  "mhealth"  "advanc"   "prefer"    "percept"     "smartphon"  "network"
 [7,] "care"    "elderli"  "protect"  "structur"  "telemedicin" "applic"     "mobil"  
 [8,] "applic"  "patient"  "usabl"    "manag"     "scale"       "understand" "care"   
 [9,] "support" "attitud"  "determin" "inform"    "smartphon"   "china"      "adult"  
[10,] "ag"      "influenc" "approach" "applic"    "french"      "survei"     "servic" 
      Topic 8     Topic 9      Topic 10    Topic 11    Topic 12   Topic 13  Topic 14   
 [1,] "social"    "onlin"      "care"      "system"    "investig" "effect"  "manag"    
 [2,] "person"    "educ"       "electron"  "inform"    "clinic"   "evalu"   "method"   
 [3,] "promot"    "telehealth" "record"    "healthcar" "virtual"  "mental"  "implement"
 [4,] "wearabl"   "experi"     "qualit"    "patient"   "empir"    "base"    "intervent"
 [5,] "inform"    "role"       "medic"     "cancer"    "consum"   "care"    "mix"      
 [6,] "intervent" "examin"     "implement" "provid"    "sustain"  "video"   "diabet"   
 [7,] "extend"    "choic"      "perspect"  "nation"    "realiti"  "patient" "patient"  
 [8,] "control"   "util"       "hospit"    "qualiti"   "energi"   "exercis" "type"     
 [9,] "trial"     "emerg"      "provid"    "support"   "food"     "game"    "diseas"   
[10,] "theori"    "base"       "commun"    "wearabl"   "adolesc"  "moder"   "cohort"   
      Topic 15  Topic 16  
 [1,] "percept" "assess"  
 [2,] "impact"  "design"  
 [3,] "genet"   "approach"
 [4,] "cultur"  "elderli" 
 [5,] "modifi"  "internet"
 [6,] "public"  "devic"   
 [7,] "risk"    "integr"  
 [8,] "food"    "behavior"
 [9,] "us"      "social"  
[10,] "consum"  "stigma"

위의 결과는 16개의 토픽과 각 토픽 내에서 출현 빈도가 높은 단어 10개를 나타낸다. R의 토픽 모델링 분석 기능은 분류만 실시할 뿐 각 토픽에 대한 명칭은 분석가가 분야별 지식을 활용해 직접 지정해야 한다.(요인 분석과 같은 개념)

토픽별 상위 단어의 출현 확률은 다음과 같다.

# A tibble: 16,016 x 3
   topic term      beta
   <int> <chr>    <dbl>
 1     1 ag    2.46e- 2
 2     2 ag    1.84e-30
 3     3 ag    6.38e- 3
 4     4 ag    5.33e- 3
 5     5 ag    2.40e-66
 6     6 ag    6.41e-46
 7     7 ag    1.33e-80
 8     8 ag    1.19e- 2
 9     9 ag    8.57e-17
10    10 ag    6.84e- 5
# ... with 16,006 more rows

위의 결과에서 보듯이 토픽 1에 단어 ag가 출현할 확률이 0.02%이다.

실습

#
# 토픽 모델링 
# 패키지: topicmodels
# 설정: 
#       토픽 개수 16개 
#       상위 단어 개수 10개
#       추가 불용어 목록
#

library(topicmodels)
library(tidyverse)
library(tidytext)
library(xlsx)

#
# 설정 
#
topic_n <- 16  # 토픽 개수 16개
term_top_n <- 10  # 상위 단어 개수 10개
# custom_stop_words <- c("act","bill")  # 추가 불용어
custom_stop_words <- c("accept","health","studi","factor",
                       "technologi","model",
                       "adopt","intent")  # 추가 불용어

#
# 자료확인 
#
documents <- read.xlsx("titles.xlsx",1,header = TRUE, stringsAsFactors=FALSE)
documents$ID <- seq(1,nrow(documents))
glimpse(documents)

# 
# 단어 분리/숫자 제거/불용어 제거/어원 통일 
#
document_tokens <- documents %>%
  unnest_tokens(output = word, input = text) %>%  # tidytext
  # remove numbers
  filter(!str_detect(word, "^[0-9]*$")) %>%
  # remove stop words
  anti_join(stop_words) %>%
  # stem the words
  mutate(word = SnowballC::wordStem(word)) %>%
  filter(!word %in% custom_stop_words) 

document_tokens
glimpse(document_tokens)

#
# DTM으로 변환
#
document_dtm <- document_tokens %>%
  # get count of each token in each document
  count(ID, word) %>%
  # create a document-term matrix with all features and tf weighting
  cast_dtm(document = ID, term = word, value = n)  # tidytext
document_dtm

#
# LDA 실행
#
document_lda <- LDA(document_dtm, k = topic_n, control = list(seed = 123))  # topicmodels
document_lda

#
# 토픽별 상위 term_top_n 개 단어 출력(빈도순)
#
get_terms(document_lda,term_top_n)

#
# tidy형으로 변화(분석 편리)
#
document_lda_td <- tidy(document_lda)
document_lda_td

#
# 상위 term_top_n 개 단어 추출(토픽당) 
#
top_terms <- document_lda_td %>%
  group_by(topic) %>%
  top_n(term_top_n, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms

#
# 토픽별 단어 빈도 막대 그래프 
# beta 토픽내에서 단어가 출현할 확률
#
top_terms %>%
  mutate(topic = factor(topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = topic)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~ topic, scales = "free", ncol = 4) +
  coord_flip()

R에서 topicmodels 뿐만 아니라 lda 패키지를 사용하여 토픽 모델링 분석을 할 수도 있다.

저작자표시 비영리 변경금지 (새창열림)

'빅데이터분석' 카테고리의 다른 글

토픽 모델링 tf-idf R (0)	2020.11.03

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/09 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

글 보관함

Better Researcher

티스토리 뷰

토픽 모델링 R

'빅데이터분석' 카테고리의 다른 글

티스토리툴바