토픽 모델링 tf-idf R

티스토리 뷰

빅데이터분석

토픽 모델링 tf-idf R

yggo 2020. 11. 3. 15:10

titles.xlsx

0.03MB

다수의 문서의 내용을 분석해서 주요한 단어들을 그룹별로 분류하는 분석방법을 토픽 모델링이라고 한다.

개요

데이터를 엑셀 파일에서 입력받아 tf-idf 적용 후 토픽 모델링 분석을 실시하다. R의 topicmodels 패키지를 사용한다.

입력

Web of science의 검색 결과를 엑셀 포맷으로 다운로드하여 다음과 같이 수정한다.

분석할 데이터 열만 제외하고 모두 삭제(본 예제에서사용할 열은 'Article Title')
열의 이름을 text로 수정(엑셀에서 1번 행의 값에 해당)
titles.xlsx 로 저장(첨부된 파일로 실습 가능)

아래의 코드는 엑셀 파일을 읽어온 후 각 행의 번호에 해당하는 ID열을 추가한다.

documents <- read.xlsx("titles.xlsx",1,header = TRUE, stringsAsFactors=FALSE)
documents$ID <- seq(1,nrow(documents))
glimpse(documents)

토픽 개수, 토픽당 상위 단어 개수, 추가 불용어를 설정한다. 추가 불용어는 1차 분석 후 재추가 해야 할 수도 있다. 토픽의 개수는 분석가가 한단한다. (참조: cran.r-project.org/web/packages/ldatuning/vignettes/topics.html)

#
# 설정 
#
topic_n <- 16  # 토픽 개수 16개
term_top_n <- 10  # 상위 단어 개수 10개
custom_stop_words <- c("accept","health","studi","factor",
                       "technologi","model",
                       "adopt","intent")  # 추가 불용어

tf-idf는 단어의 빈도뿐만 아니라 단어가 여러 문서에 얼마나 중복적으로 출현하는지까지 고려한다. 즉, 많은 문서에 포함된 단어라면 중요도가 상대적으로 떨어진다고 간주한다. 이러한 단어를 제거한 후에 DTM을 만들고 LDA를 실시한다.

출력

단어들의 그룹(토픽)

분석 절차

전처리(데이터 정제화)

여백, 문장부호, 소문 자화, 불용어 제거

document_tokens <- documents %>%
  unnest_tokens(output = word, input = text) %>%  # tidytext
  # 숫자 제거
  filter(!str_detect(word, "^[0-9]*$")) %>%
  # 불용어 제거
  anti_join(stop_words) %>%
  # 단어를 어원으로 변경(어원이 같은 단어를 동일어로 처리)
  mutate(word = SnowballC::wordStem(word)) %>%
  filter(!word %in% custom_stop_words)

토픽 분석

> head(document_tokens)
  ID     word
1  1  patient
2  1   experi
3  1    digit
4  1       ag
5  1 investig
6  1   effect

tf-idf를 적용한다.

#
# tf_idf 적용(무의미 단어 제거, 속도 향상)
# idf: 문서가 많을수록 값이 작아짐 
#

#
# tf 계산
# 하나의 문서에 동일단어 복수 출현시 행병합
#
document_tokens <- document_tokens %>%
  group_by(ID) %>%
  count(ID,word) %>%
  ungroup()

# idf 계산용
# 문서별 총 단어수 계산
total_tokens <- document_tokens %>% 
  group_by(ID) %>%
  summarize( total = sum(n) ) %>%
  ungroup()
  
total_tokens

document_tokens <- left_join(document_tokens,total_tokens)
document_tokens

#
# tf_idf 계산
# 기존데이터에 tf, idf, tf_idf 행 추가
document_tokens <- document_tokens %>%
  bind_tf_idf(word,ID,n)
document_tokens

#
# tf_idf 높은 단어 확인
#
document_tokens %>% arrange(desc(tf_idf))

#
# tf_idf, 출현회수를 기준으로 필터링
# 
document_tokens <- document_tokens %>%
  filter(tf_idf > tf_idf_minimum) %>%
  filter(n > tokens_n)

document_tokens

LDA의 입력 포맷이 DTM이기 때문에 cast_dtm으로 자료 포맷을 변환한다.

document_dtm <- document_tokens %>%
  cast_dtm(document = ID, term = word, value = n)  # tidytext

> document_dtm
<<DocumentTermMatrix (documents: 312, terms: 1001)>>
Non-/sparse entries: 2769/309543
Sparsity           : 99%
Maximal term length: 17
Weighting          : term frequency (tf)

#
# LDA 실행
#
document_lda <- LDA(document_dtm, k = topic_n, control = list(seed = 123))  # topicmodels
document_lda

결과 해석

토픽별 상위 n개의 단어를 출력한다. 출력된 단어 중에서 불용어 처리를 해야 할 단어가 있으면 custom_stop_words에 추가한 다음 분석을 재실시한다.

> get_terms(document_lda,term_top_n)
      Topic 1   Topic 2    Topic 3    Topic 4     Topic 5       Topic 6      Topic 7  
 [1,] "user"    "digit"    "tool"     "patient"   "smart"       "mobil"      "manag"  
 [2,] "develop" "peopl"    "analysi"  "nurs"      "home"        "cross"      "depress"
 [3,] "design"  "perspect" "test"     "portal"    "influenc"    "app"        "effect" 
 [4,] "evid"    "base"     "develop"  "experi"    "prevent"     "section"    "sexual" 
 [5,] "role"    "qualit"   "user"     "physician" "valid"       "adult"      "therapi"
 [6,] "servic"  "mhealth"  "advanc"   "prefer"    "percept"     "smartphon"  "network"
 [7,] "care"    "elderli"  "protect"  "structur"  "telemedicin" "applic"     "mobil"  
 [8,] "applic"  "patient"  "usabl"    "manag"     "scale"       "understand" "care"   
 [9,] "support" "attitud"  "determin" "inform"    "smartphon"   "china"      "adult"  
[10,] "ag"      "influenc" "approach" "applic"    "french"      "survei"     "servic" 
      Topic 8     Topic 9      Topic 10    Topic 11    Topic 12   Topic 13  Topic 14   
 [1,] "social"    "onlin"      "care"      "system"    "investig" "effect"  "manag"    
 [2,] "person"    "educ"       "electron"  "inform"    "clinic"   "evalu"   "method"   
 [3,] "promot"    "telehealth" "record"    "healthcar" "virtual"  "mental"  "implement"
 [4,] "wearabl"   "experi"     "qualit"    "patient"   "empir"    "base"    "intervent"
 [5,] "inform"    "role"       "medic"     "cancer"    "consum"   "care"    "mix"      
 [6,] "intervent" "examin"     "implement" "provid"    "sustain"  "video"   "diabet"   
 [7,] "extend"    "choic"      "perspect"  "nation"    "realiti"  "patient" "patient"  
 [8,] "control"   "util"       "hospit"    "qualiti"   "energi"   "exercis" "type"     
 [9,] "trial"     "emerg"      "provid"    "support"   "food"     "game"    "diseas"   
[10,] "theori"    "base"       "commun"    "wearabl"   "adolesc"  "moder"   "cohort"   
      Topic 15  Topic 16  
 [1,] "percept" "assess"  
 [2,] "impact"  "design"  
 [3,] "genet"   "approach"
 [4,] "cultur"  "elderli" 
 [5,] "modifi"  "internet"
 [6,] "public"  "devic"   
 [7,] "risk"    "integr"  
 [8,] "food"    "behavior"
 [9,] "us"      "social"  
[10,] "consum"  "stigma"

위의 결과는 16개의 토픽과 각 토픽 내에서 출현 빈도가 높은 단어 10개를 나타낸다. R의 토픽 모델링 분석 기능은 분류만 실시할 뿐 각 토픽에 대한 명칭은 분석가가 분야별 지식을 활용해 직접 지정해야 한다.(요인 분석과 같은 개념)

토픽별 상위 단어의 출현 확률은 다음과 같다.

# A tibble: 160 x 3
   topic term      beta
   <int> <chr>    <dbl>
 1     1 user    0.0956
 2     1 develop 0.0551
 3     1 design  0.0448
 4     1 evid    0.0352
 5     1 role    0.0332
 6     1 servic  0.0305
 7     1 care    0.0296
 8     1 applic  0.0287
 9     1 support 0.0279
10     1 ag      0.0246
# ... with 150 more rows

위의 결과에서 보듯이 토픽 1에 단어 ag가 출현할 확률이 0.02%이다.

실습

#
# 토픽 모델링 
# 패키지: topicmodels
# 설정: 
#       토픽 개수 16개 
#       상위 단어 개수 10개
#       추가 불용어 목록
#

library(topicmodels)
library(tidyverse)
library(tidytext)
library(xlsx)

#
# 설정 
#
topic_n <- 16  # 토픽 개수 16개
term_top_n <- 10  # 상위 단어 개수 10개
tf_idf_minimum <- 0.1 # tf_idf 최소치
tokens_n <- 0 # 출현회수 
custom_stop_words <- c("accept","health","studi","factor",
                       "technologi","model",
                       "adopt","intent")  # 추가 불용어

#
# 자료확인 
#
documents <- read.xlsx("titles.xlsx",1,header = TRUE, stringsAsFactors=FALSE)
documents$ID <- seq(1,nrow(documents))
glimpse(documents)

# 
# 단어 분리/숫자 제거/불용어 제거/어원 통일 
#
document_tokens <- documents %>%
  unnest_tokens(output = word, input = text) %>%  # tidytext
  # remove numbers
  filter(!str_detect(word, "^[0-9]*$")) %>%
  # remove stop words
  anti_join(stop_words) %>%
  # stem the words
  mutate(word = SnowballC::wordStem(word)) %>%
  filter(!word %in% custom_stop_words) 

document_tokens <- document_tokens
head(document_tokens)
glimpse(document_tokens)

#
# tf_idf 적용(무의미 단어 제거, 속도 향상)
# idf: 문서가 많을수록 값이 작아짐 
#

#
# tf 계산
# 하나의 문서에 동일단어 복수 출현시 행병합
#
document_tokens <- document_tokens %>%
  group_by(ID) %>%
  count(ID,word) %>%
  ungroup()

# idf 계산용
# 문서별 총 단어수 계산
total_tokens <- document_tokens %>% 
  group_by(ID) %>%
  summarize( total = sum(n) ) %>%
  ungroup()
  
total_tokens

document_tokens <- left_join(document_tokens,total_tokens)
document_tokens

#
# tf_idf 계산
# 기존데이터에 tf, idf, tf_idf 행 추가
document_tokens <- document_tokens %>%
  bind_tf_idf(word,ID,n)
document_tokens

#
# tf_idf 높은 단어 확인
#
document_tokens %>% arrange(desc(tf_idf))

#
# tf_idf, 출현회수를 기준으로 필터링
# 
document_tokens <- document_tokens %>%
  filter(tf_idf > tf_idf_minimum) %>%
  filter(n > tokens_n)

document_tokens

#
# DTM으로 변환(tidytext->DTM)
#
document_dtm <- document_tokens %>%
  cast_dtm(document = ID, term = word, value = n)  # tidytext
document_dtm

#
# LDA 실행
#
document_lda <- LDA(document_dtm, k = topic_n, control = list(seed = 123))  # topicmodels
document_lda

#
# 토픽별 상위 term_top_n 개 단어 출력(빈도순)
#
get_terms(document_lda,term_top_n)

#
# tidy형으로 변화(분석 편리)
#
document_lda_td <- tidy(document_lda)
document_lda_td

#
# 상위 term_top_n 개 단어 추출(토픽당) 
#
top_terms <- document_lda_td %>%
  group_by(topic) %>%
  top_n(term_top_n, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms

#
# 토픽별 단어 빈도 막대 그래프 
# beta 토픽내에서 단어가 출현할 확률
#
top_terms %>%
  mutate(topic = factor(topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = topic)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~ topic, scales = "free", ncol = 4) +
  coord_flip()

R에서 topicmodels 뿐만 아니라 lda 패키지를 사용하여 토픽 모델링 분석을 할 수도 있다.

저작자표시 비영리 변경금지 (새창열림)

'빅데이터분석' 카테고리의 다른 글

토픽 모델링 R (0)	2020.11.02

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/09 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

글 보관함

Better Researcher

티스토리 뷰

토픽 모델링 tf-idf R

'빅데이터분석' 카테고리의 다른 글

티스토리툴바