딥러닝을 이용한 자연어 처리 (1)텍스트 전처리, 단어임베딩

1. 텍스트 전처리

자연어 처리(Natural Language Processing) 컴퓨터를 통해 인간의 언어를 분석, 및 처리하는 인공지능의 한 분야

토큰화: 주어진 텍스트를 간 단어기준으로 분리하는 것

- 토큰화를 할 때, 특수기호 및 소문자 처리를 해줘야 함

- 데이터 내 단어 빈도수 계산

IMDB dataset이 들어 있는 text.txt 파일을 확인해 봅니다. 파일 내 각 줄은 하나의 리뷰에 해당합니다.
텍스트 데이터를 불러오면서 단어가 key, 빈도수가 value로 구성된 딕셔너리 변수인 word_counter를 만드세요.
- 파일 내 각 줄 끝에는 새로운 줄을 의미하는 특수기호(\n)가 추가되어 있습니다. rstrip() 함수를 이용하여 각 줄 맨 끝에 있는 특수기호를 제거하세요.
- split() 함수를 사용하면 각 줄을 공백 기준으로 분리하여 단어를 추출할 수 있습니다.
word_counter를 활용하여, text.txt에 들어 있는 모든 단어의 빈도수 총합을 total 변수에 저장하세요.
word_counter를 활용하여, text.txt 내 100회 이상 발생하는 단어를 up_five 리스트에 저장하세요.

word_counter = dict() #딕셔너리 변수 만들기 완료

# 단어가 key, 빈도수가 value로 구성된 딕셔너리 변수를 만드세요.
with open('text.txt', 'r') as f:
    for line in f:
        for word in line.rstrip().split(): #줄바꿈 기호 제거, 단어기준 나누기
            if word not in word_counter: #key값이 있으면
                word_counter[word] = 1 #value를 +1
            else:
                word_counter[word] += 1

print(word_counter)


# 텍스트 파일에 내 모든 단어의 총 빈도수를 구해보세요.
total = 0

# 텍스트 파일 내 100회 이상 발생하는 단어를 리스트 형태로 저장하세요.
up_five = list()

for word, freq in word_counter.items():
    total += freq #value(빈도수)를 더하기
    if freq >= 100: #빈도수>100
        up_five.append(word) #key로 할당된 단어에 추가 

print(total)
print(up_five)

- 텍스트 전처리를 통한 데이터 탐색: 대소문자 및 특수기호를 제거하는 텍스트 전처리를 통해 데이터를 탐색해볼 것

영화 리뷰를 불러오면서 모든 리뷰를 소문자 처리를 하고, 단어 내 알파벳을 제외한 모든 숫자 및 특수기호를 제거해 주세요.
- 문자열.lower(): 해당 문자열을 모두 소문자로 변환할 수 있습니다.
- regex.sub('', 문자열): 문자열 내 regex 변수의 정규표현식에 해당하는 모든 문자를 제거(‘’로 교체)
- 전처리가 완료된 단어와 단어의 빈도수를 word_counter 딕셔너리에 저장하세요.
test.txt에 존재하는 단어 the의 빈도수를 count 변수에 저장하세요

import re

word_counter = dict()
regex = re.compile('[^a-z A-Z]')


# 텍스트 파일을 소문자로 변환 및 숫자 및 특수기호를 제거한 딕셔너리를 만드세요.
with open('text.txt', 'r') as f: # 실습 1 과 동일한 방식으로 `IMDB dataset`을 불러옵니다.
    for line in f:
        words = line.rstrip().lower().split()#줄바꿈 제거,소문자처리,단어분리
        for word in words:#토큰화된 단어를 하나씩 살펴본다. 
            processed_word = regex.sub('', word)#전처리 대상 word
            
            if processed_word not in word_counter:
                word_counter[processed_word] = 1 #processed_word를 키로 사용하고, 빈도수를 =1
            else: 
                word_counter[processed_word] += 1

# 단어 "the"의 빈도수를 확인해 보세요.
count = word_counter["the"]

print(count)

-NLTK를 통한 stopwords 및 stemming 처리

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

test_sentences = [
    "i have looked forward to seeing this since i first saw it amoungst her work",
    "this is a superb movie suitable for all but the very youngest",
    "i first saw this movie when I was a little kid and fell in love with it at once",
    "i am sooo tired but the show must go on",
]

# 영어 stopword를 저장하세요.
stopwords = stopwords.words('english') #stop워드들이 저장됨. 

print(stopwords)

# stopword를 추가하고 업데이트된 stopword를 저장하세요.
new_keywords = ['noone', 'sooo', 'thereafter', 'beyond', 'amoungst', 'among']
updated_stopwords = stopwords + new_keywords #더해주기 

print(updated_stopwords)

# 업데이트된 stopword로 test_sentences를 전처리하고 tokenized_word에 저장하세요.
tokenized_word = []

for sentence in test_sentences:
    tokens = word_tokenize(sentence) #공백으로 나눠서 단어기준으로 나누기 split과 동일 
    new_sent = []
    for token in tokens:
        if token not in updated_stopwords:
            new_sent.append(token)
    tokenized_word.append(new_sent)

print(tokenized_word)

# stemming을 해보세요. looked -> look 같은 원형 형태로 바꾸는 것. 
stemmed_sent = []
stemmer = PorterStemmer()

for word in tokenized_word[0]:
    stemmed_sent.append(stemmer.stem(word))
print(stemmed_sent)

2. 단어 임베딩

word2vec은 단어 간 문맥을 사용하여, 주어진 문맥에서 어떤 단어가 발생하는지 예측하는 문제로 단어 벡터를 학습합니다.

3. word2vec

word2vec은 단어 간 문맥을 사용하여, 주어진 문맥에서 어떤 단어가 발생하는지 예측하는 문제로 단어 벡터를 학습합니다.

word2vec으로 단어 유사도 측정

word2vec은 신경망을 통해 단어 임베딩 벡터를 학습합니다. 이번 실습에서는 파이썬 라이브러리인 gensim을 사용하여 word2vec을 학습하도록 하겠습니다.

지시사항

Emotions dataset for NLP 데이터셋을 불러오는 load_data 함수는 이미 작성되어 있습니다.
input_data에 저장되어 있는 텍스트 데이터를 사용해서 단어별 문맥의 길이를 의미하는 window는 2, 벡터의 차원이 300인 word2vec 모델을 학습하세요. (epochs는 10으로 설정)
단어 happy와 유사한 단어 10개를 similar_happy 변수에 저장하세요.
단어 sad와 유사한 단어 10개를 similar_sad 변수에 저장하세요.
good과 bad의 임베딩 벡터 간 유사도를 similar_good_bad 변수에 저장하세요.
sad와 lonely의 임베딩 벡터 간 유사도를 similar_sad_lonely 변수에 저장하세요.
happy의 임베딩 벡터를 wv_happy 변수에 저장하세요.

import pandas as pd
from gensim.models import Word2Vec

def load_data(filepath):
    data = pd.read_csv(filepath, delimiter=';', header=None, names=['sentence','emotion'])
    data = data['sentence']

    gensim_input = []
    for text in data:
        gensim_input.append(text.rstrip().split())
    return gensim_input

input_data = load_data("emotions_train.txt")

# word2vec 모델을 학습하세요.
w2v_model = Word2Vec(window=2, vector_size = 300) #앞뒤 2개씩 4개, 단일신경망 크기는 300 
w2v_model.build_vocab(input_data) #input데이터에 정수형 id를 부여 
w2v_model.train(input_data, total_examples = w2v_model.corpus_count, epochs = 10) #학습데이터를 넣고, build_vocab에 총 몇개가 있는지, 10번 반복 



# happy와 유사한 단어를 확인하세요.
similar_happy = w2v_model.wv.most_similar("happy") #happy와 가장 유사한 임베딩 벡터는?

print(similar_happy)

# sad와 유사한 단어를 확인하세요.
similar_sad = w2v_model.wv.most_similar("sad") #sad와 가장 유사한 임베딩 벡터는?
print(similar_sad)

# 단어 good과 bad의 임베딩 벡터 간 유사도를 확인하세요.
similar_good_bad = w2v_model.wv.similarity("good", "bad")

print(similar_good_bad)

# 단어 sad과 lonely의 임베딩 벡터 간 유사도를 확인하세요.
similar_sad_lonely = w2v_model.wv.similarity("sad", "lonely")
#0이면 완전 같음. 클수록 반대임 
print(similar_sad_lonely)

# happy의 임베딩 벡터를 확인하세요.
wv_happy = w2v_model.wv['happy']

print(wv_happy)

4. fastText

word2vec은 학습 데이터에 존재한 단어의 벡터를 생성할 수 있기에 미등록 단어 문제(out-of-vocabulary, OOV)가 발생합니다.

fastText로 단어 임베딩 벡터 생성

fastText는 word2vec의 단점인 미등록 단어 문제를 해결합니다. 이번 실습에서는 파이썬 라이브러리인 gensim을 사용하여 fastText를 학습하도록 하겠습니다.

지시사항

input_data에 저장되어 있는 텍스트 데이터를 사용해서 단어별 문맥의 길이를 의미하는 window는 3, 벡터의 차원이 100, 단어의 최소 발생 빈도를 의미하는 min_count가 10인 fastText 모델을 학습하세요.
- epochs는 10으로 설정합니다.
단어 day와 유사한 단어 10개를 similar_day 변수에 저장하세요.
단어 night와 유사한 단어 10개를 similar_night 변수에 저장하세요.
elllllllice의 임베딩 벡터를 wv_elice 변수에 저장하세요.

from gensim.models import FastText
import pandas as pd

# Emotions dataset for NLP 데이터셋을 불러오는 load_data() 함수입니다.
def load_data(filepath):
    data = pd.read_csv(filepath, delimiter=';', header=None, names=['sentence','emotion'])
    data = data['sentence']

    gensim_input = []
    for text in data:
        gensim_input.append(text.rstrip().split())

    return gensim_input

input_data = load_data("emotions_train.txt")

# fastText 모델을 학습하세요.
ft_model = FastText(min_count = 10, window = 3, vector_size=100)
ft_model.build_vocab(input_data)
ft_model.train(input_data, total_examples = ft_model.corpus_count, epochs = 10) #왜 10?

# day와 유사한 단어 10개를 확인하세요.
similar_day = ft_model.wv.most_similar("day")

print(similar_day)

# night와 유사한 단어 10개를 확인하세요.
similar_night = ft_model.wv.most_similar("night")

print(similar_night)

# elllllllice의 임베딩 벡터를 확인하세요.
wv_elice = ft_model.wv['elllllllice']

print(wv_elice)

딥러닝을 이용한 자연어 처리 (1)텍스트 전처리, 단어임베딩

1. 텍스트 전처리

2. 단어 임베딩

3. word2vec

word2vec으로 단어 유사도 측정

지시사항

4. fastText

fastText로 단어 임베딩 벡터 생성

지시사항

마무리

티스토리툴바