notfounduser's diary

글

라벨이 machine learning인 게시물 표시

python machine learning using Doc2Vec (3/3)

- 7월 18, 2019

python을 이용해서 문장 학습을 시키고 해당 문장이 의문문인지 평문인지에 대한 테스트를 진행해 봅니다. 2편에 이어서 학습된 모델을 가지고 실제 분류확인을 해보겠습니다. 1. 환경준비 Windows 10 python 3.7 konlpy gensim 2. edit qna_test.py from collections import namedtuple from gensim.models import doc2vec from konlpy.tag import Twitter import multiprocessing from pprint import pprint from gensim.models import Doc2Vec from sklearn.linear_model import LogisticRegression import numpy import pickle twitter = Twitter() def read_data(filename): with open(filename, 'r', encoding='UTF8') as f: data = [line.split('\t') for line in f.read().splitlines()] return data def tokenize(doc): # norm, stem은 optional return ['/'.join(t) for t in twitter.pos(doc, norm=True, stem=True)] # 실제 구동 데이터를 읽기 run_data = read_data('C:/work/python/knlp/data/qna_run.txt') # 형태소 분류 run_docs = [(tokeniz...

자세한 내용 보기

python machine learning using Doc2Vec (2/3)

- 7월 17, 2019

python을 이용해서 문장 학습을 시키고 해당 문장이 의문문인지 평문인지에 대한 테스트를 진행해 봅니다. 1편에 이어서 학습된 데이터를 바탕으로 학습 모델을 만듭니다. 1. 환경준비 Windows 10 python 3.7 konlpy gensim 2. edit qna_test.py from collections import namedtuple from gensim.models import doc2vec from konlpy.tag import Twitter import multiprocessing from pprint import pprint from gensim.models import Doc2Vec from sklearn.linear_model import LogisticRegression import numpy import pickle twitter = Twitter() def read_data(filename): with open(filename, 'r', encoding='UTF8') as f: data = [line.split('\t') for line in f.read().splitlines()] return data def tokenize(doc): # norm, stem은 optional return ['/'.join(t) for t in twitter.pos(doc, norm=True, stem=True)] # 테스트 데이터를 읽기 train_data = read_data('C:/work/python/knlp/data/qna_train.txt') test_data = read_data('C:/wo...

자세한 내용 보기

python machine learning using Doc2Vec (1/3)

- 7월 17, 2019

python을 이용해서 문장 학습을 시키고 해당 문장이 의문문인지 평문인지에 대한 테스트를 진행해 봅니다. 1. 환경준비 Windows 10 python 3.7 konlpy gensim 2. edit qna_train.py from collections import namedtuple from gensim.models import doc2vec from konlpy.tag import Twitter import multiprocessing from pprint import pprint twitter = Twitter() def read_data(filename): with open(filename, 'r', encoding='UTF8') as f: data = [line.split('\t') for line in f.read().splitlines()] return data def tokenize(doc): # norm, stem은 optional return ['/'.join(t) for t in twitter.pos(doc, norm=True, stem=True)] #doc2vec parameters cores = multiprocessing.cpu_count() vector_size = 300 window_size = 15 word_min_count = 2 sampling_threshold = 1e-5 negative_size = 5 train_epoch = 100 dm = 1 worker_count = cores # 트래이닝 데이터 읽기 train_data = read_data('C:/work/py...

자세한 내용 보기