Natural Language Processing and Text Mining (NLPTM)
Bagian dari Combined Module: Unstructured Data Analysis (UDA)
Prasyarat:
- NLPTM-01 dan NLPTM-02
Outline:
- Post Tag
- WordNet
- Word Sense Disambiguation
Code Lesson NLPTM-03
Code dari lesson ini dapat di akses di Link berikut (wajib login ke Google/Gmail): Code NLPTM-03
Di link tersebut anda langsung bisa merubah code dan menjalankannya. Keterangan lebih lanjut di video yang disertakan.
Sangat disarankan untuk membuka code dan video berdampingan/"side-by-side" untuk mendapatkan pengalaman belajar yang baik. Silahkan modifikasi (coba-coba) hal lain, selain yang ditunjukkan di video untuk mendapatkan pengalaman belajar yang lebih mendalam. Tentu saja juga silahkan akses berbagai referensi lain untuk memperkaya pengetahuan lalu diskusikan di forum yang telah disediakan.
Video Lesson NLPTM-03
Natural Language Processing and Text Mining (NLPTM)
Unstructured Data Analysis (UDA)*
Unstructured Data Analysis (UDA)*
NLPTM-03: Dasar-Dasar Natural Language Processing (NLP)- Bagian ke-03 ¶
(C) Taufik Sutanto - 2020
tau-data Indonesia ~ https://tau-data.id/uda/
tau-data Indonesia ~ https://tau-data.id/uda/
Outline Module NLPTM-03/UDA-03:¶
- Pos tag
- WordNet dan WSD
# Jalankan Cell ini "HANYA" jika anda menggunakan Google Colab
# Jika di jalankan di komputer local, silahkan lihat NLPTM-02 untuk instalasinya.
import warnings; warnings.simplefilter('ignore')
import nltk
try:
import google.colab
IN_COLAB = True
!wget https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/taudataDDGsna.py
!wget https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/taudataNlpTm.py
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/contoh.pdf
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/slang.txt
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/stopwords_id.txt
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/stopwords_en.txt
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/kata_dasar.txt
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/wn-ind-def.tab
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/wn-msa-all.tab
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/ind_SA.csv
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/all_indo_man_tag_corpus_model.crf.tagger
!pip install --upgrade spacy python-crfsuite unidecode textblob sastrawi sklearn-pycrfsuite
!pip install --upgrade unidecode twython tweepy beautifulsoup4 tika
!python -m spacy download en
!python -m spacy download xx
!python -m spacy download en_core_web_sm
nltk.download('popular')
except:
IN_COLAB = False
print("Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded")
Tokenisasi¶
Tokenisasi adalah pemisahan kata, simbol, frase, dan entitas penting lainnya (yang disebut sebagai token) dari sebuah teks untuk kemudian di analisa lebih lanjut. Token dalam NLP sering dimaknai dengan "sebuah kata", walau tokenisasi juga bisa dilakukan ke kalimat, paragraf, atau entitas penting lainnya (misal suatu pola string DNA di Bioinformatika).
Mengapa perlu tokenisasi?
- Langkah penting dalam preprocessing, menghindari kompleksitas mengolah langsung pada string asal.
- Menghindari masalah (semantic) saat pemrosesan model-model natural language.
- Suatu tahapan sistematis dalam merubah unstructured (text) data ke bentuk terstruktur yang lebih mudah di olah.
[Image Source]: https://www.softwareadvice.com/resources/what-is-text-analytics/
Stemming dan Lemma¶
-
Stemmer akan menghasilkan sebuah bentuk kata yang disepakati oleh suatu sistem tanpa mengindahkan konteks kalimat. Syaratnya beberapa kata dengan makna serupa hanya perlu dipetakan secara konsisten ke sebuah kata baku. Banyak digunakan di IR & komputasinya relatif sedikit. Biasanya dilakukan dengan menghilangkan imbuhan (suffix/prefix).
-
lemmatisation akan menghasilkan kata baku (dictionary word) dan bergantung konteks.
-
Lemma & stemming bisa jadi sama-sama menghasilkan suatu akar kata (root word). Misal : Melompat ==> lompat
Part-of-Speech (pos) di ilmu bahasa (Linguistik)¶
# POS tags in NLTK (English)
import nltk
T = 'I am currently learning NLP in English, but if possible I want to know NLP in Indonesian language too'
nltk_tokens = nltk.word_tokenize(T)
print(nltk.pos_tag(nltk_tokens))
# Tidak lagi hanya 9 macam tags seperti yang dibahas ahli bahasa (linguist)
[('I', 'PRP'), ('am', 'VBP'), ('currently', 'RB'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('English', 'NNP'), (',', ','), ('but', 'CC'), ('if', 'IN'), ('possible', 'JJ'), ('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('know', 'VB'), ('NLP', 'NNP'), ('in', 'IN'), ('Indonesian', 'JJ'), ('language', 'NN'), ('too', 'RB')]
Z = nltk.pos_tag(nltk_tokens)
Z[0][1]
'PRP'
# filtering berdasarkan "pos"
pos = set(['NN','JJ'])
hasil = []
for pt in Z:
if pt[1] in pos:
hasil.append(pt[0])
hasil
['possible', 'Indonesian', 'language']
# Pythonista! .... Flat.
[t[0] for t in Z if t[1] in pos]
['possible', 'Indonesian', 'language']
# Penggunaan di text mining jika suatu kata ingin dibedakan jika berbeda pos
hasil = []
for pt in Z:
hasil.append(pt[0]+'_'+pt[1])
print(hasil)
['I_PRP', 'am_VBP', 'currently_RB', 'learning_VBG', 'NLP_NNP', 'in_IN', 'English_NNP', ',_,', 'but_CC', 'if_IN', 'possible_JJ', 'I_PRP', 'want_VBP', 'to_TO', 'know_VB', 'NLP_NNP', 'in_IN', 'Indonesian_JJ', 'language_NN', 'too_RB']
# Pos tags in TextBlob (English)
from textblob import TextBlob
for word, pos in TextBlob(T).tags:
print("{}_{}".format(word, pos), end=', ')
I_PRP, am_VBP, currently_RB, learning_VBG, NLP_NNP, in_IN, English_NNP, but_CC, if_IN, possible_JJ, I_PRP, want_VBP, to_TO, know_VB, NLP_NNP, in_IN, Indonesian_JJ, language_NN, too_RB,
# Pos Tag Spacy English
#from spacy.lang.en import English
#nlp_en = English()
import spacy
nlp_en = spacy.load('en_core_web_sm')
tokens = nlp_en(T)
for tok in tokens:
print("{}_{}".format(tok, tok.tag_), end = ', ')
I_PRP, am_VBP, currently_RB, learning_VBG, NLP_NNP, in_IN, English_NNP, ,_,, but_CC, if_IN, possible_JJ, I_PRP, want_VBP, to_TO, know_VB, NLP_NNP, in_IN, Indonesian_JJ, language_NN, too_RB,
# Spacy tidak perlu tabel pos tag ... bisa pakai perintah "explain"
spacy.explain('CC')
# Daftar Lengkap: https://spacy.io/api/annotation#pos-tagging
'conjunction, coordinating'
# Pos Tags in Spacy - Bahasa Indonesia?
from spacy.lang.id import Indonesian
nlp_id = Indonesian() # Language Model
Ti = "Saat bepergian ke Jogjakarta jangan lupa membeli oleh-oleh"
Teks = nlp_id(Ti)
for token in Teks:
print(token.lemma_, token.tag_)
# Fungsi pos-tags belum tersedia untuk bahasa indonesia .. :(
Saat bepergian ke Jogjakarta jangan lupa membeli oleh-oleh
Ti.split()
['Saat', 'bepergian', 'ke', 'Jogjakarta', 'jangan', 'lupa', 'membeli', 'oleh-oleh']
# Pos Tag Bahasa Indonesia lewat NLTK
# https://yudiwbs.wordpress.com/2018/02/20/pos-tagger-bahasa-indonesia-dengan-pytho/
from nltk.tag import CRFTagger
ct = CRFTagger()
ct.set_model_file('data/all_indo_man_tag_corpus_model.crf.tagger')
hasil = ct.tag_sents([Ti.split()]) # Hati-hati ... Stuktur Data ini adalah "List-of-Lists"!!!...
hasil = hasil[0]
print(hasil)
# Hati-hati dengan struktur data inputnya
[('Saat', 'NN'), ('bepergian', 'NN'), ('ke', 'IN'), ('Jogjakarta', 'NNP'), ('jangan', 'NEG'), ('lupa', 'VB'), ('membeli', 'VB'), ('oleh-oleh', 'IN')]
WordNet:¶
- Sebuah Lexical database Bahasa Inggris dengan konsep "word sense"
- Diprakarsai oleh George A. Miller di tahun 1980
- WordNet memuat informasi hubungan semantic antar kata.
Struktur Wordnet:
Beberapa istilah penting:
- Semantics : Makna/arti ekspresi manusia melalui bahasa
NLP hanya bisa berusaha mendapatkan makna denotasi dari suatu bahasa.
Kata-kata (Lexical terms) yang di olah hanyalah kata baku (dictionary words) - Dalam Semantic terdapat beberapa konsep:
- Polysemy : Kata dengan makna >1. Contoh: Apple (buah dan merk)
- Synonymy : Persamaan kata
- Antonymy : Lawan kata
- Hyponymy : Khusus ==> Umum, contoh : "merah" adalah hyponym dari "warna"
Hubungan ini di dapat dari taxonomy (hierarchical structure) kata . - Hypernym : Umum ==> Khusus, contoh: "warna" adalah hypernymnya "merah"
- Idiom : istilah yang makna berbeda secara signifikan dibandingkan kata penyusunnya.
Contoh: buah tangan, meja hijau, dll. - Meronym : Hubungan sematik karena bagian dari sesuatu. Misal jari meronym tangan dan roda meronym mobil.
- Holonym : kebalikan meronym, contoh: tangan holonym Jari, dan mobil holonym roda.
Aplikasi WordNet:
- Word Similarity
- Word Sense Disambiguation (WSD)
- Improving Information Retrieval
- Machine Translation, dsb
WordNet Size:
#WordNet Interface - Synonym sets
from nltk.corpus import wordnet as wn # Load English WordNet
print([sinonim.name() for sinonim in wn.synsets("bank")])
# Hasilnya adalah sebuah triplets (<lemma>.<pos>.<number>)
# <Lemma> : word’s morphological stem (kata baku/dasar)
# <pos ~ part-of-speech> : Atribut kata n-NOUN, v-VERB, a-ADJECTIVE, s-ADJECTIVE SATELLITE, r-ADVERB
# <number> : Sense number, index integer (bilangan bulat) "terurut dari penggunaan yang paling populer"
# http://www.nltk.org/api/nltk.corpus.reader.html?highlight=wordnet#nltk.corpus.reader.wordnet.Lemma.synset
['bank.n.01', 'depository_financial_institution.n.01', 'bank.n.03', 'bank.n.04', 'bank.n.05', 'bank.n.06', 'bank.n.07', 'savings_bank.n.02', 'bank.n.09', 'bank.n.10', 'bank.v.01', 'bank.v.02', 'bank.v.03', 'bank.v.04', 'bank.v.05', 'deposit.v.02', 'bank.v.07', 'trust.v.01']
# Sinonim bergantung pada Jenis kata # NOUN, VERB, ADJ, ADV
print( [s.name() for s in wn.synsets("run", pos=wn.NOUN)[:5]] )
print( [s.name() for s in wn.synsets("run", pos=wn.VERB)[:5]] )
['run.n.01', 'test.n.05', 'footrace.n.01', 'streak.n.01', 'run.n.05'] ['run.v.01', 'scat.v.01', 'run.v.03', 'operate.v.01', 'run.v.05']
print( [s.name() for s in wn.synsets("you", pos=wn.NOUN)[:5]] )
print(wn.synset('i.n.03').definition())
[] the 9th letter of the Roman alphabet
# Definisi suatu kata (membutuhkan parameter input yang spesifik)
print(wn.synset('bank.n.08').definition())
print(wn.synset('run.n.01').definition())
print(wn.synset('run.v.01').definition())
# Hati-hati "synset" not "synsets"
# butuh attribut POS (i.e. n,v,a,r, atau s) dan index (00,01,02,...)
a container (usually with a slot in the top) for keeping money at home a score in baseball made by a runner touching all four bases safely move fast by using one's feet, with one foot off the ground at any given time
# Contoh kalimat untuk suatu kata (butuh triplets sebagai input)
print(wn.synset('bank.n.08').examples())
print(wn.synset('run.n.01').examples())
print(wn.synset('run.v.01').examples())
['the coin bank was empty'] ['the Yankees scored 3 runs in the bottom of the 9th', 'their first tally came in the 3rd inning'] ["Don't run--you'll be out of breath", 'The children ran to the store']
Aplikasi WordNet : Jarak antar Kata¶
WordNet memiliki beberapa Words Similarities, contoh (thesaurus-based):
- Path similarity
- Leacock-Chodorow Similarity
- Wu-Palmer Similarity
- Thesaurus based similarity menggunakan hirarki (tingkatan) hypernym/hyponym (is-a or subsumption). Dalam hal ini di struktur WordNet.
- Hanya noun-noun (thesaurus-based) similarity bisa dilakukan di wordnet, karena noun dan verb berada di 2 hirarki yg berbeda
- 2 kata similar jika "hampir sinonim" atau setidaknya dapat tergantikan dalam konteks yang sama. Word Relatedness (WR) != WS, Contoh Love dan hate memiliki relatedness yang besar (sebagai bentuk perasaan), tapi similarity yang kecil.
- Similarity != Jarak
- WS berguna untuk aplikasi yang membutuhkan semantic kata, misal sistem QA, IR, Summariztion, & machine translation.
Distributional based simmilarities (tidak dibahas):
- Resnik Similarity
- Jiang-Conrath Similarity
- Lin Similarity
Reference: Dan Jurafsky and James H. Martin's ubiquitous Speech and Language Processing 2nd Edition. Halaman: 652-667 di bab 20 (Computational Lexical Semantics).
# Path Similarity : shortest path
# Path similarity menghitung jumlah edges minimal dari suatu word sense ke word sense lainnya,
# Menggunakan struktur data hirarki (graph) seperti WordNet
man = wn.synset("man.n.01")
boy = wn.synset("boy.n.01")
woman = wn.synset("woman.n.01")
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
dime = wn.synset('dime.n.00')
a = man.path_similarity(man)
b = man.path_similarity(dog)
c = dog.path_similarity(cat)
d = man.path_similarity(boy)
print(a,b,c,d)
man.path_similarity(dime)
1.0 0.14285714285714285 0.2 0.3333333333333333
0.07692307692307693
# Leacock-Chodorow Similarity (Leacock and Chodorow 1998) : -log(shortest_path_w1_w2)
# Warning, slow!.
man = wn.synset("man.n.01")
woman = wn.synset("woman.n.01")
dog = wn.synset('dog.n.01')
tree = wn.synset('tree.n.01')
a = man.lch_similarity(woman)
b = man.lch_similarity(dog)
c = dog.lch_similarity(tree)
print(a,b,c)
2.538973871058276 1.6916760106710724 1.55814461804655
# Word Similarity by semantics: wupsimilarity
# need try-catch
man = wn.synset("man.n.01")
woman = wn.synset("woman.n.01")
boy = wn.synset('boy.n.01')
hate = wn.synset('hate.n.01')
love = wn.synset('love.n.01')
a = man.wup_similarity(boy)
b = man.wup_similarity(hate)
c = boy.wup_similarity(love)
d = hate.wup_similarity(love)
w = man.wup_similarity(woman)
print(a,b,c,d,w)
# Measured by semantics, d close to 1 (because they are antonyms)
# Explanation summary of this distance: https://linguistics.stackexchange.com/a/9164
# More theoritical Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3864022/
# We will discuss word2vec latter as an alternative
0.6666666666666666 0.15384615384615385 0.15384615384615385 0.8571428571428571 0.6666666666666666
# WordNet Indonesia?
print(wn.lemmas("ambisius", lang='ind'))
print(wn.synsets('sombong', lang='ind'))
# This is not the expected behaviour
[Lemma('ambitious.a.01.ambisius')] [Synset('cocky.s.01'), Synset('daredevil.s.01'), Synset('cavalier.s.01'), Synset('grandiloquent.s.02'), Synset('bootless.s.01'), Synset('proud.a.01'), Synset('arrogant.s.01'), Synset('bigheaded.s.01'), Synset('disdainful.s.02'), Synset('conceited.s.01'), Synset('file_allocation_table.n.01')]
# Load secara manual [sangat terbatas & kurang akurat, namun bisa dikembangkan]
import taudataNlpTm as tau
wn_id = tau.WordNet_id()
w1 = 'ambisius'
w2 = 'ambigu'
print(w1, wn_id[w1], '\n', w2, wn_id[w2])
# Masih load sebagai dictionary belum sebagai class/object
# WordNet seluruh Bahasa: http://compling.hss.ntu.edu.sg/omw/ - MIT license
# https://stackoverflow.com/questions/31478152/how-to-use-the-language-option-in-synsets-nltk-if-you-load-a-wordnet-manually
# http://nullege.com/codes/search/nltk.corpus.reader.wordnet.WordNetCorpusReader
ambisius {'def': ['mempunyai cita-cita tinggi'], 'pos': ['a']} ambigu {'def': ['ambigu (khususnya dalam hal negatif)'], 'pos': ['a']}
#WordNet from textBlob
from textblob import Word
word = Word("plant")
word.synsets[:5]
[Synset('plant.n.01'), Synset('plant.n.02'), Synset('plant.n.03'), Synset('plant.n.04'), Synset('plant.v.01')]
# berbagai definisi "plant"
word.definitions[:5]
['buildings for carrying on industrial labor', '(botany) a living organism lacking the power of locomotion', 'an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience', 'something planted secretly for discovery by another', 'put or set (seeds, seedlings, or plants) into the ground']
# related Lemma
plant = word.synsets[1]
print(plant.lemma_names())
['plant', 'flora', 'plant_life']
# Hypernyms : Umum ==> khusus
# e.g. contoh organisme adalah plant
plant.hypernyms()
[Synset('organism.n.01')]
# Hyponyms : khusus ==> umum
# e.g. contoh plant adalah aquatic (plant)
plant.hyponyms()[:5]
[Synset('acrogen.n.01'), Synset('air_plant.n.01'), Synset('annual.n.01'), Synset('apomict.n.01'), Synset('aquatic.n.01')]
# Hubungan semantic bagian dari
plant.member_holonyms()
[Synset('plantae.n.01')]
# Kebalikan Holonym
plant.part_meronyms()
[Synset('hood.n.02'), Synset('plant_part.n.01')]
# Contoh similarity menggunakan TextBlob
from textblob.wordnet import Synset
octopus = Synset("octopus.n.02")
nautilus = Synset('paper_nautilus.n.01')
shrimp = Synset('shrimp.n.03')
pearl = Synset('pearl.n.01')
hate = wn.synset('hate.n.01')
love = wn.synset('love.n.01')
a = octopus.path_similarity(octopus) # 1.0
b = octopus.path_similarity(nautilus) # 0.33
c = octopus.path_similarity(shrimp) # 0.11
d = octopus.path_similarity(pearl)
e = hate.path_similarity(love)
print(a,b,c,d,e)
1.0 0.3333333333333333 0.1111111111111111 0.06666666666666667 0.3333333333333333
Word Sense Disambiguation
- Bertujuan untuk mendapatkan word sense (makna kata) yang tepat "sesuai dengan konteksnya".
- Contoh aplikasinya penterjemah (machine translation), named entity recognition, Question-Answering system, IR, klasifikasi text, dll.
# Word sense disambiguation : "book" - buku dan memesan tiket
T1 = 'Please book me a ticket to Jogjakarta'
T2 = 'I am going to read this book in the flight'
for token in nlp_en(T1):
print(token,token.tag_, end =', ')
print()
for token in nlp_en(T2):
print(token,token.tag_, end =', ')
# perbedaan yang jelas antara VB dan NN pada kata "book"
Please UH, book VB, me PRP, a DT, ticket NN, to IN, Jogjakarta NNP, I PRP, am VBP, going VBG, to TO, read VB, this DT, book NN, in IN, the DT, flight NN,
# A more proper way = Lesk Algorithm
# https://en.wikipedia.org/wiki/Lesk_algorithm
# Minor Modified from from https://stackoverflow.com/questions/20896278/word-sense-disambiguation-algorithm-in-python
bank_1 = 'I went to the bank to deposit my money'
bank_2 = 'The river bank was full of dead fishes'
# lesk_wsd(sentence, ambiguous_word, pos=None, stem=True, hyperhypo=True)
print("Context:", bank_1)
answer = tau.lesk_wsd(bank_1,'bank')
print("Sense:", answer)
print("Definition:",wn.synset(answer).definition(),'\n')
print("Context:", bank_2)
answer = tau.lesk_wsd(bank_2,'bank')
print("Sense:", answer)
print("Definition:",wn.synset(answer).definition(),'\n')
Context: I went to the bank to deposit my money Sense: depository_financial_institution.n.01 Definition: a financial institution that accepts deposits and channels the money into lending activities Context: The river bank was full of dead fishes Sense: bank.n.01 Definition: sloping land (especially the slope beside a body of water)
print("Context:", T1)
answer = tau.lesk_wsd(T1,'book')
print("Sense:", answer)
print("Definition:",wn.synset(answer).definition(),'\n')
print("Context:", T2)
answer = tau.lesk_wsd(bank_2,'book')
print("Sense:", answer)
print("Definition:",wn.synset(answer).definition(),'\n')
Context: Please book me a ticket to Jogjakarta Sense: book.n.11 Definition: a number of sheets (ticket or stamps etc.) bound together on one edge Context: I am going to read this book in the flight Sense: record.n.05 Definition: a compilation of the known facts regarding something or someone
More alternatives in Python for WSD¶
http://meta-guide.com/software-meta-guide/100-best-github-word-sense-disambiguation
End of Module
Referensi lain untuk belajar Text Mining di Python
- Farzindar, A., & Inkpen, D. (2017). Natural language processing for social media. Synthesis Lectures on Human Language Technologies, 10(2), 1-195.
- Kao, A., & Poteet, S. R. (Eds.). (2007). Natural language processing and text mining. Springer Science & Business Media.
- Perkins, J. (2014). Python 3 Text Processing with NLTK 3 Cookbook. Packt Publishing Ltd.
- http://www.nltk.org/book/
Tidak ada komentar:
Posting Komentar
Relevant & Respectful Comments Only.