[学習]word2vec で遊ぶ。
word2vec (自然言語処理)を落ち着いた環境で、手短に遊ぶ。
環境を作るのに、ドエライ時間を要するが、学習は個人差があるように、機械にも性能差があるのは否めないので、make したまま寝るのが一番。
[amazon_enhanced asin=”4873114705″ /][amazon_enhanced asin=”4339027510″ /][amazon_enhanced asin=”4339024511″ /][amazon_enhanced asin=”4339024694″ /]
===
■環境を作る。
yukio@dynabook ~/word2vec/word2vec-read-only
$ ./demo-phrases.sh
make: Nothing to be done for ‘all’.
Starting training using file news.2012.en.shuffled-norm0
Words processed: 296900K Vocab size: 33198K
Vocab size (unigrams + bigrams): 18838711
Words in train file: 296901342
Words written: 296900K
real 36m31.377s
user 21m55.781s
sys 7m5.359s
Starting training using file news.2012.en.shuffled-norm0-phrase0
Words processed: 280500K Vocab size: 38761K
Vocab size (unigrams + bigrams): 21728781
Words in train file: 280513979
Words written: 280500K
real 29m38.999s
user 19m26.875s
sys 7m46.811s
Starting training using file news.2012.en.shuffled-norm1-phrase1
Vocab size: 681320
Words in train file: 283545447
Alpha: 0.000005 Progress: 100.00% Words/thread/sec: 84.98k
real 111m20.243s
user 838m48.234s
sys 0m53.780s
けっこうな時間を要しましたね。
—
■ ‘Computer’ では何が出力されるのかな?
Enter word or sentence (EXIT to break): computer
Word: computer Position in vocabulary: 1922
Word Cosine distance
————————————————————————
computers 0.816240
software 0.708589
laptop 0.694821
computer’s 0.676898
keystrokes 0.653019
electronic 0.631960
device 0.622195
computers’ 0.618579
mobile_device 0.609824
flash_drive 0.606466
desktop_computer 0.601706
internet_connection 0.599446
thumb_drive 0.595328
keystroke 0.593948
computerized 0.590586
desktop 0.587194
web 0.586146
malware 0.585932
word_processing 0.585793
debug 0.584467
wi_fi_hotspot 0.584011
error_messages 0.583115
laptop_computer 0.582110
arduino 0.580863
user 0.580396
your_computer’s 0.580019
spyware 0.578177
server 0.577982
handheld_devices 0.577063
automated 0.576914
devices 0.576548
tracking_software 0.576425
web_servers 0.576260
computer’s_hard_drive 0.576223
mobile_phone 0.575662
usb_drive 0.574477
encryption 0.573236
malicious_code 0.572133
remote_server 0.569906
desktop_pc 0.569610
興味深いのは、サーバよりもアルドィーノのほうが近接度が高いということ。コーパスが作成された時期にも因るのでしょうが。
—
■ ‘Japan’ では・・・
Enter word or sentence (EXIT to break): japan
Word: japan Position in vocabulary: 1035
Word Cosine distance
————————————————————————
japan’s 0.783928
south_korea 0.781404
china 0.733827
japanese 0.723014
tokyo 0.704671
asia 0.644378
europe 0.639472
other_asian_nations 0.620738
taiwan 0.619987
germany 0.615176
kuril_islands 0.613550
tokyo_march_upi 0.613329
india 0.612798
korea 0.609041
countries 0.592254
asia_excluding 0.583868
united_states 0.583465
thailand 0.582957
tokyo_april_upi 0.580327
brazil 0.576181
china’s 0.574224
ap_tokyo 0.573809
senkaku_diaoyu_islands 0.573809
territorial_dispute_between 0.573254
over_disputed_islands 0.572992
tokyo_sept_upi 0.570736
territorial_row_between 0.567809
tokyo_nov_upi 0.567369
philippines 0.563850
last_year’s_fukushima_nuclear 0.563020
ryukyu_islands 0.562682
asian_countries 0.562654
japan_south_korea 0.558992
south_korean 0.558505
australia 0.557522
russia 0.556449
chinese 0.556147
tokyo_dec_upi 0.556115
territorial_row 0.553880
beijing 0.552677
コーパスの作成時期に依存することがハッキリ分かる結果ですね。
—
■外国にも著名な街 “Akihabara’ では、観光地との近接度が高いですね。
Enter word or sentence (EXIT to break): akihabara
Word: akihabara Position in vocabulary: 300750
Word Cosine distance
————————————————————————
ginza 0.630624
shibuya 0.627033
asakusa 0.582663
shinjuku 0.578208
harajuku 0.562503
omotesando 0.562430
shibuya_district 0.545689
aoyama 0.535024
roppongi_district 0.534838
roppongi 0.530840
yoyogi 0.530087
shopping_arcades 0.528585
wako 0.516259
co_jp 0.513500
ginza_district 0.508774
okayama 0.507944
tokyo’s_ginza 0.506670
tokyo 0.505515
tokyo’s 0.498354
yoshinori 0.496685
osaka 0.493471
otaku 0.493339
hiroko_tabuchi_contributed_reporting 0.493252
buynow 0.490066
shopping_district 0.489351
nihon 0.488485
zeniya 0.485182
shimbashi 0.482262
zhongguancun 0.480785
roppongi_hills 0.479378
tetsuo 0.474016
yoyogi_park 0.471967
minami 0.471964
azabu 0.470780
osaka’s 0.469058
laforet 0.468352
yanagi 0.466992
electronics_store 0.465139
electronics 0.464955
nikkei_index_shed 0.463926
—
■では、’service’ では。
Enter word or sentence (EXIT to break): service
Word: service Position in vocabulary: 495
Word Cosine distance
————————————————————————
services 0.762807
service_providers 0.570149
service_provider 0.557932
customers 0.547668
network 0.544316
access 0.541908
provider 0.532947
operators 0.526344
customer_service 0.524407
providers 0.523774
service’s 0.508977
facilities 0.508721
lebara 0.508005
employees 0.503374
monthly_subscription 0.502840
functions 0.497853
mobile 0.497117
providing 0.494426
stations 0.493734
helotrac_x 0.492092
delivery 0.492068
maintenance 0.486617
internet_access 0.486330
voip 0.486298
postal_services 0.485677
gametanium 0.484630
online_portal 0.484593
systems 0.482075
broadband_access 0.480736
inaer 0.480549
staff 0.478217
exent’s 0.477309
high_bandwidth 0.476419
subscription_based 0.472000
system 0.471115
call_centers 0.470993
enabling 0.470146
streamwide 0.468853
customer 0.468721
users 0.467595
—
■もしや、2012年のコーパスであっても、フレーズとしての組み合わせ近接を計算できるのではないだろうか。
‘service science’ ではどうだろう。これを最後の検索に。
Enter word or sentence (EXIT to break): service science
Word: service Position in vocabulary: 495
Word: science Position in vocabulary: 1655
Word Cosine distance
————————————————————————
services 0.660516
scientific_research 0.626619
scientific 0.623595
technology 0.622529
educational 0.607284
research 0.597329
technologies 0.554766
engineering 0.554162
resource 0.544080
innovation 0.543709
systems 0.542555
programs 0.537852
fully_accredited 0.535299
expertise 0.534186
education 0.532555
science_engineering 0.532547
communication 0.527393
lifelong_learning 0.521500
software_engineering 0.520856
program 0.519150
functions 0.516761
teaching 0.513441
computing 0.512357
applications 0.512339
enterprise 0.511462
communications 0.507896
physical_sciences 0.506098
innovative_technology 0.506033
scientific_discoveries 0.501350
biomedical 0.501021
technology_directorate 0.498348
literacy 0.497534
curriculum 0.496828
solutions 0.496724
software_development 0.495778
information_technology 0.495600
math_science 0.494913
academic_research 0.494798
cutting_edge_research 0.494436
collaborative 0.494172
以上