Drkcore

12012010 Python MeCab PCI

feedparserとMeCabでエントリから名詞を抜き出す

Programming Collective Intelligenceを読み始めた。

Programming Collective Intelligence: Building Smart Web 2.0 Applications
Toby Segaran
Oreilly & Associates Inc / 3446円 ( 2007-08 )

英語だとスペースで分割すれば単語に分けられるのだけど、日本語は品詞分解できないと類似度を計れないしクラスタリングもできないので、まずMeCabを使えるようにしておく。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys,re,feedparser
import MeCab

d = feedparser.parse('http://blog.kzfmix.com/rss/')
txt = ''
for entry in d.entries:
    txt += re.compile(r'<[^>]+>').sub('',entry.summary_detail.value)

try:
    t = MeCab.Tagger()
    m = t.parseToNode(txt.encode('utf-8'))
    while m:
        if m.stat < 2:
            if re.match('名詞',m.feature): print m.surface
        m = m.next
except RuntimeError, e:
    print "RuntimeError:", e;

About

もう5年目(wishlistありマス♡)
最近はPythonとDeepLearning
日本酒自粛中
ドラムンベースからミニマルまで
ポケモンGOゆるめ