当前位置:首页 > Python爬虫 > 【pyahocorasick】python处理违禁词

【pyahocorasick】python处理违禁词

作者:二营长 发布时间:2017-03-23 点击:

今天二营长有个需求,需要对百万级别的关键词进行违禁词过滤,每次都找研发兄弟跑,人家估计不耐烦了,啪...把程序扔给我了,让我自己跑,看到脚本的当时我是崩溃的,这TM的是啥?pyahocorasick库见都没见过,来感受下:

import ahocorasick    
import time    
def main():    
    t1 = time.time()    
    A = ahocorasick.Automaton()    
    with open("D:\\seo-dev\\blackword\\blackword.properties", 'r') as fp:    
        for line in fp:    
            tok = line.strip("\n").split("\t")    
            if len(tok) < 1:    
                print line.decode('utf-8')    
            else:    
                A.add_word(tok[0], (1, tok[0]))    
    A.make_automaton()    
    f1 = open("D:\\seo-dev\\blackword\\back_dangdang.csv", 'w')    
    g1 = open("D:\\seo-dev\\blackword\\result_dangdang.csv", 'w')    
    cnt = 0    
    for line in open("D:\\seo-dev\\blackword\\dangdang.csv", 'r'):    
        cnt += 1    
        if cnt % 10000 == 0:    
            print cnt    
        tok = line.strip("\n").split(",")    
        kw = tok[7]    
        media = tok[0]    
        good = True    
        for k,(i,t) in A.iter(kw):    
            if i == 2 and t != kw:    
                continue    
            tok.append(t)    
            line = ",".join(tok) + "\n"    
            f1.write(line)    
            good = False    
            break    
        if good:    
            line = ",".join(tok) + "\n"    
            g1.write(line)    
    f1.close()    
    g1.close()    
    t2 = time.time()    
    print "cost time is ", t2 - t1    
    return    
main()

多亏平时自己也写点,要不真得懵逼,上面的脚本跑的是properties文件和csv文件,谁叫我只会txt....所以改造了下(参考:http://blog.csdn.net/pirage/article/details/51657178):

import ahocorasick
import time
def main():
	t1 = time.time()
	A = ahocorasick.Automaton()
	#向trie树中添加单词
	with open("blackword.txt",'r') as fp:
		for line in fp:
			bkw = line.strip()
			A.add_word(bkw,(1,bkw))
	#将trie树转化为Aho-Corasick自动机
	A.make_automaton()
	g1 = open("result_dangdang.txt", 'a+')
	for line in open("dangdang.txt"):
		for k,(i,t) in A.iter(line.strip()):
			print line.strip()+t
			g1.write(line.strip()+":"+t+"\n")
	g1.close()
	t2 = time.time()
	print "cost time is ", t2 - t1
main()

功能一样,关键自己写的用着顺手。

安装pyahocorasick库的时候可能会遇到一些坑

坑一:pyahocorasick这是C语言扩展包,在windows机器上安装可能会提示你先安装VCForPython27.msi这个文件,按照提示自己去官网下吧,很快。

坑二:比如用pip install pyahocorasick安装,可能会提醒你升级pip,如果用python -m pip install -upgrade pip命令更新老失败就去官网手动下载whl文件,然后pip install **.whl升级pip工具。

坑一和坑二顺序忘了...

https://github.com/hzlRises/hzlgithub/blob/master/jd.com/sick.py

邮箱:techseo.cn@gmail.com,欢迎交流。
上一篇:百度无限制主动推送      下一篇:【matplotlib】Python画图神器初体验