今天二营长有个需求,需要对百万级别的关键词进行违禁词过滤,每次都找研发兄弟跑,人家估计不耐烦了,啪...把程序扔给我了,让我自己跑,看到脚本的当时我是崩溃的,这TM的是啥?pyahocorasick库见都没见过,来感受下:
import ahocorasick import time def main(): t1 = time.time() A = ahocorasick.Automaton() with open("D:\\seo-dev\\blackword\\blackword.properties", 'r') as fp: for line in fp: tok = line.strip("\n").split("\t") if len(tok) < 1: print line.decode('utf-8') else: A.add_word(tok[0], (1, tok[0])) A.make_automaton() f1 = open("D:\\seo-dev\\blackword\\back_dangdang.csv", 'w') g1 = open("D:\\seo-dev\\blackword\\result_dangdang.csv", 'w') cnt = 0 for line in open("D:\\seo-dev\\blackword\\dangdang.csv", 'r'): cnt += 1 if cnt % 10000 == 0: print cnt tok = line.strip("\n").split(",") kw = tok[7] media = tok[0] good = True for k,(i,t) in A.iter(kw): if i == 2 and t != kw: continue tok.append(t) line = ",".join(tok) + "\n" f1.write(line) good = False break if good: line = ",".join(tok) + "\n" g1.write(line) f1.close() g1.close() t2 = time.time() print "cost time is ", t2 - t1 return main()
多亏平时自己也写点,要不真得懵逼,上面的脚本跑的是properties文件和csv文件,谁叫我只会txt....所以改造了下(参考:http://blog.csdn.net/pirage/article/details/51657178):
import ahocorasick import time def main(): t1 = time.time() A = ahocorasick.Automaton() #向trie树中添加单词 with open("blackword.txt",'r') as fp: for line in fp: bkw = line.strip() A.add_word(bkw,(1,bkw)) #将trie树转化为Aho-Corasick自动机 A.make_automaton() g1 = open("result_dangdang.txt", 'a+') for line in open("dangdang.txt"): for k,(i,t) in A.iter(line.strip()): print line.strip()+t g1.write(line.strip()+":"+t+"\n") g1.close() t2 = time.time() print "cost time is ", t2 - t1 main()
功能一样,关键自己写的用着顺手。
安装pyahocorasick库的时候可能会遇到一些坑
坑一:pyahocorasick这是C语言扩展包,在windows机器上安装可能会提示你先安装VCForPython27.msi这个文件,按照提示自己去官网下吧,很快。
坑二:比如用pip install pyahocorasick安装,可能会提醒你升级pip,如果用python -m pip install -upgrade pip命令更新老失败就去官网手动下载whl文件,然后pip install **.whl升级pip工具。
坑一和坑二顺序忘了...
https://github.com/hzlRises/hzlgithub/blob/master/jd.com/sick.py