今天二营长有个需求,需要对百万级别的关键词进行违禁词过滤,每次都找研发兄弟跑,人家估计不耐烦了,啪...把程序扔给我了,让我自己跑,看到脚本的当时我是崩溃的,这TM的是啥?pyahocorasick库见都没见过,来感受下:
import ahocorasick
import time
def main():
t1 = time.time()
A = ahocorasick.Automaton()
with open("D:\\seo-dev\\blackword\\blackword.properties", 'r') as fp:
for line in fp:
tok = line.strip("\n").split("\t")
if len(tok) < 1:
print line.decode('utf-8')
else:
A.add_word(tok[0], (1, tok[0]))
A.make_automaton()
f1 = open("D:\\seo-dev\\blackword\\back_dangdang.csv", 'w')
g1 = open("D:\\seo-dev\\blackword\\result_dangdang.csv", 'w')
cnt = 0
for line in open("D:\\seo-dev\\blackword\\dangdang.csv", 'r'):
cnt += 1
if cnt % 10000 == 0:
print cnt
tok = line.strip("\n").split(",")
kw = tok[7]
media = tok[0]
good = True
for k,(i,t) in A.iter(kw):
if i == 2 and t != kw:
continue
tok.append(t)
line = ",".join(tok) + "\n"
f1.write(line)
good = False
break
if good:
line = ",".join(tok) + "\n"
g1.write(line)
f1.close()
g1.close()
t2 = time.time()
print "cost time is ", t2 - t1
return
main()多亏平时自己也写点,要不真得懵逼,上面的脚本跑的是properties文件和csv文件,谁叫我只会txt....所以改造了下(参考:http://blog.csdn.net/pirage/article/details/51657178):
import ahocorasick
import time
def main():
t1 = time.time()
A = ahocorasick.Automaton()
#向trie树中添加单词
with open("blackword.txt",'r') as fp:
for line in fp:
bkw = line.strip()
A.add_word(bkw,(1,bkw))
#将trie树转化为Aho-Corasick自动机
A.make_automaton()
g1 = open("result_dangdang.txt", 'a+')
for line in open("dangdang.txt"):
for k,(i,t) in A.iter(line.strip()):
print line.strip()+t
g1.write(line.strip()+":"+t+"\n")
g1.close()
t2 = time.time()
print "cost time is ", t2 - t1
main()功能一样,关键自己写的用着顺手。
安装pyahocorasick库的时候可能会遇到一些坑
坑一:pyahocorasick这是C语言扩展包,在windows机器上安装可能会提示你先安装VCForPython27.msi这个文件,按照提示自己去官网下吧,很快。
坑二:比如用pip install pyahocorasick安装,可能会提醒你升级pip,如果用python -m pip install -upgrade pip命令更新老失败就去官网手动下载whl文件,然后pip install **.whl升级pip工具。
坑一和坑二顺序忘了...
https://github.com/hzlRises/hzlgithub/blob/master/jd.com/sick.py