基于Python说话的大数据搜刮引擎

发布时间：2019-07-13 21:11:36 所属栏目：建站来源：简单艾

导读：搜刮是大数据规模里常见的需求。Splunk和ELK别离是该规模在非开源和开源规模里的率领者。本文操作很少的Python代码实现了一个根基的数据搜刮成果，试图让各人领略大数据搜刮的根基道理。布隆过滤器 (Bloom Filter) 第一步我们先要实现一个布隆过滤器。布

上代码：

class Splunk(object): 
 def __init__(self): 
 self.bf = Bloomfilter(64) 
 self.terms = {} # Dictionary of term to set of events 
 self.events = [] 
 def add_event(self, event): 
 """Adds an event to this object""" 
 # Generate a unique ID for the event, and save it 
 event_id = len(self.events) 
 self.events.append(event) 
 # Add each term to the bloomfilter, and track the event by each term 
 for term in segments(event): 
 self.bf.add_value(term) 
 if term not in self.terms: 
 self.terms[term] = set() 
 self.terms[term].add(event_id) 
 def search(self, term): 
 """Search for a single term, and yield all the events that contain it""" 
 # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory) 
 if not self.bf.might_contain(term): 
 return 
 # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx 
 if term not in self.terms: 
 return 
 for event_id in sorted(self.terms[term]): 
 yield self.events[event_id]

Splunk代表一个拥有搜刮成果的索引荟萃
每一个荟萃中包括一个布隆过滤器，一个倒排词表(字典)，和一个存储全部变乱的数组
当一个变乱被插手到索引的时辰，会做以下的逻辑
为每一个变乱天生一个unqie id，这里就是序号
对变乱举办分词，把每一个词插手到倒排词表，也就是每一个词对应的变乱的id的映射布局，留意，一个词也许对应多个变乱，以是倒排表的的值是一个Set。倒排表是绝大部门搜刮引擎的焦点成果。
当一个词被搜刮的时辰，会做以下的逻辑
搜查布隆过滤器，假如为假，直接返回
搜查词表，假如被搜刮单词不在词表中，直接返回
在倒排表中找到全部对应的变乱id，然后返回变乱的内容

我们运行下看看把：

s = Splunk() 
s.add_event('src_ip = 1.2.3.4') 
s.add_event('src_ip = 5.6.7.8') 
s.add_event('dst_ip = 1.2.3.4') 
for event in s.search('1.2.3.4'): 
 print event 
print '-' 
for event in s.search('src_ip'): 
 print event 
print '-' 
for event in s.search('ip'): 
 print event 
src_ip = 1.2.3.4 
dst_ip = 1.2.3.4 
- 
src_ip = 1.2.3.4 
src_ip = 5.6.7.8 
- 
src_ip = 1.2.3.4 
src_ip = 5.6.7.8 
dst_ip = 1.2.3.4

是不是很赞!

更伟大的搜刮

更进一步，在搜刮进程中，我们想用And和Or来实现更伟大的搜刮逻辑。

上代码：

class SplunkM(object): 
 def __init__(self): 
 self.bf = Bloomfilter(64) 
 self.terms = {} # Dictionary of term to set of events 
 self.events = [] 
 def add_event(self, event): 
 """Adds an event to this object""" 
 # Generate a unique ID for the event, and save it 
 event_id = len(self.events) 
 self.events.append(event) 
 # Add each term to the bloomfilter, and track the event by each term 
 for term in segments(event): 
 self.bf.add_value(term) 
 if term not in self.terms: 
 self.terms[term] = set() 
 self.terms[term].add(event_id) 
 def search_all(self, terms): 
 """Search for an AND of all terms""" 
 # Start with the universe of all events... 
 results = set(range(len(self.events))) 
 for term in terms: 
 # If a term isn't present at all then we can stop looking 
 if not self.bf.might_contain(term): 
 return 
 if term not in self.terms: 
 return 
 # Drop events that don't match from our results 
 results = results.intersection(self.terms[term]) 
 for event_id in sorted(results): 
 yield self.events[event_id] 
 def search_any(self, terms): 
 """Search for an OR of all terms""" 
 results = set() 
 for term in terms: 
 # If a term isn't present, we skip it, but don't stop 
 if not self.bf.might_contain(term): 
 continue 
 if term not in self.terms: 
 continue 
 # Add these events to our results 
 results = results.union(self.terms[term]) 
 for event_id in sorted(results): 
 yield self.events[event_id]

操作Python荟萃的intersection和union操纵，可以很利便的支持And(求交集)和Or(求合集)的操纵。

（编辑：湖南网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

5/7

首页

尾页

SEO排名难做的四大原因	在保持网站优化的同时
网站SEO优化的几个技巧	网站原创内容怎么写？