上代码:
- class Splunk(object):
- def __init__(self):
- self.bf = Bloomfilter(64)
- self.terms = {} # Dictionary of term to set of events
- self.events = []
- def add_event(self, event):
- """Adds an event to this object"""
- # Generate a unique ID for the event, and save it
- event_id = len(self.events)
- self.events.append(event)
- # Add each term to the bloomfilter, and track the event by each term
- for term in segments(event):
- self.bf.add_value(term)
- if term not in self.terms:
- self.terms[term] = set()
- self.terms[term].add(event_id)
- def search(self, term):
- """Search for a single term, and yield all the events that contain it"""
- # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory)
- if not self.bf.might_contain(term):
- return
- # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx
- if term not in self.terms:
- return
- for event_id in sorted(self.terms[term]):
- yield self.events[event_id]
- Splunk代表一个拥有搜刮成果的索引荟萃
- 每一个荟萃中包括一个布隆过滤器,一个倒排词表(字典),和一个存储全部变乱的数组
- 当一个变乱被插手到索引的时辰,会做以下的逻辑
- 为每一个变乱天生一个unqie id,这里就是序号
- 对变乱举办分词,把每一个词插手到倒排词表,也就是每一个词对应的变乱的id的映射布局,留意,一个词也许对应多个变乱,以是倒排表的的值是一个Set。倒排表是绝大部门搜刮引擎的焦点成果。
- 当一个词被搜刮的时辰,会做以下的逻辑
- 搜查布隆过滤器,假如为假,直接返回
- 搜查词表,假如被搜刮单词不在词表中,直接返回
- 在倒排表中找到全部对应的变乱id,然后返回变乱的内容
我们运行下看看把:
- s = Splunk()
- s.add_event('src_ip = 1.2.3.4')
- s.add_event('src_ip = 5.6.7.8')
- s.add_event('dst_ip = 1.2.3.4')
- for event in s.search('1.2.3.4'):
- print event
- print '-'
- for event in s.search('src_ip'):
- print event
- print '-'
- for event in s.search('ip'):
- print event
- src_ip = 1.2.3.4
- dst_ip = 1.2.3.4
- -
- src_ip = 1.2.3.4
- src_ip = 5.6.7.8
- -
- src_ip = 1.2.3.4
- src_ip = 5.6.7.8
- dst_ip = 1.2.3.4
是不是很赞!
更伟大的搜刮
更进一步,在搜刮进程中,我们想用And和Or来实现更伟大的搜刮逻辑。
上代码:
- class SplunkM(object):
- def __init__(self):
- self.bf = Bloomfilter(64)
- self.terms = {} # Dictionary of term to set of events
- self.events = []
- def add_event(self, event):
- """Adds an event to this object"""
- # Generate a unique ID for the event, and save it
- event_id = len(self.events)
- self.events.append(event)
- # Add each term to the bloomfilter, and track the event by each term
- for term in segments(event):
- self.bf.add_value(term)
- if term not in self.terms:
- self.terms[term] = set()
- self.terms[term].add(event_id)
- def search_all(self, terms):
- """Search for an AND of all terms"""
- # Start with the universe of all events...
- results = set(range(len(self.events)))
- for term in terms:
- # If a term isn't present at all then we can stop looking
- if not self.bf.might_contain(term):
- return
- if term not in self.terms:
- return
- # Drop events that don't match from our results
- results = results.intersection(self.terms[term])
- for event_id in sorted(results):
- yield self.events[event_id]
- def search_any(self, terms):
- """Search for an OR of all terms"""
- results = set()
- for term in terms:
- # If a term isn't present, we skip it, but don't stop
- if not self.bf.might_contain(term):
- continue
- if term not in self.terms:
- continue
- # Add these events to our results
- results = results.union(self.terms[term])
- for event_id in sorted(results):
- yield self.events[event_id]
操作Python荟萃的intersection和union操纵,可以很利便的支持And(求交集)和Or(求合集)的操纵。 (编辑:湖南网)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|