This module contains “collector” objects. Collectors provide a way to gather “raw” results from a whoosh.matching.Matcher object, implement sorting, filtering, collation, etc., and produce a whoosh.searching.Results object.
The basic collectors are:
Here’s an example of a simple collector that instead of remembering the matched documents just counts up the number of matches:
class CountingCollector(Collector):
def prepare(self, top_searcher, q, context):
# Always call super method in prepare
Collector.prepare(self, top_searcher, q, context)
self.count = 0
def collect(self, sub_docnum):
self.count += 1
c = CountingCollector()
mysearcher.search_with_collector(myquery, c)
print(c.count)
There are also several wrapping collectors that extend or modify the functionality of other collectors. The meth:whoosh.searching.Searcher.search method uses many of these when you specify various parameters.
NOTE: collectors are not designed to be reentrant or thread-safe. It is generally a good idea to create a new collector for each search.
Base class for collectors.
x.__init__(...) initializes x; see help(type(x)) for signature
Returns a sequence of docnums matched in this collector. (Only valid after the collector is run.)
The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.
This method is called for every matched document. It should do the work of adding a matched document to the results, and it should return an object to use as a “sorting key” for the given document (such as the document’s score, a key generated by a facet, or just None). Subclasses must implement this method.
If you want the score for the current document, use self.matcher.score().
Overriding methods should add the current document offset (self.offset) to the sub_docnum to get the top-level document number for the matching document to add to results.
Parameters: | sub_docnum – the document number of the current match within the current sub-searcher. You must add self.offset to this number to get the document’s top-level document number. |
---|
This method calls Collector.matches() and then for each matched document calls Collector.collect(). Sub-classes that want to intervene between finding matches and adding them to the collection (for example, to filter out certain documents) can override this method.
Returns True if the collector naturally computes the exact number of matching documents. Collectors that use block optimizations will return False since they might skip blocks containing matching documents.
Note that if this method returns False you can still call count(), but it means that method might have to do more work to calculate the number of matching documents.
Returns the total number of documents matched in this collector. (Only valid after the collector is run.)
The default implementation is based on the docset. If a collector does not maintain the docset, it will need to override this method.
This method is called after a search.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
Yields a series of relative document numbers for matches in the current subsearcher.
This method is called before a search.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
Parameters: |
|
---|
Removes a document from the collector. Not that this method uses the global document number as opposed to Collector.collect() which takes a segment-relative docnum.
Returns a Results object containing the results of the search. Subclasses must implement this method
This method is called each time the collector starts on a new sub-searcher.
Subclasses can override this to perform set-up work, but they should still call the superclass’s method because it sets several necessary attributes on the collector object:
Returns a sorting key for the current match. This should return the same value returned by Collector.collect(), but without the side effect of adding the current document to the results.
If the collector has been prepared with context.needs_current=True, this method can use self.matcher to get information, for example the score. Otherwise, it should only use the provided sub_docnum, since the matcher may be in an inconsistent state.
Subclasses must implement this method.
Base class for collectors that sort the results based on document score.
Parameters: | replace – Number of matches between attempts to replace the |
---|
matcher with a more efficient version.
Base class for collectors that wrap other collectors.
A collector that only returns the top “N” scored results.
Parameters: |
|
---|
A collector that returns all scored results.
A collector that returns results sorted by a given whoosh.sorting.Facet object. See Sorting and faceting for more information.
Parameters: |
|
---|
A collector that lets you allow and/or restrict certain document numbers in the results:
uc = collectors.UnlimitedCollector()
ins = query.Term("chapter", "rendering")
outs = query.Term("status", "restricted")
fc = FilterCollector(uc, allow=ins, restrict=outs)
mysearcher.search_with_collector(myquery, fc)
print(fc.results())
This collector discards a document if:
(So, if the same document number is in both sets, that document will be discarded.)
If you have a reference to the collector, you can use FilterCollector.filtered_count to get the number of matching documents filtered out of the results by the collector.
Parameters: |
|
---|
A collector that creates groups of documents based on whoosh.sorting.Facet objects. See Sorting and faceting for more information.
This collector is used if you specify a groupedby parameter in the whoosh.searching.Searcher.search() method. You can use the whoosh.searching.Results.groups() method to access the facet groups.
If you have a reference to the collector can also use FacetedCollector.facetmaps to access the groups directly:
uc = collectors.UnlimitedCollector()
fc = FacetedCollector(uc, sorting.FieldFacet("category"))
mysearcher.search_with_collector(myquery, fc)
print(fc.facetmaps)
Parameters: |
|
---|
A collector that collapses results based on a facet. That is, it eliminates all but the top N results that share the same facet key. Documents with an empty key for the facet are never eliminated.
The “top” results within each group is determined by the result ordering (e.g. highest score in a scored search) or an optional second “ordering” facet.
If you have a reference to the collector you can use CollapseCollector.collapsed_counts to access the number of documents eliminated based on each key:
tc = TopCollector(limit=20)
cc = CollapseCollector(tc, "group", limit=3)
mysearcher.search_with_collector(myquery, cc)
print(cc.collapsed_counts)
See Collapsing results for more information.
Parameters: |
|
---|
A collector that raises a TimeLimit exception if the search does not complete within a certain number of seconds:
uc = collectors.UnlimitedCollector()
tlc = TimeLimitedCollector(uc, timelimit=5.8)
try:
mysearcher.search_with_collector(myquery, tlc)
except collectors.TimeLimit:
print("The search ran out of time!")
# We can still get partial results from the collector
print(tlc.results())
IMPORTANT: On Unix systems (systems where signal.SIGALRM is defined), the code uses signals to stop searching immediately when the time limit is reached. On Windows, the OS does not support this functionality, so the search only checks the time between each found document, so if a matcher is slow the search could exceed the time limit.
Parameters: |
|
---|
A collector that remembers which terms appeared in which terms appeared in each matched document.
This collector is used if you specify terms=True in the whoosh.searching.Searcher.search() method.
If you have a reference to the collector can also use TermsCollector.termslist to access the term lists directly:
uc = collectors.UnlimitedCollector()
tc = TermsCollector(uc)
mysearcher.search_with_collector(myquery, tc)
# tc.termdocs is a dictionary mapping (fieldname, text) tuples to
# sets of document numbers
print(tc.termdocs)
# tc.docterms is a dictionary mapping docnums to lists of
# (fieldname, text) tuples
print(tc.docterms)