The sloppy frequency contribution of a match depends on the distance:
- highest freq for distance=0 (exact match).
- freq gets lower as distance gets higher.
Example: for query "a b"~2, a document "x a b a y" can be matched twice: once for "a b" (distance=0), and once for "b a" (distance=2).
Possibly not all valid combinations are encountered, because for efficiency we always propagate the least PhrasePosition. This allows to base on PriorityQueue and move forward faster. As result, for example, document "a b c b a" would score differently for queries "a b c"~4 and "c b a"~4, although they really are equivalent. Similarly, for doc "a b c b a f g", query "c b"~2 would get same score as "g f"~2, although "c b"~2 could be matched twice. We may want to fix this in the future (currently not, for performance reasons).
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final DocIdSetIterator
private final boolean
private boolean
private int
private boolean
private boolean
private final ImpactsDISI
private int
private int
private int
private int
private int
private final int
private final PhrasePositions[]
private boolean
private final PhraseQueue
private PhrasePositions[][]
private PhrasePositions[]
private final int
-
Constructor Summary
ConstructorsConstructorDescriptionSloppyPhraseMatcher
(PhraseQuery.PostingsAndFreq[] postings, int slop, ScoreMode scoreMode, Similarity.SimScorer scorer, float matchCost, boolean captureLeadMatch) -
Method Summary
Modifier and TypeMethodDescriptionprivate boolean
advance a PhrasePosition and update 'end', return false if exhaustedprivate boolean
At initialization (each doc), each repetition group is sorted by (query) offset.private boolean
pp was just advanced.(package private) DocIdSetIterator
Approximation that only matches documents that have all terms.private void
private int
index of a pp2 colliding with pp, or -1 if noneint
The end offset of the current matchint
The end position of the current matchprivate void
Fill the queue (all pps are already placedprivate ArrayList
<ArrayList<PhrasePositions>> gatherRptGroups
(LinkedHashMap<Term, Integer> rptTerms) Detect repetition groups.(package private) ImpactsDISI
Approximation that is aware of impacts.private boolean
with repeats: not so simple.private boolean
initialize with checking for repeats.private boolean
Initialize PhrasePositions in place.private void
no repeats: simplest case, and most common.private PhrasePositions
lesser
(PhrasePositions pp, PhrasePositions pp2) compare two pps, but only by position and offset(package private) float
maxFreq()
An upper bound on the number of possible matches on this documentboolean
Find the next match on the current document, returningfalse
if there are none.private void
move all PPs to their first positionprivate ArrayList
<FixedBitSet> ppTermsBitSets
(PhrasePositions[] rpp, HashMap<Term, Integer> tord) bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is setprivate PhrasePositions[]
repeatingPPs
(HashMap<Term, Integer> rptTerms) find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRptsprivate LinkedHashMap
<Term, Integer> find repeating terms and assign them ordinal valuesvoid
reset()
Called afterPhraseMatcher.approximation()
has been advanced(package private) float
The slop-adjusted weight of the current matchprivate void
sort each repetition group by (query) offset.int
The start offset of the current matchint
The start position of the current matchtermGroups
(LinkedHashMap<Term, Integer> tord, ArrayList<FixedBitSet> bb) map each term to the single group that contains itprivate final int
tpPos
(PhrasePositions pp) Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset)private void
union (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different termsMethods inherited from class org.apache.lucene.search.PhraseMatcher
getMatchCost
-
Field Details
-
phrasePositions
-
slop
private final int slop -
numPostings
private final int numPostings -
pq
-
captureLeadMatch
private final boolean captureLeadMatch -
approximation
-
impactsApproximation
-
end
private int end -
leadPosition
private int leadPosition -
leadOffset
private int leadOffset -
leadEndOffset
private int leadEndOffset -
leadOrd
private int leadOrd -
hasRpts
private boolean hasRpts -
checkedRpts
private boolean checkedRpts -
hasMultiTermRpts
private boolean hasMultiTermRpts -
rptGroups
-
rptStack
-
positioned
private boolean positioned -
matchLength
private int matchLength
-
-
Constructor Details
-
SloppyPhraseMatcher
public SloppyPhraseMatcher(PhraseQuery.PostingsAndFreq[] postings, int slop, ScoreMode scoreMode, Similarity.SimScorer scorer, float matchCost, boolean captureLeadMatch)
-
-
Method Details
-
approximation
DocIdSetIterator approximation()Description copied from class:PhraseMatcher
Approximation that only matches documents that have all terms.- Specified by:
approximation
in classPhraseMatcher
-
impactsApproximation
ImpactsDISI impactsApproximation()Description copied from class:PhraseMatcher
Approximation that is aware of impacts.- Specified by:
impactsApproximation
in classPhraseMatcher
-
maxFreq
Description copied from class:PhraseMatcher
An upper bound on the number of possible matches on this document- Specified by:
maxFreq
in classPhraseMatcher
- Throws:
IOException
-
reset
Description copied from class:PhraseMatcher
Called afterPhraseMatcher.approximation()
has been advanced- Specified by:
reset
in classPhraseMatcher
- Throws:
IOException
-
sloppyWeight
float sloppyWeight()Description copied from class:PhraseMatcher
The slop-adjusted weight of the current matchThe sum of the slop-adjusted weights is used as the freq for scoring
- Specified by:
sloppyWeight
in classPhraseMatcher
-
nextMatch
Description copied from class:PhraseMatcher
Find the next match on the current document, returningfalse
if there are none.- Specified by:
nextMatch
in classPhraseMatcher
- Throws:
IOException
-
captureLead
- Throws:
IOException
-
startPosition
public int startPosition()Description copied from class:PhraseMatcher
The start position of the current match- Specified by:
startPosition
in classPhraseMatcher
-
endPosition
public int endPosition()Description copied from class:PhraseMatcher
The end position of the current match- Specified by:
endPosition
in classPhraseMatcher
-
startOffset
Description copied from class:PhraseMatcher
The start offset of the current match- Specified by:
startOffset
in classPhraseMatcher
- Throws:
IOException
-
endOffset
Description copied from class:PhraseMatcher
The end offset of the current match- Specified by:
endOffset
in classPhraseMatcher
- Throws:
IOException
-
advancePP
advance a PhrasePosition and update 'end', return false if exhausted- Throws:
IOException
-
advanceRpts
pp was just advanced. If that caused a repeater collision, resolve by advancing the lesser of the two colliding pps. Note that there can only be one collision, as by the initialization there were no collisions before pp was advanced.- Throws:
IOException
-
lesser
compare two pps, but only by position and offset -
collide
index of a pp2 colliding with pp, or -1 if none -
initPhrasePositions
Initialize PhrasePositions in place. A one time initialization for this scorer (on first doc matching all terms):- Check if there are repetitions
- If there are, find groups of repetitions.
- no repetitions: "ho my"~2
- repetitions: "ho my my"~2
- repetitions: "my ho my"~2
- Returns:
- false if PPs are exhausted (and so current doc will not be a match)
- Throws:
IOException
-
initSimple
no repeats: simplest case, and most common. It is important to keep this piece of the code simple and efficient- Throws:
IOException
-
initComplex
with repeats: not so simple.- Throws:
IOException
-
placeFirstPositions
move all PPs to their first position- Throws:
IOException
-
fillQueue
private void fillQueue()Fill the queue (all pps are already placed -
advanceRepeatGroups
At initialization (each doc), each repetition group is sorted by (query) offset. This provides the start condition: no collisions.Case 1: no multi-term repeats
It is sufficient to advance each pp in the group by one less than its group index. So lesser pp is not advanced, 2nd one advance once, 3rd one advanced twice, etc.Case 2: multi-term repeats
- Returns:
- false if PPs are exhausted.
- Throws:
IOException
-
initFirstTime
initialize with checking for repeats. Heavy work, but done only for the first candidate doc.If there are repetitions, check if multi-term postings (MTP) are involved.
Without MTP, once PPs are placed in the first candidate doc, repeats (and groups) are visible.
With MTP, a more complex check is needed, up-front, as there may be "hidden collisions".
For example P1 has {A,B}, P1 has {B,C}, and the first doc is: "A C B". At start, P1 would point to "A", p2 to "C", and it will not be identified that P1 and P2 are repetitions of each other.The more complex initialization has two parts:
(1) identification of repetition groups.
(2) advancing repeat groups at the start of the doc.
For (1), a possible solution is to just create a single repetition group, made of all repeating pps. But this would slow down the check for collisions, as all pps would need to be checked. Instead, we compute "connected regions" on the bipartite graph of postings and terms.- Throws:
IOException
-
sortRptGroups
sort each repetition group by (query) offset. Done only once (at first doc) and allows to initialize faster for each doc. -
gatherRptGroups
private ArrayList<ArrayList<PhrasePositions>> gatherRptGroups(LinkedHashMap<Term, Integer> rptTerms) throws IOExceptionDetect repetition groups. Done once - for first doc- Throws:
IOException
-
tpPos
Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset) -
repeatingTerms
find repeating terms and assign them ordinal values -
repeatingPPs
find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRpts -
ppTermsBitSets
bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is set -
unionTermGroups
union (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different terms -
termGroups
private HashMap<Term,Integer> termGroups(LinkedHashMap<Term, Integer> tord, ArrayList<FixedBitSet> bb) throws IOExceptionmap each term to the single group that contains it- Throws:
IOException
-