Class MlBreakEngine

java.lang.Object
com.ibm.icu.impl.breakiter.MlBreakEngine

public class MlBreakEngine extends Object
  • Field Details

    • MAX_FEATURE

      private static final int MAX_FEATURE
      See Also:
    • fDigitOrOpenPunctuationOrAlphabetSet

      private UnicodeSet fDigitOrOpenPunctuationOrAlphabetSet
    • fClosePunctuationSet

      private UnicodeSet fClosePunctuationSet
    • fModel

      private List<HashMap<String,Integer>> fModel
    • fNegativeSum

      private int fNegativeSum
  • Constructor Details

    • MlBreakEngine

      public MlBreakEngine(UnicodeSet digitOrOpenPunctuationOrAlphabetSet, UnicodeSet closePunctuationSet)
      Constructor for Chinese and Japanese phrase breaking.
      Parameters:
      digitOrOpenPunctuationOrAlphabetSet - An unicode set with the digit and open punctuation and alphabet.
      closePunctuationSet - An unicode set with the close punctuation.
  • Method Details

    • divideUpRange

      public int divideUpRange(CharacterIterator inText, int startPos, int endPos, CharacterIterator inString, int codePointLength, int[] charPositions, DictionaryBreakEngine.DequeI foundBreaks)
      Divide up a range of characters handled by this break engine.
      Parameters:
      inText - An input text.
      startPos - The start index of the input text.
      endPos - The end index of the input text.
      inString - A input string normalized from inText from startPos to endPos
      codePointLength - The number of code points of inString
      charPositions - A map that transforms inString's code point index to code unit index.
      foundBreaks - A list to store the breakpoint.
      Returns:
      The number of breakpoints
    • transform

      private String transform(CharacterIterator inString)
      Transform a CharacterIterator into a String.
    • evaluateBreakpoint

      private void evaluateBreakpoint(String inputStr, int[] indexList, int startIdx, int numCodeUnits, ArrayList<Integer> boundary)
      Evaluate whether the breakpointIdx is a potential breakpoint.
      Parameters:
      inputStr - An input string to be segmented.
      indexList - A code unit index list of the inputStr.
      startIdx - The start index of the indexList.
      numCodeUnits - The current code unit boundary of the indexList.
      boundary - A list including the index of the breakpoint.
    • initIndexList

      private int initIndexList(CharacterIterator inString, int[] indexList, int codePointLength)
      Initialize the index list from the input string.
      Parameters:
      inString - An input string to be segmented.
      indexList - A code unit index list of the inString.
      codePointLength - The number of code points of the input string
      Returns:
      The number of the code units of the first six characters in inString.
    • loadMLModel

      private void loadMLModel()
      Load the machine learning's model file.
    • initKeyValue

      private void initKeyValue(UResourceBundle rb, String keyName, String valueName, HashMap<String,Integer> map)
      In the machine learning's model file, specify the name of the key and value to load the corresponding feature and its score.
      Parameters:
      rb - A RedouceBundle corresponding to the model file.
      keyName - The kay name in the model file.
      valueName - The value name in the model file.
      map - A HashMap to store the pairs of the feature and its score.