Class DirectoryTaxonomyWriter

java.lang.Object
org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter
All Implemented Interfaces:
Closeable, AutoCloseable, TaxonomyWriter, TwoPhaseCommit
Direct Known Subclasses:
ReindexingEnrichedDirectoryTaxonomyWriter

public class DirectoryTaxonomyWriter extends Object implements TaxonomyWriter
TaxonomyWriter which uses a Directory to store the taxonomy information on disk, and keeps an additional in-memory cache of some or all categories.

In addition to the permanently-stored information in the Directory, efficiency dictates that we also keep an in-memory cache of recently seen or all categories, so that we do not need to go back to disk for every category addition to see which ordinal this category already has, if any. A TaxonomyWriterCache object determines the specific caching algorithm used.

This class offers some hooks for extending classes to control the IndexWriter instance that is used. See openIndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.index.IndexWriterConfig).

  • Field Details

    • INDEX_EPOCH

      public static final String INDEX_EPOCH
      Property name of user commit data that contains the index epoch. The epoch changes whenever the taxonomy is recreated (i.e. opened with IndexWriterConfig.OpenMode.CREATE.

      Applications should not use this property in their commit data because it will be overridden by this taxonomy writer.

      See Also:
    • DEFAULT_CACHE_SIZE

      private static final int DEFAULT_CACHE_SIZE
      See Also:
    • dir

      private final Directory dir
    • indexWriter

      private final IndexWriter indexWriter
    • useOlderFormat

      private final boolean useOlderFormat
    • cache

      private final TaxonomyWriterCache cache
    • cacheMisses

      private final AtomicInteger cacheMisses
    • nextID

      private final AtomicInteger nextID
    • fullPathField

      private final Field fullPathField
    • indexEpoch

      private long indexEpoch
    • parentStream

    • parentStreamField

      private Field parentStreamField
    • cacheMissesUntilFill

      private int cacheMissesUntilFill
    • shouldFillCache

      private boolean shouldFillCache
    • readerManager

      private ReaderManager readerManager
    • initializedReaderManager

      private volatile boolean initializedReaderManager
    • shouldRefreshReaderManager

      private volatile boolean shouldRefreshReaderManager
    • cacheIsComplete

      private volatile boolean cacheIsComplete
      We call the cache "complete" if we know that every category in our taxonomy is in the cache. When the cache is not complete, and we can't find a category in the cache, we still need to look for it in the on-disk index; Therefore when the cache is not complete, we need to open a "reader" to the taxonomy index. The cache becomes incomplete if it was never filled with the existing categories, or if a put() to the cache ever returned true (meaning that some cached data was cleared).
    • isClosed

      private volatile boolean isClosed
    • taxoArrays

      private volatile TaxonomyIndexArrays taxoArrays
  • Constructor Details

  • Method Details

    • getCache

      public TaxonomyWriterCache getCache()
      Returns the TaxonomyWriterCache in use by this writer.
    • openIndexWriter

      protected IndexWriter openIndexWriter(Directory directory, IndexWriterConfig config) throws IOException
      Open internal index writer, which contains the taxonomy data.

      Extensions may provide their own IndexWriter implementation or instance.
      NOTE: the instance this method returns will be closed upon calling to close().
      NOTE: the merge policy in effect must not merge none adjacent segments. See comment in createIndexWriterConfig(IndexWriterConfig.OpenMode) for the logic behind this.

      Parameters:
      directory - the Directory on top of which an IndexWriter should be opened.
      config - configuration for the internal index writer.
      Throws:
      IOException
      See Also:
    • createIndexWriterConfig

      protected IndexWriterConfig createIndexWriterConfig(IndexWriterConfig.OpenMode openMode)
      Create the IndexWriterConfig that would be used for opening the internal index writer.
      Extensions can configure the IndexWriter as they see fit, including setting a merge-scheduler, or deletion-policy, different RAM size etc.

      NOTE: internal docids of the configured index must not be altered. For that, categories are never deleted from the taxonomy index. In addition, merge policy in effect must not merge none adjacent segments.
      Parameters:
      openMode - see IndexWriterConfig.OpenMode
      See Also:
    • initReaderManager

      private void initReaderManager() throws IOException
      Opens a ReaderManager from the internal IndexWriter.
      Throws:
      IOException
    • defaultTaxonomyWriterCache

      public static TaxonomyWriterCache defaultTaxonomyWriterCache()
      Defines the default TaxonomyWriterCache to use in constructors which do not specify one.

      The current default is LruTaxonomyWriterCache

    • close

      public void close() throws IOException
      Frees used resources as well as closes the underlying IndexWriter, which commits whatever changes made to it to the underlying Directory.
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Throws:
      IOException
    • doClose

      private void doClose() throws IOException
      Throws:
      IOException
    • closeResources

      protected void closeResources() throws IOException
      A hook for extending classes to close additional resources that were used. The default implementation closes the IndexReader as well as the TaxonomyWriterCache instances that were used.
      NOTE: if you override this method, you should include a super.closeResources() call in your implementation.
      Throws:
      IOException
    • findCategory

      protected int findCategory(FacetLabel categoryPath) throws IOException
      Look up the given category in the cache and/or the on-disk storage, returning the category's ordinal, or a negative number in case the category does not yet exist in the taxonomy.
      Throws:
      IOException
    • addCategory

      public int addCategory(FacetLabel categoryPath) throws IOException
      Description copied from interface: TaxonomyWriter
      addCategory() adds a category with a given path name to the taxonomy, and returns its ordinal. If the category was already present in the taxonomy, its existing ordinal is returned.

      Before adding a category, addCategory() makes sure that all its ancestor categories exist in the taxonomy as well. As result, the ordinal of a category is guaranteed to be smaller then the ordinal of any of its descendants.

      Specified by:
      addCategory in interface TaxonomyWriter
      Throws:
      IOException
    • internalAddCategory

      private int internalAddCategory(FacetLabel cp) throws IOException
      Add a new category into the index (and the cache), and return its new ordinal.

      Actually, we might also need to add some of the category's ancestors before we can add the category itself (while keeping the invariant that a parent is always added to the taxonomy before its child). We do this by recursion.

      Throws:
      IOException
    • ensureOpen

      protected final void ensureOpen()
      Verifies that this instance wasn't closed, or throws AlreadyClosedException if it is.
    • enrichOrdinalDocument

      protected void enrichOrdinalDocument(Document d, FacetLabel categoryPath)
      Child classes can implement this method to modify the document corresponding to a category path before indexing it.
    • addCategoryDocument

      private int addCategoryDocument(FacetLabel categoryPath, int parent) throws IOException
      Note that the methods calling addCategoryDocument() are synchronized, so this method is effectively synchronized as well.
      Throws:
      IOException
    • addToCache

      private void addToCache(FacetLabel categoryPath, int id) throws IOException
      Throws:
      IOException
    • refreshReaderManager

      private void refreshReaderManager() throws IOException
      Throws:
      IOException
    • commit

      public long commit() throws IOException
      Description copied from interface: TwoPhaseCommit
      The second phase of a 2-phase commit. Implementations should ideally do very little work in this method (following TwoPhaseCommit.prepareCommit(), and after it returns, the caller can assume that the changes were successfully committed to the underlying storage.
      Specified by:
      commit in interface TwoPhaseCommit
      Throws:
      IOException
    • combinedCommitData

      private Iterable<Map.Entry<String,String>> combinedCommitData(Iterable<Map.Entry<String,String>> commitData)
      Combine original user data with the taxonomy epoch.
    • setLiveCommitData

      public void setLiveCommitData(Iterable<Map.Entry<String,String>> commitUserData)
      Description copied from interface: TaxonomyWriter
      Specified by:
      setLiveCommitData in interface TaxonomyWriter
    • getLiveCommitData

      public Iterable<Map.Entry<String,String>> getLiveCommitData()
      Description copied from interface: TaxonomyWriter
      Returns the commit user data iterable that was set on TaxonomyWriter.setLiveCommitData(Iterable).
      Specified by:
      getLiveCommitData in interface TaxonomyWriter
    • prepareCommit

      public long prepareCommit() throws IOException
      prepare most of the work needed for a two-phase commit. See IndexWriter.prepareCommit().
      Specified by:
      prepareCommit in interface TwoPhaseCommit
      Throws:
      IOException
    • getSize

      public int getSize()
      Description copied from interface: TaxonomyWriter
      getSize() returns the number of categories in the taxonomy.

      Because categories are numbered consecutively starting with 0, it means the taxonomy contains ordinals 0 through getSize()-1.

      Note that the number returned by getSize() is often slightly higher than the number of categories inserted into the taxonomy; This is because when a category is added to the taxonomy, its ancestors are also added automatically (including the root, which always get ordinal 0).

      Specified by:
      getSize in interface TaxonomyWriter
    • setCacheMissesUntilFill

      public void setCacheMissesUntilFill(int i)
      Set the number of cache misses before an attempt is made to read the entire taxonomy into the in-memory cache.

      This taxonomy writer holds an in-memory cache of recently seen categories to speed up operation. On each cache-miss, the on-disk index needs to be consulted. When an existing taxonomy is opened, a lot of slow disk reads like that are needed until the cache is filled, so it is more efficient to read the entire taxonomy into memory at once. We do this complete read after a certain number (defined by this method) of cache misses.

      If the number is set to 0, the entire taxonomy is read into the cache on first use, without fetching individual categories first.

      NOTE: it is assumed that this method is called immediately after the taxonomy writer has been created.

    • perhapsFillCache

      private void perhapsFillCache() throws IOException
      Throws:
      IOException
    • getTaxoArrays

      private TaxonomyIndexArrays getTaxoArrays() throws IOException
      Throws:
      IOException
    • getParent

      public int getParent(int ordinal) throws IOException
      Description copied from interface: TaxonomyWriter
      getParent() returns the ordinal of the parent category of the category with the given ordinal.

      When a category is specified as a path name, finding the path of its parent is as trivial as dropping the last component of the path. getParent() is functionally equivalent to calling getPath() on the given ordinal, dropping the last component of the path, and then calling getOrdinal() to get an ordinal back.

      If the given ordinal is the ROOT_ORDINAL, an INVALID_ORDINAL is returned. If the given ordinal is a top-level category, the ROOT_ORDINAL is returned. If an invalid ordinal is given (negative or beyond the last available ordinal), an IndexOutOfBoundsException is thrown. However, it is expected that getParent will only be called for ordinals which are already known to be in the taxonomy. TODO (Facet): instead of a getParent(ordinal) method, consider having a

      getCategory(categorypath, prefixlen) which is similar to addCategory except it doesn't add new categories; This method can be used to get the ordinals of all prefixes of the given category, and it can use exactly the same code and cache used by addCategory() so it means less code.

      Specified by:
      getParent in interface TaxonomyWriter
      Throws:
      IOException
    • addTaxonomy

      public void addTaxonomy(Directory taxoDir, DirectoryTaxonomyWriter.OrdinalMap map) throws IOException
      Takes the categories from the given taxonomy directory, and adds the missing ones to this taxonomy. Additionally, it fills the given DirectoryTaxonomyWriter.OrdinalMap with a mapping from the original ordinal to the new ordinal.
      Throws:
      IOException
    • rollback

      public void rollback() throws IOException
      Rollback changes to the taxonomy writer and closes the instance. Following this method the instance becomes unusable (calling any of its API methods will yield an AlreadyClosedException).
      Specified by:
      rollback in interface TwoPhaseCommit
      Throws:
      IOException
    • replaceTaxonomy

      public void replaceTaxonomy(Directory taxoDir) throws IOException
      Replaces the current taxonomy with the given one. This method should generally be called in conjunction with IndexWriter.addIndexes(Directory...) to replace both the taxonomy and the search index content.
      Throws:
      IOException
    • deleteAll

      void deleteAll() throws IOException
      Delete the taxonomy and reset all state for this writer.

      To keep using the same main index, you would have to regenerate the taxonomy, taking care that ordinals are indexed in the same order as before. An example of this can be found in ReindexingEnrichedDirectoryTaxonomyWriter.reindexWithNewOrdinalData(BiConsumer).

      Throws:
      IOException
    • getDirectory

      public Directory getDirectory()
      Returns the Directory of this taxonomy writer.
    • getInternalIndexWriter

      final IndexWriter getInternalIndexWriter()
      Used by DirectoryTaxonomyReader to support NRT.

      NOTE: you should not use the obtained IndexWriter in any way, other than opening an IndexReader on it, or otherwise, the taxonomy index may become corrupt!

    • getTaxonomyEpoch

      public final long getTaxonomyEpoch()
      Expert: returns current index epoch, if this is a near-real-time reader. Used by DirectoryTaxonomyReader to support NRT.
    • useNumericDocValuesForOrdinals

      public boolean useNumericDocValuesForOrdinals()
      Description copied from interface: TaxonomyWriter
      Determine whether-or-not to store taxonomy ordinals for each document using the older binary format or the newer SortedNumericDocValues format (based on the version used to create the index).
      Specified by:
      useNumericDocValuesForOrdinals in interface TaxonomyWriter