Skip to content
This repository has been archived by the owner on Aug 30, 2022. It is now read-only.

Update ConversationSearch and RelatedConversation for for the new Indexing Structure #303

Open
westei opened this issue Mar 21, 2019 · 2 comments
Assignees

Comments

@westei
Copy link
Member

westei commented Mar 21, 2019

#302 updates the ConversationIndex to better support indexing of public channels. This requires also updates to the conversation search and releated conversation search services:

  1. as the changed index structure requires to update the query logic
  2. some assumptions made for those searches are around short conversations and do no longer apply when public channels are indexed
@westei
Copy link
Member Author

westei commented Mar 26, 2019

ConversationMltQueryBuilder

The related conversation query provider (class: ConversationMltQueryBuilder) ist the one most affected. The old assumption that the first message in a conversation is the question and following messages are answers does no longer reflect the reality of how conversations are structured.

At its core the related conversation query provider does content based recommendations. It takes the last messages of the current conversation as context and searches of relevant sections in other conversations. This is content based recommendation and to implement this one needs to ensure:

  1. that for each message one knows the surrounding content so that it can be used as context
  2. prohibit overlapping results: In the typical case that relevant information are distributed in several messages near to each other a content based recommendation algorithm based on a sliding window around single messages would tend to return multiple messages within this section in a result set. However users would like to have the whole section as a single result instead However multiple results for the same conversation (aka. channel) are still ok if they do not overlap.

After a lot of testing the best way to implement this is to split the conversation into none overlapping sections and use the text of messages in that section as context for messages. The Query Provider can than use the Solr Collapse and Expand feature to ensure that no overlapping sections are part of the result set.

NOTE: required result format and UI changes to the related conversation Widget are described in #305

@westei
Copy link
Member Author

westei commented Mar 26, 2019

Conversation Segmentation

This is a feature of the ConversationIndexer that splits up conversations (aka. channels) into sections considering requirements of content based recommendation.

NOTE: the goal is NOT to detect single conversations part of a channel, but to split up messages in a channel to sections useful for content based recommendations (e.g. as described in the above comment)

The algorithm uses the following parameter

  • minMsg = 2 ... the minimum number of messages per section
  • maxMsg = 10 ... the maximum number of messages per section
  • minContextLength = 100 ... the minimum desired length of the context in chars
  • contextLength = 300 ... the desired length of the context in chars

The algorithm considers merged messages. Those are subsequent messages sent by the same user within the mergeTimeout = 30sec. Merged messages do only count as a single message toward the configured minMsg and maxMsg limits

Other than those parameters the decision on how to segment a conversation in section is based on the duration between messages. The divides the duration to the previous message with the mean over the last messages (the gapRatio). If this gapRatio is

  • > 10 it will start a new section as soon as minMsg is fulfilled
  • > 5 it will start a new section as soon as minMsg and minContextLength is fulfilled
  • > 3 it will start a new section as soon as minMsg and contextLength is fulfilled
  • a new section is started if maxMsg is reached

The mean duration is reset to the defaultGap = 3min if a new section was started because of the gapRatio. If a new section is started because maxMsg is reached the mean is kept from the previous section.

NOTES:

  • This algorithm can be calculated without loading all messages of a conversation. One can e.g. load 100 messages and perform the segmentation. Use the last segmentation index as starting point for the segmentation of the next 100 messages.
  • This algorithm could be improved by adding a lookahead - meaning calculating the gapRatioof the lookahead messages and decide based on those values on the best segmentation index. The current implementation represents a lookahead of 1

westei added a commit that referenced this issue Mar 26, 2019
* Major update to the ConversationMLT component see [Conversation Segmentation](#303 (comment)) and [ConversationMltQueryBuilder](#303 (comment)) for details
* Updated ConversationIndex and -Indexer to include additional information about the context of messages
* Conversation are now indexed in 3 levels: (1) the conversation, (2) sections (see [Conversation Segmentation](#303 (comment))) and (3) single Messages.
* The ConversationIndexer now uses Streams to process messages in preparation of a DataModel change where Messages are stored separately from conversation
* Updates to all ConversationIndex related QueryBuilder to reflect changes in the Index Structure
westei added a commit that referenced this issue Mar 29, 2019
…plementation in Solr does not work with `2.4`
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant