Paragraph Level Search Results with Solr

Paragraph Level Search Results with Solr

Gergely Lekli's picture

Apache Solr is a great alternative to Drupal’s built-in search engine. With the help of the Apache Solr contributed module, it is not even too difficult to get it up and running on any Drupal site. Even though Solr is a very sophisticated search platform with many features, there is one kind of functionality that it does not provide out of the box: paragraph level search results.

Imagine that there is an article with very extensive content that could spread across multiple pages if it was printed in a book (like this blog post). Let’s suppose a site visitor searches for a keyword that is present in that article and a link to this long article appears in the search results. When he clicks on the link and goes to the article, he will be presented with a flood of text, and he will be having a hard time finding the part in the article that matched his keyword. It would be much easier for him if the browser jumped to the paragraph that matched his keyword when he clicks on an article in the search results. Let’s explore one way to implement this in Drupal. The idea for this solution stems from the excellent OSCI Toolkit, an open-source toolkit that we are using to develop a web-based online scholarly catalog for the Los Angeles County Museum of Art.

When Drupal indexes content, it sends each node, along with all its content, individually to Solr. Solr stores each node as a ‘document’ in its index. When a search is initiated against Solr, it picks the documents in its index that match the keyword and returns those documents as part of the search results. However, these documents that Solr returns does not contain any information about where the matched keyword is located inside the content. Therefore, in order to have Solr return the paragraph that matches the search keyword, we need to submit each paragraph as an individual ‘document’ to Solr. (I am calling the search result units that Solr returns ‘Solr documents’. There might be a term for it that I am not aware of.) Once we do that, search results will actually be a list of paragraphs instead of a list of nodes.

 

Submitting individual paragraphs to Solr

 

In Drupal, search indexing is done on a node-by-node basis. In order to submit each paragraph as individual document to Solr, we need to parse node content and break it into paragraphs before it is sent to Solr.  Naturally, there is hook we can implement that allows us to do this. The hook is hook_apachesolr_index_document_build(). In this hook, a regular expression can be used to identify the paragraphs, assuming the text format is HTML. The implementation looks as follows in a custom module called paragraph_search.

function paragraph_search_apachesolr_index_document_build(ApacheSolrDocument $document, $entity, $entity_type, $env_id) {

	$documents = array();

	// Render just the body field.
	$content = render(field_view_field('node', $entity, 'body'));
	if (empty($content)) {
		return;
	}

	// Identify the paragraphs in the content.
	preg_match_all('/<p[^>]*>(.*?)<\/p>/', $content, $matches);

	// Send each paragraph individually to Solr.
	foreach($matches[1] as $key => $paragraph) {

		$paragraph_document = clone $document;

		$paragraph_document->id = 'node-' . $entity->nid . '-p-' . $key;
		$paragraph_document->bundle = 'paragraph';
		$paragraph_document->bundle_name = 'paragraph';
		$paragraph_document->content = trim($paragraph);
		$paragraph_document->teaser = $paragraph_document->content;

		$documents[] = $paragraph_document;
	}

	apachesolr_index_send_to_solr($env_id, $documents);
}

The $document argument essentially represents the data that will be submitted to Solr. In the hook implementation, we duplicate this document for each paragraph of the node’s body. In the duplicated document, we change the id and bundle variable so that the paragraph can be identified when we receive it back as a search result. The id needs to be a unique string across all paragraphs; we put the node ID and the paragraph’s ordinal number in it using the pattern ‘node-{node_id}-p-{paragraph_id}’.

The actual purpose of this hook is to modify a document (that is a node’s data) before it is indexed, not creating new documents, and therefore we need to call to apachesolr_index_send_to_solr() to send the newly created documents to Solr for indexing.

At this point, if we reindex all content on the site, and initiate a search we will receive paragraph level results. It might be confusing, because we will see the same node title appearing multiple times in the results, but the teaser text below the title will indicated that the keyword matched at different locations in the some node. The following screenshot illustrates an example.

 

Assigning IDs to paragraphs when the node is viewed

 

We are now assigning an identifier to each paragraph on its way to Solr index. However, we have no way of locating a certain paragraph on the node view page. Some sort of identifier needs to be embedded in the rendered node as well that can be used to match the IDs we sent to Solr to a certain paragraph in the rendered node.

We need a way of modifying the output of the node rendering process. One solution is to develop a filter that we can apply to the body field of the node. Below is our implementation of hook_filter_info(), which is where custom filters can be defined.

function paragraph_search_filter_info() {
  $filters['paragraph_search_identify'] = array(
    'title' => t('Insert paragraph identifiers into <p> tags. This is needed for the paragraph level search.'),
    'process callback' => 'paragraph_search_identify_paragraphs_filter',
  );
  return $filters;
}

In the callback function of the filter, we use the same regular expression to identify paragraphs as we do in hook_apachesolr_index_document_build(). For each <p> tag that we identify in the content, we add an anchor tag with the identifier of the paragraph. In this example, the identifier consists of the letter ‘p’ (indicating paragraph) and the ordinal number of the paragraph, separated by a dash. Note that the same string is contained in the id variable that we assign in hook_apachesolr_index_document_build().

function paragraph_search_identify_paragraphs_filter($text) {
	// Find the paragraphs in the text.
	preg_match_all('/<p[^>]*>(.*?)<\/p>/', $text, $matches);

	foreach($matches[0] as $key => $paragraph) {
		// Insert an anchor before the <p> tag.
		$paragraph_with_id = '<a name="p-' . $key . '"></a>' . $paragraph;
		$text = str_replace($paragraph, $paragraph_with_id, $text);
	}

  return $text;
}

After enabling our new filter on the text format that is selected in the node’s body, the HTML source code of the node will contain lines like this:

<a name="p-0"></a><p>Adipiscing brevitas caecus gemino mauris melior obruo…

 

Adding the anchor tag to the link in the search results

 

We now get paragraphs as search results, but the title on the results page is still unchanged. That is, it is a link to the node’s page with no reference to the paragraph. In hook_apachesolr_process_results(), a hook provided by Apache Solr module, we can add an anchor tag to the link that references a paragraph in the node’s content.

Below is our implementation.

function paragraph_search_apachesolr_process_results(array &$results, DrupalSolrQueryInterface $query) {
  foreach ($results as $key => $result) {
  	if ($result['bundle'] == 'paragraph') {
  		// Add the paragraph anchor to the link.
  		$paragraph_id = str_replace('node-' . $results[$key]['fields']['entity_id'] . '-', '', $results[$key]['fields']['id']);
  		$results[$key]['link'] = $results[$key]['link'] . '#' . $paragraph_id;
  	}
  }
}

The data available to us in the hook implementation, that is the $results variable, looks as follows when printed out using dsm().

We check the bundle variable to make sure that the result being processed is indeed a paragraph. If so, we extract the identifier returned by Solr, and from that we can construct the ID of the anchor tag on the rendered node page. Remember, we used the pattern ‘node-{node_id}-p-{paragraph_id}’ when assigning IDs to Solr documents; and used the pattern ‘p-{paragraph_id}’ when assigning IDs in the rendered node.  In case someone is wondering, the reason they are different is because the node ID is not available in the filter processing function.

So by removing the first ‘node-{node_id}-’ part from the document ID, we get the anchor tag ID. We append this to the link variable, which will be used as the link target when the result is rendered. The search results page should now have links similar to those illustrated in the image below.

When we click on one of the paragraph results and get to the node page, the browser should scroll to the paragraph that matched our keyword.

Comments

Use hook_apachesolr_index_documents_alter if you want to drastically change the document structure but use the same data more or less.

The use of hook_apachesolr_index_document_build here ... is not the cleanest way out there.

Well done to use the Apache Solr Search module like this, this is one of the reasons we made it so projects like this could use and modify it to their specific needs. Great!

Post new comment