Indexing Unmanaged Files for Solr Search

Indexing Unmanaged Files for Solr Search

Ki Kim's picture

As an add-on Drupal module for Apache Solr Search, Apache Solr Attachments enables indexing and searching of file attachments. Tika java application is a recommended choice to parse the text contents out of various file formats. Tika not only can parse plain text files or Microsoft Office documents, but it can also read meta data contained in image, audio and video formats.

File indexing, however, only works for files attached to contents, which are managed by Drupal. However, consider the situation in which content files are uploaded but not managed; that is, Drupal neither knows about nor tracks the files, even though they are linked in the contents. The files could have been uploaded through a wysiwyg editor without Drupal's knowledge, or the contents could have been migrated from another system.

Would there be a way to index unmanaged files? My colleague Gergely Lekli shared a method of splitting a potentially long body of contents into multiple Solr documents for each paragraph in his blog, Paragraph Level Search Results with Solr.

The method being presented here is essentially the same approach to creating multiple Solr documents from a single content (aka node in Drupal lingo), only applied to files. Apache Solr module provides an API to allow just that. That is what actually happens with Apache Solr Attachments module for managed files. If a Drupal node has three attached files parsable by Tika, total of 4 Solr documents would be indexed; one node plus three files. Solr documents derived from files will show up as a search result as if they are an independent entity.

Create Solr documents for each file

As the first step, we are implementing hook_apachesolr_index_documents_alter() in a custom module. In this example, the custom module is called unmanagedfilesearch and it depends on Apache Solr Search module.

 

<?php
function unmanagedfilesearch_apachesolr_index_documents_alter(&$documents, $entity, $entity_type, $env_id) {
 
// Deal with node entity for now, but it could be extended to other entities in general.
 
if ($entity_type == 'node') {
   
$solr = apachesolr_get_solr($env_id);
   
$site_hash = apachesolr_site_hash();
 
   
// To prepare for (re)indexing, remove previously indexed unmanaged files, if any,
    // from Solr Index that have entity_id greater than a billion.
    // Chose high number to distinguish from managed files.
    // Explained later in unmanagedfilesearch_parse_files().
   
$solr->deleteByQuery("id:$site_hash/file/*-$entity->nid AND entity_id:[1000000000 TO *]");
 
   
// Parse linked files in the content and create separate solr documents for each.
    // Pattern borrowed from Apache Solr Search module.
   
foreach (unmanagedfilesearch_parse_files($entity) as $file) {
     
$document = _apachesolr_index_process_entity_get_document($file, 'file');
     
$documents = array_merge($documents, unmanagedfilesearch_solr_document($document, $file, 'file', $env_id));
    }
  }
}
?>

 

Before building Solr documents for (re)indexing, we remove existing indexed flles, but only unmanaged files. Unmanaged files do not have file ID in Drupal database, but still needs to be identified by number entity_id in Solr Index. In order to ensure unique entity_id as a file, fake ID is generated using the node ID and the file's position with a high numbered offset, a billion in this example. That is, the second unmanaged file of a node with ID 1234 would be 1234 * billion + 2 = 1234000000002.

This fake file ID does not live in Drupal database, but in Solr Index only to assign a unique number to each file. This scheme will work as long as the site's actual file ID will not reach one billion.

The function calls two other custom functions; unmanagedfilesearch_parse_files() and unmanagedfilesearch_solr_document().

Identify linked files in the content

unmanagedfilesearch_parse_files() function parses the body of the node to search for local files and return array of files if any. If looks for href attribute, collects all candidate values from them. Then it checks each URI to confirm they are local files that actually exist in the file system. Finally it builds an array of file objects.

 

<?php
function unmanagedfilesearch_parse_files($node) {
 
$body = $node->body[LANGUAGE_NONE][0]['value'];
 
 
// Parse href attributes in <A> links.
 
preg_match_all('/href=[\'"]([^\>\'"]*)[\'"]/', $body, $matches, PREG_SET_ORDER);
 
 
$files = array();
  foreach (
$matches as $match) {
   
// Determine if the file is local. Absolute URL could be local.
    // Beginning double slashes is implicit for the current page's protocol, but just apply http.
   
if (substr($match[1], 0, 2) == '//') {
     
$url = 'http:' . $match[1];
    }
    elseif (
substr($match[1], 0, 1) == '/') {
     
$url = $GLOBALS['base_root'] . $match[1];
    }
    else {
     
$url = $match[1];
    }
 
   
$parse = parse_url($url);
   
// Get absolute URL to the file location.
   
$path_files = file_create_url('public://');
    if (isset(
$parse['host']) and $parse['host'] == $_SERVER['HTTP_HOST']) {
     
$uri = 'public://' . str_replace($path_files, '', $url);
     
// Convert back things (such as %20 back to a space).
     
$uri = urldecode($uri);
 
      if (
file_exists($uri)) {
       
// TODO: Check that the file is not managed by Drupal already.
 
       
if (apachesolr_attachments_allowed_mime($mimetype = file_get_mimetype($uri))) {
         
$file = new stdClass();
         
// Since the file is not managed, it has no file id. Just give large number (to avoid potential mix-up with managed files).
          // The lowest possible number to get is 1 billion plus 1. This assumes actual managed file id will not reach 1 billion, which should be okay for most sites.
         
$file->fid       = $node->nid * 1000000000 + count($files) + 1;
         
$file->uri       = $uri;
         
$file->filename  = drupal_basename($uri);
         
$file->filemime  = $mimetype;
         
$file->uid       = 1;
         
$file->status    = 1;
         
$file->filesize  = filesize($uri);
         
$file->timestamp = REQUEST_TIME;
         
// This variable is custom one that's not part of file object, but needed in unmanagedfilesearch_solr_document().
         
$file->parent_entity_id = $node->nid;
 
         
$files[] = $file;
        }
      }
    }
  }
 
  return
$files;
}
?>

 

Building file info for Solr document

unmanagedfilesearch_solr_document() is to build the file specific information for a Solr document.

The function is basically same as apachesolr_attachments_solr_document() in Apache Solr Attachments module. We could have used apachesolr_attachments_solr_document() only if it did not check for the file's status in the Drupal database, which is not applicable because the file does not exist as far as Drupal is concerned.

Our modified function removes the code that searches parents of the file in the database, and instead establishes current node as the file's sole parent. That's why we added a custom variable, parent_entity_id, to $file object in unmanagedfilesearch_parse_files() above.

 

<?php
function unmanagedfilesearch_solr_document(ApacheSolrDocument $document, $file, $entity_type, $env_id) {
 
module_load_include('inc', 'apachesolr_attachments', 'apachesolr_attachments.index');
 
$documents = array();
 
 
$text = apachesolr_attachments_get_attachment_text($file);
 
  if (empty(
$text)) {
    return
$documents;
  }
 
 
// Custom addition
 
$parent = (object)array(
   
'parent_entity_type' => 'node',
   
'parent_entity_id'   => $file->parent_entity_id,
  );
 
 
// load the parent entity and reset cache
 
$parent_entities = entity_load($parent->parent_entity_type, array($parent->parent_entity_id), NULL, TRUE);
 
$parent_entity = reset($parent_entities);
 
 
// Skip invalid entities
 
if (empty($parent_entity)) {
    continue;
  }
 
 
// Retrieve the parent entity id and bundle
 
list($parent_entity_id, $parent_entity_vid, $parent_entity_bundle) = entity_extract_ids($parent->parent_entity_type, $parent_entity);
 
$parent_entity_type = $parent->parent_entity_type;
 
 
// proceed with building this document only if the parent entity is not flagged for
  // indexing attachments with parent entity or not indexing attachements
 
if (variable_get('apachesolr_attachments_entity_bundle_indexing_' . $parent_entity_bundle, 'seperate') == 'seperate') {
   
// Get a clone of the bare minimum document
   
$filedocument = clone $document;
 
   
//Get the callback array to add stuff to the document
   
$callbacks = apachesolr_entity_get_callback($parent_entity_type, 'document callback');
   
$build_documents = array();
    if (
is_array($callbacks)) {
      foreach (
$callbacks as $callback) {
       
// Call a type-specific callback to add stuff to the document.
       
if (is_callable($callback)) {
         
$build_documents = array_merge($build_documents, $callback($filedocument, $parent_entity, $parent_entity_type, $env_id));
        }
      }
    }
 
   
// Take the top document from the stack
   
$filedocument = reset($build_documents);
 
   
// Build our separate document and overwrite basic information
   
$filedocument->id = apachesolr_document_id($file->fid . '-' . $parent_entity_id, $entity_type);
   
$filedocument->url = file_create_url($file->uri);
   
$path = file_stream_wrapper_get_instance_by_uri($file->uri)->getExternalUrl();
   
// A path is not a requirement of an entity
   
if (!empty($path)) {
     
$filedocument->path = $path;
    }
 
   
// Add extra info to our document
   
$filedocument->label = apachesolr_clean_text($file->filename);
   
$filedocument->content = apachesolr_clean_text($file->filename) . ' ' . $text;
 
   
$filedocument->ds_created = apachesolr_date_iso($file->timestamp);
   
$filedocument->ds_changed = $filedocument->ds_created;
 
   
$filedocument->created = apachesolr_date_iso($file->timestamp);
   
$filedocument->changed = $filedocument->created;
 
   
// Add Parent information fields. See <a href="http://drupal.org/node/1515822">http://drupal.org/node/1515822</a> for explanation
   
$parent_entity_info = entity_get_info($parent_entity_type);
   
$small_parent_entity = new stdClass();
   
$small_parent_entity->entity_type = $parent_entity_type;
   
$small_parent_entity->{$parent_entity_info['entity keys']['id']} = $parent_entity_id;
 
   
$small_parent_entity->{$parent_entity_info['entity keys']['bundle']} = $parent_entity_bundle;
 
   
// Not all entities has entity key label set, so it should be checked first to avoid errors.
   
if (isset($parent_entity_info['entity keys']['label'])) {
     
$small_parent_entity->{$parent_entity_info['entity keys']['label']} = $parent_entity->{$parent_entity_info['entity keys']['label']};
    }
 
   
// Add all to one field because if it is spread out over
    // multiple fields there is no way of knowing which multifield value
    // belongs to which entity
    // It does not load the complete entity in to the index because that
    // would dramatically increase the index size and processing time
   
$filedocument->zm_parent_entity = drupal_json_encode($small_parent_entity);
   
$filedocument->sm_parent_entity_bundle = $parent_entity_type . "-" . $parent_entity_bundle;
   
$filedocument->sm_parent_entity_type = $parent_entity_type;
 
   
// Add Apachesolr Attachments specific fields.
   
$filedocument->ss_filemime = $file->filemime;
   
$filedocument->ss_filesize = $file->filesize;
 
   
$documents[] = $filedocument;
  }
 
  return
$documents;
}
?>

 

If you compare with original apachesolr_attachments_solr_document(), you will also notice that the foreach loop that iterates over possible multiple parents is removed.

This way, Solr Search can index unmanaged files that Drupal does not know about and present search results, in which managed and unmanaged files are virtually indistinguishable and treated equally.

Comments

Works like a charm, you should consider contributing or adding this to Solr Attachments module. Thank you!

Hi Ki, thank you for the module. This module is dependent on the solr attachments module, right?

Also, when i enable this module and start indexing content, I get the following error

"An AJAX HTTP error occurred. HTTP Result Code: 500 Debugging information follows. Path: /batch?id=1285&op=do StatusText: Service unavailable (with message) ResponseText: "

Is there anything that you could think of, that might be causing this issue.

Once I disable the unmanagedfilesearch module, it indexes fine.

Thank you.

Hi Ki, I was able to overcome the error that I mentioned in my comment above, but I get a new error when I index.

EntityMalformedException: Missing bundle property on entity of type file. in entity_extract_ids() (line 7721 of /var/www/reit/includes/common.inc).

Was wondering if you could offer any insight into that.

Thank you.

Hi Sarat,

You just have to add the line :
"$file->type = 'document';" (for example)
after the line :
"$file->status = 1;"
in the function "unmanagedfilesearch_parse_files"

Thank you to Ki for this article, it help me a lot.

Benjamin

Post new comment

About Urban Insight

We create elegant, mobile-friendly websites.

We solve complex problems using Drupal and open source software.

Learn More

Snippet

If you don't want to save strings in clear text, there are new php functions (php >= 5.3.0) that can be of help; openssl_encrypt() and openssl_decrypt().

<?php
  $string
= "This is a readable string."
 
$password = "<a href="mailto:p@ssword">p@ssword</a>";
 
$method = "aes-256-cbc";
 
 
$encrypted = openssl_encrypt($string, $method, $password);
 
  echo
"$string => $encrypted";
 
// Outputs: This is a readable string. => OeOiTWcgIPC1xIZaDJ3XTEaY/D4m1sQmxgPbzjxxlRA=
 
 
$decrypted = openssl_decrypt($encrypted, $method, $password);
  echo
"$encrypted => $decrypted";
 
// Outputs: OeOiTWcgIPC1xIZaDJ3XTEaY/D4m1sQmxgPbzjxxlRA= => This is a readable string.
?>

According to http://stackoverflow.com/questions/1391132/two-way-encryption-in-php, these are possible values for encryption methods.

aes-128-cbc, aes-128-cfb, aes-128-cfb1, aes-128-cfb8, aes-128-ecb, aes-128-ofb, aes-192-cbc, aes-192-cfb, aes-192-cfb1, aes-192-cfb8, aes-192-ecb, aes-192-ofb, aes-256-cbc, aes-256-cfb, aes-256-cfb1, aes-256-cfb8, aes-256-ecb, aes-256-ofb, bf-cbc, bf-cfb, bf-ecb, bf-ofb, camellia-128-cbc, camellia-128-cfb, camellia-128-cfb1, camellia-128-cfb8, camellia-128-ecb, camellia-128-ofb, camellia-192-cbc, camellia-192-cfb, camellia-192-cfb1, camellia-192-cfb8, camellia-192-ecb, camellia-192-ofb, camellia-256-cbc, camellia-256-cfb, camellia-256-cfb1, camellia-256-cfb8, camellia-256-ecb, camellia-256-ofb, cast5-cbc, cast5-cfb, cast5-ecb, cast5-ofb, des-cbc, des-cfb, des-cfb1, des-cfb8, des-ecb, des-ede, des-ede-cbc, des-ede-cfb, des-ede-ofb, des-ede3, des-ede3-cbc, des-ede3-cfb, des-ede3-cfb1, des-ede3-cfb8, des-ede3-ofb, des-ofb, desx-cbc, rc2-40-cbc, rc2-64-cbc, rc2-cbc, rc2-cfb, rc2-ecb, rc2-ofb, rc4, rc4-40, seed-cbc, seed-cfb, seed-ecb, seed-ofb

If you don't know what to choose, try "aes-256-cbc". AES is said to be used by U.S. government.

For alternatives that can be used in older PHP versions, check out, http://us2.php.net/manual/en/refs.crypto.php.

Don't use two-way encryptions on passwords. They should be encrypted with one-way hash functions.