Skip to main content
January, 2014

Indexing Unmanaged Files for Solr Search

Ki H. Kim
Ki H. Kim
Director, Engineering

As an add-on Drupal module for Apache Solr Search, Apache Solr Attachments enables indexing and searching of file attachments. Tika java application is a recommended choice to parse the text contents out of various file formats. Tika not only can parse plain text files or Microsoft Office documents, but it can also read meta data contained in image, audio and video formats. File indexing, however, only works for files attached to contents, which are managed by Drupal. However, consider the situation in which content files are uploaded but not managed; that is, Drupal neither knows about nor tracks the files, even though they are linked in the contents. The files could have been uploaded through a wysiwyg editor without Drupal's knowledge, or the contents could have been migrated from another system. Would there be a way to index unmanaged files? My colleague Gergely Lekli shared a method of splitting a potentially long body of contents into multiple Solr documents for each paragraph in his blog, Paragraph Level Search Results with Solr.

The method being presented here is essentially the same approach to creating multiple Solr documents from a single content (aka node in Drupal lingo), only applied to files. Apache Solr module provides an API to allow just that. That is what actually happens with Apache Solr Attachments module for managed files. If a Drupal node has three attached files parsable by Tika, total of 4 Solr documents would be indexed; one node plus three files. Solr documents derived from files will show up as a search result as if they are an independent entity.

Create Solr Documents for Each File

As the first step, we are implementing hook_apachesolr_index_documents_alter() in a custom module. In this example, the custom module is called unmanagedfilesearch and it depends on Apache Solr Search module.  

<?php 
function unmanagedfilesearch_apachesolr_index_documents_alter(&$documents, $entity, $entity_type, $env_id) { 
  // Deal with node entity for now, but it could be extended to other entities in general. 
  if ($entity_type == 'node') { 
    $solr = apachesolr_get_solr($env_id); 
    $site_hash = apachesolr_site_hash(); 

    // To prepare for (re)indexing, remove previously indexed unmanaged files, if any, 
    // from Solr Index that have entity_id greater than a billion. 
    // Chose high number to distinguish from managed files. 
    // Explained later in unmanagedfilesearch_parse_files(). 
    $solr->deleteByQuery("id:$site_hash/file/*-$entity->nid AND entity_id:[1000000000 TO *]"); 

    // Parse linked files in the content and create separate solr documents for each. 
    // Pattern borrowed from Apache Solr Search module. 
    foreach (unmanagedfilesearch_parse_files($entity) as $file) { 
      $document = _apachesolr_index_process_entity_get_document($file, 'file'); 
      $documents = array_merge($documents, unmanagedfilesearch_solr_document($document, $file, 'file', $env_id)); 
    } 
  } 
} 
?>

Before building Solr documents for (re)indexing, we remove existing indexed flles, but only unmanaged files. Unmanaged files do not have file ID in Drupal database, but still needs to be identified by number entity_id in Solr Index. In order to ensure unique entity_id as a file, fake ID is generated using the node ID and the file's position with a high numbered offset, a billion in this example. That is, the second unmanaged file of a node with ID 1234 would be 1234 * billion + 2 = 1234000000002. This fake file ID does not live in Drupal database, but in Solr Index only to assign a unique number to each file. This scheme will work as long as the site's actual file ID will not reach one billion. The function calls two other custom functions; unmanagedfilesearch_parse_files() and unmanagedfilesearch_solr_document().

 

Identify Linked Files in the Content

unmanagedfilesearch_parse_files() function parses the body of the node to search for local files and return array of files if any. If looks for href attribute, collects all candidate values from them. Then it checks each URI to confirm they are local files that actually exist in the file system. Finally it builds an array of file objects.  

<?php 
function unmanagedfilesearch_parse_files($node) { 
  $body = $node->body[LANGUAGE_NONE][0]['value']; 

  // Parse href attributes in <a> links. 
  preg_match_all('/href=[\'"]([^\>\'"]*)[\'"]/', $body, $matches, PREG_SET_ORDER); 

  $files = array(); 
  foreach ($matches as $match) { 
    // Determine if the file is local. Absolute URL could be local. 
    // Beginning double slashes is implicit for the current page's protocol, but just apply http. 
    if (substr($match[1], 0, 2) == '//') { 
      $url = 'http:' . $match[1]; 
    } 
    elseif (substr($match[1], 0, 1) == '/') { 
      $url = $GLOBALS['base_root'] . $match[1]; 
    }
    else { 
      $url = $match[1]; 
    } 

    $parse = parse_url($url); 
    // Get absolute URL to the file location. 
    $path_files = file_create_url('public://'); 
    if (isset($parse['host']) and $parse['host'] == $_SERVER['HTTP_HOST']) { 
      $uri = 'public://' . str_replace($path_files, '', $url); 
      // Convert back things (such as %20 back to a space). 
      $uri = urldecode($uri); 

      if (file_exists($uri)) { 
        // TODO: Check that the file is not managed by Drupal already. 

        if (apachesolr_attachments_allowed_mime($mimetype = file_get_mimetype($uri))) {
          $file = new stdClass(); 
          // Since the file is not managed, it has no file id. Just give large number (to avoid 
          // potential mix-up with managed files). The lowest possible number to get is 1 billion 
          // plus 1. This assumes actual managed file id will not reach 1 billion, which should 
          // be okay for most sites.
          $file->fid = $node->nid * 1000000000 + count($files) + 1; 
          $file->uri = $uri; 
          $file->filename = drupal_basename($uri); 
          $file->filemime = $mimetype; 
          $file->uid = 1; 
          $file->status = 1; 
          $file->filesize = filesize($uri); 
          $file->timestamp = REQUEST_TIME; 
          // This variable is custom one that's not part of file object, 
          // but needed in unmanagedfilesearch_solr_document(). 
          $file->parent_entity_id = $node->nid; 

          $files[] = $file; 
        } 
      } 
    } 
  } 

  return $files; 
} 
?> 

Building File Information for Solr Document

unmanagedfilesearch_solr_document() is to build the file specific information for a Solr document. The function is basically same as apachesolr_attachments_solr_document() in Apache Solr Attachments module. We could have used apachesolr_attachments_solr_document() only if it did not check for the file's status in the Drupal database, which is not applicable because the file does not exist as far as Drupal is concerned. Our modified function removes the code that searches parents of the file in the database, and instead establishes current node as the file's sole parent. That's why we added a custom variable, parent_entity_id, to $file object in unmanagedfilesearch_parse_files() above.

<?php 
function unmanagedfilesearch_solr_document(ApacheSolrDocument $document, $file, $entity_type, $env_id) {
  module_load_include('inc', 'apachesolr_attachments', 'apachesolr_attachments.index');
  $documents = array();

  $text = apachesolr_attachments_get_attachment_text($file);

  if (empty($text)) {
    return $documents;
  } 

  // Custom addition 
  $parent = (object)array( 
    'parent_entity_type' => 'node', 
    'parent_entity_id' => $file->parent_entity_id, 
  );
 
  // load the parent entity and reset cache. 
  $parent_entities = entity_load($parent->parent_entity_type, array($parent->parent_entity_id), NULL, TRUE);
  $parent_entity = reset($parent_entities);
 
  // Skip invalid entities 
  if (empty($parent_entity)) {
    continue;
  } 

  // Retrieve the parent entity id and bundle. 
  list($parent_entity_id, $parent_entity_vid, $parent_entity_bundle) = entity_extract_ids($parent->parent_entity_type, $parent_entity);
  $parent_entity_type = $parent->parent_entity_type;
 
  // proceed with building this document only if the parent entity is not flagged for 
  // indexing attachments with parent entity or not indexing attachements 
  if (variable_get('apachesolr_attachments_entity_bundle_indexing_' . $parent_entity_bundle, 'seperate') == 'seperate') {
    // Get a clone of the bare minimum document 
    $filedocument = clone $document;
 
    //Get the callback array to add stuff to the document 
    $callbacks = apachesolr_entity_get_callback($parent_entity_type, 'document callback');
    $build_documents = array();
    if (is_array($callbacks)) {
      foreach ($callbacks as $callback) {
        // Call a type-specific callback to add stuff to the document. 
        if (is_callable($callback)) {
          $build_documents = array_merge($build_documents, $callback($filedocument, $parent_entity, $parent_entity_type, $env_id));
        } 
      } 
    } 

    // Take the top document from the stack 
    $filedocument = reset($build_documents);
   
    // Build our separate document and overwrite basic information 
    $filedocument->id = apachesolr_document_id($file->fid . '-' . $parent_entity_id, $entity_type);
    $filedocument->url = file_create_url($file->uri);
    $path = file_stream_wrapper_get_instance_by_uri($file->uri)->getExternalUrl();
    // A path is not a requirement of an entity. 
    if (!empty($path)) {
      $filedocument->path = $path;
    } 

    // Add extra info to our document 
    $filedocument->label = apachesolr_clean_text($file->filename);
    $filedocument->content = apachesolr_clean_text($file->filename) . ' ' . $text;

    $filedocument->ds_created = apachesolr_date_iso($file->timestamp);
    $filedocument->ds_changed = $filedocument->ds_created;

    $filedocument->created = apachesolr_date_iso($file->timestamp);
    $filedocument->changed = $filedocument->created;
   
    // Add Parent information fields. See http://drupal.org/node/1515822 for explanation. 
    $parent_entity_info = entity_get_info($parent_entity_type);
    $small_parent_entity = new stdClass();
    $small_parent_entity->entity_type = $parent_entity_type;
    $small_parent_entity->{$parent_entity_info['entity keys']['id']} = $parent_entity_id;

    $small_parent_entity->{$parent_entity_info['entity keys']['bundle']} = $parent_entity_bundle;
   
    // Not all entities has entity key label set, so it should be checked first to avoid errors. 
    if (isset($parent_entity_info['entity keys']['label'])) {
      $small_parent_entity->{$parent_entity_info['entity keys']['label']} = $parent_entity->{$parent_entity_info['entity keys']['label']};
    } 

    // Add all to one field because if it is spread out over 
    // multiple fields there is no way of knowing which multifield value 
    // belongs to which entity 
    // It does not load the complete entity in to the index because that 
    // would dramatically increase the index size and processing time $filedocument->zm_parent_entity = drupal_json_encode($small_parent_entity);
    $filedocument->sm_parent_entity_bundle = $parent_entity_type . "-" . $parent_entity_bundle;
    $filedocument->sm_parent_entity_type = $parent_entity_type;
   
    // Add Apachesolr Attachments specific fields. 
    $filedocument->ss_filemime = $file->filemime;
    $filedocument->ss_filesize = $file->filesize;

    $documents[] = $filedocument;
   } 

  return $documents;
} 
?>

If you compare with original apachesolr_attachments_solr_document(), you will also notice that the foreach loop that iterates over possible multiple parents is removed. This way, Solr Search can index unmanaged files that Drupal does not know about and present search results, in which managed and unmanaged files are virtually indistinguishable and treated equally.