Search Considerations for Sharing Content

2

February 16, 2010 by Alistair Deneys

Late last year I wrote a post exploring some techniques for sharing content inside Sitecore. One thing I didn’t really take into account was searching and indexing of this shared content. Paul recently left a comment on the post bringing this to my attention. Thanks Paul.

So I’d like to address these issues now. The way in which your shared content is indexed will depend on the technique used as well as the search technology you’re using.

If you’re using an external web crawler type indexer such as google, dtSearch or Funnelback, then each shared article should get indexed in each of the locations it’s shared. This is because each of the parent pages will link to the shared article as if the article is below itself, which was the whole point for the techniques I described. Realistically you have little control over how these articles get indexed unless the search technology you are using gives you the control you need.

If you’re using the built in Lucene.net indexer that ships with Sitecore, then things run completely differently. This is because Sitecore is controlling the indexer and works at the data level, not just a “page” level as the external engines do.

The whole idea of sharing content is to minimise duplication. We only want a single content item which we make appear in multiple locations. Although to the outside world each of the instances of a shared article appear as a separate page, we only have a single item. So it should come as no surprise that the search indexer will only index the article once in the location it’s shared from. This can pose some problems as we probably want our search to operate the same as it does for these items when we use an external search engine.

There is an exception to this however. If you’re using a proxy item approach. Proxy items behave just like real items, so they each appear as a separate item to the indexer. Keep in mind it will also index the source article as well as the proxy item definitions (depending on how your index is set up), so you may need to filter the search results before presenting them to the user.

If you’re using any of the other techniques I described (view item, wildcard item or custom ItemHandler), then we’ll have to tweak the Sitecore indexer to treat shared articles differently from other articles and index each instance of the shared article rather than the article itself.

The examples of the sharing techniques I wrote about previously all assumed you wanted to shared an entire shared folder of articles into your site. But if we wanted control over exactly which articles to share in we could simply add a “shared articles” field onto the item being shared into and use that data in our presentation to only list the articles in that list. This raises an important consideration for how we index these items.

In the original “share all” examples as far as Sitecore is concerned, there is no link between the shared article and the location we want to share the article into. This is because we handle the sharing through the presentation logic. If we have defined a list of articles to share into our sharing location then we do have a link at the data level which Sitecore detects and gives access to through the Link Database.

Firstly let’s look at the “share all” examples where we have no link. Instead of tweaking the indexer to create multiple Lucene documents for the single shared document, instead we need to detect when the sharing location is being indexed and add the documents in there.


using Lucene.Net.Documents;
using Lucene.Net.Index;
using Sitecore.Data.Items;

namespace SitecoreSearch
{
  public class SharedArticleIndex :
    Sitecore.Data.Indexing.Index
  {
    private string m_sharedFolder = string.Empty;
    private string m_sharingFolder = string.Empty;

    public string SharedFolder
    {
      get { return m_sharedFolder; }
      set { m_sharedFolder = value.ToLower(); }
    }

    public string SharingFolder
    {
      get { return m_sharingFolder; }
      set { m_sharingFolder = value.ToLower(); }
    }

    public SharedArticleIndex(string name) : base(name) { }

    protected override void UpdateVersion(Item version,
	  IndexWriter writer)
    {
      base.UpdateVersion(version, writer);
      if (version.Paths.FullPath.ToLower() == m_sharingFolder)
      {
        // Sharing folder being indexed.
		// Add each shared document to index
        var sharedRoot =
		  version.Database.GetItem(m_sharedFolder);
        var sharedArticles = sharedRoot.GetChildren();
        for (int i = 0; i < sharedArticles.Count; i++)
        {
          var doc = new Document();
          var id = GetDocID(version, true) + "_" +
		    GetDocID(sharedArticles[i], true);
          doc.Add(new Lucene.Net.Documents.Field(
            Sitecore.Data.Indexing.Index.DocIDFieldName, id,
            Lucene.Net.Documents.Field.Store.COMPRESS,
            Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
          AddFields(sharedArticles[i], doc);
          writer.AddDocument(doc);
        }
      }
    }
  }
}

The shared folder parameter is used to specify the folder in the content tree which is sharing the article and the sharing folder parameter is the folder where the article is shared from. Note the custom ID format used for the Lucene document ID. It contains both the shared article ID and the folder the article is being shared into to allow the search results page to construct a proper “sharing” URL for the article.

To have Sitecore use our custom indexer when building the index update the /configuration/sitecore/indexes/index node which corresponds to the database you’re using (my examples use the web database and web index). Here’s a sample config patch file.


<?xml version="1.0"?>
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <indexes>
      <index id="web" singleInstance="true"
        type="SitecoreSearch.SharedArticleIndex, SitecoreSearch">
        <SharedFolder>/sitecore/content/share</SharedFolder>
        <SharingFolder>/sitecore/content/home</SharingFolder>
        <param desc="name">$(id)</param>
        <fields hint="raw:AddField">
          <field target="created">__created</field>
          <field target="updated">__updated</field>
          <field target="author">__updated by</field>
          <field target="published">__published</field>
          <field target="name">@name</field>
          <field storage="unstored">@name</field>
          <field target="template" storage="keyword">@tid</field>
          <field target="id" storage="unstored">@id</field>
          <type storage="unstored">memo</type>
          <type storage="unstored">text</type>
          <type storage="unstored">Single-Line Text</type>
          <type storage="unstored" stripTags="true">html</type>
          <type storage="unstored" stripTags="true">rich text</type>
          <type storage="unstored" stripTags="true">word document</type>
        </fields>
      </index>
    </indexes>
    <databases>
      <database id="web">
        <indexes hint="list:AddIndex">
          <index path="indexes/index[@id='web']" />
        </indexes>
        <Engines.HistoryEngine.Storage>
          <obj type="Sitecore.Data.$(database).$(database)HistoryStorage,
            Sitecore.Kernel">
            <param connectionStringName="$(id)" />
            <EntryLifeTime>30.00:00:00</EntryLifeTime>
          </obj>
        </Engines.HistoryEngine.Storage>
        <Engines.HistoryEngine.SaveDotNetCallStack>
          false
        </Engines.HistoryEngine.SaveDotNetCallStack>
      </database>
    </databases>
  </sitecore>
</configuration>

In addition to the custom indexer we’ll need to tweak the search results page to deal with the custom Lucene Document ID we’ve had to use above as we don’t have a real item with a real ID and need some way to link the shared article to the sharing folder.


using System;
using System.Collections.Generic;
using Sitecore.Data;
using Sitecore.Data.Items;

namespace SitecoreSearch.layouts
{
  public partial class Search : System.Web.UI.UserControl
  {
    protected class SearchResult
    {
      public string Url { get; set; }
      public string Title { get; set; }
    }

    protected void PerformSearch(object sender, EventArgs args)
    {
      var hits = Sitecore.Context.Database.Indexes["web"].
        Search(query.Text, Sitecore.Context.Database);
      var length = hits.Length();
      var results = new List(length);

      for (int i = 0; i < length; i++)
      {
        // Check if item is shared or normal
        var doc = hits.Doc(i);
        var id = doc.Get(Sitecore.Data.Indexing.Index.DocIDFieldName);
        if (id.Contains("_"))
        {
          var ids = id.Split(new string[] { "_" },
            StringSplitOptions.RemoveEmptyEntries);
          var sharingFolder = Sitecore.Context.Database.GetItem(
            ItemPointer.Parse(ids[0]).ItemID);
          var sharedArticle = Sitecore.Context.Database.GetItem(
            ItemPointer.Parse(ids[1]).ItemID);

          results.Add(new SearchResult()
          {
            Url = GenerateUrl(sharingFolder, sharedArticle),
            Title = sharedArticle.Name
          });
        }
        else
        {
          var item = Sitecore.Data.Indexing.Index.GetItem(doc,
            Sitecore.Context.Database);
          results.Add(new SearchResult() { Url =
            item.Paths.GetFriendlyUrl(), Title = item.Name });
        }
      }
      resultList.DataSource = results;
      resultList.DataBind();

      count.Text = hits.Length().ToString();
    }

    private string GenerateUrl(Item sharingFolder, Item sharedArticle)
    {
      return sharingFolder.Paths.GetFriendlyUrl() + "?id=" +
        sharedArticle.ID.ToString();
    }
  }
}

This approach will work for all the other sharing techniques. Of course the GenerateUrl method would have to be updated for each of the techniques. The example above works for the “view item” technique.

If we had a field on the sharing folder to control which items were shared into the itself as mentioned above, then we could use the link database in the indexer rather than just pulling in all the shared articles. This is basically flipping the indexing around so instead of indexing from the sharing folder we can index from the shared articles instead. You could also leave the indexer to index from the sharing folder by using the above SharedArticleIndex indexer and when a sharing folder is indexed, read the “shared articles” field and create Lucene documents that way.

This next indexer will use the flipped indexing approach; index from the shared article using the links database. This approach is a little more robust as you don’t have to specify every single sharing folder in configuration.


using Lucene.Net.Documents;
using Lucene.Net.Index;
using Sitecore.Data.Items;

namespace SitecoreSearch
{
  public class SelectSharedArticleIndex :
    Sitecore.Data.Indexing.Index
  {
    private string m_sharedFolder = string.Empty;

    public string SharedFolder
    {
      get { return m_sharedFolder; }
      set { m_sharedFolder = value.ToLower(); }
    }

    public SelectSharedArticleIndex(string name) : base(name) { }

    protected override void UpdateVersion(Item version,
      IndexWriter writer)
    {
      if(version.Paths.FullPath.ToLower().StartsWith(m_sharedFolder))
      {
        // Shared article. Only add the links
        var linkDB = Sitecore.Configuration.Factory.GetLinkDatabase();
        var links = linkDB.GetReferrers(version);
        for (int i = 0; i < links.Length; i++)
        {
          var sharedFolder = links[i].GetSourceItem();
          if (sharedFolder.Paths.FullPath.StartsWith(
            "/sitecore/content"))
          {
            var doc = new Document();
            var id = GetDocID(sharedFolder, true) + "_" +
              GetDocID(version, true);
            doc.Add(new Lucene.Net.Documents.Field(
              Sitecore.Data.Indexing.Index.DocIDFieldName, id,
              Lucene.Net.Documents.Field.Store.COMPRESS,
              Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
            AddFields(version, doc);
            writer.AddDocument(doc);
          }
        }
      }
      else
        base.UpdateVersion(version, writer);
    }
  }
}

We’ll still need to use the tweaked search results code which handles the custom document ID appropriately as well.

So I hope that has helped shed some light on potential issues when choosing a sharing strategy for your solution. And just in case you thought we were done…

Another thing to consider is how to link to a shared article from another non-shared article. What if I wanted to link to a shared article from inside a rich text field?

The first and simplest approach is to treat the link as external and just enter the full URL to the article in the location you want it shared from. The problem here is you lose the benefits of link integrity. If the shared article is moved or renamed your link will break. Another idea is to link directly to the shared article as a normal internal link. You will also have to rewrite the URL to an appropriate shared URL when the item is published, probably using the replacers feature. I’ll leave you to investigate that one on your own.

Advertisements

2 thoughts on “Search Considerations for Sharing Content

  1. Paul says:

    Clever solution as usual!

    I can’t help but think thought that this is a lot of tweaking/customization when compared with the Proxies approach.

    The problem with Proxies though is they are somewhat rigid in that they only allow and item or and item hierarchy to be proxied. My team and I have recently decided to investigate the idea of a custom 3rd option: “FilteredProxy”. The idea being that you could specify one or many filters to be applied to a Proxy configuration.

    One variable in all this is the performance of Proxies in general. The documentation is somewhat vague on this (http://sdn.sitecore.net/SDN5/Forum/ShowPost.aspx?PostID=24526).

    Thoughts?

    • Alistair Deneys says:

      Good point Paul. As with most things in Sitecore, you could extend the existing ProxyDataProvider with your own FilteredProxyDataProvider class (configured in web.config) to filter out items which don’t match your given filter. I’d like to collect some hard data on proxy item performance versus normal item performance but I do know you can set the PublishVirtualItems setting on the source DB to true to have Sitecore create real items in the target database, so you wouldn’t have any issue with performance then.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

The views expressed on this blog are solely my own and do not necessarily reflect the views of my employer.
%d bloggers like this: