Tuesday, September 18, 2012

Finding and Targeting Content in SharePoint 2010


By EPC Group's Sr. SharePoint Architect Timothy Calunod   

In my previous post, we began a configuration process to reproduce a simple targeted Search solution in SharePoint 2010 that was once configured in SharePoint 2007. The array of differences in the configuration were much more plentiful than I expected, and thus required a little more work than the original setup. However, the process can still bring us to the solution despite the changes that made the recreation more complex.

We began with setting up the simple content architecture with a Web Application and a four-Site Site Collection, and proceeded to create the Search Service Application (SSA) needed to meet our Search Service requirement. At this point, the configuration of the Content Sources, the Scope for the target and the custom Web Pages in the Search Center still need to be completed, all of which we will continue with here in this post.

Content Sources Overview

In order for any content to be discovered through a Search query, a Content Source must be created and configured. This Content Source defines what network endpoint SharePoint will identify as a location of stored content and uses Hosts, also known as Start Addresses, to define what points in that network source SharePoint will crawl. Additionally, file types must be defined for SharePoint to identify, by a simple file extension, what is considered content by both the user and by SharePoint. Furthermore, SharePoint requires a special interpreter called an Index Filter (or iFilter) to open and understand how to read the content in the designated content node so it can create the entries in the Index Partition and other databases. Thus the Content Source is heavily used by the Crawl Component to find and index content for search usage.

Content Sources in SharePoint 2010 are similar to SharePoint 2007, but now a new Content Source communication system has been included. Previously, all Content Sources used Protocol Handlers to communicate with a network endpoint and traverse the system to identify content in that system, referred to as Content Nodes. Now, some Content Sources, such as the ones that connect to third-party content system databases such as Lotus Notes or Documentum, use the newer system of the Connector Framework. This Indexing Connector is built upon Business Data Connectivity Services, and allows SharePoint to crawl, enumerate, and create local indexes of the targeted database content.

The Indexing Connector, regardless of being Protocol Handler or Connector Framework, is required for the Crawl Component to use the configured Content Source, but by picking the type of Content Source, the proper Indexing Connector is chosen as well. Thus content can essentially be grouped by its native Indexing Connector, such as File Shares or Exchange Public Folders.

When building a Content Source, once the Indexing Connector type has been chosen (again by choosing the Content Source type), the host addresses that will be crawled must also be determined. These addresses, also known as Start Addresses, include the entry point for the Index Connector via the URL, UNC, or other method used by an Indexing Connector (such as a Business Data Connectivity Model) and a crawl depth configuration specific to Indexing Connector type that allows SharePoint to crawl subfolders, other servers, and the like. And while the software boundary of a Content Source is 500 Start Addresses, the recommended threshold is 100. This means planning the starting points for the crawls should be high enough in the target system to crawl everything expected to be indexed from that system.

Also, each Content Source can have and should have a Crawling Schedule, to allow the Crawl Component to regularly visit the target system to update indexes and to keep the source of content queries as fresh as the organization is comfortable to have. The key tradeoffs reside in how much processing from the Crawl Component and the target system can be borne and still be acceptable for daily working performance from either system and the amount of content expected to be crawled.

This planning point usually results in multiple Content Sources to reduce resource consumption by creating differing and staggered scheduling when performing crawls. Scheduling options include Full Crawls and Incremental Crawls. Full Crawls are usually required at the onset of the SSA’s deployment, and subsequently when configurations change or corrupted indexes need to be repaired, among other reasons. However, SharePoint will not perform any crawls unless triggered either manually or by schedule.

TargetedSearch-ContentSource-View

Configuration

In Enterprise Search, SharePoint has always had a self-awareness of content that it managed regarding indexing. A standard, automatically created Content Source called Local SharePoint Sites is created when a new SSA is built, and it includes all Start Addresses for each Web Application associated with the SSA. This automatic configuration makes crawling SharePoint-native content much easier, and new Web Applications added are automatically updated as a host address as needed. For our scenario, since the SSA did not exist before creating the Web Application, we do not need to set up this host address. However, to insure that our targeted application is focused on the particular content in question, a different location of duplicate content will be created using the File Shares Content Source.

1) Create a Share

First, a file share named Workspace Content will be created that will host the same content as our Web Application. Again, this is only to verify that our targeted solution is actually looking at the source we are expecting, which will be the SharePoint Site-based content.
TargetedSearch-Content-FileShareView

2) Create a File Share Content Source

Next, we build a Content Source based on the File Shares Indexing Connector and set the appropriate Start Address by using the UNC of the file share
TargetedSearch-ContentSource-AltSource

3) A Full Crawl of the Content Source is needed to be sure we have the File Share content Crawled.


TargetedSearch-ContentSource-FullCrawlAltSource

4) Content for the Document Center Subsite of the Web Application is added.

TargetedSearch-Content-DocCenterView

The Document Center Site Template now includes an Upload a Document button, which triggers the single document upload dialog box and also allows for multiple content uploads. The same content stored in the Workspace Content File Share is also stored here. Also, the Document Center has only one Document Library by default called Documents, which will suffice for our scenario, but could include additional Document Libraries, Folders, and even Picture Libraries as it did in the original solution. In this case, we do not need to flesh out that level of organization to test the solution.

5) Create web page-only content for the Web Application
TargetedSearch-Content-PublishingSiteView

To insure that the targeted solution is also only looking at the Document Center Site, an additional Site was created to emulate the News Site from the previous version of SharePoint. You may notice that the Workspace Content File Share also includes HTML pages that would be used in web pages for the News Site, and thus after the Publishing Site was created as a Subsite, web pages for the Site content was also generated here. The purpose of this is to test how keywords in Search may show results from either the Document Center, the Publishing Site, or the File Share, since the point of the targeted solution is to focus on only the Document Center Site.

TargetedSearch-PublishingSite-WebPageView

6) Perform a Full Crawl for the Local SharePoint Sites Content

Since we need the content of both sites, the Document Center and the Publishing Site, included in our results set as this would be a standard expectation of an Enterprise Search, a Full Crawl against the Web Application needs to be run. Going forward, Incremental Crawls will be adequate, but for testing purposes we will update our index to be as fresh as possible.

Viewing Search

Once our configuration for Site content and crawled content has been completed, we can test that Enterprise Search is providing an accurate view of our content. This is important because SharePoint 2010 continues to support Context Search (sometimes referred to as SharePoint Foundation Search), which automatically indexes a Site Collection for its content and allows a user to search only within that Site Collection. By adding the File Share Content Source and creating the Enterprise Search Site, we have created an Enterprise Search that will go beyond the Site Collection and will behave as a standard Enterprise Search application should.

TargetedSearch-SearchResults-Comparison1
TargetedSearch-SearchResults-Comparison2

In the first screenshot, we see that content has come from both the Document Center Site and the File Share. We are also alerted to the presence of duplicate entries, as denoted by the View Duplicates link that shows under some of the results. In the second screenshot, one of the duplicate entries is expanded to show that content from the Publishing Site is also being displayed in results separately from the File Share. Thus we have an expected and very standard approach to an Enterprise Search application, emulating what would have occurred with less configuration from SharePoint 2007.

Now that we have the groundwork set, we can create the targeted application solution.

Observation

Similarities between versions of SharePoint does not always mean quick and easy configurations going from previous version to next. Although many additional enhancements have been made to improve many portions of the experience and functionality, there are still additional tasks and planning points that need to be considered and performed to meet the need of a functional solution. And while some things, such as Context Search, provides a quick and touchless solution for simple search, when expanding to deeper or broader levels there needs to be more completed to service even something that may have been considered standard from the previous version.

In our scenario, taking Enterprise Search from scratch required many additional steps and planning options, and to reach the point where the custom solution was originally crafted, there was still plenty of work to be performed to reach that point. There are reasons why this approach can be useful, especially when considering a more centralized Enterprise Search experience, but it also means some processes from previous configurations may need to be well documented and thought out a bit more to move into the new format and structure.

At this point our scenario brings us to the customization of the solution, and considerations for the present system of Enterprise Search still need to be considered. These topics will be examined in upcoming posts.