Follow me on Twitter @AntonioMaio2

Tuesday, January 14, 2014

Putting Metadata to Work: An Introduction

SharePoint has always had fantastic built-in support for metadata, but many organizations have not yet harnessed the power of metadata to build new efficiency and productivity and security into their business.  This article and a few posts that follow will take us on a tour of what metadata is, how organizations can take advantage of its benefits and the SharePoint features that support it.

An Introduction to Metadata
In simplest terms metadata is structured information about our data.  If you are looking at a set of documents the ‘last modified’ date for each document is a form of metadata.  As well, the ‘author’ of each document is also a form of metadata.  One way in which we use these 2 types of metadata is when we want to sort that list of documents by the ‘last modified’ date in order to find the most recently updated document.  We could also sort by author in order to find a document written by a particular person.  We can see these fields available in the Windows Explorer.

Metadata fields in this form are sometimes referred to as Tags.  The ‘last modified date’ and ‘author’ are typically filled in automatically by the applications we use to edit those documents as is the case with Microsoft Word.  The values for other metadata fields can be selected by end users – for example, a ‘department’ may need to be filled by end users when saving a document to identify the department responsible for the document.

When users interact with documents and their metadata within SharePoint, for example saving a document or adding a document to SharePoint, they are typically presented with a limited set of values to select from for each metadata field.  This ensures consistency in how metadata fields are specified and simplifies the process for end users making it more likely that they will select the correct values. 

Creating a well-defined set of metadata fields and values for the business is often referred to as creating a Metadata Taxonomy.  Creating a corporate metadata taxonomy can be an extremely important task for many organizations, and it can turn out to be a simple or very complicated process which involves many stakeholders.  Well look at this process in a future post.
Other forms of metadata can also serve other purposes, for example to enhance security or business process.

Persistent Metadata
An important concept when working with metadata is the idea of Persistent Metadata.  Persistent metadata refers to metadata which is stored within the files or information objects to which they refer.  For example, if I start a new Microsoft Word document, when I save it my name (as configured on the Windows system) is stored within the document as the author of that document, as is the current date/time as the document’s creation date. 
I can typically see the persistent metadata within a file by right-mouse clicking on the file and access the document’s properties.  This will show me the document’s persistent metadata:

Persistent metadata is important because it allows metadata to travel with the document so that no matter how I distribute or transmit the document the metadata travels with it.  This has benefits from a security perspective because most network or gateway security systems can scan document metadata as document’s pass through the network (via email, HTTP or FTP).  Those systems can perform simple validations to determine if documents are being transmitted inappropriately or against corporate policy.

There are several standardized formats or protocols for storing persistent metadata within information objects:
  • Documents Properties (docProps) within legacy Microsoft Office document formats like .doc, .xls, .ppt.
  • Open XML file custom properties (customXml) within current Microsoft Office document formats like .docx, .xlsx, .pptx.
  • Keywords field for simple name/value pairs within PDF files
  • XMP section within PDF or PDF/A files
  • Dublin Core for standardized metadata schema elements
  • Persistent metadata within emails is often stored as an xHeader

There are several more for specialized security and interoperability purposes.

Metadata which is not persistent is typically stored outside the file in a database of some kind.  For example, document metadata within SharePoint is stored by default in the SQL Server database upon which SharePoint sits.  Metadata stored within a database also has many benefits in terms of making content easily searchable.  This allows end users to more easily find the content they need, and it can assist auditors during eDiscovery processes.

Metadata Improves Information Security
In recent years many organizations have begun to attach one or more ‘classification’ metadata fields to their documents to identify the sensitivity of the information within the document.  Along with the sensitivity you often see a ‘community’ metadata field which helps identify the intended audience for that information.
Often the available values for a ‘classification’ metadata field are limited to a small number of terms that make sense to those end users that will be selecting the sensitivity classification for a particular document – terms like:
  • Public
  • Internal Distribution Only
  • Confidential
  • Highly Confidential
  • Restricted
  • Legal Restricted
  • Secret
  • Top Secret 
Not all of these classification terms may make sense to your business, but some likely will.  It’s important to only use the terms that make sense to you and your end users.  Often educating employees is required to train people on which term to use in specific cases, and on what the company policy is for classifying corporate information.
From a security perspective, the purpose for using a classification or sensitivity metadata field is typically related to controlling distribution of the information and ensuring corporate information is only viewed or accessed by those that are permitted to access it.  If we look at the metadata term list above, some identify the sensitivity of the data (ex. Confidential, Highly Confidential) while others define who should have access to the data (ex. Public, Internal Distribution Only, Legal Restricted).  It’s very tempting to combine terms from both sets into the same metadata field because they ultimately serve the same purpose (control distribution or access) but that can often cause confusion for end users – when do I use ‘Internal Distribution Only’ as opposed to ‘Confidential’?  Often, when several classification terms overlap in meaning as in the case here, organizations will use a ‘community’ metadata field to separate the concept of sensitivity of the information from the distribution of the information.

There are 2 main security benefits of applying ‘classification’ metadata to identify sensitivity of your information:
  • When end users access or receive information the ‘classification’ metadata can educate them on how to handle that information or how to control its distribution,
  • Automated policy systems can enforce access control policies based on that metadata
Of course, for these benefits to be realized other systems need to be in place. End users must look at a document’s metadata, or some other system needs to be in place to add security markings to documents based on that metadata.  As well, one or more policy systems will need to be place in order to take advantage of that metadata to control access or distribution.  SharePoint 2013 out of the box does not contain such systems to automate that would automate these processes, but the first step in securing such sensitive information should be to add metadata fields to SharePoint lists and libraries so that you can start capturing valuable metadata.
Metadata Improves Business Productivity
Business Workflows
In many business processes information objects or documents need to move through specific workflows, typically moving from one person to another.  As they move through that process the state of a document can change to show that one stage of the process is complete, and another is ready to begin.  This is pretty basic and already occurs in most businesses in one form or another.  The ‘state’ in this case can be viewed as document metadata.
When working with large numbers of documents, large user communities or many processes metadata can provide some great benefits to our business by allowing us to streamline processes to target directly at specific documents depending on the nature of the information or on its current state.  For example, many organizations will store ‘department’ and ‘status’ metadata fields with documents which must through specific approvals; then depending on the values of the ‘department’ and ‘status’ fields an approval task will be automatically assigned to a specific manager. 
This is pretty common, but we can see how it alleviates a user’s need to select the appropriate manager to approve things like expense reports, travel requisitions, budgets, etc.  It can also route tasks more accurately to the appropriate people, helping to avoid user error.  This can get a little more complex by also looking at the amount of an expense report (another piece of metadata) and if the total is over a certain amount then automatically route the approval task to a more senior level manager.
Another example that’s often seen is the implementation of a workflow which automatically moves a document from one site to another when the document is ready to be archived.  This type of workflow would be based on several pieces of metadata including:
  • Status (if the document is approved, published or in some other state identifying completion)
  • Date Last Modified (if a document has not been modified for a long time)
  • Department (some departments may not ever want content to be archived)
  • Retention Period (depending on the nature of the information and compliance laws it may require being retained for specific periods of time)
SharePoint builds in great functionality and flexibility to design and implement workflows.  Those workflows can take advantage of metadata fields within SharePoint to achieve the scenarios described above, as well as many others.  Workflows in SharePoint typically take advantage of metadata through Content Types.  Some great information from Microsoft on planning and implementing content types and workflows can be found here:
Making Content More Searchable
SharePoint 2013 has made great advancements in its ability to search large amounts of content.  It now provides a very flexible and very robust enterprise search application.  Deployment of SharePoint 2013 search can be fairly complex requiring specific planning and in-depth knowledge.  The built-in search capability can take advantage of SharePoint metadata and provide users with additional structured data to use in finding content they need.  As well, it can allow end users to refine their search queries and narrow in more quickly to the content they’re looking for.
An example of such a search could be if an auditor wishes to find all content from the Finance department which has been approved by a specific manager that is currently under investigation.  To make this a little more complicated let’s consider looking for all content from the Finance department, which is an expense report, approved by a specific manager, which is over a certain amount of money and which was approved between two specific dates.  If these values were all stored as SharePoint metadata with each document, such a search would be rather trivial.  We can find many such examples where metadata can provide the benefit of making content more searchable.
That said, by default SharePoint 2013 Search does not use metadata as part of its search index.  Some configuration is needed to have metadata included as part of your search query.  When searching for content SharePoint must first crawl the content to build up an index of the content available - this is done prior to end users performing a search.  When crawling content, several different types of properties within each document are examined by the search engine.  For example the search engine can extract keywords from content – these are called ‘crawled properties’.  Only those properties that have been pre-configured as ‘managed properties’ are used as part of the search index.  This can be a complex task that could be an article in itself - for more information on configuring SharePoint 2013 search to use managed properties and metadata refer to:

No comments:

Post a Comment