The Office 365 DLP policies currently supports many sensitive data types which represent all types of information that an individual or business may want to product, information like US social security numbers, various European passport numbers or identity card numbers, credit card numbers and so on. At the time this post was written it support 82 sensitive types. These sensitive data types in many cases include a regular expression that is matched, an extensive list of keywords that are searched for within a proximity to the sensitive data, and in many cases a checksum that is calculated (for example, running the Luhn's algorithm on a suspected credit card number). An inventory of the sensitive data types supported along with exactly what each looks for can be found here: .
You are also able to modify existing sensitive data types or create custom sensitive data types which not only get used by Office 365 DLP, but also by features like Office 365 Labels, and Office 365 Advanced Data Governance. You may want to create a custom sensitive data type if you have a custom piece of data within your organization that follows a particular well-defined pattern and that you need to look for within documents or emails. An example is if you have a custom format for an employee number. Another case is if you live in a country that has an identity card number format or driver's license format, for example, which is not represented by the built-in Office 365 sensitive data types.
There are a few sites/articles out there that shares steps on how to do this, but I have found many of them to be incomplete. Recently we had to create some custom sensitive data types for a GDPR project and I wanted to share my experience at creating custom sensitive data types.
In our walk-through, the example of a custom sensitive data type I'm going to use is EU Debit Card Number. We're going to envision a scenario where we're looking for this type of sensitive data as part of a GDPR project and we are getting enough false positives that we want to try to make looking for this type more accurate, by introducing new keywords, adjusting the proximity parameter and modifying the confidence level.
Most Practical Approach
The most practical approach when creating or customizing a sensitive data type is to create a new sensitive data type based on an existing one, giving it a unique name and identifiers. For example, if you wish to adjust the parameters of the “EU Debit Card Number” sensitive data type, you could name your copy of that rule “EU Debit Card Enhanced” to distinguish it from the original. In your new sensitive data type, simply modify the values you wish to change to improve its accuracy. Once complete, you will upload your new sensitive data type and create a new DLP rule (or modify an existing one) to use the new sensitive data type you just added. Modifying the accuracy of sensitive data types could require some trial and error, so maintaining a copy of the original type allows you to fall back to it if required in the future.Customize the Sensitive Data Type
The following is the detailed step by step process that is necessary to create a custom sensitive data type or modify an existing one.1. Export the existing Rule Package of built in sensitive data types that are available in Office 365
a. At the PowerShell command prompt create a connection to Exchange Online:
$ UserCredential = Get-Credential $Session = New-PSSession -ConfigurationName Microsoft.Exchange -ConnectionUri https://ps.compliance.protection.outlook.com/powershell-liveid/ -Credential $UserCredential -Authentication Basic -AllowRedirection Import-PSSession $Session
Note: At the start of this example we have to use the Exchange Online PowerShell module and at the end of the example we're using the newer Security and Compliance Center PowerShell module. This is intentional.
b. Export the current rule collection to an XML file:
$ruleCollection = Get-ClassificationRuleCollection Set-Content -path "C:\exportedSensitiveTypes.xml" -Encoding Byte -Value $ruleCollection.SerializedClassificationRuleCollection
c. Copy the XML file you just exported and give it a different file name – for example: MyNewDLPRule.xml. Keep the original exported rule xml file for reference while you’re constructing your new file.
2. Open the XML file you just copied/renamed in your favorite XML editor
3. You will need to generate 2 GUIDs and replace those in the sensitive data type that you are modifying. At the PowerShell command prompt, type the following and record your new GUID. Do this a second time and record that new GUID as well.
New-Guid
4. The XML file is large, so for simplicity we’re going to start with an existing sensitive data type and isolate it in our new rule xml file by removing all other sensitive data types. In this case, we’re going to modify the “EU Debit Card Number” sensitive data type.
Note: the following details are important when modifying the rule’s XML structure.
Note: the following details are important when modifying the rule’s XML structure.
a. The <rulepack> element of the file contains information about the publisher, in this case Microsoft. It also contains localized versions of all the Publisher strings. The <rulepack> definition contains an id property which is the first place where you’ll need to replace the existing GUID with a new one you just generated.
b. The <rules> element contains the definition of your sensitive data type.
c. The <rules> element is made up of the <entity> element which defines the pattern to use, the <keyword> element which defines the list of keywords to match, a <regex> element which defines the regular expression that’s used in the pattern, and a <localizedstrings> element which defines the default and localized strings displayed in the DLP rule UI.
d. The <entity> element contains an id property which is the second place where you’ll need to replace the existing GUID with a new one you just generated. This GUID is referenced further down within the <localizedstrings> element, as the id for a <resource> element. This <resource> element identifies both default and localized strings to use when displaying your sensitive data type in the DLP UI. If you are deleting <entity> and <resource> elements in order to simplify the XML file you are working with, ensure that you do not delete the <resource> element that is referenced using the GUID by your sensitive data type’s <entity> element.
e. The <entity> element also contains several <match> elements, which identify the names of the keyword lists that are used by that sensitive data type. These appear as follows:
f. The keyword list named "Keyword_card_terms_dict" is defined further down in the XML file within a <keyword> element. In this case, the <keyword> element contains a <group> element and multiple <term> values, each of which define a keyword that is used by the Office 365 DLP to identify sensitive information. Keyword lists are often referred to as “corroborative evidence” when defining DLP rules. Keyword lists defined within the XML structure can be used by multiple sensitive data types – for example, the Credit Card Number and EU Debit Card Number sensitive data types share some of the same keyword lists. The same is true for <regex> definitions. Keyword lists defined in this file can be modified to add additional, more specific keywords or to remove keywords. Once again, if you are deleting <keyword> lists throughout the original XML file, ensure that you do not delete a <keyword> element that is referenced within your sensitive data types <entity> element.
g. The <entity> element contains three elements which are used to define a pattern that the sensitive data type must match: IdMatch, Match, Any. In order to promote re-usability of definitions across multiple patterns, both the IdMatch and Match elements do not define the details of what content needs to be matched but instead reference it through the idRef attribute.
b. The <rules> element contains the definition of your sensitive data type.
c. The <rules> element is made up of the <entity> element which defines the pattern to use, the <keyword> element which defines the list of keywords to match, a <regex> element which defines the regular expression that’s used in the pattern, and a <localizedstrings> element which defines the default and localized strings displayed in the DLP rule UI.
d. The <entity> element contains an id property which is the second place where you’ll need to replace the existing GUID with a new one you just generated. This GUID is referenced further down within the <localizedstrings> element, as the id for a <resource> element. This <resource> element identifies both default and localized strings to use when displaying your sensitive data type in the DLP UI. If you are deleting <entity> and <resource> elements in order to simplify the XML file you are working with, ensure that you do not delete the <resource> element that is referenced using the GUID by your sensitive data type’s <entity> element.
e. The <entity> element also contains several <match> elements, which identify the names of the keyword lists that are used by that sensitive data type. These appear as follows:
<any minmatches="1"> <match idref="Keyword_eu_debit_card"> <match idref="Keyword_card_terms_dict"> <match idref="Keyword_card_security_terms_dict"> <match idref="Keyword_card_expiration_terms_dict"> <match idref="Func_expiration_date"> </any>
f. The keyword list named "Keyword_card_terms_dict" is defined further down in the XML file within a <keyword> element. In this case, the <keyword> element contains a <group> element and multiple <term> values, each of which define a keyword that is used by the Office 365 DLP to identify sensitive information. Keyword lists are often referred to as “corroborative evidence” when defining DLP rules. Keyword lists defined within the XML structure can be used by multiple sensitive data types – for example, the Credit Card Number and EU Debit Card Number sensitive data types share some of the same keyword lists. The same is true for <regex> definitions. Keyword lists defined in this file can be modified to add additional, more specific keywords or to remove keywords. Once again, if you are deleting <keyword> lists throughout the original XML file, ensure that you do not delete a <keyword> element that is referenced within your sensitive data types <entity> element.
g. The <entity> element contains three elements which are used to define a pattern that the sensitive data type must match: IdMatch, Match, Any. In order to promote re-usability of definitions across multiple patterns, both the IdMatch and Match elements do not define the details of what content needs to be matched but instead reference it through the idRef attribute.
5. Now that we have our sensitive data type XML file ready to edit, we’ll make some basic modifications in order to identify our sensitive data type as a new unique type when configuring a DLP rule.
a. Modify the id property of the <RulePack> element. This id should be replaced with one of the new GUIDs we created.
b. Within the <RulePack> element, find the <Publisher> element and replace its id property with the second of the new GUIDs we created.
c. Modify the id property of the <Entity> element which represents our sensitive data type. This id should be replaced with third of the new GUIDs we created.
d. Within the <RulePack><Details><LocalizeDetails> element, find the <PublisherName>, <Name> and <Description> elements. Modify the value of these elements to unique values.
e. Within the <LocalizedStrings> element, find the <Resource> element which had the same id as our <Entity> element and modify its id to match the GUID that we assigned to the id of the <Entity> element.
f. Within the <LocalizedStrings><Resource> element we just modified, find the <Name> and <Description> elements and modify their values so that our sensitive data type has a unique name and description. This will help us better select the correct sensitive data type when we configure a DLP rule.
g. If you are going to start with an existing sensitive data type and that type contains existing keyword lists in the definition, it is important to note that some keyword lists are in use by Microsoft’s built in sensitive data types and names of those lists may not be reused in other sensitive data type definitions. In the case of the “EU Debit Card Number” example, it makes use the following Microsoft keyword lists:
Those lists are currently in use by multiple built in sensitive data types and we are not permitted to reuse those names. Therefore, we must modify those names in the <Keywords> element such that they are unique, as follows:
Those names must also be modified within the <Entity> element where they are referenced:
<RulePack id="bd2568b4-b331-4387-b399-7e46065f6994">
b. Within the <RulePack> element, find the <Publisher> element and replace its id property with the second of the new GUIDs we created.
<Publisher id="ac9a7b29-870f-4810-a96f-6b4080c67e5d" />
c. Modify the id property of the <Entity> element which represents our sensitive data type. This id should be replaced with third of the new GUIDs we created.
<Entity id="48da7072-821e-4804-9fab-72ffb48f6f78" patternsProximity="300" recommendedConfidence="85">
d. Within the <RulePack><Details><LocalizeDetails> element, find the <PublisherName>, <Name> and <Description> elements. Modify the value of these elements to unique values.
<Details defaultLangCode="en"> <LocalizedDetails langcode="en"> <PublisherName>Contoso</PublisherName> <Name>Contoso Rule Package</Name> <Description>Defines the set of classification rules for Contoso</Description> </LocalizedDetails>
e. Within the <LocalizedStrings> element, find the <Resource> element which had the same id as our <Entity> element and modify its id to match the GUID that we assigned to the id of the <Entity> element.
f. Within the <LocalizedStrings><Resource> element we just modified, find the <Name> and <Description> elements and modify their values so that our sensitive data type has a unique name and description. This will help us better select the correct sensitive data type when we configure a DLP rule.
<Resource idRef="48da7072-821e-4804-9fab-72ffb48f6f78"> <Name default="true" langcode="en-us">EU Debit Card Number Enhanced</Name> <Description default="true" langcode="en-us">Detects European Union debit card number with enhanced accuracy.</Description> </Resource>
g. If you are going to start with an existing sensitive data type and that type contains existing keyword lists in the definition, it is important to note that some keyword lists are in use by Microsoft’s built in sensitive data types and names of those lists may not be reused in other sensitive data type definitions. In the case of the “EU Debit Card Number” example, it makes use the following Microsoft keyword lists:
<Keyword id="Keyword_card_expiration_terms_dict"> <Keyword id="Keyword_card_security_terms_dict"> <Keyword id="Keyword_card_terms_dict">
Those lists are currently in use by multiple built in sensitive data types and we are not permitted to reuse those names. Therefore, we must modify those names in the <Keywords> element such that they are unique, as follows:
<Keyword id="Keyword_card_expiration_terms_dict_enhanced"> <Keyword id="Keyword_card_security_terms_dict_enhanced "> <Keyword id="Keyword_card_terms_dict_enhanced ">
Those names must also be modified within the <Entity> element where they are referenced:
<Entity id="48da7072-821e-4804-9fab-72ffb48f6f78" patternsProximity="150" recommendedConfidence="85"> <Pattern confidenceLevel="85"> … <Any minMatches="1"> … <Match idRef="Keyword_card_terms_dict_enhanced" /> <Match idRef="Keyword_card_security_terms_dict_enhanced" /> <Match idRef="Keyword_card_expiration_terms_dict_enhanced" /> … </Any> </Pattern> </Entity>
Fine Tune a Sensitive Data Type to Avoid False Positives or Look for Organization-Specific Information
Now we’ll make some modifications to our custom sensitive data type in an attempt to improve its accuracy. Improving the accuracy of DLP rules in any system requires testing against a sample data set, and may require fine tuning through repetitive modifications and tests. When searching for an EU Debit Card Number in our example, the definition of that number is strictly defined as 16 digits using a complex pattern, and being subject to the validation of a checksum. We cannot alter this pattern due to the string definition of this sensitive data type. However, we can make the following adjustments in order to improve the accuracy of how Office 365 DLP finds this sensitive data type in our content within Office 365:
1. Proximity Modifications
We can modify the character pattern proximity to expand or shrink the window in which keywords must be found around the sensitive data type. In our case we’ll shrink the window by modifying the patternProximity value in our <Entity> element from 300 to 150 characters. This means that our corroborative evidence, or our keywords, must be closer to our sensitive data type in order to signal a match on this rule.
2. Keyword Modifications
We can add keywords to one of our <Keywords> element in order to provide our sensitive data type more specific corroborative evidence to search for in order to signal a match on this rule. These keywords could be organization specific keywords, or language specific keywords. Alternatively, we might find that some keywords are causing false positives to occur and as a result we may want to remove keywords. Keywords are added by navigating to our <Keywords> element for one of the three keyword lists provided with the “EU Debit Card Number” definition and adding additional <Term> elements, with the keywords as values (one <Term> per additional keyword).
3. Confidence Modifications
We can modify the confidence with which the sensitive data type must match the criteria specified in its definition before a match is signaled and reported. This is done by modifying the confidenceLevel property on the <Entity><Pattern> element. The more evidence that a pattern requires, the more confidence you have that an actual entity (such as employee ID) has been identified when the pattern is matched. If you remove keywords from the definition, you would typically want to adjust how confident you are that this sensitive data type was found by lowering it from its default level of 85 in the case of the EU Debit Card Number type.
We can modify the character pattern proximity to expand or shrink the window in which keywords must be found around the sensitive data type. In our case we’ll shrink the window by modifying the patternProximity value in our <Entity> element from 300 to 150 characters. This means that our corroborative evidence, or our keywords, must be closer to our sensitive data type in order to signal a match on this rule.
<Entity id="48da7072-821e-4804-9fab-72ffb48f6f78" patternsProximity="150" recommendedConfidence="85">
2. Keyword Modifications
We can add keywords to one of our <Keywords> element in order to provide our sensitive data type more specific corroborative evidence to search for in order to signal a match on this rule. These keywords could be organization specific keywords, or language specific keywords. Alternatively, we might find that some keywords are causing false positives to occur and as a result we may want to remove keywords. Keywords are added by navigating to our <Keywords> element for one of the three keyword lists provided with the “EU Debit Card Number” definition and adding additional <Term> elements, with the keywords as values (one <Term> per additional keyword).
<Keyword id="Keyword_card_terms_dict"> <Group> <Term>corporate card</Term> <Term>organization card</Term> <Term>acct nbr</Term> <Term>acct num</Term> <Term>acct no</Term> … </Group> </Keyword>
3. Confidence Modifications
We can modify the confidence with which the sensitive data type must match the criteria specified in its definition before a match is signaled and reported. This is done by modifying the confidenceLevel property on the <Entity><Pattern> element. The more evidence that a pattern requires, the more confidence you have that an actual entity (such as employee ID) has been identified when the pattern is matched. If you remove keywords from the definition, you would typically want to adjust how confident you are that this sensitive data type was found by lowering it from its default level of 85 in the case of the EU Debit Card Number type.
<Entity id="48da7072-821e-4804-9fab-72ffb48f6f78" patternsProximity="150" recommendedConfidence="85"> <Pattern confidenceLevel="85"> … </Pattern> </Entity>
The following is a screenshot of our final sensitive data type definition file (with some elements collapsed):
Upload a New Sensitive Data Type
Now that we've defined our sensitive data type in the XML file structure, we are going to upload it to our Office 365 tenant.1. At the PowerShell command prompt create a connection to the Office 365 Security & Compliance Center:
Note: At the start of this example we have to use the Exchange Online PowerShell module and at the end of the example we're using the newer Security and Compliance Center PowerShell module. This is intentional.
2. Create a new Classification Rule in Office 365 and upload your sensitive data type XML definition file:
When the upload has completed successfully, the following output will appear in the PowerShell console:
3. Trigger a re-crawl of the content within the site collections potentially containing the new sensitive data type
4. Login to Office 365 as an administrator, navigate to the Security & Compliance Center, create a new Data Loss Prevention policy and select the new sensitive data type you just created.
$ UserCredential = Get-Credential $Session = New-PSSession -ConfigurationName Microsoft.Exchange -ConnectionUri https://ps.compliance.protection.outlook.com/powershell-liveid/ -Credential $UserCredential -Authentication Basic -AllowRedirection Import-PSSession $Session
Note: At the start of this example we have to use the Exchange Online PowerShell module and at the end of the example we're using the newer Security and Compliance Center PowerShell module. This is intentional.
2. Create a new Classification Rule in Office 365 and upload your sensitive data type XML definition file:
New-DlpSensitiveInformationTypeRulePackage -FileData (Get-Content -Path "C:\EUDebitCardNumberEnhanced.xml" -Encoding Byte)
When the upload has completed successfully, the following output will appear in the PowerShell console:
3. Trigger a re-crawl of the content within the site collections potentially containing the new sensitive data type
4. Login to Office 365 as an administrator, navigate to the Security & Compliance Center, create a new Data Loss Prevention policy and select the new sensitive data type you just created.
Full Disclosure
You may find that some of this information is reprinted from a Microsoft article titled Office 365 Information Protection for GDPR. This is because I worked with Microsoft to help write and produce that content, and I wanted to re-use some of that material to highlight another scenario. Of course, please refer to both this article and the Microsoft content to get your custom sensitive data type working correctly.More Information
For more information, refer to the following helpful articles:Enjoy.
-Antonio