You are on page 1of 256

EMC Documentum

Content Intelligence Services


Version 6.7

Administration Guide

EMC Corporation
Corporate Headquarters:
Hopkinton, MA 01748-9103
1-508-435-1000
www.EMC.com

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change
without notice.
The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind
with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness
for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. All other trademarks
used herein are the property of their respective owners.
Copyright 2011 EMC Corporation. All rights reserved.

Table of Contents

Preface

................................................................................................................................ 11

Chapter 1

Introduction .................................................................................................
Taxonomy-based classification...........................................................................
Entity extraction ...............................................................................................
Metadata extraction ..........................................................................................

13
13
14
14

Chapter 2

Overview
....................................................................................................
Components.....................................................................................................
Architecture .....................................................................................................
Roles ...............................................................................................................
Limitations.......................................................................................................
CIS processes textual content .........................................................................
One CIS server can only work with one repository ..........................................
CIS processing updates the last modified date ................................................
Text extraction ..................................................................................................
Document properties ....................................................................................
Documentum attributes ................................................................................

17
17
18
20
20
20
20
21
21
21
21

Part 1

Administration

..................................................................................................... 23

Chapter 3

Administer the CIS server ...........................................................................


Start/Stop the CIS server ...................................................................................
Manage CIS service on Linux ........................................................................
Configure the CIS JMX agent .........................................................................
CIS JMX Agent in JConsole .......................................................................
CIS JMX agent in Documentum Administrator ...........................................
Configure the CIS server ...................................................................................
As an application ..........................................................................................
As a Windows service ...................................................................................
As a Java application.....................................................................................
Monitor CIS server processing ...........................................................................
CIS server log files ........................................................................................
Monitor CIS processing in cis-activity.log ...................................................
Status files for unprocessed documents ..........................................................
How to read the unprocessed_docs.txt file .................................................

25
25
26
26
26
26
27
27
31
31
32
32
33
33
34

Chapter 4

Troubleshooting ..........................................................................................
Modify the level of details in the detailed activity log file ....................................
Most common errors .........................................................................................
Frequently asked questions ...............................................................................
I have changed a taxonomy or a document set, and reprocessing
does not take my changes into account...........................................................

35
35
36
38

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

38

Table of Contents

Stemming is not available for my language ....................................................


How can I improve the performance? ............................................................
Part 2
Chapter 5

Part 3
Chapter 6

CIS in Documentum Administrator

...................................................................... 41

Content Intelligence Services ......................................................................


Content Intelligence Services .............................................................................
Providing evidence .......................................................................................
About confidence values and score thresholds ............................................
About stemming and phrase order ............................................................
Setting the language used for the stemming ...........................................
Activating the stemming .......................................................................
Retaining the phrase order ....................................................................
About category links .................................................................................
Setting up Content Intelligence Services .............................................................
Enabling Content Intelligence Services ...............................................................
Missing Content Intelligence node .....................................................................
Modifying Content Intelligence Services configuration........................................
Building taxonomies .........................................................................................
Setting permissions for Content Intelligence ...................................................
Defining category classes ..............................................................................
Defining taxonomies .....................................................................................
Creating subtypes for a taxonomy or for a category ........................................
Creating custom tab for the subtype...........................................................
Creating subtype instances ........................................................................
Defining categories .......................................................................................
Displaying object titles ..............................................................................
Setting category rules................................................................................
Defining property rules .............................................................................
Displaying attributes in Property rules .......................................................
Defining simple evidence terms .................................................................
Managing taxonomies ...................................................................................
Making taxonomies available ....................................................................
Synchronizing taxonomies ........................................................................
Deleting taxonomies .................................................................................
Processing documents.......................................................................................
Test processing and production processing.....................................................
Defining document sets.................................................................................
Submitting documents to CIS server ..............................................................
Assigning a document manually ...................................................................
Reviewing categorized documents.................................................................
Clearing assignments ................................................................................
Refining category definitions .............................................................................
Using compound terms .................................................................................
Selecting terms .............................................................................................
Using common words as evidence terms ....................................................
Modifying category and taxonomy properties ................................................
Defining compound evidence terms...............................................................
Configuration

38
39

43
43
44
44
45
45
46
46
46
47
47
48
49
50
50
51
53
55
55
57
59
61
62
63
65
66
67
67
68
69
69
69
70
72
73
73
74
75
76
77
77
77
78

...................................................................................................... 81

Configuring the Type of Content Processed ................................................


Principles .........................................................................................................
Configuring attribute processing .......................................................................

83
83
83

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Table of Contents

Chapter 7

Part 4

Troubleshooting attribute processing configuration ............................................

85

Configuring Document Sets ........................................................................


Document set configuration files........................................................................
Configuring a document set for metadata extraction ...........................................
Converting the 6.6 document set configuration files ............................................

87
87
90
91

Entity Extraction

.................................................................................................. 93

Chapter 8

Entity Extraction ..........................................................................................


Installation of the entity extraction server ...........................................................
The entity extraction process .............................................................................

Chapter 9

Configuring Entity Extraction ...................................................................... 97


Manage entity extraction services ...................................................................... 97
Disable entity extraction.................................................................................... 98
Set up a multi-node environment....................................................................... 99
Customize the cartridge: add named entities .................................................... 100
Blacklist specific entities .................................................................................. 103

Part 5

Classification

95
95
96

.................................................................................................... 105

Chapter 10

Classification Process ...............................................................................


Data synchronization for classification .............................................................
Select documents to process ............................................................................
Submit documents for processing ....................................................................
Define submission schedules .......................................................................
Submit documents on demand ....................................................................
Resubmit documents ..................................................................................
Conceptual analysis and category score ...........................................................
Category score computation ........................................................................
Stemming capability .......................................................................................
Stemming mechanism .................................................................................
Configure CIS default language ...................................................................
Auto categorization of the documents ..............................................................
Pattern analysis ..............................................................................................
Patterns as evidence terms ..........................................................................
Use patterns in rules ...................................................................................
Limitations.................................................................................................
Configure pattern analysis ..........................................................................
Classification information ...............................................................................
Category assignments configuration ................................................................
Classification roles ..........................................................................................
The taxonomy manager ..............................................................................
The category owner ....................................................................................

Chapter 11

Configure CIS Standard Classification

Chapter 12

Use the Taxonomy Exchange Format (TEF) ..............................................


Import taxonomies in Taxonomy Exchange Format ...........................................

107
107
108
108
109
109
109
109
110
112
112
113
113
114
114
114
115
115
116
116
117
117
117

...................................................... 119

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

121
121

Table of Contents

Use the tef2repository script ........................................................................


Use the TefUtil tool .....................................................................................
TEF elements..............................................................................................
Taxonomy Exchange Format action files ...........................................................
Create TEF action files ................................................................................
TEF action file elements ..........................................................................
Part 6

Metadata Extraction

122
123
124
169
169
170

.......................................................................................... 193

Chapter 13

Metadata Extraction ..................................................................................


Metadata extraction principles.........................................................................
Defining metadata extraction rules ..................................................................
Rules sample ..................................................................................................

195
195
196
196

Chapter 14

Configuring Metadata Extraction ...............................................................


Defining a rules file.........................................................................................
Running the extract_metadata script ................................................................
Metadata extraction rules ................................................................................
Rules principles ..........................................................................................
Rules definitions .........................................................................................
Conditions .................................................................................................
Operator rules ............................................................................................
Best practices and tips .....................................................................................

199
199
200
201
201
202
215
222
226

Part 7

Exposing Content Intelligence Services Results

............................................... 227

Chapter 15

Expose Classification Concepts or Entities in CenterStage Filters ............


Extract classification concepts ..........................................................................
Extract new entities ........................................................................................
Add custom filters in CenterStage....................................................................
Clear previous entities ....................................................................................
Clear the document status ...............................................................................

Chapter 16

Annotation API

Chapter 17

Integrate CIS Classification .......................................................................


Organize your library .....................................................................................
Workflow and lifecycle processing ...................................................................
Web Publisher integration ...............................................................................
Retention Policy Services integration ...............................................................

Appendix A

Content Intelligence Services Processing Diagram

Appendix B

Properties Extracted

Appendix C

Document Set Configuration Files ............................................................


default.xml.....................................................................................................
docset-sample.xml ..........................................................................................

229
229
231
233
238
239

.......................................................................................... 241
243
244
244
244
245

................................... 247

.................................................................................. 249
251
251
252

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Table of Contents

List of Figures

Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.

CIS architecture overview Classification ............................................................


CIS architecture overview Entity extraction ........................................................
CIS architecture overview Metadata extraction ...................................................
Example of a custom entity based on Luxid TM360 Postal Address ........................
Example of localization of a custom entity ............................................................
CIS processing diagram legend............................................................................
CIS processing diagram notes ..............................................................................
CIS processing diagram ......................................................................................

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

18
19
19
235
238
247
247
248

Table of Contents

List of Tables

Table 1.
Table 2.
Table 3.
Table 4.
Table 5.
Table 6.
Table 7.
Table 8.
Table 9.
Table 10.
Table 11.
Table 12.
Table 13.
Table 14.
Table 15.
Table 16.
Table 17.
Table 18.
Table 19.
Table 20.
Table 21.
Table 22.
Table 23.
Table 24.
Table 25.
Table 26.
Table 27.
Table 28.
Table 29.
Table 30.
Table 31.
Table 32.
Table 33.
Table 34.
Table 35.
Table 36.
Table 37.

CIS service options (Linux)....................................................................................


Configuration parameters in the cis.properties file ..................................................
Possible errors for unprocessed files.......................................................................
Date formats for property rules .............................................................................
Descriptions of the xml elements in document set configuration files .......................
Customization files for entities ............................................................................
Special characters in customization files ...............................................................
<class> Element Attributes ..................................................................................
<details> Element attributes.................................................................................
<impliedKeywordDefaults> Element Attributes ....................................................
<keywordDefaults> Element Attributes ................................................................
<evidencePropagation> Element Attributes ..........................................................
<categoryEvidenceDefaults> Element Attributes ...................................................
<taxonomy> Element Attributes ..........................................................................
<category> Element Attributes .............................................................................
<details> Element attributes.................................................................................
<owner> Element Attributes ................................................................................
<operation> Element Attributes ...........................................................................
<supportedLanguage> Element Attributes ...........................................................
<attribute> Element Attributes .............................................................................
<definition> Element Attributes ...........................................................................
<evidence> Element Attributes ............................................................................
<keyword> Element Attributes ............................................................................
<categoryEvidence> Element Attributes ...............................................................
<qualifier> Element Attributes .............................................................................
<categoryLink> Element Attributes ......................................................................
<add> Element Attributes....................................................................................
<classObject> Element Attributes .........................................................................
<taxonomyObject> Element Attributes .................................................................
<withinParentReference> Element Attributes ........................................................
<categoryObject> Element Attributes ...................................................................
<classReference> Element Attributes ....................................................................
<categoryReference> Element Attributes ..............................................................
<export> Element Attributes ................................................................................
<SetMetadata> Element Attributes .......................................................................
<GetMetadata> Element Attributes ......................................................................
<DocProperty> Element Attributes ......................................................................

26
28
34
64
89
100
102
127
129
133
135
137
139
141
143
147
150
152
154
156
158
160
163
164
166
168
174
176
177
178
180
183
184
191
203
204
205

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Table of Contents

Table 38.
Table 39.
Table 40.
Table 41.
Table 42.
Table 43.
Table 44.
Table 45.
Table 46.
Table 47.
Table 48.
Table 49.
Table 50.
Table 51.

<DocRepositoryAttribute> Element Attributes ......................................................


<Block> Element Attributes .................................................................................
<Line> Element Attributes ...................................................................................
<Pattern> Element Attributes ...............................................................................
<Zone> Element Attributes ..................................................................................
<Constant> Element Attributes ............................................................................
<Exists> Element Attributes .................................................................................
<Contains> Element Attributes ............................................................................
<Equals> Element Attributes ...............................................................................
<IsPositionBefore> Element Attributes .................................................................
<Concat> Element Attributes ...............................................................................
Internal and public names of default entities ........................................................
TM360 entities ....................................................................................................
Extracted properties for MS Office and other documents .......................................

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

206
207
209
211
213
214
219
220
221
222
225
231
232
249

Table of Contents

10

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Preface

The Content Intelligence Services Administration Guide contains procedures and information for setting
up and managing the server-side components of Content Intelligence Services (CIS). This manual
assumes that you have already installed Content Intelligence Services by following the instructions in
the Content Intelligence Services Installation Guide.

Intended audience
This manual is intended primarily for administrators who are managing Content Intelligence
Services applications.
The CIS server categorizes documents into taxonomies that you build and maintain using
Documentum Administrator. For information about using Documentum Administrator, see the
Documentum Administrator User Guide. The CIS server is also used in the context of a CenterStage
deployment to extract entities displayed as filters in CenterStage clients.

Typographic conventions
The following table describes the typographic conventions used in this guide.
Typographic conventions

Typeface

Text type

Body Italic

Book titles, emphasis.

Body Bold

In procedures:
User actions (what the user clicks, presses, selects, or types) in
procedures
Interface elements (button names, dialog boxes)
Key names
In running text:
Command names, daemons, options, programs, processes,
notifications, system calls, man pages, services, applications,
utilities, kernels

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

11

Preface

Typeface

Text type

Body Italic

Book titles, emphasis (glossary terms, See also index references)


Variables in text (outside of command sample)

Courier

If shown on separate line, prompts, system output, filenames,


pathnames, URLs, syntax examples

Courier Bold

User input shown on separate line

Courier Italic

In procedures:
Variables in command strings
User input variables

<Italic in angle
brackets>

A variable for which you must provide a value.

Revision history
The following changes have been made to this document.
Revision history

Revision Date
April 2011

12

Description
Initial publication

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 1
Introduction

EMC Documentum Content Intelligence Services (CIS) is the automatic classification and extraction
engine for EMC Documentum. Automatic classification is based on taxonomies and categories and
allows you to organize content in many different ways. Entity extraction collect entities from content
using Natural Language Processing. Entities are exposed in CenterStage deployments and, when
stored as annotations, they can also be accessed using the Annotation API. Content Intelligence
Services also allows you to extract metadata from the documents.

Taxonomy-based classification
CIS organizes documents into taxonomies. A taxonomy is a hierarchical set of categories used to
organize content in the repository. This organization, often based on the subject matter of the content,
provides one place for users to look for all content related to common topics of interest.
For example, suppose that the folders in repository cabinets organize objects based on which
department created the content or on the document type, such as Press Releases in one folder and
Product Design Specifications in another folder. A user looking for all available information about a
particular product including documents from multiple departments, and both press releases and
design specifications needs to look in all folders that could possibly include objects related to that
topic. With product-based categories, the user can look in a single category to find all documents
related to the product, while the documents themselves remain filed in the original folders.
CIS classification is highly configurable. The following features allow you to set the classification that
fits your needs of content organization.
Keyword-based classification CIS can assign documents to relevant categories based on a
semantic analysis of their content. When you define your taxonomy, you identify keywords, phrases,
and patterns associated with each category. The CIS server uses these words and phrases as evidence
terms: when the server processes a document, it assigns the document to these categories based
on the evidence terms it finds in the content.
Property-based classification You can also configure CIS to classify documents based on the
property values (document metadata). In this case, documents are assigned according to the values
of the repository attributes. It is possible to set it as a requirement for documents to match with
a category.
Configurable confidence threshold As the CIS server processes a document, it determines the
confidence score of a document for each category in the taxonomy. The confidence score reflects how

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

13

Introduction

much evidence the CIS server found to indicate that the document belongs to the category. If the
document score for a category meets or exceeds a predefined threshold, the CIS server assigns the
document to that category. If the confidence score falls short of the threshold, the CIS server can
provisionally assign the document to the category as a Pending candidate. The user who owns the
category must review pending document candidates before they are fully categorized.
Actions based on classification results When a document is assigned to a category, you can
decide to link this document to the folder associated with the category. You can also select to add
the category names to an attribute of the document. You can enable or disable these features when
you configure CIS.
Manual categorization CIS also supports manual categorization, where users (rather than the CIS
server) manually assign documents to categories in DA. As with the automatic CIS server processing,
category assignments can be used to link documents into a searchable hierarchy of category folders,
add the category names to a document attribute, or both.
Classification concepts stored as annotations Classification concepts are category matches found
by CIS and based on taxonomies. They are not stored as category assignments unlike CIS standard
classification processing but as annotation objects. They can be exposed in CenterStage as search
filters or accessed using the Annotation API.

Entity extraction
CIS analyzes the content, metadata, and comments of documents to extract information relevant for
the end users. The information extracted is called entities and presented as filters when navigating in
CenterStage or when running a search. The default entities extracted by CIS are the following:
Placethis filter includes geographical places and groups them by countries and cities. For the
USA, states are provided for information only and not as a group.
Peoplethis filter corresponds to names of individuals.
Companythis filter contains names of organizations such as companies, institutions, or
associations.
The entity extraction is enabled by default for the repository used by CenterStage. It runs
automatically every half hour.
You can configure entity extraction to extract other entities or to store the entities as annotation objects
and to access them using the Annotation API.

Metadata extraction
With CIS, you can extract metadata from the content, properties, or repository attributes of
documents. Metadata is often defined as data about data. We call metadata the pieces of
information that provide a description of the content of the documents. Metadata extraction relies
only on rules that you define. Like the taxonomy-based classification, it does not imply any content
analysis, unlike entity extraction.

14

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Introduction

Valuable information is sometimes difficult to capture. Metadata extraction allows you to find
metadata in the content, properties, or repository attributes of your documents and label these
metadata.
Extracted metadata are stored as annotation objects and you can access them using the Annotation
API.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

15

Introduction

16

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 2
Overview

This chapter describes CIS product and more specifically:


Components, page 17
Architecture, page 18
Roles, page 20
Limitations, page 20
Text extraction, page 21

Components
Content Intelligence Services includes these key components:
The Content Intelligence Services client (CIS client), such as Documentum Administrator or any
custom application using the Content Intelligence Application Programming Interface (CI API),
can be used for creating and managing the taxonomy used for categorizing documents. You
can also use Documentum Administrator to configure CIS. The CI API handles communication
between the CIS client, the CIS server, and the Documentum repository.
The Content Intelligence Services server (CIS server) performs the automatic categorization of
documents based on taxonomy and category definitions, and triggers the entity extraction.
The entity extraction server performs entity extraction using cartridges.
A repository is required to store CIS data (such as taxonomy definitions, document set (also called
docset) definitions, configuration files, and extracted entities).
The Annotation API allows access to the information stored as annotations, it could be the result
of the entity extraction processing or the metadata extraction processing.
When you create a taxonomy using Documentum Administrator or by importing a prebuilt taxonomy,
the objects comprising the taxonomy are saved into the repository containing the documents that CIS
will process. When you are done creating or modifying the taxonomy in Documentum Administrator,
you synchronize the new definitions to make them available to the CIS server.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

17

Overview

Architecture
The CIS server communicates with the Content Server using the Documentum Foundation Classes
(DFC). It is recommended to deploy CIS on a separate machine from the Content Server machine.
The repository of the Content Server stores the documents to categorize or already categorized but
also the taxonomy definitions and the document set definitions. One CIS server can only point to one
repository. A repository must be enabled for CIS before starting any configuration.
Once enabled, the taxonomies and the document sets can be created using Documentum
Administrator. It is also possible to import existing taxonomies defined in a Taxonomy Exchange
Format (TEF) file.
When used for the classification, two modes are available: the production mode and the test mode.
You can either use one CIS server for both modes or two CIS servers: one for each mode. One
repository can only use one CIS server for each mode. Using two modes allows the CIS user to
modify and test the taxonomies and document sets while the production server is still running.
There is no test mode for the entity extraction and the metadata extraction.
Figure 1. CIS architecture overview Classification

18

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Overview

Figure 2. CIS architecture overview Entity extraction

Figure 3. CIS architecture overview Metadata extraction

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

19

Overview

Roles
Implementing and working with Content Intelligence Services requires the action of several distinct
roles. A person can combine several roles. The following list describes briefly each role:
The System administrator installs CIS and enables CIS in the repository. This person also monitors
the CIS server: ensures that the server is up and running, define the document sets, checks the
logs for errors, and tracks the unprocessed or excluded documents.
When CIS is used for classification:
The Taxonomy Manager creates, tests and maintains taxonomies. This person also sets category
owners, document sets, and verify excluded or unprocessed documents.
When necessary, the Category Owner verifies the correct categorization and reviews pending
documents.
The General user can manually submit documents for CIS processing, and consumes the results of
the categorization by browsing the categories or using the attributes created by the categorization.
When CIS is used for entity extraction:
The Terminology manager, such as a librarian, adds named entities to the cartridge.
In CenterStage clients, the General user uses filters Place, People, and Companies to navigate
or run a search.
All CIS-related tasks are performed in Documentum Administrator, except for the General user
which performs the tasks in a CIS client such as Webtop or CenterStage.

Limitations
This section describes some principles about CIS and known limitations.

CIS processes textual content


CIS only performs a textual analysis of the content of documents: it cannot process graphical content.
By default, CIS does not process a document that has no textual content (like images). To process
such files, setup category rules, or configure processing on document properties only.

One CIS server can only work with one repository


One repository can work with one or two CIS servers (one in test mode and one in production mode)
but a CIS server can only work with a single repository.

20

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Overview

CIS processing updates the last modified date


CIS standard classification processing changes the document properties last_modified_date and
last_modifier when one of the options Link to Folders or Assign as Attributes is selected.

Text extraction
This section describes the text extraction step that precedes any processing. Before performing any
analytics processing on a document, the content of the document and its properties are extracted
and processed by Oracle Outside In.

Document properties
When a document is processed, the CIS server automatically recognizes its format. The CIS server
can then extract the properties that are expected to be available for the document. Some property
values can only be extracted if they have been filled in by the document author. For example, the
Title of a PDF document is entered by the author. The Appendix B, Properties Extracted provides
information about properties automatically extracted depending on the document format.
The property values are automatically added to the content extracted from the documents. They can
then be used to match any category keyword, or to extract entities or metadata.

Documentum attributes
The Documentum attributes attached to documents, also called repository attributes, can be used
in several ways:
They can be used as filters when defining a document set in DA by adding a constraint like
attribute/operator/value.
They can be used in a property rule when defining a category to assign documents, by adding a
constraint like attribute/operator/value.
They can be used in addition to, or instead of, the content of the documents, as described in
Chapter 6, Configuring the Type of Content Processed.
They can be used to extract metadata. In this case, the attributes are directly accessed, they do
not need to be part of the content processed.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

21

Overview

22

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Part 1
Administration

This part includes the following chapters:


Chapter 3, Administer the CIS server
Chapter 4, Troubleshooting

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

23

Administration

24

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 3
Administer the CIS server

This chapter provides instructions for basic CIS administration activities. It includes the following
activities:
Start/Stop the CIS server, page 25
Configure the CIS server, page 27
Monitor CIS server processing, page 32
The default CIS installation directory is:
C:\Program Files\Documentum\CIS on Windows hosts
$DOCUMENTUM_SHARED/cis on Linux hosts
This directory is referenced in CIS documentation as the variable path <CIS installation directory>.

Start/Stop the CIS server


There are various possibilities to start and stop the CIS server.
On Windows hosts, CIS is installed as a Windows Service "Documentum Content Intelligence
Services", you can go to the Services section of the Computer Management dialog to start and stop
the CIS server.
Another possibility on Windows hosts is to start and stop the CIS server using the JMX server scripts:
<CIS installation directory>/service/startCIS.bat
<CIS installation directory>/service/stopCIS.bat
You can also monitor CIS server status using the following script: <CIS installation
directory>/service/statusCIS.bat.
Be aware that the startCIS.bat script, unlike stopCIS.bat or statusCIS.bat, is not
compatible with Windows services. The startCIS.bat script does not start CIS as a Windows
service. If the CIS server is already running, the script tries to launch a new server but fails because
the port is already in use. You can use stopCIS.bat to stop CIS when it is started as a Windows
service, and restart CIS as a Windows service. You can use statusCIS.bat to monitor the status of
CIS Windows service.
On Linux hosts, use the CIS service as described in Manage CIS service on Linux, page 26.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

25

Administer the CIS server

Finally, on both Winddows and Linux hosts, you can access the CIS server using its JMX Agent. You
can access the JMX Agent by a URL either using Documentum Administrator or using JConsole.

Manage CIS service on Linux


On Linux hosts, use the CIS service to monitor, start, and stop the CIS server. CIS service is the cis
script located at <CIS installation directory>/service.
Note: The Linux CIS service does not manage the entity extraction server. Manage entity extraction
services, page 97 describes how to start, stop, or monitor the entity extraction server.
The following table describes the options that can be used with the script.
Table 1. CIS service options (Linux)

Option

Expected output

Description

status

Checking for service CIS


or CIS is not running

Indicates if the CIS server is running or not.


Does not indicate processing errors.

start

Starting CIS or CIS is


already running

Allows you to start the CIS server.

stop

Shutting down CIS

Allows you to stop the CIS server.

Configure the CIS JMX agent


The JMX agent is accessible from Documentum Administrator but also using Jconsole.

CIS JMX Agent in JConsole


The JConsole executable is available at <JDK_HOME>/bin, where <JDK_HOME> is the installation
directory for the JDK. To access the CIS JMX agent from Jconsole, you can do one of the following:
Provide the process ID of the application, for local monitoring.
Provide the host name and port number, for remote monitoring.
For more information about monitoring a JMX agent from the JConsole, refer to Java documentation
such as: http://java.sun.com/j2se/1.5.0/docs/guide/management/jconsole.html.

CIS JMX agent in Documentum Administrator


In Documentum Administrator, under the Resource Management node, add a new Resource Agent.
The Documentum Administrator User Guide describes this procedure in details.

26

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Administer the CIS server

To add a new Resource agent, provide a JMX URL. The CIS server JMX URL is:
service:jmx:rmi:///jndi/rmi://<cishost>:<port>/cisserveragent

where cishost is the fully qualified domain name, that is, including the hostname and domain name,
where CIS resides, and port is an RMI port set in cis.properties by the cis.jmx.agent.port parameter.
By default, the JMX port is 8061.
You must be a member of the ci_taxonomy_manager_role to create a Resource Agent for the CIS
server.

Configure the CIS server


As for the CIS server start and stop, the CIS server can be configured in different ways.
As an application, page 27
As a Windows service, page 31
As a Java application, page 31

As an application
A first level of configuration can be set by modifying the properties file of the CIS application.
Properties files can be found at <CIS installation directory>/config. They consist of:
cis.properties. Modify this file as described in To modify cis.properties, page 27.
dfc.properties. The Documentum Foundation Classes documentation provides more information
about DFC parameters.
patterns.properties. Use this file to define patterns as described in Patterns as evidence terms,
page 114.
log4j.xml. Use this file to configure Log4j. CIS server log files, page 32, provides more details
about log files.
Note that, to apply the changes made in the properties or configuration file, you need to restart
the CIS server.

To modify cis.properties
1.

Stop the CIS server.

2.

Locate the cis.properties file in <CIS installation directory>/config.

3.

Open the cis.properties file with a text editor.

4.

Update the parameter setting as needed. Table 2, page 28, provides details on available
configuration parameters.

5.

Restart the CIS server.

The following table provides details on parameters that can be set in the cis.properties file. The
filepaths are relative to <CIS installation directory>.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

27

Administer the CIS server

Table 2. Configuration parameters in the cis.properties file

Parameter

Description

Server port section


cis.server.port

The server port number used by CIS. The port number


defined here must match the one defined in DA. In DA,
you can specify the port number with the server host
name. Default value (for DA):
cis.server.port=8079

Docbase settings section


cis.server.repository

The name of the repository that the CIS server connects


to. This repository contains all taxonomy definitions and
the documents to classify.

Docbase credentials files directory


section
cis.server.credentials.dir

The path to the directory where file containing login and


password for repository access is stored. Default value:
cis.server.credentials.dir=repodata/
authentication

Document exclusion section


cis.server.docexclusion.dir

The path to the directory where the file listing excluded


documents and the file listing unprocessed documents are
located. This directory must exist. By default:
cis.server.docexclusion.dir=repodata/
docexclusion

File docset Folder section


cis.server.file_docset.dir

The path to the directory where the configuration files for


document sets are located.
cis.server.file_docset.dir=repodata/file_
docset

cis.server.centerstage.enabled

This parameter indicates whether the CIS server must


extract entities for CenterStage spaces:
cis.server.centerstage.enabled=true

Note: The extraction of entities must also be enabled on


the CenterStage side. This parameter has no impact on the
classification processes on regular document sets.
cis.server.centerstage.interval

This parameter sets the delay between the end of a


processing of documents in CenterStage spaces and the
start of the next one, in seconds. It is also the frequency
on which CIS looks for new spaces. By default, it is set to
run every half hour:
cis.server.centerstage.interval=1800

28

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Administer the CIS server

Parameter

Description

Limit settings section


cis.server.limit.max_file_size

The maximum size (in bytes) of files processed by the CIS


server. Files whose size is bigger than this limit are not
processed and an error is logged in the log file. By default:
cis.server.limit.max_file_size=10000000

cis.classification.limit.max_content_
size

The maximum size (in bytes) of extracted text content


per file processed. When the CIS server processes files
whose content size is greater than this limit, the content is
partially extracted up to the limit and a warning is logged
in the log file. By default:
cis.classification.limit.max_content_size=
1000000

Note: Archive files are processed as one file. The limit


applies to the total content size of all files contained in the
archive file, not to the content size of each individual file
contained in the archive.
Thread pools section
cis.server.execution.threads

The maximum number of threads allocated to the


execution of classification tasks. If CIS was installed to
perform classification processing, the default value is
5; if CIS was installed for CenterStage, that is for entity
extraction, the default value is 1.

cis.server.scheduling.threads

The maximum number of threads allocated to the


scheduling of document sets. By default:
cis.server.scheduling.threads=5

cis.server.scheduling.queue_delay

Time in seconds before the first execution of the processing


queue. By default:
cis.server.scheduling.queue_delay=60

cis.server.scheduling.queue_interval

Time in seconds between two consecutive queue


processing runs. By default:
cis.server.scheduling.queue_interval=1800

Classification engine section


cis.classification.matching_window_
size

The window size for proximity matching of evidence


terms. The size is expressed in the number of word
positions in the window, hence it must be a positive value.
All evidence terms for a category must match inside
the window. Set the window size to 0 to deactivate the
proximity checking. By default:
cis.classification.matching_window_size=1000

Patterns (regular expressions) section

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

29

Administer the CIS server

Parameter

Description

cis.server.patterns.file

The configuration file that contains pattern definitions.


The CIS server loads this file by its name from the server
classpath. By default:
cis.server.patterns.file=patterns.properties

Linguistic configuration section


cis.linguistic.language.default
cis.linguistic.stemming.allowed

These parameters set the default language for the


stemming feature and whether it is activated or not. By
default:
cis.linguistic.language.default=english
cis.linguistic.stemming.allowed=true

Stemming capability, page 112 provides more information


on the linguistic configuration.
Luxid configuration section
cis.entity.luxid.annotation_server.host

The host of the Luxid Annotation Server, the entity


extraction server. It is only required for the entity
extraction. By default, it is set to localhost because the
installer installs the entity extraction server on the same
host as the CIS server:
cis.entity.luxid.annotation_server.host=
localhost

cis.entity.luxid.annotation_server.cpu

The number of CPUs used by Luxid Annotation Server. By


default, it is set to 1:
cis.entity.luxid.annotation_server.cpu=1

cis.entity.luxid.limit.max_text_size

The maximum size (in bytes) of text submitted for the


entity detection per file processed. When the CIS server
processes files whose content size is higher than this limit,
only the content up to the limit is submitted and a warning
is logged. By default:
cis.entity.luxid.limit.max_text_size=50000

Increasing the limit impacts both the performance and the


memory usage.
cis.entity.luxid.limit.max_entities_
per_type

The maximum number of detected entities stored per


entity type and per file processed. By default:
cis.entity.luxid.limit.max_entities_per_
type=10

cis.entity.luxid.limit.detection_
timeout

The timeout for detecting entities in one file, in seconds.


By default:
cis.entity.luxid.limit.detection_timeout=1800

cis.entity.luxid.resource.dir

The directory containing Luxid resource files.


cis.entity.luxid.resource.dir=resources/luxid

30

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Administer the CIS server

Parameter

Description

cis.entity.luxid.tmp.dir

The directory in which to store temporary documents to


send to Luxid Annotation Server. This directory must
exist.
cis.entity.luxid.tmp.dir=repodata/luxid_tmp

JMX agent configuration section


cis.jmx.agent.port

The RMI port number used for the JMX Agent URL. By
default, the JMX port is 8061 :
cis.jmx.agent.port=8061

External native libraries section


cis.native.lib.dir

The directory containing DLLs for external native libraries.


Dependent native libraries are searched in the global
PATH.
cis.native.lib.dir=lib

As a Windows service
Another set of configuration parameters can be set when considering the CIS server as a Windows
service. Edit the C:\Program Files\Documentum\CIS\service\wrapper.conf file and modify the
parameters as required, for example, the Wrapper Logging Properties.
You can also modify the recovery parameters of CIS Windows service. Go to the Services section of
the Computer Management dialog, and open the Properties of the service. In the Recovery tab, you
can modify the default recovery settings. CIS Windows service is configured to restart as follows:
First restart is immediate.
Second restart occurs after 30 seconds.
Third and subsequent restarts occur after 10 minutes.
The restart count is reset every 60 minutes.

As a Java application
On Windows hosts, the CIS server can also be configured using the script: startCIS.bat in <CIS
installation directory>. You can edit the script and modify CIS parameters:
CIS server system setup, such as the CIS_PATH environment variable
CIS server java options, such as the CLASSPATH variable, memory configuration
Remember to create a backup before modifying the script.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

31

Administer the CIS server

Example 3-1. startCIS.bat script example


@echo off
set CIS_PATH=C:\Program Files\Documentum\CIS
set JAVA=C:\Program Files\Documentum\java\1.6.0_17\bin\java
set JAVA_OPTS=-Xmx256m "-Djava.library.path=C:\Program Files\Documentum\CIS\lib"
"-Djava.security.policy=C:\Program Files\Documentum\CIS\resources\luxid\LAF.policy"
set CLASSPATH=%CIS_PATH%\config;%CIS_PATH%\resources;%CIS_PATH%\repodata\authentication;
%CIS_PATH%\lib\cis_server.jar
"%JAVA%" %JAVA_OPTS% -cp "%CLASSPATH%" com.documentum.cis.service.server.CISServerLauncher

CIS_PATH indicates the path where CIS has been installed.

Monitor CIS server processing


You can monitor the CIS server status using Documentum Administrator, where the CIS server can
be accessed as a resource agent or using JConsole. Configure the CIS JMX agent, page 26, describes
how to configure and access CIS JMX Agent.
You can also monitor CIS activity using CIS server log files, which contain a record of the actions of
the CIS server.
Additional log files can be configured using the logging utility Log4j.
Two status files indicate which documents have not been processed or have been excluded.

CIS server log files


By default, the following server log files are available:
cis.logcontains information about the operations of the CIS server, including normal activity
log.
cis-error.logcontains messages with status ERROR and above.
cis-activity.logcontains information related to normal activity.
cis-scripts.log contains messages related to the execution of any CIS script (such as
tef2repository, clear_entities, and so on).
cis-svc.log contains the messages related to CIS server start and stop. This information is
a complement to DmCISService.log (which is only available on Windows hosts for the CIS
Windows service). When troubleshooting CIS server startup issues, look first at cis-svc.log
then at DmCISService.log.
DmCISService.log contains information about dates and time when starting and stopping
the server as a Windows service.
Log files are available at <CIS installation directory>/logs, except the log file for the service, which is
located at: <CIS installation directory>/service.

32

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Administer the CIS server

An additional log file can be created for troubleshooting: cis-activity-detailed.log. This log
file is more verbose and contains information about both normal and detailed activity. Refer to the
procedure Modify the level of details in the detailed activity log file, page 35 to enable it.
The Log4j setup for CIS server logging is the file <CIS installation directory>/config/
log4j.xml.
The Log4j project website (http://logging.apache.org/log4j/docs/index.html) provides information on
configuring log statements.

Monitor CIS processing in cis-activity.log


The cis-activity.log file gives you detailed information such as the number of documents that
were processed, and whether the processing was successful, canceled, or failed.
The following example is an extract of a cis-activity.log. The cis-activity.log file contains more
information than what is described in the example.
Example 3-2. cis-activity.log sample
-01-21 16:42:22,311 [taskExecutor_0] activity.normal - Docset completed:
Songs_0b110a308000710d. tasks:37 success:37 canceled:0 errors:0

For each document set, you have the following information:


tasks: total number of documents that were processed, successfully or not.
success: number of documents successfully processed.
canceled: number of documents for which the processing was stopped, for example if a
document was processed by another thread.
errors: number of documents for which the processing failed and returned errors.

Status files for unprocessed documents


In addition to the log files, two status files provide you with a list of documents that have not been
processed or that have been excluded for subsequent processings. These files are available in the
docexclusion folder, in <CIS installation directory>/repodata:
unprocessed_docs.txtlists all the documents that have not been processed. Each line entry
corresponds to a document and provides information to identify the document and the cause of
the error. This file can be renamed or deleted while the CIS server is running. It is automatically
created when need be.
excluded_docs.txtlists the documents that the CIS server could not process or that the CIS server
suspects of having caused a crash or a hang (note they are only suspected and might not have
actually caused the crash or hang). These documents are skipped during subsequent processings.
Note: The unprocessed documents are documents that the CIS server could not process, for example,
when the file format is not supported or when the maximum file size is exceeded. It means that no
classification has been done and therefore, it does not mean that the documents would not have
matched any category has their processing been possible.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

33

Administer the CIS server

How to read the unprocessed_docs.txt file


The unprocessed_docs.txt file lists the documents that have not been processed during CIS last run
and is in Comma-separated Values (CSV) format.
Each unprocessed document corresponds to a row and indicates the type of error, an internal
identification of the reason the document was not processed, the object ID, when the error has been
reported, some information depending on the type of error and the name of the file.
Table 3. Possible errors for unprocessed files

Type of error

Description

Related information available in the


unprocessed_docs.txt file

TOOLARGE

The document is too large to load and


process, therefore it is skipped.

The maximum supported content size


The document size

EXTRACTION

The text extraction failed for this


document.

The module
The type of issue
The message (often containing the
error code)

CLASSIFICATION

The text classification failed for this


document. This error implies that the
extraction was successful.

The module
The type of issue
The message (often containing the
error code)

SUSPECTEDCRASH

The document is suspected to make


the CIS server crash or hang.

The date when the processing of the


suspected document started.

CONTENTERROR

The document cannot be retrieved


from the repository.

The error message from the repository.

34

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 4
Troubleshooting

This chapter covers common CIS errors and frequently asked questions about CIS processing. This
can help you troubleshoot any issue you may face with CIS.
Modify the level of details in the detailed activity log file, page 35
Most common errors, page 36
Frequently asked questions, page 38

Modify the level of details in the detailed


activity log file
A good practice to troubleshoot issues is to get detailed logs. The cis-activity-detailed.log
file provides information about every operation for a document such as text extraction, classification,
entity extraction, and so on.
The following procedure describes how to enable the detailed activity log.

To enable the detailed activity log:


1.

Navigate to <CIS installation directory>/config and edit the configuration file log4j.xml.

2.

Uncomment the section for the appender with name "activity-detailed":


<!--appender name="activity-detailed" class="org.apache.
log4j.RollingFileAppender">
<param name="File" value="$H(DFC_DATA_DIR)/logs/cis-activity-detailed.log"/>
<param name="Threshold" value="all"/>
<param name="MaxFileSize" value="100MB"/>
<param name="MaxBackupIndex" value="10"/>
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern"
value="%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] %c - %m%n"/>
</layout>
</appender-->

Note: In the log4j configuration, an appender refers to an output destination. For CIS, each
appender creates a log file.
3.

Uncomment all appender references to "activity-detailed" in the category elements


"activity-normal" and "activity-detailed":
<category name="activity.normal">
<level value="on"/>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

35

Troubleshooting

<appender-ref ref="file"/>
<appender-ref ref="console"/>
<appender-ref ref="activity"/>
<!--appender-ref ref="activity-detailed"/-->
</category>
<category name="activity.detailed">
<level value="off"/>
<!--appender-ref ref="activity-detailed"/-->
</category>

4.

Set the level of the category activity.detailed to on:


<category name="activity.detailed">
<level value="on"/>
<appender-ref ref="activity-detailed"/>
</category>

Most common errors


This section provides examples of the most common error messages. The error messages appear in
the log files (such as cis.log, install.log), in DA, or as traces in the console if you started the CIS server
using the command script.
Example 4-1. DA version and CIS server version are not compatible.

In cis.log:
ERROR 2007-06-18 11:49:58,937
com.documentum.cis.service.internal.communication.CommandReader
[Stream reader (clientId=1)] - IO error when receiving a command
from a remote client (clientId=1)java.io.InvalidClassException:
com.documentum.cis.service.internal.command.SynchronizeTaxonomyCommand;
local class incompatible: stream classdesc serialVersionUID =
231215431745769233, local class serialVersionUID = -5880332928005196941

In DA:
Error in updating content intelligence configurations
Response not received for command=Version negotiation (maxVersion=1) with id=1

or (in DA log)
16:49:02,427 ERROR CISClientManager Error while sending a command to server
(command=Synchronize taxonomy (taxonomyId=0b1109558000111e, execution mode=TEST))
java.io.IOException: Disconnection of the server

Context: This may happen when you updated CIS but not DA. We recommend to use the same
version of the CIS server and DA or any other CI API client.
Solution: Check the Content Intelligence Services Release Notes or the Content Intelligence Services
Installation Guide to understand which versions of DA and CIS are compatible.
Example 4-2. The authentication against the repository failed (cis.log or console)
CIS server starting...
Invalid or no repository credentials. Waiting for credentials before
CIS server can actually start.
CIS server starts listening on port 8079

36

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Troubleshooting

Context: When the CIS server starts, it checks the user credentials against the repository before
opening a session. If no credentials are found or if they are invalid (for example, after a repository
change or for a repository previously enabled for another CIS server), the CIS server starts in a
restricted mode that only allows receiving new or updated credentials. You cannot launch any
classification run but you can change the credentials in Documentum Administrator. When the CIS
server receives the valid credentials, it tries to connect to the repository. If successful, it switches to
full mode and the following message appears:
Credentials ok, CIS server connected to Repository.

Solution: Set the CIS parameters (login and password) in DA to create an authentication file.
Example 4-3. CIS server cannot connect to the repository (install.log, cis.log)

If the CIS server cannot connect to the repository, the server periodically tries to reconnect to the
repository. Attempts are made every minute for five minutes, then increasing the delay and
eventually trying every hour until successful or shut down. Therefore, the CIS administrator no
longer needs to manually restart the CIS server.
Context: This error can occur if the repository has been modified, if the docbroker has been moved
to another machine, or after a network issue...
21:33:36,429 ERROR [Thread-0] com.documentum.cis.service.internal.
communication.CommandClient - IO error when initializing the command client
(attempting to open a socket connection to server host=CMAQAWIN2K8598 port=8079)
java.net.ConnectException: Connection refused: connect
Example 4-4. DA cannot connect to the CIS server. (error in DA)
Error in updating content intelligence configurations - Connection refused: connect

Solution:
Check if the CIS server is up and running, for example, look at the Windows service status.
Verify that the CIS port is correctly set: on the CIS host, in cis.properties file, and in DA, in the CIS
configuration page.
Look for any network interference: firewall, antivirus application, etc.
Upgrade DA to the same version as the CIS server, or at least update ci.jar on DA. The
procedure to update ci.jar is described in the Content Intelligence Services Installation Guide, in
the Troubleshooting chapter.
Example 4-5. Connection Broker (docbroker) not reachable from the CIS server. (cis.log)

If the connection broker is not reachable from the CIS server, most likely the CIS server will not
be able to connect to the repository.
ERROR 2009-06-08 10:17:31,641 com.documentum.cis.service.internal.scheduling.
ExecuteQueuedDocumentsCommand [schedulerExecutor_1] - Error creating queue
processing task set. Will try again on next iteration.com.documentum.cis.
service.internal.content.ContentException: DfDocbrokerException:: THREAD:
schedulerExecutor_1; MSG: [DFC_DOCBROKER_REQUEST_FAILED] Request to Docbroker
"mycs:1489" failed; ERRORCODE: ff; NEXT: null

Solution: Check with the network administrator.


Example 4-6. Entity extraction engine not started or not reachable. (cis.log)
ERROR 2009-06-08 12:03:38,516 com.documentum.cis.service.internal.sequencing.Task
[taskExecutor_0] - Error while processing task on [Test personne.ppt]
(com.documentum.cis.shared.config.ConfigurationException: An error occured while
opening a session on the Luxid Annotation Server on host=localhost)

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

37

Troubleshooting

Solution:
Check if the entity extraction server is up and running, for example, look at the Windows services
status.
Verify the configuration of the entity extraction server in cis.properties file.
Example 4-7. CenterStage not installed on repository. (cis.log)
ERROR 2010-04-23 12:06:42,788 com.documentum.cis.service.internal.adapter.dfc.
DfcFileSpaceWatcher [main] - Space watcher is enabled but CenterStage seems not
to be present. Please check property 'cis.server.centerstage.enabled' in the
cis.properties configuration file (CIS server needs to be restarted after this
file is modified).

Solution: Verify the configuration in cis.properties file.

Frequently asked questions


I have changed a taxonomy or a document set, and
reprocessing does not take my changes into account
Context: You are using CIS for standard classification
Solution: First, remember that only synchronized taxonomies are used for the classification. When
changes are made to a taxonomy definition or to a document set definition, the changes are not taken
into account automatically to avoid reprocessing documents at any time. To trigger the reprocessing
of the documents, you have to clear the assignments already created for these documents.

Stemming is not available for my language


Context: You are using CIS for taxonomy-based classification for a language other than English,
French, German, Spanish, Portuguese, Italian, Norwegian, Swedish, Danish, Dutch, Romanian,
Russian, Finnish, Hungarian, or Turkish.
Solution: First, disable stemming. When you turn off stemming, the CIS server looks for an exact
match with the defined term. It means that, for example, CIS will not recognize the plural form of a
noun or different forms of the same verb. When stemming is disabled, explicitly add as terms all
of the forms you want the CIS server to recognize. You can also work with a thesaurus and enter
synonyms with all their forms (plural, etc) as keywords.

38

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Troubleshooting

How can I improve the performance?


Context: You are using CIS for standard classification.
The following are the main performance factors, by order of decreasing impact:
1.

Number of categories assigned per document

2.

Document format. The same document in PDF format takes more time to be processed than in
Word format, for example.

3.

Size of the text content of document. For better performance, this size is limited by the
cis.classification.limit.max_content_ size parameter in cis.properties configuration file described
in Table 2, page 28.

4.

Binary size of the document. Files whose size is bigger than this limit are not processed and an
error is logged in the log file. By default: cis.server.limit.max_file_size=10000000. This parameter
is in the cis.properties configuration file and described in Table 2, page 28.

5.

Number of threads. The number of executing threads is optimized for the type of processing
selected during CIS installation. If CIS is used for classification, then the default number of
threads is 5. If CIS is installed for CenterStage, that is, for entity extraction, the default number of
threads is set to 1 to optimize the data throughput with the entity extraction server. If you want
to use both types of processing or if you are using a processing different than the one chosen
during installation, you should adjust the number of threads.

6.

Classification options can impact performance. For example, CIS cannot update an attribute
or link into a folder any locked or immutable object, such as lightweightobjects (for example,
dm_message_archive). If you try to apply these options on such objects, this will result in lots of
errors and warnings which will in turn lower the performance. The only processing possible for
these types of objects is for category assignments which are relations of type dm_category_assign.
It means that you have to deselect the options Link assigned documents into category folders and
Update document attributes with category assignments in Documentum Administrator.

The size of the document set has no impact on the performance whether you define one large
document set, or ten small ones.
Similarly, the number of taxonomies and categories has no impact on the performance.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

39

Troubleshooting

40

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Part 2
CIS in Documentum Administrator

For your convenience, this part includes the Content Intelligence chapter from the Documentum
Administrator User Guide. It describes all actions that can be performed in Documentum
Administrator, such as configuring CIS for a repository or defining a taxonomy for classification.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

41

CIS in Documentum Administrator

42

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 5
Content Intelligence Services

This chapter contains the following topics:


Content Intelligence Services, page 43
Setting up Content Intelligence Services, page 47
Enabling Content Intelligence Services, page 47
Missing Content Intelligence node, page 48
Modifying Content Intelligence Services configuration, page 49
Building taxonomies, page 50
Processing documents, page 69
Refining category definitions, page 75

Content Intelligence Services


Content Intelligence Services (CIS) is an EMC Documentum product that automatically categorizes
documents based on an analysis of their content, their metadata, or their Documentum attributes.
Content Intelligence Services organizes documents into categories that are maintained in a taxonomy.
A taxonomy is a hierarchical set of categories used to organize content in the repository based on a set
of criteria different from the cabinet and folder structure. This alternate organization, often based
on the subject matter of the content, provides a place for users to look for all content related to
common topics of interest. On Content Server, Documentum Administrator is used to configure CIS
and categorize documents.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

43

Content Intelligence Services

Providing evidence
The quality of the classification relies on the category definitions. The more accurate the definition,
the better the classification. To define efficient categories, you can act on several aspects:
You can use keywords, that is, evidence terms and their respective confidence value. Evidence
terms can be simple terms or phrases, for which you choose to apply a stemming analysis or keep
the phrase order. You can also defined patterns using regular expressions.
You can set property rules that allows you to define category assignments according to the values
of the repository attributes.
You can use evidence from other categories by setting category links.

About confidence values and score thresholds


Each category definition lists the words and phrases that serve as evidence that a document belongs
in the category. These words and phrases are called evidence terms. When CIS server analyzes a
document, it reviews all the content of the document, looks for these terms, and determines whether
to score a hit based on which terms it finds. Based on the number of hits, CIS server calculates a
document score for each category.
Each evidence term in the category definition has a confidence value assigned to it. The confidence
value specifies how certain CIS server can be about scoring a hit for a document when it contains
the term. For example, if a document includes the text IBM, CIS server can be nearly certain that the
document relates to the category International Business Machines. Therefore, the confidence level for
the term IBM is High.
Other pieces of evidence may suggest that the category might be appropriate. For example, if a
document includes the text Big Blue, CIS server cannot be certain that it refers to International
Business Machines. The confidence level is Low, meaning that CIS server should score a hit for the
category International Business Machines only if it encounters the text Big Blue and other evidence of
the same category in the document.
You can also exclude evidence terms. For example, suppose you have a category for the company
Apple Computers. The term Apple is certainly evidence of the category. However, if the term
fruit appears in the same document, you can be fairly sure that Apple refers to the fruit and not
the company. To capture this fact, you would add fruit as excluded evidence term to the Apple
Computers category.
Finally, you can define terms as required terms. In this case, the document must contain at least one
Required term. If only Required terms are defined for the category, then only one is sufficient to
assign the document to the category. If the evidence terms are not only Required terms, then the
document must contain one Required term and have a confidence score high enough for the category.
The confidence values for evidence terms are integers between 0 and 100.
When you set confidence values in Documentum Administrator, you can choose a predefined
confidence level or enter a number directly. The predefined values are:
High: Equivalent to the confidence level 75.
Medium: Equivalent to the confidence level 50.

44

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

Low: Equivalent to the confidence level 15.


Supporting: This evidence by itself does not cause CIS server to score a hit for a document.
However, it increases the confidence level of other evidence found in the same document.
Exclude: If one of the evidence terms found in a document has this confidence level, then the
document will never be assigned to this category.
Required: These terms are must-have terms but they are not taken into account for the documents
score.
If the resulting score exceeds or meets the categorys on-target threshold, CIS server assigns the
document to the category. If the score is lower than the on-target threshold but higher than or equal
to the candidate threshold, CIS server assigns the document to the category as a Pending candidate;
the category owner must review and approve the document before the assignment is complete. If the
score falls below the candidate threshold, CIS server does not assign the document to the category.

About stemming and phrase order


CIS server linguistic analysis module uses stemming to recognize related words and treat them as a
single evidence term. Stemming means extracting the common root, or stem, from expressions that
differ only in their grammatical form. For example, the words parked, parks, and parking share
the same stem (park), and CIS server recognizes them as four instances of the same evidence term
rather than as four different terms.
You may want certain evidence terms not be stemmed. For example, if you define the term Explorer
as in Microsoft Internet Explorer, you do not want CIS server to recognize other forms of the word as
the same term. When you turn off stemming, CIS server looks for an exact match with the defined
term. In our example, CIS server would consider the word explorers as a separate term. Turning
off stemming for a term means that CIS server will not recognize even the plural form of a noun or
different forms of the same verb. When you turn off stemming, make sure you explicitly add as terms
all of the forms you want CIS server to recognize.
Another example is when you want CIS server to treat different forms of the same stem as separate
terms; for example, if you want to use provider and provide as evidence of different categories.
CIS provides out-of-the-box stemming capability for the English language. To use the stemming
option for other languages, you need to install language dictionaries, as described in the Content
Intelligence Service Administration Guide.

Setting the language used for the stemming


The language used for the stemming can be defined either for the documents, for the categories, or
for both of them.
When you specify the language of a document, the text of the document is analyzed and stemmed
according to this language. Then the result of the analysis is compared with the evidence terms of
categories of the same language or which language is not defined. Note that defining a language for a
category acts as a filter: a document will never be assigned to a category of a different language.
To set the language for the documents that you want to classify, you can either set it for every
document or for an entire document set. When a language is set for a document set, it prevails over

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

45

Content Intelligence Services

the language set for individual documents. This prevents from classification errors if the document
language is not correctly set. Note that you can only select one language per document. If the
document set is made of many documents in different languages, then the language must be set at the
document level and not at the document set level. When no language is defined for the documents
or for the document sets, if the stemming is activated, the language used is the one defined in the
CIS server configuration. The Content Intelligence Service Administration Guide describes how to set
the default language for CIS server.
You can also define the language of the categories used for the classification. The language can be
set for every category of for the entire taxonomy. If the language of a category is not specified, then
the language of the taxonomy is used, it does not inherit the language of the parent category, if any.
When no language is defined, the language used is the one defined in the CIS server configuration.
You also have the possibility to define the language as "Any language", this means that documents in
different languages can be assigned to this category.
The following languages are available for the stemming option: English, French, German, Italian,
Portuguese, Spanish, Danish, Dutch, Finnish, Swedish, Norwegian Bokmal, and Norwegian Nynorsk.

Activating the stemming


Stemming can be activated at different levels:
In the category class definition, you can choose to use the stemming on the category names. If
you select Use stemming in the category class definition, then it will be the default value for
all categories created from this category class.
In the category definition, you can choose to override the default option inherited by the category
class. You can either select or deselect the stemming option.
For each evidence term (keywords or phrases), you can choose to use the stemming. This option is
automatically disabled and grayed out if you selected Any language as the category language.

Retaining the phrase order


When you enter a multi-word phrase as evidence, CIS server by default looks for an exact match with
the phrase. If you select the Recognize words in any order checkbox, CIS server looks for all of the
words in the phrase in the same sentence regardless of their order.

About category links


Categories can include other categories as evidence: when a document is assigned to one category,
CIS server can use that assignment as evidence for a related category. Like all evidence, category
link evidence has a confidence value associated with it, telling CIS server how much to add to the
documents overall score for the current category when the document is assigned to the linked
category.

46

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

There are three types of category links:


Explicit category links, for which you identify the category to link into the evidence for this
category
Parent links, for which CIS links all of this categorys parent categories into its set of evidence terms
Child links, for which CIS links all children of this category into its set of evidence terms
Category classes can specify that CIS include Parent or Child links automatically. If a category
belongs to a class where these options are set, the evidence for the category will include these links
even though they do not appear in the category definition itself.

Setting up Content Intelligence Services


Before you can use CIS on Content Server, you must install and start the CIS server. For details about
installing the CIS server, refer to the Content Intelligence Services Installation Guide. For details about
starting the CIS server, refer to the Content Intelligence Services Administration Guide.
After you have installed and started the CIS server, you can configure CIS using Documentum
Administrator. Configuring CIS includes the following tasks:
Configuring the repository for Content Intelligence Services.
Designing and creating taxonomies.
Synchronizing the taxonomies with CIS server.
Identifying a set of test documents and checking them into a folder in the repository.
The set of documents should include representatives of the various types of documents you are
processing with Content Intelligence Services. The documents help to test and fine-tune the
category definitions.
Creating a document set that selects the test documents.
Testing the document set and review the resulting categorizations.
Adjusting the category definitions as necessary to refine the results.
Synchronizing taxonomy in production mode, and run the document set in production mode.
Bringing the taxonomies online.

Enabling Content Intelligence Services


Before you can use Content Intelligence Services, you must activate the CIS-related objects in the
repository to which you want to apply CIS processing.
Note: You must be logged in as a user with superuser privileges to enable CIS processing. If you
do not have sufficient privileges, the CIS options do not appear.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

47

Content Intelligence Services

To enable CIS functionality for a repository:


1.

Navigate to Administration > Content Intelligence for the repository you want to process
documents from.
If the Content Intelligence node is not visible, refer to Missing Content Intelligence node, page
48 to solve this issue.

2.

Click the Enable repository for category assignments link.


The Enable Repository for Content Intelligence page appears.
When you create taxonomies and categories, Documentum Administrator creates corresponding
folders, one folder for each taxonomy and category with the same hierarchical relationships.
When the Link to Folders option is active, CIS links categorized documents into the folders
corresponding to their assigned categories.
The default location for these folders is in a cabinet named Categories.
The default path for the Content Intelligence administrative information is
/System/Application/CI.
These two locations cannot be modified.

3.

Enter the host names for the production CIS server and the test CIS server. The host name is
made of the IP address or DNS name followed by the port number (optional), for example:
192.168.1.250:8079

Default port number is 8079.


You can define the host names using an IPv6 addresses. When using an IPv6 address, with or
without a specific port number, it must be enclosed by square brackets, for example:
[2001:0db8:0:0:0:0:1428:57ab]
[2001:0db8:0:0:0:0:1428:57ab]:5678

CIS enables you to categorize documents in production mode or test mode; see Test processing
and production processing, page 69 for details. Although you can use the same CIS server for
both production and testing, separate servers are recommended for better performance and
availability.
The specified CIS server(s) need be running when you enable the repository.
4.

Enter the User Name and password for the CIS server to connect to the repository. The
authentication against the repository is required when retrieving documents and assigning
documents to categories.

5.

Click OK.

6.

Set the CIS processing options for the repository, as described in Modifying Content Intelligence
Services configuration, page 49.

Missing Content Intelligence node


If the Content Intelligence node is not visible, this may because CIS DAR is not installed in the
repository. To check this, navigate to the Administration page and verify what is indicated in the
System information page, Repository section. If it indicates: Content Intelligence: CIS DAR not
found in the repository, then install CIS DAR as described in the following procedure.

48

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

To deploy CIS DAR on the repository:


1.

Depending on your use of the classification functionality, proceed with one the following options:
You use CIS server to classify content: in this case, it is likely that the version of your
Documentum Administrator is more recent than the version of your CIS server. In this case,
we recommend you upgrade CIS to the same version as Documentum Administrator.
You classify manually, without CIS server.
1.

In this case, download the Content Intelligence Services archive file from the Documentum
download center, for the same version as your Documentum Administrator.

2.

Unzip the archive file and navigate to the DAR folder.

3.

Deploy the module cis_artifact.dar according to Documentum guidelines for


module deployment in Documentum Composer documentation.

Modifying Content Intelligence Services


configuration
The Configuration for Content Intelligence page enables you to modify how CIS records category
assignments as well as the host names for the CIS servers that process documents from this repository.
You must be a member of the ci_taxonomy_manager_role to configure CIS.

To modify CIS configuration for a repository:


1.

Navigate to Administration node.

2.

In the Content Intelligence Services box on the right, click the link Configure CIS.
The Configuration for Content Intelligence page appears.

3.

Update the host names, and optionally the port numbers, of the CIS production and test servers if
necessary.
CIS allows you to categorize documents in production mode or test mode; see Test processing
and production processing, page 69 for details. Although you can use the same CIS server for
both production and testing, separate servers are recommended for better performance and
availability.
The specified CIS server(s) need be running when you configure the repository.

4.

Specify whether CIS links assigned documents into a corresponding category folder. This option
is not selected by default.
If you do not select the Link assigned documents into category folders option, category
assignments are not returned as search results, and Documentum Webtop users can view
assignments only if you assign them as attributes.
Note: Selecting this option affects system performance during document processing and
classification. Do not select it unless you need the functionality it provides.

5.

Specify whether CIS adds assigned category names to document attributes by selecting or not
the Update document attributes with category assignments option . This option is selected
by default.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

49

Content Intelligence Services

Which attributes CIS updates is determined by the category classes of each category; see Defining
category classes, page 51.
6.

Enter the Documentum User Name and password for CIS server to use when connecting to
this repository.
Select a user account that has appropriate permissions for retrieving documents to process and
assigning documents to categories.

7.

Click OK to validate.

Building taxonomies
The term taxonomy refers to two related items in Content Intelligence Services. In most situations
it refers to the hierarchy of categories that divide up a particular subject area for content. For
example, the term is used in this sense when you refer to the Human Resources taxonomy or the
Pharmaceutical taxonomy. A taxonomy in this sense has a root level and any number of categories as
direct and indirect children.
Content Intelligence Services also uses the term taxonomy to refer to the Documentum object that
serves as the root level of the hierarchy. Taxonomy objects represent the top level, much as a cabinet
represents the top level of a hierarchy of folders.
The organizational structure of a taxonomy determines the navigation path that users follow to
locate documents in the category as well as certain types of inheritance: a category inherits some
default values from the taxonomy definition and can inherit evidence from its children categories, its
parent category, or any other category.
Taxonomies consist of three types of Documentum objects:
Taxonomy objects represent the root of a hierarchical tree of categories. The definition of
a taxonomy sets default values for its categories and can include property conditions that
documents must meet in order to be assigned to categories in the taxonomy. No documents are
assigned directly to the root of the taxonomy.
Categories are the headings under which documents are categorized. The definition of a category
includes the evidence that CIS server looks for in document content to determine whether
it belongs in the category.
Category classes define general types of categories. Every category is assigned to a class, which
specifies the default behavior of the category.
In addition to building taxonomies using Documentum Administrator, you can import pre-built
taxonomies from XML files in taxonomy exchange format (TEF). The Content Intelligence Services
Administration Guide provides more information about importing taxonomies.

Setting permissions for Content Intelligence


To define taxonomies, categories, category classes, or document sets in Documentum Administrator,
you need Superuser privileges or the ci_taxonomy_manager_role. The following procedure

50

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

describes how to restrict the access to the Content Intelligence node to only some members using the
ci_taxonomy_manager_role.

To set permission using ci_taxonomy_manager_role:


1.

Navigate to Administration > Administrator Access

2.

Create an Administrator Access Set.

3.

In the Administrator Access Set properties, select the Content Intelligence node and the
role ci_taxonomy_manager_role.

4.

Navigate to Administration > User Management > Roles.

5.

Select the ci_taxonomy_manager_role.

6.

Click File > Add Member(s) and select the names of the users or groups you want to add to this
role.

7.

Navigate to /System/Applications/CI and select all objects in the CI folder.

8.

Open the object properties, and then the Permissions tab.

9.

Select the Permission Set to CI Default ACL.

Some information related to the taxonomy manager role:


Taxonomy managers can create taxonomies and add categories only to their own taxonomies.
Taxonomy managers can only edit or delete the categories they have created. If a category was
created by another taxonomy manager, the other taxonomy manager is the owner of this category.
If the owner is not specified during the import of a taxonomy, the taxonomy and its categories are
owned by the CIS user defined for the repository.
It is possible to change the owner of a taxonomy or category but, as for any Documentum object,
the change must be done for each object (taxonomy or category). Edit the owner of the object and
set it to ci_taxonomy_manager_role.

Defining category classes


Each category is part of a category class. The properties of a category class determine the default
behavior of categories belonging to the class. Individual categories can override the default behavior.
If you are using the Assign as Attributes option to write category assignments into each documents
attributes, the category class identifies which attribute CIS writes the category names into.
CIS includes one category class by default, named Generic. In many instances, you can configure this
category class and use it for all of your categories. You need to create additional category classes
only when you need to assign category information to a different attribute or use different rules for
generating category evidence.
You can also delete category classes, but you must first reassign all categories to use another category
class. You can reassign the categories on the page that displays when you delete the class.

To create or modify a category class:


1.

Navigate to Administration > Content Intelligence > Category Classes.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

51

Content Intelligence Services

A list page appears, showing the available category classes.


2.

Select File > New > Category Class to create a new category class, or click the
category class whose properties you want to set.

icon next to the

The properties page for category classes appears. It has two tabs, one for general category class
information and the other for default values.
3.

Enter a name and description for the category class.


The name appears in the list of category classes that displays when creating categories. If you are
editing an existing category class, the name is read only. The description enables you to enter
more descriptive information about the category class.

4.

Identify the document attribute into which CIS writes the names of assigned categories.
The classification attribute must be an existing attribute for the object type of documents that will
be assigned to categories of this class, and it must be a repeating value attribute, for example,
keywords. Category names are written into the attribute only if this option is active; see
Modifying Content Intelligence Services configuration, page 49 for information about setting the
option. Note that the current values of the selected attribute are erased by CIS and replaced by
the result of the new categorization. Therefore, end users should not edit this attribute manually.

5.

Click the Default Values tab.


You use this page to set the default behavior for categories of this class. When you assign a
category to this class, the category will use the values from the class unless the user who creates
the category changes the option on the New Category screen.

6.

Specify how CIS treats the category name as an evidence term for the category.
a.

To have CIS adding the category name as an evidence term, select the Include Category
Name as evidence term checkbox. If you deselect this option, the next two options are not
relevant and are grayed out. Skip to step 7.

b. To activate the stemming option on the category name, select the Use stemming checkbox.
c.

7.

To enable the words in multi-word category names to appear in any order, select the
Recognize words in any order checkbox. When the checkbox is not selected, CIS server
recognizes the category name only if it appears exactly as entered.

Set the default rules for using evidence from child or parent categories.
When a document is assigned to one category, CIS server can use that assignment as evidence
that the document also belongs in a related category. This type of evidence propagation is most
common between categories and their parent or children categories. See About category links,
page 46 for more information.
a.

To use evidence from parent or child categories by default, select the Use evidence from
child/parent checkbox. Deselect the checkbox to avoid evidence propagation.

b. From the drop-down list associated with the checkbox, select child to use evidence from
child categories as evidence for the current category or parent to use evidence from parent
categories.
Note: You cannot link to a category with a name that is not unique. If you define links to
categories with a non-unique name, the links will not be taken into account by CIS processing.
8.

52

Click Finish to close the properties page.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

To delete category classes:


When you delete a category class already used for existing categories, you are prompted to reassigned
the categories to another category class.
1.

Navigate to Administration > Content Intelligence > Category Classes.


A list page appears, showing the available category classes.

2.

Select the category classes you want to delete.

3.

From the File menu, select Delete.


A confirmation message appears, asking you to confirm that you want to delete the category class.

4.

For category classes that are assigned to existing categories, select an alternate category class for
the categories.
When a category class is still in use, the confirmation message page enables you to select which
of the remaining category classes is assigned to categories that currently use the deleted class.
Choose the class from the Update categories to use the category class drop-down list.

5.

Click OK to delete the category class.

Defining taxonomies
You need to create a taxonomy object before you can create any of the categories in the hierarchy. The
taxonomy object sets certain default values for the categories below it.
Since the taxonomy object is the root of a complete hierarchy of categories, it is the object that you
work with when performing actions that affect the entire hierarchy, such as making the latest
definitions available to CIS server (synchronizing) or making the hierarchy available to users (bringing
the taxonomy online). For information about these operations, see Managing taxonomies, page 67.
Every CIS implementation needs to have at least one taxonomy to use for analyzing and processing
documents. Depending on the types of documents being categorized, you may want to create
multiple taxonomies. Generally you want one taxonomy for each distinct subject area or domain.
One advantage to separate taxonomies is that they can be maintained separately, by different subject
matter experts, for example.
The Properties page for a taxonomy object can have two or three tabs:
The Attributes tab displays the basic information about the taxonomy, most of which was entered
when the taxonomy was created.
The Property Rules tab lists conditions that documents must meet before CIS server will assign
them to any category under this taxonomy.
The Select Taxonomy Type tab is displayed if category or taxonomy is subtyped. Using this,
you can create your own subtype.

To create or modify a taxonomy:


1.

Navigate to Administration > Content Intelligence > Taxonomies.


A list of the existing taxonomies appears.

2.

To display only taxonomies that you own or only online taxonomies, choose one of the options
from the drop-down list in the upper right corner of the list page.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

53

Content Intelligence Services

3.

Select File > New > Taxonomy to create a new taxonomy. To modify a taxonomy, select it and
then go to View > Properties > Info.
The properties page for taxonomies appear.

4.

In the Select Taxonomy Type tab, select the taxonomy type from the drop down list to create
a subtype.
Click Next to proceed or click Attributes tab.
Attributes page displays the non-editable subtype of the taxonomy.

5.

Enter a name, title, and description for the taxonomy. Only the taxonomy name is mandatory
and it must be unique. The title is not mandatory and it is not necessarily unique.
By default, the taxonomy name is the text that appears in the list of taxonomies. However, it is
possible to display the taxonomy title instead of the taxonomy name, the procedure To display
the object titles instead of the object names:, page 61 describes how to switch from the category
and taxonomy names to the category and taxonomy titles.

6.

Click the Select owner link and choose the taxonomy owner. The taxonomy owner can be a
person, a list of persons, or groups.

7.

Choose the default category class from the drop-down list.


The selected class appears as the default category class when you create categories in this
taxonomy. See Defining category classes, page 51 for information about category classes.

8.

Select the taxonomy language. The selected language must match with the language of the
documents that you want to classify. If the language is different, the documents will never
be assigned to a category of this taxonomy.
If the language of a category is not defined, the language set for the taxonomy is used. If no
language is set for the taxonomy, CIS server default language is used.
Select Any language in the drop down list to match any documents language. For example, you
can use this option if you dont plan to activate the stemming and thus, evidence terms are valid
in any language, such as patterns for social security numbers or acronyms like EMC. If the option
Any language is selected, then it is not possible to use the stemming on the evidence terms of
this taxonomy. The Use stemming option in the evidence term definition is then disabled and
grayed out.

9.

Specify whether the taxonomy is online or offline.


An online taxonomy is available for users to browse and assign documents to. A new taxonomy
is offline until you explicitly put it online by selecting Online from the State drop-down list.
Typically you want to keep the taxonomy offline until you have completed testing it.

10. Set the default on-target and candidate thresholds.


The on-target and candidate thresholds determine which documents CIS server assigns to a
category during automatic processing. When a documents confidence score for the category
meets or exceeds the on-target threshold, CIS server assigns it to the category. When the score
meets or exceeds the candidate threshold but does not reach the on-target threshold, CIS server
assigns the document to the category as a candidate requiring approval from the category owner.
See About confidence values and score thresholds, page 44 for details.
The threshold values for the taxonomy object set the default threshold values for categories in
this taxonomy. The default values are 80 for the on-target threshold and 20 for the candidate
threshold.
11. For previously saved taxonomies, refresh the synchronization state of the taxonomy.

54

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

If the taxonomy has never been synchronized, the status is Unknown. See Synchronizing
taxonomies, page 68 for information about synchronization.
The synchronization state is not displayed when you are creating a new taxonomy.
12. Click OK to close the properties page, or click the Property Rules tab to specify criteria that
all documents in this taxonomy must meet.
Add property rules to the taxonomy if you want to define rules specific to document attributes.
For help using the Property Rules tab, see Defining property rules, page 63.
13. Click OK to close the properties page.
14. To create or modify the categories in the taxonomy, see Defining categories, page 59 for
information about defining categories.
15. To synchronize the taxonomy if you have made any changes to it or its categories, see
Synchronizing taxonomies, page 68 for information about synchronization.

Creating subtypes for a taxonomy or for a category


Sub-typing feature enables you to add your own attributes to the dm_object (dm_taxonomy for
taxonomies or dm_category for categories) and create a sub-type for that object. You can create a
subtype or multiple subtypes by editing/adding the attributes of objects with the tools such as TEF,
DA and Web Publisher. The subtype created resides in the repository data dictionary. They inherit
the ACL settings from dm_category and dm_taxonomy.
Using Documentum Administrator and TEF, you can create a custom tab for the subtype. For more
information refer to Creating custom tab for the subtype, page 55.

Creating custom tab for the subtype


You must use Documentum Application Builder to create the custom tab for a category subtypes
attributes. You can configure the Documentum Adminstrators tab using Documentum Application
Builder. After configuring the Documentum Administrators tab, you can create a custom tab for
their subtypes.
If customization for a subtype is not available, Documentum Administrator will use the closest
super-type settings that are available for a particular subtype.
Example 5-1. Example 1

MyCat1 is a subtype of Category.


If Documentum Administrator is not customized to recognize MyCat1, it reads MyCat1 as a default
category.
Example 5-2. Example 2

MyCat1 is a subtype of Category.


MyCat2 is a subtype of MyCat1.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

55

Content Intelligence Services

If Documentum Administrator is not customized to recognize MyCat2, then it reads MyCat2 as


MyCat1. If MyCat1 is also not customized, then DA reads MyCat1 and MyCat2 as individual
categories.
To create custom tabs for a category subtypes attributes, use the Display Configuration tab in
Documentum Application Builder. Using this, you can configure Documentum Administrator tab
as needed.

To create custom tab for the subtype:


1.

Open the Documentum Application Builder.

2.

In the DocApp Explorer, double-click the object types name to open the type editor, and select the
Display Configuration tab.
Tip: Tip: Each row in the Scope field represents one scope. A scope does not have a name and is
instead identified by its set of scope definitions.
To know more about the scope field, refer to "Working with Object Types" in Documentum
Application Builder User Guide

3.

To create and modify tabs on which to display the attributes, perform these actions in the
Display Configuration List:
Note:
The object types parents tabs are inherited. Adding, deleting, editing tabs, or changing the
order of the tabs breaks inheritance that is, changes made to the parents tabs will not be
reflected in this types tabs.
Tab names are also localizable.
Web Publisher does not have tabs, so it displays the display configurations as sections on
the same page.
For WDK applications, to display attributes (particularly mandatory ones) on the object
properties Info page, specify the Info category.
To add a new tab:
a.

Click Add.

b. Enter a new tab name or choose one of the defaults from the drop-down list.
c.

To add the tab to all EMC Documentum clients, check Add to all applications. This tab is
shared between all application and any changes to it are reflected in all applications.

d. Click OK.
Note: When you create tabs with identical names in different applications, DAB creates new
internal names for the second and subsequent tabs by appending an underscore and integer
(for example, dm_info_0) because the internal names must be unique for a type. The identical
names are still displayed because they are mapped to their corresponding internal names. When
you change locales, DAB displays the internal names, because you have not specified a name
to be displayed in that locale. It is recommended that you change them to more meaningful
names in the new locales.
Using one of the defaults automatically creates a tab with an identical name, because the default
is already used by another application.

56

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

Checking Add to all applications results in only one tab being created-not several tabs with
different internal names and identical display names-and all display names are mapped to
that one tab.
To remove a tab, select the tab name and click Remove.
To rename a tab, select the tab name and click Rename.
To change the order in which tabs are displayed, select the tab and click the up and down arrows.
4.

To modify the attributes displayed on a tab, perform these actions in the Attributes in Display
Configuration:
a.

In the Display Configuration List, select the tab in which the attributes you want to modify
are displayed. The attributes that are currently displayed on the tab are shown in the
Attributes in Display Configuration text box.

b. Click Edit
c.

To specify which attributes are displayed on the tab and how they are displayed, perform
these actions in the Display Configuration dialog box:
To display attributes on the tab, select the attribute in the Available attributes text box
and click Add.
To delete attributes from the tab, select the attribute in the Current attribute list text box
and click Remove.
To change the order in which the attributes are displayed on the tab, select the attribute in
the Current attribute list text box and click up or down arrows.
To display a separator between two attributes, select the attribute above which you want
to add a separator and click Add Separator.
To delete a separator between two attributes, select the separator and click Remove
Separator.

If you have more attributes than can fit on a tab, force some attributes to be displayed on a
secondary page in Webtop, select the attribute and click Make Secondary.
To move a secondary attribute back onto the primary tab, select the attribute and click Make
Primary.

Creating subtype instances


Use Documentum Administrator to create new instances of taxonomies and categories.

To create subtype instances:


1.

Click File > New > Category or File > New > Taxonomy.
When a new instance is created, Documentum Administrator launches the Info screen for the
new object. You can customize Documentum Administrator to create subtype instances which is
similar to category/taxonomy creation.
The info screen enables you to view and edit attributes of a particular taxonomy or category.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

57

Content Intelligence Services

2.

Enter a name, title, and description for the category. Only the category name is mandatory and it
must be unique between categories that have the same parent. The title is not mandatory and it
is not necessarily unique.
By default, the category name is the text that appears in the list of categories and is the name of
the folder created to correspond to this category. However, it is possible to display the category
title instead of the category name, the procedure To display the object titles instead of the object
names:, page 61 describes how to switch from the category and taxonomy names to the category
and taxonomy titles.

3.

Click the Select owner link and choose the owner of this category.
The standard page for selecting user(s) or group(s) appears. The category owner is the user who
can approve or reject documents assigned to the category as a candidate requiring approval
from the category owner; see Reviewing categorized documents, page 73 for information about
the document review process. The user you select is added to the ci_category_owner_role
automatically, giving him or her access to the category through Documentum Administrator.
Note: If both a user and a group exist with the same name, the user cannot be selected as
a category owner, only the group.

4.

Choose the category class from the drop-down list.


The category class determines default behavior for the new category as well as the document
attribute to which CIS server adds the category name if you are using the Assign as Attributes
option.

5.

Enter on-target and candidate thresholds.


The on-target and candidate thresholds determine which documents CIS server assigns to a
category during automatic processing. When a documents confidence score for the category
meets or exceeds the on-target threshold, CIS server assigns it to the category. When the score
meets or exceeds the candidate threshold but does not reach the on-target threshold, CIS server
assigns the document to the category as a candidate requiring approval from the category owner.
The default values come from the definition of the taxonomy you selected in order to navigate
to this category.

6.

Click OK to create subtype instance.

To create an instance of a taxonomy or category subtype:


1.

Click the CustomProp tab to create a custom tab for the subtypes.

2.

Enter the custom type for the subtype.

3.

Click OK to close the properties page, or click the Property Rules tab to specify criteria that
all documents in this taxonomy must meet.
Add property rules to the taxonomy if you want to apply a specific criteria to all documents
before they are considered for categorization in this taxonomy. For help using the Property
Rules tab, see Defining property rules, page 63.

4.

58

Click OK to close the properties page.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

Defining categories
When you create a category, you define its position in the hierarchy of categories by navigating into
the category that you want to be its parent. The category inherits default values for most of the
required attributes from the taxonomy object at the top of the hierarchy.
The procedure below describes how to create a category and set its basic properties. For information
about providing the evidence that CIS server uses to identify documents that belong in the category,
see Setting category rules, page 62.
Note: When you customize CIS and CenterStage to expose the taxonomy-based classification as
search filters in CenterStage clients, the result of the classification process is to store category names
as annotations (and not as category assignments). The maximum number of categories that can
be assigned to one document for this type of classification is 273 categories. If you use more than
273 categories, no category names are stored for this document. Make sure any document does
not match more than 273 categories.

To create a category:
1.

Navigate to Administration > Content Intelligence > Taxonomies.


A list of the existing taxonomies appears.

2.

To display only taxonomies that you own or only online taxonomies, choose one of the options
from the drop-down list in the upper right corner of the list page.

3.

Select a taxonomy and navigate to the location where you want the category to appear.
The right pane should display the contents of the category that will be the new categorys parent.

4.

From the menu, select File > New > Category.


The properties page for categories appears with three tabs

5.

If subtypes have been created, in the Select Category Type tab, select the category type from the
drop down list to create a subtype.
Click Next to proceed or click Attributes tab. If no subtypes have been created, directly go
to the Attributes tab.
Attributes page displays the non-editable subtype of the category.

6.

Enter a name, title, and description for the category. Only the category name is mandatory and it
must be unique between categories that have the same parent. The title is not mandatory and it
is not necessarily unique.
By default, the category name is the text that appears in the list of categories and is the name of
the folder created to correspond to this category. However, it is possible to display the category
title instead of the category name, the procedure To display the object titles instead of the object
names:, page 61 describes how to switch from the category and taxonomy names to the category
and taxonomy titles.
The maximum number of characters for the category name is 255 characters. The category
path that includes the category name and the names of the parent categories must not exceed
450 characters.

7.

Click the Select owner link and choose the owner of this category.
The standard page for selecting a user appears. The category owner is the user who can approve
or reject documents assigned to the category as a candidate requiring approval from the category

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

59

Content Intelligence Services

owner; see Reviewing categorized documents, page 73, for information about the document
review process. The user you select is added to the ci_category_owner_role automatically, giving
him or her access to the category through Documentum Administrator.
8.

Select the category class from the drop-down list.


The category class determines default behavior for the new category as well as the document
attribute to which CIS server adds the category name if you are using the Assign as Attributes
option.

9.

Select the category language. The selected language is used to filter the documents that you want
to classify. If the language is different, the documents will never be assigned to the category.
If the language of a category is not defined -and whatever the language of the parent category, if
any- the language set for the taxonomy is used. If no language is set for the taxonomy, CIS server
default language is used.
Select Any language in the drop down list to match any documents language. For example, you
can use this option if you dont plan to activate the stemming and thus, evidence terms are valid
in any language, such as patterns for social security numbers or acronyms like EMC. If the option
Any language is selected, then it is not possible to use the stemming on the evidence terms of this
category. The Use stemming option is then disabled and grayed out.

10. Enter on-target and candidate thresholds.


The on-target and candidate thresholds determine which documents CIS server assigns to a
category during automatic processing. When a documents confidence score for the category
meets or exceeds the on-target threshold, CIS server assigns it to the category. When the score
meets or exceeds the candidate threshold but does not reach the on-target threshold, CIS server
assigns the document to the category as a candidate requiring approval from the category owner.
See About confidence values and score thresholds, page 44 for details.
The default values come from the definition of the taxonomy you selected in order to navigate
to this category.
11. Specify how CIS treats the category name as an evidence term for the category.
a.

To have CIS adding the category name as an evidence term, select the Include Category
Name as evidence term checkbox. If you deselect this option, the next two options are not
relevant and are grayed out.

b. To activate the stemming option on the category name, select the Use stemming checkbox.
This option is automatically disabled and grayed out if you selected Any language as the
category language.
c.

To enable the words in multi-word category names to appear in any order, select the
Recognize words in any order checkbox. When the checkbox is not selected, CIS server
recognizes the category name only if it appears exactly as entered.

12. Set the default rules for using evidence from child or parent categories.
When a document is assigned to one category, CIS server can use that assignment as evidence
that the document also belongs in a related category. This type of evidence propagation is most
common between categories and their parent or children categories. See About category links,
page 46 for more information.
a.

60

To use evidence from parent or child categories by default, select the Use evidence from
child/parent checkbox. Deselect the checkbox to avoid evidence propagation.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

b. From the drop-down list associated with the checkbox, select child to use evidence from
child categories as evidence for the current category or parent to use evidence from parent
categories.
Note: You cannot link to a category with a name that is not unique. If you define links to
categories with a non-unique name, the links will not be taken into account by CIS processing.
13. Click CustomProp tab to create a custom tab for the subtypes.
14. If the customization for a subtype is not available, Documentum Administrator will use the
closest supertype settings that are available for a particular subtype. For more information , refer
to Creating custom tab for the subtype, page 55 .
15. Enter the custom type for the subtype.
16. Click OK.
The property page closes, and the category appears in the list.
17. Set the category rules.
For details, see Setting category rules, page 62.

Displaying object titles


You have the possibility to display the object title instead of the object names for the taxonomy,
category and document objects.
If all titles are defined (for taxonomies, categories, and documents), it allows to display the title, which
can be more user-friendly, instead of the name, which is used as an identifier.
Note that you cannot choose to display only category titles, or only document titles. The switch works
on all objects at once. If the title is not defined for all objects then the column will be empty. In this
case, you can display both columns, side by side.

To display the object titles instead of the object names:


1.

Locate the taxonomies_component.xml file under the <DA webapp


directory>\webcomponent\config\admin\taxonomies directory.

2.

Locate the <showobjectname> property.


Set the property to true to display the category name (default option).
Set the property to false to display the category title.

3.

Save the file.

4.

Restart Apache Tomcat service to apply the modification.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

61

Content Intelligence Services

Setting category rules


A categorys rules determine what documents are assigned to it. The rules fall into two major
categories:
Property rules, which set conditions that a document must meet in order to be considered for
assignment to the category
Evidence, which list the words, phrases or patterns that CIS server looks for to indicate that a
document belongs in the category
Property rules specify category rules based on attributes of the document; evidence specifies category
rules based on the content of the document. If the category definition only contains evidence terms
then a document must contain these evidence terms to be assigned to the category. If the category
definition only contains property rules, then the document or its attributes must meet the conditions
set by the property rules. If the category definition only both evidence terms and property rules, then
both must be satisfied for a document to be assigned.
The evidence terms that can be defined in Documentum Administrator for a category can also be
divided into two categories:
Simple terms are the key words and phrases for the category, each of which by itself is a good
indicator of category membership
Compound terms are groups of words and phrases that work together to indicate category
membership. No one term in the group has a high enough confidence value to assign a document
to the category, but the presence of multiple terms can cause the total confidence score to cross the
on-target threshold.
For many categories, only simple terms are required. As a general practice, we recommend adding
only simple terms when you first define a category. You can add compound terms when you are
refining your categories to make more subtle discriminations as a result of testing.
Note: Patterns cannot be defined in Documentum Administrator, the Content Intelligence Services
Administration Guide describes how to define patterns.

To set the rules for a category:


1.

Navigate to Administration > Content Intelligence > Taxonomies.


A list of the existing taxonomies appears.

2.

Navigate to the category whose rules you want to set.

3.

Click the

icon in the Rules column.

The rules page for the category appears. The right pane of the screen displays property rules for
the category; the left pane displays the evidence for the category.
4.

Set any property rules based on document attributes.


See Defining property rules, page 63 for details.

5.

Define the evidence for the category.


The evidence for a category is divided into simple terms and compound terms. When defining a
new category, we recommend adding simple terms; see Defining simple evidence terms, page 66
for details.
See Defining compound evidence terms, page 78 for information about creating compound terms.

62

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

Defining property rules


The Rules Summary page for a category shows the rules that CIS server uses to determine which
documents it assigns to the category. While evidence terms specify what words and phrases need to
appear in the content of a document, property rules define other property conditions, not related
to the content, that documents must meet in order to be assigned to the category. The property
conditions are based on the repository attributes of the documents. If a document does not meet the
defined property conditions, CIS server does not assign it to the category.
The Property rules can be used in conjunction with evidence terms. When defined at the category
level, they can also be used on their own to assign documents. This way, you can assign documents
to categories based on the documents property values, without even considering the documents
content.
You can also define property rules for a taxonomy as a whole. In this case, the property rule is used to
filter documents. Unlike property rules set for categories, the property rules for a taxonomy cannot be
used to assign documents. Any property rule associated with the taxonomy applies to every category
within the taxonomy. The taxonomy-level rules appear on the rules page for the category with the
taxonomy name displayed in the title of the box.
If you want to use the value of Documentum object attributes to be processed as content for the
classification, refer to Content Intelligence Services Administration Guide which describes how to modify
the default processing and take into account property values in addition to, or instead of, the text
content of the documents.

To set property rules that documents must meet:


1.

From the Property Rules page, click the Edit link in the Category Property Rule box.
The Property Rules page appears.

2.

To require assigned documents to come from a specific folder, click the Select folder link next to
Look in: and navigate to the folder.
When you click OK after selecting the folder, the folder appears next to the Look in label.

3.

To require assigned document to have a particular object type, click the Select type link next to
Type: and select the object type. The default object type is dm_sysobject. If you have created
custom object types, To display or hide an attribute:, page 65, describes how to make custom
object types available in the CIS component.
When you click OK after selecting the object type, the type name appears next to the Type label.

4.

To assign documents based on their attributes, select the Properties checkbox and enter the
criteria used to qualify documents.
a.

Select whether all criteria should be met:


ALL indicates that all rules must be satisfied to assign the document.
ANY means that the document can be assigned when only one rule is satisfied.
By default, all property rules must be satisfied.

b. Select the repository attribute whose value you want to test. The list of attributes differs
according to the selected object type. If you have created custom attributes, To display or
hide an attribute:, page 65, describes how to display custom attributes.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

63

Content Intelligence Services

c.

From the drop-down list in the middle, select the operator that will be used to compare the
selected attribute with the test value.
The available operators differ depending on the type of the attribute you selected in the
previous step. For example, for a Boolean attribute, the two operators are equal and not
equal and the possible values are true or false.
The operators contains and does not contain are only available for string attributes.
The operators greater than or less than can be used to select string values alphabetically. For
example, the string ABD is greater than ABC. You can then assign documents using their
title, their author or any other string attribute by alphabetical order, such as: all documents
with an author name greater than A and less than C (note that in this case, words starting
with C are ignored).

d. Enter the value to test against in the text box on the right. Values are not case sensitive and
accents are ignored.
To define a rule on the Format attribute, you must enter the value as it appears in the
documents Property page. For example, to match documents whose format is Microsoft
Word Office Word Document 8.0-2003 (Windows), enter the value msw8.
To define a rule on any date attribute, the corresponding value should comply to
Documentum date standards. Table 4, page 64 demonstrate possible date formats
(non-exhaustive list).
Table 4. Date formats for property rules

Date format

Example

mm/dd/yy

02/15/1990

mon dd yyyy

Feb 15 1990

mm/yy

02/90

dd/mm/yyyy

15/02/1990

yyyy/mm

1990/02

yy/mm/dd

90/02/15

yyyy-mm-dd

1990-02-15

dd-mon-yy

15-Feb-90

month yyyy

February 1990

month dd yy

February 15 90

month, yyyy

February, 1990

month dd, yyyy

February 15, 1990

Note that property rules on a date attribute do not take into account the time (hours, minutes,
seconds).
e.
5.

64

To add an additional condition, click the Add Property button and repeat steps b through d.

Click OK to return to the rules page.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

Displaying attributes in Property rules


You may need to modify the type list or the attribute list for property rules, to select which attributes
you want to display for the property rules.
By default, all the attributes of the selected object type are available, excepted attributes beginning
with r_, a_, or i_, such as r_modified_date or a_content_type. To hide attributes that are visible by
default, you need to add them to an exclusion list. To make available attributes that are hidden by
default, you need to add them to an inclusion list.
Custom types created from dm_sysobject or dm_document object type automatically inherit
of the same searchable attributes. The attributes available or excluded for the dm_sysobject or
dm_document object types are also available or excluded for the derived object.
The following procedure describes how to display or hide attributes.

To display or hide an attribute:


1.

Navigate to C:\Program Files\Apache Software Foundation\Tomcat


5.0\webapps\da\webcomponent\config\admin\category.

2.

Open the qualifierrules_component.xml file.

3.

Under the <attribute_list> element, you can add an entry for the type whose attribute display
you want to modify.
For example:
<attribute_list>
<type id='my_custom_type'>

Two <type id> elements already exist for the dm_sysobject and dm_document object types.
4.

Under the <type id> element, add the new attributes that should or should not appear in the
drop-down menu, respectively in the <exclusion_attributes> and <inclusion_attributes> elements.
By default, all the attributes of the selected object type are available; to hide them, add them
to the exclusion list.
Attributes that are hidden by default begin with r_, a_, or i_; to make them available, add them
to the inclusion list.
For example:
<attribute_list>
<type id='my_custom_type'>
<exclusion_attributes>
<attribute>my_custom_attribute1</attribute>
<attribute>my_custom_attribute2</attribute>
<exclusion_attributes>
<inclusion_attributes>
<attribute>my_custom_attribute3</attribute>
<inclusion_attributes>
</type>
</attribute_list>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

65

Content Intelligence Services

Defining simple evidence terms


The Simple Terms box displays words and phrases that are good indicators of category membership
individually. Each term has an associated confidence value, which indicates how certainly CIS server
can infer the appropriateness of the category when the term appears in a document. For simple terms,
the confidence value is generally High. See Providing evidence, page 44 for more information.
A newly defined category may have one simple term already defined: the name of the category. The
category name may appear as text or as the keyword @implied; either option means that CIS server
treats the category name as a simple evidence term. The category name or @implied appears if
the category class for this category has the Generate evidence from category name option set; see
Defining category classes, page 51.
If you find during testing that a particular simple term is causing CIS server to assign too many
documents to the category, you can convert the simple term into a compound term that is more
discriminating. To convert a simple term into a compound term, click the Add additional terms link
next to the term that you want to change and follow the instructions in Defining compound evidence
terms, page 78.

To define the properties of a category evidence term:


1.

Click the Add a new simple term link to add a new term, or click the
you want to modify.

icon next to a term

The Evidence page appears. For a new term, the Use stemming and Recognize words in any
order checkboxes are set to the default values from the category class for this category.
2.

To use a word or phrase as evidence for the category, click the Keyword option button and enter
the word or phrase in the adjacent text box.
A keyword is a text string that CIS server looks for in the documents it processes.

3.

To include another category as evidence for this category, click the Category option button and
identify the category to use as evidence for this category.
A category link tells CIS server to use the evidence of another category as part of the definition
of this category.
To use this categorys parent category, select Parent from the drop-down list.
To use this categorys children categories, select Child.
To link to a selected category, select Category, then click the Select category link that appears
to the right of the drop-down list and select the related category from the page that appears.
Note: You cannot link to category with a name that is not unique. If you define links to categories
with a non-unique name, the links will not be taken into account by CIS processing.
See About category links, page 46 for more information about the types of category link.

4.

Specify whether CIS server uses stemming on the evidence term by selecting or deselecting the
Use stemming checkbox. This option is automatically disabled and grayed out if you selected
Any language as the category language.

5.

If the evidence term is a multi-word phrase, specify whether CIS server recognizes the words in
any order by selecting or deselecting the Recognize words in any order checkbox.
If the checkbox is not selected, CIS server recognizes the phrase only when the words appear in
exactly the order they are entered here.

66

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

6.

Assign a confidence value for the evidence term.


The system assigns High confidence to the term by default, and we recommend this confidence
value for most simple terms. To specify a different value:
a.

Deselect the Have the system automatically assign the confidence (HIGH) for me checkbox.
A pair of option buttons appear for setting the confidence level.

b. To select one of the system-defined confidence levels, click the System Defined Confidence
Level button and select a level from the drop-down list box. The system-defined levels are
described in About confidence values and score thresholds, page 44.
c.
7.

To set a custom confidence level, click the Custom Confidence Level button and enter a
number between 0 and 100 in the text box.

Click OK to close the Evidence page.


The evidence term appears in the Simple Terms box.

8.

Repeat steps 1 to 7 for each simple term.

Managing taxonomies
When you create a taxonomy, it is offline by default. Offline taxonomies are available under the
Administration > Content Intelligence node for designing and building, but are not available for
users to see. To make the taxonomy available to users, you bring it online.
When you create or modify any part of a taxonomy, you need to make it available to CIS server
so that CIS server can use the new or updated taxonomy and category definitions to categorize
documents. This process is called synchronization.
Both of these operations are available for complete taxonomies only, not individual categories
or portions of the hierarchy.

Making taxonomies available


When you create a taxonomy, it has an offline status. An offline taxonomy is available through
Documentum Administrator, but is not visible to end-users via Webtop. (You can perform test
categorizations with an offline taxonomy; see Test processing and production processing, page
69.) Offline status enables you to build, test, and revise the taxonomy before making it available to
end-users.
When you bring it online, the taxonomy, its categories, and categorized documents appear to users
under the Categories node.

To make a taxonomy available to users in Webtop:


1.

Navigate to Administration > Content Intelligence > Taxonomies.


A list of the existing taxonomies appears.

2.

Select the taxonomy you want to make available then go to View > Properties > Info.

3.

The properties page for the taxonomy appears, select the Attributes tab.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

67

Content Intelligence Services

4.

Select Online from the State drop-down list box.

5.

Click OK.
The taxonomy now appears to users under the Category node and is available for categorization.

To make a taxonomy unavailable to end users:


1.

Navigate to Administration > Content Intelligence > Taxonomies.


A list of the existing taxonomies appears.

2.

Select the taxonomy you want to take offline then go to View > Properties > Info.

3.

The properties page for the taxonomy appears, select the Attributes tab.

4.

Select Offline from the State drop-down list box.

5.

Click OK.
The taxonomy is no longer visible to users. Existing documents remain in the categories.

Synchronizing taxonomies
The taxonomy and category definitions you create are saved in the repository. When you create or
modify any part of a taxonomy, you need to make it available to CIS server so that CIS server can
use the new or updated taxonomy and category definitions to categorize documents. This process
is called synchronization. Updates to the taxonomy are not reflected in automatic processing until
you synchronize them.
Note: If any of the categories in a taxonomy include links to categories in other taxonomies, all
related taxonomies must be synchronized to avoid possible errors.

To synchronize a taxonomy definition:


1.

Navigate to Administration > Content Intelligence > Taxonomies.


A list of the existing taxonomies appears.

2.

Select the taxonomy you want to synchronize.

3.

Select Tools > Content Intelligence > Synchronize.


The Synchronize page appears. If you selected multiple taxonomies, the page will appear once
for each selected taxonomy.

4.

Select which CIS servers you want to synchronize with.


You can categorize documents in production mode or test mode, providing a separate CIS server
host for each mode; see Test processing and production processing, page 69 for details. Select the
checkbox for the production server, the test server, or both. CIS will copy the latest taxonomy
definitions to the selected server(s).

5.

Click the OK button to start the synchronization.


If you selected multiple taxonomies at step 2, a Next button appears in place of the OK button
until you have selected servers for each taxonomy. The synchronization for all selected
taxonomies occurs together.

68

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

The synchronization process starts, and the list of taxonomies reappears. If you receive any errors
or warnings, refer to the error log on CIS server for details. See the Content Intelligence Services
Administration Guide for information.
6.

To check the status of the synchronization process, click the View Jobs button at the bottom of
the page.
When the synchronization is complete, a message indicating its success or failure is sent to
your Documentum Inbox.

Deleting taxonomies
When you delete a taxonomy, it removes all categories within that taxonomy except for categories
that are linked into other taxonomies. All assignments to those categories are also removed, although
the documents themselves are not.

To delete a taxonomy:
1.

Navigate to Administration > Content Intelligence > Taxonomies.


A list of the existing taxonomies appears.

2.

Select the taxonomy you want to delete.

3.

Select File > Delete.


A message page appears asking you to confirm that you want to delete the taxonomy.

4.

Click OK to remove the taxonomy.

Processing documents
When your taxonomies and their category definitions are in place, you are ready to categorize
documents. Content Intelligence Services supports both automatic categorization, where CIS server
analyzes documents and assigns them to appropriate categories, and manual categorization, where a
person assigns documents to categories.
Documentum Administrator enables you to review the results of either type of categorization, and
to manually adjust them if necessary. For documents that CIS server could not definitively assign
to particular categories, category owners use Documentum Administrator to approve or reject the
candidate documents.

Test processing and production processing


You can submit documents to CIS server in test mode or production mode. You choose the mode
when you define the document set.
In test mode, CIS server performs its analysis to categorize the submitted documents, but it does not
make any of the permanent updates that you want it to make when you put Content Intelligence
Services into production. You use test mode to refine and validate your category definitions. After

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

69

Content Intelligence Services

reviewing the results of a test run, you can clear the proposed categorizations, update the category
definitions, and run the test again. When CIS server is properly categorizing documents, you can
bring the taxonomy online to put it into production.
In production mode, CIS server updates documents and the repository based on the results of its
categorization. The nature of the updates depends on which configuration options are active: if Link
to Folders is active, CIS server links documents into the folders corresponding to the categories,
and if Assign as Attribute is active, CIS server writes the name of the assigned categories into each
documents attributes. Refer toModifying Content Intelligence Services configuration, page 49 for
details about setting the options.
You can perform test processing on a separate CIS server from your production server. Offloading
test processing from the production server prevents your tests from competing for resources with
the production system. See Modifying Content Intelligence Services configuration, page 49 for
information about specifying the test and production servers.
You can view the documents assigned to a category either after a test processing or after a production
processing.

To switch from production view to test view


1.

Navigate to the category for which you want to see the assigned documents. (Do not select the
category.)

2.

Select View > Page View > Test view to display the results of the category assignments after
a test run.

3.

Repeat the previous step but selecting Production view to go back to the production view.

Defining document sets


Documents are submitted to CIS server by means of document sets. A document set is a collection
of documents that are sent to CIS server together, and which CIS server processes in the same way.
The document set can retrieve all documents from a specified folder or be automatically applied to
documents that users submit for categorization.
Once you have created and run a document set, the Properties page for the document set includes
status information on the Last Run tab.

To create or modify a document set:


1.

Navigate to Administration > Content Intelligence > Document Sets.


A list of the existing document sets appears.

2.

Select File > New > Document Set to create a new document set, or select the document set you
want to modify then select View > Properties > Info.
The properties page for document sets appears.

3.

Enter a name and description for the document set.


Use a descriptive name that will enable you to distinguish it from other document sets. You may
want the name to reflect the documents included in the set.

70

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

4.

Select the document set language. The selected language must match with the language of the
categories and taxonomies used for the classification. The documents will never be assigned to
a category of a different language.
If the language of the document set is not defined, the language set for the document is used. If
no language is set for the document, CIS server default language is used.

5.

Click the Document Set Builder tab.


You use the controls on this tab to create the query used to retrieve documents for processing.

6.

To include documents from a specific folder, click the Select link next to Look in: and navigate to
the folder containing the documents to process.
When you click OK after selecting the folder, the folder appears next to the Look in label.

7.

To specify the object type of the documents selected for processing, click the Select link next
to Type: and select the object type.
When you click OK after selecting the object type, the type name appears next to the Type label.

8.

The Properties checkbox is already selected to assign documents based on their attributes. Enter
the criteria used to select documents.
a.

Select an attribute whose value you want to test.


The drop-down list on the left displays the attributes of the object type you selected at step 6.

b. From the drop-down list in the middle, select the operator to use to compare the selected
attribute to the test value.
The available operators differ depending on the attribute you selected in the previous step.
c.

Enter the value to test against in the text box on the right.

d. To add an additional condition, click the Add Property button and repeat steps a through c.
The document set will include only those documents whose attributes meet all of the
conditions.
9.

Click the Processing tab.


You use the controls on this tab to specify when the documents in this document set are submitted
to CIS server for processing and whether they are processed in test or production mode.

10. By default, the schedule is set to Inactive. To define a schedule, set the document set schedule
to Active.
An active document set is run according to its defined schedule. An inactive document set is
not run, and the remaining scheduling controls are grayed out.
11. For active document sets, specify when the documents in the set should be submitted to CIS
server for processing.
a.

Click the calendar icon next to the Start Date field to select the day on which the documents
will be first submitted to CIS server.

b. Set the time of day for the first run by selecting numbers from the Hour, Minute, and
Second drop-down lists.
The Hour setting uses a 24-hour clock.
c.

Specify how often this document set submits documents to CIS server by entering a number
in the Repeat box and picking the units (minutes, hours, days, weeks, or months) from
the drop-down list.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

71

Content Intelligence Services

Each time the document set runs, it submits only new or revised documents to CIS server.
12. Click one of the Processing Mode option buttons to indicate whether to run this document set
in production mode or test mode.
See Test processing and production processing, page 69 for information about production and
test modes. Selecting the mode also determines which CIS server processes the document set: the
production server or the test server.
13. If you chose Test at step 11, click Select Taxonomy and select a taxonomy to run the test against.
For a test run, you can have CIS server only consider the categories in the taxonomy you are
testing. The taxonomy does not need to be online. For a production run, all synchronized
taxonomies are used for the classification.
14. Click OK to close the properties page.
15. Synchronize the document set to make it available to CIS server.
a.

Select the document set you want to synchronize.

b. Select Tools > Content Intelligence > Synchronize.


The Synchronize page appears. CIS servers to Update shows which CIS server will be
updated based on the processing mode for this document set.
c.

Click the OK button to start the synchronization.


If you receive any errors or warnings, refer to the error log on CIS server for details.

16. To check the status of the synchronization process, click the View Jobs button at the bottom of
the page.
When the synchronization is complete, a message indicating its success or failure is sent to
your Documentum Inbox.
17. To view the documents that the document set will submit to CIS server, click the name of the
document set on the list page.
Documentum Administrator runs the query from the Document Set Builder tab and displays the
documents in the result set.
Note: Deleting a document from this page removes it from the repository, not just from the
document set.

Submitting documents to CIS server


There are two ways to submit documents to CIS server for automatic categorization.
When you submit one or more documents for automatic categorization, the documents are added
to a queue awaiting CIS server processing. They are processed as CIS server retrieves documents
from its queue.

To submit a document for CIS server processing:


1.

Select the document you want classify.

2.

Select Tools > Submit for Classification.

72

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

To submit a set of documents for CIS server processing:


1.

Navigate to Administration > Content Intelligence > Document Sets.


A list of the existing document sets appears.

2.

Select the document set you want to run.


You can only select one document set at a time. If you select multiple sets, the Start Processing
menu option is grayed out. However, several document sets can be processed at the same time.

3.

Select Tools > Content Intelligence > Start Processing.

4.

Enter a name for the run.


The name enables you to identify this run in the log files.

5.

Click OK to submit the documents for processing.


To review the status of a processing run, open the properties page for the document set and click
the Last Run tab. For a greater level of detail, check the CIS server log files; see the Content
Intelligence Services Administration Guide.

Assigning a document manually


This sections describes how to manually assign a document from a cabinet folder to a category.
CIS server must be configured to Production mode. The Assign/Unassign option is not available in
Test mode.

To manually assign a document:


1.

Navigate to a cabinet and select the document to assign.

2.

Select Edit > Add To Clipboard.

3.

Navigate to the category to which you want to assign the document in the nodeAdministration >
Content Intelligence > Taxonomies). If not already done, turn page view into Production view.
The list of documents belonging to the selected category in Production view is displayed.

4.

Select Edit > Assign here. The document is assigned to the category, its status is set to
assigned_manual.
If the option Link assigned documents into category folders is enabled, a relationship is created
between the document and the category folder corresponding to the selected category.
If the option Update document attributes with category assignments is enabled, the name of the
category is added as a value of the keyword attribute for the document.

Reviewing categorized documents


The My Categories page provides direct access to the categories for which you are the owner.
From the My Categories page, you can view all documents assigned to the categories you own, or
you can display just those documents assigned to the category with a status of Pending. As the
category owner, you are responsible for approving or rejecting Pending documents. The review of

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

73

Content Intelligence Services

Pending documents is only available in Production mode, that is, when CIS server is configured as
the production server and not the test server.
Documents receive Pending status when the confidence score that CIS server assigns to the document
is higher than the categorys candidate threshold but less than its on-target threshold. When you
approve or reject a Pending document assignment, CIS server saves this information and does not ask
you to approve or reject it again (unless you clear assignments).

To review candidate documents:


1.

Navigate to Administration > Content Intelligence > My Categories.


A list of the categories for which you are the category owner appears. The total number of
candidate (Pending) documents for the category appears in the right column.
Note: The My Categories list displays all categories at the same level. To view categories in
their proper hierarchical position, navigate to the categories from Administration > Content
Intelligence > Taxonomies rather than choosing My Categories.

2.

Select My Categories with pending documents from the drop-down list in the upper right.
With this option selected, the list displays only categories that have Pending documents.

3.

Click the category Name to display the complete list of documents assigned to the category, or
click the value in the Total Candidates column to display only the Pending documents.
The list of assigned documents and their assignment status appears.

4.

Select the checkbox next to the candidate document to select it.

5.

To approve the document in this category, select Tools > Content Intelligence > Approve and
click OK on the confirmation page that appears.
If you are only viewing the Pending documents, the approved document disappears from the
current view because it is no longer a candidate.

6.

To reject the suggested categorization, select Tools > Content Intelligence > Reject Candidate
and click OK on the confirmation page that appears..
The document disappears from the current view because it is no longer a candidate.

7.

Repeat steps 3 through 6 for each candidate document in categories for which you are the
category owner.

Clearing assignments
You can clear assignments at the taxonomy level or a category level. You can choose to clear only the
documents in that category, or in the category and all of its children.
You can also clear the assignments for all documents belonging to a document set or for a single
document.
Clearing assignments is most common when running in test mode. If you clear assignments
made in production mode, any record of the category owners approval or rejection of a proposed
assignment is also lost. As a result, CIS server may ask the category owner to approve or reject
category assignments again.

74

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

To remove assignments of all documents in a taxonomy or category:


1.

Navigate to Administration > Content Intelligence > Taxonomies.


A list of the existing taxonomies appears.

2.

Navigate to the category whose assignments you want to clear and select it.

3.

Select Tools > Content Intelligence > Clear Assignments.

4.

Select which types of assignments to clear.


a.

Click one of the Clear assignments with status option buttons to indicate whether to clear all
assignments, only pending assignments, or only complete assignments.

b. Click one of the Clear assignments with type option buttons to indicate whether to clear test
assignments, active assignments, or both.
5.

To clear the assignments in all subcategories, select the Include subcategories? checkbox.
If the checkbox is not selected, only assignments in the current category are cleared.

6.

Click OK.

To remove assignments of all documents in a document set:


1.

Navigate to Administration > Content Intelligence > Document Sets.


A list of the existing document sets appears.

2.

Navigate to the document set whose assignments you want to clear and select it.

3.

Select Tools > Content Intelligence > Clear Assignments.

4.

Select which types of assignments to clear.


a.

Click one of the Clear assignments with status option buttons to indicate whether to clear all
assignments, only pending assignments, or only complete assignments.

b. Click one of the Clear assignments with type option buttons to indicate whether to clear test
assignments, active assignments, or both.
5.

Click OK.

To remove the assignment for a selected document:


1.

Navigate to Administration > Content Intelligence > Taxonomies.


A list of the existing taxonomies appears.

2.

Navigate to the document whose assignment you want to clear and select it by clicking the
checkbox next to its name.

3.

Select Tools > Content Intelligence > Clear Assignments.

Refining category definitions


When you have created your taxonomy and provided evidence terms for each category, the next step
is to test how well the category definitions guide CIS server in categorizing documents.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

75

Content Intelligence Services

Compile a set of test documents and submit them to CIS server. The test set should include
representatives of the various types of documents you will be processing with Content Intelligence
Services. When processing is complete, review the resulting categorization. If CIS server does not
assign some documents to the categories you expect it to, you may need to revise the category
thresholds or the evidence associated with the categories.
If a document appears in a category it should not, it means that the evidence for that category is too
broad: consider adding additional terms. If a document does not appear in a category that it should,
it means that the evidence is too restrictive.
The rule of thumb is: Make the category definition simple and test it with your documents. If it works
in most cases leave it alone. If there are problems recognizing a category and more differentiating
data is necessary, then use compound terms as described in the topics of this section.
It is also possible to define patterns to match specific terms like phone numbers or social security
numbers. The Content Intelligence Services Administration Guide provides the detailed procedure
for defining patterns.

Using compound terms


CIS server determines whether to assign a document to a category by adding together the confidence
values assigned to the individual pieces of evidence in the category definition. For some categories,
there may be multiple, separate collections of evidence that should lead CIS server to assign a
document to the category. You can define categories that have multiple evidence sets, each of which
represents an independent means of recognizing the category.
An evidence set is a collection of terms that CIS server uses together as evidence of a particular
concept. You can create multiple evidence sets in order to define separate sets of terms. Confidence
levels are not combined across evidence sets.
When you define a category in Documentum Administrator, the first evidence set consists of simple
terms, each of which by itself is a good indicator of category membership. A simple term can be a
single word or a multi-word phrase, and is typically assigned a confidence value of High. The list
of simple terms represents the keywords and phrases for the category, and for many categories it is
the only evidence required.
When you are tuning your categories to make more subtle distinctions, you can add compound
terms to the category definition. A compound term is a collection of words and phrases that work
together to indicate category membership. Each word or phrase typically has a confidence value of
Low, Supporting, or Exclude. No one term from the collection has a high enough confidence value to
assign a document to the category, but the presence of multiple terms can cause the total confidence
score to cross the on-target threshold. The main difference between a compound term and a list of
simple terms is the confidence value of each term.
CIS server treats each compound term as an independent evidence set. That is, you can think of each
compound term as an independent definition of the category evidence. A document is assigned to
the category only if its cumulative score from any one compound term (or the list of simple terms)
exceeds the threshold.
See Defining compound evidence terms, page 78 for details about creating compound terms.

76

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

Selecting terms
The biggest challenge when defining categories is selecting the proper terms to serve as evidence
for them. If you define a category using only terms that are unique to that category, CIS server will
not recognize the category in documents that relate to it in an indirect way. On the other hand, if
you choose common words as evidence terms, CIS server may recognize the category when the
document does not in fact belong in it.
The challenge is to create category definitions that are just complete enough to trigger category
recognition without introducing ambiguity. It is just as important to keep misleading terms out of
category definitions as it is to make sure that all viable terms are included. You might think that OR is
a viable term as part of the definition of Oregon, but OR crops up in so many other contexts that OR
should not be part of the definition of Oregon.
Note: CIS server is not case sensitive for evidence terms. OR matches OR, Or, and or.

Using common words as evidence terms


The easiest categories to define are those having proper nouns as evidence terms. Defining the
category for International Business Machines Corporation is intuitive: you would naturally include
features such as IBM and variations on the company name.
More complex category definition techniques are required when the proper noun denoting a category
is made up of several commonly occurring words. Defining a category such as Internet Service
Provider means you have to clearly specify what CIS server should not recognize as a valid term as
well as what it should recognize. Internet Service Provider is a name made up of three frequently
encountered words, and CIS server needs to recognize all three words in the correct context to
correctly assign a document to the category.
A correct definition uses both simple terms and a compound term. The list of simple terms contains
obvious and unique synonyms, such as ISP. The compound term includes each word of the phrase
Internet Service Provider as an Supporting term: no evidence is enough until all three terms are
found in a document.

Modifying category and taxonomy properties


The options on this page are the same as those for creating a new taxonomy.
If CIS server is not assigning documents properly to a category, you may need to change the on-target
or candidate thresholds. If documents appear in the category that should not, you may need to
increase the thresholds; if documents that should appear in the category do not, you may need to
lower the thresholds. If the category owner is required to approve too many documents, you can
lower the on-target threshold while leaving the candidate threshold unchanged.

To update category or taxonomy properties:


1.

Navigate to the category whose properties you want to update. Select the category and then
select View > Properties > Info.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

77

Content Intelligence Services

2.

The Properties page appears, select the Attributes tab.

3.

Update the title and description for the category if necessary.

4.

To change the category owner, click the Select owner link and choose the new owner.
The standard page for selecting a user appears. The category owner is the user who can approve
or reject documents assigned to the category as a candidate requiring approval from the category
owner; see Reviewing categorized documents, page 73 for information about the document
review process.

5.

To change the category class, choose the category class from the drop-down list.
The category class determines default behavior for the new category as well as the document
attribute to which CIS server adds the category name if you are using the Assign as Attributes
option.

6.

Update the on-target and candidate thresholds.


The on-target and candidate thresholds determine which documents CIS Server assigns to a
category during automatic processing. When a documents confidence score for the category
meets or exceeds the on-target threshold, CIS server assigns it to the category. When the score
meets or exceeds the candidate threshold but does not reach the on-target threshold, CIS server
assigns the document to the category as a candidate requiring approval from the category owner.

7.

Click OK.
The property page closes.

Defining compound evidence terms


A compound term is a collection of words and phrases that work together to indicate category
membership. None of the words by themselves are enough for CIS server to confidently assign a
document to the category, but when they appear in combination it adds to the confidence score. See
Using compound terms, page 76 for more information.
When a category definition includes multiple compound terms, each one defines a collection of
evidence used together to set a documents score. Confidence levels are not combined across
compound terms.
If you find during testing that a particular simple term is causing CIS server to assign too many
documents to the category, you can convert the simple term into a compound term that is more
discriminating. To convert a simple term into a compound term, click the Add additional terms link
next to the term that you want to change and follow the instructions in the procedure below.

To create a new compound evidence term:


1.

Navigate to the category whose evidence you want to update and click the
the rules page.

2.

Click the Add new compound evidence link to add a completely new compound term, or click the
Add additional term link next to a simple term that you want to convert into a compound term.

icon to display

The Evidence page appears. It looks the same as the Evidence page for a simple term, except
that Prev, Next, and Finish buttons appear in place of the OK button at the bottom of the

78

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Content Intelligence Services

page. These buttons enable you to navigate between the Evidence pages for each of the terms
that make up the compound term.
3.

Set the evidence properties for one of the simple terms in the compound term.
Follow steps 1 through 6 of the procedure for defining a simple term. The only difference when
defining part of a compound term is that the default system-assigned confidence level is Low.

4.

Click Next and repeat step 3 to add additional terms, or click Finish (or OK if you are converting
a simple term) to complete the compound term.
When you click Next, another instance of the Evidence page appears. The page title shows
which term you are now defining and the total number of evidence terms in the compound term
(Compound Evidence Term X of Y).
When you click Finish or OK, the individual terms of the compound term appear on a list page.
Click the Back to Rules Summary link to display again the Rules page of the category.

To modify a compound term:


1.

Click the

icon next to the compound term you want to modify.

A list page appears with each individual term in the compound in a separate row.
2.

To modify a term in the compound, click the


properties.

icon next to the term and change its evidence

Follow the procedure for defining a simple term.


3.

To add an additional term to the compound, select File New Evidence and set the evidence
properties for the new term.
Follow the procedure for defining a simple term. The only difference when defining part of a
compound term is that the default system-assigned confidence level is Low.

4.

To remove one or more terms from the compound, select the checkboxes next to the terms and
select File > Delete.
If removing the selected terms will result in only a single term remaining, a page appears asking
whether you want to convert the remaining term to a simple term or delete it as well.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

79

Content Intelligence Services

80

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Part 3
Configuration

This part includes the following chapters:


Chapter 6, Configuring the Type of Content Processed
Chapter 7, Configuring Document Sets

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

81

Configuration

82

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 6
Configuring the Type of Content
Processed

This chapter describes how to configure CIS to analyze Documentum object attributes in addition to,
or instead of, the content of the documents. This configuration is referred to as attribute processing.

Principles
By default, CIS analyzes the textual content of the documents to find the concepts defined in the
taxonomies or to extract entities. You can change the default behavior and have CIS analyze the
values of Documentum object attributes in addition to, or instead of, the content of the documents.
It is also possible to define a specific behavior for each document set. This configuration can only
be done by configuration files in the Repository and not in Documentum Administrator using the
Content Intelligence node.
There are two main types of configuration files:
The default configuration file: default.properties, enables you to define the type of default
processing you want (text only, attributes only, or both) and contains the list of default attributes
to use if no attributes are defined in the custom configuration file.
Custom configuration files enable you to define the type of processing and the attributes to use if
need be. You can create a custom configuration file for:
A specific document set processing.
Queue-based processing.
Interactive processing.
There is one configuration file per document set whereas queue based and interactive is only one
file (each) total. Depending on the type of processing, the name of the properties file will change.
The following section details the file names.

Configuring attribute processing


To configure CIS for attribute processing:
1.

In Documentum Administrator, navigate to:


/System/Applications/CI/AttributeProcessing

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

83

Configuring the Type of Content Processed

If the folder does not exist, create it.


The folder contains sample files: attribute-processing-default-sample.properties
is the sample file for the default configuration file. attribute-processing-sample.
properties is the sample file for the custom configuration file.
2.

Create the configuration files as needed. To do so, you may reuse the content of the sample files.
If they do not exist, refer to the examples at the end of the procedure.

3.

Give the appropriate name to the file:


Processing

File name

Default
(any processing)

default.properties

Document set processing

set_<docset_type>_<docset_name>.properties
where
<docset_type> is repo for repository document sets or file
for file-based document sets (for CenterStage deployments
only) and
<docset_name> is the name of the document set.

Queue-based processing

queueproc.properties

Interactive processing (CI API interactiveproc.properties


or DFS Web Service).
Once you created the configuration files in the repository, you do not need to restart the CIS server.
Changes will be taken into account on the next execution. If you modify the default configuration
and want to apply the changes immediately to all document sets, then restart the CIS server. If you
want to reprocess all document sets, remember to clear all previous assignments before starting
the new processing.
When a repeating attribute is defined for attribute-based processing, all instances of that repeating
attribute for the object (if any) are used.
In a custom configuration files, you can define attributes in three ways:
Define attributes in replacement of the attributes defined in the default configuration.
Define attributes in addition to the attributes defined in the default configuration.
Ignore attributes that were defined in the default configuration.
Example 6-1. Default configuration file for attribute-based processing

In this example, the processing is done by default on the attributes subject, object_name, and authors,
and on the content of the documents.
# Possible values: attributes_and_text, text_only and attributes_only
defaultInputSource=attributes_and_text
# List all attribute names to extract when processing with an input
# source of attributes_and_text or attributes_only.
# Non existing attributes on an object are ignored (is not an error)
defaultAttributes=subject, object_name, authors

84

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Configuring the Type of Content Processed

Example 6-2. Custom configuration file for attribute-based processing

In this example, the processing is done for a specific document set (or the queue, or an API) on the
attributes subject, authors, and product_id, but not on the content of the documents.
# Possible values: attributes_and_text, text_only and attributes_only
specificInputSource=attributes_only
# List of attribute names to extract in replacement of the attributes
# defined by defaultAttributes in the default configuration file.
specificAttributes=subject, authors, keywords
# List of attribute names to extract in addition to the attributes defined
# in defaultAttributes in the default configuration file or in
# specificAttributes above.
addedAttributes=product_id
# List of attribute names not to extract (assuming they were
# previously defined in defaultAttributes in the default configuration
# file, in specificAttributes or in addedAttributes above).
removedAttributes=keywords

Troubleshooting attribute processing


configuration
Note the following possible issues and recommendations:
Invalid configuration files raises an error and prevents processing. An invalid configuration file is
a file that does not comply with the required format as shown in the two previous examples.
An error is also logged if you set attribute-based configuration (such as attributes_only or
attributes_and_text) without defining the required attribute set to use.
A warning is raised when an object is to be processed based on attributes only, and that object
does not have any of the specified attributes.
Do not define an attribute as a target attribute for classification attribute update and for
attribute-based processing.
Set the following debug trace in log4j configuration to identify potential issues:
<category name="com.documentum.cis.service.internal.adapter.dfc.
DfcAttributeProcessingConfiguration">
<level value="debug" />
</category>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

85

Configuring the Type of Content Processed

86

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 7
Configuring Document Sets

This chapter provides information about document set configuration.


There are two types of document sets. Most document sets are defined in DA, we call them repository
document sets. For CenterStage, the document sets are automatically created by CIS, one document
set per space. We call them file-based document sets.
The configuration of the document sets can be modified by configuration files (for both repository
document sets and file-based document sets), either for all document sets or for a specific document
set.
It allows you to select a type of processing (classification, entity extraction, or metadata extraction)
and how to store the output of the processing. It is possible to merge the output of several types of
processing and store them as one type of annotation. For example, you can define a taxonomy for
your products, then configure the entity extraction to extract any product name, and finally merge the
output of the two types of processing to store it as one type of annotation.
Note: The document set configuration does not apply to standard classification. With the document
set configuration, you can only store the category matches found by the classification as annotations
and you cannot act on category assignments.

Document set configuration files


CIS configuration can either be modified for all document sets or for a specific document set. The
configuration is done through configuration files stored in the repository, but not under the Content
Intelligence node in DA.
The following procedure describes how to edit the configuration files of the document sets.

To edit the configuration file of the document sets:


1.

In Documentum Administrator, navigate to the folder:


Cabinets/System/Applications/CI/DocsetConfiguration

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

87

Configuring Document Sets

2.

Do one of the following:


a.

To configure all document sets with the same entities, edit the default configuration:
i.

Locate the default.xml configuration file. This configuration will also apply to future
document sets, that is, to any new space created in CenterStage.

ii. Locate the <docset-default type="file"> element. The changes made in this element apply
to all document sets created for CenterStage spaces.
Or
Locate the <docset-default type="repo"> element. The changes made in this element
apply to all document sets created in Documentum Administrator.
b. To configure only one document set, create a configuration file:
i.

Create a copy (with a copy/paste operation) of the sample file docset-sample.xml.

ii. Locate the file space_docset_list.txt. This file lists all the document sets created
for CenterStage spaces, the first column indicates the space name, the second column
the space ID, the third column indicates the configuration file name for this document
set, such as <space_name>_<space_ID>.xml.
iii. Rename the configuration file using the file name indicated for this document set / space
in the list.
Caution: The configuration at the document set level overwrites the configuration
made in the default.xml file for a specific section in the xml: <analysis-plan>,
<entity-detection>, <classification>, and <storage>. It means that, for example, you
can customize only the extraction of entities (and not the classification) for a given
document set.
For example, in the previous screenshot, the file annette_4_0b1109b680036a88.xml is likely
a configuration file for a space document set called annette.
The files default.xml and docset-sample.xml are available in the Appendix C, Document
Set Configuration Files of this guide.
Note that it is not necessary to restart CIS after the modification of a document set file; changes are
applied dynamically.

88

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Configuring Document Sets

The following table describes the elements that can be used in the configuration files.
Table 5. Descriptions of the xml elements in document set configuration files

Element and attribute

Description

<docset-defaults>

Root element of the file.

<docset-default type>

Defines the configuration for a type of document set.


Possible values are: repo for document sets created in DA, or file for
document sets created for CenterStage spaces.
Possible children are <analysis-plan>, <entity-detection>, or
<classification-annotation>.

<analysis-plan>

(Child of <annotation>) Contains the elements that define the types of


processing to be executed by the CIS server.
Possible children are: <classification-step/> and <entity-detection-step/>.

<classification-step/>

Enables the classification processing that returns applicable categories.

<entity-detectionstep/>

Enables the entity extraction processing based on Luxid cartridges. Use


this processing to expose entities, such as People, Place, and Organization,
in CenterStage filters.

<metadata-extractionstep/>

Enables the metadata extraction processing.

<entity-detection>

Contains the elements that define the entity types to be extracted by the
<entity-detection-step/> processing.
Possible children are one or more <analysis> elements.

<analysis name>

(Child of <entity-detection>, <classification>, or <metadata-extraction>)


Defines the name of the analysis performed whatever the type of
processing.
Possible children are one or more elements: <entity>, <metadata>,
<rule-set>, or <repository-taxonomy> depending on the type of processing.

<entity>

Defines an entity type to extract.


The value of the <entity> element is the name of the entity type in the
cartridge, for example the concept (not the subconcept) in the Temis
cartridge TM360. Refer to Luxid documentation for the exact name of the
entity/concept.

<builtin-entity>

Used to define the default entities extracted by CIS for CenterStage clients.
Table 49, page 231 describes these default entities.

<entity levels>

Defines a concatenation of the various levels of an entity in the source


cartridge that you want to store as one entity. It does not correspond
to sub-entities.

<classification>

Contains the elements that define the concepts generated by the


<classification-step/> processing.
Possible children are one or more <analysis> elements.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

89

Configuring Document Sets

Element and attribute

Description

<repositorytaxonomy>

Defines the taxonomy to use for the <classification-annotation-step/>


processing.
The value is the name of the taxonomy used. Unlike the classification
for category assignments, not all taxonomies in production are used,
specify every taxonomy that you want to expose by adding as many
<repository-taxonomy> elements as you need.

<metadata-extraction>

Contains the elements that define the metadata to be extracted by the


<metadata-extraction-step/> processing.
Possible children are one or more <analysis> elements.

<rule-set>

Specifies the set of extraction rules.

<metadata>

Defines the name of the extracted metadata as it appears in the rule set.

<storage>

Defines how the result of the processing will be stored.


Possible children are one or more <annotation> elements.

<annotation code >

Specifies an index number higher than or equal to 1000 or the name


of an existing entity. The index number should be unique through all
configurations. To map the new entity to an entity filter already existing
in CenterStage, use the entity (internal) name instead of the index, refer to
Table 49, page 231 for the name of existing entity types.

<analysis>

(Child of <annotation>) Defines the name of the analysis processing to


be stored as an annotation. The value must match the <analysis name>
previously set.

Configuring a document set for metadata


extraction
This section provides an example of document set configuration for metadata extraction.

To configure a document set for metadata extraction:


1.

Create a document set in Documentum Administrator.

2.

Create the configuration file for the document set as described in To edit the configuration file of
the document sets:, page 87.

3.

In the <analysis-plan> element, indicate the type of processing:


<analysis-plan>
<metadata-extraction-step/>
</analysis-plan>

4.

In the <entity-detection> element, add a <analysis> element such as :


<metadata-extraction>
<analysis name="metadata_subject">
<rule-set>metadata-rules1</rule-set>
<metadata>subject</metadata>
</analysis>
</metadata-extraction>

90

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Configuring Document Sets

Where
The name attribute in the <analysis> element is any name, it will be reused later to define the
way the metadata element values will be stored.
The value of the <rule-set> element is the name of the rules file without the format extension.
The value of the <metadata> element is the value of the name attribute of the <SetMetadata
name=""> element in the rules file.
5.

In the <storage> element, indicate the type of storage:


<storage>
<annotation code="1005">
<analysis>metadata_subject</analysis>
</annotation>
</storage>

Where
The code attribute in the <annotation> element is an index number higher than or equal to
1000 or the name of an existing entity type.
The value of the <analysis> element is the name of the analysis as defined in the previous step.
Other examples of document set configuration for the other types of processing are available: To
configure the classification for CenterStage spaces:, page 230 and To configure the document sets for
new entity types:, page 232.

Converting the 6.6 document set configuration


files
In CIS 6.7, the xml syntax for the document set configuration files has been improved. However, the
one disadvantage is that the 6.6 version of the document set configuration files is not supported in
CIS 6.7 and the document set configuration files must be converted. To help you in this conversion,
you can use the convert_docset_configuration script as described in the following procedure.

To convert the document set configuration files:


1.

On the CIS server machine, locate the convert_docset_configuration.bat file (on Windows hosts,
or convert_docset_configuration on Linux hosts); it can be found at <CIS installation directory>/bin.

2.

Run the script with one of the following parameters:


To convert the default document set configuration file:
convert_docset_configuration.bat -Default

To convert a specific document set configuration file:


convert_docset_configuration.bat -Docset:<docset_id>

where <docset_id> is the space ID.


To find the space ID, in Documentum Administrator, locate the file space_docset_list.txt in
Cabinets/System/Applications/CI/DocsetConfiguration. This file lists all the document sets
created for CenterStage spaces, the first column indicates the space name, the second column the
space ID, the third column indicates the configuration file for this document set.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

91

Configuring Document Sets

When converted, the configuration file version 6.6 is backed up as old_<docset_config_filename>.xml


and it is replaced with a new file with the same configuration parameters.

92

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Part 4
Entity Extraction

This part describes the entity extraction processing which is one of the three different types of content
analysis: extraction of entities, extraction of metadata, and classification.
It includes the following chapters:
Chapter 8, Entity Extraction
Chapter 9, Configuring Entity Extraction

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

93

Entity Extraction

94

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 8
Entity Extraction

Entities are pieces of information identified in context by text analysis. Entities are extracted when the
information they convey is relevant to end users.
The extraction of entities performed by CIS is exposed in CenterStage clients. The entities are
available as filters when navigating or running a search. Entities can also be stored as annotations
and accessed using the Annotation API.
To extract entities, CIS relies on an entity extraction server, currently Temis Luxid, with a text analysis
cartridge. The cartridge contains extraction rules and dictionaries to allow the identification of
entities in various languages. The entity extraction server launches the extraction processes when
triggered by the automatic scheduling of the CIS server. The CIS server collects returned entities, and
stores them for the corresponding documents.
By default, the entity extraction server identifies the following entities:
Peoplenames of people.
Companycompanies, including organizations and media.
Placegeographical locations.
You can customize the entity extraction to extract other types of entities.
This chapter describes:
The installation of the entity extraction server
The entity extraction process

Installation of the entity extraction server


By default, the installation folder for the entity extraction server is:
on Windows hosts: C:\Program Files\Documentum\CIS\Temis\Luxid
on Linux hosts: $DOCUMENTUM_SHARED/cis/Temis/Luxid
You can install additional entity extraction servers, as described in Set up a multi-node environment,
page 99.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

95

Entity Extraction

The entity extraction process


The CIS server extracts the content of the new or modified documents in the document sets; then
the entity extraction server analyzes the content of the documents. The entity extraction server
identifies the text language, then the entities based on the cartridge rules and dictionaries, and
pushes the output to the CIS server. The CIS server filters the raw entities and stores them for the
corresponding documents. Documents are not modified.
The supported file formats for the entity extraction are identical to the supported file formats for the
classification and they are specified in Content Intelligence Services Release Notes.
The supported languages for entity extraction are: English, French, German, Spanish, Italian, and
Dutch.

96

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 9
Configuring Entity Extraction

In CenterStage deployments, the entity extraction does not require much configuration. There is no
document set configuration. The document sets are automatically defined, based on the spaces in
CenterStage, and the CIS server maintains one document set per space. Every half hour, the CIS
server automatically checks for new spaces in the repository and the entity extraction runs for all
spaces on new and modified objects. It is possible to modify the default scheduling by setting the
property cis.server.centerstage.interval as described in Table 2, page 28.
Apart from CenterStage deployments, you can also extract entities from the documents in the
repository. To do so, perform the following steps:
1.

Create a document set in DA, as described in Defining document sets, page 70.

2.

Configure the document set to store the entities as annotations, as described in Chapter 7,
Configuring Document Sets.

3.

Access the annotations using the Annotation API, as described in Chapter 16, Annotation API.

This chapter describes the following configuration possibilities:


Manage entity extraction services, page 97
Disable entity extraction, page 98
Set up a multi-node environment, page 99
Customize the cartridge: add named entities, page 100
Blacklist specific entities, page 103
You may also look at the following topics to test configuration changes:
Clear previous entities, page 238
Clear the document status, page 239

Manage entity extraction services


The entity extraction server relies on several services. These services are installed in automatic
startup mode.
To make sure that all services are started, check the Temis Luxid Started icon in the notification
area (usually on the right of the taskbar). If all services are started, the tooltip displayed is Temis
Luxid Started. On the contrary, if some services have not started, the tooltip indicates the number of
services started.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

97

Configuring Entity Extraction

Start or stop the services using the icon in the notification area.
To start entity extraction services:
In the notification area, right-click Temis Luxid Started

and select Start Luxid.

To stop entity extraction services:


You can stop Luxid services, for example, to free up some resource.
In the notification area, right-click Temis Luxid Started

and select Stop Luxid.

If you select Quit instead of Stop, the Temis Luxid Started icon is no longer available in the
notification area.
To restore the Luxid icon:
On Windows hosts, select Start > All programs > Startup > Luxid Starter.
On Linux hosts, navigate to $DOCUMENTUM_SHARED/cis/Temis/Luxid/AnnotationFactory/
adminserver/bin/ and run LuxidStarter script.
If the icon in the notification area does not start the services correctly, on Windows hosts, you can
start the services manually as described in the following procedure.

To start entity extraction services manually (Windows hosts):


1.

Select My Computer > Manage >Services and Applications > Services.

2.

Start the services listed below observing the same order:


Documentum CIS Luxid Admin Server
Documentum CIS Luxid Vinci Naming Service
Documentum CIS Luxid SVN Service
Documentum CIS Luxid Annotation Server
Documentum CIS Luxid IDE Server v2
Documentum CIS Luxid Annotation Node
Documentum CIS Luxid Xelda Service
Wait for a service to be displayed as Started before starting the next service.

Disable entity extraction


The extraction of entities is enabled by default for CenterStage client applications. A quick way to
disable it is to modify the cis.server.centerstage.enabled property in cis.properties, as described
in Table 2, page 28. This property activates the processing of content in CenterStage spaces. When a
space is detected, a document set is created and the default document set configuration applies.
You can also disable the entity extraction for a specific document set by modifying the document set
configuration, as described in Chapter 7, Configuring Document Sets.

98

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Configuring Entity Extraction

Set up a multi-node environment


To increase the performance of the entity extraction, you can set up a multi-node environment.
You have two possibilities:
Allow the use of several CPUs on the same machine.
Install the entity extraction server on several machines.
This section describes the procedures to implement these possibilities.
To use several CPUs for entity extraction, simply modify the property cis.entity.luxid.annotation_
server.cpu in cis.properties, as described in Table 2, page 28. By default, only one CPU is used, the
maximum allowed is 32 CPUs. If you install additional entity extraction servers, indicate the total
number of CPUs to use across all machines.

To install an additional entity extraction server:


1.

On the machine where CIS is installed, navigate to the installation folder for the entity extraction
server. By default, it is:
C:\Program Files\Documentum\CIS\Temis\Luxid on Windows hosts
$DOCUMENTUM_SHARED/cis/Temis/Luxid on Linux hosts

2.

Copy the installer setupAnnotationFactory.5.1.x.x.exe (for Windows), or setupAnnotationFactory.


5.1.x.x.bin (for Linux), and the license file, LAFLicense.txt, to the machine on which you want to
install the entity extraction server. Do not run the installer on the machine where CIS is installed.

3.

On the new machine, run the installer.

4.

Select Node Only, then click Next.

5.

Select all options, then click Next.

6.

Click Next to validate the Java heap size in MB settings.

7.

Define the installation path, then click Next.

8.

Select an option for Luxid Annotation Factory icon location, then click Next.

9.

Click Choose and select the license file LAFLicense.txt provided by TEMIS, then click Next

10. In Luxid Annotation Server address, enter the IP address of the main server host (the machine
on which CIS is installled), then click Next.
11. Check your installation parameters, then click Install.
12. Start the services as described in Manage entity extraction services, page 97. You can modify them
to start them automatically on system reboot.
13. On the machine where CIS is installed, modify the property cis.entity.luxid.annotation_server.cpu
in cis.properties, as described in Table 2, page 28, to indicate how many CPUs to use. You cannot
specify the number of CPUs for each machine.
It takes five to ten minutes for the main entity extraction server to detect the new node.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

99

Configuring Entity Extraction

Customize the cartridge: add named entities


It is possible to add user-defined entities to the cartridge used for entity extraction. There is one
customization file for each entity type, in which you can add named entities.
To do so, you have to:
Edit the customization files.
Compile the cartridge.
The following procedures describe these tasks.

To customize the cartridge:


1.

Stop Luxid:
On Windows hosts, stop the service Documentum CIS Luxid IDE Server V2.
On Linux hosts, in the notification area, right-click the Temis Luxid Started icon and select
Stop Luxid.

2.

Navigate to the file RF360.tma and create a back-up copy:


<installation path for the entity extraction server>/AnnotationFactory/
IDE/skillCartridges/TM360/skillUnits/RF360.tma

3.

Edit the customization files as follows:


a.

Navigate to the template files located at:


<installation path for the entity extraction server>/AnnotationFactory/
IDE/skillCartridges/Customization Templates

b. Open the customization file with an XML editor. Select the file depending on the type
of entities that you want to add.
Table 6. Customization files for entities

c.

Entity type

Customization template file name

Company

Company-external-lex.scp

People

Person-external-lex.scp

Place

Location-external-lex.scp

To add entities, locate the macro elements corresponding to this entity type and add the
entities in the <e></e> child element following the guidelines provided in this procedure.
Perform the steps corresponding to the type of entities you want to add:
Add a Company entity:
1.

Locate the following macro elements:


<macro name="UserDefinedCompany" searchon="form" case="preserveFirst">
<macro name="Company" display="yes">
<macro name="CustomCompany" display="whenreferenced">

100

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Configuring Entity Extraction

2.

Add entities in the <e></e> child element, for example:


<e>
EMC
| Epic / Megagames
| Epyx
| Eurocom
| Info&amp;games
</e>

This adds five new company names.


Add a People entity:
Add a full name (first and last names):
1.

Locate the following macro element:


<macro name="UserName" display="whenreferenced">

2.

Add entities in the <e></e> child element, for example:


<e>
Michael / Man
| Jody / Foster
</e>

Add only the first name:


1.

Locate the following macro element:


<macro name="UserDefinedFirstName"

2.

searchon="form" case="preserveFirst"
display="no">

Add entities in the <e></e> child element, for example:


<e>
Bush
</e>

Add only the last name:


1.

Locate the following macro element:


<macro name="UserDefinedFamilyName" searchon="form" case="preserveFirst"
display="no">

2.

Add entities in the <e></e> child element, for example:


<e>
John
</e>

When adding only the first name or only the last name, the entity is extracted only if it
matches a pattern used by the extraction engine. For example, if the first name appears
only once in a document and that the context does allow to detect that it is a first name,
then the entity is not extracted.
Add a Place entity:
To add a country:
1.

Locate the following macro elements:


<macro name="UserDefinedLocation" searchon="form" case="preserveFirst">
<macro name="Location" display="yes">
<macro name="Geopolitical">

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

101

Configuring Entity Extraction

2.

Under the macro child element for the corresponding continent, add entities in the
<e></e> child element, for example:
<macro name="Africa">
<e>
Sampleland
| Sampletwoland
</e>
</macro>

To add a city:
1.

Locate the following macro elements:


<macro name="UserDefinedLocation" searchon="form" case="preserveFirst">
<macro name="Location" display="yes">
<macro name="Geopolitical">

2.

Add a <macro></macro> element for the corresponding country.

3.

Add entities in the <e></e> child element, for example:


<macro name="Africa">
<macro name="Algeria" searchon="form" display="whenreferenced">
<e>
Newtown
|Newcity
</e>
</macro>
</macro>

The entities in the <e></e> child element must comply with the following guidelines:
After the first entry, start each entry with a separator | (vertical bar).
In multi-word entries, separate each word with a slash /.
Write the ampersand character or the angle brackets in a protected way:
Table 7. Special characters in customization files

Special character

XML encoding

&

&amp;

<

&lt;

>

&gt;

You can use regular expressions. Therefore, use a backslash as an escape character for the
period (.), the asterisk (*), the question mark (?), the plus sign (+), and the exclamation mark (!).
Do not add the company extensions such as Inc. or Corp. The extensions are automatically
analyzed during the entity extraction.
By default, entity matching is case-sensitive only on the first character of the first word.
4.

In your Environment Variables, add the following path in the Path system variable:
<installation path for the entity extraction server>/AnnotationFactory/jre/bin

such as
C:\Program Files\Documentum\CIS\Temis\Luxid\AnnotationFactory\jre\bin

5.

102

Launch a new console to take into account the new path.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Configuring Entity Extraction

6.

Compile the cartridge as follows:


a.

Locate the script jscc.exe (on Windows hosts, or jscc on Linux hosts) at:
<installation path for the entity extraction
server>/AnnotationFactory/IDE/bin

b. Run the following command:


jscc -p <full path for component.scp> -t <full path for RF360.tma>

where component.scp is the name of the template file you modified.


7.

Restart Luxid:
On Windows hosts, start the service Documentum CIS Luxid IDE Server V2.
On Linux hosts, in the notification area, right-click the Temis Luxid Started icon and select
Start Luxid.

Example 9-1. Add named entities to the Company customization file

To add the following names:


Eidos Interactive
Epic MegaGames
Epyx
Eurocom
Info&Games
you would modify the customization file as follows:
<macro name="UserDefinedCompany" searchon="form" case="preserveFirst">
<macro name="Company" display="whenreferenced">
<macro name="UserName" display="whenreferenced">
<e>
Eidos / Interactive
| Epic / Megagames
| Epyx
| Eurocom
| Info&amp;games
</e>
</macro>
</macro>
</macro>

Blacklist specific entities


If some specific words or phrases are extracted as entities and you do not want them to be displayed
or returned in your client interface, you can blacklist them. To do so, perform the following steps.

To blacklist specific entities:


1.

Navigate to the file StopEntities.txt located at:


<CIS installation directory>/resources/luxid

2.

Open StopEntities.txt with a text editor.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

103

Configuring Entity Extraction

3.

Add each word or phrase on a new line taking into account the following guidelines:
Words or phrases are case-insensitive.
The spaces at the beginning or at the end of the line are not taken into account.
Regular expressions are not allowed.
Comments are allowed when beginning with a pound sign (#).

4.

104

Save the file. Make sure the encoding is still UTF-8.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Part 5
Classification

This part describes the classification processing which is one of the three different types of content
analysis: the extraction of entities, the extraction of metadata, or the classification.
It includes the following chapters:
Chapter 10, Classification Process
Chapter 11, Configure CIS Standard Classification
Chapter 12, Use the Taxonomy Exchange Format (TEF)

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

105

Classification

106

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 10
Classification Process

There are two types of classification process:


The standard classification is based on all the taxonomies synchronized in production mode and
returns category assignments. Optionally, it can update attributes or link documents to folders.
The second type of classification relies on the document set configuration. In the document
set configuration, you define the set of taxonomies to use and the name of the annotations that
will be used to store the concepts found by classification. This classification is used to expose
classification concepts in CenterStage filters, as described in Chapter 15, Expose Classification
Concepts or Entities in CenterStage Filters. Another way to access annotations is to use the
Annotation API described in Chapter 16, Annotation API.
This chapter explains the classification process. It includes the following sections:
Data synchronization for classification, page 107
Select documents to process, page 108
Submit documents for processing, page 108
Conceptual analysis and category score, page 109
Stemming capability, page 112
Auto categorization of the documents, page 113
Pattern analysis, page 114
Classification information, page 116
Category assignments configuration, page 116
Classification roles, page 117
Use Documentum Administrator to configure a repository and define taxonomies and document sets.

Data synchronization for classification


Before performing any classification, you have to define taxonomies and document sets. These
definitions need to be synchronized to make them available to the CIS server.
The CIS server uses snapshots of the taxonomies. Taxonomy snapshots are XML files representing
the taxonomies. They are created or updated when the taxonomies are synchronized. So, if a
taxonomy definition has never been synchronized, the CIS server cannot use it. This synchronization

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

107

Classification Process

is not automatic but manually triggered by the CIS administrator. When a CIS server is set for
both Test and Production modes, two versions of the taxonomy snapshots are available: one for the
production mode and one for the test mode. The taxonomy snapshots are stored in the repository.
When the CIS server restarts, it checks the validity of the snapshots: if taxonomies have been
removed, the corresponding snapshots are deleted. However, note that if taxonomies have been
modified, the snapshots are only updated by a manual synchronization.
The document set definitions are read by the CIS server on CIS server restart. When a document set
definition is modified, the CIS server reads the updated definition when the document set is restarted
or when it is synchronized.

Select documents to process


The document set definition tells the CIS server which documents to retrieve and provides
instructions on how to handle the documents. The selection of documents based on their location,
their object type, or their attributes can also be done by creating property rules for a taxonomy or a
category. However, defining the documents selection in the document set definition is much more
efficient because it filters documents before analyzing their content whereas filtering by category
rules in a taxonomy or category is done once the document is processed.
In addition to determining which documents the CIS server processes, the document set has various
configuration options that control how the documents are processed. For example, the document
set definition can include a processing schedule that determines how often the CIS server processes
documents.

Submit documents for processing


When the CIS server is running, it continuously checks the document set schedule to determine when
to retrieve documents for processing. When the scheduled time arrives, the CIS server connects to the
repository, runs the DQL query for the document set, and checks for documents added or updated
since the last time it ran the query. Define submission schedules, page 109, provides information
about defining a schedule for a document set.
Defining a schedule for a document set is optional. Typically, if you are using Content Intelligence
Services only for its Web Publisher integration, you do not define schedules. Even in a more
batch-oriented implementation scenario, you have the option to process documents on demand.
This option can be especially useful during the development and testing phases of system
deployment. Submit documents on demand, page 109, provides information on methods to submit
documents on demand.
You can also resubmit documents that the CIS server has already processed. For example, you
may want the CIS server to reanalyze documents after you modify the taxonomy definitions. This
situation is common during the development and testing phases of your implementation. Resubmit
documents, page 109, provides information about reanalyzing a document.

108

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Classification Process

Define submission schedules


The schedule the CIS server follows to poll for submitted documents is set in the configuration of the
document sets. Each document set can have a different schedule.
Each time the document set is polled for documents, the start time is reset automatically. The new
start time is the polling interval after the end of the previous processing. For example, suppose the
start time is January 1, 2002 at 2 a.m. and the interval is one day (1 0:0:0). Suppose also that processing
takes one hour. The next start time will be January 2 at 3 a.m.

Submit documents on demand


There are several ways to submit documents to the CIS server on demand:
To submit one document, do either of the following:
Click the See CIS values button in WDK-based applications.
Select Tools > Submit for categorization in WDK-based applications.
To submit a document set, select Tools > Content Intelligence > Start Processing in Documentum
Administrator.
The CIS server regularly processes documents submitted on demand using the online taxonomies.
The documentation related to WDK-based application provides more information about the See
CIS values option.

Resubmit documents
Sometimes you want the CIS server to reanalyze documents. For example, when you have modified
the documents of a document set. You can then submit again the documents using one of the
procedures described in the previous section, Submit documents on demand, page 109.
If a submission schedule has been defined for the document set containing the modified documents,
then they will be automatically reprocessed the next time the schedule runs. When a schedule process
starts, it automatically retrieves the last update of the document set and it checks whether the version
of the documents has been modified since its last run.
If the documents have not been modified, the CIS server does not start any new process. If you
want to process again documents against a new taxonomy, use the Clear assignments function on
the document set first.

Conceptual analysis and category score


During the classification process, the document is analyzed using the taxonomy definition.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

109

Classification Process

It identifies the evidence the CIS server looks for to determine whether a document includes the
category or concept. The confidence values assigned to the various pieces of evidence determine
when the CIS server signals a hit for the category.
When the CIS server analyzes a document, it automatically computes a score for each keyword and a
score for the whole expression. The expression score is then compared to the category thresholds
to define the document as:
not assigned if the candidate threshold is not reached.
candidate or pending if the candidate threshold is reached but not the on-target threshold.
assigned if the on-target threshold is reached.

Category score computation


Each category definition lists the words and phrases used as evidence that a document belongs
to the category. These words and phrases are called evidence terms. We can define three types of
evidence terms:
simple evidence terms Words or phrases (words in order or not) are simple evidence terms
and can also be referred as keywords.
compound evidence terms A compound evidence term is an expression, a group of words
and phrases that work together to indicate category membership. None of the words or phrases
by themselves can be considered as an evidence term. Only their combination can add to the
confidence score.
patterns A pattern is also known as a regular expression and describes a set of strings. For
example, patterns can match credit card numbers, phone numbers, or Social Security numbers.
First, when defining keywords, remember that keywords are words. They consist of characters,
mostly letters; letters are elements of an alphabet system not any character in a character map.
However, you can add digits to your keywords or punctuation marks if they are part of the word as,
for example, in merry-go-round. Define a keyword as a pattern in the following cases:
The keyword includes only digits.
The keyword uses a character such as a symbol or a currency sign.
The keyword defines an acronym.
Whereas simple and compound evidence terms can be defined in Documentum Administrator, the
pattern analysis must be configured manually. Pattern analysis, page 114 provides information
on how to define patterns.
Most of the time, the first definition of a category is a list of simple evidence terms, it is called an
evidence set. Each compound term can also be seen as an independent evidence set.
When the CIS server analyzes a document, it looks for these evidence terms. It then computes a
score for each evidence term based on the confidence value defined for the evidence term. All the
evidence terms must be found in the document within a given region of text. By default, the size of
this matching window is set to 1000 words. This proximity matching is useful to prevent accidental
matching of large documents where evidence terms could be spread apart in the document. If need
be, you can modify the size of the window or deactivate it in CIS configuration file, cis.properties.
the CIS server calculates a score for the document based on the scores of the evidence terms found.

110

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Classification Process

Finally, the CIS server compares the document score with the category thresholds to determine
whether the document has be assigned to the category or not, or left as a pending candidate.
Each evidence term in the category definition has a confidence value assigned to it. The confidence
value specifies how certain the CIS server can be about scoring a hit for a document when it contains
the term. For example, if a document includes the text IBM, the CIS server can be nearly certain that
the document relates to the category International Business Machines. Therefore, the confidence
level for the term IBM is High.
Other pieces of evidence may suggest that the category might be appropriate. For example, if a
document includes the text Big Blue, the CIS server cannot be certain that it refers to International
Business Machines. The confidence level is Low, meaning that the CIS server should score a hit
for the category International Business Machines only if it encounters the text Big Blue and other
evidence of the same category in the document.
You can also exclude evidence terms. For example, suppose you have a category for the company
Apple Computers. The term Apple is certainly evidence of the category. However, if the term
fruit appears in the same document, you can be fairly sure that Apple refers to the fruit and not
the company. To capture this fact, you would add fruit as excluded evidence term to the Apple
Computers category.
Finally, you can define terms as required terms. In this case, the document must contain at least one
Required term. If only Required terms are defined for the category, then only one is sufficient to
assign the document to the category. If the evidence terms are not only Required terms, then the
document must contain one Required term and have a confidence score high enough for the category.
The confidence values for evidence terms are integers from 0 through 100.
When you set confidence values in Documentum Administrator, you can choose a predefined
confidence level or enter a number directly. The predefined values are:
High: Equivalent to the confidence level 75.
Medium: Equivalent to the confidence level 50.
Low: Equivalent to the confidence level 15.
Supporting: This evidence by itself does not cause the CIS server to score a hit for a document.
However, it increases the confidence level of other evidence found in the same document.
Exclude: If one of the evidence terms found in a document has this confidence level, then the
document will never be assigned to this category.
Required: These terms are must-have terms but they are not taken into account for the score
of a document.
If the resulting score exceeds or meets the on-target threshold of a category, the CIS server assigns the
document to the category. If the score is lower than the on-target threshold but higher than or equal to
the candidate threshold, the CIS server assigns the document to the category as a Pending candidate;
the category owner must review and approve the document before the assignment is complete. If the
score falls below the candidate threshold, the CIS server does not assign the document to the category.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

111

Classification Process

Stemming capability
CIS is fully Unicode-compliant; which enables the classification of documents written in any language.
The stemming capability of Content Intelligence Services allows you to use the stemming for
documents in English, French, German, Spanish, Italian, Portuguese, Danish, Dutch, Norwegian,
Swedish, Romanian, Russian, Finnish, Hungarian, or Turkish. The stemming feature is the ability to
recognize that fishing, fished, fish, and fisher have the same root word fish. The language
dictionaries are embedded with CIS, you do not need to download them separately.
Note: The CIS server processes documents using the stemming capability only in the configured
language.

Stemming mechanism
You need to supply CIS with the initialization data it requires. This initialization data includes the
following:
The language set at the category level and at the document level.
The indication of whether the stemming is activated or not.
Default settings are in the cis.properties file as shown in the following example.
Example 10-1. Extract of the cis.properties file
# The default language of the linguistic engine.
cis.linguistic.language.default=english
# Whether the word stemming feature is allowed globally (true/false).
cis.linguistic.stemming.allowed=true

The language used for the stemming should be defined for the documents and for the categories.
When no language is defined, either for the documents or for the categories, the default language is
used.
When you specify the language of a document, the text of the document is analyzed and stemmed
according to this language. Then the result of the analysis is compared with the evidence terms of
categories of the same language or which language is not defined. Defining a language for a category
acts as a filter: a document is never assigned to a category of a different language.
To set the language for the documents that you want to classify, you can either set it for every
document or for an entire document set. When a language is set for a document set, it prevails
over the language set for individual documents. This behavior prevents from classification errors
if the document language is not correctly set. You can only select one language per document. If
the document set is made of many documents in different languages, then the language must be
set at the document level and not at the document set level. When no language is defined for the
documents or for the document sets, if the stemming is activated, the language used is the one
defined in the CIS server configuration.

112

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Classification Process

You can also define the language of the categories used for the classification. The language can be
set for every category of for the entire taxonomy. If the language of a category is not specified, then
the language of the taxonomy is used, it does not inherit the language of the parent category, if any.
When no language is defined, the language used is the one defined in the CIS server configuration.
You also have the possibility to define the language as Any language, this means that documents in
any language that is, in different languages can be assigned to this category.

Configure CIS default language


When the stemming functionality is enabled, a default language must be set in case the language is not
defined at the taxonomy or category level. You can set the default language in the cis.properties file.
By default, the cis.properties file is available in the directory <CIS installation directory>/config.
However, it can be located in another folder as long as the Java classpath points to the correct folder.
The following procedure describes how you can change the default language to French. Similarly,
you can change the default language to any of the other available languages.

To change the default language to French:


1.

Stop the CIS server.

2.

Open the cis.properties file.

3.

Locate the cis.linguistic.language.default property.

4.

Set its value to french or fr.


cis.linguistic.language.default=french

The language value can either be the full language name in lowercase and in english, such as
french, or the two-letter ISO 6391 language code such as fr.
5.

Restart the CIS server.

Auto categorization of the documents


When the CIS server processes a document, it performs a semantic analysis of the document based
on the taxonomy you have defined. From the analysis, Content Intelligence Services knows what
concepts are covered in the document. With the Auto Categorization feature, Content Intelligence
Services uses this information to link the document to appropriate category folders in your repository
structure.
The CIS server uses a taxonomy to analyze the content of documents. The list of concepts in the
taxonomy identifies which concepts you want the CIS server to look for in each document. The
definition of each concept tells the server how to determine whether that concept is covered in a
documentwhat evidence to look for.
When the CIS server completes its analysis of a document, it has a list of the concepts discussed in
that document.
To use the Auto Categorization feature, the taxonomy structure mirror a folder structure in the
repository. To ensure that the two structures remain parallel, a repository folder is automatically

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

113

Classification Process

created when you define a category. Auto Categorization occurs automatically when documents
are processed. When a document is assigned to a category, if the option "Link assigned documents
into category folders" is enabled, the document is automatically linked to the repository folder
representing that category.

Pattern analysis
This section describes the Pattern analyzer feature and includes the following topics:
Patterns as evidence terms, page 114
Limitations, page 115
Use patterns in rules, page 114
Configure pattern analysis, page 115

Patterns as evidence terms


Pattern support uses a standard form of regular expressions. A regular expression, often called a
pattern, is an expression that describes a set of strings. The simple and compound evidence terms
used for content analysis and defined with Documentum Administrator are words and phrases. As
an advanced feature, you can define patterns to be used as evidence terms to describe a category.
Patterns can match numbers such as credit card numbers, phone numbers, Social Security numbers,
and so on.
For information about simple and compound evidence terms, see the Content Intelligence chapter in
the Documentum Administrator user guide.

Use patterns in rules


Once defined, patterns can be used as keywords when you define the evidence terms for a category.
CIS processes patterns by first compiling them and applying them to the document. If the pattern
value is matched, a pattern token is inserted. For example, if the Phone pattern is found in the text,
the token $Phone is automatically inserted. You can treat pattern tokens just like any keywords
when defining the CIS taxonomy. In our Phone example, if you set up a Phone category, simply
add a simple evidence term $Phone to the list of evidence. It is also possible to create compound
evidence terms. You could also specify that it is necessary to find the $Phone and another term
(such as $CreditCard) to classify the document to the category.

114

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Classification Process

Limitations
The definition of a pattern is done using a standard regular expression language. However, it is not a
natural language. Besides, searching for many patterns can slow down performance significantly.
Two or three patterns do not cause noticeable performance degradation, but it adds up.
You cannot define patterns using capturing groups; they are not supported and replaced by
non-capturing groups. Instead, you can use a separator such as (-?|[\\. ]).

Configure pattern analysis


To configure the pattern analysis, define patterns in the property file: patterns.properties. The
following procedure describes the required steps to define a pattern.

To define patterns:
1.

Locate the patterns.properties in the directory <CIS installation directory>/config.

2.

In a text editor, open patterns.properties.

3.

Since pattern analysis may affect performance of the content analysis, it can be turned off.
By default, the feature is enabled. To enable or disable pattern analysis, set the property
pattern.processing.enabled.

4.

To see additional tracing information in the log file, set the property tracing.enabled to true.
This information only applies to pattern loading and processing.

5.

Append the required data for defining patterns.


For each pattern, you can define the following parameters:
Scope reserved for future use, the default value must be (Global).
Value value of the actual pattern.
Note: Pattern value syntax corresponds to the REGEXP syntax with one exception: each
backslash character \ must be escaped by placing another backslash character in front of it:
\\. For more information about patterns and regular-expression constructs, you may refer
to http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html.
Besides, try not to create too general patterns that match many words in numerous documents
since this would impact performance.
Token string to reference the pattern as evidence in a category definition. A guideline to
define tokens is to write them with the $ character at the beginning, such as $Phone or
$CreditCard.
Sample this value can be used to run a special unit test: TCisPatternAnalyzer. This unit
test automatically loads and validates all patterns in this file.
Note: You can create a QA utility based on this unit test. The Sample parameter is ignored
by the runtime processing engine.
Description notes of the pattern creator.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

115

Classification Process

For each parameter, indicate the incremented number of the pattern. For example, when
defining the third pattern, its parameters must end with 3 such as: pattern.scope.3,
pattern.value.3, and so on.
6.

Save and close patterns.properties.

7.

After you defined patterns, you can use them as evidence terms for a category in Documentum
Administrator.

Example 10-2. Pattern description for US Social Security Number


pattern.scope=(Global)
pattern.value=^\\d{3}-\\d{2}-\\d{4}
pattern.token=$SSN
pattern.sample=Text 555-55-5555 Text
pattern.description=U.S. Social Security Number regular

In this example, you can see that the pattern is made of three digits: \\d{3}, then two digits: \\d{2},
then four digits: \\d{4}, separated by hyphens: -.

Classification information
The classification information indicates which documents have been processed or are being processed.
The classification results are first category assignments then, depending on the CIS configuration,
they can also correspond to folder assignments and document metadata updates (Assign as Attributes
option). Classification information and category assignments are stored in the repository.
The classification is done per document tree. Only the CURRENT version of the documents is
categorized but the entire tree is assigned to a category. When clearing the assignments, only the
assignments of the current version of the documents are removed.
If you create a version of a document, remember that it inherits the metadata attributes of the
previous version. So, if you are using the Assign as Attributes option, the attribute values generated
by the classification of the previous version of the document may no longer be relevant.

Category assignments configuration


Category assignments are created as the result of the standard classification process. A category
assignment is a relationship between a document and the categories it is assigned to.
Content Intelligence Services can reflect category assignments in the repository in two ways:
Link to Folders: CIS maintains a set of folders whose names and hierarchy correspond to the
categories in the taxonomy. When a document is categorized, CIS creates a relationship between
the document and the category. When users view the category folder in DA, they see the assigned
documents, but the documents are not linked into the folder in the same way documents are
linked into folders in other parts of the repository. When you select this option, however, CIS

116

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Classification Process

creates a full link between the document and the folder in addition to its normal assignment
relation. This allows users to see the documents in the taxonomy hierarchy.
Assign as Attributes: When a document is categorized, CIS writes the names of assigned
categories in the attributes of the document. The category class definition specifies which
document attribute is updated for each matching category.
You can configure CIS to record category assignments in both of these ways, one of them, or neither.
If neither Link to Folders or Assign as Attributes is active, Webtop users are not able to see the
category assignments.
You should select these options only when you know you need the functionality they provide.
Default CIS functionality is adequate in most cases.
Note: Category assignments are only exposed in Webtop clients and not in CenterStage clients.

Classification roles
This section describes the two roles involved in the classification process:
The taxonomy manager, page 117
The category owner, page 117

The taxonomy manager


The taxonomy manager is responsible for creating the taxonomy tree by defining the taxonomies, the
category classes, and the categories. As part of the category definition, this person sets the required
evidence terms, any necessary property rule, and the category owner. The taxonomy manager also
defines the thresholds used to assign documents automatically, to keep them as pending documents
or to reject them. Taxonomies and categories can be defined from scratch or imported.
The second job of the taxonomy manager is to define the document sets: which documents will be
categorized and when, and whether the document set is meant to be run automatically.
Before launching any production process, the taxonomy manager must test the taxonomies to control
if the document sets are correctly categorized. If a category receives too many assignments, it means
that the evidence terms are too general and must be redefined more precisely. The taxonomy
manager has the possibility to clear assignments for a category, an entire document set, or only one
selected document. Once the tests are successful, the taxonomy manager can bring the taxonomies
online. Only online taxonomies are visible to Webtop end users.

The category owner


The category owner is set by the taxonomy manager. There can be several category owners
in an organization, depending on their expertise. They are responsible for reviewing pending
documents, assigning documents manually and clearing assignments when necessary. However,
these responsibilities can also be taken by the taxonomy manager.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

117

Classification Process

From the My Categories page in Documentum Administrator, the category owners can view all
documents assigned to the categories they own, or they can display just the documents assigned
to the category with a status of Pending. The category owners must approve or reject pending
documents, also called Candidate documents, that did not reach a score high enough to be
automatically categorized. If the threshold for automatic categorization is equal to the threshold for
Candidate documents, then there are no Candidate documents: documents are either automatically
categorized or rejected. Once the documents are categorized, either automatically or after approval,
they become viewable by end users. If a category owner rejects a pending document, this document
is not viewable by end users in the categories. For example, in Webtop, even if the document is
viewable in a cabinet folder, it is not viewable under the Categories node if it is not categorized.
The category owners also have the possibility to assign any document manually to a category.
As the taxonomy manager, the category owners can clear assignments, for example, if they mistakenly
approved a pending document.
Note: If both a user and a group exist with the same name in the repository, the user cannot be
selected for category_owner, only the group.

118

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 11
Configure CIS Standard Classification

This chapter describes how to configure CIS for standard classification.


The following procedure provides an overview of the steps necessary to get Content Intelligence
Services up and running.

To configure Content Intelligence Services:


1.

Install Content Intelligence Services by following the instructions in the Content Intelligence
Services Installation Guide.

2.

Start the CIS server. See Chapter 3, Administer the CIS server, for details.

3.

If the repository has not been enabled for CIS use during the installation, enable CIS in the
repository using Documentum Administrator (DA). The Documentum Administrator User Guide
provides information on this procedure. The properties you specify when you enable CIS (such
as the hostname for the test and production servers) can later be modified in the Configure
CIS window in DA.

4.

Create or import a taxonomy.


Content Intelligence Services offers two ways to create the taxonomy that it uses to categorize
documents:
Import a taxonomy in taxonomy exchange format (TEF)
Build a taxonomy using Documentum Administrator
Importing a TEF taxonomy is the best approach if you are creating a taxonomy based on an
industry-standard thesaurus. You can download several prebuilt TEF taxonomies for different
subject areas. The Data Sheet of Content Intelligence Services provides a list of these taxonomies,
it is available from the EMC Software site: http://www.emc.com when searching CIS datasheet.
You can import these taxonomies as is, or you can customize the TEF files before you import them.
For information about downloading and importing TEF taxonomies, see Import taxonomies in
Taxonomy Exchange Format, page 121.
Documentum Administrator is the preferred tool for creating a custom taxonomy. The
Documentum Administrator User Guide provides more information on how to create taxonomies
with DA.

5.

Create the necessary document sets to select the documents that you want to be automatically
categorized.

6.

Synchronize the taxonomy definitions in the Documentum repository to make them available to
the CIS server for the classification processing.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

119

Configure CIS Standard Classification

The CIS objects you import or create with Documentum Administrator are saved in the repository
containing the documents, categories, evidence terms that CIS uses in its processing. Synchronize
the taxonomies and the document sets so that the CIS server can use them. Each time the CIS
server processes a document set, the CIS server reads the document set definition. However, for
scheduled document sets, you need to synchronize them when the schedule has been modified. If
the document set is not synchronized, the CIS server does not know that the schedule has been
updated until the next time it tries to process the document set.
7.

If you want to integrate CIS with Web Publisher, configure Web Publisher so that it can locate
the CIS server.
The Web Publisher documentation provides details on how to integrate CIS with Web Publisher.

8.

When using CIS and RPS to apply policies on category folders, set the DFC used by CIS as
Privileged DFC client in Documentum Administrator, for the corresponding repository. The
Documentum Administrator User Guide provides more information on privileged DFC setup in the
Privileged Clients chapter.

The main server configuration file is cis.properties. To modify cis.properties, page 27, indicates the
steps to modify the parameters in cis.properties file.
Note: If the repository was previously enabled for another CIS server, you must reconfigure it in
DA to create an authentication file. Similarly, if you change the repository for a given CIS server,
reconfigure it to create an authentication file. The authentication files are stored in the directory
defined by the property cis.server.credentials.dir in cis.properties on the CIS server file system.
Each authentication file, called user_<repository_name>.properties contains the login and encrypted
password of the CIS administrator for this given repository.

120

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 12
Use the Taxonomy Exchange Format
(TEF)

This chapter describes the XML elements you can use to write a taxonomy in Taxonomy Exchange
Format (TEF).
A TEF file defines the structure of one or more taxonomies. You can import an entire taxonomy
structure or only part of it. In the later case, you must also create a TEF action file that specifies what
actions to take. Importing an entire taxonomy is simpler and does not require a TEF action file
(script tef2repository). Exporting a taxonomy takes information from the repository and creates a
TEF XML file. The TEF XML schema accommodates subtypes and their attributes. This chapter
includes the following sections:
Import taxonomies in Taxonomy Exchange Format, page 121
Taxonomy Exchange Format action files, page 169
Note: Before importing taxonomies, make sure that you enabled CIS functionality in Documentum
Administrator. The Enabling Content Intelligence Services section in the Documentum Administrator User
Guide provides more information on this procedure.

Import taxonomies in Taxonomy Exchange


Format
Taxonomy Exchange Format (TEF) is an XML format that describes a complete taxonomy: its
categories, its hierarchical structure, and the evidence used to assign documents to categories
during automatic processing. A TEF Action file is required to perform any action on TEF files.
TEF is comprised of two related XML schemas:
The core TEF schema (defined by tefSchema.xsd) defines the structure of one or more taxonomies.
The TEF action schema (defined by tefActionSchema.xsd) defines the actions used to transfer the
taxonomy definitions from a TEF file into a Documentum repository and vice versa.
The two schemas are located in <CIS installation directory>/doc/tef
Documentum provides a set of taxonomies defined in TEF. They can be downloaded from the
Powerlink site (http://Powerlink.EMC.com). You can also write a TEF taxonomy using an XML or text
editor. The TEF format is described in the section TEF elements, page 124.
Note: The import functionality does not support the update of taxonomies.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

121

Use the Taxonomy Exchange Format (TEF)

There are two possible imports:


Use the tef2repository script, page 122. This tool is much simpler and quicker than the TefUtil tool
but it only allows you to import an entire taxonomy. You do not need any TEF Action file and
the XML schemas are automatically selected.
Use the TefUtil tool, page 123. Depending on the Tef action file, the TefUtil utility allows you to
add, export or delete CIS objects in the repository. It also allows you to rearrange the category
hierarchy.
Note: If taxonomy depth is big, and category names are long, then importing a category can exceed
the database limit for the path property. For example, category path should not exceed 450 characters
on SQL Server. In this case, edit the taxonomy file before importing to reduce taxonomy depth
and/or category names length.

Use the tef2repository script


You can use the tef2repository script to import the entire content of a TEF file, whether it contains
one or several taxonomies. Since the sript does not use any TEF Action file, it cannot be used to
export taxonomies or to rearrange the hierarchy of categories in a repository. The TEF schema is
automatically selected and you do not need to specify a path; if a path is specified, it is ignored and
current schemas are used.

To import a TEF taxonomy with tef2repository script:


1.

Write or obtain a TEF taxonomy.

2.

Locate the tef2repository.bat file (on Windows hosts, or tef2repository on Linux hosts); it can be
found in<CIS installation directory>/bin.

3.

Import the taxonomy using the import script with the following parameters:
if CIS is already configured for the repository (that is, enabled in Documentum Administrator):
tef2repository -TefFile:<filename>

where <filename> can be the TEF filename or the relative filepath to the TEF file. In this case,
the repository information is retrieved using the settings in cis.properties.
If you need to provide the credentials of the user:
tef2repository -Repository:<repository_name> -Username:<user_
name> -Password:<user_password> -TefFile:<filename>

When running the script on a different CIS server machine, you can provide the absolute
directory paths to indicate where the file cis.properties and where the credentials file can
be found:
tef2repository -CisConfDir:confdir -CisHomeDir:homedir -TefFile:<filename>

where confdir is the absolute directory path for cis.properties file,


and homedir is the absolute path to the directory in which the relative directory defined by
property cis.server.credentials.dir can be found. Table 2, page 28, provides more information
about the cis.server.credentials.dir parameter. (If you concatenate the homedir path and the
cis.server.credentials.dir path, you should obtain the entire absolute path to the credential file.)

122

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Use the Taxonomy Exchange Format (TEF)

These parameters are only required when you run the script from a different CIS server
machine.
Note: To increase the memory size for importing large taxonomies, edit the script file and add the
-Xmx argument. Refer to the following procedure for more details about the -Xmx argument.
Parameter values are case sensitive (but parameter names are case insensitive).

Use the TefUtil tool


Depending on the Tef action file, the TefUtil utility allows you to add, export or delete CIS objects in
the repository. It also allows you to rearrange the category hierarchy.
The TEF action file includes references to TEF taxonomy definition files and to the schema definition
file (XSD). TefUtil expects to locate these files in the current working directory; if the filename
includes a relative path specification, it must be the relative path from the current working directory.
To import any of the taxonomies delivered with CIS, run TefUtil from the directory containing the
TEF action file.
The type attribute determines what type of object must be created in the repository. The default for
type is dm_category or dm_taxonomy.

To import a TEF taxonomy with TefUtil:


1.

Write or obtain a TEF taxonomy.

2.

Create a TEF action file.


The TEF action file instructs the TEF utility what actions to take on the definitions in the TEF
taxonomy file. The TEF action file format is described in the Taxonomy Exchange Format action
files, page 169.

3.

Import the taxonomy thanks to the TefUtil utility:


a.

Open the directory that contains the taxonomy XML files.

b. Enter this command into a Command Prompt window:


java -cp "<CIS installation directory>/config;
<CIS installation directory>/lib/ci.jar;
<CIS installation directory>/lib/cis_server.jar;
<CIS installation directory>/lib/dfc.jar"
com.documentum.ci.tools.TefUtil -Docbase:<docbase>
-UserName:<login> -Password:<password>
-TefActionFile:<TEF action file path>

where <TEF action file path> is the relative file path of the TEF action file,
<docbase> is the name of the repository into which you want to import the taxonomy,
<login> and <password> are the Documentum user name and password for logging into the
repository.
For large taxonomies, you may need to allocate more Java memory. To do so, use the -Xmx argument
to increase the maximum allowed size for the Java heap. Append the letter k for kilobytes, or m for
megabytes. This argument comes before the classpath argument (-cp) in the command line.
For trace errors, add the option -Dlog4j.configuration=<CIS installation directory>/config/log4j-script.
xml. This argument comes before the classpath argument (-cp) in the command line.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

123

Use the Taxonomy Exchange Format (TEF)

Note: Given that TefUtil is a command line and not a script, it must be run from the directory
containing the files to import.
Using the TEF utility, you cannot set evidence propagation to @parent on a category which has
multiple parents, such as parents with category links to the category. This would generate an error at
the taxonomy synchronization in cis.log.

TEF elements
The following sections describes the TEF elements:
tef, page 126
class, page 127
details, page 129
description, page 130
categoryDefaults, page 131
impliedKeywordDefaults, page 133
keywordDefaults, page 135
evidencePropagation, page 137
categoryEvidenceDefaults, page 139
taxonomy, page 141
category, page 143
details, page 147
owners, page 149
owner, page 150
operations, page 151
operation, page 152
languageInfo, page 153
supportedLanguage, page 154
extended_attributes, page 155
attribute, page 156
value, page 157
definition, page 158
evidence, page 160
evidenceSet, page 162
keyword, page 163
categoryEvidence, page 164

124

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Use the Taxonomy Exchange Format (TEF)

qualifiers, page 165


qualifier, page 166
categoryLink, page 168

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

125

tef

tef
Purpose
Root element of a file in taxonomy exchange format

Diagram

Children
<class>
<taxonomy>
<category>

Parents
None

Usage notes
The <tef> element must be the first element in the file. All other elements must appear inside of it.
Example of <tef>
<tef>
<class name="Generic">
... [Class definition]
</class>
<taxonomy name="Products" className="Generic" taxonomy version="version">
... [Taxonomy definition]
</taxonomy>
</tef>

126

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

class

class
Purpose
Defines a category class

Diagram

Attributes
Table 8. <class> Element Attributes

Attribute

Description

name

Name of the category class, which must be


unique among <class> elements

Children
<details>
<categoryDefaults>

Parents
<tef>

Usage notes
The <class> element defines a category class. Each CIS category is assigned to a class, which
determines the default confidence levels of the category and evidence propagation behavior. The
category class also identifies the document attribute to which category assignments are written if the
Assign as Attributes option is active.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

127

class

<class> elements appear as children of the <tef> element, outside of the taxonomies and categories. A
<class> element has two subelements:
<details> provides descriptive information about the class and sets the document attribute into
which the CIS server writes category assignments when Assign as Attributes is active.
<categoryDefaults> sets default values for how the CIS server handles evidence for categories
of this category class.

Example of <class>
<class name="Generic">
<details source="Source" targetAttribute="keywords" title="Generic class">
<description>Category class for basic categories</description>
</details>
<categoryDefaults>
<impliedKeywordDefaults confidence="100" stem="true"
phraseOrderExact="false"/>
<keywordDefaults confidence="high" stem="true" phraseOrderExact="false"/>
<evidencePropagation type="@parent" confidence="medium"/>
<categoryEvidenceDefaults confidence="off"/>
</categoryDefaults>
</class>

128

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

details

details
Purpose
Provides descriptive information about its parent category class.

Diagram

Attributes
Table 9. <details> Element attributes

Attribute

Description

title

Title for the parent category class

source

(Deprecated) Text indicating the source of the


parent category class

targetAttribute

Name of the document attribute into which the


CIS server writes the names of the categories to
which a document is assigned

Children
<description>

Parents
<class>
Note: A different <details> element is a subelement of <taxonomy> or <category>. See details,
page 147.

Usage notes
The <details> element provides a description of its parent category class. It also sets the document
attribute into which the CIS server writes the names of categories to which a document is assigned.
The <description> subelement contains a text description of the parent category class. The text
appears between the opening tag and the closing tag, not as an attribute as with other TEF elements.
See class, page 127 for an example that uses the <details> element.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

129

description

description
Purpose
Provides a description of the parent element

Children
Text of the description

Parents
<details>

Usage notes
The <description> element is the only TEF element that includes plain text rather than subelements
between its opening and closing tags. See details, page 147 for an example.

130

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

categoryDefaults

categoryDefaults
Purpose
Provides default values for confidence levels and evidence propagation.
Defined at the category class level for all categories that reference this category class.

Diagram

Children
<impliedKeywordDefaults>
<keywordDefaults>
<evidencePropagation>
<categoryEvidenceDefaults>

Parents
<class>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

131

categoryDefaults

Usage notes
Each piece of evidence for a category has three associated attributes that control how the CIS server
handles it:
confidence A confidence level that determines how much the CIS server adds to the score of a
document when the evidence is found
stem True or false setting that determines whether the CIS server uses stemming to recognize
other forms of the word. Corresponds to the Use stemming functionality in Documentum
Administrator.
phraseOrderExact True or false setting that determines whether the words in a multiple word
phrase must appear in exact order or in random order. Corresponds to the Recognize words in
any order functionality in Documentum Administrator.
The <categoryDefaults> element specifies the default values for these options. Each subelement
sets the default values for a particular type of evidence (keyword, implied keyword, propagated
evidence, and linked category evidence). The default values can be overridden by specifying a value
in the <keyword> or <categoryEvidence> element.

Example of <categoryDefaults>
<class name="Class">
<details source="Source" targetAttribute="Target" title="Title">
<description>Class description</description>
</details>
<categoryDefaults>
<impliedKeywordDefaults confidence="100" stem="true" phraseOrderExact="false"/>
<keywordDefaults confidence="high" stem="true" phraseOrderExact="false"/>
<evidencePropagation type="@parent" confidence="medium"/>
<categoryEvidenceDefaults confidence="off"/>
</categoryDefaults>
</class>

132

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

impliedKeywordDefaults

impliedKeywordDefaults
Purpose
Provides the defaults for handling implied keywords

Attributes
Table 10. <impliedKeywordDefaults> Element Attributes

Attribute

Description

confidence

Confidence-level value specifying the level of


confidence the CIS server applies to implied
keywords by default

stem

true or false, instructing the CIS server whether


to use stemming for implied keywords by
default

phraseOrderExact

true or false, instructing the CIS server whether


to recognize the words in a phrase only if they
appear in exact order

Children
None

Parents
<categoryDefaults>

Usage notes
For the confidence attribute, you can enter a predefined confidence level or enter a number directly.
The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.
Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

133

impliedKeywordDefaults

Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not
only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.

134

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

keywordDefaults

keywordDefaults
Purpose
Provides the defaults for handling evidence keywords

Attributes
Table 11. <keywordDefaults> Element Attributes

Attribute

Description

confidence

Confidence-level value specifying the level of


confidence the CIS server applies to keywords
by default

stem

true or false, instructing the CIS server whether


to use stemming for keywords by default

phraseOrderExact

true or false, instructing the CIS server whether


to recognize the words in a phrase only if they
appear in exact order

Children
None

Parents
<categoryDefaults>

Usage notes
The corresponding functionality is not exposed in Documentum Administrator. It can be very useful
to define globally the default values and behaviors for all the keywords in the category instead of
setting these values and behaviors for each keyword.
For the confidence attribute, you can enter a predefined confidence level or enter a number directly.
The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

135

keywordDefaults

Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.
Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not
only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.

136

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

evidencePropagation

evidencePropagation
Purpose
Provides the defaults for propagating evidence

Attributes
Table 12. <evidencePropagation> Element Attributes

Attribute

Description

type

Type of evidence propagation:


off
@child
@parent

confidence

Confidence level to assign to propagated


evidence by default

Children
None

Parents
<categoryDefaults>

Usage notes
Evidence of one category or taxonomy can automatically be considered evidence for another category
or taxonomy. Sharing evidence across categories and taxonomies is called propagating evidence.
Evidence can only be propagated between categories and taxonomies that have a direct parent and
child relationship. The propagation direction can be either parent to child or child to parent.
For example, suppose a taxonomy has this hierarchical structure (showing just the top-level elements
for the categories and taxonomy):
<taxonomy name="United States" className="Country"
taxonomyVersion="1.0">
<category name="Missouri" className="State">
<category name="St Louis" className="City">
<category name="Branson" className="City">
</category>
</taxonomy>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

137

evidencePropagation

In this structure, if the direction of propagation is child to parent, the categories St. Louis and Branson
can propagate their evidence to the Missouri category, because St. Louis and Branson are direct
children of the Missouri category. Similarly, Missouri can propagate its evidence to the taxonomy
United States, because Missouri is a direct child of United States. If the direction of propagation is
parent to child, the taxonomy United States can propagate its evidence to the category Missouri, and
Missouri can propagate its evidence to both St. Louis and Branson.
However, you can never automatically propagate evidence directly between the taxonomy United
States <taxonomy> and the St. Louis or Branson categories, because the categories are indirectly
contained within the <taxonomy> element. Additionally, evidence cannot be automatically
propagated between sibling categories. In the preceding example, it means that evidence for St. Louis
cannot automatically be propagated to Branson, nor can evidence for Branson be propagated to
St. Louis.
For the confidence attribute, you can enter a predefined confidence level or enter a number directly.
The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.
Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.
Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not
only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.

138

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

categoryEvidenceDefaults

categoryEvidenceDefaults
Purpose
Provides the defaults for handling evidence from other categories linked into an evidence set.
Defined at the category class level for all categories that reference this category class.

Attributes
Table 13. <categoryEvidenceDefaults> Element Attributes

Attribute

Description

confidence

Confidence level value to apply to category


evidence links by default

Children
None

Parents
<categoryDefaults>

Usage notes
The corresponding functionality is not exposed in Documentum Administrator.
For the confidence attribute, you can enter a predefined confidence level or enter a number directly.
The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.
Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.
Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

139

categoryEvidenceDefaults

only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.

140

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

taxonomy

taxonomy
Purpose
Defines a taxonomy object

Diagram

Attributes
Table 14. <taxonomy> Element Attributes

Attribute

Description

name

Name of the taxonomy, which must be unique


among <taxonomy> elements

className

Name of the category class used as the default


for categories in this taxonomy

taxonomyVersion

Version label for the taxonomy

type

Subtype for taxonomy

internalId

Internal use

Children
<details>
<definition>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

141

taxonomy

<category>
<categoryLink>

Parents
<tef>

Usage notes
The <taxonomy> element represents the root of a hierarchical tree of categories. You can include
multiple <taxonomy> elements in a TEF file.
Some aspects of the <taxonomy> element establish default values for <category> elements that appear
inside of it.
A <taxonomy> element is divided into three main parts:
<details> provides descriptive information about the taxonomy.
<definition> specifies property rules that documents must meet in order for the CIS server to
assign them to categories in the taxonomy and provides default threshold values for categories in
the taxonomy.
<category> and <categoryLink> elements define the hierarchical structure of the taxonomy.

Example of <taxonomy>
<taxonomy name="Products" className="Generic" taxonomyVersion="Version 1">
<details title="Products">
<description>Products taxonomy</description>
</details>
<definition candidateThreshold="50" onTargetThreshold="80">
</definition>
<category name="Web Content Management Suite" className="Generic">
... [Category definition]
</category>
<category name="Enterprise Content Management Suite" className="Generic">
... [Category definition]
</category>
</taxonomy>

142

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

category

category
Purpose
Defines a category

Diagram

Attributes
Table 15. <category> Element Attributes

Attribute

Description

name

Name of the category, which must be unique


among sister categories

className

Name of the category class assigned to this


category

type (for categorySubtype)

Subtype of the category

internalId

Internal use

The default for type is dm_category or dm_taxonomy. During importing, the type attribute determines
what type of object needs to be created in the repository. During exporting, all attributes are exported
from the dm_category/dm_taxonomy subtype to <extended_data>.

Children
<details>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

143

category

<definition>
<category>
<categoryLink>

Parents
<taxonomy>
<category>

Usage notes
The <category> element represents a category. <category> elements are valid within <taxonomy>
elements and within other <category> elements. The nested structure defines the hierarchy of the
taxonomy.
Every category belongs to a category class. The class determines the default confidence levels used
for different types of evidence and the document attribute into which the CIS server writes the
name of the category when it assigns a document.
A <category> element is divided into three main parts:
<details> provides descriptive information about the category and how it is used and managed.
<definition> provides the evidence and property rules that the CIS server uses to determine
which documents to assign to the category.
<category> and <categoryLink> elements define subcategories.

Example of <category>
<category name="Web Content Management Suite">
<details title="Web Content Management Suite">
<description>The suite of Documentum products for managing Web content
</description>
<owners/>
<operations>
<operation type="user_browse"/>
<operation type="manual_assignment"/>
</operations>
<languageInfo>
<supportedLanguage languageCode="es" translatedName="translation 1"/>
<supportedLanguage languageCode="jp" translatedName="translation 2"/>
</languageInfo>
</details>
<definition candidateThreshold="60" onTargetThreshold="90">
<evidence evidencePropagation="low" impliedKeyword="33">
<evidenceSet>
<keyword name="Web Content Management" confidence="high"
phraseOrderExact="true" stem="false"/>
<keyword name="WCM" confidence="high"/>
</evidenceSet>
</evidence>
</definition>
<category name="Web Publisher" className="Generic">
... [Category definition]
</category>

144

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

category

<category name="Site Deployment Services" className="Generic">


... [Category definition]
</category>
...
</category>
Example of <CategorySubtype>

The Brand Loyalty is a category subtype called dm_category_subtype.


This dm_category_subtype has 12 attributes:
Attribute for each of the six supported types (boolean, integer, string, id, time, and double)
repeating and non-repeating examples for each type.
<category name="Brand Loyalty" type=dm_category_subtype>
<details title="Brand Loyalty">
<extended_attributes>
<attribute name="bool single">
<value>true</value>
</attribute>
<attribute name="int single ">
<value>1</value>
</attribute>
<attribute name="str single ">
<value>test string 1</value>
</attribute>
<attribute name="id single ">
<value>98adc6236f527be1</value>
</attribute>
<attribute name="time single ">
<value>Thu Feb 01 11:30:01 PST 2001</value>
</attribute>
<attribute name="double single ">
<value>1.0</value>
</attribute>
<attribute name="bool repeating">
<value>true</value>
<value>false</value>
<value>true</value>
</attribute>
<attribute name="int repeating ">
<value>1</value>
<value>2</value>
<value>3</value>
</attribute>
<attribute name="str repeating ">
<value>test string 1</value>
<value>test string 2</value>
<value>test string 3</value>
</attribute>
<attribute name="id repeating ">
<value>98adc6236f527be1</value>
<value>98adc6236f527be2</value>
<value>98adc6236f527be3</value>
</attribute>
<attribute name="time repeating ">
<value>Thu Feb 01 11:30:01 PST 2001</value>
<value>Sat Mar 02 12:30:02 PST 2002</value>
<value>Thu Apr 03 13:30:03 PST 2003</value>
</attribute>
<attribute name="double repeating ">
<value>1.0</value>
<value>2.2</value>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

145

category

<value>3.4</value>
</attribute>
</extended_attributes>

</details>
<definition>

</definition>
</category>

146

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

details

details
Purpose
Groups together descriptive information about its parent category or taxonomy

Diagram

Attributes
Table 16. <details> Element attributes

Attribute

Description

title

Title for the parent category or taxonomy

Children
<description>
<owners>
<operations>
<languageInfo>
<extended_attributes>

Parents
<taxonomy>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

147

details

<category>
Note: A different <details> element is a subelement of <class>. See details, page 129.

Usage notes
The <details> element provides a description of its parent category or taxonomy. The <details>
element is composed of four subelements:
<description> contains a text description of the parent category or taxonomy. The text appears
between the opening tag and the closing tag, not as an attribute as with other TEF elements.
<owners> lists the owner of the parent category or taxonomy. The <owners> element groups
together any number of <owner> elements, each of which gives the Documentum user name of an
owner for the parent category or taxonomy.
<operations> lists which user operations are available for the parent category or taxonomy. The
<operations> element groups together any number of <operation> elements, each of which
identifies a type of operation that is valid.
<languageInfo> provides translated names for the parent category or taxonomy. The
<languageInfo> element groups together any number of <supportedLanguage> elements, each of
which identifies a language and provides a translation of the name into that language.

Example of <details>
<category name="Web Content Management Suite">
<details title="Web Content Management Suite">
<description>The suite of Documentum products for
managing Web content</description>
<owners>
<owner name="dmadmin"/>
</owners>
<operations>
<operation type="user_browse"/>
<operation type="manual_assignment"/>
</operations>
<languageInfo>
<supportedLanguage languageCode="es" translatedName="translation 1"/>
<supportedLanguage languageCode="jp" translatedName="translation 2"/>
</languageInfo>
</details>
... [The <definition> element and subcategories]
</category>

148

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

owners

owners
Purpose
Groups together the owners assigned to the parent category or taxonomy

Diagram

Children
<owner>

Parents
<details>

Usage notes
The owners of a category are the Documentum users who can review candidate documents and
approve or reject their assignment to the category. Candidate documents are documents whose
confidence score exceeds the candidate threshold of the category but fall short of its on-target
threshold, or documents that are assigned to the category manually with the Manual Workflow
option active.
See details, page 147 for an example that uses the <owners> element.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

149

owner

owner
Purpose
Identifies an owner of a category or taxonomy

Diagram

Attributes
Table 17. <owner> Element Attributes

Attribute

Description

name

Documentum user name of an owner of the


parent category or taxonomy. The owner can be
a user or a group of users.

Children
None

Parents
<owners>

Usage notes
The owners of a category are the Documentum users who can review candidate documents and
approve or reject their assignment to the category. Candidate documents are documents whose
confidence score exceeds the candidate threshold of the category but fall short of its on-target
threshold, or documents that are assigned to the category manually with the Manual Workflow
option active. For categories created using Documentum Administrator, the user who created the
category is an owner by default. See details, page 147 for an example that uses the <owner> element.

150

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

operations

operations
Purpose
Groups together the set of operations available for the parent category or taxonomy

Diagram

Children
<operation>

Parents
<details>

Usage notes
The intent of the <operations> element is to specify what user operations are available for a category or
taxonomy. For example, you may not want users to see the documents assigned to certain categories.
In this release, the <operations> element does not affect standard CIS processing. The operations are
saved as part of the category definition, but Documentum applications do not refer to them.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

151

operation

operation
Purpose
Identifies an operation that is available for the parent category or taxonomy

Attributes
Table 18. <operation> Element Attributes

Attribute

Description

type

Text describing an available operation for


categories

Children
None

Parents
<operations>

Usage notes
The intent of the <operation> element is to identify a user operation that is available for a category
or taxonomy. For example, you may include an operation that makes the category available for
browsing by users.
In this release, the <operation> element does not affect standard CIS processing. Any operations are
saved as part of the category definition, but Documentum applications do not refer to them.

152

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

languageInfo

languageInfo
Purpose
Reserved useGroups together the translated names of the parent category or taxonomy

Diagram

Children
<supportedLanguage>

Parents
<details>

Usage notes
The subelements of <languageInfo> translate the category or taxonomy name into other languages.
Each <supportedLanguage> element identifies a language (using its Documentum language code) and
provides a translation of the name into that language. When a user views the category or taxonomy,
its name appears in the same language as the Documentum user interface if a translation is available.
See details, page 147 for an example that includes the <languageInfo> element.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

153

supportedLanguage

supportedLanguage
Purpose
Reserved useProvides a translated category or taxonomy name for a specified language

Attributes
Table 19. <supportedLanguage> Element Attributes

Attribute

Description

languageCode

Two-letter language code used to identify a


language in Documentum

translatedName

Name for the parent category or taxonomy,


translated into the language whose code you
specify

Children
None

Parents
<languageInfo>

Usage notes
The subelements of <languageInfo> translate the category or taxonomy name into other languages.
Each <supportedLanguage> element identifies a language (using its Documentum language code) and
provides a translation of the name into that language. When a user views the category or taxonomy,
its name appears in the same language as the Documentum user interface if a translation is available.
See details, page 147 for an example that includes the <languageInfo> element.

154

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

extended_attributes

extended_attributes
Purpose
Attributes of the subtype category

Children
<attribute>

Parents
<details>

Usage notes
<extended_attributes> is used to populate the attributes of the subtype. The type attribute determines
what type of object needs to be created in the repository.
Attributes can be repeating, they can be of different types - boolean, integer, string, id, time, and
double. See category, page 143, for an example of a category subtype.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

155

attribute

attribute
Purpose
One of the attributes of the subtype category

Attributes
Table 20. <attribute> Element Attributes

Attribute

Description

name

Name of the attribute.

Children
<value>

Parents
<extended_attributes>

Usage notes
<extended_attributes> is used to populate the attributes of the subtype. The type attribute determines
what type of object needs to be created in the repository.
Attributes can be repeating, they can be of different types - boolean, integer, string, id, time, and
double. See category, page 143, for an example of a category subtype.

156

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

value

value
Purpose
Provides the value for an attribute of the subtype category

Children
Value of the attribute

Parents
<attribute>

Usage notes
See category, page 143, for an example of a category subtype.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

157

definition

definition
Purpose
Identifies the set of documents belonging to a category

Diagram

Attributes
Table 21. <definition> Element Attributes

Attribute

Description

onTargetThreshold

Number from 0 through 100 indicating the


confidence score that a document must have for
the CIS server to assign it to the category. If this
optional attribute is not included, the category
uses the on-target threshold specified for the
<taxonomy> it belongs to.

candidateThreshold

Number from 0 through 100 indicating the


confidence score that a document must have
for the CIS server to assign it to the category
as a candidate requiring approval. If this
optional attribute is not included, the category
uses the candidate threshold specified for the
<taxonomy> it belongs to.
The candidate threshold must be lower or equal
to the on-target threshold.

keywordLanguage

Two-letter language code used to identify a


language in Documentum

Children
<evidence>, only if the parent element is <category>
<qualifiers>

158

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

definition

Parents
<taxonomy>
<category>

Usage notes
The <definition> element supplies the criteria that the CIS server uses to determine which documents
to assign to the parent category or taxonomy. The <qualifiers> subelement defines property rules
that a document must meet to be assigned. The <evidence> subelement provides the evidence and
confidence values that the CIS server uses to assign a confidence score to the document.
If a <taxonomy> element includes any <qualifiers>, the specified property rules apply to all categories
in the taxonomy. If a document submitted for processing does not meet the property rules for
the taxonomy, the CIS server does not evaluate it for assignment into any of the categories in the
taxonomy.
The keywordLanguage attribute allows you to set the language that will be used when the stemming
is enabled. The taxonomy or category language acts as a filter: the language of the document (or of
the document set) should match the category language or the document cannot be assigned. The
section Setting the language used for the stemming in Documentum Administrator User guide provides
more information about the stemming functionality.
Note: The <definition> under a <taxonomy> element should not include an <evidence> subelement.
Documents are not assigned to the root of the taxonomy.

Example of <definition>
<category name="Web Content Management Suite">
<details title="Web Content Management Suite">
... [Category details]
</details>
<definition candidateThreshold="60" onTargetThreshold="90">
<evidence evidencePropagation="low" impliedKeyword="33">
<evidenceSet>
<keyword name="Web Content Management" confidence="high"
phraseOrderExact="true" stem="false"/>
<keyword name="WCM" confidence="high"/>
</evidenceSet>
</evidence>
</definition>
</category>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

159

evidence

evidence
Purpose
Identifies the evidence used to assign documents to the parent category

Diagram

Attributes
Table 22. <evidence> Element Attributes

Attribute

Description

impliedKeyword

Confidence-level value specifying the level of


confidence the CIS server applies to implied
keywords. If this optional attribute is not
included, the category uses the confidence level
specified in the category class.

evidencePropagation

Confidence-level value specifying the level of


confidence the CIS server applies to evidence
propagated from other categories. If this
optional attribute is not included, the category
uses the confidence level specified in the
category class.

Children
<evidenceSet>

Parents
<definition>

160

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

evidence

Usage notes
The <evidence> element provides the evidence and confidence values that the CIS server uses to
assign a confidence score for the parent category to the document. The evidence for a category is
organized into evidence sets, each of which defines a collection of evidence keywords that the CIS
server considers together when calculating the score of a document, relative to the category.
For the impliedKeyword and evidencePropagation attributes, you can enter a predefined confidence
level or enter a number directly. The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.
Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.
Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not
only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.
Example of <evidence>
<evidence>
<evidenceSet>
<keyword name="Documentum"/>
</evidenceSet>
<evidenceSet>
<keyword name="ECM"/>
<keyword name="Enterprise Content Management" confidence="high"
phraseOrderExact="true" stem="false"/>
<categoryEvidence name="@parent" confidence="low"/>
</evidenceSet>
</evidence>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

161

evidenceSet

evidenceSet
Purpose
Groups together a set of evidence that the CIS server considers together when analyzing documents

Diagram

Children
<keyword>
<categoryEvidence>

Parents
<evidence>

Usage notes
An evidence set is a collection of keywords that the CIS server uses together as evidence of a
particular concept. The keywords are identified using the <keyword> and <categoryEvidence>
subelements. A category can have multiple evidence sets that define separate sets of co-occurring
keywords. Confidence levels are not combined across evidence sets.

162

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

keyword

keyword
Purpose
Identifies a string for the CIS server to use as evidence for the parent category

Attributes
Table 23. <keyword> Element Attributes

Attribute

Description

name

Text of the keyword, which the CIS server looks


for in document content

confidence

Confidence level value added to the document


score when the CIS server finds the keyword

stem

true or false, instructing the CIS server whether


to use stemming for the keyword

phraseOrderExact

true or false, instructing the CIS server whether


to recognize a multi-word keyword phrase only
when the words appear exactly in the specified
order

Children
None

Parents
<evidenceSet>

Usage notes
The <keyword> element defines a piece of evidence that the CIS server looks for in the content of the
documents it processes. The text of the keyword can be one or more words. When the server finds the
keyword, it adds the confidence value for the keyword to the confidence score of the document for
the parent category.
If you do not include values for one or more of the attributes, their values are inherited from the
<keywordDefaults> element of the category class.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

163

categoryEvidence

categoryEvidence
Purpose
Includes evidence of another category as part of the evidence for the parent category

Attributes
Table 24. <categoryEvidence> Element Attributes

Attribute

Description

name

Name of the category whose evidence you want


to use as evidence for the parent category as well

className

Category class assigned to the linked category

confidence

Level of confidence to assign to evidence from


the linked category

internalId

Internal use

Children
None

Parents
<evidenceSet>

Usage notes
Categories can include other categories as evidence: when a document is assigned to one category,
the CIS server can use that assignment as evidence for a related category. For example, when a
document is assigned to the category Documentum Content Intelligence Services, you might
want it also assigned to the category Documentum. To accomplish this, you link the category
Documentum Content Intelligence Services into an evidence set for the category Documentum.
Like all evidence, category link evidence has a confidence value associated with it, telling the CIS
server how much to add to the overall score of the document for the current category when the
document is assigned to the linked category.
If you do not include values for one or more of the attributes, their values are inherited from the
<categoryEvidenceDefaults> element of the category class.

164

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

qualifiers

qualifiers
Purpose
Groups together the property rules for the parent category

Diagram

Children
<qualifier>

Parents
<definition>

Usage notes
The definition of a category or taxonomy can include property rules that assigned documents must
meet. For the CIS server to assign a document to a category, the document must meet the property
rules for the category and the property rules for the taxonomy to which the category belongs.

Example of <qualifiers>
<definition candidateThreshold="50" onTargetThreshold="80">
<evidence>
... [Category evidence]
</evidence>
<qualifiers>
<qualifier tag="location" operation="equal" value="/MarketingCabinet"/>
<qualifier tag="type" operation="not_equal" value="custom_type"/>
</qualifiers>
</definition>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

165

qualifier

qualifier
Purpose
Defines a qualifying condition for documents assigned to the parent category or taxonomy

Attributes
Table 25. <qualifier> Element Attributes

Attribute

Description

tag

Aspect of a document to use for comparing.


Valid tag values are:
locationto assign documents based on their
location in the repository
typeto assign documents based on their
Documentum object type
Name of a document attributeto assign
documents based on the value of one of its
attributes
qualifiers_evaluation_policyto indicate if
the documents must meet all or any property
rules to be assigned. In this case, the value of
the operation attribute must be equal and
the value of the value attribute must be all
or any.

166

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

qualifier

Attribute

Description

operation

Comparison operator used to compare the tag to


the value. Valid operators are:
equal
not_equal
include
exclude
subtype

value

Value to compare the tag to.


For tag=location, value is a complete
repository path
For tag=type, value is the name of a
Documentum object type
For tag=attribute_name, value is a value for
the specified attribute

Children
None

Parents
<qualifiers>

Usage notes
Before the CIS server assigns a document to a category, it verifies that the document meets the
property rules for the category and for the taxonomy. If a document fails to meet any condition, the
CIS server does not assign the document to the category regardless of the strength of the evidence. If
no evidence terms are defined and the document meets the property rule, then it is automatically
assigned.
A special property rule is defined to indicate if the documents must satisfy all the conditions or if one
condition is enough to assign a document, for example:
<qualifier tag="qualifiers_evaluation_policy" operation="equal" value="all"/>

See qualifiers, page 165 for an example.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

167

categoryLink

categoryLink
Purpose
Links an existing category into the hierarchy of the taxonomy

Attributes
Table 26. <categoryLink> Element Attributes

Attribute

Description

name

Name of an existing category

className

Name of the category class assigned to the


linked category

internalId

Internal use

Children
None

Parents
<taxonomy>
<category>

Usage notes
The <categoryLink> element enables you to include a category in more than one place in the
hierarchy. You use the <category> element to define the category and its evidence structure once, then
use <categoryLink> to link the category into other locations in the taxonomy. Linking category does
not imply the evidence propagation.

Example of <categoryLink>
<taxonomy name="Products" className="Generic" taxonomyVersion="Version 1">
<details title="Products">
<description>Products taxonomy</description>
</details>
<definition candidateThreshold="50" onTargetThreshold="80"/>
<category name="Web Content Management Suite" className="Generic">
... [Category definition]
</category>
<categoryLink name="Enterprise Content Management Suite"
className="Generic"/>
</taxonomy>

168

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

categoryLink

Taxonomy Exchange Format action files


A taxonomy exchange format (TEF) action file tells the TEF utility script (TefUtil) what actions to take.

Create TEF action files


The available actions that can be carried out by creating a TEF action file are:
Add CIS objects from a TEF file into a repository
Export CIS objects from a repository into a TEF file
Remove CIS objects from a repository
Rearrange the hierarchy of categories in a repository
A single TEF action file can perform these actions in any combination.
To create a TEF action file, you can use the example files delivered in <CIS installation directory>/doc/tef:
addActionExample.xml, exportActionExample.xml. Copy one of these files and read the description
of the elements below to know how to use them to create your own TEF action files.
Example 12-1. Example of a TEF action file to add a CIS object
<actions xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://emc.com/SYMBOLIC/TefSchema10/">
<add fileName="tefExample.xml">
<classObject xPath="/tef/class"/>
<taxonomyObject xPath="/tef/taxonomy[@name='Products Example']"
branch="true"/>
<withinParentReference name="Web Content Management Suite"
className="Example">
<categoryObject xPath="/tef/category[@name='Web Publisher' and
@className='Example']" branch="true"/>
</withinParentReference>
</add>
</actions>

For information on how to run a TEF action file, see the Run the TefUtil utility step of the procedure To
import a TEF taxonomy with TefUtil:, page 123.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

169

categoryLink

TEF action file elements


The following sections describes the TEF action files elements:
actions, page 171
add, page 174
classObject, page 176
taxonomyObject, page 177
withinParentReference, page 178
categoryObject, page 180
delete, page 181
classReference, page 183
categoryReference, page 184
relink, page 186
absoluteParentList, page 187
addParentList, page 189
removeParentList, page 190
export, page 191

170

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

actions

actions
Purpose
Root element of the TEF action file

Diagram

Children
<add>
<delete>
<relink>
<export>
Note: The <update> action, which appears in tefActionSchema.xsd, is not supported in this release.

Parents
None

Usage notes
<actions> is the root element of a TEF action file. Each of its subelements is an action you want
to perform on a taxonomy.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

171

actions

Example of <actions>
<actions>
<delete>
<categoryReference name="Maximum Taxonomy" className="Class"/>
<classReference name="Class"/>
<classReference name="Alternate Class"/>
</delete>
<add fileName="testTef.xml">
<classObject xPath="/tef/class"/>
<taxonomyObject xPath="/tef/taxonomy[@name='Maximum Taxonomy']
" branch="true"/>
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
<export fileName="tefOut1.xml" xsdFileName="tefSchema.xsd">
<classReference name="Class"/>
<classReference name="Alternate Class"/>
</export>
<add fileName="testTef.xml">
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
<export fileName="tefOut2.xml" xsdFileName="tefSchema.xsd">
<categoryReference name="Maximum Taxonomy" className="Class"
branchLevels="all" details="true" definitions="false"/>
<categoryReference name="Maximum Category" className="Class"
branchLevels="all" details="false" definitions="false"/>
<categoryReference name="Maximum Category" className="Class"
branchLevels="all" details="true" definitions="false"/>
</export>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category">
<absoluteParentList>
<categoryReference className="Class" name="Maximum Taxonomy"/>
<categoryReference className="Class" name="Maximum Category"/>
</absoluteParentList>
</categoryReference>
</relink>
<export fileName="tefOut3.xml" xsdFileName="tefSchema.xsd">
<categoryReference name="Maximum Taxonomy" className="Class"
branchLevels="all" details="true" definitions="false"/>
</export>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category2">
<addParentList>
<categoryReference className="Class" name="Minimum Category"/>
</addParentList>
<removeParentList>
<categoryReference className="Class" name="Maximum Category"/>
</removeParentList>
</categoryReference>
</relink>
<export fileName="tefOut4.xml" xsdFileName="tefSchema.xsd">
<categoryReference name="Maximum Taxonomy" className="Class"
branchLevels="all" details="true" definitions="false"/>
</export>
<delete>
<categoryReference name="Maximum Category" className="Class"/>
</delete>
<export fileName="tefOut5.xml" xsdFileName="tefSchema.xsd">

172

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

actions

<categoryReference name="Maximum Taxonomy" className="Class"


branchLevels="all" details="true" definitions="false"/>
</export>
</actions>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

173

add

add
Purpose
Adds categories, taxonomies, or category classes to the repository

Diagram

Attributes
Table 27. <add> Element Attributes

Attribute

Description

fileName

Name of the TEF file containing the definitions


of the CIS objects to add

Children
<classObject>
<taxonomyObject>
<withinParentReference>

Parents
<actions>

174

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

add

Usage notes
The <add> action adds new CIS objects to the repository based on definitions stored in a TEF file.
Each subelement identifies one or more objects to add. The <classObject> and <taxonomyObject>
elements identify category classes and taxonomies respectively using an XPath reference to elements
in the TEF file. The <withinParentReference> element identifies a position in the hierarchy where
the categories referred to inside of it are added.
If an object that appears inside of the <add> action exists, the TEF utility does not add the object again
or update it. It ignores the existing object and continues.

Example of <add>
<actions>
<add fileName="testTef.xml">
<classObject xPath="/tef/class"/>
<taxonomyObject xPath="/tef/taxonomy[@name='Maximum Taxonomy']"
branch="true"/>
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
</actions>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

175

classObject

classObject
Purpose
Identifies a category class in a TEF file to add to a repository

Attributes
Table 28. <classObject> Element Attributes

Attribute

Description

xPath

XPath reference identifying a category class


element to add

Children
None

Parents
<add>

Usage notes
The <classObject> element identifies a category class object from a TEF file using an XPath reference.
If the XPath reference selects multiple category classes, the TEF utility adds each class.

Example of <classObject>
<actions>
<add fileName="testTef.xml">
<classObject xPath="/tef/class"/>
</add>
</actions>

176

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

taxonomyObject

taxonomyObject
Purpose
Identifies a taxonomy element in a TEF file

Attributes
Table 29. <taxonomyObject> Element Attributes

Attribute

Description

xPath

XPath instruction identifying a taxonomy


element

branch

true or false, specifying whether to add all


children of this taxonomy or just the specified
taxonomy object

Children
None

Parents
<add>

Usage notes
The <taxonomyObject> element identifies a taxonomy object using an XPath reference. If the XPath
reference matches more than one taxonomy, the TEF utility adds them all.
Example of <taxonomyObject>
<actions>
<add fileName="testTef.xml">
<taxonomyObject xPath="/tef/taxonomy[@name='Maximum Taxonomy']" branch="true"/>
</add>
</actions>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

177

withinParentReference

withinParentReference
Purpose
Identifies where in the hierarchy to add new categories

Diagram

Attributes
Table 30. <withinParentReference> Element Attributes

Attribute

Description

name

Name of the parent category for the categories


that appear as subelements

className

Name of the category class assigned to the


category identified by the name attribute

Children
<categoryObject>

Parents
<add>

Usage notes
The <withinParentReference> element identifies the parent category for one or more new categories
being added from a TEF file.

178

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

withinParentReference

Example of <withinParentReference>
<actions>
<add fileName="testTef.xml">
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
</actions>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

179

categoryObject

categoryObject
Purpose
Identifies a category element in a TEF file

Attributes
Table 31. <categoryObject> Element Attributes

Attribute

Description

xPath

XPath reference identifying one or more


category elements in a TEF file

branch

true or false, specifying whether to add all


children of the selected categories or only the
selected categories themselves

Children
None

Parents
<withinParentReference>

Usage notes
The <categoryObject> element appears within an <add> action to identify one or more categories to
add from the TEF file to the repository. It identifies categories using an XPath reference. If the XPath
reference selects more than one category, the TEF utility adds them all.
<categoryObject> appears as a subelement of the <withinParentReference> element, which determines
where in the hierarchy the categories are added.
Example of <categoryObject>
<actions>
<add fileName="testTef.xml">
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
</actions>

180

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

delete

delete
Purpose
Removes category classes, categories, or taxonomies from the repository

Diagram

Children
<categoryReference>
<classReference>

Parents
<actions>

Usage notes
The <delete> action removes CIS objects from the repository.
To delete a category, the object referred to by <categoryReference> must not have any children. To
delete a category class, no existing categories or taxonomies can use the category class. For this
reason, the branch attribute is ignored when used with the <delete> action.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

181

delete

Example of <delete>
<actions>
<delete>
<categoryReference name="Maximum Taxonomy" className="Class"/>
<classReference name="Class"/>
<classReference name="Alternate Class"/>
</delete>
</actions>

182

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

classReference

classReference
Purpose
Identifies a category class element in the repository

Attributes
Table 32. <classReference> Element Attributes

Attribute

Description

name

Name of the category class

Children
None

Parents
<delete>
<export>

Usage notes
The <classReference> element identifies an existing category class using its name.

Example of <classReference>
<actions>
<delete>
<classReference name="Class"/>
</delete>
</actions>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

183

categoryReference

categoryReference
Purpose
Identifies a category in the repository to perform an action on.

Diagram

Note: This diagram applies only to <categoryReference> elements that appear within a <relink>
action. <categoryReference> has no child elements in the context of other actions.

Attributes
Table 33. <categoryReference> Element Attributes

Attribute

Description

name

Name of the category

className

Name of the category class assigned to the


category

branch

true or false, specifying whether to perform the


action on children of the specified category or
only on the specified category
Not available with the <delete> action.

branchLevels

(Used with the <export> action only) Number


of levels of children to export, or all to export
all children

details

(Used with the <export> action only) true or


false, indicating whether to include the <details>
element for the exported category

definition

(Used with the <export> action only) true


or false, indicating whether to include the
<definition> element for the exported category

184

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

categoryReference

Children
<absoluteParentList>
<addParentList>
<removeParentList>
Note: These child elements apply only to <categoryReference> elements that appear within a <relink>
action. <categoryReference> has no child elements in the context of other actions.

Parents
<delete>
<relink>
<export>

Usage notes
<categoryReference> identifies an existing category or taxonomy in the repository. Its attributes
specify whether the action applies only to the specified category or to the category and its children.
Example of <categoryReference>
<actions>
<delete>
<categoryReference name="Maximum Taxonomy" className="Class"/>
</delete>
</actions>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

185

relink

relink
Purpose
Links existing categories into new hierarchical locations

Diagram

Children
<categoryReference>

Parents
<actions>

Usage notes
The <relink> action links an existing category into an additional location in the hierarchy. The
<categoryReference> identifies the existing category and the new locations you want to link it to. The
<relink> action can also be used to remove category links; see removeParentList, page 190. However,
every category must be linked to at least one parent category, and the TEF utility gives an error if
you attempt to remove the final link.
Example of <relink>
<actions>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category">
<absoluteParentList>
<categoryReference className="Class" name="Maximum Taxonomy"/>
</absoluteParentList>
</categoryReference>
</relink>
</actions>

186

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

absoluteParentList

absoluteParentList
Purpose
Provides a fixed list of parent categories for a linked category

Diagram

Children
<categoryReference>

Parents
<categoryReference>

Usage notes
<absoluteParentList> appears inside of a <relink> action, as a subelement of the <categoryReference>
that identifies the category being relinked. Its subelements identify the complete list of parent
categories for the relinked category.
The alternative to <absoluteParentList> is <addParentList> and <removeParentList>. These elements
identify which parents to add and remove for the relinked category rather than listing the complete
set of parent categories.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

187

absoluteParentList

Example of <absoluteParentList>
<actions>
<relink>
<categoryReference className="Alternate Class"
name="Alternate Category">
<absoluteParentList>
<categoryReference className="Class" name="Maximum Taxonomy"/>
</absoluteParentList>
</categoryReference>
</relink>
</actions>

188

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

addParentList

addParentList
Purpose
Provides a list of new parent categories for a linked category in the repository

Diagram

Children
<categoryReference>

Parents
<categoryReference>

Usage notes
<addParentList> appears inside of a relink action, as a subelement of the <categoryReference> that
identifies the category being relinked. Its subelements identify new parent categories for the relinked
category. The category is not unlinked from any of its current positions, only the new parents are
added. Use <removeParentList> to remove existing links for the category.
The alternative to <addParentList> is <absoluteParentList>. This element identifies the complete set of
parent categories for the relinked category rather than identifying only new parents.

Example of <addParentList>
<actions>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category2">
<addParentList>
<categoryReference className="Class" name="Minimum Category"/>
</addParentList>
<removeParentList>
<categoryReference className="Class" name="Maximum Category"/>
</removeParentList>
</categoryReference>
</relink>
</actions>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

189

removeParentList

removeParentList
Purpose
Provides a list of parent categories to remove from a category

Diagram

Children
<categoryReference>

Parents
<categoryReference>

Usage notes
<removeParentList> appears inside of a relink action, as a subelement of the <categoryReference>
that identifies the category being relinked. Its subelements identify current parent categories to
remove from the relinked category.
Since every category must be linked to at least one parent category, the TEF utility gives an error if
you attempt to remove the final link.

Example of <removeParentList>
<actions>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category2">
<addParentList>
<categoryReference className="Class" name="Minimum Category"/>
</addParentList>
<removeParentList>
<categoryReference className="Class" name="Maximum Category"/>
</removeParentList>
</categoryReference>
</relink>
</actions>

190

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

export

export
Purpose
Creates a TEF file containing the definitions of specified categories and category classes

Diagram

Attributes
Table 34. <export> Element Attributes

Attribute

Description

fileName

Name of the TEF file to create

xsdFileName

Name of the XSD schema file to use for


validating the file, including the path

Children
<classReference>
<categoryReference>

Parents
<actions>

Usage notes
The <export> action creates a TEF file containing elements for the selected classes and categories. The
<classReference> and <categoryReference> elements identify existing classes and categories from the

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

191

export

repository. The attributes of the <categoryReference> element determine what aspects of the category
definition are included in the TEF file; see categoryReference, page 184.
In most cases, the value of the xsdFileName attribute should be the standard Documentum TEF
schema file tefSchema.xsd. To create a TEF file without validation, set the xsdFileName attribute to
an empty string.
The category type or taxonomy type is always exported to the type attribute. Export action
automatically picks up all attributes from dm_category/dm_taxonomy subtype and exports these
attributes to <extended_data>.

Example of <export>
<actions>
<export fileName="tefOut1.xml" xsdFileName="tefSchema.xsd">
<classReference name="Class"/>
<classReference name="Alternate Class"/>
</export>
</actions>

192

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Part 6
Metadata Extraction

This part describes the metadata extraction processing which is one of the three different types of
content analysis: the extraction of entities, the extraction of metadata, or the classification.
It includes the following chapters:
Chapter 13, Metadata Extraction
Chapter 14, Configuring Metadata Extraction

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

193

Metadata Extraction

194

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 13
Metadata Extraction

This chapter describes metadata extraction processing.


Metadata is often defined as data about data. We call metadata the pieces of information that
provide a description of the content of documents. The metadata extraction relies only on rules
that you define. Like taxonomy-based classification, it does not imply any content analysis, unlike
entity extraction.

Metadata extraction principles


You can extract metadata from the content, properties, or repository attributes of documents.
The customization to extract metadata requires the following tasks:
Create a document set under the Content Intelligence node in Documentum Administrator.
Configure the document set to indicate which metadata you want, which rule set to use, and how
to store the metadata. The preferred storage for metadata is annotations. The annotations can then
be accessed using the Annotation API. Chapter 16, Annotation API provides more information
about the Annotation API.
Define the rules to extract the metadata.
The last two tasks can be performed in any order. Just make sure they reference the same metadata.
There are various levels of complexity when defining rules, depending on the structure of documents.
The structure can always be the same for all the documents in the document set, in this case, the
rules may be simple to define. If the structure of the documents includes variations, the rules can
be more complex. For example, the subject can be on one or several lines, the author can be after or
before the title.
The rules are validated against the text extracted and not against the document itself. The formatting
or layout of a document cannot be used to define the rules. The rules are set using regular
expressions, phrases, character indexes. You can also define rules to delimit a zone of text in which
another rule will look for the metadata elements.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

195

Metadata Extraction

Defining metadata extraction rules


This section guides you through the creation of metadata extraction rules. The main steps are:
1.

Analyze the structure of the documents. Identify the similarities and differences in the documents
structure.

2.

Define one or several rules.

3.

Use the extract_metadata script to test the rules on one document.

4.

Refine the rules as needed.

5.

Use the extract_metadata script to test the rules on several documents.

Rules sample
In this example, we see various ways to define rules to extract some metadata elements from a
document.
We want to extract the date, the reference number, and the subject.

196

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Metadata Extraction

Looking at the structure of the document, we can see that:


The date is on the first line of the document. The logo is an image and therefore not processed
by CIS.
The reference is identified by the text Ref.: and it is the only piece of information on this line.
The subject is made of several lines, it starts after the text Subject: and ends before the greeting
Sir/Madam,.
A first set of simple rules to extract these elements could be the following:
<?xml version="1.0" encoding="UTF-8"?>
<MetadataExtractionRules>
<!--To extract the date: extract the first line of the document and set it
as the value of the metadata element 'date'.-->
<SetMetadata name="date">
<Line occurrence=1/>
</SetMetadata>
<!--To extract the reference: extract the text after Ref.: and until
the end of the line.-->
<SetMetadata name="reference">
<Line start="Ref.:"/>
</SetMetadata>
<!--To extract the subject: extract the text after Subject: and before
Sir/Madam,.-->
<SetMetadata name="subject">
<Block start="Subject:" end="Sir/Madam,"/>
</SetMetadata>
</MetadataExtractionRules>

These simple rules can work for other similar documents. However, if documents in the document set
include small variations (the date is not always on the first line, the greeting is not always Sir/Madam),
these rules will fail to extract the metadata elements. Then you have to define more robust rules.
Regarding the date element for example, you can use a regular expression to match the date. The
date follows the pattern day / month / year, or 2 digits / 2 digits / 4 digits. The corresponding regular
expression is \d{2}/\d{2}/\d{4} where \d means digit and \d{2} means 2 digits.
In this case, the rule is the following:
<SetMetadata name="date">
<Pattern regex="\d{2}/\d{2}/\d{4}"/>
</SetMetadata>

To make sure we match the date located between the beginning of the document and the Subject:,
we can modify the rule as follows:
<SetMetadata name="date">
<Block end="Subject:">
<Pattern regex="\d{2}/\d{2}/\d{4}"/>
</Block>
</SetMetadata>

This rule can be read as Put in the metadata date the value returned by the sub-rule Block. The
sub-rule Block first reduces the target text from the beginning of the document to Subject:, then
processes its own sub-rule Pattern, and then returns the values returned by Pattern. The sub-rule
Pattern finds the first text matching the regular expression for the date and returns it (or returns no
value if not found).

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

197

Metadata Extraction

To better extract the reference element, we can define a block with start and end elements instead of
a line.
<SetMetadata name="reference">
<Block start="Ref.:" end="Subject:"/>
</SetMetadata>

The extraction of the subject element is more tricky because it depends on the greeting Sir/Madam
which may be different. We can first try to extract the block between Subject: and Sir/Madam,,
and if it is not found, else take the first 3 lines of text after Subject:.
<SetMetadata name="subject">
<First>
<Block start="Subject:" end="Sir/Madam,"/>
<Block start="Subject:">
<Line fromOccurrence="1" toOccurrence="3"/>
</Block>
</First>
</SetMetadata>

You can find some sample files at <CIS installation directory>/doc/metadata.

198

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 14
Configuring Metadata Extraction

This chapter describes how to configure the extraction of metadata from the content, properties, or
repository attributes of documents including extraction rule definition and testing.

Defining a rules file


The rules for the metadata extraction are defined in a XML configuration file. The following
procedure describes how to create this file.

To define a rules file:


1.

In Documentum Administrator, navigate to the folder:


Cabinets/System/Applications/CI/MetadataExtractionRules

2.

Do one of the following:


Create a new XML file and edit it.
Import an existing rules file.

3.

Add rules to the configuration file. The rules are added between the <MetadataExtractionRules>
and </MetadataExtractionRules> elements. Metadata extraction rules, page 201 describes the
rules available with their usage and provides some examples.

Example 14-1. Sample rules file


<?xml version="1.0" encoding="UTF-8"?>
<MetadataExtractionRules>
<SetMetadata name="date">
<Line occurrence="1"/>
</SetMetadata>
<SetMetadata name="reference">
<Line start="Ref.:"/>
</SetMetadata>
<SetMetadata name="subject">
<Block start="Subject:" end="Sir/Madam,"/>
</SetMetadata>
</MetadataExtractionRules>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

199

Configuring Metadata Extraction

Running the extract_metadata script


The extract_metadata script can be used to test one document or to test several documents. If one
document is tested, the output of the script is a text file. If several documents are tested, the output is
a report in CSV format.
A generic -help parameter is available to display the list and usage of all parameters.
You can add the -trackRules parameter to facilitate rules debugging but be aware that it is verbose
and slows down the processing.

To run the extract_metadata script on one document:


1.

Locate the extract_metadata.bat script (on Windows hosts, or extract_metadata on Linux hosts); it
can be found at <CIS installation directory>/bin.

2.

Run the script with the following parameters:


extract_metadata -doc "<test_document>" -rules "<rule_file>" -output
"<results_file>" -extractedText "<extracted_text_file>" -extractedProperties
"<extracted_properties_file>"

where
<test_document> is the filepath for the document to be tested.
<rule_file> is the filepath for the XML rule file.
<results_file> is the filepath for the text file generated and it contains the extraction results.
<extracted_text_file> is filepath for a file containing only the extracted text as returned by
Oracle text extractor.
<extracted_properties_file> is the filepath for a file containing only the extracted properties
with the specified rules.
such as:
extract_metadata -doc "..\doc\metadata\document_sample.doc" -rules
"..\doc\metadata\rules_sample1.xml" -output "extracted_metadata.txt"
-extractedText "extracted_text.txt" -extractedProperties "extracted_properties.txt"

To run the extract_metadata script on several documents:


1.

Locate the extract_metadata.bat script (on Windows hosts, or extract_metadata on Linux hosts); it
can be found at <CIS installation directory>/bin.

2.

Run the script with the following parameters:


extract_metadata -docDir "<test_directory>" -rules "<rule_file>"
-output "<results_file>"

where
<test_directory> is the filepath for the directory that contains the documents to be tested.
<rule_file> is the filepath for the XML rule file.
<results_file> is the filepath for the CSV file generated for the extraction results.
such as:
extract_metadata -docDir "..\docs" -rules "..\doc\metadata\rules_sample2.xml"
-output "extracted_metadata_report.csv"

You can use the extract_metadata script on a machine different from the one of which CIS is installed.
The following procedure describes the required steps to do so.

200

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Configuring Metadata Extraction

To run the extract_metadata script on another machine:


1.

Locate and run the build_metadata_extractor script in <CIS installation directory>/bin directory.
It creates a new directory <CIS installation directory>/metadata_extractor with all necessary
resources inside.

2.

Copy the <CIS installation directory>/metadata_extractor directory to another machine.

3.

Locate the metadata_extractor/bin/extract_metadata script.

4.

Update the following lines in the extract_metadata script with the correct paths for the current
machine:
set
set
set
set

5.

CIS_CONF_DIR=C:\Program Files\Documentum\CIS\config
CIS_HOME_DIR=C:\Program Files\Documentum\CIS
CIS_LIB_DIR=C:\Program Files\Documentum\CIS\lib
JH=C:\Program Files\Documentum\java\1.6.0_17

Run the extract_metadata script.

Metadata extraction rules


This section describes the rules and their usage, and provides some examples.

Rules principles
Rules are evaluated in order.
The rules are evaluated in the reading order. Make sure you write them in the order they have to
be evaluated. This also has an impact when you define a target text. The text zone of the target
text can be reduced but not enlarged.
A rule can contain zero, one, or several sub-rules.
If a rule contains sub-rules, the sub-rules are processed first. Then the rule processes itself with the
result of the sub-rule and returns zero, one, or several values. A rule usually has zero or one sub-rule,
only operator rule can have several sub-rules.
A rule applies to a text zone.
The root rule applies to the entire document text, then rules can reduce the target text (text zone
on which the rule applies).
A rule returns zero, one, or several values.
The values returned, if any, are stored according to the document set configuration. Values are
available during the processing to evaluate other rules.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

201

Configuring Metadata Extraction

Rules definitions
For all rules, the names of the metadata elements, document properties, or repository attributes
are case sensitive. They must comply with XML standards, which means that characters such as
underscores or spaces are allowed.
The target text could be the entire document or any part of the document defined by a rule.

202

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

SetMetadata

SetMetadata
Purpose
This rule allows you to define a metadata element.

Attributes
Table 35. <SetMetadata> Element Attributes

Attribute

Description

name

Specifies the name of the metadata element to set.

Usage notes
This rule must have a sub-rule. The metadata element is set with the values returned by the sub-rule.
Once the metadata element is created, it can be accessed by other rules (such as GetMetadata) and it
can be stored as specified in the document set configuration. Make sure the name of the metadata
element is exactly the same in the document set configuration. The name is case-sensitive.

Example of <SetMetadata>
<SetMetadata name="reference">
<Block start="Ref.:" end="Subject:"/>
</SetMetadata>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

203

GetMetadata

GetMetadata
Purpose
This rule allows you to get the values of a metadata element.

Attributes
Table 36. <GetMetadata> Element Attributes

Attribute

Description

name

Specifies the name of the metadata element to set.

Usage notes
This rule has no sub-rule. It returns the values set for a metadata element, or no value if the metadata
element is not set. Refer to the SetMetadata rule to know how to set a metadata element.
The GetMetadata rule allows you to verify the existence of a metadata element and to start a new
rule to retrieve the value of this metadata element. The example of the Concat rule also provides
an example of GetMetadata usage. It is different from the Exists condition that only verify the
existence of the metadata element.

204

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

DocProperty

DocProperty
Purpose
This rule gets the value of a document property.

Attributes
Table 37. <DocProperty> Element Attributes

Attribute

Description

name

Specifies the name of the property to get.

Usage notes
This rule has no sub-rule. It returns the value of a specific property extracted from the document,
such as the title of a PDF document, or no value if the property is not set in the document.
To make sure you set the exact name of the property, run the extract_metadata script on one document
with the -extractedProperties parameter. Look at the text as it is extracted by Oracle text extractor.
This allows you to know the name of the property as it is seen by the extractor. For example, you
may have an Author property in the application interface that appears as primaryauthor in the
extracted text.
Note: System properties may not be extracted depending on the file format. For example, if you
use Windows Explorer, the properties set in the Summary tab of the document Properties are not
extracted for the PDF documents.

Example of <DocProperty>
<SetMetadata name="author">
<DocProperty name="primaryauthor"/>
</SetMetadata>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

205

DocRepositoryAttribute

DocRepositoryAttribute
Purpose
This rule gets the values of an attribute associated to the document in the Documentum repository.

Attributes
Table 38. <DocRepositoryAttribute> Element Attributes

Attribute

Description

name

Specifies the name of the repository attribute to get.

Usage notes
This rule has no sub-rule. It returns the values of an attribute associated to the document in the
Documentum repository, such as the attributes title or keywords, or no value if the attribute is not set.

Example of <DocRepositoryAttribute>
<SetMetadata name="DocTitle">
<First>
<DocRepositoryAttribute name="title"/>
<DocProperty name="title"/>
<Line occurrence="1"/>
</First>
</SetMetadata>

206

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Block

Block
Purpose
This rule looks for a text block delimited by a start element and an end element.

Attributes
Table 39. <Block> Element Attributes

Attribute

Description

start

Specifies a start element for the block that can be plain text or regular
expression.
Optional, by default the start element is the beginning of the current
target text.

fromMetadataPosition

Specifies a start element for the block that is the name of a metadata
element. The block starts at the metadata element position. The metadata
element has to be defined in a previous rule.
Optional, by default the start element is the beginning of the current
target text.

includeStart

Specifies whether to include the block start.


False by default.

end

Specifies an end element for the block that can be plain text or regular
expression.
Optional, by default the end element is the end of the current target text.

toMetadataPosition

Specifies an end element for the block that is the name of a metadata
element. The block ends at the metadata element position. The metadata
element has to be defined in previous rule.
Optional, by default the end element is the end of the current target text.

includeEnd

Specifies whether to include the block end.


False by default.

ignoreCase

Specifies whether to ignore the letter case of the plain text or regular
expression.
False by default.

occurrence

Keeps only a specific occurrence. For example, to keep only the


second occurrence, set occurrence=2. It is equivalent to having
fromOccurrence=2 and toOccurrence=2.
Optional, only the first occurrence is kept by default.

fromOccurrence

Defines the first occurrence to keep. For example, fromOccurrence=3


keeps all occurrences starting from the third and up to the last.
Optional, fromOccurrence=1 by default.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

207

Block

Attribute

Description

toOccurrence

Defines the last occurrence to keep. For example, fromOccurrence=2


toOccurrence=3 keeps only the second and the third occurrence.
Optional, if not specified, all occurences up to the last one are kept, unless
the occurrence attribute is set.

allOccurences

Specifies that all occurrences are kept. Equivalent to having


fromOccurrence=1.
Optional, false by default.

Usage notes
Both start and end elements can be plain text, regular expression or a metadata element defined in
a previous rule.
If either the start or end element is not found, then this rule returns no value without processing
further its sub-rule (if any). If there is no sub-rule, this rule returns one value the text block matched.
If there is a sub-rule, it is invoked with a target text reduced to the text block matched by this rule.
Then the values returned by the sub-rule are returned by this rule.

Examples of <Block>
<Block start="Ref.:" end="Subject:"/>
<Block end="Subject:"/>
<Block fromMetadataPosition="Author" end="Date:"/>
<SetMetadata name="phones">
<Block start="Tel" end="Fax" allOccurrences="true">
<Pattern regex="\d{5}-\d{4}"/>
</Block>
</SetMetadata>

208

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Line

Line
Purpose
This rule looks for one or several lines.

Attributes
Table 40. <Line> Element Attributes

Attribute

Description

start

Specifies a start element for the block that can be plain text or regular
expression.
Optional, by default the start element is the beginning of the current
target text.

fromMetadataPosition

Specifies a start element for the block that is the name of a metadata
element. The block starts at the metadata element position. The metadata
element has to be defined in a previous rule.
Optional, by default the start element is the beginning of the current
target text.

includeStart

Specifies whether to include the block start.


False by default.

end

Specifies an end element for the block that can be plain text or regular
expression.
Optional, by default the end element is the end of the current target text.

toMetadataPosition

Specifies an end element for the block that is the name of a metadata
element. The block ends at the metadata element position. The metadata
element has to be defined in previous rule.
Optional, by default the end element is the end of the current target text.

includeEnd

Specifies whether to include the block end.


False by default.

ignoreCase

Specifies whether to ignore the letter case of the plain text or regular
expression.
False by default.

occurrence

Keeps only a specific occurrence. For example, to keep only the


second occurrence, set occurrence=2. It is equivalent to having
fromOccurrence=2 and toOccurrence=2.
Optional, only the first occurrence is kept by default.

fromOccurrence

Defines the first occurrence to keep. For example, fromOccurrence=3


keeps all occurrences starting from the third and up to the last.
Optional, fromOccurrence=1 by default.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

209

Line

Attribute

Description

toOccurrence

Defines the last occurrence to keep. For example, fromOccurrence=2


toOccurrence=3 keeps only the second and the third occurrence.
Optional, if not specified, all occurences up to the last one are kept, unless
the occurrence attribute is set.

allOccurences

Specifies that all occurrences are kept. Equivalent to having


fromOccurrence=1.
Optional, false by default.

Usage notes
The Line rule is similar to the Block rule for which the default start element is the beginning of a line
and the default end element is the end of a line. This rule has the same attributes as the Block rule.

Examples of <Line>
<SetMetadata name="date">
<Line occurrence=1/>
</SetMetadata>
<SetMetadata name="reference">
<Line start="Ref.:"/>
</SetMetadata>
<SetMetadata name="subject">
<Block start="Subject:">
<Line fromOccurrence="1" toOccurrence="3"/>
</Block>
</SetMetadata>

210

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Pattern

Pattern
Purpose
This rule looks for a text fragment matching a pattern.

Attributes
Table 41. <Pattern> Element Attributes

Attribute

Description

regex

Specifies the phrase in plain text or the regular expression to look for.

ignoreCase

Specifies whether to ignore the letter case of the plain text or regular
expression.
False by default.

occurrence

Keeps only a specific occurrence. For example, to keep only the


second occurrence, set occurrence=2. It is equivalent to having
fromOccurrence=2 and toOccurrence=2.
Optional, only the first occurrence is kept by default.

fromOccurrence

Defines the first occurrence to keep. For example, fromOccurrence=3


keeps all occurrences starting from the third and up to the last.
Optional, fromOccurrence=1 by default.

toOccurrence

Defines the last occurrence to keep. For example, fromOccurrence=2


toOccurrence=3 keeps only the second and the third occurrence.
Optional, if not specified, all occurences up to the last one are kept, unless
the occurrence attribute is set.

allOccurences

Specifies that all occurrences are kept. Equivalent to having


fromOccurrence=1.
Optional, false by default.

Usage notes
The pattern can be either a phrase in plain text or a regular expression.
If the pattern is not found, then this rule returns no value without processing further its sub-rule
(if any). If there is no sub-rule, this rule returns one value the text fragment matched. If there is a
sub-rule, it is invoked with a target text reduced to the text fragment matched by this rule. Then the
values returned by the sub-rule are returned by this rule.

Examples of <Pattern>
<SetMetadata name="date">
<Block end="Subject:">
<Pattern regex="\d{2}/\d{2}/\d{4}"/>
</Block>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

211

Pattern

</SetMetadata>
<SetMetadata name="emails">
<Pattern regex="[a-z\._]+@[a-z\.]+" ignoreCase="true" allOccurrences="true"/>
</SetMetadata>

212

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Zone

Zone
Purpose
This rule reduces the target text based on character indexes.

Attributes
Table 42. <Zone> Element Attributes

Attribute

Description

startIndex

Specifies the index of the first character (included) of the text zone to keep.

endIndex

Specifies the index of character following (excluded) the last character


of the text zone.

Usage notes
Unlike Block or Pattern, this rule does not look for a text fragment but it directly reduces the target
text based on character indexes. For example, in some cases, it may be necessary to limit the target
text to the first page but sometime there is no visible text marker identifying the end of the first page.
To do that, you can limit the target text to the first 500 characters (roughly the first page).
If the start index defined is greater than the length of the current target text, then this rule returns no
value without processing further its sub-rule (if any). If there is no sub-rule, this rule returns one
value the delimited text zone. If there is a sub-rule, it is invoked with a target text reduced to the
delimited text zone. Then the values returned by the sub-rule are returned by this rule.

Example of <Zone>
<Zone startIndex="0" endIndex="500">
<SetMetadata name="InventionTitle">
<Block start="Title of invention:" end="Name of Program"/>
</SetMetadata>
</Zone>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

213

Constant

Constant
Purpose
This rule returns always the same constant value.

Attributes
Table 43. <Constant> Element Attributes

Attribute

Description

value

Returns the constant value; it can be any string.

Usage notes
This rule has no sub-rule. It can be used to concatenate a constant value with another value to set a
metadata element, such as adding the symbol for a unit of measurement.

Example of <Constant>
<SetMetadata name="price">
<Concat>
<Pattern regex="\d+"/>
<Constant value="$"/>
</Concat>
</SetMetadata>

214

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

If

If
Purpose
This conditional rule evaluates conditions and, depending on the evaluation, processes a sub-rule.

Usage notes
This rule must have at least one condition and exactly one <Then> child element. It can optionally
have one <Else> child element. Each child element (<Then> or <Else>) must have one sub-rule. If all
the conditions are satisfied (implicit AND), then the sub-rule of the <Then> tag is processed. Else, the
sub-rule of the <Else> tag is processed (if any). The <If> rule returns the values returned by either the
<Then> sub-rule, or the <Else> sub-rule, or no value if there is no <Else>.
<If>
Conditions
<Then>
Sub-rules
</Then>
<Else>
Sub-rules
</Else>
</If>

Example of <If>
<If>
<Not>
<Exists name="Subject" source="metadata"/>
</Not>
<Then>
<SetMetadata name="Subject">
<Block start="Subject:" end="Sir/Madam,"/>
</SetMetadata>
</Then>
</If>

Conditions
Conditions are not rules, they are used inside the <If> rule. Their evaluation always returns a Boolean.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

215

And

And
Purpose
This operator evaluates all its sub-conditions with an AND.

Usage notes
The And operator must have at least one sub-condition. It is satisfied if and only if all its
sub-conditions are satisfied. If a sub-condition is not satisfied, the next sub-conditions are not
evaluated and the And operator is not satisfied.
The And operator has no attributes.

Example of <And>
<If>
<And>
<Contains name="subject" source="metadata" value="report"/>
<Equals name="format" source="docProperty" value="pdf"/>
</And>
<Then>
<SetMetadata name="authors">
<Line start="Authors:"/>
</SetMetadata>
</Then>
</If>

216

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Or

Or
Purpose
This operator evaluates all its sub-conditions with an OR.

Usage notes
The Or operator must have at least one sub-condition. It is satisfied if and only at least one of its
sub-conditions is satisfied. If a sub-condition satisfied, the next sub-conditions are not evaluated and
the Or operator is satisfied.
The Or operator has no attributes.

Example of <Or>
<If>
<Or>
<Contains name="subject" source="metadata" value="report"/>
<Equals name="format" source="docProperty" value="pdf"/>
</Or>
<Then>
<SetMetadata name="authors">
<Line start="Authors:"/>
</SetMetadata>
</Then>
</If>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

217

Not

Not
Purpose
This operator evaluates its single sub-condition and inverts the evaluation Boolean.

Usage notes
The Not operator must have one sub-condition. It is satisfied if and only the sub-condition is not
satisfied.
The Not operator has no attributes.

Example of <Not>
<If>
<Not>
<Exists name="Subject" source="metadata"/>
</Not>
<Then>
<SetMetadata name="Subject">
<Block start="Subject:" end="Sir/Madam,"/>
</SetMetadata>
</Then>
</If>

218

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Exists

Exists
Purpose
This condition is satisfied if a metadata element is defined and has a non empty value.

Attributes
Table 44. <Exists> Element Attributes

Attribute

Description

name

Specifies the name of the metadata element to set.

source

Specifies the source of the metadata element. Possible values are:


metadata, docProperty, docRepositoryAttribute.

Usage notes
If the metadata element is not found or has an empty value, then the Exists condition is not satisfied.

Example of <Exists>
<If>
<Not>
<Exists name="Subject" source="metadata"/>
</Not>
<Then>
<SetMetadata name="Subject">
<Block start="Subject:" end="Sir/Madam,"/>
</SetMetadata>
</Then>
</If>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

219

Contains

Contains
Purpose
This condition is satisfied if the value of a metadata element contains a specific string.

Attributes
Table 45. <Contains> Element Attributes

Attribute

Description

name

Specifies the name of the metadata element to check.

source

Specifies the source of the metadata element to check. Possible values are:
metadata, docProperty, docRepositoryAttribute.

value

Specifies the string to look for.

ignoreCase

Specifies whether to ignore the letter case of the string to look for.
False by default.

Usage notes
The Contains condition is satisfied if and only if the metadata value contains this substring.

Example of <Contains>
<SetMetadata name="authors">
<If>
<Contains name="subject" source="metadata" value="report"/>
<Then>
<Line start="Authors:"/>
</Then>
<Else>
<Block fromMetadataPosition="title" end="Date:"/>
</Else>
</If>
</SetMetadata>

220

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Equals

Equals
Purpose
This condition is satisfied if the value of a metadata element is equal to a specific string.

Attributes
Table 46. <Equals> Element Attributes

Attribute

Description

name

Specifies the name of the metadata element to check.

source

Specifies the source of the metadata element to check. Possible values are:
metadata, docProperty, docRepositoryAttribute.

value

Specifies the string to compare with.

ignoreCase

Specifies whether to ignore the letter case of the string to look for.
False by default.

Usage notes
The Equals condition is satisfied if and only if the value of the metadata element is equal to this string.

Example of <Equals>
<SetMetadata name="authors">
<If>
<Equals name="subject" source="metadata" value="report"/>
<Then>
<Line start="Authors:"/>
</Then>
<Else>
<Block fromMetadataPosition="title" end="Date:"/>
</Else>
</If>
</SetMetadata>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

221

IsPositionBefore

IsPositionBefore
Purpose
This condition is satisfied if a metadata element is positioned before another metadata element.

Attributes
Table 47. <IsPositionBefore> Element Attributes

Attribute

Description

metadata1

Specifies the name of the first metadata element with which to compare
the position.

metadata2

Specifies the name of the second metadata element with which to compare
the position.

Usage notes
This condition compares the position of two metadata elements already defined by previous rules.
It is satisfied if and only if both metadata elements are defined, have values, and the first metadata
element is positioned before the second metadata element. If one of the metadata elements is not
found, then the IsPositionBefore condition is not satisfied.

Example of <IsPositionBefore>
<SetMetadata name="title">
<If>
<IsPositionBefore metadata1="document_type" metadata2="date"/>
<Then>
<Block fromMetadataPosition="date" end="Version"/>
</Then>
<Else>
<Block fromMetadataPosition="document_type " end="Version"/>
</Else>
</If>
</SetMetadata>

Operator rules
Operator rules are special rules that may have multiple sub-rules. Operator rules define the way
all results returned by their sub-rules are processed so that the operator itself returns a single list
of values.

222

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

First

First
Purpose
This operator processes its sub-rules sequentially in order and returns the results of the first sub-rule
that returns non empty values.

Usage notes
Once a sub-rule returns non empty values, the next sub-rules are not processed, and the result of
the sub-rule is returned by the First operator. If all sub-rules return no value, then the First operator
returns no value.
The First rule has no attributes.

Example of <First>
<SetMetadata name="DocTitle">
<First>
<DocRepositoryAttribute name="title"/>
<DocProperty name="title"/>
<Line occurrence="1"/>
</First>
</SetMetadata>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

223

All

All
Purpose
This operator processes all its sub-rules in order and returns all the results in a list of values.

Usage notes
The All operator returns a single list of values in which all the non-null values returned by all the
sub-rules are appended, in order. If all sub-rules return no value then the All operator returns no
value.
The All rule has no attributes.

Example of <All>
<Block start="Ricorso in Appello" end="Sentenza">
<All>
<SetMetadata name="AppealNumber">
<Pattern regex="\d{7}-[A-Z]{3}"/>
</SetMetadata>
<SetMetadata name="AppealDate">
<Pattern regex="\d{2}-\d{2}-\d{4}"/>
</SetMetadata>
</All>
</Block>

224

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Concat

Concat
Purpose
This operator processes all its sub-rules in order and returns a single value with all the values of
the sub-rules concatenated.

Attributes
Table 48. <Concat> Element Attributes

Attribute

Description

separator

Specifies the separator to insert between each value.


Optional, by default the separator is a space.

Usage notes
The Concat operator returns a single value that is the concatenation of all the non-null values returned
by all the sub-rules, in order. If all sub-rules return no value then the Concat operator returns no value.

Example of <Concat>
<SetMetadata name="person_in_charge">
<Concat separator=", ">
<GetMetadata name="last_name"/>
<GetMetadata name="first_name"/>
</Concat>
</SetMetadata>
<SetMetadata name="subject">
<Concat>
<Block start="Subject:">
<Line fromOccurrence="1" toOccurrence="3"/>
</Block>
</Concat>
</SetMetadata>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

225

MostFrequent

MostFrequent
Purpose
This operator processes all its sub-rules in order and returns only the most frequent value.

Usage notes
The MostFrequent operator returns a single value: the most frequent value of all the non-null values
returned by all the sub-rules. If all sub-rules return no value then the MostFrequent operator returns
no value. If all values have the same number of occurrences, then the first most frequent is kept.
The MostFrequent rule has no attributes.

Example of <MostFrequent>
<SetMetadata name="most_frequent_email">
<MostFrequent>
<Pattern regex="[a-z\._]+@[a-z\.]+" allOccurrences="true"/>
</MostFrequent>
</SetMetadata>

Best practices and tips


In some cases, the rules may return values that are too long to be stored as annotations. To avoid this
kind of issue, you can nest several rules to limit the length of the returned value.
In the following example, the abstract metadata element is defined by the Block rule as anything
between the word Abstract and the word Introduction, but to limit the length of the returned
value it only takes the first two lines.
<SetMetadata name="abstract">
<Block start="Abstract" end="Introduction">
<Line fromOccurrence="1" toOccurrence="2"/>
</Block>
</SetMetadata>

In the following example, the abstract metadata element is also defined by the Block rule as anything
between the word Abstract and the word Introduction, but to limit the length of the returned
value it only takes the first 50 characters.
<SetMetadata name="abstract">
<Block start="Abstract" end="Introduction">
<Zone startIndex="0" endIndex="50"/>
</Block></SetMetadata>

226

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Part 7
Exposing Content Intelligence Services
Results

This part describes various ways to expose the results of a CIS processing:
Expose classification concepts in CenterStage: Classification concepts are category matches
found by Content Intelligence Services (CIS) and based on predefined taxonomies. They are
not stored as category assignments unlike CIS standard classification processing. They can be
exposed in CenterStage as search filters.
Expose more entities in CenterStage: Like People, Place, and Organization entities that are
available out-of-the-box in CenterStage, CIS allows you to extract other entities that are relevant to
your company using Temis cartridges and expose them in CenterStage.
Access annotations: Annotations are a unique way to store entities, classification concepts, and
extracted metadata in the repository. The Annotation API allows you to access these annotations
and use them according to your needs.
Integrating CIS classification: There are several integration scenarios for CIS standard
classification processing.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

227

Exposing Content Intelligence Services Results

228

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 15
Expose Classification Concepts or
Entities in CenterStage Filters

This chapter describes the steps required for two customizations: exposing classification concepts in
CenterStage navigation filters and exposing additional entities in CenterStage navigation filters.
Extract classification concepts, page 229
Extract new entities, page 231
Add custom filters in CenterStage, page 233
Optional steps to test the customizations:
Clear previous entities, page 238
Clear the document status, page 239
The customizations described in this document require CenterStage version 1.1 and CIS version 6.6
installed for CenterStage. CenterStage and CIS must use the CIS DAR file (cis_artifacts.dar) version
6.6. It is assumed that these products are installed and running.

Extract classification concepts


It is possible to expose the concepts found by the taxonomy-based classification as a filter in
CenterStage clients. The names of the categories are then displayed as filter values in CenterStage for
the categorized documents.
The processing used to obtain classification concepts for CenterStage is different from the standard
CIS classification processing:
No category assignments are made. As a consequence, the options related to category assignments
such as Link assigned documents into category folders or Update document attributes with
category assignments do not apply.
Unlike the classification for category assignments, not all taxonomies in production are
automatically used, you have to specify every taxonomy that you want to expose.
The concepts found by the classification are stored in the repository and exposed in CenterStage
search filters. The hierarchy of the taxonomy is not kept in the results. Each concept will be
displayed at the same level, that is, as a flat list, regardless of its position in the taxonomy.
Note: The standard CIS classification processing is not compatible with the classification processing
for CenterStage. Do not try to perform the two types of processing on the same documents.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

229

Expose Classification Concepts or Entities in CenterStage Filters

To expose classification concepts:


1.

Configure the taxonomies: create or import taxonomies in TEF format, then synchronize them in
Production mode. Refer to Configure CIS for Classification chapter in CIS Adminisration guide and
Content Intelligence Services chapter in Documentum Administrator User Guide.

2.

Configure the document sets for the classification as described in To configure the classification
for CenterStage spaces:, page 230, either for all spaces or only for a specific space. Each time a
space is created in CenterStage, a document set is automatically created for this space.

3.

Define the new filters in CenterStage as described in To define new filters in CenterStage:, page
233. The filters are also mapped to the full-text indices.

4.

(Optional) Reprocess the documents. If you decide to not reprocess the documents, the values in
the filter will only appear when the documents are modified, which triggers automatically a new
processing. To force a reprocessing, clear the document status table, as described in Clear the
document status, page 239. To test the customization, you can also clear the previously extracted
entities as described in To clear extracted entities with the clear_entities script:, page 238.

To configure the classification for CenterStage spaces:


1.

Edit the configuration file of the space of your choice as described in To edit the configuration
file of the document sets:, page 87.

2.

In the <analysis-plan> element, indicate the type of processing:


<analysis-plan>
<classification-step/>
</analysis-plan>

<classification-step/> is the taxonomy-based classification processing that stores concepts but


does not create category assignments.
3.

Add or edit the <classification> element such as :


<classification>
<analysis name="custom_classif">
<repository-taxonomy>MyCompany products</repository-taxonomy>
<repository-taxonomy>MyCompany projects</repository-taxonomy>
</analysis>
</classification>

Where
The name attribute in the <analysis> element is any name, it will be reused later to define the
way the entity values are stored.
The value of the <repository-taxonomy> element is the name of the taxonomy used. In the
example, two custom taxonomies are mapped. All values will be displayed in the same filter.
4.

In the <storage> element, indicate the type of storage:


<storage>
<annotation code="1001">
<analysis>custom_classif</analysis>
</annotation>
</storage>

Where
The code attribute in the <annotation> element is an index number higher than or equal to
1000 or the name of an existing entity type.
The value of the <analysis> element is the name of the analysis as defined in the previous step.

230

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Expose Classification Concepts or Entities in CenterStage Filters

Process, or reprocess, the documents to store the results of the classification so that they can be
exposed in the new filter in CenterStage.

Extract new entities


You have the possibility to extract and expose more entity types than People in text, Place in text, and
Organization in text such as:
Other entities from the TM360 cartridge, which is available out-of-the-box with CIS. Refer to Table
50, page 232 for the list of additional entities available in the TM360 cartridge.
Entities from other Temis Luxid cartridges. EMC only distributes TM360, you need a separate
license agreement with Temis to use other cartridges.
It is not possible to expose entities of cartridges from other providers.

To extract new entities:


1.

For a cartridge other than TM360, set up the cartridge and the annotation plan as described in
Luxid documentation.

2.

Define the new entity types in the configuration file, either for all spaces or only for a specific
space, as described in To configure the document sets for new entity types:, page 232. Each time a
space is created in CenterStage, a document set is automatically created for this space.

3.

To expose the new entities in CenterStage clients, define the new filters as described in To define
new filters in CenterStage:, page 233. The filters are also mapped to the full-text indices.

4.

(Optional) Reprocess the documents. If you decide to not reprocess the documents, the values in
the filter will only appear when the documents are modified, which triggers automatically a new
processing. To force a reprocessing, clear the document status table, as described in Clear the
document status, page 239. To test the customization, you can also clear the previously extracted
entities as described in To clear extracted entities with the clear_entities script:, page 238.

The following table specifies the internal and public name of CenterStage entities.
Table 49. Internal and public names of default entities

Public name (as visible in CenterStage)

Internal name or description

People in text

CISPerson

Organization in text

CISCompany
The CISCompany entity includes values from
the Company, Organization, and Media entities
of the TM360 cartridge.

Place in text

CISLocation
You cannot map a taxonomy or a custom entity
to the Place in text entity.

The following table specifies the name and descriptions of other TM360 entities that you can use.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

231

Expose Classification Concepts or Entities in CenterStage Filters

Table 50. TM360 entities

Name

Description

StockIndex

Stock index names such as Dow Jones or CAC40.

Function

Job functions (high positions) such as Chief


executive or Financial advisor.

Postal Address

Postal addresses such as 23 Rue des Dames, 75


017 Paris.

Fax Number

Fax numbers such as Fax: (917) 765-865.

Phone Number

Phone numbers such as: +33 (1) 35 78 18 28 or


phone : (917) 765-865.

URL

Web addresses (URLs) such as www.temis.com.

Email

Email addresses such as caestill@runner.com.

UserDefined[09]

Entities customized in TM360, such as


UserDefined0, UserDefined1, etc.

Time Expression, Money Expression, Measurements, and Relationships are not available in CIS
extraction.

To configure the document sets for new entity types:


1.

Edit the configuration file of your choice as described in To edit the configuration file of the
document sets:, page 87.

2.

In the <analysis-plan> element, indicate the type of processing:


<analysis-plan>
<entity-detection-step/>
</analysis-plan>

<entity-detection-step/> is the entity extraction processing based on cartridges.


3.

In the <entity-detection> element, add a <analysis> element such as :


<entity-detection>
<analysis name="custom_entity_1">
<entity>Postal Address</entity>
</analysis>
</entity-detection>

Where
The name attribute in the <analysis> element is any name, it will be reused later to define the
way the entity values will be stored.
The value of the <entity> element is the name of the entity in the cartridge, for example the
concept (not the subconcept) in the Temis cartridge TM360. Refer to Table 50, page 232
for the exact name of an entity from TM360 or refer to Luxid documentation for entities
from other cartridges. If you want to use one of the default entities, use the <builtin-entity>
element instead of the <entity> element. Refer to Table 49, page 231 for the exact name of
default entities.
4.

In the <storage> element, indicate the type of storage:


<storage>
<annotation code="1002">

232

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Expose Classification Concepts or Entities in CenterStage Filters

<analysis>custom_entity_1</analysis>
</annotation>
</storage>

Where
The code attribute in the <annotation> element is an index number higher than or equal to
1000 or the name of an existing entity type.
The value of the <analysis> element is the name of the analysis as defined in the previous step.
5.

If the cartridge is not TM360, add the new annotation plan to the cis.entity.luxid.annotation_plan.
names property in cis.properties file:
cis.entity.luxid.annotation_plan.names=TM360

By default, only the TM360 cartridge is defined. Separate cartridge names with a comma.
Process, or reprocess, the documents to store the results of the classification so that they can be
exposed in the new filter in CenterStage.

Add custom filters in CenterStage


Once you have configured CIS to process CenterStage documents using a taxonomy or a new entity,
you can add a filter in CenterStage user interface.
Note: In the context of CenterStage customization, the term facet refers to a search filter.

To define new filters in CenterStage:


1.

In Documentum Administrator, edit the file facet_definitions.xml in the folder:


Cabinets/System/Applications/CenterStage Pro/config

If the file does not exist, create it by performing the following steps:
a.

Navigate to facet_definitions.xml in the folder:


Cabinets/System/Applications/CenterStage/config

b. Export the file facet_definitions.xml to your local file system.


c.

Open your local copy of facet_definitions.xml for editing in a text or XML editor.

The chapter 9 Set CenterStage Application Options of the CenterStage 1.2 Administration
Guide describes the customization of the app.xml file. The customization mechanism for
facet_definitions.xml is similar.
2.

In each <facetdisplay> element, add a <facet> element for each new filter.

3.

Set a value for the id parameter. This id will be used later to define the filter.
Here we defined _facet_custom_project and _facet_custom_postal_address:
<facetdisplay id="facets">
<facet id="_kw_location" visible="true"/>
<facet id="_kw_format" visible="true"/>
<facet id="r_modify_date" visible="true"/>
<facet id="r_modifier"/>
<facet id="r_full_content_size"/>
<facet id="kw_topic" visible="true"/>
<facet id="_facet_person"/>
<facet id="_facet_place" visible="true"/>
<facet id="_facet_company" visible="true"/>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

233

Expose Classification Concepts or Entities in CenterStage Filters

<facet id="_facet_custom_project" visible="true"/>


<facet id="_facet_custom_postal_address" visible="true"/>
<facetdisplay>

4.

Add the definition of each new filter as follows:


<facet id="_facet_custom_project">
<nlsbundle> </nlsbundle>
<label>Project</label>
<desc>Projects and products of MyCompany</desc>
<maxvalues>8</maxvalues>
<sort>FREQUENCY</sort>
<strategies>
<strategy type="groupby">
<required>
<attribute name="r_object_id"/>
</required>
</strategy>
<strategy type="dsearch">
<required>
<attribute name="dmftcustom/entities/custom_1001"/>
</required>
</strategy>
</strategies>
<entity>_custom_entity_project</entity>
<handler>com.emc.documentum.kw.data.facet.entities.
FacetCustomHandler</handler>
<queryhandler>com.emc.documentum.kw.data.facet.entities.
PropertyExpressionHandler</queryhandler>
</facet>

where
the value of <label> is the display label of the filter, here Project;
the value of <attribute name> must be dmftcustom/entities/custom_<index> where
<index> is the index of the taxonomy or custom entity that you set in the configuration for the
document sets, here 1001 (Refer to the <annotation> element in Step 4 of the procedure To
configure the classification for CenterStage spaces or Step 4 of the procedure To configure the
document sets for new entities);
the value of <entity> is an arbitrary value that is reused to map the index used to identify the
taxonomy with the filter, here _custom_entity_project.
Similarly, for the Postal Address entity, the definition would be:
<facet id="_facet_custom_postal_address">
<nlsbundle></nlsbundle>
<label>Postal Address</label>
<desc>Postal address entities</desc>
<sort>FREQUENCY</sort>
<maxvalues>8</maxvalues>
<strategies>
<strategy type="groupby">
<required>
<attribute name="r_object_id"/>
</required>
</strategy>
<strategy type="dsearch">
<required>
<attribute name="dmftcustom/entities/custom_1002"/>
</required>
</strategy>
</strategies>
<entity>_custom_entity_postal_address</entity>

234

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Expose Classification Concepts or Entities in CenterStage Filters

<handler>com.emc.documentum.kw.data.facet.entities.
FacetCustomHandler</handler>
<queryhandler>com.emc.documentum.kw.data.facet.entities.
PropertyExpressionHandler</queryhandler>
</facet>

5.

Set the mapping between the new filter and the index you previously set for the taxonomy
as follows:
<entities>
...
<entity id="_custom_entity_project">
<code>1001</code>
<prefix>X6US70M1001:</prefix>
<alias>dmftcustom/entities/custom_1001</alias>
</entity>
...
</entities>

where
the value of <entity id> is the id that you set in the filter definition;
the value of <code> is the index of the taxonomy or custom entity that you set in the
configuration for the document sets;
the value of <prefix> is X6US70M<index>:
the value of <alias> is dmftcustom/entities/custom_<index>.
Similarly, for the Postal Address entity, the mapping would be:
<entities>
...
<entity id="_custom_entity_postal_address">
<code>1002</code>
<prefix>X6US70M1002:</prefix>
<alias>dmftcustom/entities/custom_1002</alias>
</entity>
...
</entities>

6.

Save your changes and, if you were editing the file on your local file system, import the file to:
Cabinets/System/Applications/CenterStage Pro/config

After you added the new filter to CenterStage, it is populated by the custom entity values or by the
classification concepts. The following figure shows the result of the customization example used
in the previous procedure.
Figure 4. Example of a custom entity based on Luxid TM360 Postal Address

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

235

Expose Classification Concepts or Entities in CenterStage Filters

With the xPlore indexer, you need to modify the index to add the name of the attribute in which
entities are stored. More details about xPlore indexing configuration are provided in the Documentum
xPlore Administration Guide.

To configure the indexing of custom entities by xPlore 1.1:


1.

In the IndexAgent web page: http://<hostmachine>:9200/IndexAgent, stop the IndexAgent.

2.

Log in to the xPlore machine.

3.

Stop the xPlore index agent. To do so, run stopIndexagent.cmd located at


<dsearch_home>/jboss4.3.0/server.
<dsearch_home> is the directory in which xPlore was installed.

4.

Stop the xPlore search service. To do so, run stopPrimaryDsearch.cmd located at


<dsearch_home>/jboss4.3.0/server

5.

Edit the file indexserverconfig.xml located at <dsearch_home>/config.

6.

In the section <category-definition>/<indexes>, add a new line for each new entity, such as:
<sub-path leading-wildcard="false" compress="true" boost-value="1.0"
description="Used by CenterStage to compute the custom facet 1001"
include-descendants="false" returning-contents="true" value-comparison="true"
full-text-search="true" enumerate-repeating-elements="false" type="string"
path="dmftcustom/entities/custom_1001"/>

You only have to update the path parameter, and optionally the description parameter. The path
value must be the value of the <alias> element set previously in the mapping configuration in
facet_definitions.xml.
7.

Restart the xPlore search service. To do so, run startPrimaryDsearch.cmd located at


<dsearch_home>/jboss4.3.0/server.

8.

Restart the xPlore index agent. To do so, run startIndexAgent.cmd located at


<dsearch_home>/jboss4.3.0/server.

9.

In the IndexAgent web page: http://<hostmachine>:9200/IndexAgent, start the IndexAgent in


Normal mode.

10. Rebuild the index. To do so, perform the following steps:


a.

In the dsearchadmin web page http://<hostmachine>:9300/dsearchadmin,


navigate to the collection "Default" (that is, the collection at : Home>Data
Management>your_repository>default).

b. Click Rebuild Indexes. A message indicating the progress of the rebuilt is displayed.

To localize the filter label:


1.

In the local system where you deployed CenterStage WAR, navigate to <CenterStagePro installation
directory>/WEB-INF/classes.

2.

In this directory, create the following properties files:


MyCustomLocalization.properties
MyCustomLocalization_en.properties
MyCustomLocalization_<xx>.properties

236

(default properties file)


(English properties file)
(properties file for your language)

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Expose Classification Concepts or Entities in CenterStage Filters

where <xx> is the two-letter language code for your language. The root name of the files
MyCustomLocalization is arbitrary in this example but must be the same for all files. The
properties files are text files with the .properties file extension.
3.

In every properties file, add one line for each filter label and for each filter description to define a
mapping between the label or description and the translation for each language. The description
is the phrase displayed in the More... view in CenterStage.
MyCustomLocalization.properties:
FACET_CUSTOM_1001_DISPLAY_LABEL=Projects
FACET_CUSTOM_1001_DISPLAY_DESC=Projects and products of MyCompany
FACET_CUSTOM_1002_DISPLAY_LABEL=Postal Address
FACET_CUSTOM_1002_DISPLAY_DESC=Postal address entities

These definitions are usually not visible in the Graphic User Interface.
MyCustomLocalization_en.properties:
FACET_CUSTOM_1001_DISPLAY_LABEL=Projects
FACET_CUSTOM_1001_DISPLAY_DESC=Projects and products of MyCompany
FACET_CUSTOM_1002_DISPLAY_LABEL=Postal Address
FACET_CUSTOM_1002_DISPLAY_DESC=Entities based on postal addresses

These properties are automatically loaded when the locale is English.


MyCustomLocalization_fr.properties:
FACET_CUSTOM_1001_DISPLAY_LABEL=Projets
FACET_CUSTOM_1001_DISPLAY_DESC=Projets et produits de MyCompany
FACET_CUSTOM_1002_DISPLAY_LABEL=Addresse postale
FACET_CUSTOM_1002_DISPLAY_DESC=Entits contenant des addresses postales

These properties are automatically loaded when the locale is French.


In the properties file, the name of the label, here FACET_CUSTOM_100x_DISPLAY_LABEL, is
arbitrary but it must be unique and easily identifiable.
4.

Remove the hard coded value of the label in the facet_definitions.xml located at:
Cabinets/System/Applications/CenterStage Pro/config

a.

Set the value of <nlsbundle> to the filename of the default properties file, without the .properties
extension.

b. In each element <label> and <desc>, insert a <nlsid></nlsid> element.


c.

Set the value of <nlsid> elements to the label name and description name that you set in
the properties file.

<facet id="_facet_custom_finance">
<nlsbundle>MyCustomLocalization</nlsbundle>
<label><nlsid>FACET_CUSTOM_1001_DISPLAY_LABEL</nlsid></label>
<desc><nlsid>FACET_CUSTOM_1001_DISPLAY_DESC</nlsid></desc>
<strategies>
...
<facet id="_facet_custom_postal_address">
<nlsbundle>MyCustomLocalization</nlsbundle>
<label><nlsid>FACET_CUSTOM_1002_DISPLAY_LABEL</nlsid></label>
<desc><nlsid>FACET_CUSTOM_1002_DISPLAY_DESC</nlsid></desc>
<strategies>
...

5.

Restart the web application server for CenterStage.

The following figure shows the result of the localization example used in the previous procedure.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

237

Expose Classification Concepts or Entities in CenterStage Filters

Figure 5. Example of localization of a custom entity

Clear previous entities


During the test deployment, you may want to delete previously extracted entities to start your
customization test from a blank state. A script is available to clear the entities stored in the repository
for a given space or for all spaces. Note that it cleans all entities. It also cleans the dm_docstatus table
that stores the status for all documents belonging to a document set. The dm_docstatus table keeps
track of which documents have already been processed, so that only new or modified documents are
(re)processed. Cleaning the dm_docstatus table, as described in Clear the document status, page 239,
triggers the reprocessing of documents already processed by CIS and which have not been modified.
So, clearing the status of all documents may lead to reprocess all CenterStage content.

To clear extracted entities with the clear_entities script:


1.

On CIS host machine, locate the clear_entities.bat file (on Windows hosts, or clear_entities on
Linux hosts); it can be found at <CIS installation directory>/bin.

2.

Run the script with the one of the following parameters:


To remove all entities for all spaces:
clear_entities -All

To remove all entities for one space (that is, for one document set) :
clear_entities -Docset:<docset_id>

where <docset_id> is the space ID.


To find the space ID, in Documentum Administrator, locate the file space_docset_list.txt in
Cabinets/System/Applications/CI/DocsetConfiguration. This file lists all the document sets
created for CenterStage spaces, the first column indicates the space name, the second column the
space ID, the third column indicates the configuration file for this document set.

238

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Expose Classification Concepts or Entities in CenterStage Filters

Clear the document status


CIS automatically process documents that have been modified. However, to force CIS to process
again the documents after a configuration change, you can manually clean the dm_docstatus table
that stores the status (processed or not processed) for all documents belonging to a document set.
In Documentum Administrator, open the DQL editor and run one of the following DQL query:
To delete the document set status for all document sets (all CenterStage spaces):
DELETE FROM dm_docstatus

To delete the document set status for one document set (one CenterStage space):
DELETE FROM dm_docstatus WHERE st_docset_id=<docset_id>

where <docset_id> is the space ID.


Note: For CIS classification processing, the preferred method to clear the document status is to clear
the assignments in Documentum Administrator, as described in Clearing assignments, page 74.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

239

Expose Classification Concepts or Entities in CenterStage Filters

240

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 16
Annotation API

Annotations are a unique way to store entities, classification concepts, and extracted metadata in the
repository. The Annotation API allows you to access these annotations and use them according to
your needs.
There are several benefits of storing the results of an analytics processing as annotations and accessing
them via the API over storing the results as attribute values.
The modification of an attribute value by CIS implies the update of last_modifier and
last_modified_date properties of the documents.
Manual editing of attribute values also updated by CIS is not possible, as the editing will trigger a
CIS reprocessing of the document that will then overwrite the modification.
You can find the javadoc documentation at <CIS installation directory>/doc/cis_client_api. Refer to the
package com.documentum.cis.annotation. You do not need the packages com.documentum.ci and
com.documentum.services.classification to access annotations.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

241

Annotation API

242

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Chapter 17
Integrate CIS Classification

This chapter describes the most common integration scenarios for the classification.
The CIS server analyzes documents from a Documentum repository and extracts relevant information
about them. CIS can then use the results of the classification to do the following:
Automatically set values of document attributes (Auto Tagging).
Link the documents into appropriate repository folders (Auto Categorization).
Suggest attribute values to WDK-based application users (Web Publisher integration).
Any combination of these tasks.
Content Intelligence Services is just one piece of your broader Documentum content management
solution. There are three common integration scenarios for the classification:
Organize your library, page 244
Workflow and lifecycle processing, page 244
Web Publisher integration, page 244
Retention Policy Services integration, page 245
Once CIS processing is complete, you can use the results of the analysis to:
Improve searching By extracting information from the document content and adding it to the
document attributes, you transform unstructured data into searchable structured data. Because
CIS adds attributes programmatically, you can be sure that they make consistent use of a standard
vocabulary.
Organize documents for easy navigation Auto Categorization enables you to link
automatically documents into a repository folder structure that makes sense to users.
Support personalization Personalization server platforms use document attributes to tailor
the content displayed to different users. Using CIS enriches the attributes of a document with
information based on the content of the document, making the subject matter available as a basis
for personalization.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

243

Integrate CIS Classification

Organize your library


A key challenge when implementing a content management system is organizing your repository
so that users can easily navigate through the document library and find the documents they need.
Content Intelligence Services can help address this challenge by:
Adding information about the content of a document to its attribute set.
Providing consistency in attribute assignment.
Organizing documents into a repository folder structure that makes sense to users.
The CIS server periodically checks the repository for the documents you want it to process. The
server retrieves any new or revised documents, analyzes them, and uses the results to set attribute
values programmatically or link the documents into appropriate folders. This option is referred
to as batch processing.
Adding content-related attributes enables users to search for documents based on their subject
matter. It also helps support proper categorization of the document by personalization servers.
Personalization server platforms use document attributes to tailor the content displayed to different
users. Using CIS makes the subject matter available as a basis for personalization. Because CIS
sets attribute values programmatically, you avoid the issues that result when users vary in how
they enter values.

Workflow and lifecycle processing


Content Intelligence Services can participate in a Documentum workflow. The CIS server can analyze
documents that have reached a particular stage in their lifecycle, then advance them to the next stage.
To do so, implement the xCP CIS Activity Template xCelerator. The xCelerator integrates the
classification function of CIS with xCP by adding an additional activity template: Classify and
Categorize. This activity template integrates an automatic classification step within the Documentum
Process Builder, to drive a workflow step with content intelligence.

Web Publisher integration


Web authors using Web Publisher have to enter document attributes when checking the documents
into the repository. When you integrate Content Intelligence Services with Web Publisher, the CIS
server can suggest attribute values. The web author can choose to use the CIS-suggested attributes
rather than manually entering values. This option not only saves time for authors, but also ensures
that all authors use a consistent vocabulary for attribute values.
Web Publisher integration takes advantage of the Auto Tagging features of Content Intelligence
Services. Auto Tagging links resource properties to repository attributes. When a Web Publisher user
clicks the See CIS Values button, the CIS server analyzes the document and provides suggested
values for all attributes that have been mapped to resource properties. This option is referred to as
online processing or on-demand processing.

244

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Integrate CIS Classification

When a new folder or cabinet is created under the templates section in Web Publisher, the default
value of the CIS node is EMPTY. Any value specified at the folder level overrides the name of the
CIS Node that is defined globally at the "Web Publisher Admin Settings" level.

Retention Policy Services integration


Integrating CIS with Retention Policy Services (RPS) allows you to classify documents according
to a policy.
The policy is defined in the categories. The taxonomy tree maps to the repository folder tree. The
classification implements the policy and CIS links the documents into appropriate repository folders.
For example, you can set a category rule for documents that contain the keyword confidential.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

245

Integrate CIS Classification

246

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Appendix A
Content Intelligence Services
Processing Diagram

This section provides a diagram of CIS processing and describes the two main flows:
Classification, based on repository document sets and taxonomies.
Entity extraction, only available in CenterStage deployments and based on file document sets.
Note that error conditions do not appear in the diagram.
The following figures describe the diagram legend and provide notes related to the diagram.
Figure 6. CIS processing diagram legend

Figure 7. CIS processing diagram notes

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

247

Content Intelligence Services Processing Diagram

Figure 8. CIS processing diagram

248

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Appendix B
Properties Extracted

This appendix identifies the properties extracted from documents. The list of properties differs
depending on the file format of the document. If no value can be extracted for a given property,
that property is not created for the document.

Table 51. Extracted properties for MS Office and other documents

abstract

disposition

lastsavedby

receivedfrom

address

division

manager

revisiondate

attachments

doccomment

office

section

authorization

doctype

owner

source

category

editminutes

primaryauthor

subject

company*

editor

project

title

countpages

group

publisher

versionnotes

creationdate

keyword

purpose

versionnumber

department

language

reference

* These properties can only be used for metadata extraction and not for classification.

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

249

Properties Extracted

250

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Appendix C
Document Set Configuration Files

This appendix provides the following configuration files:


To configure all document sets: consider the default.xml configuration file
To define a specific configuration for one document set (replaces the default configuration): create
a copy of docset-sample.xml and modify it.
These files are available in Documentum Administrator in the folder:
Cabinets/System/Applications/CI/DocsetConfiguration

default.xml
Modify this configuration file to define the default configuration for all document sets.
Example C-1. default.xml configuration file
<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright (c) 1998-2010 EMC Corporation.
All Rights Reserved. -->
<docset-defaults>
<!-- =====================================================
<!-- Configuration for classic repository docsets.
<!-- By default, only the classification is activated.
<!-- =====================================================
<docset-default type="repo">
<analysis-plan>
<classification-step/>
</analysis-plan>

-->
-->
-->
-->

<!-- By default, we enable only classic category assignments -->


<storage>
<category-assignments>
<all-repository-taxonomies />
</category-assignments>
</storage>
</docset-default>
<!-- =====================================================
<!-- Configuration for file docsets (CenterStage)
<!-- By default, only the entity detection is enabled
<!-- =====================================================
<docset-default type="file">

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

-->
-->
-->
-->

251

Document Set Configuration Files

<analysis-plan>
<entity-detection-step/>
</analysis-plan>
<entity-detection>
<analysis name="person">
<!-- This builtin entity as a post processing filter
to remove some wrong values. -->
<builtin-entity>CISPerson</builtin-entity>
</analysis>
<analysis name="company">
<!-- This builtin entity aggregates the Company, Organization
and Media default entities, and do a post processing to remove
some wrong values. -->
<builtin-entity>CISCompany</builtin-entity>
</analysis>
<analysis name="location">
<!-- This builtin entity is all Geopolitical values in
default Location entity. -->
<builtin-entity>CISLocation</builtin-entity>
</analysis>
</entity-detection>
<storage>
<annotation code="Person">
<analysis>person</analysis>
</annotation>
<annotation code="Company">
<analysis>company</analysis>
</annotation>
<annotation code="Location">
<analysis>location</analysis>
</annotation>
<!-- Classification is not enabled by default, but we enable assigner
to reduce configuration rework if using classification with file docset. -->
<category-assignments>
<all-repository-taxonomies />
</category-assignments>
</storage>
</docset-default>
</docset-defaults>

docset-sample.xml
Create a copy of this file and modify it to configure a document set.
Example C-2. docset-sample.xml configuration file
<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright (c) 1998-2010 EMC Corporation. All Rights Reserved. -->
<!-- This is a sample configuration file for a specific docset.
To configure a docset, copy the content of this file as a base into a new file.

252

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Document Set Configuration Files

The file names that must be used to configure CenterStage docsets are available
in the file 'space_docset_list.txt' -->
<docset>
<!-- This section defines the list of processing that must be executed
on all documents of the docset. -->
<analysis-plan>
<classification-step/>
<entity-detection-step/>
</analysis-plan>
<!-- This sections defines some taxonomies that must be used in analysis 'foo'. -->
<classification>
<analysis name="foo">
<repository-taxonomy>my-da-taxo</repository-taxonomy>
<tef-taxonomy>my-direct-taxo</tef-taxonomy>
</analysis>
</classification>
<!-- This section customizes the entity detection. It is possible
to add entities that will be stored in addition to default entities. -->
<entity-detection>
<analysis name="bar">
<entity>Function</entity>
</analysis>
</entity-detection>
<!-- This section defines how the analysis results are persisted. -->
<storage>
<!-- Store the analysis foo into the annotation with the code 1001. -->
<annotation code="1001">
<analysis>foo</analysis>
</annotation>
<!-- Stores the analysis bar into the documentum attribute keywords. -->
<attribute name="keywords">
<analysis>bar</analysis>
</attribute>
</storage>
</docset>

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

253

Document Set Configuration Files

254

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

Index

A
Action files
Taxonomy Exchange Format, 169
architecture, 18
Assign as Attributes option, 117
authentication, failed, 36
auto categorization, 113

C
candidate threshold, 109
cartridge
additional entities, 231
customized, 100
CenterStage, 95
classification, 229
ci.jar, 36
CIS, 43
administration, 25
architecture, 18
bringing taxonomies online, 67
category classes, 51
category rules, 62
clearing assignments, 74
components, 17
compound terms, 76
configuration settings, 49
creating document sets, 70
creating taxonomies, 50, 53
defining categories, 59
deleting taxonomies, 69
enabling repository, 47
overview, 17
property rules, 63
reviewing documents, 73
submitting documents, 72
synchronizing taxonomies, 68
taking taxonomies offline, 67
testing, 69
user roles, 20

CIS server
configuring, 27
log files, 32
monitoring, 32
starting, 25
stopping, 25
cis.log, 32
cis.properties, 120
classification information, 116
clear
document status, 239
entities, 238
compatibility error, 36
components, 17
confidence values, 44, 110
configuration
document set, 87
configuration steps, 119
connection, failed, 37
Content Intelligence Services
introduction, 43
setting up, 47
custom entities, 231
custom filters, 229, 233
for classification concepts, 229
for extracted entities, 231
localization, 236

D
docset, 108
document confidence scores, 44
document processing, 107
document properties, 21
document set, 108
configuration, 87
documents
excluded, 33
unprocessed, 33
Documentum attributes, 21

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

255

Index

E
entity
add, 100
blacklist, 103
entity extraction, 95
additional servers, 99
customized cartridge, 100
disable, 98
multi-node environment, 99
process, 96
server, 95
services, 97
errors
authentication, 36
compatibility, 36
connection, 37
installation, 36
log files, 32
evidence
propagating, 46
evidence terms, 110
excluded documents, 33

I
import
taxonomy, 122 to 123
integration, 243

R
regular expressions, 114
repository
enabling for CIS, 47
repository attributes, 21
reprocessing, 239
for classification, 109
for entity extraction, 238
result of the classification, 116

S
schedule, 109
scores, 110
stemming, 45, 112
synchronization, 107

library organization, 244


lifecycle processing, 244
Link to Folders option, 116
log files, 32
Log4j setup, 32

taxonomies
importing, 121
taxonomy exchange format (TEF), 121
TEF
Action files, 169
Tef2repository script, 122
TefUtil tool, 123
TM360, 231
additional entities, 232

multi-node environment, 99
multilingual capability, 112

unprocessed documents, 33
user roles
category owner, 117
taxonomy manager, 117

O
on demand, 109

P
patterns, 110, 114

256

analysis, 114
definition, 115
evidence terms, 114
limitations, 115
pending documents, 109
phrase order, 46
proximity matching, 28, 110

W
Web Publisher, 244
workflow processing, 244

EMC Documentum Content Intelligence Services Version 6.7 Administration Guide

You might also like