Professional Documents
Culture Documents
Administration Guide
EMC Corporation
Corporate Headquarters:
Hopkinton, MA 01748-9103
1-508-435-1000
www.EMC.com
EMC believes the information in this publication is accurate as of its publication date. The information is subject to change
without notice.
The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind
with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness
for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. All other trademarks
used herein are the property of their respective owners.
Copyright 2011 EMC Corporation. All rights reserved.
Table of Contents
Preface
................................................................................................................................ 11
Chapter 1
Introduction .................................................................................................
Taxonomy-based classification...........................................................................
Entity extraction ...............................................................................................
Metadata extraction ..........................................................................................
13
13
14
14
Chapter 2
Overview
....................................................................................................
Components.....................................................................................................
Architecture .....................................................................................................
Roles ...............................................................................................................
Limitations.......................................................................................................
CIS processes textual content .........................................................................
One CIS server can only work with one repository ..........................................
CIS processing updates the last modified date ................................................
Text extraction ..................................................................................................
Document properties ....................................................................................
Documentum attributes ................................................................................
17
17
18
20
20
20
20
21
21
21
21
Part 1
Administration
..................................................................................................... 23
Chapter 3
25
25
26
26
26
26
27
27
31
31
32
32
33
33
34
Chapter 4
Troubleshooting ..........................................................................................
Modify the level of details in the detailed activity log file ....................................
Most common errors .........................................................................................
Frequently asked questions ...............................................................................
I have changed a taxonomy or a document set, and reprocessing
does not take my changes into account...........................................................
35
35
36
38
38
Table of Contents
Part 3
Chapter 6
...................................................................... 41
38
39
43
43
44
44
45
45
46
46
46
47
47
48
49
50
50
51
53
55
55
57
59
61
62
63
65
66
67
67
68
69
69
69
70
72
73
73
74
75
76
77
77
77
78
...................................................................................................... 81
83
83
83
Table of Contents
Chapter 7
Part 4
85
87
87
90
91
Entity Extraction
.................................................................................................. 93
Chapter 8
Chapter 9
Part 5
Classification
95
95
96
.................................................................................................... 105
Chapter 10
Chapter 11
Chapter 12
107
107
108
108
109
109
109
109
110
112
112
113
113
114
114
114
115
115
116
116
117
117
117
...................................................... 119
121
121
Table of Contents
Metadata Extraction
122
123
124
169
169
170
.......................................................................................... 193
Chapter 13
195
195
196
196
Chapter 14
199
199
200
201
201
202
215
222
226
Part 7
............................................... 227
Chapter 15
Chapter 16
Annotation API
Chapter 17
Appendix A
Appendix B
Properties Extracted
Appendix C
229
229
231
233
238
239
.......................................................................................... 241
243
244
244
244
245
................................... 247
.................................................................................. 249
251
251
252
Table of Contents
List of Figures
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
18
19
19
235
238
247
247
248
Table of Contents
List of Tables
Table 1.
Table 2.
Table 3.
Table 4.
Table 5.
Table 6.
Table 7.
Table 8.
Table 9.
Table 10.
Table 11.
Table 12.
Table 13.
Table 14.
Table 15.
Table 16.
Table 17.
Table 18.
Table 19.
Table 20.
Table 21.
Table 22.
Table 23.
Table 24.
Table 25.
Table 26.
Table 27.
Table 28.
Table 29.
Table 30.
Table 31.
Table 32.
Table 33.
Table 34.
Table 35.
Table 36.
Table 37.
26
28
34
64
89
100
102
127
129
133
135
137
139
141
143
147
150
152
154
156
158
160
163
164
166
168
174
176
177
178
180
183
184
191
203
204
205
Table of Contents
Table 38.
Table 39.
Table 40.
Table 41.
Table 42.
Table 43.
Table 44.
Table 45.
Table 46.
Table 47.
Table 48.
Table 49.
Table 50.
Table 51.
206
207
209
211
213
214
219
220
221
222
225
231
232
249
Table of Contents
10
Preface
The Content Intelligence Services Administration Guide contains procedures and information for setting
up and managing the server-side components of Content Intelligence Services (CIS). This manual
assumes that you have already installed Content Intelligence Services by following the instructions in
the Content Intelligence Services Installation Guide.
Intended audience
This manual is intended primarily for administrators who are managing Content Intelligence
Services applications.
The CIS server categorizes documents into taxonomies that you build and maintain using
Documentum Administrator. For information about using Documentum Administrator, see the
Documentum Administrator User Guide. The CIS server is also used in the context of a CenterStage
deployment to extract entities displayed as filters in CenterStage clients.
Typographic conventions
The following table describes the typographic conventions used in this guide.
Typographic conventions
Typeface
Text type
Body Italic
Body Bold
In procedures:
User actions (what the user clicks, presses, selects, or types) in
procedures
Interface elements (button names, dialog boxes)
Key names
In running text:
Command names, daemons, options, programs, processes,
notifications, system calls, man pages, services, applications,
utilities, kernels
11
Preface
Typeface
Text type
Body Italic
Courier
Courier Bold
Courier Italic
In procedures:
Variables in command strings
User input variables
<Italic in angle
brackets>
Revision history
The following changes have been made to this document.
Revision history
Revision Date
April 2011
12
Description
Initial publication
Chapter 1
Introduction
EMC Documentum Content Intelligence Services (CIS) is the automatic classification and extraction
engine for EMC Documentum. Automatic classification is based on taxonomies and categories and
allows you to organize content in many different ways. Entity extraction collect entities from content
using Natural Language Processing. Entities are exposed in CenterStage deployments and, when
stored as annotations, they can also be accessed using the Annotation API. Content Intelligence
Services also allows you to extract metadata from the documents.
Taxonomy-based classification
CIS organizes documents into taxonomies. A taxonomy is a hierarchical set of categories used to
organize content in the repository. This organization, often based on the subject matter of the content,
provides one place for users to look for all content related to common topics of interest.
For example, suppose that the folders in repository cabinets organize objects based on which
department created the content or on the document type, such as Press Releases in one folder and
Product Design Specifications in another folder. A user looking for all available information about a
particular product including documents from multiple departments, and both press releases and
design specifications needs to look in all folders that could possibly include objects related to that
topic. With product-based categories, the user can look in a single category to find all documents
related to the product, while the documents themselves remain filed in the original folders.
CIS classification is highly configurable. The following features allow you to set the classification that
fits your needs of content organization.
Keyword-based classification CIS can assign documents to relevant categories based on a
semantic analysis of their content. When you define your taxonomy, you identify keywords, phrases,
and patterns associated with each category. The CIS server uses these words and phrases as evidence
terms: when the server processes a document, it assigns the document to these categories based
on the evidence terms it finds in the content.
Property-based classification You can also configure CIS to classify documents based on the
property values (document metadata). In this case, documents are assigned according to the values
of the repository attributes. It is possible to set it as a requirement for documents to match with
a category.
Configurable confidence threshold As the CIS server processes a document, it determines the
confidence score of a document for each category in the taxonomy. The confidence score reflects how
13
Introduction
much evidence the CIS server found to indicate that the document belongs to the category. If the
document score for a category meets or exceeds a predefined threshold, the CIS server assigns the
document to that category. If the confidence score falls short of the threshold, the CIS server can
provisionally assign the document to the category as a Pending candidate. The user who owns the
category must review pending document candidates before they are fully categorized.
Actions based on classification results When a document is assigned to a category, you can
decide to link this document to the folder associated with the category. You can also select to add
the category names to an attribute of the document. You can enable or disable these features when
you configure CIS.
Manual categorization CIS also supports manual categorization, where users (rather than the CIS
server) manually assign documents to categories in DA. As with the automatic CIS server processing,
category assignments can be used to link documents into a searchable hierarchy of category folders,
add the category names to a document attribute, or both.
Classification concepts stored as annotations Classification concepts are category matches found
by CIS and based on taxonomies. They are not stored as category assignments unlike CIS standard
classification processing but as annotation objects. They can be exposed in CenterStage as search
filters or accessed using the Annotation API.
Entity extraction
CIS analyzes the content, metadata, and comments of documents to extract information relevant for
the end users. The information extracted is called entities and presented as filters when navigating in
CenterStage or when running a search. The default entities extracted by CIS are the following:
Placethis filter includes geographical places and groups them by countries and cities. For the
USA, states are provided for information only and not as a group.
Peoplethis filter corresponds to names of individuals.
Companythis filter contains names of organizations such as companies, institutions, or
associations.
The entity extraction is enabled by default for the repository used by CenterStage. It runs
automatically every half hour.
You can configure entity extraction to extract other entities or to store the entities as annotation objects
and to access them using the Annotation API.
Metadata extraction
With CIS, you can extract metadata from the content, properties, or repository attributes of
documents. Metadata is often defined as data about data. We call metadata the pieces of
information that provide a description of the content of the documents. Metadata extraction relies
only on rules that you define. Like the taxonomy-based classification, it does not imply any content
analysis, unlike entity extraction.
14
Introduction
Valuable information is sometimes difficult to capture. Metadata extraction allows you to find
metadata in the content, properties, or repository attributes of your documents and label these
metadata.
Extracted metadata are stored as annotation objects and you can access them using the Annotation
API.
15
Introduction
16
Chapter 2
Overview
Components
Content Intelligence Services includes these key components:
The Content Intelligence Services client (CIS client), such as Documentum Administrator or any
custom application using the Content Intelligence Application Programming Interface (CI API),
can be used for creating and managing the taxonomy used for categorizing documents. You
can also use Documentum Administrator to configure CIS. The CI API handles communication
between the CIS client, the CIS server, and the Documentum repository.
The Content Intelligence Services server (CIS server) performs the automatic categorization of
documents based on taxonomy and category definitions, and triggers the entity extraction.
The entity extraction server performs entity extraction using cartridges.
A repository is required to store CIS data (such as taxonomy definitions, document set (also called
docset) definitions, configuration files, and extracted entities).
The Annotation API allows access to the information stored as annotations, it could be the result
of the entity extraction processing or the metadata extraction processing.
When you create a taxonomy using Documentum Administrator or by importing a prebuilt taxonomy,
the objects comprising the taxonomy are saved into the repository containing the documents that CIS
will process. When you are done creating or modifying the taxonomy in Documentum Administrator,
you synchronize the new definitions to make them available to the CIS server.
17
Overview
Architecture
The CIS server communicates with the Content Server using the Documentum Foundation Classes
(DFC). It is recommended to deploy CIS on a separate machine from the Content Server machine.
The repository of the Content Server stores the documents to categorize or already categorized but
also the taxonomy definitions and the document set definitions. One CIS server can only point to one
repository. A repository must be enabled for CIS before starting any configuration.
Once enabled, the taxonomies and the document sets can be created using Documentum
Administrator. It is also possible to import existing taxonomies defined in a Taxonomy Exchange
Format (TEF) file.
When used for the classification, two modes are available: the production mode and the test mode.
You can either use one CIS server for both modes or two CIS servers: one for each mode. One
repository can only use one CIS server for each mode. Using two modes allows the CIS user to
modify and test the taxonomies and document sets while the production server is still running.
There is no test mode for the entity extraction and the metadata extraction.
Figure 1. CIS architecture overview Classification
18
Overview
19
Overview
Roles
Implementing and working with Content Intelligence Services requires the action of several distinct
roles. A person can combine several roles. The following list describes briefly each role:
The System administrator installs CIS and enables CIS in the repository. This person also monitors
the CIS server: ensures that the server is up and running, define the document sets, checks the
logs for errors, and tracks the unprocessed or excluded documents.
When CIS is used for classification:
The Taxonomy Manager creates, tests and maintains taxonomies. This person also sets category
owners, document sets, and verify excluded or unprocessed documents.
When necessary, the Category Owner verifies the correct categorization and reviews pending
documents.
The General user can manually submit documents for CIS processing, and consumes the results of
the categorization by browsing the categories or using the attributes created by the categorization.
When CIS is used for entity extraction:
The Terminology manager, such as a librarian, adds named entities to the cartridge.
In CenterStage clients, the General user uses filters Place, People, and Companies to navigate
or run a search.
All CIS-related tasks are performed in Documentum Administrator, except for the General user
which performs the tasks in a CIS client such as Webtop or CenterStage.
Limitations
This section describes some principles about CIS and known limitations.
20
Overview
Text extraction
This section describes the text extraction step that precedes any processing. Before performing any
analytics processing on a document, the content of the document and its properties are extracted
and processed by Oracle Outside In.
Document properties
When a document is processed, the CIS server automatically recognizes its format. The CIS server
can then extract the properties that are expected to be available for the document. Some property
values can only be extracted if they have been filled in by the document author. For example, the
Title of a PDF document is entered by the author. The Appendix B, Properties Extracted provides
information about properties automatically extracted depending on the document format.
The property values are automatically added to the content extracted from the documents. They can
then be used to match any category keyword, or to extract entities or metadata.
Documentum attributes
The Documentum attributes attached to documents, also called repository attributes, can be used
in several ways:
They can be used as filters when defining a document set in DA by adding a constraint like
attribute/operator/value.
They can be used in a property rule when defining a category to assign documents, by adding a
constraint like attribute/operator/value.
They can be used in addition to, or instead of, the content of the documents, as described in
Chapter 6, Configuring the Type of Content Processed.
They can be used to extract metadata. In this case, the attributes are directly accessed, they do
not need to be part of the content processed.
21
Overview
22
Part 1
Administration
23
Administration
24
Chapter 3
Administer the CIS server
This chapter provides instructions for basic CIS administration activities. It includes the following
activities:
Start/Stop the CIS server, page 25
Configure the CIS server, page 27
Monitor CIS server processing, page 32
The default CIS installation directory is:
C:\Program Files\Documentum\CIS on Windows hosts
$DOCUMENTUM_SHARED/cis on Linux hosts
This directory is referenced in CIS documentation as the variable path <CIS installation directory>.
25
Finally, on both Winddows and Linux hosts, you can access the CIS server using its JMX Agent. You
can access the JMX Agent by a URL either using Documentum Administrator or using JConsole.
Option
Expected output
Description
status
start
stop
26
To add a new Resource agent, provide a JMX URL. The CIS server JMX URL is:
service:jmx:rmi:///jndi/rmi://<cishost>:<port>/cisserveragent
where cishost is the fully qualified domain name, that is, including the hostname and domain name,
where CIS resides, and port is an RMI port set in cis.properties by the cis.jmx.agent.port parameter.
By default, the JMX port is 8061.
You must be a member of the ci_taxonomy_manager_role to create a Resource Agent for the CIS
server.
As an application
A first level of configuration can be set by modifying the properties file of the CIS application.
Properties files can be found at <CIS installation directory>/config. They consist of:
cis.properties. Modify this file as described in To modify cis.properties, page 27.
dfc.properties. The Documentum Foundation Classes documentation provides more information
about DFC parameters.
patterns.properties. Use this file to define patterns as described in Patterns as evidence terms,
page 114.
log4j.xml. Use this file to configure Log4j. CIS server log files, page 32, provides more details
about log files.
Note that, to apply the changes made in the properties or configuration file, you need to restart
the CIS server.
To modify cis.properties
1.
2.
3.
4.
Update the parameter setting as needed. Table 2, page 28, provides details on available
configuration parameters.
5.
The following table provides details on parameters that can be set in the cis.properties file. The
filepaths are relative to <CIS installation directory>.
27
Parameter
Description
cis.server.centerstage.enabled
28
Parameter
Description
cis.classification.limit.max_content_
size
cis.server.scheduling.threads
cis.server.scheduling.queue_delay
cis.server.scheduling.queue_interval
29
Parameter
Description
cis.server.patterns.file
cis.entity.luxid.annotation_server.cpu
cis.entity.luxid.limit.max_text_size
cis.entity.luxid.limit.detection_
timeout
cis.entity.luxid.resource.dir
30
Parameter
Description
cis.entity.luxid.tmp.dir
The RMI port number used for the JMX Agent URL. By
default, the JMX port is 8061 :
cis.jmx.agent.port=8061
As a Windows service
Another set of configuration parameters can be set when considering the CIS server as a Windows
service. Edit the C:\Program Files\Documentum\CIS\service\wrapper.conf file and modify the
parameters as required, for example, the Wrapper Logging Properties.
You can also modify the recovery parameters of CIS Windows service. Go to the Services section of
the Computer Management dialog, and open the Properties of the service. In the Recovery tab, you
can modify the default recovery settings. CIS Windows service is configured to restart as follows:
First restart is immediate.
Second restart occurs after 30 seconds.
Third and subsequent restarts occur after 10 minutes.
The restart count is reset every 60 minutes.
As a Java application
On Windows hosts, the CIS server can also be configured using the script: startCIS.bat in <CIS
installation directory>. You can edit the script and modify CIS parameters:
CIS server system setup, such as the CIS_PATH environment variable
CIS server java options, such as the CLASSPATH variable, memory configuration
Remember to create a backup before modifying the script.
31
32
An additional log file can be created for troubleshooting: cis-activity-detailed.log. This log
file is more verbose and contains information about both normal and detailed activity. Refer to the
procedure Modify the level of details in the detailed activity log file, page 35 to enable it.
The Log4j setup for CIS server logging is the file <CIS installation directory>/config/
log4j.xml.
The Log4j project website (http://logging.apache.org/log4j/docs/index.html) provides information on
configuring log statements.
33
Type of error
Description
TOOLARGE
EXTRACTION
The module
The type of issue
The message (often containing the
error code)
CLASSIFICATION
The module
The type of issue
The message (often containing the
error code)
SUSPECTEDCRASH
CONTENTERROR
34
Chapter 4
Troubleshooting
This chapter covers common CIS errors and frequently asked questions about CIS processing. This
can help you troubleshoot any issue you may face with CIS.
Modify the level of details in the detailed activity log file, page 35
Most common errors, page 36
Frequently asked questions, page 38
Navigate to <CIS installation directory>/config and edit the configuration file log4j.xml.
2.
Note: In the log4j configuration, an appender refers to an output destination. For CIS, each
appender creates a log file.
3.
35
Troubleshooting
<appender-ref ref="file"/>
<appender-ref ref="console"/>
<appender-ref ref="activity"/>
<!--appender-ref ref="activity-detailed"/-->
</category>
<category name="activity.detailed">
<level value="off"/>
<!--appender-ref ref="activity-detailed"/-->
</category>
4.
In cis.log:
ERROR 2007-06-18 11:49:58,937
com.documentum.cis.service.internal.communication.CommandReader
[Stream reader (clientId=1)] - IO error when receiving a command
from a remote client (clientId=1)java.io.InvalidClassException:
com.documentum.cis.service.internal.command.SynchronizeTaxonomyCommand;
local class incompatible: stream classdesc serialVersionUID =
231215431745769233, local class serialVersionUID = -5880332928005196941
In DA:
Error in updating content intelligence configurations
Response not received for command=Version negotiation (maxVersion=1) with id=1
or (in DA log)
16:49:02,427 ERROR CISClientManager Error while sending a command to server
(command=Synchronize taxonomy (taxonomyId=0b1109558000111e, execution mode=TEST))
java.io.IOException: Disconnection of the server
Context: This may happen when you updated CIS but not DA. We recommend to use the same
version of the CIS server and DA or any other CI API client.
Solution: Check the Content Intelligence Services Release Notes or the Content Intelligence Services
Installation Guide to understand which versions of DA and CIS are compatible.
Example 4-2. The authentication against the repository failed (cis.log or console)
CIS server starting...
Invalid or no repository credentials. Waiting for credentials before
CIS server can actually start.
CIS server starts listening on port 8079
36
Troubleshooting
Context: When the CIS server starts, it checks the user credentials against the repository before
opening a session. If no credentials are found or if they are invalid (for example, after a repository
change or for a repository previously enabled for another CIS server), the CIS server starts in a
restricted mode that only allows receiving new or updated credentials. You cannot launch any
classification run but you can change the credentials in Documentum Administrator. When the CIS
server receives the valid credentials, it tries to connect to the repository. If successful, it switches to
full mode and the following message appears:
Credentials ok, CIS server connected to Repository.
Solution: Set the CIS parameters (login and password) in DA to create an authentication file.
Example 4-3. CIS server cannot connect to the repository (install.log, cis.log)
If the CIS server cannot connect to the repository, the server periodically tries to reconnect to the
repository. Attempts are made every minute for five minutes, then increasing the delay and
eventually trying every hour until successful or shut down. Therefore, the CIS administrator no
longer needs to manually restart the CIS server.
Context: This error can occur if the repository has been modified, if the docbroker has been moved
to another machine, or after a network issue...
21:33:36,429 ERROR [Thread-0] com.documentum.cis.service.internal.
communication.CommandClient - IO error when initializing the command client
(attempting to open a socket connection to server host=CMAQAWIN2K8598 port=8079)
java.net.ConnectException: Connection refused: connect
Example 4-4. DA cannot connect to the CIS server. (error in DA)
Error in updating content intelligence configurations - Connection refused: connect
Solution:
Check if the CIS server is up and running, for example, look at the Windows service status.
Verify that the CIS port is correctly set: on the CIS host, in cis.properties file, and in DA, in the CIS
configuration page.
Look for any network interference: firewall, antivirus application, etc.
Upgrade DA to the same version as the CIS server, or at least update ci.jar on DA. The
procedure to update ci.jar is described in the Content Intelligence Services Installation Guide, in
the Troubleshooting chapter.
Example 4-5. Connection Broker (docbroker) not reachable from the CIS server. (cis.log)
If the connection broker is not reachable from the CIS server, most likely the CIS server will not
be able to connect to the repository.
ERROR 2009-06-08 10:17:31,641 com.documentum.cis.service.internal.scheduling.
ExecuteQueuedDocumentsCommand [schedulerExecutor_1] - Error creating queue
processing task set. Will try again on next iteration.com.documentum.cis.
service.internal.content.ContentException: DfDocbrokerException:: THREAD:
schedulerExecutor_1; MSG: [DFC_DOCBROKER_REQUEST_FAILED] Request to Docbroker
"mycs:1489" failed; ERRORCODE: ff; NEXT: null
37
Troubleshooting
Solution:
Check if the entity extraction server is up and running, for example, look at the Windows services
status.
Verify the configuration of the entity extraction server in cis.properties file.
Example 4-7. CenterStage not installed on repository. (cis.log)
ERROR 2010-04-23 12:06:42,788 com.documentum.cis.service.internal.adapter.dfc.
DfcFileSpaceWatcher [main] - Space watcher is enabled but CenterStage seems not
to be present. Please check property 'cis.server.centerstage.enabled' in the
cis.properties configuration file (CIS server needs to be restarted after this
file is modified).
38
Troubleshooting
2.
Document format. The same document in PDF format takes more time to be processed than in
Word format, for example.
3.
Size of the text content of document. For better performance, this size is limited by the
cis.classification.limit.max_content_ size parameter in cis.properties configuration file described
in Table 2, page 28.
4.
Binary size of the document. Files whose size is bigger than this limit are not processed and an
error is logged in the log file. By default: cis.server.limit.max_file_size=10000000. This parameter
is in the cis.properties configuration file and described in Table 2, page 28.
5.
Number of threads. The number of executing threads is optimized for the type of processing
selected during CIS installation. If CIS is used for classification, then the default number of
threads is 5. If CIS is installed for CenterStage, that is, for entity extraction, the default number of
threads is set to 1 to optimize the data throughput with the entity extraction server. If you want
to use both types of processing or if you are using a processing different than the one chosen
during installation, you should adjust the number of threads.
6.
Classification options can impact performance. For example, CIS cannot update an attribute
or link into a folder any locked or immutable object, such as lightweightobjects (for example,
dm_message_archive). If you try to apply these options on such objects, this will result in lots of
errors and warnings which will in turn lower the performance. The only processing possible for
these types of objects is for category assignments which are relations of type dm_category_assign.
It means that you have to deselect the options Link assigned documents into category folders and
Update document attributes with category assignments in Documentum Administrator.
The size of the document set has no impact on the performance whether you define one large
document set, or ten small ones.
Similarly, the number of taxonomies and categories has no impact on the performance.
39
Troubleshooting
40
Part 2
CIS in Documentum Administrator
For your convenience, this part includes the Content Intelligence chapter from the Documentum
Administrator User Guide. It describes all actions that can be performed in Documentum
Administrator, such as configuring CIS for a repository or defining a taxonomy for classification.
41
42
Chapter 5
Content Intelligence Services
43
Providing evidence
The quality of the classification relies on the category definitions. The more accurate the definition,
the better the classification. To define efficient categories, you can act on several aspects:
You can use keywords, that is, evidence terms and their respective confidence value. Evidence
terms can be simple terms or phrases, for which you choose to apply a stemming analysis or keep
the phrase order. You can also defined patterns using regular expressions.
You can set property rules that allows you to define category assignments according to the values
of the repository attributes.
You can use evidence from other categories by setting category links.
44
45
the language set for individual documents. This prevents from classification errors if the document
language is not correctly set. Note that you can only select one language per document. If the
document set is made of many documents in different languages, then the language must be set at the
document level and not at the document set level. When no language is defined for the documents
or for the document sets, if the stemming is activated, the language used is the one defined in the
CIS server configuration. The Content Intelligence Service Administration Guide describes how to set
the default language for CIS server.
You can also define the language of the categories used for the classification. The language can be
set for every category of for the entire taxonomy. If the language of a category is not specified, then
the language of the taxonomy is used, it does not inherit the language of the parent category, if any.
When no language is defined, the language used is the one defined in the CIS server configuration.
You also have the possibility to define the language as "Any language", this means that documents in
different languages can be assigned to this category.
The following languages are available for the stemming option: English, French, German, Italian,
Portuguese, Spanish, Danish, Dutch, Finnish, Swedish, Norwegian Bokmal, and Norwegian Nynorsk.
46
47
Navigate to Administration > Content Intelligence for the repository you want to process
documents from.
If the Content Intelligence node is not visible, refer to Missing Content Intelligence node, page
48 to solve this issue.
2.
3.
Enter the host names for the production CIS server and the test CIS server. The host name is
made of the IP address or DNS name followed by the port number (optional), for example:
192.168.1.250:8079
CIS enables you to categorize documents in production mode or test mode; see Test processing
and production processing, page 69 for details. Although you can use the same CIS server for
both production and testing, separate servers are recommended for better performance and
availability.
The specified CIS server(s) need be running when you enable the repository.
4.
Enter the User Name and password for the CIS server to connect to the repository. The
authentication against the repository is required when retrieving documents and assigning
documents to categories.
5.
Click OK.
6.
Set the CIS processing options for the repository, as described in Modifying Content Intelligence
Services configuration, page 49.
48
Depending on your use of the classification functionality, proceed with one the following options:
You use CIS server to classify content: in this case, it is likely that the version of your
Documentum Administrator is more recent than the version of your CIS server. In this case,
we recommend you upgrade CIS to the same version as Documentum Administrator.
You classify manually, without CIS server.
1.
In this case, download the Content Intelligence Services archive file from the Documentum
download center, for the same version as your Documentum Administrator.
2.
3.
2.
In the Content Intelligence Services box on the right, click the link Configure CIS.
The Configuration for Content Intelligence page appears.
3.
Update the host names, and optionally the port numbers, of the CIS production and test servers if
necessary.
CIS allows you to categorize documents in production mode or test mode; see Test processing
and production processing, page 69 for details. Although you can use the same CIS server for
both production and testing, separate servers are recommended for better performance and
availability.
The specified CIS server(s) need be running when you configure the repository.
4.
Specify whether CIS links assigned documents into a corresponding category folder. This option
is not selected by default.
If you do not select the Link assigned documents into category folders option, category
assignments are not returned as search results, and Documentum Webtop users can view
assignments only if you assign them as attributes.
Note: Selecting this option affects system performance during document processing and
classification. Do not select it unless you need the functionality it provides.
5.
Specify whether CIS adds assigned category names to document attributes by selecting or not
the Update document attributes with category assignments option . This option is selected
by default.
49
Which attributes CIS updates is determined by the category classes of each category; see Defining
category classes, page 51.
6.
Enter the Documentum User Name and password for CIS server to use when connecting to
this repository.
Select a user account that has appropriate permissions for retrieving documents to process and
assigning documents to categories.
7.
Click OK to validate.
Building taxonomies
The term taxonomy refers to two related items in Content Intelligence Services. In most situations
it refers to the hierarchy of categories that divide up a particular subject area for content. For
example, the term is used in this sense when you refer to the Human Resources taxonomy or the
Pharmaceutical taxonomy. A taxonomy in this sense has a root level and any number of categories as
direct and indirect children.
Content Intelligence Services also uses the term taxonomy to refer to the Documentum object that
serves as the root level of the hierarchy. Taxonomy objects represent the top level, much as a cabinet
represents the top level of a hierarchy of folders.
The organizational structure of a taxonomy determines the navigation path that users follow to
locate documents in the category as well as certain types of inheritance: a category inherits some
default values from the taxonomy definition and can inherit evidence from its children categories, its
parent category, or any other category.
Taxonomies consist of three types of Documentum objects:
Taxonomy objects represent the root of a hierarchical tree of categories. The definition of
a taxonomy sets default values for its categories and can include property conditions that
documents must meet in order to be assigned to categories in the taxonomy. No documents are
assigned directly to the root of the taxonomy.
Categories are the headings under which documents are categorized. The definition of a category
includes the evidence that CIS server looks for in document content to determine whether
it belongs in the category.
Category classes define general types of categories. Every category is assigned to a class, which
specifies the default behavior of the category.
In addition to building taxonomies using Documentum Administrator, you can import pre-built
taxonomies from XML files in taxonomy exchange format (TEF). The Content Intelligence Services
Administration Guide provides more information about importing taxonomies.
50
describes how to restrict the access to the Content Intelligence node to only some members using the
ci_taxonomy_manager_role.
2.
3.
In the Administrator Access Set properties, select the Content Intelligence node and the
role ci_taxonomy_manager_role.
4.
5.
6.
Click File > Add Member(s) and select the names of the users or groups you want to add to this
role.
7.
8.
9.
51
Select File > New > Category Class to create a new category class, or click the
category class whose properties you want to set.
The properties page for category classes appears. It has two tabs, one for general category class
information and the other for default values.
3.
4.
Identify the document attribute into which CIS writes the names of assigned categories.
The classification attribute must be an existing attribute for the object type of documents that will
be assigned to categories of this class, and it must be a repeating value attribute, for example,
keywords. Category names are written into the attribute only if this option is active; see
Modifying Content Intelligence Services configuration, page 49 for information about setting the
option. Note that the current values of the selected attribute are erased by CIS and replaced by
the result of the new categorization. Therefore, end users should not edit this attribute manually.
5.
6.
Specify how CIS treats the category name as an evidence term for the category.
a.
To have CIS adding the category name as an evidence term, select the Include Category
Name as evidence term checkbox. If you deselect this option, the next two options are not
relevant and are grayed out. Skip to step 7.
b. To activate the stemming option on the category name, select the Use stemming checkbox.
c.
7.
To enable the words in multi-word category names to appear in any order, select the
Recognize words in any order checkbox. When the checkbox is not selected, CIS server
recognizes the category name only if it appears exactly as entered.
Set the default rules for using evidence from child or parent categories.
When a document is assigned to one category, CIS server can use that assignment as evidence
that the document also belongs in a related category. This type of evidence propagation is most
common between categories and their parent or children categories. See About category links,
page 46 for more information.
a.
To use evidence from parent or child categories by default, select the Use evidence from
child/parent checkbox. Deselect the checkbox to avoid evidence propagation.
b. From the drop-down list associated with the checkbox, select child to use evidence from
child categories as evidence for the current category or parent to use evidence from parent
categories.
Note: You cannot link to a category with a name that is not unique. If you define links to
categories with a non-unique name, the links will not be taken into account by CIS processing.
8.
52
2.
3.
4.
For category classes that are assigned to existing categories, select an alternate category class for
the categories.
When a category class is still in use, the confirmation message page enables you to select which
of the remaining category classes is assigned to categories that currently use the deleted class.
Choose the class from the Update categories to use the category class drop-down list.
5.
Defining taxonomies
You need to create a taxonomy object before you can create any of the categories in the hierarchy. The
taxonomy object sets certain default values for the categories below it.
Since the taxonomy object is the root of a complete hierarchy of categories, it is the object that you
work with when performing actions that affect the entire hierarchy, such as making the latest
definitions available to CIS server (synchronizing) or making the hierarchy available to users (bringing
the taxonomy online). For information about these operations, see Managing taxonomies, page 67.
Every CIS implementation needs to have at least one taxonomy to use for analyzing and processing
documents. Depending on the types of documents being categorized, you may want to create
multiple taxonomies. Generally you want one taxonomy for each distinct subject area or domain.
One advantage to separate taxonomies is that they can be maintained separately, by different subject
matter experts, for example.
The Properties page for a taxonomy object can have two or three tabs:
The Attributes tab displays the basic information about the taxonomy, most of which was entered
when the taxonomy was created.
The Property Rules tab lists conditions that documents must meet before CIS server will assign
them to any category under this taxonomy.
The Select Taxonomy Type tab is displayed if category or taxonomy is subtyped. Using this,
you can create your own subtype.
2.
To display only taxonomies that you own or only online taxonomies, choose one of the options
from the drop-down list in the upper right corner of the list page.
53
3.
Select File > New > Taxonomy to create a new taxonomy. To modify a taxonomy, select it and
then go to View > Properties > Info.
The properties page for taxonomies appear.
4.
In the Select Taxonomy Type tab, select the taxonomy type from the drop down list to create
a subtype.
Click Next to proceed or click Attributes tab.
Attributes page displays the non-editable subtype of the taxonomy.
5.
Enter a name, title, and description for the taxonomy. Only the taxonomy name is mandatory
and it must be unique. The title is not mandatory and it is not necessarily unique.
By default, the taxonomy name is the text that appears in the list of taxonomies. However, it is
possible to display the taxonomy title instead of the taxonomy name, the procedure To display
the object titles instead of the object names:, page 61 describes how to switch from the category
and taxonomy names to the category and taxonomy titles.
6.
Click the Select owner link and choose the taxonomy owner. The taxonomy owner can be a
person, a list of persons, or groups.
7.
8.
Select the taxonomy language. The selected language must match with the language of the
documents that you want to classify. If the language is different, the documents will never
be assigned to a category of this taxonomy.
If the language of a category is not defined, the language set for the taxonomy is used. If no
language is set for the taxonomy, CIS server default language is used.
Select Any language in the drop down list to match any documents language. For example, you
can use this option if you dont plan to activate the stemming and thus, evidence terms are valid
in any language, such as patterns for social security numbers or acronyms like EMC. If the option
Any language is selected, then it is not possible to use the stemming on the evidence terms of
this taxonomy. The Use stemming option in the evidence term definition is then disabled and
grayed out.
9.
54
If the taxonomy has never been synchronized, the status is Unknown. See Synchronizing
taxonomies, page 68 for information about synchronization.
The synchronization state is not displayed when you are creating a new taxonomy.
12. Click OK to close the properties page, or click the Property Rules tab to specify criteria that
all documents in this taxonomy must meet.
Add property rules to the taxonomy if you want to define rules specific to document attributes.
For help using the Property Rules tab, see Defining property rules, page 63.
13. Click OK to close the properties page.
14. To create or modify the categories in the taxonomy, see Defining categories, page 59 for
information about defining categories.
15. To synchronize the taxonomy if you have made any changes to it or its categories, see
Synchronizing taxonomies, page 68 for information about synchronization.
55
2.
In the DocApp Explorer, double-click the object types name to open the type editor, and select the
Display Configuration tab.
Tip: Tip: Each row in the Scope field represents one scope. A scope does not have a name and is
instead identified by its set of scope definitions.
To know more about the scope field, refer to "Working with Object Types" in Documentum
Application Builder User Guide
3.
To create and modify tabs on which to display the attributes, perform these actions in the
Display Configuration List:
Note:
The object types parents tabs are inherited. Adding, deleting, editing tabs, or changing the
order of the tabs breaks inheritance that is, changes made to the parents tabs will not be
reflected in this types tabs.
Tab names are also localizable.
Web Publisher does not have tabs, so it displays the display configurations as sections on
the same page.
For WDK applications, to display attributes (particularly mandatory ones) on the object
properties Info page, specify the Info category.
To add a new tab:
a.
Click Add.
b. Enter a new tab name or choose one of the defaults from the drop-down list.
c.
To add the tab to all EMC Documentum clients, check Add to all applications. This tab is
shared between all application and any changes to it are reflected in all applications.
d. Click OK.
Note: When you create tabs with identical names in different applications, DAB creates new
internal names for the second and subsequent tabs by appending an underscore and integer
(for example, dm_info_0) because the internal names must be unique for a type. The identical
names are still displayed because they are mapped to their corresponding internal names. When
you change locales, DAB displays the internal names, because you have not specified a name
to be displayed in that locale. It is recommended that you change them to more meaningful
names in the new locales.
Using one of the defaults automatically creates a tab with an identical name, because the default
is already used by another application.
56
Checking Add to all applications results in only one tab being created-not several tabs with
different internal names and identical display names-and all display names are mapped to
that one tab.
To remove a tab, select the tab name and click Remove.
To rename a tab, select the tab name and click Rename.
To change the order in which tabs are displayed, select the tab and click the up and down arrows.
4.
To modify the attributes displayed on a tab, perform these actions in the Attributes in Display
Configuration:
a.
In the Display Configuration List, select the tab in which the attributes you want to modify
are displayed. The attributes that are currently displayed on the tab are shown in the
Attributes in Display Configuration text box.
b. Click Edit
c.
To specify which attributes are displayed on the tab and how they are displayed, perform
these actions in the Display Configuration dialog box:
To display attributes on the tab, select the attribute in the Available attributes text box
and click Add.
To delete attributes from the tab, select the attribute in the Current attribute list text box
and click Remove.
To change the order in which the attributes are displayed on the tab, select the attribute in
the Current attribute list text box and click up or down arrows.
To display a separator between two attributes, select the attribute above which you want
to add a separator and click Add Separator.
To delete a separator between two attributes, select the separator and click Remove
Separator.
If you have more attributes than can fit on a tab, force some attributes to be displayed on a
secondary page in Webtop, select the attribute and click Make Secondary.
To move a secondary attribute back onto the primary tab, select the attribute and click Make
Primary.
Click File > New > Category or File > New > Taxonomy.
When a new instance is created, Documentum Administrator launches the Info screen for the
new object. You can customize Documentum Administrator to create subtype instances which is
similar to category/taxonomy creation.
The info screen enables you to view and edit attributes of a particular taxonomy or category.
57
2.
Enter a name, title, and description for the category. Only the category name is mandatory and it
must be unique between categories that have the same parent. The title is not mandatory and it
is not necessarily unique.
By default, the category name is the text that appears in the list of categories and is the name of
the folder created to correspond to this category. However, it is possible to display the category
title instead of the category name, the procedure To display the object titles instead of the object
names:, page 61 describes how to switch from the category and taxonomy names to the category
and taxonomy titles.
3.
Click the Select owner link and choose the owner of this category.
The standard page for selecting user(s) or group(s) appears. The category owner is the user who
can approve or reject documents assigned to the category as a candidate requiring approval
from the category owner; see Reviewing categorized documents, page 73 for information about
the document review process. The user you select is added to the ci_category_owner_role
automatically, giving him or her access to the category through Documentum Administrator.
Note: If both a user and a group exist with the same name, the user cannot be selected as
a category owner, only the group.
4.
5.
6.
Click the CustomProp tab to create a custom tab for the subtypes.
2.
3.
Click OK to close the properties page, or click the Property Rules tab to specify criteria that
all documents in this taxonomy must meet.
Add property rules to the taxonomy if you want to apply a specific criteria to all documents
before they are considered for categorization in this taxonomy. For help using the Property
Rules tab, see Defining property rules, page 63.
4.
58
Defining categories
When you create a category, you define its position in the hierarchy of categories by navigating into
the category that you want to be its parent. The category inherits default values for most of the
required attributes from the taxonomy object at the top of the hierarchy.
The procedure below describes how to create a category and set its basic properties. For information
about providing the evidence that CIS server uses to identify documents that belong in the category,
see Setting category rules, page 62.
Note: When you customize CIS and CenterStage to expose the taxonomy-based classification as
search filters in CenterStage clients, the result of the classification process is to store category names
as annotations (and not as category assignments). The maximum number of categories that can
be assigned to one document for this type of classification is 273 categories. If you use more than
273 categories, no category names are stored for this document. Make sure any document does
not match more than 273 categories.
To create a category:
1.
2.
To display only taxonomies that you own or only online taxonomies, choose one of the options
from the drop-down list in the upper right corner of the list page.
3.
Select a taxonomy and navigate to the location where you want the category to appear.
The right pane should display the contents of the category that will be the new categorys parent.
4.
5.
If subtypes have been created, in the Select Category Type tab, select the category type from the
drop down list to create a subtype.
Click Next to proceed or click Attributes tab. If no subtypes have been created, directly go
to the Attributes tab.
Attributes page displays the non-editable subtype of the category.
6.
Enter a name, title, and description for the category. Only the category name is mandatory and it
must be unique between categories that have the same parent. The title is not mandatory and it
is not necessarily unique.
By default, the category name is the text that appears in the list of categories and is the name of
the folder created to correspond to this category. However, it is possible to display the category
title instead of the category name, the procedure To display the object titles instead of the object
names:, page 61 describes how to switch from the category and taxonomy names to the category
and taxonomy titles.
The maximum number of characters for the category name is 255 characters. The category
path that includes the category name and the names of the parent categories must not exceed
450 characters.
7.
Click the Select owner link and choose the owner of this category.
The standard page for selecting a user appears. The category owner is the user who can approve
or reject documents assigned to the category as a candidate requiring approval from the category
59
owner; see Reviewing categorized documents, page 73, for information about the document
review process. The user you select is added to the ci_category_owner_role automatically, giving
him or her access to the category through Documentum Administrator.
8.
9.
Select the category language. The selected language is used to filter the documents that you want
to classify. If the language is different, the documents will never be assigned to the category.
If the language of a category is not defined -and whatever the language of the parent category, if
any- the language set for the taxonomy is used. If no language is set for the taxonomy, CIS server
default language is used.
Select Any language in the drop down list to match any documents language. For example, you
can use this option if you dont plan to activate the stemming and thus, evidence terms are valid
in any language, such as patterns for social security numbers or acronyms like EMC. If the option
Any language is selected, then it is not possible to use the stemming on the evidence terms of this
category. The Use stemming option is then disabled and grayed out.
To have CIS adding the category name as an evidence term, select the Include Category
Name as evidence term checkbox. If you deselect this option, the next two options are not
relevant and are grayed out.
b. To activate the stemming option on the category name, select the Use stemming checkbox.
This option is automatically disabled and grayed out if you selected Any language as the
category language.
c.
To enable the words in multi-word category names to appear in any order, select the
Recognize words in any order checkbox. When the checkbox is not selected, CIS server
recognizes the category name only if it appears exactly as entered.
12. Set the default rules for using evidence from child or parent categories.
When a document is assigned to one category, CIS server can use that assignment as evidence
that the document also belongs in a related category. This type of evidence propagation is most
common between categories and their parent or children categories. See About category links,
page 46 for more information.
a.
60
To use evidence from parent or child categories by default, select the Use evidence from
child/parent checkbox. Deselect the checkbox to avoid evidence propagation.
b. From the drop-down list associated with the checkbox, select child to use evidence from
child categories as evidence for the current category or parent to use evidence from parent
categories.
Note: You cannot link to a category with a name that is not unique. If you define links to
categories with a non-unique name, the links will not be taken into account by CIS processing.
13. Click CustomProp tab to create a custom tab for the subtypes.
14. If the customization for a subtype is not available, Documentum Administrator will use the
closest supertype settings that are available for a particular subtype. For more information , refer
to Creating custom tab for the subtype, page 55 .
15. Enter the custom type for the subtype.
16. Click OK.
The property page closes, and the category appears in the list.
17. Set the category rules.
For details, see Setting category rules, page 62.
2.
3.
4.
61
2.
3.
Click the
The rules page for the category appears. The right pane of the screen displays property rules for
the category; the left pane displays the evidence for the category.
4.
5.
62
From the Property Rules page, click the Edit link in the Category Property Rule box.
The Property Rules page appears.
2.
To require assigned documents to come from a specific folder, click the Select folder link next to
Look in: and navigate to the folder.
When you click OK after selecting the folder, the folder appears next to the Look in label.
3.
To require assigned document to have a particular object type, click the Select type link next to
Type: and select the object type. The default object type is dm_sysobject. If you have created
custom object types, To display or hide an attribute:, page 65, describes how to make custom
object types available in the CIS component.
When you click OK after selecting the object type, the type name appears next to the Type label.
4.
To assign documents based on their attributes, select the Properties checkbox and enter the
criteria used to qualify documents.
a.
b. Select the repository attribute whose value you want to test. The list of attributes differs
according to the selected object type. If you have created custom attributes, To display or
hide an attribute:, page 65, describes how to display custom attributes.
63
c.
From the drop-down list in the middle, select the operator that will be used to compare the
selected attribute with the test value.
The available operators differ depending on the type of the attribute you selected in the
previous step. For example, for a Boolean attribute, the two operators are equal and not
equal and the possible values are true or false.
The operators contains and does not contain are only available for string attributes.
The operators greater than or less than can be used to select string values alphabetically. For
example, the string ABD is greater than ABC. You can then assign documents using their
title, their author or any other string attribute by alphabetical order, such as: all documents
with an author name greater than A and less than C (note that in this case, words starting
with C are ignored).
d. Enter the value to test against in the text box on the right. Values are not case sensitive and
accents are ignored.
To define a rule on the Format attribute, you must enter the value as it appears in the
documents Property page. For example, to match documents whose format is Microsoft
Word Office Word Document 8.0-2003 (Windows), enter the value msw8.
To define a rule on any date attribute, the corresponding value should comply to
Documentum date standards. Table 4, page 64 demonstrate possible date formats
(non-exhaustive list).
Table 4. Date formats for property rules
Date format
Example
mm/dd/yy
02/15/1990
mon dd yyyy
Feb 15 1990
mm/yy
02/90
dd/mm/yyyy
15/02/1990
yyyy/mm
1990/02
yy/mm/dd
90/02/15
yyyy-mm-dd
1990-02-15
dd-mon-yy
15-Feb-90
month yyyy
February 1990
month dd yy
February 15 90
month, yyyy
February, 1990
Note that property rules on a date attribute do not take into account the time (hours, minutes,
seconds).
e.
5.
64
To add an additional condition, click the Add Property button and repeat steps b through d.
2.
3.
Under the <attribute_list> element, you can add an entry for the type whose attribute display
you want to modify.
For example:
<attribute_list>
<type id='my_custom_type'>
Two <type id> elements already exist for the dm_sysobject and dm_document object types.
4.
Under the <type id> element, add the new attributes that should or should not appear in the
drop-down menu, respectively in the <exclusion_attributes> and <inclusion_attributes> elements.
By default, all the attributes of the selected object type are available; to hide them, add them
to the exclusion list.
Attributes that are hidden by default begin with r_, a_, or i_; to make them available, add them
to the inclusion list.
For example:
<attribute_list>
<type id='my_custom_type'>
<exclusion_attributes>
<attribute>my_custom_attribute1</attribute>
<attribute>my_custom_attribute2</attribute>
<exclusion_attributes>
<inclusion_attributes>
<attribute>my_custom_attribute3</attribute>
<inclusion_attributes>
</type>
</attribute_list>
65
Click the Add a new simple term link to add a new term, or click the
you want to modify.
The Evidence page appears. For a new term, the Use stemming and Recognize words in any
order checkboxes are set to the default values from the category class for this category.
2.
To use a word or phrase as evidence for the category, click the Keyword option button and enter
the word or phrase in the adjacent text box.
A keyword is a text string that CIS server looks for in the documents it processes.
3.
To include another category as evidence for this category, click the Category option button and
identify the category to use as evidence for this category.
A category link tells CIS server to use the evidence of another category as part of the definition
of this category.
To use this categorys parent category, select Parent from the drop-down list.
To use this categorys children categories, select Child.
To link to a selected category, select Category, then click the Select category link that appears
to the right of the drop-down list and select the related category from the page that appears.
Note: You cannot link to category with a name that is not unique. If you define links to categories
with a non-unique name, the links will not be taken into account by CIS processing.
See About category links, page 46 for more information about the types of category link.
4.
Specify whether CIS server uses stemming on the evidence term by selecting or deselecting the
Use stemming checkbox. This option is automatically disabled and grayed out if you selected
Any language as the category language.
5.
If the evidence term is a multi-word phrase, specify whether CIS server recognizes the words in
any order by selecting or deselecting the Recognize words in any order checkbox.
If the checkbox is not selected, CIS server recognizes the phrase only when the words appear in
exactly the order they are entered here.
66
6.
Deselect the Have the system automatically assign the confidence (HIGH) for me checkbox.
A pair of option buttons appear for setting the confidence level.
b. To select one of the system-defined confidence levels, click the System Defined Confidence
Level button and select a level from the drop-down list box. The system-defined levels are
described in About confidence values and score thresholds, page 44.
c.
7.
To set a custom confidence level, click the Custom Confidence Level button and enter a
number between 0 and 100 in the text box.
8.
Managing taxonomies
When you create a taxonomy, it is offline by default. Offline taxonomies are available under the
Administration > Content Intelligence node for designing and building, but are not available for
users to see. To make the taxonomy available to users, you bring it online.
When you create or modify any part of a taxonomy, you need to make it available to CIS server
so that CIS server can use the new or updated taxonomy and category definitions to categorize
documents. This process is called synchronization.
Both of these operations are available for complete taxonomies only, not individual categories
or portions of the hierarchy.
2.
Select the taxonomy you want to make available then go to View > Properties > Info.
3.
The properties page for the taxonomy appears, select the Attributes tab.
67
4.
5.
Click OK.
The taxonomy now appears to users under the Category node and is available for categorization.
2.
Select the taxonomy you want to take offline then go to View > Properties > Info.
3.
The properties page for the taxonomy appears, select the Attributes tab.
4.
5.
Click OK.
The taxonomy is no longer visible to users. Existing documents remain in the categories.
Synchronizing taxonomies
The taxonomy and category definitions you create are saved in the repository. When you create or
modify any part of a taxonomy, you need to make it available to CIS server so that CIS server can
use the new or updated taxonomy and category definitions to categorize documents. This process
is called synchronization. Updates to the taxonomy are not reflected in automatic processing until
you synchronize them.
Note: If any of the categories in a taxonomy include links to categories in other taxonomies, all
related taxonomies must be synchronized to avoid possible errors.
2.
3.
4.
5.
68
The synchronization process starts, and the list of taxonomies reappears. If you receive any errors
or warnings, refer to the error log on CIS server for details. See the Content Intelligence Services
Administration Guide for information.
6.
To check the status of the synchronization process, click the View Jobs button at the bottom of
the page.
When the synchronization is complete, a message indicating its success or failure is sent to
your Documentum Inbox.
Deleting taxonomies
When you delete a taxonomy, it removes all categories within that taxonomy except for categories
that are linked into other taxonomies. All assignments to those categories are also removed, although
the documents themselves are not.
To delete a taxonomy:
1.
2.
3.
4.
Processing documents
When your taxonomies and their category definitions are in place, you are ready to categorize
documents. Content Intelligence Services supports both automatic categorization, where CIS server
analyzes documents and assigns them to appropriate categories, and manual categorization, where a
person assigns documents to categories.
Documentum Administrator enables you to review the results of either type of categorization, and
to manually adjust them if necessary. For documents that CIS server could not definitively assign
to particular categories, category owners use Documentum Administrator to approve or reject the
candidate documents.
69
reviewing the results of a test run, you can clear the proposed categorizations, update the category
definitions, and run the test again. When CIS server is properly categorizing documents, you can
bring the taxonomy online to put it into production.
In production mode, CIS server updates documents and the repository based on the results of its
categorization. The nature of the updates depends on which configuration options are active: if Link
to Folders is active, CIS server links documents into the folders corresponding to the categories,
and if Assign as Attribute is active, CIS server writes the name of the assigned categories into each
documents attributes. Refer toModifying Content Intelligence Services configuration, page 49 for
details about setting the options.
You can perform test processing on a separate CIS server from your production server. Offloading
test processing from the production server prevents your tests from competing for resources with
the production system. See Modifying Content Intelligence Services configuration, page 49 for
information about specifying the test and production servers.
You can view the documents assigned to a category either after a test processing or after a production
processing.
Navigate to the category for which you want to see the assigned documents. (Do not select the
category.)
2.
Select View > Page View > Test view to display the results of the category assignments after
a test run.
3.
Repeat the previous step but selecting Production view to go back to the production view.
2.
Select File > New > Document Set to create a new document set, or select the document set you
want to modify then select View > Properties > Info.
The properties page for document sets appears.
3.
70
4.
Select the document set language. The selected language must match with the language of the
categories and taxonomies used for the classification. The documents will never be assigned to
a category of a different language.
If the language of the document set is not defined, the language set for the document is used. If
no language is set for the document, CIS server default language is used.
5.
6.
To include documents from a specific folder, click the Select link next to Look in: and navigate to
the folder containing the documents to process.
When you click OK after selecting the folder, the folder appears next to the Look in label.
7.
To specify the object type of the documents selected for processing, click the Select link next
to Type: and select the object type.
When you click OK after selecting the object type, the type name appears next to the Type label.
8.
The Properties checkbox is already selected to assign documents based on their attributes. Enter
the criteria used to select documents.
a.
b. From the drop-down list in the middle, select the operator to use to compare the selected
attribute to the test value.
The available operators differ depending on the attribute you selected in the previous step.
c.
Enter the value to test against in the text box on the right.
d. To add an additional condition, click the Add Property button and repeat steps a through c.
The document set will include only those documents whose attributes meet all of the
conditions.
9.
10. By default, the schedule is set to Inactive. To define a schedule, set the document set schedule
to Active.
An active document set is run according to its defined schedule. An inactive document set is
not run, and the remaining scheduling controls are grayed out.
11. For active document sets, specify when the documents in the set should be submitted to CIS
server for processing.
a.
Click the calendar icon next to the Start Date field to select the day on which the documents
will be first submitted to CIS server.
b. Set the time of day for the first run by selecting numbers from the Hour, Minute, and
Second drop-down lists.
The Hour setting uses a 24-hour clock.
c.
Specify how often this document set submits documents to CIS server by entering a number
in the Repeat box and picking the units (minutes, hours, days, weeks, or months) from
the drop-down list.
71
Each time the document set runs, it submits only new or revised documents to CIS server.
12. Click one of the Processing Mode option buttons to indicate whether to run this document set
in production mode or test mode.
See Test processing and production processing, page 69 for information about production and
test modes. Selecting the mode also determines which CIS server processes the document set: the
production server or the test server.
13. If you chose Test at step 11, click Select Taxonomy and select a taxonomy to run the test against.
For a test run, you can have CIS server only consider the categories in the taxonomy you are
testing. The taxonomy does not need to be online. For a production run, all synchronized
taxonomies are used for the classification.
14. Click OK to close the properties page.
15. Synchronize the document set to make it available to CIS server.
a.
16. To check the status of the synchronization process, click the View Jobs button at the bottom of
the page.
When the synchronization is complete, a message indicating its success or failure is sent to
your Documentum Inbox.
17. To view the documents that the document set will submit to CIS server, click the name of the
document set on the list page.
Documentum Administrator runs the query from the Document Set Builder tab and displays the
documents in the result set.
Note: Deleting a document from this page removes it from the repository, not just from the
document set.
2.
72
2.
3.
4.
5.
2.
3.
Navigate to the category to which you want to assign the document in the nodeAdministration >
Content Intelligence > Taxonomies). If not already done, turn page view into Production view.
The list of documents belonging to the selected category in Production view is displayed.
4.
Select Edit > Assign here. The document is assigned to the category, its status is set to
assigned_manual.
If the option Link assigned documents into category folders is enabled, a relationship is created
between the document and the category folder corresponding to the selected category.
If the option Update document attributes with category assignments is enabled, the name of the
category is added as a value of the keyword attribute for the document.
73
Pending documents is only available in Production mode, that is, when CIS server is configured as
the production server and not the test server.
Documents receive Pending status when the confidence score that CIS server assigns to the document
is higher than the categorys candidate threshold but less than its on-target threshold. When you
approve or reject a Pending document assignment, CIS server saves this information and does not ask
you to approve or reject it again (unless you clear assignments).
2.
Select My Categories with pending documents from the drop-down list in the upper right.
With this option selected, the list displays only categories that have Pending documents.
3.
Click the category Name to display the complete list of documents assigned to the category, or
click the value in the Total Candidates column to display only the Pending documents.
The list of assigned documents and their assignment status appears.
4.
5.
To approve the document in this category, select Tools > Content Intelligence > Approve and
click OK on the confirmation page that appears.
If you are only viewing the Pending documents, the approved document disappears from the
current view because it is no longer a candidate.
6.
To reject the suggested categorization, select Tools > Content Intelligence > Reject Candidate
and click OK on the confirmation page that appears..
The document disappears from the current view because it is no longer a candidate.
7.
Repeat steps 3 through 6 for each candidate document in categories for which you are the
category owner.
Clearing assignments
You can clear assignments at the taxonomy level or a category level. You can choose to clear only the
documents in that category, or in the category and all of its children.
You can also clear the assignments for all documents belonging to a document set or for a single
document.
Clearing assignments is most common when running in test mode. If you clear assignments
made in production mode, any record of the category owners approval or rejection of a proposed
assignment is also lost. As a result, CIS server may ask the category owner to approve or reject
category assignments again.
74
2.
Navigate to the category whose assignments you want to clear and select it.
3.
4.
Click one of the Clear assignments with status option buttons to indicate whether to clear all
assignments, only pending assignments, or only complete assignments.
b. Click one of the Clear assignments with type option buttons to indicate whether to clear test
assignments, active assignments, or both.
5.
To clear the assignments in all subcategories, select the Include subcategories? checkbox.
If the checkbox is not selected, only assignments in the current category are cleared.
6.
Click OK.
2.
Navigate to the document set whose assignments you want to clear and select it.
3.
4.
Click one of the Clear assignments with status option buttons to indicate whether to clear all
assignments, only pending assignments, or only complete assignments.
b. Click one of the Clear assignments with type option buttons to indicate whether to clear test
assignments, active assignments, or both.
5.
Click OK.
2.
Navigate to the document whose assignment you want to clear and select it by clicking the
checkbox next to its name.
3.
75
Compile a set of test documents and submit them to CIS server. The test set should include
representatives of the various types of documents you will be processing with Content Intelligence
Services. When processing is complete, review the resulting categorization. If CIS server does not
assign some documents to the categories you expect it to, you may need to revise the category
thresholds or the evidence associated with the categories.
If a document appears in a category it should not, it means that the evidence for that category is too
broad: consider adding additional terms. If a document does not appear in a category that it should,
it means that the evidence is too restrictive.
The rule of thumb is: Make the category definition simple and test it with your documents. If it works
in most cases leave it alone. If there are problems recognizing a category and more differentiating
data is necessary, then use compound terms as described in the topics of this section.
It is also possible to define patterns to match specific terms like phone numbers or social security
numbers. The Content Intelligence Services Administration Guide provides the detailed procedure
for defining patterns.
76
Selecting terms
The biggest challenge when defining categories is selecting the proper terms to serve as evidence
for them. If you define a category using only terms that are unique to that category, CIS server will
not recognize the category in documents that relate to it in an indirect way. On the other hand, if
you choose common words as evidence terms, CIS server may recognize the category when the
document does not in fact belong in it.
The challenge is to create category definitions that are just complete enough to trigger category
recognition without introducing ambiguity. It is just as important to keep misleading terms out of
category definitions as it is to make sure that all viable terms are included. You might think that OR is
a viable term as part of the definition of Oregon, but OR crops up in so many other contexts that OR
should not be part of the definition of Oregon.
Note: CIS server is not case sensitive for evidence terms. OR matches OR, Or, and or.
Navigate to the category whose properties you want to update. Select the category and then
select View > Properties > Info.
77
2.
3.
4.
To change the category owner, click the Select owner link and choose the new owner.
The standard page for selecting a user appears. The category owner is the user who can approve
or reject documents assigned to the category as a candidate requiring approval from the category
owner; see Reviewing categorized documents, page 73 for information about the document
review process.
5.
To change the category class, choose the category class from the drop-down list.
The category class determines default behavior for the new category as well as the document
attribute to which CIS server adds the category name if you are using the Assign as Attributes
option.
6.
7.
Click OK.
The property page closes.
Navigate to the category whose evidence you want to update and click the
the rules page.
2.
Click the Add new compound evidence link to add a completely new compound term, or click the
Add additional term link next to a simple term that you want to convert into a compound term.
icon to display
The Evidence page appears. It looks the same as the Evidence page for a simple term, except
that Prev, Next, and Finish buttons appear in place of the OK button at the bottom of the
78
page. These buttons enable you to navigate between the Evidence pages for each of the terms
that make up the compound term.
3.
Set the evidence properties for one of the simple terms in the compound term.
Follow steps 1 through 6 of the procedure for defining a simple term. The only difference when
defining part of a compound term is that the default system-assigned confidence level is Low.
4.
Click Next and repeat step 3 to add additional terms, or click Finish (or OK if you are converting
a simple term) to complete the compound term.
When you click Next, another instance of the Evidence page appears. The page title shows
which term you are now defining and the total number of evidence terms in the compound term
(Compound Evidence Term X of Y).
When you click Finish or OK, the individual terms of the compound term appear on a list page.
Click the Back to Rules Summary link to display again the Rules page of the category.
Click the
A list page appears with each individual term in the compound in a separate row.
2.
To add an additional term to the compound, select File New Evidence and set the evidence
properties for the new term.
Follow the procedure for defining a simple term. The only difference when defining part of a
compound term is that the default system-assigned confidence level is Low.
4.
To remove one or more terms from the compound, select the checkboxes next to the terms and
select File > Delete.
If removing the selected terms will result in only a single term remaining, a page appears asking
whether you want to convert the remaining term to a simple term or delete it as well.
79
80
Part 3
Configuration
81
Configuration
82
Chapter 6
Configuring the Type of Content
Processed
This chapter describes how to configure CIS to analyze Documentum object attributes in addition to,
or instead of, the content of the documents. This configuration is referred to as attribute processing.
Principles
By default, CIS analyzes the textual content of the documents to find the concepts defined in the
taxonomies or to extract entities. You can change the default behavior and have CIS analyze the
values of Documentum object attributes in addition to, or instead of, the content of the documents.
It is also possible to define a specific behavior for each document set. This configuration can only
be done by configuration files in the Repository and not in Documentum Administrator using the
Content Intelligence node.
There are two main types of configuration files:
The default configuration file: default.properties, enables you to define the type of default
processing you want (text only, attributes only, or both) and contains the list of default attributes
to use if no attributes are defined in the custom configuration file.
Custom configuration files enable you to define the type of processing and the attributes to use if
need be. You can create a custom configuration file for:
A specific document set processing.
Queue-based processing.
Interactive processing.
There is one configuration file per document set whereas queue based and interactive is only one
file (each) total. Depending on the type of processing, the name of the properties file will change.
The following section details the file names.
83
Create the configuration files as needed. To do so, you may reuse the content of the sample files.
If they do not exist, refer to the examples at the end of the procedure.
3.
File name
Default
(any processing)
default.properties
set_<docset_type>_<docset_name>.properties
where
<docset_type> is repo for repository document sets or file
for file-based document sets (for CenterStage deployments
only) and
<docset_name> is the name of the document set.
Queue-based processing
queueproc.properties
In this example, the processing is done by default on the attributes subject, object_name, and authors,
and on the content of the documents.
# Possible values: attributes_and_text, text_only and attributes_only
defaultInputSource=attributes_and_text
# List all attribute names to extract when processing with an input
# source of attributes_and_text or attributes_only.
# Non existing attributes on an object are ignored (is not an error)
defaultAttributes=subject, object_name, authors
84
In this example, the processing is done for a specific document set (or the queue, or an API) on the
attributes subject, authors, and product_id, but not on the content of the documents.
# Possible values: attributes_and_text, text_only and attributes_only
specificInputSource=attributes_only
# List of attribute names to extract in replacement of the attributes
# defined by defaultAttributes in the default configuration file.
specificAttributes=subject, authors, keywords
# List of attribute names to extract in addition to the attributes defined
# in defaultAttributes in the default configuration file or in
# specificAttributes above.
addedAttributes=product_id
# List of attribute names not to extract (assuming they were
# previously defined in defaultAttributes in the default configuration
# file, in specificAttributes or in addedAttributes above).
removedAttributes=keywords
85
86
Chapter 7
Configuring Document Sets
87
2.
To configure all document sets with the same entities, edit the default configuration:
i.
Locate the default.xml configuration file. This configuration will also apply to future
document sets, that is, to any new space created in CenterStage.
ii. Locate the <docset-default type="file"> element. The changes made in this element apply
to all document sets created for CenterStage spaces.
Or
Locate the <docset-default type="repo"> element. The changes made in this element
apply to all document sets created in Documentum Administrator.
b. To configure only one document set, create a configuration file:
i.
ii. Locate the file space_docset_list.txt. This file lists all the document sets created
for CenterStage spaces, the first column indicates the space name, the second column
the space ID, the third column indicates the configuration file name for this document
set, such as <space_name>_<space_ID>.xml.
iii. Rename the configuration file using the file name indicated for this document set / space
in the list.
Caution: The configuration at the document set level overwrites the configuration
made in the default.xml file for a specific section in the xml: <analysis-plan>,
<entity-detection>, <classification>, and <storage>. It means that, for example, you
can customize only the extraction of entities (and not the classification) for a given
document set.
For example, in the previous screenshot, the file annette_4_0b1109b680036a88.xml is likely
a configuration file for a space document set called annette.
The files default.xml and docset-sample.xml are available in the Appendix C, Document
Set Configuration Files of this guide.
Note that it is not necessary to restart CIS after the modification of a document set file; changes are
applied dynamically.
88
The following table describes the elements that can be used in the configuration files.
Table 5. Descriptions of the xml elements in document set configuration files
Description
<docset-defaults>
<docset-default type>
<analysis-plan>
<classification-step/>
<entity-detectionstep/>
<metadata-extractionstep/>
<entity-detection>
Contains the elements that define the entity types to be extracted by the
<entity-detection-step/> processing.
Possible children are one or more <analysis> elements.
<analysis name>
<entity>
<builtin-entity>
Used to define the default entities extracted by CIS for CenterStage clients.
Table 49, page 231 describes these default entities.
<entity levels>
<classification>
89
Description
<repositorytaxonomy>
<metadata-extraction>
<rule-set>
<metadata>
Defines the name of the extracted metadata as it appears in the rule set.
<storage>
<analysis>
2.
Create the configuration file for the document set as described in To edit the configuration file of
the document sets:, page 87.
3.
4.
90
Where
The name attribute in the <analysis> element is any name, it will be reused later to define the
way the metadata element values will be stored.
The value of the <rule-set> element is the name of the rules file without the format extension.
The value of the <metadata> element is the value of the name attribute of the <SetMetadata
name=""> element in the rules file.
5.
Where
The code attribute in the <annotation> element is an index number higher than or equal to
1000 or the name of an existing entity type.
The value of the <analysis> element is the name of the analysis as defined in the previous step.
Other examples of document set configuration for the other types of processing are available: To
configure the classification for CenterStage spaces:, page 230 and To configure the document sets for
new entity types:, page 232.
On the CIS server machine, locate the convert_docset_configuration.bat file (on Windows hosts,
or convert_docset_configuration on Linux hosts); it can be found at <CIS installation directory>/bin.
2.
91
92
Part 4
Entity Extraction
This part describes the entity extraction processing which is one of the three different types of content
analysis: extraction of entities, extraction of metadata, and classification.
It includes the following chapters:
Chapter 8, Entity Extraction
Chapter 9, Configuring Entity Extraction
93
Entity Extraction
94
Chapter 8
Entity Extraction
Entities are pieces of information identified in context by text analysis. Entities are extracted when the
information they convey is relevant to end users.
The extraction of entities performed by CIS is exposed in CenterStage clients. The entities are
available as filters when navigating or running a search. Entities can also be stored as annotations
and accessed using the Annotation API.
To extract entities, CIS relies on an entity extraction server, currently Temis Luxid, with a text analysis
cartridge. The cartridge contains extraction rules and dictionaries to allow the identification of
entities in various languages. The entity extraction server launches the extraction processes when
triggered by the automatic scheduling of the CIS server. The CIS server collects returned entities, and
stores them for the corresponding documents.
By default, the entity extraction server identifies the following entities:
Peoplenames of people.
Companycompanies, including organizations and media.
Placegeographical locations.
You can customize the entity extraction to extract other types of entities.
This chapter describes:
The installation of the entity extraction server
The entity extraction process
95
Entity Extraction
96
Chapter 9
Configuring Entity Extraction
In CenterStage deployments, the entity extraction does not require much configuration. There is no
document set configuration. The document sets are automatically defined, based on the spaces in
CenterStage, and the CIS server maintains one document set per space. Every half hour, the CIS
server automatically checks for new spaces in the repository and the entity extraction runs for all
spaces on new and modified objects. It is possible to modify the default scheduling by setting the
property cis.server.centerstage.interval as described in Table 2, page 28.
Apart from CenterStage deployments, you can also extract entities from the documents in the
repository. To do so, perform the following steps:
1.
Create a document set in DA, as described in Defining document sets, page 70.
2.
Configure the document set to store the entities as annotations, as described in Chapter 7,
Configuring Document Sets.
3.
Access the annotations using the Annotation API, as described in Chapter 16, Annotation API.
97
Start or stop the services using the icon in the notification area.
To start entity extraction services:
In the notification area, right-click Temis Luxid Started
If you select Quit instead of Stop, the Temis Luxid Started icon is no longer available in the
notification area.
To restore the Luxid icon:
On Windows hosts, select Start > All programs > Startup > Luxid Starter.
On Linux hosts, navigate to $DOCUMENTUM_SHARED/cis/Temis/Luxid/AnnotationFactory/
adminserver/bin/ and run LuxidStarter script.
If the icon in the notification area does not start the services correctly, on Windows hosts, you can
start the services manually as described in the following procedure.
2.
98
On the machine where CIS is installed, navigate to the installation folder for the entity extraction
server. By default, it is:
C:\Program Files\Documentum\CIS\Temis\Luxid on Windows hosts
$DOCUMENTUM_SHARED/cis/Temis/Luxid on Linux hosts
2.
3.
4.
5.
6.
7.
8.
Select an option for Luxid Annotation Factory icon location, then click Next.
9.
Click Choose and select the license file LAFLicense.txt provided by TEMIS, then click Next
10. In Luxid Annotation Server address, enter the IP address of the main server host (the machine
on which CIS is installled), then click Next.
11. Check your installation parameters, then click Install.
12. Start the services as described in Manage entity extraction services, page 97. You can modify them
to start them automatically on system reboot.
13. On the machine where CIS is installed, modify the property cis.entity.luxid.annotation_server.cpu
in cis.properties, as described in Table 2, page 28, to indicate how many CPUs to use. You cannot
specify the number of CPUs for each machine.
It takes five to ten minutes for the main entity extraction server to detect the new node.
99
Stop Luxid:
On Windows hosts, stop the service Documentum CIS Luxid IDE Server V2.
On Linux hosts, in the notification area, right-click the Temis Luxid Started icon and select
Stop Luxid.
2.
3.
b. Open the customization file with an XML editor. Select the file depending on the type
of entities that you want to add.
Table 6. Customization files for entities
c.
Entity type
Company
Company-external-lex.scp
People
Person-external-lex.scp
Place
Location-external-lex.scp
To add entities, locate the macro elements corresponding to this entity type and add the
entities in the <e></e> child element following the guidelines provided in this procedure.
Perform the steps corresponding to the type of entities you want to add:
Add a Company entity:
1.
100
2.
2.
2.
searchon="form" case="preserveFirst"
display="no">
2.
When adding only the first name or only the last name, the entity is extracted only if it
matches a pattern used by the extraction engine. For example, if the first name appears
only once in a document and that the context does allow to detect that it is a first name,
then the entity is not extracted.
Add a Place entity:
To add a country:
1.
101
2.
Under the macro child element for the corresponding continent, add entities in the
<e></e> child element, for example:
<macro name="Africa">
<e>
Sampleland
| Sampletwoland
</e>
</macro>
To add a city:
1.
2.
3.
The entities in the <e></e> child element must comply with the following guidelines:
After the first entry, start each entry with a separator | (vertical bar).
In multi-word entries, separate each word with a slash /.
Write the ampersand character or the angle brackets in a protected way:
Table 7. Special characters in customization files
Special character
XML encoding
&
&
<
<
>
>
You can use regular expressions. Therefore, use a backslash as an escape character for the
period (.), the asterisk (*), the question mark (?), the plus sign (+), and the exclamation mark (!).
Do not add the company extensions such as Inc. or Corp. The extensions are automatically
analyzed during the entity extraction.
By default, entity matching is case-sensitive only on the first character of the first word.
4.
In your Environment Variables, add the following path in the Path system variable:
<installation path for the entity extraction server>/AnnotationFactory/jre/bin
such as
C:\Program Files\Documentum\CIS\Temis\Luxid\AnnotationFactory\jre\bin
5.
102
6.
Locate the script jscc.exe (on Windows hosts, or jscc on Linux hosts) at:
<installation path for the entity extraction
server>/AnnotationFactory/IDE/bin
Restart Luxid:
On Windows hosts, start the service Documentum CIS Luxid IDE Server V2.
On Linux hosts, in the notification area, right-click the Temis Luxid Started icon and select
Start Luxid.
2.
103
3.
Add each word or phrase on a new line taking into account the following guidelines:
Words or phrases are case-insensitive.
The spaces at the beginning or at the end of the line are not taken into account.
Regular expressions are not allowed.
Comments are allowed when beginning with a pound sign (#).
4.
104
Part 5
Classification
This part describes the classification processing which is one of the three different types of content
analysis: the extraction of entities, the extraction of metadata, or the classification.
It includes the following chapters:
Chapter 10, Classification Process
Chapter 11, Configure CIS Standard Classification
Chapter 12, Use the Taxonomy Exchange Format (TEF)
105
Classification
106
Chapter 10
Classification Process
107
Classification Process
is not automatic but manually triggered by the CIS administrator. When a CIS server is set for
both Test and Production modes, two versions of the taxonomy snapshots are available: one for the
production mode and one for the test mode. The taxonomy snapshots are stored in the repository.
When the CIS server restarts, it checks the validity of the snapshots: if taxonomies have been
removed, the corresponding snapshots are deleted. However, note that if taxonomies have been
modified, the snapshots are only updated by a manual synchronization.
The document set definitions are read by the CIS server on CIS server restart. When a document set
definition is modified, the CIS server reads the updated definition when the document set is restarted
or when it is synchronized.
108
Classification Process
Resubmit documents
Sometimes you want the CIS server to reanalyze documents. For example, when you have modified
the documents of a document set. You can then submit again the documents using one of the
procedures described in the previous section, Submit documents on demand, page 109.
If a submission schedule has been defined for the document set containing the modified documents,
then they will be automatically reprocessed the next time the schedule runs. When a schedule process
starts, it automatically retrieves the last update of the document set and it checks whether the version
of the documents has been modified since its last run.
If the documents have not been modified, the CIS server does not start any new process. If you
want to process again documents against a new taxonomy, use the Clear assignments function on
the document set first.
109
Classification Process
It identifies the evidence the CIS server looks for to determine whether a document includes the
category or concept. The confidence values assigned to the various pieces of evidence determine
when the CIS server signals a hit for the category.
When the CIS server analyzes a document, it automatically computes a score for each keyword and a
score for the whole expression. The expression score is then compared to the category thresholds
to define the document as:
not assigned if the candidate threshold is not reached.
candidate or pending if the candidate threshold is reached but not the on-target threshold.
assigned if the on-target threshold is reached.
110
Classification Process
Finally, the CIS server compares the document score with the category thresholds to determine
whether the document has be assigned to the category or not, or left as a pending candidate.
Each evidence term in the category definition has a confidence value assigned to it. The confidence
value specifies how certain the CIS server can be about scoring a hit for a document when it contains
the term. For example, if a document includes the text IBM, the CIS server can be nearly certain that
the document relates to the category International Business Machines. Therefore, the confidence
level for the term IBM is High.
Other pieces of evidence may suggest that the category might be appropriate. For example, if a
document includes the text Big Blue, the CIS server cannot be certain that it refers to International
Business Machines. The confidence level is Low, meaning that the CIS server should score a hit
for the category International Business Machines only if it encounters the text Big Blue and other
evidence of the same category in the document.
You can also exclude evidence terms. For example, suppose you have a category for the company
Apple Computers. The term Apple is certainly evidence of the category. However, if the term
fruit appears in the same document, you can be fairly sure that Apple refers to the fruit and not
the company. To capture this fact, you would add fruit as excluded evidence term to the Apple
Computers category.
Finally, you can define terms as required terms. In this case, the document must contain at least one
Required term. If only Required terms are defined for the category, then only one is sufficient to
assign the document to the category. If the evidence terms are not only Required terms, then the
document must contain one Required term and have a confidence score high enough for the category.
The confidence values for evidence terms are integers from 0 through 100.
When you set confidence values in Documentum Administrator, you can choose a predefined
confidence level or enter a number directly. The predefined values are:
High: Equivalent to the confidence level 75.
Medium: Equivalent to the confidence level 50.
Low: Equivalent to the confidence level 15.
Supporting: This evidence by itself does not cause the CIS server to score a hit for a document.
However, it increases the confidence level of other evidence found in the same document.
Exclude: If one of the evidence terms found in a document has this confidence level, then the
document will never be assigned to this category.
Required: These terms are must-have terms but they are not taken into account for the score
of a document.
If the resulting score exceeds or meets the on-target threshold of a category, the CIS server assigns the
document to the category. If the score is lower than the on-target threshold but higher than or equal to
the candidate threshold, the CIS server assigns the document to the category as a Pending candidate;
the category owner must review and approve the document before the assignment is complete. If the
score falls below the candidate threshold, the CIS server does not assign the document to the category.
111
Classification Process
Stemming capability
CIS is fully Unicode-compliant; which enables the classification of documents written in any language.
The stemming capability of Content Intelligence Services allows you to use the stemming for
documents in English, French, German, Spanish, Italian, Portuguese, Danish, Dutch, Norwegian,
Swedish, Romanian, Russian, Finnish, Hungarian, or Turkish. The stemming feature is the ability to
recognize that fishing, fished, fish, and fisher have the same root word fish. The language
dictionaries are embedded with CIS, you do not need to download them separately.
Note: The CIS server processes documents using the stemming capability only in the configured
language.
Stemming mechanism
You need to supply CIS with the initialization data it requires. This initialization data includes the
following:
The language set at the category level and at the document level.
The indication of whether the stemming is activated or not.
Default settings are in the cis.properties file as shown in the following example.
Example 10-1. Extract of the cis.properties file
# The default language of the linguistic engine.
cis.linguistic.language.default=english
# Whether the word stemming feature is allowed globally (true/false).
cis.linguistic.stemming.allowed=true
The language used for the stemming should be defined for the documents and for the categories.
When no language is defined, either for the documents or for the categories, the default language is
used.
When you specify the language of a document, the text of the document is analyzed and stemmed
according to this language. Then the result of the analysis is compared with the evidence terms of
categories of the same language or which language is not defined. Defining a language for a category
acts as a filter: a document is never assigned to a category of a different language.
To set the language for the documents that you want to classify, you can either set it for every
document or for an entire document set. When a language is set for a document set, it prevails
over the language set for individual documents. This behavior prevents from classification errors
if the document language is not correctly set. You can only select one language per document. If
the document set is made of many documents in different languages, then the language must be
set at the document level and not at the document set level. When no language is defined for the
documents or for the document sets, if the stemming is activated, the language used is the one
defined in the CIS server configuration.
112
Classification Process
You can also define the language of the categories used for the classification. The language can be
set for every category of for the entire taxonomy. If the language of a category is not specified, then
the language of the taxonomy is used, it does not inherit the language of the parent category, if any.
When no language is defined, the language used is the one defined in the CIS server configuration.
You also have the possibility to define the language as Any language, this means that documents in
any language that is, in different languages can be assigned to this category.
2.
3.
4.
The language value can either be the full language name in lowercase and in english, such as
french, or the two-letter ISO 6391 language code such as fr.
5.
113
Classification Process
created when you define a category. Auto Categorization occurs automatically when documents
are processed. When a document is assigned to a category, if the option "Link assigned documents
into category folders" is enabled, the document is automatically linked to the repository folder
representing that category.
Pattern analysis
This section describes the Pattern analyzer feature and includes the following topics:
Patterns as evidence terms, page 114
Limitations, page 115
Use patterns in rules, page 114
Configure pattern analysis, page 115
114
Classification Process
Limitations
The definition of a pattern is done using a standard regular expression language. However, it is not a
natural language. Besides, searching for many patterns can slow down performance significantly.
Two or three patterns do not cause noticeable performance degradation, but it adds up.
You cannot define patterns using capturing groups; they are not supported and replaced by
non-capturing groups. Instead, you can use a separator such as (-?|[\\. ]).
To define patterns:
1.
2.
3.
Since pattern analysis may affect performance of the content analysis, it can be turned off.
By default, the feature is enabled. To enable or disable pattern analysis, set the property
pattern.processing.enabled.
4.
To see additional tracing information in the log file, set the property tracing.enabled to true.
This information only applies to pattern loading and processing.
5.
115
Classification Process
For each parameter, indicate the incremented number of the pattern. For example, when
defining the third pattern, its parameters must end with 3 such as: pattern.scope.3,
pattern.value.3, and so on.
6.
7.
After you defined patterns, you can use them as evidence terms for a category in Documentum
Administrator.
In this example, you can see that the pattern is made of three digits: \\d{3}, then two digits: \\d{2},
then four digits: \\d{4}, separated by hyphens: -.
Classification information
The classification information indicates which documents have been processed or are being processed.
The classification results are first category assignments then, depending on the CIS configuration,
they can also correspond to folder assignments and document metadata updates (Assign as Attributes
option). Classification information and category assignments are stored in the repository.
The classification is done per document tree. Only the CURRENT version of the documents is
categorized but the entire tree is assigned to a category. When clearing the assignments, only the
assignments of the current version of the documents are removed.
If you create a version of a document, remember that it inherits the metadata attributes of the
previous version. So, if you are using the Assign as Attributes option, the attribute values generated
by the classification of the previous version of the document may no longer be relevant.
116
Classification Process
creates a full link between the document and the folder in addition to its normal assignment
relation. This allows users to see the documents in the taxonomy hierarchy.
Assign as Attributes: When a document is categorized, CIS writes the names of assigned
categories in the attributes of the document. The category class definition specifies which
document attribute is updated for each matching category.
You can configure CIS to record category assignments in both of these ways, one of them, or neither.
If neither Link to Folders or Assign as Attributes is active, Webtop users are not able to see the
category assignments.
You should select these options only when you know you need the functionality they provide.
Default CIS functionality is adequate in most cases.
Note: Category assignments are only exposed in Webtop clients and not in CenterStage clients.
Classification roles
This section describes the two roles involved in the classification process:
The taxonomy manager, page 117
The category owner, page 117
117
Classification Process
From the My Categories page in Documentum Administrator, the category owners can view all
documents assigned to the categories they own, or they can display just the documents assigned
to the category with a status of Pending. The category owners must approve or reject pending
documents, also called Candidate documents, that did not reach a score high enough to be
automatically categorized. If the threshold for automatic categorization is equal to the threshold for
Candidate documents, then there are no Candidate documents: documents are either automatically
categorized or rejected. Once the documents are categorized, either automatically or after approval,
they become viewable by end users. If a category owner rejects a pending document, this document
is not viewable by end users in the categories. For example, in Webtop, even if the document is
viewable in a cabinet folder, it is not viewable under the Categories node if it is not categorized.
The category owners also have the possibility to assign any document manually to a category.
As the taxonomy manager, the category owners can clear assignments, for example, if they mistakenly
approved a pending document.
Note: If both a user and a group exist with the same name in the repository, the user cannot be
selected for category_owner, only the group.
118
Chapter 11
Configure CIS Standard Classification
Install Content Intelligence Services by following the instructions in the Content Intelligence
Services Installation Guide.
2.
Start the CIS server. See Chapter 3, Administer the CIS server, for details.
3.
If the repository has not been enabled for CIS use during the installation, enable CIS in the
repository using Documentum Administrator (DA). The Documentum Administrator User Guide
provides information on this procedure. The properties you specify when you enable CIS (such
as the hostname for the test and production servers) can later be modified in the Configure
CIS window in DA.
4.
5.
Create the necessary document sets to select the documents that you want to be automatically
categorized.
6.
Synchronize the taxonomy definitions in the Documentum repository to make them available to
the CIS server for the classification processing.
119
The CIS objects you import or create with Documentum Administrator are saved in the repository
containing the documents, categories, evidence terms that CIS uses in its processing. Synchronize
the taxonomies and the document sets so that the CIS server can use them. Each time the CIS
server processes a document set, the CIS server reads the document set definition. However, for
scheduled document sets, you need to synchronize them when the schedule has been modified. If
the document set is not synchronized, the CIS server does not know that the schedule has been
updated until the next time it tries to process the document set.
7.
If you want to integrate CIS with Web Publisher, configure Web Publisher so that it can locate
the CIS server.
The Web Publisher documentation provides details on how to integrate CIS with Web Publisher.
8.
When using CIS and RPS to apply policies on category folders, set the DFC used by CIS as
Privileged DFC client in Documentum Administrator, for the corresponding repository. The
Documentum Administrator User Guide provides more information on privileged DFC setup in the
Privileged Clients chapter.
The main server configuration file is cis.properties. To modify cis.properties, page 27, indicates the
steps to modify the parameters in cis.properties file.
Note: If the repository was previously enabled for another CIS server, you must reconfigure it in
DA to create an authentication file. Similarly, if you change the repository for a given CIS server,
reconfigure it to create an authentication file. The authentication files are stored in the directory
defined by the property cis.server.credentials.dir in cis.properties on the CIS server file system.
Each authentication file, called user_<repository_name>.properties contains the login and encrypted
password of the CIS administrator for this given repository.
120
Chapter 12
Use the Taxonomy Exchange Format
(TEF)
This chapter describes the XML elements you can use to write a taxonomy in Taxonomy Exchange
Format (TEF).
A TEF file defines the structure of one or more taxonomies. You can import an entire taxonomy
structure or only part of it. In the later case, you must also create a TEF action file that specifies what
actions to take. Importing an entire taxonomy is simpler and does not require a TEF action file
(script tef2repository). Exporting a taxonomy takes information from the repository and creates a
TEF XML file. The TEF XML schema accommodates subtypes and their attributes. This chapter
includes the following sections:
Import taxonomies in Taxonomy Exchange Format, page 121
Taxonomy Exchange Format action files, page 169
Note: Before importing taxonomies, make sure that you enabled CIS functionality in Documentum
Administrator. The Enabling Content Intelligence Services section in the Documentum Administrator User
Guide provides more information on this procedure.
121
2.
Locate the tef2repository.bat file (on Windows hosts, or tef2repository on Linux hosts); it can be
found in<CIS installation directory>/bin.
3.
Import the taxonomy using the import script with the following parameters:
if CIS is already configured for the repository (that is, enabled in Documentum Administrator):
tef2repository -TefFile:<filename>
where <filename> can be the TEF filename or the relative filepath to the TEF file. In this case,
the repository information is retrieved using the settings in cis.properties.
If you need to provide the credentials of the user:
tef2repository -Repository:<repository_name> -Username:<user_
name> -Password:<user_password> -TefFile:<filename>
When running the script on a different CIS server machine, you can provide the absolute
directory paths to indicate where the file cis.properties and where the credentials file can
be found:
tef2repository -CisConfDir:confdir -CisHomeDir:homedir -TefFile:<filename>
122
These parameters are only required when you run the script from a different CIS server
machine.
Note: To increase the memory size for importing large taxonomies, edit the script file and add the
-Xmx argument. Refer to the following procedure for more details about the -Xmx argument.
Parameter values are case sensitive (but parameter names are case insensitive).
2.
3.
where <TEF action file path> is the relative file path of the TEF action file,
<docbase> is the name of the repository into which you want to import the taxonomy,
<login> and <password> are the Documentum user name and password for logging into the
repository.
For large taxonomies, you may need to allocate more Java memory. To do so, use the -Xmx argument
to increase the maximum allowed size for the Java heap. Append the letter k for kilobytes, or m for
megabytes. This argument comes before the classpath argument (-cp) in the command line.
For trace errors, add the option -Dlog4j.configuration=<CIS installation directory>/config/log4j-script.
xml. This argument comes before the classpath argument (-cp) in the command line.
123
Note: Given that TefUtil is a command line and not a script, it must be run from the directory
containing the files to import.
Using the TEF utility, you cannot set evidence propagation to @parent on a category which has
multiple parents, such as parents with category links to the category. This would generate an error at
the taxonomy synchronization in cis.log.
TEF elements
The following sections describes the TEF elements:
tef, page 126
class, page 127
details, page 129
description, page 130
categoryDefaults, page 131
impliedKeywordDefaults, page 133
keywordDefaults, page 135
evidencePropagation, page 137
categoryEvidenceDefaults, page 139
taxonomy, page 141
category, page 143
details, page 147
owners, page 149
owner, page 150
operations, page 151
operation, page 152
languageInfo, page 153
supportedLanguage, page 154
extended_attributes, page 155
attribute, page 156
value, page 157
definition, page 158
evidence, page 160
evidenceSet, page 162
keyword, page 163
categoryEvidence, page 164
124
125
tef
tef
Purpose
Root element of a file in taxonomy exchange format
Diagram
Children
<class>
<taxonomy>
<category>
Parents
None
Usage notes
The <tef> element must be the first element in the file. All other elements must appear inside of it.
Example of <tef>
<tef>
<class name="Generic">
... [Class definition]
</class>
<taxonomy name="Products" className="Generic" taxonomy version="version">
... [Taxonomy definition]
</taxonomy>
</tef>
126
class
class
Purpose
Defines a category class
Diagram
Attributes
Table 8. <class> Element Attributes
Attribute
Description
name
Children
<details>
<categoryDefaults>
Parents
<tef>
Usage notes
The <class> element defines a category class. Each CIS category is assigned to a class, which
determines the default confidence levels of the category and evidence propagation behavior. The
category class also identifies the document attribute to which category assignments are written if the
Assign as Attributes option is active.
127
class
<class> elements appear as children of the <tef> element, outside of the taxonomies and categories. A
<class> element has two subelements:
<details> provides descriptive information about the class and sets the document attribute into
which the CIS server writes category assignments when Assign as Attributes is active.
<categoryDefaults> sets default values for how the CIS server handles evidence for categories
of this category class.
Example of <class>
<class name="Generic">
<details source="Source" targetAttribute="keywords" title="Generic class">
<description>Category class for basic categories</description>
</details>
<categoryDefaults>
<impliedKeywordDefaults confidence="100" stem="true"
phraseOrderExact="false"/>
<keywordDefaults confidence="high" stem="true" phraseOrderExact="false"/>
<evidencePropagation type="@parent" confidence="medium"/>
<categoryEvidenceDefaults confidence="off"/>
</categoryDefaults>
</class>
128
details
details
Purpose
Provides descriptive information about its parent category class.
Diagram
Attributes
Table 9. <details> Element attributes
Attribute
Description
title
source
targetAttribute
Children
<description>
Parents
<class>
Note: A different <details> element is a subelement of <taxonomy> or <category>. See details,
page 147.
Usage notes
The <details> element provides a description of its parent category class. It also sets the document
attribute into which the CIS server writes the names of categories to which a document is assigned.
The <description> subelement contains a text description of the parent category class. The text
appears between the opening tag and the closing tag, not as an attribute as with other TEF elements.
See class, page 127 for an example that uses the <details> element.
129
description
description
Purpose
Provides a description of the parent element
Children
Text of the description
Parents
<details>
Usage notes
The <description> element is the only TEF element that includes plain text rather than subelements
between its opening and closing tags. See details, page 147 for an example.
130
categoryDefaults
categoryDefaults
Purpose
Provides default values for confidence levels and evidence propagation.
Defined at the category class level for all categories that reference this category class.
Diagram
Children
<impliedKeywordDefaults>
<keywordDefaults>
<evidencePropagation>
<categoryEvidenceDefaults>
Parents
<class>
131
categoryDefaults
Usage notes
Each piece of evidence for a category has three associated attributes that control how the CIS server
handles it:
confidence A confidence level that determines how much the CIS server adds to the score of a
document when the evidence is found
stem True or false setting that determines whether the CIS server uses stemming to recognize
other forms of the word. Corresponds to the Use stemming functionality in Documentum
Administrator.
phraseOrderExact True or false setting that determines whether the words in a multiple word
phrase must appear in exact order or in random order. Corresponds to the Recognize words in
any order functionality in Documentum Administrator.
The <categoryDefaults> element specifies the default values for these options. Each subelement
sets the default values for a particular type of evidence (keyword, implied keyword, propagated
evidence, and linked category evidence). The default values can be overridden by specifying a value
in the <keyword> or <categoryEvidence> element.
Example of <categoryDefaults>
<class name="Class">
<details source="Source" targetAttribute="Target" title="Title">
<description>Class description</description>
</details>
<categoryDefaults>
<impliedKeywordDefaults confidence="100" stem="true" phraseOrderExact="false"/>
<keywordDefaults confidence="high" stem="true" phraseOrderExact="false"/>
<evidencePropagation type="@parent" confidence="medium"/>
<categoryEvidenceDefaults confidence="off"/>
</categoryDefaults>
</class>
132
impliedKeywordDefaults
impliedKeywordDefaults
Purpose
Provides the defaults for handling implied keywords
Attributes
Table 10. <impliedKeywordDefaults> Element Attributes
Attribute
Description
confidence
stem
phraseOrderExact
Children
None
Parents
<categoryDefaults>
Usage notes
For the confidence attribute, you can enter a predefined confidence level or enter a number directly.
The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.
Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.
133
impliedKeywordDefaults
Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not
only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.
134
keywordDefaults
keywordDefaults
Purpose
Provides the defaults for handling evidence keywords
Attributes
Table 11. <keywordDefaults> Element Attributes
Attribute
Description
confidence
stem
phraseOrderExact
Children
None
Parents
<categoryDefaults>
Usage notes
The corresponding functionality is not exposed in Documentum Administrator. It can be very useful
to define globally the default values and behaviors for all the keywords in the category instead of
setting these values and behaviors for each keyword.
For the confidence attribute, you can enter a predefined confidence level or enter a number directly.
The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.
135
keywordDefaults
Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.
Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not
only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.
136
evidencePropagation
evidencePropagation
Purpose
Provides the defaults for propagating evidence
Attributes
Table 12. <evidencePropagation> Element Attributes
Attribute
Description
type
confidence
Children
None
Parents
<categoryDefaults>
Usage notes
Evidence of one category or taxonomy can automatically be considered evidence for another category
or taxonomy. Sharing evidence across categories and taxonomies is called propagating evidence.
Evidence can only be propagated between categories and taxonomies that have a direct parent and
child relationship. The propagation direction can be either parent to child or child to parent.
For example, suppose a taxonomy has this hierarchical structure (showing just the top-level elements
for the categories and taxonomy):
<taxonomy name="United States" className="Country"
taxonomyVersion="1.0">
<category name="Missouri" className="State">
<category name="St Louis" className="City">
<category name="Branson" className="City">
</category>
</taxonomy>
137
evidencePropagation
In this structure, if the direction of propagation is child to parent, the categories St. Louis and Branson
can propagate their evidence to the Missouri category, because St. Louis and Branson are direct
children of the Missouri category. Similarly, Missouri can propagate its evidence to the taxonomy
United States, because Missouri is a direct child of United States. If the direction of propagation is
parent to child, the taxonomy United States can propagate its evidence to the category Missouri, and
Missouri can propagate its evidence to both St. Louis and Branson.
However, you can never automatically propagate evidence directly between the taxonomy United
States <taxonomy> and the St. Louis or Branson categories, because the categories are indirectly
contained within the <taxonomy> element. Additionally, evidence cannot be automatically
propagated between sibling categories. In the preceding example, it means that evidence for St. Louis
cannot automatically be propagated to Branson, nor can evidence for Branson be propagated to
St. Louis.
For the confidence attribute, you can enter a predefined confidence level or enter a number directly.
The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.
Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.
Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not
only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.
138
categoryEvidenceDefaults
categoryEvidenceDefaults
Purpose
Provides the defaults for handling evidence from other categories linked into an evidence set.
Defined at the category class level for all categories that reference this category class.
Attributes
Table 13. <categoryEvidenceDefaults> Element Attributes
Attribute
Description
confidence
Children
None
Parents
<categoryDefaults>
Usage notes
The corresponding functionality is not exposed in Documentum Administrator.
For the confidence attribute, you can enter a predefined confidence level or enter a number directly.
The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.
Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.
Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not
139
categoryEvidenceDefaults
only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.
140
taxonomy
taxonomy
Purpose
Defines a taxonomy object
Diagram
Attributes
Table 14. <taxonomy> Element Attributes
Attribute
Description
name
className
taxonomyVersion
type
internalId
Internal use
Children
<details>
<definition>
141
taxonomy
<category>
<categoryLink>
Parents
<tef>
Usage notes
The <taxonomy> element represents the root of a hierarchical tree of categories. You can include
multiple <taxonomy> elements in a TEF file.
Some aspects of the <taxonomy> element establish default values for <category> elements that appear
inside of it.
A <taxonomy> element is divided into three main parts:
<details> provides descriptive information about the taxonomy.
<definition> specifies property rules that documents must meet in order for the CIS server to
assign them to categories in the taxonomy and provides default threshold values for categories in
the taxonomy.
<category> and <categoryLink> elements define the hierarchical structure of the taxonomy.
Example of <taxonomy>
<taxonomy name="Products" className="Generic" taxonomyVersion="Version 1">
<details title="Products">
<description>Products taxonomy</description>
</details>
<definition candidateThreshold="50" onTargetThreshold="80">
</definition>
<category name="Web Content Management Suite" className="Generic">
... [Category definition]
</category>
<category name="Enterprise Content Management Suite" className="Generic">
... [Category definition]
</category>
</taxonomy>
142
category
category
Purpose
Defines a category
Diagram
Attributes
Table 15. <category> Element Attributes
Attribute
Description
name
className
internalId
Internal use
The default for type is dm_category or dm_taxonomy. During importing, the type attribute determines
what type of object needs to be created in the repository. During exporting, all attributes are exported
from the dm_category/dm_taxonomy subtype to <extended_data>.
Children
<details>
143
category
<definition>
<category>
<categoryLink>
Parents
<taxonomy>
<category>
Usage notes
The <category> element represents a category. <category> elements are valid within <taxonomy>
elements and within other <category> elements. The nested structure defines the hierarchy of the
taxonomy.
Every category belongs to a category class. The class determines the default confidence levels used
for different types of evidence and the document attribute into which the CIS server writes the
name of the category when it assigns a document.
A <category> element is divided into three main parts:
<details> provides descriptive information about the category and how it is used and managed.
<definition> provides the evidence and property rules that the CIS server uses to determine
which documents to assign to the category.
<category> and <categoryLink> elements define subcategories.
Example of <category>
<category name="Web Content Management Suite">
<details title="Web Content Management Suite">
<description>The suite of Documentum products for managing Web content
</description>
<owners/>
<operations>
<operation type="user_browse"/>
<operation type="manual_assignment"/>
</operations>
<languageInfo>
<supportedLanguage languageCode="es" translatedName="translation 1"/>
<supportedLanguage languageCode="jp" translatedName="translation 2"/>
</languageInfo>
</details>
<definition candidateThreshold="60" onTargetThreshold="90">
<evidence evidencePropagation="low" impliedKeyword="33">
<evidenceSet>
<keyword name="Web Content Management" confidence="high"
phraseOrderExact="true" stem="false"/>
<keyword name="WCM" confidence="high"/>
</evidenceSet>
</evidence>
</definition>
<category name="Web Publisher" className="Generic">
... [Category definition]
</category>
144
category
145
category
<value>3.4</value>
</attribute>
</extended_attributes>
</details>
<definition>
</definition>
</category>
146
details
details
Purpose
Groups together descriptive information about its parent category or taxonomy
Diagram
Attributes
Table 16. <details> Element attributes
Attribute
Description
title
Children
<description>
<owners>
<operations>
<languageInfo>
<extended_attributes>
Parents
<taxonomy>
147
details
<category>
Note: A different <details> element is a subelement of <class>. See details, page 129.
Usage notes
The <details> element provides a description of its parent category or taxonomy. The <details>
element is composed of four subelements:
<description> contains a text description of the parent category or taxonomy. The text appears
between the opening tag and the closing tag, not as an attribute as with other TEF elements.
<owners> lists the owner of the parent category or taxonomy. The <owners> element groups
together any number of <owner> elements, each of which gives the Documentum user name of an
owner for the parent category or taxonomy.
<operations> lists which user operations are available for the parent category or taxonomy. The
<operations> element groups together any number of <operation> elements, each of which
identifies a type of operation that is valid.
<languageInfo> provides translated names for the parent category or taxonomy. The
<languageInfo> element groups together any number of <supportedLanguage> elements, each of
which identifies a language and provides a translation of the name into that language.
Example of <details>
<category name="Web Content Management Suite">
<details title="Web Content Management Suite">
<description>The suite of Documentum products for
managing Web content</description>
<owners>
<owner name="dmadmin"/>
</owners>
<operations>
<operation type="user_browse"/>
<operation type="manual_assignment"/>
</operations>
<languageInfo>
<supportedLanguage languageCode="es" translatedName="translation 1"/>
<supportedLanguage languageCode="jp" translatedName="translation 2"/>
</languageInfo>
</details>
... [The <definition> element and subcategories]
</category>
148
owners
owners
Purpose
Groups together the owners assigned to the parent category or taxonomy
Diagram
Children
<owner>
Parents
<details>
Usage notes
The owners of a category are the Documentum users who can review candidate documents and
approve or reject their assignment to the category. Candidate documents are documents whose
confidence score exceeds the candidate threshold of the category but fall short of its on-target
threshold, or documents that are assigned to the category manually with the Manual Workflow
option active.
See details, page 147 for an example that uses the <owners> element.
149
owner
owner
Purpose
Identifies an owner of a category or taxonomy
Diagram
Attributes
Table 17. <owner> Element Attributes
Attribute
Description
name
Children
None
Parents
<owners>
Usage notes
The owners of a category are the Documentum users who can review candidate documents and
approve or reject their assignment to the category. Candidate documents are documents whose
confidence score exceeds the candidate threshold of the category but fall short of its on-target
threshold, or documents that are assigned to the category manually with the Manual Workflow
option active. For categories created using Documentum Administrator, the user who created the
category is an owner by default. See details, page 147 for an example that uses the <owner> element.
150
operations
operations
Purpose
Groups together the set of operations available for the parent category or taxonomy
Diagram
Children
<operation>
Parents
<details>
Usage notes
The intent of the <operations> element is to specify what user operations are available for a category or
taxonomy. For example, you may not want users to see the documents assigned to certain categories.
In this release, the <operations> element does not affect standard CIS processing. The operations are
saved as part of the category definition, but Documentum applications do not refer to them.
151
operation
operation
Purpose
Identifies an operation that is available for the parent category or taxonomy
Attributes
Table 18. <operation> Element Attributes
Attribute
Description
type
Children
None
Parents
<operations>
Usage notes
The intent of the <operation> element is to identify a user operation that is available for a category
or taxonomy. For example, you may include an operation that makes the category available for
browsing by users.
In this release, the <operation> element does not affect standard CIS processing. Any operations are
saved as part of the category definition, but Documentum applications do not refer to them.
152
languageInfo
languageInfo
Purpose
Reserved useGroups together the translated names of the parent category or taxonomy
Diagram
Children
<supportedLanguage>
Parents
<details>
Usage notes
The subelements of <languageInfo> translate the category or taxonomy name into other languages.
Each <supportedLanguage> element identifies a language (using its Documentum language code) and
provides a translation of the name into that language. When a user views the category or taxonomy,
its name appears in the same language as the Documentum user interface if a translation is available.
See details, page 147 for an example that includes the <languageInfo> element.
153
supportedLanguage
supportedLanguage
Purpose
Reserved useProvides a translated category or taxonomy name for a specified language
Attributes
Table 19. <supportedLanguage> Element Attributes
Attribute
Description
languageCode
translatedName
Children
None
Parents
<languageInfo>
Usage notes
The subelements of <languageInfo> translate the category or taxonomy name into other languages.
Each <supportedLanguage> element identifies a language (using its Documentum language code) and
provides a translation of the name into that language. When a user views the category or taxonomy,
its name appears in the same language as the Documentum user interface if a translation is available.
See details, page 147 for an example that includes the <languageInfo> element.
154
extended_attributes
extended_attributes
Purpose
Attributes of the subtype category
Children
<attribute>
Parents
<details>
Usage notes
<extended_attributes> is used to populate the attributes of the subtype. The type attribute determines
what type of object needs to be created in the repository.
Attributes can be repeating, they can be of different types - boolean, integer, string, id, time, and
double. See category, page 143, for an example of a category subtype.
155
attribute
attribute
Purpose
One of the attributes of the subtype category
Attributes
Table 20. <attribute> Element Attributes
Attribute
Description
name
Children
<value>
Parents
<extended_attributes>
Usage notes
<extended_attributes> is used to populate the attributes of the subtype. The type attribute determines
what type of object needs to be created in the repository.
Attributes can be repeating, they can be of different types - boolean, integer, string, id, time, and
double. See category, page 143, for an example of a category subtype.
156
value
value
Purpose
Provides the value for an attribute of the subtype category
Children
Value of the attribute
Parents
<attribute>
Usage notes
See category, page 143, for an example of a category subtype.
157
definition
definition
Purpose
Identifies the set of documents belonging to a category
Diagram
Attributes
Table 21. <definition> Element Attributes
Attribute
Description
onTargetThreshold
candidateThreshold
keywordLanguage
Children
<evidence>, only if the parent element is <category>
<qualifiers>
158
definition
Parents
<taxonomy>
<category>
Usage notes
The <definition> element supplies the criteria that the CIS server uses to determine which documents
to assign to the parent category or taxonomy. The <qualifiers> subelement defines property rules
that a document must meet to be assigned. The <evidence> subelement provides the evidence and
confidence values that the CIS server uses to assign a confidence score to the document.
If a <taxonomy> element includes any <qualifiers>, the specified property rules apply to all categories
in the taxonomy. If a document submitted for processing does not meet the property rules for
the taxonomy, the CIS server does not evaluate it for assignment into any of the categories in the
taxonomy.
The keywordLanguage attribute allows you to set the language that will be used when the stemming
is enabled. The taxonomy or category language acts as a filter: the language of the document (or of
the document set) should match the category language or the document cannot be assigned. The
section Setting the language used for the stemming in Documentum Administrator User guide provides
more information about the stemming functionality.
Note: The <definition> under a <taxonomy> element should not include an <evidence> subelement.
Documents are not assigned to the root of the taxonomy.
Example of <definition>
<category name="Web Content Management Suite">
<details title="Web Content Management Suite">
... [Category details]
</details>
<definition candidateThreshold="60" onTargetThreshold="90">
<evidence evidencePropagation="low" impliedKeyword="33">
<evidenceSet>
<keyword name="Web Content Management" confidence="high"
phraseOrderExact="true" stem="false"/>
<keyword name="WCM" confidence="high"/>
</evidenceSet>
</evidence>
</definition>
</category>
159
evidence
evidence
Purpose
Identifies the evidence used to assign documents to the parent category
Diagram
Attributes
Table 22. <evidence> Element Attributes
Attribute
Description
impliedKeyword
evidencePropagation
Children
<evidenceSet>
Parents
<definition>
160
evidence
Usage notes
The <evidence> element provides the evidence and confidence values that the CIS server uses to
assign a confidence score for the parent category to the document. The evidence for a category is
organized into evidence sets, each of which defines a collection of evidence keywords that the CIS
server considers together when calculating the score of a document, relative to the category.
For the impliedKeyword and evidencePropagation attributes, you can enter a predefined confidence
level or enter a number directly. The predefined values are:
Certain Equivalent to the confidence level 100.
High Equivalent to the confidence level 75.
Medium Equivalent to the confidence level 50.
Low Equivalent to the confidence level 15.
Supporting Deprecated. This evidence by itself does not cause the CIS server to assign the
document to the category. However, it increases the confidence level of other evidence found in
the same evidence set.
Exclude This evidence by itself prevents the CIS server from assigning the document to the
category, regardless of how much other evidence appears.
Off No weight is added to the document confidence score for this evidence. This value
is useful to not set default values.
-22 Equivalent to the confidence Required in Documentum Administrator. If only Required
terms are defined for the category, then the document must contain at least one Required term
and only one is sufficient to assign the document to the category. If the evidence terms are not
only Required terms, then the document must contain one Required term and have a confidence
score high enough for the category.
Example of <evidence>
<evidence>
<evidenceSet>
<keyword name="Documentum"/>
</evidenceSet>
<evidenceSet>
<keyword name="ECM"/>
<keyword name="Enterprise Content Management" confidence="high"
phraseOrderExact="true" stem="false"/>
<categoryEvidence name="@parent" confidence="low"/>
</evidenceSet>
</evidence>
161
evidenceSet
evidenceSet
Purpose
Groups together a set of evidence that the CIS server considers together when analyzing documents
Diagram
Children
<keyword>
<categoryEvidence>
Parents
<evidence>
Usage notes
An evidence set is a collection of keywords that the CIS server uses together as evidence of a
particular concept. The keywords are identified using the <keyword> and <categoryEvidence>
subelements. A category can have multiple evidence sets that define separate sets of co-occurring
keywords. Confidence levels are not combined across evidence sets.
162
keyword
keyword
Purpose
Identifies a string for the CIS server to use as evidence for the parent category
Attributes
Table 23. <keyword> Element Attributes
Attribute
Description
name
confidence
stem
phraseOrderExact
Children
None
Parents
<evidenceSet>
Usage notes
The <keyword> element defines a piece of evidence that the CIS server looks for in the content of the
documents it processes. The text of the keyword can be one or more words. When the server finds the
keyword, it adds the confidence value for the keyword to the confidence score of the document for
the parent category.
If you do not include values for one or more of the attributes, their values are inherited from the
<keywordDefaults> element of the category class.
163
categoryEvidence
categoryEvidence
Purpose
Includes evidence of another category as part of the evidence for the parent category
Attributes
Table 24. <categoryEvidence> Element Attributes
Attribute
Description
name
className
confidence
internalId
Internal use
Children
None
Parents
<evidenceSet>
Usage notes
Categories can include other categories as evidence: when a document is assigned to one category,
the CIS server can use that assignment as evidence for a related category. For example, when a
document is assigned to the category Documentum Content Intelligence Services, you might
want it also assigned to the category Documentum. To accomplish this, you link the category
Documentum Content Intelligence Services into an evidence set for the category Documentum.
Like all evidence, category link evidence has a confidence value associated with it, telling the CIS
server how much to add to the overall score of the document for the current category when the
document is assigned to the linked category.
If you do not include values for one or more of the attributes, their values are inherited from the
<categoryEvidenceDefaults> element of the category class.
164
qualifiers
qualifiers
Purpose
Groups together the property rules for the parent category
Diagram
Children
<qualifier>
Parents
<definition>
Usage notes
The definition of a category or taxonomy can include property rules that assigned documents must
meet. For the CIS server to assign a document to a category, the document must meet the property
rules for the category and the property rules for the taxonomy to which the category belongs.
Example of <qualifiers>
<definition candidateThreshold="50" onTargetThreshold="80">
<evidence>
... [Category evidence]
</evidence>
<qualifiers>
<qualifier tag="location" operation="equal" value="/MarketingCabinet"/>
<qualifier tag="type" operation="not_equal" value="custom_type"/>
</qualifiers>
</definition>
165
qualifier
qualifier
Purpose
Defines a qualifying condition for documents assigned to the parent category or taxonomy
Attributes
Table 25. <qualifier> Element Attributes
Attribute
Description
tag
166
qualifier
Attribute
Description
operation
value
Children
None
Parents
<qualifiers>
Usage notes
Before the CIS server assigns a document to a category, it verifies that the document meets the
property rules for the category and for the taxonomy. If a document fails to meet any condition, the
CIS server does not assign the document to the category regardless of the strength of the evidence. If
no evidence terms are defined and the document meets the property rule, then it is automatically
assigned.
A special property rule is defined to indicate if the documents must satisfy all the conditions or if one
condition is enough to assign a document, for example:
<qualifier tag="qualifiers_evaluation_policy" operation="equal" value="all"/>
167
categoryLink
categoryLink
Purpose
Links an existing category into the hierarchy of the taxonomy
Attributes
Table 26. <categoryLink> Element Attributes
Attribute
Description
name
className
internalId
Internal use
Children
None
Parents
<taxonomy>
<category>
Usage notes
The <categoryLink> element enables you to include a category in more than one place in the
hierarchy. You use the <category> element to define the category and its evidence structure once, then
use <categoryLink> to link the category into other locations in the taxonomy. Linking category does
not imply the evidence propagation.
Example of <categoryLink>
<taxonomy name="Products" className="Generic" taxonomyVersion="Version 1">
<details title="Products">
<description>Products taxonomy</description>
</details>
<definition candidateThreshold="50" onTargetThreshold="80"/>
<category name="Web Content Management Suite" className="Generic">
... [Category definition]
</category>
<categoryLink name="Enterprise Content Management Suite"
className="Generic"/>
</taxonomy>
168
categoryLink
For information on how to run a TEF action file, see the Run the TefUtil utility step of the procedure To
import a TEF taxonomy with TefUtil:, page 123.
169
categoryLink
170
actions
actions
Purpose
Root element of the TEF action file
Diagram
Children
<add>
<delete>
<relink>
<export>
Note: The <update> action, which appears in tefActionSchema.xsd, is not supported in this release.
Parents
None
Usage notes
<actions> is the root element of a TEF action file. Each of its subelements is an action you want
to perform on a taxonomy.
171
actions
Example of <actions>
<actions>
<delete>
<categoryReference name="Maximum Taxonomy" className="Class"/>
<classReference name="Class"/>
<classReference name="Alternate Class"/>
</delete>
<add fileName="testTef.xml">
<classObject xPath="/tef/class"/>
<taxonomyObject xPath="/tef/taxonomy[@name='Maximum Taxonomy']
" branch="true"/>
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
<export fileName="tefOut1.xml" xsdFileName="tefSchema.xsd">
<classReference name="Class"/>
<classReference name="Alternate Class"/>
</export>
<add fileName="testTef.xml">
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
<export fileName="tefOut2.xml" xsdFileName="tefSchema.xsd">
<categoryReference name="Maximum Taxonomy" className="Class"
branchLevels="all" details="true" definitions="false"/>
<categoryReference name="Maximum Category" className="Class"
branchLevels="all" details="false" definitions="false"/>
<categoryReference name="Maximum Category" className="Class"
branchLevels="all" details="true" definitions="false"/>
</export>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category">
<absoluteParentList>
<categoryReference className="Class" name="Maximum Taxonomy"/>
<categoryReference className="Class" name="Maximum Category"/>
</absoluteParentList>
</categoryReference>
</relink>
<export fileName="tefOut3.xml" xsdFileName="tefSchema.xsd">
<categoryReference name="Maximum Taxonomy" className="Class"
branchLevels="all" details="true" definitions="false"/>
</export>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category2">
<addParentList>
<categoryReference className="Class" name="Minimum Category"/>
</addParentList>
<removeParentList>
<categoryReference className="Class" name="Maximum Category"/>
</removeParentList>
</categoryReference>
</relink>
<export fileName="tefOut4.xml" xsdFileName="tefSchema.xsd">
<categoryReference name="Maximum Taxonomy" className="Class"
branchLevels="all" details="true" definitions="false"/>
</export>
<delete>
<categoryReference name="Maximum Category" className="Class"/>
</delete>
<export fileName="tefOut5.xml" xsdFileName="tefSchema.xsd">
172
actions
173
add
add
Purpose
Adds categories, taxonomies, or category classes to the repository
Diagram
Attributes
Table 27. <add> Element Attributes
Attribute
Description
fileName
Children
<classObject>
<taxonomyObject>
<withinParentReference>
Parents
<actions>
174
add
Usage notes
The <add> action adds new CIS objects to the repository based on definitions stored in a TEF file.
Each subelement identifies one or more objects to add. The <classObject> and <taxonomyObject>
elements identify category classes and taxonomies respectively using an XPath reference to elements
in the TEF file. The <withinParentReference> element identifies a position in the hierarchy where
the categories referred to inside of it are added.
If an object that appears inside of the <add> action exists, the TEF utility does not add the object again
or update it. It ignores the existing object and continues.
Example of <add>
<actions>
<add fileName="testTef.xml">
<classObject xPath="/tef/class"/>
<taxonomyObject xPath="/tef/taxonomy[@name='Maximum Taxonomy']"
branch="true"/>
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
</actions>
175
classObject
classObject
Purpose
Identifies a category class in a TEF file to add to a repository
Attributes
Table 28. <classObject> Element Attributes
Attribute
Description
xPath
Children
None
Parents
<add>
Usage notes
The <classObject> element identifies a category class object from a TEF file using an XPath reference.
If the XPath reference selects multiple category classes, the TEF utility adds each class.
Example of <classObject>
<actions>
<add fileName="testTef.xml">
<classObject xPath="/tef/class"/>
</add>
</actions>
176
taxonomyObject
taxonomyObject
Purpose
Identifies a taxonomy element in a TEF file
Attributes
Table 29. <taxonomyObject> Element Attributes
Attribute
Description
xPath
branch
Children
None
Parents
<add>
Usage notes
The <taxonomyObject> element identifies a taxonomy object using an XPath reference. If the XPath
reference matches more than one taxonomy, the TEF utility adds them all.
Example of <taxonomyObject>
<actions>
<add fileName="testTef.xml">
<taxonomyObject xPath="/tef/taxonomy[@name='Maximum Taxonomy']" branch="true"/>
</add>
</actions>
177
withinParentReference
withinParentReference
Purpose
Identifies where in the hierarchy to add new categories
Diagram
Attributes
Table 30. <withinParentReference> Element Attributes
Attribute
Description
name
className
Children
<categoryObject>
Parents
<add>
Usage notes
The <withinParentReference> element identifies the parent category for one or more new categories
being added from a TEF file.
178
withinParentReference
Example of <withinParentReference>
<actions>
<add fileName="testTef.xml">
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
</actions>
179
categoryObject
categoryObject
Purpose
Identifies a category element in a TEF file
Attributes
Table 31. <categoryObject> Element Attributes
Attribute
Description
xPath
branch
Children
None
Parents
<withinParentReference>
Usage notes
The <categoryObject> element appears within an <add> action to identify one or more categories to
add from the TEF file to the repository. It identifies categories using an XPath reference. If the XPath
reference selects more than one category, the TEF utility adds them all.
<categoryObject> appears as a subelement of the <withinParentReference> element, which determines
where in the hierarchy the categories are added.
Example of <categoryObject>
<actions>
<add fileName="testTef.xml">
<withinParentReference name="Maximum Taxonomy" className="Class">
<categoryObject xPath="/tef/category[@name='Maximum Category']"
branch="true"/>
</withinParentReference>
</add>
</actions>
180
delete
delete
Purpose
Removes category classes, categories, or taxonomies from the repository
Diagram
Children
<categoryReference>
<classReference>
Parents
<actions>
Usage notes
The <delete> action removes CIS objects from the repository.
To delete a category, the object referred to by <categoryReference> must not have any children. To
delete a category class, no existing categories or taxonomies can use the category class. For this
reason, the branch attribute is ignored when used with the <delete> action.
181
delete
Example of <delete>
<actions>
<delete>
<categoryReference name="Maximum Taxonomy" className="Class"/>
<classReference name="Class"/>
<classReference name="Alternate Class"/>
</delete>
</actions>
182
classReference
classReference
Purpose
Identifies a category class element in the repository
Attributes
Table 32. <classReference> Element Attributes
Attribute
Description
name
Children
None
Parents
<delete>
<export>
Usage notes
The <classReference> element identifies an existing category class using its name.
Example of <classReference>
<actions>
<delete>
<classReference name="Class"/>
</delete>
</actions>
183
categoryReference
categoryReference
Purpose
Identifies a category in the repository to perform an action on.
Diagram
Note: This diagram applies only to <categoryReference> elements that appear within a <relink>
action. <categoryReference> has no child elements in the context of other actions.
Attributes
Table 33. <categoryReference> Element Attributes
Attribute
Description
name
className
branch
branchLevels
details
definition
184
categoryReference
Children
<absoluteParentList>
<addParentList>
<removeParentList>
Note: These child elements apply only to <categoryReference> elements that appear within a <relink>
action. <categoryReference> has no child elements in the context of other actions.
Parents
<delete>
<relink>
<export>
Usage notes
<categoryReference> identifies an existing category or taxonomy in the repository. Its attributes
specify whether the action applies only to the specified category or to the category and its children.
Example of <categoryReference>
<actions>
<delete>
<categoryReference name="Maximum Taxonomy" className="Class"/>
</delete>
</actions>
185
relink
relink
Purpose
Links existing categories into new hierarchical locations
Diagram
Children
<categoryReference>
Parents
<actions>
Usage notes
The <relink> action links an existing category into an additional location in the hierarchy. The
<categoryReference> identifies the existing category and the new locations you want to link it to. The
<relink> action can also be used to remove category links; see removeParentList, page 190. However,
every category must be linked to at least one parent category, and the TEF utility gives an error if
you attempt to remove the final link.
Example of <relink>
<actions>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category">
<absoluteParentList>
<categoryReference className="Class" name="Maximum Taxonomy"/>
</absoluteParentList>
</categoryReference>
</relink>
</actions>
186
absoluteParentList
absoluteParentList
Purpose
Provides a fixed list of parent categories for a linked category
Diagram
Children
<categoryReference>
Parents
<categoryReference>
Usage notes
<absoluteParentList> appears inside of a <relink> action, as a subelement of the <categoryReference>
that identifies the category being relinked. Its subelements identify the complete list of parent
categories for the relinked category.
The alternative to <absoluteParentList> is <addParentList> and <removeParentList>. These elements
identify which parents to add and remove for the relinked category rather than listing the complete
set of parent categories.
187
absoluteParentList
Example of <absoluteParentList>
<actions>
<relink>
<categoryReference className="Alternate Class"
name="Alternate Category">
<absoluteParentList>
<categoryReference className="Class" name="Maximum Taxonomy"/>
</absoluteParentList>
</categoryReference>
</relink>
</actions>
188
addParentList
addParentList
Purpose
Provides a list of new parent categories for a linked category in the repository
Diagram
Children
<categoryReference>
Parents
<categoryReference>
Usage notes
<addParentList> appears inside of a relink action, as a subelement of the <categoryReference> that
identifies the category being relinked. Its subelements identify new parent categories for the relinked
category. The category is not unlinked from any of its current positions, only the new parents are
added. Use <removeParentList> to remove existing links for the category.
The alternative to <addParentList> is <absoluteParentList>. This element identifies the complete set of
parent categories for the relinked category rather than identifying only new parents.
Example of <addParentList>
<actions>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category2">
<addParentList>
<categoryReference className="Class" name="Minimum Category"/>
</addParentList>
<removeParentList>
<categoryReference className="Class" name="Maximum Category"/>
</removeParentList>
</categoryReference>
</relink>
</actions>
189
removeParentList
removeParentList
Purpose
Provides a list of parent categories to remove from a category
Diagram
Children
<categoryReference>
Parents
<categoryReference>
Usage notes
<removeParentList> appears inside of a relink action, as a subelement of the <categoryReference>
that identifies the category being relinked. Its subelements identify current parent categories to
remove from the relinked category.
Since every category must be linked to at least one parent category, the TEF utility gives an error if
you attempt to remove the final link.
Example of <removeParentList>
<actions>
<relink>
<categoryReference className="Alternate Class" name="Alternate Category2">
<addParentList>
<categoryReference className="Class" name="Minimum Category"/>
</addParentList>
<removeParentList>
<categoryReference className="Class" name="Maximum Category"/>
</removeParentList>
</categoryReference>
</relink>
</actions>
190
export
export
Purpose
Creates a TEF file containing the definitions of specified categories and category classes
Diagram
Attributes
Table 34. <export> Element Attributes
Attribute
Description
fileName
xsdFileName
Children
<classReference>
<categoryReference>
Parents
<actions>
Usage notes
The <export> action creates a TEF file containing elements for the selected classes and categories. The
<classReference> and <categoryReference> elements identify existing classes and categories from the
191
export
repository. The attributes of the <categoryReference> element determine what aspects of the category
definition are included in the TEF file; see categoryReference, page 184.
In most cases, the value of the xsdFileName attribute should be the standard Documentum TEF
schema file tefSchema.xsd. To create a TEF file without validation, set the xsdFileName attribute to
an empty string.
The category type or taxonomy type is always exported to the type attribute. Export action
automatically picks up all attributes from dm_category/dm_taxonomy subtype and exports these
attributes to <extended_data>.
Example of <export>
<actions>
<export fileName="tefOut1.xml" xsdFileName="tefSchema.xsd">
<classReference name="Class"/>
<classReference name="Alternate Class"/>
</export>
</actions>
192
Part 6
Metadata Extraction
This part describes the metadata extraction processing which is one of the three different types of
content analysis: the extraction of entities, the extraction of metadata, or the classification.
It includes the following chapters:
Chapter 13, Metadata Extraction
Chapter 14, Configuring Metadata Extraction
193
Metadata Extraction
194
Chapter 13
Metadata Extraction
195
Metadata Extraction
Analyze the structure of the documents. Identify the similarities and differences in the documents
structure.
2.
3.
4.
5.
Rules sample
In this example, we see various ways to define rules to extract some metadata elements from a
document.
We want to extract the date, the reference number, and the subject.
196
Metadata Extraction
These simple rules can work for other similar documents. However, if documents in the document set
include small variations (the date is not always on the first line, the greeting is not always Sir/Madam),
these rules will fail to extract the metadata elements. Then you have to define more robust rules.
Regarding the date element for example, you can use a regular expression to match the date. The
date follows the pattern day / month / year, or 2 digits / 2 digits / 4 digits. The corresponding regular
expression is \d{2}/\d{2}/\d{4} where \d means digit and \d{2} means 2 digits.
In this case, the rule is the following:
<SetMetadata name="date">
<Pattern regex="\d{2}/\d{2}/\d{4}"/>
</SetMetadata>
To make sure we match the date located between the beginning of the document and the Subject:,
we can modify the rule as follows:
<SetMetadata name="date">
<Block end="Subject:">
<Pattern regex="\d{2}/\d{2}/\d{4}"/>
</Block>
</SetMetadata>
This rule can be read as Put in the metadata date the value returned by the sub-rule Block. The
sub-rule Block first reduces the target text from the beginning of the document to Subject:, then
processes its own sub-rule Pattern, and then returns the values returned by Pattern. The sub-rule
Pattern finds the first text matching the regular expression for the date and returns it (or returns no
value if not found).
197
Metadata Extraction
To better extract the reference element, we can define a block with start and end elements instead of
a line.
<SetMetadata name="reference">
<Block start="Ref.:" end="Subject:"/>
</SetMetadata>
The extraction of the subject element is more tricky because it depends on the greeting Sir/Madam
which may be different. We can first try to extract the block between Subject: and Sir/Madam,,
and if it is not found, else take the first 3 lines of text after Subject:.
<SetMetadata name="subject">
<First>
<Block start="Subject:" end="Sir/Madam,"/>
<Block start="Subject:">
<Line fromOccurrence="1" toOccurrence="3"/>
</Block>
</First>
</SetMetadata>
198
Chapter 14
Configuring Metadata Extraction
This chapter describes how to configure the extraction of metadata from the content, properties, or
repository attributes of documents including extraction rule definition and testing.
2.
3.
Add rules to the configuration file. The rules are added between the <MetadataExtractionRules>
and </MetadataExtractionRules> elements. Metadata extraction rules, page 201 describes the
rules available with their usage and provides some examples.
199
Locate the extract_metadata.bat script (on Windows hosts, or extract_metadata on Linux hosts); it
can be found at <CIS installation directory>/bin.
2.
where
<test_document> is the filepath for the document to be tested.
<rule_file> is the filepath for the XML rule file.
<results_file> is the filepath for the text file generated and it contains the extraction results.
<extracted_text_file> is filepath for a file containing only the extracted text as returned by
Oracle text extractor.
<extracted_properties_file> is the filepath for a file containing only the extracted properties
with the specified rules.
such as:
extract_metadata -doc "..\doc\metadata\document_sample.doc" -rules
"..\doc\metadata\rules_sample1.xml" -output "extracted_metadata.txt"
-extractedText "extracted_text.txt" -extractedProperties "extracted_properties.txt"
Locate the extract_metadata.bat script (on Windows hosts, or extract_metadata on Linux hosts); it
can be found at <CIS installation directory>/bin.
2.
where
<test_directory> is the filepath for the directory that contains the documents to be tested.
<rule_file> is the filepath for the XML rule file.
<results_file> is the filepath for the CSV file generated for the extraction results.
such as:
extract_metadata -docDir "..\docs" -rules "..\doc\metadata\rules_sample2.xml"
-output "extracted_metadata_report.csv"
You can use the extract_metadata script on a machine different from the one of which CIS is installed.
The following procedure describes the required steps to do so.
200
Locate and run the build_metadata_extractor script in <CIS installation directory>/bin directory.
It creates a new directory <CIS installation directory>/metadata_extractor with all necessary
resources inside.
2.
3.
4.
Update the following lines in the extract_metadata script with the correct paths for the current
machine:
set
set
set
set
5.
CIS_CONF_DIR=C:\Program Files\Documentum\CIS\config
CIS_HOME_DIR=C:\Program Files\Documentum\CIS
CIS_LIB_DIR=C:\Program Files\Documentum\CIS\lib
JH=C:\Program Files\Documentum\java\1.6.0_17
Rules principles
Rules are evaluated in order.
The rules are evaluated in the reading order. Make sure you write them in the order they have to
be evaluated. This also has an impact when you define a target text. The text zone of the target
text can be reduced but not enlarged.
A rule can contain zero, one, or several sub-rules.
If a rule contains sub-rules, the sub-rules are processed first. Then the rule processes itself with the
result of the sub-rule and returns zero, one, or several values. A rule usually has zero or one sub-rule,
only operator rule can have several sub-rules.
A rule applies to a text zone.
The root rule applies to the entire document text, then rules can reduce the target text (text zone
on which the rule applies).
A rule returns zero, one, or several values.
The values returned, if any, are stored according to the document set configuration. Values are
available during the processing to evaluate other rules.
201
Rules definitions
For all rules, the names of the metadata elements, document properties, or repository attributes
are case sensitive. They must comply with XML standards, which means that characters such as
underscores or spaces are allowed.
The target text could be the entire document or any part of the document defined by a rule.
202
SetMetadata
SetMetadata
Purpose
This rule allows you to define a metadata element.
Attributes
Table 35. <SetMetadata> Element Attributes
Attribute
Description
name
Usage notes
This rule must have a sub-rule. The metadata element is set with the values returned by the sub-rule.
Once the metadata element is created, it can be accessed by other rules (such as GetMetadata) and it
can be stored as specified in the document set configuration. Make sure the name of the metadata
element is exactly the same in the document set configuration. The name is case-sensitive.
Example of <SetMetadata>
<SetMetadata name="reference">
<Block start="Ref.:" end="Subject:"/>
</SetMetadata>
203
GetMetadata
GetMetadata
Purpose
This rule allows you to get the values of a metadata element.
Attributes
Table 36. <GetMetadata> Element Attributes
Attribute
Description
name
Usage notes
This rule has no sub-rule. It returns the values set for a metadata element, or no value if the metadata
element is not set. Refer to the SetMetadata rule to know how to set a metadata element.
The GetMetadata rule allows you to verify the existence of a metadata element and to start a new
rule to retrieve the value of this metadata element. The example of the Concat rule also provides
an example of GetMetadata usage. It is different from the Exists condition that only verify the
existence of the metadata element.
204
DocProperty
DocProperty
Purpose
This rule gets the value of a document property.
Attributes
Table 37. <DocProperty> Element Attributes
Attribute
Description
name
Usage notes
This rule has no sub-rule. It returns the value of a specific property extracted from the document,
such as the title of a PDF document, or no value if the property is not set in the document.
To make sure you set the exact name of the property, run the extract_metadata script on one document
with the -extractedProperties parameter. Look at the text as it is extracted by Oracle text extractor.
This allows you to know the name of the property as it is seen by the extractor. For example, you
may have an Author property in the application interface that appears as primaryauthor in the
extracted text.
Note: System properties may not be extracted depending on the file format. For example, if you
use Windows Explorer, the properties set in the Summary tab of the document Properties are not
extracted for the PDF documents.
Example of <DocProperty>
<SetMetadata name="author">
<DocProperty name="primaryauthor"/>
</SetMetadata>
205
DocRepositoryAttribute
DocRepositoryAttribute
Purpose
This rule gets the values of an attribute associated to the document in the Documentum repository.
Attributes
Table 38. <DocRepositoryAttribute> Element Attributes
Attribute
Description
name
Usage notes
This rule has no sub-rule. It returns the values of an attribute associated to the document in the
Documentum repository, such as the attributes title or keywords, or no value if the attribute is not set.
Example of <DocRepositoryAttribute>
<SetMetadata name="DocTitle">
<First>
<DocRepositoryAttribute name="title"/>
<DocProperty name="title"/>
<Line occurrence="1"/>
</First>
</SetMetadata>
206
Block
Block
Purpose
This rule looks for a text block delimited by a start element and an end element.
Attributes
Table 39. <Block> Element Attributes
Attribute
Description
start
Specifies a start element for the block that can be plain text or regular
expression.
Optional, by default the start element is the beginning of the current
target text.
fromMetadataPosition
Specifies a start element for the block that is the name of a metadata
element. The block starts at the metadata element position. The metadata
element has to be defined in a previous rule.
Optional, by default the start element is the beginning of the current
target text.
includeStart
end
Specifies an end element for the block that can be plain text or regular
expression.
Optional, by default the end element is the end of the current target text.
toMetadataPosition
Specifies an end element for the block that is the name of a metadata
element. The block ends at the metadata element position. The metadata
element has to be defined in previous rule.
Optional, by default the end element is the end of the current target text.
includeEnd
ignoreCase
Specifies whether to ignore the letter case of the plain text or regular
expression.
False by default.
occurrence
fromOccurrence
207
Block
Attribute
Description
toOccurrence
allOccurences
Usage notes
Both start and end elements can be plain text, regular expression or a metadata element defined in
a previous rule.
If either the start or end element is not found, then this rule returns no value without processing
further its sub-rule (if any). If there is no sub-rule, this rule returns one value the text block matched.
If there is a sub-rule, it is invoked with a target text reduced to the text block matched by this rule.
Then the values returned by the sub-rule are returned by this rule.
Examples of <Block>
<Block start="Ref.:" end="Subject:"/>
<Block end="Subject:"/>
<Block fromMetadataPosition="Author" end="Date:"/>
<SetMetadata name="phones">
<Block start="Tel" end="Fax" allOccurrences="true">
<Pattern regex="\d{5}-\d{4}"/>
</Block>
</SetMetadata>
208
Line
Line
Purpose
This rule looks for one or several lines.
Attributes
Table 40. <Line> Element Attributes
Attribute
Description
start
Specifies a start element for the block that can be plain text or regular
expression.
Optional, by default the start element is the beginning of the current
target text.
fromMetadataPosition
Specifies a start element for the block that is the name of a metadata
element. The block starts at the metadata element position. The metadata
element has to be defined in a previous rule.
Optional, by default the start element is the beginning of the current
target text.
includeStart
end
Specifies an end element for the block that can be plain text or regular
expression.
Optional, by default the end element is the end of the current target text.
toMetadataPosition
Specifies an end element for the block that is the name of a metadata
element. The block ends at the metadata element position. The metadata
element has to be defined in previous rule.
Optional, by default the end element is the end of the current target text.
includeEnd
ignoreCase
Specifies whether to ignore the letter case of the plain text or regular
expression.
False by default.
occurrence
fromOccurrence
209
Line
Attribute
Description
toOccurrence
allOccurences
Usage notes
The Line rule is similar to the Block rule for which the default start element is the beginning of a line
and the default end element is the end of a line. This rule has the same attributes as the Block rule.
Examples of <Line>
<SetMetadata name="date">
<Line occurrence=1/>
</SetMetadata>
<SetMetadata name="reference">
<Line start="Ref.:"/>
</SetMetadata>
<SetMetadata name="subject">
<Block start="Subject:">
<Line fromOccurrence="1" toOccurrence="3"/>
</Block>
</SetMetadata>
210
Pattern
Pattern
Purpose
This rule looks for a text fragment matching a pattern.
Attributes
Table 41. <Pattern> Element Attributes
Attribute
Description
regex
Specifies the phrase in plain text or the regular expression to look for.
ignoreCase
Specifies whether to ignore the letter case of the plain text or regular
expression.
False by default.
occurrence
fromOccurrence
toOccurrence
allOccurences
Usage notes
The pattern can be either a phrase in plain text or a regular expression.
If the pattern is not found, then this rule returns no value without processing further its sub-rule
(if any). If there is no sub-rule, this rule returns one value the text fragment matched. If there is a
sub-rule, it is invoked with a target text reduced to the text fragment matched by this rule. Then the
values returned by the sub-rule are returned by this rule.
Examples of <Pattern>
<SetMetadata name="date">
<Block end="Subject:">
<Pattern regex="\d{2}/\d{2}/\d{4}"/>
</Block>
211
Pattern
</SetMetadata>
<SetMetadata name="emails">
<Pattern regex="[a-z\._]+@[a-z\.]+" ignoreCase="true" allOccurrences="true"/>
</SetMetadata>
212
Zone
Zone
Purpose
This rule reduces the target text based on character indexes.
Attributes
Table 42. <Zone> Element Attributes
Attribute
Description
startIndex
Specifies the index of the first character (included) of the text zone to keep.
endIndex
Usage notes
Unlike Block or Pattern, this rule does not look for a text fragment but it directly reduces the target
text based on character indexes. For example, in some cases, it may be necessary to limit the target
text to the first page but sometime there is no visible text marker identifying the end of the first page.
To do that, you can limit the target text to the first 500 characters (roughly the first page).
If the start index defined is greater than the length of the current target text, then this rule returns no
value without processing further its sub-rule (if any). If there is no sub-rule, this rule returns one
value the delimited text zone. If there is a sub-rule, it is invoked with a target text reduced to the
delimited text zone. Then the values returned by the sub-rule are returned by this rule.
Example of <Zone>
<Zone startIndex="0" endIndex="500">
<SetMetadata name="InventionTitle">
<Block start="Title of invention:" end="Name of Program"/>
</SetMetadata>
</Zone>
213
Constant
Constant
Purpose
This rule returns always the same constant value.
Attributes
Table 43. <Constant> Element Attributes
Attribute
Description
value
Usage notes
This rule has no sub-rule. It can be used to concatenate a constant value with another value to set a
metadata element, such as adding the symbol for a unit of measurement.
Example of <Constant>
<SetMetadata name="price">
<Concat>
<Pattern regex="\d+"/>
<Constant value="$"/>
</Concat>
</SetMetadata>
214
If
If
Purpose
This conditional rule evaluates conditions and, depending on the evaluation, processes a sub-rule.
Usage notes
This rule must have at least one condition and exactly one <Then> child element. It can optionally
have one <Else> child element. Each child element (<Then> or <Else>) must have one sub-rule. If all
the conditions are satisfied (implicit AND), then the sub-rule of the <Then> tag is processed. Else, the
sub-rule of the <Else> tag is processed (if any). The <If> rule returns the values returned by either the
<Then> sub-rule, or the <Else> sub-rule, or no value if there is no <Else>.
<If>
Conditions
<Then>
Sub-rules
</Then>
<Else>
Sub-rules
</Else>
</If>
Example of <If>
<If>
<Not>
<Exists name="Subject" source="metadata"/>
</Not>
<Then>
<SetMetadata name="Subject">
<Block start="Subject:" end="Sir/Madam,"/>
</SetMetadata>
</Then>
</If>
Conditions
Conditions are not rules, they are used inside the <If> rule. Their evaluation always returns a Boolean.
215
And
And
Purpose
This operator evaluates all its sub-conditions with an AND.
Usage notes
The And operator must have at least one sub-condition. It is satisfied if and only if all its
sub-conditions are satisfied. If a sub-condition is not satisfied, the next sub-conditions are not
evaluated and the And operator is not satisfied.
The And operator has no attributes.
Example of <And>
<If>
<And>
<Contains name="subject" source="metadata" value="report"/>
<Equals name="format" source="docProperty" value="pdf"/>
</And>
<Then>
<SetMetadata name="authors">
<Line start="Authors:"/>
</SetMetadata>
</Then>
</If>
216
Or
Or
Purpose
This operator evaluates all its sub-conditions with an OR.
Usage notes
The Or operator must have at least one sub-condition. It is satisfied if and only at least one of its
sub-conditions is satisfied. If a sub-condition satisfied, the next sub-conditions are not evaluated and
the Or operator is satisfied.
The Or operator has no attributes.
Example of <Or>
<If>
<Or>
<Contains name="subject" source="metadata" value="report"/>
<Equals name="format" source="docProperty" value="pdf"/>
</Or>
<Then>
<SetMetadata name="authors">
<Line start="Authors:"/>
</SetMetadata>
</Then>
</If>
217
Not
Not
Purpose
This operator evaluates its single sub-condition and inverts the evaluation Boolean.
Usage notes
The Not operator must have one sub-condition. It is satisfied if and only the sub-condition is not
satisfied.
The Not operator has no attributes.
Example of <Not>
<If>
<Not>
<Exists name="Subject" source="metadata"/>
</Not>
<Then>
<SetMetadata name="Subject">
<Block start="Subject:" end="Sir/Madam,"/>
</SetMetadata>
</Then>
</If>
218
Exists
Exists
Purpose
This condition is satisfied if a metadata element is defined and has a non empty value.
Attributes
Table 44. <Exists> Element Attributes
Attribute
Description
name
source
Usage notes
If the metadata element is not found or has an empty value, then the Exists condition is not satisfied.
Example of <Exists>
<If>
<Not>
<Exists name="Subject" source="metadata"/>
</Not>
<Then>
<SetMetadata name="Subject">
<Block start="Subject:" end="Sir/Madam,"/>
</SetMetadata>
</Then>
</If>
219
Contains
Contains
Purpose
This condition is satisfied if the value of a metadata element contains a specific string.
Attributes
Table 45. <Contains> Element Attributes
Attribute
Description
name
source
Specifies the source of the metadata element to check. Possible values are:
metadata, docProperty, docRepositoryAttribute.
value
ignoreCase
Specifies whether to ignore the letter case of the string to look for.
False by default.
Usage notes
The Contains condition is satisfied if and only if the metadata value contains this substring.
Example of <Contains>
<SetMetadata name="authors">
<If>
<Contains name="subject" source="metadata" value="report"/>
<Then>
<Line start="Authors:"/>
</Then>
<Else>
<Block fromMetadataPosition="title" end="Date:"/>
</Else>
</If>
</SetMetadata>
220
Equals
Equals
Purpose
This condition is satisfied if the value of a metadata element is equal to a specific string.
Attributes
Table 46. <Equals> Element Attributes
Attribute
Description
name
source
Specifies the source of the metadata element to check. Possible values are:
metadata, docProperty, docRepositoryAttribute.
value
ignoreCase
Specifies whether to ignore the letter case of the string to look for.
False by default.
Usage notes
The Equals condition is satisfied if and only if the value of the metadata element is equal to this string.
Example of <Equals>
<SetMetadata name="authors">
<If>
<Equals name="subject" source="metadata" value="report"/>
<Then>
<Line start="Authors:"/>
</Then>
<Else>
<Block fromMetadataPosition="title" end="Date:"/>
</Else>
</If>
</SetMetadata>
221
IsPositionBefore
IsPositionBefore
Purpose
This condition is satisfied if a metadata element is positioned before another metadata element.
Attributes
Table 47. <IsPositionBefore> Element Attributes
Attribute
Description
metadata1
Specifies the name of the first metadata element with which to compare
the position.
metadata2
Specifies the name of the second metadata element with which to compare
the position.
Usage notes
This condition compares the position of two metadata elements already defined by previous rules.
It is satisfied if and only if both metadata elements are defined, have values, and the first metadata
element is positioned before the second metadata element. If one of the metadata elements is not
found, then the IsPositionBefore condition is not satisfied.
Example of <IsPositionBefore>
<SetMetadata name="title">
<If>
<IsPositionBefore metadata1="document_type" metadata2="date"/>
<Then>
<Block fromMetadataPosition="date" end="Version"/>
</Then>
<Else>
<Block fromMetadataPosition="document_type " end="Version"/>
</Else>
</If>
</SetMetadata>
Operator rules
Operator rules are special rules that may have multiple sub-rules. Operator rules define the way
all results returned by their sub-rules are processed so that the operator itself returns a single list
of values.
222
First
First
Purpose
This operator processes its sub-rules sequentially in order and returns the results of the first sub-rule
that returns non empty values.
Usage notes
Once a sub-rule returns non empty values, the next sub-rules are not processed, and the result of
the sub-rule is returned by the First operator. If all sub-rules return no value, then the First operator
returns no value.
The First rule has no attributes.
Example of <First>
<SetMetadata name="DocTitle">
<First>
<DocRepositoryAttribute name="title"/>
<DocProperty name="title"/>
<Line occurrence="1"/>
</First>
</SetMetadata>
223
All
All
Purpose
This operator processes all its sub-rules in order and returns all the results in a list of values.
Usage notes
The All operator returns a single list of values in which all the non-null values returned by all the
sub-rules are appended, in order. If all sub-rules return no value then the All operator returns no
value.
The All rule has no attributes.
Example of <All>
<Block start="Ricorso in Appello" end="Sentenza">
<All>
<SetMetadata name="AppealNumber">
<Pattern regex="\d{7}-[A-Z]{3}"/>
</SetMetadata>
<SetMetadata name="AppealDate">
<Pattern regex="\d{2}-\d{2}-\d{4}"/>
</SetMetadata>
</All>
</Block>
224
Concat
Concat
Purpose
This operator processes all its sub-rules in order and returns a single value with all the values of
the sub-rules concatenated.
Attributes
Table 48. <Concat> Element Attributes
Attribute
Description
separator
Usage notes
The Concat operator returns a single value that is the concatenation of all the non-null values returned
by all the sub-rules, in order. If all sub-rules return no value then the Concat operator returns no value.
Example of <Concat>
<SetMetadata name="person_in_charge">
<Concat separator=", ">
<GetMetadata name="last_name"/>
<GetMetadata name="first_name"/>
</Concat>
</SetMetadata>
<SetMetadata name="subject">
<Concat>
<Block start="Subject:">
<Line fromOccurrence="1" toOccurrence="3"/>
</Block>
</Concat>
</SetMetadata>
225
MostFrequent
MostFrequent
Purpose
This operator processes all its sub-rules in order and returns only the most frequent value.
Usage notes
The MostFrequent operator returns a single value: the most frequent value of all the non-null values
returned by all the sub-rules. If all sub-rules return no value then the MostFrequent operator returns
no value. If all values have the same number of occurrences, then the first most frequent is kept.
The MostFrequent rule has no attributes.
Example of <MostFrequent>
<SetMetadata name="most_frequent_email">
<MostFrequent>
<Pattern regex="[a-z\._]+@[a-z\.]+" allOccurrences="true"/>
</MostFrequent>
</SetMetadata>
In the following example, the abstract metadata element is also defined by the Block rule as anything
between the word Abstract and the word Introduction, but to limit the length of the returned
value it only takes the first 50 characters.
<SetMetadata name="abstract">
<Block start="Abstract" end="Introduction">
<Zone startIndex="0" endIndex="50"/>
</Block></SetMetadata>
226
Part 7
Exposing Content Intelligence Services
Results
This part describes various ways to expose the results of a CIS processing:
Expose classification concepts in CenterStage: Classification concepts are category matches
found by Content Intelligence Services (CIS) and based on predefined taxonomies. They are
not stored as category assignments unlike CIS standard classification processing. They can be
exposed in CenterStage as search filters.
Expose more entities in CenterStage: Like People, Place, and Organization entities that are
available out-of-the-box in CenterStage, CIS allows you to extract other entities that are relevant to
your company using Temis cartridges and expose them in CenterStage.
Access annotations: Annotations are a unique way to store entities, classification concepts, and
extracted metadata in the repository. The Annotation API allows you to access these annotations
and use them according to your needs.
Integrating CIS classification: There are several integration scenarios for CIS standard
classification processing.
227
228
Chapter 15
Expose Classification Concepts or
Entities in CenterStage Filters
This chapter describes the steps required for two customizations: exposing classification concepts in
CenterStage navigation filters and exposing additional entities in CenterStage navigation filters.
Extract classification concepts, page 229
Extract new entities, page 231
Add custom filters in CenterStage, page 233
Optional steps to test the customizations:
Clear previous entities, page 238
Clear the document status, page 239
The customizations described in this document require CenterStage version 1.1 and CIS version 6.6
installed for CenterStage. CenterStage and CIS must use the CIS DAR file (cis_artifacts.dar) version
6.6. It is assumed that these products are installed and running.
229
Configure the taxonomies: create or import taxonomies in TEF format, then synchronize them in
Production mode. Refer to Configure CIS for Classification chapter in CIS Adminisration guide and
Content Intelligence Services chapter in Documentum Administrator User Guide.
2.
Configure the document sets for the classification as described in To configure the classification
for CenterStage spaces:, page 230, either for all spaces or only for a specific space. Each time a
space is created in CenterStage, a document set is automatically created for this space.
3.
Define the new filters in CenterStage as described in To define new filters in CenterStage:, page
233. The filters are also mapped to the full-text indices.
4.
(Optional) Reprocess the documents. If you decide to not reprocess the documents, the values in
the filter will only appear when the documents are modified, which triggers automatically a new
processing. To force a reprocessing, clear the document status table, as described in Clear the
document status, page 239. To test the customization, you can also clear the previously extracted
entities as described in To clear extracted entities with the clear_entities script:, page 238.
Edit the configuration file of the space of your choice as described in To edit the configuration
file of the document sets:, page 87.
2.
Where
The name attribute in the <analysis> element is any name, it will be reused later to define the
way the entity values are stored.
The value of the <repository-taxonomy> element is the name of the taxonomy used. In the
example, two custom taxonomies are mapped. All values will be displayed in the same filter.
4.
Where
The code attribute in the <annotation> element is an index number higher than or equal to
1000 or the name of an existing entity type.
The value of the <analysis> element is the name of the analysis as defined in the previous step.
230
Process, or reprocess, the documents to store the results of the classification so that they can be
exposed in the new filter in CenterStage.
For a cartridge other than TM360, set up the cartridge and the annotation plan as described in
Luxid documentation.
2.
Define the new entity types in the configuration file, either for all spaces or only for a specific
space, as described in To configure the document sets for new entity types:, page 232. Each time a
space is created in CenterStage, a document set is automatically created for this space.
3.
To expose the new entities in CenterStage clients, define the new filters as described in To define
new filters in CenterStage:, page 233. The filters are also mapped to the full-text indices.
4.
(Optional) Reprocess the documents. If you decide to not reprocess the documents, the values in
the filter will only appear when the documents are modified, which triggers automatically a new
processing. To force a reprocessing, clear the document status table, as described in Clear the
document status, page 239. To test the customization, you can also clear the previously extracted
entities as described in To clear extracted entities with the clear_entities script:, page 238.
The following table specifies the internal and public name of CenterStage entities.
Table 49. Internal and public names of default entities
People in text
CISPerson
Organization in text
CISCompany
The CISCompany entity includes values from
the Company, Organization, and Media entities
of the TM360 cartridge.
Place in text
CISLocation
You cannot map a taxonomy or a custom entity
to the Place in text entity.
The following table specifies the name and descriptions of other TM360 entities that you can use.
231
Name
Description
StockIndex
Function
Postal Address
Fax Number
Phone Number
URL
UserDefined[09]
Time Expression, Money Expression, Measurements, and Relationships are not available in CIS
extraction.
Edit the configuration file of your choice as described in To edit the configuration file of the
document sets:, page 87.
2.
Where
The name attribute in the <analysis> element is any name, it will be reused later to define the
way the entity values will be stored.
The value of the <entity> element is the name of the entity in the cartridge, for example the
concept (not the subconcept) in the Temis cartridge TM360. Refer to Table 50, page 232
for the exact name of an entity from TM360 or refer to Luxid documentation for entities
from other cartridges. If you want to use one of the default entities, use the <builtin-entity>
element instead of the <entity> element. Refer to Table 49, page 231 for the exact name of
default entities.
4.
232
<analysis>custom_entity_1</analysis>
</annotation>
</storage>
Where
The code attribute in the <annotation> element is an index number higher than or equal to
1000 or the name of an existing entity type.
The value of the <analysis> element is the name of the analysis as defined in the previous step.
5.
If the cartridge is not TM360, add the new annotation plan to the cis.entity.luxid.annotation_plan.
names property in cis.properties file:
cis.entity.luxid.annotation_plan.names=TM360
By default, only the TM360 cartridge is defined. Separate cartridge names with a comma.
Process, or reprocess, the documents to store the results of the classification so that they can be
exposed in the new filter in CenterStage.
If the file does not exist, create it by performing the following steps:
a.
Open your local copy of facet_definitions.xml for editing in a text or XML editor.
The chapter 9 Set CenterStage Application Options of the CenterStage 1.2 Administration
Guide describes the customization of the app.xml file. The customization mechanism for
facet_definitions.xml is similar.
2.
In each <facetdisplay> element, add a <facet> element for each new filter.
3.
Set a value for the id parameter. This id will be used later to define the filter.
Here we defined _facet_custom_project and _facet_custom_postal_address:
<facetdisplay id="facets">
<facet id="_kw_location" visible="true"/>
<facet id="_kw_format" visible="true"/>
<facet id="r_modify_date" visible="true"/>
<facet id="r_modifier"/>
<facet id="r_full_content_size"/>
<facet id="kw_topic" visible="true"/>
<facet id="_facet_person"/>
<facet id="_facet_place" visible="true"/>
<facet id="_facet_company" visible="true"/>
233
4.
where
the value of <label> is the display label of the filter, here Project;
the value of <attribute name> must be dmftcustom/entities/custom_<index> where
<index> is the index of the taxonomy or custom entity that you set in the configuration for the
document sets, here 1001 (Refer to the <annotation> element in Step 4 of the procedure To
configure the classification for CenterStage spaces or Step 4 of the procedure To configure the
document sets for new entities);
the value of <entity> is an arbitrary value that is reused to map the index used to identify the
taxonomy with the filter, here _custom_entity_project.
Similarly, for the Postal Address entity, the definition would be:
<facet id="_facet_custom_postal_address">
<nlsbundle></nlsbundle>
<label>Postal Address</label>
<desc>Postal address entities</desc>
<sort>FREQUENCY</sort>
<maxvalues>8</maxvalues>
<strategies>
<strategy type="groupby">
<required>
<attribute name="r_object_id"/>
</required>
</strategy>
<strategy type="dsearch">
<required>
<attribute name="dmftcustom/entities/custom_1002"/>
</required>
</strategy>
</strategies>
<entity>_custom_entity_postal_address</entity>
234
<handler>com.emc.documentum.kw.data.facet.entities.
FacetCustomHandler</handler>
<queryhandler>com.emc.documentum.kw.data.facet.entities.
PropertyExpressionHandler</queryhandler>
</facet>
5.
Set the mapping between the new filter and the index you previously set for the taxonomy
as follows:
<entities>
...
<entity id="_custom_entity_project">
<code>1001</code>
<prefix>X6US70M1001:</prefix>
<alias>dmftcustom/entities/custom_1001</alias>
</entity>
...
</entities>
where
the value of <entity id> is the id that you set in the filter definition;
the value of <code> is the index of the taxonomy or custom entity that you set in the
configuration for the document sets;
the value of <prefix> is X6US70M<index>:
the value of <alias> is dmftcustom/entities/custom_<index>.
Similarly, for the Postal Address entity, the mapping would be:
<entities>
...
<entity id="_custom_entity_postal_address">
<code>1002</code>
<prefix>X6US70M1002:</prefix>
<alias>dmftcustom/entities/custom_1002</alias>
</entity>
...
</entities>
6.
Save your changes and, if you were editing the file on your local file system, import the file to:
Cabinets/System/Applications/CenterStage Pro/config
After you added the new filter to CenterStage, it is populated by the custom entity values or by the
classification concepts. The following figure shows the result of the customization example used
in the previous procedure.
Figure 4. Example of a custom entity based on Luxid TM360 Postal Address
235
With the xPlore indexer, you need to modify the index to add the name of the attribute in which
entities are stored. More details about xPlore indexing configuration are provided in the Documentum
xPlore Administration Guide.
2.
3.
4.
5.
6.
In the section <category-definition>/<indexes>, add a new line for each new entity, such as:
<sub-path leading-wildcard="false" compress="true" boost-value="1.0"
description="Used by CenterStage to compute the custom facet 1001"
include-descendants="false" returning-contents="true" value-comparison="true"
full-text-search="true" enumerate-repeating-elements="false" type="string"
path="dmftcustom/entities/custom_1001"/>
You only have to update the path parameter, and optionally the description parameter. The path
value must be the value of the <alias> element set previously in the mapping configuration in
facet_definitions.xml.
7.
8.
9.
b. Click Rebuild Indexes. A message indicating the progress of the rebuilt is displayed.
In the local system where you deployed CenterStage WAR, navigate to <CenterStagePro installation
directory>/WEB-INF/classes.
2.
236
where <xx> is the two-letter language code for your language. The root name of the files
MyCustomLocalization is arbitrary in this example but must be the same for all files. The
properties files are text files with the .properties file extension.
3.
In every properties file, add one line for each filter label and for each filter description to define a
mapping between the label or description and the translation for each language. The description
is the phrase displayed in the More... view in CenterStage.
MyCustomLocalization.properties:
FACET_CUSTOM_1001_DISPLAY_LABEL=Projects
FACET_CUSTOM_1001_DISPLAY_DESC=Projects and products of MyCompany
FACET_CUSTOM_1002_DISPLAY_LABEL=Postal Address
FACET_CUSTOM_1002_DISPLAY_DESC=Postal address entities
These definitions are usually not visible in the Graphic User Interface.
MyCustomLocalization_en.properties:
FACET_CUSTOM_1001_DISPLAY_LABEL=Projects
FACET_CUSTOM_1001_DISPLAY_DESC=Projects and products of MyCompany
FACET_CUSTOM_1002_DISPLAY_LABEL=Postal Address
FACET_CUSTOM_1002_DISPLAY_DESC=Entities based on postal addresses
Remove the hard coded value of the label in the facet_definitions.xml located at:
Cabinets/System/Applications/CenterStage Pro/config
a.
Set the value of <nlsbundle> to the filename of the default properties file, without the .properties
extension.
Set the value of <nlsid> elements to the label name and description name that you set in
the properties file.
<facet id="_facet_custom_finance">
<nlsbundle>MyCustomLocalization</nlsbundle>
<label><nlsid>FACET_CUSTOM_1001_DISPLAY_LABEL</nlsid></label>
<desc><nlsid>FACET_CUSTOM_1001_DISPLAY_DESC</nlsid></desc>
<strategies>
...
<facet id="_facet_custom_postal_address">
<nlsbundle>MyCustomLocalization</nlsbundle>
<label><nlsid>FACET_CUSTOM_1002_DISPLAY_LABEL</nlsid></label>
<desc><nlsid>FACET_CUSTOM_1002_DISPLAY_DESC</nlsid></desc>
<strategies>
...
5.
The following figure shows the result of the localization example used in the previous procedure.
237
On CIS host machine, locate the clear_entities.bat file (on Windows hosts, or clear_entities on
Linux hosts); it can be found at <CIS installation directory>/bin.
2.
To remove all entities for one space (that is, for one document set) :
clear_entities -Docset:<docset_id>
238
To delete the document set status for one document set (one CenterStage space):
DELETE FROM dm_docstatus WHERE st_docset_id=<docset_id>
239
240
Chapter 16
Annotation API
Annotations are a unique way to store entities, classification concepts, and extracted metadata in the
repository. The Annotation API allows you to access these annotations and use them according to
your needs.
There are several benefits of storing the results of an analytics processing as annotations and accessing
them via the API over storing the results as attribute values.
The modification of an attribute value by CIS implies the update of last_modifier and
last_modified_date properties of the documents.
Manual editing of attribute values also updated by CIS is not possible, as the editing will trigger a
CIS reprocessing of the document that will then overwrite the modification.
You can find the javadoc documentation at <CIS installation directory>/doc/cis_client_api. Refer to the
package com.documentum.cis.annotation. You do not need the packages com.documentum.ci and
com.documentum.services.classification to access annotations.
241
Annotation API
242
Chapter 17
Integrate CIS Classification
This chapter describes the most common integration scenarios for the classification.
The CIS server analyzes documents from a Documentum repository and extracts relevant information
about them. CIS can then use the results of the classification to do the following:
Automatically set values of document attributes (Auto Tagging).
Link the documents into appropriate repository folders (Auto Categorization).
Suggest attribute values to WDK-based application users (Web Publisher integration).
Any combination of these tasks.
Content Intelligence Services is just one piece of your broader Documentum content management
solution. There are three common integration scenarios for the classification:
Organize your library, page 244
Workflow and lifecycle processing, page 244
Web Publisher integration, page 244
Retention Policy Services integration, page 245
Once CIS processing is complete, you can use the results of the analysis to:
Improve searching By extracting information from the document content and adding it to the
document attributes, you transform unstructured data into searchable structured data. Because
CIS adds attributes programmatically, you can be sure that they make consistent use of a standard
vocabulary.
Organize documents for easy navigation Auto Categorization enables you to link
automatically documents into a repository folder structure that makes sense to users.
Support personalization Personalization server platforms use document attributes to tailor
the content displayed to different users. Using CIS enriches the attributes of a document with
information based on the content of the document, making the subject matter available as a basis
for personalization.
243
244
When a new folder or cabinet is created under the templates section in Web Publisher, the default
value of the CIS node is EMPTY. Any value specified at the folder level overrides the name of the
CIS Node that is defined globally at the "Web Publisher Admin Settings" level.
245
246
Appendix A
Content Intelligence Services
Processing Diagram
This section provides a diagram of CIS processing and describes the two main flows:
Classification, based on repository document sets and taxonomies.
Entity extraction, only available in CenterStage deployments and based on file document sets.
Note that error conditions do not appear in the diagram.
The following figures describe the diagram legend and provide notes related to the diagram.
Figure 6. CIS processing diagram legend
247
248
Appendix B
Properties Extracted
This appendix identifies the properties extracted from documents. The list of properties differs
depending on the file format of the document. If no value can be extracted for a given property,
that property is not created for the document.
abstract
disposition
lastsavedby
receivedfrom
address
division
manager
revisiondate
attachments
doccomment
office
section
authorization
doctype
owner
source
category
editminutes
primaryauthor
subject
company*
editor
project
title
countpages
group
publisher
versionnotes
creationdate
keyword
purpose
versionnumber
department
language
reference
* These properties can only be used for metadata extraction and not for classification.
249
Properties Extracted
250
Appendix C
Document Set Configuration Files
default.xml
Modify this configuration file to define the default configuration for all document sets.
Example C-1. default.xml configuration file
<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright (c) 1998-2010 EMC Corporation.
All Rights Reserved. -->
<docset-defaults>
<!-- =====================================================
<!-- Configuration for classic repository docsets.
<!-- By default, only the classification is activated.
<!-- =====================================================
<docset-default type="repo">
<analysis-plan>
<classification-step/>
</analysis-plan>
-->
-->
-->
-->
-->
-->
-->
-->
251
<analysis-plan>
<entity-detection-step/>
</analysis-plan>
<entity-detection>
<analysis name="person">
<!-- This builtin entity as a post processing filter
to remove some wrong values. -->
<builtin-entity>CISPerson</builtin-entity>
</analysis>
<analysis name="company">
<!-- This builtin entity aggregates the Company, Organization
and Media default entities, and do a post processing to remove
some wrong values. -->
<builtin-entity>CISCompany</builtin-entity>
</analysis>
<analysis name="location">
<!-- This builtin entity is all Geopolitical values in
default Location entity. -->
<builtin-entity>CISLocation</builtin-entity>
</analysis>
</entity-detection>
<storage>
<annotation code="Person">
<analysis>person</analysis>
</annotation>
<annotation code="Company">
<analysis>company</analysis>
</annotation>
<annotation code="Location">
<analysis>location</analysis>
</annotation>
<!-- Classification is not enabled by default, but we enable assigner
to reduce configuration rework if using classification with file docset. -->
<category-assignments>
<all-repository-taxonomies />
</category-assignments>
</storage>
</docset-default>
</docset-defaults>
docset-sample.xml
Create a copy of this file and modify it to configure a document set.
Example C-2. docset-sample.xml configuration file
<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright (c) 1998-2010 EMC Corporation. All Rights Reserved. -->
<!-- This is a sample configuration file for a specific docset.
To configure a docset, copy the content of this file as a base into a new file.
252
The file names that must be used to configure CenterStage docsets are available
in the file 'space_docset_list.txt' -->
<docset>
<!-- This section defines the list of processing that must be executed
on all documents of the docset. -->
<analysis-plan>
<classification-step/>
<entity-detection-step/>
</analysis-plan>
<!-- This sections defines some taxonomies that must be used in analysis 'foo'. -->
<classification>
<analysis name="foo">
<repository-taxonomy>my-da-taxo</repository-taxonomy>
<tef-taxonomy>my-direct-taxo</tef-taxonomy>
</analysis>
</classification>
<!-- This section customizes the entity detection. It is possible
to add entities that will be stored in addition to default entities. -->
<entity-detection>
<analysis name="bar">
<entity>Function</entity>
</analysis>
</entity-detection>
<!-- This section defines how the analysis results are persisted. -->
<storage>
<!-- Store the analysis foo into the annotation with the code 1001. -->
<annotation code="1001">
<analysis>foo</analysis>
</annotation>
<!-- Stores the analysis bar into the documentum attribute keywords. -->
<attribute name="keywords">
<analysis>bar</analysis>
</attribute>
</storage>
</docset>
253
254
Index
A
Action files
Taxonomy Exchange Format, 169
architecture, 18
Assign as Attributes option, 117
authentication, failed, 36
auto categorization, 113
C
candidate threshold, 109
cartridge
additional entities, 231
customized, 100
CenterStage, 95
classification, 229
ci.jar, 36
CIS, 43
administration, 25
architecture, 18
bringing taxonomies online, 67
category classes, 51
category rules, 62
clearing assignments, 74
components, 17
compound terms, 76
configuration settings, 49
creating document sets, 70
creating taxonomies, 50, 53
defining categories, 59
deleting taxonomies, 69
enabling repository, 47
overview, 17
property rules, 63
reviewing documents, 73
submitting documents, 72
synchronizing taxonomies, 68
taking taxonomies offline, 67
testing, 69
user roles, 20
CIS server
configuring, 27
log files, 32
monitoring, 32
starting, 25
stopping, 25
cis.log, 32
cis.properties, 120
classification information, 116
clear
document status, 239
entities, 238
compatibility error, 36
components, 17
confidence values, 44, 110
configuration
document set, 87
configuration steps, 119
connection, failed, 37
Content Intelligence Services
introduction, 43
setting up, 47
custom entities, 231
custom filters, 229, 233
for classification concepts, 229
for extracted entities, 231
localization, 236
D
docset, 108
document confidence scores, 44
document processing, 107
document properties, 21
document set, 108
configuration, 87
documents
excluded, 33
unprocessed, 33
Documentum attributes, 21
255
Index
E
entity
add, 100
blacklist, 103
entity extraction, 95
additional servers, 99
customized cartridge, 100
disable, 98
multi-node environment, 99
process, 96
server, 95
services, 97
errors
authentication, 36
compatibility, 36
connection, 37
installation, 36
log files, 32
evidence
propagating, 46
evidence terms, 110
excluded documents, 33
I
import
taxonomy, 122 to 123
integration, 243
R
regular expressions, 114
repository
enabling for CIS, 47
repository attributes, 21
reprocessing, 239
for classification, 109
for entity extraction, 238
result of the classification, 116
S
schedule, 109
scores, 110
stemming, 45, 112
synchronization, 107
taxonomies
importing, 121
taxonomy exchange format (TEF), 121
TEF
Action files, 169
Tef2repository script, 122
TefUtil tool, 123
TM360, 231
additional entities, 232
multi-node environment, 99
multilingual capability, 112
unprocessed documents, 33
user roles
category owner, 117
taxonomy manager, 117
O
on demand, 109
P
patterns, 110, 114
256
analysis, 114
definition, 115
evidence terms, 114
limitations, 115
pending documents, 109
phrase order, 46
proximity matching, 28, 110
W
Web Publisher, 244
workflow processing, 244