You are on page 1of 5

1

############### CALL FOR PARTICIPATION ###################

NTCIR-9 Cross-lingual Link Discovery Task

http://ntcir.nii.ac.jp/CrossLink

Final workshop meeting at NTCIR-9: December 6-9, 2010 [NII, Tokyo]

############################################################

INTRODUCTION:

Cross-lingual link discovery (CLLD) is a way of automatically finding potential

linking between isolated documents in different languages. It is not very

dissimilar from traditional cross-lingual information retrieval (CLIR) because CLIR

can be viewed as a process of creating a virtual link between the provided cross-

lingual query and the retrieved documents; on the other hand, CLLD

recommends a set of meaningful anchors actively in the source document and

use them as queries with the contextual information from the text to establish

actual linking with documents in other languages.

Wikipedia is an online multilingual encyclopaedia that contains enormous articles

covering most written languages in this planet and so includes extensive

hypertext links between documents of same language for easy reading and
2

referencing. However, the pages in different languages are rarely interrelated

except for the cross-lingual link between pages about the same subject. This

could pose serious difficulties for users who try to seek information or knowledge

from different lingual sources. Therefore, cross-lingual link discovery tries to

break the language barrier in knowledge sharing. With CLLD users are allowed

to discover documents in languages which they either are familiar with (or not), or

which have a richer set of documents than in their language of choice.

For English there are several link discovery tools, which assist topic curators in

discovering prospective anchors and targets for a given document. No such

tools yet exist, that support the cross linking of documents from multiple

languages. This task aims to incubate the technologies assisting CLLD and

enhance the user experience in viewing or editing documents in cross-lingual

manner. The language difference, ambiguities and other language issues such

as Chinese segmentation could all make this task even more challenging.

Researchers who interest in cross-lingual link discovery are all welcome to join

us. Particularly, researchers from either CLIR or link discovery community are

encouraged to participate in this exciting task.

To participate, please visit the registration

pages: http://research.nii.ac.jp/ntcir/ntcir-9/howto.html, also you will have to sign

a user agreement form - details will be announced from NII later


3

TASK DEFINITION:

Generally, the link between documents can be classified as either outgoing or

incoming, but in this task we mainly focus on the outgoing link starting from

English source documents and being pointed to Chinese, Korean, and Japanese

target documents. The whole CLLD task is comprised of following three

subtasks:

 English to Chinese CLLD

 English to Japanese CLLD

 English to Korean CLLD

Participants can choose one or more of the above three subtasks to participate

in.

The English topics and the target corpus consist of actual Wikipedia pages in xml

format with rich structured information. To submit a run, participants are required

to choose the most suitable anchors from the topic document, and for each

anchor identify the most suitable documents in the test corpus. For each topic we

will allow up to 50 anchors, each with up to 5 targets may, so there is a total of

250 outgoing links per topic.

TOPIC AND DOCUMENT COLLECTIONS:


4

Two sets of 25 articles chosen from the English Wikipedia will be used as topics

for the uses of creating dry run and formal run separately. These topics will be

orphaned by removing all links to then (from the collection) and from them (to the

collection). The corresponding pages in Chinese, Japanese and Korean will also

be removed from those collections.

The training and test collections for the three subtasks are exactly the same. The

collections are formed by search engine friendly xml files created from Wikipedia

mysql database dumps taken on June 2010. The details of the collections are

given as following (the language of the corpus, the number of articles, the size of

the corpus, and date of dump):

Chinese 318,736 2.7G 27/06/2010

Japanese 716,088 6.1G 24/06/2010

Korean 201,596 1.2G 28/06/2010

ASSESSMENT AND EVALUATION:

There will be two types of assessments: automatic assessment using the

Wikipedia ground truth (existing cross-lingual links); and manual assessment

done by human assessors. For the latter, all submissions will be pooled and a

GUI tool for efficient assessment will be used. In manual assessment, either the

anchor candidate or the target link could be identified relevant (or non-relevant).
5

Once the anchor candidate is assessed as non-relevant, all anchors and

associated links inside this anchor will become non-relevant. After the

assessment, the performance of cross-lingual link discovery system then will be

evaluated using Precision, Recall and Mean Average Precision metrics.

FOR MORE DETAILS:

Please visit

http://ntcir.nii.ac.jp/CrossLink

Please also note that the registration deadline is December 20, 2010 (for all

NTCIR-9 tasks).

ORGANIZERS:

Shlomo Geva, Queensland University of Technology, Australia

Andrew Trotman, University of Otago, New Zealand

Yue Xu, Queensland University of Technology, Australia

Eric Tang, Queensland University of Technology, Australia

Darren Huang, Queensland University of Technology, Australia

If you have any questions, please contact Eric Tang (l4.tang@qut.edu.au) or

send an email to the task mailing list: crosslink@lists.otago.ac.nz

You might also like