Final Workshop Meeting at NTCIR-9: December 6-9, 2010 (NII, Tokyo) - Cross-Lingual Link Discovery Task

1
############### CALL FOR PARTICIPATION ###################
NTCIR-9 Cross-lingual Link Discovery Task
http://ntcir.nii.ac.jp/CrossLink
Final workshop meeting at NTCIR-9: December 6-9, 2010 [NII, Tokyo]
############################################################
INTRODUCTION:
Cross-lingual link discovery (CLLD) is a way of automatically finding potential
linking between isolated documents in different languages. It is not very
dissimilar from traditional cross-lingual information retrieval (CLIR) because CLIR
can be viewed as a process of creating a virtual link between the provided cross-
lingual query and the retrieved documents; on the other hand, CLLD
recommends a set of meaningful anchors actively in the source document and
use them as queries with the contextual information from the text to establish
actual linking with documents in other languages.
Wikipedia is an online multilingual encyclopaedia that contains enormous articles
covering most written languages in this planet and so includes extensive
hypertext links between documents of same language for easy reading and
2
referencing. However, the pages in different languages are rarely interrelated
except for the cross-lingual link between pages about the same subject. This
could pose serious difficulties for users who try to seek information or knowledge
from different lingual sources. Therefore, cross-lingual link discovery tries to
break the language barrier in knowledge sharing. With CLLD users are allowed
to discover documents in languages which they either are familiar with (or not), or
which have a richer set of documents than in their language of choice.
For English there are several link discovery tools, which assist topic curators in
discovering prospective anchors and targets for a given document. No such
tools yet exist, that support the cross linking of documents from multiple
languages. This task aims to incubate the technologies assisting CLLD and
enhance the user experience in viewing or editing documents in cross-lingual
manner. The language difference, ambiguities and other language issues such
as Chinese segmentation could all make this task even more challenging.
Researchers who interest in cross-lingual link discovery are all welcome to join
us. Particularly, researchers from either CLIR or link discovery community are
encouraged to participate in this exciting task.
To participate, please visit the registration
pages: http://research.nii.ac.jp/ntcir/ntcir-9/howto.html, also you will have to sign
a user agreement form - details will be announced from NII later

3
TASK DEFINITION:
Generally, the link between documents can be classified as either outgoing or
incoming, but in this task we mainly focus on the outgoing link starting from
English source documents and being pointed to Chinese, Korean, and Japanese
target documents. The whole CLLD task is comprised of following three
subtasks:
 English to Chinese CLLD
 English to Japanese CLLD
 English to Korean CLLD
Participants can choose one or more of the above three subtasks to participate
in.
The English topics and the target corpus consist of actual Wikipedia pages in xml
format with rich structured information. To submit a run, participants are required
to choose the most suitable anchors from the topic document, and for each
anchor identify the most suitable documents in the test corpus. For each topic we
will allow up to 50 anchors, each with up to 5 targets may, so there is a total of
250 outgoing links per topic.
TOPIC AND DOCUMENT COLLECTIONS:

4
Two sets of 25 articles chosen from the English Wikipedia will be used as topics
for the uses of creating dry run and formal run separately. These topics will be
orphaned by removing all links to then (from the collection) and from them (to the
collection). The corresponding pages in Chinese, Japanese and Korean will also
be removed from those collections.
The training and test collections for the three subtasks are exactly the same. The
collections are formed by search engine friendly xml files created from Wikipedia
mysql database dumps taken on June 2010. The details of the collections are
given as following (the language of the corpus, the number of articles, the size of
the corpus, and date of dump):
Chinese 318,736 2.7G 27/06/2010
Japanese 716,088 6.1G 24/06/2010
Korean 201,596 1.2G 28/06/2010
ASSESSMENT AND EVALUATION:
There will be two types of assessments: automatic assessment using the
Wikipedia ground truth (existing cross-lingual links); and manual assessment
done by human assessors. For the latter, all submissions will be pooled and a
GUI tool for efficient assessment will be used. In manual assessment, either the
anchor candidate or the target link could be identified relevant (or non-relevant).
5
Once the anchor candidate is assessed as non-relevant, all anchors and
associated links inside this anchor will become non-relevant. After the
assessment, the performance of cross-lingual link discovery system then will be
evaluated using Precision, Recall and Mean Average Precision metrics.
FOR MORE DETAILS:
Please visit
http://ntcir.nii.ac.jp/CrossLink
Please also note that the registration deadline is December 20, 2010 (for all
NTCIR-9 tasks).
ORGANIZERS:
Shlomo Geva, Queensland University of Technology, Australia
Andrew Trotman, University of Otago, New Zealand
Yue Xu, Queensland University of Technology, Australia
Eric Tang, Queensland University of Technology, Australia
Darren Huang, Queensland University of Technology, Australia
If you have any questions, please contact Eric Tang (l4.tang@qut.edu.au) or
send an email to the task mailing list: crosslink@lists.otago.ac.nz

Final Workshop Meeting at NTCIR-9: December 6-9, 2010 (NII, Tokyo) - Cross-Lingual Link Discovery Task

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Workshop Meeting at NTCIR-9: December 6-9, 2010 (NII, Tokyo) - Cross-Lingual Link Discovery Task

Uploaded by

Copyright:

Available Formats

1

############### CALL FOR PARTICIPATION ###################

NTCIR-9 Cross-lingual Link Discovery Task

Final workshop meeting at NTCIR-9: December 6-9, 2010 [NII, Tokyo]

Cross-lingual link discovery (CLLD) is a way of automatically finding potential

linking between isolated documents in different languages. It is not very

dissimilar from traditional cross-lingual information retrieval (CLIR) because CLIR

recommends a set of meaningful anchors actively in the source document and

actual linking with documents in other languages.

Wikipedia is an online multilingual encyclopaedia that contains enormous articles

covering most written languages in this planet and so includes extensive

referencing. However, the pages in different languages are rarely interrelated

from different lingual sources. Therefore, cross-lingual link discovery tries to

which have a richer set of documents than in their language of choice.

discovering prospective anchors and targets for a given document. No such

enhance the user experience in viewing or editing documents in cross-lingual

encouraged to participate in this exciting task.

To participate, please visit the registration

pages: http://research.nii.ac.jp/ntcir/ntcir-9/howto.html, also you will have to sign

a user agreement form - details will be announced from NII later

Generally, the link between documents can be classified as either outgoing or

target documents. The whole CLLD task is comprised of following three

 English to Chinese CLLD

 English to Japanese CLLD

 English to Korean CLLD

will allow up to 50 anchors, each with up to 5 targets may, so there is a total of

250 outgoing links per topic.

TOPIC AND DOCUMENT COLLECTIONS:

be removed from those collections.

the corpus, and date of dump):

Chinese 318,736 2.7G 27/06/2010

Japanese 716,088 6.1G 24/06/2010

Korean 201,596 1.2G 28/06/2010

ASSESSMENT AND EVALUATION:

There will be two types of assessments: automatic assessment using the

Wikipedia ground truth (existing cross-lingual links); and manual assessment

Once the anchor candidate is assessed as non-relevant, all anchors and

assessment, the performance of cross-lingual link discovery system then will be

evaluated using Precision, Recall and Mean Average Precision metrics.

FOR MORE DETAILS:

Shlomo Geva, Queensland University of Technology, Australia

Andrew Trotman, University of Otago, New Zealand

Yue Xu, Queensland University of Technology, Australia

Eric Tang, Queensland University of Technology, Australia

Darren Huang, Queensland University of Technology, Australia

If you have any questions, please contact Eric Tang (l4.tang@qut.edu.au) or

send an email to the task mailing list: crosslink@lists.otago.ac.nz

You might also like