Cross Language
Information Retrieval
Road Map
Cross Lingual IR
Motivation
Definition
General Issues With CLIR
Basic Approaches to CLIR
CLIR evaluation
CLIR applications
2021/4/15 3
Information Retrieval
Single language:both the user’s query and documents
to be searched are in same language.
Cross language:documents written in a language
different from the language of the user's query
documents
query
2021/4/15 4
2000-2010年世界各大洲网络语言使用增长率(数据更新时间:2010年6月30日)
The Internet Big Picture
World
Regions
Population Internet
Users
Penetrat
ion(%po
pulation)
Users %
of Table
Growth
2000-2015
Africa 1,158,355,663 313,257,074 % % 6,839%
Asia 4,032,466,882 1,563,208,143 % % 1,268%
Europe 821,555,904 604,122,380 % % 475%
Middle East 236,137,235 115,823,882 % % 3,426%
North
America
357,172,209 313,862,863 % % 191%
Latin
America
617,776,105 333,115,908 % % 1,743%
Oceania/Aus
tralia
37,157,120 27,100,334 % % 256%
World Total 7260,621,118 3,270,490,584 45% 100% 806%
World Internet Users and 2015 Population Stats
2021/4/15 5
2021/4/15 6
Usage of content languages
for websites
2021/4/15 7
2002 2015
English 72% English %
German 7% Russian %
Japanese 6% German %
Spanish 3% Japanese %
French 3% Spanish %
Italian 2% French %
Dutch 2% Portuguese %
Chinese 2% Chinese %
Korean 1% Italian %
Russian 1% Polish %
Portuguese 1% Turkish %
Cross Language IR
Motivation
Information unavailability in some languages
Language barrier
Definition:
Cross-language information retrieval (CLIR) is
a subfield of information retrieval dealing with
retrieving information written in a language
different from the language of the user's query
(wikipedia)
Example:
A user may ask query in Chinese but retrieve
relevant documents written in English.
Why do we need CLIR
systems?
Needs technologies that enable access to info
regardless of geographic/language barriers.
To find, retrieve and understand relevant information in
whatever language/form.
CLIR has become one of the key factors affecting
knowledge sharing all over the world.
General Issues With CLIR
Multilingual text access (character sets, etc.)
Differences between languages
-stemming, compound words, breaks between words, etc.
Term ambiguity between languages
What to translate (query vs. document) and how
Matching strategies
No translation
(1) Cognate matching
Translation
(2) Query translation
(3) Document translation
(4) Interlingual techniques
2021/4/15 11
Cognate matching(同源匹配)
In the case of the most naive cognate matching,
untranslatable terms such as proper nouns or technical
terminology are left unchanged through the stage of
translation.
The unchanged term can be expected to match successfully with
a corresponding term in another language if the two languages
have a close linguistic relationship.(for example, generation in
English and French)
When two languages are very different, by exploring a method for
measuring similarity between transliteration and its original word,
we may make cognate matching feasible(音译)..
2021/4/15 12
2021/4/15 13
Query translation
搜索引擎 翻译系统
法语查询
法语文档
结果
中文查询
选择 浏览
法语文档集合
过程:
将中文查询翻译成法语
检索法语文档集合
将检索结果翻译成中文
2021/4/15 14
query translation
Query translation is the most widely used matching
strategy for CLIR due to its tractability.
the retrieval system does not have to change its inverted files of
index terms in any way against queries in any language.
It is less computationally costly to process the translation
of a query than that of a large set of documents
Challenge: term ambiguity
‘queries are often short and short queries provide little context for
disambiguation’
Term disambiguation will be discussed later.
2021/4/15 15
查询翻译优缺点
优点
简单
容易操作
灵活
节约时间、空间,效率高
缺点
缺乏上下文
对于短查询式,翻译歧义性大
2021/4/15 16
Document translation
中文查询
法语文档集合
搜索引擎
翻译系统
中文文档集合
结果
选择 浏览
过程:
将整个法语文档翻译成中文文档
直接用中文文档检索
2021/4/15 17
Document translation
Document translation has opposite advantages and
disadvantages from query translation.
In CLIR experiments, this approach is not usually
utilized, and query translation is dominant.
However, some researchers have used it to translate
large sets of documents since more varied context within
each document is available for translation, which can
improve translation quality.
Oard and Hackett (1998) reported that automatic machine
translation of a set of documents using a commercial MT system
outperforms query translation in an experiment of CLIR from
German to English
2021/4/15 18
文档翻译优缺点
优点
只翻译一次
文档提供的上下文比较丰富
文档可以线下事先翻译好
缺点
翻译速度慢
占用大量空间、时间,效率低
依赖机器翻译系统的质量
2021/4/15 19
查询翻译vs.文档翻译
取决于特定语言资源
通常查询翻译使用更广
两种方法都提出了“交互性”挑战
Interlingual approach
an intermediate space of subject representation into
which both the query and the documents are converted
is used to compare them.
One type of interlingual approach is to use the ‘‘synsets’’
provided in WordNet, which is a wellknown machine-
readable thesaurus.
For example, Diekema, Oroumchian, Sheridan, and Liddy (1999)
employed the WordNet synset numbers as language-
independent representations for CLIR.
Since a synset number (label) representing a concept is
corresponded to a set of concrete words in each of languages
supported (., English and French), it is possible that a query
term in the source languages is linked to words in the target
language via the synset 20
Translation techniques
2021/4/15 21
Dictionary-based methods
Using a bilingual Machine Readable Dictionary (MRD).
most retrieval systems are still based on so-called ‘‘bag-
of-words’’ architectures, in which both query statements
and document texts are decomposed into a set of words
(or phrases) through a process of indexing.
Thus we can translate a query easily by replacing each
query term with its translation equivalents appearing in a
bilingual dictionary or a bilingual term list.
2021/4/15 22
2021/4/15 23
bilingual dictionary
2021/4/15 24
Term translation
oil
petroleum
probe
survey
take samples
选哪个翻译?
没有翻译!
restrain
cymbidium
goeringii分词错误
oil
petroleum
probe
survey
take samples
2021/4/15 25
Some issues in term
translation
Compound words,for example German
decomposition
No boundary between words,. Chinese
segmentation
Specialized vocabulary not contained in the
dictionary,. named entity
2021/4/15 26
Examples
Compound decomposition (复合词分解)
chinese word segmentation
新西兰花
新西兰 花 New Zealand flowers
新 西兰花 fresh broccolis
2021/4/15 27
Corpora-based method
Parallel(双语平行语料库) or comparable
corpora(双语可比语料库) are useful
resources enabling us to extract beneficial
information for CLIR.
For example, in order to translate English queries
into Spanish, Davis and Dunning (1995) extracted
moderately frequent Spanish terms from Spanish
documents aligned with English documents which
had been searched using an English query
(source query).
2021/4/15 28
Parallel corpora
A parallel corpus (pl. corpora) is a document
collection composed of two or more disjoint
subsets, each written in a different language,
such that documents in each subset are
translations of documents in each other subset.
Very high accuracy
2021/4/15 29
象形文字
古埃及文字
希腊文
2021/4/15 30
罗塞塔石碑
罗塞塔石碑(Rosetta Stone,也译作罗塞达
碑),高米,宽米,是一块制作于公
元前196年的大理石石碑,原本是一块刻有埃
及国王托勒密五世(Ptolemy V)诏书的石碑。
石碑上用希腊文字、古埃及文字和当时的通俗
体文字刻了同样的内容。由于这块石碑刻有三
种不同语言版本,使得近代的考古学家得以有
机会对照各语言版本的内容后,解读出已经失
传千余年的埃及象形文之意义与结构,而成为
今日研究古埃及历史的重要里程碑。
2021/4/15 31
More parallel corpora
news:
DE-News (German-English)
Hong-Kong News, Xinhua News (Chinese-
English)
Government docuemtns:
Canadian-Hansards (French-English)
Europarl (Danish, Dutch, English, Finnish,
French, German, Greek, Italian, Portugese,
Spanish, Swedish)
UN Treaties (Russian, English, Arabic, …)
Bible (many, many languages)
2021/4/15 32
Examples
English German
Diverging opinions about
planned tax reform
Unterschiedliche Meinungen
zur geplanten Steuerreform
The discussion around the
envisaged major tax reform
continues .
Die Diskussion um die
vorgesehene grosse
Steuerreform dauert an .
The FDP economics expert,
Graf Lambsdorff , today came
out in favor of advancing the
enactment of significant parts
of the overhaul , currently
planned for 1999 .
Der FDP - Wirtschaftsexperte
Graf Lambsdorff sprach sich
heute dafuer aus , wesentliche
Teile der fuer 1999 geplanten
Reform vorzuziehen .
2021/4/15 33
Comparable corpora
A comparable corpus is a pair of corpora in
two different languages, which come from the
same domain.
Talking the same topic
Parallel sentences may also be mined from
comparable corpora such as news stories written
on the same topic in different languages.
Some researchers extract phrase pairs from
comparable corpora using a classifier approach.
2021/4/15 34
Example
The WWW can provide rich and ubiquitous machine-
readable resources, from which we may be able to
automatically extract information useful for CLIR.
For example, Chen (2002) and Chen and Gey (2003)
made use of a general search engine on the Internet and
tried to find English translation equivalents of Chinese or
Japanese terms (mainly proper nouns) by analyzing
contexts of these terms in Chinese and Japanese Web
documents returned by the engine.
2021/4/15 35
2021/4/15 36
Term disambiguation
techniques (翻译歧义性)
Disambiguation from among multiple
alternative term translations,多个翻译如何选
择?., Apple, Bank
Use of part-of-speech (POS) tags.
Use of parallel corpus.
Use of co-occurrence statistics in the target
corpus.
Use of the query expansion technique.
Use of part-of-speech tags
The basic idea of using part-of-speech (POS)
tags for translation disambiguation is to select
only translations having the same POS with
that of the source query term.
This method requires that POS tagging
software is available for both languages.
2021/4/15 37
Parallel corpus-based
disambiguation
A parallel corpus was used for determining
the ‘‘best’’ translation or set of translations by
Davis (1997, 1998), where a single
translation for each source term was selected
from a set of translations listed in an MRD
according to the result of searching a parallel
corpus.
2021/4/15 38
2021/4/15 39
Translation probability
探测survey
试探
样品
测量
(p = )
(p = )
(p = )
(p = )
多个翻译 翻译概率
Disambiguation based on co-
occurrence statistics
the correct translations of query terms should co-occur in
target language documents and incorrect translations
should tend not to co-occur.
First, the two most related terms in the query were
determined based on cooccurrence statistics in the
source language corpus, and then the ‘‘best’’ translations
were selected from all pairs of translations of these two
terms according to co-occurrence statistics in the target
language corpus.
It should be noted that these two corpora do not have to
be parallel or comparable.
2021/4/15 40
Query expansion for
disambiguation
Pseudo relevance feedback (PRF), also known as blind
feedback, is widely recognized as an effective
technique for enhancing performance of information
retrieval. PRF also works effectively for CLIR tasks.
In the case of CLIR, two kinds of PRF are feasible:
Pre-translation feedback and
Post-translation feedback
2021/4/15 41
Pre-translation feedback
Documents from a corpus in the source language can be
retrieved prior to translation in order to add a set of new
terms to the source query (pre-translation feedback) if
such a corpus is available.
Pre-translation feedback may contribute to improvement
of precision. This is due to the fact that the PRF is
basically done using the entire query––not each source
term respectively. That is, synonyms or related terms
corresponding to the ‘‘correct’’ meaning of each source
term within a context of the query are expected to be
automatically added through the PRF process.
2021/4/15 42
Post-translation feedback
After translation, standard PRF can be applied using the
target document collection (post-translation feedback).
post-translation feedback can be considered a device for
improving recall ratio, as shown in standard experiments
of monolingual retrieval.
In CLIR, two well-known methods for weighting terms in
the top-ranked documents are often utilized for selecting
‘‘good’’ terms, ., the Rocchio method and the
probabilistic method.
2021/4/15 43
bi-directional translation
Boughanem et al. (2002), explored a ‘‘bi-directional
translation’’ technique in which a form of backward
translation is used for ranking translation candidates.
Suppose that we need to translate English query terms
into French ones. In ‘‘bi-directional translation,’’ first a set
of French equivalents for an English term is found in an
English–French dictionary. Next, using a French–English
dictionary, each French equivalent is reversely translated
into a set of English terms. Basically, if the set includes
the original source term, the French translation
equivalent is chosen as a preferred translation.
2021/4/15 44
2021/4/15 45
跨语言检索评价
信息检索评价
给定一个检索主题,一个文档集合,一些人工判断
好的相关文献
对系统返回的检索结果进行判断
TREC CLIR (96-02): 英语到其他语言
CLEF (00-): 欧洲语言之间
NTCIR (99-): 亚洲语言与英语
2021/4/15 46
跨语言检索评价模型
47
Applications of CLIR
2021/4/15 48
Cross language Search
Engine
April 25, 2006: European search engine
“Quaero”
French President announced 90 million-euro support.
May 16, 2007: Google Translate
Provide CLIR for 12 languages
Goal: take "all the Web & translate into multiple
langs."
May 5, 2008: Yahoo Babel Fish
Provide CLIR between 12 languages
It was AltaVista's project, later bought by Yahoo
2021/4/15 49
Google Translate
2021/4/15 50
2021/4/15 51
Yahoo Babel Fish
2021/4/15 52
2021/4/15 53
2021/4/15 54
提问
请比较Google和Yahoo!的跨语言搜索引
擎的区别,分析各自的优缺点
Google:一步完成(translate & search),
检索结果翻译回源语言。优点:快速,便于用
户理解检索结果。缺点:用户无法修改翻译。
Yahoo!:两步完成(translate + search),
检索结果未翻译。优点:有中间步骤,用户可
以修改翻译。缺点:复杂,检索结果无法识别。
数字图书馆的跨语言检索
2010年6月11日在芬兰首
都赫尔辛基举行的ICSTI
(国际科技信息理事会)
夏季会议上发布的世界科
学跨语言检索平台
WorldWideScience
2021/4/15 55
WorldWideScience
联盟的成员单位都是专业图书情报机构或科技
信息事业的领导机构,如美国能源部科技信息
局(OSTI)、美国国会图书馆、大英图书馆、
加拿大科技信息研究所、韩国科技信息研究所、
中国科技信息研究所等。
该平台还可以自动进行跨语言跨库检索
2021/4/15 56
WorldWideScience
2021/4/15 57
跨语言专利检索
根据世界知识产权组织(World Intellectual
Property Organization, WIPO)报导,专利文
件包含全世界90%~95%的科研成果,而其他
技术文件(论文或期刊等)中只含5%~10%
的研发成果。
在研究工作中若能善于利用专利检索可以缩短
60%的研发时间,同时减少40%的研发经费。
2021/4/15 58
2010年5月,世界知识产权组织WIPO发布了
跨语言专利检索系统PATENTSCOPE的测试
版,标志着跨语言信息检索在专利检索中的应
用从实验室走向实用化。
该系统只能提供英语、法语、德语、日语、西
班牙语5种语言之间的跨语言专利检索。
2021/4/15 59
2021/4/15 60
2021/4/15 61
跨语言图像检索
2021/4/15 62
2021/4/15 63
2021/4/15 64
电子商务中的应用
CINDOR 是目前比较成功的一个商业跨语言信
息检索系统
CINDOR系统拥有概念中间语言(Conceptual
Interlingua)、语言分析(Language
Analysis)、搜索管理(Search
Management)三大核心技术。
CINDOR目前支持英语、法语、西班牙语,正
在研制简体中文、俄语、阿拉伯语。
2021/4/15 65
2021/4/15 66
2021/4/15 67
Reference
Kazuaki Kishida. Technical issues of cross-
language information retrieval: a review.
Information Processing and Management.
2005(41), pp433-455.
葛运东;跨语言信息检索查询翻译技术研究[D];苏
州大学;2010
王序文. 基于主题伪相关反馈的跨语言信息检索技
术研究 [D]; 北京邮电大学,2014
彭琳.汉语词语语义相似度度量及其在跨语言信息
检索中的应用研究[D]; 复旦大学, 2010
2021/4/15 68
2021/4/15 69
对“交互”的挑战
CLIR poses some unique challenges for
interaction
How do you help users select translated query
terms?
How do you help users select document terms for
query refinement?
How do you compensate for poor translation
quality?
2021/4/15 70
多语言信息获取 Cross-Language
Information Access,CLIA
CLIR
System
Result
Processing
Result
Presentation
Query formulation
Question analysis
Request generation
Need negotiation
Need identification
Source selection
Result Selection
Result Examination
Information Extraction
Result Classification
Result Visualization
Result Summarization
Query Reformulation
Relevance FeedbackCLIA System
Need
Clarification
Need
Instantiation
2021/4/15 71
CLIA vs. CLIR
Cross-Language Information Retrieval
A narrow view of CLIA
CLIR is limited, good for developing matching
techniques
Cross-Language Information Access
Aim to help users find the information they want
Concern not just the ranking of results
2021/4/15 72
多语言信息获取
用户为中心
关注用户与系统的交互
相关性依赖于特定“用户”与特定“情境”
交互
信息需求不能被完全充分理解
语言歧义性
需求与使用的范围更广
多媒体:图像、声音……
聚焦信息:段落检索、问答……
凝练信息:摘要、信息抽取……
2021/4/15 73
多语言信息获取生命周期
检索
经过翻译的查询式
检索结果列表
文档选择
待浏览的文档
文档浏览
查询翻译
查询形成 查询式
待传递的文档
查询重新
形成
翻译重新
选择
文档重新
选择
2021/4/15 74
支持查询(重新)形成
Problems
Term Mismatch: query translations terms in docs
Translations in foreign language
How to display, interpret and control
Is query translation an extra step?
Query reformulation
where and how to get info
2021/4/15 75
用户辅助查询翻译
2021/4/15 76
支持文档(重新)选择
Selection need translated surrogates
How to generate surrogates?
How to translate surrogates?
Examination need translated documents
2021/4/15 77
摘要生成
How to generate surrogates
First N words in docs (good for news articles)
Key Word In Context, automatic summarization
Passage retrieval
How to translate surrogates
Gloss translation: term by term translation
Phrase translation: only translate phrases in docs
Machine Translation
2021/4/15 78
摘要辅助用户选择判断文档