HUAWEI TECHNOLOGIES CO., LTD.
Security Level:
Hang Li
New Trend in Machine Learning:
Learning from Machine Human
Interactions
Machine Learning: The Most Effective
Approach to Artificial Intelligence
李航博文:《机器学习正在改变我们的工作与生活》
My Book on Statistical Learning Methods
Talk Outline
• How Much Training Data Is Enough Data?
• Machine Learning Using Big Data Collected from
Human Machine Interactions
– Log Data Mining
– Crowdsourcing
– Human Computation
• Example
– Learning of Relevance Model from Click-through
Data
Using Sufficient Amount of High
Quality Training Data Is Key for
Learning Accurate Model
5
How Much Training Data Is
Enough Data?
6
Learning of Binary Classifier
7
error empirical )(
error expected )(
samples ing train)},{(
model of complexity ||
space hypothesis )(,
),(
}1,0{,
herr
herr
yxS
h
xhyCh
yxPD
YyXx
S
D
ii
Sample Complexity
• Theorem (Occam’s Razor)
If
then with probability
for all ,
• Example
8
)
2
ln||(ln
2
1
||
2
hS
1
Ch |)()(| herrherr SD
000,50||
100||,,
S
h
Training Data Size
• The more the better
• Empirically, at least training examples
are needed to learn accurate classifier
• k is number of parameters
9
)(100 kO
Usually Data Is Labeled by Small
Number of Professionals
10
New Trend: Collecting Data from
Human Machine Interactions
New Trend
• Learning from Interactions with Humans
• Models
– Log Data Mining (日志数据挖掘)
– Crowdsourcing (众包)
– Human Computation (人机协同计算)
Example: Click-through Data in Web Search
• User submits query
• System returns URLs
• User clicks relevant
URLs
• Implicit feedback on
relevance from users
sdcc
Log Data Mining
• Definition
– The goal of log data mining is to extract
information from log data set and transform it into
an understandable structure for further uses
• Characteristics
– Users behavior is recorded in application system
– Users do not need to carry out special operations
except uses of system
Example: Amazon Mechanical Turk
• Market for labeling
tasks
• Requester provides
labeling task
• Workers performs
labeling task and
received monetary
rewards (usually small
amount)
14
Crowdsourcing
• Definition
– The act of outsourcing tasks, traditionally
performed by an employee or contractor, to an
undefined, large group of people or community
through an open call
• Characteristics
– Market is formed
– Users receive monetary reward but also have
intention of learning and entertainment.
Example: Game Having Purpose
ESP Game
• Two players tag images
• Rewarded when their tags
agree
• Agreed tags are assigned to
images for image search
grass outdoor
Example: ReCAPTCHA
• Control word:
– Image of word rendered
by computer
– For indentifying human
• Unknown word:
– Image of word from OCR
– For correcting ORC error
by humans
Luis von Ahn
Human Computation
• Definition
– Given computational problem from requester,
design solution using both automated computers
and human computers
• Characteristics
– Humans as ‘computers’
– Humans intentionally perform basic operations
– Humans and computers jointly accomplish a task
Relations between Models
Log Data Mining
Crowdsourcing
Human
Computation
Users are paid Users are not paid
Users intentionally participate Users unintentionally participate
Example: Learning of Relevance
Model from Click-through Data
21
Matching in Latent Space
• Motivation
– Perform robust matching between query and document
• Assumption
– Queries have similarity
– Document have similarity
– Click-through data represent “similarity” relations between
queries and documents
• Approach
– Projection to latent space
– Regularization or constraints
• Results
– Automatically Learn relevance model
– Significantly enhance accuracy of query document
matching
Matching in Latent Space
q1
qm
q2
d1
dn
d2
Query Space Document Space
q1
d2
qm
dn
d1
q2 Latent Space
qL
dL
IR Models Are Similarity Functions
IR Models as Similarity Functions
(Xu and Li 2010)
q1
qm q2
d1
dn
d2
Query Space Document Space
q1
d2
qm
dn
d1
q2
New Space
'
unigram
unigram
unigram
unigram
unigram
unigram
unigram unigram
unigram
VSM, BM25,
LM, MRF
Mapping functions
are diagonal matrices
Problem with IR Models: Term
Mismatch
• Matching in Latent Space can solve the
problem by
– Reducing dimensionality of latent space (from
term level matching to semantic matching)
– Correlating semantically similar terms (matrices
are not diagonal)
– Automatically learning mapping functions from
data
• Generalized and Learnable of IR models
26
Experimental Results
eEnterprise Search Web Search
• RMLS and PLS work better than BM25, Random Walk, Latent
Semantic Indexing
• RMLS works equally well as PLS, with higher learning
efficiency and scalability
• Image Annotation Data
hook
fishing
singer
solider
worrier
microphone
Example: Projecting Keywords and
Images into Latent Space
Talk Outline
• How Much Training Data Is Enough Data?
• Machine Learning Using Big Data Collected from
Human Machine Interactions
– Log Data Mining
– Crowdsourcing
– Human Computation
• Example
– Learning of Relevance Model from Click-through
Data
Opportunities and Challenges
• Development of Effective Mechanisms for Data
Collection
– From crowds
– Big data
– High quality (vs., garbage in garbage out)
– Useful for applications
• Development of Effective Methods for Machine
Learning
– Large scale
– High accuracy
31
Acknowledgement
• I thank Prof. Qiang Yang for discussions on
related issues
• I thank Wei Wu, Jun Xu, and Zhengdong Lv for
joint work on learning of relevance model
• I thank organizers of SDCC for initiating me to
the conference
Thank you!
33