Twitter机器学习平台的
设计与搭建
郭晓江
Why Machine Learning?
• Mine useful information from data, predict future
• Machine learning is becoming more and more
powerful
– Data availability
– Computation power
– Development in algorithms
More Data or Better Model?
Machine Learning is Important @Twitter
• ~80% of DAU is attributed to teams doing ML
• ~90% of revenue comes from ads backed by ML models
• ML platform supports many teams
– ads ranking
– ads targeting
– timeline ranking
– anti-spam
– recommendation
– moments ranking
– trends
Machine Learning is Large Scale
@Twitter
• Take ads ranking as an example
– ~10,000,000,000,000 predictions made daily
– ~10,000,000 weights per model
– ~10,000 features per training example
– ~1,000,000,000,000B of training data
Machine Learning is Realtime @Twitter
• Twitter is all about realtime: news, events, videos, trends
• Advertiser campaign targets realtime event, hashtags,
spanning as short as a few hours even minutes
• ML needs to adapt to dynamically changing traffic
Ads Ranking
Timeline Ranking
• Most people missed most tweets
• Surface most relevant tweets
user missed
Recommendation
Anti-spam
• Fake account detection
• Spam detection
• Abuse detection
• NSFW (not safe for work)
detection
Scaling Challenges
• Organization scaling
– How to support client team efficiently?
– How to enable client team prototype quickly?
• System scaling
– How to train and make inference efficiently?
– How to enable fast iteration and
experimentation?
Organization Scaling
• ML platform’s focus
– Define feature, transform, model format
– Provide framework and tooling
• data ETL, trainers, parameter search, serving runtime,
workflow management
– Onboard client and provide support
• Client team’s focus
– Define and extract features
– Own and maintain training pipeline and serving runtime
System Scaling
• Data preparation
• Offline training, workflow
management
• Online serving, A/B testing
• Tooling
Process Process
Data
WareHouse
Data Preparation
Offline Training Online Serving
A/B TestingW orkflow M anagem ent
Data Format—Standardized Feature
Format
• Enable feature sharing across teams
• Make ML platform iteration easy
• Feature format
– Support 4 dense, 2 sparse feature types
– Use hashed id instead of string name for efficient
serialization, storage, compute
– Collocate schema (id to name mapping) with the data
Data ETL Framework—DataAPI
• Make operations on distributed data painless for ML
practitioners
• Scala data ETL API
– Provide powerful abstractions for ML datasets and
operations
– Fluent API, enabling imperative programming
• Ensure data and metadata consistency through operations
Example
1. Take my dataset whose path given by “input”
2. Sample it by 10% randomly
3. Discretize with the given discretizer
4. Left join with media label on tweet id
5. Dump the result to path given by “output”
Trainer Portfolio
• Large scale logistic regression
– Vowpal Wabbit: open source c++ trainer
– Lolly: JVM online learning trainer
• Discretizer
– Boosting tree: GBDT, AdaBoost
– Random forest
– MDL discretizer
• Deep Learning
– Torch based training libraries
PredictionEngine
• Large scale online SGD learning
• Used in both offline training and online
serving
• Architecture
– Transform: MDL, Decision tree
– Feature crossing
– Logistic Regression: Vowpal Wabbit or
in-house JVM learner
Transform
Transform
Transform
Cross
Logistic
Regression
D ataR ecord
D ataR ecord
PredictionEngine Optimization
• Reduce serialization cost
– Model collocation
– Batch request API
• Reduce compute cost
– Feature id instead of string name
– Transform sharing across models
– Feature cross done on the fly
PredictionEngine Optimization
• Training/Serving throughput
– Sharding for model updates
– Separation of training and prediction services
– Elastic load based on latency
• Realtime feedback
– Treat ads impression as non-click event
• Fault tolerance
– Snapshot model every fixed interval
– Anomaly traffic detection
Tooling—Auto Hyper-parameter Tuning
• Problem: too many hyper parameter
needs to tune
• Traditional method
– Grid search
– Random search
• Auto hyper parameter tuning tool
– Assume Gaussian process
– Compute posterior distribution of
predictive function after each
observation
– Pick parameter setting that
maximize expected improvement
Tooling
• Workflow management
• Insight and interpretation
– Inspect data/model in human readable format
– Compute dataset stats
– Visualize tree model
• Feature selection tool
– Forward/backward greedy search
Work in Progress
• Balance scalability and flexibility
– Large scale torch based ML
• Deep learning application in timeline/ads ranking
• Better tooling
– Visualization and interactive exploration
郭晓江(Jack Guo)
Contact:jguo@