MBA智库文档行业 IT互联网《Twitter机器学习平台的设计与搭建》.pdf

《Twitter机器学习平台的设计与搭建》.pdf

下载

用户#2701054

25页 | 2.57MB | 0次下载 |

0.0

(0人评价)

我要评价：

投诉举报

用手机看文档

扫一扫,手机看文档

下载

开通VIP

Twitter机器学习平台的设计与搭建郭晓江 Why Machine Learning? • Mine useful information from data, predict future • Machine learning is becoming more and more powerful – Data availability – Computation power – Development in algorithms More Data or Better Model? Machine Learning is Important @Twitter • ~80% of DAU is attributed to teams doing ML • ~90% of revenue comes from ads backed by ML models • ML platform supports many teams – ads ranking – ads targeting – timeline ranking – anti-spam – recommendation – moments ranking – trends Machine Learning is Large Scale @Twitter • Take ads ranking as an example – ~10,000,000,000,000 predictions made daily – ~10,000,000 weights per model – ~10,000 features per training example – ~1,000,000,000,000B of training data Machine Learning is Realtime @Twitter • Twitter is all about realtime: news, events, videos, trends • Advertiser campaign targets realtime event, hashtags, spanning as short as a few hours even minutes • ML needs to adapt to dynamically changing traffic Ads Ranking Timeline Ranking • Most people missed most tweets • Surface most relevant tweets user missed Recommendation Anti-spam • Fake account detection • Spam detection • Abuse detection • NSFW (not safe for work) detection Scaling Challenges • Organization scaling – How to support client team efficiently? – How to enable client team prototype quickly? • System scaling – How to train and make inference efficiently? – How to enable fast iteration and experimentation? Organization Scaling • ML platform’s focus – Define feature, transform, model format – Provide framework and tooling • data ETL, trainers, parameter search, serving runtime, workflow management – Onboard client and provide support • Client team’s focus – Define and extract features – Own and maintain training pipeline and serving runtime System Scaling • Data preparation • Offline training, workflow management • Online serving, A/B testing • Tooling Process Process Data WareHouse Data Preparation Ofﬂine Training Online Serving A/B TestingW orkﬂow M anagem ent Data Format—Standardized Feature Format • Enable feature sharing across teams • Make ML platform iteration easy • Feature format – Support 4 dense, 2 sparse feature types – Use hashed id instead of string name for efficient serialization, storage, compute – Collocate schema (id to name mapping) with the data Data ETL Framework—DataAPI • Make operations on distributed data painless for ML practitioners • Scala data ETL API – Provide powerful abstractions for ML datasets and operations – Fluent API, enabling imperative programming • Ensure data and metadata consistency through operations Example 1. Take my dataset whose path given by “input” 2. Sample it by 10% randomly 3. Discretize with the given discretizer 4. Left join with media label on tweet id 5. Dump the result to path given by “output” Trainer Portfolio • Large scale logistic regression – Vowpal Wabbit: open source c++ trainer – Lolly: JVM online learning trainer • Discretizer – Boosting tree: GBDT, AdaBoost – Random forest – MDL discretizer • Deep Learning – Torch based training libraries PredictionEngine • Large scale online SGD learning • Used in both offline training and online serving • Architecture – Transform: MDL, Decision tree – Feature crossing – Logistic Regression: Vowpal Wabbit or in-house JVM learner Transform Transform Transform Cross Logistic Regression D ataR ecord D ataR ecord PredictionEngine Optimization • Reduce serialization cost – Model collocation – Batch request API • Reduce compute cost – Feature id instead of string name – Transform sharing across models – Feature cross done on the fly PredictionEngine Optimization • Training/Serving throughput – Sharding for model updates – Separation of training and prediction services – Elastic load based on latency • Realtime feedback – Treat ads impression as non-click event • Fault tolerance – Snapshot model every fixed interval – Anomaly traffic detection Tooling—Auto Hyper-parameter Tuning • Problem: too many hyper parameter needs to tune • Traditional method – Grid search – Random search • Auto hyper parameter tuning tool – Assume Gaussian process – Compute posterior distribution of predictive function after each observation – Pick parameter setting that maximize expected improvement Tooling • Workflow management • Insight and interpretation – Inspect data/model in human readable format – Compute dataset stats – Visualize tree model • Feature selection tool – Forward/backward greedy search Work in Progress • Balance scalability and flexibility – Large scale torch based ML • Deep learning application in timeline/ads ranking • Better tooling – Visualization and interactive exploration 郭晓江(Jack Guo) Contact:jguo@

联系我们

智库文档公众号

客服微信

《Twitter机器学习平台的设计与搭建》.pdf

下载

标签

相关文档

相关专题更多

联系我们

意见反馈

标签

相关文档

相关专题 更多

联系我们

意见反馈

相关专题更多