Informatica 大数据管理
Informatica技术顾问:冷鹏
也说说大数据
4
5
大数据的动物世界
Copyright © 2003 , Inc. All rights reserved.
大数据项目失败的几要素
7
没有用
BDM Overview
Informatica Big Data Management
大数据项目通常的问题
Laboratory
(insights)
Factory
(actions)
Data Lakes, DW, DM, NoSQL
Analytic Apps, Enterprise Apps, etc.
提高预测性维
护
提高运营效率
提高客户忠诚
度
减少安全风险
提高欺诈检测
Machine
Device, Cloud
Documents
and Emails
Relational,
Mainframe
Social Media,
Web Logs
需要花太长的时间来获取、整理数据
大量不信任的数据充斥其中
高技术难度
太多的一次性项目
很难开发和维护
许多合规性要求
数据分析师 数据科学家 业务人员 数据管家 数据建模 数据工程师
整体大数据管理方案
©2016 Informatica. Proprietary and Confidential
摄取 治理 转换 发现
Hand-coding Hand-coding Hand-coding
展现
Hand-coding Hand-coding
安全 获取 消费
Applications
Data Mining
Dashboards
Files Relational
Social
Files
Device data
Weblogs
Informatica Big Data Management ()
大数据的架构 传统的数据架构
Informatica 大数据管理的三大支柱
Big Data
集成
• 简单的可视化开发界面
• 灵活部署、任务优化执行
• 100多个数据接口、转换
• 发布/订阅流数据采集
Big Data
质量&治理
• 多用户协作处理能力
• 业务术语表
• 数据探查和数据质量
• 360°的关系视图
• 终端到终端的数据沿袭分析
Big Data
安全
• 敏感数据发现和分类
• 增殖分析
• 风险评估
• 持久和动态数据屏蔽
Big Data Management
Informatica Big Data Management技术架构
Data
Connectivity
Data
Quality
Data
Masking
Data
Integration
Metadata
Management
Smart Executor
Hive
• Hive on Tez
• Map Reduce
Blaze Spark
YARN
SQL Hadoop Cluster
Database
Pushdown
Data
Virtualization
Native Data
Transformation
Engine
Native
Blaze is Informatica’s new high performance cluster aware data integration
engine released a part of Informatica
Blaze is designed from the ground up to address big data computing requirements for
large scale data integration workloads. Blaze is a distributed engine which runs directly on
YARN to leverage all the compute nodes on a Hadoop cluster to provide the highest
throughput. Blaze achieves high data processing performance with automatic intelligent
data pipelining, job partitioning, and scaling for large concurrent workloads. It is
integrated with Hadoop security infrastructure for secure data processing.
Unlike other general purpose data processing engines on YARN, Blaze a specialized engine,
tuned for optimal data integration performance with complete data integration capability
based on 20+ years of enterprise market leading data integration expertise. Informatica
PowerCenter data flows are separated from the underlying execution engine which enables
existing data flows to benefit from Blaze’s processing performance or future data processing
engines without any code changes.
Blaze
Informatica Blaze监控管理
TPC-DS基准 - Blaze与SPARK、Hive on Yarn比较
性能测试对比分析
tpcds q3 tpcds q15 tpcds q19
Hive 0:02:11 0:02:50 0:04:18
Spark 0:00:20 0:00:43 0:00:31
Blaze 0:00:11 0:00:11 0:00:16
0:00:00
0:01:09
0:02:18
0:03:27
0:04:36
h
:m
m
:s
s
TPC-DS SF100
tpcds q3 tpcds q15 tpcds q19
Hive 0:02:44 0:03:24 0:05:13
Spark 0:00:35 0:01:32 0:00:36
Blaze 0:00:26 0:00:23 0:00:59
0:00:00
0:01:09
0:02:18
0:03:27
0:04:36
0:05:46
h
:m
m
:s
s
TPC-DS SF300
tpcds q3 tpcds q15 tpcds q19
Hive 0:06:39 0:07:58 0:09:47
Spark 0:01:48 0:04:09 0:02:12
Blaze 0:01:51 0:02:42 0:04:22
0:00:00
0:04:19
0:08:38
0:12:58
h
:m
m
:s
s
TPC-DS SF1000
Big Data 数据集成
Informatica Big Data Integration
提供多种多样数据接口适配器
1
7
100+
预置分词算法
200+
预置数据接口适配器
Out of the
Box
业务规则和标准化库
WebSphere MQ
JMS
MSMQ
SAP NetWeaver XI
JD Edwards
Lotus Notes
Oracle E-Business
PeopleSoft
Oracle
DB2 UDB
DB2/400
SQL Server
Sybase
ADABAS
Datacom
DB2
IDMS
IMS
Word, Excel
PDF
StarOffice
WordPerfect
Email (POP, IMPA)
HTTP
Informix
Teradata
Netezza
ODBC
JDBC
VSAM
C-ISAM
Binary Flat Files
Tape Formats…
Web Services
TIBCO
webMethods
Flat files
ASCII reports
HTML
RPG
ANSI
LDAP
EDI–X12
EDI-Fact
RosettaNet
HL7
HIPAA
XML
LegalXML
IFX
cXML
AST
FIX
SWIFT
Cargo IMP
MVR
Salesforce CRM
RightNow
NetSuite
ADP
Hewitt
SAP By Design
Oracle
OnDemand
Facebook
Twitter
LinkedIn
Kapow
Pivotal
Vertica
Netezza
Teradata
Aster
预置多种非结构化数据解析器
1
8
数据存储和传输格式 行业标准格式 企业文档格式
XML
JSON
Parquet
AVRO
Financial Services
Healthcare
EDI
Delimited
Files
PDF
Word
Excel
Hadoop Cluster
Informatica IDE
样例: 在Hadoop中手工编程处理数据
使用Informatica Big Data集成工具
简单的图形化开发界面 自动生成Hadoop可执行任务
可重用现有开发
轻松部署
轻松将数据处理负载推送至Hadoop
Data Warehouse
Profile, Parse, Cleanse, Match
自动卸载不经常使用的数据并
进行归档
将 ETL/ELT任务推送至
Hadoop执行提高数据处理效率
重用现有开发工作和工作流
Informatica VDS 流数据采集和实时分析
22
VDS
DPI
Devices
U
lt
ra
M
e
s
s
a
g
in
g
B
u
s
P
u
b
lis
h
/
S
u
b
s
c
ri
b
e
High Performance
Messaging Infrastructure
1 second over million data
record transfer
Real-time Data Acquisition with
Vibe Data Stream
Real Time
Analysis,
Complex
Event
Processing
Customer
Data
Alert Interface
Dashboards
Campaign
Management
Systems
Sources Real-time Data Conversion and
Event Processing
Web Log
Email
IM
Location
数据流
Switches Load
Balancer
s
Informatica VDS图形化流数据采集开发界面
23
Big Data集成运行平台监控
• 详细的统计数据显示:随着时间的推移源/目标吞吐量监控
(行/字节)。
2
3
1
端到端元数据管理-血缘分析
Big Data 治理与质量
Informatica Big Data Governance & Quality
CUSTOMER_ID example COUNTRY CODE example
3. Drilldown Analysis (into Hadoop Data)
2. Value &
Pattern
Analysis of
Hadoop Data
1. Profiling Stats:
Min/Max Values, NULLs,
Inferred Data Types, etc.
Drill down into actual data
values to inspect results
across entire data set,
including potential duplicates
Value and Pattern Frequency to
isolated inconsistent/dirty data or
unexpected patterns
Hadoop Data Profiling results –
exposed to anyone in
enterprise via browser
Stats to identify outliers
and anomalies in data
探查Hadoop数据特征
在Hadoop进行数据质量治理
Big Data cleansing, deduplication, parsing
2
8
Address
Validation
Standardize
Parsing
Address Validation and
Geocoding enrichment across
260 countries
Standardization and Reference
Data Management
Parsing of Unstructured
Data/Text Fields of all data
types of data (customer/
product/ social/ logs)
DQ logic pushed down/run natively ON Hadoop
图形化数据质量校验规则管理
Context Menu
Collapse Tree
Shape Colors
Shape sizing
修复编辑有问题数据 Finish to Exit Editing
Session
Multiple Tab
Support
核验、审计数据管理
Live Data Map : 实时智能数据地图
EIC
Relationships
Catalog
Statistics
Live Data Map
Rules
Glossary
Ratings
• 勘探
• 语义搜索
• 关系发现
数据发现 敏感数据跟踪 数据治理 智能建议
Live Data Map
所有企业数据资产的知识图谱
• 建议
• 360度全景
• 用户评分
All
Informatica
Repositories
3rd party – BI,
Modeling, Big Data,
RDBMS
Applications,
Business glossary &
context
User Ratings,
Feedback,
Operational Stats
Johnathon T. Smith
John Smith
JN@S!~H Johnathon T Smith John Smith
Resolve identities, persist
linkage and fuzzy match key
Incremental loads use
existing match key
. Smith
Jon Smyth
Real time fuzzy lookup
JN@S!~H Johnathon T Smith John Smith
DA%{SR Darth Vader Anakin Skywalker
D?R;OE& Derrick Rose D. Rose
M!JO#+N Michael Jordan .
WB%@CN Bill Clinton William Clinton
MA|*S{} Matt Simpson
Mathew
Sampson
• Zip code 60640, Lifetime
Spend Over $100
Create Clusters and Groups
• Smith Household
• > 2 Product purchases last 18
months, > 2 Tweets About
Company, Northeast Region
1
2
3
4
Big Data 关系管理
面向数据科学家和分析师
• 企业数据资产搜索和发现
• 批量和实时从内部系统、云资
源进行数据采集
• 合适的数据集建议
• 对于大数据集提供类似Excel
样式进行数据加工和处理
• 数据发布和共享
面向IT人员
• 提供数据资产的统一视图访问
功能
• 数据被使用和用户活动分析
• 统一数据集成运维监控
产品截图-1
产品截图-2
产品截图-3
产品截图-4
产品截图-5
产品截图-6
产品截图-7
产品截图-8
产品截图-9
Big Data 安全
Informatica Big Data Security
Thank You!
谢谢大家!
冷 鹏