DataX使用说明
1.DataX介绍
DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
也就是数据库的数据同步工具,免费版没有web页面,需在控制台进行操作
2.DataX下载
DataX 下载地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
DataX 快速启动地址:https://github.com/alibaba/DataX/blob/master/userGuid.md
DataX github地址:https://github.com/alibaba/DataX
下载请点击第一个地址下载,快速使用可以选择不看,下方会详细讲解
3.DataX使用
3.1 环境
- JDK1.8
- python 2.6.*(推荐)python3也可运行
windows或者macos安装好适配自己的环境(这里就自行百度啦)
3.2 解压
DataX下载完后,是一个datax.tat.gz的包,解压它,得到datax文件夹
进入datax
- bin:程序的执行目录,里面的data.py是控制台执行该程序的启动文件
- log:执行后的日志,每执行一次生成一个日志文件,可以清楚的看到报错信息
- plugin:reader和writer插件,比如reader里的mysqlreader为mysql的读取插件
3.3 使用
1.数据源文档(gitHub加载较慢,用流量看github稍微快一些)
类型 | 数据源 | Reader(读) | Writer(写) | 文档 |
---|---|---|---|---|
RDBMS 关系型数据库 | MySQL | √ | √ | 读 、写 |
Oracle | √ | √ | 读 、写 | |
OceanBase | √ | √ | 读 、写 | |
SQLServer | √ | √ | 读 、写 | |
PostgreSQL | √ | √ | 读 、写 | |
DRDS | √ | √ | 读 、写 | |
通用RDBMS(支持所有关系型数据库) | √ | √ | 读 、写 | |
阿里云数仓数据存储 | ODPS | √ | √ | 读 、写 |
ADS | √ | 写 | ||
OSS | √ | √ | 读 、写 | |
OCS | √ | √ | 读 、写 | |
NoSQL数据存储 | OTS | √ | √ | 读 、写 |
Hbase0.94 | √ | √ | 读 、写 | |
Hbase1.1 | √ | √ | 读 、写 | |
Phoenix4.x | √ | √ | 读 、写 | |
Phoenix5.x | √ | √ | 读 、写 | |
MongoDB | √ | √ | 读 、写 | |
Hive | √ | √ | 读 、写 | |
Cassandra | √ | √ | 读 、写 | |
无结构化数据存储 | TxtFile | √ | √ | 读 、写 |
FTP | √ | √ | 读 、写 | |
HDFS | √ | √ | 读 、写 | |
Elasticsearch | √ | 写 | ||
时间序列数据库 | OpenTSDB | √ | 读 | |
TSDB | √ | √ | 读 、写 |
2.配置json文件
使用dataX很简单,只需要配置一个json文件即可,配置内容大致如下:
- 配置需要读取的数据源
- 配置需要存储的数据源
- 配置线程数
而该json文件的空模板datax也提供给我们了,使用如下命令获取(首先进入datax/bin/下):
python datax.py -r {读取数据库的插件名} -w {写入数据库的插件名}
比如我想要用txt文件作为读取的数据源,mysql数据库作为写入的数据源,获取它的模板如下:
python datax.py -r txtfilereader -w mysqlwriter
- 所有的插件名可以去/plugin文件夹下找到名字
得到如下json:
{"job": {"content": [{"reader": {"name": "txtfilereader", "parameter": {"column": [], "encoding": "", "fieldDelimiter": "", "path": []}}, "writer": {"name": "mysqlwriter", "parameter": {"column": [], "connection": [{"jdbcUrl": "", "table": []}], "password": "", "preSql": [], "session": [], "username": "", "writeMode": ""}}}], "setting": {"speed": {"channel": ""}}}
}
里面具体的参数应该填写什么值,就看上方的那个表格(数据源文档),作为读的数据源就去看读的文档,作为写的数据源就看写的文档,将配置填入后,新建json文件,复制该json文本到json文件中。
3.运行该json文件(首先进入datax/bin/下)
python datax.py {你的json配置文件的路径}
3.4 示例
以txtfilereader(txt文件读取插件)和mysqlwriter(mysql写入插件)作为数据同步示例。
1.首先获取txtfilereader与mysqlwriter的空配置模板
python datax.py -r txtfilereader -w mysqlwriter
结果如下:
{"job": {"content": [{"reader": {"name": "txtfilereader", "parameter": {"column": [], "encoding": "", "fieldDelimiter": "", "path": []}}, "writer": {"name": "mysqlwriter", "parameter": {"column": [], "connection": [{"jdbcUrl": "", "table": []}], "password": "", "preSql": [], "session": [], "username": "", "writeMode": ""}}}], "setting": {"speed": {"channel": ""}}}
}
2.新建json文件,复制模板内容,名字可取需要同步的表的表名,我取名为comm_attachment.json
3.新建mysql表(作为写入的表)
create table comm_attachment
(id int auto_increment comment '主键'primary key,project_type varchar(100) null comment '项目名-说明该附件是属于哪个项目的',attach_type varchar(100) null comment '附件类型',attach_name varchar(200) null comment '附件名',attach_url varchar(500) null comment '附件下载地址',attach_key varchar(500) null comment '附件key',attach_hash varchar(500) null comment '附件hash',attach_size varchar(100) null comment '附件大小',created_by varchar(100) null comment '创建人',created_date timestamp null comment '创建时间',last_updated_by varchar(100) null comment '最后修改人',last_updated_date timestamp null comment '最后修改时间',version int null comment '乐观锁-版本号'
)comment '附件表';
4.新建csv文件,编造一些测试数据进去
我的mysql数据源的表comm_attachment有12列,那么csv的内容也和该表的列顺序一致(如果顺序不一致,更改配置中writer.column数组的顺序)
建好后,把csv后缀改为txt即可,txt内容如下:
1,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
2,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
3,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
4,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
5,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
6,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
7,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
8,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
9,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
10,sunnyDay,image/png,振动测量钉钉图标.png,http://qyn6nlamm.hd-bkt.clouddn.com/Frv7wnlpCWpjlUq-qWFPrjQdm1AI?attname=%E6%8C%AF%E5%8A%A8%E6%B5%8B%E9%87%8F%E9%92%89%E9%92%89%E5%9B%BE%E6%A0%87.png,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,Frv7wnlpCWpjlUq-qWFPrjQdm1AI,44kb,anonymous,2021-09-16 16:52:38,anonymous,2021-09-16 16:52:38,0
5.查阅文档后,编写json配置
类型 | 数据源 | Reader(读) | Writer(写) | 文档 |
---|---|---|---|---|
RDBMS 关系型数据库 | MySQL | √ | √ | 读 、写 |
无结构化数据存储 | TxtFile | √ | √ | 读 、写 |
comm_attachment.json:
{"job": {"content": [{"reader": {"name": "txtfilereader","parameter": {"column": [{"index": 0,"type": "long"},{"index": 1,"type": "string"},{"index": 2,"type": "string"},{"index": 3,"type": "string"},{"index": 4,"type": "string"},{"index": 5,"type": "string"},{"index": 6,"type": "string"},{"index": 7,"type": "string"},{"index": 8,"type": "string"},{"index": 9,"type": "date","format": "yyyy-MM-dd HH:mm:ss"},{"index": 10,"type": "string"},{"index": 11,"type": "date","format": "yyyy-MM-dd HH:mm:ss"},{"index": 12,"type": "long"},],"encoding": "UTF-8","fieldDelimiter": ",","path": ["/Users/cheng/Desktop/sunnyday_comm_attachment.txt"]}},"writer": {"name": "mysqlwriter","parameter": {"column": ["id","project_type","attach_type","attach_name","attach_url","attach_key","attach_hash","attach_size","created_by","created_date","last_updated_by","last_updated_date","version"],"connection": [{"jdbcUrl": "jdbc:mysql://localhost:3306/sunnyday?characterEncoding=utf-8&serverTimezone=Asia/Shanghai","table": ["comm_attachment"]}],"password": "12345678","preSql": ["delete from comm_attachment"],"session": [],"username": "root","writeMode": "insert"}}}],"setting": {"speed": {"channel": "5"}}}
}
- 配置文件需要告诉它读取源的列属性(txt没有列的属性定义)
- 配置文件并不需要告诉它写入源的列属性(自动翻译)
- writer.column是按照填写的顺序来接收读取源的数据的,顺序需要确定好
6.执行json配置
执行(在datax/bin下):
python datax.py /Users/cheng/Desktop/comm_attachment.json
部分结果:
2021-11-08 12:04:03.357 [job-0] INFO JobContainer - PerfTrace not enable!
2021-11-08 12:04:03.358 [job-0] INFO StandAloneJobContainerCommunicator - Total 10 records, 2741 bytes | Speed 274B/s, 1 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2021-11-08 12:04:03.359 [job-0] INFO JobContainer -
任务启动时刻 : 2021-11-08 12:03:52
任务结束时刻 : 2021-11-08 12:04:03
任务总计耗时 : 10s
任务平均流量 : 274B/s
记录写入速度 : 1rec/s
读出记录总数 : 10
读写失败总数 : 0
7.查看写入源的数据
写入成功