2024-01-22

BigData

Hudi 概述&安装

一、Hudi 概述

1.1、Hudi 简介

Apache Hudi（Hadoop Upserts Delete and Incremental）是下一代流数据湖平台。Apache Hudi 将核心仓库和数据库功能直接引入数据湖。Hudi 提供了表、事务、高效的 upserts/delete、高级索引、流摄取服务、数据集群/压缩优化和并发，同时保持数据的开源文件格式。
Apache Hudi 不仅非常适合于流工作负载，而且还允许创建高效的增量批处理管道。
Apache Hudi 可以轻松地在任何云存储平台上使用。Hudi 的高级性能优化，使分析工作负载更快的任何流行的查询引擎，包括Apache Spark、Flink、Presto、Trino、Hive 等。

1.2、发展历史

2015 年：发表了增量处理的核心思想/原则（O'reilly 文章）。
2016 年：由 Uber 创建并为所有数据库/关键业务提供支持。
2017 年：由 Uber 开源，并支撑 100PB 数据湖。
2018 年：吸引大量使用者，并因云计算普及。
2019 年：成为 ASF 孵化项目，并增加更多平台组件。
2020 年：毕业成为 Apache 顶级项目，社区、下载量、采用率增长超过 10 倍。
2021 年：支持 Uber 500PB 数据湖，SQL DML、Flink 集成、索引、元服务器、缓存。

1.3、Hudi 特性

可插拔索引机制支持快速Upsert/Delete。
支持增量拉取表变更以进行处理。
支持事务提交及回滚，并发控制。
支持Spark、Presto、Trino、Hive、Flink等引擎的SQL读写。
自动管理小文件，数据聚簇，压缩，清理。
流式摄入，内置CDC源和工具。
内置可扩展存储访问的元数据跟踪。
向后兼容的方式实现表结构变更的支持。

1.4、使用场景

1）近实时写入
减少碎片化工具的使用。
CDC 增量导入 RDBMS 数据。
限制小文件的大小和数量。
2）近实时分析
相对于秒级存储（Druid, OpenTSDB），节省资源。
提供分钟级别时效性，支撑更高效的查询。
Hudi 作为 lib，非常轻量。
3）增量 pipeline
区分 arrivetime 和 event time 处理延迟数据。
更短的调度 interval 减少端到端延迟（小时 -> 分钟） => Incremental Processing。
4）增量导出
替代部分 Kafka 的场景，数据导出到在线服务存储 e.g. ES。

二、编译安装

2.1、编译环境准备

本文的相关组件版本如下：

Hadoop  3.1.3
Hive    3.1.2
Flink   1.13.6，scala-2.12
Spark   3.2.2，scala-2.12

1）安装 Maven
（1）上传 apache-maven-3.6.1-bin.tar.gz 到 /opt/software 目录，并解压更名

tar -zxvf apache-maven-3.6.1-bin.tar.gz -C /opt/module/
mv apache-maven-3.6.1 maven-3.6.1

（2）添加环境变量到/etc/profile中

sudo vim /etc/profile
#MAVEN_HOME
export MAVEN_HOME=/opt/module/maven-3.6.1
export PATH=$PATH:$MAVEN_HOME/bin

source /etc/profile
mvn -v

2）修改为阿里镜像
（1）修改 setting.xml，指定为阿里仓库地址

vim /opt/module/maven-3.6.1/conf/settings.xml 

<!-- 添加阿里云镜像-->
<mirror>
    <id>nexus-aliyun</id>
    <mirrorOf>central</mirrorOf>
    <name>Nexus aliyun</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>

2.2、编译 Hudi

2.2.1、上传源码包

将 hudi-0.12.0.src.tgz 上传到/opt/software，并解压

tar -zxvf /opt/software/hudi-0.12.0.src.tgz -C /opt/module

也可以从 github 下载：https://github.com/apache/hudi/

2.2.2、修改pom文件

vim /opt/software/hudi-0.12.0/pom.xml
1）新增 repository 加速依赖下载

<repository>
    <id>nexus-aliyun</id>
    <name>nexus-aliyun</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    <releases>
        <enabled>true</enabled>
    </releases>
    <snapshots>
        <enabled>false</enabled>
    </snapshots>
</repository>

2）修改依赖的组件版本

<hadoop.version>3.1.3</hadoop.version>
<hive.version>3.1.2</hive.version>

2.2.3 修改源码兼容 hadoop3

Hudi 默认依赖的 hadoop2，要兼容 hadoop3，除了修改版本，还需要修改如下代码：

vim /opt/software/hudi-0.12.0/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java

修改第110行，原先只有一个参数，添加第二个参数null：
否则会因为 hadoop2.x 和 3.x 版本兼容问题，报错如下：

2.2.4、手动安装 Kafka 依赖

有几个 kafka 的依赖需要手动安装，否则编译报错如下：

1）下载 jar 包
通过网址下载：http://packages.confluent.io/archive/5.3/confluent-5.3.4-2.12.zip
解压后找到以下 jar 包，上传服务器 hadoop1：

common-config-5.3.4.jar
common-utils-5.3.4.jar
kafka-avro-serializer-5.3.4.jar
kafka-schema-registry-client-5.3.4.jar

2）install 到 maven 本地仓库

mvn install:install-file -DgroupId=io.confluent -DartifactId=common-config -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-config-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=common-utils -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-utils-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-avro-serializer -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-avro-serializer-5.3.4.jar
mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-schema-registry-client-5.3.4.jar

2.2.5、解决 spark 模块依赖冲突

修改了 Hive 版本为 3.1.2，其携带的 jetty 是 0.9.3，hudi 本身用的 0.9.4，存在依赖冲突。
1）修改 hudi-spark-bundle 的 pom 文件，排除低版本 jetty，添加hudi指定版本的jetty:

vim /opt/software/hudi-0.12.0/packaging/hudi-spark-bundle/pom.xml

在382行的位置，修改如下：

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-jdbc</artifactId>
  <version>${hive.version}</version>
  <scope>${spark.bundle.hive.scope}</scope>
  <exclusions>
    <exclusion>
      <groupId>javax.servlet</groupId>
      <artifactId>*</artifactId>
    </exclusion>
    <exclusion>
      <groupId>javax.servlet.jsp</groupId>
      <artifactId>*</artifactId>
    </exclusion>
    <exclusion>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>*</artifactId>
    </exclusion>
  </exclusions>
</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-metastore</artifactId>
  <version>${hive.version}</version>
  <scope>${spark.bundle.hive.scope}</scope>
  <exclusions>
    <exclusion>
      <groupId>javax.servlet</groupId>
      <artifactId>*</artifactId>
    </exclusion>
    <exclusion>
      <groupId>org.datanucleus</groupId>
      <artifactId>datanucleus-core</artifactId>
    </exclusion>
    <exclusion>
      <groupId>javax.servlet.jsp</groupId>
      <artifactId>*</artifactId>
    </exclusion>
    <exclusion>
      <artifactId>guava</artifactId>
      <groupId>com.google.guava</groupId>
    </exclusion>
  </exclusions>
</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-common</artifactId>
  <version>${hive.version}</version>
  <scope>${spark.bundle.hive.scope}</scope>
  <exclusions>
    <exclusion>
      <groupId>org.eclipse.jetty.orbit</groupId>
      <artifactId>javax.servlet</artifactId>
    </exclusion>
    <exclusion>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>*</artifactId>
    </exclusion>
  </exclusions>
</dependency>

<!-- 增加hudi配置版本的jetty -->
<dependency>
  <groupId>org.eclipse.jetty</groupId>
  <artifactId>jetty-server</artifactId>
  <version>${jetty.version}</version>
</dependency>
<dependency>
  <groupId>org.eclipse.jetty</groupId>
  <artifactId>jetty-util</artifactId>
  <version>${jetty.version}</version>
</dependency>
<dependency>
  <groupId>org.eclipse.jetty</groupId>
  <artifactId>jetty-webapp</artifactId>
  <version>${jetty.version}</version>
</dependency>
<dependency>
  <groupId>org.eclipse.jetty</groupId>
  <artifactId>jetty-http</artifactId>
  <version>${jetty.version}</version>
</dependency>

否则在使用 spark 向 hudi 表插入数据时，会报错如下：

java.lang.NoSuchMethodError: org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)V

2）修改 hudi-utilities-bundle 的 pom 文件，排除低版本 jetty，添加 hudi 指定版本的 jetty:

vim /opt/software/hudi-0.12.0/packaging/hudi-utilities-bundle/pom.xml

在405行的位置，修改如下：

<!-- Hoodie -->
<dependency>
  <groupId>org.apache.hudi</groupId>
  <artifactId>hudi-common</artifactId>
  <version>${project.version}</version>
  <exclusions>
    <exclusion>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>*</artifactId>
    </exclusion>
  </exclusions>
</dependency>
<dependency>
  <groupId>org.apache.hudi</groupId>
  <artifactId>hudi-client-common</artifactId>
  <version>${project.version}</version>
  <exclusions>
    <exclusion>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>*</artifactId>
    </exclusion>
  </exclusions>
</dependency>


<!-- Hive -->
<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-service</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>
  <exclusions>
    <exclusion>
      <artifactId>servlet-api</artifactId>
      <groupId>javax.servlet</groupId>
    </exclusion>
    <exclusion>
      <artifactId>guava</artifactId>
      <groupId>com.google.guava</groupId>
    </exclusion>
    <exclusion>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>*</artifactId>
    </exclusion>
    <exclusion>
      <groupId>org.pentaho</groupId>
      <artifactId>*</artifactId>
    </exclusion>
  </exclusions>
</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-service-rpc</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>
</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-jdbc</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>
  <exclusions>
    <exclusion>
      <groupId>javax.servlet</groupId>
      <artifactId>*</artifactId>
    </exclusion>
    <exclusion>
      <groupId>javax.servlet.jsp</groupId>
      <artifactId>*</artifactId>
    </exclusion>
    <exclusion>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>*</artifactId>
    </exclusion>
  </exclusions>
</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-metastore</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>
  <exclusions>
    <exclusion>
      <groupId>javax.servlet</groupId>
      <artifactId>*</artifactId>
    </exclusion>
    <exclusion>
      <groupId>org.datanucleus</groupId>
      <artifactId>datanucleus-core</artifactId>
    </exclusion>
    <exclusion>
      <groupId>javax.servlet.jsp</groupId>
      <artifactId>*</artifactId>
    </exclusion>
    <exclusion>
      <artifactId>guava</artifactId>
      <groupId>com.google.guava</groupId>
    </exclusion>
  </exclusions>
</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-common</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>
  <exclusions>
    <exclusion>
      <groupId>org.eclipse.jetty.orbit</groupId>
      <artifactId>javax.servlet</artifactId>
    </exclusion>
    <exclusion>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>*</artifactId>
    </exclusion>
  </exclusions>
</dependency>

<!-- 增加hudi配置版本的jetty -->
<dependency>
  <groupId>org.eclipse.jetty</groupId>
  <artifactId>jetty-server</artifactId>
  <version>${jetty.version}</version>
</dependency>
<dependency>
  <groupId>org.eclipse.jetty</groupId>
  <artifactId>jetty-util</artifactId>
  <version>${jetty.version}</version>
</dependency>
<dependency>
  <groupId>org.eclipse.jetty</groupId>
  <artifactId>jetty-webapp</artifactId>
  <version>${jetty.version}</version>
</dependency>
<dependency>
  <groupId>org.eclipse.jetty</groupId>
  <artifactId>jetty-http</artifactId>
  <version>${jetty.version}</version>
</dependency>

否则在使用 DeltaStreamer 工具向 hudi 表插入数据时，也会报 Jetty 的错误。

2.2.6、执行编译命令

mvn clean package -DskipTests -Dspark3.2 -Dflink1.13 -Dscala-2.12 -Dhadoop.version=3.1.3 -Pflink-bundle-shade-hive3

2.2.7、编译成功

编译成功后，进入 hudi-cli 说明成功：

编译完成后，相关的包在 packaging 目录的各个模块中：
比如，flink 与 hudi 的包：

暂无