Installing DBImport

Installation

To install DBImport, please follow the steps on this page. That will setup DBImport and make you able to import your first database in notime.

Download the source package from the latest release from the GitHub Release Page. Create a directory and unpack the file:

mkdir /usr/local/dbimport
cd /usr/local/dbimport
tar -xf dbimport-VERSION.tar

There is a number of Python dependencies that needs to be in place before DBImport can start. As of python itself, DBImport is verified with Python 3.6 and 3.8. Other versions may work but it’s not tested. It’s also preferable to use a virtual environment for Python, but creating a virtual environment is not part of this documentation. All required python packages with working versions is available in a requirements.txt file, one for each verified version of python. The version you select will also depend on what Spark version is installed in your cluster

Python 3.6

This version of python works with both Spark2 and Spark3. Please install the python packages for version 3.6:

pip3.6 install -r requirements_python36.txt

Python 3.8

When running python 3.8, you also need to run Spark3. If not, you will get errors in the ETL / Hive steps when running DBImport. Please install the python packages for version 3.8:

pip3.8 install -r requirements_python38.txt

Base configuration

There is a file inside the conf directory called dbimport.cfg. This file contains configuration that is required to connect databases, HDFS and such. Before running any DBImport commands, this file needs to be configured. The important settings that you must configure is the following

Section [Database], all settings starting with mysql. This is the connection to the DBImport configuration database.

Section [REST_statistics]. Put post_column_data and post_column_data to false or configure a valid REST interface on rest_endpoint

Section [Kerberos]. keytab and principal need to point to a valid keytab file and the principal that it contains.

Section [Hive]. hive_metastore_alchemy_conn must have a valid Alchemy connection string to the Hive Metastore SQL database.

Section [Hive]. servers, port and kerberos_service_name must all be set against a Hive server with binary transport mode.

Section [Spark]. servers, port and kerberos_service_name must all be set against a Hive server with binary transport mode.

Section [Airflow]. airflow_alchemy_conn must have a valid Alchemy connection string to the Airflow SQL database. Only needed if Airflow integrations will be used

Section [Credentials]. Both public and private keys need to be valid. See SSL keys section further down on this side for more information.

Note

If the hive_remove_locks_by_force setting (in the configuration table) is set to 1, the user configured for the hive_metastore_alchemy_conn must have delete permissions to the HIVE_LOCKS table. Other than that, the user only requires select permissions.

SSL keys

As DBImport is saving connection details to all databases its importing from, this data needs to be encrypted and stored in a secure way. To do that, we have a privat/public keypair that encrypts and decrypts the credentials to the source system. These keys needs to be configured. Default configuration is to have them under the conf directory. This can be changed in the dbimport.cfg if if needed. To create the keys with the default path, run the following command:

openssl genrsa -out conf/dbimport_private_key.pem 8192
openssl rsa -in conf/dbimport_private_key.pem -out conf/dbimport_public_key.pem -outform PEM -pubout

Once created, make sure that the unix permission only enabled the correct users to read these files.

Database Setup

As the configuration is all located in a MariaDB database, you need to create this database, give the DBImport user the permissions to create tables and then the setup tool will create the required tables in that database. We assume that you have a working MariaDB server as the setup of that is not part of this documentation.

Note

Due to incombability between MariaDB and SQLAlchemy, MariaDB version 10.2.8 or earlier in the 10.2 series is not supported. Either run 10.1.x or 10.2.9 or newer

Create the configuration schema running this command:

bin/setup --createDB

Database Configuration

There is also global configuration stored inside the newly created database. These settings are stored in the configuration table. There is one value that is important to configure correct as it requires creation of a table in Hive if enabled. And that is hive_validate_before_execution. If this is set to 1 in valueInt column, it will do a select … group by …. from the table configured by the hive_validate_table setting. Default value is dbimport.validate_table. To create this table, please run the following in Hive:

create database dbimport
create table dbimport.validate_table ( id integer, value string )
insert into dbimport.validate_table (id, value) values (1, 'Hive'), (2, 'LLAP'), (3, 'is'), (4, 'working'), (5, 'fine')

DBImport uses staging databases for creating of Import Tables and Temporary Export Tables. These databases needs to exist before running import and/or exports. The name of the databases are configured by the import_staging_database and the export_staging_database. To create the default databases, please run the following in Hive:

create database etl_import_staging
create database etl_export_staging

JDBC Driver location

In order to connect to a specific database type, the path to the JDBC JAR file must be updated. Inside the jdbc_connections_drivers table you will find the supported database types with the JDBC driver name and the classpath data. For a new and fresh installation of DBImport the classpath will be ‘add path to JAR file’. This is obviously not correct and needs to up update to point to the full path of the jar file that includes the driver specified in the driver column. Enter the correct path to your jar file and also make sure that sqoop have these files in its classpath.

Thats it. DBImport is now installed and ready for usage. If this is your first time working with DBImport, the Quickstart guide can help you to get up to speed fast.

Support for Spark

If you are planning to use Spark there might be an incombability problem with python2 vs python3 and the topology script file in HDFS.

Note

If you see the error ** SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(rack)?** when using the spark tool in DBImport, you know that you will have to fix this

The usual way to change this is the following

change the net.topology.script.file.name to another filename in core-site.xml. In this example we will use /etc/hadoop/conf/custom_topology_script.py

cp /etc/hadoop/conf/ topology_script.py /etc/hadoop/conf/custom_topology_script.py

Change the first row from “#!/usr/bin/env python” to “#!/usr/bin/python2”

Make the same change on all nodes in the cluster.

Restart affected services (HDFS, Yarn and so on)

Note

If you dont want to create a seperate topology file and configuration, you can of course just change the first row in the default topology file on all nodes in the cluster

Once that is changed, you need to update the Spark settings in the configuration file. These two different versions have been testad and works on HDP 2.6.5 and HDP 3.1.4

Verified Spark settings for HDP 2.6.5:

path_append = /usr/hdp/current/spark2-client/python/, /usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip
jar_files =
py_files = /usr/hdp/current/spark2-client/python/lib/py4j-src.zip
master = yarn
deployMode = client
yarnqueue = default
dynamic_allocation = true
executor_memory = 2688M
hdp_version = 2.6.5.0-292
hive_library = HiveContext

Verified Spark settings for HDP 3.1.4:

path_append = /usr/hdp/current/spark2-client/python/, /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip, /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip
jar_files = /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar
py_files = /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip
master = yarn
deployMode = client
yarnqueue = default
dynamic_allocation = true
executor_memory = 2688M
hdp_version = 3.1.4.0-315
hive_library = HiveWarehouseSession

Verified Spark settings for CDP 7.1.7 SP1:

path_append = /opt/cloudera/parcels/CDH/lib/spark/python, /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/spark/python/lib/py4j-0.10.7-src.zip, /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hive_warehouse_connector/pyspark_hwc-1.0.0.7.1.7.1000-141.zip
jar_files = /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/jars/hive-warehouse-connector-assembly-1.0.0.7.1.7.1000-141.jar
py_files = /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hive_warehouse_connector/pyspark_hwc-1.0.0.7.1.7.1000-141.zip
master = yarn
deployMode = client
yarnqueue = default
dynamic_allocation = true
executor_memory = 2688M
hdp_version = 7.1.7.1000-141
hive_library = HiveWarehouseSession

Spark support for AWS S3 exports

if you are planning to use AWS S3 exports, you need to add the jar files to the Spark configuration. The two files needed is hadoop-aws-VERSION.jar and aws-java-sdk-s3-VERSION.jar. For the CDP 7.1.7 SP1 example above, replace the jar_files row with the following row:

jar_files = /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/jars/hive-warehouse-connector-assembly-1.0.0.7.1.7.1000-141.jar, /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/jars/hadoop-aws-3.1.1.7.1.7.1000-141.jar, /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/jars/aws-java-sdk-s3-1.12.124.jar

Python version for Spark

Both the local DBImport code and the spark code must run with the same python version. In order to do so, set the PYSPARK_PYTHON variable to the python version you are running before you execute the import or exports:

export PYSPARK_PYTHON=python3.6

Support for Spark3

The most obvious activity to enable Spark3 is to of course install Spark3 in the cluster. If you are using CDP, Spark3 is available as a separate Parcel and will be installed in parallel with Spark2, both being active in the cluster at the same time. Once Spark3 is installed, you need to tell DBImport to use the jar files from Spark3 instead of Spark2.

Verified Spark settings for CDP 7.1.8:

path_append = /opt/cloudera/parcels/SPARK3/lib/spark3/python, /opt/cloudera/parcels/SPARK3/lib/spark3/python/lib/py4j-0.10.9.5-src.zip, /opt/cloudera/parcels/SPARK3/lib/hwc_for_spark3/pyspark_hwc-spark3-1.0.0.3.3.7180.5-1.zip
jar_files = /opt/cloudera/parcels/SPARK3/lib/hwc_for_spark3/hive-warehouse-connector-spark3-assembly-1.0.0.3.3.7180.5-1.jar
py_files = /opt/cloudera/parcels/SPARK3/lib/hwc_for_spark3/pyspark_hwc-spark3-1.0.0.3.3.7180.5-1.zip

master = yarn
deployMode = client
yarnqueue = default
dynamic_allocation = true
executor_memory = 2688M
hdp_version = 7.1.8.11-3
hive_library = HiveWarehouseSession

Iceberg

Note

Iceberg support is in tech-preview state and havent been used as much in a stable production environment as normal imports have. Please keep this in mind when using it.

As Iceberg isn’t fully supported yet by CDP Private Cloud Base Edition, we need to manually install Iceberg and enable support for it. Download the Iceberg spark jar from Apache Iceberg homepage and upload that jar file to all nodes in the cluster. During development of Iceberg support in DBImport, the file and version used together with CDP 7.1.8 was iceberg-spark-runtime-3.3_2.12-1.1.0.jar. To enable DBImport to use Iceberg file format, you need to add the jar file to the [Spark] jar_files configuration in the config file.

Verified Spark settings for CDP 7.1.8 with Iceberg:

path_append = /opt/cloudera/parcels/SPARK3/lib/spark3/python, /opt/cloudera/parcels/SPARK3/lib/spark3/python/lib/py4j-0.10.9.5-src.zip, /opt/cloudera/parcels/SPARK3/lib/hwc_for_spark3/pyspark_hwc-spark3-1.0.0.3.3.7180.5-1.zip
jar_files = /opt/cloudera/parcels/SPARK3/lib/hwc_for_spark3/hive-warehouse-connector-spark3-assembly-1.0.0.3.3.7180.5-1.jar, /opt/datalake/jars/iceberg-spark-runtime-3.3_2.12-1.1.0.jar
py_files = /opt/cloudera/parcels/SPARK3/lib/hwc_for_spark3/pyspark_hwc-spark3-1.0.0.3.3.7180.5-1.zip

Due to a limitation in CDP regarding TEZ-4248 implementation, CDP Private Cloud Base Edition cant access the Iceberg tables through Hive. But Impala and direct spark access works fine. Regardless of Hive working or not, the following settings must be enabled in the HiveServer2 in CDP for Iceberg to work even with Impala.

hive.vectorized.execution.enabled	false
iceberg.engine.hive.enabled	true
tez.mrreader.config.update.properties	hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids

Atlas

There are three things you need to configure in order to use the Atlas integration.

Note

DBImport requires Atlas version 1.1.0 or higher

DBImport configuration file

In the configuration file for DBImport, you need to specify the address to the Atlas server. If Atlas is running with SSL, you also need to configure the certificate path for the CA file. See the default configuration file for syntax on these properties.

Atlas Schema

DBImport uses it’s own schema in Atlas. This needs to be created manually. The json file for the schema is located under atlas/dbimportProcess.json. In order to create the schema, run the following commands in a session where you have a valid Kerberos ticket with admin permission in Atlas:

curl --negotiate -u : -i -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' "https://<ATLAS SERVER HOSTNAME>:<PORT>/api/atlas/v2/types/typedefs" -d "@<FULL_PATH_TO_DBIMPORT>/atlas/dbimportProcess.json"

Atlas process Pictures

To get the correct representation of the DBImport Lineage in Atlas, you need to copy the DBImport icon into the Atlas Server directory for lineage icons. The picture is located under atlas/ DBImport_Process.png. The target directory is server/webapp/atlas/img/entity-icon under Atlas installation directory. Example

HDP:

cp /usr/local/dbimport/atlas/DBImport_Process.png /usr/hdp/current/atlas-server/server/webapp/atlas/img/entity-icon

CDP/CDH:

cp /usr/local/dbimport/atlas/DBImport_Process.png /opt/cloudera/parcels/CDH/lib/atlas/server/webapp/atlas/img/entity-icon

DBImport Server

The server part of DBImport handles Atlas auto discovery of remote RDBMS schemas and the asynchronous copy of data between two or more DBImport instances. It also contains a simple REST interface in order to query the status of the DBImport Server for monitoring purposes.

Two directories is required for the DBImport server. One for log files and where the pid file should be written. These are configured in the DBImport configurarion file in the [Server] section. Please create these two directories and change permission to the user running the DBImport server. For default user and path, the following can be used:

mkdir /var/log/dbimport
mkdir /var/run/dbimport
chown dbimport /var/log/dbimport
chown dbimport /var/run/dbimport
chmod 700 /var/log/dbimport
chmod 700 /var/run/dbimport

distcp configuration for multi instance ingestion

The DBImport server uses distcp in the background to copy files between Hadoop clusters. This is done in parallel and the number of parallel sessions in controlled by the distCP_threads under the [Server] section in the configuration file.

If the number of distcp commands running in parallel gets to high, the log file will be hard to read as the output come in one file. This behavior can be changed so that each thread log in its own file. distCP_separate_logs controls this.

Upgrading

In order to upgrade DBImport, all files from the source package needs to be replaced with new versions. Many versions also needs a configuration database schema upgrade. To be sure that you are on the last schema for the configuration database, please run the following after upgrading the binary files:

./setup --upgradeDB

This command will only upgrade the schema if it’s not in the last version. So it’s safe to run it even if there are no new schema.

There might also be changes in the conf/dbimport.cfg configuration file. Please compare your local version with the default config file available in the release. If you are missing certain settings in your local configuration, please add those. If you dont do that, you will most likely get an error saying that you cant find a specific key in the configuration.