Database Tables

As this version of DBImport dont have the web admin tool available, the documentation will be agains each column in the configuration database tables. The admin tool will later use the same fields so whats said in here will later be applicable on the admin tool aswell.

Table - airflow_custom_dags

Its possible to construct a DAG that have no import, export or ETL definitions in it, but instead just Tasks from the airflow_task table. That might nbe useful to for example run custom Hive Code after an import is completed as a separate DAG. Defining a DAG in here also requires you to have at least one task in airflow_tasks defined “in main”

Column Documentation
dag_name Name of Airflow DAG.
schedule_interval Time to execute dag
retries How many retries should be Task do in Airflow before it failes
operator_notes Free text field to write a note about the custom DAG.
application_notes Free text field that can be used for application documentaton, notes or links.
auto_regenerate_dag 1 = The DAG will be auto regenerated by manage command
airflow_notes Documentation that will be available in Airflow. Markdown syntax
sudo_user All tasks in DAG will use this user for sudo instead of default
timezone Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.

Table - airflow_etl_dags

To create a DAG in Airflow for only ETL jobs, this is the table that holds all definitions of the DAG configuration, including the filter that defines what ETL jobs to run, schedules times, pool names and much more.

Column Documentation
dag_name Name of Airflow DAG.
schedule_interval Time to execute dag
filter_job Filter string for JOB in etl_jobs table
filter_task Filter string for TASK in etl_jobs table
filter_source_db Filter string for SOURCE_DB in etl_jobs table
filter_target_db Filter string for TARGET_DB in etl_jobs table
retries How many retries should be Task do in Airflow before it failes
trigger_dag_on_success <NOT USED>
operator_notes Free text field to write a note about the ETL job.
application_notes Free text field that can be used for application documentaton, notes or links.
auto_regenerate_dag 1 = The DAG will be auto regenerated by manage command
airflow_notes Documentation that will be available in Airflow. Markdown syntax
sudo_user All tasks in DAG will use this user for sudo instead of default
timezone Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.

Table - airflow_export_dags

To create a DAG in Airflow for Exports, this is the table that holds all definitions of the DAG configuration, including the filter that defines what tables to export, schedules times, pool names and much more.

Column Documentation
dag_name Name of Airflow DAG.
schedule_interval Time to execute dag
filter_dbalias Filter string for DBALIAS in export_tables
filter_target_schema Filter string for TARGET_SCHEMA in export_tables
filter_target_table Filter string for TARGET_TABLE in export_tables
use_python_dbimport Legacy use only. Always put this to 1
retries How many retries should be Task do in Airflow before it failes
trigger_dag_on_success Name of DAG to trigger if export is successfull. Comma seperated list of DAGs (optional)
operator_notes Free text field to write a note about the export.
application_notes Free text field that can be used for application documentaton, notes or links.
auto_regenerate_dag 1 = The DAG will be auto regenerated by manage command
airflow_notes Documentation that will be available in Airflow. Markdown syntax
sudo_user All tasks in DAG will use this user for sudo instead of default
timezone Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.

Table - airflow_import_dags

To create a DAG in Airflow for Imports, this is the table that holds all definitions of the DAG configuration, including the filter that defines what tables to import, schedules times, pool names and much more.

Column Documentation
dag_name Name of Airflow DAG.
schedule_interval Time to execute dag
filter_hive Filter string for database and table. ; separated. Wildcards (*) allowed. Example HIVE_DB.HIVE_TABLE; HIVE_DB.HIVE_TABLE
use_python_dbimport Legacy use only. Always put this to 1
finish_all_stage1_first 1 = All Import phase jobs will be completed first, and when all is successfull, the ETL phase start
retries How many retries should be Task do in Airflow before it failes
retries_stage1 Specific retries number for Import Phase
retries_stage2 Specific retries number for ETL Phase
pool_stage1 Airflow pool used for stage1 tasks. NULL for the default Hostname pool
pool_stage2 Airflow pool used for stage2 tasks. NULL for the default DAG pool
operator_notes Free text field to write a note about the import.
application_notes Free text field that can be used for application documentaton, notes or links.
auto_table_discovery <NOT USED>
auto_regenerate_dag 1 = The DAG will be auto regenerated by manage command
run_import_and_etl_separate 1 = The Import and ETL phase will run in separate Tasks.
airflow_notes Documentation that will be available in Airflow. Markdown syntax
sudo_user All tasks in DAG will use this user for sudo instead of default
metadata_import 1 = Run only getSchema and getRowCount, 0 = Run a normal import
timezone Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.

Table - airflow_tasks

All DAGs can be customized by adding Tasks into the DAG. Depending on what placement and type of Tasks that is created, DBImport will add custom placeholders to keep the DAG separated in three different parts. Before, In and After Main. In main is where all regular Imports, export or ETL jobs are executed. If you want to execute something before these, you place it in Before Main. And if you want to execute something after, you place it in After main. Please check the Airflow Integration part of the documentation for more examples and better understanding of the data you can put into this table

Column Documentation
dag_name Name of DAG to add Tasks to
task_name Name of the Task in Airflow
task_type The type of the Task
placement Placement for the Task
jdbc_dbalias For ‘JDBC SQL’ Task Type, this specifies what database the SQL should run against
hive_db <NOT USED>
airflow_pool Airflow Pool to use.
airflow_priority Airflow Priority. Higher number, higher priority
include_in_airflow Enable or disable the Task in the DAG during creation of DAG file.
task_dependency_in_main If placement is In Main, this defines a dependency for the Task. Comma separated list
task_config The configuration for the Task. Depends on what Task type it is.
sensor_connection Name of Connection in Airflow
sensor_poke_interval Poke interval for sensors in seconds
sensor_timeout_minutes Timeout for sensors in minutes
sudo_user The task will use this user for sudo instead of default

Table - atlas_column_cache

Atlas discovery uses this table to cache values in order to detect changes instead of putting a heavy load on the Atlas server.

Column Documentation
hostname Hostname for the database
port Port for the database
database_name Database name
schema_name Database schema
table_name Database table
column_name Name of the column
column_type Type of the column
column_length Length of the column
column_is_nullable Is null values allowed in the column
column_comment Comment on the column
table_comment Comment on the table
table_type Table type.
table_create_time Timestamp for when the table was created
default_value Default value of the column

Table - atlas_key_cache

Atlas discovery uses this table to cache values in order to detect changes instead of putting a heavy load on the Atlas server.

Column Documentation
hostname Hostname for the database
port Port for the database
database_name Database name
schema_name Database schema
table_name Database table
constraint_name Name of the constraint
constraint_type Type of the constraint
column_name Name of the column
reference_schema_name Name of the schema that is referenced
reference_table_name Name of the table that is referenced
reference_column_name Name of the column that is referenced
col_key_position Position of the key

Table - configuration

This is the DBImport global configuration table. In here you can configure items such as the name of the staging database, disable global executions, max number of mappers and much more.

Column Documentation
configKey Name of the configuration item. These are controlled and maintained by thee setup tool. Dont change these manually
valueInt Integer based Value
valueStr String based Value
valueDate Date based Value
description Description on the setting

Table - copy_async_status

The status table for asynchronous copy between DBImport instances.

Column Documentation
table_id Reference to import_table.table_id
hive_db Hive Database
hive_table Hive Table to copy
destination DBImport instances to copy the imported data to
copy_status Status of the copy operation
last_status_update Last time the server changed progress on this copy
failures Number of failures on current state
hdfs_source_path HDFS path to copy from
hdfs_target_path HDFS path to copy to

Table - copy_tables

When the copy phase starts, it will look in this table to understand if its going to copy its data and to what DBImport instances.

Column Documentation
copy_id Auto Incremented PrimaryKey of the table
hive_filter Filter string for database and table. ; separated. Wildcards (*) allowed. Example HIVE_DB.HIVE_TABLE; HIVE_DB.HIVE_TABLE
destination DBImport instances to copy the imported data to
data_transfer Synchronous will transfer the data as part of the Import. Asynchronous will transfer the data by a separate process and not part of the Import

Table - dbimport_instances

This table contains all DBInstance that will receive data from this instance during the copy phase

Column Documentation
instance_id Auto Incremented PrimaryKey of the table
name Name of the DBImport instance
db_hostname MySQL Hostname to DBImport database
db_port MySQL Port to DBImport database
db_database MySQL Database to DBImport database
db_credentials MySQL Username and Password to DBImport database
hdfs_address HDFS address. Example hdfs://hadoopcluster
hdfs_basedir The base dir to write data to. Example /apps/dbimport
sync_credentials 0 = Credentials wont be synced, 1 = The credentials information will be synced to the other cluster

Table - etl_jobs

Column Documentation
job  
task  
job_id  
etl_type  
include_in_airflow  
source_db  
source_table  
target_db  
target_table  
operator_notes Free text field to write a note about the import.

Table - export_columns

This table contains all columns that exists on all tables that we are exporting. Unlike the export_tables table, this one gets created automatically by the export tool

Column Documentation
table_id Foreign Key to export_tables column ‘table_id’
column_id Unique identifier
column_name Name of column in target table. Dont change this manually
column_type Column type from Hive. Dont change this manually
column_order The order of the columns. Dont change this manually
hive_db Only used to make it easier to read the table. No real usage
hive_table Only used to make it easier to read the table. No real usage
target_column_name Override the name of column in the target system
target_column_type Override the column type in the target system
last_update_from_hive Timestamp of last schema update from Hive. Dont change this manually
last_export_time Timestamp of last export. Dont change this manually
selection <NOT USED>
include_in_export 1 = Include column in export, 0 = Exclude column in export
comment The column comment from the source system. Dont change this manually
operator_notes Free text field to write a note about the import.

Table - export_retries_log

Log of all retries that have happened.

Column Documentation
dbalias Database connection name that we export to
target_schema Schema on the target system
target_table Table on the target system
retry_time Time when the retry was started
stage The stage of the import that the retry started from. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s
stage_description Description of the stage

Table - export_stage

The export tool keeps track of how far in the export the tool have succeeded. So in case of an error, lets say that Hive is not responding, the next time an export is executed it will skip the first part and continue from where it ended in error on the previous run. If you want to rerun from the begining, the information in this table needs to be cleared. This is done with the “manage –clearExportStage” tool. Keep in mind that clearing the stage of an incremental export might result in the loss of the data.

Column Documentation
dbalias Database connection name that we export to
target_schema Schema on the target system
target_table Table on the target system
stage Current stage of the export. This is the internal stage number
stage_description Description of the stage
stage_time The date and time when the import entered the stage

Table - export_stage_statistics

As DBImport progress through the different stages of the export, it also keeps track of start and stop time for each stage together with the duration. That information is kept in this table

Column Documentation
dbalias Database connection name that we export to
target_schema Schema on the target system
target_table Table on the target system
stage Current stage of the export. This is the internal stage number
start Time when stage started
stop Time when stage was completed
duration Duration of stage

Table - export_statistics

At the end of each export, all statistics about how long each part took aswell as general information about Hive database and table, number of rows imported, size of the export and much more are logged in this table. This table grows and will never be truncated by DBImport itself. If it becomes to large for you, it’s up to each user to delete or truncate this table as you see fit.

Column Documentation
id Auto incremented PrimaryKey of the table
dbalias ID of the Database Connection
target_database Name of the source database
target_schema Name of the source schema
target_table Name of the source table
hive_db Hive Database
hive_table Hive Table
export_phase Import Phase method
incremental 0 = Full import, 1 = Incremental import
rows How many rows that was imported
size The total size in bytes that was imported
sessions How many parallell sessions was used against the source (sqoop mappers)
duration Tota duration in seconds
start Timestamp of start
stop Timestamp of stop
get_hive_tableschema_duration  
get_hive_tableschema_start  
get_hive_tableschema_stop  
clear_table_rowcount_duration  
clear_table_rowcount_start  
clear_table_rowcount_stop  
create_temp_table_duration  
create_temp_table_start  
create_temp_table_stop  
truncate_temp_table_duration  
truncate_temp_table_start  
truncate_temp_table_stop  
fetch_maxvalue_start  
fetch_maxvalue_stop  
fetch_maxvalue_duration  
insert_into_temp_table_duration  
insert_into_temp_table_start  
insert_into_temp_table_stop  
create_target_table_duration  
create_target_table_start  
create_target_table_stop  
truncate_target_table_duration  
truncate_target_table_start  
truncate_target_table_stop  
sqoop_duration  
sqoop_start  
sqoop_stop  
validate_duration  
validate_start  
validate_stop  
update_statistics_duration  
update_statistics_start  
update_statistics_stop  
update_target_table_duration  
update_target_table_start  
update_target_table_stop  
spark_duration  
spark_start  
spark_stop  
atlas_schema_duration  
atlas_schema_start  
atlas_schema_stop  

Table - export_statistics_last

The last entry in table export_statistics is also stored in this table. This makes it easier to find the latest data without first grouping to find the latest entry. When export_statistics table grows to a high number of million rows, it saves alot of cpu power for the database server.

Column Documentation
dbalias ID of the Database Connection
target_database Name of the source database
target_schema Name of the source schema
target_table Name of the source table
hive_db Hive Database
hive_table Hive Table
export_phase Import Phase method
incremental 0 = Full import, 1 = Incremental import
rows How many rows that was imported
size The total size in bytes that was imported
sessions How many parallell sessions was used against the source (sqoop mappers)
duration Tota duration in seconds
start Timestamp of start
stop Timestamp of stop
get_hive_tableschema_duration  
get_hive_tableschema_start  
get_hive_tableschema_stop  
clear_table_rowcount_duration  
clear_table_rowcount_start  
clear_table_rowcount_stop  
create_temp_table_duration  
create_temp_table_start  
create_temp_table_stop  
truncate_temp_table_duration  
truncate_temp_table_start  
truncate_temp_table_stop  
fetch_maxvalue_start  
fetch_maxvalue_stop  
fetch_maxvalue_duration  
insert_into_temp_table_duration  
insert_into_temp_table_start  
insert_into_temp_table_stop  
create_target_table_duration  
create_target_table_start  
create_target_table_stop  
truncate_target_table_duration  
truncate_target_table_start  
truncate_target_table_stop  
sqoop_duration  
sqoop_start  
sqoop_stop  
validate_duration  
validate_start  
validate_stop  
update_statistics_duration  
update_statistics_start  
update_statistics_stop  
update_target_table_duration  
update_target_table_start  
update_target_table_stop  
spark_duration  
spark_start  
spark_stop  
atlas_schema_duration  
atlas_schema_start  
atlas_schema_stop  

Table - export_tables

Main table where all tables that we can export are stored.

Column Documentation
dbalias Database connection name that we export to
target_schema Schema on the target system
target_table Table on the target system
table_id Unique identifier of the table
hive_db Name of Hive Database that we export from
hive_table Name of Hive Table that we export from
export_type What export method to use. Only full and incr is supported.
export_tool What tool should be used for exporting data
last_update_from_hive Timestamp of last schema update from Hive
sql_where_addition Will be added AFTER the SQL WHERE. If it’s an incr export, this will be after the incr limit statements. Example “orderId > 1000”
include_in_airflow Will the table be included in Airflow DAG when it matches the DAG selection
notUsed01 <NOT USED>
forceCreateTempTable Force export to create a Hive table and export that instead. Useful when exporting views
notUsed02 <NOT USED>
validate_export 1 = Validate the export once it’s done. 0 = Disable validation
validationMethod Validation method to use
validationCustomQueryHiveSQL Custom SQL query for Hive table
validationCustomQueryTargetSQL Custom SQL query for target table
uppercase_columns -1 = auto (Oracle = uppercase, other databases = lowercase)
truncate_target 1 = Truncate the target table before we export the data. Not used by incremental exports
mappers -1 = auto, 0 = invalid. Auto updated by ‘export_main.sh’
hive_rowcount Number of rows in Hive table. Dont change manually
target_rowcount Number of rows in Target table. Dont change manually
validationCustomQueryHiveValue Used for validation. Dont change manually
validationCustomQueryTargetValue Used for validation. Dont change manually
incr_column The column in the Hive table that will be used to identify new rows for the incremental export. Must be a timestamp column
incr_validation_method full or incr. Full means that the validation will check to total number of rows up until maxvalue and compare source with target. Incr will only compare the rows between min and max value (the data that sqoop just wrote)
incr_minvalue Used by incremental exports to keep track of progress. Dont change manually
incr_maxvalue Used by incremental exports to keep track of progress. Dont change manually
incr_minvalue_pending Used by incremental exports to keep track of progress. Dont change manually
incr_maxvalue_pending Used by incremental exports to keep track of progress. Dont change manually
sqoop_options Sqoop options to use during export.
sqoop_last_size Used to track sqoop operation. Dont change manually
sqoop_last_rows Used to track sqoop operation. Dont change manually
sqoop_last_mappers Used to track sqoop operation. Dont change manually
sqoop_last_execution Used to track sqoop operation. Dont change manually
create_target_table_sql SQL statement that was used to create the target table. Dont change manually
operator_notes Free text field to write a note about the export.
hive_javaheap Heap size for Hive
airflow_priority This will set priority_weight in Airflow
airflow_notes Documentation that will be available in Airflow. Markdown syntax

Table - import_columns

This table contains all columns that exists on all tables that we are importing. Unlike the import_tables table, this one gets created automatically by the ‘Get Source TableSchema’ stage.

Column Documentation
table_id Foreign Key to import_tables column ‘table_id’
column_id Unique identifier of the column
column_order In what order does the column exist in the source system.
column_name Name of column in Hive. Dont change this manually
hive_db Hive Database
hive_table Hive Table
source_column_name Name of column in source system. Dont change this manually
column_type Column type in Hive. Dont change this manually
source_column_type Column type in source system. Dont change this manually
source_database_type That database type was the column imported from
column_name_override Set a custom name of the column in Hive
column_type_override Set a custom column type in Hive
sqoop_column_type Used to create a correct –map-column-java setting for sqoop.
force_string If set to 1, all character based fields (char, varchar) will become string in Hive. Overrides the same setting in import_tables and jdbc_connections table
include_in_import 1 = Include column in import, 0 = Exclude column in import
source_primary_key Number starting from 1 listing the order of the column in the PK. Dont change this manually
last_update_from_source Timestamp of last schema update from source
comment The column comment from the source system
operator_notes Free text field to write a note about the column
sqoop_column_type_override Set the –map-column-java field to a fixed value and not calculated by DBImport
anonymization_function What anonymization function should be used with the data in this column

Table - import_failure_log

If there is an error or a warning during import, bu the import still continues, these errors are logged in this table. An example could be that a column cant be altered, foreign key not created, no new columns can be added and such.

Column Documentation
hive_db Hive Database
hive_table Hive Table
eventtime Time when error/warning occurred
severity The Severity of the event.
import_type The import method used
error_text Text describing the failure

Table - import_foreign_keys

All foreign key definitions is saved in this table. The information in this table is recreated all the time, so no manually changes are allowed here. For a better understanding of this table, please use the view called import_foreign_keys_view instead

Column Documentation
table_id Table ID in import_tables that have the FK
column_id Column ID in import_columns that have the FK
fk_index Index of FK
fk_table_id Table ID in import_tables that the table is having a reference against
fk_column_id Column ID in import_columns that the table is having a reference against
key_position Position of the key

Table - import_retries_log

Log of all retries that have happened.

Column Documentation
hive_db Hive DB
hive_table Hive Table
retry_time Time when the retry was started
stage The stage of the import that the retry started from. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s
stage_description Description of the stage

Table - import_stage

The import tool keeps track of how far in the import the tool have succeeded. So in case of an error, lets say that Hive is not responding, the next time an import is executed it will skip the first part and continue from where it ended in error on the previous run. If you want to rerun from the begining, the information in this table needs to be cleared. This is done with the “manage –clearImportStage” tool. Keep in mind that clearing the stage of an incremental import might result in the loss of the data.

Column Documentation
hive_db Hive Database
hive_table Hive Table
stage Current stage of the import. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s
stage_description Description of the stage
stage_time The date and time when the import entered the stage

Table - import_stage_statistics

As DBImport progress through the different stages of the import, it also keeps track of start and stop time for each stage together with the duration. That information is kept in this table

Column Documentation
hive_db Hive Database
hive_table Hive Table
stage Current stage of the import. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s
start Time when stage started
stop Time when stage was completed
duration Duration of stage

Table - import_statistics

At the end of each import, all statistics about how long each part took aswell as general information about Hive datbase and table, number of rows imported, size of the import and much more are logged in this table. This table grows and will never be truncated by DBImport itself. If it becomes to large for you, it’s up to each user to delete or truncate this table as you see fit.

Column Documentation
id Auto incremented PrimaryKey of the table
hive_db Hive Database
hive_table Hive Table
importtype What kind of import type that was used
import_phase Import Phase method
copy_phase Copy Phase method
etl_phase ETL Phase method
incremental 0 = Full import, 1 = Incremental import
dbalias ID of the Database Connection
source_database Name of the source database
source_schema Name of the source schema
source_table Name of the source table
rows How many rows that was imported
size The total size in bytes that was imported
sessions How many parallell sessions was used against the source (sqoop mappers)
duration Tota duration in seconds
start Timestamp of start
stop Timestamp of stop
sqoop_duration  
sqoop_start  
sqoop_stop  
spark_duration  
spark_start  
spark_stop  
clear_hive_locks_duration  
clear_hive_locks_start  
clear_hive_locks_stop  
clear_table_rowcount_duration  
clear_table_rowcount_start  
clear_table_rowcount_stop  
connect_to_hive_duration  
connect_to_hive_start  
connect_to_hive_stop  
create_import_table_duration  
create_import_table_start  
create_import_table_stop  
create_target_table_duration  
create_target_table_start  
create_target_table_stop  
get_import_rowcount_duration  
get_import_rowcount_start  
get_import_rowcount_stop  
get_source_rowcount_duration  
get_source_rowcount_start  
get_source_rowcount_stop  
get_source_tableschema_duration  
get_source_tableschema_start  
get_source_tableschema_stop  
get_target_rowcount_duration  
get_target_rowcount_start  
get_target_rowcount_stop  
hive_import_duration  
hive_import_start  
hive_import_stop  
truncate_target_table_duration  
truncate_target_table_start  
truncate_target_table_stop  
merge_table_duration  
merge_table_start  
merge_table_stop  
create_history_table_duration  
create_history_table_start  
create_history_table_stop  
create_delete_table_duration  
create_delete_table_start  
create_delete_table_stop  
update_statistics_duration  
update_statistics_start  
update_statistics_stop  
validate_import_table_duration  
validate_import_table_start  
validate_import_table_stop  
validate_sqoop_import_duration  
validate_sqoop_import_start  
validate_sqoop_import_stop  
validate_target_table_duration  
validate_target_table_start  
validate_target_table_stop  
copy_data_duration  
copy_data_start  
copy_data_stop  
copy_schema_duration  
copy_schema_start  
copy_schema_stop  
atlas_schema_duration  
atlas_schema_start  
atlas_schema_stop  

Table - import_statistics_last

The last entry in table import_statistics is also stored in this table. This makes it easier to find the latest data without first grouping to find the latest entry. When import_statistics table grows to a high number of million rows, it saves alot of cpu power for the database server.

Column Documentation
hive_db Hive Database
hive_table Hive Table
importtype What kind of import type that was used
import_phase Import Phase method
copy_phase Copy Phase method
etl_phase ETL Phase method
incremental 0 = Full import, 1 = Incremental import
dbalias ID of the Database Connection
source_database Name of the source database
source_schema Name of the source schema
source_table Name of the source table
rows How many rows that was imported
size The total size in bytes that was imported
sessions How many parallell sessions was used against the source (sqoop mappers)
duration Tota duration in seconds
start Timestamp of start
stop Timestamp of stop
sqoop_duration  
sqoop_start  
sqoop_stop  
spark_duration  
spark_start  
spark_stop  
clear_hive_locks_duration  
clear_hive_locks_start  
clear_hive_locks_stop  
clear_table_rowcount_duration  
clear_table_rowcount_start  
clear_table_rowcount_stop  
connect_to_hive_duration  
connect_to_hive_start  
connect_to_hive_stop  
create_import_table_duration  
create_import_table_start  
create_import_table_stop  
create_target_table_duration  
create_target_table_start  
create_target_table_stop  
get_import_rowcount_duration  
get_import_rowcount_start  
get_import_rowcount_stop  
get_source_rowcount_duration  
get_source_rowcount_start  
get_source_rowcount_stop  
get_source_tableschema_duration  
get_source_tableschema_start  
get_source_tableschema_stop  
get_target_rowcount_duration  
get_target_rowcount_start  
get_target_rowcount_stop  
hive_import_duration  
hive_import_start  
hive_import_stop  
truncate_target_table_duration  
truncate_target_table_start  
truncate_target_table_stop  
merge_table_duration  
merge_table_start  
merge_table_stop  
create_history_table_duration  
create_history_table_start  
create_history_table_stop  
create_delete_table_duration  
create_delete_table_start  
create_delete_table_stop  
update_statistics_duration  
update_statistics_start  
update_statistics_stop  
validate_import_table_duration  
validate_import_table_start  
validate_import_table_stop  
validate_sqoop_import_duration  
validate_sqoop_import_start  
validate_sqoop_import_stop  
validate_target_table_duration  
validate_target_table_start  
validate_target_table_stop  
copy_data_duration  
copy_data_start  
copy_data_stop  
copy_schema_duration  
copy_schema_start  
copy_schema_stop  
atlas_schema_duration  
atlas_schema_start  
atlas_schema_stop  

Table - import_tables

Main table where all tables that we can import are stored.

Column Documentation
hive_db Hive Database to import to
hive_table Hive Table to import to
table_id Unique identifier
dbalias Name of database connection from jdbc_connections table
source_schema Name of the schema in the remote database
source_table Name of the table in the remote database
import_phase_type What method to use for ETL phase
etl_phase_type What method to use for ETL phase
import_type What import method to use
import_tool What tool should be used for importing data
last_update_from_source Timestamp of last schema update from source
sqoop_sql_where_addition Will be added AFTER the SQL WHERE. If it’s an incr import, this will be after the incr limit statements. Example “orderId > 1000”
nomerge_ingestion_sql_addition This will be added to the data ingestion of None-Merge imports (full, full_direct and incr). Usefull to filter out data from import tables to target tables
include_in_airflow Will the table be included in Airflow DAG when it matches the DAG selection
airflow_priority This will set priority_weight in Airflow
validate_import Should the import be validated
validationMethod Validation method to use
validate_source query = Run a ‘select count(*) from …’ to get the number of rows in the source table. sqoop = Use the number of rows imported by sqoop as the number of rows in the source table
validate_diff_allowed -1 = auto calculated diff allowed. If a positiv number, this is the amount of rows that the diff is allowed to have
validationCustomQuerySourceSQL Custom SQL query for source table
validationCustomQueryHiveSQL Custom SQL query for Hive table
validationCustomQueryValidateImportTable 1 = Validate Import table, 0 = Dont validate Import table
truncate_hive <NOT USED>
mappers -1 = auto or positiv number for a fixed number of mappers. If Auto, then it’s calculated based of last sqoop import size
soft_delete_during_merge If 1, then the row will be marked as deleted instead of actually being removed from the table. Only used for Merge imports
source_rowcount Used for validation. Dont change manually
source_rowcount_incr  
hive_rowcount Used for validation. Dont change manually
validationCustomQuerySourceValue Used for validation. Dont change manually
validationCustomQueryHiveValue Used for validation. Dont change manually
incr_mode Incremental import mode
incr_column What column to use to identify new rows
incr_validation_method full or incr. Full means that the validation will check to total number of rows up until maxvalue and compare source with target. Incr will only compare the rows between min and max value (the data that sqoop just wrote)
incr_minvalue Used for incremental imports. Dont change manually
incr_maxvalue Used for incremental imports. Dont change manually
incr_minvalue_pending Used for incremental imports. Dont change manually
incr_maxvalue_pending Used for incremental imports. Dont change manually
pk_column_override Force the import and Hive table to define another PrimaryKey constraint. Comma separeted list of columns
pk_column_override_mergeonly Force the import to use another PrimaryKey constraint during Merge operations. Comma separeted list of columns
hive_merge_heap Should be a multiple of Yarn container size. If NULL then it will use the default specified in Yarn and TEZ
spark_executor_memory Memory used by spark when importring data. Overrides default value in global configuration
concatenate_hive_table <NOT USED>
split_by_column Column to split by when doing import with multiple sessions
sqoop_query Use a custom query in sqoop to read data from source table
sqoop_options Options to send to sqoop. Most common used for –split-by option
sqoop_last_size Used to track sqoop operation. Dont change manually
sqoop_last_rows Used to track sqoop operation. Dont change manually
sqoop_last_mappers Used to track sqoop operation. Dont change manually
sqoop_last_execution Used to track sqoop operation. Dont change manually
sqoop_use_generated_sql 1 = Use the generated SQL that is saved in the generated_sqoop_query column
sqoop_allow_text_splitter Allow splits on text columns. Use with caution
force_string If set to 1, all character based fields (char, varchar) will become string in Hive. Overrides the same setting in jdbc_connections table
comment Table comment from source system. Dont change manually
generated_hive_column_definition Generated column definition for Hive create table. Dont change manually
generated_sqoop_query Generated query for sqoop. Dont change manually
generated_sqoop_options Generated options for sqoop. Dont change manually
generated_pk_columns Generated Primary Keys. Dont change manually
generated_foreign_keys <NOT USED>
datalake_source This value will come in the dbimport_source column if present. Overrides the same setting in jdbc_connections table
operator_notes Free text field to write a note about the import.
copy_finished Time when last copy from Master DBImport instance was completed. Dont change manually
copy_slave Defines if this table is a Master table or a Slave table. Dont change manually
create_foreign_keys -1 (default) = Get information from jdbc_connections table
airflow_notes Documentation that will be available in Airflow. Markdown syntax
custom_max_query You can use a custom SQL query that will get the Max value from the source database. This Max value will be used in an inremental import to know how much to read in each execution
mergeCompactionMethod Compaction method to use after import using merge is completed. Default means a major compaction if it is configured to do so in the configuration table
sourceTableType Type of table on the source system. This is a read-only

Table - import_tables_indexes

Information table that contains what indexes are available on the source system. The data in the table is not used by DBImport itsealf, instead it’s a help for the users to know what indexes exists to make an effective import.

Column Documentation
table_id Foreign Key to import_tables column ‘table_id’
hive_db Hive Database
hive_table Hive Table
index_name Name of the index
index_type The type of the index. This is the name from the source database vendor
index_unique Does the index have unique values or not?
column Column name
column_type Column type in source system
column_order The order of the column in the index.
column_is_nullable 1 = Column may contain null values, 0 = Nulls are not allowed in the column

Table - jdbc_connections

Database connection definitions

Column Documentation
dbalias Name of the Database connection
private_key_path <NOT USED>
public_key_path <NOT USED>
jdbc_url The JDBC URL String
credentials Encrypted fields for credentials.m Changed by the saveCredentialTool
datalake_source This value will come in the dbimport_source column if present. Priority is table, connection
max_import_sessions You can limit the number of parallel sessions during import with this value. If NULL, then Max will come from configuration file
force_string If set to 1, all character based fields (char, varchar) will become string in Hive
create_datalake_import If set to 1, the datalake_import column will be created on all tables that is using this dbalias
timewindow_start Start of the time window when we are allowed to run against this connection.
timewindow_stop End of the time window when we are allowed to run against this connection.
operator_notes Free text field to write a note about the connection
create_foreign_keys 1 = Create foreign keys, 0 = Dont create foreign keys
contact_info Contact information. Used by Atlas integration
description Description. Used by Atlas integration
owner Owner of system and/or data. Used by Atlas integration
atlas_discovery 1 = Discover tables/views on this connection, 0 = Dont use Atlas discovery on this connection
atlas_include_filter Include filter for Atlas discovery
atlas_exclude_filter Exclude filter for Atlas discovery
atlas_last_discovery  
environment Name of the Environment type
seed_file File that will be used to fetch the custom seed that will be used for anonymization functions on data from the connection

Table - jdbc_connections_drivers

Column Documentation
database_type Name of database type. Name is hardcoded into Python scripts, so use only defined values
version Free-text field with version. Has nothing to do with version of driver itself.
driver Java class for JDBC driver
classpath Full path to JDBC driver/jar file. If more than one file is required, separate them with : and no spaces

Table - jdbc_connections_environments

Environments types. Used in jdbc_connections table to define what kind of connection type it is

Column Documentation
environment Name of the Environment type
description Description of the Environment type

Table - jdbc_table_change_history

This table keeps track of all changes that was done to a JDBC table after the initial creation. Example could be that a colum type was changed from char(10) to varchar(10). That kind of information is logged in this table

Column Documentation
dbalias Database connection name
db_name Database name
schema_name Schema Name
table_name Table Name
column_name Column Name
eventtime Time and date of the event
event Event name.
previous_value Value before the change
value Value after the change
description Detailed description of the event

Table - json_to_rest

Temporary storage of JSON payloads that will be sent to a REST interface if the tool is configured to do so.

Column Documentation
id Unique Identifier
type The type of JSON data that is saved in the ‘jsondata’ column
create_time Time when the JSON data was created
status Internal status to keep track of what the status of the transmissions is
jsondata The payload to send

Table - table_change_history

This table keeps track of all changes that was done to an abject after the initial load. Example could be that a colum type was changed from char(10) to varchar(10). That kind of information is logged in this table

Column Documentation
hive_db Hive Database
hive_table Hive Table
column_name Column Name
eventtime Time and date of the event
event Event name.
previous_value Value before the change
value Value after the change
description Detailed description of the event