Database Tables

As this version of DBImport dont have the web admin tool available, the documentation will be agains each column in the configuration database tables. The admin tool will later use the same fields so whats said in here will later be applicable on the admin tool aswell.

Table - airflow_custom_dags

Its possible to construct a DAG that have no import, export or ETL definitions in it, but instead just Tasks from the airflow_task table. That might nbe useful to for example run custom Hive Code after an import is completed as a separate DAG. Defining a DAG in here also requires you to have at least one task in airflow_tasks defined “in main”

Column	Documentation
dag_name	Name of Airflow DAG.
schedule_interval	Time to execute dag
retries	How many retries should be Task do in Airflow before it failes
operator_notes	Free text field to write a note about the custom DAG.
application_notes	Free text field that can be used for application documentaton, notes or links.
auto_regenerate_dag	1 = The DAG will be auto regenerated by manage command
airflow_notes	Documentation that will be available in Airflow. Markdown syntax
sudo_user	All tasks in DAG will use this user for sudo instead of default
timezone	Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.
email	Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses
email_on_failure	Send email on failures
email_on_retries	Send email on retries
tags	Comma seperated list of Airflow tags that will be set on the Dag
sla_warning_time	Maximum time this DAG should run before Airflow triggers a SLA miss
retry_exponential_backoff	1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries
concurrency	Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration

Table - airflow_etl_dags

To create a DAG in Airflow for only ETL jobs, this is the table that holds all definitions of the DAG configuration, including the filter that defines what ETL jobs to run, schedules times, pool names and much more.

Column	Documentation
dag_name	Name of Airflow DAG.
schedule_interval	Time to execute dag
filter_job	Filter string for JOB in etl_jobs table
filter_task	Filter string for TASK in etl_jobs table
filter_source_db	Filter string for SOURCE_DB in etl_jobs table
filter_target_db	Filter string for TARGET_DB in etl_jobs table
retries	How many retries should be Task do in Airflow before it failes
trigger_dag_on_success	<NOT USED>
operator_notes	Free text field to write a note about the ETL job.
application_notes	Free text field that can be used for application documentaton, notes or links.
auto_regenerate_dag	1 = The DAG will be auto regenerated by manage command
airflow_notes	Documentation that will be available in Airflow. Markdown syntax
sudo_user	All tasks in DAG will use this user for sudo instead of default
timezone	Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.
email	Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses
email_on_failure	Send email on failures
email_on_retries	Send email on retries
tags	Comma seperated list of Airflow tags that will be set on the Dag
sla_warning_time	Maximum time this DAG should run before Airflow triggers a SLA miss
retry_exponential_backoff	1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries
concurrency	Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration

Table - airflow_export_dags

To create a DAG in Airflow for Exports, this is the table that holds all definitions of the DAG configuration, including the filter that defines what tables to export, schedules times, pool names and much more.

Column	Documentation
dag_name	Name of Airflow DAG.
schedule_interval	Time to execute dag
filter_dbalias	Filter string for DBALIAS in export_tables
filter_target_schema	Filter string for TARGET_SCHEMA in export_tables
filter_target_table	Filter string for TARGET_TABLE in export_tables
use_python_dbimport	Legacy use only. Always put this to 1
retries	How many retries should be Task do in Airflow before it failes
trigger_dag_on_success	Name of DAG to trigger if export is successfull. Comma seperated list of DAGs (optional)
operator_notes	Free text field to write a note about the export.
application_notes	Free text field that can be used for application documentaton, notes or links.
auto_regenerate_dag	1 = The DAG will be auto regenerated by manage command
airflow_notes	Documentation that will be available in Airflow. Markdown syntax
sudo_user	All tasks in DAG will use this user for sudo instead of default
timezone	Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.
email	Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses
email_on_failure	Send email on failures
email_on_retries	Send email on retries
tags	Comma seperated list of Airflow tags that will be set on the Dag
sla_warning_time	Maximum time this DAG should run before Airflow triggers a SLA miss
retry_exponential_backoff	1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries
concurrency	Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration

Table - airflow_import_dags

To create a DAG in Airflow for Imports, this is the table that holds all definitions of the DAG configuration, including the filter that defines what tables to import, schedules times, pool names and much more.

Column	Documentation
dag_name	Name of Airflow DAG.
schedule_interval	Time to execute dag
filter_hive	Filter string for database and table. ; separated. Wildcards (*) allowed. Example HIVE_DB.HIVE_TABLE; HIVE_DB.HIVE_TABLE
use_python_dbimport	Legacy use only. Always put this to 1
finish_all_stage1_first	1 = All Import phase jobs will be completed first, and when all is successfull, the ETL phase start
retries	How many retries should be Task do in Airflow before it failes
retries_stage1	Specific retries number for Import Phase
retries_stage2	Specific retries number for ETL Phase
pool_stage1	Airflow pool used for stage1 tasks. NULL for the default Hostname pool
pool_stage2	Airflow pool used for stage2 tasks. NULL for the default DAG pool
operator_notes	Free text field to write a note about the import.
application_notes	Free text field that can be used for application documentaton, notes or links.
auto_table_discovery	<NOT USED>
auto_regenerate_dag	1 = The DAG will be auto regenerated by manage command
run_import_and_etl_separate	1 = The Import and ETL phase will run in separate Tasks.
airflow_notes	Documentation that will be available in Airflow. Markdown syntax
sudo_user	All tasks in DAG will use this user for sudo instead of default
metadata_import	1 = Run only getSchema and getRowCount, 0 = Run a normal import
timezone	Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.
email	Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses
email_on_failure	Send email on failures
email_on_retries	Send email on retries
tags	Comma seperated list of Airflow tags that will be set on the Dag
sla_warning_time	Maximum time this DAG should run before Airflow triggers a SLA miss
retry_exponential_backoff	1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries
concurrency	Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration

Table - airflow_tasks

All DAGs can be customized by adding Tasks into the DAG. Depending on what placement and type of Tasks that is created, DBImport will add custom placeholders to keep the DAG separated in three different parts. Before, In and After Main. In main is where all regular Imports, export or ETL jobs are executed. If you want to execute something before these, you place it in Before Main. And if you want to execute something after, you place it in After main. Please check the Airflow Integration part of the documentation for more examples and better understanding of the data you can put into this table

Column	Documentation
dag_name	Name of DAG to add Tasks to
task_name	Name of the Task in Airflow
task_type	The type of the Task
placement	Placement for the Task
jdbc_dbalias	For ‘JDBC SQL’ Task Type, this specifies what database the SQL should run against
hive_db	<NOT USED>
airflow_pool	Airflow Pool to use.
airflow_priority	Airflow Priority. Higher number, higher priority
include_in_airflow	Enable or disable the Task in the DAG during creation of DAG file.
task_dependency_upstream	Defines the upstream dependency for the Task. Comma separated list
task_dependency_downstream	Defines the upstream dependency for the Task. Comma separated list
task_config	The configuration for the Task. Depends on what Task type it is.
sensor_connection	Name of Connection in Airflow
sensor_poke_interval	Poke interval for sensors in seconds
sensor_timeout_minutes	Timeout for sensors in minutes
sensor_soft_fail	Setting this to 1 will add soft_fail=True on sensor
sudo_user	The task will use this user for sudo instead of default

Table - atlas_column_cache

Atlas discovery uses this table to cache values in order to detect changes instead of putting a heavy load on the Atlas server.

Column	Documentation
hostname	Hostname for the database
port	Port for the database
database_name	Database name
schema_name	Database schema
table_name	Database table
column_name	Name of the column
column_type	Type of the column
column_length	Length of the column
column_is_nullable	Is null values allowed in the column
column_comment	Comment on the column
table_comment	Comment on the table
table_type	Table type.
table_create_time	Timestamp for when the table was created
default_value	Default value of the column

Table - atlas_key_cache

Atlas discovery uses this table to cache values in order to detect changes instead of putting a heavy load on the Atlas server.

Column	Documentation
hostname	Hostname for the database
port	Port for the database
database_name	Database name
schema_name	Database schema
table_name	Database table
constraint_name	Name of the constraint
constraint_type	Type of the constraint
column_name	Name of the column
reference_schema_name	Name of the schema that is referenced
reference_table_name	Name of the table that is referenced
reference_column_name	Name of the column that is referenced
col_key_position	Position of the key

Table - configuration

This is the DBImport global configuration table. In here you can configure items such as the name of the staging database, disable global executions, max number of mappers and much more.

Column	Documentation
configKey	Name of the configuration item. These are controlled and maintained by thee setup tool. Dont change these manually
valueInt	Integer based Value
valueStr	String based Value
valueDate	Date based Value
description	Description on the setting

Table - copy_async_status

The status table for asynchronous copy between DBImport instances.

Column	Documentation
table_id	Reference to import_table.table_id
hive_db	Hive Database
hive_table	Hive Table to copy
destination	DBImport instances to copy the imported data to
copy_status	Status of the copy operation
last_status_update	Last time the server changed progress on this copy
failures	Number of failures on current state
hdfs_source_path	HDFS path to copy from
hdfs_target_path	HDFS path to copy to

Table - copy_tables

When the copy phase starts, it will look in this table to understand if its going to copy its data and to what DBImport instances.

Column	Documentation
copy_id	Auto Incremented PrimaryKey of the table
hive_filter	Filter string for database and table. ; separated. Wildcards (*) allowed. Example HIVE_DB.HIVE_TABLE; HIVE_DB.HIVE_TABLE
destination	DBImport instances to copy the imported data to
data_transfer	Synchronous will transfer the data as part of the Import. Asynchronous will transfer the data by a separate process and not part of the Import

Table - dbimport_instances

This table contains all DBInstance that will receive data from this instance during the copy phase

Column	Documentation
instance_id	Auto Incremented PrimaryKey of the table
name	Name of the DBImport instance
db_hostname	MySQL Hostname to DBImport database
db_port	MySQL Port to DBImport database
db_database	MySQL Database to DBImport database
db_credentials	MySQL Username and Password to DBImport database
hdfs_address	HDFS address. Example hdfs://hadoopcluster
hdfs_basedir	The base dir to write data to. Example /apps/dbimport
sync_credentials	0 = Credentials wont be synced, 1 = The credentials information will be synced to the other cluster

Table - etl_jobs

Column	Documentation
job
task
job_id
etl_type
include_in_airflow
source_db
source_table
target_db
target_table
operator_notes	Free text field to write a note about the import.

Table - export_columns

This table contains all columns that exists on all tables that we are exporting. Unlike the export_tables table, this one gets created automatically by the export tool

Column	Documentation
table_id	Foreign Key to export_tables column ‘table_id’
column_id	Unique identifier
column_name	Name of column in target table. Dont change this manually
column_type	Column type from Hive. Dont change this manually
column_order	The order of the columns. Dont change this manually
hive_db	Only used to make it easier to read the table. No real usage
hive_table	Only used to make it easier to read the table. No real usage
target_column_name	Override the name of column in the target system
target_column_type	Override the column type in the target system
last_update_from_hive	Timestamp of last schema update from Hive. Dont change this manually
last_export_time	Timestamp of last export. Dont change this manually
selection	<NOT USED>
include_in_export	1 = Include column in export, 0 = Exclude column in export
comment	The column comment from the source system. Dont change this manually
operator_notes	Free text field to write a note about the import.

Table - export_retries_log

Log of all retries that have happened.

Column	Documentation
dbalias	Database connection name that we export to
target_schema	Schema on the target system
target_table	Table on the target system
retry_time	Time when the retry was started
stage	The stage of the import that the retry started from. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s
stage_description	Description of the stage

Table - export_stage

The export tool keeps track of how far in the export the tool have succeeded. So in case of an error, lets say that Hive is not responding, the next time an export is executed it will skip the first part and continue from where it ended in error on the previous run. If you want to rerun from the begining, the information in this table needs to be cleared. This is done with the “manage –clearExportStage” tool. Keep in mind that clearing the stage of an incremental export might result in the loss of the data.

Column	Documentation
dbalias	Database connection name that we export to
target_schema	Schema on the target system
target_table	Table on the target system
stage	Current stage of the export. This is the internal stage number
stage_description	Description of the stage
stage_time	The date and time when the import entered the stage

Table - export_stage_statistics

As DBImport progress through the different stages of the export, it also keeps track of start and stop time for each stage together with the duration. That information is kept in this table

Column	Documentation
dbalias	Database connection name that we export to
target_schema	Schema on the target system
target_table	Table on the target system
stage	Current stage of the export. This is the internal stage number
start	Time when stage started
stop	Time when stage was completed
duration	Duration of stage

Table - export_statistics

At the end of each export, all statistics about how long each part took aswell as general information about Hive database and table, number of rows imported, size of the export and much more are logged in this table. This table grows and will never be truncated by DBImport itself. If it becomes to large for you, it’s up to each user to delete or truncate this table as you see fit.

Column	Documentation
id	Auto incremented PrimaryKey of the table
dbalias	ID of the Database Connection
target_database	Name of the source database
target_schema	Name of the source schema
target_table	Name of the source table
hive_db	Hive Database
hive_table	Hive Table
export_phase	Import Phase method
incremental	0 = Full import, 1 = Incremental import
rows	How many rows that was imported
size	The total size in bytes that was imported
sessions	How many parallell sessions was used against the source (sqoop mappers)
duration	Tota duration in seconds
start	Timestamp of start
stop	Timestamp of stop
get_hive_tableschema_duration
get_hive_tableschema_start
get_hive_tableschema_stop
clear_table_rowcount_duration
clear_table_rowcount_start
clear_table_rowcount_stop
create_temp_table_duration
create_temp_table_start
create_temp_table_stop
truncate_temp_table_duration
truncate_temp_table_start
truncate_temp_table_stop
fetch_maxvalue_start
fetch_maxvalue_stop
fetch_maxvalue_duration
insert_into_temp_table_duration
insert_into_temp_table_start
insert_into_temp_table_stop
create_target_table_duration
create_target_table_start
create_target_table_stop
truncate_target_table_duration
truncate_target_table_start
truncate_target_table_stop
sqoop_duration
sqoop_start
sqoop_stop
validate_duration
validate_start
validate_stop
update_statistics_duration
update_statistics_start
update_statistics_stop
update_target_table_duration
update_target_table_start
update_target_table_stop
spark_duration
spark_start
spark_stop
atlas_schema_duration
atlas_schema_start
atlas_schema_stop

Table - export_statistics_last

The last entry in table export_statistics is also stored in this table. This makes it easier to find the latest data without first grouping to find the latest entry. When export_statistics table grows to a high number of million rows, it saves alot of cpu power for the database server.

Column	Documentation
dbalias	ID of the Database Connection
target_database	Name of the source database
target_schema	Name of the source schema
target_table	Name of the source table
hive_db	Hive Database
hive_table	Hive Table
export_phase	Import Phase method
incremental	0 = Full import, 1 = Incremental import
rows	How many rows that was imported
size	The total size in bytes that was imported
sessions	How many parallell sessions was used against the source (sqoop mappers)
duration	Tota duration in seconds
start	Timestamp of start
stop	Timestamp of stop
get_hive_tableschema_duration
get_hive_tableschema_start
get_hive_tableschema_stop
clear_table_rowcount_duration
clear_table_rowcount_start
clear_table_rowcount_stop
create_temp_table_duration
create_temp_table_start
create_temp_table_stop
truncate_temp_table_duration
truncate_temp_table_start
truncate_temp_table_stop
fetch_maxvalue_start
fetch_maxvalue_stop
fetch_maxvalue_duration
insert_into_temp_table_duration
insert_into_temp_table_start
insert_into_temp_table_stop
create_target_table_duration
create_target_table_start
create_target_table_stop
truncate_target_table_duration
truncate_target_table_start
truncate_target_table_stop
sqoop_duration
sqoop_start
sqoop_stop
validate_duration
validate_start
validate_stop
update_statistics_duration
update_statistics_start
update_statistics_stop
update_target_table_duration
update_target_table_start
update_target_table_stop
spark_duration
spark_start
spark_stop
atlas_schema_duration
atlas_schema_start
atlas_schema_stop

Table - export_tables

Main table where all tables that we can export are stored.

Column	Documentation
dbalias	Database connection name that we export to
target_schema	Schema on the target system
target_table	Table on the target system
table_id	Unique identifier of the table
hive_db	Name of Hive Database that we export from
hive_table	Name of Hive Table that we export from
export_type	What export method to use. Only full and incr is supported.
export_tool	What tool should be used for exporting data
last_update_from_hive	Timestamp of last schema update from Hive
sql_where_addition	Will be added AFTER the SQL WHERE. If it’s an incr export, this will be after the incr limit statements. Example “orderId > 1000”
include_in_airflow	Will the table be included in Airflow DAG when it matches the DAG selection
notUsed01	<NOT USED>
forceCreateTempTable	Force export to create a Hive table and export that instead. Useful when exporting views
notUsed02	<NOT USED>
validate_export	1 = Validate the export once it’s done. 0 = Disable validation
validationMethod	Validation method to use
validationCustomQueryHiveSQL	Custom SQL query for Hive table
validationCustomQueryTargetSQL	Custom SQL query for target table
uppercase_columns	-1 = auto (Oracle = uppercase, other databases = lowercase)
truncate_target	1 = Truncate the target table before we export the data. Not used by incremental exports
mappers	-1 = auto, 0 = invalid. Auto updated by ‘export_main.sh’
hive_rowcount	Number of rows in Hive table. Dont change manually
target_rowcount	Number of rows in Target table. Dont change manually
validationCustomQueryHiveValue	Used for validation. Dont change manually
validationCustomQueryTargetValue	Used for validation. Dont change manually
incr_column	The column in the Hive table that will be used to identify new rows for the incremental export. Must be a timestamp column
incr_validation_method	full or incr. Full means that the validation will check to total number of rows up until maxvalue and compare source with target. Incr will only compare the rows between min and max value (the data that sqoop just wrote)
incr_minvalue	Used by incremental exports to keep track of progress. Dont change manually
incr_maxvalue	Used by incremental exports to keep track of progress. Dont change manually
incr_minvalue_pending	Used by incremental exports to keep track of progress. Dont change manually
incr_maxvalue_pending	Used by incremental exports to keep track of progress. Dont change manually
sqoop_options	Sqoop options to use during export.
sqoop_last_size	Used to track sqoop operation. Dont change manually
sqoop_last_rows	Used to track sqoop operation. Dont change manually
sqoop_last_mappers	Used to track sqoop operation. Dont change manually
sqoop_last_execution	Used to track sqoop operation. Dont change manually
create_target_table_sql	SQL statement that was used to create the target table. Dont change manually
operator_notes	Free text field to write a note about the export.
hive_javaheap	Heap size for Hive
airflow_priority	This will set priority_weight in Airflow
airflow_notes	Documentation that will be available in Airflow. Markdown syntax

Table - import_columns

This table contains all columns that exists on all tables that we are importing. Unlike the import_tables table, this one gets created automatically by the ‘Get Source TableSchema’ stage.

Column	Documentation
table_id	Foreign Key to import_tables column ‘table_id’
column_id	Unique identifier of the column
column_order	In what order does the column exist in the source system.
column_name	Name of column in Hive. Dont change this manually
hive_db	Hive Database
hive_table	Hive Table
source_column_name	Name of column in source system. Dont change this manually
column_type	Column type in Hive. Dont change this manually
source_column_type	Column type in source system. Dont change this manually
source_database_type	That database type was the column imported from
column_name_override	Set a custom name of the column in Hive
column_type_override	Set a custom column type in Hive
sqoop_column_type	Used to create a correct –map-column-java setting for sqoop.
force_string	If set to 1, all character based fields (char, varchar) will become string in Hive. Overrides the same setting in import_tables and jdbc_connections table
include_in_import	1 = Include column in import, 0 = Exclude column in import
source_primary_key	Number starting from 1 listing the order of the column in the PK. Dont change this manually
last_update_from_source	Timestamp of last schema update from source
comment	The column comment from the source system
operator_notes	Free text field to write a note about the column
sqoop_column_type_override	Set the –map-column-java field to a fixed value and not calculated by DBImport
anonymization_function	What anonymization function should be used with the data in this column

Table - import_failure_log

If there is an error or a warning during import, bu the import still continues, these errors are logged in this table. An example could be that a column cant be altered, foreign key not created, no new columns can be added and such.

Column	Documentation
hive_db	Hive Database
hive_table	Hive Table
eventtime	Time when error/warning occurred
severity	The Severity of the event.
import_type	The import method used
error_text	Text describing the failure

Table - import_foreign_keys

All foreign key definitions is saved in this table. The information in this table is recreated all the time, so no manually changes are allowed here. For a better understanding of this table, please use the view called import_foreign_keys_view instead

Column	Documentation
table_id	Table ID in import_tables that have the FK
column_id	Column ID in import_columns that have the FK
fk_index	Index of FK
fk_table_id	Table ID in import_tables that the table is having a reference against
fk_column_id	Column ID in import_columns that the table is having a reference against
key_position	Position of the key

Table - import_retries_log

Log of all retries that have happened.

Column	Documentation
hive_db	Hive DB
hive_table	Hive Table
retry_time	Time when the retry was started
stage	The stage of the import that the retry started from. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s
stage_description	Description of the stage

Table - import_stage

The import tool keeps track of how far in the import the tool have succeeded. So in case of an error, lets say that Hive is not responding, the next time an import is executed it will skip the first part and continue from where it ended in error on the previous run. If you want to rerun from the begining, the information in this table needs to be cleared. This is done with the “manage –clearImportStage” tool. Keep in mind that clearing the stage of an incremental import might result in the loss of the data.

Column	Documentation
hive_db	Hive Database
hive_table	Hive Table
stage	Current stage of the import. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s
stage_description	Description of the stage
stage_time	The date and time when the import entered the stage

Table - import_stage_statistics

As DBImport progress through the different stages of the import, it also keeps track of start and stop time for each stage together with the duration. That information is kept in this table

Column	Documentation
hive_db	Hive Database
hive_table	Hive Table
stage	Current stage of the import. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s
start	Time when stage started
stop	Time when stage was completed
duration	Duration of stage

Table - import_statistics

At the end of each import, all statistics about how long each part took aswell as general information about Hive datbase and table, number of rows imported, size of the import and much more are logged in this table. This table grows and will never be truncated by DBImport itself. If it becomes to large for you, it’s up to each user to delete or truncate this table as you see fit.

Column	Documentation
id	Auto incremented PrimaryKey of the table
hive_db	Hive Database
hive_table	Hive Table
importtype	What kind of import type that was used
import_phase	Import Phase method
copy_phase	Copy Phase method
etl_phase	ETL Phase method
incremental	0 = Full import, 1 = Incremental import
dbalias	ID of the Database Connection
source_database	Name of the source database
source_schema	Name of the source schema
source_table	Name of the source table
rows	How many rows that was imported
size	The total size in bytes that was imported
sessions	How many parallell sessions was used against the source (sqoop mappers)
duration	Tota duration in seconds
start	Timestamp of start
stop	Timestamp of stop
sqoop_duration
sqoop_start
sqoop_stop
spark_duration
spark_start
spark_stop
spark_etl_duration
spark_etl_start
spark_etl_stop
clear_hive_locks_duration
clear_hive_locks_start
clear_hive_locks_stop
clear_table_rowcount_duration
clear_table_rowcount_start
clear_table_rowcount_stop
connect_to_hive_duration
connect_to_hive_start
connect_to_hive_stop
create_import_table_duration
create_import_table_start
create_import_table_stop
create_target_table_duration
create_target_table_start
create_target_table_stop
get_import_rowcount_duration
get_import_rowcount_start
get_import_rowcount_stop
get_source_rowcount_duration
get_source_rowcount_start
get_source_rowcount_stop
get_source_tableschema_duration
get_source_tableschema_start
get_source_tableschema_stop
get_target_rowcount_duration
get_target_rowcount_start
get_target_rowcount_stop
hive_import_duration
hive_import_start
hive_import_stop
truncate_target_table_duration
truncate_target_table_start
truncate_target_table_stop
merge_table_duration
merge_table_start
merge_table_stop
create_history_table_duration
create_history_table_start
create_history_table_stop
create_delete_table_duration
create_delete_table_start
create_delete_table_stop
update_statistics_duration
update_statistics_start
update_statistics_stop
validate_import_table_duration
validate_import_table_start
validate_import_table_stop
validate_sqoop_import_duration
validate_sqoop_import_start
validate_sqoop_import_stop
validate_target_table_duration
validate_target_table_start
validate_target_table_stop
copy_data_duration
copy_data_start
copy_data_stop
copy_schema_duration
copy_schema_start
copy_schema_stop
atlas_schema_duration
atlas_schema_start
atlas_schema_stop

Table - import_statistics_last

The last entry in table import_statistics is also stored in this table. This makes it easier to find the latest data without first grouping to find the latest entry. When import_statistics table grows to a high number of million rows, it saves alot of cpu power for the database server.

Column	Documentation
hive_db	Hive Database
hive_table	Hive Table
importtype	What kind of import type that was used
import_phase	Import Phase method
copy_phase	Copy Phase method
etl_phase	ETL Phase method
incremental	0 = Full import, 1 = Incremental import
dbalias	ID of the Database Connection
source_database	Name of the source database
source_schema	Name of the source schema
source_table	Name of the source table
rows	How many rows that was imported
size	The total size in bytes that was imported
sessions	How many parallell sessions was used against the source (sqoop mappers)
duration	Tota duration in seconds
start	Timestamp of start
stop	Timestamp of stop
sqoop_duration
sqoop_start
sqoop_stop
spark_duration
spark_start
spark_stop
spark_etl_duration
spark_etl_start
spark_etl_stop
clear_hive_locks_duration
clear_hive_locks_start
clear_hive_locks_stop
clear_table_rowcount_duration
clear_table_rowcount_start
clear_table_rowcount_stop
connect_to_hive_duration
connect_to_hive_start
connect_to_hive_stop
create_import_table_duration
create_import_table_start
create_import_table_stop
create_target_table_duration
create_target_table_start
create_target_table_stop
get_import_rowcount_duration
get_import_rowcount_start
get_import_rowcount_stop
get_source_rowcount_duration
get_source_rowcount_start
get_source_rowcount_stop
get_source_tableschema_duration
get_source_tableschema_start
get_source_tableschema_stop
get_target_rowcount_duration
get_target_rowcount_start
get_target_rowcount_stop
hive_import_duration
hive_import_start
hive_import_stop
truncate_target_table_duration
truncate_target_table_start
truncate_target_table_stop
merge_table_duration
merge_table_start
merge_table_stop
create_history_table_duration
create_history_table_start
create_history_table_stop
create_delete_table_duration
create_delete_table_start
create_delete_table_stop
update_statistics_duration
update_statistics_start
update_statistics_stop
validate_import_table_duration
validate_import_table_start
validate_import_table_stop
validate_sqoop_import_duration
validate_sqoop_import_start
validate_sqoop_import_stop
validate_target_table_duration
validate_target_table_start
validate_target_table_stop
copy_data_duration
copy_data_start
copy_data_stop
copy_schema_duration
copy_schema_start
copy_schema_stop
atlas_schema_duration
atlas_schema_start
atlas_schema_stop

Table - import_tables

Main table where all tables that we can import are stored.

Column	Documentation
hive_db	Hive Database to import to
hive_table	Hive Table to import to
table_id	Unique identifier
dbalias	Name of database connection from jdbc_connections table
source_schema	Name of the schema in the remote database
source_table	Name of the table in the remote database
import_phase_type	What method to use for ETL phase
etl_phase_type	What method to use for ETL phase
import_type	What import method to use
import_tool	What tool should be used for importing data
etl_engine	What engine will be used to process etl stage
last_update_from_source	Timestamp of last schema update from source
sqoop_sql_where_addition	Will be added AFTER the SQL WHERE. If it’s an incr import, this will be after the incr limit statements. Example “orderId > 1000”
nomerge_ingestion_sql_addition	This will be added to the data ingestion of None-Merge imports (full, full_direct and incr). Usefull to filter out data from import tables to target tables
include_in_airflow	Will the table be included in Airflow DAG when it matches the DAG selection
airflow_priority	This will set priority_weight in Airflow
validate_import	Should the import be validated
validationMethod	Validation method to use
validate_source	query = Run a ‘select count(*) from …’ to get the number of rows in the source table. sqoop = Use the number of rows imported by sqoop as the number of rows in the source table
validate_diff_allowed	-1 = auto calculated diff allowed. If a positiv number, this is the amount of rows that the diff is allowed to have
validationCustomQuerySourceSQL	Custom SQL query for source table
validationCustomQueryHiveSQL	Custom SQL query for Hive table
validationCustomQueryValidateImportTable	1 = Validate Import table, 0 = Dont validate Import table
truncate_hive	<NOT USED>
mappers	-1 = auto or positiv number for a fixed number of mappers. If Auto, then it’s calculated based of last sqoop import size
soft_delete_during_merge	If 1, then the row will be marked as deleted instead of actually being removed from the table. Only used for Merge imports
source_rowcount	Used for validation. Dont change manually
source_rowcount_incr
hive_rowcount	Used for validation. Dont change manually
validationCustomQuerySourceValue	Used for validation. Dont change manually
validationCustomQueryHiveValue	Used for validation. Dont change manually
incr_mode	Incremental import mode
incr_column	What column to use to identify new rows
incr_validation_method	full or incr. Full means that the validation will check to total number of rows up until maxvalue and compare source with target. Incr will only compare the rows between min and max value (the data that sqoop just wrote)
incr_minvalue	Used for incremental imports. Dont change manually
incr_maxvalue	Used for incremental imports. Dont change manually
incr_minvalue_pending	Used for incremental imports. Dont change manually
incr_maxvalue_pending	Used for incremental imports. Dont change manually
pk_column_override	Force the import and Hive table to define another PrimaryKey constraint. Comma separeted list of columns
pk_column_override_mergeonly	Force the import to use another PrimaryKey constraint during Merge operations. Comma separeted list of columns
hive_merge_heap	Should be a multiple of Yarn container size. If NULL then it will use the default specified in Yarn and TEZ
hive_split_count	Sets tez.grouping.split-count in the Hive session
spark_executor_memory	Memory used by spark when importring data. Overrides default value in global configuration
spark_executors	Number of Spark executors to use. Overrides default value in global configuration
concatenate_hive_table	<NOT USED>
split_by_column	Column to split by when doing import with multiple sessions
sqoop_query	Use a custom query in sqoop to read data from source table
sqoop_options	Options to send to sqoop. Most common used for –split-by option
sqoop_last_size	Used to track sqoop operation. Dont change manually
sqoop_last_rows	Used to track sqoop operation. Dont change manually
sqoop_last_mappers	Used to track sqoop operation. Dont change manually
sqoop_last_execution	Used to track sqoop operation. Dont change manually
sqoop_use_generated_sql	1 = Use the generated SQL that is saved in the generated_sqoop_query column
sqoop_allow_text_splitter	Allow splits on text columns. Use with caution
force_string	If set to 1, all character based fields (char, varchar) will become string in Hive. Overrides the same setting in jdbc_connections table
comment	Table comment from source system. Dont change manually
generated_hive_column_definition	Generated column definition for Hive create table. Dont change manually
generated_sqoop_query	Generated query for sqoop. Dont change manually
generated_sqoop_options	Generated options for sqoop. Dont change manually
generated_pk_columns	Generated Primary Keys. Dont change manually
generated_foreign_keys	<NOT USED>
datalake_source	This value will come in the dbimport_source column if present. Overrides the same setting in jdbc_connections table
operator_notes	Free text field to write a note about the import.
copy_finished	Time when last copy from Master DBImport instance was completed. Dont change manually
copy_slave	Defines if this table is a Master table or a Slave table. Dont change manually
create_foreign_keys	-1 (default) = Get information from jdbc_connections table
airflow_notes	Documentation that will be available in Airflow. Markdown syntax
custom_max_query	You can use a custom SQL query that will get the Max value from the source database. This Max value will be used in an inremental import to know how much to read in each execution
mergeCompactionMethod	Compaction method to use after import using merge is completed. Default means a major compaction if it is configured to do so in the configuration table
sourceTableType	Type of table on the source system. This is a read-only
invalidate_impala	-1 = Use default value from configuration table, otherwise 1 will invalidate the table in Impala and 0 will not

Table - import_tables_indexes

Information table that contains what indexes are available on the source system. The data in the table is not used by DBImport itsealf, instead it’s a help for the users to know what indexes exists to make an effective import.

Column	Documentation
table_id	Foreign Key to import_tables column ‘table_id’
hive_db	Hive Database
hive_table	Hive Table
index_name	Name of the index
index_type	The type of the index. This is the name from the source database vendor
index_unique	Does the index have unique values or not?
column	Column name
column_type	Column type in source system
column_order	The order of the column in the index.
column_is_nullable	1 = Column may contain null values, 0 = Nulls are not allowed in the column

Table - jdbc_connections

Database connection definitions

Column	Documentation
dbalias	Name of the Database connection
private_key_path	<NOT USED>
public_key_path	<NOT USED>
jdbc_url	The JDBC URL String
credentials	Encrypted fields for credentials.m Changed by the saveCredentialTool
datalake_source	This value will come in the dbimport_source column if present. Priority is table, connection
max_import_sessions	You can limit the number of parallel sessions during import with this value. If NULL, then Max will come from configuration file
force_string	If set to 1, all character based fields (char, varchar) will become string in Hive
create_datalake_import	If set to 1, the datalake_import column will be created on all tables that is using this dbalias
timewindow_start	Start of the time window when we are allowed to run against this connection.
timewindow_stop	End of the time window when we are allowed to run against this connection.
timewindow_timezone	Timezone used for timewindow_start and timewindow_stop columns. Use full text timezone, example Europe/Stockholm.
operator_notes	Free text field to write a note about the connection
create_foreign_keys	1 = Create foreign keys, 0 = Dont create foreign keys
contact_info	Contact information. Used by Atlas integration
description	Description. Used by Atlas integration
owner	Owner of system and/or data. Used by Atlas integration
atlas_discovery	1 = Discover tables/views on this connection, 0 = Dont use Atlas discovery on this connection
atlas_include_filter	Include filter for Atlas discovery
atlas_exclude_filter	Exclude filter for Atlas discovery
atlas_last_discovery
environment	Name of the Environment type
seed_file	File that will be used to fetch the custom seed that will be used for anonymization functions on data from the connection

Table - jdbc_connections_drivers

Column	Documentation
database_type	Name of database type. Name is hardcoded into Python scripts, so use only defined values
version	Free-text field with version. Has nothing to do with version of driver itself.
driver	Java class for JDBC driver
classpath	Full path to JDBC driver/jar file. If more than one file is required, separate them with : and no spaces

Table - jdbc_connections_environments

Environments types. Used in jdbc_connections table to define what kind of connection type it is

Column	Documentation
environment	Name of the Environment type
description	Description of the Environment type

Table - jdbc_table_change_history

This table keeps track of all changes that was done to a JDBC table after the initial creation. Example could be that a colum type was changed from char(10) to varchar(10). That kind of information is logged in this table

Column	Documentation
dbalias	Database connection name
db_name	Database name
schema_name	Schema Name
table_name	Table Name
column_name	Column Name
eventtime	Time and date of the event
event	Event name.
previous_value	Value before the change
value	Value after the change
description	Detailed description of the event

Table - json_to_send

Temporary storage of JSON payloads that will be sent to a REST interface if the tool is configured to do so.

Column	Documentation
id	Unique Identifier
type	The type of JSON data that is saved in the ‘jsondata’ column
create_time	Time when the JSON data was created
status	Internal status to keep track of what the status of the transmissions is
destination	The destination where to send the json.
jsondata	The payload to send

Table - table_change_history

This table keeps track of all changes that was done to an abject after the initial load. Example could be that a colum type was changed from char(10) to varchar(10). That kind of information is logged in this table

Column	Documentation
hive_db	Hive Database
hive_table	Hive Table
column_name	Column Name
eventtime	Time and date of the event
event	Event name.
previous_value	Value before the change
value	Value after the change
description	Detailed description of the event

Table - yarn_statistics

Statistics about Yarn applications that DBImport launches. Used for rough resource estimations used and connection between Yarn application ID and DBImport job

Column	Documentation
yarn_application_id	Name of the application in Yarn
operation_type	What kind of DBImport job it is
operation_tool	The tool or operation that was used
operation_timestamp	The timestamp for the complete operation. Can be used for grouping together all applications for a specific operation, i.e a complete import/export
hive_db	Name of the Hive DB. Used for imports
hive_table	Name of the Hive Table. Used for imports
dbalias	Name of the database alias. Used for exports
target_schema	Name of the target schema. Used for exports
target_table	Name for the target table. Used for exports
application_start	Timestamp when the operation started
application_stop	Timestampe when the operation was completed
yarn_containers_total	Number of containers used. For spark and sqoop, this will be the maximum that was used during the operation
yarn_containers_failed	Number of contaners that failed
yarn_containers_killed	Number of contaners that was killed