Database Tables

As this version of DBImport dont have the web admin tool available, the documentation will be agains each column in the configuration database tables. The admin tool will later use the same fields so whats said in here will later be applicable on the admin tool aswell.

Table - airflow_custom_dags

Its possible to construct a DAG that have no import, export or ETL definitions in it, but instead just Tasks from the airflow_task table. That might nbe useful to for example run custom Hive Code after an import is completed as a separate DAG. Defining a DAG in here also requires you to have at least one task in airflow_tasks defined “in main”

Column

Documentation

dag_name

Name of Airflow DAG.

schedule_interval

Time to execute dag

retries

How many retries should be Task do in Airflow before it failes

operator_notes

Free text field to write a note about the custom DAG.

application_notes

Free text field that can be used for application documentaton, notes or links.

auto_regenerate_dag

1 = The DAG will be auto regenerated by manage command

airflow_notes

Documentation that will be available in Airflow. Markdown syntax

sudo_user

All tasks in DAG will use this user for sudo instead of default

timezone

Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.

email

Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses

email_on_failure

Send email on failures

email_on_retries

Send email on retries

tags

Comma seperated list of Airflow tags that will be set on the Dag

sla_warning_time

Maximum time this DAG should run before Airflow triggers a SLA miss

retry_exponential_backoff

1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries

concurrency

Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration

Table - airflow_etl_dags

To create a DAG in Airflow for only ETL jobs, this is the table that holds all definitions of the DAG configuration, including the filter that defines what ETL jobs to run, schedules times, pool names and much more.

Column

Documentation

dag_name

Name of Airflow DAG.

schedule_interval

Time to execute dag

filter_job

Filter string for JOB in etl_jobs table

filter_task

Filter string for TASK in etl_jobs table

filter_source_db

Filter string for SOURCE_DB in etl_jobs table

filter_target_db

Filter string for TARGET_DB in etl_jobs table

retries

How many retries should be Task do in Airflow before it failes

trigger_dag_on_success

<NOT USED>

operator_notes

Free text field to write a note about the ETL job.

application_notes

Free text field that can be used for application documentaton, notes or links.

auto_regenerate_dag

1 = The DAG will be auto regenerated by manage command

airflow_notes

Documentation that will be available in Airflow. Markdown syntax

sudo_user

All tasks in DAG will use this user for sudo instead of default

timezone

Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.

email

Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses

email_on_failure

Send email on failures

email_on_retries

Send email on retries

tags

Comma seperated list of Airflow tags that will be set on the Dag

sla_warning_time

Maximum time this DAG should run before Airflow triggers a SLA miss

retry_exponential_backoff

1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries

concurrency

Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration

Table - airflow_export_dags

To create a DAG in Airflow for Exports, this is the table that holds all definitions of the DAG configuration, including the filter that defines what tables to export, schedules times, pool names and much more.

Column

Documentation

dag_name

Name of Airflow DAG.

schedule_interval

Time to execute dag

filter_dbalias

Filter string for DBALIAS in export_tables

filter_target_schema

Filter string for TARGET_SCHEMA in export_tables

filter_target_table

Filter string for TARGET_TABLE in export_tables

use_python_dbimport

Legacy use only. Always put this to 1

retries

How many retries should be Task do in Airflow before it failes

trigger_dag_on_success

Name of DAG to trigger if export is successfull. Comma seperated list of DAGs (optional)

operator_notes

Free text field to write a note about the export.

application_notes

Free text field that can be used for application documentaton, notes or links.

auto_regenerate_dag

1 = The DAG will be auto regenerated by manage command

airflow_notes

Documentation that will be available in Airflow. Markdown syntax

sudo_user

All tasks in DAG will use this user for sudo instead of default

timezone

Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.

email

Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses

email_on_failure

Send email on failures

email_on_retries

Send email on retries

tags

Comma seperated list of Airflow tags that will be set on the Dag

sla_warning_time

Maximum time this DAG should run before Airflow triggers a SLA miss

retry_exponential_backoff

1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries

concurrency

Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration

Table - airflow_import_dags

To create a DAG in Airflow for Imports, this is the table that holds all definitions of the DAG configuration, including the filter that defines what tables to import, schedules times, pool names and much more.

Column

Documentation

dag_name

Name of Airflow DAG.

schedule_interval

Time to execute dag

filter_hive

Filter string for database and table. ; separated. Wildcards (*) allowed. Example HIVE_DB.HIVE_TABLE; HIVE_DB.HIVE_TABLE

use_python_dbimport

Legacy use only. Always put this to 1

finish_all_stage1_first

1 = All Import phase jobs will be completed first, and when all is successfull, the ETL phase start

retries

How many retries should be Task do in Airflow before it failes

retries_stage1

Specific retries number for Import Phase

retries_stage2

Specific retries number for ETL Phase

pool_stage1

Airflow pool used for stage1 tasks. NULL for the default Hostname pool

pool_stage2

Airflow pool used for stage2 tasks. NULL for the default DAG pool

operator_notes

Free text field to write a note about the import.

application_notes

Free text field that can be used for application documentaton, notes or links.

auto_table_discovery

<NOT USED>

auto_regenerate_dag

1 = The DAG will be auto regenerated by manage command

run_import_and_etl_separate

1 = The Import and ETL phase will run in separate Tasks.

airflow_notes

Documentation that will be available in Airflow. Markdown syntax

sudo_user

All tasks in DAG will use this user for sudo instead of default

metadata_import

1 = Run only getSchema and getRowCount, 0 = Run a normal import

timezone

Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm.

email

Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses

email_on_failure

Send email on failures

email_on_retries

Send email on retries

tags

Comma seperated list of Airflow tags that will be set on the Dag

sla_warning_time

Maximum time this DAG should run before Airflow triggers a SLA miss

retry_exponential_backoff

1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries

concurrency

Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration

Table - airflow_tasks

All DAGs can be customized by adding Tasks into the DAG. Depending on what placement and type of Tasks that is created, DBImport will add custom placeholders to keep the DAG separated in three different parts. Before, In and After Main. In main is where all regular Imports, export or ETL jobs are executed. If you want to execute something before these, you place it in Before Main. And if you want to execute something after, you place it in After main. Please check the Airflow Integration part of the documentation for more examples and better understanding of the data you can put into this table

Column

Documentation

dag_name

Name of DAG to add Tasks to

task_name

Name of the Task in Airflow

task_type

The type of the Task

placement

Placement for the Task

jdbc_dbalias

For ‘JDBC SQL’ Task Type, this specifies what database the SQL should run against

hive_db

<NOT USED>

airflow_pool

Airflow Pool to use.

airflow_priority

Airflow Priority. Higher number, higher priority

include_in_airflow

Enable or disable the Task in the DAG during creation of DAG file.

task_dependency_upstream

Defines the upstream dependency for the Task. Comma separated list

task_dependency_downstream

Defines the upstream dependency for the Task. Comma separated list

task_config

The configuration for the Task. Depends on what Task type it is.

sensor_connection

Name of Connection in Airflow

sensor_poke_interval

Poke interval for sensors in seconds

sensor_timeout_minutes

Timeout for sensors in minutes

sensor_soft_fail

Setting this to 1 will add soft_fail=True on sensor

sudo_user

The task will use this user for sudo instead of default

Table - atlas_column_cache

Atlas discovery uses this table to cache values in order to detect changes instead of putting a heavy load on the Atlas server.

Column

Documentation

hostname

Hostname for the database

port

Port for the database

database_name

Database name

schema_name

Database schema

table_name

Database table

column_name

Name of the column

column_type

Type of the column

column_length

Length of the column

column_is_nullable

Is null values allowed in the column

column_comment

Comment on the column

table_comment

Comment on the table

table_type

Table type.

table_create_time

Timestamp for when the table was created

default_value

Default value of the column

Table - atlas_key_cache

Atlas discovery uses this table to cache values in order to detect changes instead of putting a heavy load on the Atlas server.

Column

Documentation

hostname

Hostname for the database

port

Port for the database

database_name

Database name

schema_name

Database schema

table_name

Database table

constraint_name

Name of the constraint

constraint_type

Type of the constraint

column_name

Name of the column

reference_schema_name

Name of the schema that is referenced

reference_table_name

Name of the table that is referenced

reference_column_name

Name of the column that is referenced

col_key_position

Position of the key

Table - configuration

This is the DBImport global configuration table. In here you can configure items such as the name of the staging database, disable global executions, max number of mappers and much more.

Column

Documentation

configKey

Name of the configuration item. These are controlled and maintained by thee setup tool. Dont change these manually

valueInt

Integer based Value

valueStr

String based Value

valueDate

Date based Value

description

Description on the setting

Table - copy_async_status

The status table for asynchronous copy between DBImport instances.

Column

Documentation

table_id

Reference to import_table.table_id

hive_db

Hive Database

hive_table

Hive Table to copy

destination

DBImport instances to copy the imported data to

copy_status

Status of the copy operation

last_status_update

Last time the server changed progress on this copy

failures

Number of failures on current state

hdfs_source_path

HDFS path to copy from

hdfs_target_path

HDFS path to copy to

Table - copy_tables

When the copy phase starts, it will look in this table to understand if its going to copy its data and to what DBImport instances.

Column

Documentation

copy_id

Auto Incremented PrimaryKey of the table

hive_filter

Filter string for database and table. ; separated. Wildcards (*) allowed. Example HIVE_DB.HIVE_TABLE; HIVE_DB.HIVE_TABLE

destination

DBImport instances to copy the imported data to

data_transfer

Synchronous will transfer the data as part of the Import. Asynchronous will transfer the data by a separate process and not part of the Import

Table - dbimport_instances

This table contains all DBInstance that will receive data from this instance during the copy phase

Column

Documentation

instance_id

Auto Incremented PrimaryKey of the table

name

Name of the DBImport instance

db_hostname

MySQL Hostname to DBImport database

db_port

MySQL Port to DBImport database

db_database

MySQL Database to DBImport database

db_credentials

MySQL Username and Password to DBImport database

hdfs_address

HDFS address. Example hdfs://hadoopcluster

hdfs_basedir

The base dir to write data to. Example /apps/dbimport

sync_credentials

0 = Credentials wont be synced, 1 = The credentials information will be synced to the other cluster

Table - etl_jobs

Column

Documentation

job

task

job_id

etl_type

include_in_airflow

source_db

source_table

target_db

target_table

operator_notes

Free text field to write a note about the import.

Table - export_columns

This table contains all columns that exists on all tables that we are exporting. Unlike the export_tables table, this one gets created automatically by the export tool

Column

Documentation

table_id

Foreign Key to export_tables column ‘table_id’

column_id

Unique identifier

column_name

Name of column in target table. Dont change this manually

column_type

Column type from Hive. Dont change this manually

column_order

The order of the columns. Dont change this manually

hive_db

Only used to make it easier to read the table. No real usage

hive_table

Only used to make it easier to read the table. No real usage

target_column_name

Override the name of column in the target system

target_column_type

Override the column type in the target system

last_update_from_hive

Timestamp of last schema update from Hive. Dont change this manually

last_export_time

Timestamp of last export. Dont change this manually

selection

<NOT USED>

include_in_export

1 = Include column in export, 0 = Exclude column in export

comment

The column comment from the source system. Dont change this manually

operator_notes

Free text field to write a note about the import.

Table - export_retries_log

Log of all retries that have happened.

Column

Documentation

dbalias

Database connection name that we export to

target_schema

Schema on the target system

target_table

Table on the target system

retry_time

Time when the retry was started

stage

The stage of the import that the retry started from. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s

stage_description

Description of the stage

Table - export_stage

The export tool keeps track of how far in the export the tool have succeeded. So in case of an error, lets say that Hive is not responding, the next time an export is executed it will skip the first part and continue from where it ended in error on the previous run. If you want to rerun from the begining, the information in this table needs to be cleared. This is done with the “manage –clearExportStage” tool. Keep in mind that clearing the stage of an incremental export might result in the loss of the data.

Column

Documentation

dbalias

Database connection name that we export to

target_schema

Schema on the target system

target_table

Table on the target system

stage

Current stage of the export. This is the internal stage number

stage_description

Description of the stage

stage_time

The date and time when the import entered the stage

Table - export_stage_statistics

As DBImport progress through the different stages of the export, it also keeps track of start and stop time for each stage together with the duration. That information is kept in this table

Column

Documentation

dbalias

Database connection name that we export to

target_schema

Schema on the target system

target_table

Table on the target system

stage

Current stage of the export. This is the internal stage number

start

Time when stage started

stop

Time when stage was completed

duration

Duration of stage

Table - export_statistics

At the end of each export, all statistics about how long each part took aswell as general information about Hive database and table, number of rows imported, size of the export and much more are logged in this table. This table grows and will never be truncated by DBImport itself. If it becomes to large for you, it’s up to each user to delete or truncate this table as you see fit.

Column

Documentation

id

Auto incremented PrimaryKey of the table

dbalias

ID of the Database Connection

target_database

Name of the source database

target_schema

Name of the source schema

target_table

Name of the source table

hive_db

Hive Database

hive_table

Hive Table

export_phase

Import Phase method

incremental

0 = Full import, 1 = Incremental import

rows

How many rows that was imported

size

The total size in bytes that was imported

sessions

How many parallell sessions was used against the source (sqoop mappers)

duration

Tota duration in seconds

start

Timestamp of start

stop

Timestamp of stop

get_hive_tableschema_duration

get_hive_tableschema_start

get_hive_tableschema_stop

clear_table_rowcount_duration

clear_table_rowcount_start

clear_table_rowcount_stop

create_temp_table_duration

create_temp_table_start

create_temp_table_stop

truncate_temp_table_duration

truncate_temp_table_start

truncate_temp_table_stop

fetch_maxvalue_start

fetch_maxvalue_stop

fetch_maxvalue_duration

insert_into_temp_table_duration

insert_into_temp_table_start

insert_into_temp_table_stop

create_target_table_duration

create_target_table_start

create_target_table_stop

truncate_target_table_duration

truncate_target_table_start

truncate_target_table_stop

sqoop_duration

sqoop_start

sqoop_stop

validate_duration

validate_start

validate_stop

update_statistics_duration

update_statistics_start

update_statistics_stop

update_target_table_duration

update_target_table_start

update_target_table_stop

spark_duration

spark_start

spark_stop

atlas_schema_duration

atlas_schema_start

atlas_schema_stop

Table - export_statistics_last

The last entry in table export_statistics is also stored in this table. This makes it easier to find the latest data without first grouping to find the latest entry. When export_statistics table grows to a high number of million rows, it saves alot of cpu power for the database server.

Column

Documentation

dbalias

ID of the Database Connection

target_database

Name of the source database

target_schema

Name of the source schema

target_table

Name of the source table

hive_db

Hive Database

hive_table

Hive Table

export_phase

Import Phase method

incremental

0 = Full import, 1 = Incremental import

rows

How many rows that was imported

size

The total size in bytes that was imported

sessions

How many parallell sessions was used against the source (sqoop mappers)

duration

Tota duration in seconds

start

Timestamp of start

stop

Timestamp of stop

get_hive_tableschema_duration

get_hive_tableschema_start

get_hive_tableschema_stop

clear_table_rowcount_duration

clear_table_rowcount_start

clear_table_rowcount_stop

create_temp_table_duration

create_temp_table_start

create_temp_table_stop

truncate_temp_table_duration

truncate_temp_table_start

truncate_temp_table_stop

fetch_maxvalue_start

fetch_maxvalue_stop

fetch_maxvalue_duration

insert_into_temp_table_duration

insert_into_temp_table_start

insert_into_temp_table_stop

create_target_table_duration

create_target_table_start

create_target_table_stop

truncate_target_table_duration

truncate_target_table_start

truncate_target_table_stop

sqoop_duration

sqoop_start

sqoop_stop

validate_duration

validate_start

validate_stop

update_statistics_duration

update_statistics_start

update_statistics_stop

update_target_table_duration

update_target_table_start

update_target_table_stop

spark_duration

spark_start

spark_stop

atlas_schema_duration

atlas_schema_start

atlas_schema_stop

Table - export_tables

Main table where all tables that we can export are stored.

Column

Documentation

dbalias

Database connection name that we export to

target_schema

Schema on the target system

target_table

Table on the target system

table_id

Unique identifier of the table

hive_db

Name of Hive Database that we export from

hive_table

Name of Hive Table that we export from

export_type

What export method to use. Only full and incr is supported.

export_tool

What tool should be used for exporting data

last_update_from_hive

Timestamp of last schema update from Hive

sql_where_addition

Will be added AFTER the SQL WHERE. If it’s an incr export, this will be after the incr limit statements. Example “orderId > 1000”

include_in_airflow

Will the table be included in Airflow DAG when it matches the DAG selection

notUsed01

<NOT USED>

forceCreateTempTable

Force export to create a Hive table and export that instead. Useful when exporting views

notUsed02

<NOT USED>

validate_export

1 = Validate the export once it’s done. 0 = Disable validation

validationMethod

Validation method to use

validationCustomQueryHiveSQL

Custom SQL query for Hive table

validationCustomQueryTargetSQL

Custom SQL query for target table

uppercase_columns

-1 = auto (Oracle = uppercase, other databases = lowercase)

truncate_target

1 = Truncate the target table before we export the data. Not used by incremental exports

mappers

-1 = auto, 0 = invalid. Auto updated by ‘export_main.sh’

hive_rowcount

Number of rows in Hive table. Dont change manually

target_rowcount

Number of rows in Target table. Dont change manually

validationCustomQueryHiveValue

Used for validation. Dont change manually

validationCustomQueryTargetValue

Used for validation. Dont change manually

incr_column

The column in the Hive table that will be used to identify new rows for the incremental export. Must be a timestamp column

incr_validation_method

full or incr. Full means that the validation will check to total number of rows up until maxvalue and compare source with target. Incr will only compare the rows between min and max value (the data that sqoop just wrote)

incr_minvalue

Used by incremental exports to keep track of progress. Dont change manually

incr_maxvalue

Used by incremental exports to keep track of progress. Dont change manually

incr_minvalue_pending

Used by incremental exports to keep track of progress. Dont change manually

incr_maxvalue_pending

Used by incremental exports to keep track of progress. Dont change manually

sqoop_options

Sqoop options to use during export.

sqoop_last_size

Used to track sqoop operation. Dont change manually

sqoop_last_rows

Used to track sqoop operation. Dont change manually

sqoop_last_mappers

Used to track sqoop operation. Dont change manually

sqoop_last_execution

Used to track sqoop operation. Dont change manually

create_target_table_sql

SQL statement that was used to create the target table. Dont change manually

operator_notes

Free text field to write a note about the export.

hive_javaheap

Heap size for Hive

airflow_priority

This will set priority_weight in Airflow

airflow_notes

Documentation that will be available in Airflow. Markdown syntax

Table - import_columns

This table contains all columns that exists on all tables that we are importing. Unlike the import_tables table, this one gets created automatically by the ‘Get Source TableSchema’ stage.

Column

Documentation

table_id

Foreign Key to import_tables column ‘table_id’

column_id

Unique identifier of the column

column_order

In what order does the column exist in the source system.

column_name

Name of column in Hive. Dont change this manually

hive_db

Hive Database

hive_table

Hive Table

source_column_name

Name of column in source system. Dont change this manually

column_type

Column type in Hive. Dont change this manually

source_column_type

Column type in source system. Dont change this manually

source_database_type

That database type was the column imported from

column_name_override

Set a custom name of the column in Hive

column_type_override

Set a custom column type in Hive

sqoop_column_type

Used to create a correct –map-column-java setting for sqoop.

force_string

If set to 1, all character based fields (char, varchar) will become string in Hive. Overrides the same setting in import_tables and jdbc_connections table

include_in_import

1 = Include column in import, 0 = Exclude column in import

source_primary_key

Number starting from 1 listing the order of the column in the PK. Dont change this manually

last_update_from_source

Timestamp of last schema update from source

comment

The column comment from the source system

operator_notes

Free text field to write a note about the column

sqoop_column_type_override

Set the –map-column-java field to a fixed value and not calculated by DBImport

anonymization_function

What anonymization function should be used with the data in this column

Table - import_failure_log

If there is an error or a warning during import, bu the import still continues, these errors are logged in this table. An example could be that a column cant be altered, foreign key not created, no new columns can be added and such.

Column

Documentation

hive_db

Hive Database

hive_table

Hive Table

eventtime

Time when error/warning occurred

severity

The Severity of the event.

import_type

The import method used

error_text

Text describing the failure

Table - import_foreign_keys

All foreign key definitions is saved in this table. The information in this table is recreated all the time, so no manually changes are allowed here. For a better understanding of this table, please use the view called import_foreign_keys_view instead

Column

Documentation

table_id

Table ID in import_tables that have the FK

column_id

Column ID in import_columns that have the FK

fk_index

Index of FK

fk_table_id

Table ID in import_tables that the table is having a reference against

fk_column_id

Column ID in import_columns that the table is having a reference against

key_position

Position of the key

Table - import_retries_log

Log of all retries that have happened.

Column

Documentation

hive_db

Hive DB

hive_table

Hive Table

retry_time

Time when the retry was started

stage

The stage of the import that the retry started from. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s

stage_description

Description of the stage

Table - import_stage

The import tool keeps track of how far in the import the tool have succeeded. So in case of an error, lets say that Hive is not responding, the next time an import is executed it will skip the first part and continue from where it ended in error on the previous run. If you want to rerun from the begining, the information in this table needs to be cleared. This is done with the “manage –clearImportStage” tool. Keep in mind that clearing the stage of an incremental import might result in the loss of the data.

Column

Documentation

hive_db

Hive Database

hive_table

Hive Table

stage

Current stage of the import. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s

stage_description

Description of the stage

stage_time

The date and time when the import entered the stage

Table - import_stage_statistics

As DBImport progress through the different stages of the import, it also keeps track of start and stop time for each stage together with the duration. That information is kept in this table

Column

Documentation

hive_db

Hive Database

hive_table

Hive Table

stage

Current stage of the import. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s

start

Time when stage started

stop

Time when stage was completed

duration

Duration of stage

Table - import_statistics

At the end of each import, all statistics about how long each part took aswell as general information about Hive datbase and table, number of rows imported, size of the import and much more are logged in this table. This table grows and will never be truncated by DBImport itself. If it becomes to large for you, it’s up to each user to delete or truncate this table as you see fit.

Column

Documentation

id

Auto incremented PrimaryKey of the table

hive_db

Hive Database

hive_table

Hive Table

importtype

What kind of import type that was used

import_phase

Import Phase method

copy_phase

Copy Phase method

etl_phase

ETL Phase method

incremental

0 = Full import, 1 = Incremental import

dbalias

ID of the Database Connection

source_database

Name of the source database

source_schema

Name of the source schema

source_table

Name of the source table

rows

How many rows that was imported

size

The total size in bytes that was imported

sessions

How many parallell sessions was used against the source (sqoop mappers)

duration

Tota duration in seconds

start

Timestamp of start

stop

Timestamp of stop

sqoop_duration

sqoop_start

sqoop_stop

spark_duration

spark_start

spark_stop

spark_etl_duration

spark_etl_start

spark_etl_stop

clear_hive_locks_duration

clear_hive_locks_start

clear_hive_locks_stop

clear_table_rowcount_duration

clear_table_rowcount_start

clear_table_rowcount_stop

connect_to_hive_duration

connect_to_hive_start

connect_to_hive_stop

create_import_table_duration

create_import_table_start

create_import_table_stop

create_target_table_duration

create_target_table_start

create_target_table_stop

get_import_rowcount_duration

get_import_rowcount_start

get_import_rowcount_stop

get_source_rowcount_duration

get_source_rowcount_start

get_source_rowcount_stop

get_source_tableschema_duration

get_source_tableschema_start

get_source_tableschema_stop

get_target_rowcount_duration

get_target_rowcount_start

get_target_rowcount_stop

hive_import_duration

hive_import_start

hive_import_stop

truncate_target_table_duration

truncate_target_table_start

truncate_target_table_stop

merge_table_duration

merge_table_start

merge_table_stop

create_history_table_duration

create_history_table_start

create_history_table_stop

create_delete_table_duration

create_delete_table_start

create_delete_table_stop

update_statistics_duration

update_statistics_start

update_statistics_stop

validate_import_table_duration

validate_import_table_start

validate_import_table_stop

validate_sqoop_import_duration

validate_sqoop_import_start

validate_sqoop_import_stop

validate_target_table_duration

validate_target_table_start

validate_target_table_stop

copy_data_duration

copy_data_start

copy_data_stop

copy_schema_duration

copy_schema_start

copy_schema_stop

atlas_schema_duration

atlas_schema_start

atlas_schema_stop

Table - import_statistics_last

The last entry in table import_statistics is also stored in this table. This makes it easier to find the latest data without first grouping to find the latest entry. When import_statistics table grows to a high number of million rows, it saves alot of cpu power for the database server.

Column

Documentation

hive_db

Hive Database

hive_table

Hive Table

importtype

What kind of import type that was used

import_phase

Import Phase method

copy_phase

Copy Phase method

etl_phase

ETL Phase method

incremental

0 = Full import, 1 = Incremental import

dbalias

ID of the Database Connection

source_database

Name of the source database

source_schema

Name of the source schema

source_table

Name of the source table

rows

How many rows that was imported

size

The total size in bytes that was imported

sessions

How many parallell sessions was used against the source (sqoop mappers)

duration

Tota duration in seconds

start

Timestamp of start

stop

Timestamp of stop

sqoop_duration

sqoop_start

sqoop_stop

spark_duration

spark_start

spark_stop

spark_etl_duration

spark_etl_start

spark_etl_stop

clear_hive_locks_duration

clear_hive_locks_start

clear_hive_locks_stop

clear_table_rowcount_duration

clear_table_rowcount_start

clear_table_rowcount_stop

connect_to_hive_duration

connect_to_hive_start

connect_to_hive_stop

create_import_table_duration

create_import_table_start

create_import_table_stop

create_target_table_duration

create_target_table_start

create_target_table_stop

get_import_rowcount_duration

get_import_rowcount_start

get_import_rowcount_stop

get_source_rowcount_duration

get_source_rowcount_start

get_source_rowcount_stop

get_source_tableschema_duration

get_source_tableschema_start

get_source_tableschema_stop

get_target_rowcount_duration

get_target_rowcount_start

get_target_rowcount_stop

hive_import_duration

hive_import_start

hive_import_stop

truncate_target_table_duration

truncate_target_table_start

truncate_target_table_stop

merge_table_duration

merge_table_start

merge_table_stop

create_history_table_duration

create_history_table_start

create_history_table_stop

create_delete_table_duration

create_delete_table_start

create_delete_table_stop

update_statistics_duration

update_statistics_start

update_statistics_stop

validate_import_table_duration

validate_import_table_start

validate_import_table_stop

validate_sqoop_import_duration

validate_sqoop_import_start

validate_sqoop_import_stop

validate_target_table_duration

validate_target_table_start

validate_target_table_stop

copy_data_duration

copy_data_start

copy_data_stop

copy_schema_duration

copy_schema_start

copy_schema_stop

atlas_schema_duration

atlas_schema_start

atlas_schema_stop

Table - import_tables

Main table where all tables that we can import are stored.

Column

Documentation

hive_db

Hive Database to import to

hive_table

Hive Table to import to

table_id

Unique identifier

dbalias

Name of database connection from jdbc_connections table

source_schema

Name of the schema in the remote database

source_table

Name of the table in the remote database

import_phase_type

What method to use for ETL phase

etl_phase_type

What method to use for ETL phase

import_type

What import method to use

import_tool

What tool should be used for importing data

etl_engine

What engine will be used to process etl stage

last_update_from_source

Timestamp of last schema update from source

sqoop_sql_where_addition

Will be added AFTER the SQL WHERE. If it’s an incr import, this will be after the incr limit statements. Example “orderId > 1000”

nomerge_ingestion_sql_addition

This will be added to the data ingestion of None-Merge imports (full, full_direct and incr). Usefull to filter out data from import tables to target tables

include_in_airflow

Will the table be included in Airflow DAG when it matches the DAG selection

airflow_priority

This will set priority_weight in Airflow

validate_import

Should the import be validated

validationMethod

Validation method to use

validate_source

query = Run a ‘select count(*) from …’ to get the number of rows in the source table. sqoop = Use the number of rows imported by sqoop as the number of rows in the source table

validate_diff_allowed

-1 = auto calculated diff allowed. If a positiv number, this is the amount of rows that the diff is allowed to have

validationCustomQuerySourceSQL

Custom SQL query for source table

validationCustomQueryHiveSQL

Custom SQL query for Hive table

validationCustomQueryValidateImportTable

1 = Validate Import table, 0 = Dont validate Import table

truncate_hive

<NOT USED>

mappers

-1 = auto or positiv number for a fixed number of mappers. If Auto, then it’s calculated based of last sqoop import size

soft_delete_during_merge

If 1, then the row will be marked as deleted instead of actually being removed from the table. Only used for Merge imports

source_rowcount

Used for validation. Dont change manually

source_rowcount_incr

hive_rowcount

Used for validation. Dont change manually

validationCustomQuerySourceValue

Used for validation. Dont change manually

validationCustomQueryHiveValue

Used for validation. Dont change manually

incr_mode

Incremental import mode

incr_column

What column to use to identify new rows

incr_validation_method

full or incr. Full means that the validation will check to total number of rows up until maxvalue and compare source with target. Incr will only compare the rows between min and max value (the data that sqoop just wrote)

incr_minvalue

Used for incremental imports. Dont change manually

incr_maxvalue

Used for incremental imports. Dont change manually

incr_minvalue_pending

Used for incremental imports. Dont change manually

incr_maxvalue_pending

Used for incremental imports. Dont change manually

pk_column_override

Force the import and Hive table to define another PrimaryKey constraint. Comma separeted list of columns

pk_column_override_mergeonly

Force the import to use another PrimaryKey constraint during Merge operations. Comma separeted list of columns

hive_merge_heap

Should be a multiple of Yarn container size. If NULL then it will use the default specified in Yarn and TEZ

hive_split_count

Sets tez.grouping.split-count in the Hive session

spark_executor_memory

Memory used by spark when importring data. Overrides default value in global configuration

spark_executors

Number of Spark executors to use. Overrides default value in global configuration

concatenate_hive_table

<NOT USED>

split_by_column

Column to split by when doing import with multiple sessions

sqoop_query

Use a custom query in sqoop to read data from source table

sqoop_options

Options to send to sqoop. Most common used for –split-by option

sqoop_last_size

Used to track sqoop operation. Dont change manually

sqoop_last_rows

Used to track sqoop operation. Dont change manually

sqoop_last_mappers

Used to track sqoop operation. Dont change manually

sqoop_last_execution

Used to track sqoop operation. Dont change manually

sqoop_use_generated_sql

1 = Use the generated SQL that is saved in the generated_sqoop_query column

sqoop_allow_text_splitter

Allow splits on text columns. Use with caution

force_string

If set to 1, all character based fields (char, varchar) will become string in Hive. Overrides the same setting in jdbc_connections table

comment

Table comment from source system. Dont change manually

generated_hive_column_definition

Generated column definition for Hive create table. Dont change manually

generated_sqoop_query

Generated query for sqoop. Dont change manually

generated_sqoop_options

Generated options for sqoop. Dont change manually

generated_pk_columns

Generated Primary Keys. Dont change manually

generated_foreign_keys

<NOT USED>

datalake_source

This value will come in the dbimport_source column if present. Overrides the same setting in jdbc_connections table

operator_notes

Free text field to write a note about the import.

copy_finished

Time when last copy from Master DBImport instance was completed. Dont change manually

copy_slave

Defines if this table is a Master table or a Slave table. Dont change manually

create_foreign_keys

-1 (default) = Get information from jdbc_connections table

airflow_notes

Documentation that will be available in Airflow. Markdown syntax

custom_max_query

You can use a custom SQL query that will get the Max value from the source database. This Max value will be used in an inremental import to know how much to read in each execution

mergeCompactionMethod

Compaction method to use after import using merge is completed. Default means a major compaction if it is configured to do so in the configuration table

sourceTableType

Type of table on the source system. This is a read-only

invalidate_impala

-1 = Use default value from configuration table, otherwise 1 will invalidate the table in Impala and 0 will not

Table - import_tables_indexes

Information table that contains what indexes are available on the source system. The data in the table is not used by DBImport itsealf, instead it’s a help for the users to know what indexes exists to make an effective import.

Column

Documentation

table_id

Foreign Key to import_tables column ‘table_id’

hive_db

Hive Database

hive_table

Hive Table

index_name

Name of the index

index_type

The type of the index. This is the name from the source database vendor

index_unique

Does the index have unique values or not?

column

Column name

column_type

Column type in source system

column_order

The order of the column in the index.

column_is_nullable

1 = Column may contain null values, 0 = Nulls are not allowed in the column

Table - jdbc_connections

Database connection definitions

Column

Documentation

dbalias

Name of the Database connection

private_key_path

<NOT USED>

public_key_path

<NOT USED>

jdbc_url

The JDBC URL String

credentials

Encrypted fields for credentials.m Changed by the saveCredentialTool

datalake_source

This value will come in the dbimport_source column if present. Priority is table, connection

max_import_sessions

You can limit the number of parallel sessions during import with this value. If NULL, then Max will come from configuration file

force_string

If set to 1, all character based fields (char, varchar) will become string in Hive

create_datalake_import

If set to 1, the datalake_import column will be created on all tables that is using this dbalias

timewindow_start

Start of the time window when we are allowed to run against this connection.

timewindow_stop

End of the time window when we are allowed to run against this connection.

timewindow_timezone

Timezone used for timewindow_start and timewindow_stop columns. Use full text timezone, example Europe/Stockholm.

operator_notes

Free text field to write a note about the connection

create_foreign_keys

1 = Create foreign keys, 0 = Dont create foreign keys

contact_info

Contact information. Used by Atlas integration

description

Description. Used by Atlas integration

owner

Owner of system and/or data. Used by Atlas integration

atlas_discovery

1 = Discover tables/views on this connection, 0 = Dont use Atlas discovery on this connection

atlas_include_filter

Include filter for Atlas discovery

atlas_exclude_filter

Exclude filter for Atlas discovery

atlas_last_discovery

environment

Name of the Environment type

seed_file

File that will be used to fetch the custom seed that will be used for anonymization functions on data from the connection

Table - jdbc_connections_drivers

Column

Documentation

database_type

Name of database type. Name is hardcoded into Python scripts, so use only defined values

version

Free-text field with version. Has nothing to do with version of driver itself.

driver

Java class for JDBC driver

classpath

Full path to JDBC driver/jar file. If more than one file is required, separate them with : and no spaces

Table - jdbc_connections_environments

Environments types. Used in jdbc_connections table to define what kind of connection type it is

Column

Documentation

environment

Name of the Environment type

description

Description of the Environment type

Table - jdbc_table_change_history

This table keeps track of all changes that was done to a JDBC table after the initial creation. Example could be that a colum type was changed from char(10) to varchar(10). That kind of information is logged in this table

Column

Documentation

dbalias

Database connection name

db_name

Database name

schema_name

Schema Name

table_name

Table Name

column_name

Column Name

eventtime

Time and date of the event

event

Event name.

previous_value

Value before the change

value

Value after the change

description

Detailed description of the event

Table - json_to_send

Temporary storage of JSON payloads that will be sent to a REST interface if the tool is configured to do so.

Column

Documentation

id

Unique Identifier

type

The type of JSON data that is saved in the ‘jsondata’ column

create_time

Time when the JSON data was created

status

Internal status to keep track of what the status of the transmissions is

destination

The destination where to send the json.

jsondata

The payload to send

Table - table_change_history

This table keeps track of all changes that was done to an abject after the initial load. Example could be that a colum type was changed from char(10) to varchar(10). That kind of information is logged in this table

Column

Documentation

hive_db

Hive Database

hive_table

Hive Table

column_name

Column Name

eventtime

Time and date of the event

event

Event name.

previous_value

Value before the change

value

Value after the change

description

Detailed description of the event

Table - yarn_statistics

Statistics about Yarn applications that DBImport launches. Used for rough resource estimations used and connection between Yarn application ID and DBImport job

Column

Documentation

yarn_application_id

Name of the application in Yarn

operation_type

What kind of DBImport job it is

operation_tool

The tool or operation that was used

operation_timestamp

The timestamp for the complete operation. Can be used for grouping together all applications for a specific operation, i.e a complete import/export

hive_db

Name of the Hive DB. Used for imports

hive_table

Name of the Hive Table. Used for imports

dbalias

Name of the database alias. Used for exports

target_schema

Name of the target schema. Used for exports

target_table

Name for the target table. Used for exports

application_start

Timestamp when the operation started

application_stop

Timestampe when the operation was completed

yarn_containers_total

Number of containers used. For spark and sqoop, this will be the maximum that was used during the operation

yarn_containers_failed

Number of contaners that failed

yarn_containers_killed

Number of contaners that was killed