Database Tables
As this version of DBImport dont have the web admin tool available, the documentation will be agains each column in the configuration database tables. The admin tool will later use the same fields so whats said in here will later be applicable on the admin tool aswell.
Table - airflow_custom_dags
Its possible to construct a DAG that have no import, export or ETL definitions in it, but instead just Tasks from the airflow_task table. That might nbe useful to for example run custom Hive Code after an import is completed as a separate DAG. Defining a DAG in here also requires you to have at least one task in airflow_tasks defined “in main”
Column |
Documentation |
---|---|
dag_name |
Name of Airflow DAG. |
schedule_interval |
Time to execute dag |
retries |
How many retries should be Task do in Airflow before it failes |
operator_notes |
Free text field to write a note about the custom DAG. |
application_notes |
Free text field that can be used for application documentaton, notes or links. |
auto_regenerate_dag |
1 = The DAG will be auto regenerated by manage command |
airflow_notes |
Documentation that will be available in Airflow. Markdown syntax |
sudo_user |
All tasks in DAG will use this user for sudo instead of default |
timezone |
Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm. |
Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses |
|
email_on_failure |
Send email on failures |
email_on_retries |
Send email on retries |
tags |
Comma seperated list of Airflow tags that will be set on the Dag |
sla_warning_time |
Maximum time this DAG should run before Airflow triggers a SLA miss |
retry_exponential_backoff |
1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries |
concurrency |
Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration |
Table - airflow_etl_dags
To create a DAG in Airflow for only ETL jobs, this is the table that holds all definitions of the DAG configuration, including the filter that defines what ETL jobs to run, schedules times, pool names and much more.
Column |
Documentation |
---|---|
dag_name |
Name of Airflow DAG. |
schedule_interval |
Time to execute dag |
filter_job |
Filter string for JOB in etl_jobs table |
filter_task |
Filter string for TASK in etl_jobs table |
filter_source_db |
Filter string for SOURCE_DB in etl_jobs table |
filter_target_db |
Filter string for TARGET_DB in etl_jobs table |
retries |
How many retries should be Task do in Airflow before it failes |
trigger_dag_on_success |
<NOT USED> |
operator_notes |
Free text field to write a note about the ETL job. |
application_notes |
Free text field that can be used for application documentaton, notes or links. |
auto_regenerate_dag |
1 = The DAG will be auto regenerated by manage command |
airflow_notes |
Documentation that will be available in Airflow. Markdown syntax |
sudo_user |
All tasks in DAG will use this user for sudo instead of default |
timezone |
Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm. |
Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses |
|
email_on_failure |
Send email on failures |
email_on_retries |
Send email on retries |
tags |
Comma seperated list of Airflow tags that will be set on the Dag |
sla_warning_time |
Maximum time this DAG should run before Airflow triggers a SLA miss |
retry_exponential_backoff |
1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries |
concurrency |
Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration |
Table - airflow_export_dags
To create a DAG in Airflow for Exports, this is the table that holds all definitions of the DAG configuration, including the filter that defines what tables to export, schedules times, pool names and much more.
Column |
Documentation |
---|---|
dag_name |
Name of Airflow DAG. |
schedule_interval |
Time to execute dag |
filter_dbalias |
Filter string for DBALIAS in export_tables |
filter_target_schema |
Filter string for TARGET_SCHEMA in export_tables |
filter_target_table |
Filter string for TARGET_TABLE in export_tables |
use_python_dbimport |
Legacy use only. Always put this to 1 |
retries |
How many retries should be Task do in Airflow before it failes |
trigger_dag_on_success |
Name of DAG to trigger if export is successfull. Comma seperated list of DAGs (optional) |
operator_notes |
Free text field to write a note about the export. |
application_notes |
Free text field that can be used for application documentaton, notes or links. |
auto_regenerate_dag |
1 = The DAG will be auto regenerated by manage command |
airflow_notes |
Documentation that will be available in Airflow. Markdown syntax |
sudo_user |
All tasks in DAG will use this user for sudo instead of default |
timezone |
Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm. |
Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses |
|
email_on_failure |
Send email on failures |
email_on_retries |
Send email on retries |
tags |
Comma seperated list of Airflow tags that will be set on the Dag |
sla_warning_time |
Maximum time this DAG should run before Airflow triggers a SLA miss |
retry_exponential_backoff |
1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries |
concurrency |
Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration |
Table - airflow_import_dags
To create a DAG in Airflow for Imports, this is the table that holds all definitions of the DAG configuration, including the filter that defines what tables to import, schedules times, pool names and much more.
Column |
Documentation |
---|---|
dag_name |
Name of Airflow DAG. |
schedule_interval |
Time to execute dag |
filter_hive |
Filter string for database and table. ; separated. Wildcards (*) allowed. Example HIVE_DB.HIVE_TABLE; HIVE_DB.HIVE_TABLE |
use_python_dbimport |
Legacy use only. Always put this to 1 |
finish_all_stage1_first |
1 = All Import phase jobs will be completed first, and when all is successfull, the ETL phase start |
retries |
How many retries should be Task do in Airflow before it failes |
retries_stage1 |
Specific retries number for Import Phase |
retries_stage2 |
Specific retries number for ETL Phase |
pool_stage1 |
Airflow pool used for stage1 tasks. NULL for the default Hostname pool |
pool_stage2 |
Airflow pool used for stage2 tasks. NULL for the default DAG pool |
operator_notes |
Free text field to write a note about the import. |
application_notes |
Free text field that can be used for application documentaton, notes or links. |
auto_table_discovery |
<NOT USED> |
auto_regenerate_dag |
1 = The DAG will be auto regenerated by manage command |
run_import_and_etl_separate |
1 = The Import and ETL phase will run in separate Tasks. |
airflow_notes |
Documentation that will be available in Airflow. Markdown syntax |
sudo_user |
All tasks in DAG will use this user for sudo instead of default |
metadata_import |
1 = Run only getSchema and getRowCount, 0 = Run a normal import |
timezone |
Timezone used for schedule_interval column. Use full text timezone, example Europe/Stockholm. |
Email to send message to in case email_on_retry or email_on_failure is set to True. Comma separated list of email addresses |
|
email_on_failure |
Send email on failures |
email_on_retries |
Send email on retries |
tags |
Comma seperated list of Airflow tags that will be set on the Dag |
sla_warning_time |
Maximum time this DAG should run before Airflow triggers a SLA miss |
retry_exponential_backoff |
1 = Use the retry_exponential_backoff Airflow function that will cause the retry between failed tasks to be longer and longer each time instead of a fixed time, 0 = Run with a fixed time of 5 min between the task retries |
concurrency |
Set the max number of concurrent tasks in the DAG while executing. Overrides the default value specified in Airflow configuration |
Table - airflow_tasks
All DAGs can be customized by adding Tasks into the DAG. Depending on what placement and type of Tasks that is created, DBImport will add custom placeholders to keep the DAG separated in three different parts. Before, In and After Main. In main is where all regular Imports, export or ETL jobs are executed. If you want to execute something before these, you place it in Before Main. And if you want to execute something after, you place it in After main. Please check the Airflow Integration part of the documentation for more examples and better understanding of the data you can put into this table
Column |
Documentation |
---|---|
dag_name |
Name of DAG to add Tasks to |
task_name |
Name of the Task in Airflow |
task_type |
The type of the Task |
placement |
Placement for the Task |
jdbc_dbalias |
For ‘JDBC SQL’ Task Type, this specifies what database the SQL should run against |
hive_db |
<NOT USED> |
airflow_pool |
Airflow Pool to use. |
airflow_priority |
Airflow Priority. Higher number, higher priority |
include_in_airflow |
Enable or disable the Task in the DAG during creation of DAG file. |
task_dependency_upstream |
Defines the upstream dependency for the Task. Comma separated list |
task_dependency_downstream |
Defines the upstream dependency for the Task. Comma separated list |
task_config |
The configuration for the Task. Depends on what Task type it is. |
sensor_connection |
Name of Connection in Airflow |
sensor_poke_interval |
Poke interval for sensors in seconds |
sensor_timeout_minutes |
Timeout for sensors in minutes |
sensor_soft_fail |
Setting this to 1 will add soft_fail=True on sensor |
sudo_user |
The task will use this user for sudo instead of default |
Table - atlas_column_cache
Atlas discovery uses this table to cache values in order to detect changes instead of putting a heavy load on the Atlas server.
Column |
Documentation |
---|---|
hostname |
Hostname for the database |
port |
Port for the database |
database_name |
Database name |
schema_name |
Database schema |
table_name |
Database table |
column_name |
Name of the column |
column_type |
Type of the column |
column_length |
Length of the column |
column_is_nullable |
Is null values allowed in the column |
column_comment |
Comment on the column |
table_comment |
Comment on the table |
table_type |
Table type. |
table_create_time |
Timestamp for when the table was created |
default_value |
Default value of the column |
Table - atlas_key_cache
Atlas discovery uses this table to cache values in order to detect changes instead of putting a heavy load on the Atlas server.
Column |
Documentation |
---|---|
hostname |
Hostname for the database |
port |
Port for the database |
database_name |
Database name |
schema_name |
Database schema |
table_name |
Database table |
constraint_name |
Name of the constraint |
constraint_type |
Type of the constraint |
column_name |
Name of the column |
reference_schema_name |
Name of the schema that is referenced |
reference_table_name |
Name of the table that is referenced |
reference_column_name |
Name of the column that is referenced |
col_key_position |
Position of the key |
Table - configuration
This is the DBImport global configuration table. In here you can configure items such as the name of the staging database, disable global executions, max number of mappers and much more.
Column |
Documentation |
---|---|
configKey |
Name of the configuration item. These are controlled and maintained by thee setup tool. Dont change these manually |
valueInt |
Integer based Value |
valueStr |
String based Value |
valueDate |
Date based Value |
description |
Description on the setting |
Table - copy_async_status
The status table for asynchronous copy between DBImport instances.
Column |
Documentation |
---|---|
table_id |
Reference to import_table.table_id |
hive_db |
Hive Database |
hive_table |
Hive Table to copy |
destination |
DBImport instances to copy the imported data to |
copy_status |
Status of the copy operation |
last_status_update |
Last time the server changed progress on this copy |
failures |
Number of failures on current state |
hdfs_source_path |
HDFS path to copy from |
hdfs_target_path |
HDFS path to copy to |
Table - copy_tables
When the copy phase starts, it will look in this table to understand if its going to copy its data and to what DBImport instances.
Column |
Documentation |
---|---|
copy_id |
Auto Incremented PrimaryKey of the table |
hive_filter |
Filter string for database and table. ; separated. Wildcards (*) allowed. Example HIVE_DB.HIVE_TABLE; HIVE_DB.HIVE_TABLE |
destination |
DBImport instances to copy the imported data to |
data_transfer |
Synchronous will transfer the data as part of the Import. Asynchronous will transfer the data by a separate process and not part of the Import |
Table - dbimport_instances
This table contains all DBInstance that will receive data from this instance during the copy phase
Column |
Documentation |
---|---|
instance_id |
Auto Incremented PrimaryKey of the table |
name |
Name of the DBImport instance |
db_hostname |
MySQL Hostname to DBImport database |
db_port |
MySQL Port to DBImport database |
db_database |
MySQL Database to DBImport database |
db_credentials |
MySQL Username and Password to DBImport database |
hdfs_address |
HDFS address. Example hdfs://hadoopcluster |
hdfs_basedir |
The base dir to write data to. Example /apps/dbimport |
sync_credentials |
0 = Credentials wont be synced, 1 = The credentials information will be synced to the other cluster |
Table - etl_jobs
Column |
Documentation |
---|---|
job |
|
task |
|
job_id |
|
etl_type |
|
include_in_airflow |
|
source_db |
|
source_table |
|
target_db |
|
target_table |
|
operator_notes |
Free text field to write a note about the import. |
Table - export_columns
This table contains all columns that exists on all tables that we are exporting. Unlike the export_tables table, this one gets created automatically by the export tool
Column |
Documentation |
---|---|
table_id |
Foreign Key to export_tables column ‘table_id’ |
column_id |
Unique identifier |
column_name |
Name of column in target table. Dont change this manually |
column_type |
Column type from Hive. Dont change this manually |
column_order |
The order of the columns. Dont change this manually |
hive_db |
Only used to make it easier to read the table. No real usage |
hive_table |
Only used to make it easier to read the table. No real usage |
target_column_name |
Override the name of column in the target system |
target_column_type |
Override the column type in the target system |
last_update_from_hive |
Timestamp of last schema update from Hive. Dont change this manually |
last_export_time |
Timestamp of last export. Dont change this manually |
selection |
<NOT USED> |
include_in_export |
1 = Include column in export, 0 = Exclude column in export |
comment |
The column comment from the source system. Dont change this manually |
operator_notes |
Free text field to write a note about the import. |
Table - export_retries_log
Log of all retries that have happened.
Column |
Documentation |
---|---|
dbalias |
Database connection name that we export to |
target_schema |
Schema on the target system |
target_table |
Table on the target system |
retry_time |
Time when the retry was started |
stage |
The stage of the import that the retry started from. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s |
stage_description |
Description of the stage |
Table - export_stage
The export tool keeps track of how far in the export the tool have succeeded. So in case of an error, lets say that Hive is not responding, the next time an export is executed it will skip the first part and continue from where it ended in error on the previous run. If you want to rerun from the begining, the information in this table needs to be cleared. This is done with the “manage –clearExportStage” tool. Keep in mind that clearing the stage of an incremental export might result in the loss of the data.
Column |
Documentation |
---|---|
dbalias |
Database connection name that we export to |
target_schema |
Schema on the target system |
target_table |
Table on the target system |
stage |
Current stage of the export. This is the internal stage number |
stage_description |
Description of the stage |
stage_time |
The date and time when the import entered the stage |
Table - export_stage_statistics
As DBImport progress through the different stages of the export, it also keeps track of start and stop time for each stage together with the duration. That information is kept in this table
Column |
Documentation |
---|---|
dbalias |
Database connection name that we export to |
target_schema |
Schema on the target system |
target_table |
Table on the target system |
stage |
Current stage of the export. This is the internal stage number |
start |
Time when stage started |
stop |
Time when stage was completed |
duration |
Duration of stage |
Table - export_statistics
At the end of each export, all statistics about how long each part took aswell as general information about Hive database and table, number of rows imported, size of the export and much more are logged in this table. This table grows and will never be truncated by DBImport itself. If it becomes to large for you, it’s up to each user to delete or truncate this table as you see fit.
Column |
Documentation |
---|---|
id |
Auto incremented PrimaryKey of the table |
dbalias |
ID of the Database Connection |
target_database |
Name of the source database |
target_schema |
Name of the source schema |
target_table |
Name of the source table |
hive_db |
Hive Database |
hive_table |
Hive Table |
export_phase |
Import Phase method |
incremental |
0 = Full import, 1 = Incremental import |
rows |
How many rows that was imported |
size |
The total size in bytes that was imported |
sessions |
How many parallell sessions was used against the source (sqoop mappers) |
duration |
Tota duration in seconds |
start |
Timestamp of start |
stop |
Timestamp of stop |
get_hive_tableschema_duration |
|
get_hive_tableschema_start |
|
get_hive_tableschema_stop |
|
clear_table_rowcount_duration |
|
clear_table_rowcount_start |
|
clear_table_rowcount_stop |
|
create_temp_table_duration |
|
create_temp_table_start |
|
create_temp_table_stop |
|
truncate_temp_table_duration |
|
truncate_temp_table_start |
|
truncate_temp_table_stop |
|
fetch_maxvalue_start |
|
fetch_maxvalue_stop |
|
fetch_maxvalue_duration |
|
insert_into_temp_table_duration |
|
insert_into_temp_table_start |
|
insert_into_temp_table_stop |
|
create_target_table_duration |
|
create_target_table_start |
|
create_target_table_stop |
|
truncate_target_table_duration |
|
truncate_target_table_start |
|
truncate_target_table_stop |
|
sqoop_duration |
|
sqoop_start |
|
sqoop_stop |
|
validate_duration |
|
validate_start |
|
validate_stop |
|
update_statistics_duration |
|
update_statistics_start |
|
update_statistics_stop |
|
update_target_table_duration |
|
update_target_table_start |
|
update_target_table_stop |
|
spark_duration |
|
spark_start |
|
spark_stop |
|
atlas_schema_duration |
|
atlas_schema_start |
|
atlas_schema_stop |
Table - export_statistics_last
The last entry in table export_statistics is also stored in this table. This makes it easier to find the latest data without first grouping to find the latest entry. When export_statistics table grows to a high number of million rows, it saves alot of cpu power for the database server.
Column |
Documentation |
---|---|
dbalias |
ID of the Database Connection |
target_database |
Name of the source database |
target_schema |
Name of the source schema |
target_table |
Name of the source table |
hive_db |
Hive Database |
hive_table |
Hive Table |
export_phase |
Import Phase method |
incremental |
0 = Full import, 1 = Incremental import |
rows |
How many rows that was imported |
size |
The total size in bytes that was imported |
sessions |
How many parallell sessions was used against the source (sqoop mappers) |
duration |
Tota duration in seconds |
start |
Timestamp of start |
stop |
Timestamp of stop |
get_hive_tableschema_duration |
|
get_hive_tableschema_start |
|
get_hive_tableschema_stop |
|
clear_table_rowcount_duration |
|
clear_table_rowcount_start |
|
clear_table_rowcount_stop |
|
create_temp_table_duration |
|
create_temp_table_start |
|
create_temp_table_stop |
|
truncate_temp_table_duration |
|
truncate_temp_table_start |
|
truncate_temp_table_stop |
|
fetch_maxvalue_start |
|
fetch_maxvalue_stop |
|
fetch_maxvalue_duration |
|
insert_into_temp_table_duration |
|
insert_into_temp_table_start |
|
insert_into_temp_table_stop |
|
create_target_table_duration |
|
create_target_table_start |
|
create_target_table_stop |
|
truncate_target_table_duration |
|
truncate_target_table_start |
|
truncate_target_table_stop |
|
sqoop_duration |
|
sqoop_start |
|
sqoop_stop |
|
validate_duration |
|
validate_start |
|
validate_stop |
|
update_statistics_duration |
|
update_statistics_start |
|
update_statistics_stop |
|
update_target_table_duration |
|
update_target_table_start |
|
update_target_table_stop |
|
spark_duration |
|
spark_start |
|
spark_stop |
|
atlas_schema_duration |
|
atlas_schema_start |
|
atlas_schema_stop |
Table - export_tables
Main table where all tables that we can export are stored.
Column |
Documentation |
---|---|
dbalias |
Database connection name that we export to |
target_schema |
Schema on the target system |
target_table |
Table on the target system |
table_id |
Unique identifier of the table |
hive_db |
Name of Hive Database that we export from |
hive_table |
Name of Hive Table that we export from |
export_type |
What export method to use. Only full and incr is supported. |
export_tool |
What tool should be used for exporting data |
last_update_from_hive |
Timestamp of last schema update from Hive |
sql_where_addition |
Will be added AFTER the SQL WHERE. If it’s an incr export, this will be after the incr limit statements. Example “orderId > 1000” |
include_in_airflow |
Will the table be included in Airflow DAG when it matches the DAG selection |
notUsed01 |
<NOT USED> |
forceCreateTempTable |
Force export to create a Hive table and export that instead. Useful when exporting views |
notUsed02 |
<NOT USED> |
validate_export |
1 = Validate the export once it’s done. 0 = Disable validation |
validationMethod |
Validation method to use |
validationCustomQueryHiveSQL |
Custom SQL query for Hive table |
validationCustomQueryTargetSQL |
Custom SQL query for target table |
uppercase_columns |
-1 = auto (Oracle = uppercase, other databases = lowercase) |
truncate_target |
1 = Truncate the target table before we export the data. Not used by incremental exports |
mappers |
-1 = auto, 0 = invalid. Auto updated by ‘export_main.sh’ |
hive_rowcount |
Number of rows in Hive table. Dont change manually |
target_rowcount |
Number of rows in Target table. Dont change manually |
validationCustomQueryHiveValue |
Used for validation. Dont change manually |
validationCustomQueryTargetValue |
Used for validation. Dont change manually |
incr_column |
The column in the Hive table that will be used to identify new rows for the incremental export. Must be a timestamp column |
incr_validation_method |
full or incr. Full means that the validation will check to total number of rows up until maxvalue and compare source with target. Incr will only compare the rows between min and max value (the data that sqoop just wrote) |
incr_minvalue |
Used by incremental exports to keep track of progress. Dont change manually |
incr_maxvalue |
Used by incremental exports to keep track of progress. Dont change manually |
incr_minvalue_pending |
Used by incremental exports to keep track of progress. Dont change manually |
incr_maxvalue_pending |
Used by incremental exports to keep track of progress. Dont change manually |
sqoop_options |
Sqoop options to use during export. |
sqoop_last_size |
Used to track sqoop operation. Dont change manually |
sqoop_last_rows |
Used to track sqoop operation. Dont change manually |
sqoop_last_mappers |
Used to track sqoop operation. Dont change manually |
sqoop_last_execution |
Used to track sqoop operation. Dont change manually |
create_target_table_sql |
SQL statement that was used to create the target table. Dont change manually |
operator_notes |
Free text field to write a note about the export. |
hive_javaheap |
Heap size for Hive |
airflow_priority |
This will set priority_weight in Airflow |
airflow_notes |
Documentation that will be available in Airflow. Markdown syntax |
Table - import_columns
This table contains all columns that exists on all tables that we are importing. Unlike the import_tables table, this one gets created automatically by the ‘Get Source TableSchema’ stage.
Column |
Documentation |
---|---|
table_id |
Foreign Key to import_tables column ‘table_id’ |
column_id |
Unique identifier of the column |
column_order |
In what order does the column exist in the source system. |
column_name |
Name of column in Hive. Dont change this manually |
hive_db |
Hive Database |
hive_table |
Hive Table |
source_column_name |
Name of column in source system. Dont change this manually |
column_type |
Column type in Hive. Dont change this manually |
source_column_type |
Column type in source system. Dont change this manually |
source_database_type |
That database type was the column imported from |
column_name_override |
Set a custom name of the column in Hive |
column_type_override |
Set a custom column type in Hive |
sqoop_column_type |
Used to create a correct –map-column-java setting for sqoop. |
force_string |
If set to 1, all character based fields (char, varchar) will become string in Hive. Overrides the same setting in import_tables and jdbc_connections table |
include_in_import |
1 = Include column in import, 0 = Exclude column in import |
source_primary_key |
Number starting from 1 listing the order of the column in the PK. Dont change this manually |
last_update_from_source |
Timestamp of last schema update from source |
comment |
The column comment from the source system |
operator_notes |
Free text field to write a note about the column |
sqoop_column_type_override |
Set the –map-column-java field to a fixed value and not calculated by DBImport |
anonymization_function |
What anonymization function should be used with the data in this column |
Table - import_failure_log
If there is an error or a warning during import, bu the import still continues, these errors are logged in this table. An example could be that a column cant be altered, foreign key not created, no new columns can be added and such.
Column |
Documentation |
---|---|
hive_db |
Hive Database |
hive_table |
Hive Table |
eventtime |
Time when error/warning occurred |
severity |
The Severity of the event. |
import_type |
The import method used |
error_text |
Text describing the failure |
Table - import_foreign_keys
All foreign key definitions is saved in this table. The information in this table is recreated all the time, so no manually changes are allowed here. For a better understanding of this table, please use the view called import_foreign_keys_view instead
Column |
Documentation |
---|---|
table_id |
Table ID in import_tables that have the FK |
column_id |
Column ID in import_columns that have the FK |
fk_index |
Index of FK |
fk_table_id |
Table ID in import_tables that the table is having a reference against |
fk_column_id |
Column ID in import_columns that the table is having a reference against |
key_position |
Position of the key |
Table - import_retries_log
Log of all retries that have happened.
Column |
Documentation |
---|---|
hive_db |
Hive DB |
hive_table |
Hive Table |
retry_time |
Time when the retry was started |
stage |
The stage of the import that the retry started from. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s |
stage_description |
Description of the stage |
Table - import_stage
The import tool keeps track of how far in the import the tool have succeeded. So in case of an error, lets say that Hive is not responding, the next time an import is executed it will skip the first part and continue from where it ended in error on the previous run. If you want to rerun from the begining, the information in this table needs to be cleared. This is done with the “manage –clearImportStage” tool. Keep in mind that clearing the stage of an incremental import might result in the loss of the data.
Column |
Documentation |
---|---|
hive_db |
Hive Database |
hive_table |
Hive Table |
stage |
Current stage of the import. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s |
stage_description |
Description of the stage |
stage_time |
The date and time when the import entered the stage |
Table - import_stage_statistics
As DBImport progress through the different stages of the import, it also keeps track of start and stop time for each stage together with the duration. That information is kept in this table
Column |
Documentation |
---|---|
hive_db |
Hive Database |
hive_table |
Hive Table |
stage |
Current stage of the import. This is an internal stage and has nothing to do with stage1 and stage2 in Airflow DAG’s |
start |
Time when stage started |
stop |
Time when stage was completed |
duration |
Duration of stage |
Table - import_statistics
At the end of each import, all statistics about how long each part took aswell as general information about Hive datbase and table, number of rows imported, size of the import and much more are logged in this table. This table grows and will never be truncated by DBImport itself. If it becomes to large for you, it’s up to each user to delete or truncate this table as you see fit.
Column |
Documentation |
---|---|
id |
Auto incremented PrimaryKey of the table |
hive_db |
Hive Database |
hive_table |
Hive Table |
importtype |
What kind of import type that was used |
import_phase |
Import Phase method |
copy_phase |
Copy Phase method |
etl_phase |
ETL Phase method |
incremental |
0 = Full import, 1 = Incremental import |
dbalias |
ID of the Database Connection |
source_database |
Name of the source database |
source_schema |
Name of the source schema |
source_table |
Name of the source table |
rows |
How many rows that was imported |
size |
The total size in bytes that was imported |
sessions |
How many parallell sessions was used against the source (sqoop mappers) |
duration |
Tota duration in seconds |
start |
Timestamp of start |
stop |
Timestamp of stop |
sqoop_duration |
|
sqoop_start |
|
sqoop_stop |
|
spark_duration |
|
spark_start |
|
spark_stop |
|
spark_etl_duration |
|
spark_etl_start |
|
spark_etl_stop |
|
clear_hive_locks_duration |
|
clear_hive_locks_start |
|
clear_hive_locks_stop |
|
clear_table_rowcount_duration |
|
clear_table_rowcount_start |
|
clear_table_rowcount_stop |
|
connect_to_hive_duration |
|
connect_to_hive_start |
|
connect_to_hive_stop |
|
create_import_table_duration |
|
create_import_table_start |
|
create_import_table_stop |
|
create_target_table_duration |
|
create_target_table_start |
|
create_target_table_stop |
|
get_import_rowcount_duration |
|
get_import_rowcount_start |
|
get_import_rowcount_stop |
|
get_source_rowcount_duration |
|
get_source_rowcount_start |
|
get_source_rowcount_stop |
|
get_source_tableschema_duration |
|
get_source_tableschema_start |
|
get_source_tableschema_stop |
|
get_target_rowcount_duration |
|
get_target_rowcount_start |
|
get_target_rowcount_stop |
|
hive_import_duration |
|
hive_import_start |
|
hive_import_stop |
|
truncate_target_table_duration |
|
truncate_target_table_start |
|
truncate_target_table_stop |
|
merge_table_duration |
|
merge_table_start |
|
merge_table_stop |
|
create_history_table_duration |
|
create_history_table_start |
|
create_history_table_stop |
|
create_delete_table_duration |
|
create_delete_table_start |
|
create_delete_table_stop |
|
update_statistics_duration |
|
update_statistics_start |
|
update_statistics_stop |
|
validate_import_table_duration |
|
validate_import_table_start |
|
validate_import_table_stop |
|
validate_sqoop_import_duration |
|
validate_sqoop_import_start |
|
validate_sqoop_import_stop |
|
validate_target_table_duration |
|
validate_target_table_start |
|
validate_target_table_stop |
|
copy_data_duration |
|
copy_data_start |
|
copy_data_stop |
|
copy_schema_duration |
|
copy_schema_start |
|
copy_schema_stop |
|
atlas_schema_duration |
|
atlas_schema_start |
|
atlas_schema_stop |
Table - import_statistics_last
The last entry in table import_statistics is also stored in this table. This makes it easier to find the latest data without first grouping to find the latest entry. When import_statistics table grows to a high number of million rows, it saves alot of cpu power for the database server.
Column |
Documentation |
---|---|
hive_db |
Hive Database |
hive_table |
Hive Table |
importtype |
What kind of import type that was used |
import_phase |
Import Phase method |
copy_phase |
Copy Phase method |
etl_phase |
ETL Phase method |
incremental |
0 = Full import, 1 = Incremental import |
dbalias |
ID of the Database Connection |
source_database |
Name of the source database |
source_schema |
Name of the source schema |
source_table |
Name of the source table |
rows |
How many rows that was imported |
size |
The total size in bytes that was imported |
sessions |
How many parallell sessions was used against the source (sqoop mappers) |
duration |
Tota duration in seconds |
start |
Timestamp of start |
stop |
Timestamp of stop |
sqoop_duration |
|
sqoop_start |
|
sqoop_stop |
|
spark_duration |
|
spark_start |
|
spark_stop |
|
spark_etl_duration |
|
spark_etl_start |
|
spark_etl_stop |
|
clear_hive_locks_duration |
|
clear_hive_locks_start |
|
clear_hive_locks_stop |
|
clear_table_rowcount_duration |
|
clear_table_rowcount_start |
|
clear_table_rowcount_stop |
|
connect_to_hive_duration |
|
connect_to_hive_start |
|
connect_to_hive_stop |
|
create_import_table_duration |
|
create_import_table_start |
|
create_import_table_stop |
|
create_target_table_duration |
|
create_target_table_start |
|
create_target_table_stop |
|
get_import_rowcount_duration |
|
get_import_rowcount_start |
|
get_import_rowcount_stop |
|
get_source_rowcount_duration |
|
get_source_rowcount_start |
|
get_source_rowcount_stop |
|
get_source_tableschema_duration |
|
get_source_tableschema_start |
|
get_source_tableschema_stop |
|
get_target_rowcount_duration |
|
get_target_rowcount_start |
|
get_target_rowcount_stop |
|
hive_import_duration |
|
hive_import_start |
|
hive_import_stop |
|
truncate_target_table_duration |
|
truncate_target_table_start |
|
truncate_target_table_stop |
|
merge_table_duration |
|
merge_table_start |
|
merge_table_stop |
|
create_history_table_duration |
|
create_history_table_start |
|
create_history_table_stop |
|
create_delete_table_duration |
|
create_delete_table_start |
|
create_delete_table_stop |
|
update_statistics_duration |
|
update_statistics_start |
|
update_statistics_stop |
|
validate_import_table_duration |
|
validate_import_table_start |
|
validate_import_table_stop |
|
validate_sqoop_import_duration |
|
validate_sqoop_import_start |
|
validate_sqoop_import_stop |
|
validate_target_table_duration |
|
validate_target_table_start |
|
validate_target_table_stop |
|
copy_data_duration |
|
copy_data_start |
|
copy_data_stop |
|
copy_schema_duration |
|
copy_schema_start |
|
copy_schema_stop |
|
atlas_schema_duration |
|
atlas_schema_start |
|
atlas_schema_stop |
Table - import_tables
Main table where all tables that we can import are stored.
Column |
Documentation |
---|---|
hive_db |
Hive Database to import to |
hive_table |
Hive Table to import to |
table_id |
Unique identifier |
dbalias |
Name of database connection from jdbc_connections table |
source_schema |
Name of the schema in the remote database |
source_table |
Name of the table in the remote database |
import_phase_type |
What method to use for ETL phase |
etl_phase_type |
What method to use for ETL phase |
import_type |
What import method to use |
import_tool |
What tool should be used for importing data |
etl_engine |
What engine will be used to process etl stage |
last_update_from_source |
Timestamp of last schema update from source |
sqoop_sql_where_addition |
Will be added AFTER the SQL WHERE. If it’s an incr import, this will be after the incr limit statements. Example “orderId > 1000” |
nomerge_ingestion_sql_addition |
This will be added to the data ingestion of None-Merge imports (full, full_direct and incr). Usefull to filter out data from import tables to target tables |
include_in_airflow |
Will the table be included in Airflow DAG when it matches the DAG selection |
airflow_priority |
This will set priority_weight in Airflow |
validate_import |
Should the import be validated |
validationMethod |
Validation method to use |
validate_source |
query = Run a ‘select count(*) from …’ to get the number of rows in the source table. sqoop = Use the number of rows imported by sqoop as the number of rows in the source table |
validate_diff_allowed |
-1 = auto calculated diff allowed. If a positiv number, this is the amount of rows that the diff is allowed to have |
validationCustomQuerySourceSQL |
Custom SQL query for source table |
validationCustomQueryHiveSQL |
Custom SQL query for Hive table |
validationCustomQueryValidateImportTable |
1 = Validate Import table, 0 = Dont validate Import table |
truncate_hive |
<NOT USED> |
mappers |
-1 = auto or positiv number for a fixed number of mappers. If Auto, then it’s calculated based of last sqoop import size |
soft_delete_during_merge |
If 1, then the row will be marked as deleted instead of actually being removed from the table. Only used for Merge imports |
source_rowcount |
Used for validation. Dont change manually |
source_rowcount_incr |
|
hive_rowcount |
Used for validation. Dont change manually |
validationCustomQuerySourceValue |
Used for validation. Dont change manually |
validationCustomQueryHiveValue |
Used for validation. Dont change manually |
incr_mode |
Incremental import mode |
incr_column |
What column to use to identify new rows |
incr_validation_method |
full or incr. Full means that the validation will check to total number of rows up until maxvalue and compare source with target. Incr will only compare the rows between min and max value (the data that sqoop just wrote) |
incr_minvalue |
Used for incremental imports. Dont change manually |
incr_maxvalue |
Used for incremental imports. Dont change manually |
incr_minvalue_pending |
Used for incremental imports. Dont change manually |
incr_maxvalue_pending |
Used for incremental imports. Dont change manually |
pk_column_override |
Force the import and Hive table to define another PrimaryKey constraint. Comma separeted list of columns |
pk_column_override_mergeonly |
Force the import to use another PrimaryKey constraint during Merge operations. Comma separeted list of columns |
hive_merge_heap |
Should be a multiple of Yarn container size. If NULL then it will use the default specified in Yarn and TEZ |
hive_split_count |
Sets tez.grouping.split-count in the Hive session |
spark_executor_memory |
Memory used by spark when importring data. Overrides default value in global configuration |
spark_executors |
Number of Spark executors to use. Overrides default value in global configuration |
concatenate_hive_table |
<NOT USED> |
split_by_column |
Column to split by when doing import with multiple sessions |
sqoop_query |
Use a custom query in sqoop to read data from source table |
sqoop_options |
Options to send to sqoop. Most common used for –split-by option |
sqoop_last_size |
Used to track sqoop operation. Dont change manually |
sqoop_last_rows |
Used to track sqoop operation. Dont change manually |
sqoop_last_mappers |
Used to track sqoop operation. Dont change manually |
sqoop_last_execution |
Used to track sqoop operation. Dont change manually |
sqoop_use_generated_sql |
1 = Use the generated SQL that is saved in the generated_sqoop_query column |
sqoop_allow_text_splitter |
Allow splits on text columns. Use with caution |
force_string |
If set to 1, all character based fields (char, varchar) will become string in Hive. Overrides the same setting in jdbc_connections table |
comment |
Table comment from source system. Dont change manually |
generated_hive_column_definition |
Generated column definition for Hive create table. Dont change manually |
generated_sqoop_query |
Generated query for sqoop. Dont change manually |
generated_sqoop_options |
Generated options for sqoop. Dont change manually |
generated_pk_columns |
Generated Primary Keys. Dont change manually |
generated_foreign_keys |
<NOT USED> |
datalake_source |
This value will come in the dbimport_source column if present. Overrides the same setting in jdbc_connections table |
operator_notes |
Free text field to write a note about the import. |
copy_finished |
Time when last copy from Master DBImport instance was completed. Dont change manually |
copy_slave |
Defines if this table is a Master table or a Slave table. Dont change manually |
create_foreign_keys |
-1 (default) = Get information from jdbc_connections table |
airflow_notes |
Documentation that will be available in Airflow. Markdown syntax |
custom_max_query |
You can use a custom SQL query that will get the Max value from the source database. This Max value will be used in an inremental import to know how much to read in each execution |
mergeCompactionMethod |
Compaction method to use after import using merge is completed. Default means a major compaction if it is configured to do so in the configuration table |
sourceTableType |
Type of table on the source system. This is a read-only |
invalidate_impala |
-1 = Use default value from configuration table, otherwise 1 will invalidate the table in Impala and 0 will not |
Table - import_tables_indexes
Information table that contains what indexes are available on the source system. The data in the table is not used by DBImport itsealf, instead it’s a help for the users to know what indexes exists to make an effective import.
Column |
Documentation |
---|---|
table_id |
Foreign Key to import_tables column ‘table_id’ |
hive_db |
Hive Database |
hive_table |
Hive Table |
index_name |
Name of the index |
index_type |
The type of the index. This is the name from the source database vendor |
index_unique |
Does the index have unique values or not? |
column |
Column name |
column_type |
Column type in source system |
column_order |
The order of the column in the index. |
column_is_nullable |
1 = Column may contain null values, 0 = Nulls are not allowed in the column |
Table - jdbc_connections
Database connection definitions
Column |
Documentation |
---|---|
dbalias |
Name of the Database connection |
private_key_path |
<NOT USED> |
public_key_path |
<NOT USED> |
jdbc_url |
The JDBC URL String |
credentials |
Encrypted fields for credentials.m Changed by the saveCredentialTool |
datalake_source |
This value will come in the dbimport_source column if present. Priority is table, connection |
max_import_sessions |
You can limit the number of parallel sessions during import with this value. If NULL, then Max will come from configuration file |
force_string |
If set to 1, all character based fields (char, varchar) will become string in Hive |
create_datalake_import |
If set to 1, the datalake_import column will be created on all tables that is using this dbalias |
timewindow_start |
Start of the time window when we are allowed to run against this connection. |
timewindow_stop |
End of the time window when we are allowed to run against this connection. |
timewindow_timezone |
Timezone used for timewindow_start and timewindow_stop columns. Use full text timezone, example Europe/Stockholm. |
operator_notes |
Free text field to write a note about the connection |
create_foreign_keys |
1 = Create foreign keys, 0 = Dont create foreign keys |
contact_info |
Contact information. Used by Atlas integration |
description |
Description. Used by Atlas integration |
owner |
Owner of system and/or data. Used by Atlas integration |
atlas_discovery |
1 = Discover tables/views on this connection, 0 = Dont use Atlas discovery on this connection |
atlas_include_filter |
Include filter for Atlas discovery |
atlas_exclude_filter |
Exclude filter for Atlas discovery |
atlas_last_discovery |
|
environment |
Name of the Environment type |
seed_file |
File that will be used to fetch the custom seed that will be used for anonymization functions on data from the connection |
Table - jdbc_connections_drivers
Column |
Documentation |
---|---|
database_type |
Name of database type. Name is hardcoded into Python scripts, so use only defined values |
version |
Free-text field with version. Has nothing to do with version of driver itself. |
driver |
Java class for JDBC driver |
classpath |
Full path to JDBC driver/jar file. If more than one file is required, separate them with : and no spaces |
Table - jdbc_connections_environments
Environments types. Used in jdbc_connections table to define what kind of connection type it is
Column |
Documentation |
---|---|
environment |
Name of the Environment type |
description |
Description of the Environment type |
Table - jdbc_table_change_history
This table keeps track of all changes that was done to a JDBC table after the initial creation. Example could be that a colum type was changed from char(10) to varchar(10). That kind of information is logged in this table
Column |
Documentation |
---|---|
dbalias |
Database connection name |
db_name |
Database name |
schema_name |
Schema Name |
table_name |
Table Name |
column_name |
Column Name |
eventtime |
Time and date of the event |
event |
Event name. |
previous_value |
Value before the change |
value |
Value after the change |
description |
Detailed description of the event |
Table - json_to_send
Temporary storage of JSON payloads that will be sent to a REST interface if the tool is configured to do so.
Column |
Documentation |
---|---|
id |
Unique Identifier |
type |
The type of JSON data that is saved in the ‘jsondata’ column |
create_time |
Time when the JSON data was created |
status |
Internal status to keep track of what the status of the transmissions is |
destination |
The destination where to send the json. |
jsondata |
The payload to send |
Table - table_change_history
This table keeps track of all changes that was done to an abject after the initial load. Example could be that a colum type was changed from char(10) to varchar(10). That kind of information is logged in this table
Column |
Documentation |
---|---|
hive_db |
Hive Database |
hive_table |
Hive Table |
column_name |
Column Name |
eventtime |
Time and date of the event |
event |
Event name. |
previous_value |
Value before the change |
value |
Value after the change |
description |
Detailed description of the event |
Table - yarn_statistics
Statistics about Yarn applications that DBImport launches. Used for rough resource estimations used and connection between Yarn application ID and DBImport job
Column |
Documentation |
---|---|
yarn_application_id |
Name of the application in Yarn |
operation_type |
What kind of DBImport job it is |
operation_tool |
The tool or operation that was used |
operation_timestamp |
The timestamp for the complete operation. Can be used for grouping together all applications for a specific operation, i.e a complete import/export |
hive_db |
Name of the Hive DB. Used for imports |
hive_table |
Name of the Hive Table. Used for imports |
dbalias |
Name of the database alias. Used for exports |
target_schema |
Name of the target schema. Used for exports |
target_table |
Name for the target table. Used for exports |
application_start |
Timestamp when the operation started |
application_stop |
Timestampe when the operation was completed |
yarn_containers_total |
Number of containers used. For spark and sqoop, this will be the maximum that was used during the operation |
yarn_containers_failed |
Number of contaners that failed |
yarn_containers_killed |
Number of contaners that was killed |