Import methods

When specifying a table to import, one of the most important setting is to select an import method. It basically defines if it’s a full import or an incremental import together with what will happen afterwards. Maybe a history table should be created or rows in the Hive table should be deleted even if it’s an incremental import. This mayor settings are what we call an import method.

Note

The stage number seen in the documentation are the internal stage id that is used by DBImport. This number is also used in the import_stage and the import_retries_log table.

Note

The default Hive table type is managed table using ORC if nothing else is specified.

Full import

Full imports reads the entire source table and makes it available in Hive. Depending on what ETL phase is used, the Target table can be created/updated in different ways. Most simple and most common way is to do a full import, but there are more complex imports that compare all values in all columns to create audit information.

Truncate and insert

This is the most common way to import the data. The entire table is loaded by DBImport and then the target table in Hive is truncated and all rows are inserted again.

Setting	Configuration
Import Type (old)	full
Import Phase	full
ETL Phase	truncate_insert

Insert

Full import with Insert will just add data to the target table without truncating. This will most likely create duplicates in the target database. As we still need to validate the appended data, the create_datalake_import must be set to 1 in jdbc_connections table for the connection the appended table is using.

Setting	Configuration
Import Type (old)	full_insert
Import Phase	full
ETL Phase	insert

Full Merge

Doing a Full Merge operation instead of a normal full import gives you one additional thing. It will create a number of new columns that will contain information about when was the last time the row was changed. This is a great way to get only changed data from a table that have no way to identify if the data in the row was changed or not. Will create a fairly large job in Hive during the merge, and depending on the cluster size, might take all resources available in the cluster.

Setting	Configuration
Import Type (old)	full_merge_direct
Import Phase	full
ETL Phase	merge

Full Merge with History Audit

This is one of the largest import method you can use. It will fetch all rows from the source system and once available in the Import Table, the data will be merge into the Target table. Do know what rows have been changed, all columns will be compared between the Import and the Target table. When that is done, a new merge will run that will find out what rows exists in the Target table and not in the Import table. These are the rows that was deleted in the source system. Once they are identified, they will be inserted into the History Audit table and then deleted from the Target table. Depending on the size of the table, this can be a very large job in Hive during the different merge commands. Keep that in mind when you select a timeslot to run the job.

Setting	Configuration
Import Type (old)	full_merge_direct_history
Import Phase	full
ETL Phase	merge_history_audit

None

This import will not load anything in Hive. It will only create an external Import table, but nothing more than that. It’s used when you have a complex ETL process that needs to be executed after the import and loading the Target table is just a waste of time/resources.

Setting	Configuration
Import Phase	full
ETL Phase	none

Incremental import

An incremental imports keeps track of how much data have been read from the source table and only imports the new data. There are two different ways to do this

Append If data is added to the source table and there is an integer based column that increases for every new row (AUTO_INCREMENT), then Append mode is the way to go.

Last Modified If there is a column with the type of date or a timestamp, and it gets a new data/timestamp for every new row, then Last Modified the correct option.

Insert

The changed data is read from the source and once it’s avalable in the Import table, an insert operation will be triggered in Hive to insert the newly fetched rows into the Target table.

Setting	Configuration
Import Type (old)	incr
Import Phase	incr
ETL Phase	insert

Merge

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing. Keep in mind that if the source table deletes rows, we wont fetch them with this import.

Setting	Configuration
Import Type (old)	incr_merge_direct
Import Phase	incr
ETL Phase	merge

Merge with History Audit

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing. Keep in mind that if the source table deletes rows, we wont fetch them with this import. After the merge is completed, it will also insert all new and changed rows into the History Audit Table so it’s possible to track the changed in the table over time

Setting	Configuration
Import Phase	incr
ETL Phase	merge_history_audit

Oracle Flashback

This import method uses the Oracle Flashback Version Query to fetch only the changed rows from the last import. Comparing this to a standard incremental import, the main differences is that we detect deletes as well and that we dont require a timestamp or an integer based column with increasing values. The downside is that the table must support Oracle Flashback Version Query and that the undo area is large enough to keep changes between imports. Once the data is avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists, delete the data if that happend in the source system and insert it if it’s missing.

Note

Oracle Flashback only supports sqoop as the import tool

Merge

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing.

Setting	Configuration
Import Phase	oracle_flashback
ETL Phase	merge

Merge with History Audit

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing. After the merge is completed, it will also insert all new and changed rows into the History Audit Table so it’s possible to track the changed in the table over time

Setting	Configuration
Import Phase	oracle_flashback
ETL Phase	merge_history_audit

Microsoft Change Tracking

This import method uses the Microsoft Change Tracking function to fetch only the changed rows from the last import. Comparing this to a standard incremental import, the main differences is that we detect deletes as well and that we dont require a timestamp or an integer based column with increasing values. The downside is that the function requires that Change Tracking is enabled on the database and table on the source system. Once the data is avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists, delete the data if that happend in the source system and insert it if it’s missing.

Note

Microsoft Change Tracking only supports spark as the import tool

Merge

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing.

Setting	Configuration
Import Phase	mssql_change_tracking
ETL Phase	merge

Merge with History Audit

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing. After the merge is completed, it will also insert all new and changed rows into the History Audit Table so it’s possible to track the changed in the table over time

Setting	Configuration
Import Phase	mssql_change_tracking
ETL Phase	merge_history_audit