Import methods

When specifying a table to import, one of the most important setting is to select an import method. It basically defines if it’s a full import or an incremental import together with what will happen afterwards. Maybe a history table should be created or rows in the Hive table should be deleted even if it’s an incremental import. This mayor settings are what we call an import method.

Note

The stage number seen in the documentation are the internal stage id that is used by DBImport. This number is also used in the import_stage and the import_retries_log table.

Note

The default Hive table type is managed table using ORC if nothing else is specified.

Full import

Full imports reads the entire source table and makes it available in Hive. Depending on what ETL phase is used, the Target table can be created/updated in different ways. Most simple and most common way is to do a full import, but there are more complex imports that compare all values in all columns to create audit information.

Truncate and insert

This is the most common way to import the data. The entire table is loaded by DBImport and then the target table in Hive is truncated and all rows are inserted again.

Setting Configuration
Import Type (old) full
Import Phase full
ETL Phase truncate_insert
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Get source table rowcount
    Run a select count(1) from ... on the source table to the number of rows
  4. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1011
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet or Orc files that sqoop or spark wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1050
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  7. Truncate target table
    Clears the Hive target table
  8. Copy rows from import to target table
    Insert all rows from the import table to the target table
  9. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  10. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  11. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1054

Insert

Full import with Insert will just add data to the target table without truncating. This will most likely create duplicates in the target database. As we still need to validate the appended data, the create_datalake_import must be set to 1 in jdbc_connections table for the connection the appended table is using.

Setting Configuration
Import Type (old) full_insert
Import Phase full
ETL Phase insert
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Get source table rowcount
    Run a select count(1) from ... on the source table to the number of rows
  4. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1011
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet or Orc files that sqoop or spark wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1050
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  7. Copy rows from import to target table
    Insert all rows from the import table to the target table
  8. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  9. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  10. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1054

Full Merge

Doing a Full Merge operation instead of a normal full import gives you one additional thing. It will create a number of new columns that will contain information about when was the last time the row was changed. This is a great way to get only changed data from a table that have no way to identify if the data in the row was changed or not. Will create a fairly large job in Hive during the merge, and depending on the cluster size, might take all resources available in the cluster.

Setting Configuration
Import Type (old) full_merge_direct
Import Phase full
ETL Phase merge
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Get source table rowcount
    Run a select count(1) from ... on the source table to the number of rows
  4. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1011
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet or Orc files that sqoop or spark wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3250
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the Target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  7. Creating the Delete table
    The Delete table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  8. Merge Import table with Target table
    Merge all data in the Import table into the Target table based on PK and if any values is changed in any of the columns.
  9. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  10. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  11. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1054

Full Merge with History Audit

This is one of the largest import method you can use. It will fetch all rows from the source system and once available in the Import Table, the data will be merge into the Target table. Do know what rows have been changed, all columns will be compared between the Import and the Target table. When that is done, a new merge will run that will find out what rows exists in the Target table and not in the Import table. These are the rows that was deleted in the source system. Once they are identified, they will be inserted into the History Audit table and then deleted from the Target table. Depending on the size of the table, this can be a very large job in Hive during the different merge commands. Keep that in mind when you select a timeslot to run the job.

Setting Configuration
Import Type (old) full_merge_direct_history
Import Phase full
ETL Phase merge_history_audit
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Get source table rowcount
    Run a select count(1) from ... on the source table to the number of rows
  4. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1011
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet or Orc files that sqoop or spark wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3250
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the Target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  7. Creating the History table
    The History table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  8. Creating the Delete table
    The Delete table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  9. Merge Import table with Target table
    Merge all data in the Import table into the Target table based on PK and if any values is changed in any of the columns.
  10. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  11. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  12. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1054

None

This import will not load anything in Hive. It will only create an external Import table, but nothing more than that. It’s used when you have a complex ETL process that needs to be executed after the import and loading the Target table is just a waste of time/resources.

Setting Configuration
Import Phase full
ETL Phase none
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Get source table rowcount
    Run a select count(1) from ... on the source table to the number of rows
  4. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1011
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet or Orc files that sqoop or spark wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.

Incremental import

An incremental imports keeps track of how much data have been read from the source table and only imports the new data. There are two different ways to do this

Append If data is added to the source table and there is an integer based column that increases for every new row (AUTO_INCREMENT), then Append mode is the way to go.

Last Modified If there is a column with the type of date or a timestamp, and it gets a new data/timestamp for every new row, then Last Modified the correct option.

Insert

The changed data is read from the source and once it’s avalable in the Import table, an insert operation will be triggered in Hive to insert the newly fetched rows into the Target table.

Setting Configuration
Import Type (old) incr
Import Phase incr
ETL Phase insert
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files
  4. Get source table rowcount
    Run a select count(1) from ... where incr_column > min_value and incr_column > max_value on the source table to get the number of rows. Due to the where statement, it only validaes the incremental rows
    If the incremental validation method is ‘full’, then a select count(1) from ... without any where statement is also executed against the source table.
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1111
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet or Orc files that sqoop or spark wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table based on the min and max values that was used for sqoop. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1150
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  7. Copy rows from import to target table
    Insert all rows from the import table to the target table
  8. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  9. Get Target table rowcount
    If the incremental validation method is ‘incr’, then a select count(1) from ... where incr_column > min_value and incr_column > max_value on the target table to get the number of rows. If it is ‘full’, then a normal select count(1) from ... without any where statement will be executed instead
  10. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table based on the min and max values that was used for sqoop. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
  11. Saving pending incremental values
    In order to start the next incremental import from the last entry that the current import read, we are saving the min and max values into the import_tables table. The next import will then start to read from the next record after the max we read this time.

Merge

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing. Keep in mind that if the source table deletes rows, we wont fetch them with this import.

Setting Configuration
Import Type (old) incr_merge_direct
Import Phase incr
ETL Phase merge
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files
  4. Get source table rowcount
    Run a select count(1) from ... where incr_column > min_value and incr_column > max_value on the source table to get the number of rows. Due to the where statement, it only validaes the incremental rows
    If the incremental validation method is ‘full’, then a select count(1) from ... without any where statement is also executed against the source table.
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1111
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet or Orc files that sqoop or spark wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3301
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the Target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  7. Merge Import table with Target table
    Merge all data in the Import table into the Target table based on PK.
  8. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  9. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  10. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3304
  11. Saving pending incremental values
    In order to start the next incremental import from the last entry that the current import read, we are saving the min and max values into the import_tables table. The next import will then start to read from the next record after the max we read this time.

Merge with History Audit

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing. Keep in mind that if the source table deletes rows, we wont fetch them with this import. After the merge is completed, it will also insert all new and changed rows into the History Audit Table so it’s possible to track the changed in the table over time

Setting Configuration
Import Phase incr
ETL Phase merge_history_audit
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files
  4. Get source table rowcount
    Run a select count(1) from ... where incr_column > min_value and incr_column > max_value on the source table to get the number of rows. Due to the where statement, it only validaes the incremental rows
    If the incremental validation method is ‘full’, then a select count(1) from ... without any where statement is also executed against the source table.
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1111
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet or Orc files that sqoop or spark wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3301
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the Target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  7. Creating the History table
    The History table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  8. Merge Import table with Target table
    Merge all data in the Import table into the Target table based on PK.
  9. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  10. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  11. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3304
  12. Saving pending incremental values
    In order to start the next incremental import from the last entry that the current import read, we are saving the min and max values into the import_tables table. The next import will then start to read from the next record after the max we read this time.

Oracle Flashback

This import method uses the Oracle Flashback Version Query to fetch only the changed rows from the last import. Comparing this to a standard incremental import, the main differences is that we detect deletes as well and that we dont require a timestamp or an integer based column with increasing values. The downside is that the table must support Oracle Flashback Version Query and that the undo area is large enough to keep changes between imports. Once the data is avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists, delete the data if that happend in the source system and insert it if it’s missing.

Note

Oracle Flashback only supports sqoop as the import tool

Merge

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing.

Setting Configuration
Import Phase oracle_flashback
ETL Phase merge
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files. This is where the Oracle Flashback VERSION BETWEEN query is executed against the source system.
  4. Get source table rowcount
    Run a select count(1) from ... VERSIONS BETWEEN SCN <min_value> AND <max_value> WHERE VERSIONS_OPERATION IS NOT NULL AND VERSIONS_ENDTIME IS NULL on the source table to get the number of rows. Due to the where statement, it only validates the incremental rows
    If the incremental validation method is ‘full’, then a select count(1) from ... VERSIONS BETWEEN SCN <min_value> AND <max_value> WHERE VERSIONS_ENDTIME IS NULL AND (VERSIONS_OPERATION != 'D' OR VERSIONS_OPERATION IS NULL) is also executed against the source table.
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1211
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet files that sqoop wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3301
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the Target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  7. Merge Import table with Target table
    Merge all data in the Import table into the Target table based on PK.
  8. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  9. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  10. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3304
  11. Saving pending incremental values
    In order to start the next incremental import from the last entry that the current import read, we are saving the min and max values into the import_tables table. The next import will then start to read from the next record after the max we read this time.

Merge with History Audit

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing. After the merge is completed, it will also insert all new and changed rows into the History Audit Table so it’s possible to track the changed in the table over time

Setting Configuration
Import Phase oracle_flashback
ETL Phase merge_history_audit
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Sqoop import
    Executes the sqoop import and saves the source table in Parquet files. This is where the Oracle Flashback VERSION BETWEEN query is executed against the source system.
  4. Get source table rowcount
    Run a select count(1) from ... VERSIONS BETWEEN SCN <min_value> AND <max_value> WHERE VERSIONS_OPERATION IS NOT NULL AND VERSIONS_ENDTIME IS NULL on the source table to get the number of rows. Due to the where statement, it only validates the incremental rows
    If the incremental validation method is ‘full’, then a select count(1) from ... VERSIONS BETWEEN SCN <min_value> AND <max_value> WHERE VERSIONS_ENDTIME IS NULL AND (VERSIONS_OPERATION != 'D' OR VERSIONS_OPERATION IS NULL) is also executed against the source table.
  5. Validate sqoop import
    Validates that sqoop read the same amount of rows that exists in the source system. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 1211
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Parquet files that sqoop wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3301
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the Target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  7. Creating the History table
    The History table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  8. Merge Import table with Target table
    Merge all data in the Import table into the Target table based on PK.
  9. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  10. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  11. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3304
  12. Saving pending incremental values
    In order to start the next incremental import from the last entry that the current import read, we are saving the min and max values into the import_tables table. The next import will then start to read from the next record after the max we read this time.

Microsoft Change Tracking

This import method uses the Microsoft Change Tracking function to fetch only the changed rows from the last import. Comparing this to a standard incremental import, the main differences is that we detect deletes as well and that we dont require a timestamp or an integer based column with increasing values. The downside is that the function requires that Change Tracking is enabled on the database and table on the source system. Once the data is avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists, delete the data if that happend in the source system and insert it if it’s missing.

Note

Microsoft Change Tracking only supports spark as the import tool

Merge

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing.

Setting Configuration
Import Phase mssql_change_tracking
ETL Phase merge
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Checking MSSQL Change Tracking functions
    Connects to the source system and validates that the range of data that will be read by spark is valid. If it’s not, it will force the import to do a full initial import and also truncate the Target table.
  1. Spark import
    Executes the spark import and saves the source table in Orc files.
  1. Get source table rowcount
    Fetch the number of rows in the source table and saves it to the DBImport configuration database
  1. Update Atlas
    Update Atlas with source system information and lineage.
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Orc files that sqoop wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table.
    If the validation fails, the next import will restart from stage 3601
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the Target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  1. Merge Import table with Target table
    Merge all data in the Import table into the Target table based on PK.
  2. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  3. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3304
  5. Saving pending incremental values
    In order to start the next incremental import from the last entry that the current import read, we are saving the min and max values into the import_tables table. The next import will then start to read from the next record after the max we read this time.

Merge with History Audit

The changed data is read from the source and once it’s avalable in the Import table, a merge operation will be executed in Hive. The merge will be based on the Primary Keys and will update the information in the Target table if it already exists and insert it if it’s missing. After the merge is completed, it will also insert all new and changed rows into the History Audit Table so it’s possible to track the changed in the table over time

Setting Configuration
Import Phase mssql_change_tracking
ETL Phase merge_history_audit
  1. Getting source tableschema
    This stage connects to the source database and reads all columns, columntypes, primary keys, foreign keys and comments and saves the to the configuration database.
  2. Clear table rowcount
    Removes the number of rows that was import in the previous import of the table
  3. Checking MSSQL Change Tracking functions
    Connects to the source system and validates that the range of data that will be read by spark is valid. If it’s not, it will force the import to do a full initial import and also truncate the Target table.
  1. Spark import
    Executes the spark import and saves the source table in Orc files.
  1. Get source table rowcount
    Fetch the number of rows in the source table and saves it to the DBImport configuration database
  1. Update Atlas
    Update Atlas with source system information and lineage.
  1. Stage1 Completed
    This is just a mark saying that the stage 1 is completed. If you selected to run only a stage 1 import, this is where the import will end.
  1. Connecting to Hive
    Connects to Hive and runs a test to verify that Hive is working properly
  2. Creating the import table in the staging database
    The import table is created. This is an external table based on the Orc files that sqoop wrote. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  3. Get Import table rowcount
    Run a select count(1) from ... on the Import table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table.
    If the validation fails, the next import will restart from stage 3651
  5. Removing Hive locks by force
    Due to a bug in Hive, we need to remove the locks by force. This connects to the metadatabase and removes them from there
  6. Creating the Target table
    The target table is created. Any changes on the exiting table compared the the information that was received in the Getting source tableschema stage is applied here.
  1. Merge Import table with Target table
    Merge all data in the Import table into the Target table based on PK.
  2. Update Hive statistics on target table
    Updates all the statistcs in Hive for the table
  3. Get Target table rowcount
    Run a select count(1) from ... on the Target table in Hive to get the number of rows
  4. Validate import table
    Compare the number of rows from the source table with the number of rows in the import table. These dont have to match 100% and is based on the configuration in the import_tables.validate_diff_allowed column.
    If the validation fails, the next import will restart from stage 3304
  5. Saving pending incremental values
    In order to start the next incremental import from the last entry that the current import read, we are saving the min and max values into the import_tables table. The next import will then start to read from the next record after the max we read this time.