Atlas Integration
Data Governance and metadata is one of the key resources in a data storage environment like Hadoop. The ability to get information like schema, lineage, owners, contact information and audit is extremely valuable for everybody that work with the data. The Atlas integration in DBimport will help with all of this information and make sure that it gets imported into Atlas correctly.
Import/Export and Discovery
There are two different ways to get the information into Atlas.
Import/Export
When an import or export is running in DBImport, the process will connect to a remote SQL database and read the schema and data. The data will end up in Hive and the schema will be used in the DBImport configuration database. This schema will also be written to Atlas during import and export. That means that just that single table gets imported into Atlas. So if there are 100 tables in the remote database, and you are importing 5 of them to Hive. Atlas will only contain information about the 5 imported tables. The other 95 will be unavailable in Atlas.
Discovery
If the DBImport server is started, it will connect to the different remote SQL databases configured in the jdbc_connections table and read the entire schema and import all tables and views into Atlas. Include and exclude filters can be specified in the jdbc_connections table in order to limit the number of items that gets imported into Atlas. This is useful for maybe excluding a complete schema that is used only for testing.
What will be stored in Atlas?
Regardless if the data comes from an Import or Export operation or is discovered by the DBImport server, the following information get stored in Atlas.
rdbms_instance
This object contains information about where the instance is running, what databases exists in it and how to connect to the instance.
name |
Name of the instance |
databases |
Links to the different databases available in the instance. This will be rdbms_db objects |
hostname |
Server that the instance is running at |
port |
Port to the instance |
protocol |
The protocol used to connect to. Hardcoded to JDBC as that’s what DBImport is using |
rdbms_type |
What kind of instance it is. Example are Oracle, MSSQL, MySQL, Postgres and more |
rdbms_db
This object contains information about where the instance is running, what databases exists in it and how to connect to the instance.
name |
Name of the database |
instance |
Links to the rdbms_instance object that the database belongs to |
tables |
Links to the different tables available on the database. This will be rdbms_table objects |
prodOrOther |
Gets populated by the environment column in jdbc_connections table |
owner |
Gets populated by the owner column in jdbc_connections table |
description |
Gets populated by the description column in jdbc_connections table |
contact_info |
Gets populated by the contact_info column in jdbc_connections table |
rdbms_table
This object contains information about the SQL table.
name |
Name of the table |
name_path |
Schema of the table |
db |
Links to the rdbms_db object that the table belongs to |
columns |
Links to the different columns available on the database. This will be rdbms_column objects |
comment |
Table comment from remote SQL database |
type |
Table or view. |
owner |
Gets populated by the owner column in jdbc_connections table |
contact_info |
Gets populated by the contact_info column in jdbc_connections table |
rdbms_column
This object contains information about the different columns that exists in a table
name |
Name of the column |
table |
Links to the rdbms_table object that the column belongs to |
comment |
Column comment from the remote SQL database |
data_type |
Type of column |
isNullable |
True or False if the column is specified as allows null or not |
isPrimaryKey |
True or False if the column is part of the primary key |
length |
Length of the column. If the type is a varchar(45), the value 45 will be available here |
owner |
Gets populated by the owner column in jdbc_connections table |
DBImport_Process
This is the DBImport object that gets created in Atlas as part of the DBImport installation. The object contains information about the source and target, if it’s an import or export and also serves as the object that will be part of the Lineage from the SQL table to the HDFS directory and further into Hive.
commandlineOpts |
The entire command including the parameters that was used to run the DBImport tool |
description |
Description of what have been done in the process |
startTime |
Timestamp of when the DBImport process started |
endTimepe |
Timestamp of when the DBImport process stopped |
importTool |
Sqoop or Spark |
input |
Link to the input object. Usually rdbms_table for imports and hive_table for exports |
output |
Link to the output object. Usually hdfs_path for imports and rdbms_table for exports |
operation |
Import or Export |
username |
Name of user executing the DBimport process |
Relationships between objects
There are different relationships created for the rdbms objects. In order for example a rdbms_table to exist in a rdbms_db, there is a relationship between them both. Those kind of relationships are documented in the tables above on each and every object where they exists. In addition to that, there is also relationships that gets created depending on the remote SQL configuration. This is a list of those relationships that gets created
Foreign Keys
There can be Foreign Keys on rdbms_tables that links two tables together in the same database. The information about the relationships comes from the remote SQL tables and is a reflection on how the remote tables are linked together.
Note
Index will also be a kind of relationship that gets created. Support for Index on rdbms object in Atlas will come in a future version of DBImport
Graphical representation of what the relationships can look like in Atlas
Lineage
One of the important functions of Atlas is to show the Lineage, the graphical representation of how data is flowing through your system. DBImport creates Lineage for all Imports and Exports. This will connect the rdbms_table objects in the remote SQL databases with the Hadoop environment. Nothing is needed to configure in order to use this functionality. All you need to do is to make sure Atlas integration is enabled and a Lineage as in the example below will be created in Atlas.