Atlas Integration

Data Governance and metadata is one of the key resources in a data storage environment like Hadoop. The ability to get information like schema, lineage, owners, contact information and audit is extremely valuable for everybody that work with the data. The Atlas integration in DBimport will help with all of this information and make sure that it gets imported into Atlas correctly.

Import/Export and Discovery

There are two different ways to get the information into Atlas.

Import/Export

When an import or export is running in DBImport, the process will connect to a remote SQL database and read the schema and data. The data will end up in Hive and the schema will be used in the DBImport configuration database. This schema will also be written to Atlas during import and export. That means that just that single table gets imported into Atlas. So if there are 100 tables in the remote database, and you are importing 5 of them to Hive. Atlas will only contain information about the 5 imported tables. The other 95 will be unavailable in Atlas.

Discovery

If the DBImport server is started, it will connect to the different remote SQL databases configured in the jdbc_connections table and read the entire schema and import all tables and views into Atlas. Include and exclude filters can be specified in the jdbc_connections table in order to limit the number of items that gets imported into Atlas. This is useful for maybe excluding a complete schema that is used only for testing.

What will be stored in Atlas?

Regardless if the data comes from an Import or Export operation or is discovered by the DBImport server, the following information get stored in Atlas.

rdbms_instance

This object contains information about where the instance is running, what databases exists in it and how to connect to the instance.

name Name of the instance
databases Links to the different databases available in the instance. This will be rdbms_db objects
hostname Server that the instance is running at
port Port to the instance
protocol The protocol used to connect to. Hardcoded to JDBC as that’s what DBImport is using
rdbms_type What kind of instance it is. Example are Oracle, MSSQL, MySQL, Postgres and more

rdbms_db

This object contains information about where the instance is running, what databases exists in it and how to connect to the instance.

name Name of the database
instance Links to the rdbms_instance object that the database belongs to
tables Links to the different tables available on the database. This will be rdbms_table objects
prodOrOther Gets populated by the environment column in jdbc_connections table
owner Gets populated by the owner column in jdbc_connections table
description Gets populated by the description column in jdbc_connections table
contact_info Gets populated by the contact_info column in jdbc_connections table

rdbms_table

This object contains information about the SQL table.

name Name of the table
name_path Schema of the table
db Links to the rdbms_db object that the table belongs to
columns Links to the different columns available on the database. This will be rdbms_column objects
comment Table comment from remote SQL database
type Table or view.
owner Gets populated by the owner column in jdbc_connections table
contact_info Gets populated by the contact_info column in jdbc_connections table

rdbms_column

This object contains information about the different columns that exists in a table

name Name of the column
table Links to the rdbms_table object that the column belongs to
comment Column comment from the remote SQL database
data_type Type of column
isNullable True or False if the column is specified as allows null or not
isPrimaryKey True or False if the column is part of the primary key
length Length of the column. If the type is a varchar(45), the value 45 will be available here
owner Gets populated by the owner column in jdbc_connections table

DBImport_Process

This is the DBImport object that gets created in Atlas as part of the DBImport installation. The object contains information about the source and target, if it’s an import or export and also serves as the object that will be part of the Lineage from the SQL table to the HDFS directory and further into Hive.

commandlineOpts The entire command including the parameters that was used to run the DBImport tool
description Description of what have been done in the process
startTime Timestamp of when the DBImport process started
endTimepe Timestamp of when the DBImport process stopped
importTool Sqoop or Spark
input Link to the input object. Usually rdbms_table for imports and hive_table for exports
output Link to the output object. Usually hdfs_path for imports and rdbms_table for exports
operation Import or Export
username Name of user executing the DBimport process

Relationships between objects

There are different relationships created for the rdbms objects. In order for example a rdbms_table to exist in a rdbms_db, there is a relationship between them both. Those kind of relationships are documented in the tables above on each and every object where they exists. In addition to that, there is also relationships that gets created depending on the remote SQL configuration. This is a list of those relationships that gets created

Foreign Keys

There can be Foreign Keys on rdbms_tables that links two tables together in the same database. The information about the relationships comes from the remote SQL tables and is a reflection on how the remote tables are linked together.

Note

Index will also be a kind of relationship that gets created. Support for Index on rdbms object in Atlas will come in a future version of DBImport

Graphical representation of what the relationships can look like in Atlas

_images/atlas_relationship.jpg

Lineage

One of the important functions of Atlas is to show the Lineage, the graphical representation of how data is flowing through your system. DBImport creates Lineage for all Imports and Exports. This will connect the rdbms_table objects in the remote SQL databases with the Hadoop environment. Nothing is needed to configure in order to use this functionality. All you need to do is to make sure Atlas integration is enabled and a Lineage as in the example below will be created in Atlas.

_images/atlas_lineage.jpg