Atlas Integration

Data Governance and metadata is one of the key resources in a data storage environment like Hadoop. The ability to get information like schema, lineage, owners, contact information and audit is extremely valuable for everybody that work with the data. The Atlas integration in DBimport will help with all of this information and make sure that it gets imported into Atlas correctly.

Import/Export and Discovery

There are two different ways to get the information into Atlas.

Import/Export

When an import or export is running in DBImport, the process will connect to a remote SQL database and read the schema and data. The data will end up in Hive and the schema will be used in the DBImport configuration database. This schema will also be written to Atlas during import and export. That means that just that single table gets imported into Atlas. So if there are 100 tables in the remote database, and you are importing 5 of them to Hive. Atlas will only contain information about the 5 imported tables. The other 95 will be unavailable in Atlas.

Discovery

If the DBImport server is started, it will connect to the different remote SQL databases configured in the jdbc_connections table and read the entire schema and import all tables and views into Atlas. Include and exclude filters can be specified in the jdbc_connections table in order to limit the number of items that gets imported into Atlas. This is useful for maybe excluding a complete schema that is used only for testing.

What will be stored in Atlas?

Regardless if the data comes from an Import or Export operation or is discovered by the DBImport server, the following information get stored in Atlas.

rdbms_instance

This object contains information about where the instance is running, what databases exists in it and how to connect to the instance.

name

Name of the instance

databases

Links to the different databases available in the instance. This will be rdbms_db objects

hostname

Server that the instance is running at

port

Port to the instance

protocol

The protocol used to connect to. Hardcoded to JDBC as that’s what DBImport is using

rdbms_type

What kind of instance it is. Example are Oracle, MSSQL, MySQL, Postgres and more

rdbms_db

This object contains information about where the instance is running, what databases exists in it and how to connect to the instance.

name

Name of the database

instance

Links to the rdbms_instance object that the database belongs to

tables

Links to the different tables available on the database. This will be rdbms_table objects

prodOrOther

Gets populated by the environment column in jdbc_connections table

owner

Gets populated by the owner column in jdbc_connections table

description

Gets populated by the description column in jdbc_connections table

contact_info

Gets populated by the contact_info column in jdbc_connections table

rdbms_table

This object contains information about the SQL table.

name

Name of the table

name_path

Schema of the table

db

Links to the rdbms_db object that the table belongs to

columns

Links to the different columns available on the database. This will be rdbms_column objects

comment

Table comment from remote SQL database

type

Table or view.

owner

Gets populated by the owner column in jdbc_connections table

contact_info

Gets populated by the contact_info column in jdbc_connections table

rdbms_column

This object contains information about the different columns that exists in a table

name

Name of the column

table

Links to the rdbms_table object that the column belongs to

comment

Column comment from the remote SQL database

data_type

Type of column

isNullable

True or False if the column is specified as allows null or not

isPrimaryKey

True or False if the column is part of the primary key

length

Length of the column. If the type is a varchar(45), the value 45 will be available here

owner

Gets populated by the owner column in jdbc_connections table

DBImport_Process

This is the DBImport object that gets created in Atlas as part of the DBImport installation. The object contains information about the source and target, if it’s an import or export and also serves as the object that will be part of the Lineage from the SQL table to the HDFS directory and further into Hive.

commandlineOpts

The entire command including the parameters that was used to run the DBImport tool

description

Description of what have been done in the process

startTime

Timestamp of when the DBImport process started

endTimepe

Timestamp of when the DBImport process stopped

importTool

Sqoop or Spark

input

Link to the input object. Usually rdbms_table for imports and hive_table for exports

output

Link to the output object. Usually hdfs_path for imports and rdbms_table for exports

operation

Import or Export

username

Name of user executing the DBimport process

Relationships between objects

There are different relationships created for the rdbms objects. In order for example a rdbms_table to exist in a rdbms_db, there is a relationship between them both. Those kind of relationships are documented in the tables above on each and every object where they exists. In addition to that, there is also relationships that gets created depending on the remote SQL configuration. This is a list of those relationships that gets created

Foreign Keys

There can be Foreign Keys on rdbms_tables that links two tables together in the same database. The information about the relationships comes from the remote SQL tables and is a reflection on how the remote tables are linked together.

Note

Index will also be a kind of relationship that gets created. Support for Index on rdbms object in Atlas will come in a future version of DBImport

Graphical representation of what the relationships can look like in Atlas

_images/atlas_relationship.jpg

Lineage

One of the important functions of Atlas is to show the Lineage, the graphical representation of how data is flowing through your system. DBImport creates Lineage for all Imports and Exports. This will connect the rdbms_table objects in the remote SQL databases with the Hadoop environment. Nothing is needed to configure in order to use this functionality. All you need to do is to make sure Atlas integration is enabled and a Lineage as in the example below will be created in Atlas.

_images/atlas_lineage.jpg