A

Wednesday, December 15th, 2021 6:11 PM

Collibra DQ and Athena

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage.

Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3. Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena.

Besides, we recommend using Athena when you have documents in Amazon S3 and you are working with Collibra DQ and Collibra DGC using DQ connector. DQ connector doesn’t support the integration with S3 directly, we have to do it through Athena.

In this link, you can find the known limitation of the DQ connector.

To start working with Athena, visit this link and follow the steps to configure databases and tables in Athena.

Prerequisites

To be able to bring the information from Collibra DQ to Collibra DQ related to the Athena connection, Edge and the DQ connector have to be configured. To do that, follow the steps of the post Complete DQ Connector Configuration.

Configure Athena connection in Collibra DQ

Athena is one of the connections certified by Collibra DQ (List of all the drivers supported).

Go to connections and click on “Add” in the Athena connection.

One of the most important part of the configuration is the URL, by default is:

jdbc:awsathena://AwsRegion=${region};User=${key};Password=${secret};S3OutputLocation=s3://${output_location};MetadataRetrievalMethod=Auto

You have to provide the following information:

  1. Region: the region where you have your S3 bucket.
  2. Key: This is the AWS Access Key.
  3. Secret: This is the AWS Secret Key
  4. Output_location: This is S3://output_location, it can be found in Settings of your Athena configuration.

  1. MetadataRetrievalMethod should be Query. If you keep “Auto” the connection will work but no tables will be shown in Explorer.

Then, select the correct drivers and change the name of the connection (without spaces).

Sometimes the parameters provided by the URL are not enough to configure the connection.

AWS uses IAM to authenticate into their systems. Identity and access management (IAM) is a framework of policies and technology that authenticates and authorizes access to applications, data, systems and cloud platforms. In basic terms, it helps ensure that the right people have the right access, for the right reasons.

There are more parameters that have to be considered in the connection and to do that we have to configure Named Profiles with the IAM Role in our DQ Server. To do that follow these steps:

  1. Login to your server.
  2. We have to create a file called “credentials” with the connection information in ~/.aws/credentials (Linux & Mac).
  3. Check the user who executed owl-webapp (use ps command).

This means that we have to save our credential file in /root/.aws.

  1. By default, our /root folder doesn’t have privileges to write so we can execute:

sudo chmod o+rwx /root

  1. Create .aws folder inside /root executing

sudo mkdir /root/.aws

  1. Create the credentials file with the following content:

  1. Create the URL adding the profile:

jdbc:awsathena://AwsRegion=us-east-1;Profile=testprofile;S3OutputLocation=s3://as-dataquality/;MetadataRetrievalMethod=Query

Notice that now we have the property “Profile” with the name of the profile configured in the server.

Connection to Athena in Collibra DGC

  • Fill the following fields:
    • Name: The name of this connection must be the same as the connection in Collibra DQ.
    • Description.
    • Connection provider: Generic JDBC connection.
    • Driver class name: For this driver the value es cdata.jdbc.amazonathena.AmazonAthenaDriver.
    • Driver jar: Upload the driver from the Marketplace.
    • Connection string: jdbc:amazonathena:
    • Connection properties: The following properties have to be added:
      • AWSAccessKey: The access key of your AWS account or for an authorized IAM user.
      • AWSSecrectKey -> The secret key of your AWS account or for an authorized IAM user.
      • DataSource -> The default value of this connection property is AwsDataCatalog.
      • S3StagingDirectory -> Specifies the location Amazon Athena will use to store the results of a query.
  • Click on “Save”
  • Click on “Test Connection” to be sure that your connection is working properly.
  1. Configure the capability

  • Click on “Capabilities”
  • Click on “Add capability”.
  • Configure the capability with the following information:

  • Name
  • Description.
  • Capability template: Catalog JDBC ingestion.
  • JDBC Connection: Select the created Athena connection.
  • JDBC datasource type: GENERIC.

Finally, complete all the steps listed in the post Complete DQ Connector Configuration as we have commented before.

Hope this helps!

3 years ago

This is fantastic work sister!

Loading...