L

Wednesday, December 15th, 2021 4:05 PM

TO BE REVIEWED: How does the Collibra DQ Agent work exactly?

Question: How does the Collibra DQ Agent work exactly?
This was answered in conversation with Vadim Vaks, Principal Enterprise Architect:

  1. The Agent is responsible for working with a Connection.
    a. Connection(s) must be associated with an Agent (see doc links at the bottom of this post).

  2. The Agent polls the DQ Metastore for a DQ Check job which has not been processed yet.

  3. The Agent runs the DQ Check on an Apache Spark platform, and returns the results of the DQ Check to the DQ Metastore for display by the Web UI. (It is possible to use “pushdown” to avoid Apache Spark, that is a separate issue).

  4. Lightweight: The agent is translator of the DQ Check to an actual Spark job, and it tracks the progress of that job.

  • Agent is not activated before the Job is launched with [Run] from the Web UI Explorer.
  • Could be run from REST API
  • Could be run from Scala API
  1. The “Agent” actually polls the DQ Metastore, looking for the OwlChecks that it needs to run, (roughly every 5 seconds):

  2. Agent only activates when there is DQ Check request in the Metastore event queue.

  3. DQ Checks gets created and pushed into this DQ Metastore, either from the DQ Web UI, the REST API.

  4. The Scala API doesn’t use an agent. You’re already there, you have the SparkContext already.

  5. Agent logs get stored in the Agent Home: Stage 1 logs.

  6. DQ Spark Submit JVM log: Stage 2 log:

  • Job between compute space and the agent.
  • Log directory of the Agent home.
  • Spark standalone has some data in this log.

Here are some Collibra DQ Agent docs:

  1. Agent config: https://dq-docs.collibra.com/installation/agent-configuration

  2. Add Connection to Agent: https://dq-docs.collibra.com/connecting-to-dbs-in-owl-web/add-connection-to-agent

No Responses!
Loading...