L

Thursday, December 9th, 2021 4:58 PM

TO BE REVIEWED: How do I troubleshoot a Collibra DQ job failure?

Question: How do I troubleshoot a Collibra DQ job failure?
Vadim Vaks answered this:
A: It helps to outline the Collibra DQ Job Lifecycle and where to find logs for each phase .Every DQ Job goes through a three-stage lifecycle:

  • Stage 1: Agent picks up job from the DQ Metastore and translates it into a valid Spark Submit request. This includes credential acquisition and injection for Cloud and Kerberos. If a job never makes it out of STAGING, the first thing to do is to check the Agent logs (<INSTALL_HOME>/log/agent.log or on K8s kubectl logs <agent-pod-name> -n <namespace>.
  • Stage 2: Agent hands off the DQ check to Spark via Spark Submit, maintaining a handle on the Spark Submit request. At this point the Job is in Spark’s custody but not yet running (Spark Submit creates its own JVM to manage the submission of the Spark Job to the cluster/runtime). If the job fails with a message saying something like: "Failed with reason NULL" on the Jobs page, check the Stage 2 logs (there will be a separate log for each Job). These can be found either on the Agent itself (<INSTALL_HOME>/log/<name-of-job>.log) or whenever possible on the Jobs page Action Dropdown on the job entry.
  • Stage 3: Spark Submit instantiates the Job in the target Spark Runtime (Hadoop/K8s/Spark-Master). At this point, the DQ core code is active and DQ is back in control of the job. Typically, if a job makes it to this stage, it will no longer be in STAGING status and you should see an error message on the Jobs Page. Typically, the full Stage 3 log is required to trouble shoot a problem that happens in Core. Stage 3 logs can be obtained from the Actions drop down for the job entry. If log extraction failed, job logs will need to be gathered from the Spark Runtime directly (Hadoop Resource Manager, K8s API via Kubectl or vendor provided UI, Spark Master UI or directly from the Spark Master Host).
No Responses!
Loading...