DQ Life-cycle and Apache Spark

How is Spark integrated with Collibra Data Quality and made available to customers? Are there any considerations or consequences when upgrading Spark that could impact Collibra Data Quality?

Answer:
Apache Spark is integrated at the compute-plane level. For example, in the Collibra DQ life-cycle we can see the steps and stages of the Apache Spark involvement. So at each Spark step and stage upgrading Spark could have an impact:

[DQ Metastore - Postgres] DQ Check is requested using the DQ Web UI or DQ REST API.
[Spark] DQ Agent associated with the request “pulls up” the data from the data source specified (eg. using JDBC Connection), based on the “initial scope query” SQL provided by the user, and stores the results of the SELECT query in an Apache Spark DataFrame (Dataset).
[Spark] Apache Spark session is created using parameters, and Executors are instantiated and provisioned accordingly:
Number of Spark Executors
Number of Cores (CPUs) per Executor.
Amount of RAM per Executor.
[Spark] DQ Agent runs the DQ Check:
Runs profile adaptive rules against the Spark DataFrame.
Runs user defined custom rules against the Spark DataFrame.
Runs any advanced DQ features against the Spark DataFrame:
Patterns
Outliers
Dupes
Validate Source
[DQ Metastore - Postgres] DQ Agent stores the results of the above work in the DQ Metastore.
[Spark] The Apache Spark DataFrame is destroyed (eg. Java VM garbage collection).

Based on the above we can see the DQ life-cycle areas which are involved with Spark.

Community

laurentweichberger

DQ Life-cycle and Apache Spark

No Responses!

Was this helpful?

Community

laurentweichberger

DQ Life-cycle and Apache Spark

No Responses!

Related Tags

Related Conversations

Was this helpful?