101 Messages
DQ Life-cycle and Apache Spark
- How is Spark integrated with Collibra Data Quality and made available to customers? Are there any considerations or consequences when upgrading Spark that could impact Collibra Data Quality?
Answer:
Apache Spark is integrated at the compute-plane level. For example, in the Collibra DQ life-cycle we can see the steps and stages of the Apache Spark involvement. So at each Spark step and stage upgrading Spark could have an impact:
-
[DQ Metastore - Postgres] DQ Check is requested using the DQ Web UI or DQ REST API.
-
[Spark] DQ Agent associated with the request “pulls up” the data from the data source specified (eg. using JDBC Connection), based on the “initial scope query” SQL provided by the user, and stores the results of the
SELECT
query in an Apache SparkDataFrame
(Dataset
). -
[Spark] Apache Spark session is created using parameters, and Executors are instantiated and provisioned accordingly:
-
Number of Spark Executors
-
Number of Cores (CPUs) per Executor.
-
Amount of RAM per Executor.
-
[Spark] DQ Agent runs the DQ Check:
-
Runs profile adaptive rules against the Spark
DataFrame
. -
Runs user defined custom rules against the Spark
DataFrame
. -
Runs any advanced DQ features against the Spark
DataFrame
: -
Patterns
-
Outliers
-
Dupes
-
Validate Source
-
[DQ Metastore - Postgres] DQ Agent stores the results of the above work in the DQ Metastore.
-
[Spark] The Apache Spark
DataFrame
is destroyed (eg. Java VM garbage collection).
Based on the above we can see the DQ life-cycle areas which are involved with Spark.
No Responses!