L

Wednesday, July 5th, 2023 6:30 PM

DQ Life-cycle and Apache Spark

  1. How is Spark integrated with Collibra Data Quality and made available to customers? Are there any considerations or consequences when upgrading Spark that could impact Collibra Data Quality?

Answer:
Apache Spark is integrated at the compute-plane level. For example, in the Collibra DQ life-cycle we can see the steps and stages of the Apache Spark involvement. So at each Spark step and stage upgrading Spark could have an impact:

  1. [DQ Metastore - Postgres] DQ Check is requested using the DQ Web UI or DQ REST API.

  2. [Spark] DQ Agent associated with the request “pulls up” the data from the data source specified (eg. using JDBC Connection), based on the “initial scope query” SQL provided by the user, and stores the results of the SELECT query in an Apache Spark DataFrame (Dataset).

  3. [Spark] Apache Spark session is created using parameters, and Executors are instantiated and provisioned accordingly:

  4. Number of Spark Executors

  5. Number of Cores (CPUs) per Executor.

  6. Amount of RAM per Executor.

  7. [Spark] DQ Agent runs the DQ Check:

  8. Runs profile adaptive rules against the Spark DataFrame.

  9. Runs user defined custom rules against the Spark DataFrame.

  10. Runs any advanced DQ features against the Spark DataFrame:

  11. Patterns

  12. Outliers

  13. Dupes

  14. Validate Source

  15. [DQ Metastore - Postgres] DQ Agent stores the results of the above work in the DQ Metastore.

  16. [Spark] The Apache Spark DataFrame is destroyed (eg. Java VM garbage collection).

Based on the above we can see the DQ life-cycle areas which are involved with Spark.

No Responses!
Loading...