L

Wednesday, March 2nd, 2022 5:22 PM

REVIEWED-PENDING DOCUMENTATION: What is the work flow for a DQ Check Job, and how are rules applied, do rules hit the original data source?

Question: My understanding is that there are two possibilities for rules to run. One is they run against the Apache Spark Dataset created by the DQ Check, thereby they do not bother the customer datasource. The other option is that a DQ Native SQL rule is created which is targeting a native database function. Now we need to understand the workflow.

Given the following DQ Check work flow:

  1. We run the DQ Check and the Scope SELECT statement creates a Spark Dataset.

  2. We run Adaptive Rules against this detached Dataset.

  3. Someone writes a custom rule, and runs the DQ Check, is this rule running against A) the customer data source, or B) the detached Dataset we made in step 1.?

  4. Lastly, we use a Native SQL rule which has an Oracle function in it:
    WITH temporaryTable (averageValue) as
    (SELECT avg(Attr1)
    FROM Table)
    SELECT Attr1
    FROM Table, temporaryTable
    WHERE Table.Attr1 > temporaryTable.averageValue;

How can this be run against the detached Dataset, wouldn’t it have to go to the Oracle data source to hit that function?

Vadim Vaks helped clarify with his answer:
A: If you write a rule that is not NativeSQL then DQ Job applies those rules to the data that was pulled into memory of the Spark session.

  • Any rules that are created as Native SQL will be sent to the original datasource (Oracle/Postgres/Snowflake/) for processing.

  • If a dataset has a mix of both, then both of the proceeding apply.

No Responses!
Loading...