L

Tuesday, December 21st, 2021 2:22 PM

TO BE REVIEWED: How can I configure Collibra DQ to run a large DQ Check Job (containing millions of rows of data)?

Question: How can I configure Collibra DQ to run a large DQ Check Job (containing millions of rows of data)?
Answer: We have some customer reports to us, sharing their configuration of Apache Spark, Kubernetes (K8s), Docker, and more:

Collibra DQ ran on K8s Docker:

  • 150 Million rows, with 31 columns, in 45 minutes (details :using 18 Spark Executors, 12 GB RAM and 4 CPUs per Executor, and Driver got 4 GB RAM). Success!

  • 44 Million rows, with 495 columns, in 79 minutes (details: using 46 Spark Executors, 24 GB RAM and 8 CPUs per Executor, and with Kerberos TGT, and Driver got 4GB RAM). Success!

Another customer config has a mix of node types in the cluster:

  • Spark Master : r5.xlarge (4vCore, 32GB ram), 1 instance.
  • Spark Core : m5.2xlarge (8vCore, 32GB), 10 instances.

The Collibra DQ job:

  • 65 Million rows, with 270 columns, ran in 4 hours! (details: using 10 Spark Executors, 25GB RAM and 8 CPUs per Executor, with Driver given 4 GB RAM). Success!

Another client using 2021.12 (December 2021) release:

  • 65 Million rows, 100 columns, AWS EMR, 5 nodes, 20 Executors, (4 CPU per Executor and 9 GB RAM per Executor) => Success in 1.5 hours!

  • 246 M Rows, and 79 columns using a “LIMIT 25000” in SQL. Collibra DQ Check with 7 Spark Executors, 6 GB RAM per Executor, 9 Cores per Executor. Driver memory 4 GB RAM: Finished in 10 minutes! (TEN)!

From the above reports we can see that Collibra DQ is successful at running “large” DQ Check Jobs.

158 Messages

 • 

50 Points

3 years ago

As you consider your processing, may also want to check out this great post
https://datacitizens.collibra.com/forum/t/to-be-reviewed-dq-ec2-size-and-type-class/993

2 years ago

This one just in from a large banking client of mine running Collibra DQ: “We are now able to run large jobs using m5.8xlarge (32 CPUs, 128 GB RAM) and 4 Spark Executors. With 4GB RAM Executor memory and 4GB ephemeral per executor). 695K records and 350 Columns; ~7GB data, DQ Job completes in ~15 minutes.”

2 years ago

Large client in a meeting just now said 400 million rows, 13 columns, DQ Job duration took 1 hour!

Loading...