101 Messages
TO BE REVIEWED: How can I configure Collibra DQ to run a large DQ Check Job (containing millions of rows of data)?
Question: How can I configure Collibra DQ to run a large DQ Check Job (containing millions of rows of data)?
Answer: We have some customer reports to us, sharing their configuration of Apache Spark, Kubernetes (K8s), Docker, and more:
Collibra DQ ran on K8s Docker:
-
150 Million rows, with 31 columns, in 45 minutes (details :using 18 Spark Executors, 12 GB RAM and 4 CPUs per Executor, and Driver got 4 GB RAM). Success!
-
44 Million rows, with 495 columns, in 79 minutes (details: using 46 Spark Executors, 24 GB RAM and 8 CPUs per Executor, and with Kerberos TGT, and Driver got 4GB RAM). Success!
Another customer config has a mix of node types in the cluster:
- Spark Master : r5.xlarge (4vCore, 32GB ram), 1 instance.
- Spark Core : m5.2xlarge (8vCore, 32GB), 10 instances.
The Collibra DQ job:
- 65 Million rows, with 270 columns, ran in 4 hours! (details: using 10 Spark Executors, 25GB RAM and 8 CPUs per Executor, with Driver given 4 GB RAM). Success!
Another client using 2021.12 (December 2021) release:
-
65 Million rows, 100 columns, AWS EMR, 5 nodes, 20 Executors, (4 CPU per Executor and 9 GB RAM per Executor) => Success in 1.5 hours!
-
246 M Rows, and 79 columns using a “LIMIT 25000” in SQL. Collibra DQ Check with 7 Spark Executors, 6 GB RAM per Executor, 9 Cores per Executor. Driver memory 4 GB RAM: Finished in 10 minutes! (TEN)!
From the above reports we can see that Collibra DQ is successful at running “large” DQ Check Jobs.
ericgerstner
158 Messages
•
50 Points
3 years ago
As you consider your processing, may also want to check out this great post
https://datacitizens.collibra.com/forum/t/to-be-reviewed-dq-ec2-size-and-type-class/993
1
0
laurentweichberger
101 Messages
2 years ago
This one just in from a large banking client of mine running Collibra DQ: “We are now able to run large jobs using m5.8xlarge (32 CPUs, 128 GB RAM) and 4 Spark Executors. With 4GB RAM Executor memory and 4GB ephemeral per executor). 695K records and 350 Columns; ~7GB data, DQ Job completes in ~15 minutes.”
0
laurentweichberger
101 Messages
2 years ago
Large client in a meeting just now said 400 million rows, 13 columns, DQ Job duration took 1 hour!
0
0