M

Thursday, July 15th, 2021 5:36 PM

Technical: DQ Supported Data Types

Supported Data Types / General

Q: What Types Of Data Sources Can Collibra DQ Clean?
A: Collibra DQ Is Not A Cleansing Tool But Can Identify Issues Across Many Systems e.g. Hive, MongoDB, SAPHana, File Stores, etc.

Supported Data Types / Unstructured

Q: Do You Support Unstructured Data?
A: If There Is Driver To Take Horizontal Structure And Flatten To Column-Based Structure, Yes, But That Falls Back Into JDBC World. Driver Doing The Work. We Do Scan JSON Files. Blob Data Types And Deeply Nested JSON Might Be More Difficult

Q: What is our support for scanning files (ie excel, etc) or dataframes in Spark?
A: csv (delimited text files), avro, parquet, json, xml (spark supported, just not xls)

Q: Can we read JSON structure stored in a Column, which happens to be in a table?
A: Only in our SDK (notebook) format. You can pass in a function to explode json, but this is not easy without knowledge of these techniques. Quick answer is no, not out of the box.

Q: How Do We Know When DQ Requirements Of A Customer Are NOT A Good Fit With Collibra DQ?
A: IoT Sensors Or Scada Data With Proprietary Connectors Are OUT Unless They Dump To Kafka. Any Streaming Solution Not Using Kafka Is Out. Mainframe / EBCIDIC Is Out Unless Has A JDBC Connector And Converts To Ascii Or Regular Text Format. VSAM Is Out. Log Files Are Out (Use Splunk). Data Without Header Or Schema Is Out (JSON And XML Are Fine). Highly Dimensional Tables Are Not Great. Star And Snowflake Tables In 3rd Or 4th Normal Form Don’t Tell You Much. Need View Or Business Level Table To Get Deeper Insights)

Supported Data Types / Kafka

Q: What are the details around streaming Kafka files and how do the results get exposed within Collibra DQ?
A: https://docs.owl-analytics.com/owlcheck-examples/owlcheck-kafka (run as a listener on the topic, setup is slightly different than batch mode. invoked from CLI. results appear and functionality is the same in UI)

Q: When streaming data and pulling off data from a Kafka queue, since it is near realtime, how does the rule(s) results and scorecard get aggregated over one hour, one day, etc?
A: No aggregation for Kafka. If you want aggregation, best option is batch mode

Q: Real-Time Data Quality with Collibra DQ and Kafka

No Responses!
Loading...