J

Tuesday, April 19th, 2022 8:50 AM

Data lineage automation using openlineage framework

We aim to get the full story around our data with Collibra Data Lineage using as much as possible automation

I would need to ingest data lineage metadata in DGC using Open Lineage framework, in order display real time technical data lineage paths in DGC centralized governance framework

Did you already experiment such pipelines solution or any alternative tool as metadata repository reference ?

30 Messages

3 years ago

Hello @jean-luc.garnier.safrangroup.com ! I’m the Product Manager for technical lineage in Collibra.

At this time, Open Lineage does an “operational lineage” - you can observe what is happening with pipelines - e.g. when they last ran successfully.

Collibra offers “definitional lineage” - you can see how your data flows through your environment. The use cases are impact analysis (“if I change this column’s name, what is affected by that?”) and security/privacy (“column X in table A is PII, and it’s transformed and loaded into column Y of table B, so table B/column Y should be marked as PII as well.”)

We are thinking about adding operational lineage - we would integrate with Open Lineage, at least using their schema/framework - but we do not have any firm plans or commitments.

Hope that helps!
-Sheeri Cabral

3 Messages

2 years ago

Hi @sheeri.cabral , hope you are well. Are there any new updates or plans available to share? Thank you!

41 Messages

2 years ago

Hi,
We are also interested in such an integration.
We are planning a POC to hack Collibra is such a way to integrate operational lineage in Collibra, i.e., the inclusion of job information (job start and end times), whether a job failed. Think it’s possible, but it will cost us, but we prefer an integrated solution: this type of lineage is much needed.

Thanks,
Tom

30 Messages

Hello! It’s definitely possible to hack, especially with custom lineage - the transformation code is just a string, so you can append anything you want.

We have an open ideation ticket - we’d love votes but even more so - your use case. Where do you want to see that information? what do you want to be able to do with it? e.g. you want jobs start time, job end time, and whether or not it failed. I assume that’s for the most recent run, not the entire history.

What sources are you using (e.g. Oracle, Matillion, Tableau, custom lineage) and where would that information come from?

https://productresources.collibra.com/ideation-platform/?id=PINT-I-60

@jean-luc.garnier.safrangroup.com @tom.kuppens.telenetgroup.be @miguel.guillen

41 Messages

Hi Sheeri,
Thanks for you message. We still need to refine what and how we are going to manage the operational lineage intergration in Collibra.

From an overall perspective, having two places to manage lineage is Collibra is not very efficient. The DGC has a powerful semantic layer that can connect everything. The techlineage engine is powerful in getting the lineage out of the source systems and getting into the techlin module. Both very cool! But then it stops being cool, because the integration between the techlin module and the DGC is not ideal at the moment, to mention a few issues:

Because of the descriptive nature of operational lineage and the issue mentioned above, I think the logic place to have that information in the DGC. We would use it to propagate the operational information on the actual jobs to everybody who is interested in it. The use case are plentiful e.g. if using a table, you want to know its provenance but also the jobs that it’s depended on (via the table provenance), and if these jobs ran without hiccups. How the information is compartmentalized and managed is open for discussion.

We are doing Qlik Sense lineage (Nodegraph), SPSS modeler lineage (own development), Oracle Lineage (techlin Oracle), Datastage Lineage (techlin Oracle hack because the Datastage method does not give good lineage). We are using a couple of other inhouse job frameworks and looking into phasing out Datastage.

Thanks,
Tom KUPPENS

9 Messages

Yes, good news thanks for your feed back

What about manta solution please ?

2 Messages

While it is tempting to bring in operational metadata into Collibra, we should be asking “who” does “what” with the information. Some of the operational metadata that report on freshness of data, usage, data quality, etc. does make sense to bring into Collibra, but you have to ask why “operational lineage” when even “definitional lineage” across an enterprise portfolio of applications is often challenging to represent in a data owner/steward friendly manner. We have Collibra well established as our enterprise metadata and data governance tool. But we are now recognizing needs for a product platform metadata repository (we selected LinkedIn DataHub, but there are other options) which sources definitional metadata from Collibra but also operational metadata from the platform, primarily for data engineer and data ops use. A federated architecture for metadata is necessary for enabling enterprise wide semantic consistency while also supporting product agility.

41 Messages

Hi Thomas,
Interesting!

We are also at the point where we understand that Collibra is not enough in terms of lineage management. Like I describe in one of my earlier posts in this thread, there are many thing we struggle at the moment, and our go solution to it mostly: let’s get the data outside Collibra and do it ourselves. E.g. reporting: Give me a list of tables that are used in a certain dashboard?, or In which reports in this table used?. We recently managed to get operational lineage into Collibra, but the engine and ingestion is built outside Collibra.

So, I’m really curious what your experience is with Datahub, and the other options you had a look at!

Best,
Tom Kuppens
[email protected]

Loading...