A

10 Messages

 • 

750 Points

Thursday, October 31st, 2024 10:06 AM

Best order to deploy Sampling, Profiling and Classifcation

We currently use Edge to harvest metadata and lineage from our Snowflake accounts using JDBC connections.  We are looking at potentially deploying the sampling, profiling and classification capabilities next.  I wondered if there is any particular logical order to deploy them in so we can plan our delivery?  They all have similar data access requirements, but our snowflake accounts have some very large tables so I'd prefer to deploy in sequence from least to most demanding so we can monitor the load on our Edge side. Thanks for any advice.

132 Messages

 • 

9.1K Points

15 days ago

Profiling, some on the techs can limit the rows. Without this (eg DB2) I think it's not workable. If you do use it, recommend to de-couple from metadata ingestion, so you can run less often & in quieter times. We pip tested with 100 rows to start with and gave that a few weeks in PROD first (running 2am Saturday), to get a feel for impact & timings. We avoided even ad hoc testing during working times. Not worth it for the chance of negative publicity. Make sure the DBAs, network guys etc are aware of what you are doing. Scale up the profile row limit when you're ready. 

Sampling and Data Classification are lighter and we happily test whenever.

A tip: don't have ">" in your naming convention ... it breaks Data Classification!

84 Messages

 • 

2.9K Points

15 days ago

One thing to remember is that profiling on some of the JDBC drivers, you cannot choose rows to profile it has to be 100%.

We ran into this with some of our data and tried to run profiling on our DEV instance. Even with the more efficient updates to Edge it still takes 2 days and will often randomly error out with what we assume is a timeout, but the logs obviously stop when it stops.

Technically you can limit the number of tables, but if you have systems with 2000+ tables the exclusion list becomes cumbersome to work with.

Loading...