Community_Alex's profile

722 Messages

 • 

19.8K Points

Thursday, March 13th, 2025 12:54 PM

Masterclass AMA Session | Best practices: Ensuring data reliability in the AI era

Hey Collibra Community, thanks for joining our session on ensuring data reliability in the AI era! 👋 Seth Clark and Brad Munday are here to answer any questions.

You can also let them know tips you learned that you plan to use in your organization. And because this is a community please share any best practices that have worked for you so others can benefit from them!

(All questions are sourced directly from the live session to support our community in learning together!)

722 Messages

 • 

19.8K Points

12 days ago

What are some best practices to embed data quality monitors in data pipelines?

4 Messages

 • 

920 Points

Implement Monitoring at Multiple Stages:

  • Source Monitoring: Validate data as close to the source as possible to catch issues early.
  • Ingestion Monitoring: Check data as it enters the pipeline for format, schema, and basic validity.
  • Transformation Monitoring: Monitor data after each transformation step to ensure the logic is applied correctly and data quality is preserved.
  • Output Monitoring: Validate the final output data before it's used for analysis or model training.

Define Clear Data Quality Metrics & SLAs:

  • Identify Key Metrics: Determine the most critical data quality dimensions for your specific use case (e.g., completeness, accuracy, consistency, timeliness, validity, uniqueness).
  • Set Thresholds: Establish acceptable thresholds or ranges for each metric. These should be realistic and based on business needs.
  • Service Level Agreements (SLAs): Define SLAs for data quality, specifying acceptable levels of quality and consequences for breaches.

You can learn more https://www.collibra.com/resources/collibra-data-quality-and-observability-best-practices-for-reliable-data-pipelines

722 Messages

 • 

19.8K Points

12 days ago

How do you use data observability to ensure policy compliance in the context of AI use cases?

4 Messages

 • 

915 Points

One of the most obvious scenarios for ensuring policy compliance is to make sure that the data you're putting into a model isn't breaking any rules or laws. Managing those use cases within something like Collibra's AI Gov product provides a common place to centralize those policies and make sure that any data assets that are going to be used for a given application won't break those policies. This might include classifying your data to making sure PII, PCI, PHI aren't included in the data used for a particular AI application. Another good example would be real-time monitoring of the prompts that are being submitted to an LLM application to make sure harmful questions aren't being asked.

722 Messages

 • 

19.8K Points

12 days ago

How do you get the different stakeholders data stewards/engineers/scientist to collaborate on data reliability?

4 Messages

 • 

915 Points

Definitely not a one-size-fits all answer! Depending on your organization, there are a few ways that might make sense:

  1. Norming around data products can be a really effective way to collaborate. It's almost like having an SLA that data science teams can rely on.
  2. Some shared training between data scientists and data engineers on the types of observability each focuses on. Data observability and ML model observability can compliment each other very nicely, but only if data engineering and data science teams are working closely together.
  3. Having data science teams share the business impacts of upstream data changes can also help data stewards understand what kinds of risks to prioritize and which types of risks are manageable.

722 Messages

 • 

19.8K Points

12 days ago

In which layer does Collibra DQ operate?

4 Messages

 • 

915 Points

Great question. Within the context of a classical ML workflow, Collibra DQ would probably best fit just upstream of the "Data Prep and Pipelines" stage. Putting in place robust data quality on the highest-value data sources that data science teams may use up stream of their model development workflows is really the best way to go. That give data science teams confidence that the production data they build their features and inference pipelines off of is high quality, consistent, and that the structure (schema) isn't changing without them knowing in advance!

722 Messages

 • 

19.8K Points

12 days ago

What's your thought regarding the FAIRification of data in addition to traditional quality rules to get AI-ready data? Thanks!

4 Messages

 • 

920 Points

FAIR (Findability, Accessibility, Interoperability, Reuse) adds a data governance aspect in addition to traditional data quality rules. While data quality focuses on accuracy, completeness, and consistency, FAIR principles emphasize findability, accessibility, interoperability, and reusability. FAIR principles inherently introduce aspects of data governance because they address how data is managed, shared, and controlled to ensure its value and utility over time. So adding a data catalog/marketplace, privacy and access controls, data contracts, etc. 

722 Messages

 • 

19.8K Points

12 days ago

How do you convince customers that classical ML is the right tool vs GenAI?

722 Messages

 • 

19.8K Points

12 days ago

What are best practices to get AI from project to production? 

722 Messages

 • 

19.8K Points

12 days ago

What role does data quality play in issues such as overfitting? 

4 Messages

 • 

920 Points

Inaccurate or Biased Data:
If the training data contains errors, inconsistencies, or biases, the model will learn these flaws, leading to inaccurate predictions and potentially perpetuating biases in the AI's output.
Identify and remove errors, inconsistencies, and outliers from the dataset. 
Missing Data:
Incomplete datasets can lead to models making assumptions based on limited information, which can result in overfitting and inaccurate predictions.
Implement checks to ensure that the data meets certain quality standards, such as completeness, accuracy, and consistency.
Lack of Diversity:
If the training data is not diverse enough, the model might not be able to generalize to different scenarios or populations, leading to overfitting and poor performance on unseen data.
Ensure that the training data is representative of the real-world scenarios where the model will be used.
Evaluate the model's performance on different subsets of the data to assess its ability to generalize.
Loading...