S

Monday, April 4th, 2022 5:23 AM

OWL Python libraries

Hi all,

Does anyone know where the location is for the owl python libraries? Testing out calling a DQ job from a python notebook and can’t find the libs :frowning:

Thanks
Shane :vulcan_salute:

675 Messages

 • 

13.2K Points

3 years ago

@shane.dolley.mip.com.au Can you find what you are looking for here? If not or either way, check out this thread. Do either of these help?

3 Messages

Unfortunately not @kristen.freer :frowning:

The client is attempting to run python from an Azure Databricks notebook to run queries/DQ jobs on their DQ server but without the actual packages to import first of all, we’re at a loss.

Hi @shane.dolley.mip.com.au , you will need the Collibra DQ Scala API JARs for this to work. We don’t have a Python API, we use Py4J to wrap the Scala API, like this:
from pyspark import SparkConf,SparkContext

from pyspark.sql import SQLContext,DataFrame

from py4j.java_collections import SetConverter, MapConverter, ListConverter

from py4j.java_gateway import JavaGateway, java_import

df = spark.read.format("csv").option("header","true").load("s3a://s3-datasets/nyse.csv")

df.count()

pgHost = ""

pgDatabase = ""

pgSchema = ""

pgUser = ""

pgPass = ""

# Scala version: val pgPort = s"5432/$pgDatabase?currentSchema=$pgSchema"

# Python version:

pgPort = f"5432/{pgDatabase}?currentSchema={pgSchema}"

#The PySpark session exposes the py4j entry points into the JVM. These interfaces allow python code to call Java/Scala objects.

jvm = spark._jvm

gateway = spark._sc._gateway

opt = jvm.com.owl.common.options.OwlOptions()

opt.setDataset("nyse_python")

opt.setRunId("2018-01-10")

opt.setHost(pgHost)

opt.setPort(pgPort)

opt.setPgUser(pgUser)

opt.setPgPassword(pgPass)

opt.setDatasetSafeOff(True)

optDup = jvm.com.owl.common.options.DupeOpt()

optDup.setOn(True)

optDup.setLowerBound = 99

#Current DQ API method signatures take a java array, python list translates to java ArrayList, need to explicitly tell py4j to serialize as a Java array

dupeInc = gateway.new_array(jvm.java.lang.String, 1)

dupeInc[0] = "SYMBOL"

optDup.setInclude(dupeInc)

opt.setDupe(optDup)

optOut = jvm.com.owl.common.options.OutlierOpt()

optOut.setOn(True)

optOut.setRecord(True)

optOut.setLookback(12)

optOut.setDateColumn("TRADE_DATE")

#This method takes an enum value, use jvm interface to access enum embedded in a Java object

optOut.setTimeBin(jvm.com.owl.common.options.OutlierOpt.TimeBin.DAY)

key = gateway.new_array(jvm.java.lang.String, 1)

key[0] = "SYMBOL"

optOut.setKey(key)

optOut.setMeasurementUnit("VOLUME=100000000,HIGH=0.1,LOW=0.1,OPEN=0.1,CLOSE=0.1")

opt.setOutlier(optOut)

optPat = jvm.com.owl.common.options.PatternOpt()

optPat.setOn(True)

key = gateway.new_array(jvm.java.lang.String, 1)

key[0] = "SYMBOL"

optPat.setKey(key)

optPat.setLookback(5)

optPat.setDateColumn("TRADE_DATE")

owl = jvm.com.owl.core.util.OwlUtils.OwlContext(df._jdf, opt)

owl.register(opt)

owl.owlCheck()

#Owl context returns a Scala dataframe, use pyspark.sql Dataframe wrapper to return a pyspark dataframe instead

DataFrame(owl.getOutliers(), sqlContext._ssql_ctx).show()

DataFrame(owl.getDupes(), sqlContext._ssql_ctx).show()

3 Messages

Thanks Laurent. :slight_smile:

Loading...