PIP is a plugin for Postgres (being incorporated into MayBMS) that implements a probabilistic database supporting continuous distributions. PIP implements a hybrid of the C-Tables and VG-Functions approaches to probabilistic databases, allowing it to employ a sampling optimizer to ensure that statistical metrics are computed as efficiently as possible. More details on PIP's implementation can be found in our ICDE 2010 paper
This document is structured as follows:
PIP is included as part of the MayBMS Distribution.
http://maybms.sourceforge.net
PIP's core functionality is embedded in a standard Postgres plugin and may be used with any compatible installation of Postgres (it has been verified to work under Postgres 8.4). PIP also includes query-rewriting functionality that makes it possible to write queries that treat probabilistic data as regular data -- this allows for nearly seamless integration of probabilistic data into existing workflows. However, this rewriting functionality requires building a copy of Postgres with PIP's CType extensions. If you do not wish to use these extensions, skip directly to Installing the PIP Library.
$> cd maybms $> ./pip_plugin/scripts/patch_postgres.sh
$> cd postgresql-ctype $> ./configure $> make $> sudo make install $> export PGDATA=[/usr/share/postgres_db] $> export PATH=$PATH:/usr/local/postgres/bin $> mkdir $PGDATA $> initdb
$> cd pip_plugin $> make
$> sudo make_install
$> psql [my_database] -f install[.ctype].sql
$> psql template_1 -f install[.ctype].sql
All probabilistic data in PIP is expressed using the pip_eqn
datatype. New probabilistic data is introduced through the CREATE_VARIABLE
function. This function takes 2 parameters: the name of a distribution, and a vector of parameters to that distribution.
For example:
CREATE TABLE my_probabilistic_data( name varchar(100), data pip_eqn ); INSERT INTO my_probabilistic_data SELECT input.name, CREATE_VARIABLE("Normal", vector(input.mean, input.stddev)) FROM input;This example creates a random variable by iterating over each row of the table 'input'. The Normal distribution requires two parameters: a mean and a standard deviation, both of which are drawn from the corresponding row of the input table.
PIP comes with several probability distributions predefined, described below. The name of the distribution is passed as the first parameter to the CREATE_VARIABLE function.
Queries over probabilistic data fall into two stages. With the CType extensions, the first stage is nearly identical to querying normal data -- write your queries as you would for deterministic data. For example:
CREATE TABLE source_1(int id, measurement double precision, data pip_eqn); CREATE TABLE source_2(int id, data pip_eqn); CREATE TABLE source_3(int id, data pip_eqn); -- fill source_1, 2, and 3 CREATE TABLE results AS SELECT source_1.id, source_1.measurement, source_3.data FROM source_1, source_2, source_3 WHERE source_1.id = source_2.id AND source_2.id = source_3.id AND source_1.data > source_2.data + source_3.data;
However, the result of this query will not be a numerical result, but rather a compressed representation of the probabilistic formula that defines each cell of the result. PIP provides several operators to transform the result into a comprehendable form. These are defined below.
Without using the CType extensions, users must be aware of some of PIP's inner workings. Specifically, constraints over probabilistic data (e.g., source_1.data > source_2.data + source_3.data
) can not be reduced to booleans (as the result of the inequality is itself probabilistic), and must be included in the query result as data. The CType extensions rewrite queries so that the the query results are modified automatically, but without them you must write your queries accordingly.
These comparisons are of type pip_atom
.
What this means for you: Without the CType extensions, queries must be written with comparisons over probabilistic data in the SELECT
clause and not the WHERE
clause. For example, the exampe query above must be rewritten as:
CREATE TABLE source_1(int id, measurement double precision, data pip_eqn); CREATE TABLE source_2(int id, data pip_eqn); CREATE TABLE source_3(int id, data pip_eqn); -- fill source_1, 2, and 3 CREATE TABLE results AS SELECT source_1.id, source_1.measurement, source_3.data, source_1.data > source_2.data + source_3.data FROM source_1, source_2, source_3 WHERE source_1.id = source_2.id AND source_2.id = source_3.id;
Users define new probability distributions by declaring a function (in C) for generating samples from the distribution, as well as several metadata functions that the PIP sampling optimizer can use to improve sampling efficiency. New distributions can be introduced as follows:
.c
file in pip_plugin/src/dist
, which should #include "#pip.h"
.DECLARE_PIP_DISTRIBUTION([shortname]) = { .name = [stringname], .size = [paramsize], .init = [init_fn], .gen = [gen_fn], .pdf = [pdf_fn], .cdf = [cdf_fn], .icdf= [icdf_fn], .out = [output_fn], .in = [input_fn], .joint= false }See below for definitions of each bracketed term.
expectation(var, row)
SELECT results.id, expectation(results.data, results) FROM results;As a shorthand for the expectation function, you may use double angle brackets.
SELECT results.id, << results.data @ results >> FROM results;
conf_one(row)
source_1.data > source_2.data + source_3.data
).SELECT results.id, conf_one(results) FROM results;
expectation_sum(var, row)
SELECT expectation_sum(results.data, results) FROM results;
sum(value, row)
SELECT sum(results.measurement, results) FROM results;
expectation_max(var, row)
SELECT expectation_max(results.data, results) FROM results;
max(value, row)
SELECT max(results.measurement, results) FROM results;
CREATE_VARIABLE("Zero", vector())
CREATE_VARIABLE("Exponential", vector(lambda))
CREATE_VARIABLE("Normal", vector(mean, stddev))
CREATE_VARIABLE("Poisson", vector(lambda))
CREATE_VARIABLE("Uniform", vector(low, high))
The components of a PIP distribution are as follows:
shortname
stringname
CREATE_VARIABLE
.paramsize
2 * sizeof(float)
.init_fn
void init_fn(pip_var *var, HeapTupleHeader params)When your initializer is invoked,
var
will contain a pointer to initialized pip variable. var->group_state
will contain a pointer to an allocated, but uninitialized block of [paramsize] bytes. params
is a pointer to the postgres vector() of parameters. See below for information on utility functions for parsing the parameter vector.gen_fn
float8 gen_fn(pip_var *var, int64 seed)The generator function returns a value sampled from the distribution being defined, with parameters stored in
var->group_state
(as defined above). This function MUST be deterministic; Randomness is obtained from the randomly selected seed
parameter. See below for information on utility functions for generating random numbers from this seed value.pdf_fn
float8 pdf_fn(pip_var *var, float8 point)The pdf function returns the probability density at the indicated
point
of the distribution being defined, with parameters stored in var->group_state
(as defined above). If this pointer is NULL, optimizations involving the distribution's PDF will be ignored by PIP's sampling optimizer.cdf_fn
float8 cdf_fn(pip_var *var, float8 point)The cdf function returns the probability of selecting a sample less than or equal to the indicated
point
, given the parameters stored in var->group_state
(as defined above). If this pointer is NULL, optimizations involving the distribution's CDF will be ignored by PIP's sampling optimizer.icdf_fn
float8 icdf_fn(pip_var *var, float8 probability)The inverse cdf function returns a point in the domain of the distribution such that the likelihood of sampling a value less than or equal to the point is equal to the indicated
probability
(which is guaranteed to be between 0 and 1, inclusive), given the parameters stored in var->group_state
(as defined above). If both [cdf_fn] and [icdf_fn] are provided, then it MUST be true that x = cdf_fn(var, icdf_fn(var, x));If this pointer is NULL, optimizations involving the distribution's inverse CDF will be ignored by PIP's sampling optimizer.
output_fn
int input_fn(pip_var *var, int len, char *str)When invoked,
str
will contain a pointer to a (large) allocated, but uninitialized block of memory of size len
this function should fill with a C string containing a human-readable representation of the distribution's parameters (e.g., using snprintf), which should be stored in var->group_state
(as defined above). The function should return the length of the string, not counting the trailing null character (as snprintf). Note: Although this function is optional, be aware that not including it will prevent users from being able to export data defined in terms of this distribution in the more commonly compatible text format, potentially preventing you from being able to upgrade your PIP install. Consequently, this function MUST be included in any production system.
input_fn
int input_fn(pip_var *var, char *str)When the input function is invoked,
var
will contain a pointer to initialized pip variable. var->group_state
will contain a pointer to an allocated, but uninitialized block of [paramsize] bytes. str
references a C string containing a human-readable representation of this distribution's parameters, as used in [output_fn]. The input function should parse this string (e.g., using sscanf) and initialize var->group_state
in the same way as [init_fn]. Note: Although this function is optional, be aware that not including it will prevent users from being able to import data defined in terms of this distribution.
PIP includes a library of utility functions for use in defining distributions. These methods are defined in pip_plugin/src/include/dist.h
(which will be included as part of "pip.h"
)
float8 dist_param_float8(HeapTupleHeader params, int n, float8 default)
n
th element (starting with 0) of params
, cast and/or translated into a float8. If the params
has fewer than n
+1 elements, the default
value will be returned instead.int64 pip_prng_step(int64 seed)
seed
value and returns the next (random but deterministic) seed value.int64 pip_prng_int(int64 *seed)
seed
) integer value between 0 and 2^63-1. The seed
value is passed by reference, and is stepped to its next value automatically.
float8 pip_prng_float(int64 *seed)
seed
) float value between 0 and 1. The seed
value is passed by reference, and is stepped to its next value automatically.
void pip_box_muller(float8 *X, float8 *Y, int64 *seed)
seed
), independent, normally distributed floating point numbers (with mean of 0 and standard deviation of 1) using the Box-Muller method. The independent variables will be stored in X
and Y
-- either or both may be used. The seed
value is passed by reference, and is stepped to its next value automatically (note that the Box-Muller method requires two random numbers as input and as a consequence, seed
is actually stepped twice).