PIP is a plugin for Postgres (being incorporated into MayBMS) that implements a probabilistic database supporting continuous distributions. PIP implements a hybrid of the C-Tables and VG-Functions approaches to probabilistic databases, allowing it to employ a sampling optimizer to ensure that statistical metrics are computed as efficiently as possible. More details on PIP's implementation can be found in our ICDE 2010 paper

This document is structured as follows:

- Installing PIP
- Using PIP
- Creating Probabilistic Data
- Writing Probabilistic Queries - With CType Extensions
- Writing Probabilistic Queries - Without CType Extensions
- Defining Probability Distributions
- Reference: Expectation Operators
- Reference: Predefined Distributions
- Reference: Distribution Components
- Reference: Distribution Utility Functions

PIP is included as part of the MayBMS Distribution.

http://maybms.sourceforge.net

PIP's core functionality is embedded in a standard Postgres plugin and may be used with any compatible installation of Postgres (it has been verified to work under Postgres 8.4). PIP also includes query-rewriting functionality that makes it possible to write queries that treat probabilistic data as regular data -- this allows for nearly seamless integration of probabilistic data into existing workflows. However, this rewriting functionality requires building a copy of Postgres with PIP's CType extensions. If you do not wish to use these extensions, skip directly to Installing the PIP Library.

- Create a copy of MayBMS with the CType extensions by running the patch script.
$> cd maybms $> ./pip_plugin/scripts/patch_postgres.sh

- Build, install, and set up the patched MayBMS as normal. For example, on a standard configuration:
$> cd postgresql-ctype $> ./configure $> make $> sudo make install $> export PGDATA=[/usr/share/postgres_db] $> export PATH=$PATH:/usr/local/postgres/bin $> mkdir $PGDATA $> initdb

- Compile PIP.
$> cd pip_plugin $> make

- Install PIP. By default, PIP Installs the PIP library into /usr/local/lib/pgsql. To change the default install directory, modify the INSTALL_DIR variable in the Makefile.
$> sudo make_install

- Configure Postgres to use PIP. This step needs to be run once for each database that you intend to use PIP with. The makefile generates two scripts to automate this process: install.sql and install.ctype.sql, for use with standard postgres installations and installs with the CType patch described above, respectively.
$> psql [my_database] -f install[.ctype].sql

- To add PIP functionality to all databases created in the future, run the setup script (as above) on the database 'template_1'. Note that this does not affect any existing databases.
$> psql template_1 -f install[.ctype].sql

All probabilistic data in PIP is expressed using the `pip_eqn`

datatype. New probabilistic data is introduced through the `CREATE_VARIABLE`

function. This function takes 2 parameters: the name of a distribution, and a vector of parameters to that distribution.

For example:

CREATE TABLE my_probabilistic_data( name varchar(100), data pip_eqn ); INSERT INTO my_probabilistic_data SELECT input.name, CREATE_VARIABLE("Normal", vector(input.mean, input.stddev)) FROM input;This example creates a random variable by iterating over each row of the table 'input'. The Normal distribution requires two parameters: a mean and a standard deviation, both of which are drawn from the corresponding row of the input table.

PIP comes with several probability distributions predefined, described below. The name of the distribution is passed as the first parameter to the CREATE_VARIABLE function.

Queries over probabilistic data fall into two stages. With the CType extensions, the first stage is nearly identical to querying normal data -- write your queries as you would for deterministic data. For example:

CREATE TABLE source_1(int id, measurement double precision, data pip_eqn); CREATE TABLE source_2(int id, data pip_eqn); CREATE TABLE source_3(int id, data pip_eqn); -- fill source_1, 2, and 3 CREATE TABLE results AS SELECT source_1.id, source_1.measurement, source_3.data FROM source_1, source_2, source_3 WHERE source_1.id = source_2.id AND source_2.id = source_3.id AND source_1.data > source_2.data + source_3.data;

However, the result of this query will not be a numerical result, but rather a compressed representation of the probabilistic formula that defines each cell of the result. PIP provides several operators to transform the result into a comprehendable form. These are defined below.

Without using the CType extensions, users must be aware of some of PIP's inner workings. Specifically, constraints over probabilistic data (e.g., `source_1.data > source_2.data + source_3.data`

) can not be reduced to booleans (as the result of the inequality is itself probabilistic), and must be included in the query result as data. The CType extensions rewrite queries so that the the query results are modified automatically, but without them you must write your queries accordingly.
These comparisons are of type `pip_atom`

.

**What this means for you:** Without the CType extensions, queries must be written with comparisons over probabilistic data in the `SELECT`

clause and not the `WHERE`

clause. For example, the exampe query above must be rewritten as:

CREATE TABLE source_1(int id, measurement double precision, data pip_eqn); CREATE TABLE source_2(int id, data pip_eqn); CREATE TABLE source_3(int id, data pip_eqn); -- fill source_1, 2, and 3 CREATE TABLE results AS SELECT source_1.id, source_1.measurement, source_3.data,source_1.data > source_2.data + source_3.dataFROM source_1, source_2, source_3 WHERE source_1.id = source_2.id AND source_2.id = source_3.id;

Users define new probability distributions by declaring a function (in C) for generating samples from the distribution, as well as several metadata functions that the PIP sampling optimizer can use to improve sampling efficiency. New distributions can be introduced as follows:

- Create a new
`.c`

file in`pip_plugin/src/dist`

, which should`#include "#pip.h"`

. - Add the following code to the file:
DECLARE_PIP_DISTRIBUTION([shortname]) = { .name = [stringname], .size = [paramsize], .init = [init_fn], .gen = [gen_fn], .pdf = [pdf_fn], .cdf = [cdf_fn], .icdf= [icdf_fn], .out = [output_fn], .in = [input_fn], .joint= false }

See below for definitions of each bracketed term. - Define the necessary functions as described below.
- Compile and install PIP as normal. You may be prompted to upgrade your PIP install.

`expectation(var, row)`

- Compute the expectation of the specified variable. For technical reasons, you must also provide a reference to the table that the variable appears in.
SELECT results.id, expectation(results.data, results) FROM results;

As a shorthand for the expectation function, you may use double angle brackets.SELECT results.id, << results.data @ results >> FROM results;

`conf_one(row)`

- Compute the confidence of a row, or probability that the row is present in the database; Such nondeterminism can arise as the result of comparisons involving probabilistic data (e.g.
`source_1.data > source_2.data + source_3.data`

).SELECT results.id, conf_one(results) FROM results;

`expectation_sum(var, row)`

- Compute the expectation of the sum aggregate over the specified probabilistic data column. For technical reasons, you must also provide a reference to the table that the column appears in.
SELECT expectation_sum(results.data, results) FROM results;

`sum(value, row)`

- Compute the expectation of the sum aggregate over the specified non-probabilistic data column. This function should replace use of the standard sum aggregate. For technical reasons, you must also provide a reference to the table that the column appears in.
SELECT sum(results.measurement, results) FROM results;

`expectation_max(var, row)`

- Compute the expectation of the max aggregate over the specified column. For technical reasons, you must also provide a reference to the table that the column appears in.
SELECT expectation_max(results.data, results) FROM results;

`max(value, row)`

- Compute the expectation of the max aggregate over the specified non-probabilistic data column. This function should replace use of the standard max aggregate. For technical reasons, you must also provide a reference to the table that the column appears in.
SELECT max(results.measurement, results) FROM results;

- Zero
- A distribution that is always zero (i.e., the Dirac Delta as a distribution).
CREATE_VARIABLE("Zero", vector())

- Exponential
- The Exponential Distribution. The Exponential Distribution takes one parameter, lambda: the
**inverse**of the expectation of the variable being created.CREATE_VARIABLE("Exponential", vector(lambda))

- Normal
- The Normal Distribution. The Normal Distribution takes two parameters, the mean, and the standard deviation of the variable being created.
CREATE_VARIABLE("Normal", vector(mean, stddev))

- Poisson
- The Poisson Distribution. The Poisson distribution takes one parameter, lambda: the expectation of the variable being created.
CREATE_VARIABLE("Poisson", vector(lambda))

- Uniform
- The Uniform Distribution. The uniform distribution takes two parameters, the endpoints of the distribution; The endpoints may be provided in any order.
CREATE_VARIABLE("Uniform", vector(low, high))

The components of a PIP distribution are as follows:

`shortname`

- A short, unique, one-word name for the distribution. The short name must be a valid c identifier
`stringname`

- A C string containing a human-readable name for the distribution. This is the same name that will be used when an end-user calls
`CREATE_VARIABLE`

. `paramsize`

- The number of bytes required to store this distribution's parameters. For example, if your distribution is parametereized by two floats, this parameter should be
`2 * sizeof(float)`

. `init_fn`

- A pointer to a constructor function for your distribution with the following schema:
void init_fn(pip_var *var, HeapTupleHeader params)

When your initializer is invoked,`var`

will contain a pointer to initialized pip variable.`var->group_state`

will contain a pointer to an allocated, but uninitialized block of [paramsize] bytes.`params`

is a pointer to the postgres vector() of parameters. See below for information on utility functions for parsing the parameter vector. `gen_fn`

- A pointer to a generator function for your distribution with the following schema:
float8 gen_fn(pip_var *var, int64 seed)

The generator function returns a value sampled from the distribution being defined, with parameters stored in`var->group_state`

(as defined above). This function**MUST**be deterministic; Randomness is obtained from the randomly selected`seed`

parameter. See below for information on utility functions for generating random numbers from this seed value. `pdf_fn`

- NULL, or a pointer to a function for computing the probability density function of the distribution being defined, with the following schema:
float8 pdf_fn(pip_var *var, float8 point)

The pdf function returns the probability density at the indicated`point`

of the distribution being defined, with parameters stored in`var->group_state`

(as defined above). If this pointer is NULL, optimizations involving the distribution's PDF will be ignored by PIP's sampling optimizer. `cdf_fn`

- NULL, or a pointer to a function for computing the cumulative distribution function of the distribution being defined, with the following schema:
float8 cdf_fn(pip_var *var, float8 point)

The cdf function returns the probability of selecting a sample less than or equal to the indicated`point`

, given the parameters stored in`var->group_state`

(as defined above). If this pointer is NULL, optimizations involving the distribution's CDF will be ignored by PIP's sampling optimizer. `icdf_fn`

- NULL, or a pointer to a function for computing the inverse of the cumulative distribution function of the distribution being defined, with the following schema:
float8 icdf_fn(pip_var *var, float8 probability)

The inverse cdf function returns a point in the domain of the distribution such that the likelihood of sampling a value less than or equal to the point is equal to the indicated`probability`

(which is guaranteed to be between 0 and 1, inclusive), given the parameters stored in`var->group_state`

(as defined above). If both [cdf_fn] and [icdf_fn] are provided, then it**MUST**be true thatx = cdf_fn(var, icdf_fn(var, x));

If this pointer is NULL, optimizations involving the distribution's inverse CDF will be ignored by PIP's sampling optimizer. `output_fn`

- NULL, or a pointer to a function for generating a human-readable representation of the distribution's parameters, with the following schema:
int input_fn(pip_var *var, int len, char *str)

When invoked,`str`

will contain a pointer to a (large) allocated, but uninitialized block of memory of size`len`

this function should fill with a C string containing a human-readable representation of the distribution's parameters (e.g., using snprintf), which should be stored in`var->group_state`

(as defined above). The function should return the length of the string, not counting the trailing null character (as snprintf).**Note:**Although this function is optional, be aware that not including it will prevent users from being able to export data defined in terms of this distribution in the more commonly compatible text format, potentially preventing you from being able to upgrade your PIP install. Consequently, this function**MUST**be included in any production system. `input_fn`

- NULL, or a pointer to a function for parsing a human-readable representation of the distribution's parameters, with the following schema:
int input_fn(pip_var *var, char *str)

When the input function is invoked,`var`

will contain a pointer to initialized pip variable.`var->group_state`

will contain a pointer to an allocated, but uninitialized block of [paramsize] bytes.`str`

references a C string containing a human-readable representation of this distribution's parameters, as used in [output_fn]. The input function should parse this string (e.g., using sscanf) and initialize`var->group_state`

in the same way as [init_fn].**Note:**Although this function is optional, be aware that not including it will prevent users from being able to import data defined in terms of this distribution.

PIP includes a library of utility functions for use in defining distributions. These methods are defined in `pip_plugin/src/include/dist.h`

(which will be included as part of `"pip.h"`

)

`float8 dist_param_float8(HeapTupleHeader params, int n, float8 default)`

- Utility function for parsing a Postgres vector (e.g., in [init_fn]). Returns the
`n`

th element (starting with 0) of`params`

, cast and/or translated into a float8. If the`params`

has fewer than`n`

+1 elements, the`default`

value will be returned instead. `int64 pip_prng_step(int64 seed)`

- The central operation for generating random numbers, for use in a distribution's generator function. Takes a
`seed`

value and returns the next (random but deterministic) seed value. `int64 pip_prng_int(int64 *seed)`

- The canonical way to generate a random integer in a distribution's generator function. Takes a seed value, and returns a random (but deterministic based on
`seed`

) integer value between 0 and 2^63-1. The`seed`

value is passed by reference, and is stepped to its next value automatically. `float8 pip_prng_float(int64 *seed)`

- The canonical way to generate a random float in a distribution's generator function. Takes a seed value, and returns a random (but deterministic based on
`seed`

) float value between 0 and 1. The`seed`

value is passed by reference, and is stepped to its next value automatically. `void pip_box_muller(float8 *X, float8 *Y, int64 *seed)`

- Generate two random (but deterministic based on
`seed`

), independent, normally distributed floating point numbers (with mean of 0 and standard deviation of 1) using the Box-Muller method. The**independent**variables will be stored in`X`

and`Y`

-- either or both may be used. The`seed`

value is passed by reference, and is stepped to its next value automatically (note that the Box-Muller method requires two random numbers as input and as a consequence,`seed`

is actually stepped twice).