?

Log in

No account? Create an account
Building your own Aggregate Functions with PostGreSQL - Nick [entries|archive|friends|userinfo]
Nick

[ website | gagravarr.org ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Building your own Aggregate Functions with PostGreSQL [Apr. 2nd, 2009|04:00 pm]
Nick
This used to be on the Torchbox Blog, but got lost in the site re-launch. As it's really handy, I've re-posted it here

In SQL, aggregate functions are powerful things that act upon a number of rows, and output a single value from them. With PostGreSQL 8.x, it's now possible to easily define your own aggregate functions, to aggreate your data however you need. One use of this is to generate powerful reports very easily.

Let us think of a very simple (if slightly stilted) database:
CREATE SEQUENCE s_types;
CREATE TABLE types (
id INTEGER NOT NULL DEFAULT NEXTVAL('s_types'),
name VARCHAR(16) NOT NULL,
CONSTRAINT pk_types PRIMARY KEY(id)
);
INSERT INTO types (name) VALUES ('coding');
INSERT INTO types (name) VALUES ('design');
INSERT INTO types (name) VALUES ('sysadmin');
CREATE SEQUENCE s_customers;
CREATE TABLE customers (
id INTEGER NOT NULL DEFAULT NEXTVAL('s_customers'),
name VARCHAR(255) NOT NULL,
CONSTRAINT pk_customers PRIMARY KEY(id) 
);
CREATE TABLE sales (
customer_id INTEGER NOT NULL,
sale_date timestamp NOT NULL DEFAULT now(),
type_id INTEGER NOT NULL,
CONSTRAINT fk_sales_customer FOREIGN KEY ("customer_id") REFERENCES "customers" ("id"),
CONSTRAINT fk_sales_type FOREIGN KEY ("type_id") REFERENCES "types" ("id")
);
CREATE INDEX sales_customer ON sales (customer_id);

All goes well with our database for a while. Someone asks for a report of the first and last sale for each customer, and we're able to produce that directly in SQL, with just one result row per customer. Then they ask for something fancier - they don't just want to know the first and last sale date for every customer, they also want the list of sale types too. The snag is they only want one line per customer, and ideally we don't want to have to do all the heavy lifting work in our own code.

 The first and last sale dates are easy:

SELECT id, name, MIN(sale_date) AS first_sale, MAX(sale_date) AS last_sale FROM customers INNER JOIN sales ON (id = customer_id) GROUP BY id,name

 id | name     | first_sale          | last_sale           
----+----------+---------------------+---------------------
1 | Torchbox | 2007-01-03 10:02:11 | 2007-10-06 09:44:21

But we also want an aggregate function that'll give us the list of sale types. We have a look through the list of standard postgresql aggregate functions, but we don't find anything that'll effectively turn a sequence of strings into a list or an array. After a quick google, we can't find a list aggregate or array aggregate function anywhere.

Luckily, it's possible to define your own PostGreSQL aggregate functions, without needing to resort to coding them in c (which is possible, by the way, just requires some work). We can ask PostGreSQL to build us a new aggregate function, simply by gluing together normal functions to act on the data as it goes past

For our use case, we want something that takes a sequence of strings (but possibly also numbers for a similar use), and returns us an array of the distinct values at the end. The distinct bit is easy - just do func(DISTINCT val) rather than just func(val) and PostGreSQL will do it for us. For the rest, we'll want a custom aggregate function that starts with a new array, and does an array append with new values. Say, something like:

CREATE AGGREGATE to_array (
sfunc = array_append,
basetype = anyelement,
stype = anyarray,
initcond = '{}'
)

We can now get our list of customers, their first and last sale, and all their sale types, all with one row per customer, all in sql:

SELECT customers.id, customers.name, MIN(sale_date) AS first_sale, MAX(sale_date) AS last_sale, to_array(DISTINCT types.name) AS sale_types FROM customers INNER JOIN sales ON (customers.id = customer_id) INNER JOIN types ON (type_id = types.id) GROUP BY customers.id,customers.name

 id | name     | first_sale          | last_sale           | sale_types       
----+----------+---------------------+---------------------+------------------
1 | Torchbox | 2007-01-03 10:02:11 | 2007-10-06 09:44:21 | {sysadmin,coding}


If we wanted to exclude null entries from going into our array, then we'd use a similar new aggregate function, with a custom function to power it:
CREATE FUNCTION array_append_not_null(anyarray,anyelement) RETURNS anyarray
  AS 'SELECT CASE WHEN $2 IS NULL THEN $1 ELSE array_append($1,$2) END '
  LANGUAGE SQL IMMUTABLE RETURNS NULL ON NULL INPUT;
CREATE AGGREGATE to_array_not_null (
sfunc = array_append_not_null,
basetype = anyelement,
stype = anyarray,
initcond = '{}'
)
There are lots of other possibilities too, and PostGreSQL provides loads of handy functions to build up your new aggregate with. If we start using our own functions to build up new aggregates, then the possibilities are almost endless!

Two handy links from the official postgresql documentation are Custom Aggregate Functions and Built-In Aggregate Functions
link