Tips & tricks to using BigQuery UDFs

While other databases have allowed users to define their own SQL functions since forever, it was only in 2015 that BigQuery first offered this functionality. Even then, you were limited to writing them in JavaScript – it wouldn’t be until 2018 that SQL based UDFs became generally available, and they were only temporary1 at that!

Fast forward to 2020, BigQuery has supported persistent, SQL-based UDFs for over a year, meaning we can finally create functions that can be persisted indefinitely. Hoorah! 🎉

Persist what, where?

It’s all well and good that BigQuery now supports persistent UDFs, but where are they persisted? Just as other databases allow you to namespace your functions to a particular schema, with BigQuery your functions are created and tied to a particular dataset.

I like to create a separate fn dataset in my projects solely for this purpose, this way I can quickly scan all the functions I have available without other tables or views getting in the way.

Having a short and easily-identifiable name for your dataset also helps when it comes time to write your queries, especially as BigQuery requires you to prefix UDFs with the dataset they’re from. As it turns out, I’m not alone in following this pattern, Google use the same naming convention for their publicly available UDFs. Speaking of which…

Check out BigQuery Utils on GitHub

Google have curated a number of useful UDFs and made them publicly available from their own Google Cloud project. Rather than having to mess around creating these UDFs yourself, you can go ahead and use these right now by referencing Google’s project/dataset and functions in your queries, like so:

SELECT bqutil.fn.int(1)

The one downside to this otherwise fantastic resource is that it’s only available within the US region, so those of us down under (and I guess other regions too) will have to resort to deploying these functions to our own project tenancy until such time as they’re made more widely available.

A few of my favourite UDFs

This is what you’re all here for, right? Listed below are the functions I’m particularly jazzed about. Each one includes a short introduction explaining what it does and why it’s useful, followed by its SQL definition, and finally a few examples to get you started.

In no particular order, here we go!

1. Median

There isn’t a week that goes by that I don’t use this function. Often the data I work with is skewed in some form or another, making using the median a much better choice for calculating the central tendency than using AVG for the mean.

It’s quite clever how it works too. It relies on the functionality of BigQuery’s ARRAY_AGG to aggregate together all the values we want to calculate the median for, and then uses array indexing to find the value(s) that best line up with middle of the array. Simple, but elegant.

I can’t take credit for this one, for that I have to thank the fine work done by Elliot Brossard (and Felipe Hoffa for bringing it to my attention).

CREATE OR REPLACE FUNCTION fn.median(arr ANY TYPE)
AS ((
  SELECT IF (
    MOD(ARRAY_LENGTH(arr), 2) = 0,
    (arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
    arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
  )
  FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));

Let’s look at a couple of examples to see how it works.

Example 1

Here we have a list of users who have visited our website by date:

dateuser count
2020-05-197
2020-05-2018
2020-05-2120
2020-05-2221
2020-05-2325
2020-05-2432

To calculate the median number of users who have visited our website per day, we’d use our freshly minted median function as follows:

SELECT
  bqutil.fn.median(ARRAY_AGG(user_count)) AS median_users_per_day
FROM
  tbl
median users per day
20.5

Because we have an even number of records, there’s no one value that best represents the middle. The function therefore gets the median by calculating the mean difference between the two middlemost values (20 and 21), returning us a median of 20.5.

Some of you may recognise that what we’re doing here is calculating the median on a dataset of discrete values (user counts). There’s a whole separate conversation to be had about the effect discrete and continuous variables have on calculating the median – this post is not that post. Suffice to say the function has performed admirably for our purposes and has given us a median we’d be proud to take home to our mothers.

Example 2

dateuser categoryuser count
2020-05-19Free7
2020-05-20Free18
2020-05-21Free20
2020-05-19Premium21
2020-05-20Premium25
2020-05-21Premium32

This second example has users segmented by category. If we wanted to calculate the median number of users who’ve visited our website per day and per category, we’d do this:

SELECT
  user_category,
  bqutil.fn.median(ARRAY_AGG(user_count)) AS median_users_per_day
FROM
  tbl
GROUP BY
  user_category
user categorymedian users per day
Free18
Premium25

Here it’s much easier for us to reconcile the above medians with the values from our original dataset. Since each group has only three records, the middle value is unambiguous – the function dutifully grabs the middlemost value for each group (they’re treated as separate arrays since we’re using GROUP BY) and returns us the medians you see above.

2. Percentage change (increase/decrease) between two numbers

I’ll be honest, the number of times I’ve had to reach out to Google2 to tell me how to calculate this one (and the one below) – it’s embarrassing. Turning this into a UDF was as much about keeping my code DRY as it was staving off my fear of inadequacy for not remembering such basic formula.

Keep in mind this function calculates percentage change, which is different from the percentage difference (whose function is listed below). Most of us are probably more familiar with percentage change, it’s what we hear around us most often:

Percentage change can be summarised then as a measure of the relative change between a variable’s before value (e.g. yesterday’s stock price) and its after value (today’s stock price).

CREATE OR REPLACE FUNCTION fn.percentage_change(val1 FLOAT64, val2 FLOAT64)
RETURNS FLOAT64 AS (
  IF(
    ABS(val1) > 0 AND ABS(val2) > 0,
    (val2 - val1) / ABS(val1),
    NULL
  )
);

Example

SELECT
  fn.percentage_change(1, 2), -- 1.0 (100% increase)
  fn.percentage_change(0.5, 2), -- 3.0 (300% increase)
  fn.percentage_change(100, 75), -- -0.25 (25% decrease)
  fn.percentage_change(-20, -30) -- -0.5 (50% decrease)

3. Percentage difference between two numbers

Calculating the percentage difference is useful when the relationship between two values doesn’t matter. It’s not relevant to say one is an increase or decrease over the other. They’re just… different.

When calculating percentage change, we use one of the values as a reference point from which to compare the other (e.g. old versus new), but when calculating the percentage difference it doesn’t make sense to use either value as the reference. Instead, we take the mid-point of the two values and use this as the reference. By doing this, you’ll notice that it no longer matters which value comes first in the equation, the result is the same – the measure is said to be non-directional.

So when is it appropriate to use percentage difference over percentage change? To steal an example from ol’ faithful2 – say you’re comparing the size of two companies based on the number of their employees. Company A has 93 employees and company B has 117, the percentage difference in size between these two companies comes out to be 22.86%. It’s not correct to say that company A is 22.86% smaller than company B, or that company B is 22.86% larger than company A. They’re different. How different? 22.86% different!

As the old adage goes, you should always pick the right tool for the job. The same applies here when choosing between percentage change or difference.

CREATE OR REPLACE FUNCTION fn.percentage_difference(val1 FLOAT64, val2 FLOAT64) RETURNS FLOAT64 AS (
  IF(
    val1 > 0 AND val2 > 0,
    ABS(val1 - val2) / ((val1 + val2) / 2),
    NULL
  )
);

Example

SELECT
  fn.percentage_difference(1, 2), -- 0.6667 (66.67%)
  fn.percentage_difference(2, 1), -- 0.6667 (66.67%)
  fn.percentage_difference(100, 75), -- 0.2857 (28.57%)
  fn.percentage_difference(-20, -30) -- NULL

4. Evaluate whether two arrays are equal

This one I don’t use that often, but when I do I’m thankful to have it available in my arsenal of UDFs. It compares two arrays and evaluates true/false if they contain the same values - no matter their order.

See? I can keep these explanations short. 😉

CREATE OR REPLACE FUNCTION fn.equal_arrays(arr1 ANY TYPE, arr2 ANY TYPE)
RETURNS BOOL AS (
  (
    SELECT LOGICAL_AND(v IN UNNEST(arr2))
    FROM UNNEST(arr1) v
  )
);

Example

SELECT
  fn.equal_arrays([1, 2, 3], [1, 2, 3]), -- true
  fn.equal_arrays([1, 2, 3], [3, 1, 2]), -- true
  fn.equal_arrays(['a', 'b', 'c'], ['c', 'b', 'a']), -- true
  fn.equal_arrays(
    [STRUCT('A' AS a), STRUCT('B' AS b)],
    [STRUCT('A' AS a), STRUCT('B' AS b)]
  ), -- true
  fn.equal_arrays(
    [STRUCT('Z' AS a), STRUCT('B' AS b)],
    [STRUCT('A' AS a), STRUCT('B' AS b)]
  ), -- false
  fn.equal_arrays(
    ['1', '2'],
    [1, 2]
  ) -- error: both arrays must share the same data type

5. Slice an array by index range

It’s easy to take for granted the functionality you get for free in other programming languages. Slicing an array is one such example – so trivial to do in a language like Python, but requires multiple lines of code to do the same thing in SQL.

CREATE OR REPLACE FUNCTION fn.slice_array(arr ANY TYPE, start_index INT64, end_index INT64)
RETURNS ARRAY<ANY TYPE> AS ((
  SELECT ARRAY_AGG(val)
  FROM UNNEST(arr) AS val WITH OFFSET idx
  WHERE idx >= (
    CASE
      WHEN start_index < 0
        THEN ARRAY_LENGTH(arr) + start_index
      WHEN end_index < 0 AND start_index IS NULL
        THEN ARRAY_LENGTH(arr) + end_index
      WHEN start_index < 0
        THEN ARRAY_LENGTH(arr) + start_index
      ELSE
        IFNULL(start_index, end_index)
    END
  )
  AND idx <= (
    IF(
      end_index < 0,
      ARRAY_LENGTH(arr) + end_index,
      IFNULL(end_index, start_index)
    )
  )
));

Example

SELECT
  fn.slice_array(['a', 'b', 'c', 'd'], 0, 1), -- ['a', 'b']
  fn.slice_array(['a', 'b', 'c', 'd'], 1, 2) -- ['b', 'c']

You can also use a negative index for the upper bound to find the nth value from the right-hand side of the array. Like so:

SELECT
  fn.slice_array(['a', 'b', 'c', 'd'], 2, -1), -- ['c', 'd']
  fn.slice_array(['a', 'b', 'c', 'd'], 1, -2) -- ['b', 'c', ]

Set both bounds to the same index to get a single-valued array. Alternatively, set one of the bounds to NULL.

SELECT
  fn.slice_array(['a', 'b', 'c', 'd'], 2, 2), -- ['c']
  fn.slice_array(['a', 'b', 'c', 'd'], 2, NULL), -- ['c']
  fn.slice_array(['a', 'b', 'c', 'd'], -1, -1) -- ['d']
  fn.slice_array(['a', 'b', 'c', 'd'], NULL, -1) -- ['d']

Combine with something like SPLIT or REGEXP_EXTRACT for some next level shenanigans!

DECLARE file_path STRING;
DECLARE regex STRING DEFAULT "/([\\w\\._-]+)";

SET file_path = "/home/chris/projects/bigquery-udfs.sql";

SELECT
  -- root folder
  ARRAY_TO_STRING(
    fn.slice_array(
      REGEXP_EXTRACT_ALL(file_path, regex),
      0, NULL
    ),
  "") AS root_folder,
  -- parent folder
  ARRAY_TO_STRING(
    fn.slice_array(
      REGEXP_EXTRACT_ALL(file_path, regex),
      NULL, -2
    ),
  "") AS parent_folder,
  -- filename
  fn.slice_array(
    REGEXP_EXTRACT_ALL(file_path, regex),
    NULL, -1
  )[OFFSET(0)] AS filename
root folderparent folderfilename
homeprojectsbigquery-udfs.sql

6. Generate bin intervals

This UDF was born out of frustration with Data Studio. Can you believe they don’t include a histogram chart? The audacity! I was forced to create my own bin intervals in SQL and use a bar chart like an animal.

This function groups values into interval ranges based on the bin size and limit you give it.

CREATE OR REPLACE FUNCTION fn.bin_value(val FLOAT64, bin_size FLOAT64, bin_max FLOAT64)
AS (
  CASE
    WHEN val > bin_max THEN bin_max
    ELSE CAST(TRUNC((val / bin_size)) * bin_size AS INT64)
  END
);

Example

DECLARE bin_size INT64 DEFAULT 5;
DECLARE bin_max INT64 DEFAULT 20;
DECLARE numbers ARRAY<INT64>;

SET numbers = [1, 3, 7, 9, 11, 11, 14, 19, 22, 29];

WITH frequencies AS (
  SELECT
    fn.bin_value(num, bin_size, bin_max) AS bin,
    COUNT(1) AS frequency
  FROM
    UNNEST(numbers) num
  GROUP BY
    bin
)
SELECT
  bin,
  IF(
    bin = bin_max,
    CONCAT(bin_max, "+"),
    CONCAT(bin, "-", bin + bin_size)
  ) AS bin_range,
  frequency
FROM
  frequencies
ORDER BY
  bin
binbin rangefrequency
00-52
55-102
1010-153
1515-201
2020+2

7. Extract values from APPROX_TOP_X (excluding NULLs)

APPROX_TOP_COUNT is great, it’s saved me a ton of time and energy scaffolding sub-queries and CTEs to get the top N records of X, Y and Z. Sometimes though I’m only interested in knowing what the top values are, I don’t care for knowing what their counts are and I certainly don’t care that the top value might be NULL.

This function returns a flattened array with values from APPROX_TOP_COUNT (or APPROX_TOP_SUM), excluding those pesky NULLs.

CREATE OR REPLACE FUNCTION fn.extract_top_values(arr ANY TYPE)
RETURNS ARRAY<STRING> AS (
  (
    SELECT ARRAY_AGG(a.value ORDER BY i)
    FROM UNNEST(arr) a WITH OFFSET i
    WHERE a.value IS NOT NULL
  )
);

Example

DECLARE favourite_colors ARRAY<STRING>;

SET favourite_colors = [
  NULL, NULL, NULL, NULL,
  'blue', 'blue', 'blue',
  'red', 'red',
  'green',
  'purple',
  'orange'
];

SELECT
  APPROX_TOP_COUNT(colours, 3) -- [(NULL, 4), ("blue", 3), ("red", 2)]
  ,fn.extract_top_values(
    APPROX_TOP_COUNT(colours, 3)
  )[OFFSET(0)] -- "blue"
FROM
  UNNEST(favourite_colors) AS colours

8. Last day of the month

This is an example of a function that doesn’t take much brain power to put together, but given you’re probably using it as an input into some other calculation, it won’t take long before your query devolves into an indecipherable chain of function calls. Do yourself a favour and use a UDF instead.

Its usage is pretty self-explanatory. You give it a date, any date, and it will return back a new date set to the last day of the month.

CREATE OR REPLACE FUNCTION fn.month_ceil(date DATE)
RETURNS DATE AS (
  DATE_SUB(DATE_ADD(DATE_TRUNC(date, MONTH), INTERVAL 1 MONTH), INTERVAL 1 DAY)
);

Example

SELECT
  fn.month_ceil(DATE("2012-12-25")), -- 2012-12-31
  fn.month_ceil(DATE("2016-02-09")), -- 2016-02-29 (leap year)
  fn.month_ceil(DATE("2019-02-18")), -- 2016-02-28
  fn.month_ceil(DATE("2020-05-11")) -- 2020-05-30

9. Total number of days in a month

Sometimes you need to know how many days there are in a month. It’s easy enough to stitch together the necessary functions to do this yourself, but once you’ve done it a few times, it just makes sense that it exists as a function, and so here we are.

An added feature of this function is that if the input date is the same as the current month, it will return only those days that have elapsed thus far.

An example of why this might be useful is when calculating the percentage of days a user is active in a month – calculating this without factoring in the current date could unfairly classify the user as inactive. This function allows you to pro-rate whatever calculation you’re working on and has the added benefit of saving some poor soul from receiving an unwarranted marketing email.

CREATE OR REPLACE FUNCTION fn.days_in_month(date DATE)
RETURNS INT64 AS (
  CASE
    WHEN DATE_TRUNC(CURRENT_DATE(), MONTH) = DATE_TRUNC(date, MONTH) -- is the current month
    THEN EXTRACT(DAY FROM CURRENT_DATE()) - 1 -- return days-to-date up to today
    ELSE EXTRACT(DAY FROM fn.month_ceil(date))
  END
);

Example

SELECT CURRENT_DATE; -- 2020-05-26

SELECT
  fn.days_in_month(DATE("2012-12-25")), -- 31
  fn.days_in_month(DATE("2016-02-09")), -- 29 (leap year)
  fn.days_in_month(DATE("2019-02-18")), -- 28
  fn.days_in_month(DATE("2020-05-11")) -- 25 (elapsed days in current month)

If you wanted to use it in some sort of pro-rated calculation, such as calculating the percentage of days a user is active for in a month, we can apply it like this:

SELECT CURRENT_DATE; -- 2020-05-26

WITH user_activity_per_day AS (
  SELECT
    day AS date
    ,1 AS user_id
  FROM
    UNNEST(
      GENERATE_DATE_ARRAY(
        "2020-04-01",
        "2020-05-25",
        INTERVAL 1 DAY
      )
    ) AS day

), user_activity_per_month AS (
  SELECT
    DATE_TRUNC(date, MONTH) AS month
    ,user_id
    ,COUNT(1) AS days_active
  FROM
    user_activity_per_day
  GROUP BY
    month
    ,user_id
)
SELECT
  FORMAT_DATE("%b-%Y", month) AS month
  ,user_id
  ,days_active
  ,days_active / fn.days_in_month(month) AS percent_active_in_month
FROM
  user_activity_per_month
monthuser iddays activepercent active in month
Apr-20201301.0
May-20201251.0

10. Lookup Australian timezone by postcode

So this one’s pretty niche, I know – there’s a reason it’s listed last. If you ever want to know the timezone and all you have is a postcode, this one’s for you. It’s perfect for when you want to do some time-of-day analysis and all you have is a UTC timestamp and a user’s postcode.

CREATE OR REPLACE FUNCTION fn.postcode_to_timezone(postcode STRING)
RETURNS STRING AS (
  CASE
    WHEN postcode = "2880" THEN 'Australia/Broken_Hill'
    WHEN postcode = "2898" THEN 'Australia/Lord_Howe'
    WHEN postcode = "2899" THEN 'Pacific/Norfolk'
    WHEN postcode = "6443" THEN 'Australia/Eucla'
    WHEN postcode = "6798" THEN 'Indian/Christmas'
    WHEN postcode = "6799" THEN 'Indian/Cocos'
    WHEN STARTS_WITH(postcode, "0") THEN 'Australia/Darwin'
    WHEN STARTS_WITH(postcode, "2") THEN 'Australia/Sydney'
    WHEN STARTS_WITH(postcode, "3") THEN 'Australia/Melbourne'
    WHEN STARTS_WITH(postcode, "4") THEN 'Australia/Brisbane'
    WHEN STARTS_WITH(postcode, "5") THEN 'Australia/Adelaide'
    WHEN STARTS_WITH(postcode, "6") THEN 'Australia/Perth'
    WHEN STARTS_WITH(postcode, "7") THEN 'Australia/Hobart'
  END
);

Example

SELECT
  fn.postcode_to_timezone("2000"), -- "Australia/Sydney"
  fn.postcode_to_timezone("3031"), -- "Australia/Melbourne"
  fn.postcode_to_timezone("5045") -- "Australia/Adelaide"

That’s all, folks

Hopefully some of those were useful for you, or at least gives you some ideas for UDFs of your own.

Cheers!


  1. Back when UDFs were first introduced to BigQuery, they were only temporary. That meant you had to re-define and create them for every query (or more realistically, copy and paste them from that document you keep for just such an occasion). Persistent UDFs wouldn’t arrive until much later in mid-2019. ↩︎

  2. Thank you Omni Calculator ❤️! ↩︎ ↩︎