While other databases have allowed users to define their own SQL functions since forever, it was only in 2015 that BigQuery first offered this functionality. Even then, you were limited to writing them in JavaScript – it wouldn’t be until 2018 that SQL based UDFs became generally available, and they were only temporary1 at that!
Fast forward to 2020, BigQuery has supported persistent, SQL-based UDFs for over a year, meaning we can finally create functions that can be persisted indefinitely. Hoorah! 🎉
Persist what, where?
It’s all well and good that BigQuery now supports persistent UDFs, but where are they persisted? Just as other databases allow you to namespace your functions to a particular schema, with BigQuery your functions are created and tied to a particular dataset.
I like to create a separate fn
dataset in my projects solely for this purpose, this way I can quickly scan all the functions I have available without other tables or views getting in the way.
Having a short and easily-identifiable name for your dataset also helps when it comes time to write your queries, especially as BigQuery requires you to prefix UDFs with the dataset they’re from. As it turns out, I’m not alone in following this pattern, Google use the same naming convention for their publicly available UDFs. Speaking of which…
Check out BigQuery Utils on GitHub
Google have curated a number of useful UDFs and made them publicly available from their own Google Cloud project. Rather than having to mess around creating these UDFs yourself, you can go ahead and use these right now by referencing Google’s project/dataset and functions in your queries, like so:
SELECT bqutil.fn.int(1)
The one downside to this otherwise fantastic resource is that it’s only available within the US region, so those of us down under (and I guess other regions too) will have to resort to deploying these functions to our own project tenancy until such time as they’re made more widely available.
A few of my favourite UDFs
This is what you’re all here for, right? Listed below are the functions I’m particularly jazzed about. Each one includes a short introduction explaining what it does and why it’s useful, followed by its SQL definition, and finally a few examples to get you started.
In no particular order, here we go!
- 1. Median
- 2. Percentage change (increase/decrease) between two numbers
- 3. Percentage difference between two numbers
- 4. Evaluate whether two arrays are equal
- 5. Slice an array by index range
- 6. Generate bin intervals
- 7. Extract values from
APPROX_TOP_X
(excludingNULL
s) - 8. Last day of the month
- 9. Total number of days in a month
- 10. Lookup Australian timezone by postcode
1. Median
There isn’t a week that goes by that I don’t use this function. Often the data I work with is skewed in some form or another, making using the median a much better choice for calculating the central tendency than using AVG
for the mean.
It’s quite clever how it works too. It relies on the functionality of BigQuery’s ARRAY_AGG
to aggregate together all the values we want to calculate the median for, and then uses array indexing to find the value(s) that best line up with middle of the array. Simple, but elegant.
I can’t take credit for this one, for that I have to thank the fine work done by Elliot Brossard (and Felipe Hoffa for bringing it to my attention).
CREATE OR REPLACE FUNCTION fn.median(arr ANY TYPE)
AS ((
SELECT IF (
MOD(ARRAY_LENGTH(arr), 2) = 0,
(arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));
Let’s look at a couple of examples to see how it works.
Example 1
Here we have a list of users who have visited our website by date:
date | user count |
---|---|
2020-05-19 | 7 |
2020-05-20 | 18 |
2020-05-21 | 20 |
2020-05-22 | 21 |
2020-05-23 | 25 |
2020-05-24 | 32 |
To calculate the median number of users who have visited our website per day, we’d use our freshly minted median function as follows:
SELECT
bqutil.fn.median(ARRAY_AGG(user_count)) AS median_users_per_day
FROM
tbl
median users per day |
---|
20.5 |
Because we have an even number of records, there’s no one value that best represents the middle. The function therefore gets the median by calculating the mean difference between the two middlemost values (20
and 21
), returning us a median of 20.5
.
Some of you may recognise that what we’re doing here is calculating the median on a dataset of discrete values (user counts). There’s a whole separate conversation to be had about the effect discrete and continuous variables have on calculating the median – this post is not that post. Suffice to say the function has performed admirably for our purposes and has given us a median we’d be proud to take home to our mothers.
Example 2
date | user category | user count |
---|---|---|
2020-05-19 | Free | 7 |
2020-05-20 | Free | 18 |
2020-05-21 | Free | 20 |
2020-05-19 | Premium | 21 |
2020-05-20 | Premium | 25 |
2020-05-21 | Premium | 32 |
This second example has users segmented by category. If we wanted to calculate the median number of users who’ve visited our website per day and per category, we’d do this:
SELECT
user_category,
bqutil.fn.median(ARRAY_AGG(user_count)) AS median_users_per_day
FROM
tbl
GROUP BY
user_category
user category | median users per day |
---|---|
Free | 18 |
Premium | 25 |
Here it’s much easier for us to reconcile the above medians with the values from our original dataset. Since each group has only three records, the middle value is unambiguous – the function dutifully grabs the middlemost value for each group (they’re treated as separate arrays since we’re using GROUP BY
) and returns us the medians you see above.
2. Percentage change (increase/decrease) between two numbers
I’ll be honest, the number of times I’ve had to reach out to Google2 to tell me how to calculate this one (and the one below) – it’s embarrassing. Turning this into a UDF was as much about keeping my code DRY as it was staving off my fear of inadequacy for not remembering such basic formula.
Keep in mind this function calculates percentage change, which is different from the percentage difference (whose function is listed below). Most of us are probably more familiar with percentage change, it’s what we hear around us most often:
- “The sharemarket is down 7.2% from yesterday.”
- “This laptop’s battery lasts 15% longer than last year’s model.”
- “I just negotiated a 20% increase to my salary!”
Percentage change can be summarised then as a measure of the relative change between a variable’s before value (e.g. yesterday’s stock price) and its after value (today’s stock price).
CREATE OR REPLACE FUNCTION fn.percentage_change(val1 FLOAT64, val2 FLOAT64)
RETURNS FLOAT64 AS (
IF(
ABS(val1) > 0 AND ABS(val2) > 0,
(val2 - val1) / ABS(val1),
NULL
)
);
Example
SELECT
fn.percentage_change(1, 2), -- 1.0 (100% increase)
fn.percentage_change(0.5, 2), -- 3.0 (300% increase)
fn.percentage_change(100, 75), -- -0.25 (25% decrease)
fn.percentage_change(-20, -30) -- -0.5 (50% decrease)
3. Percentage difference between two numbers
Calculating the percentage difference is useful when the relationship between two values doesn’t matter. It’s not relevant to say one is an increase or decrease over the other. They’re just… different.
When calculating percentage change, we use one of the values as a reference point from which to compare the other (e.g. old versus new), but when calculating the percentage difference it doesn’t make sense to use either value as the reference. Instead, we take the mid-point of the two values and use this as the reference. By doing this, you’ll notice that it no longer matters which value comes first in the equation, the result is the same – the measure is said to be non-directional.
So when is it appropriate to use percentage difference over percentage change? To steal an example from ol’ faithful2 – say you’re comparing the size of two companies based on the number of their employees. Company A
has 93 employees and company B
has 117, the percentage difference in size between these two companies comes out to be 22.86%. It’s not correct to say that company A
is 22.86% smaller than company B
, or that company B
is 22.86% larger than company A
. They’re different. How different? 22.86% different!
As the old adage goes, you should always pick the right tool for the job. The same applies here when choosing between percentage change or difference.
CREATE OR REPLACE FUNCTION fn.percentage_difference(val1 FLOAT64, val2 FLOAT64) RETURNS FLOAT64 AS (
IF(
val1 > 0 AND val2 > 0,
ABS(val1 - val2) / ((val1 + val2) / 2),
NULL
)
);
Example
SELECT
fn.percentage_difference(1, 2), -- 0.6667 (66.67%)
fn.percentage_difference(2, 1), -- 0.6667 (66.67%)
fn.percentage_difference(100, 75), -- 0.2857 (28.57%)
fn.percentage_difference(-20, -30) -- NULL
4. Evaluate whether two arrays are equal
This one I don’t use that often, but when I do I’m thankful to have it available in my arsenal of UDFs. It compares two arrays and evaluates true/false if they contain the same values - no matter their order.
See? I can keep these explanations short. 😉
CREATE OR REPLACE FUNCTION fn.equal_arrays(arr1 ANY TYPE, arr2 ANY TYPE)
RETURNS BOOL AS (
(
SELECT LOGICAL_AND(v IN UNNEST(arr2))
FROM UNNEST(arr1) v
)
);
Example
SELECT
fn.equal_arrays([1, 2, 3], [1, 2, 3]), -- true
fn.equal_arrays([1, 2, 3], [3, 1, 2]), -- true
fn.equal_arrays(['a', 'b', 'c'], ['c', 'b', 'a']), -- true
fn.equal_arrays(
[STRUCT('A' AS a), STRUCT('B' AS b)],
[STRUCT('A' AS a), STRUCT('B' AS b)]
), -- true
fn.equal_arrays(
[STRUCT('Z' AS a), STRUCT('B' AS b)],
[STRUCT('A' AS a), STRUCT('B' AS b)]
), -- false
fn.equal_arrays(
['1', '2'],
[1, 2]
) -- error: both arrays must share the same data type
5. Slice an array by index range
It’s easy to take for granted the functionality you get for free in other programming languages. Slicing an array is one such example – so trivial to do in a language like Python, but requires multiple lines of code to do the same thing in SQL.
CREATE OR REPLACE FUNCTION fn.slice_array(arr ANY TYPE, start_index INT64, end_index INT64)
RETURNS ARRAY<ANY TYPE> AS ((
SELECT ARRAY_AGG(val)
FROM UNNEST(arr) AS val WITH OFFSET idx
WHERE idx >= (
CASE
WHEN start_index < 0
THEN ARRAY_LENGTH(arr) + start_index
WHEN end_index < 0 AND start_index IS NULL
THEN ARRAY_LENGTH(arr) + end_index
WHEN start_index < 0
THEN ARRAY_LENGTH(arr) + start_index
ELSE
IFNULL(start_index, end_index)
END
)
AND idx <= (
IF(
end_index < 0,
ARRAY_LENGTH(arr) + end_index,
IFNULL(end_index, start_index)
)
)
));
Example
SELECT
fn.slice_array(['a', 'b', 'c', 'd'], 0, 1), -- ['a', 'b']
fn.slice_array(['a', 'b', 'c', 'd'], 1, 2) -- ['b', 'c']
You can also use a negative index for the upper bound to find the nth value from the right-hand side of the array. Like so:
SELECT
fn.slice_array(['a', 'b', 'c', 'd'], 2, -1), -- ['c', 'd']
fn.slice_array(['a', 'b', 'c', 'd'], 1, -2) -- ['b', 'c', ]
Set both bounds to the same index to get a single-valued array. Alternatively, set one of the bounds to NULL
.
SELECT
fn.slice_array(['a', 'b', 'c', 'd'], 2, 2), -- ['c']
fn.slice_array(['a', 'b', 'c', 'd'], 2, NULL), -- ['c']
fn.slice_array(['a', 'b', 'c', 'd'], -1, -1) -- ['d']
fn.slice_array(['a', 'b', 'c', 'd'], NULL, -1) -- ['d']
Combine with something like SPLIT
or REGEXP_EXTRACT
for some next level shenanigans!
DECLARE file_path STRING;
DECLARE regex STRING DEFAULT "/([\\w\\._-]+)";
SET file_path = "/home/chris/projects/bigquery-udfs.sql";
SELECT
-- root folder
ARRAY_TO_STRING(
fn.slice_array(
REGEXP_EXTRACT_ALL(file_path, regex),
0, NULL
),
"") AS root_folder,
-- parent folder
ARRAY_TO_STRING(
fn.slice_array(
REGEXP_EXTRACT_ALL(file_path, regex),
NULL, -2
),
"") AS parent_folder,
-- filename
fn.slice_array(
REGEXP_EXTRACT_ALL(file_path, regex),
NULL, -1
)[OFFSET(0)] AS filename
root folder | parent folder | filename |
---|---|---|
home | projects | bigquery-udfs.sql |
6. Generate bin intervals
This UDF was born out of frustration with Data Studio. Can you believe they don’t include a histogram chart? The audacity! I was forced to create my own bin intervals in SQL and use a bar chart like an animal.
This function groups values into interval ranges based on the bin size and limit you give it.
CREATE OR REPLACE FUNCTION fn.bin_value(val FLOAT64, bin_size FLOAT64, bin_max FLOAT64)
AS (
CASE
WHEN val > bin_max THEN bin_max
ELSE CAST(TRUNC((val / bin_size)) * bin_size AS INT64)
END
);
Example
DECLARE bin_size INT64 DEFAULT 5;
DECLARE bin_max INT64 DEFAULT 20;
DECLARE numbers ARRAY<INT64>;
SET numbers = [1, 3, 7, 9, 11, 11, 14, 19, 22, 29];
WITH frequencies AS (
SELECT
fn.bin_value(num, bin_size, bin_max) AS bin,
COUNT(1) AS frequency
FROM
UNNEST(numbers) num
GROUP BY
bin
)
SELECT
bin,
IF(
bin = bin_max,
CONCAT(bin_max, "+"),
CONCAT(bin, "-", bin + bin_size)
) AS bin_range,
frequency
FROM
frequencies
ORDER BY
bin
bin | bin range | frequency |
---|---|---|
0 | 0-5 | 2 |
5 | 5-10 | 2 |
10 | 10-15 | 3 |
15 | 15-20 | 1 |
20 | 20+ | 2 |
7. Extract values from APPROX_TOP_X
(excluding NULL
s)
APPROX_TOP_COUNT
is great, it’s saved me a ton of time and energy scaffolding sub-queries and CTEs to get the top N records of X, Y and Z. Sometimes though I’m only interested in knowing what the top values are, I don’t care for knowing what their counts are and I certainly don’t care that the top value might be NULL
.
This function returns a flattened array with values from APPROX_TOP_COUNT
(or APPROX_TOP_SUM
), excluding those pesky NULL
s.
CREATE OR REPLACE FUNCTION fn.extract_top_values(arr ANY TYPE)
RETURNS ARRAY<STRING> AS (
(
SELECT ARRAY_AGG(a.value ORDER BY i)
FROM UNNEST(arr) a WITH OFFSET i
WHERE a.value IS NOT NULL
)
);
Example
DECLARE favourite_colors ARRAY<STRING>;
SET favourite_colors = [
NULL, NULL, NULL, NULL,
'blue', 'blue', 'blue',
'red', 'red',
'green',
'purple',
'orange'
];
SELECT
APPROX_TOP_COUNT(colours, 3) -- [(NULL, 4), ("blue", 3), ("red", 2)]
,fn.extract_top_values(
APPROX_TOP_COUNT(colours, 3)
)[OFFSET(0)] -- "blue"
FROM
UNNEST(favourite_colors) AS colours
8. Last day of the month
This is an example of a function that doesn’t take much brain power to put together, but given you’re probably using it as an input into some other calculation, it won’t take long before your query devolves into an indecipherable chain of function calls. Do yourself a favour and use a UDF instead.
Its usage is pretty self-explanatory. You give it a date, any date, and it will return back a new date set to the last day of the month.
CREATE OR REPLACE FUNCTION fn.month_ceil(date DATE)
RETURNS DATE AS (
DATE_SUB(DATE_ADD(DATE_TRUNC(date, MONTH), INTERVAL 1 MONTH), INTERVAL 1 DAY)
);
Example
SELECT
fn.month_ceil(DATE("2012-12-25")), -- 2012-12-31
fn.month_ceil(DATE("2016-02-09")), -- 2016-02-29 (leap year)
fn.month_ceil(DATE("2019-02-18")), -- 2016-02-28
fn.month_ceil(DATE("2020-05-11")) -- 2020-05-30
9. Total number of days in a month
Sometimes you need to know how many days there are in a month. It’s easy enough to stitch together the necessary functions to do this yourself, but once you’ve done it a few times, it just makes sense that it exists as a function, and so here we are.
An added feature of this function is that if the input date is the same as the current month, it will return only those days that have elapsed thus far.
An example of why this might be useful is when calculating the percentage of days a user is active in a month – calculating this without factoring in the current date could unfairly classify the user as inactive. This function allows you to pro-rate whatever calculation you’re working on and has the added benefit of saving some poor soul from receiving an unwarranted marketing email.
CREATE OR REPLACE FUNCTION fn.days_in_month(date DATE)
RETURNS INT64 AS (
CASE
WHEN DATE_TRUNC(CURRENT_DATE(), MONTH) = DATE_TRUNC(date, MONTH) -- is the current month
THEN EXTRACT(DAY FROM CURRENT_DATE()) - 1 -- return days-to-date up to today
ELSE EXTRACT(DAY FROM fn.month_ceil(date))
END
);
Example
SELECT CURRENT_DATE; -- 2020-05-26
SELECT
fn.days_in_month(DATE("2012-12-25")), -- 31
fn.days_in_month(DATE("2016-02-09")), -- 29 (leap year)
fn.days_in_month(DATE("2019-02-18")), -- 28
fn.days_in_month(DATE("2020-05-11")) -- 25 (elapsed days in current month)
If you wanted to use it in some sort of pro-rated calculation, such as calculating the percentage of days a user is active for in a month, we can apply it like this:
SELECT CURRENT_DATE; -- 2020-05-26
WITH user_activity_per_day AS (
SELECT
day AS date
,1 AS user_id
FROM
UNNEST(
GENERATE_DATE_ARRAY(
"2020-04-01",
"2020-05-25",
INTERVAL 1 DAY
)
) AS day
), user_activity_per_month AS (
SELECT
DATE_TRUNC(date, MONTH) AS month
,user_id
,COUNT(1) AS days_active
FROM
user_activity_per_day
GROUP BY
month
,user_id
)
SELECT
FORMAT_DATE("%b-%Y", month) AS month
,user_id
,days_active
,days_active / fn.days_in_month(month) AS percent_active_in_month
FROM
user_activity_per_month
month | user id | days active | percent active in month |
---|---|---|---|
Apr-2020 | 1 | 30 | 1.0 |
May-2020 | 1 | 25 | 1.0 |
10. Lookup Australian timezone by postcode
So this one’s pretty niche, I know – there’s a reason it’s listed last. If you ever want to know the timezone and all you have is a postcode, this one’s for you. It’s perfect for when you want to do some time-of-day analysis and all you have is a UTC timestamp and a user’s postcode.
CREATE OR REPLACE FUNCTION fn.postcode_to_timezone(postcode STRING)
RETURNS STRING AS (
CASE
WHEN postcode = "2880" THEN 'Australia/Broken_Hill'
WHEN postcode = "2898" THEN 'Australia/Lord_Howe'
WHEN postcode = "2899" THEN 'Pacific/Norfolk'
WHEN postcode = "6443" THEN 'Australia/Eucla'
WHEN postcode = "6798" THEN 'Indian/Christmas'
WHEN postcode = "6799" THEN 'Indian/Cocos'
WHEN STARTS_WITH(postcode, "0") THEN 'Australia/Darwin'
WHEN STARTS_WITH(postcode, "2") THEN 'Australia/Sydney'
WHEN STARTS_WITH(postcode, "3") THEN 'Australia/Melbourne'
WHEN STARTS_WITH(postcode, "4") THEN 'Australia/Brisbane'
WHEN STARTS_WITH(postcode, "5") THEN 'Australia/Adelaide'
WHEN STARTS_WITH(postcode, "6") THEN 'Australia/Perth'
WHEN STARTS_WITH(postcode, "7") THEN 'Australia/Hobart'
END
);
Example
SELECT
fn.postcode_to_timezone("2000"), -- "Australia/Sydney"
fn.postcode_to_timezone("3031"), -- "Australia/Melbourne"
fn.postcode_to_timezone("5045") -- "Australia/Adelaide"
That’s all, folks
Hopefully some of those were useful for you, or at least gives you some ideas for UDFs of your own.
Cheers!
Back when UDFs were first introduced to BigQuery, they were only temporary. That meant you had to re-define and create them for every query (or more realistically, copy and paste them from that document you keep for just such an occasion). Persistent UDFs wouldn’t arrive until much later in mid-2019. ↩︎
Thank you Omni Calculator ❤️! ↩︎ ↩︎