Fast forward to 2020, BigQuery has supported persistent, SQL-based UDFs for over a year, meaning we can finally create functions that can be persisted indefinitely. Hoorah! 🎉
Persist what, where?
It’s all well and good that BigQuery now supports persistent UDFs, but where are they persisted? Just as other databases allow you to namespace your functions to a particular schema, with BigQuery your functions are created and tied to a particular dataset.
I like to create a separate
fn dataset in my projects solely for this purpose, this way I can quickly scan all the functions I have available without other tables or views getting in the way.
Having a short and easily-identifiable name for your dataset also helps when it comes time to write your queries, especially as BigQuery requires you to prefix UDFs with the dataset they’re from. As it turns out, I’m not alone in following this pattern, Google use the same naming convention for their publicly available UDFs. Speaking of which…
Check out BigQuery Utils on GitHub
Google have curated a number of useful UDFs and made them publicly available from their own Google Cloud project. Rather than having to mess around creating these UDFs yourself, you can go ahead and use these right now by referencing Google’s project/dataset and functions in your queries, like so:
The one downside to this otherwise fantastic resource is that it’s only available within the US region, so those of us down under (and I guess other regions too) will have to resort to deploying these functions to our own project tenancy until such time as they’re made more widely available.
A few of my favourite UDFs
This is what you’re all here for, right? Listed below are the functions I’m particularly jazzed about. Each one includes a short introduction explaining what it does and why it’s useful, followed by its SQL definition, and finally a few examples to get you started.
In no particular order, here we go!
- 1. Median
- 2. Percentage change (increase/decrease) between two numbers
- 3. Percentage difference between two numbers
- 4. Evaluate whether two arrays are equal
- 5. Slice an array by index range
- 6. Generate bin intervals
- 7. Extract values from
- 8. Last day of the month
- 9. Total number of days in a month
- 10. Lookup Australian timezone by postcode
There isn’t a week that goes by that I don’t use this function. Often the data I work with is skewed in some form or another, making using the median a much better choice for calculating the central tendency than using
AVG for the mean.
It’s quite clever how it works too. It relies on the functionality of BigQuery’s
ARRAY_AGG to aggregate together all the values we want to calculate the median for, and then uses array indexing to find the value(s) that best line up with middle of the array. Simple, but elegant.
I can’t take credit for this one, for that I have to thank the fine work done by Elliot Brossard (and Felipe Hoffa for bringing it to my attention).
CREATE OR REPLACE FUNCTION fn.median(arr ANY TYPE) AS (( SELECT IF ( MOD(ARRAY_LENGTH(arr), 2) = 0, (arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2, arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))] ) FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x) ));
Let’s look at a couple of examples to see how it works.
Here we have a list of users who have visited our website by date:
To calculate the median number of users who have visited our website per day, we’d use our freshly minted median function as follows:
SELECT bqutil.fn.median(ARRAY_AGG(user_count)) AS median_users_per_day FROM tbl
|median users per day|
Because we have an even number of records, there’s no one value that best represents the middle. The function therefore gets the median by calculating the mean difference between the two middlemost values (
21), returning us a median of
Some of you may recognise that what we’re doing here is calculating the median on a dataset of discrete values (user counts). There’s a whole separate conversation to be had about the effect discrete and continuous variables have on calculating the median – this post is not that post. Suffice to say the function has performed admirably for our purposes and has given us a median we’d be proud to take home to our mothers.
|date||user category||user count|
This second example has users segmented by category. If we wanted to calculate the median number of users who’ve visited our website per day and per category, we’d do this:
SELECT user_category, bqutil.fn.median(ARRAY_AGG(user_count)) AS median_users_per_day FROM tbl GROUP BY user_category
|user category||median users per day|
Here it’s much easier for us to reconcile the above medians with the values from our original dataset. Since each group has only three records, the middle value is unambiguous – the function dutifully grabs the middlemost value for each group (they’re treated as separate arrays since we’re using
GROUP BY) and returns us the medians you see above.
2. Percentage change (increase/decrease) between two numbers
I’ll be honest, the number of times I’ve had to reach out to Google2 to tell me how to calculate this one (and the one below) – it’s embarrassing. Turning this into a UDF was as much about keeping my code DRY as it was staving off my fear of inadequacy for not remembering such basic formula.
Keep in mind this function calculates percentage change, which is different from the percentage difference (whose function is listed below). Most of us are probably more familiar with percentage change, it’s what we hear around us most often:
- “The sharemarket is down 7.2% from yesterday.”
- “This laptop’s battery lasts 15% longer than last year’s model.”
- “I just negotiated a 20% increase to my salary!”
Percentage change can be summarised then as a measure of the relative change between a variable’s before value (e.g. yesterday’s stock price) and its after value (today’s stock price).
CREATE OR REPLACE FUNCTION fn.percentage_change(val1 FLOAT64, val2 FLOAT64) RETURNS FLOAT64 AS ( IF( ABS(val1) > 0 AND ABS(val2) > 0, (val2 - val1) / ABS(val1), NULL ) );
SELECT fn.percentage_change(1, 2), -- 1.0 (100% increase) fn.percentage_change(0.5, 2), -- 3.0 (300% increase) fn.percentage_change(100, 75), -- -0.25 (25% decrease) fn.percentage_change(-20, -30) -- -0.5 (50% decrease)
3. Percentage difference between two numbers
Calculating the percentage difference is useful when the relationship between two values doesn’t matter. It’s not relevant to say one is an increase or decrease over the other. They’re just… different.
When calculating percentage change, we use one of the values as a reference point from which to compare the other (e.g. old versus new), but when calculating the percentage difference it doesn’t make sense to use either value as the reference. Instead, we take the mid-point of the two values and use this as the reference. By doing this, you’ll notice that it no longer matters which value comes first in the equation, the result is the same – the measure is said to be non-directional.
So when is it appropriate to use percentage difference over percentage change? To steal an example from ol’ faithful2 – say you’re comparing the size of two companies based on the number of their employees. Company
A has 93 employees and company
B has 117, the percentage difference in size between these two companies comes out to be 22.86%. It’s not correct to say that company
A is 22.86% smaller than company
B, or that company
B is 22.86% larger than company
A. They’re different. How different? 22.86% different!
As the old adage goes, you should always pick the right tool for the job. The same applies here when choosing between percentage change or difference.
CREATE OR REPLACE FUNCTION fn.percentage_difference(val1 FLOAT64, val2 FLOAT64) RETURNS FLOAT64 AS ( IF( val1 > 0 AND val2 > 0, ABS(val1 - val2) / ((val1 + val2) / 2), NULL ) );
SELECT fn.percentage_difference(1, 2), -- 0.6667 (66.67%) fn.percentage_difference(2, 1), -- 0.6667 (66.67%) fn.percentage_difference(100, 75), -- 0.2857 (28.57%) fn.percentage_difference(-20, -30) -- NULL
4. Evaluate whether two arrays are equal
This one I don’t use that often, but when I do I’m thankful to have it available in my arsenal of UDFs. It compares two arrays and evaluates true/false if they contain the same values - no matter their order.
See? I can keep these explanations short. 😉
CREATE OR REPLACE FUNCTION fn.equal_arrays(arr1 ANY TYPE, arr2 ANY TYPE) RETURNS BOOL AS ( ( SELECT LOGICAL_AND(v IN UNNEST(arr2)) FROM UNNEST(arr1) v ) );
SELECT fn.equal_arrays([1, 2, 3], [1, 2, 3]), -- true fn.equal_arrays([1, 2, 3], [3, 1, 2]), -- true fn.equal_arrays(['a', 'b', 'c'], ['c', 'b', 'a']), -- true fn.equal_arrays( [STRUCT('A' AS a), STRUCT('B' AS b)], [STRUCT('A' AS a), STRUCT('B' AS b)] ), -- true fn.equal_arrays( [STRUCT('Z' AS a), STRUCT('B' AS b)], [STRUCT('A' AS a), STRUCT('B' AS b)] ), -- false fn.equal_arrays( ['1', '2'], [1, 2] ) -- error: both arrays must share the same data type
5. Slice an array by index range
It’s easy to take for granted the functionality you get for free in other programming languages. Slicing an array is one such example – so trivial to do in a language like Python, but requires multiple lines of code to do the same thing in SQL.
CREATE OR REPLACE FUNCTION fn.slice_array(arr ANY TYPE, start_index INT64, end_index INT64) RETURNS ARRAY<ANY TYPE> AS (( SELECT ARRAY_AGG(val) FROM UNNEST(arr) AS val WITH OFFSET idx WHERE idx >= ( CASE WHEN start_index < 0 THEN ARRAY_LENGTH(arr) + start_index WHEN end_index < 0 AND start_index IS NULL THEN ARRAY_LENGTH(arr) + end_index WHEN start_index < 0 THEN ARRAY_LENGTH(arr) + start_index ELSE IFNULL(start_index, end_index) END ) AND idx <= ( IF( end_index < 0, ARRAY_LENGTH(arr) + end_index, IFNULL(end_index, start_index) ) ) ));
SELECT fn.slice_array(['a', 'b', 'c', 'd'], 0, 1), -- ['a', 'b'] fn.slice_array(['a', 'b', 'c', 'd'], 1, 2) -- ['b', 'c']
You can also use a negative index for the upper bound to find the nth value from the right-hand side of the array. Like so:
SELECT fn.slice_array(['a', 'b', 'c', 'd'], 2, -1), -- ['c', 'd'] fn.slice_array(['a', 'b', 'c', 'd'], 1, -2) -- ['b', 'c', ]
Set both bounds to the same index to get a single-valued array. Alternatively, set one of the bounds to
SELECT fn.slice_array(['a', 'b', 'c', 'd'], 2, 2), -- ['c'] fn.slice_array(['a', 'b', 'c', 'd'], 2, NULL), -- ['c'] fn.slice_array(['a', 'b', 'c', 'd'], -1, -1) -- ['d'] fn.slice_array(['a', 'b', 'c', 'd'], NULL, -1) -- ['d']
Combine with something like
REGEXP_EXTRACT for some next level shenanigans!
DECLARE file_path STRING; DECLARE regex STRING DEFAULT "/([\\w\\._-]+)"; SET file_path = "/home/chris/projects/bigquery-udfs.sql"; SELECT -- root folder ARRAY_TO_STRING( fn.slice_array( REGEXP_EXTRACT_ALL(file_path, regex), 0, NULL ), "") AS root_folder, -- parent folder ARRAY_TO_STRING( fn.slice_array( REGEXP_EXTRACT_ALL(file_path, regex), NULL, -2 ), "") AS parent_folder, -- filename fn.slice_array( REGEXP_EXTRACT_ALL(file_path, regex), NULL, -1 )[OFFSET(0)] AS filename
|root folder||parent folder||filename|
6. Generate bin intervals
This UDF was born out of frustration with Data Studio. Can you believe they don’t include a histogram chart? The audacity! I was forced to create my own bin intervals in SQL and use a bar chart like an animal.
This function groups values into interval ranges based on the bin size and limit you give it.
CREATE OR REPLACE FUNCTION fn.bin_value(val FLOAT64, bin_size FLOAT64, bin_max FLOAT64) AS ( CASE WHEN val > bin_max THEN bin_max ELSE CAST(TRUNC((val / bin_size)) * bin_size AS INT64) END );
DECLARE bin_size INT64 DEFAULT 5; DECLARE bin_max INT64 DEFAULT 20; DECLARE numbers ARRAY<INT64>; SET numbers = [1, 3, 7, 9, 11, 11, 14, 19, 22, 29]; WITH frequencies AS ( SELECT fn.bin_value(num, bin_size, bin_max) AS bin, COUNT(1) AS frequency FROM UNNEST(numbers) num GROUP BY bin ) SELECT bin, IF( bin = bin_max, CONCAT(bin_max, "+"), CONCAT(bin, "-", bin + bin_size) ) AS bin_range, frequency FROM frequencies ORDER BY bin
7. Extract values from
APPROX_TOP_COUNT is great, it’s saved me a ton of time and energy scaffolding sub-queries and CTEs to get the top N records of X, Y and Z. Sometimes though I’m only interested in knowing what the top values are, I don’t care for knowing what their counts are and I certainly don’t care that the top value might be
This function returns a flattened array with values from
APPROX_TOP_SUM), excluding those pesky
CREATE OR REPLACE FUNCTION fn.extract_top_values(arr ANY TYPE) RETURNS ARRAY<STRING> AS ( ( SELECT ARRAY_AGG(a.value ORDER BY i) FROM UNNEST(arr) a WITH OFFSET i WHERE a.value IS NOT NULL ) );
DECLARE favourite_colors ARRAY<STRING>; SET favourite_colors = [ NULL, NULL, NULL, NULL, 'blue', 'blue', 'blue', 'red', 'red', 'green', 'purple', 'orange' ]; SELECT APPROX_TOP_COUNT(colours, 3) -- [(NULL, 4), ("blue", 3), ("red", 2)] ,fn.extract_top_values( APPROX_TOP_COUNT(colours, 3) )[OFFSET(0)] -- "blue" FROM UNNEST(favourite_colors) AS colours
8. Last day of the month
This is an example of a function that doesn’t take much brain power to put together, but given you’re probably using it as an input into some other calculation, it won’t take long before your query devolves into an indecipherable chain of function calls. Do yourself a favour and use a UDF instead.
Its usage is pretty self-explanatory. You give it a date, any date, and it will return back a new date set to the last day of the month.
CREATE OR REPLACE FUNCTION fn.month_ceil(date DATE) RETURNS DATE AS ( DATE_SUB(DATE_ADD(DATE_TRUNC(date, MONTH), INTERVAL 1 MONTH), INTERVAL 1 DAY) );
SELECT fn.month_ceil(DATE("2012-12-25")), -- 2012-12-31 fn.month_ceil(DATE("2016-02-09")), -- 2016-02-29 (leap year) fn.month_ceil(DATE("2019-02-18")), -- 2016-02-28 fn.month_ceil(DATE("2020-05-11")) -- 2020-05-30
9. Total number of days in a month
Sometimes you need to know how many days there are in a month. It’s easy enough to stitch together the necessary functions to do this yourself, but once you’ve done it a few times, it just makes sense that it exists as a function, and so here we are.
An added feature of this function is that if the input date is the same as the current month, it will return only those days that have elapsed thus far.
An example of why this might be useful is when calculating the percentage of days a user is active in a month – calculating this without factoring in the current date could unfairly classify the user as inactive. This function allows you to pro-rate whatever calculation you’re working on and has the added benefit of saving some poor soul from receiving an unwarranted marketing email.
CREATE OR REPLACE FUNCTION fn.days_in_month(date DATE) RETURNS INT64 AS ( CASE WHEN DATE_TRUNC(CURRENT_DATE(), MONTH) = DATE_TRUNC(date, MONTH) -- is the current month THEN EXTRACT(DAY FROM CURRENT_DATE()) - 1 -- return days-to-date up to today ELSE EXTRACT(DAY FROM fn.month_ceil(date)) END );
SELECT CURRENT_DATE; -- 2020-05-26 SELECT fn.days_in_month(DATE("2012-12-25")), -- 31 fn.days_in_month(DATE("2016-02-09")), -- 29 (leap year) fn.days_in_month(DATE("2019-02-18")), -- 28 fn.days_in_month(DATE("2020-05-11")) -- 25 (elapsed days in current month)
If you wanted to use it in some sort of pro-rated calculation, such as calculating the percentage of days a user is active for in a month, we can apply it like this:
SELECT CURRENT_DATE; -- 2020-05-26 WITH user_activity_per_day AS ( SELECT day AS date ,1 AS user_id FROM UNNEST( GENERATE_DATE_ARRAY( "2020-04-01", "2020-05-25", INTERVAL 1 DAY ) ) AS day ), user_activity_per_month AS ( SELECT DATE_TRUNC(date, MONTH) AS month ,user_id ,COUNT(1) AS days_active FROM user_activity_per_day GROUP BY month ,user_id ) SELECT FORMAT_DATE("%b-%Y", month) AS month ,user_id ,days_active ,days_active / fn.days_in_month(month) AS percent_active_in_month FROM user_activity_per_month
|month||user id||days active||percent active in month|
10. Lookup Australian timezone by postcode
So this one’s pretty niche, I know – there’s a reason it’s listed last. If you ever want to know the timezone and all you have is a postcode, this one’s for you. It’s perfect for when you want to do some time-of-day analysis and all you have is a UTC timestamp and a user’s postcode.
CREATE OR REPLACE FUNCTION fn.postcode_to_timezone(postcode STRING) RETURNS STRING AS ( CASE WHEN postcode = "2880" THEN 'Australia/Broken_Hill' WHEN postcode = "2898" THEN 'Australia/Lord_Howe' WHEN postcode = "2899" THEN 'Pacific/Norfolk' WHEN postcode = "6443" THEN 'Australia/Eucla' WHEN postcode = "6798" THEN 'Indian/Christmas' WHEN postcode = "6799" THEN 'Indian/Cocos' WHEN STARTS_WITH(postcode, "0") THEN 'Australia/Darwin' WHEN STARTS_WITH(postcode, "2") THEN 'Australia/Sydney' WHEN STARTS_WITH(postcode, "3") THEN 'Australia/Melbourne' WHEN STARTS_WITH(postcode, "4") THEN 'Australia/Brisbane' WHEN STARTS_WITH(postcode, "5") THEN 'Australia/Adelaide' WHEN STARTS_WITH(postcode, "6") THEN 'Australia/Perth' WHEN STARTS_WITH(postcode, "7") THEN 'Australia/Hobart' END );
SELECT fn.postcode_to_timezone("2000"), -- "Australia/Sydney" fn.postcode_to_timezone("3031"), -- "Australia/Melbourne" fn.postcode_to_timezone("5045") -- "Australia/Adelaide"
That’s all, folks
Hopefully some of those were useful for you, or at least gives you some ideas for UDFs of your own.
Back when UDFs were first introduced to BigQuery, they were only temporary. That meant you had to re-define and create them for every query (or more realistically, copy and paste them from that document you keep for just such an occasion). Persistent UDFs wouldn’t arrive until much later in mid-2019. ↩︎
Thank you Omni Calculator ❤️! ↩︎ ↩︎