This year at work there’s been a push to brush up on relational database fundamentals. I’m generally pretty good at that, so I wanted to put my skills to the test. My preferred way to learn (or upskill) a new language in the month of December?
I’ve done AoC for the last 5 iterations starting in 2021, spanning a few languages – starting in Ruby, then Python, then Golang last year (but I got busy and didn’t finish), to finally a mix of Python and SQL. I get really sucked into these puzzles, so I wanted a challenge. So, I fired up PostgreSQL and figured I’d take a stab at writing some of the solutions in SQL.
Repository Link
https://github.com/ty-porter/advent-of-code/tree/master/2024-sql
Methodology
I’m decent at writing SQL queries for business purposes but writing them to solve complicated problems like AoC would definitely prove to be quite difficult and potentially time-consuming. To save a little of my sanity, I chose to write the solutions out in Python before attempting them in SQL. Python for me is reasonably quick to prototype in, and crucially, solving it in Python gave me a foundation where I could check my work without fear of running into AoC’s wrong answer cooldown periods.
Once I had a working solution in Python, the SQL scripts could be created.
Meta-Tables, Functions, Batch Scripts
I created a few meta-scripts that did some basic stuff, like set up a solutions
table for my answers and a raw_data
table that would hold lines of the problem input. update_solution
, a function to update a record in the solutions
table.
-- common/setup.sql
DROP TABLE IF EXISTS raw_data;
CREATE TABLE raw_data (
position SERIAL PRIMARY KEY
, raw_data VARCHAR
);
-- There's a separate script to drop this, but it's not included here. It should persist between solutions.
CREATE TABLE IF NOT EXISTS solutions (
day INT
, part INT
, result VARCHAR
, PRIMARY KEY (day, part)
);
CREATE OR REPLACE FUNCTION update_solution(newday INT, newpart INT, newresult ANYELEMENT)
RETURNS VOID AS $$
BEGIN
INSERT INTO solutions (day, part, result)
VALUES (newday, newpart, newresult::VARCHAR)
ON CONFLICT (day, part)
DO UPDATE
SET result = EXCLUDED.result::VARCHAR;
END;
$$ LANGUAGE plpgsql;
Since I’m on Windows, then I defined a batch file that would seed the raw_data
table via psql
and run my solution scripts. There’s a whole host of scripts I used here, but this was the most common one:
# scripts/run.bat
@echo off
set dayno=%1
set connection=postgresql://aoc:aoc@localhost:5432/aoc
psql -c "SET client_min_messages TO NOTICE;" %connection%
psql --quiet -f .\common\setup.sql %connection%
psql --quiet -c "COPY raw_data (raw_data) FROM 'C:\\Users\tyler\development\advent-of-code\2024-sql\solutions\%dayno%\prompt.txt' WITH (FORMAT text)" %connection%
psql --quiet -f .\solutions\%dayno%\solution.sql %connection%
Then I can run a script with scrips/run.bat <day>
!
Solution Scripts
Now the fun can begin. Each solution script is designed to be idempotent – it builds up a schema that it will use for a solution, computes the result, writes it to the solutions
table, and tears itself back down.
First I parsed the raw data rows into some sort of useful table. If this is straightforward enough, an INSERT INTO ... SELECT
was the preferred approach, but for more complex data I needed to use a PL/pgSQL
function (by default I named it process_raw_data
). For instance, if I needed to parse my raw data rows into a 2D grid, I might do something like:
DROP TABLE IF EXISTS grid;
CREATE TABLE grid (
id SERIAL PRIMARY KEY
, x INT NOT NULL
, y INT NOT NULL
, c VARCHAR NOT NULL
);
CREATE OR REPLACE FUNCTION process_raw_data()
RETURNS VOID AS $$
DECLARE
record RECORD;
x INT := 0;
y INT := 0;
c VARCHAR;
char_row VARCHAR[];
BEGIN
FOR record IN SELECT * FROM raw_data ORDER BY position ASC
LOOP
char_row := string_to_array(record.raw_data, NULL);
x := 0;
FOREACH c IN ARRAY char_row
LOOP
INSERT INTO grid (x, y, c) VALUES (x, y, c);
x := x + 1;
END LOOP;
y := y + 1;
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT process_raw_data();
I honestly overused these functions. They were really only useful for looping over things, but in some cases looking back I can see that they either a) were just regular UPDATE
/ INSERT
queries within the functions or b) could be converted to them easily.
For the solution calculation itself, I liked using CTEs to break up the individual queries. It makes it clear what is an intermediate calculation vs. what the result is. Since there are two parts to a problem, I can potentially re-use those calculations with very little effort.
My last few CTEs were usually called part1
/part2
and were simply a step to get the part
/result
column for the solutions table. After that, a UNION ALL
on the two parts is readily inserted using my solution insert function.
WITH subcalculation1 AS (
-- Some expensive calculation
)
, subcalculation2 AS (
-- Some other calculation
)
, part1 AS (
SELECT
1 AS part
, COUNT(*) AS result
FROM subcalculation1
)
, part2 AS (
SELECT
2 AS part
, COUNT(*) AS result
FROM subcalculation2
)
SELECT
update_solution(1, results.part, results.result)
FROM (
SELECT * FROM part1
UNION ALL
SELECT * FROM part2
) AS results;
So How’d It Go?
Honestly, once you get used to it, SQL really isn’t that bad for doing AoC. I mean, it’s not the right tool for the job, but I expected it to be way worse. I thought it would be a certainty that I would need to stop around the end of the first week – at time of writing I’m at double that!
The Good
Leaning into set theory
Duh. That’s what SQL is made for.
There’s lots of instances of this but let’s take Day 7 as an example – we want to perform a list of operations on each row in a table. Well, we can do that, it’s a CROSS JOIN
on the operands table that we can filter out.
WITH RECURSIVE bfs AS (
SELECT
eq.id
, eq.tgt
, eq.operands[1] AS current_val
, eq.operands AS operands
, 1 AS idx
, array_length(eq.operands, 1) AS tgtidx
, ARRAY[]::TEXT[] as path
FROM equations eq
UNION ALL
SELECT
b.id
, b.tgt
, CASE
WHEN op = '+' THEN b.current_val + b.operands[b.idx + 1]
WHEN op = '*' THEN b.current_val * b.operands[b.idx + 1]
WHEN op = '|' THEN CONCAT(b.current_val::TEXT, (b.operands[b.idx + 1])::TEXT)::NUMERIC
END AS current_val
, b.operands
, b.idx + 1
, b.tgtidx
, path || op as path
FROM bfs b
CROSS JOIN unnest(ARRAY['+', '*', '|']) AS op
WHERE b.idx < array_length(b.operands, 1)
)
-- Use the result of this calculation...
Storing results
Also duh.
This is kind of obvious but when you’re working a complex puzzle like this, you have lots of options at your fingertips for data storage. A database after all is intended for long-term storage. You can use this to your advantage by leveraging CTEs, materialized views, temp tables, and even permanent working tables.
In an imperative programming language, you would be reaching for something like a Python dict, list, or some other data structure to represent data. In Postgres, you really only have one option – a table (or a table-like structure). You really don’t have to think about it either, especially if using a CTE and can just pop a MATERIALIZED
on there to “store” the results.
The good thing about storage like this is it’s easy to make multiple passes of intermediate calculations on the same working data.
The Bad
Text processing
Getting useful data out of the raw data table was a pain almost every time. It can be done, but transforming the data was pretty wonky.
Take a look at this from Day 9. It gets the job done, but is pretty messy. Most days relied on substr
or some sort of regex function to parse the raw data rows. Other languages would process this in a more readable way.
Recursion / iteration
Both of these are pretty rough. You can do it, but it will not be enjoyable or particularly performant.
I skipped Day 6 Part 2 – the Part 1 solution I came up with iteratively created a path on a 2D grid. It ran in ~3s, but the Part 2 solution would have run that same code hundreds of times. It just wasn’t worth it.
I had some luck with recursively generating a set of values like in Day 8, as long as the list of values was very tightly bounded:
WITH RECURSIVE harmonics AS (
SELECT
*
FROM antinodes
UNION ALL
SELECT
x + dx AS x
, y + dy AS y
, dx
, dy
FROM harmonics h
JOIN grid_bounds bounds
ON bounds.min_x <= h.x AND bounds.min_y <= h.y
AND bounds.max_x >= h.x AND bounds.max_y >= h.y
)
-- Do something with this calculation
I even got breadth-first searches working in a few cases (I overused BFS in my Python solutions, too):
WITH RECURSIVE bfs AS (
SELECT
id
, x
, y
, n
, x AS ox
, y AS oy
FROM grid
WHERE n = 0
UNION ALL
SELECT
g.id
, g.x
, g.y
, g.n
, b.ox
, b.oy
FROM bfs b
CROSS JOIN directions d
INNER JOIN grid g
ON b.x + d.x = g.x
AND b.y + d.y = g.y
AND b.n + 1 = g.n
)
, part1 AS (
SELECT
1 AS part
, COUNT(DISTINCT (x, y, ox, oy)) AS result
FROM bfs
WHERE n = 9
)
-- Write out the result
More on BFS in SQL later.
The Ugly
Also known as “how the hell did that even work??”.
Every time you need a piece of data, you must query for it
Queries are expensive. There’s a reason N + 1 is a huge problem for a database, as it racks up tons of IO / transit time rather than spending time finding things that actually matter. That means the programmer, especially a masochistic puzzle-solving programmer, is heavily incentivized to make as few queries in the hot loop as possible.
Take for example Day 12, which is about identifying contiguous regions of letters and calculating their area (trivial), perimeter (fairly easy), and in Part 2, number of sides (now we’re talkin’).
My solution needed all the neighbors for each cell. To do that in one query I aggregated each direction in order using existence cast to '1'
or '0'
as the element. That gives a length 8 string that can be cast to a BIT(8)
– an 8 bit number representing a bitmap whether a neighbor is present at a specified position. Can this be done better? I don’t know at this point, and I wasted hours trying to optimize it.
WITH adjacent_cells AS (
SELECT
ord, x, y
FROM (
VALUES
(1, xi , yi - 1), -- Up
(2, xi + 1, yi ), -- Right
(3, xi , yi + 1), -- Down
(4, xi - 1, yi ), -- Left
(5, xi + 1, yi - 1), -- Up-Right
(6, xi + 1, yi + 1), -- Down-Right
(7, xi - 1, yi + 1), -- Down-Left
(8, xi - 1, yi - 1) -- Up-Left
) AS adjacent(ord, x, y)
)
, ordered_cells AS (
SELECT
ac.*
, CASE
WHEN g.id IS NOT NULL THEN '1'
ELSE '0'
END AS val
FROM adjacent_cells ac
LEFT JOIN grid g
ON g.x = ac.x AND g.y = ac.y AND g.region_id = rid
ORDER BY ac.ord ASC
)
SELECT
string_agg(oc.val, '') AS bitmask
INTO
neighbors -- A variable to be used later
FROM ordered_cells oc;
Recursion / iteration
I put this in the Bad section as it does work when you get everything right, the approach suits the data, etc.
When it goes wrong, oh boy are you out of luck.
First of all the syntax is really wonky, it requires a base case with a UNION ALL
to the recursive portion:
WITH RECURSIVE subquery AS (
-- Base case
SELECT 1 AS i;
UNION ALL
-- Recursive case
SELECT i + 1 FROM subquery
)
SELECT * FROM subquery
WHERE i < 10;
This is a trivial example, but it’s quite simple to write a case that will simply never stop recursing. I ran into this many times, and because it isn’t in a function it’s not possible to RAISE NOTICE
and effectively debug log your way out of the problem. Recursion specifically is not very performant, so if you need to check a lot of iterations, you’re better off creating your own functions.
To go back to that Day 12 example, I needed to write a custom BFS to define the regions. A recursive CTE simply timed out no matter what I tried (and I’m still not 100% sure what went wrong).
CREATE OR REPLACE FUNCTION define_regions()
RETURNS VOID AS $$
DECLARE
rid INT := 1;
region_start INT;
rows_updated INT;
BEGIN
WHILE (SELECT count(*) FROM grid WHERE region_id IS NULL) > 0 LOOP
SELECT min(id) INTO region_start FROM grid WHERE region_id IS NULL;
UPDATE grid
SET region_id = rid
WHERE id = region_start;
LOOP
WITH adjacent AS (
SELECT
adj.id
FROM grid base
INNER JOIN grid adj
ON base.c = adj.c
WHERE base.region_id = rid
AND adj.region_id IS NULL
AND base.c = adj.c
AND (abs(base.x - adj.x) + abs(base.y - adj.y)) = 1
)
UPDATE grid
SET region_id = rid
WHERE id IN (SELECT a.id FROM adjacent a);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
IF rows_updated = 0 THEN
EXIT;
END IF;
END LOOP;
rid := rid + 1;
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT define_regions();
Honestly, you should read this entire solution for Day 12. Quite frankly I am in disbelief that it works at all – this seems like something Postgres is wholly unsuitable for:
# Breaking this:
AAAA
BBCD
BBCC
EEEC
# Into this:
+-+-+-+-+
|A A A A|
+-+-+-+-+ +-+
|D|
+-+-+ +-+ +-+
|B B| |C|
+ + + +-+
|B B| |C C|
+-+-+ +-+ +
|C|
+-+-+-+ +-+
|E E E|
+-+-+-+
# Where:
# A - area: 4, perimeter: 10, sides: 4
# B - area: 4, perimeter: 8, sides: 4
# C - area: 4, perimeter: 10, sides: 8
# D - area: 1, perimeter: 4, sides: 4
# E - area: 4, perimeter: 8, sides: 4
The syntax
It’s ugly. I’m sorry. I know the math guys out there probably love it, but from a programming perspective it’s pretty clunky. Admittedly, puzzles make that worse – they are essentially adversarial inputs for what a database is trying to accomplish.
On the other hand, getting the rows updated in a function should not look like this:
GET DIAGNOSTICS rows_updated = ROW_COUNT;
IF rows_updated = 0 THEN
EXIT;
END IF;
I shouldn’t have to write my own modulus
function because Python uses a signed remainder definition for the %
operator but Postgres uses a mathematical definition:
-- % and mod() return negative numbers for negative dividend / positive divisor and vice versa.
CREATE OR REPLACE FUNCTION modulus(dividend INT, divisor INT)
RETURNS INT AS $$
DECLARE
result INT;
BEGIN
divisor := abs(divisor);
result := mod(dividend, divisor);
IF result < 0 THEN
result := result + divisor;
END IF;
RETURN result;
END;
$$ LANGUAGE plpgsql;
It’s weird that casting from BIT(8)
to BIT(4)
preserves the highest bits:
aoc=# SELECT '11110000'::BIT(8)::BIT(4);
bit
------
1111
I’m sure there’s more.
All in all, it was a lot of fun. As of time of writing (December 16), I made it to Day 14. I’m 2 days behind and not sure if I’ll make up the difference. Day 12 was a significant effort and Days 15 and 16 were of similar size.
Either way, I was surprised to make it this far and I actually did learn quite a lot.
That’s all for now. Thanks for reading!