How to Query Hierarchical Data in Snowflake
--
Querying hierarchical data is well documented for Snowflake. However, here is an extensive in-depth coverage of a very typical and frequent hierarchy, using self-joins on the same table. It is the parent-child hierarchy for fathers and sons, Facebook friends, employees and their managers etc. Each entity has one single identifier and a friendlier display name. And the hierarchy is given by another optional ID to its unique and optional parent.
Typical Scenario
The CSV file below — that you can easily save and upload in a Snowflake table — presents a hierarchy on the EMPNO-MGR pair, with ENAME used as display name for both an employee and the manager:
EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
7839,KING,PRESIDENT,,1981-11-17,5000,,10
7698,BLAKE,MANAGER,7839,1981-05-01,2850,,30
7654,MARTIN,SALESMAN,7698,1981-09-28,1250,1400,30
7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300.5,30
7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30
7900,JAMES,CLERK,7698,1981-12-03,950,,30
7844,TURNER,SALESMAN,7698,1981-09-08,1500,0,30
7782,CLARK,MANAGER,7839,1981-06-09,2450.5,,10
7934,MILLER,CLERK,7782,1982-01-23,1300,,10
7566,JONES,MANAGER,7839,1981-04-02,2975.8,,20
7788,SCOTT,ANALYST,7566,1982-12-09,3000,,20
7876,ADAMS,CLERK,7788,1983-01-12,1100,,20
7902,FORD,ANALYST,7566,1981-12-03,3000,,20
7369,SMITH,CLERK,7902,1980-12-17,800,,20
In very simple apps I wrote and described here in the past, I displayed this hierarchy in different ways. As a GraphViz chart, as in the post feature image. Or an animated collapsible tree with D3:
In text mode it may require some semi-structured data like JSON:
{
"name": "KING",
"children": [{
"name": "BLAKE",
"children": [
{ "name": "MARTIN" },
{ "name": "ALLEN" },
{ "name": "WARD" },
{ "name": "JAMES" },
{ "name": "TURNER" }]
},{
"name": "CLARK",
"children": [{ "name": "MILLER" }]
},{
"name": "JONES",
"children": [{
"name": "SCOTT",
"children": [{ "name": "ADAMS" }]
},{
"name": "FORD",
"children": [{ "name": "SMITH" }]}
]}]}
Or even better, a simple hierarchy in YAML, from which the “name” property could be eliminated as well:
name: KING
- name: BLAKE
- name: MARTIN
- name: ALLEN
- name: WARD
- name: JAMES
- name: TURNER
- name: CLARK
- name: MILLER
- name: JONES
- name: SCOTT
- name: ADAMS
- name: FORD
- name: SMITH
But how can we get a similar hierarchy directly from a SQL query in Snowflake — this is the question? Something like this:
On top we have the single root node, with no parent at all. Followed by one of its children, with their children below and so on. All this suggested by an indent. The second column contains the full hierarchy for each node, and can be eventually used to sort the tabular data to get this specific format.
(1) Using Joins
A first solution is by using joins. But the obvious simple query below will render key-value pairs, in tabular format. You don’t see the hierarchy at all:
select e.ename as name, e.job as title, m.ename as manager
from emp e left join emp m on e.mgr = m.empno
order by e.ename;
If you have a fixed and low number of already known levels, you may add a left self-join for each level and that’s it. We’ve seen four levels max in our data, which can be displayed in the desired format with the following query:
with cte as (
select e.ename as name,
coalesce(coalesce(coalesce(' -> ' || m3.ename, '')
|| ' -> ' || m2.ename, '')
|| ' -> ' || m1.ename, '')
|| ' -> ' || e.ename as path,
e.job as title
from emp e
left join emp m1 on e.mgr = m1.empno
left join emp m2 on m1.mgr = m2.empno
left join emp m3 on m2.mgr = m3.empno
)
select repeat(' ', array_size(split(path, '->'))-1)
|| name as name, path, title
from cte
order by path;
- The embedded calls to COALESCE will build the path for each node.
- Using SPLIT and ARRAY_SIZE, we count the number of arrows in each path, and we get the indent — with REPEAT — for each node name.
The query is practical, but it doesn’t look very nice, and we need to know in advance the depth of our tree.
(2) Using CONNECT BY
The START WITH and CONNECT BY clauses have been introduced by Oracle long ago, and Snowflake tries to emulate almost exactly their behavior. The following recursive query with CONNECT BY renders the data in our required hierarchical format:
select repeat(' ', level) || ename as name,
sys_connect_by_path(ename, ' -> ') as path,
job as title
from emp
start with job = 'PRESIDENT'
connect by prior empno = mgr
order by path;
- We need to sort our data by the path — given by the built-in SYS_CONNECT_BY_PATH function call — otherwise the hierarchy is rendered breadth-first, not depth-first.
- CONNECT BY requires a PRIOR keyword before the field whose value must be collected from the previous iteration.
- We have a built-in LEVEL value, which is great and helps with the name indent.
- START WITH is like a WHERE clause to select the top node.
(3) Using Recursive CTEs
CONNECT BY queries work only on self-joined tables, like the one we had. Recursive CTEs are more flexible and powerful. Here is our equivalent query — to render the exact same hierarchical data format — with a recursive CTE:
with recursive cte (level, name, title, empno, mgr, path) as (
select 1, ename, job, empno, mgr, ' -> ' || ename
from emp where mgr is null
union all
select m.level + 1,
repeat(' ', level) || e.ename, e.job,
e.empno, e.mgr, path || ' -> ' || e.ename
from emp e join cte m on e.mgr = m.empno
)
select name, path, title
from cte
order by path;
- The first SELECT query is only for the top node, with no MGR at all.
- The second SELECT query is for the other nodes, and joins with the previous query.
- There is no built-in LEVEL function, but we assign a level variable to 1 for the root node, then we increment it on each in-depth iteration.
- We concatenate the path with the previous node name, and we use it to sort the whole UNION.
Conclusions
There are at least three proper ways to directly query hierarchical data in Snowflake and render it in a useful way, with no help needed from another tool or programming code.
While using joins may only work for a low number of levels, and when you know in advance the depth of your tree, recursive CTEs and the legacy CONNECT BY clause should be used instead.
Always sort your data by the path of each node, for a proper display with indentation on the node names.