PDTs vs. dbt in 2022 (Revisited)
Three years have passed since Dylan Baker's classic post where he argued that data transformation should generally be done using dbt rather than Looker persistent derived tables (PDTs).
dbt is to PDTs what good sourdough with cultured butter is to white sandwich bread with margarine: broadly the same thing, but one somewhat more palatable than the other.
Since then, Looker has added lots of features and improvements to PDTs, so you might be wondering—where do PDTs belong in 2022?
tl;dr: We still think dbt is the better tool for most data transformation use cases, but that doesn't mean you should abandon PDTs entirely.
Let’s review some quotes from Dylan's original post with an updated understanding of PDT features to see why Looker’s improved PDTs still don’t measure up to dbt models for most use cases.
Performance problems and incremental models
Incredibly often, the issue that brings people from PDTs to dbt is the performance of their models. Most of the time, performance problems stem from one of two problems. The first is an inability to build PDTs incrementally.
Good news! Looker now supports incremental PDTs. This addresses a big performance downside for PDTs, but head-to-head, dbt comes out on top with superior performance tuning options. For example, dbt allows you to specify a subset of columns for rows that are incrementally updated, potentially giving you further performance improvements. You can also customize how you want your data to be updated in cases where your table doesn’t have a stable unique key by specifying an incremental_strategy.
Development vs. production environments are also handled in a first class way in dbt, making your iterative cycles much faster when building incremental models. You can safely run models locally that get written to a dedicated schema separate from production. You can specify a --full-refresh flag arbitrarily on local runs to compare incremental runtimes against full build runtimes. In Looker, there is indeed some isolation between development and production environments, but we think the overall workflow is more complicated than what you get in dbt.
Testing your models
The ID column was supposed to be your unique primary key and now it isn't. There is currently no automated mechanism in Looker to test the output of that query.
Looker now supports data tests. However, this doesn’t replace the need for dbt tests. We think it’s best to use both testing frameworks together: dbt for schema tests and Looker for business logic tests.
Though you could define a primary key uniqueness test in Looker, testing for primary key uniqueness is easier with dbt. It’s more concise to write this test in dbt (just two lines in a schema file), and you’ll benefit from identifying these errors earlier in the development cycle, preventing issues from affecting non-Looker consumers of that data.
So where should you use data tests? Let’s say you define a new measure in Looker called revenue. You can use a Looker data test to verify it against a known value, say, last year’s revenue, which should remain fixed over time. This test will give you confidence that new logical changes don't affect historical reporting. You can even run your Looker tests in CI or on a schedule with Spectacles!
Making data usage-agnostic
Every time you make a change to the SQL in a PDT, Looker will rename the underlying table in your database.
Looker now supports stable PDT names for querying PDTs from outside of Looker. To use this feature, you add publish_as_db_view: yes to your derived table definition.
However, these table names, while stable, aren't exactly human readable. According to Looker's documentation, Looker uses the following format to generate the table name:
[scratch schema name].[connection registration key]_[model name]_[view name]
For example, if you have a PDT in the model faa and a view named e_flights_pdt, you'll only need the scratch schema name and the connection registration key. You can find both in the SQL tab in the Data section of any query against the PDT. Look in the CREATE TABLE or FROM clause of the SQL statement.
In Looker's example, the table name is rendered as something like tmp.LR$LBC5Q_faa_e_flights_pdt. This is a convoluted name and might be confusing for a data warehouse user, as it looks like a machine-generated table. You could create a new database view on top of that table with a more obvious name, but modeling in dbt will natively allow you to name your tables in a human-readable way.
DRY code
SQL is a notoriously WET (write every time) language. It's veritably soaking with repetition.
dbt’s macros framework is a great way to abstract SQL into re-usable blocks of code, and Looker still doesn’t have a comparable analogue. Even if you’re scared of writing your own dbt macros, there’s a good set of “public macros” that you can import into your project via dbt packages. A good example is the dbt utils pivot macro; I’ve used that function in practically every dbt project I’ve worked on.
When PDTs are best
There aren't many situations where Persistent Derived Tables are necessary over dbt models. However, there are many reasons why non-persisted, vanilla Derived Tables (DTs) are necessary. DTs practically function in a very similar way to PDTs, they just don't get materialised in the database.
This is still very true. Ephemeral derived tables are still necessary when you need to handle user input or re-aggregate data on a different grain for use in another explore.
One new use case for PDTs is aggregate awareness. If you have large tables and find yourself needing to create different aggregations for better query performance, but don’t want to build out a dbt model for each one, then aggregate awareness may be for you.
The code example below (taken from Looker’s docs) illustrates how enabling this feature is as simple as adding an aggregate_table argument to an explore:
In this example, if a user selects the same fields as in the aggregate_table parameter, Looker will query the pre-aggregated PDT instead of the raw table. Depending on how large your tables are, this could save you a lot of query processing time and is much more concise than building a dbt model for every possible aggregation.
Another good use case for PDTs is as a prototyping tool for analysts. They may have an easier time getting started with PDTs since the feature is already folded into the Looker platform, and they won't have to worry about setting up a new data connection or scheduling jobs. At larger organizations with more specialized roles, analysts may not even have access to write dbt models. In these cases, PDTs can serve as a quick fix for analyst modeling needs.
How to integrate dbt with Looker
You would be adding another tool to your stack. That adds complexity and extra management responsibilities. You would be adding another step to your development workflow. That requires a bit more work from analysts and engineers.
When Dylan wrote the original post, dbt wasn't yet fully embraced by the analytics community. Today, it’s a different story; dbt is the standard for data transformation, and there aren’t many reasons not to use dbt.
Teams face a new challenge: how do you integrate dbt effectively with downstream tools like Looker?
Imagine you want to rename the id column in your opportunities table to opportunity_id. How do you know what downstream objects in Looker will be affected by this change? Most teams do this manually by inspecting a few key dashboards and hoping for the best. But as the team creates more Looker views on top of dbt tables, the risk of making breaking dbt changes grows. Eventually, it's easy to inadvertently introduce new problems as you fix old ones.
The good news is the tooling has greatly improved. You can use the Spectacles dbt integration to test for breaking Looker changes on dbt pull requests. Spectacles will alert you to any dbt changes that will cause errors in Looker. In the example above, Spectacles would tell you in your Pull Request exactly which dimensions or measures would be impacted by renaming a column, allowing you to make an educated, integrated plan for rolling out your changes.
The bottom line
Though Looker has improved PDTs quite a bit since 2019, the benefits of dbt models outweigh the downsides. Robust incremental configuration, simple schema tests, user-friendly table names, and macros mean that dbt is still the better data transformation tool. Use Spectacles to de-risk the integration point and find potential errors, and save PDTs for aggregate awareness and analyst prototyping.