Guide to get you jump started on DBT for a complete newbie
Is your organization about to use DBT (data build tool) or are you about to start at a new company that uses it? This is a quick starter guide that will help you start your journey with DBT. It will be a great foundation when you go to the official doc to learn more in-depth. For the official introduction & definitions or for more comprehensive learning, please go to the related dbt sites that are linked at the end of this blog in the “other resources” section. \
What is DBT (Data Build Tool)?
Literally the first paragraph in the dbt doc shows what dbt is, so you can read the official wording there. In my mind, dbt is a tool that transforms raw data in your database warehouse using git, jinja (python – template language), and sql.
Why DBT?
One may ask, if it’s using sql, then why would I want to pay money and use dbt when I can just create tables and views? To that, I have my own answers for you.
You could use the free version of dbt, to start. But more than that,
Dbt will create dependencies automatically which will ensure all the tables and views get updated in the order that you need them to. For instance, if your table A is dependent on your table B, you need table B to be updated before table A gets updated. DBT handles that for you.
DBT will create a doc for you which includes DAG (directed acyclic graph) — it shows the dependencies we talked about in #2, in a diagram, showing the picture of the entire tables and views and how they relate to each other. That means, whenever you are altering a table, you can easily find out which tables/views will be affected by the change!
Git integration — it allows you to use version control (if your change needs to be reverted, you can easily do so), experiment in your branch without messing with master branch, implement mandatory code review before merging, etc.
Other cool functions like creating macros (functions in programming languages) , packages (essentially, libraries in programming languages), and tests (testing column values, etc. for you)
What is the general, high-level workflow using DBT?
If you join a company that uses a paid version of dbt (which was my case), then you will get access to their dbt cloud. When you sign in, it will likely have everything set up for you, such as database warehouse connection and your own schema, usually named: dbt_yourFirstName. Why is that, if you don’t have such schema in the database?
It’s because of git. So, by having your own schema, you can experiment in a branch that is connected to your schema only and see the result before you merge into master, which is the dbt production schema — the schema that everyone uses as the source of truth tables, meaning the production stage tables and views.
So simple workflow would be like this:
Go to a branch (create one if needed)
make changes you want to make or create new files, etc.
run dbt and then check your schema (dbt_yourFirstName) for the result of the changes you made
if everything looks good, commit
create a pull request
if everything looks good, the admin will approve your changes to be merged into master
The next time the dbt production schema runs, your changes will be implemented
****Note: depending on the database you are using, you may have to create a dbt_yourFirstName schema within the DB before running dbt. For instance, redshift will not create such a schema automatically when you run dbt, so you have to create it within redshift before you run it. If you are not sure what your own schema name is, go to your profile > Credentials > Project you are on (ex. Analytics). You should see the “schema” there.
Wait, what do you mean by “run dbt”?
When you run dbt, it will literally run all its files (unless specified) and put the result back into the database. This is why you are using your own schema by default to test your codes before you merge it into the master (production) schema. When you run dbt, connected to your own schema, it will put back the result into your schema dbt_yourFirstName and you can see if the result is what you desired.
You can specify what to run, of course, since running the entire code for all the tables and views modified using dbt does not always make sense to test one code. To do so, you will generally follow this format:
Dbt run -m folderNameWithinModels.fileName
ex) dbt run -m finance.contribution_actuals
For more information on “run”, you can visit dbt doc command reference For more variation of dbt run, you can visit dbt doc syntax overview
How do I create a new table or view?
Branch & Models & .sql
First and foremost, make sure you are on a branch, not master. Master is read only for a good reason. Now, you will see the folder named “models”. This is where the majority of your files live. If your organization has been using dbt and has been categorizing the files, then within it you will probably see multiple folders. If not, you may want to implement that as the number of files will keep growing and you will want to categorize them for better experience.
Whatever the case, go to the folder you want to create a file in (either “models” or any other folder within “models” folder) and hover – you will see three dots appear on the right. Click it and you can choose a new file or new folder. Let’s do a new file. It will prompt you to create a name – name it, but whatever you do, make sure it ends with “.sql” extension. If not, it will not know that it’s sql.
Jinja REF {{ref('table_name')}}
Then write your sql! The only thing that you need to do differently here in dbt, is to use jinja. Don’t fret — the only thing you need to know at this point is to use “ref”. In your sql, are you querying from a table or a view that’s already modified within dbt? Meaning, you can see the table/view in your models already? Then you use {{ref(‘table_name’)}} and put the table/view name. For instance, {{ref(‘contribution_actuals’)}} —– notice that we do not specify the schema here. Why do you have to use this, you ask? This is how DBT determines dependencies among tables and views to create 1. The right order of update based on the dependency and 2. DAG in dbt doc. More on the doc later!
If you are querying from a table/view that is not modified already using dbt, then you can just use your good ol’ sql FROM “schema.table_name” for now. The better way to do this though, is to use jinja “source”.
Jinja source???
Yes, yes – jinja source. It simply is to designate the source. To do so, you go to “sources.yml” file and add the schema and table information like this:
data:image/s3,"s3://crabby-images/f8866/f886663efe9cf232f20d5114fd4b7a071acca47a" alt=""
Then when you want to refer to these tables, you use {{source(‘schema name’, ‘table_name’)}}, like this: {{ source(‘finance_department’,’contribution_actuals’)}}
why would I go through all this when I can use "FROM schema.table_name" and it will be less work?
That’s a great point and I have a reason. When you use dbt doc (we will cover how to view doc in a bit) and go to DAG (the diagram that shows the dependencies of the tables/views), those regular from the statement table will not appear. However, if you added source in sources.yml file and referred to it using jinja source {{source()}} then the source table will appear in the doc. If the table you use will only be referred once, then it does not matter; but if the source table gets referred multiple times for different tables and views, this is how you can add the table in the DAG so you can find out which tables and views query from this particular table. This goes the same with {{ref(‘’)}} we mentioned above. Your code will technically run without using the jinja ref, but if you don’t use ref, then 1. It will not show in DAG and more importantly 2. The dependency will not work correctly to make sure your referring table gets updated before your current code.
Now tell me about the dbt doc already.
DBT doc is great because it will give you comments on tables/views if any, shows the list of columns and data types, and more importantly, DAG that shows you which tables/views are dependent on which tables/views.
So how do you access this awesome doc? You simply run the following on the command line (at the bottom of the browser) in your dbt cloud:
dbt docs generate
When you do, it will generate docs based on the current status. When it’s done “view docs” link will appear on the top left corner — screenshot:
data:image/s3,"s3://crabby-images/adb64/adb64e06d9415826034d7b2e48807fee91e62a11" alt=""
When you click that, it will get you to the doc. It will have a list of tables/views which you can click and learn about. If you want to see DAG, click the icon on the bottom right (screenshot below):
data:image/s3,"s3://crabby-images/1029f/1029f1ed90151264b6cf7c46347d5f5d95f98f17" alt=""
Final Thoughts
There is so much more to explore on DBT and to explain, like yml files, packages, macros, test, etc. But because it’s a DBT (Data Build Tool) Quick Starter Guide for absolute beginners to start using dbt, I will stop here. This guide should be enough to get you started and to make you comfortable testing things out on dbt. As I mentioned in the beginning, dbt is an interesting tool combining git, jinja (python), and sql — the more you explore, the more you will appreciate its values. As promised, I will end the blog with links to more resources.
Other resources
Dbt doc: https://docs.getdbt.com/docs/introduction Dbt tutorials: https://docs.getdbt.com/tutorial/setting-up Dbt learn (I recommend taking “Fundamental” course if you have time): https://learn.getdbt.com/ Dbt hub (for packages): https://hub.getdbt.com/
Comentários