Git Your SQL Together (with a Query Library)
Focusing on these points has led to my continuous adoption of a query library -- a git repository for saving and sharing commonly (and uncommonly) used queries, all while tracking any changes made to these queries over time.
If I could teach SQL to analysts who plan to work in industry data science, I'd start by sharing a few SQL Truths I've learned, and why I recommend tracking SQL queries in git. Here goes:
You will *always* need that query again
Queries are living artifacts that change over time
If it’s useful to you, it’s useful to others (and vice versa)
Focusing on these points has led to my continuous adoption of a query library -- a git repository for saving and sharing commonly (and uncommonly) used queries, all while tracking any changes made to these queries over time.
Anyone who is following my personal journey might know that I'm in the midst of building data science infrastructure at a start-up. In addition to setting up a data dictionary, setting up a query library is one of the first things I did to start documenting the institutional knowledge we have about our data.
Let’s talk more about these SQL Truths and why they've led me to git my SQL together with query libraries (and how to build one yourself!).
SQL Truth #1: You will always need that query again.
Have you ever written a query for a “one-off” analysis, deleted it (maybe even on purpose! *shudders*), and lost all memory of how to create said query -- just before being asked to re-run or tweak that analysis? Happens to the best of us.
Usually, this is a job for reproducibility. But, even if we take reproducibility seriously (and we do!), it’s easy for queries from “one-off” analyses to slip through the cracks because they can live outside of our normal project workflow. After all, if you’re only doing something once, there’s no need for it to be reproducible, right?
The sooner we can accept that it’s never just once, the sooner we can hit CTRL+S, put them in a query library, and move on to more interesting problems.
SQL Truth #2: Queries are living artifacts that change over time
Here’s a short list of reasons why your queries might change over time:
You’ve become more familiar with database(s)
You’ve gained a deeper understanding of your data
You’ve had to add filters and caveats over time
You’ve changed the way you calculate a metric
You’re answering slightly different questions
You’ve started collecting new data
You’ve found discrepancies or issues with tables or fields
Your business model has changed
You’ve gotten better at writing SQL
This list isn’t all-inclusive, but hopefully it gives you an idea of how and why your queries might change. This isn’t inherently good or bad; it just means that you’ll want to capture the way that you’re doing things both as documentation for the future and to ensure that changes or updates are applied as needed to keep things up-to-date.
SQL Truth #3: If it’s useful to you, it’s useful to others
Have you ever asked a coworker where some data lives and had them respond with a beautiful, hand-curated SQL query? It’s the best. You get to focus more on the fun analysis, or starting new projects, and as a bonus, your company isn’t paying two people to repeat the same process.
SQL queries contain built up domain knowledge -- the way you filter, join, and aggregate tables reflects knowledge you’ve collected about how the data is collected, stored, and should be used for practical application. Sharing queries with others is a nice thing to do, and a good way to spread institutional knowledge about your data (as is building a good data dictionary!).
When you start to compile and compare SQL queries, you might find discrepancies in the way that different people pull the same data. (This happened on a team I was on -- if you asked three of us to pull all live clients, we’d all do it with slightly different caveats that reflected our individual understanding of the data. Not a bad thing, but no wonder our numbers didn’t always match!) Creating and reviewing a query library is also good way to get everyone on the same page.
Building your own query library
Let's focus on how you can build your own query library in a few simple steps:
After writing a query that you’ll want to use again*, save it(!!) as a .sql file with a descriptive name. I tend to name my files based on what the query is accomplishing as output, and since naming things is cheap (and I never have to hand-type the names of these files), I use very descriptive (long) names, like activity_on_mobile_since_june2018_by_client.sql.
*This was a test -- every query is a query you might want to use again!Create a git repository in a shared location and upload your queries to it. Encourage your team to do the same, and discuss how to best organize queries with them. This is also a good time for a mini blameless post-mortem of existing queries (especially if they come from different analysts) by looking for discrepancies and using them as an opportunity to level-up everyone's understanding of the data.
Whenever you create a new query, take a few minutes to clean it up so that others can understand how and when they might want to use it, and upload it to the query repository. Whenever you update an existing query locally, make sure to commit those changes to your query library (ideally along with an explanation of why you made them).
Final Notes
The idea of the query library was introduced to me by Tim, who implemented one for our product analytics team while we were both at Web.com. (You might remember Tim from this post on the career "tour of duty" concept.) It was helpful for our team there, and I've happily implemented one on every team since.
I hope this post is helpful, and I'd love to hear from you if you have a data infrastructure tool or idea to share (or a question!); please write a comment below or ping me on Twitter. Thanks for reading!