Git Your SQL Together (with a Query Library)

If I could teach SQL to analysts who plan to work in industry data science, I’d start by sharing a few SQL Truths I’ve learned, and why I recommend tracking SQL queries in git. Here goes:

  1. You will *always* need that query again
  2. Queries are living artifacts that change over time
  3. If it’s useful to you, it’s useful to others (and vice versa)

Focusing on these points has led to my continuous adoption of a query library — a git repository for saving and sharing commonly (and uncommonly) used queries, all while tracking any changes made to these queries over time. 

Anyone who is following my personal journey might know that I’m in the midst of building data science infrastructure at a start-up. In addition to setting up a data dictionary, setting up a query library is one of the first things I did to start documenting the institutional knowledge we have about our data.

Let’s talk more about these SQL Truths and why they’ve led me to git my SQL together with query libraries (and how to build one yourself!). 

SQL Truth #1: You will always need that query again.

Have you ever written a query for a “one-off” analysis, deleted it (maybe even on purpose! *shudders*), and lost all memory of how to create said query — just before being asked to re-run or tweak that analysis? Happens to the best of us.

Screen Shot 2018-11-27 at 6.15.00 PM

Usually, this is a job for reproducibility. But, even if we take reproducibility seriously (and we do!), it’s easy for queries from “one-off” analyses to slip through the cracks because they can live outside of our normal project workflow. After all, if you’re only doing something once, there’s no need for it to be reproducible, right?

The sooner we can accept that it’s never just once, the sooner we can hit CTRL+S, put them in a query library, and move on to more interesting problems.

SQL Truth #2: Queries are living artifacts that change over time

Here’s a short list of reasons why your queries might change over time:

  • You’ve become more familiar with database(s)
  • You’ve gained a deeper understanding of your data
  • You’ve had to add filters and caveats over time
  • You’ve changed the way you calculate a metric
  • You’re answering slightly different questions
  • You’ve started collecting new data
  • You’ve found discrepancies or issues with tables or fields
  • Your business model has changed
  • You’ve gotten better at writing SQL

This list isn’t all-inclusive, but hopefully it gives you an idea of how and why your queries might change. This isn’t inherently good or bad; it just means that you’ll want to capture the way that you’re doing things both as documentation for the future and to ensure that changes or updates are applied as needed to keep things up-to-date.

SQL Truth #3: If it’s useful to you, it’s useful to others

Have you ever asked a coworker where some data lives and had them respond with a beautiful, hand-curated SQL query? It’s the best. You get to focus more on the fun analysis, or starting new projects, and as a bonus, your company isn’t paying two people to repeat the same process.

SQL queries contain built up domain knowledge — the way you filter, join, and aggregate tables reflects knowledge you’ve collected about how the data is collected, stored, and should be used for practical application. Sharing queries with others is a nice thing to do, and a good way to spread institutional knowledge about your data (as is building a good data dictionary!).

When you start to compile and compare SQL queries, you might find discrepancies in the way that different people pull the same data. (This happened on a team I was on — if you asked three of us to pull all live clients, we’d all do it with slightly different caveats that reflected our individual understanding of the data. Not a bad thing, but no wonder our numbers didn’t always match!) Creating and reviewing a query library is also good way to get everyone on the same page.

Building your own query library

Let’s focus on how you can build your own query library in a few simple steps:

  1. After writing a query that you’ll want to use again*, save it(!!) as a .sql file with a descriptive name. 

    I tend to name my files based on what the query is accomplishing as output, and since naming things is cheap (and I never have to hand-type the names of these files), I use very descriptive (long) names, like activity_on_mobile_since_june2018_by_client.sql.*This was a test — every query is a query you might want to use again!

  2. Create a git repository in a shared location and upload your queries to it. Encourage your team to do the same, and discuss how to best organize queries with them. 

    This is also a good time for a mini blameless post-mortem of existing queries (especially if they come from different analysts) by looking for discrepancies and using them as an opportunity to level-up everyone’s understanding of the data.

  3. Whenever you create a new query, take a few minutes to clean it up so that others can understand how and when they might want to use it, and upload it to the query repository. 

    Whenever you update an existing query locally, make sure to commit those changes to your query library (ideally along with an explanation of why you made them).

Final Notes

The idea of the query library was introduced to me by Tim, who implemented one for our product analytics team while we were both at Web.com. (You might remember Tim from this post on the career “tour of duty” concept.) It was helpful for our team there, and I’ve happily implemented one on every team since.

I hope this post is helpful, and I’d love to hear from you if you have a data infrastructure tool or idea to share (or a question!); please write a comment below or ping me on Twitter. Thanks for reading!

 

2 thoughts on “Git Your SQL Together (with a Query Library)”

  1. Thanks a lot for your post! we have just started to collect sql queries in a dedicated repository as well at my company. Although we have git repositories for each research project, we see the value in a dedicated SQL repository for two reason: firstly we can share queries not only among data scientists but support, etc as well. Secondly, many queries are used in multiple research projects in the same or very similar format.

    I would love to hear your thoughts about the following:

    – we already have sql queries in dedicated repositories for specific research projects along with R scripts where we actually use them. How would you decrease double work / ensure consistency between individual repositories and the central query repository?
    – a query might be used with slight modifications in different places. Would you store all versions for specific use cases or only the core part of the query?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s