One Year of R-Ladies Austin 🎉

Today marks one year of R-Ladies Austin! The Austin chapter started when Victoria Valencia and I emailed R-Ladies global (on the same day!) to ask about starting a local chapter. It was meant to be — the ladies from R-Ladies global introduced us, and the rest is history.

For anyone who hasn’t heard of R-Ladies, we are a global organization whose mission is to promote gender diversity in the R community by encouraging, inspiring, and empowering underrepresented minorities. We are doing this by building a collaborative global network of R leaders, mentors, learners, and developers to facilitate individual and collective progress worldwide. There are over 60 R-Ladies chapters around the world and we continue to grow!

Here in Austin, it’s been a busy year. So far, we’ve hosted 16 meetups — including seven workshops, two book club meetings, two rounds of lightning talks, a handful of happy hours, a movie night, and a visit from NASA.


To celebrate our first R-Ladies anniversary, I thought it would be fun to answer some questions with Victoria around our journey so far:

What has been the best part of working with R-Ladies?

Victoria: The best part has been connecting with women in our community that share similar passions and interest in data! It has been so fun. Also, the R Ladies hex stickers are pretty awesome. 🙂

Caitlin: It’s been great to be a part of such a supportive community and to meet so many brilliant women, both here in Austin and in other cities. Since joining R-Ladies, I’ve built a great network, learned cool things, and had a lot of fun along the way.

Have you had a favorite meetup so far?

Victoria: My favorite meetup by far was our book club for Dear Data by Giorgia Lupi and Stefanie Posavec. We started by discussing the book and the types of visualizations and data Giorgia and Stefanie shared with each other. Some were funny and some were sad but all of them were inspiring! We followed by creating our own visualizations in a postcard format of the beer list at Thunderbird Coffee. Who knew a beer list could be so fun to visualize and that each of us would think to do it in such different ways! It was a blast.

Caitlin: I love the book club meetups too — it’s a great space because we can do anything from have deep discussions on the ethical impacts of algorithms in society (I’m looking at you, Weapons of Math Destruction) to getting really creative and using colored pencils to dream up artistic ways of visualizing data. I also loved having David Meza come down from NASA in Houston to talk about knowledge architecture. It would be an understatement to say that he’s been supportive since day one, because he actually reached out to us long before our first meeting. (I guess “supportive since day -75” doesn’t have quite the same ring to it, but it’s true.)

What’s the biggest thing you’ve learned after one year of organizing R-Ladies?

Victoria: That managing a meetup is a fair amount of work, but certainly worth the effort! I have also learned that the R Ladies community is strong and close knit and super supportive! It has been great connecting and learning from them.

Caitlin: I agree with Victoria’s take — managing is a lot of work but also *very* worth it. I’ve learned a lot about building community through collaboration. Working with other local meetups has helped us to expand our reach and provide more opportunities for the women in our group. It’s also been very cool to learn more about the tech community on Austin. We’ve been fortunate to receive lots of support from local companies and other tech groups, and it’s been nice to get more plugged in that way while building a distinct community that adds something new to the mix.

How has R-Ladies helped you (personally or professionally)?

Victoria: R-Ladies has helped me by allowing myself time to learn about cool R stuff I did not know before! It has helped me to learn more efficient ways of coding by going through all of the chapters of R For Data Science, how to relax with colored pencils, data, and beer, and that opened my mind to different perspectives from fellow R-ladies about the continually evolving and expanding world of data that surrounds us.

Caitlin: I can’t say enough good things about the R-Ladies community. The individual chapters help to build local communities and strong networks of highly-skilled women, and the global chapter works hard to promote the work of R-Ladies to the larger global community, including people who might not see that work otherwise. Especially since a lot of women are one of few women on their team (or the only woman on their team), it’s great to have a network who can relate and provide feedback and advice (on all sorts of things) when you need it. On a personal level, I’ve built relationships with amazing women (both in real life and virtually) through R-Ladies, and it’s opened up some opportunities that would have taken a lot longer to find on my own.


The next 12 months

We’ve grown a lot this first year (we’re over 275-strong!), and we’re hoping to grow even more in the next 12 months. If you’re in Austin and haven’t made it out to a meetup yet, we’d love to meet you! We’re beginner friendly, positive, and dedicated to promoting gender diversity in the R Community (and tech in Austin more generally). And even if you are just interested in data and maybe learning more about R we want you to join us as well!

If you’re not in Austin, but want to support R-Ladies, I’d encourage you to check out R-Ladies directory the next time you’re looking for speakers or for local women to reach out to — there are lots of women out there doing amazing things, and R-Ladies is making it easier and easier to find and connect with them.

The two biggest things that we’ll need in the next 12 months are speakers and space. If you use R and have learned a cool thing, discovered a neat package, done an interesting analysis, or have anything else you want to share, we’d love to hear from you. And if you have space available, we’re always looking for new spaces to host the various types of meetups we put on. Please get in touch with us; we’d love to hear from you!

Thanks for a fantastic year, and looking forward to the next 12 months!

Caitlin and Victoria

Only 60% Sure

I have a fantastic coworker who I’ve been pair programming with a lot with lately, and he does one thing that I wish everyone did:

He has a habit of stating something (usually an answer to a question I’ve asked), and then after a beat, saying something like, “I said that very confidently, but I’m only about 60% sure“. This is usually followed by a suggestion to firm up his answer, like “you should ask X person”, “you should try it and see what you think”, or “you should maybe research that more”.

Screen Shot 2018-02-07 at 12.52.42 PM

Here’s why I love that follow-up response so much:

  1. This answer builds my trust in him (because I know he’ll admit if he doesn’t know something), and builds my confidence in his answers overall. On the flip side, if he makes a statement and doesn’t qualify it with some level of uncertainty, I trust it as-is and don’t feel like I need to research it or double-check afterwards.
  2. This models great behavior around questions for our org as a whole by making “I’m not sure” an acceptable way to answer to a question. By including additional suggestions of ways that I can get answers in his response, he’s setting me up to find the best possible answer, which is better and more efficient than getting an incomplete answer from someone who might not be the best person to cover a given subject area.
  3. This encourages more questions. I’m not afraid to ask tough or opinion-based questions because I know I’ll get a thoughtful and balanced answer. Asking more questions has led to a deeper understanding of the technologies and products I’m working with — a win-win for him, for me, and for the company.

Since I’ve heard him say this, I’ve started incorporating it into my own conversations, both professional and personal. It’s a small thing, but it makes a big difference in the way we interact, and I would love to see more people adopt this habit.

Imposter Syndrome in Data Science

Lately I’ve been hearing and reading lots about imposter syndrome, and I wanted to share a few thoughts on why imposter syndrome is so prevalent in data science, how I deal with it personally, and ways we can encourage people who are feeling the impact.

Why is imposter syndrome so prevalent in data science?

Data science has a few characteristics which make it a fertile ground for imposter syndrome:

  • Data science is a new field.

    DJ Patil and Jeff Hammerbacher were the first titled “data scientists” only about 7(!) years ago (around 2011). Since then, as we’ve all been figuring out what data science *is*, differing definitions of “data scientist” have led to some confusion around what a data scientist should be (or know). Also, because “data science” wasn’t taught in colleges (as such) before then, the vast majority of data scientists do not have a diploma that says “data science”. So, most data scientists come from other fields.

  • Data science is a combination of other fields.

    Depending on who you ask, a data scientist is some combination of an analyst / statistician / engineer / machine learning pro / visualizer / database specialist / business expert. Each of these are deep positions in their own right, and it’s perfectly reasonable to expect that a person who comes to data science from any one of these fields will have significant gaps when it comes to the other fields on the list.

  • Data science is constantly expanding with new technologies.

    As computer memory becomes cheaper, open-source becomes more popular, and more people become interested in learning and contributing to data science and data-science-adjacent fields, the technology surrounding data science grows at a very healthy rate. This is fantastic for the community and for efficiency, but leads to lots of new technologies for data scientists to learn and a culture where there is pressure to stay “on top” of the field.

So, we have people from a variety of backgrounds coming to a new field with many applications  whose boundaries aren’t clearly defined (thus causing inevitable gaps in their knowledge of that field as a whole), and where technology is changing faster than a single person can keep up with. That is the plight of a data scientist in 2018, and why so many people feel the effects of imposter syndrome.

My Secret for Dealing with Imposter Syndrome

Every single data scientist that I know (and you know) is learning on the job. It might be small stuff (like cool tools or keyboard shortcuts) or bigger stuff (like new algorithms or programming languages), but we’re all learning as we go, and I think it’s crucial that we acknowledge that. For me, it’s simultaneously really exciting to be in a field where everyone is learning, and also kind of intimidating (because what if the stuff I’m learning is stuff that everyone else already knows?), and that intimidation is a form of imposter syndrome.

The way that I’ve dealt with imposter syndrome is this: I’ve accepted that I will never be able to learn everything there is to know in data science — I will never know every algorithm, every technology, every cool package, or even every language — and that’s okay. The great thing about being in such a diverse field is that nobody will know all of these things (and that’s okay too!).

I also know that I know things that others don’t. I’ve built predictive models for dozens of colleges and non-profits, have experience on what it takes to create and analyze successful (and unsuccessful!) A/B tests, and am currently learning how to do machine learning models in production. These are not skills that everyone has — there are people who know more about computer science than I do, or machine learning, or Macbook shortcuts — and that’s okay. Diversity is a good thing, and I can learn from those people. There’s a great Venn diagram which illustrates the relationship between what you know and what other people know, and how they overlap. What you know is rarely a subset of what other people know; your knowledge overlaps with others and also sets you apart from others.

Community-wide Techniques for Reducing Imposter Syndrome

If we can agree that all data scientists are learning on the job, I think the best things that we can do for reducing imposter syndrome in the larger data science community are to be open in acknowledging it and to work towards fostering a healthy learning environment.

  • Get comfortable with “I don’t know”

    I love when people say “I don’t know”. It takes courage to admit when you don’t know something (especially in public) and I have a great deal of respect for people who do this. One way that we can make people more comfortable with not knowing things is to adopt good social rules (like no feigning surprise when someone doesn’t know a thing, and embrace them as one of today’s lucky 10,000 instead).

  • Don’t “fake it ‘til you make it”

    Sure, it’s good to be confident, but the actual definition of an imposter is someone who deceives, and I think we can do better than “faking it” on our way to becoming better data scientists. “Faking it” is stressful, and if you get caught in a lie, can potentially cause long-term damage and loss of trust.

  • Encourage questions

    The benefit to asking questions is two-fold:1) You gain knowledge through conversation around questions
    2) Asking questions publicly encourages others to ask questions too

    Asking questions is exactly the kind of thing data scientist should be doing, and we should work to encourage it.

  • Share what you’re learning

    When I see others share what they’re learning about, it helps me put my own learning in perspective — and whether I know much about the topic or not, it’s encouraging to see other people (especially more experienced people) talk about things that are new to them.I’ve started a personal initiative to track the things I’m learning each week on Twitter using the hashtag #DSlearnings. Feel free to have a look at the archives (I’d love to chat if you’re learning similar things!), and to add your own learnings to the hashtag.

A little bit of transparency goes a long way towards staving off imposter syndrome. We can embrace both being knowledgeable and not knowing things — and do so in public.

I’d love to hear ways that others deal with imposter syndrome, and about things you’re learning (feel free to use #DSlearnings or make your own hashtag!) along the way.

PS: Thank you to @jennybryan and @dataandme for the Venn diagram!

Reading List: 2017 Edition (and some thoughts on resolutions)

My not-so-secret secret? I was an English major.  I also majored in Stats and love math and data science, but I have always and forever loved reading. In an effort to read more often, each year I set a goal* of reading 25 books. So, in the spirit of Susan Fowler, and with the hope of getting good book suggestions, I want to share my 2017 reading list (with brief commentary). My top five recommended reads are designated with **.

1. The Great Gatsby, by F. Scott Fitzgerald
I don’t have much to say about The Great Gatsby that hasn’t been said already, but I can say that it was much more interesting than I remember it being in high school — and that I really, really want to go to a Gatsby party.

2. Men Explain Things to Me, by Rebecca Solnit
The word “mansplaining” was coined in reaction to Rebecca Solnit’s titular essay “Men Explain Things To Me“, which begins with a situation women might find vaguely familiar: after Solnit mentions the topic of her most recent book, a guy at a party asks if she’s heard of another *very important* book on the same topic, and it takes her friend’s repetition of “That’s her book” three or four times to sink in and leave the man speechless. The essay is short, and definitely worth a read, and the book does a good job of adding color to mansplaining and other gendered issues through added data and commentary, including a thoughtful, well-researched take on domestic violence that I hadn’t heard before.

3. Lean In, by Sheryl Sandberg**
I realize that I’m several years late to the game, but after finally reading Lean In, I would recommend it at the level of “required reading” for women navigating corporate America (and the tech world in particular). Sheryl Sandburg provides solid examples (and data!) on gendered differences in salary negotiations, likability, speaking up, explaining success, applying for level-up positions, getting promotions, ambition, and so much more. Reading this book inspired me to speak up (even about the little things!), and I’ve saved many of the factoids for future reference as I’m navigating my own career. Seriously, ladies, read this book if you haven’t already!

4. Hillbilly Elegy, by J.D. Vance
Hillbilly Elegy interweaves two stories: J.D. Vance’s personal story of “making it out” of a Glass-Castle-esque “hillbilly” upbringing by joining the Marines, going to college, and eventually law school at Yale, and a more general look at the problems confronting the modern white working class (in Appalachia and similar regions). The most interesting piece, for me, was a specific example of a town whose blue-collar factory jobs eventually dried up, and the impact this has on the town (focusing on home prices, lack of mobility, and personal pride, to name a few). The book is eye-opening, and I liked that it focused on facts as well as personal experience to paint a picture of the modern-day hillbilly’s plight.

5. Ender’s Game, by Orson Scott Card
Ender’s Game makes consistent appearances on reddit must-read lists, so I finally gave it a whirl and ended up liking it. This sci-fi novel focuses on a future where Earth is attacked by aliens and specially selected children are given military tactical training through a series of battle simulations (“games”) to fight aliens and protect humankind (all in zero gravity!). The book follows Ender, one of the chosen children, from normal childhood life through battle school, with a twist ending to boot.

6. The Girl with the Lower Back Tattoo, by Amy Schumer
While there’s plenty I love about Amy Schumer’s comedy, her autobiography was mostly repetition of stories I’d heard from her standup / interviews / etc. I’d skip it and watch her skits instead.

7. Brave New World, by Aldous Huxley
Another high school assignment, another worthy re-read. I went through a heavy dystopian novel phase in 2016, and this was the tail end. From gene therapy to pharmaceuticals to race relations to hookup culture, I think Brave New World is still, 86 years later, an incredibly relevant (and surprisingly current!) take on “modern” issues. (It would also make a great episode of Black Mirror.)

8. The Rational Optimist, by Matt Ridley
The central argument of this book is that things are better now than they ever have been (and they’re continuing to get better) — mostly due to trade and specialization among tribes of humans. I read this not long after Sapiens, which definitely colored my thinking (they’re actually “You Might Also Like…” pairs on Amazon). The Rational Optimist doesn’t have the breadth of Sapiens, but it covers the history of trade and specialization in much greater depth, and provides interesting historically-informed commentary on modern-day hot topics like fossil fuels, government, and war. (Full disclosure: this is my boss’s favorite book and there’s something cool about reading your boss’s favorite book and seeing where it might impact their perspective.)

9. The Argonauts, by Maggie Nelson
I discovered Maggie Nelson via Bluets, her poetic lyrical essay about a woman who falls in love with the color blue (which I *loved*). The Argonauts is a completely different “family” of story — a genre-bending take on parenting and romance that focuses on Maggie’s own queer family and relationship with fluidly gendered Harry Dodge. Maggie is open, brutally honest, and thoughtful, and I appreciate her sharing such personal experiences.

10. The Girl on the Train, by Paula Hawkins
Thriller. Girl (woman) on train sees a couple out the train window every day, and daydreams about their “perfect” life  — UNTIL one day she sees the woman kiss another man and that woman goes missing…

If this sounds interesting to you, you’ll probably like this book. It’s a quick read, has a few twists, and was perfect for filling time on a flight from Austin to Boston.

11. How to Make Friends and Influence People, by Dale Carnegie
Dale Carnegie’s advice on making friends and influencing people is timeless. This book is still getting updated and reprinted 75 years after its original publication, and still relevant (though some of the original examples are a little — charmingly — dated). If you’re interested in the basic techniques, the Wikipedia article does a good job of describing Carnegie’s basic system, but the book itself is a quick read and one I’d recommend.

12. The Undoing Project, by Michael Lewis
I love Michael Lewis books. The Undoing Project is another good one, focusing on the relationship between Daniel Kahneman and Amos Tversky, who together created the field of behavioral economics. Lewis, as usual, does a great job of explaining fairly technical concepts while weaving a really interesting story around the complex relationship between two men. This book is at turns triumphant and heartbreaking, and the research itself was interesting enough that I’d recommend it.

13. It, by Stephen King
My husband and I both read It this year in preparation for the new movie (which was so much better than the original!). This was my first Stephen King novel and won’t be my last.

14. The Heart, by Maylis Kerangal**
I read this book solely based on Bill Gates’ recommendation and I’m so glad I did. This is technically the story of a heart transplant, but it is actually much more than that — a beautiful, gripping look at the fragility of life and family and relationships. The poetic language provides a strong contrast between the family whose life is forever changed with the matter-of-factness for the medical professionals involved in the story (for whom this is a normal “day at the office”). This book is an experience (I cried more than once), but a highly recommended one.

15. My Own Words, by Ruth Bader Ginsburg
There is so much about Ruth Bader Ginsburg that I find inspiring — her drive to make it to the Supreme Court, her work on gender equality and women’s rights, her relationship with her husband, her ability to see beyond political views and build cross-aisle friendships, and even her workout regimen (at age 84!) are all reasons to look up to RBG. Her book focuses on specific court cases and is peppered with interesting details about life on the Supreme Court. I found her book a bit repetitive (as some of the cases are cited multiple times) but overall a good in-depth look at the life of an important women “in her own words”. I’m definitely a fan.

16. Sprint, by Jake Knapp
Sprint is like a time machine for business ideas: a process to get a team from concept to prototype with customer feedback in a single week. I read this book after hearing about the concept from the UX team at (where I work), who have developed several product features that are the direct result of sprints. If you work on a product team and are interested in ways to test and fast-track development ideas, I’d recommend this book.

17. Option B, by Sheryl Sandberg
Option B is a look at building resilience through loss, disappointment, and heartache. After the sudden loss of her husband, Sandburg took some time off to pick up the pieces — of her family, her job, and everything else — and this book tells that story. Like Lean In, Option B is a combination of experiences and research, and like Lean In, it’s full of interesting stories and practical advice — like how to be there for someone going through a loss (and the importance of questions like  “What do you not want on a burger?“. This one resonated for me personally after the sudden loss of my father, and I spent so much time thinking about my mom that I eventually just sent her a copy. If you’ve ever wondered what to say or how to help someone who has experienced loss, check out this book.

18. Switch, by Chip and Dan Heath
The tagline of Switch is “How to change when change is hard”. This book focuses on an eight-step process for making change and introduces the idea of clearing a path for the elephant and the rider — the elephant being your emotions, big and hard to control, and the rider being your rational side, technically “in-control” but sometimes not enough to overcome your emotional side. Creating a clear path that addresses the needs and interests of both the elephant (emotions) and rider (rational thinking) is a means for making change “stick”. My boss and I both read this book and it provided a useful framework and shorthand that we’ve used while trying to make organization-level changes.

19. Mammother, by Zachary Schomburg**
Zachary Schomburg is one of my favorite poets (his book Scary, No Scary is an all-time favorite), and a few years ago, he announced that he was working on his first novel — I have been looking forward to reading Mammother ever since and it did not disappoint. Mammother is the story of a town suffering from a mysterious plague called God’s Finger that leaves its victims dead with a giant hole in their chest. There is a large cast of characters, plenty of magical realism (ala Marquez), and dense, beautiful language to support a surprisingly emotional story (I cried on a plane at the ending). If you like poetry, magical realism, or weird, cool reads, I highly, highly recommend this book (and all of Schomburg’s poetry, for that matter).

20. Between the World and Me, by Ta-Nehisi Coates**
This is one of those books that totally changed my perspective. Between the World and Me is a letter from Coates to his son about the experience and realities of being black in the United States. In addition to being beautifully written, this book covers territory I didn’t know existed — on the relationship between fear and violence, on Howard University, on how different race is experienced in the US than other countries without a history of slavery, on bodily harm, and so much more on “being black”. I wish this was required reading and can’t recommend it enough.

21. A Column of Fire, by Ken Follett
A Column of Fire is the third book in the widely spaced “Kingsbridge” series (which starts with The Pillars of the Earth, written in 1990). I was shocked to find that there was a third book in this series and downloaded it immediately. Each book is engaging historical non-fiction, and focuses on the political intersection of government and religion (and those who exploit either — or both). A Column of Fire is an apt addition — interesting storyline, lots of characters, and a cool take on historical events, particularly Mary, Queen of Scots. It’s worth noting that while this is part of a ‘series’, the books stand perfectly well on their own — though I would still start with The Pillars of the Earth if you’re interested.

22. Dear Data, by Giorgia Lupi and Stefanie Posavec
Dear Data is a project created by two women (Giorgia and Stefanie) getting to know each other by sending weekly postcards based on self-collected and visualized data — like “how many times I swore”, “every time I looked in the mirror”, and “how many times I said ‘sorry'”. The data is interesting, and the visualizations and engrossing — so much so that the collection of postcards was purchased by MoMA for display. We kicked off the R-Ladies Austin book club by reading this, and creating our own postcards, which was so much fun — and made us realize that the data collection and visualization process is not nearly as easy as it looks!

23. Milk and Honey by Rumi Kaur
Milk and Honey is a poetry book in four chapters: “the hurting”, “the loving”, “the breaking”, and “the healing”. Kaur provides an intense look at each emotion in turn. I have to admit that I didn’t love this book. While I appreciate brutal honesty in poetry (and there is plenty in here), the translation of feelings to language read as a bit like teen-angst, which was a turn-off.

24. Weapons of Math Destruction by Cathi O’Neil**
Weapons of Math Destruction is a look at the algorithms pervade modern life — and the problems embedded within them. This book does a great job of explaining the ethical implications of collecting and using data to make decisions, and outlines a framework for creating responsible algorithms. After reading this, I’m noticing new algorithms and data issues almost weekly, so it’s definitely had an impact on my thinking and approach to creating algorithms and working with data. I think this will resonate with “data people” and everyone else (the examples jump from teachers to credit cards to court systems). Also, a quick shameless plug: this is the next book for the R-Ladies Austin book club, so if you want to discuss it in person, please join us on Jan 31!

25. A Little Princess by Frances Hodgson Burnett
I loved this movie growing up, and after my mom bought me the book, I devoured it during Christmas weekend. A Little Princess is a heartwarming story about a wealthy-but-charming girl is sent to boarding school by her loving father (who is her only parent). Soon after, her father dies, and she lives a riches-to-rags story in which she is forced into labor to earn her keep as an orphan and ward of the school — all while imagining a better life as a princess. The book is just as magical as the movie, and I wholeheartedly recommend both.

*Resolutions vs. Goals (or, bonus thoughts on the whole “New Year’s” thing):

About four years ago, I changed my approach to New Year’s resolutions. While I love the theme of self-improvement, I don’t do well with resolutions that require doing something every day, or always, or never. A resolution like “read every day” is uninspiring, doesn’t allow room for the ebbs and flows of varied routines, and seems to be designed for failure — it only takes one slip-up to tarnish a “read every day” track record. Vague resolutions like “read more” are also tough. I appreciate a more concrete number or outcome to work toward so that I can track my progress through the year and know whether I’ve achieved each goal at the end of it.

So, I’ve replaced vague and overly stiff resolutions with quantifiable goals to be accomplished gradually over the course of the year. Now a resolution like “read every day” (or “read more”) becomes a goal: “read 25 books this year”. This allows room for vacations, sick days, and life to happen without making me feel like I’ve “failed” if ever I can’t find time to read. The number 25 was a good goal for me because it felt doable, easy to track (about a book every other week), and I knew I’d feel good about it at the end of the year, which is motivation to keep making progress.

If you were to track my progress towards this year’s goal, it you’d find major spikes around vacations (I like reading on planes) and during weeks where I don’t have lots of events. All this is just to say that if you have a goal of reading more, or anything else that you haven’t been able to find time for, I’d highly recommend goals over resolutions.

Have you read anything moving lately? Have questions or a different take on any of these books? I’d love to hear about it. You can comment here or ping me on twitter.

Blue Christmas: A data-driven search for the most depressing Christmas song

Christmas music can be a lot of things — joyous, ironic, melancholy, cheerful, funny, and, in some cases, downright depressing. I personally realized this while watching an immensely sad scene in The Family Stone centered around Judy Garland singing ‘Have Yourself a Merry Little Christmas’ and haven’t yet fully recovered (or stopped noticing sad Christmas music). With this scene in mind, and without being able to think of any song that was more sad, I set out to use data to find the most depressing Christmas song. (Spoiler alert: I was wrong about Have Yourself A Merry Little Christmas.)

Data Collection

The data collection process broke down into three steps: choosing which songs to analyze, using Spotify to extract “musical” information about each song chosen, and using Genius (and Google) to collect the lyrics for each song.

Which Songs?

As it turns out, there are looooots of Christmas songs out there. (Just think of how many covers are released each year!). I was hoping for a Buzzfeed “Top 100 Christmas Songs of All Time” list but after checking FiveThirtyEight, Billboard, Spotify, and lots of Google searches, I wasn’t able to come up with anything satisfactory that included both title and artist (which I’d need to gather ‘musical’ track attributes). I settled for a Spotify’s ‘Christmas Classics’ playlist. This 60-song playlist contains many classics (“Silver Bells”, “Sleigh Ride”) as well as some modern classics (I’m looking at you, Mariah Carey). While it doesn’t include all songs, I think it does a good job of picking the most popular version of each song chosen and handily satisfies my “title and artist” requirement.

Gathering ‘Musical’ Data (from Spotify)

We can get an idea of musical sadness by using data from the Spotify API, which allows you to extract various musical attributes for a given track (like danceability, speechiness, and liveness). For this analysis, I focused mainly on two attributes: energy and valence.

Valence is defined by Spotify as: “A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).” We can use this to determine how sad a track sounds (independent of lyrics).

Energy (as defined by Spotify) also rates from 0 to 1 and represents “perceptual measure of intensity and activity”. Heavy metal would rate high on the energy scale and slow acoustic tracks would rate low.

To gather this data, I used Charlie Thompson’s fantastic spotifyr package to interface with the Spotify API. This package can be installed and loaded via github like so:


Regardless of how you access it, the Spotify API requires that you set up a dev account (here) to create a client_id and client_secret. Save these as system variables (see below) and you’re ready to start gathering data!

Sys.setenv(SPOTIFY_CLIENT_ID = "your client id here")
Sys.setenv(SPOTIFY_CLIENT_SECRET = "your client secret here")

The spotifyr package allows you to pull track info based on a given artist, playlist, or album. Since I couldn’t pull the data from Spotify’s playlist directly, I copied all of the songs into my own playlist (“christmas_classics_spotify”) for easy access.

The steps I took were:
1) Get all of my playlist names (since I have more than one) using get_user_playlists
2) Get tracks from each playlist using get_playlist_tracks
3) Filter all tracks to just the tracks from the christmas_classics_spotify playlist
4) Use the get_track_audio_features function to get the features for the songs I care about.

Here’s what that code looks like:


playlists <- get_user_playlists("1216605385")
tracks <- get_playlist_tracks(playlists)
xmas_tracks <- tracks %>%
    filter(playlist_name == "christmas_classics_spotify")

track_features <- get_track_audio_features(xmas_tracks)

Gathering Lyrics

Originally, I planned to get pull all of the song lyrics via the Genius API and Josiah Parry’s in-progress geniusR functions. This was a pretty good plan but I quickly realized that not all of the songs I wanted lyrics for were actually available on Genius (some are fairly old); so, I used a combination of geniusR and good ol’ fashioned copying and pasting to get the lyrics for all of the songs in my playlist. (Note: copying and pasting is not as boring as it sounds if you’re also watching old episodes of Parks and Recreation.)

If you do want to use the Genius API to gather data, you’ll need to create an account with Genius to get an API access token. Similar to what we did with the Spotify API, you can save this as an environment variable:

Sys.setenv(genius_token = "your access token here")

Since geniusR isn’t a fully instrumented package, the best way to use its functions is to clone the geniusR repo and run each script, or copy and paste each one into your own script. Most of these functions are helper functions for the genius_lyrics function, which is the only one you’ll need. This function takes artist, song as arguments (like below):

jingle_bell_rock <- genius_lyrics(artist = "Bobby Helms", song = "Jingle Bell Rock")

You can loop through this function as needed to get any lyrics you’d like to analyze. Once you’ve collected all of your lyrics, you’re ready to move on to analysis!
Data Analysis: Quantifying Sadness

A song is made up of music and lyrics, and we’ll use both to create a Downer Score (a measure of a song’s sadness).

Musical Sadness

Earlier I mentioned a couple of useful features from the Spotify data we can use to quantify sadness — energy and valence. While these measures are useful individually, combining them gives us a better picture of how depressing a song might be. For example, a song that is high-valence but low-energy would definitely be happy, but might be considered more ‘peaceful’ or ‘calm’ than ‘joyous’. Likewise, a song that is low-valence but high-energy might be considered more ‘angry’ or ‘turbulent’ than ‘sad’. The most depressing songs will be both low-valence and low-energy (think Eeyore!). If we plot valence against energy, the sad songs will be the ones closest to the point lowest-valence, lowest-energy (0, 0):

(Note: you can interact with the plot by clicking on it.)
To quantify the musical sadness of each song, we calculate that song’s distance (in terms of valence and energy) from the point (0, 0) — the lower this distance is, the sadder the song.

Based on musical sadness, the most depressing Christmas songs are:

Screen Shot 2017-12-21 at 10.54.42 PM

O Christmas Tree was definitely not one of my guesses for “most depressing Christmas song” (although a listen-through of this version might convince me otherwise), but fear not, we still have to take a look at the emotions conveyed in the lyrics…

Lyrical Sadness

To analyze lyrical sadness, I used Julia Silge and Dave Robinson’s tidytext package to perform sentiment analysis on each song. tidytext comes complete with a tokenizer (to break down long blocks of texts into their individual words for analysis), a list of stop words (common words like “a”, “an”, and “the” which don’t carry much meaning and are therefore removed), and several sentiment catalogs we can use to analyze feeling or emotion attached to a given word.

Let’s get to it! After loading up the tidytext package, I created a list of sad words and a list of joy words from the NRC emotion lexicon.


sad_words %>%
    filter(lexicon == "nrc", sentiment == 'sadness') %>%
    select(word) %>%
    mutate(sad = T)

joy_words %>%
    filter(lexicon == "nrc", sentiment == 'joy') %>%
    select(word) %>%
    mutate(joy = T)


Next I removed stop words and left-joined the sad and joy word lists into my set of lyrics to calculate the percent of sad words and the percent of joy words that appeared in each song.

with_sentiment %
    anti_join(stop_words) %>%
    left_join(sad_words) %>%
    left_join(joy_words) %>%
    summarise(pct_sad = round(sum(sad, na.rm = T) / n(), 4),
    pct_joy = round(sum(joy, na.rm = T) / n(), 4),
    sad_minus_joy = pct_sad - pct_joy)

You might have noticed that in the last line of code above, I subtracted the percent of joy words from the percent of sad words. Originally, I only looked at the percent of sad words, but I noticed that even happy songs (like Joy to the World) do have some sad words, while other songs had zero sad words. To account for this, I subtracted the percent of sad words from the percent of joy words. (In fact, I thought it was interesting that only two songs have a higher percent of sad words than joy words — Blue Christmas and You’re a Mean One, Mr. Grinch.)

Based on lyrical sadness, the most depressing Christmas songs are:

Screen Shot 2017-12-21 at 11.39.37 PM

The Downer Index

In order to crown the most depressing Christmas song, we’ll have to combine the metrics for lyrical sadness (pct sadwords) and musical sadness (distance). I’ve created a metric, the Downer Index, which does just that:


A downer index near 1 is a happier song and a downer index near 0 is a more depressing song. This index weights the musical and lyrical elements of the song equally, and both are on a (0, 1) scale such that a higher score represents a happier quantity. This metric (and blog post) is inspired by Charlie Thompson’s gloom index (and the accompanying blog article, which I highly recommend reading for a look into the sad songs of Radiohead).

Based on the data, the most depressing Christmas song is… Blue Christmas!

This doleful tune about unrequited love certainly delivers lyrically (Blue Christmas was the most lyrically sad song), which contributed highly to its ranking; it came in 28th overall for musical sadness with a score of .64. I also learned that Blue Christmas was not an Elvis original, though his is by far the most popular cover (thanks Wikipedia!).

And without further ado, here are the top ten most depressing Christmas songs:

Screen Shot 2017-12-22 at 9.35.17 AM

While this approach isn’t perfect, I’m pretty happy with the results (except that my horse wasn’t even in the top ten!) and think the data does a fairly good job of capturing both the musical and lyrical sadness in the songs I analyzed.

Bonus: Christmas Song Superlatives

If you’re a “glass half full” kind of person, you might also be interested in some of the happier Christmas songs, which I also dug up while performing this analysis:

This data was a lot of fun to play with and I only scratched the surface on types of analyses you could do with it. If anyone is interested, I’m happy to share it.

Merry Christmas!



Data Meta-Metrics

Sometimes I work with great data: I know how and when it’s collected, it lives in a familiar database, and represents exactly what I expect it represent. Other times, I’ve had to work with less-than-stellar data — the kind of data that comes with an “oral history” and lots of caveats and exceptions when it comes to using it in practice.

When stakeholders ask data questions, they don’t know which type of data — great, or less-than-stellar — is available to answer them. When the data available falls into the latter camp, there is an additional responsibility on the analyst to use the data appropriately, and to communicate honestly. I can be very confident about the methodologies I’m using to analyze data, but if there are issues with the underlying dataset, I might not be so confident in the results of an analysis, or my ability to repeat the analysis. Ideally, we should be passing this information — our confidences and our doubts — on to stakeholders alongside any results or reports we share.

So, how do we communicate confidences and doubts about data to a non-technical audience (in a way that is efficient and easily interpretable)? Lately I’ve been experimenting with embedding a “state of the data” in presentations through red, yellow, and green data meta-metrics.


Recently my team wanted to know whether a new product feature was increasing sales. We thought of multiple ways to explore whether the new feature was having impact, including whether emails mentioning the new feature had higher engagement, and using trade show data to see whether there was more interest in the product after the feature was released. Before starting the analysis, we decided that we’d like this analysis to be repeatable — that is, we’d like to be able to refresh the results as needed to see the long-term impact of the feature on product sales.

Sounds easy, right? Collect data, write some code, and build a reproducible analysis. I thought so too, until I started talking to various stakeholders in 5+ different teams about the data they had available.

I found the data we wanted in a variety of states — anywhere from “lives in a familiar database and easy to explore” to “Anna* needs to download a report with very specific filters from a proprietary system and give you the data” to “Call Matt* and see if he remembers”. Eventually I was able to get some good (and not-so-good) data together and build out the necessary analyses.

While compiling all of the data and accompanying analyses together for a presentation, I realized that I needed some way to communicate what I had found along the way: not all of the data was equally relevant to the questions we were asking of it, not all of the data was trustworthy, and not all of the analysis was neatly reproducible.

The data meta-metrics rating system below is what I’ve used to convey the quality of the data and its collection process to technical and non-technical members of my team. It’s based on three components: relevance, trustworthiness, and repeatability. The slide below outlines the criteria I used for each score (green, yellow, red) in each category.

Screen Shot 2017-11-13 at 9.00.49 PM

Within the presentation, I added these scores to the bottom of every slide. In the below example, the data we had definitely answered the question we were asking of it (it was relevant), and I trusted the source and data collection mechanism, but the analysis wasn’t fully reproducible — in this case, I needed to manually run a report and export a text file before being able to use it as an input in an automated analysis. Overall, this data is pretty good and I think the rating system reflects that. The improvement that would take this data to green-green-green would be pretty simple — just writing the email data to a more easily accessible database, which becomes a roadmap item if we feel this report is valuable enough that we’ll want to repeat it.

Screen Shot 2017-11-13 at 9.24.12 PM

Below is an example of a not-so-great data process. Trade shows are inherently pretty chaotic, and our data reflects that. It’s hard to tell what specifically makes a trade show attendee interested in a product, and tracking that journey in real-time is much harder without records of interactions like demos, phone calls, etc.. This becomes another road map item; if we want to dig deeper into trade show data and use it to guide product decisions, we need to implement better ways of collecting and storing that data.

Screen Shot 2017-11-13 at 9.18.37 PM

Overall, this exercise was helpful for diagnosing the strengths and weaknesses of our data storage and collection across multiple teams. Providing this data in an easy-to-understand format allowed us to have informative conversations about the state of our data and what we could do to improve it. Getting the rest of the team involved in the data improvement process also helps my understanding of what data we do and don’t have, what we can and can’t collect, and makes my analyses more relevant to their needs.

The meta-metrics I used here are the ones we specifically cared about for this type of analysis; I could certainly see use cases where we might swap out or add another data meta-metric. If you’ve worked on conveying “the state of the data” or data meta-metrics to your team, I’d love to hear more about your process and the meta-metrics you’ve used in the comments.

*  Names have been changed to protect the innocent.