Open source is everywhere, with the data world in particular seeing fast innovation and growth in the number of projects and tools. A year ago I decided a top priority for 2020 was finding an open source project to contribute to. At the time, my assumptive motivations were that:
- It would add breadth to my existing skills and knowledge
- I enjoy contributing to improving the workflow of others
- I could talk about my experience in interviews
- Internet points?
A year on I’ve been reflecting on my experience, what I wish I’d known before I started, and to what extent my initial expectations held true.
Early 2020 I’d just returned to university to complete my Masters’ degree. My plan was to go travelling for a couple of months after graduation and start work again in August, though I was keen to build upon my learning during the eight months away from a data engineering role. Contributing to an open source project seemed like a good idea. I had been introduced to a data transformation tool a few months prior, dbt, and knew it was a hot project so decided to start there.
Maintainers of open source projects often use tags on issues to signify their accessibility for first time contributors. I scoured the issues on GitHub, and found a tagged ‘Good First Issue’ which I thought was approachable enough for me to get stuck into. Here it is.
I was fortunate in this case as the project’s primary author, Drew, had provided some helpful pointers. I’d been a dbt user for several months but this codebase was unlike anything I’d ever seen. It seemed huge. How was I going to work my way through and understand thousands of lines of code?
After about an hour of scrolling and becoming increasingly scared off, I wondered if maybe I didn’t need to understand it all. Open source projects are large by nature as they have many contributors, and often solve new and complex problems. I now know that a large but well-written codebase will be designed with maintainability and testability in mind, which means that classes and functions should be narrow in scope and purpose. Still, this was totally foreign code to me.
Following the experience I’ve developed a set of steps I always follow when approaching a new codebase for the first time:
- Read any published contribution or development guides.
- Identify the relevant concepts to the area of the project I want to contribute to before seeking to understand them.
- Examine previously merged pull requests which added functionality to the same or similar area.
- If contributing to a CLI based tool, find the entry point (in setup.py for a Python project) and work through the function calls. If it’s a library, locate the public API methods and classes and proceed in the same way.
- Once the key methods and classes have been identified, head to the tests. Well written tests tell a story about the code they refer to.
A systematic approach will help avoid code overwhelm, and make it easier to stretch the process over several sessions, which will almost always be necessary for a first contribution.
Lesson two - contributing to a new codebase just is hard, but a good workflow will make life easier.
I now had an idea of where the solution to the problem should be. But there were helper methods everywhere, references to objects I had no understanding of. Even the best engineer in the world doesn’t have a shortcut here. You will have to spend some time introspecting these objects and looking up function definitions. While we can’t speed up our brain’s ability to understand new code other than through practice, configuring your text editor and understanding its shortcuts makes navigating and looking up definitions faster.
If the codebase’s language is Python (which I’m 99% sure it is if this is your first contribution and you’re a data person), there’s a magical function called
breakpoint(). When called, it pauses the execution flow and opens up the Python interpreter in its scope. Dig into variables, call functions, whatever you like. Once you’re done, continue execution of the program. Read more here.
Now you’re able to pause execution, examine variables and run arbitrary commands, you’re well on your way. Ideally, we’d write a test before writing any application code to say ‘The result of this function call with these inputs should be X’, see it fail, and then write the code needed to make it pass. Real life isn’t always that straightforward though, and in a first contribution the test suite will require some study too.
You’ve got two options:
- Hack away until you think you’ve added the functionality, and then add some tests to prove it.
- Start by adding some failing tests, and work toward making them pass.
The latter is known as test driven development (TDD), and is written about extensively. In practice, I typically follow 1. and 2. in equal proportion. I’d recommend experimenting with both options, but including a test which addresses the issue you’re trying to solve asserts to both yourself and reviewer what the goal of the contribution is. Many open source projects have strict contribution criteria around testing, and it’s a great way to improve your understanding of what a high quality codebase and test suite looks like.
This is the one I really wish I’d known from the start. After the dbt work, I didn’t make any contributions for several months. While I’d enjoyed the process and learned a lot, it was challenging and the actual change wasn’t something I benefited from. Several months later, I started to use a tool called Great Expectations which I’ve previously written about. This time I had a reason to make a contribution, I wanted to extend the alerting options from just Slack to include Pagerduty.
I was excited not just by the prospect of contributing to a tool I enjoyed using, but also the use-case that would be unlocked for me by implementing the change. Part of what makes open source software so great is that the creators are users and vise-versa. Having a personal reason to contribute made the process infinitely more enjoyable, and the achievement is now something I’m reminded of every time I take benefit from it.
Since then, I’ve become a maintainer of SQLFluff. SQLFluff is a dialect-flexible and configurable SQL linter. As a fan of both SQL and style consistency, it was an immediately compelling project to contribute to. Designed with ELT applications in mind, SQLFluff also works with jinja templating and dbt. If style consistency is your thing, give it a try. You will without doubt find either a bug (it’s still young) or a linting rule you’d like to implement. If SQLFluff excites you too, I’d love to help get you started with your first open source contribution, so if you have any questions reach out in the Slack workspace.
I started with the goal of making a contribution to open source, with a few ideas of why. A year on, I’ve learned what the real benefits are. So what are they, and what motivates me to continue?
- I’m learning to solve interesting problems associated with larger and more complex codebases
- I’m building relationships with interesting people across the world
- Satisfaction from building tools that improve others’ working lives
- Unique elements to my career story
- Recognition in the data community has proved extremely beneficial for professional networking and seeking advice.
- You don’t need to understand all the code
- Contributing to a new codebase just is hard, but a good workflow will make life easier
- Contribute when you have a reason to.