4 min read

Developing tools for exploratory visualization

I find it incredibly fulfilling to work on projects that enhance the productivity of other data scientists. After all, most data scientists need to maximize time spent formulating, examining, and refining questions about data; and minimize time translating those thoughts into machine code. I’ve talked at great lengths about how my work on plotly fits into this context, which roughly speaking, allows one to leverage many benefits of interactive web graphics without any knowledge of web technologies.

If you keep up with online media, the benefits of interactive web graphics are pretty clear: they can help grab the audience’s attention, enchance knowledge transfer, and allow the audience to further investigate detailed information. And yes, assuming you’re a web developer, we already have awesome tools for creating web-based data visualizations; but even if you’re a polyglot data science unicorn, you still shouldn’t be writing JavaScript/JSON to do exploratory analysis (where the most useful view/question of the data is not yet known). That’s because data exploration requires non-linear iteration between data transformation/modeling/visualization (which a web browser was not designed to do). That being said, interactive graphics are certainly useful for exploration,1 which is partially why we’ve seen a recent explosion in R/Python/Julia interfaces to JavaScript graphing libraries.

I feel very fortunate to have maintained the (completely free and open source) R interface to plotly.js for over 2 years now. I also feel very fortunate to be in a position where the plotly.js maintainers respond very well to my bug reports and feature requests.2 Now that I’ve graduated and had some time to reflect on the past few years, I’ve been thinking a bit more generally about data science software development, and how I can keep sustaining this type of work through my consulting services.

Generally speaking, designing data science software requires making hard decisions about which abstractions to make, and perhaps more importantly, which abstractions not to make (especially with respect to visualization). Doing this well requires an intimate knowledge of the most common/frustating/difficult tasks within the given domain, which means data science software developers should:

  1. Be familiar with the vast ecosystem of existing tools or else we risk re-inventing the wheel.3
  2. Get our hands dirty analyzing data, and applying our tools to real data-driven problems, or else we risk working on insignificant problems.

This is why I spend a good chunk of my time researching the constantly expanding ecosystem of R packages and using some of those tools to do real data analysis. Doing so not only forces you to eat your own dog food, but helps you focus on the bigger picture, rather than drowning in a sea of issues that don’t effect most users. For example, most of the new linking framework in plotly was motivated and (indirectly) supported by work on eechidna/pedestrians/bcviz. I hope to continue doing this sort of work with the help of more clients that have interesting data-driven problems and need better ways to explore/present their data visually.

Another thing I’ve learned through this work is that even the best programming interfaces do about 80% of what you’d like them to do. That number is even smaller for general-purpose visualization software (e.g., plotly) since their scope is so impossibly large. Fortunately, working with open-source tools means that we are (usually) free to modify the software to fit a particular use case. That sounds great in theory, but in practice, it can be a time-consuming and ultimately futile effort without an intimate understanding of the software. Since I have an intimate understanding of plotly and many other R packages, I also offer my clients the ability to adapt/modify/fix existing open-source tools for a particular project.

  1. Assuming that the time spent iterating from one visualization to the next is relatively small.

  2. Having tracked development on similar R packages, like ggvis, I’m convinced that a strong relationship amongst development teams (i.e., R and JavaScript devs) is necessary to maintain such a project.

  3. That isn’t to say that “re-inventing the wheel” can’t be a good thing, especially when it leads to a better wheel. In my opinion, especially in academia, there is too much of a focus on whether tool(s) exist, and not nearly enough attention is paid to whether they are usable.