7 min read

R Stats NYC 2017

Friday and Saturday of this weekend was the 2017 RStatsNYC conference. With the small exception of remarkably uncomfortable chairs, the conference was fantastic (and great food!). There was a remarkable variety in what the speakers covered – from deep thoughts on the ethics and philosophy of doing data science, to some super useful RStudio tips and tricks and everything in between. I decided to take my notes from the two days and post below in case there’s things anyone else might find useful.

See, really good food!

One of the remarkable similarities between the speakers was the degree to which many talks focused on process of doing data science. It feels like there’s a lot of room for growth in terms of understanding and standardizing best practices of doing data science.

Deep Thoughts

Many of the speakers shared some great nuggets of insight into doing data science that extended far beyond simple lessons about R. Some of my favorites below.

The Past, Present, and Future of Data Science

  • Statistics has gradually added different parts of the statistical process to the field — but many process points remain to be integrated. We don’t have a theory of a good statistical workflow. Meta-systematization is needed. We need to systematize how we systematize things.
  • Why do we need theory? Theory is scalable. Simulations are not.
  • We should be specifying priors about effect sizes along with research designs.
  • Is having a core data science team at a company a good thing or a bad thing?
  • The history of software engineering is a great guide for the future of data science.
  • As data gets bigger, machine learning problems and questions will get bigger too. Our jobs are safe.
  • The hardest job of a data scientist is turning open-ended problems in the real world into practical problem statements that are solvable using data. Defining success is essential.
  • Software engineers build and grow software — the sky is the limit. Data scientists are scientists and are constrained by the real world.

Data Science as Business Intelligence (and Empathy)

  • Analysis doesn’t end at result delivery — it ends at developing and proselytizing new business strategies and innovation.
  • Automating Excel workflows can be a first step towards bringing R to a team.
  • Agile development tells user stories as an empathy hack. As a ______ I want ______ So I can ______
  • Person-level stories (the near) are always more meaningful than data stories (the far). We need to balance both as Data Scientists.
  • In development, we need to have an actual user in mind, rather than a theoretical user who wants everything.

How to Live-Tweet an Event (a great aside)

  • Three important elements in live tweet of event: conference hashtag #rstatsnyc, @speaker (put . first so it’s viewable publicly), always include a pic.
  • They can have one of a few purposes: intro person/topic, spread insight, link to resource, spread positivity!

Live tweet: conference hashtag, @speaker, picture, positivity

How to Hackathon

  • If you’re not at risk of failing a few days a quarter, you’re not pushing.
  • Keys to focused hacking: rock solid infrastrure, minimize context switching (away messages), parallel work, strong and clear conversation, timeline accountability (detailed to the half-hour), positive team dynamic, time-boxing decisions.
  • How to start an analysis:
    • Identify assumptions about a solution
    • Rank assumptions by how critical they are
    • Test assumptions with the simplest experiment possible
  • Fail fast. The worst thing about spending a few months barking up the wrong tree isn’t the time or money lost — it’s team demoralization.
  • Truths about data science:
    • Nerd != won’t engage with business
    • Shiny tool != hard problem
    • Shiny tool != being a smart data scientist

Open Source and Package-Building

  • The hardest step in joining the open source community is going from 0 contributions to 1.
  • Open source contributions: failing test with fix > failing test > bug report
    • Remember to include sessionInfo()
  • Package release ideas:
    • Release quickly, but also slowly — take time to fix dumb decisions
    • Blaze new trails, but don’t reinvent the wheel
    • Tell people about your package — then listen.
  • Package design: documentation > usability > performance > features

Interesting Research

There were only a few presentations of actual research projects — but the ones presented were really wonderful.

  • Cell tower data can reveal interesting patterns in movement — in particular return to normal flow of people after natural disasters.
  • Stack Overflow has a selection of data from stack overflow — can do text analysis on it. There’s also a jobs list and an api.
  • It is possible to predict which officers in police departments will have adverse incidents with the public in order to arrange interventions. Nashville and Charlottesville are doing this. This work can save lives.
  • ROpenSci does peer review for R packages. Many of their packages get data from scientific databases or interface with scientific equipment.

Statistical and Programming Tools

  • Convolutional Neural Net: used for image recognition
  • VPNs are good privacy tools, we should probably all use one.

R Packages to Investigate

There were so many neat R packages mentioned over the course of the conference. Here was a short list of ones I didn’t know and wanted to look into.

  • packrat: helps you create a package template file
  • plumber: allows you to easily turn regular R code into an API
  • RXKCD: add XKCD cartoons to stuff!
  • trelliscope: many-panel data vis
  • compareGroups: compare demographics and other aspects across groups
  • htmlWidgets: integrate javascript applets into R code. We want interactivity, and javascript is de-facto standard for everything having to do with interactivity.
    • ipywidgets: similar for python
  • broom: turn analytics output into tidy data frames
  • goodpractice: does a variety of checking for good package development practice
    • lintr: helps check for good code style
    • devtools::spellcheck()
  • solidify, highcharts: commonly used presentation packages
  • tidytext: help turn text into tidy data frames

Generally Helpful Programming Hints, tips, and tricks

  • rr-init: ROpenSci package to create skeleton of project
  • Run R script myscript.R from command line: Rscript myscript.R
  • RStudio Server runs on port 8787 by default
  • Anaconda can be used to download a local R version
  • Can install entire tidy verse at once: install.packages(“tidyverse”)
  • Always choose a license for your github repo. MIT is the loosest.
  • Can put an empty folder in a git repo by creating and adding a file to git touch folder/.gitkeep
  • make bash — can create a makefile (make.sh) with assigned dependencies and will only run files that have been updated since last run
  • Join R mailing list
  • Contributing to documentation (or improving unhelpful error messages) is a great way to get started on contributing to Open Source
  • In R save() saves multiple objects with their names as .Rda. Using load restores them into the environment. saveRds() saves a single object without its name. Restoring requires assignment a <- readRds().

RStudio Hints, Tips, Tricks

The RStudio folks gave a great presentation on some tricks and tips in RStudio. Some of my favorite takeaways:

  • Realtime inline TeX previews in RNotebooks: $EQN$
  • Notebook preview — doesn’t re-run, just renders what’s been run
  • Can include code chunks that run code other than R (SQL, Bash, Python). Instead of {r} in chunk header, put (e.g.) {sql connection = con, output.var = ‘varname’} and then can put SELECT statement in body that will be assigned to varname
  • If you want to include variables in SQL code, use ?varname in code.
  • RStudio git integration provides nice way to view diffs
  • Tab autocomplete is fuzzy — just type unique letters in command and tab for rest
  • Shift-Tab to insert code snippets — there are built in (tools > global options > code snippets)
  • Starting to type a command and then cmd + arrow up gives history of all times that command used
  • Can easily go from history to console or script
  • ctrl + . search files, fns, etc
  • CTRL + SHIFT + . search among tabs
  • ctrl + shift + f (“or if that’s too hard to remember, ctrl capital F”) to customize where searching
  • SHIFT + ALT + k list keyboard shortcuts