RStats NYC 2017
Friday and Saturday of this weekend was the 2017 RStatsNYC conference. With the small exception of remarkably uncomfortable chairs, the conference was fantastic (and great food!). There was a remarkable variety in what the speakers covered – from deep thoughts on the ethics and philosophy of doing data science, to some super useful RStudio tips and tricks and everything in between. I decided to take my notes from the two days and post below in case there’s things anyone else might find useful.
One of the remarkable similarities between the speakers was the degree to which many talks focused on process of doing data science. It feels like there’s a lot of room for growth in terms of understanding and standardizing best practices of doing data science.
Deep Thoughts
Many of the speakers shared some great nuggets of insight into doing data science that extended far beyond simple lessons about R
. Some of my favorites below.
The Past, Present, and Future of Data Science
- Statistics has gradually added different parts of the statistical process to the field — but many process points remain to be integrated. We don’t have a theory of a good statistical workflow. Meta-systematization is needed. We need to systematize how we systematize things.
- Why do we need theory? Theory is scalable. Simulations are not.
- We should be specifying priors about effect sizes along with research designs.
- Is having a core data science team at a company a good thing or a bad thing?
- The history of software engineering is a great guide for the future of data science.
- As data gets bigger, machine learning problems and questions will get bigger too. Our jobs are safe.
- The hardest job of a data scientist is turning open-ended problems in the real world into practical problem statements that are solvable using data. Defining success is essential.
- Software engineers build and grow software — the sky is the limit. Data scientists are scientists and are constrained by the real world.
Data Science as Business Intelligence (and Empathy)
- Analysis doesn’t end at result delivery — it ends at developing and proselytizing new business strategies and innovation.
- Automating Excel workflows can be a first step towards bringing
R
to a team. - Agile development tells user stories as an empathy hack.
As a ______
I want ______
So I can ______
- Person-level stories (the near) are always more meaningful than data stories (the far). We need to balance both as Data Scientists.
- In development, we need to have an actual user in mind, rather than a theoretical user who wants everything.
How to Live-Tweet an Event (a great aside)
- Three important elements in live tweet of event: conference hashtag
#rstatsnyc
, @speaker (put . first so it’s viewable publicly), always include a pic. - They can have one of a few purposes: intro person/topic, spread insight, link to resource, spread positivity!
Live tweet: conference hashtag, @speaker, picture, positivity {{< tweet 855792361527541760 >}}
How to Hackathon
- If you’re not at risk of failing a few days a quarter, you’re not pushing.
- Keys to focused hacking: rock solid infrastrure, minimize context switching (away messages), parallel work, strong and clear conversation, timeline accountability (detailed to the half-hour), positive team dynamic, time-boxing decisions.
- How to start an analysis:
- Identify assumptions about a solution
- Rank assumptions by how critical they are
- Test assumptions with the simplest experiment possible
- Fail fast. The worst thing about spending a few months barking up the wrong tree isn’t the time or money lost — it’s team demoralization.
- Truths about data science:
- Nerd != won’t engage with business
- Shiny tool != hard problem
- Shiny tool != being a smart data scientist
Open Source and Package-Building
- The hardest step in joining the open source community is going from 0 contributions to 1.
- Open source contributions: failing test with fix > failing test > bug report
- Remember to include
sessionInfo()
- Remember to include
- Package release ideas:
- Release quickly, but also slowly — take time to fix dumb decisions
- Blaze new trails, but don’t reinvent the wheel
- Tell people about your package — then listen.
- Package design: documentation > usability > performance > features
Interesting Research
There were only a few presentations of actual research projects — but the ones presented were really wonderful.
- Cell tower data can reveal interesting patterns in movement — in particular return to normal flow of people after natural disasters.
- Stack Overflow has a selection of data from stack overflow — can do text analysis on it. There’s also a jobs list and an api.
- It is possible to predict which officers in police departments will have adverse incidents with the public in order to arrange interventions. Nashville and Charlottesville are doing this. This work can save lives.
- ROpenSci does peer review for
R
packages. Many of their packages get data from scientific databases or interface with scientific equipment.
Statistical and Programming Tools
- Convolutional Neural Net: used for image recognition
- VPNs are good privacy tools, we should probably all use one.
R Packages to Investigate
There were so many neat R packages mentioned over the course of the conference. Here was a short list of ones I didn’t know and wanted to look into.
packrat
: helps you create a package template fileplumber
: allows you to easily turn regular R code into an APIRXKCD
: add XKCD cartoons to stuff!trelliscope
: many-panel data viscompareGroups
: compare demographics and other aspects across groupshtmlWidgets
: integratejavascript
applets intoR
code. We want interactivity, and javascript is de-facto standard for everything having to do with interactivity.ipywidgets
: similar forpython
broom
: turn analytics output into tidy data framesgoodpractice
: does a variety of checking for good package development practicelintr
: helps check for good code styledevtools::spellcheck()
solidify
,highcharts
: commonly used presentation packagestidytext
: help turn text into tidy data frames
Generally Helpful Programming Hints, tips, and tricks
rr-init
: ROpenSci package to create skeleton of project- Run
R
scriptmyscript.R
from command line:Rscript myscript.R
- RStudio Server runs on port 8787 by default
- Anaconda can be used to download a local R version
- Can install entire tidy verse at once:
install.packages(“tidyverse”)
- Always choose a license for your github repo. MIT is the loosest.
- Can put an empty folder in a git repo by creating and adding a file to git
touch folder/.gitkeep
make
bash — can create a makefile (make.sh
) with assigned dependencies and will only run files that have been updated since last run- Join R mailing list
- Contributing to documentation (or improving unhelpful error messages) is a great way to get started on contributing to Open Source
- In R
save()
saves multiple objects with their names as.Rda
. Usingload
restores them into the environment.saveRds()
saves a single object without its name. Restoring requires assignmenta <- readRds()
.
RStudio Hints, Tips, Tricks
The RStudio folks gave a great presentation on some tricks and tips in RStudio. Some of my favorite takeaways:
- Realtime inline
TeX
previews in RNotebooks:$EQN$
- Notebook preview — doesn’t re-run, just renders what’s been run
- Can include code chunks that run code other than R (SQL, Bash, Python). Instead of
{r}
in chunk header, put (e.g.){sql connection = con, output.var = ‘varname’}
and then can putSELECT
statement in body that will be assigned tovarname
- If you want to include variables in SQL code, use
?varname
in code. - RStudio git integration provides nice way to view diffs
- Tab autocomplete is fuzzy — just type unique letters in command and tab for rest
Shift-Tab
to insert code snippets — there are built in (tools > global options > code snippets
)- Starting to type a command and then
cmd + arrow up
gives history of all times that command used - Can easily go from history to console or script
ctrl + .
search files, fns, etcCTRL + SHIFT + .
search among tabsctrl + shift + f
(“or if that’s too hard to remember, ctrl capital F”) to customize where searchingSHIFT + ALT + k
list keyboard shortcuts