How to become a data scientist for statistics undergraduates:
I’ll cut through the words and give you the answer: ask questions about the world, look for data that can be used to test these questions, and build something to see if you were right. The same advice can be given to anyone whether they are a statistics undergrad or not, the act of following your own curiosity and producing something (or nothing, but trying) is the most valuable thing you can do in data science. I’ll go more in depth on how to do so and where to get started later, as well as tailoring this to include ideas for people who already have theoretical knowledge, or are on the path to gaining it during a statistics undergraduate degree.
In my companion blog I go over the usefulness and insufficiency of an undergraduate degree in statistics in order to become a data scientist. You can read it in full here, but the gist of it is that while it is incredibly useful as a foundation and opens a lot of doors, the work you do outside of your degree counts for more. “What sort of work?” you may ask. Here goes:
Research which tools are currently being used
This could mean in your field, or if you’re not particularly invested into a particular field it’s just whatever’s the most popular. For example, as of writing this blog the current tools being used generally are Python’s libraries like Pandas, Keras, and SciKit Learn. For bioinformatic specifically, R is preferred in many cases, and for Econometrics Stata is useful. However, this could all change in the future: Google is pushing Swift for data science, and other newer languages such as Julia are also on the horizon.
A good place to look for this sort of advice is on Youtube where everyone and their mother has their own opinion on which language to learn. If in doubt, just learn Python.
Learn by doing
You do not need much programming experience to make things, as much as this seems intimidating. The best thing you can do to become a data scientist is to make things right from the get go. You could make a poster, an interactive graph or other visualisation, a report, a video, anything where you try to find something out using data. While first learning, I decided to make a sentiment analysis tool for R that would produce a plot of how positive/negative a news article was. Did I know how to do this before starting? Did I know which packages to use before starting? No and no, and I cannot stress this enough. You learn by doing, you learn by going “I want to do X and I don’t know the best way to do it” and then googling how to do it. Over time you build up a large toolbox and more importantly you learn what can be done even if you don’t remember exactly how you did it.
This, however, is one of the reasons why I think just throwing yourself into the deep end can be difficult and perhaps inadvisable to some. It’s important to feel confident in googling your problems and knowing where to go for answers. Equally important is developing good habits when you code like proper indentation and presentation. These things are best learnt from a source such as a book or web series. If you are learning R then Hadley Wickham’s book “R for Data Science” is available online and it is probably the best book for data scientists or students learning R from scratch.
Keep at it
Programming is hard. Programming is time consuming. Programming will make you want to give up a lot.
On the other hand programming is challenging. Programming is worth the investment. Programming will make you a more logical thinker and better able to problem solve.