Geoscience to Data Science Starter Pack
Why Write this and Who is it targeting?
I wrote a blog post, LEARNING TO CODE, on my website in early 2016, three years ago. The premise of that blog post was a summary of the different styles of learning you could pick from when trying to learn how to code. This blog post, like that one, was prompted by the realization that I had the same conversation with two different people within a single week. They were asking the same questions, so I might as well write everything down.
This post is directed at houston-based geoscience types starting off on a month to years long process of improving their skills in data science and maybe eventually getting a job in data science. It lays out the things I’ve found myself telling people in real life.
Step 1: Figure out if you’re interested in this type of thing….
I’ve not seen a lot of writing on the best way to do this. The best path forward may be a personal decision to a large degree.
If you have kids and want to involve them in your first steps, code.org and Scratch are two resources to try out if you haven’t written any code. Both are designed for kids but still kinda cool. They’ll let you see what kind of logic writing code uses but often doing so in a pictorial form that doesn’t require memorizing any syntax .
You might also want to try some shorter lessons of 1–5 hour length on sites like code academy or take your time going through an introduction to any of the languages on w3schools.
If you’re more motivated by what you can eventually do, you might try watching a few videos of talks from any of the SciPy conferences or the machine-learning videos from PyCon. They’ll be partially over your head, but they can still be very interesting. You can also take a look at the blog posts summarizing what projects were made during geology hackathons by AgileScientific.
What Language to Learn?
The favorite. Different computer languages are better for different tasks. They also change in popularity over time. There used to be Python vs. R for data science debates, but those have faded recently as Python has won over more people. Two libraries you’ll use often that also have good documentation & lots of video tutorials are SciPy and Scikit-learn. If you want to try NLP (natural language processing) SpaCy has maybe the best documentation of major Python machine-learning libraries.
While Python tends to dominate the hard sciences and to a decent extent machine-learning, R still leads among the social sciences. There is interesting geoscience computing done in R, most is done in Python.
Other languages that don’t start with Pytho
C++ and Java are the languages most often learned by Computer Science majors in university. There are good reasons for this and not quite so good reasons for this. Certain things, like highly dependable applications, embedded applications, and low-level high performance computing is done in C++. If you are a geophysicist and did some in school, it might be a place to continue. If not, it probably isn’t the place to start.
Some of the things that could be said about C++ could also be said about Java. There is a fair amount of machine-learning done using Java when it is done via distributed computing on big data. Spark is an important tool in that space to at least know about. If you’re interested in Spark but want to stick to Python, there is also PySpark.
For a more detailed and humorous explanation, there is always this infographic representing computer languages as Lord of the Rings characters, which never goes out of style assuming you’ve already seen the movie.
How to learn?
In-person Bootcamps, Online Courses, Online Lessons, etc.
As mentioned above, I previously wrote a blog post in 2016 about the different types of ways to learn how to code. We all have different learning style preferences. We also have different life constraints that affect what methods fit into our life style. Much of what I wrote then ties in closely with this post but from a more generic learning to code and less data science specific perspective.
Useful Things that Didn’t Exist (I think) in January 2016
One thing of importance to note is that Microsoft Azure notebooks and Google Colab didn’t exist in January 2016. If they would have, and I knew about them, I would included them in the previous blog post. These are similar to a Jupyter Notebook but run in the cloud and are accessed via your browser. They will let you get started writing Python without having to deal (at least initially) with the often messy process of installing languages, editors, and code libraries locally on your computer. If you do install things on your local computer, the Anaconda installation method is probably the easiest path forward.
Build Things People Can Find
Start a Github Profile
Why? = Because if you’re self-taught you need to show evidence you can create things and write actual code. The commonly acceptable way to do this is to give people a link to your github profile where you have a bunch of public code projects. These can be data visualizations, machine-learning baby-scale projects, whatever. Make sure not all of them are forks or class work where you followed instructions.
If you’re not familiar with the terms, here are some definitions of Git and Github. There are other services than github you can use, like gitlab or bitbucket, but GitHub is the most common.
While on the topic of github, I will note that this repository of “AWESOME OPEN GEOSCIENCE” code projects is something to check out. It lives on github. It contains a wide variety of lesser known geoscience-domain-specific tools you can use. It started as a conversation I had with others in the Software Undergound Slack channel. It is one of the many “Awesome lists” out there for code in a specific domain or application area.
Why? = Because it is good web programming practice and shows you can build something. Additionally, it can be a way to do personal branding.
- http://justingosses.com/ This is my personal website. I should probably update it, but sharing now just so you don’t only look at flawless ones and get discouraged. It is mostly WordPress.
- http://kbroman.org/simple_site/pages/user_site.html : This github.io page is a nice template that people can use and substitute in your own content.
- https://medium.com/@svinkle/publish-and-share-your-own-website-for-free-with-github-2eff049a1cb5: Another tutorial.
Active In-person Learning
Tutorials at Tech Conferences
Why? = Because they’re really good at getting as much of the information coming out of the firehose to go directly in your brain. They can also serve as starter material for a project on your github. Often the tutorials will be based around a library or a type of task. You’ll usually leave with a link to not just slides but also all the code the instructor ran, which sets you up to learn it even deeper later on. Conferences can be a good way to network too.
Why? = Because hackathons are the fastest way to build things that demonstrate your ability to combine concepts and techniques to solve a real world problem. They’re also great for networking and learning new things through collaborative problem solving.
The factors that have differentiated good from less good hackathons in my limited experience were a length of at least 5 hours if not 2 days, interesting project ideas, project ideas scaled to the time and skillsets of participants, most participants knowing how to code at least a little, and enough coffee/food that you don’t have to leave.
Good Hackathons likely to be in Houston in the future:
- https://events.agilescientific.com/ : Agile runs several a year, typically around conferences that I can attest are quite good.
- http://houstonhackathon.com/ : I’ve never been able to go myself as I’ve always been out of town, but I’m told its worth your time.
Why? = There’s a reason schools spend a lot of time filling peoples’ heads via the single-speaker at front of room format. It is generally effective.
There are a variety of Houston meet-ups in the machine-learning, data science, python space. These meet-ups are almost always free. They vary in quality. Sometimes when they’re not good, it can be because they’ve turned into a vendor pitch or the content was different than what was listed. The houston energy data science meet-up sometimes falls into the trap of speakers being just a bit too vendor-ish, but usually it is okay. SPE (Society of Petroleum Engineers) sometimes has oil and gas data science “meet-ups”, but they aren’t free so I never go (Hint hint to anyone at SPE Gulf Coast Section).
Why? = Because not all meet-ups are just a person talking and that’s a good thing. Some of them are more about doing.
Sketch city regularly has people, local government agencies, and non-profits come in to share a bit about their open-data and what problems/solutions/visualizations/predictions a data-literate member of the public might make from their data. It is a good meet-up to attend for getting project ideas and networking within the local civic tech or civic-tech-interested crowd.
The Houston Data Visualization Meet-up (disclaimer I help co-lead this one) has both single-speaker format and data-jam format meetings. Data-jams are often on Saturday morning and consist of 10–30 people working in small groups to visualize a dataset they were just given that morning. Often these datasets come from a local community group or the city of Houston, though we’ve also used non-local datasets like ChemCam data from the Mars rover Curiosity or a dataset of Russion-bots’ posting on Twitter. In addition to being great starter projects for your portfolio and good networking, this type of meet-up exposes you to a wide variety of GUI and code library data visualization toolsets. You’ll find out what tools are good for what use cases.
Filling Your Head Digitally
Once you get a certain level of proficiency, learning will start to become more about keeping up and continuing to grow. The rate of “new” in data science greatly outstrips geology. It also occurs in different places. “New” in oil & gas geology tends to mostly occur in yearly conferences, monthly or quarterly journal publications, new corporate best practice documents from on high, and major software updates. “New” in data science occurs in those places. It also occurs to a much larger extent on Slack, Twitter, Podcasts, and Medium articles. New techniques, new results, entirely new libraries are often announced via those methods before they are published in a journal or integrated into a GUI software application your organization might purchase. The flip side of using the methods below to ingest new data science content is the deluge can sometimes get overwhelming.
Why? = Because your niche interest area may not overlap with the people you interact with on a daily basis. Even if it does, the number of people is going to be small. Slack is a way to expand that community discussion digitally. Slack is an asynchronous communication platform built around channels, which each have a different topic. It is similar to older chat programs but the user design works a lot better. The softwareunderground slack team is all about computing & geoscience. Anyone can join. A few example channels are geospatial, houston, js, kaggle, open-geoscience, python, r-users, reading, and viz.
- https://softwareunderground.org/slack/ : You can join at this link. Cheers to the Agile Scientific group for setting it up.
Why? = Because if Slack, PodCasts, Medium, Journals, etc. all have a frequency, Twitter vibrates the fastest. Often things will all appear here first before they appear elsewhere. The girl who builds the crazy visualizations that inspire your next project. She’ll post drafts to Twitter. Someone recently discovered a rarely used but super useful function for your domain in a general purpose Python library. They’ll post about that to Twitter. Twitter isn’t just data science, of course. You’ll have to curate your feed by following people with good content, and that takes time, but it is an option for ingesting content at the cutting edge.
Why? = Because data science isn’t just in text form.
- DataSkeptic : Data Science Explanations & Discussion
- under sampled radio : Geology + Computing
Why? = Because getting a few things into your head via 5–30 minutes of reading is sometimes the exact right size of learning.
- https://hackernoon.com/@kozyrkov : Cassie Kozyrkov Chief Data Intelligence Engineer at Google. She does a great job condensing down the subject matter into small useful bits of explanation you can use with other people without becoming fluffy like so many other pieces in Forbes or Business Insider that cover similar ground.
- https://medium.com/multiple-views-visualization-research-explained : Explains data visualization research just like the name says. Written by a collection of academic data visualization researchers.
Why? = Well to be honest, I’m not sure I get that much from LinkedIN, but it is good for finding out about small conferences or meetings with a data science focus that intersect with geology or oil & gas. Both of these have mini-conferences or workshops that center on the intersection of oil & gas and analytics.
- https://www.linkedin.com/company/spe-gulf-coast-section/ : Society of Petroleum Engineers has an active data analytics section in Houston
- https://www.linkedin.com/company/houston-geological-society/ Houston Geological Society has an analytics mini-conference in Houston in 2019.
- https://www.linkedin.com/in/ricekenkennedy/detail/recent-activity/ Rice University has workshops and talks that might be of interest. I’ve previously presented remotely at Rice Data Science day, which has had some geology & machine-learning talks.