An Introduction to Data Science in Python
Data is seen by many thought leaders as a concept which is the key to building the next-level society of the future. Thanks to the open-source culture that’s mostly dominated the information technology environment, both data and the tools to process this data are commonly available and accessible to everyone. However, choosing among the tools and mastering how to use and utilize them are not always trivial. With many options, which language to learn first or focus on is one of the most frequently asked questions for newcomers.
Python, started 25 years ago as a hobby project for its developer, has now become a language widely taught at universities as the introduction course and the first programming language, surpassing languages much older and established. It owes its popularity to several factors, including simplicity, intuitiveness, and a strong community. Although Python is used in many domains thanks to its large variety of libraries and frameworks, its spread among the data science community is especially noteworthy. It is a very popular language for data scientists and data analysis that has uses up to big data. A plethora of experts have been creating and perfecting data science libraries for Python as volunteers, which make it possible to create state-of-the-art data processing and analysis tools in Python. Recently, big companies such as Google and Microsoft have also started to back those open-source efforts. Data science with Python has exploded as a result of this booming ecosystem.
Simplicity is a concept embedded in Python’s philosophy from the very beginning of the language. Python programming differs from most of the remaining programming culture in that doing something in a “clever” way is not in itself seen as a desirable thing. It’s preferred to do a given task in a definite, clear, and obvious style, so that programmers won’t have to think about which to choose among many methods for one purpose, and the implementation will be comprehensible to other people reading the code. There is also PEP which aims to standardize how Python applications are written.
Python is also known for being a high level language that resembles natural human language. The term “high level” here means that for Python, it is usually not necessary to mess with the details about how a script does its job internally, such as how it optimizes its use of memory. A natural and fluent style of writing code is often called Pythonic. It’s quite common that a line of Python code to realize a given task almost sounds like giving an order to an intelligent robot in plain English. Data science with Python is easier to do right as a result.
A programming language, no matter how magnificent it is in itself, can’t exist in a vacuum for too long. Every programmer, no matter what their skill level is, will need support from time to time in that language. A strong, involved, and widespread community is one of the greatest advantages of Python. Answers to most of your questions when writing in Python is only a Google search and a few clicks away. If that doesn’t solve your problem, there are always eager professionals to help you in platforms like Stack Overflow and Codementor. Apart from general problems related to the language itself, if you have questions about how to implement something in specific Python libraries, you have a pretty good chance to solve it by asking for help in the GitHub page of that project. (Moreover, please don’t forget to help others in return when you’ve reached a certain level of proficiency!)
There are many great free resources online for learning Python. For those of you who haven’t done anything related to programming before, the Non-Programmer’s Tutorial for Python 3 is a good starting point. However, it will also benefit you greatly if you learn a bit about general, language-agnostic principles of programming. For example, software design patterns are useful (and sometimes necessary) tools to write well-structured applications in any language.
This free ebook is a nice reference. Programming Foundations with Python is a free video course with exercises which will help you grasp the fundamentals of both Python and programming in general (you can also find data science related courses on Udacity taught by well known experts). If you prefer to learn by actually writing code, I recommend Codecademy as a Python tutorial where you face coding challenges, beginning from easy to more advanced.
In order to use Python in your data related projects in an optimal way, the SciPy stack, a set of programming tools originally devised for scientific computing, is well known as a basic Data Science framework filled with helpful data science Python modules. It includes Python packages such as NumPy which provides the necessary tools for implementing vectors, matrices, linear algebra, and random variables. Matplotlib makes it possible to visualize the data in various ways to make it more comprehensible. Seaborn is a powerful data visualization library for Python. Pandas presents data structures that are fast, reliable, and easy to use and allows for easy data manipulation. IPython notebooks in the Anaconda environment help you to create documents with visuals that contain Python code and output, making it easy to modify snippets of Python code and to see the results immediately. They are all powerful data science tools to be used in your skillset.
Machine Learning is one of the most prominent areas in Data Science. Data science with Python makes it easy to explore the fundamentals of machine learning. Once you’ve learned a few basic Machine Learning algorithms such as linear regression and logistic regression, the scikit-learn library of Python makes it surprisingly trivial to implement ready-to-use Machine Learning systems that you can train with the data at hand and use for prediction. As you advance further in Machine Learning and feel confident to customize things, you can use more advanced libraries such as Theano, Keras, and TensorFlow that can be used for deep learning and neural networks.
Besides Python, there are languages which are widely used in data analytics, statistics, and Machine Learning. MATLAB is a professional language and environment used in all areas of science and engineering. Although it comes with very powerful tools, it’s not open source and relatively expensive. There’s also the language R, with its libraries, and is open source. However, unlike Python which is a general purpose language, it’s more of a specialized tool mostly aimed at statistical computing. Because data analytics is rarely a pure and isolated process, this may necessitate using other languages beside R in certain applications. In comparison, data science with Python makes it possible to get pretty much everything done in a single environment. There are other general purpose languages such as Go, a brainchild of Google, which you can choose to use in data related projects. The main advantage of Go compared to Python is its speed. Python is generally slower than Go (and also in comparison to many common languages) because it’s pretty high level, and it’s implementation of multi-threading may be problematic to some. By contrast, Python’s established data science libraries and involved community is it’s most significant advantage against Go. Data science with Python is made easier by the great community support that comes with it.
To recap, Python is an ideal choice for those who are interested in scraping, retrieving, processing, and analyzing data. With so much commonly available data and powerful tools that come with Python, the possibilities of what you can do is practically endless. Who knows, maybe one day, after you master data science and programming, you can even win a Kaggle competition with the algorithm you’ve implemented in Python!
This is was written by Volkan Erdoğan, a content contributor for Codementor who studied Molecular Biology and Genetics in Boğaziçi University. As a graduate student, he applied Machine Learning algorithms to analyze the genetics of Epilepsy. He’s proficient in several programming languages including Python and commonly uses it for data processing and analysis.