What is data science?
It’s there around you. Everywhere. You don’t even realize how prevalent it is. But it’s there – on your cell phone, PC, camera, heck- even in your watches and Fitbits!
I am talking about the one thing that today’s world functions on- DATA. For quite some time now, we’ve all been absolutely deluged by data. Everything we do, right from posting on social media to texting; from saving a document to generating a query, everything generates huge amounts of data. Even a simple Google search!
Although data immersion is nothing new, you may have noticed that the phenomenon is accelerating. What used to be tiny streams of data once upon a time have turned to barrages of structured, semi-structured, and unstructured data that is streaming from almost every activity that takes place in both the digital and physical worlds. Welcome to the world of big data!
You might question the purpose of all this data collection. How is all of this data useful?
Just a decade ago, no one was in a position to make much use of the data generated. However, today, that has definitely changed. We now have two sets of super cool people- data engineers who constantly find innovative and powerful new ways to capture, collate, and condense unimaginably massive volumes of data; and data scientists, who analyze this data and derive valuable insights from it to suggest actions which will actually make a big difference.
Now data scientists are plying their trade in the sexiest job of the 21st century.
What is data science? This is a common question whenever you bring up data scientists. Data science produces insights that are valuable to every person working in any industry. It helps you understand and improve your business, investments, planning, and even your personal health, lifestyle and social life.
Who can use data science?
The terms data science and data engineering are commonly interchanged. Both the fields, though interdependent, are very distinct domains of expertise.
Data science involves deriving actionable and invaluable insights from raw datasets using computational methods. On the other hand, data engineering revolves around handling data-processing hold-ups and data-handling problems for applications that use huge volumes, varieties and velocities of data. This is a handy difference in definitions for those looking to understand the answer to the questions “what is data science?”.
Both the disciplines do have some points in common- they involve working with the following three varieties of data:
- Structured data: Data that’s stored, processed, and manipulated in a traditional relational database management system and that is explicitly labelled in some way that is human-readable.
- Unstructured data: Data that’s commonly generated from human activities and that doesn’t fit into a structured database format.
- Semi-structured data: Data that doesn’t fit into a structured database system, but is nonetheless structured by tags that are useful for creating a form of order and hierarchy in the data.
It is a common misconception that only big organizations which have massive funding implement data science methodologies to improve and optimize their business. However, that is far from true.
The explosion of data has created a huge demand for insights, which is embedded in almost all aspects of business today. Since most organizations, big or small, have realized that they need to function in a data-driven and super competitive environment, sound knowledge of data science is emerging as a requisite function in almost every line of business. Every business needs to know the answer to the question “what is data science?”.
What does this mean for the average person? First and foremost, we need to realize that our culture has changed and we need to keep up with the times. You don’t need to get into a full-time course and channel years towards getting a degree in statistics, computer science, or data science, but you do need to update your skill-set. You only need to take out some dedicated time to acquire these skills which will keep you current. The internet has made learning far easier than it was before — there are resources everywhere. When it comes to data science, you can learn just about everything through online courses. If you ask who can use data science skills, the correct answer would be anyone with a bit of training and understanding. Everyone can use data insights to enhance their careers and lives — by exploring data science you’ll understand how to ask the right questions of your data such that you can make smart business decisions and hit business goals.
Let’s look at the pieces of the data science puzzle
To really dive deep into the puzzle of “what is data science”, you have to break it into different pieces.
To practice data science, you need to know mathematics and statistics, programming knowledge to work with data and implement machine learning techniques, and an area of subject-matter expertise. Without domain expertise, you might be able to call yourself a mathematician or a statistician or a programmer, but not a data scientist.
As the demand for deriving insights from data has increased, all industries have begun to adopt data science. Different types of data science have emerged, all of them industry and domain-specific.
The following are just a few titles under which experts of every discipline are using data science — Ad Tech Data Scientist, Director of Banking Digital Analyst, Clinical Data Scientist, Geo-Engineer Data Scientist, Geospatial Analytics Data Scientist, Retail Personalization Data Scientist, and Clinical Informatics Analyst in Pharmacometrics.
Collecting, querying and consuming data
Data engineers capture and collate huge volumes of structured, unstructured, and semi-structured data that exceeds the processing capacity of conventional database systems. Again, data engineering tasks are separate from the work that’s performed in data science, which focuses more on analysis, prediction, and visualization. Despite this distinction, when a data scientist collects, queries, and consumes data during the analysis process, he or she performs work that’s very similar to that of a data engineer.
Although valuable insights can be generated from a single data source, oftentimes the combination of several relevant sources delivers the contextual information required to drive better data-informed decisions. A data scientist can work off of several datasets that are stored in one database, or even in several different data warehouses. Other times, source data is stored and processed on a cloud-based platform that’s been built by software and data engineers.
No matter how the data is combined or where it’s stored, if you’re doing data science, you almost always have to query the data and work with it by a process known as data mining. Most of the time, you use Structured Query Language (SQL) to query data.
The data that you access from various sources doesn’t come in an easily packaged form, ready for analysis — quite the contrary. The raw data not only may vary substantially in format, but you may also need to transform it to make all the data sources cohesive and amenable to analysis. Transformation may require changing data types, the order in which data appears, and even the creation of data entries based on the information provided by existing entries. You’ll have to think carefully about data storage, data processing and data modeling as you do exploratory data analysis with new technology.
You may need to use big data solutions like Hadoop to deal with massive amounts of data that a single computer can’t handle. Thankfully, many of these tools are open source and free forever, and they are specifically designed so you can do advanced analytics on large amounts of data.
Data science relies heavily on a practitioner’s math and statistics skills precisely because these are the skills needed in order to understand your data and its significance. The skills are also valuable in data science because you can use them to carry out predictive forecasting, decision modeling, and hypotheses testing.
Though most of the concepts and formulae used in statistics are derived from the vast knowledge base of mathematics, it is treated as separate and independent branch of math that has many applications. So it is important to understand the difference between the fields of math and statistics in order to understand the answer to the question of “what is data science.”
Mathematics uses deterministic numerical methods and deductive reasoning to form a quantitative description of the world, while statistics is a form of science that’s derived from mathematics, but that focuses on using a stochastic approach — an approach based on probabilities — and inductive reasoning to form a quantitative description of the world.
Data scientists use mathematical methods to build decision models, to generate approximations, and to make predictions about the future.
In data science, statistical methods are useful for getting a better understanding of your data’s significance, for validating hypotheses, for simulating scenarios, and for making predictive forecasts of future events. Advanced statistical skills are somewhat rare, even among quantitative analysts, engineers, and scientists. If you want to go places in data science though, take some time to get up to speed in a few basic statistical methods, like linear regression, Bayes Theorem and probability, inferential statistics, ordinary least squares regression, Monte Carlo simulations, and time series analysis.
The good news is that you don’t have to know everything — it’s not like you need to go out and get a master’s degree in statistics to do data science. You need to know just a few fundamental concepts and approaches from statistics to solve problems in order to benefit from data science skills.
A data scientist may need to know several programming languages in order to achieve specific goals. For example, you may need SQL knowledge to extract data from relational databases. Programming languages such as Python and R are important for writing scripts for data manipulation, analysis and visualization.
The immense datasets that data scientists rely on often require multiple levels of redundant processing to transform into useful processed data. Manually performing these tasks is time consuming and error prone, so programming presents the best method for achieving the goal of a coherent, usable data source.
Given the number of programming languages that most data scientists use, it may not be possible to use just one programming language.
You may have to choose other languages to fill out your toolkit. The languages you choose depend on a number of criteria. Here are the things you should consider:
- How you intend to use data science in your code (you have a number of tasks to consider, such as data analysis, classification, and regression)
- Your familiarity with the programming language
- The need to interact with other languages
- The availability of tools to enhance the development environment
- The availability of APIs and libraries to make performing tasks easier
Although coding is a requirement for data science, it really doesn’t have to be this big scary thing people make it out to be. Your coding can be as fancy and complex as you want it to be, but you can also take a rather simple approach. Although these skills are paramount to success, you can pretty easily learn enough coding to practice high-level data science with tutorials such as Codecademy.
Data scientists are required to have strong subject-matter expertise in the area in which they’re working. Data scientists generate deep insights and then use their domain-specific expertise to understand exactly what those insights mean.
Assume you just landed a data science job with MegaTelCo, one of the largest telecommunication firms in the United States. They are having a major problem with customer retention in their wireless business. In the mid-Atlantic region, 20% of cell phone customers leave when their contracts expire, and it is getting increasingly difficult to acquire new customers. Since the cell phone market is now saturated, previously huge growth in the wireless market has tapered off.
Communications companies are now engaged in battles to attract each other’s customers while retaining their own. Customers switching from one company to another is called churn, and it is expensive all around: one company must spend on incentives to attract a customer while another company loses revenue when the customer departs.
You have been called in to help understand the problem and to devise a solution. Attracting new customers is much more expensive than retaining existing ones, so a good deal of marketing budget is allocated to prevent churn. Marketing has already designed a special retention offer. Your task is to devise a precise, step-by-step plan for how the data science team should use MegaTelCo’s vast data resources to decide which customers should be offered the special retention deal prior to the expiration of their contracts.
Think carefully about what data you might use and how they would be used. Specifically, how should MegaTelCo choose a set of customers to receive their offer in order to best reduce churn for a particular incentive budget? Answering this question is much more complicated than it may seem initially.
This means that when it comes to solving a data science problem, in many ways it is far more important to have any understanding of the data one is looking at than it is to have a PhD in statistics. In this case, you need to have a good understanding of the telecom industry to solve this data science problem.
Communicating data insights
Another skill set paramount to a data scientist’s success are communication skills. As a data scientist, you must have sharp oral and written communication skills. If a data scientist can’t communicate, all the knowledge and insight in the world will do nothing for your organization.
Data scientists need to be able to explain data insights in a way that staff members can understand. Not only that, they need to be able to produce clear and meaningful data visualizations and written narratives. Most of the time, people need to see something for themselves in order to understand. Data scientists must be creative and pragmatic in their means and methods of communication.
You’ll have to use your substantive expertise to defend any predictive models you build and the decision making that will have to come after your research into the data.
Here’s a list of 31 free data visualization tools you can experiment with. This blog post describes how to become a data scientist in a bit more depth. Finally, if you want some more reading material, check out this list of data science blogs.
That’s the scoop on data science. I hope this post gave you a clear explanation of what data science exactly is.
The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, Netflix ,and LinkedIn have all tapped their data and made that the core of their success. They were the vanguard, but now even small businesses are following their path. Whether it’s mining social media data, recommending products based on a user’s purchase history or studying the URLs that people pass to others, the next generation of successful businesses will be built around data.There has never been a better time to understand the answer to the question of “what is data science?”.
The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.
Data science is indeed, one of the most important things to understand in the coming years.