If you’re just starting out with data science, you’re likely learning a lot of new terminology. From Hadoop to munging, it can be hard to keep it all straight. That’s where a comprehensive data science glossary comes in. We’ve compiled a list of data science terms below, complete with input from experts in the field.
30 Popular Data Science Terms
Let’s start at the beginning.
Data science
At its essence, data science is a field that works with and analyzes large amounts of data to provide meaningful information that can be used to make decisions and solve problems. Data science includes work in computation, statistics, analytics, data mining, and programming.
Related: What Is Data Science?
Data scientist
An analytical data professional with a high degree of technical skill and knowledge, usually with expertise in programming languages such as R and Python. Data scientists help businesses collect, compile, interpret, format, model, make predictions about, and manipulate all kinds of data in all manner of ways. They’re experts at both construction and deconstruction. Even though the role of data scientist is relatively new, it’s in high demand and pays well.
DJ Patil, who built the first data science team at LinkedIn before becoming the first chief data scientist of the United States in 2015, coined the modern version of the term “data scientist” with Jeff Hammerbacher (Facebook’s early data science lead) in 2008.
Patil has put it this way: “A data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data.”
Data analyst
An interpreter of data who typically specializes in identifying trends. They’re similar to data scientists, sans the coding experience. One way to think about data analysts is that they’re junior data scientists on their way to becoming full-fledged data scientists.
Martin Schedlbauer, associate clinical professor and director of Northeastern University’s information, data science, and data analytics programs, explains, “Data scientists are quite different from data analysts; they’re much more technical and mathematical. They’ll have more of a background in computer science.”
Related: Career Comparison: Data Analyst vs. Data Scientist
Business analyst
Like a data analyst, but more invested in the actionable implications of data to promote the progress and development of a business.
A business analyst will recommend action based on his or her interpretation of data, such as whether or not a business should continue to sell a particular product. Business analysts can use the work of data scientists to communicate the business side of the data to the ultimate decision-makers.
Related: Career Comparison: Business Analyst vs. Data Analyst
Data engineer
Anyone who designs, QAs, and maintains the systems that data scientists employ daily. Whereas a data scientist might be focused on data analysis, a data engineer focuses more on data preparedness.
“If a data scientist’s job is to analyze and translate data into meaningful and contextual data, it is the data engineer’s job to ideate and build up the software architecture that will enable it,” says Jamie Cambell, former Google security engineer and founder at Gobestvpn.com.
Likewise, they ensure that quality data comes through the pipeline. After all, iron cannot be refined into gold, to paraphrase Mark Twain.
The data engineer is the pit to the race car driver. They make sure data scientists have a well-oiled data pipeline to perform their jobs adequately.
Related: Career Comparison: Data Engineer vs. Data Scientist
Data governance
The management of the overall quality, integrity, relevance, and security of available data.
The semantics fit here. It’s not a lot different from governing a place. To govern is to “conduct the policy, actions, and affairs of (a state, organization, or people,” according to Google’s dictionary. Replace a state, organization, or people with data, and that’s pretty close.
Data governance usually involves a governing body that validates the relevance of data and maintains the status quo to the degree that it prevents disruption of data quality, integrity, or security.
Data set
Quite simply, a collection of data, particularly one that is specifically structured. They can be small and simple to work with or large and complex. Yelp’s popular data set, for example, includes over 1.2 million business attributes like hours, parking, availability, and ambiance.
Related: 19 Free Public Data Sets for Your First Data Science Project
Data mining
A process that data scientists employ to find usable models and insights in data sets. They use numerous techniques to accomplish this task such as regression, classification, cluster analysis, and outlier analysis.
Related: Data Mining in Python: A Guide
Data visualization
Any attempt to make data more easily digestible by rendering it in a visual context. Data visualization includes charting, graphing, infographing, and can even include cartooning—in generic use cases.
Data modeling
Modeling is all about turning data into predictive and actionable information. “Building models that can predict and explain outcomes,” says Daniel Jebaraj, vice president at syncfusion.com, a company that provides enterprise-grade software to companies for such purposes as data integration and big data processing.
Often data modeling involves the process of visually documenting complex data using text and symbols.
Data wrangling
The process of formatting or restructuring raw data to suit specific needs or increase its decision-making power (sometimes referred to as data munging). Think in terms of livestock wrangling, if it helps. To wrangle livestock is to herd or move animals to a specific purpose. Rather than livestock, data scientists have, you guessed it, data. To rein in that raw data, whether for legibility or something else, it needs structure.
“This is typically messy work and takes time. Without adequate preparation, results are difficult to use,” Daniel Jebaraj says.
Data scientists often spend somewhere between 50 and 80 percent of their time data wrangling.
Related: A Comprehensive Introduction to Data Wrangling
Big data
Big data comes from Moore’s Law, a theory that computing power doubles every two years. This has led to the rise of massive data sets generated by millions of computers.
Put simply, big data is a collective term that describes data that is too large to fit on a single computer. Conventional tools like SQL and Excel are typically unable to handle big data, so new ones have been developed to take their place.
Get To Know Other Data Science Students
Corey Wade
Founder And Director at Berkeley Coding Academy
Jonathan King
Sr. Healthcare Analyst at IBM
Pizon Shetu
Data Scientist at Whiterock AI
Algorithm
A series of repeatable steps, usually expressed mathematically, to accomplish a specific data science task or solve a problem. An important part of a data scientist’s job is his or her ability to recognize an algorithm’s suitability for certain tasks, as it’s impossible to rely on one algorithm as a panacea to all problems.
A few commonly used algorithms in data science include: linear and logistic regression, Naive Bayes, and KNN (K-Nearest Neighbors).
Artificial intelligence
Well-known by its acronym, AI is the apparent ability of machines to act “intelligently” and has become an increasingly popular and useful area of computer science.
The definition of intelligence is broad here, and there’s disagreement about what constitutes machine intelligence. According to Science Daily, the modern definition of AI is “the study and design of intelligent agents,” agents being a system that studies its environment and acts in the interest of maximizing chances of success.
AI is responsible for everything from your favorite triple AAA video game NPCs to Facebook’s algorithms to single out and ban inappropriate content.
Machine learning
The computational process wherein a machine “learns” and adjusts its behaviors based on feedback from data. Usually manifesting as an adaptable algorithm, machine learning helps computers predict outcomes without explicit human input.
“Machines learn a function from data without the specific function being explicitly programmed. Given certain inputs what is the function that produces observed outputs? Such a function should also be able to handle previously unseen data (generalize),” adds Daniel Jebaraj.
As more data becomes available, machine learning uses statistical analysis to adjust and update behavior to more accurately predict the future.
Machine learning engineer
A data scientist does the statistical analysis required to determine which machine learning approach to use, then they model the algorithm and prototype it for testing. At that point, a machine learning engineer takes the prototyped model and makes it work in a production environment at scale.
A machine learning engineer isn’t necessarily expected to understand the predictive models and their underlying mathematics the way a data scientist is. A machine learning engineer is, however, expected to master the software tools that make these models usable.
Related: How to Become a Machine Learning Engineer
Deep learning
A branch of machine learning that attempts to mirror the neurons and neural networks associated with thinking in human beings. It’s the enemy of many a dystopian sci-fi novel where robots become smarter than humans and cause the downfall of mankind. We’re not quite there yet, but recent advances in artificial intelligence employ deep learning technology for speech recognition, translation, and image recognition software.
Supervised learning
A common branch of machine learning in which a data scientist trains the algorithm to draw what he or she believes to be the correct conclusions.
“It’s similar to the way a child might learn arithmetic from a teacher,” writes Nikki Castle in this Datascience.com article.
This is distinctly different from unsupervised learning, which does not rely on human guidance. An example use case for supervised learning might include a data scientist training an algorithm to recognize images of female human beings using correctly labeled images of female human beings and their characteristics.
Unsupervised learning
A branch of machine learning where the algorithm does not rely on human input, and is, instead, self-learning. This more closely resembles what some experts call true artificial intelligence.
This form of machine learning is extremely complicated and is not always the go-to for simpler tasks. However, it can be used to solve complex problems that people would not normally undertake, according to Nikki Castle.
Whereas the supervised algorithm would accept and use the labels assigned to it to classify female human characteristics, an unsupervised algorithm would learn the differences on its own, free of bias, and assign its own labels to differentiate.
Reinforcement learning
An area of unsupervised machine learning where the machine seeks to maximize reward. The machine, or “agent,” learns through trial and error as well as reward and punishment.
If you’ve heard of positive and negative reinforcement, those same principles are applied here. Reinforcement learning problems are usually explained in terms of games. Let’s take chess, for example. The machine’s goal is to win at chess. It’s positively reinforced when it makes moves that win material, such as capturing a pawn, and negatively reinforced when it makes moves that lose material, such as having a pawn captured. Combinations of these rewards and punishments result in a self-learning machine that improves at chess over time.
API
An acronym that stands for application programming interface. APIs provide users with a set of functions used to interact with and deploy the features of a specific application or service.
Facebook, for example, provides developers of software applications with access to Facebook features through its API. By hooking into the Facebook API, developers can allow users of their own applications to log in using Facebook, or they can access personal information stored in Facebook databases, such as date of birth or workplace.
Python
An object-oriented programming language often used in data science because users have developed an extensive array of tools applicable to the field. Python is free to use for commercial or personal projects, and it’s often commended for its learnability for programmers and non-programmers alike.
Related: An Introduction to Machine Learning in Python
R
An open-source language and environment for statistical computing and analysis. Like Python, R is often used in data science—and knowledge of it is often expected for job applicants. Sometimes considered more difficult to learn than languages like Python, R shines most brightly for its graphical and plotting capabilities and its many data science-driven packages.
Ruby
A scripting language that is also popular with data scientists, though not on the same level as Python and R. It does not contain the volume of specialized libraries available in R and Python, and reasons for using it are mostly historical.
SQL
An acronym that stands for structured query language, this programming language is designed to interact with databases. Of course, where databases are involved, data scientists aren’t far away. SQL is another must-learn language for data scientists in the making.
Excel
One of the most used spreadsheet applications on the market. There’s no way you haven’t come into contact with Excel. It’s used in data science for obvious reasons, but it’s used in practically every professional environment and, at the very least, a familiarity with it is expected in any job you’ll encounter. Excel does great with crunching numbers; visualizing data; reading, importing, and exporting CSV files commonly used in data science; and much more.
Hadoop
An open-source software framework that allows data scientists to process big data using clusters of hardware running simple programming models. Many herald Hadoop as a solution to big data problems. It allows you to manage much more data than you can on a single computer.
Pandas
An open-source software library for Python. The library is widely used in the data science community for data manipulation and analysis because it’s free and distributable under the BSD license.
It is much quicker to process larger datasets than Excel, and it has more functionality. You can clean data by applying programmatic methods to the data with pandas. You can, for example, replace every error value in the data set with a default value, such as zero, in one line of code.
Decision tree
A tool of data scientists and related professions to visually lay out decisions and decision making. As the name suggests, the visual model for the decision-making process is a tree. It’s widely used in data mining and machine learning.
Unstructured data
Any data that does not fit a predefined data model. Often this data does not fit into the typical row-column structure of a database. Images, emails, videos, audio, and pretty much anything else that might be difficult to “tabify” might constitute examples of unstructured data.
The field of data science is wildly complex and deep. These are just some of the data science terms you’ll encounter often, and they only represent a high-level discussion of the field. If you delve further into each of these data terms, you’ll find even deeper topics for discussion. Hopefully, this serves as a primer to pique the interests of aspiring data scientists, and a reference for those looking to keep things straight.