AI Prompts for Data Science: A Guide to Maximizing Results

In today’s data-driven world, the ability to seamlessly interact with information is paramount. Generative AI, especially tools like OpenAI’s ChatGPT, is revolutionizing how we approach data querying. No longer confined to the realms of data professionals, these AI tools are integrating into mainstream business applications, transforming how we understand and utilize data. From business intelligence platforms to CRM systems, the integration of generative AI is making data querying more intuitive and accessible.

The Rise of AI in Business Applications

As generative AI tools grow increasingly mainstream, everyone, including non-data professionals, must learn to write natural language prompts to query data. OpenAI’s release of the ChatGPT API enabled developers to integrate the chatbot into their apps and products.

Generative AI is becoming a mainstay in business intelligence (BI) tools, CRM systems, and marketing automation software. These base models are fine-tuned on the company’s proprietary data, making them more useful for internal querying.

Salesforce’s Einstein bot was unveiled at the TrailbalzerDX developer conference earlier this year. During a demo, the bot helped a hypothetical account executive research a new lead, write an introductory email, and identify the products most suited to the new customer via the Tableau analytics platform. All of this was executed through a series of canned text prompts.

Many BI platforms feature AI and machine learning capabilities that eliminate the need for complex coding and SQL queries. When someone enters a search term, question, or phrase into the system, the software searches for relevant databases containing that keyword combination and generates an answer in plaintext, a report, or a chart.

Some software programs have built-in virtual assistants that auto-generate prompts based on the user’s permissions, role, and prior searches. A sales development rep might receive prompts like “How many discovery calls resulted in demos?” A marketing analyst could ask, “What were the conversion rates for our last email campaign?” and the AI will generate a corresponding data visualization.

Here are a few simple prompts someone might use to understand a dataset:

“Can you help me understand the distribution of my dataset’s feature X?”
“What are the key statistics of column Y in my dataset?”
“Show me a summary of the missing values in my dataset.”

Understanding AI Prompts for Data Science

In many ways, generative AI makes data analytics more user-friendly and available to a broader audience. However, you still need strong statistical and programming skills to get the most out of it.

“People like to jump straight into data analysis when they use a tool like ChatGPT, but I think this is problematic because we can’t entirely trust the model’s output,” says Zwingmann.

Project planning is the most critical phase of any data science project. This is where you must:

Define a clear problem/opportunity statement
Define a solution
Define the approach

The success of an analytics project depends on how well you understand the situation, define the problem, and translate the business question into an analytical question.

Project Planning

Planning is the most critical phase of any data science project. Generative AI can help you create a checklist of steps for an end-to-end project.

“ChatGPT is great for going from a high-level problem statement to a smart problem statement,” says Zwingmann. “You can prompt ChatGPT to suggest different issue trees or iterate through these issue trees, and it can help you analyze whether they are mutually exclusive and collectively exhaustive.”

An issue tree is a visual representation that breaks down complex problems into key issues and sub-issues while organizing the relationships between them. It serves as a framework for systematic problem-solving or decision-making.

For example, if you’re analyzing customer retention data for an e-commerce website, you might examine various facets, such as website user experience, product quality, and customer support. Each facet has sub-issues, such as excessive page load time or low social media engagement. The point of the issue tree is to explore cause-and-effect relationships comprehensively.

chatgpt prompts for data science, project planning

“ChatGPT doesn’t completely understand what an issue tree is in the way data scientists use it,” says Zwingmann. “The best option is to give it a few examples of what an issue tree is–and maybe even feed it some bad examples that aren’t mutually exclusive and collectively exhaustive.”

You can also upload a dataset to ChatGPT using a plug-in (more on that later!) and ask: “What are the three most interesting trends in this data?” These trends will help you pinpoint questions you can answer through your analysis or potential hypotheses to test.

Structure of a Data Science Prompt

Explain the purpose of the dataset – Mention what the dataset is for and what’s in it.

“I have a loan dataset for a mid-sized bank. The loan dataset consists of 8,500 rows and 10 columns: [‘Loan_ID, ‘Gender,’ ‘Married,’ ‘Dependents,’ ‘Education,’ ‘Self_Employed,’ ‘#ApplicantIncome’, ‘#CoapplicantIncome,’ ‘#LoanAmount.’] Can you list the steps I should follow to develop an end-to-end project for my portfolio?”

State the goal of your analysis

“Please include class imbalance issues and accurately predict whether a loan will not be paid back rather than if a loan is paid back.”

State your requirements:

“We will create a web app using Gradio and deploy it on Spaces. We will not monitor the model in the production environment.”

How To Upload Datasets to ChatGPT

Developers have scrambled to build browser extensions and other workarounds to enable file uploads. Currently, there are four methods for uploading files to ChatGPT.

Plug-ins (GPT Plus users only)

The Code interpreter plug-in is a beta feature embedded within ChatGPT, so you don’t need to access the plug-in store to activate it. This is a version of ChatGPT that knows how to write and execute Python code and can work with file uploads. Simply go to Settings > Beta features and enable Code Interpreter. You’ll notice a + button in the text field. Click this button to upload a file.

“You can upload your data in a CSV file using Code Interpreter and ask it to come up with an EDA,” says Zwingmann. “You can copy-paste the Python code, double-check it, and make corrections where necessary.”

You can find other plug-ins, such as AskYourPDF from the plug-in store. Locate the desired plug-in, install it, and activate it. When you type “upload a document” into ChatGPT’s text field, you’ll receive a link to upload your document. You’ll be given a unique document ID when the upload is completed. Copy-paste the document ID into the chatbot to begin querying your document.

chatgpt prompts for data science, plugins

Browser Extensions

ChatGPT File Uploader is a Chrome Browser extension that supports the following file formats: txt, js , py, css, json, csv.pdf, doc, docx,.xls, and .xlsx. This extension is useful for large datasets as it uploads your file in small chunks, bypassing ChatGPT’s 15,000-character limit. Once you install and enable the browser extension, you’ll see a green ‘Submit File’ button above the text field when you next sign into ChatGPT.

Copy-paste

You can copy-paste data directly into ChatGPT’s text field. However, bear in mind that ChatGPT’s character limit is 15,000 characters.

Workarounds

Developers have found ways to implement file upload buttons by writing scripts. Here is one such tutorial on generating a JS script that creates a button with the text ‘Submit File’ and inserts it into the DOM.

Note: Before uploading a dataset to ChatGPT, ensure it does not contain sensitive information (e.g.: personally identifiable information (PII), financial data, biometric data, or political and religious views) and is not proprietary. Information in text prompts and files uploaded to ChatGPT are ingested into the model to further its training, so data privacy is far from guaranteed.

ChatGPT recently launched a new feature enabling users to turn off chat history. This excludes specific conversations from being reused to train the model.

Use Cases for AI Prompting in Data Science

Data Cleaning Using ChatGPT

Messy, noisy data skews your analysis. After sifting through the data and identifying “noise”—inconsistencies, duplicates, lack of naming conventions—you can ask ChatGPT to write code that executes specific data cleaning tasks.

1. Duplicate removal:

“Can you help me write a script to identify and remove duplicate entries from my dataset?”

2. Missing values imputation:

“I have a dataset with missing values in some columns. How can I impute these missing values?”

3. Outlier detection:

“I’m dealing with outliers in my data that are affecting my analysis. How can I detect and handle outliers effectively?”

4. Data formatting:

“My dataset has inconsistent date formats. How can I standardize them to a single format? Generate the corresponding code in Python.”

5. Text data cleaning:

“I’m working with a text dataset with a lot of noise, like special characters and typos. What are some techniques to clean up this text data?”

6. Categorical variable cleaning:

“I have categorical variables with various naming conventions. How can I standardize these categories for consistent analysis?”

7. Data type conversion:

“My dataset has columns with incorrect data types. How can I convert them to the appropriate data types?”

8. Removing irrelevant columns:

“There are many columns in my dataset that I don’t need for analysis. What’s the best way to identify and remove irrelevant columns?”

9. Handling inconsistent data:

“My dataset has inconsistent values representing the same thing (e.g., ‘Male’ and ‘M’ for gender). How can I clean and consolidate these values?”

“The text data in my dataset contains a lot of spelling mistakes. How can I correct these errors to improve the data quality?”

“There are special characters and symbols in my data causing issues. How can I remove or replace these characters?”

10. Data transformation:

“I need to transform certain variables in my dataset (e.g., log-transform skewed numeric variables). How can I do this effectively?”

Get To Know Other Data Science Students

Pizon Shetu

Data Scientist at Whiterock AI

Read Story

Jonathan King

Sr. Healthcare Analyst at IBM

Read Story

Bret Marshall

Software Engineer at Growers Edge

Read Story

Exploratory Data Analysis (EDA) Using ChatGPT

EDA is the process of running initial investigations on the data to discern patterns. The point is to discover the high-level contents of your dataset and identify aspects that are ripe for analysis. You also want to determine if the data is structured or unstructured, and how noisy it is. Doing this prep work makes data cleaning more effective because you can modify the data for a specific problem.

Sample Prompts for Exploratory Data Analysis

Data Overview

What are the dimensions (rows and columns) of the dataset?
What are the features (variables) present in the dataset?
What target variable should I try to predict (e.g., loan approval)?
Are there any missing values in the dataset? How prevalent are they across features?

Univariate Analysis:

What is the distribution of the target variable? Is it balanced or imbalanced?
For categorical features (e.g., gender, education), what are the different categories and their frequencies?
For numerical features (e.g., income, loan amount), what are the summary statistics (mean, median, standard deviation)?
Are there any outliers in the numerical features? How might they affect my analysis?

Bivariate Analysis:

How does the target variable vary across different categories of categorical features? (E.g., loan approval by gender, education level)
Are there any correlations between numerical features and the target variable? (E.g., correlation between income and loan approval)
Are there any patterns or trends in scatter plots between pairs of numerical features?
Can you identify any significant differences in distributions when comparing target classes?

Multivariate Analysis:

Can you visualize interactions between multiple categorical features and the target variable?
Are there combinations of features that seem to influence loan approval more than individual features alone?
How do the relationships between features change when considering the target variable?

Feature Selection and Model Evaluation

Feature selection is the process of selecting a subset of relevant features (variables) from your dataset to improve the performance of your machine learning models. Too many features can lead to overfitting and computing power overload.

Understanding the Dataset

ChatGPT can help you analyze the dataset to identify potential features that might be irrelevant or have little impact on the model’s performance.

Prompt: “Help me understand the characteristics of my dataset. It’s a [brief description of your dataset]. Tell me about the number of features, the types of features (numeric, categorical, text, etc.) and any initial insights you can provide.”

Correlation Analysis

ChatGPT can guide you through calculating correlation coefficients between features and the target variable to identify which features are more strongly related to the target.

Prompt: “Calculate the correlation coefficients between the features in my dataset and the target variable. I’d like to identify which features correlate more strongly with the target.”

Feature Importance

If your model can provide feature importance scores (e.g., in tree-based models), ChatGPT can help you interpret these scores to prioritize features.

Prompt: “I have trained a machine learning model and obtained feature importance scores for the input features. The model is based on [insert model type, e.g., Random Forest, XGBoost, etc.]. The feature importance scores indicate the contribution of each feature to the model’s predictions. Could you help me interpret these scores? Specifically, I would like to understand:

How to interpret feature importance scores in general.
What high feature importance means for a specific feature.
Whether a high score always means a feature is highly predictive.
How to use feature importance scores for feature selection or further analysis.”

Generate Code for Selecting Features

ChatGPT can provide code examples using libraries like scikit-learn to implement feature selection techniques such as Recursive Feature Elimination (RFE), SelectKBest, and more.

Data Visualizations Using ChatGPT (Requires Plug-In)

While ChatGPT can’t generate charts and graphs, it can provide guidance and code snippets for data visualizations. For example, it can describe the steps to create visualizations using popular libraries like Matplotlib or Seaborn in Python. Upload your dataset using a plug-in, and ChatGPT will recommend the best way to visualize it, and even generate the corresponding code. You can also ask what data to visualize, particularly if you have a vast dataset.

Alternatively, plug-ins like Noteable enable you to upload datasets to ChatGPT and generate charts and graphs using a custom prompt. The plug-in also provides a notebook with all of the Python code used for the visualization. (Note: You must be a paid ChatGPT Plus user to enable plugins from the Plugin store).

chatgpt prompts for data science, Data Visualizations Using ChatGPT — *Cumulative returns on equities, ETFs, funds, indices, currencies, cryptocurrencies and money market*s

Source: noteable

ChatGPT can read multiple datasets simultaneously if you provide the links. You must understand how to write Python code to use the Noteable plugin. Simply provide a link to the dataset you want to use and link it to a new project on Noteable. You can check the complete report by accessing your Noteable workspace and opening the .ipynb file. This file contains all the code ChatGPT has written, with graphs and visualizations.

chatgpt prompts for data science, graphs and visualizations

How To Structure a Data Visualization Prompt

Explain what type of data each dataset contains and what it’s for.

“Each dataset represents the rating for the game FIFA for the years 2017, 2018, 2019, 2020, and 2021.”

Specify what variables you’re interested in analyzing. If you want ChatGPT to disregard any rows or columns in your dataset, mention it here.

“Analyze soccer players only from the following countries: United States, Canada, England, Brazil, and Argentina.”

Define what type of visualization you want for each hypothesis you’re testing.

“Make a histogram and a boxplot to explore the average height of players in these countries” or “Make a scatterplot to see how the weight of players is distributed.”

Wrapping Up

The integration of AI prompts into data science is revolutionizing the way businesses operate and make data-driven decisions. With tools like ChatGPT becoming more accessible, it’s essential for professionals across industries to harness the power of generative AI effectively. Crafting precise and effective prompts not only enhances the accuracy of data retrieval but also streamlines complex analytical processes. As we navigate the future of data science, the role of AI prompting will undoubtedly become even more pivotal, emphasizing the need for continuous learning and adaptation in this dynamic domain.

Since you’re here…
Curious about a career in data science? Experiment with our free data science learning path, or join our Data Science Bootcamp, where you’ll get your tuition back if you don’t land a job after graduating. We’re confident because our courses work – check out our student success stories to get inspired.

AI Prompts for Data Science: A Guide to Maximizing Results

Ready to launch your career?

The Rise of AI in Business Applications