{"id":12940,"date":"2021-10-08T11:52:13","date_gmt":"2021-10-08T18:52:13","guid":{"rendered":"https:\/\/www.springboard.com\/?p=12940"},"modified":"2023-06-29T02:15:33","modified_gmt":"2023-06-29T09:15:33","slug":"decision-tree-implementation-in-python","status":"publish","type":"post","link":"https:\/\/www.springboard.com\/blog\/data-science\/decision-tree-implementation-in-python\/","title":{"rendered":"Decision Tree Implementation in Python with Example"},"content":{"rendered":"\n<p>A decision tree is a simple representation for classifying examples. It is a supervised machine learning technique where the data is continuously split according to a certain parameter. Decision tree analysis can help solve both classification &amp; regression problems. The decision tree algorithm breaks down a dataset into smaller subsets; while during the same time, an associated decision tree is incrementally developed. A decision tree consists of nodes (that test for the value of a certain attribute), edges\/branch (that correspond to the outcome of a test and connect to the next node or leaf) &amp; leaf nodes (the terminal nodes that predict the outcome) that makes it a complete structure. In this blog post, we are going to learn about the decision tree implementation in Python, using the scikit learn Package. <\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-node.png\" alt=\"Decision Node\" class=\"wp-image-46797\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-node.png 600w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-node-400x267.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-node-380x253.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-node-380x253.png 420w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><figcaption class=\"wp-element-caption\">Source: <a href=\"http:\/\/Javatpoint.com\" target=\"_blank\" data-type=\"URL\" data-id=\"Javatpoint.com\" rel=\"noreferrer noopener\">Javatpoint<\/a><\/figcaption><\/figure>\n\n\n\n<p>For our analysis, we have chosen a very relevant, and unique dataset which is applicable in the field of medical sciences, that will help predict whether or not a patient has diabetes, based on the variables captured in the dataset (more datasets here). This information has been sourced from the National Institute of Diabetes, Digestive and Kidney Diseases and includes predictor variables like a patient\u2019s BMI, pregnancy details, insulin level, age, etc. Let\u2019s dig right into solving this problem using a decision tree algorithm for classification.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Decision Tree Implementation in Python<\/strong><\/h2>\n\n\n\n<p>As for any data analytics problem, we start by cleaning the dataset and eliminating all the null and missing values from the data. In this case, we are not dealing with erroneous data which saves us this step. <\/p>\n\n\n\n<p>1. We import the required libraries for our decision tree analysis &amp; pull in the required data<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Load libraries<br>import pandas as pd<br>from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier<br>from sklearn.model_selection import train_test_split # Import train_test_split function<br>from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation<br>col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']<br># load dataset<br>pima = pd.read_csv(\"pima-indians-diabetes.csv\", header=None, names=col_names)<\/pre>\n\n\n\n<p>Let\u2019s check out what the first few rows of this dataset look like<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">pima.head()<\/pre>\n\n\n\n<p>2. After loading the data, we understand the structure &amp; variables, determine the target &amp; feature variables (dependent &amp; independent variables respectively) <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#split dataset in features and target variable<br>feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']<br>X = pima[feature_cols] # Features<br>y = pima.label # Target variable<\/pre>\n\n\n\n<p>3. Let\u2019s divide the data into training &amp; testing sets in the ratio of 70:30. <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Split dataset into training set and test set<br>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test<\/pre>\n\n\n\n<p>As a standard practice, you may follow 70:30 to 80:20 as needed. <\/p>\n\n\n\n<p>4. Performing The decision tree analysis using scikit learn <\/p>\n\n\n\n<p># Create Decision Tree classifier object<br>clf = DecisionTreeClassifier()<br># Train Decision Tree Classifier<br>clf = clf.fit(X_train,y_train)<br>#Predict the response for test dataset<br>y_pred = clf.predict(X_test)<\/p>\n\n\n\n<p>5. But we should estimate how accurately the classifier predicts the outcome. The accuracy is computed by comparing actual test set values and predicted values.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Model Accuracy, how often is the classifier correct?print(\"Accuracy:\",metrics.accuracy_score(y_test, y_pred))<br><br>Accuracy: 0.6753246753246753<\/pre>\n\n\n\n<p>Looks like our decision tree algorithm has an accuracy of 67.53%. A value this high is usually considered good. <\/p>\n\n\n\n<p>6. Now that we have created a decision tree, let&#8217;s see what it looks like when we visualise it<\/p>\n\n\n\n<p>The Scikit-learn&#8217;s export_graphviz function can help visualise the decision tree. We can use this on our Jupyter notebooks. In case you are not using Jupyter, you may want to look at installing the following libraries:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Graphviz -converts decision tree classifier into dot file<\/li>\n\n\n\n<li>Pydotplus- convert this dot file to png or displayable form on Jupyter.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">from sklearn.tree import export_graphviz<br>from sklearn.externals.six import StringIO  <br>from IPython.display import Image  <br>import pydotplus<br>dot_data = StringIO()<br>export_graphviz(clf, out_file=dot_data,  <br>                filled=True, rounded=True,<br>                special_characters=True,feature_names = feature_cols,class_names=['0','1'])<br>graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  <br>graph.write_png('diabetes.png')<br>Image(graph.create_png())<\/pre>\n\n\n\n<p>Is this the outcome that you seem to be getting too?<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"544\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-1200x544.jpeg\" alt=\"Decision Tree Implementation in Python\" class=\"wp-image-46799\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-1200x544.jpeg 1200w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-400x181.jpeg 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-768x348.jpeg 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-1536x696.jpeg 1536w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-380x172.jpeg 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-700x317.jpeg 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python.jpeg 1600w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-380x172.jpeg 420w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><figcaption class=\"wp-element-caption\">Python Output<\/figcaption><\/figure>\n\n\n\n<p>You will notice, that in this extensive decision tree chart, each internal node has a decision rule that splits the data. But are all of these useful\/pure? <\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Measuring the Impurity of Nodes Created Via Decision Tree Analysis<\/strong><\/h3>\n\n\n\n<p>Gini referred to as Gini ratio measures the impurity of the node in a decision tree. One can assume that a node is pure when all of its records belong to the same class. Such nodes are known as the leaf nodes.<\/p>\n\n\n\n<p>In our outcome above, the complete decision tree is difficult to interpret due to the complexity of the outcome. Pruning\/shortening a tree is essential to ease our understanding of the outcome and optimise it. This optimisation can be done in one of three ways:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>criterion: optional (default=\u201dgini\u201d) or Choose attribute selection measure<br><\/strong>This parameter allows us to use the attribute selection measure. <\/li>\n\n\n\n<li><strong>splitter: string, optional (default=\u201dbest\u201d) or Split Strategy<br><\/strong>Allows the user to split strategy. You may choose \u201cbest\u201d to choose the best split or \u201crandom\u201d to choose the best random split.<\/li>\n\n\n\n<li><strong>max_depth: int or None, optional (default=None) or Maximum Depth of a Tree<br><\/strong>This parameter determines the maximum depth of the tree. A higher value of this variable causes overfitting and a lower value causes underfitting.<\/li>\n<\/ul>\n\n\n\n<p>In our case, we will be varying the maximum depth of the tree as a control variable for pre-pruning. Let\u2019s try max_depth=3. <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Create Decision Tree classifier object<br>clf = DecisionTreeClassifier(criterion=\"entropy\", max_depth=3)<br><br># Train Decision Tree Classifier<br>clf = clf.fit(X_train,y_train)<br><br>#Predict the response for test dataset<br>y_pred = clf.predict(X_test)<br><br># Model Accuracy, how often is the classifier correct?print(\"Accuracy:\",metrics.accuracy_score(y_test, y_pred))<br><br>Accuracy: 0.7705627705627706<br><\/pre>\n\n\n\n<p>On Pre-pruning, the accuracy of the decision tree algorithm increased to 77.05%, which is clearly better than the previous model.<\/p>\n\n\n<div class=\"bg-leaf-50 p-4 my-3\"><h4 class=\"fw-bold text-center\">Get To Know Other\tData Science Students<\/h4><div class=\"row row-cols-1 row-cols-lg-3\"><div class=\"col\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/mikiko-bazeley\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203192\/Student%20Success\/Mikiko_Bazeley_125x125.png\" alt=\"Mikiko Bazeley\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Mikiko Bazeley<\/p><p class=\"text-muted lh-1\">ML Engineer at MailChimp<\/p><\/div><div class=\"w-100 d-block d-md-none mt-3\"><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/mikiko-bazeley\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/karen-masterson\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1543914918\/homepage-assets\/career-tracks\/dsc\/dsc-alumni\/karen.png\" alt=\"Karen Masterson\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Karen Masterson<\/p><p class=\"text-muted lh-1\">Data Analyst at Verizon Digital Media Services<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/karen-masterson\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/bryan-dickinson\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1638213300\/Student%20Success\/Bryan_Dickinson_125x125.png\" alt=\"Bryan Dickinson\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Bryan Dickinson<\/p><p class=\"text-muted lh-1\">Senior Marketing Analyst at REI<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/bryan-dickinson\">Read Story<\/a><\/p><\/div><\/div><\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Decision Tree Implementation in Python: Visualising Decision Trees in Python<\/h2>\n\n\n\n<pre class=\"wp-block-preformatted\">from sklearn.externals.six import StringIO  <br>from IPython.display import Image  <br>from sklearn.tree import export_graphviz<br>import pydotplus<br>dot_data = StringIO()<br>export_graphviz(clf, out_file=dot_data,  <br>                    filled=True, rounded=True,<br>                    special_characters=True, feature_names = feature_cols,class_names=['0','1'])<br>graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  <br>graph.write_png('diabetes.png')<br>Image(graph.create_png())<\/pre>\n\n\n\n<p>With this, your outcome would look like:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"515\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-pruned-model-1200x515.png\" alt=\"Decision Tree Implementation in Python, pruned model\" class=\"wp-image-46800\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-pruned-model-1200x515.png 1200w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-pruned-model-400x172.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-pruned-model-768x330.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-pruned-model-380x163.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-pruned-model-700x301.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-pruned-model.png 1344w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2021\/10\/decision-tree-implementation-in-python-pruned-model-380x163.png 420w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><figcaption class=\"wp-element-caption\">Python Output<\/figcaption><\/figure>\n\n\n\n<p>The outcome of this pruned model looks easy to interpret. With this, we have been able to classify the data &amp; predict if a person has diabetes or not. Decision tree in python is a very popular supervised learning algorithm technique in the field of machine learning (an important subset of <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-definition\/\">data science<\/a>), But, decision tree is not the only clustering technique that you can use to extract this information, there are various other methods that you can explore as a ML engineer or <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\" data-type=\"post\" data-id=\"24427\">data scientists<\/a>. <\/p>\n\n\n\n<p class=\"rm has-background\" style=\"background-color:#efeff6\"><strong>Since you\u2019re here\u2026<br><\/strong>Curious about a career in data science? Experiment with our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/resources\/guides\/data-science-process\/\" target=\"_blank\">free data science learning path<\/a>, or join our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" target=\"_blank\">Data Science Bootcamp<\/a>, where you\u2019ll get your tuition back if you don&#8217;t land a job after graduating. We\u2019re confident because our courses work \u2013 check out our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/success\/\" target=\"_blank\">student success stories<\/a> to get inspired.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A decision tree is a simple representation for classifying examples. It is a supervised machine learning technique where the data is continuously split according to a certain parameter. Decision tree analysis can help solve both classification &amp; regression problems. The decision tree algorithm breaks down a dataset into smaller subsets; while during the same time, [&hellip;]<\/p>\n","protected":false},"author":100,"featured_media":8306,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_eb_attr":"","_eb_data_table":"","footnotes":""},"categories":[67],"tags":[],"marketing_tags":[],"class_list":{"0":"post-12940","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science"},"acf":[],"_links":{"self":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/12940"}],"collection":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/users\/100"}],"replies":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/comments?post=12940"}],"version-history":[{"count":3,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/12940\/revisions"}],"predecessor-version":[{"id":46803,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/12940\/revisions\/46803"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media\/8306"}],"wp:attachment":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media?parent=12940"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/categories?post=12940"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/tags?post=12940"},{"taxonomy":"marketing_tags","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/marketing_tags?post=12940"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}