Springboard Mentor Dipanjan Sarkar: A Knowledge Sharing Advocate
Dipanjan (DJ) Sarkar has wanted to work with computers in some capacity since he was a little boy. After mostly playing games (shout out to Doom!), he began programming his computer to complete increasingly complex tasks with a few lines of code. Later, while getting his bachelor’s degree in computer science and engineering, he took some electives around data mining, soft computing, and artificial intelligence. He was hooked.
Dipanjan went on to earn a more data science-focused master’s degree, setting himself up to find a job that would let him “build real-world solutions and systems instead of just working on prototypes, lab experiments, and proof of concepts.”
Dipanjan is now a data scientist, consultant, trainer, and writer. He has consulted and worked for startups as well as Fortune 500 companies like Intel. His current area of focus is predictive analytics, machine learning (ML), deep learning (DL), natural language processing (NLP), and statistical modeling. He also leads training sessions around data science and artificial intelligence. He plans to venture into the world of open-source products to help developers improve their productivity.
A technical writer, Dipanjan has published several books about ML, DL, and NLP. He’s also a key contributor and editor at Towards Data Science. And he shares his data science knowledge on LinkedIn.
As if that weren’t enough, Dipanjan has been a Springboard data science mentor for more than a year.
We recently sat down with him to discuss his data science journey, his favorite professional project, and much more.
Why do you love data science?
I love getting, cleaning, analyzing, visualizing, and modeling on data and facing new challenges every day. I feel satisfied with my work only when my analysis helps in realizing concrete business value and impact.
I also love sharing my knowledge around data science in books, open-source projects, and articles. Receiving feedback from people that what I’ve shared has helped them in their own career and learning is the fuel I need to drive myself further in this field.
Can you tell me about a project you’ve worked on that you’re really proud of?
My favorite data science project is building a generic anomaly detection framework to detect potential anomalies and failures in key infrastructure like network devices, servers, applications, and client systems. The key motivation for this project was that existing monitoring systems are not dynamic, intelligent, and context-aware. Simple threshold-based monitoring systems simply do not cut it anymore in enterprise-grade infrastructure, where a simple spike in critical events can lead to a massive outage.
Anomaly detection is something which is not very new, and we did a lot of research on existing literature pertaining to methods in time-series anomaly detection. We came across interesting statistical and machine learning-based methods. The best part was not just blindly going with the hype of “what might supposedly be best,” but actually validating it with multiple experiments on real-world data from our infrastructure systems.
We found out that many off-the-shelf methods don’t really work very well in our environment and that could be due to the diverse nature of our systems. What worked for us was the SH-ESD algorithm, popularly known as the Seasonal Hybrid-Extreme Studentized Deviate and built as a nice framework by Twitter called AnomalyDetection. But, surprise! The package was built on R (widely used by data scientists mostly for statistical analysis) and we extensively use Python (one of the leading programming languages in the domain of data science) in our production systems, starting from data ingestion and wrangling to modeling and predictions.
The reason why I am really proud of this project is because I had to use my knowledge of inferential statistics and mathematics coupled with my programming skills to build a SH-ESD algorithm from scratch in Python! To be honest, this was the first time I saw a real-world implementation of a statistical test being used in real time in production for something besides your regular A/B testing.
This project is currently in production, providing daily and nearly real-time anomaly alerts for a wide variety of infrastructure, and is easily extensible to other domains and data sources. Having being involved in this project right from its ideation phase, architecting, developing, and building this solution from the ground up, and now seeing it run successfully in production is why it will always hold a special place in my heart.
Get To Know Other Data Science Students
What did you learn from the project?
From working on this project I learned a few things. Always visualize your data, observe key trends, and don’t proceed blindly based on statistical measures (no wonder we can easily lie with statistics). Also, off-the-shelf products or frameworks might not work well for you, so don’t force-fit them. Build something if you need to, but never re-invent the wheel!
Also, in the real world, always have conversations with the key stakeholders so you know what you are working with and toward. Set key success criteria and work toward it, so you always keep in mind the business impact your project has.
Lastly, keep things simple—always follow the principle of Occam’s razor. Don’t build complex, fancy, and sophisticated models just for the sake of it; always make sure it is relevant to solving the problem at hand.
Do you use data science outside of a professional environment?
I used to try to forecast my overall spending on things like transport, food, utilities, and so on based on my past historical data. But I’ve given up considering how much I keep spending on food and unnecessary things!
I also used data science (clustering/grouping) to help identify similar data points based on financial/bank records for my dad sometime back. That was a really nice mini-project for me and helped him a lot too.
What publications do you follow to stay up to date on data science news and cutting-edge technologies?
The primary sources I use are Elsevier, Springer, ArXiv, ACM, and IEEE for more research-oriented content. Safari Books in general provides me with a vast library of technical books from Springer, Apress, Packt, O’Reilly, Manning, etc. For online publications, I follow Towards Data Science, KDnuggets, Mashable, TechCrunch, Hacker News, Wired, and Hacker Noon whenever I have some extra time.
What are a few of your favorite things you’ve written?
I have published a couple of data science papers in journals like IEEE and Elsevier on topics around image steganography and infrastructure fault prediction with machine learning. One of the books I published is available through Springer/Apress and Packt on the topic of machine learning, social media analytics, natural language processing, and deep learning using both R and Python. One of my favorite books is my recent one, “Hands-On Transfer Learning with Python.” I’ve even open-sourced all the examples on GitHub for everyone.
Occasionally, I’ll contribute articles for Toward Data Science. I wrote one around effective visualization of multi-dimensional data which is definitely one of my favorite pieces I have worked on. Some of my other articles can be found on Medium.
Since you’re here…
Thinking about a career in data science? Enroll in our Data Science Bootcamp, and we’ll get you hired in 6 months. If you’re just getting started, take a peek at our foundational Data Science Course, and don’t forget to peep our student reviews. The data’s on our side.