Brian Achaye
Brian Achaye

Data Scientist

Data Analyst

ODK/Kobo Toolbox Expert

BI Engineer

Data Solutions Consultant

Brian Achaye

Data Scientist

Data Analyst

ODK/Kobo Toolbox Expert

BI Engineer

Data Solutions Consultant

Articles

How I Use ChatGPT as a Data Scientist (Without Getting Burned)

How I Use ChatGPT as a Data Scientist (Without Getting Burned)

As a data scientist, I’ve found that ChatGPT is like a double-edged sword—it can dramatically accelerate my workflow, but if I rely on it blindly, it can lead to embarrassing mistakes (like the time it convinced me to use a fake Python library).

Over the past year, I’ve refined exactly how to use ChatGPT effectively in my data science work—from prototyping models to debugging TensorFlow errors. In this post, I’ll share my favorite use cases, real code examples, and hard-learned lessons on avoiding AI-generated pitfalls.

1. Rapid Prototyping for Machine Learning Models

Training a machine learning model involves a lot of boilerplate code—data splitting, preprocessing, baseline modeling, and evaluation. Instead of rewriting the same sklearn pipelines repeatedly, I now use ChatGPT to:

Generate starter code (e.g., “Write a PyTorch training loop for binary classification.”)
Suggest model architectures (e.g., “What’s a lightweight scikit-learn model for imbalanced data?”)
Explain hyperparameters (e.g., “What does max_depth actually do in a Random Forest?”)

Example Prompt:

“Give me Python code to compare Logistic Regression, Random Forest, and XGBoost on a binary classification task, with feature scaling and ROC curve plotting.”

ChatGPT spits out a 90% complete script—saving me 30+ minutes of typing.

⚠️ Critical Checkpoints:

  • Library versions matter! ChatGPT might use deprecated syntax (e.g., old fit_transform behavior).
  • Always verify cross-validation logic—I once caught it using train_test_split before scaling (a classic data leakage pitfall).

2. Debugging Inscrutable Errors

We’ve all been there: You’re staring at a cryptic TensorFlow error like:

InvalidArgumentError: Input to reshape is a tensor with X values, but the requested shape requires Y  

Instead of scrolling through GitHub issues for hours, I now:

  1. Paste the error + relevant code into ChatGPT.
  2. Ask: “What does this error mean, and how can I fix it?”

Real Example:

I once struggled with a ValueError in Keras when reshaping a CNN input. ChatGPT pointed out I’d forgotten to add channels_last in my image dimensions—a fix I’d have taken way longer to find alone.

🔗 Pro Tip: For niche libraries (e.g., PySpark), specify the version: “I’m using PySpark 3.5. How do I fix this?”

3. Automating Exploratory Data Analysis (EDA)

While tools like pandas-profiling are great, I use ChatGPT to:

Generate summary stats code (e.g., “Python code to check for outliers in all numeric columns.”)
Suggest visualizations (e.g., “What plots best show time-series seasonality?”)
Explain statistical tests (e.g., “When should I use a Mann-Whitney U test vs. a t-test?”)

Example Workflow:

  1. I ask: “Give me Python code to visualize missing values and correlations in a DataFrame.”
  2. ChatGPT returns a heatmap + missingno matrix snippet—which I then tweak for my dataset.

📌 Watch Out: ChatGPT sometimes suggests inappropriate tests (e.g., using Pearson’s R on ordinal data). Always check assumptions!

4. Translating Math into Code

Implementing algorithms from research papers can be painful. Now, I feed ChatGPT equations or pseudocode and ask:

“Convert this gradient descent update rule into Python.”
“How do I implement a custom loss function in Keras?”

Case Study:

I needed to code a custom weighted MSE loss for a regression problem. ChatGPT gave me a TensorFlow function that worked after minor tweaks:

def weighted_mse(y_true, y_pred, weights):  
    return tf.reduce_mean(weights * tf.square(y_true - y_pred))  

⚠️ Verify the Math! I once caught ChatGPT miscounting array dimensions in a backpropagation example.

5. Writing Documentation & Reports

Data science isn’t just code—it’s communicating insights. ChatGPT helps me:

Draft READMEs (e.g., “Summarize this ML pipeline’s steps for a technical audience.”)
Simplify jargon (e.g., “Explain PCA to a business team in 2 sentences.”)
Generate report outlines (e.g., “Structure a summary for an A/B test result.”)

Example Output:

“Principal Component Analysis (PCA) simplifies complex data by finding ‘summary’ directions (like shadows of a 3D object) that capture the most variation.”

The Dark Side: When ChatGPT Gets It Wrong

Here’s where I’ve been burned:

Hallucinated APIs: It once told me to use tf.keras.metrics.f1_score (doesn’t exist).
Dangerous Advice: Suggested using accuracy for imbalanced medical data (terrible idea).
Subtle Bugs: Gave me a sns.boxplot snippet with incorrect hue ordering.

My Safeguards:

Small-scale testing (run code on a sample first).
Cross-reference docs (always check sklearn/PyTorch official sources).
Never trust stats explanations blindly (verify with Wikipedia/Stack Exchange).

Final Verdict: A Supercharged Intern, Not a Colleague

I treat ChatGPT like a brilliant but sloppy intern: great for drafts, terrible for final answers. The key is knowing when to trust it—and when to double-check.

How do you use AI in your data science work? Let’s discuss in the comments!

Related Posts
Write a comment