
Abhishek Deore
Data Scientist
About Me
Experience
Research Assistant
University of Arizona, Department of Hydrology and Atmospheric Sciences.
Jan 2025 - Feb 2025
- •Developed an AI-driven genomic data extraction pipeline using Large Language Models (LLMs) and AI agents, replacing manual research paper data extraction with an automated process.
- •Implemented Retrieval-Augmented Generation (RAG) and Named Entity Recognition (NER) to extract key genomic entities such as gene expressions, mutations, and clinical annotations with high accuracy.
- •Optimized data ingestion and structuring by converting unstructured text into structured formats using transformers, fine-tuned LLMs, and semantic similarity scoring for enhanced retrieval.
- •Reduced processing time from hours to minutes by enabling batch extraction of multiple records, integrating vector databases (FAISS / ChromaDB) for efficient indexing, and leveraging cloud-based distributed processing.
- •Enhanced scalability and accessibility by designing a modular workflow using Python, LangChain, and cloud storage, significantly accelerating genomic research and database expansion.
Graduate Research Assistant
University of Arizona, College of Nursing.
Sep 2024 - Dec 2024
- •Developed AI-powered interactive 3D characters by designing and rigging three virtual family members of digital patients using Character Creator 4 (CC4) and integrating them into Unity for a simulated healthcare training environment.
- •Implemented real-time conversational AI by connecting the 3D models to ConvAI and LLM-based dialogue systems, enabling lifelike interactions where characters could provide patient histories and respond contextually to healthcare professionals' queries.
- •Integrated AI-driven behavior and decision-making using Unity ML-Agents and state-driven architectures, allowing characters to dynamically adapt their responses based on user interactions, patient conditions, and simulated medical procedures.
- •Designed and optimized a scalable simulation pipeline by leveraging Unity's animation and physics systems, ensuring realistic interactions, facial expressions, and gestures aligned with speech synthesis for an immersive training experience.
- •Enhanced training for nurses and medical professionals by creating an interactive 3D patient care simulation, allowing them to practice medical decision-making, patient interaction, and procedural workflows in a fully AI-powered virtual environment.
Data Scientist
University of Arizona, AI Core
May 2024 - Aug 2024
- •Engineered a SQL to Retrieval Augmented Generation (RAG) architecture using Python programming and machine learning to convert natural language queries into SQL, improving query accuracy and response time.
- •Implemented RAG-powered chatbots for multiple clients, including a feedback analysis system for Eller College of Management, leveraging SQL and data visualization to enhance user interaction and data retrieval, resulting in a 40% increase in engagement.
- •Created an innovative architecture that resolved numerical data hallucination issues by bypassing traditional RAG vector databases, reducing hallucination by 25% in prediction and classification models.
- •Leveraged AWS cloud technologies and agile methodologies to deploy, scale, and refine machine learning models and chatbot applications, incorporating user feedback for continuous improvement.
Skills
Programming Languages
Machine Learning & AI
Data Science & Analytics
Libraries & Frameworks
Tools & Platforms
Projects

Multi-Modal Retrieval Engine with RAG
A sophisticated search and question-answering system that works across different content types (text and images) with intelligent response generation powered by Google's Gemini API.
- Built a cross-format search system using specialized neural embeddings for text and images
- Implemented vector similarity search with ChromaDB for efficient content retrieval.
- Integrated Gemini API to generate intelligent answers from retrieved documents.
- Created a Streamlit interface for intuitive content uploading, searching, and questioning.
Tech Stack: Python, Google Generative AI (Gemini API), Streamlit, SentenceTransformers, MobileNetV2, ChromaDB (vector database), PIL, NumPy, Matplotlib

File Organization Agent
An intelligent file management assistant that leverages natural language processing to automate file organization and searching tasks. The project demonstrates the practical application of AI agents in solving everyday file management challenges.
- Developed a sophisticated agent using Google's Gemini AI that translates conversational commands into precise file system operations, enabling users to organize and find files through intuitive language.
- Implemented a robust system with distinct modules for understanding user intent, planning actions, executing file operations, and maintaining contextual memory, showcasing advanced AI agent design principles.
- Created both a command-line interface and a Streamlit web application, providing users with flexible interaction methods and a visually engaging file management experience with real-time feedback.
- Engineered advanced file organization capabilities including sorting by type, date, and name, with the ability to search and locate files across multiple directories using natural language queries.
Tech Stack: Python, Google Gemini AI, Streamlit, Pathlib, OS modules

AI-Enhanced Chatbot for Feedback Analysis
Designed and implemented a chatbot for Eller College of Management professors to query and analyze student performance feedback using advanced RAG with Natural Language to SQL architecture.
- Developed an intelligent agent for generating and iterating SQL queries
- Integrated the chatbot into the Radian 360 Business Communications feedback platform
- Enhanced analysis of over 6 years of curated interviews and presentations data
Tech Stack: RAG, Natural Language Prcoessing SQL, AWS RDS, Langchain, VectorDB, Agentic AI

Credit Card Fraud Detection
Implemented advanced anomaly detection techniques using Isolation Forest and Local Outlier Factor (LOF) to identify fraudulent credit card transactions.
- Achieved AUC scores of 0.8942 (Isolation Forest) and 0.8640 (LOF)
- Handled class imbalance using SMOTE
- Performed comprehensive EDA and feature engineering
- Utilized Python, scikit-learn, and Jupyter Notebook
Tech Stack: Python, scikit-learn, Jupyter Notebook, SMOTE

Student Academic Performance Prediction using Machine Learning
Developed a machine learning system to predict student academic outcomes (Dropout, Enrolled, Graduate) using demographic, socio-economic, and academic performance data, achieving 77% accuracy with XGBoost.
- Engineered a comprehensive data preprocessing pipeline handling 4,424 student records with 36 features, including SMOTE implementation for class imbalance, feature scaling, and encoding for optimal model performance.
- Implemented and compared multiple machine learning models (Random Forest, XGBoost, Logistic Regression, Neural Network) with hyperparameter tuning, ultimately achieving 77% prediction accuracy using XGBoost.
- Conducted in-depth feature importance analysis that revealed academic performance metrics as the strongest predictors, followed by financial and administrative factors, providing actionable insights for student success.
- Developed a robust evaluation framework using precision, recall, and F1-scores across different models, incorporating cross-validation and specialized handling for imbalanced class distribution to ensure reliable predictions.
Tech Stack: Python, pandas, scikit-learn, XGBoost, SMOTE, matplotlib, seaborn, Random Forest, Logistic Regression, Neural Network, Cross Validation, GridSearchCV

Brain Stroke Detection using Machine Learning
Designed a machine learning model to predict brain stroke likelihood using demographic and health data, applying data balancing techniques and classification models to improve accuracy and reliability.
- Built a data preprocessing pipeline by cleaning missing values, removing irrelevant features, and applying class balancing techniques to enhance model robustness.
- Developed and fine-tuned multiple classification models, including Logistic Regression, Decision Trees, and Random Forest, optimizing hyperparameters for peak performance.
- Analyzed key stroke risk factors through exploratory data analysis (EDA), uncovering significant correlations between age, BMI, smoking status, and stroke occurrence.
- Designed a rigorous evaluation framework leveraging accuracy, precision, recall, and confusion matrices to ensure high reliability in stroke prediction outcomes.
Tech Stack: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn, Logistic Regression, Decision Tree, Random Forest, SMOTE, Feature Engineering, Hyperparameter Tuning

SONAR Rock or Mine Predictor
Developed a machine learning model to classify SONAR signals as either rocks or mines, demonstrating effective data preprocessing and model evaluation techniques.
- Implemented and compared various data normalization techniques
- Achieved 85.71% testing accuracy using Logistic Regression with Box-Cox transformation
- Demonstrated effective handling of skewed data
- Utilized Python, scikit-learn, and Jupyter Notebook
Tech Stack: Python, scikit-learn, Jupyter Notebook

Toxicity Tracker: AI-Powered Detection of Harmful Online Comments
Designed and implemented an AI tool using Python and TensorFlow to detect and classify toxic comments in real-time across digital platforms.
- Leveraged natural language processing and deep learning techniques
- Managed substantial datasets, employing Pandas and NLTK for data preprocessing and analysis
- Achieved robust model accuracy and efficiency for dynamic content moderation
Tech Stack: Python, TensorFlow, Pandas, NLTK

Diwali Sales EDA
Conducted an in-depth Exploratory Data Analysis (EDA) on Diwali sales data to derive actionable insights for business strategy and marketing optimization.
- Uncovered key customer demographics: women aged 26-35 and professionals in Healthcare, IT, and Government jobs
- Identified top-performing states: Maharashtra, Karnataka, and Uttar Pradesh
- Analyzed product performance, highlighting Fashion, Electronics, and Home & Kitchen as top-selling categories
Tech Stack: Python, Pandas, Matplotlib, Seaborn

Ecommerce Analysis Dashboard using Power BI
Created an interactive dashboard to track and analyze online sales data across India for an E-commerce store owner.
- Created interactive dashboard to track and analyze online sales data
- Used complex parameters to drill down in worksheet and customization using filters and slicers
- Created connections, join new tables, calculations to manipulate data and enable user-driven parameters for visualizations
Tech Stack: Power BI