Josh Kraus

Data Scientist, Former Musician & Educator

Browse my work below.

NCAA Basketball Tournament Predictive Model

Built an end-to-end machine learning pipeline in Python to predict NCAA tournament

outcomes and generate optimized bracket picks. Engineered models in AutoGluon with

walk-forward cross-validation and backtesting. Developed a Monte Carlo simulation

framework using importance-weighted bracket selection via binomial log-likelihood, and

feature pipelines to pull from web-scraped KenPom and SportsReference data.

Stack: Python, AutoGluon, BeautifulSoup, scikit-learn

Key techniques: Automated Machine Learning, walk-forward cross-validation, class weights, Monte Carlo simulation

Highlight: 63% backtested accuracy over 10 years

Google Flights Web Scraper

Developed a production-grade async web scraper in Python to extract and analyze flight

pricing data. Architected modular batch processing system supporting concurrent scraping

with configurable rate limiting, retry logic, and error handling achieving 97.5% success rate.

Implemented a comprehensive test suite with 300 unit tests (95% code coverage) and

established CI/CD pipeline for automated testing and versioned releases.

Stack: Python, Playwright, pytest, GitHub Actions

Key techniques: Async batch architecture, retry/error handling, CI/CD, unit testing

Highlight: 97.5% scrape success rate across hundreds of routes; 300 unit tests at 95% coverage

Interpretable Machine Learning for Medical Image Analysis

Deep learning models have shown strong accuracy in medical image classification, but their black-box nature makes them unviable in clinical settings, they can't be interpreted, validated, or approved by the FDA as diagnostic tools. This project explored whether interpretable traditional ML models could serve as a feasible alternative, using the NIH Chest X-ray dataset (112,000+ radiographs) to classify pleural effusion versus no finding. Seven classifiers were trained and tuned including kernel-SVM, Gradient Boosting, ANN, and Naïve Bayes across balanced and imbalanced training sets, with four data augmentation techniques tested systematically. The kernel-SVM on balanced data achieved an AUC of 0.75, matching published deep learning benchmarks on the same dataset at a fraction of the computational cost and with full interpretability. A key finding was the profound impact of class imbalance: models trained on imbalanced data achieved misleadingly high accuracy while completely failing to detect the minority class.

Stack: Python, scikit-learn, Keras, ImgAug, NumPy

Key techniques: Kernel-SVM, Gradient Boosting, ANN, hyperparameter tuning, data augmentation

Profitable Airbnb Investment Locator

Identifying zip codes with strong short-term rental returns requires synthesizing property values, occupancy data, and local socioeconomic factors. This project built a machine learning pipeline to classify U.S. zip codes as profitable or not for Airbnb investment, defined as an annualized rental yield of 18% or higher. Data was scraped from Rabbu using Selenium across 4,300+ zip codes and merged with Zillow home value data and U.S. Census socioeconomic variables, resulting in a 73-feature dataset with significant class imbalance (only 5.5% profitable). After evaluating 12 classifiers via PyCaret and testing resampling techniques including SMOTE and GANs, a LightGBM model achieved 81.1% precision on the minority class. The model identified average daily rate, home value, and occupancy as the top predictors, and surfaced a counterintuitive finding: the most profitable zip codes tended to have lower-than-average home values, suggesting that high-priced markets are often harder to generate returns in despite commanding higher nightly rates.

Stack: Python, PySpark, LightGBM, PyCaret, Selenium, Pandas, scikit-learn, imbalanced-learn

Key techniques: Web scraping, data integration, class imbalance (SMOTE, GANs), LightGBM, feature importance

Highlight: 81.1% minority-class precision across 4,300+ U.S. zip codes; identified key drivers of STR profitability with actionable insights for real estate investors

Music Educators' Perceptions of Music Theory & Aural Skills Instruction (Master's Thesis, University of Florida)

Before transitioning to data science, my master's thesis examined a persistent gap in music education: most college music students report feeling unprepared by their high school programs for collegiate music theory and aural skills courses, yet educators largely report teaching these subjects regularly. Was this a real discrepancy, and if so, why? To investigate, I designed and administered an IRB-approved quantitative survey to 102 public school music educators across North Carolina and Florida, analyzing responses using descriptive statistics, Spearman's rho correlation tests, and repeated measures ANOVA. Results showed no significant discrepancy between perceived importance and reported implementation but revealed that educators' confidence in their students' enjoyment of these subjects was the barrier most associated with reduced teaching frequency, a finding with direct implications for curriculum and teacher training design.

Methods: Survey design, Spearman's correlation, repeated measures ANOVA with Tukey-Kramer post-hoc tests

Highlight: This project predates my data science career, but demonstrates defining a measurable research question, designing a rigorous data collection instrument, and drawing defensible conclusions from noisy, real-world data

Questions? Feel free to reach me by email at joshtkraus@gmail.com

Page updated

Google Sites

Report abuse