You have not been coached through setting up a Python environment. If using Python You will need to set up a Python environment and install the necessary packages to run this code - takes about 15 minutes; see https://quarto.org/docs/projects/virtual-environments.html. Alternatively, delete the Python code and only leave the remaining R code that is provided. You can see the executed Python output at my GitHub pages site: https://flyaflya.github.io/randomForestChallenge/.
The Problem: Can Many Weak Learners Beat One Strong Learner?
Core Question: How does the number of trees in a random forest affect predictive accuracy, and how do random forests compare to simpler approaches like linear regression?
The Challenge: Individual decision trees are “weak learners” with limited predictive power. Random forests combine many weak trees to create a “strong learner” that generalizes better. But how many trees do we need? Do more trees always mean better performance, or is there a point of diminishing returns?
Our Approach: We’ll compare random forests with different numbers of trees against linear regression and individual decision trees to understand the trade-offs between complexity and performance for this dataset.
⚠️ AI Partnership Required
This challenge pushes boundaries intentionally. You’ll tackle problems that normally require weeks of study, but with Cursor AI as your partner (and your brain keeping it honest), you can accomplish more than you thought possible.
The new reality: The four stages of competence are Ignorance → Awareness → Learning → Mastery. AI lets us produce Mastery-level work while operating primarily in the Awareness stage. I focus on awareness training, you leverage AI for execution, and together we create outputs that used to require years of dedicated study.
Data and Methodology
We analyze the Ames Housing dataset, which contains detailed information about residential properties sold in Ames, Iowa from 2006 to 2010. This dataset is ideal for our analysis because:
Anticipated Non-linear Relationships: Real estate prices have complex, non-linear relationships between features (e.g., square footage in wealthy vs. poor zip codes affects price differently)
Mixed Data Types: Contains both categorical (zipCode) and numerical variables
Real-world Complexity: Captures the kind of messy, real-world data where ensemble methods excel
Since we anticipate non-linear relationships, random forests are well-suited to model the relationship between features and sale price.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_estimators
5000
criterion
'squared_error'
max_depth
None
min_samples_split
2
min_samples_leaf
1
min_weight_fraction_leaf
0.0
max_features
3
max_leaf_nodes
None
min_impurity_decrease
0.0
bootstrap
True
oob_score
False
n_jobs
None
random_state
123
verbose
0
warm_start
False
ccp_alpha
0.0
max_samples
None
monotonic_cst
None
Results: The Power of Ensemble Learning
Our analysis reveals a clear pattern: more trees consistently improve performance. Let’s examine the results and understand why this happens.
Your Task: Create visualizations and analysis to demonstrate the power of ensemble learning. You’ll need to create three key components:
1. The Power of More Trees Visualization
Create a visualization showing: - RMSE vs Number of Trees (both training and test data) - R-squared vs Number of Trees - Do not echo the code that creates the visualization
Add Brief Discussion of the Visualization - Discuss where the most dramatic improvement in performance occurs as you add more trees, how dramatic is it? - Discuss diminishing returns as you add more trees
📊 Visualization Requirements
Create two plots: 1. RMSE Plot: Show how RMSE decreases with more trees (both training and test) 2. R-squared Plot: Show how R-squared increases with more trees
Use log scale on x-axis to better show the relationship across the range of tree counts.
2. Overfitting Visualization and Analysis
Your Task: Compare decision trees vs random forests in terms of overfitting.
Create one visualization with two side-by-side plots showing: - Decision trees: How performance changes with tree complexity (max depth) - Random forests: How performance changes with number of trees
Your analysis should explain: - Why individual decision trees overfit as they become more complex - Why random forests don’t suffer from the same overfitting problem - The mechanisms that prevent overfitting in random forests (bootstrap sampling, random feature selection, averaging)
📊 Overfitting Analysis Requirements
Create a side-by-side comparison showing: 1. Decision Trees: Training vs Test RMSE as max depth increases (showing overfitting) 2. Random Forests: Training vs Test RMSE as number of trees increases (no overfitting)
Use the same y-axis limits for both side-by-side plots so it clearly shows whether random forests outperform decision trees.
Do not echo the code that creates the visualization
3. Linear Regression vs Random Forest Comparison
Your Task: Compare random forests to linear regression baseline.
Create a comparison table showing: - Linear Regression RMSE - Random Forest (1 tree) RMSE
- Random Forest (100 trees) RMSE - Random Forest (1000 trees) RMSE
Your analysis should address: - The improvement in RMSE when going from 1 tree to 100 trees - Whether switching from linear regression to 100-tree random forest shows similar improvement - When random forests are worth the added complexity vs linear regression - The trade-offs between interpretability and performance
📊 Comparison Requirements
Create a clear table comparing:
Linear Regression
Random Forest (1 tree)
Random Forest (100 trees)
Random Forest (1000 trees)
Include percentage improvements over linear regression for each random forest model.
Challenge Requirements 📋
Minimum Requirements for Any Points on Challenge
Create a GitHub Pages Site: Use the starter repository (see Repository Setup section below) to begin with a working template. The repository includes all the analysis code and visualizations above. Use just one language for the analysis and visualizations, delete the other language and omit the panel tabsets.
Add Analysis and Visualizations: Complete the three analysis sections above with your own code and insights.
GitHub Repository: Use your forked repository (from the starter repository) named “randomForestChallenge” in your GitHub account.
GitHub Pages Setup: The repository should be made the source of your github pages:
Go to your repository settings (click the “Settings” tab in your GitHub repository)
Scroll down to the “Pages” section in the left sidebar
Under “Source”, select “Deploy from a branch”
Choose “main” branch and “/ (root)” folder
Click “Save”
Your site will be available at: https://[your-username].github.io/randomForestChallenge/
Note: It may take a few minutes for the site to become available after enabling Pages
Step 2: Clone your fork locally using Cursor (or VS Code)
Step 3: You’re ready to start! The repository includes pre-loaded data and a working template with all the analysis above.
💡 Why Use the Starter Repository?
Benefits:
Pre-loaded data: All required data and analysis code is included
Working template: Basic Quarto structure (index.qmd) is ready
No setup errors: Avoid common data loading issues
Focus on analysis: Spend time on the visualizations and analysis, not data preparation
Getting Started Tips
🎯 Navy SEALs Motto
“Slow is Smooth and Smooth is Fast”
Take your time to understand the random forest mechanics, plan your approach carefully, and execute with precision. Rushing through this challenge will only lead to errors and confusion.
💾 Important: Save Your Work Frequently!
Before you start: Make sure to commit your work often using the Source Control panel in Cursor (Ctrl+Shift+G or Cmd+Shift+G). This prevents the AI from overwriting your progress and ensures you don’t lose your work.
Commit after each major step:
After adding your visualizations
After adding your analysis
After rendering to HTML
Before asking the AI for help with new code
How to commit:
Open Source Control panel (Ctrl+Shift+G)
Stage your changes (+ button)
Write a descriptive commit message
Click the checkmark to commit
Remember: Frequent commits are your safety net!
Grading Rubric 🎓
📊 What You’re Really Being Graded On
This is an investigative report, not a coding exercise. You’re analyzing random forest models and reporting your findings like a professional analyst would. Think of this as a brief you’d write for a client or manager about the power of ensemble learning and when to use random forests vs simpler approaches.
What makes a great report:
Clear narrative: Tell the story of what you discovered about ensemble learning
Insightful analysis: Focus on the most interesting findings about random forest performance
Professional presentation: Clean, readable, and engaging
Concise conclusions: No AI babble or unnecessary technical jargon
Human insights: Your interpretation of what the performance improvements actually mean
Practical implications: When random forests are worth the added complexity
What we’re looking for: A compelling 2-3 minute read that demonstrates both the power of ensemble learning and the importance of choosing the right tool for the job.
Questions to Answer for 75% Grade on Challenge
Power of More Trees Analysis: Provide a clear, well-reasoned analysis of how random forest performance improves with more trees. Your analysis should demonstrate understanding of ensemble learning principles and diminishing returns.
Questions to Answer for 85% Grade on Challenge
Overfitting Analysis: Provide a thorough analysis comparing decision trees vs random forests in terms of overfitting. Your analysis should explain why individual trees overfit while random forests don’t, and the mechanisms that prevent overfitting in ensemble methods.
Questions to Answer for 95% Grade on Challenge
Linear Regression Comparison: Your analysis should include a clear comparison table and discussion of when random forests are worth the added complexity vs linear regression. Focus on practical implications for real-world applications.
Questions to Answer for 100% Grade on Challenge
Professional Presentation: Your analysis should be written in a professional, engaging style that would be appropriate for a business audience. Use clear visualizations and focus on practical insights rather than technical jargon.