# ***The Impact of Feature Randomness in Random Forests***

To illustrate the impact of feature randomoness in random forests, we'll build and compare two random forests. One that considers all features at each split, and another that considers a random subset of features at each split.



## ***Step 1: Import Necessary Libraries***

In [None]:
# 1. Import Necessary Libraries:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

## ***Step 2: Generate Synthetic Dataset***

We'll create a synthetic dataset using the `make_classification` function with 1,000 samples (`n_samples`) and 20 features (`n_features`). We will set the number of informative features (`n_informative`) to 8 for now. `n_redundant = 0` means no features are replicates.

`n_clusters_per_class` specifies the number of distinct clusters to generate within each class. Since 1 will create a dataset that is too simple, we'll set it to 2 in order to capture the effects of feature randomness.

In [None]:
# 2. Generate Synthetic Dataset:

X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=8, n_redundant=0,
                           n_clusters_per_class=2, random_state=707)

## ***Step 3: Splitting data***

In [None]:
# 3. Split the Dataset:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=707)

## ***Step 4: Training a Random Forest without Feature Randomness***

We'll make sure that this model considers all features at each split by keeping `max_features` at 20.

In [None]:
# 4. Train Random Forest without Feature Randomness:

rf_no_feature_randomness = RandomForestClassifier(n_estimators=200,
                                                  max_features=20,
                                                  random_state=707)
rf_no_feature_randomness.fit(X_train, y_train)

## ***Step 5: Evaluating Model with `classification_report`***

`classification_report` key metrics:
- **Precision:** Indicates the accuracy of positive predictions for a class. It is calculated as the ratio of true positives to the sum of true positives and false positives.

- **Recall (Sensitivity or True Positive Rate):** Measures the model's ability to identify all actual positive instances of a class. It is computed as the ratio of true positives to the sum of true positives and false negatives. High recall indicates that the model captures most of the actual positives for that class.

- **F1-Score:** The harmonic mean of precision and recall, providing a single metric that balances both concerns. An F1-score close to 1 signifies a model with both high precision and recall for that class.

- **Support:** Denotes the number of actual occurrences of each class in the dataset. It reflects how many instances of each class are present and is essential for understanding the context of the other metrics.

- **Accuracy:** The overall proportion of correct predictions across all classes.​

- **Macro Average:** The unweighted mean of the metrics for all classes, treating each class equally regardless of its support.​

- **Weighted Average:** The mean of the metrics, weighted by the support of each class, accounting for class imbalance.



In [None]:
# 5. Evaluating Model with classification_report

y_pred_no_fr = rf_no_feature_randomness.predict(X_test)
accuracy_no_fr = accuracy_score(y_test, y_pred_no_fr)
print(f'Accuracy without feature randomness: {accuracy_no_fr:.4f}')
print('Classification Report without feature randomness:')
print(classification_report(y_test, y_pred_no_fr))

## ***Step 6: Training a Random Forest with Feature Randomness***

We'll limit `max_features` here to 4 (approx. sqrt(20)). Your task now is to experiment with different `max_features` values. Remember, the best `max_features` value is linked with the number of useful features.

In [None]:
# 6. Train Random Forest with Feature Randomness:

rf_with_feature_randomness = RandomForestClassifier(n_estimators=200,
                                                    max_features=11,
                                                    random_state=707)
rf_with_feature_randomness.fit(X_train, y_train)

## ***Step 7: Evaluating Model with `accuracy_score` and `classification_report`***

Play around with the number of features considered at each split by controlling the `max_features` parameter in the code. Then, use the `classification_report` to see whether it helps the model.

In [None]:
# 7. Evaluate Model:
y_pred_with_fr = rf_with_feature_randomness.predict(X_test)
accuracy_with_fr = accuracy_score(y_test, y_pred_with_fr)
print(f'Accuracy with feature randomness: {accuracy_with_fr:.4f}')
print('Classification Report with feature randomness:')
print(classification_report(y_test, y_pred_with_fr))

## ***Experiment***

Experimentation suggestions:

- Try experimenting with different `max_features` in Step 6. Evaluate changes with Step 7
- Play around with `n_informative` to see how the number of informative features influence the best`max_features`.
- Try increasing `n_clusters_per_class` see if the implementation of feature randomness improves model with more complex data.




## ***What did you find?***

- You'll find that in general a reduced number of features considered at each split leads to a better performing model!

- The best `max_feature` value hovers around the number of informative features.
  - This could in turn mean that if all the features in a dataset are informative features, implementing feature randomness may be futile. However, with complex real life datasets, this is unlikely to be the case.