Alex Janss
The GEO database is home to thousands of microarray datasets. These datasets measure the relative gene expression across thousands of selected genes for a given patient sample. Each value in the output roughly corresponds to the abundance of an RNA molecule of a given type (corresponding to the given gene) for a particular sample, with higher values indicating more RNA molecules.
Microarrays are an extremely important source of data, as they are very cheap and easy to run and yeild lots of information about the state of the cells from the sample. The gene expression data obtained form the cells yeilds information about the cells' activity and can be used to distinguish types of cells (eg. brain vs liver) and states of cells (eg. healthy vs diseased).
The dataset used in this experiment has 174 different samples, each from one of nine different types of cancer (Breast, Central Nervous System, Colon, Leukemia, Melanoma, Non-Small Cell Lung, Ovarian, Prostate, and Renal). My goal is to create a model that will be able to sucessfuly classify a sample into one of the 9 cancer type by the microarray expression data.
Importing the necessary libraries and data:
import numpy as np
import random
import pickle
import scipy.stats as stats
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import tensorflow as tf
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
np.set_printoptions(precision=4)
path = ''
from google.colab import drive
drive.mount('/content/drive')
path = '/content/drive/My Drive/Colab Notebooks/'
Lets load the data and take a look at the first few rows:
# load data
try:
labels = pickle.load(open(path + "labels.pickle", "rb"))
data = pickle.load(open(path + "data.pickle", "rb"))
except (OSError, IOError) as e:
df = pd.read_csv(path + 'GDS4296_table.csv', low_memory=False, index_col=0)
df = df.T
df.drop(columns='#NAME?', inplace=True) # drop columns w/ missing names
df.sort_values(by='Cell_type') # sort by label
data = df.iloc[:,1:] # separate data and labels
labels = df.iloc[:,0]
data=(data-data.mean())/data.std() # rescale (normalize) the variables
pickle.dump(data, open(path+"data.pickle", "wb"))
pickle.dump(labels, open(path+"labels.pickle", "wb"))
print(data.head())
Each row ("GSM...") represents a single patient sample that we will try to classify into one of the 9 types of cancers. There are 174 samples.
The labels
variable designates which type of cancer each sample came from.
The remaining columns labeled by gene name represent the level of a particular RNA molecule within the given sample*. There are 54,676 genes. Note that I have rescaled (normalized) the genes accross the 174 samples.
To state the obvious, this is an extremely high dimensionality dataset, with 54,676 predictors and only 174 data points. This will be a major challenge in fitting models to this data.
Lets take a look at the distributions of a few of the variables:
Note: The value represents the relative expression of that RNA and cannot be used to determine the absolue quantity of an RNA molecule, and cannot be compared accross columns. Thus not much information is lost in rescaling.
# histograms
sns.set()
gene1, gene2, gene3 = 100, 200, 300
fig, axes = plt.subplots(ncols=3, sharex=True, sharey=True, figsize=(8,4))
data.iloc[:,gene1].plot.hist(ax=axes[0])
data.iloc[:,gene2].plot.hist(ax=axes[1])
data.iloc[:,gene3].plot.hist(ax=axes[2])
axes[0].set_xlabel('Level of gene ' + df.iloc[:,gene1].name)
axes[1].set_xlabel('Level of gene ' + df.iloc[:,gene2].name)
axes[2].set_xlabel('Level of gene ' + df.iloc[:,gene3].name)
axes[0].set_ylabel('# of samples')
The distribution of the samples appears to be (somewhat) normal for these few genes. Now that we have some idea of what the data looks like, we will begin classification.
Naive Bayes is arguably one of the simplest classification techniques in machine learning. There are no hyperparameters to tune, one simply needs to run the model. Here I use the Gaussian version of the algorithm, as the predictors are continuous not categorical.
Let's test this method via 5-fold cross validation. I use the stratified version of train-test split, as not all the classes are equally represented.
# GNB, 5-fold cross validation.
random.seed(711)
gnb_mod = GaussianNB() # Create a Gaussian Classifier
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
error = []
for train_index, test_index in skf.split(data, labels):
X_train, X_test = data.iloc[train_index], data.iloc[test_index] # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
gnb_mod.fit(X_train, y_train) # fit on train
y_pred = gnb_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error)) # average accuracy
96.8% prediction accuracy! This is impressive for a model that took no work to tune and a difficult data set.
Random forest is an extremely popular out-of-the-box ML tool that works very well with very little tuning on a variety of datasets. It works by averaging a large number of high-variance decision trees. Here we will test RF with 500, 1000, and 2000 trees using 5-fold CV.
# RF, 5-fold CV
random.seed(731)
num_trees = [500,1000,2000] # hyperparameters to test
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
error = []
for param in num_trees:
rf_mod = RandomForestClassifier(n_estimators=param) # create RF model
for train_index, test_index in skf.split(data, labels):
X_train, X_test = data.iloc[train_index], data.iloc[test_index] # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
rf_mod.fit(X_train, y_train) # fit on train
y_pred = rf_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print("Accuracy:",sum(error)/len(error), "with", param, "trees") # average accuracy
Our best model performed with 98.8% accuracy. This is an incredible performance on a model that also took very little tuning. RF models are often used a baseline for testing the performance of more complex models. 98.8% is a very high bar to beat! I will now attempt to beat this with other models, but if not, this accuracy is stellar performance given the data.
Tree boosting is a powerful method that combines many weak learners (small trees) succesively to create an aggregate model. Each model takes much longer to run and tune than the methods used above, as the model building process is computationally expensive and there are more hyperparameters. As an illustration I run a single boosting model with 5-fold CV using sci-kit learn's implementation.
# Boosting, 5-fold CV
random.seed(1699)
boost_mod = GradientBoostingClassifier(n_estimators=100,learning_rate=.1,max_depth=2) # create boosting model
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
error = []
for train_index, test_index in skf.split(data, labels):
X_train, X_test = data.iloc[train_index], data.iloc[test_index] # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error)) # average accuracy
As is evident, this model needs much more tuning in order to be able to perform well. I will now perform a grid search using a single train-test split.
# Boosting, Grid search
random.seed(711)
n_trees = np.array([50,100,500])
rate = np.array([.001, .01, .1])
depth = np.array([2,4])
params = np.array(np.meshgrid(n_trees, rate, depth)).reshape(3,-1).T # grid search parameters
X_train, X_test, y_train, y_test = train_test_split(data, labels, stratify=labels, test_size=0.2) # single train-test split
error = []
# create boosting model for each combination of hyperparameters
for p in params:
boost_mod = GradientBoostingClassifier(n_estimators=int(p[0]),learning_rate=p[1],max_depth=int(p[2]))
boost_mod.fit(X_train, y_train)
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(p, error[-1])
print(error)
This grid search ended up being unhelpful. Because of the single train-test split, most of the parameter combinations yeilded the same accuracy scores. Unfortunately, we will need to use 5-fold CV to more precicely evaluate the hyperparameters.
# Boosting, 5-fold CV, Grid search
random.seed(711)
n_trees = np.array([50,100,500])
rate = np.array([.001, .01, .1])
depth = np.array([2,4])
params = np.array(np.meshgrid(n_trees, rate, depth)).reshape(3,-1).T # grid search params
# 5-fold CV
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
for p in params:
boost_mod = GradientBoostingClassifier(n_estimators=int(p[0]), learning_rate=p[1], max_depth=int(p[2])) # create boosting model
error = []
for train_index, test_index in skf.split(data, labels):
X_train, X_test = data.iloc[train_index], data.iloc[test_index] # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error), p) # average accuracy
Here the run has been manually terminated after just 5 sets of hyperparameters have been tested. Even this took over 4 hours to run on my machine. This problem is quickly becoming intractable; Even with this very course grid, fitting 18 x 5 = 90 boosting models on 50,000 variables would take too long.
The time being taken is due to the 50,000+ variables that the model is trying to evaluate. In order to reduce the dimensionality and make this problem more tractable, we need to remove some of the variables.
The sklearn random forest implementation comes with a feature_importances
attribute. It is determined by assessing the mean decrease in node impurity across the 100s of trees: variables that split the classes with greater purity are assigned a higher importance. We will use these importance scores to select which variables to use.
# RF variable importance
rf_mod = RandomForestClassifier(n_estimators=1000, random_state=3314) # create RF model
_ = rf_mod.fit(data, labels) # fit on entire dataset
#@title
var_import=np.sort(var_import)[::-1]
total, result = 0.0, 0
for i in range(len(var_import)):
total = total + var_import[i]
if total > .90:
result = i
break
plt.plot(np.arange(54623), var_import)
plt.fill_between(np.arange(54623),var_import)
line1= plt.axvline(x=result,color='gray')
plt.legend((line1,), ("90th percentile",))
plt.title("Variable Importances")
Above is shown the variable importance scores, sorted, for all 54,623 variables. As we can see, the vast majority of the variable importance is held within the top several thousand variables, with the vertical line indicating the 90th percentile mark.
Below I extract the 500, 1000, and 2000 most important features (as determined by the above score).
# extract most important features
var_import_index = np.argsort(rf_mod.feature_importances_)[::-1] # sort by feature importance
data_best500 = data.iloc[:,var_import_index[:500]] # top 500 features
data_best1000 = data.iloc[:,var_import_index[:1000]] # top 1000
data_best2000 = data.iloc[:,var_import_index[:2000]] # top 2000
print(data_best500.iloc[:5,:5])
A quick google of these genes is very encouraging. Acording to NCBI.gov: "Mutations in the PAX8 gene have been associated with thyroid dysgenesis, thyroid follicular carcinomas and atypical follicular thyroid adenomas." The top gene is associated with thyroid cancer! Below I plot the distributions of a few of these variables accross the 9 classes.
fig, axes = plt.subplots(ncols=3, sharey=True, figsize=(16,6),)
pax8_box = [data_best500[labels==i].iloc[:,0] for i in range(9)]
tmem101_box = [data_best500[labels==i].iloc[:,1] for i in range(9)]
trim15_box = [data_best500[labels==i].iloc[:,3] for i in range(9)]
_= axes[0].boxplot(pax8_box)
axes[0].set_title("PAX8")
_=axes[1].boxplot(tmem101_box)
axes[1].set_title("TMEM101")
_=axes[2].boxplot(trim15_box)
_=axes[2].set_title("TRIM15")
These variables with high importance are clearly very useful in separating the classes. PAX8 has high expression in classes 7 and 9, TMEM101 has low expression for class 7, and TRIM15 has high expression in class 3.
Now I will attempt the same grid search as above, w/ 5 fold CV, using only the top 500 variables.
# Boosting 500 features, 5-fold CV, Grid search
random.seed(711)
n_trees = np.array([100, 500, 900]) # grid search parameters
rate = np.array([.1, .01, .001])
depth = np.array([2, 4])
params = np.array(np.meshgrid(n_trees, rate, depth)).reshape(3,-1).T
# 5-fold CV
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
for p in params:
boost_mod = GradientBoostingClassifier(n_estimators=int(p[0]), learning_rate=p[1], max_depth=int(p[2])) # create boosting model
error = []
for train_index, test_index in skf.split(data_best500, labels):
X_train, X_test = data_best500.iloc[train_index], data_best500.iloc[test_index] # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error), p) # average accuracy
This code ran much faster, but these results are not encouraging. The top score was 94% accuracy, which doesn't even beat Naive Bayes. Perhaps 500 variables are not enough. We attempt the same with 1000 variables.
#@title Boosting, top 1000 features
# Boosting after feature selection, 5-fold CV, Grid search, best 1000
random.seed(711)
n_trees = np.array([100, 500, 900]) # grid search parameters
rate = np.array([.1, .01, .001])
depth = np.array([2, 4])
params = np.array(np.meshgrid(n_trees, rate, depth)).reshape(3,-1).T
# 5-fold CV
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
for p in params:
boost_mod = GradientBoostingClassifier(n_estimators=int(p[0]), learning_rate=p[1], max_depth=int(p[2])) # create boosting model
error = []
for train_index, test_index in skf.split(data_best1000, labels):
X_train, X_test = data_best1000.iloc[train_index], data_best1000.iloc[test_index] # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error), p) # average accuracy
This is better. The best model had 98.3% accuracy. It's parameters were [n_estimators=900, learning_rate=.1, max_depth=2]
. This is getting close to our Random Forest benchmark of 98.8%.
I will now grid-search around these parameters: I will try n_estimators=[800,1000,1200]
and learning_rate=[.01,.1,.2]
with max_depth=2
#@title Grid Search, top 1000 features
# Boosting after feature selection, 5-fold CV, Grid search, best 1000
random.seed(711)
n_trees = np.array([800, 1000, 1200]) # grid search parameters
rate = np.array([.05, .1, .2])
depth = np.array([2])
params = np.array(np.meshgrid(n_trees, rate, depth)).reshape(3,-1).T
# 5-fold CV
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
for p in params:
boost_mod = GradientBoostingClassifier(n_estimators=int(p[0]), learning_rate=p[1], max_depth=int(p[2])) # create boosting model
error = []
for train_index, test_index in skf.split(data_best, labels):
X_train, X_test = data_best.iloc[train_index], data_best.iloc[test_index] # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error), p) # average accuracy
None of these settings improved performance. I will retry the original search parameters, but with 2000 variables.
#@title Boosting, top 2000 features
# Boosting after feature selection, 5-fold CV, Grid search, best 2000
random.seed(711)
n_trees = np.array([100, 500, 900]) # grid search parameters
rate = np.array([.1, .01, .001])
depth = np.array([2, 4])
params = np.array(np.meshgrid(n_trees, rate, depth)).reshape(3,-1).T
# 5-fold CV
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
for p in params:
boost_mod = GradientBoostingClassifier(n_estimators=int(p[0]), learning_rate=p[1], max_depth=int(p[2])) # create boosting model
error = []
for train_index, test_index in skf.split(data_best2000, labels):
X_train, X_test = data_best2000.iloc[train_index], data_best2000.iloc[test_index] # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error), p) # average accuracy
None of these models outperformed the models with only 1000 features, and this is beggining to take significantly more time to run. We will try to optimize the approach.
XGBoost is the python industry standard for boosted trees, with several useful features including stochastic gradient boosting and GPU acceleration. This will allow me to run models faster.
Another important method used often in practice is Random Search (as opposed to Grid Search). It is more thoughrough, and can efficiently explore more of the parameter space.
With grid search, nine trials only test three distinct places in the important parameter. With random search, all nine trails explore distinct values. Figure Credit: Peter Worcester
I will attempt a RandomSearch through the parameters, and will attempt to use the stochastic boosting feature. This means that only a random portion of the variables and datapoints are used to fit each small tree. For this trial I will use 10% of the rows and 10% of the columns for stochastic boosting. This will allow the models to run in reasonable time.
#XGboost, RandomSearch, stochastic gradient
random.seed(711)
n_iter = 10
n_trees = np.array([100, 200, 300, 500, 700])
rate = np.array([0.1, .07, .05, .03, 0.01])
depth = np.array([2,3,4])
params = np.array([np.random.choice(n_trees,n_iter),np.random.choice(rate,n_iter),np.random.choice(depth,n_iter)]).T # grid search params
# 5-fold CV
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
for p in params:
boost_mod = XGBClassifier(n_estimators=int(p[0]), learning_rate=p[1], max_depth=int(p[2]), colsample_bytree=.1, subsample=.1, tree_method="gpu_hist") # create boosting model
error = []
for train_index, test_index in skf.split(data, labels):
X_train, X_test = data.iloc[train_index].to_numpy(), data.iloc[test_index].to_numpy() # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error), p) # average accuracy
My initial trial is not too promising with a best score of 93%, but I will try again without stochastic boosting (100% of rows and cols used). In order to speed things up, I will use only the top 500 features selected previously.
#@title XGBoost, RandomSearch, top 500 features
#Boosting, Random grid search, XGboost, feature selection
random.seed(711)
# create a default XGBoost classifier
boost_mod = XGBClassifier(tree_method="gpu_hist")
# create K-fold object
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
# Create the grid search parameter grid and scoring funcitons
param_grid = {
"learning_rate": [0.1, .07, .05, .03, 0.01],
"max_depth": [2, 3, 4],
"n_estimators": [100, 200, 300, 500, 700],
}
# create the grid search object
grid = RandomizedSearchCV(estimator=boost_mod, param_distributions=param_grid,cv=skf,scoring='accuracy',n_jobs=-1,n_iter=10)
# fit grid search
best_model = grid.fit(data_best500.to_numpy(), labels)
print(f'Best score: {best_model.best_score_}')
print(f'Best model: {best_model.best_params_}')
pd.DataFrame(best_model.cv_results_).iloc[:,[4,5,6,-3]]
Results are slightly better, with a top score of 94%. I will repeat the above search with the top 1000 and 2000 features.
#@title XGBoost, RandomSearch, top 1000 features
#Boosting, Random grid search, XGboost, best 1000 features
random.seed(711)
# create a default XGBoost classifier
boost_mod = XGBClassifier(tree_method="gpu_hist")
# create K-fold object
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
# Create the grid search parameter grid and scoring funcitons
param_grid = {
"learning_rate": [0.1, .07, .05, .03, 0.01],
"max_depth": [2, 3, 4],
"n_estimators": [100, 200, 300, 500, 700],
}
scoring = 'accuracy'
# create the grid search object
grid = RandomizedSearchCV(
estimator=boost_mod,
param_distributions=param_grid,
cv=skf,
scoring=scoring,
n_jobs=-1,
n_iter=10,
)
# fit grid search
best_model = grid.fit(data_best1000.to_numpy(), labels)
print(f'Best score: {best_model.best_score_}')
print(f'Best model: {best_model.best_params_}')
#@title XGBoost, RandomSearch, top 2000 features
#Boosting, Random grid search, XGboost, best 2000 feature selection
random.seed(711)
# create a default XGBoost classifier
boost_mod = XGBClassifier(tree_method="gpu_hist")
# create K-fold object
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
# Create the grid search parameter grid and scoring funcitons
param_grid = {
"learning_rate": [0.1, .07, .05, .03, 0.01],
"max_depth": [2, 3, 4],
"n_estimators": [100, 200, 300, 500, 700],
}
scoring = 'accuracy'
# create the grid search object
grid = RandomizedSearchCV(
estimator=boost_mod,param_distributions=param_grid,cv=skf,scoring=scoring,n_jobs=-1,n_iter=10)
# fit grid search
best_model = grid.fit(data_best2000.to_numpy(), labels)
print(f'Best score: {best_model.best_score_}')
print(f'Best model: {best_model.best_params_}')
This is much better. With 1000 variables, 97.1% was acheived, and with 2000, 98.2%. I will using these parameters with all the features:
# XGBoost, 5-fold CV, all features used
random.seed(1699)
boost_mod = XGBClassifier(n_estimators=700,learning_rate=.05,max_depth=4, tree_method="gpu_hist") # create boosting model
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
error = []
for train_index, test_index in skf.split(data, labels):
X_train, X_test = data.iloc[train_index].to_numpy(), data.iloc[test_index].to_numpy() # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error)) # average accuracy
Adding in the rest of the features did not help, and in fact detracted quite a bit, likely due to higher variance. I notice that higher numbers of trees seem to perform well (700 trees, at the upper end of the search). I now a search with higher numbers of trees, between 600 and 1200.
# Another grid search, more trees.
#XGBoost, RandomSearch, top 2000 features
random.seed(711)
# create a default XGBoost classifier
boost_mod = XGBClassifier(tree_method="gpu_hist")
# create K-fold object
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
# Create the grid search parameter grid and scoring funcitons
param_grid = {
"learning_rate": stats.uniform(loc=.01,scale=.99),
"max_depth": [2, 4],
"n_estimators": np.arange(600,1200),
}
# create the grid search object
grid = RandomizedSearchCV(estimator=boost_mod, param_distributions=param_grid, cv=skf, scoring='accuracy', n_jobs=-1, n_iter=10)
# fit grid search
best_model = grid.fit(data_best2000.to_numpy(), labels)
print(f'Best score: {best_model.best_score_}')
print(f'Best model: {best_model.best_params_}')
Again, good, but no improvement over the previous. We still have not beaten random forest at 98.8%
Now comes a twist in my tale: by chance I ran a model with the following parameters, using some mild stochastic boosting.
# XGBoost, 5-fold CV, all features
random.seed(1699)
boost_mod = XGBClassifier(n_estimators=100,learning_rate=.1,max_depth=2, tree_method="gpu_hist", colsample_bytree=.7, subsample=.7) # create boosting model
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
error = []
for train_index, test_index in skf.split(data, labels):
X_train, X_test = data.iloc[train_index].to_numpy(), data.iloc[test_index].to_numpy() # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error)) # average accuracy
This is the highest score yet! I now do a RandomSearch around those parameters:
# random search around the lucky numbers
# XGBoost, all features
random.seed(711)
n_iter = 10 # n random searches
# search parameters
n_trees = np.arange(70,130)
rate = np.linspace(.05,.5,50)
cols_samp = np.linspace(.5,1,50)
subsamp = np.linspace(.3,1,50)
params = np.array([np.random.choice(n_trees,n_iter),np.random.choice(rate,n_iter),np.random.choice(cols_samp,n_iter),np.random.choice(subsamp,n_iter)]).T
# 5-fold CV
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
for p in params:
boost_mod = XGBClassifier(n_estimators=int(p[0]), learning_rate=p[1], max_depth=2, colsample_bytree=p[2], subsample=p[3], tree_method="gpu_hist") # create boosting model
error = []
for train_index, test_index in skf.split(data, labels):
X_train, X_test = data.iloc[train_index].to_numpy(), data.iloc[test_index].to_numpy() # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print("Accuracy:",sum(error)/len(error),"params:", p) # average accuracy
100% accuracy with 5-fold CV! To veryfy I repeat with those parameters and a different seed:
# Boosting, 5-fold CV, XGBoost
random.seed(7211)
boost_mod = XGBClassifier(n_estimators=74,learning_rate=.2429,max_depth=2, tree_method="gpu_hist", colsample_bytree=.6939, subsample=.5857) # create boosting model
skf = StratifiedKFold(n_splits=5, shuffle=True) # 5-fold
error = []
for train_index, test_index in skf.split(data, labels):
X_train, X_test = data.iloc[train_index].to_numpy(), data.iloc[test_index].to_numpy() # train-test split
y_train, y_test = labels.iloc[train_index], labels.iloc[test_index]
boost_mod.fit(X_train, y_train) # fit on train
y_pred = boost_mod.predict(X_test) # predict on test
error.append(metrics.accuracy_score(y_test, y_pred)) # model accuracy
print(sum(error)/len(error)) # average accuracy
Incredible! Boosting (with a lot of tuning) was able to beat random forest by a small margin.
Below is a simple summary of the best reult for each type of model.
# Results Summary
print("Model:\t\t\t", "Accuracy:")
print("Naive Bayes\t\t\t96.8%")
print("Random Forest\t\t\t98.8%")
print("Boosting\t\t\t92.3%")
print("Boosting, Var selection\t\t98.4%")
print("Boosting, stochastic\t\t99.4%")
From a biological standpoint, the most useful part of this study is the variable importance: finding out which genes are most important in which types of cancer. Even then, it would have been more useful to compare cancerous cells of one type to non-cancerous cells of the same type. As it stands, it is unclear whether important genes are unique to the cancer type, or simply to the cell type (eg. genes unique to the brain vs genes unique to the liver). But that was not the goal of this model: oncogene identification is another thing entirely.
From a data science standpoint, These models did unexpectedly well for relatively low sample size and high dimensionality, with a large number of classes. Random Forest performed incredibly well with almost no tuning. Boosting took and incredible ammount of tuning to outperform RF, but eventually did beat RF by a bit. Interestigly, feature selection ended up not being used in the final model.