rename_duplicates[source]
rename_duplicates(old)
Description:
A simple helper function to add numeric suffix to duplicate string entries.
Example:
for w in rename_duplicates(['Atom','Electron','Atom','Neutron','Atom']):
print(w)
>> Atom
Electron
Atom_1
Neutron
Atom_2
run_classifiers[source]
run_classifiers(X,y,clf_lst=[LogisticRegression(C=0.1, n_jobs=-1)],names=None,num_runs=10,test_frac=0.2,scaling=True,metric='accuracy',runtime=False,verbose=True)
Description
Runs through the list of classifiers for a given number of times
Args:
X: numpy.ndarray, feature array in the shape of (M X N).
If an array with shape (M,) is passed, the function coerces it to (M X 1) shape
y: numpy.ndarray, output array in the shape of (M X 1).
If an array with shape (M,) is passed, the function coerces it to (M X 1) shape
clf_lst: list/tutple, A list/tuple of Scikit-learn estimator objects (classifiers)
names: list/tuple of strings, Human-readable names/descriptions of the estimators
e.g. Support Vector Machine with Linear Kernel and C=0.025 for an estimator object SVC(kernel="linear", C=0.025).
If not supplied explicitly, then the function tries to extract a suitable name from the estimator class but the result is not optimal.
num_runs: int, Number of runs (fitting) per model
test_frac: float, Test set fraction
scaling: bool, flag to run StandardScaler on the data, default True
metric: str, name of the ML metric user is interested in. Currently, could be accuracy or f1
runtime: bool, if True, calculates and returns the fitting time (in milliseconds) along with the ML metric
verbose: bool, if True, prints a single-line message after each estimator finishes num_runs runs
Returns:
df_scores: A Pandas DataFrame of score i.e. ML metric that was requested for all the runs.
If num_runs=10 then you will have 10 rows in this dataframe. Each classifier/estimator will be a separate column.
df_runtimes: A Pandas DataFrame of the training times (in milliseconds) for all the runs and estimators.
Example:
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from .mldsutils import run_classifiers
X1, y1 = make_classification(n_features=20,
n_samples=2000,
n_redundant=0,
n_informative=20,
n_clusters_per_class=1)
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=10),]
clf_names = ['k-Nearest Neighbors(3)',
'Support Vector Machine with Linear Kernel',
'Support Vector Machine with RBF Kernel']
d1,d2 = run_classifiers(X1,y1,
clf_lst=classifiers,
names = clf_names,
metric='f1',verbose=True)
plot_bars[source]
plot_bars(d,t1='Mean accuracy score of algorithms',t2='Std.dev of the accuracy scores of algorithms')
run_regressors[source]
run_regressors(X,y,reg_lst=[LinearRegression(n_jobs=-1)],names=None,num_runs=10,test_frac=0.2,scaling=True,metric='rmse',runtime=False,verbose=True)
Description
Runs through the list of classifiers for a given number of times
Args:
X: numpy.ndarray, feature array in the shape of (M X N).
If an array with shape (M,) is passed, the function coerces it to (M X 1) shape
y: numpy.ndarray, output array in the shape of (M X 1).
If an array with shape (M,) is passed, the function coerces it to (M X 1) shape
reg_lst: list/tutple, A list/tuple of Scikit-learn estimator objects (regressors)
names: list/tuple of strings, Human-readable names/descriptions of the estimators
e.g. LASSO regression with alpha=0.1 for an estimator object Lasso(alpha=0.1).
If not supplied explicitly, then the function tries to extract a suitable name from the estimator class but the result is not optimal.
num_runs: int, Number of runs (fitting) per model
test_frac: float, Test set fraction
scaling: bool, flag to run StandardScaler on the data, default True
metric: str, name of the ML metric user is interested in. Currently, could be rmse or r2
runtime: bool, if True, calculates and returns the fitting time (in milliseconds) along with the ML metric
verbose: bool, if True, prints a single-line message after each estimator finishes num_runs runs
Returns:
df_scores: A Pandas DataFrame of score i.e. ML metric that was requested for all the runs.
If num_runs=10 then you will have 10 rows in this dataframe. Each regressor/estimator will be a separate column.
df_runtimes: A Pandas DataFrame of the training times (in milliseconds) for all the runs and estimators.
Example:
from .mldsutils import *
from sklearn.linear_model import LinearRegression, Lasso, Ridge
import numpy as np
reg_names = ["Linear regression","L1 (LASSO) regression","Ridge regression"]
regressors = [LinearRegression(n_jobs=-1),Lasso(alpha=0.1),Ridge(alpha=0.1)]
X = np.random.normal(size=200)
y = 2*X+3+np.random.uniform(1,2,size=200)
d1 = run_regressors(X,y,regressors,metric='r2',runtime=False,verbose=True)