GithubHelp home page GithubHelp logo

bhandan / datascience Goto Github PK

View Code? Open in Web Editor NEW
1.0 0.0 0.0 81.14 MB

Jupyter Notebook 90.51% Python 0.25% Rebol 9.25%
datascience practice computer-science artificial-intelligence datascientist python pandas r linear-regression logistic-regression

datascience's Introduction

This Data Science Practices project, to incress the expertise level

TECHNOLOGY USED

	1.PySpark 
	2.Pandas
	3.R

NUMBER OF PROJECTS FOR PRACTICES

    1.Iris Data
	2.Titanic Data
	3.loan Predication Data
	4.Bigmart Sales Data
	5.Boston Housing Data
	6.Human Activety Recognitiion Data
	7.Black Friday Data
	8.Siam Competetion Data
	9.Trips Histry Data
	10.Million Songs Data
	11.Census Data
	12.Movie Lens Data
	13.Idenify your  Digits
	14.Yelp Data
	15.ImageNet Data
	16.KDD Cup 1998
	17.Chicago Crime Data
	18.New York Yellow Taxi Data
	19.New York Green Taxi Data
	20.Chipotle DataSet Practices 

MODELS ARE USED

A. Supervised Learning
B. Unsupervised
C. Semi-supervised Learning 
D. Reinforcement Learning
Machine Learning Algorithms:
1. Linear Regression
				
	A. Advantages
		1. Simple to Explain
		2. Highly Interpretable
		3. Model Training and Predicationare fast
		4. No tuning is required
		5. Features don't need scaling
		6. Can Perform Well with a small number of observation
		7. Well understood
	B. Disadvantages
		1. Presumes a linear relationship between the feature and the response
		2. Performance is not competitive
		3. Can't automatically learn feature interaction 
	
	#Python
				
		from sklearn import linear_model
		x_train=input_variables_values_training_datasets
		y_train=target_variables_values_training_datasets
		x_test=input_variables_values_test_datasets
		linear = linear_model.LinearRegression()
		linear.fit(x_train, y_train)
		linear.score(x_train, y_train)
		predicted= linear.predict(x_test)
	
	#R 
		x_train <- input_variables_values_training_datasets
		y_train <- target_variables_values_training_datasets
		x_test <- input_variables_values_test_datasets
		x <- cbind(x_train,y_train)
		linear <- lm(y_train ~ ., data = x)
		summary(linear)
		predicted= predict(linear,x_test) 

2.Logistic Regression
	
	A. Advantages
		1. Highly interpretable
		2. Model training and prediction are fast
		3. No tuning is required
		4. Features don't need scaling
		5. Can perform well with a small number of observations
		6. Outputs well-calibrated predicted probabilities
	B. Disadvantages
		1. Presumes a linear relationship between the features and the log-odds of the response
		2. Performance is not competitive with the best supervised learning methods
		3. Can't automatically learn feature interactions
		
	#Python
		from sklearn.linear_model import LogisticRegression
		model = LogisticRegression()
		model.fit(X, y)
		model.score(X, y)
		predicted= model.predict(x_test)
	
	#R					
		x <- cbind(x_train,y_train)
		logistic <- glm(y_train ~ ., data = x,family='binomial')
		summary(logistic)
		predicted= predict(logistic,x_test)
				
3. Decision Trees
	A. Advantages
		1. Can be used for regression or classification
		2. Can be displayed graphically
		3. Highly interpretable
		4. Can be specified as a series of rules, and more closely approximate human decision-making than other models
		5. Prediction is fast
		6. Features don't need scaling
		7. Automatically learns feature interactions
		8. Tends to ignore irrelevant features
		9. Non-parametric (will outperform linear models if relationship between features and response is highly non-linear)
	B. Disadvantages
		1. Performance is (generally) not competitive with the best supervised learning methods
		2. Can easily overfit the training data (tuning is required)
		3. Small variations in the data can result in a completely different tree (high variance)
		4. Recursive binary splitting makes "locally optimal" decisions that may not result in a globally optimal tree
		5. Doesn't tend to work well if the classes are highly unbalanced
		6. Doesn't tend to work well with very small datasets
		
	#Python			
		from sklearn import tree
		model = tree.DecisionTreeClassifier(criterion='gini')
		model.fit(X, y)
		model.score(X, y)
		predicted= model.predict(x_test)
	#R 
		library(rpart)
		x <- cbind(x_train,y_train)
		fit <- rpart(y_train ~ ., data = x,method="class")
		summary(fit)
		predicted= predict(fit,x_test)
						
4.SVM(Support Vector Machine)
				
	#Python
		from sklearn import svm
		model = svm.svc()
		model.fit(X, y)
		model.score(X, y)
		predicted= model.predict(x_test)
			
	#R 
		library(e1071)
		x <- cbind(x_train,y_train)
		fit <-svm(y_train ~ ., data = x)
		summary(fit)
		predicted= predict(fit,x_test)		
				
5. Naive Bayes
															
					[P(c/X)=P(x1/c)*P(x2/c)*........*P(xn/c)*P(c)]
	A. Advantages
		1. Model training and predication are very fast
		2. Somewhat interpretable
		3. No tuning is required
		4. Features don't need scaling
		5. Insensitive to irrelevant feature
		6. Perform better than logistic regression when training set is very small
	B. Disadvantages
		1. Predicted probabilities are not well-calibrated
		2. Correlated features can be problematic
		3. Can't handle negative features with Multinomial Naive Bayes
		4. Has a higher "asymptotic error" than logistic regression
		
	#Python			
		from sklearn.naive_bayes import GaussianNB
		model.fit(X, y)
		predicted= model.predict(x_test)	
	#R			
		library(e1071)
		x <- cbind(x_train,y_train)
		fit <-naiveBayes(y_train ~ ., data = x)
		summary(fit)
		predicted= predict(fit,x_test)
		
6. KNN(K-Nearest Neighbors)	
	A. Advantages:
		1. Simple to understand and explain
		2. Model training is fast
		3. Can be used for classfication and regression
	
	B. Disadvantages:
		1. Must store all of the training data
		2. Predication phase can be slow when N is large-scale
		3. Sensitive to irrelevant
		4. Sensitive to the scale of the data
		5. Accuracy is not competitive with the best supervised learning methods
		
	#Python
		from sklearn.neighbors import KNeighborsClassifier
		KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5
		model.fit(X, y)
		predicted= model.predict(x_test)
	#R 
		library(knn
		x <- cbind(x_train,y_train)
		fit <-knn(y_train ~ ., data = x,k=5)
		summary(fit)
		predicted= predict(fit,x_test)
				
7. K-Means

	#Python
		from sklearn.cluster import KMeans
		k_means = KMeans(n_clusters=3, random_state=0)
		model.fit(X)
		predicted= model.predict(x_test)	
	#R 
		library(cluster)
		fit <- kmeans(X, 3) # 5 cluster solution
			
8. Random Forests
	A. Advantages(compared to decision trees):
		1. Performance is competitive with the best supervised learning methods
		2. Provides a more reliable estimate of feature importance
		3. Allows you to estimate out-of-sample error without using train/test split or cross-validation
	B. Disadvantages
		1. Less interpretable
		2. Slower to train
		3. Slower to predict
		
	#Python
		from sklearn.ensemble import RandomForestClassifier
		model= RandomForestClassifier()
		model.fit(X, y)
		predicted= model.predict(x_test)
				
9.Dimensionality Reduction Algorithms

	#Python
		from sklearn import decomposition
		train_reduced = pca.fit_transform(train)
		test_reduced = pca.transform(test)
	#R 
		library(stats)
		pca <- princomp(train, cor = TRUE)
		train_reduced  <- predict(pca,train)
		test_reduced  <- predict(pca,test)
		
10.Gradient Boosting Algorithms
	
	10.1. GBM
			
		#Python
			from sklearn.ensemble import GradientBoostingClassifier
			model= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
			model.fit(X, y)
			predicted= model.predict(x_test)
			
		#R 
			library(caret)
			x <- cbind(x_train,y_train)
			fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4)
			fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE)
			predicted= predict(fit,x_test,type= "prob")[,2]
			
	10.2. XGBoost
		
		#Python
			from xgboost import XGBClassifier
			from sklearn.model_selection import train_test_split
			from sklearn.metrics import accuracy_score
			X = dataset[:,0:10]
			Y = dataset[:,10:]
			seed = 1
			X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)
			model = XGBClassifier()
			model.fit(X_train, y_train)
			y_pred = model.predict(X_test)
			
		#R 
			require(caret)
			x <- cbind(x_train,y_train)
			TrainControl <- trainControl( method = "repeatedcv", number = 10, repeats = 4)
			model<- train(y ~ ., data = x, method = "xgbLinear", trControl = TrainControl,verbose = FALSE)
			predicted <- predict(model, x_test)
			
			
	10.3. LightGBM
			a.Faster training speed and higher efficiency
			b.Lower memory usage
			c.Better accuracy
			d.Parallel and GPU learning supported
			e.Capable of handling large-scale data
		#Python 
			data = np.random.rand(500, 10) 
			label = np.random.randint(2, size=500)
			train_data = lgb.Dataset(data, label=label)
			test_data = train_data.create_valid('test.svm')
			param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
			param['metric'] = 'auc'
			num_round = 10
			bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])
			bst.save_model('model.txt')
			data = np.random.rand(7, 10)
			ypred = bst.predict(data)		
		#R 
			library(RLightGBM)
			data(example.binary)
			num_iterations <- 100
			config <- list(objective = "binary",  metric="binary_logloss,auc", learning_rate = 0.1,
					num_leaves = 63, tree_learner = "serial", feature_fraction = 0.8, bagging_freq = 5, 
					bagging_fraction = 0.8, min_data_in_leaf = 50, min_sum_hessian_in_leaf = 5.0)
			handle.data <- lgbm.data.create(x)
			lgbm.data.setField(handle.data, "label", y)
			handle.booster <- lgbm.booster.create(handle.data, lapply(config, as.character))
			lgbm.booster.train(handle.booster, num_iterations, 5)
			pred <- lgbm.booster.predict(handle.booster, x.test)
			sum(y.test == (y.pred > 0.5)) / length(y.test)
			lgbm.booster.save(handle.booster, filename = "/tmp/model.txt")
			
			
	10.4 Catboost
		#R 
			set.seed(1)
			require(titanic)
			require(caret)
			require(catboost)
			tt <- titanic::titanic_train[complete.cases(titanic::titanic_train),]
			data <- as.data.frame(as.matrix(tt), stringsAsFactors = TRUE)
			drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")
			x <- data[,!(names(data) %in% drop_columns)]y <- data[,c("Survived")]
			fit_control <- trainControl(method = "cv", number = 4,classProbs = TRUE)
			grid <- expand.grid(depth = c(4, 6, 8),learning_rate = 0.1,iterations = 100, l2_leaf_reg = 1e-3, 
					rsm = 0.95, border_count = 64)
			report <- train(x, as.factor(make.names(y)),method = catboost.caret,verbose = TRUE,
					preProc = NULL,tuneGrid = grid, trControl = fit_control)
			print(report)
			importance <- varImp(report, scale = FALSE)
			print(importance)

datascience's People

Contributors

bhandan avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.