GithubHelp home page GithubHelp logo

Khalid's Projects

ai-capstone-project-on-e-commerce-amazon-domain- icon ai-capstone-project-on-e-commerce-amazon-domain-

Project Task: Week 1 Class Imbalance Problem: 1. Perform an EDA on the dataset. a) See what a positive, negative, and neutral review looks like b) Check the class count for each class. It’s a class imbalance problem. 2. Convert the reviews in Tf-Idf score. 3. Run multinomial Naive Bayes classifier. Everything will be classified as positive because of the class imbalance. Project Task: Week 2 Tackling Class Imbalance Problem: Oversampling or undersampling can be used to tackle the class imbalance problem. In case of class imbalance criteria, use the following metrices for evaluating model performance: precision, recall, F1-score, AUC-ROC curve. Use F1-Score as the evaluation criteria for this project. Use Tree-based classifiers like Random Forest and XGBoost. Note: Tree-based classifiers work on two ideologies namely, Bagging or Boosting and have fine-tuning parameter which takes care of the imbalanced class. Project Task: Week 3 Model Selection: Apply multi-class SVM’s and neural nets. Use possible ensemble techniques like: XGboost + oversampled_multinomial_NB. Assign a score to the sentence sentiment (engineer a feature called sentiment score). Use this engineered feature in the model and check for improvements. Draw insights on the same. Project Task: Week 4 Applying LSTM: Use LSTM for the previous problem (use parameters of LSTM like top-word, embedding-length, Dropout, epochs, number of layers, etc.) Hint: Another variation of LSTM, GRU (Gated Recurrent Units) can be tried as well. 2. Compare the accuracy of neural nets with traditional ML based algorithms. 3. Find the best setting of LSTM (Neural Net) and GRU that can best classify the reviews as positive, negative, and neutral. Hint: Use techniques like Grid Search, Cross-Validation and Random Search Optional Tasks: Week 4 Topic Modeling: 1. Cluster similar reviews. Note: Some reviews may talk about the device as a gift-option. Other reviews may be about product looks and some may highlight about its battery and performance. Try naming the clusters. 2. Perform Topic Modeling Hint: Use scikit-learn provided Latent Dirchlette Allocation (LDA) and Non-Negative Matrix Factorization (NMF). Download the Data sets from here .

combating-twitter-hate-speech-using-ml-and-nlp icon combating-twitter-hate-speech-using-ml-and-nlp

Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter. Problem Statement: Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium to spread hate. You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model. Domain: Social Media Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross validation to get the best model. Content: id: identifier number of the tweet Label: 0 (non-hate) /1 (hate) Tweet: the text in the tweet Tasks: Load the tweets file using read_csv function from Pandas package. Get the tweets into a list for easy text cleanup and manipulation. To cleanup: Normalize the casing. Using regular expressions, remove user handles. These begin with '@’. Using regular expressions, remove URLs. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms. Remove stop words. Remove redundant terms like ‘amp’, ‘rt’, etc. Remove ‘#’ symbols from the tweet while retaining the term. Extra cleanup by removing terms with a length of 1. Check out the top terms in the tweets: First, get all the tokenized terms into one large list. Use the counter and find the 10 most common terms. Data formatting for predictive modeling: Join the tokens back to form strings. This will be required for the vectorizers. Assign x and y. Perform train_test_split using sklearn. We’ll use TF-IDF values for the terms as a feature to get into a vector space model. Import TF-IDF vectorizer from sklearn. Instantiate with a maximum of 5000 terms in your vocabulary. Fit and apply on the train set. Apply on the test set. Model building: Ordinary Logistic Regression Instantiate Logistic Regression from sklearn with default parameters. Fit into the train data. Make predictions for the train and the test set. Model evaluation: Accuracy, recall, and f_1 score. Report the accuracy on the train set. Report the recall on the train set: decent, high, or low. Get the f1 score on the train set. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s. Adjust the appropriate class in the LogisticRegression model. Train again with the adjustment and evaluate. Train the model on the train set. Evaluate the predictions on the train set: accuracy, recall, and f_1 score. Regularization and Hyperparameter tuning: Import GridSearch and StratifiedKFold because of class imbalance. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters. Use a balanced class weight while instantiating the logistic regression. Find the parameters with the best recall in cross validation. Choose ‘recall’ as the metric for scoring. Choose stratified 4 fold cross validation scheme. Fit into the train set. What are the best parameters? Predict and evaluate using the best estimator. Use the best estimator from the grid search to make predictions on the test set. What is the recall on the test set for the toxic comments? What is the f_1 score?

comcast---telecom--concumer-complain-analysis icon comcast---telecom--concumer-complain-analysis

Analysis Task To perform these tasks, you can use any of the different Python libraries such as NumPy, SciPy, Pandas, scikit-learn, matplotlib, and BeautifulSoup. - Import data into Python environment. - Provide the trend chart for the number of complaints at monthly and daily granularity levels. - Provide a table with the frequency of complaint types. Which complaint types are maximum i.e., around internet, network issues, or across any other domains. - Create a new categorical variable with value as Open and Closed. Open & Pending is to be categorized as Open and Closed & Solved is to be categorized as Closed. - Provide state wise status of complaints in a stacked bar chart. Use the categorized variable from Q3. Provide insights on: Which state has the maximum complaints Which state has the highest percentage of unresolved complaints - Provide the percentage of complaints resolved till date, which were received through the Internet and customer care calls.

face_docker icon face_docker

As a senior level engineer, build a docker image that takes a youtube video and extracts uniques faces from any soccer video Finally. make a pull request to this repository. Sample video on youtube: https://www.youtube.com/watch?v=JriaiYZZhbY&t=4s.

feast icon feast

Feature Store for Machine Learning

mercedes-benz--greener-manufacturing_khalid-yusuf-liman icon mercedes-benz--greener-manufacturing_khalid-yusuf-liman

DESCRIPTION Reduce the time a Mercedes-Benz spends on the test bench. Problem Statement Scenario: Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams. To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards. Following actions should be performed: If for any column(s), the variance is equal to zero, then you need to remove those variable(s). Check for null and unique values for test and train sets. Apply label encoder. Perform dimensionality reduction. Predict your test_df values using XGBoost. Find the datasets here.

opencv icon opencv

Open Source Computer Vision Library

pet-classification-model-using-cnn icon pet-classification-model-using-cnn

DESCRIPTION Build a CNN model that classifies the given pet images correctly into dog and cat images. The project scope document specifies the requirements for the project “Pet Classification Model Using CNN.” Apart from specifying the functional and nonfunctional requirements for the project, it also serves as an input for project scoping. Project Description and Scope You are provided with the following resources that can be used as inputs for your model: 1. A collection of images of pets, that is, cats and dogs. These images are of different sizes with varied lighting conditions. 2. Code template containing the following code blocks: a. Import modules (part 1) b. Set hyper parameters (part 2) c. Read image data set (part 3) d. Run TensorFlow model (part 4) You are expected to write the code for CNN image classification model (between Parts 3 and 4) using TensorFlow that trains on the data and calculates the accuracy score on the test data. Project Guidelines Begin by extracting ipynb file and the data in the same folder. The CNN model (cnn_model_fn) should have the following layers: ● Input layer ● Convolutional layer 1 with 32 filters of kernel size[5,5] ● Pooling layer 1 with pool size[2,2] and stride 2 ● Convolutional layer 2 with 64 filters of kernel size[5,5] ● Pooling layer 2 with pool size[2,2] and stride 2 ● Dense layer whose output size is fixed in the hyper parameter: fc_size=32 ● Dropout layer with dropout probability 0.4 Predict the class by doing a softmax on the output of the dropout layers. This should be followed by training and evaluation: 1 | Page ©Simplilearn. All rights reserved ● For the training step, define the loss function and minimize it ● For the evaluation step, calculate the accuracy Run the program for 100, 200, and 300 iterations, respectively. Follow this by a report on the final accuracy and loss on the evaluation data. Prerequisites

pipreqs icon pipreqs

pipreqs - Generate pip requirements.txt file based on imports of any project. Looking for maintainers to move this project forward.

predict_patients_readmission icon predict_patients_readmission

Predict the risk of hospital readmission for patients who have been discharged from the hospital within the previous 30 days given the information in the attached dataset (taken from US Medicare patients).

twitter-hate icon twitter-hate

DESCRIPTION Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter. Problem Statement: Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium to spread hate. You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model. Domain: Social Media Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross-validation to get the best model. Content: id: identifier number of the tweet Label: 0 (non-hate) /1 (hate) Tweet: the text in the tweet Tasks: Load the tweets file using read_csv function from Pandas package. Get the tweets into a list for easy text cleanup and manipulation. To cleanup: Normalize the casing. Using regular expressions, remove user handles. These begin with '@’. Using regular expressions, remove URLs. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms. Remove stop words. Remove redundant terms like ‘amp’, ‘rt’, etc. Remove ‘#’ symbols from the tweet while retaining the term. Extra cleanup by removing terms with a length of 1. Check out the top terms in the tweets: First, get all the tokenized terms into one large list. Use the counter and find the 10 most common terms. Data formatting for predictive modeling: Join the tokens back to form strings. This will be required for the vectorizers. Assign x and y. Perform train_test_split using sklearn. We’ll use TF-IDF values for the terms as a feature to get into a vector space model. Import TF-IDF vectorizer from sklearn. Instantiate with a maximum of 5000 terms in your vocabulary. Fit and apply on the train set. Apply on the test set. Model building: Ordinary Logistic Regression Instantiate Logistic Regression from sklearn with default parameters. Fit into the train data. Make predictions for the train and the test set. Model evaluation: Accuracy, recall, and f_1 score. Report the accuracy on the train set. Report the recall on the train set: decent, high, or low. Get the f1 score on the train set. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s. Adjust the appropriate class in the LogisticRegression model. Train again with the adjustment and evaluate. Train the model on the train set. Evaluate the predictions on the train set: accuracy, recall, and f_1 score. Regularization and Hyperparameter tuning: Import GridSearch and StratifiedKFold because of class imbalance. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters. Use a balanced class weight while instantiating the logistic regression. Find the parameters with the best recall in cross validation. Choose ‘recall’ as the metric for scoring. Choose stratified 4 fold cross validation scheme. Fit into the train set. What are the best parameters? Predict and evaluate using the best estimator. Use the best estimator from the grid search to make predictions on the test set. What is the recall on the test set for the toxic comments?

woe_scoring icon woe_scoring

Monotone Weight Of Evidence Transformer and LogisticRegression model with scikit-learn API

zomato-rating icon zomato-rating

Zomato Rating DESCRIPTION Using NLP and machine learning, make a model to predict the rating in a review based on the content of the text review. This will help identify cases with a mismatch. Problem Statement: Zomato is India’s largest platform for discovering restaurants and ordering food. It operates in India as well as a few cities internationally. Bangalore is one of the biggest customers and restaurant bases for Zomato with 4 to 5 million users using the platform each month. Users on the platform can also post reviews of restaurants and provide a rating accompanying the review. The content in the reviews should ideally reflect the rating provided by the customer. In many cases, there is a mismatch, owing to multiple reasons, where the rating does not match the customer review. The reviews and rating match is very important as it builds customer trust on the platform and helps the user get an accurate picture of the restaurant. You, as a data scientist, need to enable the identification and cleanup of such cases to ensure the ratings reflect the reviews and that the reviews seem trustworthy to the customer. You will need to use NLP techniques in conjunction with machine learning models to predict the rating from the review text. Domain: Hospitality and internet Analysis to be done: Perform specific data cleanup, build a rating prediction model using the Random Forest technique and NLP. Content: rating: the rating given by the customer review_text: the text in the review Steps to perform: Perform clean up on the data; tweak the stop words (negative terms are important). Follow up with a Random Forest Regressor to predict the star rating given by the customers. Tasks: Load the data using read_csv function from pandas package Null values in the review text? Remove the records where the review text is null Perform cleanup on the data Normalize the casing Remove extra line breaks from the text Remove stop words Note: Terms like ‘no’, ‘not’, ‘don’, ‘won’ are important, don’t remove Remove punctuation Separation into train and test sets Use train-test method to divide your data into 2 sets: train and test Use a 70-30 split Use TF-IDF values for the terms as features to get into a vector space model Import TF-IDF vectorizer from sklearn Instantiate with a maximum of 5000 terms in your vocabulary Fit and apply on the train set Apply on the test set Model building: Random Forest Regressor Instantiate RandomForestRegressor from sklearn (set random seed) Fit on the train data Make predictions for the train set Model evaluation Report the root mean square error Hyperparameter tuning Import GridSearch Provide the parameter grid to choose: max_features – ‘auto’, ‘sqrt’, ‘log2’ max_depth – 10, 15, 20, 25 Find the parameters with the best mean square error in cross-validation Choose the appropriate scoring as the metric for scoring Choose stratified 5 fold cross-validation scheme Fit on the train set What are the best parameters? Predict and evaluate using the best estimator Use the best estimator from the grid search to make predictions on the test set What is the root mean squared error on the test set? Can you identify mismatch cases? Make a rule based on the predicted value and the actual value that identifies mismatch cases (e.g. difference in actual and predicted being more than a cutoff) How many such cases do you see? Are all these mismatch cases genuine?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.