hil-se / fds Goto Github PK

View Code? Open in Web Editor NEW

25.0 71.0 10.0 7.48 MB

DSCI-633: Foundations of Data Science https://github.com/hil-se/fds

License: MIT License

Python 46.96% Jupyter Notebook 53.04%

course-materials data-science machine-learning software-engineering rit

fds's Introduction

Syllabus | Slides and Assignments | Project | Instructor

Course Description

A foundation course in data science, emphasizing both concepts and techniques. The course provides an overview of data analysis tasks and the associated challenges, spanning data preprocessing, model building, model validation, and evaluation. Major families of data analysis techniques covered include classification, clustering, association analysis, anomaly detection, and statistical testing. This is a practice-driven course as it includes a series of programming assignments that will involve the implementation of specific techniques on practical datasets from diverse application domains, reinforcing the concepts and techniques covered in lectures. The best way to learn an algorithm is to implement and apply it youself. You will experience that in this course.

Course Learning Outcomes

Students completing this course are expected to

Gain a brief understanding of basic data mining and machine learning techniques.
Develop the ability to solve real-world problems using machine learning.

Syllabus and Policies

The course uses Github for assignment submission, discussion, questions. Slides, assignments, and recorded videos will be posted here.

Prerequisites: The course does not have formal prerequesites, but we describe background knowledge that will help you be successful in the course. Since the course always has a substantial programming component solid programming skills will be benefitial. Also note that Python and github are required for submitting assignments and Assignment 0 provides learning materials to help students with those.

Textbook: We will be using Pang-Ning Tan's "Introduction to Data Mining (Second Edition)" (ISBN-13: 978-0133128901) throughout the course. However, there is no need to buy that book since the slides will cover all the content you need. Meanwhile, a better way (than reading the textbook) to dig deeper into the material is always searching online for research papers and blog articles.

Grading: Evaluation will be based on the following distribution: 70% assignments, 30% project. A detailed grading policy can be found at the end of the description of each assignment and project.

Grade	Points	Grade	Points
A	93 or above	B-	80 – 82
A-	90 – 92	C+	77 – 79
B+	87 – 89	C	70 – 76
B	83 – 86	F	Below 70

Time management: Besides the 2.5 hours/week on lectures, students are expected to spend 1 to 5 hours on assignments (for the first 9 weeks) or on the project (for the rest weeks) every week depending on their proficiency in coding and machine learning.

Late work policy: The TA will start grading assignments after the due date. Late work in assignments will not be graded. Exceptions to this policy will be made only in extraordinary circumstances, almost always involving a family or medical emergency---with your academic advisor or the Dean of Student Affairs requesting the exception on your behalf. Accommodations for travel (e.g., for interviews) might be possible if requested at least 3 days in advance.

Team work: Students will be assigned to small study groups. Discussions within the group are welcomed but each student should complete their assignments and project independently. Identical or extremely similar submissions within the study group will still be considered as cheating. Questions that cannot be resolved within the study group should be posted as a new issue on this repo for discussion.

Academic Integrity: Students are encouraged to discuss the assignments and projects with each other, especially in their study group. But do not copy finished assignments or projects from other students' Github repos. Up to 90% of the learning in this course comes from completing the assignments and the project. Skipping the assignments and the project is a huge waste to your effort spent on this course. In the meantime, students need to confront the TA or the instructor if their submissions were found too similar.

Generative AI tools: Coding solutions must be your own work, which means you cannot use generative AI tools in any manner to write your programs. When learning fundamental skills, you need to ensure that you master the basics. If I doubt authorship, I may ask you to explain the code or re-create aspects of the code in one of our labs – you must show that you have mastered the fundamentals.

Accommodations for students with disabilities: If you have a disability and have an accommodations letter from the Disability Resources office, we encourage you to discuss your accommodations and needs with us as early in the semester as possible. We will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with the Office of Disability Resources, we encourage you to contact them at [email protected].

A note on self-care: Please take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep and taking some time to relax. This will help you achieve your goals and cope with stress. All of us benefit from support during times of struggle. You are not alone. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful. If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 585-475-2261 for urgent matters or email [email protected] for non-urgent cases. Please also consider reaching out to a friend, faculty, or family member you trust for help getting connected to the support that can help.

fds's People

Stargazers

Watchers

Forkers

rigved ra9118 cryptothesuperdog umanggada ishaneja y-singh rainmaker519 amulya275 shkelqimlloqanaj mysteriousvoid

fds's Issues

Assignment 9

Hi Professor,

I would keep this page for all concerns in assignment 9 if you do not mind.

One concern is that there is two declarations of a single variable of loop in mutate() which could result conflict, kindly see the code below:
for **i,** x in enumerate(self.generation): new_x = list(x) for **i** in range(len(x)):

" i " variable was declared two times; please let me know whether or not this would result an issue.

DO NOT post RIT ID in the readme.md file of your repo

Thanks to the students who raised up this concern. Assignment 1 has now been modified so that RIT ID is not required in the readme.md file.
Instead, please put your github id there, in that way, the TA can still associate your repo with you.

For those who have posted their RIT ID, my apologies for the risk but please change that to the github id.

Assignment 9: k-fold cross-validation

Hello Prof. Zhe,

In assignment 9, we are using the k-fold cross-validation technique to find the average metrics over k-subsets. So, I feel we should average over the number of folds (k in this case) instead of the number of data points.

Current code:

objs_crossval = objs_crossval / float(len(self.data_y))

My suggestion:

objs_crossval = objs_crossval / float(self.crossval_fold)

This change will cause the metrics taken from 2-folds (default value passed to my_GA) to be averaged correctly.

Please let me know if this change is correct.

Thank you

Zhe Yu (Instructor)

Hello Everyone,

I am your instructor for this course. I am an assistant professor in the Department of Software Engineering at RIT who graduated from NC State University in 2020.

The foundation of data science was my favorite course during my Ph.D. study and I hope that you can feel it the same way. It is such a pity that we cannot meet in-person but feel free to contact me for any questions through GitHub issues (for public) or emails (for private).

Zhe

Hello, I'm Chenliang

Hello, My name is Chenliang Cao, but you can call me George. I'm from Shanghai, China. but I stayed in Rochester after graduating from U of R two years ago. I have a degree of Applied mathematics and I am very new to this data science area.
I'm excited about this course, and hopefully everything can get back on track from this COVID-19 situation.

Grading evaluation distribution adds up to more than 100%

Describe the bug
In the README file, the Grading distribution adds up to 103% with 63% for assignments and 40% for the project.

To Reproduce
Steps to reproduce the behavior:

Go to the Syllabus and Policies section of the README page.
Scroll down to the "Grading" point.
This grading point mentions that the distribution will be as follows: 63% assignments, 40% project. This adds up to 103%.

Expected behavior
The grading distribution should add up to 100% instead of 103%.

Screenshots

Desktop (please complete the following information):

OS: N/A
Browser: N/A
Version: N/A

Smartphone (please complete the following information):

Device: N/A
OS: N/A
Browser: N/A
Version: N/A

Additional context
N/A

Nishant Nair

Hello everyone,

Hope you're doing well and keeping safe during this pandemic.

I am Nishant Nair and I'm from India. I have been working in the field of data and analytics for over 5 years now and have undertaken various roles during this stint which includes data analyst, business analyst, business intelligence engineer and decision scientist to name a few. I wanted to dig much deeper into data science in the space of statistical methods, optimisation techniques, visualisation and automation. My journey in the field of analytics and prevalent interest in the "Science" of data, naturally led me to take up MS in Data Science at RIT. Additionally, with the flexibility that RIT offers and wide variety of courses that one can choose from, really puts it above all for me, personally.

Apart from work, I also like to indulge myself in doing some regular exercises and yoga just to manage day-to-day physical or mental workload. I like to listen to a lot of songs and keep exploring new artists on the go. I would like to believe that I have a good taste in them, so if you're looking for some recommendations, please feel free to reach out.

Hope to see you soon in person, by Spring. I am positive that the current situation will turnaround !! Let's plan to grow and develop together both professionally and personally.

Thanks and best regards

"Questions About the Course" task requires a reply

Hello,

In myCourses, the "Getting Started -> Questions About the Course" task requires a "reply" in order to mark the task as completed. However, the specific instruction, in the task's description, is to not reply to the task. Please advise.

Thank you

Mrinal Chaudhari

Hello,

I am from Pune, India. I have a degree in Information technology. I intern in GTT company where I used to work as a developer. I had two years of experience in Tata Consultancy Services as Database Administrator. I always fascinating about data and how data is handle in today's era. Because Data science is comes from various background such as statistic, applied mathematics and many more. And I would love to explore further more in this area. I am glad that I am part of this class and very excited to know more about you all.

class labels for the testing data set

Can we have the class label for the testing set to measure the accuracy of our model?
The current testing set doesn't have the class label. I tried to measure the accuracy by dividing the training set into training and testing but the samples is not enough . I t will be good to have class label the current testing set, such as having values at class label "Species" for Iris_test.csv

Assignment 7: update

Please update your A7.py file.
I have modified it after the lecture.
Now you should have the exact same output as in the assignment7.md file.

Mayuresh Nene

Hi,

I am from Pune, India. Glad I will be starting this program this week, as it's been on my radar for a year now. This was mainly because I participated in an idea at a Hackathon at my internship. That sparked my interest in diving deeper into the field.

This has definitely been a roller coaster year for the most part and I am sure this new phase will bring a lot of interesting new experiences for us all! Can't wait to learn and have fun at the same time :)

Assignment 3, do not forget about the smoothing factor self.alpha

As stated in the __init__() function:

        # alpha: smoothing factor
        # P(xi = t | y = c) = (N(t,c) + alpha) / (N(c) + n(i)*alpha)
        # where n(i) is the number of available categories (values) of feature i
        # Setting alpha = 1 is called Laplace smoothing

Do not forget to use it when calculating P(xi|yj)

Python Practicing Resources

For those who are experiencing problems with Python programming:

If you have not done this yet, go through the Codecademy (free version) and the tutorial from the textbook listed in Assignment 1.
If you are seeking for more practices, https://www.w3resource.com/python-exercises/ has plenty of exercises.

Austin Simmons

Hi, I am Austin Simmons. I am currently staying in Rochester. I finished my undergrad at RIT last year and am excited to start working on my masters. Looking forward to learning more about data science and finding out what interests me the most.

Prajwal Krishna

Hey All,

I am from India living in Bengaluru right now.
Well as far as Covid situation is concerned I am sure it has become a new normal for all of us.

Amidst all this I am very excited to start this Master's journey with RIT and absolutely gutted that we can not be present in Rochester physically and have fun together. I am sure we all get to meet in Spring 2021 ( !2020 ) :D

Data science is a very new thing for me and I am sure the foundations of data science will put all of us on the right track.
Let us all collaborate together and make the most of the facilities given by RIT for remote learning.

Hello, I am Yash

Hey there. I am Yash Mahesh Bangera. I hope you guys are doing well. I am currently in Mumbai, India, and am affected badly by the Corona pandemic. Everything here is shut and under lockdown. I am, however, keen on learning new things each day and look forward to this course too. My goal after graduation is to be able to introduce solutions that make use of Data Science, Machine Learning, and Artificial Intelligence that could help the community as much as possible.

Assignment 7 update

Please update every file in the Assignment7 folder.
Now the pca function returns the principal components instead of the transformed data X_pca.
This is good for applying the same transformation to the test data.

Hello everyone, I'm Sultan

Hi everyone!!

I'm Sultan Almassari, a Software Engineering student, currently in Rochester. As an experience, I used to work as Quality Assurance engineer for 3 years.

The reason why I opt to be educated in Data science is that it will give me a sound knowledge of how to implement its application with my interest. This will give me an opportunity to integrate several fields(e.g. data science with QA) to come up with a good mined and predictable data. So, I am so excited to know the recipe and the secret of data science!

I believe that no one is happy with this pandemic, but we need to adapt ourselves. Anyway, looking forward to knowing all of you guys!

Assignment 4 Cosine distance

The slides now provide more details on calculating the cosine similarity.

Remember that cosine dist = 1 - cosine similarity

Mismatch of the output format of predict_proba() function between sklearn and our implementation

There is a mismatch of the output format of predict_proba() function. (I think ours are better.)
So when you what to check the output of the sklearn version of the classifiers, also change the print() function in A#.py.

Hey, I am Vinayak Sengupta

Hi everyone,
I am from Mumbai, India.

The COVID-19 situation, like for most people here had made my plan for masters quite uncertain as well, but a career in the field of Data Science was something that I always planned on pursuing.

I did my undergrad studies in Computer Science and Engineering, so learning the foundations of data science, are relatively familiar for me, but I believe I have a long journey ahead of me yet.

Hope to connect with you all and meet up in Spring! Till then, All the best and Cheers!

Hello, I'm Neha.

Hello everyone,

I am Neha, currently residing in Patna, India.
I was captivated by how much sense it makes when data is analyzed and processed effectively. I cannot think of any sector which cannot be favored by Data Science. I am looking forward for all the amazing years here at RIT.

I am not an exception when it comes to COVID crisis.
Hope we get a vaccine soon for COVID-19! See you all soon!

Assignment 8 output is different from the expected values

Hello Prof. Zhe and other students,

I'm getting slightly different outputs for the "iris-setosa" class but the same outputs for the remaining classes:

{'Iris-setosa': {'prec': 1.0, 'recall': 1.0, 'f1': 1.0, 'auc': 1.0}, 'Iris-versicolor': {'prec': 0.8979591836734694, 'recall': 0.9777777777777777, 'f1': 0.9361702127659575, 'auc': 0.98}, 'Iris-virginica': {'prec': 0.975609756097561, 'recall': 0.8888888888888888, 'f1': 0.9302325581395349, 'auc': 0.9587654320987653}}
Average F1 scores: 
{'macro': 0.9562045824755032, 'micro': 0.9555555555555556, 'weighted': 0.9562045824755032}

Is there a possibility of getting different output, given that we are not introducing any randomness?

Are any others getting the same output?

Thank you

Assignment 4 updated hint file

my_KNN_hint.py has been updated to handle the scenario where the features of test data do not match with those of training data.

This update will not affect the results given the data files used in A4 but you are welcomed to pull it.

Infinite loop in assignment 9 tune method

Hello Prof. Zhe,

I'm running into an infinite loop in the tune() method.

while self.life > 0 or self.iter < self.max_generation:
    self.select()
    if self.compete(self.pf, self.pf_best):
        self.life = self.max_life
    else:
        self.life -= 1

Here, self.compete always returns True for me with the default values passed to my_GA, so self.life is never decremented. The following change in the while loop condition fixes it for me:

while self.life > 0 and self.iter < self.max_generation:

Changing or to and ensures that the loop breaks if self.life is less than 1 or if self.iter crosses the self.max_generation limit.

Please let me know if this is correct.

Thank you

Assignment 5: update your A5.py

Please update your A5.py file, the newer version uses sklearn decision tree.

Sanket Waghmare

Hello all,
I am from Mumbai,India.
The amount of Information that can be collected, analysed and acquired from data is something that has always fascinated me.
I'm so glad that I am going into a field that I am passionate about and excited to see how RIT helps me in delving deeper into it.
COVID-19 has definitely been an obstacle that we have been surviving through, but connecting and working together, we can make this semester worthwhile.
See you all soon.

Rigved Rakshit

Hello,

I'm from Mumbai, India. I was a System Architect at my previous company. I have been interested in Data Science (and Natural Language Processing specifically) for a few years now and I'm absolutely delighted to start my Data Science journey with this course! COVID-19 has wreaked havoc throughout the world but everyone is learning to mitigate its spread. I hope to meet everyone soon in-person!

Rigved

Question about Assignment 2

Sir, I have written this code :
def find_best_split(self, pop, X, labels):
# Find the best split
# Inputs:
# pop: indices of data in the node
# X: independent variables of training data
# labels: dependent variables of training data
# Output: tuple(best feature to split, weighted impurity score of best split, splitting point of the feature, [indices of data in left node, indices of data in right node], [weighted impurity score of left node, weighted impurity score of right node])
######################
best_feature = None
cans_arr = []
df_below_threshold = []
df_above_threshold = []
df_per_midpoint = []
labels_below_threshold = []
labels_above_threshold = []
midpoints = []
impurity_values = []
impurity_values_b = []
impurity_values_a = []
for feature in X.keys():
cans = np.array(X[feature][pop])
cans_arr.append(cans)
first_switch = np.where(labels[:-1] != labels[1:])[0][0]
for i in range(0,len(cans_arr),1):
mid_value = ((cans_arr[i][first_switch] + cans_arr[i][first_switch+1])/2)
midpoints.append(mid_value)
for i in range(0,len(midpoints),1):
df_below_threshold.append(X.loc[X[X.keys()[i]]<=midpoints[i]])
df_above_threshold.append(X.loc[X[X.keys()[i]]>midpoints[i]])
df_per_midpoint.append([df_below_threshold,df_above_threshold])
labels_below_threshold.append((labels[X[X.keys()[i]]<=midpoints[i]]))
labels_above_threshold.append((labels[X[X.keys()[i]]>midpoints[i]]))

( Just the starting portion of the code)
But, I am getting this error in python. I do not understand why this is so. I get it when I use clf.fit(X,y) :
IndexError Traceback (most recent call last)
in
----> 1 clf.fit(X,y)

~\DSCI-633\assignments\assignment2\my_DT_hint.py in fit(self, X, y)
168 else:
169 # Find the best split using find_best_split function
--> 170 best_feature = self.find_best_split(current_pop, X, labels)
171 if (current_impure - best_feature[1]) > self.min_impurity_decrease * N:
172 # Split the node

~\DSCI-633\assignments\assignment2\my_DT_hint.py in find_best_split(self, pop, X, labels)
75 first_switch = np.where(labels[:-1] != labels[1:])[0][0]
76 for i in range(0,len(cans_arr),1):
---> 77 mid_value = ((cans_arr[i])[first_switch] + (cans_arr[i])[first_switch+1])/2
78 midpoints.append(mid_value)
79 for i in range(0,len(midpoints),1):

IndexError: index 44 is out of bounds for axis 0 with size 39

Entropy can be larger than 1

Entropy score can be larger than 1 when more than 2 classes are presented.
For three classes, if each class has the same number of samples, entropy = 1.58

A9, update stopping criteria for self.tune()

The stopping criteria (Line 229 in my_GA_hint.py) should be linked with "and" instead of "or".
Please fix or update.

Due date of A2 is extended to midnight Friday

TA will start grading A2 no early than Saturday. This only applies to A2, not the following assignments.

Duplicate data points showing up in pf_best in assignment 9

Hello Prof. Zhe,

Sometimes when I run A9.py, I get duplicate data points in pf_best, as mentioned by another student during the lecture today. The result of that is I get duplicate precision and recall values as shown below:

[array([0.95041809, 0.96006597]), array([0.95041809, 0.96006597]), array([0.95041809, 0.96006597]), array([0.95041809, 0.96006597]), array([0.95041809, 0.96006597]), array([0.95041809, 0.96006597]), array([0.95041809, 0.96006597]), array([0.95041809, 0.96006597]), array([0.95041809, 0.96006597]), array([0.95041809, 0.96006597])]
[0.96479161]

If we check for duplicate points before adding into pf_best, this doesn't occur.

Origninal code:

if not_dominated:
    to_add.append(j)
    modified = True

Suggestion:

if not_dominated and pf_new[j] not in pf_best:
    to_add.append(j)
    modified = True

Does this suggestion look good?

Thank you

Abdullah Alsaleh

Hello everyone,
This is Abdullah Alsaleh from Saudi Arabia, currently in Rochester, NY.
I have a bachelor's degree in information technology, and my main areas of interest are Data applications and Web development. Back in Saudi Arabia, I have worked as a web developer and system administrator in the Education sector for almost 3 years.
Over the years, my passion for gaining experience, enhancing my skills, and expanding my knowledge in data and analytics of a bigdata have evolved. I always believe in data science as one of the most important success factors in all fields.

I'm very excited about this course! With all that is going on with the Coronavirus, I hope we enjoy the course together and hope to see you soon in the spring...

Deadline Extension

The due date for A9 is extended to Oct 27th.

It is now required to change your DSCI-633 repo to private

As many students have suggested, it is better to keep the DSCI-633 repos private and only accessible by the instructor and the TA.

Keeping repos have the following advantages:

Others cannot see your work until you decide to make it public.
Cheating would become much more difficult.
Now private repos can be created by free GitHub accounts so why not?

If you have already created a public repo, I will be demonstrating on Aug 25 for how to change it to private.

If you have not created a repo yet, follow the new version of Assignment 1 to create a private repo.

Now everyone needs to also invite the instructor and the TA (Github ID: azhe825, ketakisbarde) as collaborators to their private repos.

Shrunali Paygude

Hello everyone,
I am from Pune, India.
The COVID situation had made my plan for masters a bit dicey but learning Data Science was something that I always wanted to pursue.

I did my undergrad in Electronics and Telecommunication, so learning the foundations of data science, which is very new for me, will prove crucial in my journey.

Hope to see you all in Spring! Till then, happy learning :)

set() in Python gives a different ordered list everytime it is run

Hello Prof. Zhe,

In my_AdaBoost_hint.py, we use self.classes_ = list(set(list(y))). This creates a list with a different order on each run. For example, I'll run this code with self.n_estimators = 10:

self.classes_ = list(set(list(y)))
print('my_AdaBoost')
print(self.classes_)

for i in range(self.n_estimators):
    sample = np.random.choice(n, n, p=w)
    sampled = X.iloc[sample]
    sampled.index = range(len(sample))
    self.estimators[i].fit(sampled, labels[sample])
    print('sklearn_AdaBoost')
    print(self.estimators[i].classes_)

This will give the following output from two consecutive runs.

Run 1:

my_AdaBoost
['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica'

Run 2:

my_AdaBoost
['Iris-virginica', 'Iris-versicolor', 'Iris-setosa']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sklearn_AdaBoost
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']

In run 1, the order in the list is the same as that of the base classifiers. However, in run 2, the order is different.

Now, if I run print(self.estimators[i].predict_proba(X)) for any test dataset, the probabilities will have class label columns in the same order as the sklearn base classifier, instead of the class label columns in the order of my_AdaBoost_hint.py. Thus, the ensemble classifier (my_AdaBoost_hint.py) of run 2 will output incorrect probabilities because column 0 in the ensemble classifier is Iris-virginica instead of the column 0 in the base classifier (sklearn.tree.DecisionTreeClassifier) that refers to Iris-setosa.

Further, this column name ordering is assumed in the following line from the def predict(self, X) line of my_AdaBoost_hint.py file:

predictions = [self.classes_[np.argmax(prob)] for prob in probs.to_numpy()]

One solution to this problem is to use the following code to generate self.classes_:

self.classes_ = y.unique().tolist()

This code will always generate the class label list in the same order as sklearn's classifiers.

What are your thoughts on this, Prof.?

Note: This problem will arise for every ensemble learning algorithm where we use a base classifier from sklearn.

Question about grading policy for assignment 2

Hello Prof. Zhe,

Is the grading policy for assignment 2 based on the number of same probability predictions from the predict_proba method as well? I'm asking because I have implemented Hunt's algorithm from the slides with the assumption that each feature will be evaluated on only one node. Once a feature is selected as the decision condition for a node, subsequent nodes will only use one of the remaining features to make a decision. This assumption leads to a more generalized decision tree than the one calculated by sklearn. For more context, here are the two trees, one from my program and the second from sklearn:

[['PetalLengthCm', 3.0, 'l', 'Iris-setosa'], ['PetalWidthCm', 1.8, 'l', 'Iris-versicolor'], ['SepalLengthCm', 6.0, 'ge', 'Iris-virginica'], ['SepalWidthCm', 3.2, 'l', 'Iris-virginica']]

|--- feature_2 <= 2.45
|   |--- class: Iris-setosa
|--- feature_2 >  2.45
|   |--- feature_3 <= 1.75
|   |   |--- feature_2 <= 4.95
|   |   |   |--- feature_3 <= 1.65
|   |   |   |   |--- class: Iris-versicolor
|   |   |   |--- feature_3 >  1.65
|   |   |   |   |--- class: Iris-virginica
|   |   |--- feature_2 >  4.95
|   |   |   |--- feature_3 <= 1.55
|   |   |   |   |--- class: Iris-virginica
|   |   |   |--- feature_3 >  1.55
|   |   |   |   |--- feature_2 <= 5.45
|   |   |   |   |   |--- class: Iris-versicolor
|   |   |   |   |--- feature_2 >  5.45
|   |   |   |   |   |--- class: Iris-virginica
|   |--- feature_3 >  1.75
|   |   |--- feature_2 <= 4.85
|   |   |   |--- feature_1 <= 3.10
|   |   |   |   |--- class: Iris-virginica
|   |   |   |--- feature_1 >  3.10
|   |   |   |   |--- class: Iris-versicolor
|   |   |--- feature_2 >  4.85
|   |   |   |--- class: Iris-virginica

My code makes the same predictions on the given test data, but with a lower confidence, which can be seen when the data from predict_proba is printed.

Intro

Hello everyone, my name is Simran Deshmukh and I am from Mumbai,India. I have completed my B.E in computer and was hired by Accenture as an ASE after my undergraduation. I have also completed an internship with Robokart. Well, the situation of COVID-19 in my locality is pretty much under control however,we are still under lockdown for safety purpose. I am also very excited to start with this course as it will help me learn new skills and concepts so as to become a good data scientist after my graduation.

Hi There, I'm Skye

Hello!
I'm Skye and I recently moved into my apartment at The Province from Long Island. Very much looking forward to this semester despite the current situation.

Assignment 9 Issue

Hello Everyone:

Can someone please help me debug my code, I'm getting this error.

[array([0.94016572, 0.93962375]), array([0.80635777, 0.73373702]), array([0.9482311, 0.9482311]), array([0.93395306, 0.93749212]), array([0.94127852, 0.94144122]), array([0.94154915, 0.94390909]), array([0.92430898, 0.92892568]), array([0.95596545, 0.95723743]), array([0.78782617, 0.71472826]), array([0.93281628, 0.93376601])]
Traceback (most recent call last):
File "/Users/pk/Desktop/RIT/Fall-2020/DSCI-633/assignments/assignment9/A9.py", line 41, in
best = ga2.tune()
File "/Users/pk/Desktop/RIT/Fall-2020/DSCI-633/assignments/assignment9/my_GA.py", line 236, in tune
self.crossover()
File "/Users/pk/Desktop/RIT/Fall-2020/DSCI-633/assignments/assignment9/my_GA.py", line 197, in crossover
new_point = cross(self.generation[ids[0]], self.generation[ids[1]])
File "/Users/pk/Desktop/RIT/Fall-2020/DSCI-633/assignments/assignment9/my_GA.py", line 188, in cross
if((np.random.randint(a,2))==0):
File "mtrand.pyx", line 746, in numpy.random.mtrand.RandomState.randint
File "_bounded_integers.pyx", line 1269, in numpy.random._bounded_integers._rand_int64
File "_bounded_integers.pyx", line 656, in numpy.random._bounded_integers._rand_int64_broadcast
numpy.core._exceptions.UFuncTypeError: ufunc 'less' did not contain a loop with signature matching types (dtype('<U21'), dtype('<U21')) -> dtype('bool')

Hi! I'm Yuvraj

Hello everyone,

I'm Yuvraj Singh. I'm currently staying in Rochester. Unfortuntely, because of COVID, I'm tied down with either work or school since there isn't much anyone can do right now. I just started my BS/MS in Computer Science and I plan to specialize in Big Data and Data Management. I want to learn more about how to handle large sets of data and what one can learn from it.

Hello, I'm Rochelle

Hi everyone,
I'm Rochelle and I am currently in Bangalore, India.
This Pandemic has been challenging to say the least and Im really hoping I get to meet you all in person in Spring 2021!
I wanted to dive into Data Science because there is so much that can be learnt from Data and can be used in every field.

Can't wait to learn and meet you all soon !

Assignment 6: Fix a bug on my_KMeans_hint.py

On Line 101 in the predict() function, np.argmax should be np.argmin since we are assigning data points to their closest cluster centers.

This has been corrected in the my_KMeans_hint.py file. Please update.

@AbdullahA93 Thanks for finding this bug!

Prajwal Krishna

Hey All,

I am from India living in Bengaluru right now.
Well as far as Covid situation is concerned I am sure it has become a new normal for all of us.

MyGithub: prajwal2495

hil-se / fds Goto Github PK

fds's Introduction

Course Description

Course Learning Outcomes

Syllabus and Policies

fds's People

Stargazers

Watchers

Forkers

fds's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs