[CORE FEATURE] Devise DB schema/architecture,about omshub/website

Comments (25)

williarm commented on September 18, 2024 4

GitHub doesn't like .sql extensions for some reason. Conceptual design attached.

Lookup schema is for static values (or data that doesnt change often). Allows us to data drive the UI.
Core schema is the transactional schema.

Would need to add indexes and keys to this model - would depend on how UI accesses the database so we can optimize the queries. Data volume wouldn't really require any type of partitioning.

Direct link to diagram: https://dbdiagram.io/d/627b26367f945876b6f49b78

from website.

baheckman commented on September 18, 2024 2

I'm not sure if this fits under this issue, but I think it would be very helpful to capture some of the users background. Maybe something like "professional_developer: boolean" or "years_professional_experience: int" and possibly some tech background (like do they know Python or C prior to the course).

Courses like GIOS can be a breeze for some devs that have worked with C and gRPC while they can be a struggle for others. That makes a big difference in terms of how readers align to the reviews.

from website.

rkrishnan47 commented on September 18, 2024 1

A couple of questions on the core.users table:

Can we change password to password_hash?
Should we allow users to sign in using OAuth, which would require some changes to this table (this should probably be a separate issue, but would affect the structure of the users table).

from website.

williarm commented on September 18, 2024 1

can you elaborate more what does active_flg stands for?

Active_flg is a logical boolean to help control what data is displayed in the UI. For example, if we have a drop down that we are populating through a lookup table (ex. specialization) we can filter the drop down values via the active flg. This allows us to stage new data without it becoming visible until we set the flag to true, or turn off/turn on values without having to delete them from the database.

from website.

williarm commented on September 18, 2024 1

This is fantastic @williarm! Thanks for making it ! A few comments/suggestions/questions.

Thanks for reviewing. This was a rough draft of a data model and I anticipated tweaks so nothing is set in stone. I intentionally made it more normalized to demonstrate the ideas and assume we will flatten it out.

core_course_review:

I think overall_rating/workload_hours should be integers, I don't think we need people to enter 10.43 hours/week or a rating of 3.5 or something. Let's just stick to integers

Works for me.

We will need to consider other fields to allow for more aspects of a course to review, e.g. on MSCSHub they have textbook, lectures, professor, piazza support. Not saying we should do all of those, but something to keep in mind

Yea, depending on the aspect we can just add new columns or create a child table to course_review to add the new aspect (with the course_review_id being a FK to the child table).

And not a schema question, but should consider what scale we want, out of 5, out of 7, out of 10? My older brother loves a -5 to 5 scale

I think the original OMSCentral site did a scale of 5, I think that probably fits here but either way the model supports it.

I assume this table is for the individual reviews, for the aggregate reviews I don't see a schema for those. Do you think it would be such that those would be calculated every time the site loads? I think it might be better to have another table with the aggregated summaries averages/totals for the reviews that gets updated any time someone adds a review rather than recalculating each time, but would need to discuss further. If it's minimal load on the server to do so could go either way.

I don't think the volume of data will be that high where we can't calculate on the fly. My idea here is that we have views sitting on top of the data model that the UI references for reads. We can have a view that does the aggregations anytime someone accesses the main page. I think I saw a screenshot of the OMSCentral page where the total number of reviews was less than 5000, so that should be pretty small from a data standpoint. The benefit of doing it this way is we can see updated averages and totals as soon as someone makes a review. We also don't need to manage a refresh process to load the aggregate table. If we start to have a ton of data volume, then it might make sense to build a dimensional model to do some aggregations. But for now, I don't recommend that. If others disagree then it's worth a discussion.

lookup_lk_user_type and role

What would be the difference between type and role of a user?

Just demonstrating the options here, we don't have to go with this. Depends on how granular we want to get with roles and access control. An example where this makes sense would be if we want to have user types Internal and External, where Internal roles are Admin, User, etc.

specialization tables

It might be best to use something other than "specialization" for the tables as that's fairly OMSCS specific language, OMSA and OMSCY call them "tracks", perhaps something like "concentration?

That works.

Is there any reason not to combine lk_degree_program_specialization and lk_specialization? The specailization_desc could just be added to the former, no need for another table

I separated them out for uniqueness in case we have multiple degree programs with the same specialization name. Again, maybe too normalized so if flattening this out works better we can go that route.

I think it might be worth have a discussion for how to treat OMSCS vs OMSA vs OMSCY specifics and the various tracks/specializations within

Part of the reason why I went with the more normalized approach. Allows us to be very granular with how we map the degree programs and the tracks/specializations. If we want to see everything under OMSCS for example, we can just drive to the data using the OMSCS program and filter out all OMSA and OMSCY stuff. Additionally if we ever expand this into new degree programs (or, if it expands beyond GT), we just add new rows to the lookup tables for the degree program.

lk_department

Do we need a department table? I think it could just be a column on the lk_course table

Similar reasoning as lk_degree_program_specialization and lk_specialiation. Depends on how granular we want to get and what works for the site.

lk_course

We might want some more details in each course for other aspects of a course to call out, again using MSCSHub as an example, they have some useful tags at the top, e.g. exams, homework, peer-reviewed, etc... https://mscshub.com/courses/Algorithms:%20Techniques%20and%20Theory

I think we could add these attributes to the course_review table or build child tables to track the information.

course_offering & lk_semester

Again as much as I like normalization, should we just add semester_desc to the course_offering table?

Depends on if we want to track semesters as their own entity. I lean towards keeping the more normalized approach for data integrity, but open to alterations.

from website.

williarm commented on September 18, 2024 1

This looks incredibly comprehensive, thanks for putting it together @williarm!

I've noticed that we're storing the user's full name and email address in the core.users table. These are considered PII under the GDPR regulations, and subject to the right to erasure. I don't think we'll have any major issues with this, but it does add maintenance overhead and will require us to build a mechanism where users can request to be forgotten.

Do we have a use case for persisting this data? The OMSCentral reviews appear to be anonymous.

We don't have to store PII like name. I included it as an example of what a user entity might look like. Good call out on the GDPR stuff. Not sure how it would work if someone uses an identifiable email address like [email protected]?

from website.

mccormick-wooden commented on September 18, 2024 1

@williarm I think this looks really good overall. How you've modeled the domain makes a lot of sense to me. Just some thoughts on a few conventions I'm seeing:

I'm not sure I agree that active_flg buys a whole lot, or at least I think it increases complexity and encourages less-than-ideal deployment strategies. Ideally, a feature that relies on some data should be deployed with that data (via migration, deployment script, etc). If we want to be able to hide/show entities while maintaining referential integrity, I think a better strategy would be to have removed_by/removed_at audit fields (similar to the created_by/created_at you already have). This will also allow us to kill the deleted_flgs and make the field that controls presence consistent across the data model. The big benefit of making this consistent is that base data access logic can generically check for presence of these fields without needing to bake that logic across the application into various derived access logic (or views).
I think it's a good idea to start with consistent types for all primary keys (probably int). If these values are "meaningful" to the domain it makes them unnecessarily difficult to change later. If the plan is to abstract the raw db with a view layer, that would be another reason why using meaningful PKs is unnecessary.
It may be worth thinking about modeling lk_course_specialization as lk_course_offering_specialization since I think some courses have changed specialization credit over time. But whether or not this is a good idea depends on how/where you want to surface this (e.g. if you just care about surfacing the specialization state of a course "today", then don't bother).
I read above you aren't married to having both user_type and user_role - I think that user_type is probably unnecessary and that only simple roles are needed (honestly, probably just an "Admin" and "Normal" user type is sufficient).
I'm not sure that blob is right for course_review.comments - you probably just need text which will support storing markdown. This will make anything related to searching/analyzing review text easier down the road.
If you want to support adding degree specialization to a user profile, I think the right way to do it is to add a FK to core.users that references degree_program_specialization, and ensure that there is an Undecided specialization that can be mapped to all degree_programs.
This is definitely a nitpick but in general, I think names like created_by / created_at for audit fields are a little more conventional and cleaner - I think prefixing with datetime here is probably unnecessary.
Another nitpick, but I think lk_ prefixes are probably unnecessary when tables exist in a lookup schema.

Also, re: aggregated reviews, you said above:

We can have a view that does the aggregations anytime someone accesses the main page.

100% agree, I think any type of aggregation on the reviews should be computed dynamically (in a view layer or further down the stack potentially) - refreshing a dedicated aggregation table is a headache that you don't want or need.

from website.

driscoll42 commented on September 18, 2024 1

To help frame some of these discussions and decisions, here is the temporary submission form we will be sending to the community to use in the interim period. Worth looking over to consider how the data would fit.

https://docs.google.com/forms/d/e/1FAIpQLSc1xXBa3nnPECvoAKLMC4X3iXbZOghOiIQv6p8xAwR5gysBSA/viewform?usp=sf_link

from website.

williarm commented on September 18, 2024 1

So there will include combinations of all courses and specializations?

Yep. The main purpose is to be able to model courses that are valid for multiple specializations. Take Machine Learning for example - because it counts for several different specializations, the model allows you to do something like this to figure out what specializations it counts for:
select spec.Description as [Machine Learning Specializations]
from lk_specialization spec
join lk_course_specialization coursespec on coursespec.specialization_id = spec.specialization_id
join lk_course course on course.course_id = coursespec.course_id
where course.course_desc = 'Machine Learning'
and <various active_flg checks>
which would result in (something like):
Foundational
Interactive Intelligence
Machine Learning
Computational Perception & Robotics
(I don't know if folks intend to model whether something is an specialization "elective" or "core", or whether "Foundational" is considered a specialization, so this is an approximation of the results)

We can add the elective / core and foundational flags in the junction table.

from website.

ilanbarshir commented on September 18, 2024

probably should consider user picture in core.users.
An alternative to password is to force authentication through google/facebook/github. It makes it easier regarding security aspects

from website.

ilanbarshir commented on September 18, 2024

can you elaborate more what does active_flg stands for?

from website.

driscoll42 commented on September 18, 2024

This is fantastic @williarm! Thanks for making it ! A few comments/suggestions/questions.

core_course_review:

I think overall_rating/workload_hours should be integers, I don't think we need people to enter 10.43 hours/week or a rating of 3.5 or something. Let's just stick to integers
We will need to consider other fields to allow for more aspects of a course to review, e.g. on MSCSHub they have textbook, lectures, professor, piazza support. Not saying we should do all of those, but something to keep in mind
And not a schema question, but should consider what scale we want, out of 5, out of 7, out of 10? My older brother loves a -5 to 5 scale
I assume this table is for the individual reviews, for the aggregate reviews I don't see a schema for those. Do you think it would be such that those would be calculated every time the site loads? I think it might be better to have another table with the aggregated summaries averages/totals for the reviews that gets updated any time someone adds a review rather than recalculating each time, but would need to discuss further. If it's minimal load on the server to do so could go either way.

lookup_lk_user_type and role

What would be the difference between type and role of a user?

specialization tables

It might be best to use something other than "specialization" for the tables as that's fairly OMSCS specific language, OMSA and OMSCY call them "tracks", perhaps something like "concentration?
Is there any reason not to combine lk_degree_program_specialization and lk_specialization? The specailization_desc could just be added to the former, no need for another table
I think it might be worth have a discussion for how to treat OMSCS vs OMSA vs OMSCY specifics and the various tracks/specializations within

lk_department

Do we need a department table? I think it could just be a column on the lk_course table

lk_course

We might want some more details in each course for other aspects of a course to call out, again using MSCSHub as an example, they have some useful tags at the top, e.g. exams, homework, peer-reviewed, etc... https://mscshub.com/courses/Algorithms:%20Techniques%20and%20Theory

course_offering & lk_semester

Again as much as I like normalization, should we just add semester_desc to the course_offering table?

from website.

williarm commented on September 18, 2024

A couple of questions on the core.users table:

Can we change password to password_hash?

Should we allow users to sign in using OAuth, which would require some changes to this table (this should probably be a separate issue, but would affect the structure of the users table).

Yea, no issues renaming to password_hash. The OAuth question is a broader design question as you mentioned.

from website.

driscoll42 commented on September 18, 2024

I need to reply in more detail later, I appreciate your feedback and agree with most of it, but one thing to consider as well are "aliases". E.g. seraching for "BD4H" instead of CSE-6250 or Big Data for Healthcare, similar for ML for Machine Learning, DL for Deep Learning, etc...

from website.

disposedtrolley commented on September 18, 2024

This looks incredibly comprehensive, thanks for putting it together @williarm!

I've noticed that we're storing the user's full name and email address in the core.users table. These are considered PII under the GDPR regulations, and subject to the right to erasure. I don't think we'll have any major issues with this, but it does add maintenance overhead and will require us to build a mechanism where users can request to be forgotten.

Do we have a use case for persisting this data? The OMSCentral reviews appear to be anonymous.

from website.

disposedtrolley commented on September 18, 2024

Not sure how it would work if someone uses an identifiable email address like [email protected]?

Yeah me neither. I think the best we can do is avoid storing anything that could be considered PII if at all possible. This might be the case if we use an authentication service like Auth0 who can manage all the user data.

from website.

driscoll42 commented on September 18, 2024

I'm not sure if this fits under this issue, but I think it would be very helpful to capture some of the users background. Maybe something like "professional_developer: boolean" or "years_professional_experience: int" and possibly some tech background (like do they know Python or C prior to the course).

Courses like GIOS can be a breeze for some devs that have worked with C and gRPC while they can be a struggle for others. That makes a big difference in terms of how readers align to the reviews.

I do like this, I think we'd need to play with it a bit, but on the right track, also another field which would be useful to have people put on their reviews:

Finished Course (BOOLEAN)
Final Grade (VARCHAR) - limited to a drop down of A/B/C/D/F/Do not want to share (probably a better letter/acronym for the last one)

It would be interesting to compare difficulty/rating based on grade/if they even finished it.

These kind of fields are probably a 2.0 thing, not 1.0

Adding to this, as part of the user profile, they could say if they are a OMSCS, OMSA, or OMSCY student. However I know there are people who take multiple Masters so maybe just what they took the course as?

from website.

kewellcjj commented on September 18, 2024

I'm going to create some csv files with fake reviews based on @williarm's design.

from website.

williarm commented on September 18, 2024

@williarm I think this looks really good overall. How you've modeled the domain makes a lot of sense to me. Just some thoughts on a few conventions I'm seeing:

I'm not sure I agree that active_flg buys a whole lot, or at least I think it increases complexity and encourages less-than-ideal deployment strategies. Ideally, a feature that relies on some data should be deployed with that data (via migration, deployment script, etc). If we want to be able to hide/show entities while maintaining referential integrity, I think a better strategy would be to have removed_by/removed_at audit fields (similar to the created_by/created_at you already have). This will also allow us to kill the deleted_flgs and make the field that controls presence consistent across the data model. The big benefit of making this consistent is that base data access logic can generically check for presence of these fields without needing to bake that logic across the application into various derived access logic (or views).

I'm confused on the access pattern here - are you suggesting UI would filter out the inactive/logically deleted rows based on the audit columns? Assuming one of those would be a date?

I think it's a good idea to start with consistent types for all primary keys (probably int). If these values are "meaningful" to the domain it makes them unnecessarily difficult to change later. If the plan is to abstract the raw db with a view layer, that would be another reason why using meaningful PKs is unnecessary.

This is fine.

It may be worth thinking about modeling lk_course_specialization as lk_course_offering_specialization since I think some courses have changed specialization credit over time. But whether or not this is a good idea depends on how/where you want to surface this (e.g. if you just care about surfacing the specialization state of a course "today", then don't bother).

If this is the case, we would just insert a new row into lk_course_specialization and set the active_flg to false if the course is no longer part of the specialization. This way we retain the history of that course.

I read above you aren't married to having both user_type and user_role - I think that user_type is probably unnecessary and that only simple roles are needed (honestly, probably just an "Admin" and "Normal" user type is sufficient).

Yea, really depends on how we want to do access. I'm fine either way.

I'm not sure that blob is right for course_review.comments - you probably just need text which will support storing markdown. This will make anything related to searching/analyzing review text easier down the road.

This is up for review as well - I went with blob because some RDBMS have limitations on the amount of characters in text fields. Whatever works here is fine with me as long as we're in agreement that we need a wide field for the comments that allows for different formats.

If you want to support adding degree specialization to a user profile, I think the right way to do it is to add a FK to core.users that references degree_program_specialization, and ensure that there is an Undecided specialization that can be mapped to all degree_programs.

Either this or a junction table that allows a user to be tied to multiple specializations in case GT ever allows dual specializations or the same user is in more than one program.

This is definitely a nitpick but in general, I think names like created_by / created_at for audit fields are a little more conventional and cleaner - I think prefixing with datetime here is probably unnecessary.

That's fine with me, we just need to have audit columns.

Another nitpick, but I think lk_ prefixes are probably unnecessary when tables exist in a lookup schema.

I like to keep them because when writing a query it's easier to see that lk_ represents the type of table and the type of data contained in it.

Also, re: aggregated reviews, you said above:

We can have a view that does the aggregations anytime someone accesses the main page.

100% agree, I think any type of aggregation on the reviews should be computed dynamically (in a view layer or further down the stack potentially) - refreshing a dedicated aggregation table is a headache that you don't want or need.

Yea, the data volume is going to be low enough where this makes sense.

from website.

mccormick-wooden commented on September 18, 2024

I'm confused on the access pattern here - are you suggesting UI would filter out the inactive/logically deleted rows based on the audit columns? Assuming one of those would be a date?

"views" was referencing the db view layer that was proposed earlier.

I'm saying the data access logic (on the server, I guess, or whatever is decided) could generically look at one field (a "removed_by") instead of several ("active" / "deleted").

from website.

kewellcjj commented on September 18, 2024

I created some fake data downloadable at https://gatech.box.com/s/1ti7r0fvheuitbgi3pccdnf5yfq2b4jj.

The csv files closely follow the structure proposed by @williarm.

Some notes:

reviews are from yelp reviews so not from omscentral
all primary keys are named as _id, integer
For simplicity, I made all active_flg=True (except for the lk_course), datetime created and change (except for the course reviews) at 2014-01-01, create/change user as 0
there are 105 courses, 3 degree programs, 9 specializations, 10,000 reviews, 3,000 users, 27 semesters (from spring 2014 to fall 2022). Id numbering counts from 0.
lk_course_specializaiton, lk_user_role, lk_user_type, users are not available yet, I not sure what are the sets for user_role and user_type.

It should let us start with the database and related backend works. Please take a look and feel free to make suggestions/changes

from website.

williarm commented on September 18, 2024

I created some fake data downloadable at https://gatech.box.com/s/1ti7r0fvheuitbgi3pccdnf5yfq2b4jj.

The csv files closely follow the structure proposed by @williarm.

Some notes:

reviews are from yelp reviews so not from omscentral

all primary keys are named as _id, integer

For simplicity, I made all active_flg=True (except for the lk_course), datetime created and change (except for the course reviews) at 2014-01-01, create/change user as 0

there are 105 courses, 3 degree programs, 9 specializations, 10,000 reviews, 3,000 users, 27 semesters (from spring 2014 to fall 2022). Id numbering counts from 0.

lk_course_specializaiton, lk_user_role, lk_user_type, users are not available yet, I not sure what are the sets for user_role and user_type.

It should let us start with the database and related backend works. Please take a look and feel free to make suggestions/changes

lk_course_specialization is a junction between lk_course and lk_specialization.

So for example, if we had the following row in lk_specialization:
1|Computing Systems|True

And the following row in lk_course:
1|CS|6200|Graduate Intro to Operating Systems

Then the row in lk_course_specialization would be:
1|1|1|True

from website.

kewellcjj commented on September 18, 2024

I see, basically outer join regardless of whether the specialization matches the course. So there will include combinations of all courses and specializations? I assume active_flg indicates whether the course is a course under the specialization.

from website.

mccormick-wooden commented on September 18, 2024

So there will include combinations of all courses and specializations?

Yep. The main purpose is to be able to model courses that are valid for multiple specializations. Take Machine Learning for example - because it counts for several different specializations, the model allows you to do something like this to figure out what specializations it counts for:

select spec.Description as [Machine Learning Specializations]
from lk_specialization spec
join lk_course_specialization coursespec on coursespec.specialization_id = spec.specialization_id
join lk_course course on course.course_id = coursespec.course_id
where course.course_desc = 'Machine Learning'
and <various active_flg checks>

which would result in (something like):

Foundational
Interactive Intelligence
Machine Learning
Computational Perception & Robotics

(I don't know if folks intend to model whether something is an specialization "elective" or "core", or whether "Foundational" is considered a specialization, so this is an approximation of the results)

from website.

awpala commented on September 18, 2024

Closing for obsolescence. Project has elected to use Firebase for back end service as of July 2022.

from website.

[CORE FEATURE] Devise DB schema/architecture about website HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs