ds-unit-4-sprint-1-nlp's Introduction
ds-unit-4-sprint-1-nlp's People
Forkers
johnpharmd aaptedata krsmith zangell44 joshdsolis crawftv rick1270 danielmartinalarcon quinn-dougherty carlos-gutier manjulamishra danielleromanoff shreyasjothish invegat hughjafro cocoisland samirgadkari valogonor zarrinan chrisseiler96 brit228 albert-h-wong brittonwinterrose captmoonshot axrd 0xoddrey themultitude veritaem extrajp2014 donw385 lambdaschool-colejhudson tristan-paul donaldocelaj macscheffer wel51x sealuwee wjarvis2 derek-shing ssingh1187 mbrady4 livjab damerei rowebyrowe standroidbeta will-cotton4 danhorsley granero0011 nickwinters1 pwalis dustiny5 nolanole tomfox1 bkrant jaytheopensourcerer nicomontoya sokjc bundickm jazzathoth tbradshaw91 higgins2718 connorpheraty asinani tortas mkirby42 dpgofast ndoshi83 tjhendrixx dwightchurchill smsinclair valerielangat mohamad-ali-nasser khaloodi danielcalimayor mmastin macr lilysu nikux willhk ridleyleisy alvinwalker314 kevwebb jaavion jefntungila chefdarek jtkernan7 llpk79 rtrey29 lambdaschool-forks nchibana gyhou nov05 mjh09 ewuerfel66 mikvikpik chancedurr alqu7095 nrvanwyck joshfowlkes ianforrest11 mauneyds-unit-4-sprint-1-nlp's Issues
Table Comparing SKlearn API to Gensim Comparison
Link in module 1 assignment still points to "Master" instead of "main"
data link in https://github.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/blob/main/module1-text-data/LS_DS_411_Text_Data_Assignment.ipynb should point to main instead of master.
module 1 lecture notebook regex
re.sub(r'[^a-zA-Z ^0-9]', '', sample)
that regex pattern will not replace any '^' in the string.
re.sub(r'[^a-zA-Z ^0-9]', '', 'Hi, Joe, look up ^ there.')
'Hi Joe look up ^ there'
should be
re.sub(r'[^a-zA-Z 0-9]', '', sample)
Downgrade Python Version
Thinc and Spacy require Python 3.7 - the conda environment instructions say to use Python 3.8.
Training the model with gridsearch outside of the pipeline.
Due to the computational expense of vectorizing raw text data, if the model is trained via the full pipeline, then for every param combo in the gridsearch, the training data is revectorizing unnecessarily. Also the .fit method for pipeline does not allow for passing parameters to do more robust things such as early stopping.
Time Difference:
full pipeline - several minutes to train one param combo
outside pipeline - 2 seconds to train one param combo all the way to early stopping limit
Proposed Solution:
create preprocess pipeline -> prepocess training data -> train model -> add best_estimator_ to pipeline via nesting
Unused variables in compute_coherence_values
In your function for computing coherence values in LS_DS_414_Topic_Modeling_Lecture.ipynb, both tokens and stream are unused, which means that the parameter path is also unused. Recommend removing them.
Typo in Module1 Objective 1
This doesn't work.
Sketch out NLP project for students
Requirements - update
update numpy
to numpy==1.16
to avoid gensim breaking
add:
import warnings
warnings.filterwarnings('ignore')
to avoid this:
DeprecationWarning: Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.from_iter(generator)) or the python sum builtin instead.
score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
happening about 100 times.
Litterary Typo
Litterary == literary
Update Pandas code to 1.0
From Maxie:
pointer:
df['brand'] = df['brand'].apply(lambda x: x.lower())
can be simplified to
df['brand'] = df['brand'].str.lower()
Pip installation issues
pip install -r requirements.txt
Results in...
ERROR: Could not find a version that satisfies the requirement mkl-fft (from -r requirements.txt (line 45)) (from versions: none)
ERROR: No matching distribution found for mkl-fft
And then also...
ERROR: Could not find a version that satisfies the requirement mkl-random (from -r requirements.txt (line 52)) (from versions: none)
ERROR: No matching distribution found for mkl-random (from -r requirements.txt (line 52))
Gensim + Generators
Remind students that gensim can use more than generators
Fix Submissions No in M3
Max submissions should be 20 instead of 2 in nb copy.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.