Jupyter Notebook 81.14% HTML 18.84% Dockerfile 0.01%

ds-unit-4-sprint-1-nlp's Introduction

DS-Unit-4-Sprint-1-NLP

ds-unit-4-sprint-1-nlp's People

Contributors

Stargazers

Watchers

Forkers

johnpharmd aaptedata krsmith zangell44 joshdsolis crawftv rick1270 danielmartinalarcon quinn-dougherty carlos-gutier manjulamishra danielleromanoff shreyasjothish invegat hughjafro cocoisland samirgadkari valogonor zarrinan chrisseiler96 brit228 albert-h-wong brittonwinterrose captmoonshot axrd 0xoddrey themultitude veritaem extrajp2014 donw385 lambdaschool-colejhudson tristan-paul donaldocelaj macscheffer wel51x sealuwee wjarvis2 derek-shing ssingh1187 mbrady4 livjab damerei rowebyrowe standroidbeta will-cotton4 danhorsley granero0011 nickwinters1 pwalis dustiny5 nolanole tomfox1 bkrant jaytheopensourcerer nicomontoya sokjc bundickm jazzathoth tbradshaw91 higgins2718 connorpheraty asinani tortas mkirby42 dpgofast ndoshi83 tjhendrixx dwightchurchill smsinclair valerielangat mohamad-ali-nasser khaloodi danielcalimayor mmastin macr lilysu nikux willhk ridleyleisy alvinwalker314 kevwebb jaavion jefntungila chefdarek jtkernan7 llpk79 rtrey29 lambdaschool-forks nchibana gyhou nov05 mjh09 ewuerfel66 mikvikpik chancedurr alqu7095 nrvanwyck joshfowlkes ianforrest11 mauney

ds-unit-4-sprint-1-nlp's Issues

Table Comparing SKlearn API to Gensim Comparison

Link in module 1 assignment still points to "Master" instead of "main"

data link in https://github.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/blob/main/module1-text-data/LS_DS_411_Text_Data_Assignment.ipynb should point to main instead of master.

module 1 lecture notebook regex

re.sub(r'[^a-zA-Z ^0-9]', '', sample)

that regex pattern will not replace any '^' in the string.

re.sub(r'[^a-zA-Z ^0-9]', '', 'Hi, Joe, look up ^ there.')

'Hi Joe look up ^ there'

should be
re.sub(r'[^a-zA-Z 0-9]', '', sample)

Downgrade Python Version

Thinc and Spacy require Python 3.7 - the conda environment instructions say to use Python 3.8.

Training the model with gridsearch outside of the pipeline.

Due to the computational expense of vectorizing raw text data, if the model is trained via the full pipeline, then for every param combo in the gridsearch, the training data is revectorizing unnecessarily. Also the .fit method for pipeline does not allow for passing parameters to do more robust things such as early stopping.

Time Difference:
full pipeline - several minutes to train one param combo
outside pipeline - 2 seconds to train one param combo all the way to early stopping limit

Proposed Solution:
create preprocess pipeline -> prepocess training data -> train model -> add best_estimator_ to pipeline via nesting

to avoid this:

DeprecationWarning: Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.from_iter(generator)) or the python sum builtin instead.
score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)

happening about 100 times.

pip install -r requirements.txt

Results in...

ERROR: Could not find a version that satisfies the requirement mkl-fft (from -r requirements.txt (line 45)) (from versions: none)
ERROR: No matching distribution found for mkl-fft

And then also...

ERROR: Could not find a version that satisfies the requirement mkl-random (from -r requirements.txt (line 52)) (from versions: none)
ERROR: No matching distribution found for mkl-random (from -r requirements.txt (line 52))

Gensim + Generators

Remind students that gensim can use more than generators

Fix Submissions No in M3

Max submissions should be 20 instead of 2 in nb copy.

bloominstituteoftechnology / ds-unit-4-sprint-1-nlp Goto Github PK

ds-unit-4-sprint-1-nlp's Introduction

DS-Unit-4-Sprint-1-NLP

ds-unit-4-sprint-1-nlp's People

Contributors

Stargazers

Watchers

Forkers

ds-unit-4-sprint-1-nlp's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs