lxmls / lxmls-toolkit Goto Github PK

Machine Learning applied to Natural Language Processing Toolkit used in the Lisbon Machine Learning Summer School

License: Other

Python 23.03% Perl 1.00% Jupyter Notebook 75.93% Shell 0.05%

lxmls-toolkit's Introduction

LxMLS 2023

Machine learning toolkit for natural language processing. Written for Lisbon Machine Learning Summer School (lxmls.it.pt). This covers

Scientific Python and Mathematical background
Linear Classifiers
Sequence Models
Structured Prediction
Syntax and Parsing
Feed-forward models in deep learning
Sequence models in deep learning
Reinforcement Learning

Machine learning toolkit for natural language processing. Written for LxMLS - Lisbon Machine Learning Summer School

Instructions for Students

Use the student branch not this one!

Install with Anaconda or pip

If you are new to Python, the simplest method is to use Anacondato handle your packages, just go to

https://www.anaconda.com/download/

and follow the instructions. We strongly recommend using at least Python 3.

If you prefer pip to Anaconda you can install the toolkit in a way that does not interfere with your existing installation. For this you can use a virtual environment as follows

virtualenv venv
source venv/bin/activate (on Windows: .\venv\Scripts\activate)
pip install pip setuptools --upgrade
pip install --editable .

This will install the toolkit in a way that is modifiable. If you want to also virtualize you Python version (e.g. you are stuck with Python2 on your system), have a look at pyenv.

Bear in mind that the main purpose of the toolkit is educative. You may resort to other toolboxes if you are looking for efficient implementations of the algorithms described.

Running

Run from the project root directory. If an importing error occurs, try first adding the current path to the PYTHONPATH environment variable, e.g.:
- export PYTHONPATH=.

Development

To run the all tests install tox and pytest

pip install tox pytest

and run

tox

Note, to combine the coverage data from all the tox environments run:

Windows
```
set PYTEST_ADDOPTS=--cov-append
tox
```
Other
```
PYTEST_ADDOPTS=--cov-append tox
```

lxmls-toolkit's People

Contributors

Stargazers

Watchers

Forkers

andre-martins ramon-astudillo luisnunespt aleixomatos sarmento alourenco tnunes anacmendes capcarr luispedro itakatz slohia simonsuster imclab shunzhang sivareddyg jmmalaca slitayem filippoc alvations-all negrinho beckdaniel dcferreira longyuewangdcu miguelbalmeida marianaalmeida reinhack oplatek aan680 freieschaf samiroid chrishokamp mairosdevscope hpcosta mariajoanac gitter-badger zzmjohn iarroyof ravigarg27 amritsinghbains pablocosta pombredanne viveksck sandy4321 rpinquie gitnor ml-ai-nlp-ir sophiesongge kepler ibalmeida jamesjohnson92 yesanton maria-pedroto jandredias hieuhoang tbmihailov amandacurry vania-mendonca gelarehai aramkeshavarz ananana joao-m-almeida martaquintao boriel junfenglx codeaudit fjdsfg rhemaecho telmop pedrobalage brunocunha stgitorious shakhova chiraaglala pvougiou bastings khui anbasile frankier gitihubi ehsanasgari anlausch kaosengineer arianpasquali gazzola vahbuna zeichenkette wafadz bplank meddulla askinkaty venelink m-avelar uberstig jbrry miguelvr rhvaz aliskin gomesfernanda joaolages

lxmls-toolkit's Issues

Change to Python3

Aside of the obvious reasons, Jupyter notebook in anaconda is giving more and more problems

emission scores outside the logsum in the backward algorithm ?

I find the run_forward and run_backward algorithms a bit messi since the pseudocode provided uses
probabilities and the code uses scores.

Shouldn't the emission scores be outside the logsum ?

In the sequence_classification_decoder.py function we have, in run_backward:

backward[pos, current_state] = logsum( backward[pos+1, :]
+ transition_scores[pos, :, current_state]
+ emission_scores[pos+1, :])

In the run_forward algorithm we have the emisison_scores OUTSIDE the logsum function

forward[pos, current_state] = logsum(forward[pos-1, :] + transition_scores[pos-1, current_state, :])
forward[pos, current_state] += emission_scores[pos, current_state]

My intuition suggests that the forward algorithm is correct. I have derived the following formula (which we might put in the document to facilitate the comprenhension of the algorithm)

where we have:
logsum( forward + transition_scores ) + emission_scores

Unclear state of `develop` branch

We are piling up some last minute changes in develop. At the same time there have been some direct fixes in student/master.

It is unclear if all changes in develop should be propagated to master for example these files

labs/images_for_notebooks/parsing/eisner_comp_left_2.svg
labs/images_for_notebooks/parsing/eisner_comp_right_2.svg
labs/images_for_notebooks/parsing/eisner_inc_left.svg
labs/images_for_notebooks/parsing/eisner_inc_left_2.svg
labs/images_for_notebooks/parsing/eisner_inc_right.svg
labs/images_for_notebooks/parsing/eisner_inc_right_2.svg
labs/images_for_notebooks/parsing/eisner_init.svg
labs/images_for_notebooks/parsing/eisner_pseudocode.png

@dcferreira are these needed?. I see no reference to them.

Fix build fail do to deprecated use of imp in matplotlib

this is currently making our build fail

https://travis-ci.org/LxMLS/lxmls-toolkit/builds/555051922

seems to be a problem inside matplotlib. Also the hmm unit test seems to fail with low probability.

labs/notebooks/learning_structured_predictors/exercise_3 can be deleted

This seems to be only the first blocks of exercise 4.

Mira with regularizer=1 gives no update to the parameters for the Amazon dataset.. It looked better when we've changed the floor division '//' to '/' in the stepsize formula. Was there probably a thought behind using the floor division?

Feature: Unit testing derived from notebooks

Current unit-tests need a number of changes:

Since we now have notebooks a big advantage would be to create them from these notebooks similarly to labs/convert_notebooks.sh does for scripts
This will also solve partially the upcoming

#71

since test will have content-driven names and not day order (e.g. day1)

solve.py is not compatible with Python3

solve.py script in the branch student is not compatible with Python3 in several aspects:

raising exceptions
calling print function
urllib2.urlopen is replaced with urllib.request.urlopen

On the other hand, in the lab guide the students are expected to use Python3, that's why I propose to fix solve.py and make it Python3-compatible. I've already fixed the above issues locally but I don't feel confident enough to make a pull request.

Thanks!

Toolkit still does not pass Travis tests

@kepler solved one of the errors #128

there are other errors in the deep learning day. Maybe consequence of the RL day updates

https://travis-ci.org/LxMLS/lxmls-toolkit/jobs/543686960

NOTEBOOK: Basic Tutorials

Write exercise text and code for the Basic Tutorials day into a notebook under

labs/notebooks/basic_tutorials/

use

labs/notebooks/non_linear_classifiers/

as reference

the kernels in the notebooks are differs from one to another

I think we need to choose exactly one kernel and stick to it in all of the notebooks.

I suggest generic python3 kernel, since it is most universal nowadays.

Windows users have codec problems when invoking pdb

At least one Windows user reported having problems when adding the line "import pdb; pdb.set_trace()" to a file on Day 1. This would yield a codecs related error message.

Changing that to "import ipdb; ipdb.set_trace()" fixed the issue. This requires installing ipdb beforehand with the command "pip install ipdb", run from the command line.

We should test this on a Windows machine and, if confirmed, change the guide to add this remark for Windows users.

Update Student Branch

After this years updates are finished, we will need to create the student branch.

Update README.md to tell students to install symbolically the toolkit ala python setup.py develop, see #73 (comment)
Check code for the exercises is removed and actualized
Fix solve.py to use long names following #71 instead of e.g. day1 and change code for the updates

(Day 5) No en_perline001.txt in the student branch.

Day 5 guide asks students to run

python wordcount.py en_perline001.txt > results.txt

but there's no file inside the big_data folder. Also, the guide says to navigate to the "wordcount" folder while the script is actually inside the "big_data" folder but this can be easily fixed in the guide.

Check difference in parsing exercises in python 2.7/3.5 and python 3.6

The unit tests works very well for python 2.7 and python 3.5, but fails in python 3.6.

To solve this issue, you may need to check how the parsing algorithms are working and find the reason for the different final weights.

I guess this problem maybe because a different dict implementation for python 3.6 (https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-compactdict).

Exercise 2.7 in notebooks is not clear

In Exercise 2.7 the students are supposed to add smoothing by changing the argument to the train_supervised. But the way the notebook is phrased makes it seem like students are supposed to implement train_supervised, or add smoothing to the implementation.

Sentiment Reader leaves out the first feature word

In lxmls.readers.sentiment_reader, it seems that the iteration over feature words accidentally leaves out the first word:
for feat in toks[1:-1]

The index should start at 0.

MrJob word count example doesn't work on windows

For some reason, MrJob doesn't run the reducer step correcly on windows. This has been verified in several windows laptops during the labs. A quick fix is adding a combiner function implementation to the MRJob subclass, doing the same thing as the reducer:

def combiner(self, word, counts):
    yield (word, sum(counts))

I'm not sure why this works. Nevertheless, it has been verified in both mrjob v0.4.4 and v0.5.0-dev, on python Anaconda 2.7.9, 64bit.

NOTEBOOK: Parsing

Write exercise text and code for the Parsing day into a notebook under

labs/notebooks/parsing/

use

labs/notebooks/non_linear_classifiers/

as reference

Encoding needs to be specified everywhere a file is read

The encoding used by open() in python is platform dependent, so we should specify it everywhere.

@pedrobalage reported this in:

lxmls-toolkit/lxmls/parsing/dependency_reader.py

Line 50 in 2250dc2

conll_file = open(path.join(base_deppars_dir, language + "_train.conll"))

lxmls-toolkit/lxmls/parsing/dependency_reader.py

Line 89 in 2250dc2

conll_file = open(path.join(base_deppars_dir, language + "_train.conll"))

But I guess this happens in more places.

Find the randomness in non-linear classifiers

The accuracy for the exercises of non-linear classifiers (numpy and pytorch) are given results in a range of +/-2.
For this reason, the unit tests for these days are with a high tolerance factor (2).

You should check why the results are not the same for different executions of these classifiers. After finding the problem (possibly a random initialization of some function), fix it (possible by defining a seed) in order to allow better unit tests.

You may change the unit tests to guess the correct accuracy results with a lower tolerance (1e-2).

Small problems with python basics notebook

Hi,

I have noticed few small problems with python basics notebook:

0.2

There is a difference between the code in the lab guide and the code in the notebook. The "a +=1" line should NOT be indented, according to the guide, and should thus lead to infinite cycle. In the notebook the code is the same as the example right above it.

0.3 Exceptions

More of a usage case comment, than actual error - in order to actually get ValueError, I need to specifically insert something like "a" (the string with the quotes). If I just insert the character a (without quotes), the notebook interprets it as a variable and I get NameError for undefined variable rather than the value error and the exception.

0.6 and 0.7

The code for 0.6 and 0.7 is double (there are two 0.6 sections, the first one having the code for both 0.6 and 0.7).

Feature: Non Ambiguous names for days

Using day0, day1, etc is ambiguous and error prone since we change the order often. Also it is difficult to relate exercises with tests. I propose the following:

Use clearer naming scheme for each chapter e.g. linear_classifiers.
Keep the name in references in the guide
Use clearer naming scheme for each exercise e.g. backpropagation_numpy
Keep the name in references in the guide

Data reader fails when installing with pip

What happens:

When installing the guide the absolute paths are different and the lxmls.readers do not find the data.

How to solve this:

Dependencies of local files inside code should be avoided. One pythonic solution would be to force the user to give the data path when instantiating the data.

Way to reproduce:

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
pip install .    # As opposed to python setup.py develop

the run the code

import lxmls.readers.sentiment_reader as srs
corpus = srs.SentimentCorpus("books")

note that python setup.py develop will work.

Wrong solution in sequence models' notebook exercise 2.10

In labs/notebooks/sequence_models/Exercises_2.1_to_2.11.ipynb the solution to the exercise 2.10 shows the function train_EM when the objective was to construct the update_countsfunction instead.

Fix build fail due to deprecated code in parsing day

Build fails in travis-ci, see

https://travis-ci.org/LxMLS/lxmls-toolkit/jobs/509740861

PendingDeprecationWarning: the matrix subclass is not the recommended way to represent matrices or deal with linear algebra

We should change numpy.matrix to numpy arrays. If this is not done we will have to disable the test for the parsing day (which is left out this year)

Less than 2000 documents read in day 1

Some students, instead of reading 2000 total documents and 1600 training documents, reported seeing the following numbers:
840
672

At least two operating systems had this problem:
Win 7 64-bit
Ubuntu 14.04 64-bit

MIRA and SVM are implemented in student branch

According to the guide, the student should implement MIRA and SVM (Exercises 1.3 and 1.5, respectively) but these algorithms are already implemented in the student branch. In fact, they seem to be the same implementations as in the master branch since "git diff master student" doesn't output anything in both mira.py and svm.py.

[Hugo] lxmls.readers.sentiment_reader - encoding problem

Em anexo o ficheiro lxmls/readers/sentiment_reader.py corrigido.
No "windoze" o python usa um encoding diferente do mac e linux (utf-8) e não consegue ler correctamente os ficheiros do sentiment/books.
A solução é na chamada ao "open" indicar qual é o encoding pretendido.

Reinforcement Learning Day Lab

Learning Rate of exercise 6.3 was increased. We have to warn students who downloaded the code before this change

NOTEBOOK: Structured Predictors

Write exercise text and code for the Structured Predictors day into a notebook under

labs/notebooks/structured_predictors/

use

labs/notebooks/non_linear_classifiers/

as reference

lxmls.big_data.postprocess - error in code

In method 'score' defined in postprocess.py, seems the 'val' variable should be initialized to 0.0, not to 1.0, since it accumulates log of likelihood ratios.

Some notebooks contain saved output

Some notebooks contain computed output, including customs paths (eg. in the exception message for the linear classifier notebook).

The students can just rerun the cell to get their own output, but this might be a bit confusing.

Wrong description for Structured Perceptron

The description is "Implements a first order CRF" :D
https://github.com/LxMLS/lxmls-toolkit/blob/master/lxmls/sequences/structured_perceptron.py#L9

small error on the perceptron code

Small "bug" in the perceptron and the Mira algorithms.

The permutation of the data is done at each epoch according to the documentation. Nevertheless the code performs a permutation to the data that reamins the same across all epochs.

The code is currently:

    # # Randomize the examples
    perm = np.random.permutation(nr_x)
    for epoch_nr in xrange(self.nr_epochs):
        for nr in xrange(nr_x):
            ...

but it should be

    for epoch_nr in xrange(self.nr_epochs):
        # # Randomize the examples
        perm = np.random.permutation(nr_x)
        for nr in xrange(nr_x):
            ...

The documentation:

Function "combine_partials" in "emstep.py" is already implemented, even in student branch

In the guide we tell students that we will give them a skeleton of the function "combine_partials" in file "emstep.py", but this function is actually fully implemented, even in the student branch.

Same error of permutation in MIRA (as it was in the perceptron)

I have changed this in master and student branches.
I'll do a pull request. Waiting for the guide itself to be updated. Instead of "Implement the MIRA algorithm" the titlte should be "use the MIRA implementation and do parts 1 to 4 as in the previous exercise".

toolkit compiling error

Hi all..

I recently tried to compile dempendecies for my LxMLS toolkit. I had the following problems. Thank you in advance.

~$ git clone https://github.com/LxMLS/lxmls-toolkit.git
Cloning into 'lxmls-toolkit'...
remote: Counting objects: 1800, done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 1800 (delta 6), reused 0 (delta 0), pack-reused 1784
Receiving objects: 100% (1800/1800), 22.31 MiB | 57.00 KiB/s, done.
Resolving deltas: 100% (1091/1091), done.
Checking connectivity... done.
~$ cd lxmls-toolkit
~/lxmls-toolkit$ pip install -r pip-requirements.txt
Downloading/unpacking configparser==3.2.0r3 (from -r pip-requirements.txt (line 1))
  Downloading configparser-3.2.0r3.tar.gz
  Running setup.py (path:/tmp/pip_build_iarroyof/configparser/setup.py) egg_info for package configparser

Downloading/unpacking pyyaml (from -r pip-requirements.txt (line 2))
  Downloading PyYAML-3.11.tar.gz (248kB): 248kB downloaded
  Running setup.py (path:/tmp/pip_build_iarroyof/pyyaml/setup.py) egg_info for package pyyaml

Downloading/unpacking nltk (from -r pip-requirements.txt (line 3))
  Downloading nltk-3.0.3.tar.gz (1.0MB): 1.0MB downloaded
  Running setup.py (path:/tmp/pip_build_iarroyof/nltk/setup.py) egg_info for package nltk

    warning: no files found matching 'Makefile' under directory '*.txt'
    warning: no previously-included files matching '*~' found anywhere in distribution
Requirement already satisfied (use --upgrade to upgrade): numpy in /usr/lib/python2.7/dist-packages (from -r pip-requirements.txt (line 4))
Requirement already satisfied (use --upgrade to upgrade): scipy in /usr/lib/python2.7/dist-packages (from -r pip-requirements.txt (line 5))
Requirement already satisfied (use --upgrade to upgrade): matplotlib in /usr/lib/pymodules/python2.7 (from -r pip-requirements.txt (line 6))
Downloading/unpacking mrjob (from -r pip-requirements.txt (line 7))
  Downloading mrjob-0.4.4.tar.gz (186kB): 186kB downloaded
  Running setup.py (path:/tmp/pip_build_iarroyof/mrjob/setup.py) egg_info for package mrjob

    no previously-included directories found matching 'docs'
    warning: no files found matching '*.sh' under directory 'bootstrap'
Downloading/unpacking theano (from -r pip-requirements.txt (line 8))
  Downloading Theano-0.7.0.tar.gz (2.0MB): 2.0MB downloaded
  Running setup.py (path:/tmp/pip_build_iarroyof/theano/setup.py) egg_info for package theano

    warning: manifest_maker: MANIFEST.in, line 8: 'recursive-include' expects <dir> <pattern1> <pattern2> ...

Downloading/unpacking ordereddict (from configparser==3.2.0r3->-r pip-requirements.txt (line 1))
  Downloading ordereddict-1.1.tar.gz
  Running setup.py (path:/tmp/pip_build_iarroyof/ordereddict/setup.py) egg_info for package ordereddict

Downloading/unpacking unittest2 (from configparser==3.2.0r3->-r pip-requirements.txt (line 1))
  Downloading unittest2-1.1.0-py2.py3-none-any.whl (96kB): 96kB downloaded
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /usr/lib/python2.7/dist-packages (from matplotlib->-r pip-requirements.txt (line 6))
Requirement already satisfied (use --upgrade to upgrade): tornado in /usr/lib/python2.7/dist-packages (from matplotlib->-r pip-requirements.txt (line 6))
Requirement already satisfied (use --upgrade to upgrade): pyparsing>=1.5.6 in /usr/lib/python2.7/dist-packages (from matplotlib->-r pip-requirements.txt (line 6))
Requirement already satisfied (use --upgrade to upgrade): nose in /usr/lib/python2.7/dist-packages (from matplotlib->-r pip-requirements.txt (line 6))
Downloading/unpacking boto>=2.2.0 (from mrjob->-r pip-requirements.txt (line 7))
  Downloading boto-2.38.0-py2.py3-none-any.whl (1.3MB): 1.3MB downloaded
Downloading/unpacking filechunkio (from mrjob->-r pip-requirements.txt (line 7))
  Downloading filechunkio-1.6.tar.gz
  Running setup.py (path:/tmp/pip_build_iarroyof/filechunkio/setup.py) egg_info for package filechunkio

Requirement already satisfied (use --upgrade to upgrade): simplejson>=2.0.9 in /usr/lib/python2.7/dist-packages (from mrjob->-r pip-requirements.txt (line 7))
Requirement already satisfied (use --upgrade to upgrade): six>=1.4 in /usr/lib/python2.7/dist-packages (from unittest2->configparser==3.2.0r3->-r pip-requirements.txt (line 1))
Downloading/unpacking traceback2 (from unittest2->configparser==3.2.0r3->-r pip-requirements.txt (line 1))
  Downloading traceback2-1.4.0-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): argparse in /usr/lib/python2.7 (from unittest2->configparser==3.2.0r3->-r pip-requirements.txt (line 1))
Downloading/unpacking linecache2 (from traceback2->unittest2->configparser==3.2.0r3->-r pip-requirements.txt (line 1))
  Downloading linecache2-1.0.0-py2.py3-none-any.whl
Installing collected packages: configparser, pyyaml, nltk, mrjob, theano, ordereddict, unittest2, boto, filechunkio, traceback2, linecache2
  Running setup.py install for configparser
    error: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/configparser_helpers.py'
    Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_iarroyof/configparser/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-PSZTAo-record/install-record.txt --single-version-externally-managed --compile:
    running install

running build

running build_py

creating build

creating build/lib.linux-x86_64-2.7

copying configparser.py -> build/lib.linux-x86_64-2.7

copying configparser_helpers.py -> build/lib.linux-x86_64-2.7

running install_lib

copying build/lib.linux-x86_64-2.7/configparser_helpers.py -> /usr/local/lib/python2.7/dist-packages

error: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/configparser_helpers.py'

----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_iarroyof/configparser/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-PSZTAo-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_iarroyof/configparser
Storing debug log for failure in /home/iarroyof/.pip/pip.log

After that, I suppose it is needed su privileges, but it didn't work:

~/lxmls-toolkit$ sudo pip install -r pip-requirements.txt
[sudo] password for iarroyof: 
Downloading/unpacking configparser==3.2.0r3 (from -r pip-requirements.txt (line 1))
  Downloading configparser-3.2.0r3.tar.gz
  Running setup.py (path:/tmp/pip_build_root/configparser/setup.py) egg_info for package configparser

Downloading/unpacking pyyaml (from -r pip-requirements.txt (line 2))
  Downloading PyYAML-3.11.tar.gz (248kB): 248kB downloaded
  Running setup.py (path:/tmp/pip_build_root/pyyaml/setup.py) egg_info for package pyyaml

Downloading/unpacking nltk (from -r pip-requirements.txt (line 3))
  Downloading nltk-3.0.3.tar.gz (1.0MB): 1.0MB downloaded
  Running setup.py (path:/tmp/pip_build_root/nltk/setup.py) egg_info for package nltk

    warning: no files found matching 'Makefile' under directory '*.txt'
    warning: no previously-included files matching '*~' found anywhere in distribution
Requirement already satisfied (use --upgrade to upgrade): numpy in /usr/lib/python2.7/dist-packages (from -r pip-requirements.txt (line 4))
Requirement already satisfied (use --upgrade to upgrade): scipy in /usr/lib/python2.7/dist-packages (from -r pip-requirements.txt (line 5))
Requirement already satisfied (use --upgrade to upgrade): matplotlib in /usr/lib/pymodules/python2.7 (from -r pip-requirements.txt (line 6))
Downloading/unpacking mrjob (from -r pip-requirements.txt (line 7))
  Downloading mrjob-0.4.4.tar.gz (186kB): 186kB downloaded
  Running setup.py (path:/tmp/pip_build_root/mrjob/setup.py) egg_info for package mrjob

    no previously-included directories found matching 'docs'
    warning: no files found matching '*.sh' under directory 'bootstrap'
Downloading/unpacking theano (from -r pip-requirements.txt (line 8))
  Downloading Theano-0.7.0.tar.gz (2.0MB): 1.3MB downloaded
Cleaning up...
Exception:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1197, in prepare_files
    do_download,
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1375, in unpack_url
    self.session,
  File "/usr/lib/python2.7/dist-packages/pip/download.py", line 572, in unpack_http_url
    download_hash = _download_url(resp, link, temp_location)
  File "/usr/lib/python2.7/dist-packages/pip/download.py", line 433, in _download_url
    for chunk in resp_read(4096):
  File "/usr/lib/python2.7/dist-packages/pip/download.py", line 421, in resp_read
    chunk_size, decode_content=False):
  File "/usr/lib/python2.7/dist-packages/urllib3/response.py", line 225, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/lib/python2.7/dist-packages/urllib3/response.py", line 174, in read
    data = self._fp.read(amt)
  File "/usr/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib/python2.7/ssl.py", line 341, in recv
    return self.read(buflen)
  File "/usr/lib/python2.7/ssl.py", line 260, in read
    return self._sslobj.read(len)
SSLError: The read operation timed out

Storing debug log for failure in /home/iarroyof/.pip/pip.log
~/lxmls-toolkit$

Any help will be very appreciated.

Change results of Naive Bayes on day 1

incorrect number of mistakes in structured perceptron code

not quite sure, but should we not increment the num_mistakes variable at line 56 in the sequences/structured_perceptron.py code by adding the line num_minstakes +=1?

2 minutes later, never mind, I figured it out :). closed.

NOTEBOOK: Sequence Models

Write exercise text and code for the Sequence Models day into a notebook under

labs/notebooks/sequence_models/

use

labs/notebooks/non_linear_classifiers/

as reference

Add autoreload extension to the notebooks when students need to implement code

I think it would be a good idea to add the magic commands to load the autoreload extension, at least for the lxmls module.

Notebooks for every day

As @davidbp has been suggested for a while, we should have all days in jupyter notebooks.

I did this fir the new pytorch deep days see labs/notebooks/non_linear_classifiers/ and labs/notebooks/non_linear_sequence_classifiers/.

There is also a script labs/convert_notebooks.sh to automatically create the *.py version for those who want to work remote.

It seems a good idea to try to derive the unit tests for each day in a similar way, see #72

Change results on guide of day 2

Unit Tests and cross platform compatibility

Build unit tests for each day/algorithm and check for compatibility across different python version (2.7, 3.5, 3.6).

svm is provided in student branch

(I'm making an issue so we don't forget for next year.)

The lab guide asks to implement SVM primal, but it is provided in the student branch in classifiers/svm.py.

If we remove it, solve.py should download the master version. Right now it does not do that.
Solve also downloads perceptron.py, but that is unnecessary since it is also already provided.

lxmls.classifiers.mira

No 'lxmls.classifiers.mira' aparecem várias referências a 'y[inst:inst+1,0]' que é equivalente a 'y[inst,0]'
Se o 1 é constante era melhor simplificar.
Se queremos variar, era melhor usar uma variável.
Ou adicionar um comentário a explicar quais as variações que seriam possíveis...

Add missing Big Data files for Day 5

Some files (at least "sequences200.txt") are missing in Day 5. They should be added.

NOTEBOOK: Linear Classifiers

Write exercise text and code for the Linear Classifiers day into a notebook under

labs/notebooks/linear_classifiers/

use

labs/notebooks/non_linear_classifiers/

as reference

Ex. 5.5 breaks with theano.config.floatX=float32

Currently Exercise 5.5 (Theano MLP with batch) only works if floatX is set to "float64", even though it is usually float32.

It breaks because the data set is loaded as float64:

train_x = scr.train_X.T
train_y = scr.train_y[:, 0]
test_x = scr.test_X.T
test_y = scr.test_y[:, 0 ]

This fixes it:

train_x = train_x.astype(theano.config.floatX)
train_y = train_y.astype("int32")
test_x = test_x.astype(theano.config.floatX)
test_y = test_y.astype("int32")

lxmls / lxmls-toolkit Goto Github PK

lxmls-toolkit's Introduction

LxMLS 2023

Instructions for Students

Install with Anaconda or pip

Running

Development

lxmls-toolkit's People

Contributors

Stargazers

Watchers

Forkers

lxmls-toolkit's Issues

What happens:

How to solve this:

Way to reproduce:

Recommend Projects

Recommend Topics

Recommend Org

Jobs