explosion / spacy-course Goto Github PK

View Code? Open in Web Editor NEW

2.3K 61.0 366.0 9.21 MB

👩‍🏫 Advanced NLP with spaCy: A free online course

Home Page: https://course.spacy.io

License: MIT License

Python 80.33% JavaScript 8.24% CSS 7.82% Makefile 0.11% Dockerfile 0.12% Sass 3.00% Shell 0.38%

spacy nlp natural-language-processing online-course course gatsby gatsbyjs jupyter binder machine-learning

spacy-course's Introduction

Advanced NLP with spaCy: A free online course

This repo contains both an online course, as well as its modern open-source web framework. In the course, you'll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches. The front-end is powered by Gatsby, Reveal.js and Plyr, and the back-end code execution uses Binder 💖 It's all open-source and published under the MIT license (code and framework) and CC BY-NC (spaCy course materials).

This course is mostly intended for self-study. Yes, you can cheat – the solutions are all in this repo, there's no penalty for clicking "Show hints" or "Show solution", and you can mark an exercise as done when you think it's done.

💬 Languages and Translations

Language	Text Examples¹	Source	Authors
English	English	`chapters/en`, `exercises/en`	@ines
German	German	`chapters/de`, `exercises/de`	@ines, @Jette16
Spanish	Spanish	`chapters/es`, `exercises/es`	@mariacamilagl, @damian-romero
French	French	`chapters/fr`, `exercises/fr`	@datakime
Japanese	Japanese	`chapters/ja`, `exercises/ja`	@tamuhey, @hiroshi-matsuda-rit, @icoxfog417, @akirakubo, @forest1988, @ao9mame, @matsurih, @HiromuHota, @mei28, @polm
Chinese	Chinese	`chapters/zh`, `exercises/zh`	@crownpku
Portuguese	English	`chapters/pt`, `exercises/pt`	@Cristianasp

If you spot a mistake, I always appreciate pull requests!

1. This is the language used for the text examples and resources used in the exercises. For example, the German version of the course also uses German text examples and models. It's not always possible to translate all code examples, so some translations may still use and analyze English text as part of the course.

Related resources

📚 Prefer notebooks? Check out the Jupyter notebook version of this course, put together by @cristianasp.

💁 FAQ

Is this related to the spaCy course on DataCamp?

I originally developed the content for DataCamp, but I wanted to make a free version to make it available to more people, and so you don't have to sign up for their service. As a weekend project, I ended up putting together my own little app to present the exercises and content in a fun and interactive way.

Can I use this to build my own course?

Probably, yes! If you've been looking for a DIY way to publish your materials, I hope that my little framework can be useful. Because so many people expressed interest in this, I put together some starter repos that you can fork and adapt:

🐍 Python: ines/course-starter-python
🇷 R: ines/course-starter-r

Why the different licenses?

The source of the app, UI components and Gatsby framework for building interactive courses is licensed as MIT, like pretty much all of my open-source software. The course materials themselves (slides and chapters), are licensed under CC BY-NC. This means that you can use them freely – you just can't make money off them.

I want to help translate this course into my language. How can I contribute?

First, thanks so much, this is really cool and valuable to the community 🙌 I've tried to set up the course structure so it's easy to add different languages: language-specific files are organized into directories in exercises and chapters, and other language specific texts are available in locale.json. If you want to contribute, there are two different ways to get involved:

Start a community translation project. This is the easiest, no-strings-attached way. You can fork the repo, copy-paste the English version, change the language code, start translating and invite others to contribute (if you like). If you're looking for contributors, feel free to open an issue here or tag @spacy_io on Twitter so we can help get the word out. We're also happy to answer your questions on the issue tracker.
Make us an offer. We're open to commissioning translations for different languages, so if you're interested, email us at [email protected] and include your offer, estimated time schedule and a bit about you and your background (and any technical writing or translation work you've done in the past, if available). It doesn't matter where you're based, but you should be able to issue invoices as a freelancer or similar, depending on your country.

I want to help create an audio/video tutorial for an existing translation. How can I get involved?

Again, thanks, this is super cool! While the English and German videos also include a video recording, it's not a requirement and we'd be happy to just provide an audio track alongside the slides. We'd take care of the postprocessing and video editing, so all we need is the audio recording. If you feel comfortable recording yourself reading out the slide notes in your language, email us at [email protected] and make us an offer and include a bit about you and similar work you've done in the past, if available.

🎛 Usage & API

Running the app

To start the local development server, install Gatsby and then all other dependencies, then use npm run dev to start the development server. Make sure you have at least Node 10.15 installed.

npm install -g gatsby-cli  # Install Gatsby globally
npm install                # Install dependencies
npm run dev                # Run the development server

If running with docker just run make build and then make gatsby-dev

How it works

When building the site, Gatsby will look for .py files and make their contents available to query via GraphQL. This lets us use the raw code within the app. Under the hood, the app uses Binder to serve up an image with the package dependencies, including the spaCy models. By calling into JupyterLab, we can then execute code using the active kernel. This lets you edit the code in the browser and see the live results. Also see my juniper repo for more details on the implementation.

To validate the code when the user hits "Submit", I'm currently using a slightly hacky trick. Since the Python code is sent back to the kernel as a string, we can manipulate it and add tests – for example, exercise exc_01_02_01.py will be validated using test_01_02_01.py (if available). The user code and test are combined using a string template. At the moment, the testTemplate in the meta.json looks like this:

from wasabi import msg
__msg__ = msg
__solution__ = """${solution}"""
${solution}

${test}
try:
    test()
except AssertionError as e:
    __msg__.fail(e)

If present, ${solution} will be replaced with the string value of the submitted user code. In this case, we're inserting it twice: once as a string so we can check whether the submission includes something, and once as the code, so we can actually run it and check the objects it creates. ${test} is replaced by the contents of the test file. I'm also making wasabi's printer available as __msg__, so we can easily print pretty messages in the tests. Finally, the try/accept block checks if the test function raises an AssertionError and if so, displays the error message. This also hides the full error traceback (which can easily leak the correct answers).

A test file could then look like this:

def test():
    assert "spacy.load" in __solution__, "Are you calling spacy.load?"
    assert nlp.meta["lang"] == "en", "Are you loading the correct model?"
    assert nlp.meta["name"] == "core_web_sm", "Are you loading the correct model?"
    assert "nlp(text)" in __solution__, "Are you processing the text correctly?"
    assert "print(doc.text)" in __solution__, "Are you printing the Doc's text?"

    __msg__.good(
        "Well done! Now that you've practiced loading models, let's look at "
        "some of their predictions."
    )

With this approach, it's not always possible to validate the input perfectly – there are too many options and we want to avoid false positives.

Running automated tests

The automated tests make sure that the provided solution code is compatible with the test file that's used to validate submissions. The test suite is powered by the pytest framework and runnable test files are generated automatically in a directory __tests__ before the test session starts. See the conftest.py for implementation details.

# Install requirements
pip install -r binder/requirements.txt
# Run the tests (will generate the files automatically)
python -m pytest __tests__

If running with docker just run make build and then make pytest

Directory Structure

├── binder
|   └── requirements.txt  # Python dependency requirements for Binder
├── chapters              # chapters, grouped by language
|   ├── en                # English chapters, one Markdown file per language
|   |   └── slides        # English slides, one Markdown file per presentation
|   └── ...               # other languages
├── exercises             # code files, tests and assets for exercises
|   ├── en                # English exercises, solutions, tests and data
|   └── ...               # other languages
├── public                # compiled site
├── src                   # Gatsby/React source, independent from content
├── static                # static assets like images, available in slides/chapters
├── locale.json           # translations of meta and UI text
├── meta.json             # course metadata
└── theme.sass            # UI theme colors and settings

Setting up Binder

The requirements.txt in the repository defines the packages that are installed when building it with Binder. For this course, I'm using the source repo as the Binder repo, as it allows to keep everything in one place. It also lets the exercises reference and load other files (e.g. JSON), which will be copied over into the Python environment. I build the binder from a branch binder, though, which I only update if Binder-relevant files change. Otherwise, every update to master would trigger an image rebuild.

You can specify the binder settings like repo, branch and kernel type in the "juniper" section of the meta.json. I'd recommend running the very first build via the interface on the Binder website, as this gives you a detailed build log and feedback on whether everything worked as expected. Enter your repository URL, click "launch" and wait for it to install the dependencies and build the image.

File formats

Chapters

Chapters are placed in /chapters and are Markdown files consisting of <exercise> components. They'll be turned into pages, e.g. /chapter1. In their frontmatter block at the top of the file, they need to specify type: chapter, as well as the following meta:

---
title: The chapter title
description: The chapter description
prev: /chapter1 # exact path to previous chapter or null to not show a link
next: /chapter3 # exact path to next chapter or null to not show a link
id: 2 # unique identifier for chapter
type: chapter # important: this creates a standalone page from the chapter
---

Slides

Slides are placed in /slides and are markdown files consisting of slide content, separated by ---. They need to specify the following frontmatter block at the top of the file:

---
type: slides
---

The first and last slide use a special layout and will display the headline in the center of the slide. Speaker notes (in this case, the script) can be added at the end of a slide, prefixed by Notes:. They'll then be shown on the right next to the slides. Here's an example slides file:

---
type: slide
---

# Processing pipelines

Notes: This is a slide deck about processing pipelines.

---

# Next slide

- Some bullet points here
- And another bullet point

<img src="/image.jpg" alt="An image located in /static" />

Custom Elements

When using custom elements, make sure to place a newline between the opening/closing tags and the children. Otherwise, Markdown content may not render correctly.

`<exercise>`

Container of a single exercise.

Argument	Type	Description
`id`	number / string	Unique exercise ID within chapter.
`title`	string	Exercise title.
`type`	string	Optional type. `"slides"` makes container wider and adds icon.
children	-	The contents of the exercise.

<exercise id="1" title="Introduction to spaCy">

Content goes here...

</exercise>

`<codeblock>`

Argument	Type	Description
`id`	number / string	Unique identifier of the code exercise.
`source`	string	Name of the source file (without file extension). Defaults to `exc_${id}` if not set.
`solution`	string	Name of the solution file (without file extension). Defaults to `solution_${id}` if not set.
`test`	string	Name of the test file (without file extension). Defaults to `test_${id}` if not set.
children	string	Optional hints displayed when the user clicks "Show hints".

<codeblock id="02_03">

This is a hint!

</codeblock>

`<slides>`

Container to display slides interactively using Reveal.js and a Markdown file.

Argument	Type	Description
`source`	string	Name of slides file (without file extension).

<slides source="chapter1_01_introduction-to-spacy">
</slides>

`<choice>`

Container for multiple-choice question.

Argument	Type	Description
`id`	string / number	Optional unique ID. Can be used if more than one choice question is present in one exercise.
children	nodes	Only `<opt>` components for the options.

<choice>

<opt text="Option one">You have selected option one! This is not good.</opt>
<opt text="Option two" correct="true">Yay! </opt>

</choice>

`<opt>`

A multiple-choice option.

Argument	Type	Description
`text`	string	The option text to be displayed. Supports inline HTML.
`correct`	string	`"true"` if the option is the correct answer.
children	string	The text to be displayed if the option is selected (explaining why it's correct or incorrect).

spacy-course's People

Contributors

Stargazers

Watchers

Forkers

barseghyanartur tspannhw foton263 vnijs ttimbers hodwilliam pankajmehar vck hack121 nwilliam868 shrugj vesterde shaunstanislauslau stiero hhy5277 engmux alabarga sjoerdapp allorimd petartodorov aldabbagh gakkilovemath jonmcalder mreishus programmicon aculich nirupam1sharma akayeshmantha anneshachowdhury lucian-whu motazsaad santhosh-ks pluketic oztiger dorianbrown thaneacheron shantanu0304 bradparks zhiephieforks w7yuu mbrukman juandes dengziming emailhy sumehta paulgb derekdjia m8e meitianjinbu curapersona interventionlabs johnlaudun theainerd austinlasseter giuliorossetti ninabina921 nawazzeeshan salrivas jonleslie rajacsp mnhacohen kmk028 shinde-rahul grayowlshuck giogkarakis milog17 navneethc littleserendipity skylaronomics jastor11 rongpenl pubfork timopheym themushrr00m applied-machinelearning hajderr statdataanalyzer yhjohn163 r4gis geosapere stcybrdgs abiraja2004 sateesh110 k1mmie amir22010 datalayer-externals inkenbrandt bayethiernodiop ghanima conradbm rohit-singh-lbb 73percent vivek2319 lorarjohns hitman56 jgxxjg bytearchive todun amitmahi shalevy1

spacy-course's Issues

isinstance(similarity, float) not working for np.float32

Problem

Chapter 2, section 10, part 2. Test file here. The check isinstance(similarity, float) fails.

Edit: Also part 3

I dug a bit around, and found that Token.similarity() returns a numpy.float32, not a python float.

In [2]: type(similarity)                                                                                    
Out[2]: numpy.float32

In [3]: isinstance(similarity, float)                                                                       
Out[3]: False

Solution proposals

Now, I'm unsure how to best proceed: It does return a floating point value, just not the exact kind that the test expects. Here are two suggestions:

1: Use `np.floating` in addition to `float`

In [17]: isinstance(similarity, np.floating)                
Out[17]: True

In [18]: isinstance(1.0, np.floating)  # Why we need to ALSO check isinstance(x, float)
Out[18]: False

It is surprisingly hard to find docs to link for numpy.floating 🤔
Instead, here is its docstring: "Abstract base class of all floating-point scalar types."

2: Try casting to float

Check the value's range while we're there

In [22]: 0 <= float(similarity) <= 1                        
Out[22]: True

Edit edit: For what it is worth, using this in the solution passes the test:

...
similarity = span1.similarity(span2)
similarity = float(similarity)
print(similarity)

Additional considerations

Maybe the docs should reflect that minuscule difference? I'm not sure if it is important in practice.
Along with #7, this is a case where the pre-made solution fails on its associated test. We could perhaps make a script for checking each solution against its test? I'll spin off a separate issue for this.

Advanced Spacy Course: Bowie is a word

How to reproduce the behaviour

Your Environment

Operating System:
Python Version Used:
spaCy Version Used:
Environment Information:

Provide a unique Jupyter per chapter

As an enhancement I suggest to provide as well an unique jupyter notebook with all the code per chapter.
Some of us learn just better with the jupyter in front and playing around with the code and making our own annotations.
thanks

Chapter 1 q11 Matcher matches all text.

import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

produces
Matches: ['New', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders', 'by', 'mistake']

Spacy version 2.0.16

Turkish Translation would be great idea !

Hello spaCy Community, 💥

I would love to work on Turkish translation in this beautiful course. Any ideas about that? :)

Auto-format solution code with Black in testTemplate

@thorbjornwolf suggested in #7:

Just a side-thought: Could it be interesting to use Black to normalize the solution code formatting before checking it? I see this test checks against the two main string quotes " and ': if that is a recurring thing, Black might help by forcing alllll the code to adhere to a certain, deterministic formatting choice. On the other hand, applying it just might increase complexity, and may reduce the readability of the tests 🤷‍♂️

I replied:

Yessss, so cool you mentioned this – I actually had the exact same idea yesterday! 👍 This would definitely make it so much easier to validate the string code, because we can make more assumptions about it – and the user could still write their code however they want to. I'll need to play around with this a bit, though, to make sure there are no unintended side-effects (and how to best do it from within Python – I've only ever used Black via a plugin or on the command line before).

Missing json file

iphone.json file is missing from the exercises folder. Please check it!

feature request: example for text classification

First of all: thank you. I've learned an amazing amount in a super short time from your work here.

I do have a small request that might help others. After the course I wanted to try to train a text classification example and I could not figure what I was doing wrong (i created a stackoverflow question here) even though training for entity detection worked just fine.

A code example/assignment on text classification on an entire document might be nice to add to the course as an example (there's a lot of use-cases for it) and it might help others too.

Should I get the code to work, I wouldn't mind giving a try for a PR either, should that be welcome.

Diagram for the Span object

Data Structures(2): Doc, Span and Token: https://course.spacy.io/en/chapter2
Section 2 on the Span Object

The [diagram] says Span(1,4). The slice from index 1 - 4 is highlighted in yellow. while it should be from index 1 to 3 since the end index is exclusive

.

"Export slides deck to pdf" functionality

Hi,
this project is indeed very nice.

What about implementing an "export to pdf" functionality for slides deck?
I feel that it might be something useful to make teaching resources available also offline.

Ch 2, section 7: `result` not in code

It's me again! Let me know if this gets annoying, or if you prefer feedback in a different channel.

The first multiple-choice option in chapter 2, section 7, part 1 refers to a result object, which is not in the code. Looks like a leftover from a refactor :-)

The text is this: "The tokens in the result should be converted back to Token objects. This will let you reuse them in spaCy."

Chapter 1 Lesson 9 requires student to name the entity iterator variable 'ent' -- unnecessarily

Submitting the code with the iterator named anything other than ent results in an error message saying "✘ Are you iterating over the entities?"

But the iterator name has no impact on the generated output and shouldn't matter.

Looks like the solution does in fact search for 'ent' explicitly as the var name currently.
https://github.com/ines/spacy-course/blob/master/exercises/en/test_01_09.py

About sharing knowledge

What is the problem with Chinese README's

Firstly, we congratulate you for getting so much star by sharing this repository with humanity.

But it is very disappointing for non-Chinese speakers when one couldn't understand what a trending repository is about.

When we see such a repo on trending, our minds are blurring like Gollum's.

There is a way you can help to solve this disappointment which I believe is experienced by many people who want to know more about your valuable work and appreciate it.

What we want:

Please add English translation of your README so you are sharing your work and knowledge with more people.

How this will help you:

More feedback to fix and improve your project.
New ideas about your project.
Greater fame.

“Sharing knowledge is the most fundamental act of friendship. Because it is a way you can give something without loosing something.”

— Richard Stallman

Thank you!

This issue created by us/english-please script. Please report on any error. Thank you!

Phrase Matcher fails on custom tokens

currently, my functionality is depends on Phrase Matcher, I create custom Phrase Matcher and add my custom tokens
self.matcher = PhraseMatcher(nlp.vocab, attr="LEMMA")
text = 'thermoplastic'
patterns = [nlp(text.lower())]
self.matcher.add(matcher_object['type'], None, *patterns)

it works when I try to find word like 'thermoplastic' 'thermoplastics' but when I try with multiple words
'islamid thermoplastics' it failes.
any clue what I am doing wrong.

Chapter 4, section 3-4: Training data labelling misses bare "iPhone"

Exercises 04_03 and 04_04 both miss the "iPhone" entity in the sentence "Your iPhone goes up to 11 today". I first thought this was on purpose, to teach us something about inspecting the labels.
However, seeing section 7's gadgets file, it looks like the "iPhone" is intended to be included 😄

test_02_09 looks for the small, not the medium, model

The exercise in chapter 2, section 9 where we want to inspect the "bananas" word vector has an imperfect test. It checks for the presence of "spacy.load('en_core_web_sm')", and not the model we've learned to use, namely en_core_web_md.

The file is here.

Very easy to read test module, though, I like it!

Just a side-thought: Could it be interesting to use Black to normalize the solution code formatting before checking it? I see this test checks against the two main string quotes " and ': if that is a recurring thing, Black might help by forcing alllll the code to adhere to a certain, deterministic formatting choice. On the other hand, applying it just might increase complexity, and may reduce the readability of the tests 🤷‍♂️

Trouble running code for Exercise 5 in Chapter 4

Hi! I recently took this course on DataCamp and did manage to run the code in the exercise. For future references, I copied the exercise and the answer to my own notebook using Jupyter. However, when I tried to run the code, I encountered an error.

This is the original code:

import random

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}
    
    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]
        
        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

And this is the error I got:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-aa733910c3ca> in <module>
     18 
     19         # Update the model
---> 20         nlp.update(texts, annotations, losses=losses)
     21         #example = Example.from_dict(nlp.make_doc(texts), annotations)
     22         #nlp.update([example])

~\anaconda3\lib\site-packages\spacy\language.py in update(self, examples, _, drop, sgd, losses, component_cfg, exclude)
   1086         """
   1087         if _ is not None:
-> 1088             raise ValueError(Errors.E989)
   1089         if losses is None:
   1090             losses = {}

ValueError: [E989] `nlp.update()` was called with two positional arguments. This may be due to a backwards-incompatible change to the format of the training data in spaCy 3.0 onwards. The 'update' function should now be called with a batch of Example objects, instead of `(text, annotation)` tuples.

I'm sorry if it looks a bit confusing, but this error has been bugging me for weeks now, and I tried to find the solution through stackoverflow and applied them here, but nothing seems to work out. Really appreciate it if the author herself or anyone here can explain what happened.

Thank you.

How to remove a component from nlp pipeline? or should I create(maybe load) nlp object with same statistical model for every different pipeline?

I am a newbie for spacy...

Issue with matcher in Excercise 1.12.2

The following line of code and the instructions for this exercise may be wrong. As seen in the attached screenshot the target words are tagged as "INTJ"

https://github.com/ines/spacy-course/blob/90feb4983573db71163ffda56d8fea78a02d747f/exercises/es/solution_01_12_02.py#L15

IMO, the idea was to match proper nouns and the example could be fine, meaning the issue is on the Spacy POS tagger. I may be missing something, so certainly, someone can determine if the POS tagger needs to be fixed or the example updated.

Regards
-Gon

Chapter 1, section 5: Add instructions on installing `en_core_web_sm`

Spun off from #4 (comment)

In chapter 1, section 5, slide 3, the following code block appears for the first time:

import spacy

nlp = spacy.load('en_core_web_sm')

I personally like to jot down code along with the slides, so this is where I got "challenged", as follows (details at the bottom of the comment):

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[virtualenv dir]/lib/python3.7/site-packages/spacy/__init__.py", line 27, in load
    return util.load_model(name, **overrides)
  File "[virtualenv dir]/lib/python3.7/site-packages/spacy/util.py", line 136, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Luckily, the answer is in the spacy docs, in the INSTALLATION box on the right: $ python -m spacy download en_core_web_sm

After doing so, and starting a new python session:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>>

🎉 ☺️

My suggested solution would be to add the installation instructions in the slides, or no later than the interactive code in section 7 Loading models. Otherwise, people who also run spacy locally will have to look it up themselves... which is not a big deal, just a tiny bump in the road :-)

virtualenv setup

$ python3.7 -m virtualenv tmp-venv
$ source tmp-venv/bin/activate
$ pip install spacy

pip freeze output

$ pip freeze
blis==0.2.4
certifi==2019.3.9
chardet==3.0.4
cymem==2.0.2
idna==2.8
jsonschema==2.6.0
murmurhash==1.0.2
numpy==1.16.2
plac==0.9.6
preshed==2.0.1
requests==2.21.0
spacy==2.1.3
srsly==0.0.5
thinc==7.0.4
tqdm==4.31.1
urllib3==1.24.2
wasabi==0.2.1

python version

$ python --version
Python 3.7.1

French translation?

Hi, is there a French translation for the course in progress? I looked through issues and PRs but might have missed something. If not, I'm happy to start working on this. Thanks for the great course!

Add utility to run exercise tests on premade exercise solutions

Issues #7 and #9 both pertain to broken exercise tests.

In this issue, I'd like to suggest that we automate away the boring part of checking that the exercise solutions pass their tests.

I haven't yet grokked how the student solutions are put together with the exercise tests, but it looks like the test has access to both the student code as a string, and to the variable defined within. Perhaps we could leverage that pre-made combination, just substituting the student solution with the instructor solution files?

This is, of course, only if you think it'll be worth the effort. Luckily, the course is made so the student can just move past any broken tests 😄

[Typo] Chapter 2, section 7: PRONOUN before VERB

A minor typo in part 2 of the section 7 questions.

If a proper noun after a verb is found, print the token.text.

should be, given the original question:

If a proper noun before a verb is found, print the token.text.

Repeated failed connections trying to access course

Hi, I've been doing chapters 1 and 2 of the course today and have been having repeated errors where the code cells either fail in their connection (giving an error message of: "Connecting failed. Please reload and try again."), don't respond to anything.

Upon closer inspection using the web developer tools the error is "CORS header 'Access-Control-Allow-Origin' missing".

Attached is a screenshot and the associated HAR trace.
course.spacy.io_Archive [21-11-01 16-01-31].har.zip

Trouble setting up a working JS environment for the course

First of all thanks for sharing this lovely course.

I was going to play around with the course today but I am having trouble setting up the JS requirements on my Mac (node 10, node 12, node 14) and an Ubuntu (node 12) docker image, I am a JS novice so that could be the problem.

I went down a rabbit hole trying to get sharp installed which led me to try and bump sharp up from [email protected] based on gatsbyjs/gatsby#13781
Next I played npm whack-a-mole with trying making sure all dependencies' sharp version gets bumped which led to other dependencies needing to be changed.

Is there a way to resolve all these dependencies in the JS world? Python packaging seems like heaven compared to what I had to go through.

I have managed to create a working Dockerfile with all the updated packages where I can run gatsby in dev mode and run the python tests - I will create a PR to merge these.

--- BELOW is my debugging process as a JS novice ---

On both my Mac and the Ubuntu docker image following:

npm install -g gatsby-cli  # Install Gatsby globally
npm install                # Install dependencies

During npm install installing [email protected] fails with an error like this:

> [email protected] install /host/node_modules/sharp
> (node install/libvips && node install/dll-copy && prebuild-install) || (node-gyp rebuild && node install/dll-copy)

info sharp Downloading https://github.com/lovell/sharp-libvips/releases/download/v8.7.0/libvips-8.7.0-linux-x64.tar.gz
prebuild-install WARN install No prebuilt binaries found (target=12.16.3 runtime=node arch=x64 libc= platform=linux)
make: Entering directory '/host/node_modules/sharp/build'
  TOUCH Release/obj.target/libvips-cpp.stamp
  CXX(target) Release/obj.target/sharp/src/common.o
In file included from ../src/common.cc:27:0:
../src/common.h:78:20: error: 'Handle' is not a member of 'v8'
   bool HasAttr(v8::Handle<v8::Object> obj, std::string attr);
                    ^~~~~~
../src/common.h:78:37: error: expected primary-expression before '>' token
   bool HasAttr(v8::Handle<v8::Object> obj, std::string attr);
.
.
.

Then I used to hunt down dependencies using sharp and what version are coming through:

npm ls | grep -C 10 --color sharp

Then using npm-remote-ls looked for what the updated sharp version would be for example with gatsby-plugin-manifest:

npm-remote-ls [email protected] | grep sharp

Tried to harmonise the version bumps to use one version of sharp as multiple sharp versions seem to potentially cause challenges.

Finally I have tried to resolve all conflicts from the previous bumps. As a complete JS Novice I am not sure if my approach is sensible I will make a PR where you can review

JS code insdie <script> tags is not exectued

Originally reported in my comment here ines/course-starter-python#4 (comment), but I realized that this is a more general issue that also affects the spacy course, not just the Python course starter. It seems that js code included in slides via <script> tags is not executed correctly in the spacy course framework. I tried in the latest version of this GitHub repo by modifying the first slideshow (chapter1_01_introduction-to-spacy.md) to include the following just under the bullet points:

<script>console.log('This does not work')</script>

Nothing is written to the console and no errors are raised either. This is not a problem with reveal.js because downloading their example slideshow and including the same code in the index.html works just fine (more elaborate examples such as vegalite plots also work fine in reveal.js).

Curiously, it seems that javascript is enabled and working fine in the spacy course slides, because <noscript> tags do not execute either and if I include the same js as part of a button, the console does correctly show log messages when this button is pressed:

<button type="button" onclick="console.log('This works')">Click me</button>

However, if the button references as function that has been defined inside <script> tags, the console throws an error that the function is not defined.

<script>
function myFunction() {
  console.log('This does not work either')
}
</script>

<button type="button" onclick="myFunction()">Click me too</button>

I thought this meant that the <script> tags were filtered out from the source file at some point, but if I highlight the text in a slide and click "View source for highlighted", I can see that they are still there, so there must be something wrong elsewhere such as in their execution.

I tried grepping this git repo for anything related to these tags or the slides, but I didn't see anything that seemed to indicate that script js code execution was disabled. I am not familiar with js so I would very much appreciate how to troubleshoot this further. Eventually, I want to use this for displaying Vegalite plots and the old workaroudn for this did not work for me, probably because vegalite has changed how the plots are displayed and the jupyter plugin has been deprecated as it is no longer needed.

Chapter 4.4 evaluates correct with no entities

I put doc.ents instead of entities but it was still marked correct

Chapter 3.3 Inspecting Pipelines

I'm running spacy version 2.1.4

Running this code:

import spacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

provides this error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-56-c8e8c6cbcb4d> in <module>
      2 
      3 # Load the en_core_web_sm model
----> 4 nlp = spacy.load("en_core_web_sm")
      5 
      6 # Print the names of the pipeline components

/usr/local/lib/python3.6/dist-packages/spacy/__init__.py in load(name, **overrides)
     25     if depr_path not in (True, False, None):
     26         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 27     return util.load_model(name, **overrides)
     28 
     29 

/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model(name, **overrides)
    129             return load_model_from_link(name, **overrides)
    130         if is_package(name):  # installed as package
--> 131             return load_model_from_package(name, **overrides)
    132         if Path(name).exists():  # path to model data directory
    133             return load_model_from_path(Path(name), **overrides)

/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model_from_package(name, **overrides)
    150     """Load a model from an installed package."""
    151     cls = importlib.import_module(name)
--> 152     return cls.load(**overrides)
    153 
    154 

/usr/local/lib/python3.6/dist-packages/en_core_web_sm/__init__.py in load(**overrides)
     10 
     11 def load(**overrides):
---> 12     return load_model_from_init_py(__file__, **overrides)

/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model_from_init_py(init_file, **overrides)
    188     if not model_path.exists():
    189         raise IOError(Errors.E052.format(path=path2str(data_path)))
--> 190     return load_model_from_path(data_path, meta, **overrides)
    191 
    192 

/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model_from_path(model_path, meta, **overrides)
    171             component = nlp.create_pipe(name, config=config)
    172             nlp.add_pipe(component, name=name)
--> 173     return nlp.from_disk(model_path)
    174 
    175 

/usr/local/lib/python3.6/dist-packages/spacy/language.py in from_disk(self, path, exclude, disable)
    789             # Convert to list here in case exclude is (default) tuple
    790             exclude = list(exclude) + ["vocab"]
--> 791         util.from_disk(path, deserializers, exclude)
    792         self._path = path
    793         return self

/usr/local/lib/python3.6/dist-packages/spacy/util.py in from_disk(path, readers, exclude)
    628         # Split to support file names like meta.json
    629         if key.split(".")[0] not in exclude:
--> 630             reader(path / key)
    631     return path
    632 

/usr/local/lib/python3.6/dist-packages/spacy/language.py in <lambda>(p)
    779         deserializers["meta.json"] = lambda p: self.meta.update(srsly.read_json(p))
    780         deserializers["vocab"] = lambda p: self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
--> 781         deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"])
    782         for name, proc in self.pipeline:
    783             if name in exclude:

tokenizer.pyx in spacy.tokenizer.Tokenizer.from_disk()

tokenizer.pyx in spacy.tokenizer.Tokenizer.from_bytes()

/usr/lib/python3.6/re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

/usr/lib/python3.6/re.py in _compile(pattern, flags)
    299     if not sre_compile.isstring(pattern):
    300         raise TypeError("first argument must be string or compiled pattern")
--> 301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):
    303         if len(_cache) >= _MAXCACHE:

/usr/lib/python3.6/sre_compile.py in compile(p, flags)
    560     if isstring(p):
    561         pattern = p
--> 562         p = sre_parse.parse(p, flags)
    563     else:
    564         pattern = None

/usr/lib/python3.6/sre_parse.py in parse(str, flags, pattern)
    853 
    854     try:
--> 855         p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    856     except Verbose:
    857         # the VERBOSE flag was switched on inside the pattern.  to be

/usr/lib/python3.6/sre_parse.py in _parse_sub(source, state, verbose, nested)
    414     while True:
    415         itemsappend(_parse(source, state, verbose, nested + 1,
--> 416                            not nested and not items))
    417         if not sourcematch("|"):
    418             break

/usr/lib/python3.6/sre_parse.py in _parse(source, state, verbose, nested, first)
    525                     break
    526                 elif this[0] == "\\":
--> 527                     code1 = _class_escape(source, this)
    528                 else:
    529                     code1 = LITERAL, _ord(this)

/usr/lib/python3.6/sre_parse.py in _class_escape(source, escape)
    334         if len(escape) == 2:
    335             if c in ASCIILETTERS:
--> 336                 raise source.error('bad escape %s' % escape, len(escape))
    337             return LITERAL, ord(escape[1])
    338     except ValueError:

error: bad escape \p at position 257

Building a training loop - update example from spacy101 v2.x tutorial

Hello, can you please indicate how to change the code below to align with Spacy's current v3.2.x?

Thanks for your support,

Achilleas

import spacy
import random
import json
   
with open("exercises/en/gadgets.json") as f:
    TRAINING_DATA = json.loads(f.read())

nlp = spacy.blank("en")
ner = nlp.add_pipe('ner')
ner.add_label("GADGET")

nlp.vocab.vectors.name = 'example'

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        # Batch the examples and iterate over them
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
        print("{0:.10f}".format(losses['ner']) )

error in builidng docker

Hi there,
I just want to build docker and I got this error for installing gatsby :

npm WARN notsup Unsupported engine for [email protected]: wanted: {"node":">=18.0.0"} (current: {"node":"12.22.12","npm":"6.14.16"})
npm WARN notsup Not compatible with your version of node/npm: [email protected]
npm WARN notsup Unsupported engine for [email protected]: wanted: {"node":">=14.14"} (current: {"node":"12.22.12","npm":"6.14.16"})
npm WARN notsup Not compatible with your version of node/npm: [email protected]
npm WARN notsup Unsupported engine for [email protected]: wanted: {"node":">=18.0.0"} (current: {"node":"12.22.12","npm":"6.14.16"})
npm WARN notsup Not compatible with your version of node/npm: [email protected]

npm ERR! Unexpected end of JSON input while parsing near '...ZqkK80Tm1Uq1pOOfPE3cW'

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2023-06-12T13_51_02_133Z-debug.log

amendments suggestion

This is a draft, and I will continue to update it as needed.

"ID" may be a little bit confusing.

https://github.com/ines/spacy-course/blame/master/chapters/en/slides/chapter1_02_statistical-models.md#L104

not method, but function?

https://github.com/ines/spacy-course/blame/master/chapters/en/slides/chapter1_02_statistical-models.md#L192

add backquotes to words and spaces? (inconsistent to L113)

https://github.com/ines/spacy-course/blame/master/chapters/en/chapter2.md#L127

not "shared nlp object", but "shared nlp.vocab object"?

https://github.com/ines/spacy-course/blame/master/chapters/en/chapter2.md#L155

check if index < len(doc) - 1?

https://github.com/ines/spacy-course/blame/master/chapters/en/chapter2.md#L196

not one or more, but zero or more

https://github.com/ines/spacy-course/blame/master/chapters/en/slides/chapter2_02_data-structures-2.md#L51

what this "easier" means? In terms of the computational cost?

https://github.com/ines/spacy-course/blame/master/chapters/en/slides/chapter4_02_training-loop.md#L32

Wrong word segmentation in chinese course

The sentence is just splited by character.

# 导入spacy并创建中文nlp对象
import spacy

nlp = spacy.blank("zh")

# 处理文本
doc = nlp("我喜欢老虎和狮子。")

# 遍历打印doc中的内容
for i, token in enumerate(doc):
    print(i, token.text)

# 截取Doc中"老虎"的部分
laohu = doc[2:3]
print(laohu.text)

# 截取Doc中"老虎和狮子"的部分(不包括"。")
laohu_he_shizi = doc[2:5]
print(laohu_he_shizi.text)

Output

0 我
1 喜
2 欢
3 老
4 虎
5 和
6 狮
7 子
8 。
欢
欢老虎

Reproducing Chapter 1:10 slide code matches on the whole sentence

When I run the following code in spaCy 2.0.12:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

doc = nlp("New iPhone X release date leaked")
matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

I get:

New
iPhone
X
release
date
leaked
.

I somehow match on every individual word of doc instead of just 'iPhone X'. Not sure why it is.

Chapter 1, section 10-11: "ORTH" vs "TEXT" in pattern matching

Hi Ines!
This is 😍
Great UI and content. Really great work!

So far (Chapter 1, section 11), I've only been confused twice: Once when I had to install the en_core_web_sm myself (I don't mind, though, it was easy to find out how), and now in section 11.
In section 10, we learned to use the pattern key ORTH to do exact text matching, but section 11 expects the newer v2.1 TEXT key (nice docs by the way).

I think the two sections should be aligned, or we/you should tell students that they can use either. Having not used spacy before, I strongly prefer the new TEXT key (what is ORTH even?) 😁

My initial investigation used the very neat spacy.explain feature, which does not yet know either word - I'm not sure if it is meant to also explain the pattern matching keys.

At any rate, thank you for your very nice and accessible work!

Why does chapter 2 section 7 solution use iteration instead of matching?

Could someone please explain why does the section on Data structures best practices use iteration instead of matching for finding proper noun before a verb? Is it simply to make the pairing between naive solution and the recommended one more direct or is there additional rationale on when to use iteration instead of the excellent matching functionality provided by SpaCy?

Here's the relevant solution code snippet provided:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

And here's an example of how I would construct a matcher to extract proper noun followed by a verb:

doc = nlp("Berlin is a nice city")
matcher = Matcher(nlp.vocab)
matcher.add("Proper nouns", None, [{"POS": "PROPN"}, {"POS":"VERB"}])

matches = matcher(doc)
for match in matches:
    print("Found proper noun before a verb:", doc[match[1]])

Chapter 2 § 4 prose disagrees with code

In the section entitled "The Span object (2)", the prose states:

To add an entity label to the span, we first need to look up the string in the string store. We can then provide it to the span as the label argument.

However the code just passes in the string directly:

span_with_label = Span(doc, 0, 2, label="GREETING")

It seems like the code assumes spaCy 2.1 which added support for passing in a Unicode string but the prose was written before 2.1 was released?

videos

can we embed youtube or video links or playable audio in the slides?

Missing implementation to output matches from the matcher at chapter1_03_rule-based-matching.md

Phenomenon

There is the output of the matches from the matcher at chapter1_03_rule-based-matching.md Matching lexical attributes, but the implementation to get it is missing.

doc = nlp("2018 FIFA World Cup: France won!")

(The implementation is missing to output the following 2018 FIFA World Cup: from doc.

2018 FIFA World Cup:

Proposal

Add the implementation to get the output. The following code is one of the candidate.

matcher = Matcher(nlp.vocab)
matcher.add("IPHONE_PATTERN", None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

But a little verbose. We can avoid the repetition of the code by function.

def print_matches(doc, pattern);
    matcher = Matcher(nlp.vocab)
    matcher.add("PATTERN", None, pattern)
    matches = matcher(doc)
    for match_id, start, end in matches:
        matched_span = doc[start:end]
        print(matched_span.text)

Then, we can write as following.

doc = nlp("2018 FIFA World Cup: France won!")
print_matches(doc, pattern)

Chapter 2, section 11 --> example is wrong

# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

The second pattern is not added to the matcher.

Use Japanese model in Japanese translation

Since spacy has released Japanese model in v2.3, it would be good to use the Japanese model in Japanese version of this course. (Now en model is used)
One of the contributors of the Japanese model @hiroshi-matsuda-rit agrees with this idea.
What do you think, @ines?