GithubHelp home page GithubHelp logo

vxern / wiktionary-scraper Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 63 KB

๐Ÿ‡ฌ๐Ÿ‡ง An extensible, robust and lightweight (45kB) Wiktionary.org scraper to fetch detailed information about words in various languages.

Home Page: https://npmjs.com/package/wiktionary-scraper

License: MIT License

TypeScript 100.00%
definitions dictionary english etymology javascript language parser scraper typescript wiktionary

wiktionary-scraper's Introduction

A lightweight scraper to fetch information about words in various languages from Wiktionary.

Table of contents

Usage

To start using the scraper, first install it using the following command:

npm install wiktionary-scraper

The simplest way of using the scraper is as follows:

import * as Wiktionary from "wiktionary-scraper";

const results = await Wiktionary.get("word");

You can change the language of the target word by setting the lemmaLanguage:

import * as Wiktionary from "wiktionary-scraper";

const results = await Wiktionary.get('o', {
  lemmaLanguage: "Romanian",
});

You can specify if redirects should be followed by setting followRedirects to true:

import * as Wiktionary from "wiktionary-scraper";

// Redirects to and returns results for "Germany".
const results = await Wiktionary.get('germany', {
  followRedirects: true,
});

By default, the User-Agent header used in requests is filled in using a default value mentioning wiktionary-scraper.

To remove it, set userAgent to undefined.

If you want to change it, specify userAgent:

import * as Wiktionary from "wiktionary-scraper";

const results = await Wiktionary.get('word', {
  userAgent: "Your App (https://example.com)",
});

You can also parse HTML of the website directly, bypassing the fetch step.

โ„น๏ธ Notice that, as opposed to get(), parse() is synchronous:

import * as Wiktionary from "wiktionary-scraper";

const results = Wiktionary.parse(html);

Completeness

This library currently only supports the English version of Wiktionary.

Features

  • Parses both single- and multiple-etymology entries.
  • Recognises standard, non-standard and some explicitly disallowed parts of speech, as defined here. In total, there are 60+ recognised parts of speech, which should cover the vast majority of definitions.
    • Note, however, that it is very possible that the library will fail to recognise certain niche, non-standard parts of speech. Should you come across any, please post an issue.

Section support

  • Description
  • Glyph origin
  • Etymology
  • Pronunciation
  • Production
  • Definitions
  • Usage notes
  • Reconstruction notes
  • Inflection sections:
    • Inflection
    • Conjugation
    • Declension
  • Mutation
  • Quotations
  • Alternative forms
  • Alternative reconstructions
  • Relations:
    • Synonyms
    • Antonyms
    • Hypernyms
    • Hyponyms
    • Meronyms
    • Holonyms
    • Comeronyms
    • Troponyms
    • Parasynonyms
    • Coordinate terms
    • Derived terms
    • Related terms
  • Translations
  • Trivia
  • See also
  • References
  • Further reading
  • Anagrams
  • Examples

Recognised parts of speech

Parts of speech
  • Adjective
  • Adverb
  • Ambiposition
  • Article
  • Circumposition
  • Classifier
  • Conjunction
  • Contraction
  • Counter
  • Determiner
  • Ideophone
  • Interjection
  • Noun
  • Numeral
  • Participle
  • Particle
  • Postposition
  • Preposition
  • Pronoun
  • Proper noun
  • Verb
Morphemes
  • Circumfix
  • Combining form
  • Infix
  • Interfix
  • Prefix
  • Root
  • Suffix
Symbols
  • Diacritical mark
  • Letter
  • Ligature
  • Number
  • Punctuation mark
  • Syllable
  • Symbol
Phrases
  • Phrase
  • Proverb
  • Prepositional phrase
Han characters and language-specific varieties
  • Han character
  • Hanzi
  • Kanji
  • Hanja
Other
  • Romanization
  • Logogram
  • Determinative
Explicitly disallowed parts of speech

You know, just in case somebody didn't follow the rules on Wiktionary.

  • Abbreviation
  • Acronym
  • Initialism
  • Cardinal-number
  • Ordinal-number
  • Cardinal-numeral
  • Ordinal-numeral
  • Clitic
  • Gerund
  • Idiom
Library additions
  • Adposition
  • Affix
  • Character

wiktionary-scraper's People

Contributors

vxern avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.