GithubHelp home page GithubHelp logo

surkal / wiktionnaireparser Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 1.0 208 KB

A library for parsing the french wiktionary

License: GNU General Public License v3.0

Makefile 0.25% Python 99.75%
francais french python python3 wiktionary wiktionary-parser

wiktionnaireparser's People

Contributors

deepsourcebot avatar dependabot-preview[bot] avatar goliath-yann avatar surkal avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

lrosique

wiktionnaireparser's Issues

Error with section "Erreur*"

In 58 pages' words, we have a section named Erreur*

Due to the star in the name, the pyquery operation crashes so this section must be removed.

Here's the error

Traceback (most recent call last):

  File "D:\dev\Python\Python387\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-1-0f4a11aaa388>", line 3, in <module>
    page.get_word_data["partOfSpeech"]

  File "d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py", line 77, in get_word_data
    'partOfSpeech': self.get_parts_of_speech(),

  File "d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py", line 140, in get_parts_of_speech
    nice_section_name = self._real_section_name(section_name)

  File "d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py", line 128, in _real_section_name
    section = self._query.find(section_name)

  File "D:\dev\Python\Python387\lib\site-packages\pyquery\pyquery.py", line 677, in find
    xpath = self._css_to_xpath(selector)

  File "D:\dev\Python\Python387\lib\site-packages\pyquery\pyquery.py", line 282, in _css_to_xpath
    return self._translator.css_to_xpath(selector, prefix)

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
    for selector in parse(css))

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\parser.py", line 415, in parse
    return list(parse_selector_group(stream))

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
    yield Selector(*parse_selector(stream))

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\parser.py", line 436, in parse_selector
    result, pseudo_element = parse_simple_selector(stream)

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\parser.py", line 544, in parse_simple_selector
    raise SelectorSyntaxError(

  File "<string>", line unknown
SelectorSyntaxError: Expected selector, got <DELIM '*' at 7>

Here are some words :

  • malette
  • pillier
  • léthal
  • trippe

Error with related words

Hi,

I've been scrapping the french wiktionary and I've found an issue with the WikionnaireParser : for some words (6 326 over 1 874 000) the method "get_word_data" crashes and can't give the word data.

After some investigation, it comes from the module cssselect (eventhough i don't know why on these specific words) and i've just hotfixed the code with a try/catch around "for p_ in p:" (row 146 of the parser).

Here is the error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
d:\workspaces\peg_words\v2\check_dataframes.py in <module>
     51 ERROR_PAGE = df_all_words[df_all_words["status"] == "ERROR_PAGE"]
     52 page = wiktp.from_source("lithotypographier")
---> 53 page.get_word_data

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_word_data(self)
     57             'title'       : self.get_title(),
     58             'etymologies' : self.get_etymology(),
---> 59             'partOfSpeech': self.get_parts_of_speech(),
     60         }
     61 

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_parts_of_speech(self)
    144         ]
    145         try:
--> 146             for p_ in p:
    147                 related = self.get_related_words(p_)
    148                 if related:

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_related_words(self, related_word)
    302 
    303             section = section.getparent().getnext()
--> 304             if 'Notes' in value:
    305                 related = self.get_notes(section)
    306             else:

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\utils.py in extract_related_words(section)
     47     url = '/wiki/'
     48     while section.tag != 'h3' and section.tag != 'h4':
---> 49         for link in section.cssselect('a'):
     50             if 'Annexe:' in link.attrib.get('href'):
     51                 continue

src\lxml\etree.pyx in lxml.etree._Element.cssselect()

src\lxml\xpath.pxi in lxml.etree.XPath.__call__()

src\lxml\apihelpers.pxi in lxml.etree._rootNodeOrRaise()

ValueError: Input object is not an XML element: HtmlComment

And here are some words that don't work :

  • à croupeton
  • lithotypographier
  • piloris
  • pied au plancher
  • clochepied
  • cloîtres

Thank you anyway for your work, it's awesome :)

Incorrect etymology

Etymology of malette should be an empty string

{'title': 'malette',
 'etymologies': '(Date à préciser) Étymologie manquante ou incomplète. Si vous la connaissez, vous pouvez l’ajouter en cliquant ici.',
 'partOfSpeech': {}}

Error with definitions

Hello,

for 373 words the parser is crashing in the "get_definitions" method : on row 235 it's written "while text.tag != 'ol'" but for these words text is None in the loop.

The bug :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
d:\workspaces\peg_words\v2\check_dataframes.py in <module>
      113 page = wiktp.from_source("longouse")
----> 114 page = page.get_word_data["partOfSpeech"]

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_word_data(self)
     57             'title'       : self.get_title(),
     58             'etymologies' : self.get_etymology(),
---> 59             'partOfSpeech': self.get_parts_of_speech(),
     60         }
     61 

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_parts_of_speech(self)
    121         for section_name in sections:
    122             nice_section_name = self._real_section_name(section_name)
--> 123             parts_of_speech[nice_section_name] = self.get_definitions(section_name)
    124             # Translations ?
    125             if self._language == 'Français':

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_definitions(self, part_of_speech)
    233         text = self._query.find(part_of_speech)[0]
    234         text = text.getparent()
--> 235         while text.tag != 'ol':
    236             # ligne de forme
    237             if text.tag == 'p' or text.tag == 'span':

AttributeError: 'NoneType' object has no attribute 'tag'

Solution :

while text is not None and text.tag != 'ol':
...
if text is not None:
  for i, definition_bloc in enumerate(text.getchildren()):
...

Example with words not working :

  • longouse
  • lundi
  • maçonnes
  • légèrement
  • pauvrement
  • passaient
  • octaveur

Error with text_content

Hi,

Last bug i guess : for words like "lundi" and "pauvrement" the module is crashing with

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
d:\workspaces\WiktionnaireParser\tmp.py in <module>
      3 page = wiktp.from_source("pauvrement")
----> 4 page = page.get_word_data["partOfSpeech"]
      5 page

d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py in get_word_data(self)
     75             'title': self.get_title(),
     76             'etymologies': self.get_etymology(),
---> 77             'partOfSpeech': self.get_parts_of_speech(),
     78         }
     79 

d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py in get_parts_of_speech(self)
    150                 if not re.match(r'#Traductions', value):
    151                     continue
--> 152                 translation = self.get_translations(value)
    153                 parts_of_speech[nice_section_name]['translations'] = translation
    154 

d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py in get_translations(self, translation_id)
    275 
    276         for line in lines:
--> 277             language = line.find('span').text_content()
    278             transl = []
    279             links = line.find('a')

AttributeError: 'NoneType' object has no attribute 'text_content'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.