surkal / wiktionnaireparser Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 208 KB

A library for parsing the french wiktionary

License: GNU General Public License v3.0

Makefile 0.25% Python 99.75%

francais french python python3 wiktionary wiktionary-parser

wiktionnaireparser's People

Contributors

Stargazers

Watchers

Forkers

lrosique

wiktionnaireparser's Issues

Error with section "Erreur*"

In 58 pages' words, we have a section named Erreur*

Due to the star in the name, the pyquery operation crashes so this section must be removed.

Here's the error

Traceback (most recent call last):

  File "D:\dev\Python\Python387\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-1-0f4a11aaa388>", line 3, in <module>
    page.get_word_data["partOfSpeech"]

  File "d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py", line 77, in get_word_data
    'partOfSpeech': self.get_parts_of_speech(),

  File "d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py", line 140, in get_parts_of_speech
    nice_section_name = self._real_section_name(section_name)

  File "d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py", line 128, in _real_section_name
    section = self._query.find(section_name)

  File "D:\dev\Python\Python387\lib\site-packages\pyquery\pyquery.py", line 677, in find
    xpath = self._css_to_xpath(selector)

  File "D:\dev\Python\Python387\lib\site-packages\pyquery\pyquery.py", line 282, in _css_to_xpath
    return self._translator.css_to_xpath(selector, prefix)

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
    for selector in parse(css))

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\parser.py", line 415, in parse
    return list(parse_selector_group(stream))

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
    yield Selector(*parse_selector(stream))

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\parser.py", line 436, in parse_selector
    result, pseudo_element = parse_simple_selector(stream)

  File "D:\dev\Python\Python387\lib\site-packages\cssselect\parser.py", line 544, in parse_simple_selector
    raise SelectorSyntaxError(

  File "<string>", line unknown
SelectorSyntaxError: Expected selector, got <DELIM '*' at 7>

Here are some words :

malette
pillier
léthal
trippe

Error with related words

Hi,

I've been scrapping the french wiktionary and I've found an issue with the WikionnaireParser : for some words (6 326 over 1 874 000) the method "get_word_data" crashes and can't give the word data.

After some investigation, it comes from the module cssselect (eventhough i don't know why on these specific words) and i've just hotfixed the code with a try/catch around "for p_ in p:" (row 146 of the parser).

Here is the error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
d:\workspaces\peg_words\v2\check_dataframes.py in <module>
     51 ERROR_PAGE = df_all_words[df_all_words["status"] == "ERROR_PAGE"]
     52 page = wiktp.from_source("lithotypographier")
---> 53 page.get_word_data

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_word_data(self)
     57             'title'       : self.get_title(),
     58             'etymologies' : self.get_etymology(),
---> 59             'partOfSpeech': self.get_parts_of_speech(),
     60         }
     61 

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_parts_of_speech(self)
    144         ]
    145         try:
--> 146             for p_ in p:
    147                 related = self.get_related_words(p_)
    148                 if related:

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_related_words(self, related_word)
    302 
    303             section = section.getparent().getnext()
--> 304             if 'Notes' in value:
    305                 related = self.get_notes(section)
    306             else:

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\utils.py in extract_related_words(section)
     47     url = '/wiki/'
     48     while section.tag != 'h3' and section.tag != 'h4':
---> 49         for link in section.cssselect('a'):
     50             if 'Annexe:' in link.attrib.get('href'):
     51                 continue

src\lxml\etree.pyx in lxml.etree._Element.cssselect()

src\lxml\xpath.pxi in lxml.etree.XPath.__call__()

src\lxml\apihelpers.pxi in lxml.etree._rootNodeOrRaise()

ValueError: Input object is not an XML element: HtmlComment

And here are some words that don't work :

à croupeton
lithotypographier
piloris
pied au plancher
clochepied
cloîtres

Thank you anyway for your work, it's awesome :)

Incorrect etymology

Etymology of malette should be an empty string

{'title': 'malette',
 'etymologies': '(Date à préciser) Étymologie manquante ou incomplète. Si vous la connaissez, vous pouvez l’ajouter en cliquant ici.',
 'partOfSpeech': {}}

Error with definitions

Hello,

for 373 words the parser is crashing in the "get_definitions" method : on row 235 it's written "while text.tag != 'ol'" but for these words text is None in the loop.

The bug :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
d:\workspaces\peg_words\v2\check_dataframes.py in <module>
      113 page = wiktp.from_source("longouse")
----> 114 page = page.get_word_data["partOfSpeech"]

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_word_data(self)
     57             'title'       : self.get_title(),
     58             'etymologies' : self.get_etymology(),
---> 59             'partOfSpeech': self.get_parts_of_speech(),
     60         }
     61 

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_parts_of_speech(self)
    121         for section_name in sections:
    122             nice_section_name = self._real_section_name(section_name)
--> 123             parts_of_speech[nice_section_name] = self.get_definitions(section_name)
    124             # Translations ?
    125             if self._language == 'Français':

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_definitions(self, part_of_speech)
    233         text = self._query.find(part_of_speech)[0]
    234         text = text.getparent()
--> 235         while text.tag != 'ol':
    236             # ligne de forme
    237             if text.tag == 'p' or text.tag == 'span':

AttributeError: 'NoneType' object has no attribute 'tag'

Solution :

while text is not None and text.tag != 'ol':
...
if text is not None:
  for i, definition_bloc in enumerate(text.getchildren()):
...

Example with words not working :

longouse
lundi
maçonnes
légèrement
pauvrement
passaient
octaveur

Error when requesting a random page with a language code that does not exist

Error with text_content

Hi,

Last bug i guess : for words like "lundi" and "pauvrement" the module is crashing with

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
d:\workspaces\WiktionnaireParser\tmp.py in <module>
      3 page = wiktp.from_source("pauvrement")
----> 4 page = page.get_word_data["partOfSpeech"]
      5 page

d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py in get_word_data(self)
     75             'title': self.get_title(),
     76             'etymologies': self.get_etymology(),
---> 77             'partOfSpeech': self.get_parts_of_speech(),
     78         }
     79 

d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py in get_parts_of_speech(self)
    150                 if not re.match(r'#Traductions', value):
    151                     continue
--> 152                 translation = self.get_translations(value)
    153                 parts_of_speech[nice_section_name]['translations'] = translation
    154 

d:\workspaces\WiktionnaireParser\wiktionnaireparser\parser.py in get_translations(self, translation_id)
    275 
    276         for line in lines:
--> 277             language = line.find('span').text_content()
    278             transl = []
    279             links = line.find('a')

AttributeError: 'NoneType' object has no attribute 'text_content'

surkal / wiktionnaireparser Goto Github PK

wiktionnaireparser's People

Contributors

Stargazers

Watchers

Forkers

wiktionnaireparser's Issues

Error with section "Erreur*"

Error with related words

Incorrect etymology

Error with definitions

Error when requesting a random page with a language code that does not exist

Error with text_content

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs