t3nsor / quora-backup Goto Github PK

View Code? Open in Web Editor NEW

126.0 126.0 73.0 27 KB

Python scripts to download Quora answers and convert them into a more portable form

License: GNU General Public License v2.0

Python 100.00%

quora-backup's People

Contributors

Stargazers

Watchers

quora-backup's Issues

Javascript execution on browser create an empty page.

I tried execution of the javascript on console of two browsers: Mozilla and Chrome.
It opens a new tab with blank screens.

What is the issue?

converter.py crashes

The crawler worked beautifully for me, but converter.py crashed.

Found 476 answers
Filename: 2013-02-07 Whats-an-efficient-way-to-overcome-procrastination.html
Traceback (most recent call last):
  File "./quora-backup/converter.py", line 233, in <module>
    document = parser.parse(page_html, encoding='utf-8')
  File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/html5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/html5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/_inputstream.py", line 151, in HTMLInputStream
    return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'

When I experimented by removing encoding from the **kwargs, I got this:

Found 476 answers
Filename: 2013-02-07 Whats-an-efficient-way-to-overcome-procrastination.html
[WARNING] Unrecognized node
Traceback (most recent call last):
  File "./quora-backup/converter.py", line 279, in <module>
    saved_page.write(serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False).render(walker))
AttributeError: module 'html5lib.serializer' has no attribute 'htmlserializer'

I wonder if maybe it's an html5lib version thing or similar.

Here's my configuration:

(venv) $ python3 --version
Python 3.5.2
(venv) $ pip freeze
html5lib==0.999999999
six==1.10.0
webencodings==0.5

Happy to help triangulate, and while I don't code every weekend, I'd offer a pull request eventually if I can figure out how to get it working.

Converter Fails to Fetch Answer

$ ../software-git/quora-backup/converter.py answers-en answers-en-ready
Found 2503 answers
Filename: 2015-01-18 What-are-some-of-the-worst-baby-names.html
Traceback (most recent call last):
File "../software-git/quora-backup/converter.py", line 216, in
print('[WARNING] Failed to locate answer on page (Source URL was %s)' % url, file=sys.stderr)
NameError: name 'url' is not defined

Date parsing case missed

Quora answers may have strings like "Added 12am" - and the script fails to parse such string - giving me xxxx-xx-xx. Just need to add another regex. Should be easy to fix.

crawler.py run error?.

Hi, Please help..
Downloading from Quora. Whn running crawler.py, the below error happens

Traceback (most recent call last):
File "./crawler.py", line 100, in
answers = json.load(input_file)
File "/usr/lib/python3.7/json/init.py", line 296, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.7/json/init.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 5572 (char 5571)

Add timestamp to Json file

In the current code - the user has to run 2 different javascript codes - one for list of all the answers and other for time. They can be combined - store the time and offset as the last 2 element in answer list - and then user will only have to run the js command once.

Of course the crawler.py has to be changed.

This will also eliminate need for timezone argument and timestamp argument - since they will be available directly in json itself.

HTML content has changed, removing critical style from converter.py

Styles have changed: in particular the node with QuestionArea has been removed:

<div class="CenteredQuestionPage NewGridQuestionPage QuestionAnswerPageMain"><div class="grid_page"><div class="layout_centered_2col_main"><div class="ans_page_question_header"><div id="MVzBgP"><div class="QuestionArea"><div id="VoNHyN "><div class="question_text_edit"><h1><span id="RymIQB"><a class="question_link" href="/What-is-the-etymology-of-Gylippus-It-has-to-do-with-h orses-but-what-else" action_mousedown="QuestionLinkClickthrough" id="__w2_QKath9N_link">

is now:

<div class='CenteredQuestionPage NewGridQu estionPage QuestionAnswerPageMain'><div class='grid_page'><div class='layout_centered_2col_main'><div class='ans_page_question_header'><div c lass='_type_serif_title_xlarge pass_color_to_child_links'><div id='GgjOTP'><a class='question_link' href='/What-is-the-etymology-of-Gylippus- It-has-to-do-with-horses-but-what-else' target='_top' action_mousedown='QuestionLinkClickthrough' id='__w2_z5KqY6h_link'>

As a result, converter.py cannot function: QuestionArea is no longer the class to look for.

Crawler fails for people with non-ascii names

When trying to run the crawler as directed, I get the following error:

[DEBUG] Loading input file content.json
Traceback (most recent call last):
File "./crawler.py", line 97, in
answers = json.load(input_file)
File "/usr/lib/python3.4/json/init.py", line 268, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.4/json/init.py", line 318, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.4/json/decoder.py", line 343, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.4/json/decoder.py", line 359, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1 column 133 (char 132)

Looking at the actual json-file it seems likely to me that the problem here is that I'm registered on Quora as "Eivind Kjørstad", and the ø gets encoded in the urls in a way the crawler does not approve of. Specifically the start of my json-file looks like this:

[["https://www.quora.com/What-can-I-do-to-prevent-myself-from-getting-into-a-cycle-of-mediocrity/answer/Eivind-Kj%C3%B8rstad","Added
1h
ago"],["https://www.quora.com/Laws-in-India/If-I-get-a-signed-document-from-my-wife-and-her-parents-saying-that-No-dowry-has-been-given-to-the-groom-and-his-family-in-this-marriage-and-other-necessary-details-can-my-wife-or-her-family-members-still-file-a-case-against-me-under-the-Dowry-Act/answer/Eivind-Kj%C3%B8rstad","Added
2h
ago"],["https://www.quora.com/Why-do-so-many-men-on-Quora-use-the-word-females-to-refer-to-human-women/answer/Eivind-Kj%C3%B8rstad","Added
Wed"]

At a guess, the crawler disapprove of the "%C3%B8" part in my name. If I run sed on the json to take out those characters, then the crawler runs, but of course then the URLs it tries to fetch are incorrect and nothing is fetched.

International date formats

I am a Quora user in English, Portuguese and Spanish. While I have been able to use the script with my English answers, it fails with the answers in other languages because they indicate dates in a quite different way.

I'd be okay to patch the scripts myself (since I only need this once), keeping one version for each language. That's okay because I only need to run the script a few times.

Would you please help me adapt the script so it recognises international dates?

This is the way it is in Portuguese:

less than 24 hours: "Adicionado há 12h"
less than a week: "Adicionado quinta-feira" (the days of the week are "domingo, segunda-feira, terça-feira, quarta-feira, quinta-feira, sexta-feira, sábado")
less than a year "Adicionado 8 de outubro" (the months of the year are "janeiro, fevereiro, março, abril, maio, junho, julho, agosto, setembro, outubro, novembro, dezembro")
more than a year: "Adicionado 30/07/2018" (date format is dd/mm/yyyy and the trailing zero is used).

This is the way it is in Spanish:

less than 24 hours: "Añadido hace 12h"
less than a week: "Añadido viernes" (the days of the week are "domingo, lunes, martes, miércoles, jueves, viernes, sábado, domingo")
less than a year "Añadido el 8 de octubre" (the months of the year are "enero, febrero, marzo, abril, mayo, junio, julio, agosto, septiembre, octubre, noviembre, diciembre")
more than a year: "Añadido el 30/9/2018" (date format is dd/mm/yyyy but the trailing zero isn't used).

Chrome doesn't work with the initial JS routine

Pasting the JS function into the Chrome console results in the following errors:

Uncaught ReferenceError: $ is not defined
    at <anonymous>:1:70
window.open().document.write(JSON.stringify(Array.prototype.map.call($('.UserContentList .pagedlist_item'), function (e) { return [$(e).find('a')[0].href, $(e).find('.metadata').text()] })))
VM263:1 Uncaught TypeError: Array.prototype.map called on null or undefined
    at map (native)
    at <anonymous>:1:65
(anonymous) @ VM263:1
main-thumb-33627499-25-jhhwtnwmpleskklqhoepgdoyodyclyqv.jpeg:1 GET https://qph.ec.quoracdn.net/main-thumb-33627499-25-jhhwtnwmpleskklqhoepgdoyodyclyqv.jpeg 403 (Forbidden)

Using Chrome version 55.0.2883.95 (64-bit) / Mac.

Prevent Re-downloading answers

If the answer was already downloaded - then do not download it. If users run the script again and again - only the newly posted answers will be downloaded. An interactive option can be added to the script - which when enabled - ask users interactively what they want to do with each answer that is already there in folder ! User may want to download answer again - incase they have update it.

Again - I can implement it if it makes sense.

Error saving files with a long filename when there are multi-byte characters

When calculating the max file length, Python's len only cares about number of characters, but the OS cares about the number of bytes, so when there's a non-ASCII character in a filename that's too long, it can't save.

For reference, the file I tried to save was Which-is-more-likely-China-emerges-as-a-xenophobic-chauvinistic-force-bitter-and-hostile-to-the-West-because-it-tried-to-slow-down-or-abort-its-development-”-or-“educated-and-involved-in-the-ways-of-the-world-more-cosmopolitan-more-internationalized-.html (note the curly quotes), which is 255 characters long, but is over 255 bytes.

I have a preliminary fix on my fork, but this only works with UTF-8.

Don't insist on answer component of URL

crawler.py can be used to retrieve blogs from Quora, not just answers. But if it is, the constraint that the URL fetched needs to match quora.com/answer/... needs to be relaxed:

# Get the part of the URL indicating the question title; we will save under this name
m1 = re.search('quora\.com/([^/]+)/answer', url)
# if there's a context topic
m2 = re.search('quora\.com/[^/]+/([^/]+)/answer', url)
filename = added_time + ' '
if not m1 is None:
    filename += m1.group(1)
elif not m2 is None:
    filename += m2.group(1)
else:
    print('[ERROR] Could not find question part of URL %s; skipping' % url, file=sys.stderr)
    continue

I change the last two lines to:

    # blog post
    m3 = re.search('quora\.com/([^/]+)', url)
    filename += m3.group(1)

Anonymous answers generate non-ASCII character

The metadata element of an anonymous answer, as fetched in the Javascript fetching all timestamps of answers, looks like:

"Anonymous · Added Mar 1"

That middle dot is non Ascii, and makes crawler.py barf at answers = json.load(input_file)

You might want to put some filtering out of non-Ascii characters from the loaded JSON in to preempt that. The middle dot can straightforwardly be replaced with a period.

t3nsor / quora-backup Goto Github PK

quora-backup's People

Contributors

Stargazers

Watchers

Forkers

quora-backup's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs