t3nsor / quora-backup Goto Github PK
View Code? Open in Web Editor NEWPython scripts to download Quora answers and convert them into a more portable form
License: GNU General Public License v2.0
Python scripts to download Quora answers and convert them into a more portable form
License: GNU General Public License v2.0
I tried execution of the javascript on console of two browsers: Mozilla and Chrome.
It opens a new tab with blank screens.
What is the issue?
The crawler worked beautifully for me, but converter.py
crashed.
Found 476 answers
Filename: 2013-02-07 Whats-an-efficient-way-to-overcome-procrastination.html
Traceback (most recent call last):
File "./quora-backup/converter.py", line 233, in <module>
document = parser.parse(page_html, encoding='utf-8')
File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/html5parser.py", line 235, in parse
self._parse(stream, False, None, *args, **kwargs)
File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/html5parser.py", line 85, in _parse
self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/_tokenizer.py", line 36, in __init__
self.stream = HTMLInputStream(stream, **kwargs)
File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/_inputstream.py", line 151, in HTMLInputStream
return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'
When I experimented by removing encoding
from the **kwargs
, I got this:
Found 476 answers
Filename: 2013-02-07 Whats-an-efficient-way-to-overcome-procrastination.html
[WARNING] Unrecognized node
Traceback (most recent call last):
File "./quora-backup/converter.py", line 279, in <module>
saved_page.write(serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False).render(walker))
AttributeError: module 'html5lib.serializer' has no attribute 'htmlserializer'
I wonder if maybe it's an html5lib
version thing or similar.
Here's my configuration:
(venv) $ python3 --version
Python 3.5.2
(venv) $ pip freeze
html5lib==0.999999999
six==1.10.0
webencodings==0.5
Happy to help triangulate, and while I don't code every weekend, I'd offer a pull request eventually if I can figure out how to get it working.
$ ../software-git/quora-backup/converter.py answers-en answers-en-ready
Found 2503 answers
Filename: 2015-01-18 What-are-some-of-the-worst-baby-names.html
Traceback (most recent call last):
File "../software-git/quora-backup/converter.py", line 216, in
print('[WARNING] Failed to locate answer on page (Source URL was %s)' % url, file=sys.stderr)
NameError: name 'url' is not defined
Quora answers may have strings like "Added 12am" - and the script fails to parse such string - giving me xxxx-xx-xx. Just need to add another regex. Should be easy to fix.
Hi, Please help..
Downloading from Quora. Whn running crawler.py, the below error happens
Traceback (most recent call last):
File "./crawler.py", line 100, in
answers = json.load(input_file)
File "/usr/lib/python3.7/json/init.py", line 296, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.7/json/init.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 5572 (char 5571)
In the current code - the user has to run 2 different javascript codes - one for list of all the answers and other for time. They can be combined - store the time and offset as the last 2 element in answer list - and then user will only have to run the js command once.
Of course the crawler.py has to be changed.
This will also eliminate need for timezone argument and timestamp argument - since they will be available directly in json itself.
Styles have changed: in particular the node with QuestionArea has been removed:
<div class="CenteredQuestionPage NewGridQuestionPage QuestionAnswerPageMain"><div class="grid_page"><div class="layout_centered_2col_main"><div class="ans_page_question_header"><div id="MVzBgP"><div class="QuestionArea"><div id="VoNHyN "><div class="question_text_edit"><h1><span id="RymIQB"><a class="question_link" href="/What-is-the-etymology-of-Gylippus-It-has-to-do-with-h orses-but-what-else" action_mousedown="QuestionLinkClickthrough" id="__w2_QKath9N_link">
is now:
<div class='CenteredQuestionPage NewGridQu estionPage QuestionAnswerPageMain'><div class='grid_page'><div class='layout_centered_2col_main'><div class='ans_page_question_header'><div c lass='_type_serif_title_xlarge pass_color_to_child_links'><div id='GgjOTP'><a class='question_link' href='/What-is-the-etymology-of-Gylippus- It-has-to-do-with-horses-but-what-else' target='_top' action_mousedown='QuestionLinkClickthrough' id='__w2_z5KqY6h_link'>
As a result, converter.py cannot function: QuestionArea is no longer the class to look for.
When trying to run the crawler as directed, I get the following error:
[DEBUG] Loading input file content.json
Traceback (most recent call last):
File "./crawler.py", line 97, in
answers = json.load(input_file)
File "/usr/lib/python3.4/json/init.py", line 268, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.4/json/init.py", line 318, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.4/json/decoder.py", line 343, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.4/json/decoder.py", line 359, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1 column 133 (char 132)
Looking at the actual json-file it seems likely to me that the problem here is that I'm registered on Quora as "Eivind Kjørstad", and the ø gets encoded in the urls in a way the crawler does not approve of. Specifically the start of my json-file looks like this:
[["https://www.quora.com/What-can-I-do-to-prevent-myself-from-getting-into-a-cycle-of-mediocrity/answer/Eivind-Kj%C3%B8rstad","Added
1h
ago"],["https://www.quora.com/Laws-in-India/If-I-get-a-signed-document-from-my-wife-and-her-parents-saying-that-No-dowry-has-been-given-to-the-groom-and-his-family-in-this-marriage-and-other-necessary-details-can-my-wife-or-her-family-members-still-file-a-case-against-me-under-the-Dowry-Act/answer/Eivind-Kj%C3%B8rstad","Added
2h
ago"],["https://www.quora.com/Why-do-so-many-men-on-Quora-use-the-word-females-to-refer-to-human-women/answer/Eivind-Kj%C3%B8rstad","Added
Wed"]
At a guess, the crawler disapprove of the "%C3%B8" part in my name. If I run sed on the json to take out those characters, then the crawler runs, but of course then the URLs it tries to fetch are incorrect and nothing is fetched.
I am a Quora user in English, Portuguese and Spanish. While I have been able to use the script with my English answers, it fails with the answers in other languages because they indicate dates in a quite different way.
I'd be okay to patch the scripts myself (since I only need this once), keeping one version for each language. That's okay because I only need to run the script a few times.
Would you please help me adapt the script so it recognises international dates?
This is the way it is in Portuguese:
less than 24 hours: "Adicionado há 12h"
less than a week: "Adicionado quinta-feira" (the days of the week are "domingo, segunda-feira, terça-feira, quarta-feira, quinta-feira, sexta-feira, sábado")
less than a year "Adicionado 8 de outubro" (the months of the year are "janeiro, fevereiro, março, abril, maio, junho, julho, agosto, setembro, outubro, novembro, dezembro")
more than a year: "Adicionado 30/07/2018" (date format is dd/mm/yyyy and the trailing zero is used).
This is the way it is in Spanish:
less than 24 hours: "Añadido hace 12h"
less than a week: "Añadido viernes" (the days of the week are "domingo, lunes, martes, miércoles, jueves, viernes, sábado, domingo")
less than a year "Añadido el 8 de octubre" (the months of the year are "enero, febrero, marzo, abril, mayo, junio, julio, agosto, septiembre, octubre, noviembre, diciembre")
more than a year: "Añadido el 30/9/2018" (date format is dd/mm/yyyy but the trailing zero isn't used).
Pasting the JS function into the Chrome console results in the following errors:
Uncaught ReferenceError: $ is not defined
at <anonymous>:1:70
window.open().document.write(JSON.stringify(Array.prototype.map.call($('.UserContentList .pagedlist_item'), function (e) { return [$(e).find('a')[0].href, $(e).find('.metadata').text()] })))
VM263:1 Uncaught TypeError: Array.prototype.map called on null or undefined
at map (native)
at <anonymous>:1:65
(anonymous) @ VM263:1
main-thumb-33627499-25-jhhwtnwmpleskklqhoepgdoyodyclyqv.jpeg:1 GET https://qph.ec.quoracdn.net/main-thumb-33627499-25-jhhwtnwmpleskklqhoepgdoyodyclyqv.jpeg 403 (Forbidden)
Using Chrome version 55.0.2883.95 (64-bit) / Mac.
If the answer was already downloaded - then do not download it. If users run the script again and again - only the newly posted answers will be downloaded. An interactive option can be added to the script - which when enabled - ask users interactively what they want to do with each answer that is already there in folder ! User may want to download answer again - incase they have update it.
Again - I can implement it if it makes sense.
When calculating the max file length, Python's len
only cares about number of characters, but the OS cares about the number of bytes, so when there's a non-ASCII character in a filename that's too long, it can't save.
For reference, the file I tried to save was Which-is-more-likely-China-emerges-as-a-xenophobic-chauvinistic-force-bitter-and-hostile-to-the-West-because-it-tried-to-slow-down-or-abort-its-development-”-or-“educated-and-involved-in-the-ways-of-the-world-more-cosmopolitan-more-internationalized-.html
(note the curly quotes), which is 255 characters long, but is over 255 bytes.
I have a preliminary fix on my fork, but this only works with UTF-8.
crawler.py can be used to retrieve blogs from Quora, not just answers. But if it is, the constraint that the URL fetched needs to match quora.com/answer/...
needs to be relaxed:
# Get the part of the URL indicating the question title; we will save under this name
m1 = re.search('quora\.com/([^/]+)/answer', url)
# if there's a context topic
m2 = re.search('quora\.com/[^/]+/([^/]+)/answer', url)
filename = added_time + ' '
if not m1 is None:
filename += m1.group(1)
elif not m2 is None:
filename += m2.group(1)
else:
print('[ERROR] Could not find question part of URL %s; skipping' % url, file=sys.stderr)
continue
I change the last two lines to:
# blog post
m3 = re.search('quora\.com/([^/]+)', url)
filename += m3.group(1)
The metadata element of an anonymous answer, as fetched in the Javascript fetching all timestamps of answers, looks like:
"Anonymous · Added Mar 1"
That middle dot is non Ascii, and makes crawler.py barf at answers = json.load(input_file)
You might want to put some filtering out of non-Ascii characters from the loaded JSON in to preempt that. The middle dot can straightforwardly be replaced with a period.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.