csu / pyquora Goto Github PK
View Code? Open in Web Editor NEWA Python module for fetching and parsing data from Quora.
Home Page: http://christopher.su/pyquora/
License: Other
A Python module for fetching and parsing data from Quora.
Home Page: http://christopher.su/pyquora/
License: Other
Doesn't need to actually check the functionality of the methods/API (because they would just be aliases to methods that are being tested elsewhere in the test suite), just need to check that the old API/methods exist and can be called with the correct parameters.
Because of Quora's recent UI change Quora.get_user_activity does not scrap data correctly.
A direct consequence on quora-api can be observed by making a GET request on:
http://quora-api.herokuapp.com/users//activity/answers
where an empty array is returned.
https://github.com/csu/pyquora/edit/master/quora/pyquora.py#L45
We should fix this, but maintain the legacy API so things that currently use pyquora
don't need to be rewritten. We can throw away the legacy support at a certain milestone, like v2.0 or something.
>>> from quora import Quora
>>> Quora.get_random_answers(5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "quora/quora.py", line 139, in get_random_answers
answer = Quora.get_one_answer(question)
File "quora/quora.py", line 50, in get_one_answer
return Quora.scrape_one_answer(soup)
File "quora/quora.py", line 54, in scrape_one_answer
answer = soup.find('div', id = re.compile('_answer_content$')).find('div', id = re.compile('_container'))
AttributeError: 'NoneType' object has no attribute 'find'
>>>
No longer the same as the markdown/GitHub readme.
If I'm not wrong, the tests are checking if data has been scraped. It does not check if it has been done correctly.
How about adding selected HTML pages into the test folder rather than loading the page from Quora every time the test is run? This way, we can check if there is a difference between what was expected and what was received.
Not absolutely necessary, but could improve readability.
So csu/quora-backup#6 doesn't happen again.
e.g. the end API usage should be like
user = Quora.User('Christopher-J-Su')
activity = user.activity
print activity.activity_type
I.e. we shouldn't have to call a method to get activity, rather, it should be an attribute of the User class like the other statistics (followers, following, edits, etc.).
Answers aren't being parsed from the feed properly.
Test code:
from quora import Quora, Activity
quora = Quora()
activity = quora.get_activity('Christopher-J-Su')
print activity.answers
Results:
(env)csu:pyquora (master)$ python debug.py
[]
Also, from quora-api
:
{
"items": []
}
@rohithpr and @aaronwinter have contributed enough and are familiar enough with the codebase to directly push to pyquora and quora-api, as well as review and accept pull requests. An org should be created to grant them push access to the repositories.
Fetch the number of views, edits, followers, etc. for a question, but not the content (for now, just to be safe ๐).
This happens when the answer's author has a number at the end of their username.
Ex: Foo-Bar-23 but we make a function call as: get_one_answer(question, 'Foo-Bar')
One way to overcome this would be to check for invalid dicts and keep making function calls as:
get_one_answer(question, 'Foo-Bar-1')
, get_one_answer(question, 'Foo-Bar-2')
and so on till a valid dict is received but it is highly inefficient.
So we need to find another way to get these answers.
As I've stated here, quora is blocking some (all?) scripts.
from bs4 import BeautifulSoup
import requests
url = 'http://www.quora.com/search?q=flowers'
soup = BeautifulSoup(requests.get(url).text)
print soup
<html>
<head>
<title>503 Service Unavailable</title>
</head>
<body>
<h1>503 Service Unavailable</h1>
The server is currently unavailable. Please try again at a later time.<br/><br/>
Our automated scripts have detected a possible scraper. If you feel we have made an error, please email [email protected]. Sorry for the inconvenience. Thanks.
</body>
</html>
stats = quora.get_user_stats('Christopher-J-Su')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "../quora/quora.py", line 143, in get_user_stats
return User.get_user_stats(u)
File "../quora/user.py", line 156, in get_user_stats
user_dict = {'answers' : data_stats[1],
IndexError: list index out of range
Right now, my example tests (wrote just to get CI working) just check to see if any of the activity attributes (answers, questions, etc.) return an empty list. This isn't always necessarily correct.
For example, if someone hasn't posted a review in a long time, their activity.review_requests
will be empty, even if pyquora
is working properly.
using Quora.get_one_answer('6hARL') can only get one answer for a question.
I mean how to get all the answers of a question?
thanks
I ran into a little something when I was writing tests and made changes to quora.py.
Those changes aren't useful as nosetest is importing quora from the venv.
Is this the expected behaviour or am I doing something wrong?
Try Is-there-a-proof-of-the-Four-Color-Theorem-that-does-not-involve-substantial-computation
.
GET: http://quora-api.herokuapp.com/questions/Is-there-a-proof-of-the-Four-Color-Theorem-that-does-not-involve-substantial-computation
Output:
{
"answer_count": 4,
"answer_wiki": "<div class=\"hidden\" id=\"answer_wiki\"><div id=\"ld_ebgwib_28688\"><div id=\"__w2_sHb6iqm_wiki\"></div></div></div>",
"question_details": null,
"question_text": "Is there a proof of the Four Color Theorem that does not involve substantial computation?",
"topics": [
"Science, Engineering, and Technology",
"Science",
"Formal Sciences",
"Mathematics"
],
"want_answers": 1
}
question_details
is null, but the question has details on Quora.
What possible use cases are there for get_random_answers? I don't think it's necessary/useful. Plus, we're importing string
and random
just for it.
I don't see any endpoint. is there any method to do that?
thanks
Currently the USAGE instruction in README.md is like this:
from quora import Quora, Activity
quora = new Quora()
# get user activity
activity = get_activity('Christopher-J-Su')
But it should be like this:
from quora import Quora, Activity
quora = Quora()
# get user activity
activity = quora.get_activity('Christopher-J-Su')
When the get_question_stats()
method is called with an invalid question, an unhandled exception occurs. Here's a dump of an error:
question = Quora.get_question_stats('Medicine-and-Healthcare')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/quora/quora.py", line 115, in get_question_stats
return Quora.scrape_question_stats(soup)
File "/usr/local/lib/python2.7/dist-packages/quora/quora.py", line 125, in scrape_question_stats
answer_count = soup.find('div', attrs={'class' : 'answer_count'}).next.split()[0]
AttributeError: 'NoneType' object has no attribute 'next'
It's still not working for me. It's also not working at http://quora-api.herokuapp.com/answers/How-can-I-join-Open-Source-Rails-projects/Tobias-Sandelius
.
>>> from quora import Quora
>>> Quora.get_one_answer('How-can-I-join-Open-Source-Rails-projects', 'Tobias-Sandelius')
{}
Currently it is in readme.md. Wouldn't it be better to move it from there to another folder with code examples?
Aiming for Python 2.6+ and Python 3.3+ compatibility.
print Quora.get_question_stats('What-are-the-best-Cyanide-Happiness-comics')
{'want_answers': 2, 'question_text': u'What are the best Cyanide & Happiness comics?', 'topics': [u'Communication', u'Writing', u'Books', u'Publishing', u'Comics (narrative art form)'], 'question_details': None, 'answer_count': 474, 'answer_wiki': None}
want_answers should've been 2k! ๐
In light of new Quora UI changes, we need to fix how we detect question follows in user activity.
Looks like Quora is masking usernames when you view an answer without logging in. "Quora User" is shown in place of the user's actual name. This is also affecting answers fetched by requests.
I haven't checked to see the extent to which this is applied.
PS: It's not an issue with the user being banned or anything, it shows the name properly after logging in.
There is more and more helper function, we should organize them in different class/subclass to increase readability.
help(quora)
is pretty unhelpful!
There are three ways of calling this function and each one returns different value of question
.
Or some other name instead of Question
, this class will be responsible for questions and answers.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.