aaronhall / spypy Goto Github PK
View Code? Open in Web Editor NEWA Solr-powered search engine for Python documentation (under construction)
A Solr-powered search engine for Python documentation (under construction)
Not sure if this behavior is necessarily bad, but it heavily favors the highly-weighted exact-match fields (like f_name, f_parents).
Apparently dismax only uses the score from the highest scoring field (by default), instead of some combination of scores across all fields. Lots of "ties" in the results. Can't find how to change this behavior. Seen at http://wiki.apache.org/solr/DisMaxQParserPlugin#tie_.28Tie_breaker.29
nltk might have something
Low word distance in multiterm queries yield poor results.
See: "string to lower case", "string to lowercase", "string to lower"
This is partially a synonym issue, but mostly the distance is too low.
Example query (notice ~4 at tail of query)
<str name="parsedquery">
+((DisjunctionMaxQuery((f_fqn:string^50.0 | title:string^10.0 | text:string | f_definition:string^3.0 | f_name:string^10.0 | f_parents:string^5.0)) DisjunctionMaxQuery((f_fqn:to^50.0 | f_name:to^10.0 | f_parents:to^5.0)) DisjunctionMaxQuery((f_fqn:lower^50.0 | title:lower^10.0 | text:lower | f_definition:lower^3.0 | f_name:lower^10.0 | f_parents:lower^5.0)) DisjunctionMaxQuery((f_fqn:case^50.0 | title:case^10.0 | text:case | f_definition:case^3.0 | f_name:case^10.0 | f_parents:case^5.0)))~4)
</str>
<str name="parsedquery_toString">
+(((f_fqn:string^50.0 | title:string^10.0 | text:string | f_definition:string^3.0 | f_name:string^10.0 | f_parents:string^5.0) (f_fqn:to^50.0 | f_name:to^10.0 | f_parents:to^5.0) (f_fqn:lower^50.0 | title:lower^10.0 | text:lower | f_definition:lower^3.0 | f_name:lower^10.0 | f_parents:lower^5.0) (f_fqn:case^50.0 | title:case^10.0 | text:case | f_definition:case^3.0 | f_name:case^10.0 | f_parents:case^5.0))~4)
</str>
See requests
package for http stuff: Check out http://docs.python-requests.org/en/latest/index.html
Solr complains:
Traceback (most recent call last):
File "parser.py", line 251, in <module>
raise e
sunburnt.schema.SolrError: ({'status': '400', 'content-length': '1551', 'server': 'Jetty(6.1.24)', 'cache-control': 'must-revalidate,no-cache,no-store', 'date': 'Wed, 18 Jul 2012 19:02:00 GMT', 'content-type': 'text/html; charset=iso-8859-1'}, '<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>\n<title>Error 400 ERROR: [doc=dc3d4d77f79e4021e338d8cfaa73f216cd036b57] Error adding field \'f_parents\'=\'codeop\'</title>\n</head>\n<body><h2>HTTP ERROR 400</h2>\n<p>Problem accessing /solr/spypy/update/. Reason:\n<pre> ERROR: [doc=dc3d4d77f79e4021e338d8cfaa73f216cd036b57] Error adding field \'f_parents\'=\'codeop\'</pre></p><hr /><i><small>Powered by Jetty://
f_parents key is commented out for now
DL's can contain multiple items (func names+sigs) for the same definition: http://docs.python.org/library/string.html#string.ljust
Parser naively combines these into a single doc
Not sure how to automate this
Spider should write out entries to database. Indexer will use database as source
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.