GithubHelp home page GithubHelp logo

abhinav-upadhyay / apropos_replacement Goto Github PK

View Code? Open in Web Editor NEW
20.0 20.0 1.0 2.53 MB

GSoC Project for NetBSD. Aim of the project is to develop a replacement for apropos(1) using mandoc and FTS engine of Sqlite

Home Page: http://netbsd-soc.sourceforge.net/projects/apropos_replacement/

C 99.76% JavaScript 0.24%
apropos c man-page ranking search

apropos_replacement's People

Contributors

0-wiz-0 avatar abhinav-upadhyay avatar jsonn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

0mp

apropos_replacement's Issues

Replace fprintf with err/errx/warn/warnx

Joerg suggested to replace fprintf calls in the error handling code with err/errx/warn/warnx functions as these functions also output the name of the program, which is useful in shell scripts, and also this seems to be the convention in most of the code in NetBSD.

Avoid hardlinks

As reported in Issue #11 , one of the reasons of the duplicate entries in the database are hardlinks of the man page sources. They should be avoided at any cost to prevent unnecessary noise in the database.

Joerg suggested in IRC that to generate a hash of the man pages and compare them to avoid this.

I have uploaded a new branch with a commit for this. I will merge it with master after some reviews.

Do VACUUM

Using VACUUM brings down the size of the database substantially. 30M from 45M, so add it to optimize.

Implement ManRank

Initial aim is to calculate a static weight of each page:

  • Basic version could simply be to simply calculate how many times a particular man page has been referenced by any other man page.
  • A more advanced version could be to properly mimic PageRank. That is, if a page is referenced by a page of a significantly higher weight, the referenced page gets a larger boost in it's weight as compared to a page which is referenced by a page of lower weight. (Relative weights ?). Although we will need to figure out how to start this out, because initially all the pages will have the same weight: 0

Manage symlinks/hardlinks with the database

This idea was part of the original project idea as given on the GSoC Ideas page 1

And it was also part of my proposal in the optional tasks list.

The ideas is basically to index all the symlinks/hardlinks to the man pages in the database. The links to a man page are identified by parsing the multiple .Nm entries. All the extra .Nm entries mean that all of them are links to the current page and we store them in a separate table.

So this would allow us to wipe off all these hard/softlinks from the file system. Next time when man(1) is invoked to open a man page which is a link to another page, man(1) would look up the database, find the target page for the link and render it.

Improve the performance of makemandb

makemandb is taking a lot of time in building the index and computing the weights of all the terms. The main reason for this is that, all the processing steps at the moment are separated into different C functions which individually access the database to fetch data, perform the computations and then store the data.

The performance can be improved significantly if we are able to join all these separate processing tasks into a single SQL query, as that way, Sqlite will be able to better optimize the query execution plan.

Index the section number of the man pages as well

This is required for several reasons:

  • The output of apropos should also tell the section number to which the matched result belongs
  • Indexing section number will allow us to perform search based on specific sections as well (e.g. may be the user is only searching for standard libarary functions)
  • It may prove to be helpful in future in ranking the results as well.

Problem:

As I look at the man page sources, only the .Dt macro contains the section number. I did try to parse it in makemandb but the function for handling .Dt doesn't seem to be invoked by libmandoc at all.

Remove call to strlen in lower

The call to strlen in lower function apropos-utils.c is needless. The end of string can be easily checked by checking for \0.
Thanks for Joerg for the feedback. :)

Implement compress/uncompress functions

In order to reduce the size of the FTS database, it is necessary to enable the compress, uncompress options in the FTS module and implemented corresponding compression/decompression functions.

I am going to use zlib for this purpose. Using the compress and uncompress functions from that library.

Implement a stopword tokenizer

Sqlite's inbuilt tokenizers: simple and porter do not support filtering of stopwords, as a result the index contains stopwords like "the", "a", "and" etc.

These stopwords are usually a source of noise in the search results and it would be benefical to prevent them from getting indexed.

I have extended the porter stemmer tokenizer from the source of Sqlite (fts3_porter.c) to filter any stopwords it encounters.

I really added a new function at the end of the source file and a couple of header files at the top. In future if any update is released to the fts3_porter. c from Sqlite community, it should be really easy to drop in the update. This is the reason I did not change the coding conventions used in the file.

Note: As a side effect, the stopword tokenizer also reduces the overall size of the db by 7MB.

apropos.c: Bring the global variable idf inside main

The IDF for a query needs to be calculated only once, while the other ranking factors like tf calculated for each document. With this intention I had delcared idf as a global variable to be able to maintain it's state. But a better approach would be to declare it as a local variable inside main and pass it down as an argument to the search function (while still maintaining the state).
Thanks to Joerg for pointing out the improvement.

Repeated records in the database

I just ran a few sample queries to test the functionality of apropos, it turned out that certain man pages have been inserted multiple times in the database.

After looking around the man page sources I found out at least two main reasons for this:

  • There are architecture specific man pages. For example we have different man pages for boot(8) for different architectures
/usr/share/man/man8/acorn32/boot.8
/usr/share/man/man8/alpha/boot.8
/usr/share/man/man8/amd64/boot.8
/usr/share/man/man8/amiga/boot.8
/usr/share/man/man8/atari/boot.8
/usr/share/man/man8/cobalt/boot.8
/usr/share/man/man8/dreamcast/boot.8
/usr/share/man/man8/hp300/boot.8
/usr/share/man/man8/hp700/boot.8
/usr/share/man/man8/hpcarm/boot.8
/usr/share/man/man8/hpcmips/boot.8
/usr/share/man/man8/hpcsh/boot.8
/usr/share/man/man8/i386/boot.8
/usr/share/man/man8/mac68k/boot.8
/usr/share/man/man8/macppc/boot.8
/usr/share/man/man8/mvme68k/boot.8
/usr/share/man/man8/next68k/boot.8
/usr/share/man/man8/pmax/boot.8
/usr/share/man/man8/prep/boot.8
/usr/share/man/man8/sgimips/boot.8
/usr/share/man/man8/sparc/boot.8
/usr/share/man/man8/sparc64/boot.8
/usr/share/man/man8/sun2/boot.8
/usr/share/man/man8/sun3/boot.8
/usr/share/man/man8/vax/boot.8
/usr/share/man/man8/x68k/boot.8
/usr/share/man/man8/boot.8
  • Another reason, I belive are aliases (I don't know what else to call them). For example we have man pages for csh with following different names:
/usr/share/man/man1/bg.1
/usr/share/man/man1/csh.1
/usr/share/man/man1/dirs.1
/usr/share/man/man1/fg.1
/usr/share/man/man1/foreach.1
/usr/share/man/man1/history.1
/usr/share/man/man1/jobs.1
/usr/share/man/man1/limit.1
/usr/share/man/man1/popd.1
/usr/share/man/man1/pushd.1
/usr/share/man/man1/rehash.1
/usr/share/man/man1/repeat.1
/usr/share/man/man1/source.1
/usr/share/man/man1/stop.1
/usr/share/man/man1/suspend.1

There might be other possible reasons for this.

Temporary solution

For now, I am thinking to add a check in the db insertion code, to make sure that there is not already a record with the same name to prevent duplicate entries.

Add an option to makemandb to forcefully rebuild the database

The default behaviour of makemandb after fixing issue #38 is to just update the database with new pages that are installed and remove pages which are no longer on the file system.

But there should be an option to allow the user to forcefully rebuild the database, should there be any inconsistency or any other issues.

Replace calls to fgets() with getline()

makemandb.c is using fgets() in it's code. It should be replaced with getline()

     while (fgets(line, MAXLINE, file) != NULL) {
        /* remove the new line character from the string */
        line[strlen(line) - 1] = '\0';
        traversedir(line);
    }

Add option 'o' to makemandb to optimize the database

Sqlite supports merging the B-tree indices into one large B-Tree index to support faster lookups. This is an expensive operation, therefore adding it as an option to makemandb.

Option 'o' to makemandb would force makemandb to also go and optimize the database. This performance boost comes at the expense of slightly more disk usage. For example on my system the databse size grew from 42M to 45M.

Handling .Nm macros

The value of .Nm macro is specified at only one place in the man page source and in the rest of the man page where ever .Nm macro is used again, it is replaced with the previously specified value (That's how I preceive it to be, I might be wrong)

For example

.Nm ls

.Nm is used to display the list of files and directories

this will parsed as

ls is used to display the list of files and directories.

But the present code leads to an output like this:

is used to display the list of files and directories.

We need to rectify it, other wise when performing search, the ranking of the page will suffer in search results because of lesser number of occurrences of the command name in its own man page.

Normalize section numbers

It is possible for section numbers to be like "3f", "3p" etc. instead of being a single character string like "1", "2", etc. So it is better to normalize this string to contain only the numeric portion only.

Also, I should be using the '=' operator in the SQL queries to match the section numbers, but for some reason with the compress option enabled for the FTS table, this does not seem to be working, thus I had to resort to using LIKE operator for matching the section numbers. Normalizing the section numbers means that I do not need to use wild card operators with the LIKE queries.

Fix parsing of NAME section for man(7) pages

The present code is not good enough to parse all sorts of variations in the NAME section of the man(7) pages.

For example the code does not seem to work well for the pages whose NAME section is distributed amongst multiple node (in the tree representation). I don't have a concrete example at hand, but IIRC one of the pages had something like the following in it's NAME section:

.SH NAME
.LP
foo-bar
\-
a sample page
...

Also, there are a lot of variations in which the NAME sectio is defined, it is not limited to only a comma separated list of names, then a - and then the one line description, I have seen more than this, some examples are:

.SH NAME
\&    foo-bar \- blah blah blah

.SH NAME
foo-bar \-\- blah blah blah

.SH NAME
foo-bar - blah blah blah

.SH NAME
 foo-bar \- blah blah blah

.SH NAME
foo-bar blah blah blah

Parse additional sections and store them in separate columns [man]

It would be useful and probably beneficial to parse additional sections from the page like LIBRARY, SYNOPSIS, ERRORS, EXIT VALUES, etc and store them in different columns. This has two main objectives:

  • Improve search by giving each section a different weight depending on it's usefulness
  • Display some additional information along with the search results.

Change the option for section specific search

It would be ideal to be able to search within multiple sections without making the syntax of the command very complex or requiring the user to enter the same option multiple times.

The -s option (#32) was ok for specifying a single section to search in but for specifying multiple sections, sections number themselves are better candidates for becoming options.

So the new syntax should be something like this:

apropos -123 "my search query"

and apropos would look in sections 1, 2 and 3 only.

libmandoc assertion failures

While running makemandb, the indexing process halts midway because of an assertion failure of libmandoc on a particular set of man pages. I belive this is being caused by a bug in libmandoc probably, as if I try to run mandoc(1) against the faulty man page, I get the same assertion failure:

parsing /usr/share/man/man4/atari/floppy.4
assertion "' ' != buf[*pos]" failed: file "/usr/src/external/bsd/mdocml/lib/libmandoc/../../dist/mdoc_argv.c", line 282, function "mdoc_argv"
Abort trap (core dumped)

With mandoc(1):

$ mandoc /usr/share/man/man4/atari/floppy.4
assertion "' ' != buf[*pos]" failed: file "/usr/src/external/bsd/mdocml/lib/libmandoc/../../dist/mdoc_argv.c", line 282, function "mdoc_argv"
Abort trap (core dumped)

Unable to parse escape sequences

The mandoc(3) related code for parsing the man pages in mandocdb.c in it's present state is unable to identify escape sequences and returns them as it is.

This needs to be fixed.

Precompute the term weights

The new ranking tf-idf based ranking algorithm (refer issue #17) has slowed down the search considerably. It might be a better idea to precompute the weights of the different terms to improve the run time of search.

Joerg suggested to precompute the weight of each term before hand (while building the index) and store in the database. This seems an attractive idea to me. But there are a couple of concerns.

First I will show the code of the ranking function at the moment, to bring things in context:

//Loop through each phrase in the search query
for(iPhrase = 0; iPhrase < nPhrase; iPhrase++){
        int iCol;                     /* Current column */

        /* Now iterate through each column in the users query. For each column,
        ** increment the relevancy score by:
        **
        **   (<hit count> / <global hit count>) * <column weight>
        **
        ** aPhraseinfo[] points to the start of the data for phrase iPhrase. So
        ** the hit count and global hit counts for each column are found in 
        ** aPhraseinfo[iCol*3] and aPhraseinfo[iCol*3+1], respectively.
        */
        int *aPhraseinfo = &aMatchinfo[2 + iPhrase * (nCol) * 3];
        for(iCol = 2; iCol < nCol - 2 ; iCol++) {
            int nHitCount = aPhraseinfo[3*iCol];
            int nGlobalHitCount = aPhraseinfo[3*iCol+1];
            double weight = sqlite3_value_double(apVal[iCol+1]);
            int nDocsHitCount = aPhraseinfo[3 * iCol + 2]; 
            if ( nHitCount > 0 )
                tf += ((double)nHitCount / nGlobalHitCount ) * weight;

            if (nGlobalHitCount > 0)
                idf += log(ndoc/nDocsHitCount)/ log(ndoc);
        }
               // The final score
        score = tf * idf;
    }

Concerns

  • We need access to per docuement statistics (like how many times a term occurs in a particular document), which is provided by Sqlite only when executing the query. The matchinfo() provides this kind of information.
  • For corpus wide statistics , Sqlite already provides the Fts4aux module (http://www.sqlite.org/fts3.html#fts4aux), which provides a read only table with all the corpus wide statistic for each unique term in the corpus.

To overcome the 1st problem, Joerg suggested the idea to perform a query for each unique term to get the document specific statistic.

I still have some concerns about how to use this info at the time of executing the query for ranking the documents. (for reference see the code above).

I think I will play with it and see how well it goes.

Implement a tf-idf based content ranking

tf-idf based weighting schemes are fairly common in Information Retrieval applications.

tf = Term Frequency, indicates the number of times a particular term in the search query appears in a particular document.
idf = Inverse Document Frequency indicates, the number of documents in which a given term appears.

idf = log(n/ni)
n = Total number of documents in the corpus
ni = Number of documents in which the ith term appears.

Term frequency is a local factor, which is concerned only with the number of occurrences of the search terms in one particular document at a time.

While Inverse Document Frequency is a global factor, in the sense that, it indicates the discriminating power of a term. If a term appears in only a selected set of documents, then it means, that that term separates that set of documents from the rest.

So a useful way to generate a content based score for ranking the documents is

tf * idf

It's also a good idea to normalise the two factors.

Term frequency can be optimally normalised by dividing it by the frequency of the most frequent word in the corpus

tf(ti) = Number of occurrences of the ith term / Max(tf)

Inverse document frequency can be normalised by dividing it by the log of the number of documents in the corpus.

idf(ti) = log(n/ni)/log(n)

ENOMEM in traversedir() during execution

The function traversedir() in makemandb.c is returning ENOMEM after a certain level. I believe it is due to the recursive code (may be we are overflowing the process stack).
Current implementation of traversedir() looks like this:

static void
traversedir(const char *file)
{
    struct stat sb;
    struct dirent *dirp;
    DIR *dp;
    char *buf;

    if (stat(file, &sb) < 0) {
        fprintf(stderr, "stat failed: %s", file);
        return;
    }

    /* if it is a regular file, pass it to the parser */
    if (S_ISREG(sb.st_mode)) {
        pmdoc(file);
        printf("parsing %s\n", file);
        if (insert_into_db() < 0)
            fprintf(stderr, "Error indexing: %s\n", file);
        return;
    }

    /* if it is a directory, traverse it recursively */
    else if (S_ISDIR(sb.st_mode)) {
        if ((dp = opendir(file)) == NULL) {
            fprintf(stderr, "opendir error: %s", file);
            return;
        }

        while ((dirp = readdir(dp)) != NULL) {
            /* Avoid . and .. entries in a directory */
            if (strncmp(dirp->d_name, ".", 1)) {
                if ((asprintf(&buf, "%s/%s", file, dirp->d_name) == -1)) {
                    closedir(dp);
                    if (errno == ENOMEM)
                        fprintf(stderr, "ENOMEM\n");
                    free(buf);
                    continue;
                }
                traversedir(buf);
                free(buf);
            }
        }

        closedir(dp);
    }
    else
        fprintf(stderr, "unknown file type: %s\n", file);
}       

May be we can translate into plain iterative code.

Multiple values for the .Nm macro in some man pages

Some manual pages contain multiple names under the .Nm macro. Our code in it's current state only takes out the first value from the .Nm macro, rest of the values are ignored.

Example the MD2Data.3 man page has multiple .Nm macros:
.Nm MD2Init ,
.Nm MD2Update ,
.Nm MD2Final ,
.Nm MD2End ,
.Nm MD2File ,
.Nm MD2Data

Problem:

This will lead to multiple rows in the database with the same Name column. And also, all the other names will be lost, resulting in poor quality of search results.

Possible Solution:

Currently the code for the .Nm macro is like this:

                 if (n->sec == SEC_NAME && n->child->type == MDOC_TEXT) {
                     if ((name = strdup(n->child->string)) == NULL) {
                      fprintf(stderr, "Memory allocation error");
                      return;
                     }
                  }

Instead of directly picking up the first text value available, we should run a loop and store all the Names in an array.

Add support to recognize section specific keywords in the query

I think one of the possible ways of improving the search results could be to scan the user query and try to recognize if he/she is looking for results from specific sections, and give more weight to search results found in such sections.

For example:

apropos "system call for file status"

So here the term/phrase 'system call' gives clear indication that the user is looking for the stat(2) system call and not the stat(1) utility.

Use sqlite3_exec to reduce the unnecessary C code

sqlite3_exec [http://www.sqlite.org/c3ref/exec.html] is a utility function which in most cases can be used to reduce a lot of repititve C code related to sqlite. Replace the elalborate Sqlite code with sqlite3_exec where possible.

Add a pager

It should be possible to display more than 10 results at a time and page them through a pager like more or less.

I am adding a new option 'p' to apropos for this. It would mean that the user wants to view more than 10 results and page them through a pager. The default pager is more.

Unable to parse automatically generated man pages

Problem:

Someo of the man pages are automatically generated and have a markup different from other man pages. On encountering such man pages, our code breaks up and moves on to the next man page in the list.

We need to find a way to be able to parse these man pages as well.

Example of some the automatically generated pages:
cc.1
addr2line.1
ar.1
as.1
c++.1
c++filt.1
cccp.1
gdb.1
lex.1

Fix parsing of .Nd macro in mdoc(7) pages

Passing over the nodes iteratively in the NAME section (specifically for the .Nd) macro is causing to miss many documents, so as suggested by Kristaps, the solution would be pass over the nodes recursively to collect data from all the child nodes in the section.

Implement a stop words filter for search

I believe the quality of search will drastically improve if we implement a stop word filter in apropos.c to filter out all the stop words like (a, an, the, to, ...,etc.). I noticed that sqlite's FTS does not do it for us, and so if the search query contains these stop words, Sqlite will try to find a match for them as well.

Implementation Idea:

I think we can create a hash table which contains all the stop words which we would like to filter out. When the user enters the query, we scan the query, one word at a time in a loop, and check in the hash table if the word exists or not. If that word exists in the hash table that means that it is a stop word and we can remove it from the query.

I think we can use the
hcreate(3), hsearch(3) and hdestroy(3) functions ?

Avoid using global variables

makemandb.c is using three char * type global variables. These variables are being used to hold the data that is being extracted from the parsed man pages.

After the man page is parsed, we call the db insertion function (insert_into_db), which inserts them into the database and then frees the memory used by them.

Global variables are a convenience but I believe in the longer run, they are going to be more pain than anything.

Problem:

In the traversedir() function, the following bits are relevant:

if (S_ISREG(sb.st_mode)) {
pmdoc(file);
printf("parsing %s\n", file);
if (insert_into_db() < 0)
fprintf(stderr, "Error indexing: %s\n", file);
return;
}

pmdoc is being called to parse the man pages. After it is done the parsed data is stored in three global variables which are declared at the top:

static char *name = NULL; // for storing the name of the man page
static char *name_desc = NULL; // for storing the one line description (.Nd)
static char *desc = NULL; // for storing the DESCRIPTION section

After pmdoc finishes, insert_into_db() is called, which simply inserts these values as a row in the db table.

The global variables could be avoided by declaring them locally inside the traversedir function and passing them as a (char **) parameter to the parsing related functions, which are:

static void pmdoc_Nd(const struct mdoc_node *);
static void pmdoc_Sh(const struct mdoc_node *);

We could have passed a (char **) parameter, but libmandoc wants these functions to have the same signature. And only these functions have access to the parsed data from different sections of the man pages, so global variables seem a necessity.

DB Size too big

After implementing code for parsing man(7) pages (#25) , I ran makemandb again. Now I had 7596 documents in the index. This caused the size of DB to grow drastically up.

Previously with 3000 documents the size was 23M but now it had grown up to 99M.

It turned out that most the space was being taken by the mandb_weights table, which was storing the weight of each term in the db, in each individual document in which it occurred. It had around 12,22,500 or so records.

So the most straightforward way out of this was to remove the code related to precomputation of weights and remove the mandb_weights table. Do the weight computation on the fly whicle doing the search.

Large deviation in pre-computed weights and weights computed on the fly

After fixing the issue #18, I observed a degradation in the quality of search results. Obvious reason was, that there were bugs in the code which was computing the weights of the terms for storing in the database.

Some of the problems were:

Sqlite stems the words using the Porter Stemming algorithm before indexing them. So we need to perform the stemming as well, when trying to fetch pre-computed weights for the terms in the user query. For this I added a new source file porter_stemmer.c which implements the Porter stemming algorithm.

Besides this there were issues in the calculation of the tf and idf weights.

Parse additional sections and decompose them into separate columns [mdoc]

It would be useful and probably beneficial to parse additional sections from the page like LIBRARY, SYNOPSIS, ERRORS, EXIT VALUES, etc and store them in different columns. This has two main objectives:

  • Improve search by giving each section a different weight depending on it's usefulness
  • Display some additional information along with the search results.

Add support for incremental updates

Adding support for updating the database as new man pages are installed was on the TODO list of my proposal. Joerg's suggestion was to add support for incremental updation.

  • The idea is to first traverse the file hierarchy, generate and cache md5 hash of all the man pages in a temporary table.
  • Then check in the database, the pages whose md5 hash we do not have, and add them to the index.
  • And also drop the pages whose md5 hash we do have but they are not present in the temporary table (most likely they are removed).

makemandb returns an error if the database already contains the table

makemandb tries to open the database (apropos.db), if it doesn't exist, it will create it and if it exists it will try to open it. After opening the database it creates a new virtual table (mandb)

The problem occurs if there is an existing database file with an existing table (mandb). In this case the sqlite code fails and our program terminates.

We need to handle this in a better way.

Possible Solutions:

  • If we encounter an error while creating the table, we assume that the only possible cause of error can be the existence of the table, so we run a "drop table" statement against the table and then try again to create the table. One possible drawaback of this strategy can be that we may end up in a cycle of creating and dropping tables if the cause of the error was not the already existing table but something else.
  • The other solution can be to just delete the database file when makemandb starts to run. This seems much cleaner way to handle the things to me. May be I will go for this.

Compiler errors

Following compiler errors were reported by clang with warning=4 (thanks to Joerg for testing )

rm -f .gdbinit
echo "set solib-absolute-prefix /home/joerg/work/NetBSD/obj/cvs/amd64/destdir.amd64" > .gdbinit
#   compile  apropos_replacement/makemandb.o
/home/joerg/work/NetBSD/obj/cvs/tools/bin/x86_64--netbsd-clang -O2  -std=gnu99  -Wno-sign-compare -Wno-pointer-sign  -Wall -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-sign-compare  -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Wsign-compare  -Wpointer-sign  -Werror    --sysroot=/home/joerg/work/NetBSD/obj/cvs/amd64/destdir.amd64 -I/home/joerg/work/NetBSD/cvs/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS  -c    makemandb.c
makemandb.c:437:10: error: assigning to 'char *' from 'const char [73]' discards qualifiers [-Werror]
                sqlstr = "insert into mandb values (:section, :name, :name_desc, :desc, :md5_hash)";
                       ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
makemandb.c:538:9: error: assigning to 'char *' from 'const char [99]' discards qualifiers [-Werror]
        sqlstr = "create virtual table mandb using fts4(section, name, name_desc, desc, \
               ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
makemandb.c:519:6: error: unused variable 'idx' [-Werror,-Wunused-variable]
        int idx = -1;
            ^
makemandb.c:630:9: error: assigning to 'char *' from 'const char [47]' discards qualifiers [-Werror]
        sqlstr = "select * from mandb where md5_hash = :md5_hash";
               ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4 errors generated.

*** Failed target:  makemandb.o
*** Failed command: /home/joerg/work/NetBSD/obj/cvs/tools/bin/x86_64--netbsd-clang -O2 -std=gnu99 -Wno-sign-compare -Wno-pointer-sign -Wall -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-sign-compare -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Wsign-compare -Wpointer-sign -Werror --sysroot=/home/joerg/work/NetBSD/obj/cvs/amd64/destdir.amd64 -I/home/joerg/work/NetBSD/cvs/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS -c makemandb.c
*** Error code 1 (continuing)
#   compile  apropos_replacement/sqlite3.o
/home/joerg/work/NetBSD/obj/cvs/tools/bin/x86_64--netbsd-clang -O2  -std=gnu99  -Wno-sign-compare -Wno-pointer-sign  -Wall -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-sign-compare  -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Wsign-compare  -Wpointer-sign  -Werror    --sysroot=/home/joerg/work/NetBSD/obj/cvs/amd64/destdir.amd64 -I/home/joerg/work/NetBSD/cvs/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS  -c    sqlite3.c
sqlite3.c:13900:5: error: initializing 'char *' with an expression of type 'const char [10]' discards qualifiers [-Werror]
    FUNCTION(julianday,        -1, 0, 0, juliandayFunc ),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:9381:45: note: instantiated from:
   SQLITE_INT_TO_PTR(iArg), 0, xFunc, 0, 0, #zName, 0, 0}
                                            ^
<scratch space>:191:1: note: instantiated from:
"julianday"
^~~~~~~~~~~
sqlite3.c:13901:5: error: initializing 'char *' with an expression of type 'const char [5]' discards qualifiers [-Werror]
    FUNCTION(date,             -1, 0, 0, dateFunc      ),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:9381:45: note: instantiated from:
   SQLITE_INT_TO_PTR(iArg), 0, xFunc, 0, 0, #zName, 0, 0}
                                            ^
<scratch space>:192:1: note: instantiated from:
"date"
^~~~~~
sqlite3.c:13902:5: error: initializing 'char *' with an expression of type 'const char [5]' discards qualifiers [-Werror]
    FUNCTION(time,             -1, 0, 0, timeFunc      ),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:9381:45: note: instantiated from:
   SQLITE_INT_TO_PTR(iArg), 0, xFunc, 0, 0, #zName, 0, 0}
                                            ^
<scratch space>:193:1: note: instantiated from:
"time"
^~~~~~
sqlite3.c:13903:5: error: initializing 'char *' with an expression of type 'const char [9]' discards qualifiers [-Werror]
    FUNCTION(datetime,         -1, 0, 0, datetimeFunc  ),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:9381:45: note: instantiated from:
   SQLITE_INT_TO_PTR(iArg), 0, xFunc, 0, 0, #zName, 0, 0}
                                            ^
<scratch space>:194:1: note: instantiated from:
"datetime"
^~~~~~~~~~
sqlite3.c:13904:5: error: initializing 'char *' with an expression of type 'const char [9]' discards qualifiers [-Werror]
    FUNCTION(strftime,         -1, 0, 0, strftimeFunc  ),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:9381:45: note: instantiated from:
   SQLITE_INT_TO_PTR(iArg), 0, xFunc, 0, 0, #zName, 0, 0}
                                            ^
<scratch space>:195:1: note: instantiated from:
"strftime"
^~~~~~~~~~
sqlite3.c:13905:5: error: initializing 'char *' with an expression of type 'const char [13]' discards qualifiers [-Werror]
    FUNCTION(current_time,      0, 0, 0, ctimeFunc     ),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:9381:45: note: instantiated from:
   SQLITE_INT_TO_PTR(iArg), 0, xFunc, 0, 0, #zName, 0, 0}
                                            ^
<scratch space>:196:1: note: instantiated from:
"current_time"
^~~~~~~~~~~~~~
sqlite3.c:13906:5: error: initializing 'char *' with an expression of type 'const char [18]' discards qualifiers [-Werror]
    FUNCTION(current_timestamp, 0, 0, 0, ctimestampFunc),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:9381:45: note: instantiated from:
   SQLITE_INT_TO_PTR(iArg), 0, xFunc, 0, 0, #zName, 0, 0}
                                            ^
<scratch space>:197:1: note: instantiated from:
"current_timestamp"
^~~~~~~~~~~~~~~~~~~
sqlite3.c:13907:5: error: initializing 'char *' with an expression of type 'const char [13]' discards qualifiers [-Werror]
    FUNCTION(current_date,      0, 0, 0, cdateFunc     ),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:9381:45: note: instantiated from:
   SQLITE_INT_TO_PTR(iArg), 0, xFunc, 0, 0, #zName, 0, 0}
                                            ^
<scratch space>:198:1: note: instantiated from:
"current_date"
^~~~~~~~~~~~~~
sqlite3.c:18954:17: error: assigning to 'char *' from 'const char [4]' discards qualifiers [-Werror]
          bufpt = "NaN";
                ^ ~~~~~
sqlite3.c:18966:21: error: assigning to 'char *' from 'const char [5]' discards qualifiers [-Werror]
              bufpt = "-Inf";
                    ^ ~~~~~~
sqlite3.c:18968:21: error: assigning to 'char *' from 'const char [5]' discards qualifiers [-Werror]
              bufpt = "+Inf";
                    ^ ~~~~~~
sqlite3.c:18970:21: error: assigning to 'char *' from 'const char [4]' discards qualifiers [-Werror]
              bufpt = "Inf";
                    ^ ~~~~~
sqlite3.c:19103:17: error: assigning to 'char *' from 'const char [1]' discards qualifiers [-Werror]
          bufpt = "";
                ^ ~~
sqlite3.c:19122:29: error: assigning to 'char *' from 'const char *' discards qualifiers [-Werror]
        if( isnull ) escarg = (xtype==etSQLESCAPE2 ? "NULL" : "(NULL)");
                            ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:25124:8: error: assigning to 'char *' from 'const char [1]' discards qualifiers [-Werror]
  zErr = "";
       ^ ~~
sqlite3.c:54835:48: error: passing 'const char [16]' to parameter of type 'char *' discards qualifiers [-Werror]
            get4byte(&pBt->pPage1->aData[36]), "Main freelist: ");
                                               ^~~~~~~~~~~~~~~~~
sqlite3.c:54486:9: note: passing argument to parameter 'zContext' here
  char *zContext        /* Context for error messages */
        ^
sqlite3.c:54846:38: error: passing 'const char [21]' to parameter of type 'char *' discards qualifiers [-Werror]
    checkTreePage(&sCheck, aRoot[i], "List of tree roots: ", NULL, NULL);
                                     ^~~~~~~~~~~~~~~~~~~~~~
sqlite3.c:54570:9: note: passing argument to parameter 'zParentContext' here
  char *zParentContext, /* Parent context */
        ^
sqlite3.c:57919:13: error: assigning to 'char *' from 'const char [7]' discards qualifiers [-Werror]
        zP4 = "(blob)";
            ^ ~~~~~~~~
sqlite3.c:60916:13: error: initializing 'char *' with an expression of type 'const char [1]' discards qualifiers [-Werror]
      = {0, "", (double)0, {0}, 0, MEM_Null, SQLITE_NULL, 0,
            ^~
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.

*** Failed target:  sqlite3.o
*** Failed command: /home/joerg/work/NetBSD/obj/cvs/tools/bin/x86_64--netbsd-clang -O2 -std=gnu99 -Wno-sign-compare -Wno-pointer-sign -Wall -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-sign-compare -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Wsign-compare -Wpointer-sign -Werror --sysroot=/home/joerg/work/NetBSD/obj/cvs/amd64/destdir.amd64 -I/home/joerg/work/NetBSD/cvs/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS -c sqlite3.c
*** Error code 1 (continuing)
#   compile  apropos_replacement/apropos.o
/home/joerg/work/NetBSD/obj/cvs/tools/bin/x86_64--netbsd-clang -O2  -std=gnu99  -Wno-sign-compare -Wno-pointer-sign  -Wall -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-sign-compare  -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Wsign-compare  -Wpointer-sign  -Werror    --sysroot=/home/joerg/work/NetBSD/obj/cvs/amd64/destdir.amd64 -I/home/joerg/work/NetBSD/cvs/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS  -c    apropos.c
apropos.c:80:9: error: assigning to 'char *' from 'const char [173]' discards qualifiers [-Werror]
        sqlstr = "select section, name, snippet(mandb, \"\033[1m\", \"\033[0m\", \"...\" )"
               ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
apropos.c:131:23: error: initializing 'char *' with an expression of type 'const char [2]' discards qualifiers [-Werror]
        char *stopwords[] = {"a", "about", "also", "all", "an", "another", "and", "are", 
                             ^~~
apropos.c:131:28: error: initializing 'char *' with an expression of type 'const char [6]' discards qualifiers [-Werror]
        char *stopwords[] = {"a", "about", "also", "all", "an", "another", "and", "are", 
                                  ^~~~~~~
apropos.c:131:37: error: initializing 'char *' with an expression of type 'const char [5]' discards qualifiers [-Werror]
        char *stopwords[] = {"a", "about", "also", "all", "an", "another", "and", "are", 
                                           ^~~~~~
apropos.c:131:45: error: initializing 'char *' with an expression of type 'const char [4]' discards qualifiers [-Werror]
        char *stopwords[] = {"a", "about", "also", "all", "an", "another", "and", "are", 
                                                   ^~~~~
apropos.c:131:52: error: initializing 'char *' with an expression of type 'const char [3]' discards qualifiers [-Werror]
        char *stopwords[] = {"a", "about", "also", "all", "an", "another", "and", "are", 
                                                          ^~~~
apropos.c:131:58: error: initializing 'char *' with an expression of type 'const char [8]' discards qualifiers [-Werror]
        char *stopwords[] = {"a", "about", "also", "all", "an", "another", "and", "are", 
                                                                ^~~~~~~~~
apropos.c:131:69: error: initializing 'char *' with an expression of type 'const char [4]' discards qualifiers [-Werror]
        char *stopwords[] = {"a", "about", "also", "all", "an", "another", "and", "are", 
                                                                           ^~~~~
apropos.c:131:76: error: initializing 'char *' with an expression of type 'const char [4]' discards qualifiers [-Werror]
        char *stopwords[] = {"a", "about", "also", "all", "an", "another", "and", "are", 
                                                                                  ^~~~~
apropos.c:132:2: error: initializing 'char *' with an expression of type 'const char [4]' discards qualifiers [-Werror]
        "how", "is", "or", "the", "how", "what", "when", "which", "why", NULL};
        ^~~~~
apropos.c:132:9: error: initializing 'char *' with an expression of type 'const char [3]' discards qualifiers [-Werror]
        "how", "is", "or", "the", "how", "what", "when", "which", "why", NULL};
               ^~~~
apropos.c:132:15: error: initializing 'char *' with an expression of type 'const char [3]' discards qualifiers [-Werror]
        "how", "is", "or", "the", "how", "what", "when", "which", "why", NULL};
                     ^~~~
apropos.c:132:21: error: initializing 'char *' with an expression of type 'const char [4]' discards qualifiers [-Werror]
        "how", "is", "or", "the", "how", "what", "when", "which", "why", NULL};
                           ^~~~~
apropos.c:132:28: error: initializing 'char *' with an expression of type 'const char [4]' discards qualifiers [-Werror]
        "how", "is", "or", "the", "how", "what", "when", "which", "why", NULL};
                                  ^~~~~
apropos.c:132:35: error: initializing 'char *' with an expression of type 'const char [5]' discards qualifiers [-Werror]
        "how", "is", "or", "the", "how", "what", "when", "which", "why", NULL};
                                         ^~~~~~
apropos.c:132:43: error: initializing 'char *' with an expression of type 'const char [5]' discards qualifiers [-Werror]
        "how", "is", "or", "the", "how", "what", "when", "which", "why", NULL};
                                                 ^~~~~~
apropos.c:132:51: error: initializing 'char *' with an expression of type 'const char [6]' discards qualifiers [-Werror]
        "how", "is", "or", "the", "how", "what", "when", "which", "why", NULL};
                                                         ^~~~~~~
apropos.c:132:60: error: initializing 'char *' with an expression of type 'const char [4]' discards qualifiers [-Werror]
        "how", "is", "or", "the", "how", "what", "when", "which", "why", NULL};
                                                                  ^~~~~
apropos.c:248:14: error: assigning to 'int *' from 'unsigned int *' converts between pointers to integer types with different sign [-Werror,-Wpointer-sign]
  aMatchinfo = (unsigned int *)sqlite3_value_blob(apVal[0]);
             ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.

*** Failed target:  apropos.o
*** Failed command: /home/joerg/work/NetBSD/obj/cvs/tools/bin/x86_64--netbsd-clang -O2 -std=gnu99 -Wno-sign-compare -Wno-pointer-sign -Wall -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-sign-compare -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Wsign-compare -Wpointer-sign -Werror --sysroot=/home/joerg/work/NetBSD/obj/cvs/amd64/destdir.amd64 -I/home/joerg/work/NetBSD/cvs/src/external/bsd/mdocml/dist -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_FTS3_PARENTHESIS -c apropos.c
*** Error code 1 (continuing)
`all' not remade because of errors.

Bug in porter_stemmer.c

There is a blunder in porter_stemmer.c in the stem_word() function.

  1. It might not return a value in some cases
  2. On encountering a number as the value of the string parameter it runs in an infinite loop.

I have fixed (1) temporariliy ( this just allows to compile the code), it is still possible to hit some unexpected problems.

There will still be a weird behaviour on encountering a number if not an infinite loop.

A better solution should be to replace porter_stemmer.c with libstemmer which provides a better interface to stem individual words.

Improve term weighting as per Salton & Buckley 1988

Salton and Buckley wrote a seminal paper on term weighting in Text Retrieval [0] , which analysed different term weighting schemes and their effects.

As per that paper, the following scheme for generating term weights was the most effective:

for weight of jth term in the ith document we have the weight w(i,j) given by

w(i, j) =( (log(tf(i, j)) + 1) * idf(j) ) /  for(j = 1 to t) take sum[(log(tf(i, j) + 1) * idf(j)] ^2

where w(i, j) = weight of term j in document i
tf(i, j) = term frequency of term j in document i
idf(j) = inverse document frequency of term j
for(j =1 to t) take sum = sigma for j =1 to t

I am implementing the same thing but in a slightly less complicated manner to avoid computational overheads and it seems be performing very well.

My implementation:

w(i, j) = (tf(i, j) * idf(j)) / for(j = 1 to t) take sum[ tf(i, j) * idf(j)]

Acttually the mandb_weights table contains term-weights already in the form (tf * idf). So the only thing I need to do while calculating the rank is to perform the division in the formula quoted above.

[0]: Salton and Buckley 1988, paper i PDF

Make the size of the buffer as an explicit argument to concat

As per Joerg's feedback, the idea is to make the size of the source buffer as an argument to concat. He alternatively suggested to resize the buffers in larger chunks. I don't get the 2nd idea completely, therefore going with the first one, which seems like a good idea.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.