GithubHelp home page GithubHelp logo

Comments (16)

artunit avatar artunit commented on June 19, 2024

I think one approach to this is to use the Solr's Data Import Handler to leverage the database but pull the full text from the file system, the FileReaderDataSource is sort of an example of this.

from reader.

molikd avatar molikd commented on June 19, 2024

We are going to be switching to a Postgres end-point so Data Import Handler might work well. Eric?

from reader.

artunit avatar artunit commented on June 19, 2024

Solr normally uses a timestamp column to calculate delta entries, this could be a trigger in sqlite. Postgres would work well for this too, it also has an awesome foreign data wrapper.

from reader.

artunit avatar artunit commented on June 19, 2024

I managed to get my laptop trapped in transit between a repair shop and the campus before the start of the pandemic, so these examples are from a borrowed windows machine. I used the sqlite table id as the identifier here but this should be whatever meaningful identifier is possible with the dataset (sha looks like there are duplicates?). So here a full import would use the identifiers from the articles table to determine which file to import. I thought this could be used with JSON files but I think it has to be done with XML inputs. There are some solr-specific functions that I think we would want to position the data for anyway so I have some python code for setting up the XML input, but basically the idea here is for the database to be linked with the indexing (to allow "looping through"). A solr deltaimport, which is typically what you set up in cron or whatever, would keep the index in sync (see sql triggers below).

<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
  <dataSource name="ddr1" 
       url="jdbc:sqlite:C:/util/shared/DistantReader/test/test.db" 
       driver="org.sqlite.JDBC"/>
  <dataSource name="ddr2" type="FileDataSource" encoding="UTF-8" />
  <document>
    <!-- processor for database interactions -->
    <entity name="edr1" dataSource="ddr1" pk="id" processor="SqlEntityProcessor" 
         query="SELECT id from articles"
         deltaQuery="SELECT id from articles WHERE timeStamp > '${dataimporter.last_index_time}'"
         deletedPkQuery="SELECT deleted_id as id from deletes WHERE timeStamp > '${dataimporter.last_index_time}'">
        <!-- processor for filesystem interactons -->
        <entity name="edr2" dataSource="ddr2" 
             processor="XPathEntityProcessor" useSolrAddSchema="true" 
             stream="true" forEach="/doc" 
             url="C:\tmp\xml\${edr1.id}.xml"/> 
    </entity>
  </document>
</dataConfig>

There is a deletes table and I added 2 triggers to the database schema:

CREATE TABLE articles (
        id        INTEGER PRIMARY KEY AUTOINCREMENT,
        sha       TEXT,
        title     TEXT,
        journal   TEXT,
        date      TEXT,
        abstract  TEXT,
        doi       TEXT,
        timeStamp DATE
);
CREATE TRIGGER insert_article_timeStamp AFTER INSERT ON articles
BEGIN
  UPDATE articles SET timeStamp = DATETIME('NOW')
  WHERE rowid = new.rowid;
END;
CREATE TABLE deletes (
        id        INTEGER PRIMARY KEY,
        deleted_id INTEGER,
        timeStamp DATE
);
CREATE TRIGGER insert_after_delete AFTER DELETE ON articles
BEGIN
  INSERT INTO deletes (deleted_id,timeStamp) VALUES (OLD.id,DATETIME('NOW'));
END;
CREATE TABLE authors (
        sha     TEXT,
        author  TEXT
);

The last time I was involved with something like this, the solr index was built totally outside of dataimporthandler, and deltaimports were used to keep ongoing, and typically more minor, updates in sync.

from reader.

ericleasemorgan avatar ericleasemorgan commented on June 19, 2024

from reader.

ericleasemorgan avatar ericleasemorgan commented on June 19, 2024

from reader.

ericleasemorgan avatar ericleasemorgan commented on June 19, 2024

from reader.

ericleasemorgan avatar ericleasemorgan commented on June 19, 2024

from reader.

artunit avatar artunit commented on June 19, 2024

Sorry to be slow on the uptake, I received an email message for each comment. I don't think much could touch SQLite in terms of ease of deployment, postgres has some unique options to leverage solr capabilities at the sql level, but it might be overkill for this purpose. You definitely don't need postgres for the data import handler, which in itself is really just a way to keep a solr index in sync with a database. I will go through a setup on the shared file system, I wanted to sort out the data handler stuff on my laptop because I tend to lean on a lot on the web-based solr admin interface for setting something like this up. The data handler doesn't impact the index definition at all, it can be as faceted and as extensive as solr schemas allow.

from reader.

ericleasemorgan avatar ericleasemorgan commented on June 19, 2024

from reader.

artunit avatar artunit commented on June 19, 2024

../let's use Solr to index the content of a database I've already filled. It is an SQLite database, and it is located on our shared file system

I ran through a simple example on the shared file system, I used an xml version of the records but I didn't realize the full-text of the content was now included in the database. SQLite could be the source of all of the content in the index(es) if that's where all of the content from the dataset ends up.

from reader.

ericleasemorgan avatar ericleasemorgan commented on June 19, 2024

from reader.

artunit avatar artunit commented on June 19, 2024

Sorry, it's a long weekend here in Canada and I am just catching up on today's email. The solr instance is in /export/scratch/solr/solr-8.5.1 and I ran the import again with the PlainTextEntityProcessor. I misread the source column in the database last night and thought it was the fulltext, but I think the fulltext for solr should come from the file system (since you already have it nicely processed), and everything else could come directly from the database. I am running this version of solr on port 8984 and you can see the results of the import with:

curl http://localhost:8984/solr/cord/dataimport?command=status

It took just over 11 minutes, but that's only the fulltext. The status report claims there are no skipped documents, but there are documents which did not resolve to a file. The current config for the data handler stuff is in /export/scratch/solr/solr-8.5.1/server/solr/cord/conf/DIHconfigfile.xml. The field layout still needs to be worked out, as well as the faceting and so on, but solr works well with a database for this kind of thing and I don't think the indexing will be terribly onerous. I haven't worked on optimizing solr at all, I never have done much with poking around SolrCloud and using multiple servers.

from reader.

ericleasemorgan avatar ericleasemorgan commented on June 19, 2024

Art, great work. 'More real soon....

from reader.

ericleasemorgan avatar ericleasemorgan commented on June 19, 2024

from reader.

ericleasemorgan avatar ericleasemorgan commented on June 19, 2024

I think we can consider this "done"; we are now successfully able to create study carrels against subsets of CORD.

from reader.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.