Comments (16)
I think one approach to this is to use the Solr's Data Import Handler to leverage the database but pull the full text from the file system, the FileReaderDataSource is sort of an example of this.
from reader.
We are going to be switching to a Postgres end-point so Data Import Handler might work well. Eric?
from reader.
Solr normally uses a timestamp column to calculate delta entries, this could be a trigger in sqlite. Postgres would work well for this too, it also has an awesome foreign data wrapper.
from reader.
I managed to get my laptop trapped in transit between a repair shop and the campus before the start of the pandemic, so these examples are from a borrowed windows machine. I used the sqlite table id as the identifier here but this should be whatever meaningful identifier is possible with the dataset (sha looks like there are duplicates?). So here a full import would use the identifiers from the articles table to determine which file to import. I thought this could be used with JSON files but I think it has to be done with XML inputs. There are some solr-specific functions that I think we would want to position the data for anyway so I have some python code for setting up the XML input, but basically the idea here is for the database to be linked with the indexing (to allow "looping through"). A solr deltaimport, which is typically what you set up in cron or whatever, would keep the index in sync (see sql triggers below).
<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
<dataSource name="ddr1"
url="jdbc:sqlite:C:/util/shared/DistantReader/test/test.db"
driver="org.sqlite.JDBC"/>
<dataSource name="ddr2" type="FileDataSource" encoding="UTF-8" />
<document>
<!-- processor for database interactions -->
<entity name="edr1" dataSource="ddr1" pk="id" processor="SqlEntityProcessor"
query="SELECT id from articles"
deltaQuery="SELECT id from articles WHERE timeStamp > '${dataimporter.last_index_time}'"
deletedPkQuery="SELECT deleted_id as id from deletes WHERE timeStamp > '${dataimporter.last_index_time}'">
<!-- processor for filesystem interactons -->
<entity name="edr2" dataSource="ddr2"
processor="XPathEntityProcessor" useSolrAddSchema="true"
stream="true" forEach="/doc"
url="C:\tmp\xml\${edr1.id}.xml"/>
</entity>
</document>
</dataConfig>
There is a deletes table and I added 2 triggers to the database schema:
CREATE TABLE articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
sha TEXT,
title TEXT,
journal TEXT,
date TEXT,
abstract TEXT,
doi TEXT,
timeStamp DATE
);
CREATE TRIGGER insert_article_timeStamp AFTER INSERT ON articles
BEGIN
UPDATE articles SET timeStamp = DATETIME('NOW')
WHERE rowid = new.rowid;
END;
CREATE TABLE deletes (
id INTEGER PRIMARY KEY,
deleted_id INTEGER,
timeStamp DATE
);
CREATE TRIGGER insert_after_delete AFTER DELETE ON articles
BEGIN
INSERT INTO deletes (deleted_id,timeStamp) VALUES (OLD.id,DATETIME('NOW'));
END;
CREATE TABLE authors (
sha TEXT,
author TEXT
);
The last time I was involved with something like this, the solr index was built totally outside of dataimporthandler, and deltaimports were used to keep ongoing, and typically more minor, updates in sync.
from reader.
from reader.
from reader.
from reader.
from reader.
Sorry to be slow on the uptake, I received an email message for each comment. I don't think much could touch SQLite in terms of ease of deployment, postgres has some unique options to leverage solr capabilities at the sql level, but it might be overkill for this purpose. You definitely don't need postgres for the data import handler, which in itself is really just a way to keep a solr index in sync with a database. I will go through a setup on the shared file system, I wanted to sort out the data handler stuff on my laptop because I tend to lean on a lot on the web-based solr admin interface for setting something like this up. The data handler doesn't impact the index definition at all, it can be as faceted and as extensive as solr schemas allow.
from reader.
from reader.
../let's use Solr to index the content of a database I've already filled. It is an SQLite database, and it is located on our shared file system
I ran through a simple example on the shared file system, I used an xml version of the records but I didn't realize the full-text of the content was now included in the database. SQLite could be the source of all of the content in the index(es) if that's where all of the content from the dataset ends up.
from reader.
from reader.
Sorry, it's a long weekend here in Canada and I am just catching up on today's email. The solr instance is in /export/scratch/solr/solr-8.5.1 and I ran the import again with the PlainTextEntityProcessor. I misread the source column in the database last night and thought it was the fulltext, but I think the fulltext for solr should come from the file system (since you already have it nicely processed), and everything else could come directly from the database. I am running this version of solr on port 8984 and you can see the results of the import with:
curl http://localhost:8984/solr/cord/dataimport?command=status
It took just over 11 minutes, but that's only the fulltext. The status report claims there are no skipped documents, but there are documents which did not resolve to a file. The current config for the data handler stuff is in /export/scratch/solr/solr-8.5.1/server/solr/cord/conf/DIHconfigfile.xml. The field layout still needs to be worked out, as well as the faceting and so on, but solr works well with a database for this kind of thing and I don't think the indexing will be terribly onerous. I haven't worked on optimizing solr at all, I never have done much with poking around SolrCloud and using multiple servers.
from reader.
Art, great work. 'More real soon....
from reader.
from reader.
I think we can consider this "done"; we are now successfully able to create study carrels against subsets of CORD.
from reader.
Related Issues (20)
- No search box on Create Carrel cord? main pane HOT 1
- Browse my carrels needs padding and header footer re-style HOT 1
- job I submitted yesterday didn't really run? HOT 7
- User browse carrel resolve to index.htm? HOT 4
- public carrels HOT 3
- column headers don't sort on carrel dir page HOT 1
- Make carrel scraper more efficient HOT 2
- annotated bibliography page inherits limit on number of titles in list from prev page display potentially misinforming user
- bibliographics and annotated bibliography page add download as RIS feature
- back button on annotated bibliography page returns user to keywords page HOT 3
- gutenberg submission is missing provenance HOT 2
- Change provenance file to JSON format HOT 2
- Promote flask app to be the main site HOT 1
- Move ORCID registration to eric
- layout of top banner on home page is squishing the art HOT 2
- Can't login. User is redirected to orcid on login never gets to distant reader? HOT 4
- html in carrels is referring to non-existant files HOT 1
- Help for searching CORD-19 HOT 1
- Add Admin role to webui
- Rework webui to use sqlacalmey
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from reader.