gleanerio / gleaner Goto Github PK

View Code? Open in Web Editor NEW

15.0 7.0 10.0 386.18 MB

Gleaner: JSON-LD and structured data on the web harvesting

Home Page: https://gleaner.io

License: Apache License 2.0

Go 87.11% Makefile 0.71% Dockerfile 0.24% Shell 6.66% Python 0.21% HTML 2.93% CSS 2.14%

rdf semantic-web schema-org

gleaner's Introduction

Gleaner (https://gleaner.io)

About

Gleaner is a tool for extracting JSON-LD from web pages. You provide Gleaner a list of sites to index and it will access and retrieve pages based on the sitemap.xml of the domain(s). Gleaner can then check for well formed and valid structure in documents. The product of Gleaner runs can then be used to form Knowledge Graphs, Full-Text Indexes, Semantic Indexes Spatial Indexes or other products to drive discovery and use.

About

The image above gives an overview of the basic workflow of Gleaner.

This image show that the product of Gleaner is really a populated data warehouse (document warehouse). Where those documents are either the JSON-LD structured data document harvested or the the provenance graphs generated by Gleaner during the process of harvesting.

Gleaner talks to an S3 compliant object store as part of its configuration. This can be AWS S3, Google Cloud Storage (GCS) or other S3 compliant object stores. A typical set up might see the use the open source Minio package in this role.

Note also the use of headless chrome in this diagram. A headless chrome instance is use for those cases where the resources to be harvested are placing the JSON-LD documents into the document object model (DOM) dynamically. In this case then the headless chrome is used to render the page and run the Javascript to form the rendered HTML document that can be parsed for the JSON-LD.

This previous image gives a view of a typical completed installation and use of Gleaner. In this image we now see the use of the Nabu package (also in this Gleaner.io organization) to synchronize the data warehouse with a triple store.

Nabu is described in its own repository but it basically reads the the JSON-LD document and performs ELT, ETL workflows on it. In this case, a simple ETL of the JSON-LD. Extracted from the S3, translated from JSON-LD into ntriples and then loaded into the triplestores. If your triplestore natively handles the JSON-LD serialization of the RDF then this could be a simple extract and load.

Code and Git Branch Patterns

Go versions

Gleaner is written in Go and we ask that the developers stay in sycn with the latest stable release. Go is a very stable language API so generally there are little issues with being off a version or two.

Note that conflicts with the go.mod and go.sum are not unexpected. As noted here, please resolve conflicts with these files by doing a union followed by a

go mod tidy

following the merge. Once you have resolved the conflict and done tidy can add, if needed, the go.mod and go.sum files and commit.

Branches

If you are interested in working on Gleaner we ask that you use the following git pattern. Branches should start with your initials followed by -- and then a name. This can be a descriptive name or an issue name.

Please branch off of dev and merge back into dev. Given the small number of developers we hope this wont result in many conflicts. As we agree on a version of dev that we like, we will make merges to master from which builds for releases and container will be done.

$ git checkout -b [initials_or_team_name]--[your_branch_title_snake_case]
$ git checkout -b df--dev_doc_updates
< make some code changes >
$ git add .
$ git commit -m '[initials] <title of your changes>'

Gleaner Indexing

While we work on bringing this repository documentation in line please visit:

For the best documentation on using Gleaner at this time.

Unit tests

There are some unit tests here; to run them, you can do go test -v ./...

gleaner's People

Contributors

Stargazers

Watchers

Forkers

earthcubearchitecture-ecresourcereg mathiasbockwoldt craig-willis earthcube mbcode ayudhien jmckenna wanghaisheng adplincinst internetofwater

gleaner's Issues

context negotiation based on link in header as alternative to content negotiation

Due to:
schemaorg/schemaorg#2578

The manner for getting the context has been updated. Note the link header use as an option (?) for resolving the context. We may not be able to simply content negotiate, we will need to look for the link header too.

Headless rework

r2r has long rendering issues. The present code does waits for DOMContentEventFired
but that is not long enough. Added a headlesswait and thread sleep, but even then that does not get everything.

This uses the mafredri, so it looks like a starting point.
["github.com/mafredri/cdp/protocol/network"](https://github.com/Aleksandr-Kai/articles_parser/blob/18c4cd2c90600e0eb7b628853a3959995e514dbd/pkg/browserapp/browser.go
//*/)

Two changes:

get the wait until network idle/ page rendering correct
testing so that we can test it.

Think it would be hard to test, as the page renderer put items into minio, directly.

Grab URL's from r2r, and ieda.
https://dev.rvdata.us/search/fileset/100748
https://dev.rvdata.us/search/fileset/101773

sitemap: no headless if 404

If an item does not exist, then headless is still called.

2021/12/02 20:31:09 Direct access failed, trying headless for  https://raw.githubusercontent.com/earthcube/ecrro/tree/master/Examples/Software-ERDDAP-SDO.JSON 
2021/12/02 20:31:09 Direct access failed, trying headless for  https://raw.githubusercontent.com/earthcube/ecrro/tree/master/Examples/Service-IRIS-fsdnEvent-JSON.json

SIGSEGV - segmentation violation

Hi,

I've a interesting issue with the 'gleaner' command, which is a static exe file.

The gleaner cmd compiled from source has the same SIGSEGV message runtime panic message as the v2.0.25 and v2.0.22 I downloaded from GitHub here.

The message are:

v2.0.22

main.go:30: EarthCube Gleaner
panic: runtime error: invalid memory address or nil pointer dereference

v2.0.25

main.go:34: EarthCube Gleaner
panic: runtime error: invalid memory address or nil pointer dereference

self compiled version

main.go:35: EarthCube Gleaner
panic: runtime error: invalid memory address or nil pointer dereference

I try to execute the gleaner cmd with and without root permission. The results are the same.
Does any one have/had similar issues in the past and a hint to solve this?

OS: Ubuntu 20.04
GOLANG: go version go1.13.8 linux/amd64
Environment variables: GLEANER_BASE=/tmp/gleaner GLEANER_OBJECTS=/tmp/gleaner/datavol/s3 GLEANER_GRAPH=/tmp/gleaner/datavol/graph DATAVOL=/tmp/gleaner/datavol and directories exists.

Thanks in advance, Andreas

issue with headless chrome

One call to headless chrome happens, then the run stops.

2022/03/03 17:06:21 r2r sitemap size is : 42497 queuing: 42497 mode: full 
2022/03/03 17:06:21 Headless chrome call to: r2r
ubuntu@ec-testbed-containers:~/glcon$

The source

- sourcetype: sitemap
  name: r2r
  logo: https://www.rvdata.us/images/Logo.4b1519be.png
  url: https://service-dev.rvdata.us/api/sitemap/
  headless: true
  pid: http://www.re3data.org/repository/r3d100010735
  propername: Rolling Deck to Repository Program (R2R)
  domain: https://www.rvdata.us/
  active: true
  credentialsfile: ""
  other: {}

console from run

Using gleaner config file: /home/ubuntu/glcon/configs/geocodes/gleaner
Using nabu config file: /home/ubuntu/glcon/configs/geocodes/nabu
batch called
2022/03/03 17:05:06 Building organization graph.
2022/03/03 17:05:06 The specified bucket does not exist.
2022/03/03 17:05:06 Sitegraph(s) processed
2022/03/03 17:05:06 Summoner start time: 2022-03-03 17:05:06.353164344 +0000 UTC m=+0.184392504 
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true  map[other:map[]]}]
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true  map[other:map[]]}]
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true  map[other:map[]]}]
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true  map[other:map[]]}]
2022/03/03 17:06:20 We are not a sitemap index, check to see if we are a sitemap
2022/03/03 17:06:21 r2r sitemap size is : 42497 queuing: 42497 mode: full 
2022/03/03 17:06:21 Headless chrome call to: r2r
ubuntu@ec-testbed-containers:~/glcon$

summon from google drive

at present the resource registry puts generated files in a google drive.
So, summoning from a google drive:
https://drive.google.com/drive/u/0/folders/1TacUQqjpBbGsPQ8JPps47lBXMQsNBRnd

This has been sync'd using rclone, but that means the usermeta is not attached.
So create a google drive harvester

article
https://medium.com/one-eyed-astericks/how-to-iterate-through-google-drive-folders-and-files-using-google-drive-api-v2-in-golang-9c29a77cbcf2

remove Gleaner leveraging quads

It's not proper for gleaner to use named graphs since people may use them or not as publishers.

also, this is coming.. https://w3c.github.io/json-ld-syntax/#graph-containers

TODO: remove all use of quads BY gleaner itself but allow JSON-LD to quads via the libraries of course.

Could you add a LICENSE file?

Something like https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/adding-a-license-to-a-repository ? (I think maybe also needed in other projects nearby here?)

glcon rename servers.yaml generate.yaml

Pondering renaming servers.yaml to generate.yaml or configure.yaml to reflect that it is used to generate the config files, and will allow for adding of other items.

sourcemaps:
    type: csv
    file:
    url:

gleaner:
    runid: pattern_{{date}}

type organization support

Look at the domain of each provider (may be different from data distribution domain) and put in a check for type organization based schema.org in the index page

load this into the graph as a connection to the prov graph?

CIDs for missing @ID in some graphs..

Based on a discussion in ESIP's schema.org slack channel I am noting a possible enhancement

Take a data graph and look for certain (all?) branches of the main data graph
Using approaches like JSON-LD frames that result in a sub graph we can validate with a SHACL shape. If valid, normalize and then generates a CID (sha256 + sugar I guess)
use this for the @id
recursive SHA calculation issue? the CID would only be on the body of the data graph it would seem. It's a property of the graph.

References:

Error in nqtoNTctx with tika

When running the Tika miller we get an error from time to time with nqToNTctx. Likely due to to unicode characters?

Gleaner does not fully implement incremental indexing correctly at this time.

At present Gleaner is not doing incremental indexing (--mode diff) correctly.

While Gleaner does know not to index URLs previously retrieved (summoned) in this mode. It currently does not know if the associated object retrieve has been updated and hence gets a new object based on an update SHA hash. This can result in extraneous objects in the object store when multiple full indexes are done. These could then get synchronized to the triplestore or other indexes resulting in extra or older resources in the indexes along side the newer ones.

Also, the sitegraph workflow, unlike the sitemap workflow is not yet (I believe) leveraging the --mode diff mode of indexing (ie, incremental) so this will need to be added.

One or both of these points is likely this case for this extra object getting in.

Previous work

Previously, I was leveraging the generated prov records to address this function via S3Select calls. This had the benefit of using existing tooling to address this function. So no additional technical debt. Unfortunately this approach is far too slow as the resource count begins to grow. For some emerging work with a couple communities, where the scale will be approaching a million records and more, this was never going to scale well.

Bolt KV store

To address this I have already started to integrate a KV store to address holding a record of the previously visited resources. This can then be used to check and skip such records. At this time, that capacity has been merged into the dev (and master) branches and should be working and fully replacing the S3Select based approach

Note, that this does generate a gleaner.db file during the run. So this document should be noted in the development of the Docker files.

Note, if this file is lost or removed all that is lost is the record that supports the incremental indexing. No data or other information. So one would simply have to do a full index again to rebuild it and return the capacity of incremental indexing.

Existing limitation to be addressed

The second major issue right now is that code does not take into account that URLs might be removed. So a "prune" option needs to be added that will look for URLs in the warehouse that are not in the domains sitemap and remove them and the associated object that has been downloaded from that, now deleted, URL.

Code to do this prune operation is the next development work. and several support functions for this are already done *

Regarding sitegraphs

The sitegraph concept can be implemented in the same manner as a sitemap. However, since it is one large file, any change will result in a new hash, and so this is a case where the sitegraph approach does have some operational implications a more fine grained approach like sitemaps does not suffer from.

Pattern thoughts

Just a few notes, more for myself on the steps the code needs to do when running in full, incremental and prune approaches.

(inc)
if sitemap url in kv:
	skip
else:
	index
	put sitemap url in kv


(full)
if sitemap url in kv:
	index
        delete old  kv value (object sha)
	replace kv value (object sha)
else:
	index
	put url in kv


(prune)
if kvurl in sitemap
	skip
else:
	delete associated object
	delete kvurl entry

Self signed certs

Review to see if this is an issue

(base) ➜ run_polar git:(master) curl https://www.polardata.ca/metadata.json
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

may need to catch and report the error better (since if this is the case. I don't do anything at this time)

RDF patching

Patching the JSON-LD data graphs may become important as we incorporate approaches to do "update" indexing.

Some references for JSON-LD "diff" optons include

glcon --sourcemaps not working

passing source maps to glcon config generate does not make the generate use the source maps.

logs to object store

At end of run send logs to object store. This will let non-adminstrators and repository owners see the results.

{bucket}/logs/lastrun
? with a 'lastrun' file of what the command was

at start of run, move lastrun to a folder,
If we cronjob/automate we may want a logrotate option.

glcon config - add sourcesSource

add a block to servers/config.yaml to tell config where to pull sources from. while there is a --sources flag, the dataset does not normally change, so this is a better way to configure, config

default:

sourcesSource:
   type: csv
   getFrom: sources.csv

add

sourcesSource:
   type: googlesheetcsv
   getFrom: csvdownloadurl

sourcesSource:
   type: excel
   getFrom: filename with sheet

add go test to automation

Remove HTML in string literals

Some people are putting HTML in the descriptive text. We could remove this with something like the following gist.

Not real sure how I feel yet about modifying the documents that others author though. Especially something like the descriptions.

gist link: https://gist.github.com/g10guang/04f11221dadf1ed019e0d3cf3e82caf3

package utils

import (
	"regexp"
	"sort"
	"strings"
)

// match html tag and replace it with ""
func RemoveHtmlTag(in string) string {
	// regex to match html tag
	const pattern = `(<\/?[a-zA-A]+?[^>]*\/?>)*`
	r := regexp.MustCompile(pattern)
	groups := r.FindAllString(in, -1)
	// should replace long string first
	sort.Slice(groups, func(i, j int) bool {
		return len(groups[i]) > len(groups[j])
	})
	for _, group := range groups {
		if strings.TrimSpace(group) != "" {
			in = strings.ReplaceAll(in, group, "")
		}
	}
	return in
}

localConfig sources remote not working

something got lost.
remote source in localConfig.yaml not working.

EarthCube resource registry. gleaner hits some issues

It looks like it does in a prov call, and an object2rdf call.
Also, milled getting named with .json extension, but they are .rdf files.

configuration
ecrr.zip

ubuntu@ec-testbed-containers:~/glcon$ ./glcon gleaner batch --cfgName ecrr
2021/11/09 18:17:06 EarthCube Gleaner
Using gleaner config file: /home/ubuntu/glcon/configs/ecrr/gleaner
Using nabu config file: /home/ubuntu/glcon/configs/ecrr/nabu
batch called
2021/11/09 18:17:06 Building organization graph.
2021/11/09 18:17:06 Miller start time: 2021-11-09 18:17:06.905400154 +0000 UTC m=+0.198000722 
2021/11/09 18:17:06 Adding bucket to milling list: summoned/ecrr
2021/11/09 18:17:06 Adding bucket to milling list: summoned/ecrr_examples
2021/11/09 18:17:06 Adding bucket to prov building list: prov/ecrr
2021/11/09 18:17:06 Adding bucket to prov building list: prov/ecrr_examples
  37% |███████████████                            | (107/289, 100 it/s) [0s:1s]2021/11/09 18:17:07 invalid character '0' in string escape code
2021/11/09 18:17:07 obj2RDF invalid character '0' in string escape code
2021/11/09 18:17:07 invalid character '0' in string escape code
2021/11/09 18:17:07 obj2RDF invalid character '0' in string escape code
  78% |█████████████████████████████████          | (228/289, 113 it/s) [1s:0s]2021/11/09 18:17:08 invalid character 'u' looking for beginning of value
2021/11/09 18:17:08 obj2RDF invalid character 'u' looking for beginning of value
 100% |███████████████████████████████████████████| (289/289, 152 it/s)        
2021/11/09 18:17:08 Assembling result graph for prefix: summoned/ecrr to: milled/ecrr
2021/11/09 18:17:08 Result graph will be at: results/rr1/ecrr_graph.nq
2021/11/09 18:17:08 Start pipe reader / writer sequence
2021/11/09 18:17:10 Pipe copy for graph done
   0% |                                               | (0/0, 0 it/min) [0s:0s]
2021/11/09 18:17:10 Assembling result graph for prefix: summoned/ecrr_examples to: milled/ecrr_examples
2021/11/09 18:17:10 Result graph will be at: results/rr1/ecrr_examples_graph.nq
2021/11/09 18:17:10 Start pipe reader / writer sequence
2021/11/09 18:17:10 Pipe copy for graph done
2021/11/09 18:17:10 Miller end time: 2021-11-09 18:17:10.184253951 +0000 UTC m=+3.476854554 
2021/11/09 18:17:10 Miller run time: 0.054648

getNormSHA produces a blank string for certain documents

This one is puzzling me. So, in my logs for crawling http://nsidc.org/, I have a bunch of non-identical json-ld objects, which are getting the same hash generated for them. I poked around and figure out that this is because proc.Normalize (line 38 in calcShaNorm.go) is generating an empty string. And when you calculate the SHA of a bunch of identical empty strings, it's going to be the same.

logger: acquire.go:206: #4 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 2553 for http://nsidc.org/data/NSIDC-0051/versions/1
logger: acquire.go:219: #4 thread for http://nsidc.org/data/NSIDC-0051/versions/1 
logger: acquire.go:206: #14 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 2495 for http://nsidc.org/data/NSIDC-0076/versions/1
logger: acquire.go:219: #14 thread for http://nsidc.org/data/NSIDC-0076/versions/1 
logger: acquire.go:206: #31 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 3046 for http://nsidc.org/data/NSIDC-0037/versions/1
logger: acquire.go:219: #31 thread for http://nsidc.org/data/NSIDC-0037/versions/1 
logger: acquire.go:206: #15 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 3667 for http://nsidc.org/data/NSIDC-0042/versions/1

Here's the config to crawl that site:

- name: nsidc
  url: https://nsidc.org/sitemap.xml
  headless: false
  properName: National Snow and Ice Data Center
  domain: https://nsidc.org

AND, also, they have their context specified with no trailing slash, and not https, so you need to add this to contextmaps:

- prefix: "http://schema.org"
  file: "./schemaorg-current-https.jsonld"

Is that a clue, there? Is json-gold not able to normalize a json-ld object that is set up this way?

I'm also finding that once I am able to get unique JSON-LD objects for each of the AADC sites in their sitemap*, it only generates 3 different SHAs for the whole set of them. I haven't looked into that much further.

because each page has two json-ld files, one which is for the organization, and one which is for the actual data page itself. I'm working on a change that finds only the relevant json-ld and grabs just that one, but it's kind of tricky, and I don't know how broadly applicable it is.

robots.txt

Gleaner should also take a robots.txt file to obtain a sitemap (optionally based on agent string) and also read a delay value for harvesting.

binary not executable, & old imports

cmd/gleaner> gleaner
gleaner: Exec format error. Binary file not executable.

code has 1/2doz imports that start w/: http://github.com/earthcubearchitecture-project418/gleaner/internal/check

Provide option to skip validation

Tasks is to scrape for biomedical markup data, the data extends the schema.org schema and therefore we don't need the validation that Gleaner does.

Issue with sitemap

gleaner not pulling in this sitemap... https://github.com/earthcube/GeoCODES-Metadata/blob/main/sitemap.xml

reads it, but nothing is created/summoned

ubuntu@ec-testbed-containers:~/glcon$ ./glcon gleaner batch --cfgName ecrr
2021/11/09 19:48:17 EarthCube Gleaner
Using gleaner config file: /home/ubuntu/glcon/configs/ecrr/gleaner
Using nabu config file: /home/ubuntu/glcon/configs/ecrr/nabu
batch called
2021/11/09 19:48:17 Building organization graph.
2021/11/09 19:48:18 Sitegraph(s) processed
2021/11/09 19:48:18 Summoner start time: 2021-11-09 19:48:18.096928621 +0000 UTC m=+0.172947508 
2021/11/09 19:48:18 [{sitemap ecrr_submitted https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png rclone https://drive.google.com/drive/u/0/folders/1TacUQqjpBbGsPQ8JPps47lBXMQsNBRnd false  Earthcube Resource Registry http://www.earthcube.org/resourceregistry/ false} {sitemap ecrr_examples https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png https://raw.githubusercontent.com/earthcube/ecrro/master/Examples/sitemap.xml false  Earthcube Resource Registry Examples http://www.earthcube.org/resourceregistry/examples false} {sitemap geocodes_examples  https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false  GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 [{sitemap geocodes_examples  https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false  GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 We are not a sitemap index, check to see if we are a sitemap
2021/11/09 19:48:18 geocodes_examples sitemap size is : 4 queuing: 4 mode: full 
2021/11/09 19:48:18 [{sitemap ecrr_submitted https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png rclone https://drive.google.com/drive/u/0/folders/1TacUQqjpBbGsPQ8JPps47lBXMQsNBRnd false  Earthcube Resource Registry http://www.earthcube.org/resourceregistry/ false} {sitemap ecrr_examples https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png https://raw.githubusercontent.com/earthcube/ecrro/master/Examples/sitemap.xml false  Earthcube Resource Registry Examples http://www.earthcube.org/resourceregistry/examples false} {sitemap geocodes_examples  https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false  GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 [{sitemap geocodes_examples  https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false  GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 Thread count 5 delay 0
 100% |████████████████████████████████████████████████| (4/4, 15 it/s)
2021/11/09 19:48:18 Wrote log size 547
2021/11/09 19:48:20 Summoner end time: 2021-11-09 19:48:20.160551705 +0000 UTC m=+2.236570592 
2021/11/09 19:48:20 Summoner run time: 0.034394 
2021/11/09 19:48:20 Miller start time: 2021-11-09 19:48:20.160614155 +0000 UTC m=+2.236633059 
2021/11/09 19:48:20 Adding bucket to milling list: summoned/geocodes_examples
2021/11/09 19:48:20 Adding bucket to prov building list: prov/geocodes_examples
   0% |                                               | (0/0, 0 it/min) [0s:0s]
2021/11/09 19:48:20 Assembling result graph for prefix: summoned/geocodes_examples to: milled/geocodes_examples
2021/11/09 19:48:20 Result graph will be at: results/rr1/geocodes_examples_graph.nq
2021/11/09 19:48:20 Start pipe reader / writer sequence
2021/11/09 19:48:20 Pipe copy for graph done
2021/11/09 19:48:20 Miller end time: 2021-11-09 19:48:20.35733808 +0000 UTC m=+2.433356968 
2021/11/09 19:48:20 Miller run time: 0.003279

Better sitemap support

Need to be able to deal with multi-sitemap sites (sitemap -> sitempas)

Also inspect the XML for updates and perhaps support logging of resources to support mod on and logic to decided if to index. This could also help with or be connected to PROV milling.

Sitemap improvements

The code in fence.gleaner.io can correctly parse sitemaps based on a threshold date. This code needs to be ported over into Gleaner now.

build config files for Gleaner, Nabu from CSV/Google Sheet/Excel

With many repositories, it can be a slight pain when the config needs to be updated.
And a spreadsheet can be used to manage a list of sites, that can be used to render an HTML page

so write a python tool to generate the gleaner config and nabu configs
Could also be used for future generation from Database, or a website

Review IndexNow

Need to review https://www.indexnow.org/faq

If we want PUSH notification, this would be the way to do it. However, this is a bit at odds with the current structured data on the web approach and diffing sitemaps is not that hard (I mean we are doing it).

So I worry that at the scale Gleaner normally works at this is a bit of a digression (at best) or regression (at worst).

Needs to be reviewed though and would be a neat side tool that could then invoke Gleaner with a "set of URLs". This bypassing of the sitemaps et al and providing a URL or set of URLs to index has been something I felt might be nice for testing purposes or for developers / publishers to have. That aspect would not be hard to implement in the Gleaner code.

Then the "indexnow" server would be separate from that and be responsible for calling Gleaner then.

finish robots.txt support

Finish integrating the robots.txt support.

semaphore locks using minio dsync

I am using blast also and your project looks useful for Science and Medical.
I work in Medical and think that the JSON-LD and RDF aspects are really useful to allow researchers to collaborate better

I was wondering if it makes sense to change this:
https://github.com/earthcubearchitecture-project418/gleaner/blob/master/internal/millers/millerbleve/bleve.go#L32

to use this:
https://github.com/minio/dsync

namespace in XML sitemaps

<ns0:sitemapindex xmlns:ns0="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
		<loc> https://geoconnex.us/sitemap/namespaces/nmwdi/nmwdi_ose_ids__0.xml </loc>
		<lastmod> 2021-10-19 19:25:50.620909 </lastmod>

is valid but doesn't work

Currently the sitemap parsing works with

<?xml version='1.0' encoding='utf-8'?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://samples.earth/sitemap0.xml</loc>
  <lastmod>2021-04-11</lastmod>
  </sitemap>
</sitemapindex>

The namespace approach is valid XML though, so this needs to be resolved.

Get JSON-LD context files by URL

Like we do with the SHACL shapes, we should optionally download the JSON-LD context by allowing the user to provide the URL for the resolving the context in the config file. Rather than having them provide it in the file system.

spatial property needed for name of dataset if available..

Check if there is a name..
put that in as a property,

If a desc.. put in 100... characters... (maybe)

otherwise put in "Un-named by facility"

Check stack at start of (batch... other) commands.

had a misconfiguration issue.

'Bucket not found' stack changes meant https redirect was occurring.
Code should have just stopped and force

batch and other commands that require access to the stack need to run check before fully executing.

need to additional options:

--cron running in cronjob. if there is an error during check... don't stop
--nocheck do not run check
--noheadlesscheck do not check to see if headless chrome is running. good for development and probably tutorials.

Object store based indexing prov

This is a an issue to document a new approach for incremental indexing that will remove the boltdb in favor of an object store based approach.

Create: /sitemap/[name]/sitemaplatest.xml
We still will not know what URLs in a sitemap had JSON-LD (only prov knows that)
Need functions like record() diff() Do these as interfaces for a change (for the better)

When we pull down a new sitemap, we can get the array of URLs from it, compare it to the last stored sitemap in the bucket and then only index the diff for incremental. The failure point is that we don't know if a URL has updated or newly added JSON-LD metadata. However that is what the occasional full index is for vs the incremental indexes.

Sequence wise we would

pull down the new sitemap
compare to the current sitemap in the object store
get the diff and pass to summoner
store the new XML and remove/rename the older sitemap (depending on how much history we wish to store)

spatial index review...

We need to review the spatial indexer...
There may be some issues creeping in related to resources with multiple types... like IEDA with BBOX and POINT

Some searchers are giving odd results... may also be related to old items in the index. Need to rebuild the geohash index from scratch.

Consider integrating Apache Stanbol for named entity and topic matching

e.g. see

Async approaches

To further bring Gleaner to orchestration environments we need to have an async approach for queuing resources to process.

Keeping with trying to stay native Go I am looking at https://github.com/mcmathja/curlyq and wrapping the consumer in a go func with a semaphore count/limit.

A bit heavier with a redis dep is https://github.com/hibiken/asynq

I prefer the curlyq native go approach but might try and evaluate both.

Basically this would replace gleaner main and perhaps be the "millstone" code. It would set up the consumer, producer and queue. The consumer would simply send a resource ( []string ) item to miller then.

The producer might be a variant of summoner that runs periodically (once a day for example) with a set of provider URLs and a rolling date window (ie, last 24 hours in this case).

Need to improve the documentation around the config file

The config has changed a bit and we need to make sure we have an always current template for the config file, a simple demo and perhaps documentation on the all sections.

Miller is dropping dataset metadata but keeping things like person, organiation and logo for NSIDC

I'm not sure what's going on here, but I'm trying to index the NSIDC, and running into this. Any help or insight would be appreciated - at this point, I don't know if the bug is in Gleaner, or elsewhere.

I'm able to get json-ld out that looks like this, for example:

{
   "@context": { 
            "vocab": "https://schema.org/" 
     },
    "@graph": [
        {
            "@type": "Dataset",
            "provider": {
                "@type": "Organization",
                "name": "NSIDC: National Snow and Ice Data Center",
                "url": "https://nsidc.org",
                "logo": {
                    "@type": "ImageObject",
                    "representativeOfPage": "True",
                    "url": "https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png",
                    "width": "60 px",
                    "height": "60 px"
                }
            },
            "@id": "NSIDC-0303",
            "name": "Radar Investigations of Antarctic Ice Stream Margins, Siple Dome, 1998, Version 1",
            "version": "1",
            "description": "This data set consists of surface-based radar measurements, including geometry of the bed, surface, and internal layers, and bed reflectivity measurements at two sites along ice stream margins at Siple Dome, Antarctica. The research is a radar examination of bed reflection characteristics and internal layer geometry in two inter-ice-stream ridges, the Shabtaie Ridge (Ridge D/E) and the Engelhardt Ridge (Ridge B/C), and across margins with the adjacent ice streams, the MacAyeal Ice Stream (Ice Stream E) and the Whillans Ice Stream (Ice Stream B). Investigators collected these radar data from 14 November through 13 December 1998. Data are in Microsoft Word, PDF, ASCII text, MATLAB, binary, and various image formats. Investigators have also provided code for MATLAB routines that they used to view the radar data. Data are available via FTP.",
            "temporalCoverage": "1998-11-14 00:00:00 to 1998-12-13 00:00:00",
            "spatialCoverage": "N: -80.1678, S: -83.3528, E: -138.3697, W: -141.6722",
            "identifier": "https://doi.org/10.7265/N52B8VZP",
            "keywords": "Radar \u0026gt; Radar Reflectivity \u0026gt; Bed Reflectivity, Radar \u0026gt; Radar Imagery \u0026gt; Bed, Surface, and Internal Layer Geometry",
            "author": {
                "@type": "Person",
                "name": [
                    "Nadine Nereson",
                    "Charles Raymond"
                ]
            },
            "publisher": {
                "@type": "Organization",
                "@id": "https://nsidc.org",
                "name": "National Snow and Ice Data Center",
                "url": "https://nsidc.org",
                "logo": {
                    "@type": "ImageObject",
                    "representativeOfPage": "True",
                    "url": "https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png",
                    "width": "60 px",
                    "height": "60 px"
                }
            },
            "url": "https://nsidc.org/data/NSIDC-0303/versions/1"
        }
    ]
}

But what comes out of the miller for this metadata is missing a lot of the stuff I want, namely the DataSet and its description and all that good stuff. I just get the logo and the authors.

<https://nsidc.org> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
<https://nsidc.org> <http://schema.org/logo> _:bc932hau7ho5ajldeot1g .
<https://nsidc.org> <http://schema.org/name> "National Snow and Ice Data Center" .
<https://nsidc.org> <http://schema.org/url> <https://nsidc.org> .
_:bc932hau7ho5ajldeot1g <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/ImageObject> .
_:bc932hau7ho5ajldeot1g <http://schema.org/height> "60 px" .
_:bc932hau7ho5ajldeot1g <http://schema.org/representativeOfPage> "True" .
_:bc932hau7ho5ajldeot1g <http://schema.org/url> <https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png> .
_:bc932hau7ho5ajldeot1g <http://schema.org/width> "60 px" .
_:bc932hau7ho5ajldeot20 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:bc932hau7ho5ajldeot20 <http://schema.org/name> "Nadine Nereson" .
_:bc932hau7ho5ajldeot20 <http://schema.org/name> "Charles Raymond" .
_:bc932hau7ho5ajldeot2g <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
_:bc932hau7ho5ajldeot2g <http://schema.org/logo> _:bc932hau7ho5ajldeot30 .
_:bc932hau7ho5ajldeot2g <http://schema.org/name> "NSIDC: National Snow and Ice Data Center" .
_:bc932hau7ho5ajldeot2g <http://schema.org/url> <https://nsidc.org/data/NSIDC-0303/versions/1> .
_:bc932hau7ho5ajldeot30 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/ImageObject> .
_:bc932hau7ho5ajldeot30 <http://schema.org/height> "60 px" .
_:bc932hau7ho5ajldeot30 <http://schema.org/representativeOfPage> "True" .
_:bc932hau7ho5ajldeot30 <http://schema.org/url> <https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png> .
_:bc932hau7ho5ajldeot30 <http://schema.org/width> "60 px" .

sitegraph for aquadocs returns 403 from go

Some issue calling from code... at least from GoLand.
returns 403.

context location in new config

gleaner_base.yaml
was:
./schemaorg-current-https.jsonld

needs to be
./configs/local/schemaorg-current-https.jsonld

or maybe just pull down and put in
./configs/schemaorg-current-https.jsonld

gleanerio / gleaner Goto Github PK

gleaner's Introduction

Gleaner (https://gleaner.io)

About

About

Code and Git Branch Patterns

Go versions

Branches

Gleaner Indexing

Unit tests

gleaner's People

Contributors

Stargazers

Watchers

Forkers

gleaner's Issues

At present Gleaner is not doing incremental indexing (--mode diff) correctly.

Previous work

Bolt KV store

Existing limitation to be addressed

Regarding sitegraphs

Pattern thoughts

Recommend Projects

Recommend Topics

Recommend Org

Jobs