GithubHelp home page GithubHelp logo

gleanerio / gleaner Goto Github PK

View Code? Open in Web Editor NEW
15.0 15.0 10.0 386.18 MB

Gleaner: JSON-LD and structured data on the web harvesting

Home Page: https://gleaner.io

License: Apache License 2.0

Go 87.10% Makefile 0.71% Dockerfile 0.24% Shell 6.66% Python 0.21% HTML 2.93% CSS 2.14%
rdf schema-org semantic-web

gleaner's Introduction

gleaner's People

Contributors

ashepherd avatar dependabot[bot] avatar fils avatar nein09 avatar valentinedwv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

gleaner's Issues

Async approaches

To further bring Gleaner to orchestration environments we need to have an async approach for queuing resources to process.

Keeping with trying to stay native Go I am looking at https://github.com/mcmathja/curlyq and wrapping the consumer in a go func with a semaphore count/limit.

A bit heavier with a redis dep is https://github.com/hibiken/asynq

I prefer the curlyq native go approach but might try and evaluate both.

Basically this would replace gleaner main and perhaps be the "millstone" code. It would set up the consumer, producer and queue. The consumer would simply send a resource ( []string ) item to miller then.

The producer might be a variant of summoner that runs periodically (once a day for example) with a set of provider URLs and a rolling date window (ie, last 24 hours in this case).

Sitemap improvements

The code in fence.gleaner.io can correctly parse sitemaps based on a threshold date. This code needs to be ported over into Gleaner now.

CIDs for missing @ID in some graphs..

Based on a discussion in ESIP's schema.org slack channel I am noting a possible enhancement

  • Take a data graph and look for certain (all?) branches of the main data graph
  • Using approaches like JSON-LD frames that result in a sub graph we can validate with a SHACL shape. If valid, normalize and then generates a CID (sha256 + sugar I guess)
  • use this for the @id
  • recursive SHA calculation issue? the CID would only be on the body of the data graph it would seem. It's a property of the graph.

References:

Better sitemap support

Need to be able to deal with multi-sitemap sites (sitemap -> sitempas)

Also inspect the XML for updates and perhaps support logging of resources to support mod on and logic to decided if to index. This could also help with or be connected to PROV milling.

Remove HTML in string literals

Some people are putting HTML in the descriptive text. We could remove this with something like the following gist.

Not real sure how I feel yet about modifying the documents that others author though. Especially something like the descriptions.

gist link: https://gist.github.com/g10guang/04f11221dadf1ed019e0d3cf3e82caf3

package utils

import (
	"regexp"
	"sort"
	"strings"
)

// match html tag and replace it with ""
func RemoveHtmlTag(in string) string {
	// regex to match html tag
	const pattern = `(<\/?[a-zA-A]+?[^>]*\/?>)*`
	r := regexp.MustCompile(pattern)
	groups := r.FindAllString(in, -1)
	// should replace long string first
	sort.Slice(groups, func(i, j int) bool {
		return len(groups[i]) > len(groups[j])
	})
	for _, group := range groups {
		if strings.TrimSpace(group) != "" {
			in = strings.ReplaceAll(in, group, "")
		}
	}
	return in
}

context location in new config

gleaner_base.yaml
was:
./schemaorg-current-https.jsonld

needs to be
./configs/local/schemaorg-current-https.jsonld

or maybe just pull down and put in
./configs/schemaorg-current-https.jsonld

issue with headless chrome

One call to headless chrome happens, then the run stops.

2022/03/03 17:06:21 r2r sitemap size is : 42497 queuing: 42497 mode: full 
2022/03/03 17:06:21 Headless chrome call to: r2r
ubuntu@ec-testbed-containers:~/glcon$ 

The source

- sourcetype: sitemap
  name: r2r
  logo: https://www.rvdata.us/images/Logo.4b1519be.png
  url: https://service-dev.rvdata.us/api/sitemap/
  headless: true
  pid: http://www.re3data.org/repository/r3d100010735
  propername: Rolling Deck to Repository Program (R2R)
  domain: https://www.rvdata.us/
  active: true
  credentialsfile: ""
  other: {}

console from run

Using gleaner config file: /home/ubuntu/glcon/configs/geocodes/gleaner
Using nabu config file: /home/ubuntu/glcon/configs/geocodes/nabu
batch called
2022/03/03 17:05:06 Building organization graph.
2022/03/03 17:05:06 The specified bucket does not exist.
2022/03/03 17:05:06 Sitegraph(s) processed
2022/03/03 17:05:06 Summoner start time: 2022-03-03 17:05:06.353164344 +0000 UTC m=+0.184392504 
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true  map[other:map[]]}]
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true  map[other:map[]]}]
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true  map[other:map[]]}]
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true  map[other:map[]]}]
2022/03/03 17:06:20 We are not a sitemap index, check to see if we are a sitemap
2022/03/03 17:06:21 r2r sitemap size is : 42497 queuing: 42497 mode: full 
2022/03/03 17:06:21 Headless chrome call to: r2r
ubuntu@ec-testbed-containers:~/glcon$ 

glcon config - add sourcesSource

add a block to servers/config.yaml to tell config where to pull sources from. while there is a --sources flag, the dataset does not normally change, so this is a better way to configure, config

default:

sourcesSource:
   type: csv
   getFrom: sources.csv

add

sourcesSource:
   type: googlesheetcsv
   getFrom: csvdownloadurl
sourcesSource:
   type: excel
   getFrom: filename with sheet

Gleaner does not fully implement incremental indexing correctly at this time.

At present Gleaner is not doing incremental indexing (--mode diff) correctly.

While Gleaner does know not to index URLs previously retrieved (summoned) in this mode. It currently does not know if the associated object retrieve has been updated and hence gets a new object based on an update SHA hash. This can result in extraneous objects in the object store when multiple full indexes are done. These could then get synchronized to the triplestore or other indexes resulting in extra or older resources in the indexes along side the newer ones.

Also, the sitegraph workflow, unlike the sitemap workflow is not yet (I believe) leveraging the --mode diff mode of indexing (ie, incremental) so this will need to be added.

One or both of these points is likely this case for this extra object getting in.

Screen Shot 2021-11-05 at 10 29 34 AM

Previous work

Previously, I was leveraging the generated prov records to address this function via S3Select calls. This had the benefit of using existing tooling to address this function. So no additional technical debt. Unfortunately this approach is far too slow as the resource count begins to grow. For some emerging work with a couple communities, where the scale will be approaching a million records and more, this was never going to scale well.

Bolt KV store

To address this I have already started to integrate a KV store to address holding a record of the previously visited resources. This can then be used to check and skip such records. At this time, that capacity has been merged into the dev (and master) branches and should be working and fully replacing the S3Select based approach

Note, that this does generate a gleaner.db file during the run. So this document should be noted in the development of the Docker files.

Note, if this file is lost or removed all that is lost is the record that supports the incremental indexing. No data or other information. So one would simply have to do a full index again to rebuild it and return the capacity of incremental indexing.

Existing limitation to be addressed

The second major issue right now is that code does not take into account that URLs might be removed. So a "prune" option needs to be added that will look for URLs in the warehouse that are not in the domains sitemap and remove them and the associated object that has been downloaded from that, now deleted, URL.

  • Code to do this prune operation is the next development work. and several support functions for this are already done *

Regarding sitegraphs

The sitegraph concept can be implemented in the same manner as a sitemap. However, since it is one large file, any change will result in a new hash, and so this is a case where the sitegraph approach does have some operational implications a more fine grained approach like sitemaps does not suffer from.

Pattern thoughts

Just a few notes, more for myself on the steps the code needs to do when running in full, incremental and prune approaches.

(inc)
if sitemap url in kv:
	skip
else:
	index
	put sitemap url in kv


(full)
if sitemap url in kv:
	index
        delete old  kv value (object sha)
	replace kv value (object sha)
else:
	index
	put url in kv


(prune)
if kvurl in sitemap
	skip
else:
	delete associated object
	delete kvurl entry

type organization support

Look at the domain of each provider (may be different from data distribution domain) and put in a check for type organization based schema.org in the index page

load this into the graph as a connection to the prov graph?

Get JSON-LD context files by URL

Like we do with the SHACL shapes, we should optionally download the JSON-LD context by allowing the user to provide the URL for the resolving the context in the config file. Rather than having them provide it in the file system.

sitemap: no headless if 404

If an item does not exist, then headless is still called.

2021/12/02 20:31:09 Direct access failed, trying headless for  https://raw.githubusercontent.com/earthcube/ecrro/tree/master/Examples/Software-ERDDAP-SDO.JSON 
2021/12/02 20:31:09 Direct access failed, trying headless for  https://raw.githubusercontent.com/earthcube/ecrro/tree/master/Examples/Service-IRIS-fsdnEvent-JSON.json 

Self signed certs

Review to see if this is an issue

(base) ➜ run_polar git:(master) curl https://www.polardata.ca/metadata.json
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

may need to catch and report the error better (since if this is the case. I don't do anything at this time)

Review IndexNow

Need to review https://www.indexnow.org/faq

If we want PUSH notification, this would be the way to do it. However, this is a bit at odds with the current structured data on the web approach and diffing sitemaps is not that hard (I mean we are doing it).

So I worry that at the scale Gleaner normally works at this is a bit of a digression (at best) or regression (at worst).

Needs to be reviewed though and would be a neat side tool that could then invoke Gleaner with a "set of URLs". This bypassing of the sitemaps et al and providing a URL or set of URLs to index has been something I felt might be nice for testing purposes or for developers / publishers to have. That aspect would not be hard to implement in the Gleaner code.

Then the "indexnow" server would be separate from that and be responsible for calling Gleaner then.

Provide option to skip validation

Tasks is to scrape for biomedical markup data, the data extends the schema.org schema and therefore we don't need the validation that Gleaner does.

Check stack at start of (batch... other) commands.

had a misconfiguration issue.

'Bucket not found' stack changes meant https redirect was occurring.
Code should have just stopped and force

batch and other commands that require access to the stack need to run check before fully executing.

need to additional options:

  • --cron running in cronjob. if there is an error during check... don't stop
  • --nocheck do not run check
  • --noheadlesscheck do not check to see if headless chrome is running. good for development and probably tutorials.

getNormSHA produces a blank string for certain documents

This one is puzzling me. So, in my logs for crawling http://nsidc.org/, I have a bunch of non-identical json-ld objects, which are getting the same hash generated for them. I poked around and figure out that this is because proc.Normalize (line 38 in calcShaNorm.go) is generating an empty string. And when you calculate the SHA of a bunch of identical empty strings, it's going to be the same.

logger: acquire.go:206: #4 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 2553 for http://nsidc.org/data/NSIDC-0051/versions/1
logger: acquire.go:219: #4 thread for http://nsidc.org/data/NSIDC-0051/versions/1 
logger: acquire.go:206: #14 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 2495 for http://nsidc.org/data/NSIDC-0076/versions/1
logger: acquire.go:219: #14 thread for http://nsidc.org/data/NSIDC-0076/versions/1 
logger: acquire.go:206: #31 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 3046 for http://nsidc.org/data/NSIDC-0037/versions/1
logger: acquire.go:219: #31 thread for http://nsidc.org/data/NSIDC-0037/versions/1 
logger: acquire.go:206: #15 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 3667 for http://nsidc.org/data/NSIDC-0042/versions/1

Here's the config to crawl that site:

- name: nsidc
  url: https://nsidc.org/sitemap.xml
  headless: false
  properName: National Snow and Ice Data Center
  domain: https://nsidc.org

AND, also, they have their context specified with no trailing slash, and not https, so you need to add this to contextmaps:

- prefix: "http://schema.org"
  file: "./schemaorg-current-https.jsonld"

Is that a clue, there? Is json-gold not able to normalize a json-ld object that is set up this way?

I'm also finding that once I am able to get unique JSON-LD objects for each of the AADC sites in their sitemap*, it only generates 3 different SHAs for the whole set of them. I haven't looked into that much further.

  • because each page has two json-ld files, one which is for the organization, and one which is for the actual data page itself. I'm working on a change that finds only the relevant json-ld and grabs just that one, but it's kind of tricky, and I don't know how broadly applicable it is.

robots.txt

Gleaner should also take a robots.txt file to obtain a sitemap (optionally based on agent string) and also read a delay value for harvesting.

spatial index review...

We need to review the spatial indexer...
There may be some issues creeping in related to resources with multiple types... like IEDA with BBOX and POINT

Some searchers are giving odd results... may also be related to old items in the index. Need to rebuild the geohash index from scratch.

SIGSEGV - segmentation violation

Hi,

I've a interesting issue with the 'gleaner' command, which is a static exe file.

The gleaner cmd compiled from source has the same SIGSEGV message runtime panic message as the v2.0.25 and v2.0.22 I downloaded from GitHub here.

The message are:

v2.0.22

main.go:30: EarthCube Gleaner
panic: runtime error: invalid memory address or nil pointer dereference

v2.0.25

main.go:34: EarthCube Gleaner
panic: runtime error: invalid memory address or nil pointer dereference

self compiled version

main.go:35: EarthCube Gleaner
panic: runtime error: invalid memory address or nil pointer dereference

I try to execute the gleaner cmd with and without root permission. The results are the same.
Does any one have/had similar issues in the past and a hint to solve this?

OS: Ubuntu 20.04
GOLANG: go version go1.13.8 linux/amd64
Environment variables: GLEANER_BASE=/tmp/gleaner GLEANER_OBJECTS=/tmp/gleaner/datavol/s3 GLEANER_GRAPH=/tmp/gleaner/datavol/graph DATAVOL=/tmp/gleaner/datavol and directories exists.

Thanks in advance, Andreas

Bucket ID in config file not uses

This has never been coded in. It makes running test indexing or indexing on AWS difficult.

Just need to track through the item in the config along to the various locations in the code.

Error in nqtoNTctx with tika

When running the Tika miller we get an error from time to time with nqToNTctx. Likely due to to unicode characters?

glcon fixes

glcon config init
wiped out configs files. copying from the configs/{cfgName} rather than configs/template

glcon gleaner batch
needs to be able to run a single source to enable quick configuration testing.

Modify main to allow selection of one source or sitemap

This will be useful in cases where we want one config file but then be able to select a specific source to index from a flag.

Good for testing but also for scheduled indexing that might change from source to source in frequency.

Headless rework

r2r has long rendering issues. The present code does waits for DOMContentEventFired
but that is not long enough. Added a headlesswait and thread sleep, but even then that does not get everything.

This uses the mafredri, so it looks like a starting point.
["github.com/mafredri/cdp/protocol/network"](https://github.com/Aleksandr-Kai/articles_parser/blob/18c4cd2c90600e0eb7b628853a3959995e514dbd/pkg/browserapp/browser.go
//*/)

Two changes:

  1. get the wait until network idle/ page rendering correct
  2. testing so that we can test it.

Think it would be hard to test, as the page renderer put items into minio, directly.

Grab URL's from r2r, and ieda.
https://dev.rvdata.us/search/fileset/100748
https://dev.rvdata.us/search/fileset/101773

Miller is dropping dataset metadata but keeping things like person, organiation and logo for NSIDC

I'm not sure what's going on here, but I'm trying to index the NSIDC, and running into this. Any help or insight would be appreciated - at this point, I don't know if the bug is in Gleaner, or elsewhere.

I'm able to get json-ld out that looks like this, for example:

{
   "@context": { 
            "vocab": "https://schema.org/" 
     },
    "@graph": [
        {
            "@type": "Dataset",
            "provider": {
                "@type": "Organization",
                "name": "NSIDC: National Snow and Ice Data Center",
                "url": "https://nsidc.org",
                "logo": {
                    "@type": "ImageObject",
                    "representativeOfPage": "True",
                    "url": "https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png",
                    "width": "60 px",
                    "height": "60 px"
                }
            },
            "@id": "NSIDC-0303",
            "name": "Radar Investigations of Antarctic Ice Stream Margins, Siple Dome, 1998, Version 1",
            "version": "1",
            "description": "This data set consists of surface-based radar measurements, including geometry of the bed, surface, and internal layers, and bed reflectivity measurements at two sites along ice stream margins at Siple Dome, Antarctica. The research is a radar examination of bed reflection characteristics and internal layer geometry in two inter-ice-stream ridges, the Shabtaie Ridge (Ridge D/E) and the Engelhardt Ridge (Ridge B/C), and across margins with the adjacent ice streams, the MacAyeal Ice Stream (Ice Stream E) and the Whillans Ice Stream (Ice Stream B). Investigators collected these radar data from 14 November through 13 December 1998. Data are in Microsoft Word, PDF, ASCII text, MATLAB, binary, and various image formats. Investigators have also provided code for MATLAB routines that they used to view the radar data. Data are available via FTP.",
            "temporalCoverage": "1998-11-14 00:00:00 to 1998-12-13 00:00:00",
            "spatialCoverage": "N: -80.1678, S: -83.3528, E: -138.3697, W: -141.6722",
            "identifier": "https://doi.org/10.7265/N52B8VZP",
            "keywords": "Radar \u0026gt; Radar Reflectivity \u0026gt; Bed Reflectivity, Radar \u0026gt; Radar Imagery \u0026gt; Bed, Surface, and Internal Layer Geometry",
            "author": {
                "@type": "Person",
                "name": [
                    "Nadine Nereson",
                    "Charles Raymond"
                ]
            },
            "publisher": {
                "@type": "Organization",
                "@id": "https://nsidc.org",
                "name": "National Snow and Ice Data Center",
                "url": "https://nsidc.org",
                "logo": {
                    "@type": "ImageObject",
                    "representativeOfPage": "True",
                    "url": "https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png",
                    "width": "60 px",
                    "height": "60 px"
                }
            },
            "url": "https://nsidc.org/data/NSIDC-0303/versions/1"
        }
    ]
}

But what comes out of the miller for this metadata is missing a lot of the stuff I want, namely the DataSet and its description and all that good stuff. I just get the logo and the authors.

<https://nsidc.org> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
<https://nsidc.org> <http://schema.org/logo> _:bc932hau7ho5ajldeot1g .
<https://nsidc.org> <http://schema.org/name> "National Snow and Ice Data Center" .
<https://nsidc.org> <http://schema.org/url> <https://nsidc.org> .
_:bc932hau7ho5ajldeot1g <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/ImageObject> .
_:bc932hau7ho5ajldeot1g <http://schema.org/height> "60 px" .
_:bc932hau7ho5ajldeot1g <http://schema.org/representativeOfPage> "True" .
_:bc932hau7ho5ajldeot1g <http://schema.org/url> <https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png> .
_:bc932hau7ho5ajldeot1g <http://schema.org/width> "60 px" .
_:bc932hau7ho5ajldeot20 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:bc932hau7ho5ajldeot20 <http://schema.org/name> "Nadine Nereson" .
_:bc932hau7ho5ajldeot20 <http://schema.org/name> "Charles Raymond" .
_:bc932hau7ho5ajldeot2g <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
_:bc932hau7ho5ajldeot2g <http://schema.org/logo> _:bc932hau7ho5ajldeot30 .
_:bc932hau7ho5ajldeot2g <http://schema.org/name> "NSIDC: National Snow and Ice Data Center" .
_:bc932hau7ho5ajldeot2g <http://schema.org/url> <https://nsidc.org/data/NSIDC-0303/versions/1> .
_:bc932hau7ho5ajldeot30 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/ImageObject> .
_:bc932hau7ho5ajldeot30 <http://schema.org/height> "60 px" .
_:bc932hau7ho5ajldeot30 <http://schema.org/representativeOfPage> "True" .
_:bc932hau7ho5ajldeot30 <http://schema.org/url> <https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png> .
_:bc932hau7ho5ajldeot30 <http://schema.org/width> "60 px" .

glcon rename servers.yaml generate.yaml

Pondering renaming servers.yaml to generate.yaml or configure.yaml to reflect that it is used to generate the config files, and will allow for adding of other items.

sourcemaps:
    type: csv
    file:
    url:
gleaner:
    runid: pattern_{{date}}

build config files for Gleaner, Nabu from CSV/Google Sheet/Excel

With many repositories, it can be a slight pain when the config needs to be updated.
And a spreadsheet can be used to manage a list of sites, that can be used to render an HTML page

so write a python tool to generate the gleaner config and nabu configs
Could also be used for future generation from Database, or a website

Object store based indexing prov

This is a an issue to document a new approach for incremental indexing that will remove the boltdb in favor of an object store based approach.

  • Create: /sitemap/[name]/sitemaplatest.xml
  • We still will not know what URLs in a sitemap had JSON-LD (only prov knows that)
  • Need functions like record() diff() Do these as interfaces for a change (for the better)

When we pull down a new sitemap, we can get the array of URLs from it, compare it to the last stored sitemap in the bucket and then only index the diff for incremental. The failure point is that we don't know if a URL has updated or newly added JSON-LD metadata. However that is what the occasional full index is for vs the incremental indexes.

Sequence wise we would

  • pull down the new sitemap
  • compare to the current sitemap in the object store
  • get the diff and pass to summoner
  • store the new XML and remove/rename the older sitemap (depending on how much history we wish to store)

EarthCube resource registry. gleaner hits some issues

It looks like it does in a prov call, and an object2rdf call.
Also, milled getting named with .json extension, but they are .rdf files.

configuration
ecrr.zip

ubuntu@ec-testbed-containers:~/glcon$ ./glcon gleaner batch --cfgName ecrr
2021/11/09 18:17:06 EarthCube Gleaner
Using gleaner config file: /home/ubuntu/glcon/configs/ecrr/gleaner
Using nabu config file: /home/ubuntu/glcon/configs/ecrr/nabu
batch called
2021/11/09 18:17:06 Building organization graph.
2021/11/09 18:17:06 Miller start time: 2021-11-09 18:17:06.905400154 +0000 UTC m=+0.198000722 
2021/11/09 18:17:06 Adding bucket to milling list: summoned/ecrr
2021/11/09 18:17:06 Adding bucket to milling list: summoned/ecrr_examples
2021/11/09 18:17:06 Adding bucket to prov building list: prov/ecrr
2021/11/09 18:17:06 Adding bucket to prov building list: prov/ecrr_examples
  37% |███████████████                            | (107/289, 100 it/s) [0s:1s]2021/11/09 18:17:07 invalid character '0' in string escape code
2021/11/09 18:17:07 obj2RDF invalid character '0' in string escape code
2021/11/09 18:17:07 invalid character '0' in string escape code
2021/11/09 18:17:07 obj2RDF invalid character '0' in string escape code
  78% |█████████████████████████████████          | (228/289, 113 it/s) [1s:0s]2021/11/09 18:17:08 invalid character 'u' looking for beginning of value
2021/11/09 18:17:08 obj2RDF invalid character 'u' looking for beginning of value
 100% |███████████████████████████████████████████| (289/289, 152 it/s)        
2021/11/09 18:17:08 Assembling result graph for prefix: summoned/ecrr to: milled/ecrr
2021/11/09 18:17:08 Result graph will be at: results/rr1/ecrr_graph.nq
2021/11/09 18:17:08 Start pipe reader / writer sequence
2021/11/09 18:17:10 Pipe copy for graph done
   0% |                                               | (0/0, 0 it/min) [0s:0s]
2021/11/09 18:17:10 Assembling result graph for prefix: summoned/ecrr_examples to: milled/ecrr_examples
2021/11/09 18:17:10 Result graph will be at: results/rr1/ecrr_examples_graph.nq
2021/11/09 18:17:10 Start pipe reader / writer sequence
2021/11/09 18:17:10 Pipe copy for graph done
2021/11/09 18:17:10 Miller end time: 2021-11-09 18:17:10.184253951 +0000 UTC m=+3.476854554 
2021/11/09 18:17:10 Miller run time: 0.054648 

namespace in XML sitemaps

So

<ns0:sitemapindex xmlns:ns0="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
		<loc> https://geoconnex.us/sitemap/namespaces/nmwdi/nmwdi_ose_ids__0.xml </loc>
		<lastmod> 2021-10-19 19:25:50.620909 </lastmod>

is valid but doesn't work

Currently the sitemap parsing works with

<?xml version='1.0' encoding='utf-8'?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://samples.earth/sitemap0.xml</loc>
  <lastmod>2021-04-11</lastmod>
  </sitemap>
</sitemapindex>

The namespace approach is valid XML though, so this needs to be resolved.

Issue with sitemap

gleaner not pulling in this sitemap... https://github.com/earthcube/GeoCODES-Metadata/blob/main/sitemap.xml

reads it, but nothing is created/summoned

ubuntu@ec-testbed-containers:~/glcon$ ./glcon gleaner batch --cfgName ecrr
2021/11/09 19:48:17 EarthCube Gleaner
Using gleaner config file: /home/ubuntu/glcon/configs/ecrr/gleaner
Using nabu config file: /home/ubuntu/glcon/configs/ecrr/nabu
batch called
2021/11/09 19:48:17 Building organization graph.
2021/11/09 19:48:18 Sitegraph(s) processed
2021/11/09 19:48:18 Summoner start time: 2021-11-09 19:48:18.096928621 +0000 UTC m=+0.172947508 
2021/11/09 19:48:18 [{sitemap ecrr_submitted https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png rclone https://drive.google.com/drive/u/0/folders/1TacUQqjpBbGsPQ8JPps47lBXMQsNBRnd false  Earthcube Resource Registry http://www.earthcube.org/resourceregistry/ false} {sitemap ecrr_examples https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png https://raw.githubusercontent.com/earthcube/ecrro/master/Examples/sitemap.xml false  Earthcube Resource Registry Examples http://www.earthcube.org/resourceregistry/examples false} {sitemap geocodes_examples  https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false  GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 [{sitemap geocodes_examples  https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false  GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 We are not a sitemap index, check to see if we are a sitemap
2021/11/09 19:48:18 geocodes_examples sitemap size is : 4 queuing: 4 mode: full 
2021/11/09 19:48:18 [{sitemap ecrr_submitted https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png rclone https://drive.google.com/drive/u/0/folders/1TacUQqjpBbGsPQ8JPps47lBXMQsNBRnd false  Earthcube Resource Registry http://www.earthcube.org/resourceregistry/ false} {sitemap ecrr_examples https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png https://raw.githubusercontent.com/earthcube/ecrro/master/Examples/sitemap.xml false  Earthcube Resource Registry Examples http://www.earthcube.org/resourceregistry/examples false} {sitemap geocodes_examples  https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false  GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 [{sitemap geocodes_examples  https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false  GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 Thread count 5 delay 0
 100% |████████████████████████████████████████████████| (4/4, 15 it/s)
2021/11/09 19:48:18 Wrote log size 547
2021/11/09 19:48:20 Summoner end time: 2021-11-09 19:48:20.160551705 +0000 UTC m=+2.236570592 
2021/11/09 19:48:20 Summoner run time: 0.034394 
2021/11/09 19:48:20 Miller start time: 2021-11-09 19:48:20.160614155 +0000 UTC m=+2.236633059 
2021/11/09 19:48:20 Adding bucket to milling list: summoned/geocodes_examples
2021/11/09 19:48:20 Adding bucket to prov building list: prov/geocodes_examples
   0% |                                               | (0/0, 0 it/min) [0s:0s]
2021/11/09 19:48:20 Assembling result graph for prefix: summoned/geocodes_examples to: milled/geocodes_examples
2021/11/09 19:48:20 Result graph will be at: results/rr1/geocodes_examples_graph.nq
2021/11/09 19:48:20 Start pipe reader / writer sequence
2021/11/09 19:48:20 Pipe copy for graph done
2021/11/09 19:48:20 Miller end time: 2021-11-09 19:48:20.35733808 +0000 UTC m=+2.433356968 
2021/11/09 19:48:20 Miller run time: 0.003279 

logs to object store

At end of run send logs to object store. This will let non-adminstrators and repository owners see the results.

{bucket}/logs/lastrun
? with a 'lastrun' file of what the command was

at start of run, move lastrun to a folder,
If we cronjob/automate we may want a logrotate option.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.