This is just the repo for the web site at gleaner.io. It is shameless copy of the Tailwind Toolbox landing page template.
It's deployed out to a Google object store served via GROW.
Gleaner: JSON-LD and structured data on the web harvesting
Home Page: https://gleaner.io
License: Apache License 2.0
This is just the repo for the web site at gleaner.io. It is shameless copy of the Tailwind Toolbox landing page template.
It's deployed out to a Google object store served via GROW.
To further bring Gleaner to orchestration environments we need to have an async approach for queuing resources to process.
Keeping with trying to stay native Go I am looking at https://github.com/mcmathja/curlyq and wrapping the consumer in a go func with a semaphore count/limit.
A bit heavier with a redis dep is https://github.com/hibiken/asynq
I prefer the curlyq native go approach but might try and evaluate both.
Basically this would replace gleaner main and perhaps be the "millstone" code. It would set up the consumer, producer and queue. The consumer would simply send a resource ( []string ) item to miller then.
The producer might be a variant of summoner that runs periodically (once a day for example) with a set of provider URLs and a rolling date window (ie, last 24 hours in this case).
Patching the JSON-LD data graphs may become important as we incorporate approaches to do "update" indexing.
Some references for JSON-LD "diff" optons include
The code in fence.gleaner.io can correctly parse sitemaps based on a threshold date. This code needs to be ported over into Gleaner now.
Based on a discussion in ESIP's schema.org slack channel I am noting a possible enhancement
References:
Need to be able to deal with multi-sitemap sites (sitemap -> sitempas)
Also inspect the XML for updates and perhaps support logging of resources to support mod on and logic to decided if to index. This could also help with or be connected to PROV milling.
Some people are putting HTML in the descriptive text. We could remove this with something like the following gist.
Not real sure how I feel yet about modifying the documents that others author though. Especially something like the descriptions.
gist link: https://gist.github.com/g10guang/04f11221dadf1ed019e0d3cf3e82caf3
package utils
import (
"regexp"
"sort"
"strings"
)
// match html tag and replace it with ""
func RemoveHtmlTag(in string) string {
// regex to match html tag
const pattern = `(<\/?[a-zA-A]+?[^>]*\/?>)*`
r := regexp.MustCompile(pattern)
groups := r.FindAllString(in, -1)
// should replace long string first
sort.Slice(groups, func(i, j int) bool {
return len(groups[i]) > len(groups[j])
})
for _, group := range groups {
if strings.TrimSpace(group) != "" {
in = strings.ReplaceAll(in, group, "")
}
}
return in
}
Due to:
schemaorg/schemaorg#2578
The manner for getting the context has been updated. Note the link header use as an option (?) for resolving the context. We may not be able to simply content negotiate, we will need to look for the link header too.
gleaner_base.yaml
was:
./schemaorg-current-https.jsonld
needs to be
./configs/local/schemaorg-current-https.jsonld
or maybe just pull down and put in
./configs/schemaorg-current-https.jsonld
One call to headless chrome happens, then the run stops.
2022/03/03 17:06:21 r2r sitemap size is : 42497 queuing: 42497 mode: full
2022/03/03 17:06:21 Headless chrome call to: r2r
ubuntu@ec-testbed-containers:~/glcon$
The source
- sourcetype: sitemap
name: r2r
logo: https://www.rvdata.us/images/Logo.4b1519be.png
url: https://service-dev.rvdata.us/api/sitemap/
headless: true
pid: http://www.re3data.org/repository/r3d100010735
propername: Rolling Deck to Repository Program (R2R)
domain: https://www.rvdata.us/
active: true
credentialsfile: ""
other: {}
console from run
Using gleaner config file: /home/ubuntu/glcon/configs/geocodes/gleaner
Using nabu config file: /home/ubuntu/glcon/configs/geocodes/nabu
batch called
2022/03/03 17:05:06 Building organization graph.
2022/03/03 17:05:06 The specified bucket does not exist.
2022/03/03 17:05:06 Sitegraph(s) processed
2022/03/03 17:05:06 Summoner start time: 2022-03-03 17:05:06.353164344 +0000 UTC m=+0.184392504
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true map[other:map[]]}]
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true map[other:map[]]}]
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true map[other:map[]]}]
2022/03/03 17:05:06 [{sitemap r2r https://www.rvdata.us/images/Logo.4b1519be.png https://service-dev.rvdata.us/api/sitemap/ true http://www.re3data.org/repository/r3d100010735 Rolling Deck to Repository Program (R2R) https://www.rvdata.us/ true map[other:map[]]}]
2022/03/03 17:06:20 We are not a sitemap index, check to see if we are a sitemap
2022/03/03 17:06:21 r2r sitemap size is : 42497 queuing: 42497 mode: full
2022/03/03 17:06:21 Headless chrome call to: r2r
ubuntu@ec-testbed-containers:~/glcon$
add a block to servers/config.yaml to tell config where to pull sources from. while there is a --sources flag, the dataset does not normally change, so this is a better way to configure, config
default:
sourcesSource:
type: csv
getFrom: sources.csv
add
sourcesSource:
type: googlesheetcsv
getFrom: csvdownloadurl
sourcesSource:
type: excel
getFrom: filename with sheet
While Gleaner does know not to index URLs previously retrieved (summoned) in this mode. It currently does not know if the associated object retrieve has been updated and hence gets a new object based on an update SHA hash. This can result in extraneous objects in the object store when multiple full indexes are done. These could then get synchronized to the triplestore or other indexes resulting in extra or older resources in the indexes along side the newer ones.
Also, the sitegraph workflow, unlike the sitemap workflow is not yet (I believe) leveraging the --mode diff mode of indexing (ie, incremental) so this will need to be added.
One or both of these points is likely this case for this extra object getting in.
Previously, I was leveraging the generated prov records to address this function via S3Select calls. This had the benefit of using existing tooling to address this function. So no additional technical debt. Unfortunately this approach is far too slow as the resource count begins to grow. For some emerging work with a couple communities, where the scale will be approaching a million records and more, this was never going to scale well.
To address this I have already started to integrate a KV store to address holding a record of the previously visited resources. This can then be used to check and skip such records. At this time, that capacity has been merged into the dev (and master) branches and should be working and fully replacing the S3Select based approach
Note, that this does generate a gleaner.db file during the run. So this document should be noted in the development of the Docker files.
Note, if this file is lost or removed all that is lost is the record that supports the incremental indexing. No data or other information. So one would simply have to do a full index again to rebuild it and return the capacity of incremental indexing.
The second major issue right now is that code does not take into account that URLs might be removed. So a "prune" option needs to be added that will look for URLs in the warehouse that are not in the domains sitemap and remove them and the associated object that has been downloaded from that, now deleted, URL.
The sitegraph concept can be implemented in the same manner as a sitemap. However, since it is one large file, any change will result in a new hash, and so this is a case where the sitegraph approach does have some operational implications a more fine grained approach like sitemaps does not suffer from.
Just a few notes, more for myself on the steps the code needs to do when running in full, incremental and prune approaches.
(inc)
if sitemap url in kv:
skip
else:
index
put sitemap url in kv
(full)
if sitemap url in kv:
index
delete old kv value (object sha)
replace kv value (object sha)
else:
index
put url in kv
(prune)
if kvurl in sitemap
skip
else:
delete associated object
delete kvurl entry
Look at the domain of each provider (may be different from data distribution domain) and put in a check for type organization based schema.org in the index page
load this into the graph as a connection to the prov graph?
Rather than on/off flags, maybe we make the workflows separate?
batch would do both according to configuration.
since we are using cobra, we could build up some run in sequence commands.
Like we do with the SHACL shapes, we should optionally download the JSON-LD context by allowing the user to provide the URL for the resolving the context in the config file. Rather than having them provide it in the file system.
passing source maps to glcon config generate does not make the generate use the source maps.
If an item does not exist, then headless is still called.
2021/12/02 20:31:09 Direct access failed, trying headless for https://raw.githubusercontent.com/earthcube/ecrro/tree/master/Examples/Software-ERDDAP-SDO.JSON
2021/12/02 20:31:09 Direct access failed, trying headless for https://raw.githubusercontent.com/earthcube/ecrro/tree/master/Examples/Service-IRIS-fsdnEvent-JSON.json
cmd/gleaner> gleaner
gleaner: Exec format error. Binary file not executable.
code has 1/2doz imports that start w/: http://github.com/earthcubearchitecture-project418/gleaner/internal/check
Review to see if this is an issue
(base) ➜ run_polar git:(master) curl https://www.polardata.ca/metadata.json
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
may need to catch and report the error better (since if this is the case. I don't do anything at this time)
Need to review https://www.indexnow.org/faq
If we want PUSH notification, this would be the way to do it. However, this is a bit at odds with the current structured data on the web approach and diffing sitemaps is not that hard (I mean we are doing it).
So I worry that at the scale Gleaner normally works at this is a bit of a digression (at best) or regression (at worst).
Needs to be reviewed though and would be a neat side tool that could then invoke Gleaner with a "set of URLs". This bypassing of the sitemaps et al and providing a URL or set of URLs to index has been something I felt might be nice for testing purposes or for developers / publishers to have. That aspect would not be hard to implement in the Gleaner code.
Then the "indexnow" server would be separate from that and be responsible for calling Gleaner then.
Tasks is to scrape for biomedical markup data, the data extends the schema.org schema and therefore we don't need the validation that Gleaner does.
had a misconfiguration issue.
'Bucket not found' stack changes meant https redirect was occurring.
Code should have just stopped and force
batch and other commands that require access to the stack need to run check before fully executing.
need to additional options:
This one is puzzling me. So, in my logs for crawling http://nsidc.org/, I have a bunch of non-identical json-ld objects, which are getting the same hash generated for them. I poked around and figure out that this is because proc.Normalize
(line 38 in calcShaNorm.go) is generating an empty string. And when you calculate the SHA of a bunch of identical empty strings, it's going to be the same.
logger: acquire.go:206: #4 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 2553 for http://nsidc.org/data/NSIDC-0051/versions/1
logger: acquire.go:219: #4 thread for http://nsidc.org/data/NSIDC-0051/versions/1
logger: acquire.go:206: #14 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 2495 for http://nsidc.org/data/NSIDC-0076/versions/1
logger: acquire.go:219: #14 thread for http://nsidc.org/data/NSIDC-0076/versions/1
logger: acquire.go:206: #31 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 3046 for http://nsidc.org/data/NSIDC-0037/versions/1
logger: acquire.go:219: #31 thread for http://nsidc.org/data/NSIDC-0037/versions/1
logger: acquire.go:206: #15 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 3667 for http://nsidc.org/data/NSIDC-0042/versions/1
Here's the config to crawl that site:
- name: nsidc
url: https://nsidc.org/sitemap.xml
headless: false
properName: National Snow and Ice Data Center
domain: https://nsidc.org
AND, also, they have their context specified with no trailing slash, and not https, so you need to add this to contextmaps
:
- prefix: "http://schema.org"
file: "./schemaorg-current-https.jsonld"
Is that a clue, there? Is json-gold not able to normalize a json-ld object that is set up this way?
I'm also finding that once I am able to get unique JSON-LD objects for each of the AADC sites in their sitemap*, it only generates 3 different SHAs for the whole set of them. I haven't looked into that much further.
Gleaner should also take a robots.txt file to obtain a sitemap (optionally based on agent string) and also read a delay value for harvesting.
We need to review the spatial indexer...
There may be some issues creeping in related to resources with multiple types... like IEDA with BBOX and POINT
Some searchers are giving odd results... may also be related to old items in the index. Need to rebuild the geohash index from scratch.
Hi,
I've a interesting issue with the 'gleaner' command, which is a static exe file.
The gleaner cmd compiled from source has the same SIGSEGV message runtime panic message as the v2.0.25 and v2.0.22 I downloaded from GitHub here.
The message are:
v2.0.22
main.go:30: EarthCube Gleaner
panic: runtime error: invalid memory address or nil pointer dereference
v2.0.25
main.go:34: EarthCube Gleaner
panic: runtime error: invalid memory address or nil pointer dereference
self compiled version
main.go:35: EarthCube Gleaner
panic: runtime error: invalid memory address or nil pointer dereference
I try to execute the gleaner cmd with and without root permission. The results are the same.
Does any one have/had similar issues in the past and a hint to solve this?
OS: Ubuntu 20.04
GOLANG: go version go1.13.8 linux/amd64
Environment variables: GLEANER_BASE=/tmp/gleaner GLEANER_OBJECTS=/tmp/gleaner/datavol/s3 GLEANER_GRAPH=/tmp/gleaner/datavol/graph DATAVOL=/tmp/gleaner/datavol and directories exists.
Thanks in advance, Andreas
This has never been coded in. It makes running test indexing or indexing on AWS difficult.
Just need to track through the item in the config along to the various locations in the code.
When running the Tika miller we get an error from time to time with nqToNTctx. Likely due to to unicode characters?
at present the resource registry puts generated files in a google drive.
So, summoning from a google drive:
https://drive.google.com/drive/u/0/folders/1TacUQqjpBbGsPQ8JPps47lBXMQsNBRnd
This has been sync'd using rclone, but that means the usermeta is not attached.
So create a google drive harvester
Something like https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/adding-a-license-to-a-repository ? (I think maybe also needed in other projects nearby here?)
glcon config init
wiped out configs files. copying from the configs/{cfgName} rather than configs/template
glcon gleaner batch
needs to be able to run a single source to enable quick configuration testing.
Check if there is a name..
put that in as a property,
If a desc.. put in 100... characters... (maybe)
otherwise put in "Un-named by facility"
The config has changed a bit and we need to make sure we have an always current template for the config file, a simple demo and perhaps documentation on the all sections.
This will be useful in cases where we want one config file but then be able to select a specific source to index from a flag.
Good for testing but also for scheduled indexing that might change from source to source in frequency.
I am using blast also and your project looks useful for Science and Medical.
I work in Medical and think that the JSON-LD and RDF aspects are really useful to allow researchers to collaborate better
I was wondering if it makes sense to change this:
https://github.com/earthcubearchitecture-project418/gleaner/blob/master/internal/millers/millerbleve/bleve.go#L32
to use this:
https://github.com/minio/dsync
It's not proper for gleaner to use named graphs since people may use them or not as publishers.
also, this is coming.. https://w3c.github.io/json-ld-syntax/#graph-containers
TODO: remove all use of quads BY gleaner itself but allow JSON-LD to quads via the libraries of course.
r2r has long rendering issues. The present code does waits for DOMContentEventFired
but that is not long enough. Added a headlesswait and thread sleep, but even then that does not get everything.
This uses the mafredri, so it looks like a starting point.
["github.com/mafredri/cdp/protocol/network"](https://github.com/Aleksandr-Kai/articles_parser/blob/18c4cd2c90600e0eb7b628853a3959995e514dbd/pkg/browserapp/browser.go
//*/)
Two changes:
Think it would be hard to test, as the page renderer put items into minio, directly.
Grab URL's from r2r, and ieda.
https://dev.rvdata.us/search/fileset/100748
https://dev.rvdata.us/search/fileset/101773
A standard for geospatial Metadata has been the web accessible folder, aka apache directory listing.
https://ioos.github.io/catalog/pages/registry/waf_creation/
Example code from:
https://github.com/Esri/geoportal-server-harvester/tree/master/geoportal-connectors/geoportal-harvester-waf/src/main/java/com/esri/geoportal/harvester/waf
I'm not sure what's going on here, but I'm trying to index the NSIDC, and running into this. Any help or insight would be appreciated - at this point, I don't know if the bug is in Gleaner, or elsewhere.
I'm able to get json-ld out that looks like this, for example:
{
"@context": {
"vocab": "https://schema.org/"
},
"@graph": [
{
"@type": "Dataset",
"provider": {
"@type": "Organization",
"name": "NSIDC: National Snow and Ice Data Center",
"url": "https://nsidc.org",
"logo": {
"@type": "ImageObject",
"representativeOfPage": "True",
"url": "https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png",
"width": "60 px",
"height": "60 px"
}
},
"@id": "NSIDC-0303",
"name": "Radar Investigations of Antarctic Ice Stream Margins, Siple Dome, 1998, Version 1",
"version": "1",
"description": "This data set consists of surface-based radar measurements, including geometry of the bed, surface, and internal layers, and bed reflectivity measurements at two sites along ice stream margins at Siple Dome, Antarctica. The research is a radar examination of bed reflection characteristics and internal layer geometry in two inter-ice-stream ridges, the Shabtaie Ridge (Ridge D/E) and the Engelhardt Ridge (Ridge B/C), and across margins with the adjacent ice streams, the MacAyeal Ice Stream (Ice Stream E) and the Whillans Ice Stream (Ice Stream B). Investigators collected these radar data from 14 November through 13 December 1998. Data are in Microsoft Word, PDF, ASCII text, MATLAB, binary, and various image formats. Investigators have also provided code for MATLAB routines that they used to view the radar data. Data are available via FTP.",
"temporalCoverage": "1998-11-14 00:00:00 to 1998-12-13 00:00:00",
"spatialCoverage": "N: -80.1678, S: -83.3528, E: -138.3697, W: -141.6722",
"identifier": "https://doi.org/10.7265/N52B8VZP",
"keywords": "Radar \u0026gt; Radar Reflectivity \u0026gt; Bed Reflectivity, Radar \u0026gt; Radar Imagery \u0026gt; Bed, Surface, and Internal Layer Geometry",
"author": {
"@type": "Person",
"name": [
"Nadine Nereson",
"Charles Raymond"
]
},
"publisher": {
"@type": "Organization",
"@id": "https://nsidc.org",
"name": "National Snow and Ice Data Center",
"url": "https://nsidc.org",
"logo": {
"@type": "ImageObject",
"representativeOfPage": "True",
"url": "https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png",
"width": "60 px",
"height": "60 px"
}
},
"url": "https://nsidc.org/data/NSIDC-0303/versions/1"
}
]
}
But what comes out of the miller for this metadata is missing a lot of the stuff I want, namely the DataSet and its description and all that good stuff. I just get the logo and the authors.
<https://nsidc.org> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
<https://nsidc.org> <http://schema.org/logo> _:bc932hau7ho5ajldeot1g .
<https://nsidc.org> <http://schema.org/name> "National Snow and Ice Data Center" .
<https://nsidc.org> <http://schema.org/url> <https://nsidc.org> .
_:bc932hau7ho5ajldeot1g <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/ImageObject> .
_:bc932hau7ho5ajldeot1g <http://schema.org/height> "60 px" .
_:bc932hau7ho5ajldeot1g <http://schema.org/representativeOfPage> "True" .
_:bc932hau7ho5ajldeot1g <http://schema.org/url> <https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png> .
_:bc932hau7ho5ajldeot1g <http://schema.org/width> "60 px" .
_:bc932hau7ho5ajldeot20 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:bc932hau7ho5ajldeot20 <http://schema.org/name> "Nadine Nereson" .
_:bc932hau7ho5ajldeot20 <http://schema.org/name> "Charles Raymond" .
_:bc932hau7ho5ajldeot2g <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
_:bc932hau7ho5ajldeot2g <http://schema.org/logo> _:bc932hau7ho5ajldeot30 .
_:bc932hau7ho5ajldeot2g <http://schema.org/name> "NSIDC: National Snow and Ice Data Center" .
_:bc932hau7ho5ajldeot2g <http://schema.org/url> <https://nsidc.org/data/NSIDC-0303/versions/1> .
_:bc932hau7ho5ajldeot30 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/ImageObject> .
_:bc932hau7ho5ajldeot30 <http://schema.org/height> "60 px" .
_:bc932hau7ho5ajldeot30 <http://schema.org/representativeOfPage> "True" .
_:bc932hau7ho5ajldeot30 <http://schema.org/url> <https://nsidc.org/sites/nsidc.org/files/images/nsidc-logo.png> .
_:bc932hau7ho5ajldeot30 <http://schema.org/width> "60 px" .
Pondering renaming servers.yaml to generate.yaml or configure.yaml to reflect that it is used to generate the config files, and will allow for adding of other items.
sourcemaps:
type: csv
file:
url:
gleaner:
runid: pattern_{{date}}
Finish integrating the robots.txt support.
With many repositories, it can be a slight pain when the config needs to be updated.
And a spreadsheet can be used to manage a list of sites, that can be used to render an HTML page
so write a python tool to generate the gleaner config and nabu configs
Could also be used for future generation from Database, or a website
This is a an issue to document a new approach for incremental indexing that will remove the boltdb in favor of an object store based approach.
When we pull down a new sitemap, we can get the array of URLs from it, compare it to the last stored sitemap in the bucket and then only index the diff for incremental. The failure point is that we don't know if a URL has updated or newly added JSON-LD metadata. However that is what the occasional full index is for vs the incremental indexes.
Sequence wise we would
something got lost.
remote source in localConfig.yaml not working.
It looks like it does in a prov call, and an object2rdf call.
Also, milled getting named with .json extension, but they are .rdf files.
configuration
ecrr.zip
ubuntu@ec-testbed-containers:~/glcon$ ./glcon gleaner batch --cfgName ecrr
2021/11/09 18:17:06 EarthCube Gleaner
Using gleaner config file: /home/ubuntu/glcon/configs/ecrr/gleaner
Using nabu config file: /home/ubuntu/glcon/configs/ecrr/nabu
batch called
2021/11/09 18:17:06 Building organization graph.
2021/11/09 18:17:06 Miller start time: 2021-11-09 18:17:06.905400154 +0000 UTC m=+0.198000722
2021/11/09 18:17:06 Adding bucket to milling list: summoned/ecrr
2021/11/09 18:17:06 Adding bucket to milling list: summoned/ecrr_examples
2021/11/09 18:17:06 Adding bucket to prov building list: prov/ecrr
2021/11/09 18:17:06 Adding bucket to prov building list: prov/ecrr_examples
37% |███████████████ | (107/289, 100 it/s) [0s:1s]2021/11/09 18:17:07 invalid character '0' in string escape code
2021/11/09 18:17:07 obj2RDF invalid character '0' in string escape code
2021/11/09 18:17:07 invalid character '0' in string escape code
2021/11/09 18:17:07 obj2RDF invalid character '0' in string escape code
78% |█████████████████████████████████ | (228/289, 113 it/s) [1s:0s]2021/11/09 18:17:08 invalid character 'u' looking for beginning of value
2021/11/09 18:17:08 obj2RDF invalid character 'u' looking for beginning of value
100% |███████████████████████████████████████████| (289/289, 152 it/s)
2021/11/09 18:17:08 Assembling result graph for prefix: summoned/ecrr to: milled/ecrr
2021/11/09 18:17:08 Result graph will be at: results/rr1/ecrr_graph.nq
2021/11/09 18:17:08 Start pipe reader / writer sequence
2021/11/09 18:17:10 Pipe copy for graph done
0% | | (0/0, 0 it/min) [0s:0s]
2021/11/09 18:17:10 Assembling result graph for prefix: summoned/ecrr_examples to: milled/ecrr_examples
2021/11/09 18:17:10 Result graph will be at: results/rr1/ecrr_examples_graph.nq
2021/11/09 18:17:10 Start pipe reader / writer sequence
2021/11/09 18:17:10 Pipe copy for graph done
2021/11/09 18:17:10 Miller end time: 2021-11-09 18:17:10.184253951 +0000 UTC m=+3.476854554
2021/11/09 18:17:10 Miller run time: 0.054648
So
<ns0:sitemapindex xmlns:ns0="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc> https://geoconnex.us/sitemap/namespaces/nmwdi/nmwdi_ose_ids__0.xml </loc>
<lastmod> 2021-10-19 19:25:50.620909 </lastmod>
is valid but doesn't work
Currently the sitemap parsing works with
<?xml version='1.0' encoding='utf-8'?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://samples.earth/sitemap0.xml</loc>
<lastmod>2021-04-11</lastmod>
</sitemap>
</sitemapindex>
The namespace approach is valid XML though, so this needs to be resolved.
gleaner not pulling in this sitemap... https://github.com/earthcube/GeoCODES-Metadata/blob/main/sitemap.xml
reads it, but nothing is created/summoned
ubuntu@ec-testbed-containers:~/glcon$ ./glcon gleaner batch --cfgName ecrr
2021/11/09 19:48:17 EarthCube Gleaner
Using gleaner config file: /home/ubuntu/glcon/configs/ecrr/gleaner
Using nabu config file: /home/ubuntu/glcon/configs/ecrr/nabu
batch called
2021/11/09 19:48:17 Building organization graph.
2021/11/09 19:48:18 Sitegraph(s) processed
2021/11/09 19:48:18 Summoner start time: 2021-11-09 19:48:18.096928621 +0000 UTC m=+0.172947508
2021/11/09 19:48:18 [{sitemap ecrr_submitted https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png rclone https://drive.google.com/drive/u/0/folders/1TacUQqjpBbGsPQ8JPps47lBXMQsNBRnd false Earthcube Resource Registry http://www.earthcube.org/resourceregistry/ false} {sitemap ecrr_examples https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png https://raw.githubusercontent.com/earthcube/ecrro/master/Examples/sitemap.xml false Earthcube Resource Registry Examples http://www.earthcube.org/resourceregistry/examples false} {sitemap geocodes_examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 [{sitemap geocodes_examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 We are not a sitemap index, check to see if we are a sitemap
2021/11/09 19:48:18 geocodes_examples sitemap size is : 4 queuing: 4 mode: full
2021/11/09 19:48:18 [{sitemap ecrr_submitted https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png rclone https://drive.google.com/drive/u/0/folders/1TacUQqjpBbGsPQ8JPps47lBXMQsNBRnd false Earthcube Resource Registry http://www.earthcube.org/resourceregistry/ false} {sitemap ecrr_examples https://www.earthcube.org/sites/default/files/doc-repository/logo_earthcube_full_horizontal.png https://raw.githubusercontent.com/earthcube/ecrro/master/Examples/sitemap.xml false Earthcube Resource Registry Examples http://www.earthcube.org/resourceregistry/examples false} {sitemap geocodes_examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 [{sitemap geocodes_examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/main/sitemap.xml false GeoCodes Tools Examples https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/ true}]
2021/11/09 19:48:18 Thread count 5 delay 0
100% |████████████████████████████████████████████████| (4/4, 15 it/s)
2021/11/09 19:48:18 Wrote log size 547
2021/11/09 19:48:20 Summoner end time: 2021-11-09 19:48:20.160551705 +0000 UTC m=+2.236570592
2021/11/09 19:48:20 Summoner run time: 0.034394
2021/11/09 19:48:20 Miller start time: 2021-11-09 19:48:20.160614155 +0000 UTC m=+2.236633059
2021/11/09 19:48:20 Adding bucket to milling list: summoned/geocodes_examples
2021/11/09 19:48:20 Adding bucket to prov building list: prov/geocodes_examples
0% | | (0/0, 0 it/min) [0s:0s]
2021/11/09 19:48:20 Assembling result graph for prefix: summoned/geocodes_examples to: milled/geocodes_examples
2021/11/09 19:48:20 Result graph will be at: results/rr1/geocodes_examples_graph.nq
2021/11/09 19:48:20 Start pipe reader / writer sequence
2021/11/09 19:48:20 Pipe copy for graph done
2021/11/09 19:48:20 Miller end time: 2021-11-09 19:48:20.35733808 +0000 UTC m=+2.433356968
2021/11/09 19:48:20 Miller run time: 0.003279
At end of run send logs to object store. This will let non-adminstrators and repository owners see the results.
{bucket}/logs/lastrun
? with a 'lastrun' file of what the command was
at start of run, move lastrun to a folder,
If we cronjob/automate we may want a logrotate option.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.