The genscrape from rootsdev

FS Tree: Get attached sources

Will need to update existing source references (or add new ones) to correctly point to the source descriptions.

MyHeritage records

Examples for testing:

SSDI: https://www.myheritage.com/research/record-10002-10248383/stephen-a-zierak-in-us-social-security-death-index-ssdi
BillionGraves: https://www.myheritage.com/research/record-10147-4078167/anne-h-sarvay-in-billiongraves
BillionGraves Relations: https://www.myheritage.com/research/record-10147-52111725/janet-k-smith-in-billiongraves
FamilySearch Family Tree: https://www.myheritage.com/research/record-40001-171359631/katherine-zierak-in-familysearch-family-tree
1940 US Census: https://www.myheritage.com/research/record-10053-90714854/helen-g-yurkiewicz-in-1940-united-states-federal-census
1920 US Census: https://www.myheritage.com/research/record-10133-113306106/grace-t-kiefer-in-1920-united-states-federal-census
Bremen Departure Lists: https://www.myheritage.com/research/record-30240-12326/mopsche-turner-in-germany-bremen-passenger-departure-lists
Mexico Baptisms: https://www.myheritage.com/research/record-30039-1799709-F/j-ysac-gonzalez-in-mexico-baptisms
Scotland Baptisms: https://www.myheritage.com/research/record-30226-2401107-F/issabel-cameron-in-scotland-births-baptisms
California County Marriages: https://www.myheritage.com/research/record-30244-1399039/charles-e-york-and-frances-m-osborn-in-california-county-marriages
Germany Marriages: https://www.myheritage.com/research/record-30038-1968627-F/jacobine-friederike-van-der-horst-and-heinrich-kuckes-in-germany-marriages
Spain Marriages: https://www.myheritage.com/research/record-30057-1934741-F/cecilia-alabori-pujeu-and-juan-clota-y-clota-in-spain-marriages
Massachusetts Marriages: https://www.myheritage.com/research/record-30033-1113561/henry-leroy-york-and-elevia-belle-harriman-in-massachusetts-marriages
Denmark 1930 Census: https://www.myheritage.com/research/record-10181-2434583/terry-andersen-in-1930-denmark-census
1901 England Census: https://www.myheritage.com/research/record-10156-101511824/samuel-jeremy-in-1901-england-wales-census
1870 US Census: https://www.myheritage.com/research/record-10128-41588411/martha-snark-in-1870-united-states-federal-census
1910 US Census: https://www.myheritage.com/research/record-10132-53702700/john-s-townsend-in-1910-united-states-federal-census

Can we stop explicitly listing all available scrapers in main.js?

https://github.com/rootsdev/genscrape/blob/master/src/main.js#L71-L73

Remove lodash dependency

It's mostly used for forEach, isArray, isFunction, find. It ought to be easy to remove.

Find A Grave: birth place included with birth date

http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=30903512
http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=30903497

Schema

Two choices:

A basic schema such as what gen-search uses or a version of that which allows multiple assertions (which roots-search will use in the future).
A more complete schema that accounts for non-vital facts, sources, more relationships, etc.

In GedcomX, is there a way to state that the data was assembled by genscrape?

I could be useful to state that the data was assembled by genscrape and even specify which version of genscrape. Does the GedcomX model allow for that?

Find A Grave: support sh pages

http://www.findagrave.com/cgi-bin/fg.cgi?page=sh&GRid=92732955&
http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GRid=92732955&

Those two pages are the same profile. The only difference is that the sh page has a banner that says "You are taking a random walk through our online cemetery." The Find A Grave scraper currently only works on the gr pages.

Add lib version in comment of packaged files

We used to include the version number in the file name but CDNs expect the file name to be the same every time (rightly so) so we stopped doing that. Instead lets add the version number in a comment at the beginning of the package files.

Error thrown on some Find A Grave memorials

Ancestry beta

Get ready for the new version of Ancestry

gender

We're not looking for the gender yet. We'll need to do that eventually. Before 1.0.

Document how to submit a new scraper

Document how to write it, how to add it to the genscrape lib, and how to test it.

Don't serialize ExtensibleData.links

Only FS uses them. We don't need them. They just bloat the data.

We can get rid of them easily by removing links from GedcomX.ExtensibleData.jsonProps

Person/birthDate from Open Archives not processed

While testing with WikiTree X and RootsSearch I noticed that the Person/birthDate value isn't recognized/handled. The scraper is similar tot the Genealogie Online scraper which does handle the person/birthDate value. Microdata on both sites are the same (and valid).

Test page: https://www.openarch.nl/show.php?archive=srt&identifier=4d23eb16-04c8-c35d-e0c0-8504b2342a22&200&lang=en

Add a destroy() method?

There are two general usecases:

Inject genscrape and listen for all events (AJAX trees could fire multiple events as the user browses)
Inject genscrape, wait for one event, then tear down

genscrape supports usecase (1) right now but doesn't support (2).

Should we add a configuration option that only allows one event to be fired or should we add a destroy() method that apps are expected to call when they want usecase (2)? Perhaps both.

Ancestry tree: Add name types

We know that the first name will be the BirthName and all others will be AlsoKnownAs

Failing test in node v4+

Why does the urlPatternToRegex test fail in node v4 and up?

Handle new URL structure in FS tree

Used to be https://familysearch.org/tree/#view=ancestor&person=K2HD-1TC

New URLs are https://familysearch.org/tree/person/K2HD-1TC/details

Add support for WikiTree

Create a src/lib directory

All files currently in src except for main.js will be moved into it. That will make the code a little more organized.

Remove jQuery dependency

As we work to rewrite all parsers for the schema upgrade, let's remove the dependency on jQuery.

document error event

Ancestry: incomplete names and nicknames

With blank first name and last name known, the last name got incorrectly populated into the first name field on Wikitree.com

With nickname in "Quotes" it was ignored and not populated into the nicknames field.

http://www.wikitree.com/g2g/321570/use-the-chrome-web-browser-try-the-new-wikitree-x-extension?show=326013#c326013

Standardize citations

As we work on #23, we are adding citations via a SourceDescription. They're not very standard right now. Once it ships, lets revisit citations including exploration of citation templates.

Ancestry might be broken

Test with http://person.ancestry.com/tree/34211057/person/19227526747/facts

Family names are parsed incorrectly. HTML might have changed.

Ancestry profiles no longer getting birth and death dates?

A user reported that this is happening.

Add information about the originating site

Apps working with genscrape might want to programmatically detect and display information about the site that the data originates from. They shouldn't have to generate their own URL matching algorithm for all the different websites because we already do that.

SourceDescriptions have a repository property which is a URI that resolves to an Agent which can be used to describe an owner of a repository.

TODO:

UTF8

Make sure we handle UTF8.

Test: https://familysearch.org/tree/#view=ancestor&person=L5CB-71D&section=details

Find A Grave: add the memorial number to the citation

Current:

Find A Grave, database and images (http://findagrave.com : accessed 2 February 2017), memorial page for Lucy Brancheau Loranger (1824 - 1870) - Find A Grave Memorial.

Requested:

Find A Grave, database and images (http://findagrave.com : accessed 2 February 2017), memorial #93209858 for Lucy Brancheau Loranger (1824 - 1870) - Find A Grave Memorial.

Expose the originating entity ID

For example, FamilySearch Family Tree person ID or Find A Grave memorial number.

We will set the persons' IDs to be the ID we want to expose (instead of the auto-incrementing IDs we've been using).

Related to #33

findmypast

Update Find A Grave to work with https URLs

Find A Grave recently enabled and forwards all traffic to https.

genscrape/src/scrapers/findagrave.js

Line 6 in 1af4027

utils.urlPatternToRegex("http://www.findagrave.com/cgi-bin/fg.cgi*")

Where should SourceDescriptions be attached?

In working on #33 I see that we need to be more deliberate about where we attach SourceDescriptions. FamilySearch generates SourceDescriptions on their own and have them attached to the root GEDCOM X element. For all other sites we have been generating one SourceDescription and attaching it to all persons and relationships in the document. We did that before realizing that the root element had a description property.

I believe we should use the root level description property. But should we continue attaching that same SourceDescription to all persons and relationships?

Reveal list of supported sites for programmatic access

Internally, genscrape only has a list of regexes for matching against urls. To make this work we would need to add another parameter in the register function that accepts an id/name.

Schema upgrade to GEDCOM X

genscrape uses the gensearch schema. It's a simple schema designed for searching on genealogy websites. But genscrape has many more use cases and ought have a more advanced schema with first-class support for multiple persons, relationships, and sources.

{
    "persons": [],
    "relationships": [],
    "sources": []
}

That looks very similar to the GEDCOM X JSON format, with the only exception being that GEDCOM X has sourceDescriptions instead of sources. Despite how much I dislike the term sourceDescriptions and it's schema, I _really_ like the idea of not having to create my own data format.

Testing

I figure we'll use PhantomJS for testing if since we need to simulate a browser.

Authentication will be an issue since most trees and records are behind a paywall, or at least require a login. We can't put credentials in this public repo so we'll probably need a script of some sort that prompts the developer for relevant auth credentials before running the test suite.

BillionGraves

Figure out why it seems to be broken

Ancestry: support other country domains beyond .com

Tree:

Records:

Or perhaps just the English ones for now because the Ancestry parser only works with English labels.

Don't commit built js file

We are currently committing the built genscrape.js file. Let's stop doing that.

Document which parts of the GEDCOM X model we use

Use pieces of the RS and Records spec
Persons and relationships (all that are available)
Person.principal to denote the focus person
GedcomX.description points to a SourceDescription for the data
SourceDescription.repository points to an Agent representing the website origin

Update dev dependencies

They are severely out of date, though we might want to wait until instabul has a final 1.0 release.

https://david-dm.org/rootsdev/genscrape#info=devDependencies&view=table

Use gensites for populating repository information

Related to work done in #33. We could use gensites to generate that info.

Detect shema.org

Option to get all the data

Using a common schema some will inherently lead to some data loss. Add an option to get the complete source-specific schema.

Another thought is to use the source schema by default and have an option for converting it into a shared schema.

API

This library will just be a utility. It is designed to function multiple environments such as a browser extension or a node.js app. We will need to devise an API that functions well for common use cases.

It needs to be async since websites such as FamilySearch and MyHeritage load data via AJAX on some pages.

Also, in an environment like a browser extension where a user might navigate to multiple pages, there will be more than one data or load event and might even be a no-data event. This is just begging for an event driven system. Perhaps something like node's common method of on('event_name', function(data){ }).

It would be nice if it pulled the Marriage date and location when present in a profile from Ancestry.com

http://www.wikitree.com/g2g/321570/use-the-chrome-web-browser-try-the-new-wikitree-x-extension?show=326020#c326020

Add searches of the Gesher Galicia records API

Yeah, I know, it's a very niche website, but hey! Free open records API! May as well include it too.

Full docs here:
http://docs.geshergalicia.apiary.io/

rootsdev / genscrape Goto Github PK

genscrape's People

Stargazers

Watchers

Forkers

genscrape's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs