GithubHelp home page GithubHelp logo

rootsdev / genscrape Goto Github PK

View Code? Open in Web Editor NEW
42.0 42.0 6.0 1.66 MB

JavaScript library that aids in scraping person data off of genealogy websites

License: MIT License

JavaScript 3.73% HTML 96.27%

genscrape's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

genscrape's Issues

MyHeritage records

Examples for testing:

Schema

Two choices:

  1. A basic schema such as what gen-search uses or a version of that which allows multiple assertions (which roots-search will use in the future).
  2. A more complete schema that accounts for non-vital facts, sources, more relationships, etc.

Add lib version in comment of packaged files

We used to include the version number in the file name but CDNs expect the file name to be the same every time (rightly so) so we stopped doing that. Instead lets add the version number in a comment at the beginning of the package files.

gender

We're not looking for the gender yet. We'll need to do that eventually. Before 1.0.

Don't serialize ExtensibleData.links

Only FS uses them. We don't need them. They just bloat the data.

We can get rid of them easily by removing links from GedcomX.ExtensibleData.jsonProps

Add a destroy() method?

There are two general usecases:

  1. Inject genscrape and listen for all events (AJAX trees could fire multiple events as the user browses)
  2. Inject genscrape, wait for one event, then tear down

genscrape supports usecase (1) right now but doesn't support (2).

Should we add a configuration option that only allows one event to be fired or should we add a destroy() method that apps are expected to call when they want usecase (2)? Perhaps both.

Create a src/lib directory

All files currently in src except for main.js will be moved into it. That will make the code a little more organized.

Standardize citations

As we work on #23, we are adding citations via a SourceDescription. They're not very standard right now. Once it ships, lets revisit citations including exploration of citation templates.

Add information about the originating site

Apps working with genscrape might want to programmatically detect and display information about the site that the data originates from. They shouldn't have to generate their own URL matching algorithm for all the different websites because we already do that.

SourceDescriptions have a repository property which is a URI that resolves to an Agent which can be used to describe an owner of a repository.

TODO:

  • #34 Where should SourceDescriptions be attached?
  • Create Agent and attach to the root SourceDescription
    • Ancestry tree
    • Ancestry record
    • BillionGraves
    • FamilySearch ancestor
    • FamilySearch record
    • Find A Grave
    • findmypast record
    • findmypast tree
    • Genealogie Online
    • OpenArch
    • WeRelate
    • WikiTree

Expose the originating entity ID

For example, FamilySearch Family Tree person ID or Find A Grave memorial number.

We will set the persons' IDs to be the ID we want to expose (instead of the auto-incrementing IDs we've been using).

Related to #33

  • Ancestry tree
  • Ancestry record
  • BillionGraves
  • FamilySearch ancestor
  • FamilySearch record
  • Find A Grave
  • findmypast record
  • findmypast tree
  • Genealogie Online
  • OpenArch
  • WeRelate
  • WikiTree
  • Document Identifiers

Where should SourceDescriptions be attached?

In working on #33 I see that we need to be more deliberate about where we attach SourceDescriptions. FamilySearch generates SourceDescriptions on their own and have them attached to the root GEDCOM X element. For all other sites we have been generating one SourceDescription and attaching it to all persons and relationships in the document. We did that before realizing that the root element had a description property.

I believe we should use the root level description property. But should we continue attaching that same SourceDescription to all persons and relationships?

Schema upgrade to GEDCOM X

genscrape uses the gensearch schema. It's a simple schema designed for searching on genealogy websites. But genscrape has many more use cases and ought have a more advanced schema with first-class support for multiple persons, relationships, and sources.

{
    "persons": [],
    "relationships": [],
    "sources": []
}

That looks very similar to the GEDCOM X JSON format, with the only exception being that GEDCOM X has sourceDescriptions instead of sources. Despite how much I dislike the term sourceDescriptions and it's schema, I _really_ like the idea of not having to create my own data format.

Testing

I figure we'll use PhantomJS for testing if since we need to simulate a browser.

Authentication will be an issue since most trees and records are behind a paywall, or at least require a login. We can't put credentials in this public repo so we'll probably need a script of some sort that prompts the developer for relevant auth credentials before running the test suite.

Ancestry: support other country domains beyond .com

Tree:

  • United Kingdom: .co.uk
  • Canada: .ca
  • Australia: .com.au
  • Germany: .de
  • Italy: .it
  • France: .fr
  • Sweden: .se
  • Mexico: .mx

Records:

  • United Kingdom: .co.uk
  • Canada: .ca
  • Australia: .com.au
  • Germany: .de
  • Italy: .it
  • France: .fr
  • Sweden: .se
  • Mexico: .mx

Or perhaps just the English ones for now because the Ancestry parser only works with English labels.

Document which parts of the GEDCOM X model we use

  • Use pieces of the RS and Records spec
  • Persons and relationships (all that are available)
  • Person.principal to denote the focus person
  • GedcomX.description points to a SourceDescription for the data
  • SourceDescription.repository points to an Agent representing the website origin

Option to get all the data

Using a common schema some will inherently lead to some data loss. Add an option to get the complete source-specific schema.

Another thought is to use the source schema by default and have an option for converting it into a shared schema.

API

This library will just be a utility. It is designed to function multiple environments such as a browser extension or a node.js app. We will need to devise an API that functions well for common use cases.

It needs to be async since websites such as FamilySearch and MyHeritage load data via AJAX on some pages.

Also, in an environment like a browser extension where a user might navigate to multiple pages, there will be more than one data or load event and might even be a no-data event. This is just begging for an event driven system. Perhaps something like node's common method of on('event_name', function(data){ }).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.