GithubHelp home page GithubHelp logo

adelsvapen's Introduction

Download the data

Download data
CSV
Stata

Purpose

Scraper of Adelsvapen genealogy website for noble families of Sweden.

Output

Click here to see what the data looks like.

Planning

This is what the website looks like:

And these are the pages with the information that we want to get, including:

  • Name of the family
  • Date of introduction into nobility and extinction
  • The first person in the family and their biography
  • The biography of their children

Ideally we would want to process this text in such way to keep the bolded names, and extract birth places and dates.

Robots.txt

First we check if they have asked us not to scrape the website.

robots.txt is a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit. This relies on voluntary compliance.

They have an extensive list of bots they do not allow, but do not forbid us from scraping the /genealogi subsection. So as long as we use a good wait time between hits so as not to overwhelm their server, we will be okay!

Starting point

We want to get a list of the noble families and a link to their family trees, as shown in the table below:

Noble families of Sweden
Abrahamsson - von Döbeln link
von Ebbertz - Järnefelt link
Kafle - Oxehufwud link
Pahl - Sölfvermusköt link
Tallberg - Östner link
Source: Adelsvapen

Branches

Next we want to scrape the links to the branches of family trees for each family. They look like this:

Ideally we would loop through the list and keep a record of each branch of the family tree with a link to it.

# let's write a function to do this

library(rvest)

get_branches <- function(url_in) {
  message("Getting branches of family tree from: ", url_in)
  Sys.sleep(5)
  html <- read_html(url_in)

  family_links <- html %>%
    html_nodes("a") %>%
    html_attr("href")

  family_links_filtered <- family_links %>%
    as_tibble() %>%
    rename(branch_url = value) %>%
    # get family links that contain a /genealogi beginning and a number
    filter(
      str_detect(branch_url, "/genealogi/.+\\d"),
      !str_detect(branch_url, "Adelsvapen-Wiki"),
      !str_detect(branch_url, "Special:")
    )
  
  return(family_links_filtered)
}

noble_families_branches <- noble_families %>% 
  mutate(branches = map(url, possibly(get_branches, "failed")))

noble_families_branches %>% 
  unnest(branches) %>% 
  write_rds(here::here("links", "family_links.rds"))

So now we have a link to each of the 2,335 branches of the family trees.

A sample of these is shown below in a table

Sample of branches
Noble family Branch name
Pahl - Sölfvermusköt Palmencrona nr 1559
von Ebbertz - Järnefelt Gripensch%C3%BCtz nr 1613
von Ebbertz - Järnefelt Falck nr 156
Abrahamsson - von Döbeln Arnell nr 1885
Pahl - Sölfvermusköt Patkull nr 237
Kafle - Oxehufwud Mannerskantz nr 2082
von Ebbertz - Järnefelt Ihre nr 2043
Abrahamsson - von Döbeln De Frietzcky nr 1375
Abrahamsson - von Döbeln Bj%C3%B6rkenstam nr 2269
Kafle - Oxehufwud Olivecrona nr 1626
Abrahamsson - von Döbeln Von Brandten nr 1376
Kafle - Oxehufwud Ogilwie nr 277
von Ebbertz - Järnefelt Von Graffenthal nr 822
Abrahamsson - von Döbeln Didron nr 440
Kafle - Oxehufwud Lenck nr 448
Abrahamsson - von Döbeln Berch nr 1774
Kafle - Oxehufwud Abrahamsson - von D%C3%B6beln
von Ebbertz - Järnefelt Grubbenhielm nr 708
Pahl - Sölfvermusköt Von Ebbertz - J%C3%A4rnefelt
Abrahamsson - von Döbeln Bergstedt nr 2199

Information on each branch of family tree

Next we want to get the data from each of these branches, including the title, information about the origin individual, and information on their children.

# Again lets write a function for this
url_in = "https://www.adelsvapen.com/genealogi/Adlerstr%C3%A5hle_nr_1765"

get_info_from_family <- function(url_in) {
  html <- read_html(url_in)

  fam_abb <- str_remove(url_in, "https://www.adelsvapen.com/genealogi/")

  title <- html %>%
    html_nodes(".title") %>%
    html_text()

  bio <- html %>%
    html_nodes("p") %>%
    html_text() %>%
    str_squish() %>%
    as_tibble() %>%
    rename(info = value) %>%
    filter(str_detect(info, "\\d\\d\\d\\d")) %>%
    nest(bio = everything())

  children <- html %>%
    html_nodes("ul") %>%
    html_text() %>%
    str_squish() %>%
    as_tibble() %>%
    rename(children_bio = value) %>%
    filter(str_detect(children_bio, "\\d\\d\\d\\d")) %>%
    # mutate(tab = 1) %>% 
    nest(children = everything())

  # image
  # img_src <- html %>%
  #   html_node(".image") %>%
  #   html_elements("img") %>%
  #   html_attr("src")
  # 
  # img_url <- img_src %>% str_c("https://www.adelsvapen.com", .)
  # 
  # download.file(img_url, destfile = here::here(glue::glue("scraped_images/{fam_abb}.jpg")), mode = "wb")

  return(tibble(title, bio, children))
}

data <- get_info_from_family(url_in)

data %>% 
  write_rds(here::here("temp", "Abrahamsson.rds"))

data <- read_rds(here::here("temp", "Abrahamsson.rds"))

data %>%
  unnest(bio) %>%
  unnest(children) %>%
  mutate(across(where(is.numeric), as.character)) %>% 
  pivot_longer(everything(), names_to = "Section", values_to = "Text") %>%
  mutate(Section = str_to_title(str_replace(Section, "_", " "))) %>%
  arrange(desc(Section)) %>%
  distinct() %>%
  group_by(Section) %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "#191970"),
      cell_text(color = "white")
    ),
    locations = cells_row_groups(groups = everything())
  ) %>%
  tab_header(title = md("**Example of data on individual family branches**")) %>%
  tab_footnote(md("Source: [Adelsvapen](https://www.adelsvapen.com/genealogi/Abrahamsson_nr_1817)")) %>% 
  gtsave(here::here("temp", "Abrahamsson_info.png"))

Abrahamsson 1817 Family information

Images of family crest

We can also include the family crest:

Abrahamsson 1817 Family crest

Conclusion

To get the information on each of the family branches, we need to loop through our list of 2,335 branches. This will take some time - at 10 seconds of wait time between each hit - it is just more than 6 hours.

We also need to think about how to structure the data that we scrape before going through the whole scraping process.

adelsvapen's People

Contributors

j-jayes avatar

Watchers

 avatar

adelsvapen's Issues

Scraper problem: source missing for some values

Description

For some records, the source is missing because I've used regex to look for sources (Källor) and not (Källa).

This can be fixed, let's do them all together and not scrape again too soon.

Scrape Adelsvapen

One issue - need to fix the missing records that didn't get scraped the first time and then add them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.