hadro / hathi_analysis Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 26.47 MB

Usage analysis of HathiTrust

Home Page: https://hadro.github.io/hathi_analysis/

HTML 96.56% Jupyter Notebook 3.44% Python 0.01%

hathi_analysis's People

Contributors

Watchers

hathi_analysis's Issues

Access oriented graph

Via NK:

what if we did these graphs with just open volumes, so
all volumes in hathi
all open volumes in hathi
all accessed open volumes in hathi
on a big overlaid bar chart
and then the bottom panel would just be a timeseries of %of open volumes accessed (edited)
it would give a bit of a more continuous idea of “if we do the work to open this item, what is the chance it will be accessed?”

CC @nkrabben

Comparison of CCE to HT

As a user, I would like to compare the number of CCE registrations for books published in the US with HT's data of US published books so that I can determine whether CCE registrations are a good representation for all books published in the US for a particular time.

I have book registration data for 1923-1952 (1953-onward includes non-book type registrations).

AC1: For each year from 1923 to 1952, please count the number of unique titles in HT.

select count(distinct(hathitrust_record_number)) from hathifiles where (publication_date = '1949' or publication_date = '1949.0') and bibliograhic_format = 'BK' and publication_place LIKE '__u'

select count(distinct(oclc_number)) from hathifiles where (publication_date = '1949' or publication_date = '1949.0') and bibliograhic_format = 'BK' and publication_place LIKE '__u'

Access relative to supply

I'd be interested in seeing how the top40k items represent access proportional to the amount of material in hathi. Some pseudoish code to explain it.

df = pd.DataFrame(data, columns = ['year', 'items_in_top40', 'items_in_hathi', 'items_open])

df['rel_total'] = df.items_in_top_40/df.items_in_hathi
df.plot(x = 'year', y = rel_total, type = 'scatter')

df['rel_open'] = df.items_in_top_40/df.items_open
df.plot(x = 'year', y = rel_open, type = 'scatter')

Add in the usage

It would be good to see if there are relationships between publication year and usage amount on a volume level. This might need some binning to be useful.

Quick sketch
Axis 1: Years, maybe binned into decades or centuries
Axis 2: Access level, binned into ... 0, 1, 2-5, 6-20, ... 1,000-1,000,000 (not really sure about the bins)
Axis 3: Either number of volumes in year/access bin, or percentage of volumes in year/access bin compared to that entire year's volumes

hadro / hathi_analysis Goto Github PK

hathi_analysis's People

Contributors

Watchers

hathi_analysis's Issues

Access oriented graph

Comparison of CCE to HT

Access relative to supply

Add in the usage

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs