GithubHelp home page GithubHelp logo

tiny-crawler's Introduction

tiny-crawler

  • csdn: you can login in www.csdn.com by this script
  • libgen: you can replace the keyword to search the books and papers from libgen.io. the code of libgen.py is so short, so i do not format the code
  • arxiv_search_pdfDownload: you can just replace the keyword to search papers from arxiv.org, the download links and paper filename will save in the correspond txt file
  • arxiv_0704-now_wAbstract: get the paper meta info by month from 2007.04 to now
  • arxiv_9108-0703_wAbstract.py: get the paper meta info from 1991.08 to 2007.03
    ps:because arxiv change it's url rule from 2007.03, so we need two different script to scrapy the data.
  • arxiv_byArchive_woAbstract: download the paper meta info in bulk by access arxiv archive. but it can get the papers' abstract

paperMeta4arxiv

because the arxiv do not support Regular search, so i scrapy the paper meta info here
the format as follow:

              <id> \t <paper name> \t <subject> \t <authors>   

you can find the paper meta info from 2008.01 to 2018.04

A better way to get books from libgen.io

  1. download the libgen_content.rar. After decompression, you'll get the libgen_content.csv, that contains the whole 2319076 digit books info;
'id', 'title', 'volumeinfo', 'series', 'periodical', 'author', 'year', 'edition', 'publisher', 'city', 'pages', 'language', 'topic', 'library', 'issue', 'identifier', 'issn', 'asin', 'udc', 'lbc', 'ddc', 'lcc', 'doi',  'googlebookid', 'openLibraryid', 'commentary', 'dpi', 'color', 'cleaned', 'orientation', 'paginated', 'scanned', 'bookmarked', 'searchable', 'filesize', 'extension', 'md5', 'generic', 'visible', 'locator', 'local', 'timeadded', 'timelastmodified', 'coverurl','identifierwodash', 'tags', 'pagesinfile'
  1. you should delete some confusing string:
sed 's/\\"/ /g' libgen_content.csv > libgen_content1.csv
# sed -i '/"ban"/d;/"del"/d;/"Russian"/d' libgen_content1.csv # should not run this
  1. use grep command to filter the lines you selected;
grep -i mathematics libgen_content1.csv > result.csv
  1. then, use "libgen_createDownloadlink.py" to create "libgen.io.{keyword}.txt", each line in the txt files contain raw book info and different mirror downloadlinks!
python libgen_createDownloadlink.py result.csv 

5๏ผ‰ because of the libgen.pw website changes download link frequently, we also need another script to update the libgen.pw downloadlink in "result.csv"

python libgen_updateLibgenPWLink.py -f result.csv -n 20

ps:

awk -F'\t' '{print "- " $4 " .["$2"](https://arxiv.org/pdf/"$1") [J]. arXiv preprint arXiv:"$1"."}' file.txt >file1.txt

tiny-crawler's People

Contributors

chanchichoi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.