GithubHelp home page GithubHelp logo

gunkpunk / vk-url-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bellingcat/vk-url-scraper

0.0 0.0 0.0 199 KB

Scrape VK URLs to fetch info and media - python API or command line tool.

Home Page: https://pypi.org/project/vk-url-scraper/

License: MIT License

Shell 1.65% Python 97.06% Makefile 1.28%

vk-url-scraper's Introduction

vk-url-scraper

Python library to scrape data, and especially media links like videos and photos, from vk.com URLs.

PyPI version PyPI download month Documentation Status

You can use it via the command line or as a python library, check the documentation.

Installation

You can install the most recent release from pypi via pip install vk-url-scraper.

To use the library you will need a valid username/password combination for vk.com.

Command line usage

# run this to learn more about the parameters
vk_url_scraper --help

# scrape a URL and get the JSON result in the console
vk_url_scraper -username "username here" --password "password here" --urls https://vk.com/wall12345_6789
# OR
vk_url_scraper -u "username here" -p "password here" --urls https://vk.com/wall12345_6789
# you can also have multiple urls
vk_url_scraper -u "username here" -p "password here" --urls https://vk.com/wall12345_6789 https://vk.com/photo-12345_6789 https://vk.com/video12345_6789

# you can pass a token as well to avoid always authenticating 
# and possibly getting captcha prompts
# you can fetch the token from the vk_config.v2.json file generated under by searching for "access_token"
vk_url_scraper -u "username" -p "password" -t "vktoken goes here" --urls https://vk.com/wall12345_6789

# save the JSON output into a file
vk_url_scraper -u "username here" -p "password here" --urls https://vk.com/wall12345_6789 > output.json

# download any photos or videos found in these URLS
# this will use or create an output/ folder and dump the files there
vk_url_scraper -u "username here" -p "password here" --download --urls https://vk.com/wall12345_6789
# or
vk_url_scraper -u "username here" -p "password here" -d --urls https://vk.com/wall12345_6789

Python library usage

from vk_url_scraper import VkScraper

vks = VkScraper("username", "password")

# scrape any "photo" URL
res = vks.scrape("https://vk.com/photo1_278184324?rev=1")

# scrape any "wall" URL
res = vks.scrape("https://vk.com/wall-1_398461")

# scrape any "video" URL
res = vks.scrape("https://vk.com/video-6596301_145810025")
print(res[0]["text"]) # eg: -> to get the text from code
# Every scrape* function returns a list of dict like
{
	"id": "wall_id",
	"text": "text in this post" ,
	"datetime": utc datetime of post,
	"attachments": {
		# if photo, video, link exists
		"photo": [list of urls with max quality],
		"video": [list of urls with max quality],
		"link": [list of urls with max quality],
	},
	"payload": "original JSON response converted to dict which you can parse for more data
}

see [docs] for all available functions.

TODO

  • scrape album links
  • scrape profile links
  • docs online from sphinx

Development

(more info in CONTRIBUTING.md).

  1. setup dev environment with pip install -r dev-requirements.txt or pipenv install -r dev-requirements.txt
  2. setup environment with pip install -r requirements.txt or pipenv install -r requirements.txt
  3. To run all checks to make run-checks (fixes style) or individually
    1. To fix style: black . and isort . -> flake8 . to validate lint
    2. To do type checking: mypy .
    3. To test: pytest . (pytest -v --color=yes --doctest-modules tests/ vk_url_scraper/ to user verbose, colors, and test docstring examples)
  4. make docs to generate shpynx docs -> edit config.py if needed

To test the command line interface available in main.py you need to pass the -m option to python like so: python -m vk_url_scraper -u "" -p "" --urls ...

Releasing new version

  1. edit version.py with proper versioning
  2. run ./scripts/release.sh to create a tag and push, alternatively
    1. git tag vx.y.z to tag version
    2. git push origin vx.y.z -> this will trigger workflow and put project on pypi
  3. go to https://readthedocs.org/ to deploy new docs version (if webhook is not setup)

Fixing a failed release

If for some reason the GitHub Actions release workflow failed with an error that needs to be fixed, you'll have to delete both the tag and corresponding release from GitHub. After you've pushed a fix, delete the tag from your local clone with

git tag -l | xargs git tag -d && git fetch -t

Then repeat the steps above.

vk-url-scraper's People

Contributors

dependabot[bot] avatar loganwilliams avatar msramalho avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.