GithubHelp home page GithubHelp logo

senadev42 / atwitterscraper Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 78 KB

Scrapes a twitter account for posts/tweets and saves them into a postgres db. API docs with Swagger, emails with Mailgun.

JavaScript 100.00%
puppeteer swagger swagger-ui

atwitterscraper's Introduction

Twitter Scraping Server

This project is a Node.js application for scraping tweets from a given twitter account, storing them in a PostgreSQL database, and serving them over a simple REST API. It uses puppeteer for the scraping, express for the server, postgres for the database, and swagger for api docs.

Almost the entirety of the scraping logic is in utilities/scraper.js while everything else is set up like a typical express server.

Hosted version

A hosted version of this api using render's free tier can be found at https://atwitterscraper.onrender.com with the swagger api documentation hosted at https://atwitterscraper.onrender.com/api-docs.

This version is limited in that

  • it will take time to spin up since render spins down servers after a period of inactivity, so it will not start automatically.
  • limited ci/cd + no access to the shell means puppeteer doesn't have access to chromium on the render server without doing some truly hacky stuff.

While /api/tweets will serve tweets scraped so far on the hosted version, to do a full test it will have be setup locally or on a controlled enviroment.

Getting Started

Prerequisites

  • Node.js installed on your system
  • PostgreSQL installed and running, or a managed database like neonDB.
  • A mailgun account and api key (for the email functionality).

Setup

  1. Clone this repository:
git clone https://github.com/senadev42/atwitterscraper.git
  1. Install dependencies:
cd 
npm install
  1. This project uses postgres as a database. You can use your own postgres db or a managed postgres db like neon (link above in pre-requisites). Just create a .env file and add a postgres connection string. Refer to the .env.example file if needed.
POSTGRES_CONN_STRING = "postgresql://<username>:<password>@<dbhost>/<dbname>?sslmode=require"

Note: Including the sslmode=require parameter is required if you're going to be using this with a managed database like neon.

  1. This project uses mailgun as an email provider. You'll need to create a mailgun account and get an api key and a domain.
MAILGUN_API_KEY = ""
MAILGUN_DOMAIN = ""

EMAIL_TARGET = ""

And if using the sandbox domain (https://app.mailgun.com/mg/sending/domains) provided by mailgun, you will need to

  • add the email account you're testing with into Authorized Recepients
  • confirm the email in your inbox to become verified

image

  1. Start the server:
npm run dev

Usage

Once the server is running, you can access the API and Swagger documentation. The port is 3000 by default but you can change it by setting a PORT value in the .env file.

  • Swagger Documentation: /api-docs

Scraping

The server is configured to scrape periodically using node-scheduler, grabbing what it can and then generating a hash for a tweet using the author, time and other details to use as a unique identifier.

While it runs once an hour, for testing purposes, you can use GET /api/tweets/manualscrape to do trigger an immediate scrape.

atwitterscraper's People

Contributors

senadev42 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.