webGrabber is a config-based web scraper and browser automation tool that makes it easy to extract data from websites and automate repetitive browsing tasks. With its flexible and powerful set of features, including custom actions, memory interpolation, and the ability to run specific grabs, webGrabber is the perfect solution for streamlining your web scraping and browser automation needs. Whether you are a data analyst, researcher, or web developer, webGrabber has something to offer for everyone.
npm install
If you have trouble with chromium on Mac, you can try to install it using:
npx puppeteer browsers install chrome
Or you can add the executable path to Chrome in the options passed to Puppeteer through Grabber using the options file:
export default {
executablePath: '/path/to/Chrome'
}
Create a grab config (json|yml|yaml) file in the src/grabs directory of the project
Hello World example: hello-world.json
{
"name": "hello-world",
"actions" : [
{
"name" : "log",
"params" : {
"text" : "Hello World!"
}
}
]
}
Hello World example: hello-world.yml
name: hello-world
actions:
- name: log
params:
text: "Hello World!"
Run the app and all the grabs in the src/grabs directory will be executed:
npm run start
Run a specific grab:
npm run start hello-world
Run the app in server mode to start an HTTP server and receive grab configurations via API requests. In server mode, the application exposes an HTTP POST endpoint to accept JSON payloads for grab configurations.
npm run start:server
- Endpoint:
/grab
- Method: POST
- Payload: The endpoint expects a JSON payload containing the grab configuration.
- Server Port: The server runs on the port specified in the
PORT
environment variable, with a default fallback to port 3000 if not set.
Send a POST request with a JSON payload to this endpoint to trigger the grab process.
A full list of actions can be found in Actions
An example of how to add custom actions is found in the custom file
Environment variables can be set in a .env file in the root of the project
All variables prepended with GRABBER_ will be loaded into the memory and can be accessed in the config files
The memory can be accessed in the config files using the {{variable}} syntax
An action can return a value that can be used in the next action by using the INPUT keyword
The following variable names are reserved and should be used in the config files with caution:
- INPUT
- PARAMS
- INDENTATION
- CURRENT_DIR
- BASE_DIR
- PAYLOAD_ID