GithubHelp home page GithubHelp logo

doc22940 / scan-public-directory Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jfgiraud/scan-public-directory

1.0 1.0 0.0 191 KB

a tool to extract links from a « public directory » page returned by a web server

Shell 4.83% Python 95.17%

scan-public-directory's Introduction

scan-public-directory

scan-public-directory is a tool to extract links from a specified public directory returned by a web server.

It permits :

  • to specify filters to accept or reject links
  • to define and call a command on each selected link (like wget)

page source

capture

command result

$ ./scan-public-directory url http://example.com/images/ 
2009-02-01 11:13|2.4M|http://example.com/images/02-09.jpg
2010-11-03 10:36|3.4M|http://example.com/images/021110.jpg
2007-03-01 08:21|4.1M|http://example.com/images/03-07.jpg

configuration file

A configuration file .spdrc is created on the first program execution.

The configuration file defines the regexp to search after cleaning lines.

You can define new formats to detect and extract links.

usage

The program takes url or file as the first parameter.

usage: scan-public-directory url [-h] [--max-depth LEVELS]
                                 [--min-depth LEVELS] [--min-size SIZE]
                                 [--max-size SIZE] [--accept LIST]
                                 [--reject LIST] [--verbose]
                                 [--after DATETIME] [--before DATETIME]
                                 [--lines LINES] [--banner] [--print FORMAT]
                                 [--exec COMMAND]
                                 URL

positional arguments:
  URL                 The url where performing operation

optional arguments:
  -h, --help          show this help message and exit
  --max-depth LEVELS  Descend at mots levels
  --min-depth LEVELS  Ignore file links at levels less than levels
  --min-size SIZE     Ignore file with size less than the specified size
  --max-size SIZE     Ignore file with size greater than the specified size
  --accept LIST       Accept only the files with the specified file name
                      suffixes or patterns (comma separated list). If any of
                      the wildcard characters, *, ?, [ or ], appear in an
                      element of the list, it will be treated as a pattern,
                      rather than a suffix
  --reject LIST       Reject the files with the specified file name suffixes
                      or patterns (comma separated list). If any of the
                      wildcard characters, *, ?, [ or ], appear in an element
                      of the list, it will be treated as a pattern, rather
                      than a suffix
  --verbose           Display more messages
  --after DATETIME    Reject the files with date/time before the specified
                      datetime (YYYY-MM-DD hh:mm)
  --before DATETIME   Reject the files with date/time after the specified
                      datetime (YYYY-MM-DD hh:mm)
  --lines LINES       Select specified lines matching all filters (N, N-, N-M,
                      -M)
  --banner            Display banner
  --print FORMAT      Print the given string replacing patterns {index} {date}
                      {size} {url} by their respective values
  --exec COMMAND      Execute the given shell command (ex: echo "#{index}
                      {date} // {size} // {url}"). If command starts with : it
                      is considered as an alias

examples of use

$ ./scan-public-directory url http://example.com/images/ | tee photo.txt
2009-02-01 11:13|2.4M|http://example.com/images/02-09.jpg
2010-11-03 10:36|3.4M|http://example.com/images/021110.jpg
2007-03-01 08:21|4.1M|http://example.com/images/03-07.jpg
$ ./scan-public-directory file photo.txt --accept '*-0*' --before '2015-12-25 19:34' --max-size '3M'
2009-02-01 11:13|2.4M|http://example.com/images/02-09.jpg

If necessary, after filtering, you can call a shell command (see the --exec option in the usage)

Configuration file sample

$ cat ~/.spdrc 
dir_regex: '<a href="([^"]*/)">[^<]*</a>'
a_regex: '<a href="([^"]*[^\/])">[^<]*</a>'
clean_line:
  - "\\s*<img[^>]+>\\s*"
  - "\\s*<a[^>]+>[^<]*</a>\\s*"
  - "\\s*<td[^>]*>\\s*"
  - "\\s*<tr[^>]*>\\s*"
  - "\\s*</td>\\s*"
  - "\\s*</tr>\\s*"
  - "&nbsp;"
formats:
  - "DD-MMM-YYYY HH:mm"
  - "YYYY-MMM-DD HH:mm"
  - "M/D/YYYY h:mm A"
  - "YYYY-MM-DD HH:mm"
  - "dddd, MMMM DD, YYYY h:mm A"
a2strptime:
  'dddd': '%A'
  'ddd': '%a'
  'DD': '%d'
  'D': '%d'
  'MMMM': '%B'
  'MMM': '%b'
  'MM': '%m'
  'M': '%m'
  'YYYY': '%Y'
  'HH': '%H'
  'H': '%H'
  'hh': '%I'
  'h': '%I'
  'A': '%p'
  'a': '%p'
  'mm': '%M'
  'm': '%M'
aliases:
  :wget: "wget -c '{url}'"

scan-public-directory's People

Contributors

jfgiraud avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.