GithubHelp home page GithubHelp logo

pavelsr / email-extractor Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 2.0 84 KB

Fast email crawler

Home Page: http://metacpan.org/pod/Email::Extractor

Perl 4.85% HTML 95.15%
perl crawler email-crawler telemarketing email-marketing

email-extractor's Introduction

NAME

Email::Extractor - Fast email crawler

VERSION

version 0.04

SYNOPSIS

my $crawler = Email::Extractor->new( only_language => 'ru', timeout => 30 );

$crawler->search_until_attempts('https://example.com' , 5);

my $arrayref = $crawler->get_emails_from_uri($website);

while (!@$arrayref) {
    my $urls_array = extract_contact_links($website);
    for my $url (@$urls_array) {
        $arrayref = $crawler->get_emails_from_uri($url);
    }
}

DESCRIPTION

Speedy crawler that can be used for extraction of email addresses from html pages

Sharpen for russian language but you are welcome to send me MR with support of your own language, just modify "contacts" in Email::Extractor and "url_with_contacts" in Email::Extractor

NAME

Email::Extractor

AUTHORS

Pavel Serkov [email protected]

new

Constructor

Params:

only_lang - array of languages to check, by default is C<ru>
timeout   - timeout of each request in seconds, by default is C<20>

search_until_attempts

Search for email until specified number of GET requests

my $emails = $crawler->search_until_attempts( $uri, 5 );

Return ARRAYREF or undef if no emails found

get_emails_from_uri

High-level function uses Email::Find

Found all emails (regexp accoding RFC 822 standart) in html page

$emails = $crawler->get_emails_from_uri('https://example.com');
$emails = $crawler->get_emails_from_uri('user/test.html');

Function can accept http(s) uri or file paths both

Return ARRAYREF (can be empty)

extract_contact_links

Extract links that may contain company contacts

$crawler->get_emails_from_uri('http://example.com');
$crawler->extract_contact_links;

or you can load html manually and call this method with param:

$crawler->extract_contact_links($html)

But in that case method will not remove external links and make absolute

Technically, this method to three things:

  1. Extract all links that from html document (accepted as string)

  2. Remove external links.

  3. Store links that assumed to be contact separately. Assumption is made by looking on href and body of a tags

Support both absolute or relative links

Use Mojo::DOM currently

Veriables for debug:

$crawler->{last_all_links}  # all links that was get in start of extract_contact_links method
$self->{non_contact_links}  # links assumed not contained company contacts
$self->{last_uri}

Return ARRAYREF or undef if no contact links found

contacts

Return hash with contacts word in different languages

perl -Ilib -E "use Email::Extractor; use Data::Dumper; print Dumper Email::Extractor::contacts();"

url_with_contacts

Return array of words that may contain contact url

perl -Ilib -E "use Email::Extractor; use Data::Dumper; print Dumper Email::Extractor::url_with_contacts();"

get_exceptions

Return array of addresses that Email::Find consider as email but in fact it is no

perl -Ilib -E "use Email::Extractor; use Data::Dumper; print Dumper Email::Extractor::exceptions();"

get_encoding

Return encoding of last loaded html

For detection uses "encoding_from_html_document" in HTML::Encoding

$self->get_encoding;
$self->get_encoding($some_html_code);

If called without parametes it return encoding of last text loaded by function load_addr_to_str()

contacts

Return hash with contacts word in different languages

perl -Ilib -E "use Email::Extractor; use Data::Dumper; print Dumper Email::Extractor::contacts();"

Links checked in uppercase and lowecase also

get_encoding

Return encoding of last loaded html

For detection uses "encoding_from_html_document" in HTML::Encoding

$self->get_encoding;
$self->get_encoding($some_html_code);

If called without parametes it return encoding of last text loaded by function load_addr_to_str()

AUTHOR

Pavel Serikov [email protected]

COPYRIGHT AND LICENSE

This software is copyright (c) 2018 by Pavel Serikov.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

email-extractor's People

Stargazers

 avatar

Watchers

 avatar

Forkers

jayd2446 4stacks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.