=head1 NAME

WWW::Crawler::Mojo - A web crawling framework for Perl

=head1 SYNOPSIS

    use strict;
    use warnings;
    use utf8;
    use WWW::Crawler::Mojo;
    use 5.10.0;
    
    my $bot = WWW::Crawler::Mojo->new;
    
    $bot->on(res => sub {
        my ($bot, $browse, $job, $res) = @_;
        
        $browse->();
    });
    
    $bot->on(refer => sub {
        my ($bot, $enqueue, $job, $context) = @_;
        
        $enqueue->();
    });
    
    $bot->enqueue('http://example.com/');
    $bot->crawl;

=head1 DESCRIPTION

L<WWW::Crawler::Mojo> is a web crawling framework for those who familier with
L<Mojo>::* APIs.

Note that the module is aimed at trivial use cases of crawling within a
moderate range of web pages so DO NOT use it for persistent crawler jobs.

=head1 ATTRIBUTES

L<WWW::Crawler::Mojo> inherits all attributes from L<Mojo::EventEmitter> and
implements the following new ones.

=head2 ua

A L<Mojo::UserAgent> instance.

    my $ua = $bot->ua;
    $bot->ua(Mojo::UserAgent->new);

=head2 ua_name

Name of crawler for User-Agent header.

    $bot->ua_name('my-bot/0.01 (+https://example.com/)');
    say $bot->ua_name; # 'my-bot/0.01 (+https://example.com/)'

=head2 active_conn

A number of current connections.

    $bot->active_conn($bot->active_conn + 1);
    say $bot->active_conn;

=head2 active_conns_per_host

A number of current connections per host.

    $bot->active_conns_per_host($bot->active_conns_per_host + 1);
    say $bot->active_conns_per_host;

=head2 depth

A number of max depth to crawl. Note that the depth is the number of HTTP
requests to get to URI starting with the first job. This doesn't mean the
deepness of URI path detected with slash.

    $bot->depth(5);
    say $bot->depth; # 5

=head2 fix

A hash whoes keys are md5 hashes of enqueued URLs.

=head2 max_conn

A number of max connections.

    $bot->max_conn(5);
    say $bot->max_conn; # 5

=head2 max_conn_per_host

A number of max connections per host.

    $bot->max_conn_per_host(5);
    say $bot->max_conn_per_host; # 5

=head2 peeping_port

An port number for providing peeping monitor. It also evalutated as boolean for
disabling/enabling the feature. Defaults to undef, meaning disable.

    $bot->peeping_port(3001);
    say $bot->peeping_port; # 3000

=head2 peeping_max_length

Max length of peeping monitor content.

    $bot->peeping_max_length(100000);
    say $bot->peeping_max_length; # 100000

=head2 queue

FIFO array contains L<WWW::Crawler::Mojo::Job> objects.

    push(@{$bot->queue}, WWW::Crawler::Mojo::Job->new(...));
    my $job = shift @{$bot->queue};

=head2 shuffle

An interval in seconds to shuffle the job queue. It also evalutated as boolean
for disabling/enabling the feature. Defaults to undef, meaning disable.

    $bot->shuffle(5);
    say $bot->shuffle; # 5

=head1 EVENTS

L<WWW::Crawler::Mojo> inherits all events from L<Mojo::EventEmitter> and
implements the following new ones.

=head2 res

Emitted when crawler got response from server.

    $bot->on(res => sub {
        my ($bot, $browse, $job, $res) = @_;
        if (...) {
            $browse->();
        } else {
            # DO NOTHING
        }
    });

=head2 refer

Emitted when new URI is found. You can enqueue the URI conditionally with
the callback.

    $bot->on(refer => sub {
        my ($bot, $enqueue, $job, $context) = @_;
        if (...) {
            $enqueue->();
        } elseif (...) {
            $enqueue->(...); # maybe different url
        } else {
            # DO NOTHING
        }
    });

=head2 empty

Emitted when queue length got zero. The length is checked every 5 seconds.

    $bot->on(empty => sub {
        my ($bot) = @_;
        say "Queue is drained out.";
    });

=head2 error

Emitted when user agent returns no status code for request. Possibly caused by
network errors or un-responsible servers.

    $bot->on(error => sub {
        my ($bot, $error, $job) = @_;
        say "error: $_[1]";
        if (...) { # until failur occures 3 times
            $bot->requeue($job);
        }
    });

Note that server errors such as 404 or 500 cannot be catched with the event.
Consider res event for the use case instead of this.

=head2 start

Emitted right before crawl is started.

    $bot->on(start => sub {
        my $self = shift;
        ...
    });

=head1 METHODS

L<WWW::Crawler::Mojo> inherits all methods from L<Mojo::EventEmitter> and
implements the following new ones.

=head2 crawl

Start crawling loop.

    $bot->crawl;

=head2 init

Initialize crawler settings.

    $bot->init;

=head2 process_job

Process a job.

    $bot->process_job;

=head2 say_start

Displays starting messages to STDOUT

    $bot->say_start;

=head2 peeping_handler

peeping API dispatcher.

    $bot->peeping_handler($loop, $stream);

=head2 browse

Parses and discovers links in a web page. Each links are appended to FIFO array.

    $bot->browse($res, $job);

=head2 enqueue

Append one or more URIs or L<WWW::Crawler::Mojo::Job> objects.

    $bot->enqueue('http://example.com/index1.html');

OR

    $bot->enqueue($job1, $job2);

OR

    $bot->enqueue(
        'http://example.com/index1.html',
        'http://example.com/index2.html',
        'http://example.com/index3.html',
    );

=head2 requeue

Append one or more URLs or jobs for re-try. This accepts same arguments as
enqueue method.

    $self->on(error => sub {
        my ($self, $msg, $job) = @_;
        if (...) { # until failur occures 3 times
            $bot->requeue($job);
        }
    });

=head2 collect_urls_html

Collects URLs out of HTML.

    WWW::Crawler::Mojo::collect_urls_html($dom, sub {
        my ($uri, $dom) = @_;
    });

=head2 collect_urls_css

Collects URLs out of CSS.

    WWW::Crawler::Mojo::collect_urls_css($dom, sub {
        my $uri = shift;
    });

=head2 guess_encoding

Guesses encoding of HTML or CSS with given L<Mojo::Message::Response> instance.

    $encode = WWW::Crawler::Mojo::guess_encoding($res) || 'utf-8'

=head2 resolve_href

Resolves URLs with a base URL.

    WWW::Crawler::Mojo::resolve_href($base, $uri);

=head1 CONSTANTS

=head2 %tag_attributes

A catalog of HTML attribute names which possibly contain URLs.

    script  => ['src'],
    link    => ['href'],
    a       => ['href'],
    img     => ['src'],
    area    => ['href', 'ping'],
    embed   => ['src'],
    frame   => ['src'],
    iframe  => ['src'],
    input   => ['src'],
    object  => ['data'],
    form    => ['action'],

=head1 EXAMPLE

L<https://github.com/jamadam/WWW-Flatten>

=head1 AUTHOR

Sugama Keita, E<lt>sugama@jamadam.comE<gt>

=head1 COPYRIGHT AND LICENSE

Copyright (C) jamadam

This program is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=cut