creekorful / bathyscaphe Goto Github PK

View Code? Open in Web Editor NEW

90.0 6.0 24.0 850 KB

Fast, highly configurable, cloud native dark web crawler.

Home Page: https://blog.creekorful.com/building-fast-modern-web-crawler/

License: GNU General Public License v3.0

Go 96.81% Shell 1.21% Python 1.98%

web-crawler golang elasticsearch crawling kibana crawler hidden-services tor architecture

bathyscaphe's People

Contributors

Stargazers

Watchers

bathyscaphe's Issues

Improve build step

Since we are copying the project root dir when building dockerfiles, a change in process A will force rebuild of others ones.

scheduler: add list of disallowed extensions

crawler process are loosing too much time querying for resources not wanted, the idea is to not schedule URL ending with specify extensions (.img, .jpg, ...)

Allow to post an URL trough the API

I think it's a good idea to keep the queue from being used by too many processes.
The API should be the single point of entry for the whole system, except when the performance (async) justify it.

Therefore, I think we should add another endpoint in the API, to allow to post an URL.
The API will simply put the url in the corresponding queue.

Doing that, we should refactor the feeder process to use this endpoint instead of the queue.

In the meantime, it will be time to allow the API to be reach from outside the docker network.

Better ACL for API

Use combination of verb + path

f.e

GET /v1/resources
POST /v1/resources
POST /v1/urls

etc...

Create endpoint to generate user? that will consume rights etc...
Rights will be stored in the JWT token

create release.sh script

Take tag as parameter (without v prefix)
Create commit (format: Release v$tag)
Show diff for the maintainer to confirm the changes
Create signed commit (format: Release v$tag)
Call build.sh with $tag as parameter
Call build.sh with no parameter (latest build)
Display to the users that he need to check the details & run (git push && git push --tags && ./push.sh $tag && ./push.sh latest

Improve authorization

Add documentation
Allow generate token from command line

Create cache of 'ignored' resources

We can put 'down' domains in it so we won't spend time trying to crawl them again
We can manually ignore domains by adding them to the list

Feeder to API unmarshaling error

I can't get that new feeder to work, am I doing it wrong?

./cmd/feeder/feeder --api-uri http://localhost:15005 --url https://www.facebookcorewwwi.onion
INFO[0000] Starting trandoshan-feeder v0.0.1
INFO[0000] URL https://www.facebookcorewwwi.onion successfully sent to the crawler

api_1            | time="2020-08-08T03:15:21Z" level=error msg="Error while un-marshaling url: invalid character 'h' looking for beginning of value"

Maybe missing some json encoding? I'm not sure, I tried passing json encoded values too but it didn't like them any better than a raw url.

Elasticsearch Crashing with Code 127

I'm running this on a Google Cloud Platform compute instance with 8GB RAM and 2 cores.

When I open the Kibana dashboard and create a canvas with a data table of the crawled content from resources *, it appears to lag for a brief moment and later give me 401 unauth errors.

In the console, I see that docker_elasticsearch_1 exited with code 127

Memory usage at the time of crash doesn't seem to be high either, with around 2/8GB RAM being used.

scheduler_1      | time="2020-09-07T03:53:12Z" level=info msg="Successfully initialized tdsh-scheduler. Waiting for URLs"
torproxy_1       | WARNING: no logs are available with the 'none' log driver
// After opening kibana dashboard and waiting about 20 seconds
docker_elasticsearch_1 exited with code 127

Duplicate URLs in ElasticSearch DB

Hi there,

I've been playing with this Tor crawler for some time and generally it works pretty well. However, I've got a problem of duplicate urls. It has been running for 4 days and has achieved over 4000 hits, but the count of unique urls is just around 1000.

I noticed that there is a query method in the scheduler that asks the ElasticSearch DB whether the found url already exists.

b64URI := base64.URLEncoding.EncodeToString([]byte(normalizedURL.String()))
apiURL := fmt.Sprintf("%s/v1/resources?url=%s", apiURI, b64URI)

var urls []proto.ResourceDto
r, err := httpClient.JSONGet(apiURL, &urls)
...
if len(urls) == 0 {
...

I've copied this method to the crawler and persister as well to check todo urls and resource urls. However, it still only gets around 1000 unique urls out of over 4000 hits.

Does anyone have any idea of how to fix this problem? Any hint would be greatly appreciated.

Add graph database support

Create relationship between crawled pages to determinate where this URL has been referenced

Implement index refresh

To discuss

Implement trandoshanctl login

Release executables using CD pipeline with goreleaser

Configure RabbitMQ to persist messages to disk

Add kubernetes config files

Improve testing

scheduler: Error while searching URL: %!s(<nil>)

reported by @FFrozTT

Request to Elasticsearch failed: {"error":{}}

So I've had all containers running overnight without exiting and there is certainly a lot of activity but something doesn't seem quite right between Kibana and Elasticsearch. Kibana is only showing me 8 entry's and giving this error:

Request to Elasticsearch failed: {"error":{}}

Error: Request to Elasticsearch failed: {"error":{}}
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4900279
    at Function._module.service.Promise.try (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2504083)
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503457
    at Array.map (<anonymous>)
    at Function._module.service.Promise.map (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503414)
    at callResponseHandlers (http://x.x.x.x:15004/bundles/commons.bundle.js:3:4898793)
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4881154
    at processQueue (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:204190)
    at http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:205154
    at Scope.$digest (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:215159)

Allow to crawl same url multiple times

Configure a refresh / min delay to allow recrawl

Implement authentication mechanism

For the API

Problems crawling torch results pages

For some reason the crawler is not parsing Torch results pages correctly because none of the links end up being scheduled or crawled.

e.g.

http://xmh57jrzrnw6insl.onion/4a1f6b371c/search.cgi?s=DRP&q=irc&cmd=Search%21

Should return plenty of results.

I have no idea why, the only thing remotely interesting is that there is an iframe at the start of the page. I think I've seen this with some other pages too but I can't remember what they were.

Elasticsearch: max virtual memory areas vm.max_map_count [xxxx] is too low, increase to at least [xxxx]

I'm seeing a lot more results than before but also API errors on URL's I know to be working:

time="2020-08-10T20:40:42Z" level=error msg="Error getting response from ES: dial tcp: lookup elasticsearch on 127.0.0.11:53: no such host"

Switch to new architecture

The current architecture of Trandoshan is not flexible: messages can be read by only one consumer (to prevent duplicates), etc...

It could be interesting to switch to event driven architecture: each process push they own events trough queues (no consumer uniquness), and everyone who cares about the message just need to subscribe and do what he wants with it.

This of course introduces a problem: we will have message consumer by the same duplicated processes (f.e the crawler process which is generally scaled). To prevent this will need to have a UNIQUE crawler process reading trough the queue, and pushing it to a private queues where others crawler process are subscribing (i.e forwarding the message)

I dunno if the implementation make sense at this time, but the general idea seems pretty good.

Error while submiting new URL through the API

Hi, I'm quite interested in this crawler but I got an error when I tried to start it. So I just added

feeder:
    image: trandoshan.io/feeder:latest
    command: --log-level debug --api-uri http://localhost:15005 --url http://torlinkbgs6aabns.onion/

to the docker-compose.yml and executed ./scripts/start.sh. But the feeder didn't work properly and returned the following message: feeder_1 | time="2020-08-26T11:43:50Z" level=error msg="Unable to publish URL: Post \"http://localhost:15005/v1/urls\": dial tcp 127.0.0.1:15005: connect: connection refused"
I did some search online but failed to solve this problem. Could anyone give me some hints? Thank you!

Implement login functionnality

Allows the crawler to log in darknet forums to reach more interesting insights

test on a list of onion websites

Hi,

Hope you are all well !

Just a quick suggestion, you could bulk/test trandoshan on the list of onion websites available at https://github.com/onionltd/oniontree

I am curious how to bulk add them ? how much will fail because of a captacha challenge.

Thanks in advance for your insights and inputs on that.

Cheers,
X

Build Error invalid argument "creekorful/"

Hi, docker version is 19.03.12, we tried to build trandoshan using the scripts build.sh. However it gives the following errors:

invalid argument "creekorful/" for "-t, --tag" flag: invalid reference format See 'docker build --help'.

Where can we find the documentation on this project? Thanks

Kibana server is not ready yet

I tried the new project but can't get past "Kibana server is not ready yet". I used the packaged build and start scripts. Are there additional steps or an installation guide somewhere?

Edit: Everything appeared to start OK, here's my output:
Starting deployments_nats_1 ... done
Starting deployments_elasticsearch_1 ... done
Starting deployments_torproxy_1 ... done
Starting deployments_scheduler_1 ... done
Starting deployments_crawler_1 ... done
Starting deployments_kibana_1 ... done
Starting deployments_api_1 ... done
Starting deployments_persister_1 ... done
Attaching to deployments_torproxy_1, deployments_nats_1, deployments_elasticsearch_1, deployments_scheduler_1, deployments_crawler_1, deployments_api_1, deployments_kibana_1, deployments_persister_1
torproxy_1 | WARNING: no logs are available with the 'none' log driver
nats_1 | WARNING: no logs are available with the 'none' log driver
elasticsearch_1 | WARNING: no logs are available with the 'none' log driver
scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Starting trandoshan-scheduler v0.0.1"
scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using NATS server at: nats"
scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using API server at: http://api:8080"
scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Successfully initialized trandoshan-scheduler. Waiting for URLs"
crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Starting trandoshan-crawler v0.0.1"
crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using NATS server at: nats"
crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using TOR proxy at: torproxy:9050"
crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Successfully initialized trandoshan-crawler. Waiting for URLs"
api_1 | {"time":"2020-08-05T21:39:33.269084605Z","level":"INFO","prefix":"echo","file":"api.go","line":"73","message":"Starting trandoshan-api v0.0.1"}
api_1 | {"time":"2020-08-05T21:39:33.269182929Z","level":"DEBUG","prefix":"echo","file":"api.go","line":"75","message":"Using elasticsearch server at: http://elasticsearch:9200"}
api_1 | {"time":"2020-08-05T21:39:33.295324468Z","level":"INFO","prefix":"echo","file":"api.go","line":"88","message":"Successfully initialized trandoshan-api. Waiting for requests"}
api_1 | ⇨ http server started on [::]:8080
kibana_1 | WARNING: no logs are available with the 'none' log driver
persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Starting trandoshan-persister v0.0.1"
persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using NATS server at: nats"
persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using API server at: http://api:8080"
persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Successfully initialized trandoshan-persister. Waiting for resources"
deployments_elasticsearch_1 exited with code 1

Use the olivere/elastic ES client

Increase elasticsearch maximum memory?

Prevent from crawling binary, images, etc...

Add switches to docker-compose allowing detaching

Add -t -i switches to docker command allowing detaching. Right now I have to restart the whole project if I want to attach/detach.

Alternatively, you could assign unique detach keys:

--detach-keys "ctrl-a,a"

Sorry, I was trying to add an "Enhancement label" but I don't think I can

TTL is not respected

Use proper HTTP client

Release docker image on dockerhub

harmonise logging message

Switch to zerolog

Add persistence for rabbitmq

Improve search based on keyword

Create dashboard application

The idea is to have a simple JS (angular? react? vue?) application that will dial with the API to get insight from the crawler.

Resource page to view / search resources using input
Page to submit URL to crawl
?

If anyone has suggestion, feel free to comment on this PR!

Add TLS support?

For each of the crawler components

Centralise configuration around API?

System Requirements

What are the system requirements to run this ? HDD, RAM etc

Create trandoshanctl

Will remplace the feeder

Allows to submit url
Allows to list resources

%!s(<nil>) response

I'm having that previous issue again with the new build:

scheduler_1 | time="2020-08-08T17:17:49Z" level=debug msg="Processing URL: https://www.facebookcorewwwi.onion"
api_1 | time="2020-08-08T17:17:49Z" level=debug msg="Successfully published URL: https://www.facebookcorewwwi.onion"
api_1 | time="2020-08-08T17:17:49Z" level=error msg="Error getting response: %!s()"
scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Error while searching URL: %!s()"
scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Received status code: 500"

creekorful / bathyscaphe Goto Github PK

bathyscaphe's People

Contributors

Stargazers

Watchers

Forkers

bathyscaphe's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs