GithubHelp home page GithubHelp logo

creekorful / bathyscaphe Goto Github PK

View Code? Open in Web Editor NEW
90.0 6.0 24.0 850 KB

Fast, highly configurable, cloud native dark web crawler.

Home Page: https://blog.creekorful.com/building-fast-modern-web-crawler/

License: GNU General Public License v3.0

Go 96.81% Shell 1.21% Python 1.98%
web-crawler golang elasticsearch crawling kibana crawler hidden-services tor architecture

bathyscaphe's People

Contributors

creekorful avatar ffroztt avatar gaganbhat avatar smithalc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bathyscaphe's Issues

Improve build step

Since we are copying the project root dir when building dockerfiles, a change in process A will force rebuild of others ones.

Allow to post an URL trough the API

I think it's a good idea to keep the queue from being used by too many processes.
The API should be the single point of entry for the whole system, except when the performance (async) justify it.

Therefore, I think we should add another endpoint in the API, to allow to post an URL.
The API will simply put the url in the corresponding queue.

Doing that, we should refactor the feeder process to use this endpoint instead of the queue.

In the meantime, it will be time to allow the API to be reach from outside the docker network.

Better ACL for API

Use combination of verb + path

f.e

  • GET /v1/resources
  • POST /v1/resources
  • POST /v1/urls

etc...


  • Create endpoint to generate user? that will consume rights etc...
  • Rights will be stored in the JWT token

create release.sh script

  • Take tag as parameter (without v prefix)
  • Create commit (format: Release v$tag)
  • Show diff for the maintainer to confirm the changes
  • Create signed commit (format: Release v$tag)
  • Call build.sh with $tag as parameter
  • Call build.sh with no parameter (latest build)
  • Display to the users that he need to check the details & run (git push && git push --tags && ./push.sh $tag && ./push.sh latest

Create cache of 'ignored' resources

  • We can put 'down' domains in it so we won't spend time trying to crawl them again
  • We can manually ignore domains by adding them to the list

Feeder to API unmarshaling error

I can't get that new feeder to work, am I doing it wrong?

./cmd/feeder/feeder --api-uri http://localhost:15005 --url https://www.facebookcorewwwi.onion
INFO[0000] Starting trandoshan-feeder v0.0.1
INFO[0000] URL https://www.facebookcorewwwi.onion successfully sent to the crawler
api_1            | time="2020-08-08T03:15:21Z" level=error msg="Error while un-marshaling url: invalid character 'h' looking for beginning of value"

Maybe missing some json encoding? I'm not sure, I tried passing json encoded values too but it didn't like them any better than a raw url.

Elasticsearch Crashing with Code 127

I'm running this on a Google Cloud Platform compute instance with 8GB RAM and 2 cores.

When I open the Kibana dashboard and create a canvas with a data table of the crawled content from resources *, it appears to lag for a brief moment and later give me 401 unauth errors.

In the console, I see that docker_elasticsearch_1 exited with code 127

Memory usage at the time of crash doesn't seem to be high either, with around 2/8GB RAM being used.

scheduler_1      | time="2020-09-07T03:53:12Z" level=info msg="Successfully initialized tdsh-scheduler. Waiting for URLs"
torproxy_1       | WARNING: no logs are available with the 'none' log driver
// After opening kibana dashboard and waiting about 20 seconds
docker_elasticsearch_1 exited with code 127

Duplicate URLs in ElasticSearch DB

Hi there,

I've been playing with this Tor crawler for some time and generally it works pretty well. However, I've got a problem of duplicate urls. It has been running for 4 days and has achieved over 4000 hits, but the count of unique urls is just around 1000.

I noticed that there is a query method in the scheduler that asks the ElasticSearch DB whether the found url already exists.

b64URI := base64.URLEncoding.EncodeToString([]byte(normalizedURL.String()))
apiURL := fmt.Sprintf("%s/v1/resources?url=%s", apiURI, b64URI)

var urls []proto.ResourceDto
r, err := httpClient.JSONGet(apiURL, &urls)
...
if len(urls) == 0 {
...

I've copied this method to the crawler and persister as well to check todo urls and resource urls. However, it still only gets around 1000 unique urls out of over 4000 hits.

Does anyone have any idea of how to fix this problem? Any hint would be greatly appreciated.

Request to Elasticsearch failed: {"error":{}}

So I've had all containers running overnight without exiting and there is certainly a lot of activity but something doesn't seem quite right between Kibana and Elasticsearch. Kibana is only showing me 8 entry's and giving this error:

Request to Elasticsearch failed: {"error":{}}

Error: Request to Elasticsearch failed: {"error":{}}
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4900279
    at Function._module.service.Promise.try (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2504083)
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503457
    at Array.map (<anonymous>)
    at Function._module.service.Promise.map (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503414)
    at callResponseHandlers (http://x.x.x.x:15004/bundles/commons.bundle.js:3:4898793)
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4881154
    at processQueue (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:204190)
    at http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:205154
    at Scope.$digest (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:215159)

Switch to new architecture

The current architecture of Trandoshan is not flexible: messages can be read by only one consumer (to prevent duplicates), etc...

It could be interesting to switch to event driven architecture: each process push they own events trough queues (no consumer uniquness), and everyone who cares about the message just need to subscribe and do what he wants with it.

This of course introduces a problem: we will have message consumer by the same duplicated processes (f.e the crawler process which is generally scaled). To prevent this will need to have a UNIQUE crawler process reading trough the queue, and pushing it to a private queues where others crawler process are subscribing (i.e forwarding the message)

I dunno if the implementation make sense at this time, but the general idea seems pretty good.

Error while submiting new URL through the API

Hi, I'm quite interested in this crawler but I got an error when I tried to start it. So I just added

feeder:
    image: trandoshan.io/feeder:latest
    command: --log-level debug --api-uri http://localhost:15005 --url http://torlinkbgs6aabns.onion/

to the docker-compose.yml and executed ./scripts/start.sh. But the feeder didn't work properly and returned the following message: feeder_1 | time="2020-08-26T11:43:50Z" level=error msg="Unable to publish URL: Post \"http://localhost:15005/v1/urls\": dial tcp 127.0.0.1:15005: connect: connection refused"
I did some search online but failed to solve this problem. Could anyone give me some hints? Thank you!

test on a list of onion websites

Hi,

Hope you are all well !

Just a quick suggestion, you could bulk/test trandoshan on the list of onion websites available at https://github.com/onionltd/oniontree

I am curious how to bulk add them ? how much will fail because of a captacha challenge.

Thanks in advance for your insights and inputs on that.

Cheers,
X

Build Error invalid argument "creekorful/"

Hi, docker version is 19.03.12, we tried to build trandoshan using the scripts build.sh. However it gives the following errors:

invalid argument "creekorful/" for "-t, --tag" flag: invalid reference format See 'docker build --help'.

Where can we find the documentation on this project? Thanks

Kibana server is not ready yet

I tried the new project but can't get past "Kibana server is not ready yet". I used the packaged build and start scripts. Are there additional steps or an installation guide somewhere?

Edit: Everything appeared to start OK, here's my output:
Starting deployments_nats_1 ... done
Starting deployments_elasticsearch_1 ... done
Starting deployments_torproxy_1 ... done
Starting deployments_scheduler_1 ... done
Starting deployments_crawler_1 ... done
Starting deployments_kibana_1 ... done
Starting deployments_api_1 ... done
Starting deployments_persister_1 ... done
Attaching to deployments_torproxy_1, deployments_nats_1, deployments_elasticsearch_1, deployments_scheduler_1, deployments_crawler_1, deployments_api_1, deployments_kibana_1, deployments_persister_1
torproxy_1 | WARNING: no logs are available with the 'none' log driver
nats_1 | WARNING: no logs are available with the 'none' log driver
elasticsearch_1 | WARNING: no logs are available with the 'none' log driver
scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Starting trandoshan-scheduler v0.0.1"
scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using NATS server at: nats"
scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using API server at: http://api:8080"
scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Successfully initialized trandoshan-scheduler. Waiting for URLs"
crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Starting trandoshan-crawler v0.0.1"
crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using NATS server at: nats"
crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using TOR proxy at: torproxy:9050"
crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Successfully initialized trandoshan-crawler. Waiting for URLs"
api_1 | {"time":"2020-08-05T21:39:33.269084605Z","level":"INFO","prefix":"echo","file":"api.go","line":"73","message":"Starting trandoshan-api v0.0.1"}
api_1 | {"time":"2020-08-05T21:39:33.269182929Z","level":"DEBUG","prefix":"echo","file":"api.go","line":"75","message":"Using elasticsearch server at: http://elasticsearch:9200"}
api_1 | {"time":"2020-08-05T21:39:33.295324468Z","level":"INFO","prefix":"echo","file":"api.go","line":"88","message":"Successfully initialized trandoshan-api. Waiting for requests"}
api_1 | โ‡จ http server started on [::]:8080
kibana_1 | WARNING: no logs are available with the 'none' log driver
persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Starting trandoshan-persister v0.0.1"
persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using NATS server at: nats"
persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using API server at: http://api:8080"
persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Successfully initialized trandoshan-persister. Waiting for resources"
deployments_elasticsearch_1 exited with code 1

Add switches to docker-compose allowing detaching

Add -t -i switches to docker command allowing detaching. Right now I have to restart the whole project if I want to attach/detach.

Alternatively, you could assign unique detach keys:

--detach-keys "ctrl-a,a"

Sorry, I was trying to add an "Enhancement label" but I don't think I can

Create dashboard application

The idea is to have a simple JS (angular? react? vue?) application that will dial with the API to get insight from the crawler.

  • Resource page to view / search resources using input
  • Page to submit URL to crawl
  • ?

If anyone has suggestion, feel free to comment on this PR!

%!s(<nil>) response

I'm having that previous issue again with the new build:

scheduler_1 | time="2020-08-08T17:17:49Z" level=debug msg="Processing URL: https://www.facebookcorewwwi.onion"
api_1 | time="2020-08-08T17:17:49Z" level=debug msg="Successfully published URL: https://www.facebookcorewwwi.onion"
api_1 | time="2020-08-08T17:17:49Z" level=error msg="Error getting response: %!s()"
scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Error while searching URL: %!s()"
scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Received status code: 500"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.