GithubHelp home page GithubHelp logo

Comments (20)

j-mendez avatar j-mendez commented on May 21, 2024 2

@sebs this is now available in 1.42.0.

crawling multiple domains as one for the url https://rssea.fr and https://loto.rsseau.fr

Thank you for the issue!

from spider.

sebs avatar sebs commented on May 21, 2024 1

ah i love this so much ;) you are solving a big problem for me. Trying to build a url dataset of 20 million for a coding challenge ;)

I do really aprechiate this as it saves me a ton of time.

from spider.

sebs avatar sebs commented on May 21, 2024 1

maybe make it possible to add a * to extract all external domain links?

Background: one thing I am using the tool for is to create link maps ... aka page a links to page b

from spider.

j-mendez avatar j-mendez commented on May 21, 2024

Hi @sebs, not at the moment. It would be a nice feature to have. Some companies like Disney have their main domain as the root page, while having every link that they care about treated as a different dns name on the page. It makes it hard to gather all the website data with this pattern.

from spider.

j-mendez avatar j-mendez commented on May 21, 2024

@sebs no worries at all, feel free to keep reporting issues even if it is just a simple question! This feature is something that I wanted for awhile too since this project is the main engine for collecting data across a couple things I use.

from spider.

sebs avatar sebs commented on May 21, 2024

i did not find the option for external domains in the cli version of spider. Maybe the change did not make it through?

from spider.

j-mendez avatar j-mendez commented on May 21, 2024

@sebs at the moment not available in the CLI. Not all features go 1:1, if they fit the CLI they also need to be added separately. Going to re-open this issue for the CLI.

from spider.

j-mendez avatar j-mendez commented on May 21, 2024

Now available in the CLI v1.45.10. Example below to group domains.

spider --domain https://rsseau.fr -E https://loto.rsseau.fr/ crawl -o.

The E flag can also be written as external-domains.

from spider.

j-mendez avatar j-mendez commented on May 21, 2024

@sebs done via 1.46.0. Thank you!

from spider.

sebs avatar sebs commented on May 21, 2024

<3

from spider.

apsaltis avatar apsaltis commented on May 21, 2024

Hi,
Perhaps I'm using this incorrectly, but when I try the following command, using spider_cli 1.80.78:

spider -t -v --url https://www.theconsortium.cloud/ --depth 10 -s -E https://39834791.fs1.hubspotusercontent-na1.net/ scrape

I never see any URLs from the external domain, even though on one of the pages crawled https://www.theconsortium.cloud/application-consulting-services-page there is a button that links to a pdf on Hubspot, the HTML looks like this:

Download our one-pager for more information

the output from the scape command looks like this for that page:
{
"html": "",
"links": [],
"url": "https://www.theconsortium.cloud/application-consulting-services-page"
},

Is there a way, either programmatically or via the CLI, to have a spider detect all of the links on a page? Thanks in advance.

from spider.

scientiac avatar scientiac commented on May 21, 2024

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

from spider.

j-mendez avatar j-mendez commented on May 21, 2024

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

Use website.external_domains to add domains into the group for discovery.

from spider.

scientiac avatar scientiac commented on May 21, 2024

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

from spider.

j-mendez avatar j-mendez commented on May 21, 2024

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

Set website.external_domains to a wildcard. If this isn't a thing yet, I can add it in later.

from spider.

scientiac avatar scientiac commented on May 21, 2024

I don't think it is a thing.

from spider.

j-mendez avatar j-mendez commented on May 21, 2024

I don't think it is a thing.

CASELESS_WILD_CARD external domains handling

#135 (comment) looks like it was done. Use website.with_external_domains.

from spider.

scientiac avatar scientiac commented on May 21, 2024

image

asks me to provide an argument

from spider.

j-mendez avatar j-mendez commented on May 21, 2024

image

asks me to provide an argument

Correct, follow the type for the function. Set the value to a wildcard. Not sure what IDE that is, rust-analyzer is almost a must when using any crate.

from spider.

scientiac avatar scientiac commented on May 21, 2024

I used this

        .with_external_domains(Some(vec!["*"].into_iter().map(|s| s.to_string())))

and this:

        .with_external_domains(Some(std::iter::once("*".to_string())));

this compiles just fine but doesn't give me any external link from the site

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://carboxi.de");

    website.with_respect_robots_txt(true)
        .with_subdomains(true)
        .with_external_domains(Some(std::iter::once("*".to_string())));

    website.crawl().await;

    let links = website.get_links();
    let url = website.get_url().inner();
    let status = website.get_status();

    println!("URL: {:?}", url);
    println!("Status: {:?}\n", status);

    for link in links {
        println!("{:?}", link.as_ref());
    }
}

i dont think i understand what the wildcard for this is

from spider.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.