I only get 'internal links'. Is there a way to get external links to

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Also extract urls that are pointing to other domains? [CLI] about spider HOT 20 CLOSED

sebs commented on May 21, 2024

Also extract urls that are pointing to other domains? [CLI]

from spider.

Comments (20)

j-mendez commented on May 21, 2024 2

@sebs this is now available in 1.42.0.

Thank you for the issue!

from spider.

sebs commented on May 21, 2024 1

ah i love this so much ;) you are solving a big problem for me. Trying to build a url dataset of 20 million for a coding challenge ;)

I do really aprechiate this as it saves me a ton of time.

from spider.

sebs commented on May 21, 2024 1

maybe make it possible to add a * to extract all external domain links?

Background: one thing I am using the tool for is to create link maps ... aka page a links to page b

from spider.

j-mendez commented on May 21, 2024

Hi @sebs, not at the moment. It would be a nice feature to have. Some companies like Disney have their main domain as the root page, while having every link that they care about treated as a different dns name on the page. It makes it hard to gather all the website data with this pattern.

from spider.

j-mendez commented on May 21, 2024

@sebs no worries at all, feel free to keep reporting issues even if it is just a simple question! This feature is something that I wanted for awhile too since this project is the main engine for collecting data across a couple things I use.

from spider.

sebs commented on May 21, 2024

i did not find the option for external domains in the cli version of spider. Maybe the change did not make it through?

from spider.

j-mendez commented on May 21, 2024

@sebs at the moment not available in the CLI. Not all features go 1:1, if they fit the CLI they also need to be added separately. Going to re-open this issue for the CLI.

from spider.

j-mendez commented on May 21, 2024

Now available in the CLI v1.45.10. Example below to group domains.

spider --domain https://rsseau.fr -E https://loto.rsseau.fr/ crawl -o.

The E flag can also be written as external-domains.

from spider.

j-mendez commented on May 21, 2024

@sebs done via 1.46.0. Thank you!

from spider.

sebs commented on May 21, 2024

from spider.

apsaltis commented on May 21, 2024

Hi,
Perhaps I'm using this incorrectly, but when I try the following command, using spider_cli 1.80.78:

spider -t -v --url https://www.theconsortium.cloud/ --depth 10 -s -E https://39834791.fs1.hubspotusercontent-na1.net/ scrape

I never see any URLs from the external domain, even though on one of the pages crawled https://www.theconsortium.cloud/application-consulting-services-page there is a button that links to a pdf on Hubspot, the HTML looks like this:

Download our one-pager for more information

the output from the scape command looks like this for that page:
{
"html": "",
"links": [],
"url": "https://www.theconsortium.cloud/application-consulting-services-page"
},

Is there a way, either programmatically or via the CLI, to have a spider detect all of the links on a page? Thanks in advance.

from spider.

scientiac commented on May 21, 2024

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

from spider.

j-mendez commented on May 21, 2024

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

Use website.external_domains to add domains into the group for discovery.

from spider.

scientiac commented on May 21, 2024

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

from spider.

j-mendez commented on May 21, 2024

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

Set website.external_domains to a wildcard. If this isn't a thing yet, I can add it in later.

from spider.

scientiac commented on May 21, 2024

I don't think it is a thing.

from spider.

j-mendez commented on May 21, 2024

I don't think it is a thing.

#135 (comment) looks like it was done. Use website.with_external_domains.

from spider.

scientiac commented on May 21, 2024

asks me to provide an argument

from spider.

j-mendez commented on May 21, 2024

asks me to provide an argument

Correct, follow the type for the function. Set the value to a wildcard. Not sure what IDE that is, rust-analyzer is almost a must when using any crate.

from spider.

scientiac commented on May 21, 2024

I used this

        .with_external_domains(Some(vec!["*"].into_iter().map(|s| s.to_string())))

and this:

        .with_external_domains(Some(std::iter::once("*".to_string())));

this compiles just fine but doesn't give me any external link from the site

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://carboxi.de");

    website.with_respect_robots_txt(true)
        .with_subdomains(true)
        .with_external_domains(Some(std::iter::once("*".to_string())));

    website.crawl().await;

    let links = website.get_links();
    let url = website.get_url().inner();
    let status = website.get_status();

    println!("URL: {:?}", url);
    println!("Status: {:?}\n", status);

    for link in links {
        println!("{:?}", link.as_ref());
    }
}

i dont think i understand what the wildcard for this is

from spider.

Also extract urls that are pointing to other domains? [CLI] about spider HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs