Comments (20)
@sebs this is now available in 1.42.0.
Thank you for the issue!
from spider.
ah i love this so much ;) you are solving a big problem for me. Trying to build a url dataset of 20 million for a coding challenge ;)
I do really aprechiate this as it saves me a ton of time.
from spider.
maybe make it possible to add a * to extract all external domain links?
Background: one thing I am using the tool for is to create link maps ... aka page a links to page b
from spider.
Hi @sebs, not at the moment. It would be a nice feature to have. Some companies like Disney have their main domain as the root page, while having every link that they care about treated as a different dns name on the page. It makes it hard to gather all the website data with this pattern.
from spider.
@sebs no worries at all, feel free to keep reporting issues even if it is just a simple question! This feature is something that I wanted for awhile too since this project is the main engine for collecting data across a couple things I use.
from spider.
i did not find the option for external domains in the cli version of spider. Maybe the change did not make it through?
from spider.
@sebs at the moment not available in the CLI. Not all features go 1:1, if they fit the CLI they also need to be added separately. Going to re-open this issue for the CLI.
from spider.
Now available in the CLI v1.45.10. Example below to group domains.
spider --domain https://rsseau.fr -E https://loto.rsseau.fr/ crawl -o
.
The E
flag can also be written as external-domains
.
from spider.
@sebs done via 1.46.0
. Thank you!
from spider.
<3
from spider.
Hi,
Perhaps I'm using this incorrectly, but when I try the following command, using spider_cli 1.80.78:
spider -t -v --url https://www.theconsortium.cloud/ --depth 10 -s -E https://39834791.fs1.hubspotusercontent-na1.net/ scrape
I never see any URLs from the external domain, even though on one of the pages crawled https://www.theconsortium.cloud/application-consulting-services-page there is a button that links to a pdf on Hubspot, the HTML looks like this:
Download our one-pager for more informationthe output from the scape command looks like this for that page:
{
"html": "",
"links": [],
"url": "https://www.theconsortium.cloud/application-consulting-services-page"
},
Is there a way, either programmatically or via the CLI, to have a spider detect all of the links on a page? Thanks in advance.
from spider.
How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.
from spider.
How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.
Use website.external_domains to add domains into the group for discovery.
from spider.
I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall
from spider.
I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall
Set website.external_domains to a wildcard. If this isn't a thing yet, I can add it in later.
from spider.
I don't think it is a thing.
from spider.
I don't think it is a thing.
#135 (comment) looks like it was done. Use website.with_external_domains
.
from spider.
asks me to provide an argument
from spider.
asks me to provide an argument
Correct, follow the type for the function. Set the value to a wildcard. Not sure what IDE that is, rust-analyzer is almost a must when using any crate.
from spider.
I used this
.with_external_domains(Some(vec!["*"].into_iter().map(|s| s.to_string())))
and this:
.with_external_domains(Some(std::iter::once("*".to_string())));
this compiles just fine but doesn't give me any external link from the site
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://carboxi.de");
website.with_respect_robots_txt(true)
.with_subdomains(true)
.with_external_domains(Some(std::iter::once("*".to_string())));
website.crawl().await;
let links = website.get_links();
let url = website.get_url().inner();
let status = website.get_status();
println!("URL: {:?}", url);
println!("Status: {:?}\n", status);
for link in links {
println!("{:?}", link.as_ref());
}
}
i dont think i understand what the wildcard for this is
from spider.
Related Issues (20)
- Add the ability to download not only html, but also all site assets: css, js, imgs, etc HOT 1
- cli tutorial store crawls result as json HOT 2
- error[E0061]: this function takes 2 arguments but 1 argument was supplied HOT 1
- only let me spider one url HOT 4
- cli parameters HOT 1
- Extract text from Html HOT 1
- The result of get_html is garbled in case of Shift_JIS html HOT 4
- `with_on_link_find_callback` doesn't exist HOT 2
- CLI - Not including the schema in -d parameter results in critical error HOT 3
- Scraping timeout Issue HOT 2
- Extracting all urls on a page HOT 8
- Support ignoring SSL errors HOT 4
- Some pages have 0 bytes from scraped page. After rerunning, different pages have 0 bytes HOT 11
- Scraped html does not match the url - chrome [with_wait_for_idle_network] HOT 17
- Chrome flag chrome_intercept page hang. HOT 1
- Is it possible to dynamicall add links to crawl? HOT 7
- Running with decentralized feature HOT 1
- Already crawled URL attempted as % encoded HOT 3
- Is it possible to extract broken links from the crawl? HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spider.