GithubHelp home page GithubHelp logo

sromexs / get-sitemap-links Goto Github PK

View Code? Open in Web Editor NEW
10.0 1.0 0.0 19 KB

This package get, fetch, crawl, sitemap pages recursively and fetch all links in between <loc> tag.

TypeScript 100.00%
sitemap sitemap-links typescript sitemap-xml sitemapper get-sitemap fetch-sitemap crawl-sitemap sitemap-crawler

get-sitemap-links's Introduction

Get Sitemap Links

get sitemap links

Get Sitemap Links is a TypeScript library that fetches all links recursively from a sitemap page. It can be used in both Node.js and TypeScript applications.

Installation

You can install the package using npm:

npm i get-sitemap-links

Example

This is simple usage of the this tool we can get all links of the sitemap url :

const array = await GetSitemapLinks(
  "https://example.com/sitemap.xml"
);

// Output :
// array = [
//      "https://example.ir/post/1",
//      "https://example.ir/post/2",
//      "https://example.ir/post/3",
//      "https://example.ir/post/4",
//      ...
//  ]

With Node.js:

const GetSitemapLinks = require("get-sitemap-links").default;

(async () => {
  const array = await GetSitemapLinks(
    "https://nexload.ir/wp-sitemap-posts-post-1.xml"
  );
  console.log(array.length);
})();

With Typescript:

import GetSitemapLinks from "get-sitemap-links";

(async () => {
  const array = await GetSitemapLinks(
    "https://nexload.ir/wp-sitemap-posts-post-1.xml"
  );
  console.log(array.length);
})();

Options

(async () => {
  const array = await GetSitemapLinks("https://nexload.ir/wp-sitemap.xml", {
    filterIndexes: "posts",
    // Here we say we just want indexes that includes "posts" string
    // This option only works when givin sitemap link is IndexPage like example.com/sitemap.xml
  });
  console.log(array.length);
})();

get-sitemap-links's People

Contributors

sromexs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

get-sitemap-links's Issues

The links should parse XML data format link <CData /> when parsing sitemaps

First of all, thanks for your work here :)

I am using this library for an internal tool and realized that this fails to extract URLs correctly when the loc data is in CDAta format like below

<sitemap>
  <loc><![CDATA[https://example.com/post-sitemap.xml]]></loc>
  <lastmod><![CDATA[2020-11-16T18:13:33+00:00]]></lastmod>
</sitemap>

In this case, the expected return value is https://example.com/post-sitemap.xm but instead we get <![CDATA[https://example.com/post-sitemap.xml]]>

We perhaps need to add a regex somewhere to extract data between CData section

Something wrong with regex pattern

Regex rules may miss-match sometimes
for example code below may find next url https://www.orangemarketing.com</loc><lastmod>2023-01-30</lastmod></url><url><loc>https://www.orangemarketing.com/555

Its contains part of XML sitemap, looks like issue with missing trailing slash

const GetSitemapLinks = require("get-sitemap-links").default;

(async () => {
  const array = await GetSitemapLinks(
    "https://www.orangemarketing.com/sitemap.xml"
  );
  
  console.table(array);
})();

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.