GithubHelp home page GithubHelp logo

usshen / angleparse Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kamome283/angleparse

0.0 1.0 0.0 75 KB

HTML parsing and processing tool for PowerShell.

License: Apache License 2.0

C# 81.04% Batchfile 0.25% PowerShell 14.28% Shell 4.43%

angleparse's Introduction

AngleParse

Easy to use HTML parsing and processing tool for PowerShell.

# Articles from PowerShell dev blog
iwr 'https://devblogs.microsoft.com/powershell/' |
Select-HtmlContent 'div.entry-box.container', @{
  Title = 'h5.entry-title'
  Author = 'span.entry-author-link', ([regex]'(\w+) \w+')
  PostDate = 'span.entry-post-date', { [DateTime]::Parse($_) }
} | select -first 3

# PostDate           Title                                                   Author
# --------           -----                                                   ------
# 2020/07/07 0:00:00 PowerShellGet 3.0 Preview 6 Release                     Sydney
# 2020/06/26 0:00:00 Native Commands in PowerShell – A New Approach – Part 2 Jim
# 2020/06/22 0:00:00 Native Commands in PowerShell – A New Approach          Jim

Package

https://www.powershellgallery.com/packages/AngleParse

Before Use

  1. Install-Package AngleParse
  2. Import-Module AngleParse

Usage

gc ./foobar.html -raw | Select-HtmlContent ([AngleParse.Attr]::Class)

Select-HtmlContent command receives string content from pipeline, interprets given content as HTML DOM tree, then processes with given selectors which are specified in the command's first argument.

About Selector

There are 5 kinds of selectors. They are CSS selector, attribute selector, regex selector, scriptblock selector and hashtable selector. All selectors receive one input and output multiple (includes 0 and 1) items. And by specifying multiple selectors, you can combine selectors which works as PowerShell's pipeline.

'<div><span>abc</span></div>' | Select-HtmlContent "div > span", ([regex]'a(bc)')
# bc

# Similar work as below.
'<div><span>abc</span></div>' |
Select-HtmlContent "div > span" |
% { $_ | Select-HtmlContent ([regex]'a(bc)') }
# bc

When selector outputs single item, output array is unified like PowerShell's default behaviour for ease of use.

iwr "https://b.hatena.ne.jp/" | Select-HtmlContent "div.entrylist-contents", @{ 
    Title = "h3.entrylist-contents-title > a"
    Tags = "a[rel=tag]"
} | select -first 1 | Format-List

# Title contains only one string item so that array is unified.
# Title : Go To トラベル 感染を広げないためには(忽那賢志) - 個人 - Y...
# Tags  : {COVID-19, 旅行, 社会, 医療…}    

CSS Selector

'<div><span class="foo">text content here</span></div>' | Select-HtmlContent "div > span.foo"
# text content here

String is interpreted as CSS selector. This selector receives DOM element and outputs DOM elements which matches given selector.

Attribute Selector

'<a href="https://foo.go.jp">bar</a>' | Select-HtmlContent ([AngleParse.Attr]::Href)
# https://foo.go.jp

Enum value of AngleParse.Attr class is interpreted as attribute selector. There are 11 kinds of attributes as below.

Element
InnerHtml
OuterHtml
TextContent
Id
Class
SplitClasses
Href
Src
Title
Name

This selector receives DOM element and outputs matched attributes as string, excluding Element attribute which is introduced in ScriptBlock Selector section. SplitClasses attribute outputs classes which is split by space, although Class attribute does not do any special work.

# Class
'<div class="foo bar">' | Select-HtmlContent ([AngleParse.Attr]::Class)
# foo bar

# SplitClasses
'<div class="foo bar">' | Select-HtmlContent ([AngleParse.Attr]::SplitClasses)
# foo
# bar

Regex Selector

Regex value is interpreted as regex selector. This selector receives DOM element or string and outputs captured strings. When you pass DOM element to this selector, this selector operate matching on the element's inner text content.

# Not captured the day part so that outputs are year and month.
'<span>2020/07/22</span>' | Select-HtmlContent ([regex]'(\d{4})/(\d{2})/\d{2}')
# 2020
# 07

ScriptBlock Selector

'<span>7</span>' | Select-HtmlContent { [int]$_ * 6; [int]$_ * 7 }
# 42
# 49

ScriptBlock is interpreted as scriptblock selector. This selector receives any kind of objects and outputs evaluated objects. Passed object is bound to $_. When you pass DOM element to this selector, the element is implicitly converted to the element's inner text content which has string type. If you do not want this conversion, pipe Element attribute selector before using scriptblock selector.

'<div><span>a</span><span>b</span></div>' | 
Select-HtmlContent ([AngleParse.Attr]::Element), { $_.ChildElementCount }
# 2

Hashtable Selector

'<div class="a">1a</div><div class="b">2b</div>' |
select-htmlcontent "> div",
@{ Class = ([AngleParse.Attr]::Class);
   NumPlus1 = ([regex]'(\d)\w'), { [int]$_ + 1 } }

# Class NumPlus1
# ----- --------
# a            2
# b            3

Hashtable is interpreted as hashtable selector. Each value of the hashtable must be valid selctor(s). This selector processes input with given selectors in the hashtable and bound to the corresponding key.

Other Resources

PowerShellから簡単にスクレイピングするためのツールを作った(Japanese)

Special Thanks To

and all the support.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.