GithubHelp home page GithubHelp logo

topfunky / hpple Goto Github PK

View Code? Open in Web Editor NEW
2.8K 108.0 476.0 112 KB

An XML/HTML parser for Objective-C, inspired by Hpricot.

Home Page: http://topfunky.com

License: MIT License

Objective-C 59.25% HTML 39.70% Ruby 1.05%

hpple's People

Contributors

3lvis avatar bitdeli-chef avatar davydotcom avatar draganjovev avatar imrekel avatar lax avatar macserv avatar mattjgalloway avatar premedios avatar rdougan avatar readmecritic avatar saleh-hosseinkhani avatar tobihagemann avatar topfunky avatar trupin avatar wanghui9309 avatar wess avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpple's Issues

Missing content string when parsing HTML

Hi!

I want to use hpple in my little iPhone project at stumbled upon a problem. When parsing a table in a website, I noticed that text within a tag was not parsed. Following the example I tested with:






Un couple épatant

(2002)

aka "Trilogy: Two" - International (English title) , UK

aka "An Amazing Couple" - International (English title)

aka "Two" - UK

The text (2002) should be content of the <td> tag but querying the respective element's content resulted in an empty string. I debugged and found the cause, although I am not sure whether it broke anything else, which is why I would kindly ask you to look into it.

The problem lies in XPathQuery's DictionaryForNode method:

if ([[resultForNode objectForKey:@"nodeName"] isEqual:@"text"] && parentResult)
{
[parentResult setObject:[currentNodeContent stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:@"nodeContent"];
return nil;
}

If I got it right, the parser checks the currentNode's name first and if this is set to "text" (I assume libxml does that) it sets the parentNode's content to that string. The problem here is, it doesn't check whether there is already a string. Instead it replaces a potentially present string, even with an empty string. Here is my solution (again not really tested yet):

if ([[resultForNode objectForKey:@"nodeName"] isEqual:@"text"] && parentResult) { currentNodeContent = [currentNodeContent stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]]; NSString *parentNodeContent = [parentResult objectForKey:@"nodeContent"]; if (parentNodeContent) { currentNodeContent = [parentNodeContent stringByAppendingString:currentNodeContent]; } [parentResult setObject:currentNodeContent forKey:@"nodeContent"]; return nil; }

Please note that I don't remove newline characters. I'm not sure what's the best approach here, but a string without any seperators doesn't seem like a good idea to me.

I hope this really is a bug. Feedback is appreciated.

Cheers,
Chris

Recreate a blog structure

Hi, I am currently working on the blog reader and use HPPLe to parse HTML. I would like to use it to get a full structure of a blog post, so to know how different tags are following one after another. What I would like to accomplish in the end is - instead of revealing a blog post in UIWebView, get all its elements and show them in native views. The problem is that structure varies from post to post - so I need to figure out an algorithm that will scan each time post's HTML code and give me its clear structure. Hope you might help me with it. You can also check this question on stackoverflow - http://stackoverflow.com/questions/24041858/recreating-blog-post-structure-in-a-native-ios-view .

Doesn't show embedded videos

When I use this code to parse a website that has an embedded video, it doesn't include the video. Is this a problem with the parser?

Not support UTF8 parsing

I tried to parse Korean in HTML but it isn't working. I think it doesn't use NSString but cString on parsing. Any character must be in ASCII scope if it works
P.S. also cannot parse by unicode

Memory leak in searchWithXPathQuery?

Hi! So i'm am using this cool lib in my app to parse html. it works fine, but i found out a memory leak, each time i use searchWithXPathQuery. i've search all internet, but still have leaks.

in xcode Instruments in table with leaked objects :

Responsible library - libxml2.2.dylib
Responsible frame - xmlStrndup

[TFHppleElement raw] returning bad results?

I'm comparing the results of [rootElement raw] with the actual HTML, and the results are similar, but not the same. Specifically, the characters "]]" have been appended inside each "script" tag.

Um...this isn't correct behavior, is it? Just checking to make sure I understand how the "raw" method is supposed to work.

[Update: Okay, I see why: There's a "![CDATA[" that's been added at the beginning of each script tag. Which is causing the HTML to not behave properly.]

HTML with quotes around content cause them to not exist in the hippie element.

When I search for the span via simple query string, the following html which looks like this:
"3097"
"Designer: someDesigner"

Ends up coming out like this in the hippleElement.

code:

Basically it doesn't see anything between quotes. Seems like a c parsing issue to me but I'm too retarded to understand your code...lol.

TFhpple memory leak solution

xmlChar *nodeContent = xmlNodeGetContent(currentNode);
    if (nodeContent != NULL) {
        NSString *currentNodeContent = [NSString stringWithCString:(const char *)nodeContent
                                                          encoding:NSUTF8StringEncoding];
        if ([resultForNode[@"nodeName"] isEqual:@"text"] && parentResult) {
            if (parentContent) {
                NSCharacterSet *charactersToTrim = [NSCharacterSet whitespaceAndNewlineCharacterSet];
                parentResult[@"nodeContent"] = [currentNodeContent stringByTrimmingCharactersInSet:charactersToTrim];
                /** Memory leak point release, Prevent memory leak */
                xmlFree(nodeContent);
                /** Memory leak point release, Prevent memory leak */
                return nil;
            }
            if (currentNodeContent != nil) {
                resultForNode[@"nodeContent"] = currentNodeContent;
            }
            /** Memory leak point release, Prevent memory leak */
            xmlFree(nodeContent);
            /** Memory leak point release, Prevent memory leak */
            return resultForNode;
        } else {
            resultForNode[@"nodeContent"] = currentNodeContent;
        }
        xmlFree(nodeContent);
    }

Not available through cocoapods

The library is not available through cocoapods. After adding it to Podfile if I run pod install it says [!] Unable to find a specification for Hpple.

How to use it with kCFStringEncodingGB_18030_2000

How to use it according to parse gb2312 html? Thank you.

NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);
TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:htmlData encoding:[NSString stringWithFormat:@"%lu",(unsigned long)encoding]];

And it doesn't help me .

encoding problem

I have to use 'EUC-KR'
but hpple loves 'NSUTF8StringEncoding'

How can i make it possible to show EUC-KR(-2147481280)?

XML Woes

Cool lib TopFunky

Out of interest, I couldn't get it to work for this xml at all :(

Every XPath query I try returns an empty array. Tried debugging it too, can't see what's going on. Also tried stripping out the xmlns stuff. No joy. Response string looks good too, hpple just don't dig it :(

        ASIFormDataRequest *storeRequest;
    NSString *urlString = @"http://clients.multimap.com/API/search/1.2/asdanew?countryCode=GB&returnFields=wal_mart_no,name&routeModes=driving&lat=53.799720&lon=-1.549170&orderByFields=Distance&count=10";    

    storeRequest = [ASIHTTPRequest requestWithURL:[NSURL URLWithString:urlString]]; 
    [storeRequest startSynchronous];    
    NSError *error = [storeRequest error];
    if (!error) {       
        NSString *response = [storeRequest responseString];     
        NSData *data = [response dataUsingEncoding:NSUTF8StringEncoding];       

        //NOT WORKING - TFHpple Not Parsing At All :(
        TFHpple *doc = [[TFHpple alloc] initWithXMLData:data];
        NSArray *records = [doc search:@"//Field" ];
        NSLog(@"%@", records);
        }

innerHTML?

Is it possible to get innerHTML?

I have some html like
< span>text< br>text2< /span>, and only text2 will show up on [element content].

So i dont know how to get the full content, i remember Hpricot offers something like innerHtml.

Thank you!

about the html Escape character

my html is
<html><body><div id='nativerich'>!;&amp;(*)</div></body></html>

when I use hpple parser my html

I got div raw string : <div id="nativerich">!;&amp;(*)</div>
but div content string: "!;&(*)"

how to make div raw string content be same as div content string ,parse &amp; as &

Seeking for Help in Emergency

How to iterate/parse all children for given tag

Hi,

I have tried various xpath expressions to get all images, within lists, divs, etc. But, not all images (or script tags) are found.

Not sure if its an xpath issue, but how can the entire tree be iterated through, searching for these tags ... while it may be slower, brute force sometimes works.

Thanks,

Peter

Link content in label

When I try and retrieve text from an html link, I can output it correctly in the NSLog but when I try and set a label's text to it, nothing shows up. This is confusing because I see the data in the log but I cannot make it appear in a label. It works if I put it in a text field which is strange but I want a label. Thanks for the support.

Query works in Chrome, Not in hpple .

Trying to use a query that works in Chrome console but not in hpple

"//ul//li/span[normalize-space(text())='10']"

Any ideas on how to replicate this to work with hpple would be greatly appreciated.

Memory leak in XPathQuery

Hi,

First thank you for your work on Hpple, it helps a lot.

In an unusual case in my code when the query don't match I've hit a memory leak, please see this diff.

-- seb

No way to parse text content with no parent html tag

Many websites has text content of paras inside double quotes, instead of enclosing

HTML tags. It is not parse such sites.

For the following HTML :-

"A new " report "by intelligence firm " Recorded Future "examined Al-Qaeda's changes in encryption in response to the Snowden leaks, noting "an increased pace of innovation, specifically new competing jihadist platforms and three major new encryption tools from three different organizations - GIMF, Al-Fajr Technical Committee, and ISIS - within a three to five-month time frame of the leaks."

It is not possible to extract the text because TFHppleElement node with class 'article' has all the text without parent HTML tag in its content.

It has various child nodes with "a", "em" tags, the rest is in its content.

See http://appleinsider.com/articles/14/08/02/al-qaeda-prefers-android-over-apples-ios for example.

TFHppleElement * div = [[doc searchWithXPathQuery:@"//div[@Class='article']"] objectAtIndex:0];
po div.content (it won't print text which is hyperlinked or italicised.

TFHpple Testcases

Looks like all the Testcases are out-dated. Its looking for the methods that doesn't exist any more. Shouldn't this be corrected?

Search in TFHppleElement

I'm wondering if support for searchWithXPathQuery will be available on the TFHppleElement itself.

Let's say I parse a complete html document with Hppl and select a repetitive part like tr's on a table through xpath.
I then got an array of TFHppleElements for these tr's.
I then like to search in this subset for some child nodes with xpath to get more detailed information.

Will this be possible?

Add Support for other xmlXPathObjectTypes

I'd like to see support for types other than XPATH_NODESET such as XPATH_STRING, XPATH_NUMBER, and XPATH_BOOLEAN. Perhaps a more general object wrapper could be created to store other data types and be returned instead of TFHppleElements only.

Crashes when selecting certain elements on certain pages

The following code crashes:

NSData* data = [NSData dataWithContentsOfURL:[NSURL URLWithString:@"http://tvtropes.org/pmwiki/pmwiki.php/Comicbook/TheAvengers"]];
    TFHpple* hpple = [TFHpple hppleWithHTMLData:data];
    NSArray* array = [hpple searchWithXPathQuery:@"//div"];

with output:

'NSInvalidArgumentException', reason: '-[__NSCFDictionary setObject:forKey:]: attempt to insert nil value (key: nodeContent)'

Using queries like @"//table", @"//td", @"//tr" also crashes, while @"//span", @"//a" do not.

I've put this in a simple, separate class to isolate the issue. The search I'm actually trying to do is @"//div[@Class='indent']", which crashes on the above URL and many other TVTropes pages, but does work on for example http://tvtropes.org/pmwiki/pmwiki.php/Film/TheAvengers- but plain @"//div" doesn't work there either.

Please add semantic version tags

I’ve recently added hpple to the CocoaPods package manager repo.

CocoaPods is a tool for managing dependencies for OSX and iOS Xcode projects and provides a central repository for iOS/OSX libraries. This makes adding libraries to a project and updating them extremely easy and it will help users to resolve dependencies of the libraries they use.

However, hpple doesn't have any version tags. I’ve added the current HEAD as version 0.0.1, but a version tag will make dependency resolution much easier.

Semantic version tags (instead of plain commit hashes/revisions) allow for resolution of cross-dependencies.

In case you didn’t know this yet; you can tag the current HEAD as, for instance, version 1.0.0, like so:

$ git tag -a 1.0.0 -m "Tag release 1.0.0"
$ git push --tags

Bump the version

Could you bump the version so the latest code can be pulled in through CocoaPods?

If you no longer want to maintain this I can fork and push it into CocoaPods as a separate project. Just wanted to check what you want to do first.

Thanks.

Can it also parse XML or not?

Hello, you stated within the title that it can be used for HTML and XML.
Can I use it for XML now?
I did a text but it does not seems to work. HTML works well.

Please tell me, what's missing if it is possible to parse XML.

Compile issue when using Apple LLVM 5.1 - Language -> Compile Sources As -> Objective-C++

For one of my projects I have a lot of C++ code which requires me to specify the Apple LLVM 5.1 - Language as 'Objective-C++' rather than leave it as 'According to file type'.
This produces two compilation errors in XpathQuery.m
XPathQuery.m:192:11: No matching function for call to 'htmlReadMemory'
XPathQuery.m:218:11: No matching function for call to 'xmlReadMemory'
where the detail is (for the first one):
/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator7.1.sdk/usr/include/libxml2/libxml/HTMLparser.h:206:3: Candidate function not viable: cannot convert argument of incomplete type 'const void *' to 'const char *'
NB when I am able to leave the compiler setting as 'According to file type' I get no errors.

Encoding should be passed as NSStringEncoding, not NSString

Now that we can initialize a TFHpple with an encoding parameter, one has to search through the libxml2 code to find what literal values are allowed for the encoding. This is not in keeping with the way string encoding is handled throughout Objective-C. Instead hppleWithHTMLData: encoding: and similar methods should take an NSStringEncoding enum, such as NSUTF8StringEncoding or NSISOLatin1StringEncoding, and then pass the appropriate encoding string to libxml2.

Unfortunately this would break compatibility of the API for anyone already using the new with-encoding methods; perhaps still more methods like hppleWithHTMLData: enumEncoding: should be created?

Memory leak in TFHpple and TFHppleElement

I use Profile to check if my app has memory and think following code has memory leak bug:

TFHpple.m

  • (TFHpple *) hppleWithData:(NSData *)theData isXML:(BOOL)isDataXML {
    return [[[self class] alloc] initWithData:theData isXML:isDataXML];
    }

should be:

  • (TFHpple *) hppleWithData:(NSData *)theData isXML:(BOOL)isDataXML {
    return [[[[self class] alloc] initWithData:theData isXML:isDataXML] autorelease];
    }

TFHppleElement.m

  • (TFHppleElement *) hppleElementWithNode:(NSDictionary *) theNode {
    return [[[self class] alloc] initWithNode:theNode];
    }

should be:

  • (TFHppleElement *) hppleElementWithNode:(NSDictionary *) theNode {
    return [[[[self class] alloc] initWithNode:theNode] autorelease];
    }

Add license for XPathQuery files

This is what's used currently:

//
//  XPathQuery.h
//  FuelFinder
//
//  Created by Matt Gallagher on 4/08/08.
//  Copyright 2008 __MyCompanyName__. All rights reserved.
//

Original article: https://www.cocoawithlove.com/2008/10/using-libxml2-for-parsing-and-xpath.html

Here's the license specified: https://www.cocoawithlove.com/about/

// Copyright © 2008-2018 Matt Gallagher ( http://cocoawithlove.com ). All rights reserved.

// Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee
// is hereby granted, provided that the above copyright notice and this permission notice appear in
// all copies.

// THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
// REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
// AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
// INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
// FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
// NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH
// THE USE OR PERFORMANCE OF THIS SOFTWARE.

how to ignore <br />

Dear topFunky, how should i ignore <<br />>?
the html file like this:
<div class="Fiche">
<p>Every Sunday, our Chef proposes a buffet high in color.<br />
<br />
A brunch <br />
<br />
Every <br />
<br />
Information/ reservations : (377) 98 06 03 60</p>

When i parse the text between<p> and</p>, what i only get is: Information/ reservations : (377) 98 06 03 60.
How should i ignore <br />, thanks a lot!

My code:
NSArray *array4Soustitre = [xpathParser search:@"//div [@Class='Fiche']/p"];
TFHppleElement *ele= [array4Soustitre objectAtIndex:0];
NSLog(@"content is %@ ",[ele content]);

Get element's innerHtml without element's tags?

How can I get the raw HTML of an element, not include the tags of that element? For example, if i were getting , and body looks like this:

stuff in body

How can I get only:

stuff in body

Currently the raw property includes the tags of the target element itself.

node's parent is nil

my html code is like this:

And my objc code is:
TFHpple* hpple = [TFHpple hppleWithHTMLData:data];
NSArray* links = [hpple searchWithXPathQuery:@"//a[@href!='']"];
for (TFHppleElement* link in links) {
NSLog(@"link's parent: %@", link.parent);
}

The output shows that the link's parent is nil, while it should be

  • node in this case. So is this a bug?

  • Missing content when parsing html

    hi,
    first thank you for hpple. It helps a lot.

    Recently when I use hpple to parse a html file, the content, which should have something in it, is null.
    Here is part of the html file:

            发信人:
                znslm
                (小白), 信区: Pictures 标 题: Re: 你十六岁喜欢的那个人怎么样了?发信站: 南京大学小百合站 (Mon Nov 7 21:49:51 2011)    小孩上小学了
                
                [:D]
                
                喜欢我的那位呢/ 我猜的 --
            
            
    the xpath string is "://tr/td/pre/a" the result is supposed to be "znslm", but it's null. how to fetch the string?

    help

    how can I get the image url in this string ?
    8c6fa975-8978-4d6f-894f-86766912e536

    Memory leak FIX in : NSDictionary *DictionaryForNode

    Hi,

    thx for this library, I was using it and I saw a memory leak while using it.
    I had a quick look at it, please test the fix below (lines with "//RNR")

    hope it helps, it corrected my issue :-)

    NSDictionary *DictionaryForNode(xmlNodePtr currentNode, NSMutableDictionary *parentResult,BOOL parentContent)
    {
    NSMutableDictionary *resultForNode = [NSMutableDictionary dictionary];
    if (currentNode->name) {
    NSString *currentNodeContent = [NSString stringWithCString:(const char *)currentNode->name
    encoding:NSUTF8StringEncoding];
    resultForNode[@"nodeName"] = currentNodeContent;
    }

    xmlChar *nodeContent = xmlNodeGetContent(currentNode);
    if (nodeContent != NULL) {
        NSString *currentNodeContent = [NSString stringWithCString:(const char *)nodeContent
                                                          encoding:NSUTF8StringEncoding];
        if ([resultForNode[@"nodeName"] isEqual:@"text"] && parentResult) {
            if (parentContent) {
                NSCharacterSet *charactersToTrim = [NSCharacterSet whitespaceAndNewlineCharacterSet];
                parentResult[@"nodeContent"] = [currentNodeContent stringByTrimmingCharactersInSet:charactersToTrim];
                xmlFree(nodeContent); //RNR
                return nil;
            }
            if (currentNodeContent != nil) {
                resultForNode[@"nodeContent"] = currentNodeContent;
            }
            xmlFree(nodeContent); //RNR
            return resultForNode;
        } else {
            resultForNode[@"nodeContent"] = currentNodeContent;
        }
        xmlFree(nodeContent);
    }
    

    Rachid

    hpple on AppleWatch

    Hi
    i use your class in my app to display image of a gallery in a blog under wordpress.com

    In iPhone works perfectly , but when i use hpple on Apple Watch it's works on simulator but on fisical device it won't works !
    When the parsing arrive on this line of code

    TFHpple *parser = [TFHpple hppleWithHTMLData:appleWatchHtmlData];
    NSLog(@"parser %@", parser);
    NSString *tutorialsXpathQueryString = @"//div[@class='gallery galleryid-1461 gallery-columns-3 gallery-size-thumbnail']/dl/dt/a/img";
    
    NSArray *imageFilesArray = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];
    NSMutableArray *imagesArray = [[NSMutableArray alloc] initWithCapacity:0];
    

    the nslog say
    2015-10-25 18:19:13.939 WatchApp Extension[398:328494] parser <TFHpple: 0x1655d7e0>
    2015-10-25 18:19:13.947 WatchApp Extension[398:328494] Unable to parse.

    Perhaps it may be due to the limited apple watch resources, but how can i solve this issue ?

    Thank you so much

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.