GithubHelp home page GithubHelp logo

Comments (2)

laughingclouds avatar laughingclouds commented on June 9, 2024

I tried bs4 a little bit. There are many ways of separating the content from the rest of the html document.

One way might be

# p.text represents code within paragraph tag
for p in soup.findAll("p"):
   print(p.text)

But when I ran this against a document, the output was garbled.
I checked the doc and there were too many '\n' characters within the paragraphs.

What we could do is format the text within every paragraph.
So we save a bunch of desired tags, and insert them all in an html template.

I was also thinking of storing that "template" html code along with a style rules in a separate place.

from scrapia-world.

laughingclouds avatar laughingclouds commented on June 9, 2024

This piece of code does a good job with dealing with the text formatting. It needs improvements.

from bs4 import BeautifulSoup


def fixLine(lineText: str):
    """lineText is a single line of a paragraph"""
    words = lineText.split()
    newText = " ".join([word for word in words if word != " "])
    return newText

def fixPara(pText: str):
    """pText is text within a paragraph tag"""
    words = lineText.split()
    newText = " ".join([word for word in words if word != " "])
    return newText

fName = "HTML_FILE_NAME"
with open(fName) as fp:
    soup = BeautifulSoup(fp, "html.parser")
s = ""
for p in soup.findAll("p"):
    s += fixPara(p.text) + '\n'
s = s.rstrip('\n')

from scrapia-world.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.