opendatade / html2csv Goto Github PK
View Code? Open in Web Editor NEWThis project forked from r-barnes/html2csv
Reads HTML files, converting tables into CSV files
This project forked from r-barnes/html2csv
Reads HTML files, converting tables into CSV files
Currently if --ignoreempty
is active all tables are still created on disk, but empty tables are subsequently deleted. A better approach (that will also help with providing html2csv as a module) would be to determine which tables will be created before creating them (possibly storing in memory, though this could become an issue for very large pages).
From source:
...
#TODO: In HTML documents that use tables for layout (yuck!) add option to detect table titles in preceding sibling tables of size 1x1
...
As described in source:
...
#Removes nested tables. for handling the sins of 1990's web pages.
#TODO: Add an argument to enable/disable table de-nesting
[t.extract() for t in table.findAll("table")]
#This would grab all TRs regardless of depth without the above line removing nested tables
for row in table.findAll('tr'):
...
The table de-nesting hack is currently default behavior. For better backwards compatibility and flexibility it may be better to make it off-by-default and add a flag to enable it.
See title.
Currently HTML must be provided by a local file. Add functionality to include an http request to fetch the page.
From source:
...
#TODO: Add a -v/--verbose option for the excessive print statements below. Consider both `-v` `-V` (AKA: VERY VERBOSE, which is currently the default)
...
for row in table.findAll('tr'):
print(f"Processing row number {rowcount}")
rowcount += 1
cols = row.findAll(['td','th'])
print(f"Found {len(cols)} columns.")
...
Output during processing is now a little excessive. Some of the output is useful for debugging but should be off by default. Adding verbosity flags gives the user some flexibility here.
From source:
...
#TODO: Add option to compress multi-line categorical headers (eg: headers with colspan>1) into concatenated-naming-style single row headers
# EG: "Population" header spanning above sub-headers "Number" and "Percentage" produces two single-column headers of "Population - Number"
# and "Population - Percentage". Delemiter can be configurable with a sensible default set.
...
IIRC the CSV spec only allows for one line of headers, so this is also a specification compliance issue.
From source:
...
print(f"Found {len(cols)} columns.")
#TODO: Detect and discard empty tables (those that contain 1 row, 1 column, consisting of an empty string...I think?)
#TODO: Add a warning for non-rectangular tables.
if cols:
cols = [str(x.text).strip() for x in cols]
...
Currently several of the files generated contain no meaningful content (sometimes the whole file is ""
). These tables are likely artifacts of layout tables and can be discarded in most cases. This may be worth acknowledging in verbose output and enabling/disabling via a flag.
from source:
...
cols = row.findAll(['td','th'])
print(f"Found {len(cols)} columns.")
#TODO: Detect and discard empty tables (those that contain 1 row, 1 column, consisting of an empty string...I think?)
#TODO: Add a warning for non-rectangular tables.
...
Tables with varying numbers of columns per row may be a sign that there are layout tables, errors, or multicolumn headers (not currently supported). Produce a warning when these are seen (and allow it to be disabled or cause script to abort, both based on command flags)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.