GithubHelp home page GithubHelp logo

mdr's Introduction

MDR

https://travis-ci.org/scrapinghub/mdr.svg?branch=master

MDR is a library detect and extract listing data from HTML page. It implemented base on the Finding and Extracting Data Records from Web Pages but change the similarity to tree alignment proposed by Web Data Extraction Based on Partial Tree Alignment and Automatic Wrapper Adaptation by Tree Edit Distance Matching.

Requires

numpy and scipy must be installed to build this package.

Usage

Detect listing data

MDR assume the data record close to the elements has most text nodes:

[1]: import requests
[2]: from mdr import MDR
[3]: mdr = MDR()
[4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')
[5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8'))
...

[8]: [doc.getpath(c) for c in candidates[:10]]
 ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]',
 '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div',
 '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]',
 '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody',
 '/html/body/div[2]',
 '/html/body/div[2]/div[4]/div/div[1]']

Extract data record

MDR can find the repetiton patterns by using tree matching under certain candidate DOM tree, then it builds a mapping from HTML element to other matched elements of the DOM tree.

Used with annotation (optional)

You can annotate the seed elements with any tools (e.g. scrapely) you like, then mdr will be able to find the other matched elements on the page.

e.g. you can find this demo page here. the colored data in first row are annotated manually, the rest are extracted by MDR.

Author

Terry Peng <[email protected]>

License

MIT

mdr's People

Contributors

shaneaevans avatar tpeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mdr's Issues

ImportError: cannot import name 'MDR'

docker run -ti python bash
apt-get update
apt-get install -y python-numpy cython python-scipy
git clone https://github.com/scrapinghub/mdr.git
cd mdr
pip install -r requirements.txt
python setup.py build
python setup.py install
python tests/test_mdr.py

Traceback (most recent call last):
  File "tests/test_mdr.py", line 3, in <module>
    from mdr import MDR, Record
  File "/usr/local/lib/python3.6/site-packages/mdr-0.0.1-py3.6-linux-x86_64.egg/mdr/__init__.py", line 1, in <module>
    from mdr import MDR, Record, RecordFinder, RecordAligner
ImportError: cannot import name 'MDR'

ValueError: The number of observations cannot be determined on an empty distance matrix.

this simple example test.py:

text = """
<html>
<body>
    <table>
        <tr><td>p1</td><th>v1</th></tr>
        <tr><td>p2</td><td>v2</td></tr>
        <tr><td>p3</td><td>v3</td></tr>
    </table>
</body>
</html>
"""

from mdr import MDR
mdr = MDR()
candidates, doc = mdr.list_candidates(text)
print([doc.getpath(c) for c in candidates])
print(mdr.extract(candidates[0]))

results in this exception:

$ python test.py
['/html/body/table/tr[1]/th', '/html/body/table']
Traceback (most recent call last):
  File "test.py", line 17, in <module>
    print(mdr.extract(candidates[0]))
  File "build/bdist.macosx-10.12-x86_64/egg/mdr/mdr.py", line 134, in extract
  File "build/bdist.macosx-10.12-x86_64/egg/mdr/mdr.py", line 167, in hcluster
  File "/Users/david/.virtualenvs/py2-data/lib/python2.7/site-packages/scipy/cluster/hierarchy.py", line 660, in linkage
    n = int(distance.num_obs_y(y))
  File "/Users/david/.virtualenvs/py2-data/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1718, in num_obs_y
    raise ValueError("The number of observations cannot be determined on "
ValueError: The number of observations cannot be determined on an empty distance matrix.

Any idea?

bug in the alg implementation: RecordAligner.align misses one of the childs

from lxml.html import etree
from mdr import RecordAligner, Record

def toString(tree):
    return etree.tostring(tree, pretty_print=True)

t1 = etree.XML("""<root><a><a1/></a><b/><c/></root>""")
t2 = etree.XML("""<root><a/><b><b1/></b><c/></root>""")

seed, mappings = RecordAligner().align([Record(t1), Record(t2)])
print toString(seed[0])

seed, mappings = RecordAligner().align([Record(t2), Record(t1)])
print toString(seed[0])


# <root>
#  <a/>
#  <b>
#    <b1/>
#  </b>
#  <c/>
# </root>
#
# <root>
#  <a>
#    <a1/>
#  </a>
#  <b/>
#  <c/>
# </root>
#
# shouldn't it be:
# <root>
#  <a>
#    <a1/>
#  </a>
#  <b>
#    <b1/>
#  </b>
#  <c/>
# </root>

is this a problem in the algorithm or in the implementation?

documentation on README.rst on how to use the result of mdr.extract

this is the example you have:

import requests
from mdr import MDR
mdr = MDR()
r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')
candidates, doc = mdr.list_candidates(r.text.encode('utf8'))
[doc.getpath(c) for c in candidates[:10]]

how to use mdr.extract? can you provide a representative example?

seed_records, mappings = mdr.extract(candidates[2])
from lxml import etree
print(etree.tostring(seed_records[0], pretty_print=True))
???

No module named _tree

git clone https://github.com/scrapinghub/mdr.git .
2025 ls
2026 nano requirements.txt
2027 pip install -r requirements.txt
2028 sudo apt-get install python-numpy
2029 sudo apt-get install cython
2030 sudo apt-get install python-scipy
2031 python setup.py build
2032 python setup.py install --user
2033 nano test.py
2034 python test.py
2035 ls
2036 ./update_c.sh
2037 python test.py
2038 python setup.py build
2039 python setup.py install --user
2040 python test
2041 python test.py
2042 ls
2043 nano setup.py
2044 python setup.py install
2045 python test.py
2046 ls
2047 nano CHANGES.txt
2048 ls
2049 python test.py
2050 history

Traceback (most recent call last):
File "test.py", line 2, in
from mdr import MDR
File "/root/mdr/mdr/init.py", line 1, in
from mdr import MDR, Record, RecordFinder, RecordAligner
File "/root/mdr/mdr/mdr.py", line 13, in
from ._tree import tree_size
ImportError: No module named _tree

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.