alirezamika / autoscraper Goto Github PK

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

License: MIT License

Python 100.00%

scraping scraper scrape webscraping crawler web-scraping ai artificial-intelligence python webautomation

autoscraper's Introduction

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python

This project is made for automatic web scraping to make scraping easy. It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. This data can be text, url or any html tag value of that page. It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages.

Installation

It's compatible with python 3.

Install latest version from git repository using pip:

$ pip install git+https://github.com/alirezamika/autoscraper.git

Install from PyPI:

$ pip install autoscraper

Install from source:

$ python setup.py install

How to use

Getting similar results

Say we want to fetch all related post titles in a stackoverflow page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

Here's the output:

[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 
    'How to call an external command?', 
    'What are metaclasses in Python?', 
    'Does Python have a ternary conditional operator?', 
    'How do you remove duplicates from a list whilst preserving order?', 
    'Convert bytes to a string', 
    'How to get line count of a large file cheaply in Python?', 
    "Does Python have a string 'contains' substring method?", 
    'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]

Now you can use the scraper object to get related topics of any stackoverflow page:

scraper.get_result_similar('https://stackoverflow.com/questions/606191/convert-bytes-to-a-string')

Getting exact result

Say we want to scrape live stock prices from yahoo finance:

from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'

wanted_list = ["124.81"]

scraper = AutoScraper()

# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper.build(url, wanted_list)
print(result)

Note that you should update the wanted_list if you want to copy this code, as the content of the page dynamically changes.

You can also pass any custom requests module parameter. for example you may want to use proxies or custom headers:

proxies = {
    "http": 'http://127.0.0.1:8001',
    "https": 'https://127.0.0.1:8001',
}

result = scraper.build(url, wanted_list, request_args=dict(proxies=proxies))

Now we can get the price of any symbol:

scraper.get_result_exact('https://finance.yahoo.com/quote/MSFT/')

You may want to get other info as well. For example if you want to get market cap too, you can just append it to the wanted list. By using the get_result_exact method, it will retrieve the data as the same exact order in the wanted list.

Another example: Say we want to scrape the about text, number of stars and the link to issues of Github repo pages:

from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'

wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '2.5k', 'https://github.com/alirezamika/autoscraper/issues']

scraper = AutoScraper()
scraper.build(url, wanted_list)

Simple, right?

Saving the model

We can now save the built model to use it later. To save:

# Give it a file path
scraper.save('yahoo-finance')

And to load:

scraper.load('yahoo-finance')

Tutorials

See this gist for more advanced usages.
AutoScraper and Flask: Create an API From Any Website in Less Than 5 Minutes

Issues

Feel free to open an issue if you have any problem using the module.

Support the project

Happy Coding ♥️

autoscraper's People

Contributors

Stargazers

Watchers

Forkers

krzemienski vitorarjol workflowmate allen-oneill xiaoqin00 tikazyq forgeries jq-k hadryan abeusher bhardwajrahul bailey9005 mpwjames damienstanton tchigher ntiyison yonatanlange2005 ruben4889 quantumu 4144 imera88 lmeninato bonidees fmolliet jesshe deepakksahu aiturri kkrbalam parampavar nguyenhieuec shaunstanislauslau pullmeinenfinger jb25 gutfpgpm2020 iamkamleshrangi justinzwick reewardius kureyv cthulhu-irl swiffers anthony2698 mirasantika joneill-dino lulzx longtemps2 xrosliang silky heptios deepmeditativemind adbmd aritonangjos stjordanis iskandarsaleh xaunlee mohiteakshay andrew-colman-2 altovate suryatmodulus b2bda saonam askmetoo nkpydev3686 caofancpu d8ger pedropinchez zbrill lynxgsm aromeira mujeebhabib qshirazi dusens rakibcse amirstudy sringly laashub-of-scg l1kw1d harrythebarry threepointsquare generousdev dimanjan amulmgr gamagenfox experiencen upton yx700000007 smitakshigupta vishalbelsare yinwin norm688 eluisluzquadros milan-chicago picknickchock melhadf ephraimmagopa senhordim frantol910 narasimha1997 lorekin shinroo fengrk

autoscraper's Issues

ERROR: Package 'autoscraper' requires a different Python: 2.7.16 not in '>=3.6'

All 3 listed installation methods return the error shown in the issue title & cause an installation failure. No change when using pip or pip3 command.
I tried running the following 2 commands to get around the pre-commit issue but with no change in the result:
$ pip uninstall pre-commit # uninstall from Python2.7
$ pip3 install pre-commit # install with Python3

Not very reasonable, a lot of things change

Autoscrapper for scrapping dynamic websites!

Hello! @alirezamika
I wanted to ask what to do if the text in the wanted list is no longer available in the website !

from autoscraper import AutoScraper

scrapper=AutoScraper()

scrape_data1=[
    ("https://theprint.in",["Japan envoy holds talks with senior Taliban members in Kabul"]),
    ("https://theprint.in",["India records 9,119 new Covid cases, active infections lowest in 539 days"]),
    ("https://theprint.in",["Farm laws debate missed a lot. Neither supporters nor Modi govt identified the real problem"]),
    ("https://theprint.in",["Punjab’s Dalits are shifting state politics, flocking churches, singing Chamar pride"]),
]
for get_url,data in scrape_data1:
    scrapper.build(url=get_url,wanted_list=data,update=True)
    Main_news=scrapper.get_result_similar(url="https://theprint.in",grouped=True, group_by_alias=True,unique=True)
       
print(Main_news)

Here in the above code scrapes a news website but if i run it after few hours when the news get updated , scrapper returns
{}
or someting else related to the text found!
i mean i want to know how to optimize the code for dynamic websites where the text gets updated !

Website Structure

Hello! Thank you so much for sharing your work!

I wanted to ask, if i trained my model on some website, then this website will change the website structure and styling , will it still work? Can I get the same data? or I will be needed to re-train it again?

Cannot support Chinese

Could you update the README with static sites?

Most of the pages you demo on are dynamic, so some of the quoted text disappears. Can you pick sites that don't change as ofter.

Import Error

Hi, when i launch : python3 autoscraper.py,
i got the following error:
ImportError: cannot import name 'AutoScraper' from partially initialized module 'autoscraper' (most likely due to a circular import)

Have you any idea of the fix?
Thanks for your reply.

Regards

Progression of errors while installing

1

santiago@santiago-Aspire-A515-51:~$ pip install git+https://github.com/alirezamika/autoscraper.git
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/alirezamika/autoscraper.git
Cloning https://github.com/alirezamika/autoscraper.git to /tmp/pip-req-build-zjd5pn9g
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-zjd5pn9g/setup.py'"'"'; file='"'"'/tmp/pip-req-build-zjd5pn9g/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-du5i409j
cwd: /tmp/pip-req-build-zjd5pn9g/
Complete output (7 lines):
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-req-build-zjd5pn9g/setup.py", line 7, in
with open(path.join(here, 'README.rst'), encoding='utf-8') as f:
File "/usr/lib/python3.6/codecs.py", line 897, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-zjd5pn9g/README.rst'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

2 - Git clone manually, installed setuptools manually, then Readme.rst not found

santiago@santiago-Aspire-A515-51:/Devel/autoscraper$ python3.8 -m pip install setuptools
Collecting setuptools
Cache entry deserialization failed, entry ignored
Downloading https://files.pythonhosted.org/packages/b0/8b/379494d7dbd3854aa7b85b216cb0af54edcb7fce7d086ba3e35522a713cf/setuptools-50.0.0-py3-none-any.whl (783kB)
100% |████████████████████████████████| 788kB 615kB/s
Installing collected packages: setuptools
Successfully installed setuptools-50.0.0
santiago@santiago-Aspire-A515-51:/Devel/autoscraper$ python setup.py install
Traceback (most recent call last):
File "setup.py", line 7, in
with open(path.join(here, 'README.rst'), encoding='utf-8') as f:
File "/usr/lib/python3.8/codecs.py", line 905, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: '/home/santiago/Devel/autoscraper/README.rst'

3 - Renamed Readme.md to README.rst

santiago@santiago-Aspire-A515-51:~/Devel/autoscraper$ python setup.py install
running install
error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the
installation directory:

[Errno 13] Permission denied: '/usr/lib/python3.8/site-packages'

The installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

/usr/lib/python3.8/site-packages/

This directory does not currently exist. Please create it and try again, or
choose a different installation directory (using the -d or --install-dir
option).

wanted_list presupposes knowledge of page

Perhaps we want to let the scraper find the data we know will be in the page, we just don't know the value of it. Using your first example for the Stack Overflow related questions, what if we could instead modify the list as such:

wanted_list = ["Related"]

and this would still return the same output:
'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 'How to call an external command?',..., 'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'

Similarly, for the stock price example we need to know the stock's value ahead of time. What if we want to grab those values using the scraper?

wanted_list = ["Previous Close", "Day's Range"]

This also allows get_result_exact to feed us the same values given another stock's yahoo url.

I'm not familiar enough with BeautifulSoup to know how simple it would be to parse out these elements' values, but it could be worth thinking about.

Cheers!

Pulling tables would be awesome

Perhaps I missed it somewhere, but it would be great to go here:
https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6829/Stages/15151/PlayerStatistics/England-Premier-League-2017-2018

And grab the entire table(s):
Premier League Player Statistics
Premier League Assist to Goal Scorer

About removing duplicate result

I‘m sorry to add this issue, I dont konw whether this is an issue.

In my code.I dont want to remove the duplicate result,and I had tried to commented out some code.But it seems doesn't work,so I add this issue.

sorry for this issue again.Pls tell me If this is not an issue,I will delete this.

Add docs string

Add doc-string to the functions to make them more self-explainable and understandable while working with code-editors.

ssl.SSLCertVerificationError:

I followed all instruction and run the sample program using the AutoScraper
as shown below
from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

We can add one or multiple candidates here.

You can also put urls here to retrieve urls.

wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list )
print(result)

But I get the follwoing error
============ RESTART: D:/PythonCode-1/Web Scraping/AutoSraper 001.py ===========
Traceback (most recent call last):
File "C:\Python39\lib\site-packages\urllib3\connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "C:\Python39\lib\site-packages\urllib3\connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "C:\Python39\lib\site-packages\urllib3\connectionpool.py", line 1010, in validate_conn
conn.connect()
File "C:\Python39\lib\site-packages\urllib3\connection.py", line 416, in connect
self.sock = ssl_wrap_socket(
File "C:\Python39\lib\site-packages\urllib3\util\ssl.py", line 449, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(
File "C:\Python39\lib\site-packages\urllib3\util\ssl.py", line 493, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "C:\Python39\lib\ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "C:\Python39\lib\ssl.py", line 1040, in _create
self.do_handshake()
File "C:\Python39\lib\ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python39\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "C:\Python39\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "C:\Python39\lib\site-packages\urllib3\util\retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Max retries exceeded with url: /questions/2081586/web-scraping-with-python (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:/PythonCode-1/Web Scraping/AutoSraper 001.py", line 11, in
result = scraper.build(url, wanted_list )
File "C:\Python39\lib\site-packages\autoscraper\auto_scraper.py", line 227, in build
soup = self._get_soup(url=url, html=html, request_args=request_args)
File "C:\Python39\lib\site-packages\autoscraper\auto_scraper.py", line 119, in _get_soup
html = cls._fetch_html(url, request_args)
File "C:\Python39\lib\site-packages\autoscraper\auto_scraper.py", line 105, in _fetch_html
res = requests.get(url, headers=headers, **request_args)
File "C:\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python39\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python39\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "C:\Python39\lib\site-packages\requests\adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Max retries exceeded with url: /questions/2081586/web-scraping-with-python (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)')))

support for Chinese/different lanuage/non-default encoding?

The build returns blank for this url:

https://www.ptwxz.com/html/11/11014/

from autoscraper import AutoScraper

scraper = AutoScraper()

url = 'https://www.ptwxz.com/html/11/11014/'
wanted_list = ['第一章']
scraper.build(url, wanted_list)
print(scraper.stack_list)

url1 = 'https://www.ptwxz.com/html/9/9108/'
print(scraper.get_result_similar(url1))

Training text with extra spaces before and after while predicted text does not

I am dealing with Q&A pages that some paragraphs contains extra spaces before and after the span (on inspecting the source), while some other span do not. E.g.:
(With extra space) https://www.sfc.hk/en/faqs/intermediaries/licensing/Associated-entities#0FCC1339F7B94DF69DD1DF73DB5F7DCA
(No extra space) https://www.sfc.hk/en/faqs/intermediaries/licensing/Family-Offices#F919B6DCE05349D8A9E8CEE8CA9C7750

As a result it seems like a model trained with the prior would not predict latter as similar. In fact even during the "build" process question with extra space don't treat other without space as similar.

Another question is on the expanded part of the text (the "A: " answer text). It doesn't expand unless a "+" sign is clicked. In that case is there anyway to get Exact result including the answer part?

Thanks for the great work.

Is there a way to call .sleep() during the build?

What title says. I've been banned from a website after a few tests and was wondering if there's a solution to this.
Thanks!

Is there a way to scrape information from multiple pages?

I want to scrape information from all the pages of a website, Is there a way that it will go to next page automatically and scrape it?

It skips duplicate items, is there a way to retain duplicate values

Websites that require cookies

Is there a way to scrape websites that check for cookies?

Pagination

Hi how can I handle pagination for example if I want to fetch comments and reviews.
And is there a way to detect/handle consecutive pages other than by listing them like how general scrapers would have a click function to move to different pages or actions.

want_dict hanging?

I tried to use wanted_dict in scraper.build() and it is hanging.

scraper.build(url, wanted_dict=wanted_dict)

If I use wanted_list is it fine.
It is hanging here:

Traceback (most recent call last):
  File "test1.py", line 28, in <module>
    result = scraper.build(url, wanted_dict=wanted_dict, request_args=request_args)
  File "site-packages\autoscraper\auto_scraper.py", line 222, in build
    result, stack = self._get_result_for_child(child, soup, url)
  File "site-packages\autoscraper\auto_scraper.py", line 268, in _get_result_for_child
    result = self._get_result_with_stack(stack, soup, url, 1.0)
  File "site-packages\autoscraper\auto_scraper.py", line 326, in _get_result_with_stack
    getattr(i, 'child_index', 0)) for i in parents]
  File "site-packages\autoscraper\auto_scraper.py", line 326, in <listcomp>
    getattr(i, 'child_index', 0)) for i in parents]
  File "site-packages\bs4\element.py", line 1441, in __getattr__
    if len(tag) > 3 and tag.endswith('Tag'):
KeyboardInterrupt

Module not found

I have a problem with importing AutoScraper(). Not sure where can be a problem. All description here: https://stackoverflow.com/questions/66551742/autoscraper-module-not-found-in-python-autoscraper-library

Thanks for help

How to scrape a dynamic website?

I am trying to export a localhost website that is generated with this project:

https://github.com/HBehrens/puncover

The project generates a localhost website, and each time the user interacts clicks a link the project receives a GET request and the website generates the HTML. This means that the HTML is generated each time the user access a link through their browser. At the moment the project does not export the website to html or pdf. For this reason I want to know how could I recursively get all the hyperlinks and then generate the HTML version. Would this be possible with autoscraper?

Won't find special characters

When trying to find anything that contains a . in it I get no results.

url = 'https://pastebin.com/APSMFRLL'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["."]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

I would've expected to get:

one.two
three.four
five.six
seven.eight

Maybe I'm not doing something correctly perhaps.

HTML Parameter

I read a previous post that mentioned capability for the HTML parameter, in which I could render a JS application using another tool (BS or Selenium) and pass in the HTML data for AutoScraper to parse. Does anyone have steps or documentation on how to use this parameter?

Google Trend Scraping

url="https://trends.google.com/trends/trendingsearches/realtime?geo=IN&category=all"
wanted_list= ['Tamil Nadu • Lockdown • Coronavirus','https://www.sentinelassam.com/national-news/tn-police-makes-ockdown-violators-sit-with-fake-covid-positive-patients-on-their-two-wheelers-539462']
scraper=AutoScraper()
result=scraper.build(url,wanted_list)
print(result)

Output >> [ ]

Scraping skips item

Hello,

I am really delighted with this fantastic module. However I want to signal a lillte annoying bug:

I am trying to scrape wsj.com.
In the wanted list I put the left column 1st article title and left colum 1st article summary.
The results list does not include the head (up center) article summary but it includes its title.
Number of list elements is 33 where it should be 34 (17 titles + 17 summaries).

I do suspect that since the head summary is align/centered and the other artcile summaries are align/left this might cause this skipping.

Thank You indeed, all te best

Pierre-Emmanuel FEGA

add URL with save & load function

add URL with save & load function so that we don't need to build or remember the URL

Add support for incremental learning

As of now, the rules are formed at once based on the targets specified in wanted_list and the stack list is generated for those targets. Sometimes there will be scenarios where I have to update the existing stack list with new rules learnt from different set of targets on the same URL. As seen in the build method, you create a new stack list every time a build method is called. Provide an update method, that updates the stack list simply by appending the new rules learnt from new set of targets.
This will be very useful functionality because it will allow developers to incrementally add new targets by retaining the older rules.

Trying to scrap email address from any given website.

It returns empty list.

How it can be used to scrap all the email gives on website.

Add support for sepecifying text encoding.

I'm working with a legacy Chinese site with BIG5 text encoding, and I'm not able to set text encoding by passing arguments through request_args, because requests don't support it.

Encoding can only be set by writing to the encoding property of requests object (According to this).

So maybe adding an encoding param and set encoding in _get_soup in auto_scraper.py would be a good idea.

Asynchronous methods for fetching URLs, parsing HTML, and exporting data

Introduction

I was looking over the code for this project and am impressed with it's simplicity in design and brilliant approach to this problem. However, one thing that jumped out to me was the lack of asynchronous methods to allow for a huge speed increase, especially as the number of pages to scrape increases. I am quite familiar with the standard libraries used to meet this goal and propose the following changes:

Let me know your thoughts and if you're interested in the idea. The performance gains would be immense! Thanks!

Technical changes and additions proposal

1. Subclass AutoScraper with AsyncAutoScraper, which would require the packages aiohttp, aiofiles, and aiosql along with a few others purely optionally to increase speed - uvloop, brotlipy, cchardet, and aiodns
2. Refactor the _get_soup method by extracting an async method to download HTML asynchronously using aiohttp
3. Refactor the get_results* and _build* functions to also be async (simply adding the keyword) and then making sure to call them by using a multiprocessing/threading pool
- a. The get_* functions should handle the calling of these in an executor set to aforementioned pool
- b. Pools are created using concurrent.futures.*
- c. Inner-method logic should remain untouched since parsing is a CPU-bound task
4. Use aiofiles for the save method to be able to export many individual JSON files quickly if desired, same for the load method if multiple sources are being used
5. Add functionality for exporting to an SQL database asynchronously using aiosql

References

@alirezamika

How to get CSS or XPATH on which the auto-scrapper receives data from the page?

or their sequence?

Defining large block of text as wanted list

When our target value is a large block of text, it becomes messy.
Instead can a feature be added so that we can define the text shortly?

For example:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum

can be defined as:
Lorem ipsum(...)est laborum

Scrapping a private website page

Hi,

I would like to scrap a private webpage where I need to be logged in.

Is there a way to do that?

Thank you

bs4.FeatureNotFound Error

Tried using AutoScraper in the Python shell and I got this error.

Nonbreaking spaces lead to surprising behavior

I tried using autoscraper to scrape items from the hackernews home page. The scraper had issues with the nonbreaking space in the comments link on each list item. I was eventually able to workaround the issue by using '\xa0' in the wanted_list string. That matched the comments field but then returned incorrect results anyway. My guess is that something is not matching the nonbreaking space in the "stack" analysis (but I didn't invest the time to find the root cause).

This project is an interesting idea, but I recommend unit tests and some documentation about the matching algorithm to help users help you with diagnosing bugs.

Scrapping output is zero

i tried to scrape the webpage but the results are zero 👍

/////////
`
from autoscraper import AutoScraper

url = 'https://trade.mango.markets/account?pubkey=8zJHqNa9sVvyLmVBQwY2vch5729dqfmzF3cxE25ZYVn'

wanted_list = ['Futures Positions','Notion Size']

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

`
/////

Output Results are none

@alirezamika can you guide whats the issue:

are the webpage is using node.js ?

WebScraper

Scraper can't find requested data even though site is well-structured and consistent

This site is consistent and well-structured with easily located selectors but autoscraper struggles with scraping the data. I trained it with a few examples which found the data successfully but subsequent attempts to scrape other pages yields missing data even though the markup is the same for these pages as the training pages.

from autoscraper import AutoScraper
scraper = AutoScraper()
scraper.build("https://www.weedsta.com/strains/blue-dream", ["Blue Dream", "Hybrid", "24.49%", "0.19%", "10 Reviews"], update=True)
scraper.build("https://www.weedsta.com/strains/trainwreck", ["Trainwreck", "Sativa", "18.63%", "0.53%", "3 Reviews"], update=True)
scraper.build("https://www.weedsta.com/strains/sour-diesel", ["Sour Diesel", "Sativa", "22.2%", "0.31%", "8 Reviews"], update=True)

Here's an example where you can see that the percentages are not returned.

>>> scraper.get_result_similar('https://www.weedsta.com/strains/banana-kush', grouped=True)
{'rule_dt56': ['Banana Kush'], 'rule_l8fu': ['Banana Kush'], 'rule_7m0b': [], 'rule_fq5s': ['1 Reviews'], 'rule_4lqv': [], 'rule_bq2d': [], 'rule_mgmx': [], 'rule_pshq': [], 'rule_cnvq': ['Banana Kush'], 'rule_bmx8': ['Banana Kush'], 'rule_3npf': [], 'rule_7ko7': [], 'rule_tfnf': [], 'rule_ia0h': []}
>>>

Getting candidate value in when trying scraping.

This is my code

from autoscraper import AutoScraper

url = 'https://www.thedailystar.net/news/bangladesh/diplomacy/news/rohingya-repatriation-countries-should-impose-sanctions-pressurise-myanmar-2922581'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
scraper = AutoScraper()
wanted_list = ["Many of our development partners are selling arms to Myanmar: Foreign Minister"]
scraper1 = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

This is the result

I am getting the value of candidate i.e. wanted_list = ["Many of our development partners are selling arms to Myanmar: Foreign Minister"] as result.
I am new to autoscraper (actually I am just trying out from today). Is this the usual result I should hope for or do I get the content of whole webpage ?

how to use this behind a proxy

this is a great project but when i open a proxy, nothing worked. help please.

Extracting webpages with a collections of items (structurally)

Hi,
How do I extract a list of a list of text from a webpage with:

Name: Amy, Age: 13
Name: Bobby, Age: 33
Name: Chris, Age: 54

Ideally I would like the results to be:

[['Amy', '13'],
 ['Bobby', '33],
 ['Chris', '54']
]

Need help to get price from the web site.

Hi, I can't get price from this url, even though I try to add request header to it.

url:

url = "https://www.lazada.com.my/products/pensonic-pb-7511-multifunctional-hand-blender-white-i15717885-s19365971.html"
wanted_list = ["105.24"] #the price is 105.24
s = requests.session()
scraper = AutoScraper()
s.get(url, headers=scraper.request_headers)
result = scraper.build(url, wanted_list, request_args={'cookies': s.cookies.get_dict()})

and the output is always None
How can I get the price?

Not able to scrape images from amazon and flipkart from product pages

scraper.build returns a blank list

Here is the code to reproduce:

    from autoscraper import AutoScraper
    
    class Scraper():
        
        wanted_list = ["0.79"]
        origUrl = 'https://www.sec.gov/Archives/edgar/data/0001744489/000174448921000105/fy2021_q2xprxex991.htm'
        newUrl = 'https://www.sec.gov/Archives/edgar/data/0001744489/000174448921000179/fy2021_q3xprxex991.htm'
        path="Alpaca/Scraper/sec/file.txt"
        
        def scrape(self):
            scraper = AutoScraper()
            result = scraper.build(self.origUrl, self.wanted_list)
            print(result)
            result = scraper.get_result_exact(self.newUrl)
            print(result)
    
    if __name__ == '__main__':
        scraper = Scraper()
        scraper.scrape()

Here is the log:

    []
    []

Expected to be:

    [0.79]
    [0.80]

Github issue numbers

Thanks for creating such a cool project! It looks like it's exactly what I need, but I'm having trouble getting it to work for Github issue numbers.

Example code with this own project's issues page

from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper/issues?q=is%3Aissue'

wanted_list = ["#47"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

The result here is blank, and checking the stack_list, there is nothing there. The formatting for the element is a bit strange with lots of whitespace and newlines, so I tried copying the whole element in directly with triple parenthesis, but that has the same result.

wanted_list = """
          #47
            by """

Which when evaluated by python becomes

['\n          #47\n            by ']

Originally I also tried just using the number, as that would be the most convenient, but no beans. I was able to get it to work easily with the actual text of the issue, so I fear it's something weird with the way it's formatted.

Is this an issue with whitespace, or am I messing up something basic? Thanks!

authentication

how does this tool work with authentication? i'm interested in using with content behind a shibboleth saml login.. so i need to store session variables in the process

How to scrap read more data given in the reviews?

I want to scrap the reviews from website. But when the answer have Read more with a link, its scraping as it is as Read more and not the entire data

build method with wanted_dict does not work.

Tested with autoscraper-1.1.6

When calling the build method with wanted_dict, the method behaves badly as it treats the searched string as an array of individual letters. The culprit is around l.204 of auto_scraper.py as the data structure does not behave the same as when you use the wanted_list option.

Besides this, I find the work done so far super interesting and promising. Keep up the good work.