Comments (23)
Selenium + Chromium + Firefox are now in master
. Closing as completed! :)
from auto-gpt.
okay, thanks for the context. I'll try to see what I can do.
from auto-gpt.
@Torantulino I see that you're using BeautifulSoup for processing the content of the site. This won't handle data that has to be injected into a site via say JavaScript. I'm not sure exactly how, but some of the RESTful / GraphQL / etc calls could be helpful for summarizing a page.
We could also consider pulling the metadata from the page and using that to determine how to prompt the summarization. To be fair, I haven't looked at the prompting code yet and don't know if you're already doing this.
from auto-gpt.
I also modified above code to handle Javascript using Selenium.
pip install selenium
"""
This module is designed to scrape text and hyperlinks from a given URL, summarize the text,
and return a limited number of hyperlinks. It uses the requests and Selenium libraries for making
HTTP requests, BeautifulSoup for parsing HTML, and a custom 'browse' module for text summarization.
"""
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import browse
def browse_website(url, max_links=5, retries=3):
# Use a session object for connection pooling and better performance
with requests.Session() as session:
scraped_data = scrape_data(url, session, retries)
summary = get_text_summary(scraped_data['text'])
links = get_hyperlinks(scraped_data['links'], max_links)
result = f"""Website Content Summary: {summary}\n\nLinks: {links}"""
return result
def get_text_summary(text):
# Use the custom 'browse' module to summarize the text
summary = browse.summarize_text(text)
return f' "Result" : {summary}'
def get_hyperlinks(links, max_links):
# Limit the number of hyperlinks returned to the specified maximum
return links[:max_links]
def scrape_data(url, session, retries):
# Make requests to the specified URL and parse the HTML content
for i in range(retries):
try:
# Set up a headless Chrome browser
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
# Navigate to the URL and render the JavaScript content
driver.get(url)
html = driver.page_source
# Parse the rendered HTML with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
text = ' '.join([p.get_text() for p in soup.find_all('p')])
links = [a.get('href') for a in soup.find_all('a')]
driver.quit()
return {'text': text, 'links': links}
except Exception as e:
# Retry the request if an exception occurs, up to the specified retry limit
if i == retries - 1:
raise e
from auto-gpt.
Here's pull request for fix. It utilizes Pyppteer to navigate JavaScript sites. Lots of adaptability and flexibility.
from auto-gpt.
+1 for selenium browser, supports JS injecting, and might be better than google console if you are running it from residental IP, just add jitter and wait time before reading, and you can easly safe page_source as html or parse it later with bs4
opencv is too OP for this and resource intensive, no point of using it, except for images, as I said, no point.
I just found out about this project, might gonna look further into this for the weekend
from auto-gpt.
This is currently majorly limiting Auto-GPT's capabilities.
It essentially is unable to reliably browse the web and frequently misses information even though it's visiting the correct URLs.
from auto-gpt.
Current Implementation
Web browsing is currently handled in the following way:
def browse_website(url):
summary = get_text_summary(url)
links = get_hyperlinks(url)
# Limit links to 5
if len(links) > 5:
links = links[:5]
result = f"""Website Content Summary: {summary}\n\nLinks: {links}"""
return result
def get_text_summary(url):
text = browse.scrape_text(url)
summary = browse.summarize_text(text)
return """ "Result" : """ + summary
def get_hyperlinks(url):
link_list = browse.scrape_links(url)
return link_list
Where scrape_text
uses BeautifulSoup to extract the text contained within a webpage.
The Problem
The summarize_text
function feeds GPT3.5 this scraped text, and whilst gpt3.5 does a great job of summarising it, it doesn't know what we're specifically after.
This leads to instances where AutoGPT can be looking to find some news on CNN, and instead receives a summary of what CNN is.
An Illustrated Example
- Here, Auto-GPT is trying to browse a news site to find a technology related news story.
- They select https://www.theverge.com/tech, a screenshot from which can be seen below:
- Even though there are clearly news stories contained within this page, the summary does not include them specifically, just mentions they are there.
Summary:
Website Content Summary: "Result" : The Tech subpage of The Verge provides news on hardware, apps, and technology from various companies including Google and Apple as well as smaller startups. It includes articles on various tech topics and links to the website's legal and privacy information.
from auto-gpt.
Might be easier to add 3rd party support for example: https://www.algolia.com/pricing/
If their API can do the heavy lifting, it will be easier, there are some oss plugins as well
from auto-gpt.
Few tips for scraping here, use selenium with opencv, make sure to force scroll to bottom of the page to load everything. Actually, current chapgpt sometimes gives correct scraping results instead of code to scrape, so it's able to do it, just jailed for some reason.
from auto-gpt.
Selenium or pupetteer or similar for web ui logins would be optimal i guess imho.
from auto-gpt.
Has anyone had any luck extracting key information from a website's text with gpt3.5?
GPT4 does it easily, but it's too expensive for this task.
What we need is for the key information to be extracted from a webpage, rather than it's general purpose to be described.
from auto-gpt.
can you provide more context, please?
I've tried "extract key information from this webpage text" and given it hackernews text, and it did a decent job.
from auto-gpt.
tried to get it to work, but I don't have access to gpt-4 API, but with gpt-3.5-turbo, it works for some and doesn't for others even with strict prompting. but gpt-4 should be able to do it with strict prompting.
from auto-gpt.
from auto-gpt.
Might be easier to add 3rd party support for example: https://www.algolia.com/pricing/
If their API can do the heavy lifting, it will be easier, there are some oss plugins as well
https://github.com/neuml/txtai
from auto-gpt.
What about bs4 (BeautifulSoup4)?
"""
This module is designed to scrape text and hyperlinks from a given URL, summarize the text,
and return a limited number of hyperlinks. It uses the requests library for making HTTP requests,
BeautifulSoup for parsing HTML, and a custom 'browse' module for text summarization.
"""
import requests
from bs4 import BeautifulSoup
import browse
def browse_website(url, max_links=5, retries=3):
# Use a session object for connection pooling and better performance
with requests.Session() as session:
scraped_data = scrape_data(url, session, retries)
summary = get_text_summary(scraped_data['text'])
links = get_hyperlinks(scraped_data['links'], max_links)
result = f"""Website Content Summary: {summary}\n\nLinks: {links}"""
return result
def get_text_summary(text):
# Use the custom 'browse' module to summarize the text
summary = browse.summarize_text(text)
return f' "Result" : {summary}'
def get_hyperlinks(links, max_links):
# Limit the number of hyperlinks returned to the specified maximum
return links[:max_links]
def scrape_data(url, session, retries):
# Make requests to the specified URL and parse the HTML content
for i in range(retries):
try:
response = session.get(url, headers={'User-Agent': 'Mozilla/5.0'})
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
text = ' '.join([p.get_text() for p in soup.find_all('p')])
links = [a.get('href') for a in soup.find_all('a')]
return {'text': text, 'links': links}
except requests.exceptions.RequestException as e:
# Retry the request if an exception occurs, up to the specified retry limit
if i == retries - 1:
raise e
from auto-gpt.
This is the single biggest issue in facing with GPT3.5-turbo and Auto-GPT
It is not reliably able to go to a website and pull out and summarize key information.
My use case is going to a job posting and pulling out a summary to compare to my resume. This works be a game changer
from auto-gpt.
Looking for feedback on #507, it's my first-time @Torantulino making a "official" PR, but the project is beyond compelling and I had to get this out to the community. It's magical what happens when it's able to access information from as recent as today (April 8th, 2023) in it's analysis, reasoning, and logic.
edit: I made a 26-minute video showing what's possible with this PR. This isn't a minor incremental bug fix, it's a MAJOR unlock!
https://youtu.be/yM_yxVn4y2I
from auto-gpt.
Pix2struct to capture structure and plain ocr could give acceptable results if everything else fails .
from auto-gpt.
Where are we at for logging into websites during web-crawling?
from auto-gpt.
Mechanize would work well, although not functional with JS.
from auto-gpt.
how much flexibility do we have to configure default code to direct the AI where we want it to go and grab what we want along the way?
from auto-gpt.
Related Issues (20)
- Implement "Note" Blocks
- Add Tooltips to block properties
- [Builder] Use new agent versioning feature in UI
- Add customisable placeholders in Block properties
- Add dictionary lookup block
- Fix bug in time comparison logic in reddit block
- Add WebScraper Block
- Add summariser block
- Add support for other LLM in AI Block
- Root directory should forward to Builder
- Add * to required fields
- Prevent agent execution when required fields are empty - outline in red.
- Censor "Secret" fields like API keys in the Builder until clicked on
- Add Jina WebSearch
- [Builder] Add support for Agent Templates
- Add CreateMediumPost Block
- Add YouTube Transcriber Block
- Azure endpoint not configured HOT 1
- can't load credentials error
- Where is the node?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from auto-gpt.