Comments (4)
@michael-supreme Thanks for providing these links to replicate, currently still investigating this issue
from thepipe.
I'm running thepipe locally to extract some page URLs for processing with GPT4o, and it seems that the image generated for each page only captures the content above the fold (See example below). Is there a method to have it capture the entire page to be processed? (perhaps an argument such as fullPage=True/False)
My token limit for GPT4o as part of my plan is 10M, so I'm not overly concerned with hitting limits.
Example image: https://imgur.com/a/a06g3lh
Hey @michael-supreme , this should be the default behaviour already. In extractor.py
there is
# Get the viewport size and document size to scroll
viewport_height = page.viewport_size['height']
total_height = page.evaluate("document.body.scrollHeight")
# Scroll to the bottom of the page and take screenshots
current_scroll_position = 0
scrolldowns, max_scrolldowns = 0, 10 # in case of infinite scroll
while current_scroll_position < total_height and scrolldowns < max_scrolldowns:
# rest of code...
current_scroll_position += viewport_height
page.evaluate(f"window.scrollTo(0, {current_scroll_position})")
scrolldowns += 1
If it is not scrolling automatically for you, you can post the link you're trying to extract and I can take a closer look.
from thepipe.
@emcf Seems it works on some pages but not others. For example, on this contact us page, I get the full page captured in multiple screenshots for every 720px of page height.
But on this homepage, it stops after the second chunk (wondering if it fails due to scripts or animations on the page?).
Also, the homepage in the original post has the same issue, where it stops after the second screenshot.
from thepipe.
@michael-supreme Thanks for providing these links to replicate, currently still investigating this issue
@emcf Just wanted to let you know that the issue also happens then setting the extraction to text_only=True - It appears to only extract the text content for the first 720px of the page
from thepipe.
Related Issues (17)
- Feature requests 🔨 HOT 4
- Make docker image
- Video frame + transcript extraction
- Audio transcript extraction HOT 1
- No longer working after addition of THEPIPE_API_KEY HOT 5
- `ai_extraction=True` not working locally HOT 2
- Swap Whisper Version
- Some videos (without audio) fail to extract
- add syntax to match multiple patterns with match/ignore functionality.
- Add .ino functionality for GitHub repos related to arduino
- Error when trying to Pipe Linkedin profile
- Running "Locally" HOT 2
- file type scanning
- Pytesseract error when text_only is True within GitHub Action
- Increment Timestamp for Long Videos
- Directory extraction fails if one file or any files fail HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from thepipe.