GithubHelp home page GithubHelp logo

jimmc414 / 1filellm Goto Github PK

View Code? Open in Web Editor NEW
224.0 4.0 19.0 446 KB

Specify a github or local repo, github pull request, arXiv or Sci-Hub paper, Youtube transcript or documentation URL on the web and scrape into a text file and clipboard for easier LLM ingestion

License: MIT License

Python 100.00%
arxiv github llm tiktoken youtube-transcript-api doi ipynb pdf pmid sci-hub

1filellm's People

Contributors

aliciusschroeder avatar jimmc414 avatar sammcj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

1filellm's Issues

time out problem in processing repo

Traceback (most recent call last):
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 491, in _make_request
raise new_e
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
conn.connect()
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 616, in connect
self.sock = sock = self._new_conn()
^^^^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 207, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7e56c099f250>, 'Connection to api.github.com timed out. (connect timeout=None)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/user/surprise/.venv/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/z-huang/InnerTune/contents/fastlane/metadata/android/es?ref=dev (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7e56c099f250>, 'Connection to api.github.com timed out. (connect timeout=None)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/surprise/new/new.py", line 348, in
main()
File "/home/user/surprise/new/new.py", line 316, in main
process_github_repo(input_path, output_file)
File "/home/user/surprise/new/new.py", line 115, in process_github_repo
process_directory(contents_url, output)
File "/home/user/surprise/new/new.py", line 79, in process_directory
process_directory(file["url"], output)
File "/home/user/surprise/new/new.py", line 79, in process_directory
process_directory(file["url"], output)
File "/home/user/surprise/new/new.py", line 79, in process_directory
process_directory(file["url"], output)
[Previous line repeated 1 more time]
File "/home/user/surprise/new/new.py", line 55, in process_directory
response = requests.get(url, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/surprise/.venv/lib/python3.11/site-packages/requests/adapters.py", line 507, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/z-huang/InnerTune/contents/fastlane/metadata/android/es?ref=dev (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7e56c099f250>, 'Connection to api.github.com timed out. (connect timeout=None)'))

TODOs

  • If the pull request or issue includes references to other issues, pull requests, or external resources, extract and include those references or links in the output to provide more context.

  • In the pull request output, separate the diff and review comments into different sections for better readability. eg, you could have a section for the diff and another section for the review comments, with references to the relevant file paths and line numbers.

Test issue for test_onefilellm.py

This is a test issue for use in test_onefilellm.py.

import os
import gensim.downloader as api
import numpy as np
from scipy import spatial
from nltk import word_tokenize
from nltk.corpus import stopwords
import configparser
import pickle
import shutil

# Read settings from ini file
config = configparser.ConfigParser()
config.read('settings.ini')
folder_path = config.get('paths', 'txt_documents')

# Prompt user for filename
input_file = input("Enter the name of the text file: ")
input_file_path = os.path.join(folder_path, input_file)

# Set similarity threshold
similarity_threshold = float(input("Enter similarity threshold (e.g. 0.5): "))

"Science progresses one funeral at a time" -Max Plank

exact file in processing repo

thanks for the code.
currently we can modify to add or remove extension.
but if .md is selected means, i need only readme.md not all file that ending with .md.
so how to do that?

Thank you for your contribution!

when I run this command : python test_onefilellm.py

[nltk_data] Downloading package stopwords to
[nltk_data] /Users/player/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Traceback (most recent call last):
File "/Users/player/Downloads/1filellm-main/test_onefilellm.py", line 5, in
from onefilellm import process_github_repo, process_arxiv_pdf, process_local_folder, fetch_youtube_transcript, crawl_and_extract_text
File "/Users/player/Downloads/1filellm-main/onefilellm.py", line 30, in
raise EnvironmentError("GITHUB_TOKEN environment variable not set.")
OSError: GITHUB_TOKEN environment variable not set.

Incorrect tiktoken Version Specified in `requirements.txt`

While reviewing the requirements.txt, I noticed that the specified version of tiktoken (0.0.3) does not exist. This could lead to issues when installing dependencies. I've tested the functionality with tiktoken version 0.6.0, and everything seems to work flawlessly. I suggest updating the version in requirements.txt accordingly to avoid future installation problems.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.