bellingcat / edgar Goto Github PK

View Code? Open in Web Editor NEW

109.0 109.0 13.0 310 KB

Tool for the retrieval of corporate and financial data from the SEC

Home Page: https://colab.research.google.com/github/bellingcat/EDGAR/blob/main/notebook/Bellingcat_EDGAR_Tool.ipynb

License: GNU General Public License v3.0

Python 88.91% Jupyter Notebook 10.31% Dockerfile 0.78%

command-line financial-data open-source-research python securities-and-exchange-commission

edgar's People

Contributors

Stargazers

Watchers

Forkers

v0rts yig-dao finnstue wenlambdar firmai-research jordan-gillard schoenkk thinkgandhi nauelserraino jackcollins91 rkiddy jorik041

edgar's Issues

Searching by category is broken.

Currently, the category argument is passed to the API as a URL parameter derived from the TEXT_SEARCH_FILING_CATEGORIES_MAPPING:

&category=form-cat3

which doesn't appear to affect the search results.

Instead the TEXT_SEARCH_CATEGORY_FORM_GROUPINGS should be used to set the forms URL parameter instead.

&forms=10-12B,10-12G,18-12B,20FR12B, ...

Additionally, perhaps the users should be able to set the forms directly - either providing a Category from the mapping, or a list of forms of interest.

Escape characters not well documented

edgar-tool text_search "BF Borgers"

Results in:
https://efts.sec.gov/LATEST/search-index?q=%5CBF+Borgers%5C&dateRange=custom&startdt=2019-05-05&enddt=2024-05-03&page=1

Note that %5C is , not "

Py 12, windows 10, powershell 7.42.

This is a Doc issue. For Powershell it should be: edgar-tool text_search "BF Borgers"

Write test suite

I need to preface this issue by saying I love the work you do and that I think this software is very cool.

I notice that this repo doesn't have unit tests. The result of this is that breaking changes and regressions are difficult to catch. I know you guys want to publish EDGAR to pypi, so this is really no bueno. Ideally we'd want a robust test suite that runs in CI for every PR. I think this is especially relevant to PR #10, since it involves major code changes and therefore the strong potential to break existing behavior.

I'm interested in helping Bellingcat, so I'd be happy to work on this or take on an advisory role.

General Refactor

There are quite a few places where code can be-deduplicated or removed from nested if/elses. A refactor would help make the codebase more accessible.

Bulk download after text search?

Hey! Do you plan on adding a bulk download for documents based on text search results? Might be really useful.

requirements.txt scheduled?

Hi!

This looks promising. Are you planning to release a requirements file soon?

An error is thrown when no results are found

Currently if a query returns no results an Error is thrown

src.browser.PageCheckFailedError: Page check failed, page load seems to have failed

It would be better if this was avoided or handled and an easy to interpret message was given to the user.

To reproduce run:

python main.py text_search Bellingcat --start_date "2023-01-01" --exact_search

Table of Cleaned Financial Data script

Hello, can you please share the script that generates the 'Table of Cleaned Financial Data'?

Thanks,

Mike

Add Ongoing Monitoring

One possible feature would be to have a mechanism of creating a regular run of the tool for a set of keywords.

This could be accomplished by a cronjob or a service worker (I don't know what the cross-platform solutions to this are).

Enable searching for a mixture of exact and inexact keywords

Currently keyword search is either exact (keyword order must match for all keywords) or inexact (and order of keywords may match).

# Join search keywords into a single string
keywords = " ".join(keywords)
keywords = f'"{keywords}"' if exact_search else keywords

It would be good if searches could use a mix of exact and inexact keyword matches:

i.e. "John Doe" Pharmaceuticals

Union Type is added in Python 3.10

Considering union type is available starting from python 3.10 (docs), adding location based search #28 will limit python version to 3.10+.

Location-based Search

Currently, searching by "Principal executive offices in" and "Incorporated in" are not supported by the tool. These are provided by the SEC tool:

They are handled by passing the following fields to the API:

locationType=incorporated
locationCode=AL
locationCodes=AL

If locationType is omitted the biz_states field is searched.
If locationType=incorporated the inc_states field is searched.

locationCode doesn't appear to do anything, instead locationCodes (note plural) seems to have the effect. Multiple values can be given and the endpoint appears to return matches for any of the terms.

The TEXT_SEARCH_LOCATIONS_MAPPING object should be used to support this

Replace Selenium

I'm not sure that Selenium is necessary and adds some friction to the tool.

Would GET requests to the endpoint work? There will stll need to be some HTML parsing of the results as I don't believe there is a nicely structured API for this.

Replace sys.exit with exceptions and handling

At the moment, handling the first page of results not loading is done by calling sys.exit(1)

I think it would be better to raise and handle a specific exception than call sys.exit(1); if someone used the text_search class they might not expect a call to sys.exit

EDGAR/src/text_search.py

Lines 478 to 480 in 685865b

 except PageCheckFailedError as e: 

 print(e) 

 sys.exit(1)

EDGAR/src/text_search.py

Lines 486 to 489 in 685865b

 except Exception as e: 

 print(f"Execution aborting due to a {e.__class__.__name__} error raised " 

 f"while parsing number of results for first page at URL {url}: {e}") 

 sys.exit(1)

Use efts.sec.gov endpoint

At the moment, the EDGAR tool uses Selenium and the web-based search tool, as well as HTML parsing with Beautiful Soup.

It appears that the efts.sec.gov endpoint returns neatly formatted JSON, we should use this instead!

Example query:
https://efts.sec.gov/LATEST/search-index?q=operational%20efficiency&dateRange=custom&startdt=2020-02-28&enddt=2024-04-24&page=1&from=0

Page limit Parameter

Hi team, it would be helpful to expose a page limit parameter to users that would allow for generic requests like 'climate' to stop after a certain amount of pages, if so chosen by the user, this could be the expected behaviour:

class EdgarTextSearcher:
    
    #---> perhaps adjust like this? 

    def _fetch_search_request_results(
        self,
        search_request_url_args: str,
        min_wait_seconds: float,
        max_wait_seconds: float,
        retries: int,
        max_pages: Optional[int] = None  # New parameter
    ) -> Iterator[Iterator[Dict[str, Any]]]:
        """
        Fetches the results for the given search request and paginates through the results.

        :param search_request_url_args: URL-encoded request arguments string to concatenate to the SEC website URL
        :param min_wait_seconds: minimum number of seconds to wait for the request to complete
        :param max_wait_seconds: maximum number of seconds to wait for the request to complete
        :param retries: number of times to retry the request before failing
        :param max_pages: maximum number of pages to fetch (optional)
        :return: Iterator of dictionaries representing the parsed table rows
        """

        # Fetch first page, verify that the request was successful by checking the results table appears on the page
        self.json_response = fetch_page(
            f"{TEXT_SEARCH_BASE_URL}{search_request_url_args}",
            min_wait_seconds,
            max_wait_seconds,
            retries,
        )(
            lambda json_response: json_response.get("error") is None
            and json_response.get("hits", {}).get("hits", 0) != 0,
            f"First search request failed for URL {TEXT_SEARCH_BASE_URL}{search_request_url_args} ...",
        )

        # Get number of pages
        num_pages = self._compute_number_of_pages()
        
        # Limit the number of pages if max_pages is specified
        if max_pages is not None:
            num_pages = min(num_pages, max_pages)
            print(f"Limiting search to {num_pages} pages")

        for i in range(1, num_pages + 1):
            paginated_url = f"{TEXT_SEARCH_BASE_URL}{search_request_url_args}&page={i}&from={100*(i-1)}"
            try:
                self.json_response = fetch_page(
                    paginated_url,
                    min_wait_seconds,
                    max_wait_seconds,
                    retries,
                )(
                    lambda json_response: json_response.get("error") is None,
                    f"Search request failed for page {i} at URL {paginated_url}, skipping page...",
                )
                if self.json_response.get("hits", {}).get("hits", 0) == 0:
                    raise ResultsTableNotFoundError()
                page_results = self._parse_table_rows(paginated_url)
                yield page_results
            except PageCheckFailedError as e:
                print(e)
                continue
            except ResultsTableNotFoundError:
                print(
                    f"Could not find results table on page {i} at URL {paginated_url}, skipping page..."
                )
                continue
            except Exception as e:
                print(
                    f"Unexpected {e.__class__.__name__} error occurred while fetching page {i} at URL {paginated_url}, skipping page: {e}"
                )
                continue

    def text_search(
        self,
        keywords: List[str],
        entity_id: Optional[str],
        filing_form: Optional[str],
        single_forms: Optional[str],
        start_date: date,
        end_date: date,
        min_wait_seconds: float,
        max_wait_seconds: float,
        retries: int,
        destination: str,
        peo_in: Optional[str],
        inc_in: Optional[str],
        max_pages: Optional[int] = None  # New parameter
    ) -> None:
        """
        Searches the SEC website for filings based on the given parameters.

        :param keywords: Search keywords to input in the "Document word or phrase" field
        :param entity_id: Entity/Person name, ticker, or CIK number to input in the "Company name, ticker, or CIK" field
        :param filing_form: Group to select within the filing category dropdown menu, defaults to None
        :param single_forms: List of single forms to search for (e.g. ['10-K', '10-Q']), defaults to None
        :param start_date: Start date for the custom date range
        :param end_date: End date for the custom date range
        :param min_wait_seconds: Minimum number of seconds to wait for the request to complete
        :param max_wait_seconds: Maximum number of seconds to wait for the request to complete
        :param retries: Number of times to retry the request before failing
        :param destination: Name of the CSV file to write the results to
        :param peo_in: Search principal executive offices in a location (e.g. "NY,OH")
        :param inc_in: Search incorporated in a location (e.g. "NY,OH")
        :param max_pages: Maximum number of pages to fetch (optional)
        """
        self._generate_search_requests(
            keywords=keywords,
            entity_id=entity_id,
            filing_form=filing_form,
            single_forms=single_forms,
            start_date=start_date,
            end_date=end_date,
            min_wait_seconds=min_wait_seconds,
            max_wait_seconds=max_wait_seconds,
            retries=retries,
            peo_in=peo_in,
            inc_in=inc_in,
        )

        search_requests_results: List[Iterator[Iterator[Dict[str, Any]]]] = []
        for r in self.search_requests:

            # Run generated search requests and paginate through results
            try:
                all_pages_results: Iterator[Iterator[Dict[str, Any]]] = (
                    self._fetch_search_request_results(
                        search_request_url_args=r,
                        min_wait_seconds=min_wait_seconds,
                        max_wait_seconds=max_wait_seconds,
                        retries=retries,
                        max_pages=max_pages  # Pass the new parameter
                    )
                )
                search_requests_results.append(all_pages_results)

            except Exception as e:
                print(
                    f"Skipping search request due to an unexpected {e.__class__.__name__} for request parameters '{r}': {e}"
                )
        if(search_requests_results == []):
            raise NoResultsFoundError(f"No results found for the search query")
        write_results_to_file(
            itertools.chain(*search_requests_results),
            destination,
            TEXT_SEARCH_CSV_FIELDS_NAMES,
        )

and then called like this

edgar_searcher = EdgarTextSearcher()
edgar_searcher.text_search(
    keywords=["Resignations"],
    entity_id=None,
    filing_form=None,
    single_forms=None,
    start_date=date(2024, 1, 1),
    end_date=date(2024, 12, 31),
    min_wait_seconds=0.1,
    max_wait_seconds=0.5,
    retries=3,
    destination="results.csv",
    peo_in=None,
    inc_in=None,
    max_pages=5  # This will limit the search to the first 5 pages
)

Thanks for the nice tool!

import pandas as pd

CSV = r"C:\EDGAR\test\edgar_volcano_monitoring.csv"  # Output of text_search
df = pd.read_csv(CSV)

cols = df.columns.to_list()
df_no_dup = df.drop_duplicates(subset=cols)

print(f"Shape of df: {df.shape}")
print(f"Shape of df_no_dup: {df_no_dup.shape}")

>>> Shape of df: (1400, 16)
>>> Shape of df_no_dup: (101, 16)

Reproducibility:

poetry run edgar-tool text_search Volcano Monitoring

If this behaviour is not expected I would be happy to work on it

	except Exception as e:
	print(f"Execution aborting due to a {e.__class__.__name__} error raised "
	f"while parsing number of results for first page at URL {url}: {e}")
	sys.exit(1)

bellingcat / edgar Goto Github PK

edgar's People

Contributors

Stargazers

Watchers

Forkers

edgar's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs