GithubHelp home page GithubHelp logo

bellingcat / edgar Goto Github PK

View Code? Open in Web Editor NEW
109.0 109.0 13.0 310 KB

Tool for the retrieval of corporate and financial data from the SEC

Home Page: https://colab.research.google.com/github/bellingcat/EDGAR/blob/main/notebook/Bellingcat_EDGAR_Tool.ipynb

License: GNU General Public License v3.0

Python 88.91% Jupyter Notebook 10.31% Dockerfile 0.78%
command-line financial-data open-source-research python securities-and-exchange-commission

edgar's People

Contributors

galenreich avatar georgedyer avatar jackcollins91 avatar jordan-gillard avatar msramalho avatar nauelserraino avatar opskov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

edgar's Issues

Searching by category is broken.

Currently, the category argument is passed to the API as a URL parameter derived from the TEXT_SEARCH_FILING_CATEGORIES_MAPPING:

&category=form-cat3

which doesn't appear to affect the search results.

Instead the TEXT_SEARCH_CATEGORY_FORM_GROUPINGS should be used to set the forms URL parameter instead.

&forms=10-12B,10-12G,18-12B,20FR12B, ...

Additionally, perhaps the users should be able to set the forms directly - either providing a Category from the mapping, or a list of forms of interest.

Write test suite

I need to preface this issue by saying I love the work you do and that I think this software is very cool.

I notice that this repo doesn't have unit tests. The result of this is that breaking changes and regressions are difficult to catch. I know you guys want to publish EDGAR to pypi, so this is really no bueno. Ideally we'd want a robust test suite that runs in CI for every PR. I think this is especially relevant to PR #10, since it involves major code changes and therefore the strong potential to break existing behavior.

I'm interested in helping Bellingcat, so I'd be happy to work on this or take on an advisory role.

General Refactor

There are quite a few places where code can be-deduplicated or removed from nested if/elses. A refactor would help make the codebase more accessible.

An error is thrown when no results are found

Currently if a query returns no results an Error is thrown

src.browser.PageCheckFailedError: Page check failed, page load seems to have failed

It would be better if this was avoided or handled and an easy to interpret message was given to the user.

To reproduce run:

python main.py text_search Bellingcat --start_date "2023-01-01" --exact_search

Add Ongoing Monitoring

One possible feature would be to have a mechanism of creating a regular run of the tool for a set of keywords.

This could be accomplished by a cronjob or a service worker (I don't know what the cross-platform solutions to this are).

Enable searching for a mixture of exact and inexact keywords

Currently keyword search is either exact (keyword order must match for all keywords) or inexact (and order of keywords may match).

# Join search keywords into a single string
keywords = " ".join(keywords)
keywords = f'"{keywords}"' if exact_search else keywords

It would be good if searches could use a mix of exact and inexact keyword matches:

i.e. "John Doe" Pharmaceuticals

Location-based Search

Currently, searching by "Principal executive offices in" and "Incorporated in" are not supported by the tool. These are provided by the SEC tool:

image

They are handled by passing the following fields to the API:

locationType=incorporated
locationCode=AL
locationCodes=AL

If locationType is omitted the biz_states field is searched.
If locationType=incorporated the inc_states field is searched.

locationCode doesn't appear to do anything, instead locationCodes (note plural) seems to have the effect. Multiple values can be given and the endpoint appears to return matches for any of the terms.

The TEXT_SEARCH_LOCATIONS_MAPPING object should be used to support this

Replace Selenium

I'm not sure that Selenium is necessary and adds some friction to the tool.

Would GET requests to the endpoint work? There will stll need to be some HTML parsing of the results as I don't believe there is a nicely structured API for this.

Replace sys.exit with exceptions and handling

At the moment, handling the first page of results not loading is done by calling sys.exit(1)

I think it would be better to raise and handle a specific exception than call sys.exit(1); if someone used the text_search class they might not expect a call to sys.exit

EDGAR/src/text_search.py

Lines 478 to 480 in 685865b

except PageCheckFailedError as e:
print(e)
sys.exit(1)

EDGAR/src/text_search.py

Lines 486 to 489 in 685865b

except Exception as e:
print(f"Execution aborting due to a {e.__class__.__name__} error raised "
f"while parsing number of results for first page at URL {url}: {e}")
sys.exit(1)

Page limit Parameter

Hi team, it would be helpful to expose a page limit parameter to users that would allow for generic requests like 'climate' to stop after a certain amount of pages, if so chosen by the user, this could be the expected behaviour:

class EdgarTextSearcher:
    
    #---> perhaps adjust like this? 

    def _fetch_search_request_results(
        self,
        search_request_url_args: str,
        min_wait_seconds: float,
        max_wait_seconds: float,
        retries: int,
        max_pages: Optional[int] = None  # New parameter
    ) -> Iterator[Iterator[Dict[str, Any]]]:
        """
        Fetches the results for the given search request and paginates through the results.

        :param search_request_url_args: URL-encoded request arguments string to concatenate to the SEC website URL
        :param min_wait_seconds: minimum number of seconds to wait for the request to complete
        :param max_wait_seconds: maximum number of seconds to wait for the request to complete
        :param retries: number of times to retry the request before failing
        :param max_pages: maximum number of pages to fetch (optional)
        :return: Iterator of dictionaries representing the parsed table rows
        """

        # Fetch first page, verify that the request was successful by checking the results table appears on the page
        self.json_response = fetch_page(
            f"{TEXT_SEARCH_BASE_URL}{search_request_url_args}",
            min_wait_seconds,
            max_wait_seconds,
            retries,
        )(
            lambda json_response: json_response.get("error") is None
            and json_response.get("hits", {}).get("hits", 0) != 0,
            f"First search request failed for URL {TEXT_SEARCH_BASE_URL}{search_request_url_args} ...",
        )

        # Get number of pages
        num_pages = self._compute_number_of_pages()
        
        # Limit the number of pages if max_pages is specified
        if max_pages is not None:
            num_pages = min(num_pages, max_pages)
            print(f"Limiting search to {num_pages} pages")

        for i in range(1, num_pages + 1):
            paginated_url = f"{TEXT_SEARCH_BASE_URL}{search_request_url_args}&page={i}&from={100*(i-1)}"
            try:
                self.json_response = fetch_page(
                    paginated_url,
                    min_wait_seconds,
                    max_wait_seconds,
                    retries,
                )(
                    lambda json_response: json_response.get("error") is None,
                    f"Search request failed for page {i} at URL {paginated_url}, skipping page...",
                )
                if self.json_response.get("hits", {}).get("hits", 0) == 0:
                    raise ResultsTableNotFoundError()
                page_results = self._parse_table_rows(paginated_url)
                yield page_results
            except PageCheckFailedError as e:
                print(e)
                continue
            except ResultsTableNotFoundError:
                print(
                    f"Could not find results table on page {i} at URL {paginated_url}, skipping page..."
                )
                continue
            except Exception as e:
                print(
                    f"Unexpected {e.__class__.__name__} error occurred while fetching page {i} at URL {paginated_url}, skipping page: {e}"
                )
                continue

    def text_search(
        self,
        keywords: List[str],
        entity_id: Optional[str],
        filing_form: Optional[str],
        single_forms: Optional[str],
        start_date: date,
        end_date: date,
        min_wait_seconds: float,
        max_wait_seconds: float,
        retries: int,
        destination: str,
        peo_in: Optional[str],
        inc_in: Optional[str],
        max_pages: Optional[int] = None  # New parameter
    ) -> None:
        """
        Searches the SEC website for filings based on the given parameters.

        :param keywords: Search keywords to input in the "Document word or phrase" field
        :param entity_id: Entity/Person name, ticker, or CIK number to input in the "Company name, ticker, or CIK" field
        :param filing_form: Group to select within the filing category dropdown menu, defaults to None
        :param single_forms: List of single forms to search for (e.g. ['10-K', '10-Q']), defaults to None
        :param start_date: Start date for the custom date range
        :param end_date: End date for the custom date range
        :param min_wait_seconds: Minimum number of seconds to wait for the request to complete
        :param max_wait_seconds: Maximum number of seconds to wait for the request to complete
        :param retries: Number of times to retry the request before failing
        :param destination: Name of the CSV file to write the results to
        :param peo_in: Search principal executive offices in a location (e.g. "NY,OH")
        :param inc_in: Search incorporated in a location (e.g. "NY,OH")
        :param max_pages: Maximum number of pages to fetch (optional)
        """
        self._generate_search_requests(
            keywords=keywords,
            entity_id=entity_id,
            filing_form=filing_form,
            single_forms=single_forms,
            start_date=start_date,
            end_date=end_date,
            min_wait_seconds=min_wait_seconds,
            max_wait_seconds=max_wait_seconds,
            retries=retries,
            peo_in=peo_in,
            inc_in=inc_in,
        )

        search_requests_results: List[Iterator[Iterator[Dict[str, Any]]]] = []
        for r in self.search_requests:

            # Run generated search requests and paginate through results
            try:
                all_pages_results: Iterator[Iterator[Dict[str, Any]]] = (
                    self._fetch_search_request_results(
                        search_request_url_args=r,
                        min_wait_seconds=min_wait_seconds,
                        max_wait_seconds=max_wait_seconds,
                        retries=retries,
                        max_pages=max_pages  # Pass the new parameter
                    )
                )
                search_requests_results.append(all_pages_results)

            except Exception as e:
                print(
                    f"Skipping search request due to an unexpected {e.__class__.__name__} for request parameters '{r}': {e}"
                )
        if(search_requests_results == []):
            raise NoResultsFoundError(f"No results found for the search query")
        write_results_to_file(
            itertools.chain(*search_requests_results),
            destination,
            TEXT_SEARCH_CSV_FIELDS_NAMES,
        )

and then called like this

edgar_searcher = EdgarTextSearcher()
edgar_searcher.text_search(
    keywords=["Resignations"],
    entity_id=None,
    filing_form=None,
    single_forms=None,
    start_date=date(2024, 1, 1),
    end_date=date(2024, 12, 31),
    min_wait_seconds=0.1,
    max_wait_seconds=0.5,
    retries=3,
    destination="results.csv",
    peo_in=None,
    inc_in=None,
    max_pages=5  # This will limit the search to the first 5 pages
)

Thanks for the nice tool!

No link to data file?

In the README I see: "I've built a table containing most income statement, balance sheet, and cash flow statement data for every company traded publicly in the U.S. This table is updated periodically, and available here for download as a .CSV file. "

I cannot find a link for this .CSV file?

Package as Python Library

It would be great if the tool was packaged up in a Python library with re-usable components. This would let us release to PyPI.

CSV output contains duplicates

I noticed that the output contains a ton of duplicates, is this behaviour expected?

Reference:

import pandas as pd

CSV = r"C:\EDGAR\test\edgar_volcano_monitoring.csv"  # Output of text_search
df = pd.read_csv(CSV)

cols = df.columns.to_list()
df_no_dup = df.drop_duplicates(subset=cols)

print(f"Shape of df: {df.shape}")
print(f"Shape of df_no_dup: {df_no_dup.shape}")

>>> Shape of df: (1400, 16)
>>> Shape of df_no_dup: (101, 16)

Reproducibility:

poetry run edgar-tool text_search Volcano Monitoring

If this behaviour is not expected I would be happy to work on it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.