This is not an issue with gtfs-lite per se, but rather with a particular GTFS file I tried to load, resulting in the following error:
/home/ghsci/work/process/data/transit_feeds/bilbao_gtfs/20230509_010334_RENFE_AVLD
Traceback (most recent call last):
File "/env/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3652, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'end_date'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ghsci/work/process/subprocesses/_10_gtfs_analysis.py", line 266, in <module>
main()
File "/home/ghsci/work/process/subprocesses/_10_gtfs_analysis.py", line 81, in main
loaded_feeds = gtfslite.GTFS.load_zip(f'{gtfsfeed_path}.zip')
File "/env/lib/python3.10/site-packages/gtfslite/gtfs.py", line 280, in load_zip
calendar["end_date"] = pd.to_datetime(calendar["end_date"]).dt.date
File "/env/lib/python3.10/site-packages/pandas/core/frame.py", line 3760, in __getitem__
indexer = self.columns.get_loc(key)
File "/env/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3654, in get_loc
raise KeyError(key) from err
KeyError: 'end_date'
The apparent cause of this is that this calendar.txt file contains spaces before the newline symbol for the header as well as data rows, as seen in this text editor showing all characters:
I don't belive the data rows are the issue, as the read_csv uses the skipinitialspace=True
(and I confirmed this resolves the spaces-after-dates issue).
However, the last column in this file ends up with its name included spaces, such that it can't be understood as simply 'end_date'
, as per below screenshot showing the reading of this file into pandas:
One possibility, if you did want to handle these kind of inconsistencies, would be to call str.strip()
on columns after loading each dataframe, as per https://stackoverflow.com/a/36082588/4636357, e.g.
calendar.columns = calendar.columns.str.strip()
I confirmed that the above code resolves the issue in this case:
without this addition:
>>> import gtfslite.gtfs
>>> test = gtfslite.gtfs.GTFS.load_zip('data/20230509_010334_RENFE_AVLD.zip')
Traceback (most recent call last):
File "C:\Users\carlh\miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'end_date'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\gtfs-lite\gtfslite\gtfs.py", line 277, in load_zip
calendar["start_date"] = pd.to_datetime(calendar["start_date"]).dt.date
File "C:\Users\carlh\miniconda3\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\carlh\miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'end_date'
with the addition:
>>> import gtfslite.gtfs
>>> test = gtfslite.gtfs.GTFS.load_zip('data/20230509_010334_RENFE_AVLD.zip')
>>> test
<gtfslite.gtfs.GTFS object at 0x0000024EFD75FE50>
>>> test.calendar
service_id monday tuesday wednesday ... saturday sunday start_date end_date
0 2023-05-082023-06-09001651 True True True ... True True 2023-05-08 1970-01-01
1 2023-05-082023-06-09001653 True True True ... True True 2023-05-08 1970-01-01
2 2023-05-082023-06-30001901 True True True ... True True 2023-05-08 1970-01-01
3 2023-05-082023-06-30001902 True True True ... True True 2023-05-08 1970-01-01
4 2023-05-082023-06-30001931 True True True ... True True 2023-05-08 1970-01-01
... ... ... ... ... ... ... ... ... ...
3425 2023-05-082023-05-28389071 True True True ... True True 2023-05-08 1970-01-01
3426 2023-05-082023-05-28389081 True True True ... True True 2023-05-08 1970-01-01
3427 2023-05-082023-05-28389091 True True True ... True True 2023-05-08 1970-01-01
3428 2023-05-082023-12-09941841 True True True ... True True 2023-05-08 1970-01-01
3429 2023-05-082023-12-09942751 True True True ... True True 2023-05-08 1970-01-01
[3430 rows x 10 columns]
If you were to implement this, to help robustness in loading GTFS feeds with slight validity issues, it would probably be a good idea to do this for all loaded frames.
I'm not sure if its of interest for you to implement this feature, as its a problem with some GTFS files, not the software. However, I suspect other GTFS readers must do something similar as a colleague was able to read this GTFS feed using urbanaccess, as per this thread https://github.com/global-healthy-liveable-cities/global-indicators/issues/275 (focused on a different issue, which I'm scoping whether usage of GTFS-Lite can resolve).
In case it helps, I'll look into drafting a pull request implementing this change.