We have Reuters data, it contains 21578 Reuters news documents from 1987. They were labeled manually by Reuters personnel. Labels belong to 5 different category classes, such as 'people', 'places' and 'topics'. The total number of categories is 672, but many of them occur only very rarely. Some documents belong to many different categories, others to only one, and some have no category.
The dataset (7.8. MB) can be downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz
I have used Apache Spark to classify the topics for those documents, which were not categorized into any category.