armand33 / wikidatasets Goto Github PK
View Code? Open in Web Editor NEWBreak Wikidata dumps into smaller knowledge graphs
License: Other
Break Wikidata dumps into smaller knowledge graphs
License: Other
I gave your sample code a go and it seems to be it'd take about 240 hours to run. It's gone through about 2.2m in 7hrs 41m and will probably keeping going for days and days. Is this just the speed it is?
import pickle
from wikidatasets.processFunctions import get_subclasses, query_wikidata_dump, build_dataset
path = 'humans/' # this will contain the files output through the process
dump_path = 'F:\latest-all.json.bz2' # path to the bz2 dump file
n_lines = 0 # this can be an upper bound
test_entities = get_subclasses('Q5')
query_wikidata_dump(dump_path, path, n_lines,
test_entities=test_entities, collect_labels=True)
labels = pickle.load(open(path + 'labels.pkl', 'rb'))
build_dataset(path, labels)
I was trying to extract the sub-graph for power tools following the sample code found from the documentation. Everything went well until it hits the last step when saving the txt file for nodes. Then it throws an encoding error. Was it because that the pandas version was too old? It is using 0.24.0, but the latest version is 1.0.1.
COMMAND:
import pickle
from wikidatasets.processFunctions import (
get_subclasses,
query_wikidata_dump,
build_dataset,
)
path = "power_tools/" # this will contain the files output through the process
dump_path = "latest-all.json.bz2" # path to the bz2 dump file
n_lines = 56208653 # this can be an upper bound
test_entities = get_subclasses("Q1327701")
# Q1327701 refers to power tool : tool that is actuated by an additional power source and mechanism other than by hand alone
query_wikidata_dump(
dump_path, path, n_lines, test_entities=test_entities, collect_labels=True
)
labels = pickle.load(open(path + "labels.pkl", "rb"))
build_dataset(path, labels)
ERROR:
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-1-0a38d065a0ef> in <module>
17
18 labels = pickle.load(open(path + "labels.pkl", "rb"))
---> 19 build_dataset(path, labels)
~\AppData\Local\Continuum\miniconda3\envs\prd_knw_gph\lib\site-packages\wikidatasets\processFunctions.py in build_dataset(path, labels, return_, dump_date)
169 write_csv(edges, path + 'edges.txt')
170 write_csv(attributes, path + 'attributes.txt')
--> 171 write_ent_dict(nodes, path + 'nodes.txt')
172 write_ent_dict(entities, path + 'entities.txt')
173 write_rel_dict(relations, path + 'relations.txt')
~\AppData\Local\Continuum\miniconda3\envs\prd_knw_gph\lib\site-packages\wikidatasets\utils.py in write_ent_dict(df, name)
195 f.write('# Entities: {}\n'.format(len(df)))
196 f.write('entityID\twikidataID\tlabel\n')
--> 197 df.to_csv(f, sep='\t', header=False, index=False)
198
199
~\AppData\Local\Continuum\miniconda3\envs\prd_knw_gph\lib\site-packages\pandas\core\generic.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
3018 doublequote=doublequote,
3019 escapechar=escapechar, decimal=decimal)
-> 3020 formatter.save()
3021
3022 if path_or_buf is None:
~\AppData\Local\Continuum\miniconda3\envs\prd_knw_gph\lib\site-packages\pandas\io\formats\csvs.py in save(self)
170 self.writer = UnicodeWriter(f, **writer_kwargs)
171
--> 172 self._save()
173
174 finally:
~\AppData\Local\Continuum\miniconda3\envs\prd_knw_gph\lib\site-packages\pandas\io\formats\csvs.py in _save(self)
286 break
287
--> 288 self._save_chunk(start_i, end_i)
289
290 def _save_chunk(self, start_i, end_i):
~\AppData\Local\Continuum\miniconda3\envs\prd_knw_gph\lib\site-packages\pandas\io\formats\csvs.py in _save_chunk(self, start_i, end_i)
312 quoting=self.quoting)
313
--> 314 libwriters.write_csv_rows(self.data, ix, self.nlevels,
315 self.cols, self.writer)
pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()
~\AppData\Local\Continuum\miniconda3\envs\prd_knw_gph\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode character '\u0142' in position 14: character maps to <undefined>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.