Closed for maintenance! May be back soon
LOD Laundromat is a standard-compliant implementation for crawling and cleaning Linked Open Data (LOD). Since data cleaning is often reported as comprising 80% of a data analist’s workload, it would be great if we can automate at least a large part of that.
With the LOD Laundromat, data is cleaned in parallel by an arbitrary number of Washing Machine threads. The collection of washing machines can be monitored from a main Laundromat thread.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>$(URI)</loc>
<lastmod>$(DATE_TIME)</lastmod>
<changefreq>$(DURATION)</changefreq>
<priority>$(FLOAT)</priority>
</url>
</urlset>
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<sc:dataset>
<sc:datasetLabel>$(STRING)</sc:datasetLabel>
<sc:dataDumpLocation>$(URI)</sc:dataDumpLocation>
<sc:linkedDataPrefix>$(IRI)</sc:linkedDataPrefix>
<changefreq>$(DURATION)</changefreq>
</sc:dataset>
</urlset>
Adding a new URI to the seedlist results in the following registration:
$(URI_HASH){
relative: $(BOOL),
status: added,
uri: $(URI)
}
We are storing seedlist registrations in a SWI dictionary
datastructure. SWI dictionaries are very similar to the JSON format
for data exchange. The main difference is the $(HASH)
, which we
will use to link seedlist registrations to one another.
The value of $(HASH)
is computed in the following way:
- Take the value of
$(URI)
- Map uppercase characters that appear in the scheme or host components to their corresponding lowercase characters[fn::See §6.2.2.1 of RFC 3986 (https://tools.ietf.org/html/rfc3986#section-6.2.2.1)].
- Map lowercase characters that denote hexadecimal digits within a percent-encoded octet to their corresponding uppercase characters[fn::See §6.2.2.1 of RFC 3986 (https://tools.ietf.org/html/rfc3986#section-6.2.2.1)].
- Decode percent-encoded octets that denote unreserved characters[fn::See §6.2.2.2 of RFC 3986 (https://tools.ietf.org/html/rfc3986#section-6.2.2.1)].
- Remove the relative path references
.
and..
by applying reference resolution[fn::See §6.2.2.3 of RFC 3986 (https://tools.ietf.org/html/rfc3986#section-6.2.2.3)]. - Take the MD5 hash.
We allow relative URIs to be added to the seedlist, denoted by the
Boolean property relative
. We cannot do anything useful with
relative URIs, because their download location is unknown due to a
missing host machine name. Still, we want to be able to quantify how
often a dataset is erroneously denoted by a relative URI.
The status
property is going to keep track of the URI throughout the
data cleaning process. Its initial state is added
, which means that
it is added to the seedlist.
The uri
property stores the URI itself. interval
denotes the time
in between consecutive crawls, expressed in seconds since the epoch.
processed
denotes the time of the last crawl.
$(URI_HASH){
added: $(ADDED),
child: $(HASH),
interval: $(INTERVAL),
processed: $(PROCESSED),
uri: $(URI)
}
This stage takes seeds that match the pattern in (1), and changes them
to match pattern (2) during the download process. If the download
fails we only have metadata $(HTTP_META)
about the TCP and/or HTTP
communication process, resulting in a seed record with pattern (3).
If the download succeeds, there is also content metadata
$(CONTENT_META)
, resulting in a seed record with pattern (4).
A seed is stale, and therefore a candicate for re-downloading, if
$(PROCESSED) + $(INTERVAL) < $(NOW)
While downloading:
$(DOWNLOAD_HASH){
parent: $(SEED_HASH),
status: downloading
}
After downloading:
$(DOWNLOAD_HASH){
http: [$(HTTP_META)],
newline: $(NEWLINE), %
number_of_bytes: $(NONNEG), %
number_of_chars: $(NONNEG), %
number_of_lines: $(NONNEG) %
parent: $(SEED_HASH),
status: filed,
timestamp: $(BEGIN)-$(END)
}
The record includes the $(BEGIN)$
and $(END)
times of the
download.
$(HTTP_META)
has the following form:
http{
headers: $(HTTP_HEADERS),
status: $(STATUS_CODE),
uri: $(URI),
version: version{major: $(NONNEG), minor: $(NONNEG)},
walltime: $(FLOAT)
}
This stage is started for each seed that matches [1]. If the seed
denotes a downloaded file that is an archive, the resulting seed
record will include pointer to each directly included ‘child’ file as
in [3]. Status depleted
denotes that no more files are enclosed
within this file. For each child, a new seed record of the form [4]
is added to the seedlist.
If the seed denotes a downloaded file that contains data, its seed
record is updated to have status unarchived
. We must determine the
character encoding of the data file in order to be able to read it.
Unfortunately, this can only be determined heuristically. We perform
the following steps:
- We look for a Unicode Byte Order Marker (BOM), which indicates that the file has Unicode encoding.
- If not BOM is present, we use unchardet in order to guess the encoding. If the encoding is incompatible with Unicode[fn::An example of a common encoding that is compatible with Unicode is (US-)ASCII.], we recode the entire file using iconv.
Candidates for the unpacking stage have the following form:
$(ARCHIVE_HASH){status: filed}
While unpacking:
$(ENTRY_HASH){parent: $(ARCHIVE_HASH), status: unarchiving}
After unpacking:
$(ENTRY_HASH){status: unarchived} % leaf node
$(ARCHIVE_HASH){children: [$(ENTRY_HASH)], status: depleted} % non-leaf node
$(ENTRY_HASH){parent: $(ARCHIVE_HASH), status: filed} % future processing
$(ENTRY_HASH){status: unarchived}
$(ENTRY_HASH){status: guessing}
$(ENTRY_HASH){format: $(FORMAT), status: guessed}
$(FORMAT)
is one of the following values:
- JSON-LD
- N-Quads
- N-Triples
- RDF/XML
- RDFa
- TriG
- Turtle
$(ENTRY_HASH){format: $(FORMAT), status: guessed}
$(ENTRY_HASH){status: parsing}
$(CLEAN_HASH){dirty: $(ENTRY_HASH), status: cleaned} % clean file
$(ENTRY_HASH){clean: $(CLEAN_HAHS), status: parsed} % dirty file