⚡ Web | ✍ Blog | 🐦 Twitter | 🎞 Youtube | ☕ Coffee
🔭 Currently working on gathering texts on the Web and detecting word trends
🖩 First programs written on a TI-83 Plus in TI-BASIC
Name: Adrien Barbaresi
Type: User
Company: Berlin-Brg. Academy of Sciences (BBAW)
Bio: Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.
Twitter: adbarbaresi
Location: Berlin
Blog: adrien.barbaresi.eu
A simple Python wrapper for the archive.is capturing service
A collection of awesome web crawler,spider in different languages
Visualization of the most frequent words in the German federal election in 2021
universal character encoding detector
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
Explore, visualize and publish corpora as CSS/XHTML documents
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
SFST/SMOR/DWDS-based German Morphology
Automatically exported from code.google.com/p/equipe-crawler
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
Extraction of a German Reddit Corpus
Automatically exported from code.google.com/p/gps-corpus-builder
Fast and robust date extraction from web pages, with Python or on the command-line
Experiments to modernize the LaTeX class of the JLCL
A readability parser which can extract title, content, images from html pages
Heuristic based boilerplate removal tool
LAnguage-CLassified OpenSubtitles
Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language
Faster, modernized fork of the language identification tool langid.py
fast python port of arc90's readability tool, updated to match latest readability.js!
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Old prototype for toponym extraction in historical texts written in German
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Diverse tools used with Twitter data
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.