Comments (2)
@isensee-bastian I've been thinking about this, and it could really be the killer feature of the kraken. I think many people must be just as annoyed with all the duplicates on these platforms, as we are. And a way to cut through that noise would probably add significant value and drive some adoption.
I've also been thinking about how to do it...
- Given that we do have a real DB (#8)
- We could actually visit the URL to the listing, which we are so far only copying (the 'href')
- We then copy the listing's text (introducing a new per-strategy selector)
- Load into memory all listings of the last two days (from the database)
- Calculate text similarity values for the listing at hand versus all the listings in memory
- If none of the indices is higher than the threshold, write it to the DB as a new entry
- If it passes the threshold and is thus identified as the same listing, amend it to an "also-seen-on" list/array-field on the already existing entry
This would be nice for the board notifiers/UIs, because you get a list of sites, where this listing has been posted. This allows you to then go to and apply on the site you like most (e.g. freelance.de instead of freelancermap.de). If we just drop duplicates silently, then we lose the value of this information.
On the other hand, this is something that only works for the board UIs. We'd have to see how we can wring the most value out of this for notifiers like Telegram or Slack.
from re-employment-kraken.
@uschtwill I totally agree. Manually identifying duplicates requires time and mental energy that could be spent in a better way.
The steps you listed make total sense. I think it is a good blueprint to follow for implementation.
About the tracking of duplicates: You are right about the value of storing multiple URLs per project. It helps to have a choice for applying. Moreover, by seeing the number of duplicates, one can roughly estimate how high or low the chances are for winning a project.
True, we need a different concept for notifying about duplicates in messengers vs board apps.
from re-employment-kraken.
Related Issues (12)
- Add 'Jira' as notification strategy
- Add 'Trello' as notification strategy
- Add 'e-mail' as notification strategy
- Start using headless browser, Playwright, Puppeteer or similar to get around WAF
- Look into using 'crawlee' package
- Implement rate limiting, for case where Notion integration is enabled and there are more than 50 results
- Performance: Replace text file persistence with a real database HOT 2
- Add Telegram bot notification strategy
- Also save actual text/copy of listings
- Build: Enable running as a Docker container
- Handle '429 too many requests' error gracefully when scraping jobs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from re-employment-kraken.