๐ Hi there!
In this project, you'll build a simple API that fetches some info about a given URL/webpage and makes the results accessible. The goal of this project is to see how you approach a given problem & set of requirements with little constraint on how to approach it.
To get started, make sure you have Node.js installed. We recommend the active LTS release. Afterward, clone this repository. The project will contain an empty index.js
file you're free to begin working in. If you have another approach in mind, just delete this file.
FYI, our stack is largely based on TypeScript & Node.js. We use PostgreSQL for our primary database, but any relational database is fine. How you tackle this project is entirely up to you!
Develop a RESTful API to complete the following:
- Add an endpoint that accepts a URL in the request body and create and return a new
Reference
record as JSON. - During this process, you should also initiate an asynchronous task to fetch data from the URL saved in the
Reference
. More information on fetching data is below. - Note: The endpoint should return the
Reference
record without waiting for it to be processed.
- Implement an async worker function that processes the reference. This function should take a
Reference
as an argument. - Given the
Reference
url
field, get the text content from the page'stitle
and anymeta
elements (if they exist) with their names and values serialized to create a semi-structured representation of a page's title & metadata. - Return the data as an object and create a new
Result
record in the database, storing the info as JSON or a serialized string into the record'sdata
column.
- Add a new GET endpoint that allows a user to fetch results for a given
Reference
ID. This endpoint should return a list of savedResults
for a givenReference
as JSON. Don't forget to keep it RESTful and keep resource naming best practices in mind as you go.
In your processing task, you'll need to fetch the contents of a webpage and extract information from its DOM. To do this we recommend fetching and working with the page content using browser automation tools like Puppeteer or Playwright.
Fetching HTML via HTTP and being able to extract your information without any additional effort is becoming increasingly less common these days with the rise of JS-dependent rendering, SPAs, and other complexities like bot detection or browser fingerprinting. If you'd like to challenge yourself a bit further, check out ToScrape, which has a number of great scenarios already laid out and designed to be extracted!
A reference is created when a user makes a call to POST /references
Field | Type | Description |
---|---|---|
id |
primary key | the reference identifier |
url |
string | a valid web address |
created_at |
timestamp | reference created time |
A result is created after a data fetching task for a Reference
is completed.
Field | Type | Description |
---|---|---|
id |
primary key | the reference identifier |
reference_id |
foreign key | the related reference |
data |
json | Result from the fetching task |
created_at |
timestamp | result created time |
Other things that are not required, but we would love to see:
- Test coverage (We tend to use Jest)
- Additional validations
- More endpoints (fetch all references, delete a reference & its results, etc.)
- Make use of an actual job queue (Redis, ElasticMQ, etc.)
- Scheduling/interval-based reprocessing of existing references to monitor changes
- Anything else you can think of!
Suppose you don't implement bonus items, no worries. Feel free to share some notes of things you might do and how you might have gone about them given more time.
When you have finished the exercise, please create a bundle of your work by running npm run bundle
in the project root.
This will create a bundle file called take-home-challenge.bundle
based on your local main branch. Send the file to us via email, or if you received a submission link from your hiring manager, please upload it there.
Thank you, and good luck! ๐