GithubHelp home page GithubHelp logo

Comments (9)

karlhorky avatar karlhorky commented on July 21, 2024 1

And good luck with your MVP! Lunr is quite easy to integrate for a basic use case

Nice, thanks! I think the MVP will be good for a start, and then I'll probably go to a full-blown database in the mid-term, and get the searching logic out of JS altogether.

from highlight-words.

tricinel avatar tricinel commented on July 21, 2024

Hi @karlhorky! Thanks for the kind words and for your proposal! I'd like to understand it better before doing anything :)

So, your requirement is that you have a large piece of text. And you want to short-circuit the search for matches after X characters. So a simple example is:

Given this text:

My dog is a very good boy and is always eating his lunch.

And the search for is.

When you want all the results, you would get is, is and his.
When you want to stop the search after 10 characters of text, you are essentially searching for is in My dog is so you get one is.

If you don't care about whether a chunk is a match or not, then you are essentially splitting the text at 90 characters into two slices: (1) you care about the chunks (match or not) in the first slice and (2) don't care at all about the chunks (match or not) in the second slice.

If that is indeed, the case, then you can split the text at 90 characters to start with. For example:

const chunks = highlightWords({
  text: content.match(/.{1,90}/g)[0],
  query: searchQuery,
  clipBy: 3,
})

If you do care about the number of chunks that match, then you want to do a combination of: as long as the chunk is a match and I have less than 90 characters from all the chunks, keep going. This is what I think your filter does. They seem like to different use cases so I wanted to better understand what you're after.

My questions are:

  1. Have you tested the performance limits? Meaning...how large a text is too large? Where's the turning point where we can see the problems?
  2. Is your requirement specifically for the browser? I would guess yes, because you mention pixels sizes and font sizes, but want to be sure. Since we can use the same library in a server environment, I want to be sure we can handle performance issues in both cases.

I think that while your solution could work...I want to fix the underlying issue if possible. :) If that's a performance problem, we should fix that rather than extend the API.

Could you provide an example (codesandbox, whatever) where we can see this? I will test it myself as well in a node env.

Thanks again!

from highlight-words.

karlhorky avatar karlhorky commented on July 21, 2024

Ok, I was a bit confused by the wording above, but I think I understand after reading it a few times.

There are two use cases that you're asking about. I will call them A and B:

A: If you don't care about whether a chunk is a match or not ... splitting the text at 90 characters

Right, this would be if you want the first 90 character part of the text (regardless of whether there are matches in it or not). This is not my use case.

B: If you do care about the number of chunks that match, then you want to do a combination of: as long as the chunk is a match and I have less than 90 characters from all the chunks, keep going.

This is my use case.

I can tell you a bit more about what I'm doing (I'm creating a simple search MVP - this code is in Node.js):

  1. I have an array of many long strings extracted from HTML files
  2. I first filter this array to make sure that I only have strings that have at least one match of the search query
  3. I map over the filtered array, and split these long strings using highlight-words. Here I would like the chunk.text from all chunks to add up to a string of maximum 90 characters, so that it doesn't look too crazy in the autocomplete popup. I want highlight-words to be able to short-circuit internally and avoid taking extra CPU time.

I want to make sure that as we scale up and add more documents, that we do the least amount of work possible. In the mid term, we will be switching away from this JS-based filtering approach to a proper full text indexed search.

It's actually less about my use case though, because I think controls to allow your users to avoid performance penalties on large texts is very useful generally.


As to your questions:

  1. Have you tested the performance limits? Meaning...how large a text is too large? Where's the turning point where we can see the problems?

I haven't. It's actually pretty fast right now (for a single user on an M1 development machine), but the point is that I want to avoid doing a bunch of extra work.

  1. Is your requirement specifically for the browser? I would guess yes, because you mention pixels sizes and font sizes, but want to be sure. Since we can use the same library in a server environment, I want to be sure we can handle performance issues in both cases.

See my search MVP use case above. It is actually code that runs on a server, but the final result will be displayed in a constrained design in the browser.

from highlight-words.

tricinel avatar tricinel commented on July 21, 2024

Cool, gotcha! Yes, we're on the same page then with the two use cases.

By the way, the necessity for this utility came out of exactly your use case. I wanted an autocomplete that searches when you type and, to tell the user why a specific search result shows up in the autocomplete, highlights the matched terms in each result (quite similar to algolia used on https://www.gatsbyjs.com/ or https://tailwindcss.com/).

I, however, never envisioned that I would want to highlight big chunks of text in an autocomplete. Let me describe how I worked with this, maybe it helps you decide as well (hint: this is a bit besides this specific issue).

I use lunr search to index a bunch of blog articles, picking just the title and a summary from the fields (so I skip the content, which can be larger than usual and wouldn't want to put it in the autocomplete). Then as the user types, with a debounce, lunr will return the results and these are passed through to highlight-words for highlighting. The end result is exactly what algolia looks like.

I guess the difference is that I have a summary for each post, while you might want to look into the entire content - so I can imagine that the autocomplete might have a lot of matches for common words. I would say, for your specific use case, I would not implement a maxLength. There are better ways to fix that - one being a search index that gives proper weights to things and limits the results that reach the front-end.

As a side note, the way this is currently implemented is with a regex that will split any text into matches and non-matches (and I just mark a chunk as match or non-match)...so to implement a short-circuit like you want would mean a bigger rewrite than I have time for now.

However, I admit I have never investigated how performant this utility is. So I would like to do that first.

I hope this sounds agreeable :) I'd like to close this issue and open a new one where we can investigate any performance issues that might arise in large pieces of texts. And go from there.

from highlight-words.

karlhorky avatar karlhorky commented on July 21, 2024

By the way, the necessity for this utility came out of exactly your use case

I use lunr search to index a bunch of blog articles, picking just the title and a summary from the fields

Ah nice, lunr was also something that I was looking at before deciding to push the more polished version to mid-term and just get something working for now.

implemented is with a regex that will split any text into matches and non-matches

to implement a short-circuit like you want would mean a bigger rewrite than I have time for now

Ok, understood - makes sense.

I'd like to close this issue and open a new one where we can investigate any performance issues that might arise in large pieces of texts

Sure, if that's the path forward here, then by all means :) I guess given your approach you mentioned above, performance may not even be a concern...

from highlight-words.

tricinel avatar tricinel commented on July 21, 2024

Thanks! Definitely looking into this next. If it turns out to be a problem (or maybe I can find an elegant way to short-circuit the process early), then I'm coming back to this issue.

I'll keep you updated nevertheless! Thanks!

And good luck with your MVP! Lunr is quite easy to integrate for a basic use case - it will only take you about an afternoon to get everything up and running. Here's the entire thing for me: https://gist.github.com/tricinel/85a0ba3333c36a20196728340df9ec26. You can reuse it to create any index you want. You just need a way to get the content loaded (my getPosts function).

from highlight-words.

karlhorky avatar karlhorky commented on July 21, 2024

Ah just thinking more about my use case, maybe it would be possible to get an option clipByLength in addition to clipBy, which would clip with an ellipsis not by words but by number of characters? An alternative would be to keep clipBy for the actual number value, and add an option clipByType with possible values of "words" and "characters" (defaults to "words")

Otherwise, with only clipBy, you may have super long "words" such as URLs that don't get clipped reasonably.

What do you think? If you think it's an ok idea, should I open a new issue for this?

from highlight-words.

tricinel avatar tricinel commented on July 21, 2024

Hm, I see. That won't solve your use case for the above, so could be a new feature. I'll consider it, sure. Would you mind opening a new issue though? It seems unrelated to the short-circuit thing we were discussing.

from highlight-words.

karlhorky avatar karlhorky commented on July 21, 2024

Sure, here's one: #8

from highlight-words.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.