GithubHelp home page GithubHelp logo

Comments (6)

waltzofpearls avatar waltzofpearls commented on June 2, 2024 1

Thanks for sharing this information! I'm planning for a new version of this rust crate, and it might involve a complete rewrite to improve parsing speed and provide more flexible configuration options for the parser. I will take your use case into consideration.

When I first wrote this parser, it was part of a cli tool, and I made some opinionated decisions on the library's api design. As time went by, I've noticed a growing demand for increased flexibility. Addressing these requests would require some degree of redesign. And I've always wanted to change the way how the date time string is parsed, transitioning away from regex towards a character by character parsing approach.

from dateparser.

waltzofpearls avatar waltzofpearls commented on June 2, 2024 1

Agreed on what you suggested. Thanks, Hayden 🙏

I will first try statically declare regex, and replace the use of lazy_static macro. Profiling is another good idea. And yes, regex is easier to understand and maintain.

For overall speed, I just ran make bench again, and the results on Mac with M1 Pro is reasonable, compare to a few years ago when I did that on Intel i9. Now the medians for different date formats range from 35 ns to 1140 ns. What I really hope is to even out those numbers, and nothing should be over 500 ns.

from dateparser.

waltzofpearls avatar waltzofpearls commented on June 2, 2024

Hey @haydenflinner, pull requests are always welcome. I had a look at datetime.rs, and found 2 Parse methods that didn't use default_time, one in hms_z and another in month_md_hms. If I made them use default_time, would that meet your requirement?

from dateparser.

haydenflinner avatar haydenflinner commented on June 2, 2024

Unfortunately not, and I don't think this lib will be made to fit my use-case as easily as I thought it would during prototyping. I really need to know what was found, whether it was a time without a date, or a date without a time, or a real timestamp. Once I know that, falling back to a default date or time is easy, and I can use the context of what time or date was parsed to decide what my fallback value is. With the API as-is, I need to know my default ahead of time, and at return-time I can't tell if what I got back was my default or just a value which happened to match it. For example, suppose I'm parsing a list of timestamps, and they suddenly change from 23:59:00 to 00:01. Probably, it is a new day. Detecting that with default_date set requires doing some math like 'did my timestamps suddenly jump back >20 hours).

For now, I copy-pasted the whole thing into my wd repo along with some other tweaks that I needed. It is an awesome resource already, thanks for this!

Btw, I added support for fractional seconds in fn hms, had to change the regex and add %.f to the parse_from_str call. Copied that from another time parsing thing later on in that file which had support for fractionals.
haydenflinner/wd@4af850b#diff-e587ba841624ffc4321d69ca6ae1830b3bd221f539df0388668c9d8f59b53969R332

from dateparser.

haydenflinner avatar haydenflinner commented on June 2, 2024

That's cool news, thanks! On the regex note, it would be neat if Rust had compile-time regexes rather than always requiring the runtime, but I suspect the reason it doesn't is that regexing is rarely the bottleneck. Make sure you've profiled (cargo flamegraph) before throwing out the easy to use approach 😄

From a performance perspective, there are two main things that I noticed. One is that most users who care about speed are likely parsing large sequences of the same data, so if we were to guess the right format the first time we'd be far faster than always iterating through the whole list (if their format is towards the bottom). To that end just remembering which format was last parsed in this parser would be 95+% of the way there, if you really wanted to overkill you could keep some sort of data structure to sort them by observed frequency.

The other thing that might help is joining all of the regexes into one big one (using |) and using capturing groups to figure out which one matched. Once in this singular big regex, common prefixes might be more obvious to the dev, but they might also get compiled out by the regex engine, don't know for sure. Either way it more closely matches what we're trying to do, find which of a bunch of expressions matches. Make sure you set up some benches before trying this though because it is always possible that this is slower, and the regex support for many matches was more-so intended for searching within large strings. Users of the lib should probably be responsible for ensuring they give you close to the minimal string that would contain the timestamp, as they have the domain knowledge (for example I am taking the first 3 "words" from a potentially multi-kb line, by space-splitting). But if you are interested in supporting arbitrary search within large strings, the singular regex would most likely be faster, because you're visiting all of the data only once, and this is what regex engines are optimized for, using SIMD and all sorts of other tricks to do it faster and more expressively than a naive char-by-char approach.

from dateparser.

haydenflinner avatar haydenflinner commented on June 2, 2024

Awesome, gonna be hard to beat that! Have you given any thought to how the API redesign will look? Maybe returning a struct containing Option Option? This is not yet at the top of my priorities either, I can always just make my users type in the exact date/time they're looking for 😄 but it is a little weird in my log-viewer when someone types "2023/04/10" and the viewer seeks to "current time" on that date haha.

from dateparser.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.