Comments (10)
In case you want to compare to other languages, here's the behavior in R:
strsplit(c("abc", "ẞ", ""), "")
#> [[1]]
#> [1] "a" "b" "c"
#>
#> [[2]]
#> [1] "ẞ"
#>
#> [[3]]
#> character(0)
stringr::str_split(c("abc", "ẞ", ""), "")
#> [[1]]
#> [1] "a" "b" "c"
#>
#> [[2]]
#> [1] "ẞ"
#>
#> [[3]]
#> character(0)
from polars.
I discussed this with @orlp , and we indeed want to go for the expected behavior listed in the issue description.
The empty string input will be a special case that splits the string into its characters. Splitting an empty string this way will result in a list containing one empty string.
from polars.
Seems like this is the default behavior of Rust split
though, so... maybe my expectations are incorrect?
fn main() {
let v: Vec<&str> = "Hello world!".split("").collect();
println!("{:?}", v)
}
["", "H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d", "!", ""]
from polars.
Also python does not allow an empty separator
"Hello World!".split("")
# > ValueError: empty separator
# python "solution"
list("Hello World!")
# ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!']
What somehow makes sense because you can't split on "nothing".
You can only split on something you can actually find.
Maybe dont allow emtpy separators and force user to use None
to mean split into chars?
from polars.
I had resorted to using .extract_all
for this (although it produces an empty list for the last case).
s.str.extract_all("(?s).")
# shape: (3,)
# Series: '' [list[str]]
# [
# ["a", "b", "c"]
# ["ẞ"]
# []
# ]
from polars.
I discussed this with @orlp , and we indeed want to go for the expected behavior listed in the issue description.
The empty string input will be a special case that splits the string into its characters. Splitting an empty string this way will result in a list containing one empty string.
@stinodego
I am not convinced that splitting an emtpy string should return a list containing one emtpy string. I think this is not the expected behaviour but should be an emtpy list just like the example from @cmdlineluser.
There is no "split by nothing" so this special use case would instead mean "iterate over chars" I would assume?!
python
for line in ["abc", "ß", ""]:
print(f'{line:5} -> {list(line)}')
# abc -> ['a', 'b', 'c']
# ß -> ['ß']
# -> []
rust
vec!["abc", "ß", ""]
.iter()
.map(|line| line.chars().collect::<Vec<char>>())
.collect::<Vec<_>>();
// [['a', 'b', 'c'], ['ß'], []]
from polars.
@JulianCologne I feel it's a bit of a 0 to the 0th power situation. Is that 1 or is that 0? It depends from which side you approach "".split("")
:
"bar".split("") -> ["b", "a", "r"] # Desired as discussed.
"ba".split("") -> ["b", "a"] # Desired as discussed.
"b".split("") -> ["b"] # Desired as discussed.
"".split("") -> ?
"".split("b") -> [""] # Defined by Python, want to be consistent with.
"".split("ba") -> [""] # Defined by Python, want to be consistent with.
"".split("bar") -> [""] # Defined by Python, want to be consistent with.
from polars.
@orlp Interesting thoughts, however...
splitting by nothing is not defined.
So the new idea becomes "iterate over the chars" and imo the expected behaviour is to have as many items in the list as the text is long
List length should be equal to utf8-char-count
- "bar" -> 3 chars
- "ba" -> 2 chars
- "b" -> 1 chars
- "" -> 0 chars
Also your second half examples have a different meaning!
Splitting by something that is not there will result in the original string being returned
"XXX".split('abc')
-> ['XXX']
Logic 1) if sep is empty -> special case -> list of all chars
"bar".split("") -> ["b", "a", "r"] # length: 3
"ba".split("") -> ["b", "a"] # length: 2
"b".split("") -> ["b"] # length: 1
"".split("") -> [] # length: 0
Logic 2) if sep is not found -> special case -> keep original string
"XXX".split("bar") -> ["XXX"]
"XXX".split("ba") -> ["XXX"]
"XXX".split("b") -> ["XXX"]
"XXX".split("") -> ["X", "X", "X"] # Different case! Cannot search for emtpy string so requires different logic from above! :)
Conclusion
"".split("")
muss follow "Logic 1" as you cannot check for an emtpy string in your text.
from polars.
I would actually tend to agree with @JulianCologne here - returning an empty list in that special case would be more useful.
from polars.
I'm fine with it, let's make it an empty list.
from polars.
Related Issues (20)
- Add section about using `pipe` to the user guide HOT 1
- Regression: `list.sum()` inside WhenThen now returns a list HOT 1
- In pl.Series, nan_to_null parameter not respected with floats HOT 1
- When reading excel table data, you are advised to freely select the column name or column number to read data HOT 2
- When reading excel table data, allow selection of the column names/indices to read HOT 2
- Incorrect `ColumnNotFound` panic, which occurs only for LazyFrames HOT 2
- search_sorted does not work on boolean columns
- PanicException creating DataFrame with numpy array inside dict HOT 1
- `struct.rename_fields` does not work on structs with categorical columns after scanning a parquet file with more than one row group. HOT 4
- `SchemaFieldNotFound` on LazyFrame when using `select` after `struct.field(...)` HOT 1
- Handle `pd.NaT` values in lists passed to DataFrame constructor HOT 5
- `.struct.field("*")` PanicException: no `columns` expected at this point
- unique + cross-join PanicException on streaming engine
- Odd result from `.list.sum()` HOT 1
- Filter on chunked DataFrame that removes data in one chunk drops all data in DataFrame when filtering based on categorical struct col HOT 1
- Backfill based on rows HOT 1
- Add a `pl.sql` method for running SQL commands HOT 1
- polars can no longer find column in LazyFrame HOT 2
- pl.struct usage of schema renders keyword argument renaming ineffective
- Update function signature of `nth` to allow positional input of indices
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.