What happened? TLDR I'm wondering if it's inte

If you want to try out that PR, <a class="user-mention notranslate" data-hovercard-typ

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

It does show up <a target="_blank" rel="noopener noreferrer" href="https://private

feat: add `catalog` & `database` support to `create_table`,about ibis-project/ibis

Comments (12)

gforsyth commented on June 6, 2024 1

Ok, I think I have a fix for this.

This is a horrible bit of bookkeeping. For context, this is what is happening:

We set the catalog using a context manager, and we set the database also using a context manager. Currently what is happening to you is this weird edge case:

set catalog to comms_media_dev (succeeds)
set database to dart_extensions within comms_media_dev (succeeds)

then we write the table, great! Now we try to change catalog and database back in reverse order and...

set database to default (the previous value that we saved) within comms_media_dev (fails, you do not have permission to access that)

set catalog back to spark_catalog (or previous value, but we never get here because of the previous error.

So I think what we need to do is instead:

set catalog
set database
write table
set catalog back
set database back

It would be really great if spark would allow for setting both of these values at the same time, but that is apparently not a thing.

from ibis.

gforsyth commented on June 6, 2024 1

If you want to try out that PR, @mark-druffel, that would be a huge help until I can get a much more complicated pyspark testing setup put together.

from ibis.

gforsyth commented on June 6, 2024 1

Hey @mark-druffel -- ~~let's take the conversation over to #9042~~ -- I think I know what I missed in that PR that is still causing you errors. Thanks for helping us test this out!

EDIT: let's continue over in #9067 where I'm trying to fix this

from ibis.

gforsyth commented on June 6, 2024 1

Hey @mark-druffel -- we've merged in my fixes from #9067 so hopefully main will be working now -- definitely let us know if things are still failing, and thanks for your help in testing this out!

from ibis.

mark-druffel commented on June 6, 2024

@gforsyth I'm still trying to debug this myself to understand but wanted to post here as well. Let me know if I should open a new issue instead of posting on the closed one.

The fix doesn't seem to be working on my end if I'm using it correctly. I double checked my Ibis version to make sure I'm on the right one, I installed w/ ibis-framework[pyspark] @ git+https://github.com/ibis-project/ibis.git@2c1a58e25575f9f0d9876e37b49154e276558526 and my version shown in the environment was:

Just now (2s)
%pip show ibis-framework
Name: ibis-framework
Version: 9.0.0.dev677
Summary: The portable Python dataframe library
Home-page: https://ibis-project.org
Author: Ibis Maintainers
Author-email: [email protected]
License: Apache-2.0
Location: /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages
Requires: atpublic, bidict, numpy, pandas, parsy, pyarrow, pyarrow-hotfix, python-dateutil, pytz, rich, sqlglot, toolz, typing-extensions
Required-by:

I tested w/ the following code and got an error that made me think it was trying to split string to tuple, but I already provided a tuple:

import ibis
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
ispark = ibis.pyspark.connect(session = spark)

df = ispark.read_parquet("abfss://media_meas_campaign_info/")
ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database=["comms_media_dev", "dart_extensions"], overwrite=True)

ValueError: oops
File , line 7
4 ispark = ibis.pyspark.connect(session = spark)
6 df = ispark.read_parquet("abfss://media_meas_campaign_info/")
----> 7 ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database=["comms_media_dev", "dart_extensions"], overwrite=True)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/pyspark/init.py:497, in Backend.create_table(self, name, obj, schema, database, temp, overwrite, format)
492 if temp is True:
493 raise NotImplementedError(
494 "PySpark backend does not yet support temporary tables"
495 )
--> 497 table_loc = self._to_sqlglot_table(database)
498 catalog, db = self._to_catalog_db_tuple(table_loc)
500 if obj is not None:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/sql/init.py:561, in SQLBackend._to_sqlglot_table(self, database)
559 database = sg.exp.Table(catalog=catalog, db=db)
560 else:
--> 561 raise ValueError("oops")
563 return database

So I tried providing catalog and database as string with dot separator and my error looks similar to the error I got when I opened the issue initially. It seems like it accepted my catalog argument, but dropped my database argument and substituted default:

import ibis
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
ispark = ibis.pyspark.connect(session = spark)

df = ispark.read_parquet("abfss://media_meas_campaign_info.parquet")
ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database="comms_media_dev.dart_extensions", overwrite=True)

Py4JJavaError: An error occurred while calling o435.setCurrentDatabase.
: com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User does not have USE SCHEMA on Schema 'comms_media_dev.default'.
at com.databricks.managedcatalog.UCReliableHttpClient.reliablyAndTranslateExceptions(UCReliableHttpClient.scala:87)
at com.databricks.managedcatalog.UCReliableHttpClient.get(UCReliableHttpClient.scala:139)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$getSchema$1(ManagedCatalogClientImpl.scala:540)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$recordAndWrapException$2(ManagedCatalogClientImpl.scala:4400)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$recordAndWrapException$1(ManagedCatalogClientImpl.scala:4399)
at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException(ErrorDetailsHandler.scala:25)
at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException$(ErrorDetailsHandler.scala:23)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.wrapServiceException(ManagedCatalogClientImpl.scala:151)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.recordAndWrapException(ManagedCatalogClientImpl.scala:4396)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.getSchema(ManagedCatalogClientImpl.scala:533)
at com.databricks.sql.managedcatalog.ManagedCatalogCommon.shouldUpdateSchemaMetadata(ManagedCatalogCommon.scala:2199)
at com.databricks.sql.managedcatalog.ManagedCatalogCommon.getSchemaMetadataInternal(ManagedCatalogCommon.scala:2652)
at com.databricks.sql.managedcatalog.ManagedCatalogCommon.$anonfun$getSchemaMetadata$3(ManagedCatalogCommon.scala:282)
at scala.Option.getOrElse(Option.scala:189)
at com.databricks.sql.managedcatalog.ManagedCatalogCommon.getSchemaMetadata(ManagedCatalogCommon.scala:282)
at com.databricks.sql.managedcatalog.ManagedCatalogCommon.schemaExists(ManagedCatalogCommon.scala:287)
at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.$anonfun$schemaExists$1(ProfiledManagedCatalog.scala:143)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.catalyst.MetricKeyUtils$.measure(MetricKey.scala:672)
at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.$anonfun$profile$1(ProfiledManagedCatalog.scala:60)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.profile(ProfiledManagedCatalog.scala:59)
at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.schemaExists(ProfiledManagedCatalog.scala:143)
at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.databaseExists(ManagedCatalogSessionCatalog.scala:625)
at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.requireScExists(ManagedCatalogSessionCatalog.scala:275)
at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.setCurrentDatabase(ManagedCatalogSessionCatalog.scala:486)
at com.databricks.sql.DatabricksCatalogManager.setCurrentNamespace(DatabricksCatalogManager.scala:156)
at org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:100)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/pyspark/init.py:240, in Backend._active_database(self, name)
239 self._session.catalog.setCurrentDatabase(name)
--> 240 yield
241 finally:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/pyspark/init.py:507, in Backend.create_table(self, name, obj, schema, database, temp, overwrite, format)
506 df = self._session.sql(query)
--> 507 df.write.saveAsTable(name, format=format, mode=mode)
508 elif schema is not None:
File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function..wrapper(*args, **kwargs)
46 try:
---> 47 res = func(*args, **kwargs)
48 logger.log_success(
49 module_name, class_name, function_name, time.perf_counter() - start, signature
50 )
File /databricks/spark/python/pyspark/sql/readwriter.py:1841, in DataFrameWriter.saveAsTable(self, name, format, mode, partitionBy, **options)
1840 self.format(format)
-> 1841 self._jwrite.saveAsTable(name)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.call(self, *args)
1354 answer = self.gateway_client.send_command(command)
-> 1355 return_value = get_return_value(
1356 answer, self.gateway_client, self.target_id, self.name)
1358 for temp_arg in temp_args:
File /databricks/spark/python/pyspark/errors/exceptions/captured.py:230, in capture_sql_exception..deco(*a, **kw)
227 if not isinstance(converted, UnknownException):
228 # Hide where the exception came from that shows a non-Pythonic
229 # JVM exception message.
--> 230 raise converted from None
231 else:
AnalysisException: [RequestId=5899623e-983f-4972-812a-dbdc7706c8a3 ErrorClass=INVALID_PARAMETER_VALUE.MANAGED_TABLE_FORMAT] Only Delta is supported for managed tables

During handling of the above exception, another exception occurred:
Py4JJavaError Traceback (most recent call last)
File , line 7
4 ispark = ibis.pyspark.connect(session = spark)
6 df = ispark.read_parquet("abfss://lmedia_meas_campaign_info/")
----> 7 ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database="comms_media_dev.dart_extensions", overwrite=True)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/pyspark/init.py:504, in Backend.create_table(self, name, obj, schema, database, temp, overwrite, format)
502 query = self.compile(table)
503 mode = "overwrite" if overwrite else "error"
--> 504 with self._active_catalog(catalog), self._active_database(db):
505 self._run_pre_execute_hooks(table)
506 df = self._session.sql(query)
File /usr/lib/python3.10/contextlib.py:153, in _GeneratorContextManager.exit(self, typ, value, traceback)
151 value = typ()
152 try:
--> 153 self.gen.throw(typ, value, traceback)
154 except StopIteration as exc:
155 # Suppress StopIteration unless it's the same exception that
156 # was passed to throw(). This prevents a StopIteration
157 # raised inside the "with" statement from being suppressed.
158 return exc is not value
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/pyspark/init.py:242, in Backend._active_database(self, name)
240 yield
241 finally:
--> 242 self._session.catalog.setCurrentDatabase(current)
File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function..wrapper(*args, **kwargs)
45 start = time.perf_counter()
46 try:
---> 47 res = func(*args, **kwargs)
48 logger.log_success(
49 module_name, class_name, function_name, time.perf_counter() - start, signature
50 )
51 return res
File /databricks/spark/python/pyspark/sql/catalog.py:193, in Catalog.setCurrentDatabase(self, dbName)
183 def setCurrentDatabase(self, dbName: str) -> None:
184 """
185 Sets the current default database in this session.
186
(...)
191 >>> spark.catalog.setCurrentDatabase("default")
192 """
--> 193 return self._jcatalog.setCurrentDatabase(dbName)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.call(self, *args)
1349 command = proto.CALL_COMMAND_NAME +
1350 self.command_header +
1351 args_command +
1352 proto.END_COMMAND_PART
1354 answer = self.gateway_client.send_command(command)
-> 1355 return_value = get_return_value(
1356 answer, self.gateway_client, self.target_id, self.name)
1358 for temp_arg in temp_args:
1359 if hasattr(temp_arg, "_detach"):
File /databricks/spark/python/pyspark/errors/exceptions/captured.py:224, in capture_sql_exception..deco(*a, **kw)
222 def deco(*a: Any, **kw: Any) -> Any:
223 try:
--> 224 return f(*a, **kw)
225 except Py4JJavaError as e:
226 converted = convert_exception(e.java_exception)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))

from ibis.

gforsyth commented on June 6, 2024

Hey @mark-druffel -- I don't have multiple catalogs set up, so it's very possible I missed something.

That first error you got is because it needs to be a tuple, not a list -- we can almost certainly relax that requirement (and also have a much better error message. So it would be

ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database=("comms_media_dev", "dart_extensions"), overwrite=True)

That said, the second way should be equivalent to the tuple way, so something is a bit wrong. I will try to figure out what's going sideways.

from ibis.

mark-druffel commented on June 6, 2024

That makes sense, I just tried that and error looks the same as the second attempt from above. Please let me know if there's anything I can do to help and thanks so much for your quick response!

ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database=("comms_media_dev", "dart_extensions"), overwrite=True)

> Py4JJavaError: An error occurred while calling o435.setCurrentDatabase.
> : com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User does not have USE SCHEMA on Schema 'comms_media_dev.default'.
> 	at com.databricks.managedcatalog.UCReliableHttpClient.reliablyAndTranslateExceptions(UCReliableHttpClient.scala:87)
> 	at com.databricks.managedcatalog.UCReliableHttpClient.get(UCReliableHttpClient.scala:139)
> 	at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$getSchema$1(ManagedCatalogClientImpl.scala:540)
> 	at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$recordAndWrapException$2(ManagedCatalogClientImpl.scala:4400)
> 	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
> 	at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$recordAndWrapException$1(ManagedCatalogClientImpl.scala:4399)
> 	at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException(ErrorDetailsHandler.scala:25)
> 	at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException$(ErrorDetailsHandler.scala:23)
> 	at com.databricks.managedcatalog.ManagedCatalogClientImpl.wrapServiceException(ManagedCatalogClientImpl.scala:151)
> 	at com.databricks.managedcatalog.ManagedCatalogClientImpl.recordAndWrapException(ManagedCatalogClientImpl.scala:4396)
> 	at com.databricks.managedcatalog.ManagedCatalogClientImpl.getSchema(ManagedCatalogClientImpl.scala:533)
> 	at com.databricks.sql.managedcatalog.ManagedCatalogCommon.shouldUpdateSchemaMetadata(ManagedCatalogCommon.scala:2199)
> 	at com.databricks.sql.managedcatalog.ManagedCatalogCommon.getSchemaMetadataInternal(ManagedCatalogCommon.scala:2652)
> 	at com.databricks.sql.managedcatalog.ManagedCatalogCommon.$anonfun$getSchemaMetadata$3(ManagedCatalogCommon.scala:282)
> 	at scala.Option.getOrElse(Option.scala:189)
> 	at com.databricks.sql.managedcatalog.ManagedCatalogCommon.getSchemaMetadata(ManagedCatalogCommon.scala:282)
> 	at com.databricks.sql.managedcatalog.ManagedCatalogCommon.schemaExists(ManagedCatalogCommon.scala:287)
> 	at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.$anonfun$schemaExists$1(ProfiledManagedCatalog.scala:143)
> 	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
> 	at org.apache.spark.sql.catalyst.MetricKeyUtils$.measure(MetricKey.scala:672)
> 	at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.$anonfun$profile$1(ProfiledManagedCatalog.scala:60)
> 	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
> 	at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.profile(ProfiledManagedCatalog.scala:59)
> 	at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.schemaExists(ProfiledManagedCatalog.scala:143)
> 	at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.databaseExists(ManagedCatalogSessionCatalog.scala:625)
> 	at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.requireScExists(ManagedCatalogSessionCatalog.scala:275)
> 	at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.setCurrentDatabase(ManagedCatalogSessionCatalog.scala:486)
> 	at com.databricks.sql.DatabricksCatalogManager.setCurrentNamespace(DatabricksCatalogManager.scala:156)
> 	at org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:100)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> 	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
> 	at py4j.Gateway.invoke(Gateway.java:306)
> 	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> 	at py4j.commands.CallCommand.execute(CallCommand.java:79)
> 	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
> 	at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
> 	at java.lang.Thread.run(Thread.java:750)
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:240, in Backend._active_database(self, name)
>     239     self._session.catalog.setCurrentDatabase(name)
> --> 240     yield
>     241 finally:
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:507, in Backend.create_table(self, name, obj, schema, database, temp, overwrite, format)
>     506         df = self._session.sql(query)
> --> 507         df.write.saveAsTable(name, format=format, mode=mode)
>     508 elif schema is not None:
> File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
>      46 try:
> ---> 47     res = func(*args, **kwargs)
>      48     logger.log_success(
>      49         module_name, class_name, function_name, time.perf_counter() - start, signature
>      50     )
> File /databricks/spark/python/pyspark/sql/readwriter.py:1841, in DataFrameWriter.saveAsTable(self, name, format, mode, partitionBy, **options)
>    1840     self.format(format)
> -> 1841 self._jwrite.saveAsTable(name)
> File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.__call__(self, *args)
>    1354 answer = self.gateway_client.send_command(command)
> -> 1355 return_value = get_return_value(
>    1356     answer, self.gateway_client, self.target_id, self.name)
>    1358 for temp_arg in temp_args:
> File /databricks/spark/python/pyspark/errors/exceptions/captured.py:230, in capture_sql_exception.<locals>.deco(*a, **kw)
>     227 if not isinstance(converted, UnknownException):
>     228     # Hide where the exception came from that shows a non-Pythonic
>     229     # JVM exception message.
> --> 230     raise converted from None
>     231 else:
> AnalysisException: [RequestId=952d4d82-e41a-4892-83ba-d52cbbfce80e ErrorClass=INVALID_PARAMETER_VALUE.MANAGED_TABLE_FORMAT] Only Delta is supported for managed tables
> 
> During handling of the above exception, another exception occurred:
> Py4JJavaError                             Traceback (most recent call last)
> File <command-1657592028371427>, line 7
>       4 ispark = ibis.pyspark.connect(session = spark)
>       6 df = ispark.read_parquet("abfss://media_meas_campaign_info")
> ----> 7 ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database=("comms_media_dev", "dart_extensions"), overwrite=True)
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:504, in Backend.create_table(self, name, obj, schema, database, temp, overwrite, format)
>     502 query = self.compile(table)
>     503 mode = "overwrite" if overwrite else "error"
> --> 504 with self._active_catalog(catalog), self._active_database(db):
>     505     self._run_pre_execute_hooks(table)
>     506     df = self._session.sql(query)
> File /usr/lib/python3.10/contextlib.py:153, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
>     151     value = typ()
>     152 try:
> --> 153     self.gen.throw(typ, value, traceback)
>     154 except StopIteration as exc:
>     155     # Suppress StopIteration *unless* it's the same exception that
>     156     # was passed to throw().  This prevents a StopIteration
>     157     # raised inside the "with" statement from being suppressed.
>     158     return exc is not value
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-59b1aa40-b629-4cf3-82ae-6f44d5b1b2f6/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:242, in Backend._active_database(self, name)
>     240     yield
>     241 finally:
> --> 242     self._session.catalog.setCurrentDatabase(current)
> File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
>      45 start = time.perf_counter()
>      46 try:
> ---> 47     res = func(*args, **kwargs)
>      48     logger.log_success(
>      49         module_name, class_name, function_name, time.perf_counter() - start, signature
>      50     )
>      51     return res
> File /databricks/spark/python/pyspark/sql/catalog.py:193, in Catalog.setCurrentDatabase(self, dbName)
>     183 def setCurrentDatabase(self, dbName: str) -> None:
>     184     """
>     185     Sets the current default database in this session.
>     186 
>    (...)
>     191     >>> spark.catalog.setCurrentDatabase("default")
>     192     """
> --> 193     return self._jcatalog.setCurrentDatabase(dbName)
> File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.__call__(self, *args)
>    1349 command = proto.CALL_COMMAND_NAME +\
>    1350     self.command_header +\
>    1351     args_command +\
>    1352     proto.END_COMMAND_PART
>    1354 answer = self.gateway_client.send_command(command)
> -> 1355 return_value = get_return_value(
>    1356     answer, self.gateway_client, self.target_id, self.name)
>    1358 for temp_arg in temp_args:
>    1359     if hasattr(temp_arg, "_detach"):
> File /databricks/spark/python/pyspark/errors/exceptions/captured.py:224, in capture_sql_exception.<locals>.deco(*a, **kw)
>     222 def deco(*a: Any, **kw: Any) -> Any:
>     223     try:
> --> 224         return f(*a, **kw)
>     225     except Py4JJavaError as e:
>     226         converted = convert_exception(e.java_exception)
> File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
>     324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
>     325 if answer[1] == REFERENCE_TYPE:
> --> 326     raise Py4JJavaError(
>     327         "An error occurred while calling {0}{1}{2}.\n".
>     328         format(target_id, ".", name), value)
>     329 else:
>     330     raise Py4JError(
>     331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
>     332         format(target_id, ".", name, value))

from ibis.

gforsyth commented on June 6, 2024

Question that might seem a bit odd, but does the table show up in the appropriate place in spite of the error message?
In the output of something like:

ispark.list_tables(database=("comms_media_dev", "dart_extensions"))

from ibis.

mark-druffel commented on June 6, 2024

It does show up

from ibis.

mark-druffel commented on June 6, 2024

Not sure if this helps at all, but if I set the catalog & db from the spark session and pass the parameter and it appears the db switches back to default.

from ibis.

mark-druffel commented on June 6, 2024

^ Sorry disregard didn't see your last pop through. Yea spark not allowing both at the same time is really annoying imho

from ibis.

mark-druffel commented on June 6, 2024

Sorry for the delay, databricks takes forever to start... Now it says the schema can't be found, but I provided an obj parameter 🤔 I also added format='delta' because I did get an error w/o it this time saying managed tables must be in delta format. I had forgotten to add it on prior tests, but that's a valid error which I wasn't getting before.

import ibis
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
ispark = ibis.pyspark.connect(session = spark)
df = ispark.read_parquet("abfss://media_meas_campaign_info/")

print(f"Tables: {ispark.list_tables(database = ('comms_media_dev','dart_extensions'))}\n")
print(f"Current Catalog: {ispark._session.catalog.currentCatalog()}\n")
print(f"Current Database: {ispark._session.catalog.currentDatabase()}\n")
ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database=('comms_media_dev','dart_extensions'), overwrite=True, format = "delta")
print(f"Current Catalog: {ispark._session.catalog.currentCatalog()}\n")
print(f"Current Database: {ispark._session.catalog.currentDatabase()}\n")

Tables: ['ibis_read_parquet_472gdhsajjakrgoq2mzf7ffz7u', 'ibis_read_parquet_73xgg7oaunet5oyv5rmderp7wa', 'ibis_read_parquet_dzbw5jngqngsxpg6ug7u266w2i', 'ibis_read_parquet_g2kop6usdncf3k67qgk4i7igpi', 'ibis_read_parquet_j6q3xnj7uzcg5ecfsmdty6l4xa', 'ibis_read_parquet_wifrr4hijbevvdhhlv5kivn2ey', 'ibis_read_parquet_xzhqoneiorfqhfiqdk7nqmpe4u', 'raw_media_meas_offer_info', 'raw_target_history', 'standardized_media_meas_campaign_info', 'standardized_media_meas_offer_info', 'standardized_target_history']

Current Catalog: hive_metastore

Current Database: default

[[SCHEMA_NOT_FOUND](https://docs.microsoft.com/azure/databricks/error-messages/error-classes#schema_not_found)] The schema `dart_extensions` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog.
To tolerate the error on drop use DROP SCHEMA IF EXISTS. SQLSTATE: 42704
File <command-4437199335976496>, line 10
      8 print(f"Current Catalog: {ispark._session.catalog.currentCatalog()}\n")
      9 print(f"Current Database: {ispark._session.catalog.currentDatabase()}\n")
---> 10 ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database=('comms_media_dev','dart_extensions'), overwrite=True, format = "delta")
     11 print(f"Current Catalog: {ispark._session.catalog.currentCatalog()}\n")
     12 print(f"Current Database: {ispark._session.catalog.currentDatabase()}\n")
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10d3656f-1fae-4528-917f-49d0869552d4/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:532, in Backend.create_table(self, name, obj, schema, database, temp, overwrite, format)
    529 else:
    530     raise com.IbisError("The schema or obj parameter is required")
--> 532 return self.table(name, database=db)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10d3656f-1fae-4528-917f-49d0869552d4/lib/python3.10/site-packages/ibis/backends/sql/__init__.py:137, in SQLBackend.table(self, name, schema, database)
    134     catalog = table_loc.catalog or None
    135     database = table_loc.db or None
--> 137 table_schema = self.get_schema(name, catalog=catalog, database=database)
    138 return ops.DatabaseTable(
    139     name,
    140     schema=table_schema,
    141     source=self,
    142     namespace=ops.Namespace(catalog=catalog, database=database),
    143 ).to_expr()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10d3656f-1fae-4528-917f-49d0869552d4/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:459, in Backend.get_schema(self, table_name, catalog, database)
    457 table_loc = self._to_sqlglot_table((catalog, database))
    458 catalog, db = self._to_catalog_db_tuple(table_loc)
--> 459 with self._active_catalog_database(catalog, db):
    460     df = self._session.table(table_name)
    461     struct = PySparkType.to_ibis(df.schema)
File /usr/lib/python3.10/contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    133 del self.args, self.kwds, self.func
    134 try:
--> 135     return next(self.gen)
    136 except StopIteration:
    137     raise RuntimeError("generator didn't yield") from None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10d3656f-1fae-4528-917f-49d0869552d4/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:254, in Backend._active_catalog_database(self, catalog, db)
    252     if not PYSPARK_LT_34 and catalog is not None:
    253         self._session.catalog.setCurrentCatalog(catalog)
--> 254     self._session.catalog.setCurrentDatabase(db)
    255     yield
    256 finally:
File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
     45 start = time.perf_counter()
     46 try:
---> 47     res = func(*args, **kwargs)
     48     logger.log_success(
     49         module_name, class_name, function_name, time.perf_counter() - start, signature
     50     )
     51     return res
File /databricks/spark/python/pyspark/sql/catalog.py:193, in Catalog.setCurrentDatabase(self, dbName)
    183 def setCurrentDatabase(self, dbName: str) -> None:
    184     """
    185     Sets the current default database in this session.
    186 
   (...)
    191     >>> spark.catalog.setCurrentDatabase("default")
    192     """
--> 193     return self._jcatalog.setCurrentDatabase(dbName)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.__call__(self, *args)
   1349 command = proto.CALL_COMMAND_NAME +\
   1350     self.command_header +\
   1351     args_command +\
>    1352     proto.END_COMMAND_PART
>    1354 answer = self.gateway_client.send_command(command)
> -> 1355 return_value = get_return_value(
>    1356     answer, self.gateway_client, self.target_id, self.name)
>    1358 for temp_arg in temp_args:
>    1359     if hasattr(temp_arg, "_detach"):
> File /databricks/spark/python/pyspark/errors/exceptions/captured.py:230, in capture_sql_exception.<locals>.deco(*a, **kw)
>     226 converted = convert_exception(e.java_exception)
>     227 if not isinstance(converted, UnknownException):
>     228     # Hide where the exception came from that shows a non-Pythonic
>     229     # JVM exception message.
> --> 230     raise converted from None
>     231 else:
>     232     raise

from ibis.

feat: add `catalog` & `database` support to `create_table` about ibis HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs