Connect Spark to Oracle using JDBC and pyspark

Oct 13, 2018  │  m. Jul 29, 2023 by manuhortet  │  #spark   #python  

1. Drivers

To use the Java Database Connectivity (JDBC) API we’ll need to provide our application with the correct drivers. The drivers can be found on this Oracle repository . You’ll need to know the version of the database you want to connect to. Try using this SQL statement:

SELECT * FROM V$VERSION

To provide the downloaded drivers to your application, add this when evoking it:

--driver-class-path oracle/ojdbc8.jar

2. Reading using JDBC

To fill a Spark dataFrame directly from a database using JDBC we will need a dataFrameReader object:

‘‘‘python dataFrame = SparkSession.builder.getOrCreate() dataFrameReader = dataFrame.read ’’’

and to define the JDBC related objects we will use as options when reading from the database:

driver = "oracle.jdbc.OracleDriver"
url = "jdbc:oracle:thin:@//<HOST>:<PORT>/service_name"
user = "your_user"
password = "your_pass"

Note: The JDBC URL is a relatively versatile entity that you probably have seen in some different formats. For me, the only format working properly with pyspark is the one shown above.

There’s still one more option to define: dbtable This option will be used as a source for the SELECT statement running under the hood. For that, it can be initialized as any valid SQL query content. Simpler: dbtable can be the name of a table or a query we want to execute over that table.

dbtable = "your_table"
dbtable_2 = "(SELECT * FROM your_table WHERE time > 10)"

Finally, to fill your dataFrame:

df = dataFrameReader.format("jdbc").option("url", url) \
                                   .option("driver", driver) \
                                   .option("dtable", dbtable) \
                                   .option("user", user) \
                                   .option("password", password).load()

Now df is a Spark dataFrame filled directly from an Oracle database using JDBC!