--- title: "Using python for postrgresql uploads" author: "James P. Gilbert" date: "`r Sys.Date()`" output: pdf_document: toc: yes html_document: number_sections: yes toc: yes vignette: > %\VignetteIndexEntry{Python based uploads} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- # Introduction Note, this feature should be considered experimental until further notice. The use of DatabaseConnector with postgresql without bulk uploads functionality will often be slow and require the installation and configuration of postgresql binaries on the system. This may be challenging or restrictred in many environments. Similarly, this method requires writing a data.frame to disk which will be prohibitively slow if data is already in a csv format. As a consequence, this package supports bulk uploading through python with a small amount of configuration. This uses no r-memory; csv files are transfered directly through python and will be considerably faster This process uses the `psycopg2` python library, which can be installed via compilation or in binary form. This process demonstrates usage with the `psycopg2-binary` package. # Installing psycopg2 ## Using a virtualenv Result model manager provides an interactive function for enabling the python library. If psycopg2 this function will do nothing. However, if there is no available binary (and the `reticulate` package is not installed) you will be asked to install these packages. Do do this run the following: ```{r eval = F} ResultModelManager::enablePythonUploads() ``` Alternatively you can specify this manually ```{r eval = F} ResultModelManager::install_psycopg2() ``` ## Using conda or system python installs Please consult the [reticulate documentation](https://rstudio.github.io/reticulate/articles/python_packages.html) on how to install the `psycopg2-binary` package. # Usage within functions By default, this functionality will not be enabled when uploading tables and the function `pyUploadCsv` will fail. To enable, and directly upload a csv, try the following example code. ```{r eval = FALSE} ResultModelManager::enablePythonUploads() connectionDetails <- DabaseConnector::createConnectionDetails( dbms = "postgreql", server = "myserver.com", port = 5432, password = "s", user = "me", database = "some_db" ) connection <- DatabaseConnector::connect(connectionDetails) readr::write_csv( data.frame( id = 1:1e6, paste(1:1e6, "bottle(s) on the wall") ), "my_massive_csv.csv" ) ResultModelManager::pyUploadCsv(connection, table = "my_table", filepath = "my_massive_csv.csv", schema = "my_schema" ) ``` Note that you are not required to call `ResultModelManager::enablePythonUploads()` every time. As an alternative, add the following line to your .Renviron file (note that this will automatically assume that setup of python libraries has been completed) ``` RMM_USE_PYTHON_UPLOADS=TRUE ``` The astute reader will realize that this approach requires an IO call, writing the CSV to disk. In many situations this will be a major bottleneck. A much more sensible approach is to use a string buffer. Thankfully, the author of this package has provided such an interface! ```{r eval=F} ResultModelManager::pyUploadDataframe(connection, table = "my_table", filepath = "my_massive_csv.csv", schema = "my_schema" ) ``` Note - that this approach is actually already implemented for you when you use `uploadResults` functionality. That's right - if you call `ResultModelManager::enablePythonUploads()` (and you are using postgres) you will be able to upload your large R `data.frames` to postgres! ```{r eval=F} ResultModelManager::enablePythonUploads() ResultModelManager::uploadResults( connectionDetails, schema = "my_schema", resultsFolder = "my_results_folder", tablePrefix = "cm_", purgeSiteDataBeforeUploading = FALSE, specifications = getResultsDataModelSpec() ) ``` Better yet, calling `ResultModelManager::enablePythonUploads()` before uploading results from any OHDSI package will automatically give you this fast(er) upload functionality.