We believe empowering engineers drives innovation.

Property Risk Analysis Pilot Using Databricks

By Riley Nastase
June 21, 2023

If you’ve been checking the news recently, you’ll have seen a lot of articles referencing shifting real estate trends around the world. The premise of many of these articles is that expensive, high-class office buildings that tend to be flagships of metro areas, are becoming less and less popular. As interest rates rise and work-from-home continues to become more common, one can only wonder how these real-estate trends will evolve from here, and the impact these trends might have on property owners.

At Rearc, we want to do more than just wonder. We want to use the data and tools we have at our disposal to help our partners and clients make informed decisions about their properties. Recently, for the Databricks Data + AI Summit 2023, we decided to zoom in on San Francisco, the host city, and build a product that showcases the power of the Rearc Data Platform, Databricks, and Delta Sharing in the context of these trends.

What if sentiment analysis, locally relevant economic metrics, reports of closing businesses, and more could be utilized, not just to remind us of an uncertain industry, but to help property owners make better decisions about their commercial properties and tenants? Thanks to a vast amount of privately/publicly available data, as well as the remarkable capabilities of Databricks, we’ve been able to create a pilot default risk score.

What We Built

The results of our analysis are presented below. Here you can interact with a selection of properties across San Francisco. If you hover over the graphic, you’ll find an assortment of descriptive information, including the property’s address, the name of the business that occupies the property, and risk scores computed according to each of the following categories:

Each of these risk scores falls between 0 and 1, where the risk score is intended to communicate the likelihood that a given tenant/property will default. The higher-risk properties are indicated by a darker shade of red, while relatively safe properties are represented with green, with varying shades of yellow covering the properties in between.

You can find a full-screen visualization, complete with San Francisco building footprints, at Rearc’s website. It may take a minute or two to load.

Why Understanding Property Risk Matters

Understanding property risk is crucial for property owners and investors as it enables them to make informed decisions about their commercial properties and tenants. The applications of this analysis stretch far and wide, but here a few examples of how property risk analysis can be used to inform decision-making:

In summary, understanding property risk is vital for owners seeking to navigate the evolving world of real estate. However, a comprehensive property risk solution requires more than just understanding the importance of risk analysis. It necessitates the sourcing, transformation, and harmonization of a diverse data landscape. This is where Rearc’s capabilities shine. With our expertise in all stages of the data life cycle, we make it our mission to minimize the complexities of this process.

How We Built It

Obtaining the Data

The Rearc Data team has a robust data platform built on top of Apache Airflow which we have used in collaboration with multiple partners to deliver a variety of complex data requests over the years, and we were already sourcing and publishing data to Unity Catalog from several of the sources used in this analysis, including the Bureau of Labor Statistics, the Bureau of Economic Analysis, and the Federal Reserve.

We also used data from the San Francisco Government Open Data project. From this source, we extracted building footprints, local business information, and other locally revelant datasets which we were easily able to assimilate into our workflow with our Data Platform and Delta Sharing. To use these, in addition to our own data, we simply load the delta files from Unity Catalog using the Databricks notebook interface.

## Load Rearc Datasets
interest_rates = sqlContext.sql(
  "SELECT * FROM rearc_catalog.fs_federalreserveboard.frb_h15")
sector_employment = sqlContext.sql(
  "SELECT * FROM rearc_catalog.stat_bls.bls_employment_national_data_monthly")
metro_gdp = sqlContext.sql(
  "SELECT * FROM rearc_catalog.stat_employ_usa.employ_usa_gdp_by_county_metro_yearly_bea")

## Load San Francisco Open Data
buildings = sqlContext.sql(
  "SELECT * FROM rearc_catalog.stat_lnd.lnd_usa_sanfrancisco_building_footprints_static_sfgov")
businesses = sqlContext.sql(
  "SELECT * FROM rearc_catalog.stat_lnd.lnd_usa_sanfrancisco_registered_businesses_static_sfgov")

Finally, we incorporated GDELT, a news “firehose” dataset, into our analysis. GDELT is a very large dataset (more than 8 trillion datapoints!) that indexes almost every news item in the world. This would be quite difficult to source using standard methods, but the data can be found and accessed via the Databricks Marketplace.

For this project, we heavily utilized Databricks, a unified analytics platform that integrates Apache Spark and provides collaborative tools for processing and analyzing large-scale data. In addition, the Databricks Marketplace includes a wide assortment of datasets to add to our analysis. Because we have existing pipelines that publish our data to Databricks, this tool was a natural choice for the data collection phase of this work. By pulling data from Rearc’s data catalog using the Marketplace and Delta Sharing, we gain seamless access to diverse and up-to-date data sources, greatly accelerating our analysis.

Generating Risk Scores using Databricks

The goal in this product is to show how Databricks and Delta Sharing can help estimate a property risk score. With this score, we want to provide property owners with valuable insights to inform their decision-making processes. Because we are already using Databricks to centralize all of our data, we decided to continue to use its capabilities (particularly notebooks, Spark, and SQL) to generate the risk scores, and we will walk through the process below.

1. Incorporating Historical Features

Historical features such as interest rates, GDP growth, and others were crucial in our risk analysis. To incorporate trends, we utilized a combination of aggregation and time-series methods, allowing us to capture important historical patterns and their potential impact on property risk.

2. Handling GDELT Data

GDELT, a large-scale dataset, presented challenges due to its size. However, utilizing Databricks’ capabilities with the below code, we were able to efficiently scan the GDELT database for news items relating to the businesses we found in San Francisco.

gdelt_extract = (
  spark.sql("""
    SELECT DATE, TONE, EXPLODE(SPLIT(ORGANIZATIONS, ';')) AS organization
    FROM `external_shares_gdelt`.`<user_catalog>`.`gkg_v1_daily`
    """
  )
  .where(col('organization').isin(companies_list))
  .toPandas()
)

Additionally, we created two sentiment scores: visibility (measuring the level of recognition for a company on a scale of 0 to 1) and perception (evaluating the positive or negative perception of a company). See the below plot for an example from March 2023.

3. Creating Synthetic Tenants/Properties Data

In addition to the wealth of publicly available data, many property owners also store their own internal data about the histories of their tenants and/or properties. To demonstrate how property owners could utilize their own data to enrich this solution, we generated synthetic data which includes:

4. Scaling and Harmonizing Data

To ensure consistency across different risk categories and datasets, we applied scaling and harmonization techniques. These methods allowed us to normalize and standardize the data, facilitating a comprehensive assessment of property risk.

Conclusion

By harnessing the capabilities of the Rearc Data Platform, Databricks, and Delta Sharing, we can provide property owners and investors around the world with the tools they need to make informed decisions. Our data-driven risk analysis facilitates proactive risk management and enables individuals to navigate the rental property market with confidence.

Our ability to help your data endeavors doesn’t stop here though. Every day in the data world is marked by transformative advancements like Generative AI, machine learning, data marketplaces, data clean rooms, and the list of cutting-edge technologies goes on. The data landscape is rapidly undergoing massive changes, and Rearc is a partner that is poised to support organizations on their journey to, not simply survive these changes, but to thrive in them. Our capabilities and experience enable us to quickly adapt in these shifting tides, and we’d like to help our partners do the same, allowing them to harness the full potential of data-driven insights.

More information about this project can be found at our Product Page. Additionally, if you would like access to the content and slides we presented at the Databricks Data + AI Summit 2023, or if you would like to know more about how Rearc Data can help you advance your data capabilities, fill out the form below and we’ll be in touch!

Source Data Sets