# Working With the Data¶

In this section, we have guidelines for how to get the most out of the Proximity dataset. These tips are useful for modeling, visualizing, and understanding the data.

## Aggregation¶

It is commonly necessary to aggregate Proximity scores for analysis. This type of aggregation occurs when looking to get an overall score for a radius around a location, for example.

The recommended way of aggregating data is to use a weighted average where the weight is the Volume Index included in the data. The process below details how to calculate the weighted average score for a given segment. For an expanded version of this process, please see our weighted average example spreadsheet.

### Example Aggregation¶

In this example, we are going to calculate the weighted average score for the Bookish segment (EA01).

Block Group ID Trade Area EA01: Bookish Volume Index Volume * Segment
100030149073 TA1 8.12 0.09 0.73
100030149081 TA1 27.27 1.08 29.45
100030149091 TA1 4.14 0.29 1.2
100030149092 TA1 2.35 0.69 1.62
100030149093 TA1 0.14 3.74 0.52
100030149094 TA1 5.55 1.73 9.61
100030150001 TA1 8.12 0.09 0.73

This table provides values for an example of aggregating Proximity data. Column 3 shows the scores for one segment. Column 4 shows the volume index score for each of the given block groups. Column 5 is generated by multiplying columns 3 and 4.

$Weighted \, Average = \frac{sum(volume * segment)}{sum(volume)} = \frac{44.91}{7.90} = 5.68$

Note: In this example, each of these block groups intersects with a radius ring. Accordingly, they are all part of TA1 (trade area 1).

When looking to aggregate Proximity data, the size of the selected radius or geography can have a large effect on results. For best results (especially in a modeling environment), it makes sense to try multiple radii with the data and see how it performs. For different use cases, sometimes smaller or larger radii are appropriate. Sometimes using both small and large radii together can have a strong effect on the model and provide different insights.

While there is definitely room for experimentation on what provides the best performance for your particular model, the upper limit that we usually recommend is a 5-mile radius. The larger the radius, the more likely that aggregate values will score closer to the national average. We typically use radii no smaller than 0.25 miles, though it is possible to go below that. Aggregating across custom trade areas can be appropriate as well.

Our standard recommendation is to start by testing .25, .5, and 1-mile radii, looking for best performance.

## Regional vs. National Datasets¶

National datasets are commonly used with Proximity data and often strong relationships can be discovered between performance at a national level and Proximity data. While we often see these strong relationships, we often see relationships between variables at a regional level as well. Given a large nationwide dataset, it is definitely recommended to explore the data at both levels.

There is observed regional variation in Proximity data. For example, the segment Praise and Worship is known to score more highly in the South of the US.

## Variable Selection: How many variables are appropriate?¶

Proximity data comes with scores for up to 72 unique social segments and 19 indexes. Deciding how many to use in a predictive model certainly varies by use case. In the creation of these segments and indexes, we took care to select segments that reflect real-life behaviors and interests while not being so similar to each other to be picking up on the exact same types of behavior.

Put succinctly, while there are relationships between segments, they are all distinct and represent different patterns of behavior.

When looking to limit the variables selected for a model, common practices apply. For example, using feature importances, correlations, recursive feature elimination, greedy stepwise selection, and many more techniques are all valid ways to work with the data.

## Can variables be combined?¶

While there are relationships between variables, each is distinct and represents a unique pattern of behavior found in social media. In our taxonomy, variables are grouped for organization and readability purposes. In most use cases, it is recommended to use segments individually. However, it can be useful to create indexes or use dimensionality reduction techniques for modeling (such as PCA). One example index might be:

Brand Index: Combining the top 3 variables that show up around a brand’s location. For example, for the retail store Shinola, the “Shinola Index":

$Shinola \, Index = \frac{Coffee \, Connoisseur + Bookish + Girl \, Squad}{3}$

Often, standard indexes of variables come with the data. You can learn more about these in the Dataset Structure section.

## Building a Model with Proximity Data¶

Our datasets are designed to be as easy as possible to use for modeling. In the exploratory phase, it can be helpful to run correlations between Proximity data and the target variable.

In the modeling stage, common ways for evaluating the data are:

1. Build a model with just Geosocial data.
2. Integrate Geosocial data with other data sources into a model (possibly an already created model).
3. Use Geosocial data to predict the residuals of an existing model.

In particular, we recommend that users always make sure to attempt #2, as relationships between Geosocial data and other data can be particularly powerful.