Methodology¶
Overview¶
The Geosocial Proximity dataset was created using unsupervised machine learning to identify common patterns in social media posts. After identifying these common patterns, we can then score areas for how much they exhibit the identified patterns. The data reflects these scores as a percentile for understandability.
Raw Data¶
Raw data comes from Twitter, Instagram, Foursquare, and more. However, the vast majority of collected data comes from Twitter and Instagram, which are more commonly used for geotagged posts. All social media posts used have latitude and longitude associated with them.
Missing Values¶
Scores are only provided for a given location if we are able to collect enough data to provide an accurate score. We've found that in order for our models to provide a score that accurately reflects the social media it's been provided, it needs at least 10 unique posts, 5 unique users, and 100 unique data points. When we say "data point", we mean a contextualized word. Words are better understood in full context, for example the word "Coffee" in isolation may be referring to anything from an artisanal coffee shop to the color of a dress. But when the post is "Need my coffee this morning", the context tells the model that this post should score high for the Daily Grind segment.
Timespan/Vintage¶
In generating scores, we use a 2-year timespan and name our datasets to reflect the period used for each dataset. For example, a dataset labeled “jan01_2020...” is generated using social media from after December 31st, 2017 and before January 1st, 2020. In testing, this time span has proven to be the most useful for analytics and location-based use cases.
Buffering¶
When scoring given geography, the social media that is generated inside that geography is used to generate a score. In most cases, we do not use any social media outside of the geography to determine a score. For special cases when we do, an additional variable called "BUFFER_METERS" will be included to indicate whether a buffer was used to expand to social media outside the geography itself.
For example, if the "BUFFER_METERS" variable is included and it has a value of 400, that means that a 400-meter buffer was applied to the geography (all social media in the buffer was also included in score calculation).
Non-USA countries¶
Some special considerations apply to data generated for countries outside of the USA.
Canada¶
The Proximity dataset was created using data generated in the USA. Due to English being the dominant language in most of Canada, the scoring methodology was applied to Canada with minimal change. The geography level used for Canadian datasets is Dissemination Area (DA).
Mexico¶
In order to generate scores for Mexico, automatic Spanish to English translation was applied to raw social media data prior to scoring areas. The geography level used for Mexican datasets is Área geoestadística básica (AGEB).
United Kingdom¶
The geography level used for UK datasets is Lower Layer Super Output Areas (LSOAs).