Applied Data Science

Opening a Bubble Tea franchise in Singapore

A quest to find the best neighbourhood to start a Xing Fu Tang Bubble Tea franchise in Singapore.

Bryan Choi
21 min readApr 28, 2021

Foreword

Here, we’ll look for the optimal location for the next Xing Fu Tang (a Taiwanese bubble tea franchise) branch in Singapore.

For a more detailed view, you’re welcomed to have a read of the Jupyter Notebook here. The Github repository for this project can be found here, and all code snippets can be found here.

Introduction

Bubble Tea in Singapore

What comes to your mind when you hear tell of a looming lockdown?

For most Singaporeans, the answer might be bubble tea.

On 21 April 2020, tightened circuit-breaker measures were announced in Singapore to curb the spread of COVID-19. In response, Singaporeans went out in droves to get their fix of bubble tea. Searches for the drink on Google spiked, and for the few bubble tea shops that were still allowed to operate during the lockdown, they find themselves running out of tapioca pearls within hours of opening.

Singaporeans googled “bubble tea” the most on 21 April for the whole of 2020

So relevant is this tea drink in the lives of Singaporeans, that Channel News Asia (a Singapore-based news network) published an article on mental wellness titled “Life without bubble tea: How to cope with change during the pandemic”.

Business Problem — Expanding the Xing Fu Tang Franchise in Singapore

Despite its popularity, players of this industry are not immune to the impacts of the COVID-19 pandemic. As the retail sector gradually recovers in 2021, the mainstays of the Singaporean bubble tea scene should anticipate a return to normalcy and use the opportunity to scale mindfully.

Photo by George Zheng on Unsplash

Xing Fu Tang is one such mainstay.

With its signature boba pearls stir-fried in brown sugar, this Taiwanese bubble tea company entered the Singaporean market in mid-2019 with its first outpost in Century Square.

They remain relevant in the Southeast Asia city-state, often garnering positive reviews on food publications.

As of April 2021, Xing Fu Tang has 10 outlets in Singapore.

We’ll use data science tools to fetch, visualise and analyse geolocation, demographic, and commercial data for this business problem.

We begin by locating bubble tea stores in Singapore, and checking if they are often located near shopping malls and Mass Rapid Transit (MRT) stations. We’ll then import and visualise population, income and household dwelling data of Singapore’s Planning Areas/Subzones.

With the above, we’ll use a clustering model to find neighbourhoods with similar characteristics, finding patterns in the data that’ll show us where best to open a new Xing Fu Tang outlet. Promising areas should also be situated away from existing Xing Fu Tang outlets as well.

Data

We’ve defined our business problem (the customer order), now let’s look at the data (the ingredients) and the data science tools (the utensils) we’ll need.

Data — The Ingredients

Data Science Tools — The Utensils

The project is Python-based and thus most tools are Python functions/packages.

  • Pandas — Data manipulation and analysis library
  • Folium — Mapping module that visualises data on Leaflet.js maps
  • Beautiful Soup — Used for web scraping
  • Geopandas — Adds support for geographic data to pandas objects
  • geopy.geocoders.Nominatim — Function that returns OpenStreetMap data when given an address
  • Shapely — Geometric object manipulation
  • sklearn — Machine learning library
  • Matplotlib — Data visualisation library
  • Branca — Folium element/colour manipulation
  • requests — To send HTTP requests
  • Miscellaneous: math, os

Boundary Data Preparations

Let’s begin by preparing our canvas, namely a map of Singapore with her Planning Area and Subzone boundaries drawn.

We’ll read the .geojson (an open standard format designed for representing simple geographical features) files into their geopandas dataframes, and plot our first map using Folium.

Folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map.

A reminder that copy-able code snippets can be found here.
Singapore with Planning Areas (black lined) and Subzones (grey lined)

We’ve plotted our first Folium graph. Note that the maps on the Jupyter Notebook preview here are interactive. This applies to all maps moving forward.

Bubble Tea Shop Locations

We’ll follow up by getting the locations of Xing Fu Tang and other bubble tea brand outlets.

Scraping Xing Fu Tang Locations in Singapore

We could get Xing Fu Tang’s outlet locations via the Foursquare API, but for the sake of variety and increased accuracy, we’ll scrape their addresses from xingfutangsg.com using Beautiful Soup.

Beautiful Soup is a library that makes it easy to scrape information from web pages.

We’ll use the Nominatim geolocator from OpenStreetMap. As the search criteria are somewhat strict, our input to the search function will be Mall or general area location accordingly.

Output: [‘Ang Mo Kio ‘,  ‘Bukit Merah’,  ‘Causeway Point’,  ‘Century Square’,  ‘Compass One’,  ‘Hillion Mall’,  ‘Paya Lebar Square’,  ‘Plaza Singapura’,  ‘Takashimaya’,  ‘Northpoint City’]

In the output, we see the names of the 10 Xing Fu Tang locations.

Surprisingly, searching for “Ang Mo Kio MRT” returns no results. However, searching for “Ang Mo Kio” returns the MRT station. We’ll use that instead.

Other Bubble Tea Shop Locations in Singapore

Now we’ll use the Foursquare API to retrieve the store location of some well-known bubble tea brands. I’ve referred to this top-10 list for this project.

The Foursquare Places API provides location based experiences with diverse information about venues, users, photos, and check-ins.

To access the Foursquare API, you can sign-up for a Foursquare Developer account here. The free plan is limited, but its features are enough for this purpose.

Note that we use the Category ID for “Bubble Tea Shop”, which is “52e81612bcbc57f1066b7a0c”. A list of Category IDs can be found here.

Dataframe of most Bubble Tea shops location in Singapore

A dataframe is akin to a table, a 2D labelled data structure with columns, signifying data features (i.e. attributes) and rows, instances of data.

Plotting Locations of Xing Fu Tang and other Bubble Tea outlets

Now that we have location data for Xing Fu Tang outlets and stores of other bubble franchise, we can plot it onto our map.

Map of Xing Fu Tang and other bubble tea shop locations

Perfect. We can see that bubble tea outlets are concentrated in the central region, branching mainly towards the north and east. Otherwise, we don’t have much to go on about. Let’s look at our next datasets.

MRT Stations and Malls Location

It’s not uncommon to find a bubble tea store close to a mall or a mass transportation facility. Therefore, it’s worth exploring the relationship between MRT/Mall locations with bubble tea outlets.

Let’s retrieve, plot and compare these locations. To avoid having too many dots on the map, we’ll use the Folium HeatMap function to plot an MRT HeatMap.

Credit goes to yxlee245 on Kaggle for compiling the coordinates of MRT stations in Singapore.

MRT Heatmap

Here we see that the location of bubble tea shops in Singapore has some correlations with the location of MRT stations.

There are some areas with a high density of MRT stations but with minimal/no bubble tea stores. Some examples include:

  • Bukit Timah/Novena/Tanglin
  • Kallang/Geylang boundary
  • Tuas/Pioneer

One might assume that these are predominately residential areas with abundant MRT connections but minimal commercial zones, or industrial areas with basic commercial enterprises.

Nonetheless, the majority of bubble tea outlets are situated in the vicinity of an MRT station.

Shopping Mall Locations vs. Bubble Tea Shop Locations

We proceed by querying the locations of Shopping Malls from the Foursquare API. The Category ID for “Shopping Mall” is “4bf58dd8d48988d1fd941735”.

First 5 rows of mall locations details dataframe

Similarly, we plot a heatmap of mall locations in Singapore.

Malls Heatmap

Here we also see a relationship between shopping mall and bubble tea shop locations. Some examples of this relationship include:

  • Orchard (Planning Area)
  • Downtown Core (Planning Area)
  • Tampines East
  • Geylang East
  • Jurong Gateway.

We see most shopping malls have at least 1 bubble tea shop, and a sizable group of bubble tea shops aren’t close to shopping malls.

Compiling Location Data

To give some features (characteristics) to our Subzones so that we may group them later on, we’ll need to count how many bubble tea shops, Xing Fu Tang outlets, MRT stations, and malls are in a given Subzone.

We’d like to have a function that returns a 3 given Lavender’s polygon, and the coordinates of the 3 outlets in these Subzones

For this, we define a function that counts the number of locations that are in a given area. Inputs include the coordinates of a point location (e.g. bubble tea outlets) and the area boundary (e.g. Lavender subzone).

Great, the code below loops through all Subzones and counts the number of Bubble Tea Stores in each. This is repeated for Xing Fu Tang outlets, MRT stations, and malls.

First 5 rows of the Subzone dataframe

In the end, we’ll get something like the above.

Demographic Data

Besides location data, we’d like to also use some demographic data to cluster the Subzones.

The metrics we’ll look at are:

  • Population of 20–44 year olds by Subzones
  • Dwelling Type by Subzones
  • Median Income by Planning Area

Let’s start with Population data.

Data of Population of 20–44 year olds by Subzones

To have a more accurate clustering of Subzones, it’s worth identifying the age segment(s) that are more relevant to our studies.

A market study conducted in China in 2019 shows that the majority of bubble tea consumers are born between 1980 and 1999, which translates to an age range of 22 to 41.

Let’s assume that:

  1. Bubble tea consumption patterns are similar in both China and Singapore.
  2. The age range remains similar in both 2019 and 2021.

We’ll select the most relevant ages in the data below and sum them by age group and Subzone. The data source from the Singapore Department of Statistics (DOS) is segmented by Subzone/Planning Area, Age Group, Sex, and Type of Dwelling.

For population data, let’s create a new dataframe for Subzone and age segmentation. The age brackets relevant to us are ‘20_to_24’,’25_to_29', ‘30_to_34’, ‘35_to_39’ and ‘40_to_44’. We’ll create a column called “pop_total20_44” to sum up the numbers in these ranges by Subzones.

First 10 rows of age dataframe

We’ve now gotten a dataframe that contains population segmented by Subzone and age, as well as generated a column with the total population of Subzone residents aged between 20 and 44 years.

Data of Type of Dwelling by Subzones

Another data feature that may help us cluster similar Subzones is the Type of Dwelling. Information about the Types of Dwellings can be found here.

This data property (pun intended) may give insights into the type of residents, real estate value, and general wealth of a Subzone.

Let’s give some weights to the type of dwellings found in Singapore:

  • Others: 1
  • HDB 1- and 2-Room Flats: 2
  • HDB 3-Room Flats: 3
  • HDB 4-Room Flats: 4
  • HDB 5-Room and Executive Flats: 5
  • HUDC Flats (excluding those privatised): 6
  • Condominiums and Other Apartments: 7
  • Landed Properties: 8

These dwelling weights are arbitrary (A 4-room HDB flat might not cost twice as much as a 1 or 2-rooms HDB flat) but they are unequally weighted to represent relative values. An area of improvement can be to adjust the weights so they better represent the property values.

We’ll calculate a Dwelling Index weighted by the population of residents living in a type of dwelling. Not to be confused by the dwelling weight above.

The formula for calculating the weighted average for the Dwelling Index is:

First 5 rows of updated Subzone dataframe

Here we see that Fort Canning has a Dwelling Index of 7, hinting at it being a region of relatively higher wealth.

Plotting of Population Data

Now that we’ve gotten our Population (20–44 yo) and Dwelling Index data, let’s plot them onto maps. We’ll use the Folium choropleth function to generate the following maps.

Choropleth Maps display divided geographical areas or regions that are coloured, shaded or patterned in relation to a data variable.

Choropleth of population (20–44 yo)

On the map, we see that Singaporeans aged between 20 to 44 years live in Subzones away from the central region, mainly in Jurong West, Woodlands, Yishun, Sengkang/Punggol and Tampines/Bedok, among other Subzones.

We also see that current Xing Fu Tang outlets are close to the majority of these population hotspots, suggesting that our methodology is in line with Xing Fu Tang’s Singapore strategy.

Let’s now plot the Dwelling Index choropleth.

Plotting of Dwelling Index Data

Choropleth of dwelling index

On the map above, we see tells of 3 trends.

Firstly, we see that 7 of the 10 Xing Fu Tang’s current outlets are located in areas of medium dwelling index (the higher the index, the “wealthier” the Subzones). These are also generally areas with large populations of 20–44 year olds.

Secondly, we see that 3 of the 10 Xing Fu Tang’s current outlets are within the central region, in areas of medium-high dwelling index.

Thirdly, we rarely see any bubble tea outlets in areas of high dwelling index. These are also generally areas of lower population of 20–44 year olds.

These trends apply to other bubble tea brand outlets as well.

We can hypothesise that most bubble tea brands cater towards areas of medium wealth, with the exception being in the city centre (where they potentially cater to white-collar workers or retain a presence for brand relevance). Areas of high dwelling index may also be less densely populated as landed properties don’t house as many people as flats.

Median Income by Planning Area

To add substance to our analysis on Dwelling Types, we’ll also analyse the median income of residents by Planning Area.

The income data segments residents into monthly income (SGD) brackets ranging from “Below 1,000” to “5,000–5,999” to “12,000 & Over”. The data source from the Singapore Department of Statistics (DOS).

Screenshot of .csv file

To get the median income, we sum across the rows cumulatively, divide the total by 2 (getting the median population number). The median income bracket will be the bracket where the median population number lies.

Taking Ang Mo Kio as an example, we see that the total population is 101,300, half of that is 50,650, which falls into the income bracket of 3,000–3,999 (i.e. median income of SGD3,500).

Plotting of Median Income Data

Choropleth of median income

Plotting the median income onto the map, we see that the trend agrees with the trends we see in the Dwelling Index, namely that bubble tea outlets are mainly situated in middle-income areas.

So far, we’ve retrieved all the data required and visualised them to explore their contents. In the Analysis chapter, we’ll prepare the dataframes to be used in the machine learning algorithm.

Methodology

The goal of this report is to identify optimal Subzones for Xing Fu Tang’s next branch in Singapore.

We’ve retrieved/calculated the following data:

  • Number of existing Xing Fu Tang outlets by Subzones
  • Number of bubble tea shops from other franchises by Subzones
  • Number of Mass Rapid Transit (MRT) Stations by Subzones
  • Number of Shopping Malls by Subzones
  • Population of Target Demographic (20–44 years old) by Subzones
  • Median Income of Residents by Planning Area
  • Aggregation of Dwelling Types (Dwelling Index) by Subzones
  • Boundary Data for Subzones and Planning Area

Next, we’ll use:

  • K-nearest Neighbour algorithm to fill some of the missing values.
  • StandardScalar to scale the data accordingly (to avoid biasing the model towards datasets large values).
  • K-means method to cluster Subzones into similar groups of Subzones (clusters).
  • A range of K values, Silhouette Score and the Elbow method to determine the best K to use for K-means.

We’ll present these findings graphically and come to a final decision.

Analysis

Data Review

We begin by reviewing the data we have and determine whether they are fit for processing (garbage in = garbage out).

Our work dataframe shows missing values

A preview of what we have now shows that:

  1. We have a mixture of columns with information (area names, geodata) and numeric values. We will have to split this dataframe into cluster_info and cluster_val as our machine learning algorithm takes only numeric values.
  2. We have some missing values for Dwelling Index and Median Income, we’ll have to plug those holes.

Data Cleansing

First, we split the columns.

We’ll also have to remove the “pop_total” column as that data is linearly related to “pop_total20_44”. Keeping “pop_total” in would result in multicollinearity, which gives unwanted additional weight to the population data.

Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related.

Second, let’s clean up some missing values.

For rows with 0 or NaN values only (e.g. row 1), we’ll fill the NaN values with 0.

For rows with NaN and other values, we’ll perform missing data imputation using the K-nearest Neighbour algorithm.

In short, K-nearest Neighbour works by approximating the value of a missing datapoint based on the values of neighbouring datapoints.

“Birds of the same feather flock together”

Great, let’s move on.

Data Scaling

We start with scaling the data to avoid biases towards data points with larger numbers (e.g. population, median income).

StandardScaler works by removing the mean of a dataset from its datapoints and scaling the datapoints to unit variance.

Finding the Best K value for K-means

Despite the similar name, K-nearest Neighbour isn’t the same as K-means.

“K” in K-nearest Neighbour refers to number of nearest neighbours to an unlabeled datapoint the algorithm checks to label the said datapoint.

“K” in K-means refers to the number of clusters the algorithm will attempt to “group” a dataset into. K = 5 means you’ll get 5 clusters in the end.

For our case, we’ll be using the Elbow method (plotted in blue below) and the Silhouette Score method (plotted in red below) to find the best K value for K-means.

Elbow method, also known as the Sum of Squared Distance measures the error between a cluster’s centre and its datapoints. The smaller the value the better.

Silhouette Score method studies the separation distance between clusters. The larger the value the better.

Potential Silhouette scores range from -1 to 1:

  • -1: the point is in the wrong cluster
  • 0: the point is on the decision boundary between 2 neighbouring clusters
  • +1: the point is far away from neighbouring clusters. (Great!)

Credits: Tony Xu’s materials on clustering similar neighbourhoods have been a large help

In the graph above, we see a peak in the Silhouette score and the optimal K value. Our ideal number of clusters for our K-means clustering is 11.

K-means Clustering with Optimal K

Let’s run the algorithm with the optimal K.

First 5 rows of clustered Subzone dataframe

In the dataframe above, we see that each Subzone has been assigned to a cluster.

Plotting Clusters onto Map

Let’s colour-code the clusters and plot them onto a map. We’ll use branca.colormap (imported here as cmp), a utility module to define the cluster colours.

Map of 332 Subzones clustered into 11 clusters

Voila, that’s Singapore’s Subzones clustered based on the traits we specified above.

Comparing Clusters with Bubble Tea Locations

Let’s look at the clusters based on bubble tea locations.

Map of clusters vs. bubble tea outlet locations

Here, we see that Subzones containing Xing Fu Tang outlets are clusters 1 (light grey) and 6 (dark green), with cluster 6 being very competitive for Xing Fu Tang.

Cluster 8 (dark red) contains Subzones with many bubble tea outlets but no Xing Fu Tang outlets.

Clusters 2 (dark grey), 4 (dark blue), 10 (dark orange) and 11 (lilac) have a distinctively low number of bubble tea outlets.

Comparing Clusters with MRT/Mall Locations

Let’s now look at clusters vs. MRT/Mall locations.

Map of clusters vs. MRT/mall locations

Since we’ve established that bubble tea shops are often found near MRT stations and Malls, it’s no surprise to see similar trends here.

Scoring Clusters to find Best Clusters

We’re not done. Currently, we do not have a metric to rank the clusters from best (good locations for Xing Fu Tang’s new branch) to worst (locations Xing Fu Tang should avoid).

We can give each Subzone in a cluster a score based on the number of MRT/Malls (positive scores) and the number of bubble tea shops (negative scores) in a Subzone.

I’ve elected to go with the following scores:

  • MRT: +3 (I assume 1 MRT station isn’t fully saturated until it has 3 bubble tea shops)
  • Mall: +5 (I assume 1 Mall isn’t fully saturated until it has 5 bubble tea shops)
  • Bubble tea shop from other brands: -1
  • Existing Xing Fu Tang outlet: -10

Both Bedok North and Aljunied scored highly as they have relatively few bubble tea shops for the number of MRT/Malls within their boundaries.

However tempting to conclude our report using just the individual Subzone scores, we need to remember to look at clusters as a whole as clusters take into account other features such as Dwelling Index, Population (20 to 44 yo) and Median Income.

Clusters with Best Average Scores

To get the best cluster, we “merge” all Subzones into their clusters by averaging their feature values. We pick the best cluster by the largest average Subzone score.

Dataframe of clusters scored
Description of data in cluster 7

We’ve found our top cluster (cluster 7). It has 11 Subzones in it (count = 11) with an average Subzone score of 6.3.

Feature Analysis of Best Cluster

Let’s explore our best cluster even further. First, we’ll compare the best cluster with its counterparts on all 7 features (e.g. Median Income, Dwelling Index etc.)

Credits: Tony Xu for a great bar chart function

Feature analysis of cluster 7 against other clusters

In the bar charts above, we see Cluster 7 ranked 3rd in terms of the average Population of 20–44 year olds.

In terms of average median income and dwelling index, Cluster 7 is consistent with our previous considerations that bubble tea shops are usually located in areas of medium wealth.

Cluster 7 has high connectivity owing to it having the highest average MRT count of 2+.

Xing Fu Tang can expect moderate competition as Subzones in Cluster 7 having on average 2+ bubble tea outlets.

Cluster 9 is worth discussing. It has the highest average Population of 20–44 year olds and has a relatively low number of bubble tea outlets. However, it doesn’t rank highly due to a lack of malls and MRT stations, which indicates it being largely residential with low commercial foot traffic.

Creme de la Creme (Top 5 Promising Subzones)

Let’s further narrow down Subzones in Cluster 7 to the Top 5 best candidates for Xing Fu Tang’s new outlet.

Tied for first place are Bedok North and Aljunied with Subzone scores of 9 each. In second place we have Lavender, Serangoon Central and Jurong West Central each with 8 points.

Plotting Best Subzones

Let’s plot these Subzones on a map, along with the rest of Cluster 7.

Let’s plot the bubble tea shops locations onto the map to compare further.

Except for Aljunied, the Top 5 Subzones are located in midpoints between Xing Fu Tang Outlets, or for the case of Jurong West Central, in a potential region to expand into.

We can also see that Aljunied is selected due to a relative lack of Bubble Tea Outlets in the Subzone.

Let’s now plot the MRT/Mall locations onto the map.

Consistent with the notion that bubble tea shops are usually situated close to MRT/Malls, we see most selected Subzones having an abundance of these 2 locations.

This concludes our analysis on the best Subzone to locate a new Xing Fu Tang.

Results and Discussion

In this report, we’ve clustered Singapore’s 332 Subzones into 11 clusters using the K-means method. The clustering is based on 7 features, namely the following:

  • Number of existing Xing Fu Tang outlets by Subzones
  • Number of bubble tea shops from other franchises by Subzones
  • Number of Mass Rapid Transit (MRT) Stations by Subzones
  • Number of Shopping Malls by Subzones
  • Population of Target Demographic (20–44 years old) by Subzones
  • Median Income of Residents by Planning Area
  • Aggregation of Dwelling Types (Dwelling Index) by Subzones

Data visualisation and initial exploratory data analysis show a potential relationship between locations of Bubble Tea Shops with MRT Stations and Shopping Malls.

With additional demographic and types of dwelling data, we can infer that Subzones with a high population of 20–44 year olds and a medium amount of wealth (medium income and moderate dwelling types) have a higher number of MRT stations and Malls. We can postulate that these Subzones are more densely populated with medium-cost flats, and see larger commercial foot traffic.

This, in turn, attracts a larger number of Bubble Tea shops.

By scoring each Subzone positively by the number of MRT Stations/Mall it contains, and scoring negatively when Bubble Tea Shops are present, we can determine promising Subzones for Xing Fu Tang’s next branch.

Based on the data given, we’ve found that the 5 most promising Subzones are (in descending order) Bedok North, Aljunied, Lavender, Serangoon Central and Jurong West Central.

This study only serves as an initial recommendation for further insights gathering. As the study only looks at the 7 aforementioned features, the accuracy of the model is limited. Other factors such as real-time commercial traffic data, education levels, occupation data, strongly-correlating venue types (e.g. large overlap of bubble tea — hotpot restaurants consumer trends) can be considered to improve the accuracy of the model.

Additionally, while the Subzone division is the smallest census division for Singapore, other methods to divide Singapore into smaller units of division (e.g. hexagonal grid) may offer insights on a better level of resolution, giving accuracy to the level of metres.

Conclusion

The goal of this report is to identify promising areas to locate a Xing Fu Tang outlet in Singapore. Dividing Singapore into its 332 subzones, we fetched Subzone data for the relevant features and clustered similar Subzones into 11 clusters.

We scored the clusters by the number of MRT stations, malls, and bubble tea outlet it contains. The promising Subzones identified had higher counts of MRT stations and malls and a lower count of bubble tea outlets. The 5 most promising Subzones are (in descending order) Bedok North, Aljunied, Lavender, Serangoon Central and Jurong West Central.

The next steps include further scrutiny of these 5 Subzones for their suitability and additional considerations for other factors that might improve the model’s accuracy.

Code

https://github.com/BryanJian/Coursera_Capstone/blob/master/Battle%20of%20the%20Neighbourhoods%20(Singapore).ipynb

References

Data Sources

--

--