Analyzing Swiss Baby Surnames with R
The Swiss Federal Statistical Office (SFSO) has some nice data with which you can play around. To hone my R programming skills, I grabbed a recently updated dataset for Swiss female and male surnames for babies in 2019. You can find the datasets here.
In fact, the have a dataset in px-format, which covers the years 2000 to 2019. Here you’ll find a description of this px-format.
The first challenge is to find out how to work with px-files. Thankfully, this is easy, the pxR package takes care of that. It imports a file in px-format and produces a data-frame that you can use like any other data-frame.
The second challenge was with one of the original column names “Sprachregion / Kanton”. This did not want to filter and kept me giving either a column name not found or an empty data-set. So I change this column name in the original file to read “Kanton” and it worked.
I thought I start with a density plot to see if this tells me anything about the names:
The names to the left are the ones that are not chosen by many, but there are an awful lot of these, lets call them rare, names.
The names to the right are the ones that are chosen by many, but there are not a lot of these, lets call them common, names.
A first look would seem to suggest that 2019 was a year in which the diversity of baby names chosen was the highest in this period (2000-2019) for both male and female baby names.
Some number crunching: Total number of (unique) names in dataset are (for 2019) 2765 (female) and 2702 (male). You can read the SFSO press release (no English version) to find out more on the most common names in 2019 and more.
If you want to have a look at the code I wrote, you can find it on github.