Every year I try to pick a side project to work on, to maintain/expand my Python and data skills. In January 2020 I found out that Seattle makes its pet license data openly available on this site, so I thought it might be fun to play with that.
This page serves as the summary, covering the time series, stats, and mapping I did. The specific notebooks are all in the Git repo as well.
Given 11,000 unique pet names extracted from Seattle's pet registration data, about 51% are "people" names, meaning they have appeared in the US Census at some point.
The actual distribution of names among 49,000 pets, however, shows that 80% of pets have "people" names.
If you're wondering why it might matter, a recent (2022) study, showed that the type of name had an effect on how long it took shelter pets to be adopted. And I'd like to do some further research on what types of names there are -- I can already see some trends in terms of food names, color names, "punny" names, and pop culture names!
Initially I made a time series to show the number of adoptions year over year.
The main finding is that there are a lot more adoptions recorded in the last couple of years! Is it a record-keeping issue? An explosion of beuracracy? Just that many more pets? Who knows.
I also wondered if there were observable spikes in pet adoptions in any given month, or day of the week. To find this out, I plotted another time series. From my initial observations, it didn't look like anything much was happening on that front.
Another way to slice and dice the data would be to try running some basic stats on it. Since ZIP codes were available in the pet data.I hypothesized, for example, that zip codes with higher numbers of rentals would have a correspondingly higher percentage of cats (as opposed to any other species). After aggregating the data and running something on it, this hypothesis was proved false. Oh, well. I'll still run though how I did it.
I first made a pivot table using Pandas, cleaning the zip code data and aggregating it to show the number of cata snd dogs per zip code. I then used USZipCode to add other data about that Zip code.
After making my completed pivot table, I moved to R in order to make some scatterplots and run some correlation tests. I thought, for example, that the population of a zip code might predict the totala number of pets in it, and so forth. There is a weak correlation between population size and total pets, fwiw.
I also want to see how the pet data mapped out. I made two chloropleth maps, one showing just the Settle area, the other showing the zip codes that appared throughout Washington state.
I haven't yet finished playing with this data. I have a lot of language-related questions about this data. How many of these pet names were "human-sounding"? How many are food-related? How many are in another language? How many have last names? How many are "punny"?
My plan was to start with the "human" part, because my data science friend advised finding a variable that could slice the data in half (which it might?).
I've cleaned and deduped the name data, and I even made up an algorithm for deciding whether the names were "human-sounding": I looked at the starting by comparing the names against US baby names for the last 50 years. Unfortunately, there are a lot of pet names and the site keeps kicking me out, so I think I need to clean the names a bit more, and probably find a way to batch process.
And there are a lot of other things I want to do! Map stuff, for example is on my list. And whatever else crosses my mind.