So after a few weeks of Data Management and Analytics class and having been working on with R, I have attended to the DBS Analytics Society meeting on 22nd of October.
Thanks to Darren, we had some pastries for breakfast, eating them while drinking a double shot coffee woke me up on a Saturday morning.
Darren prepared us four different quizzes although I could have finished only two of them in 2 hours, it was a very helpful meeting to practice R with different data sets.
First quiz was about basic R commands and how to use them. It was relatively easier than the second quiz. I have uploaded my code to my Github account with the questions. I got one mistake in my first trial as first question was asking for sum of the output where I gave the output as the answer.
Second quiz was more challenging. It was analyzing the baby names in US. We had two different data sets, first data set Baby Names US 1900 and second data set US Baby Names 2000. First a few questions were about reading the data into R and then calculating how many babies were born in a given year and then 10 most popular names by year and gender. The most difficult question for me was 12th question. We have been asked to find out top 10 changed male baby names. Here is my code to solve the question with a bit help from Darren.
#Which are the 10 most changed male baby names?
names(pop1900)  <- paste('n1900')
names(pop2000)  <- paste('n2000')
merged_data = merge(pop1900, pop2000, by =c('Name', 'Sex'), all=TRUE)
merged_data$n1900[is.na(merged_data$n1900)] = 0
merged_data$n2000[is.na(merged_data$n2000)] = 0
merged_data$diff = abs(merged_data$n1900-merged_data$n2000)
ordered_diff = merged_data[order(-merged_data$diff, merged_data$Name),]
ordered_diff_male <- subset(ordered_diff, ordered_diff$Sex == 'M')
I have changed the third column names to n1900 and n2000 before merging the data so that I wouldn’t have two columns with same name. After that I have merged both data sets and changed NA values to “0”. To find the change for a name between 1900 – 2000 I have subtracted number of births in 2000 from number of births in 1900. To prevent any negative values I have also used abs function so that I can have absolute values of each difference.
After that I have ordered the data and then created a new variable where I created a subset with difference in Male names. Here are the top 10 changed Male names between 1900 and 2000:
“Jacob, Michael, Matthew, Joshua, Christopher, Nicholas, Andrew, Daniel, Tyler, Brandon”
Next question was to get the same for Females and here are the top 10 changed Female names between 1900 and 2000:
“Emily, Hannah, Madison, Ashley, Alexis, Samantha, Jessica, Sarah, Taylor, Lauren”
Do you think it is a coincidence that AshleyMadison.com, the biggest affair website, are using 2 of the top 4 most changed female baby names between 1900 and 2000? I don’t know, but let’s find out.
#How Many Ashley and Madison's there were 1900 and 2000?
> Ashley1900 <- subset(pop1900, pop1900$Name == 'Ashley')
Name Sex n1900
3550 Ashley M 5
> Ashley2000 <- subset(pop2000, pop2000$Name == 'Ashley')
Name Sex n2000
4 Ashley F 17993
19136 Ashley M 82
> Madison1900 <- subset(pop1900, pop1900$Name == 'Madison')
Name Sex n1900
2840 Madison M 17
> Madison2000 <- subset(pop2000, pop2000$Name == 'Madison')
Name Sex n2000
3 Madison F 19966
18694 Madison M 138
There were no Female babies named Ashley or Madison back in 1900 while there were 17,993 Ashley and 19,966 Madison in 2000. So, do you think Darren Morgenstern had checked any data while he was choosing the name AshleyMadison.com back in 2001 while he was establishing the platform?
Also, here are two graphs that compares names between 1900 and 2000 for each gender.
Although I have done only two of the four quizzes yet, I am planning to finish the rest. I will post my challenges to the blog when I finish them.