Migrants visualized

First of all, this is in no way a statement on the immigration crisis in Europe. I do have opinions. But it is more a reaction or reflection on three maps I saw on this page.

Danish televison channel TV2 is illustrating the number of refugees or perhaps rather immigrants received in EU-memberstates in the period 2015 to 2017. This is the map showing the number of immigrants to EU in 2015

Note Germany. Germany welcomed the absolutely highest number of immigrants. What piqued my interest though, is that this might be a good illustration of the numbers, it is not really the relevant comparisons. Yes, Germany welcomed more refugees than Denmark did. But Germany is a rather larger country than Denmark. For a given value of “fair”, it is only fair that Germany takes more refugees than smaller countries.

A more relevant comparison might be the number of refugees compared to population. Or area. Sweden saw (at that time) no problems with welcoming a huge number of migrants, because, as they said, there are a lot of un-populated space in Sweden, plenty of room for everyone! Or perhaps GDP is a better way. Richer countries should shoulder a larger part of the challenge than poorer countries.

I’m not concerned here with what is fair. What concerns me is that the graphic is misleading. Lets make an attempt at fixing that. Or at least present a slightly different perspective on the data.

I’ll try to illustrate the number of migrants as a proportion of population in the different countries. The data is “stolen” directly from the news-channel. They have it from UNHCR, Eurostat and the European Parlament.

The first step will be to get the data.

url <- "http://nyheder.tv2.dk/udland/2018-06-28-se-kortet-saa-mange-asylansoegere-har-de-forskellige-eu-lande-taget"
data <- readLines(url)

By inspection, I can see that the relevant data is in these three lines:

dat.2015 <- data[460]
dat.2016 <- data[409]
dat.2017 <- data[358]

There is a small problem. Strange danish characters are encoded. Lets fix that:

library(stringr)
data <- str_replace_all(data,"\\\\u00d8", "Ø")
data <- str_replace_all(data,"\\\\u00e6", "æ")
data <- str_replace_all(data,"\\\\u00f8", "ø")

And load it into the separate variables again:

dat.2015 <- data[460]
dat.2016 <- data[409]
dat.2017 <- data[358]

Next, I need to get the names of the countries, and the number of migrants received in each country.

The data I’m after looks like this:

\“Bulgarien\”:{\“valueheat\”:20365,\“valuecolored\”:\“none\”,\“description\”:\“\”},

And the regular expression picking that out of the data ought to be:

‘\“(\p{L}+)\”:{\“valueheat\”:(\d+|\“\”),’

For some reason that is not working. I probably should try to figure that out, but I’m on vacation, and would rather drink cold white wine that dig too deep into the weirdness that is regular expressions in R.

Instead I’m going to use this simpler pattern:

pattern <- '(\")(\\w+)(.*?)(\\d+|\\\"\\\")'

And then fix problems later.

Extracting the data:

dat.2015 <- unlist(str_extract_all(dat.2015, pattern))
dat.2016 <- unlist(str_extract_all(dat.2016, pattern))
dat.2017 <- unlist(str_extract_all(dat.2017, pattern))

Inspecting the data, reveals that the interesting parts are lines 23 to 111.

dat.2015 <- dat.2015[23:111]
dat.2016 <- dat.2016[23:111]
dat.2017 <- dat.2017[23:111]

An example of two lines:

dat.2015[11:12]
## [1] "\"Danmark\":{\"valueheat\":20935"              
## [2] "\"valuecolored\":\"none\",\"description\":\"\""

First, lets get rid of the second line. There are one of those for each country.

Secondly, I’ll remove the first to characters in the first line. \“ to be precise.

And thirdly, I’ll split that line on \”:{\“valueheat\”:

Using the nice pipes makes that easy:

library(purrr)
dat.2015 <- dat.2015 %>%
  discard(str_detect, "valuecolored") %>%
  substring(2) %>%
  strsplit(split="\":{\"valueheat\":",fixed=TRUE)

That should give me a list where each element is a vector with two elements:

dat.2015[[8]]
## [1] "Finland" "32345"

Neat! Lets do that with the other years:

dat.2016 <- dat.2016 %>%
  discard(str_detect, "valuecolored") %>%
  substring(2) %>%
  strsplit(split="\":{\"valueheat\":",fixed=TRUE)

dat.2017 <- dat.2017 %>%
  discard(str_detect, "valuecolored") %>%
  substring(2) %>%
  strsplit(split="\":{\"valueheat\":",fixed=TRUE)

Now, lets get these data into some dataframes. First I’m unlisting the data, then I pour it into a matrix to get the right shape. And then I’m converting the matrices to dataframes:

dat.2017 <- data.frame(matrix(unlist(dat.2017), ncol=2, byrow=T), stringsAsFactors = F)
dat.2016 <- data.frame(matrix(unlist(dat.2016), ncol=2, byrow=T), stringsAsFactors = F)
dat.2015 <- data.frame(matrix(unlist(dat.2015), ncol=2, byrow=T), stringsAsFactors = F)

There is a slight problem. Albania was the first country in the list. And the structure of the raw data was a bit different.

dat.2015[1,1]
## [1] "regions\":{\"Albanien"

Lets fix that:

dat.2015[1,1] <- "Albanien"
dat.2016[1,1] <- "Albanien"
dat.2017[1,1] <- "Albanien"

And let me just change the column names:

colnames(dat.2015) <- c("Land", "2015")
colnames(dat.2016) <- c("Land", "2016")
colnames(dat.2017) <- c("Land", "2017")

“Land” is danish for “country”.

I’m going to need just one dataframe. I get that by joining the three dataframes:

library(dplyr)
total <- left_join(dat.2015, dat.2016, by="Land")
total <- left_join(total, dat.2017, by="Land")

The numbers are saved as characters. Converting them to numeric:

total$`2015` <- as.numeric(total$`2015`)
## Warning: NAs introduced by coercion
total$`2016` <- as.numeric(total$`2016`)
## Warning: NAs introduced by coercion
total$`2017` <- as.numeric(total$`2017`)
## Warning: NAs introduced by coercion

That introduced some NAs. Countries where there are no data.

Inspecting the data, I can see that there are data for all three years for some countries. For other countries, there are no data at all. The function complete.cases() will return true for a row without NAs.

Using that to get rid of countries where we don’t have complete data:

total <- total[complete.cases(total),]

Next is getting some figures for the populations.

The relevant page on Wikipedia is:

url <- 'https://da.wikipedia.org/wiki/Verdens_landes_befolkningsst%C3%B8rrelser'

Getting that:

library(XML)
library(httr)
r <- GET(url)
doc <- readHTMLTable(
  doc=content(r, "text"))
tabellen <- doc$'NULL'

colnames(tabellen) <-  apply(tabellen[1,],2,as.character)

I’m only interested in the country name, and the population:

tabellen <- tabellen %>%
  select(`Land (eller territorium)`, Population)

Renaming the colums:

colnames(tabellen) <- c("Land", "Population")
tabellen$Population <- as.character(tabellen$Population)
tabellen$Population <- as.numeric(str_remove_all(tabellen$Population, fixed(".")))
## Warning: NAs introduced by coercion

And while I’m at it, the second line gets rid of the factors, and the third removes the thousand separators (“.”)

Now I can join the dataframe containing population figures, with the dataframe containing countries and number of migrants:

total <- left_join(total, tabellen, by="Land")
## Warning: Column `Land` joining character vector and factor, coercing into
## character vector

There are three smaller problems. Cyprys, France and Ireland. The problem is that the country name I get from Wikipedia contains a note. I might be able to get rid of that by code. I’m going to do it manually.

total[10,5] <- 4635400
total[7,5] <- 67286000
total[3,5] <- 847000

Now I have a nice dataframe with the name of the countries (in danish), the numbe of migrants received in 2015, 2016 and 2017, and the population in 2018.

Now it is time to look at some maps.

library(ggplot2)
library(rworldmap)

I am going to match the countries in my dataframe, with the countries I get from the map data. That requires that I have the english names for the countries in my dataframe.

This is the translation:

enland <- c("Belgium", "Bulgaria", "Cyprus", "Denmark", "Estonia", "Finland", "France", "Greece", "Netherlands", "Ireland",
          "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Poland", "Portugal", "Romania", "Slovakia",
          "Slovenia", "Spain", "United Kingdom", "Sweden", "Czech Rep.", "Germany", "Hungary", "Austria"  )

Getting that into the data frame:

total$enland <- enland

Next, lets get the map:

worldmap <- getMap()

That retrieves data for the entire world. I’m only interested in EU:

EU <- worldmap[which(worldmap$NAME %in% enland),]
EU <- map_data(EU)

The first line extracts the part of the world map that has names in the list of countries that I have data for.
map_data() converts that into a nice structure that is suitable for entering into ggplot.

Next step is calculating the number of migrants received in each country as a proportion of that countrys population:

total <- total %>%
  mutate(`2015` = `2015`/Population*100, `2016` = `2016`/Population*100, `2017`=`2017`/Population*100)

I’m mutating the columns 2015-2017 by dividing by population. And multiplying by 100 to get percentages.

The almost final step, is to join my migrant-proportions with the map data:

total <- left_join(total,EU, by=c("enland"="region") )

The map data does not call the countries for countries. Rather their names are saved in the variable “region”.

And now the final step. I’m going to need the data on tidy form. So I’m loading tidyr.

Then I pass the data frame to select(), where I pick out the variables I need. Long(itude), lat(itude), 2015, 2016 and 2017, and the name of the country.
That is passed to gather(), where I make a new row for each year, with the proportions in the new variabels year and prop.

All that is passed to ggplot, and a layer where the polygons describing the countries are plotted. They are with a colour matching the proportions. And grouped by “group”. This is important. Grouping by country name gives weird results. I’ll get back to that. color=“white” plots the lines in the polygons in white.

Finally, I facet the data on year.

library(tidyr)
total %>%
  select(long,lat,`2015`,`2016`,`2017`, group) %>%
  gather(year, prop,  `2015`:`2017`)%>%
  ggplot() +
    geom_polygon(aes(long, lat, fill = prop, group = group), color = "white") +
    theme_void() +
    facet_wrap(~year, ncol=2)

plot of chunk unnamed-chunk-32

Thats it!
And now the picture is slightly different. What is interesting is that Germany still takes a higher proportion of the migrants than other countries. But in 2015, they didn’t. That was the year when the german chancellor Angela Merkel said the famous words “Wir schaffen das”, We’ll manage. But also the year when Hungary and sweden welcomed migrants in numbers equalling 1.79% and 1.65% of their population respectively. You can compare that with the fact that Germany the same year received migrants equalling 0.58% of their population.

A cynic might claim that it is no surprise that Sweden and Hungary closed their borders late in 2015.

Any way, that is a different subject. I just think that these three maps are slightly more informative than what TV2 provided.

Also, I promised to get back to the group thingy.

Making the same plot, but grouping on country names:

total %>%
  select(long,lat,`2015`,`2016`,`2017`, group, enland) %>%
  gather(year, prop,  `2015`:`2017`)%>%
  ggplot() +
    geom_polygon(aes(long, lat, fill = prop, group = enland), color = "white") +
    theme_void()

plot of chunk unnamed-chunk-33

What happens is that the polygons describing Italy are grouped in a way that connects the parts describing sicily to the northern part of Italy. That looks weird. The same happens with Sardinia.

Finally. I have not been very consistent in my use of words. I have used “received” and “welcomed” interchangeably. Hungary and Denmark has not been very welcoming. But we are talking about real humans here, and welcoming simply sounds nicer than received. Complicating the situation was the fact that a lot of the arrivals were not actually what we would normally call refugees. At least not refugees from war. So I have also not been consistent in the use of “migrant” vs “refugee”. That is not really my point. The point is that we should always think about how these kinds of numbers are presented.

Udgivet i R

Briefly unavailable for scheduled maintenance.

Oh the horror.

A couple of days ago something happened while updating one of my sites. “Briefly unavailable for scheduled maintenance.” was all that the page told me.

How to fix that? Noting that the page in question is made with WordPress?

Log on to the server serving your page. Delete the file “.maintenance”. Log out. Refresh.

If that doesn’t help, you have bigger problems.

Why is it happening? When you update something on a wordpress site, the site is placed in maintenance mode. It’s a bad idea to have people entering comments into the database at the same time you’re doing maintenance on it.

The site is placed in maintenance mode – every request for the site is given the messange “Briefly unavailable for scheduled maintenance”. And that is controlled by the file “.maintenance”. Normally that file will be deleted when the update or whatever is over. If something goes wrong, it survives. Everything went fine, but the file is still there, and your site is unavailable.

Mann-Whitney-wilcoxon test

The Mann-Whitney U test,

also known as the Mann-Whitney-Wilcoxon test, is a non-parametric test of a null hypothesis, that it is equally likely, that a randomly selected value from one sample, will be less than, or greater, than a randomly selected value from a second sample.

Intuitively I think I might be able to see how this translates to “the two samples are from populations that are actually different”. I have a harder time to describe that intuition.

The test is nearly as efficient as a t-test. But the advantadge is, that it does not require that we can assume that the two samples are normal distributed.

It can also be used to test if two independent samples, were selected from populations that have the same distribution.

The idea is, we have to samples. If we pick a random value from sample A, and a random value from sample B, there is a likelihood, a chance, that the value from sample A is larger than the value from sample B. Is that likelihood the same as the likelihood that the value from sample A is smaller than the value from sample B.

How to do that in R?

Let us look at some data. mtcars is a dataset with some facts about various 1974 US cars.
One of the numbers we have i mpg, miles pr gallon, or fuel-efficiency. Another is about the transmission. Is there a manual or an automatic gearbox. In the column am a “1” indicates an automatic gearbox. A “0” a manual gearbox.

We can now extract two sets of data on the fuel efficiency, based on the type of transmission:

aut <- mtcars[which(mtcars$am==1),]$mpg
man <- mtcars[which(mtcars$am==0),]$mpg

In “aut” we have the fuel efficiency of cars with an automatic gearbox, and in “man” the same, but for the, in Europe, more familiar, manual gearbox.

How to do the Wilcoxon test?

wilcox.test(aut,man)
## Warning in wilcox.test.default(aut, man): cannot compute exact p-value with
## ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  aut and man
## W = 205, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0

The p-value is 0.001871. If we take a 0.05 significance level, we can conclude that the two sets of values are from non-identical populations. Had it been larger (quite a bit larger), we would not have been able to reject the null-hypothesis, that the two samples comes from identical populations.

There is a different way to do the test, where we dont split the dataset in two:

wilcox.test(mpg ~ am, data=mtcars)
## Warning in wilcox.test.default(x = c(21.4, 18.7, 18.1, 14.3, 24.4, 22.8, :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mpg by am
## W = 42, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0

We tell the function that we want to test mpg (fuel efficiency), as a function of am (the type of transmission), and that the dataset we are working on is mtcars. The function then splits the dataset in two based on the value in am. Had there been more than two, things would probably stop working.

The result is the same though.

The warning

We get a warning. That is because the algorithm has a problem when there are identical values in the two datasets. There is a way around. We can add a “jitter”, to the two datasets. jitter() adds small random values to the values in the dataset. Often that solves the problem:

wilcox.test(jitter(aut),jitter(man))
## 
##  Wilcoxon rank sum test
## 
## data:  jitter(aut) and jitter(man)
## W = 205, p-value = 0.001214
## alternative hypothesis: true location shift is not equal to 0

On the other hand, we are no longer comparing the values in the dataset. We can see the difference here:

aut - jitter(aut)
##  [1] -0.06068697  0.06873662  0.05276257  0.06029616 -0.04756118
##  [6] -0.03489005 -0.01479096 -0.04547827 -0.06275216  0.01547043
## [11] -0.02540212 -0.05201828  0.01065800

It would probably be a bad idea to add random noise to the data. On the other hand, it is not very likely that two cars have exactly the same fuel efficiency. We have probably “binned” the values, and the addition of random noise would probably not change the values we are working on too much. But you should always consider why the warning arises, if it is actually a problem and not just a warning, and if it is appropriate to solve it in this way.

An engineers’ dream come true

Project Euler, problem 263 – An engineers’ dream come true

This is, at the time of coding, the highest numbered Project Euler problem I’ve tried to tackle. With a difficulty rating of 75% it is also the most difficult. At least on paper. But An engineers’ dream come true? How can I not, as an engineer, not try to solve it?

We har looking for an n, for which it holds that:

  • n-9 and n-3 must be consecutive primes
  • n-3 and n+3 must also be consecutive primes
  • n+3 and n+9 must also be consecutive primes.

These are primepairs that are “sexy”, that is that have differences of 6.

Also, n, n-8, n-4, n+4 and n+8 must be practical numbers, that is numbers where the numbes 1:n can be written as sums of distinct divisors of n.

So if a number n gives sexy prime pairs, and are very practical – that is what an engineer dreams of – hence the paradise.

The fastest way seems to be to generate a list of primes, sieve those out that conforms to the requirements for consecutive primes, and then test those numbers for practicality.

Lets get started!

The trusty numbers library provides the primes, up to 1000000. Then for each of those primes, return the n-value, if the differences between the sets of primes, are 6.

library(numbers)

primenumbers <- Primes(1000000)
primecandidates <- numeric()
for(i in 1:(length(primenumbers)-4)){
if((primenumbers[i+1] - primenumbers[i] == 6) & (primenumbers[i+2] - primenumbers[i+1]== 6) & (primenumbers[i+3] - primenumbers[i+2]== 6)){
  primcandidates <- c(primecandidates,(primenumbers[i]+9))
  }
}

What we get from this, is not primenumbers, but values of n, that gives the consecutive primes we need.

Now I have a list of candidate n’s based on the requirements for primes. Next I need to check for practicality.

First I tried a naive way. Generate the list of numbers that I need to sum to using distinct divisors, for a given x.
Then get the divisors of x. I dont need to check if I can sum to the numbers that are themselves divisors, so excluding them leaves me with at slightly smaller set. Then I get all the ways I can take two divisors from the set of divisors. Sum them, and exclude them from the list of numbers. I continue until I have taken all the possible combinations of 2, 3, 4 etc divisors, based on how many there are. If there are no numbers left in the vector of numbers that I need to be able to sum to, I was able to express all those numbers as sums of distinct divisors. And then x was a practical number.

practical <- function(x){
  test <- 1:x
  divs <- divisors(x)
  test <- setdiff(test,divs)
  for(i in 2:length(divs)){
    test <- setdiff(test,combn(divs,i,sum))
  }
  !as.logical(length(test))
}

Two problems. One that can be fixed. I continue to generate combinations of divisors and summing them, even if I have already found ways to sum all the numbers. The second problem is more serious. When I have to test a number with really many divisors – it takes a long time. Also, handling a vector containing all numbers 1:1000000 takes a lot of time.

I need a faster way of checking for practicality.

Wikipedia to the rescue. There exists a way of checking. I have no idea why it works. But it does.

For a number x, larger than 1, and where the first primefactor is 2. All primefactors are ordered. Taking each primefactor, that has to be smaller than or equal to the sum of the divisors of the product of all the smaller primefactors. Plus one. Oh! And that sum – if 3 is a primefactor twice, that is if 32 is a factor, I should square 3 in the product.

That sounds simple.

For a number x, get the primefactors. Use table to get the counts of the primefactors, ie that 3 is there twice. Those are the values of the result from the table function. The names of the table function are the primefactors.

For each factor from number 2 to the end of the number of factors, get the names of the primefactors from number 1 to just before that factor we are looking at (as numeric). Exponentiate with the values from the table – that is how many times a primefactor is a primefactor. Generate the product, get the divisors of that product, sum them, and add 1. If the factor we were looking at is larger that that, we have determined that x is not practical – and can return FALSE. If x gets through that process, it is practial.

I need to handle the case where there is only one primefactor – 2. Those numbers are practial, but the way I have done the check breaks when there is only one primefactor. Its simple enough, just check if there is only one distinct primefactor, and return TRUE in that case.

practical <- function(x){
  all_factors <- factors(x)
  all_factors <- table(all_factors)
  n_factors <- length(all_factors)
  res <- TRUE
    if(n_factors ==1){
    return(TRUE)
    break()
  }

  for(i in 2:n_factors){
    if((as.numeric(names(all_factors)[i]))>(sum(divisors(prod(as.numeric(names(all_factors)[1:i-1])**as.numeric(all_factors)[1:i-1])))+1)){
      return(FALSE)
      break()
      }
  }
  return(TRUE)
}

So, quite a bit faster!

Now I can take my candidate n’s based on the requirements for primepairs, and just keep the n’s that are themselves practical. And where n-8, n-4, n+4 and n+8 are also practial:

eng_numbers <- primecandidates %>%
  keep(function(x) practical(x-8)) %>%
  keep(function(x) practical(x-4)) %>%
  keep(function(x) practical(x)) %>%
  keep(function(x) practical(x+4)) %>%
  keep(function(x) practical(x+8))

eng_numbers
## numeric(0)

OK. There was no n’s in this set.

This is kinda problem. The n we are looking for are actually pretty large. I know this, because this writeup is written after I found the solution. So it is not because the code is flawed.

Nu har vi så den udfordring, at vi skal have fat i ret høje tal.

primenumbers <- primes(500000000)
primecandidates <- numeric()
for(i in 1:(length(primenumbers)-4)){
if((primenumbers[i+1] - primenumbers[i] == 6) & (primenumbers[i+2] - primenumbers[i+1]== 6) & (primenumbers[i+3] - primenumbers[i+2]== 6)){
  primecandidates <- c(primecandidates,(primenumbers[i]+9))
  }
}
eng_numbers <- primecandidates %>%
  keep(function(x) practical(x-8)) %>%
  keep(function(x) practical(x-4)) %>%
  keep(function(x) practical(x))   %>%
  keep(function(x) practical(x+4)) %>%
  keep(function(x) practical(x+8))

length(eng_numbers)
## [1] 3

That gives us three. We need four.

Now – I could just test larger and larger sets of primes. I run into memory problems when I try that. Then I could write some code, that generates primenumbers between some two numbers, and increse those numbers until I get number four.

I have not. Instead I just tried again and again, until I found number 4. We need to have an upper limit for primes that are some four times larger than the one I just used. Anyway, I found number four. And got the green tickmark.

Lesson learned

Sometimes you really need to look into the math. This would have been impossible if I had not found a quicker way to check for practicality.

Also, I should refactor this code. The check for practicality could certainly be optimised a bit.

And – I should probably check for primality, rather than generating the list of primes.

Project Euler 84

Monopoly – Project Euler problem 84

This is basically a question of building a Monopoly simulator.

We begin at square GO. Roll two six-sided dice, and advance.

If we land on the “Go to jail” (G2J) square, we go to the JAIL square.

If we land on a CC square, we draw a card from the “community chest” stack of cards. There are 16 cards there, only two changes the position of the player: “Advance to GO” and “Go to Jail”

If we land on a CH square, we draw a card from the Chance-stack. There are 16 cards here, only ten changes the position:

  • Advance to GO
  • Go to JAIL
  • Go to C1
  • Go to E3
  • Go to H2
  • Go to R1
  • Go to next R (railway company)
  • Go to next R
  • Go to next U (utility company)
  • Go back 3 squares.

There are three CH squares, and three CC.

A player being sent to jail, is assumed to buy himself out immediately. And if a player rolls three consecutive doubles, he does not advance the result of the third roll, but goes directly to jail.

The two stacks of cards are shuffled at the beginning.

Find the probability of landing on a given square, and return the three squares that are most likely to land on.

Numbering the squares from 00 to 39, the answer is the concatenated string representing these three squares. For 6 sided dice, the most likely squares to end on after a roll of the dice, are squares 10, 24 and 00. The answer is 102400.

I am going to need some dice:

dice <- function() sample(1:6, 2, replace=TRUE)

I’m gonna need a community chest function. And I am going to take a chance. That is, I am going to asume that the whole “shuffle the deck at the start, and place drawn card at bottom” is going to be equivalent to just picking a random card:

pool <- rep(0,14)
pool <- c(pool,1,11)
cc <- function(cs){
  res <- sample(pool,1)
  ifelse(res==0,cs,res)
}

It takes the current square, builds a vector containing 14 repetitions of 0, one instance of “1” corresponding to GO, and one of “11”, corresponding to jail. Then it returns one of the values.

And a chance function. Again, I am going to make the same assumption as with the community chest function:

ch_pool <- rep(0,6)
ch_pool <- c(ch_pool,1:10)
ch <- function(cs){
 chance <- sample(ch_pool,1) 
 if(chance==0) res <- cs
 if(chance==1) res <- 1
 if(chance==2) res <- 11
 if(chance==3) res <- 12
 if(chance==4) res <- 25
 if(chance==5) res <- 40
 if(chance==6) res <- 6
 if(chance==7&cs==8) res <- 16
 if(chance==7&cs==23) res <- 26
 if(chance==7&cs==37) res <- 6
 if(chance==8&cs==8) res <- 16
 if(chance==8&cs==23) res <- 26
 if(chance==8&cs==37) res <- 6
 if(chance==9&cs==8) res <- 13
 if(chance==9&cs==23) res <- 29
 if(chance==9&cs==37) res <- 13
 if(chance==10) res <- cs-3
 return(res)
}

This is basically the same as the community function. Just a bit longer.

I am also going to need a structure for the playing board:

hit <- numeric(40)

A third chance I’m taking. That the part about hitting three doubles in a row, and then going to jail, will also disappear if I just run this long enough:

curr_square <- 1

for(i in 1:10000000){
  curr_square <- ((curr_square + sum(dice())-1)%%40 + 1) 
  if(curr_square == 31) curr_square <- 11
  if(curr_square %in% c(2,18,34)) curr_square <- cc(curr_square)
  if(curr_square %in% c(8,23,37)) curr_square <- ch(curr_square)
  hit[curr_square] <- hit[curr_square] + 1
}

I begin at square 1. And then I am going to repeat the following 1.000.000 times:
Roll the dice, add the result to the current square. If the result is more than 40, do some simple math to continue from square 1.

If the square I’m landing on is square 31, set new square to 11 – that is the jail.

If it is 2, 18 or 34, run the cc – community chest function. That will return the square that we land on after drawing a card from that stack.

If it is 8, 23 or 37, we have landed on a chance square. Run the ch – chance function. That will return the square that we land on after drawing a card from that stack.

That should give me the frequencies with which the player have landed on each square.

Now I can rank the results:

freq <- rank(hit)

That gives me the rank. Number 40 is the square that was hit most times.

I need to translate that to the numbering of the squares that is used for the result.

squares <- c("00","01","02","03","04","05","06","07","08","09",10:39)
paste(squares[freq %in% c(40,39,38)], collapse="")
## [1] "001024"

pasting together the squares hit most, second most, and third-most times. That is fortunately the result given in the problem.

Now lets try with a new set of dice:

dice <- function() sample(1:4, 2, replace=TRUE)

And the same code:

hit <- numeric(40)
curr_square <- 1

for(i in 1:10000000){
  curr_square <- ((curr_square + sum(dice())-1)%%40 + 1) 
  if(curr_square == 31){ curr_square <- 11}
  if(curr_square %in% c(2,18,34)) curr_square <- cc(curr_square)
  if(curr_square %in% c(8,23,37)) curr_square <- ch(curr_square)
  hit[curr_square] <- hit[curr_square] + 1
}

freq <- rank(hit)

answer <- paste(squares[freq %in% c(40,39,38)], collapse="")

And we are rewarded with the nice green tickmark.

This was not as difficult as I thought. But maybe it would be wise to look at the forum for the problem, and study some of the non-brute force solutions.

Oh – another lesson learned. In the code above, I run through 10.000.000 throw with the dice. I actually wanted to throw the dice 1.000.000 times. Always count your zeroes!

Also – 100.000 thow is enough to get the right result.

Project Euler 179

Project Euler 179

14 has four divisors: 1, 2, 7 and 14.

14+1, ie. 15, also has four divisors: 1, 3, 5 and 15.

How many integers, n, where 1 < n < 107, has the same number of divisors as n + 1?

That is surprisinly simple. Start by figuring out how many divisors all the numbers have.

n <- 10000000
t <- integer(n)



for(s in 1:n){
  t[seq(s,n, by=s)] <- t[seq(s,n, by=s)] + 1
}

First I make a vector with the length 107 containing only zeroes.

Element 1 coresponds to n = 1, element 47 to n = 47.

Every single element in that vector has 1 as a divisor.

Every other element has 2 as a divisor.

Every third element has 3 as a divisor, and so on.

So for every element, in the sequence 2 to 107, in steps of 2, I add 1.

The same for every element in the sequence 3 to 107 in steps of 3.

And so on.

That takes a loooong time to do.

Now I have a vector t, where the value in t[n] is the number of divisors in n.

I now need to compare them.

If I have two identical vectors q and r. And add a 0 to the beginning of one of them, and a 0 at the end of the other. I will have two vectors, where element i+1 in the first is the same as element i in the second (or the other way around). So when I compare q[47] with r[47], I actually compare q[47] with q[48]. That is the comparison I need. When these to values are identicial, I have an iteger with the same number of divisors as the following integer.

Apoligies for the not so fluent prose. It’s pretty warm and I’m rather tired.

Anyway:

q <- c(0,t)
r <- c(t,0)

I now just need to compare them, and sum the result of that comparison:

answer <- sum(q==r)

Lesson learned

Just because a problem has a high number, doesn’t mean that it is particularly difficult.

dplyr::count()

dplyr::count()

A short note for myself.

Given a dataframe, eg mtcars:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

These cars have 4, 6 or 8 cylinders. How many cars fall in each category?

count(mtcars, cyl)
## # A tibble: 3 x 2
##     cyl     n
##   <dbl> <int>
## 1    4.    11
## 2    6.     7
## 3    8.    14

They have 3, 4 or 5 gears.

How many cars have 8 cylinders, and 3 gears?

count(mtcars, cyl, gear)
## # A tibble: 8 x 3
##     cyl  gear     n
##   <dbl> <dbl> <int>
## 1    4.    3.     1
## 2    4.    4.     8
## 3    4.    5.     2
## 4    6.    3.     2
## 5    6.    4.     4
## 6    6.    5.     1
## 7    8.    3.    12
## 8    8.    5.     2

This is equivalent to first grouping the observations by cylinders and gears. And then counting using tally():

mtcars %>% 
  group_by(cyl,gear) %>%
  tally()
## # A tibble: 8 x 3
## # Groups:   cyl [?]
##     cyl  gear     n
##   <dbl> <dbl> <int>
## 1    4.    3.     1
## 2    4.    4.     8
## 3    4.    5.     2
## 4    6.    3.     2
## 5    6.    4.     4
## 6    6.    5.     1
## 7    8.    3.    12
## 8    8.    5.     2

If, for some obscure reason, we want to count the number of carburetors in each category, we can do that:

count(mtcars, cyl, gear, wt=carb)
## # A tibble: 8 x 3
##     cyl  gear     n
##   <dbl> <dbl> <dbl>
## 1    4.    3.    1.
## 2    4.    4.   12.
## 3    4.    5.    4.
## 4    6.    3.    2.
## 5    6.    4.   16.
## 6    6.    5.    6.
## 7    8.    3.   37.
## 8    8.    5.   12.

The cars with 8 cylinders, and 3 gears, have among them a total of 37 carburetors.

Project Euler – 58

Project Euler 58 – Spiral Primes

All right. Lets make a square, by jotting down all numbers from 1 in a spiral like this:

matrix(c(37, 36, 35, 34, 33, 32, 31,38, 17, 16, 15, 14, 13, 30,39, 18,  5,  4,  3, 12, 29,40, 19,  6,  1,  2, 11, 28,41, 20,  7,  8,  9, 10, 27,42, 21, 22, 23, 24, 25, 26,43, 44, 45, 46, 47, 48, 49),7,7,byrow=TRUE)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]   37   36   35   34   33   32   31
## [2,]   38   17   16   15   14   13   30
## [3,]   39   18    5    4    3   12   29
## [4,]   40   19    6    1    2   11   28
## [5,]   41   20    7    8    9   10   27
## [6,]   42   21   22   23   24   25   26
## [7,]   43   44   45   46   47   48   49

We’re going anti-clockwise.

What Project Euler thinks is interesting, is that 8 out of the 13 numbers lying on the two diagonals, are prime.

That gives a primal ratio of 8/13, or about 62%

We can continue adding layers to the spiral, and make squares with increasing side length.

We are now tasked with finding the side length of a the square spiral for which the ratio of primes along the two diagonals falls below 10% (for the first time).

The task boils down to: Figure out what numbers are on the diagonals of a square with a given side length. What is the proportion of primes in that set of numbers.

The final number in the spiral of a nxn square is, nxn. That number is in the south-eastern corner of the square.

The number in the south-wester corner is n2 – (n-1). The number in the north-wester corner is n2 – 2(n-1). And the one in the north-eastern corner is n2 – 3(n-1).

The numbers on the diagonals are all corner-numbers in a square. If I want all the numbers on the diagonals in the 7×7 square above, I calculate the corner-numbers for n = 3, n = 5 and n = 7, and collect them all in a vector, along with the center piece, 1.

I’m going to define a numbers-vector containing the numbers on the diagonal, and add 1 to it.

numbers <- numeric()
numbers <- c(1, numbers)

Having that vector, I need to calculate the fraction of the elements in it, that are prime. The numbers library is handy for that:

library(numbers)

prime_fraction <- function(v){
  sum(isPrime(v))/length(v)
}

The function takes a vector as input, runs isPrime() on it. That function returns a logical vector, with TRUE where the element is prime. The sum of that vector is the number of primes in the input vector. Dividing that with the number of elements in the vector, length(), returns the fraction of primes in the input vector.

Now all that remains is adding diagonal numbers to the vector, check the fraction of primes, and continue until that fraction is below 0.1. I’m going to run the first iteration by hand. The fraction of primes in the initial numbers vector is 0, and that is of course less than 0.1, but not the answer I’m looking for.

i <- 3
numbers <- c(numbers, i*i, i*i-(i-1), i*i-2*(i-1), (i*i-3*(i-1)))
i <- i + 2
while(prime_fraction(numbers)>0.1){
  numbers <- c(numbers, i*i, i*i-(i-1), i*i-2*(i-1), i*i-3*(i-1))
  i <- i + 2
}
answer <- i - 2

OK. That gives me the correct answer. One problem: It takes forever. Well, not forever. But too long. The problem is the isPrime() function. A small test tells me that checking the primality of a vector containing 5385 elements, takes about 70 milliseconds. That is pretty quick. But when I do that several thousand times, it adds up.
Also I’m destined to spend eternity in the first circle of R-hell for growing the vector in perhaps the most inefficient way.

Is there a simpler way?

What I’m really looking for is two numbers. The number of primes in the diagonals. And the total number of numbers in the diagonals.

Each diagonal in an nxn square contains n numbers. The center is counted twice, so the total number is given by 2*n – 1.

What about the number of primes? If I instead of adding the four corners to the numbers vector, and testing that for primes, just test the four numbers, and add the number of primes among them to a counter. That should be quicker, as I’m only checking each number for primality once. What should make it even faster, is that I only have to check three numbers. The south-eastern number is always a square, so definitely not prime.

Lets try again. As before, I’m priming the pump, and beginning with the 5×5 square:

i <- 5
primes <- 3

while(primes/(2*i - 1) > 0.1){
  primes <- primes +sum(isPrime(c(i*i-(i-1), i*i-2*(i-1), i*i-3*(i-1))))
  i <- i + 2
}
answer <- i

Much, much faster!

Lessons learned

There usually is a faster way to do things. Also, and that was a surprise. Getting the length of a 5385-element long vector is not that slower than evaluating 2*2693 – 1.

test <- runif(5385)
leng <- function(){
  length(test)
}

i <- 2693
mult <- function(){
  2*i - 1
}

library(microbenchmark)
mbm <- microbenchmark(mult, leng, times = 10000)
## Warning in microbenchmark(mult, leng, times = 10000): Could not measure a
## positive execution time for 71 evaluations.
mbm
## Unit: nanoseconds
##  expr min lq     mean median uq    max neval
##  mult   0  0 193.7454      1  1 373167 10000
##  leng   0  0 197.7259      0  1 767366 10000

Depending on what else is happening on the computer, it might even be faster to take the length than to do simple arithmetic.

library(ggplot2)
autoplot(mbm)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 9946 rows containing non-finite values (stat_ydensity).

plot of chunk unnamed-chunk-7

Not that it makes that much of a difference. But I find it interesting.

Project Euler 35

Project Euler 35 – Circular primes

197 is an example of a circular prime. It is itself a prime. And if we rotate the digits, 971 and 719 are also primes.

Given that 2, 3, 5 and 7 are by definition circular primes – not that we rotate that much, how many circular primes are there below one million?

Lets begin by generating a vector containing all primes below 1000000:

library(numbers)
prim <- Primes(1000000)
length(prim)
## [1] 78498

That is 78498 numbers. Is there any way to reduce that number?

There is. Any prime containg an even digit, will rotate to a number ending in an even digit. That number cannot be a prime.

The same applies to primes containing the digit 5.

Of course there are two special cases, 2 and 5.

Lets reduce the number of primes we need to consider. First make a function checking if there are any “illegal” digits in the number:

poss <- function(x){
  !any(unlist(strsplit(as.character(x),"")) %in% c(0,2,4,5,6,8))  
}

The function splits the number in individual digits, checks if they are in the list of forbidden digits, and return fals if any of the digits of the number is in the list of forbidden digits.

Now paring down the list of primes:

prim <- prim[sapply(prim,poss)]

That leaves us with 1111 possible primes.

Now we need to figure out how to rotate the numbers:

rotate <- function(x){
  res <- integer()
  while(length(res)<as.integer(log10(x)+1)){
    x <- x%/%10 + x%%10*(10**(as.integer(log10(x)+1)-1))
    res <- c(res,x)
  }
  all(res %in% prim)
}

Lets take a four digit number, 1234, as an example.

I begin by initializing a variable res to keep the rotated numbers.

as.integer(log10(x)+1) gives me the length of the number. Here, 4.

Rotating a four digit number will yield four numbers. As long as there are less than four numbers in the result-vector res, I need to rotate at least once more.

I want the first rotation to give the number 4123.

Performing an integer division by 10 gives me the first three digits of the number, here 123. Modulo 10 gives me the last, here 4.

I get the rotation by adding the first three digits of the number to the last number multiplied by 10 to the power of the number of digits in the original number minus 1.

103 is 1000, multiplying by 4 is 4000, adding that to 123 returns 4123. Repeat until there are four numbers in the result-vector.
Then check if all the elements of the result-vector are in the list of primes. If they all are, return TRUE, otherwise return FALSE.

Now it is simply a question of getting rid of all the remaining candidate primes that does not rotate to primes:

prim <- prim[sapply(prim,rotate)]

The lenght is the number of primes that are circular as defined above. I just need to add the two special cases 2 and 5, and I have the answer:

answer <- length(prim) + 2