Analytics and Ethics

Upon the enlightening release of the Cambridge Analytica exposé that revealed the company’s underhanded tactics it utilises to influence and manipulate people with a combination of data analytics, espionage, and ‘honeypotting’, I think it’s incredibly important to discuss how the analytical community needs to ensure a commitment to honest and unbiased analytics, and ethically sourced data.

It should always be on the analyst’s mind about how their work may be used and interpreted by all parties who consume the analysis that they’ve produced. This is underlined by the importance of clarity in data visualisation. Producing data visualisations that have little to no confounding elements (such as dual axis and scales that don’t start at zero) is important to ensure that our analysis can be correctly and easily interpreted by our audience, and minimise the risk that our audience draws false conclusions from the visualisation. If there are any particular nuances to the data that may cause an incorrect interpretation of that data, it’s important that it is made clear to the user and isn’t allowed to cloud their judgement. Techniques like visualising uncertainty are therefore incredibly useful when showing analysis that includes things like forecasting or correlation analysis that rely heavily on confidence intervals.

Evidently when it comes to enacting these methods, to create clarity in data visualisations, there is a wide scope for interpreting how this should be carried out. This creates the subjectivity of data visualisation which opens a gap which is ripe for ethical dilemmas! One thing that we, as analysts, have to be salient of is making sure we don’t infer our biases in the analyses we carry out. When ensuring our audience don’t mistakenly draw false conclusions from the data we’re presenting to them, we need to make sure that these ‘false’ conclusions aren’t simply the conclusions that we don’t want them to make because it doesn’t align with our world view. This is especially important in the land of data journalism. The vast majority of publications come with a litany of biases; which side of the political spectrum they lean toward, their views on particular hot-button topics, and whomever sponsors their content. It’s important that the data being presented in these publications isn’t being twisted to create misleading and false narratives that align with the publication’s biases. Doing so is morally reprehensible and incredibly manipulative, but we see it time and time again, and the general public that aren’t well versed in statistics and good data visualisation will fall for these cheap tricks again and again.

This is simply a microcosm of the scope for this topic. It’s obvious that despite my misgivings, analytics will still be used for a variety of dishonest and malicious practices – hell, most marketing companies use analytics to make sure us chumps in the public are enticed as much as possible by their product/message/service and that’s not likely to stop any time soon. Hopefully, in time and likely after a large enough scandal with big data (e.g. accounting and the Enron scandal influencing the profession’s push for a greater emphasis on ethics, and physics with the Manhattan project) we’ll solidify a code of ethics that all analysts will strive to follow.


Florida Ticket Price Comparisons

I’m going on holiday soon! I’m headed to Florida with my parents and being the cost savvy Brits that we are we’re always on the hunt for a bargain. We wanted to go do some of the Theme Parks (we settled on Disney and/or Universal Studios) and went about finding the cheapest ticket rates we could. During the internet hunt my Mum kept using the Calculator on her PC to do things like exchange rate conversions and to calculate price differences – but, being the Excel monkey that I am I proposed a much better solution: let’s whack it all in Excel and let me make some pretty charts!

We wanted to do around 4 days in total either all in Disney, all in Universal or 2 days in one and 2 in another. If it was far too expensive just 2 days in one of them may have been necessary. For our 2 day and 4 day prices we’ve used the listed price on the American site portal for Universal Studios Orlando and Walt Disney World. We also found that there was pretty good deals for a 14 day ticket for both parks on American Attractions with an extra £8 off by using the code MSE8 in the checkout (found on Money Saving Expert). The 14-day tickets would afford us a touch more flexibility if we really wanted to do more than 4 days. Due to parking prices for both parks, we’ve also looked at how expensive it would be for one person to buy an annual pass (which would be especially useful as my parents go to Florida often) as these annual passes often give free parking as an extra. We also found that the Universal Studios annual pass gives you 15% off on multi-day passes bought at the front gate, so I’ve factored that into our calculations (one issue is that currently Universal Studios are offering up to $20 off for buying online, but I couldn’t find the price before the discount to use as the ‘at-the-gate’ price so this may be a little underestimated).

So, after calculating the prices using an exchange rate of 1.40 USD to 1 GBP and adding sales taxes to the American prices, here are our findings – presented using lovely bullet charts (click here for my first post on bullet charts if you want to find out more):

Price Per Visit

Price Per Visit calculated using the total price for 3 people.

Total Price

The cheapest options are to buy a 4-day pass per person for Universal Studios, but the option that provides the most flexibility and I think for us the best value (I just love how vague value is as a concept) is to get the annual pass for one person and 14-day tickets for the other two people. For 4 visits, it’s about £30 more expensive per visit than the 4-day tickets plus annual pass, but as soon as we go a 5th time it becomes better value at £148.63 per visit. So, it saves us from being locked into only going 4 times; offers us discounts in general around the parks; it gives us the option to go the City Walk and restaurants like the Hard Rock Café or go to the cinema (which we would also get discounts at), and if we’d like to go to the parks after going for food for example, we could.

Here’s a link to my spreadsheet. I’ve done the calculations for up to 3 people, but if you wanted to piggyback off my work it wouldn’t be too hard to add in the calculations for more than 3 people provided you’re comfortable using Excel.

If you have any of your own money saving tips or sites please do mention them below, and if you find any better deals definitely comment, we’ve still got a couple weeks until we need to purchase the tickets, so it’ll be really useful for us too!

General Write-Up

Splatfest Analysis (Part 3)

In this final Splatfest Analysis, we will be taking a closer look at the distributions of the 9 games I had with the Splattershot to determine whether its consistent place in the top 3 was due to me consistently playing well with the weapon, or whether the averages were skewed by particularly high leverage outliers.

The points scored with the Splattershot were relatively consistent around 800-900, but a particularly good game where I scored 1180 points has skewed the mean positively, but it wasn’t so dramatic a swing to cause the mean to drift far from the median; it was near where most of the points were clustered.

The number of kills I got each game with the Splattershot were also consistently around 5-6 kills each game. A slightly low game with 3 kills did cause some skew negatively to the mean, but again it stayed close to the median and where most points were clustered.

The number of deaths each game also showed very little variation, with the most frequent number of deaths being 0. It ranged from 0-4 with a fairly even spread, meaning no individual point had a particularly large influence on the mean of 1.67.

In conclusion, it does seem that I was consistently good with the Splattershot – meaning it was well deserving of its top 3 positions despite its lower sample size.

And with that this utterly riveting multi-part analysis comes to a close! My main take-aways from a data visualisation and analysis stand-point when carrying out these analyses were that:

  • Box and Whisker Plots in Excel 2016 are fickle and I’ll likely avoid them in future!
  • A more heavily annotated set of graphics may have been more suitable, instead of writing up everything in these report-style blog posts. This would allow the analyses to better live as separate entities from this blog, which is especially important when doing analyses of sensitive data to make sure the data can’t easily be divorced from the analysts’ findings and any caveats to the data.
  • Using Excel to do ‘proper’ data analysis with the AnalysisToolPak is very easy. Until starting this I was unaware Excel could do things like linear regression (almost) natively, usually I would move my data over to a more specialised analytical software package like SPSS (and I have a particularly large disdain for SPSS!).

Thanks to those of you who took the time to read my ramblings and please don’t be afraid to leave any feedback you have in the comments section.

Here is the Excel file I used to create all of the graphics for this analysis: Link.


Zipf’s Law

After watching a long session of ‘VSauce’ videos (great brain food videos, albeit very addictive!), I came across this video discussing ‘Zipf’s Law’. Zipf’s law states that in any corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency table. This Zipfian distribution applies to many different types of data studied across a variety of fields (the video discusses a large variety of these instances). Zipfian distributions also follow the ‘Pareto Principle’, the 80-20 rule. 80% of the words used in any corpus are only 20% of the unique words used.

After being utterly captivated by this phenomenon, I decided it would be fun to check whether my blog, being a corpus of natural language, followed Zipf’s law.

To my delight it does! I also thought it would be a nice opportunity for me to experiment with colour, utilising a dark midnight blue as the background instead of the usual white. Using the complimentary orange for the data points allows them to ‘pop’ from the background which I think produces a nice effect.

If you disagree or have any other non-traditional colour combinations for charts feel free to let me know in the comment section!

General Write-Up

Splatfest Analysis (Part 2)

Up next in this multi-part analysis is a look at my performance with each weapon I used during the Splatfest. I’ll take a closer look at my Win to Loss ratio by each weapon and how well I did overall with them regardless of outcome of the game.

Weapon Distribution

To get things started we’ll take a look at the number of games I used each weapon – this is necessary to help contextualise the rest of this analysis. Those weapons that have a high number of uses are much more likely to be reliable when making assumptions about how good or bad I was with the weapon as they are much less likely to be influenced by random chance. For example with the Krak-On Splat Roller, I can’t be 100% sure whether any of my results are definitely because I’m bad with the weapon, because with only 2 uses my performance could be as a result of a number of factors like: my team composition (what weapons my team mates were using); the enemies team composition (did I just get unlucky and fight against those weapon types that counter Roller weapons?); or whether I was just having a bad/off game. Weapons like the Sploosh-o-Matic are more likely to have the effects of those previously mentioned factors smoothed out simply by having a larger sample size – I’m much less likely to have uncomplimentary team composition every game for 21 games than I am for 2 games as an example.

Now let’s look at my Win to Loss ratio with each weapon:

Win Loss ratio by weapon v2

Ketchup vs Mayo, indicating my best (ketchup-coloured) and worst (mayo-coloured) Win:Loss ratios, yum.

Those weapons I did worst with (Blaster, Tentatek Splattershot and .52 Gal) are all weapons I used less than 5 times, so it’s hard to draw any concrete conclusions about whether I would have averaged out higher/lower with more games – but I can tell you that I did purposefully stop using these weapons because I felt completely useless with them (when we look at the next couple of charts we’ll see whether I actually was useless with them). The Tri-Slosher is the clear winner out of these – there are rumblings within the Splatoon 2 community that this weapon is overpowered so I’m not massively surprised by this outcome. Its large ink spread, ability to 2-hit kill from a decent range and lack of ‘RNG’ (it will always have the same ink spread every shot unlike other guns) mean it’s much easier to consistently do well with, which is most likely why I did so much better with it than other weapons over the course of the Splatfest. Of the weapons I used a lot of times (> 5 uses), the lowest win to loss ratio was with the Kelp Splat Charger, which as discussed in my previous post I got significantly (statistically speaking!) worse with throughout the Splatfest, overall losing a game for every game I won.

Mayo coloured bars are those weapons I did particularly poor with… Or you could think of it as the colour for the Krak-On Splat Roller, either works.

There’s one clear conclusion we can draw from these… I suck with the Krak-On Splat Roller. As you can see I definitely was not doing well with this weapon with the lowest average points, KDA and average number of kills (and highest average number of deaths). I stopped using that weapon with good reason! Even accounting for potential bad team compositions or enemy counters I don’t think I would have done any better, I was absolutely terrible with it.

Shooter weapons like the Splattershot and Sploosh-o-matic tended to be at the top of the average points overall, and as these weapons are generally recommended for Turf War it makes sense.

Despite my high win to loss ratio with the Tri-Slosher it ended up fairly middle of the pack for average points scored, being at the bottom of the group of weapons that I used frequently. However, it topped the average number of kills and was very close to the front of the pack with regards to KDA, potentially being why I won more games with it.

The other weapons I used less than 5 times also don’t seem to have performed particularly awfully. The .52 Gal was consistently at the bottom of the pack but was a country mile better than my performance with the Krak-On Splat Roller, and wasn’t too far from the rest of the pack, while the others were fairly spread out, with the Blaster even being the weapon with the 2nd most kills and 3rd highest KDA. The Blaster’s low point scores were its major downside, meaning I may be good with it, but it’s not particularly suited for Turf War so I may have been correct in swapping out from it.

The Splattershot maintained a top 3 position in each of these metrics, which is probably why it averaged out to a positive win to loss ratio too. Being the all-rounder weapon it fits in well here – especially as it was the first weapon I used a good amount in both Splatoon 1 and during the first week after Splatoon 2 was released. The lower sample size (9) makes it hard to properly compare it with my other most used weapons, so I may take a look at its distributions for each of these metrics in a future part to see if any high leverage points may have skewed the average heavily for this weapon.

So in conclusion, I should never ever use the Krak-On Splat Roller (without a good amount of practice) and I was right to only use it twice. I tended to do best with the Tri-Slosher, with high amounts of kills and a decent KDA. My most consistently good weapon seems to be the Splattershot, keeping top 3 positions across all the metrics I looked at (although its lower sample size means it’s less easily comparable with my other most used weapons).

That’s it for this part! I’ll be taking a deeper dive into some of the specifics I brought up within this part of the analysis and looking at how I performed when winning and losing in the next exciting (… ahem) installment of this analysis.



Splatfest Analysis (Part 1)

Splatoon 2 came out on the Nintendo Switch a couple weeks back, and they’ve had their first ‘Splatfest’ since the game was released. ‘Splatfests’ are worldwide events that pit two similar things against each other in an opinion poll (for this Splatfest it was Mayo vs Ketchup). After voting for your favourite, you’ll then be playing the game mode ‘Turf War’ against members of the opposite team, to determine which side is the best at Splatoon. I picked Ketchup, and I battled valiantly to prove that Mayo is bad and everyone who likes it deserves to lose at Splatoon.

During this Splatfest, I used Nintendo’s phone app to collect the data of my Splatfest matches in order to analyse my performance over the 24-hour event. I’ll be using this data for a multi-part analysis, covering a variety of sections like which weapons was I best with, what stages I was better at and more!

Today’s post will be covering whether I improved at the game throughout the Splatfest, mainly focused on correlation analysis.

Did I improve my performance during Splatoon 2’s ‘Splatfest’ event?

Correlation Matrix

I wanted to check whether I had improved at all throughout the Splatfest – I did this by looking at my points scored, number of kills and number of deaths, and seeing whether any improved as I played more.

Looking at the correlation matrix and the P-Values for each of the options, most of my performance stayed relatively similar and the main reason my performance tended to vary was down to random chance, not me showing marked improvements or deterioration; apart from two things:

  1. I actually got worse at covering turf the more I played with the Charger weapon type (shown by the highly negative correlation coefficient of -0.63). This wasn’t simply due to random chance (and was therefore significant), as its P-Value was below the Alpha value of 0.05.
  2. I tended to die less the more I played. I had slightly negative correlation (with a coefficient of -0.27) between me playing more games in the Splatfest and the number of deaths I had. This was also not due to random chance, as the P-Value again was less than the Alpha value.

I could tell that both things were happening as I played. When I first started playing with the ‘Kelp Splat Charger’ I was having absolutely amazing games, often being top of my team’s scoreboard, but as I played with it more, and when playing during the second day of the Splatfest I could definitely tell my aim was getting worse and I was generally not playing as tightly as when I started. As the Splatfest went on I could also tell that I was slowly becoming more conservative in my play style; I felt that my team mates were always going in and dying constantly and I thought it would be best to generally stay back and make sure my team could super-jump to me (it rarely ever happened though… my team mates were playing like they had never played the game before, despite most being a similar level to me).

I also did a quick check to see whether points scored, number of kills and number of deaths all correlated, and it wasn’t really a surprise to see that as I scored more points I tended to get more kills and die less, but it’s nice to confirm it!


London Population Analysis

Spurred on by conversations with my brother and various Reddit posts making claims about immigrant populations in the UK, I decided to check the ONS website to find some of their recent analyses on the UK population (which are fantastic, by the way)1.

One claim I remember specifically wanting to check out from a Reddit user was that London was now over 50% non-British. Surprisingly (to me at least) this is true for some of the Local Authorities in London (by place of birth as opposed to Nationality), although for the wider picture of London this isn’t the case:



Data source: ONS, Population of the United Kingdom by Country of Birth and Nationality 20152

Through these, we can see that Kensington and Chelsea have the highest percentage of Non-British nationals in Inner London, and Brent has the highest in Outer London. Both Inner London and Outer London have similar percentages for their British and Non-British nationals and places of birth. Kensington and Chelsea, Newham and Westminster are the Local Authorities (LAs) within Inner London that have 50% or more of their population born outside of the UK, and Brent and Harrow are the LAs within Outer London that 50% or more of their population are born outside of the UK.

The ONS data set also contains the top 5 places of birth and Nationality by Region (unfortunately not at LA level so we can’t investigate further into those 5 LAs with more than 50% of their population born outside the UK):

London top 5 pop

Source: ONS, Population of the United Kingdom by Country of Birth and Nationality 20152

The two countries that appear in both lists are India and Poland, so we can see that of the 293,000 India born residing in London, 173,000 or 59.0% (not taking into account the confidence intervals so this is really just an estimate!!) are British Nationals, whereas of the 170,000 Poland born residing in London, there’s a much, much smaller percentage that are British Nationals (I’m unable to even estimate this, due to the confidence intervals and estimates overlapping). This is echoed throughout the UK, where the ONS state the following: “This reflects that EU nationals have the freedom of movement between EU countries, whereas for non-EU nationals there is an incentive to acquire British nationality. This may also reflect the length of time that individuals have lived in the UK and the numbers born to UK nationals living abroad.”1.

1 Office of National Statistics, 2016

2 Office of National Statistics, 2016