“We try to hire the best people and don’t believe in having talented people doing repetitive tasks,” says Facebook’s director of data centre operations Niall McEntegart.
Earlier this week, Facebook opened a new 120,000 sq-foot EMEA headquarters at Grand Canal Square in Dublin where it employs some 500 people who represent 50 nationalities. The company first established its Dublin operations in 2009 in the area now known locally as Silicon Docks.
Some 1.3bn people across the planet use the social network. Of those, 80pc access the site via mobile devices.
While most people know Facebook as a social network, the company is really at the coalface of major leaps forward in terms of enterprise computing and data-centre technologies.
Three years ago, Facebook started the Open Compute Project (OCF) with the goal of building one of the most efficient computing infrastructures at the lowest possible cost.
As a result, the company has saved US$1.3bn by focusing on efficient designs and using open-source technologies.
Can you outline the breadth and scope of the technology rollout across your organisation and what improvements it will bring to the company?
As director of data centre operations for Europe and the eastern half of the US, I don’t get involved in pure business IT-related projects as much as a CIO would. I would be more aligned on production and infrastructure and delivering product to end user at meta scale really, rather being internally focused as a lot of CIOs would be.
Do you have a large in-house IT team, or do you look to strategically outsource where possible?
The Dublin office has a considerable engineering group, about a fifth of the office are actually technical engineers of one description or another and it varies. That’s more than 100 people.
It varies across disciplines ranging from security and privacy to things such as the developer support engineering team who support all of the 300,000-plus third-party developers, including players such as King.com and Zynga, who develop apps on Facebook as a platform, to things like IT support for the business and analytics.
The biggest group by a distance is infrastructure engineering and that is based in Dublin and what we specialise in is delivering the content from people who use Facebook every day on their desktops or mobiles, and how that information travels from those devices across our network to the data centres where the process and information is stored and back to them. Every time they use Facebook.
We look after everything from the network infrastructure, rolling out network capacity in cities and countries in locations closer to those people to give them better performance through to managing the actual data centres themselves in terms of operating the data centres.
It is about ensuring we have computing capacity as we grow and that will increase considerably into the future as data usage increases by a considerable amount.
We also do things such as database operations, automation teams focus on very heavily in terms of how we bring up new computing capacity and manage the lifecycle as we decommission old equipment, we automate all of that.
Plus we have the Web Foundation teams that are focused on tracking the performance of Facebook and identifying issues as they arise and proactively doing whatever needs to be done.
The infrastructure teams in Dublin tend to be there to make sure everything runs smoothly, they tend not to know that we are there.
Tell us about how the Open Compute initiative has transformed Facebook’s entire data infrastructure?
That is key to everything. Three or four years ago we decided to roll the dice and cross the Rubicon and adopt a relatively new approach. We looked at our computing infrastructure from an entire technology stack perspective – everything from the software that runs in the hardware, how that runs, the resources it uses, the hardware itself, storage hardware, power infrastructure, network through to the actual cooling and the data centre building.
We took all of those individual pieces and tried to optimise them as an entire technology stack to try and make it efficient and cost less.
We started two and half years ago and started by making some of those designs available to the Open Compute and I think we have 300 members at this stage who are all leading industry organisations, manufacturers, designers and integrators, all leaders in their own fields.
The bits we’re good at we have contributed to the community and likewise we can consume and collaborate with other people.
It has worked extremely well. We have saved about US$1.3bn over the last two to three years from using Open Compute. That is a very significant amount of money and has contributed to the service.
What are the main points of your company’s IT strategy?
We need to be as efficient as we possibly can and cost effective in how we do things. Facebook is a free service and it will remain that way. As a result, we need to be efficient and cost effective in how we do things and it goes back to our culture as a company.
We call ourselves a hacker company and it is very much in our DNA. We believe in not always doing things and consuming them, if there’s a better way of doing it we try to figure that out and do it as effectively and cost efficiently as we can.
Open Compute is definitely part of that, one of the main drivers.
There are a number of data centres located mainly in the US but also in northern Sweden that went live last year. Again it’s a pretty cutting-edge facility, it’s performance 1.07 PUE and uses 100pc renewable energy.
It uses 100pc Open Compute hardware and it is performing very well. The applications are spread out globally because of redundancy.
What are the big trends and challenges in your sector, and how do you plan to use IT to address them?
Mobile is driving change but it is also about coping with vast quantities of data.
The biggest challenge is growth of data and cost of storage of data for long periods of time.
Mobile devices are really driving the volume of data that’s being stored and processed because people are automatically able to create hi-res photos and so the files are getting bigger and people are generating huge volumes of video.
If you walk forward 10, 20 or 30 years from now, how will that look?
Facebook is in a very privileged position and we count ourselves very lucky and we are very proud of the fact that people use Facebook as a life album for events in their lives. It’s their own personal lives, it is a photo book of what’s happening in life and if you look at we are, 10 years old now, and 1.35bn people use Facebook, walk forward 10 or 20 years from now – that’s a huge volume of information that needs to be stored long term.
How do you do that? We’re looking at technologies such as cold storage. If you take photographs, for example, today we call it hot data, and over time it gets colder as people access it less and less.
But they may want to go back and look at it and the access has to remain fast. People expect that information to be stored safely and accessed if they need it, so how do you do that for long periods of time?
We have had to try and develop very interesting solutions to deal with this and one of this is storage on Open Compute and it’s about unconventional storage solutions and we actually built specific data centres to store this equipment.
The data centres themselves are very much bare bones data centres. They don’t need back-up power or all sorts of redundancy to a point if the utility goes down the data centre will go down but it does so gracefully and then comes back up once you restore power. You’re talking 99.999pc availability anyway so it’s very rare that that would happen.
That has given us the scalability for very long-term storage of data that isn’t accessed very often.
We have been able to reduce costs to 20pc of the cost of a traditional data centre as it stands and they are also to able to hold eight times the storage of a standard storage solution.
We can store 2 petabytes per rack and then only need 2kW of very low power – those buildings hold about 1,500 racks.
That’s our first generation of cold storage, all tiered data, redundancy and power efficiency.
We are looking even beyond that – but it is one excellent solution that is working well.
What does the future of storage look like?
Most storage today is still based on traditional hardware technologies, such as hard drives. Regardless of what you do, you still have a limited capacity so we are investing in interesting solutions, such as Blu-ray, for example.
Ten thousand Blu-ray disks hold a petabyte of data, which is pretty amazing, and the interesting thing is there is quite a bit of Blu-ray production capacity out there.
Use of Blu-ray is decreasing quickly, it’s becoming an old technology. People are using streaming much more so there is excess capacity of Blu-ray in the world and this could work out very well for Blu-ray manufacturers and people who need storage.
We are looking at building prototypes of entire Blu-ray libraries. If you think about how that would work: they cost 50pc of the cost of our current cold storage solution and less than conventional storage solutions, you’re talking about one-sixth of the cost of traditional storage.
The interesting thing, as well, is really the power and longevity. If you consider how it would work – Blu-ray discs could last up to 50 years without an issue plus you only really use a small amount of power when doing an original write to the disc, it’s just sitting there and not being used so doesn’t need any power at all.
And the rare time you are using it, it is a fraction of the power when reading. The power reduction is 80pc compared to current cold storage solution – more than 400 watts of power, which is tiny.
So you start to get to a position where it becomes much more cost effective and easier to do long-term storage of data.
That is unconventional thinking, but how we need to start thinking if we are going to be able to do this. All companies need to think this way and consider data challenges going forward.
We all need to start thinking of interesting and non-conventional solutions, otherwise the cost of storage is going to increase.
You have a lot of responsibility but also latitude to try new things. Is this because your CEO is a coder and Facebook has embraced hacker culture?
It’s a hacker culture, we’re proud of it and what we try to do – even as we’ve grown – we’ve tried extremely hard to keep that and not develop a lot of the bureaucratic problems companies have as they grow in scale and it is something we put a lot of energy into, making sure we actually keep on track.
We tend to use a lot less people than a lot of other companies. When you talk about automation and that mentality – if you look at our data centres, we use automation everywhere. We try to hire the best people and don’t believe in having talented people doing repetitive tasks.
Industry failure rates are 4-5pc annually, but our data centre in northern Sweden is 0.15pc, which is far more efficient and viable, plus we’ve built automation to fix issues in a data centre.
If failure happens in our data centre more than 50pc are fixed by tools we built internally called Cyborgs and not actually touched by a human. And the ones that are human a Cyborg can’t fix, we generate a ticket that includes full diagnostics.
Our engineers would know exactly what parts are needed, where they are needed, and they’re not going back and forth trying to figure out what’s wrong. We have 100pc fix rates. Combining things together like this allows us to scale at unprecedented levels compared to other companies.
We have reached a point where in Sweden we have one person overseeing 30,000 servers without incident and that’s pretty impressive.