KnowForGrow

Big Data Storage

Hyperscale:-

At root, the key requirements of big data storage are that it can handle very large amounts of data and keep scaling to keep up with growth, and that it can provide the input/output operations per second (IOPS) necessary to deliver data to analytics tools.

The largest big data practitioners - Google, Facebook, Apple, etc - run what are known as hyperscale computing environments.

These comprise vast amounts of commodity servers with direct-attached storage (DAS). Redundancy is at the level of the entire compute/storage unit, and if a unit suffers an outage of any component it is replaced wholesale, having already failed over to its mirror.

Such environments run the likes of Hadoop, NoSQL and Cassandra as analytics engines, and typically have PCIe flash storage alone in the server or in addition to disk to cut storage latency to a minimum. There's no shared storage in this type of configuration.

Hyperscale computing environments have been the preserve of the largest web-based operations to date, but it is highly probable that such compute/storage architectures will bleed down into more mainstream enterprises in the coming years.

The appetite for building hyperscale systems will depend on the ability of an enterprise to take on a lot of in-house hardware building and maintenance and whether they can justify such systems to handle limited tasks alongside more traditional enterprise environments that handle large amounts of applications on less specialised systems.

NAS:-

Hyperscale is not the only way. Many enterprises, and even quite small businesses, can take advantage of big data analytics. They will need the ability to handle relatively large data sets and handle them quickly, but may not need quite the same response times as those organisations that use it push adverts out to users over response times of a few seconds.

So the key type of big data storage system with the attributes required will often be scale-out or clustered NAS. This is file access shared storage that can scale out to meet capacity or increased compute requirements and uses parallel file systems that are distributed across many storage nodes that can handle billions of files without the kind of performance degradation that happens with ordinary file systems as they grow.

For some time, scale-out or clustered NAS was a distinct product category, with specialised suppliers such as Isilon and BlueArc. But a measure of the increasing importance of such systems is that both of these have been bought relatively recently by big storage suppliers - EMC and Hitachi Data Systems, respectively.

Meanwhile, clustered NAS has gone mainstream, and the big change here was with NetApp incorporating true clustering and petabyte/parallel file system capability into its Data ONTAP OS in its FAS filers.

Object Storage

The other storage format that is built for very large numbers of files is object storage. This tackles the same challenge as scale-out NAS - that traditional tree-like file systems become unwieldy when they contain large numbers of files. Object-based storage gets around this by giving each file a unique identifier and indexing the data and its location. It's more like the DNS way of doing things on the internet than the kind of file system we're used to.

Object storage systems can scale to very high capacity and large numbers of files in the billions, so are another option for enterprises that want to take advantage of big data. Having said that, object storage is a less mature technology than scale-out NAS

So, to sum up, big data storage needs to be able to handle capacity and provide low latency for analytics work. You can choose to do it like the big boys in hyper scale environments or adopt NAS or object storage in more traditional IT departments to do the job.

Big Data Analytics in Retail Industry - A Case Study

Add Comment

Comments

No comments yet. Be the first!