Big Data Storage
Hyperscale:-
At root, the key requirements of big data storage are that it can handle very large amounts of data and keep scaling to keep up with growth, and that it can provide the input/output operations per second (IOPS) necessary to deliver data to analytics tools.
The largest big data practitioners - Google, Facebook, Apple, etc - run what are known as hyperscale computing environments.
These comprise vast amounts of commodity servers with direct-attached storage (DAS). Redundancy is at the level of the entire compute/storage unit, and if a unit suffers an outage of any component it is replaced wholesale, having already failed over to its mirror.
Such environments run the likes of Hadoop, NoSQL and Cassandra as analytics engines, and typically have PCIe flash storage alone in the server or in addition to disk to cut storage latency to a minimum. There's no shared storage in this type of configuration.
Hyperscale computing environments have been the preserve of the largest web-based operations to date, but it is highly probable that such compute/storage architectures will bleed down into more mainstream enterprises in the coming years.
The appetite for building hyperscale systems will depend on the ability of an enterprise to take on a lot of in-house hardware building and maintenance and whether they can justify such systems to handle limited tasks alongside more traditional enterprise environments that handle large amounts of applications on less specialised systems.
NAS:-
Hyperscale is not the only way. Many enterprises, and even quite small businesses, can take advantage of big data analytics. They will need the ability to handle relatively large data sets and handle them quickly, but may not need quite the same response times as those organisations that use it push adverts out to users over response times of a few seconds.
So the key type of big data storage system with the attributes required will often be scale-out or clustered NAS. This is file access shared storage that can scale out to meet capacity or increased compute requirements and uses parallel file systems that are distributed across many storage nodes that can handle billions of files without the kind of performance degradation that happens with ordinary file systems as they grow.
For some time, scale-out or clustered NAS was a distinct product category, with specialised suppliers such as Isilon and BlueArc. But a measure of the increasing importance of such systems is that both of these have been bought relatively recently by big storage suppliers - EMC and Hitachi Data Systems, respectively.
Meanwhile, clustered NAS has gone mainstream, and the big change here was with NetApp incorporating true clustering and petabyte/parallel file system capability into its Data ONTAP OS in its FAS filers.
Object Storage
The other storage format that is built for very large numbers of files is object storage. This tackles the same challenge as scale-out NAS - that traditional tree-like file systems become unwieldy when they contain large numbers of files. Object-based storage gets around this by giving each file a unique identifier and indexing the data and its location. It's more like the DNS way of doing things on the internet than the kind of file system we're used to.
Object storage systems can scale to very high capacity and large numbers of files in the billions, so are another option for enterprises that want to take advantage of big data. Having said that, object storage is a less mature technology than scale-out NAS
So, to sum up, big data storage needs to be able to handle capacity and provide low latency for analytics work. You can choose to do it like the big boys in hyper scale environments or adopt NAS or object storage in more traditional IT departments to do the job.
Add Comment
This policy contains information about your privacy. By posting, you are declaring that you understand this policy:
- Your name, rating, website address, town, country, state and comment will be publicly displayed if entered.
- Aside from the data entered into these form fields, other stored data about your comment will include:
- Your IP address (not displayed)
- The time/date of your submission (displayed)
- Your email address will not be shared. It is collected for only two reasons:
- Administrative purposes, should a need to contact you arise.
- To inform you of new comments, should you subscribe to receive notifications.
- A cookie may be set on your computer. This is used to remember your inputs. It will expire by itself.
This policy is subject to change at any time and without notice.
These terms and conditions contain rules about posting comments. By submitting a comment, you are declaring that you agree with these rules:
- Although the administrator will attempt to moderate comments, it is impossible for every comment to have been moderated at any given time.
- You acknowledge that all comments express the views and opinions of the original author and not those of the administrator.
- You agree not to post any material which is knowingly false, obscene, hateful, threatening, harassing or invasive of a person's privacy.
- The administrator has the right to edit, move or remove any comment for any reason and without notice.
Failure to comply with these rules may result in being banned from submitting further comments.
These terms and conditions are subject to change at any time and without notice.
Comments