The rise of web applications — websites that replace the functions of a software program that was traditionally installed on a personal computer – was one of the hottest topics in the tech industry. Huge numbers of “Web 2.0” startups are competing for user attention, and many observers predict rapid growth for web applications (Rubicon Consulting, Inc. 2007). Usage of a web application can outpace initial expectations. Growth is good for business but creates some real challenges when it comes to trying to keep everything up and running speedily along specially increasing traffic to web applications poses great challenges to database servers. End users are becoming more and more sensitive to the quality of the offered services. This requires addressing issues such as pushing quality of service (QoS) requirements into database processing and providing database system scalability (Ye 2002). But applications suffer from unpredictable load, especially due to events such as breaking news (e.g., Hurricane Katrina) and sudden popularity spikes (e.g., the “Slashdot Effect”) (Amit Manjhi 2009). Investing in a server farm that can accommodate such high loads is not only expensive (particularly after factoring in the management costs) but also risky because the expected customers might not show up. Content Delivery Networks (CDNs) provide such service by maintaining a large, shared infrastructure to absorb the load spikes that may occur for any individual application. However, CDNs currently do not provide a way to scale the database component of a Web application they only provide a way to mitigate network load hence the CDN solution is not sufficient when the database system is the bottle-neck, as such is in many web applications.
Currently all most of these organizations have stored structured data in relational databases for subsequent access and analysis. The structure of data in a relational database is predefined by the layout of the tables and the fixed names and types of the columns. Scaling is done by distributing the data and load across multiple powerful and expensive servers (Leavitt 2010, 13). When traffic is unpredictable it is very difficult for companies to justify investment large server farms. According to Jeremy Zawodny “Relational databases don’t work easily in a distributed manner because joining their tables across a distributed system is difficult”. Stephen O’Grady, an analyst with market research firm RedMonk states that “due to relational databases aren’t designed to function with data partitioning, so distributing their functionality is a chore”. In addition to with relational data¬bases, users must convert all data into tables. When the data doesn’t fit easily into a table, the database’s structure can be complex, difficult, and slow to work with as such is the case with most web applications (Leavitt 2010). Stefan Edlich, professor at the Beuth University of Applied Sciences in Berlin argues that SQL can entail large amounts of complex code and doesn’t work well with modern, agile development. As such it does not really lend well to fast moving web 2.0 era (Ye 2002).
As web application grows the limitations of rational databases hold back the performance and become the single largest bottle neck in application architecture. This has been demonstrated again and again. For example let’s look at Reddit an popular social news aggregating and link sharing site which has seen growth of almost 20% (Edberg 2010) growth month on month. Currently their architecture relies on POSTGRES QL database that is sharded / partitioned across multiple data stores even with cashing solutions like Memchache they are unable to keep up with the growth and have a hit a point where according to the administrators of side where they are forced to look into other technical solution. Another good example of limitations of traditional structured database is putting a strain on application performance is Center for Technology and Innovation. Currently their shield application stores the data on a MYSQL database. But since database has grown to more than 4GB of data it is having serious issues in performance. One of the main reasons is that the current database structure is such that information is stored in one table and sharding it across multiple servers is not an option. Due to the table being lager than the available ram on the server the server has to make slow and expensive disk i/o to get the information rather than retrieve it from the cashes that lives on the ram. Unlike reddit implementing cashing solutions such as memcashed will also not work here since chased data is not being hit more than once, rather due to nature of report multiple references are being made and queries need to be calculated each time. The only solution for scalability in this scenario is vertical scalability by throwing more ram at the problem. But with CTI having single tables that are more than 60 GB in size rather expensive high end servers are needed to provide cashing of data and are actually running into physical limits of hardware itself.
Due to the limitations posed by vendors and users are increasingly turning to NOSQL databases. The first introduction occurred in 2007, when Amazon published a paper that introduced its Dynamo distributed NOSQL system for internal use. Amazon was one of the first major companies to store their data in non non-relational database one of the first companies to implement such a solution. Soon thereafter The Apache Software Founda-tion developed Hbase, a distributed, open source database that emulates Google’s Big Table. Facebook faced with its data storage problems implemented high-perfor¬mance Cassandra to help power its website (Leavitt 2010). As of writing of this proposal most high trafficked websites / web applications such as Twitter (Pronsc 2010) , Digg (Quinn 2010) are in the process of moving to NOSQL solutions. All though NOSQL might seem as a straight forward option for scalability, NOSQL databases face several challenges.. Because NOSQL databases don’t work with SQL, they require manual query programming, which can be fast for simple tasks but time-consuming for others. In addition, complex query pro¬gramming for the databases can be difficult (Leavitt 2010) another problems is relational databases natively support ACID, while NOSQL databases don’t. NOSQL databases thus don’t natively offer the degree of reliability that ACID provides. If users want NOSQL databases to apply ACID restraints to a data set, they must perform additional programming. Since NOSQL databases don’t natively support ACID transactions, they also could compromise consistency, unless manual support is provided. Not providing consistency enables better performance and scalability but is a problem for certain types of applica¬tions and transactions, such as those involved in banking. Another major disadvantage is most organizations are unfamiliar with NOSQL databases and thus may not feel knowledgeable enough to choose one or even to determine that the strategies might be better for their purposes.
Project Aims and Objective
The research aims to explore and investigate challenges faced by web applications in scaling their data base. And also identify scalability strategies that can be used to mitigate the issues identified with the use NOSQL solutions so that business can provide acceptable Quality of service to their customers with minimal business impact.
To achieve the aim of this research the following objectives need to be met
a. Review of existing web application database architectures and their scalability pitfalls
b. Review business impacts due to scalability problems they face
c. Review of current NOSQL solutions and their implementations
d. Strategies and for scaling the web application database architecture with NOSQL solutions
e. Set of best practices for web applications to move from traditional rational database systems to NOSQL solutions
Amit Manjhi, Phillip B. Gibbons, Anastassia Ailamaki , Charles Garrod. Invalidation Clues for Database Scalability Services. National Science Foundation, 2009, 1-8.
Edberg, Jeremey. And a fun weekend was had by all… March 1, 2010. http://www.reddit.com/r/programming/comments/b81v1/i_was_hoping_we_could_get_a_good_technical/ (accessed March 12, 2010).
Leavitt, Neal. “Will NoSQL Databases Live Up.” TECHNOLOGY NEWS, 2010: 13-17.
Pronsc, Mitchell. Cassandra Usurping MySQL on Twitter. March 03, 2010. http://architects.dzone.com/articles/cassandra-usurping-mysql?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+javalobby/frontpage+(Javalobby+/+Java+Zone)&utm_content=Google+Reader (accessed March 12, 2010).
Quinn, John. Saying Yes to NoSQL; Going Steady with Cassandra. March 9, 2010. http://about.digg.com/node/564 (accessed March 12, 2010).
Rubicon Consulting, Inc. Growth of web applications in the US:Rapid adoption, but only when there’s a real benefit, Status and implications for the tech industry. Rubicon Consulting, 2007.
Ye, Haiwei. “Towards Database Scalability through Efficient Data Distribution.” Proceedings of the 3rd International Symposium on Electronic Commerce (ISEC’02). IEEE, 2002. 1.