Information System
- At the most basic level, an information system (IS) is a set of components that work together to manage data processing and storage. Its role is to support the key aspects of running an organization, such as communication, record-keeping, decision making, data analysis and more. Companies use this information to improve their business operations, make strategic decisions and gain a competitive edge.
- Information systems typically include a combination of software, hardware and telecommunication networks. For example, an organization may use customer relationship management systems to gain a better understanding of its target audience, acquire new customers and retain existing clients. This technology allows companies to gather and analyze sales activity data, define the exact target group of a marketing campaign and measure customer satisfaction.
The Benefits of Information Systems
- Modern technology can significantly boost your company's performance and productivity. Information systems are no exception. Organizations worldwide rely on them to research and develop new ways to generate revenue, engage customers and streamline time-consuming tasks.
- With an information system, businesses can save time and money while making smarter decisions. A company's internal departments, such as marketing and sales, can communicate better and share information more easily.
- Since this technology is automated and uses complex algorithms, it reduces human error. Furthermore, employees can focus on the core aspects of a business rather than spending hours collecting data, filling out paperwork and doing the manual analysis.
- Thanks to modern information systems, team members can access massive amounts of data from one platform. For example, they can gather and process information from different sources, such as vendors, customers, warehouses and sales agents, with a few mouse clicks.
Uses and Applications
There are different types of information systems and each has a different role. Business intelligence (BI) systems, for instance, can turn data into valuable insights.
This kind of technology allows for faster, more accurate reporting, better business decisions, and more efficient resource allocation. Another major benefit is data visualization, which enables analysts to interpret large amounts of information, predict future events and find patterns in historical data.
Organizations can also use enterprise resource planning (ERP) software to collect, manage and analyze data across different areas, from manufacturing to finance and accounting. This type of information system consists of multiple applications that provide a 360-degree view of business operations. NetSuite ERP, PeopleSoft, Odoo, and Intacct are just a few examples of ERP software.
Like other information systems, ERP provides actionable insights and helps you decide on the next steps. It also makes it easier to achieve regulatory compliance, increase data security and share information between departments. Additionally, it helps to ensure that all of your financial records are accurate and up-to-date.
In the long run, ERP software can reduce operational costs, improve collaboration and boost your revenue. Nearly half of the companies that implement this system report major benefits within six months.
At the end of the day, information systems can give you a competitive advantage and provide the data you need to make faster, smarter business decisions. Depending on your needs, you can opt for transaction processing systems, knowledge management systems, decision support systems and more. When choosing one, consider your budget, industry and business size. Look for an information system that aligns with your goals and can streamline your day-to-day operations.
Persistent Data
- Persistent data is data that’s considered durable at rest with the coming and going of software and devices. Master data that’s stable—that is set and recoverable whether in flash or in memory.
- Here's what we heard when we asked, "How do you define persistent data?":
- The opposite of dynamic—it doesn’t change and is not accessed very frequently.
- Core information, also known as dimensional information in data warehousing. Demographics of entities—customers, suppliers, orders.
- Master data that’s stable.
- Data that exists from one instance to another. Data that exists across time independent of the systems that created it. Now there’s always a secondary use for data, so there’s more persistent data. A persistent copy may be made or it may be aggregated. The idea of persistence is becoming more fluid.
- Stored in actual format and stays there versus in-memory where you have it once, close the file and it’s gone. You can retrieve persistent data again and again. Data that’s written to the disc; however, the speed of the discs is a bottleneck for the database. Trying to move to memory because it’s 16X faster.
- Every client has their own threshold for criticality (e.g. financial services don’t want to lose any debits or credits). Now, with much more data from machines and sensors, there is greater transactionality. The meta-data is as important as the data itself. Meta-data must be transactional.
- Non-volatile. Persists in the face of a power outage.
- Any data stored in a way that it stays stored for an extended period versus in-memory data. Stored in the system modeled and structured to endure power outages. Data doesn’t change at all.
- Data considered durable at rest with the coming and going of hardware and devices. There’s a persistence layer at which you hold your data at risk.
- Data that is set and recoverable whether in flash or memory backed.
- With persistent data, there is reasonable confidence that changes will not be lost and the data will be available later. Depending on the requirements, in-cloud or in-memory systems can qualify. We care most about the "data" part. If it’s data, we want to enable customers to read, query, transform, write, add-value, etc.
- A way to persist data to disk or storage. Multiple options to do so with one replica across data centers in any combination with and without persistence. Snapshot data to disk or snapshot changes. Write to disk every second or every write. Users can choose between all options. Persistence is part of a high availability suite which provides replication and instant failover. Registered over multiple clouds. Host thousands of instances over multiple data centers with only two node failures per day. Users can choose between multiple data centers and multiple geographies. We are the company behind Redis. Others treat as a cache and not a database. Multiple nodes - data written to disks. You can’t do that with regular open source. If you don’t do high availability, as recommended, you can lose your data.
Anything that goes to a relational or NoSQL database in between.
Data
In computing, data is information that has been translated into a form that is efficient for movement or processing. Relative to today's computers and transmission media, data is information converted into binary digital form. It is acceptable for data to be used as a singular subject or a plural subject. Raw data is a term used to describe data in its most basic digital format.
The concept of data in the context of computing has its roots in the work of Claude Shannon, an American mathematician known as the father of information theory. He ushered in binary digital concepts based on applying two-value Boolean logic to electronic circuits. Binary digit formats underlie the CPUs, semiconductor memories and disk drives, as well as many of the peripheral devices common in computing today. Early computer input for both control and data took the form of punch cards, followed by magnetic tape and the hard disk.
Early on, data's importance in business computing became apparent by the popularity of the terms "data processing" and "electronic data processing," which, for a time, came to encompass the full gamut of what is now known as information technology. Over the history of corporate computing, specialization occurred, and a distinct data profession emerged along with the growth of corporate data processing.
How data is stored
Computers represent data, including video, images, sounds, and text, as binary values using patterns of just two numbers: 1 and 0. A bit is the smallest unit of data and represents just a single value. A byte is eight binary digits long. Storage and memory are measured in megabytes and gigabytes.
The units of data measurement continue to grow as the amount of data collected and stored grows. The relatively new term "brontobyte," for example, is data storage that is equal to 10 to the 27th power of bytes.
Data can be stored in file formats, as in mainframe systems using ISAM and VSAM. Other file formats for data storage, conversion and processing include comma-separated values. These formats continued to find uses across a variety of machine types, even as more structured-data-oriented approaches gained a footing in corporate computing.
Greater specialization developed a database, database management system and the relational database technology arose to organize information.
Types of data
The growth of the web and smartphones over the past decade led to a surge in digital data creation. Data now includes text, audio, and video information, as well as log and web activity records. Much of that is unstructured data.
The term big data has been used to describe data in the petabyte range or larger. A shorthand take depicts big data with 3Vs -- volume, variety, and velocity. As web-based e-commerce has spread, big data-driven business models have evolved which treat data as an asset in itself. Such trends have also spawned greater preoccupation with the social uses of data and data privacy.
Data has meaning beyond its use in computing applications oriented toward data processing. For example, in electronic component interconnection and network communication, the term data is often distinguished from "control information," "control bits," and similar terms to identify the main content of a transmission unit. Moreover, in science, the term data is used to describe a gathered body of facts. That is also the case in fields such as finance, marketing, demographics, and health.
Data management and use
With the proliferation of data in organizations, added emphasis has been placed on ensuring data quality by reducing duplication and guaranteeing the most accurate, current records are used. The many steps involved with modern data management include data cleansing, as well as extract, transform and load (ETL) processes for integrating data. Data for processing has come to be complemented by metadata, sometimes referred to as "data about data," that helps administrators and users understand database and other data.
Analytics that combine structured and unstructured data have become useful, as organizations seek to capitalize on such information. Systems for such analytics increasingly strive for real-time performance, so they are built to handle incoming data consumed at high ingestion rates, and to process data streams for immediate use in operations.
Over time, the idea of the database for operations and transactions has been extended to the database for reporting and predictive data analytics. A chief example is the data warehouse, which is optimized to process questions about operations for business analysts and business leaders. Increasing emphasis on finding patterns and predicting business outcomes has led to the development of data mining techniques.
Data professionals
The database administrator profession is an offshoot of IT. These database experts work on designing, tuning and maintaining the database.
The data profession took firm root as the relational database management system (RDBMS) gained wide use in corporations, beginning in the 1980s. The relational database's rise was enabled in part by the Structured Query Language (SQL). Later, non-SQL databases, known as NoSQL databases, arose as an alternative to established RDBMSes.
Database (DB)
A database is a collection of information that is organized so that it can be easily accessed, managed and updated.
Data is organized into rows, columns, and tables, and it is indexed to make it easier to find relevant information. Data gets updated, expanded and deleted as new information is added. Databases process workloads to create and update themselves, querying the data they contain and running applications against it.
Computer databases typically contain aggregations of data records or files, such as sales transactions, product catalogs and inventories, and customer profiles.
Typically, a database manager provides users with the ability to control read/write access, specify report generation and analyze usage. Some databases offer ACID (atomicity, consistency, isolation, and durability) compliance to guarantee that data is consistent and that transactions are complete.
Databases are prevalent in large mainframe systems but are also present in smaller distributed workstations and midrange systems, such as IBM's AS/400 and personal computers.
Evolution of databases
Databases have evolved since their inception in the 1960s, beginning with hierarchical and network databases, through the 1980s with object-oriented databases, and today with SQL and NoSQL databases and cloud databases.
In one view, databases can be classified according to content type: bibliographic, full text, numeric and images. In computing, databases are sometimes classified according to their organizational approach. There are many different kinds of databases, ranging from the most prevalent approach, the relational database, to a distributed database, cloud database or NoSQL database.
Relational database
A relational database, invented by E.F. Codd at IBM in 1970, is a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways.
Relational databases are made up of a set of tables with data that fits into a predefined category. Each table has at least one data category in a column, and each row has a certain data instance for the categories which are defined in the columns.
The Structured Query Language (SQL) is the standard user and application program interface for a relational database. Relational databases are easy to extend, and a new data category can be added after the original database creation without requiring that you modify all the existing applications.
Distributed database
A distributed database is a database in which portions of the database are stored in multiple physical locations, and in which processing is dispersed or replicated among different points in a network.
Distributed databases can be homogeneous or heterogeneous. All the physical locations in a homogeneous distributed database system have the same underlying hardware and run the same operating systems and database applications. The hardware, operating systems or database applications in a heterogeneous distributed database may be different at each of the locations.
Cloud database
A cloud database is a database that has been optimized or built for a virtualized environment, either in a hybrid cloud, public cloud or private cloud. Cloud databases provide benefits such as the ability to pay for storage capacity and bandwidth on a per-user basis, and they provide scalability on demand, along with high availability.
A cloud database also gives enterprises the opportunity to support business applications in a software-as-a-service deployment.
NoSQL database
NoSQL databases are useful for large sets of distributed data.
NoSQL databases are effective for big data performance issues that relational databases aren't built to solve. They are most effective when an organization must analyze large chunks of unstructured data or data that are stored across multiple virtual servers in the cloud.
Object-oriented database
Items created using object-oriented programming languages are often stored in relational databases, but object-oriented databases are well-suited for those items.
An object-oriented database is organized around objects rather than actions, and data rather than logic. For example, a multimedia record in a relational database can be a definable data object, as opposed to an alphanumeric value.
Graph database
A graph-oriented database, or graph database, is a type of NoSQL database that uses graph theory to store, map and query relationships. Graph databases are basically collections of nodes and edges, where each node represents an entity, and each edge represents a connection between nodes.
Graph databases are growing in popularity for analyzing interconnections. For example, companies might use a graph database to mine data about customers from social media.
Accessing the database: DBMS and RDBMS
A database management system (DBMS) is a type of software that allows you to define, manipulate, retrieve and manage data stored within a database.
Database Management System (DBMS)
A database management system (DBMS) is system software for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data.
A DBMS makes it possible for end users to create, read, update and delete data in a database. The DBMS essentially serves as an interface between the database and end users or application programs, ensuring that data is consistently organized and remains easily accessible.
The DBMS manages three important things: the data, the database engine that allows data to be accessed, locked and modified -- and the database schema, which defines the database’s logical structure. These three foundational elements help provide concurrency, security, data integrity, and uniform administration procedures. Typical database administration tasks supported by the DBMS include change management, performance monitoring/tuning, and backup and recovery. Many database management systems are also responsible for automated rollbacks, restarts, and recovery as well as the logging and auditing of activity.
The DBMS is perhaps most useful for providing a centralized view of data that can be accessed by multiple users, from multiple locations, in a controlled manner. A DBMS can limit what data the end user sees, as well as how that end user can view the data, providing many views of a single database schema. End users and software programs are free from having to understand where the data is physically located or on what type of storage media it resides because the DBMS handles all requests.
The DBMS can offer both logical and physical data independence. That means it can protect users and applications from needing to know where data is stored or having to be concerned about changes to the physical structure of data (storage and hardware). As long as programs use the application programming interface (API) for the database that is provided by the DBMS, developers won't have to modify programs just because changes have been made to the database.
With relational DBMSs (RDBMSs), this API is SQL, a standard programming language for defining, protecting and accessing data in an RDBMS.
Popular types of DBMSes
Popular database models and their management systems include:
Relational database management system (RDMS) - adaptable to most use cases, but RDBMS Tier-1 products can be quite expensive.
NoSQL DBMS - well-suited for loosely defined data structures that may evolve over time.
In-memory database management system (IMDBMS) - provides faster response times and better performance.
Columnar database management system (CDBMS) - well-suited for data warehouses that have a large number of similar data items.
Cloud-based data management system - the cloud service provider is responsible for providing and maintaining the DBMS.
Advantages of a DBMS
Using a DBMS to store and manage data comes with advantages, but also overhead. One of the biggest advantages of using a DBMS is that it lets end users and application programmers access and use the same data while managing data integrity. Data is better protected and maintained when it can be shared using a DBMS instead of creating new iterations of the same data stored in new files for every new application. The DBMS provides a central store of data that can be accessed by multiple users in a controlled manner.
Central storage and management of data within the DBMS provide:
Data abstraction and independence
Data security
A locking mechanism for concurrent access
An efficient handler to balance the needs of multiple applications using the same data
The ability to swiftly recover from crashes and errors, including restartability and recoverability
Robust data integrity capabilities
Logging and auditing of activity
Simple access using a standard application programming interface (API)
Uniform administration procedures for data
Another advantage of a DBMS is that it can be used to impose a logical, structured organization on the data. A DBMS delivers economy of scale for processing large amounts of data because it is optimized for such operations.
A DBMS can also provide many views of a single database schema. A view defines what data the user sees and how that user sees the data. The DBMS provides a level of abstraction between the conceptual schema that defines the logical structure of the database and the physical schema that describes the files, indexes and other physical mechanisms used by the database. When a DBMS is used, systems can be modified much more easily when business requirements change. New categories of data can be added to the database without disrupting the existing system and applications can be insulated from how data is structured and stored.
Of course, a DBMS must perform additional work to provide these advantages, thereby bringing with it the overhead. A DBMS will use more memory and CPU than a simple file storage system. And, of course, different types of DBMSes will require different types and levels of system resources.
Database Server
A database server is a server which houses a database application that provides database services to other computer programs or to computers, as defined by the client-server model.[citation needed][1][2] Database management systems (DBMSs) frequently provide database-server functionality, and some database management systems (such as MySQL) rely exclusively on the client-server model for database access (while others e.g. SQLite are meant for use as an embedded database).
Users access a database server either through a "front end" running on the user's computer – which displays requested data – or through the "back end", which runs on the server and handles tasks such as data analysis and storage.
In a master-slave model, database master servers are central and primary locations of data while database slave servers are synchronized backups of the master acting as proxies.
Most database applications respond to a query language. Each database understands its query language and converts each submitted query to server-readable form and executes it to retrieve results.
Examples of proprietary database applications include Oracle, DB2, Informix, and Microsoft SQL Server. Examples of free software database applications include PostgreSQL, and under the GNU General, Public Licence includes Ingres and MySQL.
Every server uses its own query logic and structure. The SQL (Structured Query Language) query language is more or less the same on all relational database applications.
Every server uses its own query logic and structure. The SQL (Structured Query Language) query language is more or less the same on all relational database applications.
For clarification, a database server is simply a server that maintains services related to clients via database applications.
DB-Engines lists over 300 DBMSs in its ranking.
Computers process data in a database into information. A database at a school, for example, contains data about students, e.g., student data, class data, etc. A computer at the school processes new student data and then sends advising appointment and ID card information to the printers.
2. multiple client machines can access a single resource simultaneously.
3. enables sharing of common application binaries and read-only information instead of putting them on every single machine. This results in reduced overall disk storage cost and administration overhead.
4. gives access to uniform data to groups of users.
5. useful when many users exist on many systems with each user's home directory located on every single machine. Network file systems allow you to all users home directories on a single machine under /home
1. Program-Data Dependence. File descriptions are stored within each application program that accesses a given file.
2. Duplication of Data. Applications are developed independently in file processing systems leading to unplanned duplicate files. Duplication is wasteful as it requires additional storage space and changes in one file must be made manually in all files. This also results in a loss of data integrity. It is also possible that the same data item may have different names in different files, or the same name may be used for different data items in different files.
3. Limited data sharing. Each application has its own private files with little opportunity to share data outside their own applications. A requested report may require data from several incompatible files in separate systems.
4. Lengthy Development Times. There is little opportunity to leverage previous development efforts. Each new application requires the developer to start from scratch by designing new file formats and descriptions
5. Excessive Program Maintenance. The preceding factors create a heavy program maintenance load.
6. Integrity Problem. The problem of integrity is the problem of ensuring that the data in the database is accentuate
Example: Relational data.
Example: XML data.
Example: Word, PDF, Text, Media logs.
Differences between Structured, Semi-structured and Unstructured data:
Various kinds of authentication procedures are applied for the verification and validation of end users, likewise, a registration number is provided by the application procedures which keeps a track and record of data usage. The local area office handles this thing.
You can imagine a distributed database as a one in which various portions of a database are stored in multiple different locations(physical) along with the application procedures which are replicated and distributed among various points in a network.
There are two kinds of distributed database, viz. homogenous and heterogeneous. The databases which have the same underlying hardware and run over the same operating systems and application procedures are known as homogeneous DDB, for eg. All physical locations in a DDB. Whereas, the operating systems, underlying hardware as well as application procedures can be different at various sites of a DDB which is known as heterogeneous DDB.
There are various simple operations that can be applied over the table which makes these databases easier to extend, join two databases with a common relation and modify all existing applications.
A cloud database also gives enterprises the opportunity to support business applications in a software-as-a-service deployment
An object-oriented database is organized around objects rather than actions, and data rather than logic. For example, a multimedia record in a relational database can be a definable data object, as opposed to an alphanumeric value.
Graph databases are basically used for analyzing interconnections. For example, companies might use a graph database to mine data about customers from social media.
Difference Between File and Database
File
A data file is a collection of related records stored on a storage medium such as a hard disk or optical disc. A Student file at a school might consist of thousands of individual student records. Each student record in the file contains the same fields. Each field, however, contains different data. The image shows a small sample Student file that contains four student records, each with eleven fields. A database includes a group of related data files.Database
A database is a collection of data organized in a manner that allows access, retrieval, and use of that data. Data is a collection of unprocessed items, which can include text, numbers, images, audio, and video. Information is processed data; that is, it is organized, meaningful, and useful.Computers process data in a database into information. A database at a school, for example, contains data about students, e.g., student data, class data, etc. A computer at the school processes new student data and then sends advising appointment and ID card information to the printers.
Database - Advantages & Disadvantages
Advantages
- Reduced data redundancy
- Reduced updating errors and increased consistency
- Greater data integrity and independence from applications programs
- Improved data access to users through the use of host and query languages
- Improved data security
- Reduced data entry, storage, and retrieval costs
- Facilitated development of new applications program
Disadvantages
- Database systems are complex, difficult and time-consuming to design
- Substantial hardware and software start-up costs
- Damage to database affects virtually all applications programs
- Extensive conversion costs in moving from a file-based system to a database system
- Initial training required for all programmers and users
File system - Advantages & Disadvantages
Advantages
1. it supports heterogeneous operating systems including all flavors of the UNIX operating system as well as Linux and windows2. multiple client machines can access a single resource simultaneously.
3. enables sharing of common application binaries and read-only information instead of putting them on every single machine. This results in reduced overall disk storage cost and administration overhead.
4. gives access to uniform data to groups of users.
5. useful when many users exist on many systems with each user's home directory located on every single machine. Network file systems allow you to all users home directories on a single machine under /home
Disadvantage
1. Program-Data Dependence. File descriptions are stored within each application program that accesses a given file.
2. Duplication of Data. Applications are developed independently in file processing systems leading to unplanned duplicate files. Duplication is wasteful as it requires additional storage space and changes in one file must be made manually in all files. This also results in a loss of data integrity. It is also possible that the same data item may have different names in different files, or the same name may be used for different data items in different files.
3. Limited data sharing. Each application has its own private files with little opportunity to share data outside their own applications. A requested report may require data from several incompatible files in separate systems.
4. Lengthy Development Times. There is little opportunity to leverage previous development efforts. Each new application requires the developer to start from scratch by designing new file formats and descriptions
5. Excessive Program Maintenance. The preceding factors create a heavy program maintenance load.
6. Integrity Problem. The problem of integrity is the problem of ensuring that the data in the database is accentuate
- Big Data includes huge volume, high velocity, and extensible variety of data. These are 3 types: Structured data, Semi-structured data, and Unstructured data.
Structured Data
Structured data is data whose elements are addressable for effective analysis. It has been organized into a formatted repository that is typically a database. It concerns all data which can be stored in database SQL in the table with rows and columns. They have a relational key and can easily be mapped into pre-designed fields. Today, those data are most processed in development and simplest way to manage information.Example: Relational data.
Semi-Structured Data
Semi-structured data is information that does not reside in a rational database but that have some organizational properties that make it easier to analyze. With some process, you can store them in the relation database (it could be very hard for some kind of semi-structured data), but Semi-structured exist to ease space.Example: XML data.
Unstructured Data
Unstructured data is a data that is which is not organized in a pre-defined manner or does not have a pre-defined data model, thus it is not a good fit for a mainstream relational database. So for Unstructured data, there are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications.Example: Word, PDF, Text, Media logs.
Differences between Structured, Semi-structured and Unstructured data:
Types of Databases
1. Centralized Database
The information(data) is stored at a centralized location and the users from different locations can access this data. This type of database contains application procedures that help the users to access the data even from a remote location.Various kinds of authentication procedures are applied for the verification and validation of end users, likewise, a registration number is provided by the application procedures which keeps a track and record of data usage. The local area office handles this thing.
2.Distributed Database
Just opposite of the centralized database concept, the distributed database has contributions from the common database as well as the information captured by local computers also. The data is not at one place and is distributed at various sites of an organization. These sites are connected to each other with the help of communication links which helps them to access the distributed data easily.You can imagine a distributed database as a one in which various portions of a database are stored in multiple different locations(physical) along with the application procedures which are replicated and distributed among various points in a network.
There are two kinds of distributed database, viz. homogenous and heterogeneous. The databases which have the same underlying hardware and run over the same operating systems and application procedures are known as homogeneous DDB, for eg. All physical locations in a DDB. Whereas, the operating systems, underlying hardware as well as application procedures can be different at various sites of a DDB which is known as heterogeneous DDB.
3.Personal Database
Data is collected and stored on personal computers which are small and easily manageable. The data is generally used by the same department of an organization and is accessed by a small group of people.4.End User Database
The end user is usually not concerned about the transaction or operations done at various levels and is only aware of the product which may be a software or an application. Therefore, this is a shared database which is specifically designed for the end user, just like different levels’ managers. Summary of whole information is collected in this database.5.Commercial Database
These are the paid versions of the huge databases designed uniquely for the users who want to access the information for help. These databases are subject specific, and one cannot afford to maintain such huge information. Access to such databases is provided through commercial links.6.NoSQL Database
These are used for large sets of distributed data. There are some big data performance issues which are effectively handled by relational databases, such kind of issues are easily managed by NoSQL databases. There are very efficient in analyzing large size unstructured data that may be stored at multiple virtual servers of the cloud.7.Operational Database
Information related to operations of an enterprise is stored inside this database. Functional lines like marketing, employee relations, customer service etc. require such kind of databases.8.Relational Databases
These databases are categorized by a set of tables where data gets fit into a pre-defined category. The table consists of rows and columns where the column has an entry for data for a specific category and rows contains instance for that data defined according to the category. The Structured Query Language (SQL) is the standard user and application program interface for a relational database.There are various simple operations that can be applied over the table which makes these databases easier to extend, join two databases with a common relation and modify all existing applications.
9.Cloud Databases
Now a day, data has been specifically getting stored over clouds also known as a virtual environment, either in a hybrid cloud, public or private cloud. A cloud database is a database that has been optimized or built for such a virtualized environment. There are various benefits of a cloud database, some of which are the ability to pay for storage capacity and bandwidth on a per-user basis, and they provide scalability on demand, along with high availability.A cloud database also gives enterprises the opportunity to support business applications in a software-as-a-service deployment
10.Object-Oriented Databases
An object-oriented database is a collection of object-oriented programming and relational database. There are various items which are created using object-oriented programming languages like C++, Java which can be stored in relational databases, but object-oriented databases are well-suited for those items.An object-oriented database is organized around objects rather than actions, and data rather than logic. For example, a multimedia record in a relational database can be a definable data object, as opposed to an alphanumeric value.
11.Graph Databases
The graph is a collection of nodes and edges where each node is used to represent an entity and each edge describes the relationship between entities. A graph-oriented database, or graph database, is a type of NoSQL database that uses graph theory to store, map and query relationships.Graph databases are basically used for analyzing interconnections. For example, companies might use a graph database to mine data about customers from social media.
Difference between Big Data and Data Warehouse
- Data Warehousing is one of the common words for the last 10-20 years, whereas Big Data is a hot trend for the last 5-10 years. Both of them hold a lot of data, used for reporting, managed by an electronic storage device. So one common thought of maximum people that recent big data will replace old data warehousing very soon. But still, big data and data warehousing is not interchangeable as they used totally for a different purpose.
Application To Files/DB
- Files and DBs are external components
- They are existing outside the software system
- Software can connect to the files/DBs to perform CRUD operations on data
•File – File path, URL
•DB – connection string
- To process data in DB
•SQL statements
•Prepared statements
•Callable statements
Statements, PreparedStatement and CallableStatement
Once a connection is obtained we can interact with the database. The JDBC Statement, CallableStatement, and PreparedStatement interfaces define the methods and properties that enable you to send SQL or PL/SQL commands and receive data from your database.They also define methods that help bridge data type differences between Java and SQL data types used in a database.
The following table provides a summary of each interface's purpose to decide on the interface to use.
The Statement Objects
- Creating Statement Object
Before you can use a Statement object to execute a SQL statement, you need to create one using the Connection object's createStatement( ) method, as in the following example −
Statement stmt = null;
try {
stmt = conn.createStatement( );
. . .
}
catch (SQLException e) {
. . .
}
finally {
. . .
}
Once you've created a Statement object, you can then use it to execute an SQL statement with one of its three execute methods.
- boolean execute (String SQL): Returns a boolean value of true if a ResultSet object can be retrieved; otherwise, it returns false. Use this method to execute SQL DDL statements or when you need to use truly dynamic SQL.
- int executeUpdate (String SQL): Returns the number of rows affected by the execution of the SQL statement. Use this method to execute SQL statements for which you expect to get a number of rows affected - for example, an INSERT, UPDATE, or DELETE statement.
- ResultSet executeQuery (String SQL): Returns a ResultSet object. Use this method when you expect to get a result set, as you would with a SELECT statement.
Closing Statement Object
Just as you close a Connection object to save database resources, for the same reason you should also close the Statement object.A simple call to the close() method will do the job. If you close the Connection object first, it will close the Statement object as well. However, you should always explicitly close the Statement object to ensure proper cleanup.
Statement stmt = null;
try {
stmt = conn.createStatement( );
. . .
}
catch (SQLException e) {
. . .
}
finally {
stmt.close();
}
For a better understanding, we suggest you study the Statement - Example tutorial.
The PreparedStatement Objects
The PreparedStatement interface extends the Statement interface, which gives you added functionality with a couple of advantages over a generic Statement object.This statement gives you the flexibility of supplying arguments dynamically.
- Creating PreparedStatement Object
PreparedStatement pstmt = null;
try {
String SQL = "Update Employees SET age = ? WHERE id = ?";
pstmt = conn.prepareStatement(SQL);
. . .
}
catch (SQLException e) {
. . .
}
finally {
. . .
}
All parameters in JDBC are represented by the? symbol, which is known as the parameter marker. You must supply values for every parameter before executing the SQL statement.
The setXXX() methods bind values to the parameters, where XXX represents the Java data type of the value you wish to bind to the input parameter. If you forget to supply the values, you will receive an SQLException.
Each parameter marker is referred by its ordinal position. The first marker represents position 1, the next position 2, and so forth. This method differs from that of Java array indices, which starts at 0.
All of the Statement object's methods for interacting with the database (a) execute(), (b) executeQuery(), and (c) executeUpdate() also work with the PreparedStatement object. However, the methods are modified to use SQL statements that can input the parameters.
- Closing PreparedStatement Object
Just as you close a Statement object, for the same reason you should also close the PreparedStatement object.
A simple call to the close() method will do the job. If you close the Connection object first, it will close the PreparedStatement object as well. However, you should always explicitly close the PreparedStatement object to ensure proper cleanup.
PreparedStatement pstmt = null;
try {
String SQL = "Update Employees SET age = ? WHERE id = ?";
pstmt = conn.prepareStatement(SQL);
. . .
}
catch (SQLException e) {
. . .
}
finally {
pstmt.close();
}
For a better understanding, let us study Prepare - Example Code.
The CallableStatement Objects
Just as a Connection object creates the Statement and PreparedStatement objects, it also creates the CallableStatement object, which would be used to execute a call to a database stored procedure.- Creating CallableStatement Object
Suppose, you need to execute the following Oracle stored procedure −
CREATE OR REPLACE PROCEDURE getEmpName
(EMP_ID IN NUMBER, EMP_FIRST OUT VARCHAR) AS
BEGIN
SELECT first INTO EMP_FIRST
FROM Employees
WHERE ID = EMP_ID;
END;
NOTE: Above stored procedure has been written for Oracle, but we are working with MySQL database so, let us write a same stored procedure for MySQL as follows to create it in EMP database −
DELIMITER $$
DROP PROCEDURE IF EXISTS `EMP`.`getEmpName` $$
CREATE PROCEDURE `EMP`.`getEmpName`
(IN EMP_ID INT, OUT EMP_FIRST VARCHAR(255))
BEGIN
SELECT first INTO EMP_FIRST
FROM Employees
WHERE ID = EMP_ID;
END $$
DELIMITER ;
Three types of parameters exist: IN, OUT, and INOUT. The PreparedStatement object only uses the IN parameter. The CallableStatement object can use all the three.
Here are the definitions of each −
The following code snippet shows how to employ the Connection.prepareCall() method to instantiate a CallableStatement object based on the preceding stored procedure −
CallableStatement cstmt = null;
try {
String SQL = "{call getEmpName (?, ?)}";
cstmt = conn.prepareCall (SQL);
. . .
}
catch (SQLException e) {
. . .
}
finally {
. . .
}
The String variable SQL, represents the stored procedure, with parameter placeholders.
Using the CallableStatement objects is much like using PreparedStatement objects. You must bind values to all the parameters before executing the statement, or you will receive an SQLException.
If you have IN parameters, just follow the same rules and techniques that apply to a PreparedStatement object; use the setXXX() method that corresponds to the Java data type you are binding.
When you use OUT and INOUT parameters you must employ an additional CallableStatement method, registerOutParameter(). The registerOutParameter() method binds the JDBC data type, to the data type that the stored procedure is expected to return.
Once you call your stored procedure, you retrieve the value from the OUT parameter with the appropriate getXXX() method. This method casts the retrieved value of SQL type to a Java data type.
- Closing CallableStatement Object
Just as you close other Statement objects, for the same reason you should also close the CallableStatement object.
A simple call to the close() method will do the job. If you close the Connection object first, it will close the CallableStatement object as well. However, you should always explicitly close the CallableStatement object to ensure proper cleanup.
CallableStatement cstmt = null;
try {
String SQL = "{call getEmpName (?, ?)}";
cstmt = conn.prepareCall (SQL);
. . .
}
catch (SQLException e) {
. . .
}
finally {
cstmt.close();
}
For a better understanding, I would suggest studying Callable - Example Code.
Statement Vs PreparedStatement Vs CallableStatement In Java :
ORM
Before we talk about what an Object-Relational-Mapper is, it might be better to talk about Object-Relational-Mapping as a concept first.
Unless you’ve worked exclusively with NoSQL databases, you’ve likely written your fair share of SQL queries. They usually look something like this:
SELECT * FROM users WHERE email = 'test@test.com';
Object-relational-mapping is the idea of being able to write queries like the one above, as well as much more complicated ones, using the object-oriented paradigm of your preferred programming language.
Long story short, we are trying to interact with our database using our language of choice instead of SQL.
Here’s where the Object-relational-mapper comes in. When most people say “ORM” they are referring to a library that implements this technique. For example, the above query would now look something like this:
var orm = require('generic-orm-library');
var user = orm("users").where({ email: 'test@test.com' });
As you can see, we are using an imaginary ORM library to execute the exact same query, except we can write it in JavaScript (or whatever language you’re using). We can use the same languages we know and love, and also abstract away some of the complexity of interfacing with a database.
As with any technique, there are tradeoffs that should be considered when using an ORM.
Let’s take a look at some of the pros and cons!
Tip: Use Bit to organize your components and make them discoverable for your team, to build new applications faster. It’s open-source, give it a try.
Pros
- You get to write in the language you are already using anyway. If we’re being honest, we probably aren’t the greatest at writing SQL statements. SQL is a ridiculously powerful language, but most of us don’t write in it often. We do, however, tend to be much more fluent in one language or another and being able to leverage that fluency is awesome!
- It abstracts away the database system so that switching from MySQL to PostgreSQL, or whatever flavor you prefer, is easy-peasy.
- Depending on the ORM you get a lot of advanced features out of the box, such as support for transactions, connection pooling, migrations, seeds, streams, and all sorts of other goodies.
- Many of the queries you write will perform better than if you wrote them yourself.
Cons
- If you are a master at SQL, you can probably get more performant queries by writing them yourself.
- There is overhead involved in learning how to use any given ORM.
- The initial configuration of an ORM can be a headache.
- As a developer, it is important to understand what is happening under the hood. Since ORMs can serve as a crutch to avoid understanding databases and SQL, it can make you a weaker developer in that portion of the stack.
Popular ORMs
Wikipedia has a great list of ORMs that exist for just about any language. That list is missing JavaScript, which is my language of choice so I will throw my hat in the ring for Knex.js.
They’re not paying me to say that, I’ve simply enjoyed working with their software and I don’t have any experience with other JavaScript ORMs. This article might provide more insightful feedback for JavaScript specifically.
The ORM debate isn’t about technology; it’s about values. People tell you that you should or shouldn’t use ORM based on what they think matters more: clean data access, or clean code.
The first camp — thinks that software should concentrate on the model level and treat the data store as incidental. People in this camp say you should use ORM because it lets you describe at a high level how your program’s data should be stored and retrieved from the database with little code using the language of the program without making the model “leak” SQL.
The second camp — thinks that software should concentrate on the persistence level and treat the code as incidental. People in this camp don’t use ORM because they think the program should be written so its data storage and access patterns fit naturally with the underlying data store, and all data access is controlled and explicit.
So Who’s Right?
No one is right, and no one is wrong. Both camps have good points and bad points, and both approaches can lead to good and bad code. But in the debate, most people choose a side based personal preferences and values rather than evidence.
So here are some best practices from both camps:
- If you’re going to use ORM, you should make your model objects as simple as possible. Be more vigilant about simplicity to make sure your model objects really are just Plain Ol’ Data. Otherwise, you may end up wrestling with your ORM to make sure the persistence works like you expect it to, and it’s not looking for methods and properties that aren’t actually there.
- If you’re not going to use ORM, you should probably define DAOs or persistence and query methods to avoid coupling the model layer with the persistence layer. Otherwise, you end up with SQL in your model objects and forced dependency on your project.
- If you know your data access patterns are generally going to be simple (like basic object retrieval) but you don’t know all of them up front, you should think about using an ORM. While ORMs can make building complex queries confusing to build and difficult to debug, an ORM can save you huge amounts of time if your queries are generally pretty simple.
- If you know your data access pattern is going to be complex or you plan to use a lot of database-specific features, you may not want to use an ORM. While many ORMs (like Hibernate) let you access the underlying data source connection pretty easily , if you know you’re going to have to throw around a lot of custom SQL, you may not get a lot of value out of ORM to begin with because you’re constantly going to have to break out of it.
- If it absolutely, positively, has to, has to, has to go fast, you may not want to use ORM. The only way to be absolutely sure all your queries consistently go fast is to plan your database structure carefully, manage your data access pattern with extreme prejudice, commit to one data store, and write your own queries optimized against that data store.
- Having mentioned some of the scenarios depending on which you could decide on whether or not to go with ORM, let me also point out a few Pros and Cons of ORM in general.
PROS
Facilitates implementing domain model pattern.
A huge reduction in code.
Takes care of vendor-specific code by itself.
Cache Management — Entities are cached in memory thereby reducing the load on the DB.
CONS
Increased startup time due to metadata preparation( not good for desktop applications).
Huge learning curve without ORM.
Relatively hard to fine-tune and debug generated SQL.Not suitable for applications without a clean domain object model.
Whether or not you should use ORM isn’t about other people’s values or even your own. It’s about choosing the right technique for your application based on its technical requirements. Use ORM or don’t base not on personal values but on what your app needs more: control over data access, or less code to maintain.
They’re not paying me to say that, I’ve simply enjoyed working with their software and I don’t have any experience with other JavaScript ORMs. This article might provide more insightful feedback for JavaScript specifically.
POJO
POJO stands for Plain Old Java Object. It is an ordinary Java object, not bound by any special restriction other than those forced by the Java Language Specification and not requiring any classpath. POJOs are used for increasing the readability and reusability of a program. POJOs have gained the most acceptance because they are easy to write and understand. They were introduced in EJB 3.0 by Sun Microsystems.
A POJO should not:
Extend prespecified classes, Ex: public class GFG extends javax.servlet.http.HttpServlet { … } is not a POJO class.
Implement prespecified interfaces, Ex: public class Bar implements javax.ejb.EntityBean { … } is not a POJO class.
Contain prespecified annotations, Ex: @javax.persistence.Entity public class Baz { … } is not a POJO class.
POJOs basically defines an entity. Like in your program, if you want an Employee class then you can create a POJO as follows:
// Employee POJO class to represent entity Employee
public class Employee
{
// default field
String name;
// public field
public String id;
// private salary
private double salary;
//arg-constructor to initialize fields
public Employee(String name, String id,
double salary)
{
this.name = name;
this.id = id;
this.salary = salary;
}
// getter method for name
public String getName()
{
return name;
}
// getter method for id
public String getId()
{
return id;
}
// getter method for salary
public Double getSalary()
{
return salary;
}
}
The above example is a well-defined example of POJO class. As you can see, there is no restriction on access-modifier of fields. They can be private, default, protected or the public. It is also not necessary to include any constructor in it.
POJO is an object which encapsulates Business Logic. Following image shows a working example of POJO class. Controllers get to interact with your business logic which in turn interact with POJO to access the database. In this example, a database entity is represented by POJO. This POJO has the same members as a database entity.
Java Beans
Beans are a special type of Pojos. There are some restrictions on POJO to be a bean.
All JavaBeans are POJOs but not all POJOs are JavaBeans.
Serializable i.e. they should implement Serializable interface. Still, some POJOs who don’t implement Serializable interface are called POJOs because Serializable is a marker interface and therefore not of many burdens.
Fields should be private. This is to provide complete control on fields.
Fields should have getters or setters or both.
A no-arg constructor should be there in a bean.
Fields are accessed only by constructor or getter setters.
Getters and Setters have some special names depending on the field name. For example, if the field name is someProperty then its getter preferably will be:
public void getSomeProperty()
{
return someProperty;
}
and setter will be
public void setSomePRoperty(someProperty)
{
this.someProperty=someProperty;
}
Visibility of getters and setters in the general public. Getters and setters provide the complete restriction on fields. e.g. consider below the property,
Integer age;
If you set the visibility of age to the public, then any object can use this. Suppose you want that age can’t be 0. In that case, you can’t have control. Any object can set it 0. But by using the setter method, you have control. You can have a condition in your setter method. Similarly, forgetter method if you want that if your age is 0 then it should return null, you can achieve this by using the getter method as in the following example:
// Java program to illustrate JavaBeans
class Bean
{
// private field property
private Integer property;
Bean()
{
// No-arg constructor
}
// setter method for property
public void setProperty(Integer property)
{
if (property == 0)
{
// if property is 0 return
return;
}
this.property=property;
}
// getter method for property
public int getProperty()
{
if (property == 0)
{
// if property is 0 return null
return null;
}
return property;
}
}
// Class to test above bean
public class GFG
{
public static void main(String[] args)
{
Bean bean = new Bean();
bean.setProperty(0);
System.out.println("After setting to 0: " +
bean.getProperty());
bean.setProperty(5);
System.out.println("After setting to valid" +
" value: " + bean.getProperty());
}
}
Output:-
After setting to 0: null
After setting to valid value: 5
POJO vs Java Bean
JPA
Java Persistence API is a collection of classes and methods to persistently store the vast amounts of data into a database which is provided by the Oracle Corporation.
Where to use JPA?
To reduce the burden of writing codes for relational object management, a programmer follows the ‘JPA Provider’ framework, which allows easy interaction with the database instance. Here the required framework is taken over by JPA.
JPA Providers
JPA is an open source API, therefore various enterprise vendors such as Oracle, Redhat, Eclipse, etc. provide new products by adding the JPA persistence flavor in them. Some of these products include:
Hibernate, EclipseLink, TopLink, Spring Data JPA, etc.
.NET ORM Tools
- .NET database connectivity tools with ORM support. It includes the following ORM tools and solutions:
- Entity Developer is a powerful modeling and code generation tool for LinqConnect, Telerik Data Access, NHibernate, and ADO.NET Entity Framework.

- You can design an entity model from scratch or reverse-engineer an existing database. The model is used to generate C# or Visual Basic code with predefined or custom code templates.
- dotConnect is an enhanced data connectivity solution built over ADO.NET architecture and a development framework with a number of innovative technologies and support for such ORM solutions as Entity Framework and LinqConnect. dotConnect includes high-performance data providers for the databases and cloud applications and offers a complete solution for developing data-related applications and web sites.

- LINQ Insight is a powerful Visual Studio add-in for LINQ development that allows you to execute LINQ queries at design time directly from Visual Studio without starting a debug session and provides a powerful ORM profiler for Entity Framework, NHibernate, LINQ to SQL, and LinqConnect.
- It profiles the data access layer of your projects and tracks all the ORM calls and SQL queries from the ORM.

- LinqConnect - a fast and easy to use ORM solution, developed closely to the Microsoft LINQ to SQL technology. In addition to LINQ to SQL features, LinqConnect provides its own advanced functionality. LinqConnect supports SQL Server, Oracle, MySQL, PostgreSQL, and SQLite.

Java ORM Tools
Object-relational mapping' (ORM, O/RM, and O/R mapping) in computer software is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language. There are both free and commercial packages available that perform object-relational mapping, although some programmers opt to create their own ORM tools.
Data management tasks in object-oriented (OO) programming are typically implemented by manipulating objects that are almost always non-scalar values.
Many popular database products such as structured query language database management systems (SQL DBMS) can only store and manipulate scalar values such as integers and strings organized within tables. The programmer must either convert the object values into groups of simpler values for storage in the database (and convert them back upon retrieval), or only use simple scalar values within the program. Object-relational mapping is used to implement the first approach.
The heart of the problem is translating the logical representation of the objects into an atomized form that is capable of being stored on the database, while somehow preserving the properties of the objects and their relationships so that they can be reloaded as an object when needed. If this storage and retrieval functionality is implemented, the objects are then said to be persistent.
Hibernate

Hibernate's primary feature is mapping from Java classes to database tables (and from Java data types to SQL data types). Hibernate also provides data query and retrieval facilities. Hibernate generates the SQL calls and attempts to relieve the developer from manual result set handling and object conversion and keep the application portable to all supported SQL databases with little performance overhead.
Features of Hibernate:
Transparent persistence without byte code processing
Object-oriented query language
Object / Relational mappings
Automatic primary key generation
Object/Relational mapping definition
HDLCA (Hibernate Dual-Layer Cache Architecture)
High performance
J2EE integration
JMX support, Integration with J2EE architecture
Hibernate may not be the best solution for data-centric applications that only use stored-procedures to implement the business logic in the database, it is most useful with object-oriented domain models and business logic in the Java-based middle-tier. However, Hibernate can certainly help you to remove or encapsulate vendor-specific SQL code and streamlines the common task of translating result sets from a tabular representation to a graph of objects.
IBatis / MyBatis
iBATIS is a persistence framework which automates the mapping between SQL databases and objects in Java, .NET, and Ruby on Rails. In Java, the objects are POJOs (Plain Old Java Objects). The mappings are decoupled from the application logic by packaging the SQL statements in XML configuration files. The result is a significant reduction in the amount of code that a developer needs to access a relational database using lower level APIs like JDBC and ODBC.
Other persistence frameworks such as Hibernate allow the creation of an object model (in Java, say) by the user, and create and maintain the relational database automatically. iBATIS takes the reverse approach: the developer starts with an SQL database and iBATIS automates the creation of the Java objects. Both approaches have advantages, and iBATIS is a good choice when the developer does not have full control over the SQL database schema.
For example, an application may need to access an existing SQL database used by other software or access a new database whose schema is not fully under the application developer's control, such as when a specialized database design team has created the schema and carefully optimized it for high performance.
Features of IBatis:
Support for Unit of work/object level transactions
In memory object filtering
Providing an ODMG compliant API and/or OCL and/or OPath
Supports multi servers (clustering) and simultaneous access by other applications without loss of transaction integrity
very Caching - Built-in support
Supports disconnected operations
Support for Remoting. Distributed Objects.
The iBatis framework is a lightweight data mapping framework and persistence API that can be used to quickly leverage a legacy database schema to generate a database persistence layer for your Java application.
Toplink

TopLink Essentials is the reference implementation of the EJB 3.0 Java Persistence API (JPA) and the open-source community edition of Oracle's TopLink product.
TopLink Essentials is a limited version of the proprietary product. For example, TopLink Essentials doesn't provide cache synchronization between clustered applications, some cache invalidation policy, and Query Cache.
Features of TopLink:
Query framework that supports an object-oriented expression framework, Query by Example (QBE), EJB QL, SQL, and stored procedures
Object-level transaction framework
Caching to ensure object identity
Set of direct and relational mappings
EIS/JCA support for non-relational data sources
Visual mapping editor (Mapping Workbench)
Database and JEE Architecture independent
Oracle TopLink delivers a proven standard based enterprise Java solution for all relational and XML persistence needs basing on high performance and scalability, developer productivity, and flexibility in architecture and design.
PHP ORM Tools
- Doctrine ORM
- RedBeanPHP
- Eloquent ORM
- Propel
- Analogue ORM
- Sheep Orm
- RedBeanPHP4
NoSQL
NoSQL encompasses a wide variety of different database technologies that were developed in response to the demands presented in building modern applications:
Developers are working with applications that create massive volumes of new, rapidly changing data types — structured, semi-structured, unstructured and polymorphic data.
Long gone is the twelve-to-eighteen month waterfall development cycle. Now small teams work in agile sprints, iterating quickly and pushing code every week or two, some even multiple times every day.
Applications that once served a finite audience are now delivered as services that must be always-on, accessible from many different devices and scaled globally to millions of users.
Organizations are now turning to scale-out architectures using open software technologies, commodity servers and cloud computing instead of large monolithic servers and storage infrastructure.
Relational databases were not designed to cope with the scale and agility challenges that face modern applications, nor were they built to take advantage of the commodity storage and processing power available today.
NoSQL Database Types
Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
Graph stores are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.
The Benefits of NoSQL
When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address:
Large volumes of rapidly changing structured, semi-structured, and unstructured data
Agile sprints, quick schema iteration, and frequent code pushes
Object-oriented programming that is easy to use and flexible
Geographically distributed scale-out architecture instead of expensive, monolithic architecture
Hadoop
Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.
The 4 Modules of Hadoop
Hadoop is made up of "modules", each of which carries out a particular task essential for a computer system designed for big data analytics.
1. Distributed File-System
The most important two are the Distributed File System, which allows data to be stored in an easily accessible format, across a large number of linked storage devices, and the MapReduce - which provides the basic tools for poking around in the data.
(A "file system" is the method used by a computer to store data, so it can be found and used. Normally this is determined by the computer's operating system, however a Hadoop system uses its own file system which sits "above" the file system of the host computer - meaning it can be accessed using any computer running any supported OS).
2. MapReduce
MapReduce is named after the two basic operations this module carries out - reading data from the database, putting it into a format suitable for analysis (map), and performing mathematical operations i.e counting the number of males aged 30+ in a customer database (reduce).
3. Hadoop Common
The other module is Hadoop Common, which provides the tools (in Java) needed for the user's computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.
4. YARN
The final module is YARN, which manages resources of the systems storing the data and running the analysis.
Various other procedures, libraries or features have come to be considered part of the Hadoop "framework" over recent years, but Hadoop Distributed File System, Hadoop MapReduce, Hadoop Common, and Hadoop YARN are the principle four.
How Hadoop Came About
Development of Hadoop began when forward-thinking software engineers realised that it was quickly becoming useful for anybody to be able to store and analyze datasets far larger than can practically be stored and accessed on one physical storage device (such as a hard disk).
This is partly because as physical storage devices become bigger it takes longer for the component that reads the data from the disk (which in a hard disk, would be the "head") to move to a specified segment. Instead, many smaller devices working in parallel are more efficient than one large one.
It was released in 2005 by the Apache Software Foundation, a non-profit organization which produces open source software which powers much of the Internet behind the scenes. And if you're wondering where the odd name came from, it was the name given to a toy elephant belonging to the son of one of the original creators!
The Usage of Hadoop
The flexible nature of a Hadoop system means companies can add to or modify their data system as their needs change, using cheap and readily-available parts from any IT vendor.
Today, it is the most widely used system for providing data storage and processing across "commodity" hardware - relatively inexpensive, off-the-shelf systems linked together, as opposed to expensive, bespoke systems custom-made for the job in hand. In fact it is claimed that more than half of the companies in the Fortune 500 make use of it.
Just about all of the big online names use it, and as anyone is free to alter it for their own purposes, modifications made to the software by expert engineers at, for example, Amazon and Google, are fed back to the development community, where they are often used to improve the "official" product. This form of collaborative development between volunteer and commercial users is a key feature of open source software.
In its "raw" state - using the basic modules supplied here http://hadoop.apache.org/ by Apache, it can be very complex, even for IT professionals - which is why various commercial versions have been developed such as Cloudera which simplify the task of installing and running a Hadoop system, as well as offering training and support services.
So that, in a (fairly large) nutshell, is Hadoop. Thanks to the flexible nature of the system, companies can expand and adjust their data analysis operations as their business expands. And the support and enthusiasm of the open source community behind it have led to great strides towards making big data analysis more accessible for everyone.
Hadoop Architecture
Hadoop works in master-slave fashion. There is a master node and there are n numbers of slave nodes where n can be 1000s. Master manages, maintains and monitors the slaves while slaves are the actual worker nodes.
Hadoop Daemons
Namenode – It runs on master node for HDFS.
Datanode – It runs on slave nodes for HDFS.
ResourceManager – It runs on master node for Yarn.
NodeManager – It runs on slave node for Yarn.
Hadoop Flavors
Apache – Vanilla flavor, as the actual code is residing in Apache repositories.
Hortonworks – Popular distribution in the industry.
Cloudera – It is the most popular in the industry.
MapR – It has rewritten HDFS and its HDFS is faster as compared to others.
IBM – Proprietary distribution is known as Big Insights.
Hadoop Ecosystem
HDFS – Distributed storage layer for Hadoop.
Hadoop – Resource management layer introduced in Hadoop 2.x.
Map-Reduce – Parallel processing layer for Hadoop.
HBase – It is a column-oriented database that runs on top of HDFS. It is a NoSQL database which does not understand the structured query.
Hive – Apache Hive is a data warehousing infrastructure based on Hadoop and it enables easy data summarization, using SQL queries.
Pig – It is a top-level scripting language. As we use it with Hadoop. Pig enables writing of complex data processing without Java programming.
Flume – It is a reliable system for efficiently collecting large amounts of log data from many different sources in real-time.
Sqoop – It is a tool design to transport huge volumes of data between Hadoop and RDBMS.
Oozie – It is a Java Web application uses to schedule Apache Hadoop jobs. It combines multiple jobs sequentially into one logical unit of work.
Zookeeper – A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Mahout – A library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.
Information Retrieval System
Information Retrieval system is a part and parcel of the communication system. The main objectives of Information retrieval are to supply the right information, to the hand of the right user at the right time. Various materials and methods are used for retrieving our desired information. The term Information retrieval first introduced by Calvin Mooers in 1951.
Definition:
Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. It is a part of information science, which studies those activities relating to the retrieval of information. Searches can be based on metadata or on full-text indexing. We may define this process as;
The element of Information retrieval:
Information Retrieval mainly consists of four elements, i.e.
a. Information carrier.
b. Descriptor.
c. Document address.
d. Transmission of information.
Now, these elements are briefly discussed below;
information carrier: Carrier is something that carries, hold or conveys something. On the other hand, Information Carrier is something that carriers or store information. For example Film, Magnetic Tape, CD, DVD, etc.
Descriptor: Term or terminology that’s used to search for information from storage is known as Descriptor. It is known as keywords that we use for search information from a storage device.
Document address: Every document must have an address that identifies the location of that document. Here document address means ISBN, ISSN, code number, call number, shelf number or file number that helps us to retrieve information.
Transmission of Information: Transmission of Information means to supply any document at the hands of the users when needed. Information retrieval system uses various communication channel or networking tools for doing this.
The function of Information Retrieval:
The main functions of the Information Retrieval system are to supply the right information at the hand of the right user at the right time. Functions of Information Retrieval system areas discussed below:
Acquisition: It is the main and first function of Information Retrieval. The acquisition means to collect information from various sources. Firstly the user collects their desired information from various sources. Sources may be Book, document, database, journal, etc.
Contest analysis: the Second step of the Information retrieval system is to analyze their acquired information, and in this step, they may take a decision is this document they collect is valuable or not.
Content presentation: Information presentation is a system for presenting information to the user. Information should be presented clearly and effectively so that users can be able to understand them very easily. For this purpose catalog, bibliography, index, current awareness service will help us a lot.
Creation of file/store: In this stage, the library authority creates a new file for storing their collected information, which is ready for presentation. They organize those file in some systematic way.
Creation of search methods: In this stage, the authority decide what kinds of search logic they may use for searching and retrieving information.
Dissemination: The last stage of the Information retrieval system is dissemination. It is the act of spreading information widely. In the stage, the library authority disseminates information to the user in a systematic way.
Techniques of Information retrieval:
The main objective of a Library and Information Center is to supply the right information in the hands of the right user at the right time. For this, they use two techniques for retrieving information more effectively, i.e.
1. The traditional system, and
2. Non-traditional system.
Now those are briefly discussed below.
1. Traditional system: The main function of a Library and Information Center is to acquisition, organizing, preservation, and dissemination of materials. For this, they follow some traditional system. Such as cataloging, classification, indexing, abstracting, bibliography, authority file, etc. Some of them are briefly discussed below:
Index: An index is an alphabetical list at the back of a book saying where particular things are mentioned in the book. It is a very important tool for information retrieval.
Abstract: An abstract is a concise and accurate representation of the contents of a document. It also serves as a retrieval tool.
Bibliography: Bibliography is also a list of books, not confined to a particular library. It acts as a retrieval tool.
Authority file: It is a list of files contains call number and class number without any specific rules. It also an important tool for retrieving information.
2. Non-traditional/modern system: Besides some traditional system, in the present time we also use some non-traditional/modern system for information retrieval. It is in two types.
a. Semi-automatic system: It is a combination of man and machine, use for retrieving information. Besides the use of machines, man’s intelligence and physical labor also needed for retrieving information. Catalog card, punch card, EDGE notes card, Apache card, etc. are some of the examples of the semi-automatic system.
b. Automatic system: Use of automatic retrieval tools is first introduced in the late 1980s. In this system, all the retrieval works are done with the help of computer and other related modern technologies. Physical labors of man are not important here. Some of the automatic tools are
Computer, modem, CD-ROM, Hard-Disk, Floppy Disk, and Internet, etc.
Computer: an electronic machine that can store and work with large amounts of information.
CD-ROM: a compact disc used as a read-only optical memory device for a computer system.
Hard Disk: a rigid non-removable magnetic disk with a large data storage capacity.
Floppy Disk: a flexible removable magnetic disk (typically encased in a hard plastic shell) for storing data.
Internet: a global computer network providing a variety of information and communication facilities, consisting of interconnected networks using standardized communication protocols.
Basic Retrieval Tools
- Bibliographies
- Catalogs
- Indexes
- Finding Aids
- Registers
- Online Databases
Wrote by Hansi