Tuesday, 5 January 2016

SharePoint and Hadoop integration

Hi Friend today I am going to explain how to integrate two vast technologies that are SharePoint and Hadoop. Actually I started learning Hadoop to impress a girl but unfortunately I failed to impress her but I succeed in learning hadoop. Millions of thanks to my dream girl she is the inspiration for me to learn hadoop. Hadoop (Big Data) is like a teenage Sex. Everyone talks about it, nobody really knows how to do it ,everyone  thinks everyone else is doing it, so everyone claims they are doing it…   From last couple of years I fall into love with SharePoint many times but I don’t have chance to marry SharePoint... J So I decided to integrate two oceans with the help of MSBI.

Below is the screen short for your reference.
Below is the architecture diagram of Hadoop for your reference:


Import data from RDBMS to HDFS/Hadoop using Sqoop:
Sqoop has the notion of connectors, which contain the specialized logic to read and write to external systems. Sqoop comes with two classes of connectors: a common connector for regular reads and writes, and a "fast" connector that uses database-proprietary batch mechanisms for efficient imports. Below Figure shows these two classes of connectors and the databases that they support.
Sqoop connectors
In this technique, we will look at how to use Sqoop as a simple mechanism to bring relational data into Hadoop clusters. We'll walk through the process of importing data from MySQL into Sqoop.
I use a Hadoop cloud with 1 master node and 4 slave nodes, all running on Ubuntu. The Hadoop runs on user 'Hadoop' with password ' Hadoop'

MySQL
The first step you have to do is to make some changes in your mysql db if you haven't already done it.
Enable remote access mode for the database:
This step is to make sure that all the slave nodes can reach the database to enable the import process running in parallel.

Open the file :  "/etc/mysql/my.cnf"
         vim /etc/mysql/my.cnf

Change the bind address from "localhost" to IP address of the machine.
        bind-address  = <IP Address>

 Restart MySQL daemon:
     /etc/init.d/mysqld restart

  Ensure that MySQl is running:
   ps aux | grep mysqld

2. Create user for each of the remote node on Mysql DB This step is to make sure that all the users coming from various IPs would be able to connect to the database.

  Connect with the root password:
       mysql -uroot -p<root_pass>

 In the Mysql Shell, start creating users. I created 5 users hadoop on different domains.
  create user 'hadoop'@'<ip of master>' IDENTIFIED BY 'hadoop';
  * The first 'hadoop' in the above query is username and second is password
 
Grant permissions to this user. I granted all the permissions. You can define yours.     GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'<ip of master>' WITH GRANT OPTION;

  Repeat this process for all the slaves.

SQOOP:
Install sqoop from Apache's project link:  http://www.apache.org/dyn/closer.cgi/sqoop/

1.
Set the HADOOP_HOME to hadoop's root directory.
2. If you are not using HBASE, and you don't set the variable HBASE_HOME , sqoop would complain to start.

To make it work, open file

    $SQOOP_HOME/bin/configure-sqoop
and comment out all the lines which have HBASE_HOME or HBASE . Save the file and Close.
3. Now start Sqoop: $SQOOP_HOME/bin/sqoop.

SQOOP and MySQL
The next step is to link sqoop and MySQl by installing the driver.

Since I installed the Apache's Sqoop, by default it doesn't come with the mysql driver. However, Cloudera versions of Sqoop have the driver and this step would not be required.

1
. Download the database driver from mysql website: http://dev.mysql.com/downloads/connector/j/

2
. Untar the tarball and move the .jar file inside $SQOOP_HOME/ lib/

SQOOP and Hadoop
We are almost done. The last thing left is to fire the correct command on sqoop shell.
The skeleton of the command I used :

   $SQOOP_HOME/bin/sqoop  import  --connect jdbc:mysql://<ip address of mysql server> :3306/<database_name>  --username hadoop  --password hadoop  --table <table to import>  --split-by <field to split the data> -m 8 --fields-terminated-by '\t'


Here I explain the each option in the command:
  --connect  jdbc:mysql://<ip address of mysql server> :<port of mysql>/<database_name>
       This specifies the connection parameters i.e IP address and port of MySQL. If your
       mysql installation is running on a port other than 3306 (default port) , then reflect the change  accordingly. You have to specify the name of the database from which data needs to be imported.

--username hadoop  --password hadoop

    The username and password to connect to MySQL. We created these explicitly in earlier steps.
--table  <table to import>
     The name of the table from which data will be imported. You can also import output of specific queries from the table. This is not explained in this write-up.
--split-by <field to split the data> -m 8

    since I am using hadoop, I want to use its parallelization advantage so that importing can be done in parallel. I specify the number of mappers by -m 8. In my case, I use 8 mappers.
The split-by field splits the data for each mapper. You can specify this field based on the data you want to process. Choose this field carefully to exploit the advantage of parallelization. It’s better to choose a field with values well spread over. You can even give string fields as the splitter. Internally, SQOOP will  calculate the MAX and the MIN of this field, and would create partitions for each mapper based on this   information
--fields-terminted-by '\t'
     I specify that the fields would be separated by a tab in the final file.
The final result of the above process was the creation of the folder /user/hadoop/<table> in HDFS with 8 files, each from one mapper.
Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources. Data analysts use Hive to explore, structure and analyze that data, then turn it into actionable business insight.
Advantages of using Hive for enterprise SQL in Hadoop:
Feature
Description
Familiar
Query data with a SQL-based language
Fast
Interactive response times, even over huge datasets
Scalable and Extensible
As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance

Now we use HIVE to process the data from HDFS. Before this we have to know how HIVE works below are the details explanation:
The tables in Hive are similar to tables in a relational database, and data units are organized in taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.
Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.
Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.
The next step is to transform the files with Hive (which we have loaded into HDFS in the previous steps).

Design the Hive schema and database

First, you need to understand how the data is laid out in these files. Issue the following command in the Linux terminal window:
$ head -2 Filename.ext
The head -2 command enables you to see the first two lines (header plus the first line) of your file. In Linux, you can use different ways to compare files (diff, etc.), but for now, simply look at the column names and data using the following
Click the InfoSphere BigInsights shell, then click Hive Shell.
I typically create text files for each table, write the SQL and Data Manipulation Language (DML) code there, save the tables, then paste the code into the Hive shell. This method is far easier than typing SQL and DML code line by line in a command shell.
After the tables have been created, they are ready to be populated using the following command:
LOAD DATA INPATH '/user/hadoop/<table> '
OVERWRITE INTO TABLE A.tbl_6005;
select * from A.tbl_6005 limit 10;
select count(*) from A.tbl_6005;
After populating the tables and running some simple select statements, the data should be correct. You can delete the header files before loading them into the HDFS or Hive table.
Note: Remember to include row format-delimited fields terminated by single quotation marks (') when you create the tables. If you leave out that key part, your data will load incorrectly. Build the master table to integrate the data after building the core tables; the next step is to build the master table that integrates the data from all tables. This task can be complicated, depending on your sources.
Now we have to use MSBI(SSIS) technology to import data into SQL database from Hive database with the help ODBC connector  for detail explanation please follow this link thanks Matt Masson for the great article. Once you implemented the things which are mentioned in the blog, then we are ready to create the Cube with the help of data base which we have imported using ODBC.
Creating cube (SSAS):
Multidimensional cubes and transactional databases are two very different things. From experience I have seen even veteran DBAs avoid the subject of cubes completely because it is too much of an unknown area for them. Often cube work is passed on to developers because of their comfort with using Visual Studio. This is a great pity because it is, in reality, not very difficult at all to create an OLAP cube. It is safe to say that most of the work needs to be done in a traditional SQL Server database engine / SSIS environment from creating the data warehouse model to keeping it fed with ETL packages.

Creating an SSAS Project in SSDT:

This is one of the most advanced examples in the book, so get ready for some fun. You’ll build a cube in SSAS, which gives you high-speed multidimensional analysis capability. This one will use UDM, but you’ll get a chance to use BISM in a little bit. Building your cube will require several steps: You’ll need to build a data source, a data source view, some dimensions, and some measures before your cube can be realized.
Start a New Project
To build an SSAS cube, you must first start a project by following these steps:
1.       Open the SQL Server Data Tools and create a new project.
2.       In the New Project dialog box under Installed Templates on the left, choose Business Intelligence Analysis Services.
3.       In the main pane, select Analysis Services Multidimensional and Data Mining Project, as you can see in Figure.
Figure:The New Project dialog with Analysis Services Multidimensional and Data Mining Project selected
4. Name your project FirstCube and click OK.
You’re now presented with an empty window, which seems like a rare beginning to a project with a template; really, you have nothing to start with, so it’s time to start creating. The first component you’ll need is somewhere to retrieve data from: a data source.

Building a Data Source

To create the data source you’ll use for your first cube, follow these steps:
1.       Navigate to the Solution Explorer pane on the right, right-click Data Sources, and click New Data Source. This will bring up the Data Source Wizard, which will walk you through the creation process just as you’d expect.
2.       Before you skip by the opening screen as you usually would, though, take note of what it says (just this once. . .you can skip it later). I won’t re-type it here, but it’s giving you a heads-up about the next component you’ll create: the data source view.
3.       Meanwhile, go ahead and click Next to continue creating your data source. In this next screen, it’s time to set up a connection string.
4.       If your database is visible as a selection already, go ahead and choose it; if not, click New.
5.       For your server name, enter (local), and then drop down the box labeled Select or Enter a Database Name and choose imported Database from previous steps I named it as PolaSoft.
6.       Click OK to return to the wizard and then click Next.
7.       You can now enter the user you want SSAS to impersonate when it connects to this data source. Select Use the Service Account and click Next. Using the service account (the account that runs the SQL Server Analysis Server service) is fairly common even in production, but make sure that service account has privileges to read your data source.
8.       For your data source name, type your name and then click Finish.

Building a Data Source View

Now that you’ve created a data source, you’ll need a data source view (as the Data Source Wizard suggested). Follow these steps:
1.       Right-click Data Source Views and choose New Data Source View. Predictably, up comes the Data Source View Wizard to walk you through the process. Click Next.
2.       Make sure the PolaSoft data source is selected and then click Next.
3.       On the Select Tables and Views screen, choose FactInternetSales under Available objects and then click the right arrow to move it into the Included Objects column on the right.
4.       To add its related dimensions, click the Add Related Tables button as shown in Figure and then click Next. Note that one of the related tables is a fact, not a dimension. There’s no distinction made at this level. Later, you will be able to select and edit dimensions individually.
Adding tables to the view
1.       On the last screen, name your data source view according to its contents: Internet Sales.
2.       Click Finish to create the Internet Sales data source view, and you’ll see it in the content pane, looking something like Figure  (your exact layout may vary).
The finished Internet Sales view

Creating Your First Cube

Now for the exciting part…you get to create your first cube.
1.       Right-click Cubes in the Solution Explorer and select New Cube to bring up the Cube Wizard. This will walk you through choosing measure groups (which you currently known as fact tables), the measures within them, and your dimensions for this cube. Don’t worry about the word “cube” here and think you just have to stick with three dimensions, either; cube is just a metaphor, and you can create a four-dimensional hypercube, a tesseract, or an unnamed higher-dimensional object if you want (and you’re about to do so!). To begin, click Next.
2.       On the Select Creation Method screen, make sure Use Existing Tables is selected, and click Next.
3.      
The wizard will now want you to tell it where to find measure groups. You could help it out by telling it those are in your fact tables, but never mind — it’s smart enough to figure it out. If you click Suggest, it will automatically select the correct tables. Do so (the result is shown in Figure and then click Next.

Selecting Measure Group Tables
1.       Now the wizard would like to know which measures from your measure groups (fact tables) you’d like to store in the cube. By default it’s got them all selected; go ahead and accept this by clicking Next.
2.       At this point, you have measures, but you still need dimensions; the wizard will select the dimension tables from your data source view and invite you to create them as new dimensions in the UDM. Again, by default they’re all selected, and you can click Next.
3.       The wizard is now ready to complete. Verify you have something that looks like Figure, and go back to make corrections if you need. If everything appears to be in order, click Finish.
Completing the Cube Wizard

Making Your Cube User-Friendly

Right about now, you’re probably expecting something like “congratulations, you’re done!” After all, you’ve built up the connection, designated the measures and dimensions, and defined your cube, so it would be nifty if you could just start browsing it, but you’re not quite there yet. First you’ll want to make some of your dimensions a little more friendly; they’re currently just defined by their keys because SSAS doesn’t know which fields in your dimension tables to use as labels. Once you’ve settled that, you’ll need to deploy and process your cube for the first time before it’s ready to use.
1.       In the Solution Explorer under Dimensions, double-click DimDate. The Dimension Editor will come up, allowing you to make this dimension a bit more useable.
2.       To make the date attributes available, highlight all of them (except DateKey, which as you can see is already in the attribute list) and drag them to the attribute list.
3.       Date, of course, is a dimension that can be naturally defined as a hierarchy (like you did quite manually in the T-SQL grouping examples). Drag Fiscal Quarter from the Attributes pane to the Hierarchies pane to start creating a hierarchy.
4.       Drag Month Number of Year to the tag under Fiscal Quarter, and DateKey similarly below that.
5.       Finally, rename the hierarchy (right-click it and choose Rename) to Fiscal Quarter - Month. The result should look something like Figure .


Renaming The Hierarchy
1.       Save the DimDate dimension and close the dimension editor. You will be prompted to save changes to your cube along with the new dimension changes; do so.
2.       For each of the other dimensions, don’t create hierarchies for now, but bring all the interesting text columns into the Attributes pane (you can bring over all the non-key columns except for the Large Photo column in the Products table), and re-save the dimensions.

Deploying the Cube

There’s more you can do to create useful hierarchies, but for now it’s time to build, deploy, and process the cube. This process can be started by following these steps.
1.       Select Deploy First Cube on the Build menu. You’ll see a series of status messages as the cube is built, deployed, and processed for the first time. You’ll receive a few warnings when you deploy FirstCube, and if they’re warnings and not errors, you can safely ignore them for now.
2.       When it’s done and you see Deployment Completed Successfully in the lower right, your first cube is ready to browse.
NOTE When you set up a user in your data source view, you chose the service user — this is the user that’s running the Analysis Services. If that user doesn’t have a login to your SQL Server, you’re going to receive an error when you try to process your cube.
In addition, this example bypasses a step that’s important for processing hierarchies in cubes with large amounts of data: creating attribute relationships. The cube will still successfully process (though you will receive a warning),For larger data volumes, you will need to address this warning. For more information on how to do that, consult the more complete SSAS text.
  1. In the Solution Explorer pane, double-click the Internet Sales cube and then look in the tabs above the main editing pane for the Browser tab and click that.
  2. Now you can drag and drop your measures (such as ExtendedAmount) and your dimensions and hierarchies (look for the Fiscal Quarter - Month hierarchy under the Due Date dimension) into the query pane, and voilà — your data is sliced and diced as quickly as you please.

How It Works

Whew! That was a lot of setup, but the payoff is pretty good too. What you’ve done is to build your first cube, and under the hood you’ve created a UDM-based semantic model queryable through the MDX language. This cube isn’t fully complete — you’d probably want to add some aggregations, attribute relationships, and other elements, but it’s an impressive start.
It started when you chose your project type. The Multidimensional project types build the UDM-based data models, whereas the Tabular Project types build your model in BISM. Because I plan to bring you through PowerPivot shortly (which is BISM-based), I let you down the UDM route here. You’ll find that for basic operations the two are equally capable.
Once you had your project put together, you had a few components to create on the way to browsing your cube. Let’s call a few out.
  • Data source: Your data source is a connection to an individual place where data for your BI reporting can be found. While this one was a SQL Server data source, you can use any number of providers, both included and third-party. Nothing here should surprise you too much; this is a similar kind of list to what you’d find in SSIS, for example.
  • Data source views: A data source view is a much more interesting animal. Using a data source, the data source view contains a set of tables or views, and defines the relationships among them. Each DSV is usually built around a business topic, and contains any tables related to that topic.
  • Cubes: While the next thing you proceeded to create was a cube, the Cube Wizard went ahead and built measure groups and dimensions for you along the way. Without those, you haven’t got much of a cube. The cube isn’t a pass-through directly to your source data. To update the data in the cube, you must process the cube; you can do this through a regularly scheduled job with SQL Agent or, of course, manually. In this case, the wizard took care of a lot of the details for you, but you’ll read more about what the cube really is in a few minutes.
Once Cube gets ready with data, users can run queries on Cube created in SSAS. SSRS Reports and Excel Pivoting/Power Pivot can use OLAP Cube as source of data instead of OLTP database to get performance for
Complex Queries. SSRS Reports; Excel Power Pivot can be used for visualization/analysis of data from cube.


Based on the cube we can Create SSRS reports and deploy it to SharePoint to do these please follow my previous article here
Conclusion: Now we have Cube ready we can now create SSRS reports based on cubes ,or PPS dashboards based on cubes, Excel reports or we can integrated the IBM cognos or SAP BI. And also we can create a web service based on the database which we have imported from HIVE database.
once again thanks to my one side love… J Hope this will help you thank for reading…








20 comments:

  1. hey do u have any idea how to get the data from sharepoint to hadoop , as the reverse of wat you have explained here

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. href="https://istanbulolala.biz/">https://istanbulolala.biz/
    5JF

    ReplyDelete
  4. <li style="-webkit-tap-highlight-color: transparen

    ReplyDelete