Big Data Privacy and Security - Case Study

👤 Diwas Poudel    🕒 23 May 2019    📁 TECH


Due to the rapid combination of heavy parallel computing power, advanced communications and scalable platforms the amount of data generated by social networking sites, internet, sensor network, smart IoT, health application and many others are rapidly increasing day by day. Actually, Big data is a large amount of data which is mined or generated from various sites in various format and with very high-speed rate. It is becoming a highlighted topic and tools that not only used to analyzes patterns but can also provide the predictive likelihood of an event. Conventional techniques cannot handle those generated data. So, to handle those generated data either have to improve existing techniques or should adopt new techniques or methods.

From small organization to big, all use big data. Popular e-commerce sites like Amazon and Alibaba uses big data to learn our browsing habits or shopping behaviour. Social networking sites such as Twitter, Facebook, Instagram store all the information about our personal relationships and life. Popular video sites such as YouTube, Netflix, IMDB uses big data to recommends videos based on our search history. Popular Job Suggestion site likes LinkedIn also recommends us suitable job needed based on our search history and our profile and connection. Education sites like Coursera, Udemy also recommend a suitable course. Dating Site like Tinder also uses our preference to find the best partner for us. App Store like Apple Store and Google Play also learn our app downloading pattern and recommend an app to us. For recommending purpose, big data companies gather our personal details and search history and then store it, reuse it and recommend us by doing this we are putting our privacy at great risk. Big Data companies should protect our privacy. They do this by gaining commercial profits.

 So, this data must not be compromised and must be handled in a manner that respects the privacy of the individuals represented.


Among many concerns in big data mining approaches, security and privacy are major ones.The main aim of this report is to empower people for protecting their privacy.

Other specific objectives for this project are:

  • To make everyone aware of the importance of there data.
  • To provide information about how to protect their data.
    • Lack of control to the data by the owner
    • Lack of expertise and training
    • Legal uncertainty
    • Unauthorized secondary usage
    • Litigations
    • Compelled disclosure to the government
    • Disclosure of breaches and data security
    • Location of data, retention and transfer

Data accessibility provide knowledge to the people about how big data companies uses there data for there benefit.


3.1 Data Generation/Collection

 Data Generation or Collection is the key life cycle of big data process. Source of the data can be distributed in various sites and the data generation phase does the first step to gather those data. According to the sixth edition of DOMO’s report,” Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person of the earth”. Therefore, it is hard for the traditional system to handle this data.

3.2 Data Storage

In this phase, collected data are stored in large-scale sets. The data storage system has data management software tools and hardware infrastructure. Hardware infrastructure refers to utilizing information and communications technology (ICT) resources for various tasks like distributed storage. Data management software tools refer to the collection of software tools stacked on top of hardware infrastructure to query and manage large data sets. It helps to simplify complexity and drive performance. Data Storage System should also help to provides interfaces to interact with and examine stored data.

3.3 Data Analytics

Big data analytics phase involves complex processing by examining large stored data to uncover information like hidden patterns, unknown correlations. This enables companies to better understand their products, customers and market trends and helps in making an appropriate decision. In this phase, data mining algorithms such as classification, clustering and association rule mining are performed to extract sensitive data.

3.4 Knowledge creation phase

Here, valuable information obtained from Data Analytics Phase is used by decision makers for making a decision related to their business.


   Most important challenges that should be taken into consideration while dealing with big data are:

4.1 Random Distribution

Major challenges with big data solutions are to properly distribute the data because data are coming from a random source.

 4.2 Privacy

Privacy is also a big concern for organizations with big data stores.The biggest challenge for big data company is to protect individual data and should not be leaked or exploit or alter.

4.3 Computations

Computational security and other digital data in a distributed framework like MapReduce function of Hadoop, mostly lack protection in security. The two main preventions for it are securing the mappers and protecting the data in the presence of an unauthorized mapper.

4.4 Communication

Data are stored in many different nodes and which may belong to many different clusters and can be distributed in a different part of the world. So, communication between nodes of the cluster, or cluster to cluster through the public or private network is a challenging task.

4.5 Access Control

Access control is a selective restriction of accessing or using the resource of a computing environment. In a big data environment, access to the data should be managed by a strong access control mechanism like locks and login credentials to deny any anonymous users from getting access to the storage servers.

Most important challenges that should be taken into consideration while dealing with big data are:

5 Big Privacy and Security Concerns in Big Data

Among many challenges, information privacy is one of those. It governs with how data is collected, shared and used. One serious user privacy-related issue is an identification of personal data and information during the transmission of data over the World Wide Web [1]. Big Data privacy matter for  1) Limit the Power 2) Respect for Individual 3) Reputation Management 4) Trust 5) Freedom of Thought and Speech

5.1 Big data Privacy and Security in Data Generation Phase

There are two ways of data generation, one is passive data generation and next is active data generation. Inactive data generation, the third party will get the data from the data owner, by asking permission[2], while in passive data generation data produce by data owner’s will gathered by the third party where data owner may not know about his data being gathered.

Protecting the data starts from this first phase of the life cycle. The data owner should hide or encrypt his/her sensitive personal data as much as possible when allowing to the third party to collect the data. In data generation phase privacy violations can be reduced by either by encrypting the sensitive data or restricting the access to the resource or by falsifying the data.

5.1.1 Access restriction 

If the owner’s of the data has sensitive information and not supposed to be shared and is sharing it passively, then makes use of few measures to protect your privacy such as using anti-tracking extension like Ghostery,Privacy Badger etc, advertisement or script blockers like ScriptBlock, and encryption tools like Cryptocat, Mailvelope,MiniLock, AdBlock Plus etc.

5.1.2 Falsifying data

If third party-access the sensitive information, then falsifying the data is necessary and done by many users. This is done to unrevealed true information to the third party. The following methods  are utilized by the owner’s of the data to falsify the data:

A Socketpuppet tool is utilized to shows the false online identity of the users. Whenever multiple accounts are used by the same users then it is harmful and it negatively skews the discussion and fake news can be propagated very confidently. In the era of fake news, detecting sock puppets is important. Sockpuppet takes many forms like :

  1. a) Logging out to make problematic edits as an IP Address
  2. b) Creating new accounts or fake account to avoid detection
  3. c) Using other person's account
  4. d) Reusing old unused accounts

Individuals identity can be masked by using certain security tools like:

10minutemail: This is useful when the owner of the data needs to give this email Id to register or to  just access the sites. sites will provide a fake number so using that you can use to create email Id for verification.

Mask Me: This is useful when the owner of the data needs to give online shopping and credit card details.

Also, user themselves should be protected from phishing, spamming and spooling so , that your data will be secured.These attackers is hijacking data provider and collecting to get access to data.

5.2 Big data Privacy and Security in Data Storage Phase

Big Data’s are stored in data-center which can be in a distributed environment. Data stored on data-centre should not be compromised, as it can be harmful because the information may be disclosed. So, we need to ensure that the stored data are secured and protected against such damage and threats. For the safety of collected data, some security measures used by an organization are data partitioning, data anonymization approach, a permutation is done to protect itself from data mining based attack, unauthorized data access.

Data anonymization Approaches

Data anonymization is the process of making the data clean and hygienic whose main intention is privacy protection. It is the process of either removing personally identifiable information from data sets or just encrypt the data so that the people whom the data describe remain unknown or anonymous. Data  Anonymous technology is mainly used for trajectory privacy, location privacy, and database privacy, but here we are proposing all privacy related to cloud storage. Some of the data anonymization tools are CA Data Manager, Compuware, BM Security Guardium etc.

There are many forms of security which can be used to improve data anonymization. Some of them are: K-anonymity, L-diversity anonymous and T-closeness anonymous


In K-anonymity, if the attempts are made to identify the record then each record indistinguishable from a defined number (say k) of other records. This approach guarantees that each sensitive attribute is hidden in the scale of k groups. This means that the probability of recognizing the individual record does not exceed 1/k. The level of privacy depends on the size of k.


L-diversity [3] anonymous guarantee that at least L different values should be in each group’s sensitive attributes. This means that an attacker has a maximum probability of 1/L of recognizing a user’s personal sensitive data and information.


T‑closeness is a further improvement of l-diversity group based anonymization. Here [4], the distribution of the sensitive attribute is taken into consideration, and also the differences of distribution between sensitive properties and values in groups does not exceed value T.

5.2.1 Cloud Computing And its issues in Privacy

Cloud Computing is an emerging technology for the application of the next generation of IT. Cloud computing is used by most of the organization because it is known as the most popular model for supporting complex and large data. Cloud Computing reduces organization cost by using cheap infrastructural resources and has elastic features and due to which it is suitable for managing and handling ever raising data sets in a big data application. However, it has vulnerabilities and potential risk. One of the main hurdles and barrier in shifting to cloud computing is its privacy and security concerns. Here, cloud computing may lead to privacy concerns in the processing or sharing the privacy data because of multi-tenancy system. Data anonymization and encryption is two widely-accepted ways to combat privacy breach.

The paper [4][5] presents some of those issues in privacy on Cloud which is as follows:

      • Lack of control to the data by the owner
      • Lack of expertise and training
      • Legal uncertainty
      • Unauthorized secondary usage
      • Litigations
      • Compelled disclosure to the government
      • Disclosure of breaches and data security
      • Location of data, retention and transfer
      • Data accessibility

5.2.2 Approaches to Privacy Preservation Storage On Cloud

Few approaches for the safeguarding privacy of the owner when data are stored on the cloud are as follows:

A)Attribute-based Encryption

In Attribute-based Encryption(ABE) access policies, policies and rules are defined by the owner of the data and under those policies data are encrypted. The data can only be decrypted by the user whose attributes meets the access policies defined by the owner of the data. Anyone may often need to change data access policies when dealing with big data as the owner of the data may have to share it with different organizations.

b)    Homomorphic encryption

[7] Privacy breaches are found in a public cloud because of multi-tenancy and virtualization. The cloud users may share the same physical space and in such a scenario the chances of data leakage are very high. One way to protect the data on the cloud is to encrypt the data and store them on the cloud and allow the cloud to perform computations over encrypted data. Encryption of data is mainly to ensure the confidentiality of data. Homomorphic encryption is the type of encryption which allows functions to be computed on encrypted data, they are used to perform operations on encrypted data without knowing the private key (without decryption), the client is the only holder of the secret key. When we decrypt the result of any operation, it is the same as if we had carried out the calculation on the raw data.

5.2.3 Integrity Verification of big data storage in the cloud

When cloud computing is used for big data storage, the data owner loses control over data. A cloud-based server may not be fully trusted and the outsourced data are at risk. The owners of the data need to be strongly convinced that the cloud is storing data properly according to the service level contract. One way to ensure privacy to the cloud user is to provide the system with the mechanism to let data owner verify that his data stored on the cloud isn't damaged. Therefore data integrity verification has critical importance.

Some different integrity verification schemas are given below:

a) POR(Proofs of Retrievability):

In this technique, the data stored in the cloud can be effectively validated and can provide some degree of data recovery function. Its main features are:POR guarantees correct data possession.

b) PDP (Provable Data Possession)

Ateniese et al proposed a Provable Data Possession (PDP) protocol in [10]. By adding a forward error correction code, the PDP protocol can be transformed into POR protocol. Using a combination of sampling strategies based on Randomized Digital Signature (RSA) and homomorphic authenticators in the PDP scheme, the data can be used by verifier without downloading it.PDP works well with static data.

c)Public Auditing

In public auditing, the third party is used for auditing. Here, to generate authentication values, BLS(Boneh-Lynn-Shacham) signatures are used. These schemas are proved to be secure.

5.3 Big data Privacy and Security in Data Analyzing  Phase

As data mining algorithm is used here, so data mining based attacks may occur which may lead to security breach its each and every stage must be protected against such attack and make sure that only valid user is only accessing in this phase. Suggested defence are partition the datasets (vertically and horizontally) and use access control, also use code attribute encryption, follow correct analysis procedures and document, audit and review the process.

5.4 Big data Privacy and Security in Knowledge Creation Phase

Some times decision finally obtained for knowledge creation may not be accurate. So, on implementing decision may lead to a security breach. So via proper brainstorming process decision must be implemented.

6. Facebook Case Study in Short

Facebook is the world’s largest social networking websites, and most of us are using it to share our posts, comments, photographs, and other interesting content of our everyday lives with our families, relatives and friends. Nowadays we are hearing that Facebook is no secured now that as we’re also sharing it with their advertisers directly or indirectly. Recently, Facebook has been causing a stir amongst those interested in online privacy, security and data protection.

To influence public opinions, various attempts are made by a various political organization from the information obtained via a data breach. Political events for which politicians paid Cambridge Analytica to use information from the data breach include the following in [8]:

  • 2015 and 2016 campaigns of US politicians Ted Cruz and Donald Trump 
  • The general election of Mexico 2018, for Institutional Revolutionary Party
  • 2016 British Exit Vote 

Facebook determines user behaviour by facial recognition, tracking cookies, tag suggestion and analyzing the likes etc so we must think twice before giving any types of our information to any big company which is looking after our private data.

7. Conclusion and Future Work

In this report, we conclude that security and privacy-related tasks are challenging on big data. So,we summarize these stuff from this report: Network traffic in big data should be encrypted with suitable measures and standards, employees should be checked and  authorized before accessing to the access systems, use multi-layer authentication, access to devices should be checked, anonymised data should be analyse,data ownership should be respected,secure channel should be used for communication to prevent data leakage, and network should be observed and monitored against viruses, threats and breaches. In the near future, big data privacy, security and safety can be big issues, so for this existing technology should be improved or should adopt new techniques for getting actual information. It is hoped that this study helps in understanding the big data and its privacy, security and its ecosystem in a better way and somehow helps in developing better systems and solutions not only for today but also for the upcoming generations as well.

8. References

 [1] Porambage P, et al. The quest for privacy in the internet of things. IEEE Cloud Comp. 2016;3(2):36–45.

[2] Xu L, Jiang C, Wang J, Yuan J, Ren Y. Information security in big data: privacy and data mining. IEEE Access. 2014;2:1149–76

 [3] A.Machanavajjhala, J.Gehrke, and D.Kifer, et al, “ℓdiversity: Privacy beyond k-anonymity”, In Proc. of ICDE, Apr.2006. [12]N. Li, T. Li, and S.

[4] Mohammed, A., AlSudiari, T., & Vasista, T. G. K. 2012.Cloud Computing And Privacy Regulations: an exploratory Study On Issues And Implications.advanced computing: An International Journal (ACIJ), 3 (2), 159-169

[5] Pearson, S. 2012. Privacy, Security and Trust in CloudComputing. Privacy and Security for Cloud Computing,3-42.

[6] H. Cheng, C. Rong, K. Hwang, W. Wang, and Y. Li, “Secure big data storage and sharing scheme for cloud tenants,” China Communications, vol. 12, no. 6, pp. 106–115, Jun. 2015.

[7] Nivedita W. Wasankar1, A.V. Deorankar2 “A study paper on Homomorphic encryption in cloud computing”





[9] Alyson L. Young, Anabel Quan-Haase, Information Revelation and Internet Privacy Concerns on Social Network Sites: A Case Study of Facebook

[10] G. Ateniese, R. C. Burns, R. Curtmola, et al. Provable data possession at untrusted stores[C]. Proceedings of the 2007 ACM Conference on Computer and Communications Security, CCS2007, Alexandria, Virginia, USA, 2007, 598-609