Which of the following would be a reason why an organization would use web usage mining?

Data Profiling

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

14.1 Application Contexts for Data Profiling

A retrospective look at the wave of consolidations among vendors in the data quality and data management industry shows one specific similarity across the board – the transition to building an end-to-end data management suite is incomplete without the incorporation of a data profiling product. The reason for this is that many good data management practices are based on a clear understanding of “content,” ranging from specific data values, the characteristics of the data elements holding those values, relationships between data elements across records in one table, or associations across multiple tables.

It is worth reviewing some basic application contexts in which profiling plays a part, and that will ultimately help to demonstrate how a collection of relatively straightforward analytic techniques can be combined to shed light on the fundamental perspective of information utility for multiple purposes. We will then provide greater detail in subsequent sections for each of these applications.

14.1.1 Data Reverse Engineering

The absence of documented knowledge about a data set [which drives the need for anomaly analysis] accounts for the need for a higher-level understanding of the definitions, reference data, and structure of the data set – its metadata. Data reverse engineering is used to review the structure of a data set for which there is little or no existing metadata or for which the existing metadata are suspect, for the purpose of discovering and documenting the actual current state of its metadata.

In this situation, data profiling is employed to incrementally build up a knowledge base associated with data element structure and use. Column values are analyzed to determine if there are commonly used value domains, if those domains map to known conceptual value domains, to review the size and types of each data element, to identify any embedded pattern structures associated with any data element, and to identify keys and how those keys are used to refer to other data entities.

The results of this reverse engineering process can be used to populate a metadata repository. The discovered metadata can be used to facilitate dependent development activities, such as business process renovation, enterprise data architecture, or data migrations.

14.1.2 Anomaly Analysis

One might presume that when operating in a well-controlled data management framework, data analysts will have some understanding of what types of issues and errors exist within various data sets. But even in these types of environments there is often little visibility into data peculiarities in relation to existing data dependencies, let alone the situation in which data sets are reused for alternate and new purposes.

So to get a handle on data set usability, there must be a process to establish a baseline measure of the quality of the data set, even distinct from specific downstream application uses. Anomaly analysis is a process for empirically analyzing the values in a data set to look for unexpected behavior to provide that initial baseline review. Essentially, anomaly analysis:

Executes a statistical review of the data values stored in all the data elements in the data set,

Examines value frequency distributions,

Examines the variance of values,

Logs the percentage of data attributes populated,

Explores relationships between columns, and

Explores relationships across data sets

to reveal potentially flawed data values, data elements, or records. Discovered flaws are typically documented and can be brought to the attention of the business clients to determine whether each flaw has any critical business impact.

14.1.3 Data Quality Rule Discovery

The need to observe dependencies within a data set manifests itself through the emergence [either by design or organically through use] of data quality rules. In many situations, though, there is no documentation of the rules for a number of reasons.

As one example, the rules are deeply embedded in application code and have never been explicitly associated with the data. As another example, the system may inadvertently have constrained the user from being able to complete a task, and user behavior has evolved to observe unwritten rules that enable the task to be performed.

Data profiling can be used to examine a data set to identify and extract embedded business rules, whether they are intentional but undocumented, or purely unintentional. These rules can be combined with predefined data quality expectations, as described in chapter 9, and used as the targets for data quality auditing and monitoring.

14.1.4 Metadata Compliance and Data Model Integrity

The results of profiling can also be used as a way of determining the degree to which the data actually observes any already existent metadata, ranging from data element specifications, validity rules associated with table consistency [such as uniqueness of a primary key], as well as demonstrating that referential integrity constraints are enforced properly in the data set.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780123737175000142

Anomaly detection

Patrick Schneider, Fatos Xhafa, in Anomaly Detection and Complex Event Processing over IoT Data Streams, 2022

3.1 Introduction to anomaly detection

Data points that are inconsistent with the major data distribution are called anomalies [2]. Originally, the problem of anomaly analysis has been tackled by the statistics community and fundamental results have been published in the literature in the field [60,31,53].

According to Barnett and Lewis' definition [60], an anomaly is “an observation [or subset of observations], which appears to be inconsistent with the remainder of that set of data”. Hawkins [31] defined an anomaly as “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”. Anomalies are also referred to as rare events, abnormalities, deviants, or outliers.

More recently, anomaly detection has received significant attention from many other research communities such as machine learning and data mining, networking, health, security, fraud detection, etc., due to the insights that rare events can provide for the phenomenon under study. Thus, outlier detection techniques can be used to monitor a wireless sensor network to identify faulty sensors or interesting behavior patterns. The availability of data used for the anomaly detection task depends on the properties of the data set. In static datasets, anomaly detection can be conducted over the whole data set in which all observations are available.

In continuous data stream scenarios, the observations may not be available at any moment and arrive sequentially. The observations in data streams can be seen only once and anomalies should be detected in real time. In environments where the distribution changes over time [nonstationary], traditional detection methods cannot be applied, and the models change. Therefore, adaptive models need to be considered to deal with dynamically changing characteristics to detect anomalies in the evolving time series.

In recent years, several algorithms have been proposed to detect anomalies in data sets or data streams. Some of them consider scenarios where a sequence with one or more characteristics unfolds over time. While others focus on more complex scenarios in which streaming elements with one or more characteristics have causal / noncausal relationships with each other.

Extensive research on the detection of outliers in static data related to various applications scenarios can be found in [4,8,13,14,30].

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128238189000134

Bringing It All Together

David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011

20.3.1 Data Profiling

Data profiling incorporates a collection of analysis and assessment algorithms that provide empirical insight about potential data issues, and has become a ubiquitous set of tools employed for data quality processes supporting numerous information management programs, including assessment, validation, metadata management, data integration processing, migrations, and modernization projects. Chapter 14 discussed analyses and algorithms profiling tools employ and how those analyses provide value in a number of application contexts. Profiling plays a part in some relatively straightforward analytic techniques that, when combined, shed light on the fundamental perspective of information utility for multiple purposes, such as:

Data reverse engineering: Data reverse engineering is used to review the structure of a data set for which there is few or no existing metadata or for which the existing metadata are suspect for the purpose of discovering and documenting the actual current state of its metadata. Data profiling is used to grow a knowledge base associated with data element structure and use. Column values are analyzed to determine if there are commonly used value domains, to reveal whether those domains map to known conceptual value domains, to review the size and types of each data element, to identify any embedded pattern structures associated with any data element, and to identify keys and how those keys are used to refer to other data entities. The metadata discovered as a result of this reverse engineering process can be used facilitate dependent development activities such as business process renovation, enterprise data architecture, or data migrations.

Anomaly analysis: There is often little visibility into data peculiarities in relation to existing data dependencies, especially when data sets are reused. Profiling is used to establish baseline measures of data set quality, even distinct from specific downstream application uses. Anomaly analysis is a process for empirically analyzing the values in a data set to look for unexpected behavior to provide that initial baseline review and is used to reveal potentially flawed data values, data elements, or records. Discovered flaws are typically documented and can be brought to the attention of the business clients to determine whether each flaw has any critical business impact.

Data quality rule discovery: The need to observe dependencies within a data set manifests itself through the emergence [either by design or organically through use] of data quality rules. In many situations, though, there is no documentation of the rules for a number of reasons. Data profiling can be used to examine a data set to identify and extract embedded business rules, whether they are intentional but undocumented, or purely unintentional. These rules can be combined with predefined data quality expectations, as described in chapter 9, and used as the targets for data quality auditing and monitoring.

Metadata compliance and data model integrity: The results of profiling can also be used as a way of determining the degree to which the data actually observes any already existent metadata, ranging from data element specifications, validity rules associated with table consistency [such as uniqueness of a primary key]. The results can also be used to demonstrate that referential integrity constraints are enforced properly in the data set.

Review chapter 14 to see how data profiling tools are engineered and how they support aspects of the data quality program. When evaluating data profiling products, it is valuable to first assess the business needs for a data profiling tools [as a by-product of the data requirements analysis process and determination of remediation as described in chapters 9 and 12]. In general, when evaluating data profiling tools, consider these capabilities discussed in chapter 14:

Column profiling

Cross-column [dependency]

Cross-table [redundancy]

Structure analysis

Business rules discovery

Business rules management

Metadata management

Historical tracking

Proactive auditing

Business rule importing

Business rule exporting

Metadata importing

Metadata exporting

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780123737175000208

Data Quality and MDM

David Loshin, in Master Data Management, 2009

5.4 Employing Data Quality and Data Integration Tools

Data quality and data integration tools have evolved from simple standardization and pattern matching into suites of tools for complex automation of data analysis, standardization, matching, and aggregation. For example, data profiling has matured from a simplistic distribution analysis into a suite of complex automated analysis techniques that can be used to identify, isolate, monitor, audit, and help address anomalies that degrade the value of an enterprise information asset. Early uses of data profiling for anomaly analysis have been superseded by more complex uses that are integrated into proactive information quality processes. When coupled with other data quality technologies, these processes provide a wide range of functional capabilities. In fact, there is a growing trend to employ data profiling for identification of master data objects in their various instantiations across the enterprise.

A core expectation of the MDM program is the ability to consolidate multiple data sets representing a master data object [such as “customer”] and to resolve variant representations into a conceptual “best representation” whose presentation is promoted as representing a master version for all participating applications. This capability relies on consulting metadata and data standards that have been discovered through the data profiling and discovery process to parse, standardize, match, and resolve the surviving data values from identified replicated records. More relevant is that the tools and techniques used to identify duplicate data and to identify data anomalies are exactly the same ones used to facilitate an effective strategy for resolving those anomalies within an MDM framework. The fact that these capabilities are available from traditional data cleansing vendors is indicated by the numerous consolidations, acquisitions, and partnerships between data integration vendors and data quality tools vendors, but this underscores the conventional wisdom that data quality tools are required for a successful MDM implementation.

Most important is the ability to transparently aggregate data in preparation for presenting a uniquely identifiable representation via a central authority and to provide access for applications to interact with the central authority. In the absence of a standardized integration strategy [and its accompanying tools], the attempt to transition to an MDM environment would be stymied by the need to modernize all existing production applications. Data integration products have evolved to the point where they can adapt to practically any data representation framework and can provide the means for transforming existing data into a form that can be materialized, presented, and manipulated via a master data system.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780123742254000059

Proactive Security and Reputational Ranking

Eric Cole, in Advanced Persistent Threat, 2013

Advanced

The adversary understands security and knows how to defeat or get around most traditional security measures. Not to depress anyone but if an organization has security devices that have been purchased more than 3 years ago and have a standard/default configuration, in most cases it will be ineffective against the APT. The reason is that these devices were not built or configured to deal with this level of a threat. They were meant to deal with traditional worms and viruses that had distinct signatures that could be tracked and detected on a network. Now the good news is that the standard configuration for older security devices was set to look for the standard threat. With proper tuning many of these devices, in coordination with other devices, can be used as part of the solution. The second important point is that security devices that are focused on data, data flow, and anomaly analysis are much more effective at dealing with the APT than anything that is based on signatures or specific instances of an attack.

It is important when we talk about the advanced nature of the attack, we still have to remember that the attacker has to follow the general principles of finding a weakness and exploiting them. Yes, the APT is good but it is not super human or something that no matter what you do, it will still break in and go undetected. The mistake that is often made is many organizations give the APT more credit than it deserves and says no matter what you do it is unstoppable and undetectable. That statement is just not true. An attack has to take advantage of a weakness in a system and make modifications to a system once it is compromised. Based on those two factors means that the APT can be prevented and detected, if we know what vulnerabilities to close down and where to look. The correct phrase to use when talking about the APT is that it is unstoppable and undetectable using traditional security and normal remediation methods. However, if we change how we approach security, in much the same way that the APT changed how systems were compromised, organizations do have a chance of properly dealing with the threat. It is also important to remember that based on the persistent nature, an organization might not be able to prevent all attacks, some will break in and the reason for detection is so important.

Now the main problem with the APT is that it takes advantage of vulnerabilities that are needed in order for the organization to function. This makes it very difficult to track and close down the attack vector. With traditional attacks, the vulnerability that was exploited was an extraneous service or an unpatched system which meant if an organization focused in on the correct area they could remediate the threat and cause minimal impact to the enterprise. Today, the APT is taking advantage of email attachments and employees within the organization. An organization can only remediate this 100% by shutting down email and firing all employees. While they might be tempting, that is not practical. Therefore, organizations have to be very creative in how they deal with the threat.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9781597499491000103

Focusing in on the Right Security

Eric Cole, in Advanced Persistent Threat, 2013

What is the Problem That is Being Solved?

Growing up one of the jokes that people would make is that every Miss America candidate when asked what her dream goals are would state “to solve world hunger” or “bring peace to the world.” These are high-level goals but in order to solve them you have to identify what the problem is that is trying to be solved. Every great journey begins with the first step. An organization cannot identify the first step, if they do not know what the problem is they are trying to address. We have worked with many clients and when asked what they are trying to accomplish they would state “to be 100% secure” or “to never be compromised by the APT.” While these are notable goals, they are as nebulous as solving world hunger. The million dollar question is where do you start? What are the problems and associated project plans that need to be developed to ultimately maintain forward progress toward the goal? Goals are hard to achieve by themselves but problems can be solved with the right focus.

Every organization should always perform an assessment within their environment to identify the main problems that need to be solved in order to do the right thing in terms of defending against the APT. The good news is the high-risk problems for many organizations typically are similar. In shifting from doing good to doing the right things, the following are typical problems that organizations are ignoring but need to solve to make forward progress in dealing with the APT:

The ability to detect compromised systems—One of the number one problems today is that organizations have most of their focus and attention on inbound prevention. Trying to stop attacks is important and should continue. Even though the APT is very stealthy, if an organization can prevent 20% initial attacks, that is still better than nothing and is 20% less attacks that have to be detected after the fact. However, the APT is persistent and will continue to target an organization until they accomplish their goal. Bottom line is regardless of what number of attacks can be prevented with inbound traffic, it is not going to be 100%. Organizations must be examining outbound traffic looking for signs of a compromised system. One of the million dollar questions is if an organization had a compromised system, would they be able to detect it and how long would it take. In some cases organization can take 6–8 months in order to detect a compromised system. Think about how much information is leaving the organization every day. Even if it was detected within 3 months instead of 6 months, that would still be a lot less damage and exposure to the organization. Any skill including security takes time. Ideally you want to be able to detect attacks as soon as they occur and contain/control immediately or within a short period of time like 12 h; however the most basic question is regardless of time, would your organization have any chance at all of being able to detect a compromised system? The immediate response is of course, but step back for a second. If a system was currently compromised on your network today, leaking information how would you know? It is critical in combatting the APT that organizations solve the problem of being able to detect compromised systems. Once an organization has the capability to do so, they need to continually decrease the amount of time it would take to detect the attack. Every minute a compromised system goes undetected is an increased damage and exposure to the organization.

Being able to identify anomalies from known baselines—anomaly detection is critical to dealing with and detecting the APT, but the fundamental question is how can something be an anomaly if you do not know what is normal. An organization must track usage, network patterns, connectivity, and bandwidth to understand and build a profile of what is normal and expected to be seen in an organization. Depending on the type of attack it might be very visible or it might be subtle, but if someone compromises a system they are going to act differently than a normal user. If their behavior is exactly the same, then they are not an attacker or your normal users are attackers. Organizations need to figure out which activities to build baselines against, but the ones that work well against the APT are: [1] length of the connection; [2] amount of outbound data; [3] external IPs being connected to. In almost all cases that we have seen the APT creates an obvious anomaly across all three areas that is quite different than normal traffic. The powerful component of anomaly analysis is the correlation across multiple sources. While one item might give a little insight into something being an anomaly, the real value is when the results are compared across 3–5 different variables. When this is done it becomes quite obvious that something is an anomaly. Creating baselines are quite easy if you have the traffic, sniffer output, or logs, but the base analysis has to be done for it to be of value. The big problem for many organizations is that they have the data but it has never been normalized. It is like there is an oil deposit under someone’s house that is worth millions but because they never checked, they never realized and therefore was not able to take advantage of the value of it.

Properly segmented networks—Flat networks are easy for users to be able to access data but it is also extremely easy for attackers to also be able to access the information. What is ironic is that properly segmented networks can also be configured to be very easy for users to be able to access data but also very difficult for attackers to be able to cause harm. The question for your organization is whether you want option [1] easy for the user and easy for the attacker or option [2] easy for the user and hard for the attacker. Life is full of adventures so when it comes to protecting information, let us stay away from excitement and pick option 2. The big problem with many organizations is that they cannot properly control who can access what systems and once a client system is compromised, it is very easy for the attacker to have full access to any information they want. Gateways and control points have been used for many years in the physical security realm to protect valuable possessions and the same concepts can be used in with electronic information to protect critical information.

Better correlation to prevent/detect attacks—The APT is like a puzzle, in order to solve it you need to have all of the pieces. With a large puzzle if you only have one piece you really have no idea what the picture of the puzzle is. The more pieces you have the clearer the picture. A puzzle piece is an entry in a log file on a single device. If you just look at the logs or information on a single computer you only have a small number of pieces, the more devices you gather and correlate information against, the clearer the picture and the more useful the data. With the APT correlation is king. In many cases better detection can lead to improved prevention. Once an organization is able to detect an attack and see what they missed, that information can be used to build better defensive measures in the future. Proper detection should lead to improved metrics and better information for enhanced prevention of future attacks that are similar. Now the critical piece is to correlate the information and look for general patterns that can be blocked, not specific signatures. Since the attacker is always changing, signatures will provide minimal protection in dealing with the APT. While looking at specific log entries is good for detailed analysis, most of the energy and effort in dealing with the APT should be focused on high-level correlation, tied with anomaly analysis.

Proper incident response to prevent reinfection—When an organization finds out that they have been compromised for 6–8 months and it went undetected, the response is not usually a calm and peaceful response, it is usually people freaking out. The typical response is to make the problem go away as fast as possible. Most organizations forget about the six-step process for handling an incident, skip steps and all focus is on recovery. Get the systems back up and running as quickly as possible. Since people are under stress they often forget the obvious. If the attacker compromised a system once, there is very good chance they will compromise it a second time. While catching an attacker is important, reinfection is deadly. If the first time the attacker broke in you eventually caught them, they are not happy. They will break back in but they will work even harder to be stealthy and not get caught. If it took an organization 6–8 months to detect the attacker the first time and now they are really trying to be stealthy, how long do you think it will take the organization to catch them the second time. When it comes to compromise, do it right the first time, there are no second chances. Many organizations that have not been compromised take a logical look at the problem and say the longer a system is down the more money it is going to cost the organization. Therefore the quicker we can recovery the better off the organization is. The problem is it is better to be down once, for a longer period of time and fix the problem, than recover quickly but become reinfected and have additional data loss. However, it is important to point out that there are some systems in which availability is critical. In these cases systems might need to be brought up before they are fully remediated, but this should be a business decision and all systems should be carefully monitored and controlled in these circumstances. The important lesson with an incident, but especially with a devastating attack like the APT, is do it right the first time and fix the problem. There are no second chances.

Security is very challenging and it is always important to make sure the problem you are fixing is the highest priority problem.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9781597499491000115

Data Mining Trends and Research Frontiers

Jiawei Han, ... Jian Pei, in Data Mining [Third Edition], 2012

13.1.3 Mining Other Kinds of Data

In addition to sequences and graphs, there are many other kinds of semi-structured or unstructured data, such as spatiotemporal, multimedia, and hypertext data, which have interesting applications. Such data carry various kinds of semantics, are either stored in or dynamically streamed through a system, and call for specialized data mining methodologies. Thus, mining multiple kinds of data, including spatial data, spatiotemporal data, cyber-physical system data, multimedia data, text data, web data, and data streams, are increasingly important tasks in data mining. In this subsection, we overview the methodologies for mining these kinds of data.

Mining Spatial Data

Spatial data mining discovers patterns and knowledge from spatial data. Spatial data, in many cases, refer to geospace-related data stored in geospatial data repositories. The data can be in “vector” or “raster” formats, or in the form of imagery and geo-referenced multimedia. Recently, large geographic data warehouses have been constructed by integrating thematic and geographically referenced data from multiple sources. From these, we can construct spatial data cubes that contain spatial dimensions and measures, and support spatial OLAP for multidimensional spatial data analysis. Spatial data mining can be performed on spatial data warehouses, spatial databases, and other geospatial data repositories. Popular topics on geographic knowledge discovery and spatial data mining include mining spatial associations and co-location patterns, spatial clustering, spatial classification, spatial modeling, and spatial trend and outlier analysis.

Mining Spatiotemporal Data and Moving Objects

Spatiotemporal data are data that relate to both space and time. Spatiotemporal data mining refers to the process of discovering patterns and knowledge from spatiotemporal data. Typical examples of spatiotemporal data mining include discovering the evolutionary history of cities and lands, uncovering weather patterns, predicting earthquakes and hurricanes, and determining global warming trends. Spatiotemporal data mining has become increasingly important and has far-reaching implications, given the popularity of mobile phones, GPS devices, Internet-based map services, weather services, and digital Earth, as well as satellite, RFID, sensor, wireless, and video technologies.

Among many kinds of spatiotemporal data, moving-object data [i.e., data about moving objects] are especially important. For example, animal scientists attach telemetry equipment on wildlife to analyze ecological behavior, mobility managers embed GPS in cars to better monitor and guide vehicles, and meteorologists use weather satellites and radars to observe hurricanes. Massive-scale moving-object data are becoming rich, complex, and ubiquitous. Examples of moving-object data mining include mining movement patterns of multiple moving objects [i.e., the discovery of relationships among multiple moving objects such as moving clusters, leaders and followers, merge, convoy, swarm, and pincer, as well as other collective movement patterns]. Other examples of moving-object data mining include mining periodic patterns for one or a set of moving objects, and mining trajectory patterns, clusters, models, and outliers.

Mining Cyber-Physical System Data

A cyber-physical system [CPS] typically consists of a large number of interacting physical and information components. CPS systems may be interconnected so as to form large heterogeneous cyber-physical networks. Examples of cyber-physical networks include a patient care system that links a patient monitoring system with a network of patient/medical information and an emergency handling system; a transportation system that links a transportation monitoring network, consisting of many sensors and video cameras, with a traffic information and control system; and a battlefield commander system that links a sensor/reconnaissance network with a battlefield information analysis system. Clearly, cyber-physical systems and networks will be ubiquitous and form a critical component of modern information infrastructure.

Data generated in cyber-physical systems are dynamic, volatile, noisy, inconsistent, and interdependent, containing rich spatiotemporal information, and they are critically important for real-time decision making. In comparison with typical spatiotemporal data mining, mining cyber-physical data requires linking the current situation with a large information base, performing real-time calculations, and returning prompt responses. Research in the area includes rare-event detection and anomaly analysis in cyber-physical data streams, reliability and trustworthiness in cyber-physical data analysis, effective spatiotemporal data analysis in cyber-physical networks, and the integration of stream data mining with real-time automated control processes.

Mining Multimedia Data

Multimedia data mining is the discovery of interesting patterns from multimedia databases that store and manage large collections of multimedia objects, including image data, video data, audio data, as well as sequence data and hypertext data containing text, text markups, and linkages. Multimedia data mining is an interdisciplinary field that integrates image processing and understanding, computer vision, data mining, and pattern recognition. Issues in multimedia data mining include content-based retrieval and similarity search, and generalization and multidimensional analysis. Multimedia data cubes contain additional dimensions and measures for multimedia information. Other topics in multimedia mining include classification and prediction analysis, mining associations, and video and audio data mining [Section 13.2.3].

Mining Text Data

Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. A substantial portion of information is stored as text such as news articles, technical papers, books, digital libraries, email messages, blogs, and web pages. Hence, research in text mining has been very active. An important goal is to derive high-quality information from text. This is typically done through the discovery of patterns and trends by means such as statistical pattern learning, topic modeling, and statistical language modeling. Text mining usually requires structuring the input text [e.g., parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database]. This is followed by deriving patterns within the structured data, and evaluation and interpretation of the output. “High quality” in text mining usually refers to a combination of relevance, novelty, and interestingness.

Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity-relation modeling [i.e., learning relations between named entities]. Other examples include multilingual data mining, multidimensional text analysis, contextual text mining, and trust and evolution analysis in text data, as well as text mining applications in security, biomedical literature analysis, online media analysis, and analytical customer relationship management. Various kinds of text mining and analysis software and tools are available in academic institutions, open-source forums, and industry. Text mining often also uses WordNet, Sematic Web, Wikipedia, and other information sources to enhance the understanding and mining of text data.

Mining Web Data

The World Wide Web serves as a huge, widely distributed, global information center for news, advertisements, consumer information, financial management, education, government, and e-commerce. It contains a rich and dynamic collection of information about web page contents with hypertext structures and multimedia, hyperlink information, and access and usage information, providing fertile sources for data mining. Web mining is the application of data mining techniques to discover patterns, structures, and knowledge from the Web. According to analysis targets, web mining can be organized into three main areas: web content mining, web structure mining, and web usage mining.

Web content mining analyzes web content such as text, multimedia data, and structured data [within web pages or linked across web pages]. This is done to understand the content of web pages, provide scalable and informative keyword-based page indexing, entity/concept resolution, web page relevance and ranking, web page content summaries, and other valuable information related to web search and analysis. Web pages can reside either on the surface web or on the deep Web. The surface web is that portion of the Web that is indexed by typical search engines. The deep Web [or hidden Web] refers to web content that is not part of the surface web. Its contents are provided by underlying database engines.

Web content mining has been studied extensively by researchers, search engines, and other web service companies. Web content mining can build links across multiple web pages for individuals; therefore, it has the potential to inappropriately disclose personal information. Studies on privacy-preserving data mining address this concern through the development of techniques to protect personal privacy on the Web.

Web structure mining is the process of using graph and network mining theory and methods to analyze the nodes and connection structures on the Web. It extracts patterns from hyperlinks, where a hyperlink is a structural component that connects a web page to another location. It can also mine the document structure within a page [e.g., analyze the treelike structure of page structures to describe HTML or XML tag usage]. Both kinds of web structure mining help us understand web contents and may also help transform web contents into relatively structured data sets.

Web usage mining is the process of extracting useful information [e.g., user click streams] from server logs. It finds patterns related to general or particular groups of users; understands users' search patterns, trends, and associations; and predicts what users are looking for on the Internet. It helps improve search efficiency and effectiveness, as well as promotes products or related information to different groups of users at the right time. Web search companies routinely conduct web usage mining to improve their quality of service.

Mining Data Streams

Stream data refer to data that flow into a system in vast volumes, change dynamically, are possibly infinite, and contain multidimensional features. Such data cannot be stored in traditional database systems. Moreover, most systems may only be able to read the stream once in sequential order. This poses great challenges for the effective mining of stream data. Substantial research has led to progress in the development of efficient methods for mining data streams, in the areas of mining frequent and sequential patterns, multidimensional analysis [e.g., the construction of stream cubes], classification, clustering, outlier analysis, and the online detection of rare events in data streams. The general philosophy is to develop single-scan or a-few-scan algorithms using limited computing and storage capabilities.

This includes collecting information about stream data in sliding windows or tilted time windows [where the most recent data are registered at the finest granularity and the more distant data are registered at a coarser granularity], and exploring techniques like microclustering, limited aggregation, and approximation. Many applications of stream data mining can be explored—for example, real-time detection of anomalies in computer network traffic, botnets, text streams, video streams, power-grid flows, web searches, sensor networks, and cyber-physical systems.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780123814791000137

Flow-based intrusion detection: Techniques and challenges

Muhammad Fahad Umer, ... Yaxin Bi, in Computers & Security, 2017

5.1.3 Time-series statistical techniques

Time-series based statistical techniques use previously observed values to forecast new values. Sperotto et al. [2008] used time-series analysis anomaly characterization in flow traffic. Nguyen et al. [2008] apply Holt–Winters forecasting method to detect an anomaly in the flow traffic. They use four metrics to construct a flow: total bytes, total packets, the number of flows with similar volume to the same destination socket and the number of flows that has a similar volume and same source and destination address, but to different ports. These four metrics are used to detect three types of anomalies: flooding, TCP SYN, and port scan. The Holt–Winters method keeps track of the normal metrics values and raises an anomaly flag if any value goes out of range. The technique is limited to only three anomalies and can be bypassed if the attacker keeps the flow metrics values within range.

A high-speed flow level intrusion detection system [HiFIND] is presented in Li et al. [2010]. The use of flow information for high speed and DoS resilient intrusion detection was initially proposed in Li et al. [2005] and Gao et al. [2006]. HiFIND uses a small set of packet header fields including Source/Destination IP and Source/Destination ports. It focuses on three types of attacks: SYN Flooding, Horizontal Scan and Vertical Scan. The authors use Holt–Winters double exponential smoothing and EWMA with season indices method for change detection in network traffic. HiFIND is applied in three phases and false-positives are reduced by separating the intrusion and network anomalies caused by misconfiguration. The performance evolution of HiFIND is carried out using both simulation and on-site deployment. A custom dataset of one-day traffic traces consisted of 900M flow records is used. The authors compared the HiFIND with other statistical detection techniques for flow-based detection, and results show that HiFIND has similar accuracy but is memory efficient in worst case scenarios. HiFIND is one of the few models to implement the intrusion detection systems security [Sadre et al., 2012]. The use of Holt–Winters double exponential smoothing and EWMA with season indices can have the drawback of statistical seasoning effect. The authors only used a 4-feature NetFlow record without including the protocol field. Therefore the system may not be able to detect the attacks sent on UDP packets. Another limitation of the HiFIND system is the inability to detect small and slow-ramped attacks.

Read full article

URL: //www.sciencedirect.com/science/article/pii/S0167404817301165

Big data analytics for wireless and wired network design: A survey

Mohammed S. Hadi, ... Jaafar M.H. Elmirghani, in Computer Networks, 2018

7 Big data analytics in the industry

Throughout our survey, we came across several companies that offer network solutions based on big data analytics. These companies and solutions are highlighted in Table 3. It should be noted that these solutions are enabled by research conducted in their corresponding areas. We have added academic research papers related to each solution in Table 3.

Table 3. Big data analytics-powered industrial solutions.

No.ManufacturerSolution nameRelated academic papersUsage, functions and capabilities
1 Juniper NetReflex IP [27,73,108] Eliminates network errors.
Monitors QoS/QoE.
Capacity planning, traffic routing, caching, and other optimizations.
NetReflex MPLS Segment and trend MPLS and VPN usage to plan for congestion.
Identifies traffic utilization and trends to optimize operational cost.
Ability to slice network performance according to VPN, Cost of Service [CoS], and Provider Edge [PE]-PE enabling more efficient planning.
2 Nokia Traffica [69,109] Real-time issues detection and network troubleshooting.
Gain real-time, end-to-end insight on traffic, network, devices, and subscribers.
Wireless Network Guardian [110] Improves end-to-end network analytics and reporting with real-time subscriber-level information.
Detects anomalies and reports airtime, signaling, and bandwidth resource consumption.
Proactive detection of issues, including automatic detection of user anomalies and low QoE score alerts.
Preventive Complaint Analysis [111] Detects network elements’ behavior anomalies.
Predicting where customer complaints might arise and prioritizes network optimization accordingly.
Predictive Care [110,112] Used for network elements, and proved its effectiveness by helping Shanghai Mobile become more agile and responsive.
Accuracy of the simplified alerts is around 98 percent, reducing operational workload.
3 HP [HPE] Vertica [64,113] Provides CDR analysis that can help Communication Service Provides [CSPs].
Examines dropped call records and other maintenance data to determine where to invest in infrastructure.
Failure prediction and proactive maintenance.
4 Amdocs Deep Network Analytics [114] Combines RAN information with BSS and customer data to deploy the network proactively.
Predictive maintenance.
5 Apervi Apervi's Real-time Log Analytics Solution [ARLAS] [115–117] Collects, aggregates, and stores log data in real-time.

Due to the proprietary nature of industrial products, the exact algorithms or methods behind these products is not available in the open literature. Therefore, academic papers with related concept[s] are cited. NetReflex IP and NetReflex MPLS utilizes big data analytics [27,73,108] to provide services like anomaly analysis and traffic analysis. Nokia provided several solutions targeting the wireless field. For example, Traffica introduces itself as a real-time traffic monitoring tool that analyzes user behavior to gain network insights, similar approaches were presented in academia by the authors of [69,109]. The Wireless Network Guardian detects user anomalies in mobile networks where a comparable topic was discussed in [110]. Preventive Complaint Analysis makes use of big data analytics to detect behavioral anomalies in mobile network elements where the authors in [111] provided a similar approach. Predictive Care utilizes big data analytics to identify anomalies in network elements before affecting the user, a comparable academic approach is presented in [110,112]. HP presented Vertica, a solution that exploits CDRs for network planning, optimization, and fault prediction purposes.

The authors in [64,113] researched akin approaches. Amdoc's Deep Network

Analyzer provides predictive maintenance and proactive network deployment for cellular networks. The authors in [114] presented a similar approach. Log analytics can be used for a variety of purposes. Aprevi's ARLAS solution provided real-time collection and storage of network logs. Related academic research was presented by the authors in [115–117].

Examining the above solutions, one can note that the majority of the solutions are in the wireless field. This, in fact, coincides with the orientation of the academically-researched topics. Sampling through the offered solutions, we noticed the increased interest in anomaly prediction and network node deployment. Thus, offering the customer a service that is as close to optimal as possible, while minimizing network expansion expenditures.

Read full article

URL: //www.sciencedirect.com/science/article/pii/S1389128618300239

Data quality challenges in large-scale cyber-physical systems: A systematic review

Ahmed Abdulhasan Alwan, ... Paolo Falcarin, in Information Systems, 2022

7 RQ2: Data mining and data quality management in large-scale CPSs

This section is to answer the second SLR review question [RQ2] listed in Table 2 based on the results of the SLR.

Data quality assessment in large-scale CPS applications using traditional methods is no longer efficient because of the heterogeneous large volume of data that these systems typically exchange [57]. Thus, such systems, usually rely on numerous sensor nodes that stream large volume of data in real-time which requires a high-performance, scalable and flexible tools to effectively provide insight real-time data processing and analysing mechanisms [44,59,80,88]. Based on the results of the SLR data extraction process illustrated in Table A.10, many statistical, technical and machine-learning models were proposed, tested and evaluated mostly for identifying data quality issues, decreasing their occurrence probability and overcoming their impact on the system. Most of these proposed solutions, methods, or models were developed to enhance the reliability of a particular system by improving its data quality based on prior knowledge extracted from the data itself, a process known as Data Mining. Considering the SLR empirical studies only, it is possible to categorise all the adopted data quality assessment/management methods, techniques or solutions into three primary groups:

Data mining.

Technical solutions/models.

Mathematical models.

Fig. 7 shows the usage ratio of the methods of each of the above groups, indicating that data mining methods are the most widely used compared to other technical or mathematical techniques.

Data mining is the process of auto-discovering knowledge, patterns or models from large volumes of data using advance data analysis methods [89]. Data mining techniques are essential for data analysis in large-scale CPS applications which relay on sensor node networks that, typically, stream a continuous flow of spatiotemporal4 data at a relatively high-speed and dynamicity [90]. Focusing on the SLR primary studies that adopted data mining techniques for tackling data quality challenges in large-scale CPS applications reveals that these methods are mainly divided into statistical and machine-learning based methods. Furthermore, it reveals that most popular data mining techniques used for data mining in large-scale CPS applications are anomaly analysis, predictive analysis and clustering analysis, as shown in Fig. 8. Moreover, these three leading data mining techniques were applied to address various data quality issues associated with the main data quality dimensions, as shown in Fig. 9.

Fig. 7. The most popular data quality assessment/management methods or techniques in large-scale CPS applications based the number of the SLR studies.

Fig. 10 shows a holistic diagram of the main data quality management/assessment techniques and data quality dimensions based on the SLR results.

Fig. 8. The most popular data mining techniques in large-scale CPS applications based on the No. of the SLR studies.

Fig. 9. Data mining techniques and the main data quality dimensions in large-scale CPS applications.

Fig. 10. A holistic diagram of the main data quality management/assessment methods/techniques and data quality dimensions based on the SLR results.

7.1 Anomaly analysis for data quality management

Anomaly analysis, also called outlier detection, is the process of identifying unusual patterns in datasets which do not comply with well-established normal behaviour [90]. If the absolute value of the deviation degree of a sensor node’s observation is higher than a pre-calculated threshold value, then this observation is an outlier [91]. As shown in Figs. 8 and 9, anomaly analysis is a significant research field in the context of data quality assessment in large-scale CPSs, which mainly investigated using statistical and machine-learning based outlier detection techniques. e.g., Deep Neural Networks [DNN] [92], K-Nearest Neighbours algorithm [KNN] [92], K-means clustering algorithm [93] as machine-learning based outlier detection methods and, standard deviation, correlation coefficient [94] and DBSCAN [88,95,96] as statistical outlier detection methods.

Outlier detection relies on the assumption that the values of sensor nodes’ observations are correlated spatially, temporally or both spatially and temporally. However, these assumptions are not necessarily always valid, especially in large-scale CPS applications where the correlations between sensor nodes may be affected by many parameters such as the size of the deployment environment and the geographical distribution of sensor nodes [97]. For example, the approach of spatial continuity cannot be applied directly to the real-world temperature observations collected from the temperature sensor nodes distributed around London because of a phenomenon known as the Urban Heat Islands.5 According to the Meteorological Office [Met Office], the phenomenon of heat islands is caused by many associated factors, such as the heat released from industrial, domestic facilities, concrete and other building material which observe sun heat during the day and release it back during the night. The phenomenon of urban heat islands may cause up to 5 degrees [unexpected] deviation among sensor nodes observations at the same point in time, which violates the spatial continuity constrains [98] among sensor nodes observations. The heat profile map of London is shown in Fig. 11, where the temperatures in central London may reach 11 C°while dropped by 6 degrees C°in the suburbs at the same point in time [98,99], as shown in Fig. 11.

7.1.1 Clustering-based outlier detection

Clustering-based outlier detection relies on comparing individual correlated sensors’ observations with the centroid value of their clusters. Therefore, it needs no prior knowledge of the sensor node historical data. Clustering-based outlier detection can be e.g., implemented using DBSCAN clustering algorithm for detecting errors, noise and failures outliers in high-speed, non-stationary, large volume WSN’s data, [88,96]. However, according to [91], clustering can not be considered as a reliable anomaly detection technique in real-world scenarios. It can be used as an outlier filtering mechanism due to challenges in determining both clusters’ optimum number of sensor nodes and determining their centroid value in each cluster.

Fig. 11. The heat profile map of London highlighting the impact of urban heat islands, [98].

7.1.2 Predictive analysis based outlier detection

Predictive analysis is the process of mining current and historical data to identify patterns and to forecast the future values of time series [5,100]. Predictive analysis might be conducted using statistical or machine learning based techniques [101]. For example, machine learning model based on the Random Forest Prediction [Random Forest Regression] method was adopted by [12] for developing an automated data quality control mechanism for weather data. Another example based on statistical predictive analysis using the one step-forward approach, autoregressive moving average [ARMA] model for tackling the inevitable challenge of sensors and sensor networks failure in power terminals, [82]. Furthermore, some applications required a mixed-methods approach, where both machine-learning and statistical methods were adopted to tackle a particular data quality challenge. For example, [61] investigated the use of artificial neural network and linear regression for calibrating low-cost environmental monitoring sensors to improve the accuracy of their observations. Predictive analysis methods rely on predictive models developed using historical data as a training data set. Therefore, using predictive analysis in real-time [online mode] applications raises performance concerns because of the complexity and volume of the required training data set [78,102]. Using predictive analysis is a challenge in real-time large-scale CPS applications; thus it may require analysing hundreds of sensor nodes data streams in a relatively short time [103,104]. Furthermore, the training process for predictive analysis requires relatively long and valid [anomaly-free] time series, which cannot be guaranteed in real-world scenarios [91].

Read full article

URL: //www.sciencedirect.com/science/article/pii/S0306437921001484

What is the main goal of data mining quizlet?

Data Mining is a non-trivial process of determining valid, novel, potentially usable, and understandable patterns in data. The use of previously-trained prediction models in order to accurately understand the effect of specific parameters on the end results.

Why do large data driven companies use analytic platforms?

Why is big data analytics important? Organizations can use big data analytics systems and software to make data-driven decisions that can improve business-related outcomes. The benefits may include more effective marketing, new revenue opportunities, customer personalization and improved operational efficiency.

What's the most important reason that big data is often managed in a cloud environment?

Enables faster scalability Large volumes of both structured and unstructured data requires increased processing power, storage, and more. The cloud provides not only readily-available infrastructure, but also the ability to scale this infrastructure really quickly so you can manage large spikes in traffic or usage.

Which company became one of the world's largest merchant service aggregators after it was purchased by eBay in 2002?

PayPal was acquired by eBay in October 2002 and is now located in San Jose, California. Our community of users is the largest and one of the most loyal online trading communities on the Internet.

Chủ Đề