Tag: Machine Learning Systems

How Our Favorite Sites and Tools Use Machine Learning | Salesforce Guide
As companies collect more and more data about their customers, an increased amount of duplicate information starts appearing in the data as well, causing a lot of confusion among internal teams. Since it would be impossible to manually go through all of the data and delete the duplicates, companies have come up with machine learning solutions that perform such work for them. Today we would like to take a look at some interesting uses of machine learning to catch duplicates in all kinds of environments. Before we dive right in, let’s take a look at how machine learning systems work.

How Do Machine Learning Systems Identify Duplicates?

When a person looks at an image or two strings of data it would be fairly easy for them to determine whether or not the images or strings are duplicates. However, how would you train a machine to spot such duplicates? Perhaps a good starting point would be to identify all of the similarities, but then you would need to explain exactly what “similar” means. Are there gradations to similarities? In order to overcome such challenges, researchers use string metrics to train machine learning models.

There are many string metrics to choose from. The following is a list of some of the most frequently used string metrics:
- Hamming Distance – This method counts the number of substitutions that are required to turn one string into another.
- Levenshtein Distance – This string metrics expands on the Hamming Distance by allowing operations such as deletion and insertion in addition to substitution.
- Jaro-Winkler Distance – a string metric measuring an edit distance between two sequences.
- Learnable Distance – This one takes into consideration that different edit operations have varying significance in different domains.
- Sørensen–Dice coefficient – This one measures how similar two strings are in terms of the number of common bigrams (a bigram is a pair of adjacent letters in the string).
All of this may sound complicated, so let’s take a look at some real-world examples of using machine learning for deduplication.

Don’t forget to check out: Collections in Salesforce – A Brief Guide

Detecting Duplicate Questions in Quora and Reddit

Quora and Reddit are two very popular platforms used by millions of people all over the world. However, this poses a significant issue of duplicate questions and posts since one user is most likely unaware that somebody else already asked the same question. Since it is not possible to manually sift through all of the questions, they use machine learning to identify them. As an example, let’s use the questions below:
- How do I book cheap hotel rooms?
- What are the best ways to find cheap hotel deals?
In order to train their machine learning algorithms to identify whether or not these questions are duplicates, Quora uses a massive dataset consisting of 404,290 question pairs and a test set of 2,345,795 question pairs. The reason that so many questions are needed is that so many factors need to be considered such as capitalization, abbreviations, and the ground truth. All of this is meant to find high-quality answers to questions, resulting in a better experience for all Quora Users across the board.

Deduping Ads on Craigslist

Craigslist is another very popular platform for posting all kinds of advertisements, but very often sellers will make changes to the ads if they are not satisfied with the ad’s performance. For example, let’s take a look at the image below:

A machine learning system will be able to use the string metrics we mentioned earlier to determine the distance between each string and the operations necessary to turn one string into another. It will then be able to flag all of the duplicate ads.

Deduping Lines of Code

Even people who are not IT professionals have heard of GitHub, a popular resource where developers can host, share, and discover software. Since there are more than 190 million repositories on GitHub and more than 40 million users, it is pretty easy for duplicate lines of code to appear. In fact, research into this issue shows that 93% of JavaScript on GitHub is duplicate. There are many different classifications of duplicates ranging from completely identical to those that are semantically similar but syntactically different. GitHub relies on machine learning to parse through all the code submitted by the users and detect the duplicates that are either exactly the same or perform the same functions.

Using Machine Learning to Dedupe Salesforce

Machine learning is a much better alternative to the traditional rule-based approach used to dedupe Salesforce. It is much more effective in identifying fuzzy duplicates since it is not possible to create a rule for every possible scenario. What machine learning does is take the string metrics mentioned above and many other ones as well and then learn to replicate the human thought process. The way it works is the system presents you with a pair of records and you can either label them as unique or duplicates. In doing so, the system automatically learns from your decisions to better adjust the deduplication algorithms to better match your data.

Check out another amazing blog by Ildudkin here: How Do Artificial Intelligence Systems Compare Salesforce

Another big benefit of the machine learning approach is that it is much more scalable. Consider that if you start with a modest number of records, such as 50,000, adding another 5,000 will require 250,000,000 comparisons to be made. To extrapolate further, if a computer manages to compare 10,000 records per second (which will require enormous computational power), it would still take almost seven hours to do the full comparison and identify duplicates:

250,000,000 ➗ 10,000 ➗ 3,600 = 6.94 hours

Comparisons Comparisons Seconds

per second in an hour

Machine learning takes a much smarter approach by blocking together records with specific similarities and only checking these blocks for duplicates. This results in significantly fewer comparisons that need to be made thus saving you a lot of time.

The Role of Machine Learning In Deduplication Will Only Increase

Duplicate data can carry severe ramifications for your CRM or database which is why there has been a range of techniques developed to identify duplicates. As we already mentioned, it is much more efficient and smarter than creating filters which are time-consuming and ultimately a futile effort. Therefore, start using machine learning to dedupe your data today.
June 14, 2021
How Do Artificial Intelligence Systems Compare Salesforce Records?

When you compare two Salesforce records, side-by-side, you can easily determine whether or not they are duplicates. However, even if you have a small number of records, let’s say less than 100,000, it would be almost impossible to sift through them one, one by one, and perform such a comparison. This is why companies have developed various tools that automate such processes, but to do a good job, the machines need to be able to recognize all of the similarities and differences between the records. In this article, we will take a closer look at some of the methods used by data scientists to train machine learning systems to identify duplicates.

How Can Machine Learning Systems Compare and Contrast Records?

One of the main tools researchers use is string metrics. This is when you take two strings of data and return a that is low if the strings are similar and high if they are different. How does this work in practice? Well, let’s take a look at the two records below:

First Name Last Name E-mail Company Name

Ron Burgundy ron.burgundy@acme.com Acme

Ronald burgundy ron.burgundy@acme.com Acme Corp

If a human were to look at these two records, it would be pretty obvious that these are duplicates. However, machines rely on string metrics to replicate the human thought process, which is what AI is all about. One of the most famous string metrics is the Hamming distance which measures the number of substitutions that need to be made in order to turn one string into another. For example, if we return to the two records above, there would only need to be one substitution made to turn “burgundy” into “Burgundy”, therefore the Hamming distance would be 1.

Don’t forget to check out: Learn About Record Level Security in Salesforce

There are many other string metrics that measure the similarity between two strings and what separates each one is the operations they allow. For example, we mentioned the Hamming distance, but this string metric only allows substitutions meaning that it can only be applied to strings that are of equal length. Something like the Levenshtein distance allows for deletion insertion and substitution.

How Can All of This Be Used to Dedupe Salesforce?

There are a couple of ways an AI system can approach Salesforce deduplication. One of the ways is the blocking method, which is illustrated below:

Record 1 Record 2

Ron Burgundy, ron.burgundy@acme.com, Acme Ronald burgundy,ron.burgundy@acme.com Acme Corp

Such blocking methodology is what makes this approach scalable. The way it works is that whenever you upload new records into your Salesforce, the system will automatically block together records that look “similar”. This can be something like the first three letters of the first name or any other criteria.

This is very beneficial because it reduces the number of comparisons that need to be made. For example, let’s say that you have 100,000 records in your Salesforce and you would like to upload an Excel spreadsheet that contains 50,000 records. The traditional rule-based deduplication apps would need to compare each new record with existing ones meaning that there would need to be 5,000,000,000 comparisons done (100,000 x 50,000). Imagine how long this would take and how much it increases the probability of an error. Also, we need to keep in mind that 100,000 records are a fairly modest number of Salesforce records. There are lots of organizations that have hundreds of thousands or even millions of records. Therefore the traditional approach is simply not very scalable to accommodate such models.

The other option would be to compare each field individually:

Record 1 Record 2

First Name Ron Ronald

Last Name Burgundy burgundy

Email ron.burgundy@acme.com ron.burgundy@acme.com

Company Acme Acme Corp

Once the system has blocked together “similar” records, it will then proceed to analyze each record field by field. This is where all of the string metrics we talked about earlier will come into play. In addition to this, the system will assign each field a particular “weight” or importance. For example, let’s say that for your dataset, the “E-mail” field is the most important. You can either adjust the algorithms yourself or when you label records as duplicates (or not) the system will automatically learn the correct weights. This is called Active Learning and is more preferable since the system can precisely calculate the importance of one field over another.

Check out another amazing blog by Ildudkin here: Artificial Intelligence vs Automation: Which is Better for Salesforce Deduplication?

What are the Advantages of the Machine Learning Approach?

The biggest benefit machine learning can offer is that it does all of the work for you. The Active Learning aspect we described in the previous section will apply all of the necessary weights to each field automatically. What this means is that there is no complicated setup process or rules to create. Let’s look at the following scenario. Imagine that one of the sales reps discovers a duplicate and notifies the Salesforce admin about the problem. The Salesforce admin will then proceed to create a rule that will prevent such duplicates from occurring in the future. This process would have to be repeated over and over again every time a new duplicate is discovered making such a process unsustainable.

Also, we need to remember that the built-in deduplication in Salesforce is also rule-based, it’s just very limited. For example, you are only able to merge three records at a time, there is no support for custom objects, and a lot of other limitations. Machine learning is just the smarter way to go since rule creation is simple automation, whereas AI and machine learning tries to recreate the human thought process. More about the differences between machine learning and automation are discussed in this article. It would not make sense to choose a deduplication product that simply expands Salesforce’s functionality, instead of fixing the entire process. This is why the machine learning approach is the best way to go.

April 29, 2021
Artificial Intelligence vs Automation: Which is Better for Salesforce Deduplication?

While a lot of businesses are looking to streamline processes, this can be done in many different ways via automation, machine learning, and artificial intelligence. In terms of deduping your Salesforce environment, you have these same choices: rule-based deduplication, which is like automation, and a machine learning approach. In this article, we will tell you about the difference between these two approaches and why machine learning is the best way to go. First, let’s start with automation.

What is Automation?

Automation has been around for centuries. We can trace it back to eras like the Industrial Revolution and even earlier to medieval times when water was used in traditional milling to replace human labor in turning the millstone. This offers us an insight into what automation really is. Basically, automation is the process of machines replicating human tasks, but the machines will not have the ability to dynamically respond to any changes. Therefore, in many ways, the automation we use today serves the same purpose as it did back in the middle ages.

If we look at the rule-based Salesforce deduplication tools, we can see this same type of automation. For example, let’s say that one of the matching rules is (Company OR Email) AND (Phone ). Instead of asking your sales reps or other Salesforce user to spot records matching the above match equation, you can simply create a rule that will do this for you. However, when we think about all of the possible fuzzy duplicates, it will be almost impossible to create a rule for every scenario. This is why machine learning is the better alternative.

Don’t forget to check out: 9 Tips to Keep Your Data Secure in Salesforce

How Does the Machine Learning Approach Work?

Whenever you label two records as duplicates (or not) the system automatically learns from your choices and will apply the same logic to future records. For example, let’s take a look at the records below:

Name Last Name Phone Email

Joseph Smith (555) 431-0221 joseph.smith@acme.com

Joe Smith (555) 431-0221 joseph.smith@acme.com

To a human, it would be pretty obvious that these records are duplicates, but what exactly gives that away? Even though the names may be the same, both the name “Joe” and the last name “Smith” are very common, so technically it could be two different people. However, since they have the same phone number and email this is a bigger indication that these two are duplicates. In other words, we can say that the “Phone Number” field and the “Email” field have more weight than the fields like “First Name” and “Last Name”.

This is how the machines learn to identify duplicates records as well since they are able to replicate the human thought process. However, it also goes beyond human computational capabilities as well. If we return to the example, above, we established that the “Email” field is more important than the “Last Name: field, but would you be able to quantify by exactly how much? Is it 3 times or 2.5? The system would not only be able to calculate something like this but apply the necessary weight to every field and dynamically adjust those as new records are added to the system.

How Are the Machine Learning Systems Created?

It would be useful to look at the process of creating a machine learning system as a pyramid. At the base, you have all of the data used to train the system. The data in your Salesforce environment is used as training data since the system needs to adjust to your individual situation. Therefore, when you are labeling records as either duplicates or unique, you are actually training the deduplication algorithm. Then comes the analytics stage which occurs when the system is able to manipulate the digitized data, allowing it to extract some meaningful insights. The system can now differentiate the duplicates from other records.

Next, we come to the machine learning stage. Basically, the machine is able to take what it has learned and apply this knowledge and analysis to new data without any explicit programming. Any new records that come in will be deduped based on the field weights, string metrics, and other criteria that the system learned in the previous stage. It is worth pointing out that the learning aspect never ends. The system will continue to learn from new data and user actions and learn on the go.

Finally, we get to the ultimate level which is AI. Even though machine learning is a big part of AI, it goes a level beyond machine learning by producing human capabilities, in our case, this would be identifying duplicates on its own.

Why is Machine Learning the Best Approach to Deduping Salesforce?

Machine learning is the best way to go because it does all of the work for you. There are no complex rules to set up, you don’t have to standardize your data or any other configurations. Also, this approach is much more scalable. For example, let’s say that you already have 100,000 records in your Salesforce and you would like to upload a spreadsheet with 5,000 additional ones. A rule-based system would have to compare all of the incoming records with the existing ones, which is 500,000,000 comparisons. A machine learning system would take a smarter approach by blocking together records that have something in common. This could be something like the first three letters of the “First Name” name field, the same email address, and any other characteristic.

Check out another amazing blog by Ildudkin here: A Machine Learning Approach to Deduping Salesforce

Try the Machine Learning Approach to Deduplication

If you are tired of setting up rules or you notice that duplicates keep finding their way into your Salesforce, consider switching to the machine learning approach. It is a lot more comprehensive and it will significantly simplify your life and the job of your sales professionals.

February 18, 2021

First Name	Last Name	E-mail	Company Name
Ron	Burgundy	ron.burgundy@acme.com	Acme
Ronald	burgundy	ron.burgundy@acme.com	Acme Corp

Record 1	Record 2
Ron Burgundy, ron.burgundy@acme.com, Acme	Ronald burgundy,ron.burgundy@acme.com Acme Corp

Name	Last Name	Phone	Email
Joseph	Smith	(555) 431-0221	joseph.smith@acme.com
Joe	Smith	(555) 431-0221	joseph.smith@acme.com

Tag: Machine Learning Systems

How Our Favorite Sites and Tools Use Machine Learning | Salesforce Guide

How Do Machine Learning Systems Identify Duplicates?

Detecting Duplicate Questions in Quora and Reddit

Deduping Ads on Craigslist

Deduping Lines of Code

Using Machine Learning to Dedupe Salesforce

The Role of Machine Learning In Deduplication Will Only Increase

How Do Artificial Intelligence Systems Compare Salesforce Records?

How Can Machine Learning Systems Compare and Contrast Records?

How Can All of This Be Used to Dedupe Salesforce?

The other option would be to compare each field individually:

What are the Advantages of the Machine Learning Approach?

Artificial Intelligence vs Automation: Which is Better for Salesforce Deduplication?

What is Automation?

How Does the Machine Learning Approach Work?

How Are the Machine Learning Systems Created?

Why is Machine Learning the Best Approach to Deduping Salesforce?

Try the Machine Learning Approach to Deduplication