How a Software Update Crashed Computers Globally

How-a-Software-Update-Crashed-Computers-Globally-Banner-image

How a Software Update Crashed Computers Globally

And why the CrowdStrike outage is proving difficult to resolve.

On Friday 19 July, the world experienced a rare and massive global IT outage. These events, while infrequent, can cause significant disruption. They often originate from errors in centralized systems, such as cloud services or server farms. However, this particular outage was unique and has proven to be difficult and time-consuming to resolve. The culprit? A faulty software update was pushed directly to PCs by CrowdStrike, a leading cybersecurity firm serving over half of the Fortune 500 companies.

 

Windows Global IT Outage: The Beginning

The outage began with a Windows global IT outage stemming from faulty code distributed by CrowdStrike. This update caused affected machines to enter an endless reboot loop, rendering them offline and virtually unusable. The severity of the problem was compounded by the inability to issue a fix remotely.

 

Immediate Impacts of the IT Outage

The immediate aftermath saw a widespread Microsoft server down scenario. Systems across various industries were disrupted, highlighting the dependency on stable cybersecurity measures. With computers stuck in an endless cycle of reboots, normal business operations ground to a halt, creating a ripple effect that was felt globally.

 

The Challenges of a Remote Fix

Why the Global IT Outage is Harder to FixHow-a-Software-Update-Crashed-Computers-Globally-middle-image

One of the most significant challenges in this global IT outage is the inability to resolve the issue remotely. The faulty code rendered remote fixes ineffective, necessitating manual intervention. This meant that each affected machine had to be individually accessed to remove the problematic update.

 

Manual vs. Automated Fixes

Unless experts can devise a method to fix the machines remotely, the process will be painstakingly slow. CrowdStrike is exploring ways to automate the repair process, which would significantly expedite resolution. However, the complexity of the situation means that even an automated solution is not guaranteed to be straightforward.

 

 

Broader Implications of the Outage

Understanding the Broader Impact

The Windows global IT outage has exposed vulnerabilities in how updates are managed and deployed. This incident serves as a stark reminder of the potential risks associated with centralized update systems. Businesses worldwide are now reevaluating their dependence on single-point updates to avoid similar disruptions in the future.

 

Preventing Future IT Outages

Moving forward, organizations could implement more rigorous testing protocols and fail-safes to prevent such widespread disruptions. Additionally, there may be a shift towards more decentralized update mechanisms to minimize the risk of a single point of failure.

 

Conclusion

The global IT outage caused by a faulty CrowdStrike update serves as a critical lesson for the tech industry. The incident underscores the need for more resilient and fail-safe update mechanisms to ensure that such disruptions do not occur again. As organizations worldwide continue to grapple with the consequences, the focus will undoubtedly shift towards preventing future occurrences through improved practices and technologies.

 

FAQs

What caused the global IT outage?

The outage was caused by a faulty CrowdStrike software update, which led to affected computers to enter an endless reboot loop.

 

How widespread was the outage?

The outage was global, affecting businesses and systems across various industries worldwide.

 

Why is it difficult to fix the outage?

The affected machines cannot be remotely fixed due to the nature of the faulty code. Each computer needs to be manually accessed to remove the problematic update.

 

Is there a way to automate the fix?

CrowdStrike is exploring automated solutions, but the complexity of the issue means that a straightforward automated fix may not be feasible.

 

What are the broader implications of the outage?

The incident highlights the vulnerabilities in centralized update systems and may lead to more rigorous testing protocols and decentralized update mechanisms.

 

How can future IT outages be prevented?

Implementing more robust testing procedures and decentralized update systems can help prevent similar outages in the future.

Microsoft Windows Outage: CrowdStrike Falcon Sensor Update

Microsoft-Windows-Outage-CrowdStrike-Falcon-Sensor-Update-banner-imag

Microsoft Windows Outage: CrowdStrike Falcon Sensor Update

 

Like millions of others, I tried to go on vacation, only to have two flights get delayed because of IT issues.  As an engineer who enjoys problem-solving and as CEO of the company nothing amps me up more than a worldwide IT issue, and what frustrates me the most is the lack of clear information.

 

From the announcements on their website and on social media, CloudStrike issued an update and that update was defective, causing a Microsoft outage. The computers that downloaded the update go into a debug loop; attempt to boot, error, attempt repair, restore system files, boot, repeat.

 

The update affects only Windows systems, Linux and Macs are unaffected.

 

The wide-spread impact and Windows server down focus; is because Microsoft outsourced part of their security to Cloudstrike, allowing CloudStrike to directly patch the Windows Operating System.

 

Microsoft and CrowdStrike Responses

 

Microsoft reported continuous improvements and ongoing mitigation actions, directing users to its admin center and status page for more details. Meanwhile, CrowdStrike acknowledged that recent crashes on Windows systems were linked to issues with the Falcon sensor.

 

The company stated that symptoms included the Microsoft server down and the hosts experiencing a blue screen error related to the Falcon Sensor and assured that their engineering teams were actively working on a resolution to this IT outage.

 

There is a deeper problem here, one that will impact us worldwide until we address it.  The technology world is becoming too intertwined with too little testing or accountability leading to a decrease in durability, stability, and an increase in outages.

 

Global Impact on Microsoft Windows UsersMicrosoft-Windows-Outage-CrowdStrike-Falcon-Sensor-Update-middle-image

 

Windows users worldwide, including those in the US, Europe, and India, experienced the blue screen of death or Windows global IT outage, rendering their systems unusable. Users reported their PCs randomly restarting and entering the blue screen error mode, interrupting their workday. Social media posts showed screens stuck on the recovery page with messages indicating Windows didn’t load correctly and offering options to restart the PC.

 

If Microsoft had not outsourced certain modules to CloudStrike then this outage wouldn’t have occurred.  Too many vendors build their products based on assembling a hodgepodge of tools, leading to outages when one tool fails.

 

The global IT outage caused by CrowdStrike’s Falcon Sensor has highlighted the vulnerability of interconnected systems.

 

I see it in the MSP industry all the time; most (if not all) of our competitors use outsourced support tools, outsourced ticket systems, outsourced hosting, outsourced technology stack, and even outsourced staff.  If everything is outsourced then how do you maintain quality?

 

We are very different, which is why component outages like what is occurring today do not impact us.  The tools we use are all running on servers we built, those servers are running in clusters we own, which are running in dedicated data centers we control.  We plan for failures to occur, which to clients translates into unbelievable up time, and that translates into unbelievable net promotor scores.

 

The net promotor score is an industry client “happiness” score; for the MSP industry, the average score is 32-38, at Protected Harbor our score is over 90.

 

Because we own our own stack, because all our staff are employees with no outsourcing, and because 85%+ of our staff are engineers, we can deliver amazing support and uptime, which translates into customer happiness.

 

If you are not a customer of ours and if your systems are affected by this global IT outage, wait.  Microsoft will issue an update soon that will help alleviate this issue, however, a manual update process might be required.  If your local systems are not impacted yet, turn them off right now and wait for a couple of hours for Microsoft to issue an update.  For clients of ours, go to work, everything is working.  If your local systems or home system are impacted, then contact support and we will get you running.

 

 

Preventing Outages with High Availability (HA)

Preventing-outages-with-High-Availability-Banner-image

Preventing Outages with High Availability (HA)

High Availability (HA) is a fundamental part of data management, ensuring that critical data remains accessible and operational despite unforeseen challenges. It’s a comprehensive approach that employs various strategies and technologies to prevent outages, minimize downtime, and maintain continuous data accessibility. The following are five areas that comprise a powerful HA deployment.

Redundancy and Replication:  Redundancy and replication involve maintaining multiple copies of data across geographically distributed locations or redundant hardware components. For instance, in a private cloud environment, data may be replicated across multiple availability data centers. This redundancy ensures that if one copy of the data becomes unavailable due to hardware failures, natural disasters, or other issues, another copy can seamlessly take its place, preventing downtime and ensuring data availability. For example: On Premise Vs private cloud (AWS) offers services like Amazon S3 (Simple Storage Service) and Amazon RDS (Relational Database Service) that automatically replicate data across multiple availability zones within a region, providing high availability and durability.

Fault Tolerance:  Fault tolerance is the ability of a system to continue operating and serving data even in the presence of hardware failures, software errors, or network issues. One common example of fault tolerance is automatic failover in database systems. For instance, in a master-slave database replication setup, if the master node fails, operations are automatically redirected to one of the slave nodes, ensuring uninterrupted access to data. This ensures that critical services remain available even in the event of hardware failures or other disruptions.

Automated Monitoring and Alerting:  Automated monitoring and alerting systems continuously monitor the health and performance of data storage systems, databases, and other critical components. These systems use metrics such as CPU utilization, disk space, and network latency to detect anomalies or potential issues. For example, monitoring tools like PRTG and Grafana can be configured to track key performance indicators (KPIs) and send alerts via email, SMS, or other channels when thresholds are exceeded or abnormalities are detected. This proactive approach allows IT staff to identify and address potential issues before they escalate into outages, minimizing downtime and ensuring data availability.

For example, we write custom monitoring scripts, for our clients, that alert us to database processing pressure and long-running queries and errors.  Good monitoring is critical for production database performance and end-user usability.

Preventing-outages-with-High-Availability-Middle-imageLoad Balancing:  Load balancing distributes incoming requests for data across multiple servers or nodes to ensure optimal performance and availability. For example, a web application deployed across multiple servers may use a load balancer to distribute incoming traffic among the servers evenly. If one server becomes overloaded or unavailable, the load balancer redirects traffic to the remaining servers, ensuring that the application remains accessible and responsive. Load balancing is crucial in preventing overload situations that could lead to downtime or degraded performance.

Data Backup and Recovery:  Data backup and recovery mechanisms protect against data loss caused by accidental deletion, corruption, or other unforeseen events. Regular backups are taken of critical data and stored securely, allowing organizations to restore data quickly in the event of a failure or data loss incident.

Continuous Software Updates and Patching:  Keeping software systems up to date with the latest security patches and updates is essential for maintaining Data High Availability. For example, database vendors regularly release patches to address security vulnerabilities and software bugs. Automated patch management systems can streamline the process of applying updates across distributed systems, ensuring that critical security patches are applied promptly. By keeping software systems up-to-date, organizations can mitigate the risk of security breaches and ensure the stability and reliability of their data infrastructure.

Disaster Recovery Planning:  Disaster recovery planning involves developing comprehensive plans and procedures for recovering data and IT systems in the event of a catastrophic failure or natural disaster. For example, organizations may implement multi-site disaster recovery strategies, where critical data and applications are replicated across geographically dispersed data centers. These plans typically outline roles and responsibilities, communication protocols, backup and recovery procedures, and alternative infrastructure arrangements to minimize downtime and data loss in emergencies.

We develop database disaster automatic failure procedures and processes for clients and work with programmers or IT departments to help them understand the importance of HA and how to change their code to optimize their use of High Availability.

An Essential Tool

Data High Availability is essential for preventing outages and ensuring continuous data accessibility in modern IT environments. By employing the strategies we outlined, you can mitigate the risk of downtime, maintain business continuity, and ensure the availability and reliability of critical data and services.

High Availability is available on all modern database platforms and requires a thoughtful approach. We’d be happy to show you how we can help your organization and make your applications and systems fly without disruption. Call us today.

Meta Global Outage

Meta’s Global Outage: What Happened and How Users Reacted

Meta, the parent company of social media giants Facebook and Instagram, recently faced a widespread global outage that left millions of users unable to access their platforms. The disruption, which occurred on a Wednesday, prompted frustration and concern among users worldwide.

Andy Stone, Communications Director at Meta, issued an apology for the inconvenience caused by the outage, acknowledging the technical issue and assuring users that it had been resolved as quickly as possible.

“Earlier today, a technical issue caused people to have difficulty accessing some of our services. We resolved the issue as quickly as possible for everyone who was impacted, and we apologize for any inconvenience,” said Stone.

The outage had a significant impact globally, with users reporting difficulties accessing Facebook and Instagram, platforms they rely on for communication, networking, and entertainment.

Following the restoration of services, users expressed relief and gratitude for the swift resolution of the issue. Many took to social media to share their experiences and express appreciation for Meta’s timely intervention.

Metas-Global-Outage-What-Happened-and-How-Users-Reacted-Middle-imageHowever, during the outage, users encountered various issues such as being logged out of their Facebook accounts and experiencing problems refreshing their Instagram feeds. Additionally, Threads, an app developed by Meta, experienced a complete shutdown, displaying error messages upon launch.

Reports on DownDetector, a website that tracks internet service outages, surged rapidly for all three platforms following the onset of the issue. Despite widespread complaints, Meta initially did not officially acknowledge the problem.

However, Andy Stone later addressed the issue on Twitter, acknowledging the widespread difficulties users faced in accessing the company’s services. Stone’s tweet reassured users that Meta was actively working to resolve the problem.

The outage serves as a reminder of the dependence many users have on social media platforms for communication and entertainment. It also highlights the importance of swift responses from companies like Meta when technical issues arise.

 

Update from Meta

Meta spokesperson Andy Stone acknowledged the widespread meta network connectivity problems, stating, “We’re aware of the issues affecting access to our services. Rest assured, we’re actively addressing this.” Following the restoration of services, Stone issued an apology, acknowledging the inconvenience caused by the meta social media blackout. “Earlier today, a technical glitch hindered access to some of our services. We’ve swiftly resolved the issue for all affected users and extend our sincere apologies for any disruption,” he tweeted.

However, X (formerly Twitter) owner Elon Musk couldn’t resist poking fun at Meta, quipping, “If you’re seeing this post, it’s because our servers are still up.” This lighthearted jab underscores the frustration experienced by users during the Facebook worldwide outage, emphasizing the impact of technical hiccups on social media platforms.

In a recent incident, Meta experienced a significant outage that left users with no social media for six hours, causing widespread disruption across its platforms, including Facebook, Instagram, and WhatsApp. The prolonged downtime resulted in a massive financial impact, with Mark Zuckerberg’s Meta loses $3 billion in market value. This outage highlighted the vulnerability of relying on a single company for multiple social media services, prompting discussions about the resilience and reliability of Meta’s infrastructure.

 

In conclusion, while the global outage caused inconvenience for millions of users, the swift resolution of the issue and Meta’s acknowledgment of the problem have helped restore confidence among users. It also underscores the need for continuous improvement in maintaining the reliability and accessibility of online services.

How to Prevent Crashes and Outages?

How to Prevent Crashes and Outages Banner image

How to Prevent Crashes and Outages?

Today’s workforce relies heavily on computers for day-to-day tasks. If a computer crashes, we tend to get more than just a little agitated.

Fear of being unable to work and get our jobs done for the day races through our minds while anger takes its place in the forefront of trying to fix whatever went wrong, throwing all logic out the window.

When a system abruptly ceases to work, it crashes. The scope of a system failure can vary significantly from one that affects all subsystems to one that is just limited to a particular device or just the kernel itself.

System hang-ups are a related occurrence in which the operating system is nominally loaded. Still, the system stops responding to input from any user/device and ceases producing output. Another way to define such a system is as frozen.

This blog will explain how to prevent crashes and outages in 6 easy steps.

 

What is a System Crash and an Outage?

A system crash is a term used to describe a situation in which a computer system fails, usually due to an error or a bug in the software. An outage may also be caused by an application program, system software, driver, hardware malfunction, power outage, or another factor.

“A system freeze,” “system hang,” or “the blue screen of death” are the other terms for a system crash.

An outage is a general term for an unexpected interruption to a service or network. Outages can be planned (for example, during maintenance) or unplanned (a fault occurs). Outages can last for minutes, hours, days, or even weeks.

 

Main Reasons for Crashes and Outages

System outages can be caused by various factors, from hardware failures to software glitches. In many cases, outages are the result of a combination of factors. The following are some of the most common causes of system outages:

  • Hardware failures: A defective component can cause an entire system to fail. Servers, hard drives, and other components can fail, leading to an outage.
  • Software glitches: Software glitches can also cause system outages. A coding error or a bug in the software can disrupt the system’s regular operation.
  • Power outages: A power outage can cause the entire system to fail. The system may be damaged permanently if the power is not restored quickly.
  • Natural disasters: Natural disasters such as hurricanes, tornadoes, and earthquakes can damage or destroy critical components of the system.

System crashes can be caused by various things, from software defects to hardware failures. Sometimes, the crash may even be caused by something as simple as a power outage or due to a more severe issue, such as a virus or malware infection.

  • Overheating: When a computer’s CPU or graphics card gets too hot, it can cause the system to freeze or crash. This is often the result of inadequate cooling or dust and dirt buildup inside the computer.
  • Bad drivers: If a driver is outdated, corrupt, or incompatible with the operating system, it can cause the system to crash. In some cases, this can even lead to data loss or permanent damage to the computer.

How-to-Prevent-Crashes-and-Outages-middle-imagePreventions Against Crashes and Outages

Nobody wants their computer to crash, but it will happen eventually. Here are a few ways to help prevent them and keep your computer running smoothly.

 

1.    Keep Your Software Up to Date by Installing Updates

One of the best ways to prevent crashes and outages is by updating your software. This means installing updates as soon as they become available. You should also keep your operating system and programs up to date. These updates can fix bugs and security vulnerabilities, so installing them as soon as they are released is essential.

 

2.    Avoid Clicking on Links or Downloading Files from Unknown Sources

It’s essential to be proactive in preventing crashes and outages. One way to do this is to avoid clicking on links or downloading files from unknown sources, as these can often contain malware that can harm your computer or network. Additionally, you should routinely back up your data to recover it if something goes wrong.

 

3. Make Sure You Have Good Antivirus and Anti-Malware Programs

One of the most important things you can do to prevent crashes is to ensure that your antivirus and anti-malware programs are up to date. These programs can help protect your computer from malware infections, which can cause crashes.

 

4.    Close Programs You’re Not Using

One of the best ways to prevent crashes and outages is to close any programs you’re not using. When too many programs are open, your computer’s performance can suffer, leading to crashes and outages.

 

5.    Delete Unwanted Files

Another way to improve your computer’s performance is to regularly delete files you no longer need. This will free up space on your hard drive, allowing your computer to run more efficiently.

 

6.    Try a Trusted Disk Clean-Up to Free Up Some Space

This will help your computer run faster and smoother. You can even defragment your hard drive occasionally to keep it organized and running smoothly.

Remember to install updates for your operating system and software as soon as they are available. Keeping your computer clean and organized will help prevent crashes and outages.

 

Final Words

Don’t forget that you are the one running the computer, not the other way around. Therefore, it is your top priority to maintain the computers for improved performance and to continually check for any disruptions that could result in computer failures.

Try to pay attention to the little warnings your system sends you so you can save not just your computer but also yourself from a mental spiral.

Now that you know what causes crashes and outages, you can stay on top of them by following a few simple rules. Regularly monitoring your system resources, updating your software, keeping your system up to date, and having a good antivirus are the best ways to keep your computer running smoothly and keep both crashes and outages at bay.

Taking care of your data can help you to protect it from crashes and outages. You can get expert help from Protected Harbor to manage and maintain your systems and data. Protected Harbor provides an added layer of security that helps to ensure the uninterrupted flow of business-critical data. Additionally, our expert team monitors and detects any threats or updates to your system in order to ensure a smooth, efficient operation that saves it from crashing.

We help you to avoid the most common causes of data loss and system outages. These include network issues due to malicious activity, viruses, and system overload; natural disasters; power outages; and accidental deletion or corruption of data. You’re less likely to experience a system outage or lose critical data if you have a backup, plus 99.99% uptime is our guarantee.

 

Sign up now and get a free consultation to learn more about how Protected Harbor can keep your company’s data secure and your business up and running.

 

Outages and Downtime; Is it a big deal?

Outages and DowntimeOutages and Downtime; Is it a big deal?

Downtime and outages are costly affairs for any company. According to research and industry survey by Gartner, as much as $300000 per hour the industry loses on an average. It is a high priority for a business owner to safeguard your online presence from unexpected outages. Imagine how your clients feel when they visit your website only to find an “Error: website down” or “Server error” message. Or half your office is unable to log in and work.

You may think that some downtime once in a while wouldn’t do much harm to your business. But let me tell you, it’s a big deal.

Downtime and outages are hostile to your business

Whether you’re a large company or a small business, IT outages can cost you exorbitantly. With time, more businesses are becoming dependent on technology and cloud infrastructure. Also, the customer’s expectations are increasing, which means if your system is down and they can’t reach you, they will move elsewhere. Since every customer is valuable, you don’t want to lose them due to an outage. Outages and downtime affect your business in many underlying ways.

Hampers Brand Image

While all the ways outages impact your business, this is the worst and affects you in the long run. It completely demolishes a business structure that took a while to build. For example, suppose a customer regularly experiences outages that make using the services and products. In that case, they will switch to another company and share their negative experiences with others on social platforms. Poor word of mouth may push away potential customers, and your business’s reputation takes a hit.

Loss of productivity and business opportunities

If your servers crash or IT infrastructure is down, productivity and profits follow. Employees and other parties are left stranded without the resources to complete their work. Network outages can bring down the overall productivity, which we call a domino effect. This disrupts the supply chain, which multiplies the impact of downtime. For example, a recent outage of AWS (Amazon Web Services) affected millions of people, their supply chain, and delivery of products and services across all of their platforms and third-party companies sharing the same platform.

For the companies who depend on online sales, server outage and downtime is a nightmare. Any loss of networking means customers won’t have access to your products or services online. It will lead to fewer customers and lesser revenues. It is a best-case scenario if the outage is resolved quickly, but imagine if the downtime persists for hours or days and affects a significant number of online customers. A broken sales funnel discourages customers from doing business with you again. There the effects of outages can be disastrous.

So how do you prevent system outages?

Downtime and outages are directly related to the server and IT infrastructure capabilities. It can be simplified into Anticipation, Monitoring, and Response. To cover these aspects, we created a full-proof strategy that is AOA (Application Outage Avoidance), or in simpler words, we also call it Always on Availability. In AOA, we set up several things to prevent and tackle outages.

  • First of which is to anticipate and be proactive. We prepare in advance for possible scenarios and keep them in check.
  • The second thing is in-depth monitoring of the servers. We don’t just check if a server is up or down- we look at RAM, CPU, disk performance, application performance metrics such as page life expectancy inside of SQL. Then we tie the antivirus directly into our monitoring system. If Windows Defender detects an infected file, it triggers an alert in our monitoring system so we can respond within 5 minutes and quarantine/cleans the infected file.
  • The final big piece of this is geo-blocking and blacklisting. Our edge firewalls block entire countries and block bad IPs by reading and updating public IP blacklists every 4 hours to keep up with the latest known attacks. We use a windows failover cluster which eliminates a single point of failure. For example, the client will remain online if a host goes down.
  • Other features include- Ransomware, Viruses and Phishing attack protection, complete IT support, and a private cloud backup which has led to us achieving a 99.99% uptime for our clients.

These features are implemented into Protected Harbor’s systems and solutions to enable an optimum level of control and advanced safety and security. IT outages can be frustrating, but actively listen to clients to build a structure to support your business and workflow – achieving a perfect mix of IT infrastructure and business operations.

Visit Protected Harbor to end outages and downtime once and for all.